This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Coverage-Guaranteed Prediction Sets for Out-of-Distribution Data

Xin Zou, Weiwei Liu corresponding author
Abstract

Out-of-distribution (OOD) generalization has attracted increasing research attention in recent years, due to its promising experimental results in real-world applications. In this paper, we study the confidence set prediction problem in the OOD generalization setting. Split conformal prediction (SCP) is an efficient framework for handling the confidence set prediction problem. However, the validity of SCP requires the examples to be exchangeable, which is violated in the OOD setting. Empirically, we show that trivially applying SCP results in a failure to maintain the marginal coverage when the unseen target domain is different from the source domain. To address this issue, we develop a method for forming confident prediction sets in the OOD setting and theoretically prove the validity of our method. Finally, we conduct experiments on simulated data to empirically verify the correctness of our theory and the validity of our proposed method.

1 Introduction

Recent years have witnessed the remarkable success of modern machine learning techniques in many applications. A fundamental assumption of most machine learning algorithms is that the training and test data are drawn from the same underlying distribution. However, this assumption is consistently violated in many practical applications. In reality, the test environment is influenced by a range of factors, such as the distributional shifts across photos caused by the use of different cameras in image classification tasks, the voices of different persons in voice recognition tasks, and the variations between scenes in self-driving tasks (Nagarajan, Andreassen, and Neyshabur 2021). As a result, there is now a rapidly growing body of research with a focus on generalizing to unseen target domains with the help of the source domains, namely OOD generalization (Shen et al. 2021).

Existing OOD generalization methods focus on improving worst-case performance on the target domains, i.e., improving the average test accuracy of the model on the worst target domain. However, in some systems that require high security (such as medical diagnosis), even a single mistake may have disastrous consequences. In these cases, it is important to quantify the uncertainty of the predictions. One way to perform uncertainty estimation (Amodei et al. 2016; Jiang et al. 2012, 2018; Angelopoulos et al. 2021) is to create confident prediction sets that provably contain the correct answer with high probability. Let Xn+1𝒳X_{n+1}\in\mathcal{X} be a new test example for which we would like to predict the corresponding label Yn+1𝒴Y_{n+1}\in\mathcal{Y}, where 𝒳\mathcal{X} is the input space and 𝒴\mathcal{Y} is the label space. For any given α(0,1)\alpha\in(0,1), the aim of confidence set prediction is to construct a set-valued output, 𝒞(Xn+1)\mathcal{C}(X_{n+1}), which contains the label Yn+1Y_{n+1} with distribution-free marginal coverage at a significance level α\alpha, i.e., (Yn+1𝒞(Xn+1))1α\mathbb{P}\left(Y_{n+1}\in\mathcal{C}(X_{n+1})\right)\geq 1-\alpha. A confidence set predictor 𝒞α\mathcal{C}^{\alpha} is said to be valid if (Yn+1𝒞α(Xn+1))1α\mathbb{P}\left(Y_{n+1}\in\mathcal{C}^{\alpha}(X_{n+1})\right)\geq 1-\alpha for any α(0,1)\alpha\in(0,1), where α\alpha is a hyper-parameter of the predictor. To simplify the notation, we omit the superscript α\alpha in the remainder of this paper.

Conformal prediction (Vovk, Gammerman, and Shafer 2005; Shafer and Vovk 2008, CP) is a model-agnostic, non-parametric and distribution-free (the coverage guarantee holds for any distribution) framework for creating confident prediction sets. Split conformal prediction (Vovk 2013; Vovk, Gammerman, and Shafer 2005, SCP), a special type of CP, has been shown to be computationally efficient. SCP reserves a set of data as the calibration set, and then uses the relative value of scores of the calibration set and that of a new test example to construct the prediction set. The validity of SCP relies on the assumption that the examples are exchangeable. However, in the OOD setting, the distributional shift between the training and test distributions leads to the violation of the exchangeability assumption. We empirically evaluate the performance of SCP in the OOD setting in Section 4. Unfortunately, we find that trivially applying SCP results in a failure to maintain marginal coverage in the OOD setting.

To address this issue, we construct a set predictor based on the ff-divergence (Alfréd 1961) between the test distribution (target domain) and the convex hull of the training distributions (source domains). We theoretically show that our set predictor is guaranteed to maintain the marginal coverage (Corollary 9). We then conduct simulation experiments to verify our theory.

The remainder of this article is structured as follows: §2 introduces some related works; §3 presents the notation definitions and preliminaries; §4 conducts experiments that show the failure of SCP in the OOD generalization setting; §5 creates corrected confidence set predictor in the OOD generalization setting; §6 provides our experimental results. §7 make discussions with the most related work. Finally, the conclusions are presented in §8. All of our proofs are attached in Appendix A.

2 Related Works

OOD generalization. OOD generalization aims to train a model with data from the source domains so that it is capable of generalizing to an unseen target domain. A large number of algorithms have been developed to improve OOD generalization. One series of works focuses on minimizing the discrepancies between the source domains (Li et al. 2018b; Ganin et al. 2016; Li et al. 2018c; Sun and Saenko 2016). Meta-learning domain generalization (Li et al. 2018a, MLDG) leverages the meta-learning approach and simulates train/test distributional shift during training by synthesizing virtual testing domains within each mini-batch. Another line of works (Xin et al. 2022; Wang et al. 2022) conducts adversarial training (Madry et al. 2018) to improve the OOD generalization performance. (Zou and Liu 2023) considers improving the adversarial robustness of the unseen target domain. Notably, the above works all focus on improving the average performance on the target domain; in contrast, we focus on designing valid confidence set predictors for data from the unseen target domains, as this is a crucial element of making high-stakes decisions in systems that require high security.

Conformal prediction. As introduced in §1, conformal prediction is a model-agnostic, non-parametric, and distribution-free framework that provides valid confidence set predictors. Generally speaking, examples are assumed to be exchangeable in a CP context. Most pertinent to our work, (Gendler et al. 2022; Tibshirani et al. 2019; Fisch et al. 2021; Cauchois et al. 2020; Gibbs and Candès 2021; Oliveira et al. 2022) all consider various situations in which the exchangeability of the examples is violated to some extent. (Gendler et al. 2022) considers the case in which the test examples may be adversarially attacked (Szegedy et al. 2014; Goodfellow, Shlens, and Szegedy 2015; Madry et al. 2018); (Tibshirani et al. 2019) investigates the situation in which the density ratio between the target domain and the source domain is known; (Fisch et al. 2021) studies the few-shot learning setting and assumes that the source domains and the target domain are independent and identically distributed (i.i.d.) from some distribution on the domains; (Gibbs and Candès 2021) considers an online learning setting and (Oliveira et al. 2022) provides results when the examples are mixing (Achim 2013; Xiaohong, Lars Peter, and Marine 2010; Bin 1994). Different from all the works discussed above, we consider the OOD generalization setting in which the ff-divergence between the target domain and the convex hull of the source domains is constrained. The most related work among them is (Cauchois et al. 2020), which studies the worst-case coverage guarantee of a ff-divergence ball centered at the single source domain. For the discussions about similarities and differences with (Cauchois et al. 2020), please refer to Section 7.

3 Preliminaries

We begin with the OOD setups and a review of conformal prediction.

Notations. We denote {1,2,,n}\{1,2,\dots,n\} by [n][n] for n+n\in\mathbb{N}_{+}. For a distribution PP on \mathbb{R}, we define the quantile function of PP as 𝒬(β;P)inf{s|P(Ss)β}\mathcal{Q}(\beta;P)\coloneqq\inf\{s\in\mathbb{R}|P(S\leq s)\geq\beta\}. Similarly, for a cumulative distribution function (c.d.f.) FF on \mathbb{R}, we define 𝒬(β;F)inf{s|F(s)β}\mathcal{Q}(\beta;F)\coloneqq\inf\{s\in\mathbb{R}|F(s)\geq\beta\}. For nn distributions P1,,PnP_{1},\dots,P_{n}, we define 𝒞(P1,,Pn){i=1nλiPi|λ1,,λn0;i=1nλi=1}\mathcal{CH}\left(P_{1},\dots,P_{n}\right)\coloneqq\left\{\sum_{i=1}^{n}\lambda_{i}P_{i}|\lambda_{1},\dots,\lambda_{n}\geq 0;\sum_{i=1}^{n}\lambda_{i}=1\right\} as the convex hull of the distributions P1,,PnP_{1},\dots,P_{n}. We further define 𝒩(μ,Σ)\mathcal{N}(\mu,\Sigma) as the multi-variable Gaussian distribution with mean vector μ\mu and covariance matrix Σ\Sigma. For a set AA, we define the indicator function as 𝕀A()\mathbb{I}_{A}(\cdot), where 𝕀A(x)=1\mathbb{I}_{A}(x)=1 if xAx\in A and 𝕀A(x)=0\mathbb{I}_{A}(x)=0 otherwise.

3.1 Out-of-Distribution Generalization

We define the input space as 𝒳\mathcal{X} and the label space as 𝒴\mathcal{Y}. We set 𝒴={±1}\mathcal{Y}=\{\pm 1\}, 𝒴={1,2,,K}\mathcal{Y}=\{1,2,\dots,K\} (where KK is the number of classes), and 𝒴=\mathcal{Y}=\mathbb{R} for the binary classification problem, the multi-class classification problem, and the regression problem, respectively. Let 𝒮{S1,,Sd}\mathcal{S}\coloneqq\left\{S_{1},\dots,S_{d}\right\} be the set of source domains, where dd is the number of source domains. S1,,SdS_{1},\dots,S_{d} are distributions on 𝒵𝒳×𝒴\mathcal{Z}\coloneqq\mathcal{X}\times\mathcal{Y}, and we use the terminologies ”domain” and ”distribution” interchangeably in this paper. Let TT denote the target domain. The goal of OOD generalization is to obtain good performance on all T𝒯T\in\mathcal{T}, where 𝒯\mathcal{T} is the set of all possible target domains; we usually assume 𝒮𝒯\mathcal{S}\subseteq\mathcal{T}.

In a standard OOD generalization setting, we learn a predictor h{h:𝒳𝒴}h\in\mathcal{H}\subseteq\{h:\mathcal{X}\xrightarrow{}\mathcal{Y}\} from the source domains 𝒮\mathcal{S} and define a loss function :𝒴×𝒴\ell:\mathcal{Y}\times\mathcal{Y}\xrightarrow{}\mathbb{R}_{*} where =[0,+)\mathbb{R}_{*}=[0,+\infty). We aim to minimize the worst-case population risk of the predictor hh on the unseen target domain as follows:

𝒯(h)=maxT𝒯𝔼(X,Y)T[(h(X),Y)].\mathcal{R}_{\mathcal{T}}(h)=\underset{T\in\mathcal{T}}{\max}\underset{(X,Y)\sim T}{\mathbb{E}}\left[\ell\left(h(X),Y\right)\right].

However, in some systems that require high security, a mistake may lead to serious disasters. In these cases, a good solution is to output a prediction set with a marginal coverage guarantee. For a predefined confidence level 1α(0,1)1-\alpha\in(0,1), we wish to output a prediction set 𝒞(x)𝒴\mathcal{C}(x)\subseteq\mathcal{Y} such that, for any T𝒯T\in\mathcal{T}:

(X,Y)T,𝒞[Y𝒞(X)]1α,\underset{(X,Y)\sim T,\mathcal{C}}{\mathbb{P}}\left[Y\in\mathcal{C}(X)\right]\geq 1-\alpha, (1)

where the probability is over the randomness of test examples (X,Y)T(X,Y)\sim T and the randomness of the prediction set 𝒞\mathcal{C}. To achieve (1), we follow the idea of SCP (Vovk 2013; Vovk, Gammerman, and Shafer 2005) to construct 𝒞(x)\mathcal{C}(x). The next section introduces the main idea of SCP.

3.2 Split Conformal Prediction

Nonconformity score. In SCP, we consider a supervised learning problem that involves predicting the label y𝒴y\in\mathcal{Y} of the input x𝒳x\in\mathcal{X}. We assume that we have a predictive model s:𝒳×𝒴s:\mathcal{X}\times\mathcal{Y}\xrightarrow{}\mathbb{R}, which outputs the nonconformity score s(x,y)s(x,y). The nonconformity score function s(,)s(\cdot,\cdot) is usually trained with a set of training data. s(x,y)<s(x,y)s(x,y)<s(x,y^{\prime}) means that for the input xx, yy is more likely than yy^{\prime} to be the label. Some examples of nonconformity scores are as follows: for a probabilistic model p(y|x)p(y|x), we can take the negative log-likelihood as the score, s(x,y)=log(p(y|x))s(x,y)=-\text{log}(p(y|x)); for a regression model h:𝒳𝒴h:\mathcal{X}\xrightarrow{}\mathcal{Y}, a typical choice is s(x,y)=|h(x)y|s(x,y)=|h(x)-y|; for a multi-class classifier h:𝒳ΔK1h:\mathcal{X}\xrightarrow{}\Delta^{K-1}, where ΔK1\Delta^{K-1} is the K1K-1 dimensional simplex in K\mathbb{R}^{K}, we can take s(x,y)=1h(x)ys(x,y)=1-h(x)_{y}.

In SCP, we assume that the examples {(Xi,Yi)}i=1n+1𝒳×𝒴\{(X_{i},Y_{i})\}_{i=1}^{n+1}\subseteq\mathcal{X}\times\mathcal{Y} are exchangeable (Definition 1). For a predefined significance level α(0,1)\alpha\in(0,1), the goal is to provide a valid confidence set 𝒞^(Xn+1)\widehat{\mathcal{C}}(X_{n+1}). CP methods (Shafer and Vovk 2008) take advantage of the exchangeability of the data and the properties of the quantile function to make such a construction possible.

Definition 1 ((Shafer and Vovk 2008, Exchangeability)).

The random variables Z1,,ZnZ_{1},\dots,Z_{n} are exchangeable if for every permutation τ\tau for integers 1,,n1,\dots,n, the variables W1,,WnW_{1},\dots,W_{n}, where Wi=Zτ(i)W_{i}=Z_{\tau(i)}, have the same joint probability distribution as Z1,,ZnZ_{1},\dots,Z_{n}.

Let Vi=s(Xi,Yi)V_{i}=s(X_{i},Y_{i}) for i[n+1]i\in[n+1] be the nonconformity scores corresponding to the examples {(Xi,Yi)}i=1n+1\{(X_{i},Y_{i})\}_{i=1}^{n+1}, where s(,)s(\cdot,\cdot) is independent of {(Xi,Yi)}i=1n+1\{(X_{i},Y_{i})\}_{i=1}^{n+1}. The independence between s(,)s(\cdot,\cdot) and {(Xi,Yi)}i=1n+1\{(X_{i},Y_{i})\}_{i=1}^{n+1} is useful since in this case we can prove that the scores {Vi}i=1n+1\{V_{i}\}_{i=1}^{n+1} are exchangeable. The exchangeability of {Vi}i=1n+1\{V_{i}\}_{i=1}^{n+1} comes from the exchangeability of {(Xi,Yi)}i=1n+1\{(X_{i},Y_{i})\}_{i=1}^{n+1} and the independence between the s(,)s(\cdot,\cdot) and {(Xi,Yi)}i=1n+1\{(X_{i},Y_{i})\}_{i=1}^{n+1}. Next, define rank(Vi)\text{rank}(V_{i}) as the rank of ViV_{i} among {Vi}i=1n+1\{V_{i}\}_{i=1}^{n+1} for any i[n+1]i\in[n+1] (in ascending order; we assume that ties are broken randomly). By the exchangeability of {Vi}i=1n+1\{V_{i}\}_{i=1}^{n+1}, rank(Vi)\text{rank}(V_{i}) is uniform on [n+1][n+1], which is used to prove the validity of SCP in Lemma 2. We use P^({Vi}i=1n)\widehat{P}\left(\{V_{i}\}_{i=1}^{n}\right) to denote the empirical distribution determined by the examples V1,,VnV_{1},\dots,V_{n}. Let

𝒞^n(x){y𝒴|s(x,y)𝒬(n+1n(1α);P^({Vi}i=1n))},\widehat{\mathcal{C}}_{n}(x)\!\coloneqq\!\!\left\{y\!\in\!\mathcal{Y}\Big{|}s(x,y)\!\leq\!\mathcal{Q}\!\left(\!\!\frac{n+1}{n}(1\!-\!\alpha);\!\widehat{P}\!\left(\{V_{i}\}_{i=1}^{n}\right)\!\right)\!\right\}, (2)

we then have the following marginal coverage guarantee.

Lemma 2 (The validity of SCP).

Assume that examples {(Xi,Yi)}i=1n+1\{(X_{i},Y_{i})\}_{i=1}^{n+1} are exchangeable. For any nonconformity score s(,)s(\cdot,\cdot) and any α(0,1)\alpha\in(0,1), the prediction set defined in Equation 2 satisfies:

(Yn+1𝒞^n(Xn+1))1α,\mathbb{P}\left(Y_{n+1}\in\widehat{\mathcal{C}}_{n}(X_{n+1})\right)\geq 1-\alpha, (3)

where the probability is over the randomness of {(Xi,Yi)}i=1n+1\{(X_{i},Y_{i})\}_{i=1}^{n+1}.

In the OOD generalization setting, we also want to obtain a valid set predictor that is valid for any T𝒯T\in\mathcal{T}. In light of this, some natural questions arise:

Does the set predictor defined in Equation 2 remain valid when the unseen target domain is different from the source domains? If not, can we construct a new set predictor that is valid in the OOD generalization setting?

Unfortunately, the answer to the first question is negative. Theoretically, as shown in Appendix A.1, the proof of Lemma 2 is highly dependent on the exchangeability of the examples {(Xi,Yi)}i=1n+1\{(X_{i},Y_{i})\}_{i=1}^{n+1}, which is easily violated if there is any distributional shift between the distribution of {(Xi,Yi)}i=1n\{(X_{i},Y_{i})\}_{i=1}^{n} and the distribution of (Xn+1,Yn+1)(X_{n+1},Y_{n+1}). This means that in the OOD setting, the proof technique of Lemma 2 cannot be applied. Empirically, in Section 4, we provide a toy example to show that the set predictor 𝒞^n(x)\widehat{\mathcal{C}}_{n}(x) is no longer valid in the OOD setting.

In Section 5, we give an affirmative answer to the second question. We first construct a new set predictor based on the ff-divergence between the target domain and the convex hull of the source domains, then provide marginal coverage guarantees for the constructed predictor.

4 SCP Fails in the OOD Setting

In this section, we construct a toy example to show that for the OOD confidence set prediction problem, SCP is no longer valid, even under a slight distributional shift.

For simplicity, we consider a single-domain case. Specifically, we consider the regression problem and set 𝒳=l\mathcal{X}=\mathbb{R}^{l}, 𝒴=\mathcal{Y}=\mathbb{R}. We define the source domain SS as follows: given a linear predictor L(x)=w,x+bL(x)=\left<w^{\star},x\right>+b^{\star} where wlw^{\star}\in\mathbb{R}^{l} and bb^{\star}\in\mathbb{R}. The marginal distribution of XX and the conditional distribution of YY given XX are defined as:

X𝒩(μs,σs,x2Il),Y|X=x𝒩(L(x),σs,y2),X\sim\mathcal{N}(\mu_{s},\sigma_{s,x}^{2}I_{l}),\ \ \ \ Y|X=x\sim\mathcal{N}(L(x),\sigma_{s,y}^{2}),

where μsl\mu_{s}\in\mathbb{R}^{l} is the mean vector of XX, σs,x,σs,y\sigma_{s,x},\sigma_{s,y} are positive scalars, and Ill×lI_{l}\in\mathbb{R}^{l\times l} is an identity matrix. Similarly, for the target domain TT, we define:

X𝒩(μt,σt,x2Il),Y|X=x𝒩(L(x),σt,y2),X\sim\mathcal{N}(\mu_{t},\sigma_{t,x}^{2}I_{l}),\ \ \ \ Y|X=x\sim\mathcal{N}(L(x),\sigma_{t,y}^{2}),

where μtl\mu_{t}\in\mathbb{R}^{l} is the mean vector of XX and σt,x,σt,y\sigma_{t,x},\sigma_{t,y} are positive scalars. For simplicity, we set μs=μt,σs,x=σt,x\mu_{s}=\mu_{t},\sigma_{s,x}=\sigma_{t,x} and σs,yσt,y\sigma_{s,y}\neq\sigma_{t,y}. We sample mtrainm_{\text{train}} training examples from SS to train a linear predictor L^(x)=w^,x+b^\hat{L}(x)=\left<\hat{w},x\right>+\hat{b}, where w^l\hat{w}\in\mathbb{R}^{l} and b^\hat{b}\in\mathbb{R}. We then define the nonconformity score as s(x,y)=|L^(x)y|s(x,y)=|\hat{L}(x)-y|. We sample nn examples from SS to construct the prediction set 𝒞^n(x)\widehat{\mathcal{C}}_{n}(x) in Equation 2 and sample mtestm_{\text{test}} examples from TT to form the test data. We run 10001000 times with different random seeds. The results for the coverage (left) and length (right) of the prediction set are presented in box plot form in Figure 1. Here, the coverage is the ratio between the number of test examples such that yi𝒞^n(xi)y_{i}\in\widehat{\mathcal{C}}_{n}(x_{i}) and the size of the test set. The red lines stand for the desired marginal coverages. Since the boxes are below the red coverage lines, we conclude that SCP fails to provide a prediction set with desired coverage when there exists a distributional shift between the source domain and the target domain.

Refer to caption
Figure 1: The box plots for the results of the 10001000 runs. We show the results for α={0.05,0.1,0.15,0.2,0.25,0.3}\alpha=\{0.05,0.1,0.15,0.2,0.25,0.3\} and the horizontal axis represents the value of α\alpha. The left plot shows the results for the coverage of the prediction sets. The red lines are the marginal coverage guarantees that we wish to achieve. The right plot shows the results for the length of the prediction sets.

5 Corrected SCP for OOD Data

In this section, we consider correcting SCP for OOD data. We first consider the case in which we have access to the population distributions of the scores from the source domains. We then consider the case in which we only have access to the empirical distributions and correct the prediction set to obtain a marginal coverage guarantee in Equation 3.

5.1 Target Distribution Set and Confidence Sets

It is obvious that obtaining a marginal coverage guarantee for an arbitrary target distribution is impossible unless we set 𝒞^(x)=𝒴\widehat{\mathcal{C}}(x)=\mathcal{Y} for all x𝒳x\in\mathcal{X}, which is a trivial confidence set predictor and does not provide any useful information. In this paper, we consider the case in which the ff-divergence (Alfréd 1961) between the target domain and the convex hull of the source domains does not exceed a predefined value. The well-known KL divergence and TV distance are both special cases of ff-divergence.

As Section 4 shows, when the target domain differs from the source domain, the marginal coverage does not hold for the predictor (2). Below, we construct new prediction sets where the marginal coverage (3) holds.

We define a set of distributions 𝒯{Q|Q is a distribution on 𝒳×𝒴}\mathcal{T}\subseteq\{Q|Q\text{ is a distribution on }\mathcal{X}\times\mathcal{Y}\}. For each T𝒯T\in\mathcal{T}, the distribution of the score for the data from TT is defined as the push forward distribution s#Ts\#T, where (s#T)(A)=T(s1(A))(s\#T)(A)=T(s^{-1}(A)) for any measurable set AA\subseteq\mathbb{R}. We define the distribution set of the scores as 𝒫{s#T:T𝒯}\mathcal{P}\coloneqq\{s\#T:T\in\mathcal{T}\}. For a given α(0,1)\alpha\in(0,1), our goal is to choose a threshold tt\in\mathbb{R} such that the confidence set 𝒞~(x){y𝒴|s(x,y)t}\widetilde{\mathcal{C}}(x)\coloneqq\{y\in\mathcal{Y}|s(x,y)\leq t\} satisfies (3) when (Xn+1,Yn+1)(X_{n+1},Y_{n+1}) is drawn from any target domain T𝒯T\in\mathcal{T}. The following lemma provides a proper choice of tt.

Lemma 3.

For any unknown target distribution T𝒯T\in\mathcal{T}, assume that (Xn+1,Yn+1)(X_{n+1},Y_{n+1}) is drawn from TT. If we set tmaxP𝒫𝒬(1α;P)t\geq\underset{P\in\mathcal{P}}{\max}\ \mathcal{Q}(1-\alpha;P), then:

(Yn+1𝒞~(Xn+1))1α.\mathbb{P}\left(Y_{n+1}\in\widetilde{\mathcal{C}}(X_{n+1})\right)\geq 1-\alpha. (4)

For a given set 𝒫\mathcal{P} of distributions for the score, Lemma 3 reduces the problem of finding a valid confidence set predictor to the following optimization problem:

max𝒬(1α;P)s.t.P𝒫.\max\ \mathcal{Q}(1-\alpha;P)\ \text{s.t.}\ P\in\mathcal{P}. (5)

Next, we formulate the set 𝒯\mathcal{T} through the lens of ff-divergence.

Definition 4 (ff-divergence).

Let f:f:\mathbb{R}\xrightarrow{}\mathbb{R} be a closed convex function satisfying f(1)=0f(1)=0 and f(t)=+f(t)=+\infty for t<0t<0. Let P,QP,Q be two probability distributions such that PQP\ll Q (PP is absolutely continuous with respect to QQ). The ff-divergence between PP and QQ can then be defined as follows:

Df(PQ)f(dPdQ)𝑑Q,D_{f}(P\|Q)\coloneqq\int f\left(\frac{dP}{dQ}\right)dQ,

where dPdQ\frac{dP}{dQ} is the Radon-Nikodym derivative (Patrick 2008).

Remark 1.

For a given function ff that satisfies the conditions in Definition 4, define f0(t)f(t)f(1)(t1)f_{0}(t)\coloneqq f(t)-f^{\prime}(1)(t-1). We then obtain that, for any PQP\ll Q:

Df0(PQ)=f0(dPdQ)𝑑Q=Df(PQ).\displaystyle D_{f_{0}}(P\|Q)=\int f_{0}\left(\frac{dP}{dQ}\right)dQ=D_{f}(P\|Q).

By the convexity of ff, it can be easily observed that f0(t)0f_{0}(t)\geq 0 for all tt\in\mathbb{R}. Moreover, inftf0(t)=f0(1)=0\inf_{t}f_{0}(t)=f_{0}(1)=0 and f0(1)=0f_{0}^{\prime}(1)=0. Since f0f_{0} produces the same ff-divergence as ff, without loss of generality, we can assume that f(1)=f(1)=0f^{\prime}(1)=f(1)=0 and f0f\geq 0.

Equipped with the ff-divergence, we can now define our target distribution set 𝒯\mathcal{T} for a given threshold ρ>0\rho>0:

𝒯f,ρ(S1,,Sd){T|Q𝒞(S1,,Sd)s.t.Df(TQ)ρ}.\mathcal{T}_{f,\rho}\!(S_{1}\!,\!\cdots\!,\!S_{d})\!\!\coloneqq\!\!\left\{\!T|\exists Q\!\in\!\mathcal{CH}\!(S_{1}\!,\!\cdots\!,\!S_{d})\ \text{s.t.}\ D_{f}(T\|Q)\!\!\leq\!\!\rho\!\right\}.

We omit S1,,SdS_{1},\dots,S_{d} and use 𝒯f,ρ\mathcal{T}_{f,\rho} for simplicity. The corresponding distribution set for the scores is then:

𝒫{s#T|T𝒯f,ρ}.\mathcal{P}\coloneqq\{s\#T|T\in\mathcal{T}_{f,\rho}\}. (6)

However, it is hard to obtain the precise relationship between 𝒫\mathcal{P} and the distributions s#S1,,s#Sds\#S_{1},\dots,s\#S_{d}, which makes it difficult to analyze 𝒫\mathcal{P}. We instead consider the following distribution set of scores:

𝒫f,ρ\displaystyle\mathcal{P}_{f,\rho} {S is a distribution on |\displaystyle\coloneqq\left\{S\text{ is a distribution on }\mathbb{R}|\right. (7)
S0𝒞(s#S1,,s#Sd)s.t.Df(SS0)ρ}.\displaystyle\exists S_{0}\in\mathcal{CH}(s\#S_{1},\cdots,s\#S_{d})\ \left.\text{s.t.}\ D_{f}(S\|S_{0})\leq\rho\right\}.

The following lemma reveals the relationship between 𝒫\mathcal{P} and 𝒫f,ρ\mathcal{P}_{f,\rho}.

Lemma 5.

Let 𝒫\mathcal{P}, 𝒫f,ρ\mathcal{P}_{f,\rho} be defined as in (6), (7) respectively. Then, 𝒫𝒫f,ρ\mathcal{P}\subseteq\mathcal{P}_{f,\rho}.

Remark 2.

According to Lemma 5, supP𝒫f,ρ𝒬(1α;P)supP𝒫𝒬(1α;P)\underset{P\in\mathcal{P}_{f,\rho}}{\sup}\mathcal{Q}(1-\alpha;P)\geq\underset{P\in\mathcal{P}}{\sup}\mathcal{Q}(1-\alpha;P). Lemma 3 accordingly tells us that if we set t=supP𝒫f,ρ𝒬(1α;P)t=\underset{P\in\mathcal{P}_{f,\rho}}{\sup}\mathcal{Q}(1-\alpha;P), then for (Xn+1,Yn+1)(X_{n+1},Y_{n+1}) drawn from any target distribution T𝒯f,ρT\in\mathcal{T}_{f,\rho}, we have (Yn+1𝒞~(Xn+1))1α\mathbb{P}\left(Y_{n+1}\in\widetilde{\mathcal{C}}(X_{n+1})\right)\geq 1-\alpha. Our goal is now to solve Problem (5) for the set 𝒫f,ρ\mathcal{P}_{f,\rho}.

According to Remark 2, we define the worst-case quantile function for the distribution set 𝒫f,ρ\mathcal{P}_{f,\rho} as 𝒬~(α;𝒫f,ρ)supP𝒫f,ρ𝒬(α;P)\widetilde{\mathcal{Q}}(\alpha;\mathcal{P}_{f,\rho})\coloneqq\underset{P\in\mathcal{P}_{f,\rho}}{\sup}\mathcal{Q}(\alpha;P). Remark 2 tells us that taking t=𝒬~(1α;𝒫f,ρ)t=\widetilde{\mathcal{Q}}(1-\alpha;\mathcal{P}_{f,\rho}) produces a valid confidence set 𝒞~\widetilde{\mathcal{C}}. The next theorem allows us to express the worst-case quantile function in terms of the standard quantile function, which helps us to calculate the worst-case quantile efficiently.

Theorem 6.

Let F1,,FdF_{1},\dots,F_{d} be the c.d.f.’s of the distributions s#S1,,s#Sds\#S_{1},\dots,s\#S_{d}. Define the function gf,ρ:[0,1][0,1]g_{f,\rho}:[0,1]\xrightarrow{}[0,1] as

gf,ρ(β)inf{z[0,1]|βf(zβ)+(1β)f(1z1β)ρ}g_{f,\rho}(\beta)\!\coloneqq\!\inf\!\left\{\!z\!\in\![0,1]\bigg{|}\beta f\!\left(\frac{z}{\beta}\right)\!+\!(1\!-\!\beta)f\!\left(\frac{1-z}{1-\beta}\right)\!\leq\!\rho\right\}

and define the inverse of gf,ρg_{f,\rho} as gf,ρ1(τ)=sup{β[0,1]|gf,ρ(β)τ}g_{f,\rho}^{-1}(\tau)=\sup\{\beta\in[0,1]\big{|}g_{f,\rho}(\beta)\leq\tau\}. Let Fmin(x)min1idFi(x)F_{\min}(x)\coloneqq\underset{1\leq i\leq d}{\min}F_{i}(x) be a c.d.f., the following holds for all α(0,1)\alpha\in(0,1):

𝒬~(α;𝒫f,ρ)=𝒬(gf,ρ1(α);Fmin).\widetilde{\mathcal{Q}}(\alpha;\mathcal{P}_{f,\rho})=\mathcal{Q}(g_{f,\rho}^{-1}(\alpha);F_{\min}).

5.2 Marginal Coverage Guarantee for Empirical Source Distributions

In the previous section, we presented marginal coverage guarantees when we have access to the population distributions of the scores for source domains. However, in practice, it is difficult or even impossible to access these population distributions. In this section, we provide marginal coverage guarantees even when we only have access to the empirical distributions, which is useful in practice.

For any i[d]i\in[d], assume we have mim_{i} i.i.d. examples {Vij=s(Xij,Yij)}j=1mi\left\{V_{ij}=s(X_{ij},Y_{ij})\right\}_{j=1}^{m_{i}} from the source distribution SiS_{i}. Further, suppose that F^i\hat{F}_{i} is the empirical c.d.f. corresponding to FiF_{i}, which is defined as F^i(x)=1mij=1mi𝕀(,x](Vij)\hat{F}_{i}(x)=\frac{1}{m_{i}}\sum_{j=1}^{m_{i}}\mathbb{I}_{(-\infty,x]}(V_{ij}). Define F^min(x)=min1idF^i(x)\hat{F}_{\min}(x)=\underset{1\leq i\leq d}{\min}\hat{F}_{i}(x). We first provide an error bound when we estimate FminF_{\min} with F^min\hat{F}_{\min}.

Proposition 7.

Let F1,,FdF_{1},\dots,F_{d} be c.d.f.’s on \mathbb{R}, define Fmin(x)=min1idFi(x)F_{\min}(x)=\underset{1\leq i\leq d}{\min}F_{i}(x). Suppose F^1,,F^d\hat{F}_{1},\dots,\hat{F}_{d} are the empirical c.d.f.’s corresponding to F1,,FdF_{1},\dots,F_{d}, defined with m1,,mdm_{1},\dots,m_{d} examples, respectively. Define F^min(x)=min1idF^i(x)\hat{F}_{\min}(x)=\underset{1\leq i\leq d}{\min}\hat{F}_{i}(x). Then, for any ϵ>0\epsilon>0,

(supx|Fmin(x)F^min(x)|>ϵ)2i=1de2miϵ2,\mathbb{P}\left(\underset{x\in\mathbb{R}}{\sup}\left|F_{\min}(x)-\hat{F}_{\min}(x)\right|>\epsilon\right)\leq 2\sum_{i=1}^{d}e^{-2m_{i}\epsilon^{2}},

where the probability is over the randomness of the examples that define the empirical c.d.f.’s.

The above Proposition 7 allows us to quantify the error caused by replacing the population distributions with the empirical distributions, which leads to the following marginal coverage guarantee for the prediction set 𝒞~\widetilde{\mathcal{C}} that we have defined before.

Theorem 8 (Marginal coverage guarantee for the empirical estimations).

Assume Vn+1=s(Xn+1,Yn+1)P𝒫f,ρV_{n+1}=s(X_{n+1},Y_{n+1})\sim P\in\mathcal{P}_{f,\rho} is independent of {Vij}i,j=1d,mi\{V_{ij}\}_{i,j=1}^{d,m_{i}} where {Vij}j=1mii.i.d.s#Si\{V_{ij}\}_{j=1}^{m_{i}}\overset{i.i.d.}{\sim}s\#S_{i} for i[d]i\in[d]. Suppose ρ=infP0𝒞sDf(PP0)ρ\rho^{\star}=\underset{P_{0}\in\mathcal{CH}_{s}}{\inf}D_{f}(P\|P_{0})\leq\rho where 𝒞s=𝒞(s#S1,,s#Sd)\mathcal{CH}_{s}=\mathcal{CH}(s\#S_{1},\cdots,s\#S_{d}). Let F^min\hat{F}_{\min} be defined as in Proposition 7 and let S^1,,S^d\hat{S}_{1},\dots,\hat{S}_{d} be the empirical distributions of S1,,SdS_{1},\dots,S_{d} respectively. If we set t=𝒬~(1α;𝒫^f,ρ)=𝒬(gf,ρ1(1α);F^min)t=\widetilde{\mathcal{Q}}(1-\alpha;\hat{\mathcal{P}}_{f,\rho})=\mathcal{Q}(g_{f,\rho}^{-1}(1-\alpha);\hat{F}_{\min}), then for any ϵ>0\epsilon>0, we obtain the following marginal coverage guarantee for 𝒞~\widetilde{\mathcal{C}}:

(Yn+1𝒞~(Xn+1))\displaystyle\mathbb{P}\!\left(\!\!Y_{n+1}\!\!\in\!\widetilde{\mathcal{C}}(X_{n+1})\!\right) (12i=1de2miϵ2)gf,ρ(gf,ρ1(1α)ϵ),\displaystyle\!\!\geq\!\!\left(\!\!1\!\!-\!\!2\!\sum_{i=1}^{d}\!e^{-2m_{i}\epsilon^{2}}\!\!\right)\!\!g_{f,\rho^{\star}}\!\!\left(\!g_{f,\rho}^{-1}(1\!\!-\!\!\alpha)\!-\!\epsilon\!\right),

where the randomness is over the choice of the source examples and (Xn+1,Yn+1)(X_{n+1},Y_{n+1}) and

𝒫^f,ρ{S|S0𝒞(s#S^1,,s#S^d)s.t.Df(SS0)ρ}.\hat{\mathcal{P}}_{f,\rho}\!\coloneqq\!\!\left\{\!S\Big{|}\exists S_{0}\!\!\in\!\mathcal{CH}(s\#\hat{S}_{1},\cdots,s\#\hat{S}_{d})\ \text{s.t.}\ D_{f}\!(S\|S_{0})\!\!\leq\!\!\rho\!\right\}.

By Lemma 14 in the Appendix, gf,ρ(β)g_{f,\rho}(\beta) is non-increasing in ρ\rho and non-decreasing in β\beta, so gf,ρ(gf,ρ1(1α)ϵ)gf,ρ(gf,ρ1(1α)ϵ)g_{f,\rho^{\star}}(g_{f,\rho}^{-1}(1\!-\!\alpha)\!-\!\epsilon)\!\geq\!g_{f,\rho}(g_{f,\rho}^{-1}(1\!-\!\alpha)\!-\!\epsilon). In practice, we do not know ρ\rho^{\star}, so we use gf,ρ(gf,ρ1(1α)ϵ)g_{f,\rho}(g_{f,\rho}^{-1}(1\!-\!\alpha)\!-\!\epsilon) instead. Since gf,ρ(gf,ρ1(1α)ϵ)gf,ρ(gf,ρ1(1α))=1αg_{f,\rho}(g_{f,\rho}^{-1}(1\!-\!\alpha)\!-\!\epsilon)\!\leq\!g_{f,\rho}(g_{f,\rho}^{-1}(1\!-\!\alpha))\!=\!1-\alpha, we get guaranteed coverage (12i=1de2miϵ2)gf,ρ(gf,ρ1(1α)ϵ)1α\left(1\!-\!2\sum_{i=1}^{d}e^{-2m_{i}\epsilon^{2}}\right)g_{f,\rho}(g_{f,\rho}^{-1}(1\!-\!\alpha)\!-\!\epsilon)\!\leq\!1\!-\!\alpha. To achieve a marginal coverage with the level of at least 1α1-\alpha, we need to correct the output set by replacing α\alpha with some α<α\alpha^{\prime}<\alpha when running our confidence set predictor. The following corollary tells us how to choose α\alpha^{\prime} to correct the prediction set.

Corollary 9 (Correct the prediction set to get a (1α)(1-\alpha) marginal coverage).

Let (Xn+1,Yn+1)(X_{n+1},Y_{n+1}), F^min\hat{F}_{\min}, 𝒫^f,ρ\hat{\mathcal{P}}_{f,\rho} be defined as in Theorem 8. For arbitrary ϵ>0\epsilon>0, if we set t=𝒬~(1α;𝒫^f,ρ)=𝒬(gf,ρ1(1α);F^min)t=\widetilde{\mathcal{Q}}(1-\alpha^{\prime};\hat{\mathcal{P}}_{f,\rho})=\mathcal{Q}\left(g_{f,\rho}^{-1}(1-\alpha^{\prime});\hat{F}_{\min}\right), where

α=1gf,ρ(ϵ+gf,ρ1(1α12i=1de2miϵ2)),\alpha^{\prime}=1-g_{f,\rho}\left(\epsilon+g_{f,\rho}^{-1}\left(\frac{1-\alpha}{1-2\sum_{i=1}^{d}e^{-2m_{i}\epsilon^{2}}}\right)\right),

then we obtain the following marginal coverage guarantee:

(Yn+1𝒞~(Xn+1))1α.\mathbb{P}\left(Y_{n+1}\in\widetilde{\mathcal{C}}(X_{n+1})\right)\geq 1-\alpha.
Remark 3.

Corollary 9 tells us that we can take t=𝒬(gf,ρ1(1α);F^min)=𝒬(ϵ+gf,ρ1(1α12i=1de2miϵ2);F^min)t=\mathcal{Q}\left(g_{f,\rho}^{-1}(1-\alpha^{\prime});\hat{F}_{\min}\right)=\mathcal{Q}\left(\epsilon+g_{f,\rho}^{-1}\left(\frac{1-\alpha}{1-2\sum_{i=1}^{d}e^{-2m_{i}\epsilon^{2}}}\right);\hat{F}_{\min}\right) to get a marginal coverage guarantee with confidence level 1α1-\alpha. When f(),s(,)f(\cdot),s(\cdot,\cdot) are chosen and the numbers of examples that are used to estimate the source distributions, i.e., m1,,mdm_{1},\dots,m_{d}, are given, we solve the following optimization problem to find a desired tt.

min0<ϵ1\displaystyle\underset{0<\epsilon\leq 1}{\min} 𝒬(ϵ+gf,ρ1(1α12i=1de2miϵ2);F^min),\displaystyle\ \ \mathcal{Q}\left(\epsilon+g_{f,\rho}^{-1}\left(\frac{1-\alpha}{1-2\sum_{i=1}^{d}e^{-2m_{i}\epsilon^{2}}}\right);\hat{F}_{\min}\right),
s.t. ϵ+gf,ρ1(1α12i=1de2miϵ2)1.\displaystyle\epsilon+g_{f,\rho}^{-1}\left(\frac{1-\alpha}{1-2\sum_{i=1}^{d}e^{-2m_{i}\epsilon^{2}}}\right)\leq 1.

Since the quantile function 𝒬(;F^min)\mathcal{Q}\left(\cdot;\hat{F}_{\min}\right) is non-decreasing, let h(ϵ)=ϵ+gf,ρ1(1α12i=1de2miϵ2)h(\epsilon)=\epsilon+g_{f,\rho}^{-1}\left(\frac{1-\alpha}{1-2\sum_{i=1}^{d}e^{-2m_{i}\epsilon^{2}}}\right), we solve the following problem instead:

minh(ϵ) s.t. 0<ϵ1,h(ϵ)1.\min\ \ h(\epsilon)\text{ s.t. }0<\epsilon\leq 1,h(\epsilon)\leq 1.

For some choices of ff, the functions gf,ρg_{f,\rho} and gf,ρ1g_{f,\rho}^{-1} have closed forms (please refer to the examples in Section 5.3). For general ff that we do not have a closed form of gf,ρ1g_{f,\rho}^{-1}, the following lemma tells us that we can use a binary search algorithm to efficiently compute the value of gf,ρ1(τ)g_{f,\rho}^{-1}(\tau) for a given τ\tau.

Lemma 10 ((Cauchois et al. 2020), The form of gf,ρ1g_{f,\rho}^{-1} that can be efficiently solved).

Let gf,ρ,gf,ρ1g_{f,\rho},g_{f,\rho}^{-1} be defined as in Theorem 6. Then, for any τ[0,1]\tau\in[0,1], we have:

gf,ρ1(τ)=sup{β[τ,1]|βf(τβ)+(1β)f(1τ1β)ρ}.g_{f,\rho}^{-1}(\tau)\!=\!\sup\!\left\{\beta\!\in\![\tau,1]\left|\beta f\!\left(\frac{\tau}{\beta}\right)\!+\!(1\!-\!\beta)f\!\left(\frac{1\!-\!\tau}{1\!-\!\beta}\right)\!\leq\!\rho\right.\!\right\}.

5.3 Examples

In this section, we present some examples of calculating gf,ρg_{f,\rho} and gf,ρ1g_{f,\rho}^{-1} for some important ff-divergences.

Example 1 (χ2\chi^{2}-divergence).

Let f(t)=(t1)2f(t)=(t-1)^{2}; then, Df(PQ)=𝔼Q[(dPdQ1)2]=𝔼Q[(dPdQ)21]D_{f}(P\|Q)=\mathbb{E}_{Q}\left[\left(\frac{dP}{dQ}-1\right)^{2}\right]=\mathbb{E}_{Q}\left[\left(\frac{dP}{dQ}\right)^{2}-1\right] is the χ2\chi^{2}-divergence. In this case, we have gf,ρ(β)=(βρβ(1β))+g_{f,\rho}(\beta)=\left(\beta-\sqrt{\rho\beta(1-\beta)}\right)_{+}, where (x)+=max{0,x}(x)_{+}=\max\{0,x\}. gf,ρ1(τ)g_{f,\rho}^{-1}(\tau) is the solution of the following optimization problem:

maxβ s.t. {ρρ+1β1βρβ(1β)τ.\max\beta\text{ s.t. }\left\{\begin{array}[]{l}\frac{\rho}{\rho+1}\leq\beta\leq 1\\ \beta-\sqrt{\rho\beta(1-\beta)}\leq\tau\end{array}\right..
Example 2 (Total variation distance, (Cauchois et al. 2020)).

Let f(t)=12|t1|f(t)=\frac{1}{2}|t-1|; then, Df(PQ)=𝔼Q[12|dPdQ1|]D_{f}(P\|Q)=\mathbb{E}_{Q}\left[\frac{1}{2}\left|\frac{dP}{dQ}-1\right|\right] is the total variation distance. In this case, we can provide analytic forms for gf,ρg_{f,\rho} and gf,ρ1g_{f,\rho}^{-1}:

gf,ρ(β)=(βρ)+,gf,ρ1(τ)=min{τ+ρ,1}.g_{f,\rho}(\beta)=(\beta-\rho)_{+},\ \ \ \ g_{f,\rho}^{-1}(\tau)=\min\{\tau+\rho,1\}.
Example 3 (Kullback-Leibler divergence).

Let f(t)=tlogtf(t)=t\log t; then, Df(PQ)=𝔼Q[dPdQlog(dPdQ)]D_{f}(P\|Q)=\mathbb{E}_{Q}\left[\frac{dP}{dQ}\log\left(\frac{dP}{dQ}\right)\right] is the Kullback-Leibler (KL) divergence (Solomon and Richard A 1951). Unfortunately, we cannot provide the analytic forms of gf,ρg_{f,\rho} and gf,ρ1g_{f,\rho}^{-1} for KL-divergence. Fortunately, according to Theorem 6 and Remark 3, we can compute the values gf,ρ(β)g_{f,\rho}(\beta) and gf,ρ1(τ)g_{f,\rho}^{-1}(\tau) by solving a one-dimensional convex optimization problem, which can be solved efficiently using binary search.

Refer to caption
Figure 2: The violin plots for the coverage of the 10001000 runs under the same data generation settings as in Section 4. We show results for α={0.05,0.1,0.15,0.2,0.25,0.3}\alpha=\{0.05,0.1,0.15,0.2,0.25,0.3\}. Here, the red lines are the marginal coverage guarantees that we wish to achieve. The white point represents the median, while the two endpoints of the thick line are the 0.250.25 quantile and the 0.750.75 quantile.
Refer to caption
Figure 3: The violin plots for the coverage of the 10001000 runs for the multi-source OOD confidence set prediction task. We show results for α={0.05,0.1,0.15,0.2,0.25,0.3}\alpha=\{0.05,0.1,0.15,0.2,0.25,0.3\}. Here, the red lines are the marginal coverage guarantees that we wish to achieve. The white point represents the median, while the two endpoints of the thick line are the 0.250.25 quantile and the 0.750.75 quantile.

6 Experiments

In this section, we use simulated data to verify our theory and the validity of our constructed confidence set predictor (referred to as OOD-SCP in the remainder of this paper). We consider two cases: first, we verify the validity of OOD-SCP using the same settings as in Section 4; then, we construct a multi-source OOD confidence set prediction task and show that OOD-SCP is valid for this task.

According to Figure 2, unlike standard SCP, for all values of α\alpha, the violin for OOD-SCP is above the desired coverage line, which shows that OOD-SCP is empirically valid.

We next consider a multi-source OOD confidence set prediction task. Similar to Section 4, we consider the regression problem and set 𝒳=l,𝒴=\mathcal{X}=\mathbb{R}^{l},\mathcal{Y}=\mathbb{R}. Define the oracle linear predictor L:𝒳𝒴L:\mathcal{X}\xrightarrow{}\mathcal{Y} as L(x)=w,x+bL(x)=\left<w^{\star},x\right>+b^{\star}, where wlw^{\star}\in\mathbb{R}^{l} and bb^{\star}\in\mathbb{R}. We define the marginal distribution of XX for the source domains S1S_{1} and S2S_{2} as

S1X=𝒩(μ1,σs,x2Il),S2X=𝒩(μ2,σs,x2Il)S_{1X}=\mathcal{N}(\mu_{1},\sigma_{s,x}^{2}I_{l}),\ \ \ \ S_{2X}=\mathcal{N}(\mu_{2},\sigma_{s,x}^{2}I_{l})

respectively, where μ1,μ2l\mu_{1},\mu_{2}\in\mathbb{R}^{l} are the mean vectors, σs,x>0\sigma_{s,x}>0 is a scalar, and Ill×lI_{l}\in\mathbb{R}^{l\times l} is the identity matrix with dimension l×ll\times l. We define Y|X=x𝒩(L(x),σs,y2)Y|X=x\sim\mathcal{N}(L(x),\sigma_{s,y}^{2}) for both S1S_{1} and S2S_{2}. For the target domain TT, we define the marginal distribution of XX as TX=S1X+S2X2T_{X}=\frac{S_{1X}+S_{2X}}{2} and the conditional distribution of YY given XX as Y|X=x𝒩(L(x),σt,y2)Y|X=x\sim\mathcal{N}(L(x),\sigma_{t,y}^{2}). Here, σs,y,σt,y>0\sigma_{s,y},\sigma_{t,y}>0 are the standard deviations and σs,yσt,y\sigma_{s,y}\neq\sigma_{t,y}.

Similar to Section 4, we sample mtrain2\frac{m_{\text{train}}}{2} examples from S1S_{1} and mtrain2\frac{m_{\text{train}}}{2} examples from S2S_{2} to train a linear predictor L^(x)=w^,x+b^\hat{L}(x)=\left<\hat{w},x\right>+\hat{b}, where w^l\hat{w}\in\mathbb{R}^{l} and b^\hat{b}\in\mathbb{R}. We then define the nonconformity score as s(x,y)=|L^(x)y|s(x,y)=|\hat{L}(x)-y|. We sample n2\frac{n}{2} examples from S1S_{1}, and n2\frac{n}{2} examples from S2S_{2} to construct the prediction set 𝒞~(x)\widetilde{\mathcal{C}}(x) and sample mtestm_{\text{test}} examples from TT to form the test data.

Figure 3 shows the results for the multi-source OOD confidence set prediction task. From the figure, we can see that the violins for the standard SCP are under the desired coverage lines, which means that the standard SCP is invalid in this case. By contrast, the violins for OOD-SCP are above the desired coverage lines, indicating that OOD-SCP is valid, which validates Corollary 9.

The reason we do not do experiments on real datasets is that do not know how to set the value of ρ\rho for the existing OOD datasets. Our main claim is that when the target domain satisfies T𝒯f,ρT\in\mathcal{T}_{f,\rho}, the coverage of our method is guaranteed. However, we claim it is acceptable. In many fields, we face the same problem.

In adversarial robustness (Szegedy et al. 2014), the theories (for example (Montasser, Hanneke, and Srebro 2019)) provide an upper bound of the test adversarial robustness (x,y)D[δϵ:h(x+δ)y]\underset{(x,y)\sim D}{\mathbb{P}}[\exists\|\delta\|\leq\epsilon:h(x+\delta)\neq y], where hh is a classifier. The results just tell us that we have the guarantee for the test accuracy if the test perturbation δ\delta satisfies δϵ\|\delta\|\leq\epsilon. However, what if δ>ϵ\|\delta\|>\epsilon? It is out of the scope of their theories.

For distributional robustness optimization (DRO), the theories (Lee and Raginsky 2018) prove that if the test distribution is in a Wasserstein ball with radius rr, then the test risk can be upper bounded. Formally, maxDW(r)(x,y)D[h(x)y]\underset{D\in W(r)}{\max}\underset{(x,y)\sim D}{\mathbb{P}}[h(x)\neq y] is upper bounded, where W(r)W(r) is a Wasserstein ball with radius. They do not know how to set rr to make W(r)W(r) contain the test distribution either, however, this does not overshadow their contribution to the DRO community. In other words, the issues of ρ\rho do not overshadow our contribution to the OOD community.

7 Discussions

Our work is an extension of (Cauchois et al. 2020) to the multi-domain case. In this section, we discuss the differences between our work and (Cauchois et al. 2020).

7.1 The Necessity of Our Extension

In the multi-source setting, to make use of all the source domains, a trivial method is to regard the mixture of the source domains, as a domain S=i=1dλiSiS=\sum_{i=1}^{d}\lambda_{i}S_{i} and use the method in (Cauchois et al. 2020). However, there are two issues:

  • Given the empirical data from S1,,SdS_{1},\dots,S_{d}, we don’t know the exact values of λ1,,λd\lambda_{1},\!\dots,\!\lambda_{d} for the mixed domain, so we don’t know the set 𝒫¯={s#T|Df(TS)ρ}\bar{\mathcal{P}}=\left\{s\#T|D_{f}(T\|S)\leq\rho\right\} for a given ρ\rho. So we don’t know the set that we are giving a coverage guarantee for.

  • We may be not able to provide a coverage guarantee for data from one of the source domains. Take KL-divergence as an example, then drawing from SS can be regarded as first drawing an index II from λ\lambda and then drawing an example from SIS_{I}. SiS_{i} can be seen as drawn from the same process with λ=ei\lambda=e_{i}, where ei=(0,,0,1,0,,0)de_{i}=(0,\dots,0,1,0,\dots,0)\in\mathbb{R}^{d} with only the ii-th element being 11. By the chain rule, KL(SiS)=KL(ei𝝀)+𝔼jeiKL(SjSj)=log(1/λi)KL(S_{i}\|S)=KL(e_{i}\|\boldsymbol{\lambda})+\mathbb{E}_{j\sim e_{i}}KL(S_{j}\|S_{j})=\log(1/\lambda_{i}). If ρ<maxilog(1/λi)\rho<\max_{i}\log(1/\lambda_{i}), then there exists a SiS_{i} s.t. KL(SiS)>ρKL(S_{i}\|S)>\rho, i.e., we can not even get a coverage guarantee for the source domain SiS_{i}, which is unacceptable! The problem gets worse if the source domain number dd becomes larger since maxilog(1/λi)logd\max_{i}\log(1/\lambda_{i})\geq\log d. However, in our generalization, there is no such problem even if we choose ρ=0\rho=0, which means the method in (Cauchois et al. 2020) is not compatible with the multi-source setting, so our extension is necessary.

7.2 The Difference in Proof Skills

In fact, our Theorem 6 is an extension of (Cauchois et al. 2020) to the OOD setting and mainly depends on Lemmas 17 and 18. Lemma 17 helps us reduce multi-input gf,ρg_{f,\rho} to a single-input case. The main idea of the proof of Lemma 18 comes from the argument in (Cauchois et al. 2020), however, the extension is non-trivial. In Lemma 18, let h(z,β)=βf(z/β)+(1β)f((1z)/(1β))h(z,\beta)\!=\!\beta f(z/\beta)+(1\!-\!\beta)f((1\!-\!z)/(1\!-\!\beta)) we use the multi-input gf,ρ(β1,,βd)=inf{z[0,1]|infλΔd1h(z,i=1dλiβi)ρ}g_{f,\rho}(\beta_{1},\!\cdots\!,\beta_{d})\!=\!\inf\{z\!\in\![0,1]|\inf_{\lambda\in\Delta^{d-1}}h(z,\sum_{i=1}^{d}\!\lambda_{i}\beta_{i})\!\leq\!\rho\}, which involves taking infimum w.r.t. λ\lambda and is much more complicated than the single-input case in (Cauchois et al. 2020). We construct a set 𝒫f,ρ\mathcal{P}_{f,\rho}^{*} that is more complicated than that in (Cauchois et al. 2020) and the proof is more difficult. Moreover, due to multiple inputs and the infλΔd1\inf_{\lambda\in\Delta^{d-1}} operator, we need to consider 4 cases according to whether each Fi(t)F_{i}(t) is 0 or 11.

Our Theorem 8 and its corresponding Corollary 9 are novel and quite different from Corollaries 2.1 and 2.2 in (Cauchois et al. 2020). The common point is that they all consider finite sample approximation. The proof of Corollary 2.1 in (Cauchois et al. 2020) relies on the exchangeability of the source examples, however, in the OOD setting, examples are drawn from different source domains and are not exchangeable. So the analysis techniques in (Cauchois et al. 2020) can not be applied in our case. To fill this gap, we use the decomposition technique and concentration inequalities.

8 Conclusion

We study the confidence set prediction problem in the OOD generalization setting. We first empirically show that SCP is not valid in the OOD generalization setting. We then develop a method for forming valid confident prediction sets in the OOD setting and theoretically prove the validity of our proposed method. Finally, we conduct experiments on simulated data to empirically verify both the correctness of our theory and the validity of our proposed method.

Acknowledgements

This work is supported by the National Key R&D Program of China under Grant 2023YFC3604702, the National Natural Science Foundation of China under Grant 61976161, the Fundamental Research Funds for the Central Universities under Grant 2042022rc0016.

References

  • Achim (2013) Achim, K. 2013. Probability theory: a comprehensive course. Springer Science & Business Media.
  • Alfréd (1961) Alfréd, R. 1961. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, volume 4, 547–562.
  • Amodei et al. (2016) Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P. F.; Schulman, J.; and Mané, D. 2016. Concrete Problems in AI Safety. CoRR, abs/1606.06565.
  • Angelopoulos et al. (2021) Angelopoulos, A. N.; Bates, S.; Jordan, M. I.; and Malik, J. 2021. Uncertainty Sets for Image Classifiers using Conformal Prediction. In ICLR.
  • Bin (1994) Bin, Y. 1994. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 94–116.
  • Boyd, Boyd, and Vandenberghe (2004) Boyd, S.; Boyd, S. P.; and Vandenberghe, L. 2004. Convex optimization. Cambridge university press.
  • Cauchois et al. (2020) Cauchois, M.; Gupta, S.; Ali, A.; and Duchi, J. C. 2020. Robust Validation: Confident Predictions Even When Distributions Shift. CoRR, abs/2008.04267.
  • Fisch et al. (2021) Fisch, A.; Schuster, T.; Jaakkola, T. S.; and Barzilay, R. 2021. Few-Shot Conformal Prediction with Auxiliary Tasks. In ICML, 3329–3339.
  • Ganin et al. (2016) Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. S. 2016. Domain-Adversarial Training of Neural Networks. J. Mach. Learn. Res., 17: 59:1–59:35.
  • Gendler et al. (2022) Gendler, A.; Weng, T.; Daniel, L.; and Romano, Y. 2022. Adversarially Robust Conformal Prediction. In ICLR.
  • Gibbs and Candès (2021) Gibbs, I.; and Candès, E. J. 2021. Adaptive Conformal Inference Under Distribution Shift. In NeurIPS, 1660–1672.
  • Goodfellow, Shlens, and Szegedy (2015) Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2015. Explaining and Harnessing Adversarial Examples. In ICLR.
  • Jiang et al. (2018) Jiang, H.; Kim, B.; Guan, M. Y.; and Gupta, M. R. 2018. To Trust Or Not To Trust A Classifier. In NeurIPS, 5546–5557.
  • Jiang et al. (2012) Jiang, X.; Osl, M.; Kim, J.; and Ohno-Machado, L. 2012. Calibrating predictive model estimates to support personalized medicine. J. Am. Medical Informatics Assoc., 19(2): 263–274.
  • Lee and Raginsky (2018) Lee, J.; and Raginsky, M. 2018. Minimax Statistical Learning with Wasserstein distances. In NeurIPS, 2692–2701.
  • Li et al. (2018a) Li, D.; Yang, Y.; Song, Y.; and Hospedales, T. M. 2018a. Learning to Generalize: Meta-Learning for Domain Generalization. In AAAI, 3490–3497.
  • Li et al. (2018b) Li, H.; Pan, S. J.; Wang, S.; and Kot, A. C. 2018b. Domain Generalization With Adversarial Feature Learning. In CVPR, 5400–5409.
  • Li et al. (2018c) Li, Y.; Tian, X.; Gong, M.; Liu, Y.; Liu, T.; Zhang, K.; and Tao, D. 2018c. Deep Domain Generalization via Conditional Invariant Adversarial Networks. In ECCV, volume 11219, 647–663.
  • Madry et al. (2018) Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; and Vladu, A. 2018. Towards Deep Learning Models Resistant to Adversarial Attacks. In ICLR.
  • Massart (1990) Massart, P. 1990. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. The annals of Probability, 1269–1283.
  • Montasser, Hanneke, and Srebro (2019) Montasser, O.; Hanneke, S.; and Srebro, N. 2019. VC Classes are Adversarially Robustly Learnable, but Only Improperly. In COLT, 2512–2530.
  • Nagarajan, Andreassen, and Neyshabur (2021) Nagarajan, V.; Andreassen, A.; and Neyshabur, B. 2021. Understanding the failure modes of out-of-distribution generalization. In ICLR.
  • Oliveira et al. (2022) Oliveira, R. I.; Orenstein, P.; Ramos, T.; and Romano, J. V. 2022. Split Conformal Prediction for Dependent Data. arXiv:2203.15885.
  • Patrick (2008) Patrick, B. 2008. Probability and measure. John Wiley & Sons.
  • Shafer and Vovk (2008) Shafer, G.; and Vovk, V. 2008. A Tutorial on Conformal Prediction. J. Mach. Learn. Res., 9: 371–421.
  • Shen et al. (2021) Shen, Z.; Liu, J.; He, Y.; Zhang, X.; Xu, R.; Yu, H.; and Cui, P. 2021. Towards Out-Of-Distribution Generalization: A Survey. CoRR, abs/2108.13624.
  • Solomon and Richard A (1951) Solomon, K.; and Richard A, L. 1951. On information and sufficiency. The annals of mathematical statistics, 22(1): 79–86.
  • Sun and Saenko (2016) Sun, B.; and Saenko, K. 2016. Deep CORAL: Correlation Alignment for Deep Domain Adaptation. In ECCV, volume 9915, 443–450.
  • Szegedy et al. (2014) Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I. J.; and Fergus, R. 2014. Intriguing properties of neural networks. In ICLR.
  • Tibshirani et al. (2019) Tibshirani, R. J.; Barber, R. F.; Candès, E. J.; and Ramdas, A. 2019. Conformal Prediction Under Covariate Shift. In NeurIPS, 2526–2536.
  • Vovk (2013) Vovk, V. 2013. Conditional validity of inductive conformal predictors. Mach. Learn., 92(2-3): 349–376.
  • Vovk, Gammerman, and Shafer (2005) Vovk, V.; Gammerman, A.; and Shafer, G. 2005. Algorithmic Learning in a Random World. Springer-Verlag. ISBN 0387001522.
  • Wang et al. (2022) Wang, Q.; Wang, Y.; Zhu, H.; and Wang, Y. 2022. Improving Out-of-Distribution Generalization by Adversarial Training with Structured Priors. CoRR, abs/2210.06807.
  • Xiaohong, Lars Peter, and Marine (2010) Xiaohong, C.; Lars Peter, H.; and Marine, C. 2010. Nonlinearity and temporal dependence. Journal of Econometrics, 155: 155–169.
  • Xin et al. (2022) Xin, S.; Wang, Y.; Su, J.; and Wang, Y. 2022. Domain-wise Adversarial Training for Out-of-Distribution Generalization.
  • Zou and Liu (2023) Zou, X.; and Liu, W. 2023. On the Adversarial Robustness of Out-of-distribution Generalization Models. In Thirty-seventh Conference on Neural Information Processing Systems.

Appendix A Proofs

A.1 Proof of Lemma 2

Lemma 2.

Assume that examples {(Xi,Yi)}i=1n+1\{(X_{i},Y_{i})\}_{i=1}^{n+1} are exchangeable. For any nonconformity score ss and any α(0,1)\alpha\in(0,1), the prediction set defined in Equation 2 satisfies:

(Yn+1𝒞^n(Xn+1))1α,\mathbb{P}\left(Y_{n+1}\in\widehat{\mathcal{C}}_{n}(X_{n+1})\right)\geq 1-\alpha,

where the probability is over the randomness of {(Xi,Yi)}i=1n+1\{(X_{i},Y_{i})\}_{i=1}^{n+1}.

Proof of Lemma 2.

Let Vi=s(Xi,Yi)V_{i}=s(X_{i},Y_{i}), then the following holds:

(Yn+1𝒞^n(Xn+1))\displaystyle\mathbb{P}\left(Y_{n+1}\in\widehat{\mathcal{C}}_{n}(X_{n+1})\right) =(i)(Vn+1𝒬(n+1n(1α);P^({Vi}i=1n)))\displaystyle\overset{(\text{i})}{=}\mathbb{P}\left(V_{n+1}\leq\mathcal{Q}\left(\frac{n+1}{n}(1-\alpha);\widehat{P}\left(\{V_{i}\}_{i=1}^{n}\right)\right)\right)
=(ii)((SVn+1)n+1n(1α))\displaystyle\overset{(\text{ii})}{=}\mathbb{P}\left(\mathbb{P}(S\leq V_{n+1})\leq\frac{n+1}{n}(1-\alpha)\right)
(rank(Vn+1)(n+1)(1α))\displaystyle\geq\mathbb{P}\left(\text{rank}(V_{n+1})\leq\lceil(n+1)(1-\alpha)\rceil\right)
1α,\displaystyle\geq 1-\alpha,

where (i)(\text{i}) is from the definition of 𝒞^n\widehat{\mathcal{C}}_{n} and (ii)(\text{ii}) is a result of the definition of the quantile function 𝒬\mathcal{Q}, the inner probability is over the randomness of SS and the outer probability is over the randomness of {(Xi,Yi)}i=1n+1\{(X_{i},Y_{i})\}_{i=1}^{n+1}. ∎

A.2 Proof of Lemma 3

Lemma 3.

For any unknown target distribution T𝒯T\in\mathcal{T}, assume that (Xn+1,Yn+1)(X_{n+1},Y_{n+1}) is drawn from TT. If we set tmaxP𝒫𝒬(1α;P)t\geq\underset{P\in\mathcal{P}}{\max}\ \mathcal{Q}(1-\alpha;P), then:

(Yn+1𝒞~(Xn+1))1α.\mathbb{P}\left(Y_{n+1}\in\widetilde{\mathcal{C}}(X_{n+1})\right)\geq 1-\alpha.
Proof of Lemma 3.
(Yn+1𝒞~(Xn+1))\displaystyle\mathbb{P}\left(Y_{n+1}\in\widetilde{\mathcal{C}}(X_{n+1})\right) =(s(Xn+1,Yn+1)t)=(Vn+1t)\displaystyle=\mathbb{P}\left(s(X_{n+1},Y_{n+1})\leq t\right)=\mathbb{P}\left(V_{n+1}\leq t\right)
(i)(Vn+1maxP𝒫0𝒬(1α;P))\displaystyle\overset{(\text{i})}{\geq}\mathbb{P}\left(V_{n+1}\leq\underset{P\in\mathcal{P}_{0}}{\max}\ \mathcal{Q}(1-\alpha;P)\right)
(ii)(Vn+1𝒬(1α;s#T))\displaystyle\overset{(\text{ii})}{\geq}\mathbb{P}\left(V_{n+1}\leq\mathcal{Q}(1-\alpha;s\#T)\right)
(iii)1α,\displaystyle\overset{(\text{iii})}{\geq}1-\alpha,

where (i)(\text{i}) is from the fact tmaxP𝒫0𝒬(1α;P)t\geq\underset{P\in\mathcal{P}_{0}}{\max}\ \mathcal{Q}(1-\alpha;P); (ii)(\text{ii}) is because maxP𝒫0𝒬(1α;P)𝒬(1α;s#T)\underset{P\in\mathcal{P}_{0}}{\max}\ \mathcal{Q}(1-\alpha;P)\geq\mathcal{Q}(1-\alpha;s\#T) and (iii)(\text{iii}) is a result of the definition of the quantile function. ∎

A.3 Proof of Lemma 5

Lemma 5.

Let 𝒫\mathcal{P}, 𝒫f,ρ\mathcal{P}_{f,\rho} be defined as in (6), (7) respectively. Then, 𝒫𝒫f,ρ\mathcal{P}\subseteq\mathcal{P}_{f,\rho}.

The proof of Lemma 5 relies on the data processing inequality for ff-divergence.

Lemma 11 (Data Processing Inequality for ff-divergence).

Let PX,QXP_{X},Q_{X} be distributions on measurable space (𝒳,)(\mathcal{X},\mathcal{F}) and PY|XP_{Y|X} be a transition kernel from (𝒳,)(\mathcal{X},\mathcal{F}) to (𝒴,𝒢)(\mathcal{Y},\mathcal{G}). Let PY,QYP_{Y},Q_{Y} be distributions on measurable space (𝒴,𝒢)(\mathcal{Y},\mathcal{G}) and be the transformation of PX,QXP_{X},Q_{X} pushed through PY|XP_{Y|X}, i.e., PX(B)=𝒳PY|X(B|x)𝑑PX(x)P_{X}(B)=\int_{\mathcal{X}}P_{Y|X}(B|x)dP_{X}(x). Then, for any f-divergence, we have that:

Df(PXQX)Df(PYQY).D_{f}(P_{X}\|Q_{X})\geq D_{f}(P_{Y}\|Q_{Y}).

To prove Lemma 11, we need some properties about the ff-divergences.

Lemma 12 (Properties of ff-divergences).

For any distributions P,QP,Q on 𝒵\mathcal{Z} dominated by a common measure λ\lambda, i.e. P,QλP,Q\ll\lambda (such a λ\lambda always exists since we can take λ=12P+12Q\lambda=\frac{1}{2}P+\frac{1}{2}Q), we have:

  • Non-Negativity: Df(PQ)0D_{f}(P\|Q)\geq 0. Furthermore, if ff is strictly convex around 11, then the equality is obtained if and only if P=QP=Q.

  • Convexity: The mapping (P,Q)Df(PQ)(P,Q)\xrightarrow{}D_{f}(P\|Q) is jointly convex. Consequently, PDf(PQ)P\xrightarrow{}D_{f}(P\|Q) is convex for fixed QQ, and QDf(PQ)Q\xrightarrow{}D_{f}(P\|Q) is convex for fixed PP.

  • Joint vs. Marginal: If PXY=PXPY|XP_{XY}=P_{X}P_{Y|X} and QXY=QXPY|XQ_{XY}=Q_{X}P_{Y|X}, then Df(PXYQXY)=Df(PXQX)D_{f}\left(P_{XY}\|Q_{XY}\right)=D_{f}\left(P_{X}\|Q_{X}\right).

Proof of Lemma 12.

To prove the non-negativity, by the definition of ff-divergence we have:

Df(PQ)=𝔼Q[f(dPdQ)](i)f(𝔼Q[dPdQ])=f(1)=0,D_{f}(P\|Q)=\mathbb{E}_{Q}\left[f\left(\frac{dP}{dQ}\right)\right]\overset{(\text{i})}{\geq}f\left(\mathbb{E}_{Q}\left[\frac{dP}{dQ}\right]\right)=f(1)=0,

where (i)(\text{i}) follows from Jensen’s inequality. Next, we prove that if ff is strictly convex around 11, then the equality is obtained if and only if P=QP=Q. It is obvious that if P=QP=Q, then f(dPdQ)=f(1)=0f\left(\frac{dP}{dQ}\right)=f(1)=0, so Df(PQ)=0D_{f}(P\|Q)=0. For the other direction, as stated in Remark 1, we assume f0f\geq 0, so Df(PQ)=0D_{f}(P\|Q)=0 implies that f(dPdQ)=0f\left(\frac{dP}{dQ}\right)=0 almost everywhere. Since ff in strongly convex around 11 and f(1)=0f(1)=0, then f(dPdQ)=0f\left(\frac{dP}{dQ}\right)=0 implies dPdQ=1\frac{dP}{dQ}=1, which means that P=QP=Q.

To prove the convexity, let =[0,+)\mathbb{R}_{*}=[0,+\infty), for f:f:\mathbb{R}_{*}\xrightarrow{}\mathbb{R}, we define fperspective:×f_{\text{perspective}}:\mathbb{R}_{*}\times\mathbb{R}_{*}\xrightarrow{}\mathbb{R} as:

fperspective(x,y)=yf(xy),dom(fperspective)={(x,y)×|xydom(f)}.f_{\text{perspective}}(x,y)=yf\left(\frac{x}{y}\right),\ \ \ \text{dom}(f_{\text{perspective}})=\left\{(x,y)\in\mathbb{R}_{*}\times\mathbb{R}_{*}|\frac{x}{y}\in\text{dom}(f)\right\}.

Since ff is convex, according to (Boyd, Boyd, and Vandenberghe 2004, Section 3.2.6, Page 89), fperspectivef_{\text{perspective}} is also convex. Denote Df(P,Q)=Df(PQ)D_{f}^{*}(P,Q)=D_{f}(P\|Q), let (𝒳)\mathcal{M}(\mathcal{X}) be the domain of Df()D_{f}(\cdot\|\cdot), for (P1,Q1),(P2,Q2)(𝒳)(P_{1},Q_{1}),(P_{2},Q_{2})\in\mathcal{M}(\mathcal{X}), suppose that P1,Q1,P2,Q2λP_{1},Q_{1},P_{2},Q_{2}\ll\lambda (such λ\lambda exists, e.g., we can set λ=P1+P2+Q1+Q24\lambda=\frac{P_{1}+P_{2}+Q_{1}+Q_{2}}{4}). Then, for any α[0,1]\alpha\in[0,1] we have:

Df(α(P1,Q1)\displaystyle D_{f}^{*}(\alpha(P_{1},Q_{1}) +(1α)(P2,Q2))=𝒳f(αdP1dλ+(1α)dP2dλαdQ1dλ+(1α)dQ2dλ)(αdQ1dλ+(1α)dQ2dλ)dλ\displaystyle+(1-\alpha)(P_{2},Q_{2}))\!=\!\!\int_{\mathcal{X}}f\!\left(\!\frac{\alpha\frac{dP_{1}}{d\lambda}+(1-\alpha)\frac{dP_{2}}{d\lambda}}{\alpha\frac{dQ_{1}}{d\lambda}+(1-\alpha)\frac{dQ_{2}}{d\lambda}}\!\right)\left(\alpha\frac{dQ_{1}}{d\lambda}+(1-\alpha)\frac{dQ_{2}}{d\lambda}\right)d\lambda
=𝒳fperspective(αdP1dλ+(1α)dP2dλ,αdQ1dλ+(1α)dQ2dλ)𝑑λ\displaystyle=\int_{\mathcal{X}}f_{\text{perspective}}\left(\alpha\frac{dP_{1}}{d\lambda}+(1-\alpha)\frac{dP_{2}}{d\lambda},\alpha\frac{dQ_{1}}{d\lambda}+(1-\alpha)\frac{dQ_{2}}{d\lambda}\right)d\lambda
(i)𝒳αfperspective(dP1dλ,dQ1dλ)+(1α)fperspective(dP2dλ,dQ2dλ)dλ\displaystyle\overset{(\text{i})}{\leq}\int_{\mathcal{X}}\alpha f_{\text{perspective}}\left(\frac{dP_{1}}{d\lambda},\frac{dQ_{1}}{d\lambda}\right)+(1-\alpha)f_{\text{perspective}}\left(\frac{dP_{2}}{d\lambda},\frac{dQ_{2}}{d\lambda}\right)d\lambda
=α𝒳dQ1dλf(dP1dλdQ1dλ)𝑑λ+(1α)𝒳dQ2dλf(dP2dλdQ2dλ)𝑑λ\displaystyle=\alpha\int_{\mathcal{X}}\frac{dQ_{1}}{d\lambda}f\left(\frac{\frac{dP_{1}}{d\lambda}}{\frac{dQ_{1}}{d\lambda}}\right)d\lambda+(1-\alpha)\int_{\mathcal{X}}\frac{dQ_{2}}{d\lambda}f\left(\frac{\frac{dP_{2}}{d\lambda}}{\frac{dQ_{2}}{d\lambda}}\right)d\lambda
=αDf(P1,Q1)+(1α)Df(P2,Q2),\displaystyle=\alpha D_{f}^{*}(P_{1},Q_{1})+(1-\alpha)D_{f}^{*}(P_{2},Q_{2}),

where (i)(\text{i}) is from the convexity of fperspectivef_{\text{perspective}}.

To prove the relationship between ff-divergence of joint distributions and marginal distributions, we have:

Df(PXYQXY)\displaystyle D_{f}(P_{XY}\|Q_{XY}) =𝔼QXY[f(dPXYdQXY)]=𝔼QXY[f(dPXPX|YdQXPX|Y)]\displaystyle=\underset{Q_{XY}}{\mathbb{E}}\left[f\left(\frac{dP_{XY}}{dQ_{XY}}\right)\right]=\underset{Q_{XY}}{\mathbb{E}}\left[f\left(\frac{dP_{X}P_{X|Y}}{dQ_{X}P_{X|Y}}\right)\right]
=𝔼QXY[f(dPXdQX)]=𝔼QX[f(dPXdQX)]=Df(PXQX).\displaystyle=\underset{Q_{XY}}{\mathbb{E}}\left[f\left(\frac{dP_{X}}{dQ_{X}}\right)\right]=\underset{Q_{X}}{\mathbb{E}}\left[f\left(\frac{dP_{X}}{dQ_{X}}\right)\right]=D_{f}(P_{X}\|Q_{X}).

Proof of Lemma 11.

By Lemma 12, if PXY=PXPY|XP_{XY}=P_{X}P_{Y|X} and QXY=QXPY|XQ_{XY}=Q_{X}P_{Y|X}, then we have:

Df(PXQX)=Df(PXYQXY)=𝔼QXY[f(dPXYdQXY)].D_{f}(P_{X}\|Q_{X})=D_{f}(P_{XY}\|Q_{XY})=\underset{Q_{XY}}{\mathbb{E}}\left[f\left(\frac{dP_{XY}}{dQ_{XY}}\right)\right].

Using the law of total expectation, we get:

𝔼QXY[f(dPXYdQXY)]=𝔼QY[𝔼QX|Y[f(dPXYdQXY)|Y]].\underset{Q_{XY}}{\mathbb{E}}\left[f\left(\frac{dP_{XY}}{dQ_{XY}}\right)\right]=\underset{Q_{Y}}{\mathbb{E}}\left[\underset{Q_{X|Y}}{\mathbb{E}}\left[f\left(\frac{dP_{XY}}{dQ_{XY}}\right)\bigg{|}Y\right]\right].

Since ff is convex, Jensen’s inequality tells us that:

𝔼QY[𝔼QX|Y[f(dPXYdQXY)|Y]]𝔼QY[f(𝔼QX|Y[dPXYdQXY|Y])].\underset{Q_{Y}}{\mathbb{E}}\left[\underset{Q_{X|Y}}{\mathbb{E}}\left[f\left(\frac{dP_{XY}}{dQ_{XY}}\right)\bigg{|}Y\right]\right]\geq\underset{Q_{Y}}{\mathbb{E}}\left[f\left(\underset{Q_{X|Y}}{\mathbb{E}}\left[\frac{dP_{XY}}{dQ_{XY}}\bigg{|}Y\right]\right)\right].

Then we consider the 𝔼QX|Y[dPXYdQXY|Y]\underset{Q_{X|Y}}{\mathbb{E}}\left[\frac{dP_{XY}}{dQ_{XY}}\bigg{|}Y\right] term, we have:

𝔼QX|Y[dPXYdQXY|Y]=𝒳dPXYdQXY𝑑QX|Y=𝒳dPYdPX|YdQYdQX|Y𝑑QX|Y=(i)𝒳dPYdQY𝑑PX|Y=dPYdQY,\underset{Q_{X|Y}}{\mathbb{E}}\left[\frac{dP_{XY}}{dQ_{XY}}\bigg{|}Y\right]=\int_{\mathcal{X}}\frac{dP_{XY}}{dQ_{XY}}dQ_{X|Y}=\int_{\mathcal{X}}\frac{dP_{Y}dP_{X|Y}}{dQ_{Y}dQ_{X|Y}}dQ_{X|Y}\overset{(\text{i})}{=}\int_{\mathcal{X}}\frac{dP_{Y}}{dQ_{Y}}dP_{X|Y}=\frac{dP_{Y}}{dQ_{Y}},

where (i)(\text{i}) is from the fact that QX|Y=PX|YQ_{X|Y}=P_{X|Y}. Then we have:

Df(PXQX)=𝔼QXY[f(dPXYdQXY)]𝔼QY[f(dPYdQY)]=Df(PYQY).D_{f}(P_{X}\|Q_{X})=\underset{Q_{XY}}{\mathbb{E}}\left[f\left(\frac{dP_{XY}}{dQ_{XY}}\right)\right]\geq\underset{Q_{Y}}{\mathbb{E}}\left[f\left(\frac{dP_{Y}}{dQ_{Y}}\right)\right]=D_{f}(P_{Y}\|Q_{Y}).

Proof of Lemma 5.

For any S𝒫S\in\mathcal{P}, by the definition of 𝒫\mathcal{P}, we know that: there exists T𝒯f,ρT\in\mathcal{T}_{f,\rho} such that S=s#TS=s\#T.

By the definition of 𝒯f,ρ\mathcal{T}_{f,\rho}, we know that there exists Q𝒞(S1,,Sd)s.t.Df(TQ)ρQ\in\mathcal{CH}(S_{1},\cdots,S_{d})\ \text{s.t.}\ D_{f}(T\|Q)\leq\rho. By Lemma 11, we have: Df(s#Ts#Q)Df(TQ)ρD_{f}(s\#T\|s\#Q)\leq D_{f}(T\|Q)\leq\rho. Moreover, Q𝒞(S1,,Sd)Q\in\mathcal{CH}(S_{1},\cdots,S_{d}) implies that s#Q𝒞(s#S1,,s#Sd)s\#Q\in\mathcal{CH}(s\#S_{1},\cdots,s\#S_{d}), by the definition of 𝒫f,ρ\mathcal{P}_{f,\rho}, we know that S𝒫f,ρS\in\mathcal{P}_{f,\rho}. So 𝒫𝒫f,ρ\mathcal{P}\subseteq\mathcal{P}_{f,\rho}. ∎

A.4 Proof of Theorem 6

Theorem 6.

Let F1,,FdF_{1},\dots,F_{d} be the c.d.f.’s of the distributions s#S1,,s#Sds\#S_{1},\dots,s\#S_{d}. Define the function gf,ρ:[0,1][0,1]g_{f,\rho}:[0,1]\xrightarrow{}[0,1] as:

gf,ρ(β)inf{z[0,1]|βf(zβ)+(1β)f(1z1β)ρ},g_{f,\rho}(\beta)\coloneqq\inf\left\{z\in[0,1]\bigg{|}\beta f\left(\frac{z}{\beta}\right)+(1-\beta)f\left(\frac{1-z}{1-\beta}\right)\leq\rho\right\},

then for inverse of gf,ρg_{f,\rho}:

gf,ρ1(τ)=sup{β[0,1]|gf,ρ(β)τ},g_{f,\rho}^{-1}(\tau)=\sup\{\beta\in[0,1]\big{|}g_{f,\rho}(\beta)\leq\tau\},

the following holds for all α(0,1)\alpha\in(0,1):

𝒬~(α;𝒫f,ρ)=𝒬(gf,ρ1(α);Fmin),\widetilde{\mathcal{Q}}(\alpha;\mathcal{P}_{f,\rho})=\mathcal{Q}(g_{f,\rho}^{-1}(\alpha);F_{\min}),

where Fmin(x)min1idFi(x)F_{\min}(x)\coloneqq\underset{1\leq i\leq d}{\min}F_{i}(x) is a c.d.f.

Firstly, we prove that if F1,,FdF_{1},\dots,F_{d} are c.d.f.’s, then Fmin(x)=min1idFi(x)F_{\min}(x)=\underset{1\leq i\leq d}{\min}F_{i}(x) is a c.d.f.

Lemma 13.

Suppose that F1,,FdF_{1},\dots,F_{d} are c.d.f.’s, let Fmin(x)=min1idFi(x)F_{\min}(x)=\underset{1\leq i\leq d}{\min}F_{i}(x), then FminF_{\min} is a c.d.f.

Proof of Lemma 13.

We need to prove that FminF_{\min} satisfies the following properties:

  • P1

    FminF_{\min} is non-decreasing.

  • P2

    0Fmin(x)10\leq F_{\min}(x)\leq 1, limxFmin(x)=0\underset{x\to-\infty}{\lim}F_{\min}(x)=0 and limx+Fmin(x)=1\underset{x\to+\infty}{\lim}F_{\min}(x)=1.

  • P3

    FminF_{\min} is right continuous.

Proof of P1: Since F1,,FdF_{1},\dots,F_{d} are all c.d.f.’s, xy\forall x\leq y, we have Fi(x)Fi(y)F_{i}(x)\leq F_{i}(y) for all i[d]i\in[d].

Suppose j=argmin1idFi(y)j=\underset{1\leq i\leq d}{\arg\min}F_{i}(y), then Fj(y)=min1idFi(y)F_{j}(y)=\underset{1\leq i\leq d}{\min}F_{i}(y). Since Fj(x)Fj(y)F_{j}(x)\leq F_{j}(y), we have that:

Fmin(x)=min1idFi(x)Fj(x)Fj(y)=min1idFi(y)=Fmin(y),F_{\min}(x)=\underset{1\leq i\leq d}{\min}F_{i}(x)\leq F_{j}(x)\leq F_{j}(y)=\underset{1\leq i\leq d}{\min}F_{i}(y)=F_{\min}(y),

which means that the function FminF_{\min} is non-decreasing.

Proof of P2: Since F1,,FdF_{1},\dots,F_{d} are all c.d.f.’s, then 0Fi(x)10\leq F_{i}(x)\leq 1, limxFi(x)=0\underset{x\to-\infty}{\lim}F_{i}(x)=0 for all i[d]i\in[d]. It is obvious that 0Fmin(x)10\leq F_{\min}(x)\leq 1.

Since limxFi(x)=0\underset{x\to-\infty}{\lim}F_{i}(x)=0, ϵ>0,i[d]\forall\epsilon>0,\forall i\in[d], there exists δi\delta_{i} such that xδi\forall x\leq\delta_{i}, we have |Fi(x)|<ϵ|F_{i}(x)|<\epsilon, i.e., 0Fi(x)<ϵ0\leq F_{i}(x)<\epsilon. Let δ=min1idδi\delta=\underset{1\leq i\leq d}{\min}\delta_{i}, then, if xδx\leq\delta, 0Fi(x)<ϵ0\leq F_{i}(x)<\epsilon holds for all i[d]i\in[d], so Fmin(x)=min1idFi(x)<ϵF_{\min}(x)=\underset{1\leq i\leq d}{\min}F_{i}(x)<\epsilon, which means that limxFmin(x)=0\underset{x\to-\infty}{\lim}F_{\min}(x)=0. We can prove that limx+Fmin(x)=1\underset{x\to+\infty}{\lim}F_{\min}(x)=1 similarly.

Proof of P3: Since F1,,FdF_{1},\dots,F_{d} are all c.d.f.’s, they are all right continuous. Fix any x0x_{0}, then by the definition of right continuity, we know that: ϵ>0\forall\epsilon>0, i[d]\forall i\in[d], there exists δi>0\delta_{i}>0 such that, for any xx satisfying 0<xx0<δi0<x-x_{0}<\delta_{i}, we have:

|Fi(x)Fi(x0)|<ϵ,i.e.,Fi(x0)ϵ<Fi(x)<Fi(x0)+ϵ.|F_{i}(x)-F_{i}(x_{0})|<\epsilon,\ \text{i.e.},F_{i}(x_{0})-\epsilon<F_{i}(x)<F_{i}(x_{0})+\epsilon.

Let δ=min1idδi\delta=\underset{1\leq i\leq d}{\min}\delta_{i}, set j=argmin1idFi(x)j=\underset{1\leq i\leq d}{\arg\min}F_{i}(x), k=argmin1idFi(x0)k=\underset{1\leq i\leq d}{\arg\min}F_{i}(x_{0}), then:

Fmin(x)=Fj(x)=min1idFi(x),Fmin(x0)=Fk(x0)=min1idFi(x0).F_{\min}(x)=F_{j}(x)=\underset{1\leq i\leq d}{\min}F_{i}(x),\ F_{\min}(x_{0})=F_{k}(x_{0})=\underset{1\leq i\leq d}{\min}F_{i}(x_{0}).

So, for all 0<xx0<δ0<x-x_{0}<\delta, we have:

Fmin(x0)ϵ=min1idFi(x0)ϵFj(x0)ϵ<Fj(x)=Fmin(x).F_{\min}(x_{0})-\epsilon=\underset{1\leq i\leq d}{\min}F_{i}(x_{0})-\epsilon\leq F_{j}(x_{0})-\epsilon<F_{j}(x)=F_{\min}(x).

Similarly, we have:

Fmin(x)=min1idFi(x)Fk(x)<Fk(x0)+ϵ=Fmin(x0)+ϵ.F_{\min}(x)=\underset{1\leq i\leq d}{\min}F_{i}(x)\leq F_{k}(x)<F_{k}(x_{0})+\epsilon=F_{\min}(x_{0})+\epsilon.

So we have that |Fmin(x)Fmin(x0)|<ϵ|F_{\min}(x)-F_{\min}(x_{0})|<\epsilon

We now show some useful properties for gf,ρg_{f,\rho}, which are useful in our proof.

Lemma 14 (Lemma A.1, page 33 of (Cauchois et al. 2020)).

Let ff be a function that satisfies the conditions in Definition 4, then the function gf,ρg_{f,\rho} defined in Theorem 6 satisfies the following properties:

  1. (1)

    The function (β,ρ)gf,ρ(β)(\beta,\rho)\xrightarrow{}g_{f,\rho}(\beta) is a convex function.

  2. (2)

    The function (β,ρ)gf,ρ(β)(\beta,\rho)\xrightarrow{}g_{f,\rho}(\beta) is continuous for β[0,1]\beta\in[0,1] and ρ(0,)\rho\in(0,\infty).

  3. (3)

    gf,ρg_{f,\rho} is non-increasing in ρ\rho and non-decreasing in β\beta. Moreover, for all ρ>0\rho>0, there exists β0(ρ)sup{β(0,1)|gf,ρ(β)=0}\beta_{0}(\rho)\coloneqq\sup\{\beta\in(0,1)|g_{f,\rho}(\beta)=0\}, gf,ρg_{f,\rho} is strictly increasing for β>β0(ρ)\beta>\beta_{0}(\rho).

  4. (4)

    For β[0,1]\beta\in[0,1] and ρ>0\rho>0, gf,ρ(β)βg_{f,\rho}(\beta)\leq\beta. For ρ>0\rho>0, equality holds for β=0\beta=0, strict inequality holds for β(0,1)\beta\in(0,1) and ρ>0\rho>0, and gf,ρ(1)=1g_{f,\rho}(1)=1 if and only if f()=f^{\prime}(\infty)=\infty.

  5. (5)

    Let gf,ρ1(t)=sup{β|gf,ρ(β)t}g_{f,\rho}^{-1}(t)=\sup\{\beta|g_{f,\rho}(\beta)\leq t\} as in the statement of Theorem 6. Then for τ(0,1)\tau\in(0,1), gf,ρ(β)τg_{f,\rho}(\beta)\geq\tau if and only if gf,ρ1(τ)βg_{f,\rho}^{-1}(\tau)\leq\beta.

We then find the corresponding c.d.f. of the worst-case quantile function 𝒬~\widetilde{\mathcal{Q}} by the following Lemma.

Lemma 15.

The c.d.f. that corresponds to the worst-case quantile function 𝒬~\widetilde{\mathcal{Q}} (we call it worst-case c.d.f.) has the formulation that:

F~(s;𝒫f,ρ)=infP𝒫f,ρP(Ss)\widetilde{F}(s;\mathcal{P}_{f,\rho})=\underset{P\in\mathcal{P}_{f,\rho}}{\inf}P(S\leq s) (A.1)
Proof of Lemma 15.

By the definition of the quantile function, suppose the corresponding c.d.f. of 𝒬~(;𝒫f,ρ)\widetilde{\mathcal{Q}}(\cdot;\mathcal{P}_{f,\rho}) is F~\widetilde{F}, then we have the following:

𝒬~(β;𝒫f,ρ)=inf{s|F~(s;𝒫f,ρ)β}.\widetilde{\mathcal{Q}}(\beta;\mathcal{P}_{f,\rho})=\inf\{s\in\mathbb{R}|\widetilde{F}(s;\mathcal{P}_{f,\rho})\geq\beta\}. (A.2)

According to the definition of 𝒬~(β;𝒫f,ρ)\widetilde{\mathcal{Q}}(\beta;\mathcal{P}_{f,\rho}), for any β[0,1]\beta\in[0,1], we have:

𝒬~(β;𝒫f,ρ)\displaystyle\widetilde{\mathcal{Q}}(\beta;\mathcal{P}_{f,\rho}) =supP𝒫f,ρ𝒬(β;P)=supP𝒫f,ρinf{t|P(St)β}\displaystyle=\underset{P\in\mathcal{P}_{f,\rho}}{\sup}\mathcal{Q}(\beta;P)=\underset{P\in\mathcal{P}_{f,\rho}}{\sup}\inf\{t\in\mathbb{R}|P(S\leq t)\geq\beta\} (A.3)
=(i)supP𝒫f,ρ{t|P(S>t)1β}=sup{t|supP𝒫f,ρP(S>t)1β}\displaystyle\overset{(\text{i})}{=}\underset{P\in\mathcal{P}_{f,\rho}}{\sup}\{t\in\mathbb{R}|P(S>t)\geq 1-\beta\}=\sup\left\{t\in\mathbb{R}\Bigg{|}\underset{P\in\mathcal{P}_{f,\rho}}{\sup}P(S>t)\geq 1-\beta\right\}
=inf{t|supP𝒫f,ρP(S>t)1β}=inf{t|1supP𝒫f,ρP(S>t)β}\displaystyle=\inf\left\{t\in\mathbb{R}\Bigg{|}\underset{P\in\mathcal{P}_{f,\rho}}{\sup}P(S>t)\leq 1-\beta\right\}=\inf\left\{t\in\mathbb{R}\Bigg{|}1-\underset{P\in\mathcal{P}_{f,\rho}}{\sup}P(S>t)\geq\beta\right\}
=inf{t|infP𝒫f,ρ(1P(S>t))β}=inf{t|infP𝒫f,ρP(St)β}\displaystyle=\inf\left\{t\in\mathbb{R}\Bigg{|}\underset{P\in\mathcal{P}_{f,\rho}}{\inf}\left(1-P(S>t)\right)\geq\beta\right\}=\inf\left\{t\in\mathbb{R}\Bigg{|}\underset{P\in\mathcal{P}_{f,\rho}}{\inf}P(S\leq t)\geq\beta\right\}

where (i)(\text{i}) follows from the fact that for all tt\in\mathbb{R}, t<𝒬(β;P)t<\mathcal{Q}(\beta;P) if and only if P(St)<βP(S\leq t)<\beta, i.e., if and only if P(S>t)1βP(S>t)\geq 1-\beta. Compare Equation A.2 with Equation A.3, we know that Equation A.1 holds. ∎

Since the distribution set 𝒫f,ρ\mathcal{P}_{f,\rho} involves dd distributions s#S1,s#S2,,s#Sds\#S_{1},s\#S_{2},\dots,s\#S_{d}, with a slight abuse of notation, we need to extend gf,ρg_{f,\rho} to be a function such that gf,ρ:[0,1]+[0,1]g_{f,\rho}:[0,1]^{+}\xrightarrow{}[0,1], where [0,1]+[0,1]^{+} means that the input number is adaptive and can be 1,2,1,2,\dots.

Definition 16.

We define the adaptive version of gf,ρ:[0,1]+[0,1]g_{f,\rho}:[0,1]^{+}\xrightarrow{}[0,1] as:

gf,ρ(β1,,βd)inf{z[0,1]|infλ1,,λd0i=1dλi=1h(z,i=1dλiβi)ρ},g_{f,\rho}(\beta_{1},\cdots,\beta_{d})\coloneqq\inf\left\{z\in[0,1]\Bigg{|}\underset{\sum_{i=1}^{d}\lambda_{i}=1}{\underset{\lambda_{1},\dots,\lambda_{d}\geq 0}{\inf}}h\left(z,\sum_{i=1}^{d}\lambda_{i}\beta_{i}\right)\leq\rho\right\},

where

h(z,β)=βf(zβ)+(1β)f(1z1β).h\left(z,\beta\right)=\beta f\left(\frac{z}{\beta}\right)+(1-\beta)f\left(\frac{1-z}{1-\beta}\right).

It’s obvious that when the input number is 11, the function gf,ρg_{f,\rho} in Definition 16 reduces to the gf,ρg_{f,\rho} in Theorem 6. The following Lemma reduces gf,ρ(β1,,βd)g_{f,\rho}(\beta_{1},\dots,\beta_{d}) to gf,ρ(min1idβi)g_{f,\rho}\left(\underset{1\leq i\leq d}{\min}\beta_{i}\right).

Lemma 17.

For any d+d\in\mathbb{N}_{+} and β1,,βd[0,1]\beta_{1},\dots,\beta_{d}\in[0,1], we have that:

gf,ρ(β1,,βd)=gf,ρ(min1idβi).g_{f,\rho}(\beta_{1},\dots,\beta_{d})=g_{f,\rho}\left(\underset{1\leq i\leq d}{\min}\beta_{i}\right).
Proof of Lemma 17.

For all 0a1a210\leq a_{1}\leq a_{2}\leq 1, we define:

gf,ρ,(a1,a2)=inf{z[0,1]|infβ[a1,a2]βf(zβ)+(1β)f(1z1β)ρ}.g_{f,\rho,*}(a_{1},a_{2})=\inf\left\{z\in[0,1]\bigg{|}\underset{\beta\in[a_{1},a_{2}]}{\inf}\beta f\left(\frac{z}{\beta}\right)+(1-\beta)f\left(\frac{1-z}{1-\beta}\right)\leq\rho\right\}.

For fixed dd and β1,,βd[0,1]\beta_{1},\dots,\beta_{d}\in[0,1], it is obvious that:

{idλiβi|λ1,,λd0,idλi=1}=[min1idβi,max1idβi],\left\{\sum_{i}^{d}\lambda_{i}\beta_{i}\bigg{|}\lambda_{1},\dots,\lambda_{d}\geq 0,\sum_{i}^{d}\lambda_{i}=1\right\}=\left[\underset{1\leq i\leq d}{\min}\beta_{i},\underset{1\leq i\leq d}{\max}\beta_{i}\right],

which implies that gf,ρ(β1,,βd)=gf,ρ,(min1idβi,max1idβi)g_{f,\rho}(\beta_{1},\dots,\beta_{d})=g_{f,\rho,*}\left(\underset{1\leq i\leq d}{\min}\beta_{i},\underset{1\leq i\leq d}{\max}\beta_{i}\right). Now, it is sufficient to prove that for all 0a1a210\leq a_{1}\leq a_{2}\leq 1, we have:

gf,ρ,(a1,a2)=gf,ρ(a1).g_{f,\rho,*}(a_{1},a_{2})=g_{f,\rho}(a_{1}).

Recall that h(z,β)=βf(zβ)+(1β)f(1z1β)h\left(z,\beta\right)=\beta f\left(\frac{z}{\beta}\right)+(1-\beta)f\left(\frac{1-z}{1-\beta}\right), we now analyze the properties of h(z,β)h\left(z,\beta\right).

Let fperspective(z,β)=βf(zβ)f_{\text{perspective}}(z,\beta)=\beta f\left(\frac{z}{\beta}\right), according to (Boyd, Boyd, and Vandenberghe 2004, Section 3.2.6, Page 89), fperspective(z,β)f_{\text{perspective}}(z,\beta) is also convex. Let A=(1001)A=\begin{pmatrix}-1&0\\ 0&-1\end{pmatrix} and b=(1,1)Tb=(1,1)^{T}, then we know that (1z,1β)T=A(z,β)T+b(1-z,1-\beta)^{T}=A(z,\beta)^{T}+b. According to (Boyd, Boyd, and Vandenberghe 2004, Section 3.2.2, Page 79), the convexity is preserved when the input vector is composited with an affine mapping. So the function (z,β)fperspective(1z,1β)(z,\beta)\xrightarrow{}f_{\text{perspective}}(1-z,1-\beta) is also convex. Then we know that h(z,β)=fperspective(z,β)+fperspective(1z,1β)h(z,\beta)=f_{\text{perspective}}(z,\beta)+f_{\text{perspective}}(1-z,1-\beta) is convex.

Recall that in Remark 1, we show that we can assume, without loss of generality, that f(1)=0f^{\prime}(1)=0 and f0f\geq 0. Now we take the partial derivative of hh with respect to β\beta and get:

hβ|(z,β)\displaystyle\frac{\partial h}{\partial\beta}\bigg{|}_{(z,\beta)} =f(zβ)+βf(zβ)(zβ2)f(1z1β)+(1β)f(1z1β)(1z(1β)2)\displaystyle=f\left(\frac{z}{\beta}\right)+\beta f^{\prime}\left(\frac{z}{\beta}\right)\left(-\frac{z}{\beta^{2}}\right)-f\left(\frac{1-z}{1-\beta}\right)+(1-\beta)f^{\prime}\left(\frac{1-z}{1-\beta}\right)\left(\frac{1-z}{(1-\beta)^{2}}\right)
=f(zβ)zβf(zβ)f(1z1β)+1z1βf(1z1β).\displaystyle=f\left(\frac{z}{\beta}\right)-\frac{z}{\beta}f^{\prime}\left(\frac{z}{\beta}\right)-f\left(\frac{1-z}{1-\beta}\right)+\frac{1-z}{1-\beta}f^{\prime}\left(\frac{1-z}{1-\beta}\right).

Taking the partial derivative of hh with respect to zz shows:

hz|(z,β)=βf(zβ)1β+(1β)f(1z1β)11β=f(zβ)f(1z1β)\frac{\partial h}{\partial z}\bigg{|}_{(z,\beta)}=\beta f^{\prime}\left(\frac{z}{\beta}\right)\frac{1}{\beta}+(1-\beta)f^{\prime}\left(\frac{1-z}{1-\beta}\right)\frac{-1}{1-\beta}=f^{\prime}\left(\frac{z}{\beta}\right)-f^{\prime}\left(\frac{1-z}{1-\beta}\right)

Fix z[0,1]z\in[0,1], solving the equation hβ|(z,β)=0\frac{\partial h}{\partial\beta}\big{|}_{(z,\beta)}=0 gives β=z\beta=z. Similarly, fix β[0,1]\beta\in[0,1], solving the equation hz|(z,β)=0\frac{\partial h}{\partial z}\big{|}_{(z,\beta)}=0 gives z=βz=\beta. So we know that:

infβ[a1,a2]h(z,β)={h(z,z)=0z[a1,a2]h(z,a1)z<a1h(z,a2)z>a2.\underset{\beta\in[a_{1},a_{2}]}{\inf}h(z,\beta)=\left\{\begin{array}[]{lcl}h(z,z)=0&&{z\in[a_{1},a_{2}]}\\ h(z,a_{1})&&{z<a_{1}}\\ h(z,a_{2})&&{z>a_{2}}\end{array}\right..

Then we discuss three situations:

  1. (1)

    If gf,ρ,(a1,a2)g_{f,\rho,*}(a_{1},a_{2}) is attained when z<a1z<a_{1}, then gf,ρ,(a1,a2)=inf{t[0,a1)|h(t,a1)ρ}g_{f,\rho,*}(a_{1},a_{2})=\inf\{t\in[0,a_{1})|h(t,a_{1})\leq\rho\}.

  2. (2)

    If gf,ρ,(a1,a2)g_{f,\rho,*}(a_{1},a_{2}) is attained when z[a1,a2]z\in[a_{1},a_{2}], then gf,ρ,(a1,a2)=inf{t[a1,a2]|h(t,t)ρ}=a1g_{f,\rho,*}(a_{1},a_{2})=\inf\{t\in[a_{1},a_{2}]|h(t,t)\leq\rho\}=a_{1}.

  3. (3)

    If gf,ρ,(a1,a2)g_{f,\rho,*}(a_{1},a_{2}) is attained when z>a2z>a_{2}, then gf,ρ,(a1,a2)=inf{t(a2,1)|h(t,a2)ρ}a1g_{f,\rho,*}(a_{1},a_{2})=\inf\{t\in(a_{2},1)|h(t,a_{2})\leq\rho\}\geq a_{1}.

In conclusion, we know that gf,ρ,(a1,a2)a1g_{f,\rho,*}(a_{1},a_{2})\leq a_{1}, so we have:

gf,ρ,(a1,a2)=inf{z[0,a1]|h(z,a1)ρ}=inf{z[0,1]|h(z,a1)ρ}=gf,ρ(a1),g_{f,\rho,*}(a_{1},a_{2})=\inf\{z\in[0,a_{1}]|h(z,a_{1})\leq\rho\}=\inf\{z\in[0,1]|h(z,a_{1})\leq\rho\}=g_{f,\rho}(a_{1}),

which finishes the proof. ∎

Lemma 18.

Let F1,,FdF_{1},\dots,F_{d} be the c.d.f.’s of the distributions s#S1,,s#Sds\#S_{1},\dots,s\#S_{d} and define Fmin(x)=min1idFi(x)F_{\min}(x)=\underset{1\leq i\leq d}{\min}F_{i}(x). Then we have:

F~(t;𝒫f,ρ)=gf,ρ(F1(t),,Fd(t))=gf,ρ(Fmin(t))\widetilde{F}(t;\mathcal{P}_{f,\rho})=g_{f,\rho}\left(F_{1}(t),\cdots,F_{d}(t)\right)=g_{f,\rho}\left(F_{\min}(t)\right)
Proof of Lemma 18.

The second equality is a direct result from Lemma 17 We prove the first equality in four cases.

  1. (1)

    When there exists j[d]j\in[d] such that Fj(t)=0F_{j}(t)=0. Recall that F~(t;𝒫f,ρ)=infP𝒫f,ρP(tt)\widetilde{F}(t;\mathcal{P}_{f,\rho})=\underset{P\in\mathcal{P}_{f,\rho}}{\inf}P(t\leq t), we know that 0F~(t;𝒫f,ρ)min1idFi(t)=00\leq\widetilde{F}(t;\mathcal{P}_{f,\rho})\leq\underset{1\leq i\leq d}{\min}F_{i}(t)=0, so F~(t;𝒫f,ρ)=0\widetilde{F}(t;\mathcal{P}_{f,\rho})=0.

    Setting z=0z=0, since Fj(t)=0F_{j}(t)=0, take λj=1\lambda_{j}=1 and λi=0\lambda_{i}=0 for all iji\neq j, then we get that:

    infλ1,,λd0i=1dλi=1(i=1dλiβi)f(zi=1dλiβi)+(1i=1dλiβi)f(1z1i=1dλiβi)\displaystyle\underset{\sum_{i=1}^{d}\lambda_{i}=1}{\underset{\lambda_{1},\dots,\lambda_{d}\geq 0}{\inf}}\left(\sum_{i=1}^{d}\lambda_{i}\beta_{i}\right)f\left(\frac{z}{\sum_{i=1}^{d}\lambda_{i}\beta_{i}}\right)+\left(1-\sum_{i=1}^{d}\lambda_{i}\beta_{i}\right)f\left(\frac{1-z}{1-\sum_{i=1}^{d}\lambda_{i}\beta_{i}}\right)
    βjf(zβj)+(1βj)f(1z1βj)=0f(00)+f(1)=0,\displaystyle\leq\beta_{j}f\left(\frac{z}{\beta_{j}}\right)+(1-\beta_{j})f\left(\frac{1-z}{1-\beta_{j}}\right)=0f\left(\frac{0}{0}\right)+f(1)=0,

    which means that gf,ρ(F1(t),,Fd(t))=0g_{f,\rho}\left(F_{1}(t),\cdots,F_{d}(t)\right)=0. So, in this case, we have: F~(t;𝒫f,ρ)=gf,ρ(F1(t),,Fd(t))\widetilde{F}(t;\mathcal{P}_{f,\rho})=g_{f,\rho}\left(F_{1}(t),\cdots,F_{d}(t)\right). Here, we can define 0f(00)0f\left(\frac{0}{0}\right) by taking βj=z\beta_{j}=z and let βj,z0\beta_{j},z\to 0, which means that 0f(00)=00f\left(\frac{0}{0}\right)=0 here.

  2. (2)

    When 0<Fi(t)<10<F_{i}(t)<1 for all i[n]i\in[n], in this proof, we denote 𝒞(s#S1,,s#Sd)\mathcal{CH}(s\#S_{1},\cdots,s\#S_{d}) by 𝒞s\mathcal{CH}_{s} for simplicity, note that the set

    𝒫f,ρ{P|P0𝒞s s.t. Df(PP0)ρ,dPdP0 is constant on {St} and {S>t}}\mathcal{P}_{f,\rho}^{*}\coloneqq\left\{P\bigg{|}\exists P_{0}\in\mathcal{CH}_{s}\text{ s.t. }\ D_{f}(P\|P_{0})\leq\rho,\frac{dP}{dP_{0}}\text{ is constant on }\{S\leq t\}\text{ and }\{S>t\}\right\}

    is a subset of 𝒫f,ρ\mathcal{P}_{f,\rho}, i.e., 𝒫f,ρ𝒫f,ρ\mathcal{P}_{f,\rho}^{*}\subseteq\mathcal{P}_{f,\rho}. Now we consider what is meant by Df(PP0)ρ,dPdP0 is constant on {St} and {S>t}D_{f}(P\|P_{0})\leq\rho,\frac{dP}{dP_{0}}\text{ is constant on }\{S\leq t\}\text{ and }\{S>t\}. Suppose that dPdP0=p1\frac{dP}{dP_{0}}=p_{1} on {St}\{S\leq t\} and dPdP0=p2\frac{dP}{dP_{0}}=p_{2} on {S>t}\{S>t\}, on the one hand:

    {St}dPdP0𝑑P0={St}𝑑P=P(St);\int_{\{S\leq t\}}\frac{dP}{dP_{0}}dP_{0}=\int_{\{S\leq t\}}dP=P(S\leq t);

    on the other hand:

    {St}dPdP0𝑑P0={St}p1𝑑P0=p1P0(St).\int_{\{S\leq t\}}\frac{dP}{dP_{0}}dP_{0}=\int_{\{S\leq t\}}p_{1}dP_{0}=p_{1}\cdot P_{0}(S\leq t).

    The above two equations imply that P(St)=p1P0(St)P(S\leq t)=p_{1}\cdot P_{0}(S\leq t), i.e., p1=P(St)P0(St)p_{1}=\frac{P(S\leq t)}{P_{0}(S\leq t)}. Similarly we can prove that p2=P(S>t)P0(S>t)p_{2}=\frac{P(S>t)}{P_{0}(S>t)}. Then we can formulate Df(PP0)D_{f}(P\|P_{0}) as:

    Df(PP0)\displaystyle D_{f}(P\|P_{0}) =f(dPdP0)𝑑P0={St}f(P(St)P0(St))𝑑P0+{S>t}f(P(S>t)P0(S>t))𝑑P0\displaystyle=\int f\left(\frac{dP}{dP_{0}}\right)dP_{0}=\int_{\{S\leq t\}}f\left(\frac{P(S\leq t)}{P_{0}(S\leq t)}\right)dP_{0}+\int_{\{S>t\}}f\left(\frac{P(S>t)}{P_{0}(S>t)}\right)dP_{0}
    =P0(St)f(P(St)P0(St))+P0(S>t)f(P(S>t)P0(S>t)).\displaystyle=P_{0}(S\leq t)f\left(\frac{P(S\leq t)}{P_{0}(S\leq t)}\right)+P_{0}(S>t)f\left(\frac{P(S>t)}{P_{0}(S>t)}\right).

    Let z=P(St)z=P(S\leq t) and β=P0(St)\beta=P_{0}(S\leq t), then:

    Df(PP0)ρβf(zβ)+(1β)f(1z1β)ρh(z,β)ρ,D_{f}(P\|P_{0})\leq\rho\Longleftrightarrow\beta f\left(\frac{z}{\beta}\right)+(1-\beta)f\left(\frac{1-z}{1-\beta}\right)\leq\rho\Longleftrightarrow h(z,\beta)\leq\rho,

    which matches the expression in the definition of gf,ρg_{f,\rho}. Let Δd1={𝝀=(λ,,λd)|λi0,i[d];i=1dλi=1}\Delta^{d-1}=\{\boldsymbol{\lambda}=(\lambda_{,}\cdots,\lambda_{d})|\lambda_{i}\geq 0,\forall i\in[d];\sum_{i=1}^{d}\lambda_{i}=1\} be the (d1)(d-1)-dimension simplex, we can analogously define:

    𝒫f,ρ={P|𝝀Δd1 s.t. Df(Pi=1dλiSi)ρ,dPdi=1dλiSi is constant on {St} and {S>t}}\mathcal{P}_{f,\rho}^{*}\!=\!\!\left\{\!P\bigg{|}\exists\boldsymbol{\lambda}\!\in\!\Delta^{d-1}\text{ s.t. }D_{f}\!\left(\!\!P\bigg{\|}\!\sum_{i=1}^{d}\!\lambda_{i}S_{i}\!\!\right)\!\!\leq\!\rho,\frac{dP}{d\sum_{i=1}^{d}\!\lambda_{i}S_{i}}\text{ is constant on }\{S\!\!\leq\!\!t\}\text{ and }\{S\!\!>\!\!t\}\!\!\right\}

    and

    F~(t;𝒫f,ρ)\displaystyle\widetilde{F}(t;\mathcal{P}_{f,\rho}^{*}) =inf{P(St)|P𝒫f,ρ}\displaystyle=\inf\left\{P(S\leq t)|P\in\mathcal{P}_{f,\rho}^{*}\right\} (A.4)
    =inf{P(St)|𝝀Δd1 s.t. Df(Pi=1dλiSi)ρ,\displaystyle=\inf\left\{P(S\leq t)\bigg{|}\exists\boldsymbol{\lambda}\in\Delta^{d-1}\text{ s.t. }D_{f}\left(P\bigg{\|}\sum_{i=1}^{d}\lambda_{i}S_{i}\right)\leq\rho,\right.
    dPdi=1dλiSi is constant on {St} and {S>t}}\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \left.\frac{dP}{d\sum_{i=1}^{d}\lambda_{i}S_{i}}\text{ is constant on }\{S\leq t\}\text{ and }\{S>t\}\right\}
    =inf{P(St)|𝝀Δd1 s.t. dPdi=1dλiSi is constant on {St} and {S>t},\displaystyle=\inf\left\{\!P(S\leq t)\bigg{|}\exists\boldsymbol{\lambda}\in\Delta^{d-1}\text{ s.t. }\frac{dP}{d\sum_{i=1}^{d}\lambda_{i}S_{i}}\text{ is constant on }\{S\!\leq\!t\}\text{ and }\{S\!>\!t\},\right.
    (i=1dλiFi(t))f(P(St)i=1dλiFi(t))+(1i=1dλiFi(t))f(1P(St)1i=1dλiFi(t))ρ}\displaystyle\left.\left(\sum_{i=1}^{d}\lambda_{i}F_{i}(t)\!\!\right)f\left(\!\frac{P(S\leq t)}{\sum_{i=1}^{d}\lambda_{i}F_{i}(t)}\!\right)+\left(\!1\!-\!\sum_{i=1}^{d}\lambda_{i}F_{i}(t)\!\!\right)f\left(\!\frac{1-P(S\leq t)}{1\!-\!\sum_{i=1}^{d}\lambda_{i}F_{i}(t)}\!\!\right)\!\leq\!\rho\!\right\}
    =inf{P(St)|inf𝝀Δd1h(P(St),i=1dλiFi(t))ρ}\displaystyle=\inf\left\{P(S\leq t)\bigg{|}\underset{\boldsymbol{\lambda}\in\Delta^{d-1}}{\inf}h\left(P(S\leq t),\sum_{i=1}^{d}\lambda_{i}F_{i}(t)\right)\leq\rho\right\}
    =gf,ρ(F1(t),,Fd(t)).\displaystyle=g_{f,\rho}\left(F_{1}(t),\dots,F_{d}(t)\right).

    So we have that F~(t;𝒫f,ρ)F~(t;𝒫f,ρ)=gf,ρ(F1(t),,Fd(t))\widetilde{F}(t;\mathcal{P}_{f,\rho})\leq\widetilde{F}(t;\mathcal{P}_{f,\rho}^{*})=g_{f,\rho}\left(F_{1}(t),\dots,F_{d}(t)\right). For the inverse direction, fix tt\in\mathbb{R}, for any P𝒫f,ρP\in\mathcal{P}_{f,\rho}, according to the definition of 𝒫f,ρ\mathcal{P}_{f,\rho}, there exists P0𝒞sP_{0}\in\mathcal{CH}_{s} such that Df(PP0)ρD_{f}(P\|P_{0})\leq\rho. Consider the following Markov kernel KK:

    K(ds|s)={dP0(s)𝕀{st}P0(St)stdP0(s)𝕀{s>t}P0(S>t)s>t.K(ds^{\prime}|s)=\left\{\begin{array}[]{lcl}\frac{dP_{0}(s^{\prime})\mathbb{I}\{s^{\prime}\leq t\}}{P_{0}(S\leq t)}&&{s\leq t}\\ \frac{dP_{0}(s^{\prime})\mathbb{I}\{s^{\prime}>t\}}{P_{0}(S>t)}&&{s>t}\end{array}\right..

    Let P1=KPP_{1}=K\cdot P, then we have:

    dP1(s)\displaystyle dP_{1}(s) =P1(ds)=(i)+K(ds|y)𝑑P(y)\displaystyle=P_{1}(ds)\overset{(\text{i})}{=}\int_{-\infty}^{+\infty}K(ds|y)dP(y)
    =tdP0(s)𝕀{st}P0(St)𝑑P(y)+t+dP0(s)𝕀{s>t}P0(S>t)𝑑P(y)\displaystyle=\int_{-\infty}^{t}\frac{dP_{0}(s)\mathbb{I}\{s\leq t\}}{P_{0}(S\leq t)}dP(y)+\int_{t}^{+\infty}\frac{dP_{0}(s)\mathbb{I}\{s>t\}}{P_{0}(S>t)}dP(y)
    =dP0(s)𝕀{st}P0(St)P(St)+dP0(s)𝕀{s>t}P0(S>t)P(S>t)\displaystyle=\frac{dP_{0}(s)\mathbb{I}\{s\leq t\}}{P_{0}(S\leq t)}P(S\leq t)+\frac{dP_{0}(s)\mathbb{I}\{s>t\}}{P_{0}(S>t)}P(S>t)
    =(P(St)P0(St)𝕀{st}+P(S>t)P0(S>t)𝕀{s>t})dP0(s),\displaystyle=\left(\frac{P(S\leq t)}{P_{0}(S\leq t)}\mathbb{I}\{s\leq t\}+\frac{P(S>t)}{P_{0}(S>t)}\mathbb{I}\{s>t\}\right)dP_{0}(s),

    where (i)(\text{i}) comes from the definition of transition kernels. So dP1dP0=P(St)P0(St)𝕀{st}+P(S>t)P0(S>t)𝕀{s>t}\frac{dP_{1}}{dP_{0}}=\frac{P(S\leq t)}{P_{0}(S\leq t)}\mathbb{I}\{s\leq t\}+\frac{P(S>t)}{P_{0}(S>t)}\mathbb{I}\{s>t\}. Similarly, let P2=KP0P_{2}=K\cdot P_{0}, we have: dP2dP0=P0(St)P0(St)𝕀{st}+P0(S>t)P0(S>t)𝕀{s>t}=1\frac{dP_{2}}{dP_{0}}=\frac{P_{0}(S\leq t)}{P_{0}(S\leq t)}\mathbb{I}\{s\leq t\}+\frac{P_{0}(S>t)}{P_{0}(S>t)}\mathbb{I}\{s>t\}=1, so P2=P0P_{2}=P_{0}. Furthermore,

    P1(St)\displaystyle P_{1}(S\leq t) ={St}𝑑P1={St}(P(St)P0(St)𝕀{st}+P(S>t)P0(S>t)𝕀{s>t})𝑑P0\displaystyle=\int_{\{S\leq t\}}dP_{1}=\int_{\{S\leq t\}}\left(\frac{P(S\leq t)}{P_{0}(S\leq t)}\mathbb{I}\{s\leq t\}+\frac{P(S>t)}{P_{0}(S>t)}\mathbb{I}\{s>t\}\right)dP_{0}
    =P(St)P0(St)P0(St)=P(St).\displaystyle=\frac{P(S\leq t)}{P_{0}(S\leq t)}P_{0}(S\leq t)=P(S\leq t).

    By Lemma 11, we have: Df(P1P0)=Df(P1P2)Df(PP0)ρD_{f}(P_{1}\|P_{0})=D_{f}(P_{1}\|P_{2})\leq D_{f}(P\|P_{0})\leq\rho. In conclusion, for any P𝒫f,ρP\in\mathcal{P}_{f,\rho}, we can find P1P_{1} such that: there exists P0𝒞sP_{0}\in\mathcal{CH}_{s} such that Df(P1P0)ρD_{f}(P_{1}\|P_{0})\leq\rho and dP1dP0\frac{dP_{1}}{dP_{0}} is constant on {St}\{S\leq t\}, P1(St)=P(St)P_{1}(S\leq t)=P(S\leq t) and {S>t}\{S>t\}. In other words, for any P𝒫f,ρP\in\mathcal{P}_{f,\rho}, we can find P1𝒫f,ρP_{1}\in\mathcal{P}_{f,\rho}^{*} such that P1(St)=P(St)P_{1}(S\leq t)=P(S\leq t), which means that:

    inf{P(St)|P𝒫f,ρ}inf{P(St)|P𝒫f,ρ}.\inf\left\{P(S\leq t)|P\in\mathcal{P}_{f,\rho}^{*}\right\}\leq\inf\left\{P(S\leq t)|P\in\mathcal{P}_{f,\rho}\right\}.

    On the other hand, since 𝒫f,ρ𝒫f,ρ\mathcal{P}_{f,\rho}^{*}\subseteq\mathcal{P}_{f,\rho}, we have:

    inf{P(St)|P𝒫f,ρ}inf{P(St)|P𝒫f,ρ}.\inf\left\{P(S\leq t)|P\in\mathcal{P}_{f,\rho}\right\}\leq\inf\left\{P(S\leq t)|P\in\mathcal{P}_{f,\rho}^{*}\right\}. (A.5)

    So we have F~(t;𝒫f,ρ)=F~(t;𝒫f,ρ)\widetilde{F}(t;\mathcal{P}_{f,\rho})=\widetilde{F}(t;\mathcal{P}_{f,\rho}^{*}). Combining Equation A.4 with Equation A.5 implies that: F~(t;𝒫f,ρ)=gf,ρ(F1(t),,Fd(t))\widetilde{F}(t;\mathcal{P}_{f,\rho})=g_{f,\rho}\left(F_{1}(t),\dots,F_{d}(t)\right).

  3. (3)

    When Fi(t)=1F_{i}(t)=1 for all i[d]i\in[d]. Then for any z(gf,ρ(1,,1),1]z\in\left(g_{f,\rho}(1,\dots,1),1\right], let Sz,j(1z)δt+1+zSjS_{z,j}\coloneqq(1-z)\delta_{t+1}+zS_{j}, since. According to the proof of Lemma 17, h(z,1)=f(z)h(z,1)=f(z) is non-increasing in [0,1][0,1], so z>gf,ρ(1)z>g_{f,\rho}(1) implies that f(z)ρf(z)\leq\rho. So we have:

    Df(Sz,jSj)\displaystyle D_{f}(S_{z,j}\|S_{j}) =f(d(1z)δt+1+zSjdSj)𝑑Sj\displaystyle=\int f\left(\frac{d(1-z)\delta_{t+1}+zS_{j}}{dS_{j}}\right)dS_{j}
    =+f(d(1z)δt+1dSj+z)𝑑Sj(y)\displaystyle=\int_{-\infty}^{+\infty}f\left(\frac{d(1-z)\delta_{t+1}}{dS_{j}}+z\right)dS_{j}(y)
    =tf(d(1z)δt+1dSj+z)𝑑Sj(y)\displaystyle=\int_{-\infty}^{t}f\left(\frac{d(1-z)\delta_{t+1}}{dS_{j}}+z\right)dS_{j}(y)
    =tf(z)𝑑Sj(y)=f(z)ρ.\displaystyle=\int_{-\infty}^{t}f(z)dS_{j}(y)=f(z)\leq\rho.

    Moreover, Sz,j(St)=(1z)δt+1(St)+zSj(St)=zSj(St)=zS_{z,j}(S\leq t)=(1-z)\delta_{t+1}(S\leq t)+zS_{j}(S\leq t)=zS_{j}(S\leq t)=z. So for any z>gf,ρ(F1(t),,Fd(t))z>g_{f,\rho}\left(F_{1}(t),\dots,F_{d}(t)\right), there exists Sz,j𝒫f,ρS_{z,j}\in\mathcal{P}_{f,\rho} such that Sz,j(St)=zS_{z,j}(S\leq t)=z, which means that F~(t;𝒫f,ρ)gf,ρ(F1(t),,Fd(t))\widetilde{F}(t;\mathcal{P}_{f,\rho})\leq g_{f,\rho}\left(F_{1}(t),\dots,F_{d}(t)\right). For another direction, we follow the same argument except that we use the following Markov kernel to account for the fact that i=1dλiSi(S>t)=0\sum_{i=1}^{d}\lambda_{i}S_{i}(S>t)=0 for any 𝝀Δd1\boldsymbol{\lambda}\in\Delta^{d-1}:

    K(ds|s)={dP0(s)𝕀{st}P0(St)stδs=t+1s>t.K(ds^{\prime}|s)=\left\{\begin{array}[]{lcl}\frac{dP_{0}(s^{\prime})\mathbb{I}\{s^{\prime}\leq t\}}{P_{0}(S\leq t)}&&{s\leq t}\\ \delta_{s^{\prime}=t+1}&&{s>t}\end{array}\right..
  4. (4)

    When Fi(t)>0F_{i}(t)>0 for all i[d]i\in[d] and there exists at least 11 and at most d1d-1 numbers of i[d]i\in[d] such that Fi(t)=1F_{i}(t)=1. Then we know that 0<min1idFi(t)<10<\underset{1\leq i\leq d}{\min}F_{i}(t)<1, without loss of generality, we assume that j=argmin1idFi(t)j=\underset{1\leq i\leq d}{\arg\min}F_{i}(t). We define Δ={𝝀|𝝀Δd1,i=1dλiFi(t)<1}\Delta^{\prime}=\left\{\boldsymbol{\lambda}|\boldsymbol{\lambda}\in\Delta^{d-1},\sum_{i=1}^{d}\lambda_{i}F_{i}(t)<1\right\}, then we define:

    𝒫f,ρ={P|𝝀Δ s.t. Df(Pi=1dλiSi)ρ,dPdi=1dλiSi is constant on {St} and {S>t}}.\mathcal{P}_{f,\rho}^{\prime}\!=\!\!\left\{\!P\bigg{|}\exists\boldsymbol{\lambda}\!\in\!\Delta^{\prime}\text{ s.t. }D_{f}\!\left(\!\!P\bigg{\|}\!\sum_{i=1}^{d}\!\lambda_{i}S_{i}\!\!\right)\!\!\leq\!\rho,\frac{dP}{d\sum_{i=1}^{d}\!\lambda_{i}S_{i}}\text{ is constant on }\{S\!\!\leq\!\!t\}\text{ and }\{S\!\!>\!\!t\}\!\!\right\}.

    It’s obvious that 𝒫f,ρ𝒫f,ρ\mathcal{P}_{f,\rho}^{\prime}\subseteq\mathcal{P}_{f,\rho}^{*} and similar to the proof in situation (2), we have:

    F~(t;𝒫f,ρ)\displaystyle\widetilde{F}(t;\mathcal{P}_{f,\rho}^{\prime}) =inf{P(St)|P𝒫f,ρ}\displaystyle=\inf\left\{P(S\leq t)|P\in\mathcal{P}_{f,\rho}^{\prime}\right\} (A.6)
    =inf{P(St)|𝝀Δ s.t. Df(Pi=1dλiSi)ρ,\displaystyle=\inf\left\{P(S\leq t)\bigg{|}\exists\boldsymbol{\lambda}\in\Delta^{\prime}\text{ s.t. }D_{f}\left(P\bigg{\|}\sum_{i=1}^{d}\lambda_{i}S_{i}\right)\leq\rho,\right.
    dPdi=1dλiSi is constant on {St} and {S>t}}\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \left.\frac{dP}{d\sum_{i=1}^{d}\lambda_{i}S_{i}}\text{ is constant on }\{S\leq t\}\text{ and }\{S>t\}\right\}
    =inf{P(St)|𝝀Δ s.t. dPdi=1dλiSi is constant on {St} and {S>t},\displaystyle=\inf\left\{\!P(S\leq t)\bigg{|}\exists\boldsymbol{\lambda}\in\Delta^{\prime}\text{ s.t. }\frac{dP}{d\sum_{i=1}^{d}\lambda_{i}S_{i}}\text{ is constant on }\{S\!\leq\!t\}\text{ and }\{S\!>\!t\},\right.
    (i=1dλiFi(t))f(P(St)i=1dλiFi(t))+(1i=1dλiFi(t))f(1P(St)1i=1dλiFi(t))ρ}\displaystyle\left.\left(\sum_{i=1}^{d}\lambda_{i}F_{i}(t)\!\!\right)f\left(\!\frac{P(S\leq t)}{\sum_{i=1}^{d}\lambda_{i}F_{i}(t)}\!\right)+\left(\!1\!-\!\sum_{i=1}^{d}\lambda_{i}F_{i}(t)\!\!\right)f\left(\!\frac{1-P(S\leq t)}{1\!-\!\sum_{i=1}^{d}\lambda_{i}F_{i}(t)}\!\!\right)\!\leq\!\rho\!\right\}
    =inf{P(St)|inf𝝀Δh(P(St),i=1dλiFi(t))ρ}.\displaystyle=\inf\left\{P(S\leq t)\bigg{|}\underset{\boldsymbol{\lambda}\in\Delta^{\prime}}{\inf}h\left(P(S\leq t),\sum_{i=1}^{d}\lambda_{i}F_{i}(t)\right)\leq\rho\right\}.

    According to the definition of Δ\Delta^{\prime}, we have:

    {i=1dλiFi(t)|𝝀Δ}=[Fj(t),1).\left\{\sum_{i=1}^{d}\lambda_{i}F_{i}(t)\bigg{|}\boldsymbol{\lambda}\in\Delta^{\prime}\right\}=\left[F_{j}(t),1\right).

    So we have:

    F~(t;𝒫f,ρ)=inf{P(St)|infβ[Fj(t),1)h(P(St),β)ρ}.\widetilde{F}(t;\mathcal{P}_{f,\rho}^{\prime})=\inf\left\{P(S\leq t)\bigg{|}\underset{\beta\in[F_{j}(t),1)}{\inf}h\left(P(S\leq t),\beta\right)\leq\rho\right\}.

    A similar argument as in the proof Lemma 17 provides: F~(t;𝒫f,ρ)=gf,ρ(Fj(t))\widetilde{F}(t;\mathcal{P}_{f,\rho}^{\prime})=g_{f,\rho}(F_{j}(t)). Since Fj(t)=min1idFi(t)F_{j}(t)=\underset{1\leq i\leq d}{\min}F_{i}(t), Lemma 17 tells us that gf,ρ(Fj(t))=gf,ρ(F1(t),,Fd(t))g_{f,\rho}(F_{j}(t))=g_{f,\rho}\left(F_{1}(t),\cdots,F_{d}(t)\right), which means that F~(t;𝒫f,ρ)F~(t;𝒫f,ρ)=gf,ρ(F1(t),,Fd(t))\widetilde{F}(t;\mathcal{P}_{f,\rho})\leq\widetilde{F}(t;\mathcal{P}_{f,\rho}^{\prime})=g_{f,\rho}\left(F_{1}(t),\cdots,F_{d}(t)\right).

    The other direction comes from similar arguments in situations (2) and (3). When the P0P_{0} satisfies P0(St)<1P_{0}(S\leq t)<1, we use the Markov kernel defined in situation (2), otherwise, we use the Markov kernel defined in the situation (3).

Now we begin to proceed with the proof of Theorem 6, which is a direct result from Lemma 18.

Proof of Theorem 6.
𝒬~(α;𝒫f,ρ)\displaystyle\widetilde{\mathcal{Q}}(\alpha;\mathcal{P}_{f,\rho}) =(i)inf{q|F~(q;𝒫f,ρ)α}=(ii)inf{q|gf,ρ(Fmin(q))α}\displaystyle\overset{(\text{i})}{=}\inf\left\{q\in\mathbb{R}\Big{|}\widetilde{F}(q;\mathcal{P}_{f,\rho})\geq\alpha\right\}\overset{(\text{ii})}{=}\inf\left\{q\in\mathbb{R}\Big{|}g_{f,\rho}\left(F_{\min}(q)\right)\geq\alpha\right\}
=(iii)inf{q|Fmin(q)gf,ρ1(α)}=𝒬(gf,ρ1(α);Fmin),\displaystyle\overset{(\text{iii})}{=}\inf\left\{q\in\mathbb{R}\Big{|}F_{\min}(q)\geq g_{f,\rho}^{-1}(\alpha)\right\}=\mathcal{Q}\left(g_{f,\rho}^{-1}(\alpha);F_{\min}\right),

where (i)(\text{i}) follows the definition of the quantile function; (ii)(\text{ii}) comes from Lemma 18; (iii)(\text{iii}) is a result of item (5) in Lemma 14. ∎

A.5 Proof of Proposition 7

Proposition 7.

Let F1,,FdF_{1},\dots,F_{d} be c.d.f.’s on \mathbb{R}, define Fmin(x)=min1idFi(x)F_{\min}(x)=\underset{1\leq i\leq d}{\min}F_{i}(x). Suppose F^1,,F^d\hat{F}_{1},\dots,\hat{F}_{d} are the empirical c.d.f.’s corresponding to F1,,FdF_{1},\dots,F_{d} with m1,,mdm_{1},\dots,m_{d} examples, respectively. Define F^min(x)=min1idF^i(x)\hat{F}_{\min}(x)=\underset{1\leq i\leq d}{\min}\hat{F}_{i}(x), then for any ϵ>0\epsilon>0,

(supx|Fmin(x)F^min(x)|>ϵ)2i=1de2miϵ2,\mathbb{P}\left(\underset{x\in\mathbb{R}}{\sup}\left|F_{\min}(x)-\hat{F}_{\min}(x)\right|>\epsilon\right)\leq 2\sum_{i=1}^{d}e^{-2m_{i}\epsilon^{2}},

where the probability is over the randomness of the examples that define the empirical c.d.f.’s.

We need the following famous Dvoretzky–Kiefer–Wolfowitz inequality (DKW inequality for short) as a tool.

Lemma 19 ((Massart 1990, DKW inequality)).

Given a natural number nn, let X1,X2,,XnX_{1},X_{2},\dots,X_{n} be real-valued i.i.d. random variables with c.d.f. F()F(\cdot). Let FnF_{n} denote the associated empirical distribution function defined by:

Fn(x)=1ni=1n𝕀(,x](Xi),x.F_{n}(x)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}_{(-\infty,x]}(X_{i}),\ \ \ \ \ \ x\in\mathbb{R}.

Then, for any ϵ12nln2\epsilon\geq\sqrt{\frac{1}{2n}\ln{2}}, we have:

(supx(Fn(x)F(x))>ϵ)e2nϵ2,\mathbb{P}\left(\underset{x\in\mathbb{R}}{\sup}\left(F_{n}(x)-F(x)\right)>\epsilon\right)\leq e^{-2n\epsilon^{2}},

moreover, for any ϵ>0\epsilon>0, we have:

(supx|Fn(x)F(x)|>ϵ)2e2nϵ2.\mathbb{P}\left(\underset{x\in\mathbb{R}}{\sup}\left|F_{n}(x)-F(x)\right|>\epsilon\right)\leq 2e^{-2n\epsilon^{2}}.
Proof of Proposition 7.

Fix ϵ>0\epsilon>0, we define the events Ai,i[d]A_{i},i\in[d] as: Ai={supx|F^i(x)Fi(x)|ϵ}A_{i}=\left\{\underset{x\in\mathbb{R}}{\sup}\left|\hat{F}_{i}(x)-F_{i}(x)\right|\leq\epsilon\right\}. Similarly, define A={supx|F^min(x)Fmin(x)|ϵ}A=\left\{\underset{x\in\mathbb{R}}{\sup}\left|\hat{F}_{\min}(x)-F_{\min}(x)\right|\leq\epsilon\right\}. If i=1dAi\cap_{i=1}^{d}A_{i} holds, then, for all i[d]i\in[d]:

supx|F^i(x)Fi(x)|ϵ,\underset{x\in\mathbb{R}}{\sup}\left|\hat{F}_{i}(x)-F_{i}(x)\right|\leq\epsilon,

which means that:

i[d],x:Fi(x)ϵF^i(x)Fi(x)+ϵ.\forall i\in[d],\forall x\in\mathbb{R}:\ \ F_{i}(x)-\epsilon\leq\hat{F}_{i}(x)\leq F_{i}(x)+\epsilon.

For an arbitrary xx\in\mathbb{R}, assume that j=argmin1idF^i(x)j=\underset{1\leq i\leq d}{\arg\min}\hat{F}_{i}(x), then:

F^min(x)=F^j(x)Fj(x)ϵmin1idFi(x)ϵFmin(x)ϵ.\hat{F}_{\min}(x)=\hat{F}_{j}(x)\geq F_{j}(x)-\epsilon\geq\underset{1\leq i\leq d}{\min}F_{i}(x)-\epsilon\geq F_{\min}(x)-\epsilon. (A.7)

Suppose k=argmin1idFi(x)k=\underset{1\leq i\leq d}{\arg\min}F_{i}(x), then:

Fmin(x)+ϵ=Fk(x)+ϵF^k(x)min1idF^i(x)=F^min(x).F_{\min}(x)+\epsilon=F_{k}(x)+\epsilon\geq\hat{F}_{k}(x)\geq\underset{1\leq i\leq d}{\min}\hat{F}_{i}(x)=\hat{F}_{\min}(x). (A.8)

Combining Equation A.7 and Equation A.8 shows that:

x:Fmin(x)ϵF^min(x)Fmin(x)+ϵ,\forall x\in\mathbb{R}:\ \ F_{\min}(x)-\epsilon\leq\hat{F}_{\min}(x)\leq F_{\min}(x)+\epsilon,

which means that:

supx|F^min(x)Fmin(x)|ϵ.\underset{x\in\mathbb{R}}{\sup}\left|\hat{F}_{\min}(x)-F_{\min}(x)\right|\leq\epsilon.

So we have proved that i=1dAiA\cap_{i=1}^{d}A_{i}\subseteq A, which means that Ac(i=1dAi)c=i=1dAicA^{c}\subseteq\left(\cap_{i=1}^{d}A_{i}\right)^{c}=\cup_{i=1}^{d}A_{i}^{c}, where the superscript AcA^{c} is the complement set of AA. So we have:

(Ac)(i=1dAic)(i)i=1d(Aic),\mathbb{P}\left(A^{c}\right)\leq\mathbb{P}\left(\cup_{i=1}^{d}A_{i}^{c}\right)\overset{(\text{i})}{\leq}\sum_{i=1}^{d}\mathbb{P}\left(A_{i}^{c}\right), (A.9)

where (i)(\text{i}) comes from the subadditivity of the probability measure. Equation A.9 implies that:

(supx|Fmin(x)F^min(x)|>ϵ)\displaystyle\mathbb{P}\left(\underset{x\in\mathbb{R}}{\sup}\left|F_{\min}(x)-\hat{F}_{\min}(x)\right|>\epsilon\right) i=1d(supx|Fi(x)F^i(x)|>ϵ)\displaystyle\leq\sum_{i=1}^{d}\mathbb{P}\left(\underset{x\in\mathbb{R}}{\sup}\left|F_{i}(x)-\hat{F}_{i}(x)\right|>\epsilon\right)
(ii)2i=1de2miϵ2.\displaystyle\overset{(\text{ii})}{\leq}2\sum_{i=1}^{d}e^{-2m_{i}\epsilon^{2}}.

A.6 Proof of Theorem 8

Theorem 8 (Marginal coverage guarantee for the empirical estimations).

Assume Sn+1=s(Xn+1,Yn+1)P𝒫f,ρS_{n+1}=s(X_{n+1},Y_{n+1})\sim P\in\mathcal{P}_{f,\rho} is independent of {Sij}i,j=1d,mi\{S_{ij}\}_{i,j=1}^{d,m_{i}} where {Sij}j=1mii.i.d.s#Si\{S_{ij}\}_{j=1}^{m_{i}}\overset{i.i.d.}{\sim}s\#S_{i} for i[d]i\in[d]. Suppose ρ=infP0𝒞sDf(PP0)ρ\rho^{\star}=\underset{P_{0}\in\mathcal{CH}_{s}}{\inf}D_{f}(P\|P_{0})\leq\rho where 𝒞s=𝒞(s#S1,,s#Sd)\mathcal{CH}_{s}=\mathcal{CH}(s\#S_{1},\cdots,s\#S_{d}). Let F^min\hat{F}_{\min} be defined as in Proposition 7 and let S^1,,S^d\hat{S}_{1},\dots,\hat{S}_{d} be the empirical distributions of S1,,SdS_{1},\dots,S_{d} respectively. Define

𝒫^f,ρ{S is a distribution on |S0𝒞(s#S^1,,s#S^d)s.t.Df(SS0)ρ}.\hat{\mathcal{P}}_{f,\rho}\coloneqq\left\{S\text{ is a distribution on }\mathbb{R}\Big{|}\exists S_{0}\in\mathcal{CH}(s\#\hat{S}_{1},\cdots,s\#\hat{S}_{d})\ \text{s.t.}\ D_{f}(S\|S_{0})\leq\rho\right\}.

If we set t=𝒬~(1α;𝒫^f,ρ)=𝒬(gf,ρ1(1α);F^min)t=\widetilde{\mathcal{Q}}(1-\alpha;\hat{\mathcal{P}}_{f,\rho})=\mathcal{Q}\left(g_{f,\rho}^{-1}(1-\alpha);\hat{F}_{\min}\right), then for any ϵ>0\epsilon>0, we can get the following marginal coverage guarantee for 𝒞~\widetilde{\mathcal{C}}:

(Yn+1𝒞~(Xn+1))\displaystyle\mathbb{P}\left(Y_{n+1}\in\widetilde{\mathcal{C}}(X_{n+1})\right) (12i=1de2miϵ2)gf,ρ(gf,ρ1(1α)ϵ)\displaystyle\geq\left(1-2\sum_{i=1}^{d}e^{-2m_{i}\epsilon^{2}}\right)g_{f,\rho^{\star}}\left(g_{f,\rho}^{-1}(1-\alpha)-\epsilon\right)
(12i=1de2miϵ2)(1αϵgf,ρ(gf,ρ1(1α))),\displaystyle\geq\left(1-2\sum_{i=1}^{d}e^{-2m_{i}\epsilon^{2}}\right)\left(1-\alpha-\epsilon\cdot g_{f,\rho}^{\prime}\left(g_{f,\rho}^{-1}(1-\alpha)\right)\right),

where the randomness is over the choice of the source examples and (Xn+1,Yn+1)(X_{n+1},Y_{n+1}).

Proof of Theorem 8.

Suppose FF is the c.d.f. of the target domain PP. Recall that by Lemma 15, F~(t;𝒫f,ρ)=infP𝒫f,ρP(St)\widetilde{F}(t;\mathcal{P}_{f,\rho})=\underset{P\in\mathcal{P}_{f,\rho}}{\inf}P(S\leq t), then we have, for any tt\in\mathbb{R}:

F(t)F~(t;𝒫f,ρ)=(i)gf,ρ(Fmin(t)),F(t)\geq\widetilde{F}(t;\mathcal{P}_{f,\rho})\overset{(\text{i})}{=}g_{f,\rho^{\star}}\left(F_{\min}(t)\right),

where (i)(\text{i}) is from Lemma 18. Recall that 𝒞~(x)={y𝒴|s(x,y)t}\widetilde{\mathcal{C}}(x)=\{y\in\mathcal{Y}|s(x,y)\leq t\}, set t=𝒬~(1α;𝒫^f,ρ)t=\widetilde{\mathcal{Q}}(1-\alpha;\hat{\mathcal{P}}_{f,\rho}), then we have:

\displaystyle\mathbb{P} (Yn+1𝒞~(Xn+1)|{(Xij,Yij)}i,j=1d,mi)=(ii)(s(Xn+1,Yn+1)𝒬~(1α;𝒫^f,ρ)|{(Xij,Yij)}i,j=1d,mi)\displaystyle\left(Y_{n+1}\in\widetilde{\mathcal{C}}(X_{n+1})\Big{|}\{(X_{ij},Y_{ij})\}_{i,j=1}^{d,m_{i}}\right)\overset{(\text{ii})}{=}\mathbb{P}\left(s(X_{n+1},Y_{n+1})\leq\widetilde{\mathcal{Q}}(1-\alpha;\hat{\mathcal{P}}_{f,\rho})\Big{|}\{(X_{ij},Y_{ij})\}_{i,j=1}^{d,m_{i}}\right)
=(iii)F(𝒬~(1α;𝒫^f,ρ))gf,ρ(Fmin(𝒬~(1α;𝒫^f,ρ)))\displaystyle\overset{(\text{iii})}{=}F\left(\widetilde{\mathcal{Q}}(1-\alpha;\hat{\mathcal{P}}_{f,\rho})\right)\geq g_{f,\rho^{\star}}\left(F_{\min}\left(\widetilde{\mathcal{Q}}(1-\alpha;\hat{\mathcal{P}}_{f,\rho})\right)\right)
=(iv)gf,ρ(Fmin(𝒬(gf,ρ1(1α);F^min))),\displaystyle\overset{(\text{iv})}{=}g_{f,\rho^{\star}}\left(F_{\min}\left(\mathcal{Q}\left(g_{f,\rho}^{-1}(1-\alpha);\hat{F}_{\min}\right)\right)\right),

where (ii)(\text{ii}) is from the definition of 𝒞~\widetilde{\mathcal{C}} and tt; (iii)(\text{iii}) is from the fact that (Xn+1,Yn+1)(X_{n+1},Y_{n+1}) is independent of {(Xij,Yij)}i,j=1d,mi\{(X_{ij},Y_{ij})\}_{i,j=1}^{d,m_{i}} and (iv)(\text{iv}) is a result of Theorem 6. For ϵ>0\epsilon>0, when supx|Fmin(x)F^min(x)|ϵ\underset{x\in\mathbb{R}}{\sup}\left|F_{\min}(x)-\hat{F}_{\min}(x)\right|\leq\epsilon, we have:

Fmin(𝒬(gf,ρ1(1α);F^min))F^min(𝒬(gf,ρ1(1α);F^min))ϵ=gf,ρ1(1α)ϵ.F_{\min}\left(\mathcal{Q}\left(g_{f,\rho}^{-1}(1-\alpha);\hat{F}_{\min}\right)\right)\geq\hat{F}_{\min}\left(\mathcal{Q}\left(g_{f,\rho}^{-1}(1-\alpha);\hat{F}_{\min}\right)\right)-\epsilon=g_{f,\rho}^{-1}(1-\alpha)-\epsilon.

For any fixed ϵ>0\epsilon>0, define A={supx|Fmin(x)F^min(x)|ϵ}A=\left\{\underset{x\in\mathbb{R}}{\sup}\left|F_{\min}(x)-\hat{F}_{\min}(x)\right|\leq\epsilon\right\} and let PP^{*} be the probability measure with respect to {(Xij,Yij)}i,j=1d,mi\{(X_{ij},Y_{ij})\}_{i,j=1}^{d,m_{i}}. Taking expectation with respect to {(Xij,Yij)}i,j=1d,mi\{(X_{ij},Y_{ij})\}_{i,j=1}^{d,m_{i}} yields:

\displaystyle\mathbb{P} (Yn+1𝒞~(Xn+1))=𝔼{(Xij,Yij)}[(Yn+1𝒞~(Xn+1)|{(Xij,Yij)}i,j=1d,mi)]\displaystyle\left(Y_{n+1}\in\widetilde{\mathcal{C}}(X_{n+1})\right)=\underset{\{(X_{ij},Y_{ij})\}}{\mathbb{E}}\left[\mathbb{P}\left(Y_{n+1}\in\widetilde{\mathcal{C}}(X_{n+1})\Big{|}\{(X_{ij},Y_{ij})\}_{i,j=1}^{d,m_{i}}\right)\right]
=A(Yn+1𝒞~(Xn+1)|{(Xij,Yij)}i,j=1d,mi)𝑑P\displaystyle=\int_{A}\mathbb{P}\left(Y_{n+1}\in\widetilde{\mathcal{C}}(X_{n+1})\Big{|}\{(X_{ij},Y_{ij})\}_{i,j=1}^{d,m_{i}}\right)dP^{*}
+Ac(Yn+1𝒞~(Xn+1)|{(Xij,Yij)}i,j=1d,mi)𝑑P\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ +\int_{A^{c}}\mathbb{P}\left(Y_{n+1}\in\widetilde{\mathcal{C}}(X_{n+1})\Big{|}\{(X_{ij},Y_{ij})\}_{i,j=1}^{d,m_{i}}\right)dP^{*}
A(Yn+1𝒞~(Xn+1)|{(Xij,Yij)}i,j=1d,mi)𝑑P\displaystyle\geq\int_{A}\mathbb{P}\left(Y_{n+1}\in\widetilde{\mathcal{C}}(X_{n+1})\Big{|}\{(X_{ij},Y_{ij})\}_{i,j=1}^{d,m_{i}}\right)dP^{*}
Agf,ρ(Fmin(𝒬(gf,ρ1(1α);F^min)))𝑑P\displaystyle\geq\int_{A}g_{f,\rho^{\star}}\left(F_{\min}\left(\mathcal{Q}\left(g_{f,\rho}^{-1}(1-\alpha);\hat{F}_{\min}\right)\right)\right)dP^{*}
(v)Agf,ρ(gf,ρ1(1α)ϵ)𝑑P=(A)gf,ρ(gf,ρ1(1α)ϵ)\displaystyle\overset{(\text{v})}{\geq}\int_{A}g_{f,\rho^{\star}}\left(g_{f,\rho}^{-1}(1-\alpha)-\epsilon\right)dP^{*}=\mathbb{P}(A)\ g_{f,\rho^{\star}}\left(g_{f,\rho}^{-1}(1-\alpha)-\epsilon\right)
(vi)(12i=1de2miϵ2)gf,ρ(gf,ρ1(1α)ϵ)\displaystyle\overset{(\text{vi})}{\geq}\left(1-2\sum_{i=1}^{d}e^{-2m_{i}\epsilon^{2}}\right)g_{f,\rho^{\star}}\left(g_{f,\rho}^{-1}(1-\alpha)-\epsilon\right)

where (v)(\text{v}) comes from the fact that gf,ρ(β)g_{f,\rho}(\beta) is non-decreasing in β\beta (Lemma 14) and (vi)(\text{vi}) is a result of Proposition 7. Since gf,ρ(β)g_{f,\rho}(\beta) is non-increasing (Lemma 14) in ρ\rho and ρρ\rho^{\star}\leq\rho, we know that:

gf,ρ(gf,ρ1(1α)ϵ)gf,ρ(gf,ρ1(1α)ϵ).g_{f,\rho^{\star}}\left(g_{f,\rho}^{-1}(1-\alpha)-\epsilon\right)\geq g_{f,\rho}\left(g_{f,\rho}^{-1}(1-\alpha)-\epsilon\right).

Lemma 14 shows that gf,ρ(β)g_{f,\rho}(\beta) is convex with respect to β\beta, so for any x,y[0,1]x,y\in[0,1]:

gf,ρ(y)gf,ρ(x)+gf,ρ(x)(yx).g_{f,\rho}(y)\geq g_{f,\rho}(x)+g_{f,\rho}^{\prime}(x)(y-x).

Taking x=gf,ρ1(1α)x=g_{f,\rho}^{-1}(1-\alpha) and y=gf,ρ1(1α)ϵy=g_{f,\rho}^{-1}(1-\alpha)-\epsilon yields:

gf,ρ(gf,ρ1(1α)ϵ)\displaystyle g_{f,\rho}\left(g_{f,\rho}^{-1}(1-\alpha)-\epsilon\right) gf,ρ(gf,ρ1(1α))+gf,ρ(gf,ρ1(1α))ϵ\displaystyle\geq g_{f,\rho}\left(g_{f,\rho}^{-1}(1-\alpha)\right)+g_{f,\rho}^{\prime}\left(g_{f,\rho}^{-1}(1-\alpha)\right)\cdot\epsilon
=1αϵgf,ρ(gf,ρ1(1α)).\displaystyle=1-\alpha-\epsilon\cdot g_{f,\rho}^{\prime}\left(g_{f,\rho}^{-1}(1-\alpha)\right).

A.7 Proof of Corollary 9

Corollary 9 (Correct the prediction set to get a (1α)(1-\alpha) marginal coverage guarantee).

Let (Xn+1,Yn+1)(X_{n+1},Y_{n+1}), F^min\hat{F}_{\min}, 𝒫^f,ρ\hat{\mathcal{P}}_{f,\rho} be defined as in Theorem 8. For arbitrary ϵ>0\epsilon>0, if we set t=𝒬~(1α;𝒫^f,ρ)=𝒬(gf,ρ1(1α);F^min)t=\widetilde{\mathcal{Q}}(1-\alpha^{\prime};\hat{\mathcal{P}}_{f,\rho})=\mathcal{Q}\left(g_{f,\rho}^{-1}(1-\alpha^{\prime});\hat{F}_{\min}\right), where

α=1gf,ρ(ϵ+gf,ρ1(1α12i=1de2miϵ2)),\alpha^{\prime}=1-g_{f,\rho}\left(\epsilon+g_{f,\rho}^{-1}\left(\frac{1-\alpha}{1-2\sum_{i=1}^{d}e^{-2m_{i}\epsilon^{2}}}\right)\right),

then we can get the following marginal coverage guarantee:

(Yn+1𝒞~(Xn+1))1α.\mathbb{P}\left(Y_{n+1}\in\widetilde{\mathcal{C}}(X_{n+1})\right)\geq 1-\alpha.
Proof of Corollary 9.

By Lemma 14, gf,ρ(β)g_{f,\rho}(\beta) is non-increasing in ρ\rho, since ρρ\rho^{\star}\leq\rho, gf,ρ(β)gf,ρ(β)g_{f,\rho^{\star}}(\beta)\geq g_{f,\rho}(\beta) for any β[0,1]\beta\in[0,1]. Then by Theorem 8 we know that:

\displaystyle\mathbb{P} (Yn+1𝒞~(Xn+1))(12i=1de2miϵ2)gf,ρ(gf,ρ1(1α)ϵ)\displaystyle\left(Y_{n+1}\in\widetilde{\mathcal{C}}(X_{n+1})\right)\geq\left(1-2\sum_{i=1}^{d}e^{-2m_{i}\epsilon^{2}}\right)g_{f,\rho^{\star}}\left(g_{f,\rho}^{-1}(1-\alpha^{\prime})-\epsilon\right)
(12i=1de2miϵ2)gf,ρ(gf,ρ1(1α)ϵ).\displaystyle\geq\left(1-2\sum_{i=1}^{d}e^{-2m_{i}\epsilon^{2}}\right)g_{f,\rho}\left(g_{f,\rho}^{-1}(1-\alpha^{\prime})-\epsilon\right).

Since α=1gf,ρ(ϵ+gf,ρ1(1α12i=1de2miϵ2))\alpha^{\prime}=1-g_{f,\rho}\left(\epsilon+g_{f,\rho}^{-1}\left(\frac{1-\alpha}{1-2\sum_{i=1}^{d}e^{-2m_{i}\epsilon^{2}}}\right)\right), we have:

\displaystyle\mathbb{P} (Yn+1𝒞~(Xn+1))(12i=1de2miϵ2)gf,ρ(gf,ρ1(1α)ϵ)\displaystyle\left(Y_{n+1}\in\widetilde{\mathcal{C}}(X_{n+1})\right)\geq\left(1-2\sum_{i=1}^{d}e^{-2m_{i}\epsilon^{2}}\right)g_{f,\rho}\left(g_{f,\rho}^{-1}(1-\alpha^{\prime})-\epsilon\right)
=(12i=1de2miϵ2)gf,ρ(gf,ρ1(gf,ρ(ϵ+gf,ρ1(1α12i=1de2miϵ2)))ϵ)\displaystyle=\left(1-2\sum_{i=1}^{d}e^{-2m_{i}\epsilon^{2}}\right)g_{f,\rho}\left(g_{f,\rho}^{-1}\left(g_{f,\rho}\left(\epsilon+g_{f,\rho}^{-1}\left(\frac{1-\alpha}{1-2\sum_{i=1}^{d}e^{-2m_{i}\epsilon^{2}}}\right)\right)\right)-\epsilon\right)
=(12i=1de2miϵ2)gf,ρ(ϵ+gf,ρ1(1α12i=1de2miϵ2)ϵ)\displaystyle=\left(1-2\sum_{i=1}^{d}e^{-2m_{i}\epsilon^{2}}\right)g_{f,\rho}\left(\epsilon+g_{f,\rho}^{-1}\left(\frac{1-\alpha}{1-2\sum_{i=1}^{d}e^{-2m_{i}\epsilon^{2}}}\right)-\epsilon\right)
=1α.\displaystyle=1-\alpha.

A.8 Proof of Lemma 10

Lemma 10 (The form of gf,ρ1g_{f,\rho}^{-1} that can be efficiently solved).

Let gf,ρ,gf,ρ1g_{f,\rho},g_{f,\rho}^{-1} be defined as in Theorem 6, then for τ[0,1]\tau\in[0,1], we have:

gf,ρ1(τ)=sup{β[τ,1]|βf(τβ)+(1β)f(1τ1β)ρ}.g_{f,\rho}^{-1}(\tau)=\sup\left\{\beta\in[\tau,1]\left|\beta f\left(\frac{\tau}{\beta}\right)+(1-\beta)f\left(\frac{1-\tau}{1-\beta}\right)\leq\rho\right.\right\}.
Proof of Lemma 10.

As in the proof of Lemma 17, we define h(β,z)=βf(zβ)+(1β)f(1z1β)h(\beta,z)=\beta f\left(\frac{z}{\beta}\right)+(1-\beta)f\left(\frac{1-z}{1-\beta}\right) for simplicity, moreover, h(β,z)h(\beta,z) is convex.

For any ρ>0\rho>0, it is obvious that h(τ,τ)=0<ρh(\tau,\tau)=0<\rho, so gf,ρ(τ)τg_{f,\rho}(\tau)\leq\tau, which means that gf,ρ1(τ)τg_{f,\rho}^{-1}(\tau)\geq\tau. So gf,ρ1(τ)=sup{β[τ,1]|gf,ρ(β)τ}g_{f,\rho}^{-1}(\tau)=\sup\left\{\beta\in[\tau,1]\left|g_{f,\rho}(\beta)\leq\tau\right.\right\}. Moreover, for β[τ,1]\beta\in[\tau,1], since the minimum of h(β,z)h(\beta,z) is achieved at z=βz=\beta for a given β\beta, we know that inf0zτh(β,z)=h(β,τ)\underset{0\leq z\leq\tau}{\inf}h(\beta,z)=h(\beta,\tau). Then we have:

gf,ρ1(τ)\displaystyle g_{f,\rho}^{-1}(\tau) =sup{β[τ,1]|inf{z[0,1]|h(β,z)ρ}τ}\displaystyle=\sup\left\{\beta\in[\tau,1]\left|\inf\left\{z\in[0,1]\left|h(\beta,z)\leq\rho\right.\right\}\leq\tau\right.\right\}
=(i)sup{β[τ,1]|zτ s.t. h(β,z)ρ}\displaystyle\overset{(\text{i})}{=}\sup\left\{\beta\in[\tau,1]\left|\exists z\leq\tau\text{ s.t. }h(\beta,z)\leq\rho\right.\right\}
=sup{β[τ,1]|inf0zτh(β,z)ρ}\displaystyle=\sup\left\{\beta\in[\tau,1]\left|\underset{0\leq z\leq\tau}{\inf}h(\beta,z)\leq\rho\right.\right\}
=(ii)sup{β[τ,1]|h(β,τ)ρ},\displaystyle\overset{(\text{ii})}{=}\sup\left\{\beta\in[\tau,1]\left|h(\beta,\tau)\leq\rho\right.\right\},

where (i)(\text{i}) is from the fact that: when β\beta is fixed, h(z,β)h(z,\beta) is continuous in zz and the infimum can be achieved and (ii)(\text{ii}) is implied by the fact that inf0zτh(β,z)=h(β,τ)\underset{0\leq z\leq\tau}{\inf}h(\beta,z)=h(\beta,\tau) when β[τ,1]\beta\in[\tau,1]. ∎

A.9 Proof of the examples

Proof of Example 1.

For gf,ρg_{f,\rho}, when f(t)=(t1)2f(t)=(t-1)^{2}, we have that, for 0<β<10<\beta<1:

gf,ρ(β)\displaystyle g_{f,\rho}(\beta) =inf{z[0,1]|βf(zβ)+(1β)f(1z1β)ρ}\displaystyle=\inf\left\{z\in[0,1]\left|\beta f\left(\frac{z}{\beta}\right)+(1-\beta)f\left(\frac{1-z}{1-\beta}\right)\leq\rho\right.\right\}
=inf{z[0,1]|β(zβ1)2+(1β)(1z1β1)2ρ}\displaystyle=\inf\left\{z\in[0,1]\left|\beta\left(\frac{z}{\beta}-1\right)^{2}+(1-\beta)\left(\frac{1-z}{1-\beta}-1\right)^{2}\leq\rho\right.\right\}
=inf{z[0,1]|(zβ)2ρβ(1β)}\displaystyle=\inf\left\{z\in[0,1]\left|\left(z-\beta\right)^{2}\leq\rho\beta(1-\beta)\right.\right\}
=(βρβ(1β))+.\displaystyle=\left(\beta-\sqrt{\rho\beta(1-\beta)}\right)_{+}.

When β0\beta\downarrow 0, βf(zβ)=(zββ)2+\beta f\left(\frac{z}{\beta}\right)=\left(\frac{z}{\sqrt{\beta}}-\sqrt{\beta}\right)^{2}\to+\infty unless z=0z=0. By Lemma 14, when ρ>0\rho>0, gf,ρ(β)g_{f,\rho}(\beta) is continuous for β[0,1]\beta\in[0,1], so we have gf,ρ(0)=limβ0gf,ρ(β)=0g_{f,\rho}(0)=\underset{\beta\downarrow 0}{\lim}g_{f,\rho}(\beta)=0. Similarly, we have: gf,ρ(1)=1g_{f,\rho}(1)=1. Since the value of gf,ρg_{f,\rho} on 0,10,1 is consistent with the formula gf,ρ(β)=(βρβ(1β))+g_{f,\rho}(\beta)=\left(\beta-\sqrt{\rho\beta(1-\beta)}\right)_{+}, we conclude that gf,ρ(β)=(βρβ(1β))+g_{f,\rho}(\beta)=\left(\beta-\sqrt{\rho\beta(1-\beta)}\right)_{+} for β[0,1]\beta\in[0,1].

For gf,ρ1g_{f,\rho}^{-1}, we first solve the equation βρβ(1β)=0\beta-\sqrt{\rho\beta(1-\beta)}=0 and get a solution β=ρρ+1\beta^{\star}=\frac{\rho}{\rho+1}. By Lemma 14, gf,ρ(β)g_{f,\rho}(\beta) is non-decreasing and continuous when β[0,1]\beta\in[0,1], so gf,ρ(β)=βρβ(1β)g_{f,\rho}(\beta)=\beta-\sqrt{\rho\beta(1-\beta)} when ββ\beta\geq\beta^{\star}. By the definition of gf,ρ1g_{f,\rho}^{-1}, it’s obvious that βgf,ρ1(τ)\beta^{\star}\leq g_{f,\rho}^{-1}(\tau), so we can compute gf,ρ1(τ)g_{f,\rho}^{-1}(\tau) by solving the following optimization problem:

maxβ s.t. {ρρ+1β1βρβ(1β)τ.\max\beta\text{ s.t. }\left\{\begin{array}[]{l}\frac{\rho}{\rho+1}\leq\beta\leq 1\\ \beta-\sqrt{\rho\beta(1-\beta)}\leq\tau\end{array}\right..

Proof of Example 2.

For gf,ρg_{f,\rho}, when f(t)=12|t1|f(t)=\frac{1}{2}|t-1|, we have that, for 0<β<10<\beta<1:

gf,ρ(β)\displaystyle g_{f,\rho}(\beta) =inf{z[0,1]|βf(zβ)+(1β)f(1z1β)ρ}\displaystyle=\inf\left\{z\in[0,1]\left|\beta f\left(\frac{z}{\beta}\right)+(1-\beta)f\left(\frac{1-z}{1-\beta}\right)\leq\rho\right.\right\}
=inf{z[0,1]|12β|zβ1|+12(1β)|1z1β1|ρ}\displaystyle=\inf\left\{z\in[0,1]\left|\frac{1}{2}\beta\left|\frac{z}{\beta}-1\right|+\frac{1}{2}(1-\beta)\left|\frac{1-z}{1-\beta}-1\right|\leq\rho\right.\right\}
=inf{z[0,1]||zβ|ρ}\displaystyle=\inf\left\{z\in[0,1]\left|\left|z-\beta\right|\leq\rho\right.\right\}
=(βρ)+.\displaystyle=\left(\beta-\rho\right)_{+}.

According to Lemma 14, gf,ρ(β)g_{f,\rho}(\beta) is continuous when β[0,1]\beta\in[0,1], so gf,ρ(β)=(βρ)+g_{f,\rho}(\beta)=\left(\beta-\rho\right)_{+} when β[0,1]\beta\in[0,1].

For gf,ρ1g_{f,\rho}^{-1}, we have:

gf,ρ1(τ)=sup{β[0,1]|gf,ρ(β)τ}=sup{β[0,1]|(βρ)+τ}=min{τ+ρ,1}.g_{f,\rho}^{-1}(\tau)=\sup\left\{\beta\in[0,1]\left|g_{f,\rho}(\beta)\leq\tau\right.\right\}=\sup\left\{\beta\in[0,1]\left|\left(\beta-\rho\right)_{+}\leq\tau\right.\right\}=\min\{\tau+\rho,1\}.

Appendix B B Additional experimental results

In this section, we provide additional experimental results. We provide the results of the average length of the predicted confidence sets in Figures B.1 and B.2, corresponding to the results of the coverage in Figures 2 and 3, respectively.

Refer to caption
Figure B.1: The violin plots for the average length of the 10001000 runs under the same data generation settings as in Section 4. We show results for α={0.05,0.1,0.15,0.2,0.25,0.3}\alpha=\{0.05,0.1,0.15,0.2,0.25,0.3\}. The white point represents the median, while the two endpoints of the thick line are the 0.250.25 quantile and the 0.750.75 quantile.
Refer to caption
Figure B.2: The violin plots for the average length of the 10001000 runs for the multi-source OOD confidence set prediction task. We show results for α={0.05,0.1,0.15,0.2,0.25,0.3}\alpha=\{0.05,0.1,0.15,0.2,0.25,0.3\}. The white point represents the median, while the two endpoints of the thick line are the 0.250.25 quantile and the 0.750.75 quantile.