This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Conformal Prediction with Conditional Guarantees

Isaac Gibbs***Department of Statistics, Stanford University. Address for correspondence: Sequoia Hall, Stanford University, 390 Serra Mall, Stanford, CA, 94305, USA. Emails: [email protected], [email protected].    John J. Cherianfootnotemark:    Emmanuel J. CandèsDepartments of Mathematics and Statistics, Stanford University.
Abstract

We consider the problem of constructing distribution-free prediction sets with finite-sample conditional guarantees. Prior work has shown that it is impossible to provide exact conditional coverage universally in finite samples. Thus, most popular methods only guarantee marginal coverage over the covariates or are restricted to a limited set of conditional targets, e.g. coverage over a finite set of pre-specified subgroups. This paper bridges this gap by defining a spectrum of problems that interpolate between marginal and conditional validity. We motivate these problems by reformulating conditional coverage as coverage over a class of covariate shifts. When the target class of shifts is finite-dimensional, we show how to simultaneously obtain exact finite-sample coverage over all possible shifts. For example, given a collection of subgroups, our prediction sets guarantee coverage over each group. For more flexible, infinite-dimensional classes where exact coverage is impossible, we provide a procedure for quantifying the coverage errors of our algorithm. Moreover, by tuning interpretable hyperparameters, we allow the practitioner to control the size of these errors across shifts of interest. Our methods can be incorporated into existing split conformal inference pipelines, and thus can be used to quantify the uncertainty of modern black-box algorithms without distributional assumptions.

Keywords— conditional coverage, conformal inference, covariate shift, distribution-free prediction, prediction sets, black-box uncertainty quantification.

1 Introduction

Consider a training dataset {(Xi,Yi)}i=1n\{(X_{i},Y_{i})\}_{i=1}^{n}, and a test point (Xn+1,Yn+1)(X_{n+1},Y_{n+1}), all drawn i.i.d. from an unknown, arbitrary distribution PP. We study the problem of using the observed data {(Xi,Yi)}i=1n{Xn+1}\{(X_{i},Y_{i})\}_{i=1}^{n}\cup\{X_{n+1}\} to construct a prediction set C^(Xn+1)\hat{C}(X_{n+1}) that includes Yn+1Y_{n+1} with probability 1α1-\alpha over the randomness in the training and test points.

Ideally, we would like this prediction set to meet three competing goals: it should (1) make no assumptions on the underlying data-generating mechanism, (2) be valid in finite samples, and (3) satisfy the conditional coverage guarantee, (Yn+1C^(Xn+1)Xn+1=x)=1α\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1})\mid X_{n+1}=x)=1-\alpha. Unfortunately, prior work has shown that is impossible to obtain all three of these conditions simultaneously (Vovk (2012), Barber et al. (2020)). Perhaps the closest method to achieving these goals is conformal prediction, which relaxes the third criterion to the marginal coverage guarantee, (Yn+1C^(Xn+1))=1α\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1}))=1-\alpha.

ML model Conformity score Split conformal calibration Conditional calibration Accurate point prediction Prediction errors on hold-out set Marginal coverage Theorem 1 Conditional coverage Theorems 2 and 3
Figure 1: Predicting with finite-sample guarantees: our conditionally valid pipeline vs. split conformal prediction.

The gap between conditional and marginal coverage can be extremely consequential in high-stakes decision-making. Marginal validity does not preclude substantial variation in coverage among relevant subpopulations. For example, a conformal prediction set for predicting a drug candidate’s binding affinity achieves marginal 1α1-\alpha coverage even if it underestimates the predictive uncertainty over the most promising subset of compounds. In human-centered applications, marginally valid prediction sets can be untrustworthy for certain legally protected groups (e.g., those defined by sensitive attributes such as race, gender, age, etc.) (Romano et al. (2020a)).

Achieving practical, finite-sample results requires weakening our desideratum from exact conditional coverage. In this article, we pursue a goal that is motivated by the following equivalence:

(Yn+1C^(Xn+1)Xn+1=x)=1α,for all x\displaystyle\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1})\mid X_{n+1}=x)=1-\alpha,\quad\text{for all $x$}
\displaystyle\iff
𝔼[f(Xn+1)(𝟏{Yn+1C^(Xn+1)}(1α))]=0,for all measurable f.\displaystyle\mathbb{E}[f(X_{n+1})(\mathbf{1}\{Y_{n+1}\in\hat{C}(X_{n+1})\}-(1-\alpha))]=0,\quad\text{for all measurable $f$}.

In particular, we define a relaxed coverage objective by replacing “all measurable ff” with “all ff belonging to some (potentially infinite) class \mathcal{F}.” At the least complex end, taking ={x1}\mathcal{F}=\{x\mapsto 1\} recovers marginal validity, while more intermediate choices interpolate between marginal and conditional coverage.

This generic approach to relaxing conditional validity was first popularized by the “conditional-to-marginal” moment testing literature (Andrews and Shi (2013)). Our relaxation is also referred to as a “multi-accuracy” objective in theoretical computer science (Hébert-Johnson et al. (2018), Kim et al. (2019)). We remark that Deng et al. (2023) have concurrently proposed the same objective for the conditional coverage problem. However, their results focus on the infinite data regime, where the distribution of (X,Y)(X,Y) is known exactly, and their algorithm requires access to an unspecified black-box optimization oracle.

By contrast, our proposed method preserves the attractive assumption-free and computationally efficient properties of split conformal prediction. Emulating split conformal, we design our procedure as a wrapper that takes any black-box machine learning model as input. We then compute conformity scores that measure the accuracy of this model’s predictions on new test points. Finally, by calibrating bounds for these scores, we obtain prediction sets with finite-sample conditional guarantees. Figure 1 displays our workflow.

Unlike split conformal, our approach adaptively compensates for poor initial modeling decisions. In particular, if the prediction rule and conformity scores were well-designed at the outset, our procedure may only make small adjustments. This could happen, for instance, if the scores were derived from a well-specified parametric model. More often, however, the user will begin with an inaccurate or incomplete model that fails to fully capture the distribution of YXY\mid X. In these cases, our procedure will improve on split conformal by recalibrating the conformity score to provide exact conditional guarantees. In the predictive inference literature, conformal inference is often described as a protective layer that lies on top of a black-box machine learning model and transforms its point predictions into valid prediction sets. With this in mind, one might view our method as an additional protective layer that lies on top of a conformal method and transforms its (potentially poor) marginally valid outputs into richer, conditionally valid, prediction sets.

A number of prior works have also considered either modifying the split conformal calibration step (Lei and Wasserman (2014), Guan (2022), Barber et al. (2023)) or the initial prediction rule (Romano et al. (2019), Sesia and Romano (2021), Chernozhukov et al. (2021)) to better model the distribution of YXY\mid X. Crucially, despite heuristic improvements in the quality of the resulting prediction sets, all of the aforementioned approaches obtain at most weak asymptotic guarantees that rely on slow, non-parametric convergence rates.

Perhaps the only procedures to obtain a practical coverage guarantee are those of Barber et al. (2020), Vovk et al. (2003), and Jung et al. (2023). All of these methods guarantee a form of group-conditional coverage, i.e. (Yn+1C^(Xn+1)Xn+1G)1α\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1})\mid X_{n+1}\in G)\geq 1-\alpha for all sets GG in some pre-specified class 𝒢\mathcal{G}. However, the approach of Barber et al. (2020) can be computationally infeasible and severely conservative, yielding wide intervals with coverage probability far above the target level. On the other hand, the method of Vovk et al. (2003), Mondrian conformal prediction, provides exact coverage in finite samples, but does not allow the groups in 𝒢\mathcal{G} to overlap. Finally, Jung et al. (2023) propose running quantile regression over the linear function class {G𝒢βG𝟙{xG}:β|G|}\{\sum_{G\in\mathcal{G}}\beta_{G}\mathbbm{1}\{x\in G\}:\beta\in\mathbb{R}^{|G|}\}. This method is both practical and allows for overlapping groups. In this work, we propose a new method for achieving conditional coverage that improves upon the method of Jung et al. (2023) in three ways; 1) by providing tighter finite-sample coverage, 2) by requiring no assumptions on the data-generating distribution (in particular, unlike Jung et al. (2023) we allow for discrete outcomes), and 3) by providing conditional coverage guarantees far beyond the group setting.

1.1 Preview of contributions

To motivate and summarize the main contributions of our paper, we preview some applications. In particular, we show how our method can be used to satisfy two popular coverage desiderata: group-conditional coverage and coverage under covariate shift. Additionally, we demonstrate the improved finite sample performance of our method compared to previous approaches.

1.1.1 Group-conditional coverage

Refer to caption
Figure 2: Comparison of split conformal prediction (blue, left-most panel) and the randomized implementation of our method (orange, center panel) on a simulated dataset first considered by Romano et al. (2019). Black curves denote an estimate of the conditional mean, while the blue and orange shaded regions indicate the fitted prediction intervals. For this experiment, our method is implemented using the procedure outlined in Section 2 with :={G𝒢βG𝟙{xG}:β|𝒢|}\mathcal{F}:=\{\sum_{G\in\mathcal{G}}\beta_{G}\mathbbm{1}\{x\in G\}:\beta\in\mathbb{R}^{|\mathcal{G}|}\}. The rightmost panel shows the miscoverage of the two methods marginally over the x-axis and conditionally on x falling in the two grey shaded bands; the red line indicates the target level of α=0.1\alpha=0.1.

Group-conditional coverage requires that C^()\hat{C}(\cdot) satisfy (Yn+1C^(Xn+1)Xn+1G)=1α\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1})\mid X_{n+1}\in G)=1-\alpha for all GG belonging to some collection of pre-specified (potentially overlapping) groups 𝒢2Domain(X)\mathcal{G}\subseteq 2^{\text{Domain}(X)} (Barber et al. (2020)). This corresponds to a special case of our guarantee in which ={G𝒢βG𝟙{xG}:β|𝒢|}\mathcal{F}=\{\sum_{G\in\mathcal{G}}\beta_{G}\mathbbm{1}\{x\in G\}:\beta\in\mathbb{R}^{|\mathcal{G}|}\}.

Figure 2 illustrates the coverage guarantee on a simulated dataset. Here, xx is univariate and we have taken the groups 𝒢\mathcal{G} to be the collection of all sub-intervals with endpoints belonging to {0,0.5,1,,5}\{0,0.5,1,\dots,5\}. Two of these sub-intervals, [1,2][1,2] and [3,4][3,4], are shaded in grey. We compare two procedures, split conformal prediction and our conditional calibration method. As is standard, we implement split conformal using conformity score S(x,y):=|yμ^(x)|S(x,y):=|y-\hat{\mu}(x)| where μ^(x)\hat{\mu}(x) is an estimate of the conditional mean 𝔼[YX]\mathbb{E}[Y\mid X], while for our method, we take a two-sided approach in which upper and lower bounds on yμ^(x)y-\hat{\mu}(x) are computed separately. We see that while split conformal prediction only provides marginal validity, our method returns prediction sets that are adaptive to the shape of the data and thus obtain exact coverage over all subgroups.

Refer to caption
Figure 3: Marginal calibration-conditional miscoverage (left panel) and length (right panel) of quantile regression (green) and the randomized (red) and unrandomized (orange) implementations of our conditional-calibration method on a simulated dataset. All methods are implemented in their two-sided form with conformity score S(x,y)=yS(x,y)=y, i.e. we estimate the α/2\alpha/2 and 1α/21-\alpha/2 quantiles of YXY\mid X separately and define the prediction set to be the values of yy that fall between the two bounds (see Section A.7 for details). Data for this simulation are generated i.i.d. from Yi=Xiw+ϵiY_{i}=X_{i}^{\top}w+\epsilon_{i} where Xi𝒩(0,Id)X_{i}\sim\mathcal{N}(0,I_{d}), ϵi𝒩(0,1)\epsilon_{i}\sim\mathcal{N}(0,1), and wUnif(𝒮d1)w\sim\text{Unif}(\mathcal{S}^{d-1}). We implement both vanilla quantile regression (Jung et al., 2023) and our conditional-calibration methods on the function class :={β0+i=1dβi𝟙{xi>0}:βd}\mathcal{F}:=\{\beta_{0}+\sum_{i=1}^{d}\beta_{i}\mathbbm{1}\{x_{i}>0\}:\beta\in\mathbb{R}^{d}\}. Boxplots show empirical estimates obtained by averaging over 1000 test points for each of 100 calibration datasets. The red line in the left panel indicates the target coverage level of 1α=0.91-\alpha=0.9.

In this paper, we improve upon existing group-conditional coverage results in two crucial aspects: (1) we obtain tighter finite sample coverage and (2) we make no assumptions on the distribution of (Xi,Yi)(X_{i},Y_{i}) or the overlap of the groups 𝒢\mathcal{G}. Concretely, given an arbitrary finite collection of groups our randomized conditional calibration method guarantees exact coverage,

(Yn+1C^(Xn+1)Xn+1G)=1α,G𝒢,\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1})\mid X_{n+1}\in G)=1-\alpha,\ \forall G\in\mathcal{G},

while our unrandomized procedure obeys the inequalities,

1α(Yn+1C^(Xn+1)Xn+1G)1α+|𝒢|(n+1)(XG),G𝒢.1-\alpha\leq\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1})\mid X_{n+1}\in G)\leq 1-\alpha+\frac{|\mathcal{G}|}{(n+1)\mathbb{P}(X\in G)},\ \forall G\in\mathcal{G}.

Figure 3 shows the improved finite sample coverage of these method. For simplicity, this plot only displays the marginal miscoverage. Boxplots in the figure show the estimated distributions of the calibration-conditional miscoverage and length, i.e. the quantities

(Yn+1C^(Xn+1){(Xi,Yi)}i=1n)and𝔼[length(C^(Xn+1)){(Xi,Yi)}i=1n],\mathbb{P}(Y_{n+1}\notin\hat{C}(X_{n+1})\mid\{(X_{i},Y_{i})\}_{i=1}^{n})\qquad\text{and}\qquad\mathbb{E}[\,\text{length}(\hat{C}(X_{n+1}))\mid\{(X_{i},Y_{i})\}_{i=1}^{n}],

as the sample size varies. In agreement with our theory, we find that the unrandomized version of our procedure guarantees conservative coverage, while our randomized variant offers exact coverage regardless of the sample size. On the other hand, the method of Jung et al. (2023), i.e. linear quantile regression over the set of subgroup-inclusion indicator functions, can severely undercover even at what might be considered large sample sizes.

1.1.2 Coverage under covariate shift

Given an appropriate choice of \mathcal{F}, our prediction set also achieves coverage under covariate shift. To define this objective, fix any non-negative function ff and let f\mathbb{P}_{f} denote the setting in which {(Xi,Yi)}i=1n\{(X_{i},Y_{i})\}_{i=1}^{n} is sampled i.i.d. from PP, while (Xn+1,Yn+1)(X_{n+1},Y_{n+1}) is sampled independently from the distribution in which PXP_{X} is “tilted” by ff, i.e.,

Xn+1f(x)𝔼P[f(X)]dPX(x),Yn+1Xn+1PYX.X_{n+1}\sim\frac{f(x)}{\mathbb{E}_{P}[f(X)]}\cdot dP_{X}(x),\quad Y_{n+1}\mid X_{n+1}\sim P_{Y\mid X}.

Then, our method guarantees coverage under f\mathbb{P}_{f} so long as ff\in\mathcal{F}. For example, when \mathcal{F} is a finite-dimensional linear function class, our prediction set satisfies

f(Yn+1C^(Xn+1))=1α,for all non-negative functions f.\displaystyle\mathbb{P}_{f}({Y}_{n+1}\in\hat{C}({X}_{n+1}))=1-\alpha,\qquad\text{for all non-negative functions $f\in\mathcal{F}$}.

There have been a number of previous works in this area that also establish coverage under covariate shift. However, these works assume that there is a single covariate shift of interest that is either known a-priori (Tibshirani et al. (2019)), or estimated from unlabeled data (Qiu et al. (2022), Yang et al. (2022)). Our method captures this setting as a special case in which \mathcal{F} is chosen to be a singleton. On the other hand, when \mathcal{F} is non-singleton, our guarantee is more general and ensures coverage over all shifts ff\in\mathcal{F}, simultaneously.

Refer to caption
Figure 4: Comparison of split conformal prediction (blue, left-most panel) and the randomized implementation of our method (orange, center panel) on a simulated dataset first considered by Romano et al. (2019). Black curves denote an estimate of the conditional mean, while the blue and orange shaded regions indicate the fitted prediction intervals. We consider coverage under three scenarios; marginally over the whole x-axis and locally under two Gaussian tilts (denoted by f1f_{1} and f2f_{2} and plotted as grey dotted lines). For this experiment, our method is implemented using the procedure outlined in Section 2 with :={β0+i=15βiwi(x):β6}\mathcal{F}:=\{\beta_{0}+\sum_{i=1}^{5}\beta_{i}w_{i}(x):\beta\in\mathbb{R}^{6}\}, where the wiw_{i} corresponds to the Gaussian tilts with parameters (μ,σ){(0.5,1),(1.5,0.2),(2.5,1),(3.5,0.2),(4.5,1)}(\mu,\sigma)\in\{(0.5,1),(1.5,0.2),(2.5,1),(3.5,0.2),(4.5,1)\}. The rightmost panel indicates the miscoverage of both methods under all three settings with a red line denoting the target level of α=0.1\alpha=0.1.

Figure 4 illustrates a simple example of this guarantee on a synthetic dataset. Once again, the covariate xx is a scalar and the conformity score is taken to be S(x,y)=|μ^(x)y|S(x,y)=|\hat{\mu}(x)-y|; following the previous example, we implement the two-sided version of our method. We consider five covariate shifts in which PXP_{X} is tilted by the Gaussian density fμ,σ(x)=exp(12σ2(xμ)2)f_{\mu,\sigma}(x)=\exp(-\frac{1}{2\sigma^{2}}(x-\mu)^{2}) for (μ,σ){(0.5,1),(1.5,0.2),(2.5,1),(3.5,0.2),(4.5,1)}(\mu,\sigma)\in\{(0.5,1),(1.5,0.2),(2.5,1),(3.5,0.2),(4.5,1)\} . The shifts centered at 1.51.5 and 3.53.5 are plotted in grey and denoted as f1f_{1} and f2f_{2} in the figure. As the left-most panels show, split conformal gives a constant width prediction band over the entire x-axis, while our method, adapts to the shape of the data around the covariate shifts. The right-most panel of the figure validates that this correction is sufficient to expand the marginal coverage guarantee of split conformal inference to exact coverage under all three scenarios: no shift, shift by f1f_{1}, and shift by f2f_{2}.

Refer to caption
Figure 5: Demonstration of the unrandomized implementation of our shift-agnostic method on a simulated dataset first considered by Romano et al. (2019). The orange shaded region in the left panel depicts the prediction interval output by our method when \mathcal{F} is chosen to be the Gaussian reproducing kernel Hilbert space given by kernel K(x,y)=exp(12.5|xy|2)K(x,y)=\exp(-12.5|x-y|^{2}) and the hyperparameter λ\lambda is set equal to 0.0050.005 (see Section 3 for details). Hatched and solid bars in the right panel show the estimated coverage returned by our method and the true realized empirical coverage, respectively. Finally, the red line indicates the target level of α=0.1\alpha=0.1.

Extending even further beyond existing approaches, our method can also provide guarantees even when no prior information is known about the shift. By fitting with a so-called “universal” function class, e.g., \mathcal{F} is a suitable reproducing kernel Hilbert space, we provide coverage guarantees under any covariate shift. Due to the complexity of these classes, our coverage guarantee is no longer exactly 1α1-\alpha for all tilts ff. Instead, in its place, we obtain an exact estimate of the (improved) finite-sample coverage of our method. For example, as seen in Figure 5, if we run our shift-agnostic method for the two plotted Gaussian tilts, our estimated coverage precisely matches the empirically observed values. Thus, even though we cannot guarantee exact 1α1-\alpha coverage for all shifts in this example, we are still able to accurately report the true performance that users can expect from our method. For more information on how these estimates are computed we refer the interested reader to Section 3 and, in particular, to (3.4) and Proposition 2.

1.2 Outline

The remainder of this article is structured as follows. In Section 2, we introduce our method and give coverage results for the case in which \mathcal{F} is finite dimensional. These results are expanded on in Section 3, where we consider infinite dimensional classes. Computational difficulties that arise in both the finite and infinite dimensional cases are addressed in Section 4, and an efficient implementation of our method is given. In Section 5, we apply our method to two datasets and show that our approach attains tighter finite-sample conditional coverage and more predictable failure modes than competing alternatives. Finally, we conclude in Section 6 with a discussion of considerations that arise when choosing the function class (and associated regularization) to use in our method.

A Python package, conditionalconformal, implementing our methods is available on PyPI, and notebooks reproducing the experimental results in this paper can be found at github.com/jjcherian/conditional-conformal.

Notation: In this paper, we consider two settings. In the first, we take {(Xi,Yi)}i=1n+1i.i.d.P\{(X_{i},Y_{i})\}_{i=1}^{n+1}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P. In the second, {(Xi,Yi)}i=1ni.i.d.P\{(X_{i},Y_{i})\}_{i=1}^{n}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P, while (Xn+1,Yn+1)(X_{n+1},Y_{n+1}) is sampled independently from the tilted distribution Xn+1(f(x)/𝔼P[f(X)])dPX(x)X_{n+1}\sim(f(x)/\mathbb{E}_{P}[f(X)])\cdot dP_{X}(x) and Yn+1Xn+1PYXY_{n+1}\mid X_{n+1}\sim P_{Y\mid X}. We write \mathbb{P} and 𝔼\mathbb{E} with no subscript when referring to the first scenario, while we use the subscript ff to denote the second. Additionally, note that throughout this article we use (𝒳,𝒴)(\mathcal{X},\mathcal{Y}) to denote the domain of the (X,Y)(X,Y) pairs.

2 Protection against finite dimensional shifts

2.1 Warm-up: marginal coverage

As a starting point to motivate our approach, we show that split conformal prediction is a special case of our method. Before explaining this result in detail, it is useful to first review the details of the split conformal algorithm.

Recall that the conformity score function, S:𝒳×𝒴S:\mathcal{X}\times\mathcal{Y}\to\mathbb{R} measures how well the prediction of some model at XX “conforms” to the target YY. For instance, given an estimate μ^()\hat{\mu}(\cdot) of 𝔼[YX]\mathbb{E}[Y\mid X], we may take S(,)S(\cdot,\cdot) to be the absolute residual S(x,y)=|yμ^(x)|S(x,y)=|y-\hat{\mu}(x)|. In a typical implementation of split conformal, we would need to split the training data {(Xi,Yi)}i=1n\{(X_{i},Y_{i})\}_{i=1}^{n} into two parts, using one part to train μ^\hat{\mu} and reserving the second part as the calibration set. Because our method provides the same coverage guarantees regardless of the initial choice of S(,)S(\cdot,\cdot), we will not discuss this first step in detail. Instead, we will assume that the conformity score function is fixed, and we are free to use the entire dataset {(Xi,Yi)}i=1n\{(X_{i},Y_{i})\}_{i=1}^{n} to calibrate the scores. In practice, the initial step of fitting the conformity score can be critical for getting a good baseline predictor. For example, in our experiment in Section 5.3 , we use a neural network to obtain an initial prediction of the genetic treatment given to cells based on fluorescent microscopy images.

Given a conformity score function and calibration set, the split conformal algorithm outputs the set of values yy for which S(Xn+1,y)S(X_{n+1},y) is sufficiently small, i.e., the set of values yy that conform with the prediction at Xn+1X_{n+1}. The threshold for this prediction set, which we denote by SS^{*}, is set to be the ((n+1)(1α)/n)(\lceil(n+1)\cdot(1-\alpha)\rceil/n)-quantile of the conformity scores evaluated on the calibration set. In summary, the split conformal prediction set is formally defined as

C^split(Xn+1)={y:S(Xn+1,y)S}.\hat{C}_{\text{split}}(X_{n+1})=\{y:S(X_{n+1},y)\leq S^{*}\}. (2.1)

The standard method for proving the marginal coverage of C^split()\hat{C}_{\text{split}}(\cdot) is to appeal to the exchangeability of the conformity scores. Namely, let S1,,Sn+1S_{1},\dots,S_{n+1} denote the scores S(X1,Y1),,S(Xn+1,Yn+1)S(X_{1},Y_{1}),\dots,S(X_{n+1},Y_{n+1}). Since the (n+1)(n+1)-th conformity score is drawn i.i.d. from the same distribution as the first nn scores, the location of Sn+1S_{n+1} among the order statistics of (S1,,Sn+1)(S_{1},\dots,S_{n+1}) is drawn uniformly at random from each of the n+1n+1 possible indices. So, recalling that SS^{*} is chosen to be the the ((n+1)(1α)/n)(\lceil(n+1)\cdot(1-\alpha)\rceil/n)-quantile, i.e., the smallest order statistic satisfying (Sn+1S)1α\mathbb{P}(S_{n+1}\leq S^{*})\geq 1-\alpha, we arrive at the coverage guarantee (Yn+1C^split(Xn+1))=(Sn+1S)1α\mathbb{P}(Y_{n+1}\in\hat{C}_{\text{split}}(X_{n+1}))=\mathbb{P}(S_{n+1}\leq S^{*})\geq 1-\alpha. The following theorem summarizes the formal consequences of these observations.

Theorem 1 (Romano et al. (2019), Theorem 1, see also Vovk et al. (2005)).

Assume that {(Xi,Yi)}i=1n+1\{(X_{i},Y_{i})\}_{i=1}^{n+1} are independent and identically distributed. Then, the split conformal prediction set (2.1) satisfies,

(Yn+1C^split(Xn+1))1α.\mathbb{P}(Y_{n+1}\in\hat{C}_{\textup{split}}(X_{n+1}))\geq 1-\alpha.

If S(Xn+1,Yn+1)S(X_{n+1},Y_{n+1}) has a continuous distribution, it also holds that

(Yn+1C^split(Xn+1))1α+1n+1.\mathbb{P}(Y_{n+1}\in\hat{C}_{\textup{split}}(X_{n+1}))\leq 1-\alpha+\frac{1}{n+1}.

Our first insight is that this prediction set and marginal coverage guarantee can also be obtained by re-interpreting split conformal as an intercept-only quantile regression. Recall the definition of the “pinball” loss,

α(θ,S):={(1α)(Sθ) if Sθ,α(θS) if S<θ.\displaystyle\ell_{\alpha}(\theta,S):=\begin{cases}(1-\alpha)(S-\theta)\ \text{ if }S\geq\theta,\\ \alpha(\theta-S)\ \text{ if }S<\theta.\end{cases}

It is well-known that minimizing α\ell_{\alpha} over θ\theta will produce a (1α)(1-\alpha)-quantile of the training data, i.e., θ=argminθi=1nα(θ,Si)\theta^{*}=\mathop{\rm argmin}_{\theta\in\mathbb{R}}\sum_{i=1}^{n}\ell_{\alpha}(\theta,S_{i}) is a (1α)(1-\alpha)-quantile of {Si}i=1n\{S_{i}\}_{i=1}^{n} (Koenker and Bassett Jr (1978)).

In our exchangeability proof, recall that we upper bounded Sn+1S_{n+1} not by the (1α)(1-\alpha)-quantile of {Si}i=1n\{S_{i}\}_{i=1}^{n}, but by the ((n+1)(1α)/n)(\lceil(n+1)\cdot(1-\alpha)\rceil/n)-quantile. The latter value was obtained by considering an augmented dataset that included all of the scores S1,,Sn+1S_{1},\dots,S_{n+1}. To similarly account for the (unobserved) conformity score in a quantile regression, we will now fit θ\theta^{*} using a dataset that includes a guess for Sn+1S_{n+1}. Namely, let θ^S\hat{\theta}_{S} be a solution to the quantile regression problem in which we impute SS for the unknown conformity score, i.e.,

θ^S:=argminθ1n+1i=1nα(θ,Si)+1n+1α(θ,S).\hat{\theta}_{S}:=\mathop{\rm argmin}_{\theta\in\mathbb{R}}\frac{1}{n+1}\sum_{i=1}^{n}\ell_{\alpha}(\theta,S_{i})+\frac{1}{n+1}\ell_{\alpha}(\theta,S).

Then, one can verify that

C^split(Xn+1)={y:Sn+1(Xn+1,y)θ^Sn+1(Xn+1,y)},\hat{C}_{\text{split}}(X_{n+1})=\{y:S_{n+1}(X_{n+1},y)\leq\hat{\theta}_{S_{n+1}(X_{n+1},y)}\}, (2.2)

or said more informally, C^split(Xn+1)\hat{C}_{\text{split}}(X_{n+1}) includes any yy such that S(Xn+1,y)S(X_{n+1},y) is smaller than the (1α)(1-\alpha)-quantile of the augmented calibration set {Si}i=1n{S(Xn+1,y)}\{S_{i}\}_{i=1}^{n}\cup\{S(X_{n+1},y)\}. As an aside, we note there is some small subtlety here due to the non-uniqueness of θ^S\hat{\theta}_{S}. To get exact equality in (2.2), one should choose θ^S\hat{\theta}_{S} to be the largest minimizer of the quantile regression. Readers familiar with conformal inference will also recognize this method of imputing a guess for the missing (n+1)(n+1)-th datapoint as a type of full conformal prediction (Vovk et al. (2005)).

Having established that split conformal prediction can be derived via quantile regression, our generalization of this procedure to richer function classes naturally follows. Namely, we will replace the single score threshold θ\theta with a function f(X)f(X) that estimates the conditional quantiles of YXY\mid X. We then prove a generalization of Theorem 1 showing that the resulting prediction set attains a conditional coverage guarantee.

2.2 Finite dimensional classes

Recall our objective:

𝔼[f(Xn+1)(𝟙{Yn+1C^(Xn+1)}(1α))]=0,f.\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}(X_{n+1})\}-(1-\alpha))]=0,\ \forall f\in\mathcal{F}. (2.3)

In the previous section, we constructed a prediction set with marginal coverage, i.e., (2.3) for ={θ:θ}\mathcal{F}=\{\theta:\theta\in\mathbb{R}\}, by fitting an augmented quantile regression over the same function class ={θ:θ}\mathcal{F}=\{\theta:\theta\in\mathbb{R}\}. Here, we generalize this observation to any finite-dimensional linear class.

To formally define our method, let ={Φ()β:βd}\mathcal{F}=\{\Phi(\cdot)^{\top}\beta:\beta\in\mathbb{R}^{d}\} denote the class of linear functions over the basis Φ:𝒳d\Phi:\mathcal{X}\to\mathbb{R}^{d}. Our goal is to construct a C^\hat{C} satisfying (2.3) for this choice of \mathcal{F}. Imitating our re-derivation of split conformal prediction, we define the augmented quantile regression estimate g^S\hat{g}_{S} as

g^S:=argming1n+1i=1nα(g(Xi),Si)+1n+1α(g(Xn+1),S).\hat{g}_{S}:=\mathop{\rm argmin}_{g\in\mathcal{F}}\frac{1}{n+1}\sum_{i=1}^{n}\ell_{\alpha}(g(X_{i}),S_{i})+\frac{1}{n+1}\ell_{\alpha}(g(X_{n+1}),S). (2.4)

Then, we take our prediction set to be

C^(Xn+1):={y:S(Xn+1,y)g^S(Xn+1,y)(Xn+1)}.\hat{C}(X_{n+1}):=\{y:S(X_{n+1},y)\leq\hat{g}_{S(X_{n+1},y)}(X_{n+1})\}. (2.5)

Critically, we emphasize that g^S\hat{g}_{S} is fit using the same function class \mathcal{F} that appears in our coverage target. This fact will be crucial to the theoretical results that follow. To keep our notation clear under this recycling of \mathcal{F}, we will always use gg to denote quantile estimates and ff to denote re-weightings.

Before discussing the coverage properties of this method there are two technical issues that we must address. First, astute readers may have noticed that (2.5) appears to be intractable. Indeed, a naive computation of C^(Xn+1)\hat{C}(X_{n+1}) would require us to compute g^S\hat{g}_{S} for all SS\in\mathbb{R}. In Section 4, we will give an efficient algorithm for computing the prediction set that overcomes this naive approach. To ease exposition we defer the details of this method for now. The second issue that we must address is the non-uniqueness of the estimate g^S\hat{g}_{S}. In all subsequent results of this article, we will assume that g^S\hat{g}_{S} is computed using an algorithm that is invariant under re-orderings of the input data. This assumption is relevant because quantile regression can admit multiple optima; in theory, the selected optimum might systematically depend on the indices of the scores. In practice, this assumption is inconsequential because any commonly used algorithm, e.g., an interior point solver, satisfies this invariance condition.

With these issues out of the way, we are now ready to state the main result of this section, Theorem 2, which summarizes the coverage properties of (2.5). When interpreting this theorem it may be useful to recall that for non-negative ff, f()\mathbb{P}_{f}(\cdot) denotes the setting in which (X1,Y1),,(Xn,Yn)i.i.d.P(X_{1},Y_{1}),\dots,(X_{n},Y_{n})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P, while (Xn+1,Yn+1)(X_{n+1},Y_{n+1}) is sampled independently from Xn+1f(x)𝔼P[f(X)]dPX(x)X_{n+1}\sim\frac{f(x)}{\mathbb{E}_{P}[f(X)]}dP_{X}(x) and Yn+1Xn+1PYXY_{n+1}\mid X_{n+1}\sim P_{Y\mid X}.

Theorem 2.

Let ={Φ()β:βd}\mathcal{F}=\{\Phi(\cdot)^{\top}\beta:\beta\in\mathbb{R}^{d}\} denote the class of linear functions over the basis Φ:𝒳d\Phi:\mathcal{X}\to\mathbb{R}^{d}. Then, for any non-negative ff\in\mathcal{F} with 𝔼P[f(X)]>0\mathbb{E}_{P}[f(X)]>0, the prediction set given by (2.5) satisfies

f(Yn+1C^(Xn+1))1α.\mathbb{P}_{f}(Y_{n+1}\in\hat{C}(X_{n+1}))\geq 1-\alpha. (2.6)

On the other hand, if (X1,Y1),,(Xn+1,Yn+1)i.i.d.P(X_{1},Y_{1}),\dots,(X_{n+1},Y_{n+1})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P and the distribution of SXS\mid X is continuous, then for all ff\in\mathcal{F}, we additionally have the two-sided bound,

|𝔼[f(Xn+1)(𝟙{Yn+1C^(Xn+1)}(1α))]|dn+1𝔼[max1in+1|f(Xi)|].\left|\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}(X_{n+1})\}-(1-\alpha))]\right|\leq\frac{d}{n+1}\mathbb{E}\left[\max_{1\leq i\leq n+1}|f(X_{i})|\right].

This type of two-part result is typical in conformal inference. Namely, while the assumption that the distribution of SXS\mid X is continuous may seem overly restrictive, it is standard in conformal inference that upper bounds require a mild continuity assumption, while lower bounds are fully distribution-free. For example, the canonical coverage guarantee for split conformal described in Theorem 1 also gives separate upper and lower bounds for continuous and discrete data. Notably, in the case of split conformal inference this two-part result can be avoided and replaced by an exact 1α1-\alpha coverage guarantee by randomizing the prediction set. We will show in Section 4 that an analogous result also holds for our method: without any assumptions on the continuity of SXS\mid X, we show that randomizing C^(Xn+1)\hat{C}(X_{n+1}) yields 𝔼[f(Xn+1)(𝟙{Yn+1C^(Xn+1)}(1α))]=0\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}(X_{n+1})\}-(1-\alpha))]=0 for all ff\in\mathcal{F}. Because the randomization scheme we employ leverages the algorithms developed in Section 4, we defer a precise statement of this result for now.

Our next result, Corollary 1 relates the more abstract guarantee of Theorem 2 to the group-conditional coverage example previewed in the introduction.

Corollary 1.

Suppose {(Xi,Yi)}i=1n+1\{(X_{i},Y_{i})\}_{i=1}^{n+1} are independent and identically distributed and the prediction set given by (2.5) is implemented with ={xG𝒢βG𝟙{xG}:βG,G𝒢}\mathcal{F}=\{x\mapsto\sum_{G\in\mathcal{G}}\beta_{G}\mathbbm{1}\{x\in G\}:\beta_{G}\in\mathbb{R},\ \forall G\in\mathcal{G}\} for some finite collection of groups 𝒢2𝒳\mathcal{G}\subseteq 2^{\mathcal{X}}. Then, for any G𝒢G\in\mathcal{G},

(Yn+1C^(Xn+1)Xn+1G)1α.\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1})\mid X_{n+1}\in G)\geq 1-\alpha.

If the distribution of SXS\mid X is continuous, then we have the matching upper bound,

(Yn+1C^(Xn+1)Xn+1G)1α+|𝒢|(n+1)(Xn+1G).\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1})\mid X_{n+1}\in G)\leq 1-\alpha+\frac{|\mathcal{G}|}{(n+1)\cdot\mathbb{P}(X_{n+1}\in G)}.

The methods described above only estimate the upper (1α)(1-\alpha)-quantile of the conformity score. If desired, our procedure can also be generalized to give both lower and upper bounds on S(Xn+1,Yn+1)S(X_{n+1},Y_{n+1}). In particular, letting g^Sτ()\hat{g}_{S}^{\tau}(\cdot) denote our estimate of the τ\tau-th quantile, we can define the two-sided prediction set

C^two-sid.(Xn+1):={y:g^Sn+1(Xn+1,y)α/2(Xn+1)Sn+1(Xn+1,y)g^Sn+1(Xn+1,y)1α/2(Xn+1)}.\hat{C}_{\text{two-sid.}}(X_{n+1}):=\{y:\hat{g}_{S_{n+1}(X_{n+1},y)}^{\alpha/2}(X_{n+1})\leq S_{n+1}(X_{n+1},y)\leq\hat{g}_{S_{n+1}(X_{n+1},y)}^{1-\alpha/2}(X_{n+1})\}. (2.7)

As an example of this, Figures 2, 4 and 5 show results from an implementation of our method in which we fit the lower and upper quantiles of yμ^(x)y-\hat{\mu}(x) separately. Because these two-sided prediction sets have identical coverage properties to their one-sided analogues we will for simplicity focus in the remainder of this article on the one-sided version. Readers interested in the two-sided instantiation of our approach should see Section A.7 for additional information about the implementation and formal coverage guarantees of these methods.

We conclude this section by giving a brief proof sketch of Theorem 2, leaving formal details to the Appendix. The main idea is to examine the first order conditions of the quantile regression (2.4) and then exploit the fact that this regression treats the test point identically to the calibration data. This connection between the derivative of the pinball loss and coverage was first made by Jung et al. (2023).

Proof sketch of Theorem 2.

We examine the first order conditions of (2.4). By a direct computation we have that for any ff\in\mathcal{F},

ddϵα(g^S(X)+ϵf(Xi),Si)|ϵ=0={(1α)f(Xi), if Si>g^S(Xi),αf(Xi), if Si<g^S(Xi),undefined, if Si=g^S(Xi).\frac{d}{d\epsilon}\ell_{\alpha}(\hat{g}_{S}(X)+\epsilon f(X_{i}),S_{i})\bigg{|}_{\epsilon=0}=\begin{cases}-(1-\alpha)f(X_{i}),\text{ if }S_{i}>\hat{g}_{S}(X_{i}),\\ \alpha f(X_{i}),\text{ if }S_{i}<\hat{g}_{S}(X_{i}),\\ \text{undefined, if }S_{i}=\hat{g}_{S}(X_{i}).\end{cases}

For simplicity, suppose that for all ii, Sig^S(Xi)S_{i}\neq\hat{g}_{S}(X_{i}). This assumption does not hold in general and by adding it here we will obtain the stronger result 𝔼[f(Xn+1)(𝟙{Yn+1C^(Xn+1)}(1α))]=0\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}(X_{n+1})\}-(1-\alpha))]=0. In the full proof of Theorem 2, we remove this simplification and incur an additional error term.

For now, making this assumption gives the first order condition

1n+1i=1n+1αf(Xi)𝟙{Sig^Sn+1(Xi)}(1α)f(Xi)𝟙{Si>g^Sn+1(Xi)}=0\displaystyle\frac{1}{n+1}\sum_{i=1}^{n+1}\alpha f(X_{i})\mathbbm{1}\{S_{i}\leq\hat{g}_{S_{n+1}}(X_{i})\}-(1-\alpha)f(X_{i})\mathbbm{1}\{S_{i}>\hat{g}_{S_{n+1}}(X_{i})\}=0
1n+1i=1n+1f(Xi)(𝟙{Sig^Sn+1(Xi)}(1α))=0.\displaystyle\iff\frac{1}{n+1}\sum_{i=1}^{n+1}f(X_{i})(\mathbbm{1}\{S_{i}\leq\hat{g}_{S_{n+1}}(X_{i})\}-(1-\alpha))=0.

Taking expectations, we arrive at our coverage guarantee

𝔼[f(Xn+1)(𝟙{Yn+1C^(Xn+1)}(1α))]\displaystyle\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}(X_{n+1})\}-(1-\alpha))] =𝔼[f(Xn+1)(𝟙{Sn+1g^Sn+1(Xn+1)}(1α))]\displaystyle=\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{S_{n+1}\leq\hat{g}_{S_{n+1}}(X_{n+1})\}-(1-\alpha))]
=𝔼[1n+1i=1n+1f(Xi)(𝟙{Sig^Sn+1(Xi)}(1α))]\displaystyle=\mathbb{E}[\frac{1}{n+1}\sum_{i=1}^{n+1}f(X_{i})(\mathbbm{1}\{S_{i}\leq\hat{g}_{S_{n+1}}(X_{i})\}-(1-\alpha))]
=0,\displaystyle=0,

where the first equality uses the definition of C^(Xn+1)\hat{C}(X_{n+1}) and the second equality applies the fact that the triples (X1,S1,g^Sn+1(X1)),,(Xn+1,Sn+1,g^Sn+1(Xn+1))(X_{1},S_{1},\hat{g}_{S_{n+1}}(X_{1})),\dots,(X_{n+1},S_{n+1},\hat{g}_{S_{n+1}}(X_{n+1})) are exchangeable.

2.3 Related work

The method proposed above can be briefly summarized as a modified quantile regression procedure in which the new test point is incorporated into the fit. Given the popularity of vanilla quantile regression, one might reasonably ask how this compares to the more standard approach in which one fits

g^qr:=argming1ni=1nα(g(Xi),Si),\hat{g}_{\text{qr}}:=\mathop{\rm argmin}_{g\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\ell_{\alpha}(g(X_{i}),S_{i}),

on the training data and then forms the prediction set,

C^qr(Xn+1):={y:S(Xn+1,y)g^qr(Xn+1)}.\hat{C}_{\text{qr}}(X_{n+1}):=\{y:S(X_{n+1},y)\leq\hat{g}_{\text{qr}}(X_{n+1})\}.

Jung et al. (2023) analyze this approach in the case where :={i=1dβi𝟙{xGi}:βd}\mathcal{F}:=\{\sum_{i=1}^{d}\beta_{i}\mathbbm{1}\{x\in G_{i}\}:\beta\in\mathbb{R}^{d}\} is the space of linear combinations of subgroup indicator functions. Under appropriate assumptions on the distribution of (Xi,S(Xi,Yi))(X_{i},S(X_{i},Y_{i})), they show that for all groups, 1jd1\leq j\leq d and constants δ>0\delta>0 this prediction set satisfies the PAC coverage guarantee,

(|(Yn+1C^qr(Xn+1)Xn+1Gj,{(Xi,Yi)}i=1n)(1α)|O((log(1/δ)+dlog(n)n(Xn+1Gj)2)1/4))1δ.\mathbb{P}\left(\left|\mathbb{P}\left(Y_{n+1}\in\hat{C}_{\text{qr}}(X_{n+1})\mid X_{n+1}\in G_{j},\{(X_{i},Y_{i})\}_{i=1}^{n}\right)-(1-\alpha)\right|\leq O\left(\left(\frac{\log(1/\delta)+d\log(n)}{n\mathbb{P}(X_{n+1}\in G_{j})^{2}}\right)^{1/4}\right)\right)\geq 1-\delta.

On the other hand, in the same setting, Corollary 1 states the following guarantee for our method,

1α(Yn+1C^(Xn+1)Xn+1Gj)1α+d(n+1)(Xn+1Gj).1-\alpha\leq\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1})\mid X_{n+1}\in G_{j})\leq 1-\alpha+\frac{d}{(n+1)\mathbb{P}(X_{n+1}\in G_{j})}.

The proofs of both of these results are based on the first order conditions of quantile regression which, as we showed above, can be exploited to guarantee conditional coverage. That said, the approaches differ substantially and the final results are not directly comparable. At a high level, we find that the former result targets a stronger notion of coverage, ensuring concentration conditional on the calibration data, while the second result has a much faster convergence rate (d/nd/n versus (dlog(n)/n)1/4(d\log(n)/n)^{1/4}).

While these methods are difficult to compare theoretically, a much clearer picture emerges on simulated data. In particular, the results in Figure 3 show that our method is far more robust to small sample-sizes or large dimensions. Although we do not provide a PAC guarantee, we find that the coverage of our procedure concentrates tightly at the target level across a wide range of values for d/nd/n. On the other hand, the vanilla quantile regression approach taken in Jung et al. (2023) shows notable undercoverage at moderate dimensions, e.g., d/n{0.05,0.1}d/n\in\{0.05,0.1\}.

3 Extension to infinite dimensional classes

Turning back to our method, we now consider settings in which we do not have a small, finite-dimensional function class of interest. In particular, if we view the coverage target (2.3) as an interpolation between marginal and conditional coverage, then it is natural to ask what guarantees can be provided when \mathcal{F} is a rich, and potentially even infinite dimensional, function class. We know from previous work that exact coverage over an arbitrary infinite dimensional class is impossible (Vovk (2012), Barber et al. (2020)). Thus, just as we relaxed the definition of conditional coverage above, here we will construct prediction sets that satisfy a relaxed version of (2.3).

First, note that we cannot directly implement our method over an infinite dimensional class. Indeed, running quantile regression in dimension dn+1d\geq n+1 will simply interpolate the input data. In our context, this means that every value SS\in\mathbb{R} will satisfy S=g^S(Xn+1)S=\hat{g}_{S}(X_{n+1}) and our method will always output C^(Xn+1)=\hat{C}(X_{n+1})=\mathbb{R}. To circumvent this issue and obtain informative prediction sets, we must add regularization. This leads us to the definition

g^S:=argming1n+1i=1nα(g(Xi),Si)+1n+1α(g(Xn+1),S)+(g),\hat{g}_{S}:=\mathop{\rm argmin}_{g\in\mathcal{F}}\frac{1}{n+1}\sum_{i=1}^{n}\ell_{\alpha}(g(X_{i}),S_{i})+\frac{1}{n+1}\ell_{\alpha}(g(X_{n+1}),S)+\mathcal{R}(g), (3.1)

for some appropriately chosen penalty ()\mathcal{R}(\cdot). Having made this adjustment, we may now proceed identically to the previous section. Namely, we set

C^(Xn+1):={y:S(Xn+1,y)g^S(Xn+1,y)(Xn+1)},\hat{C}(X_{n+1}):=\{y:S(X_{n+1},y)\leq\hat{g}_{S(X_{n+1},y)}(X_{n+1})\}, (3.2)

and by examining the first order conditions of (3.1), we obtain the following generalization of Theorem 2.

Theorem 3.

Let \mathcal{F} be any vector space, and assume that for all f,gf,g\in\mathcal{F}, the derivative of ϵ(g+ϵf)\epsilon\mapsto\mathcal{R}(g+\epsilon f) exists. If ff is non-negative with 𝔼P[f(X)]>0\mathbb{E}_{P}[f(X)]>0, then the prediction set given by (3.2) satisfies the lower bound

f(Yn+1C^(Xn+1))1α1𝔼P[f(X)]𝔼[ddϵ(g^Sn+1+ϵf)|ϵ=0].\mathbb{P}_{f}(Y_{n+1}\in\hat{C}(X_{n+1}))\geq 1-\alpha-\frac{1}{\mathbb{E}_{P}[f(X)]}\mathbb{E}\left[\frac{d}{d\epsilon}\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)\bigg{|}_{\epsilon=0}\right].

On the other hand, suppose (X1,Y1),,(Xn+1,Yn+1)i.i.d.P(X_{1},Y_{1}),\dots,(X_{n+1},Y_{n+1})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P. Then, for all ff\in\mathcal{F}, we additionally have the two-sided bound,

𝔼[f(Xn+1)(𝟙{Yn+1C^(Xn+1)}(1α))]=𝔼[ddϵ(g^Sn+1+ϵf)|ϵ=0]+ϵint,\begin{split}\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}(X_{n+1})\}-(1-\alpha))]=-\mathbb{E}\left[\frac{d}{d\epsilon}\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)\bigg{|}_{\epsilon=0}\right]+\epsilon_{\textup{int}},\end{split} (3.3)

where ϵint\epsilon_{\textup{int}} is an interpolation error term satisfying |ϵint|𝔼[|f(Xi)|𝟙{Si=g^Sn+1(Xi)}]|\epsilon_{\textup{int}}|\leq\mathbb{E}[|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}].

Similar to our results in the previous section, the interpolation term ϵint\epsilon_{\text{int}} can be removed if we allow the prediction set to be randomized (see Section 4 for a precise statement).

To more accurately interpret Theorem 3, we will need to develop additional understanding of the two quantities appearing on the right-hand side of (3.3). The following two sections are devoted to this task for two different choices of \mathcal{F}. At a high level, the results in these sections will show that the interpolation error ϵint\epsilon_{\text{int}} is of negligible size and thus the coverage properties of our method are primarily governed by the derivative term 𝔼[ddϵ(g^Sn+1+ϵf)|ϵ=0]-\mathbb{E}[\frac{d}{d\epsilon}\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)|_{\epsilon=0}]. Informally, we interpret this derivative as providing a quantitative estimate of the difficulty of achieving conditional coverage in the direction ff. More practically, we will see that the derivative can be used to obtain accurate estimates of the coverage properties of our method. Critically, these estimates adapt to both the specific choice of tilt ff and the distribution of (X,Y)(X,Y). Thus, they determine the difficulty of obtaining conditional coverage in a way that is specific to the dataset at hand.

3.1 Specialization to functions in a reproducing kernel Hilbert space

Our first specialization of Theorem 3 is to the case where \mathcal{F} is constructed using functions from a reproducing kernel Hilbert space (RKHS). More precisely, let K:𝒳×𝒳K:\mathcal{X}\times\mathcal{X}\to\mathbb{R} be a positive definite kernel and K\mathcal{F}_{K} denote the associated RKHS with inner product ,K\langle\cdot,\cdot\rangle_{K} and norm K\|\cdot\|_{K}. Let Φ:𝒳d\Phi:\mathcal{X}\to\mathbb{R}^{d} denote any finite dimensional feature representation of 𝒳\mathcal{X}. Then, we consider implementing our method with function class ={fK()+Φ()β:fKK,βd}\mathcal{F}=\{f_{K}(\cdot)+\Phi(\cdot)^{\top}\beta:f_{K}\in\mathcal{F}_{K},\ \beta\in\mathbb{R}^{d}\} and penalty (fK()+Φ()β)=λfKK2\mathcal{R}(f_{K}(\cdot)+\Phi(\cdot)^{\top}\beta)=\lambda\|f_{K}\|_{K}^{2}. Here, λ>0\lambda>0 is a hyperparameter that controls the flexibility of the fit. For now, we take this hyperparameter to be fixed, although later in our practical experiments in Section 5.1, we will choose it by cross-validation.

Some examples of RKHSes that may be of interest include the space of radial basis functions given by K(x,y)=exp(γxy22)K(x,y)=\exp(-\gamma\|x-y\|_{2}^{2}), which allows us to give coverage guarantees over localizations of the covariates, and the polynomial kernel K(x,y)=(xy+c)mK(x,y)=(x^{\top}y+c)^{m} for mm\in\mathbb{N}, c0c\geq 0, which allows us to investigate coverage over smooth polynomial re-weightings. Additional examples and background material on reproducing kernel Hilbert spaces can be found in Paulsen and Raghupathi (2016).

To obtain a coverage guarantee for \mathcal{F}, we must understand the two terms appearing on the right-hand side of (3.3). Let f()=fK()+Φ()βf(\cdot)=f_{K}(\cdot)+\Phi(\cdot)^{\top}\beta denote the re-weighting of interest and g^Sn+1()=g^Sn+1,K()+Φ()β^Sn+1\hat{g}_{S_{n+1}}(\cdot)=\hat{g}_{S_{n+1},K}(\cdot)+\Phi(\cdot)^{\top}\hat{\beta}_{S_{n+1}} denote the fitted quantile estimate. Then, a short calculation shows that 𝔼[ddϵ(g^Sn+1+ϵf)|ϵ=0]=2λ𝔼[g^Sn+1,K,fK]\mathbb{E}[\frac{d}{d\epsilon}\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)|_{\epsilon=0}]=2\lambda\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle]. So, applying Theorem 3, we find that for all non-negative ff\in\mathcal{F} with 𝔼P[f(X)]>0\mathbb{E}_{P}[f(X)]>0,

f(Yn+1C^(Xn+1))1α2λ𝔼[g^Sn+1,K,fKK]𝔼P[f(X)],and f(Yn+1C^(Xn+1))1α2λ𝔼[g^Sn+1,K,fKK]𝔼P[f(X)]+ϵint𝔼P[f(X)].\begin{split}&\mathbb{P}_{f}(Y_{n+1}\in\hat{C}(X_{n+1}))\geq 1-\alpha-2\lambda\frac{\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle_{K}]}{\mathbb{E}_{P}[f(X)]},\\ \text{and }&\mathbb{P}_{f}(Y_{n+1}\in\hat{C}(X_{n+1}))\leq 1-\alpha-2\lambda\frac{\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle_{K}]}{\mathbb{E}_{P}[f(X)]}+\frac{\epsilon_{\text{int}}}{\mathbb{E}_{P}[f(X)]}.\end{split} (3.4)

Controlling the interpolation error is more challenging and is done by the following proposition.

Proposition 1.

Assume that (X1,Y1),,(Xn+1,Yn+1)i.i.d.P(X_{1},Y_{1}),\dots,(X_{n+1},Y_{n+1})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P and that KK is uniformly bounded. Furthermore, suppose (Φ(X),S)(\Phi(X),S) has uniformly upper and lower bounded first three moments (Assumption 1 in the Appendix) and that the distribution of SXS\mid X is continuous with a uniformly bounded density. Then, for any ff\in\mathcal{F},

|ϵint|𝔼P[|f(X)|]O(dlog(n)λn)𝔼[max1in+1|f(Xi)|]𝔼P[|f(X)|].\frac{|\epsilon_{\textup{int}}|}{\mathbb{E}_{P}[|f(X)|]}\leq O\left(\frac{d\log(n)}{\lambda n}\right)\frac{\mathbb{E}\left[\max_{1\leq i\leq n+1}|f(X_{i})|\right]}{\mathbb{E}_{P}[|f(X)|]}.

Critically, the interpolation error term decays to zero at a faster-than-parametric rate. As a result, for even moderately large nn, we expect this term to be of small size. In support of this intuition, we will show an experiment in Section 5.1 in which with sample size n=650n=650, linear dimension d=5d=5, and non-linear hyperparameter λ1\lambda\cong 1, we observe interpolation error ϵint𝔼P[|f(X)|]0.005\frac{\epsilon_{\textup{int}}}{\mathbb{E}_{P}[|f(X)|]}\cong 0.005. Thus, at appropriate sample sizes, the interpolation error has minimal effect on the coverage.

We remark that achieving this faster-than-parametric rate requires technical insights beyond existing tools. Two standard ways to establish interpolation bounds are to either to exploit the finite-dimensional character of the model class \mathcal{F} or the algorithmic stability. Unfortunately, quantile regression with both kernel and linear components satisfies neither property. To get around this problem, we give a three-part argument in which we first separate out the linear component of the fit by discretizing over β\beta. Then, with β\beta fixed, we are able to exploit known stability results to control the kernel component of the fit (Bousquet and Elisseeff (2002)). Finally, we combine the two previous steps by giving a smoothing argument that shows that the discretization can be extended to the entire function class. This result may be useful in other applications. For instance, the interpolation error determines the derivative of the loss at the empirical minimizer and, thus, may play a key role in central limit theorems for quantile regressors of this type.

Moving away from these technical issues and returning to our coverage guarantee, (3.4), we find that once the interpolation error is removed, the conditional coverage is completely dictated by the inner product between fKf_{K} and g^Sn+1\hat{g}_{S_{n+1}}. Critically, this implies that in the special case where the target re-weighting ff lies completely in the unpenalized part of the function class (i.e. when fK=0f_{K}=0) we have 𝔼[g^Sn+1,K,fKK]=0\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle_{K}]=0 and thus C^(Xn+1)\hat{C}(X_{n+1}) obtains (nearly) exact coverage under ff. On the other hand, our next proposition shows that when fK0f_{K}\neq 0, we can use a plug-in estimate to accurately estimate 𝔼[g^Sn+1,fKK]\mathbb{E}[\langle\hat{g}_{S_{n+1}},f_{K}\rangle_{K}]. Thus, even when exact coverage is impossible, a simple examination of the quantile regression fit is sufficient to determine the degradation in coverage under any re-weighting of interest.

Proposition 2.

Assume that (X1,S1),,(Xn+1,Sn+1)i.i.d.P(X_{1},S_{1}),\dots,(X_{n+1},S_{n+1})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P and that KK is uniformly bounded. Suppose further that the population loss is locally strongly convex near its minimizer (Assumption 2 in the Appendix) and (Φ(Xi),Si)(\Phi(X_{i}),S_{i}) has uniformly bounded upper and lower first and second order moments (Assumption 3 in the Appendix). Define the nn-sample quantile regression estimate

(g^n,K,β^n):=argmingKK,βd1ni=1nα(gK(Xi)+Φ(Xi)β,Si)+λgKK2,(\hat{g}_{n,K},\hat{\beta}_{n}):=\mathop{\rm argmin}_{g_{K}\in\mathcal{F}_{K},\ \beta\in\mathbb{R}^{d}}\frac{1}{n}\sum_{i=1}^{n}\ell_{\alpha}(g_{K}(X_{i})+\Phi(X_{i})^{\top}\beta,S_{i})+\lambda\|g_{K}\|^{2}_{K},

and for any δ>0\delta>0, let δ:={f()=fK()+Φ()β:fK+β21,𝔼P[|f(X)|]δ}\mathcal{F}_{\delta}:=\{f(\cdot)=f_{K}(\cdot)+\Phi(\cdot)^{\top}\beta\in\mathcal{F}:\|f\|_{K}+\|\beta\|_{2}\leq 1,\ \mathbb{E}_{P}[|f(X)|]\geq\delta\}. Then,

supfδ|2λg^n,K,fKK1ni=1n|f(Xi)|2λ𝔼[g^Sn+1,K,fKK]𝔼P[|f(X)|]|O(dlog(n)n).\sup_{f\in\mathcal{F}_{\delta}}\left|2\lambda\frac{\langle\hat{g}_{n,K},f_{K}\rangle_{K}}{\frac{1}{n}\sum_{i=1}^{n}|f(X_{i})|}-2\lambda\frac{\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle_{K}]}{\mathbb{E}_{P}[|f(X)|]}\right|\leq O\left(\sqrt{\frac{d\log(n)}{n}}\right). (3.5)

3.2 Specialization to the class of Lipschitz functions

As a second specialization of Theorem 3, we now aim to provide valid coverage over all sufficiently smooth re-weightings of the data. We will do this by examining the set of all Lipschitz functions on 𝒳\mathcal{X}. Namely, suppose 𝒳p\mathcal{X}\subseteq\mathbb{R}^{p} and define the Lipschitz norm of functions f:𝒳f:\mathcal{X}\to\mathbb{R} as

Lip(f):=supx,y𝒳,xy|f(x)f(y)|xy2.\text{Lip}(f):=\sup_{x,y\in\mathcal{X},\ x\neq y}\frac{|f(x)-f(y)|}{\|x-y\|_{2}}.

Analogous to the previous section, let L:={f:Lip(f)<}\mathcal{F}_{L}:=\{f:\text{Lip}(f)<\infty\}, Φ:𝒳d\Phi:\mathcal{X}\to\mathbb{R}^{d} be any finite dimensional feature representation of 𝒳\mathcal{X}, and consider implementing our method with the function class ={fL()+Φ()β:fLL,βd}\mathcal{F}=\{f_{L}(\cdot)+\Phi(\cdot)^{\top}\beta:f_{L}\in\mathcal{F}_{L},\ \beta\in\mathbb{R}^{d}\} and penalty (fL()+Φ()β)=λLip(fL)\mathcal{R}(f_{L}(\cdot)+\Phi(\cdot)^{\top}\beta)=\lambda\text{Lip}(f_{L}).

The astute reader may notice that the Lipschitz norm is not differentiable and thus Theorem 3 is not directly applicable to this setting. Nevertheless, it is not difficult to show that Theorem 3 can be extended by replacing ddϵ(g^Sn+1+ϵf)\frac{d}{d\epsilon}\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f) with a subgradient. So, after observing that |ϵ(g^Sn+1+ϵf)|ϵ=0|λLip(f)|\partial_{\epsilon}\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)|_{\epsilon=0}|\leq\lambda\text{Lip}(f), we can apply this analogue of Theorem 3 to find that for any non-negative ff\in\mathcal{F} with 𝔼P[f(X)]>0\mathbb{E}_{P}[f(X)]>0,

1αλLip(f)𝔼P[f(X)]f(Yn+1C^(Xn+1))1α+λLip(f)𝔼P[f(X)]+ϵint𝔼P[f(X)].1-\alpha-\lambda\frac{\text{Lip}(f)}{\mathbb{E}_{P}[f(X)]}\leq\mathbb{P}_{f}(Y_{n+1}\in\hat{C}(X_{n+1}))\leq 1-\alpha+\lambda\frac{\text{Lip}(f)}{\mathbb{E}_{P}[f(X)]}+\frac{\epsilon_{\text{int}}}{\mathbb{E}_{P}[f(X)]}.

Control of the interpolation error is handled in the following proposition.

Proposition 3.

Assume that (X1,Y1),,(Xn+1,Yn+1)p×(X_{1},Y_{1}),\dots,(X_{n+1},Y_{n+1})\in\mathbb{R}^{p}\times\mathbb{R} are i.i.d. and that XX, Φ(X)\Phi(X), and SS have bounded domains and uniformly upper and lower bounded first and second moments (Assumption 4 in the Appendix). Furthermore, assume that the distribution of SXS\mid X is continuous with a uniformly bounded density and that Φ()\Phi(\cdot) contains an intercept term. Then for any ff\in\mathcal{F},

|ϵint|𝔼P[|f(X)|]O((plog(n)λnmin{1/2,1/p})12+(dpλ2n)14).\frac{|\epsilon_{\textup{int}}|}{\mathbb{E}_{P}[|f(X)|]}\leq O\left(\left(\frac{p\log(n)}{\lambda n^{\min\{1/2,1/p\}}}\right)^{\frac{1}{2}}+\left(\frac{dp}{\lambda^{2}n}\right)^{\frac{1}{4}}\right).

This result is considerably weaker than our RKHS bound. While our proof for RKHS function classes made careful use of the stability of RKHS fitting (Bousquet and Elisseeff, 2002), here we take a more brute force approach and directly examine the uniform concentration properties of the number of interpolated points, 1n+1i=1n+1𝟙{Si=gL(Xi)+ϕ(Xi)β}\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=g_{L}(X_{i})+\phi(X_{i})^{\top}\beta\}. We defer a detailed description of this approach to the Appendix. For now, we simply remark that we do not believe this proof technique yields a tight bound and it is possible that significant improvements could be made to Proposition 3 with more careful arguments.

Regardless of the tightness of the bound, we still find that when XX is low-dimensional, the interpolation error will be small and the miscoverage of C^(Xn+1)\hat{C}(X_{n+1}) under ff will be primarily driven by its Lipschitz norm. In light of the impossibility of exact conditional coverage, this result gives a natural interpolation between marginal coverage (in which Lip(f)=Lip(1)=0)\text{Lip}(f)=\text{Lip}(1)=0)) and conditional coverage (in which Lip(f)\text{Lip}(f) can be arbitrarily large). On the other hand, in moderate to high dimensions the interpolation error term will not be negligible and the coverage can be highly conservative. For this reason, in our real data examples, we will prefer to use RKHS functions for which we have much faster convergence rates.

4 Computing the prediction set

In order to practically implement any of the methods discussed above, we need to be able to efficiently compute C^(Xn+1)={y:S(Xn+1,y)g^S(Xn+1,y)(Xn+1)}\hat{C}(X_{n+1})=\{y:S(X_{n+1},y)\leq\hat{g}_{S(X_{n+1},y)}(X_{n+1})\}. Naively, this recursive definition requires us to fit g^S\hat{g}_{S} for all possible values of SS\in\mathbb{R}. We will now show that by exploiting the monotonicity properties of quantile regression, this naive computation can be overcome and a valid prediction set can be computed efficiently using only a small number of fits.

The main subtlety that we will have to contend with is that g^S\hat{g}_{S} may not be uniquely defined. For example, consider computing the median of the dataset {S1,S2,S3,S4}={1,2,3,4}\{S_{1},S_{2},S_{3},S_{4}\}=\{1,2,3,4\}. It is easy to show that any value in the interval [2,3][2,3] is a valid solution to the median quantile regression minimizeθi=141/2(θ,i)\text{minimize}_{\theta}\sum_{i=1}^{4}\ell_{1/2}(\theta,i). Critically, this means that it is ambiguous whether or not 33 lies below or above the median. More generally, in our context, it can be ambiguous whether or not Sg^SS\leq\hat{g}_{S}. In the earlier sections of this article we have elided such non-uniqueness in the definition of g^S\hat{g}_{S}. We do this because the choice of g^S\hat{g}_{S} is not critical to the theory and, in particular, all sensible definitions will give the same coverage properties (recall that all of our theoretical results go through so long as g^S\hat{g}_{S} is computed using an algorithm that is invariant under re-orderings of the input data). However, while not theoretically relevant, this ambiguity can cause practical issues to arise in the computation.

The main insight of this section is that these technical issues can be resolved by re-defining C^(Xn+1)\hat{C}(X_{n+1}) in terms of the dual formulation of the quantile regression. This will give us a new prediction set, C^dual(Xn+1)\hat{C}_{\text{dual}}(X_{n+1}), that can be computed efficiently and satisfies the same coverage guarantees as C^(Xn+1)\hat{C}(X_{n+1}). At a high level, C^dual(Xn+1)\hat{C}_{\text{dual}}(X_{n+1}) is obtained from C^(Xn+1)\hat{C}(X_{n+1}) by removing a small portion of the points yy that lie on the interpolation boundary {y:S(Xn+1,y)=g^S(Xn+1,y)(Xn+1)}\{y:S(X_{n+1},y)=\hat{g}_{S(X_{n+1},y)}(X_{n+1})\}. Thus, one should simply think of C^dual(Xn+1)\hat{C}_{\text{dual}}(X_{n+1}) as a trimming of the original prediction set that removes some extraneous edge cases.

To define our dual optimization more formally, recall that throughout this article we have considered quantile regressions of the form,

minimizeg1n+1i=1nα(g(Xi),Si)+1n+1α(g(Xn+1),S)+(g).\underset{{g\in\mathcal{F}}}{\text{minimize}}\ \frac{1}{n+1}\sum_{i=1}^{n}\ell_{\alpha}(g(X_{i}),S_{i})+\frac{1}{n+1}\ell_{\alpha}(g(X_{n+1}),S)+\mathcal{R}(g).

Instead of directly computing the dual of this program, we first re-formulate this optimization into the identical procedure,

minimizep,qn+1,gi=1n+1(1α)pi+αqi+(n+1)(g),subject toSig(Xi)pi+qi=0,Sg(Xn+1)pn+1+qn+1=0,pi,qi0.\begin{split}\underset{p,q\in\mathbb{R}^{n+1},\ g\in\mathcal{F}}{\text{minimize}}\quad&\sum_{i=1}^{n+1}(1-\alpha)p_{i}+\alpha q_{i}+(n+1)\cdot\mathcal{R}(g),\\ \text{subject to}\quad&S_{i}-g(X_{i})-p_{i}+q_{i}=0,\\ &S-g(X_{n+1})-p_{n+1}+q_{n+1}=0,\\ &p_{i},q_{i}\geq 0.\end{split} (4.1)

Then, after some standard calculations, this yields the desired dual formulation,

maximizeηn+1i=1nηiSi+ηn+1S(η)subject toαηi1α,\begin{split}\underset{\eta\in\mathbb{R}^{n+1}}{\text{maximize}}\quad&\sum_{i=1}^{n}\eta_{i}S_{i}+\eta_{n+1}S-\mathcal{R}^{*}\left(\eta\right)\\ \text{subject to}\quad&-\alpha\leq\eta_{i}\leq 1-\alpha,\end{split} (4.2)

where ()\mathcal{R}^{*}(\cdot) denotes the function (η):=ming{(n+1)(g)i=1n+1ηig(Xi)}\mathcal{R}^{*}(\eta):=-\min_{g\in\mathcal{F}}\,\{(n+1)\mathcal{R}(g)-\sum_{i=1}^{n+1}\eta_{i}g(X_{i})\}; heuristically, we can think of ()\mathcal{R}^{*}(\cdot) as the convex conjugate for ()\mathcal{R}(\cdot).

Crucially, the KKT conditions for (4.1) allow for a more tractable definition of our prediction set. Letting ηS\eta^{S} denote any solution to (4.2) and applying the complementary slackness conditions of this primal-dual pair we find that

ηn+1S{αif S<g^S(Xn+1),[α,1α]if S=g^S(Xn+1),1αif S>g^S(Xn+1).\displaystyle\eta^{S}_{n+1}\in\begin{cases}-\alpha&\text{if $S<\hat{g}_{S}(X_{n+1}),$}\\ [-\alpha,1-\alpha]&\text{if $S=\hat{g}_{S}(X_{n+1}),$}\\ 1-\alpha&\text{if $S>\hat{g}_{S}(X_{n+1}).$}\end{cases}

As a consequence, checking whether ηn+1S<1α\eta_{n+1}^{S}<1-\alpha is nearly equivalent to checking that Sg^S(Xn+1)S\leq\hat{g}_{S}(X_{n+1}), albeit with a minor discrepancy on the interpolation boundary. This enables us to define the efficiently computable prediction set,

C^dual(Xn+1):={y:ηn+1S(Xn+1,y)<1α}.\hat{C}_{\text{dual}}(X_{n+1}):=\{y:\eta_{n+1}^{S(X_{n+1},y)}<1-\alpha\}. (4.3)

In practice, C^dual(Xn+1)\hat{C}_{\text{dual}}(X_{n+1}) can be mildly conservative. If we are willing to allow the prediction set to be randomized then exact coverage can be obtained using the prediction set,

C^dual, rand.:={y:ηn+1S(Xn+1,y)<U},\hat{C}_{\text{dual, rand.}}:=\{y:\eta_{n+1}^{S(X_{n+1},y)}<U\},

where UUnif([α,1α])U\sim\text{Unif}([-\alpha,1-\alpha]) is drawn independent of the data.

Our first result verifies that C^dual(Xn+1)\hat{C}_{\text{dual}}(X_{n+1}) obtains the same coverage guarantees as our non-randomized primal set C^(Xn+1)\hat{C}(X_{n+1}), while C^dual, rand.\hat{C}_{\text{dual, rand.}} realizes exact, non-conservative coverage.

Proposition 4.

Assume that the primal-dual pair (4.1)-(4.2) satisfies strong duality and the dual solutions {ηS}S\{\eta^{S}\}_{S\in\mathbb{R}} are computed using an algorithm that is symmetric in the input data. Then,

  1. 1.

    The statement of Theorem 3 is valid with C^(Xn+1)\hat{C}(X_{n+1}) replaced by C^dual(Xn+1)\hat{C}_{\textup{dual}}(X_{n+1}),

  2. 2.

    The statement of Theorem 3 is valid with C^(Xn+1)\hat{C}(X_{n+1}) replaced by C^dual, rand.(Xn+1)\hat{C}_{\textup{dual, rand.}}(X_{n+1}) and ϵint\epsilon_{\text{int}} set to 0.

The assumption that the primal-dual pair satisfies strong duality is very minor and we verify in Section A.1 that it holds for all the function classes and penalties considered in this article. Moreover, we also verify in Section A.1 that for all the function classes and penalties considered in this article, ()\mathcal{R}^{*}(\cdot) is tractable and thus solutions to (4.2) can be computed efficiently.

The main result of this section is Theorem 4, which states that Sηn+1SS\mapsto\eta^{S}_{n+1} is non-decreasing. Critically, this implies that membership in the dual prediction set (4.3) is monotone in the imputed score S(Xn+1,y)S(X_{n+1},y).

Theorem 4.

For all maximizers {ηn+1S}S\{\eta^{S}_{n+1}\}_{S\in\mathbb{R}} of (4.2), Sηn+1SS\mapsto\eta^{S}_{n+1} is non-decreasing in SS.

Leveraging Theorem 4, we may compute C^dual(Xn+1)\hat{C}_{\text{dual}}(X_{n+1}) (or C^dual, rand.(Xn+1)\hat{C}_{\text{dual, rand.}}(X_{n+1})) using the following two-step procedure. First, we identify the largest value of SS such that ηn+1S<1α\eta^{S}_{n+1}<1-\alpha (or ηn+1S<U\eta^{S}_{n+1}<U). Second, denoting this upper bound by Sn+1S^{*}_{n+1}, we output all yy such that S(Xn+1,y)Sn+1S(X_{n+1},y)\leq S^{*}_{n+1}. The second step is straightforward for all commonly used conformity scores. For example, if S(Xn+1,y)=|μ^(Xn+1)y|S(X_{n+1},y)=|\hat{\mu}(X_{n+1})-y|, the prediction set becomes μ^(Xn+1)±Sn+1\hat{\mu}(X_{n+1})\pm S^{*}_{n+1}.

The monotonicity of the dual variable in SS allows for more than one approach to the first step of this procedure. If we have no additional information about the structure of the optimization problem over η\eta, it is always possible to run a binary search over SS to find the largest value such that ηn+1S\eta^{S}_{n+1} is less than the targeted cutoff. On the other hand, for the typical use-case of this method, i.e., when we fit an unregularized quantile regression over a finite-dimensional function class, it is considerably more efficient to compute this cutoff for SS by applying standard tools from linear program sensitivity analysis. We defer a detailed description of our approach to Algorithm 2.

Refer to caption
Figure 6: Evaluation of the computational efficiency of our prediction set construction. The left panel displays the time taken to fit a single test point, while the right panel shows the ratio between the time taken by our method versus the cost of fitting a single quantile regression per test point. Both plots display results for the unrandomized version of our conditional calibration method implemented in its one-sided form with conformity score S(x,y)=yS(x,y)=y and 1α=0.91-\alpha=0.9, i.e., we estimate the 0.90.9-quantile of YXY\mid X. Data for this simulation is generated from the Gaussian linear model, Yi=Xiw+ϵiY_{i}=X_{i}^{\top}w+\epsilon_{i} where Xi𝒩(0,Id)X_{i}\sim\mathcal{N}(0,I_{d}), ϵi𝒩(0,1)\epsilon_{i}\sim\mathcal{N}(0,1), and wUnif(𝒮d1)w\sim\text{Unif}(\mathcal{S}^{d-1}), and our method is implemented using the linear function class :={β0+Xβ1:β0,β1d}\mathcal{F}:=\{\beta_{0}+X^{\top}\beta_{1}:\beta_{0}\in\mathbb{R},\ \beta_{1}\in\mathbb{R}^{d}\}. Dots and error bars show means and confidence intervals from 25 trials.

Figure 6 displays the computational efficiency of this sensitivity analysis method on a 2020 MacBook Pro with an Intel Core i5 processor and 16GB of RAM. The left panel of the figure displays the estimated time to construct one prediction set over varying function classes and calibration set sizes. We find that our method is computationally efficient, and that its time complexity depends much more on the dimension than the sample size. In general, the bulk of the computational cost consists of fitting a single quantile regression on the calibration set. From there, predictions for new test points are obtained by an efficient procedure for updating the quantile fit (see Algorithm 2 in the Appendix for details). The computational benefit of this approach is displayed in the right panel of Figure 6, which compares the cost of running our procedure against the time it would take to run a single quantile regression for each test point. We see that our updating procedure is substantially faster across a wide range of calibration set sizes and function class dimensions.

5 Real data experiments

5.1 Communities and crime data

We now illustrate our methods on two real datasets. For our first experiment, we consider the Communities and Crime dataset (Dua and Graff (2017), Redmond and Baveja (2002)). In this task, the goal is to use the demographic features of a community to predict its per capita violent crime rate. To make our analysis as interpretable as possible, we use only a small subset of the covariates in this dataset: population size, unemployment rate, median income, racial make-up (given as four columns indicating the percentage of the population that is Black, White, Hispanic, and Asian), and age demographics (given as two columns indicating the percentage of people in the 12-21 and 65+ age brackets).

Our goal in this section is not to give the best possible prediction intervals. Instead, we perform a simple expository analysis that is designed to demonstrate some possible use cases of our method. For this purpose, we assume that the practitioner’s primary concern is that standard prediction sets will provide unequal coverage across communities with differing racial make-ups. Additionally, we suppose that the practitioner does not have any particular predilections for achieving coverage conditional on age, unemployment rate, or median income. We encode these preferences as follows: let Φ(Xi)\Phi(X_{i}) denote the length five vector consisting of an intercept term and the four racial features and define K\mathcal{F}_{K} to be the RKHS given by the Gaussian kernel K(Xi,Xj)=exp(4XiXj22)K(X_{i},X_{j})=\exp(-4\|X_{i}-X_{j}\|_{2}^{2}) (note that since all variables in this dataset have been previously normalized to lie in [0,1][0,1], we do not make any further modifications before computing the kernel). Then, proceeding exactly as in Section 3.1 we run our method with the function class :={fK()+Φ()β:fKK,β5}\mathcal{F}:=\{f_{K}(\cdot)+\Phi(\cdot)^{\top}\beta:f_{K}\in\mathcal{F}_{K},\ \beta\in\mathbb{R}^{5}\} and penalty (fK()+Φ()β):=λfKK2\mathcal{R}(f_{K}(\cdot)+\Phi(\cdot)^{\top}\beta):=\lambda\|f_{K}\|^{2}_{K}. To get λ\lambda, we run cross-validation on the calibration set. While this does not strictly follow the theory of Section 3, our results indicate that choosing λ\lambda in this manner does not negatively impact the coverage properties of our method in practice (see Figures 9 and 13 below). Finally, we set the conformity scores to be the absolute residuals of a linear regression of YY on XX and the target coverage level to be 1α=0.91-\alpha=0.9. The sizes of the training, calibration, and test sets are taken to be 650, 650, and 694, respectively and we use the unrandomized prediction set, C^dual\hat{C}_{\text{dual}} throughout.

Refer to caption
Figure 7: Empirical miscoverage achieved by split conformal inference (blue) and the unrandomized version of our method (orange) both marginally over all data points and across linear re-weightings of the four racial categories. The red line shows the target level of α=0.1\alpha=0.1 and the error bars indicate 95% confidence intervals for the value of (Yn+1C^(Xn+1))\mathbb{P}(Y_{n+1}\notin\hat{C}(X_{n+1})) obtained by averaging over 200 train-calibration-test splits.

Figure 7 shows the empirical miscoverages obtained by our algorithm and split conformal prediction under the linearly re-weighted distributions f\mathbb{P}_{f}, for f{x1,xx%Black,xx%White,xx%Hispanic,xx%Asian}f\in\{x\mapsto 1,\ x\mapsto x_{\rm\%Black},\ x\mapsto x_{\rm\%White},\ x\mapsto x_{\rm\%Hispanic},\ x\mapsto x_{\rm\%Asian}\}. As expected, our method obtains the desired coverage level under all five settings, while split conformal is only able to deliver marginal validity.

Refer to caption
Figure 8: Empirical miscoverage achieved by split conformal prediction (blue) and the unrandomized version of our method (orange) across high racial representation subgroups. Panels from left to right show the miscoverage when the percentage of the community that falls into a racial group is in the top pp-percentile, for p{50,70,90}p\in\{50,70,90\}. To conserve space, the short forms B, W, A, and H are used to indicate the racial categories Black, White, Asian, and Hispanic, respectively. Red lines shows the target level of α=0.1\alpha=0.1 and the black error bars indicate 95% confidence intervals for the value of (Yn+1C^(Xn+1))\mathbb{P}(Y_{n+1}\notin\hat{C}(X_{n+1})) obtained by averaging over 200 train-calibration-test splits.

In practice, the user is unlikely to only care about the performance under linear re-weightings. For example, they may also want to have predictions sets that are accurate on the communities with the highest racial bias, e.g. the communities whose percentage of the population that is black is in the top 9090th percentile. Strictly speaking, our method will only provide a guarantee on these high racial representation groups if the corresponding subgroup indicators are included in \mathcal{F}. However, intuitively, even without explicit indicators, linear re-weightings should already push the method to accurately cover communities with high racial bias. Thus, we may expect that even without adjusting \mathcal{F}, the method implemented above will already perform well on these groups. To investigate this, Figure 8 shows the miscoverage of our method across high racial representation subgroups. Formally, we say that a community has high representation of a particular racial group if the percentage of the community that falls in that group is in the top pp-percentile. The three panels of the figure then show results for p=50p=50, 7070, and 9090, respectively. We find that even without any explicit indicators for the subgroups, our method is able to correct the errors of split conformal prediction and provide improved coverage in all settings.

Refer to caption
Refer to caption
Figure 9: Estimated coverages for re-weightings of the data according to a Gaussian kernel centered at each of the plotted data points. The bottom-left panel shows the estimated coverages output by our method, while the top panel gives level curves of the kernel at two specific points highlighted in red. Finally, the bottom-right panel compares boxplots of the empirical (solid) and estimated (striped) values of fx(Yn+1C^(Xn+1))\mathbb{P}_{f^{x}}(Y_{n+1}\notin\hat{C}(X_{n+1})) obtained over 200 train–calibration-test splits for the two highlighted values of xx. The red line denotes the target level of α=0.1\alpha=0.1.

In addition to providing exact coverage over racial re-weightings, our method also provides a simple procedure for evaluating the coverage across the other, more flexibly fit, covariates. More precisely, let

(β^n,g^n,K)=argminβd,gKK1ni=1nα(gK(Xi)+Φ(Xi)β,Si)+λgKK2,(\hat{\beta}_{n},\hat{g}_{n,K})=\mathop{\rm argmin}_{\beta\in\mathbb{R}^{d},g_{K}\in\mathcal{F}_{K}}\frac{1}{n}\sum_{i=1}^{n}\ell_{\alpha}(g_{K}(X_{i})+\Phi(X_{i})^{\top}\beta,S_{i})+\lambda\|g_{K}\|^{2}_{K},

denote the function fit on the calibration data and recall that by the results of Section 3 we expect that for any non-negative re-weighting f()=fK()+Φ()βf(\cdot)=f_{K}(\cdot)+\Phi(\cdot)^{\top}\beta,

f(Yn+1C^(Xn+1))1α2λg^n,K,fK1ni=1nf(Xi).\mathbb{P}_{f}(Y_{n+1}\in\hat{C}(X_{n+1}))\cong 1-\alpha-2\lambda\frac{\langle\hat{g}_{n,K},f_{K}\rangle}{\frac{1}{n}\sum_{i=1}^{n}f(X_{i})}. (5.1)

For the Gaussian RKHS, a natural set of non-negative weight functions are the local re-weightings fx(y):=K(x,y)=exp(4xy2)f^{x}(y):=K(x,y)=\exp(-4\|x-y\|^{2}), which emphasize coverage in a neighbourhood around the fixed point xx.

The bottom-left panel of Figure 9 plots the values of 1α2λg^n,K,fx1ni=1nfx(Xi)1-\alpha-2\lambda\frac{\langle\hat{g}_{n,K},f^{x}\rangle}{\frac{1}{n}\sum_{i=1}^{n}f^{x}(X_{i})} for all points xx appearing in the training and calibration sets. We see immediately that our prediction sets will undercover older communities and overcover communities with high median incomes. To aid in the interpretation of this result, the top panel of the figure indicates the level curves of K(x,)K(x,\cdot) for two specific choice of xx. Finally, the bottom-right panel compares the estimates (5.1) to the realized empirical coverages

i=1694fx(X~i)1694j=1694fx(X~j)𝟙{Y~iC^(X~i)},\sum_{i=1}^{694}\frac{f^{x}(\tilde{X}_{i})}{\frac{1}{694}\sum_{j=1}^{694}f^{x}(\tilde{X}_{j})}\mathbbm{1}\{\tilde{Y}_{i}\in\hat{C}(\tilde{X}_{i})\},

for the same two values of xx. Here, {(X~i,Y~i)}i=1694\{(\tilde{X}_{i},\tilde{Y}_{i})\}_{i=1}^{694} denotes the test set. We see that as expected (5.1) is a highly accurate estimate of the true coverage at both values of xx. To further understand the degree of localization in these plots, it may be useful to note that this re-weighting yields an effective sample size of (i=1694fx(X~i))2i=1694fx(X~i)2275\frac{(\sum_{i=1}^{694}f^{x}(\tilde{X}_{i}))^{2}}{\sum_{i=1}^{694}f^{x}(\tilde{X}_{i})^{2}}\cong 275 at the two red points.

Overall, we find that our procedure provides the user with a highly accurate picture of the coverage properties of their prediction sets. In many practical settings, plots like the bottom-left panel of Figure 9 may prompt practitioners to adjust the quantile regression to protect against observed directions of miscoverage. While such an adjustment will not strictly follow the theory of Sections 2 and 3, so long as the user is careful not to run the procedure so many times as to induce direct over-fitting to the observed miscoverage, small adjustments will likely be permissible. In practice, this type of exploratory analysis may allow practitioners to discover important patterns in their data and tune the prediction sets to reflect their coverage needs.

5.2 Comparison against existing methods

A variety of alternative methods for obtaining conditional coverage in conformal inference have been proposed in the literature. Many of these methods are designed to asymptotically achieve exact conditional coverage, i.e., (Yn+1C^(Xn+1)Xn+1)1α\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1})\mid X_{n+1})\stackrel{{\scriptstyle\mathbb{P}}}{{\longrightarrow}}1-\alpha. Here, we compare against two such approaches and demonstrate why the more precise finite-sample guarantees of our method may be preferable.

5.2.1 Comparison against conformalized quantile regression

Our first comparison is to the conformalized quantile regression (CQR) method of Romano et al. (2019). They consider a version of split conformal inference in which the training set is used to fit estimates of the conditional quantiles of YXY\mid X and the calibration set is used to adjust these estimates to guarantee marginal coverage. Here, we implement a two-sided version of their procedures in which the upper and lower quantiles are calibrated separately. In particular, let q^τ(X)\hat{q}_{\tau}(X) denote an estimate of the τ\tau-quantile of YXY\mid X. Let Sτ(X,Y):=Yq^τ(X)S_{\tau}(X,Y):=Y-\hat{q}_{\tau}(X) be a conformity score measuring the distance between YY and the quantile estimate and cτc_{\tau} denote the τ\tau-quantile of the calibration scores {Sτ(X,Y)}i=1n\{S_{\tau}(X,Y)\}_{i=1}^{n}. Then, we define the two-sided CQR prediction set as

C^CQR(Xn+1):={y:Sα/2(Xn+1,y)>cα/2,S1α/2(Xn+1,y)c1α/2}.\hat{C}_{CQR}(X_{n+1}):=\{y:S_{\alpha/2}(X_{n+1},y)>c_{\alpha/2},\ S_{1-\alpha/2}(X_{n+1},y)\leq c_{1-\alpha/2}\}. (5.2)

As alluded to above, if q^τ(X)\hat{q}_{\tau}(X) is a consistent estimate of the true conditional quantile function of YXY\mid X, then this set will asymptotically achieve exact conditional coverage (Sesia and Candès, 2020).

It is important to note that our approach is not in direct competition with CQR and both methods can be used in combination. Namely, here we will consider an implementation of our procedure in which the quantiles, cα/2c_{\alpha/2} and c1α/2c_{1-\alpha/2} are replaced by adaptive estimates from our method. Implementing two-sided fitting requires requires minor modifications of the dual formulation of our prediction set and we refer the reader to Section A.7 for details.

To compare these methods we once again employ the Communities and Crimes dataset. We consider three baseline approaches, 1) split conformal prediction with linear regression and the residual conformity score, S(X,Y):=Yγ^0Xγ^1S(X,Y):=Y-\hat{\gamma}_{0}-X^{\top}\hat{\gamma}_{1} for (γ^0,γ^1)(\hat{\gamma}_{0},\hat{\gamma}_{1}) obtained by ordinary least squares, 2) split conformal prediction implemented using the CQR procedure of Romano et al. (2019) where the estimates q^τ(X)\hat{q}_{\tau}(X) are obtained using linear quantile regression, 3) the same procedure but with q^τ(X)\hat{q}_{\tau}(X) obtained using a quantile random forest. In all three cases we calibrate the upper and lower quantiles separately using either the formulation in (5.2) or, in the case of ordinary least squares, by estimating the upper and lower quantiles of the residuals separately. We compare each of these baseline approaches against implementations of our methods where the split conformal calibration step is replaced by our method with function class, :={β0+Xβ1:β0,β1d}\mathcal{F}:=\{\beta_{0}+X^{\top}\beta_{1}:\beta_{0}\in\mathbb{R},\ \beta_{1}\in\mathbb{R}^{d}\} defined as the set of affine functions over the covariates from the previous section (i.e. population size, unemployment rate, median income, as well a set of race and age-based features).

Refer to caption
Figure 10: Comparison of split conformal inference (blue) against the unrandomized (orange) and randomized (red) implementations of our conditional calibration method. All three procedures are implemented with three different conformity scores derived from ordinary least squares (OLS), linear quantile regression (QR), and quantile random forests (QRF). Panels from left to right show the average marginal miscoverage, largest and smallest miscoverages over linear re-weightings, and prediction set lengths. Bars in the figure display averages over 200 trials, where in each trial the points in the Communities and Crimes dataset are divided uniformly at random into a training set of size 650, a calibration set of size 650, and a test set of size 694.

Figure 10 displays the results of this experiment. We find that all implementations of both split conformal and our randomized method achieve exact marginal covearge, while our unrandomized variant can slightly overcover (left panel). To evaluate the conditional coverage, the center two panels of the figure display estimates of the largest over and undercoverages obtained across linear re-weightings of the covariate space. Namely, letting Z=(1,X)Z=(1,X) denote the augmented covariates, we estimate the smallest and largest values of

𝔼[Zn+1,j𝟙{Yn+1C^(Xn+1)}]𝔼[Zn+1,j],\frac{\mathbb{E}[Z_{n+1,j}\mathbbm{1}\{Y_{n+1}\notin\hat{C}(X_{n+1})\}]}{\mathbb{E}[Z_{n+1,j}]},

over all features jj (note that here all the features are non-negative). As expected from our theory, the unrandomized variant of our method always slightly overcovers, while the randomized version obtains exact coverage throughout. On the other hand, while CQR asymptotically achieves exact coverage in this setting, our simulations show deviations from the target level using both linear quantile regression and quantile forests. Importantly, we note that while the coverage deviations of linear quantile regression are relatively small here, its performance can worsen significantly in higher dimensions (see Figure 3). Finally, the rightmost panel of the figure shows the lengths of the prediction sets returned by each method. We find that all of the prediction sets are of a similar size with the exception of the OLS implementation of split conformal prediction, whose predictions sets are relatively wide.

5.2.2 Comparison against localized conformal prediction

Our second comparison is to the localized conformal prediction method of Guan (2022). In this procedure, a localized kernel is used to re-weight the calibration data and focus on the data whose covariates are most similar to the test point. The exact method for accomplishing this is somewhat technical and we refer the reader to Theorem 3.2 of Guan (2022) for details. Similar to conformalized quantile regression, localized conformal prediction guarantees exact marginal coverage in finite samples and asymptotic conditional coverage under appropriate choices of the kernel.

To compare localized conformal prediction against our approach, we once again employ the Communities and Crime dataset. We consider the Gaussian kernel K(x1,x2):=exp(4x1x222)K(x_{1},x_{2}):=\exp(-4\|x_{1}-x_{2}\|_{2}^{2}) and take the conformity score S(X,Y):=|Yγ^0Xγ^|S(X,Y):=|Y-\hat{\gamma}_{0}-X^{\top}\hat{\gamma}| to be the absolute values of the residuals from a linear regression. We then compare two methods, localized conformal prediction, and our method implemented with the function class given by the same kernel. More specifically, we implement our method with function class :={β0+fk(X):β0,fkK}\mathcal{F}:=\{\beta_{0}+f_{k}(X):\beta_{0}\in\mathbb{R},f_{k}\in\mathcal{F}_{K}\} and penalty (fK)=λfK2\mathcal{R}(f_{K})=\lambda\|f\|_{K}^{2}, where λ\lambda is chosen by cross-validation.

Refer to caption
Figure 11: Comparison of the coverage properties of localized conformal prediction (purple) and our randomized conditional calibration method (red). The solid boxplots from left to right show the distributions of the calibration-conditional marginal coverage and kernel re-weighted coverage according to the shifts considered in Figure 9, while hatched boxes display the corresponding coverage estimates obtained using the method of Proposition 2. The feature space employed here is the same as in the previous sections and boxplots in the figure once again show results from 200 trials each containing 650 training points, 650 calibration points, and 694 test points.

The results of this experiment are shown in Figure 11. We find that, as expected, both localized conformal prediction and the randomized version of our method achieve exact marginal. To evaluate their conditional properties, we compare the coverages obtained under the two kernel re-weightings from the previous section (namely the shifts considered in Figure 9). We find that, as expected, the coverage of our method deviates from the target level, but critically this deviation is predictable and well approximated by the estimation procedure given in Proposition 2. On the other hand, although localized conformal prediction obtains similar empirical results to our method, it does not offer estimates of its coverage deviation. Thus, a priori based on the theory of Guan (2022), one may expect localized conformal prediction to give exact coverage under these shifts, while in reality, it shows notable deviations away from the target level.

5.3 Rxrx1 data

Our next experiment examines the RxRx1 dataset (Taylor et al. (2019)) from the WILDS repository (Koh et al. (2021)). This repository contains a collection of commonly used datasets for benchmarking performance under distribution shift. In the RxRx1 task, we are given images of cells obtained using fluorescent microscopy and we must predict which one of the 1339 genetic treatments the cells received. These images come from 51 different experiments run across four different cell types. It is well known that even in theoretically identical experiments, minor variations in the execution and environmental conditions can induce observable variation in the final data. Thus, we expect to see heterogeneity in the quality of the predictions across both experiments and cell types.

Perhaps the most obvious method for correcting for this heterogeneity would be to treat the experiments and cell types as known categories and apply the group-conditional coverage method outlined in Section 2. However, if we did this, we would be unable to make predictions for new unlabeled images. Thus, here we take a more data driven approach and attempt to learn a good feature representation directly from the training set.

To predict the genetic treatments of the cells we use a ResNet50 architecture trained on 37 of the 51 experiments by the original authors of the WILDS repository. We then uniformly divide the samples from the remaining 14 experiments into a training set and a test set. The training set is further split into two halves; one for learning a feature representation, and one to be used as the calibration set. To construct the feature representation, we take the feature map from the pre-trained neural network as input and run a 2\ell_{2}-regularized multinomial linear regression that predicts which experiment each image comes from. We then define our features to be the predicted probabilities of experiment membership output by this model and construct prediction sets using the linear quantile regression method of Section 2. To define the conformity scores for this experiment, let {w^i(x)}i=11339\{\hat{w}_{i}(x)\}_{i=1}^{1339} denote the weights output by the neural network at input xx. Typically, we would use these weights to compute the predicted probabilities of class membership, π^i(x):=exp(w^i(x))/(jexp(w^j(x)))\hat{\pi}_{i}(x):=\exp(\hat{w}_{i}(x))/(\sum_{j}\exp(\hat{w}_{j}(x))). Here, we add an extra step in which we use multinomial logisitic regression and the calibration data to fit a parameter TT that re-scales the weights. This procedure is known as temperature scaling and it has been found to increase the accuracy of probabilities output by neural networks (Angelopoulos et al. (2021), Guo et al. (2017)). After running this regression, we set π^i(x):=exp(Tw^i(x))/(jexp(Tw^j(x)))\hat{\pi}_{i}(x):=\exp(T\hat{w}_{i}(x))/(\sum_{j}\exp(T\hat{w}_{j}(x))) and following Romano et al. (2020b) define the conformity score function to be

S(x,y):=i:π^i(x)>π^yπ^i(x).S(x,y):=\sum_{i:\hat{\pi}_{i}(x)>\hat{\pi}_{y}}\hat{\pi}_{i}(x).
Refer to caption
Figure 12: Empirical conditional miscoverage of the unrandomized version of our method (orange) and split conformal (blue) across cell types and experiments. Red lines indicate the target level of α=0.1\alpha=0.1 and black error bars show 95% binomial confidence intervals for the calibration-conditional miscoverage (Yn+1C^(Xn+1)|Dtrain,Dcal)\mathbb{P}(Y_{n+1}\notin\hat{C}(X_{n+1})|D_{\text{train}},D_{\text{cal}}), where DtrainD_{\text{train}} and DcalD_{\text{cal}} denote the training dataset used to learn the feature representation and the calibration dataset used to implement our method, respectively.

The coverage properties of our unrandomized prediction sets are outlined in Figure 12. We see that while split conformal prediction has very heterogeneous coverage across experiments and cell types, our approach performs well on all groups. Thus, the learned feature representation successfully captures the batch effects in the data and thereby enables our method to provide the desired group-conditional coverage.

The method described above is hardly the only way to construct a feature representation for this dataset. In Section A.2, we consider an alternative approach in which we implement our method using the top principal components of the feature layer of the neural network as input. The idea here is that the batch effects (i.e. the cell types and experiment memberships) should induce large variations in the images that are visible on the top principal components. Thus, correcting the coverage along these directions will provide the desired conditional validity. In agreement with this hypothesis, we find that this procedure produces nearly identical results to those seen above, i.e., good coverage across all groups.

6 Choosing the function class and regularizer

In order to implement our method in practice, users must choose both the function class, \mathcal{F}, and regularizer, ()\mathcal{R}(\cdot). In the real data examples above we have considered some sample choices for these quantities that were designed to illustrate the guarantees of our method. A practitioner may reasonably disagree with our selections, and through their choice of \mathcal{F} and ()\mathcal{R}(\cdot), look to prioritize different conditional targets. As in many statistical estimation problems, we expect the best performance of our method to be obtained when the choices of \mathcal{F} and \mathcal{R} are guided by domain-specific knowledge of the most important features to the prediction task at hand. To help practitioners in making these choices, we highlight a few considerations that may arise.

First, we remark that although our theory requires fixed choices of \mathcal{F} and \mathcal{R}, we find empirically that running cross-validation on the training set does not significantly harm the coverage. To demonstrate this, we consider two different penalized implementations of our method. In the first, we run ridge-regularized linear quantile regression with :={β0+Xβ1:β0,β1d}\mathcal{F}:=\{\beta_{0}+X^{\top}\beta_{1}:\beta_{0}\in\mathbb{R},\ \beta_{1}\in\mathbb{R}^{d}\} and (β):=λβ2\mathcal{R}(\beta):=\lambda\|\beta\|^{2}, while in the second we run kernel quantile regression with Gaussian kernel, K(x,y):=exp(0.025xy22)K(x,y):=\exp(-0.025\,\|x-y\|_{2}^{2}) and corresponding kernel-norm regularizer, (g):=λgK2\mathcal{R}(g):=\lambda\|g\|_{K}^{2}. Figure 13 shows a comparison of the marginal coverage of our method obtained when λ\lambda is either fixed in advance, or estimated using cross-validation. We see that although our theory requires λ\lambda to be fixed, both approaches give near exact coverage empirically.

Refer to caption
Figure 13: Marginal coverage obtained with (vertical and horizontal hatches) and without (diagonal hatches) cross-validation for the randomized version of our method. The red line shows the target level of α=0.1\alpha=0.1. Data for this experiment were generated from the Gaussian linear model in which (Xi,ϵi)N(0,Id)(X_{i},\epsilon_{i})\sim N(0,I_{d}), Yi=Xiw+ϵiY_{i}=X_{i}^{\top}w+\epsilon_{i}, and ww is a unit vector sampled uniformly at random. The conformity score is taken to be simply equal to YY so that all of the data is used in the calibration set. Barplots show averages from 2020 trials where in each trial we sample n=200n=200 calibration points and 100100 test points in dimension d=50d=50.

Perhaps even more critical than the regularization level, is the choice of function class \mathcal{F}. Of particular interest is the trade-off that occurs as the dimension of \mathcal{F} increases. Indeed, while larger function spaces allow for a more rich set of conditional guarantees, we find empirically that they also lead to larger prediction sets. Thus, adding spurious features to \mathcal{F} can harm the efficiency of our predictions.

Refer to caption
Figure 14: Estimated distributions of the calibration-conditional marginal miscoverage, worst-case conditional coverage deviation, and prediction set length of quantile regression (green), as well as the unrandomized (orange) and randomized (red) variants of our conditional calibration method. Boxplots show results from 100 trials, where in each trial the coverages and lengths are evaluated empirically on 1000 test points.

To demonstrate this phenomenon, we consider a dataset with many irrelevant features that have no relationship with the response. In particular, we work with a Gaussian model in which XiN(0,Id)X_{i}\sim N(0,I_{d}), YiN(0,1)Y_{i}\sim N(0,1), and YiY_{i} is independent of XiX_{i}. We implement our method with conformity score S(X,Y)=YS(X,Y)=Y and use linear quantile regression with ={β0+Xβ1:β0,β1d}\mathcal{F}=\{\beta_{0}+X^{\top}\beta_{1}:\beta_{0}\in\mathbb{R},\ \beta_{1}\in\mathbb{R}^{d}\}, (g)=0\mathcal{R}(g)=0. Figure 14 plots estimated distributions of the calibration-conditional marginal miscoverage, worst-case conditional coverage deviation, and length of a two-sided implementation of our method in which the lower α/2\alpha/2 and upper 1α/21-\alpha/2 quantiles are estimated separately (see Section A.7), i.e. the quantities,

(Yn+1C^(Xn+1){(Xi,Yi)}i=1n),sup1jd+1|𝔼[(1,Xn+1)j(𝟙{Yn+1C^(Xn+1)}α){(Xi,Yi)}i=1n]𝔼[|(1,Xn+1)j|]|, and 𝔼[length(C^(Xn+1)){(Xi,Yi)}i=1n].\begin{gathered}\mathbb{P}(Y_{n+1}\notin\hat{C}(X_{n+1})\mid\{(X_{i},Y_{i})\}_{i=1}^{n}),\ \sup_{1\leq j\leq d+1}\left|\frac{\mathbb{E}[(1,X_{n+1})_{j}(\mathbbm{1}\{Y_{n+1}\not\in\hat{C}(X_{n+1})\}-\alpha)\mid\{(X_{i},Y_{i})\}_{i=1}^{n}]}{\mathbb{E}[|(1,X_{n+1})_{j}|]}\right|,\\ \text{ and }\mathbb{E}[\text{length}(\hat{C}(X_{n+1}))\mid\{(X_{i},Y_{i})\}_{i=1}^{n}].\end{gathered}

As a baseline, the figure also displays results for vanilla quantile regression. We find that while the marginal coverage of our method is robust to large function classes, both the worst-case coverage deviation and the average length of the prediction sets increase as the dimension grows. Recall that all the covariates here are spurious; therefore, the increase in these quantities is purely a result of over-fitting in high dimensions and not a reflection of the underlying complexity of the relationship between XX and YY. Finally, we remark that the growth in the worst-case coverage error is substantially worse for quantile regression compared to our method. Thus, while we cannot guarantee exact calibration-conditional validity over all ff\in\mathcal{F} in high dimensions, our method is practically superior to sensible alternatives.

7 Acknowledgments

E.J.C. was supported by the Office of Naval Research grant N00014-20-1-2157, the National Science Foundation grant DMS-2032014, the Simons Foundation under award 814641, and the ARO grant 2003514594. I.G. was also supported by the Office of Naval Research grant N00014-20-1-2157 and the Simons Foundation award 814641, as well as additionally by the Overdeck Fellowship Fund. J.J.C. was supported by the John and Fannie Hertz Foundation. The authors are grateful to Kevin Guo and Tim Morrison for helpful discussion on this work.

References

  • Andrews and Shi (2013) Andrews, D. W. and Shi, X. (2013) Inference based on conditional moment inequalities. Econometrica, 81, 609–666.
  • Angelopoulos et al. (2021) Angelopoulos, A. N., Bates, S., Jordan, M. and Malik, J. (2021) Uncertainty sets for image classifiers using conformal prediction. In International Conference on Learning Representations.
  • Barber et al. (2023) Barber, R. F., Candes, E. J., Ramdas, A. and Tibshirani, R. J. (2023) Conformal prediction beyond exchangeability. arXiv preprint. ArXiv:2202.13415.
  • Barber et al. (2020) Barber, R. F., Candès, E. J., Ramdas, A. and Tibshirani, R. J. (2020) The limits of distribution-free conditional predictive inference. Information and Inference: A Journal of the IMA, 10, 455–482.
  • Boucheron et al. (2005) Boucheron, S., Bousquet, O. and Lugosi, G. (2005) Theory of classification: a survey of some recent advances. ESAIM: Probability and Statistics, 9, 323–375.
  • Bousquet and Elisseeff (2002) Bousquet, O. and Elisseeff, A. (2002) Stability and generalization. Journal of Machine Learning Research, 2, 499–526.
  • Chernozhukov et al. (2021) Chernozhukov, V., Wüthrich, K. and Zhu, Y. (2021) Distributional conformal prediction. Proceedings of the National Academy of Sciences, 118, e2107794118.
  • Dantzig and Thapa (2003) Dantzig, G. B. and Thapa, M. N. (2003) Linear programming: Theory and extensions, vol. 2. Springer.
  • Deng et al. (2023) Deng, Z., Dwork, C. and Zhang, L. (2023) Happymap: A generalized multi-calibration method. arXiv preprint arXiv:2303.04379.
  • Dua and Graff (2017) Dua, D. and Graff, C. (2017) UCI machine learning repository. URL: http://archive.ics.uci.edu/ml.
  • Guan (2022) Guan, L. (2022) Localized conformal prediction: a generalized inference framework for conformal prediction. Biometrika, 110, 33–50.
  • Guo et al. (2017) Guo, C., Pleiss, G., Sun, Y. and Weinberger, K. Q. (2017) On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, 1321–1330. JMLR.org.
  • Hébert-Johnson et al. (2018) Hébert-Johnson, U., Kim, M., Reingold, O. and Rothblum, G. (2018) Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning, 1939–1948. PMLR.
  • Jung et al. (2023) Jung, C., Noarov, G., Ramalingam, R. and Roth, A. (2023) Batch multivalid conformal prediction. In International Conference on Learning Representations.
  • Kim et al. (2019) Kim, M. P., Ghorbani, A. and Zou, J. (2019) Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 247–254.
  • Koenker and Bassett Jr (1978) Koenker, R. and Bassett Jr, G. (1978) Regression quantiles. Econometrica: journal of the Econometric Society, 33–50.
  • Koh et al. (2021) Koh, P. W., Sagawa, S., Marklund, H., Xie, S. M., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R. L., Gao, I., Lee, T., David, E., Stavness, I., Guo, W., Earnshaw, B., Haque, I., Beery, S. M., Leskovec, J., Kundaje, A., Pierson, E., Levine, S., Finn, C. and Liang, P. (2021) Wilds: A benchmark of in-the-wild distribution shifts. In Proceedings of the 38th International Conference on Machine Learning (eds. M. Meila and T. Zhang), vol. 139 of Proceedings of Machine Learning Research, 5637–5664. PMLR.
  • Lei and Wasserman (2014) Lei, J. and Wasserman, L. (2014) Distribution‐free prediction bands for non‐parametric regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76.
  • McShane (1934) McShane, E. J. (1934) Extension of range of functions. Bulletin of the American Mathematical Society, 40, 837 – 842.
  • Mendelson (2014) Mendelson, S. (2014) Learning without concentration. In Proceedings of The 27th Conference on Learning Theory (eds. M. F. Balcan, V. Feldman and C. Szepesvári), vol. 35 of Proceedings of Machine Learning Research, 25–39. Barcelona, Spain: PMLR.
  • Paulsen and Raghupathi (2016) Paulsen, V. I. and Raghupathi, M. (2016) An Introduction to the Theory of Reproducing Kernel Hilbert Spaces. Cambridge Studies in Advanced Mathematics. Cambridge University Press.
  • Qiu et al. (2022) Qiu, H., Dobriban, E. and Tchetgen, E. T. (2022) Prediction sets adaptive to unknown covariate shift. arXiv preprint. ArXiv:2203.06126.
  • Redmond and Baveja (2002) Redmond, M. and Baveja, A. (2002) A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research, 141, 660–678.
  • Romano et al. (2020a) Romano, Y., Barber, R. F., Sabatti, C. and Candès, E. (2020a) With Malice Toward None: Assessing Uncertainty via Equalized Coverage. Harvard Data Science Review, 2.
  • Romano et al. (2019) Romano, Y., Patterson, E. and Candes, E. (2019) Conformalized quantile regression. In Advances in Neural Information Processing Systems (eds. H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox and R. Garnett), vol. 32. Curran Associates, Inc.
  • Romano et al. (2020b) Romano, Y., Sesia, M. and Candes, E. (2020b) Classification with valid and adaptive coverage. In Advances in Neural Information Processing Systems (eds. H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan and H. Lin), vol. 33, 3581–3591. Curran Associates, Inc.
  • Sesia and Candès (2020) Sesia, M. and Candès, E. J. (2020) A comparison of some conformal quantile regression methods. Stat, 9, e261. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/sta4.261. E261 sta4.261.
  • Sesia and Romano (2021) Sesia, M. and Romano, Y. (2021) Conformal prediction using conditional histograms. In Advances in Neural Information Processing Systems (eds. M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang and J. W. Vaughan), vol. 34, 6304–6315. Curran Associates, Inc.
  • Taylor et al. (2019) Taylor, J., Earnshaw, B., Mabey, B., Victors, M. and Yosinski, J. (2019) Rxrx1: An image set for cellular morphological variation across many experimental batches. In International Conference on Learning Representations (ICLR).
  • Tibshirani et al. (2019) Tibshirani, R. J., Foygel Barber, R., Candes, E. and Ramdas, A. (2019) Conformal prediction under covariate shift. In Advances in Neural Information Processing Systems (eds. H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox and R. Garnett), vol. 32. Curran Associates, Inc.
  • Vaart and Wellner (1996) Vaart, A. W. and Wellner, J. A. (1996) Weak Convergence and Empirical Processes. Springer New York, NY, 1 edn.
  • Vovk (2012) Vovk, V. (2012) Conditional validity of inductive conformal predictors. In Proceedings of the Asian Conference on Machine Learning (eds. S. C. H. Hoi and W. Buntine), vol. 25 of Proceedings of Machine Learning Research, 475–490. Singapore Management University, Singapore: PMLR.
  • Vovk et al. (2005) Vovk, V., Gammerman, A. and Shafer, G. (2005) Algorithmic Learning in a Random World. Berlin, Heidelberg: Springer-Verlag.
  • Vovk et al. (2003) Vovk, V., Lindsay, D., Nouretdinov, I. and Gammerman, A. (2003) Mondrian confidence machine. Tech. rep., Royal Holloway University of London.
  • Whitney (1934) Whitney, H. (1934) Analytic extensions of differentiable functions defined in closed sets. Transactions of the American Mathematical Society, 36, 63–89.
  • Yang et al. (2022) Yang, Y., Kuchibhotla, A. K. and Tchetgen, E. T. (2022) Doubly robust calibration of prediction sets under covariate shift. arXiv preprint arXiv:2203.01761.

Appendix A Appendix

A.1 Computational details for Section 4

A.1.1 Verification of the conditions of Section 4

In this section we verify that all of the quantile regression problems discussed in this paper satisfy the conditions of Section 4. To do this we will check that 1) each quantile regression admits an equivalent finite dimensional representation 2) all of these finite dimensional representations (and thus also their infinite dimensional counterparts) satisfy strong duality.

The fact that the linear quantile regression of Section 2 satisfies 1) and 2) is clear. For RKHS functions, let Kn+1×n+1K\in\mathbb{R}^{n+1\times n+1} denote the kernel matrix with entries Kij=K(Xi,Xj)K_{ij}=K(X_{i},X_{j}). Let KiK_{i} denote the ithi_{\text{th}} row (equivalently column) of KK. Then, the primal problem (3.1) can be fit by solving the equivalent convex optimization program

minimizeγn+1,βd1n+1i=1nα(Kiγ+Φ(Xi)β,Si)+1n+1α(Kn+1γ+Φ(Xn+1)β,S)+λγKγ,\displaystyle\underset{\gamma\in\mathbb{R}^{n+1},\ \beta\in\mathbb{R}^{d}}{\text{minimize}}\ \frac{1}{n+1}\sum_{i=1}^{n}\ell_{\alpha}(K_{i}^{\top}\gamma+\Phi(X_{i})^{\top}\beta,S_{i})+\frac{1}{n+1}\ell_{\alpha}(K_{n+1}^{\top}\gamma+\Phi(X_{n+1})^{\top}\beta,S)+\lambda\gamma^{\top}K\gamma,

and setting g^S()=i=1n+1γ^iK(Xi,)+Φ()β^\hat{g}_{S}(\cdot)=\sum_{i=1}^{n+1}\hat{\gamma}_{i}K(X_{i},\cdot)+\Phi(\cdot)^{\top}\hat{\beta}, for (γ^,β^)(\hat{\gamma},\hat{\beta}) any optimal solutions. With this finite-dimensional representation in hand, it now follows that the primal-dual pair must satisfy strong-duality by Slater’s condition. Moreover, it is also easy to see that in this context

(η)\displaystyle\mathcal{R}^{*}(\eta) =mingKK,βdλgKK2i=1n+1ηi(gK(Xi)+Φ(Xi)β)\displaystyle=-\min_{g_{K}\in\mathcal{F}_{K},\ \beta\in\mathbb{R}^{d}}\lambda\|g_{K}\|_{K}^{2}-\sum_{i=1}^{n+1}\eta_{i}(g_{K}(X_{i})+\Phi(X_{i})^{\top}\beta)
=minγn+1,βdλγKγi=1n+1ηi(Kiγ+Φ(Xi)β),\displaystyle=-\min_{\gamma\in\mathbb{R}^{n+1},\ \beta\in\mathbb{R}^{d}}\lambda\gamma^{\top}K\gamma-\sum_{i=1}^{n+1}\eta_{i}(K_{i}^{\top}\gamma+\Phi(X_{i})^{\top}\beta),

which is a tractable function that we can compute.

Finally, to fit Lipschitz functions we can solve the optimization program,

minimizeγn+1,βd1n+1i=1nα(γi+Φ(Xi)β,Si)+1n+1α(γn+1+Φ(Xn+1)β,S)+λmaxij|γiγj|XiXj2.\displaystyle\underset{\gamma\in\mathbb{R}^{n+1},\ \beta\in\mathbb{R}^{d}}{\text{minimize}}\ \frac{1}{n+1}\sum_{i=1}^{n}\ell_{\alpha}(\gamma_{i}+\Phi(X_{i})^{\top}\beta,S_{i})+\frac{1}{n+1}\ell_{\alpha}(\gamma_{n+1}+\Phi(X_{n+1})^{\top}\beta,S)+\lambda\max_{i\neq j}\frac{|\gamma_{i}-\gamma_{j}|}{\|X_{i}-X_{j}\|_{2}}.

The idea here is that γ1,,γn+1{\gamma}_{1},\dots,{\gamma}_{n+1} act as proxies for the values of fL(X1),,fL(Xn+1){f}_{L}(X_{1}),\dots,{f}_{L}(X_{n+1}). These proxies can always be extended to a complete function on all of 𝒳\mathcal{X} using the methods of McShane (1934), Whitney (1934). Once again, with this finite-dimensional representation in hand it is now easy to check that the primal-dual pair satisfies strong-duality by Slater’s condition. Finally, in this context we have

(η)\displaystyle\mathcal{R}^{*}(\eta) =mingLL,βdλLip(gL)i=1n+1ηi(gL(Xi)+Φ(Xi)β)\displaystyle=-\min_{g_{L}\in\mathcal{F}_{L},\ \beta\in\mathbb{R}^{d}}\lambda\text{Lip}(g_{L})-\sum_{i=1}^{n+1}\eta_{i}(g_{L}(X_{i})+\Phi(X_{i})^{\top}\beta)
=minγn+1,βdλmaxij|γiγj|XiXj2i=1n+1ηiγi,\displaystyle=-\min_{\gamma\in\mathbb{R}^{n+1},\ \beta\in\mathbb{R}^{d}}\lambda\max_{i\neq j}\frac{|\gamma_{i}-\gamma_{j}|}{\|X_{i}-X_{j}\|_{2}}-\sum_{i=1}^{n+1}\eta_{i}\gamma_{i},

which is a tractable function.

A.1.2 Additional algorithmic details

The results below elucidate various approaches to computing the prediction set. Algorithm 1 describes how we might use a binary search algorithm to discover the critical SS at which ηn+1S\eta^{S}_{n+1} equals the target cutoff.

Data: Observed data {(X1,S1),,(Xn,Sn)}{Xn+1}\{(X_{1},S_{1}),\dots,(X_{n},S_{n})\}\cup\{X_{n+1}\}, numerical error tolerance ϵ\epsilon, target cutoff C{1α,U}C\in\{1-\alpha,U\}, range [a,b][a,b] for Sn+1S_{n+1} (a,ba,b = None indicates that no bounds are known for Sn+1S_{n+1}).
if b=b= None then
       b=max{max1inSi,1}b=\max\{\max_{1\leq i\leq n}S_{i},1\};
       while ηn+1b<C\eta^{b}_{n+1}<C do
             b=2bb=2b;
      
if a=a= None then
       a=min{min1inSi,1}a=\min\{\min_{1\leq i\leq n}S_{i},-1\};
       while ηn+1aC\eta^{a}_{n+1}\geq C do
             a=2aa=2a;
      
while ba<ϵb-a<\epsilon do
       if ηn+1(a+b)/2<C\eta^{(a+b)/2}_{n+1}<C then
             a=(a+b)/2a=(a+b)/2;
            
      else
             b=(a+b)/2b=(a+b)/2;
            
      
return (a+b)/2(a+b)/2.
Algorithm 1 Binary Search Computation of Sn+1S^{*}_{n+1}

As discussed in the main text, Algorithm 1 may be inefficient if the optimization problem possesses additional structure. In particular, a common implementation of our method is to fit an unregularized quantile regression over a finite-dimensional function class. In this case, it is possible to exactly compute inf{S:ηn+1S<C}\inf\{S:\eta^{S}_{n+1}<C\}. Our approach to this problem, which leverages standard tools from LP sensitivity analysis, relies on the following observation.

Proposition 5.

Assume strong duality holds for (4.1). Let (g^n,η^n)×n(\hat{g}_{n},\hat{\eta}^{n})\in\mathcal{F}\times\mathbb{R}^{n} denote a valid primal-dual solution of (4.1), (4.2) and S^n+1:=g^n(Xn+1)\hat{S}_{n+1}:=\hat{g}_{n}(X_{n+1}). Then, (η^n,0)(\hat{\eta}^{n},0) is a valid dual solution of the dual program with data {(X1,S1)}i=1n{(Xn+1,S^n+1)}\{(X_{1},S_{1})\}_{i=1}^{n}\cup\{(X_{n+1},\hat{S}_{n+1})\}.

Proof.

This result follows immediately from the fact that (η^n,0)(\hat{\eta}^{n},0) is feasible and the duality gap of g^n\hat{g}_{n} and (η^n,0)(\hat{\eta}^{n},0) is zero. ∎

We now recall some basic linear programming (LP) facts. First, assuming without loss of generality that Φ\Phi is full-rank, it follows immediately from standard LP theory that there exists a solution in which all but pp indices in η\eta line at the extremes {α,1α}\{\alpha,1-\alpha\} (Dantzig and Thapa (2003)). Moreover, the remaining non-trivial coordinates of η\eta are piecewise linear in SS, and the slope of the function given by Sηn+1SS\mapsto\eta^{S}_{n+1} can be obtained by solving a linear equation (Dantzig and Thapa (2003)). Thus, obtaining the critical value of SS at which ηn+1S\eta^{S}_{n+1} exceeds our cutoff amounts to tracing a piecewise linear function from S^n+1\hat{S}_{n+1} until the cutoff is reached. Algorithm 2 provides an explicit description of this procedure, which closely resembles the simplex method. Note that the algorithm is initialized at S^n+1\hat{S}_{n+1} since the dual solution at that point is known to be zero (Proposition 5).

Data: Observed data {(X1,S1),,(Xn,Sn)}{Xn+1}\{(X_{1},S_{1}),\dots,(X_{n},S_{n})\}\cup\{X_{n+1}\}, target cutoff CC, quantile qq.
g^n()=LPSolver({(Xi,Si)}i=1n)\hat{g}_{n}(\cdot)=\text{LPSolver}(\{(X_{i},S_{i})\}_{i=1}^{n}) ;
Sn+1=g^n(Xn+1)S^{*}_{n+1}=\hat{g}_{n}(X_{n+1}), ηn+1=0\eta_{n+1}=0;
ic=n+1i_{c}=n+1 ;
  /* candidate to enter basis */
while ηn+1C\eta_{n+1}\neq C do
       B=ActiveSet({1,,n+1})B=\text{ActiveSet}(\{1,\dots,n+1\});
       d=(ΦB)1Φid=-(\Phi_{B}^{\top})^{-1}\Phi_{i^{*}}^{\top};
       if C>ηn+1C>\eta_{n+1} then
             Δ1:|B|=max(qηBd,q1ηBd)\Delta_{1:|B|}=\max\left(\frac{q-\eta_{B}}{d},\frac{q-1-\eta_{B}}{d}\right) ;
              /* element-wise division and maximum */
             i=argmini[|B|](Δi)i^{*}=\arg\min_{i\in[|B|]}(\Delta_{i}) ;
              /* candidate to exit basis */
             δ=Δi\delta=\Delta_{i^{*}};
            
      else
             Δ1:|B|=min(qηBd,q1ηBd)\Delta_{1:|B|}=\min\left(\frac{q-\eta_{B}}{d},\frac{q-1-\eta_{B}}{d}\right) ;
              /* element-wise division and minimum */
             i=argmaxi[|B|](Δi)i^{*}=\arg\max_{i\in[|B|]}(\Delta_{i}) ;
              /* candidate to exit basis */
             δ=Δi\delta=\Delta_{i^{*}} ;
            
      δc=max(min(δ,q1ηic),qηic)\delta_{c}=\max(\min(\delta,q-1-\eta_{i_{c}}),q-\eta_{i_{c}});
       ηB=ηB+δcd\eta_{B}=\eta_{B}+\delta_{c}d;
       ηic=ηic+δc\eta_{i_{c}}=\eta_{i_{c}}+\delta_{c};
       if δc=δ\delta_{c}=\delta then
             B=BiB=B\setminus i^{*} ;
              /* update basis */
             B=B{ic}B=B\cup\{i_{c}\} ;
              /* update basis */
            
      A=(ΦB)1ΦBcA=(\Phi_{B}^{\top})^{-1}\Phi_{B^{c}}^{\top};
       c=SBcSBAc=S_{B^{c}}^{\top}-S_{B}^{\top}A;
       ΔS=c/A1\Delta^{S}=c/A_{-1} ;
        /* element-wise division by last row of AA */
       B={iBci0}B^{*}=\{i\in B\mid c_{i}\neq 0\} ;
       if |B|=0|B^{*}|=0 then
             Sn+1=S_{n+1}=\infty ;
             ηn+1=C\eta_{n+1}=C;
            
      else
             if C>0C>0 then
                   ic=argminiBΔiSi_{c}=\arg\min_{i\in B^{*}}\Delta^{S}_{i} ;
                    /* candidate to enter basis */
                   Sn+1=Sn+1+ΔicSS^{*}_{n+1}=S^{*}_{n+1}+\Delta^{S}_{i_{c}} ;
                  
            else
                   ic=argmaxiBΔiSi_{c}=\arg\max_{i\in B^{*}}\Delta^{S}_{i} ;
                    /* candidate to enter basis */
                   Sn+1=Sn+1+ΔicSS^{*}_{n+1}=S^{*}_{n+1}+\Delta^{S}_{i_{c}} ;
                  
            
      
return Sn+1S^{*}_{n+1}.
Algorithm 2 LP Sensitivity Computation of Sn+1S^{*}_{n+1}

We conclude this section with one final method for computing a conservative alternative to our prediction set that requires only a single quantile regression fit. For this, suppose we know an upper bound MM for Sn+1S_{n+1}. Then, we may consider the conservative prediction set C^con(Xn+1,M):={y:S(Xn+1,y)g^M(Xn+1)}\hat{C}_{\text{con}}(X_{n+1},M):=\{y:S(X_{n+1},y)\leq\hat{g}_{M}(X_{n+1})\}. Our final proposition shows that whenever MM is a valid upper bound on Sn+1S_{n+1}, C^con(Xn+1,M)\hat{C}_{\text{con}}(X_{n+1},M) provides a conservative coverage guarantee. To avoid subtle issues here related to the non-uniqueness of the optimal quantile function, we will assume here that g^M(Xn+1)\hat{g}_{M}(X_{n+1}) is a min-norm solution of (4.1) (the choice of norm is not important).

Proposition 6.

Assume that \mathcal{F} admits a strictly-convex norm \|\cdot\|_{\mathcal{F}} and let {g^S}S\{\hat{g}_{S}\}_{S\in\mathbb{R}} denote the corresponding min-norm solutions to (4.1). Then, for all M>0M>0,

{y:S(Xn+1,y)g^S(Xn+1,y)(Xn+1),SM}{y:S(Xn+1,y)g^M(Xn+1)}.\{y:S(X_{n+1},y)\leq\hat{g}_{S(X_{n+1},y)}(X_{n+1}),\ S\leq M\}\subseteq\{y:S(X_{n+1},y)\leq\hat{g}_{M}(X_{n+1})\}.
Proof of Proposition 6.

Let

Ln(g):=i=1nα(g(Xi),Si) and LS(g)=Ln(g)+α(g(Xn+1),S).L_{n}(g):=\sum_{i=1}^{n}\ell_{\alpha}(g(X_{i}),S_{i})\ \ \text{ and }\ \ L_{S}(g)=L_{n}(g)+\ell_{\alpha}(g(X_{n+1}),S).

Assume for the sake of contradiction that S<MS<M, Sg^S(Xn+1)S\leq\hat{g}_{S}(X_{n+1}), and g^M(Xn+1)<S\hat{g}_{M}(X_{n+1})<S. To obtain a contradiction, we claim that it is sufficient to prove that

LS(g^M)LS(g^S)LM(g^M)LM(g^S).\displaystyle L_{S}(\hat{g}_{M})-L_{S}(\hat{g}_{S})\leq L_{M}(\hat{g}_{M})-L_{M}(\hat{g}_{S}). (A.1)

To see why, note that since g^M\hat{g}_{M} is a global optimum of gLM(g)+(n+1)(g)g\ \mapsto L_{M}(g)+(n+1)\mathcal{R}(g), we must have that

LM(g^M)+(n+1)(g^M)\displaystyle L_{M}(\hat{g}_{M})+(n+1)\mathcal{R}(\hat{g}_{M}) LM(g^S)+(n+1)(g^S).\displaystyle\leq L_{M}(\hat{g}_{S})+(n+1)\mathcal{R}(\hat{g}_{S}).

Rearranging this and applying (A.1) gives the inequality

(n+1)(g^S)(n+1)(g^M)\displaystyle(n+1)\mathcal{R}(\hat{g}_{S})-(n+1)\mathcal{R}(\hat{g}_{M}) LM(g^M)LM(g^S)\displaystyle\geq L_{M}(\hat{g}_{M})-L_{M}(\hat{g}_{S})
LS(g^M)LS(g^S),\displaystyle\geq L_{S}(\hat{g}_{M})-L_{S}(\hat{g}_{S}),

or equivalently,

LS(g^M)+(n+1)(g^M)\displaystyle L_{S}(\hat{g}_{M})+(n+1)\mathcal{R}(\hat{g}_{M}) LS(g^S)+(n+1)(g^S).\displaystyle\leq L_{S}(\hat{g}_{S})+(n+1)\mathcal{R}(\hat{g}_{S}).

Since g^S\hat{g}_{S} is the unique min-norm minimizer of gLS(g)+(n+1)(g)g\mapsto L_{S}(g)+(n+1)\mathcal{R}(g) this implies that g^S<g^M\|\hat{g}_{S}\|_{\mathcal{F}}<\hat{g}_{M}\|_{\mathcal{F}}.

Now, by a completely symmetric argument reversing the roles of g^M\hat{g}_{M} and g^S\hat{g}_{S} we also have that

LM(g^S)+(n+1)(g^S)LM(g^M)+(n+1)(g^M),L_{M}(\hat{g}_{S})+(n+1)\mathcal{R}(\hat{g}_{S})\leq L_{M}(\hat{g}_{M})+(n+1)\mathcal{R}(\hat{g}_{M}),

which by identical reasoning implies that g^M<g^S\|\hat{g}_{M}\|_{\mathcal{F}}<\|\hat{g}_{S}\|_{\mathcal{F}}. Thus, we have arrived at our desired contradiction.

To prove (A.1) we break into two cases.

Case 1:

g^M(Xn+1)<Sg^S(Xn+1)M\hat{g}_{M}(X_{n+1})<S\leq\hat{g}_{S}(X_{n+1})\leq M.

LS(g^M)LS(g^S)\displaystyle L_{S}(\hat{g}_{M})-L_{S}(\hat{g}_{S}) =(1α)(Sg^M(Xn+1))α(g^S(Xn+1)S)+Ln(g^M)Ln(g^S)\displaystyle=(1-\alpha)(S-\hat{g}_{M}(X_{n+1}))-\alpha(\hat{g}_{S}(X_{n+1})-S)+L_{n}(\hat{g}_{M})-L_{n}(\hat{g}_{S})
=(1α)(g^S(Xn+1)g^M(Xn+1))+Sg^S(Xn+1)+Ln(g^M)Ln(g^S)\displaystyle=(1-\alpha)(\hat{g}_{S}(X_{n+1})-\hat{g}_{M}(X_{n+1}))+S-\hat{g}_{S}(X_{n+1})+L_{n}(\hat{g}_{M})-L_{n}(\hat{g}_{S})
(1α)(g^S(Xn+1)g^M(Xn+1))+Ln(g^M)Ln(g^S)\displaystyle\leq(1-\alpha)(\hat{g}_{S}(X_{n+1})-\hat{g}_{M}(X_{n+1}))+L_{n}(\hat{g}_{M})-L_{n}(\hat{g}_{S})
=LM(g^M)LM(g^S)\displaystyle=L_{M}(\hat{g}_{M})-L_{M}(\hat{g}_{S})
Case 2:

g^M(Xn+1)<S<Mg^S(Xn+1)\hat{g}_{M}(X_{n+1})<S<M\leq\hat{g}_{S}(X_{n+1}).

LM(g^M)LM(g^S)\displaystyle L_{M}(\hat{g}_{M})-L_{M}(\hat{g}_{S}) =(1α)(Mg^M(Xn+1))α(g^S(Xn+1)M)+Ln(g^M)Ln(g^S)\displaystyle=(1-\alpha)(M-\hat{g}_{M}(X_{n+1}))-\alpha(\hat{g}_{S}(X_{n+1})-M)+L_{n}(\hat{g}_{M})-L_{n}(\hat{g}_{S})
=α(g^M(Xn+1)g^S(Xn+1))+Mg^M(Xn+1)+Ln(g^M)Ln(g^S)\displaystyle=\alpha(\hat{g}_{M}(X_{n+1})-\hat{g}_{S}(X_{n+1}))+M-\hat{g}_{M}(X_{n+1})+L_{n}(\hat{g}_{M})-L_{n}(\hat{g}_{S})
>α(g^M(Xn+1)g^S(Xn+1))+Sg^M(Xn+1)+Ln(g^M)Ln(g^S)\displaystyle>\alpha(\hat{g}_{M}(X_{n+1})-\hat{g}_{S}(X_{n+1}))+S-\hat{g}_{M}(X_{n+1})+L_{n}(\hat{g}_{M})-L_{n}(\hat{g}_{S})
=(1α)(Sg^M(Xn+1))α(g^S(Xn+1)S)+Ln(g^M)Ln(g^S)\displaystyle=(1-\alpha)(S-\hat{g}_{M}(X_{n+1}))-\alpha(\hat{g}_{S}(X_{n+1})-S)+L_{n}(\hat{g}_{M})-L_{n}(\hat{g}_{S})
=LS(g^M)LS(g^S).\displaystyle=L_{S}(\hat{g}_{M})-L_{S}(\hat{g}_{S}).

A.2 Additional experiments on the Rxrx1 data

As an alternative to estimating the probabilities of experimental membership, here we consider constructing a feature representation for the Rxrx1 data using principal component analysis. Namely, we implement the linear quantile regression method of Section 2 using the top principal components of the feature layer of the neural network as input. We choose the number of principal components for this analysis to be 70 based off of a visual inspection of a scree plot (Figure 15). All other steps of this experiment are kept identical to the procedure described in Section 5.3.

Similar to the results of Section 5.3, we find that the empirical coverage of this method is close to the target level across all cell types and experiments (Figure 16).

Refer to caption
Figure 15: Scree plot for the singular value decomposition of the feature layer of the pretrained neural network. The horizontal line indicates the value of the 70th largest component.
Refer to caption
Figure 16: Empirical conditional miscoverage of the unrandomized version of our method (orange) and split conformal (blue) across cell types and experiments when principal component analysis is used to construct the features. Red lines indicate the target level of α=0.1\alpha=0.1 and black error bars show 95% binomial confidence intervals for the calibration-conditional miscoverage (Yn+1C^(Xn+1)|Dtrain,Dcal)\mathbb{P}(Y_{n+1}\notin\hat{C}(X_{n+1})|D_{\text{train}},D_{\text{cal}}), where DtrainD_{\text{train}} and DcalD_{\text{cal}} denote the training dataset used to learn the feature representation and the calibration dataset used to implement our method, respectively.

A.3 Proofs of the main coverage guarantees

In this section we prove the top-level coverage guarantees of our method. We begin by proving our most general result, Theorem 3, which considers a generic function class \mathcal{F} and penalty \mathcal{R}. Then, by restricting the choices of \mathcal{F} and \mathcal{R}, we obtain Theorem 2 and Corollary 1 as special cases.

Proof of Theorem 3.

We begin by examining the first order conditions of the convex optimization problem (3.1). Namely, since g^Sn+1\hat{g}_{S_{n+1}} is a minimizer of

g1n+1i=1n+1α(g(Xi),Si)+(g)g\mapsto\frac{1}{n+1}\sum_{i=1}^{n+1}\ell_{\alpha}(g(X_{i}),S_{i})+\mathcal{R}(g)

we must have that for any fixed ff\in\mathcal{F},

0ϵ(1n+1i=1n+1α(g^Sn+1(Xi)+ϵf(Xi),Si)+(g^Sn+1+ϵf))|ϵ=0.0\in\partial_{\epsilon}\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\ell_{\alpha}(\hat{g}_{S_{n+1}}(X_{i})+\epsilon f(X_{i}),S_{i})+\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)\right)\bigg{|}_{\epsilon=0}.

By a straightforward computation, the subgradients of the pinball loss are given by

ϵ(1n+1i=1n+1α(g^Sn+1(Xi)+ϵf(Xi),Si))={1n+1(Sig^Sn+1(Xi)f(Xi)(α𝟙{Si>g^Sn+1})+Si=g^Sn+1(Xi)sif(xi))|si[α1,α]}.\partial_{\epsilon}\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\ell_{\alpha}(\hat{g}_{S_{n+1}}(X_{i})+\epsilon f(X_{i}),S_{i})\right)\\ =\left\{\frac{1}{n+1}\left(\sum_{S_{i}\neq\hat{g}_{S_{n+1}}(X_{i})}f(X_{i})\left(\alpha-\mathbbm{1}\left\{S_{i}>\hat{g}_{S_{n+1}}\right\}\right)+\sum_{S_{i}=\hat{g}_{S_{n+1}}(X_{i})}s_{i}f(x_{i})\right)\,\biggr{\lvert}\,s_{i}\in[\alpha-1,\alpha]\right\}.

Let si[α1,α]s_{i}^{*}\in[\alpha-1,\alpha] be the values setting the subgradient to 0. Rearranging, we obtain

1n+1i=1n+1f(Xi)(α𝟙{Si>g^Sn+1(Xi)})\displaystyle\frac{1}{n+1}\sum_{i=1}^{n+1}f(X_{i})\left(\alpha-\mathbbm{1}\left\{S_{i}>\hat{g}_{S_{n+1}}(X_{i})\right\}\right) =1n+1Si=g^Sn+1(Xi)(αsi)f(xi)ddϵR(g^Sn+1+ϵf)|ϵ=0.\displaystyle=\frac{1}{n+1}\sum_{S_{i}=\hat{g}_{S_{n+1}}(X_{i})}(\alpha-s^{*}_{i})f(x_{i})-\frac{d}{d\epsilon}R\left(\hat{g}_{S_{n+1}}+\epsilon f\right)\bigg{|}_{\epsilon=0}.

Our desired result now follows from the following observations. First, observe that the LHS above can be related to our desired coverage guarantee through the equation,

𝔼[f(Xn+1)(𝟙{YC^n+1}(1α))]\displaystyle\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y\in\hat{C}_{n+1}\}-(1-\alpha))] =𝔼[f(Xn+1)(α𝟙{YC^n+1})]\displaystyle=\mathbb{E}[f(X_{n+1})(\alpha-\mathbbm{1}\{Y\notin\hat{C}_{n+1}\})]
=𝔼[f(Xn+1)(α𝟙{Sn+1>g^Sn+1(Xn+1)})].\displaystyle=\mathbb{E}[f(X_{n+1})(\alpha-\mathbbm{1}\{S_{n+1}>\hat{g}_{S_{n+1}}(X_{n+1})\})].

Moreover, since g^Sn+1\hat{g}_{S_{n+1}} is fit symmetrically, i.e., is invariant to permutations of the input data, we have that {(f(Xi),g^Sn+1(Xi),Si)}i=1n+1\{(f(X_{i}),\hat{g}_{S_{n+1}}(X_{i}),S_{i})\}_{i=1}^{n+1} are exchangeable. Thus, we additionally have that

𝔼[f(Xn+1)(α𝟙{Sn+1>g^Sn+1(Xn+1)})]=𝔼[1n+1i=1n+1f(Xi)(α𝟙{Si>g^Sn+1(Xi)})]=𝔼[1n+1i=1n+1(αsi)f(Xi)𝟙{Si=g^Sn+1(Xi)}]𝔼[ddϵR(g^Sn+1+ϵf)|ϵ=0].\begin{split}\mathbb{E}[f(X_{n+1})(\alpha-&\mathbbm{1}\{S_{n+1}>\hat{g}_{S_{n+1}}(X_{n+1})\})]=\mathbb{E}\left[\frac{1}{n+1}\sum_{i=1}^{n+1}f(X_{i})\left(\alpha-\mathbbm{1}\left\{S_{i}>\hat{g}_{S_{n+1}}(X_{i})\right\}\right)\right]\\ &=\mathbb{E}\left[\frac{1}{n+1}\sum_{i=1}^{n+1}(\alpha-s_{i}^{*})f(X_{i})\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}\right]-\mathbb{E}\left[\frac{d}{d\epsilon}R\left(\hat{g}_{S_{n+1}}+\epsilon f\right)\bigg{|}_{\epsilon=0}\right].\end{split} (A.2)

Finally, since αsi[0,1]\alpha-s_{i}^{*}\in[0,1], we can bound the first term as

|𝔼[1n+1i=1n+1(αsi)f(Xi)𝟙{Si=g^Sn+1(Xi)}]|\displaystyle\left|\mathbb{E}\left[\frac{1}{n+1}\sum_{i=1}^{n+1}(\alpha-s_{i}^{*})f(X_{i})\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}\right]\right| 𝔼[1n+1i=1n+1|f(Xi)|𝟙{Si=g^Sn+1(Xi)}]\displaystyle\leq\mathbb{E}\left[\frac{1}{n+1}\sum_{i=1}^{n+1}|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}\right]
=𝔼[|f(Xi)|𝟙{Si=g^Sn+1(Xi)}],\displaystyle=\mathbb{E}[|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}],

where last step follows by exchangeability. This proves the second claim of Theorem 3.

To get the first claim of Theorem 3, note that when ff is non-negative we can lower bound 1n+1i=1n+1(αsi)f(Xi)𝟙{Si=g^Sn+1(Xi)}\frac{1}{n+1}\sum_{i=1}^{n+1}(\alpha-s_{i}^{*})f(X_{i})\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\} by 0. Plugging this into (A.2) gives us the desired inequality,

𝔼[f(Xn+1)(𝟙{YC^n+1}(1α))]𝔼[ddϵR(g^Sn+1+ϵf)|ϵ=0].\displaystyle\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y\in\hat{C}_{n+1}\}-(1-\alpha))]\geq-\mathbb{E}\left[\frac{d}{d\epsilon}R\left(\hat{g}_{S_{n+1}}+\epsilon f\right)\bigg{|}_{\epsilon=0}\right].

With the proof of Theorem 3 in hand, we are now ready to prove the special cases stated in Theorem 2 and Corollary 1.

Proof of Theorem 2.

The first statement of Theorem 2 follows immediately from the first statement of Theorem 3 in the special case where ()=0\mathcal{R}(\cdot)=0.

To get the second statement of Theorem 2 note that the second statement of Theorem 3 tells us that for any ff\in\mathcal{F},

𝔼[f(Xn+1)(𝟙{Yn+1C^(Xn+1)}(1α))]𝔼[|f(Xi)|𝟙{Si=g^Sn+1(Xi)}].\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}(X_{n+1})\}-(1-\alpha))]\leq\mathbb{E}[|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}].

So, to complete the proof we just need to show that when the distribution of SXS\mid X is continuous

𝔼[|f(Xi)|𝟙{Si=g^Sn+1(Xi)}]dn+1𝔼[max1in+1|f(Xn+1)|].\mathbb{E}[|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}]\leq\frac{d}{n+1}\mathbb{E}\left[\max_{1\leq i\leq n+1}|f(X_{n+1})|\right].

To do this, first note that by the exchangeability of {(f(Xi),g^Sn+1(Xi),Si)}i=1n+1\{(f(X_{i}),\hat{g}_{S_{n+1}}(X_{i}),S_{i})\}_{i=1}^{n+1} we have

𝔼[|f(Xi)|𝟙{Si=g^Sn+1(Xi)}]\displaystyle\mathbb{E}[|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}] =𝔼[1n+1i=1n+1|f(Xi)|𝟙{Si=g^Sn+1(Xi)}]\displaystyle=\mathbb{E}\left[\frac{1}{n+1}\sum_{i=1}^{n+1}|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}\right]
𝔼[(max1jn+1|f(Xj)|)1n+1i=1n+1𝟙{Si=g^Sn+1(Xi)}].\displaystyle\leq\mathbb{E}\left[\left(\max_{1\leq j\leq n+1}|f(X_{j})|\right)\cdot\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}\right].

Moreover, recalling that g^Sn+1(Xi)=Φ(Xi)β^\hat{g}_{S_{n+1}}(X_{i})=\Phi(X_{i})^{\top}\hat{\beta} for some β^d\hat{\beta}\in\mathbb{R}^{d} we additionally have that

(1n+1i=1n+1𝟙{Si=g^Sn+1(Xi)}>dX1,,Xn+1)\displaystyle\mathbb{P}\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}>d\mid X_{1},\dots,X_{n+1}\right)
=( 1j1<<jd+1n+1 such that  1id+1,Sji=g^Sn+1(Xji)X1,,Xn+1)\displaystyle=\mathbb{P}\left(\exists\,1\leq j_{1}<\dots<j_{d+1}\leq n+1\text{ such that }\forall\,1\leq i\leq d+1,\ S_{j_{i}}=\hat{g}_{S_{n+1}}(X_{j_{i}})\mid X_{1},\dots,X_{n+1}\right)
1j1<<jd+1n+1(βd,such that  1id+1,Sji=Φ(Xji)βX1,,Xn+1)\displaystyle\leq\sum_{1\leq j_{1}<\dots<j_{d+1}\leq n+1}\mathbb{P}\left(\exists\,\beta\in\mathbb{R}^{d},\ \text{such that }\forall\,1\leq i\leq d+1,\ S_{j_{i}}=\Phi(X_{j_{i}})^{\top}\beta\mid X_{1},\dots,X_{n+1}\right)
1j1<<jd+1n+1((Sj1,,Sjd+1)RowSpace([Φ(Xj1)||Φ(Xjd+1)])X1,,Xn+1)\displaystyle\leq\sum_{1\leq j_{1}<\dots<j_{d+1}\leq n+1}\mathbb{P}\left((S_{j_{1}},\dots,S_{j_{d+1}})\in\text{RowSpace}([\Phi(X_{j_{1}})|\dots|\Phi(X_{j_{d+1}})])\mid X_{1},\dots,X_{n+1}\right)
=0,\displaystyle=0,

where the last line follows from the fact that (Sj1,,Sjd+1)X1,,Xn+1(S_{j_{1}},\dots,S_{j_{d+1}})\mid X_{1},\dots,X_{n+1} are independent and continuously distributed and RowSpace([Φ(Xi)|Φ(Xd+1))])\text{RowSpace}([\Phi(X_{i})\mid\dots|\Phi(X_{d+1}))]^{\top}) is a dd-dimensional subspace of d+1\mathbb{R}^{d+1}. From this, we conclude that with probability 1,

1n+1i=1n+1𝟙{Si=g^Sn+1(Xi)}dn+1,\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}\leq\frac{d}{n+1},

and plugging this into our previous calculation we arrive at the desired inequality

𝔼[|f(Xi)|𝟙{Si=g^Sn+1(Xi)}]\displaystyle\mathbb{E}[|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}] 𝔼[(max1jn+1|f(Xj)|)1n+1i=1n+1𝟙{Si=g^Sn+1(Xi)}]\displaystyle\leq\mathbb{E}\left[\left(\max_{1\leq j\leq n+1}|f(X_{j})|\right)\cdot\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}\right]
dn+1𝔼[max1in+1|f(Xi)|].\displaystyle\leq\frac{d}{n+1}\mathbb{E}\left[\max_{1\leq i\leq n+1}|f(X_{i})|\right].

Proof of Corollary 1.

This follows immediately by applying Theorem 2 in the special case where ={G𝒢βG𝟙{XG}:βG}\mathcal{F}=\{\sum_{G\in\mathcal{G}}\beta_{G}\mathbbm{1}\{X\in G\}:\beta_{G}\in\mathbb{R}\}. ∎

A.4 Proofs for RKHS functions

In this section we prove Propositions 1 and 2. Throughout both proofs we will let κ2:=supxK(x,x)\kappa^{2}:=\sup_{x}K(x,x) denote our upper bound on the kernel and (when applicable) CSX:=supx,spSi|Xi=x(s)C_{S\mid X}:=\sup_{x,s}p_{S_{i}|X_{i}=x}(s) to denote our upper bound on the density of Si|XiS_{i}|X_{i}. Moreover, we will assume that the data satisfies the following set of moment conditions.

Assumption 1 (Moment conditions for RKHS bounds).

There exists constants C3,C2,c2,CS,Cf,ρ>0C_{3},C_{2},c_{2},C_{S},C_{f},\rho>0 such that

𝔼[Φ(Xi)22]C2d,supf𝔼[|f(Xi)|Φ(Xi)22]C3𝔼[|f(X)|]d,supβ:β2=1𝔼[|Φ(Xi)β|2]c2,\displaystyle\mathbb{E}[\|\Phi(X_{i})\|^{2}_{2}]\leq C_{2}d,\ \sup_{f\in\mathcal{F}}\mathbb{E}[|f(X_{i})|\cdot\|\Phi(X_{i})\|_{2}^{2}]\leq C_{3}\mathbb{E}[|f(X)|]d,\ \sup_{\beta:\|\beta\|_{2}=1}\mathbb{E}[|\Phi(X_{i})^{\top}\beta|^{2}]\leq c_{2},
supf𝔼[|f(Xi)|Si2]CS𝔼[|f(Xi)|],supf𝔼[|f(Xi)|2]Cf𝔼[|f(Xi)|], and infβ:β2=1𝔼[|Φ(Xi)β|]ρ.\displaystyle\sup_{f\in\mathcal{F}}\mathbb{E}[|f(X_{i})|S_{i}^{2}]\leq C_{S}\mathbb{E}[|f(X_{i})|],\ \sup_{f\in\mathcal{F}}\sqrt{\mathbb{E}[|f(X_{i})|^{2}]}\leq C_{f}\mathbb{E}[|f(X_{i})|],\text{ and }\inf_{\beta:\|\beta\|_{2}=1}\mathbb{E}[|\Phi(X_{i})^{\top}\beta|]\geq\rho.

Furthermore, we also have that 𝔼[|Si|2]<\mathbb{E}[|S_{i}|^{2}]<\infty.

A.4.1 Proof of Proposition 1

Our main idea is to exploit the stability of RKHS regression. We will do this using two main lemmas. The first lemma is a canonical stability result first proven in Bousquet and Elisseeff (2002) that bounds the sensitivity of the RKHS fit to changes of a single data point. While this result is quite powerful, it is not sufficient for our context because it does not account for the extra linear term Φ(Xi)β\Phi(X_{i})^{\top}\beta. Thus, we will also develop a second lemma that controls the stability of the fit to changes in β\beta.

When formalizing these ideas it will be useful have some additional notation that explicitly separates the dependence of the fit on β\beta from the dependence of the fit on the data. Let

g^β:=argmingKK1n+1i=1n+1α(gK(Xi)+Φ(Xi)β,Si)+λgKK2,\hat{g}_{\beta}:=\underset{g_{K}\in\mathcal{F}_{K}}{\text{argmin}}\frac{1}{n+1}\sum_{i=1}^{n+1}\ell_{\alpha}(g_{K}(X_{i})+\Phi(X_{i})^{\top}\beta,S_{i})+\lambda\|g_{K}\|_{K}^{2},

denote the result of fitting the RKHS part of the function class with βd\beta\in\mathbb{R}^{d} held fixed. Additionally, let {(X~i,S~i)}i=1n+1\{(\tilde{X}_{i},\tilde{S}_{i})\}_{i=1}^{n+1} denote an independent copy of {(Xi,Si)}i=1n+1\{({X}_{i},{S}_{i})\}_{i=1}^{n+1} and for any A{1,,n+1}A\subseteq\{1,\dots,n+1\} define

g^Aβ:=argmingKK1n+1iAα(gK(Xi)+Φ(Xi)β,Si)+1n+1iAα(gK(X~i)+Φ(X~i)β,S~i)+λgKK2,\hat{g}^{-A}_{\beta}:=\underset{g_{K}\in\mathcal{F}_{K}}{\text{argmin}}\frac{1}{n+1}\sum_{i\notin A}\ell_{\alpha}(g_{K}(X_{i})+\Phi(X_{i})^{\top}\beta,S_{i})+\frac{1}{n+1}\sum_{i\in A}\ell_{\alpha}(g_{K}(\tilde{X}_{i})+\Phi(\tilde{X}_{i})^{\top}\beta,\tilde{S}_{i})+\lambda\|g_{K}\|_{K}^{2},

to be the leave-AA-out version of g^β\hat{g}_{\beta} obtained by swapping out {(Xi,Si)}iA\{(X_{i},S_{i})\}_{i\in A} for {(X~i,S~i)}iA\{(\tilde{X}_{i},\tilde{S}_{i})\}_{i\in A}. Our first lemma bounds the difference between g^Aβ\hat{g}^{-A}_{\beta} and g^Aβ\hat{g}^{A}_{\beta}.

Lemma 1.

Assume that supx,xK(x,x)=κ2<\sup_{x,x}K(x,x)=\kappa^{2}<\infty. Then, for any two datasets {(Xi,Si)}i=1n+1\{(X_{i},S_{i})\}_{i=1}^{n+1} and {(X~i,S~i)}i=1n+1\{(\tilde{X}_{i},\tilde{S}_{i})\}_{i=1}^{n+1},

g^βg^Aβκ2|A|2λ(n+1).\|\hat{g}_{\beta}-\hat{g}^{-A}_{\beta}\|_{\infty}\leq\frac{\kappa^{2}|A|}{2\lambda(n+1)}.
Proof.

By a straightforward calculation one can easily check that α(gK(Xi)+Φ(Xi)β,Si)\ell_{\alpha}(g_{K}(X_{i})+\Phi(X_{i})^{\top}\beta,S_{i}) is a 1-Lipschitz function of SigK(Xi)Φ(Xi)βS_{i}-g_{K}(X_{i})-\Phi(X_{i})^{\top}\beta (see Lemma 11 for details). Thus, we may apply Theorem 22 of Bousquet and Elisseeff (2002) to conclude that

g^βg^AβKκ|A|2λ(n+1).\|\hat{g}_{\beta}-\hat{g}^{-A}_{\beta}\|_{K}\leq\frac{\kappa|A|}{2\lambda(n+1)}.

Then by applying the reproducing property of the RKHS and our bound on the kernel we arrive at the desired inequality,

g^βg^Aβ\displaystyle\|\hat{g}_{\beta}-\hat{g}^{-A}_{\beta}\|_{\infty} =supx|g^βg^Aβ,K(x,)|supxg^βg^AβKK(x,)K\displaystyle=\sup_{x}|\langle\hat{g}_{\beta}-\hat{g}^{-A}_{\beta},K(x,\cdot)\rangle|\leq\sup_{x}\|\hat{g}_{\beta}-\hat{g}^{-A}_{\beta}\|_{K}\|K(x,\cdot)\|_{K}
=g^βg^AβKsupxK(x,x)1/2κ2|A|2λ(n+1).\displaystyle=\|\hat{g}_{\beta}-\hat{g}^{-A}_{\beta}\|_{K}\sup_{x}K(x,x)^{1/2}\leq\frac{\kappa^{2}|A|}{2\lambda(n+1)}.

Our second lemma bounds the stability of the fit in β\beta.

Lemma 2.

Assume that supx,xK(x,x)=κ2<\sup_{x,x}K(x,x)=\kappa^{2}<\infty. Then for any dataset {(Xi,Si)}i=1n+1\{(X_{i},S_{i})\}_{i=1}^{n+1},

g^β1g^β24κ2λ1n+1i=1n+1|Φ(Xi)(β1β2)|,β1,β2d.\|\hat{g}_{\beta_{1}}-\hat{g}_{\beta_{2}}\|_{\infty}\leq\sqrt{\frac{4\kappa^{2}}{\lambda}\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}(\beta_{1}-\beta_{2})|},\quad\forall\beta_{1},\beta_{2}\in\mathbb{R}^{d}.
Proof.

The proof of this lemma is quite similar to the proof of Theorem 22 in Bousquet and Elisseeff (2002). For ease of notation let

Lβ(gK):=1n+1i=1n+1α(gK(Xi)+Φ(Xi)β,Si).L_{\beta}(g_{K}):=\frac{1}{n+1}\sum_{i=1}^{n+1}\ell_{\alpha}(g_{K}(X_{i})+\Phi(X_{i})^{\top}\beta,S_{i}).

By the optimality of g^β1\hat{g}_{\beta_{1}} and g^β2\hat{g}_{\beta_{2}} we have

Lβ1(g^β1)+λg^β1K2Lβ1(12g^β1+12g^β2)+λ12g^β1+12g^β2K2,\displaystyle L_{\beta_{1}}(\hat{g}_{\beta_{1}})+\lambda\|\hat{g}_{\beta_{1}}\|_{K}^{2}\leq L_{\beta_{1}}\left(\frac{1}{2}\hat{g}_{\beta_{1}}+\frac{1}{2}\hat{g}_{\beta_{2}}\right)+\lambda\left\|\frac{1}{2}\hat{g}_{\beta_{1}}+\frac{1}{2}\hat{g}_{\beta_{2}}\right\|_{K}^{2},
and Lβ2(g^β2)+λg^β2K2Lβ2(12g^β1+12g^β2)+λ12g^β1+12g^β2K2.\displaystyle L_{\beta_{2}}(\hat{g}_{\beta_{2}})+\lambda\|\hat{g}_{\beta_{2}}\|_{K}^{2}\leq L_{\beta_{2}}\left(\frac{1}{2}\hat{g}_{\beta_{1}}+\frac{1}{2}\hat{g}_{\beta_{2}}\right)+\lambda\left\|\frac{1}{2}\hat{g}_{\beta_{1}}+\frac{1}{2}\hat{g}_{\beta_{2}}\right\|_{K}^{2}.

Moreover, by the convexity of Lβ1()L_{\beta_{1}}(\cdot) and Lβ2()L_{\beta_{2}}(\cdot) it also holds that

Lβ1(12g^β1+12g^β2)12Lβ1(g^β1)+12Lβ1(g^β2),\displaystyle L_{\beta_{1}}\left(\frac{1}{2}\hat{g}_{\beta_{1}}+\frac{1}{2}\hat{g}_{\beta_{2}}\right)\leq\frac{1}{2}L_{\beta_{1}}(\hat{g}_{\beta_{1}})+\frac{1}{2}L_{\beta_{1}}(\hat{g}_{\beta_{2}}),
and Lβ2(12g^β1+12g^β2)12Lβ2(g^β1)+12Lβ2(g^β2).\displaystyle L_{\beta_{2}}\left(\frac{1}{2}\hat{g}_{\beta_{1}}+\frac{1}{2}\hat{g}_{\beta_{2}}\right)\leq\frac{1}{2}L_{\beta_{2}}(\hat{g}_{\beta_{1}})+\frac{1}{2}L_{\beta_{2}}(\hat{g}_{\beta_{2}}).

Putting all four of these inequalities together we find that

λ2g^β1g^β22K\displaystyle\frac{\lambda}{2}\left\|\hat{g}_{\beta_{1}}-\hat{g}_{\beta_{2}}\right\|^{2}_{K} =λg^β1K2+λg^β2K22λ12g^β1+12g^β2K2\displaystyle=\lambda\|\hat{g}_{\beta_{1}}\|_{K}^{2}+\lambda\|\hat{g}_{\beta_{2}}\|_{K}^{2}-2\lambda\left\|\frac{1}{2}\hat{g}_{\beta_{1}}+\frac{1}{2}\hat{g}_{\beta_{2}}\right\|_{K}^{2}
Lβ1(12g^β1+12g^β2)+Lβ2(12g^β1+12g^β2)Lβ1(g^β1)Lβ2(g^β2)\displaystyle\leq L_{\beta_{1}}\left(\frac{1}{2}\hat{g}_{\beta_{1}}+\frac{1}{2}\hat{g}_{\beta_{2}}\right)+L_{\beta_{2}}\left(\frac{1}{2}\hat{g}_{\beta_{1}}+\frac{1}{2}\hat{g}_{\beta_{2}}\right)-L_{\beta_{1}}(\hat{g}_{\beta_{1}})-L_{\beta_{2}}(\hat{g}_{\beta_{2}})
12Lβ1(g^β1)+12Lβ1(g^β2)+12Lβ2(g^β1)+12Lβ2(g^β2)Lβ1(g^β1)Lβ2(g^β2)\displaystyle\leq\frac{1}{2}L_{\beta_{1}}(\hat{g}_{\beta_{1}})+\frac{1}{2}L_{\beta_{1}}(\hat{g}_{\beta_{2}})+\frac{1}{2}L_{\beta_{2}}(\hat{g}_{\beta_{1}})+\frac{1}{2}L_{\beta_{2}}(\hat{g}_{\beta_{2}})-L_{\beta_{1}}(\hat{g}_{\beta_{1}})-L_{\beta_{2}}(\hat{g}_{\beta_{2}})
=12(Lβ2(g^β1)Lβ1(g^β1)+Lβ1(g^β2)Lβ2(g^β2))\displaystyle=\frac{1}{2}\left(L_{\beta_{2}}(\hat{g}_{\beta_{1}})-L_{\beta_{1}}(\hat{g}_{\beta_{1}})+L_{\beta_{1}}(\hat{g}_{\beta_{2}})-L_{\beta_{2}}(\hat{g}_{\beta_{2}})\right)
1n+1i=1n+1|Φ(Xi)(β1β2)|,\displaystyle\leq\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}(\beta_{1}-\beta_{2})|,

where the last inequality follows from the Lipschitz property of α(,)\ell_{\alpha}(\cdot,\cdot) (see Lemma 11). To conclude the proof one simply notes that by the reproducing property of the RKHS we have that

g^β1g^β2κg^β1g^β2K4κ2λ1n+1i=1n+1|Φ(Xi)(β1β2)|,\|\hat{g}_{\beta_{1}}-\hat{g}_{\beta_{2}}\|_{\infty}\leq\kappa\|\hat{g}_{\beta_{1}}-\hat{g}_{\beta_{2}}\|_{K}\leq\sqrt{\frac{4\kappa^{2}}{\lambda}\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}(\beta_{1}-\beta_{2})|},

as desired. ∎

In order to apply this lemma to bound we will need to control the size of |Φ(Xi)(β1β2)||\Phi(X_{i})^{\top}(\beta_{1}-\beta_{2})|. This is done in our next result. The statement of this lemma may look somewhat peculiar due to the presence of a re-weighting function ff\in\mathcal{F}. To help aid intuition it may be useful to keep in mind the special case f=1f=1, which turns the expectation below into a simple tail probability. While somewhat strange, our reason for stating the lemma in this form is that it will fit seamlessly into our later calculations without the need for additional exposition.

Lemma 3.

Assume that X1,,Xn+1i.i.dPXX_{1},\dots,X_{n+1}\stackrel{{\scriptstyle i.i.d}}{{\sim}}P_{X}. Let f𝒳f\in\mathcal{X}\to\mathbb{R} and assume that there exists constants C2,C31C_{2},C_{3}\geq 1 such that, 𝔼[Φ(Xi)22]C2d\mathbb{E}[\|\Phi(X_{i})\|_{2}^{2}]\leq C_{2}d and 𝔼[|f(Xi)|Φ(Xi)22]C3𝔼[|f(X)|]d\mathbb{E}[|f(X_{i})|\cdot\|\Phi(X_{i})\|_{2}^{2}]\leq C_{3}\mathbb{E}[|f(X)|]d. Then for any ϵ>0\epsilon>0 and 1jn+11\leq j\leq n+1,

𝔼[|f(Xj)|𝟙{supβ:βϵ1n+1i=1n+1|Φ(Xi)β|>2ϵC2d}]O(𝔼[|f(X)|]n).\mathbb{E}\left[|f(X_{j})|\mathbbm{1}\left\{\sup_{\beta:\|\beta\|\leq\epsilon}\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}\beta|>2\epsilon\sqrt{C_{2}d}\right\}\right]\leq O\left(\frac{\mathbb{E}[|f(X)|]}{n}\right).
Proof.

By the Cauchy-Schwartz inequality we have

supβ:βϵ1n+1i=1n+1|Φ(Xi)β|supβ:βϵ1n+1i=1n+1Φ(Xi)2β2=ϵn+1i=1n+1Φ(Xi)2.\sup_{\beta:\|\beta\|\leq\epsilon}\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}\beta|\leq\sup_{\beta:\|\beta\|\leq\epsilon}\frac{1}{n+1}\sum_{i=1}^{n+1}\|\Phi(X_{i})\|_{2}\|\beta\|_{2}=\frac{\epsilon}{n+1}\sum_{i=1}^{n+1}\|\Phi(X_{i})\|_{2}.

Additionally, by Jensen’s inequality it also holds that 𝔼[Φ(Xi)2]𝔼[Φ(Xi)22]=C2d\mathbb{E}[\|\Phi(X_{i})\|_{2}]\leq\sqrt{\mathbb{E}[\|\Phi(X_{i})\|^{2}_{2}]}=\sqrt{C_{2}d}. So, putting these two inequalities together we arrive at

𝔼[|f(Xj)|𝟙{supβ:βϵ1n+1i=1n+1|Φ(Xi)β|>2ϵC2d}]\displaystyle\mathbb{E}\left[|f(X_{j})|\mathbbm{1}\left\{\sup_{\beta:\|\beta\|\leq\epsilon}\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}\beta|>2\epsilon\sqrt{C_{2}d}\right\}\right]
𝔼[|f(Xj)|𝟙{1n+1i=1n+1Φ(Xi)2𝔼[Φ(Xi)2]>C2d}]\displaystyle\leq\mathbb{E}\left[|f(X_{j})|\mathbbm{1}\left\{\frac{1}{n+1}\sum_{i=1}^{n+1}\|\Phi(X_{i})\|_{2}-\mathbb{E}[\|\Phi(X_{i})\|_{2}]>\sqrt{C_{2}d}\right\}\right]
1C2d𝔼[|f(Xj)|(1n+1i=1n+1Φ(Xi)2𝔼[Φ(Xi)2])2]=O(𝔼[|f(X)|]n).\displaystyle\leq\frac{1}{C_{2}d}\mathbb{E}\left[|f(X_{j})|\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\|\Phi(X_{i})\|_{2}-\mathbb{E}[\|\Phi(X_{i})\|_{2}]\right)^{2}\right]=O\left(\frac{\mathbb{E}[|f(X)|]}{n}\right).

The final preliminary lemmas that we will require are controls on the maximum possible sizes of g^Sn+1,K\hat{g}_{S_{n+1},K} and β^Sn+1\hat{\beta}_{S_{n+1}}. Once again these lemmas will involve re-weighting functions the purpose of which is to ease our calculations further on.

Lemma 4.

It holds deterministically that

g^Sn+1,KK1λ1n+1i=1n+1|Si|.\|\hat{g}_{S_{n+1},K}\|_{K}\leq\frac{1}{\sqrt{\lambda}}\sqrt{\frac{1}{n+1}\sum_{i=1}^{n+1}|S_{i}|}.

If in addition, (X1,S1),,(Xn+1,Sn+1)i.i.d.P(X_{1},S_{1}),\dots,(X_{n+1},S_{n+1})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P, 𝔼[Si2]<\mathbb{E}[S_{i}^{2}]<\infty, and f:𝒳f:\mathcal{X}\to\mathbb{R} is a function satisfying 𝔼[|f(Xi)|Si2]Cf,S𝔼[|f(Xi)|]\mathbb{E}[|f(X_{i})|S_{i}^{2}]\leq C_{f,S}\mathbb{E}[|f(X_{i})|] for some Cf,S>0C_{f,S}>0, then we also have that for all 1jn+11\leq j\leq n+1,

𝔼[|f(Xi)|𝟙{g^Sn+1,KK2𝔼[|Si|]λ}]O(𝔼[|f(Xi)|]n)\mathbb{E}\left[|f(X_{i})|\mathbbm{1}\left\{\|\hat{g}_{S_{n+1},K}\|_{K}\geq\frac{\sqrt{2\mathbb{E}[|S_{i}|]}}{\sqrt{\lambda}}\right\}\right]\leq O\left(\frac{\mathbb{E}[|f(X_{i})|]}{n}\right)
Proof.

Taking β,fK=0\beta,f_{K}=0 gives loss

1n+1i=1n+1α(0,Si)+λ0K21n+1i=1n+1|Si|.\frac{1}{n+1}\sum_{i=1}^{n+1}\ell_{\alpha}(0,S_{i})+\lambda\|0\|_{K}^{2}\leq\frac{1}{n+1}\sum_{i=1}^{n+1}|S_{i}|.

So, since (g^Sn+1,K,β^Sn+1)(\hat{g}_{S_{n+1},K},\hat{\beta}_{S_{n+1}}) is a minimizer of the quantile regression objective we must have that

λg^Sn+1,K2K1n+1i=1n+1α(g^Sn+1,K(Xi)+Φ(Xi)β^Sn+1,Si)+λg^Sn+1,K2K1n+1i=1n+1|Si|.\lambda\|\hat{g}_{S_{n+1},K}\|^{2}_{K}\leq\frac{1}{n+1}\sum_{i=1}^{n+1}\ell_{\alpha}(\hat{g}_{S_{n+1},K}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}},S_{i})+\lambda\|\hat{g}_{S_{n+1},K}\|^{2}_{K}\leq\frac{1}{n+1}\sum_{i=1}^{n+1}|S_{i}|.

This proves the first part of the lemma. To get the second part we simply note that

𝔼[|f(Xj)|𝟙{g^Sn+1,KK2𝔼[|Si|]λ}]\displaystyle\mathbb{E}\left[|f(X_{j})|\mathbbm{1}\left\{\|\hat{g}_{S_{n+1},K}\|_{K}\geq\frac{\sqrt{2\mathbb{E}[|S_{i}|]}}{\sqrt{\lambda}}\right\}\right] 𝔼[|f(Xj)|𝟙{1n+1i=1n+1|Si|𝔼[|Si|]𝔼[|Si|]}]\displaystyle\leq\mathbb{E}\left[|f(X_{j})|\mathbbm{1}\left\{\frac{1}{n+1}\sum_{i=1}^{n+1}|S_{i}|-\mathbb{E}[|S_{i}|]\geq\mathbb{E}[|S_{i}|]\right\}\right]
1𝔼[|Si|]2𝔼[|f(Xj)|(1n+1j=1n+1|Si|𝔼[|Si|])2]\displaystyle\leq\frac{1}{\mathbb{E}[|S_{i}|]^{2}}\mathbb{E}\left[|f(X_{j})|\left(\frac{1}{n+1}\sum_{j=1}^{n+1}|S_{i}|-\mathbb{E}[|S_{i}|]\right)^{2}\right]
=O(𝔼[|f(X)|]n).\displaystyle=O\left(\frac{\mathbb{E}[|f(X)|]}{n}\right).

Lemma 5.

Let (X1,S1),,(Xn+1,Sn+1)i.i.d.P(X_{1},S_{1}),\dots,(X_{n+1},S_{n+1})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P and f:𝒳f:\mathcal{X}\to\mathbb{R}. Assume that 𝔼[Si2]<\mathbb{E}[S_{i}^{2}]<\infty, supxK(x,x)=κ2<\sup_{x}K(x,x)=\kappa^{2}<\infty, and there exists constants C2,c2,Cf,Cf,S,ρ>0C_{2},c_{2},C_{f},C_{f,S},\rho>0 such that 𝔼[|f(Xi)|Si2]Cf,S𝔼[|f(Xi)|]\mathbb{E}[|f(X_{i})|S_{i}^{2}]\leq C_{f,S}\mathbb{E}[|f(X_{i})|], 𝔼[|f(Xi)|2]Cf𝔼[|f(Xi)|]\sqrt{\mathbb{E}[|f(X_{i})|^{2}]}\leq C_{f}\mathbb{E}[|f(X_{i})|], infβ𝔼[|Φ(Xi)β|]ρ\inf_{\beta}\mathbb{E}[|\Phi(X_{i})^{\top}\beta|]\geq\rho, supβ:β2=1𝔼[|Φ(Xi)β|2]1/2c2\sup_{\beta:\|\beta\|_{2}=1}\mathbb{E}[|\Phi(X_{i})^{\top}\beta|^{2}]^{1/2}\leq c_{2}, and 𝔼[Φ(Xi)2]C2d\mathbb{E}[\|\Phi(X_{i})\|^{2}]\leq C_{2}d. Then there exists a constant cβ>0c_{\beta}>0 such that for all 1jn+11\leq j\leq n+1,

𝔼[|f(Xj)|𝟙{β^Sn+12>1λcβ}]O(d𝔼[|f(Xi)|]n).\mathbb{E}\left[|f(X_{j})|\mathbbm{1}\left\{\|\hat{\beta}_{S_{n+1}}\|_{2}>\frac{1}{\sqrt{\lambda}}c_{\beta}\right\}\right]\leq O\left(\frac{d\mathbb{E}[|f(X_{i})|]}{n}\right).
Proof.

Observe that

1n+1i=1n+1α(g^Sn+1,K(Xi)+Φ(Xi)β^Sn+1,Si)+λg^Sn+1,K2K\displaystyle\frac{1}{n+1}\sum_{i=1}^{n+1}\ell_{\alpha}(\hat{g}_{S_{n+1},K}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}},S_{i})+\lambda\|\hat{g}_{S_{n+1},K}\|^{2}_{K}
1n+1i=1n+1min{α,1α}|Sig^Sn+1,K(Xi)Φ(Xi)β^Sn+1|\displaystyle\geq\frac{1}{n+1}\sum_{i=1}^{n+1}\min\{\alpha,1-\alpha\}|S_{i}-\hat{g}_{S_{n+1},K}(X_{i})-\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}|
min{α,1α}(1n+1i=1n+1|Φ(Xi)β^Sn+1|1n+1i=1n+1|Si|1n+1i=1n+1|g^Sn+1,K(Xi)|).\displaystyle\geq\min\{\alpha,1-\alpha\}\left(\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}|-\frac{1}{n+1}\sum_{i=1}^{n+1}|S_{i}|-\frac{1}{n+1}\sum_{i=1}^{n+1}|\hat{g}_{S_{n+1},K}(X_{i})|\right).

Moreover, by the reproducing property of the RKHS

1n+1i=1n+1|g^Sn+1,K(Xi)|g^Sn+1,Kκg^Sn+1,KK,\frac{1}{n+1}\sum_{i=1}^{n+1}|\hat{g}_{S_{n+1},K}(X_{i})|\leq\|\hat{g}_{S_{n+1},K}\|_{\infty}\leq\kappa\|\hat{g}_{S_{n+1},K}\|_{K},

and by Lemma 4

g^Sn+1,KK1λ1n+1i=1n+1|Si|.\|\hat{g}_{S_{n+1},K}\|_{K}\leq\frac{1}{\sqrt{\lambda}}\sqrt{\frac{1}{n+1}\sum_{i=1}^{n+1}|S_{i}|}.

So, combining these two facts we find that

1n+1i=1n+1|Φ(Xi)β^Sn+1|1min{α,1α}(κλ1n+1i=1n+1|Si|+1n+1i=1n+1|Si|).\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}|\leq\frac{1}{\min\{\alpha,1-\alpha\}}\left(\frac{\kappa}{\sqrt{\lambda}}\sqrt{\frac{1}{n+1}\sum_{i=1}^{n+1}|S_{i}|}+\frac{1}{n+1}\sum_{i=1}^{n+1}|S_{i}|\right).

To use this inequality to bound β^Sn+12\|\hat{\beta}_{S_{n+1}}\|_{2} we will need to lower bound the mean of |Φ(Xi)β^Sn+1||\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}|. This is done in Lemma 12 below where we show that there exists constants c,c>0c,c^{\prime}>0, such that

(infβ:β=11n+1i=1n+1|Φ(Xi)β|c)1cd2(n+1)2.\mathbb{P}\left(\inf_{\beta:\|\beta\|=1}\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}\beta|\geq c\right)\geq 1-c^{\prime}\frac{d^{2}}{(n+1)^{2}}.

Thus,

𝔼[|f(Xj)|𝟙{β^Sn+12>1λcβ}]𝔼[|f(Xj)|𝟙{1n+1i=1n+1|Si|𝔼[|Si|]>Ω(1)}]\displaystyle\mathbb{E}\left[|f(X_{j})|\mathbbm{1}\left\{\|\hat{\beta}_{S_{n+1}}\|_{2}>\frac{1}{\sqrt{\lambda}}c_{\beta}\right\}\right]\leq\mathbb{E}\left[|f(X_{j})|\mathbbm{1}\left\{\frac{1}{n+1}\sum_{i=1}^{n+1}|S_{i}|-\mathbb{E}[|S_{i}|]>\Omega(1)\right\}\right]
+𝔼[|f(Xj)|𝟙{infβ:β=11n+1i=1n+1|Φ(Xi)β|c}]\displaystyle\ \ \ \ +\mathbb{E}\left[|f(X_{j})|\mathbbm{1}\left\{\inf_{\beta:\|\beta\|=1}\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}\beta|\geq c\right\}\right]
O(1)𝔼[|f(Xj)|(1n+1i=1n+1|Si|𝔼|Si|)2]+𝔼[|f(Xi)|2]1/2cdn+1O(d𝔼[|f(Xi)|]n+1).\displaystyle\leq O(1)\mathbb{E}\left[|f(X_{j})|\left(\frac{1}{n+1}\sum_{i=1}^{n+1}|S_{i}|-\mathbb{E}|S_{i}|\right)^{2}\right]+\mathbb{E}[|f(X_{i})|^{2}]^{1/2}\sqrt{c^{\prime}}\frac{d}{n+1}\leq O\left(\frac{d\mathbb{E}[|f(X_{i})|]}{n+1}\right).

With all of these results in hand we are ready to prove Proposition 1.

Proof of Proposition 1..

We will exploit the stability of the RKHS fit. Let ϵ=O(λn2d)\epsilon=O\left(\frac{\lambda}{n^{2}\sqrt{d}}\right) and define the event

E:={supβ:β2ϵ1n+1i=1n+1|Φ(Xi)β|2ϵC2d,β^Sn+12cβλ}.E:=\left\{\sup_{\beta:\|\beta\|_{2}\leq\epsilon}\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}\beta|\leq 2\epsilon\sqrt{C_{2}d},\ \|\hat{\beta}_{S_{n+1}}\|_{2}\leq\frac{c_{\beta}}{\sqrt{\lambda}}\right\}.

By Lemmas 3 and 5 we know that

𝔼[|f(Xi)|𝟙{Si=g^Sn+1,K(Xi)+Φ(Xi)β^Sn+1}]\displaystyle\mathbb{E}[|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1},K}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}]
𝔼[|f(Xi)|𝟙{Si=g^Sn+1,K(Xi)+Φ(Xi)β^Sn+1}𝟙{E}]+O(d𝔼[|f(Xi)|]n).\displaystyle\leq\mathbb{E}[|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1},K}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\mathbbm{1}\{E\}]+O\left(\frac{d\mathbb{E}[|f(X_{i})|]}{n}\right).

Thus, we just need to focus on what happens on the event EE. By applying the exchangeability of the quadruples (g^Sn+1,K(Xi),β^Sn+1,Xi,Si)(\hat{g}_{S_{n+1},K}(X_{i}),\hat{\beta}_{S_{n+1}},X_{i},S_{i}) we have that

𝔼[|f(Xi)|𝟙{Si=g^Sn+1,K(Xi)+Φ(Xi)β^Sn+1}𝟙{E}]\displaystyle\mathbb{E}[|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1},K}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\mathbbm{1}\{E\}]
=𝔼[(1n+1i=1n+1|f(Xi)|𝟙{Si=g^Sn+1,K(Xi)+Φ(Xi)β^Sn+1})𝟙{E}]\displaystyle=\mathbb{E}\left[\left(\frac{1}{n+1}\sum_{i=1}^{n+1}|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1},K}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\right)\mathbbm{1}\{E\}\right]
𝔼[max1in+1|f(Xi)|𝔼[(1n+1i=1n+1𝟙{Si=g^β^Sn+1(Xi)+Φ(Xi)β^Sn+1})𝟙{E}(Xi)i=1n+1]].\displaystyle\leq\mathbb{E}\left[\max_{1\leq i\leq n+1}|f(X_{i})|\mathbb{E}\left[\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{\hat{\beta}_{S_{n+1}}}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\right)\mathbbm{1}\{E\}\mid(X_{i})_{i=1}^{n+1}\right]\right].

To bound this quantity we just need to control the inner expectation. We will begin by fixing a large integer m>1m>1 and applying the inequality

𝔼[(1n+1i=1n+1𝟙{Si=g^β^Sn+1(Xi)+Φ(Xi)β^Sn+1})𝟙{E}(Xi)i=1n+1]\displaystyle\mathbb{E}\left[\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{\hat{\beta}_{S_{n+1}}}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\right)\mathbbm{1}\{E\}\mid(X_{i})_{i=1}^{n+1}\right]
𝔼[(1n+1i=1n+1𝟙{Si=g^β^Sn+1(Xi)+Φ(Xi)β^Sn+1})m𝟙{E}(Xi)i=1n+1]1/m.\displaystyle\leq\mathbb{E}\left[\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{\hat{\beta}_{S_{n+1}}}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\right)^{m}\mathbbm{1}\{E\}\mid(X_{i})_{i=1}^{n+1}\right]^{1/m}.

Our motivation for applying this bound is that by choosing mm sufficiently large we will be able to swap a sum and a maximum without losing too much slack. More precisely, let 𝒩d\mathcal{N}\subseteq\mathbb{R}^{d} be a minimal size ϵ\epsilon-net of {βd:β2cβ/λ}\{\beta\in\mathbb{R}^{d}:\|\beta\|_{2}\leq c_{\beta}/\sqrt{\lambda}\}. It is well known that there exists an absolute constant CN>0C_{N}>0 such that |𝒩|exp(CNdlog(cβλϵ))|\mathcal{N}|\leq\exp(C_{N}d\log(\frac{c_{\beta}}{\sqrt{\lambda}\epsilon})). Then, using this ϵ\epsilon-net we compute that

𝔼[(1n+1i=1n+1𝟙{Si=g^β^Sn+1(Xi)+Φ(Xi)β^Sn+1})m𝟙{E}(Xi)i=1n+1]\displaystyle\mathbb{E}\left[\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{\hat{\beta}_{S_{n+1}}}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\right)^{m}\mathbbm{1}\{E\}\mid(X_{i})_{i=1}^{n+1}\right]
𝔼[supβ:β2cβ/λ(1n+1i=1n+1𝟙{Si=g^β(Xi)+Φ(Xi)β})m𝟙{E}|(Xi)i=1n+1]\displaystyle\leq\mathbb{E}\left[\sup_{\beta:\|\beta\|_{2}\leq c_{\beta}/\sqrt{\lambda}}\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{\beta}(X_{i})+\Phi(X_{i})^{\top}\beta\}\right)^{m}\mathbbm{1}\{E\}|(X_{i})_{i=1}^{n+1}\right]
𝔼[supβ𝒩(1n+1i=1n+1𝟙{|Sig^β(Xi)Φ(Xi)β|O(1/n)})m𝟙{E}(Xi)i=1n+1]\displaystyle\leq\mathbb{E}\left[\sup_{\beta\in\mathcal{N}}\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{|S_{i}-\hat{g}_{\beta}(X_{i})-\Phi(X_{i})^{\top}\beta|\leq O\left(1/n\right)\}\right)^{m}\mathbbm{1}\{E\}\mid(X_{i})_{i=1}^{n+1}\right]
β𝒩𝔼[(1n+1i=1n+1𝟙{|Sig^β(Xi)Φ(Xi)β|O(1/n)})m(Xi)i=1n+1],\displaystyle\leq\sum_{\beta\in\mathcal{N}}\mathbb{E}\left[\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{|S_{i}-\hat{g}_{\beta}(X_{i})-\Phi(X_{i})^{\top}\beta|\leq O\left(1/n\right)\}\right)^{m}\mid(X_{i})_{i=1}^{n+1}\right],

where the first inequality follows from the definition of EE and the second inequality uses both Lemma 2 and the fact that on the event EE

supβ:βϵ1n+1i=1n+1|Φ(Xi)β|2ϵC2d\displaystyle\sup_{\beta:\|\beta\|\leq\epsilon}\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}\beta|\leq 2\epsilon\sqrt{C_{2}d}
\displaystyle\implies supβ:βϵmax1in+1|Φ(Xi)β|supβ:βϵn+1n+1i=1n+1|Φ(Xi)β|(n+1)2ϵC2dO(1n).\displaystyle\sup_{\beta:\|\beta\|\leq\epsilon}\max_{1\leq i\leq n+1}|\Phi(X_{i})^{\top}\beta|\leq\sup_{\beta:\|\beta\|\leq\epsilon}\frac{n+1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}\beta|\leq(n+1)2\epsilon\sqrt{C_{2}d}\leq O\left(\frac{1}{n}\right).

Continuing this calculation directly we see that,

𝔼[(1n+1i=1n+1𝟙{|Sig^β(Xi)Φ(Xi)β|O(1/n)})m(Xi)i=1n+1]\displaystyle\mathbb{E}\left[\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{|S_{i}-\hat{g}_{\beta}(X_{i})-\Phi(X_{i})^{\top}\beta|\leq O\left(1/n\right)\}\right)^{m}\mid(X_{i})_{i=1}^{n+1}\right]
=k=1m(n+1k)(mk)k!kmk1(n+1)m𝔼[i=1k𝟙{|Sig^β(Xi)Φ(Xi)β|O(1/n)}(Xi)i=1n+1]\displaystyle=\sum_{k=1}^{m}\binom{n+1}{k}\binom{m}{k}k!k^{m-k}\frac{1}{(n+1)^{m}}\mathbb{E}\left[\prod_{i=1}^{k}\mathbbm{1}\{|S_{i}-\hat{g}_{\beta}(X_{i})-\Phi(X_{i})^{\top}\beta|\leq O\left(1/n\right)\}\mid(X_{i})_{i=1}^{n+1}\right]
k=1m(n+1)k(mk)mmk(n+1)m𝔼[i=1k𝟙{|Sig^{1,,k}β(Xi)Φ(Xi)β|O(kλn)}(Xi)i=1n+1],\displaystyle\leq\sum_{k=1}^{m}(n+1)^{k}\binom{m}{k}\frac{m^{m-k}}{(n+1)^{m}}\mathbb{E}\left[\prod_{i=1}^{k}\mathbbm{1}\left\{|S_{i}-\hat{g}^{-\{1,\dots,k\}}_{\beta}(X_{i})-\Phi(X_{i})^{\top}\beta|\leq O\left(\frac{k}{\lambda n}\right)\right\}\mid(X_{i})_{i=1}^{n+1}\right],

where the last line applies Lemma 1. Finally, using the fact that Si|XiS_{i}|X_{i} has a bounded density we may upper bound the above display by

k=1m(n+1)k(mk)mmk(n+1)mO((kλn)k)O((mλn)m)k=1m(mk)O(2m(mλn)m).\displaystyle\sum_{k=1}^{m}(n+1)^{k}\binom{m}{k}\frac{m^{m-k}}{(n+1)^{m}}O\left(\left(\frac{k}{\lambda n}\right)^{k}\right)\leq O\left(\left(\frac{m}{\lambda n}\right)^{m}\right)\sum_{k=1}^{m}\binom{m}{k}\leq O\left(2^{m}\left(\frac{m}{\lambda n}\right)^{m}\right).

Putting this all together we conclude that

𝔼[(1n+1i=1n+1𝟙{Si=g^β^Sn+1(Xi)+Φ(Xi)β^Sn+1})m(Xi)i=1n+1]1mO(exp(CNdlog(1λϵ)m)mλn).\mathbb{E}\left[\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{\hat{\beta}_{S_{n+1}}}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\right)^{m}\mid(X_{i})_{i=1}^{n+1}\right]^{\frac{1}{m}}\leq O\left(\exp\left(\frac{C_{N}d\log\left(\frac{1}{\sqrt{\lambda}\epsilon}\right)}{m}\right)\frac{m}{\lambda n}\right).

The desired result then follows by taking m=dlog(1λϵ)m=d\log(\frac{1}{\sqrt{\lambda}\epsilon}) and plugging in our definition for ϵ\epsilon.

A.4.2 Proof of Proposition 2

To simplify the notation let

Ln(β,gK):=1n+1i=1nα(Φ(Xi)β+gK(Xi),Si)\displaystyle L_{n}(\beta,g_{K}):=\frac{1}{n+1}\sum_{i=1}^{n}\ell_{\alpha}(\Phi(X_{i})^{\top}\beta+g_{K}(X_{i}),S_{i})
and L(β,gK):=𝔼[α(Φ(Xi)β+gK(Xi),Si)],\displaystyle L_{\infty}(\beta,g_{K}):=\mathbb{E}[\ell_{\alpha}(\Phi(X_{i})^{\top}\beta+g_{K}(X_{i}),S_{i})],

denote the empirical and population losses and let

Mn(β,gK):=Ln(β,gK)+λgKK2\displaystyle M_{n}(\beta,g_{K}):=L_{n}(\beta,g_{K})+\lambda\|g_{K}\|_{K}^{2}
and M(β,gK):=L(β,gK)+λgKK2,\displaystyle M_{\infty}(\beta,g_{K}):=L_{\infty}(\beta,g_{K})+\lambda\|g_{K}\|_{K}^{2},

denote the corresponding empirical and population objectives. Note that MnM_{n} and MM_{\infty} are strictly convex in ff and convex in β\beta. Thus, we may let (B^n,g^n,K),(B,gK)2d×K(\hat{B}_{n},\hat{g}_{n,K}),(B^{*},g_{K}^{*})\in 2^{\mathbb{R}^{d}}\times\mathcal{F}_{K} denote the minimizers of MnM_{n} and MM_{\infty} respectively. To further ease notation in the sections that follows we will sometimes use β^n\hat{\beta}_{n} and β\beta^{*} to denote arbitrarily elements of B^n\hat{B}_{n} and BB^{*}. Finally, we will let ΠB^n,ΠB:dd\Pi_{\hat{B}_{n}},\Pi_{B^{*}}:\mathbb{R}^{d}\to\mathbb{R}^{d} denote the projections operators onto B^n\hat{B}_{n} and BB^{*}, respectively.

With these preliminary definitions in hand we now formally state the assumptions of Proposition 2. Our first assumption is that MM_{\infty} is locally strongly convex around its minimum.

Assumption 2 (Population Strong Convexity).

Let d(β,gK):=infβBββ2+gKgKKd(\beta,g_{K}):=\inf_{\beta^{\prime}\in B^{*}}\|\beta-\beta^{\prime}\|_{2}+\|g_{K}-g^{*}_{K}\|_{K} denote the distance from (gK,β)(g_{K},\beta) to the nearest population minimizer. Then, there exists constants CM,δM>0C_{M},\delta_{M}>0 such that

d(β,gK)δMM(β,gK)M(β,g)CMd(β,gK)2.d(\beta,g_{K})\leq\delta_{M}\implies M_{\infty}(\beta,g_{K})-M_{\infty}(\beta^{*}_{,}g^{*})\geq C_{M}d(\beta,g_{K})^{2}.

Overall, we believe that this assumption is mild and should hold for all distributions of interest. For instance, for continuous data it is easy to check that this condition holds whenever SXS\mid X has a positive density on \mathbb{R}. On the other hand, for discrete data we expect to have the even stronger inequality M(β,gK)M(β,g)CMd(β,gK)M_{\infty}(\beta,g_{K})-M_{\infty}(\beta^{*}_{,}g^{*})\geq C_{M}d(\beta,g_{K}). This is due to the fact that for discrete data L(,)L_{\infty}(\cdot,\cdot) has sharp jump discontinuities that give rise to large increases in the loss when (β^,g^n,K)(\hat{\beta},\hat{g}_{n,K}) moves away from (B,g)(B^{*},g^{*}).

The second assumption we will need is a set of moment conditions on SS and XX.

Assumption 3 (Moment Conditions).

There exists constants C2,ρ>0C_{2},\rho>0 such that

𝔼[Φ(Xi)22]C2d and infβ:β2=1𝔼[|Φ(Xi)β|]ρ.\displaystyle\mathbb{E}[\|\Phi(X_{i})\|^{2}_{2}]\leq C_{2}d\ \ \text{ and }\ \ \inf_{\beta:\|\beta\|_{2}=1}\mathbb{E}[|\Phi(X_{i})^{\top}\beta|]\geq\rho.

Furthermore, we also have that 𝔼[|Si|2]<\mathbb{E}[|S_{i}|^{2}]<\infty.

With these assumptions in hand we are ready to prove Proposition 2. We begin by giving a technical lemma that controls the concentration of LnL_{n} around LL_{\infty}.

Lemma 6.

Assume that (X1,S1),,(Xn,Sn)i.i.d.P(X_{1},S_{1}),\dots,(X_{n},S_{n})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P and that there exists constants C2,κ>0C_{2},\kappa>0 such that 𝔼[Φ(Xi)22]C2d\mathbb{E}[\|\Phi(X_{i})\|^{2}_{2}]\leq C_{2}d and supxK(x,x)=κ2<\sup_{x}K(x,x)=\kappa^{2}<\infty. Then for any δ1,δ2>0\delta_{1},\delta_{2}>0,

𝔼[supβΠBβ2δ1,gKgKKδ2|Ln(β,gK)Ln(ΠBβ,gK)(L(β,gK)L(ΠBβ,gK))|]\displaystyle\mathbb{E}\left[\sup_{\|\beta-\Pi_{B^{*}}\beta\|_{2}\leq\delta_{1},\ \|g_{K}-g^{*}_{K}\|_{K}\leq\delta_{2}}\left|L_{n}(\beta,g_{K})-L_{n}(\Pi_{B^{*}}\beta,g^{*}_{K})-(L_{\infty}(\beta,g_{K})-L_{\infty}(\Pi_{B^{*}}\beta,g^{*}_{K}))\right|\right]
O(δ1dn+δ21n)\displaystyle\hskip 28.45274pt\leq O\left(\delta_{1}\sqrt{\frac{d}{n}}+\delta_{2}\sqrt{\frac{1}{n}}\right)
Proof.

Let E:={(β,gK)d×K:βΠBβ2δ1,gKgKKδ2}E:=\{(\beta,g_{K})\in\mathbb{R}^{d}\times\mathcal{F}_{K}:\|\beta-\Pi_{B^{*}}\beta\|_{2}\leq\delta_{1},\ \|g_{K}-g^{*}_{K}\|_{K}\leq\delta_{2}\} and σ1,,σn+1i.i.dUnif({±1})\sigma_{1},\dots,\sigma_{n+1}\stackrel{{\scriptstyle i.i.d}}{{\sim}}\text{Unif}(\{\pm 1\}) be Rademacher random variables. Since the pinball loss is 1-Lipschitz (see Lemma 11) we may apply the symmetrization and contraction properties of Rademacher complexity to conclude that

𝔼[sup(β,gK)E|Ln(β,gK)Ln(ΠBβ,gK)(L(β,gK)L(ΠBβ,gK))|]\displaystyle\mathbb{E}\left[\sup_{(\beta,g_{K})\in E}\left|L_{n}(\beta,g_{K})-L_{n}(\Pi_{B^{*}}\beta,g^{*}_{K})-(L_{\infty}(\beta,g_{K})-L_{\infty}(\Pi_{B^{*}}\beta,g^{*}_{K}))\right|\right]
2𝔼[sup(β,gK)E|1ni=1nσi((Φ(Xi)β+gK(Xi),Si)(Φ(Xi)ΠBβ+gK(Xi),Si))|]\displaystyle\leq 2\mathbb{E}\left[\sup_{(\beta,g_{K})\in E}\left|\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}(\ell(\Phi(X_{i})^{\top}\beta+g_{K}(X_{i}),S_{i})-\ell(\Phi(X_{i})^{\top}\Pi_{B^{*}}\beta+g^{*}_{K}(X_{i}),S_{i}))\right|\right]
2𝔼[sup(β,gK)E1ni=1nσi(Φ(Xi)(βΠββ)+gK(Xi)gK(Xi))]\displaystyle\leq 2\mathbb{E}\left[\sup_{(\beta,g_{K})\in E}\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}(\Phi(X_{i})^{\top}(\beta-\Pi_{\beta^{*}}\beta)+g_{K}(X_{i})-g^{*}_{K}(X_{i}))\right]
2𝔼[supβΠBβ2δ11ni=1nσiΦ(Xi)(βΠββ)]\displaystyle\leq 2\mathbb{E}\left[\sup_{\|\beta-\Pi_{B^{*}}\beta\|_{2}\leq\delta_{1}}\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}\Phi(X_{i})^{\top}(\beta-\Pi_{\beta^{*}}\beta)\right]
+2𝔼[supgKgKKδ21ni=1nσi(gK(Xi)gK(Xi))]\displaystyle\ \ \ \ +2\mathbb{E}\left[\sup_{\|g_{K}-g^{*}_{K}\|_{K}\leq\delta_{2}}\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}(g_{K}(X_{i})-g^{*}_{K}(X_{i}))\right]
O(δ1dn+δ21n),\displaystyle\leq O\left(\delta_{1}\sqrt{\frac{d}{n}}+\delta_{2}\sqrt{\frac{1}{n}}\right),

where the last inequality follows from standard bounds on the Rademacher complexities of linear and kernel function classes (see e.g. Secion 4.1.2 of (Boucheron et al. (2005))) ∎

We now prove the main proposition.

Proof of Proposition 2.

We will show that

  1. 1.

    supfδ|1ni=1n|f(Xi)|𝔼P[|f(X)|]|=O(d/n)\sup_{f\in\mathcal{F}_{\delta}}|\frac{1}{n}\sum_{i=1}^{n}|f(X_{i})|-\mathbb{E}_{P}[|f(X)|]|=O_{\mathbb{P}}(\sqrt{d/n}),

  2. 2.

    supfKK:fKK1λ|𝔼[g^Sn+1,K,fKK]|O(1)\sup_{f_{K}\in\mathcal{F}_{K}:\|f_{K}\|_{K}\leq 1}\lambda|\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle_{K}]|\leq O(1),

  3. 3.

    supfKK:fKK1λ|g^n,K,fKK𝔼[g^Sn+1,K,fKK]|O(dlog(n)/n)\sup_{f_{K}\in\mathcal{F}_{K}:\|f_{K}\|_{K}\leq 1}\lambda|\langle\hat{g}_{n,K},f_{K}\rangle_{K}-\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle_{K}]|\leq O_{\mathbb{P}}(\sqrt{d\log(n)/n}).

Our desired result will then follow by writing

supf()=Φ()β+fK()δ|2λg^n,K,fK1ni=1nf(Xi)2λ𝔼[g^Sn+1,K,fK]𝔼P[|f(X)|]|\displaystyle\sup_{f(\cdot)=\Phi(\cdot)^{\top}\beta+f_{K}(\cdot)\in\mathcal{F}_{\delta}}\left|2\lambda\frac{\langle\hat{g}_{n,K},f_{K}\rangle}{\frac{1}{n}\sum_{i=1}^{n}f(X_{i})}-2\lambda\frac{\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle]}{\mathbb{E}_{P}[|f(X)|]}\right|
supfKK:fKK12λ|g^n,K,fK𝔼[g^Sn+1,K,fK]|𝔼P[|f(X)|]\displaystyle\leq\sup_{f_{K}\in\mathcal{F}_{K}:\|f_{K}\|_{K}\leq 1}2\lambda\frac{|\langle\hat{g}_{n,K},f_{K}\rangle-\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle]|}{\mathbb{E}_{P}[|f(X)|]}
+2λsupf()=Φ()β+fK()δ|g^n,K,fK1ni=1n|f(Xi)|g^n,K,fK𝔼P[|f(Xi)|]|\displaystyle\ \ \ \ \ +2\lambda\sup_{f(\cdot)=\Phi(\cdot)^{\top}\beta+f_{K}(\cdot)\in\mathcal{F}_{\delta}}\left|\frac{\langle\hat{g}_{n,K},f_{K}\rangle}{\frac{1}{n}\sum_{i=1}^{n}|f(X_{i})|}-\frac{\langle\hat{g}_{n,K},f_{K}\rangle}{\mathbb{E}_{P}[|f(X_{i})|]}\right|
O(dlog(n)n)+supfK𝔼[g^Sn+1,K,fK]|+O(dlog(n)n)δ2O(d/n)supfδ|1ni=1n|f(Xi)|𝔼P[|f(X)|]|\displaystyle\leq O_{\mathbb{P}}\left(\sqrt{\frac{d\log(n)}{n}}\right)+\frac{\sup_{f\in\mathcal{F}_{K}}\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}]\rangle|+O_{\mathbb{P}}(\sqrt{\frac{d\log(n)}{n}})}{\delta^{2}-O_{\mathbb{P}}(\sqrt{d/n})}\sup_{f\in\mathcal{F}_{\delta}}\left|\frac{1}{n}\sum_{i=1}^{n}|f(X_{i})|-\mathbb{E}_{P}[|f(X)|]\right|
=O(dlog(n)n).\displaystyle=O_{\mathbb{P}}\left(\sqrt{\frac{d\log(n)}{n}}\right).

We establish each of these three facts in order.

Step 1: By the results of Section 4.1.2 in Boucheron et al. (2005) we know that {fK()+Φ()β:fK+β21}\{f_{K}(\cdot)+\Phi(\cdot)^{\top}\beta:\|f\|_{K}+\|\beta\|_{2}\leq 1\} has Rademacher complexity at most O(d/n)O(\sqrt{d/n}). By the contraction property this also implies that {|fK()+Φ()β|:fK+β21}\{|f_{K}(\cdot)+\Phi(\cdot)^{\top}\beta|:\|f\|_{K}+\|\beta\|_{2}\leq 1\} has Rademacher complexity at most O(d/n)O(\sqrt{d/n}). So, by the symmetrization inequality we have that for any C>0C>0,

(supfδ|1ni=1n|f(Xi)|𝔼P[|f(X)|]|>C)1C𝔼[supfδ|1ni=1n|f(Xi)|𝔼P[|f(X)|]|]\displaystyle\mathbb{P}\left(\sup_{f\in\mathcal{F}_{\delta}}\left|\frac{1}{n}\sum_{i=1}^{n}|f(X_{i})|-\mathbb{E}_{P}[|f(X)|]\right|>C\right)\leq\frac{1}{C}\mathbb{E}\left[\sup_{f\in\mathcal{F}_{\delta}}\left|\frac{1}{n}\sum_{i=1}^{n}|f(X_{i})|-\mathbb{E}_{P}[|f(X)|]\right|\right]
2CRadComplexn({|fK()+Φ()β|:fK+β21})1CO(dn).\displaystyle\hskip 85.35826pt\leq\frac{2}{C}\text{RadComplex}_{n}(\{|f_{K}(\cdot)+\Phi(\cdot)^{\top}\beta|:\|f\|_{K}+\|\beta\|_{2}\leq 1\})\leq\frac{1}{C}O\left(\sqrt{\frac{d}{n}}\right).

This proves that supfδ|1ni=1n|f(Xi)|𝔼P[|f(X)|]|=O(d/n)\sup_{f\in\mathcal{F}_{\delta}}\left|\frac{1}{n}\sum_{i=1}^{n}|f(X_{i})|-\mathbb{E}_{P}[|f(X)|]\right|=O_{\mathbb{P}}(\sqrt{d/n}), as desired.

Step 2: By Lemma 4 we know that

supfK:fKK1|𝔼[g^Sn+1,K,fKK]|𝔼[g^Sn+1,KK]𝔼[1λ1n+1i=1n+1|Si|]𝔼[|Si|]λ.\displaystyle\sup_{f\in\mathcal{F}_{K}:\|f_{K}\|_{K}\leq 1}|\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle_{K}]|\leq\mathbb{E}[\|\hat{g}_{S_{n+1},K}\|_{K}]\leq\mathbb{E}\left[\frac{1}{\sqrt{\lambda}}\sqrt{\frac{1}{n+1}\sum_{i=1}^{n+1}|S_{i}|}\right]\leq\sqrt{\frac{\mathbb{E}[|S_{i}|]}{\lambda}}.

Multiplying both sides by λ\lambda gives the desired result.

Step 3: This step is considerably more involved than the previous two. To begin write

supfK:fKK1g^n,K,fK𝔼[g^Sn+1,K,fK]\displaystyle\sup_{f_{K}:\|f_{K}\|_{K}\leq 1}\langle\hat{g}_{n,K},f_{K}\rangle-\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle] =supfK:fKK1g^n,KgK,fK+𝔼[gKg^Sn+1,K,fK]\displaystyle=\sup_{f_{K}:\|f_{K}\|_{K}\leq 1}\langle\hat{g}_{n,K}-g^{*}_{K},f_{K}\rangle+\mathbb{E}[\langle g^{*}_{K}-\hat{g}_{S_{n+1},K},f_{K}\rangle]
g^n,KgKK+𝔼[gKg^Sn+1,KK].\displaystyle\leq\|\hat{g}_{n,K}-g^{*}_{K}\|_{K}+\mathbb{E}[\|g^{*}_{K}-\hat{g}_{S_{n+1},K}\|_{K}].

We will bound each of the two terms on the right hand side separately. To do this we will use a two-step peeling argument where each step gives a tighter bound on g^n,KgKK\|\hat{g}_{n,K}-g^{*}_{K}\|_{K} than the previous one.

Our first step will show that with high probability (β^n,g^n,K)(\hat{\beta}_{n},\hat{g}_{n,K}) must be within δM\delta_{M} of (B,gK)(B^{*},g^{*}_{K}). Let cβc_{\beta} be the constant appearing in Lemma 5. Then, by a direct computation we have that

(d(β^n,g^n,K)>δM)=(Mn(β^n,g^n,K)Mn(ΠBβ^n,gK)0,d(β^n,g^n,K)>δM)\displaystyle\mathbb{P}(d(\hat{\beta}_{n},\hat{g}_{n,K})>\delta_{M})=\mathbb{P}(M_{n}(\hat{\beta}_{n},\hat{g}_{n,K})-M_{n}(\Pi_{B^{*}}\hat{\beta}_{n},g^{*}_{K})\leq 0,\ d(\hat{\beta}_{n},\hat{g}_{n,K})>\delta_{M})
(Mn(β^n,g^n,K)Mn(ΠBβ^n,gK)(M(β^n,g^n,K)M(ΠBβ^n,gK))δM2)\displaystyle\leq\mathbb{P}(M_{n}(\hat{\beta}_{n},\hat{g}_{n,K})-M_{n}(\Pi_{B^{*}}\hat{\beta}_{n},g^{*}_{K})-(M_{\infty}(\hat{\beta}_{n},\hat{g}_{n,K})-M_{\infty}(\Pi_{B^{*}}\hat{\beta}_{n},g^{*}_{K}))\leq-\delta_{M}^{2})
(supβ2cβλ,gK2𝔼[|Si|]λ|Mn(β,gK)Mn(ΠBβ,gK)(M(β,gK)M(ΠBβ,gK))|δM2)\displaystyle\leq\mathbb{P}\left(\sup_{\|\beta\|_{2}\leq\frac{c_{\beta}}{\sqrt{\lambda}},\|g_{K}\|\leq\sqrt{\frac{2\mathbb{E}[|S_{i}|]}{\lambda}}}|M_{n}(\beta,g_{K})-M_{n}(\Pi_{B^{*}}\beta,g^{*}_{K})-(M_{\infty}(\beta,g_{K})-M_{\infty}(\Pi_{B^{*}}\beta,g^{*}_{K}))|\geq\delta_{M}^{2}\right)
+(β^n2cβλ)+(g^n,KK2𝔼[|Si|]λ)\displaystyle\quad+\mathbb{P}\left(\|\hat{\beta}_{n}\|_{2}\geq\frac{c_{\beta}}{\sqrt{\lambda}}\right)+\mathbb{P}\left(\|\hat{g}_{n,K}\|_{K}\geq\sqrt{\frac{2\mathbb{E}[|S_{i}|]}{\lambda}}\right)
1δM2𝔼[supβ2cβλ,gK2𝔼[|Si|]λ|Ln(β,gK)Ln(ΠB]β,gK)(L(β,gK)L(ΠBβ,gK))|]+O(dn),\displaystyle\leq\frac{1}{\delta_{M}^{2}}\mathbb{E}\left[\sup_{\|\beta\|_{2}\leq\frac{c_{\beta}}{\sqrt{\lambda}},\|g_{K}\|\leq\sqrt{\frac{2\mathbb{E}[|S_{i}|]}{\lambda}}}|L_{n}(\beta,g_{K})-L_{n}(\Pi_{B^{*}}]\beta,g^{*}_{K})-(L_{\infty}(\beta,g_{K})-L_{\infty}(\Pi_{B^{*}}\beta,g^{*}_{K}))|\right]+O\left(\frac{d}{n}\right),

where the last line follows by applying Lemmas 4 and 5 with f()=1f(\cdot)=1. Finally, by Lemma 6 we can additionally bound the first term above as

𝔼[supβ2cβλ,gK2𝔼[|Si|]λ|Ln(β,gK)Ln(ΠBβ^n,gK)(L(β,gK)L(ΠBβ^n,gK))|]O(dλn),\mathbb{E}\left[\sup_{\|\beta\|_{2}\leq\frac{c_{\beta}}{\sqrt{\lambda}},\|g_{K}\|\leq\sqrt{\frac{2\mathbb{E}[|S_{i}|]}{\lambda}}}|L_{n}(\beta,g_{K})-L_{n}(\Pi_{B^{*}}\hat{\beta}_{n},g^{*}_{K})-(L_{\infty}(\beta,g_{K})-L_{\infty}(\Pi_{B^{*}}\hat{\beta}_{n},g^{*}_{K}))|\right]\leq O\left(\sqrt{\frac{d}{\lambda n}}\right),

So, in total we find that

(d(β^n,g^n,K)>δM)O(dλn).\mathbb{P}(d(\hat{\beta}_{n},\hat{g}_{n,K})>\delta_{M})\leq O\left(\sqrt{\frac{d}{\lambda n}}\right).

This concludes the proof of our first concentration inequality for (β^n,g^n,K)(\hat{\beta}_{n},\hat{g}_{n,K}).

In our second step we will use this preliminary bound to get an even tighter control on d(β^n,g^n,K)d(\hat{\beta}_{n},\hat{g}_{n,K}). Fix any C>0C>0 with Cd/(λn)<δMC\sqrt{d/(\lambda n)}<\delta_{M}. For any jj\in\mathbb{R} let Aj:={(β,gK):2j1<λndd(β,gn,K)2j}A_{j}:=\{(\beta,g_{K}):2^{j-1}<\sqrt{\frac{\lambda n}{d}}d(\beta,g_{n,K})\leq 2^{j}\}. Then,

(d(β^n,g^n,K)>Cd/(λn))\displaystyle\mathbb{P}\left(d(\hat{\beta}_{n},\hat{g}_{n,K})>C\sqrt{d/(\lambda n)}\right)
j:12C2j2dnλδM(sup(β,gK)AjMn(β,gK)M(β,gK)0)+(d(β^n,g^n,K)>δM)\displaystyle\leq\sum_{j:\frac{1}{2}C\leq 2^{j}\leq 2\sqrt{\frac{d}{n\lambda}}\delta_{M}}\mathbb{P}\left(\sup_{(\beta,g_{K})\in A_{j}}M_{n}(\beta,g_{K})-M_{\infty}(\beta,g_{K})\leq 0\right)+\mathbb{P}(d(\hat{\beta}_{n},\hat{g}_{n,K})>\delta_{M})
j:12C2j2dnλδM(sup(β,gK)Aj|Mn(β,gK)Mn(ΠBβ,gK)(M(β,gK)M(ΠBβ,gK))|22j2dnλ)\displaystyle\leq\sum_{j:\frac{1}{2}C\leq 2^{j}\leq 2\sqrt{\frac{d}{n\lambda}}\delta_{M}}\mathbb{P}\left(\sup_{(\beta,g_{K})\in A_{j}}|M_{n}(\beta,g_{K})-M_{n}(\Pi_{B^{*}}\beta,g^{*}_{K})-(M_{\infty}(\beta,g_{K})-M_{\infty}(\Pi_{B^{*}}\beta,g^{*}_{K}))|\geq\frac{2^{2j-2}d}{n\lambda}\right)
+O(dλn)\displaystyle\ \ \ \ +O\left(\sqrt{\frac{d}{\lambda n}}\right)
j:12C2j2dnλδMnλ22j2d𝔼[sup(β,gK)Aj|(Ln(β,gK)Ln(ΠBβ,gK)(L(β,gK)L(ΠBβ,gK))|]\displaystyle\leq\sum_{j:\frac{1}{2}C\leq 2^{j}\leq 2\sqrt{\frac{d}{n\lambda}}\delta_{M}}\frac{n\lambda}{2^{2j-2}d}\mathbb{E}\left[\sup_{(\beta,g_{K})\in A_{j}}\left|(L_{n}(\beta,g_{K})-L_{n}(\Pi_{B^{*}}\beta,g^{*}_{K})-(L_{\infty}(\beta,g_{K})-L_{\infty}(\Pi_{B^{*}}\beta,g^{*}_{K}))\right|\right]
+O(dλn)\displaystyle\ \ \ \ +O\left(\sqrt{\frac{d}{\lambda n}}\right)
j:12C2j2dnλδMO(λ2j)+O(dλn)O(λC)+O(dλn).\displaystyle\leq\sum_{j:\frac{1}{2}C\leq 2^{j}\leq 2\sqrt{\frac{d}{n\lambda}}\delta_{M}}O(\sqrt{\lambda}2^{-j})+O\left(\sqrt{\frac{d}{\lambda n}}\right)\leq O\left(\frac{\sqrt{\lambda}}{C}\right)+O\left(\sqrt{\frac{d}{\lambda n}}\right).

This proves that g^n,KgKK=O(dλn)\|\hat{g}_{n,K}-g^{*}_{K}\|_{K}=O_{\mathbb{P}}(\sqrt{\frac{d}{\lambda n}}). To get a similar bound on the expectation write

𝔼[g^Sn+1,KgKK]0δM(g^Sn+1,KgK>t)dt\displaystyle\mathbb{E}[\|\hat{g}_{S_{n+1},K}-g^{*}_{K}\|_{K}]\leq\int_{0}^{\delta_{M}}\mathbb{P}(\|\hat{g}_{S_{n+1},K}-g^{*}_{K}\|>t)dt
+𝔼[(g^n,KK+gKK)𝟙{g^Sn+1,KgKK>δM}]\displaystyle\hskip 142.26378pt+\mathbb{E}[(\|\hat{g}_{n,K}\|_{K}+\|g^{*}_{K}\|_{K})\mathbbm{1}\{\|\hat{g}_{S_{n+1},K}-g^{*}_{K}\|_{K}>\delta_{M}\}]
0δMmin{1,O(1tdλn)+O(dλn)}dt\displaystyle\leq\int_{0}^{\delta_{M}}\min\left\{1,O\left(\frac{1}{t}\sqrt{\frac{d}{\lambda n}}\right)+O\left(\sqrt{\frac{d}{\lambda n}}\right)\right\}dt
+𝔼[(1λ(n+1)i=1n+1|Si|+𝔼[|Si|]λ)𝟙{g^Sn+1,KgKK>δM}]\displaystyle\hskip 142.26378pt+\mathbb{E}\left[\left(\sqrt{\frac{1}{\lambda(n+1)}\sum_{i=1}^{n+1}|S_{i}|}+\sqrt{\frac{\mathbb{E}[|S_{i}|]}{\lambda}}\right)\mathbbm{1}\{\|\hat{g}_{S_{n+1},K}-g^{*}_{K}\|_{K}>\delta_{M}\}\right]
O(dlog(n)λn)+𝔼[(1λ(n+1)i=1n+1(|Si|𝔼[|Si|])+2𝔼[|Si|]λ)𝟙{g^Sn+1,KgKK>δM}]\displaystyle\leq O\left(\sqrt{\frac{d\log(n)}{\lambda n}}\right)+\mathbb{E}\left[\left(\sqrt{\frac{1}{\lambda(n+1)}\sum_{i=1}^{n+1}(|S_{i}|-\mathbb{E}[|S_{i}|])}+2\sqrt{\frac{\mathbb{E}[|S_{i}|]}{\lambda}}\right)\mathbbm{1}\{\|\hat{g}_{S_{n+1},K}-g^{*}_{K}\|_{K}>\delta_{M}\}\right]
O(dlog(n)λn)+𝔼[|1λ(n+1)i=1n+1(Si𝔼[Si])|]1/2(g^Sn+1,KgKK>δM)1/2\displaystyle\leq O\left(\sqrt{\frac{d\log(n)}{\lambda n}}\right)+\mathbb{E}\left[\left|\frac{1}{\lambda(n+1)}\sum_{i=1}^{n+1}(S_{i}-\mathbb{E}[S_{i}])\right|\right]^{1/2}\mathbb{P}(\|\hat{g}_{S_{n+1},K}-g^{*}_{K}\|_{K}>\delta_{M})^{1/2}
+2𝔼[|Si|]λ(g^Sn+1,KgKK>δM)\displaystyle\hskip 142.26378pt+2\sqrt{\frac{\mathbb{E}[|S_{i}|]}{\lambda}}\mathbb{P}(\|\hat{g}_{S_{n+1},K}-g^{*}_{K}\|_{K}>\delta_{M})
=O(1λdlog(n)n),\displaystyle=O\left(\frac{1}{\lambda}\sqrt{\frac{d\log(n)}{n}}\right),

as desired.

A.5 Proofs for Lipschitz Functions

In this section we prove Proposition 3. Throughout we make the following set of technical assumptions

Assumption 4.

There exists constants CX,CΦ,CS,Cf,ρ>0C_{X},C_{\Phi},C_{S},C_{f},\rho>0 such that supf𝔼[|f(Xi)|2]Cf𝔼[|f(Xi)|]\sup_{f\in\mathcal{F}}\sqrt{\mathbb{E}[|f(X_{i})|^{2}]}\leq C_{f}\mathbb{E}[|f(X_{i})|], infβ:β2=1𝔼[|Φ(Xi)β|]ρ\inf_{\beta:\|\beta\|_{2}=1}\mathbb{E}[|\Phi(X_{i})^{\top}\beta|]\geq\rho, and with probability 1, Xi22CXp\|X_{i}\|^{2}_{2}\leq C_{X}p, Φ(Xi)22CΦd\|\Phi(X_{i})\|^{2}_{2}\leq C_{\Phi}d, and |Si|CS|S_{i}|\leq C_{S} for all ii.

The primary technical tool that we will need for the proof is a covering number bound for Lipschitz functions. This result is well-known from prior literature and we re-state it here for clarity. In the work that follow we use Bp(0,C):={xp:x2C}B_{p}(0,C):=\{x\in\mathbb{R}^{p}:\|x\|_{2}\leq C\} to denote the ball of radius CC in p\mathbb{R}^{p}.

Definition 1.

The covering number 𝒩(,ϵ,)\mathcal{N}(\mathcal{F},\epsilon,\|\cdot\|) of a set \mathcal{F} under norm \|\cdot\| is the minimum number of balls B(f,ϵ)={g:fgϵ}B(f,\epsilon)=\{g\in\mathcal{F}:\|f-g\|\leq\epsilon\} of radius ϵ\epsilon needed to cover \mathcal{F}.

Lemma 7 (Covering number of Lipschitz functions, Theorem 2.7.1 in Vaart and Wellner (1996)).

Let lipL,B1,B2,p:={f:Bp(0,B1)Lip(f)L,fB2}\mathcal{F}^{\textup{lip}}_{L,B_{1},B_{2},p}:=\{f:B_{p}(0,B_{1})\to\mathbb{R}\mid\textup{Lip}(f)\leq L,\ \|f\|_{\infty}\leq B_{2}\} denote the space of bounded Lipschitz functions on Bp(0,B1)B_{p}(0,B_{1}). Then there exists a constant C>0C>0 such that for any ϵ>0\epsilon>0,

log(𝒩(lipL,B1,B2,p,ϵ,))(CB1max{L,B2}ϵ)p.\log(\mathcal{N}(\mathcal{F}^{\textup{lip}}_{L,B_{1},B_{2},p},\epsilon,\|\cdot\|_{\infty}))\leq\left(\frac{CB_{1}\max\{L,B_{2}\}}{\epsilon}\right)^{p}.

In the present context we need to control the behaviour of Lipschitz fitting under a re-weighting ff. To account for this we will require a more general covering number bound on a weighted class of Lipschitz functions. This is given in the following lemma. In the work that follows, recall that for a probability measure QQ, the L2(Q)L_{2}(Q) norm of a function ff is defined as fL2(Q):=𝔼XQ[f(X)2]1/2\|f\|_{L_{2}(Q)}:=\mathbb{E}_{X\sim Q}[f(X)^{2}]^{1/2}.

Lemma 8.

Let f:Bp(0,B1)f:B_{p}(0,B_{1})\to\mathbb{R} be a fixed function and wlipL,B1,B2,p,f:={fgg:Bp(0,B1),Lip(g)L,gB2}\mathcal{F}^{\textup{wlip}}_{L,B_{1},B_{2},p,f}:=\{fg\mid g:B_{p}(0,B_{1})\to\mathbb{R},\ \textup{Lip}(g)\leq L,\ \|g\|_{\infty}\leq B_{2}\} denote the space of bounded Lipschitz functions on Bp(0,B1)B_{p}(0,B_{1}) multiplied with ff. Then for any probablity measure QQ, there exists a constant C>0C>0 such that for any ϵ>0\epsilon>0,

log(𝒩(wlipL,B1,B2,p,f,ϵ,L2(Q)))(CB1max{L,B2}fL2(Q)ϵ)p.\log(\mathcal{N}(\mathcal{F}^{\textup{wlip}}_{L,B_{1},B_{2},p,f},\epsilon,\|\cdot\|_{L_{2}(Q)}))\leq\left(\frac{CB_{1}\max\{L,B_{2}\}\|f\|_{L_{2}(Q)}}{\epsilon}\right)^{p}.
Proof.

Recall that we defined lipL,B1,B2,p:={g:Bp(0,B1)Lip(g)L,gB2}\mathcal{F}^{\textup{lip}}_{L,B_{1},B_{2},p}:=\{g:B_{p}(0,B_{1})\to\mathbb{R}\mid\textup{Lip}(g)\leq L,\ \|g\|_{\infty}\leq B_{2}\}. Fix any ϵ>0\epsilon>0 and let AlipL,B1,B2,pA\subseteq\mathcal{F}^{\textup{lip}}_{L,B_{1},B_{2},p} be a minimal \|\cdot\|_{\infty}-norm, ϵ/fL2(Q)\epsilon/\|f\|_{L_{2}(Q)}-covering of lipL,B1,B2,p\mathcal{F}^{\textup{lip}}_{L,B_{1},B_{2},p}. Fix any hlipL,B1,B2,ph\in\mathcal{F}^{\textup{lip}}_{L,B_{1},B_{2},p} and let hAh^{\prime}\in A be such that hhϵ/fL2(Q)\|h-h^{\prime}\|_{\infty}\leq\epsilon/\|f\|_{L_{2}(Q)}. Then,

fhfhL2(Q)=𝔼XQ[|f(X)|2|h(X)h(X)|2]1/2𝔼XQ[|f(X)|2(ϵfL2(Q))2]1/2=ϵ.\|fh-fh^{\prime}\|_{L_{2}(Q)}=\mathbb{E}_{X\sim Q}[|f(X)|^{2}\cdot|h(X)-h(X^{\prime})|^{2}]^{1/2}\leq\mathbb{E}_{X\sim Q}\left[|f(X)|^{2}\left(\frac{\epsilon}{\|f\|_{L_{2}(Q)}}\right)^{2}\right]^{1/2}=\epsilon.

In particular, we find that {fh:hA}\{fh:h\in A\} is an L2(Q)\|\cdot\|_{L_{2}(Q)}-norm, ϵ\epsilon-covering of wlipL,B1,B2,p,f\mathcal{F}^{\textup{wlip}}_{L,B_{1},B_{2},p,f}. The desired result then immediately follows by applying Lemma 7 to get a bound on |A||A|. ∎

The previous two lemmas only apply to bounded Lipschitz functions with maximum Lipshitz norm LL. To apply these results to our current setting we will need to bound Lip(g^Sn+1,L)\text{Lip}(\hat{g}_{S_{n+1},L}) and g^Sn+1,L\|\hat{g}_{S_{n+1},L}\|_{\infty}. Our next lemma does exactly this.

Lemma 9.

Assume that there exist constants CX,CS>0C_{X},C_{S}>0 such that with probability one Xi22pCX\|X_{i}\|^{2}_{2}\leq pC_{X} and |Si|CS|S_{i}|\leq C_{S} for all ii. Then with probability one, Lip(g^Sn+1,L)CSλ\text{Lip}(\hat{g}_{S_{n+1},L})\leq\frac{C_{S}}{\lambda} and g^Sn+1,LCXpCSλ\|\hat{g}_{S_{n+1},L}\|_{\infty}\leq\frac{\sqrt{C_{X}p}C_{S}}{\lambda} .

Proof.

Since (g^Sn+1,L,β^Sn+1)(\hat{g}_{S_{n+1},L},\hat{\beta}_{S_{n+1}}) is a minimizer of the quantile regression objective we must have

λLip(g^Sn+1,L)\displaystyle\lambda\text{Lip}(\hat{g}_{S_{n+1},L}) 1n+1i=1n+1α(g^Sn+1,L(Xi)+Φ(Xi)β^Sn+1,Si)+λLip(g^Sn+1,L)\displaystyle\leq\frac{1}{n+1}\sum_{i=1}^{n+1}\ell_{\alpha}\left(\hat{g}_{S_{n+1},L}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}},S_{i}\right)+\lambda\text{Lip}(\hat{g}_{S_{n+1},L})
1n+1i=1n+1α(0,Si)+λLip(0)1n+1i=1n+1|Si|CS.\displaystyle\leq\frac{1}{n+1}\sum_{i=1}^{n+1}\ell_{\alpha}(0,S_{i})+\lambda\text{Lip}(0)\leq\frac{1}{n+1}\sum_{i=1}^{n+1}|S_{i}|\leq C_{S}.

This proves the first part of the proposition. To get the second part, note that since Φ()\Phi(\cdot) has an intercept term we may assume without loss of generality that g^Sn+1,L(0)=0\hat{g}_{S_{n+1},L}(0)=0. Thus,

g^Sn+1,LCXpsupxBp(0,CXp)g^Sn+1,L(x)g^Sn+1,L(0)x2CXpLip(g^Sn+1,L)CXpCSλ,\|\hat{g}_{S_{n+1},L}\|_{\infty}\leq\sqrt{C_{X}p}\sup_{x\in B_{p}(0,\sqrt{C_{X}p})}\frac{\|\hat{g}_{S_{n+1},L}(x)-\hat{g}_{S_{n+1},L}(0)\|}{\|x\|_{2}}\leq\sqrt{C_{X}p}\text{Lip}(\hat{g}_{S_{n+1},L})\leq\sqrt{C_{X}p}\frac{C_{S}}{\lambda},

as desired. ∎

Our final preliminary lemma gives a control on the norm of the linear part of the fit. Similar to what we had for RKHS functions above, here we state this result under an arbitrary re-weighting ff.

Lemma 10.

Let f:𝒳f:\mathcal{X}\to\mathbb{R} and (X1,S1),,(Xn+1,Sn+1)i.i.d.P(X_{1},S_{1}),\dots,(X_{n+1},S_{n+1})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P. Assume that there exists constants CX,CΦ,CS,Cf,ρ>0C_{X},C_{\Phi},C_{S},C_{f},\rho>0 such that 𝔼[|f(Xi)|2]Cf𝔼[|f(Xi)|]\sqrt{\mathbb{E}[|f(X_{i})|^{2}]}\leq C_{f}\mathbb{E}[|f(X_{i})|], infβ:β2=1𝔼[|Φ(Xi)β|]ρ\inf_{\beta:\|\beta\|_{2}=1}\mathbb{E}[|\Phi(X_{i})^{\top}\beta|]\geq\rho, and with probability 1, X22CXp\|X\|_{2}^{2}\leq C_{X}p, Φ(Xi)22CΦd\|\Phi(X_{i})\|^{2}_{2}\leq C_{\Phi}d, and |Si|CS|S_{i}|\leq C_{S} for all ii. Then there exists a constant cβ>0c_{\beta}>0 such that

𝔼[|f(Xi)|𝟙{β^Sn+12>cβpλ}]O(d𝔼[|f(Xi)|]n).\mathbb{E}\left[|f(X_{i})|\mathbbm{1}\left\{\|\hat{\beta}_{S_{n+1}}\|_{2}>c_{\beta}\frac{\sqrt{p}}{\lambda}\right\}\right]\leq O\left(\frac{d\mathbb{E}[|f(X_{i})|]}{n}\right).
Proof.

By Lemma 9 we know that without loss of generality g^Sn+1,LCXpCSλ\|\hat{g}_{S_{n+1},L}\|_{\infty}\leq\frac{\sqrt{C_{X}p}C_{S}}{\lambda}. Moreover, by assumption we have that deterministically 1n+1i=1n+1|Si|CS\frac{1}{n+1}\sum_{i=1}^{n+1}|S_{i}|\leq C_{S}. With these preliminary facts in hand the desired result follows by repeating the proof of Lemma 5. ∎

With these preliminaries out of the way we are now ready to prove Proposition 3.

Proof of Proposition 3.

The main idea of this proof is to show that 1n+1i=1n+1𝟙{Si=gL(Xi)+Φ(Xi)β}\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=g_{L}(X_{i})+\Phi(X_{i})^{\top}\beta\} concentrates uniformly around its expectation. Since for any fixed (gL,β)(g_{L},\beta), 𝔼[𝟙{S=gL(X)+Φ(X)β}]=0\mathbb{E}[\mathbbm{1}\{S=g_{L}(X)+\Phi(X)^{\top}\beta\}]=0 this will imply that

1n+1i=1n+1𝟙{Si=g^Sn+1,L(Xi)+Φ(Xi)β^Sn+1}𝔼[𝟙{S=g^Sn+1(X)+Φ(X)β^Sn+1}]=0.\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1},L}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\cong\mathbb{E}[\mathbbm{1}\{S=\hat{g}_{S_{n+1}}(X)+\Phi(X)^{\top}\hat{\beta}_{S_{n+1}}\}]=0.

We now formalize this idea. Define the event

E:={g^Sn+1,LCXpCSλ,Lip(fL)CSλ,β^Sn+12cβpλ}.E:=\left\{\|\hat{g}_{S_{n+1},L}\|_{\infty}\leq\frac{\sqrt{C_{X}p}C_{S}}{\lambda},\ \text{Lip}(f_{L})\leq\frac{C_{S}}{\lambda},\ \|\hat{\beta}_{S_{n+1}}\|_{2}\leq c_{\beta}\frac{\sqrt{p}}{\lambda}\right\}.

By Lemmas 9 and 10 we know that

𝔼[|f(Xi)|𝟙{Si=g^Sn+1,L(Xi)+Φ(Xi)β^Sn+1}]\displaystyle\mathbb{E}[|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1},L}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}]
𝔼[|f(Xi)|𝟙{Si=g^Sn+1,L(Xi)+Φ(Xi)β^Sn+1}𝟙{E}]+O(d𝔼[|f(Xi)|]n).\displaystyle\leq\mathbb{E}[|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1},L}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\mathbbm{1}\{E\}]+O\left(\frac{d\mathbb{E}[|f(X_{i})|]}{n}\right).

Thus, we just need to focus on what happens on the event EE. By the exchangeability of the quadruples (g^Sn+1,L(Xi),β^Sn+1,Xi,Si)(\hat{g}_{S_{n+1},L}(X_{i}),\hat{\beta}_{S_{n+1}},X_{i},S_{i}) we have

𝔼[|f(Xi)|𝟙{Si=g^Sn+1,L(Xi)+Φ(Xi)β^Sn+1}𝟙{E}]\displaystyle\mathbb{E}[|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1},L}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\mathbbm{1}\{E\}]
=𝔼[(1n+1i=1n+1|f(Xi)|𝟙{Si=g^Sn+1,L(Xi)+Φ(Xi)β^Sn+1})𝟙{E}].\displaystyle=\mathbb{E}\left[\left(\frac{1}{n+1}\sum_{i=1}^{n+1}|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1},L}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\right)\mathbbm{1}\{E\}\right].

Let δ>0\delta>0 be a small constant that we will specify later and hh denote the tent function

h(x)={0,|x|>δ1δ1|x|,|x|δ.h(x)=\begin{cases}0,\ |x|>\delta\\ 1-\delta^{-1}|x|,\ |x|\leq\delta.\end{cases}

Let 𝒢:={g:g()=gL()+Φ()β,β2cβpλ,gLCXpCSλ,Lip(gL)CSλ}\mathcal{G}:=\{g:g(\cdot)=g_{L}(\cdot)+\Phi(\cdot)^{\top}\beta,\ \|\beta\|_{2}\leq\frac{c_{\beta}\sqrt{p}}{\lambda},\ \|g_{L}\|_{\infty}\leq\frac{\sqrt{C_{X}p}C_{S}}{\lambda},\ \text{Lip}(g_{L})\leq\frac{C_{S}}{\lambda}\} and σ1,,σni.i.d.Unif({±1})\sigma_{1},\dots,\sigma_{n}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\text{Unif}(\{\pm 1\}). Then,

𝔼[1n+1i=1n+1|f(Xi)|𝟙{Si=g^Sn+1(Xi)+Φ(Xi)β^Sn+1}𝟙{E}]\displaystyle\mathbb{E}\left[\frac{1}{n+1}\sum_{i=1}^{n+1}|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\mathbbm{1}\{E\}\right]
𝔼[1n+1i=1n+1|f(Xi)|h(Sig^Sn+1(Xi)Φ(Xi)β^Sn+1)𝟙{E}]\displaystyle\leq\mathbb{E}\left[\frac{1}{n+1}\sum_{i=1}^{n+1}|f(X_{i})|h(S_{i}-\hat{g}_{S_{n+1}}(X_{i})-\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}})\mathbbm{1}\{E\}\right]
𝔼[supg𝒢1n+1i=1n+1|f(Xi)|h(Sig(Xi))𝔼[|f(X1)|h(S1g(X1))]]\displaystyle\leq\mathbb{E}\left[\sup_{g\in\mathcal{G}}\frac{1}{n+1}\sum_{i=1}^{n+1}|f(X_{i})|h(S_{i}-g(X_{i}))-\mathbb{E}[|f(X_{1})|h(S_{1}-g(X_{1}))]\right]
+supg𝒢𝔼[|f(X1)|h(S1g(X1))]\displaystyle\quad+\sup_{g\in\mathcal{G}}\mathbb{E}[|f(X_{1})|h(S_{1}-g(X_{1}))]
2𝔼[supg𝒢1n+1i=1n+1σi|f(Xi)|h(Sig(Xi))]+supg𝒢𝔼[|f(X1)|(|S1g(X1)|X1δ)]\displaystyle\leq 2\mathbb{E}\left[\sup_{g\in\mathcal{G}}\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}|f(X_{i})|h(S_{i}-g(X_{i}))\right]+\sup_{g\in\mathcal{G}}\mathbb{E}[|f(X_{1})|\mathbb{P}(|S_{1}-g(X_{1})|\mid X_{1}\leq\delta)]
2δ1𝔼[supg𝒢1n+1i=1n+1σi|f(Xi)|(Sig(Xi))]+O(δ𝔼[|f(Xi)|])\displaystyle\leq 2\delta^{-1}\mathbb{E}\left[\sup_{g\in\mathcal{G}}\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}|f(X_{i})|(S_{i}-g(X_{i}))\right]+O(\delta\mathbb{E}[|f(X_{i})|])
2δ1𝔼[|1n+1i=1n+1σi|f(Xi)|Si|]+2δ1𝔼[supβ:β2cβpλ|1n+1i=1n+1σi|f(Xi)|Φ(Xi)β|]\displaystyle\leq 2\delta^{-1}\mathbb{E}\left[\left|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}|f(X_{i})|S_{i}\right|\right]+2\delta^{-1}\mathbb{E}\left[\sup_{\beta:\|\beta\|_{2}\leq\frac{c_{\beta}\sqrt{p}}{\lambda}}\left|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}|f(X_{i})|\Phi(X_{i})^{\top}\beta\right|\right]
+2δ1𝔼[supgL:gLCXpCSλ,Lip(gL)CSλ|1n+1i=1n+1σi|f(Xi)|gL(Xi)|]+O(δ𝔼[|f(Xi)|]),\displaystyle\quad+2\delta^{-1}\mathbb{E}\left[\sup_{g_{L}:\|g_{L}\|_{\infty}\leq\frac{\sqrt{C_{X}p}C_{S}}{\lambda},\ \text{Lip}(g_{L})\leq\frac{C_{S}}{\lambda}}\left|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}|f(X_{i})|g_{L}(X_{i})\right|\right]+O(\delta\mathbb{E}[|f(X_{i})|]),

where the third inequality follows by symmetrization, and the fourth inequality uses the fact that Si|XiS_{i}|X_{i} has a bounded density, the contraction inequality, and the fact that h()h(\cdot) is δ1\delta^{-1}-Lipschitz.

To conclude the proof we bound each of the first three terms appearing on the last line above. We have that

𝔼[|1n+1i=1n+1σi|f(Xi)|Si|]\displaystyle\mathbb{E}\left[\left|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}|f(X_{i})|S_{i}\right|\right] Var(1n+1i=1n+1σi|f(Xi)|Si)\displaystyle\leq\sqrt{\text{Var}\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}|f(X_{i})|S_{i}\right)}
CS2𝔼[f(Xi)2]1/2n+1=O(𝔼[|f(Xi)|]n), (by Assumption 4),\displaystyle\leq\sqrt{\frac{C_{S}^{2}\mathbb{E}[f(X_{i})^{2}]^{1/2}}{n+1}}=O\left(\sqrt{\frac{\mathbb{E}[|f(X_{i})|]}{n}}\right),\text{ (by Assumption \ref{ass:lip_tech_conditions})},

while

𝔼[supβ:β2cβpλ|1n+1i=1n+1σi|f(Xi)|Φ(Xi)β|]\displaystyle\mathbb{E}\left[\sup_{\beta:\|\beta\|_{2}\leq\frac{c_{\beta}\sqrt{p}}{\lambda}}\left|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}|f(X_{i})|\Phi(X_{i})^{\top}\beta\right|\right] cβpλ𝔼[1n+1i=1n+1σi|f(Xi)|Φ(Xi)2]\displaystyle\leq\frac{c_{\beta}\sqrt{p}}{\lambda}\mathbb{E}\left[\left\|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}|f(X_{i})|\Phi(X_{i})\right\|_{2}\right]
cβpλ𝔼[1n+1i=1n+1σi|f(Xi)|Φ(Xi)22]1/2\displaystyle\leq\frac{c_{\beta}\sqrt{p}}{\lambda}\mathbb{E}\left[\left\|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}|f(X_{i})|\Phi(X_{i})\right\|_{2}^{2}\right]^{1/2}
=O(dp𝔼[|f(Xi)|2]λ2n)\displaystyle=O\left(\sqrt{\frac{dp\mathbb{E}[|f(X_{i})|^{2}]}{\lambda^{2}n}}\right)
=O(dpλ2n𝔼[|f(Xi)|]), (by Assumption 4).\displaystyle=O\left(\sqrt{\frac{dp}{\lambda^{2}n}}\mathbb{E}[|f(X_{i})|]\right),\text{ (by Assumption \ref{ass:lip_tech_conditions})}.

Finally, by Lemma 8 we have the covering number bound

log(𝒩({|f|gL:Bp(0,CXp)gLCXpCSλ,Lip(gL)CSλ},ϵ,L2(PX)))\displaystyle\log\left(\mathcal{N}\left(\left\{|f|g_{L}:B_{p}(0,\sqrt{C_{X}p})\to\mathbb{R}\mid\|g_{L}\|_{\infty}\leq\frac{\sqrt{C_{X}p}C_{S}}{\lambda},\ \text{Lip}(g_{L})\leq\frac{C_{S}}{\lambda}\right\},\epsilon,\|\cdot\|_{L_{2}(P_{X})}\right)\right)
O(p𝔼[f(Xi)2]1/2λϵ)p=O(p𝔼[|f(Xi)|]λϵ)p,\displaystyle\leq O\left(\frac{p\mathbb{E}[f(X_{i})^{2}]^{1/2}}{\lambda\epsilon}\right)^{p}=O\left(\frac{p\mathbb{E}[|f(X_{i})|]}{\lambda\epsilon}\right)^{p},

and so by Dudley’s entropy integral

𝔼[supgL:gLCXpCSλ,Lip(gL)CSλ|1n+1i=1n+1σi|f(Xi)|gL(Xi)|]\displaystyle\mathbb{E}\left[\sup_{g_{L}:\|g_{L}\|_{\infty}\leq\frac{\sqrt{C_{X}p}C_{S}}{\lambda},\ \text{Lip}(g_{L})\leq\frac{C_{S}}{\lambda}}\left|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}|f(X_{i})|g_{L}(X_{i})\right|\right] O(p𝔼[|f(Xi)|]log(n)λnmin{1/2,1/p}).\displaystyle\leq O\left(\frac{p\mathbb{E}[|f(X_{i})|]\log(n)}{\lambda n^{\min\{1/2,1/p\}}}\right).

Putting all of these results together gives the final bound

𝔼[𝟙{S=g^Sn+1(X)+Φ(X)β^Sn+1}𝟙{E}]\displaystyle\mathbb{E}\left[\mathbbm{1}\{S=\hat{g}_{S_{n+1}}(X)+\Phi(X)^{\top}\hat{\beta}_{S_{n+1}}\}\mathbbm{1}\{E\}\right]
δ1(O(p𝔼[|f(Xi)|]log(n)λnmin{1/2,1/p})+O(dpλ2n𝔼[|f(Xi)|]))+O(δ𝔼[|f(Xi)|]).\displaystyle\leq\delta^{-1}\left(O\left(\frac{p\mathbb{E}[|f(X_{i})|]\log(n)}{\lambda n^{\min\{1/2,1/p\}}}\right)+O\left(\sqrt{\frac{dp}{\lambda^{2}n}}\mathbb{E}[|f(X_{i})|]\right)\right)+O(\delta\mathbb{E}[|f(X_{i})|]).

The desired result then follows by optimizing over δ\delta.

A.6 Proofs for Section 4

In this section we prove the results appearing in Section 4 of the main text. We start with a proof of Theorem 4.

Proof of Theorem 4.

We begin by giving a more careful derivation of the dual program. Recall that our primal optimization problem is

minimizeg\displaystyle\text{minimize}_{g\in\mathcal{F}} (1α)𝟏p+α𝟏q+(n+1)(g)\displaystyle(1-\alpha)\cdot\mathbf{1}^{\top}p+\alpha\cdot\mathbf{1}^{\top}q+(n+1)\cdot\mathcal{R}(g)
s.t. Sig(Xi)pi+qi=0\displaystyle S_{i}-g(X_{i})-p_{i}+q_{i}=0
Sg(Xn+1)pn+1+qn+1=0\displaystyle S-g(X_{n+1})-p_{n+1}+q_{n+1}=0
pi,qi0, 1in+1.\displaystyle p_{i},q_{i}\geq 0,\ 1\leq i\leq n+1.

The Lagrangian for this program is

(1α)𝟏p+α𝟏q+(n+1)(g)+i=1nηi(Sig(Xi)pi+qi)+ηn+1(Sg(Xn+1)pn+1+qn+1)i=1n+1(γipi+ξiqi).\begin{split}&(1-\alpha)\cdot\mathbf{1}^{\top}p+\alpha\cdot\mathbf{1}^{\top}q+(n+1)\cdot\mathcal{R}(g)+\sum_{i=1}^{n}\eta_{i}\left(S_{i}-g(X_{i})-p_{i}+q_{i}\right)\\ &\hskip 56.9055pt+\eta_{n+1}\left(S-g(X_{n+1})-p_{n+1}+q_{n+1}\right)-\sum_{i=1}^{n+1}(\gamma_{i}p_{i}+\xi_{i}q_{i}).\end{split} (A.3)

For ease of notation, let (η):=ming(n+1)(g)i=1n+1ηig(Xi)\mathcal{R}^{*}(\eta):=-\min_{g\in\mathcal{F}}(n+1)\mathcal{R}(g)-\sum_{i=1}^{n+1}\eta_{i}g(X_{i}). Then, minimizing with respect to gg gives,

(1α)𝟏p+α𝟏q(η)+i=1nηiSi+ηn+1Sηp+ηqγpξq.\displaystyle(1-\alpha)\cdot\mathbf{1}^{\top}p+\alpha\cdot\mathbf{1}^{\top}q-\mathcal{R}^{*}\left(\eta\right)+\sum_{i=1}^{n}\eta_{i}S_{i}+\eta_{n+1}S-\eta^{\top}p+\eta^{\top}q-\gamma^{\top}p-\xi^{\top}q.

So, taking derivatives of this function with respect to pp, and qq, we arrive at the constraints,

γ\displaystyle\gamma =(1α)𝟏η\displaystyle=(1-\alpha)\cdot\mathbf{1}-\eta
ξ\displaystyle\xi =α𝟏+η.\displaystyle=\alpha\cdot\mathbf{1}+\eta.

Since the only restriction on ξ\xi and γ\gamma is that they are non-negative, this can be simplified to,

η\displaystyle\eta (1α)𝟏\displaystyle\leq(1-\alpha)\cdot\mathbf{1}
η\displaystyle\eta α𝟏.\displaystyle\geq-\alpha\cdot\mathbf{1}.

Thus, we arrive at the desired dual formulation,

maximizeηi=1nηiSi+ηn+1S(η)subject to αηi1α, 1in+1.\begin{split}&\text{maximize}_{\eta}\quad\sum_{i=1}^{n}\eta_{i}S_{i}+\eta_{n+1}S-\mathcal{R}^{*}\left(\eta\right)\\ &\text{subject to }\quad-\alpha\leq\eta_{i}\leq 1-\alpha,\ 1\leq i\leq n+1.\end{split} (A.4)

Now, recall that we used the notation ηS\eta^{S} to denote the dual-optimal η\eta for a particular choice of SS. Assume for the sake of contradiction that there exists S~>S\tilde{S}>S such that ηS~n+1<ηSn+1\eta^{\tilde{S}}_{n+1}<\eta^{S}_{n+1}. Observe that we can write the dual objective as

h(ηS)+SηSn+1,\displaystyle h(\eta^{S})+S\cdot\eta^{S}_{n+1},

where hh does not depend on SS. Our assumption implies that

(S~S)(ηS~n+1ηSn+1)<0,\displaystyle(\tilde{S}-S)\cdot\left(\eta^{\tilde{S}}_{n+1}-\eta^{S}_{n+1}\right)<0,

or equivalently,

S~(ηS~n+1ηSn+1)<S(ηS~n+1ηSn+1).\displaystyle\tilde{S}\cdot\left(\eta^{\tilde{S}}_{n+1}-\eta^{S}_{n+1}\right)<S\cdot\left(\eta^{\tilde{S}}_{n+1}-\eta^{S}_{n+1}\right).

On the other hand, by the optimality of ηS\eta^{S}, we have that

h(ηS~)+S~ηS~n+1h(ηS)+S~ηSn+1S~(ηS~n+1ηSn+1)h(ηS)h(ηS~).\displaystyle h(\eta^{\tilde{S}})+\tilde{S}\cdot\eta^{\tilde{S}}_{n+1}\geq h(\eta^{S})+\tilde{S}\cdot\eta^{S}_{n+1}\quad\iff\quad\tilde{S}\cdot\left(\eta^{\tilde{S}}_{n+1}-\eta^{S}_{n+1}\right)\geq h(\eta^{S})-h(\eta^{\tilde{S}}).

Applying our assumption, we conclude that

S(ηS~n+1ηSn+1)>h(ηS)h(ηS~),\displaystyle S\cdot\left(\eta^{\tilde{S}}_{n+1}-\eta^{S}_{n+1}\right)>h(\eta^{S})-h(\eta^{\tilde{S}}),

which by rearranging yields the desired contradiction

h(ηS~)+SηS~n+1>h(ηS)+SηSn+1.\displaystyle h(\eta^{\tilde{S}})+S\cdot\eta^{\tilde{S}}_{n+1}>h(\eta^{S})+S\cdot\eta^{S}_{n+1}.

We now turn to the proof of Proposition 4, which states that the coverage properties of C^dual(Xn+1)\hat{C}_{\text{dual}}(X_{n+1}) are the same as C^(Xn+1)\hat{C}(X_{n+1}).

Proof of Proposition 4.

The proof of this Proposition is nearly identical to the proof of Theorem 3, with the exception that now instead of looking at the first order conditions of the primal, we will instead investigate the first order conditions of the Lagrangian (A.3) at the optimal dual variables. We keep all the notation the same as in the proof of Theorem 4.

We begin by proving the first statement pertaining to the coverage properties of C^dual(Xn+1)\hat{C}_{\text{dual}}(X_{n+1}). Let (g^Sn+1,pSn+1,qSn+1,ηSn+1,γSn+1i,ξiSn+1)(\hat{g}_{S_{n+1}},p^{S_{n+1}},q^{S_{n+1}},\eta^{S_{n+1}},\gamma^{S_{n+1}}_{i},\xi_{i}^{S_{n+1}}) denote an optimal primal-dual solution at the input S=Sn+1S=S_{n+1}. Recall from the proof of Theorem 4 that the Lagrangian for the optimization is

(1α)𝟏pSn+1+α𝟏qSn+1+(n+1)(g^Sn+1)+i=1n+1ηiSn+1(Sig^Sn+1(Xi)pSn+1i+qSn+1i)\displaystyle(1-\alpha)\cdot\mathbf{1}^{\top}p^{S_{n+1}}+\alpha\cdot\mathbf{1}^{\top}q^{S_{n+1}}+(n+1)\mathcal{R}(\hat{g}_{S_{n+1}})+\sum_{i=1}^{n+1}\eta_{i}^{S_{n+1}}(S_{i}-\hat{g}_{S_{n+1}}(X_{i})-p^{S_{n+1}}_{i}+q^{S_{n+1}}_{i})
i=1n+1(γSn+1pSn+1i+ξiSn+1qiSn+1).\displaystyle\quad\quad-\sum_{i=1}^{n+1}(\gamma^{S_{n+1}}p^{S_{n+1}}_{i}+\xi_{i}^{S_{n+1}}q_{i}^{S_{n+1}}).

Fix any re-weighting function ff\in\mathcal{F}. By assumption we know that strong duality (and thus the KKT conditions) hold. So, by considering the derivative of the Lagrangian in the direction ff and applying the KKT stationarity condition we find that

0=ddϵ(n+1)(g^Sn+1+ϵf)|ϵ=0i=1n+1ηSn+1if(Xi).0=\frac{d}{d\epsilon}(n+1)\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)\bigg{|}_{\epsilon=0}-\sum_{i=1}^{n+1}\eta^{S_{n+1}}_{i}f(X_{i}). (A.5)

To further unpack this equality, note that complementary slackness in the KKT condition necessitates that pSn+1iγSn+1i=0p^{{S_{n+1}}}_{i}\gamma^{S_{n+1}}_{i}=0 and qSn+1iξSn+1i=0q^{S_{n+1}}_{i}\xi^{S_{n+1}}_{i}=0. Thus, when Sig^Sn+1(Xi)>0S_{i}-\hat{g}_{S_{n+1}}(X_{i})>0, we must have γSn+1i=0\gamma^{S_{n+1}}_{i}=0, or equivalently, ηSn+1i=1α\eta^{S_{n+1}}_{i}=1-\alpha and when Sig^Sn+1(Xi)<0S_{i}-\hat{g}_{S_{n+1}}(X_{i})<0, we must have ηSn+1i=α\eta^{S_{n+1}}_{i}=-\alpha. Last, when the residual is exactly 0, the corresponding ηSn+1i\eta^{S_{n+1}}_{i} can take any value in [α,1α]\left[-\alpha,1-\alpha\right]. Plugging these observations into (A.5), we obtain

0\displaystyle 0 =ddϵ(n+1)(g^Sn+1+ϵf)|ϵ=0+i:Si<g^Sn+1(Xi)αf(Xi)\displaystyle=\frac{d}{d\epsilon}(n+1)\cdot\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)\bigg{|}_{\epsilon=0}+\sum_{i:S_{i}<\hat{g}_{S_{n+1}}(X_{i})}\alpha\cdot f(X_{i})
i:Si>g^Sn+1(Xi)(1α)f(Xi)i:Si=g^Sn+1(Xi)ηiSn+1f(Xi)\displaystyle\quad\quad-\sum_{i:S_{i}>\hat{g}_{S_{n+1}}(X_{i})}(1-\alpha)f(X_{i})-\sum_{i:S_{i}=\hat{g}_{S_{n+1}}(X_{i})}\eta_{i}^{S_{n+1}}f(X_{i})
=ddϵ(n+1)(g^Sn+1+ϵf)|ϵ=0+i:ηSn+1i<1ααf(Xi)\displaystyle=\frac{d}{d\epsilon}(n+1)\cdot\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)\bigg{|}_{\epsilon=0}+\sum_{i:\eta^{S_{n+1}}_{i}<1-\alpha}\alpha f(X_{i})
i:ηSn+1i=1α(1α)f(Xi)i:Si=g^Sn+1(Xi),ηiSn+1<1α(ηiSn+1+α)f(Xi)\displaystyle\quad\quad-\sum_{i:\eta^{S_{n+1}}_{i}=1-\alpha}(1-\alpha)f(X_{i})-\sum_{i:S_{i}=\hat{g}_{S_{n+1}}(X_{i}),\ \eta_{i}^{S_{n+1}}<1-\alpha}\left(\eta_{i}^{S_{n+1}}+\alpha\right)f(X_{i})
=ddϵ(n+1)(g^Sn+1+ϵf)|ϵ=0+i=1n+1(α𝟙{ηSn+1i=1α})f(Xi)\displaystyle=\frac{d}{d\epsilon}(n+1)\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)\bigg{|}_{\epsilon=0}+\sum_{i=1}^{n+1}\left(\alpha-\mathbbm{1}\left\{\eta^{S_{n+1}}_{i}=1-\alpha\right\}\right)f(X_{i})
i:Si=g^Sn+1(Xi),ηiSn+1<1α(ηiSn+1+α)f(Xi).\displaystyle\quad\quad-\sum_{i:S_{i}=\hat{g}_{S_{n+1}}(X_{i}),\ \eta_{i}^{S_{n+1}}<1-\alpha}\left(\eta_{i}^{S_{n+1}}+\alpha\right)f(X_{i}).

To relate this stationary condition to the coverage note that

𝔼[f(Xn+1)(𝟙{Yn+1C^dual(Xn+1)}(1α))]=𝔼[f(Xn+1)(α𝟙{Yn+1C^dual(Xn+1)})]\displaystyle\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}_{\text{dual}}(X_{n+1})\}-(1-\alpha))]=\mathbb{E}[f(X_{n+1})(\alpha-\mathbbm{1}\{Y_{n+1}\notin\hat{C}_{\text{dual}}(X_{n+1})\})]
=𝔼[f(Xn+1)(α𝟙{ηSn+1n+1=1α})]\displaystyle=\mathbb{E}\left[f(X_{n+1})\left(\alpha-\mathbbm{1}\left\{\eta^{S_{n+1}}_{n+1}=1-\alpha\right\}\right)\right]
=𝔼[1n+1i=1n+1f(Xi)(α𝟙{ηSn+1i=1α})]\displaystyle=\mathbb{E}\left[\frac{1}{n+1}\sum_{i=1}^{n+1}f(X_{i})\left(\alpha-\mathbbm{1}\left\{\eta^{S_{n+1}}_{i}=1-\alpha\right\}\right)\right]
=𝔼[ddϵ(g^Sn+1+ϵf)|ϵ=0]+𝔼[1n+1i:Si=g^Sn+1(Xi),ηiSn+1<1α(ηiSn+1+α)f(Xi)].\displaystyle=-\mathbb{E}\left[\frac{d}{d\epsilon}\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)\bigg{|}_{\epsilon=0}\right]+\mathbb{E}\left[\frac{1}{n+1}\sum_{i:S_{i}=\hat{g}_{S_{n+1}}(X_{i}),\ \eta_{i}^{S_{n+1}}<1-\alpha}\left(\eta_{i}^{S_{n+1}}+\alpha\right)f(X_{i})\right].

Finally, since ηiSn+1[α,1α]\eta_{i}^{S_{n+1}}\in[-\alpha,1-\alpha] the second term above can be bounded as

|𝔼[1n+1i:Si=g^Sn+1(Xi),ηiSn+1<1α(ηiSn+1+α)f(Xi)]|\displaystyle\left|\mathbb{E}\left[\frac{1}{n+1}\sum_{i:S_{i}=\hat{g}_{S_{n+1}}(X_{i}),\ \eta_{i}^{S_{n+1}}<1-\alpha}\left(\eta_{i}^{S_{n+1}}+\alpha\right)f(X_{i})\right]\right| 𝔼[1n+1i=1n+1𝟙{Si=g^Sn+1(Xi)}|f(Xi)|]\displaystyle\leq\mathbb{E}\left[\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}|f(X_{i})|\right]
=𝔼[𝟙{Si=g^Sn+1(Xi)}|f(Xi)|],\displaystyle=\mathbb{E}[\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}|f(X_{i})|],

while when f0f\geq 0, we additionally have the lower bound

𝔼[1n+1i:Si=g^Sn+1(Xi),ηiSn+1<1α(ηiSn+1+α)f(Xi)]\displaystyle\mathbb{E}\left[\frac{1}{n+1}\sum_{i:S_{i}=\hat{g}_{S_{n+1}}(X_{i}),\ \eta_{i}^{S_{n+1}}<1-\alpha}\left(\eta_{i}^{S_{n+1}}+\alpha\right)f(X_{i})\right]
𝔼[1n+1i:Si=g^Sn+1(Xi),ηiSn+1<1α(α+α)f(Xi)]0.\displaystyle\geq\mathbb{E}\left[\frac{1}{n+1}\sum_{i:S_{i}=\hat{g}_{S_{n+1}}(X_{i}),\ \eta_{i}^{S_{n+1}}<1-\alpha}\left(-\alpha+\alpha\right)f(X_{i})\right]\geq 0.

This concludes the proof of the first part of the proposition. For the second part of the proposition one simply notes that for any ff\in\mathcal{F},

𝔼[f(Xn+1)(𝟙{Yn+1C^dual, rand.(Xn+1)}(1α))]\displaystyle\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}_{\text{dual, rand.}}(X_{n+1})\}-(1-\alpha))] =𝔼[f(Xn+1)(𝟙{ηSn+1n+1<U}(1α))]\displaystyle=\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{\eta^{S_{n+1}}_{n+1}<U\}-(1-\alpha))]
=𝔼[f(Xn+1)(𝔼[𝟙{ηSn+1n+1<U}Xn+1,ηSn+1n+1](1α))]\displaystyle=\mathbb{E}[f(X_{n+1})(\mathbb{E}[\mathbbm{1}\{\eta^{S_{n+1}}_{n+1}<U\}\mid X_{n+1},\eta^{S_{n+1}}_{n+1}]-(1-\alpha))]
=𝔼[f(Xn+1)ηSn+1n+1]\displaystyle=-\mathbb{E}[f(X_{n+1})\eta^{S_{n+1}}_{n+1}]
=𝔼[1n+1i=1n+1f(Xn+1)ηSn+1i]\displaystyle=-\mathbb{E}\left[\frac{1}{n+1}\sum_{i=1}^{n+1}f(X_{n+1})\eta^{S_{n+1}}_{i}\right]
=𝔼[ddϵ(g^Sn+1+ϵf)],\displaystyle=-\mathbb{E}\left[\frac{d}{d\epsilon}\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)\right],

where the last equality is simply our first order condition (A.5).

A.7 Two-sided fitting

Recall that in the main text we defined the two-sided prediction set

C^two-sid.(Xn+1)={y:g^α/2S(Xn+1,y)(Xn+1)S(Xn+1,y)g^1α/2S(Xn+1,y)(Xn+1)}.\hat{C}_{\textup{two-sid.}}(X_{n+1})=\{y:\hat{g}^{\alpha/2}_{S(X_{n+1},y)}(X_{n+1})\leq S(X_{n+1},y)\leq\hat{g}^{1-\alpha/2}_{S(X_{n+1},y)}(X_{n+1})\}.

Analogues to our work on one-sided prediction sets in the main text, in this section we outline the coverage properties and computationally efficient implementation of C^two-sid.(Xn+1)\hat{C}_{\textup{two-sid.}}(X_{n+1}).

To begin, let ηS,τ\eta^{S,\tau} denote an optimal solution to the dual program (4.2) when α\alpha is replaced by 1τ1-\tau, i.e. recalling the definition of (η):=ming(n+1)(g)i=1n+1ηig(Xi)\mathcal{R}^{*}(\eta):=-\min_{g\in\mathcal{F}}(n+1)\mathcal{R}(g)-\sum_{i=1}^{n+1}\eta_{i}g(X_{i}), let ηS,τ\eta^{S,\tau} be a solution to

maximizeηn+1i=1nηiSi+ηn+1S(η)subject to(1τ)ηiτ.\begin{split}\underset{\eta\in\mathbb{R}^{n+1}}{\text{maximize}}\quad&\sum_{i=1}^{n}\eta_{i}S_{i}+\eta_{n+1}S-\mathcal{R}^{*}\left(\eta\right)\\ \text{subject to}\quad&-(1-\tau)\leq\eta_{i}\leq\tau.\end{split}

Then, similar to our one-sided sets, C^two-sid.(Xn+1)\hat{C}_{\textup{two-sid.}}(X_{n+1}) also admits the analogous dual formulation

C^dual, two-sid.(Xn+1)={y:ηS(Xn+1,y),α/2>(1α/2) and ηS(Xn+1,y),1α/2<1α/2},\hat{C}_{\textup{dual, two-sid.}}(X_{n+1})=\{y:\eta^{S(X_{n+1},y),\alpha/2}>-(1-\alpha/2)\text{ and }\eta^{S(X_{n+1},y),1-\alpha/2}<1-\alpha/2\}, (A.6)

and the analogous randomized prediction set

C^dual, two-sid., rand.(Xn+1)={y:ηS(Xn+1,y),α/2>U1 and ηS(Xn+1,y),1α/2<U2},\hat{C}_{\textup{dual, two-sid., rand.}}(X_{n+1})=\{y:\eta^{S(X_{n+1},y),\alpha/2}>U_{1}\text{ and }\eta^{S(X_{n+1},y),1-\alpha/2}<U_{2}\}, (A.7)

where U1Unif([(1α/2),α/2])U_{1}\sim\text{Unif}([-(1-\alpha/2),\alpha/2]) and independently U2Unif([α/2,1α/2])U_{2}\sim\text{Unif}([-\alpha/2,1-\alpha/2]).

As the next theorem states formally, these two-sided predictions set have identical coverage properties to their one-sided analogs.

Theorem 5.

Let \mathcal{F} be any vector space, and assume that for all f,gf,g\in\mathcal{F}, the derivative of ϵ(g+ϵf)\epsilon\mapsto\mathcal{R}(g+\epsilon f) exists. If ff is non-negative with 𝔼P[f(X)]>0\mathbb{E}_{P}[f(X)]>0, then the unrandomized prediction set given by (A.6) satisfies the lower bound

f(Yn+1C^dual, two-sid.(Xn+1))1α\displaystyle\mathbb{P}_{f}(Y_{n+1}\in\hat{C}_{\textup{dual, two-sid.}}(X_{n+1}))\geq 1-\alpha 1𝔼P[f(X)]𝔼[ddϵ(g^1α/2Sn+1+ϵf)|ϵ=0]\displaystyle-\frac{1}{\mathbb{E}_{P}[f(X)]}\mathbb{E}\left[\frac{d}{d\epsilon}\mathcal{R}(\hat{g}^{1-\alpha/2}_{S_{n+1}}+\epsilon f)\bigg{|}_{\epsilon=0}\right]
1𝔼P[f(X)]𝔼[ddϵ(g^α/2Sn+1+ϵf)|ϵ=0].\displaystyle-\frac{1}{\mathbb{E}_{P}[f(X)]}\mathbb{E}\left[\frac{d}{d\epsilon}\mathcal{R}(\hat{g}^{\alpha/2}_{S_{n+1}}+\epsilon f)\bigg{|}_{\epsilon=0}\right].

On the other hand, suppose (X1,Y1),,(Xn+1,Yn+1)i.i.d.P(X_{1},Y_{1}),\dots,(X_{n+1},Y_{n+1})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P. Then, for all ff\in\mathcal{F}, we additionally have the two-sided bound,

𝔼[f(Xn+1)(𝟙{Yn+1C^dual, two-sid.(Xn+1)}(1α))]=1𝔼P[f(X)]𝔼[ddϵ(g^1α/2Sn+1+ϵf)|ϵ=0]1𝔼P[f(X)]𝔼[ddϵ(g^α/2Sn+1+ϵf)|ϵ=0]+ϵint,\begin{split}&\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}_{\textup{dual, two-sid.}}(X_{n+1})\}-(1-\alpha))]\\ &=-\frac{1}{\mathbb{E}_{P}[f(X)]}\mathbb{E}\left[\frac{d}{d\epsilon}\mathcal{R}(\hat{g}^{1-\alpha/2}_{S_{n+1}}+\epsilon f)\bigg{|}_{\epsilon=0}\right]-\frac{1}{\mathbb{E}_{P}[f(X)]}\mathbb{E}\left[\frac{d}{d\epsilon}\mathcal{R}(\hat{g}^{\alpha/2}_{S_{n+1}}+\epsilon f)\bigg{|}_{\epsilon=0}\right]+\epsilon_{\textup{int}},\end{split} (A.8)

where ϵint\epsilon_{\textup{int}} is an interpolation error term satisfying |ϵint|𝔼[|f(Xi)|𝟙{Si=g^1α/2Sn+1(Xi)}]+𝔼[|f(Xi)|𝟙{Si=g^α/2Sn+1(Xi)}]|\epsilon_{\textup{int}}|\leq\mathbb{E}[|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}^{1-\alpha/2}_{S_{n+1}}(X_{i})\}]+\mathbb{E}[|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}^{\alpha/2}_{S_{n+1}}(X_{i})\}]. Furthermore, the same results also hold for the randomized set (A.7) with ϵint\epsilon_{\text{int}} replaced by 0.

Proof.

Note that

𝔼[f(Xn+1)(𝟙{Yn+1C^dual, two-sid.(Xn+1)}(1α))]\displaystyle\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}_{\textup{dual, two-sid.}}(X_{n+1})\}-(1-\alpha))]
=𝔼[f(Xn+1)(α𝟙{Yn+1C^dual, two-sid.(Xn+1)})]\displaystyle=\mathbb{E}[f(X_{n+1})(\alpha-\mathbbm{1}\{Y_{n+1}\notin\hat{C}_{\textup{dual, two-sid.}}(X_{n+1})\})]
=𝔼[f(Xn+1)(α/2𝟙{ηα/2,Sn+1n+1=(1α/2)})]+𝔼[f(Xn+1)(α/2𝟙{η1α/2,Sn+1n+1=(1α/2)})].\displaystyle=\mathbb{E}[f(X_{n+1})(\alpha/2-\mathbbm{1}\{\eta^{\alpha/2,S_{n+1}}_{n+1}=-(1-\alpha/2)\})]+\mathbb{E}[f(X_{n+1})(\alpha/2-\mathbbm{1}\{\eta^{1-\alpha/2,S_{n+1}}_{n+1}=(1-\alpha/2)\})].

The result then follows by repeating the steps of Proposition 4 twice to bound the two terms above separately. A similar argument demonstrates the coverage of C^dual, two-sid., rand.\hat{C}_{\textup{dual, two-sid., rand.}}

A.8 Proofs of additional technical lemmas

Lemma 11 (Lipschitz property of the pinball loss).

The pinball loss is 1-Lipschitz in the sense that for any y1,y2,y3,y4y_{1},y_{2},y_{3},y_{4}\in\mathbb{R},

|α(y1,y2)α(y3,y4)||(y1y2)(y3y4)|.|\ell_{\alpha}(y_{1},y_{2})-\ell_{\alpha}(y_{3},y_{4})|\leq|(y_{1}-y_{2})-(y_{3}-y_{4})|.
Proof.

We will show that α(y1,y2)α(y3,y4)|(y1y2)(y3y4)|\ell_{\alpha}(y_{1},y_{2})-\ell_{\alpha}(y_{3},y_{4})\leq|(y_{1}-y_{2})-(y_{3}-y_{4})|. The reverse inequality will then follow by symmetry. There are four cases.

Case 1:

y1y2,y3y4y_{1}\geq y_{2},\ y_{3}\geq y_{4}.

α(y1,y2)α(y3,y4)=α(y1y2)α(y3y4)|(y1y2)(y3y4)|.\ell_{\alpha}(y_{1},y_{2})-\ell_{\alpha}(y_{3},y_{4})=\alpha(y_{1}-y_{2})-\alpha(y_{3}-y_{4})\leq|(y_{1}-y_{2})-(y_{3}-y_{4})|.
Case 2:

y1<y2,y3<y4y_{1}<y_{2},\ y_{3}<y_{4}.

α(y1,y2)α(y3,y4)=(1α)(y2y1)(1α)(y4y3)|(y1y2)(y3y4)|.\ell_{\alpha}(y_{1},y_{2})-\ell_{\alpha}(y_{3},y_{4})=(1-\alpha)(y_{2}-y_{1})-(1-\alpha)(y_{4}-y_{3})\leq|(y_{1}-y_{2})-(y_{3}-y_{4})|.
Case 3:

y1y2,y3<y4y_{1}\geq y_{2},\ y_{3}<y_{4}.

α(y1,y2)α(y3,y4)\displaystyle\ell_{\alpha}(y_{1},y_{2})-\ell_{\alpha}(y_{3},y_{4}) =α(y1y2)(1α)(y4y3)\displaystyle=\alpha(y_{1}-y_{2})-(1-\alpha)(y_{4}-y_{3})
=α(y1y2(y3y4))+(y3y4)|(y1y2)(y3y4)|.\displaystyle=\alpha(y_{1}-y_{2}-(y_{3}-y_{4}))+(y_{3}-y_{4})\leq|(y_{1}-y_{2})-(y_{3}-y_{4})|.
Case 3:

y1<y2,y3y4y_{1}<y_{2},\ y_{3}\geq y_{4}.

α(y1,y2)α(y3,y4)\displaystyle\ell_{\alpha}(y_{1},y_{2})-\ell_{\alpha}(y_{3},y_{4}) =(1α)(y2y1)α(y3y4)\displaystyle=(1-\alpha)(y_{2}-y_{1})-\alpha(y_{3}-y_{4})
=(1α)(y2y1(y4y3))+(y4y3)|(y1y2)(y3y4)|.\displaystyle=(1-\alpha)(y_{2}-y_{1}-(y_{4}-y_{3}))+(y_{4}-y_{3})\leq|(y_{1}-y_{2})-(y_{3}-y_{4})|.

Lemma 12.

Let Φ(X1),,Φ(Xn+1)d\Phi(X_{1}),\dots,\Phi(X_{n+1})\in\mathbb{R}^{d} be i.i.d. and assume that there exists constants C2,c2,ρ>0C_{2},c_{2},\rho>0 such that supβ:β2=1𝔼[|Φ(Xi)β|2]1/2c2\sup_{\beta:\|\beta\|_{2}=1}\mathbb{E}[|\Phi(X_{i})^{\top}\beta|^{2}]^{1/2}\leq c_{2}, 𝔼[Φ(Xi)2]C2d\mathbb{E}[\|\Phi(X_{i})\|^{2}]\leq C_{2}d, and infβ:β2=1𝔼[|Φ(Xi)β|]ρ\inf_{\beta:\|\beta\|_{2}=1}\mathbb{E}[|\Phi(X_{i})^{\top}\beta|]\geq\rho. Then there exists constants c,c>0c,c^{\prime}>0 (depending only on C2C_{2}, c2c_{2}, and ρ\rho) such that,

(infβ:β2=11n+1i=1n+1|Φ(Xi)β|c)1cd2(n+1)2\mathbb{P}\left(\inf_{\beta:\|\beta\|_{2}=1}\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}\beta|\geq c\right)\geq 1-c^{\prime}\frac{d^{2}}{(n+1)^{2}}
Proof.

The main idea of this proof is to apply Theorem 5.4 of Mendelson (2014) and thus conclude that for some constant a>0a>0, |{i:|Φ(Xi)β|>a(n+1)}||\{i:|\Phi(X_{i})^{\top}\beta|>a(n+1)\}| is large uniformly in β\beta. To apply this theorem we need to check two technical conditions. Namely, we need to show that the class of functions x|Φ(x)β|x\mapsto|\Phi(x)^{\top}\beta| has bounded Rademacher complexity and that (|Φ(Xi)β|>Ω(a))\mathbb{P}(|\Phi(X_{i})^{\top}\beta|>\Omega(a)) is not too small. We now check these conditions.

Let σ1,,σn+1\sigma_{1},\dots,\sigma_{n+1} denote i.i.d. Rademacher random variables. Then, the Rademacher complexity of x|Φ(x)β|x\mapsto|\Phi(x)^{\top}\beta| can be bounded as

𝔼[supβ:β2=1|1n+1i=1n+1σi|Φ(Xi)β||]\displaystyle\mathbb{E}\left[\sup_{\beta:\|\beta\|_{2}=1}\left|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}|\Phi(X_{i})^{\top}\beta|\right|\right] =𝔼[1n+1i=1n+1σiΦ(Xi)2]\displaystyle=\mathbb{E}\left[\left\|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}\Phi(X_{i})\right\|_{2}\right]
𝔼[1n+1i=1n+1σiΦ(Xi)22]1/2C2dn+1.\displaystyle\leq\mathbb{E}\left[\left\|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}\Phi(X_{i})\right\|_{2}^{2}\right]^{1/2}\leq\sqrt{\frac{C_{2}d}{n+1}}.

This verifies our the first technical condition. For the second condition note that by the Paley-Zygmund inequality

(|Φ(Xi)β|>12ρ)𝔼[|Φ(Xi)β|]24𝔼[|Φ(Xi)β|2]ρ24c22.\displaystyle\mathbb{P}\left(|\Phi(X_{i})^{\top}\beta|>\frac{1}{2}\rho\right)\geq\frac{\mathbb{E}[|\Phi(X_{i})^{\top}\beta|]^{2}}{4\mathbb{E}[|\Phi(X_{i})^{\top}\beta|^{2}]}\geq\frac{\rho^{2}}{4c^{2}_{2}}.

Thus, by Theorem 5.4 in Mendelson (2014) we have that there exists constants a,b>0a,b>0 such that with probability at least 1cd2/(n+1)21-c^{\prime}d^{2}/(n+1)^{2}

infβ:β=1|{i:|Φ(Xi)β|>a(n+1)}(n+1)b,\inf_{\beta:\|\beta\|=1}|\{i:|\Phi(X_{i})^{\top}\beta|>a(n+1)\}\geq(n+1)b,

and thus in particular,

infβ:β=11n+1i=1n+1|Φ(Xi)β|ab.\inf_{\beta:\|\beta\|=1}\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}\beta|\geq ab.

Taking c=abc=ab gives the desired result. ∎