Conformal Prediction with Conditional Guarantees

Isaac Gibbs^*^**Department of Statistics, Stanford University. ^†^††Address for correspondence: Sequoia Hall, Stanford University, 390 Serra Mall, Stanford, CA, 94305, USA. Emails: [email protected], [email protected]. John J. Cherian^†^†footnotemark: Emmanuel J. Candès^‡^‡‡Departments of Mathematics and Statistics, Stanford University.

Abstract

We consider the problem of constructing distribution-free prediction sets with finite-sample conditional guarantees. Prior work has shown that it is impossible to provide exact conditional coverage universally in finite samples. Thus, most popular methods only guarantee marginal coverage over the covariates or are restricted to a limited set of conditional targets, e.g. coverage over a finite set of pre-specified subgroups. This paper bridges this gap by defining a spectrum of problems that interpolate between marginal and conditional validity. We motivate these problems by reformulating conditional coverage as coverage over a class of covariate shifts. When the target class of shifts is finite-dimensional, we show how to simultaneously obtain exact finite-sample coverage over all possible shifts. For example, given a collection of subgroups, our prediction sets guarantee coverage over each group. For more flexible, infinite-dimensional classes where exact coverage is impossible, we provide a procedure for quantifying the coverage errors of our algorithm. Moreover, by tuning interpretable hyperparameters, we allow the practitioner to control the size of these errors across shifts of interest. Our methods can be incorporated into existing split conformal inference pipelines, and thus can be used to quantify the uncertainty of modern black-box algorithms without distributional assumptions.

Keywords— conditional coverage, conformal inference, covariate shift, distribution-free prediction, prediction sets, black-box uncertainty quantification.

1 Introduction

Consider a training dataset $\{(X_{i},Y_{i})\}_{i=1}^{n}$ , and a test point $(X_{n+1},Y_{n+1})$ , all drawn i.i.d. from an unknown, arbitrary distribution $P$ . We study the problem of using the observed data $\{(X_{i},Y_{i})\}_{i=1}^{n}\cup\{X_{n+1}\}$ to construct a prediction set $\hat{C}(X_{n+1})$ that includes $Y_{n+1}$ with probability $1-\alpha$ over the randomness in the training and test points.

Ideally, we would like this prediction set to meet three competing goals: it should (1) make no assumptions on the underlying data-generating mechanism, (2) be valid in finite samples, and (3) satisfy the conditional coverage guarantee, $\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1})\mid X_{n+1}=x)=1-\alpha$ . Unfortunately, prior work has shown that is impossible to obtain all three of these conditions simultaneously (Vovk (2012), Barber et al. (2020)). Perhaps the closest method to achieving these goals is conformal prediction, which relaxes the third criterion to the marginal coverage guarantee, $\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1}))=1-\alpha$ .

Figure 1: Predicting with finite-sample guarantees: our conditionally valid pipeline vs. split conformal prediction.

The gap between conditional and marginal coverage can be extremely consequential in high-stakes decision-making. Marginal validity does not preclude substantial variation in coverage among relevant subpopulations. For example, a conformal prediction set for predicting a drug candidate’s binding affinity achieves marginal $1-\alpha$ coverage even if it underestimates the predictive uncertainty over the most promising subset of compounds. In human-centered applications, marginally valid prediction sets can be untrustworthy for certain legally protected groups (e.g., those defined by sensitive attributes such as race, gender, age, etc.) (Romano et al. (2020a)).

Achieving practical, finite-sample results requires weakening our desideratum from exact conditional coverage. In this article, we pursue a goal that is motivated by the following equivalence:

	$\displaystyle\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1})\mid X_{n+1}=x)=1-\alpha,\quad\text{for all $x$}$
	$\displaystyle\iff$
	$\displaystyle\mathbb{E}[f(X_{n+1})(\mathbf{1}\{Y_{n+1}\in\hat{C}(X_{n+1})\}-(1-\alpha))]=0,\quad\text{for all measurable $f$}.$

In particular, we define a relaxed coverage objective by replacing “all measurable $f$ ” with “all $f$ belonging to some (potentially infinite) class $\mathcal{F}$ .” At the least complex end, taking $\mathcal{F}=\{x\mapsto 1\}$ recovers marginal validity, while more intermediate choices interpolate between marginal and conditional coverage.

This generic approach to relaxing conditional validity was first popularized by the “conditional-to-marginal” moment testing literature (Andrews and Shi (2013)). Our relaxation is also referred to as a “multi-accuracy” objective in theoretical computer science (Hébert-Johnson et al. (2018), Kim et al. (2019)). We remark that Deng et al. (2023) have concurrently proposed the same objective for the conditional coverage problem. However, their results focus on the infinite data regime, where the distribution of $(X,Y)$ is known exactly, and their algorithm requires access to an unspecified black-box optimization oracle.

By contrast, our proposed method preserves the attractive assumption-free and computationally efficient properties of split conformal prediction. Emulating split conformal, we design our procedure as a wrapper that takes any black-box machine learning model as input. We then compute conformity scores that measure the accuracy of this model’s predictions on new test points. Finally, by calibrating bounds for these scores, we obtain prediction sets with finite-sample conditional guarantees. Figure 1 displays our workflow.

Unlike split conformal, our approach adaptively compensates for poor initial modeling decisions. In particular, if the prediction rule and conformity scores were well-designed at the outset, our procedure may only make small adjustments. This could happen, for instance, if the scores were derived from a well-specified parametric model. More often, however, the user will begin with an inaccurate or incomplete model that fails to fully capture the distribution of $Y\mid X$ . In these cases, our procedure will improve on split conformal by recalibrating the conformity score to provide exact conditional guarantees. In the predictive inference literature, conformal inference is often described as a protective layer that lies on top of a black-box machine learning model and transforms its point predictions into valid prediction sets. With this in mind, one might view our method as an additional protective layer that lies on top of a conformal method and transforms its (potentially poor) marginally valid outputs into richer, conditionally valid, prediction sets.

A number of prior works have also considered either modifying the split conformal calibration step (Lei and Wasserman (2014), Guan (2022), Barber et al. (2023)) or the initial prediction rule (Romano et al. (2019), Sesia and Romano (2021), Chernozhukov et al. (2021)) to better model the distribution of $Y\mid X$ . Crucially, despite heuristic improvements in the quality of the resulting prediction sets, all of the aforementioned approaches obtain at most weak asymptotic guarantees that rely on slow, non-parametric convergence rates.

Perhaps the only procedures to obtain a practical coverage guarantee are those of Barber et al. (2020), Vovk et al. (2003), and Jung et al. (2023). All of these methods guarantee a form of group-conditional coverage, i.e. $\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1})\mid X_{n+1}\in G)\geq 1-\alpha$ for all sets $G$ in some pre-specified class $\mathcal{G}$ . However, the approach of Barber et al. (2020) can be computationally infeasible and severely conservative, yielding wide intervals with coverage probability far above the target level. On the other hand, the method of Vovk et al. (2003), Mondrian conformal prediction, provides exact coverage in finite samples, but does not allow the groups in $\mathcal{G}$ to overlap. Finally, Jung et al. (2023) propose running quantile regression over the linear function class $\{\sum_{G\in\mathcal{G}}\beta_{G}\mathbbm{1}\{x\in G\}:\beta\in\mathbb{R}^{|G|}\}$ . This method is both practical and allows for overlapping groups. In this work, we propose a new method for achieving conditional coverage that improves upon the method of Jung et al. (2023) in three ways; 1) by providing tighter finite-sample coverage, 2) by requiring no assumptions on the data-generating distribution (in particular, unlike Jung et al. (2023) we allow for discrete outcomes), and 3) by providing conditional coverage guarantees far beyond the group setting.

1.1 Preview of contributions

To motivate and summarize the main contributions of our paper, we preview some applications. In particular, we show how our method can be used to satisfy two popular coverage desiderata: group-conditional coverage and coverage under covariate shift. Additionally, we demonstrate the improved finite sample performance of our method compared to previous approaches.

1.1.1 Group-conditional coverage

Refer to caption — Figure 2: Comparison of split conformal prediction (blue, left-most panel) and the randomized implementation of our method (orange, center panel) on a simulated dataset first considered by Romano et al. (2019). Black curves denote an estimate of the conditional mean, while the blue and orange shaded regions indicate the fitted prediction intervals. For this experiment, our method is implemented using the procedure outlined in Section 2 with $\mathcal{F}:=\{\sum_{G\in\mathcal{G}}\beta_{G}\mathbbm{1}\{x\in G\}:\beta\in\mathbb{R}^{|\mathcal{G}|}\}$ . The rightmost panel shows the miscoverage of the two methods marginally over the x-axis and conditionally on x falling in the two grey shaded bands; the red line indicates the target level of $\alpha=0.1$ .

Group-conditional coverage requires that $\hat{C}(\cdot)$ satisfy $\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1})\mid X_{n+1}\in G)=1-\alpha$ for all $G$ belonging to some collection of pre-specified (potentially overlapping) groups $\mathcal{G}\subseteq 2^{\text{Domain}(X)}$ (Barber et al. (2020)). This corresponds to a special case of our guarantee in which $\mathcal{F}=\{\sum_{G\in\mathcal{G}}\beta_{G}\mathbbm{1}\{x\in G\}:\beta\in\mathbb{R}^{|\mathcal{G}|}\}$ .

Figure 2 illustrates the coverage guarantee on a simulated dataset. Here, $x$ is univariate and we have taken the groups $\mathcal{G}$ to be the collection of all sub-intervals with endpoints belonging to $\{0,0.5,1,\dots,5\}$ . Two of these sub-intervals, $[1,2]$ and $[3,4]$ , are shaded in grey. We compare two procedures, split conformal prediction and our conditional calibration method. As is standard, we implement split conformal using conformity score $S(x,y):=|y-\hat{\mu}(x)|$ where $\hat{\mu}(x)$ is an estimate of the conditional mean $\mathbb{E}[Y\mid X]$ , while for our method, we take a two-sided approach in which upper and lower bounds on $y-\hat{\mu}(x)$ are computed separately. We see that while split conformal prediction only provides marginal validity, our method returns prediction sets that are adaptive to the shape of the data and thus obtain exact coverage over all subgroups.

In this paper, we improve upon existing group-conditional coverage results in two crucial aspects: (1) we obtain tighter finite sample coverage and (2) we make no assumptions on the distribution of $(X_{i},Y_{i})$ or the overlap of the groups $\mathcal{G}$ . Concretely, given an arbitrary finite collection of groups our randomized conditional calibration method guarantees exact coverage,

\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1})\mid X_{n+1}\in G)=1-\alpha,\ \forall G\in\mathcal{G},

while our unrandomized procedure obeys the inequalities,

1-\alpha\leq\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1})\mid X_{n+1}\in G)\leq 1-\alpha+\frac{|\mathcal{G}|}{(n+1)\mathbb{P}(X\in G)},\ \forall G\in\mathcal{G}.

Figure 3 shows the improved finite sample coverage of these method. For simplicity, this plot only displays the marginal miscoverage. Boxplots in the figure show the estimated distributions of the calibration-conditional miscoverage and length, i.e. the quantities

\mathbb{P}(Y_{n+1}\notin\hat{C}(X_{n+1})\mid\{(X_{i},Y_{i})\}_{i=1}^{n})\qquad\text{and}\qquad\mathbb{E}[\,\text{length}(\hat{C}(X_{n+1}))\mid\{(X_{i},Y_{i})\}_{i=1}^{n}],

as the sample size varies. In agreement with our theory, we find that the unrandomized version of our procedure guarantees conservative coverage, while our randomized variant offers exact coverage regardless of the sample size. On the other hand, the method of Jung et al. (2023), i.e. linear quantile regression over the set of subgroup-inclusion indicator functions, can severely undercover even at what might be considered large sample sizes.

1.1.2 Coverage under covariate shift

Given an appropriate choice of $\mathcal{F}$ , our prediction set also achieves coverage under covariate shift. To define this objective, fix any non-negative function $f$ and let $\mathbb{P}_{f}$ denote the setting in which $\{(X_{i},Y_{i})\}_{i=1}^{n}$ is sampled i.i.d. from $P$ , while $(X_{n+1},Y_{n+1})$ is sampled independently from the distribution in which $P_{X}$ is “tilted” by $f$ , i.e.,

X_{n+1}\sim\frac{f(x)}{\mathbb{E}_{P}[f(X)]}\cdot dP_{X}(x),\quad Y_{n+1}\mid X_{n+1}\sim P_{Y\mid X}.

Then, our method guarantees coverage under $\mathbb{P}_{f}$ so long as $f\in\mathcal{F}$ . For example, when $\mathcal{F}$ is a finite-dimensional linear function class, our prediction set satisfies

\displaystyle\mathbb{P}_{f}({Y}_{n+1}\in\hat{C}({X}_{n+1}))=1-\alpha,\qquad\text{for all non-negative functions $f\in\mathcal{F}$}.

There have been a number of previous works in this area that also establish coverage under covariate shift. However, these works assume that there is a single covariate shift of interest that is either known a-priori (Tibshirani et al. (2019)), or estimated from unlabeled data (Qiu et al. (2022), Yang et al. (2022)). Our method captures this setting as a special case in which $\mathcal{F}$ is chosen to be a singleton. On the other hand, when $\mathcal{F}$ is non-singleton, our guarantee is more general and ensures coverage over all shifts $f\in\mathcal{F}$ , simultaneously.

Figure 4 illustrates a simple example of this guarantee on a synthetic dataset. Once again, the covariate $x$ is a scalar and the conformity score is taken to be $S(x,y)=|\hat{\mu}(x)-y|$ ; following the previous example, we implement the two-sided version of our method. We consider five covariate shifts in which $P_{X}$ is tilted by the Gaussian density $f_{\mu,\sigma}(x)=\exp(-\frac{1}{2\sigma^{2}}(x-\mu)^{2})$ for $(\mu,\sigma)\in\{(0.5,1),(1.5,0.2),(2.5,1),(3.5,0.2),(4.5,1)\}$ . The shifts centered at $1.5$ and $3.5$ are plotted in grey and denoted as $f_{1}$ and $f_{2}$ in the figure. As the left-most panels show, split conformal gives a constant width prediction band over the entire x-axis, while our method, adapts to the shape of the data around the covariate shifts. The right-most panel of the figure validates that this correction is sufficient to expand the marginal coverage guarantee of split conformal inference to exact coverage under all three scenarios: no shift, shift by $f_{1}$ , and shift by $f_{2}$ .

Extending even further beyond existing approaches, our method can also provide guarantees even when no prior information is known about the shift. By fitting with a so-called “universal” function class, e.g., $\mathcal{F}$ is a suitable reproducing kernel Hilbert space, we provide coverage guarantees under any covariate shift. Due to the complexity of these classes, our coverage guarantee is no longer exactly $1-\alpha$ for all tilts $f$ . Instead, in its place, we obtain an exact estimate of the (improved) finite-sample coverage of our method. For example, as seen in Figure 5, if we run our shift-agnostic method for the two plotted Gaussian tilts, our estimated coverage precisely matches the empirically observed values. Thus, even though we cannot guarantee exact $1-\alpha$ coverage for all shifts in this example, we are still able to accurately report the true performance that users can expect from our method. For more information on how these estimates are computed we refer the interested reader to Section 3 and, in particular, to (3.4) and Proposition 2.

1.2 Outline

The remainder of this article is structured as follows. In Section 2, we introduce our method and give coverage results for the case in which $\mathcal{F}$ is finite dimensional. These results are expanded on in Section 3, where we consider infinite dimensional classes. Computational difficulties that arise in both the finite and infinite dimensional cases are addressed in Section 4, and an efficient implementation of our method is given. In Section 5, we apply our method to two datasets and show that our approach attains tighter finite-sample conditional coverage and more predictable failure modes than competing alternatives. Finally, we conclude in Section 6 with a discussion of considerations that arise when choosing the function class (and associated regularization) to use in our method.

A Python package, conditionalconformal, implementing our methods is available on PyPI, and notebooks reproducing the experimental results in this paper can be found at github.com/jjcherian/conditional-conformal.

Notation: In this paper, we consider two settings. In the first, we take $\{(X_{i},Y_{i})\}_{i=1}^{n+1}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P$ . In the second, $\{(X_{i},Y_{i})\}_{i=1}^{n}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P$ , while $(X_{n+1},Y_{n+1})$ is sampled independently from the tilted distribution $X_{n+1}\sim(f(x)/\mathbb{E}_{P}[f(X)])\cdot dP_{X}(x)$ and $Y_{n+1}\mid X_{n+1}\sim P_{Y\mid X}$ . We write $\mathbb{P}$ and $\mathbb{E}$ with no subscript when referring to the first scenario, while we use the subscript $f$ to denote the second. Additionally, note that throughout this article we use $(\mathcal{X},\mathcal{Y})$ to denote the domain of the $(X,Y)$ pairs.

2 Protection against finite dimensional shifts

2.1 Warm-up: marginal coverage

As a starting point to motivate our approach, we show that split conformal prediction is a special case of our method. Before explaining this result in detail, it is useful to first review the details of the split conformal algorithm.

Recall that the conformity score function, $S:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}$ measures how well the prediction of some model at $X$ “conforms” to the target $Y$ . For instance, given an estimate $\hat{\mu}(\cdot)$ of $\mathbb{E}[Y\mid X]$ , we may take $S(\cdot,\cdot)$ to be the absolute residual $S(x,y)=|y-\hat{\mu}(x)|$ . In a typical implementation of split conformal, we would need to split the training data $\{(X_{i},Y_{i})\}_{i=1}^{n}$ into two parts, using one part to train $\hat{\mu}$ and reserving the second part as the calibration set. Because our method provides the same coverage guarantees regardless of the initial choice of $S(\cdot,\cdot)$ , we will not discuss this first step in detail. Instead, we will assume that the conformity score function is fixed, and we are free to use the entire dataset $\{(X_{i},Y_{i})\}_{i=1}^{n}$ to calibrate the scores. In practice, the initial step of fitting the conformity score can be critical for getting a good baseline predictor. For example, in our experiment in Section 5.3 , we use a neural network to obtain an initial prediction of the genetic treatment given to cells based on fluorescent microscopy images.

Given a conformity score function and calibration set, the split conformal algorithm outputs the set of values $y$ for which $S(X_{n+1},y)$ is sufficiently small, i.e., the set of values $y$ that conform with the prediction at $X_{n+1}$ . The threshold for this prediction set, which we denote by $S^{*}$ , is set to be the $(\lceil(n+1)\cdot(1-\alpha)\rceil/n)$ -quantile of the conformity scores evaluated on the calibration set. In summary, the split conformal prediction set is formally defined as

\hat{C}_{\text{split}}(X_{n+1})=\{y:S(X_{n+1},y)\leq S^{*}\}.

(2.1)

The standard method for proving the marginal coverage of $\hat{C}_{\text{split}}(\cdot)$ is to appeal to the exchangeability of the conformity scores. Namely, let $S_{1},\dots,S_{n+1}$ denote the scores $S(X_{1},Y_{1}),\dots,S(X_{n+1},Y_{n+1})$ . Since the $(n+1)$ -th conformity score is drawn i.i.d. from the same distribution as the first $n$ scores, the location of $S_{n+1}$ among the order statistics of $(S_{1},\dots,S_{n+1})$ is drawn uniformly at random from each of the $n+1$ possible indices. So, recalling that $S^{*}$ is chosen to be the the $(\lceil(n+1)\cdot(1-\alpha)\rceil/n)$ -quantile, i.e., the smallest order statistic satisfying $\mathbb{P}(S_{n+1}\leq S^{*})\geq 1-\alpha$ , we arrive at the coverage guarantee $\mathbb{P}(Y_{n+1}\in\hat{C}_{\text{split}}(X_{n+1}))=\mathbb{P}(S_{n+1}\leq S^{*})\geq 1-\alpha$ . The following theorem summarizes the formal consequences of these observations.

Theorem 1 (Romano et al. (2019), Theorem 1, see also Vovk et al. (2005)).

Assume that $\{(X_{i},Y_{i})\}_{i=1}^{n+1}$ are independent and identically distributed. Then, the split conformal prediction set (2.1) satisfies,

\mathbb{P}(Y_{n+1}\in\hat{C}_{\textup{split}}(X_{n+1}))\geq 1-\alpha.

If $S(X_{n+1},Y_{n+1})$ has a continuous distribution, it also holds that

\mathbb{P}(Y_{n+1}\in\hat{C}_{\textup{split}}(X_{n+1}))\leq 1-\alpha+\frac{1}{n+1}.

Our first insight is that this prediction set and marginal coverage guarantee can also be obtained by re-interpreting split conformal as an intercept-only quantile regression. Recall the definition of the “pinball” loss,

\displaystyle\ell_{\alpha}(\theta,S):=\begin{cases}(1-\alpha)(S-\theta)\ \text{ if }S\geq\theta,\\ \alpha(\theta-S)\ \text{ if }S<\theta.\end{cases}

It is well-known that minimizing $\ell_{\alpha}$ over $\theta$ will produce a $(1-\alpha)$ -quantile of the training data, i.e., $\theta^{*}=\mathop{\rm argmin}_{\theta\in\mathbb{R}}\sum_{i=1}^{n}\ell_{\alpha}(\theta,S_{i})$ is a $(1-\alpha)$ -quantile of $\{S_{i}\}_{i=1}^{n}$ (Koenker and Bassett Jr (1978)).

In our exchangeability proof, recall that we upper bounded $S_{n+1}$ not by the $(1-\alpha)$ -quantile of $\{S_{i}\}_{i=1}^{n}$ , but by the $(\lceil(n+1)\cdot(1-\alpha)\rceil/n)$ -quantile. The latter value was obtained by considering an augmented dataset that included all of the scores $S_{1},\dots,S_{n+1}$ . To similarly account for the (unobserved) conformity score in a quantile regression, we will now fit $\theta^{*}$ using a dataset that includes a guess for $S_{n+1}$ . Namely, let $\hat{\theta}_{S}$ be a solution to the quantile regression problem in which we impute $S$ for the unknown conformity score, i.e.,

\hat{\theta}_{S}:=\mathop{\rm argmin}_{\theta\in\mathbb{R}}\frac{1}{n+1}\sum_{i=1}^{n}\ell_{\alpha}(\theta,S_{i})+\frac{1}{n+1}\ell_{\alpha}(\theta,S).

Then, one can verify that

\hat{C}_{\text{split}}(X_{n+1})=\{y:S_{n+1}(X_{n+1},y)\leq\hat{\theta}_{S_{n+1}(X_{n+1},y)}\},

(2.2)

or said more informally, $\hat{C}_{\text{split}}(X_{n+1})$ includes any $y$ such that $S(X_{n+1},y)$ is smaller than the $(1-\alpha)$ -quantile of the augmented calibration set $\{S_{i}\}_{i=1}^{n}\cup\{S(X_{n+1},y)\}$ . As an aside, we note there is some small subtlety here due to the non-uniqueness of $\hat{\theta}_{S}$ . To get exact equality in (2.2), one should choose $\hat{\theta}_{S}$ to be the largest minimizer of the quantile regression. Readers familiar with conformal inference will also recognize this method of imputing a guess for the missing $(n+1)$ -th datapoint as a type of full conformal prediction (Vovk et al. (2005)).

Having established that split conformal prediction can be derived via quantile regression, our generalization of this procedure to richer function classes naturally follows. Namely, we will replace the single score threshold $\theta$ with a function $f(X)$ that estimates the conditional quantiles of $Y\mid X$ . We then prove a generalization of Theorem 1 showing that the resulting prediction set attains a conditional coverage guarantee.

2.2 Finite dimensional classes

Recall our objective:

\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}(X_{n+1})\}-(1-\alpha))]=0,\ \forall f\in\mathcal{F}.

(2.3)

In the previous section, we constructed a prediction set with marginal coverage, i.e., (2.3) for $\mathcal{F}=\{\theta:\theta\in\mathbb{R}\}$ , by fitting an augmented quantile regression over the same function class $\mathcal{F}=\{\theta:\theta\in\mathbb{R}\}$ . Here, we generalize this observation to any finite-dimensional linear class.

To formally define our method, let $\mathcal{F}=\{\Phi(\cdot)^{\top}\beta:\beta\in\mathbb{R}^{d}\}$ denote the class of linear functions over the basis $\Phi:\mathcal{X}\to\mathbb{R}^{d}$ . Our goal is to construct a $\hat{C}$ satisfying (2.3) for this choice of $\mathcal{F}$ . Imitating our re-derivation of split conformal prediction, we define the augmented quantile regression estimate $\hat{g}_{S}$ as

\hat{g}_{S}:=\mathop{\rm argmin}_{g\in\mathcal{F}}\frac{1}{n+1}\sum_{i=1}^{n}\ell_{\alpha}(g(X_{i}),S_{i})+\frac{1}{n+1}\ell_{\alpha}(g(X_{n+1}),S).

(2.4)

Then, we take our prediction set to be

\hat{C}(X_{n+1}):=\{y:S(X_{n+1},y)\leq\hat{g}_{S(X_{n+1},y)}(X_{n+1})\}.

(2.5)

Critically, we emphasize that $\hat{g}_{S}$ is fit using the same function class $\mathcal{F}$ that appears in our coverage target. This fact will be crucial to the theoretical results that follow. To keep our notation clear under this recycling of $\mathcal{F}$ , we will always use $g$ to denote quantile estimates and $f$ to denote re-weightings.

Before discussing the coverage properties of this method there are two technical issues that we must address. First, astute readers may have noticed that (2.5) appears to be intractable. Indeed, a naive computation of $\hat{C}(X_{n+1})$ would require us to compute $\hat{g}_{S}$ for all $S\in\mathbb{R}$ . In Section 4, we will give an efficient algorithm for computing the prediction set that overcomes this naive approach. To ease exposition we defer the details of this method for now. The second issue that we must address is the non-uniqueness of the estimate $\hat{g}_{S}$ . In all subsequent results of this article, we will assume that $\hat{g}_{S}$ is computed using an algorithm that is invariant under re-orderings of the input data. This assumption is relevant because quantile regression can admit multiple optima; in theory, the selected optimum might systematically depend on the indices of the scores. In practice, this assumption is inconsequential because any commonly used algorithm, e.g., an interior point solver, satisfies this invariance condition.

With these issues out of the way, we are now ready to state the main result of this section, Theorem 2, which summarizes the coverage properties of (2.5). When interpreting this theorem it may be useful to recall that for non-negative $f$ , $\mathbb{P}_{f}(\cdot)$ denotes the setting in which $(X_{1},Y_{1}),\dots,(X_{n},Y_{n})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P$ , while $(X_{n+1},Y_{n+1})$ is sampled independently from $X_{n+1}\sim\frac{f(x)}{\mathbb{E}_{P}[f(X)]}dP_{X}(x)$ and $Y_{n+1}\mid X_{n+1}\sim P_{Y\mid X}$ .

Theorem 2.

Let $\mathcal{F}=\{\Phi(\cdot)^{\top}\beta:\beta\in\mathbb{R}^{d}\}$ denote the class of linear functions over the basis $\Phi:\mathcal{X}\to\mathbb{R}^{d}$ . Then, for any non-negative $f\in\mathcal{F}$ with $\mathbb{E}_{P}[f(X)]>0$ , the prediction set given by (2.5) satisfies

\mathbb{P}_{f}(Y_{n+1}\in\hat{C}(X_{n+1}))\geq 1-\alpha.

(2.6)

On the other hand, if $(X_{1},Y_{1}),\dots,(X_{n+1},Y_{n+1})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P$ and the distribution of $S\mid X$ is continuous, then for all $f\in\mathcal{F}$ , we additionally have the two-sided bound,

\left|\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}(X_{n+1})\}-(1-\alpha))]\right|\leq\frac{d}{n+1}\mathbb{E}\left[\max_{1\leq i\leq n+1}|f(X_{i})|\right].

This type of two-part result is typical in conformal inference. Namely, while the assumption that the distribution of $S\mid X$ is continuous may seem overly restrictive, it is standard in conformal inference that upper bounds require a mild continuity assumption, while lower bounds are fully distribution-free. For example, the canonical coverage guarantee for split conformal described in Theorem 1 also gives separate upper and lower bounds for continuous and discrete data. Notably, in the case of split conformal inference this two-part result can be avoided and replaced by an exact $1-\alpha$ coverage guarantee by randomizing the prediction set. We will show in Section 4 that an analogous result also holds for our method: without any assumptions on the continuity of $S\mid X$ , we show that randomizing $\hat{C}(X_{n+1})$ yields $\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}(X_{n+1})\}-(1-\alpha))]=0$ for all $f\in\mathcal{F}$ . Because the randomization scheme we employ leverages the algorithms developed in Section 4, we defer a precise statement of this result for now.

Our next result, Corollary 1 relates the more abstract guarantee of Theorem 2 to the group-conditional coverage example previewed in the introduction.

Corollary 1.

Suppose $\{(X_{i},Y_{i})\}_{i=1}^{n+1}$ are independent and identically distributed and the prediction set given by (2.5) is implemented with $\mathcal{F}=\{x\mapsto\sum_{G\in\mathcal{G}}\beta_{G}\mathbbm{1}\{x\in G\}:\beta_{G}\in\mathbb{R},\ \forall G\in\mathcal{G}\}$ for some finite collection of groups $\mathcal{G}\subseteq 2^{\mathcal{X}}$ . Then, for any $G\in\mathcal{G}$ ,

\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1})\mid X_{n+1}\in G)\geq 1-\alpha.

If the distribution of $S\mid X$ is continuous, then we have the matching upper bound,

\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1})\mid X_{n+1}\in G)\leq 1-\alpha+\frac{|\mathcal{G}|}{(n+1)\cdot\mathbb{P}(X_{n+1}\in G)}.

The methods described above only estimate the upper $(1-\alpha)$ -quantile of the conformity score. If desired, our procedure can also be generalized to give both lower and upper bounds on $S(X_{n+1},Y_{n+1})$ . In particular, letting $\hat{g}_{S}^{\tau}(\cdot)$ denote our estimate of the $\tau$ -th quantile, we can define the two-sided prediction set

\hat{C}_{\text{two-sid.}}(X_{n+1}):=\{y:\hat{g}_{S_{n+1}(X_{n+1},y)}^{\alpha/2}(X_{n+1})\leq S_{n+1}(X_{n+1},y)\leq\hat{g}_{S_{n+1}(X_{n+1},y)}^{1-\alpha/2}(X_{n+1})\}.

(2.7)

As an example of this, Figures 2, 4 and 5 show results from an implementation of our method in which we fit the lower and upper quantiles of $y-\hat{\mu}(x)$ separately. Because these two-sided prediction sets have identical coverage properties to their one-sided analogues we will for simplicity focus in the remainder of this article on the one-sided version. Readers interested in the two-sided instantiation of our approach should see Section A.7 for additional information about the implementation and formal coverage guarantees of these methods.

We conclude this section by giving a brief proof sketch of Theorem 2, leaving formal details to the Appendix. The main idea is to examine the first order conditions of the quantile regression (2.4) and then exploit the fact that this regression treats the test point identically to the calibration data. This connection between the derivative of the pinball loss and coverage was first made by Jung et al. (2023).

Proof sketch of Theorem 2.

We examine the first order conditions of (2.4). By a direct computation we have that for any $f\in\mathcal{F}$ ,

\frac{d}{d\epsilon}\ell_{\alpha}(\hat{g}_{S}(X)+\epsilon f(X_{i}),S_{i})\bigg{|}_{\epsilon=0}=\begin{cases}-(1-\alpha)f(X_{i}),\text{ if }S_{i}>\hat{g}_{S}(X_{i}),\\ \alpha f(X_{i}),\text{ if }S_{i}<\hat{g}_{S}(X_{i}),\\ \text{undefined, if }S_{i}=\hat{g}_{S}(X_{i}).\end{cases}

For simplicity, suppose that for all $i$ , $S_{i}\neq\hat{g}_{S}(X_{i})$ . This assumption does not hold in general and by adding it here we will obtain the stronger result $\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}(X_{n+1})\}-(1-\alpha))]=0$ . In the full proof of Theorem 2, we remove this simplification and incur an additional error term.

For now, making this assumption gives the first order condition

	$\displaystyle\frac{1}{n+1}\sum_{i=1}^{n+1}\alpha f(X_{i})\mathbbm{1}\{S_{i}\leq\hat{g}_{S_{n+1}}(X_{i})\}-(1-\alpha)f(X_{i})\mathbbm{1}\{S_{i}>\hat{g}_{S_{n+1}}(X_{i})\}=0$
	$\displaystyle\iff\frac{1}{n+1}\sum_{i=1}^{n+1}f(X_{i})(\mathbbm{1}\{S_{i}\leq\hat{g}_{S_{n+1}}(X_{i})\}-(1-\alpha))=0.$

Taking expectations, we arrive at our coverage guarantee

	$\displaystyle\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}(X_{n+1})\}-(1-\alpha))]$	$\displaystyle=\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{S_{n+1}\leq\hat{g}_{S_{n+1}}(X_{n+1})\}-(1-\alpha))]$
		$\displaystyle=\mathbb{E}[\frac{1}{n+1}\sum_{i=1}^{n+1}f(X_{i})(\mathbbm{1}\{S_{i}\leq\hat{g}_{S_{n+1}}(X_{i})\}-(1-\alpha))]$
		$\displaystyle=0,$

where the first equality uses the definition of $\hat{C}(X_{n+1})$ and the second equality applies the fact that the triples $(X_{1},S_{1},\hat{g}_{S_{n+1}}(X_{1})),\dots,(X_{n+1},S_{n+1},\hat{g}_{S_{n+1}}(X_{n+1}))$ are exchangeable.

∎

2.3 Related work

The method proposed above can be briefly summarized as a modified quantile regression procedure in which the new test point is incorporated into the fit. Given the popularity of vanilla quantile regression, one might reasonably ask how this compares to the more standard approach in which one fits

\hat{g}_{\text{qr}}:=\mathop{\rm argmin}_{g\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\ell_{\alpha}(g(X_{i}),S_{i}),

on the training data and then forms the prediction set,

\hat{C}_{\text{qr}}(X_{n+1}):=\{y:S(X_{n+1},y)\leq\hat{g}_{\text{qr}}(X_{n+1})\}.

Jung et al. (2023) analyze this approach in the case where $\mathcal{F}:=\{\sum_{i=1}^{d}\beta_{i}\mathbbm{1}\{x\in G_{i}\}:\beta\in\mathbb{R}^{d}\}$ is the space of linear combinations of subgroup indicator functions. Under appropriate assumptions on the distribution of $(X_{i},S(X_{i},Y_{i}))$ , they show that for all groups, $1\leq j\leq d$ and constants $\delta>0$ this prediction set satisfies the PAC coverage guarantee,

\mathbb{P}\left(\left|\mathbb{P}\left(Y_{n+1}\in\hat{C}_{\text{qr}}(X_{n+1})\mid X_{n+1}\in G_{j},\{(X_{i},Y_{i})\}_{i=1}^{n}\right)-(1-\alpha)\right|\leq O\left(\left(\frac{\log(1/\delta)+d\log(n)}{n\mathbb{P}(X_{n+1}\in G_{j})^{2}}\right)^{1/4}\right)\right)\geq 1-\delta.

On the other hand, in the same setting, Corollary 1 states the following guarantee for our method,

1-\alpha\leq\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1})\mid X_{n+1}\in G_{j})\leq 1-\alpha+\frac{d}{(n+1)\mathbb{P}(X_{n+1}\in G_{j})}.

The proofs of both of these results are based on the first order conditions of quantile regression which, as we showed above, can be exploited to guarantee conditional coverage. That said, the approaches differ substantially and the final results are not directly comparable. At a high level, we find that the former result targets a stronger notion of coverage, ensuring concentration conditional on the calibration data, while the second result has a much faster convergence rate ( $d/n$ versus $(d\log(n)/n)^{1/4}$ ).

While these methods are difficult to compare theoretically, a much clearer picture emerges on simulated data. In particular, the results in Figure 3 show that our method is far more robust to small sample-sizes or large dimensions. Although we do not provide a PAC guarantee, we find that the coverage of our procedure concentrates tightly at the target level across a wide range of values for $d/n$ . On the other hand, the vanilla quantile regression approach taken in Jung et al. (2023) shows notable undercoverage at moderate dimensions, e.g., $d/n\in\{0.05,0.1\}$ .

3 Extension to infinite dimensional classes

Turning back to our method, we now consider settings in which we do not have a small, finite-dimensional function class of interest. In particular, if we view the coverage target (2.3) as an interpolation between marginal and conditional coverage, then it is natural to ask what guarantees can be provided when $\mathcal{F}$ is a rich, and potentially even infinite dimensional, function class. We know from previous work that exact coverage over an arbitrary infinite dimensional class is impossible (Vovk (2012), Barber et al. (2020)). Thus, just as we relaxed the definition of conditional coverage above, here we will construct prediction sets that satisfy a relaxed version of (2.3).

First, note that we cannot directly implement our method over an infinite dimensional class. Indeed, running quantile regression in dimension $d\geq n+1$ will simply interpolate the input data. In our context, this means that every value $S\in\mathbb{R}$ will satisfy $S=\hat{g}_{S}(X_{n+1})$ and our method will always output $\hat{C}(X_{n+1})=\mathbb{R}$ . To circumvent this issue and obtain informative prediction sets, we must add regularization. This leads us to the definition

\hat{g}_{S}:=\mathop{\rm argmin}_{g\in\mathcal{F}}\frac{1}{n+1}\sum_{i=1}^{n}\ell_{\alpha}(g(X_{i}),S_{i})+\frac{1}{n+1}\ell_{\alpha}(g(X_{n+1}),S)+\mathcal{R}(g),

(3.1)

for some appropriately chosen penalty $\mathcal{R}(\cdot)$ . Having made this adjustment, we may now proceed identically to the previous section. Namely, we set

\hat{C}(X_{n+1}):=\{y:S(X_{n+1},y)\leq\hat{g}_{S(X_{n+1},y)}(X_{n+1})\},

(3.2)

and by examining the first order conditions of (3.1), we obtain the following generalization of Theorem 2.

Theorem 3.

Let $\mathcal{F}$ be any vector space, and assume that for all $f,g\in\mathcal{F}$ , the derivative of $\epsilon\mapsto\mathcal{R}(g+\epsilon f)$ exists. If $f$ is non-negative with $\mathbb{E}_{P}[f(X)]>0$ , then the prediction set given by (3.2) satisfies the lower bound

\mathbb{P}_{f}(Y_{n+1}\in\hat{C}(X_{n+1}))\geq 1-\alpha-\frac{1}{\mathbb{E}_{P}[f(X)]}\mathbb{E}\left[\frac{d}{d\epsilon}\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)\bigg{|}_{\epsilon=0}\right].

On the other hand, suppose $(X_{1},Y_{1}),\dots,(X_{n+1},Y_{n+1})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P$ . Then, for all $f\in\mathcal{F}$ , we additionally have the two-sided bound,

\begin{split}\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}(X_{n+1})\}-(1-\alpha))]=-\mathbb{E}\left[\frac{d}{d\epsilon}\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)\bigg{|}_{\epsilon=0}\right]+\epsilon_{\textup{int}},\end{split}

(3.3)

where $\epsilon_{\textup{int}}$ is an interpolation error term satisfying $|\epsilon_{\textup{int}}|\leq\mathbb{E}[|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}]$ .

Similar to our results in the previous section, the interpolation term $\epsilon_{\text{int}}$ can be removed if we allow the prediction set to be randomized (see Section 4 for a precise statement).

To more accurately interpret Theorem 3, we will need to develop additional understanding of the two quantities appearing on the right-hand side of (3.3). The following two sections are devoted to this task for two different choices of $\mathcal{F}$ . At a high level, the results in these sections will show that the interpolation error $\epsilon_{\text{int}}$ is of negligible size and thus the coverage properties of our method are primarily governed by the derivative term $-\mathbb{E}[\frac{d}{d\epsilon}\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)|_{\epsilon=0}]$ . Informally, we interpret this derivative as providing a quantitative estimate of the difficulty of achieving conditional coverage in the direction $f$ . More practically, we will see that the derivative can be used to obtain accurate estimates of the coverage properties of our method. Critically, these estimates adapt to both the specific choice of tilt $f$ and the distribution of $(X,Y)$ . Thus, they determine the difficulty of obtaining conditional coverage in a way that is specific to the dataset at hand.

3.1 Specialization to functions in a reproducing kernel Hilbert space

Our first specialization of Theorem 3 is to the case where $\mathcal{F}$ is constructed using functions from a reproducing kernel Hilbert space (RKHS). More precisely, let $K:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ be a positive definite kernel and $\mathcal{F}_{K}$ denote the associated RKHS with inner product $\langle\cdot,\cdot\rangle_{K}$ and norm $\|\cdot\|_{K}$ . Let $\Phi:\mathcal{X}\to\mathbb{R}^{d}$ denote any finite dimensional feature representation of $\mathcal{X}$ . Then, we consider implementing our method with function class $\mathcal{F}=\{f_{K}(\cdot)+\Phi(\cdot)^{\top}\beta:f_{K}\in\mathcal{F}_{K},\ \beta\in\mathbb{R}^{d}\}$ and penalty $\mathcal{R}(f_{K}(\cdot)+\Phi(\cdot)^{\top}\beta)=\lambda\|f_{K}\|_{K}^{2}$ . Here, $\lambda>0$ is a hyperparameter that controls the flexibility of the fit. For now, we take this hyperparameter to be fixed, although later in our practical experiments in Section 5.1, we will choose it by cross-validation.

Some examples of RKHSes that may be of interest include the space of radial basis functions given by $K(x,y)=\exp(-\gamma\|x-y\|_{2}^{2})$ , which allows us to give coverage guarantees over localizations of the covariates, and the polynomial kernel $K(x,y)=(x^{\top}y+c)^{m}$ for $m\in\mathbb{N}$ , $c\geq 0$ , which allows us to investigate coverage over smooth polynomial re-weightings. Additional examples and background material on reproducing kernel Hilbert spaces can be found in Paulsen and Raghupathi (2016).

To obtain a coverage guarantee for $\mathcal{F}$ , we must understand the two terms appearing on the right-hand side of (3.3). Let $f(\cdot)=f_{K}(\cdot)+\Phi(\cdot)^{\top}\beta$ denote the re-weighting of interest and $\hat{g}_{S_{n+1}}(\cdot)=\hat{g}_{S_{n+1},K}(\cdot)+\Phi(\cdot)^{\top}\hat{\beta}_{S_{n+1}}$ denote the fitted quantile estimate. Then, a short calculation shows that $\mathbb{E}[\frac{d}{d\epsilon}\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)|_{\epsilon=0}]=2\lambda\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle]$ . So, applying Theorem 3, we find that for all non-negative $f\in\mathcal{F}$ with $\mathbb{E}_{P}[f(X)]>0$ ,

\begin{split}&\mathbb{P}_{f}(Y_{n+1}\in\hat{C}(X_{n+1}))\geq 1-\alpha-2\lambda\frac{\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle_{K}]}{\mathbb{E}_{P}[f(X)]},\\ \text{and }&\mathbb{P}_{f}(Y_{n+1}\in\hat{C}(X_{n+1}))\leq 1-\alpha-2\lambda\frac{\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle_{K}]}{\mathbb{E}_{P}[f(X)]}+\frac{\epsilon_{\text{int}}}{\mathbb{E}_{P}[f(X)]}.\end{split}

(3.4)

Controlling the interpolation error is more challenging and is done by the following proposition.

Proposition 1.

Assume that $(X_{1},Y_{1}),\dots,(X_{n+1},Y_{n+1})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P$ and that $K$ is uniformly bounded. Furthermore, suppose $(\Phi(X),S)$ has uniformly upper and lower bounded first three moments (Assumption 1 in the Appendix) and that the distribution of $S\mid X$ is continuous with a uniformly bounded density. Then, for any $f\in\mathcal{F}$ ,

\frac{|\epsilon_{\textup{int}}|}{\mathbb{E}_{P}[|f(X)|]}\leq O\left(\frac{d\log(n)}{\lambda n}\right)\frac{\mathbb{E}\left[\max_{1\leq i\leq n+1}|f(X_{i})|\right]}{\mathbb{E}_{P}[|f(X)|]}.

Critically, the interpolation error term decays to zero at a faster-than-parametric rate. As a result, for even moderately large $n$ , we expect this term to be of small size. In support of this intuition, we will show an experiment in Section 5.1 in which with sample size $n=650$ , linear dimension $d=5$ , and non-linear hyperparameter $\lambda\cong 1$ , we observe interpolation error $\frac{\epsilon_{\textup{int}}}{\mathbb{E}_{P}[|f(X)|]}\cong 0.005$ . Thus, at appropriate sample sizes, the interpolation error has minimal effect on the coverage.

We remark that achieving this faster-than-parametric rate requires technical insights beyond existing tools. Two standard ways to establish interpolation bounds are to either to exploit the finite-dimensional character of the model class $\mathcal{F}$ or the algorithmic stability. Unfortunately, quantile regression with both kernel and linear components satisfies neither property. To get around this problem, we give a three-part argument in which we first separate out the linear component of the fit by discretizing over $\beta$ . Then, with $\beta$ fixed, we are able to exploit known stability results to control the kernel component of the fit (Bousquet and Elisseeff (2002)). Finally, we combine the two previous steps by giving a smoothing argument that shows that the discretization can be extended to the entire function class. This result may be useful in other applications. For instance, the interpolation error determines the derivative of the loss at the empirical minimizer and, thus, may play a key role in central limit theorems for quantile regressors of this type.

Moving away from these technical issues and returning to our coverage guarantee, (3.4), we find that once the interpolation error is removed, the conditional coverage is completely dictated by the inner product between $f_{K}$ and $\hat{g}_{S_{n+1}}$ . Critically, this implies that in the special case where the target re-weighting $f$ lies completely in the unpenalized part of the function class (i.e. when $f_{K}=0$ ) we have $\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle_{K}]=0$ and thus $\hat{C}(X_{n+1})$ obtains (nearly) exact coverage under $f$ . On the other hand, our next proposition shows that when $f_{K}\neq 0$ , we can use a plug-in estimate to accurately estimate $\mathbb{E}[\langle\hat{g}_{S_{n+1}},f_{K}\rangle_{K}]$ . Thus, even when exact coverage is impossible, a simple examination of the quantile regression fit is sufficient to determine the degradation in coverage under any re-weighting of interest.

Proposition 2.

Assume that $(X_{1},S_{1}),\dots,(X_{n+1},S_{n+1})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P$ and that $K$ is uniformly bounded. Suppose further that the population loss is locally strongly convex near its minimizer (Assumption 2 in the Appendix) and $(\Phi(X_{i}),S_{i})$ has uniformly bounded upper and lower first and second order moments (Assumption 3 in the Appendix). Define the $n$ -sample quantile regression estimate

(\hat{g}_{n,K},\hat{\beta}_{n}):=\mathop{\rm argmin}_{g_{K}\in\mathcal{F}_{K},\ \beta\in\mathbb{R}^{d}}\frac{1}{n}\sum_{i=1}^{n}\ell_{\alpha}(g_{K}(X_{i})+\Phi(X_{i})^{\top}\beta,S_{i})+\lambda\|g_{K}\|^{2}_{K},

and for any $\delta>0$ , let $\mathcal{F}_{\delta}:=\{f(\cdot)=f_{K}(\cdot)+\Phi(\cdot)^{\top}\beta\in\mathcal{F}:\|f\|_{K}+\|\beta\|_{2}\leq 1,\ \mathbb{E}_{P}[|f(X)|]\geq\delta\}$ . Then,

\sup_{f\in\mathcal{F}_{\delta}}\left|2\lambda\frac{\langle\hat{g}_{n,K},f_{K}\rangle_{K}}{\frac{1}{n}\sum_{i=1}^{n}|f(X_{i})|}-2\lambda\frac{\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle_{K}]}{\mathbb{E}_{P}[|f(X)|]}\right|\leq O\left(\sqrt{\frac{d\log(n)}{n}}\right).

(3.5)

3.2 Specialization to the class of Lipschitz functions

As a second specialization of Theorem 3, we now aim to provide valid coverage over all sufficiently smooth re-weightings of the data. We will do this by examining the set of all Lipschitz functions on $\mathcal{X}$ . Namely, suppose $\mathcal{X}\subseteq\mathbb{R}^{p}$ and define the Lipschitz norm of functions $f:\mathcal{X}\to\mathbb{R}$ as

\text{Lip}(f):=\sup_{x,y\in\mathcal{X},\ x\neq y}\frac{|f(x)-f(y)|}{\|x-y\|_{2}}.

Analogous to the previous section, let $\mathcal{F}_{L}:=\{f:\text{Lip}(f)<\infty\}$ , $\Phi:\mathcal{X}\to\mathbb{R}^{d}$ be any finite dimensional feature representation of $\mathcal{X}$ , and consider implementing our method with the function class $\mathcal{F}=\{f_{L}(\cdot)+\Phi(\cdot)^{\top}\beta:f_{L}\in\mathcal{F}_{L},\ \beta\in\mathbb{R}^{d}\}$ and penalty $\mathcal{R}(f_{L}(\cdot)+\Phi(\cdot)^{\top}\beta)=\lambda\text{Lip}(f_{L})$ .

The astute reader may notice that the Lipschitz norm is not differentiable and thus Theorem 3 is not directly applicable to this setting. Nevertheless, it is not difficult to show that Theorem 3 can be extended by replacing $\frac{d}{d\epsilon}\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)$ with a subgradient. So, after observing that $|\partial_{\epsilon}\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)|_{\epsilon=0}|\leq\lambda\text{Lip}(f)$ , we can apply this analogue of Theorem 3 to find that for any non-negative $f\in\mathcal{F}$ with $\mathbb{E}_{P}[f(X)]>0$ ,

1-\alpha-\lambda\frac{\text{Lip}(f)}{\mathbb{E}_{P}[f(X)]}\leq\mathbb{P}_{f}(Y_{n+1}\in\hat{C}(X_{n+1}))\leq 1-\alpha+\lambda\frac{\text{Lip}(f)}{\mathbb{E}_{P}[f(X)]}+\frac{\epsilon_{\text{int}}}{\mathbb{E}_{P}[f(X)]}.

Control of the interpolation error is handled in the following proposition.

Proposition 3.

Assume that $(X_{1},Y_{1}),\dots,(X_{n+1},Y_{n+1})\in\mathbb{R}^{p}\times\mathbb{R}$ are i.i.d. and that $X$ , $\Phi(X)$ , and $S$ have bounded domains and uniformly upper and lower bounded first and second moments (Assumption 4 in the Appendix). Furthermore, assume that the distribution of $S\mid X$ is continuous with a uniformly bounded density and that $\Phi(\cdot)$ contains an intercept term. Then for any $f\in\mathcal{F}$ ,

\frac{|\epsilon_{\textup{int}}|}{\mathbb{E}_{P}[|f(X)|]}\leq O\left(\left(\frac{p\log(n)}{\lambda n^{\min\{1/2,1/p\}}}\right)^{\frac{1}{2}}+\left(\frac{dp}{\lambda^{2}n}\right)^{\frac{1}{4}}\right).

This result is considerably weaker than our RKHS bound. While our proof for RKHS function classes made careful use of the stability of RKHS fitting (Bousquet and Elisseeff, 2002), here we take a more brute force approach and directly examine the uniform concentration properties of the number of interpolated points, $\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=g_{L}(X_{i})+\phi(X_{i})^{\top}\beta\}$ . We defer a detailed description of this approach to the Appendix. For now, we simply remark that we do not believe this proof technique yields a tight bound and it is possible that significant improvements could be made to Proposition 3 with more careful arguments.

Regardless of the tightness of the bound, we still find that when $X$ is low-dimensional, the interpolation error will be small and the miscoverage of $\hat{C}(X_{n+1})$ under $f$ will be primarily driven by its Lipschitz norm. In light of the impossibility of exact conditional coverage, this result gives a natural interpolation between marginal coverage (in which $\text{Lip}(f)=\text{Lip}(1)=0)$ ) and conditional coverage (in which $\text{Lip}(f)$ can be arbitrarily large). On the other hand, in moderate to high dimensions the interpolation error term will not be negligible and the coverage can be highly conservative. For this reason, in our real data examples, we will prefer to use RKHS functions for which we have much faster convergence rates.

4 Computing the prediction set

In order to practically implement any of the methods discussed above, we need to be able to efficiently compute $\hat{C}(X_{n+1})=\{y:S(X_{n+1},y)\leq\hat{g}_{S(X_{n+1},y)}(X_{n+1})\}$ . Naively, this recursive definition requires us to fit $\hat{g}_{S}$ for all possible values of $S\in\mathbb{R}$ . We will now show that by exploiting the monotonicity properties of quantile regression, this naive computation can be overcome and a valid prediction set can be computed efficiently using only a small number of fits.

The main subtlety that we will have to contend with is that $\hat{g}_{S}$ may not be uniquely defined. For example, consider computing the median of the dataset $\{S_{1},S_{2},S_{3},S_{4}\}=\{1,2,3,4\}$ . It is easy to show that any value in the interval $[2,3]$ is a valid solution to the median quantile regression $\text{minimize}_{\theta}\sum_{i=1}^{4}\ell_{1/2}(\theta,i)$ . Critically, this means that it is ambiguous whether or not $3$ lies below or above the median. More generally, in our context, it can be ambiguous whether or not $S\leq\hat{g}_{S}$ . In the earlier sections of this article we have elided such non-uniqueness in the definition of $\hat{g}_{S}$ . We do this because the choice of $\hat{g}_{S}$ is not critical to the theory and, in particular, all sensible definitions will give the same coverage properties (recall that all of our theoretical results go through so long as $\hat{g}_{S}$ is computed using an algorithm that is invariant under re-orderings of the input data). However, while not theoretically relevant, this ambiguity can cause practical issues to arise in the computation.

The main insight of this section is that these technical issues can be resolved by re-defining $\hat{C}(X_{n+1})$ in terms of the dual formulation of the quantile regression. This will give us a new prediction set, $\hat{C}_{\text{dual}}(X_{n+1})$ , that can be computed efficiently and satisfies the same coverage guarantees as $\hat{C}(X_{n+1})$ . At a high level, $\hat{C}_{\text{dual}}(X_{n+1})$ is obtained from $\hat{C}(X_{n+1})$ by removing a small portion of the points $y$ that lie on the interpolation boundary $\{y:S(X_{n+1},y)=\hat{g}_{S(X_{n+1},y)}(X_{n+1})\}$ . Thus, one should simply think of $\hat{C}_{\text{dual}}(X_{n+1})$ as a trimming of the original prediction set that removes some extraneous edge cases.

To define our dual optimization more formally, recall that throughout this article we have considered quantile regressions of the form,

\underset{{g\in\mathcal{F}}}{\text{minimize}}\ \frac{1}{n+1}\sum_{i=1}^{n}\ell_{\alpha}(g(X_{i}),S_{i})+\frac{1}{n+1}\ell_{\alpha}(g(X_{n+1}),S)+\mathcal{R}(g).

Instead of directly computing the dual of this program, we first re-formulate this optimization into the identical procedure,

\begin{split}\underset{p,q\in\mathbb{R}^{n+1},\ g\in\mathcal{F}}{\text{minimize}}\quad&\sum_{i=1}^{n+1}(1-\alpha)p_{i}+\alpha q_{i}+(n+1)\cdot\mathcal{R}(g),\\ \text{subject to}\quad&S_{i}-g(X_{i})-p_{i}+q_{i}=0,\\ &S-g(X_{n+1})-p_{n+1}+q_{n+1}=0,\\ &p_{i},q_{i}\geq 0.\end{split}

(4.1)

Then, after some standard calculations, this yields the desired dual formulation,

\begin{split}\underset{\eta\in\mathbb{R}^{n+1}}{\text{maximize}}\quad&\sum_{i=1}^{n}\eta_{i}S_{i}+\eta_{n+1}S-\mathcal{R}^{*}\left(\eta\right)\\ \text{subject to}\quad&-\alpha\leq\eta_{i}\leq 1-\alpha,\end{split}

(4.2)

where $\mathcal{R}^{*}(\cdot)$ denotes the function $\mathcal{R}^{*}(\eta):=-\min_{g\in\mathcal{F}}\,\{(n+1)\mathcal{R}(g)-\sum_{i=1}^{n+1}\eta_{i}g(X_{i})\}$ ; heuristically, we can think of $\mathcal{R}^{*}(\cdot)$ as the convex conjugate for $\mathcal{R}(\cdot)$ .

Crucially, the KKT conditions for (4.1) allow for a more tractable definition of our prediction set. Letting $\eta^{S}$ denote any solution to (4.2) and applying the complementary slackness conditions of this primal-dual pair we find that

\displaystyle\eta^{S}_{n+1}\in\begin{cases}-\alpha&\text{if $S<\hat{g}_{S}(X_{n+1}),$}\\ [-\alpha,1-\alpha]&\text{if $S=\hat{g}_{S}(X_{n+1}),$}\\ 1-\alpha&\text{if $S>\hat{g}_{S}(X_{n+1}).$}\end{cases}

As a consequence, checking whether $\eta_{n+1}^{S}<1-\alpha$ is nearly equivalent to checking that $S\leq\hat{g}_{S}(X_{n+1})$ , albeit with a minor discrepancy on the interpolation boundary. This enables us to define the efficiently computable prediction set,

\hat{C}_{\text{dual}}(X_{n+1}):=\{y:\eta_{n+1}^{S(X_{n+1},y)}<1-\alpha\}.

(4.3)

In practice, $\hat{C}_{\text{dual}}(X_{n+1})$ can be mildly conservative. If we are willing to allow the prediction set to be randomized then exact coverage can be obtained using the prediction set,

\hat{C}_{\text{dual, rand.}}:=\{y:\eta_{n+1}^{S(X_{n+1},y)}<U\},

where $U\sim\text{Unif}([-\alpha,1-\alpha])$ is drawn independent of the data.

Our first result verifies that $\hat{C}_{\text{dual}}(X_{n+1})$ obtains the same coverage guarantees as our non-randomized primal set $\hat{C}(X_{n+1})$ , while $\hat{C}_{\text{dual, rand.}}$ realizes exact, non-conservative coverage.

Proposition 4.

Assume that the primal-dual pair (4.1)-(4.2) satisfies strong duality and the dual solutions $\{\eta^{S}\}_{S\in\mathbb{R}}$ are computed using an algorithm that is symmetric in the input data. Then,

1.

The statement of Theorem 3 is valid with $\hat{C}(X_{n+1})$ replaced by $\hat{C}_{\textup{dual}}(X_{n+1})$ ,
2.

The statement of Theorem 3 is valid with $\hat{C}(X_{n+1})$ replaced by $\hat{C}_{\textup{dual, rand.}}(X_{n+1})$ and $\epsilon_{\text{int}}$ set to $0$ .

The assumption that the primal-dual pair satisfies strong duality is very minor and we verify in Section A.1 that it holds for all the function classes and penalties considered in this article. Moreover, we also verify in Section A.1 that for all the function classes and penalties considered in this article, $\mathcal{R}^{*}(\cdot)$ is tractable and thus solutions to (4.2) can be computed efficiently.

The main result of this section is Theorem 4, which states that $S\mapsto\eta^{S}_{n+1}$ is non-decreasing. Critically, this implies that membership in the dual prediction set (4.3) is monotone in the imputed score $S(X_{n+1},y)$ .

Theorem 4.

For all maximizers $\{\eta^{S}_{n+1}\}_{S\in\mathbb{R}}$ of (4.2), $S\mapsto\eta^{S}_{n+1}$ is non-decreasing in $S$ .

Leveraging Theorem 4, we may compute $\hat{C}_{\text{dual}}(X_{n+1})$ (or $\hat{C}_{\text{dual, rand.}}(X_{n+1})$ ) using the following two-step procedure. First, we identify the largest value of $S$ such that $\eta^{S}_{n+1}<1-\alpha$ (or $\eta^{S}_{n+1}<U$ ). Second, denoting this upper bound by $S^{*}_{n+1}$ , we output all $y$ such that $S(X_{n+1},y)\leq S^{*}_{n+1}$ . The second step is straightforward for all commonly used conformity scores. For example, if $S(X_{n+1},y)=|\hat{\mu}(X_{n+1})-y|$ , the prediction set becomes $\hat{\mu}(X_{n+1})\pm S^{*}_{n+1}$ .

The monotonicity of the dual variable in $S$ allows for more than one approach to the first step of this procedure. If we have no additional information about the structure of the optimization problem over $\eta$ , it is always possible to run a binary search over $S$ to find the largest value such that $\eta^{S}_{n+1}$ is less than the targeted cutoff. On the other hand, for the typical use-case of this method, i.e., when we fit an unregularized quantile regression over a finite-dimensional function class, it is considerably more efficient to compute this cutoff for $S$ by applying standard tools from linear program sensitivity analysis. We defer a detailed description of our approach to Algorithm 2.

Figure 6 displays the computational efficiency of this sensitivity analysis method on a 2020 MacBook Pro with an Intel Core i5 processor and 16GB of RAM. The left panel of the figure displays the estimated time to construct one prediction set over varying function classes and calibration set sizes. We find that our method is computationally efficient, and that its time complexity depends much more on the dimension than the sample size. In general, the bulk of the computational cost consists of fitting a single quantile regression on the calibration set. From there, predictions for new test points are obtained by an efficient procedure for updating the quantile fit (see Algorithm 2 in the Appendix for details). The computational benefit of this approach is displayed in the right panel of Figure 6, which compares the cost of running our procedure against the time it would take to run a single quantile regression for each test point. We see that our updating procedure is substantially faster across a wide range of calibration set sizes and function class dimensions.

5 Real data experiments

5.1 Communities and crime data

We now illustrate our methods on two real datasets. For our first experiment, we consider the Communities and Crime dataset (Dua and Graff (2017), Redmond and Baveja (2002)). In this task, the goal is to use the demographic features of a community to predict its per capita violent crime rate. To make our analysis as interpretable as possible, we use only a small subset of the covariates in this dataset: population size, unemployment rate, median income, racial make-up (given as four columns indicating the percentage of the population that is Black, White, Hispanic, and Asian), and age demographics (given as two columns indicating the percentage of people in the 12-21 and 65+ age brackets).

Our goal in this section is not to give the best possible prediction intervals. Instead, we perform a simple expository analysis that is designed to demonstrate some possible use cases of our method. For this purpose, we assume that the practitioner’s primary concern is that standard prediction sets will provide unequal coverage across communities with differing racial make-ups. Additionally, we suppose that the practitioner does not have any particular predilections for achieving coverage conditional on age, unemployment rate, or median income. We encode these preferences as follows: let $\Phi(X_{i})$ denote the length five vector consisting of an intercept term and the four racial features and define $\mathcal{F}_{K}$ to be the RKHS given by the Gaussian kernel $K(X_{i},X_{j})=\exp(-4\|X_{i}-X_{j}\|_{2}^{2})$ (note that since all variables in this dataset have been previously normalized to lie in $[0,1]$ , we do not make any further modifications before computing the kernel). Then, proceeding exactly as in Section 3.1 we run our method with the function class $\mathcal{F}:=\{f_{K}(\cdot)+\Phi(\cdot)^{\top}\beta:f_{K}\in\mathcal{F}_{K},\ \beta\in\mathbb{R}^{5}\}$ and penalty $\mathcal{R}(f_{K}(\cdot)+\Phi(\cdot)^{\top}\beta):=\lambda\|f_{K}\|^{2}_{K}$ . To get $\lambda$ , we run cross-validation on the calibration set. While this does not strictly follow the theory of Section 3, our results indicate that choosing $\lambda$ in this manner does not negatively impact the coverage properties of our method in practice (see Figures 9 and 13 below). Finally, we set the conformity scores to be the absolute residuals of a linear regression of $Y$ on $X$ and the target coverage level to be $1-\alpha=0.9$ . The sizes of the training, calibration, and test sets are taken to be 650, 650, and 694, respectively and we use the unrandomized prediction set, $\hat{C}_{\text{dual}}$ throughout.

Figure 7 shows the empirical miscoverages obtained by our algorithm and split conformal prediction under the linearly re-weighted distributions $\mathbb{P}_{f}$ , for $f\in\{x\mapsto 1,\ x\mapsto x_{\rm\%Black},\ x\mapsto x_{\rm\%White},\ x\mapsto x_{\rm\%Hispanic},\ x\mapsto x_{\rm\%Asian}\}$ . As expected, our method obtains the desired coverage level under all five settings, while split conformal is only able to deliver marginal validity.

In practice, the user is unlikely to only care about the performance under linear re-weightings. For example, they may also want to have predictions sets that are accurate on the communities with the highest racial bias, e.g. the communities whose percentage of the population that is black is in the top $90$ th percentile. Strictly speaking, our method will only provide a guarantee on these high racial representation groups if the corresponding subgroup indicators are included in $\mathcal{F}$ . However, intuitively, even without explicit indicators, linear re-weightings should already push the method to accurately cover communities with high racial bias. Thus, we may expect that even without adjusting $\mathcal{F}$ , the method implemented above will already perform well on these groups. To investigate this, Figure 8 shows the miscoverage of our method across high racial representation subgroups. Formally, we say that a community has high representation of a particular racial group if the percentage of the community that falls in that group is in the top $p$ -percentile. The three panels of the figure then show results for $p=50$ , $70$ , and $90$ , respectively. We find that even without any explicit indicators for the subgroups, our method is able to correct the errors of split conformal prediction and provide improved coverage in all settings.

In addition to providing exact coverage over racial re-weightings, our method also provides a simple procedure for evaluating the coverage across the other, more flexibly fit, covariates. More precisely, let

(\hat{\beta}_{n},\hat{g}_{n,K})=\mathop{\rm argmin}_{\beta\in\mathbb{R}^{d},g_{K}\in\mathcal{F}_{K}}\frac{1}{n}\sum_{i=1}^{n}\ell_{\alpha}(g_{K}(X_{i})+\Phi(X_{i})^{\top}\beta,S_{i})+\lambda\|g_{K}\|^{2}_{K},

denote the function fit on the calibration data and recall that by the results of Section 3 we expect that for any non-negative re-weighting $f(\cdot)=f_{K}(\cdot)+\Phi(\cdot)^{\top}\beta$ ,

\mathbb{P}_{f}(Y_{n+1}\in\hat{C}(X_{n+1}))\cong 1-\alpha-2\lambda\frac{\langle\hat{g}_{n,K},f_{K}\rangle}{\frac{1}{n}\sum_{i=1}^{n}f(X_{i})}.

(5.1)

For the Gaussian RKHS, a natural set of non-negative weight functions are the local re-weightings $f^{x}(y):=K(x,y)=\exp(-4\|x-y\|^{2})$ , which emphasize coverage in a neighbourhood around the fixed point $x$ .

The bottom-left panel of Figure 9 plots the values of $1-\alpha-2\lambda\frac{\langle\hat{g}_{n,K},f^{x}\rangle}{\frac{1}{n}\sum_{i=1}^{n}f^{x}(X_{i})}$ for all points $x$ appearing in the training and calibration sets. We see immediately that our prediction sets will undercover older communities and overcover communities with high median incomes. To aid in the interpretation of this result, the top panel of the figure indicates the level curves of $K(x,\cdot)$ for two specific choice of $x$ . Finally, the bottom-right panel compares the estimates (5.1) to the realized empirical coverages

\sum_{i=1}^{694}\frac{f^{x}(\tilde{X}_{i})}{\frac{1}{694}\sum_{j=1}^{694}f^{x}(\tilde{X}_{j})}\mathbbm{1}\{\tilde{Y}_{i}\in\hat{C}(\tilde{X}_{i})\},

for the same two values of $x$ . Here, $\{(\tilde{X}_{i},\tilde{Y}_{i})\}_{i=1}^{694}$ denotes the test set. We see that as expected (5.1) is a highly accurate estimate of the true coverage at both values of $x$ . To further understand the degree of localization in these plots, it may be useful to note that this re-weighting yields an effective sample size of $\frac{(\sum_{i=1}^{694}f^{x}(\tilde{X}_{i}))^{2}}{\sum_{i=1}^{694}f^{x}(\tilde{X}_{i})^{2}}\cong 275$ at the two red points.

Overall, we find that our procedure provides the user with a highly accurate picture of the coverage properties of their prediction sets. In many practical settings, plots like the bottom-left panel of Figure 9 may prompt practitioners to adjust the quantile regression to protect against observed directions of miscoverage. While such an adjustment will not strictly follow the theory of Sections 2 and 3, so long as the user is careful not to run the procedure so many times as to induce direct over-fitting to the observed miscoverage, small adjustments will likely be permissible. In practice, this type of exploratory analysis may allow practitioners to discover important patterns in their data and tune the prediction sets to reflect their coverage needs.

5.2 Comparison against existing methods

A variety of alternative methods for obtaining conditional coverage in conformal inference have been proposed in the literature. Many of these methods are designed to asymptotically achieve exact conditional coverage, i.e., $\mathbb{P}(Y_{n+1}\in\hat{C}(X_{n+1})\mid X_{n+1})\stackrel{{\scriptstyle\mathbb{P}}}{{\longrightarrow}}1-\alpha$ . Here, we compare against two such approaches and demonstrate why the more precise finite-sample guarantees of our method may be preferable.

5.2.1 Comparison against conformalized quantile regression

Our first comparison is to the conformalized quantile regression (CQR) method of Romano et al. (2019). They consider a version of split conformal inference in which the training set is used to fit estimates of the conditional quantiles of $Y\mid X$ and the calibration set is used to adjust these estimates to guarantee marginal coverage. Here, we implement a two-sided version of their procedures in which the upper and lower quantiles are calibrated separately. In particular, let $\hat{q}_{\tau}(X)$ denote an estimate of the $\tau$ -quantile of $Y\mid X$ . Let $S_{\tau}(X,Y):=Y-\hat{q}_{\tau}(X)$ be a conformity score measuring the distance between $Y$ and the quantile estimate and $c_{\tau}$ denote the $\tau$ -quantile of the calibration scores $\{S_{\tau}(X,Y)\}_{i=1}^{n}$ . Then, we define the two-sided CQR prediction set as

\hat{C}_{CQR}(X_{n+1}):=\{y:S_{\alpha/2}(X_{n+1},y)>c_{\alpha/2},\ S_{1-\alpha/2}(X_{n+1},y)\leq c_{1-\alpha/2}\}.

(5.2)

As alluded to above, if $\hat{q}_{\tau}(X)$ is a consistent estimate of the true conditional quantile function of $Y\mid X$ , then this set will asymptotically achieve exact conditional coverage (Sesia and Candès, 2020).

It is important to note that our approach is not in direct competition with CQR and both methods can be used in combination. Namely, here we will consider an implementation of our procedure in which the quantiles, $c_{\alpha/2}$ and $c_{1-\alpha/2}$ are replaced by adaptive estimates from our method. Implementing two-sided fitting requires requires minor modifications of the dual formulation of our prediction set and we refer the reader to Section A.7 for details.

To compare these methods we once again employ the Communities and Crimes dataset. We consider three baseline approaches, 1) split conformal prediction with linear regression and the residual conformity score, $S(X,Y):=Y-\hat{\gamma}_{0}-X^{\top}\hat{\gamma}_{1}$ for $(\hat{\gamma}_{0},\hat{\gamma}_{1})$ obtained by ordinary least squares, 2) split conformal prediction implemented using the CQR procedure of Romano et al. (2019) where the estimates $\hat{q}_{\tau}(X)$ are obtained using linear quantile regression, 3) the same procedure but with $\hat{q}_{\tau}(X)$ obtained using a quantile random forest. In all three cases we calibrate the upper and lower quantiles separately using either the formulation in (5.2) or, in the case of ordinary least squares, by estimating the upper and lower quantiles of the residuals separately. We compare each of these baseline approaches against implementations of our methods where the split conformal calibration step is replaced by our method with function class, $\mathcal{F}:=\{\beta_{0}+X^{\top}\beta_{1}:\beta_{0}\in\mathbb{R},\ \beta_{1}\in\mathbb{R}^{d}\}$ defined as the set of affine functions over the covariates from the previous section (i.e. population size, unemployment rate, median income, as well a set of race and age-based features).

Figure 10 displays the results of this experiment. We find that all implementations of both split conformal and our randomized method achieve exact marginal covearge, while our unrandomized variant can slightly overcover (left panel). To evaluate the conditional coverage, the center two panels of the figure display estimates of the largest over and undercoverages obtained across linear re-weightings of the covariate space. Namely, letting $Z=(1,X)$ denote the augmented covariates, we estimate the smallest and largest values of

\frac{\mathbb{E}[Z_{n+1,j}\mathbbm{1}\{Y_{n+1}\notin\hat{C}(X_{n+1})\}]}{\mathbb{E}[Z_{n+1,j}]},

over all features $j$ (note that here all the features are non-negative). As expected from our theory, the unrandomized variant of our method always slightly overcovers, while the randomized version obtains exact coverage throughout. On the other hand, while CQR asymptotically achieves exact coverage in this setting, our simulations show deviations from the target level using both linear quantile regression and quantile forests. Importantly, we note that while the coverage deviations of linear quantile regression are relatively small here, its performance can worsen significantly in higher dimensions (see Figure 3). Finally, the rightmost panel of the figure shows the lengths of the prediction sets returned by each method. We find that all of the prediction sets are of a similar size with the exception of the OLS implementation of split conformal prediction, whose predictions sets are relatively wide.

5.2.2 Comparison against localized conformal prediction

Our second comparison is to the localized conformal prediction method of Guan (2022). In this procedure, a localized kernel is used to re-weight the calibration data and focus on the data whose covariates are most similar to the test point. The exact method for accomplishing this is somewhat technical and we refer the reader to Theorem 3.2 of Guan (2022) for details. Similar to conformalized quantile regression, localized conformal prediction guarantees exact marginal coverage in finite samples and asymptotic conditional coverage under appropriate choices of the kernel.

To compare localized conformal prediction against our approach, we once again employ the Communities and Crime dataset. We consider the Gaussian kernel $K(x_{1},x_{2}):=\exp(-4\|x_{1}-x_{2}\|_{2}^{2})$ and take the conformity score $S(X,Y):=|Y-\hat{\gamma}_{0}-X^{\top}\hat{\gamma}|$ to be the absolute values of the residuals from a linear regression. We then compare two methods, localized conformal prediction, and our method implemented with the function class given by the same kernel. More specifically, we implement our method with function class $\mathcal{F}:=\{\beta_{0}+f_{k}(X):\beta_{0}\in\mathbb{R},f_{k}\in\mathcal{F}_{K}\}$ and penalty $\mathcal{R}(f_{K})=\lambda\|f\|_{K}^{2}$ , where $\lambda$ is chosen by cross-validation.

The results of this experiment are shown in Figure 11. We find that, as expected, both localized conformal prediction and the randomized version of our method achieve exact marginal. To evaluate their conditional properties, we compare the coverages obtained under the two kernel re-weightings from the previous section (namely the shifts considered in Figure 9). We find that, as expected, the coverage of our method deviates from the target level, but critically this deviation is predictable and well approximated by the estimation procedure given in Proposition 2. On the other hand, although localized conformal prediction obtains similar empirical results to our method, it does not offer estimates of its coverage deviation. Thus, a priori based on the theory of Guan (2022), one may expect localized conformal prediction to give exact coverage under these shifts, while in reality, it shows notable deviations away from the target level.

5.3 Rxrx1 data

Our next experiment examines the RxRx1 dataset (Taylor et al. (2019)) from the WILDS repository (Koh et al. (2021)). This repository contains a collection of commonly used datasets for benchmarking performance under distribution shift. In the RxRx1 task, we are given images of cells obtained using fluorescent microscopy and we must predict which one of the 1339 genetic treatments the cells received. These images come from 51 different experiments run across four different cell types. It is well known that even in theoretically identical experiments, minor variations in the execution and environmental conditions can induce observable variation in the final data. Thus, we expect to see heterogeneity in the quality of the predictions across both experiments and cell types.

Perhaps the most obvious method for correcting for this heterogeneity would be to treat the experiments and cell types as known categories and apply the group-conditional coverage method outlined in Section 2. However, if we did this, we would be unable to make predictions for new unlabeled images. Thus, here we take a more data driven approach and attempt to learn a good feature representation directly from the training set.

To predict the genetic treatments of the cells we use a ResNet50 architecture trained on 37 of the 51 experiments by the original authors of the WILDS repository. We then uniformly divide the samples from the remaining 14 experiments into a training set and a test set. The training set is further split into two halves; one for learning a feature representation, and one to be used as the calibration set. To construct the feature representation, we take the feature map from the pre-trained neural network as input and run a $\ell_{2}$ -regularized multinomial linear regression that predicts which experiment each image comes from. We then define our features to be the predicted probabilities of experiment membership output by this model and construct prediction sets using the linear quantile regression method of Section 2. To define the conformity scores for this experiment, let $\{\hat{w}_{i}(x)\}_{i=1}^{1339}$ denote the weights output by the neural network at input $x$ . Typically, we would use these weights to compute the predicted probabilities of class membership, $\hat{\pi}_{i}(x):=\exp(\hat{w}_{i}(x))/(\sum_{j}\exp(\hat{w}_{j}(x)))$ . Here, we add an extra step in which we use multinomial logisitic regression and the calibration data to fit a parameter $T$ that re-scales the weights. This procedure is known as temperature scaling and it has been found to increase the accuracy of probabilities output by neural networks (Angelopoulos et al. (2021), Guo et al. (2017)). After running this regression, we set $\hat{\pi}_{i}(x):=\exp(T\hat{w}_{i}(x))/(\sum_{j}\exp(T\hat{w}_{j}(x)))$ and following Romano et al. (2020b) define the conformity score function to be

S(x,y):=\sum_{i:\hat{\pi}_{i}(x)>\hat{\pi}_{y}}\hat{\pi}_{i}(x).

The coverage properties of our unrandomized prediction sets are outlined in Figure 12. We see that while split conformal prediction has very heterogeneous coverage across experiments and cell types, our approach performs well on all groups. Thus, the learned feature representation successfully captures the batch effects in the data and thereby enables our method to provide the desired group-conditional coverage.

The method described above is hardly the only way to construct a feature representation for this dataset. In Section A.2, we consider an alternative approach in which we implement our method using the top principal components of the feature layer of the neural network as input. The idea here is that the batch effects (i.e. the cell types and experiment memberships) should induce large variations in the images that are visible on the top principal components. Thus, correcting the coverage along these directions will provide the desired conditional validity. In agreement with this hypothesis, we find that this procedure produces nearly identical results to those seen above, i.e., good coverage across all groups.

6 Choosing the function class and regularizer

In order to implement our method in practice, users must choose both the function class, $\mathcal{F}$ , and regularizer, $\mathcal{R}(\cdot)$ . In the real data examples above we have considered some sample choices for these quantities that were designed to illustrate the guarantees of our method. A practitioner may reasonably disagree with our selections, and through their choice of $\mathcal{F}$ and $\mathcal{R}(\cdot)$ , look to prioritize different conditional targets. As in many statistical estimation problems, we expect the best performance of our method to be obtained when the choices of $\mathcal{F}$ and $\mathcal{R}$ are guided by domain-specific knowledge of the most important features to the prediction task at hand. To help practitioners in making these choices, we highlight a few considerations that may arise.

First, we remark that although our theory requires fixed choices of $\mathcal{F}$ and $\mathcal{R}$ , we find empirically that running cross-validation on the training set does not significantly harm the coverage. To demonstrate this, we consider two different penalized implementations of our method. In the first, we run ridge-regularized linear quantile regression with $\mathcal{F}:=\{\beta_{0}+X^{\top}\beta_{1}:\beta_{0}\in\mathbb{R},\ \beta_{1}\in\mathbb{R}^{d}\}$ and $\mathcal{R}(\beta):=\lambda\|\beta\|^{2}$ , while in the second we run kernel quantile regression with Gaussian kernel, $K(x,y):=\exp(-0.025\,\|x-y\|_{2}^{2})$ and corresponding kernel-norm regularizer, $\mathcal{R}(g):=\lambda\|g\|_{K}^{2}$ . Figure 13 shows a comparison of the marginal coverage of our method obtained when $\lambda$ is either fixed in advance, or estimated using cross-validation. We see that although our theory requires $\lambda$ to be fixed, both approaches give near exact coverage empirically.

Perhaps even more critical than the regularization level, is the choice of function class $\mathcal{F}$ . Of particular interest is the trade-off that occurs as the dimension of $\mathcal{F}$ increases. Indeed, while larger function spaces allow for a more rich set of conditional guarantees, we find empirically that they also lead to larger prediction sets. Thus, adding spurious features to $\mathcal{F}$ can harm the efficiency of our predictions.

To demonstrate this phenomenon, we consider a dataset with many irrelevant features that have no relationship with the response. In particular, we work with a Gaussian model in which $X_{i}\sim N(0,I_{d})$ , $Y_{i}\sim N(0,1)$ , and $Y_{i}$ is independent of $X_{i}$ . We implement our method with conformity score $S(X,Y)=Y$ and use linear quantile regression with $\mathcal{F}=\{\beta_{0}+X^{\top}\beta_{1}:\beta_{0}\in\mathbb{R},\ \beta_{1}\in\mathbb{R}^{d}\}$ , $\mathcal{R}(g)=0$ . Figure 14 plots estimated distributions of the calibration-conditional marginal miscoverage, worst-case conditional coverage deviation, and length of a two-sided implementation of our method in which the lower $\alpha/2$ and upper $1-\alpha/2$ quantiles are estimated separately (see Section A.7), i.e. the quantities,

\begin{gathered}\mathbb{P}(Y_{n+1}\notin\hat{C}(X_{n+1})\mid\{(X_{i},Y_{i})\}_{i=1}^{n}),\ \sup_{1\leq j\leq d+1}\left|\frac{\mathbb{E}[(1,X_{n+1})_{j}(\mathbbm{1}\{Y_{n+1}\not\in\hat{C}(X_{n+1})\}-\alpha)\mid\{(X_{i},Y_{i})\}_{i=1}^{n}]}{\mathbb{E}[|(1,X_{n+1})_{j}|]}\right|,\\ \text{ and }\mathbb{E}[\text{length}(\hat{C}(X_{n+1}))\mid\{(X_{i},Y_{i})\}_{i=1}^{n}].\end{gathered}

As a baseline, the figure also displays results for vanilla quantile regression. We find that while the marginal coverage of our method is robust to large function classes, both the worst-case coverage deviation and the average length of the prediction sets increase as the dimension grows. Recall that all the covariates here are spurious; therefore, the increase in these quantities is purely a result of over-fitting in high dimensions and not a reflection of the underlying complexity of the relationship between $X$ and $Y$ . Finally, we remark that the growth in the worst-case coverage error is substantially worse for quantile regression compared to our method. Thus, while we cannot guarantee exact calibration-conditional validity over all $f\in\mathcal{F}$ in high dimensions, our method is practically superior to sensible alternatives.

7 Acknowledgments

E.J.C. was supported by the Office of Naval Research grant N00014-20-1-2157, the National Science Foundation grant DMS-2032014, the Simons Foundation under award 814641, and the ARO grant 2003514594. I.G. was also supported by the Office of Naval Research grant N00014-20-1-2157 and the Simons Foundation award 814641, as well as additionally by the Overdeck Fellowship Fund. J.J.C. was supported by the John and Fannie Hertz Foundation. The authors are grateful to Kevin Guo and Tim Morrison for helpful discussion on this work.

References

Andrews and Shi (2013) Andrews, D. W. and Shi, X. (2013) Inference based on conditional moment inequalities. Econometrica, 81, 609–666.
Angelopoulos et al. (2021) Angelopoulos, A. N., Bates, S., Jordan, M. and Malik, J. (2021) Uncertainty sets for image classifiers using conformal prediction. In International Conference on Learning Representations.
Barber et al. (2023) Barber, R. F., Candes, E. J., Ramdas, A. and Tibshirani, R. J. (2023) Conformal prediction beyond exchangeability. arXiv preprint. ArXiv:2202.13415.
Barber et al. (2020) Barber, R. F., Candès, E. J., Ramdas, A. and Tibshirani, R. J. (2020) The limits of distribution-free conditional predictive inference. Information and Inference: A Journal of the IMA, 10, 455–482.
Boucheron et al. (2005) Boucheron, S., Bousquet, O. and Lugosi, G. (2005) Theory of classification: a survey of some recent advances. ESAIM: Probability and Statistics, 9, 323–375.
Bousquet and Elisseeff (2002) Bousquet, O. and Elisseeff, A. (2002) Stability and generalization. Journal of Machine Learning Research, 2, 499–526.
Chernozhukov et al. (2021) Chernozhukov, V., Wüthrich, K. and Zhu, Y. (2021) Distributional conformal prediction. Proceedings of the National Academy of Sciences, 118, e2107794118.
Dantzig and Thapa (2003) Dantzig, G. B. and Thapa, M. N. (2003) Linear programming: Theory and extensions, vol. 2. Springer.
Deng et al. (2023) Deng, Z., Dwork, C. and Zhang, L. (2023) Happymap: A generalized multi-calibration method. arXiv preprint arXiv:2303.04379.
Dua and Graff (2017) Dua, D. and Graff, C. (2017) UCI machine learning repository. URL: http://archive.ics.uci.edu/ml.
Guan (2022) Guan, L. (2022) Localized conformal prediction: a generalized inference framework for conformal prediction. Biometrika, 110, 33–50.
Guo et al. (2017) Guo, C., Pleiss, G., Sun, Y. and Weinberger, K. Q. (2017) On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, 1321–1330. JMLR.org.
Hébert-Johnson et al. (2018) Hébert-Johnson, U., Kim, M., Reingold, O. and Rothblum, G. (2018) Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning, 1939–1948. PMLR.
Jung et al. (2023) Jung, C., Noarov, G., Ramalingam, R. and Roth, A. (2023) Batch multivalid conformal prediction. In International Conference on Learning Representations.
Kim et al. (2019) Kim, M. P., Ghorbani, A. and Zou, J. (2019) Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 247–254.
Koenker and Bassett Jr (1978) Koenker, R. and Bassett Jr, G. (1978) Regression quantiles. Econometrica: journal of the Econometric Society, 33–50.
Koh et al. (2021) Koh, P. W., Sagawa, S., Marklund, H., Xie, S. M., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R. L., Gao, I., Lee, T., David, E., Stavness, I., Guo, W., Earnshaw, B., Haque, I., Beery, S. M., Leskovec, J., Kundaje, A., Pierson, E., Levine, S., Finn, C. and Liang, P. (2021) Wilds: A benchmark of in-the-wild distribution shifts. In Proceedings of the 38th International Conference on Machine Learning (eds. M. Meila and T. Zhang), vol. 139 of Proceedings of Machine Learning Research, 5637–5664. PMLR.
Lei and Wasserman (2014) Lei, J. and Wasserman, L. (2014) Distribution‐free prediction bands for non‐parametric regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76.
McShane (1934) McShane, E. J. (1934) Extension of range of functions. Bulletin of the American Mathematical Society, 40, 837 – 842.
Mendelson (2014) Mendelson, S. (2014) Learning without concentration. In Proceedings of The 27th Conference on Learning Theory (eds. M. F. Balcan, V. Feldman and C. Szepesvári), vol. 35 of Proceedings of Machine Learning Research, 25–39. Barcelona, Spain: PMLR.
Paulsen and Raghupathi (2016) Paulsen, V. I. and Raghupathi, M. (2016) An Introduction to the Theory of Reproducing Kernel Hilbert Spaces. Cambridge Studies in Advanced Mathematics. Cambridge University Press.
Qiu et al. (2022) Qiu, H., Dobriban, E. and Tchetgen, E. T. (2022) Prediction sets adaptive to unknown covariate shift. arXiv preprint. ArXiv:2203.06126.
Redmond and Baveja (2002) Redmond, M. and Baveja, A. (2002) A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research, 141, 660–678.
Romano et al. (2020a) Romano, Y., Barber, R. F., Sabatti, C. and Candès, E. (2020a) With Malice Toward None: Assessing Uncertainty via Equalized Coverage. Harvard Data Science Review, 2.
Romano et al. (2019) Romano, Y., Patterson, E. and Candes, E. (2019) Conformalized quantile regression. In Advances in Neural Information Processing Systems (eds. H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox and R. Garnett), vol. 32. Curran Associates, Inc.
Romano et al. (2020b) Romano, Y., Sesia, M. and Candes, E. (2020b) Classification with valid and adaptive coverage. In Advances in Neural Information Processing Systems (eds. H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan and H. Lin), vol. 33, 3581–3591. Curran Associates, Inc.
Sesia and Candès (2020) Sesia, M. and Candès, E. J. (2020) A comparison of some conformal quantile regression methods. Stat, 9, e261. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/sta4.261. E261 sta4.261.
Sesia and Romano (2021) Sesia, M. and Romano, Y. (2021) Conformal prediction using conditional histograms. In Advances in Neural Information Processing Systems (eds. M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang and J. W. Vaughan), vol. 34, 6304–6315. Curran Associates, Inc.
Taylor et al. (2019) Taylor, J., Earnshaw, B., Mabey, B., Victors, M. and Yosinski, J. (2019) Rxrx1: An image set for cellular morphological variation across many experimental batches. In International Conference on Learning Representations (ICLR).
Tibshirani et al. (2019) Tibshirani, R. J., Foygel Barber, R., Candes, E. and Ramdas, A. (2019) Conformal prediction under covariate shift. In Advances in Neural Information Processing Systems (eds. H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox and R. Garnett), vol. 32. Curran Associates, Inc.
Vaart and Wellner (1996) Vaart, A. W. and Wellner, J. A. (1996) Weak Convergence and Empirical Processes. Springer New York, NY, 1 edn.
Vovk (2012) Vovk, V. (2012) Conditional validity of inductive conformal predictors. In Proceedings of the Asian Conference on Machine Learning (eds. S. C. H. Hoi and W. Buntine), vol. 25 of Proceedings of Machine Learning Research, 475–490. Singapore Management University, Singapore: PMLR.
Vovk et al. (2005) Vovk, V., Gammerman, A. and Shafer, G. (2005) Algorithmic Learning in a Random World. Berlin, Heidelberg: Springer-Verlag.
Vovk et al. (2003) Vovk, V., Lindsay, D., Nouretdinov, I. and Gammerman, A. (2003) Mondrian confidence machine. Tech. rep., Royal Holloway University of London.
Whitney (1934) Whitney, H. (1934) Analytic extensions of differentiable functions defined in closed sets. Transactions of the American Mathematical Society, 36, 63–89.
Yang et al. (2022) Yang, Y., Kuchibhotla, A. K. and Tchetgen, E. T. (2022) Doubly robust calibration of prediction sets under covariate shift. arXiv preprint arXiv:2203.01761.

Appendix A Appendix

A.1 Computational details for Section 4

A.1.1 Verification of the conditions of Section 4

In this section we verify that all of the quantile regression problems discussed in this paper satisfy the conditions of Section 4. To do this we will check that 1) each quantile regression admits an equivalent finite dimensional representation 2) all of these finite dimensional representations (and thus also their infinite dimensional counterparts) satisfy strong duality.

The fact that the linear quantile regression of Section 2 satisfies 1) and 2) is clear. For RKHS functions, let $K\in\mathbb{R}^{n+1\times n+1}$ denote the kernel matrix with entries $K_{ij}=K(X_{i},X_{j})$ . Let $K_{i}$ denote the $i_{\text{th}}$ row (equivalently column) of $K$ . Then, the primal problem (3.1) can be fit by solving the equivalent convex optimization program

\displaystyle\underset{\gamma\in\mathbb{R}^{n+1},\ \beta\in\mathbb{R}^{d}}{\text{minimize}}\ \frac{1}{n+1}\sum_{i=1}^{n}\ell_{\alpha}(K_{i}^{\top}\gamma+\Phi(X_{i})^{\top}\beta,S_{i})+\frac{1}{n+1}\ell_{\alpha}(K_{n+1}^{\top}\gamma+\Phi(X_{n+1})^{\top}\beta,S)+\lambda\gamma^{\top}K\gamma,

and setting $\hat{g}_{S}(\cdot)=\sum_{i=1}^{n+1}\hat{\gamma}_{i}K(X_{i},\cdot)+\Phi(\cdot)^{\top}\hat{\beta}$ , for $(\hat{\gamma},\hat{\beta})$ any optimal solutions. With this finite-dimensional representation in hand, it now follows that the primal-dual pair must satisfy strong-duality by Slater’s condition. Moreover, it is also easy to see that in this context

	$\displaystyle\mathcal{R}^{*}(\eta)$	$\displaystyle=-\min_{g_{K}\in\mathcal{F}_{K},\ \beta\in\mathbb{R}^{d}}\lambda\\|g_{K}\\|_{K}^{2}-\sum_{i=1}^{n+1}\eta_{i}(g_{K}(X_{i})+\Phi(X_{i})^{\top}\beta)$
		$\displaystyle=-\min_{\gamma\in\mathbb{R}^{n+1},\ \beta\in\mathbb{R}^{d}}\lambda\gamma^{\top}K\gamma-\sum_{i=1}^{n+1}\eta_{i}(K_{i}^{\top}\gamma+\Phi(X_{i})^{\top}\beta),$

which is a tractable function that we can compute.

Finally, to fit Lipschitz functions we can solve the optimization program,

\displaystyle\underset{\gamma\in\mathbb{R}^{n+1},\ \beta\in\mathbb{R}^{d}}{\text{minimize}}\ \frac{1}{n+1}\sum_{i=1}^{n}\ell_{\alpha}(\gamma_{i}+\Phi(X_{i})^{\top}\beta,S_{i})+\frac{1}{n+1}\ell_{\alpha}(\gamma_{n+1}+\Phi(X_{n+1})^{\top}\beta,S)+\lambda\max_{i\neq j}\frac{|\gamma_{i}-\gamma_{j}|}{\|X_{i}-X_{j}\|_{2}}.

The idea here is that ${\gamma}_{1},\dots,{\gamma}_{n+1}$ act as proxies for the values of ${f}_{L}(X_{1}),\dots,{f}_{L}(X_{n+1})$ . These proxies can always be extended to a complete function on all of $\mathcal{X}$ using the methods of McShane (1934), Whitney (1934). Once again, with this finite-dimensional representation in hand it is now easy to check that the primal-dual pair satisfies strong-duality by Slater’s condition. Finally, in this context we have

	$\displaystyle\mathcal{R}^{*}(\eta)$	$\displaystyle=-\min_{g_{L}\in\mathcal{F}_{L},\ \beta\in\mathbb{R}^{d}}\lambda\text{Lip}(g_{L})-\sum_{i=1}^{n+1}\eta_{i}(g_{L}(X_{i})+\Phi(X_{i})^{\top}\beta)$
		$\displaystyle=-\min_{\gamma\in\mathbb{R}^{n+1},\ \beta\in\mathbb{R}^{d}}\lambda\max_{i\neq j}\frac{\|\gamma_{i}-\gamma_{j}\|}{\\|X_{i}-X_{j}\\|_{2}}-\sum_{i=1}^{n+1}\eta_{i}\gamma_{i},$

which is a tractable function.

A.1.2 Additional algorithmic details

The results below elucidate various approaches to computing the prediction set. Algorithm 1 describes how we might use a binary search algorithm to discover the critical $S$ at which $\eta^{S}_{n+1}$ equals the target cutoff.

Data: Observed data

\{(X_{1},S_{1}),\dots,(X_{n},S_{n})\}\cup\{X_{n+1}\}

, numerical error tolerance

\epsilon

, target cutoff

C\in\{1-\alpha,U\}

, range

[a,b]

for

S_{n+1}

(

a,b

= None indicates that no bounds are known for

S_{n+1}

if $b=$ None then

b=\max\{\max_{1\leq i\leq n}S_{i},1\}

;

while $\eta^{b}_{n+1}<C$ do

b=2b

;

if $a=$ None then

a=\min\{\min_{1\leq i\leq n}S_{i},-1\}

;

while $\eta^{a}_{n+1}\geq C$ do

a=2a

;

while $b-a<\epsilon$ do

if $\eta^{(a+b)/2}_{n+1}<C$ then

a=(a+b)/2

;

else

b=(a+b)/2

;

return

(a+b)/2

Algorithm 1 Binary Search Computation of

S^{*}_{n+1}

As discussed in the main text, Algorithm 1 may be inefficient if the optimization problem possesses additional structure. In particular, a common implementation of our method is to fit an unregularized quantile regression over a finite-dimensional function class. In this case, it is possible to exactly compute $\inf\{S:\eta^{S}_{n+1}<C\}$ . Our approach to this problem, which leverages standard tools from LP sensitivity analysis, relies on the following observation.

Proposition 5.

Assume strong duality holds for (4.1). Let $(\hat{g}_{n},\hat{\eta}^{n})\in\mathcal{F}\times\mathbb{R}^{n}$ denote a valid primal-dual solution of (4.1), (4.2) and $\hat{S}_{n+1}:=\hat{g}_{n}(X_{n+1})$ . Then, $(\hat{\eta}^{n},0)$ is a valid dual solution of the dual program with data $\{(X_{1},S_{1})\}_{i=1}^{n}\cup\{(X_{n+1},\hat{S}_{n+1})\}$ .

Proof.

This result follows immediately from the fact that $(\hat{\eta}^{n},0)$ is feasible and the duality gap of $\hat{g}_{n}$ and $(\hat{\eta}^{n},0)$ is zero. ∎

We now recall some basic linear programming (LP) facts. First, assuming without loss of generality that $\Phi$ is full-rank, it follows immediately from standard LP theory that there exists a solution in which all but $p$ indices in $\eta$ line at the extremes $\{\alpha,1-\alpha\}$ (Dantzig and Thapa (2003)). Moreover, the remaining non-trivial coordinates of $\eta$ are piecewise linear in $S$ , and the slope of the function given by $S\mapsto\eta^{S}_{n+1}$ can be obtained by solving a linear equation (Dantzig and Thapa (2003)). Thus, obtaining the critical value of $S$ at which $\eta^{S}_{n+1}$ exceeds our cutoff amounts to tracing a piecewise linear function from $\hat{S}_{n+1}$ until the cutoff is reached. Algorithm 2 provides an explicit description of this procedure, which closely resembles the simplex method. Note that the algorithm is initialized at $\hat{S}_{n+1}$ since the dual solution at that point is known to be zero (Proposition 5).

Data: Observed data

\{(X_{1},S_{1}),\dots,(X_{n},S_{n})\}\cup\{X_{n+1}\}

, target cutoff

C

, quantile

q

\hat{g}_{n}(\cdot)=\text{LPSolver}(\{(X_{i},S_{i})\}_{i=1}^{n})

;

S^{*}_{n+1}=\hat{g}_{n}(X_{n+1})

\eta_{n+1}=0

;

i_{c}=n+1

;

/* candidate to enter basis */

while $\eta_{n+1}\neq C$ do

B=\text{ActiveSet}(\{1,\dots,n+1\})

;

d=-(\Phi_{B}^{\top})^{-1}\Phi_{i^{*}}^{\top}

;

if $C>\eta_{n+1}$ then

\Delta_{1:|B|}=\max\left(\frac{q-\eta_{B}}{d},\frac{q-1-\eta_{B}}{d}\right)

;

/* element-wise division and maximum */

i^{*}=\arg\min_{i\in[|B|]}(\Delta_{i})

;

/* candidate to exit basis */

\delta=\Delta_{i^{*}}

;

else

\Delta_{1:|B|}=\min\left(\frac{q-\eta_{B}}{d},\frac{q-1-\eta_{B}}{d}\right)

;

/* element-wise division and minimum */

i^{*}=\arg\max_{i\in[|B|]}(\Delta_{i})

;

/* candidate to exit basis */

\delta=\Delta_{i^{*}}

;

\delta_{c}=\max(\min(\delta,q-1-\eta_{i_{c}}),q-\eta_{i_{c}})

;

\eta_{B}=\eta_{B}+\delta_{c}d

;

\eta_{i_{c}}=\eta_{i_{c}}+\delta_{c}

;

if $\delta_{c}=\delta$ then

B=B\setminus i^{*}

;

/* update basis */

B=B\cup\{i_{c}\}

;

/* update basis */

A=(\Phi_{B}^{\top})^{-1}\Phi_{B^{c}}^{\top}

;

c=S_{B^{c}}^{\top}-S_{B}^{\top}A

;

\Delta^{S}=c/A_{-1}

;

/* element-wise division by last row of

A

B^{*}=\{i\in B\mid c_{i}\neq 0\}

;

if $|B^{*}|=0$ then

S_{n+1}=\infty

;

\eta_{n+1}=C

;

else

if $C>0$ then

i_{c}=\arg\min_{i\in B^{*}}\Delta^{S}_{i}

;

/* candidate to enter basis */

S^{*}_{n+1}=S^{*}_{n+1}+\Delta^{S}_{i_{c}}

;

else

i_{c}=\arg\max_{i\in B^{*}}\Delta^{S}_{i}

;

/* candidate to enter basis */

S^{*}_{n+1}=S^{*}_{n+1}+\Delta^{S}_{i_{c}}

;

return

S^{*}_{n+1}

Algorithm 2 LP Sensitivity Computation of

S^{*}_{n+1}

We conclude this section with one final method for computing a conservative alternative to our prediction set that requires only a single quantile regression fit. For this, suppose we know an upper bound $M$ for $S_{n+1}$ . Then, we may consider the conservative prediction set $\hat{C}_{\text{con}}(X_{n+1},M):=\{y:S(X_{n+1},y)\leq\hat{g}_{M}(X_{n+1})\}$ . Our final proposition shows that whenever $M$ is a valid upper bound on $S_{n+1}$ , $\hat{C}_{\text{con}}(X_{n+1},M)$ provides a conservative coverage guarantee. To avoid subtle issues here related to the non-uniqueness of the optimal quantile function, we will assume here that $\hat{g}_{M}(X_{n+1})$ is a min-norm solution of (4.1) (the choice of norm is not important).

Proposition 6.

Assume that $\mathcal{F}$ admits a strictly-convex norm $\|\cdot\|_{\mathcal{F}}$ and let $\{\hat{g}_{S}\}_{S\in\mathbb{R}}$ denote the corresponding min-norm solutions to (4.1). Then, for all $M>0$ ,

\{y:S(X_{n+1},y)\leq\hat{g}_{S(X_{n+1},y)}(X_{n+1}),\ S\leq M\}\subseteq\{y:S(X_{n+1},y)\leq\hat{g}_{M}(X_{n+1})\}.

Proof of Proposition 6.

Let

L_{n}(g):=\sum_{i=1}^{n}\ell_{\alpha}(g(X_{i}),S_{i})\ \ \text{ and }\ \ L_{S}(g)=L_{n}(g)+\ell_{\alpha}(g(X_{n+1}),S).

Assume for the sake of contradiction that $S<M$ , $S\leq\hat{g}_{S}(X_{n+1})$ , and $\hat{g}_{M}(X_{n+1})<S$ . To obtain a contradiction, we claim that it is sufficient to prove that

\displaystyle L_{S}(\hat{g}_{M})-L_{S}(\hat{g}_{S})\leq L_{M}(\hat{g}_{M})-L_{M}(\hat{g}_{S}).

(A.1)

To see why, note that since $\hat{g}_{M}$ is a global optimum of $g\ \mapsto L_{M}(g)+(n+1)\mathcal{R}(g)$ , we must have that

\displaystyle L_{M}(\hat{g}_{M})+(n+1)\mathcal{R}(\hat{g}_{M})

\displaystyle\leq L_{M}(\hat{g}_{S})+(n+1)\mathcal{R}(\hat{g}_{S}).

Rearranging this and applying (A.1) gives the inequality

	$\displaystyle(n+1)\mathcal{R}(\hat{g}_{S})-(n+1)\mathcal{R}(\hat{g}_{M})$	$\displaystyle\geq L_{M}(\hat{g}_{M})-L_{M}(\hat{g}_{S})$
		$\displaystyle\geq L_{S}(\hat{g}_{M})-L_{S}(\hat{g}_{S}),$

or equivalently,

\displaystyle L_{S}(\hat{g}_{M})+(n+1)\mathcal{R}(\hat{g}_{M})

\displaystyle\leq L_{S}(\hat{g}_{S})+(n+1)\mathcal{R}(\hat{g}_{S}).

Since $\hat{g}_{S}$ is the unique min-norm minimizer of $g\mapsto L_{S}(g)+(n+1)\mathcal{R}(g)$ this implies that $\|\hat{g}_{S}\|_{\mathcal{F}}<\hat{g}_{M}\|_{\mathcal{F}}$ .

Now, by a completely symmetric argument reversing the roles of $\hat{g}_{M}$ and $\hat{g}_{S}$ we also have that

L_{M}(\hat{g}_{S})+(n+1)\mathcal{R}(\hat{g}_{S})\leq L_{M}(\hat{g}_{M})+(n+1)\mathcal{R}(\hat{g}_{M}),

which by identical reasoning implies that $\|\hat{g}_{M}\|_{\mathcal{F}}<\|\hat{g}_{S}\|_{\mathcal{F}}$ . Thus, we have arrived at our desired contradiction.

To prove (A.1) we break into two cases.

Case 1:

$\hat{g}_{M}(X_{n+1})<S\leq\hat{g}_{S}(X_{n+1})\leq M$ .

	$\displaystyle L_{S}(\hat{g}_{M})-L_{S}(\hat{g}_{S})$	$\displaystyle=(1-\alpha)(S-\hat{g}_{M}(X_{n+1}))-\alpha(\hat{g}_{S}(X_{n+1})-S)+L_{n}(\hat{g}_{M})-L_{n}(\hat{g}_{S})$
		$\displaystyle=(1-\alpha)(\hat{g}_{S}(X_{n+1})-\hat{g}_{M}(X_{n+1}))+S-\hat{g}_{S}(X_{n+1})+L_{n}(\hat{g}_{M})-L_{n}(\hat{g}_{S})$
		$\displaystyle\leq(1-\alpha)(\hat{g}_{S}(X_{n+1})-\hat{g}_{M}(X_{n+1}))+L_{n}(\hat{g}_{M})-L_{n}(\hat{g}_{S})$
		$\displaystyle=L_{M}(\hat{g}_{M})-L_{M}(\hat{g}_{S})$

Case 2:

$\hat{g}_{M}(X_{n+1})<S<M\leq\hat{g}_{S}(X_{n+1})$ .

	$\displaystyle L_{M}(\hat{g}_{M})-L_{M}(\hat{g}_{S})$	$\displaystyle=(1-\alpha)(M-\hat{g}_{M}(X_{n+1}))-\alpha(\hat{g}_{S}(X_{n+1})-M)+L_{n}(\hat{g}_{M})-L_{n}(\hat{g}_{S})$
		$\displaystyle=\alpha(\hat{g}_{M}(X_{n+1})-\hat{g}_{S}(X_{n+1}))+M-\hat{g}_{M}(X_{n+1})+L_{n}(\hat{g}_{M})-L_{n}(\hat{g}_{S})$
		$\displaystyle>\alpha(\hat{g}_{M}(X_{n+1})-\hat{g}_{S}(X_{n+1}))+S-\hat{g}_{M}(X_{n+1})+L_{n}(\hat{g}_{M})-L_{n}(\hat{g}_{S})$
		$\displaystyle=(1-\alpha)(S-\hat{g}_{M}(X_{n+1}))-\alpha(\hat{g}_{S}(X_{n+1})-S)+L_{n}(\hat{g}_{M})-L_{n}(\hat{g}_{S})$
		$\displaystyle=L_{S}(\hat{g}_{M})-L_{S}(\hat{g}_{S}).$

∎

A.2 Additional experiments on the Rxrx1 data

As an alternative to estimating the probabilities of experimental membership, here we consider constructing a feature representation for the Rxrx1 data using principal component analysis. Namely, we implement the linear quantile regression method of Section 2 using the top principal components of the feature layer of the neural network as input. We choose the number of principal components for this analysis to be 70 based off of a visual inspection of a scree plot (Figure 15). All other steps of this experiment are kept identical to the procedure described in Section 5.3.

Similar to the results of Section 5.3, we find that the empirical coverage of this method is close to the target level across all cell types and experiments (Figure 16).

A.3 Proofs of the main coverage guarantees

In this section we prove the top-level coverage guarantees of our method. We begin by proving our most general result, Theorem 3, which considers a generic function class $\mathcal{F}$ and penalty $\mathcal{R}$ . Then, by restricting the choices of $\mathcal{F}$ and $\mathcal{R}$ , we obtain Theorem 2 and Corollary 1 as special cases.

Proof of Theorem 3.

We begin by examining the first order conditions of the convex optimization problem (3.1). Namely, since $\hat{g}_{S_{n+1}}$ is a minimizer of

g\mapsto\frac{1}{n+1}\sum_{i=1}^{n+1}\ell_{\alpha}(g(X_{i}),S_{i})+\mathcal{R}(g)

we must have that for any fixed $f\in\mathcal{F}$ ,

0\in\partial_{\epsilon}\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\ell_{\alpha}(\hat{g}_{S_{n+1}}(X_{i})+\epsilon f(X_{i}),S_{i})+\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)\right)\bigg{|}_{\epsilon=0}.

By a straightforward computation, the subgradients of the pinball loss are given by

\partial_{\epsilon}\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\ell_{\alpha}(\hat{g}_{S_{n+1}}(X_{i})+\epsilon f(X_{i}),S_{i})\right)\\ =\left\{\frac{1}{n+1}\left(\sum_{S_{i}\neq\hat{g}_{S_{n+1}}(X_{i})}f(X_{i})\left(\alpha-\mathbbm{1}\left\{S_{i}>\hat{g}_{S_{n+1}}\right\}\right)+\sum_{S_{i}=\hat{g}_{S_{n+1}}(X_{i})}s_{i}f(x_{i})\right)\,\biggr{\lvert}\,s_{i}\in[\alpha-1,\alpha]\right\}.

Let $s_{i}^{*}\in[\alpha-1,\alpha]$ be the values setting the subgradient to $0$ . Rearranging, we obtain

\displaystyle\frac{1}{n+1}\sum_{i=1}^{n+1}f(X_{i})\left(\alpha-\mathbbm{1}\left\{S_{i}>\hat{g}_{S_{n+1}}(X_{i})\right\}\right)

\displaystyle=\frac{1}{n+1}\sum_{S_{i}=\hat{g}_{S_{n+1}}(X_{i})}(\alpha-s^{*}_{i})f(x_{i})-\frac{d}{d\epsilon}R\left(\hat{g}_{S_{n+1}}+\epsilon f\right)\bigg{|}_{\epsilon=0}.

Our desired result now follows from the following observations. First, observe that the LHS above can be related to our desired coverage guarantee through the equation,

	$\displaystyle\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y\in\hat{C}_{n+1}\}-(1-\alpha))]$	$\displaystyle=\mathbb{E}[f(X_{n+1})(\alpha-\mathbbm{1}\{Y\notin\hat{C}_{n+1}\})]$
		$\displaystyle=\mathbb{E}[f(X_{n+1})(\alpha-\mathbbm{1}\{S_{n+1}>\hat{g}_{S_{n+1}}(X_{n+1})\})].$

Moreover, since $\hat{g}_{S_{n+1}}$ is fit symmetrically, i.e., is invariant to permutations of the input data, we have that $\{(f(X_{i}),\hat{g}_{S_{n+1}}(X_{i}),S_{i})\}_{i=1}^{n+1}$ are exchangeable. Thus, we additionally have that

\begin{split}\mathbb{E}[f(X_{n+1})(\alpha-&\mathbbm{1}\{S_{n+1}>\hat{g}_{S_{n+1}}(X_{n+1})\})]=\mathbb{E}\left[\frac{1}{n+1}\sum_{i=1}^{n+1}f(X_{i})\left(\alpha-\mathbbm{1}\left\{S_{i}>\hat{g}_{S_{n+1}}(X_{i})\right\}\right)\right]\\ &=\mathbb{E}\left[\frac{1}{n+1}\sum_{i=1}^{n+1}(\alpha-s_{i}^{*})f(X_{i})\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}\right]-\mathbb{E}\left[\frac{d}{d\epsilon}R\left(\hat{g}_{S_{n+1}}+\epsilon f\right)\bigg{|}_{\epsilon=0}\right].\end{split}

(A.2)

Finally, since $\alpha-s_{i}^{*}\in[0,1]$ , we can bound the first term as

	$\displaystyle\left\|\mathbb{E}\left[\frac{1}{n+1}\sum_{i=1}^{n+1}(\alpha-s_{i}^{*})f(X_{i})\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}\right]\right\|$	$\displaystyle\leq\mathbb{E}\left[\frac{1}{n+1}\sum_{i=1}^{n+1}\|f(X_{i})\|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}\right]$
		$\displaystyle=\mathbb{E}[\|f(X_{i})\|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}],$

where last step follows by exchangeability. This proves the second claim of Theorem 3.

To get the first claim of Theorem 3, note that when $f$ is non-negative we can lower bound $\frac{1}{n+1}\sum_{i=1}^{n+1}(\alpha-s_{i}^{*})f(X_{i})\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}$ by $0$ . Plugging this into (A.2) gives us the desired inequality,

\displaystyle\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y\in\hat{C}_{n+1}\}-(1-\alpha))]\geq-\mathbb{E}\left[\frac{d}{d\epsilon}R\left(\hat{g}_{S_{n+1}}+\epsilon f\right)\bigg{|}_{\epsilon=0}\right].

∎

With the proof of Theorem 3 in hand, we are now ready to prove the special cases stated in Theorem 2 and Corollary 1.

Proof of Theorem 2.

The first statement of Theorem 2 follows immediately from the first statement of Theorem 3 in the special case where $\mathcal{R}(\cdot)=0$ .

To get the second statement of Theorem 2 note that the second statement of Theorem 3 tells us that for any $f\in\mathcal{F}$ ,

\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}(X_{n+1})\}-(1-\alpha))]\leq\mathbb{E}[|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}].

So, to complete the proof we just need to show that when the distribution of $S\mid X$ is continuous

\mathbb{E}[|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}]\leq\frac{d}{n+1}\mathbb{E}\left[\max_{1\leq i\leq n+1}|f(X_{n+1})|\right].

To do this, first note that by the exchangeability of $\{(f(X_{i}),\hat{g}_{S_{n+1}}(X_{i}),S_{i})\}_{i=1}^{n+1}$ we have

	$\displaystyle\mathbb{E}[\|f(X_{i})\|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}]$	$\displaystyle=\mathbb{E}\left[\frac{1}{n+1}\sum_{i=1}^{n+1}\|f(X_{i})\|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}\right]$
		$\displaystyle\leq\mathbb{E}\left[\left(\max_{1\leq j\leq n+1}\|f(X_{j})\|\right)\cdot\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}\right].$

Moreover, recalling that $\hat{g}_{S_{n+1}}(X_{i})=\Phi(X_{i})^{\top}\hat{\beta}$ for some $\hat{\beta}\in\mathbb{R}^{d}$ we additionally have that

	$\displaystyle\mathbb{P}\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}>d\mid X_{1},\dots,X_{n+1}\right)$
	$\displaystyle=\mathbb{P}\left(\exists\,1\leq j_{1}<\dots<j_{d+1}\leq n+1\text{ such that }\forall\,1\leq i\leq d+1,\ S_{j_{i}}=\hat{g}_{S_{n+1}}(X_{j_{i}})\mid X_{1},\dots,X_{n+1}\right)$
	$\displaystyle\leq\sum_{1\leq j_{1}<\dots<j_{d+1}\leq n+1}\mathbb{P}\left(\exists\,\beta\in\mathbb{R}^{d},\ \text{such that }\forall\,1\leq i\leq d+1,\ S_{j_{i}}=\Phi(X_{j_{i}})^{\top}\beta\mid X_{1},\dots,X_{n+1}\right)$
	$\displaystyle\leq\sum_{1\leq j_{1}<\dots<j_{d+1}\leq n+1}\mathbb{P}\left((S_{j_{1}},\dots,S_{j_{d+1}})\in\text{RowSpace}([\Phi(X_{j_{1}})\|\dots\|\Phi(X_{j_{d+1}})])\mid X_{1},\dots,X_{n+1}\right)$
	$\displaystyle=0,$

where the last line follows from the fact that $(S_{j_{1}},\dots,S_{j_{d+1}})\mid X_{1},\dots,X_{n+1}$ are independent and continuously distributed and $\text{RowSpace}([\Phi(X_{i})\mid\dots|\Phi(X_{d+1}))]^{\top})$ is a $d$ -dimensional subspace of $\mathbb{R}^{d+1}$ . From this, we conclude that with probability 1,

\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}\leq\frac{d}{n+1},

and plugging this into our previous calculation we arrive at the desired inequality

	$\displaystyle\mathbb{E}[\|f(X_{i})\|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}]$	$\displaystyle\leq\mathbb{E}\left[\left(\max_{1\leq j\leq n+1}\|f(X_{j})\|\right)\cdot\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}\right]$
		$\displaystyle\leq\frac{d}{n+1}\mathbb{E}\left[\max_{1\leq i\leq n+1}\|f(X_{i})\|\right].$

∎

Proof of Corollary 1.

This follows immediately by applying Theorem 2 in the special case where $\mathcal{F}=\{\sum_{G\in\mathcal{G}}\beta_{G}\mathbbm{1}\{X\in G\}:\beta_{G}\in\mathbb{R}\}$ . ∎

A.4 Proofs for RKHS functions

In this section we prove Propositions 1 and 2. Throughout both proofs we will let $\kappa^{2}:=\sup_{x}K(x,x)$ denote our upper bound on the kernel and (when applicable) $C_{S\mid X}:=\sup_{x,s}p_{S_{i}|X_{i}=x}(s)$ to denote our upper bound on the density of $S_{i}|X_{i}$ . Moreover, we will assume that the data satisfies the following set of moment conditions.

Assumption 1 (Moment conditions for RKHS bounds).

There exists constants $C_{3},C_{2},c_{2},C_{S},C_{f},\rho>0$ such that

	$\displaystyle\mathbb{E}[\\|\Phi(X_{i})\\|^{2}_{2}]\leq C_{2}d,\ \sup_{f\in\mathcal{F}}\mathbb{E}[\|f(X_{i})\|\cdot\\|\Phi(X_{i})\\|_{2}^{2}]\leq C_{3}\mathbb{E}[\|f(X)\|]d,\ \sup_{\beta:\\|\beta\\|_{2}=1}\mathbb{E}[\|\Phi(X_{i})^{\top}\beta\|^{2}]\leq c_{2},$
	$\displaystyle\sup_{f\in\mathcal{F}}\mathbb{E}[\|f(X_{i})\|S_{i}^{2}]\leq C_{S}\mathbb{E}[\|f(X_{i})\|],\ \sup_{f\in\mathcal{F}}\sqrt{\mathbb{E}[\|f(X_{i})\|^{2}]}\leq C_{f}\mathbb{E}[\|f(X_{i})\|],\text{ and }\inf_{\beta:\\|\beta\\|_{2}=1}\mathbb{E}[\|\Phi(X_{i})^{\top}\beta\|]\geq\rho.$

Furthermore, we also have that $\mathbb{E}[|S_{i}|^{2}]<\infty$ .

A.4.1 Proof of Proposition 1

Our main idea is to exploit the stability of RKHS regression. We will do this using two main lemmas. The first lemma is a canonical stability result first proven in Bousquet and Elisseeff (2002) that bounds the sensitivity of the RKHS fit to changes of a single data point. While this result is quite powerful, it is not sufficient for our context because it does not account for the extra linear term $\Phi(X_{i})^{\top}\beta$ . Thus, we will also develop a second lemma that controls the stability of the fit to changes in $\beta$ .

When formalizing these ideas it will be useful have some additional notation that explicitly separates the dependence of the fit on $\beta$ from the dependence of the fit on the data. Let

\hat{g}_{\beta}:=\underset{g_{K}\in\mathcal{F}_{K}}{\text{argmin}}\frac{1}{n+1}\sum_{i=1}^{n+1}\ell_{\alpha}(g_{K}(X_{i})+\Phi(X_{i})^{\top}\beta,S_{i})+\lambda\|g_{K}\|_{K}^{2},

denote the result of fitting the RKHS part of the function class with $\beta\in\mathbb{R}^{d}$ held fixed. Additionally, let $\{(\tilde{X}_{i},\tilde{S}_{i})\}_{i=1}^{n+1}$ denote an independent copy of $\{({X}_{i},{S}_{i})\}_{i=1}^{n+1}$ and for any $A\subseteq\{1,\dots,n+1\}$ define

\hat{g}^{-A}_{\beta}:=\underset{g_{K}\in\mathcal{F}_{K}}{\text{argmin}}\frac{1}{n+1}\sum_{i\notin A}\ell_{\alpha}(g_{K}(X_{i})+\Phi(X_{i})^{\top}\beta,S_{i})+\frac{1}{n+1}\sum_{i\in A}\ell_{\alpha}(g_{K}(\tilde{X}_{i})+\Phi(\tilde{X}_{i})^{\top}\beta,\tilde{S}_{i})+\lambda\|g_{K}\|_{K}^{2},

to be the leave- $A$ -out version of $\hat{g}_{\beta}$ obtained by swapping out $\{(X_{i},S_{i})\}_{i\in A}$ for $\{(\tilde{X}_{i},\tilde{S}_{i})\}_{i\in A}$ . Our first lemma bounds the difference between $\hat{g}^{-A}_{\beta}$ and $\hat{g}^{A}_{\beta}$ .

Lemma 1.

Assume that $\sup_{x,x}K(x,x)=\kappa^{2}<\infty$ . Then, for any two datasets $\{(X_{i},S_{i})\}_{i=1}^{n+1}$ and $\{(\tilde{X}_{i},\tilde{S}_{i})\}_{i=1}^{n+1}$ ,

\|\hat{g}_{\beta}-\hat{g}^{-A}_{\beta}\|_{\infty}\leq\frac{\kappa^{2}|A|}{2\lambda(n+1)}.

Proof.

By a straightforward calculation one can easily check that $\ell_{\alpha}(g_{K}(X_{i})+\Phi(X_{i})^{\top}\beta,S_{i})$ is a 1-Lipschitz function of $S_{i}-g_{K}(X_{i})-\Phi(X_{i})^{\top}\beta$ (see Lemma 11 for details). Thus, we may apply Theorem 22 of Bousquet and Elisseeff (2002) to conclude that

\|\hat{g}_{\beta}-\hat{g}^{-A}_{\beta}\|_{K}\leq\frac{\kappa|A|}{2\lambda(n+1)}.

Then by applying the reproducing property of the RKHS and our bound on the kernel we arrive at the desired inequality,

	$\displaystyle\\|\hat{g}_{\beta}-\hat{g}^{-A}_{\beta}\\|_{\infty}$	$\displaystyle=\sup_{x}\|\langle\hat{g}_{\beta}-\hat{g}^{-A}_{\beta},K(x,\cdot)\rangle\|\leq\sup_{x}\\|\hat{g}_{\beta}-\hat{g}^{-A}_{\beta}\\|_{K}\\|K(x,\cdot)\\|_{K}$
		$\displaystyle=\\|\hat{g}_{\beta}-\hat{g}^{-A}_{\beta}\\|_{K}\sup_{x}K(x,x)^{1/2}\leq\frac{\kappa^{2}\|A\|}{2\lambda(n+1)}.$

∎

Our second lemma bounds the stability of the fit in $\beta$ .

Lemma 2.

Assume that $\sup_{x,x}K(x,x)=\kappa^{2}<\infty$ . Then for any dataset $\{(X_{i},S_{i})\}_{i=1}^{n+1}$ ,

\|\hat{g}_{\beta_{1}}-\hat{g}_{\beta_{2}}\|_{\infty}\leq\sqrt{\frac{4\kappa^{2}}{\lambda}\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}(\beta_{1}-\beta_{2})|},\quad\forall\beta_{1},\beta_{2}\in\mathbb{R}^{d}.

Proof.

The proof of this lemma is quite similar to the proof of Theorem 22 in Bousquet and Elisseeff (2002). For ease of notation let

L_{\beta}(g_{K}):=\frac{1}{n+1}\sum_{i=1}^{n+1}\ell_{\alpha}(g_{K}(X_{i})+\Phi(X_{i})^{\top}\beta,S_{i}).

By the optimality of $\hat{g}_{\beta_{1}}$ and $\hat{g}_{\beta_{2}}$ we have

		$\displaystyle L_{\beta_{1}}(\hat{g}_{\beta_{1}})+\lambda\\|\hat{g}_{\beta_{1}}\\|_{K}^{2}\leq L_{\beta_{1}}\left(\frac{1}{2}\hat{g}_{\beta_{1}}+\frac{1}{2}\hat{g}_{\beta_{2}}\right)+\lambda\left\\|\frac{1}{2}\hat{g}_{\beta_{1}}+\frac{1}{2}\hat{g}_{\beta_{2}}\right\\|_{K}^{2},$
	and	$\displaystyle L_{\beta_{2}}(\hat{g}_{\beta_{2}})+\lambda\\|\hat{g}_{\beta_{2}}\\|_{K}^{2}\leq L_{\beta_{2}}\left(\frac{1}{2}\hat{g}_{\beta_{1}}+\frac{1}{2}\hat{g}_{\beta_{2}}\right)+\lambda\left\\|\frac{1}{2}\hat{g}_{\beta_{1}}+\frac{1}{2}\hat{g}_{\beta_{2}}\right\\|_{K}^{2}.$

Moreover, by the convexity of $L_{\beta_{1}}(\cdot)$ and $L_{\beta_{2}}(\cdot)$ it also holds that

		$\displaystyle L_{\beta_{1}}\left(\frac{1}{2}\hat{g}_{\beta_{1}}+\frac{1}{2}\hat{g}_{\beta_{2}}\right)\leq\frac{1}{2}L_{\beta_{1}}(\hat{g}_{\beta_{1}})+\frac{1}{2}L_{\beta_{1}}(\hat{g}_{\beta_{2}}),$
	and	$\displaystyle L_{\beta_{2}}\left(\frac{1}{2}\hat{g}_{\beta_{1}}+\frac{1}{2}\hat{g}_{\beta_{2}}\right)\leq\frac{1}{2}L_{\beta_{2}}(\hat{g}_{\beta_{1}})+\frac{1}{2}L_{\beta_{2}}(\hat{g}_{\beta_{2}}).$

Putting all four of these inequalities together we find that

	$\displaystyle\frac{\lambda}{2}\left\\|\hat{g}_{\beta_{1}}-\hat{g}_{\beta_{2}}\right\\|^{2}_{K}$	$\displaystyle=\lambda\\|\hat{g}_{\beta_{1}}\\|_{K}^{2}+\lambda\\|\hat{g}_{\beta_{2}}\\|_{K}^{2}-2\lambda\left\\|\frac{1}{2}\hat{g}_{\beta_{1}}+\frac{1}{2}\hat{g}_{\beta_{2}}\right\\|_{K}^{2}$
		$\displaystyle\leq L_{\beta_{1}}\left(\frac{1}{2}\hat{g}_{\beta_{1}}+\frac{1}{2}\hat{g}_{\beta_{2}}\right)+L_{\beta_{2}}\left(\frac{1}{2}\hat{g}_{\beta_{1}}+\frac{1}{2}\hat{g}_{\beta_{2}}\right)-L_{\beta_{1}}(\hat{g}_{\beta_{1}})-L_{\beta_{2}}(\hat{g}_{\beta_{2}})$
		$\displaystyle\leq\frac{1}{2}L_{\beta_{1}}(\hat{g}_{\beta_{1}})+\frac{1}{2}L_{\beta_{1}}(\hat{g}_{\beta_{2}})+\frac{1}{2}L_{\beta_{2}}(\hat{g}_{\beta_{1}})+\frac{1}{2}L_{\beta_{2}}(\hat{g}_{\beta_{2}})-L_{\beta_{1}}(\hat{g}_{\beta_{1}})-L_{\beta_{2}}(\hat{g}_{\beta_{2}})$
		$\displaystyle=\frac{1}{2}\left(L_{\beta_{2}}(\hat{g}_{\beta_{1}})-L_{\beta_{1}}(\hat{g}_{\beta_{1}})+L_{\beta_{1}}(\hat{g}_{\beta_{2}})-L_{\beta_{2}}(\hat{g}_{\beta_{2}})\right)$
		$\displaystyle\leq\frac{1}{n+1}\sum_{i=1}^{n+1}\|\Phi(X_{i})^{\top}(\beta_{1}-\beta_{2})\|,$

where the last inequality follows from the Lipschitz property of $\ell_{\alpha}(\cdot,\cdot)$ (see Lemma 11). To conclude the proof one simply notes that by the reproducing property of the RKHS we have that

\|\hat{g}_{\beta_{1}}-\hat{g}_{\beta_{2}}\|_{\infty}\leq\kappa\|\hat{g}_{\beta_{1}}-\hat{g}_{\beta_{2}}\|_{K}\leq\sqrt{\frac{4\kappa^{2}}{\lambda}\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}(\beta_{1}-\beta_{2})|},

as desired. ∎

In order to apply this lemma to bound we will need to control the size of $|\Phi(X_{i})^{\top}(\beta_{1}-\beta_{2})|$ . This is done in our next result. The statement of this lemma may look somewhat peculiar due to the presence of a re-weighting function $f\in\mathcal{F}$ . To help aid intuition it may be useful to keep in mind the special case $f=1$ , which turns the expectation below into a simple tail probability. While somewhat strange, our reason for stating the lemma in this form is that it will fit seamlessly into our later calculations without the need for additional exposition.

Lemma 3.

Assume that $X_{1},\dots,X_{n+1}\stackrel{{\scriptstyle i.i.d}}{{\sim}}P_{X}$ . Let $f\in\mathcal{X}\to\mathbb{R}$ and assume that there exists constants $C_{2},C_{3}\geq 1$ such that, $\mathbb{E}[\|\Phi(X_{i})\|_{2}^{2}]\leq C_{2}d$ and $\mathbb{E}[|f(X_{i})|\cdot\|\Phi(X_{i})\|_{2}^{2}]\leq C_{3}\mathbb{E}[|f(X)|]d$ . Then for any $\epsilon>0$ and $1\leq j\leq n+1$ ,

\mathbb{E}\left[|f(X_{j})|\mathbbm{1}\left\{\sup_{\beta:\|\beta\|\leq\epsilon}\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}\beta|>2\epsilon\sqrt{C_{2}d}\right\}\right]\leq O\left(\frac{\mathbb{E}[|f(X)|]}{n}\right).

Proof.

By the Cauchy-Schwartz inequality we have

\sup_{\beta:\|\beta\|\leq\epsilon}\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}\beta|\leq\sup_{\beta:\|\beta\|\leq\epsilon}\frac{1}{n+1}\sum_{i=1}^{n+1}\|\Phi(X_{i})\|_{2}\|\beta\|_{2}=\frac{\epsilon}{n+1}\sum_{i=1}^{n+1}\|\Phi(X_{i})\|_{2}.

Additionally, by Jensen’s inequality it also holds that $\mathbb{E}[\|\Phi(X_{i})\|_{2}]\leq\sqrt{\mathbb{E}[\|\Phi(X_{i})\|^{2}_{2}]}=\sqrt{C_{2}d}$ . So, putting these two inequalities together we arrive at

	$\displaystyle\mathbb{E}\left[\|f(X_{j})\|\mathbbm{1}\left\{\sup_{\beta:\\|\beta\\|\leq\epsilon}\frac{1}{n+1}\sum_{i=1}^{n+1}\|\Phi(X_{i})^{\top}\beta\|>2\epsilon\sqrt{C_{2}d}\right\}\right]$
	$\displaystyle\leq\mathbb{E}\left[\|f(X_{j})\|\mathbbm{1}\left\{\frac{1}{n+1}\sum_{i=1}^{n+1}\\|\Phi(X_{i})\\|_{2}-\mathbb{E}[\\|\Phi(X_{i})\\|_{2}]>\sqrt{C_{2}d}\right\}\right]$
	$\displaystyle\leq\frac{1}{C_{2}d}\mathbb{E}\left[\|f(X_{j})\|\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\\|\Phi(X_{i})\\|_{2}-\mathbb{E}[\\|\Phi(X_{i})\\|_{2}]\right)^{2}\right]=O\left(\frac{\mathbb{E}[\|f(X)\|]}{n}\right).$

∎

The final preliminary lemmas that we will require are controls on the maximum possible sizes of $\hat{g}_{S_{n+1},K}$ and $\hat{\beta}_{S_{n+1}}$ . Once again these lemmas will involve re-weighting functions the purpose of which is to ease our calculations further on.

Lemma 4.

It holds deterministically that

\|\hat{g}_{S_{n+1},K}\|_{K}\leq\frac{1}{\sqrt{\lambda}}\sqrt{\frac{1}{n+1}\sum_{i=1}^{n+1}|S_{i}|}.

If in addition, $(X_{1},S_{1}),\dots,(X_{n+1},S_{n+1})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P$ , $\mathbb{E}[S_{i}^{2}]<\infty$ , and $f:\mathcal{X}\to\mathbb{R}$ is a function satisfying $\mathbb{E}[|f(X_{i})|S_{i}^{2}]\leq C_{f,S}\mathbb{E}[|f(X_{i})|]$ for some $C_{f,S}>0$ , then we also have that for all $1\leq j\leq n+1$ ,

\mathbb{E}\left[|f(X_{i})|\mathbbm{1}\left\{\|\hat{g}_{S_{n+1},K}\|_{K}\geq\frac{\sqrt{2\mathbb{E}[|S_{i}|]}}{\sqrt{\lambda}}\right\}\right]\leq O\left(\frac{\mathbb{E}[|f(X_{i})|]}{n}\right)

Proof.

Taking $\beta,f_{K}=0$ gives loss

\frac{1}{n+1}\sum_{i=1}^{n+1}\ell_{\alpha}(0,S_{i})+\lambda\|0\|_{K}^{2}\leq\frac{1}{n+1}\sum_{i=1}^{n+1}|S_{i}|.

So, since $(\hat{g}_{S_{n+1},K},\hat{\beta}_{S_{n+1}})$ is a minimizer of the quantile regression objective we must have that

\lambda\|\hat{g}_{S_{n+1},K}\|^{2}_{K}\leq\frac{1}{n+1}\sum_{i=1}^{n+1}\ell_{\alpha}(\hat{g}_{S_{n+1},K}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}},S_{i})+\lambda\|\hat{g}_{S_{n+1},K}\|^{2}_{K}\leq\frac{1}{n+1}\sum_{i=1}^{n+1}|S_{i}|.

This proves the first part of the lemma. To get the second part we simply note that

	$\displaystyle\mathbb{E}\left[\|f(X_{j})\|\mathbbm{1}\left\{\\|\hat{g}_{S_{n+1},K}\\|_{K}\geq\frac{\sqrt{2\mathbb{E}[\|S_{i}\|]}}{\sqrt{\lambda}}\right\}\right]$	$\displaystyle\leq\mathbb{E}\left[\|f(X_{j})\|\mathbbm{1}\left\{\frac{1}{n+1}\sum_{i=1}^{n+1}\|S_{i}\|-\mathbb{E}[\|S_{i}\|]\geq\mathbb{E}[\|S_{i}\|]\right\}\right]$
		$\displaystyle\leq\frac{1}{\mathbb{E}[\|S_{i}\|]^{2}}\mathbb{E}\left[\|f(X_{j})\|\left(\frac{1}{n+1}\sum_{j=1}^{n+1}\|S_{i}\|-\mathbb{E}[\|S_{i}\|]\right)^{2}\right]$
		$\displaystyle=O\left(\frac{\mathbb{E}[\|f(X)\|]}{n}\right).$

∎

Lemma 5.

Let $(X_{1},S_{1}),\dots,(X_{n+1},S_{n+1})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P$ and $f:\mathcal{X}\to\mathbb{R}$ . Assume that $\mathbb{E}[S_{i}^{2}]<\infty$ , $\sup_{x}K(x,x)=\kappa^{2}<\infty$ , and there exists constants $C_{2},c_{2},C_{f},C_{f,S},\rho>0$ such that $\mathbb{E}[|f(X_{i})|S_{i}^{2}]\leq C_{f,S}\mathbb{E}[|f(X_{i})|]$ , $\sqrt{\mathbb{E}[|f(X_{i})|^{2}]}\leq C_{f}\mathbb{E}[|f(X_{i})|]$ , $\inf_{\beta}\mathbb{E}[|\Phi(X_{i})^{\top}\beta|]\geq\rho$ , $\sup_{\beta:\|\beta\|_{2}=1}\mathbb{E}[|\Phi(X_{i})^{\top}\beta|^{2}]^{1/2}\leq c_{2}$ , and $\mathbb{E}[\|\Phi(X_{i})\|^{2}]\leq C_{2}d$ . Then there exists a constant $c_{\beta}>0$ such that for all $1\leq j\leq n+1$ ,

\mathbb{E}\left[|f(X_{j})|\mathbbm{1}\left\{\|\hat{\beta}_{S_{n+1}}\|_{2}>\frac{1}{\sqrt{\lambda}}c_{\beta}\right\}\right]\leq O\left(\frac{d\mathbb{E}[|f(X_{i})|]}{n}\right).

Proof.

Observe that

	$\displaystyle\frac{1}{n+1}\sum_{i=1}^{n+1}\ell_{\alpha}(\hat{g}_{S_{n+1},K}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}},S_{i})+\lambda\\|\hat{g}_{S_{n+1},K}\\|^{2}_{K}$
	$\displaystyle\geq\frac{1}{n+1}\sum_{i=1}^{n+1}\min\{\alpha,1-\alpha\}\|S_{i}-\hat{g}_{S_{n+1},K}(X_{i})-\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\|$
	$\displaystyle\geq\min\{\alpha,1-\alpha\}\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\|\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\|-\frac{1}{n+1}\sum_{i=1}^{n+1}\|S_{i}\|-\frac{1}{n+1}\sum_{i=1}^{n+1}\|\hat{g}_{S_{n+1},K}(X_{i})\|\right).$

Moreover, by the reproducing property of the RKHS

\frac{1}{n+1}\sum_{i=1}^{n+1}|\hat{g}_{S_{n+1},K}(X_{i})|\leq\|\hat{g}_{S_{n+1},K}\|_{\infty}\leq\kappa\|\hat{g}_{S_{n+1},K}\|_{K},

and by Lemma 4

\|\hat{g}_{S_{n+1},K}\|_{K}\leq\frac{1}{\sqrt{\lambda}}\sqrt{\frac{1}{n+1}\sum_{i=1}^{n+1}|S_{i}|}.

So, combining these two facts we find that

\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}|\leq\frac{1}{\min\{\alpha,1-\alpha\}}\left(\frac{\kappa}{\sqrt{\lambda}}\sqrt{\frac{1}{n+1}\sum_{i=1}^{n+1}|S_{i}|}+\frac{1}{n+1}\sum_{i=1}^{n+1}|S_{i}|\right).

To use this inequality to bound $\|\hat{\beta}_{S_{n+1}}\|_{2}$ we will need to lower bound the mean of $|\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}|$ . This is done in Lemma 12 below where we show that there exists constants $c,c^{\prime}>0$ , such that

\mathbb{P}\left(\inf_{\beta:\|\beta\|=1}\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}\beta|\geq c\right)\geq 1-c^{\prime}\frac{d^{2}}{(n+1)^{2}}.

Thus,

	$\displaystyle\mathbb{E}\left[\|f(X_{j})\|\mathbbm{1}\left\{\\|\hat{\beta}_{S_{n+1}}\\|_{2}>\frac{1}{\sqrt{\lambda}}c_{\beta}\right\}\right]\leq\mathbb{E}\left[\|f(X_{j})\|\mathbbm{1}\left\{\frac{1}{n+1}\sum_{i=1}^{n+1}\|S_{i}\|-\mathbb{E}[\|S_{i}\|]>\Omega(1)\right\}\right]$
	$\displaystyle\ \ \ \ +\mathbb{E}\left[\|f(X_{j})\|\mathbbm{1}\left\{\inf_{\beta:\\|\beta\\|=1}\frac{1}{n+1}\sum_{i=1}^{n+1}\|\Phi(X_{i})^{\top}\beta\|\geq c\right\}\right]$
	$\displaystyle\leq O(1)\mathbb{E}\left[\|f(X_{j})\|\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\|S_{i}\|-\mathbb{E}\|S_{i}\|\right)^{2}\right]+\mathbb{E}[\|f(X_{i})\|^{2}]^{1/2}\sqrt{c^{\prime}}\frac{d}{n+1}\leq O\left(\frac{d\mathbb{E}[\|f(X_{i})\|]}{n+1}\right).$

∎

With all of these results in hand we are ready to prove Proposition 1.

Proof of Proposition 1..

We will exploit the stability of the RKHS fit. Let $\epsilon=O\left(\frac{\lambda}{n^{2}\sqrt{d}}\right)$ and define the event

E:=\left\{\sup_{\beta:\|\beta\|_{2}\leq\epsilon}\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}\beta|\leq 2\epsilon\sqrt{C_{2}d},\ \|\hat{\beta}_{S_{n+1}}\|_{2}\leq\frac{c_{\beta}}{\sqrt{\lambda}}\right\}.

By Lemmas 3 and 5 we know that

	$\displaystyle\mathbb{E}[\|f(X_{i})\|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1},K}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}]$
	$\displaystyle\leq\mathbb{E}[\|f(X_{i})\|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1},K}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\mathbbm{1}\{E\}]+O\left(\frac{d\mathbb{E}[\|f(X_{i})\|]}{n}\right).$

Thus, we just need to focus on what happens on the event $E$ . By applying the exchangeability of the quadruples $(\hat{g}_{S_{n+1},K}(X_{i}),\hat{\beta}_{S_{n+1}},X_{i},S_{i})$ we have that

	$\displaystyle\mathbb{E}[\|f(X_{i})\|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1},K}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\mathbbm{1}\{E\}]$
	$\displaystyle=\mathbb{E}\left[\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\|f(X_{i})\|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1},K}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\right)\mathbbm{1}\{E\}\right]$
	$\displaystyle\leq\mathbb{E}\left[\max_{1\leq i\leq n+1}\|f(X_{i})\|\mathbb{E}\left[\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{\hat{\beta}_{S_{n+1}}}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\right)\mathbbm{1}\{E\}\mid(X_{i})_{i=1}^{n+1}\right]\right].$

To bound this quantity we just need to control the inner expectation. We will begin by fixing a large integer $m>1$ and applying the inequality

	$\displaystyle\mathbb{E}\left[\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{\hat{\beta}_{S_{n+1}}}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\right)\mathbbm{1}\{E\}\mid(X_{i})_{i=1}^{n+1}\right]$
	$\displaystyle\leq\mathbb{E}\left[\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{\hat{\beta}_{S_{n+1}}}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\right)^{m}\mathbbm{1}\{E\}\mid(X_{i})_{i=1}^{n+1}\right]^{1/m}.$

Our motivation for applying this bound is that by choosing $m$ sufficiently large we will be able to swap a sum and a maximum without losing too much slack. More precisely, let $\mathcal{N}\subseteq\mathbb{R}^{d}$ be a minimal size $\epsilon$ -net of $\{\beta\in\mathbb{R}^{d}:\|\beta\|_{2}\leq c_{\beta}/\sqrt{\lambda}\}$ . It is well known that there exists an absolute constant $C_{N}>0$ such that $|\mathcal{N}|\leq\exp(C_{N}d\log(\frac{c_{\beta}}{\sqrt{\lambda}\epsilon}))$ . Then, using this $\epsilon$ -net we compute that

	$\displaystyle\mathbb{E}\left[\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{\hat{\beta}_{S_{n+1}}}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\right)^{m}\mathbbm{1}\{E\}\mid(X_{i})_{i=1}^{n+1}\right]$
	$\displaystyle\leq\mathbb{E}\left[\sup_{\beta:\\|\beta\\|_{2}\leq c_{\beta}/\sqrt{\lambda}}\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{\beta}(X_{i})+\Phi(X_{i})^{\top}\beta\}\right)^{m}\mathbbm{1}\{E\}\|(X_{i})_{i=1}^{n+1}\right]$
	$\displaystyle\leq\mathbb{E}\left[\sup_{\beta\in\mathcal{N}}\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{\|S_{i}-\hat{g}_{\beta}(X_{i})-\Phi(X_{i})^{\top}\beta\|\leq O\left(1/n\right)\}\right)^{m}\mathbbm{1}\{E\}\mid(X_{i})_{i=1}^{n+1}\right]$
	$\displaystyle\leq\sum_{\beta\in\mathcal{N}}\mathbb{E}\left[\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{\|S_{i}-\hat{g}_{\beta}(X_{i})-\Phi(X_{i})^{\top}\beta\|\leq O\left(1/n\right)\}\right)^{m}\mid(X_{i})_{i=1}^{n+1}\right],$

where the first inequality follows from the definition of $E$ and the second inequality uses both Lemma 2 and the fact that on the event $E$

		$\displaystyle\sup_{\beta:\\|\beta\\|\leq\epsilon}\frac{1}{n+1}\sum_{i=1}^{n+1}\|\Phi(X_{i})^{\top}\beta\|\leq 2\epsilon\sqrt{C_{2}d}$
	$\displaystyle\implies$	$\displaystyle\sup_{\beta:\\|\beta\\|\leq\epsilon}\max_{1\leq i\leq n+1}\|\Phi(X_{i})^{\top}\beta\|\leq\sup_{\beta:\\|\beta\\|\leq\epsilon}\frac{n+1}{n+1}\sum_{i=1}^{n+1}\|\Phi(X_{i})^{\top}\beta\|\leq(n+1)2\epsilon\sqrt{C_{2}d}\leq O\left(\frac{1}{n}\right).$

Continuing this calculation directly we see that,

	$\displaystyle\mathbb{E}\left[\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{\|S_{i}-\hat{g}_{\beta}(X_{i})-\Phi(X_{i})^{\top}\beta\|\leq O\left(1/n\right)\}\right)^{m}\mid(X_{i})_{i=1}^{n+1}\right]$
	$\displaystyle=\sum_{k=1}^{m}\binom{n+1}{k}\binom{m}{k}k!k^{m-k}\frac{1}{(n+1)^{m}}\mathbb{E}\left[\prod_{i=1}^{k}\mathbbm{1}\{\|S_{i}-\hat{g}_{\beta}(X_{i})-\Phi(X_{i})^{\top}\beta\|\leq O\left(1/n\right)\}\mid(X_{i})_{i=1}^{n+1}\right]$
	$\displaystyle\leq\sum_{k=1}^{m}(n+1)^{k}\binom{m}{k}\frac{m^{m-k}}{(n+1)^{m}}\mathbb{E}\left[\prod_{i=1}^{k}\mathbbm{1}\left\{\|S_{i}-\hat{g}^{-\{1,\dots,k\}}_{\beta}(X_{i})-\Phi(X_{i})^{\top}\beta\|\leq O\left(\frac{k}{\lambda n}\right)\right\}\mid(X_{i})_{i=1}^{n+1}\right],$

where the last line applies Lemma 1. Finally, using the fact that $S_{i}|X_{i}$ has a bounded density we may upper bound the above display by

\displaystyle\sum_{k=1}^{m}(n+1)^{k}\binom{m}{k}\frac{m^{m-k}}{(n+1)^{m}}O\left(\left(\frac{k}{\lambda n}\right)^{k}\right)\leq O\left(\left(\frac{m}{\lambda n}\right)^{m}\right)\sum_{k=1}^{m}\binom{m}{k}\leq O\left(2^{m}\left(\frac{m}{\lambda n}\right)^{m}\right).

Putting this all together we conclude that

\mathbb{E}\left[\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{\hat{\beta}_{S_{n+1}}}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\right)^{m}\mid(X_{i})_{i=1}^{n+1}\right]^{\frac{1}{m}}\leq O\left(\exp\left(\frac{C_{N}d\log\left(\frac{1}{\sqrt{\lambda}\epsilon}\right)}{m}\right)\frac{m}{\lambda n}\right).

The desired result then follows by taking $m=d\log(\frac{1}{\sqrt{\lambda}\epsilon})$ and plugging in our definition for $\epsilon$ .

∎

A.4.2 Proof of Proposition 2

To simplify the notation let

		$\displaystyle L_{n}(\beta,g_{K}):=\frac{1}{n+1}\sum_{i=1}^{n}\ell_{\alpha}(\Phi(X_{i})^{\top}\beta+g_{K}(X_{i}),S_{i})$
	and	$\displaystyle L_{\infty}(\beta,g_{K}):=\mathbb{E}[\ell_{\alpha}(\Phi(X_{i})^{\top}\beta+g_{K}(X_{i}),S_{i})],$

denote the empirical and population losses and let

		$\displaystyle M_{n}(\beta,g_{K}):=L_{n}(\beta,g_{K})+\lambda\\|g_{K}\\|_{K}^{2}$
	and	$\displaystyle M_{\infty}(\beta,g_{K}):=L_{\infty}(\beta,g_{K})+\lambda\\|g_{K}\\|_{K}^{2},$

denote the corresponding empirical and population objectives. Note that $M_{n}$ and $M_{\infty}$ are strictly convex in $f$ and convex in $\beta$ . Thus, we may let $(\hat{B}_{n},\hat{g}_{n,K}),(B^{*},g_{K}^{*})\in 2^{\mathbb{R}^{d}}\times\mathcal{F}_{K}$ denote the minimizers of $M_{n}$ and $M_{\infty}$ respectively. To further ease notation in the sections that follows we will sometimes use $\hat{\beta}_{n}$ and $\beta^{*}$ to denote arbitrarily elements of $\hat{B}_{n}$ and $B^{*}$ . Finally, we will let $\Pi_{\hat{B}_{n}},\Pi_{B^{*}}:\mathbb{R}^{d}\to\mathbb{R}^{d}$ denote the projections operators onto $\hat{B}_{n}$ and $B^{*}$ , respectively.

With these preliminary definitions in hand we now formally state the assumptions of Proposition 2. Our first assumption is that $M_{\infty}$ is locally strongly convex around its minimum.

Assumption 2 (Population Strong Convexity).

Let $d(\beta,g_{K}):=\inf_{\beta^{\prime}\in B^{*}}\|\beta-\beta^{\prime}\|_{2}+\|g_{K}-g^{*}_{K}\|_{K}$ denote the distance from $(g_{K},\beta)$ to the nearest population minimizer. Then, there exists constants $C_{M},\delta_{M}>0$ such that

d(\beta,g_{K})\leq\delta_{M}\implies M_{\infty}(\beta,g_{K})-M_{\infty}(\beta^{*}_{,}g^{*})\geq C_{M}d(\beta,g_{K})^{2}.

Overall, we believe that this assumption is mild and should hold for all distributions of interest. For instance, for continuous data it is easy to check that this condition holds whenever $S\mid X$ has a positive density on $\mathbb{R}$ . On the other hand, for discrete data we expect to have the even stronger inequality $M_{\infty}(\beta,g_{K})-M_{\infty}(\beta^{*}_{,}g^{*})\geq C_{M}d(\beta,g_{K})$ . This is due to the fact that for discrete data $L_{\infty}(\cdot,\cdot)$ has sharp jump discontinuities that give rise to large increases in the loss when $(\hat{\beta},\hat{g}_{n,K})$ moves away from $(B^{*},g^{*})$ .

The second assumption we will need is a set of moment conditions on $S$ and $X$ .

Assumption 3 (Moment Conditions).

There exists constants $C_{2},\rho>0$ such that

\displaystyle\mathbb{E}[\|\Phi(X_{i})\|^{2}_{2}]\leq C_{2}d\ \ \text{ and }\ \ \inf_{\beta:\|\beta\|_{2}=1}\mathbb{E}[|\Phi(X_{i})^{\top}\beta|]\geq\rho.

Furthermore, we also have that $\mathbb{E}[|S_{i}|^{2}]<\infty$ .

With these assumptions in hand we are ready to prove Proposition 2. We begin by giving a technical lemma that controls the concentration of $L_{n}$ around $L_{\infty}$ .

Lemma 6.

Assume that $(X_{1},S_{1}),\dots,(X_{n},S_{n})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P$ and that there exists constants $C_{2},\kappa>0$ such that $\mathbb{E}[\|\Phi(X_{i})\|^{2}_{2}]\leq C_{2}d$ and $\sup_{x}K(x,x)=\kappa^{2}<\infty$ . Then for any $\delta_{1},\delta_{2}>0$ ,

	$\displaystyle\mathbb{E}\left[\sup_{\\|\beta-\Pi_{B^{}}\beta\\|_{2}\leq\delta_{1},\ \\|g_{K}-g^{}_{K}\\|_{K}\leq\delta_{2}}\left\|L_{n}(\beta,g_{K})-L_{n}(\Pi_{B^{}}\beta,g^{}_{K})-(L_{\infty}(\beta,g_{K})-L_{\infty}(\Pi_{B^{}}\beta,g^{}_{K}))\right\|\right]$
	$\displaystyle\hskip 28.45274pt\leq O\left(\delta_{1}\sqrt{\frac{d}{n}}+\delta_{2}\sqrt{\frac{1}{n}}\right)$

Proof.

Let $E:=\{(\beta,g_{K})\in\mathbb{R}^{d}\times\mathcal{F}_{K}:\|\beta-\Pi_{B^{*}}\beta\|_{2}\leq\delta_{1},\ \|g_{K}-g^{*}_{K}\|_{K}\leq\delta_{2}\}$ and $\sigma_{1},\dots,\sigma_{n+1}\stackrel{{\scriptstyle i.i.d}}{{\sim}}\text{Unif}(\{\pm 1\})$ be Rademacher random variables. Since the pinball loss is 1-Lipschitz (see Lemma 11) we may apply the symmetrization and contraction properties of Rademacher complexity to conclude that

	$\displaystyle\mathbb{E}\left[\sup_{(\beta,g_{K})\in E}\left\|L_{n}(\beta,g_{K})-L_{n}(\Pi_{B^{}}\beta,g^{}_{K})-(L_{\infty}(\beta,g_{K})-L_{\infty}(\Pi_{B^{}}\beta,g^{}_{K}))\right\|\right]$
	$\displaystyle\leq 2\mathbb{E}\left[\sup_{(\beta,g_{K})\in E}\left\|\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}(\ell(\Phi(X_{i})^{\top}\beta+g_{K}(X_{i}),S_{i})-\ell(\Phi(X_{i})^{\top}\Pi_{B^{}}\beta+g^{}_{K}(X_{i}),S_{i}))\right\|\right]$
	$\displaystyle\leq 2\mathbb{E}\left[\sup_{(\beta,g_{K})\in E}\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}(\Phi(X_{i})^{\top}(\beta-\Pi_{\beta^{}}\beta)+g_{K}(X_{i})-g^{}_{K}(X_{i}))\right]$
	$\displaystyle\leq 2\mathbb{E}\left[\sup_{\\|\beta-\Pi_{B^{}}\beta\\|_{2}\leq\delta_{1}}\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}\Phi(X_{i})^{\top}(\beta-\Pi_{\beta^{}}\beta)\right]$
	$\displaystyle\ \ \ \ +2\mathbb{E}\left[\sup_{\\|g_{K}-g^{}_{K}\\|_{K}\leq\delta_{2}}\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}(g_{K}(X_{i})-g^{}_{K}(X_{i}))\right]$
	$\displaystyle\leq O\left(\delta_{1}\sqrt{\frac{d}{n}}+\delta_{2}\sqrt{\frac{1}{n}}\right),$

where the last inequality follows from standard bounds on the Rademacher complexities of linear and kernel function classes (see e.g. Secion 4.1.2 of (Boucheron et al. (2005))) ∎

We now prove the main proposition.

Proof of Proposition 2.

We will show that

1.

$\sup_{f\in\mathcal{F}_{\delta}}|\frac{1}{n}\sum_{i=1}^{n}|f(X_{i})|-\mathbb{E}_{P}[|f(X)|]|=O_{\mathbb{P}}(\sqrt{d/n})$ ,
2.

$\sup_{f_{K}\in\mathcal{F}_{K}:\|f_{K}\|_{K}\leq 1}\lambda|\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle_{K}]|\leq O(1)$ ,
3.

$\sup_{f_{K}\in\mathcal{F}_{K}:\|f_{K}\|_{K}\leq 1}\lambda|\langle\hat{g}_{n,K},f_{K}\rangle_{K}-\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle_{K}]|\leq O_{\mathbb{P}}(\sqrt{d\log(n)/n})$ .

Our desired result will then follow by writing

	$\displaystyle\sup_{f(\cdot)=\Phi(\cdot)^{\top}\beta+f_{K}(\cdot)\in\mathcal{F}_{\delta}}\left\|2\lambda\frac{\langle\hat{g}_{n,K},f_{K}\rangle}{\frac{1}{n}\sum_{i=1}^{n}f(X_{i})}-2\lambda\frac{\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle]}{\mathbb{E}_{P}[\|f(X)\|]}\right\|$
	$\displaystyle\leq\sup_{f_{K}\in\mathcal{F}_{K}:\\|f_{K}\\|_{K}\leq 1}2\lambda\frac{\|\langle\hat{g}_{n,K},f_{K}\rangle-\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle]\|}{\mathbb{E}_{P}[\|f(X)\|]}$
	$\displaystyle\ \ \ \ \ +2\lambda\sup_{f(\cdot)=\Phi(\cdot)^{\top}\beta+f_{K}(\cdot)\in\mathcal{F}_{\delta}}\left\|\frac{\langle\hat{g}_{n,K},f_{K}\rangle}{\frac{1}{n}\sum_{i=1}^{n}\|f(X_{i})\|}-\frac{\langle\hat{g}_{n,K},f_{K}\rangle}{\mathbb{E}_{P}[\|f(X_{i})\|]}\right\|$
	$\displaystyle\leq O_{\mathbb{P}}\left(\sqrt{\frac{d\log(n)}{n}}\right)+\frac{\sup_{f\in\mathcal{F}_{K}}\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}]\rangle\|+O_{\mathbb{P}}(\sqrt{\frac{d\log(n)}{n}})}{\delta^{2}-O_{\mathbb{P}}(\sqrt{d/n})}\sup_{f\in\mathcal{F}_{\delta}}\left\|\frac{1}{n}\sum_{i=1}^{n}\|f(X_{i})\|-\mathbb{E}_{P}[\|f(X)\|]\right\|$
	$\displaystyle=O_{\mathbb{P}}\left(\sqrt{\frac{d\log(n)}{n}}\right).$

We establish each of these three facts in order.

Step 1: By the results of Section 4.1.2 in Boucheron et al. (2005) we know that $\{f_{K}(\cdot)+\Phi(\cdot)^{\top}\beta:\|f\|_{K}+\|\beta\|_{2}\leq 1\}$ has Rademacher complexity at most $O(\sqrt{d/n})$ . By the contraction property this also implies that $\{|f_{K}(\cdot)+\Phi(\cdot)^{\top}\beta|:\|f\|_{K}+\|\beta\|_{2}\leq 1\}$ has Rademacher complexity at most $O(\sqrt{d/n})$ . So, by the symmetrization inequality we have that for any $C>0$ ,

	$\displaystyle\mathbb{P}\left(\sup_{f\in\mathcal{F}_{\delta}}\left\|\frac{1}{n}\sum_{i=1}^{n}\|f(X_{i})\|-\mathbb{E}_{P}[\|f(X)\|]\right\|>C\right)\leq\frac{1}{C}\mathbb{E}\left[\sup_{f\in\mathcal{F}_{\delta}}\left\|\frac{1}{n}\sum_{i=1}^{n}\|f(X_{i})\|-\mathbb{E}_{P}[\|f(X)\|]\right\|\right]$
	$\displaystyle\hskip 85.35826pt\leq\frac{2}{C}\text{RadComplex}_{n}(\{\|f_{K}(\cdot)+\Phi(\cdot)^{\top}\beta\|:\\|f\\|_{K}+\\|\beta\\|_{2}\leq 1\})\leq\frac{1}{C}O\left(\sqrt{\frac{d}{n}}\right).$

This proves that $\sup_{f\in\mathcal{F}_{\delta}}\left|\frac{1}{n}\sum_{i=1}^{n}|f(X_{i})|-\mathbb{E}_{P}[|f(X)|]\right|=O_{\mathbb{P}}(\sqrt{d/n})$ , as desired.

Step 2: By Lemma 4 we know that

\displaystyle\sup_{f\in\mathcal{F}_{K}:\|f_{K}\|_{K}\leq 1}|\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle_{K}]|\leq\mathbb{E}[\|\hat{g}_{S_{n+1},K}\|_{K}]\leq\mathbb{E}\left[\frac{1}{\sqrt{\lambda}}\sqrt{\frac{1}{n+1}\sum_{i=1}^{n+1}|S_{i}|}\right]\leq\sqrt{\frac{\mathbb{E}[|S_{i}|]}{\lambda}}.

Multiplying both sides by $\lambda$ gives the desired result.

Step 3: This step is considerably more involved than the previous two. To begin write

	$\displaystyle\sup_{f_{K}:\\|f_{K}\\|_{K}\leq 1}\langle\hat{g}_{n,K},f_{K}\rangle-\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle]$	$\displaystyle=\sup_{f_{K}:\\|f_{K}\\|_{K}\leq 1}\langle\hat{g}_{n,K}-g^{}_{K},f_{K}\rangle+\mathbb{E}[\langle g^{}_{K}-\hat{g}_{S_{n+1},K},f_{K}\rangle]$
		$\displaystyle\leq\\|\hat{g}_{n,K}-g^{}_{K}\\|_{K}+\mathbb{E}[\\|g^{}_{K}-\hat{g}_{S_{n+1},K}\\|_{K}].$

We will bound each of the two terms on the right hand side separately. To do this we will use a two-step peeling argument where each step gives a tighter bound on $\|\hat{g}_{n,K}-g^{*}_{K}\|_{K}$ than the previous one.

Our first step will show that with high probability $(\hat{\beta}_{n},\hat{g}_{n,K})$ must be within $\delta_{M}$ of $(B^{*},g^{*}_{K})$ . Let $c_{\beta}$ be the constant appearing in Lemma 5. Then, by a direct computation we have that

	$\displaystyle\mathbb{P}(d(\hat{\beta}_{n},\hat{g}_{n,K})>\delta_{M})=\mathbb{P}(M_{n}(\hat{\beta}_{n},\hat{g}_{n,K})-M_{n}(\Pi_{B^{}}\hat{\beta}_{n},g^{}_{K})\leq 0,\ d(\hat{\beta}_{n},\hat{g}_{n,K})>\delta_{M})$
	$\displaystyle\leq\mathbb{P}(M_{n}(\hat{\beta}_{n},\hat{g}_{n,K})-M_{n}(\Pi_{B^{}}\hat{\beta}_{n},g^{}_{K})-(M_{\infty}(\hat{\beta}_{n},\hat{g}_{n,K})-M_{\infty}(\Pi_{B^{}}\hat{\beta}_{n},g^{}_{K}))\leq-\delta_{M}^{2})$
	$\displaystyle\leq\mathbb{P}\left(\sup_{\\|\beta\\|_{2}\leq\frac{c_{\beta}}{\sqrt{\lambda}},\\|g_{K}\\|\leq\sqrt{\frac{2\mathbb{E}[\|S_{i}\|]}{\lambda}}}\|M_{n}(\beta,g_{K})-M_{n}(\Pi_{B^{}}\beta,g^{}_{K})-(M_{\infty}(\beta,g_{K})-M_{\infty}(\Pi_{B^{}}\beta,g^{}_{K}))\|\geq\delta_{M}^{2}\right)$
	$\displaystyle\quad+\mathbb{P}\left(\\|\hat{\beta}_{n}\\|_{2}\geq\frac{c_{\beta}}{\sqrt{\lambda}}\right)+\mathbb{P}\left(\\|\hat{g}_{n,K}\\|_{K}\geq\sqrt{\frac{2\mathbb{E}[\|S_{i}\|]}{\lambda}}\right)$
	$\displaystyle\leq\frac{1}{\delta_{M}^{2}}\mathbb{E}\left[\sup_{\\|\beta\\|_{2}\leq\frac{c_{\beta}}{\sqrt{\lambda}},\\|g_{K}\\|\leq\sqrt{\frac{2\mathbb{E}[\|S_{i}\|]}{\lambda}}}\|L_{n}(\beta,g_{K})-L_{n}(\Pi_{B^{}}]\beta,g^{}_{K})-(L_{\infty}(\beta,g_{K})-L_{\infty}(\Pi_{B^{}}\beta,g^{}_{K}))\|\right]+O\left(\frac{d}{n}\right),$

where the last line follows by applying Lemmas 4 and 5 with $f(\cdot)=1$ . Finally, by Lemma 6 we can additionally bound the first term above as

\mathbb{E}\left[\sup_{\|\beta\|_{2}\leq\frac{c_{\beta}}{\sqrt{\lambda}},\|g_{K}\|\leq\sqrt{\frac{2\mathbb{E}[|S_{i}|]}{\lambda}}}|L_{n}(\beta,g_{K})-L_{n}(\Pi_{B^{*}}\hat{\beta}_{n},g^{*}_{K})-(L_{\infty}(\beta,g_{K})-L_{\infty}(\Pi_{B^{*}}\hat{\beta}_{n},g^{*}_{K}))|\right]\leq O\left(\sqrt{\frac{d}{\lambda n}}\right),

So, in total we find that

\mathbb{P}(d(\hat{\beta}_{n},\hat{g}_{n,K})>\delta_{M})\leq O\left(\sqrt{\frac{d}{\lambda n}}\right).

This concludes the proof of our first concentration inequality for $(\hat{\beta}_{n},\hat{g}_{n,K})$ .

In our second step we will use this preliminary bound to get an even tighter control on $d(\hat{\beta}_{n},\hat{g}_{n,K})$ . Fix any $C>0$ with $C\sqrt{d/(\lambda n)}<\delta_{M}$ . For any $j\in\mathbb{R}$ let $A_{j}:=\{(\beta,g_{K}):2^{j-1}<\sqrt{\frac{\lambda n}{d}}d(\beta,g_{n,K})\leq 2^{j}\}$ . Then,

	$\displaystyle\mathbb{P}\left(d(\hat{\beta}_{n},\hat{g}_{n,K})>C\sqrt{d/(\lambda n)}\right)$
	$\displaystyle\leq\sum_{j:\frac{1}{2}C\leq 2^{j}\leq 2\sqrt{\frac{d}{n\lambda}}\delta_{M}}\mathbb{P}\left(\sup_{(\beta,g_{K})\in A_{j}}M_{n}(\beta,g_{K})-M_{\infty}(\beta,g_{K})\leq 0\right)+\mathbb{P}(d(\hat{\beta}_{n},\hat{g}_{n,K})>\delta_{M})$
	$\displaystyle\leq\sum_{j:\frac{1}{2}C\leq 2^{j}\leq 2\sqrt{\frac{d}{n\lambda}}\delta_{M}}\mathbb{P}\left(\sup_{(\beta,g_{K})\in A_{j}}\|M_{n}(\beta,g_{K})-M_{n}(\Pi_{B^{}}\beta,g^{}_{K})-(M_{\infty}(\beta,g_{K})-M_{\infty}(\Pi_{B^{}}\beta,g^{}_{K}))\|\geq\frac{2^{2j-2}d}{n\lambda}\right)$
	$\displaystyle\ \ \ \ +O\left(\sqrt{\frac{d}{\lambda n}}\right)$
	$\displaystyle\leq\sum_{j:\frac{1}{2}C\leq 2^{j}\leq 2\sqrt{\frac{d}{n\lambda}}\delta_{M}}\frac{n\lambda}{2^{2j-2}d}\mathbb{E}\left[\sup_{(\beta,g_{K})\in A_{j}}\left\|(L_{n}(\beta,g_{K})-L_{n}(\Pi_{B^{}}\beta,g^{}_{K})-(L_{\infty}(\beta,g_{K})-L_{\infty}(\Pi_{B^{}}\beta,g^{}_{K}))\right\|\right]$
	$\displaystyle\ \ \ \ +O\left(\sqrt{\frac{d}{\lambda n}}\right)$
	$\displaystyle\leq\sum_{j:\frac{1}{2}C\leq 2^{j}\leq 2\sqrt{\frac{d}{n\lambda}}\delta_{M}}O(\sqrt{\lambda}2^{-j})+O\left(\sqrt{\frac{d}{\lambda n}}\right)\leq O\left(\frac{\sqrt{\lambda}}{C}\right)+O\left(\sqrt{\frac{d}{\lambda n}}\right).$

This proves that $\|\hat{g}_{n,K}-g^{*}_{K}\|_{K}=O_{\mathbb{P}}(\sqrt{\frac{d}{\lambda n}})$ . To get a similar bound on the expectation write

	$\displaystyle\mathbb{E}[\\|\hat{g}_{S_{n+1},K}-g^{}_{K}\\|_{K}]\leq\int_{0}^{\delta_{M}}\mathbb{P}(\\|\hat{g}_{S_{n+1},K}-g^{}_{K}\\|>t)dt$
	$\displaystyle\hskip 142.26378pt+\mathbb{E}[(\\|\hat{g}_{n,K}\\|_{K}+\\|g^{}_{K}\\|_{K})\mathbbm{1}\{\\|\hat{g}_{S_{n+1},K}-g^{}_{K}\\|_{K}>\delta_{M}\}]$
	$\displaystyle\leq\int_{0}^{\delta_{M}}\min\left\{1,O\left(\frac{1}{t}\sqrt{\frac{d}{\lambda n}}\right)+O\left(\sqrt{\frac{d}{\lambda n}}\right)\right\}dt$
	$\displaystyle\hskip 142.26378pt+\mathbb{E}\left[\left(\sqrt{\frac{1}{\lambda(n+1)}\sum_{i=1}^{n+1}\|S_{i}\|}+\sqrt{\frac{\mathbb{E}[\|S_{i}\|]}{\lambda}}\right)\mathbbm{1}\{\\|\hat{g}_{S_{n+1},K}-g^{*}_{K}\\|_{K}>\delta_{M}\}\right]$
	$\displaystyle\leq O\left(\sqrt{\frac{d\log(n)}{\lambda n}}\right)+\mathbb{E}\left[\left(\sqrt{\frac{1}{\lambda(n+1)}\sum_{i=1}^{n+1}(\|S_{i}\|-\mathbb{E}[\|S_{i}\|])}+2\sqrt{\frac{\mathbb{E}[\|S_{i}\|]}{\lambda}}\right)\mathbbm{1}\{\\|\hat{g}_{S_{n+1},K}-g^{*}_{K}\\|_{K}>\delta_{M}\}\right]$
	$\displaystyle\leq O\left(\sqrt{\frac{d\log(n)}{\lambda n}}\right)+\mathbb{E}\left[\left\|\frac{1}{\lambda(n+1)}\sum_{i=1}^{n+1}(S_{i}-\mathbb{E}[S_{i}])\right\|\right]^{1/2}\mathbb{P}(\\|\hat{g}_{S_{n+1},K}-g^{*}_{K}\\|_{K}>\delta_{M})^{1/2}$
	$\displaystyle\hskip 142.26378pt+2\sqrt{\frac{\mathbb{E}[\|S_{i}\|]}{\lambda}}\mathbb{P}(\\|\hat{g}_{S_{n+1},K}-g^{*}_{K}\\|_{K}>\delta_{M})$
	$\displaystyle=O\left(\frac{1}{\lambda}\sqrt{\frac{d\log(n)}{n}}\right),$

as desired.

∎

A.5 Proofs for Lipschitz Functions

In this section we prove Proposition 3. Throughout we make the following set of technical assumptions

Assumption 4.

There exists constants $C_{X},C_{\Phi},C_{S},C_{f},\rho>0$ such that $\sup_{f\in\mathcal{F}}\sqrt{\mathbb{E}[|f(X_{i})|^{2}]}\leq C_{f}\mathbb{E}[|f(X_{i})|]$ , $\inf_{\beta:\|\beta\|_{2}=1}\mathbb{E}[|\Phi(X_{i})^{\top}\beta|]\geq\rho$ , and with probability 1, $\|X_{i}\|^{2}_{2}\leq C_{X}p$ , $\|\Phi(X_{i})\|^{2}_{2}\leq C_{\Phi}d$ , and $|S_{i}|\leq C_{S}$ for all $i$ .

The primary technical tool that we will need for the proof is a covering number bound for Lipschitz functions. This result is well-known from prior literature and we re-state it here for clarity. In the work that follow we use $B_{p}(0,C):=\{x\in\mathbb{R}^{p}:\|x\|_{2}\leq C\}$ to denote the ball of radius $C$ in $\mathbb{R}^{p}$ .

Definition 1.

The covering number $\mathcal{N}(\mathcal{F},\epsilon,\|\cdot\|)$ of a set $\mathcal{F}$ under norm $\|\cdot\|$ is the minimum number of balls $B(f,\epsilon)=\{g\in\mathcal{F}:\|f-g\|\leq\epsilon\}$ of radius $\epsilon$ needed to cover $\mathcal{F}$ .

Lemma 7 (Covering number of Lipschitz functions, Theorem 2.7.1 in Vaart and Wellner (1996)).

Let $\mathcal{F}^{\textup{lip}}_{L,B_{1},B_{2},p}:=\{f:B_{p}(0,B_{1})\to\mathbb{R}\mid\textup{Lip}(f)\leq L,\ \|f\|_{\infty}\leq B_{2}\}$ denote the space of bounded Lipschitz functions on $B_{p}(0,B_{1})$ . Then there exists a constant $C>0$ such that for any $\epsilon>0$ ,

\log(\mathcal{N}(\mathcal{F}^{\textup{lip}}_{L,B_{1},B_{2},p},\epsilon,\|\cdot\|_{\infty}))\leq\left(\frac{CB_{1}\max\{L,B_{2}\}}{\epsilon}\right)^{p}.

In the present context we need to control the behaviour of Lipschitz fitting under a re-weighting $f$ . To account for this we will require a more general covering number bound on a weighted class of Lipschitz functions. This is given in the following lemma. In the work that follows, recall that for a probability measure $Q$ , the $L_{2}(Q)$ norm of a function $f$ is defined as $\|f\|_{L_{2}(Q)}:=\mathbb{E}_{X\sim Q}[f(X)^{2}]^{1/2}$ .

Lemma 8.

Let $f:B_{p}(0,B_{1})\to\mathbb{R}$ be a fixed function and $\mathcal{F}^{\textup{wlip}}_{L,B_{1},B_{2},p,f}:=\{fg\mid g:B_{p}(0,B_{1})\to\mathbb{R},\ \textup{Lip}(g)\leq L,\ \|g\|_{\infty}\leq B_{2}\}$ denote the space of bounded Lipschitz functions on $B_{p}(0,B_{1})$ multiplied with $f$ . Then for any probablity measure $Q$ , there exists a constant $C>0$ such that for any $\epsilon>0$ ,

\log(\mathcal{N}(\mathcal{F}^{\textup{wlip}}_{L,B_{1},B_{2},p,f},\epsilon,\|\cdot\|_{L_{2}(Q)}))\leq\left(\frac{CB_{1}\max\{L,B_{2}\}\|f\|_{L_{2}(Q)}}{\epsilon}\right)^{p}.

Proof.

Recall that we defined $\mathcal{F}^{\textup{lip}}_{L,B_{1},B_{2},p}:=\{g:B_{p}(0,B_{1})\to\mathbb{R}\mid\textup{Lip}(g)\leq L,\ \|g\|_{\infty}\leq B_{2}\}$ . Fix any $\epsilon>0$ and let $A\subseteq\mathcal{F}^{\textup{lip}}_{L,B_{1},B_{2},p}$ be a minimal $\|\cdot\|_{\infty}$ -norm, $\epsilon/\|f\|_{L_{2}(Q)}$ -covering of $\mathcal{F}^{\textup{lip}}_{L,B_{1},B_{2},p}$ . Fix any $h\in\mathcal{F}^{\textup{lip}}_{L,B_{1},B_{2},p}$ and let $h^{\prime}\in A$ be such that $\|h-h^{\prime}\|_{\infty}\leq\epsilon/\|f\|_{L_{2}(Q)}$ . Then,

\|fh-fh^{\prime}\|_{L_{2}(Q)}=\mathbb{E}_{X\sim Q}[|f(X)|^{2}\cdot|h(X)-h(X^{\prime})|^{2}]^{1/2}\leq\mathbb{E}_{X\sim Q}\left[|f(X)|^{2}\left(\frac{\epsilon}{\|f\|_{L_{2}(Q)}}\right)^{2}\right]^{1/2}=\epsilon.

In particular, we find that $\{fh:h\in A\}$ is an $\|\cdot\|_{L_{2}(Q)}$ -norm, $\epsilon$ -covering of $\mathcal{F}^{\textup{wlip}}_{L,B_{1},B_{2},p,f}$ . The desired result then immediately follows by applying Lemma 7 to get a bound on $|A|$ . ∎

The previous two lemmas only apply to bounded Lipschitz functions with maximum Lipshitz norm $L$ . To apply these results to our current setting we will need to bound $\text{Lip}(\hat{g}_{S_{n+1},L})$ and $\|\hat{g}_{S_{n+1},L}\|_{\infty}$ . Our next lemma does exactly this.

Lemma 9.

Assume that there exist constants $C_{X},C_{S}>0$ such that with probability one $\|X_{i}\|^{2}_{2}\leq pC_{X}$ and $|S_{i}|\leq C_{S}$ for all $i$ . Then with probability one, $\text{Lip}(\hat{g}_{S_{n+1},L})\leq\frac{C_{S}}{\lambda}$ and $\|\hat{g}_{S_{n+1},L}\|_{\infty}\leq\frac{\sqrt{C_{X}p}C_{S}}{\lambda}$ .

Proof.

Since $(\hat{g}_{S_{n+1},L},\hat{\beta}_{S_{n+1}})$ is a minimizer of the quantile regression objective we must have

	$\displaystyle\lambda\text{Lip}(\hat{g}_{S_{n+1},L})$	$\displaystyle\leq\frac{1}{n+1}\sum_{i=1}^{n+1}\ell_{\alpha}\left(\hat{g}_{S_{n+1},L}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}},S_{i}\right)+\lambda\text{Lip}(\hat{g}_{S_{n+1},L})$
		$\displaystyle\leq\frac{1}{n+1}\sum_{i=1}^{n+1}\ell_{\alpha}(0,S_{i})+\lambda\text{Lip}(0)\leq\frac{1}{n+1}\sum_{i=1}^{n+1}\|S_{i}\|\leq C_{S}.$

This proves the first part of the proposition. To get the second part, note that since $\Phi(\cdot)$ has an intercept term we may assume without loss of generality that $\hat{g}_{S_{n+1},L}(0)=0$ . Thus,

\|\hat{g}_{S_{n+1},L}\|_{\infty}\leq\sqrt{C_{X}p}\sup_{x\in B_{p}(0,\sqrt{C_{X}p})}\frac{\|\hat{g}_{S_{n+1},L}(x)-\hat{g}_{S_{n+1},L}(0)\|}{\|x\|_{2}}\leq\sqrt{C_{X}p}\text{Lip}(\hat{g}_{S_{n+1},L})\leq\sqrt{C_{X}p}\frac{C_{S}}{\lambda},

as desired. ∎

Our final preliminary lemma gives a control on the norm of the linear part of the fit. Similar to what we had for RKHS functions above, here we state this result under an arbitrary re-weighting $f$ .

Lemma 10.

Let $f:\mathcal{X}\to\mathbb{R}$ and $(X_{1},S_{1}),\dots,(X_{n+1},S_{n+1})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P$ . Assume that there exists constants $C_{X},C_{\Phi},C_{S},C_{f},\rho>0$ such that $\sqrt{\mathbb{E}[|f(X_{i})|^{2}]}\leq C_{f}\mathbb{E}[|f(X_{i})|]$ , $\inf_{\beta:\|\beta\|_{2}=1}\mathbb{E}[|\Phi(X_{i})^{\top}\beta|]\geq\rho$ , and with probability 1, $\|X\|_{2}^{2}\leq C_{X}p$ , $\|\Phi(X_{i})\|^{2}_{2}\leq C_{\Phi}d$ , and $|S_{i}|\leq C_{S}$ for all $i$ . Then there exists a constant $c_{\beta}>0$ such that

\mathbb{E}\left[|f(X_{i})|\mathbbm{1}\left\{\|\hat{\beta}_{S_{n+1}}\|_{2}>c_{\beta}\frac{\sqrt{p}}{\lambda}\right\}\right]\leq O\left(\frac{d\mathbb{E}[|f(X_{i})|]}{n}\right).

Proof.

By Lemma 9 we know that without loss of generality $\|\hat{g}_{S_{n+1},L}\|_{\infty}\leq\frac{\sqrt{C_{X}p}C_{S}}{\lambda}$ . Moreover, by assumption we have that deterministically $\frac{1}{n+1}\sum_{i=1}^{n+1}|S_{i}|\leq C_{S}$ . With these preliminary facts in hand the desired result follows by repeating the proof of Lemma 5. ∎

With these preliminaries out of the way we are now ready to prove Proposition 3.

Proof of Proposition 3.

The main idea of this proof is to show that $\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=g_{L}(X_{i})+\Phi(X_{i})^{\top}\beta\}$ concentrates uniformly around its expectation. Since for any fixed $(g_{L},\beta)$ , $\mathbb{E}[\mathbbm{1}\{S=g_{L}(X)+\Phi(X)^{\top}\beta\}]=0$ this will imply that

\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1},L}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\cong\mathbb{E}[\mathbbm{1}\{S=\hat{g}_{S_{n+1}}(X)+\Phi(X)^{\top}\hat{\beta}_{S_{n+1}}\}]=0.

We now formalize this idea. Define the event

E:=\left\{\|\hat{g}_{S_{n+1},L}\|_{\infty}\leq\frac{\sqrt{C_{X}p}C_{S}}{\lambda},\ \text{Lip}(f_{L})\leq\frac{C_{S}}{\lambda},\ \|\hat{\beta}_{S_{n+1}}\|_{2}\leq c_{\beta}\frac{\sqrt{p}}{\lambda}\right\}.

By Lemmas 9 and 10 we know that

	$\displaystyle\mathbb{E}[\|f(X_{i})\|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1},L}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}]$
	$\displaystyle\leq\mathbb{E}[\|f(X_{i})\|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1},L}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\mathbbm{1}\{E\}]+O\left(\frac{d\mathbb{E}[\|f(X_{i})\|]}{n}\right).$

Thus, we just need to focus on what happens on the event $E$ . By the exchangeability of the quadruples $(\hat{g}_{S_{n+1},L}(X_{i}),\hat{\beta}_{S_{n+1}},X_{i},S_{i})$ we have

	$\displaystyle\mathbb{E}[\|f(X_{i})\|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1},L}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\mathbbm{1}\{E\}]$
	$\displaystyle=\mathbb{E}\left[\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\|f(X_{i})\|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1},L}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\right)\mathbbm{1}\{E\}\right].$

Let $\delta>0$ be a small constant that we will specify later and $h$ denote the tent function

h(x)=\begin{cases}0,\ |x|>\delta\\ 1-\delta^{-1}|x|,\ |x|\leq\delta.\end{cases}

Let $\mathcal{G}:=\{g:g(\cdot)=g_{L}(\cdot)+\Phi(\cdot)^{\top}\beta,\ \|\beta\|_{2}\leq\frac{c_{\beta}\sqrt{p}}{\lambda},\ \|g_{L}\|_{\infty}\leq\frac{\sqrt{C_{X}p}C_{S}}{\lambda},\ \text{Lip}(g_{L})\leq\frac{C_{S}}{\lambda}\}$ and $\sigma_{1},\dots,\sigma_{n}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\text{Unif}(\{\pm 1\})$ . Then,

	$\displaystyle\mathbb{E}\left[\frac{1}{n+1}\sum_{i=1}^{n+1}\|f(X_{i})\|\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\mathbbm{1}\{E\}\right]$
	$\displaystyle\leq\mathbb{E}\left[\frac{1}{n+1}\sum_{i=1}^{n+1}\|f(X_{i})\|h(S_{i}-\hat{g}_{S_{n+1}}(X_{i})-\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}})\mathbbm{1}\{E\}\right]$
	$\displaystyle\leq\mathbb{E}\left[\sup_{g\in\mathcal{G}}\frac{1}{n+1}\sum_{i=1}^{n+1}\|f(X_{i})\|h(S_{i}-g(X_{i}))-\mathbb{E}[\|f(X_{1})\|h(S_{1}-g(X_{1}))]\right]$
	$\displaystyle\quad+\sup_{g\in\mathcal{G}}\mathbb{E}[\|f(X_{1})\|h(S_{1}-g(X_{1}))]$
	$\displaystyle\leq 2\mathbb{E}\left[\sup_{g\in\mathcal{G}}\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}\|f(X_{i})\|h(S_{i}-g(X_{i}))\right]+\sup_{g\in\mathcal{G}}\mathbb{E}[\|f(X_{1})\|\mathbb{P}(\|S_{1}-g(X_{1})\|\mid X_{1}\leq\delta)]$
	$\displaystyle\leq 2\delta^{-1}\mathbb{E}\left[\sup_{g\in\mathcal{G}}\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}\|f(X_{i})\|(S_{i}-g(X_{i}))\right]+O(\delta\mathbb{E}[\|f(X_{i})\|])$
	$\displaystyle\leq 2\delta^{-1}\mathbb{E}\left[\left\|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}\|f(X_{i})\|S_{i}\right\|\right]+2\delta^{-1}\mathbb{E}\left[\sup_{\beta:\\|\beta\\|_{2}\leq\frac{c_{\beta}\sqrt{p}}{\lambda}}\left\|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}\|f(X_{i})\|\Phi(X_{i})^{\top}\beta\right\|\right]$
	$\displaystyle\quad+2\delta^{-1}\mathbb{E}\left[\sup_{g_{L}:\\|g_{L}\\|_{\infty}\leq\frac{\sqrt{C_{X}p}C_{S}}{\lambda},\ \text{Lip}(g_{L})\leq\frac{C_{S}}{\lambda}}\left\|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}\|f(X_{i})\|g_{L}(X_{i})\right\|\right]+O(\delta\mathbb{E}[\|f(X_{i})\|]),$

where the third inequality follows by symmetrization, and the fourth inequality uses the fact that $S_{i}|X_{i}$ has a bounded density, the contraction inequality, and the fact that $h(\cdot)$ is $\delta^{-1}$ -Lipschitz.

To conclude the proof we bound each of the first three terms appearing on the last line above. We have that

	$\displaystyle\mathbb{E}\left[\left\|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}\|f(X_{i})\|S_{i}\right\|\right]$	$\displaystyle\leq\sqrt{\text{Var}\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}\|f(X_{i})\|S_{i}\right)}$
		$\displaystyle\leq\sqrt{\frac{C_{S}^{2}\mathbb{E}[f(X_{i})^{2}]^{1/2}}{n+1}}=O\left(\sqrt{\frac{\mathbb{E}[\|f(X_{i})\|]}{n}}\right),\text{ (by Assumption \ref{ass:lip_tech_conditions})},$

while

	$\displaystyle\mathbb{E}\left[\sup_{\beta:\\|\beta\\|_{2}\leq\frac{c_{\beta}\sqrt{p}}{\lambda}}\left\|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}\|f(X_{i})\|\Phi(X_{i})^{\top}\beta\right\|\right]$	$\displaystyle\leq\frac{c_{\beta}\sqrt{p}}{\lambda}\mathbb{E}\left[\left\\|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}\|f(X_{i})\|\Phi(X_{i})\right\\|_{2}\right]$
		$\displaystyle\leq\frac{c_{\beta}\sqrt{p}}{\lambda}\mathbb{E}\left[\left\\|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}\|f(X_{i})\|\Phi(X_{i})\right\\|_{2}^{2}\right]^{1/2}$
		$\displaystyle=O\left(\sqrt{\frac{dp\mathbb{E}[\|f(X_{i})\|^{2}]}{\lambda^{2}n}}\right)$
		$\displaystyle=O\left(\sqrt{\frac{dp}{\lambda^{2}n}}\mathbb{E}[\|f(X_{i})\|]\right),\text{ (by Assumption \ref{ass:lip_tech_conditions})}.$

Finally, by Lemma 8 we have the covering number bound

	$\displaystyle\log\left(\mathcal{N}\left(\left\{\|f\|g_{L}:B_{p}(0,\sqrt{C_{X}p})\to\mathbb{R}\mid\\|g_{L}\\|_{\infty}\leq\frac{\sqrt{C_{X}p}C_{S}}{\lambda},\ \text{Lip}(g_{L})\leq\frac{C_{S}}{\lambda}\right\},\epsilon,\\|\cdot\\|_{L_{2}(P_{X})}\right)\right)$
	$\displaystyle\leq O\left(\frac{p\mathbb{E}[f(X_{i})^{2}]^{1/2}}{\lambda\epsilon}\right)^{p}=O\left(\frac{p\mathbb{E}[\|f(X_{i})\|]}{\lambda\epsilon}\right)^{p},$

and so by Dudley’s entropy integral

\displaystyle\mathbb{E}\left[\sup_{g_{L}:\|g_{L}\|_{\infty}\leq\frac{\sqrt{C_{X}p}C_{S}}{\lambda},\ \text{Lip}(g_{L})\leq\frac{C_{S}}{\lambda}}\left|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}|f(X_{i})|g_{L}(X_{i})\right|\right]

\displaystyle\leq O\left(\frac{p\mathbb{E}[|f(X_{i})|]\log(n)}{\lambda n^{\min\{1/2,1/p\}}}\right).

Putting all of these results together gives the final bound

	$\displaystyle\mathbb{E}\left[\mathbbm{1}\{S=\hat{g}_{S_{n+1}}(X)+\Phi(X)^{\top}\hat{\beta}_{S_{n+1}}\}\mathbbm{1}\{E\}\right]$
	$\displaystyle\leq\delta^{-1}\left(O\left(\frac{p\mathbb{E}[\|f(X_{i})\|]\log(n)}{\lambda n^{\min\{1/2,1/p\}}}\right)+O\left(\sqrt{\frac{dp}{\lambda^{2}n}}\mathbb{E}[\|f(X_{i})\|]\right)\right)+O(\delta\mathbb{E}[\|f(X_{i})\|]).$

The desired result then follows by optimizing over $\delta$ .

∎

A.6 Proofs for Section 4

In this section we prove the results appearing in Section 4 of the main text. We start with a proof of Theorem 4.

Proof of Theorem 4.

We begin by giving a more careful derivation of the dual program. Recall that our primal optimization problem is

	$\displaystyle\text{minimize}_{g\in\mathcal{F}}$	$\displaystyle(1-\alpha)\cdot\mathbf{1}^{\top}p+\alpha\cdot\mathbf{1}^{\top}q+(n+1)\cdot\mathcal{R}(g)$
	s.t.	$\displaystyle S_{i}-g(X_{i})-p_{i}+q_{i}=0$
		$\displaystyle S-g(X_{n+1})-p_{n+1}+q_{n+1}=0$
		$\displaystyle p_{i},q_{i}\geq 0,\ 1\leq i\leq n+1.$

The Lagrangian for this program is

\begin{split}&(1-\alpha)\cdot\mathbf{1}^{\top}p+\alpha\cdot\mathbf{1}^{\top}q+(n+1)\cdot\mathcal{R}(g)+\sum_{i=1}^{n}\eta_{i}\left(S_{i}-g(X_{i})-p_{i}+q_{i}\right)\\ &\hskip 56.9055pt+\eta_{n+1}\left(S-g(X_{n+1})-p_{n+1}+q_{n+1}\right)-\sum_{i=1}^{n+1}(\gamma_{i}p_{i}+\xi_{i}q_{i}).\end{split}

(A.3)

For ease of notation, let $\mathcal{R}^{*}(\eta):=-\min_{g\in\mathcal{F}}(n+1)\mathcal{R}(g)-\sum_{i=1}^{n+1}\eta_{i}g(X_{i})$ . Then, minimizing with respect to $g$ gives,

\displaystyle(1-\alpha)\cdot\mathbf{1}^{\top}p+\alpha\cdot\mathbf{1}^{\top}q-\mathcal{R}^{*}\left(\eta\right)+\sum_{i=1}^{n}\eta_{i}S_{i}+\eta_{n+1}S-\eta^{\top}p+\eta^{\top}q-\gamma^{\top}p-\xi^{\top}q.

So, taking derivatives of this function with respect to $p$ , and $q$ , we arrive at the constraints,

	$\displaystyle\gamma$	$\displaystyle=(1-\alpha)\cdot\mathbf{1}-\eta$
	$\displaystyle\xi$	$\displaystyle=\alpha\cdot\mathbf{1}+\eta.$

Since the only restriction on $\xi$ and $\gamma$ is that they are non-negative, this can be simplified to,

	$\displaystyle\eta$	$\displaystyle\leq(1-\alpha)\cdot\mathbf{1}$
	$\displaystyle\eta$	$\displaystyle\geq-\alpha\cdot\mathbf{1}.$

Thus, we arrive at the desired dual formulation,

\begin{split}&\text{maximize}_{\eta}\quad\sum_{i=1}^{n}\eta_{i}S_{i}+\eta_{n+1}S-\mathcal{R}^{*}\left(\eta\right)\\ &\text{subject to }\quad-\alpha\leq\eta_{i}\leq 1-\alpha,\ 1\leq i\leq n+1.\end{split}

(A.4)

Now, recall that we used the notation $\eta^{S}$ to denote the dual-optimal $\eta$ for a particular choice of $S$ . Assume for the sake of contradiction that there exists $\tilde{S}>S$ such that $\eta^{\tilde{S}}_{n+1}<\eta^{S}_{n+1}$ . Observe that we can write the dual objective as

\displaystyle h(\eta^{S})+S\cdot\eta^{S}_{n+1},

where $h$ does not depend on $S$ . Our assumption implies that

\displaystyle(\tilde{S}-S)\cdot\left(\eta^{\tilde{S}}_{n+1}-\eta^{S}_{n+1}\right)<0,

or equivalently,

\displaystyle\tilde{S}\cdot\left(\eta^{\tilde{S}}_{n+1}-\eta^{S}_{n+1}\right)<S\cdot\left(\eta^{\tilde{S}}_{n+1}-\eta^{S}_{n+1}\right).

On the other hand, by the optimality of $\eta^{S}$ , we have that

\displaystyle h(\eta^{\tilde{S}})+\tilde{S}\cdot\eta^{\tilde{S}}_{n+1}\geq h(\eta^{S})+\tilde{S}\cdot\eta^{S}_{n+1}\quad\iff\quad\tilde{S}\cdot\left(\eta^{\tilde{S}}_{n+1}-\eta^{S}_{n+1}\right)\geq h(\eta^{S})-h(\eta^{\tilde{S}}).

Applying our assumption, we conclude that

\displaystyle S\cdot\left(\eta^{\tilde{S}}_{n+1}-\eta^{S}_{n+1}\right)>h(\eta^{S})-h(\eta^{\tilde{S}}),

which by rearranging yields the desired contradiction

\displaystyle h(\eta^{\tilde{S}})+S\cdot\eta^{\tilde{S}}_{n+1}>h(\eta^{S})+S\cdot\eta^{S}_{n+1}.

∎

We now turn to the proof of Proposition 4, which states that the coverage properties of $\hat{C}_{\text{dual}}(X_{n+1})$ are the same as $\hat{C}(X_{n+1})$ .

Proof of Proposition 4.

The proof of this Proposition is nearly identical to the proof of Theorem 3, with the exception that now instead of looking at the first order conditions of the primal, we will instead investigate the first order conditions of the Lagrangian (A.3) at the optimal dual variables. We keep all the notation the same as in the proof of Theorem 4.

We begin by proving the first statement pertaining to the coverage properties of $\hat{C}_{\text{dual}}(X_{n+1})$ . Let $(\hat{g}_{S_{n+1}},p^{S_{n+1}},q^{S_{n+1}},\eta^{S_{n+1}},\gamma^{S_{n+1}}_{i},\xi_{i}^{S_{n+1}})$ denote an optimal primal-dual solution at the input $S=S_{n+1}$ . Recall from the proof of Theorem 4 that the Lagrangian for the optimization is

	$\displaystyle(1-\alpha)\cdot\mathbf{1}^{\top}p^{S_{n+1}}+\alpha\cdot\mathbf{1}^{\top}q^{S_{n+1}}+(n+1)\mathcal{R}(\hat{g}_{S_{n+1}})+\sum_{i=1}^{n+1}\eta_{i}^{S_{n+1}}(S_{i}-\hat{g}_{S_{n+1}}(X_{i})-p^{S_{n+1}}_{i}+q^{S_{n+1}}_{i})$
	$\displaystyle\quad\quad-\sum_{i=1}^{n+1}(\gamma^{S_{n+1}}p^{S_{n+1}}_{i}+\xi_{i}^{S_{n+1}}q_{i}^{S_{n+1}}).$

Fix any re-weighting function $f\in\mathcal{F}$ . By assumption we know that strong duality (and thus the KKT conditions) hold. So, by considering the derivative of the Lagrangian in the direction $f$ and applying the KKT stationarity condition we find that

0=\frac{d}{d\epsilon}(n+1)\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)\bigg{|}_{\epsilon=0}-\sum_{i=1}^{n+1}\eta^{S_{n+1}}_{i}f(X_{i}).

(A.5)

To further unpack this equality, note that complementary slackness in the KKT condition necessitates that $p^{{S_{n+1}}}_{i}\gamma^{S_{n+1}}_{i}=0$ and $q^{S_{n+1}}_{i}\xi^{S_{n+1}}_{i}=0$ . Thus, when $S_{i}-\hat{g}_{S_{n+1}}(X_{i})>0$ , we must have $\gamma^{S_{n+1}}_{i}=0$ , or equivalently, $\eta^{S_{n+1}}_{i}=1-\alpha$ and when $S_{i}-\hat{g}_{S_{n+1}}(X_{i})<0$ , we must have $\eta^{S_{n+1}}_{i}=-\alpha$ . Last, when the residual is exactly $0$ , the corresponding $\eta^{S_{n+1}}_{i}$ can take any value in $\left[-\alpha,1-\alpha\right]$ . Plugging these observations into (A.5), we obtain

	$\displaystyle 0$	$\displaystyle=\frac{d}{d\epsilon}(n+1)\cdot\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)\bigg{\|}_{\epsilon=0}+\sum_{i:S_{i}<\hat{g}_{S_{n+1}}(X_{i})}\alpha\cdot f(X_{i})$
		$\displaystyle\quad\quad-\sum_{i:S_{i}>\hat{g}_{S_{n+1}}(X_{i})}(1-\alpha)f(X_{i})-\sum_{i:S_{i}=\hat{g}_{S_{n+1}}(X_{i})}\eta_{i}^{S_{n+1}}f(X_{i})$
		$\displaystyle=\frac{d}{d\epsilon}(n+1)\cdot\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)\bigg{\|}_{\epsilon=0}+\sum_{i:\eta^{S_{n+1}}_{i}<1-\alpha}\alpha f(X_{i})$
		$\displaystyle\quad\quad-\sum_{i:\eta^{S_{n+1}}_{i}=1-\alpha}(1-\alpha)f(X_{i})-\sum_{i:S_{i}=\hat{g}_{S_{n+1}}(X_{i}),\ \eta_{i}^{S_{n+1}}<1-\alpha}\left(\eta_{i}^{S_{n+1}}+\alpha\right)f(X_{i})$
		$\displaystyle=\frac{d}{d\epsilon}(n+1)\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)\bigg{\|}_{\epsilon=0}+\sum_{i=1}^{n+1}\left(\alpha-\mathbbm{1}\left\{\eta^{S_{n+1}}_{i}=1-\alpha\right\}\right)f(X_{i})$
		$\displaystyle\quad\quad-\sum_{i:S_{i}=\hat{g}_{S_{n+1}}(X_{i}),\ \eta_{i}^{S_{n+1}}<1-\alpha}\left(\eta_{i}^{S_{n+1}}+\alpha\right)f(X_{i}).$

To relate this stationary condition to the coverage note that

	$\displaystyle\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}_{\text{dual}}(X_{n+1})\}-(1-\alpha))]=\mathbb{E}[f(X_{n+1})(\alpha-\mathbbm{1}\{Y_{n+1}\notin\hat{C}_{\text{dual}}(X_{n+1})\})]$
	$\displaystyle=\mathbb{E}\left[f(X_{n+1})\left(\alpha-\mathbbm{1}\left\{\eta^{S_{n+1}}_{n+1}=1-\alpha\right\}\right)\right]$
	$\displaystyle=\mathbb{E}\left[\frac{1}{n+1}\sum_{i=1}^{n+1}f(X_{i})\left(\alpha-\mathbbm{1}\left\{\eta^{S_{n+1}}_{i}=1-\alpha\right\}\right)\right]$
	$\displaystyle=-\mathbb{E}\left[\frac{d}{d\epsilon}\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)\bigg{\|}_{\epsilon=0}\right]+\mathbb{E}\left[\frac{1}{n+1}\sum_{i:S_{i}=\hat{g}_{S_{n+1}}(X_{i}),\ \eta_{i}^{S_{n+1}}<1-\alpha}\left(\eta_{i}^{S_{n+1}}+\alpha\right)f(X_{i})\right].$

Finally, since $\eta_{i}^{S_{n+1}}\in[-\alpha,1-\alpha]$ the second term above can be bounded as

	$\displaystyle\left\|\mathbb{E}\left[\frac{1}{n+1}\sum_{i:S_{i}=\hat{g}_{S_{n+1}}(X_{i}),\ \eta_{i}^{S_{n+1}}<1-\alpha}\left(\eta_{i}^{S_{n+1}}+\alpha\right)f(X_{i})\right]\right\|$	$\displaystyle\leq\mathbb{E}\left[\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}\|f(X_{i})\|\right]$
		$\displaystyle=\mathbb{E}[\mathbbm{1}\{S_{i}=\hat{g}_{S_{n+1}}(X_{i})\}\|f(X_{i})\|],$

while when $f\geq 0$ , we additionally have the lower bound

	$\displaystyle\mathbb{E}\left[\frac{1}{n+1}\sum_{i:S_{i}=\hat{g}_{S_{n+1}}(X_{i}),\ \eta_{i}^{S_{n+1}}<1-\alpha}\left(\eta_{i}^{S_{n+1}}+\alpha\right)f(X_{i})\right]$
	$\displaystyle\geq\mathbb{E}\left[\frac{1}{n+1}\sum_{i:S_{i}=\hat{g}_{S_{n+1}}(X_{i}),\ \eta_{i}^{S_{n+1}}<1-\alpha}\left(-\alpha+\alpha\right)f(X_{i})\right]\geq 0.$

This concludes the proof of the first part of the proposition. For the second part of the proposition one simply notes that for any $f\in\mathcal{F}$ ,

	$\displaystyle\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}_{\text{dual, rand.}}(X_{n+1})\}-(1-\alpha))]$	$\displaystyle=\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{\eta^{S_{n+1}}_{n+1}<U\}-(1-\alpha))]$
		$\displaystyle=\mathbb{E}[f(X_{n+1})(\mathbb{E}[\mathbbm{1}\{\eta^{S_{n+1}}_{n+1}<U\}\mid X_{n+1},\eta^{S_{n+1}}_{n+1}]-(1-\alpha))]$
		$\displaystyle=-\mathbb{E}[f(X_{n+1})\eta^{S_{n+1}}_{n+1}]$
		$\displaystyle=-\mathbb{E}\left[\frac{1}{n+1}\sum_{i=1}^{n+1}f(X_{n+1})\eta^{S_{n+1}}_{i}\right]$
		$\displaystyle=-\mathbb{E}\left[\frac{d}{d\epsilon}\mathcal{R}(\hat{g}_{S_{n+1}}+\epsilon f)\right],$

where the last equality is simply our first order condition (A.5).

∎

A.7 Two-sided fitting

Recall that in the main text we defined the two-sided prediction set

\hat{C}_{\textup{two-sid.}}(X_{n+1})=\{y:\hat{g}^{\alpha/2}_{S(X_{n+1},y)}(X_{n+1})\leq S(X_{n+1},y)\leq\hat{g}^{1-\alpha/2}_{S(X_{n+1},y)}(X_{n+1})\}.

Analogues to our work on one-sided prediction sets in the main text, in this section we outline the coverage properties and computationally efficient implementation of $\hat{C}_{\textup{two-sid.}}(X_{n+1})$ .

To begin, let $\eta^{S,\tau}$ denote an optimal solution to the dual program (4.2) when $\alpha$ is replaced by $1-\tau$ , i.e. recalling the definition of $\mathcal{R}^{*}(\eta):=-\min_{g\in\mathcal{F}}(n+1)\mathcal{R}(g)-\sum_{i=1}^{n+1}\eta_{i}g(X_{i})$ , let $\eta^{S,\tau}$ be a solution to

\begin{split}\underset{\eta\in\mathbb{R}^{n+1}}{\text{maximize}}\quad&\sum_{i=1}^{n}\eta_{i}S_{i}+\eta_{n+1}S-\mathcal{R}^{*}\left(\eta\right)\\ \text{subject to}\quad&-(1-\tau)\leq\eta_{i}\leq\tau.\end{split}

Then, similar to our one-sided sets, $\hat{C}_{\textup{two-sid.}}(X_{n+1})$ also admits the analogous dual formulation

\hat{C}_{\textup{dual, two-sid.}}(X_{n+1})=\{y:\eta^{S(X_{n+1},y),\alpha/2}>-(1-\alpha/2)\text{ and }\eta^{S(X_{n+1},y),1-\alpha/2}<1-\alpha/2\},

(A.6)

and the analogous randomized prediction set

\hat{C}_{\textup{dual, two-sid., rand.}}(X_{n+1})=\{y:\eta^{S(X_{n+1},y),\alpha/2}>U_{1}\text{ and }\eta^{S(X_{n+1},y),1-\alpha/2}<U_{2}\},

(A.7)

where $U_{1}\sim\text{Unif}([-(1-\alpha/2),\alpha/2])$ and independently $U_{2}\sim\text{Unif}([-\alpha/2,1-\alpha/2])$ .

As the next theorem states formally, these two-sided predictions set have identical coverage properties to their one-sided analogs.

Theorem 5.

Let $\mathcal{F}$ be any vector space, and assume that for all $f,g\in\mathcal{F}$ , the derivative of $\epsilon\mapsto\mathcal{R}(g+\epsilon f)$ exists. If $f$ is non-negative with $\mathbb{E}_{P}[f(X)]>0$ , then the unrandomized prediction set given by (A.6) satisfies the lower bound

	$\displaystyle\mathbb{P}_{f}(Y_{n+1}\in\hat{C}_{\textup{dual, two-sid.}}(X_{n+1}))\geq 1-\alpha$	$\displaystyle-\frac{1}{\mathbb{E}_{P}[f(X)]}\mathbb{E}\left[\frac{d}{d\epsilon}\mathcal{R}(\hat{g}^{1-\alpha/2}_{S_{n+1}}+\epsilon f)\bigg{\|}_{\epsilon=0}\right]$
		$\displaystyle-\frac{1}{\mathbb{E}_{P}[f(X)]}\mathbb{E}\left[\frac{d}{d\epsilon}\mathcal{R}(\hat{g}^{\alpha/2}_{S_{n+1}}+\epsilon f)\bigg{\|}_{\epsilon=0}\right].$

On the other hand, suppose $(X_{1},Y_{1}),\dots,(X_{n+1},Y_{n+1})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P$ . Then, for all $f\in\mathcal{F}$ , we additionally have the two-sided bound,

\begin{split}&\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}_{\textup{dual, two-sid.}}(X_{n+1})\}-(1-\alpha))]\\ &=-\frac{1}{\mathbb{E}_{P}[f(X)]}\mathbb{E}\left[\frac{d}{d\epsilon}\mathcal{R}(\hat{g}^{1-\alpha/2}_{S_{n+1}}+\epsilon f)\bigg{|}_{\epsilon=0}\right]-\frac{1}{\mathbb{E}_{P}[f(X)]}\mathbb{E}\left[\frac{d}{d\epsilon}\mathcal{R}(\hat{g}^{\alpha/2}_{S_{n+1}}+\epsilon f)\bigg{|}_{\epsilon=0}\right]+\epsilon_{\textup{int}},\end{split}

(A.8)

where $\epsilon_{\textup{int}}$ is an interpolation error term satisfying $|\epsilon_{\textup{int}}|\leq\mathbb{E}[|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}^{1-\alpha/2}_{S_{n+1}}(X_{i})\}]+\mathbb{E}[|f(X_{i})|\mathbbm{1}\{S_{i}=\hat{g}^{\alpha/2}_{S_{n+1}}(X_{i})\}]$ . Furthermore, the same results also hold for the randomized set (A.7) with $\epsilon_{\text{int}}$ replaced by $0$ .

Proof.

Note that

	$\displaystyle\mathbb{E}[f(X_{n+1})(\mathbbm{1}\{Y_{n+1}\in\hat{C}_{\textup{dual, two-sid.}}(X_{n+1})\}-(1-\alpha))]$
	$\displaystyle=\mathbb{E}[f(X_{n+1})(\alpha-\mathbbm{1}\{Y_{n+1}\notin\hat{C}_{\textup{dual, two-sid.}}(X_{n+1})\})]$
	$\displaystyle=\mathbb{E}[f(X_{n+1})(\alpha/2-\mathbbm{1}\{\eta^{\alpha/2,S_{n+1}}_{n+1}=-(1-\alpha/2)\})]+\mathbb{E}[f(X_{n+1})(\alpha/2-\mathbbm{1}\{\eta^{1-\alpha/2,S_{n+1}}_{n+1}=(1-\alpha/2)\})].$

The result then follows by repeating the steps of Proposition 4 twice to bound the two terms above separately. A similar argument demonstrates the coverage of $\hat{C}_{\textup{dual, two-sid., rand.}}$ ∎

A.8 Proofs of additional technical lemmas

Lemma 11 (Lipschitz property of the pinball loss).

The pinball loss is 1-Lipschitz in the sense that for any $y_{1},y_{2},y_{3},y_{4}\in\mathbb{R}$ ,

|\ell_{\alpha}(y_{1},y_{2})-\ell_{\alpha}(y_{3},y_{4})|\leq|(y_{1}-y_{2})-(y_{3}-y_{4})|.

Proof.

We will show that $\ell_{\alpha}(y_{1},y_{2})-\ell_{\alpha}(y_{3},y_{4})\leq|(y_{1}-y_{2})-(y_{3}-y_{4})|$ . The reverse inequality will then follow by symmetry. There are four cases.

Case 1:

$y_{1}\geq y_{2},\ y_{3}\geq y_{4}$ .

\ell_{\alpha}(y_{1},y_{2})-\ell_{\alpha}(y_{3},y_{4})=\alpha(y_{1}-y_{2})-\alpha(y_{3}-y_{4})\leq|(y_{1}-y_{2})-(y_{3}-y_{4})|.

Case 2:

$y_{1}<y_{2},\ y_{3}<y_{4}$ .

\ell_{\alpha}(y_{1},y_{2})-\ell_{\alpha}(y_{3},y_{4})=(1-\alpha)(y_{2}-y_{1})-(1-\alpha)(y_{4}-y_{3})\leq|(y_{1}-y_{2})-(y_{3}-y_{4})|.

Case 3:

$y_{1}\geq y_{2},\ y_{3}<y_{4}$ .

	$\displaystyle\ell_{\alpha}(y_{1},y_{2})-\ell_{\alpha}(y_{3},y_{4})$	$\displaystyle=\alpha(y_{1}-y_{2})-(1-\alpha)(y_{4}-y_{3})$
		$\displaystyle=\alpha(y_{1}-y_{2}-(y_{3}-y_{4}))+(y_{3}-y_{4})\leq\|(y_{1}-y_{2})-(y_{3}-y_{4})\|.$

Case 3:

$y_{1}<y_{2},\ y_{3}\geq y_{4}$ .

	$\displaystyle\ell_{\alpha}(y_{1},y_{2})-\ell_{\alpha}(y_{3},y_{4})$	$\displaystyle=(1-\alpha)(y_{2}-y_{1})-\alpha(y_{3}-y_{4})$
		$\displaystyle=(1-\alpha)(y_{2}-y_{1}-(y_{4}-y_{3}))+(y_{4}-y_{3})\leq\|(y_{1}-y_{2})-(y_{3}-y_{4})\|.$

∎

Lemma 12.

Let $\Phi(X_{1}),\dots,\Phi(X_{n+1})\in\mathbb{R}^{d}$ be i.i.d. and assume that there exists constants $C_{2},c_{2},\rho>0$ such that $\sup_{\beta:\|\beta\|_{2}=1}\mathbb{E}[|\Phi(X_{i})^{\top}\beta|^{2}]^{1/2}\leq c_{2}$ , $\mathbb{E}[\|\Phi(X_{i})\|^{2}]\leq C_{2}d$ , and $\inf_{\beta:\|\beta\|_{2}=1}\mathbb{E}[|\Phi(X_{i})^{\top}\beta|]\geq\rho$ . Then there exists constants $c,c^{\prime}>0$ (depending only on $C_{2}$ , $c_{2}$ , and $\rho$ ) such that,

\mathbb{P}\left(\inf_{\beta:\|\beta\|_{2}=1}\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}\beta|\geq c\right)\geq 1-c^{\prime}\frac{d^{2}}{(n+1)^{2}}

Proof.

The main idea of this proof is to apply Theorem 5.4 of Mendelson (2014) and thus conclude that for some constant $a>0$ , $|\{i:|\Phi(X_{i})^{\top}\beta|>a(n+1)\}|$ is large uniformly in $\beta$ . To apply this theorem we need to check two technical conditions. Namely, we need to show that the class of functions $x\mapsto|\Phi(x)^{\top}\beta|$ has bounded Rademacher complexity and that $\mathbb{P}(|\Phi(X_{i})^{\top}\beta|>\Omega(a))$ is not too small. We now check these conditions.

Let $\sigma_{1},\dots,\sigma_{n+1}$ denote i.i.d. Rademacher random variables. Then, the Rademacher complexity of $x\mapsto|\Phi(x)^{\top}\beta|$ can be bounded as

	$\displaystyle\mathbb{E}\left[\sup_{\beta:\\|\beta\\|_{2}=1}\left\|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}\|\Phi(X_{i})^{\top}\beta\|\right\|\right]$	$\displaystyle=\mathbb{E}\left[\left\\|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}\Phi(X_{i})\right\\|_{2}\right]$
		$\displaystyle\leq\mathbb{E}\left[\left\\|\frac{1}{n+1}\sum_{i=1}^{n+1}\sigma_{i}\Phi(X_{i})\right\\|_{2}^{2}\right]^{1/2}\leq\sqrt{\frac{C_{2}d}{n+1}}.$

This verifies our the first technical condition. For the second condition note that by the Paley-Zygmund inequality

\displaystyle\mathbb{P}\left(|\Phi(X_{i})^{\top}\beta|>\frac{1}{2}\rho\right)\geq\frac{\mathbb{E}[|\Phi(X_{i})^{\top}\beta|]^{2}}{4\mathbb{E}[|\Phi(X_{i})^{\top}\beta|^{2}]}\geq\frac{\rho^{2}}{4c^{2}_{2}}.

Thus, by Theorem 5.4 in Mendelson (2014) we have that there exists constants $a,b>0$ such that with probability at least $1-c^{\prime}d^{2}/(n+1)^{2}$

\inf_{\beta:\|\beta\|=1}|\{i:|\Phi(X_{i})^{\top}\beta|>a(n+1)\}\geq(n+1)b,

and thus in particular,

\inf_{\beta:\|\beta\|=1}\frac{1}{n+1}\sum_{i=1}^{n+1}|\Phi(X_{i})^{\top}\beta|\geq ab.

Taking $c=ab$ gives the desired result. ∎

	$\displaystyle\mathbb{E}\left[\|f(X_{j})\|\mathbbm{1}\left\{\\|\hat{g}_{S_{n+1},K}\\|_{K}\geq\frac{\sqrt{2\mathbb{E}[\|S_{i}\|]}}{\sqrt{\lambda}}\right\}\right]$	$\displaystyle\leq\mathbb{E}\left[\|f(X_{j})\|\mathbbm{1}\left\{\frac{1}{n+1}\sum_{i=1}^{n+1}\|S_{i}\|-\mathbb{E}[\|S_{i}\|]\geq\mathbb{E}[\|S_{i}\|]\right\}\right]$
		$\displaystyle\leq\frac{1}{\mathbb{E}[\|S_{i}\|]^{2}}\mathbb{E}\left[\|f(X_{j})\|\left(\frac{1}{n+1}\sum_{j=1}^{n+1}\|S_{i}\|-\mathbb{E}[\|S_{i}\|]\right)^{2}\right]$
		$\displaystyle=O\left(\frac{\mathbb{E}[\|f(X)\|]}{n}\right).$

	$\displaystyle\mathbb{E}\left[\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{\hat{\beta}_{S_{n+1}}}(X_{i})+\Phi(X_{i})^{\top}\hat{\beta}_{S_{n+1}}\}\right)^{m}\mathbbm{1}\{E\}\mid(X_{i})_{i=1}^{n+1}\right]$
	$\displaystyle\leq\mathbb{E}\left[\sup_{\beta:\\|\beta\\|_{2}\leq c_{\beta}/\sqrt{\lambda}}\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{S_{i}=\hat{g}_{\beta}(X_{i})+\Phi(X_{i})^{\top}\beta\}\right)^{m}\mathbbm{1}\{E\}\|(X_{i})_{i=1}^{n+1}\right]$
	$\displaystyle\leq\mathbb{E}\left[\sup_{\beta\in\mathcal{N}}\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{\|S_{i}-\hat{g}_{\beta}(X_{i})-\Phi(X_{i})^{\top}\beta\|\leq O\left(1/n\right)\}\right)^{m}\mathbbm{1}\{E\}\mid(X_{i})_{i=1}^{n+1}\right]$
	$\displaystyle\leq\sum_{\beta\in\mathcal{N}}\mathbb{E}\left[\left(\frac{1}{n+1}\sum_{i=1}^{n+1}\mathbbm{1}\{\|S_{i}-\hat{g}_{\beta}(X_{i})-\Phi(X_{i})^{\top}\beta\|\leq O\left(1/n\right)\}\right)^{m}\mid(X_{i})_{i=1}^{n+1}\right],$

	$\displaystyle\mathbb{E}\left[\sup_{(\beta,g_{K})\in E}\left\|L_{n}(\beta,g_{K})-L_{n}(\Pi_{B^{}}\beta,g^{}_{K})-(L_{\infty}(\beta,g_{K})-L_{\infty}(\Pi_{B^{}}\beta,g^{}_{K}))\right\|\right]$
	$\displaystyle\leq 2\mathbb{E}\left[\sup_{(\beta,g_{K})\in E}\left\|\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}(\ell(\Phi(X_{i})^{\top}\beta+g_{K}(X_{i}),S_{i})-\ell(\Phi(X_{i})^{\top}\Pi_{B^{}}\beta+g^{}_{K}(X_{i}),S_{i}))\right\|\right]$
	$\displaystyle\leq 2\mathbb{E}\left[\sup_{(\beta,g_{K})\in E}\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}(\Phi(X_{i})^{\top}(\beta-\Pi_{\beta^{}}\beta)+g_{K}(X_{i})-g^{}_{K}(X_{i}))\right]$
	$\displaystyle\leq 2\mathbb{E}\left[\sup_{\\|\beta-\Pi_{B^{}}\beta\\|_{2}\leq\delta_{1}}\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}\Phi(X_{i})^{\top}(\beta-\Pi_{\beta^{}}\beta)\right]$
	$\displaystyle\ \ \ \ +2\mathbb{E}\left[\sup_{\\|g_{K}-g^{}_{K}\\|_{K}\leq\delta_{2}}\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}(g_{K}(X_{i})-g^{}_{K}(X_{i}))\right]$
	$\displaystyle\leq O\left(\delta_{1}\sqrt{\frac{d}{n}}+\delta_{2}\sqrt{\frac{1}{n}}\right),$

	$\displaystyle\sup_{f(\cdot)=\Phi(\cdot)^{\top}\beta+f_{K}(\cdot)\in\mathcal{F}_{\delta}}\left\|2\lambda\frac{\langle\hat{g}_{n,K},f_{K}\rangle}{\frac{1}{n}\sum_{i=1}^{n}f(X_{i})}-2\lambda\frac{\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle]}{\mathbb{E}_{P}[\|f(X)\|]}\right\|$
	$\displaystyle\leq\sup_{f_{K}\in\mathcal{F}_{K}:\\|f_{K}\\|_{K}\leq 1}2\lambda\frac{\|\langle\hat{g}_{n,K},f_{K}\rangle-\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}\rangle]\|}{\mathbb{E}_{P}[\|f(X)\|]}$
	$\displaystyle\ \ \ \ \ +2\lambda\sup_{f(\cdot)=\Phi(\cdot)^{\top}\beta+f_{K}(\cdot)\in\mathcal{F}_{\delta}}\left\|\frac{\langle\hat{g}_{n,K},f_{K}\rangle}{\frac{1}{n}\sum_{i=1}^{n}\|f(X_{i})\|}-\frac{\langle\hat{g}_{n,K},f_{K}\rangle}{\mathbb{E}_{P}[\|f(X_{i})\|]}\right\|$
	$\displaystyle\leq O_{\mathbb{P}}\left(\sqrt{\frac{d\log(n)}{n}}\right)+\frac{\sup_{f\in\mathcal{F}_{K}}\mathbb{E}[\langle\hat{g}_{S_{n+1},K},f_{K}]\rangle\|+O_{\mathbb{P}}(\sqrt{\frac{d\log(n)}{n}})}{\delta^{2}-O_{\mathbb{P}}(\sqrt{d/n})}\sup_{f\in\mathcal{F}_{\delta}}\left\|\frac{1}{n}\sum_{i=1}^{n}\|f(X_{i})\|-\mathbb{E}_{P}[\|f(X)\|]\right\|$
	$\displaystyle=O_{\mathbb{P}}\left(\sqrt{\frac{d\log(n)}{n}}\right).$

	$\displaystyle\mathbb{P}(d(\hat{\beta}_{n},\hat{g}_{n,K})>\delta_{M})=\mathbb{P}(M_{n}(\hat{\beta}_{n},\hat{g}_{n,K})-M_{n}(\Pi_{B^{}}\hat{\beta}_{n},g^{}_{K})\leq 0,\ d(\hat{\beta}_{n},\hat{g}_{n,K})>\delta_{M})$
	$\displaystyle\leq\mathbb{P}(M_{n}(\hat{\beta}_{n},\hat{g}_{n,K})-M_{n}(\Pi_{B^{}}\hat{\beta}_{n},g^{}_{K})-(M_{\infty}(\hat{\beta}_{n},\hat{g}_{n,K})-M_{\infty}(\Pi_{B^{}}\hat{\beta}_{n},g^{}_{K}))\leq-\delta_{M}^{2})$
	$\displaystyle\leq\mathbb{P}\left(\sup_{\\|\beta\\|_{2}\leq\frac{c_{\beta}}{\sqrt{\lambda}},\\|g_{K}\\|\leq\sqrt{\frac{2\mathbb{E}[\|S_{i}\|]}{\lambda}}}\|M_{n}(\beta,g_{K})-M_{n}(\Pi_{B^{}}\beta,g^{}_{K})-(M_{\infty}(\beta,g_{K})-M_{\infty}(\Pi_{B^{}}\beta,g^{}_{K}))\|\geq\delta_{M}^{2}\right)$
	$\displaystyle\quad+\mathbb{P}\left(\\|\hat{\beta}_{n}\\|_{2}\geq\frac{c_{\beta}}{\sqrt{\lambda}}\right)+\mathbb{P}\left(\\|\hat{g}_{n,K}\\|_{K}\geq\sqrt{\frac{2\mathbb{E}[\|S_{i}\|]}{\lambda}}\right)$
	$\displaystyle\leq\frac{1}{\delta_{M}^{2}}\mathbb{E}\left[\sup_{\\|\beta\\|_{2}\leq\frac{c_{\beta}}{\sqrt{\lambda}},\\|g_{K}\\|\leq\sqrt{\frac{2\mathbb{E}[\|S_{i}\|]}{\lambda}}}\|L_{n}(\beta,g_{K})-L_{n}(\Pi_{B^{}}]\beta,g^{}_{K})-(L_{\infty}(\beta,g_{K})-L_{\infty}(\Pi_{B^{}}\beta,g^{}_{K}))\|\right]+O\left(\frac{d}{n}\right),$