Conformal Prediction with Conditional Guarantees
Abstract
We consider the problem of constructing distribution-free prediction sets with finite-sample conditional guarantees. Prior work has shown that it is impossible to provide exact conditional coverage universally in finite samples. Thus, most popular methods only guarantee marginal coverage over the covariates or are restricted to a limited set of conditional targets, e.g. coverage over a finite set of pre-specified subgroups. This paper bridges this gap by defining a spectrum of problems that interpolate between marginal and conditional validity. We motivate these problems by reformulating conditional coverage as coverage over a class of covariate shifts. When the target class of shifts is finite-dimensional, we show how to simultaneously obtain exact finite-sample coverage over all possible shifts. For example, given a collection of subgroups, our prediction sets guarantee coverage over each group. For more flexible, infinite-dimensional classes where exact coverage is impossible, we provide a procedure for quantifying the coverage errors of our algorithm. Moreover, by tuning interpretable hyperparameters, we allow the practitioner to control the size of these errors across shifts of interest. Our methods can be incorporated into existing split conformal inference pipelines, and thus can be used to quantify the uncertainty of modern black-box algorithms without distributional assumptions.
Keywords— conditional coverage, conformal inference, covariate shift, distribution-free prediction, prediction sets, black-box uncertainty quantification.
1 Introduction
Consider a training dataset , and a test point , all drawn i.i.d. from an unknown, arbitrary distribution . We study the problem of using the observed data to construct a prediction set that includes with probability over the randomness in the training and test points.
Ideally, we would like this prediction set to meet three competing goals: it should (1) make no assumptions on the underlying data-generating mechanism, (2) be valid in finite samples, and (3) satisfy the conditional coverage guarantee, . Unfortunately, prior work has shown that is impossible to obtain all three of these conditions simultaneously (Vovk (2012), Barber et al. (2020)). Perhaps the closest method to achieving these goals is conformal prediction, which relaxes the third criterion to the marginal coverage guarantee, .
The gap between conditional and marginal coverage can be extremely consequential in high-stakes decision-making. Marginal validity does not preclude substantial variation in coverage among relevant subpopulations. For example, a conformal prediction set for predicting a drug candidate’s binding affinity achieves marginal coverage even if it underestimates the predictive uncertainty over the most promising subset of compounds. In human-centered applications, marginally valid prediction sets can be untrustworthy for certain legally protected groups (e.g., those defined by sensitive attributes such as race, gender, age, etc.) (Romano et al. (2020a)).
Achieving practical, finite-sample results requires weakening our desideratum from exact conditional coverage. In this article, we pursue a goal that is motivated by the following equivalence:
In particular, we define a relaxed coverage objective by replacing “all measurable ” with “all belonging to some (potentially infinite) class .” At the least complex end, taking recovers marginal validity, while more intermediate choices interpolate between marginal and conditional coverage.
This generic approach to relaxing conditional validity was first popularized by the “conditional-to-marginal” moment testing literature (Andrews and Shi (2013)). Our relaxation is also referred to as a “multi-accuracy” objective in theoretical computer science (Hébert-Johnson et al. (2018), Kim et al. (2019)). We remark that Deng et al. (2023) have concurrently proposed the same objective for the conditional coverage problem. However, their results focus on the infinite data regime, where the distribution of is known exactly, and their algorithm requires access to an unspecified black-box optimization oracle.
By contrast, our proposed method preserves the attractive assumption-free and computationally efficient properties of split conformal prediction. Emulating split conformal, we design our procedure as a wrapper that takes any black-box machine learning model as input. We then compute conformity scores that measure the accuracy of this model’s predictions on new test points. Finally, by calibrating bounds for these scores, we obtain prediction sets with finite-sample conditional guarantees. Figure 1 displays our workflow.
Unlike split conformal, our approach adaptively compensates for poor initial modeling decisions. In particular, if the prediction rule and conformity scores were well-designed at the outset, our procedure may only make small adjustments. This could happen, for instance, if the scores were derived from a well-specified parametric model. More often, however, the user will begin with an inaccurate or incomplete model that fails to fully capture the distribution of . In these cases, our procedure will improve on split conformal by recalibrating the conformity score to provide exact conditional guarantees. In the predictive inference literature, conformal inference is often described as a protective layer that lies on top of a black-box machine learning model and transforms its point predictions into valid prediction sets. With this in mind, one might view our method as an additional protective layer that lies on top of a conformal method and transforms its (potentially poor) marginally valid outputs into richer, conditionally valid, prediction sets.
A number of prior works have also considered either modifying the split conformal calibration step (Lei and Wasserman (2014), Guan (2022), Barber et al. (2023)) or the initial prediction rule (Romano et al. (2019), Sesia and Romano (2021), Chernozhukov et al. (2021)) to better model the distribution of . Crucially, despite heuristic improvements in the quality of the resulting prediction sets, all of the aforementioned approaches obtain at most weak asymptotic guarantees that rely on slow, non-parametric convergence rates.
Perhaps the only procedures to obtain a practical coverage guarantee are those of Barber et al. (2020), Vovk et al. (2003), and Jung et al. (2023). All of these methods guarantee a form of group-conditional coverage, i.e. for all sets in some pre-specified class . However, the approach of Barber et al. (2020) can be computationally infeasible and severely conservative, yielding wide intervals with coverage probability far above the target level. On the other hand, the method of Vovk et al. (2003), Mondrian conformal prediction, provides exact coverage in finite samples, but does not allow the groups in to overlap. Finally, Jung et al. (2023) propose running quantile regression over the linear function class . This method is both practical and allows for overlapping groups. In this work, we propose a new method for achieving conditional coverage that improves upon the method of Jung et al. (2023) in three ways; 1) by providing tighter finite-sample coverage, 2) by requiring no assumptions on the data-generating distribution (in particular, unlike Jung et al. (2023) we allow for discrete outcomes), and 3) by providing conditional coverage guarantees far beyond the group setting.
1.1 Preview of contributions
To motivate and summarize the main contributions of our paper, we preview some applications. In particular, we show how our method can be used to satisfy two popular coverage desiderata: group-conditional coverage and coverage under covariate shift. Additionally, we demonstrate the improved finite sample performance of our method compared to previous approaches.
1.1.1 Group-conditional coverage

Group-conditional coverage requires that satisfy for all belonging to some collection of pre-specified (potentially overlapping) groups (Barber et al. (2020)). This corresponds to a special case of our guarantee in which .
Figure 2 illustrates the coverage guarantee on a simulated dataset. Here, is univariate and we have taken the groups to be the collection of all sub-intervals with endpoints belonging to . Two of these sub-intervals, and , are shaded in grey. We compare two procedures, split conformal prediction and our conditional calibration method. As is standard, we implement split conformal using conformity score where is an estimate of the conditional mean , while for our method, we take a two-sided approach in which upper and lower bounds on are computed separately. We see that while split conformal prediction only provides marginal validity, our method returns prediction sets that are adaptive to the shape of the data and thus obtain exact coverage over all subgroups.

In this paper, we improve upon existing group-conditional coverage results in two crucial aspects: (1) we obtain tighter finite sample coverage and (2) we make no assumptions on the distribution of or the overlap of the groups . Concretely, given an arbitrary finite collection of groups our randomized conditional calibration method guarantees exact coverage,
while our unrandomized procedure obeys the inequalities,
Figure 3 shows the improved finite sample coverage of these method. For simplicity, this plot only displays the marginal miscoverage. Boxplots in the figure show the estimated distributions of the calibration-conditional miscoverage and length, i.e. the quantities
as the sample size varies. In agreement with our theory, we find that the unrandomized version of our procedure guarantees conservative coverage, while our randomized variant offers exact coverage regardless of the sample size. On the other hand, the method of Jung et al. (2023), i.e. linear quantile regression over the set of subgroup-inclusion indicator functions, can severely undercover even at what might be considered large sample sizes.
1.1.2 Coverage under covariate shift
Given an appropriate choice of , our prediction set also achieves coverage under covariate shift. To define this objective, fix any non-negative function and let denote the setting in which is sampled i.i.d. from , while is sampled independently from the distribution in which is “tilted” by , i.e.,
Then, our method guarantees coverage under so long as . For example, when is a finite-dimensional linear function class, our prediction set satisfies
There have been a number of previous works in this area that also establish coverage under covariate shift. However, these works assume that there is a single covariate shift of interest that is either known a-priori (Tibshirani et al. (2019)), or estimated from unlabeled data (Qiu et al. (2022), Yang et al. (2022)). Our method captures this setting as a special case in which is chosen to be a singleton. On the other hand, when is non-singleton, our guarantee is more general and ensures coverage over all shifts , simultaneously.

Figure 4 illustrates a simple example of this guarantee on a synthetic dataset. Once again, the covariate is a scalar and the conformity score is taken to be ; following the previous example, we implement the two-sided version of our method. We consider five covariate shifts in which is tilted by the Gaussian density for . The shifts centered at and are plotted in grey and denoted as and in the figure. As the left-most panels show, split conformal gives a constant width prediction band over the entire x-axis, while our method, adapts to the shape of the data around the covariate shifts. The right-most panel of the figure validates that this correction is sufficient to expand the marginal coverage guarantee of split conformal inference to exact coverage under all three scenarios: no shift, shift by , and shift by .

Extending even further beyond existing approaches, our method can also provide guarantees even when no prior information is known about the shift. By fitting with a so-called “universal” function class, e.g., is a suitable reproducing kernel Hilbert space, we provide coverage guarantees under any covariate shift. Due to the complexity of these classes, our coverage guarantee is no longer exactly for all tilts . Instead, in its place, we obtain an exact estimate of the (improved) finite-sample coverage of our method. For example, as seen in Figure 5, if we run our shift-agnostic method for the two plotted Gaussian tilts, our estimated coverage precisely matches the empirically observed values. Thus, even though we cannot guarantee exact coverage for all shifts in this example, we are still able to accurately report the true performance that users can expect from our method. For more information on how these estimates are computed we refer the interested reader to Section 3 and, in particular, to (3.4) and Proposition 2.
1.2 Outline
The remainder of this article is structured as follows. In Section 2, we introduce our method and give coverage results for the case in which is finite dimensional. These results are expanded on in Section 3, where we consider infinite dimensional classes. Computational difficulties that arise in both the finite and infinite dimensional cases are addressed in Section 4, and an efficient implementation of our method is given. In Section 5, we apply our method to two datasets and show that our approach attains tighter finite-sample conditional coverage and more predictable failure modes than competing alternatives. Finally, we conclude in Section 6 with a discussion of considerations that arise when choosing the function class (and associated regularization) to use in our method.
A Python package, conditionalconformal, implementing our methods is available on PyPI, and notebooks reproducing the experimental results in this paper can be found at github.com/jjcherian/conditional-conformal.
Notation: In this paper, we consider two settings. In the first, we take . In the second, , while is sampled independently from the tilted distribution and . We write and with no subscript when referring to the first scenario, while we use the subscript to denote the second. Additionally, note that throughout this article we use to denote the domain of the pairs.
2 Protection against finite dimensional shifts
2.1 Warm-up: marginal coverage
As a starting point to motivate our approach, we show that split conformal prediction is a special case of our method. Before explaining this result in detail, it is useful to first review the details of the split conformal algorithm.
Recall that the conformity score function, measures how well the prediction of some model at “conforms” to the target . For instance, given an estimate of , we may take to be the absolute residual . In a typical implementation of split conformal, we would need to split the training data into two parts, using one part to train and reserving the second part as the calibration set. Because our method provides the same coverage guarantees regardless of the initial choice of , we will not discuss this first step in detail. Instead, we will assume that the conformity score function is fixed, and we are free to use the entire dataset to calibrate the scores. In practice, the initial step of fitting the conformity score can be critical for getting a good baseline predictor. For example, in our experiment in Section 5.3 , we use a neural network to obtain an initial prediction of the genetic treatment given to cells based on fluorescent microscopy images.
Given a conformity score function and calibration set, the split conformal algorithm outputs the set of values for which is sufficiently small, i.e., the set of values that conform with the prediction at . The threshold for this prediction set, which we denote by , is set to be the -quantile of the conformity scores evaluated on the calibration set. In summary, the split conformal prediction set is formally defined as
(2.1) |
The standard method for proving the marginal coverage of is to appeal to the exchangeability of the conformity scores. Namely, let denote the scores . Since the -th conformity score is drawn i.i.d. from the same distribution as the first scores, the location of among the order statistics of is drawn uniformly at random from each of the possible indices. So, recalling that is chosen to be the the -quantile, i.e., the smallest order statistic satisfying , we arrive at the coverage guarantee . The following theorem summarizes the formal consequences of these observations.
Theorem 1 (Romano et al. (2019), Theorem 1, see also Vovk et al. (2005)).
Assume that are independent and identically distributed. Then, the split conformal prediction set (2.1) satisfies,
If has a continuous distribution, it also holds that
Our first insight is that this prediction set and marginal coverage guarantee can also be obtained by re-interpreting split conformal as an intercept-only quantile regression. Recall the definition of the “pinball” loss,
It is well-known that minimizing over will produce a -quantile of the training data, i.e., is a -quantile of (Koenker and Bassett Jr (1978)).
In our exchangeability proof, recall that we upper bounded not by the -quantile of , but by the -quantile. The latter value was obtained by considering an augmented dataset that included all of the scores . To similarly account for the (unobserved) conformity score in a quantile regression, we will now fit using a dataset that includes a guess for . Namely, let be a solution to the quantile regression problem in which we impute for the unknown conformity score, i.e.,
Then, one can verify that
(2.2) |
or said more informally, includes any such that is smaller than the -quantile of the augmented calibration set . As an aside, we note there is some small subtlety here due to the non-uniqueness of . To get exact equality in (2.2), one should choose to be the largest minimizer of the quantile regression. Readers familiar with conformal inference will also recognize this method of imputing a guess for the missing -th datapoint as a type of full conformal prediction (Vovk et al. (2005)).
Having established that split conformal prediction can be derived via quantile regression, our generalization of this procedure to richer function classes naturally follows. Namely, we will replace the single score threshold with a function that estimates the conditional quantiles of . We then prove a generalization of Theorem 1 showing that the resulting prediction set attains a conditional coverage guarantee.
2.2 Finite dimensional classes
Recall our objective:
(2.3) |
In the previous section, we constructed a prediction set with marginal coverage, i.e., (2.3) for , by fitting an augmented quantile regression over the same function class . Here, we generalize this observation to any finite-dimensional linear class.
To formally define our method, let denote the class of linear functions over the basis . Our goal is to construct a satisfying (2.3) for this choice of . Imitating our re-derivation of split conformal prediction, we define the augmented quantile regression estimate as
(2.4) |
Then, we take our prediction set to be
(2.5) |
Critically, we emphasize that is fit using the same function class that appears in our coverage target. This fact will be crucial to the theoretical results that follow. To keep our notation clear under this recycling of , we will always use to denote quantile estimates and to denote re-weightings.
Before discussing the coverage properties of this method there are two technical issues that we must address. First, astute readers may have noticed that (2.5) appears to be intractable. Indeed, a naive computation of would require us to compute for all . In Section 4, we will give an efficient algorithm for computing the prediction set that overcomes this naive approach. To ease exposition we defer the details of this method for now. The second issue that we must address is the non-uniqueness of the estimate . In all subsequent results of this article, we will assume that is computed using an algorithm that is invariant under re-orderings of the input data. This assumption is relevant because quantile regression can admit multiple optima; in theory, the selected optimum might systematically depend on the indices of the scores. In practice, this assumption is inconsequential because any commonly used algorithm, e.g., an interior point solver, satisfies this invariance condition.
With these issues out of the way, we are now ready to state the main result of this section, Theorem 2, which summarizes the coverage properties of (2.5). When interpreting this theorem it may be useful to recall that for non-negative , denotes the setting in which , while is sampled independently from and .
Theorem 2.
Let denote the class of linear functions over the basis . Then, for any non-negative with , the prediction set given by (2.5) satisfies
(2.6) |
On the other hand, if and the distribution of is continuous, then for all , we additionally have the two-sided bound,
This type of two-part result is typical in conformal inference. Namely, while the assumption that the distribution of is continuous may seem overly restrictive, it is standard in conformal inference that upper bounds require a mild continuity assumption, while lower bounds are fully distribution-free. For example, the canonical coverage guarantee for split conformal described in Theorem 1 also gives separate upper and lower bounds for continuous and discrete data. Notably, in the case of split conformal inference this two-part result can be avoided and replaced by an exact coverage guarantee by randomizing the prediction set. We will show in Section 4 that an analogous result also holds for our method: without any assumptions on the continuity of , we show that randomizing yields for all . Because the randomization scheme we employ leverages the algorithms developed in Section 4, we defer a precise statement of this result for now.
Our next result, Corollary 1 relates the more abstract guarantee of Theorem 2 to the group-conditional coverage example previewed in the introduction.
Corollary 1.
Suppose are independent and identically distributed and the prediction set given by (2.5) is implemented with for some finite collection of groups . Then, for any ,
If the distribution of is continuous, then we have the matching upper bound,
The methods described above only estimate the upper -quantile of the conformity score. If desired, our procedure can also be generalized to give both lower and upper bounds on . In particular, letting denote our estimate of the -th quantile, we can define the two-sided prediction set
(2.7) |
As an example of this, Figures 2, 4 and 5 show results from an implementation of our method in which we fit the lower and upper quantiles of separately. Because these two-sided prediction sets have identical coverage properties to their one-sided analogues we will for simplicity focus in the remainder of this article on the one-sided version. Readers interested in the two-sided instantiation of our approach should see Section A.7 for additional information about the implementation and formal coverage guarantees of these methods.
We conclude this section by giving a brief proof sketch of Theorem 2, leaving formal details to the Appendix. The main idea is to examine the first order conditions of the quantile regression (2.4) and then exploit the fact that this regression treats the test point identically to the calibration data. This connection between the derivative of the pinball loss and coverage was first made by Jung et al. (2023).
Proof sketch of Theorem 2.
We examine the first order conditions of (2.4). By a direct computation we have that for any ,
For simplicity, suppose that for all , . This assumption does not hold in general and by adding it here we will obtain the stronger result . In the full proof of Theorem 2, we remove this simplification and incur an additional error term.
For now, making this assumption gives the first order condition
Taking expectations, we arrive at our coverage guarantee
where the first equality uses the definition of and the second equality applies the fact that the triples are exchangeable.
∎
2.3 Related work
The method proposed above can be briefly summarized as a modified quantile regression procedure in which the new test point is incorporated into the fit. Given the popularity of vanilla quantile regression, one might reasonably ask how this compares to the more standard approach in which one fits
on the training data and then forms the prediction set,
Jung et al. (2023) analyze this approach in the case where is the space of linear combinations of subgroup indicator functions. Under appropriate assumptions on the distribution of , they show that for all groups, and constants this prediction set satisfies the PAC coverage guarantee,
On the other hand, in the same setting, Corollary 1 states the following guarantee for our method,
The proofs of both of these results are based on the first order conditions of quantile regression which, as we showed above, can be exploited to guarantee conditional coverage. That said, the approaches differ substantially and the final results are not directly comparable. At a high level, we find that the former result targets a stronger notion of coverage, ensuring concentration conditional on the calibration data, while the second result has a much faster convergence rate ( versus ).
While these methods are difficult to compare theoretically, a much clearer picture emerges on simulated data. In particular, the results in Figure 3 show that our method is far more robust to small sample-sizes or large dimensions. Although we do not provide a PAC guarantee, we find that the coverage of our procedure concentrates tightly at the target level across a wide range of values for . On the other hand, the vanilla quantile regression approach taken in Jung et al. (2023) shows notable undercoverage at moderate dimensions, e.g., .
3 Extension to infinite dimensional classes
Turning back to our method, we now consider settings in which we do not have a small, finite-dimensional function class of interest. In particular, if we view the coverage target (2.3) as an interpolation between marginal and conditional coverage, then it is natural to ask what guarantees can be provided when is a rich, and potentially even infinite dimensional, function class. We know from previous work that exact coverage over an arbitrary infinite dimensional class is impossible (Vovk (2012), Barber et al. (2020)). Thus, just as we relaxed the definition of conditional coverage above, here we will construct prediction sets that satisfy a relaxed version of (2.3).
First, note that we cannot directly implement our method over an infinite dimensional class. Indeed, running quantile regression in dimension will simply interpolate the input data. In our context, this means that every value will satisfy and our method will always output . To circumvent this issue and obtain informative prediction sets, we must add regularization. This leads us to the definition
(3.1) |
for some appropriately chosen penalty . Having made this adjustment, we may now proceed identically to the previous section. Namely, we set
(3.2) |
and by examining the first order conditions of (3.1), we obtain the following generalization of Theorem 2.
Theorem 3.
Let be any vector space, and assume that for all , the derivative of exists. If is non-negative with , then the prediction set given by (3.2) satisfies the lower bound
On the other hand, suppose . Then, for all , we additionally have the two-sided bound,
(3.3) |
where is an interpolation error term satisfying .
Similar to our results in the previous section, the interpolation term can be removed if we allow the prediction set to be randomized (see Section 4 for a precise statement).
To more accurately interpret Theorem 3, we will need to develop additional understanding of the two quantities appearing on the right-hand side of (3.3). The following two sections are devoted to this task for two different choices of . At a high level, the results in these sections will show that the interpolation error is of negligible size and thus the coverage properties of our method are primarily governed by the derivative term . Informally, we interpret this derivative as providing a quantitative estimate of the difficulty of achieving conditional coverage in the direction . More practically, we will see that the derivative can be used to obtain accurate estimates of the coverage properties of our method. Critically, these estimates adapt to both the specific choice of tilt and the distribution of . Thus, they determine the difficulty of obtaining conditional coverage in a way that is specific to the dataset at hand.
3.1 Specialization to functions in a reproducing kernel Hilbert space
Our first specialization of Theorem 3 is to the case where is constructed using functions from a reproducing kernel Hilbert space (RKHS). More precisely, let be a positive definite kernel and denote the associated RKHS with inner product and norm . Let denote any finite dimensional feature representation of . Then, we consider implementing our method with function class and penalty . Here, is a hyperparameter that controls the flexibility of the fit. For now, we take this hyperparameter to be fixed, although later in our practical experiments in Section 5.1, we will choose it by cross-validation.
Some examples of RKHSes that may be of interest include the space of radial basis functions given by , which allows us to give coverage guarantees over localizations of the covariates, and the polynomial kernel for , , which allows us to investigate coverage over smooth polynomial re-weightings. Additional examples and background material on reproducing kernel Hilbert spaces can be found in Paulsen and Raghupathi (2016).
To obtain a coverage guarantee for , we must understand the two terms appearing on the right-hand side of (3.3). Let denote the re-weighting of interest and denote the fitted quantile estimate. Then, a short calculation shows that . So, applying Theorem 3, we find that for all non-negative with ,
(3.4) |
Controlling the interpolation error is more challenging and is done by the following proposition.
Proposition 1.
Assume that and that is uniformly bounded. Furthermore, suppose has uniformly upper and lower bounded first three moments (Assumption 1 in the Appendix) and that the distribution of is continuous with a uniformly bounded density. Then, for any ,
Critically, the interpolation error term decays to zero at a faster-than-parametric rate. As a result, for even moderately large , we expect this term to be of small size. In support of this intuition, we will show an experiment in Section 5.1 in which with sample size , linear dimension , and non-linear hyperparameter , we observe interpolation error . Thus, at appropriate sample sizes, the interpolation error has minimal effect on the coverage.
We remark that achieving this faster-than-parametric rate requires technical insights beyond existing tools. Two standard ways to establish interpolation bounds are to either to exploit the finite-dimensional character of the model class or the algorithmic stability. Unfortunately, quantile regression with both kernel and linear components satisfies neither property. To get around this problem, we give a three-part argument in which we first separate out the linear component of the fit by discretizing over . Then, with fixed, we are able to exploit known stability results to control the kernel component of the fit (Bousquet and Elisseeff (2002)). Finally, we combine the two previous steps by giving a smoothing argument that shows that the discretization can be extended to the entire function class. This result may be useful in other applications. For instance, the interpolation error determines the derivative of the loss at the empirical minimizer and, thus, may play a key role in central limit theorems for quantile regressors of this type.
Moving away from these technical issues and returning to our coverage guarantee, (3.4), we find that once the interpolation error is removed, the conditional coverage is completely dictated by the inner product between and . Critically, this implies that in the special case where the target re-weighting lies completely in the unpenalized part of the function class (i.e. when ) we have and thus obtains (nearly) exact coverage under . On the other hand, our next proposition shows that when , we can use a plug-in estimate to accurately estimate . Thus, even when exact coverage is impossible, a simple examination of the quantile regression fit is sufficient to determine the degradation in coverage under any re-weighting of interest.
Proposition 2.
Assume that and that is uniformly bounded. Suppose further that the population loss is locally strongly convex near its minimizer (Assumption 2 in the Appendix) and has uniformly bounded upper and lower first and second order moments (Assumption 3 in the Appendix). Define the -sample quantile regression estimate
and for any , let . Then,
(3.5) |
3.2 Specialization to the class of Lipschitz functions
As a second specialization of Theorem 3, we now aim to provide valid coverage over all sufficiently smooth re-weightings of the data. We will do this by examining the set of all Lipschitz functions on . Namely, suppose and define the Lipschitz norm of functions as
Analogous to the previous section, let , be any finite dimensional feature representation of , and consider implementing our method with the function class and penalty .
The astute reader may notice that the Lipschitz norm is not differentiable and thus Theorem 3 is not directly applicable to this setting. Nevertheless, it is not difficult to show that Theorem 3 can be extended by replacing with a subgradient. So, after observing that , we can apply this analogue of Theorem 3 to find that for any non-negative with ,
Control of the interpolation error is handled in the following proposition.
Proposition 3.
Assume that are i.i.d. and that , , and have bounded domains and uniformly upper and lower bounded first and second moments (Assumption 4 in the Appendix). Furthermore, assume that the distribution of is continuous with a uniformly bounded density and that contains an intercept term. Then for any ,
This result is considerably weaker than our RKHS bound. While our proof for RKHS function classes made careful use of the stability of RKHS fitting (Bousquet and Elisseeff, 2002), here we take a more brute force approach and directly examine the uniform concentration properties of the number of interpolated points, . We defer a detailed description of this approach to the Appendix. For now, we simply remark that we do not believe this proof technique yields a tight bound and it is possible that significant improvements could be made to Proposition 3 with more careful arguments.
Regardless of the tightness of the bound, we still find that when is low-dimensional, the interpolation error will be small and the miscoverage of under will be primarily driven by its Lipschitz norm. In light of the impossibility of exact conditional coverage, this result gives a natural interpolation between marginal coverage (in which ) and conditional coverage (in which can be arbitrarily large). On the other hand, in moderate to high dimensions the interpolation error term will not be negligible and the coverage can be highly conservative. For this reason, in our real data examples, we will prefer to use RKHS functions for which we have much faster convergence rates.
4 Computing the prediction set
In order to practically implement any of the methods discussed above, we need to be able to efficiently compute . Naively, this recursive definition requires us to fit for all possible values of . We will now show that by exploiting the monotonicity properties of quantile regression, this naive computation can be overcome and a valid prediction set can be computed efficiently using only a small number of fits.
The main subtlety that we will have to contend with is that may not be uniquely defined. For example, consider computing the median of the dataset . It is easy to show that any value in the interval is a valid solution to the median quantile regression . Critically, this means that it is ambiguous whether or not lies below or above the median. More generally, in our context, it can be ambiguous whether or not . In the earlier sections of this article we have elided such non-uniqueness in the definition of . We do this because the choice of is not critical to the theory and, in particular, all sensible definitions will give the same coverage properties (recall that all of our theoretical results go through so long as is computed using an algorithm that is invariant under re-orderings of the input data). However, while not theoretically relevant, this ambiguity can cause practical issues to arise in the computation.
The main insight of this section is that these technical issues can be resolved by re-defining in terms of the dual formulation of the quantile regression. This will give us a new prediction set, , that can be computed efficiently and satisfies the same coverage guarantees as . At a high level, is obtained from by removing a small portion of the points that lie on the interpolation boundary . Thus, one should simply think of as a trimming of the original prediction set that removes some extraneous edge cases.
To define our dual optimization more formally, recall that throughout this article we have considered quantile regressions of the form,
Instead of directly computing the dual of this program, we first re-formulate this optimization into the identical procedure,
(4.1) |
Then, after some standard calculations, this yields the desired dual formulation,
(4.2) |
where denotes the function ; heuristically, we can think of as the convex conjugate for .
Crucially, the KKT conditions for (4.1) allow for a more tractable definition of our prediction set. Letting denote any solution to (4.2) and applying the complementary slackness conditions of this primal-dual pair we find that
As a consequence, checking whether is nearly equivalent to checking that , albeit with a minor discrepancy on the interpolation boundary. This enables us to define the efficiently computable prediction set,
(4.3) |
In practice, can be mildly conservative. If we are willing to allow the prediction set to be randomized then exact coverage can be obtained using the prediction set,
where is drawn independent of the data.
Our first result verifies that obtains the same coverage guarantees as our non-randomized primal set , while realizes exact, non-conservative coverage.
Proposition 4.
The assumption that the primal-dual pair satisfies strong duality is very minor and we verify in Section A.1 that it holds for all the function classes and penalties considered in this article. Moreover, we also verify in Section A.1 that for all the function classes and penalties considered in this article, is tractable and thus solutions to (4.2) can be computed efficiently.
The main result of this section is Theorem 4, which states that is non-decreasing. Critically, this implies that membership in the dual prediction set (4.3) is monotone in the imputed score .
Theorem 4.
For all maximizers of (4.2), is non-decreasing in .
Leveraging Theorem 4, we may compute (or ) using the following two-step procedure. First, we identify the largest value of such that (or ). Second, denoting this upper bound by , we output all such that . The second step is straightforward for all commonly used conformity scores. For example, if , the prediction set becomes .
The monotonicity of the dual variable in allows for more than one approach to the first step of this procedure. If we have no additional information about the structure of the optimization problem over , it is always possible to run a binary search over to find the largest value such that is less than the targeted cutoff. On the other hand, for the typical use-case of this method, i.e., when we fit an unregularized quantile regression over a finite-dimensional function class, it is considerably more efficient to compute this cutoff for by applying standard tools from linear program sensitivity analysis. We defer a detailed description of our approach to Algorithm 2.

Figure 6 displays the computational efficiency of this sensitivity analysis method on a 2020 MacBook Pro with an Intel Core i5 processor and 16GB of RAM. The left panel of the figure displays the estimated time to construct one prediction set over varying function classes and calibration set sizes. We find that our method is computationally efficient, and that its time complexity depends much more on the dimension than the sample size. In general, the bulk of the computational cost consists of fitting a single quantile regression on the calibration set. From there, predictions for new test points are obtained by an efficient procedure for updating the quantile fit (see Algorithm 2 in the Appendix for details). The computational benefit of this approach is displayed in the right panel of Figure 6, which compares the cost of running our procedure against the time it would take to run a single quantile regression for each test point. We see that our updating procedure is substantially faster across a wide range of calibration set sizes and function class dimensions.
5 Real data experiments
5.1 Communities and crime data
We now illustrate our methods on two real datasets. For our first experiment, we consider the Communities and Crime dataset (Dua and Graff (2017), Redmond and Baveja (2002)). In this task, the goal is to use the demographic features of a community to predict its per capita violent crime rate. To make our analysis as interpretable as possible, we use only a small subset of the covariates in this dataset: population size, unemployment rate, median income, racial make-up (given as four columns indicating the percentage of the population that is Black, White, Hispanic, and Asian), and age demographics (given as two columns indicating the percentage of people in the 12-21 and 65+ age brackets).
Our goal in this section is not to give the best possible prediction intervals. Instead, we perform a simple expository analysis that is designed to demonstrate some possible use cases of our method. For this purpose, we assume that the practitioner’s primary concern is that standard prediction sets will provide unequal coverage across communities with differing racial make-ups. Additionally, we suppose that the practitioner does not have any particular predilections for achieving coverage conditional on age, unemployment rate, or median income. We encode these preferences as follows: let denote the length five vector consisting of an intercept term and the four racial features and define to be the RKHS given by the Gaussian kernel (note that since all variables in this dataset have been previously normalized to lie in , we do not make any further modifications before computing the kernel). Then, proceeding exactly as in Section 3.1 we run our method with the function class and penalty . To get , we run cross-validation on the calibration set. While this does not strictly follow the theory of Section 3, our results indicate that choosing in this manner does not negatively impact the coverage properties of our method in practice (see Figures 9 and 13 below). Finally, we set the conformity scores to be the absolute residuals of a linear regression of on and the target coverage level to be . The sizes of the training, calibration, and test sets are taken to be 650, 650, and 694, respectively and we use the unrandomized prediction set, throughout.

Figure 7 shows the empirical miscoverages obtained by our algorithm and split conformal prediction under the linearly re-weighted distributions , for . As expected, our method obtains the desired coverage level under all five settings, while split conformal is only able to deliver marginal validity.

In practice, the user is unlikely to only care about the performance under linear re-weightings. For example, they may also want to have predictions sets that are accurate on the communities with the highest racial bias, e.g. the communities whose percentage of the population that is black is in the top th percentile. Strictly speaking, our method will only provide a guarantee on these high racial representation groups if the corresponding subgroup indicators are included in . However, intuitively, even without explicit indicators, linear re-weightings should already push the method to accurately cover communities with high racial bias. Thus, we may expect that even without adjusting , the method implemented above will already perform well on these groups. To investigate this, Figure 8 shows the miscoverage of our method across high racial representation subgroups. Formally, we say that a community has high representation of a particular racial group if the percentage of the community that falls in that group is in the top -percentile. The three panels of the figure then show results for , , and , respectively. We find that even without any explicit indicators for the subgroups, our method is able to correct the errors of split conformal prediction and provide improved coverage in all settings.
![]() |
![]() |
In addition to providing exact coverage over racial re-weightings, our method also provides a simple procedure for evaluating the coverage across the other, more flexibly fit, covariates. More precisely, let
denote the function fit on the calibration data and recall that by the results of Section 3 we expect that for any non-negative re-weighting ,
(5.1) |
For the Gaussian RKHS, a natural set of non-negative weight functions are the local re-weightings , which emphasize coverage in a neighbourhood around the fixed point .
The bottom-left panel of Figure 9 plots the values of for all points appearing in the training and calibration sets. We see immediately that our prediction sets will undercover older communities and overcover communities with high median incomes. To aid in the interpretation of this result, the top panel of the figure indicates the level curves of for two specific choice of . Finally, the bottom-right panel compares the estimates (5.1) to the realized empirical coverages
for the same two values of . Here, denotes the test set. We see that as expected (5.1) is a highly accurate estimate of the true coverage at both values of . To further understand the degree of localization in these plots, it may be useful to note that this re-weighting yields an effective sample size of at the two red points.
Overall, we find that our procedure provides the user with a highly accurate picture of the coverage properties of their prediction sets. In many practical settings, plots like the bottom-left panel of Figure 9 may prompt practitioners to adjust the quantile regression to protect against observed directions of miscoverage. While such an adjustment will not strictly follow the theory of Sections 2 and 3, so long as the user is careful not to run the procedure so many times as to induce direct over-fitting to the observed miscoverage, small adjustments will likely be permissible. In practice, this type of exploratory analysis may allow practitioners to discover important patterns in their data and tune the prediction sets to reflect their coverage needs.
5.2 Comparison against existing methods
A variety of alternative methods for obtaining conditional coverage in conformal inference have been proposed in the literature. Many of these methods are designed to asymptotically achieve exact conditional coverage, i.e., . Here, we compare against two such approaches and demonstrate why the more precise finite-sample guarantees of our method may be preferable.
5.2.1 Comparison against conformalized quantile regression
Our first comparison is to the conformalized quantile regression (CQR) method of Romano et al. (2019). They consider a version of split conformal inference in which the training set is used to fit estimates of the conditional quantiles of and the calibration set is used to adjust these estimates to guarantee marginal coverage. Here, we implement a two-sided version of their procedures in which the upper and lower quantiles are calibrated separately. In particular, let denote an estimate of the -quantile of . Let be a conformity score measuring the distance between and the quantile estimate and denote the -quantile of the calibration scores . Then, we define the two-sided CQR prediction set as
(5.2) |
As alluded to above, if is a consistent estimate of the true conditional quantile function of , then this set will asymptotically achieve exact conditional coverage (Sesia and Candès, 2020).
It is important to note that our approach is not in direct competition with CQR and both methods can be used in combination. Namely, here we will consider an implementation of our procedure in which the quantiles, and are replaced by adaptive estimates from our method. Implementing two-sided fitting requires requires minor modifications of the dual formulation of our prediction set and we refer the reader to Section A.7 for details.
To compare these methods we once again employ the Communities and Crimes dataset. We consider three baseline approaches, 1) split conformal prediction with linear regression and the residual conformity score, for obtained by ordinary least squares, 2) split conformal prediction implemented using the CQR procedure of Romano et al. (2019) where the estimates are obtained using linear quantile regression, 3) the same procedure but with obtained using a quantile random forest. In all three cases we calibrate the upper and lower quantiles separately using either the formulation in (5.2) or, in the case of ordinary least squares, by estimating the upper and lower quantiles of the residuals separately. We compare each of these baseline approaches against implementations of our methods where the split conformal calibration step is replaced by our method with function class, defined as the set of affine functions over the covariates from the previous section (i.e. population size, unemployment rate, median income, as well a set of race and age-based features).

Figure 10 displays the results of this experiment. We find that all implementations of both split conformal and our randomized method achieve exact marginal covearge, while our unrandomized variant can slightly overcover (left panel). To evaluate the conditional coverage, the center two panels of the figure display estimates of the largest over and undercoverages obtained across linear re-weightings of the covariate space. Namely, letting denote the augmented covariates, we estimate the smallest and largest values of
over all features (note that here all the features are non-negative). As expected from our theory, the unrandomized variant of our method always slightly overcovers, while the randomized version obtains exact coverage throughout. On the other hand, while CQR asymptotically achieves exact coverage in this setting, our simulations show deviations from the target level using both linear quantile regression and quantile forests. Importantly, we note that while the coverage deviations of linear quantile regression are relatively small here, its performance can worsen significantly in higher dimensions (see Figure 3). Finally, the rightmost panel of the figure shows the lengths of the prediction sets returned by each method. We find that all of the prediction sets are of a similar size with the exception of the OLS implementation of split conformal prediction, whose predictions sets are relatively wide.
5.2.2 Comparison against localized conformal prediction
Our second comparison is to the localized conformal prediction method of Guan (2022). In this procedure, a localized kernel is used to re-weight the calibration data and focus on the data whose covariates are most similar to the test point. The exact method for accomplishing this is somewhat technical and we refer the reader to Theorem 3.2 of Guan (2022) for details. Similar to conformalized quantile regression, localized conformal prediction guarantees exact marginal coverage in finite samples and asymptotic conditional coverage under appropriate choices of the kernel.
To compare localized conformal prediction against our approach, we once again employ the Communities and Crime dataset. We consider the Gaussian kernel and take the conformity score to be the absolute values of the residuals from a linear regression. We then compare two methods, localized conformal prediction, and our method implemented with the function class given by the same kernel. More specifically, we implement our method with function class and penalty , where is chosen by cross-validation.

The results of this experiment are shown in Figure 11. We find that, as expected, both localized conformal prediction and the randomized version of our method achieve exact marginal. To evaluate their conditional properties, we compare the coverages obtained under the two kernel re-weightings from the previous section (namely the shifts considered in Figure 9). We find that, as expected, the coverage of our method deviates from the target level, but critically this deviation is predictable and well approximated by the estimation procedure given in Proposition 2. On the other hand, although localized conformal prediction obtains similar empirical results to our method, it does not offer estimates of its coverage deviation. Thus, a priori based on the theory of Guan (2022), one may expect localized conformal prediction to give exact coverage under these shifts, while in reality, it shows notable deviations away from the target level.
5.3 Rxrx1 data
Our next experiment examines the RxRx1 dataset (Taylor et al. (2019)) from the WILDS repository (Koh et al. (2021)). This repository contains a collection of commonly used datasets for benchmarking performance under distribution shift. In the RxRx1 task, we are given images of cells obtained using fluorescent microscopy and we must predict which one of the 1339 genetic treatments the cells received. These images come from 51 different experiments run across four different cell types. It is well known that even in theoretically identical experiments, minor variations in the execution and environmental conditions can induce observable variation in the final data. Thus, we expect to see heterogeneity in the quality of the predictions across both experiments and cell types.
Perhaps the most obvious method for correcting for this heterogeneity would be to treat the experiments and cell types as known categories and apply the group-conditional coverage method outlined in Section 2. However, if we did this, we would be unable to make predictions for new unlabeled images. Thus, here we take a more data driven approach and attempt to learn a good feature representation directly from the training set.
To predict the genetic treatments of the cells we use a ResNet50 architecture trained on 37 of the 51 experiments by the original authors of the WILDS repository. We then uniformly divide the samples from the remaining 14 experiments into a training set and a test set. The training set is further split into two halves; one for learning a feature representation, and one to be used as the calibration set. To construct the feature representation, we take the feature map from the pre-trained neural network as input and run a -regularized multinomial linear regression that predicts which experiment each image comes from. We then define our features to be the predicted probabilities of experiment membership output by this model and construct prediction sets using the linear quantile regression method of Section 2. To define the conformity scores for this experiment, let denote the weights output by the neural network at input . Typically, we would use these weights to compute the predicted probabilities of class membership, . Here, we add an extra step in which we use multinomial logisitic regression and the calibration data to fit a parameter that re-scales the weights. This procedure is known as temperature scaling and it has been found to increase the accuracy of probabilities output by neural networks (Angelopoulos et al. (2021), Guo et al. (2017)). After running this regression, we set and following Romano et al. (2020b) define the conformity score function to be

The coverage properties of our unrandomized prediction sets are outlined in Figure 12. We see that while split conformal prediction has very heterogeneous coverage across experiments and cell types, our approach performs well on all groups. Thus, the learned feature representation successfully captures the batch effects in the data and thereby enables our method to provide the desired group-conditional coverage.
The method described above is hardly the only way to construct a feature representation for this dataset. In Section A.2, we consider an alternative approach in which we implement our method using the top principal components of the feature layer of the neural network as input. The idea here is that the batch effects (i.e. the cell types and experiment memberships) should induce large variations in the images that are visible on the top principal components. Thus, correcting the coverage along these directions will provide the desired conditional validity. In agreement with this hypothesis, we find that this procedure produces nearly identical results to those seen above, i.e., good coverage across all groups.
6 Choosing the function class and regularizer
In order to implement our method in practice, users must choose both the function class, , and regularizer, . In the real data examples above we have considered some sample choices for these quantities that were designed to illustrate the guarantees of our method. A practitioner may reasonably disagree with our selections, and through their choice of and , look to prioritize different conditional targets. As in many statistical estimation problems, we expect the best performance of our method to be obtained when the choices of and are guided by domain-specific knowledge of the most important features to the prediction task at hand. To help practitioners in making these choices, we highlight a few considerations that may arise.
First, we remark that although our theory requires fixed choices of and , we find empirically that running cross-validation on the training set does not significantly harm the coverage. To demonstrate this, we consider two different penalized implementations of our method. In the first, we run ridge-regularized linear quantile regression with and , while in the second we run kernel quantile regression with Gaussian kernel, and corresponding kernel-norm regularizer, . Figure 13 shows a comparison of the marginal coverage of our method obtained when is either fixed in advance, or estimated using cross-validation. We see that although our theory requires to be fixed, both approaches give near exact coverage empirically.

Perhaps even more critical than the regularization level, is the choice of function class . Of particular interest is the trade-off that occurs as the dimension of increases. Indeed, while larger function spaces allow for a more rich set of conditional guarantees, we find empirically that they also lead to larger prediction sets. Thus, adding spurious features to can harm the efficiency of our predictions.

To demonstrate this phenomenon, we consider a dataset with many irrelevant features that have no relationship with the response. In particular, we work with a Gaussian model in which , , and is independent of . We implement our method with conformity score and use linear quantile regression with , . Figure 14 plots estimated distributions of the calibration-conditional marginal miscoverage, worst-case conditional coverage deviation, and length of a two-sided implementation of our method in which the lower and upper quantiles are estimated separately (see Section A.7), i.e. the quantities,
As a baseline, the figure also displays results for vanilla quantile regression. We find that while the marginal coverage of our method is robust to large function classes, both the worst-case coverage deviation and the average length of the prediction sets increase as the dimension grows. Recall that all the covariates here are spurious; therefore, the increase in these quantities is purely a result of over-fitting in high dimensions and not a reflection of the underlying complexity of the relationship between and . Finally, we remark that the growth in the worst-case coverage error is substantially worse for quantile regression compared to our method. Thus, while we cannot guarantee exact calibration-conditional validity over all in high dimensions, our method is practically superior to sensible alternatives.
7 Acknowledgments
E.J.C. was supported by the Office of Naval Research grant N00014-20-1-2157, the National Science Foundation grant DMS-2032014, the Simons Foundation under award 814641, and the ARO grant 2003514594. I.G. was also supported by the Office of Naval Research grant N00014-20-1-2157 and the Simons Foundation award 814641, as well as additionally by the Overdeck Fellowship Fund. J.J.C. was supported by the John and Fannie Hertz Foundation. The authors are grateful to Kevin Guo and Tim Morrison for helpful discussion on this work.
References
- Andrews and Shi (2013) Andrews, D. W. and Shi, X. (2013) Inference based on conditional moment inequalities. Econometrica, 81, 609–666.
- Angelopoulos et al. (2021) Angelopoulos, A. N., Bates, S., Jordan, M. and Malik, J. (2021) Uncertainty sets for image classifiers using conformal prediction. In International Conference on Learning Representations.
- Barber et al. (2023) Barber, R. F., Candes, E. J., Ramdas, A. and Tibshirani, R. J. (2023) Conformal prediction beyond exchangeability. arXiv preprint. ArXiv:2202.13415.
- Barber et al. (2020) Barber, R. F., Candès, E. J., Ramdas, A. and Tibshirani, R. J. (2020) The limits of distribution-free conditional predictive inference. Information and Inference: A Journal of the IMA, 10, 455–482.
- Boucheron et al. (2005) Boucheron, S., Bousquet, O. and Lugosi, G. (2005) Theory of classification: a survey of some recent advances. ESAIM: Probability and Statistics, 9, 323–375.
- Bousquet and Elisseeff (2002) Bousquet, O. and Elisseeff, A. (2002) Stability and generalization. Journal of Machine Learning Research, 2, 499–526.
- Chernozhukov et al. (2021) Chernozhukov, V., Wüthrich, K. and Zhu, Y. (2021) Distributional conformal prediction. Proceedings of the National Academy of Sciences, 118, e2107794118.
- Dantzig and Thapa (2003) Dantzig, G. B. and Thapa, M. N. (2003) Linear programming: Theory and extensions, vol. 2. Springer.
- Deng et al. (2023) Deng, Z., Dwork, C. and Zhang, L. (2023) Happymap: A generalized multi-calibration method. arXiv preprint arXiv:2303.04379.
- Dua and Graff (2017) Dua, D. and Graff, C. (2017) UCI machine learning repository. URL: http://archive.ics.uci.edu/ml.
- Guan (2022) Guan, L. (2022) Localized conformal prediction: a generalized inference framework for conformal prediction. Biometrika, 110, 33–50.
- Guo et al. (2017) Guo, C., Pleiss, G., Sun, Y. and Weinberger, K. Q. (2017) On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, 1321–1330. JMLR.org.
- Hébert-Johnson et al. (2018) Hébert-Johnson, U., Kim, M., Reingold, O. and Rothblum, G. (2018) Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning, 1939–1948. PMLR.
- Jung et al. (2023) Jung, C., Noarov, G., Ramalingam, R. and Roth, A. (2023) Batch multivalid conformal prediction. In International Conference on Learning Representations.
- Kim et al. (2019) Kim, M. P., Ghorbani, A. and Zou, J. (2019) Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 247–254.
- Koenker and Bassett Jr (1978) Koenker, R. and Bassett Jr, G. (1978) Regression quantiles. Econometrica: journal of the Econometric Society, 33–50.
- Koh et al. (2021) Koh, P. W., Sagawa, S., Marklund, H., Xie, S. M., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R. L., Gao, I., Lee, T., David, E., Stavness, I., Guo, W., Earnshaw, B., Haque, I., Beery, S. M., Leskovec, J., Kundaje, A., Pierson, E., Levine, S., Finn, C. and Liang, P. (2021) Wilds: A benchmark of in-the-wild distribution shifts. In Proceedings of the 38th International Conference on Machine Learning (eds. M. Meila and T. Zhang), vol. 139 of Proceedings of Machine Learning Research, 5637–5664. PMLR.
- Lei and Wasserman (2014) Lei, J. and Wasserman, L. (2014) Distribution‐free prediction bands for non‐parametric regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76.
- McShane (1934) McShane, E. J. (1934) Extension of range of functions. Bulletin of the American Mathematical Society, 40, 837 – 842.
- Mendelson (2014) Mendelson, S. (2014) Learning without concentration. In Proceedings of The 27th Conference on Learning Theory (eds. M. F. Balcan, V. Feldman and C. Szepesvári), vol. 35 of Proceedings of Machine Learning Research, 25–39. Barcelona, Spain: PMLR.
- Paulsen and Raghupathi (2016) Paulsen, V. I. and Raghupathi, M. (2016) An Introduction to the Theory of Reproducing Kernel Hilbert Spaces. Cambridge Studies in Advanced Mathematics. Cambridge University Press.
- Qiu et al. (2022) Qiu, H., Dobriban, E. and Tchetgen, E. T. (2022) Prediction sets adaptive to unknown covariate shift. arXiv preprint. ArXiv:2203.06126.
- Redmond and Baveja (2002) Redmond, M. and Baveja, A. (2002) A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research, 141, 660–678.
- Romano et al. (2020a) Romano, Y., Barber, R. F., Sabatti, C. and Candès, E. (2020a) With Malice Toward None: Assessing Uncertainty via Equalized Coverage. Harvard Data Science Review, 2.
- Romano et al. (2019) Romano, Y., Patterson, E. and Candes, E. (2019) Conformalized quantile regression. In Advances in Neural Information Processing Systems (eds. H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox and R. Garnett), vol. 32. Curran Associates, Inc.
- Romano et al. (2020b) Romano, Y., Sesia, M. and Candes, E. (2020b) Classification with valid and adaptive coverage. In Advances in Neural Information Processing Systems (eds. H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan and H. Lin), vol. 33, 3581–3591. Curran Associates, Inc.
- Sesia and Candès (2020) Sesia, M. and Candès, E. J. (2020) A comparison of some conformal quantile regression methods. Stat, 9, e261. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/sta4.261. E261 sta4.261.
- Sesia and Romano (2021) Sesia, M. and Romano, Y. (2021) Conformal prediction using conditional histograms. In Advances in Neural Information Processing Systems (eds. M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang and J. W. Vaughan), vol. 34, 6304–6315. Curran Associates, Inc.
- Taylor et al. (2019) Taylor, J., Earnshaw, B., Mabey, B., Victors, M. and Yosinski, J. (2019) Rxrx1: An image set for cellular morphological variation across many experimental batches. In International Conference on Learning Representations (ICLR).
- Tibshirani et al. (2019) Tibshirani, R. J., Foygel Barber, R., Candes, E. and Ramdas, A. (2019) Conformal prediction under covariate shift. In Advances in Neural Information Processing Systems (eds. H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox and R. Garnett), vol. 32. Curran Associates, Inc.
- Vaart and Wellner (1996) Vaart, A. W. and Wellner, J. A. (1996) Weak Convergence and Empirical Processes. Springer New York, NY, 1 edn.
- Vovk (2012) Vovk, V. (2012) Conditional validity of inductive conformal predictors. In Proceedings of the Asian Conference on Machine Learning (eds. S. C. H. Hoi and W. Buntine), vol. 25 of Proceedings of Machine Learning Research, 475–490. Singapore Management University, Singapore: PMLR.
- Vovk et al. (2005) Vovk, V., Gammerman, A. and Shafer, G. (2005) Algorithmic Learning in a Random World. Berlin, Heidelberg: Springer-Verlag.
- Vovk et al. (2003) Vovk, V., Lindsay, D., Nouretdinov, I. and Gammerman, A. (2003) Mondrian confidence machine. Tech. rep., Royal Holloway University of London.
- Whitney (1934) Whitney, H. (1934) Analytic extensions of differentiable functions defined in closed sets. Transactions of the American Mathematical Society, 36, 63–89.
- Yang et al. (2022) Yang, Y., Kuchibhotla, A. K. and Tchetgen, E. T. (2022) Doubly robust calibration of prediction sets under covariate shift. arXiv preprint arXiv:2203.01761.
Appendix A Appendix
A.1 Computational details for Section 4
A.1.1 Verification of the conditions of Section 4
In this section we verify that all of the quantile regression problems discussed in this paper satisfy the conditions of Section 4. To do this we will check that 1) each quantile regression admits an equivalent finite dimensional representation 2) all of these finite dimensional representations (and thus also their infinite dimensional counterparts) satisfy strong duality.
The fact that the linear quantile regression of Section 2 satisfies 1) and 2) is clear. For RKHS functions, let denote the kernel matrix with entries . Let denote the row (equivalently column) of . Then, the primal problem (3.1) can be fit by solving the equivalent convex optimization program
and setting , for any optimal solutions. With this finite-dimensional representation in hand, it now follows that the primal-dual pair must satisfy strong-duality by Slater’s condition. Moreover, it is also easy to see that in this context
which is a tractable function that we can compute.
Finally, to fit Lipschitz functions we can solve the optimization program,
The idea here is that act as proxies for the values of . These proxies can always be extended to a complete function on all of using the methods of McShane (1934), Whitney (1934). Once again, with this finite-dimensional representation in hand it is now easy to check that the primal-dual pair satisfies strong-duality by Slater’s condition. Finally, in this context we have
which is a tractable function.
A.1.2 Additional algorithmic details
The results below elucidate various approaches to computing the prediction set. Algorithm 1 describes how we might use a binary search algorithm to discover the critical at which equals the target cutoff.
As discussed in the main text, Algorithm 1 may be inefficient if the optimization problem possesses additional structure. In particular, a common implementation of our method is to fit an unregularized quantile regression over a finite-dimensional function class. In this case, it is possible to exactly compute . Our approach to this problem, which leverages standard tools from LP sensitivity analysis, relies on the following observation.
Proposition 5.
Proof.
This result follows immediately from the fact that is feasible and the duality gap of and is zero. ∎
We now recall some basic linear programming (LP) facts. First, assuming without loss of generality that is full-rank, it follows immediately from standard LP theory that there exists a solution in which all but indices in line at the extremes (Dantzig and Thapa (2003)). Moreover, the remaining non-trivial coordinates of are piecewise linear in , and the slope of the function given by can be obtained by solving a linear equation (Dantzig and Thapa (2003)). Thus, obtaining the critical value of at which exceeds our cutoff amounts to tracing a piecewise linear function from until the cutoff is reached. Algorithm 2 provides an explicit description of this procedure, which closely resembles the simplex method. Note that the algorithm is initialized at since the dual solution at that point is known to be zero (Proposition 5).
We conclude this section with one final method for computing a conservative alternative to our prediction set that requires only a single quantile regression fit. For this, suppose we know an upper bound for . Then, we may consider the conservative prediction set . Our final proposition shows that whenever is a valid upper bound on , provides a conservative coverage guarantee. To avoid subtle issues here related to the non-uniqueness of the optimal quantile function, we will assume here that is a min-norm solution of (4.1) (the choice of norm is not important).
Proposition 6.
Assume that admits a strictly-convex norm and let denote the corresponding min-norm solutions to (4.1). Then, for all ,
Proof of Proposition 6.
Let
Assume for the sake of contradiction that , , and . To obtain a contradiction, we claim that it is sufficient to prove that
(A.1) |
To see why, note that since is a global optimum of , we must have that
Rearranging this and applying (A.1) gives the inequality
or equivalently,
Since is the unique min-norm minimizer of this implies that .
Now, by a completely symmetric argument reversing the roles of and we also have that
which by identical reasoning implies that . Thus, we have arrived at our desired contradiction.
To prove (A.1) we break into two cases.
Case 1:
.
Case 2:
.
∎
A.2 Additional experiments on the Rxrx1 data
As an alternative to estimating the probabilities of experimental membership, here we consider constructing a feature representation for the Rxrx1 data using principal component analysis. Namely, we implement the linear quantile regression method of Section 2 using the top principal components of the feature layer of the neural network as input. We choose the number of principal components for this analysis to be 70 based off of a visual inspection of a scree plot (Figure 15). All other steps of this experiment are kept identical to the procedure described in Section 5.3.
Similar to the results of Section 5.3, we find that the empirical coverage of this method is close to the target level across all cell types and experiments (Figure 16).


A.3 Proofs of the main coverage guarantees
In this section we prove the top-level coverage guarantees of our method. We begin by proving our most general result, Theorem 3, which considers a generic function class and penalty . Then, by restricting the choices of and , we obtain Theorem 2 and Corollary 1 as special cases.
Proof of Theorem 3.
We begin by examining the first order conditions of the convex optimization problem (3.1). Namely, since is a minimizer of
we must have that for any fixed ,
By a straightforward computation, the subgradients of the pinball loss are given by
Let be the values setting the subgradient to . Rearranging, we obtain
Our desired result now follows from the following observations. First, observe that the LHS above can be related to our desired coverage guarantee through the equation,
Moreover, since is fit symmetrically, i.e., is invariant to permutations of the input data, we have that are exchangeable. Thus, we additionally have that
(A.2) |
Finally, since , we can bound the first term as
where last step follows by exchangeability. This proves the second claim of Theorem 3.
With the proof of Theorem 3 in hand, we are now ready to prove the special cases stated in Theorem 2 and Corollary 1.
Proof of Theorem 2.
The first statement of Theorem 2 follows immediately from the first statement of Theorem 3 in the special case where .
To get the second statement of Theorem 2 note that the second statement of Theorem 3 tells us that for any ,
So, to complete the proof we just need to show that when the distribution of is continuous
To do this, first note that by the exchangeability of we have
Moreover, recalling that for some we additionally have that
where the last line follows from the fact that are independent and continuously distributed and is a -dimensional subspace of . From this, we conclude that with probability 1,
and plugging this into our previous calculation we arrive at the desired inequality
∎
Proof of Corollary 1.
This follows immediately by applying Theorem 2 in the special case where . ∎
A.4 Proofs for RKHS functions
In this section we prove Propositions 1 and 2. Throughout both proofs we will let denote our upper bound on the kernel and (when applicable) to denote our upper bound on the density of . Moreover, we will assume that the data satisfies the following set of moment conditions.
Assumption 1 (Moment conditions for RKHS bounds).
There exists constants such that
Furthermore, we also have that .
A.4.1 Proof of Proposition 1
Our main idea is to exploit the stability of RKHS regression. We will do this using two main lemmas. The first lemma is a canonical stability result first proven in Bousquet and Elisseeff (2002) that bounds the sensitivity of the RKHS fit to changes of a single data point. While this result is quite powerful, it is not sufficient for our context because it does not account for the extra linear term . Thus, we will also develop a second lemma that controls the stability of the fit to changes in .
When formalizing these ideas it will be useful have some additional notation that explicitly separates the dependence of the fit on from the dependence of the fit on the data. Let
denote the result of fitting the RKHS part of the function class with held fixed. Additionally, let denote an independent copy of and for any define
to be the leave--out version of obtained by swapping out for . Our first lemma bounds the difference between and .
Lemma 1.
Assume that . Then, for any two datasets and ,
Proof.
By a straightforward calculation one can easily check that is a 1-Lipschitz function of (see Lemma 11 for details). Thus, we may apply Theorem 22 of Bousquet and Elisseeff (2002) to conclude that
Then by applying the reproducing property of the RKHS and our bound on the kernel we arrive at the desired inequality,
∎
Our second lemma bounds the stability of the fit in .
Lemma 2.
Assume that . Then for any dataset ,
Proof.
The proof of this lemma is quite similar to the proof of Theorem 22 in Bousquet and Elisseeff (2002). For ease of notation let
By the optimality of and we have
and |
Moreover, by the convexity of and it also holds that
and |
Putting all four of these inequalities together we find that
where the last inequality follows from the Lipschitz property of (see Lemma 11). To conclude the proof one simply notes that by the reproducing property of the RKHS we have that
as desired. ∎
In order to apply this lemma to bound we will need to control the size of . This is done in our next result. The statement of this lemma may look somewhat peculiar due to the presence of a re-weighting function . To help aid intuition it may be useful to keep in mind the special case , which turns the expectation below into a simple tail probability. While somewhat strange, our reason for stating the lemma in this form is that it will fit seamlessly into our later calculations without the need for additional exposition.
Lemma 3.
Assume that . Let and assume that there exists constants such that, and . Then for any and ,
Proof.
By the Cauchy-Schwartz inequality we have
Additionally, by Jensen’s inequality it also holds that . So, putting these two inequalities together we arrive at
∎
The final preliminary lemmas that we will require are controls on the maximum possible sizes of and . Once again these lemmas will involve re-weighting functions the purpose of which is to ease our calculations further on.
Lemma 4.
It holds deterministically that
If in addition, , , and is a function satisfying for some , then we also have that for all ,
Proof.
Taking gives loss
So, since is a minimizer of the quantile regression objective we must have that
This proves the first part of the lemma. To get the second part we simply note that
∎
Lemma 5.
Let and . Assume that , , and there exists constants such that , , , , and . Then there exists a constant such that for all ,
Proof.
With all of these results in hand we are ready to prove Proposition 1.
Proof of Proposition 1..
We will exploit the stability of the RKHS fit. Let and define the event
By Lemmas 3 and 5 we know that
Thus, we just need to focus on what happens on the event . By applying the exchangeability of the quadruples we have that
To bound this quantity we just need to control the inner expectation. We will begin by fixing a large integer and applying the inequality
Our motivation for applying this bound is that by choosing sufficiently large we will be able to swap a sum and a maximum without losing too much slack. More precisely, let be a minimal size -net of . It is well known that there exists an absolute constant such that . Then, using this -net we compute that
where the first inequality follows from the definition of and the second inequality uses both Lemma 2 and the fact that on the event
Continuing this calculation directly we see that,
where the last line applies Lemma 1. Finally, using the fact that has a bounded density we may upper bound the above display by
Putting this all together we conclude that
The desired result then follows by taking and plugging in our definition for .
∎
A.4.2 Proof of Proposition 2
To simplify the notation let
and |
denote the empirical and population losses and let
and |
denote the corresponding empirical and population objectives. Note that and are strictly convex in and convex in . Thus, we may let denote the minimizers of and respectively. To further ease notation in the sections that follows we will sometimes use and to denote arbitrarily elements of and . Finally, we will let denote the projections operators onto and , respectively.
With these preliminary definitions in hand we now formally state the assumptions of Proposition 2. Our first assumption is that is locally strongly convex around its minimum.
Assumption 2 (Population Strong Convexity).
Let denote the distance from to the nearest population minimizer. Then, there exists constants such that
Overall, we believe that this assumption is mild and should hold for all distributions of interest. For instance, for continuous data it is easy to check that this condition holds whenever has a positive density on . On the other hand, for discrete data we expect to have the even stronger inequality . This is due to the fact that for discrete data has sharp jump discontinuities that give rise to large increases in the loss when moves away from .
The second assumption we will need is a set of moment conditions on and .
Assumption 3 (Moment Conditions).
There exists constants such that
Furthermore, we also have that .
With these assumptions in hand we are ready to prove Proposition 2. We begin by giving a technical lemma that controls the concentration of around .
Lemma 6.
Assume that and that there exists constants such that and . Then for any ,
Proof.
Let and be Rademacher random variables. Since the pinball loss is 1-Lipschitz (see Lemma 11) we may apply the symmetrization and contraction properties of Rademacher complexity to conclude that
where the last inequality follows from standard bounds on the Rademacher complexities of linear and kernel function classes (see e.g. Secion 4.1.2 of (Boucheron et al. (2005))) ∎
We now prove the main proposition.
Proof of Proposition 2.
We will show that
-
1.
,
-
2.
,
-
3.
.
Our desired result will then follow by writing
We establish each of these three facts in order.
Step 1: By the results of Section 4.1.2 in Boucheron et al. (2005) we know that has Rademacher complexity at most . By the contraction property this also implies that has Rademacher complexity at most . So, by the symmetrization inequality we have that for any ,
This proves that , as desired.
Step 3: This step is considerably more involved than the previous two. To begin write
We will bound each of the two terms on the right hand side separately. To do this we will use a two-step peeling argument where each step gives a tighter bound on than the previous one.
Our first step will show that with high probability must be within of . Let be the constant appearing in Lemma 5. Then, by a direct computation we have that
where the last line follows by applying Lemmas 4 and 5 with . Finally, by Lemma 6 we can additionally bound the first term above as
So, in total we find that
This concludes the proof of our first concentration inequality for .
In our second step we will use this preliminary bound to get an even tighter control on . Fix any with . For any let . Then,
This proves that . To get a similar bound on the expectation write
as desired.
∎
A.5 Proofs for Lipschitz Functions
In this section we prove Proposition 3. Throughout we make the following set of technical assumptions
Assumption 4.
There exists constants such that , , and with probability 1, , , and for all .
The primary technical tool that we will need for the proof is a covering number bound for Lipschitz functions. This result is well-known from prior literature and we re-state it here for clarity. In the work that follow we use to denote the ball of radius in .
Definition 1.
The covering number of a set under norm is the minimum number of balls of radius needed to cover .
Lemma 7 (Covering number of Lipschitz functions, Theorem 2.7.1 in Vaart and Wellner (1996)).
Let denote the space of bounded Lipschitz functions on . Then there exists a constant such that for any ,
In the present context we need to control the behaviour of Lipschitz fitting under a re-weighting . To account for this we will require a more general covering number bound on a weighted class of Lipschitz functions. This is given in the following lemma. In the work that follows, recall that for a probability measure , the norm of a function is defined as .
Lemma 8.
Let be a fixed function and denote the space of bounded Lipschitz functions on multiplied with . Then for any probablity measure , there exists a constant such that for any ,
Proof.
Recall that we defined . Fix any and let be a minimal -norm, -covering of . Fix any and let be such that . Then,
In particular, we find that is an -norm, -covering of . The desired result then immediately follows by applying Lemma 7 to get a bound on . ∎
The previous two lemmas only apply to bounded Lipschitz functions with maximum Lipshitz norm . To apply these results to our current setting we will need to bound and . Our next lemma does exactly this.
Lemma 9.
Assume that there exist constants such that with probability one and for all . Then with probability one, and .
Proof.
Since is a minimizer of the quantile regression objective we must have
This proves the first part of the proposition. To get the second part, note that since has an intercept term we may assume without loss of generality that . Thus,
as desired. ∎
Our final preliminary lemma gives a control on the norm of the linear part of the fit. Similar to what we had for RKHS functions above, here we state this result under an arbitrary re-weighting .
Lemma 10.
Let and . Assume that there exists constants such that , , and with probability 1, , , and for all . Then there exists a constant such that
Proof.
With these preliminaries out of the way we are now ready to prove Proposition 3.
Proof of Proposition 3.
The main idea of this proof is to show that concentrates uniformly around its expectation. Since for any fixed , this will imply that
We now formalize this idea. Define the event
By Lemmas 9 and 10 we know that
Thus, we just need to focus on what happens on the event . By the exchangeability of the quadruples we have
Let be a small constant that we will specify later and denote the tent function
Let and . Then,
where the third inequality follows by symmetrization, and the fourth inequality uses the fact that has a bounded density, the contraction inequality, and the fact that is -Lipschitz.
To conclude the proof we bound each of the first three terms appearing on the last line above. We have that
while
Finally, by Lemma 8 we have the covering number bound
and so by Dudley’s entropy integral
Putting all of these results together gives the final bound
The desired result then follows by optimizing over .
∎
A.6 Proofs for Section 4
In this section we prove the results appearing in Section 4 of the main text. We start with a proof of Theorem 4.
Proof of Theorem 4.
We begin by giving a more careful derivation of the dual program. Recall that our primal optimization problem is
s.t. | |||
The Lagrangian for this program is
(A.3) |
For ease of notation, let . Then, minimizing with respect to gives,
So, taking derivatives of this function with respect to , and , we arrive at the constraints,
Since the only restriction on and is that they are non-negative, this can be simplified to,
Thus, we arrive at the desired dual formulation,
(A.4) |
Now, recall that we used the notation to denote the dual-optimal for a particular choice of . Assume for the sake of contradiction that there exists such that . Observe that we can write the dual objective as
where does not depend on . Our assumption implies that
or equivalently,
On the other hand, by the optimality of , we have that
Applying our assumption, we conclude that
which by rearranging yields the desired contradiction
∎
We now turn to the proof of Proposition 4, which states that the coverage properties of are the same as .
Proof of Proposition 4.
The proof of this Proposition is nearly identical to the proof of Theorem 3, with the exception that now instead of looking at the first order conditions of the primal, we will instead investigate the first order conditions of the Lagrangian (A.3) at the optimal dual variables. We keep all the notation the same as in the proof of Theorem 4.
We begin by proving the first statement pertaining to the coverage properties of . Let denote an optimal primal-dual solution at the input . Recall from the proof of Theorem 4 that the Lagrangian for the optimization is
Fix any re-weighting function . By assumption we know that strong duality (and thus the KKT conditions) hold. So, by considering the derivative of the Lagrangian in the direction and applying the KKT stationarity condition we find that
(A.5) |
To further unpack this equality, note that complementary slackness in the KKT condition necessitates that and . Thus, when , we must have , or equivalently, and when , we must have . Last, when the residual is exactly , the corresponding can take any value in . Plugging these observations into (A.5), we obtain
To relate this stationary condition to the coverage note that
Finally, since the second term above can be bounded as
while when , we additionally have the lower bound
This concludes the proof of the first part of the proposition. For the second part of the proposition one simply notes that for any ,
where the last equality is simply our first order condition (A.5).
∎
A.7 Two-sided fitting
Recall that in the main text we defined the two-sided prediction set
Analogues to our work on one-sided prediction sets in the main text, in this section we outline the coverage properties and computationally efficient implementation of .
To begin, let denote an optimal solution to the dual program (4.2) when is replaced by , i.e. recalling the definition of , let be a solution to
Then, similar to our one-sided sets, also admits the analogous dual formulation
(A.6) |
and the analogous randomized prediction set
(A.7) |
where and independently .
As the next theorem states formally, these two-sided predictions set have identical coverage properties to their one-sided analogs.
Theorem 5.
Let be any vector space, and assume that for all , the derivative of exists. If is non-negative with , then the unrandomized prediction set given by (A.6) satisfies the lower bound
On the other hand, suppose . Then, for all , we additionally have the two-sided bound,
(A.8) |
where is an interpolation error term satisfying . Furthermore, the same results also hold for the randomized set (A.7) with replaced by .
Proof.
Note that
The result then follows by repeating the steps of Proposition 4 twice to bound the two terms above separately. A similar argument demonstrates the coverage of ∎
A.8 Proofs of additional technical lemmas
Lemma 11 (Lipschitz property of the pinball loss).
The pinball loss is 1-Lipschitz in the sense that for any ,
Proof.
We will show that . The reverse inequality will then follow by symmetry. There are four cases.
Case 1:
.
Case 2:
.
Case 3:
.
Case 3:
.
∎
Lemma 12.
Let be i.i.d. and assume that there exists constants such that , , and . Then there exists constants (depending only on , , and ) such that,
Proof.
The main idea of this proof is to apply Theorem 5.4 of Mendelson (2014) and thus conclude that for some constant , is large uniformly in . To apply this theorem we need to check two technical conditions. Namely, we need to show that the class of functions has bounded Rademacher complexity and that is not too small. We now check these conditions.
Let denote i.i.d. Rademacher random variables. Then, the Rademacher complexity of can be bounded as
This verifies our the first technical condition. For the second condition note that by the Paley-Zygmund inequality
Thus, by Theorem 5.4 in Mendelson (2014) we have that there exists constants such that with probability at least
and thus in particular,
Taking gives the desired result. ∎