This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Batch Multivalid Conformal Prediction

Christopher Jung Department of Computer Science, Stanford University Georgy Noarov Department of Computer and Information Sciences, University of Pennsylvania Ramya Ramalingam Department of Computer and Information Sciences, University of Pennsylvania Aaron Roth Department of Computer and Information Sciences, University of Pennsylvania
Abstract

We develop fast distribution-free conformal prediction algorithms for obtaining multivalid coverage on exchangeable data in the batch setting. Multivalid coverage guarantees are stronger than marginal coverage guarantees in two ways: (1) They hold even conditional on group membership—that is, the target coverage level 1α1-\alpha holds conditionally on membership in each of an arbitrary (potentially intersecting) group in a finite collection 𝒢\mathcal{G} of regions in the feature space. (2) They hold even conditional on the value of the threshold used to produce the prediction set on a given example. In fact multivalid coverage guarantees hold even when conditioning on group membership and threshold value simultaneously.

We give two algorithms: both take as input an arbitrary non-conformity score and an arbitrary collection of possibly intersecting groups 𝒢\mathcal{G}, and then can equip arbitrary black-box predictors with prediction sets. Our first algorithm BatchGCP is a direct extension of quantile regression, needs to solve only a single convex minimization problem, and produces an estimator which has group-conditional guarantees for each group in 𝒢\mathcal{G}. Our second algorithm BatchMVP is iterative, and gives the full guarantees of multivalid conformal prediction: prediction sets that are valid conditionally both on group membership and non-conformity threshold. We evaluate the performance of both of our algorithms in an extensive set of experiments. Code to replicate all of our experiments can be found at https://github.com/ProgBelarus/BatchMultivalidConformal.

1 Introduction

Consider an arbitrary distribution 𝒟\mathcal{D} over a labeled data domain 𝒵=𝒳×𝒴\mathcal{Z}=\mathcal{X}\times\mathcal{Y}. A model is any function h:𝒳𝒴h:\mathcal{X}\rightarrow\mathcal{Y} for making point predictions111Often, as with a neural network that outputs scores at its softmax layer that are transformed into label predictions, a model will also produce auxiliary outputs that can be used in the construction of the non-conformity score shs_{h}. The traditional goal of conformal prediction in the “batch” setting is to take a small calibration dataset consisting of labeled examples sampled from 𝒟\mathcal{D} and use it to endow an arbitrary model h:𝒳𝒴h:\mathcal{X}\rightarrow\mathcal{Y} with prediction sets 𝒯h(x)Y\mathcal{T}_{h}(x)\subseteq Y that have the property that these prediction sets cover the true label with probability 1α1-\alpha marginally for some target miscoverage rate α\alpha:

Pr(x,y)𝒟[y𝒯h(x)]=1α\Pr_{(x,y)\sim\mathcal{D}}[y\in\mathcal{T}_{h}(x)]=1-\alpha

This is a marginal coverage guarantee because the probability is taken over the randomness of both xx and yy, without conditioning on anything. In the batch setting (unlike in the sequential setting), labels are not available when the prediction sets are deployed. Our goal in this paper is to give simple, practical algorithms in the batch setting that can give stronger than marginal guarantees — the kinds of multivalid guarantees introduced by Gupta et al. (2022); Bastani et al. (2022) in the sequential prediction setting.

Following the literature on conformal prediction (Shafer and Vovk, 2008), our prediction sets are parameterized by an arbitrary non-conformity score sh:𝒵s_{h}:\mathcal{Z}\rightarrow\mathbb{R} defined as a function of the model hh. Informally, smaller values of sh(x,y)s_{h}(x,y) should mean that the label yy “conforms” more closely to the prediction h(x)h(x) made by the model. For example, in a regression setting in which 𝒴=\mathcal{Y}=\mathbb{R}, the simplest non-conformity score is sh(x,y)=|h(x)y|s_{h}(x,y)=|h(x)-y|. By now there is a large literature giving more sophisticated non-conformity scores for both regression and classification problems—see Angelopoulos and Bates (2021) for an excellent recent survey. A non-conformity score function sh(x,y)s_{h}(x,y) induces a distribution over non-conformity scores, and if τ\tau is the 1α1-\alpha quantile of this distribution (i.e. Pr(x,y)𝒟[sh(x,y)τ]=1α\Pr_{(x,y)\sim\mathcal{D}}[s_{h}(x,y)\leq\tau]=1-\alpha), then defining prediction sets as

𝒯hτ(x)={y:sh(x,y)τ]\mathcal{T}_{h}^{\tau}(x)=\{y:s_{h}(x,y)\leq\tau]

gives 1α1-\alpha marginal coverage. Split conformal prediction (Papadopoulos et al., 2002; Lei et al., 2018) simply finds a threshold τ\tau that is an empirical 1α1-\alpha quantile on the calibration set222with a small sample bias correction, and then uses this to deploy the prediction sets 𝒯hτ(x)\mathcal{T}_{h}^{\tau}(x) defined above. Our goal is to give stronger coverage guarantees, and to do so, rather than learning a single threshold τ\tau from the calibration set, we will learn a function f:𝒳f:\mathcal{X}\rightarrow\mathbb{R} mapping unlabeled examples to thresholds. Such a mapping ff induces prediction sets defined as follows:

𝒯hf(x)={y:sh(x,y)f(x)}\mathcal{T}_{h}^{f}(x)=\{y:s_{h}(x,y)\leq f(x)\}

Our goal is to find functions f:𝒳f:\mathcal{X}\rightarrow\mathbb{R} that give valid conditional coverage guarantees of two sorts. Let 𝒢\mathcal{G} be an arbitrary collection of groups: each group g𝒢g\in\mathcal{G} is some subset of the feature domain g2𝒳g\in 2^{\mathcal{X}} about which we make no assumption and we write g(x)=1g(x)=1 to denote that xx is a member of group gg. An example xx might be a member of multiple groups in 𝒢\mathcal{G}. We want to learn a function ff that induces group conditional coverage guarantees—i.e. such that for every g𝒢g\in\mathcal{G}:

Pr(x,y)𝒟[yThf(x)|g(x)=1]=1α\Pr_{(x,y)\sim\mathcal{D}}[y\in T_{h}^{f}(x)|g(x)=1]=1-\alpha

Here we can think of the groups as representing e.g. demographic groups (broken down by race, age, gender, etc) in settings in which we are concerned about fairness, or representing any other categories that we think might be relevant to the domain at hand. Since our functions ff now map different examples xx to different thresholds f(x)f(x), we also want our guarantees to hold conditional on the chosen threshold—which we call a threshold calibrated guarantee. This avoids algorithms that achieve their target coverage rates by overcovering for some thresholds and undercovering with others — for example, by randomizing between full and empty prediction sets. That is, we have:

Pr(x,y)𝒟[yThf(x)|g(x)=1,f(x)=τ]=1α\Pr_{(x,y)\sim\mathcal{D}}[y\in T_{h}^{f}(x)|g(x)=1,f(x)=\tau]=1-\alpha

If ff is such that its corresponding prediction sets Thf(x)T_{h}^{f}(x) satisfy both group and threshold conditional guarantees simultaneously, then we say that it promises full multivalid coverage.

1.1 Our Results

We design, analyze, and empirically evaluate two algorithms: one for giving group conditional guarantees for an arbitrary collection of groups 𝒢\mathcal{G}, and the other for giving full multivalid coverage guarantees for an arbitrary collection of groups 𝒢\mathcal{G}. We give PAC-style guarantees (Park et al., 2019), which means that with high probability over the draw of the calibration set, our deployed prediction sets have their desired coverage properties on the underlying distribution. Thus our algorithms also offer “training-conditional coverage” in the sense of Bian and Barber (2022). We prove our generalization theorems under the assumption that the data is drawn i.i.d. from some distribution, but note that De Finetti’s theorem (Ressel, 1985) implies that our analysis carries over to data drawn from any exchangeable distribution (see Remark B.1).

Group Conditional Coverage: BatchGCP

We first give an exceedingly simple algorithm BatchGCP (Algorithm 1) to find a model ff that produces prediction sets 𝒯hf\mathcal{T}_{h}^{f} that have group conditional (but not threshold calibrated) coverage guarantees. We consider the class of functions ={fλ:λ|𝒢|}\mathcal{F}=\{f_{\lambda}:\lambda\in\mathbb{R}^{|\mathcal{G}|}\}: each fλf_{\lambda}\in\mathcal{F} is parameterized by a vector λ|𝒢|\lambda\in\mathbb{R}^{|\mathcal{G}|}, and takes value:

fλ(x)=f0(x)+g𝒢:g(x)=1λg.f_{\lambda}(x)=f_{0}(x)+\sum_{g\in\mathcal{G}:g(x)=1}\lambda_{g}.

Here f0f_{0} is some arbitrary initial model. Our algorithm simply finds the parameters λ\lambda that minimize the pinball loss of fλ(x)f_{\lambda}(x). This is a |𝒢||\mathcal{G}|-dimensional convex optimization problem and so can be solved efficiently using off the shelf convex optimization methods. We prove that the resulting function fλ(x)f_{\lambda}(x) guarantees group conditional coverage. This can be viewed as an extension of conformalized quantile regression (Romano et al., 2019) which is also based on minimizing pinball loss. It can also be viewed as an algorithm promising “quantile multiaccuracy”, by analogy to (mean) multiaccuracy introduced in Hébert-Johnson et al. (2018b); Kim et al. (2019), and is related to similar algorithms for guaranteeing multiaccuracy (Gopalan et al., 2022b). Here pinball loss takes the role that squared loss does in (mean) multiaccuracy.

Multivalid Coverage: BatchMVP

We next give a simple iterative algorithm BatchMVP (Algorithm 2) to find a model ff that produces prediction sets 𝒯hf\mathcal{T}_{h}^{f} that satisfy both group and threshold conditional guarantees simultaneously — i.e. full multivalid guarantees. It iteratively finds groups g𝒢g\in\mathcal{G} and thresholds τ\tau such that the current model fails to have the target coverage guarantees conditional on g(x)=1g(x)=1 and f(x)=τf(x)=\tau, and then “patches” the model so that it does. We show that each patch improves the pinball loss of the model substantially, which implies fast convergence. This can be viewed as an algorithm for promising “quantile multicalibration” and is an extension of related algorithms for guaranteeing mean multicalibration (Hébert-Johnson et al., 2018b), which offer similar guarantees for mean (rather than quantile) prediction. Once again, pinball loss takes the role that squared loss takes in the analysis of mean multicalibration.

Empirical Evaluation

We implement both algorithms, and evaluate them on synthetic prediction tasks, and on 10 real datasets derived from US Census data from the 10 largest US States using the “Folktables” package of Ding et al. (2021). In our synthetic experiments, we measure group conditional coverage with respect to synthetically defined groups that are constructed to be correlated with label noise. On Census datasets, we aim to ensure coverage on population groups defined by reported race and gender categories. We compare our algorithms to two other split conformal prediction methods: a naive baseline which simply ignores group membership, and uses a single threshold, and the method of Foygel Barber et al. (2020), which calibrates a threshold τg\tau_{g} for each group g𝒢g\in\mathcal{G}, and then on a new example xx, predicts the most conservative threshold among all groups xx belongs to. We find that both of our methods obtain significantly better group-wise coverage and threshold calibration than the baselines we compare to. Furthermore, both methods are very fast in practice, only taking a few seconds to calibrate on datasets containing tens of thousands of points.

1.2 Additional Related Work

Conformal prediction (see Shafer and Vovk (2008); Angelopoulos and Bates (2021) for excellent surveys) has seen a surge of activity in recent years. One large category of recent work has been the development of sophisticated non-conformity scores that yield compelling empirical coverage in various settings when paired with split conformal prediction (Papadopoulos et al., 2002; Lei et al., 2018). This includes Romano et al. (2019) who give a nonconformity score based on quantile regression, Angelopoulos et al. (2020) and Romano et al. (2020b) who give nonconformity scores designed for classification, and Hoff (2021) who gives a nonconformity score that leads to Bayes optimal coverage when data is drawn from the assumed prior distribution. This line of work is complementary to ours: we give algorithms that can be used as drop-in replacements for split conformal prediction, and can make use of any of these nonconformity scores.

Another line of work has focused on giving group conditional guarantees. Romano et al. (2020a) note the need for group-conditional guarantees with respect to demographic groups when fairness is a concern, and propose separately calibrating on each group in settings in which the groups are disjoint. Foygel Barber et al. (2020) consider the case of intersecting groups, and give an algorithm that separately calibrates on each group, and then uses the most conservative group-wise threshold when faced with examples that are members of multiple groups — the result is that this method systematically over-covers. The kind of “multivalid” prediction sets that we study here were first proposed by Jung et al. (2021) in the special case of prediction intervals: but the algorithm given by Jung et al. (2021), based on calibrating to moments of the label distribution and using Chebyshev’s inequality, also generally leads to over-coverage. Gupta et al. (2022) gave a theoretical derivation of an algorithm to obtain tight multivalid prediction intervals in the sequential adversarial setting, and Bastani et al. (2022) gave a practical algorithm to obtain tight multivalid coverage using prediction sets generated from any non-conformity score, also in the sequential adversarial setting. Although the sequential setting is more difficult in the sense that it makes no distributional assumptions, it also requires that labels be available after predictions are made at test time, in contrast to the batch setting that we study, in which labels are not available. Gupta et al. (2022) give an online-to-batch reduction that requires running the sequential algorithm on a large calibration set, saving the algorithm’s internal state at each iteration, and then deploying a randomized predictor that randomizes over the internal state of the algorithm across all rounds of training. This is generally impractical; in contrast we give a direct, simple, deterministic predictor in the batch setting.

Our algorithms can be viewed as learning quantile multiaccurate predictors and quantile multicalibrated predictors respectively — by analogy to multiaccuracy Kim et al. (2019) and multicalibration Hébert-Johnson et al. (2018b) which are defined with respect to means instead of quantiles. Their analysis is similar, but with pinball loss playing the role played by squared error in (mean) multicalibration. This requires analyzing the evolution of pinball loss under Lipschitzness assumptions on the underlying distribution, which is a complication that does not arise for means. More generally, our goals for obtaining group conditional guarantees for intersecting groups emerge from the literature on “multigroup” fairness — see e.g. (Kearns et al., 2018; Hébert-Johnson et al., 2018b; Kearns et al., 2019; Rothblum and Yona, 2021; Dwork et al., 2019; Globus-Harris et al., 2022)

2 Preliminaries

We study prediction tasks over a domain 𝒵=𝒳×𝒴\mathcal{Z}=\mathcal{X}\times\mathcal{Y}. 𝒳\mathcal{X} denotes the feature domain and 𝒴\mathcal{Y} the label domain. We write 𝒢2𝒳\mathcal{G}\subseteq 2^{\mathcal{X}} to denote a collection of subsets of 𝒳\mathcal{X} which we represent as indicator functions g:𝒳{0,1}g:\mathcal{X}\rightarrow\{0,1\}. The label domain might e.g. be real valued (𝒴=)(\mathcal{Y}=\mathbb{R})—the regression setting, or consist of some finite unordered set—the multiclass classification setting.

Suppose there is a fixed distribution 𝒟Δ𝒵\mathcal{D}\in\Delta\mathcal{Z}. Given such a distribution, we will write 𝒟𝒳\mathcal{D}_{\mathcal{X}} to denote the marginal distribution over features: 𝒟𝒳Δ𝒳\mathcal{D}_{\mathcal{X}}\in\Delta\mathcal{X} induced by 𝒟\mathcal{D}. We will write 𝒟𝒴(x)Δ𝒴\mathcal{D}_{\mathcal{Y}}(x)\in\Delta\mathcal{Y} to denote the conditional distribution over labels induced by 𝒟\mathcal{D} when we condition on a particular feature vector xx. 𝒟𝒴(x)\mathcal{D}_{\mathcal{Y}}(x) captures all of the information about the label that is contained by the feature vector xx, and is the object that we are trying to approximate with our uncertainty quantification. We sometimes overload the notation and write 𝒟(x)=𝒟𝒴(x)\mathcal{D}(x)=\mathcal{D}_{\mathcal{Y}}(x).

Our uncertainty quantification is based on a bounded non-conformity score function s:𝒳×𝒴s:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}. Non-conformity score functions are generally defined with respect to some model hh—which is why in the introduction we wrote shs_{h}—but our development will be entirely agnostic to the specifics of the non-conformity score function, and so we will just write ss for simplicity. Without loss of generality333When s(x,y)[L,U]s(x,y)\in[L,U], then we can have the learner use the nonconformity score s(x,y)=s(x,y)LULs^{\prime}(x,y)=\frac{s(x,y)-L}{U-L}. When the learner produces a nonconformity threshold qq^{\prime} with respect to the new scoring function ss^{\prime}, we may obtain a threshold qq with respect to the original score ss by setting q=q(UL)+Lq=q^{\prime}(U-L)+L., we assume that the scoring function takes values in the unit interval: s(x,y)[0,1]s(x,y)\in[0,1] for any x𝒳x\in\mathcal{X} and y𝒴y\in\mathcal{Y}. Given a distribution 𝒟\mathcal{D} over 𝒵=𝒳×𝒴\mathcal{Z}=\mathcal{X}\times\mathcal{Y} and a non-conformity score function ss, we write 𝒮\mathcal{S} to denote the induced joint distribution over feature vectors xx and corresponding non-conformity scores s(x,y)s(x,y). Analogously to our 𝒟(x)\mathcal{D}(x) notation, we write 𝒮(x)\mathcal{S}(x) to denote the non-conformity score distribution conditional on a particular feature vector xx. Given a subset of the feature space B𝒳B\subset\mathcal{X}, we write 𝒮|B\mathcal{S}|B to denote the conditional distribution on non-conformity scores conditional on xBx\in B. Finally, we assume that all non-conformity score distributions 𝒮(x)\mathcal{S}(x) are continuous, which simplifies our treatment of quantiles — note that if this is not the case already, it can be enforced by perturbing non-conformity scores with arbitrarily small amounts of noise from a continuous distribution.

Definition 2.1.

For any q[0,1]q\in[0,1], we say that τ\tau is a qq-quantile of a (continuous) nonconformity score distribution 𝒮\mathcal{S} if Pr(x,s)𝒮[sτ]=q\Pr_{(x,s)\sim\mathcal{S}}[s\leq\tau]=q. Similarly, τ\tau is a qq-quantile of a conditional nonconformity score distribution 𝒮(x)\mathcal{S}(x) if

Prs𝒮(x)[sτ]=q.\Pr_{s\sim\mathcal{S}(x)}[s\leq\tau]=q.

Our convergence results will be parameterized by the Lipschitz parameter of the CDF of the underlying nonconformity score distribution. Informally speaking, a distribution with a Lipschitz CDF cannot concentrate too much of its mass on an interval of tiny width. Similar assumptions are commonly needed in related work, and can be guaranteed by perturbing non-conformity scores with noise — see e.g. the discussion in (Gupta et al., 2022; Bastani et al., 2022).

Definition 2.2.

A conditional nonconformity score distribution 𝒮(x)\mathcal{S}(x) is ρ\rho-Lipschitz if we have

Prs𝒮(x)[sτ]Prs𝒮(x)[sτ]ρ(ττ) for all 0ττ1.\Pr_{s\sim\mathcal{S}(x)}[s\leq\tau^{\prime}]-\Pr_{s\sim\mathcal{S}(x)}[s\leq\tau]\leq\rho(\tau^{\prime}-\tau)\quad\text{ for all $0\leq\tau\leq\tau^{\prime}\leq 1$}.

A nonconformity score distribution 𝒮\mathcal{S} is ρ\rho-Lipschitz if for each x𝒳x\in\mathcal{X}, 𝒮(x)\mathcal{S}(x) is ρ\rho-Lipschitz.

If we could find a model f:𝒳[0,1]f:\mathcal{X}\rightarrow[0,1] that on each input xx output a value f(x)f(x) that is a qq-quantile of the nonconformity score distribution 𝒮(x)\mathcal{S}(x), this would guarantee true conditional coverage at rate qq: for every xx, Pry[y𝒯f(x)|x]=q\Pr_{y}[y\in\mathcal{T}^{f}(x)|x]=q, where 𝒯f(x)={y:s(x,y)f(x)}\mathcal{T}^{f}(x)=\{y:s(x,y)\leq f(x)\}. As this is generally impossible, our aim will be to train models ff that allow us to make similar claims — not conditional on every xx, but conditional on membership of xx in some group gg and on the value of f(x)f(x). To facilitate learning models ff with guarantees conditional on their output values, we will learn models ff whose range R(f)={f(x):x𝒳}R(f)=\{f(x):x\in\mathcal{X}\} has finite cardinality m=|R(f)|<m=|R(f)|<\infty.

3 Algorithms

3.1 Algorithmic Preliminaries

In this section we establish lemmas that will be key to the analysis of both of the algorithms we give. Both of our algorithms will rely on analyzing pinball loss as a potential function. All proofs can be found in the Appendix.

Definition 3.1.

The pinball loss at target quantile qq for threshold τ\tau and nonconformity score ss is

Lq(τ,s)=(sτ)q1[s>τ]+(τs)(1q)1[sτ].L_{q}(\tau,s)=(s-\tau)q\cdot 1[s>\tau]+(\tau-s)(1-q)\cdot 1[s\leq\tau].

We write

PBq𝒮(f)=𝔼(x,s)𝒮[Lq(f(x),s)]PB^{\mathcal{S}}_{q}(f)=\mathop{\mathbb{E}}_{(x,s)\sim\mathcal{S}}[L_{q}(f(x),s)]

for a model f:𝒳[0,1]f:\mathcal{X}\rightarrow[0,1] and nonconformity score distribution 𝒮\mathcal{S}. When quantile qq and/or distribution 𝒮\mathcal{S} is clear, we write PBPB, PBqPB_{q}, and/or PB𝒮PB^{\mathcal{S}}.

It is well known that pinball loss is minimized by the function that predicts for each xx the target quantile qq of a conditional score distribution given xx, but we will need a more refined set of statements. First we define a model’s marginal quantile consistency error:

Definition 3.2.

A qq-quantile predictor f:𝒳f:\mathcal{X}\rightarrow\mathbb{R} has marginal quantile consistency error α\alpha if

|Pr(x,s)𝒮[sf(x)]q|=α.\left|\Pr_{(x,s)\sim\mathcal{S}}[s\leq f(x)]-q\right|=\alpha.

If α=0\alpha=0, we say that ff satisfies marginal qq-quantile consistency.

Informally, we will need the pinball loss to have both a progress and an “anti-progress” property:

  1. 1.

    If a model ff is far from satisfying marginal quantile consistency, then linearly shifting it so that it satisfies marginal quantile consistency will reduce pinball loss substantially, and

  2. 2.

    If a model does satisfy marginal quantile consistency, then perturbing it slightly won’t increase pinball loss substantially.

The next lemma establishes these, under the assumption that the underlying distribution is Lipschitz.

Lemma 3.1.

Fix any nonconformity score distribution 𝒮\mathcal{S} that is ρ\rho-Lipschitz. Fix any model f:𝒳f:\mathcal{X}\rightarrow\mathbb{R} that has marginal quantile consistency error α\alpha with respect to target quantile qq, and let f(x)=f(x)+Δf^{\prime}(x)=f(x)+\Delta with Δ\Delta chosen such that ff^{\prime} is marginal quantile consistent at quantile qq. Then

α|Δ|+α22ρPBq𝒮(f)PBq𝒮(f)α22ρ.-\alpha|\Delta|+\frac{\alpha^{2}}{2\rho}\leq PB^{\mathcal{S}}_{q}(f^{\prime})-PB^{\mathcal{S}}_{q}(f)\leq-\frac{\alpha^{2}}{2\rho}.

We now define what will be a basic building block of our iterative algorithm for obtaining multivalid coverage, and of the analysis of our algorithm for obtaining group conditional coverage. It is a simple “patch” operation on a model, that shifts the model’s predictions by a constant term Δ\Delta only on those examples xx that lie in some subset BB of the feature space:

Definition 3.3 (Patch Operation).

Given a model ff, a subset B𝒳B\subseteq\mathcal{X}, and a value Δ\Delta\in\mathbb{R} define the patched model f=Patch(f,B,Δ)f^{\prime}=\mathrm{Patch}(f,B,\Delta) to be such that

f(x)={f(x)+Δif xBf(x)otherwise.\displaystyle f^{\prime}(x)=\begin{cases}f(x)+\Delta\quad\text{if $x\in B$}\\ f(x)\quad\text{otherwise}.\end{cases}

We next show that if we have a model ff, and we can identify a large region BB on which it is far from satisfying marginal quantile consistency, then “patching” the model so that it satisfies marginal quantile consistency on S|BS|B substantially improves its pinball loss.

Lemma 3.2.

Given some predictor f:𝒳f:\mathcal{X}\to\mathbb{R}, suppose we have a set of points B𝒳B\subseteq\mathcal{X} with

Pr(x,s)𝒮[xB](Pr(x,s)𝒮[sf(x)|xB]q)2αandΔ=argminΔ|Pr(x,s)𝒮[sf(x)+ΔxB]q|.\displaystyle\Pr_{(x,s)\sim\mathcal{S}}[x\in B]\cdot\left(\Pr_{(x,s)\sim\mathcal{S}}[s\leq f(x)|x\in B]-q\right)^{2}\geq\alpha\quad\text{and}\quad\Delta=\mathop{\mathrm{argmin}}_{\Delta^{\prime}\in\mathbb{R}}\left|\Pr_{(x,s)\sim\mathcal{S}}[s\leq f(x)+\Delta^{\prime}|x\in B]-q\right|.

Then, if 𝒮|B\mathcal{S}|B is continuous and ρ\rho-Lipschitz, f=Patch(f,B,Δ)f^{\prime}=\mathrm{Patch}(f,B,\Delta) has PBq𝒮(f)PBq𝒮(f)α2ρPB^{\mathcal{S}}_{q}(f^{\prime})\leq PB^{\mathcal{S}}_{q}(f)-\frac{\alpha}{2\rho}.

3.2 BatchGCP: Obtaining Group Conditional Guarantees

We now give an extremely simple algorithm BatchGCP (Batch Group-Conditional Predictor) that obtains group conditional (but not threshold calibrated) prediction sets. BatchGCP only needs to solve a single closed-form convex optimization problem. Specifically, Algorithm 1 takes as input an arbitrary threshold model ff and collection of groups 𝒢\mathcal{G}, and then simply minimizes pinball loss over all linear combinations of ff and the group indicator functions g𝒢g\in\mathcal{G}. This is a quantile-analogue of a similar algorithm that obtains group conditional mean consistency (Gopalan et al., 2022b).

  Let λ\lambda^{*} be a solution to the optimization problem:
argminλ𝔼(x,y)𝒟[Lq(f^(x;λ),y)]wheref^(x;λ)f(x)+g𝒢λgg(x)\mathrm{argmin}_{\lambda}\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}\left[L_{q}\left(\hat{f}(x;\lambda),y\right)\right]\quad\text{where}\quad\hat{f}(x;\lambda)\equiv f(x)+\sum_{g\in\mathcal{G}}\lambda_{g}\cdot g(x)
  Output f^(x;λ)\hat{f}(x;\lambda^{*})
Algorithm 1 BatchGCP(f,𝒢,q,𝒟f,\mathcal{G},q,\mathcal{D})
Theorem 3.1.

For any input model ff, groups 𝒢\mathcal{G}, and q[0,1]q\in[0,1], Algorithm 1 returns f^(x;λ)\hat{f}(x;\lambda^{*}) with

Pr(x,s)𝒮[sf^(x;λ)|g(x)=1]=q for every g𝒢.\displaystyle\Pr_{(x,s)\sim\mathcal{S}}\left[s\leq\hat{f}(x;\lambda^{*})|g(x)=1\right]=q\quad\text{ for every $g\in\mathcal{G}$}.

In other words, we have for every g𝒢g\in\mathcal{G},

Pr(x,y)𝒟[y𝒯f^(x)|g(x)=1]=q for every g𝒢,\displaystyle\Pr_{(x,y)\sim\mathcal{D}}\left[y\in\mathcal{T}^{\hat{f}}(x)|g(x)=1\right]=q\quad\text{ for every $g\in\mathcal{G}$},

where 𝒯f^(x)={y:s(x,y)f^(x;λ)}\mathcal{T}^{\hat{f}}(x)=\{y:s(x,y)\leq\hat{f}(x;\lambda^{*})\}. Furthermore, PBq𝒮(f^(;λ))PBq𝒮(f)PB^{\mathcal{S}}_{q}(\hat{f}(\cdot;\lambda^{*}))\leq PB^{\mathcal{S}}_{q}(f).

The analysis is simple: if the optimal model f^(x,λ)\hat{f}(x,\lambda^{*}) did not satisfy marginal quantile consistency on some g𝒢g\in\mathcal{G}, then by Lemma 3.2, a patch operation on the set B(g)={x:g(x)=1}B(g)=\{x:g(x)=1\} would further reduce its pinball loss. By definition, this patch would just shift the model parameter λg\lambda^{*}_{g} and yield a model f^(x,λ)\hat{f}(x,\lambda^{\prime}) for λλ\lambda^{\prime}\neq\lambda^{*}, falsifying the optimality of f^(x,λ)\hat{f}(x,\lambda^{*}) among such models.

3.3 BatchMVP: Obtaining Full Multivalid Coverage

In this section, we give a simple iterative algorithm BatchMVP (Batch Multivalid Predictor) that trains a threshold model ff that provides full multivalid coverage — i.e. that produces prediction sets 𝒯f(x)\mathcal{T}^{f}(x) that are both group conditionally valid and threshold calibrated. To do this, we start by defining quantile multicalibration, a quantile prediction analogue of (mean) multicalibration defined in Hébert-Johnson et al. (2018b); Gopalan et al. (2022a) and a variant of the quantile multicalibration definition given in Gupta et al. (2022)444Informally, Hébert-Johnson et al. (2018b) defines an \ell_{\infty} notion of multicalibration (that takes the maximum rather than expectation over levelsets vv) that Gupta et al. (2022) generalizes to quantiles. Gopalan et al. (2022a) define an 1\ell_{1} notion of calibration for means. Here we give an 2\ell_{2} variant of calibration generalized from means to quantiles..

Definition 3.4.

The quantile calibration error of qq-quantile predictor f:𝒳[0,1]f:\mathcal{X}\to[0,1] on group gg is:

Q(f,g)=vR(f)Pr(x,s)𝒮[f(x)=v|g(x)=1](qPr(x,s)𝒮[sf(x)|f(x)=v,g(x)=1])2.Q(f,g)=\sum_{v\in R(f)}\Pr_{(x,s)\sim\mathcal{S}}[f(x)=v|g(x)=1]\Big{(}q-\Pr_{(x,s)\sim\mathcal{S}}[s\leq f(x)|f(x)=v,g(x)=1]\Big{)}^{2}.

We say that ff is α\alpha-approximately qq-quantile multicalibrated with respect to group collection 𝒢\mathcal{G} if

Q(f,g)αPr(x,s)𝒮[g(x)=1] for every g𝒢.Q(f,g)\leq\frac{\alpha}{\Pr_{(x,s)\sim\mathcal{S}}[g(x)=1]}\quad\text{ for every }g\in\mathcal{G}.
Remark 3.1.

We define quantile calibration error on a group as an average over the level sets of the predictor ff, in a manner similar to how Gopalan et al. (2022a) define (mean) calibration. The idealized predictor f(x)f^{*}(x) that predicts the true qq-quantile of the conditional nonconformity score distribution S(x)S(x) on every xx would have 0 quantile calibration error on every group. When we define approximate calibration error, we allow for larger quantile calibration error on groups that appear less frequently. This is necessary if we are to learn a quantile multicalibrated predictor from finite data, and is what allows us to provide out-of-sample PAC guarantees in Section 4, and is consistent with similar definitions for mean multicalibration from the literature Jung et al. (2021); Gopalan et al. (2022a)

Conditional on membership in each group gg, the quantile multicalibration error gives a bound on the expected coverage error when conditioning both on membership in gg and on a predicted threshold of f(x)=vf(x)=v, in expectation over vv. In particular, it implies (and is stronger than) the following simple worst-case (over gg and vv) bound on coverage error conditional on both g(x)=1g(x)=1 and f(x)=vf(x)=v:

Claim 3.1.

If ff is α\alpha-approximately quantile multicalibrated with respect to 𝒢\mathcal{G} and qq, then

|Pr(x,s)𝒮[sf(x)g(x)=1,f(x)=v]q|αPr(x,s)𝒮[g(x)=1,f(x)=v].\displaystyle\left|\Pr_{(x,s)\sim\mathcal{S}}\left[s\leq f(x)|g(x)=1,f(x)=v\right]-q\right|\leq\frac{\sqrt{\alpha}}{\sqrt{\Pr_{(x,s)\sim\mathcal{S}}[g(x)=1,f(x)=v]}}.

In other words, we have for every g𝒢g\in\mathcal{G} and vR(f)v\in R(f),

|Pr(x,y)𝒟[y𝒯f(x)g(x)=1,f(x)=v]q|αPr(x,s)𝒮[g(x)=1,f(x)=v] for g𝒢,vR(f).\displaystyle\left|\Pr_{(x,y)\sim\mathcal{D}}\left[y\in\mathcal{T}^{f}(x)|g(x)=1,f(x)=v\right]-q\right|\leq\frac{\sqrt{\alpha}}{\sqrt{\Pr_{(x,s)\sim\mathcal{S}}[g(x)=1,f(x)=v]}}\>\text{ for }g\in\mathcal{G},v\in R(f).

This claim follows by fixing any group gg and threshold value vv, and dropping all the (nonnegative) terms in Q(f,g)Q(f,g) except the one corresponding to gg and vv (see the Appendix).

  Initialize t=0t=0, and define f0f_{0} as f0(x)=minv[1m]|vf(x)|f_{0}(x)=\min_{v\in[\frac{1}{m}]}|v-f(x)| for x𝒳x\in\mathcal{X}.
  while ftf_{t} is not α\alpha-approximately qq-quantile multicalibrated with respect to 𝒢\mathcal{G} do
     Let Bt={x:ft(x)=vt,gt(x)=1}B_{t}=\{x:f_{t}(x)=v_{t},g_{t}(x)=1\}, where:
(vt,gt)argmax(v,g)[1m]×𝒢Pr(x,s)𝒮[ft(x)=v,g(x)=1](qPr(x,s)𝒮[sft(x)|ft(x)=v,g(x)=1]))2(v_{t},g_{t})\in\mathop{\mathrm{argmax}}_{(v,g)\in[\frac{1}{m}]\times\mathcal{G}}\Pr_{(x,s)\sim\mathcal{S}}[f_{t}(x)=v,g(x)=1]\left(q-\Pr_{(x,s)\sim\mathcal{S}}[s\leq f_{t}(x)|f_{t}(x)=v,g(x)=1])\right)^{2}
     Let:
Δt=argminΔ[1m]|Pr(x,s)𝒮[sft(x)+ΔxBt]q|\Delta_{t}=\mathop{\mathrm{argmin}}_{\Delta\in[\frac{1}{m}]}\left|\Pr_{(x,s)\sim\mathcal{S}}[s\leq f_{t}(x)+\Delta|x\in B_{t}]-q\right|
     Update ft+1Patch(ft,Bt,Δt)f_{t+1}\leftarrow\mathrm{Patch}\left(f_{t},B_{t},\Delta_{t}\right) and tt+1t\leftarrow t+1.
  Output ftf_{t}.
Algorithm 2 BatchMVP(f,α,q,𝒢,ρ,𝒮,mf,\alpha,q,\mathcal{G},\rho,\mathcal{S},m)

Algorithm 2 is simple: Given an initial threshold model ff, collection of groups 𝒢\mathcal{G}, and target quantile qq, it repeatedly checks whether its current model ftf_{t} satisfies α\alpha-approximate quantile multicalibration. If not, it finds a group gtg_{t} and a threshold vtv_{t} such that ftf_{t} predicts a substantially incorrect quantile conditional on both membership of xx in gtg_{t} and on ft(x)=vtf_{t}(x)=v_{t} — such a pair is guaranteed to exist if ftf_{t} is not approximately quantile multicalibrated. It then fixes this inconsistency and produces a new model ft+1f_{t+1} with a patch operation up to a target discretization parameter mm which ensures that the range of ft+1f_{t+1} does not grow too large. Under the assumption that the non-conformity score distribution is Lipschitz, each patch operation substantially reduces the pinball loss of the current predictor, which ensures fast convergence to quantile multicalibration.

Theorem 3.2.

Suppose 𝒮\mathcal{S} is ρ\rho-Lipschitz and continuous, and m=8ρ2αm=\frac{8\rho^{2}}{\alpha}. After T32ρ3α2T\leq\frac{32\rho^{3}}{\alpha^{2}} many rounds, Algorithm 2 BatchMVP(f,α,q,𝒢,ρ,𝒮,mf,\alpha,q,\mathcal{G},\rho,\mathcal{S},m) returns fTf_{T} such that

  1. 1.

    PBq𝒮(fT)PBq𝒮(f0)Tα232ρ3PB^{\mathcal{S}}_{q}(f_{T})\leq PB^{\mathcal{S}}_{q}(f_{0})-T\frac{\alpha^{2}}{32\rho^{3}}.

  2. 2.

    fTf_{T} is α\alpha-approximately quantile multicalibrated with respect to 𝒢\mathcal{G} and qq. In particular, via Claim 3.1, we have for every g𝒢g\in\mathcal{G} and vR(f)v\in R(f),

    |Pr(x,s)𝒮[sfT(x)g(x)=1,fT(x)=v]q|αPr(x,s)𝒮[g(x)=1,fT(x)=v].\displaystyle\left|\Pr_{(x,s)\sim\mathcal{S}}\left[s\leq f_{T}(x)|g(x)=1,f_{T}(x)=v\right]-q\right|\leq\frac{\sqrt{\alpha}}{\sqrt{\Pr_{(x,s)\sim\mathcal{S}}[g(x)=1,f_{T}(x)=v]}}.

    In other words, we have for every g𝒢g\in\mathcal{G} and vR(f)v\in R(f),

    |Pr(x,y)𝒟[y𝒯fT(x)g(x)=1,fT(x)=v]q|αPr(x,s)𝒮[g(x)=1,fT(x)=v]\displaystyle\left|\Pr_{(x,y)\sim\mathcal{D}}\left[y\in\mathcal{T}^{f_{T}}(x)|g(x)=1,f_{T}(x)=v\right]-q\right|\leq\frac{\sqrt{\alpha}}{\sqrt{\Pr_{(x,s)\sim\mathcal{S}}[g(x)=1,f_{T}(x)=v]}}

    where 𝒯fT(x)={y:s(x,y)fT(x)}\mathcal{T}^{f_{T}}(x)=\{y:s(x,y)\leq f_{T}(x)\}.

4 Out of Sample Generalization

We have presented BatchGCP (Algorithm 1) and BatchMVP (2) as if they have direct access to the true data distribution 𝒟\mathcal{D}. In practice, rather than being able to directly access the distribution, we only have a finite calibration set D=(xi,yi)i=1nD=(x_{i},y_{i})_{i=1}^{n} of nn data points sampled iid. from 𝒟\mathcal{D}. In this section, we show that if we run our algorithms on the empirical distribution over the sample D=(xi,yi)i=1nD=(x_{i},y_{i})_{i=1}^{n}, then their guarantees hold not only for the empirical distribution over DD but also—with high probability — for the true distribution 𝒟\mathcal{D}. We include the generalization guarantees for BatchGCP in Appendix B.3 and focus on generalization guarantees for BatchMVP.

At a high level, our argument proceeds as follows (although there are a number of technical complications). For any fixed model ff, by standard concentration arguments its in- and out-of-sample quantile calibration error will be close with high probability for a large enough sample size. Our goal is to apply concentration bounds uniformly to every model fTf_{T} that might be output by our algorithm, and then union bound over all of them to get a high probability bound for whichever model happens to be output. Towards this, we show how to bound the total number of distinct models that can be output as a function of TT, the total number of iterations of the algorithm. This is possible because at each round tt, the algorithm performs a patch operation parameterized by a group gt𝒢g_{t}\in\mathcal{G} (where |𝒢|<|\mathcal{G}|<\infty), a value vt[1/m]v_{t}\in[1/m], and an update value Δt[1/m]\Delta_{t}\in[1/m] — and thus only a finite number of models ft+1f_{t+1} can result from patching the current model ftf_{t}. The difficulty is that our convergence analysis in Theorem 3.2 gives a convergence guarantee in terms of the smoothness parameter ρ\rho of the underlying distribution, which will not be preserved on the empirical distribution over the sample DD drawn from 𝒟\mathcal{D}. Hence, to upper bound the number of rounds our algorithm can run for, we need to interleave our generalization theorem with our convergence theorem, arguing that at each step—taken with respect to the empirical distribution over DD—we make progress that can be bounded by the smoothness parameter ρ\rho of the underlying distribution 𝒟\mathcal{D}.

We remark that previous analyses of out-of-sample performance for similar (mean) multicalibration algorithms (Hébert-Johnson et al., 2018a; Jung et al., 2021) have studied algorithms that use a fresh sample of data at each round, or else perturb each empirical measurement using noise so as to invoke the connection between differential privacy and generalization (Dwork et al., 2015). Analyzing algorithms of this sort is convenient, but does not correspond to how one would actually want to run the algorithm. In contrast, we directly analyze the out-of-sample performance of our algorithm when it is directly trained using a single, sufficiently large sample of data.

We first prove a high probability generalization bound for our algorithm as a function of the number of steps TT it converges in. This bound holds uniformly for all TT, and so can be applied as a function of the actual (empirical) convergence time TT.

We let D=(xi,yi)i=1nD=(x_{i},y_{i})_{i=1}^{n} denote our sample, S=(xi,s(xi,yi))i=1nS=(x_{i},s(x_{i},y_{i}))_{i=1}^{n} our nonconformity score sample, and 𝒮~S\tilde{\mathcal{S}}_{S} denote the empirical distribution over SS. When SS is clear from the context, we just write 𝒮~\tilde{\mathcal{S}}.

Theorem 4.1.

Suppose 𝒮\mathcal{S} is ρ\rho-Lipschitz and S𝒮nS\sim\mathcal{S}^{n}. Suppose BatchMVP(f,α,q,𝒢,ρ,𝒮~S,mf,\alpha,q,\mathcal{G},\rho,\tilde{\mathcal{S}}_{S},m) (Algorithm 2) runs for TT rounds and outputs model fTf_{T}. Then fTf_{T} is α\alpha^{\prime}-approximately qq-quantile multicalibrated with respect to 𝒢\mathcal{G} on 𝒮\mathcal{S} with probability 1δ1-\delta, where

α=α+213ρ2(ln(4π2T23δ)+Tln(ρ4|𝒢|α2))2αn+12ρ2(4π2T23δ)+Tln(ρ4|𝒢|α2))αn.\alpha^{\prime}=\alpha+21\sqrt{\frac{3\rho^{2}\left(\ln(\frac{4\pi^{2}T^{2}}{3\delta})+T\ln(\frac{\rho^{4}|\mathcal{G}|}{\alpha^{2}})\right)}{2\alpha n}}+\frac{12\rho^{2}(\frac{4\pi^{2}T^{2}}{3\delta})+T\ln(\frac{\rho^{4}|\mathcal{G}|}{\alpha^{2}}))}{\alpha n}.

We then prove a worst-case upper bound on the convergence time TT, which establishes a worst-case PAC style generalization bound in combination with Theorem 4.1. We remark that although the theorem contains a large constant, in our experiments, our algorithm halts after a small number of iterations TT (See Sections 5 and C), and we can directly apply Theorem 4.1 with this empirical value for TT.

Theorem 4.2.

Suppose 𝒮\mathcal{S} is ρ\rho-Lipschitz and continuous, m=ρ22αm=\frac{\rho^{2}}{2\alpha}, and our calibration set S𝒮nS\sim\mathcal{S}^{n} consists of nn iid. samples drawn from 𝒮\mathcal{S}, where nn is sufficiently large:

n92928(ln(128ρ3α2δ)+8ρ3α2ln(ρ4|𝒢|α2))max(ρ44α4,ρ6α4)n\geq 92928\left(\ln\left(\frac{128\rho^{3}}{\alpha^{2}\delta^{\prime}}\right)+\frac{8\rho^{3}}{\alpha^{2}}\ln\left(\frac{\rho^{4}|\mathcal{G}|}{\alpha^{2}}\right)\right)\max\left(\frac{\rho^{4}}{4\alpha^{4}},\frac{\rho^{6}}{\alpha^{4}}\right)

Then BatchMVP(f,α,q,𝒢,ρ,𝒮~S,mf,\alpha,q,\mathcal{G},\rho,\tilde{\mathcal{S}}_{S},m) (Algorithm 2) halts after T8ρ3α2T\leq\frac{8\rho^{3}}{\alpha^{2}} rounds with prob. 1δ1-\delta.

Formal generalization arguments are in Appendix B. Theorems 4.1, 4.2 are proved in Appendix B.4.

5 Experiments

5.1 A Synthetic Regression Task

We first consider a linear regression problem on synthetic data drawn from a feature domain of 10 binary features and 90 continuous features, with each binary feature drawn uniformly at random, and each continuous feature drawn from a normal distribution 𝒩(0,σx2)\mathcal{N}(0,\sigma^{2}_{x}). The 10 binary features are used to define 20 intersecting groups, each depending on the value of a single binary feature. An input’s label is specified by an ordinary least squares model with group-dependent noise as:

y=θ,x+𝒩(0,σ2+i=110σi2xi)y=\langle\theta,x\rangle+\mathcal{N}\left(0,\sigma^{2}+\sum_{i=1}^{10}\sigma^{2}_{i}x_{i}\right)

where each term σi2\sigma^{2}_{i} is associated with one binary feature. We set σi2=i\sigma^{2}_{i}=i for all i[10]i\in[10]. So the more groups an input xx is a member of, the more label noise it has, with larger index groups contributing larger amounts of label noise.

We generate a dataset {(xi,yi)}\{(x_{i},y_{i})\} of size 40000 using the above-described model, and split it evenly into training and test data. The training data is further split into training data of size 5000 (with which to train a least squares regression model ff) and calibration data of size 15000 (with which to calibrate our various conformal prediction methods). Given the trained predictor ff, we use non-conformity score s(x,y)=|f(x)y|s(x,y)=|f(x)-y|. Next, we define the set of groups 𝒢={g1,g2,,g20}\mathcal{G}=\{g_{1},g_{2},\cdots,g_{20}\} where for each j[20]j\in[20], gj={x𝒳x(j+1)/22j+1}g_{j}=\{x\in\mathcal{X}\mid x_{\lfloor(j+1)/2\rfloor}\equiv_{2}j+1\}. We run Algorithm 1 (BatchGCP) and Algorithm 2 (BatchMVP with m=100m=100 buckets) to achieve group-conditional and full multivalid coverage guarantees respectively, with respect to 𝒢\mathcal{G} with target coverage q=0.9q=0.9. We compare the performance of these methods to naive split-conformal prediction (which, without knowledge of 𝒢\mathcal{G}, uses the calibration data to predict a single threshold) and the method of Foygel Barber et al. (2020) which predicts a threshold for each group in 𝒢\mathcal{G} that an input xx is part of, and chooses the most conservative one.

Refer to caption
Refer to caption
Figure 1: Performance comparisons across different conformal prediction methods. Groupwise coverage is on the left, and mean prediction-set size by group is on the right (averaged over 50 runs). Error bars show standard deviation.
Refer to caption
Refer to caption
Figure 2: The left figure plots group-wise calibration error (averaged over 50 runs), weighted by group size. Error bars show standard deviation. The right figure is a scatterplot of the number of points associated with each threshold-group pair (g,τi)(g,\tau_{i}) against the average coverage conditional on that pair for all g𝒢g\in\mathcal{G} and all τi\tau_{i} in a grid, over all tested conformal prediction methods (consolidating results over all 50 runs). Target coverage is q=0.9q=0.9.

Figures 1 and 2 present results over 50 runs of this experiment. Notice that both BatchGCP and BatchMVP achieve close to the desired group coverage across all groups—with BatchGCP achieving nearly perfect coverage and BatchMVP sometimes deviating from the target coverage rate by as much as 1%. In contrast, the method of Foygel Barber et al. (2020) significantly overcovers on nearly all groups, particularly low-noise groups, and naive split-conformal starts undercovering and overcovering on high-noise and low-noise groups respectively as the expected label-noise increases. Group-wise calibration error is high across all groups but the last using the method of Foygel Barber et al. (2020), and naive split-conformal has low calibration error on groups where inclusion/exclusion reflects less fluctuation in noise, and higher calibration error in groups where there is much higher fluctuation in noise based on inclusion. Both BatchGCP and BatchMVP have lower group calibration errors — interestingly, BatchGCP appears to do nearly as well as BatchMVP in this regard despite having no theoretical guarantee on threshold calibration error. A quantile multicalibrated predictor must have low coverage error conditional on groups gg and thresholds τi\tau_{i} that appear frequently, and may have high coverage error for pairs that appear infrequently—behavior that we see in BatchMVP in Figure 2. On the other hand, for both naive split conformal prediction and the method of Foygel Barber et al. (2020), we see high mis-coverage error even for pairs (g,τi)(g,\tau_{i}) containing a large fraction of the probability mass.

We fix the upper-limit of allowed iterations TT for BatchMVP to 1000, but typically the algorithm converges and halts in many fewer iterations. Across the 50 runs of this experiment, the average number of iterations TT taken to converge was 45.54±14.7745.54\pm 14.77.

5.2 An Income Prediction Task on Census Data

We also compare our methods to naive split conformal prediction and the method of Foygel Barber et al. (2020) on real data from the 2018 Census American Community Survey Public Use Microdata, compiled by the Folktables package (Ding et al., 2021). This dataset records information about individuals including race, gender and income. In this experiment, we generate prediction sets for a person’s income while aiming for valid coverage on intersecting groups defined by race and gender.

The Folktables package provides datasets for all 50 US states. We run experiments on state-wide datasets: for each state, we split it into 60% training data 𝒟train\mathcal{D}_{train} for the income-predictor, 20% calibration data 𝒟calib\mathcal{D}_{calib} to calibrate the conformal predictors, and 20% test data 𝒟test\mathcal{D}_{test}. After training the income-predictor ff on 𝒟train\mathcal{D}_{train}, we use the non-conformity score s(x,y)=|f(x)y|s(x,y)=|f(x)-y|. There are 9 provided codes for race5551. White alone, 2. Black or African American alone, 3. American Indian alone, 4. Alaska Native alone, 5. American Indian and Alaska Native tribes specified; or American Indian or Alaska Native, not specified and no other races, 6. Asian alone, 7. Native Hawaiian and other Pacific Islander alone, 8. Some Other Race alone, 9. Two or More Races. and 2 codes for sex (1. Male, 2. Female) in the Folktables data. We define groups for five out of nine race codes (disregarding the four with the least amount of data) and both sex codes. We run all four algorithms (naive split-conformal, the method of Foygel Barber et al. (2020), BatchGCP, and BatchMVP with m=300m=300 buckets) with target coverage q=0.9q=0.9.

We ran this experiment on the 10 largest states. Figure 3 and Figure 4 present comparisons of the performance of all four algorithms on data taken from California, averaged over 50 different runs. Results on all remaining states are presented in Appendix C.2.

Just as in the synthetic experiments, both BatchGCP and BatchMVP achieve excellent coverage across all groups—in fact now, we see nearly perfect coverage for both BatchGCP and BatchMVP, with BatchGCP still obtaining slightly better group conditional coverage. In contrast, naive split-conformal prediction undercovers on certain groups and overcovers on others, and the method of Foygel Barber et al. (2020) significantly overcovers on some groups (here, group 4 and 7). The conservative approach also generally yields prediction sets of much larger size. We see also that BatchMVP achieves very low rates of calibration error across all groups, outperforming BatchGCP in this regard. Calibration error is quite irregular across groups for both naive split-conformal prediction and for the method of Foygel Barber et al. (2020), being essentially zero in certain groups and comparatively much larger in others. The average number of iterations TT BatchMVP converged in was 10.64±1.1210.64\pm 1.12.

Refer to caption
Refer to caption
Figure 3: Performance across different conformal prediction methods on Folktables California data. Groupwise coverage is on the left, and mean prediction-set sizes for each group are on the right (averaged over 50 rounds). Error bars show standard deviation.
Refer to caption
Refer to caption
Figure 4: The left figure plots group-wise calibration error, weighted by group size (averaged over 50 runs). Error bars show standard deviation. The right figure is a scatterplot of the number of points associated with each threshold-group pair (g,τi)(g,\tau_{i}) against the average coverage conditional on that pair for all g𝒢g\in\mathcal{G} and all τi\tau_{i} in a grid, over all tested conformal prediction methods (consolidating results over all 50 runs). Target coverage is q=0.9q=0.9.
Acknowledgments

We thank Varun Gupta for useful conversations at an early stage of this work. This work was supported in part by the Simons Foundation collaboration on algorithmic fairness and NSF grants FAI-2147212 and CCF-2217062.

References

  • Angelopoulos and Bates [2021] Anastasios N Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511, 2021.
  • Angelopoulos et al. [2020] Anastasios Nikolas Angelopoulos, Stephen Bates, Michael Jordan, and Jitendra Malik. Uncertainty sets for image classifiers using conformal prediction. In International Conference on Learning Representations, 2020.
  • Bastani et al. [2022] Osbert Bastani, Varun Gupta, Christopher Jung, Georgy Noarov, Ramya Ramalingam, and Aaron Roth. Practical adversarial multivalid conformal prediction. Advances in Neural Information Processing Systems, 2022.
  • Bian and Barber [2022] Michael Bian and Rina Foygel Barber. Training-conditional coverage for distribution-free predictive inference. arXiv preprint arXiv:2205.03647, 2022.
  • Ding et al. [2021] Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. Retiring adult: New datasets for fair machine learning. Advances in Neural Information Processing Systems, 34, 2021.
  • Dwork et al. [2015] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 117–126, 2015.
  • Dwork et al. [2019] Cynthia Dwork, Michael P Kim, Omer Reingold, Guy N Rothblum, and Gal Yona. Learning from outcomes: Evidence-based rankings. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 106–125. IEEE, 2019.
  • Foygel Barber et al. [2020] Rina Foygel Barber, Emmanuel J Candès, Aaditya Ramdas, and Ryan J Tibshirani. The limits of distribution-free conditional predictive inference. Information and Inference: A Journal of the IMA, 2020.
  • Globus-Harris et al. [2022] Ira Globus-Harris, Varun Gupta, Christopher Jung, Michael Kearns, Jamie Morgenstern, and Aaron Roth. Multicalibrated regression for downstream fairness. arXiv preprint arXiv:2209.07312, 2022.
  • Gopalan et al. [2022a] Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, and Udi Wieder. Omnipredictors. In 13th Innovations in Theoretical Computer Science Conference (ITCS 2022). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2022a.
  • Gopalan et al. [2022b] Parikshit Gopalan, Michael P Kim, Mihir A Singhal, and Shengjia Zhao. Low-degree multicalibration. In Conference on Learning Theory, pages 3193–3234. PMLR, 2022b.
  • Gupta et al. [2022] Varun Gupta, Christopher Jung, Georgy Noarov, Mallesh M. Pai, and Aaron Roth. Online Multivalid Learning: Means, Moments, and Prediction Intervals. In 13th Innovations in Theoretical Computer Science Conference (ITCS 2022), pages 82:1–82:24, 2022.
  • Hébert-Johnson et al. [2018a] Ursula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning, pages 1939–1948. PMLR, 2018a.
  • Hébert-Johnson et al. [2018b] Úrsula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning, pages 1939–1948, 2018b.
  • Hoff [2021] Peter Hoff. Bayes-optimal prediction with frequentist coverage control. arXiv preprint arXiv:2105.14045, 2021.
  • Jung et al. [2021] Christopher Jung, Changhwa Lee, Mallesh M Pai, Aaron Roth, and Rakesh Vohra. Moment multicalibration for uncertainty estimation. In Conference on Learning Theory. PMLR, 2021.
  • Kearns et al. [2018] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International Conference on Machine Learning, pages 2564–2572, 2018.
  • Kearns et al. [2019] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. An empirical study of rich subgroup fairness for machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 100–109, 2019.
  • Kim et al. [2019] Michael P Kim, Amirata Ghorbani, and James Zou. Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 247–254, 2019.
  • Lei et al. [2018] Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111, 2018.
  • Papadopoulos et al. [2002] Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman. Inductive confidence machines for regression. In European Conference on Machine Learning, pages 345–356. Springer, 2002.
  • Park et al. [2019] Sangdon Park, Osbert Bastani, Nikolai Matni, and Insup Lee. Pac confidence sets for deep neural networks via calibrated prediction. In International Conference on Learning Representations, 2019.
  • Ressel [1985] Paul Ressel. De finetti-type theorems: an analytical approach. The Annals of Probability, 13(3):898–922, 1985.
  • Romano et al. [2019] Yaniv Romano, Evan Patterson, and Emmanuel Candes. Conformalized quantile regression. Advances in neural information processing systems, 32, 2019.
  • Romano et al. [2020a] Yaniv Romano, Rina Foygel Barber, Chiara Sabatti, and Emmanuel Candès. With malice toward none: Assessing uncertainty via equalized coverage. Harvard Data Science Review, 2020a.
  • Romano et al. [2020b] Yaniv Romano, Matteo Sesia, and Emmanuel Candes. Classification with valid and adaptive coverage. Advances in Neural Information Processing Systems, 33:3581–3591, 2020b.
  • Rothblum and Yona [2021] Guy N Rothblum and Gal Yona. Multi-group agnostic pac learnability. In International Conference on Machine Learning, pages 9107–9115. PMLR, 2021.
  • Shafer and Vovk [2008] Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(Mar):371–421, 2008.

Appendix A Missing Details from Section 3

A.1 Missing Details from Section 3.1

Lemma A.1.

Fix any x𝒳x\in\mathcal{X} and τ[0,1]\tau\in[0,1].

𝔼s𝒮(x)[dLq(f(x)+τ,s)dτ]=[0,1]dLq(f(x)+τ,s)dτ𝑑𝒮(x)=Prs𝒮(x)[sf(x)+τ]q\displaystyle\mathop{\mathbb{E}}_{s\sim\mathcal{S}(x)}\left[\frac{dL_{q}(f(x)+\tau,s)}{d\tau}\right]=\int_{[0,1]}\frac{dL_{q}(f(x)+\tau,s)}{d\tau}d\mathcal{S}(x)=\Pr_{s\sim\mathcal{S}(x)}[s\leq f(x)+\tau]-q
Proof.
𝔼s𝒮(x)[dLq(f(x)+τ),sdτ]\displaystyle\mathop{\mathbb{E}}_{s\sim\mathcal{S}(x)}\left[\frac{dL_{q}(f(x)+\tau),s}{d\tau}\right] =01ddτLq(f(x)+τ,s)𝑑𝒮(x)\displaystyle=\int_{0}^{1}\frac{d}{d\tau}L_{q}(f(x)+\tau,s)d\mathcal{S}(x)
=0τddτLq(f(x)+τ,s)𝑑𝒮(x)+τ1ddτLq(f(x)+τ,s)𝑑𝒮(x)\displaystyle=\int_{0}^{\tau}\frac{d}{d\tau}L_{q}(f(x)+\tau,s)d\mathcal{S}(x)+\int_{\tau}^{1}\frac{d}{d\tau}L_{q}(f(x)+\tau,s)d\mathcal{S}(x)
=0τ(1q)𝑑𝒮(x)τ1q𝑑𝒮(x)\displaystyle=\int_{0}^{\tau}(1-q)d\mathcal{S}(x)-\int_{\tau}^{1}qd\mathcal{S}(x)
=Prs𝒮(x)[sτ]q\displaystyle=\Pr_{s\sim\mathcal{S}(x)}[s\leq\tau]-q

See 3.1

Proof.
PBq𝒮(f)PBq𝒮(f)\displaystyle PB^{\mathcal{S}}_{q}(f^{\prime})-PB^{\mathcal{S}}_{q}(f)
=𝒳×[0,1]Lq(f(x)+Δ,s)Lq(f(x),s)𝒮(dx,ds)\displaystyle=\int_{\mathcal{X}\times[0,1]}L_{q}(f(x)+\Delta,s)-L_{q}(f(x),s)\mathcal{S}(dx,ds)
=𝒳×[0,1]0ΔdLq(f(x)+τ,s)dτ𝑑τ𝒮(dx,ds)\displaystyle=\int_{\mathcal{X}\times[0,1]}\int_{0}^{\Delta}\frac{dL_{q}(f(x)+\tau,s)}{d\tau}d\tau\mathcal{S}(dx,ds)
=0Δ𝒳×[0,1]dLq(f(x)+τ,s)dτ𝒮(dx,ds)𝑑τ\displaystyle=\int_{0}^{\Delta}\int_{\mathcal{X}\times[0,1]}\frac{dL_{q}(f(x)+\tau,s)}{d\tau}\mathcal{S}(dx,ds)d\tau
=0Δ(𝔼x𝒟𝒳[Prs𝒮(x)[sf(x)+τ]q])𝑑τ\displaystyle=\int_{0}^{\Delta}\left(\mathop{\mathbb{E}}_{x\sim\mathcal{D}_{\mathcal{X}}}\left[\Pr_{s\sim\mathcal{S}(x)}[s\leq f(x)+\tau]-q\right]\right)d\tau
=0Δ(Pr(x,s)𝒮[sf(x)+τ]q)𝑑τ\displaystyle=\int_{0}^{\Delta}\left(\Pr_{(x,s)\sim\mathcal{S}}[s\leq f(x)+\tau]-q\right)d\tau
=0ΔPr(x,s)𝒮[sf(x)+τ]𝑑τΔq\displaystyle=\int_{0}^{\Delta}\Pr_{(x,s)\sim\mathcal{S}}[s\leq f(x)+\tau]d\tau-\Delta q

where the fourth equality follows from Lemma A.1.

For convenience, write H𝒮,f(τ)=Pr(x,s)𝒮[sf(x)+τ]H_{\mathcal{S},f}(\tau)=\Pr_{(x,s)\sim\mathcal{S}}[s\leq f(x)+\tau]. Note that

0ΔH𝒮,f(τ)𝑑τ\displaystyle\int_{0}^{\Delta}H_{\mathcal{S},f}(\tau)d\tau =0ΔPr(x,s)𝒮[sf(x)+τ]𝑑τ\displaystyle=\int_{0}^{\Delta}\Pr_{(x,s)\sim\mathcal{S}}[s\leq f(x)+\tau]d\tau
=0ΔPr(x,s)𝒮[sf(x)Δ+τ]𝑑τ\displaystyle=\int_{0}^{\Delta}\Pr_{(x,s)\sim\mathcal{S}}[s\leq f^{\prime}(x)-\Delta+\tau]d\tau
=0ΔPr(x,s)𝒮[sf(x)+τ]𝑑τ\displaystyle=-\int_{0}^{-\Delta}\Pr_{(x,s)\sim\mathcal{S}}[s\leq f^{\prime}(x)+\tau]d\tau
=0ΔH𝒮,f(τ)𝑑τ,\displaystyle=-\int_{0}^{-\Delta}H_{\mathcal{S},f^{\prime}}(\tau)d\tau,

as instead of sweeping the area under the curve from f(x)f(x) to f(x)+Δf(x)+\Delta, we can sweep the area under the curve from f(x)f^{\prime}(x) to f(x)Δf^{\prime}(x)-\Delta because f(x)=f(x)+Δf^{\prime}(x)=f(x)+\Delta.

Lemma A.2.

Fix any conformity score distribution 𝒮\mathcal{S} that is ρ\rho-Lipschitz, Δ>0\Delta>0, and f:𝒳f:\mathcal{X}\to\mathbb{R}. Then we have

H𝒮,f(0)Δ+(H𝒮,f(Δ)H𝒮,f(0))22ρ\displaystyle H_{\mathcal{S},f}(0)\Delta+\frac{(H_{\mathcal{S},f}(\Delta)-H_{\mathcal{S},f}(0))^{2}}{2\rho}
0ΔH𝒮,f(τ)𝑑τ\displaystyle\leq\int_{0}^{\Delta}H_{\mathcal{S},f}(\tau)d\tau
H𝒮,f(Δ)Δ((H𝒮,f(Δ)H𝒮,f(0))22ρ).\displaystyle\leq H_{\mathcal{S},f}(\Delta)\Delta-\left(\frac{(H_{\mathcal{S},f}(\Delta)-H_{\mathcal{S},f}(0))^{2}}{2\rho}\right).
Proof.

For simplicity write H(τ)=H𝒮,f(τ)H(\tau)=H_{\mathcal{S},f}(\tau). Note that H(τ)H(\tau) is a non-negative function that is increasing in τ\tau.

H(0)H(0)H(Δ)H(\Delta)0Δ\Delta|H(Δ)H(0)|ρ\frac{|H(\Delta)-H(0)|}{\rho}Δ|H(Δ)H(0)|ρ\Delta-\frac{|H(\Delta)-H(0)|}{\rho}
Figure 5: Upper and lower bounding the local area under HH. The max-area CDF curve is in dashed blue, the min-area CDF curve is in solid purple.

First, we find an upper bound the area. The maximum area that can be achieved is when there’s a linear rate of increase from y=H(0)y=H(0) to y=H(Δ)y=H(\Delta) between x=0x=0 and x=H(Δ)H(0)ρx=\frac{H(\Delta)-H(0)}{\rho} as depicted in Figure 5. The area measured via the integral can be calculated by subtracting the area of the triangle from the area of the rectangle from x=0x=0 to x=Δx=\Delta and from y=0y=0 to y=H(Δ)y=H(\Delta).

0ΔPr(x,s)𝒮[sf(x)+τ]𝑑τ\displaystyle\int_{0}^{\Delta}\Pr_{(x,s)\sim\mathcal{S}}[s\leq f(x)+\tau]d\tau
H(Δ)Δ(H(Δ)H(0))22ρ.\displaystyle\leq H(\Delta)\Delta-\frac{(H(\Delta)-H(0))^{2}}{2\rho}.

Now, we find a lower bound of the area. The area under the curve is minimized when there’s a linear increase from H(0)H(0) to H(Δ)H(\Delta) between x=ΔH(Δ)H(0)ρx=\Delta-\frac{H(\Delta)-H(0)}{\rho} and x=Δx=\Delta. The area can be calculated as the sum of the area of the rectangle from x=0x=0 to x=Δx=\Delta and from y=0y=0 to y=H(0)y=H(0) and the area of the triangle.

0ΔPr(x,s)𝒮[sf(x)+τ]𝑑τ\displaystyle\int_{0}^{\Delta}\Pr_{(x,s)\sim\mathcal{S}}[s\leq f(x)+\tau]d\tau
H(0)Δ+H(Δ)H(0)2ρ(H(Δ)H(0))\displaystyle\geq H(0)\Delta+\frac{H(\Delta)-H(0)}{2\rho}\left(H(\Delta)-H(0)\right)

Case (i) Δ0\Delta\geq 0:

We have H𝒮,f(Δ)=qH_{\mathcal{S},f}(\Delta)=q and H𝒮,f(0)=qαH_{\mathcal{S},f}(0)=q-\alpha.

PBq(f)PBq(f)\displaystyle PB_{q}(f^{\prime})-PB_{q}(f) H𝒮,f(Δ)Δ((H𝒮,f(Δ)H𝒮,f(0))22ρ)Δq\displaystyle\leq H_{\mathcal{S},f}(\Delta)\Delta-\left(\frac{(H_{\mathcal{S},f}(\Delta)-H_{\mathcal{S},f}(0))^{2}}{2\rho}\right)-\Delta q
α22ρ.\displaystyle\leq-\frac{\alpha^{2}}{2\rho}.

On the other hand,

PBq(f)PBq(f)\displaystyle PB_{q}(f^{\prime})-PB_{q}(f) H𝒮,f(0)Δ+(H𝒮,f(Δ)H𝒮,f(0))22ρΔq\displaystyle\geq H_{\mathcal{S},f}(0)\Delta+\frac{(H_{\mathcal{S},f}(\Delta)-H_{\mathcal{S},f}(0))^{2}}{2\rho}-\Delta q
=(qα)Δ+α22ρΔq\displaystyle=(q-\alpha)\Delta+\frac{\alpha^{2}}{2\rho}-\Delta q
=αΔ+α22ρ.\displaystyle=-\alpha\Delta+\frac{\alpha^{2}}{2\rho}.
Case (ii) Δ<0\Delta<0:

We have we have H𝒮,f(Δ)=H𝒮,f(0)=qH_{\mathcal{S},f}(\Delta)=H_{\mathcal{S},f^{\prime}}(0)=q and H𝒮,f(0)=H𝒮,f(Δ)=q+αH_{\mathcal{S},f}(0)=H_{\mathcal{S},f^{\prime}}(\Delta^{\prime})=q+\alpha where Δ=Δ\Delta^{\prime}=-\Delta, we have

PBq(f)PBq(f)\displaystyle PB_{q}(f^{\prime})-PB_{q}(f) =0ΔH𝒮,f(τ)𝑑τqΔ\displaystyle=-\int_{0}^{-\Delta}H_{\mathcal{S},f^{\prime}}(\tau)d\tau-q\Delta
=0ΔH𝒮,f(τ)𝑑τqΔ\displaystyle=-\int_{0}^{\Delta^{\prime}}H_{\mathcal{S},f^{\prime}}(\tau)d\tau-q\Delta
(H𝒮,f(0)Δ+(H𝒮,f(Δ)H𝒮,f(0))22ρ)qΔ\displaystyle\leq-\left(H_{\mathcal{S},f^{\prime}}(0)\Delta^{\prime}+\frac{(H_{\mathcal{S},f^{\prime}}(\Delta)-H_{\mathcal{S},f^{\prime}}(0))^{2}}{2\rho}\right)-q\Delta
α22ρ.\displaystyle\leq-\frac{\alpha^{2}}{2\rho}.

On the other hand, we have

PBq(f)PBq(f)\displaystyle PB_{q}(f^{\prime})-PB_{q}(f) =0ΔH𝒮,f(τ)𝑑τqΔ\displaystyle=-\int_{0}^{-\Delta}H_{\mathcal{S},f^{\prime}}(\tau)d\tau-q\Delta
=0ΔH𝒮,f(τ)𝑑τqΔ\displaystyle=-\int_{0}^{\Delta^{\prime}}H_{\mathcal{S},f^{\prime}}(\tau)d\tau-q\Delta
(H𝒮,f(Δ)Δ((H𝒮,f(Δ)H𝒮,f(0))22ρ))qΔ\displaystyle\geq-\left(H_{\mathcal{S},f^{\prime}}(\Delta^{\prime})\Delta^{\prime}-\left(\frac{(H_{\mathcal{S},f^{\prime}}(\Delta^{\prime})-H_{\mathcal{S},f^{\prime}}(0))^{2}}{2\rho}\right)\right)-q\Delta
=(q+α)Δ+α22ρ\displaystyle=(q+\alpha)\Delta+\frac{\alpha^{2}}{2\rho}
=αΔ+α22ρ.\displaystyle=\alpha\Delta+\frac{\alpha^{2}}{2\rho}.

See 3.2

Proof.

Note that ff’s marginal quantile consistency error with respect to target quantile qq as measured on 𝒮|B\mathcal{S}|B is

|Pr(x,s)𝒮|B[sf(x)]q|αPr(x,s)𝒮[xB].\left|\Pr_{(x,s)\sim\mathcal{S}|B}[s\leq f(x)]-q\right|\geq\sqrt{\frac{\alpha}{\Pr_{(x,s)\sim\mathcal{S}}[x\in B]}}.

Also, since 𝒮|B\mathcal{S}|B is continuous, we have

q=Pr(x,s)𝒮|B[sf(x)+Δ].q=\Pr_{(x,s)\sim\mathcal{S}|B}[s\leq f(x)+\Delta].

In other words, ff^{\prime} satisfies marginal quantile consistency for qq as measured on 𝒮|B\mathcal{S}|B.

Applying Lemma 3.1 to 𝒮|B\mathcal{S}|B, we have

PBq𝒮(f)PBq𝒮(f)\displaystyle PB^{\mathcal{S}}_{q}(f^{\prime})-PB^{\mathcal{S}}_{q}(f)
=Pr(x,s)𝒮[xB](PBq𝒮|B(f)PBq𝒮|B(f))\displaystyle=\Pr_{(x,s)\sim\mathcal{S}}[x\in B]\cdot(PB^{\mathcal{S}|B}_{q}(f^{\prime})-PB^{\mathcal{S}|B}_{q}(f))
Pr(x,s)𝒮[xB]α2ρPr(x,s)𝒮[xB]\displaystyle\leq-\Pr_{(x,s)\sim\mathcal{S}}[x\in B]\cdot\frac{\alpha}{2\rho\Pr_{(x,s)\sim\mathcal{S}}[x\in B]}
=α2ρ.\displaystyle=-\frac{\alpha}{2\rho}.

A.2 Missing Details from Section 3.2

See 3.1

Proof.

Suppose f^(x;λ)\hat{f}(x;\lambda^{*}) is not marginally quantile consistent on some g𝒢g^{\prime}\in\mathcal{G} — i.e. Pr(x,s)[g(x)=1](Pr(x,s)𝒮[sf(x)|g(x)=1]q)2>0\Pr_{(x,s)}[g^{\prime}(x)=1]\cdot(\Pr_{(x,s)\sim\mathcal{S}}[s\leq f(x)|g^{\prime}(x)=1]-q)^{2}>0. In other words, there exists some Δ0\Delta\neq 0 such that

Pr(x,s)𝒮[sf(x)+Δ|g(x)=1]=q.\Pr_{(x,s)\sim\mathcal{S}}[s\leq f(x)+\Delta|g^{\prime}(x)=1]=q.

Suppose we patch f^(;λ)\hat{f}(\cdot;\lambda^{*})

f=Patch(f^(;λ),g,Δ)f^{\prime}=\mathrm{Patch}(\hat{f}(\cdot;\lambda^{*}),g^{\prime},\Delta)

which can be re-written as

f(x)\displaystyle f^{\prime}(x) =f^(x;λ)+Δ𝟙[g(x)=1]\displaystyle=\hat{f}(x;\lambda^{*})+\Delta\cdot\mathbbm{1}[g^{\prime}(x)=1]
=f(x)+(g𝒢λgg(x))+Δ𝟙[g(x)=1]\displaystyle=f(x)+\left(\sum_{g\in\mathcal{G}}\lambda_{g}\cdot g(x)\right)+\Delta\cdot\mathbbm{1}[g^{\prime}(x)=1]
=f(x)+g𝒢λgg(x)\displaystyle=f(x)+\sum_{g\in\mathcal{G}}\lambda^{\prime}_{g}\cdot g(x)

where λg=λg+Δ\lambda^{\prime}_{g^{\prime}}=\lambda^{*}_{g^{\prime}}+\Delta and λg=λg\lambda^{\prime}_{g}=\lambda^{*}_{g} for all other ggg\neq g^{\prime}.

Lemma 3.2 yields

PBq𝒮(f^(,λ))PBq𝒮(f^(;λ))<0\displaystyle PB^{\mathcal{S}}_{q}(\hat{f}(\cdot,\lambda^{\prime}))-PB^{\mathcal{S}}_{q}(\hat{f}(\cdot;\lambda^{*}))<0

which contradicts the fact that λ\lambda^{*} is an optimal solution to minλ𝔼(x,y)𝒟[Lq(f^(x;λ),y)]\min_{\lambda}\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}\left[L_{q}\left(\hat{f}(x;\lambda),y\right)\right]. ∎

A.3 Missing Details from Section 3.3

See 3.1

Proof .

Consider any g𝒢g\in\mathcal{G}.

Q(f,g)\displaystyle Q(f,g) =\displaystyle= vR(f)Pr(x,s)𝒮[f(x)=v|g(x)=1](qPr(x,s)𝒮[sf(x)|f(x)=v,g(x)=1])2\displaystyle\sum_{v\in R(f)}\Pr_{(x,s)\sim\mathcal{S}}[f(x)=v|g(x)=1]\left(q-\Pr_{(x,s)\sim\mathcal{S}}[s\leq f(x)|f(x)=v,g(x)=1]\right)^{2}
\displaystyle\leq αPr(x,s)𝒮[g(x)=1]\displaystyle\frac{\alpha}{\Pr_{(x,s)\sim\mathcal{S}}[g(x)=1]}

In particular, since each term in the sum is non-negative, we must have for every vR(f)v\in R(f):

Pr(x,s)𝒮[f(x)=v|g(x)=1](qPr(x,s)𝒮[sf(x)|f(x)=v,g(x)=1])2αPr(x,s)𝒮[g(x)=1]\Pr_{(x,s)\sim\mathcal{S}}[f(x)=v|g(x)=1]\left(q-\Pr_{(x,s)\sim\mathcal{S}}[s\leq f(x)|f(x)=v,g(x)=1]\right)^{2}\leq\frac{\alpha}{\Pr_{(x,s)\sim\mathcal{S}}[g(x)=1]}

Dividing both sides by Pr(x,s)𝒮[f(x)=v|g(x)=1]\Pr_{(x,s)\sim\mathcal{S}}[f(x)=v|g(x)=1] we see that this is equivalent to:

(qPr(x,s)𝒮[sf(x)|f(x)=v,g(x)=1])2αPr(x,s)𝒮[g(x)=1,fT(x)=v]\left(q-\Pr_{(x,s)\sim\mathcal{S}}[s\leq f(x)|f(x)=v,g(x)=1]\right)^{2}\leq\frac{\alpha}{\Pr_{(x,s)\sim\mathcal{S}}[g(x)=1,f_{T}(x)=v]}

Taking the square root yields our claim. ∎

See 3.2

Proof.
(1) Marginal quantile consistency error in each round tt:

At any round tt, if ftf_{t} is not α\alpha-approximately quantile multicalibrated with respect to 𝒢\mathcal{G} and qq, we have

Pr(x,s)𝒮[xBt](qPr(x,s)𝒮[sft(x)|xBt]))2αm+1α2m:\displaystyle\Pr_{(x,s)\sim\mathcal{S}}[x\in B_{t}]\left(q-\Pr_{(x,s)\sim\mathcal{S}}[s\leq f_{t}(x)|x\in B_{t}])\right)^{2}\geq\frac{\alpha}{m+1}\geq\frac{\alpha}{2m}:

as the average of m+1m+1 elements is greater than α\alpha.

(2) Decomposing the patch operation into two patch operations:

Write

Δ~t=argminΔ[0,1]|Pr(x,s)𝒮[sft(x)+ΔxBt]q|\tilde{\Delta}_{t}=\mathop{\mathrm{argmin}}_{\Delta\in[0,1]}\left|\Pr_{(x,s)\sim\mathcal{S}}[s\leq f_{t}(x)+\Delta|x\in B_{t}]-q\right|

to denote how much we would have patched ftf_{t} by if we actually optimized over the unit interval. Then, we can divide the patch operation into two where

f~t+1=Patch(ft,Bt,Δ~t)\displaystyle\tilde{f}_{t+1}=\mathrm{Patch}\left(f_{t},B_{t},\tilde{\Delta}_{t}\right)
ft+1=Patch(f~t,Bt,ΔtΔ~t).\displaystyle f_{t+1}=\mathrm{Patch}\left(\tilde{f}_{t},B_{t},\Delta_{t}-\tilde{\Delta}_{t}\right).

Now, we try to bound the change in the pinball loss separately:

PBq𝒮(ft+1)PBq𝒮(ft)=(PBq𝒮(ft+1)PBq𝒮(f~t+1))()+(PBq𝒮(f~t+1)PBq𝒮(ft))().\displaystyle PB^{\mathcal{S}}_{q}(f_{t+1})-PB^{\mathcal{S}}_{q}(f_{t})=\underbrace{(PB^{\mathcal{S}}_{q}(f_{t+1})-PB^{\mathcal{S}}_{q}(\tilde{f}_{t+1}))}_{(*)}+\underbrace{(PB^{\mathcal{S}}_{q}(\tilde{f}_{t+1})-PB^{\mathcal{S}}_{q}(f_{t}))}_{(**)}.
(2) Bounding (**):

Lemma 3.2 yields

()=PB𝒮(f~t+1)PB𝒮(ft)α4ρm\displaystyle(**)=PB^{\mathcal{S}}(\tilde{f}_{t+1})-PB^{\mathcal{S}}(f_{t})\leq-\frac{\alpha}{4\rho m}
(3) Bounding (*):

Note that imΔ~i+1m\frac{i}{m}\leq\tilde{\Delta}\leq\frac{i+1}{m} for some i[0,,m1]i\in[0,\dots,m-1]. Because the function ΔPr(x,s)𝒮|Bt[sft(x)+Δ]\Delta\to\Pr_{(x,s)\sim\mathcal{S}|B_{t}}[s\leq f_{t}(x)+\Delta] is an increasing function in Δ\Delta, we have

|Pr(x,s)𝒮|Bt[sft(x)+Δ]q|\displaystyle\left|\Pr_{(x,s)\sim\mathcal{S}|B_{t}}\left[s\leq f_{t}(x)+\Delta^{\prime}\right]-q\right| <|Pr(x,s)𝒮|Bt[sft(x)+im]q|\displaystyle<\left|\Pr_{(x,s)\sim\mathcal{S}|B_{t}}\left[s\leq f_{t}(x)+\frac{i}{m}\right]-q\right|\quad for any Δ<im\Delta<\frac{i}{m}
|Pr(x,s)𝒮|Bt[sft(x)+Δ]q|\displaystyle\left|\Pr_{(x,s)\sim\mathcal{S}|B_{t}}\left[s\leq f_{t}(x)+\Delta^{\prime}\right]-q\right| <|Pr(x,s)𝒮|Bt[sft(x)+i+1m]q|\displaystyle<\left|\Pr_{(x,s)\sim\mathcal{S}|B_{t}}\left[s\leq f_{t}(x)+\frac{i+1}{m}\right]-q\right| for any Δ>i+1m.\displaystyle\text{for any $\Delta>\frac{i+1}{m}$}.

In other words, Δt{im,i+1m}\Delta_{t}\in\{\frac{i}{m},\frac{i+1}{m}\} so we have |ΔtΔ~t|1m|\Delta_{t}-\tilde{\Delta}_{t}|\leq\frac{1}{m}. Because 𝒮|Bt\mathcal{S}|B_{t} is ρ\rho-Lipschitz and f~t+1(x)\tilde{f}_{t+1}(x) is qq-quantile for 𝒮|Bt\mathcal{S}|B_{t}, we can bound the marginal quantile consistency error of ft+1f_{t+1} against 𝒮|Bt\mathcal{S}|B_{t} as

|Pr(x,s)𝒮|Bt[sft+1(x)]q|\displaystyle\left|\Pr_{(x,s)\sim\mathcal{S}|B_{t}}[s\leq f_{t+1}(x)]-q\right| =|Pr(x,s)𝒮|Bt[sft+1(x)]Pr(x,s)𝒮|Bt[sf~t(x)Δ~t+Δt]|ρm\displaystyle=\left|\Pr_{(x,s)\sim\mathcal{S}|B_{t}}[s\leq f_{t+1}(x)]-\Pr_{(x,s)\sim\mathcal{S}|B_{t}}[s\leq\tilde{f}_{t}(x)-\tilde{\Delta}_{t}+\Delta_{t}]\right|\leq\frac{\rho}{m}

as |ΔtΔ~t|1m|\Delta_{t}-\tilde{\Delta}_{t}|\leq\frac{1}{m}.

Note that ft+1(x)=f~t+1(x)f_{t+1}(x)=\tilde{f}_{t+1}(x) for xBtx\not\in B_{t} and ft+1(x)=f~t+1(x)+ΔtΔ~tf_{t+1}(x)=\tilde{f}_{t+1}(x)+\Delta_{t}-\tilde{\Delta}_{t} for xBtx\in B_{t} where |ΔtΔ~t|1m|\Delta_{t}-\tilde{\Delta}_{t}|\leq\frac{1}{m}. Applying Lemma 3.1 with Δ=ΔtΔ~t\Delta=\Delta_{t}-\tilde{\Delta}_{t}, αρm\alpha\leq\frac{\rho}{m}, and (f,f)=(ft+1,f~t+1)(f,f^{\prime})=(f_{t+1},\tilde{f}_{t+1}), we have that

PBq𝒮(ft+1)PBq𝒮(f~t+1)\displaystyle PB^{\mathcal{S}}_{q}(f_{t+1})-PB^{\mathcal{S}}_{q}(\tilde{f}_{t+1}) =Pr(x,s)𝒮[xBt](PB𝒮|Bt(ft+1)PB𝒮|Bt(f~t+1))\displaystyle=\Pr_{(x,s)\sim\mathcal{S}}[x\in B_{t}]\cdot\left(PB^{\mathcal{S}|B_{t}}(f_{t+1})-PB^{\mathcal{S}|B_{t}}(\tilde{f}_{t+1})\right)
Pr(x,s)𝒮[xBt](ρm1m(ρm)212ρ)\displaystyle\leq\Pr_{(x,s)\sim\mathcal{S}}[x\in B_{t}]\cdot\left(\frac{\rho}{m}\frac{1}{m}-\left(\frac{\rho}{m}\right)^{2}\frac{1}{2\rho}\right)
ρm2.\displaystyle\leq\frac{\rho}{m^{2}}.
(4) Combining (*) and (**):

We get

PBq𝒮(ft+1)PBq𝒮(ft)\displaystyle PB^{\mathcal{S}}_{q}(f_{t+1})-PB^{\mathcal{S}}_{q}(f_{t}) α4ρm+ρm2\displaystyle\leq-\frac{\alpha}{4\rho m}+\frac{\rho}{m^{2}}
=α232ρ3.\displaystyle=-\frac{\alpha^{2}}{32\rho^{3}}.
(5) Guarantees:

It directly follows that

PBq𝒮(fT)PBq𝒮(f0)Tα232ρ3.PB^{\mathcal{S}}_{q}(f_{T})-PB^{\mathcal{S}}_{q}(f_{0})\leq T\frac{\alpha^{2}}{32\rho^{3}}.

With 0PBq𝒮(f)10\leq PB^{\mathcal{S}}_{q}(f)\leq 1 for any f:𝒳[0,1]f:\mathcal{X}\to[0,1], we note that fTf_{T} must be α\alpha-approximately quantile multicalibrated with respect to 𝒢\mathcal{G} and qq after at most 32ρ3α2\frac{32\rho^{3}}{\alpha^{2}} many rounds. ∎

Appendix B Missing Details from Section 4

B.1 On the i.i.d. versus exchangeability assumption

Remark B.1.

In this section, we prove PAC-like generalization theorems for our algorithms under the assumption that the data is drawn i.i.d. from some underlying distribution. Exchangeability is a weaker condition than independence, and requires only that the probability of observing any sequence of data is permutation invariant. De Finetti’s representation theorem [Ressel, 1985] states that any infinite exchangeable sequence of data can be represented as a mixture of constituent distributions, each of which is i.i.d. Thus, our generalization theorems carry over from the i.i.d. setting to the more general exchangeable data setting: we can simply apply our theorems to each (i.i.d.) mixture component in the De Finetti representation of the exchangeable distribution. If for every mixture component, the model we learn has low quantile multicalibration error with probability 1δ1-\delta when the data is drawn from that component, then our models with also have low quantile multicalibration error with probability 1δ1-\delta in expectation over the choice of the mixture component.

B.2 Helpful Concentration Bounds

Theorem B.1 (Additive Chernoff Bound).

Let {Xi}i=1n\{X_{i}\}_{i=1}^{n} be independent random variables bounded such that for each i[n]i\in[n], Xi[0,1]X_{i}\in[0,1]. Let Sn=i=1nXiS_{n}=\sum_{i=1}^{n}X_{i} denote their sum. Then for all ϵ>0\epsilon>0,

Pr{Xi}i=1n[|Sn𝔼[Sn]|ϵ]2exp(ϵ2n).\Pr_{\{X_{i}\}_{i=1}^{n}}[|S_{n}-\mathop{\mathbb{E}}[S_{n}]|\geq\epsilon]\leq 2\exp\left(-\frac{\epsilon^{2}}{n}\right).
Theorem B.2 (Multiplicative Chernoff Bound).

Let {Xi}i=1n\{X_{i}\}_{i=1}^{n} be independent random variables bounded such that for each i[n]i\in[n], Xi[0,1]X_{i}\in[0,1]. Let Sn=i=1nXiS_{n}=\sum_{i=1}^{n}X_{i} denote their sum. Then for all η>0\eta>0,

Pr{Xi}i=1n[|Sn𝔼[Sn]|η𝔼[Sn]]2exp(𝔼[Sn]η23).\Pr_{\{X_{i}\}_{i=1}^{n}}[|S_{n}-\mathop{\mathbb{E}}[S_{n}]|\geq\eta\mathop{\mathbb{E}}[S_{n}]]\leq 2\exp\left(-\frac{\mathop{\mathbb{E}}[S_{n}]\eta^{2}}{3}\right).
Lemma B.1.

Fix any B𝒳B\subseteq\mathcal{X}. Given S={(xi,si)}i=1n𝒮nS=\{(x_{i},s_{i})\}_{i=1}^{n}\sim\mathcal{S}^{n}, we have

|1ni=1n𝟙[xiB]Pr(x,s)[xB]|3ln(2δ)Pr𝒮[xB]n.\left|\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}[x_{i}\in B]-\Pr_{(x,s)}[x\in B]\right|\leq\sqrt{\frac{3\ln(\frac{2}{\delta})\Pr_{\mathcal{S}}[x\in B]}{n}}.
Proof.

This is just a direct application of the Chernoff bound (Theorem B.2) where we set η=3ln(2δ)nPr𝒮[xB]\eta=\sqrt{\frac{3\ln(\frac{2}{\delta})}{n\Pr_{\mathcal{S}}[x\in B]}}. ∎

Lemma B.2.

Fix any B𝒳B\subseteq\mathcal{X} and f:𝒳[0,1]f:\mathcal{X}\to[0,1]. Given S={(xi,si)}i=1n𝒮nS=\{(x_{i},s_{i})\}_{i=1}^{n}\sim\mathcal{S}^{n}, we have

|1ni=1n𝟙[sf(x),xiB]Pr(x,s)[sf(x),xB]|3ln(2δ)Pr𝒮[sf(x),xB]n3ln(2δ)Pr𝒮[xB]n.\left|\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}[s\leq f(x),x_{i}\in B]-\Pr_{(x,s)}[s\leq f(x),x\in B]\right|\leq\sqrt{\frac{3\ln(\frac{2}{\delta})\Pr_{\mathcal{S}}[s\leq f(x),x\in B]}{n}}\leq\sqrt{\frac{3\ln(\frac{2}{\delta})\Pr_{\mathcal{S}}[x\in B]}{n}}.
Proof.

This is just a direct application of the Chernoff bound (Theorem B.2) where we set η=3ln(2δ)nPr𝒮[sf(x),xB]\eta=\sqrt{\frac{3\ln(\frac{2}{\delta})}{n\Pr_{\mathcal{S}}[s\leq f(x),x\in B]}}. ∎

Lemma B.3.

Fix any B𝒳B\subseteq\mathcal{X} and f:𝒳[0,1]f:\mathcal{X}\to[0,1]. Suppose 3ln(4δ)nPr[xB]12\sqrt{\frac{3\ln(\frac{4}{\delta})}{n\Pr[x\in B]}}\leq\frac{1}{2}. Given S𝒟nS\sim\mathcal{D}^{n}, we have that with probability 1δ1-\delta

|Pr(x,s)𝒮~|B[sf(x)]Pr(x,s)𝒮|B[sf(x)]|53ln(4δ)nPr𝒮[xB]\left|\Pr_{(x,s)\sim\tilde{\mathcal{S}}|B}[s\leq f(x)]-\Pr_{(x,s)\sim\mathcal{S}|B}[s\leq f(x)]\right|\leq 5\sqrt{\frac{3\ln(\frac{4}{\delta})}{n\Pr_{\mathcal{S}}[x\in B]}}
Proof.

Note that

Pr(x,s)𝒮~|B[sf(x)]=1ni=1n𝟙[sif(xi),xiB]1ni=1n𝟙[xiB].\displaystyle\Pr_{(x,s)\sim\tilde{\mathcal{S}}|B}[s\leq f(x)]=\frac{\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}[s_{i}\leq f(x_{i}),x_{i}\in B]}{\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}[x_{i}\in B]}.

Lemma B.1 and  B.2 together give us that with probability 1δ1-\delta,

|1ni=1n𝟙[sif(xi),xiB]Pr(x,s)[sf(x),xB]|\displaystyle\left|\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}[s_{i}\leq f(x_{i}),x_{i}\in B]-\Pr_{(x,s)}[s\leq f(x),x\in B]\right| 3ln(4δ)Pr𝒮[sf(x),xB]n3ln(4δ)Pr𝒮[xB]n\displaystyle\leq\sqrt{\frac{3\ln(\frac{4}{\delta})\Pr_{\mathcal{S}}[s\leq f(x),x\in B]}{n}}\leq\sqrt{\frac{3\ln(\frac{4}{\delta})\Pr_{\mathcal{S}}[x\in B]}{n}}
|1ni=1n𝟙[xiB]Pr(x,s)[xB]|\displaystyle\left|\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}[x_{i}\in B]-\Pr_{(x,s)}[x\in B]\right| 3ln(4δ)Pr𝒮[xB]n.\displaystyle\leq\sqrt{\frac{3\ln(\frac{4}{\delta})\Pr_{\mathcal{S}}[x\in B]}{n}}.

With ϵ=3ln(4δ)Pr𝒮[xB]n\epsilon=\sqrt{\frac{3\ln(\frac{4}{\delta})\Pr_{\mathcal{S}}[x\in B]}{n}}, we can show that

Pr(x,s)𝒮~|B[sf(x)]\displaystyle\Pr_{(x,s)\sim\tilde{\mathcal{S}}|B}[s\leq f(x)]
=1ni=1n𝟙[sif(xi),xiB]1ni=1n𝟙[xiB]\displaystyle=\frac{\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}[s_{i}\leq f(x_{i}),x_{i}\in B]}{\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}[x_{i}\in B]}
Pr𝒮[sf(x),xB]+ϵPr𝒮[xB]ϵ\displaystyle\leq\frac{\Pr_{\mathcal{S}}[s\leq f(x),x\in B]+\epsilon}{\Pr_{\mathcal{S}}[x\in B]-\epsilon}
=Pr𝒮[sf(x),xB]+ϵPr𝒮[xB](1ϵPr𝒮[xB])\displaystyle=\frac{\Pr_{\mathcal{S}}[s\leq f(x),x\in B]+\epsilon}{\Pr_{\mathcal{S}}[x\in B]\left(1-\frac{\epsilon}{\Pr_{\mathcal{S}}[x\in B]}\right)}
()(1+2ϵPr𝒮[xB])(Pr𝒮[sf(x),xB]+ϵPr𝒮[xB])\displaystyle\underbrace{\leq}_{(*)}\left(1+\frac{2\epsilon}{\Pr_{\mathcal{S}}[x\in B]}\right)\left(\frac{\Pr_{\mathcal{S}}[s\leq f(x),x\in B]+\epsilon}{\Pr_{\mathcal{S}}[x\in B]}\right)
Pr𝒮|B[sf(x)]+ϵPr𝒮[xB]+2ϵPr𝒮[xB]+2ϵ2Pr𝒮[xB]2\displaystyle\leq\Pr_{\mathcal{S}|B}[s\leq f(x)]+\frac{\epsilon}{\Pr_{\mathcal{S}}[x\in B]}+\frac{2\epsilon}{\Pr_{\mathcal{S}}[x\in B]}+\frac{2\epsilon^{2}}{\Pr_{\mathcal{S}}[x\in B]^{2}}
Pr𝒮|B[sf(x)]+33ln(4δ)nPr𝒮[xB]+23ln(4δ)nPr𝒮[xB]\displaystyle\leq\Pr_{\mathcal{S}|B}[s\leq f(x)]+3\sqrt{\frac{3\ln(\frac{4}{\delta})}{n\Pr_{\mathcal{S}}[x\in B]}}+2\frac{3\ln(\frac{4}{\delta})}{n\Pr_{\mathcal{S}}[x\in B]}
()Pr𝒮|B[sf(x)]+53ln(4δ)nPr𝒮[xB]\displaystyle\underbrace{\leq}_{(**)}\Pr_{\mathcal{S}|B}[s\leq f(x)]+5\sqrt{\frac{3\ln(\frac{4}{\delta})}{n\Pr_{\mathcal{S}}[x\in B]}}

where for (*), we rely on the assumption that 3ln(4δ)nPr[xB]12\sqrt{\frac{3\ln(\frac{4}{\delta})}{n\Pr[x\in B]}}\leq\frac{1}{2} to apply the inequality 1/(1x)(1+2x)1/(1-x)\leq(1+2x) for 0x1/20\leq x\leq 1/2 and for (**), we rely on 3ln(4δ)nPr𝒮[xB]1\frac{3\ln(\frac{4}{\delta})}{n\Pr_{\mathcal{S}}[x\in B]}\leq 1 to get 3ln(4δ)nPr𝒮[xB]3ln(4δ)nPr𝒮[xB]\sqrt{\frac{3\ln(\frac{4}{\delta})}{n\Pr_{\mathcal{S}}[x\in B]}}\geq\frac{3\ln(\frac{4}{\delta})}{n\Pr_{\mathcal{S}}[x\in B]}.

Lemma B.4.

Fix any B𝒳B\subseteq\mathcal{X} and f:𝒳[0,1]f:\mathcal{X}\to[0,1]. Suppose 3ln(4δ)nPr[xB]12\sqrt{\frac{3\ln(\frac{4}{\delta})}{n\Pr[x\in B]}}\leq\frac{1}{2}. Given S𝒟nS\sim\mathcal{D}^{n}, we have that with probability 1δ1-\delta

|(qPr(x,s)𝒮|B[sf(x)])2(qPr(x,s)𝒮~|B[sf(x)])2|203ln(4δ)nPr𝒮[xB]\left|\left(q-\Pr_{(x,s)\sim\mathcal{S}|B}[s\leq f(x)]\right)^{2}-\left(q-\Pr_{(x,s)\sim\tilde{\mathcal{S}}|B}[s\leq f(x)]\right)^{2}\right|\leq 20\sqrt{\frac{3\ln(\frac{4}{\delta})}{n\Pr_{\mathcal{S}}[x\in B]}}

for any q[0,1]q\in[0,1].

Proof.
|(qPr(x,s)𝒮|B[sft(x)])2(qPr(x,s)𝒮~|B[sft(x)])2|\displaystyle\left|\left(q-\Pr_{(x,s)\sim\mathcal{S}|B}[s\leq f_{t}(x)]\right)^{2}-\left(q-\Pr_{(x,s)\sim\tilde{\mathcal{S}}|B}[s\leq f_{t}(x)]\right)^{2}\right|
=|2q(Pr𝒮~|B[sf(x)]Pr𝒮|B[sf(x)])+(Pr𝒮|B[sf(x)]2Pr𝒮~|B[sf(x)]2)|\displaystyle=\left|2q\left(\Pr_{\tilde{\mathcal{S}}|B}[s\leq f(x)]-\Pr_{\mathcal{S}|B}[s\leq f(x)]\right)+\left(\Pr_{\mathcal{S}|B}[s\leq f(x)]^{2}-\Pr_{\tilde{\mathcal{S}}|B}[s\leq f(x)]^{2}\right)\right|
2|Pr𝒮|B[sf(x)]Pr𝒮~|B[sf(x)]|+|(Pr𝒮|B[sf(x)]Pr𝒮~|B[sf(x)])(Pr𝒮|B[sf(x)]+Pr𝒮~|B[sf(x)])|\displaystyle\leq 2\left|\Pr_{\mathcal{S}|B}[s\leq f(x)]-\Pr_{\tilde{\mathcal{S}}|B}[s\leq f(x)]\right|+\left|\left(\Pr_{\mathcal{S}|B}[s\leq f(x)]-\Pr_{\tilde{\mathcal{S}}|B}[s\leq f(x)]\right)\left(\Pr_{\mathcal{S}|B}[s\leq f(x)]+\Pr_{\tilde{\mathcal{S}}|B}[s\leq f(x)]\right)\right|
4|Pr𝒮|B[sf(x)]Pr𝒮~|B[sf(x)]|\displaystyle\leq 4\left|\Pr_{\mathcal{S}|B}[s\leq f(x)]-\Pr_{\tilde{\mathcal{S}}|B}[s\leq f(x)]\right|
203ln(4δ)nPr𝒮[xB].\displaystyle\leq 20\sqrt{\frac{3\ln(\frac{4}{\delta})}{n\Pr_{\mathcal{S}}[x\in B]}}.

where we use Lemma B.3 for the last inequality. ∎

Lemma B.5.

Fix B𝒳,v[1m]B\subseteq\mathcal{X},v\in[\frac{1}{m}], g𝒢g\in\mathcal{G}, and f:𝒳[0,1]f:\mathcal{X}\to[0,1]. Given S𝒟nS\sim\mathcal{D}^{n}, we have with probability 1δ1-\delta

|Pr(x,s)𝒮[xB](qPr(x,s)𝒮|B[sf(x)])2Pr(x,s)𝒮~[xB](qPr(x,s)𝒮~|B[sf(x)])2|\displaystyle\left|\Pr_{(x,s)\sim\mathcal{S}}[x\in B]\left(q-\Pr_{(x,s)\sim\mathcal{S}|B}[s\leq f(x)]\right)^{2}-\Pr_{(x,s)\sim\tilde{\mathcal{S}}}[x\in B]\left(q-\Pr_{(x,s)\sim\tilde{\mathcal{S}}|B}[s\leq f(x)]\right)^{2}\right|
213ln(8δ)Pr𝒮[xB]n+12ln(8δ)n.\displaystyle\leq 21\sqrt{\frac{3\ln(\frac{8}{\delta})\Pr_{\mathcal{S}}[x\in B]}{n}}+\frac{12\ln(\frac{8}{\delta})}{n}.
Proof.

First, Suppose 3ln(8δ)nPr𝒮[xB]>12\sqrt{\frac{3\ln(\frac{8}{\delta})}{n\Pr_{\mathcal{S}}[x\in B]}}>\frac{1}{2}. Then, with probability 1δ1-\delta,

Pr(x,s)𝒮~[B](qPr(x,s)𝒮~|B[sf(x)])2\displaystyle\Pr_{(x,s)\sim\tilde{\mathcal{S}}}[B]\left(q-\Pr_{(x,s)\sim\tilde{\mathcal{S}}|B}[s\leq f(x)]\right)^{2}
(Pr(x,s)𝒮[B]+3ln(2δ)Pr𝒮[xB]n)(qPr(x,s)𝒮~|B[sf(x)])2\displaystyle\leq\left(\Pr_{(x,s)\sim\mathcal{S}}[B]+\sqrt{\frac{3\ln(\frac{2}{\delta})\Pr_{\mathcal{S}}[x\in B]}{n}}\right)\left(q-\Pr_{(x,s)\sim\tilde{\mathcal{S}}|B}[s\leq f(x)]\right)^{2} Lemma B.1
Pr(x,s)𝒮[B]+3ln(8δ)Pr𝒮[xB]n\displaystyle\leq\Pr_{(x,s)\sim\mathcal{S}}[B]+\sqrt{\frac{3\ln(\frac{8}{\delta})\Pr_{\mathcal{S}}[x\in B]}{n}} (qPr(x,s)𝒮~|B[sf(x)])21\left(q-\Pr_{(x,s)\sim\tilde{\mathcal{S}}|B}[s\leq f(x)]\right)^{2}\leq 1
12ln(8δ)n+3ln(8δ)Pr𝒮[xB]n\displaystyle\leq\frac{12\ln(\frac{8}{\delta})}{n}+\sqrt{\frac{3\ln(\frac{8}{\delta})\Pr_{\mathcal{S}}[x\in B]}{n}} 3ln(8δ)nPr𝒮[xB]>12.\displaystyle\sqrt{\frac{3\ln(\frac{8}{\delta})}{n\Pr_{\mathcal{S}}[x\in B]}}>\frac{1}{2}.

On the other hand, we have

Pr(x,s)𝒮[B](qPr(x,s)𝒮|B[sf(x)])2Pr(x,s)𝒮[B]12ln(8δ)n.\displaystyle\Pr_{(x,s)\sim\mathcal{S}}[B]\left(q-\Pr_{(x,s)\sim\mathcal{S}|B}[s\leq f(x)]\right)^{2}\leq\Pr_{(x,s)\sim\mathcal{S}}[B]\leq\frac{12\ln(\frac{8}{\delta})}{n}.

Because |ab|max(a,b)|a-b|\leq\max(a,b) for a,b[0,1]a,b\in[0,1], we have

|Pr(x,s)𝒮[B](qPr(x,s)𝒮|B[sf(x)])2Pr(x,s)𝒮~[B](qPr(x,s)𝒮~|B[sf(x)])2|\displaystyle\left|\Pr_{(x,s)\sim\mathcal{S}}[B]\left(q-\Pr_{(x,s)\sim\mathcal{S}|B}[s\leq f(x)]\right)^{2}-\Pr_{(x,s)\sim\tilde{\mathcal{S}}}[B]\left(q-\Pr_{(x,s)\sim\tilde{\mathcal{S}}|B}[s\leq f(x)]\right)^{2}\right|
3ln(8δ)Pr𝒮[xB]n+12ln(8δ)n.\displaystyle\leq\sqrt{\frac{3\ln(\frac{8}{\delta})\Pr_{\mathcal{S}}[x\in B]}{n}}+\frac{12\ln(\frac{8}{\delta})}{n}.

Now, suppose otherwise: 3ln(8δ)nPr[xB]12\sqrt{\frac{3\ln(\frac{8}{\delta})}{n\Pr[x\in B]}}\leq\frac{1}{2}. Then, Lemma B.1 and B.4 promise us that with probability 1δ1-\delta,

Pr(x,s)𝒮~[B](qPr(x,s)𝒮~|B[sf(x)])2\displaystyle\Pr_{(x,s)\sim\tilde{\mathcal{S}}}[B]\left(q-\Pr_{(x,s)\sim\tilde{\mathcal{S}}|B}[s\leq f(x)]\right)^{2}
(Pr(x,s)𝒮[B]+ϵ1)((qPr(x,s)𝒮|B[sf(x)])2+ϵ2)\displaystyle\leq\left(\Pr_{(x,s)\sim\mathcal{S}}[B]+\epsilon_{1}\right)\left(\left(q-\Pr_{(x,s)\sim\mathcal{S}|B}[s\leq f(x)]\right)^{2}+\epsilon_{2}\right)
Pr(x,s)𝒮[B](qPr(x,s)𝒮|B[sf(x)])2+Pr(x,s)𝒮[B]ϵ2+ϵ1+ϵ1ϵ2\displaystyle\leq\Pr_{(x,s)\sim\mathcal{S}}[B]\left(q-\Pr_{(x,s)\sim\mathcal{S}|B}[s\leq f(x)]\right)^{2}+\Pr_{(x,s)\sim\mathcal{S}}[B]\epsilon_{2}+\epsilon_{1}+\epsilon_{1}\epsilon_{2}

where ϵ1=3ln(4δ)Pr𝒮[xB]n.\epsilon_{1}=\sqrt{\frac{3\ln(\frac{4}{\delta})\Pr_{\mathcal{S}}[x\in B]}{n}}. and ϵ2=203ln(8δ)nPr𝒮[xB]\epsilon_{2}=20\sqrt{\frac{3\ln(\frac{8}{\delta})}{n\Pr_{\mathcal{S}}[x\in B]}}.

We can show that

Pr(x,s)𝒮[B]ϵ2+ϵ1+ϵ1ϵ2\displaystyle\Pr_{(x,s)\sim\mathcal{S}}[B]\epsilon_{2}+\epsilon_{1}+\epsilon_{1}\epsilon_{2}
213ln(8δ)Pr𝒮[xB]n+3ln(8δ)n.\displaystyle\leq 21\sqrt{\frac{3\ln(\frac{8}{\delta})\Pr_{\mathcal{S}}[x\in B]}{n}}+\frac{3\ln(\frac{8}{\delta})}{n}.

The opposite direction works the same way. ∎

B.3 Out of sample guarantees for BatchGCP

Theorem B.3.

Suppose 𝒮\mathcal{S} is ρ\rho-Lipschitz, and write f^(;λ)=BatchGCP(f,𝒢,q,𝒟)\hat{f}(\cdot;\lambda^{*})=\texttt{BatchGCP}(f,\mathcal{G},q,\mathcal{D}) and f^(;λD)=BatchGCP(f,𝒢,q,D)\hat{f}(\cdot;\lambda^{*}_{D})=\texttt{BatchGCP}(f,\mathcal{G},q,D) given some D={(xi,yi)}i=1nD=\{(x_{i},y_{i})\}_{i=1}^{n} drawn from 𝒟\mathcal{D}. Then we have with probability 1δ1-\delta over the randomness of drawing DD from 𝒟\mathcal{D}

|Pr(x,y)𝒟[y𝒯f^(,λD)(x)g(x=1)]q|αPr[g(x)=1]\left|\Pr_{(x,y)\sim\mathcal{D}}[y\in\mathcal{T}^{\hat{f}(\cdot,\lambda^{*}_{D})}(x)|g(x=1)]-q\right|\leq\sqrt{\frac{\alpha^{\prime}}{\Pr[g(x)=1]}}

where 𝒯f^(,λD)(x)={y:s(x,y)f^(x;λD)}\mathcal{T}^{\hat{f}(\cdot,\lambda^{*}_{D})}(x)=\{y:s(x,y)\leq\hat{f}(x;\lambda^{*}_{D})\}, q=max(q,1q)q^{\prime}=\max(q,1-q), b=max(λ2,λD2)b=\lceil\max(||\lambda^{*}||_{2},||\lambda^{*}_{D}||_{2})\rceil, and α=24ρbqln(π2b23δ)+|𝒢|ln(1+2n)2n\alpha^{\prime}=24\rho bq^{\prime}\sqrt{\frac{\ln(\frac{\pi^{2}b^{2}}{3\delta})+|\mathcal{G}|\ln(1+2n)}{2n}}.

Proof.

Fix some BB\in\mathbb{N}. For any fixed λ\lambda where λ2B||\lambda||_{2}\leq B, the Chernoff Bound (Theorem B.1) with rescaling gives us that with probability 1δ1-\delta over the randomness of drawing D={(xi,yi)}i=1nD=\{(x_{i},y_{i})\}_{i=1}^{n} from 𝒟\mathcal{D},

|1ni=1nLq(f^(xi;λ),yi)𝔼(x,y)𝒟[Lq(f^(x;λ),y)]|4qBln(2δ)2n\displaystyle\left|\frac{1}{n}\sum_{i=1}^{n}L_{q}(\hat{f}(x_{i};\lambda),y_{i})-\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}[L_{q}(\hat{f}(x;\lambda),y)]\right|\leq 4q^{\prime}B\sqrt{\frac{\ln(\frac{2}{\delta})}{2n}} (1)

as λLq(f^(x,λ),y)\lambda\to L_{q}(\hat{f}(x,\lambda),y) is qq^{\prime}-Lipschitz.

We now show how to union bound the above concentration over all λ\lambda where λ2B||\lambda||_{2}\leq B. To do so, we first create a finite ϵ\epsilon-net for a ball of radius BB and union-bound over each λ\lambda in the net. Then we can argue that due to the Lipschitzness of the pinball loss LqL_{q}, the empirical pinball loss with respect to each λ\lambda that is in the ball must be concentrated around its distributional loss.

We fist provide the standard ϵ\epsilon-net argument:

Lemma B.6.

Fix ϵ\epsilon\in\mathbb{R} and BB\in\mathbb{N}. There exists a RB={λ1,,λkB}R_{B}=\{\lambda_{1},\dots,\lambda_{k_{B}}\} where kB(1+2Bϵ)|𝒢|k_{B}\leq\left(1+\frac{2B}{\epsilon}\right)^{|\mathcal{G}|} such that the following holds true: for every λ\lambda where λ2B||\lambda||_{2}\leq B, there exists j[k]j\in[k] such that λλj2ϵ||\lambda-\lambda_{j}||_{2}\leq\epsilon.

Proof.

For simplicity, write PB={λ|𝒢|:λ2B}P_{B}=\{\lambda\in\mathbb{R}^{|\mathcal{G}|}:||\lambda||_{2}\leq B\}.

We say that a set R|𝒢|R\subseteq\mathbb{R}^{|\mathcal{G}|} is an ϵ\epsilon-net of PBP_{B} if for every λPB\lambda\in P_{B}, there exists λR\lambda^{\prime}\in R such that λλ2ϵ||\lambda-\lambda^{\prime}||_{2}\leq\epsilon.

Without loss of generality, we suppose B=1B=1. Since an ϵB\frac{\epsilon}{B}-cover of P1P_{1} can be scaled up to an ϵ\epsilon-cover of PBP_{B} — i.e. given R1R_{1} which is an ϵB\frac{\epsilon}{B}-cover of P1P_{1}, the following set RB={Bλ:λR1}R_{B}=\{B\cdot\lambda:\lambda\in R_{1}\} is an ϵ\epsilon-cover of PBP_{B}.

Now, choose a set RR to be a maximally ϵ\epsilon-separated subset of P1P_{1}: for every u,vBu,v\in B, uvϵ||u-v||\geq\epsilon and no set RR^{\prime} such that RRR\subset R^{\prime} has this property.

Due to its maximal property, RR must be an ϵ\epsilon-cover for P1P_{1}. Otherwise, it means that there exists some point uP1u\in P_{1} such that for every vRv\in R uv>ϵ||u-v||>\epsilon. Note that R{u}R\cup\{u\} would still be an ϵ\epsilon-separate subset of P1P_{1}, contradicting the maximality of RR.

Due to the ϵ\epsilon-separability of RR, we have that the balls centered at each uRu\in R with radius of ϵ2\frac{\epsilon}{2} is all disjoint, meaning the sum of the volume of these balls is the volume of their union. On the other hand, we have that they all lie in a ball of radius 1+ϵ21+\frac{\epsilon}{2}, P1+ϵ2P_{1+\frac{\epsilon}{2}}. Therefore, we have

vol(Pϵ2)|R|vol(P1+ϵ2).\displaystyle vol(P_{\frac{\epsilon}{2}})\cdot|R|\leq vol(P_{1+\frac{\epsilon}{2}}).

Since vol(Pc)=c|𝒢|vol(P1)vol(P_{c})=c^{|\mathcal{G}|}\cdot vol(P_{1}), we have that

|R|(1+ϵ2)|𝒢|(ϵ2)|𝒢|=(1+2ϵ)|𝒢|.\displaystyle|R|\leq\frac{(1+\frac{\epsilon}{2})^{|\mathcal{G}|}}{(\frac{\epsilon}{2})^{|\mathcal{G}|}}=\left(1+\frac{2}{\epsilon}\right)^{|\mathcal{G}|}.

By union-bounding inequality (1) over RBR_{B}, we have that with probability 1δ1-\delta

maxj[kB]|1ni=1nLq(f^(xi;λj),yi)𝔼(x,y)𝒟[Lq(f^(x;λj),y)]|4qBln(2kBδ)2n.\displaystyle\max_{j\in[k_{B}]}\left|\frac{1}{n}\sum_{i=1}^{n}L_{q}(\hat{f}(x_{i};\lambda_{j}),y_{i})-\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}[L_{q}(\hat{f}(x;\lambda_{j}),y)]\right|\leq 4q^{\prime}B\sqrt{\frac{\ln(\frac{2k_{B}}{\delta})}{2n}}.

Using qq^{\prime}-Lipschitzness of LqL_{q}, we can further show that for any λ\lambda where λ2B||\lambda||_{2}\leq B, we have

|1ni=1nLq(f^(xi;λ),yi)𝔼(x,y)𝒟[Lq(f^(x;λ),y)]|\displaystyle\left|\frac{1}{n}\sum_{i=1}^{n}L_{q}(\hat{f}(x_{i};\lambda),y_{i})-\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}[L_{q}(\hat{f}(x;\lambda),y)]\right|
=|1ni=1nLq(f^(xi;λ),yi)1ni=1nLq(f^(xi;λj),yi)+1ni=1nLq(f^(xi;λj),yi)]\displaystyle=\Bigg{|}\frac{1}{n}\sum_{i=1}^{n}L_{q}(\hat{f}(x_{i};\lambda),y_{i})-\frac{1}{n}\sum_{i=1}^{n}L_{q}(\hat{f}(x_{i};\lambda_{j}),y_{i})+\frac{1}{n}\sum_{i=1}^{n}L_{q}(\hat{f}(x_{i};\lambda_{j}),y_{i})]
+𝔼(x,y)𝒟[Lq(f^(x;λj),y)]𝔼(x,y)𝒟[Lq(f^(x;λj),y)]𝔼(x,y)𝒟[Lq(f^(x;λ),y)]|\displaystyle+\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}[L_{q}(\hat{f}(x;\lambda_{j}),y)]-\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}[L_{q}(\hat{f}(x;\lambda_{j}),y)]-\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}[L_{q}(\hat{f}(x;\lambda),y)]\Bigg{|}
|1ni=1nLq(f^(xi;λ),yi)1ni=1nLq(f^(xi;λj),yi)]|\displaystyle\leq\Bigg{|}\frac{1}{n}\sum_{i=1}^{n}L_{q}(\hat{f}(x_{i};\lambda),y_{i})-\frac{1}{n}\sum_{i=1}^{n}L_{q}(\hat{f}(x_{i};\lambda_{j}),y_{i})]\Bigg{|}
+|1ni=1nLq(f^(xi;λj),yi))]𝔼(x,y)𝒟[Lq(f^(x;λj),y]|\displaystyle+\Bigg{|}\frac{1}{n}\sum_{i=1}^{n}L_{q}(\hat{f}(x_{i};\lambda_{j}),y_{i}))]-\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}[L_{q}(\hat{f}(x;\lambda_{j}),y]\Bigg{|}
+|𝔼(x,y)𝒟[Lq(f^(x;λj),y)])]𝔼(x,y)𝒟[Lq(f^(x;λ),y)]|\displaystyle+\Bigg{|}\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}[L_{q}(\hat{f}(x;\lambda_{j}),y)])]-\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}[L_{q}(\hat{f}(x;\lambda),y)]\Bigg{|}
2ϵq+4qBln(2δ)+ln(kB)2n\displaystyle\leq 2\epsilon q^{\prime}+4q^{\prime}B\sqrt{\frac{\ln(\frac{2}{\delta})+\ln(k_{B})}{2n}}
=2ϵq+4qBln(2δ)+|𝒢|ln(1+2Bϵ)2n\displaystyle=2\epsilon q^{\prime}+4q^{\prime}B\sqrt{\frac{\ln(\frac{2}{\delta})+|\mathcal{G}|\ln(1+\frac{2B}{\epsilon})}{2n}}

where jj is chosen such that λλj2ϵ||\lambda-\lambda_{j}||_{2}\leq\epsilon and we can find such jj because RBR_{B} is an ϵ\epsilon-net for the ball of radius BB.

Setting ϵ=Bn\epsilon=\frac{B}{n} yields

2Bqn+4qBln(2δ)+|𝒢|ln(1+2n)2n6Bqln(2δ)+|𝒢|ln(1+2n)2n\displaystyle\leq\frac{2Bq^{\prime}}{n}+4q^{\prime}B\sqrt{\frac{\ln(\frac{2}{\delta})+|\mathcal{G}|\ln(1+2n)}{2n}}\leq 6Bq^{\prime}\sqrt{\frac{\ln(\frac{2}{\delta})+|\mathcal{G}|\ln(1+2n)}{2n}}

for sufficiently large nn.

We set δb=6δπ2b2\delta_{b}=\frac{6\delta}{\pi^{2}b^{2}} so that b=1δb=δ\sum_{b=1}^{\infty}\delta_{b}=\delta. In other words, with probability 1δ1-\delta, we have simultaneously over all bb\in\mathbb{N} and λ\lambda where λ2b||\lambda||_{2}\leq b

|1ni=1nLq(f^(xi;λ),yi)𝔼(x,y)𝒟[Lq(f^(x;λ),y)]|6bqln(π2b23δ)+|𝒢|ln(1+2n)2n.\displaystyle\left|\frac{1}{n}\sum_{i=1}^{n}L_{q}(\hat{f}(x_{i};\lambda),y_{i})-\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}[L_{q}(\hat{f}(x;\lambda),y)]\right|\leq 6bq^{\prime}\sqrt{\frac{\ln(\frac{\pi^{2}b^{2}}{3\delta})+|\mathcal{G}|\ln(1+2n)}{2n}}.

In other words, the final output λD\lambda^{*}_{D} from BatchGCP and λ\lambda^{*} which is the optimal solution with respect to the true distribution 𝒟\mathcal{D} must be such that

|1ni=1nLq(f^(xi;λD),yi)𝔼(x,y)𝒟[Lq(f^(x;λD),y)]|6bqln(π2b23δ)+|𝒢|ln(1+2n)2n\displaystyle\left|\frac{1}{n}\sum_{i=1}^{n}L_{q}(\hat{f}(x_{i};\lambda^{*}_{D}),y_{i})-\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}[L_{q}(\hat{f}(x;\lambda^{*}_{D}),y)]\right|\leq 6bq^{\prime}\sqrt{\frac{\ln(\frac{\pi^{2}b^{2}}{3\delta})+|\mathcal{G}|\ln(1+2n)}{2n}}
|1ni=1nLq(f^(xi;λ),yi)𝔼(x,y)𝒟[Lq(f^(x;λ),y)]|6bqln(π2b23δ)+|𝒢|ln(1+2n)2n.\displaystyle\left|\frac{1}{n}\sum_{i=1}^{n}L_{q}(\hat{f}(x_{i};\lambda^{*}),y_{i})-\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}[L_{q}(\hat{f}(x;\lambda^{*}),y)]\right|\leq 6bq^{\prime}\sqrt{\frac{\ln(\frac{\pi^{2}b^{2}}{3\delta})+|\mathcal{G}|\ln(1+2n)}{2n}}.

where b=max(λ2,λD2)b=\lceil\max(||\lambda^{*}||_{2},||\lambda^{*}_{D}||_{2})\rceil. In other words,

𝔼(x,y)𝒟[Lq(f^(x;λD),y)]𝔼(x,y)𝒟[Lq(f^(x;λ),y)]12bqln(π2b23δ)+|𝒢|ln(1+2n)2n.\displaystyle\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}[L_{q}(\hat{f}(x;\lambda^{*}_{D}),y)]-\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}[L_{q}(\hat{f}(x;\lambda^{*}),y)]\leq 12bq^{\prime}\sqrt{\frac{\ln(\frac{\pi^{2}b^{2}}{3\delta})+|\mathcal{G}|\ln(1+2n)}{2n}}.

Now, for the sake of contradiction, suppose that there exists some g𝒢g\in\mathcal{G} such that

Pr(x,s)𝒮[g(x)=1](qPr(x,s)𝒮[sf^(x;λD)|g(x)=1])2>α\Pr_{(x,s)\sim\mathcal{S}}[g(x)=1]\cdot\left(q-\Pr_{(x,s)\sim\mathcal{S}}[s\leq\hat{f}(x;\lambda^{*}_{D})|g(x)=1]\right)^{2}>\alpha^{\prime}

where α=24ρbqln(π2b23δ)+|𝒢|ln(1+2n)2n\alpha^{\prime}=24\rho bq^{\prime}\sqrt{\frac{\ln(\frac{\pi^{2}b^{2}}{3\delta})+|\mathcal{G}|\ln(1+2n)}{2n}}.

Then, Lemma 3.2 tells us that we can decrease the true pinball loss with respect to λD\lambda^{*}_{D} by at least α2ρ\frac{\alpha^{\prime}}{2\rho} by patching B={x:g(x)=1}B=\{x:g(x)=1\}. However, that cannot be the case that as that would mean there exists λ\lambda^{\prime} such that

𝔼(x,y)𝒟[Lq(f^(x;λ),y)]<𝔼(x,y)𝒟[Lq(f^(x;λ),y)]\displaystyle\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}[L_{q}(\hat{f}(x;\lambda^{\prime}),y)]<\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}[L_{q}(\hat{f}(x;\lambda^{*}),y)]

where λ\lambda^{\prime} is such that λg=λD,g\lambda^{\prime}_{g}=\lambda^{*}_{D,g} for all ggg^{\prime}\neq g and λg\lambda^{\prime}_{g} is chosen to satisfy

q=Pr(x,y)𝒟[f(x)+λg|g(x)=1].q=\Pr_{(x,y)\sim\mathcal{D}}[f(x)+\lambda^{\prime}_{g}|g(x)=1].

This is a contradiction as we have already defined

λ=argminλ𝔼(x,y)𝒟[Lq(f^(x;λ),y)].\lambda^{*}=\arg\min_{\lambda}\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}[L_{q}(\hat{f}(x;\lambda),y)].

B.4 Out of sample guarantees for BatchMVP

B.4.1 Out-of-sample quantile multicalibration bound for fixed TT

See 4.1

Proof .

For each tt\in\mathbb{N}, define δt=δ6π21t2\delta_{t}=\delta\cdot\frac{6}{\pi^{2}}\cdot\frac{1}{t^{2}}. Note that

t=1δt=δ6π2t=11t2=δ\displaystyle\sum_{t=1}^{\infty}\delta_{t}=\delta\frac{6}{\pi^{2}}\sum_{t=1}^{\infty}\frac{1}{t^{2}}=\delta

as t=11t2=π26\sum_{t=1}^{\infty}\frac{1}{t^{2}}=\frac{\pi^{2}}{6}.

Fixed any tt\in\mathbb{N} and δ\delta. Union-bounding Lemma B.5 over g𝒢g\in\mathcal{G} and f𝒞tf\in\mathcal{C}_{t}, we have that for any g𝒢g\in\mathcal{G}, with probability 1δ1-\delta,

vR(ft)Pr(x,s)𝒮[ft(x)=v,g(x)=1](qPr(x,s)𝒮[sft(x)|ft(x)=v,g(x)=1])2\displaystyle\sum_{v\in R(f_{t})}\Pr_{(x,s)\sim\mathcal{S}}[f_{t}(x)=v,g(x)=1]\left(q-\Pr_{(x,s)\sim\mathcal{S}}[s\leq f_{t}(x)|f_{t}(x)=v,g(x)=1]\right)^{2}
vR(ft)Pr(x,s)𝒮~[ft(x)=v,g(x)=1](qPr(x,s)𝒮~[sft(x)|ft(x)=v,g(x)=1])2\displaystyle\leq\sum_{v\in R(f_{t})}\Pr_{(x,s)\sim\tilde{\mathcal{S}}}[f_{t}(x)=v,g(x)=1]\left(q-\Pr_{(x,s)\sim\tilde{\mathcal{S}}}[s\leq f_{t}(x)|f_{t}(x)=v,g(x)=1]\right)^{2}
+vR(ft)213(ln(8δ)+tln(4m2|𝒢|))Pr𝒮[ft(x)=v,g(x)=1]n+12(ln(8δ)+tln(4m2|𝒢|))n().\displaystyle+\underbrace{\sum_{v\in R(f_{t})}21\sqrt{\frac{3\left(\ln(\frac{8}{\delta})+t\ln(4m^{2}|\mathcal{G}|)\right)\Pr_{\mathcal{S}}[f_{t}(x)=v,g(x)=1]}{n}}+\frac{12(\ln(\frac{8}{\delta})+t\ln(4m^{2}|\mathcal{G}|))}{n}}_{(*)}.

We can further bound ()(*) as

()\displaystyle(*) vR(ft)213(ln(8δ)+tln(4m2|𝒢|))Pr𝒮[ft(x)=v,g(x)=1]n+2m12(ln(8δ)+tln(4m2|𝒢|))n\displaystyle\leq\sum_{v\in R(f_{t})}21\sqrt{\frac{3\left(\ln(\frac{8}{\delta})+t\ln(4m^{2}|\mathcal{G}|)\right)\Pr_{\mathcal{S}}[f_{t}(x)=v,g(x)=1]}{n}}+2m\frac{12(\ln(\frac{8}{\delta})+t\ln(4m^{2}|\mathcal{G}|))}{n}
()α+213(ln(8δ)+tln(4m2|𝒢|))Pr𝒮[g(x)=1]|R(ft)|n|R(f)|2+12ρ2(ln(8δ)+tln(4m2|𝒢|))αn\displaystyle\underbrace{\leq}_{(**)}\alpha+21\sqrt{\frac{3\left(\ln(\frac{8}{\delta})+t\ln(4m^{2}|\mathcal{G}|)\right)\frac{\Pr_{\mathcal{S}}[g(x)=1]}{|R(f_{t})|}}{n}|R(f)|^{2}}+\frac{12\rho^{2}(\ln(\frac{8}{\delta})+t\ln(4m^{2}|\mathcal{G}|))}{\alpha n}
213(ln(8δ)+tln(4m2|𝒢|))|R(ft)|n+12ρ2(ln(8δ)+Tln(4m2|𝒢|))αn\displaystyle\leq 21\sqrt{\frac{3\left(\ln(\frac{8}{\delta})+t\ln(4m^{2}|\mathcal{G}|)\right)|R(f_{t})|}{n}}+\frac{12\rho^{2}(\ln(\frac{8}{\delta})+T\ln(4m^{2}|\mathcal{G}|))}{\alpha n}
=213ρ2(ln(8δ)+tln(4m2|𝒢|))4αn+12ρ2(ln(8δ)+tln(4m2|𝒢|))αn\displaystyle=21\sqrt{\frac{3\rho^{2}\left(\ln(\frac{8}{\delta})+t\ln(4m^{2}|\mathcal{G}|)\right)}{4\alpha n}}+\frac{12\rho^{2}(\ln(\frac{8}{\delta})+t\ln(4m^{2}|\mathcal{G}|))}{\alpha n}
=213ρ2(ln(8δ)+tln(ρ4|𝒢|α2))2αn+12ρ2(ln(8δ)+tln(ρ4|𝒢|α2))αn\displaystyle=21\sqrt{\frac{3\rho^{2}\left(\ln(\frac{8}{\delta})+t\ln(\frac{\rho^{4}|\mathcal{G}|}{\alpha^{2}})\right)}{2\alpha n}}+\frac{12\rho^{2}(\ln(\frac{8}{\delta})+t\ln(\frac{\rho^{4}|\mathcal{G}|}{\alpha^{2}}))}{\alpha n}

where we used |R(ft)|=m+12m|R(f_{t})|=m+1\leq 2m and m=ρ22αm=\frac{\rho^{2}}{2\alpha}. Inequality ()(**) follows from the fact that h(z)=3t(ln(8δ)+tln(4m2|𝒢|))znh(z)=\sqrt{\frac{3t\left(\ln(\frac{8}{\delta})+t\ln(4m^{2}|\mathcal{G}|)\right)\cdot z}{n}} is concave and so the optimization problem i=1|R(ft)|h(zi)\sum_{i=1}^{|R(f_{t})|}h(z_{i}) where i=1|R(ft)|zi=Pr𝒮(g(x)=1)\sum_{i=1}^{|R(f_{t})|}z_{i}=\Pr_{\mathcal{S}}(g(x)=1) is maximized at zi=Pr𝒮(g(x)=1)|R(ft)|z_{i}=\frac{\Pr_{\mathcal{S}}(g(x)=1)}{|R(f_{t})|} for each i[|R(ft)|]i\in[|R(f_{t})|].

In other words, with probability 1δ=1t=1δt1-\delta=1-\sum_{t=1}^{\infty}\delta_{t}, we simultaneously have for every tt\in\mathbb{N}

vR(ft)Pr(x,s)𝒮[ft(x)=v,g(x)=1](qPr(x,s)𝒮[sft(x)|ft(x)=v,g(x)=1])2\displaystyle\sum_{v\in R(f_{t})}\Pr_{(x,s)\sim\mathcal{S}}[f_{t}(x)=v,g(x)=1]\left(q-\Pr_{(x,s)\sim\mathcal{S}}[s\leq f_{t}(x)|f_{t}(x)=v,g(x)=1]\right)^{2}
vR(ft)Pr(x,s)𝒮~[ft(x)=v,g(x)=1](qPr(x,s)𝒮~[sft(x)|ft(x)=v,g(x)=1])2\displaystyle\leq\sum_{v\in R(f_{t})}\Pr_{(x,s)\sim\tilde{\mathcal{S}}}[f_{t}(x)=v,g(x)=1]\left(q-\Pr_{(x,s)\sim\tilde{\mathcal{S}}}[s\leq f_{t}(x)|f_{t}(x)=v,g(x)=1]\right)^{2}
+213ρ2(ln(8δt)+tln(ρ4|𝒢|α2))2αn+12ρ2(ln(8δt)+tln(ρ4|𝒢|α2))αn\displaystyle+21\sqrt{\frac{3\rho^{2}\left(\ln(\frac{8}{\delta_{t}})+t\ln(\frac{\rho^{4}|\mathcal{G}|}{\alpha^{2}})\right)}{2\alpha n}}+\frac{12\rho^{2}(\ln(\frac{8}{\delta_{t}})+t\ln(\frac{\rho^{4}|\mathcal{G}|}{\alpha^{2}}))}{\alpha n}
vR(ft)Pr(x,s)𝒮~[ft(x)=v,g(x)=1](qPr(x,s)𝒮~[sft(x)|ft(x)=v,g(x)=1])2\displaystyle\leq\sum_{v\in R(f_{t})}\Pr_{(x,s)\sim\tilde{\mathcal{S}}}[f_{t}(x)=v,g(x)=1]\left(q-\Pr_{(x,s)\sim\tilde{\mathcal{S}}}[s\leq f_{t}(x)|f_{t}(x)=v,g(x)=1]\right)^{2}
+213ρ2(ln(4π2t23δ)+tln(ρ4|𝒢|α2))2αn+12ρ2(4π2t23δ)+tln(ρ4|𝒢|α2))αn.\displaystyle+21\sqrt{\frac{3\rho^{2}\left(\ln(\frac{4\pi^{2}t^{2}}{3\delta})+t\ln(\frac{\rho^{4}|\mathcal{G}|}{\alpha^{2}})\right)}{2\alpha n}}+\frac{12\rho^{2}(\frac{4\pi^{2}t^{2}}{3\delta})+t\ln(\frac{\rho^{4}|\mathcal{G}|}{\alpha^{2}}))}{\alpha n}.

Finally, when our algorithm halts at round TT, fTf_{T} is α\alpha-approximately multicalibrated with respect to its empirical distribution 𝒮~\tilde{\mathcal{S}}. Therefore, we have

vR(fT)Pr(x,s)𝒮[fT(x)=v,g(x)=1](qPr(x,s)𝒮[sfT(x)|fT(x)=v,g(x)=1])2\displaystyle\sum_{v\in R(f_{T})}\Pr_{(x,s)\sim\mathcal{S}}[f_{T}(x)=v,g(x)=1]\left(q-\Pr_{(x,s)\sim\mathcal{S}}[s\leq f_{T}(x)|f_{T}(x)=v,g(x)=1]\right)^{2}
α+213ρ2(ln(4π2t23δ)+tln(ρ4|𝒢|α2))2αn+12ρ2(4π2t23δ)+tln(ρ4|𝒢|α2))αn.\displaystyle\leq\alpha+21\sqrt{\frac{3\rho^{2}\left(\ln(\frac{4\pi^{2}t^{2}}{3\delta})+t\ln(\frac{\rho^{4}|\mathcal{G}|}{\alpha^{2}})\right)}{2\alpha n}}+\frac{12\rho^{2}(\frac{4\pi^{2}t^{2}}{3\delta})+t\ln(\frac{\rho^{4}|\mathcal{G}|}{\alpha^{2}}))}{\alpha n}.

B.4.2 Sample complexity of maintaining convergence speed of BatchMVP

We write 𝒞0={f0}\mathcal{C}_{0}=\{f_{0}\} and

𝒞t+1={ft+1: =f+t1(x)Patch(ft,B(v,g),Δ) where =B(v,g){x:=ft(x)v,=g(x)1} for all ftCt,v[1m],gG,Δ[1m]]. }\displaystyle\mathcal{C}_{t+1}=\left\{f_{t+1}:\quad\parbox{300.00046pt}{$f_{t+1}(x)=\mathrm{Patch}(f_{t},B(v,g),\Delta)$ where $B(v,g)=\{x:f_{t}(x)=v,g(x)=1\}$ \\ for all $f_{t}\in\mathcal{C}_{t},v\in\left[\frac{1}{m}\right],g\in\mathcal{G},\Delta\in\left[\frac{1}{m}\right]]$.}\right\}

to denote the set of all possible models ftf_{t} we could obtain at round tt of Algorithm 2 regardless of what dataset S={(xi,si)}i=1nS=\{(x_{i},s_{i})\}_{i=1}^{n} is used as input.

Lemma B.7.

Fixing the initial model f0f_{0}, the number of distinct models ftf_{t} that can arise at round tt of Algorithm 2 (quantified over all possible input datasets SS) is upper bounded by:

|𝒞t|=((m+1)2|𝒢|)t(4m2|𝒢|)t.|\mathcal{C}_{t}|=((m+1)^{2}|\mathcal{G}|)^{t}\leq(4m^{2}|\mathcal{G}|)^{t}.
Proof.

The proof is by induction on tt\in\mathbb{N}. By construction we have |𝒞0|=1|\mathcal{C}_{0}|=1. Now, note that |𝒞t+1|=(m+1)2|𝒢||𝒞t||\mathcal{C}_{t+1}|=(m+1)^{2}|\mathcal{G}||\mathcal{C}_{t}| because we consider all possible combinations of ft𝒞t,v[1m],g𝒢,f_{t}\in\mathcal{C}_{t},v\in\left[\frac{1}{m}\right],g\in\mathcal{G}, and Δ[1m]\Delta\in\left[\frac{1}{m}\right]. Therefore, if |𝒞t|=((m+1)2|𝒢|)t|\mathcal{C}_{t}|=((m+1)^{2}|\mathcal{G}|)^{t} for some tt, then |𝒞t+1|=((m+1)2|𝒢|)t+1|\mathcal{C}_{t+1}|=((m+1)^{2}|\mathcal{G}|)^{t+1}. ∎

See 4.2

Proof .

Fix any round tt. Because ftf_{t} is not α\alpha-approximately quantile multicalibrated with respect to 𝒢\mathcal{G} and qq on 𝒮~\tilde{\mathcal{S}}, we have

Pr(x,s)𝒮~[xBt](qPr(x,s)𝒮~[sft(x)|xBt]))2αm+1α2m.\displaystyle\Pr_{(x,s)\sim\tilde{\mathcal{S}}}[x\in B_{t}]\left(q-\Pr_{(x,s)\sim\tilde{\mathcal{S}}}[s\leq f_{t}(x)|x\in B_{t}])\right)^{2}\geq\frac{\alpha}{m+1}\geq\frac{\alpha}{2m}.

Now, the patch operation in this can be decomposed into the following:

vt\displaystyle v_{t} vt+Δwhere q=Pr(x,s)𝒮|Bt[sft(x)+Δ]\displaystyle\to v_{t}+\Delta^{*}\quad\text{where $q=\Pr_{(x,s)\sim\mathcal{S}|B_{t}}[s\leq f_{t}(x)+\Delta^{*}]$}
vt+Δ\displaystyle v_{t}+\Delta^{*} vt+Δ~where Δ~=argminv[1/m]|vΔ|\displaystyle\to v_{t}+\tilde{\Delta}^{*}\quad\text{where $\tilde{\Delta}^{*}=\arg\min_{v\in[1/m]}|v-\Delta^{*}|$}
vt+Δ~\displaystyle v_{t}+\tilde{\Delta}^{*} vt+Δt.\displaystyle\to v_{t}+\Delta_{t}.

For convenience, we write

ft(x)\displaystyle f^{*}_{t}(x) =ft(x)+Δ𝟙[xBt]\displaystyle=f_{t}(x)+\Delta^{*}\cdot\mathbbm{1}[x\in B_{t}]
f~t(x)\displaystyle\tilde{f}^{*}_{t}(x) =ft(x)+Δ~𝟙[xBt]\displaystyle=f_{t}(x)+\tilde{\Delta}^{*}\cdot\mathbbm{1}[x\in B_{t}]
ft+1(x)\displaystyle f_{t+1}(x) =ft(x)+Δt𝟙[xBt]\displaystyle=f_{t}(x)+\Delta_{t}\cdot\mathbbm{1}[x\in B_{t}]

Now, we show that we decrease the empirical pinball loss PB𝒮~PB^{\tilde{\mathcal{S}}} as we go from ftf_{t} to ft+1f_{t+1} in each round tt with high probability.

(1) ftftf_{t}\to f^{*}_{t}:

First, we can show progress in terms of the pinball loss on 𝒮\mathcal{S} and then by a Chernoff bound, we can show that we have made significant progress on 𝒮~\tilde{\mathcal{S}} with high probability. More specifically, we must have decreased the pinball loss with respect to 𝒮|Bt\mathcal{S}|B_{t} by going from vtv_{t} to vt+Δv_{t}+\Delta^{*}. Note that Δ\Delta^{*} is chosen to satisfy the target quantile qq with respect to 𝒮|Bt\mathcal{S}|B_{t} and 𝒮\mathcal{S} is ρ\rho-Lipschitz.

Because the empirical quantile error was significant for (vt,gt)(v_{t},g_{t}), we can show that the true quantile error must have been significant. By union bounding Lemma B.5 over all f𝒞tf\in\mathcal{C}_{t} and using Lemma B.7 to bound the cardinality of 𝒞t\mathcal{C}_{t}, we have with probability 1δ1-\delta

|Pr(x,s)𝒮[Bt](qPr(x,s)𝒮|Bt[sft(x)])2Pr(x,s)𝒮~[Bt](qPr(x,s)𝒮~|Bt[sft(x)])2|\displaystyle\left|\Pr_{(x,s)\sim\mathcal{S}}[B_{t}]\left(q-\Pr_{(x,s)\sim\mathcal{S}|B_{t}}[s\leq f_{t}(x)]\right)^{2}-\Pr_{(x,s)\sim\tilde{\mathcal{S}}}[B_{t}]\left(q-\Pr_{(x,s)\sim\tilde{\mathcal{S}}|B_{t}}[s\leq f_{t}(x)]\right)^{2}\right|
213ln(8|𝒞t|δ)Pr𝒮[xB]n+12ln(8|𝒞t|δ)n\displaystyle\leq 21\sqrt{\frac{3\ln(\frac{8|\mathcal{C}_{t}|}{\delta})\Pr_{\mathcal{S}}[x\in B]}{n}}+\frac{12\ln(\frac{8|\mathcal{C}_{t}|}{\delta})}{n}
213(ln(8δ)+tln(4m2|𝒢|))Pr𝒮[xB]n+12(ln(8δ)+tln(4m2|𝒢|))n\displaystyle\leq 21\sqrt{\frac{3\left(\ln(\frac{8}{\delta})+t\ln(4m^{2}|\mathcal{G}|)\right)\Pr_{\mathcal{S}}[x\in B]}{n}}+\frac{12\left(\ln(\frac{8}{\delta})+t\ln(4m^{2}|\mathcal{G}|)\right)}{n}
2212(ln(8δ)+tln(4m2|𝒢|))n\displaystyle\leq 22\sqrt{\frac{12\left(\ln(\frac{8}{\delta})+t\ln(4m^{2}|\mathcal{G}|)\right)}{n}}

as long n12(ln(8δ)+tln(4m2|𝒢|))n\geq 12\left(\ln(\frac{8}{\delta})+t\ln(4m^{2}|\mathcal{G}|)\right). In other words, we have

Pr(x,s)𝒮[Bt](qPr(x,s)𝒮|Bt[sft(x)])2α2m2212(ln(8δ)+tln(4m2|𝒢|))n.\Pr_{(x,s)\sim\mathcal{S}}[B_{t}]\left(q-\Pr_{(x,s)\sim\mathcal{S}|B_{t}}[s\leq f_{t}(x)]\right)^{2}\geq\frac{\alpha}{2m}-22\sqrt{\frac{12\left(\ln(\frac{8}{\delta})+t\ln(4m^{2}|\mathcal{G}|)\right)}{n}}.

If n92928m2(ln(8δ)+tln(4m2|𝒢|))α2n\geq\frac{92928m^{2}\left(\ln(\frac{8}{\delta})+t\ln(4m^{2}|\mathcal{G}|)\right)}{\alpha^{2}}, then we have

Pr(x,s)𝒮[Bt](qPr(x,s)𝒮|Bt[sft(x)])2α4m.\Pr_{(x,s)\sim\mathcal{S}}[B_{t}]\left(q-\Pr_{(x,s)\sim\mathcal{S}|B_{t}}[s\leq f_{t}(x)]\right)^{2}\geq\frac{\alpha}{4m}.

As ftf^{*}_{t} achieves the target quantile qq against 𝒮|Bt\mathcal{S}|B_{t} and the quantile error was at least α4m\frac{\alpha}{4m}, applying Lemma 3.2 yields

PB𝒮(ft)PB𝒮(ft)α8mρ.\displaystyle PB^{\mathcal{S}}(f^{*}_{t})-PB^{\mathcal{S}}(f_{t})\leq-\frac{\alpha}{8m\rho}.
(2) ftf~tf^{*}_{t}\to\tilde{f}^{*}_{t}:

Recall that Δ~\tilde{\Delta}^{*} results from rounding Δ\Delta^{*} to the nearest grid point in [1m][\frac{1}{m}]. Because 𝒮|Bt\mathcal{S}|B_{t} is ρ\rho-Lipschitz and ft()+Δf_{t}(\cdot)+\Delta^{*} satisfies the target quantile qq for 𝒮|Bt\mathcal{S}|B_{t}, we can bound the marginal quantile consistency error of ft()+Δ~f_{t}(\cdot)+\tilde{\Delta}^{*} against 𝒮|Bt\mathcal{S}|B_{t} as

|Pr(x,s)𝒮|Bt[sft(x)+Δ~]q|\displaystyle\left|\Pr_{(x,s)\sim\mathcal{S}|B_{t}}[s\leq f_{t}(x)+\tilde{\Delta}^{*}]-q\right| =|Pr(x,s)𝒮|Bt[sft(x)+Δ~]Pr(x,s)𝒮|Bt[sft(x)Δ]|ρ2m\displaystyle=\left|\Pr_{(x,s)\sim\mathcal{S}|B_{t}}[s\leq f_{t}(x)+\tilde{\Delta}^{*}]-\Pr_{(x,s)\sim\mathcal{S}|B_{t}}[s\leq f_{t}(x)-\Delta^{*}]\right|\leq\frac{\rho}{2m}

as |ΔΔ~|12m|\Delta^{*}-\tilde{\Delta}^{*}|\leq\frac{1}{2m}.

Note that ft(x)=f~t(x)f^{*}_{t}(x)=\tilde{f}^{*}_{t}(x) for xBtx\not\in B_{t} and ft(x)=f~t+1(x)+(Δ~Δ)f^{*}_{t}(x)=\tilde{f}_{t+1}(x)+(\tilde{\Delta}^{*}-\Delta^{*}) for xBtx\in B_{t} where |Δ~Δ|12m|\tilde{\Delta}^{*}-\Delta^{*}|\leq\frac{1}{2m}. Applying Lemma 3.1 with Δ=Δ~Δ\Delta=\tilde{\Delta}^{*}-\Delta^{*}, αρ2m\alpha\leq\frac{\rho}{2m}, and (f,f)=(f~t,ft)(f,f^{\prime})=(\tilde{f}^{*}_{t},f^{*}_{t}), we have that

PBq𝒮(f~t)PBq𝒮(ft)\displaystyle PB^{\mathcal{S}}_{q}(\tilde{f}^{*}_{t})-PB^{\mathcal{S}}_{q}(f^{*}_{t}) =Pr(x,s)𝒮[xBt](PB𝒮|Bt(f~t)PB𝒮|Bt(ft))\displaystyle=\Pr_{(x,s)\sim\mathcal{S}}[x\in B_{t}]\cdot\left(PB^{\mathcal{S}|B_{t}}(\tilde{f}^{*}_{t})-PB^{\mathcal{S}|B_{t}}(f^{*}_{t})\right)
Pr(x,s)𝒮[xBt](ρ2m12m(ρ2m)212ρ)\displaystyle\leq\Pr_{(x,s)\sim\mathcal{S}}[x\in B_{t}]\cdot\left(\frac{\rho}{2m}\frac{1}{2m}-\left(\frac{\rho}{2m}\right)^{2}\frac{1}{2\rho}\right)
ρ8m2.\displaystyle\leq\frac{\rho}{8m^{2}}.

We have so far shown that with probability 1δ1-\delta,

PB𝒮(ft+1)PB𝒮(ft)\displaystyle PB^{\mathcal{S}}(f_{t+1})-PB^{\mathcal{S}}(f_{t})
=(PB𝒮(ft+1)PB𝒮(f~t))+(PB𝒮(f~t)PB𝒮(ft))+(PB𝒮(ft)PB𝒮(ft))\displaystyle=\left(PB^{\mathcal{S}}(f_{t+1})-PB^{\mathcal{S}}(\tilde{f}^{*}_{t})\right)+\left(PB^{\mathcal{S}}(\tilde{f}^{*}_{t})-PB^{\mathcal{S}}(f^{*}_{t})\right)+\left(PB^{\mathcal{S}}(f^{*}_{t})-PB^{\mathcal{S}}(f_{t})\right)
(PB𝒮(ft+1)PB𝒮(f~t))+ρ8m2α8mρ\displaystyle\leq\left(PB^{\mathcal{S}}(f_{t+1})-PB^{\mathcal{S}}(\tilde{f}^{*}_{t})\right)+\frac{\rho}{8m^{2}}-\frac{\alpha}{8m\rho}

By union boudning the Chernoff bound (Theorem B.1) over 𝒞t+1\mathcal{C}_{t+1}, we can show that PB𝒮~(f)PB^{\tilde{\mathcal{S}}}(f) concentrates around PB𝒮(f)PB^{\mathcal{S}}(f) for every f𝒞tf\in\mathcal{C}_{t}: with probabiliy 1δ1-\delta, we simultaneously have

|PB𝒮~(ft)PB𝒮(ft)|\displaystyle\left|PB^{\tilde{\mathcal{S}}}(f_{t})-PB^{\mathcal{S}}(f_{t})\right| ln(2δ)+tln(4m2|𝒢|)2n\displaystyle\leq\sqrt{\frac{\ln(\frac{2}{\delta})+t\ln(4m^{2}|\mathcal{G}|)}{2n}}
|PB𝒮~(f~t)PB𝒮(f~t)|\displaystyle\left|PB^{\tilde{\mathcal{S}}}(\tilde{f}^{*}_{t})-PB^{\mathcal{S}}(\tilde{f}^{*}_{t})\right| ln(2δ)+tln(4m2|𝒢|)2n\displaystyle\leq\sqrt{\frac{\ln(\frac{2}{\delta})+t\ln(4m^{2}|\mathcal{G}|)}{2n}}
|PB𝒮~(ft+1)PB𝒮(ft+1)|\displaystyle\left|PB^{\tilde{\mathcal{S}}}(f_{t+1})-PB^{\mathcal{S}}(f_{t+1})\right| ln(2δ)+tln(4m2|𝒢|)2n\displaystyle\leq\sqrt{\frac{\ln(\frac{2}{\delta})+t\ln(4m^{2}|\mathcal{G}|)}{2n}}

as ft,f~t,ft+1𝒞t+1f_{t},\tilde{f}^{*}_{t},f_{t+1}\in\mathcal{C}_{t+1}.

In other words, we have with probability 12δ1-2\delta

PB𝒮~(ft+1)PB𝒮~(ft)\displaystyle PB^{\tilde{\mathcal{S}}}(f_{t+1})-PB^{\tilde{\mathcal{S}}}(f_{t})
(PB𝒮~(ft+1)PB𝒮~(f~t))+ρ8m2α4mρ+4ln(2δ)+tln(4m2|𝒢|)2n.\displaystyle\leq\left(PB^{\tilde{\mathcal{S}}}(f_{t+1})-PB^{\tilde{\mathcal{S}}}(\tilde{f}^{*}_{t})\right)+\frac{\rho}{8m^{2}}-\frac{\alpha}{4m\rho}+4\sqrt{\frac{\ln(\frac{2}{\delta})+t\ln(4m^{2}|\mathcal{G}|)}{2n}}.
(3) f~tft+1\tilde{f}_{t}^{*}\to f_{t+1}:

Because Δt\Delta_{t} is chosen to minimize with respect to the empirical distribution out of grid points [1/m][1/m], we can show that the pinball loss against the empirical distribution 𝒮~\tilde{\mathcal{S}} must be lower for ft+1f_{t+1} than f~t\tilde{f}^{*}_{t}.

We can calculate the derivative of the pinball loss with respect to the patch Δ\Delta as

ddΔPB𝒮~|Bt(ft()+Δ)\displaystyle\frac{d}{d\Delta}PB^{\tilde{\mathcal{S}}|B_{t}}(f_{t}(\cdot)+\Delta)
=d𝔼(x,s)𝒮~|Bt[Lq(ft(x)+Δ,s)]dΔ\displaystyle=\frac{d\mathop{\mathbb{E}}_{(x,s)\sim\tilde{\mathcal{S}}|B_{t}}[L_{q}(f_{t}(x)+\Delta,s)]}{d\Delta}
=1|Bt|ddΔi:xiBtLq(ft(xi)+Δ,si)\displaystyle=\frac{1}{|B_{t}|}\frac{d}{d\Delta}\sum_{i:x_{i}\in B_{t}}L_{q}(f_{t}(x_{i})+\Delta,s_{i})
=1|Bt|(i:xiBt,sift(xi)+ΔddΔLq(ft(xi)+Δ,si)+i:xiBt,si>ft(xi)+ΔddΔLq(ft(xi)+Δ,si))\displaystyle=\frac{1}{|B_{t}|}\left(\sum_{i:x_{i}\in B_{t},s_{i}\leq f_{t}(x_{i})+\Delta}\frac{d}{d\Delta}L_{q}(f_{t}(x_{i})+\Delta,s_{i})+\sum_{i:x_{i}\in B_{t},s_{i}>f_{t}(x_{i})+\Delta}\frac{d}{d\Delta}L_{q}(f_{t}(x_{i})+\Delta,s_{i})\right)
=1|Bt|(i:xiBt,sift(xi)+Δ(1q)i:xiBt,si>ft(xi)+Δq)\displaystyle=\frac{1}{|B_{t}|}\left(\sum_{i:x_{i}\in B_{t},s_{i}\leq f_{t}(x_{i})+\Delta}(1-q)-\sum_{i:x_{i}\in B_{t},s_{i}>f_{t}(x_{i})+\Delta}q\right)
=1|Bt||{i:xiBt,sift(xi)+Δ}|q\displaystyle=\frac{1}{|B_{t}|}\left|\{i:x_{i}\in B_{t},s_{i}\leq f_{t}(x_{i})+\Delta\}\right|-q
=Pr(x,s)𝒮~|Bt[sft(x)+Δ]q.\displaystyle=\Pr_{(x,s)\sim\tilde{\mathcal{S}}|B_{t}}[s\leq f_{t}(x)+\Delta]-q.

Because PB𝒮~|Bt(ft()+Δ)PB^{\tilde{\mathcal{S}}|B_{t}}(f_{t}(\cdot)+\Delta) is convex in Δ\Delta, minimizing this function is equivalent to minimizing the absolute value of its derivative, which is how Δt\Delta_{t} is set. Therefore, PB𝒮~|Bt(ft()+Δ)PB^{\tilde{\mathcal{S}}|B_{t}}(f_{t}(\cdot)+\Delta) is minimized at Δt\Delta_{t}. Hence, we have with probability 12δ1-2\delta (δ\delta to argue that there is significant quantile consistency error on BtB_{t} with respect to 𝒮\mathcal{S} and δ\delta to argue that the empirical pinall loss concentrates around its expectation),

PB𝒮~(ft+1)PB𝒮~(ft)\displaystyle PB^{\tilde{\mathcal{S}}}(f_{t+1})-PB^{\tilde{\mathcal{S}}}(f_{t})
(PB𝒮~(ft+1)PB𝒮~(f~t))+ρ8m2α8mρ+4ln(2δ)+tln(4m2|𝒢|)2n\displaystyle\leq\left(PB^{\tilde{\mathcal{S}}}(f_{t+1})-PB^{\tilde{\mathcal{S}}}(\tilde{f}^{*}_{t})\right)+\frac{\rho}{8m^{2}}-\frac{\alpha}{8m\rho}+4\sqrt{\frac{\ln(\frac{2}{\delta})+t\ln(4m^{2}|\mathcal{G}|)}{2n}}
ρ8m2α8mρ+4ln(2δ)+tln(4m2|𝒢|)2n\displaystyle\leq\frac{\rho}{8m^{2}}-\frac{\alpha}{8m\rho}+4\sqrt{\frac{\ln(\frac{2}{\delta})+t\ln(4m^{2}|\mathcal{G}|)}{2n}}
α24ρ3+4ln(2δ)+tln(4m2|𝒢|)2n\displaystyle\leq\frac{-\alpha^{2}}{4\rho^{3}}+4\sqrt{\frac{\ln(\frac{2}{\delta})+t\ln(4m^{2}|\mathcal{G}|)}{2n}}

as we have chosen m=ρ22αm=\frac{\rho^{2}}{2\alpha}.

If n512ρ6(ln(2δ)+tln(4m2|𝒢|))α4n\geq\frac{512\rho^{6}\left(\ln(\frac{2}{\delta})+t\ln(4m^{2}|\mathcal{G}|)\right)}{\alpha^{4}}, we have

PB𝒮~(ft+1)PB𝒮~(ft)α28ρ3.PB^{\tilde{\mathcal{S}}}(f_{t+1})-PB^{\tilde{\mathcal{S}}}(f_{t})\leq-\frac{\alpha^{2}}{8\rho^{3}}.

Because we decrease the empirical pinball loss by α28ρ3\frac{\alpha^{2}}{8\rho^{3}} in each round with probability 12δ1-2\delta, we have that with probability 12δ8ρ3α21-2\delta\cdot\frac{8\rho^{3}}{\alpha^{2}}, the algorithm halts at round T=8ρ3α2T=\frac{8\rho^{3}}{\alpha^{2}}. If

n92928(ln(8δ)+8ρ3α2ln(4m2|𝒢|))max(m2α2,ρ6α4),\displaystyle n\geq 92928\left(\ln\left(\frac{8}{\delta}\right)+\frac{8\rho^{3}}{\alpha^{2}}\ln(4m^{2}|\mathcal{G}|)\right)\max\left(\frac{m^{2}}{\alpha^{2}},\frac{\rho^{6}}{\alpha^{4}}\right),

we satisfy all the requirements that we stated previous for nn in each round t[T]t\in[T].

If we set δ=16ρ3α2δ\delta^{\prime}=\frac{16\rho^{3}}{\alpha^{2}}\delta, we have with probability 1δ1-\delta^{\prime}, the algorithm halts in T=8ρ3α2T=\frac{8\rho^{3}}{\alpha^{2}} where we require

n\displaystyle n 92928(ln(128ρ3α2δ)+8ρ3α2ln(4m2|𝒢|))max(m2α2,ρ6α4)\displaystyle\geq 92928\left(\ln\left(\frac{128\rho^{3}}{\alpha^{2}\delta^{\prime}}\right)+\frac{8\rho^{3}}{\alpha^{2}}\ln(4m^{2}|\mathcal{G}|)\right)\max\left(\frac{m^{2}}{\alpha^{2}},\frac{\rho^{6}}{\alpha^{4}}\right)
=92928(ln(128ρ3α2δ)+8ρ3α2ln(ρ4|𝒢|α2))max(ρ44α4,ρ6α4)\displaystyle=92928\left(\ln\left(\frac{128\rho^{3}}{\alpha^{2}\delta^{\prime}}\right)+\frac{8\rho^{3}}{\alpha^{2}}\ln\left(\frac{\rho^{4}|\mathcal{G}|}{\alpha^{2}}\right)\right)\max\left(\frac{\rho^{4}}{4\alpha^{4}},\frac{\rho^{6}}{\alpha^{4}}\right)

Appendix C Additional Experiments and Discussion

C.1 A Direct Test of Group Conditional Quantile Consistency and Quantile Multicalibration

In this section we abstract away the non-conformity score, and directly perform a direct evaluation of the ability of BatchGCP and BatchMVP to offer group conditional and multivalid quantile consistency guarantees. We produce a synthetic regression dataset defined to have a set of intersecting groups that are all relevant to label uncertainty. Specifically, the data {(xi,yi)}i=110000(+×)10000\{(x_{i},y_{i})\}_{i=1}^{10000}\in(\mathbb{Z}_{+}\times\mathbb{R})^{10000} is generated as follows. First, we define our group collection as 𝒢={g1,,g15}\mathcal{G}=\{g_{1},\ldots,g_{15}\}, where for each j=1,,15j=1,\ldots,15, the group gj={j,2j,3j,}+g_{j}=\{j,2j,3j,\ldots\}\subset\mathbb{Z}_{+} contains all multiples of jj. Note that group g1g_{1} encompasses the entire covariate space, ensuring that our methods will explicitly enforce marginal coverage in addition to group-wise coverage. Each xix_{i} is a random integer sampled uniformly from the range [1,5000)[1,5000). We let the corresponding label yi=|yi||yi|+1[0,1]y_{i}=\frac{|y_{i}^{\prime}|}{|y_{i}^{\prime}|+1}\in[0,1], where yiy_{i}^{\prime} is distributed as the sum of n(xi)n(x_{i}) i.i.d. 𝒩(0,1)\mathcal{N}(0,1) random variables, where n(xi)n(x_{i}) is the number of groups that xix_{i} belongs to. After generating our data, we split it into 80%80\% training data 𝒟train\mathcal{D}_{train} and 20%20\% test data 𝒟test\mathcal{D}_{test}. We then run both methods on the group collection 𝒢\mathcal{G}, for target coverage q=0.9q=0.9, with m=100m=100 buckets.

The group coverage obtained by both methods on a sample run can be seen in Figure 6. Generally, both methods achieve target group-wise coverage level on all groups, with BatchMVP exhibiting less variance in the attained coverage levels across different runs. Meanwhile, the group-wise quantile calibration errors of both models are presented in Figure 7. Both BatchMVP and BatchGCP are very well-calibrated on all groups, with calibration errors on the order of 10310^{-3} — even though unlike BatchMVP, the definition of BatchGCP does not enforce this constraint explicitly.

Refer to caption
Refer to caption
Figure 6: Per-group coverage of BatchMVP (left) and BatchGCP (right) on a representative run.
Refer to caption
Figure 7: Per-group calibration error of BatchMVP and BatchGCP on a representative run.

C.2 Further Comparisons On State Census Data

In Section 5.2 we performed an income-prediction task using census data (Ding et al. [2021]) from the state of California to compare the performance of BatchGCP and BatchMVP against each other, as well as against other regularly used conformal prediction methods. Here, we present results of the same experiment using data from other US states. We selected the ten largest states (by population data) to work with, of which California is one. Results for every state are averaged over 50 runs of the experiment, taking different random splits over the data (to form training, calibration, and test sets) each time. The plots in Figure 8, Figure  9, Figure 10 and Figure 11 compare the performance of all four methods with metrics such as group-wise coverage, prediction-set size, and group-wise quantile calibration error.

Refer to caption
(a) Texas
Refer to caption
(b) New York
Refer to caption
(c) Florida
Refer to caption
(d) Pennsylvania
Refer to caption
(e) Illinois
Refer to caption
(f) Ohio
Refer to caption
(g) North Carolina
Refer to caption
(h) Georgia
Refer to caption
(i) Michigan
Figure 8: Results for group-wise coverage for nine different states (averaged over 50 runs) using different methods of conformal prediction. Error bars show standard deviation.
Refer to caption
(a) Texas
Refer to caption
(b) New York
Refer to caption
(c) Florida
Refer to caption
(d) Pennsylvania
Refer to caption
(e) Illinois
Refer to caption
(f) Ohio
Refer to caption
(g) North Carolina
Refer to caption
(h) Georgia
Refer to caption
(i) Michigan
Figure 9: Results for group-wise average prediction set size for nine different states (averaged over 50 runs) using different methods of conformal prediction. Error bars show standard deviation.
Refer to caption
(a) Texas
Refer to caption
(b) New York
Refer to caption
(c) Florida
Refer to caption
(d) Pennsylvania
Refer to caption
(e) Illinois
Refer to caption
(f) Ohio
Refer to caption
(g) North Carolina
Refer to caption
(h) Georgia
Refer to caption
(i) Michigan
Figure 10: Results for group-wise calibration error (averaged over 50 runs). Error bars show standard deviation.
Refer to caption
(a) Texas
Refer to caption
(b) New York
Refer to caption
(c) Florida
Refer to caption
(d) Pennsylvania
Refer to caption
(e) Illinois
Refer to caption
(f) Ohio
Refer to caption
(g) North Carolina
Refer to caption
(h) Georgia
Refer to caption
(i) Michigan
Figure 11: Scatterplots of the number of points associated with each threshold-group pair (g,τi)(g,\tau_{i}) against the average coverage conditional on that pair for all g𝒢g\in\mathcal{G} and all τi\tau_{i} in a grid, over all tested conformal prediction methods (consolidating results over all 50 runs), for all nine states. Target coverage is q=0.9q=0.9.

C.2.1 Halting Time

Recall that our generalization theorem for BatchMVP is in terms of the number of iterations TT it runs for before halting. We prove a worst-case upper bound on TT, but note that empirically BatchMVP halts much sooner (and so enjoys improved theoretical generalization properties). Here we report the average number of iterations TT that BatchMVP runs for before halting on each state, averaged over the 50 runs, together with the empirical standard deviation.

Texas: 10.84±1.205910.84\pm 1.2059.
New York: 11.26±1.213411.26\pm 1.2134
Florida: 12.06±1.475212.06\pm 1.4752
Pennsylvania: 11.68±2.318911.68\pm 2.3189
Illinois: 9.74±1.49419.74\pm 1.4941
Ohio: 10.94±1.748210.94\pm 1.7482
North Carolina: 13.22±1.513813.22\pm 1.5138
Georgia: 12.18±1.645412.18\pm 1.6454
Michigan: 11.24±2.035311.24\pm 2.0353