This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Online conformal prediction with decaying step sizes

Anastasios N. Angelopoulos1, Rina Foygel Barber2, and Stephen Bates3
(1University of California, Berkeley  2University of Chicago
3Massachusetts Institute of Technology
[email protected], [email protected], [email protected]

)
Abstract

We introduce a method for online conformal prediction with decaying step sizes. Like previous methods, ours possesses a retrospective guarantee of coverage for arbitrary sequences. However, unlike previous methods, we can simultaneously estimate a population quantile when it exists. Our theory and experiments indicate substantially improved practical properties: in particular, when the distribution is stable, the coverage is close to the desired level for every time point, not just on average over the observed sequence.

1 Introduction

We study the problem of online uncertainty quantification, such as that encountered in time-series forecasting. Our goal is to produce a prediction set at each time, based on all previous information, that contain the true label with a specified coverage probability. Such prediction sets are useful to the point of being requirements in many sequential problems, including medicine (Robinson, 1978), robotics (Lindemann et al., 2023), finance (Mykland, 2003), and epidemiology (Cramer et al., 2022). Given this broad utility, it comes as no surprise that prediction sets have been studied for approximately one hundred years (and possibly more; see Section 1.1 of Tian et al. (2022)).

Formally, consider a sequence of data points (Xt,Yt)𝒳×𝒴(X_{t},Y_{t})\in\mathcal{X}\times\mathcal{Y}, for t=1,2,t=1,2,\dots. At each time tt, we observe XtX_{t} and seek to cover YtY_{t} with a set 𝒞t(Xt)\mathcal{C}_{t}(X_{t}), which depends on a base model trained on all past data (as well as the current feature XtX_{t}). After predicting, we observe YtY_{t}, and the next time-step ensues. Note that we have not made any assumptions yet about the data points and their dependencies.

This paper introduces a method for constructing the prediction sets 𝒞t\mathcal{C}_{t} that has simultaneous best-case and worst-case guarantees—that is, a “best of both worlds” property. We will describe the method shortly in Section 1.1. Broadly speaking, the method can gracefully handle both arbitrary adversarial sequences data points and also independent and identically distributed (I.I.D.) sequences. In the former case, our method will remain robust, ensuring that the historical fraction of miscovered labels converges to the desired error rate, α(0,1)\alpha\in(0,1). In the latter case, our method will converge, eventually producing the optimal prediction sets. We summarize our results below:

  1. 1.

    Worst-case guarantee (Theorem 1): When the data points are arbitrary, our algorithm achieves

    1Tt=1T𝟙Yt𝒞t(Xt)(1α±CT1/2ϵ),\frac{1}{T}\sum\limits_{t=1}^{T}{\mathbbm{1}}_{{Y_{t}\in\mathcal{C}_{t}(X_{t})}}\in\left(1-\alpha\pm\frac{C}{T^{1/2-\epsilon}}\right), (1)

    for a constant CC and any fixed ϵ>0\epsilon>0. We call this a long-run coverage guarantee.

  2. 2.

    Best-case guarantee (Theorem 3): When the data points are I.I.D., our algorithm achieves

    limT(YT𝒞T(XT))1α.\lim_{T\to\infty}\mathbb{P}\left({Y_{T}\in\mathcal{C}_{T}(X_{T})}\right)\to 1-\alpha. (2)

    We call this a convergence guarantee.

Our algorithm is the first to satisfy both guarantees simultaneously. Moreover, the decaying step size yields more stable behavior than prior methods, as we will see in experiments. See Section 1.2 for a discussion of the relationship with other methods, such as those of Gibbs and Candes (2021)Angelopoulos et al. (2023), and Xu and Xie (2021).

1.1 Method and Setup

We now describe our prediction set construction. Borrowing from conformal prediction, consider a bounded conformal score function st:𝒳×𝒴[0,B]s_{t}:\mathcal{X}\times\mathcal{Y}\to[0,B], at each time tt. This score st=st(Xt,Yt)s_{t}=s_{t}(X_{t},Y_{t}) is large when the predictions of the base model disagree greatly with the observed label; an example would be the residual score, st(x,y)=|yf^t(x)|s_{t}(x,y)=|y-\hat{f}_{t}(x)|, for a model f^t:𝒳\hat{f}_{t}:\mathcal{X}\to\mathbb{R} trained online. This concept is standard in conformal prediction (Vovk et al., 2005), and we refer the reader to Angelopoulos and Bates (2023) for a recent overview. Given this score function, define

𝒞t(x)={y𝒴:st(x,y)qt},\mathcal{C}_{t}(x)=\left\{y\in\mathcal{Y}:s_{t}(x,y)\leq q_{t}\right\}, (3)

where the threshold qtq_{t} is updated with the rule

qt+1=qt+ηt(𝟙Yt𝒞t(Xt)α).q_{t+1}=q_{t}+\eta_{t}({\mathbbm{1}}_{{Y_{t}\notin\mathcal{C}_{t}(X_{t})}}-\alpha). (4)

In particular, if we fail to cover YtY_{t} at time tt, then the threshold increases to make the procedure slightly more conservative at the next time step (and vice versa).

Familiar readers will notice the similarity of the update step (4) to that of Gibbs and Candes (2021); Bhatnagar et al. (2023); Feldman et al. (2023); Angelopoulos et al. (2023), the main difference being that here, ηt\eta_{t} can change over time—later on we will see that ηtt1/2ϵ\eta_{t}\propto t^{-1/2-\epsilon}, for some small ϵ(0,1/2)\epsilon\in(0,1/2), leads to guarantees (1) and (2) as described above. We remark also that the update step for qtq_{t} can be interpreted as an online (sub)gradient descent algorithm on the quantile loss ρ1α(t)=(1α)max{t,0}+αmax{t,0}\rho_{1-\alpha}(t)=(1-\alpha)\max\{t,0\}+\alpha\max\{-t,0\}  (Koenker and Bassett Jr, 1978), i.e., we can equivalently write the update step (4) as

qt+1=qtηtρ1α(stqt).q_{t+1}=q_{t}-\eta_{t}\nabla\rho_{1-\alpha}(s_{t}-q_{t}).

In this work, we will consider two different settings:

Setting 1 (Adversarial setting).

We say that we are in the adversarial setting if we allow (X1,Y1),(X2,Y2),(X_{1},Y_{1}),(X_{2},Y_{2}),\dots to be an arbitrary sequence of elements in 𝒳×𝒴\mathcal{X}\times\mathcal{Y}, and s1,s2,s_{1},s_{2},\dots to be an arbitrary sequence of functions from 𝒳×𝒴\mathcal{X}\times\mathcal{Y} to [0,B][0,B].

Setting 2 (I.I.D. setting).

We say that we are in the I.I.D. setting if we require that (Xt,Yt)iidP(X_{t},Y_{t})\stackrel{{\scriptstyle\textnormal{iid}}}{{\sim}}P for some distribution PP, and require that the choice of the function st:𝒳×𝒴[0,B]s_{t}:\mathcal{X}\times\mathcal{Y}\rightarrow[0,B] depends only on {(Xr,Yr)}r<t\{(X_{r},Y_{r})\}_{r<t}, for each tt (i.e., the model is trained online).

Of course, any result proved for Setting 1 will hold for Setting 2 as well. We remark that Setting 2 can be relaxed to allow for randomness in the choice of the score functions sts_{t}—our results for the I.I.D. setting will hold as long as the function sts_{t} is chosen independently of {(Xr,Yr)}rt\{(X_{r},Y_{r})\}_{r\geq t}.

Our method, like all conformal methods, has coverage guarantees that hold for any underlying model and data stream. Still, the quality of the output (e.g., the size of the prediction sets) does critically depend on the quality of the underlying model. This general interplay between conformal methods and models is discussed throughout the conformal literature (e.g., Vovk et al., 2005; Angelopoulos and Bates, 2023).

1.2 Related work

We begin by reviewing the most closely related literature. Set constructions of the form in (3), which “invert” the score function, are commonplace in conformal prediction (Vovk et al., 2005), with qtq_{t} chosen as a sample quantile of the previous conformal scores. However, the exchangeability-based arguments of the standard conformal framework cannot give any guarantees in Setting 1. The idea to set qtq_{t} via online gradient descent with a fixed step size appears first in Gibbs and Candes (2021), which introduced online conformal prediction in the adversarial setting. The version we present here builds also on the work of Bhatnagar et al. (2023)Feldman et al. (2023), and  Angelopoulos et al. (2023); in particular, Angelopoulos et al. (2023) call the update in (4) the “quantile tracker”. These papers all have long-run coverage guarantees in Setting 1, but do not have convergence guarantees in Setting 2.

Subsequent work to these has explored time-varying step sizes that respond to distribution shifts, primarily for the purpose of giving other notions of validity, such as regret analyses (Gibbs and Candès, 2022; Zaffran et al., 2022; Bastani et al., 2022; Noarov et al., 2023; Bhatnagar et al., 2023). From an algorithmic perspective, these methods depart significantly from the update in (4), generally by incorporating techniques from online learning—such as strongly adaptive online learning (Daniely et al., 2015), adaptive regret (Gradu et al., 2023), and adaptive aggregation of experts (Cesa-Bianchi and Lugosi, 2006). To summarize, the long-run coverage and regret bounds in these papers apply to substantially different, usually more complicated algorithms than the simple expression we have in (4). We remark that “best of both worlds” guarantees appear in the online learning literature (e.g., Bubeck and Slivkins, 2012; Koolen et al., 2016; Zimmert and Seldin, 2021; Jin et al., 2021; Chen et al., 2023; Dann et al., 2023), where the aim is to find a single algorithm whose regret is optimal both in a stochastic setting (i.e., data sampled from a distribution) and in an adversarial setting. A crucial difference, however, is that our paper’s guarantees are concerned with inference and predictive coverage, rather than with estimation or regret.

Farther afield from our work, there have been several other explorations of conformal prediction in time-series, but these are quite different. For example, the works of Barber et al. (2022) and Chernozhukov et al. (2018) provide conformal-type procedures with coverage guarantees under certain relaxations of exchangeability; both can provide marginal coverage in Setting 2, but cannot give any guarantees in Setting 1. Xu and Xie (2021, 2023) study the behavior of conformal methods under classical nonparametric assumptions such as model consistency and distributional smoothness for its validity, and thus cannot give distribution-free guarantees in Settings 1 or 2. Lin et al. (2022) studies the problem of cross-sectional coverage for multiple exchangeable time-series. The online conformal prediction setup was also considered early on by Vovk (2002) for exchangeable sequences. These works are not directly comparable to ours, the primary point of difference being the adversarial guarantee we can provide in Setting 1.

Finally, traditional solutions to the prediction set problem have historically relied on Bayesian modeling (e.g., Foreman-Mackey et al., 2017) or distributional assumptions such as autoregression, smoothness, or ergodicity (e.g., Biau and Patra, 2011). A parallel literature on calibration exists in the adversarial sequence model (e.g., Foster and Vohra, 1998). Our work, like that of Gibbs and Candes (2021), is clearly related to the literatures on both calibration and online convex optimization (Zinkevich, 2003), and we hope these connections will continue to reveal themselves; our work takes online conformal prediction one step closer to online learning by allowing the use of decaying step sizes, which is typical for online gradient descent.

1.3 Our contribution

We provide the first analysis of the online conformal prediction update in (4) with an arbitrary step size. Our analysis gives strong long-run coverage bounds for appropriately decaying step sizes, even in the adversarial setting (Setting 1). We also give a simultaneous convergence guarantee in the I.I.D. setting (Setting 2), showing that the parameter qtq_{t} converges to the optimal value qq^{*}. Importantly, this type of convergence does not hold with a fixed step size (the case previously analyzed in the online conformal prediction literature). In fact, we show that with a fixed step size, online conformal prediction returns meaningless prediction sets (i.e., either \emptyset or 𝒴\mathcal{Y}) infinitely often. From the theoretical point of view, therefore, our method is the first to provide this type of “best-of-both-worlds” guarantee.

While these theoretical results show an improvement (relative to the fixed-step-size method) in an I.I.D. setting, from the practical perspective we will see that a decaying step size also enables substantially better results and more stable behavior on real time series data, which lies somewhere between the I.I.D. and the adversarial regime.

2 Main results in the adversarial setting

We now present our main results for the adversarial setting, Setting 1, which establish long-run coverage guarantees with no assumptions on the data or the score functions.

2.1 Decreasing step sizes

Our first main result shows that, for a nonincreasing step size sequence, the long-run coverage rate

1Tt=1T𝟙Yt𝒞t(Xt)\frac{1}{T}\sum_{t=1}^{T}{\mathbbm{1}}_{{Y_{t}\in\mathcal{C}_{t}(X_{t})}} (5)

will converge to the nominal level 1α1-\alpha.

Theorem 1.

Let (X1,Y1),(X2,Y2),(X_{1},Y_{1}),(X_{2},Y_{2}),\dots be an arbitrary sequence of data points, and let st:𝒳×𝒴[0,B]s_{t}:\mathcal{X}\times\mathcal{Y}\rightarrow[0,B] be arbitrary functions. Let ηt\eta_{t} be a positive and nonincreasing sequence of step sizes, and fix an initial threshold q1[0,B]q_{1}\in[0,B].

Then online conformal prediction satisfies

|1Tt=1T𝟙Yt𝒞t(Xt)(1α)|B+η1ηTT\left|\frac{1}{T}\sum_{t=1}^{T}{\mathbbm{1}}_{{Y_{t}\in\mathcal{C}_{t}(X_{t})}}-(1-\alpha)\right|\leq\frac{B+\eta_{1}}{\eta_{T}T}

for all T1T\geq 1.

As a special case, if we choose a constant step size ηtη\eta_{t}\equiv\eta then this result is analogous to Gibbs and Candes (2021, Proposition 4.1). On the other hand, if we choose ηtta\eta_{t}\propto t^{-a} for some a(0,1)a\in(0,1), then the long-run coverage at time TT has error bounded as 𝒪(1T1a)\mathcal{O}(\frac{1}{T^{1-a}}).

2.2 Arbitrary step sizes

As discussed above, if the data appears to be coming from the same distribution then a decaying step size can be advantageous, to stabilize the behavior of the prediction sets over time. However, if we detect a sudden distribution shift and start to lose coverage, we might want to increase the step size ηt\eta_{t} to recover coverage more quickly. To accommodate this, the above theorem can be generalized to an arbitrary step size sequence, as follows.

Theorem 2.

Let (X1,Y1),(X2,Y2),(X_{1},Y_{1}),(X_{2},Y_{2}),\dots be an arbitrary sequence of data points, and let st:𝒳×𝒴[0,B]s_{t}:\mathcal{X}\times\mathcal{Y}\rightarrow[0,B] be arbitrary functions. Let ηt\eta_{t} be an arbitrary positive sequence, and fix an initial threshold q1[0,B]q_{1}\in[0,B].

Then online conformal prediction satisfies

|1Tt=1T𝟙Yt𝒞t(Xt)(1α)|B+max1tTηtTΔ1:T1\left|\frac{1}{T}\sum_{t=1}^{T}{\mathbbm{1}}_{{Y_{t}\in\mathcal{C}_{t}(X_{t})}}-(1-\alpha)\right|\leq\frac{B+\max_{1\leq t\leq T}\eta_{t}}{T}\cdot\|\Delta_{1:T}\|_{1}

for all T1T\geq 1, where the sequence Δ\Delta is defined with values

Δ1=η11, and Δt=ηt1ηt11 for all t2.\Delta_{1}=\eta_{1}^{-1},\textnormal{\ and \ }\Delta_{t}=\eta_{t}^{-1}-\eta_{t-1}^{-1}\textnormal{ for all $t\geq 2$}.

We can see that Theorem 1 is indeed a special case of this more general result, because in the case of a nonincreasing sequence ηt\eta_{t}, we have max1tTηt=η1\max_{1\leq t\leq T}\eta_{t}=\eta_{1}, and

Δ1:T1=|η11|+t=2T|ηt1ηt11|=η11+t=2T(ηt1ηt11)=ηT1.\|\Delta_{1:T}\|_{1}=|\eta_{1}^{-1}|+\sum_{t=2}^{T}|\eta_{t}^{-1}-\eta_{t-1}^{-1}|\\ =\eta_{1}^{-1}+\sum_{t=2}^{T}(\eta_{t}^{-1}-\eta_{t-1}^{-1})=\eta_{T}^{-1}.

But Theorem 2 can be applied much more broadly. For example, we might allow the step size to decay during long stretches of time when the distribution seems stationary, but then reset to a larger step size whenever we believe the distribution may have shifted. In this case, we can obtain an interpretable error bound from the result of Theorem 2 by observing that Δ1:T12NTmin1tTηt\|\Delta_{1:T}\|_{1}\leq\frac{2N_{T}}{\min_{1\leq t\leq T}\eta_{t}}, where NT=t=2T𝟙ηt>ηt1N_{T}=\sum_{t=2}^{T}{\mathbbm{1}}_{{\eta_{t}>\eta_{t-1}}} is the number of times we increase the step size. Thus, as long as the step size does not decay too quickly, and the number of “resets” NTN_{T} is o(T)\mathrm{o}(T), the upper bound of Theorem 2 will still be vanishing.

3 Results for I.I.D. data

We now turn to studying the setting of I.I.D. data, Setting 2, where (X1,Y1),(X2,Y2),(X_{1},Y_{1}),(X_{2},Y_{2}),\dots are sampled I.I.D. from some distribution PP on 𝒳×𝒴\mathcal{X}\times\mathcal{Y}. While Theorems 1 and 2 show that the coverage of the procedure converges in a weak sense, as in (5), for any realization of the data (or even with a nonrandom sequence of data points), we would also like to understand whether the procedure might satisfy stronger notions of convergence with “nice” data. Will the sequence of prediction intervals converge in a suitable sense? We will see that decaying step size does indeed lead to convergence, whereas a constant step size leads to oscillating behavior.

In order to make our questions precise, we need to introduce one more piece of notation to capture the notion of coverage at a particular time tt—the “instantaneous” coverage. Let

𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(q)=P(st(X,Y)q|st),\mathsf{Coverage}_{t}(q)=\mathbb{P}_{{P}}\left({s_{t}(X,Y)\leq q}\ \middle|\ {s_{t}}\right),

where the probability is calculated with respect to a data point (X,Y)P(X,Y)\sim P drawn independently of sts_{t}. Then, at time tt, the prediction set 𝒞t(Xt)\mathcal{C}_{t}(X_{t}) has coverage level 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)\mathsf{Coverage}_{t}(q_{t}), by construction. We will see in our results below that for an appropriately chosen decaying step size, 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)\mathsf{Coverage}_{t}(q_{t}) will concentrate around 1α1-\alpha over time, while if we choose a constant step size, then 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)\mathsf{Coverage}_{t}(q_{t}) will be highly variable.

3.1 Results with a pretrained score function

To begin, we assume that the score function is pretrained, i.e., that s1=s2=s_{1}=s_{2}=\dots are all equal to some fixed function s:𝒳×𝒴[0,B]s:\mathcal{X}\times\mathcal{Y}\rightarrow[0,B]. The reader should interpret this as the case where the underlying model is not updated online (e.g., s(x,y)=|yf^(x)|s(x,y)=|y-\hat{f}(x)| for a pretrained model f^\hat{f} that is no longer being updated). This simple case is intended only as an illustration of the trends we might see more generally; in Section 3.2 below we will study a more realistic setting, where model training is carried out online as the data is collected.

In this setting, since the score function does not vary with tt, we have 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t()𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾()\mathsf{Coverage}_{t}(\cdot)\equiv\mathsf{Coverage}(\cdot) where

𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(q)=P(s(X,Y)q),\mathsf{Coverage}(q)=\mathbb{P}_{{P}}\left({s(X,Y)\leq q}\right),

i.e., instantaneous coverage at time tt is 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)\mathsf{Coverage}(q_{t}).

First, we will see that choosing a constant step size leads to undesirable behavior: while coverage will hold on average over time (as recorded in Theorem 1 and in the earlier work of Gibbs and Candes (2021)), there will be high variability in 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)\mathsf{Coverage}(q_{t}) over time—for instance, we may have 𝒞t(Xt)=\mathcal{C}_{t}(X_{t})=\emptyset infinitely often.

Proposition 1.

Let (Xt,Yt)iidP(X_{t},Y_{t})\stackrel{{\scriptstyle\textnormal{iid}}}{{\sim}}P for some distribution PP. Suppose also that stss_{t}\equiv s for some fixed function s:𝒳×𝒴[0,B]s:\mathcal{X}\times\mathcal{Y}\to[0,B], and that ηtη\eta_{t}\equiv\eta for a positive constant step size η>0\eta>0. Assume also that α\alpha is a rational number.

Then online conformal prediction satisfies

𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)=0\mathsf{Coverage}(q_{t})=0 for infinitely many tt, and 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)=1\mathsf{Coverage}(q_{t})=1 for infinitely many tt,

almost surely.

In other words, even in the simplest possible setting of I.I.D. data and a fixed model, we cannot expect convergence of the method if we use a constant step size.

On the other hand, if we choose a sequence of step sizes ηt\eta_{t} that decays at an appropriate rate (such as ηtt1/2ϵ\eta_{t}\propto t^{-1/2-\epsilon}, for some ϵ(0,1/2)\epsilon\in(0,1/2), as mentioned earlier) then over time, this highly variable behavior can be avoided. Instead, we will typically see coverage converging to 1α1-\alpha for each constructed prediction set 𝒞t(Xt)\mathcal{C}_{t}(X_{t}), i.e., 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)1α\mathsf{Coverage}(q_{t})\rightarrow 1-\alpha. We will need one more assumption: defining qq^{*} as the (1α)(1-\alpha)-quantile of s(X,Y)s(X,Y), we assume that qq^{*} is unique:

𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(q)<1α for all q<q,𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(q)>1α for all q>q.\begin{array}[]{l}\mathsf{Coverage}(q)<1-\alpha\textnormal{ for all $q<q^{*}$},\\ \mathsf{Coverage}(q)>1-\alpha\textnormal{ for all $q>q^{*}$}.\end{array} (6)
Theorem 3.

Let (Xt,Yt)iidP(X_{t},Y_{t})\stackrel{{\scriptstyle\textnormal{iid}}}{{\sim}}P for some distribution PP. Suppose also that stss_{t}\equiv s for some fixed function s:𝒳×𝒴[0,B]s:\mathcal{X}\times\mathcal{Y}\to[0,B]. Assume that ηt\eta_{t} is a fixed nonnegative step size sequence satisfying

t=1ηt=,t=1ηt2<.\sum_{t=1}^{\infty}\eta_{t}=\infty,\quad\sum_{t=1}^{\infty}\eta_{t}^{2}<\infty. (7)

Assume also that qq^{*} is unique as in (6).

Then online conformal prediction satisfies

qtq almost surely.q_{t}\rightarrow q^{*}\textnormal{ almost surely}.

With an additional assumption, this immediately implies convergence of the coverage, 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)\mathsf{Coverage}(q_{t}):

Corollary 1.

Under the setting and assumptions of Theorem 3, assume also that s(X,Y)s(X,Y) has a continuous distribution (under (X,Y)P(X,Y)\sim P). Then

𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)1α almost surely.\mathsf{Coverage}(q_{t})\rightarrow 1-\alpha\textnormal{ almost surely}.

That is, instead of the high variance in coverage incurred by a constant step size (as in Proposition 1), here the coverage converges to the nominal level 1α1-\alpha. Finally, with additional assumptions, we can also characterize the rate at which the threshold qtq_{t} converges to qq^{*}:

Proposition 2.

Under the setting and assumptions of Corollary 1, assume also that the distribution of s(X,Y)s(X,Y) (under (X,Y)P(X,Y)\sim P) has density lower-bounded by γ\gamma in the range [qδ,q+δ][q^{*}-\delta,q^{*}+\delta], for some γ,δ>0\gamma,\delta>0. Take the step size sequence ηt=ct1/2ϵ\eta_{t}=ct^{-1/2-\epsilon}, for some c>0c>0 and ϵ(0,1/2)\epsilon\in(0,1/2). Then it holds for all t1t\geq 1 that

𝔼[(qtq)2]bt1/2ϵ,\mathbb{E}\left[{(q_{t}-q^{*})^{2}}\right]\leq bt^{-1/2-\epsilon},

where bb is a constant that depends only on B,γ,δ,c,ϵB,\gamma,\delta,c,\epsilon.

3.2 Results with online training of the score function

The result of Theorem 3 above is restricted to a very simple setting, where the score functions are given by stss_{t}\equiv s for some fixed ss, i.e., we are using a pretrained model. We now consider the more interesting setting where the model is trained online. Formally, we consider Setting 2 where we allow the score function sts_{t} to depend arbitrarily on the data observed before time tt, i.e., on {(Xr,Yr)}r<t\{(X_{r},Y_{r})\}_{r<t}.

First, we will consider a constant step size ηtη\eta_{t}\equiv\eta.

Proposition 3.

Let (Xt,Yt)iidP(X_{t},Y_{t})\stackrel{{\scriptstyle\textnormal{iid}}}{{\sim}}P for some distribution PP, and assume the score functions st:𝒳×𝒴[0,B]s_{t}:\mathcal{X}\times\mathcal{Y}\rightarrow[0,B] are trained online. Let ηtη\eta_{t}\equiv\eta for a positive constant step size η>0\eta>0.

Then online conformal prediction satisfies

liminft𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)=0,limsupt𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)=1\lim\inf_{t\rightarrow\infty}\mathsf{Coverage}_{t}(q_{t})=0,\quad\lim\sup_{t\rightarrow\infty}\mathsf{Coverage}_{t}(q_{t})=1

almost surely.

This result is analogous to Proposition 1 for the case of a pretrained score function (but with a slightly weaker conclusion due to the more general setting). As before, the conclusion we draw is that a constant step size inevitably leads to high variability in 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)\mathsf{Coverage}_{t}(q_{t}).

On the other hand, if we take a decaying step size, Theorem 3 established a convergence result given a pretrained score function. We will now see that similar results hold in for the online setting as long as the model converges in some sense. In many settings, we might expect sts_{t} to converge to some score function ss—for example, if our fitted regression functions, f^t\hat{f}_{t}, converge to some “true” model ff^{*}, then st(x,y)=|yf^t(x)|s_{t}(x,y)=|y-\hat{f}_{t}(x)| converges to s(x,y)=|yf(x)|s(x,y)=|y-f^{*}(x)|. As before, we let 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(q)=P(s(X,Y)q)\mathsf{Coverage}(q)=\mathbb{P}_{{P}}\left({s(X,Y)\leq q}\right), and write qq^{*} to denote the (1α)(1-\alpha)-quantile of this distribution. We now extend the convergence results of Theorem 3 to this setting.

Theorem 4.

Let (Xt,Yt)iidP(X_{t},Y_{t})\stackrel{{\scriptstyle\textnormal{iid}}}{{\sim}}P for some distribution PP, and assume the score functions sts_{t} are trained online. Assume that ηt\eta_{t} is a fixed nonnegative step size sequence satisfying (7). Let s:𝒳×𝒴[0,B]s:\mathcal{X}\times\mathcal{Y}\rightarrow[0,B] be a fixed score function, and assume that qq^{*} is unique as in (6).

Then online conformal prediction satisfies the following statement almost surely:111We use stdss_{t}\stackrel{{\scriptstyle\textnormal{d}}}{{\rightarrow}}s in the sense of convergence in distribution under (X,Y)P(X,Y)\sim P, while treating the sts_{t}’s as fixed. Specifically, we are assuming 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(q)𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(q)\mathsf{Coverage}_{t}(q)\rightarrow\mathsf{Coverage}(q), for all qq\in\mathbb{R} at which 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(q)\mathsf{Coverage}(q) is continuous.

If stdss_{t}\stackrel{{\scriptstyle\textnormal{d}}}{{\rightarrow}}s, then qtqq_{t}\rightarrow q^{*}.

As in the previous section, an additional assumption implies convergence of the coverage, 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)\mathsf{Coverage}_{t}(q_{t}):

Corollary 2.

Under the setting and assumptions of Theorem 4, assume also that s(X,Y)s(X,Y) has a continuous distribution (under (X,Y)P(X,Y)\sim P). Then online conformal prediction satisfies the following statement almost surely:

If stdss_{t}\stackrel{{\scriptstyle\textnormal{d}}}{{\rightarrow}}s, then 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)1α\mathsf{Coverage}_{t}(q_{t})\rightarrow 1-\alpha.

To summarize, the results of this section show that the coverage of each prediction set 𝒞t(Xt)\mathcal{C}_{t}(X_{t}), given by 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)\mathsf{Coverage}_{t}(q_{t}), will converge even in a setting where the model is being updated in a streaming fashion, as long as the fitted model itself converges over time.

In particular, if we choose ηtt1/2ϵ\eta_{t}\propto t^{-1/2-\epsilon} for some ϵ(0,1/2)\epsilon\in(0,1/2), then in the adversarial setting the long-run coverage error is bounded as 𝒪(1T1/2ϵ)\mathcal{O}(\frac{1}{T^{1/2-\epsilon}}) by Theorem 1, while in the I.I.D. setting, Theorem 4 guarantees convergence. In other words, this choice of ηt\eta_{t} simultaneusly achieves both types of guarantees.

While the results of this section have assumed I.I.D. data, the proof techniques used here can be extended to handle broader settings—for example, a stationary time series, where despite dependence we may still expect to see convergence over time. We leave these extensions to future work.

4 Experiments

We include two experiments: an experiment on the Elec2 dataset (Harries et al., 1999) where the data shows significant distribution shift over time, and an experiment on Imagenet (Deng et al., 2009) where the data points are exchangeable.222Code to reproduce these experiments is available at https://github.com/aangelopoulos/online-conformal-decaying.

The experiments are run with two different choices of step size for online conformal: first, a fixed step size (ηt0.05\eta_{t}\equiv 0.05); and second, a decaying step size (ηt=t1/2ϵ\eta_{t}=t^{-1/2-\epsilon} with ϵ=0.1\epsilon=0.1). We also compare to an oracle method, where online conformal is run with qq^{*} in place of qtq_{t} at each time tt, and qq^{*} is chosen to be the value that gives 1α1-\alpha average coverage over the entire sequence t=1,,Tt=1,\dots,T. All methods are run with α=0.1\alpha=0.1.

4.1 Results

Figures 2 and 2 display the results of the experiment for the Elec2 data and the Imagenet data, respectively. We now discuss our findings.

The thresholds qtq_{t}.

The first panel of each figure plots the value of the threshold qtq_{t} over time tt. We can see that the procedure with a fixed step size has significantly larger fluctuations in the quantile value as compared to the decaying step size procedure.

The instantaneous coverage 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)\mathsf{Coverage}_{t}(q_{t}).

The second panel of each figure plots the value of the instantaneous coverage 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)\mathsf{Coverage}_{t}(q_{t}) over time tt. For each dataset, since the true data distribution is unknown, we estimate 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)\mathsf{Coverage}_{t}(q_{t}) using a holdout set. We observe that 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)\mathsf{Coverage}_{t}(q_{t}) is substantially more stable for the decaying step size as compared to fixed step size in both experiments. While 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)\mathsf{Coverage}_{t}(q_{t}) concentrates closely around the nominal level 1α1-\alpha for decaying ηt\eta_{t}, for fixed ηt\eta_{t} the coverage level oscillates and does not converge.

Long-run coverage and rolling coverage.

The third panel of each figure plots the value of the long-run coverage, 1rr=1t𝟙Yr𝒞r(Xr)\frac{1}{r}\sum_{r=1}^{t}{\mathbbm{1}}_{{Y_{r}\in\mathcal{C}_{r}(X_{r})}}, over time tt. We see that the long-run coverage converges quickly to 1α1-\alpha for all methods, and we cannot differentiate between them in this plot.

Consequently, in the fourth panel of each figure, we also plot the “rolling” coverage, which computes coverage rate averaged over a rolling window of 1000 time points. We can see that this measure is tighter around 1α1-\alpha for the fixed step size procedure; for the decaying step size procedure, rolling coverage fluctuates more, but is not larger than the fluctuations for the oracle method. At first glance, it might appear that having lower variance in the rolling coverage indicates that the fixed step size procedure is actually performing better than decaying step size—but this is not the case. The low variance with fixed ηtη\eta_{t}\equiv\eta is due to overcorrecting. For example, if we have several miscoverage events in a row (which can happen by random chance, even with the oracle intervals), then the fixed-step-size method will necessarily return an overly wide interval (e.g., 𝒞t(Xt)=\mathcal{C}_{t}(X_{t})=\mathbb{R}) to give certain coverage at the next time step. Thus, the fixed-step-size method ensures low variance in rolling coverage at the cost of extremely high variance in the width and instantaneous coverage of the interval 𝒞t(Xt)\mathcal{C}_{t}(X_{t}) at each time tt. This type of overcorrection is undesirable.

4.2 Implementation details for Elec2 Time Series

The Elec2 (Harries et al., 1999) dataset is a time-series of 45312 hourly measurements of electricity demand in New South Wales, Australia. We use even-numbered time points as the time series, and odd-numbered time points as a holdout set for estimating 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)\mathsf{Coverage}_{t}(q_{t}). The demand measurements are normalized to lie in the interval Yt[0,1]Y_{t}\in[0,1]. The covariate vector Xt=(Y1,,Yt1)X_{t}=(Y_{1},\ldots,Y_{t-1}) is the sequence of all previous demand values. The forecast Y^t\hat{Y}_{t} is one-day-delayed moving average of YtY_{t} (i.e., at time tt, our predicted value Y^t\hat{Y}_{t} is given by the average of observations taken between 24 and 48 hours earlier), and the score is st(Xt,Yt)=|YtY^t|s_{t}(X_{t},Y_{t})=|Y_{t}-\hat{Y}_{t}|.

Refer to caption
Figure 1: Elec2 results. From left to right, the panels display the following (over all times tt): first, the value of the threshold qtq_{t}; second, the instantaneous coverage 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)\mathsf{Coverage}_{t}(q_{t}); third, the long-run coverage 1tr=1t𝟙Yr𝒞r(Xr)\frac{1}{t}\sum_{r=1}^{t}{\mathbbm{1}}_{{Y_{r}\in\mathcal{C}_{r}(X_{r})}}; and fourth, the rolling coverage, averaged over a rolling window of 1000 time points.
Refer to caption
Figure 2: Imagenet results. Same details as for Figure 2.

4.3 Implementation details for Imagenet

The Imagenet (Deng et al., 2009) is a standard computer vision dataset of natural images. We take the 50000 validation images of Imagenet 2012 and treat them as a time series for the purpose of evaluating our methods. Because the validation split of Imagenet is shuffled, this comprises an exchangeable time series. We use 45000 points for the time series, and the remaining 5000 points as a holdout set for estimating 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)\mathsf{Coverage}_{t}(q_{t}). As the score function, we use st(Xt,Yt)=1maxy𝒴f^(Xt)ys_{t}(X_{t},Y_{t})=1-\max_{y\in\mathcal{Y}}\hat{f}(X_{t})_{y} (here maxy𝒴f^(Xt)y\max_{y\in\mathcal{Y}}\hat{f}(X_{t})_{y} is the softmax score of the pretrained ResNet-152 model).

4.4 Additional experiments

As discussed in Section 2.2, in applications where the distribution of the data may drift or may have changepoints, it might be beneficial to allow ηt\eta_{t} to increase at times to allow for updates in the learning process. To study this empirically, in the Appendix, we include additional experiments in a broader range of settings—we test over 3000 real datasets, and compare the fixed step size method, the decaying step size method, and a “decay+adapt” version of our method where the sequence ηt\eta_{t} adapts to trends in the data (decaying if the distribution of the data appears stationary, but increasing if distribution shift is detected).

5 Proofs

In this section, we prove Theorems 123, and 4, and Propositions 1 and 3. All other results are proved in the Appendix.

5.1 Proof of Theorems 1 and 2

First, we need a lemma to verify that the values qtq_{t} are uniformly bounded over all tt. This result is essentially the same as that in Lemma 4.1 of Gibbs and Candes (2021), except extended to the setting of decaying, rather than constant, step size. The proof is given in the Appendix.

Lemma 1.

Let (X1,Y1),(X2,Y2),(X_{1},Y_{1}),(X_{2},Y_{2}),\dots be an arbitrary sequence of data points, and let st:𝒳×𝒴[0,B]s_{t}:\mathcal{X}\times\mathcal{Y}\rightarrow[0,B] be arbitrary functions. Let ηt\eta_{t} be an arbitrary nonnegative sequence, and fix an initial threshold q1[0,B]q_{1}\in[0,B].

Then online conformal prediction satisfies

αMt1qtB+(1α)Mt1 for all t1,-\alpha M_{t-1}\leq q_{t}\leq B+(1-\alpha)M_{t-1}\textnormal{ for all $t\geq 1$}, (8)

where M0=0M_{0}=0, and Mt=max1rtηrM_{t}=\max_{1\leq r\leq t}\eta_{r} for each t1t\geq 1.

We are now ready to prove the theorems. As discussed earlier, Theorem 1 is simply a special case, so we only prove the more general result Theorem 2.

By definition of Δ\Delta, we have ηt1=r=1tΔr\eta_{t}^{-1}=\sum_{r=1}^{t}\Delta_{r} for all t1t\geq 1. We then calculate

|1Tt=1T𝟙Yt𝒞t(Xt)(1α)|=|1Tt=1T𝟙Yt𝒞t(Xt)α|\displaystyle\left|\frac{1}{T}\sum_{t=1}^{T}{\mathbbm{1}}_{{Y_{t}\in\mathcal{C}_{t}(X_{t})}}-(1-\alpha)\right|=\left|\frac{1}{T}\sum_{t=1}^{T}{\mathbbm{1}}_{{Y_{t}\not\in\mathcal{C}_{t}(X_{t})}}-\alpha\right|
=|1Tt=1T(r=1tΔr)ηt(𝟙Yt𝒞t(Xt)α)|\displaystyle=\left|\frac{1}{T}\sum_{t=1}^{T}\left(\sum_{r=1}^{t}\Delta_{r}\right)\cdot\eta_{t}\left({\mathbbm{1}}_{{Y_{t}\not\in\mathcal{C}_{t}(X_{t})}}-\alpha\right)\right|
=|1Tr=1TΔr(t=rTηt(𝟙Yt𝒞t(Xt)α))|\displaystyle=\left|\frac{1}{T}\sum_{r=1}^{T}\Delta_{r}\left(\sum_{t=r}^{T}\eta_{t}\left({\mathbbm{1}}_{{Y_{t}\not\in\mathcal{C}_{t}(X_{t})}}-\alpha\right)\right)\right|
=|1Tr=1TΔr(qT+1qr)| by (4)\displaystyle=\left|\frac{1}{T}\sum_{r=1}^{T}\Delta_{r}\left(q_{T+1}-q_{r}\right)\right|\textnormal{ by~{}\eqref{eq:q_update}}
1Tr=1T|Δr|max1rT|qT+1qr|\displaystyle\leq\frac{1}{T}\sum_{r=1}^{T}|\Delta_{r}|\cdot\max_{1\leq r\leq T}\left|q_{T+1}-q_{r}\right|
1TΔ1:T1(B+max1tTηt),\displaystyle\leq\frac{1}{T}\cdot\|\Delta_{1:T}\|_{1}\cdot(B+\max_{1\leq t\leq T}\eta_{t}),

where the last step holds since qr,qT+1q_{r},q_{T+1} are bounded by Lemma 1.

5.2 Proof of Proposition 3

First we prove that limsupt𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)=1\lim\sup_{t\rightarrow\infty}\mathsf{Coverage}_{t}(q_{t})=1 almost surely. Equivalently, for any fixed ϵ>0\epsilon>0, we need to prove that (limsupt𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)<1ϵ)=0\mathbb{P}\left({\lim\sup_{t\rightarrow\infty}\mathsf{Coverage}_{t}(q_{t})<1-\epsilon}\right)=0.

We begin by constructing a useful coupling between the online conformal process, and a sequence of I.I.D. uniform random variables. For each t1t\geq 1, define

Ut{𝖴𝗇𝗂𝖿𝗈𝗋𝗆[0,𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)], if Yt𝒞t(Xt),𝖴𝗇𝗂𝖿𝗈𝗋𝗆[𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt),1], if Yt𝒞t(Xt),U_{t}\sim\begin{cases}\mathsf{Uniform}[0,\mathsf{Coverage}_{t}(q_{t})],&\textnormal{ if }Y_{t}\in\mathcal{C}_{t}(X_{t}),\\ \mathsf{Uniform}[\mathsf{Coverage}_{t}(q_{t}),1],&\textnormal{ if }Y_{t}\not\in\mathcal{C}_{t}(X_{t}),\end{cases}

drawn independently for each tt after conditioning on all the data, {(Xt,Yt)}t1\{(X_{t},Y_{t})\}_{t\geq 1}. Since (Yt𝒞t(Xt)|{(Xr,Yr)}r<t)=𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)\mathbb{P}\left({Y_{t}\in\mathcal{C}_{t}(X_{t})}\ \middle|\ {\{(X_{r},Y_{r})\}_{r<t}}\right)=\mathsf{Coverage}_{t}(q_{t}) by construction, we can verify that Utiid𝖴𝗇𝗂𝖿𝗈𝗋𝗆[0,1]U_{t}\stackrel{{\scriptstyle\textnormal{iid}}}{{\sim}}\mathsf{Uniform}[0,1].

Next fix any integer NB+ηαη(1α)N\geq\frac{B+\eta\alpha}{\eta(1-\alpha)}. Let AiA_{i} be the event that

Ut>1ϵ for all (i1)N<tiN.U_{t}>1-\epsilon\textnormal{ for all $(i-1)N<t\leq iN$}.

Since the UtU_{t}’s are I.I.D. uniform random variables, we have (Ai)=ϵN\mathbb{P}\left({A_{i}}\right)=\epsilon^{N} for each ii, and the events AiA_{i} are mutually independent. Therefore, by the second Borel–Cantelli lemma, (i1𝟙Ai=)=1\mathbb{P}\left({\sum_{i\geq 1}{\mathbbm{1}}_{{A_{i}}}=\infty}\right)=1. Now we claim that

If Ai occurs then max(i1)N<tiN+1𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)>1ϵ.\textnormal{If $A_{i}$ occurs then }\max_{(i-1)N<t\leq iN+1}\mathsf{Coverage}_{t}(q_{t})>1-\epsilon. (9)

Suppose that AiA_{i} holds and that 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)1ϵ\mathsf{Coverage}_{t}(q_{t})\leq 1-\epsilon for all tt in the range (i1)N<tiN(i-1)N<t\leq iN. Then by construction of the UtU_{t}’s, we have Yt𝒞t(Xt)Y_{t}\not\in\mathcal{C}_{t}(X_{t}) for all (i1)N<tiN(i-1)N<t\leq iN. Therefore by (4),

qiN+1=q(i1)N+1+t=(i1)N+1iNη(𝟙Yt𝒞t(Xt)α)=q(i1)N+1+Nη(1α)B,q_{iN+1}=q_{(i-1)N+1}+\sum_{t=(i-1)N+1}^{iN}\eta({\mathbbm{1}}_{{Y_{t}\not\in\mathcal{C}_{t}(X_{t})}}-\alpha)=q_{(i-1)N+1}+N\cdot\eta(1-\alpha)\geq B,

where the last step holds by our choice of NN, together with the fact that q(i1)N+1αηq_{(i-1)N+1}\geq-\alpha\eta by Lemma 1. But since the score function siN+1s_{iN+1} takes values in [0,B][0,B] by assumption, we therefore have 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾iN+1(qiN+1)𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾iN+1(B)=1\mathsf{Coverage}_{iN+1}(q_{iN+1})\geq\mathsf{Coverage}_{iN+1}(B)=1. Therefore, we have verified the claim (9).

Since AiA_{i} occurs for infinitely many ii, almost surely, by (9) we therefore have limsupt𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)1ϵ\lim\sup_{t\rightarrow\infty}\mathsf{Coverage}_{t}(q_{t})\geq 1-\epsilon, almost surely, as desired. Since ϵ>0\epsilon>0 is arbitrary, this completes the proof that limsupt𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)=1\lim\sup_{t\rightarrow\infty}\mathsf{Coverage}_{t}(q_{t})=1 almost surely.

Finally, a similar argument verifies liminft𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)=0\lim\inf_{t\rightarrow\infty}\mathsf{Coverage}(q_{t})=0 almost surely.

5.3 Proof of Proposition 1

Since stss_{t}\equiv s, we have 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)=𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)\mathsf{Coverage}_{t}(q_{t})=\mathsf{Coverage}(q_{t}), for each tt. By Proposition 3, liminft𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)=0\lim\inf_{t\rightarrow\infty}\mathsf{Coverage}(q_{t})=0 and limsupt𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)=1\lim\sup_{t\rightarrow\infty}\mathsf{Coverage}(q_{t})=1, almost surely. Since we have assumed that α\alpha is a rational number, by the definition of the procedure (4), all values qtq_{t} must lie on a discrete grid (i.e., if α=k/K\alpha=k/K for some integers k,Kk,K then, for all tt, qtq1q_{t}-q_{1} must be an integer multiple of η/K\eta/K). Moreover, by Lemma 1, qtq_{t} is uniformly bounded above and below for all tt, so qtq_{t} can only take finitely many values. This implies 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)\mathsf{Coverage}(q_{t}) also can take only finitely many values, and in particular, this means that if liminft𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)=0\lim\inf_{t\rightarrow\infty}\mathsf{Coverage}(q_{t})=0 (respectively, if limsupt𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)=1\lim\sup_{t\rightarrow\infty}\mathsf{Coverage}(q_{t})=1) then 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)=0\mathsf{Coverage}(q_{t})=0 (respectively, 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)=1\mathsf{Coverage}(q_{t})=1) for infinitely many tt.

5.4 Proofs of Theorems 3 and 4

We observe that Theorem 3 is simply a special case of Theorem 4 (obtained by taking stss_{t}\equiv s for all tt), so we only need to prove Theorem 4.

First, consider the sequence

Zt=r=1tηr(𝟙Yr𝒞r(Xr)𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾r(qr)).Z_{t}=\sum_{r=1}^{t}\eta_{r}({\mathbbm{1}}_{{Y_{r}\in\mathcal{C}_{r}(X_{r})}}-\mathsf{Coverage}_{r}(q_{r})).

Define events Z\mathcal{E}_{Z}, the event that limtZt\lim_{t\rightarrow\infty}Z_{t} exists, and s\mathcal{E}_{s}, the event that stdss_{t}\stackrel{{\scriptstyle\textnormal{d}}}{{\rightarrow}}s. In the Appendix, we will verify that

limtZt exists, almost surely,\textnormal{$\lim_{t\rightarrow\infty}Z_{t}$ exists, almost surely}, (10)

i.e., (Z)=1\mathbb{P}\left({\mathcal{E}_{Z}}\right)=1, using martingale theory.

To establish the theorem, then, it suffices for us to verify that on the event Zs\mathcal{E}_{Z}\cap\mathcal{E}_{s}, it holds that qtqq_{t}\rightarrow q^{*}. From this point on, we assume that Z\mathcal{E}_{Z} and s\mathcal{E}_{s} both hold.

Fix any ϵ>0\epsilon>0. Since q𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(q)q\mapsto\mathsf{Coverage}(q) is monotone, it can have at most countably infinitely many discontinuities. Without loss of generality, then, we can assume that this map is continuous at q=qϵ/3q=q^{*}-\epsilon/3 and at q=q+ϵ/3q=q^{*}+\epsilon/3 (by taking a smaller value of ϵ\epsilon if needed).

First, since ZtZ_{t} converges, we can find some finite time T1T_{1} such that

supttT1|r=ttηr(𝟙Yr𝒞r(Xr)𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾r(qr))|=supttT1|ZtZt1|ϵ3.\sup_{t^{\prime}\geq t\geq T_{1}}\left|\sum_{r=t}^{t^{\prime}}\eta_{r}({\mathbbm{1}}_{{Y_{r}\in\mathcal{C}_{r}(X_{r})}}-\mathsf{Coverage}_{r}(q_{r}))\right|\\ =\sup_{t^{\prime}\geq t\geq T_{1}}\left|Z_{t^{\prime}}-Z_{t-1}\right|\leq\frac{\epsilon}{3}. (11)

Moreover, since tηt2<\sum_{t}\eta_{t}^{2}<\infty, we have ηt0\eta_{t}\rightarrow 0 and so we can find some finite time T2T_{2} such that ηtϵ3\eta_{t}\leq\frac{\epsilon}{3} for all tT2t\geq T_{2}. Furthermore, on s\mathcal{E}_{s}, we have 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(q)𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(q)\mathsf{Coverage}_{t}(q)\rightarrow\mathsf{Coverage}(q), at each q=q±ϵ/3q=q^{*}\pm\epsilon/3. Thus we can find some finite time T3T_{3} and some δ>0\delta>0 such that

𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qϵ/3)1αδ\mathsf{Coverage}_{t}(q^{*}-\epsilon/3)\leq 1-\alpha-\delta (12)

for all tT3t\geq T_{3} (we are using the fact that 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qϵ/3)<1α\mathsf{Coverage}(q^{*}-\epsilon/3)<1-\alpha by (6)). Similarly we can find a finite T4T_{4} and some δ>0\delta^{\prime}>0 such that 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(q+ϵ/3)1α+δ\mathsf{Coverage}_{t}(q^{*}+\epsilon/3)\geq 1-\alpha+\delta^{\prime} for all tT4t\geq T_{4}. Let T=max{T1,T2,T3,T4}T=\max\{T_{1},T_{2},T_{3},T_{4}\}.

We will now split into cases. If it does not hold that qtq±ϵq_{t}\in q^{*}\pm\epsilon for all sufficiently large tt, then one of the following cases must hold:

  • Case 1a: qt<qϵ/3q_{t}<q^{*}-\epsilon/3 for all tTt\geq T.

  • Case 1b: qt>q+ϵ/3q_{t}>q^{*}+\epsilon/3 for all tTt\geq T.

  • Case 2a: for some ttTt^{\prime}\geq t\geq T, it holds that qtqϵ/3q_{t}\geq q^{*}-\epsilon/3 and qt<qϵq_{t^{\prime}}<q^{*}-\epsilon.

  • Case 2b: for some ttTt^{\prime}\geq t\geq T, it holds that qtq+ϵ/3q_{t}\leq q^{*}+\epsilon/3 and qt>q+ϵq_{t^{\prime}}>q^{*}+\epsilon.

We now verify that each case is impossible.

Case 1a is impossible.

We have

qϵ3qTsupt>TqtqT\displaystyle q^{*}-\frac{\epsilon}{3}-q_{T}\geq\sup_{t>T}q_{t}-q_{T}
=supt>Tr=Tt1ηr(𝟙Yr𝒞r(Xr)α) by (4)\displaystyle=\sup_{t>T}\sum_{r=T}^{t-1}\eta_{r}({\mathbbm{1}}_{{Y_{r}\not\in\mathcal{C}_{r}(X_{r})}}-\alpha)\textnormal{ \ by~{}\eqref{eq:q_update}}
supt>Tr=Tt1ηr((1α)𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾r(qr))ϵ3 by (11)\displaystyle\geq\sup_{t>T}\sum_{r=T}^{t-1}\eta_{r}\left((1-\alpha)-\mathsf{Coverage}_{r}(q_{r})\right)-\frac{\epsilon}{3}\textnormal{ \ by~{}\eqref{eqn:sum_tail}}
supt>T{[r=Tt1ηrδ]ϵ3},\displaystyle\geq\sup_{t>T}\left\{\left[\sum_{r=T}^{t-1}\eta_{r}\cdot\delta\right]-\frac{\epsilon}{3}\right\},

where the last step holds since qr<qϵ/3q_{r}<q^{*}-\epsilon/3 for rTr\geq T, and q𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾r(q)q\mapsto\mathsf{Coverage}_{r}(q) is nondecreasing, and so we have

𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾r(qr)𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾r(qϵ/3)1αδ,\mathsf{Coverage}_{r}(q_{r})\leq\mathsf{Coverage}_{r}(q^{*}-\epsilon/3)\leq 1-\alpha-\delta, (13)

by (12). Since rηr=\sum_{r}\eta_{r}=\infty, we therefore have that qϵ3qTq^{*}-\frac{\epsilon}{3}-q_{T}\geq\infty, which is a contradiction.

Case 1b is impossible.

This proof is analogous to the proof for Case 1a.

Case 2a is impossible.

First, by assumption for this case, we can find a unique time t′′Tt^{\prime\prime}\geq T such that

{qt′′qϵ/3,qr<qϵ/3 for all t′′<r<t,qt<qϵ.\begin{cases}q_{t^{\prime\prime}}\geq q^{*}-\epsilon/3,\\ q_{r}<q^{*}-\epsilon/3\textnormal{ for all $t^{\prime\prime}<r<t^{\prime}$},\\ q_{t^{\prime}}<q^{*}-\epsilon.\end{cases}

In other words, t′′t^{\prime\prime} is the last time before time tt^{\prime} when the threshold is qϵ/3\geq q^{*}-\epsilon/3. Then we have

2ϵ3>qtqt′′=r=t′′t1ηr(𝟙Yr𝒞r(Xr)α) by (4)\displaystyle-\frac{2\epsilon}{3}>q_{t^{\prime}}-q_{t^{\prime\prime}}=\sum_{r=t^{\prime\prime}}^{t^{\prime}-1}\eta_{r}\left({\mathbbm{1}}_{{Y_{r}\not\in\mathcal{C}_{r}(X_{r})}}-\alpha\right)\textnormal{ \ by~{}\eqref{eq:q_update}}
[r=t′′t1ηr((1α)𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾r(qr))]ϵ3 by (11)\displaystyle\geq\left[\sum_{r=t^{\prime\prime}}^{t^{\prime}-1}\eta_{r}\left((1-\alpha)-\mathsf{Coverage}_{r}(q_{r})\right)\right]-\frac{\epsilon}{3}\textnormal{ \ by~{}\eqref{eqn:sum_tail}}
ηt′′+[r=t′′+1t1ηr((1α)𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾r(qr))]ϵ3\displaystyle\geq-\eta_{t^{\prime\prime}}+\left[\sum_{r=t^{\prime\prime}+1}^{t^{\prime}-1}\eta_{r}\left((1-\alpha)-\mathsf{Coverage}_{r}(q_{r})\right)\right]-\frac{\epsilon}{3}
ηt′′ϵ3 by (13).\displaystyle\geq-\eta_{t^{\prime\prime}}-\frac{\epsilon}{3}\textnormal{ by~{}\eqref{eqn:coverage_eps_delta_2}}.

But since ηt′′ϵ/3\eta_{t^{\prime\prime}}\leq\epsilon/3 (because t′′Tt^{\prime\prime}\geq T), we have therefore reached a contradiction.

Case 2b is impossible.

This proof is analogous to the proof for Case 2a.

We have verified that all four cases are impossible. Therefore, qtq±ϵq_{t}\in q^{*}\pm\epsilon for all sufficiently large tt. Since ϵ>0\epsilon>0 is arbitrarily small, this completes the proof.

6 Discussion

Our paper analyzes online conformal prediction that with a decaying step size, enabling simultaneous guarantees of convergence for I.I.D. sequences and long-run coverage for adversarial ones. Moreover, it helps further unify online conformal prediction with online learning and online convex optimization, since decaying step sizes are known to have desirable properties and hence standard for the latter. Of course, the usefulness of the method will rely on choosing score functions that are well suited to the (possibly time-varying) data distribution, and choosing step sizes that decay at an appropriate rate and perhaps adapt to the level of distribution shift—building a better understanding of how to make these choices in practice is crucial for achieving informative and stable prediction intervals. Many additional open questions about extending the methodology to broader settings and understanding connections to other tools remain. In particular, we expect fruitful avenues of future inquiry would be: (1) to extend this analysis to online risk control, as in Feldman et al. (2021); (2) to adapt our analysis of Theorem 3 to deal with stationary or slowly moving time-series which may not be I.I.D. but are slowly varying enough to permit estimation; and (3) to further understand the connection between this family of techniques and the theory of online learning.

Acknowledgements

The authors thank Isaac Gibbs, Ryan Tibshirani, and Margaux Zaffran for helpful feedback on this work. R.F.B. was partially supported by the National Science Foundation via grant DMS-2023109, and by the Office of Naval Research via grant N00014-20-1-2337.

References

  • Angelopoulos and Bates [2023] Anastasios N Angelopoulos and Stephen Bates. Conformal prediction: A gentle introduction. Foundations and Trends® in Machine Learning, 16(4):494–591, 2023.
  • Angelopoulos et al. [2023] Anastasios N. Angelopoulos, Emmanuel Candes, and Ryan Tibshirani. Conformal PID control for time series prediction. In Neural Information Processing Systems, 2023.
  • Barber et al. [2022] Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. Conformal prediction beyond exchangeability. arXiv:2202.13415, 2022.
  • Bastani et al. [2022] Osbert Bastani, Varun Gupta, Christopher Jung, Georgy Noarov, Ramya Ramalingam, and Aaron Roth. Practical adversarial multivalid conformal prediction. Advances in Neural Information Processing Systems, 35:29362–29373, 2022.
  • Bhatnagar et al. [2023] Aadyot Bhatnagar, Huan Wang, Caiming Xiong, and Yu Bai. Improved online conformal prediction via strongly adaptive online learning. arXiv preprint arXiv:2302.07869, 2023.
  • Biau and Patra [2011] Gérard Biau and Benoît Patra. Sequential quantile prediction of time series. IEEE Transactions on Information Theory, 57(3):1664–1674, 2011.
  • Bubeck and Slivkins [2012] Sébastien Bubeck and Aleksandrs Slivkins. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory, pages 42–1. JMLR Workshop and Conference Proceedings, 2012.
  • Cesa-Bianchi and Lugosi [2006] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
  • Chen et al. [2023] Sijia Chen, Wei-Wei Tu, Peng Zhao, and Lijun Zhang. Optimistic online mirror descent for bridging stochastic and adversarial online convex optimization. In International Conference on Machine Learning, pages 5002–5035. PMLR, 2023.
  • Chernozhukov et al. [2018] Victor Chernozhukov, Kaspar Wüthrich, and Zhu Yinchu. Exact and robust conformal inference methods for predictive machine learning with dependent data. In Conference On Learning Theory, pages 732–749. PMLR, 2018.
  • Cramer et al. [2022] Estee Y Cramer, Evan L Ray, Velma K Lopez, Johannes Bracher, Andrea Brennen, Alvaro J Castro Rivadeneira, Aaron Gerding, Tilmann Gneiting, Katie H House, Yuxin Huang, et al. Evaluation of individual and ensemble probabilistic forecasts of covid-19 mortality in the united states. Proceedings of the National Academy of Sciences, 119(15):e2113561119, 2022.
  • Daniely et al. [2015] Amit Daniely, Alon Gonen, and Shai Shalev-Shwartz. Strongly adaptive online learning. In International Conference on Machine Learning, pages 1405–1411. PMLR, 2015.
  • Dann et al. [2023] Christoph Dann, Chen-Yu Wei, and Julian Zimmert. Best of both worlds policy optimization. In International Conference on Machine Learning, pages 6968–7008. PMLR, 2023.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255, 2009.
  • Feldman et al. [2021] Shai Feldman, Stephen Bates, and Yaniv Romano. Improving conditional coverage via orthogonal quantile regression. In Advances in Neural Information Processing Systems, 2021.
  • Feldman et al. [2023] Shai Feldman, Liran Ringel, Stephen Bates, and Yaniv Romano. Achieving risk control in online learning settings. Transactions on Machine Learning Research, 2023.
  • Foreman-Mackey et al. [2017] Daniel Foreman-Mackey, Eric Agol, Sivaram Ambikasaran, and Ruth Angus. Fast and scalable Gaussian process modeling with applications to astronomical time series. The Astronomical Journal, 154(6):220, 2017.
  • Foster and Vohra [1998] Dean P Foster and Rakesh V Vohra. Asymptotic calibration. Biometrika, 85(2):379–390, 1998.
  • Gibbs and Candes [2021] Isaac Gibbs and Emmanuel Candes. Adaptive conformal inference under distribution shift. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 1660–1672. Curran Associates, Inc., 2021.
  • Gibbs and Candès [2022] Isaac Gibbs and Emmanuel Candès. Conformal inference for online prediction with arbitrary distribution shifts. arXiv preprint arXiv:2208.08401, 2022.
  • Gradu et al. [2023] Paula Gradu, Elad Hazan, and Edgar Minasyan. Adaptive regret for control of time-varying dynamics. In Learning for Dynamics and Control Conference, pages 560–572. PMLR, 2023.
  • Harries et al. [1999] Michael Harries, New South Wales, et al. Splice-2 comparative evaluation: Electricity pricing. 1999.
  • Jin et al. [2021] Tiancheng Jin, Longbo Huang, and Haipeng Luo. The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition. Advances in Neural Information Processing Systems, 34:20491–20502, 2021.
  • Koenker and Bassett Jr [1978] Roger Koenker and Gilbert Bassett Jr. Regression quantiles. Econometrica: Journal of the Econometric Society, 46(1):33–50, 1978.
  • Koolen et al. [2016] Wouter M Koolen, Peter Grünwald, and Tim Van Erven. Combining adversarial guarantees and stochastic fast rates in online learning. Advances in Neural Information Processing Systems, 29, 2016.
  • Lin et al. [2022] Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Conformal prediction with temporal quantile adjustments. Advances in Neural Information Processing Systems, 35:31017–31030, 2022.
  • Lindemann et al. [2023] Lars Lindemann, Matthew Cleaveland, Gihyun Shim, and George J Pappas. Safe planning in dynamic environments using conformal prediction. IEEE Robotics and Automation Letters, 2023.
  • Mykland [2003] Per Aslak Mykland. Financial options and statistical prediction intervals. The Annals of Statistics, 31(5):1413–1438, 2003.
  • Noarov et al. [2023] Georgy Noarov, Ramya Ramalingam, Aaron Roth, and Stephan Xie. High-dimensional unbiased prediction for sequential decision making. In OPT 2023: Optimization for Machine Learning, 2023.
  • Robinson [1978] JA Robinson. Sequential choice of an optimal dose: A prediction intervals approach. Biometrika, 65(1):75–78, 1978.
  • Tian et al. [2022] Qinglong Tian, Daniel J Nordman, and William Q Meeker. Methods to compute prediction intervals: A review and new results. Statistical Science, 37(4):580–597, 2022.
  • Vovk [2002] Vladimir Vovk. On-line confidence machines are well-calibrated. In The 43rd Annual IEEE Symposium on Foundations of Computer Science, pages 187–196. IEEE, 2002.
  • Vovk et al. [2005] Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, 2005. doi: 10.1007/b106715.
  • Xu and Xie [2021] Chen Xu and Yao Xie. Conformal prediction interval for dynamic time-series. In International Conference on Machine Learning, pages 11559–11569. PMLR, 2021.
  • Xu and Xie [2023] Chen Xu and Yao Xie. Sequential predictive conformal inference for time series. In International Conference on Machine Learning, pages 38707–38727. PMLR, 2023.
  • Zaffran et al. [2022] Margaux Zaffran, Olivier Féron, Yannig Goude, Julie Josse, and Aymeric Dieuleveut. Adaptive conformal predictions for time series. In International Conference on Machine Learning, pages 25834–25866. PMLR, 2022.
  • Zimmert and Seldin [2021] Julian Zimmert and Yevgeny Seldin. Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits. Journal of Machine Learning Research, 22(28):1–49, 2021.
  • Zinkevich [2003] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03), pages 928–936, 2003.

Appendix A Additional proofs

A.1 Proof of Lemma 1

The proof of this result is similar to the proof of Lemma 4.1 of Gibbs and Candes [2021]. We prove this by induction. First, q1[0,B]q_{1}\in[0,B] by assumption, so (8) is satisfied at time t=1t=1. Next fix any t1t\geq 1 and assume qtq_{t} lies in the range specified in (8), and consider qt+1q_{t+1}. We now split into cases:

  • If qt[0,B]q_{t}\in[0,B], then we have

    qt+1=qt+ηt(𝟙Yt𝒞t(Xt)α)[qtηtα,qt+ηt(1α)][αMt,B+(1α)Mt].q_{t+1}=q_{t}+\eta_{t}\left({\mathbbm{1}}_{{Y_{t}\not\in\mathcal{C}_{t}(X_{t})}}-\alpha\right)\\ \in[q_{t}-\eta_{t}\alpha,q_{t}+\eta_{t}(1-\alpha)]\\ \subseteq[-\alpha M_{t},B+(1-\alpha)M_{t}].
  • If qt(B,B+(1α)Mt1]q_{t}\in(B,B+(1-\alpha)M_{t-1}], then we must have 𝒞t(Xt)=𝒴\mathcal{C}_{t}(X_{t})=\mathcal{Y}. Then 𝟙Yt𝒞t(Xt)=0{\mathbbm{1}}_{{Y_{t}\not\in\mathcal{C}_{t}(X_{t})}}=0, and so

    qt+1=qtηtα[Bηtα,B+(1α)Mt1][αMt,B+(1α)Mt].q_{t+1}=q_{t}-\eta_{t}\alpha\in[B-\eta_{t}\alpha,B+(1-\alpha)M_{t-1}]\subseteq[-\alpha M_{t},B+(1-\alpha)M_{t}].
  • If qt[αMt1,0)q_{t}\in[-\alpha M_{t-1},0), then we must have 𝒞t(Xt)=\mathcal{C}_{t}(X_{t})=\emptyset. Then 𝟙Yt𝒞t(Xt)=1{\mathbbm{1}}_{{Y_{t}\not\in\mathcal{C}_{t}(X_{t})}}=1, and so

    qt+1=qt+ηt(1α)[αMt1,ηt(1α)][αMt,B+(1α)Mt].q_{t+1}=q_{t}+\eta_{t}(1-\alpha)\in[-\alpha M_{t-1},\eta_{t}(1-\alpha)]\subseteq[-\alpha M_{t},B+(1-\alpha)M_{t}].

In all cases, then, (8) holds for t+1t+1 in place of tt, which completes the proof.

A.2 Proof of 10

We need to prove that ZtZ_{t} converges almost surely (note that the limit of ZtZ_{t} may be a random variable). For each t1t\geq 1, we have

(Yt𝒞t(Xt)|{(Xr,Yr)}r<t)=(st(Xt,Yt)qt|{(Xr,Yr)}r<t)=𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt),\mathbb{P}\left({Y_{t}\in\mathcal{C}_{t}(X_{t})}\ \middle|\ {\{(X_{r},Y_{r})\}_{r<t}}\right)\\ =\mathbb{P}\left({s_{t}(X_{t},Y_{t})\leq q_{t}}\ \middle|\ {\{(X_{r},Y_{r})\}_{r<t}}\right)=\mathsf{Coverage}_{t}(q_{t}),

since qtq_{t} and sts_{t} are functions of {(Xr,Yr)}r<t\{(X_{r},Y_{r})\}_{r<t} and are therefore independent of (Xt,Yt)P(X_{t},Y_{t})\sim P. This proves that ZtZ_{t} is a martingale with respect to the filtration generated by the sequence of data points. We also have supt1Var(Zt)<\sup_{t\geq 1}\textnormal{Var}(Z_{t})<\infty, since we have assumed t=1ηt2<\sum_{t=1}^{\infty}\eta_{t}^{2}<\infty. This means that ZtZ_{t} is a uniformly integrable martingale, and therefore, ZtZ_{t} converges almost surely (to some random variable), by Doob’s second martingale convergence theorem.

A.3 Proofs of Corollaries 1 and 2

As for the theorems, it suffices to prove Corollary 2, since Corollary 1 is simply a special case.

Using the notation defined in the proof of Theorem 4, suppose that events Z\mathcal{E}_{Z} and s\mathcal{E}_{s} both hold. Now we need to show that 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)1α\mathsf{Coverage}_{t}(q_{t})\rightarrow 1-\alpha holds as well. Fix any ϵ>0\epsilon>0. Since s(X,Y)s(X,Y) has a continuous distribution, the map q𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(q)q\mapsto\mathsf{Coverage}(q) is continuous, and so we can find some δ>0\delta>0 such that

|𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(q)𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(q)|ϵ/2 for all qq±δ.\left|\mathsf{Coverage}(q)-\mathsf{Coverage}(q^{*})\right|\leq\epsilon/2\textnormal{ for all }q\in q^{*}\pm\delta.

Moreover, 𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(q)=1α\mathsf{Coverage}(q^{*})=1-\alpha, since the distribution of s(X,Y)s(X,Y) is continuous and qq^{*} is its (1α)(1-\alpha)-quantile, so we have

|𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(q)(1α)|ϵ/2 for all qq±δ.\left|\mathsf{Coverage}(q)-(1-\alpha)\right|\leq\epsilon/2\textnormal{ for all }q\in q^{*}\pm\delta.

Next, by Theorem 4, for all sufficiently large tt, we have

|qtq|δ.|q_{t}-q^{*}|\leq\delta.

By definition of the event s\mathcal{E}_{s}, for all sufficiently large tt we have

|𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qδ)𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qδ)|ϵ/2\left|\mathsf{Coverage}_{t}(q^{*}-\delta)-\mathsf{Coverage}(q^{*}-\delta)\right|\leq\epsilon/2

and

|𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(q+δ)𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(q+δ)|ϵ/2.\left|\mathsf{Coverage}_{t}(q^{*}+\delta)-\mathsf{Coverage}(q^{*}+\delta)\right|\leq\epsilon/2.

Then, combining all of these calculations, for all sufficiently large tt we have

𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qδ)𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qδ)ϵ/2(1αϵ/2)ϵ/2=1αϵ,\mathsf{Coverage}_{t}(q_{t})\geq\mathsf{Coverage}_{t}(q^{*}-\delta)\geq\mathsf{Coverage}(q^{*}-\delta)-\epsilon/2\geq(1-\alpha-\epsilon/2)-\epsilon/2=1-\alpha-\epsilon,

where the first step holds since qtqδq_{t}\geq q^{*}-\delta, and qCoveraget(q)q\mapsto\textnormal{Coverage}_{t}(q) is nondecreasing. Similarly, for all sufficiently large tt it holds that

𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾t(qt)1α+ϵ.\mathsf{Coverage}_{t}(q_{t})\leq 1-\alpha+\epsilon.

Since ϵ>0\epsilon>0 is arbitrary, this completes the proof.

A.4 Proof of Proposition 2

By Lemma 1, qt[αc,B+(1α)c]q_{t}\in[-\alpha c,B+(1-\alpha)c] for all tt (since ηtc\eta_{t}\leq c for all tt). Since q[0,B]q^{*}\in[0,B], we therefore have |qtq|B+c|q_{t}-q^{*}|\leq B+c almost surely for all tt. We also have 2γδ12\gamma\delta\leq 1 and δBB+c\delta\leq B\leq B+c, since the density of s(X,Y)s(X,Y) is supported on [0,B][0,B] and must integrate to 11.

Next, by the assumptions of the proposition, for any qtqq_{t}\geq q^{*}, if qtq+δq_{t}\leq q^{*}+\delta then

𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)1α+γ(qtq)\mathsf{Coverage}(q_{t})\geq 1-\alpha+\gamma(q_{t}-q^{*})

while if qt>q+δq_{t}>q^{*}+\delta then

𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(q+δ)1α+γδ.\mathsf{Coverage}(q_{t})\geq\mathsf{Coverage}(q^{*}+\delta)\geq 1-\alpha+\gamma\delta.

Either way, then, if qtqq_{t}\geq q^{*} then

𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)(1α)+(qtq)γδB+c.\mathsf{Coverage}(q_{t})\geq(1-\alpha)+(q_{t}-q^{*})\cdot\frac{\gamma\delta}{B+c}.

A similar calculations shows that if qtqq_{t}\leq q^{*}, then

𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)(1α)(qqt)γδB+c.\mathsf{Coverage}(q_{t})\leq(1-\alpha)-(q^{*}-q_{t})\cdot\frac{\gamma\delta}{B+c}.

Defining a=γδB+c>0a=\frac{\gamma\delta}{B+c}>0, we therefore have

𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)(1α)qtqa\frac{\mathsf{Coverage}(q_{t})-(1-\alpha)}{q_{t}-q^{*}}\geq a (14)

whenever qtqq_{t}\neq q^{*}. Note that we must have a1/ca\leq 1/c, by construction.

Next, from the update step (4), we have

qt+1q=(qtq)+ηt((1α)𝟙Yt𝒞t(Xt)).q_{t+1}-q^{*}=(q_{t}-q^{*})+\eta_{t}\big{(}(1-\alpha)-{\mathbbm{1}}_{{Y_{t}\in\mathcal{C}_{t}(X_{t})}}\big{)}.

Since (Yt𝒞t(Xt)|{(Xr,Yr)}r<t)=𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)\mathbb{P}\left({Y_{t}\in\mathcal{C}_{t}(X_{t})}\ \middle|\ {\{(X_{r},Y_{r})\}_{r<t}}\right)=\mathsf{Coverage}(q_{t}), we then calculate

𝔼[qt+1q|{(Xr,Yr)}r<t]=(qtq)+ηt((1α)𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)),\mathbb{E}\left[{q_{t+1}-q^{*}}\ \middle|\ {\{(X_{r},Y_{r})\}_{r<t}}\right]=(q_{t}-q^{*})+\eta_{t}\big{(}(1-\alpha)-\mathsf{Coverage}(q_{t})\big{)},

and

Var(qt+1q{(Xr,Yr)}r<t)=ηt2𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)(1𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt))ηt2/4.\textnormal{Var}\big{(}q_{t+1}-q^{*}\mid\{(X_{r},Y_{r})\}_{r<t}\big{)}=\eta_{t}^{2}\cdot\mathsf{Coverage}(q_{t})\cdot(1-\mathsf{Coverage}(q_{t}))\leq\eta_{t}^{2}/4.

Therefore,

𝔼[(qt+1q)2|{(Xr,Yr)}r<t]((qtq)+ηt((1α)𝖢𝗈𝗏𝖾𝗋𝖺𝗀𝖾(qt)))2+ηt2/4(qtq)2(1aηt)2+ηt2/4,\mathbb{E}\left[{(q_{t+1}-q^{*})^{2}}\ \middle|\ {\{(X_{r},Y_{r})\}_{r<t}}\right]\leq\left((q_{t}-q^{*})+\eta_{t}\big{(}(1-\alpha)-\mathsf{Coverage}(q_{t})\big{)}\right)^{2}+\eta_{t}^{2}/4\\ \leq(q_{t}-q^{*})^{2}\cdot(1-a\eta_{t})^{2}+\eta_{t}^{2}/4,

where the last step holds by (14) above. After marginalizing, then,

𝔼[(qt+1q)2]𝔼[(qtq)2](1aηt)2+ηt2/4.\mathbb{E}\left[{(q_{t+1}-q^{*})^{2}}\right]\leq\mathbb{E}\left[{(q_{t}-q^{*})^{2}}\right]\cdot(1-a\eta_{t})^{2}+\eta_{t}^{2}/4.

Next recall ηt=ct1/2ϵ\eta_{t}=ct^{-1/2-\epsilon} for each tt. Fix some T1T\geq 1 that satisfies T1/2ϵ1/2+ϵacT^{1/2-\epsilon}\geq\frac{1/2+\epsilon}{ac}. First, since |qtq|B+c|q_{t}-q^{*}|\leq B+c for all tt as above, by choosing b(B+c)2T1/2+ϵb\geq(B+c)^{2}T^{1/2+\epsilon} we must have (qtq)2bt1/2ϵ(q_{t}-q^{*})^{2}\leq bt^{-1/2-\epsilon} for all tTt\leq T, almost surely. Next, for each tTt\geq T, we proceed by induction. Assume 𝔼[(qtq)2]bt1/2ϵ\mathbb{E}\left[{(q_{t}-q^{*})^{2}}\right]\leq bt^{-1/2-\epsilon}. Then

𝔼[(qt+1q)2]\displaystyle\mathbb{E}\left[{(q_{t+1}-q^{*})^{2}}\right] 𝔼[(qtq)2](1aηt)2+ηt2/4\displaystyle\leq\mathbb{E}\left[{(q_{t}-q^{*})^{2}}\right]\cdot(1-a\eta_{t})^{2}+\eta_{t}^{2}/4
=𝔼[(qtq)2](12aηt)+(𝔼[(qtq)2]a2+1/4)ηt2\displaystyle=\mathbb{E}\left[{(q_{t}-q^{*})^{2}}\right](1-2a\eta_{t})+\left(\mathbb{E}\left[{(q_{t}-q^{*})^{2}}\right]\cdot a^{2}+1/4\right)\eta_{t}^{2}
bt1/2ϵ(12act1/2ϵ)+((B+c)2a2+1/4)c2t12ϵ\displaystyle\leq bt^{-1/2-\epsilon}\left(1-2act^{-1/2-\epsilon}\right)+((B+c)^{2}a^{2}+1/4)c^{2}t^{-1-2\epsilon}
=bt1/2ϵ2abct12ϵ+((B+c)2a2+1/4)c2t12ϵ\displaystyle=bt^{-1/2-\epsilon}-2abct^{-1-2\epsilon}+((B+c)^{2}a^{2}+1/4)c^{2}t^{-1-2\epsilon}
b(t1/2ϵact12ϵ),\displaystyle\leq b\left(t^{-1/2-\epsilon}-act^{-1-2\epsilon}\right),

where the last step holds as long as we choose b((B+c)2a2+1/4)cab\geq\frac{((B+c)^{2}a^{2}+1/4)c}{a}. And, since tTt\geq T, we have

act12ϵ=act1/2ϵt3/2ϵacT1/2ϵt3/2ϵ(1/2+ϵ)t3/2ϵt1/2ϵ(t+1)1/2ϵ,act^{-1-2\epsilon}=act^{1/2-\epsilon}\cdot t^{-3/2-\epsilon}\geq acT^{1/2-\epsilon}\cdot t^{-3/2-\epsilon}\geq(1/2+\epsilon)t^{-3/2-\epsilon}\geq t^{-1/2-\epsilon}-(t+1)^{-1/2-\epsilon},

where the last step holds since tt1/2ϵt\mapsto t^{-1/2-\epsilon} is convex, with derivative (1/2+ϵ)t3/2ϵ-(1/2+\epsilon)t^{-3/2-\epsilon}. Therefore, we have verified that 𝔼[(qt+1q)2]b(t+1)1/2ϵ\mathbb{E}\left[{(q_{t+1}-q^{*})^{2}}\right]\leq b(t+1)^{-1/2-\epsilon}, as desired.

Appendix B Additional experiments

Refer to caption
Figure 3: Density plots of results on M4 datasets. These plots show the same quantities as in Table 1, but now as histograms over the time-series in M4.

We compare against two additional methods: first, “decay+adapt”, a variant of our procedure that decays until it detects a change point, then resets the learning rate. Change points are identified when at least NmiscoverageN_{\rm miscoverage} consecutive miscoverage events or NcoverageN_{\rm coverage} events are observed in a row (we set these constants to 10 and 30 by default, respectively). When a change point is identified, the learning rate is reset to B^(tTchangepoint)1/2+ϵ\frac{\hat{B}}{(t-T_{\rm changepoint})^{1/2+\epsilon}}, where TchangepointT_{\rm changepoint} is the time at which the changepoint is detected and ϵ(0,1/2)\epsilon\in(0,1/2). In these experiments, like in the main text, we set ϵ=0.1\epsilon=0.1.

We additionally compare against DTACI Gibbs and Candès [2022], an adaptive-learning-rate variant of ACI that uses multiplicative weights to perform the updates (see Gibbs and Candès [2022] for further details.)

We compare these methods on a dataset of over 3000 time series subsampled from the M4 time series dataset. This dataset is a diverse array of time series with varying numbers of samples and distribution shifts. Code is available in our GitHub repository to run on all 100,000 time series in M4; here, we show results on the first 3000.

Finally, to showcase the conceptual differences between the standard decaying learning rate sequence and the ‘decay+adapt’ method, we display a simulated score sequence in Figure 4. Here, the scores are simulated from 𝒩(μt,1)\mathcal{N}(\mu_{t},1), where μt=0\mu_{t}=0 for the first thousand time steps, μt=2\mu_{t}=2 for the second thousand, μt=4\mu_{t}=4 for the third thousand, and μt=6\mu_{t}=6 for the final thousand. Especially towards the end of the time series, ‘decay+adapt’ can more quickly adjust to the change points.

method coverage variance MSE infinite sets
DTACI 0.895491 4.107740 5.439935 0.036584
decay+adapt 0.885174 1.144337 1.967552 0.005011
decaying 0.900495 1.320579 2.366297 0.006883
fixed 0.901636 1.580243 2.922989 0.011179
Table 1: Table of results on M4 datasets. The table shows average results over all time series in the dataset—thus, all columns should be interpreted on average over time-series in M4. The coverage column displays the long-run coverage. The variance column shows the variance of the quantile normalized by the variance of the score sequence. The MSE column shows the squared error of the quantile normalized by the variance of the score sequence. Finally, the infinite sets column shows the fraction of time steps in the sequence for which the output is an infinite-width prediction set.
Refer to caption
Figure 4: Simulation comparison of decaying step size and ‘decay+adapt’. The raw score sequence is shown in blue, the decaying step size sequence is in orange, and ‘decay+adapt’ is in green.