This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Elementary Predictor Obtaining 2T+12\sqrt{T}+1 Distance to Calibration

Eshwar Ram Arunachaleswaran Department of Computer and Information Sciences, University of Pennsylvania Natalie Collina Department of Computer and Information Sciences, University of Pennsylvania Aaron Roth Department of Computer and Information Sciences, University of Pennsylvania Mirah Shi Department of Computer and Information Sciences, University of Pennsylvania
Abstract

Błasiok et al. (2023) proposed distance to calibration as a natural measure of calibration error that unlike expected calibration error (ECE) is continuous. Recently, Qiao and Zheng (2024) (COLT 2024) gave a non-constructive argument establishing the existence of a randomized online predictor that can obtain O(T)O(\sqrt{T}) distance to calibration in expectation in the adversarial setting, which is known to be impossible for ECE. They leave as an open problem finding an explicit, efficient, deterministic algorithm. We resolve this problem and give an extremely simple, efficient, deterministic algorithm that obtains distance to calibration error at most 2T+12\sqrt{T}+1.

1 Introduction

Probabilistic predictions of binary outcomes are said to be calibrated, if, informally, they are unbiased conditional on their own predictions. For predictors that are not perfectly calibrated, there are a variety of ways to measure calibration error. Perhaps the most popular measure is Expected Calibration Error (ECE), which measures the average bias of the predictions, weighted by the frequency of the predictions. ECE has a number of difficulties as a measure of calibration, not least of which is that it is discontinuous in the predictions. Motivated by this, Błasiok et al. (2023) propose a different measure: distance to calibration, which measures how far a predictor is in 1\ell_{1} distance from the nearest perfectly calibrated predictor. In the online adversarial setting, it has been known since Foster and Vohra (1998) how to make predictions with ECE growing at a rate of O(T2/3)O(T^{2/3}). Qiao and Valiant (2021) show that obtaining O(T)O(\sqrt{T}) rates for ECE is impossible. Recently, in a COLT 2024 paper, Qiao and Zheng (2024) showed that it was possible to make sequential predictions against an adversary guaranteeing expected distance to calibration growing at a rate of O(T)O(\sqrt{T}). Their algorithm is the solution to a minimax problem of size doubly-exponential in TT. They leave as an open problem finding an explicit, efficient, deterministic algorithm for this problem. In this paper we resolve this problem, by giving an extremely simple such algorithm with an elementary analysis.

Input: Sequence of outcomes y1:T{0,1}Ty^{1:T}\in\{0,1\}^{T}
Output: Sequence of predictions p1:T{0,1m,,1}Tp^{1:T}\in\{0,\frac{1}{m},...,1\}^{T} for some discretization parameter m>0m>0
for t=1t=1 to TT do
         Given look-ahead predictions p~1:t1\tilde{p}^{1:t-1}, define the look-ahead bias conditional on a prediction pp as:
αp~1:t1(p):=s=1t1𝟙[p~s=p](p~sys)\alpha_{\tilde{p}^{1:t-1}}(p):=\sum_{s=1}^{t-1}\mathbbm{1}[\tilde{p}^{s}=p](\tilde{p}^{s}-y^{s})
Choose two adjacent points pi=im,pi+1=i+1mp_{i}=\frac{i}{m},p_{i+1}=\frac{i+1}{m} satisfying:
αp~1:t1(pi)0 and αp~1:t1(pi+1)0\alpha_{\tilde{p}^{1:t-1}}(p_{i})\leq 0\text{ and }\alpha_{\tilde{p}^{1:t-1}}(p_{i+1})\geq 0
Arbitrarily predict pt=pi{p}^{t}=p_{i} or pt=pi+1{p}^{t}=p_{i+1};
         Upon observing the (adversarially chosen) outcome yty^{t}, set look-ahead prediction
p~t=argminp{pi,pi+1}|pyt|\tilde{p}^{t}=\operatorname{\text{argmin}}_{p\in\{p_{i},p_{i+1}\}}|p-y^{t}|
Algorithm 1 Almost-One-Step-Ahead

2 Setting

We study a sequential binary prediction setting: at every round tt, a forecaster makes a prediction pt[0,1]p^{t}\in[0,1], after which an adversary reveals an outcome yt{0,1}y^{t}\in\{0,1\}. Given a sequence of predictions p1:Tp^{1:T} and outcomes y1:Ty^{1:T}, we measure expected calibration error (ECE) as follows:

ECE(p1:T,y1:T)=p[0,1]|t=1T𝟙[pt=p](ptyt)|\mathrm{ECE}(p^{1:T},y^{1:T})=\sum_{p\in[0,1]}\left|\sum_{t=1}^{T}\mathbbm{1}[p^{t}=p](p^{t}-y^{t})\right|

Following Qiao and Zheng (2024), we define distance to calibration to be the minimum 1\ell_{1} distance between a sequence of predictions produced by a forecaster and any perfectly calibrated sequence of predictions:

CalDist(p1:T,y1:T)=minq1:T𝒞(y1:T)p1:Tq1:T1\mathrm{CalDist}(p^{1:T},y^{1:T})=\min_{q^{1:T}\in\mathcal{C}(y^{1:T})}\|p^{1:T}-q^{1:T}\|_{1}

where 𝒞(y1:T)={q1:T:ECE(q1:T,y1:T)=0}\mathcal{C}(y^{1:T})=\{q^{1:T}:\mathrm{ECE}(q^{1:T},y^{1:T})=0\} is the set of predictions that are perfectly calibrated against outcomes y1:Ty^{1:T}. First we observe that distance to calibration is upper bounded by ECE.

Lemma 1 (Qiao and Zheng (2024)).

Fix a sequence of predictions p1:Tp^{1:T} and outcomes y1:Ty^{1:T}. Then, CalDist(p1:T,y1:T)ECE(p1:T,y1:T)\mathrm{CalDist}(p^{1:T},y^{1:T})\leq\mathrm{ECE}(p^{1:T},y^{1:T}).

Proof.

For any prediction p[0,1]p\in[0,1], define

y¯T(p)=t=1T𝟙[pt=p]t=1T𝟙[pt=p]yt\overline{y}^{T}(p)=\sum_{t=1}^{T}\frac{\mathbbm{1}[p^{t}=p]}{\sum_{t=1}^{T}\mathbbm{1}[p^{t}=p]}y^{t}

to be the average outcome conditioned on the prediction pp. Consider the sequence q1:Tq^{1:T} where qt=y¯T(pt)q^{t}=\overline{y}^{T}(p^{t}). Observe that q1:Tq^{1:T} is perfectly calibrated. Thus, we have that

CalDist(p1:T,y1:T)\displaystyle\mathrm{CalDist}(p^{1:T},y^{1:T}) p1:Tq1:T1\displaystyle\leq\|p^{1:T}-q^{1:T}\|_{1}
=t=1T|ptqt|\displaystyle=\sum_{t=1}^{T}|p^{t}-q^{t}|
=p[0,1]t=1T𝟙[pt=p]|py¯T(p)|\displaystyle=\sum_{p\in[0,1]}\sum_{t=1}^{T}\mathbbm{1}[p^{t}=p]|p-\overline{y}^{T}(p)|
=p[0,1]|py¯T(p)|t=1T𝟙[pt=p]\displaystyle=\sum_{p\in[0,1]}|p-\overline{y}^{T}(p)|\sum_{t=1}^{T}\mathbbm{1}[p^{t}=p]
=p[0,1]|pt=1T𝟙[pt=p]y¯T(p)t=1T𝟙[pt=p]|\displaystyle=\sum_{p\in[0,1]}\left|p\sum_{t=1}^{T}\mathbbm{1}[p^{t}=p]-\overline{y}^{T}(p)\sum_{t=1}^{T}\mathbbm{1}[p^{t}=p]\right|
=p[0,1]|t=1T𝟙[pt=p](pyt)|\displaystyle=\sum_{p\in[0,1]}\left|\sum_{t=1}^{T}\mathbbm{1}[p^{t}=p](p-y^{t})\right|
=ECE(p1:T,y1:T)\displaystyle=\mathrm{ECE}(p^{1:T},y^{1:T})

The upper bound is not tight, however. The best known sequential prediction algorithm obtains ECE bounded by O(T2/3)O(T^{2/3}) (Foster and Vohra, 1998), and it is known that there is no algorithm guaranteeing ECE below O(T0.54389)O(T^{0.54389}) (Qiao and Valiant, 2021; Dagan et al., 2024). Qiao and Zheng (2024) give an algorithm that is the solution to a game of size doubly-exponential in TT that obtains expected distance to calibration O(T)O(\sqrt{T}). Here we give an elementary analysis of a simple efficient deterministic algorithm (Algorithm 1) that obtains distance to calibration 2T+12\sqrt{T}+1.

Theorem 1.

Algorithm 1 (Almost-One-Step-Ahead) guarantees that against any sequence of outcomes, CalDist(p1:T,y1:T)2T+1\mathrm{CalDist}(p^{1:T},y^{1:T})\leq 2\sqrt{T}+1.

3 Analysis of Algorithm 1

Before describing the algorithm, we introduce some notation. We will make predictions that belong to a grid. Let Bm={0,1/m,,1}B_{m}=\{0,1/m,...,1\} denote a discretization of the prediction space with discretization parameter m>0m>0, and let pi=i/mp_{i}=i/m. For a sequence of predictions p~1,,p~t\tilde{p}^{1},...,\tilde{p}^{t} and outcomes y1,,yty^{1},...,y^{t}, we define the bias conditional on a prediction pp as:

αp~1:t(p)=s=1t𝟙[p~s=p](p~sys)\alpha_{\tilde{p}^{1:t}}(p)=\sum_{s=1}^{t}\mathbbm{1}[\tilde{p}^{s}=p](\tilde{p}^{s}-y^{s})

To understand our algorithm, it will be helpful to first state and analyze a hypothetical “lookahead” algorithm that we call “One-Step-Ahead”, which is closely related to the algorithm and analysis given by Gupta and Ramdas (2022) in a different model. One-Step-Ahead produces predictions p~1,,p~T\tilde{p}^{1},...,\tilde{p}^{T} as follows. At round tt, before observing yty^{t}, the algorithm fixes two predictions pi,pi+1p_{i},p_{i+1} satisfying αp~1:t1(pi)0\alpha_{\tilde{p}^{1:t-1}}(p_{i})\leq 0 and αp~1:t1(pi+1)0\alpha_{\tilde{p}^{1:t-1}}(p_{i+1})\geq 0. Such a pair is guaranteed to exist, because by construction, it must be that for any history, αp~1:t1(0)0\alpha_{\tilde{p}^{1:t-1}}(0)\leq 0 and αp~1:t1(1)0\alpha_{\tilde{p}^{1:t-1}}(1)\geq 0. Note that a well known randomized algorithm obtaining diminishing ECE (and smooth calibration error) uses the same observation to carefully randomize between two such adjacent predictions (Foster, 1999; Foster and Hart, 2018). Upon observing the outcome yty^{t}, the algorithm outputs prediction p~t=argminp{pi,pi+1}|pyt|\tilde{p}^{t}=\operatorname{\text{argmin}}_{p\in\{p_{i},p_{i+1}\}}|p-y^{t}|. Naturally, we cannot implement this algorithm, as it chooses its prediction only after observing the outcome, but our analysis will rely on a key property this algorithm maintains—namely, that it always produces a sequence of predictions with ECE upper bounded by m+1m+1, the number of elements in the discretized prediction space.

Theorem 2.

For any sequence of outcomes, One-Step-Ahead achieves ECE(p~1:T,y1:T)m+1\mathrm{ECE}(\tilde{p}^{1:T},y^{1:T})\leq m+1.

Proof.

We will show that for any piBmp_{i}\in B_{m}, we have |αp~1:T(pi)|1|\alpha_{\tilde{p}^{1:T}}(p_{i})|\leq 1, after which the bound on ECE will follow: ECE(p~1:T,y1:T)=piBm|αp~1:T(pi)|m+1\mathrm{ECE}(\tilde{p}^{1:T},y^{1:T})=\sum_{p_{i}\in B_{m}}|\alpha_{\tilde{p}^{1:T}}(p_{i})|\leq m+1. We proceed via an inductive argument. Fix a prediction piBmp_{i}\in B_{m}. At the first round t1t_{1} in which pip_{i} is output by the algorithm, we have that |αp~1:t1(pi)|=|pt1yt1|1|\alpha_{\tilde{p}^{1:t_{1}}}(p_{i})|=|p^{t_{1}}-y^{t_{1}}|\leq 1. Now suppose after round t1t-1, we satisfy |αp~1:t1(pi)|1|\alpha_{\tilde{p}^{1:t-1}}(p_{i})|\leq 1. If pip_{i} is the prediction made at round tt, it must be that either: αp~1:t1(pi)0\alpha_{\tilde{p}^{1:t-1}}(p_{i})\leq 0 and piyt0p_{i}-y^{t}\geq 0; or αp~1:t1(pi)0\alpha_{\tilde{p}^{1:t-1}}(p_{i})\geq 0 and piyt0p_{i}-y^{t}\leq 0. Thus, since αp~1:t1(pi)\alpha_{\tilde{p}^{1:t-1}}(p_{i}) and piytp_{i}-y^{t} either take value 0 or differ in sign, we can conclude that

|αp~1:t(pi)|=|αp~1:t1(pi)+piyt|max{|αp~1:t1(pi)|,|piyt|}1|\alpha_{\tilde{p}^{1:t}}(p_{i})|=|\alpha_{\tilde{p}^{1:t-1}}(p_{i})+p_{i}-y^{t}|\leq\max\{|\alpha_{\tilde{p}^{1:t-1}}(p_{i})|,|p_{i}-y^{t}|\}\leq 1

which proves the theorem. ∎

Algorithm 1 (Almost-One-Step-Ahead) maintains the same state αp~1:t(p)\alpha_{\tilde{p}^{1:t}}(p) as One-Step-Ahead (which it can compute at round tt after observing the outcome yt1y_{t-1}). In particular, it does not keep track of the bias of its own predictions, but rather keeps track of the bias of the predictions that One-Step-Ahead would have made. Thus it can determine the pair pi,pi+1p_{i},p_{i+1} that One-Step-Ahead would commit to predict at round tt. It cannot make the same prediction as One-Step-Ahead (as it must fix its prediction before the label is observed) — so instead it deterministically predicts pt=pip^{t}=p_{i} (or pt=pi+1p^{t}=p_{i+1} — the choice can be arbitrary and does not affect the analysis). Since we have that |pipi+1|1m|p_{i}-p_{i+1}|\leq\frac{1}{m}, it must be that for whichever choice One-Step-Ahead would have made, we have |p~tpt|1m|\tilde{p}^{t}-p^{t}|\leq\frac{1}{m}. In other words, although Almost-One-Step-Ahead does not make the same predictions as One-Step-Ahead, it makes predictions that are within 1\ell_{1} distance T/mT/m after TT rounds. The analysis then follows by the ECE bound of One-Step-Ahead, the triangle inequality, and choosing m=Tm=\sqrt{T}.

Proof of Theorem 1. Observe that internally, Algorithm 1 maintains the sequence p~1,,p~t\tilde{p}^{1},...,\tilde{p}^{t} which corresponds exactly to predictions made by One-Step-Ahead. Thus, by Lemma 1 and Theorem 2, we have that CalDist(p~1:T,y1:T)ECE(p~1:T,y1:T)m+1\mathrm{CalDist}(\tilde{p}^{1:T},y^{1:T})\leq\mathrm{ECE}(\tilde{p}^{1:T},y^{1:T})\leq m+1. Then, we can compute the distance to calibration of the sequence p1,,pTp^{1},...,p^{T}:

CalDist(p1:T,y1:T)\displaystyle\mathrm{CalDist}(p^{1:T},y^{1:T}) =minq1:T𝒞(y1:T)p1:Tq1:T1\displaystyle=\min_{q^{1:T}\in\mathcal{C}(y^{1:T})}\|p^{1:T}-q^{1:T}\|_{1}
=minq1:T𝒞(y1:T)p1:Tp~1:T+p~1:Tq1:T1\displaystyle=\min_{q^{1:T}\in\mathcal{C}(y^{1:T})}\|p^{1:T}-\tilde{p}^{1:T}+\tilde{p}^{1:T}-q^{1:T}\|_{1}
p1:Tp~1:T1+minq1:T𝒞(y1:T)p~1:Tq1:T1\displaystyle\leq\|p^{1:T}-\tilde{p}^{1:T}\|_{1}+\min_{q^{1:T}\in\mathcal{C}(y^{1:T})}\|\tilde{p}^{1:T}-q^{1:T}\|_{1}
Tm+m+1\displaystyle\leq\frac{T}{m}+m+1

where in the last step we use the fact that |ptp~t|1/m|p^{t}-\tilde{p}^{t}|\leq 1/m for all tt and thus p1:Tp~1:T1T/m\|p^{1:T}-\tilde{p}^{1:T}\|_{1}\leq T/m. The result then follows by setting m=Tm=\sqrt{T}.

Acknowledgements

This work was supported in part by the Simons Collaboration on the Theory of Algorithmic Fairness, NSF grants FAI-2147212 and CCF-2217062, an AWS AI Gift for Research on Trustworthy AI, and the Hans Sigrist Prize.

References

  • Błasiok et al. (2023) Jarosław Błasiok, Parikshit Gopalan, Lunjia Hu, and Preetum Nakkiran. A unifying theory of distance from calibration. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 1727–1740, 2023.
  • Dagan et al. (2024) Yuval Dagan, Constantinos Daskalakis, Maxwell Fishelson, Noah Golowich, Robert Kleinberg, and Princewill Okoroafor. Improved bounds for calibration via stronger sign preservation games, 2024. URL https://arxiv.org/abs/2406.13668.
  • Foster (1999) Dean P Foster. A proof of calibration via blackwell’s approachability theorem. Games and Economic Behavior, 29(1-2):73–78, 1999.
  • Foster and Hart (2018) Dean P Foster and Sergiu Hart. Smooth calibration, leaky forecasts, finite recall, and nash dynamics. Games and Economic Behavior, 109:271–293, 2018.
  • Foster and Vohra (1998) Dean P. Foster and Rakesh V. Vohra. Asymptotic calibration. Biometrika, 85(2):379–390, 1998. ISSN 00063444. URL http://www.jstor.org/stable/2337364.
  • Gupta and Ramdas (2022) Chirag Gupta and Aaditya Ramdas. Faster online calibration without randomization: interval forecasts and the power of two choices. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 4283–4309. PMLR, 02–05 Jul 2022. URL https://proceedings.mlr.press/v178/gupta22b.html.
  • Qiao and Valiant (2021) Mingda Qiao and Gregory Valiant. Stronger calibration lower bounds via sidestepping. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, page 456–466, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380539. doi: 10.1145/3406325.3451050. URL https://doi.org/10.1145/3406325.3451050.
  • Qiao and Zheng (2024) Mingda Qiao and Letian Zheng. On the distance from calibration in sequential prediction. In Shipra Agrawal and Aaron Roth, editors, Proceedings of Thirty Seventh Conference on Learning Theory, volume 247 of Proceedings of Machine Learning Research, pages 4307–4357. PMLR, 30 Jun–03 Jul 2024. URL https://proceedings.mlr.press/v247/qiao24a.html.