This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Online Minimax Multiobjective Optimization:
Multicalibeating and Other Applications

Daniel Lee1, Georgy Noarov1, Mallesh Pai2, Aaron Roth1
1 University of Pennsylvania, 2 Rice University
[email protected], [email protected],
[email protected], [email protected]
Abstract

We introduce a simple but general online learning framework in which a learner plays against an adversary in a vector-valued game that changes every round. Even though the learner’s objective is not convex-concave (and so the minimax theorem does not apply), we give a simple algorithm that can compete with the setting in which the adversary must announce their action first, with optimally diminishing regret. We demonstrate the power of our framework by using it to (re)derive optimal bounds and efficient algorithms across a variety of domains, ranging from multicalibration to a large set of no regret algorithms, to a variant of Blackwell’s approachability theorem for polytopes with fast convergence rates. As a new application, we show how to “(multi)calibeat” an arbitrary collection of forecasters — achieving an exponentially improved dependence on the number of models we are competing against, compared to prior work.

1 Introduction

We introduce and study a simple but powerful framework for online adversarial multiobjective minimax optimization. At each round tt, an adaptive adversary chooses an environment for the learner to play in, defined by a convex compact action set 𝒳t\mathcal{X}^{t} for the learner, a convex compact action set 𝒴t\mathcal{Y}^{t} for the adversary, and a dd-dimensional continuous loss function t:𝒳t×𝒴t[1,1]d\ell^{t}:\mathcal{X}^{t}\times\mathcal{Y}^{t}\rightarrow[-1,1]^{d} that, in each coordinate, is convex in the learner’s action and concave in the adversary’s action. The learner then chooses an action, or distribution over actions, xtx^{t}, and the adversary responds with an action yty^{t}. This results in a loss vector t(xt,yt)\ell^{t}(x^{t},y^{t}), which accumulates over time. The learner’s goal is to minimize the maximum accumulated loss over each of the dd dimensions: maxj[d](t=1Tjt(xt,yt))\max_{j\in[d]}\left(\sum_{t=1}^{T}\ell_{j}^{t}(x^{t},y^{t})\right).

One may view the environment chosen at each round tt as defining a zero-sum game in which the learner wishes to minimize the maximum coordinate of the resulting loss vector. The objective of the learner in the stage game in isolation can be written as:111 A brief aside about the “inf max max” structure of wLtw^{t}_{L}: since each jt\ell_{j}^{t} is continuous, so is maxjjt\max_{j}\ell_{j}^{t}, and hence maxy(maxjjt)\max_{y}(\max_{j}\ell_{j}^{t}) is attained on the compact set 𝒴t\mathcal{Y}^{t}. However, maxy(maxjjt)\max_{y}(\max_{j}\ell_{j}^{t}) may not be a continuous function of xx and therefore the infimum over 𝒳t\mathcal{X}^{t} need not be attained.

wLt=infxt𝒳tmaxyt𝒴t(maxj[d]jt(xt,yt)).w^{t}_{L}=\inf_{x^{t}\in\mathcal{X}^{t}}\max_{y^{t}\in\mathcal{Y}^{t}}\left(\max_{j\in[d]}\ell_{j}^{t}(x^{t},y^{t})\right).

Unfortunately, although jt\ell_{j}^{t} is convex-concave in each coordinate, the maximum over coordinates does not preserve concavity for the adversary. Thus the minimax theorem does not hold, and the value of the game in which the learner moves first (defined above) is larger than the value of the game in which the adversary moves first— that is, wLt>wATw^{t}_{L}>w^{T}_{A}, where wAtw^{t}_{A} is defined as:

wAt=supyt𝒴tminxt𝒳t(maxj[d]jt(xt,yt)).w^{t}_{A}=\sup_{y^{t}\in\mathcal{Y}^{t}}\min_{x^{t}\in\mathcal{X}^{t}}\left(\max_{j\in[d]}\ell_{j}^{t}(x^{t},y^{t})\right).

Nevertheless, fixing a series of TT environments chosen by the adversary, this defines in hindsight an aspirational quantity WAT=t=1TwAtW^{T}_{A}=\sum_{t=1}^{T}w^{t}_{A}, summing the adversary-moves-first value of the constituent zero sum games. Despite the fact that these values are not individually obtainable in the stage games, we show that they are approachable on average over a sequence of rounds, i.e., there is an algorithm for the learner that guarantees that against any adversary,

maxj[d](1Tt=1Tjt(xt,yt))1TWAT+42lndT.\max_{j\in[d]}\left(\tfrac{1}{T}\sum_{t=1}^{T}\ell_{j}^{t}(x^{t},y^{t})\right)\leq\tfrac{1}{T}W_{A}^{T}+4\sqrt{\tfrac{2\ln d}{T}}.

Our derivation is elementary and based on a minimax argument, and is a development of a game-theoretic argument from the calibration literature due to Hart (2020) and Fudenberg and Levine (1999).222This argument was extended in Gupta et al. (2022) to obtain fast rates and explicit algorithms for multicalibration and multivalidity. The generic algorithm plays actions at every round tt according to a minimax equilibrium strategy in a surrogate game that is derived both from the environment chosen by the adversary at round tt, as well as from the history of play so far on previous rounds t<tt^{\prime}<t. The loss in the surrogate game is convex-concave (and so we may apply minimax arguments), and can be used to upper bound the loss in the original games.

We then show that this simple framework can be instantiated to derive a wide array of optimal bounds, and that the corresponding algorithms can be derived in closed form by solving for the minimax equilibrium of the corresponding surrogate game. Despite its simplicity, our framework has a number of applications to online learning— we sketch these below.

“Multi-Calibeating”:

Foster and Hart (2021) recently introduced the notion of “calibeating” an arbitrary online forecaster: making online calibrated predictions about an adversarially chosen sequence of inputs that are guaranteed to have lower squared error than an arbitrary predictor ff, where the improvement in error approaches ff’s calibration error in hindsight. Foster and Hart give two methods for calibeating an arbitrary collection of predictors \mathcal{F} simultaneously, but these methods have an exponential and polynomial dependence in their convergence bounds on |||\mathcal{F}|, respectively.

Using our framework, we can derive optimal online bounds for online multicalibration (Hébert-Johnson et al., 2018; Gupta et al., 2022), and as an application, obtain bounds for calibeating arbitrary collection of models with only a logarithmic dependence on |||\mathcal{F}|. Our algorithm naturally extends to the more general problem of online “multi-calibeating” — i.e. combining the goals of online multicalibration and calibeating. Namely, we give an algorithm for making real-valued predictions given contexts from some space Θ\Theta. The algorithm is parameterized by (i) a collection 𝒢2Θ\mathcal{G}\subseteq 2^{\Theta} of (arbitrary, potentially intersecting) subsets of Θ\Theta that we might envision to represent e.g. different demographic groups in a setting in which we are making predictions about people; and (ii) an arbitrary collection of predictors \mathcal{F}. We promise that our predictions are calibrated not just overall, but simultaneously within each group g𝒢g\in\mathcal{G} — and moreover, that we calibeat each predictor ff\in\mathcal{F} not just overall, but simultaneously within each group g𝒢g\in\mathcal{G}. We do this by proving an online analogue of what Hébert-Johnson et al. (2018) call a “do no harm” property in the batch setting using a similar technique: multicalibrating with respect to the level sets of the predictors.

Fast Polytope Blackwell Approachability:

We give a variant of Blackwell’s Approachability Theorem (Blackwell, 1956) for approaching a polytope. Standard methods approach a set in Euclidean distance, at a rate polynomial in the payoff dimension. In contrast, we give a dimension-independent approachability guarantee: we approximately satisfy all halfspace constraints defining the polytope, after logarithmically many rounds in the number of constraints, a significant improvement over a polynomial dimensional dependence in many settings. It is equivalent to the results of Perchet (2015), which show that the negative orthant 0d\mathbb{R}^{d}_{\leq 0} is approachable in the \ell_{\infty} metric with a log(d)\log(d) dependence in the convergence rate. This result follows immediately from a specialization of our framework that does not require changing the environment at each round, highlighting the connection between our framework and approachability. We remark that approachability has been extended in a number of ways in recent years (Mannor et al., 2014a, b; Perchet and Mannor, 2013). However most of our other applications take advantage of the flexibility of our framework to play a different game at each round (which can be defined by context) with potentially different action sets, and so do not directly follow from Blackwell approachability. Therefore, while many of our regret bounds could be derived from approachability to the negative orthant by enlarging the action space exponentially to simulate aspects of our framework, this approach would not easily lead to efficient algorithms.

Recovering Expert Learning Bounds:

Algorithms and optimal bounds for various expert learning problems fall naturally out of our framework as corollaries. This includes external regret (Vovk, 1990; Littlestone and Warmuth, 1994), internal and swap regret (Foster and Vohra, 1998; Hart and Mas-Colell, 2000; Blum and Mansour, 2007), adaptive regret (Littlestone and Warmuth, 1994; Hazan and Seshadhri, 2009; Adamskiy et al., 2012), sleeping experts (Freund et al., 1997; Blum, 1997; Blum and Mansour, 2007; Kleinberg et al., 2010), and the recently introduced multi-group regret (Blum and Lykouris, 2020; Rothblum and Yona, 2021). Multi-group regret refers to a contextual prediction problem in which the learner gets contexts from Θ\Theta before each round. It is parameterized by a collection of groups 𝒢2Θ\mathcal{G}\subseteq 2^{\Theta}: e.g., if the predictions concern people, 𝒢\mathcal{G} may represent an arbitrary, intersecting set of demographic groups. Here the “experts” are different models that make predictions on each instance; the goal is to attain no-regret not just overall, but also on the subset of rounds corresponding to contexts from each g𝒢g\in\mathcal{G}. Multi-group regret, like multicalibration, is one of the few solution concepts in the algorithmic fairness literature known not to involve tradeoffs with overall accuracy (Globus-Harris et al., 2022). Blum and Lykouris (2020) derived their algorithm for online multigroup regret via a reduction to sleeping experts, and Gupta et al. (2022) derived their algorithm for online multicalibration via a direct argument. Here we derive online algorithms for both multicalibration and multigroup regret as corollaries of the same fundamental framework.

1.1 Additional Related Work

Papers by Azar et al. (2014) and Kesselheim and Singla (2020) study a related problem: an online setting with vector-valued losses, where the goal is to minimize the \ell_{\infty} norm of the accumulated loss vector (they also consider other p\ell_{p}-norms). However, they study an incomparable benchmark that in our notation would be written as minx𝒳maxj[d]1Tt=1Tj(x,yt)\min_{x^{*}\in\mathcal{X}}\max_{j\in[d]}\frac{1}{T}\sum_{t=1}^{T}\ell_{j}(x^{*},y^{t}) (which is well-defined in their setting, where loss functions t=\ell^{t}=\ell and action sets 𝒳t=𝒳,𝒴t=𝒴\mathcal{X}^{t}=\mathcal{X},\mathcal{Y}^{t}=\mathcal{Y} are fixed throughout the interaction). On the one hand, this benchmark is stronger than ours in the sense that the maximum over coordinates is taken outside the sum over time, whereas our benchmark considers a “greedy” per-round maximum. On the other hand, in our setting the game can be different at every round, so our benchmark allows a comparison to a different action at each round rather than a single fixed action. In the setting of Kesselheim and Singla (2020), it is impossible to give any regret bound to their benchmark, so they derive an algorithm obtaining a log(d)\log(d) competitive ratio to this benchmark. In contrast, our benchmark admits a regret bound. Hence, our results are quite different in kind despite the outward similarity of the settings: none of our applications follow from their theorems (since in all of our applications, we derive regret bounds).

A different line of work (Rakhlin et al., 2010, 2011) takes a very general minimax approach towards deriving bounds in online learning, including regret minimization, calibration, and approachability. Their approach is substantially more powerful than the framework we introduce here (e.g. it can be used to derive bounds for infinite dimensional problems, and characterizes online learnability in the sense that it can also be used to prove lower bounds). However, it is also correspondingly more complex, and requires analyzing the continuation value of a TT round dynamic program. Such analyses are generally technically challenging; as an example, a recent line of work by Drenska and Kohn (2020) and Kobzar et al. (2020) considers a Rakhlin et al.-style minimax formulation of the standard experts problem, and shows how to find nonlinear PDE-based minimax solutions for the Learner and the Adversary that can be optimal not just asymptotically in the number of experts (dimensions) dd, but also nonasymptotically for small dd such as 22 or 33; their PDE approach is also conducive to bounding not just the maximum regret across dimensions, but also more general functions of the individual dimensions’ losses.

Overall, results derived from the Rakhlin et al. framework (with some notable exceptions, including Rakhlin et al. (2012)) are generically nonconstructive, whereas our framework is simple and inherently constructive, in that the algorithm derives from repeatedly solving a one-round stage zero-sum game. Relative to this literature, we view our framework as a “user-friendly” power tool that can be used to derive a wide variety of algorithms and bounds without much additional work — at the cost of not being universally expressive.

2 General Framework

2.1 The Setting

A Learner (she) plays against an Adversary (he) over rounds t[T]:={1,,T}t\in[T]:=\{1,\ldots,T\}. Over these rounds, she accumulates a dd-dimensional loss vector (d1d\geq 1), where each round’s loss vector lies in [C,C]d[-C,C]^{d} for some C>0C>0. At each round tt, the Learner and the Adversary interact as follows:

  1. 1.

    Before round tt, the Adversary selects and reveals to the Learner an environment comprising:

    1. (a)

      The Learner’s and Adversary’s respective convex compact action sets 𝒳t\mathcal{X}^{t}, 𝒴t\mathcal{Y}^{t} embedded into a finite-dimensional Euclidean space;

    2. (b)

      A continuous vector loss function t(,):𝒳t×𝒴t[C,C]d\ell^{t}(\cdot,\cdot):\mathcal{X}^{t}\times\mathcal{Y}^{t}\to[-C,C]^{d}, with each jt(,):𝒳t×𝒴t[C,C]\ell^{t}_{j}(\cdot,\cdot):\mathcal{X}^{t}\times\mathcal{Y}^{t}\to[-C,C] (for j[d]j\in[d]) convex in the 1st and concave in the 2nd argument.

  2. 2.

    The Learner selects some xt𝒳tx^{t}\in\mathcal{X}^{t}.

  3. 3.

    The Adversary observes the Learner’s selection xtx^{t}, and responds with some yt𝒴ty^{t}\in\mathcal{Y}^{t}.

  4. 4.

    The Learner suffers (and observes) the loss vector t(xt,yt)\ell^{t}(x^{t},y^{t}).

The Learner’s objective is to minimize the value of the maximum dimension of the accumulated loss vector after TT rounds—in other words, to minimize: maxj[d]t[T]jt(xt,yt).\max_{j\in[d]}\sum_{t\in[T]}\ell^{t}_{j}(x^{t},y^{t}).

To benchmark the Learner’s performance, we consider the following quantity at each round tt:

Definition 2.1 (The Adversary-Moves-First (AMF) Value at Round tt).

The Adversary-Moves-First value of the game defined by the environment (𝒳t,𝒴t,t)(\mathcal{X}^{t},\mathcal{Y}^{t},\ell^{t}) at round tt is:

wAt:=supyt𝒴tminxt𝒳t(maxj[d]jt(xt,yt)).w^{t}_{A}:=\adjustlimits{\sup}_{y^{t}\in\mathcal{Y}^{t}}{\min}_{x^{t}\in\mathcal{X}^{t}}\Big{(}\max_{j\in[d]}\ell^{t}_{j}(x^{t},y^{t})\Big{)}.

If the Adversary had to reveal yty^{t} first and the Learner could best respond, wAtw^{t}_{A} would be the smallest value of the maximum coordinate of t\ell^{t} she could guarantee. However, the function maxj[d]jt(xt,yt)\max_{j\in[d]}\ell^{t}_{j}(x^{t},y^{t}) is not convex-concave (as the max\max does not preserve concavity); hence the minimax theorem does not apply, making this value unobtainable for the Learner, who is in fact obligated to reveal xtx^{t} first. However, we can define regret to a benchmark given by the cumulative AMF values of the games:

Definition 2.2 (Adversary-Moves-First (AMF) Regret).

On transcript πt={(𝒳s,𝒴s,s),xs,ys}s=1t\pi^{t}\!=\!\{\!(\mathcal{X}^{s}\!,\mathcal{Y}^{s}\!,\ell^{s}),x^{s}\!,y^{s}\}_{s=1}^{t}, we define the Learner’s Adversary Moves First (AMF) Regret for the jthj^{\text{th}} dimension at time tt to be:

Rjt(πt):=s=1tjs(xs,ys)s=1twAs.R_{j}^{t}(\pi^{t}):=\sum_{s=1}^{t}\ell^{s}_{j}(x^{s},y^{s})-\sum_{s=1}^{t}w^{s}_{A}.

The overall AMF Regret is then defined as follows: Rt(πt)=maxj[d]Rjt.R^{t}(\pi^{t})=\max_{j\in[d]}R_{j}^{t}.333We will generally elide the dependence on the transcript and simply write RjtR^{t}_{j} and RtR^{t}.

Again, the game played at each round is not convex-concave, so we cannot get RT0R^{T}\leq 0. Instead, we will aim to obtain sublinear AMF regret, worst-case over adaptive adversaries: RT=o(T)R^{T}=o(T).

2.2 General Algorithm

Our algorithmic framework will be based on a natural idea: instead of directly grappling with the maximum coordinate of the cumulative vector valued loss, we upper bound the AMF regret with a one-dimensional “soft-max” surrogate loss function, which the algorithm will then aim to minimize.

Definition 2.3 (Surrogate loss).

Fixing a parameter η(0,1)\eta\in(0,1), we define our surrogate loss function (that implicitly depends on the transcript πt\pi^{t} through the respective round tt) as:

Lt:=j[d]exp(ηRjt) for t[T],  and L0:=d.L^{t}:=\sum_{j\in[d]}\exp\left(\eta R^{t}_{j}\right)\>\text{ for $t\in[T]$, \quad and }L^{0}:=d.

This surrogate loss tightly bounds the AMF regret RT=maxj[d]RjTR^{T}=\max_{j\in[d]}R^{T}_{j}:

Lemma 2.1.

The Learner’s AMF Regret is upper bounded using the surrogate loss as: RTlnLTη.R^{T}\leq\frac{\ln L^{T}}{\eta}.

Next we observe a simple but important bound on the per-round increase in the surrogate loss.

Lemma 2.2.

For any tt, any transcript through round tt, and any η12C\eta\leq\frac{1}{2C}, it holds that:

Lt(4η2C2+1)Lt1+ηj[d]exp(ηRjt1)(jt(xt,yt)wAt).L^{t}\leq\left(4\eta^{2}C^{2}+1\right)L^{t-1}+\eta\sum_{j\in[d]}\exp\left(\eta R^{t-1}_{j}\right)\cdot\left(\ell^{t}_{j}\left(x^{t},y^{t}\right)-w^{t}_{A}\right).

The proof is very simple (see Appendix A.1): we write out the quantity LtLt1L^{t}-L^{t-1}, use the definition of AMF regret RtR^{t}, and then bound LtLt1L^{t}-L^{t-1} via the inequality ex1+x+x2e^{x}\leq 1+x+x^{2} for |x|1|x|\leq 1.

We now exploit Lemma 2.2 to bound the final surrogate loss LTL^{T} and obtain a game-theoretic algorithm for the Learner that attains this bound. While the above steps should remind the reader of a standard derivation of the celebrated Exponential Weights algorithm via bounding a log-sum-exp potential function, the next lemma is the novel ingredient that makes our framework significantly more general by relying on Sion’s powerful generalization of the Minimax Theorem to convex-concave games.

Lemma 2.3.

For any η12C\eta\leq\frac{1}{2C}, the Learner can ensure that the final surrogate loss is bounded as:

LTd(4η2C2+1)T.L^{T}\leq d\left(4\eta^{2}C^{2}+1\right)^{T}.
Proof sketch; see Appendix A.1.

Define, for t[T]t\in[T], continuous convex-concave functions ut:𝒳t×𝒴tu^{t}:\mathcal{X}^{t}\times\mathcal{Y}^{t}\to\mathbb{R} by: ut(x,y):=j[d]exp(ηRjt1)(jt(x,y)wAt).u^{t}(x,y):={\textstyle\sum_{j\in[d]}}\exp\left(\eta R^{t-1}_{j}\right)\left(\ell^{t}_{j}(x,y)-w^{t}_{A}\right). If the Learner can ensure ut(xt,yt)0u^{t}(x^{t},y^{t})\leq 0 on all rounds t[T]t\in[T] regardless of the Adversary’s play, then Lemma 2.2 implies Lt(4η2C2+1)Lt1L^{t}\leq\left(4\eta^{2}C^{2}+1\right)L^{t-1} for all t[T]t\in[T], leading to the desired bound on LTL^{T}. Due to the continuous convex-concave nature of each utu^{t} (inherited from the loss coordinates jt\ell^{t}_{j}), we can apply Sion’s Minimax Theorem to conclude that: minxt𝒳tmaxyt𝒴tut(xt,yt)=maxyt𝒴tminxt𝒳tut(xt,yt).\min_{x^{t}\in\mathcal{X}^{t}}\max_{y^{t}\in\mathcal{Y}^{t}}u^{t}\left(x^{t},y^{t}\right)=\max_{y^{t}\in\mathcal{Y}^{t}}\min_{x^{t}\in\mathcal{X}^{t}}u^{t}\left(x^{t},y^{t}\right).

In words, the Learner has a so-called minimax-optimal strategy xtx^{t}, that achieves (worst-case over all yt𝒴ty^{t}\in\mathcal{Y}^{t}) value ut(xt,yt)u^{t}(x^{t},y^{t}) as low as if the Adversary moved first and the Learner could best-respond. But in the latter counterfactual scenario, using the definitions of utu^{t} and the Adversary-moves-first value wAtw^{t}_{A}, we can easily see that by best-responding to the Adversary, the Learner would always guarantee herself value 0\leq 0: that is, maxyt𝒴tminxt𝒳tut(xt,yt)0\max_{y^{t}\in\mathcal{Y}^{t}}\min_{x^{t}\in\mathcal{X}^{t}}u^{t}\left(x^{t},y^{t}\right)\leq 0. Thus, minxt𝒳tmaxyt𝒴tut(xt,yt)\min_{x^{t}\in\mathcal{X}^{t}}\max_{y^{t}\in\mathcal{Y}^{t}}u^{t}\left(x^{t},y^{t}\right), and so by playing minimax-optimally at every round t[T]t\in[T], the Learner will guarantee ut(xt,yt)0u^{t}\left(x^{t},y^{t}\right)\leq 0 for all tt, leading to the desired regret bound. ∎

In fact, via a simple algebraic transformation (see Appendix A.1) taking advantage of the values wAtw^{t}_{A} being independent of the actions xt,ytx^{t},y^{t}, we can explicitly express the Learner’s minimax optimal strategies at all rounds as: argminx𝒳tmaxy𝒴tut(x,y)=argminx𝒳tmaxy𝒴tj[d]exp(ηs=1t1js(xs,ys))i[d]exp(ηs=1t1is(xs,ys))jt(x,y)\adjustlimits{\mathop{\mathrm{argmin}}}_{x\in\mathcal{X}^{t}}{\max}_{y\in\mathcal{Y}^{t}}u^{t}(x,y)=\adjustlimits{\mathop{\mathrm{argmin}}}_{x\in\mathcal{X}^{t}}{\max}_{y\in\mathcal{Y}^{t}}\sum\limits_{j\in[d]}\frac{\exp\left(\eta\sum_{s=1}^{t-1}\ell_{j}^{s}(x^{s},y^{s})\right)}{\sum\limits_{i\in[d]}\exp\left(\eta\sum_{s=1}^{t-1}\ell_{i}^{s}(x^{s},y^{s})\right)}\ell^{t}_{j}(x,y). Together with the proof of Lemma 2.3, this immediately gives the following algorithm for the Learner that achieves the desired bound on LTL^{T} (and thus, as we will show, on the AMF regret RTR^{T}).

  for rounds t=1,,Tt=1,\dots,T do
     Learn adversarially chosen 𝒳t,𝒴t\mathcal{X}^{t},\mathcal{Y}^{t}, and loss function t(,)\ell^{t}(\cdot,\cdot).
     Let
χjt:=exp(ηs=1t1js(xs,ys))i[d]exp(ηs=1t1is(xs,ys)) for j[d].\chi^{t}_{j}:=\frac{\exp\left(\eta\sum_{s=1}^{t-1}\ell_{j}^{s}(x^{s},y^{s})\right)}{\sum_{i\in[d]}\exp\left(\eta\sum_{s=1}^{t-1}\ell_{i}^{s}(x^{s},y^{s})\right)}\text{ for $j\in[d]$}.
     Play
xtargminx𝒳tmaxy𝒴tj[d]χjtjt(x,y).x^{t}\in\adjustlimits{\mathop{\mathrm{argmin}}}_{x\in\mathcal{X}^{t}}{\max}_{y\in\mathcal{Y}^{t}}\sum_{j\in[d]}\chi^{t}_{j}\cdot\ell^{t}_{j}(x,y).
     Observe the Adversary’s selection of yt𝒴ty^{t}\in\mathcal{Y}^{t}.
Algorithm 1 General Algorithm for the Learner that Achieves Sublinear AMF Regret
Theorem 2.1 (AMF Regret guarantee of Algorithm 1).

For any TlndT\geq\ln d, Algorithm 1 with learning rate η=lnd4TC2\eta=\sqrt{\frac{\ln d}{4TC^{2}}} obtains, against any Adversary, AMF regret bounded by: RT4CTlnd.R^{T}\leq 4C\sqrt{T\ln d}.

Indeed, using Lemma 2.1, then Lemma 2.3, then 1+xex1+x\leq e^{x}, and finally settingη=lnd4TC2\eta=\sqrt{\tfrac{\ln d}{4TC^{2}}}, we get:

RTlnLTηln(d(4η2C2+1)T)ηln(dexp(4Tη2C2))η=lndη+4TC2η=4CTlnd.R^{T}\leq\tfrac{\ln L^{T}}{\eta}\leq\tfrac{\ln\left(d\left(4\eta^{2}C^{2}+1\right)^{T}\right)}{\eta}\leq\tfrac{\ln\left(d\exp\left(4T\eta^{2}C^{2}\right)\right)}{\eta}=\tfrac{\ln d}{\eta}+4TC^{2}\eta=4C\sqrt{T\ln d}.
Remark 2.1.

Our framework is easy to adapt to the setting where the Learner randomizes, at each round, amongst a finite set of actions 𝒜t\mathcal{A}^{t} (i.e. 𝒳t=Δ𝒜t\mathcal{X}^{t}=\Delta\mathcal{A}^{t}), and wishes to obtain in-expectation and high-probability AMF regret bounds. This is useful in all our applications below. Additionally, our AMF regret bounds are robust to the Learner playing only an approximate (rather than exact) minimax strategy at each round: we use this to derive our simple multicalibration algorithm below. See Appendix A.2 for both these extensions.

3 Deriving No-X-Regret Algorithms from Our Framework

The core of our framework — the Adversary-Moves-First regret — is strictly more general than a very large variety of known regret notions including: external, internal, swap, adaptive, sleeping-experts, multigroup, and wide-range (Φ\Phi) regret. Specifically, in Appendix E, we use our framework to derive simple O(T)O(\sqrt{T})-regret algorithms for what we call subsequence regret, which encapsulates all these regret forms. In each of these cases, our generic algorithm is efficient, and often specializes (by computing a minimax equilibrium strategy in closed form) to simple combinatorial algorithms that had been derived from first principles in prior work. We note that in any problem that involves context or changing action spaces (as the sleeping experts problem does), we are taking advantage of the flexibility of our framework to present a different environment at every round, which distinguishes our framework from more standard Blackwell approachability arguments. In fact, as we will see in Section 5 below, our framework recovers fast Blackwell approachability as a special case.

For our general subsequence regret algorithms, please see Appendix E. Now, as a warm-up application of our framework, we directly instantiate it for the simplest case of obtaining O(T)O(\sqrt{T}) external regret.

Simple Learning From Expert Advice: External Regret

In the classical experts learning setting Littlestone and Warmuth (1994), the Learner has a set of pure actions (“experts”) 𝒜\mathcal{A}. At the outset of each round t[T]t\in[T], the Learner chooses a distribution over experts xtΔ𝒜x^{t}\in\Delta\mathcal{A}. The Adversary then comes up with a vector of losses rt=(rat)a𝒜[0,1]𝒜r^{t}=(r^{t}_{a})_{a\in\mathcal{A}}\in[0,1]^{\mathcal{A}} corresponding to each expert. Next, the Learner samples atxta^{t}\sim x^{t}, and experiences loss corresponding to the expert she chose: rattr^{t}_{a^{t}}. The Learner also gets to observe the entire vector of losses rtr^{t} for that round. The goal of the Learner is to achieve sublinear external regret — that is, to ensure that the difference between her cumulative loss and the loss of the best fixed expert in hindsight grows sublinearly with TT: RextT(πT):=t[T]rattminj𝒜t[T]rjt=o(T).R^{T}_{\mathrm{ext}}(\pi^{T}):=\sum_{t\in[T]}r^{t}_{a^{t}}-\min_{j\in\mathcal{A}}\sum_{t\in[T]}r^{t}_{j}=o(T).

Theorem 3.1.

Fix a finite pure action set 𝒜\mathcal{A} for the Learner and a time horizon Tln|𝒜|T\geq\ln|\mathcal{A}|. Then, an instantiation of our framework’s Algorithm 2 lets the Learner achieve the following regret bounds:

𝔼πT[RextT(πT)]4Tln|𝒜|,and RextT(πT)8Tln|𝒜|δ with prob. 1δ.\mathop{\mathbb{E}}\nolimits_{\pi^{T}}\left[R^{T}_{\mathrm{ext}}\left(\pi^{T}\right)\right]\leq 4\sqrt{T\ln|\mathcal{A}|},\quad\text{and }R^{T}_{\mathrm{ext}}\left(\pi^{T}\right)\leq 8\sqrt{T\ln\tfrac{|\mathcal{A}|}{\delta}}\,\text{ with prob. }1-\delta.
Proof.

We instantiate (the probabilistic version of) our framework (see Section A.2.1).

At all rounds, the Learner’s pure action set is 𝒜\mathcal{A}, and the Adversary’s strategy space is the convex and compact set [0,1]|𝒜|[0,1]^{|\mathcal{A}|}, from which each round’s collection (rat)a𝒜(r^{t}_{a})_{a\in\mathcal{A}} of all actions’ losses is selected. Next, we define a |𝒜||\mathcal{A}|-dimensional loss function t=(jt)j𝒜\ell^{t}=(\ell_{j}^{t})_{j\in\mathcal{A}}, where each coordinate loss jt\ell_{j}^{t} expresses the regret of the Learner’s chosen action aa relative to action j𝒜j\in\mathcal{A}:

jt(a,rt)=ratrjt, for a𝒜,rt[0,1]|𝒜|.\ell^{t}_{j}(a,r^{t})=r^{t}_{a}-r^{t}_{j},\quad\text{ for }a\in\mathcal{A},r^{t}\in[0,1]^{|\mathcal{A}|}.

By Theorem A.1, 𝔼[maxj𝒜t[T]jt(at,rt)t[T]wAt]4Tln|𝒜|\mathop{\mathbb{E}}\left[\max_{j\in\mathcal{A}}\sum_{t\in[T]}\ell^{t}_{j}(a^{t},r^{t})-\sum_{t\in[T]}w^{t}_{A}\right]\leq 4\sqrt{T\ln|\mathcal{A}|}, where wAtw^{t}_{A} is the AMF value at round tt. Using this AMF regret bound, we can bound the Learner’s external regret as:

𝔼[RextT]=𝔼[maxj𝒜t[T]rattrjt]=𝔼[maxj𝒜t[T]jt(at,rt)]4Tln|𝒜|+t[T]wAt.\mathop{\mathbb{E}}\left[R^{T}_{\mathrm{ext}}\right]=\mathop{\mathbb{E}}\left[\max_{j\in\mathcal{A}}\sum\nolimits_{t\in[T]}r^{t}_{a^{t}}-r_{j}^{t}\right]=\mathop{\mathbb{E}}\left[\max_{j\in\mathcal{A}}\sum\nolimits_{t\in[T]}\ell^{t}_{j}(a^{t},r^{t})\right]\leq 4\sqrt{T\ln|\mathcal{A}|}+\sum\nolimits_{t\in[T]}w^{t}_{A}.

It thus remains to show that the AMF value wAt0w^{t}_{A}\leq 0 for all tt. This holds, since if the Learner knew the Adversary’s choice of losses (rat)a𝒜(r^{t}_{a})_{a\in\mathcal{A}} before round tt, then picking the action a𝒜a\in\mathcal{A} with the smallest loss ratr^{t}_{a} would get her 0 regret in that round. 444Formally, for any vector of actions’ losses rtr^{t}, define art:=argmina𝒜rat,a^{*}_{r^{t}}:=\mathop{\mathrm{argmin}}_{a\in\mathcal{A}}r^{t}_{a}, and notice that mina𝒜maxj𝒜jt(a,rt)maxj𝒜jt(art,rt)=maxj𝒜(rarttrjt)=mina𝒜ratminj𝒜rjt=0.\displaystyle\min_{a\in\mathcal{A}}\max_{j\in\mathcal{A}}\ell^{t}_{j}(a,r^{t})\leq\max_{j\in\mathcal{A}}\ell^{t}_{j}\left(a^{*}_{r^{t}},r^{t}\right)=\max_{j\in\mathcal{A}}\left(r^{t}_{a^{*}_{r^{t}}}-r^{t}_{j}\right)=\min_{a\in\mathcal{A}}r^{t}_{a}-\min_{j\in\mathcal{A}}r^{t}_{j}=0. Hence, the AMF value is indeed nonpositive at each round: wAt=suprt[0,1]|𝒜|mina𝒜maxj𝒜jt(a,rt)0.w^{t}_{A}=\adjustlimits{\sup}_{r^{t}\in[0,1]^{|\mathcal{A}|}}{\min}_{a\in\mathcal{A}}\max_{j\in\mathcal{A}}\ell^{t}_{j}(a,r^{t})\leq 0. This gives the in-expectation regret bound; the high-probability bound follows in the same way from Theorem A.2. ∎

A bound of Tln|𝒜|\sqrt{T\ln|\mathcal{A}|} is optimal for external regret in the experts learning setting, and so serves to witness the optimality of our framework’s general AMF regret bound in Theorem 2.1.

In fact, the above instantiation of Algorithm 2 yields the classical Exponential Weights algorithm Littlestone and Warmuth (1994): at each round tt, the action ata^{t} is sampled with Pr[at=j]exp(ηs=1t1rjs)\Pr[a^{t}=j]\sim\exp\left(-\eta\sum_{s=1}^{t-1}r^{s}_{j}\right), for j𝒜j\in\mathcal{A}. We denote this distribution by EWη(πt1)Δ(𝒜)\mathrm{EW}_{\eta}(\pi^{t-1})\in\Delta(\mathcal{A}).

Indeed, given the above defined loss t\ell^{t}, the Learner solves the following problem at each round:

xt\displaystyle x^{t} argminxΔ𝒜maxrt[0,1]|𝒜|j𝒜χjt𝔼ax[ratrjt],\displaystyle\in\mathop{\mathrm{argmin}}_{x\in\Delta\mathcal{A}}\max_{r^{t}\in[0,1]^{|\mathcal{A}|}}\sum_{j\in\mathcal{A}}\chi^{t}_{j}\mathop{\mathbb{E}}_{a\sim x}[r^{t}_{a}-r^{t}_{j}],

where χjt=exp(ηs=1t1(rassrjs))i𝒜exp(ηs=1t1(rassris))=exp(ηs=1t1rjs)i𝒜exp(ηs=1t1ris)\chi^{t}_{j}=\frac{\exp\left(\eta\sum_{s=1}^{t-1}(r^{s}_{a^{s}}-r^{s}_{j})\right)}{\sum_{i\in\mathcal{A}}\exp\left(\eta\sum_{s=1}^{t-1}(r^{s}_{a^{s}}-r^{s}_{i})\right)}=\frac{\exp\left(-\eta\sum_{s=1}^{t-1}r^{s}_{j}\right)}{\sum_{i\in\mathcal{A}}\exp\left(-\eta\sum_{s=1}^{t-1}r^{s}_{i}\right)}. That is, the per-coordinate weights (χjt)j𝒜(\chi^{t}_{j})_{j\in\mathcal{A}} themselves form the Exponential Weights distribution with rate η\eta.

For any choice of rtr^{t} by the Adversary, the quantity inside the expectation, jt(a,rt)=ratrjt\ell^{t}_{j}(a,r^{t})=r^{t}_{a}-r^{t}_{j}, is antisymmetric in aa and jj: that is, jt(a,rt)=at(j,rt)\ell^{t}_{j}(a,r^{t})=-\ell^{t}_{a}(j,r^{t}). Due to this antisymmetry, no matter which rtr^{t} gets selected by the Adversary, by playing aEWη(πt1)a\sim\mathrm{EW}_{\eta}(\pi^{t-1}) the Learner obtains 𝔼a,jEWη(πt1)[ratrjt]=0\mathop{\mathbb{E}}_{a,j\sim\mathrm{EW}_{\eta}(\pi^{t-1})}\left[r^{t}_{a}-r^{t}_{j}\right]=0, thus achieving the value of the game. It is also easy to see that xt=EWη(πt1)x^{t}=\mathrm{EW}_{\eta}(\pi^{t-1}) is the unique choice of xtx^{t} that guarantees nonnegative value, hence Algorithm 2, when specialized to the external regret setting, is equivalent to the Exponential Weights Algorithm 5.

4 Multicalibration and Multicalibeating

We now apply our framework to derive an online contextual prediction algorithm which simultaneously satisfies a (potentially very large) family of strong adversarial accuracy and calibration conditions. Namely, given an arbitrarily complex family 𝒢\mathcal{G} of subsets of the context space (we call them “groups”, a term from the fairness literature), the predictor will be both calibrated and accurate on each group g𝒢g\in\mathcal{G} (that is, over those online rounds when the context belongs to gg).

The accuracy benchmark that we aim to satisfy was recently proposed by Foster and Hart (2021), who called it calibeating: given any collection \mathcal{F} of online forecasters, the goal is (intuitively) to “beat” the (squared) error of each ff\in\mathcal{F} by at least the calibration score of ff.

In Section 4.1, we use our framework to rederive the online multigroup calibration (known as multicalibration) algorithm of Gupta et al. (2022). In Section 4.2, we show that by appropriately augmenting the original collection of groups 𝒢\mathcal{G}, this algorithm will, in addition to multicalibration, calibeat any family of predictors ff\in\mathcal{F} on every group g𝒢g\in\mathcal{G}, which we call multicalibeating.

4.1 Multicalibration

Setting

There is a feature (or context) space Θ\Theta encoding the set of possible feature vectors representing individuals θΘ\theta\in\Theta. There is also a label space [0,1][0,1]. At every round t[T]t\in[T]:

  1. 1.

    The Adversary announces a particular individual θtΘ\theta^{t}\in\Theta, whose label is to be predicted;

  2. 2.

    The Learner predicts a label distribution xtx^{t} over [0,1][0,1];

  3. 3.

    The Adversary observes xtx^{t}, and fixes the true label distribution yty^{t} over [0,1][0,1];

  4. 4.

    The (pure) guessed label atxta^{t}\sim x^{t} and the (pure) true label btytb^{t}\sim y^{t} are sampled.

Objective: Multicalibration

The Learner is initially given an arbitrary collection 𝒢2Θ\mathcal{G}\subseteq 2^{\Theta} of protected population groups. Her goal, multicalibration, is empirical calibration not just marginally over the whole population, but also conditionally on individual membership in each g𝒢g\in\mathcal{G}. Formally, for any n1n\geq 1 we let the nn-bucketing of the label interval [0,1][0,1] be its partition into subintervals [0,1/n),,[12/n,11/n),[11/n,1][0,1/n),\ldots,[1-2/n,1-1/n),[1-1/n,1]. The ithi^{\text{th}} of these intervals (buckets) is denoted BniB^{i}_{n}.

Definition 4.1 ((α,n)(\alpha,n)-Multicalibration with respect to 𝒢\mathcal{G}).

Fix a real α>0\alpha>0 and an integer n1n\geq 1. Given the transcript of the interaction {(at,bt)}t[T]\{(a^{t},b^{t})\}_{t\in[T]}, the Learner’s sequence of guessed labels {at}t[T]\{a^{t}\}_{t\in[T]} is (α,n)(\alpha,n)-multicalibrated with respect to the collection of groups 𝒢\mathcal{G} if:

1T|t=1T1θtg1atBni(btat)|α,for every group g𝒢 and every bucket Bni (for i[n]).\frac{1}{T}\Bigg{|}\sum_{t=1}^{T}1_{\theta^{t}\in g}\cdot 1_{a^{t}\in B^{i}_{n}}\cdot(b^{t}-a^{t})\Bigg{|}\leq\alpha,\;\text{for every group $g\in\mathcal{G}$ and every bucket $B^{i}_{n}$ (for $i\in[n]$).}

Using our framework, we now derive the guarantee on α\alpha that matches that of Gupta et al. (2022).

Theorem 4.1 (Multicalibration).

Fix a family of groups 𝒢\mathcal{G}, a time horizon Tln(2|𝒢|n)T\geq\ln(2|\mathcal{G}|n), and any natural n,r1n,r\geq 1. Then, our framework’s Algorithm 2 can be instantiated as Algorithm 3 to produce (α,n)(\alpha,n)-multicalibrated predictions w.r.t. 𝒢\mathcal{G}, where α\alpha satisfies (over transcript randomness):

𝔼[α]1rn+4ln(2|𝒢|n)TandPr[α1rn+81Tln(2|𝒢|nδ)]1δδ(0,1).\mathop{\mathbb{E}}[\alpha]\leq\tfrac{1}{rn}+4\sqrt{\tfrac{\ln(2|\mathcal{G}|n)}{T}}\quad\text{and}\quad\Pr\Big{[}\alpha\leq\tfrac{1}{rn}+8\sqrt{\tfrac{1}{T}\ln\left(\tfrac{2|\mathcal{G}|n}{\delta}\right)}\Big{]}\geq 1-\delta\;\forall\;\delta\in(0,1).
Sketch.

Setting up the game: The adversary’s strategy space is 𝒴=[0,1]\mathcal{Y}=[0,1]. The learner will randomize over 𝒜r={0,1/(rn),2/(rn),,1}\mathcal{A}_{r}=\{0,1/(rn),2/(rn),\ldots,1\}, for any choice of integer r1r\geq 1 (this will ensure continuity of the loss functions that we are about to define), i.e., her strategy space is 𝒳=Δ𝒜r\mathcal{X}=\Delta\mathcal{A}_{r}.

Loss functions: The definition of multicalibration consists of 2|𝒢|n2|\mathcal{G}|n constraints (one for each ±\pm sign, group gg, and bucket ii) of the following form: ±1Tt=1T1θtg1atBni(btat)α\pm\frac{1}{T}\sum_{t=1}^{T}1_{\theta^{t}\in g}\cdot 1_{a^{t}\in B^{i}_{n}}\cdot(b^{t}-a^{t})\leq\alpha. Thus, we define (for each t[T]t\in[T], σ=±1\sigma=\pm 1, gg, and ii) a loss function over (at,bt)𝒜r×𝒴(a^{t},b^{t})\in\mathcal{A}_{r}\times\mathcal{Y} as: i,g,σt(at,bt):=σ1θtg1atBni(btat).\ell^{t}_{i,g,\sigma}(a^{t},b^{t}):=\sigma\cdot 1_{\theta^{t}\in g}\cdot 1_{a^{t}\in B^{i}_{n}}\cdot(b^{t}-a^{t}).

Now, defining a 2|𝒢|n2|\mathcal{G}|n dimensional loss vector t:=(i,g,σt)i[n],g𝒢,σ{1,1}\ell^{t}:=\left(\ell^{t}_{i,g,\sigma}\right)_{i\in[n],g\in\mathcal{G},\sigma\in\{-1,1\}} for each t[T]t\in[T] recasts multicalibration in our framework as requiring that maxi[n],g𝒢,σ{1,1}t=1Ti,g,σt(at,bt)αT.\max_{i\in[n],g\in\mathcal{G},\sigma\in\{-1,1\}}\sum_{t=1}^{T}\ell^{t}_{i,g,\sigma}(a^{t},b^{t})\leq\alpha T.

Bounding the AMF regret: To bound the Adversary-Moves-First value with these loss functions, suppose the Adversary announces bt[0,1]b^{t}\in[0,1]. Then, we easily see that by (deterministically) responding with at=argmina𝒜r|bta|a^{t}=\mathop{\mathrm{argmin}}_{a\in\mathcal{A}_{r}}|b^{t}-a|, for all σ,g,i\sigma,g,i, i,g,σt(at,bt)12rn\ell^{t}_{i,g,\sigma}(a^{t},b^{t})\leq\frac{1}{2rn}. Hence,

wAt=supbt[0,1]minxtΔ𝒜rmaxi[n],g𝒢,σ{1,1}𝔼atxt[i,g,σt(at,bt)]12rn for every t[T].w^{t}_{A}=\adjustlimits{\sup}_{b^{t}\in[0,1]}{\min}_{x^{t}\in\Delta\mathcal{A}_{r}}\max_{i\in[n],g\in\mathcal{G},\sigma\in\{-1,1\}}\mathop{\mathbb{E}}_{a^{t}\sim x^{t}}\left[\ell^{t}_{i,g,\sigma}\left(a^{t},b^{t}\right)\right]\leq\tfrac{1}{2rn}\quad\text{ for every $t\in[T]$.}

Now, for Tln(2|𝒢|n)T\geq\ln(2|\mathcal{G}|n), the AMF regret RT=maxi[n],g𝒢,σ{1,1}t=1Ti,g,σt(at,bt)t=1TwAtR^{T}=\max_{i\in[n],g\in\mathcal{G},\sigma\in\{-1,1\}}\sum_{t=1}^{T}\ell^{t}_{i,g,\sigma}(a^{t},b^{t})-\sum_{t=1}^{T}w^{t}_{A}, by our framework’s guarantees, satisfies 𝔼[RT]4Tln(2|𝒢|n)\mathop{\mathbb{E}}[R^{T}]\leq 4\sqrt{T\ln(2|\mathcal{G}|n)} over the Learner’s randomness. Since t=1TwAtT2rn\sum_{t=1}^{T}w^{t}_{A}\leq\frac{T}{2rn}, we get 𝔼[maxi[n],g𝒢,σ{1,1}t=1Ti,g,σt(at,bt)]T2rn+4Tln(2|𝒢|n)\mathop{\mathbb{E}}[\max_{i\in[n],g\in\mathcal{G},\sigma\in\{-1,1\}}\sum_{t=1}^{T}\ell^{t}_{i,g,\sigma}(a^{t},b^{t})]\leq\frac{T}{2rn}+4\sqrt{T\ln(2|\mathcal{G}|n)}.

This gives (α,n)(\alpha,n)-multicalibration with 𝔼[α]1T(T2rn+4Tln(2|𝒢|n))=12rn+4ln(2|𝒢|n)T\mathop{\mathbb{E}}[\alpha]\leq\frac{1}{T}\left(\frac{T}{2rn}+4\sqrt{T\ln(2|\mathcal{G}|n)}\right)=\frac{1}{2rn}+4\sqrt{\frac{\ln(2|\mathcal{G}|n)}{T}}. The high-probability bound on α\alpha is obtained similarly.

Simplifying Learner’s algorithm: To attain the AMF value wAt=12rnw^{t}_{A}=\frac{1}{2rn} at each round, our framework has the Learner solve a linear program (that encodes her minimax strategy). However, she can obtain the almost optimal value 1rn\tfrac{1}{rn} without solving an LP: this observation gives Algorithm 3 (see Appendix B). The guarantees on α\alpha only differ from optimal ones by replacing 12rn1rn\frac{1}{2rn}\to\frac{1}{rn}. ∎

4.2 Multicalibeating

We now give an approach to “beating” arbitrary collections of online forecasters via online multicalibration. The goal, called calibeating by Foster and Hart (2021) who introduce the problem, is to make calibrated forecasts that are more accurate than each of an arbitrary set of forecasters, by exactly the calibration error in hindsight of that forecaster. They achieve optimal calibeating bounds for a single forecaster, but their extension to calibeating multiple forecasters incurs at least a polynomial dependence on the number of forecasters. We achieve a logarithmic dependence on the number of forecasters. Additionally, we are able to simultaneously calibeat forecasters on all (big enough) subgroups in some set 𝒢\mathcal{G}, with still only a logarithmic dependence on |𝒢||\mathcal{G}| and the number of forecasters in the group-wise convergence bound. We call this multicalibeating. We now give an overview of our setting, results, and techniques. For full details, see Appendix C.

Setting

The Learner (predictor a={at}t[T]a=\{a^{t}\}_{t\in[T]}) and the Adversary (true labels b={bt}t[T]b=\{b^{t}\}_{t\in[T]}) interact in the same way as in Section 4.1, but the Adversary additionally reveals to the Learner a finite set of forecasters \mathcal{F}, where each ff\in\mathcal{F} is a function f:ΘDff:\Theta\rightarrow D_{f}. Here Df[0,1]D_{f}\subset[0,1] is assumed to be a finite set of all possible forecasts that ff makes: it will characterize the level sets of ff. We often suppress the dependence on the transcript, denoting ftDff^{t}\in D_{f} the forecast at time tt.

The Learner’s goal is to “improve on” the forecasts of all ff\in\mathcal{F}, for some suitable scoring of the predictions. We measure the Learner’s and the forecasters’ accuracy via the squared error, alternatively known as the Brier score.

Definition 4.2 (Brier Score).

The Brier score of a forecaster ff over all rounds t[T]t\in[T] is defined as: f(πT):=1Tt[T](ftbt)2.\mathcal{B}^{f}(\pi^{T}):=\frac{1}{T}\sum_{t\in[T]}(f^{t}-b^{t})^{2}.

The Brier score can be decomposed into so-called calibration and refinement parts. The former quantifies the extent to which the predictor is calibrated, while the latter expresses the average amount of variance in predictions within every calibration bucket.

To define this decomposition, we need some extra notation. We denote by SiS_{i} the subsequence of days on which the Learner’s prediction is in bucket ii.555Note that SiS_{i} depends implicitly on the bucketing parameter nn and the transcript πT\pi^{T}. Similarly, Sd(f)S^{d}(f) (eliding (f)(f) when clear from context) denotes days on which forecaster ff predicts dd. We let Sid(f)=SiSd(f)S_{i}^{d}(f)=S_{i}\cap S^{d}(f). Finally, we use bars to indicate average predictions over given subsequences. For instance, a¯(S)\bar{a}(S) is the Learner’s average prediction over a given subsequence SS.

Definition 4.3 (Calibration and Refinement).

The calibration score 𝒦\mathcal{K} and refinement score \mathcal{R} of a forecaster ff over the full transcript πT\pi^{T} are defined as:

𝒦f(πT):=1TdDf|Sd|(db¯(Sd))2,f(πT):=1TdDftSd(btb¯(Sd))2.\mathcal{K}^{f}(\pi^{T}):=\frac{1}{T}\sum\nolimits_{d\in D_{f}}|S^{d}|(d-\bar{b}(S^{d}))^{2},\quad\quad\,\mathcal{R}^{f}(\pi^{T}):=\frac{1}{T}\sum\nolimits_{d\in D_{f}}\sum\nolimits_{t\in S^{d}}(b^{t}-\bar{b}(S^{d}))^{2}.
Fact 1 (Calibration-Refinement Decomposition of Brier Score (DeGroot and Fienberg, 1983)).

f(πT)=𝒦f(πT)+f(πT)\mathcal{B}^{f}(\pi^{T})=\mathcal{K}^{f}(\pi^{T})+\mathcal{R}^{f}(\pi^{T}).

The goal of calibeating is to beat the forecaster’s Brier score by an amount equal to its calibration score. Or equivalently, to attain a Brier score (almost) equal to the refinement score of the forecaster.

Definition 4.4 (Calibeating).

The Learner’s predictor aa is said to τ\tau-calibeat a forecaster ff if: a(πT)f(πT)+τ.\mathcal{B}^{a}(\pi^{T})\leq\mathcal{R}^{f}(\pi^{T})+\tau.

We will now extend the definition of calibeating simultaneously along two natural directions. First, we will want to calibeat multiple forecasters at once. The second extension is that we will want to calibeat the forecasters not just overall, but also on each of the subsequences corresponding to each “population group” g𝒢g\in\mathcal{G} in a given family of subpopulations 𝒢2Θ\mathcal{G}\subseteq 2^{\Theta}.

Definition 4.5 (Multicalibeating).

Given a family of forecasters \mathcal{F}, groups 𝒢2Θ\mathcal{G}\subseteq 2^{\Theta}, and a mapping β:×𝒢0\beta:\mathcal{F}\times\mathcal{G}\rightarrow\mathbb{R}_{\geq 0}, the Learner’s predictor aa is an (,𝒢,β)(\mathcal{F},\mathcal{G},\beta)-multicalibeater if for every g𝒢g\in\mathcal{G}: a(πT|{t:θtg})minf{f(πT|{t:θtg})+β(f,g)}\mathcal{B}^{a}(\pi^{T}|_{\{t:\theta^{t}\in g\}})\leq\min_{f\in\mathcal{F}}\left\{\mathcal{R}^{f}(\pi^{T}|_{\{t:\theta^{t}\in g\}})+\beta(f,g)\right\}

Note that ({f},{Θ},β(f,Θ):=τ)(\{f\},\{\Theta\},\beta(f,\Theta):=\tau)-multicalibeating is equivalent to τ\tau-calibeating a forecaster ff.

We first show how to calibeat a single forecaster (Definition 4.4). The modularity of multicalibration will then let us easily extend this result to multiple forecasters and population subgroups.

The idea is to show that if our predictor is multicalibrated with respect to the level sets of ff, then we achieve calibeating. Hébert-Johnson et al. (2018) give a similar bound in the batch setting. We denote the collection of level sets of ff as: 𝒮(f):={θΘ:f(θ)=d}dDf\mathcal{S}(f):=\{\theta\in\Theta:f(\theta)=d\}_{d\in D_{f}}.

Theorem 4.2 (Calibeating One Forecaster).

Suppose that the Learner’s predictions aa are (α,n)(\alpha,n)-multicalibrated on the collection of groups 𝒮(f){Θ}\mathcal{S}(f)\cup\{\Theta\}. Then the Learner is (α,n)(\alpha,n)-calibrated on Θ\Theta, and she (αn(|Df|+2)+2n)(\alpha n(|D_{f}|+2)+\frac{2}{n})-calibeats forecaster ff.

Proof sketch.

We show that aa has small calibration score, and refinement score close to that of ff.

Step 1: Replace a\mathcal{B}^{a} with a surrogate Brier score na\mathcal{B}^{a}_{n}. Consider a (pseudo-)predictor a~\tilde{a} given by a~t=a¯(Siat)\tilde{a}^{t}=\bar{a}(S_{i_{a^{t}}}) for t[T]t\in[T] (where iati_{a^{t}} is the bucket of ata^{t}). That is, whenever atBnia^{t}\in B^{i}_{n}, a~t\tilde{a}^{t} predicts the average of aa over all such rounds s[T]s\in[T] that asBnia^{s}\in B^{i}_{n}. This is a pseudo-predictor, as the bucket averages of aa are unknown until after round TT. Thus, a~\tilde{a} has precisely nn level sets, unlike aa. Now, we define na,𝒦na,na\mathcal{B}^{a}_{n},\mathcal{K}^{a}_{n},\mathcal{R}^{a}_{n} to be the Brier, calibration, and refinement scores of a~\tilde{a}. We can show ana+1/n\mathcal{B}^{a}\leq\mathcal{B}^{a}_{n}+1/n, allowing us to switch to bounding the more manageable Brier loss na=𝒦na+na\mathcal{B}^{a}_{n}=\mathcal{K}_{n}^{a}+\mathcal{R}^{a}_{n}.

Step 2: Bound the surrogate calibration score 𝒦na\mathcal{K}_{n}^{a}. Since the Learner is (α,n)(\alpha,n)-calibrated on the domain Θ\Theta, the calibration error per level set is at most α\alpha. There are nn level sets, so 𝒦naαn.\mathcal{K}_{n}^{a}\leq\alpha n.

Step 3: Bound the surrogate refinement score na\mathcal{R}^{a}_{n}. We connect f\mathcal{R}^{f} and na\mathcal{R}^{a}_{n} via a joint refinement score: f×a\mathcal{R}^{f\times a}, which measures the average variance of the partition generated by all intersections of the level sets of aa and ff. The finer the partition, the smaller the refinement score, so ff×a.\mathcal{R}^{f}\geq\mathcal{R}^{f\times a}. Next, informally, multicalibration ensures that aa has already “captured” most of the variance explained by ff. Therefore, refining aa’s level sets by ff does little to reduce variance. More precisely, we show that naf×a+αn(|Df|+1)+1n.\mathcal{R}^{a}_{n}\leq\mathcal{R}^{f\times a}+\alpha n(|D_{f}|+1)+\frac{1}{n}. Combining with our previous inequality, we have: naf+αn(|Df|+1)+1n.\mathcal{R}^{a}_{n}\leq\mathcal{R}^{f}+\alpha n(|D_{f}|+1)+\frac{1}{n}.

Combining the above, we get: ana+𝒦na+1n(f+αn(|Df|+1)+1n)+αn+1n\mathcal{B}^{a}\leq\mathcal{R}^{a}_{n}+\mathcal{K}_{n}^{a}+\frac{1}{n}\leq(\mathcal{R}^{f}+\alpha n(|D_{f}|+1)+\frac{1}{n})+\alpha n+\frac{1}{n}. ∎

Calibeating many forecasters

Generalizing the above construction, we can easily calibeat any collection of forecasters \mathcal{F} on the entire context space Θ\Theta: it suffices to ask for multicalibration with respect to the level sets of all forecasters, i.e. (f𝒮(f)){Θ}.\left(\bigcup_{f\in\mathcal{F}}\mathcal{S}(f)\right)\cup\{\Theta\}. Theorem 4.2 applies separately to each ff; the only degradation in the guarantees will come in the form of a larger α\alpha, since we are asking for multicalibration with respect to more groups than before. But this effect will be small, since α\alpha depends on the number of required groups |𝒢||\mathcal{G}^{\prime}| as O(ln|𝒢|)O(\sqrt{\ln|\mathcal{G}^{\prime}|}). See Corollary C.2.

However, to fully satisfy Definition 4.5 of multicalibeating, we need to calibeat all ff\in\mathcal{F} on all groups g𝒢g\in\mathcal{G} in a given collection 𝒢2Θ\mathcal{G}\subseteq 2^{\Theta}. For that, we simply extend the above construction by requiring multicalibration with respect to all pairwise intersections of the forecasters’ level sets with the groups g𝒢g\in\mathcal{G}. By further augmenting this collection with the protected groups 𝒢\mathcal{G} themselves, we finally achieve our ultimate goal: simultaneous multicalibeating and multicalibration.

Theorem 4.3 (Multicalibeating + Multicalibration).

Let 𝒢2Θ\mathcal{G}\subseteq 2^{\Theta}, and \mathcal{F} some set of forecasters f:ΘDff:\Theta\rightarrow D_{f}. The multicalibration algorithm on 𝒢:=(f{gS:(g,S)𝒢×𝒮(f)})𝒢\mathcal{G}^{\prime}:=\left(\bigcup_{f\in\mathcal{F}}\{g\cap S:(g,S)\in\mathcal{G}\times\mathcal{S}(f)\}\right)\cup\mathcal{G} with parameters r,n1r,n\geq 1, after TT rounds, attains expected (,𝒢,β)(\mathcal{F},\mathcal{G},\beta)-multicalibeating, where: 666S(g)S(g) denotes the subsequence of days on which a group gg occurs, suppressing dependence on transcript. 𝔼[β(f,g)]2n+|Df|+2r|S(g)|/T+4n(|Df|+2)1|S(g)|2/Tln(2n|𝒢|(1+f|Df|))f,g𝒢,\mathop{\mathbb{E}}[\beta(f,g)]\leq\frac{2}{n}+\frac{|D_{f}|+2}{r\cdot|S(g)|/T}+4n(|D_{f}|+2)\sqrt{\frac{1}{|S(g)|^{2}/T}\ln\left(2n|\mathcal{G}|(1+{\textstyle\sum}_{f}|D_{f}|)\right)}\;\forall\;f\in\mathcal{F},g\in\mathcal{G},

while maintaining (α,n)(\alpha,n)-multicalibration on 𝒢\mathcal{G}, with: 𝔼[α]1rn+41Tln(2n|𝒢|(1+f|Df|)).\mathop{\mathbb{E}}[\alpha]\leq\frac{1}{rn}+4\sqrt{\frac{1}{T}\ln\left(2n|\mathcal{G}|(1+{\textstyle\sum}_{f}|D_{f}|)\right)}.

In particular, for any group gg occurring more than T1/2T^{-1/2} of the time, we asymptotically converge to 1n\frac{1}{n}-calibeating as TT\to\infty, thus combining the goals of online multicalibration and multigroup regret.

5 Polytope Blackwell Approachability

Consider a setting where the Learner and the Adversary are playing a repeated game with vector-valued payoffs, in which the Learner always goes first and aims to force the average payoff over the entire interaction to approach a given convex set. Blackwell’s Theorem (1956) states that a convex set is approachable if and only if it is response-satisfiable (roughly, for any choice of the Adversary, the Learner has a response forcing the one-round payoff inside the convex set). The rate of approachability typically depends on the dimension of the payoff vectors.

This is a specialization of our framework to a case in which the environment is fixed at every round. Thus our framework can be used to obtain a dimension-independent rate bound in the fundamental case where the approachable set is a convex polytope. Our bound is only logarithmic in the polytope’s number of facets, and is achievable via an efficient convex-programming based algorithm.

Let us formalize our setting. In rounds t=1,2,t=1,2,\ldots, the Learner and the Adversary play a repeated game. Their respective pure strategy sets are 𝒜\mathcal{A} and 𝒴\mathcal{Y}, where 𝒜\mathcal{A} is a finite set and 𝒴m\mathcal{Y}\subseteq\mathbb{R}^{m} (for some integer m1m\geq 1) is convex and compact. The game’s utility function is λ\lambda-dimensional (for some integer λ1\lambda\geq 1), continuous, concave in the second argument, and is denoted by u:𝒜×𝒴λu:\mathcal{A}\times\mathcal{Y}\to\mathbb{R}^{\lambda}. At each round tt, the Learner plays a mixed strategy xtΔ𝒜x^{t}\in\Delta\mathcal{A}, the Adversary responds with some yt𝒴y^{t}\in\mathcal{Y}, and the Learner then samples a pure action atxta^{t}\sim x^{t}. This gives rise to the utility vector u(at,yt)u(a^{t},y^{t}). The average play up to any round t1t\geq 1 is then defined as u¯t=1ts=1tu(as,ys)\bar{u}^{t}=\frac{1}{t}\sum_{s=1}^{t}u(a^{s},y^{s}).

The target convex set that the Learner wants to approach is a polytope 𝒫()λ\mathcal{P}(\mathcal{H})\subseteq\mathbb{R}^{\lambda}, defined as the intersection of a finite collection of halfspaces =(hα,β)\mathcal{H}=(h_{\alpha,\beta}), where for any given αλ,β\alpha\in\mathbb{R}^{\lambda},\beta\in\mathbb{R} we denote hα,β={xλ:α,xβ0}h_{\alpha,\beta}=\{x\in\mathbb{R}^{\lambda}:\langle\alpha,x\rangle-\beta\leq 0\}. Finally, by way of normalization, consider any two dual norms ||||p||\cdot||_{p} and ||||q||\cdot||_{q}.We require, first, that αp1||\alpha||_{p}\leq 1 and |β|1|\beta|\leq 1 for each halfspace hα,βh_{\alpha,\beta}\in\mathcal{H}; and second, that the payoffs be in the ||||q||\cdot||_{q}-unit ball: u(a,y)q1||u(a,y)||_{q}\leq 1 for a𝒜,y𝒴a\in\mathcal{A},y\in\mathcal{Y}.

Theorem 5.1 (Polytope Blackwell Approachability).

Suppose the target convex polytope 𝒫()\mathcal{P}(\mathcal{H}) is response-satisfiable, in the sense that for any Adversary’s action y𝒴y\in\mathcal{Y}, the Learner has a mixed response xΔ𝒜x\in\Delta\mathcal{A} that places the expected payoff inside 𝒫()\mathcal{P}(\mathcal{H}): that is, 𝔼ax[u(a,y)]𝒫()\mathop{\mathbb{E}}_{a\sim x}[u(a,y)]\in\mathcal{P}(\mathcal{H}).

Then, 𝒫()\mathcal{P}(\mathcal{H}) is approachable, both in expectation and with high probability with respect to the transcript of the interaction. Namely, the Learner has an efficient convex programming based algorithm which guarantees both following conditions simultaneously:

  1. 1.

    For any margin ϵ>0\epsilon>0, the average play u¯t\bar{u}^{t} up to any round t64ln||ϵ2t\geq\frac{64\ln|\mathcal{H}|}{\epsilon^{2}} will satisfy 𝔼[maxhα,β(α,u¯tβ)]ϵ.\mathop{\mathbb{E}}\left[\max_{h_{\alpha,\beta}\in\mathcal{H}}\left(\langle\alpha,\bar{u}^{t}\rangle-\beta\right)\right]\leq\epsilon.

  2. 2.

    For any δ(0,1)\delta\in(0,1), the average play u¯t\bar{u}^{t} up to any round tln||t\geq\ln|\mathcal{H}| will satisfy maxhα,β(α,u¯tβ)161Tln(||δ)with probability at least 1δ.\max_{h_{\alpha,\beta}\in\mathcal{H}}\left(\langle\alpha,\bar{u}^{t}\rangle-\beta\right)\leq 16\sqrt{\tfrac{1}{T}\ln\left(\tfrac{|\mathcal{H}|}{\delta}\right)}\quad\text{with probability at least $1-\delta$}.

Acknowledgments

We thank Ira Globus-Harris, Chris Jung, and Kunal Talwar for helpful conversations at an early stage of this work. Supported in part by NSF grants AF-1763307, FAI-2147212, CCF-2217062, and CCF-1934876 and the Simons collaboration on algorithmic fairness.

References

  • Adamskiy et al. [2012] Dmitry Adamskiy, Wouter M Koolen, Alexey Chernov, and Vladimir Vovk. A closer look at adaptive regret. In International Conference on Algorithmic Learning Theory, pages 290–304. Springer, 2012.
  • Azar et al. [2014] Yossi Azar, Uriel Felge, Michal Feldman, and Moshe Tennenholtz. Sequential decision making with vector outcomes. In Proceedings of the 5th conference on Innovations in Theoretical Computer Science, pages 195–206, 2014.
  • Blackwell [1956] David Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6(1):1–8, 1956.
  • Blum [1997] Avrim Blum. Empirical support for winnow and weighted-majority algorithms: Results on a calendar scheduling domain. Machine Learning, 26(1):5–23, 1997.
  • Blum and Lykouris [2020] Avrim Blum and Thodoris Lykouris. Advancing subgroup fairness via sleeping experts. In Innovations in Theoretical Computer Science Conference (ITCS), volume 11, 2020.
  • Blum and Mansour [2007] Avrim Blum and Yishay Mansour. From external to internal regret. Journal of Machine Learning Research, 8(6), 2007.
  • DeGroot and Fienberg [1983] Morris H. DeGroot and Stephen E. Fienberg. The comparison and evaluation of forecasters. The Statistician, 32:12–22, 1983.
  • Drenska and Kohn [2020] Nadejda Drenska and Robert V Kohn. Prediction with expert advice: A pde perspective. Journal of Nonlinear Science, 30(1):137–173, 2020.
  • Dubhashi and Panconesi [2009] Devdatt P Dubhashi and Alessandro Panconesi. Concentration of measure for the analysis of randomized algorithms. Cambridge University Press, 2009.
  • Foster and Hart [2021] Dean P Foster and Sergiu Hart. “calibeating”: Beating forecasters at their own game. https://www.ma.huji.ac.il/~hart/papers/calib-beat.pdf, 2021.
  • Foster and Vohra [1998] Dean P Foster and Rakesh V Vohra. Asymptotic calibration. Biometrika, 85(2):379–390, 1998.
  • Freund et al. [1997] Yoav Freund, Robert E Schapire, Yoram Singer, and Manfred K Warmuth. Using and combining predictors that specialize. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 334–343, 1997.
  • Fudenberg and Levine [1999] Drew Fudenberg and David K Levine. An easier way to calibrate. Games and Economic Behavior, 29(1-2):131–137, 1999.
  • Globus-Harris et al. [2022] Ira Globus-Harris, Michael Kearns, and Aaron Roth. Beyond the frontier: Fairness without accuracy loss. arXiv preprint arXiv:2201.10408, 2022.
  • Greenwald and Jafari [2003] Amy Greenwald and Amir Jafari. A general class of no-regret learning algorithms and game-theoretic equilibria. In Learning theory and kernel machines, pages 2–12. Springer, 2003.
  • Gupta et al. [2022] Varun Gupta, Christopher Jung, Georgy Noarov, Mallesh M. Pai, and Aaron Roth. Online Multivalid Learning: Means, Moments, and Prediction Intervals. In 13th Innovations in Theoretical Computer Science Conference (ITCS 2022), pages 82:1–82:24, 2022.
  • Hart [2020] Sergiu Hart. Calibrated forecasts: The minimax proof, 2020. URL http://www.ma.huji.ac.il/~hart/papers/calib-minmax.pdf.
  • Hart and Mas-Colell [2000] Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
  • Hazan and Seshadhri [2009] Elad Hazan and Comandur Seshadhri. Efficient learning algorithms for changing environments. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 393–400, 2009.
  • Hébert-Johnson et al. [2018] Ursula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning, pages 1939–1948. PMLR, 2018.
  • Kesselheim and Singla [2020] Thomas Kesselheim and Sahil Singla. Online learning with vector costs and bandits with knapsacks. In Conference on Learning Theory, pages 2286–2305. PMLR, 2020.
  • Kleinberg et al. [2010] Robert Kleinberg, Alexandru Niculescu-Mizil, and Yogeshwer Sharma. Regret bounds for sleeping experts and bandits. Machine Learning, 80(2):245–272, 2010.
  • Kobzar et al. [2020] Vladimir A Kobzar, Robert V Kohn, and Zhilei Wang. New potential-based bounds for prediction with expert advice. In Conference on Learning Theory, pages 2370–2405. PMLR, 2020.
  • Lehrer [2003] Ehud Lehrer. A wide range no-regret theorem. Games and Economic Behavior, 42(1):101–115, 2003.
  • Littlestone and Warmuth [1994] Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and computation, 108(2):212–261, 1994.
  • Mannor et al. [2014a] Shie Mannor, Vianney Perchet, and Gilles Stoltz. Approachability in unknown games: Online learning meets multi-objective optimization. In Conference on Learning Theory, pages 339–355. PMLR, 2014a.
  • Mannor et al. [2014b] Shie Mannor, Vianney Perchet, and Gilles Stoltz. Set-valued approachability and online learning with partial monitoring. The Journal of Machine Learning Research, 15(1):3247–3295, 2014b.
  • Perchet [2015] Vianney Perchet. Exponential weight approachability, applications to calibration and regret minimization. Dynamic Games and Applications, 5(1):136–153, 2015.
  • Perchet and Mannor [2013] Vianney Perchet and Shie Mannor. Approachability, fast and slow. In Conference on Learning Theory, pages 474–488. PMLR, 2013.
  • Raghavan [1994] T.E.S. Raghavan. Zero-sum two-person games. In R.J. Aumann and S. Hart, editors, Handbook of Game Theory with Economic Applications, volume 2 of Handbook of Game Theory with Economic Applications, chapter 20, pages 735–768. Elsevier, 1994. URL https://ideas.repec.org/h/eee/gamchp/2-20.html.
  • Rakhlin et al. [2010] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Random averages, combinatorial parameters, and learnability. Advances in Neural Information Processing Systems, 23:1984–1992, 2010.
  • Rakhlin et al. [2011] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Beyond regret. In Proceedings of the 24th Annual Conference on Learning Theory, pages 559–594. JMLR Workshop and Conference Proceedings, 2011.
  • Rakhlin et al. [2012] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Relax and randomize: from value to algorithms. In Proceedings of the 25th International Conference on Neural Information Processing Systems-Volume 2, pages 2141–2149, 2012.
  • Rothblum and Yona [2021] Guy N Rothblum and Gal Yona. Multi-group agnostic pac learnability. arXiv preprint arXiv:2105.09989, 2021.
  • Vovk [1990] Volodimir G Vovk. Aggregating strategies. Proc. of Computational Learning Theory, 1990, 1990.

Appendix A The General Framework with Extensions to Probabilistic and Approximate Learners: Full Proofs and Algorithms

A.1 Omitted Proofs from Section 2

Proof of Lemma 2.1.

After taking the log and dividing by η\eta, this lemma follows from the following chain:

exp(ηRT)=exp(ηmaxj[d]RjT)=exp(maxj[d]ηRjT)=maxj[d]exp(ηRjT)j[d]exp(ηRjT)=LT.\displaystyle\exp\left(\eta R^{T}\right)=\exp\left(\eta\max_{j\in[d]}R^{T}_{j}\right)=\exp\left(\max_{j\in[d]}\eta R^{T}_{j}\right)=\max_{j\in[d]}\exp\left(\eta R^{T}_{j}\right)\leq\sum_{j\in[d]}\exp\left(\eta R^{T}_{j}\right)=L^{T}.

Proof of Lemma 2.2.

By definition of the surrogate loss,we have:

LtLt1\displaystyle L^{t}-L^{t-1} =j[d]exp(ηRjt)j[d]exp(ηRjt1),\displaystyle=\sum_{j\in[d]}\exp\left(\eta R^{t}_{j}\right)-\sum_{j\in[d]}\exp\left(\eta R^{t-1}_{j}\right),
=j[d]exp(ηRjt1+η(jt(xt,yt)wAt))j[d]exp(ηRjt1),\displaystyle=\sum_{j\in[d]}\exp\left(\eta R^{t-1}_{j}+\eta\left(\ell^{t}_{j}\left(x^{t},y^{t}\right)-w^{t}_{A}\right)\right)-\sum_{j\in[d]}\exp\left(\eta R^{t-1}_{j}\right),
=j[d]exp(ηRjt1)(exp(η(jt(xt,yt)wAt))1).\displaystyle=\sum_{j\in[d]}\exp\left(\eta R^{t-1}_{j}\right)\left(\exp\left(\eta\left(\ell^{t}_{j}\left(x^{t},y^{t}\right)-w^{t}_{A}\right)\right)-1\right).
Using the fact that exp(x)1x+x2\exp(x)-1\leq x+x^{2} for |x|1|x|\leq 1, we have, for η2C1\eta\cdot 2C\leq 1,
j[d]exp(ηRjt1)(η(jt(xt,yt)wAt)+η2(jt(xt,yt)wAt)2),\displaystyle\leq\sum_{j\in[d]}\exp\left(\eta R^{t-1}_{j}\right)\left(\eta\left(\ell^{t}_{j}(x^{t},y^{t})-w^{t}_{A}\right)+\eta^{2}\left(\ell^{t}_{j}(x^{t},y^{t})-w^{t}_{A}\right)^{2}\right),
ηj[d]exp(ηRjt1)(jt(xt,yt)wAt)+η2(2C)2Lt1.\displaystyle\leq\eta\sum_{j\in[d]}\exp\left(\eta R^{t-1}_{j}\right)\left(\ell^{t}_{j}\left(x^{t},y^{t}\right)-w^{t}_{A}\right)+\eta^{2}(2C)^{2}L^{t-1}.

Proof of Lemma 2.3.

We begin by recalling that L0=dL^{0}=d. Thus, the desired bound on LTL^{T} follows via Lemma 2.2 and a telescoping argument, if only we can show that for every t[T]t\in[T] the Learner has an action xt𝒳tx^{t}\in\mathcal{X}^{t} which guarantees that for any yt𝒴ty^{t}\in\mathcal{Y}^{t},

ηj[d]exp(ηRjt1)(jt(xt,yt)wAt)0.\eta\sum_{j\in[d]}\exp\left(\eta R^{t-1}_{j}\right)\left(\ell^{t}_{j}(x^{t},y^{t})-w^{t}_{A}\right)\leq 0.

To this end, we define a zero-sum game between the Learner and the Adversary, with action space 𝒳t\mathcal{X}^{t} for the Learner and 𝒴t\mathcal{Y}^{t} for the Adversary, and with the objective function (which the Adversary wants to maximize and the Learner wants to minimize):

ut(x,y):=j[d]exp(ηRjt1)(jt(x,y)wAt), for all x𝒳t,y𝒴t.u^{t}(x,y):=\sum_{j\in[d]}\exp\left(\eta R^{t-1}_{j}\right)\left(\ell^{t}_{j}(x,y)-w^{t}_{A}\right),\text{ for all }x\in\mathcal{X}^{t},y\in\mathcal{Y}^{t}.

Recall from the definition of our framework that 𝒳t,𝒴t\mathcal{X}^{t},\mathcal{Y}^{t} are convex, compact and finite-dimensional, as well as that each jt\ell^{t}_{j} is continuous, convex in the first argument, and concave in the second argument. Since utu^{t} is defined as an affine function of the individual coordinate functions jt\ell^{t}_{j}, utu^{t} is also convex-concave and continuous. This means that we may invoke Sion’s Minimax Theorem:

Fact 2 (Sion’s Minimax Theorem).

Given finite-dimensional convex compact sets 𝒳,𝒴\mathcal{X},\mathcal{Y}, and a continuous function f:𝒳×𝒴f:\mathcal{X}\times\mathcal{Y}\to\mathbb{R} which is convex in the first argument and concave in the second argument, it holds that

minx𝒳maxy𝒴f(x,y)=maxy𝒴minx𝒳f(x,y).\min_{x\in\mathcal{X}}\max_{y\in\mathcal{Y}}f(x,y)=\max_{y\in\mathcal{Y}}\min_{x\in\mathcal{X}}f(x,y).

Using Sion’s Theorem to switch the order of play (so that the Adversary is compelled to move first), and then recalling the definition of wAtw^{t}_{A} (the value of the maximum coordinate value of t\ell^{t} that the Learner can obtain when the Adversary is compelled to move first), we obtain:777Note that in the third step, maxyt𝒴t\max_{y^{t}\in\mathcal{Y}^{t}} turns into supyt𝒴t\sup_{y^{t}\in\mathcal{Y}^{t}}. This is because after each (jt(xt,yt)wAt)\left(\ell^{t}_{j^{\prime}}\left(x^{t},y^{t}\right)-w^{t}_{A}\right) is replaced with maxj(jt(xt,yt)wAt)\max_{j}\left(\ell^{t}_{j}\left(x^{t},y^{t}\right)-w^{t}_{A}\right), the maximum over yy generally becomes unachievable (recall Footnote 1).

minxt𝒳tmaxyt𝒴tut(xt,yt)\displaystyle\min_{x^{t}\in\mathcal{X}^{t}}\max_{y^{t}\in\mathcal{Y}^{t}}u^{t}\left(x^{t},y^{t}\right) =maxyt𝒴tminxt𝒳tut(xt,yt)\displaystyle=\max_{y^{t}\in\mathcal{Y}^{t}}\min_{x^{t}\in\mathcal{X}^{t}}u^{t}\left(x^{t},y^{t}\right)
=maxyt𝒴tminxt𝒳tj[d]exp(ηRjt1)(jt(xt,yt)wAt),\displaystyle=\max_{y^{t}\in\mathcal{Y}^{t}}\min_{x^{t}\in\mathcal{X}^{t}}\sum_{j^{\prime}\in[d]}\exp\left(\eta R^{t-1}_{j^{\prime}}\right)\cdot\left(\ell^{t}_{j^{\prime}}\left(x^{t},y^{t}\right)-w^{t}_{A}\right),
supyt𝒴tminxt𝒳tj[d]exp(ηRjt1)maxj[d](jt(xt,yt)wAt),\displaystyle\leq\adjustlimits{\sup}_{y^{t}\in\mathcal{Y}^{t}}{\min}_{x^{t}\in\mathcal{X}^{t}}\sum_{j^{\prime}\in[d]}\exp\left(\eta R^{t-1}_{j^{\prime}}\right)\cdot\max_{j\in[d]}\left(\ell^{t}_{j}\left(x^{t},y^{t}\right)-w^{t}_{A}\right),
=j[d]exp(ηRjt1)supyt𝒴tminxt𝒳tmaxj[d](jt(xt,yt)wAt),\displaystyle=\sum_{j^{\prime}\in[d]}\exp\left(\eta R^{t-1}_{j^{\prime}}\right)\cdot\adjustlimits{\sup}_{y^{t}\in\mathcal{Y}^{t}}{\min}_{x^{t}\in\mathcal{X}^{t}}\max_{j\in[d]}\left(\ell^{t}_{j}\left(x^{t},y^{t}\right)-w^{t}_{A}\right),
=j[d]exp(ηRjt1)(wAtwAt),\displaystyle=\sum_{j^{\prime}\in[d]}\exp\left(\eta R^{t-1}_{j^{\prime}}\right)\cdot\left(w^{t}_{A}-w^{t}_{A}\right),
=0.\displaystyle=0.

Thus, the Learner can ensure that Lt(4η2C2+1)Lt1L^{t}\leq\left(4\eta^{2}C^{2}+1\right)L^{t-1} by playing at every round tt:

xtargminx𝒳tmaxy𝒴tut(x,y).x^{t}\in\adjustlimits{\mathop{\mathrm{argmin}}}_{x\in\mathcal{X}^{t}}{\max}_{y\in\mathcal{Y}^{t}}u^{t}(x,y).

This concludes the proof. ∎

An equivalent description of Learner’s space of minimax optimal strategies at each round tt

We observe that the Learner’s optimal action at each round, derived in the proof, can be expressed without any reference to the quantities wAtw^{t}_{A}:

xt\displaystyle x^{t} \displaystyle\in argminx𝒳tmaxy𝒴tj[d]exp(ηRjt1)(jt(x,y)wAt),\displaystyle\adjustlimits{\mathop{\mathrm{argmin}}}_{x\in\mathcal{X}^{t}}{\max}_{y\in\mathcal{Y}^{t}}\sum_{j\in[d]}\exp(\eta R^{t-1}_{j})(\ell^{t}_{j}(x,y)-w^{t}_{A}),
=\displaystyle= argminx𝒳tmaxy𝒴tj[d]exp(ηRjt1)jt(x,y),\displaystyle\adjustlimits{\mathop{\mathrm{argmin}}}_{x\in\mathcal{X}^{t}}{\max}_{y\in\mathcal{Y}^{t}}\sum_{j\in[d]}\exp(\eta R^{t-1}_{j})\ell^{t}_{j}(x,y),
=\displaystyle= argminx𝒳tmaxy𝒴tj[d]exp(ηs=1t1js(xs,ys))jt(x,y)exp(ηs=1t1wAs),\displaystyle\adjustlimits{\mathop{\mathrm{argmin}}}_{x\in\mathcal{X}^{t}}{\max}_{y\in\mathcal{Y}^{t}}\sum_{j\in[d]}\frac{\exp\left(\eta\sum_{s=1}^{t-1}\ell_{j}^{s}(x^{s},y^{s})\right)\ell^{t}_{j}(x,y)}{\exp\left(\eta\sum_{s=1}^{t-1}w_{A}^{s}\right)},
=\displaystyle= argminx𝒳tmaxy𝒴tj[d]exp(ηs=1t1js(xs,ys))jt(x,y),\displaystyle\adjustlimits{\mathop{\mathrm{argmin}}}_{x\in\mathcal{X}^{t}}{\max}_{y\in\mathcal{Y}^{t}}\sum_{j\in[d]}\exp\left(\eta\sum_{s=1}^{t-1}\ell_{j}^{s}(x^{s},y^{s})\right)\ell^{t}_{j}(x,y),
=\displaystyle= argminx𝒳tmaxy𝒴tj[d]exp(ηs=1t1js(xs,ys))i[d]exp(ηs=1t1is(xs,ys))jt(x,y).\displaystyle\adjustlimits{\mathop{\mathrm{argmin}}}_{x\in\mathcal{X}^{t}}{\max}_{y\in\mathcal{Y}^{t}}\sum_{j\in[d]}\frac{\exp\left(\eta\sum_{s=1}^{t-1}\ell_{j}^{s}(x^{s},y^{s})\right)}{\sum_{i\in[d]}\exp\left(\eta\sum_{s=1}^{t-1}\ell_{i}^{s}(x^{s},y^{s})\right)}\ell^{t}_{j}(x,y).

The weights placed on the loss coordinates js(xt,yt)\ell_{j}^{s}(x^{t},y^{t}) in the final expression form a probability distribution which should remind the reader of the well known Exponential Weights distribution.

A.2 Extensions

Before presenting applications of our framework, we pause to discuss two natural extensions that are called for in some of our applications. Both extensions only require very minimal changes to the notation in Section 2.1 and to the general algorithmic framework in Section 2.2.

We begin by discussing, in Section A.2.1, how to adapt our framework to the setting where the Learner is allowed to randomize at each round amongst a finite set of actions, and wishes to obtain probabilistic guarantees for her AMF regret with respect to her randomness. This will be useful in all three of our applications.

We then proceed to show, in Section A.2.2, that our AMF regret bounds are robust to the case in which at each round, the Learner, who is playing according to the general Algorithm 1 given above, computes and plays according to an approximate (rather than exact) minimax strategy. This is useful for settings where it may be desirable (for computational or other reasons) to implement our algorithmic framework approximately, rather than exactly. In particular, in one of our applications — mean multicalibration, which is discussed in Section 4.1 — we will illustrate this point by deriving a multicalibration algorithm that has the Learner play only extremely (computationally and structurally) simple strategies, at the cost of adding an arbitrarily small term to the multicalibration bounds, compared to the Learner that plays the exact minimax equilibrium.

A.2.1 Performance Bounds for a Probabilistic Learner

So far, we have described the interaction between the Learner and the Adversary as deterministic. In many applications, however, the convex action space for the Learner is the simplex over some finite set of base actions, representing probability distributions over actions. In this case, the Adversary chooses his action in response to the probability distribution over base actions chosen by the Learner, at which point the Learner samples a single base action from her chosen distribution.

We will use the following notation. The Learner’s pure action set at time tt is denoted by 𝒜t\mathcal{A}^{t}. Before each round tt, the Adversary reveals a vector valued loss function t:𝒜t×𝒴t[C,C]d\ell^{t}:\mathcal{A}^{t}\times\mathcal{Y}^{t}\to[-C,C]^{d}. At the beginning of round tt, the Learner chooses a probabilistic mixture over her action set 𝒜t\mathcal{A}^{t}, which we will usually denote as xtΔ𝒜tx^{t}\in\Delta\mathcal{A}^{t}; after the Adversary has made his move, the Learner samples her pure action ata^{t} for the round, which is recorded into the transcript of the interaction.

The redefined vector valued losses t\ell^{t} now take as their first argument a pure action a𝒜ta\in\mathcal{A}^{t}. We extend this to 𝒳t:=Δ𝒜t\mathcal{X}^{t}:=\Delta\mathcal{A}^{t} as t(xt,yt):=𝔼atxt[t(at,yt)]\ell^{t}(x^{t},y^{t}):=\mathop{\mathbb{E}}_{a^{t}\sim x^{t}}[\ell^{t}(a^{t},y^{t})] for any xtΔ𝒜tx^{t}\in\Delta\mathcal{A}^{t}. In this notation, holding the second argument fixed, the loss function is linear (hence convex and continuous) and has a convex, compact domain (the simplex Δ𝒜t\Delta\mathcal{A}^{t}). Using this extended notation, it is now easy to see how to define the probabilistic analog of the AMF value.

Definition A.1 (Probabilistic AMF Value).
wAt:=supyt𝒴tminxt𝒳tmaxj[d]jt(xt,yt)=supyt𝒴tminxtΔ𝒜tmaxj[d]𝔼atxt[jt(at,yt)].w^{t}_{A}:=\adjustlimits{\sup}_{y^{t}\in\mathcal{Y}^{t}}{\min}_{x^{t}\in\mathcal{X}^{t}}\max_{j\in[d]}\ell^{t}_{j}(x^{t},y^{t})=\adjustlimits{\sup}_{y^{t}\in\mathcal{Y}^{t}}{\min}_{x^{t}\in\Delta\mathcal{A}^{t}}\max_{j\in[d]}\mathop{\mathbb{E}}_{a^{t}\sim x^{t}}\left[\ell^{t}_{j}(a^{t},y^{t})\right].

For a more detailed discussion of the probabilistic setting, please refer to Appendix A.3.

Adapting the algorithm to the probabilistic Learner setting

Above, Algorithm 1 was given for the deterministic case of our framework. In the probabilistic setting, when computing the probability distribution for the current round, the Learner should take into account the realized losses from the past rounds. We present the modified algorithm below.

  for rounds t=1,,Tt=1,\dots,T do
     Learn adversarially chosen 𝒜t,𝒴t\mathcal{A}^{t},\mathcal{Y}^{t}, and vector loss function t(,):𝒜t×𝒴t[C,C]d\ell^{t}(\cdot,\cdot):\mathcal{A}^{t}\times\mathcal{Y}^{t}\to[-C,C]^{d}.
     Let
χjt:=exp(ηs=1t1js(as,ys))i[d]exp(ηs=1t1is(as,ys)) for j[d].\chi^{t}_{j}:=\frac{\exp\left(\eta\sum_{s=1}^{t-1}\ell_{j}^{s}(a^{s},y^{s})\right)}{\sum_{i\in[d]}\exp\left(\eta\sum_{s=1}^{t-1}\ell_{i}^{s}(a^{s},y^{s})\right)}\text{ for $j\in[d]$}.
     Select a mixed action xtΔ𝒜tx^{t}\in\Delta\mathcal{A}^{t}, where
xtargminxΔ𝒜tmaxy𝒴tj[d]χjtjt(x,y).x^{t}\in\adjustlimits{\mathop{\mathrm{argmin}}}_{x\in\Delta\mathcal{A}^{t}}{\max}_{y\in\mathcal{Y}^{t}}\sum_{j\in[d]}\chi^{t}_{j}\cdot\ell^{t}_{j}(x,y).
     Observe the Adversary’s selection of yt𝒴ty^{t}\in\mathcal{Y}^{t}.
     Sample pure action atxta^{t}\sim x^{t}.
Algorithm 2 General Algorithm for the Probabilistic Learner
Probabilistic performance guarantees

Algorithm 2 provides two crucial blackbox guarantees to the probabilistic Learner. First, the guarantees on Algorithm 1 from Theorem 2.1 almost immediately translate into a bound on the expected AMF regret of the Learner who uses Algorithm 2, over the randomness in her actions. Second, a high-probability AMF regret bound, also over the Learner’s randomness, can be derived in a straightforward way.

Theorem A.1 (In-Expectation Bound).

Given TlndT\geq\ln d, Algorithm 2 with learning rate η=lnd4TC2\eta=\sqrt{\frac{\ln d}{4TC^{2}}} guarantees that ex-ante, with respect to the randomness in the Learner’s realized outcomes, the expected AMF regret is bounded as:

𝔼[RT]4CTlnd.\mathop{\mathbb{E}}\left[R^{T}\right]\leq 4C\sqrt{T\ln d}.
Proof Sketch.

Using Jensen’s inequality to switch expectations and exponentials, it is easy to modify the proof of Lemma 2.1 to obtain the following in-expectation bound:

𝔼[RT]ln𝔼[LT]η.\mathop{\mathbb{E}}\left[R^{T}\right]\leq\frac{\ln\mathop{\mathbb{E}}\left[L^{T}\right]}{\eta}.

The rest of the proof is similar to the proofs of Lemma 2.2 and Lemma 2.3. ∎

Theorem A.2 (High-Probability Bound).

Fix any δ(0,1)\delta\in(0,1). Given TlndT\geq\ln d, Algorithm 2 with learning rate η=lnd4TC2\eta=\sqrt{\frac{\ln d}{4TC^{2}}} guarantees that the AMF regret will satisfy, with ex-ante probability 1δ1-\delta over the randomness in the Learner’s realized outcomes,

RT8CTln(dδ).R^{T}\leq 8C\sqrt{T\ln\left(\frac{d}{\delta}\right)}.
Proof Sketch.

The proof proceeds by constructing a martingale with bounded increments that tracks the increase in the surrogate loss LTL^{T}, and then using Azuma’s inequality to conclude that the final surrogate loss (and hence the AMF regret) is bounded above with high probability. For a detailed proof, see Appendix A.3. ∎

A.2.2 Performance Bounds for a Suboptimal Learner

Our general Algorithms 1 and 2 involve the Learner solving a convex program at each round in order to identify her minimax optimal strategy. However, in some applications of our framework it may be necessary or desirable for the Learner to restrict herself to playing approximately minimax optimal strategies instead of exactly optimal ones. This can happen for a variety of reasons:

  1. 1.

    Computational efficiency. While the convex program that the Learner must solve at each round is polynomial-sized in the description of the environment, one may wish for a better running time dependence — e.g. in settings in which the action space for the Learner is exponential in some other relevant parameter of the problem. In such cases, we will want to trade off run-time for approximation error in the minimax equilibrium computation at each round.

  2. 2.

    Structural simplicity of strategies. One may wish to restrict the Learner to only playing “simple” strategies (for example, distributions over actions with small support), or more generally, strategies belonging to a certain predefined strict subset of the Learner’s strategy space. This subset may only contain approximately optimal minimax strategies.

  3. 3.

    Numerical precision. As the convex programs solved by the Learner at each round generally have irrational coefficients (due to the exponents), using finite-precision arithmetic to solve these programs will lead to a corresponding precision error in the solution, making the computed strategy only approximately minimax optimal for the Learner. This kind of approximation error can generally be driven to be arbitrarily small, but still necessitates being able to reason about approximate solutions.

Given a suboptimal instantiation of Algorithm 1 or 2, we thus want to know: how much worse will its achieved regret bound be, compared to the existential guarantee? We will now address this question for both the deterministic setting of Sections 2.1 and 2.2, and the probabilistic setting of Section A.2.1.

Recall that at each round t[T]t\in[T], both Algorithm 1 and Algorithm 2 (with the weights χjt\chi_{j}^{t} defined accordingly) have the Learner solve for the minimizer xx of the function ψt:𝒳t[C,C]\psi^{t}:\mathcal{X}^{t}\to[-C,C] defined as:

ψt(x):=maxy𝒴tj[d]χjtjt(x,y).\psi^{t}(x):=\max_{y\in\mathcal{Y}^{t}}\sum_{j\in[d]}\chi^{t}_{j}\cdot\ell^{t}_{j}(x,y).

The range of ψt\psi^{t} is [C,C][-C,C] as indicated, since it is a linear combination of loss coordinates jt(x,y)[C,C]\ell^{t}_{j}(x,y)\in[-C,C], where the weights (χ1t,,χdt)(\chi^{t}_{1},\ldots,\chi^{t}_{d}) form a probability distribution over [d][d].

Now suppose the Learner ends up playing actions x1,,xTx^{1},\ldots,x^{T} which do not necessarily minimize the respective objectives ψt()\psi^{t}(\cdot). The following definition helps capture the degree of suboptimality in the Learner’s play at each round.

Definition A.2 (Achieved AMF Value Bound).

Consider any round t[T]t\in[T], and suppose the Learner plays action xt𝒳tx^{t}\in\mathcal{X}^{t} at round tt. Then, any number

wbdt[ψt(xt),C]w^{t}_{\mathrm{bd}}\in\left[\psi^{t}(x^{t}),C\right]

is called an achieved AMF value bound for round tt.

This definition has two aspects. Most importantly, wbdtw^{t}_{\mathrm{bd}} upper bounds the Learner’s achieved objective function value at round tt. Furthermore, we restrict wbdtw^{t}_{\mathrm{bd}} to be C\leq C — otherwise it would be a meaningless bound as the Learner gets objective value C\leq C no matter what xtx^{t} she plays.

We now formulate the desired bounds on the performance of a suboptimal Learner. The upshot is that for a suboptimal Learner, the bounds of Theorems 2.1A.1A.2 hold with each wAtw^{t}_{A} replaced with the corresponding achieved AMF bound wbdtw^{t}_{\mathrm{bd}}.

Theorem A.3 (Bounds for a Suboptimal Learner).

Consider a Learner who does not necessarily play optimally at all rounds, and a sequence wbd1,,wbdTw^{1}_{\mathrm{bd}},\ldots,w^{T}_{\mathrm{bd}} of achieved AMF value bounds.

In the deterministic setting, the Learner achieves the following regret bound analogous to Theorem 2.1:

maxj[d]t=1Tjt(xt,yt)t=1Twbdt+4CTlnd.\max_{j\in[d]}\sum_{t=1}^{T}\ell^{t}_{j}(x^{t},y^{t})\leq\sum_{t=1}^{T}w^{t}_{\mathrm{bd}}+4C\sqrt{T\ln d}.

In the probabilistic setting, the Learner achieves the following in-expectation regret bound analogous to Theorem A.1:

𝔼[maxj[d]t=1Tjt(at,yt)]t=1Twbdt+4CTlnd,\mathop{\mathbb{E}}\left[\max_{j\in[d]}\sum_{t=1}^{T}\ell^{t}_{j}(a^{t},y^{t})\right]\leq\sum_{t=1}^{T}w^{t}_{\mathrm{bd}}+4C\sqrt{T\ln d},

and the following high-probability bound analogous to Theorem A.2:

maxj[d]t=1Tjt(at,yt)t=1Twbdt+8CTln(dδ) with probability 1δ, for any δ(0,1).\max_{j\in[d]}\sum_{t=1}^{T}\ell^{t}_{j}(a^{t},y^{t})\leq\sum_{t=1}^{T}w^{t}_{\mathrm{bd}}+8C\sqrt{T\ln\left(\frac{d}{\delta}\right)}\text{ with probability }\geq 1-\delta,\text{ for any $\delta\in(0,1)$.}
Proof Sketch.

We use the deterministic case for illustration. The main idea is to redefine the Learner’s regret to be relative to her achieved AMF value bounds (wbdt)t[T](w^{t}_{\mathrm{bd}})_{t\in[T]} rather than the AMF values (wAt)t[T](w^{t}_{A})_{t\in[T]}. Namely, we let Rbdt:=maxj[d](Rbdt)jR^{t}_{\mathrm{bd}}:=\max_{j\in[d]}\left(R^{t}_{\mathrm{bd}}\right)_{j}, where (Rbdt)j:=s=1tjs(xs,ys)s=1twbds.\left(R^{t}_{\mathrm{bd}}\right)_{j}:=\sum_{s=1}^{t}\ell^{s}_{j}(x^{s},y^{s})-\sum_{s=1}^{t}w^{s}_{\mathrm{bd}}. The surrogate loss is defined in the same way as before, namely Lbdt:=j[d]exp(η(Rbdt)j)L^{t}_{\mathrm{bd}}:=\sum_{j\in[d]}\exp\left(\eta\cdot\left(R^{t}_{\mathrm{bd}}\right)_{j}\right).

First, Lemma 2.1 still holds: RbdT(lnLbdT)/ηR^{T}_{\mathrm{bd}}\leq\left(\ln L^{T}_{\mathrm{bd}}\right)/\eta, with the same proof. Lemma 2.2 also holds after replacing each wAtw^{t}_{A} with wbdtw^{t}_{\mathrm{bd}}: namely, Lbdt(4η2C2+1)Lbdt1+ηj[d]exp(η(Rbdt1)j)(jt(xt,yt)wbdt).L^{t}_{\mathrm{bd}}\leq\left(4\eta^{2}C^{2}+1\right)L^{t-1}_{\mathrm{bd}}+\eta\sum_{j\in[d]}\exp\left(\eta\left(R^{t-1}_{\mathrm{bd}}\right)_{j}\right)\cdot\left(\ell^{t}_{j}\left(x^{t},y^{t}\right)-w^{t}_{\mathrm{bd}}\right). The proof is almost the same: we formerly used wAtCw^{t}_{A}\leq C, and now use that wbdtCw^{t}_{\mathrm{bd}}\leq C by Definition A.2.

Now, following the proofs of Lemma 2.3 and Theorem 2.1, to obtain the declared regret bound it suffices to show for t[T]t\in[T] that the Learner’s action xtx^{t} guarantees j[d]exp(η(Rbdt1)j)(jt(xt,yt)wbdt)0\sum_{j\in[d]}\exp\left(\eta\left(R^{t\!-\!1}_{\mathrm{bd}}\right)_{j}\right)\cdot\left(\ell^{t}_{j}\left(x^{t},y^{t}\right)\!-\!w^{t}_{\mathrm{bd}}\right)\leq 0, no matter what yty^{t} is played by the Adversary. For any yt𝒴ty^{t}\in\mathcal{Y}^{t}, we can rewrite this objective as:

j[d]exp(η(Rbdt)j)(jt(xt,yt)wbdt)=i[d]exp(ηs=1t1is(xs,ys))exp(s=1t1wbds)j[d]χjt(jt(xt,yt)wbdt).\sum_{j\in[d]}\exp\left(\eta\left(R^{t}_{\mathrm{bd}}\right)_{j}\right)\cdot\left(\ell^{t}_{j}\left(x^{t},y^{t}\right)-w^{t}_{\mathrm{bd}}\right)=\frac{\sum_{i\in[d]}\exp\left(\eta\sum_{s=1}^{t-1}\ell_{i}^{s}(x^{s},y^{s})\right)}{\exp\left(\sum_{s=1}^{t-1}w^{s}_{\mathrm{bd}}\right)}\sum_{j\in[d]}\chi^{t}_{j}\cdot\left(\ell^{t}_{j}(x^{t},y^{t})-w^{t}_{\mathrm{bd}}\right).

It now follows that action xtx^{t} achieves j[d]exp(η(Rbdt1)j)(jt(xt,yt)wbdt)0\sum\limits_{j\in[d]}\exp\left(\eta\left(R^{t\!-\!1}_{\mathrm{bd}}\right)_{j}\right)\cdot\left(\ell^{t}_{j}\left(x^{t},y^{t}\right)\!-\!w^{t}_{\mathrm{bd}}\right)\leq 0, from observing that:

j[d]χjt(jt(xt,yt)wbdt)=j[d]χjtjt(xt,yt)wbdtψt(xt)wbdt0,\sum_{j\in[d]}\chi^{t}_{j}\cdot\left(\ell^{t}_{j}(x^{t},y^{t})-w^{t}_{\mathrm{bd}}\right)=\sum_{j\in[d]}\chi^{t}_{j}\cdot\ell^{t}_{j}(x^{t},y^{t})-w^{t}_{\mathrm{bd}}\leq\psi^{t}(x^{t})-w^{t}_{\mathrm{bd}}\leq 0,

where the final inequality holds since the Learner achieves AMF value bound wbdtw^{t}_{\mathrm{bd}} at round tt. ∎

A.3 Omitted Proofs and Details from Section A.2.1: Bounds for the Probabilistic Learner

First, we define our probabilistic setting, emphasizing the differences to the deterministic protocol. At each round t[T]t\in[T], the interaction between the Learner and the Adversary proceeds as follows:

  1. 1.

    At the beginning of each round tt, the Adversary selects an environment consisting of the following, and reveals it to the Learner:

    1. (a)

      The Learner’s simplex action set 𝒳t=Δ𝒜t\mathcal{X}^{t}=\Delta\mathcal{A}^{t}, where 𝒜t\mathcal{A}^{t} is a finite set of pure actions;

    2. (b)

      The Adversary’s convex compact action set 𝒴t\mathcal{Y}^{t}, embedded in a finite-dimensional Euclidean space;

    3. (c)

      A vector valued loss function t(,):𝒜t×𝒴t[C,C]d\ell^{t}(\cdot,\cdot):\mathcal{A}^{t}\times\mathcal{Y}^{t}\to[-C,C]^{d}. Every dimension jt(,):𝒜t×𝒴t[C,C]\ell^{t}_{j}(\cdot,\cdot):\mathcal{A}^{t}\times\mathcal{Y}^{t}\to[-C,C] (where j[d]j\in[d]) of the loss function is continuous and concave in the second argument.

  2. 2.

    The Learner selects some xt𝒳tx^{t}\in\mathcal{X}^{t};

  3. 3.

    The Adversary observes the Learner’s selection xtx^{t}, and chooses some action yt𝒴ty^{t}\in\mathcal{Y}^{t} in response;

  4. 4.

    The Learner’s action xtΔ𝒜tx^{t}\in\Delta\mathcal{A}^{t} is interpreted as a mixture over the pure actions in 𝒜t\mathcal{A}^{t}, and an outcome at𝒜ta^{t}\in\mathcal{A}^{t} is sampled from it; that is, atxta^{t}\sim x^{t}.

  5. 5.

    The Learner suffers (and observes) t(at,yt)\ell^{t}(a^{t},y^{t}), the loss vector with respect to the outcome ata^{t}.

Thus, the probabilistic setting is simply a specialization of our framework to the case of the Learner’s action set being a simplex at each round.

Unlike in the above deterministic setting, where the transcript through any round tt was defined as {(xt,yt)}s=1t\{(x^{t},y^{t})\}_{s=1}^{t}, in the present case we define the transcript through round tt as

πt:={(a1,y1),,(at,yt)},\pi^{t}:=\{(a^{1},y^{1}),\ldots,(a^{t},y^{t})\},

that is, the transcript now records the Learner’s realized outcomes rather than her chosen mixtures at all rounds. Furthermore, we will denote by Πt\Pi^{t} the set of transcripts through round tt, for t[T]t\in[T].

Now, let us fix any Adversary Adv\mathrm{Adv} (that is, all of the Adversary’s decisions through round TT). With respect to this fixed Adversary, any algorithm for the Learner (defined as the collection of the Learner’s decision mappings {πt1Δ𝒜t}t[T]\{\pi^{t-1}\to\Delta\mathcal{A}^{t}\}_{t\in[T]} for all rounds) induces an ex-ante distribution 𝒫Adv\mathcal{P}_{\mathrm{Adv}} over the set of transcripts ΠT\Pi^{T}.

Now, we give two types of probabilistic guarantees on the performance of Algorithm 2, namely, an in-expectation bound and a high-probability bound. Both bounds hold for any choice of Adversary Adv\mathrm{Adv}, and are ex-ante with respect to the algorithm-induced distribution 𝒫Adv\mathcal{P}_{\mathrm{Adv}} over the final transcripts.

See A.1

As mentioned in Section A.2.1, the proof of Theorem A.1 is much the same as the proofs of Theorem 2.1 and the helper Lemmas 2.1, 2.2, 2.3, with the exception of using Jensen’s inequality to switch the order of taking expectations when necessary. We omit further details.

See A.2

Proof.

Throughout this proof, we put tildes over random variables to distinguish them from their realized values. For instance, π~t\tilde{\pi}^{t} is the random transcript through round tt, while πt\pi^{t} is a realization of π~t\tilde{\pi}^{t}. Also, we explicitly specify the dependence of the surrogate loss LtL^{t} on the (random or realized) transcript.

Consider the following random process {Z~t}\{\tilde{Z}^{t}\}, defined recursively for t=0,1,,Tt=0,1,\ldots,T and adapted to the sequence of random variables π~1,,π~T\tilde{\pi}^{1},\ldots,\tilde{\pi}^{T}. We let Z~0:=0\tilde{Z}^{0}:=0 deterministically, and for t[T]t\in[T] we let

Z~t:=Z~t1+lnLt(π~t)𝔼π~t[lnLt(π~t)|π~t1].\tilde{Z}^{t}:=\tilde{Z}^{t-1}+\ln L^{t}\left(\tilde{\pi}^{t}\right)-\mathop{\mathbb{E}}_{\tilde{\pi}^{t}}\left[\ln L^{t}\left(\tilde{\pi}^{t}\right)|\tilde{\pi}^{t-1}\right].

It is easy to see that for all t[T]t\in[T], we have 𝔼π~t[Z~t|π~t1]=Z~t1\mathop{\mathbb{E}}\limits_{\tilde{\pi}^{t}}\left[\tilde{Z}^{t}|\tilde{\pi}^{t-1}\right]=\tilde{Z}^{t-1}, and thus {Z~t}\{\tilde{Z}^{t}\} is a martingale.

We next show that this martingale has bounded increments. In brief, this follows from {Z~t}\{\tilde{Z}^{t}\} being defined in terms of the logarithm of the surrogate loss.

Lemma A.1.

The martingale {Z~t}\{\tilde{Z}^{t}\} has bounded increments: |Z~tZ~t1|4ηC|\tilde{Z}^{t}-\tilde{Z}^{t-1}|\leq 4\eta C for all t[T]t\in[T].

Proof.

It suffices to establish the bounded increments property for an arbitrary realization of the process. Towards this, fix the full transcript πT\pi^{T} of the interaction, and consider any round t[T]t\in[T].

Recall from the definition of the surrogate loss that

Lt(πt)=j[d]exp(ηRjt1(πt1))exp(η(jt(at,yt)wAt)).L^{t}(\pi^{t})=\sum_{j\in[d]}\exp\left(\eta R^{t-1}_{j}\left(\pi^{t-1}\right)\right)\cdot\exp\left(\eta\left(\ell^{t}_{j}(a^{t},y^{t})-w^{t}_{A}\right)\right).

Thus, noting that |jt(at,yt)wAt|2C\left|\ell^{t}_{j}(a^{t},y^{t})-w^{t}_{A}\right|\leq 2C for all j[d]j\in[d], we have

Lt(πt)Lt1(πt1)=Lt(πt)j[d]exp(ηRjt1(πt1))[exp(η2C),exp(η2C)].\displaystyle\frac{L^{t}(\pi^{t})}{L^{t-1}(\pi^{t-1})}=\frac{L^{t}(\pi^{t})}{\sum_{j\in[d]}\exp(\eta R^{t-1}_{j}(\pi^{t-1}))}\in\left[\exp\left(-\eta\cdot 2C\right),\exp\left(\eta\cdot 2C\right)\right].

Taking the logarithm yields

|lnLt(πt)lnLt1(πt1)|2ηC.\left|\ln L^{t}\left(\pi^{t}\right)-\ln L^{t-1}(\pi^{t-1})\right|\leq 2\eta C.

In fact, this argument shows that |lnLt(πt)lnLt1(πt1)|2ηC\left|\ln L^{t}(\pi_{{}^{\prime}}^{t})-\ln L^{t-1}(\pi^{t-1})\right|\leq 2\eta C for any transcript πt\pi^{t}_{{}^{\prime}} that equals πt1\pi^{t-1} on the first t1t-1 rounds. Hence, taking the expectation over π~t\tilde{\pi}^{t} conditioned on πt1\pi^{t-1}, we obtain:

|𝔼[lnLt(π~t)|πt1]lnLt1(πt1)|2ηC.\left|\mathop{\mathbb{E}}\left[\ln L^{t}\left(\tilde{\pi}^{t}\right)|\pi^{t-1}\right]-\ln L^{t-1}(\pi^{t-1})\right|\leq 2\eta C.

To conclude the proof, it now suffices to observe that:

|ZtZt1|\displaystyle|Z^{t}-Z^{t-1}| =|lnLt(πt)𝔼[lnLt(π~t)|πt1]|\displaystyle=\left|\ln L^{t}\left(\pi^{t}\right)-\mathop{\mathbb{E}}[\ln L^{t}\left(\tilde{\pi}^{t}\right)|\pi^{t-1}]\right|
|lnLt(πt)lnLt1(πt1)|+|lnLt1(πt1)𝔼[lnLt(π~t)|πt1]|\displaystyle\leq\left|\ln L^{t}(\pi^{t})-\ln L^{t-1}\left(\pi^{t-1}\right)\right|+\left|\ln L^{t-1}\left(\pi^{t-1}\right)-\mathop{\mathbb{E}}\left[\ln L^{t}\left(\tilde{\pi}^{t}\right)|\pi^{t-1}\right]\right|
2ηC+2ηC=4ηC.\displaystyle\leq 2\eta C+2\eta C=4\eta C.

Having established that {Z~t}\{\tilde{Z}^{t}\} is a martingale with bounded increments, we can now apply the following concentration bound (see e.g. Dubhashi and Panconesi [2009]).

Fact 3 (Azuma’s Inequality).

Fix ϵ>0\epsilon\!>\!0. For any martingale {Z~t}t=0T\{\tilde{Z}^{t}\}_{t=0}^{T} with |Z~tZ~t1|ξ|\tilde{Z}^{t}\!-\!\tilde{Z}^{t-1}|\!\leq\!\xi for t[T]t\!\in\![T],

Pr[Z~TZ~0ϵ]exp(ϵ22ξ2T).\Pr\left[\tilde{Z}^{T}-\tilde{Z}^{0}\geq\epsilon\right]\leq\exp\left(-\frac{\epsilon^{2}}{2\xi^{2}T}\right).

We instantiate this bound for our martingale with Z~0=0\tilde{Z}^{0}=0, ξ=4ηC\xi=4\eta C, and ϵ=ξ2Tln1δ=4ηC2Tln1δ\epsilon=\xi\sqrt{2T\ln\frac{1}{\delta}}=4\eta C\sqrt{2T\ln\frac{1}{\delta}}, and obtain that for any δ(0,1)\delta\in(0,1),

Z~T4ηC2Tln1δ with prob. 1δ.\tilde{Z}_{T}\leq 4\eta C\sqrt{2T\ln\frac{1}{\delta}}\quad\text{ with prob. }1-\delta. (1)

At this point, let us express Z~T\tilde{Z}^{T} as follows:

Z~T=t=1T(lnLt(π~t)𝔼π~t[lnLt(π~t)|π~t1])=lnLT(π~T)lnL0t=1T(𝔼π~t[lnLt(π~t)|π~t1]lnLt1(π~t1)).\tilde{Z}^{T}=\sum_{t=1}^{T}\!\left(\!\ln L^{t}\!\left(\tilde{\pi}^{t}\right)\!\!-\!\!\mathop{\mathbb{E}}_{\tilde{\pi}^{t}}\!\left[\ln L^{t}\!\left(\tilde{\pi}^{t}\right)\!|\tilde{\pi}^{t-1}\right]\!\right)=\ln L^{T}\!\!\left(\tilde{\pi}^{T}\right)\!-\!\ln L^{0}\!-\!\!\sum_{t=1}^{T}\!\left(\mathop{\mathbb{E}}_{\tilde{\pi}^{t}}\!\left[\ln L^{t}\!\!\left(\tilde{\pi}^{t}\right)\!|\tilde{\pi}^{t-1}\right]\!\!-\!\ln L^{t-1}\!\!\left(\tilde{\pi}^{t-1}\right)\!\!\right).

Now, with an eye toward bounding the latter sum, observe that for t[T]t\in[T],

𝔼π~t[lnLt(π~t)|π~t1]lnLt1(π~t1)\displaystyle\mathop{\mathbb{E}}_{\tilde{\pi}^{t}}\left[\ln L^{t}(\tilde{\pi}^{t})|\tilde{\pi}^{t-1}\right]-\ln L^{t-1}(\tilde{\pi}^{t-1}) ln𝔼π~t[Lt(π~t)|π~t1]lnLt1(π~t1)\displaystyle\leq\ln\mathop{\mathbb{E}}_{\tilde{\pi}^{t}}\left[L^{t}(\tilde{\pi}^{t})|\tilde{\pi}^{t-1}\right]-\ln L^{t-1}\left(\tilde{\pi}^{t-1}\right)
ln((4η2C2+1)Lt1(π~t1))lnLt1(π~t1)\displaystyle\leq\ln\left(\left(4\eta^{2}C^{2}+1\right)L^{t-1}\left(\tilde{\pi}^{t-1}\right)\right)-\ln L^{t-1}(\tilde{\pi}^{t-1})
=ln(4η2C2+1)\displaystyle=\ln(4\eta^{2}C^{2}+1)
4η2C2.\displaystyle\leq 4\eta^{2}C^{2}.

Here, the first step is via Jensen’s inequality and the last step is via ln(1+x)x\ln(1+x)\leq x for x>1x>-1. The second step holds since we can show (via reasoning similar to Lemma 2.3) that for any TlndT\geq\ln d, at each round t[T]t\in[T] Algorithm 2 with learning rate η=lnd4TC2\eta=\sqrt{\frac{\ln d}{4TC^{2}}} achieves:

𝔼π~t[Lt(π~t)|π~t1](4η2C2+1)Lt1(π~t1).\mathop{\mathbb{E}}_{\tilde{\pi}^{t}}\left[L^{t}(\tilde{\pi}^{t})|\tilde{\pi}^{t-1}\right]\leq(4\eta^{2}C^{2}+1)L^{t-1}(\tilde{\pi}^{t-1}).

Combining the above observations with Bound 1 and recalling L0=dL^{0}\!\!=\!d yields, with probability 1δ\!\geq\!1\!-\!\delta,

Z~T4ηC2Tln1δ\displaystyle\tilde{Z}_{T}\leq 4\eta C\sqrt{2T\ln\frac{1}{\delta}}\, lnLT(π~T)lndt=1T(𝔼π~t[lnLt(π~t)|π~t1]lnLt1(π~t1))4ηC2Tln1δ\displaystyle\iff\!\!\!\ln L^{T}(\tilde{\pi}^{T})-\ln d-\sum_{t=1}^{T}\left(\mathop{\mathbb{E}}_{\tilde{\pi}^{t}}[\ln L^{t}(\tilde{\pi}^{t})|\tilde{\pi}^{t-1}]-\ln L^{t-1}(\tilde{\pi}^{t-1})\right)\leq 4\eta C\sqrt{2T\ln\frac{1}{\delta}}
lnLT(π~T)lnd+t=1T(𝔼π~t[lnLt(π~t)|π~t1]lnLt1(π~t1))+4ηC2Tln1δ\displaystyle\iff\!\!\!\ln L^{T}(\tilde{\pi}^{T})\leq\ln d+\sum_{t=1}^{T}\left(\mathop{\mathbb{E}}_{\tilde{\pi}^{t}}[\ln L^{t}(\tilde{\pi}^{t})|\tilde{\pi}^{t-1}]-\ln L^{t-1}(\tilde{\pi}^{t-1})\right)+4\eta C\sqrt{2T\ln\frac{1}{\delta}}
lnLT(π~T)lnd+4η2C2T+4ηC2Tln1δ.\displaystyle\implies\!\!\ln L^{T}(\tilde{\pi}^{T})\leq\ln d+4\eta^{2}C^{2}T+4\eta C\sqrt{2T\ln\frac{1}{\delta}}.

Using the last inequality, with η=lnd4TC2\eta=\sqrt{\frac{\ln d}{4TC^{2}}}, and the fact that RT(π~T)LT(π~T)ηR^{T}\left(\tilde{\pi}^{T}\right)\leq\frac{L^{T}\left(\tilde{\pi}^{T}\right)}{\eta} (which is easy to deduce via Lemma 2.1), we thus obtain the desired high-probability AMF regret bound. Specifically, with probability 1δ1-\delta we have:

RT(π~T)\displaystyle R^{T}\left(\tilde{\pi}^{T}\right) LT(π~T)ηlndη+4ηC2T+4C2Tln1δ=24C2Tlnd+4C2Tln1δ\displaystyle\leq\frac{L^{T}\left(\tilde{\pi}^{T}\right)}{\eta}\leq\frac{\ln d}{\eta}+4\eta C^{2}T+4C\sqrt{2T\ln\frac{1}{\delta}}=2\sqrt{4C^{2}T\ln d}+4C\sqrt{2T\ln\frac{1}{\delta}}
=4CT(lnd+2ln1δ)4CT2lnd+2ln1δ8CTlndδ.\displaystyle=4C\sqrt{T}\left(\sqrt{\ln d}+\sqrt{2\ln\frac{1}{\delta}}\right)\leq 4C\sqrt{T}\cdot\sqrt{2}\cdot\sqrt{\ln d+2\ln\frac{1}{\delta}}\leq 8C\sqrt{T\ln\frac{d}{\delta}}.

In the last line, we used that x+y2x+y\sqrt{x}+\sqrt{y}\leq\sqrt{2}\sqrt{x+y} for x,y0x,y\geq 0. ∎

Appendix B Multicalibration: The Algorithm and Full Proofs

A simple and efficient algorithm for the Learner

As mentioned in the proof sketch of Theorem 4.1, in the setting of multicalibration, our framework’s general Algorithm 2 has a particularly simple approximate version (originally derived in Gupta et al. [2022]) that lets the Learner (almost) match the above bounds on the multicalibration constant α\alpha. This approximate algorithm is very efficient and has “low” randomization: namely, at each round the Learner plays an explicitly given distribution which randomizes over at most two points in 𝒜r\mathcal{A}_{r}.

  for t=1,,Tt=1,\dots,T do
     Observe θt\theta^{t}.
     For each i[n]i\in[n], compute:
Ct1i:=g𝒢:θtgexp(ηs=1t1i,g,+1s(as,bs))exp(ηs=1t1i,g,+1s(as,bs)).C^{i}_{t-1}:=\sum_{\begin{subarray}{c}g\in\mathcal{G}:\,\theta^{t}\in g\end{subarray}}\exp\left(\eta\sum_{s=1}^{t-1}\ell^{s}_{i,g,+1}\left(a^{s},b^{s}\right)\right)-\exp\left(-\eta\sum_{s=1}^{t-1}\ell^{s}_{i,g,+1}\left(a^{s},b^{s}\right)\right).
     if Ct1i>0C^{i}_{t-1}>0 for all i[n]i\in[n] then
        Predict at=1a^{t}=1.
     else if Ct1i<0C^{i}_{t-1}<0 for all i[n]i\in[n] then
        Predict at=0a^{t}=0.
     else
        Find j[n1]j\in[n-1] such that Ct1jCt1j+10C^{j}_{t-1}\cdot C^{j+1}_{t-1}\leq 0.
        Define qt[0,1]q^{t}\in[0,1] as follows (using the convention that 0/0 = 1):
qt:=|Ct1j+1|/(|Ct1j+1|+|Ct1j|).q^{t}:=\left|C^{j+1}_{t-1}\right|/\left(\left|C^{j+1}_{t-1}\right|+\left|C^{j}_{t-1}\right|\right).
        Sample at=jn1rna^{t}=\frac{j}{n}-\frac{1}{rn} with probability qtq^{t} and at=jna^{t}=\frac{j}{n} with probability 1qt1-q^{t}.
Algorithm 3 Simple Multicalibrated Learner
Theorem B.1.

Algorithm 3 achieves the multicalibration guarantees of Theorem 4.1.

Proof.

Let us instantiate the generic probabilistic Algorithm 2 with our current set of loss functions. In parallel with the notation of Algorithm 2, for any bucket ii, group gg and σ{1,+1}\sigma\in\{-1,+1\}, we define

χi,g,σt:=1Ztexp(ηs=1t1i,g,σs(as,bs)),\chi^{t}_{i,g,\sigma}:=\frac{1}{Z^{t}}\exp\left(\eta\sum_{s=1}^{t-1}\ell_{i,g,\sigma}^{s}(a^{s},b^{s})\right),

where

Zt:=i[n],g𝒢,σ=±1exp(ηs=1t1i,g,σs(as,bs)).Z^{t}:=\sum_{i^{\prime}\in[n],g^{\prime}\in\mathcal{G},\sigma^{\prime}=\pm 1}\exp\left(\eta\sum_{s=1}^{t-1}\ell_{i^{\prime},g^{\prime},\sigma^{\prime}}^{s}(a^{s},b^{s})\right).

In this notation, at each round t[T]t\in[T], the Learner has to solve the following zero-sum game:

xtargminxΔ𝒜rmaxb[0,1]𝔼ax[ξt(a,b)],x^{t}\in\mathop{\mathrm{argmin}}_{x\in\Delta\mathcal{A}_{r}}\max_{b\in[0,1]}\mathop{\mathbb{E}}_{a\sim x}\left[\xi^{t}\left(a,b\right)\right],

where we define

ξt(a,b):=i[n],g𝒢,σ{1,1}χi,g,σti,g,σt(a,b) for a𝒜r,b[0,1].\xi^{t}(a,b):=\sum_{i\in[n],g\in\mathcal{G},\sigma\in\{-1,1\}}\chi^{t}_{i,g,\sigma}\cdot\ell^{t}_{i,g,\sigma}(a,b)\quad\text{ for }a\in\mathcal{A}_{r},b\in[0,1].

For any aa, let iai_{a} denote the unique bucket index i[n]i\in[n] such that aBnia\in B^{i}_{n}. Substituting

i,g,σt(a,b)=σ1θtg1aBni(ba),\ell^{t}_{i,g,\sigma}(a,b)=\sigma\cdot 1_{\theta^{t}\in g}\cdot 1_{a\in B^{i}_{n}}\cdot(b-a),

we see that most terms in the summation disappear, and what remains is precisely

ξt(a,b)=g𝒢:θtgσ{1,1}χia,g,σtσ(ba)=(ba)Ct1iaZt,\xi^{t}(a,b)=\sum_{g\in\mathcal{G}:\,\theta^{t}\in g\,}\sum_{\sigma\in\{-1,1\}}\chi^{t}_{i_{a},g,\sigma}\cdot\sigma(b-a)=(b-a)\cdot\frac{C^{i_{a}}_{t-1}}{Z^{t}},

where Ct1ia=Ztg𝒢:θtgχia,g,+1tχia,g,1tC^{i_{a}}_{t-1}=Z^{t}\sum\limits_{g\in\mathcal{G}:\theta^{t}\in g}\chi^{t}_{i_{a},g,+1}-\chi^{t}_{i_{a},g,-1} is as defined in the pseudocode for Algorithm 3.

Crucially, for any distribution xx chosen by the Learner, her attained utility after the Adversary best-responds has a simple closed form. Namely, given any xx played by the Learner, we have

maxb[0,1]𝔼ax[ξt(a,b)]\displaystyle\max_{b\in[0,1]}\mathop{\mathbb{E}}_{a\sim x}\left[\xi^{t}\left(a,b\right)\right] =1Zt(maxb[0,1](b𝔼ax[Ct1ia])𝔼ax[aCt1ia]),\displaystyle=\frac{1}{Z^{t}}\left(\max_{b\in[0,1]}\left(b\cdot\mathop{\mathbb{E}}_{a\sim x}\left[C^{i_{a}}_{t-1}\right]\right)-\mathop{\mathbb{E}}_{a\sim x}\left[a\cdot C^{i_{a}}_{t-1}\right]\right),
=1Zt(max(𝔼ax[Ct1ia],0)𝔼ax[aCt1ia]).\displaystyle=\frac{1}{Z^{t}}\left(\max\left(\mathop{\mathbb{E}}_{a\sim x}\left[C^{i_{a}}_{t-1}\right],0\right)-\mathop{\mathbb{E}}_{a\sim x}\left[a\cdot C^{i_{a}}_{t-1}\right]\right).

With this in mind, the Learner can easily achieve value 0 in the following two cases. When Ct1i>0C^{i}_{t-1}>0 for all i[n]i\in[n], playing a=1a=1 deterministically gives: max(𝔼ax[Ct1ia],0)𝔼ax[aCt1ia]=𝔼ax[Ct1ia]𝔼ax[Ct1ia]=0\max\left(\mathop{\mathbb{E}}\limits_{a\sim x}\left[C^{i_{a}}_{t-1}\right],0\right)-\mathop{\mathbb{E}}\limits_{a\sim x}\left[a\cdot C^{i_{a}}_{t-1}\right]=\mathop{\mathbb{E}}\limits_{a\sim x}\left[C^{i_{a}}_{t-1}\right]-\mathop{\mathbb{E}}\limits_{a\sim x}\left[C^{i_{a}}_{t-1}\right]=0. When Ct1i<0C^{i}_{t-1}<0 for all i[n]i\in[n], she can play a=0a=0 deterministically, ensuring that max(𝔼ax[Ct1ia],0)𝔼ax[aCt1ia]=00=0\max\left(\mathop{\mathbb{E}}\limits_{a\sim x}\left[C^{i_{a}}_{t-1}\right],0\right)-\mathop{\mathbb{E}}\limits_{a\sim x}\left[a\cdot C^{i_{a}}_{t-1}\right]=0-0=0.

In the final case, when there are nonpositive and nonnegative quantities among {Ct1i}i[n]\{C^{i}_{t-1}\}_{i\in[n]}, note that there exists an intermediate index j[n1]j\in[n-1] such that Ct1jCt1j+10C^{j}_{t-1}\cdot C^{j+1}_{t-1}\leq 0. Then, it is easy to check that qtq^{t}, as defined in Algorithm 3, satisfies

qtCt1j+(1qt)Ct1j+1=0.q^{t}C^{j}_{t-1}+(1-q^{t})C^{j+1}_{t-1}=0.

Using this relation, we obtain that when the Learner plays at=jn1rna^{t}=\frac{j}{n}-\frac{1}{rn} with probability qtq^{t} and at=jna^{t}=\frac{j}{n} with probability 1qt1-q^{t}, she accomplishes value

maxb[0,1]𝔼at[ξt(at,b)]=\displaystyle\max_{b\in[0,1]}\mathop{\mathbb{E}}_{a^{t}}\left[\xi^{t}\left(a^{t},b\right)\right]= 1Zt(max(𝔼[Ct1iat],0)𝔼[atCt1iat])\displaystyle\frac{1}{Z^{t}}\left(\max\left(\mathop{\mathbb{E}}\left[C^{i_{a^{t}}}_{t-1}\right],0\right)-\mathop{\mathbb{E}}\left[a^{t}\cdot C^{i_{a^{t}}}_{t-1}\right]\right)
=\displaystyle= 1Zt(max(qtCt1j+(1qt)Ct1j+1,0)(qt(jn1rn)Ct1j+(1qt)jnCsj+1))\displaystyle\frac{1}{Z^{t}}\left(\max\left(q^{t}\cdot C_{t-1}^{j}+(1-q^{t})C_{t-1}^{j+1},0\right)-\left(q^{t}\left(\tfrac{j}{n}-\tfrac{1}{rn}\right)C_{t-1}^{j}+(1-q^{t})\tfrac{j}{n}C_{s}^{j+1}\right)\right)
=\displaystyle= 1Zt1rnCt1j,\displaystyle\frac{1}{Z^{t}}\cdot\frac{1}{rn}C_{t-1}^{j},

and thus, recalling that Cjt1=Ztg𝒢:θtgχj,g,+1tχj,g,1tC_{j}^{t-1}=Z^{t}\sum_{g\in\mathcal{G}:\theta^{t}\in g}\chi^{t}_{j,g,+1}-\chi^{t}_{j,g,-1}, we obtain

maxb[0,1]𝔼at[ξt(at,b)]=1rng𝒢:θtgχj,g,+1tχj,g,1t1rni[n],g𝒢,σ=±1χi,g,σt=1rn,\max_{b\in[0,1]}\mathop{\mathbb{E}}_{a^{t}}\left[\xi^{t}\left(a^{t},b\right)\right]=\frac{1}{rn}\sum\limits_{g\in\mathcal{G}:\theta^{t}\in g}\chi^{t}_{j,g,+1}-\chi^{t}_{j,g,-1}\leq\frac{1}{rn}\sum_{i\in[n],g\in\mathcal{G},\sigma=\pm 1}\chi^{t}_{i,g,\sigma}=\frac{1}{rn},

where the last line is due to the quantities χi,g,σ\chi_{i,g,\sigma} forming a probability distribution.

Therefore, in the language of Section A.2.2, the Learner who uses Algorithm 3 guarantees herself achieved AMF value bounds

wbdt=1rn for t[T].w^{t}_{\mathrm{bd}}=\frac{1}{rn}\text{ for }t\in[T].

Hence, by Theorem A.3, our (suboptimal) Learner achieves the claimed multicalibration bounds. ∎

Appendix C Multicalibeating: Full Statements and Proofs

C.1 Calibeating a Single Forecaster: Proof of Theorem 4.2

Proof of Theorem 4.2.

For the exposition of this full proof, we will employ some probabilistic notation that we have not seen in the main Section 4.2. We briefly define it here.

For any subsequence S[T]S\subseteq[T] of rounds, tSt\sim S denotes a uniformly random round in SS. We denote the empirical distributions of the values of ff, aa, (f,a)(f,a) on S[T]S\subseteq[T] by 𝒟f(S),𝒟a(S),𝒟f×a(S)\mathcal{D}^{f}(S),\mathcal{D}^{a}(S),\mathcal{D}^{f\times a}(S) (or simply 𝒟f,𝒟a,𝒟f×a\mathcal{D}^{f},\mathcal{D}^{a},\mathcal{D}^{f\times a} when S=[T]S=[T]). In this notation, we e.g. have f(πT)=𝔼d𝒟f[VartSd[bt]]\mathcal{R}^{f}(\pi^{T})=\mathop{\mathbb{E}}_{d\sim\mathcal{D}^{f}}[\mathop{\mathrm{Var}}_{t\sim S^{d}}[b^{t}]].

Our quantity of interest, the Brier score a\mathcal{B}^{a} of the Learner’s predictions aa, is inconvenient to handle: indeed, the calibration-refinement decomposition of a\mathcal{B}^{a} is of little utility since the Learner’s predictions can take arbitrary real values (in particular, they might all be distinct, in which case the refinement score would be 0, and all of the Brier score would be contained in the calibration error). Instead, we define a convenient surrogate notion of bucketed Brier/calibration/refinement score.

𝒦na(πT)\displaystyle\mathcal{K}_{n}^{a}(\pi^{T}) :=1Ti[n]|Si|(a¯(Si)b¯(Si))2.\displaystyle:=\frac{1}{T}\sum_{i\in[n]}|S_{i}|(\bar{a}(S_{i})-\bar{b}(S_{i}))^{2}.
na(πT)\displaystyle\mathcal{R}_{n}^{a}(\pi^{T}) :=1Ti[n]tSi(btb¯(Si))2=1Ti[n]|Si|VartSi[bt]=𝔼i𝒟i[VartSi[bt]].\displaystyle:=\frac{1}{T}\sum_{i\in[n]}\sum_{t\in S_{i}}(b^{t}-\bar{b}(S_{i}))^{2}=\frac{1}{T}\sum_{i\in[n]}|S_{i}|\mathop{\mathrm{Var}}_{t\in S_{i}}[b^{t}]=\mathop{\mathbb{E}}_{i\sim\mathcal{D}^{i}}[\mathop{\mathrm{Var}}_{t\sim S_{i}}[b^{t}]].
na(πT)\displaystyle\mathcal{B}_{n}^{a}(\pi^{T}) :=𝒦na(πT)+na(πT).\displaystyle:=\mathcal{K}_{n}^{a}(\pi^{T})+\mathcal{R}_{n}^{a}(\pi^{T}).

The following lemma shows that as long as nn is large enough, the surrogate Brier score is a good estimate of the true Brier score of our predictions (i.e. our squared error).

Lemma C.1.

ana+1n\mathcal{B}^{a}\leq\mathcal{B}_{n}^{a}+\frac{1}{n}.

Proof.

We first compute that the original Brier score a\mathcal{B}^{a} equals

a:=1Tt=1T(atbt)2=1Ti=1ntSi(atbt)2=1Ti=1n|Si|tSi1|Si|(atbt)2.\mathcal{B}^{a}:=\frac{1}{T}\sum_{t=1}^{T}(a^{t}-b^{t})^{2}=\frac{1}{T}\sum_{i=1}^{n}\sum_{t\in S_{i}}(a^{t}-b^{t})^{2}=\frac{1}{T}\sum_{i=1}^{n}|S_{i}|\sum_{t\in S_{i}}\frac{1}{|S_{i}|}(a^{t}-b^{t})^{2}.

The inner sum is the expectation, over the transcript, of (atbt)2(a^{t}-b^{t})^{2} conditioned on atBnia^{t}\in B^{i}_{n}, so we can write:

a=1Ti=1n|Si|𝔼tSi[(atbt)2].\mathcal{B}^{a}=\frac{1}{T}\sum_{i=1}^{n}|S_{i}|\mathop{\mathbb{E}}_{t\sim S_{i}}[(a^{t}-b^{t})^{2}].

We can decompose the expected value as:

𝔼tSi[(atbt)2]=(𝔼tSi[atbt])2+VartSi[atbt].\mathop{\mathbb{E}}_{t\sim S_{i}}[(a^{t}-b^{t})^{2}]=(\mathop{\mathbb{E}}_{t\sim S_{i}}[a^{t}-b^{t}])^{2}+\mathop{\mathrm{Var}}_{t\sim S_{i}}[a^{t}-b^{t}].

By linearity of expectation, the expectation-squared term satisfies:

(𝔼tSi[atbt])2=(a¯(Si)b¯(Si))2.(\mathop{\mathbb{E}}_{t\sim S_{i}}[a^{t}-b^{t}])^{2}=(\bar{a}(S_{i})-\bar{b}(S_{i}))^{2}.

Meanwhile, the variance term can be upper bounded using the following fact:

Fact 4.

For any random variables X,YX,Y:

Var[X+Y]=Var[X]+Var[Y]+2Cov(X,Y)Var[X]+Var[Y]+2Var[X]Var[Y].\mathop{\mathrm{Var}}[X+Y]=\mathop{\mathrm{Var}}[X]+\mathop{\mathrm{Var}}[Y]+2\mathrm{Cov}(X,Y)\leq\mathop{\mathrm{Var}}[X]+\mathop{\mathrm{Var}}[Y]+2\sqrt{\mathop{\mathrm{Var}}[X]\mathop{\mathrm{Var}}[Y]}.

where the inequality follows from an application of Cauchy-Schwartz.

Instantiating X=atX=a^{t} and Y=btY=-b^{t}, and upper bounding Var[X]12n,Var[Y]12\sqrt{\mathop{\mathrm{Var}}[X]}\leq\frac{1}{2n},\sqrt{\mathop{\mathrm{Var}}[Y]}\leq\frac{1}{2}, we get:

VartSi[atbt]\displaystyle\mathop{\mathrm{Var}}_{t\sim S_{i}}[a^{t}-b^{t}] VartSi[at]+VartSi[bt]+2VartSi[at]VartSi[bt],\displaystyle\leq\mathop{\mathrm{Var}}_{t\sim S_{i}}[a^{t}]+\mathop{\mathrm{Var}}_{t\sim S_{i}}[b^{t}]+2\sqrt{\mathop{\mathrm{Var}}_{t\sim S_{i}}[a^{t}]\mathop{\mathrm{Var}}_{t\sim S_{i}}[b^{t}]},
1(2n)2+VartSi[bt]+12n,\displaystyle\leq\frac{1}{(2n)^{2}}+\mathop{\mathrm{Var}}_{t\sim S_{i}}[b^{t}]+\frac{1}{2n},
VartSi[bt]+1n.\displaystyle\leq\mathop{\mathrm{Var}}_{t\sim S_{i}}[b^{t}]+\frac{1}{n}.

Putting the above back together gives the desired bound on the difference of a\mathcal{B}^{a} and na\mathcal{B}^{a}_{n}:

a\displaystyle\mathcal{B}^{a} =1Ti=1n|Si|𝔼tSi[(atbt)2],\displaystyle=\frac{1}{T}\sum_{i=1}^{n}|S_{i}|\mathop{\mathbb{E}}_{t\sim S_{i}}[(a^{t}-b^{t})^{2}],
1Ti=1n|Si|((a¯(Si)b¯(Si))2+VartSi[bt]+1n),\displaystyle\leq\frac{1}{T}\sum_{i=1}^{n}|S_{i}|\left((\bar{a}(S_{i})-\bar{b}(S_{i}))^{2}+\mathop{\mathrm{Var}}_{t\sim S_{i}}[b^{t}]+\frac{1}{n}\right),
=1Ti=1n|Si|(a¯(Si)b¯(Si))2+1Ti=1n|Si|VartSi[bt]+1n,\displaystyle=\frac{1}{T}\sum_{i=1}^{n}|S_{i}|(\bar{a}(S_{i})-\bar{b}(S_{i}))^{2}+\frac{1}{T}\sum_{i=1}^{n}|S_{i}|\mathop{\mathrm{Var}}_{t\sim S_{i}}[b^{t}]+\frac{1}{n},
=𝒦na+na+1n.\displaystyle=\mathcal{K}_{n}^{a}+\mathcal{R}_{n}^{a}+\frac{1}{n}.

Having shown that the surrogate Brier score na\mathcal{B}^{a}_{n} closely approximates the Learner’s original score a\mathcal{B}^{a}, we can now focus on bounding the calibration and refinement scores associated with na\mathcal{B}^{a}_{n}.

Calibration: Our multicalibration condition on Θ\Theta implies that |Si|T|b¯(Si)a¯(Si)|α\frac{|S_{i}|}{T}|\bar{b}(S_{i})-\bar{a}(S_{i})|\leq\alpha for i[n]i\in[n]. The calibration score bound then follows directly.

𝒦na=1Ti[n]|Si|(b¯(Si)a¯(Si))21Ti[n]|Si||b¯(Si)a¯(Si)|i[n]α=αn.\mathcal{K}_{n}^{a}=\frac{1}{T}\sum_{i\in[n]}|S_{i}|(\bar{b}(S_{i})-\bar{a}(S_{i}))^{2}\leq\frac{1}{T}\sum_{i\in[n]}|S_{i}||\bar{b}(S_{i})-\bar{a}(S_{i})|\leq\sum_{i\in[n]}\alpha=\alpha n.

Refinement: We claim that the Learner’s surrogate refinement score relates to the refinement score of the forecaster ff as follows:

naf+αn(|Df|+1)+1n.\mathcal{R}^{a}_{n}\leq\mathcal{R}^{f}+\alpha n(|D_{f}|+1)+\frac{1}{n}.

The proof proceeds in two steps, connecting f\mathcal{R}^{f} and a\mathcal{R}^{a} via a quantity we call f×a\mathcal{R}^{f\times a}.

Definition C.1 (Joint Refinement Score).
f×a:=𝔼d,i𝒟f×a[VartSid[bt]]=1TdDf,i[n]|Sid|VartSid[bt].\mathcal{R}^{f\times a}:=\mathop{\mathbb{E}}_{d,i\sim\mathcal{D}^{f\times a}}[\mathop{\mathrm{Var}}_{t\sim S^{d}_{i}}[b^{t}]]=\frac{1}{T}\sum_{d\in D_{f},i\in[n]}|S_{i}^{d}|\mathop{\mathrm{Var}}_{t\sim S_{i}^{d}}[b^{t}].

Recall that refinement score, although we defined it for a forecaster, is really a property of a partition of the days. It’s equally well defined if, instead of partitioning by days on which a forecaster makes a certain forecast, we partition on say, even and odd days, or sunny vs cloudy vs rainy vs snowy days. Or, in the case of Definition C.1, the partition {Sid}i[n],dD\{S_{i}^{d}\}_{i\in[n],d\in D}.

First, note that the joint refinement score of aa and ff is no worse than the refinement score of ff.

Observation 1.

ff×a.\mathcal{R}^{f}\geq\mathcal{R}^{f\times a}.

Intuitively this should make sense, since {Sid}\{S_{i}^{d}\} is a refinement of ff’s level sets by aa’s level sets. If aa is “useful”, then this inequality would be strict, as combining with aa would explain away more of the variance. Refining by aa cannot decrease the amount of variance captured by the partition.

Reversing our perspective, we can think of {Sid}\{S_{i}^{d}\} as a refinement of aa’s level sets by ff’s level sets. The key idea is to use multicalibration to show that refining by ff is not “useful." Multicalibration ensures us that almost all of ff’s explanatory power is captured by aa.

Observation 2.

na=f×a+𝔼i𝒟a[Vard𝒟f(Si)[b¯(Sid)]].\mathcal{R}_{n}^{a}=\mathcal{R}^{f\times a}+\mathop{\mathbb{E}}_{i\sim\mathcal{D}^{a}}[\mathop{\mathrm{Var}}_{d\sim\mathcal{D}^{f}(S_{i})}[\bar{b}(S_{i}^{d})]].

Observation 3.

The extra error term is small: 𝔼i𝒟a[Vard𝒟f(Si)[b¯(Sid)]]αn(|D|+1)+1n.\mathop{\mathbb{E}}_{i\sim\mathcal{D}^{a}}[\mathop{\mathrm{Var}}_{d\sim\mathcal{D}^{f}(S_{i})}[\bar{b}(S_{i}^{d})]]\leq\alpha n(|D|+1)+\frac{1}{n}.

Combining these three observations will give us our desired refinement score bound:

na(b)f+αn(|D|+1)+1n.\mathcal{R}^{a}_{n}(b)\leq\mathcal{R}^{f}+\alpha n(|D|+1)+\frac{1}{n}.

We therefore now prove these observations one by one.

Proof of Observation 1.

We recall the following fact from probability:

Fact 5 (Law of Total Variance).

For any random variables W,Z:ΩW,Z:\Omega\rightarrow\mathbb{R} in a probability space,

Var[Z]=𝔼[Var[Z|W]]+Var[𝔼[Z|W]].\mathop{\mathrm{Var}}[Z]=\mathop{\mathbb{E}}[\mathop{\mathrm{Var}}[Z|W]]+\mathop{\mathrm{Var}}[\mathop{\mathbb{E}}[Z|W]].

In particular, since variance is always non-negative:

Var[Z]𝔼[Var[Z|W]].\mathop{\mathrm{Var}}[Z]\geq\mathop{\mathbb{E}}[\mathop{\mathrm{Var}}[Z|W]].

For each fixed dd, we instantiate this fact with Ω=Sd\Omega=S^{d} (equipped with the discrete σ\sigma-algebra and uniform distribution). Z(t):=btZ(t):=b^{t} and W(t):=iatW(t):=i_{a^{t}}, the unique ii s.t. atBnia^{t}\in B^{i}_{n}. This gives us:

VartSd[bt]\displaystyle\mathop{\mathrm{Var}}_{t\sim S^{d}}[b^{t}] 𝔼i𝒟a(Sd)[VartSd[bt|atBni]]=𝔼i𝒟a(Sd)[VartSid[bt]].\displaystyle\geq\mathop{\mathbb{E}}_{i\sim\mathcal{D}^{a}(S^{d})}[\mathop{\mathrm{Var}}_{t\sim S^{d}}[b^{t}|a^{t}\in B^{i}_{n}]]=\mathop{\mathbb{E}}_{i\sim\mathcal{D}^{a}(S^{d})}[\mathop{\mathrm{Var}}_{t\sim S_{i}^{d}}[b^{t}]].

Since this is true for all dd, the inequality continues to hold in expectation over the dd’s:

f=𝔼d𝒟f[VartSd[bt]]𝔼d,i𝒟f×a[VartSid[bt]]=f×a.\mathcal{R}^{f}=\mathop{\mathbb{E}}_{d\sim\mathcal{D}^{f}}[\mathop{\mathrm{Var}}_{t\sim S^{d}}[b^{t}]]\geq\mathop{\mathbb{E}}_{d,i\sim\mathcal{D}^{f\times a}}[\mathop{\mathrm{Var}}_{t\sim S_{i}^{d}}[b^{t}]]=\mathcal{R}^{f\times a}.

Proof of Observation 2.

Recall the definition of bucketed refinement:

na=𝔼i𝒟a[VartSi[bt]].\mathcal{R}_{n}^{a}=\mathop{\mathbb{E}}_{i\sim\mathcal{D}^{a}}[\mathop{\mathrm{Var}}_{t\sim S_{i}}[b^{t}]].

To relate this back to f×a\mathcal{R}^{f\times a}, we instantiate Fact 5 again, but flipping the roles of ff and aa: we take the underlying spaces to be the sequences SiS_{i} defined by calibrated buckets, and let WW, the variable we condition on, be the level sets of ff.

For any fixed ii representing a level set of aa, Fact 5 tells us:

VartSi[bt]=𝔼d𝒟f(Si)[VartSi[bt|ft=d]]+Vard𝒟f(Si)[𝔼tSi[bt|ft=d]]=𝔼d𝒟f(Si)[VartSid[bt]]+Vard𝒟f(Si)[b¯(Sid)].\mathop{\mathrm{Var}}_{t\sim S_{i}}[b^{t}]=\mathop{\mathbb{E}}_{d\sim\mathcal{D}^{f}(S_{i})}[\mathop{\mathrm{Var}}_{t\sim S_{i}}[b^{t}|f^{t}=d]]+\mathop{\mathrm{Var}}_{d\sim\mathcal{D}^{f}(S_{i})}[\mathop{\mathbb{E}}_{t\sim S_{i}}[b^{t}|f^{t}=d]]=\mathop{\mathbb{E}}_{d\sim\mathcal{D}^{f}(S_{i})}[\mathop{\mathrm{Var}}_{t\sim S_{i}^{d}}[b^{t}]]+\mathop{\mathrm{Var}}_{d\sim\mathcal{D}^{f}(S_{i})}[\bar{b}(S_{i}^{d})].

Like before, we take the expectation over all i[n]i\in[n], giving us the desired result:

na=𝔼i𝒟a[VartSi[bt]]=𝔼d,i𝒟f×a[VartSid[bt]]+𝔼i𝒟a[Vard𝒟f(Si)[b¯(Sid)]]=f×a+𝔼i𝒟a[Vard𝒟f(Si)[b¯(Sid)]].\mathcal{R}_{n}^{a}=\mathop{\mathbb{E}}_{i\sim\mathcal{D}^{a}}[\mathop{\mathrm{Var}}_{t\sim S_{i}}[b^{t}]]=\mathop{\mathbb{E}}_{d,i\sim\mathcal{D}^{f\times a}}[\mathop{\mathrm{Var}}_{t\sim S_{i}^{d}}[b^{t}]]+\mathop{\mathbb{E}}_{i\sim\mathcal{D}^{a}}[\mathop{\mathrm{Var}}_{d\sim\mathcal{D}^{f}(S_{i})}[\bar{b}(S_{i}^{d})]]=\mathcal{R}^{f\times a}+\mathop{\mathbb{E}}_{i\sim\mathcal{D}^{a}}[\mathop{\mathrm{Var}}_{d\sim\mathcal{D}^{f}(S_{i})}[\bar{b}(S_{i}^{d})]].

Proof of Observation 3.

We have to bound the extra error term:

𝔼i𝒟a[Vard𝒟f(Si)[b¯(Sid)]].\mathop{\mathbb{E}}_{i\sim\mathcal{D}^{a}}[\mathop{\mathrm{Var}}_{d\sim\mathcal{D}^{f}(S_{i})}[\bar{b}(S_{i}^{d})]].

In words, this is the expected variance of the true averages on SidS_{i}^{d}, conditioned on the buckets ii. Intuitively, if these true averages vary a lot, then the calibration error on the SidS_{i}^{d}s must be large since the prediction on each of the SidS_{i}^{d}s is (close to) i/ni/n; in particular, they are almost constant across dd. Conversely, if multicalibration error is low, then the variance must be low as well. Formally,

𝔼i𝒟a[Vard𝒟f(Si)[b¯(Sid)]]\displaystyle\mathop{\mathbb{E}}_{i\sim\mathcal{D}^{a}}[\mathop{\mathrm{Var}}_{d\sim\mathcal{D}^{f}(S_{i})}[\bar{b}(S_{i}^{d})]] =i[n]|Si|T(Vard[𝔼tSid[bt]]),\displaystyle=\sum_{i\in[n]}\frac{|S_{i}|}{T}(\mathop{\mathrm{Var}}_{d}[\mathop{\mathbb{E}}_{t\sim S_{i}^{d}}[b^{t}]]),
=i[n]|Si|T(dD|Sid||Si|(b¯(Sid)b¯(Si))2),\displaystyle=\sum_{i\in[n]}\frac{|S_{i}|}{T}(\sum_{d\in D}\frac{|S_{i}^{d}|}{|S_{i}|}(\bar{b}(S_{i}^{d})-\bar{b}(S_{i}))^{2}),
=i,d|Sid|T(b¯(Sid)b¯(Si))2,\displaystyle=\sum_{i,d}\frac{|S_{i}^{d}|}{T}(\bar{b}(S_{i}^{d})-\bar{b}(S_{i}))^{2},
i,d|Sid|T|b¯(Sid)b¯(Si)|,\displaystyle\leq\sum_{i,d}\frac{|S_{i}^{d}|}{T}|\bar{b}(S_{i}^{d})-\bar{b}(S_{i})|,
i,d|Sid|T(|b¯(Sid)a¯(Sid)|+|a¯(Sid)a¯(Si)|+|a¯(Si)b¯(Si)|),\displaystyle\leq\sum_{i,d}\frac{|S_{i}^{d}|}{T}(|\bar{b}(S_{i}^{d})-\bar{a}(S_{i}^{d})|+|\bar{a}(S_{i}^{d})-\bar{a}(S_{i})|+|\bar{a}(S_{i})-\bar{b}(S_{i})|),
i,d|Sid|T(Tα/|Sid|+1n+Tα/|Si|),\displaystyle\leq\sum_{i,d}\frac{|S_{i}^{d}|}{T}(T\alpha/|S_{i}^{d}|+\frac{1}{n}+T\alpha/|S_{i}|),
1n+iα+i,dα,\displaystyle\leq\frac{1}{n}+\sum_{i}\alpha+\sum_{i,d}\alpha,
=1n+αn(|D|+1).\displaystyle=\frac{1}{n}+\alpha n(|D|+1).

The first line is just expanding out the definition. In the third line, we upperbound square with absolute value, since all values are at most 1. In the forth line, we break apart the error term into the difference between our average prediction on SidS_{i}^{d} and the true average (upperbounded by Tα/|Sid|T\alpha/|S_{i}^{d}|, by calibration guarantees w.r.t 𝒮(f)\mathcal{S}(f)), the difference between our prediction on SidS_{i}^{d} and our average prediction on SiS_{i} (which is upper bounded by 1/n1/n, the size of our bucketing), and the difference between our average prediction on SiS_{i} and the true average (upperbounded by Tα/|Si|T\alpha/|S_{i}|). ∎

We have shown that 𝒦naαn\mathcal{K}^{a}_{n}\leq\alpha n, and our three observations have given us that na(b)f+αn(|D|+1)+1n\mathcal{R}^{a}_{n}(b)\leq\mathcal{R}^{f}+\alpha n(|D|+1)+\frac{1}{n}. Combining these results and Lemma C.1, we obtain the desired bound: af+αn(|D|+2)+2n\mathcal{B}^{a}\leq\mathcal{R}^{f}+\alpha n(|D|+2)+\frac{2}{n}. This concludes the proof of Theorem 4.2. ∎

C.2 Applying Theorem 4.2: Explicit Rates and Multiple Forecasters

First, we show how to instantiate Theorem 4.2 with our efficiently achievable multicalibration guarantees on α\alpha of Theorem 4.1.

Corollary C.1.

When run with parameters r,n1r,n\geq 1 on the collection 𝒢:=𝒮(f){Θ}\mathcal{G}^{\prime}:=\mathcal{S}(f)\cup\{\Theta\}, the multicalibration algorithm (Algorithm 3) τ\tau-calibeats ff, where

𝔼[τ]2n+n(|Df|+2)(1rn+4ln(2(|Df|+1)n)T),\mathop{\mathbb{E}}[\tau]\leq\frac{2}{n}+n(|D_{f}|+2)\left(\frac{1}{rn}+4\sqrt{\frac{\ln(2(|D_{f}|+1)n)}{T}}\right),

and for any δ(0,1)\delta\in(0,1), with probability 1δ1-\delta,

τ2n+n(|Df|+2)(1rn+81Tln(2(|Df|+1)nδ)).\tau\leq\frac{2}{n}+n(|D_{f}|+2)\left(\frac{1}{rn}+8\sqrt{\frac{1}{T}\ln\left(\frac{2(|D_{f}|+1)n}{\delta}\right)}\right).

The calibration error overall of the algorithm is bounded, for any δ(0,1)\delta\in(0,1), as:

𝔼[𝒦na]1r+4nln(2(|Df|+1)n)Tand𝒦na1r+8n1Tln(2(|Df|+1)nδ) w. prob. 1δ.\mathop{\mathbb{E}}[\mathcal{K}^{a}_{n}]\leq\frac{1}{r}+4n\sqrt{\frac{\ln(2(|D_{f}|+1)n)}{T}}\quad\text{and}\quad\mathcal{K}^{a}_{n}\leq\frac{1}{r}+8n\sqrt{\frac{1}{T}\ln\left(\frac{2(|D_{f}|+1)n}{\delta}\right)}\;\text{ w. prob. }1-\delta.
Proof.

Using our online multicalibration guarantees, we get (by Theorem B.1):

𝔼[α]1rn+4ln(2(|Df|+1)n)T,\mathop{\mathbb{E}}[\alpha]\leq\frac{1}{rn}+4\sqrt{\frac{\ln(2(|D_{f}|+1)n)}{T}},

and, for any δ(0,1)\delta\in(0,1), with probability 1δ1-\delta:

α1rn+81Tln(2(|Df|+1)nδ),\alpha\leq\frac{1}{rn}+8\sqrt{\frac{1}{T}\ln\left(\frac{2(|D_{f}|+1)n}{\delta}\right)},

Plugging this into the result from Theorem 4.2:

afαn(|D|+2)+2n,\mathcal{B}^{a}-\mathcal{R}^{f}\leq\alpha n(|D|+2)+\frac{2}{n},

we obtain the desired in-expectation bound on τ\tau:

𝔼[τ]2n+n(|Df|+2)𝔼[α]2n+n(|Df|+2)(1rn+4ln(2(|Df|+1)n)T).\mathop{\mathbb{E}}[\tau]\leq\frac{2}{n}+n(|D_{f}|+2)\mathop{\mathbb{E}}[\alpha]\leq\frac{2}{n}+n(|D_{f}|+2)\left(\frac{1}{rn}+4\sqrt{\frac{\ln(2(|D_{f}|+1)n)}{T}}\right).

We can do so similarly for the high probability bound, so that with probability 1δ1-\delta:

τ2n+n(|Df|+2)(1rn+81Tln(2(|Df|+1)nδ)).\tau\leq\frac{2}{n}+n(|D_{f}|+2)\left(\frac{1}{rn}+8\sqrt{\frac{1}{T}\ln\left(\frac{2(|D_{f}|+1)n}{\delta}\right)}\right).

Finally, the overall calibration error follows directly by plugging in for α\alpha. ∎

The main utility in our approach to calibeating is that it easily extends to multicalibeating. As a warm up, we start by deriving calibration with respect to an ensemble of forecasters. The main result then combines this with calibeating on groups to attain the multicalibeating from Definition 4.5.

Calibeating an ensemble of forecasters

Since our result above is based on bounds on multicalibration, we can easily extend it to calibeating an ensemble of forecasters \mathcal{F} by asking for multicalibration with respect to the level sets of all forecasters. More formally, define the groups as:

(f𝒮(f)){Θ}.\left(\bigcup_{f\in\mathcal{F}}\mathcal{S}(f)\right)\cup\{\Theta\}.

Theorem 4.2 applies separately to each ff. The only degradation comes in the α\alpha, since we’re asking for multicalibration with respect to more groups. But this effect is small, since the dependence on the number of groups is only O(ln|𝒢|)O(\sqrt{\ln|\mathcal{G}|}).

Corollary C.2 (Ensemble Calibeating).

On groups 𝒢:=(f𝒮(f)){Θ}\mathcal{G}^{\prime}:=\left(\bigcup_{f\in\mathcal{F}}\mathcal{S}(f)\right)\cup\{\Theta\}, the multicalibration algorithm with parameters r,n1r,n\geq 1, after TT rounds attains (,{Θ},β)(\mathcal{F},\{\Theta\},\beta)-multicalibeating with

𝔼[β(f,Θ)]2n+n(|Df|+2)(1rn+4ln(2(1+fDf)n)T).\mathop{\mathbb{E}}[\beta(f,\Theta)]\leq\frac{2}{n}+n(|D_{f}|+2)\left(\frac{1}{rn}+4\sqrt{\frac{\ln(2(1+\sum_{f^{\prime}\in\mathcal{F}}D_{f^{\prime}})n)}{T}}\right).
Proof.

We instantiate Theorem B.1 with group collection size |𝒢|=1+f|Df||\mathcal{G}^{\prime}|=1+\sum_{f^{\prime}\in\mathcal{F}}|D_{f^{\prime}}| to conclude that the multicalibrated algorithm achieves (α,n)(\alpha,n)-multicalibration, with

𝔼[α]1rn+4ln(2(1+fDf)n)T.\mathop{\mathbb{E}}[\alpha]\leq\frac{1}{rn}+4\sqrt{\frac{\ln(2(1+\sum_{f^{\prime}\in\mathcal{F}}D_{f^{\prime}})n)}{T}}.

Now, f:𝒮(f){Θ}𝒢\forall f\in\mathcal{F}:\mathcal{S}(f)\cup\{\Theta\}\subseteq\mathcal{G}^{\prime} for every ff\in\mathcal{F}, so we can instantiate Theorem 4.2 for every forecaster ff\in\mathcal{F} to give us:

afαn(|Df|+2)+2nf.\mathcal{B}^{a}-\mathcal{R}^{f}\leq\alpha n(|D_{f}|+2)+\frac{2}{n}\quad\forall\,f\in\mathcal{F}.

Plugging in the in-expectation bound on α\alpha, we conclude:

𝔼[β(f,Θ)]𝔼[α]n(|Df|+2)+2n2n+n(|Df|+2)(1rn+4ln(2(1+fDf)n)T).\mathop{\mathbb{E}}[\beta(f,\Theta)]\leq\mathop{\mathbb{E}}[\alpha]\cdot n(|D_{f}|+2)+\frac{2}{n}\leq\frac{2}{n}+n(|D_{f}|+2)\left(\frac{1}{rn}+4\sqrt{\frac{\ln(2(1+\sum_{f^{\prime}\in\mathcal{F}}D_{f^{\prime}})n)}{T}}\right).

C.3 Multicalibeating + Multicalibration Theorem 4.3: Full Statement and Proof

Recall that for every group g𝒢g\in\mathcal{G}, we let S(g)S(g) denote the subsequence of days on which gg occurs, where the transcript is left implicit.

Theorem C.1 (Multicalibeating + Multicalibration: Full version with high-probability bounds).

Let 𝒢2Θ\mathcal{G}\subseteq 2^{\Theta}, and \mathcal{F} some set of forecasters f:ΘDff:\Theta\rightarrow D_{f}. The multicalibration algorithm on 𝒢:=(f{gS:(g,S)𝒢×𝒮(f)})𝒢\mathcal{G}^{\prime}:=\left(\bigcup_{f\in\mathcal{F}}\{g\cap S:(g,S)\in\mathcal{G}\times\mathcal{S}(f)\}\right)\cup\mathcal{G} with parameters r,n1r,n\geq 1, after TT rounds, attains expected (,𝒢,β)(\mathcal{F},\mathcal{G},\beta)-multicalibeating, where: 888S(g)S(g) denotes the subsequence of days on which a group gg occurs, suppressing dependence on transcript.

𝔼[β(f,g)]2n+|Df|+2r|S(g)|/T+4n(|Df|+2)1|S(g)|2/Tln(2n|𝒢|(1+f|Df|))f,g𝒢,\mathop{\mathbb{E}}[\beta(f,g)]\leq\frac{2}{n}+\frac{|D_{f}|+2}{r\cdot|S(g)|/T}+4n(|D_{f}|+2)\sqrt{\frac{1}{|S(g)|^{2}/T}\ln\left(2n|\mathcal{G}|(1+{\textstyle\sum}_{f}|D_{f}|)\right)}\;\forall\;f\in\mathcal{F},g\in\mathcal{G},

while maintaining (α,n)(\alpha,n)-multicalibration on the original collection 𝒢\mathcal{G}, with:

𝔼[α]1rn+41Tln(2n|𝒢|(1+f|Df|)).\mathop{\mathbb{E}}[\alpha]\leq\frac{1}{rn}+4\sqrt{\frac{1}{T}\ln\left(2n|\mathcal{G}|(1+{\textstyle\sum}_{f}|D_{f}|)\right)}.

We also have the corresponding high probability bounds. For any δ(0,1)\delta\in(0,1), with probability 1δ1-\delta:

β(f,g)2n+|Df|+2r|S(g)|/T+8n(|Df|+2)1|S(g)|2/Tln(2n|𝒢|(1+f|Df|)δ)f,g𝒢,\beta(f,g)\leq\frac{2}{n}+\frac{|D_{f}|+2}{r\cdot|S(g)|/T}+8n(|D_{f}|+2)\sqrt{\frac{1}{|S(g)|^{2}/T}\ln\left(\frac{2n|\mathcal{G}|(1+{\textstyle\sum}_{f}|D_{f}|)}{\delta}\right)}\;\forall\;f\in\mathcal{F},g\in\mathcal{G},

and on the original collection 𝒢\mathcal{G}, the multicalibration constant α\alpha satisfies, with probability 1δ1-\delta,

α1rn+81Tln(2n|𝒢|(1+f|Df|)δ).\alpha\leq\frac{1}{rn}+8\sqrt{\frac{1}{T}\ln\left(\frac{2n|\mathcal{G}|(1+{\textstyle\sum}_{f}|D_{f}|)}{\delta}\right)}.
Proof.

We begin with a preliminary observation that translates our overall multicalibration assumptions into guarantees over the individual sequences S(g)S(g), for g𝒢g\in\mathcal{G}.

Observation 4.

Let aa be (α,n)(\alpha,n)-multicalibrated on groups 𝒢\mathcal{G}^{\prime} over the entire time sequence [T][T]. Then, for any gg, on the subsequence of days S(g)S(g) the predictor aa is (αT|S(g)|,n)\left(\alpha\frac{T}{|S(g)|},n\right)-multicalibrated with respect to groups (f𝒮(f)){Θ}\left(\bigcup_{f\in\mathcal{F}}\mathcal{S}(f)\right)\cup\{\Theta\}.

Proof.

Let g𝒢g\in\mathcal{G} be some particular group. Also, fix any ff\in\mathcal{F} and S𝒮(f){Θ}S\in\mathcal{S}(f)\cup\{\Theta\}. Using multicalibration guarantees (Definition 4.1), we have that for every i[n]i\in[n]:

|tS(g):θtS and atBnibtat|=|t[T]:θtgS and atBnibtat|αT=(αT|S(g)|)|S(g)|.\left|\sum_{t\in S(g):\,\theta^{t}\in S\text{ and }a^{t}\in B^{i}_{n}}b^{t}-a^{t}\right|=\left|\sum_{t\in[T]:\,\theta^{t}\in g\cap S\text{ and }a^{t}\in B^{i}_{n}}b^{t}-a^{t}\right|\leq\alpha T=\left(\alpha\frac{T}{|S(g)|}\right)|S(g)|.

The first equality is by definition of S(g)S(g); in particular, θtgStS(g)θtS\theta^{t}\in g\cap S\iff t\in S(g)\wedge\theta^{t}\in S. This concludes the proof of our observation. ∎

With this observation in hand, the proof is again a direct application of Theorem 4.2.

We can instantiate Theorem B.1 with groups 𝒢\mathcal{G}^{\prime} to conclude that the multicalibrated algorithm achieves (α,n)(\alpha,n)-multicalibration, with (choosing any δ(0,1)\delta\in(0,1)):

𝔼[α]1rn+4ln(2|𝒢|n)T,andα1rn+81Tln(2|𝒢|nδ) w. prob. 1δ.\mathop{\mathbb{E}}[\alpha]\leq\frac{1}{rn}+4\sqrt{\frac{\ln(2|\mathcal{G}^{\prime}|n)}{T}},\quad\text{and}\quad\alpha\leq\frac{1}{rn}+8\sqrt{\frac{1}{T}\ln\left(\frac{2|\mathcal{G}^{\prime}|n}{\delta}\right)}\text{ w. prob. }1-\delta.

where |𝒢|=|𝒢|+|𝒢|(fDf)=|𝒢|(1+fDf)|\mathcal{G}^{\prime}|=|\mathcal{G}|+|\mathcal{G}|(\sum_{f}D_{f})=|\mathcal{G}|(1+\sum_{f}D_{f}).

Now, fix any g𝒢g\in\mathcal{G} and ff\in\mathcal{F}. By our observation above, we are αT|S(g)|\alpha\frac{T}{|S(g)|} multicalibrated w.r.t. S(f){Θ}S(f)\cup\{\Theta\} on the sequence of days on which gg occurs. Therefore, we can instantiate Theorem 4.2:

a(πT|{t:θtg})f(πT|{t:θtg})2n+n(|Df|+2)αT|S(g)|.\mathcal{B}^{a}(\pi^{T}|_{\{t:\theta^{t}\in g\}})-\mathcal{R}^{f}(\pi^{T}|_{\{t:\theta^{t}\in g\}})\leq\frac{2}{n}+n(|D_{f}|+2)\,\alpha\,\frac{T}{|S(g)|}.

Inserting the above bounds on α\alpha yields our in-expectation and high-probability bounds on β(,)\beta(\cdot,\cdot).

Additionally, the theorem posits that the predictor is also (α,n)(\alpha,n)-multicalibrated on the base collection of subgroups 𝒢\mathcal{G}. Indeed, we have included the family 𝒢\mathcal{G} into the collection 𝒢\mathcal{G}^{\prime}, hence the predictor will be (α,n)(\alpha,n)-multicalibrated on 𝒢\mathcal{G}. ∎

Appendix D Blackwell Approachability: The Algorithm and Full Proofs

of Theorem 5.1.

We instantiate our probabilistic framework of Section A.2.1. The Learner’s and Adversary’s action sets are inherited from the underlying Polytope Blackwell game.

Defining the loss functions.

For all t=1,2,t=1,2,\ldots, we consider the following losses:

hα,βt(x,y):=α,u(x,y)β,for hα,β,x𝒳,y𝒴,\ell^{t}_{h_{\alpha,\beta}}(x,y):=\langle\alpha,u(x,y)\rangle-\beta,\quad\text{for }h_{\alpha,\beta}\in\mathcal{H},x\in\mathcal{X},y\in\mathcal{Y},

where here and below the notational convention is that for x𝒳,y𝒴x\in\mathcal{X},y\in\mathcal{Y}, u(x,y):=𝔼ax[u(a,y)]u(x,y):=\mathop{\mathbb{E}}_{a\sim x}[u(a,y)]. The coordinates of the resulting vector loss t(x,y):=(hα,βt(x,y))hα,β\ell^{t}_{\mathcal{H}}(x,y):=\left(\ell^{t}_{h_{\alpha,\beta}}(x,y)\right)_{h_{\alpha,\beta}\in\mathcal{H}} correspond to the collection \mathcal{H} of the halfspaces that define the polytope. By Holder’s inequality, each vector loss function t[2,2]d\ell^{t}_{\mathcal{H}}\in[-2,2]^{d} — this follows because we required that for some p,qp,q with 1p+1q=1\frac{1}{p}+\frac{1}{q}=1, the family \mathcal{H} is pp-normalized, and the range of uu is contained in BqdB_{q}^{d}. In addition, each hα,βt\ell^{t}_{h_{\alpha,\beta}} is continuous and convex-concave by virtue of being a linear function of the continuous and affine-concave function uu.

Bounding the Adversary-Moves-First value.

We observe that for t[T]t\in[T], the AMF value wAt0w^{t}_{A}\leq 0. Indeed, if the Adversary moves first and selects any yt𝒴y^{t}\in\mathcal{Y}, then by the assumption of response satisfiability, the Learner has some xt𝒳x^{t}\in\mathcal{X} guaranteeing that u(xt,yt)P()u(x^{t},y^{t})\in P(\mathcal{H}). The latter is equivalent to hα,βt(xt,yt)=α,u(xt,yt)β0\ell^{t}_{h_{\alpha,\beta}}(x^{t},y^{t})=\langle\alpha,u(x^{t},y^{t})\rangle-\beta\leq 0 for all hα,βh_{\alpha,\beta}\in\mathcal{H}, letting us conclude that for any round tt,

wAt=supyt𝒴minxt𝒳(maxhα,βhα,βt(xt,yt))0.w^{t}_{A}=\sup_{y^{t}\in\mathcal{Y}}\min_{x^{t}\in\mathcal{X}}\left(\max_{h_{\alpha,\beta}\in\mathcal{H}}\ell^{t}_{h_{\alpha,\beta}}(x^{t},y^{t})\right)\leq 0.
Applying AMF regret bounds.

Given this instantiation of our framework, Theorem A.1 implies that for any response satisfiable Polytope Blackwell game, the Learner can use Algorithm 2 (instantiated with the above loss functions) to ensure that after any round Tln||T\geq\ln|\mathcal{H}|,

𝔼[maxhα,βt[T](α,u(at,yt)β)]𝔼[maxhα,βt[T]hα,βt(at,yt)t=1TwAt]8Tln||,\mathop{\mathbb{E}}\left[\max_{h_{\alpha,\beta}\in\mathcal{H}}\sum_{t\in[T]}\left(\left\langle\alpha,u\left(a^{t},y^{t}\right)\right\rangle-\beta\right)\right]\leq\mathop{\mathbb{E}}\left[\max_{h_{\alpha,\beta}\in\mathcal{H}}\sum_{t\in[T]}\ell^{t}_{h_{\alpha,\beta}}(a^{t},y^{t})-\sum_{t=1}^{T}w^{t}_{A}\right]\leq 8\sqrt{T\ln|\mathcal{H}|},

where the expectation is with respect to the Learner’s randomness. Given this guarantee, we obtain, using the definition of u¯T\bar{u}^{T}, that

maxhα,β𝔼[α,u¯Tβ]\displaystyle\max_{h_{\alpha,\beta}\in\mathcal{H}}\mathop{\mathbb{E}}\left[\left\langle\alpha,\bar{u}^{T}\right\rangle-\beta\right] 8ln||T.\displaystyle\leq 8\sqrt{\frac{\ln|\mathcal{H}|}{T}}.

Using T=T(ϵ)ln||T=T(\epsilon)\geq\ln|\mathcal{H}|, we have that for every hα,βh_{\alpha,\beta}\in\mathcal{H},

𝔼[α,u¯T(ϵ)β]8ln||T(ϵ)=8ln||64ln||/ϵ2=ϵ.\mathop{\mathbb{E}}\left[\left\langle\alpha,\bar{u}^{T(\epsilon)}\right\rangle-\beta\right]\leq 8\sqrt{\frac{\ln|\mathcal{H}|}{T(\epsilon)}}=8\sqrt{\frac{\ln|\mathcal{H}|}{64\ln|\mathcal{H}|/\epsilon^{2}}}=\epsilon.

This concludes the proof of our in-expectation guarantee for Polytope Blackwell games.

The high-probability statement follows directly from Theorem A.2, using C=2C=2. ∎

An LP based algorithm when the Adversary has a finite pure strategy space.

Algorithm 2, which achieves the guarantees of Theorem 5.1, generally involves solving a convex program at each round. It is worth pointing out that only a linear program will need to be solved at each round in the commonly studied special case of Blackwell approachability where both the Learner and the Adversary randomize between actions in their respective finite action sets 𝒜\mathcal{A} and \mathcal{B}.

Formally, in the setting above, suppose additionally that the Adversary’s action space is 𝒴=Δ\mathcal{Y}=\Delta\mathcal{B}, where \mathcal{B} is a finite set of pure actions for the Adversary. At each round tt, both the Learner and the Adversary randomize over their respective action sets. First, the Learner selects a mixture xtΔ𝒜x^{t}\in\Delta\mathcal{A}, and then the Adversary selects a mixture ytΔy^{t}\in\Delta\mathcal{B} in response. Next, pure actions atxta^{t}\sim x^{t} and btytb^{t}\sim y^{t} are sampled from the chosen mixtures, and the vector valued utility in that round is set to u(at,bt)u(a^{t},b^{t}).

In this fully probabilistic setting, at each round tt Algorithm 2 has the Learner solve a normal-form zero-sum game with pure action sets 𝒜,\mathcal{A},\mathcal{B}, where the utility to the Adversary (the max player) is

ξt(a,b):=hα,βexp(ηs=1t1(α,u(as,bs)β))(α,u(a,b)β) for a𝒜,b.\xi^{t}(a,b):=\sum_{h_{\alpha,\beta}\in\mathcal{H}}\exp\left(\eta\sum_{s=1}^{t-1}\left(\left\langle\alpha,u\left(a^{s},b^{s}\right)\right\rangle-\beta\right)\right)\cdot\left(\langle\alpha,u(a,b)\rangle-\beta\right)\text{ for }a\in\mathcal{A},b\in\mathcal{B}. (2)

A standard LP-based approach to solving this zero-sum game (see e.g. Raghavan [1994]) is for the Learner to select among distributions xtΔ𝒜x^{t}\in\Delta\mathcal{A} with the goal of minimizing the maximum payoff to the Adversary over all pure responses bb\in\mathcal{B}. Writing this down as a linear program, we obtain the following algorithm:

  for t=1,,Tt=1,\dots,T do
     Choose a mixture xt=(xat)a𝒜Δ𝒜x^{t}=(x^{t}_{a})_{a\in\mathcal{A}}\in\Delta\mathcal{A} that solves the following linear program (where ξt(,)\xi^{t}(\cdot,\cdot) is defined in (2), and zz is an unconstrained variable):
Minimize z\displaystyle\text{Minimize }z
s.t. b:\displaystyle\text{s.t. }\forall b\in\mathcal{B}:\quad za𝒜xatξt(a,b).\displaystyle z\geq\sum_{a\in\mathcal{A}}x^{t}_{a}\,\,\xi^{t}(a,b).
     Sample atxta^{t}\sim x^{t}.
Algorithm 4 Linear Programming Based Learner for Polytope Blackwell Approachability

Appendix E No-X-Regret: Definitions, Examples, Algorithms, and Proofs

As a warmup, we begin this subsection by carefully demonstrating how to use our framework to derive bounds and algorithms for the very fundamental external regret setting. Then, we derive the same types of existential guarantees in the much more general subsequence regret setting. We then specialize these subsequence regret bounds into tight bounds for various existing regret notions (such as internal, adaptive, sleeping experts, and multigroup regret). We conclude this subsection by deriving a general no-subsequence-regret algorithm which in turn specializes to an efficient algorithm in all of our applications.

E.1 Simple Learning From Expert Advice: External Regret

In the classical experts learning setting Littlestone and Warmuth [1994], the Learner has a set of pure actions (“experts”) 𝒜\mathcal{A}. At the outset of each round t[T]t\in[T], the Learner chooses a distribution over experts xtΔ𝒜x^{t}\in\Delta\mathcal{A}. The Adversary then comes up with a vector of losses rt=(rat)a𝒜[0,1]𝒜r^{t}=(r^{t}_{a})_{a\in\mathcal{A}}\in[0,1]^{\mathcal{A}} corresponding to each expert. Next, the Learner samples atxta^{t}\sim x^{t}, and experiences loss corresponding to the expert she chose: rattr^{t}_{a^{t}}. The Learner also gets to observe the entire vector of losses rtr^{t} for that round. The goal of the Learner is to achieve sublinear external regret — that is, to ensure that the difference between her cumulative loss and the loss of the best fixed expert in hindsight grows sublinearly with TT:

RextT(πT):=t[T]rattminj𝒜t[T]rjt=o(T).R^{T}_{\mathrm{ext}}(\pi^{T}):=\sum_{t\in[T]}r^{t}_{a^{t}}-\min_{j\in\mathcal{A}}\sum_{t\in[T]}r^{t}_{j}=o(T).
Theorem E.1.

Fix a finite pure action set 𝒜\mathcal{A} for the Learner and a time horizon Tln|𝒜|T\geq\ln|\mathcal{A}|. Then, Algorithm 2 can be instantiated to guarantee that the Learner’s expected external regret is bounded as

𝔼πT[RextT(πT)]4Tln|𝒜|,\mathop{\mathbb{E}}_{\pi^{T}}\left[R^{T}_{\mathrm{ext}}\left(\pi^{T}\right)\right]\leq 4\sqrt{T\ln|\mathcal{A}|},

and furthermore that for any δ(0,1)\delta\in(0,1), with ex-ante probability 1δ1-\delta over the Learner’s randomness,

RextT(πT)8Tln|𝒜|δ.R^{T}_{\mathrm{ext}}\left(\pi^{T}\right)\leq 8\sqrt{T\ln\frac{|\mathcal{A}|}{\delta}}.
Proof.

We instantiate our probabilistic framework (see Section A.2.1).

Defining the strategy spaces.

We define the Learner’s pure action set at each round to be the set 𝒜\mathcal{A}, and the Adversary’s strategy space to be the convex and compact set [0,1]|𝒜|[0,1]^{|\mathcal{A}|}, from which the Adversary chooses each round’s collection (rat)a𝒜(r^{t}_{a})_{a\in\mathcal{A}} of all actions’ losses.

Defining the loss functions.

For d=|𝒜|d=|\mathcal{A}|, we define a dd-dimensional vector valued loss function t=(jt)j𝒜\ell^{t}=(\ell^{t}_{j})_{j\in\mathcal{A}}, where for every action j𝒜j\in\mathcal{A}, the corresponding coordinate jt:𝒜×[0,1]|𝒜|[1,1]\ell^{t}_{j}:\mathcal{A}\times[0,1]^{|\mathcal{A}|}\to[-1,1] is given by

jt(a,rt)=ratrjt, for a𝒜,rt[0,1]|𝒜|.\ell^{t}_{j}(a,r^{t})=r^{t}_{a}-r^{t}_{j},\quad\text{ for }a\in\mathcal{A},r^{t}\in[0,1]^{|\mathcal{A}|}.

It is easy to see that jt(a,)\ell^{t}_{j}(a,\cdot) is continuous and concave — in fact, linear — in the second argument for all j,a𝒜j,a\in\mathcal{A} and t[T]t\in[T]. Furthermore, its range is [C,C][-C,C], for C=1C=1. This verifies the technical conditions imposed by our framework on the loss functions.

Applying AMF regret bounds.

We may now invoke Theorem A.1, which implies the following in-expectation AMF regret bound after round TT for the instantiation of Algorithm 2 with the just defined vector losses (t)t[T](\ell^{t})_{t\in[T]}:

𝔼[maxj𝒜t[T]jt(at,rt)t[T]wAt]4CTlnd=4Tln|𝒜|,\mathop{\mathbb{E}}\left[\max_{j\in\mathcal{A}}\sum_{t\in[T]}\ell^{t}_{j}(a^{t},r^{t})-\sum_{t\in[T]}w^{t}_{A}\right]\leq 4C\sqrt{T\ln d}=4\sqrt{T\ln|\mathcal{A}|},

where recall that wAtw^{t}_{A} is the Adversary-Moves-First (AMF) value at round tt. Connecting the instantiated AMF regret to the Learner’s external regret, we get:

𝔼[RextT]=𝔼[maxj𝒜t[T]rattrjt]=𝔼[maxj𝒜t[T]jt(at,rt)]4Tln|𝒜|+t[T]wAt.\mathop{\mathbb{E}}\left[R^{T}_{\mathrm{ext}}\right]=\mathop{\mathbb{E}}\left[\max_{j\in\mathcal{A}}\sum_{t\in[T]}r^{t}_{a^{t}}-r_{j}^{t}\right]=\mathop{\mathbb{E}}\left[\max_{j\in\mathcal{A}}\sum_{t\in[T]}\ell^{t}_{j}(a^{t},r^{t})\right]\leq 4\sqrt{T\ln|\mathcal{A}|}+\sum_{t\in[T]}w^{t}_{A}.
Bounding the Adversary-Moves-First value.

To obtain the claimed in-expectation external regret bound, it suffices to show that the AMF value at each round t[T]t\in[T] satisfies wAt0w^{t}_{A}\leq 0. Intuitively, this holds because if at some round the Learner knew the Adversary’s choice of losses (rat)a𝒜(r^{t}_{a})_{a\in\mathcal{A}} in advance, then she could guarantee herself no added loss in that round by picking the action a𝒜a\in\mathcal{A} with the smallest loss ratr^{t}_{a}.

Formally, for any vector of actions’ losses rtr^{t}, define art:=argmina𝒜rat,a^{*}_{r^{t}}:=\mathop{\mathrm{argmin}}_{a\in\mathcal{A}}r^{t}_{a}, and notice that

mina𝒜maxj𝒜jt(a,rt)maxj𝒜jt(art,rt)=maxj𝒜(rarttrjt)=mina𝒜ratminj𝒜rjt=0.\displaystyle\min_{a\in\mathcal{A}}\max_{j\in\mathcal{A}}\ell^{t}_{j}(a,r^{t})\leq\max_{j\in\mathcal{A}}\ell^{t}_{j}\left(a^{*}_{r^{t}},r^{t}\right)=\max_{j\in\mathcal{A}}\left(r^{t}_{a^{*}_{r^{t}}}-r^{t}_{j}\right)=\min_{a\in\mathcal{A}}r^{t}_{a}-\min_{j\in\mathcal{A}}r^{t}_{j}=0.

The third step follows by definition of arta^{*}_{r^{t}}. Hence, the AMF value is indeed nonpositive at each round:

wAt=suprt[0,1]|𝒜|mina𝒜maxj𝒜jt(a,rt)0.w^{t}_{A}=\adjustlimits{\sup}_{r^{t}\in[0,1]^{|\mathcal{A}|}}{\min}_{a\in\mathcal{A}}\max_{j\in\mathcal{A}}\ell^{t}_{j}(a,r^{t})\leq 0.

This completes the proof of the in-expectation external regret bound. The high-probability external regret bound follows in the same way from Theorem A.2 of Section A.2.1. ∎

A bound of Tln|𝒜|\sqrt{T\ln|\mathcal{A}|} is optimal for external regret in the experts learning setting, and so serves to witness the optimality of Theorem 2.1.

In fact, it is easy to demonstrate that in the external regret setting, the generic probabilistic Algorithm 2 amounts to the well known Exponential Weights algorithm (Algorithm 5 below) Littlestone and Warmuth [1994]. To see this, note that Algorithm 2, when instantiated with the above defined loss functions, has the Learner solve the following problem at each round:

xt\displaystyle x^{t} argminxΔ𝒜maxrt[0,1]|𝒜|j𝒜exp(ηs=1t1(rassrjs))i𝒜exp(ηs=1t1(rassris))𝔼ax[ratrjt],\displaystyle\in\mathop{\mathrm{argmin}}_{x\in\Delta\mathcal{A}}\max_{r^{t}\in[0,1]^{|\mathcal{A}|}}\sum_{j\in\mathcal{A}}\frac{\exp\left(\eta\sum_{s=1}^{t-1}(r^{s}_{a^{s}}-r^{s}_{j})\right)}{\sum_{i\in\mathcal{A}}\exp\left(\eta\sum_{s=1}^{t-1}(r^{s}_{a^{s}}-r^{s}_{i})\right)}\mathop{\mathbb{E}}_{a\sim x}[r^{t}_{a}-r^{t}_{j}],
=argminxΔ𝒜maxrt[0,1]|𝒜|j𝒜exp(ηs=1t1rjs)i𝒜exp(ηs=1t1ris)𝔼ax[ratrjt],\displaystyle=\mathop{\mathrm{argmin}}_{x\in\Delta\mathcal{A}}\max_{r^{t}\in[0,1]^{|\mathcal{A}|}}\sum_{j\in\mathcal{A}}\frac{\exp\left(-\eta\sum_{s=1}^{t-1}r^{s}_{j}\right)}{\sum_{i\in\mathcal{A}}\exp\left(-\eta\sum_{s=1}^{t-1}r^{s}_{i}\right)}\mathop{\mathbb{E}}_{a\sim x}[r^{t}_{a}-r^{t}_{j}],
=argminxΔ𝒜maxrt[0,1]|𝒜|𝔼ax,jEWη(πt1)[ratrjt],\displaystyle=\mathop{\mathrm{argmin}}_{x\in\Delta\mathcal{A}}\max_{r^{t}\in[0,1]^{|\mathcal{A}|}}\mathop{\mathbb{E}}_{a\sim x,j\sim\mathrm{EW}_{\eta}(\pi^{t-1})}[r^{t}_{a}-r^{t}_{j}],

where we denoted the exponential weights distribution as

EWη(πt1):=(exp(ηs=1t1rjs)i𝒜exp(ηs=1t1ris))j𝒜Δ𝒜.\mathrm{EW}_{\eta}(\pi^{t-1}):=\left(\frac{\exp\left(-\eta\sum_{s=1}^{t-1}r^{s}_{j}\right)}{\sum_{i\in\mathcal{A}}\exp\left(-\eta\sum_{s=1}^{t-1}r^{s}_{i}\right)}\right)_{j\in\mathcal{A}}\in\Delta\mathcal{A}.

For any choice of rtr^{t} by the Adversary, the quantity inside the expectation, jt(a,rt)=ratrjt\ell^{t}_{j}(a,r^{t})=r^{t}_{a}-r^{t}_{j}, is antisymmetric in aa and jj: that is, jt(a,rt)=at(j,rt)\ell^{t}_{j}(a,r^{t})=-\ell^{t}_{a}(j,r^{t}). Due to this antisymmetry, no matter which rtr^{t} gets selected by the Adversary, by playing aEWη(πt1)a\sim\mathrm{EW}_{\eta}(\pi^{t-1}) the Learner obtains

𝔼a,jEWη(πt1)[ratrjt]=0,\mathop{\mathbb{E}}_{a,j\sim\mathrm{EW}_{\eta}(\pi^{t-1})}\left[r^{t}_{a}-r^{t}_{j}\right]=0,

thus achieving the value of the game. It is also easy to see that xt=EWη(πt1)x^{t}=\mathrm{EW}_{\eta}(\pi^{t-1}) is the unique choice of xtx^{t} that guarantees nonnegative value, hence Algorithm 2, when specialized to the external regret setting, is equivalent to the Exponential Weights Algorithm 5.

  for t=1,,Tt=1,\dots,T do
     Sample ata^{t} such that at=ja^{t}=j with probability proportional to exp(ηs=1t1rjs)\exp\left(-\eta\sum_{s=1}^{t-1}r^{s}_{j}\right), for j𝒜j\in\mathcal{A}.
Algorithm 5 The Exponential Weights Algorithm with Learning Rate η\eta

E.2 Generalization to Subsequence Regret

Here, we present a generalization of the experts learning framework from which we will be able to derive our other applications to no-regret learning problems. There is again a Learner and an Adversary playing over the course of rounds t[T]t\in[T]. Initially, the Learner is endowed with a finite set of pure actions 𝒜\mathcal{A}. At each round tt, the Adversary restricts the Learner’s set of available actions for that round to some subset 𝒜t𝒜\mathcal{A}^{t}\subseteq\mathcal{A}. The Learner plays a mixture xtΔ𝒜tx^{t}\in\Delta\mathcal{A}^{t} over the available actions. The Adversary responds by selecting a vector of losses (rat)a𝒜[0,1]|𝒜|(r^{t}_{a})_{a\in\mathcal{A}}\in[0,1]^{|\mathcal{A}|} associated with the Learner’s pure actions. Next, the Learner samples a pure action atxta^{t}\sim x^{t}.

Unlike in the standard setting, the Learner’s regret will now be measured not just on the entire sequence of rounds 1,2,,T1,2,\ldots,T, but more generally on an arbitrary collection \mathcal{F} of weighted subsequences f:[T]×𝒜[0,1]f:[T]\times\mathcal{A}\to[0,1]. The understanding is that for any f,t[T],a𝒜tf\in\mathcal{F},t\in[T],a\in\mathcal{A}^{t}, the quantity f(t,a)f(t,a) is the “weight” with which round tt will be included in the subsequence if the Learner’s sampled action is aa at that round. The Learner does not need to know the subsequences ahead of time; instead the Adversary may announce the values {f(t,a)}a𝒜t,f\{f(t,a)\}_{a\in\mathcal{A}^{t},f\in\mathcal{F}} to the Learner before the corresponding round t[T]t\in[T].

Definition E.1 (Subsequence Regret).

Given a family of functions \mathcal{F}, where each ff\in\mathcal{F} is a mapping f:[T]×𝒜[0,1]f:[T]\times\mathcal{A}\to[0,1], chosen adaptively by the Adversary, and a set of finitely many pure actions 𝒜\mathcal{A} for the Learner, consider a collection of action-subsequence pairs 𝒜×\mathcal{H}\subseteq\mathcal{A}\times\mathcal{F}.

The Learner’s subsequence regret after round TT with respect to the collection \mathcal{H} is defined by

RT(πT):=max(j,f)t[T]f(t,at)(rattrjt),R^{T}_{\mathcal{H}}(\pi^{T}):=\max_{(j,f)\in\mathcal{H}}\sum_{t\in[T]}f(t,a^{t})\left(r^{t}_{a^{t}}-r^{t}_{j}\right),

where πT={(at,rt)}t[T]\pi^{T}=\{(a^{t},r^{t})\}_{t\in[T]} is the transcript of the interaction.

For intuition, suppose ={1}\mathcal{F}=\{\textbf{1}\}, where 1:[T]×𝒜[0,1]\textbf{1}:[T]\times\mathcal{A}\to[0,1] satisfies 1(t,a)=1\textbf{1}(t,a)=1 for all t,at,a. That is, the only relevant subsequence is the entire sequence of rounds 1,2,,T1,2,\ldots,T. If we then set =𝒜×\mathcal{H}=\mathcal{A}\times\mathcal{F}, subsequence regret specializes to the classical notion of (external) regret which was discussed above.

Moreover, we shall require the following condition on \mathcal{H} and the action sets {𝒜t}t[T]\{\mathcal{A}^{t}\}_{t\in[T]}, which simply asks that at each round, the Learner be responsible for regret only to currently available actions.

Definition E.2 (No regret to unavailable actions).

A collection of action-subsequence pairs \mathcal{H}, paired with action sets {𝒜t}t[T]\{\mathcal{A}^{t}\}_{t\in[T]}, satisfy the no-regret-to-unavailable-actions property if at each round t[T]t\in[T], for every ff\in\mathcal{F} such that (j,f)(j,f)\in\mathcal{H} for some j𝒜tj\not\in\mathcal{A}^{t}, it holds that f(t,a)=0f(t,a)=0 for all a𝒜ta\in\mathcal{A}^{t}.

It is worth noting that this condition is trivially satisfied whenever the Learner’s action set is invariant across rounds (𝒜t=𝒜\mathcal{A}^{t}=\mathcal{A} for all tt).

Theorem E.2.

Consider a sequence of action sets {𝒜t}t[T]\{\mathcal{A}^{t}\}_{t\in[T]} for the Learner, a collection \mathcal{H} of action-subsequence pairs, and a time horizon Tln||T\geq\ln|\mathcal{H}|. If \mathcal{H} and {𝒜t}t[T]\{\mathcal{A}^{t}\}_{t\in[T]} satisfy no-regret-to-unavailable-actions, then an appropriate instantiation of Algorithm 2 guarantees that the Learner’s expected subsequence regret is bounded as

𝔼πT[RT(πT)]4Tln||,\mathop{\mathbb{E}}_{\pi^{T}}\left[R^{T}_{\mathcal{H}}\left(\pi^{T}\right)\right]\leq 4\sqrt{T\ln|\mathcal{H}|},

and furthermore, for any δ(0,1)\delta\in(0,1), that with ex-ante probability 1δ1-\delta over the Learner’s randomness,

RT(πT)8Tln||δ.R^{T}_{\mathcal{H}}\left(\pi^{T}\right)\leq 8\sqrt{T\ln\frac{|\mathcal{H}|}{\delta}}.
Proof.

We instantiate our probabilistic framework of Section A.2.1.

Defining the strategy spaces.

At each round tt, the Learner’s pure strategy set will be 𝒜t\mathcal{A}^{t}, and the Adversary’s strategy space will be the convex and compact set [0,1]|𝒜|[0,1]^{|\mathcal{A}|}.

Defining the loss functions.

For all action-subsequence pairs (j,f)(j,f)\in\mathcal{H}, we define the corresponding loss (j,f)t:𝒜t×[0,1]|𝒜|[1,1]\ell^{t}_{(j,f)}:\mathcal{A}^{t}\times[0,1]^{|\mathcal{A}|}\to[-1,1] as

(j,f)t(a,rt)=f(t,a)(ratrjt),for a𝒜t,rt[0,1]|𝒜|.\ell^{t}_{(j,f)}(a,r^{t})=f(t,a)(r^{t}_{a}-r^{t}_{j}),\quad\text{for }a\in\mathcal{A}^{t},r^{t}\in[0,1]^{|\mathcal{A}|}.

It is easy to see that for all (j,f)(j,f)\in\mathcal{H} and each a𝒜ta\in\mathcal{A}^{t}, the function (j,f)t(a,)\ell^{t}_{(j,f)}(a,\cdot) is continuous and concave — in fact, linear — in the second argument, as well as bounded within [C,C][-C,C] for C=1C=1. Therefore, the technical conditions imposed by our framework on the loss functions are met.

Bounding the Adversary-Moves-First value.

At each round tt, the AMF value wAt=0w^{t}_{A}=0. Trivially, wAt0w_{A}^{t}\geq 0, as the Adversary can always set rat=0r_{a}^{t}=0 for all aa. Conversely, wAt0w^{t}_{A}\leq 0 as an easy consequence of the no-regret-to-unavailable-actions property. To see this, for any vector of actions’ losses rtr^{t}, define

art:=argmina𝒜trat,a^{*}_{r^{t}}:=\mathop{\mathrm{argmin}}_{a\in\mathcal{A}^{t}}r^{t}_{a},

and notice that

wAt\displaystyle w^{t}_{A} =suprt[0,1]|𝒜|mina𝒜t(max(j,f)(j,f)t(a,rt)),\displaystyle=\sup_{r^{t}\in[0,1]^{|\mathcal{A}|}}\min_{a\in\mathcal{A}^{t}}\left(\max_{(j,f)\in\mathcal{H}}\ell^{t}_{(j,f)}(a,r^{t})\right),
=suprt[0,1]|𝒜|mina𝒜tmax(max(j,f):j𝒜t(j,f)t(a,rt),0),\displaystyle=\sup_{r^{t}\in[0,1]^{|\mathcal{A}|}}\min_{a\in\mathcal{A}^{t}}\max\left(\max_{(j,f)\in\mathcal{H}:j\in\mathcal{A}^{t}}\ell^{t}_{(j,f)}(a,r^{t}),0\right), (no regret to unavailable actions)
suprt[0,1]|𝒜|max(max(j,f):j𝒜t(j,f)t(art,rt),0),\displaystyle\leq\sup_{r^{t}\in[0,1]^{|\mathcal{A}|}}\max\left(\max_{(j,f)\in\mathcal{H}:j\in\mathcal{A}^{t}}\ell^{t}_{(j,f)}(a^{*}_{r^{t}},r^{t}),0\right),
=suprt[0,1]|𝒜|max(max(j,f):j𝒜tf(t,art)(rarttrjt),0),\displaystyle=\sup_{r^{t}\in[0,1]^{|\mathcal{A}|}}\max\left(\max_{(j,f)\in\mathcal{H}:j\in\mathcal{A}^{t}}f(t,a^{*}_{r^{t}})(r^{t}_{a^{*}_{r^{t}}}-r^{t}_{j}),0\right),
suprt[0,1]|𝒜|max(max(j,f):j𝒜tf(t,art)(rjtrjt),0),\displaystyle\leq\sup_{r^{t}\in[0,1]^{|\mathcal{A}|}}\max\left(\max_{(j,f)\in\mathcal{H}:j\in\mathcal{A}^{t}}f(t,a^{*}_{r^{t}})(r^{t}_{j}-r^{t}_{j}),0\right), (by definition of art)\displaystyle\text{(by definition of }a^{*}_{r^{t}}\text{)}
=suprt[0,1]|𝒜|max(0,0),\displaystyle=\sup_{r^{t}\in[0,1]^{|\mathcal{A}|}}\max\left(0,0\right),
=0.\displaystyle=0.

We thus conclude that Theorems A.1 and A.2 apply (with C=1C=1 and all wAt=0w^{t}_{A}=0) to the subsequence regret setting, yielding the claimed in-expectation and high-probability regret bounds. ∎

We now instantiate subsequence regret with various choices of subsequence families, in order to get bounds and efficient algorithms for several standard notions of regret from the literature. For brevity, for each notion of regret considered below we only exhibit the existential in-expectation guarantee for that type of regret, and omit the corresponding high-probability bounds (which are all easily derivable from Theorem A.2). We also point out that all in-expectation bounds cited below are efficiently achievable by instantiating, with appropriate loss functions, the no-subsequence regret Algorithms 6 and 7 derived in the following Section E.3.

In all no-regret settings discussed below, except for Sleeping Experts, the Learner has a pure and finite action set 𝒜\mathcal{A} at every round t[T]t\in[T]; furthermore — as usual — the Adversary’s role at each round consists in selecting the vector of per-action losses (rat)a𝒜[0,1]|𝒜|(r^{t}_{a})_{a\in\mathcal{A}}\in[0,1]^{|\mathcal{A}|}.

Internal and Swap Regret

To introduce the notion of internal regret [Foster and Vohra, 1998], consider the following collection int𝒜𝒜\mathcal{M}_{\mathrm{int}}\subset\mathcal{A}^{\mathcal{A}} of mappings from the action set 𝒜\mathcal{A} to itself. int\mathcal{M}_{\mathrm{int}} consists of the identity map μid\mu_{\mathrm{id}} (such that μid(a)=a\mu_{\mathrm{id}}(a)=a for all a𝒜a\in\mathcal{A}), together with all |𝒜|(|𝒜|1)|\mathcal{A}|(|\mathcal{A}|-1) maps μij\mu_{i\to j} that pair two particular actions: i.e., μij(i)=j\mu_{i\to j}(i)=j, and μij(a)=a\mu_{i\to j}(a)=a for aia\neq i. The Learner’s internal regret is then defined as

RintT:=maxμintt[T]rattrμ(at)t.R^{T}_{\mathrm{int}}:=\max_{\mu\in\mathcal{M}_{\mathrm{int}}}\sum_{t\in[T]}r^{t}_{a^{t}}-r^{t}_{\mu(a^{t})}.

In other words, the Learner’s total loss is being compared to all possible counterfactual worlds, for i,j𝒜i,j\in\mathcal{A}, in which whenever the Learner played some action ii, it got replaced with action jj (and other actions remain fixed).

We can reduce the problem of obtaining no-internal-regret to the problem of obtaining no subsequence regret for a simple choice of subsequences. Let us define the following set of subsequences: :={fi:i𝒜}\mathcal{F}:=\{f_{i}:i\in\mathcal{A}\}, where each fif_{i} is defined to be the indicator of the subsequence where the Learner played action ii — that is, for all t[T]t\in[T], we let fi(t,a)=1a=if_{i}(t,a)=1_{a=i}. Then, we let :=𝒜×\mathcal{H}:=\mathcal{A}\times\mathcal{F}. By the in-expectation no-subsequence-regret guarantee, we then have

𝔼[max(j,f)t[T]f(t,at)(rattrjt)]4Tln||=42Tln|𝒜|,\mathop{\mathbb{E}}\left[\max_{(j,f)\in\mathcal{H}}\sum_{t\in[T]}f(t,a^{t})\left(r^{t}_{a^{t}}-r^{t}_{j}\right)\right]\leq 4\sqrt{T\ln|\mathcal{H}|}=4\sqrt{2T\ln|\mathcal{A}|},

since ||=|𝒜|||=|𝒜|2|\mathcal{H}|=|\mathcal{A}|\cdot|\mathcal{F}|=|\mathcal{A}|^{2}.

But observe that the Learner’s internal regret precisely coincides with the just defined instance of subsequence regret:

RintT\displaystyle R^{T}_{\mathrm{int}} =maxμintt[T]rattrμ(at)t=maxi,j𝒜t[T]:at=iritrjt=maxj𝒜maxfi:i𝒜t[T]fi(t,at)(rattrjt)\displaystyle=\max_{\mu\in\mathcal{M}_{\mathrm{int}}}\sum_{t\in[T]}r^{t}_{a^{t}}-r^{t}_{\mu(a^{t})}=\max_{i,j\in\mathcal{A}}\sum_{t\in[T]:a^{t}=i}r^{t}_{i}-r^{t}_{j}=\max_{j\in\mathcal{A}}\max_{f_{i}:i\in\mathcal{A}}\sum_{t\in[T]}f_{i}(t,a^{t})(r^{t}_{a^{t}}-r^{t}_{j})
=max(j,f)t[T]f(t,at)(rattrjt).\displaystyle=\max_{(j,f)\in\mathcal{H}}\sum_{t\in[T]}f(t,a^{t})(r^{t}_{a^{t}}-r^{t}_{j}).

Therefore, we have established the following existential in-expectation internal regret bound:

𝔼[RintT]42Tln|𝒜|,\mathop{\mathbb{E}}\left[R^{T}_{\mathrm{int}}\right]\leq 4\sqrt{2T\ln|\mathcal{A}|},

which is optimal.

The notion of swap regret, introduced in Blum and Mansour [2007], is strictly more demanding than internal regret in that it considers strategy modification rules μ\mu that can perform more than one action swap at a time. Consider the set swap\mathcal{M}_{\mathrm{swap}} of all |𝒜||𝒜||\mathcal{A}|^{|\mathcal{A}|} swapping rules μ:𝒜𝒜\mu:\mathcal{A}\to\mathcal{A}. The Learner’s swap regret is defined to be the maximum of her regret to all swapping rules:

RswapT:=maxμswapt[T]rattrμ(at)t.R^{T}_{\mathrm{swap}}:=\max_{\mu\in\mathcal{M}_{\mathrm{swap}}}\sum_{t\in[T]}r^{t}_{a^{t}}-r^{t}_{\mu(a^{t})}.

The interpretation is that the Learner’s total loss is being compared to the total loss of any remapping of her action sequence.

An easy reduction shows that the swap regret is upper-bounded by |𝒜||\mathcal{A}| times the internal regret. For completeness, we provide the details of this reduction in Appendix E.4. The reduction implies an in-expectation bound of 4|𝒜|2Tln|𝒜|4|\mathcal{A}|\sqrt{2T\ln|\mathcal{A}|} on swap regret, which, compared to the optimal bound of O(T|𝒜|ln|𝒜|)O(\sqrt{T|\mathcal{A}|\ln|\mathcal{A}|}) (see Blum and Mansour [2007]), has suboptimal dependence on |𝒜||\mathcal{A}|.

Adaptive Regret

In this setting, consider all contiguous time intervals within rounds 1,,T1,\ldots,T, namely, all intervals [t1,t2][t_{1},t_{2}], where t1,t2t_{1},t_{2} are integers such that 1t1t2T1\leq t_{1}\leq t_{2}\leq T. The Learner’s regret on each interval [t1,t2][t_{1},t_{2}] is defined as her total loss over the rounds t[t1,t2]t\in[t_{1},t_{2}], minus the loss of the best action for that interval in hindsight. The Learner’s adaptive regret is then defined to be her maximum regret over all contiguous time intervals:

RadaptiveT:=max[t1,t2]:1t1t2Tmaxj𝒜t=t1t2rattrjt.R^{T}_{\mathrm{adaptive}}:=\max_{[t_{1},t_{2}]:1\leq t_{1}\leq t_{2}\leq T}\max_{j\in\mathcal{A}}\sum_{t=t_{1}}^{t_{2}}r^{t}_{a^{t}}-r^{t}_{j}.

We observe that adaptive regret corresponds to subsequence regret with respect to :=𝒜×\mathcal{H}:=\mathcal{A}\times\mathcal{F}, where :={f[t1,t2]:1t1t2T}\mathcal{F}:=\{f_{[t_{1},t_{2}]}:1\leq t_{1}\leq t_{2}\leq T\} is the collection of subinterval indicator subsequences — that is, f[t1,t2](t,a):=1t1tt2f_{[t_{1},t_{2}]}(t,a):=1_{t_{1}\leq t\leq t_{2}} for all t[T]t\in[T] and a𝒜a\in\mathcal{A}. Observe that ||T2|\mathcal{F}|\leq T^{2}, and therefore, the expected regret upper bound for subsequence regret specializes to the following expected adaptive regret bound:

𝔼[RadaptiveT]4Tln(|𝒜|||)4T(ln|𝒜|+2lnT).\mathop{\mathbb{E}}\left[R^{T}_{\mathrm{adaptive}}\right]\leq 4\sqrt{T\ln(|\mathcal{A}||\mathcal{F}|)}\leq 4\sqrt{T(\ln|\mathcal{A}|+2\ln T)}.
Sleeping Experts

Following Blum and Mansour [2007], we define the sleeping experts setting as follows. Suppose that the Learner is initially given a set of pure actions 𝒜\mathcal{A}, and before each round tt, the Adversary chooses a subset of pure actions 𝒜t𝒜\mathcal{A}^{t}\subseteq\mathcal{A} available to the Learner at that round — these are known as the “awake experts”, and the rest of the experts are the “sleeping experts” at that round.

The Learner’s regret to each action j𝒜j\in\mathcal{A} is defined to be the excess total loss of the Learner during rounds where jj was “awake”, compared to the total loss of jj over those rounds. Formally, the Learner’s sleeping experts regret after round TT is defined to be

RsleepingT:=maxj𝒜t[T]:j𝒜trattrjt.R^{T}_{\mathrm{sleeping}}:=\max_{j\in\mathcal{A}}\sum_{t\in[T]:j\in\mathcal{A}^{t}}r^{t}_{a^{t}}-r^{t}_{j}.

This is clearly an instance of subsequence regret — indeed, we may consider the family of subsequences :={fj:j𝒜}\mathcal{F}:=\{f_{j}:j\in\mathcal{A}\}, where fj(t,a):=1j𝒜tf_{j}(t,a):=1_{j\in\mathcal{A}^{t}} for all j,a,tj,a,t, and let :={(j,fj)}j𝒜\mathcal{H}:=\{(j,f_{j})\}_{j\in\mathcal{A}}. It is easy to verify that the no-regret-to-unavailable-actions property holds, and thus the guarantees of the subsequence regret setting carry over to this sleeping experts setting. In particular, the following existential in-expectation sleeping experts regret bound holds:

𝔼[RsleepingT]4Tln|𝒜|,\mathop{\mathbb{E}}\left[R^{T}_{\mathrm{sleeping}}\right]\leq 4\sqrt{T\ln|\mathcal{A}|},

which is also optimal in this setting.

Multi-Group Regret

We imagine that before each round, the Adversary selects and reveals to the Learner some context θt\theta^{t} from an underlying feature space Θ\Theta. The interpretation is that the Learner’s decision at round tt will pertain to an individual with features θt\theta^{t}. Additionally, there is a fixed collection 𝒢2Θ\mathcal{G}\subset 2^{\Theta}, where each g𝒢g\in\mathcal{G} is interpreted as a (demographic) group of individuals within the population Θ\Theta. Here 𝒢\mathcal{G} may be large and may consist of overlapping groups. The Learner’s goal is to minimize regret to each action a𝒜a\in\mathcal{A} not just over the entire population, but also separately for each population group g𝒢g\in\mathcal{G}. Explicitly, the Learner’s multi-group regret after round TT is defined to be

RmultiT:=maxg𝒢maxj𝒜t[T]:θtgrattrjt.R^{T}_{\mathrm{multi}}:=\max_{g\in\mathcal{G}}\max_{j\in\mathcal{A}}\sum_{t\in[T]:\theta^{t}\in g}r^{t}_{a^{t}}-r^{t}_{j}.

It is easy to see that multi-group regret corresponds to subsequence regret with :=𝒜×\mathcal{H}:=\mathcal{A}\times\mathcal{F}, where :={fg:g𝒢}\mathcal{F}:=\{f_{g}:g\in\mathcal{G}\} is the collection of group indicator subsequences — that is, fg(t,a):=1θtgf_{g}(t,a):=1_{\theta^{t}\in g} for all t,at,a. Here we are taking advantage of the fact that the functions ff on which subsequences are defined need not be known to the algorithm ahead of time, and can be revealed sequentially by the Adversary, allowing us to model adversarially chosen contexts. Therefore, multi-group regret inherits subsequence regret guarantees, and in particular, we obtain the following existential in-expectation multi-group regret bound:

𝔼[RmultiT]4Tln(|𝒜||𝒢|).\mathop{\mathbb{E}}\left[R^{T}_{\mathrm{multi}}\right]\leq 4\sqrt{T\ln(|\mathcal{A}||\mathcal{G}|)}.

Observe that this bound scales only as ln|𝒢|\sqrt{\ln|\mathcal{G}|} with respect to the number of population groups, which we can therefore take to be exponentially large in the parameters of the problem.

E.3 Deriving No-Subsequence-Regret Algorithms

We now present a way to specialize Algorithm 2 to the setting of subsequence regret with no-regret-to-unavailable-actions. At each round, instead of solving a convex-concave problem, the specialized algorithm will only need to solve a polynomial-sized linear program.

  for t=1,,Tt=1,\dots,T do
     Learn the current set of feasible actions 𝒜t\mathcal{A}^{t} (potentially selected by an Adversary).
     Learn the values f(t,a)f(t,a) for every a𝒜ta\in\mathcal{A}^{t} and ff\in\mathcal{F} (potentially selected by an Adversary).
     Solve for xt=(xat)a𝒜tΔ𝒜tx^{t}=(x^{t}_{a})_{a\in\mathcal{A}^{t}}\in\Delta\mathcal{A}^{t} defined by the following linear inequalities for all a𝒜ta\in\mathcal{A}^{t}:
xat(j,f)exp(ηs=1t1(j,f)s(as,rs))f(t,a)j𝒜txjtf:(a,f)exp(ηs=1t1(a,f)s(as,rs))f(t,j)0\!\!\!\!\!\!x^{t}_{a}\!\!\!\sum_{(j,f)\in\mathcal{H}}\!\!\!\!\exp\left(\!\eta\sum_{s=1}^{t-1}\ell_{(j,f)}^{s}(a^{s},r^{s})\!\right)f(t,a)-\sum_{j\in\mathcal{A}^{t}}x^{t}_{j}\!\!\!\!\sum_{f:(a,f)\in\mathcal{H}}\!\!\!\!\!\!\exp\left(\!\eta\sum_{s=1}^{t-1}\ell_{(a,f)}^{s}(a^{s},r^{s})\!\right)f(t,j)\leq 0
     Sample atxta^{t}\sim x^{t}.
Algorithm 6 Efficient No Subsequence Regret Algorithm for the Learner
Theorem E.3.

Algorithm 6 implements Algorithm 2 in the subsequence regret setting, and achieves the same guarantees.

Proof.

In parallel to the notation of Algorithm 2, we define the following set of weights at round t[T]t\in[T]:

χ(j,f)t:=1Ztexp(ηs=1t1(j,f)s(as,rs)),\chi^{t}_{(j,f)}:=\frac{1}{Z^{t}}\exp\left(\eta\sum_{s=1}^{t-1}\ell_{(j,f)}^{s}(a^{s},r^{s})\right),

where

Zt:=(j,f)exp(ηs=1t1(j,f)s(as,rs)).Z^{t}:=\sum_{(j,f)\in\mathcal{H}}\exp\left(\eta\sum_{s=1}^{t-1}\ell_{(j,f)}^{s}(a^{s},r^{s})\right).

When instantiated with our current set of loss functions, Algorithm 2 solves the following zero-sum game at round t[T]t\in[T], where we denote (j,f)t(x,rt):=𝔼ax[(j,f)t(a,rt)]\ell^{t}_{(j,f)}(x,r^{t}):=\mathop{\mathbb{E}}_{a\sim x}[\ell^{t}_{(j,f)}(a,r^{t})]:

xtargminxΔ𝒜tmaxrt[0,1]|𝒜|(j,f)χ(j,f)t(j,f)t(x,rt).x^{t}\in\mathop{\mathrm{argmin}}_{x\in\Delta\mathcal{A}^{t}}\max_{r^{t}\in[0,1]^{|\mathcal{A}|}}\sum_{(j,f)\in\mathcal{H}}\chi^{t}_{(j,f)}\cdot\ell^{t}_{(j,f)}\left(x,r^{t}\right).

By definition of the loss functions in the subsequence regret setting, the objective function is linear in the Adversary’s choice of rtr^{t}. Thus, let us rewrite the objective as a linear combination of (rat)a𝒜t(r^{t}_{a})_{a\in\mathcal{A}^{t}}:

(j,f)χ(j,f)t(j,f)t(x,rt),\displaystyle\sum_{(j,f)\in\mathcal{H}}\chi^{t}_{(j,f)}\cdot\ell^{t}_{(j,f)}(x,r^{t}),
=\displaystyle= (j,f)χ(j,f)ta𝒜txaf(t,a)(ratrjt),\displaystyle\sum_{(j,f)\in\mathcal{H}}\chi^{t}_{(j,f)}\sum_{a\in\mathcal{A}^{t}}x_{a}\cdot f(t,a)\cdot(r^{t}_{a}-r^{t}_{j}),
=\displaystyle= (j,f)a𝒜tratxaf(t,a)χ(j,f)t(j,f)a𝒜trjtxaf(t,a)χ(j,f)t,\displaystyle\sum_{(j,f)\in\mathcal{H}}\sum_{a\in\mathcal{A}^{t}}r^{t}_{a}\cdot x_{a}\cdot f(t,a)\cdot\chi^{t}_{(j,f)}-\sum_{(j,f)\in\mathcal{H}}\sum_{a\in\mathcal{A}^{t}}r^{t}_{j}\cdot x_{a}\cdot f(t,a)\cdot\chi^{t}_{(j,f)},
which, by the no-regret-to-unavailable actions property,
=\displaystyle= a𝒜tratxa(j,f)f(t,a)χ(j,f)tj𝒜trjta𝒜txaf:(j,f)f(t,a)χ(j,f)t,\displaystyle\sum_{a\in\mathcal{A}^{t}}r^{t}_{a}\cdot x_{a}\sum_{(j,f)\in\mathcal{H}}f(t,a)\cdot\chi^{t}_{(j,f)}-\sum_{j\in\mathcal{A}^{t}}r^{t}_{j}\sum_{a\in\mathcal{A}^{t}}x_{a}\sum_{f:(j,f)\in\mathcal{H}}f(t,a)\cdot\chi^{t}_{(j,f)},
and now, swapping jj and aa in the second summation,
=\displaystyle= a𝒜tratxa(j,f)f(t,a)χ(j,f)ta𝒜tratj𝒜txjf:(a,f)f(t,j)χ(a,f)t,\displaystyle\sum_{a\in\mathcal{A}^{t}}r^{t}_{a}\cdot x_{a}\sum_{(j,f)\in\mathcal{H}}f(t,a)\cdot\chi^{t}_{(j,f)}-\sum_{a\in\mathcal{A}^{t}}r^{t}_{a}\sum_{j\in\mathcal{A}^{t}}x_{j}\sum_{f:(a,f)\in\mathcal{H}}f(t,j)\cdot\chi^{t}_{(a,f)},
=\displaystyle= a𝒜trat(xa(j,f)f(t,a)χ(j,f)tj𝒜txjf:(a,f)f(t,j)χ(a,f)t:=ca(x)).\displaystyle\sum_{a\in\mathcal{A}^{t}}r_{a}^{t}\left(\underbrace{x_{a}\sum_{(j,f)\in\mathcal{H}}f(t,a)\cdot\chi^{t}_{(j,f)}-\sum_{j\in\mathcal{A}^{t}}x_{j}\sum_{f:(a,f)\in\mathcal{H}}f(t,j)\cdot\chi^{t}_{(a,f)}}_{:=c_{a}(x)}\right).

Thus, the zero-sum game played at round tt has objective function a𝒜tca(xt)rat,\sum\limits_{a\in\mathcal{A}^{t}}c_{a}(x^{t})\cdot r^{t}_{a}, where the coefficients ca(xt)c_{a}(x^{t}) do not depend on the Adversary’s action rtr^{t}. Recall that this game has value at most wAt=0w^{t}_{A}=0. Hence, maxa𝒜tca(xt)0\max_{a\in\mathcal{A}^{t}}c_{a}(x^{t})\leq 0 for any minimax optimal strategy xtx^{t} for the Learner — since otherwise, if some ca(xt)>0c_{a^{\prime}}(x^{t})>0, the Adversary would get value ca(xt)>0c_{a^{\prime}}(x^{t})>0 by setting rat=1r_{a^{\prime}}^{t}=1 and rat=0r_{a}^{t}=0 for aaa\neq a^{\prime}. Conversely, by playing xtx^{t} such that maxa𝒜tca(xt)0\max\limits_{a\in\mathcal{A}^{t}}c_{a}(x^{t})\leq 0, the Learner gets value 0\leq 0, as rat0r_{a}^{t}\geq 0 for all aa.

Therefore, the Learner’s choice of xtx^{t} is minimax optimal if and only if for all a𝒜ta\in\mathcal{A}^{t},

ca(xt)0Ztca(xt)0\displaystyle c_{a}(x^{t})\leq 0\iff Z^{t}\cdot c_{a}(x^{t})\leq 0\iff
xat(j,f)f(t,a)exp(ηs=1t1(j,f)s(as,rs))j𝒜txjtf:(a,f)f(t,j)exp(ηs=1t1(a,f)s(as,rs))0.\displaystyle x^{t}_{a}\sum_{(j,f)\in\mathcal{H}}f(t,a)\exp\left(\eta\sum_{s=1}^{t-1}\ell_{(j,f)}^{s}(a^{s},r^{s})\right)-\sum_{j\in\mathcal{A}^{t}}x^{t}_{j}\sum_{f:(a,f)\in\mathcal{H}}f(t,j)\exp\left(\eta\sum_{s=1}^{t-1}\ell_{(a,f)}^{s}(a^{s},r^{s})\right)\leq 0.

This recovers Algorithm 6, concluding the proof. ∎

Simplification for Action Independent Subsequences

The above Algorithm 6 requires solving a linear feasibility problem. This mirrors how existing algorithms for the special case of minimizing internal regret operate (Blum and Mansour [2007]); recall that internal regret corresponds to subsequence regret for a certain collection of |𝒜||\mathcal{A}| subsequences that depend on the Learner’s action in the current round tt.

By contrast, if all of our subsequence indicators ff\in\mathcal{F} are action independent, that is, satisfy f(t,a)=f(t,a)f(t,a)=f(t,a^{\prime}) for all a,a𝒜a,a^{\prime}\in\mathcal{A} and t[T]t\in[T], then it turns out that we can avoid solving a system of linear inequalities: our equilibrium has a closed form. In what follows, we abuse notation and simply write f(t)f(t) for the value of the subsequence ff at round tt.

Observe that if each ff\in\mathcal{F} is action independent, then we can rewrite our equilibrium characterization in Algorithm 6 as the requirement that the Learner’s chosen distribution xtΔ𝒜tx^{t}\in\Delta\mathcal{A}^{t} must satisfy, for each a𝒜ta\in\mathcal{A}^{t} (provided that f(t)0f(t)\neq 0 for at least some ff\in\mathcal{F}), the following inequality:

xat\displaystyle x^{t}_{a} \displaystyle\leq j𝒜txjtf:(a,f)f(t)exp(ηs=1t1(a,f)s(as,rs))(j,f)f(t)exp(ηs=1t1(j,f)s(as,rs)),\displaystyle\frac{\sum_{j\in\mathcal{A}^{t}}x^{t}_{j}\sum_{f:(a,f)\in\mathcal{H}}f(t)\exp\left(\eta\sum_{s=1}^{t-1}\ell_{(a,f)}^{s}(a^{s},r^{s})\right)}{\sum_{(j,f)\in\mathcal{H}}f(t)\exp\left(\eta\sum_{s=1}^{t-1}\ell_{(j,f)}^{s}(a^{s},r^{s})\right)},
=\displaystyle= f:(a,f)f(t)exp(ηs=1t1(a,f)s(as,rs))(j,f)f(t)exp(ηs=1t1(j,f)s(as,rs)).\displaystyle\frac{\sum_{f:(a,f)\in\mathcal{H}}f(t)\exp\left(\eta\sum_{s=1}^{t-1}\ell_{(a,f)}^{s}(a^{s},r^{s})\right)}{\sum_{(j,f)\in\mathcal{H}}f(t)\exp\left(\eta\sum_{s=1}^{t-1}\ell_{(j,f)}^{s}(a^{s},r^{s})\right)}.

Here the equality follows because xtΔ𝒜tx^{t}\in\Delta\mathcal{A}^{t} is a probability distribution.

We now observe that setting each xatx^{t}_{a} to be its upper bound, for a𝒜ta\in\mathcal{A}^{t}, yields a probability distribution over 𝒜t\mathcal{A}^{t}, which is consequently the unique feasible solution to the above system. Hence, for action independent subsequences, we have a closed-form implementation of Algorithm 6 that does not require solving a linear feasibility problem:

  for t=1,,Tt=1,\dots,T do
     Learn the current set of feasible actions 𝒜t\mathcal{A}^{t} and the values f(t)f(t) for every ff\in\mathcal{F} (potentially selected by an Adversary).
     Sample atxta^{t}\sim x^{t}, where for all a𝒜ta\in\mathcal{A}^{t},
xat=f:(a,f)f(t)exp(ηs=1t1(a,f)s(as,rs))(j,f)f(t)exp(ηs=1t1(j,f)s(as,rs)).x^{t}_{a}=\frac{\sum_{f:(a,f)\in\mathcal{H}}f(t)\exp\left(\eta\sum_{s=1}^{t-1}\ell_{(a,f)}^{s}(a^{s},r^{s})\right)}{\sum_{(j,f)\in\mathcal{H}}f(t)\exp\left(\eta\sum_{s=1}^{t-1}\ell_{(j,f)}^{s}(a^{s},r^{s})\right)}.
Algorithm 7 An Efficient Learner for Action Independent Subsequences

E.4 Omitted Reductions between Different Notions of Regret

Reducing swap regret to internal regret

We can upper bound the swap regret by reusing the instance of subsequence regret that we defined to capture internal regret. Recall that it was defined as follows. We let :={fi:i𝒜}\mathcal{F}:=\{f_{i}:i\in\mathcal{A}\}, where each fif_{i} is the indicator of the subsequence of rounds where the Learner played action ii — that is, for all t[T]t\in[T], we let f(t,a)=1a=if(t,a)=1_{a=i}. Then, we let :=𝒜×\mathcal{H}:=\mathcal{A}\times\mathcal{F}. We then obtained the in-expectation regret guarantee

𝔼[max(j,f)t[T]f(t,at)(rattrjt)]42Tln|𝒜|.\mathop{\mathbb{E}}\left[\max_{(j,f)\in\mathcal{H}}\sum_{t\in[T]}f(t,a^{t})\left(r^{t}_{a^{t}}-r^{t}_{j}\right)\right]\leq 4\sqrt{2T\ln|\mathcal{A}|}.

Returning to swap regret, note that for any fixed swapping rule μ:𝒜𝒜\mu:\mathcal{A}\to\mathcal{A}, we have

t[T]rattrμ(at)t\displaystyle\sum_{t\in[T]}r^{t}_{a^{t}}-r^{t}_{\mu(a^{t})} =i𝒜t[T]:at=irattrμ(i)t\displaystyle=\sum_{i\in\mathcal{A}}\sum_{t\in[T]:a^{t}=i}r^{t}_{a^{t}}-r^{t}_{\mu(i)}
i𝒜maxj𝒜t[T]:at=irattrjt\displaystyle\leq\sum_{i\in\mathcal{A}}\max_{j\in\mathcal{A}}\sum_{t\in[T]:a^{t}=i}r^{t}_{a^{t}}-r^{t}_{j}
|𝒜|maxi𝒜maxj𝒜t[T]:at=irattrjt\displaystyle\leq|\mathcal{A}|\max_{i\in\mathcal{A}}\max_{j\in\mathcal{A}}\sum_{t\in[T]:a^{t}=i}r^{t}_{a^{t}}-r^{t}_{j}
=|𝒜|max(j,f)t[T]f(t,at)(rattrjt),\displaystyle=|\mathcal{A}|\max_{(j,f)\in\mathcal{H}}\sum_{t\in[T]}f(t,a^{t})\left(r^{t}_{a^{t}}-r^{t}_{j}\right),

where in the last line we simply reparametrized the maximum over i𝒜i\in\mathcal{A} as the maximum over all ff\in\mathcal{F}. Since the above holds for any μswap\mu\in\mathcal{M}_{\mathrm{swap}}, we have

Rswapt=maxμswapt[T]rattrμ(at)t|𝒜|max(j,f)t[T]f(t,at)(rattrjt),R^{t}_{\mathrm{swap}}=\max_{\mu\in\mathcal{M}_{\mathrm{swap}}}\sum_{t\in[T]}r^{t}_{a^{t}}-r^{t}_{\mu(a^{t})}\leq|\mathcal{A}|\max_{(j,f)\in\mathcal{H}}\sum_{t\in[T]}f(t,a^{t})\left(r^{t}_{a^{t}}-r^{t}_{j}\right),

and therefore, we conclude that there exists an efficient algorithm that achieves expected swap regret

𝔼[RswapT]4|𝒜|2Tln|𝒜|.\mathop{\mathbb{E}}\left[R^{T}_{\mathrm{swap}}\right]\leq 4|\mathcal{A}|\sqrt{2T\ln|\mathcal{A}|}.
Wide-range regret and its connection to subsequence regret

The wide-range regret setting was first introduced in Lehrer [2003] and then studied, in particular, in Blum and Mansour [2007] and Greenwald and Jafari [2003]. It is quite general, and is in fact equivalent to the subsequence regret setting, up to a reparametrization.

Just as in the subsequence regret setting, imagine there is a finite family of subsequences \mathcal{F}, where each ff\in\mathcal{F} has the form f:[T]×𝒜[0,1]f:[T]\times\mathcal{A}\to[0,1]. Moreover, suppose there is a finite family \mathcal{M} of modification rules. Each modification rule μ\mu\in\mathcal{M} is defined as a mapping μ:[T]×𝒜𝒜\mu:[T]\times\mathcal{A}\to\mathcal{A}, which has the interpretation that if at time tt, the Learner plays action ata^{t}, then the modification rule modifies this action into another action μ(t,at)𝒜\mu(t,a^{t})\in\mathcal{A}. Now, consider a collection of modification rule-subsequence pairs ×\mathcal{H}\subseteq\mathcal{M}\times\mathcal{F}. The Learner’s wide-range regret with respect to \mathcal{H} is defined as

RwideT:=max(μ,f)t[T]f(t,at)(rattrμ(t,at)t).R^{T}_{\mathrm{wide}}:=\max_{(\mu,f)\in\mathcal{H}}\sum_{t\in[T]}f(t,a^{t})\left(r^{t}_{a^{t}}-r^{t}_{\mu(t,a^{t})}\right).

It is evident that wide-range regret has subsequence regret (when the Learner’s action set 𝒜t=𝒜\mathcal{A}^{t}=\mathcal{A} for all t[T]t\in[T]) as a special case, where each modification rule μ\mu\in\mathcal{M} always outputs the same action: that is, for all t,att,a^{t}, we have μ(t,at)=j\mu(t,a^{t})=j for some j𝒜j\in\mathcal{A}.

It is also not hard to establish the converse. Indeed, suppose we have an instance of no-wide-range-regret learning with ×\mathcal{H}\subseteq\mathcal{M}\times\mathcal{F}, where \mathcal{M} is a family of modification rules and \mathcal{F} is a family of subsequences. Fix any pair (μ,f)(\mu,f)\in\mathcal{H}. Then, let us define, for all j𝒜j\in\mathcal{A}, the subsequence

ϕj(μ,f):[T]×𝒜[0,1] such that ϕj(μ,f)(t,a):=f(t,a)1μ(t,a)=j for all t[T],a𝒜.\phi^{(\mu,f)}_{j}:[T]\times\mathcal{A}\to[0,1]\text{ such that }\phi^{(\mu,f)}_{j}(t,a):=f(t,a)\cdot 1_{\mu(t,a)=j}\text{ for all }t\in[T],a\in\mathcal{A}.

Now, let us instantiate our subsequence regret setting with

wide:=(μ,f)j𝒜(j,ϕj(μ,f)).\mathcal{H}_{\mathrm{wide}}:=\bigcup_{(\mu,f)\in\mathcal{H}}\bigcup_{j\in\mathcal{A}}\left(j,\phi^{(\mu,f)}_{j}\right).

Observe in particular that |wide|=|𝒜||||\mathcal{H}_{\mathrm{wide}}|=|\mathcal{A}|\cdot|\mathcal{H}|.

Computing the subsequence regret of this family wide\mathcal{H}_{\mathrm{wide}}, we have

RwideT=max(μ,f)maxj𝒜t[T]:μ(t,at)=jf(t,at)(rattrjt).R^{T}_{\mathcal{H}_{\mathrm{wide}}}=\max_{(\mu,f)\in\mathcal{H}}\max_{j\in\mathcal{A}}\sum_{t\in[T]:\mu(t,a^{t})=j}f(t,a^{t})(r^{t}_{a^{t}}-r^{t}_{j}).

Now, we have the following upper bound on the wide-range regret:

RwideT\displaystyle R^{T}_{\mathrm{wide}} =max(μ,f)t[T]f(t,at)(rattrμ(t,at)t)\displaystyle=\max_{(\mu,f)\in\mathcal{H}}\sum_{t\in[T]}f(t,a^{t})\left(r^{t}_{a^{t}}-r^{t}_{\mu(t,a^{t})}\right)
=max(μ,f)j𝒜t[T]:μ(t,at)=jf(t,at)(rattrjt)\displaystyle=\max_{(\mu,f)\in\mathcal{H}}\sum_{j\in\mathcal{A}}\sum_{t\in[T]:\mu(t,a^{t})=j}f(t,a^{t})\left(r^{t}_{a^{t}}-r^{t}_{j}\right)
max(μ,f)|𝒜|maxj𝒜t[T]:μ(t,at)=jf(t,at)(rattrjt)\displaystyle\leq\max_{(\mu,f)\in\mathcal{H}}|\mathcal{A}|\,\max_{j\in\mathcal{A}}\sum_{t\in[T]:\mu(t,a^{t})=j}f(t,a^{t})\left(r^{t}_{a^{t}}-r^{t}_{j}\right)
=|𝒜|RwideT.\displaystyle=|\mathcal{A}|R^{T}_{\mathcal{H}_{\mathrm{wide}}}.

Since our subsequence regret results imply the existence of an algorithm such that 𝔼[RwideT]4Tln|H|=4T(ln|𝒜|+ln||)\mathop{\mathbb{E}}\left[R^{T}_{\mathcal{H}_{\mathrm{wide}}}\right]\leq 4\sqrt{T\ln|H^{\prime}|}=4\sqrt{T(\ln|\mathcal{A}|+\ln|\mathcal{H}|)}, we have the following expected wide-range regret bound:

𝔼[RwideT]4|𝒜|T(ln|𝒜|+ln||).\mathop{\mathbb{E}}\left[R^{T}_{\mathrm{wide}}\right]\leq 4|\mathcal{A}|\sqrt{T\left(\ln|\mathcal{A}|+\ln|\mathcal{H}|\right)}.