This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Likelihood Ratio Confidence Sets for
Sequential Decision Making

Nicolas Emmenegger
ETH Zürich
&Mojmír Mutný11footnotemark: 1
ETH Zürich
&Andreas Krause
ETH Zürich
Equal contribution.
Abstract

Certifiable, adaptive uncertainty estimates for unknown quantities are an essential ingredient of sequential decision-making algorithms. Standard approaches rely on problem-dependent concentration results and are limited to a specific combination of parameterization, noise family, and estimator. In this paper, we revisit the likelihood-based inference principle and propose to use likelihood ratios to construct any-time valid confidence sequences without requiring specialized treatment in each application scenario. Our method is especially suitable for problems with well-specified likelihoods, and the resulting sets always maintain the prescribed coverage in a model-agnostic manner. The size of the sets depends on a choice of estimator sequence in the likelihood ratio. We discuss how to provably choose the best sequence of estimators and shed light on connections to online convex optimization with algorithms such as Follow-the-Regularized-Leader. To counteract the initially large bias of the estimators, we propose a reweighting scheme that also opens up deployment in non-parametric settings such as RKHS function classes. We provide a non-asymptotic analysis of the likelihood ratio confidence sets size for generalized linear models, using insights from convex duality and online learning. We showcase the practical strength of our method on generalized linear bandit problems, survival analysis, and bandits with various additive noise distributions.

1 Introduction

One of the main issues addressed by machine learning and statistics is the estimation of an unknown model from noisy observations. For example, in supervised learning, this might concern learning the dependence between an input (covariate) xx and a random variable (observation) yy. In many cases, we are not only interested in an estimate θ^{\hat{{\theta}}} of the true model parameter θ{\theta_{\star}}, but instead in a set of plausible values that θ{\theta_{\star}} could take. Such confidence sets are of tremendous importance in sequential decision-making tasks, where uncertainty is used to drive exploration or risk-aversion needs to be implemented, and covariates are iteratively chosen based on previous observations. This setting includes problems such as bandit optimization, reinforcement learning, or active learning. In the former two, the confidence sets are often used to solve the exploration-exploitation dilemma and more generally influence the selection rule (Mukherjee et al.,, 2022), termination rule (Katz-Samuels and Jamieson,, 2020), exploration (Auer,, 2002) and/or risk-aversion (Makarova et al.,, 2021).

Refer to caption
(a) Gaussian \mathcal{L}
Refer to caption
(b) Laplace \mathcal{L}
Refer to caption
(c) Gaussian \mathcal{L} in RKHS
Figure 1: (a) and (b) show examples of confidence sets defined via level sets of the log-likelihood function in 2D at two dataset sizes, for Gaussian (a) and Laplace (b) likelihoods respectively. The sets inherit the geometry of the likelihood, and are not always ellipsoidal. (c) shows confidence bands on an RKHS function in a bandit game searching for the optimum. We compare prior work on confidence sets (Abbasi-Yadkori et al.,, 2011), our LR sets, and a common heuristic (orange). Our sets are nearly as small as the commonly used heuristic, but have provable coverage and can vastly improve sequential decision-making tasks such as bandits by quickly eliminating hypotheses.

When we interact with the environment by gathering data sequentially based on previous confidence sets, we introduce correlations between past noisy observations and future covariates. Data collected in this manner is referred to as adaptively gathered (Wasserman et al.,, 2020). Constructing estimators, confidence sets, and hypothesis tests for such non-i.i.d. data comes with added difficulty. Accordingly, and also for its importance in light of the reproducibility crisis (Baker,, 2016), the task has attracted significant attention in the statistics community in recent years (Ramdas et al.,, 2022).

Instead of deriving explicit concentration inequalities around an online estimator, we construct confidence sets implicitly defined by an inclusion criterion that is easy to evaluate in a computationally efficient manner and requires little statistical knowledge to implement. Roughly speaking, given a model pθ(y|x)p_{\theta}(y\,|\,x) that describes the conditional dependence of the observation yy given the covariate xx under parameter θ\theta, we will build sets based on a weighted modification of the sequential likelihood ratio statistic (Robbins et al.,, 1972; Wasserman et al.,, 2020)

Rt(θ):=t({θ^s}s=1t)t(θ):=s=1tpθ^sws(ys|xs)i=1tpθws(ys|xs),R_{t}(\theta):=\frac{\mathcal{L}_{t}(\{{\hat{{\theta}}}_{s}\}_{s=1}^{t})}{\mathcal{L}_{t}(\theta)}:=\frac{\prod_{s=1}^{t}p^{w_{s}}_{{\hat{{\theta}}}_{s}}(y_{s}\,|\,x_{s})}{\prod_{i=1}^{t}p^{w_{s}}_{\theta}(y_{s}\,|\,x_{s})}, (1)

where {θ^s}s\{{\hat{{\theta}}}_{s}\}_{s} is a running estimator sequence that we are free to choose, but which may only depend on previously collected data. Parameters θ\theta for which this statistic is small, i.e., for which Rt(θ)1/αR_{t}(\theta)\leq 1/\alpha will be included in the set (and considered plausible). Examples of sets in a parametric and non-parametric setting are shown in Figure 1. The weighting terms ws(0,1]w_{s}\in(0,1] are crucial for dealing with inherent irregularities of many conditional observation models but can be flexibly chosen. Classically, these are set to ws=1w_{s}=1. The full exposition of our method with choice of estimators and weights is given in Section 2. Apart from being easy to use and implement, our approach also comes with performance guarantees. These sets maintain a provable 1α1-\alpha coverage – a fact we establish using Ville’s inequality for supermartingales (Ville,, 1939), which is known to be essentially tight for martingales (see Howard et al.,, 2018, for a discussion). Therefore, in stark contrast to alternate methods, our confidence sequence is fully data-dependent, making it empirically tighter than competing approaches. Despite the rich history of sequential testing and related confidence sets going back to Wald, (1945) and Robbins et al., (1972), these sets have found little use in the interactive machine learning community, which is a gap we fill in the present paper.

Contributions

In this work, we revisit the idea of using likelihood ratios to generate anytime-valid confidence sets. The main insight is that whenever the likelihood of the noise process is known, the likelihood ratio confidence sets are fully specified. They inherit their geometry from the likelihood function, and their size depends on the quality of our estimator sequence. We critically evaluate the likelihood ratio confidence sets and, in particular, we shed light on the following aspects: Firstly, for generalized linear models, we theoretically analyze the geometry of the LR confidence sets under mild assumptions. We show their geometry is dictated by Bregman divergences of exponential families (Chowdhury et al.,, 2022). Secondly, we show that the size of the confidence set is dictated by an online prediction game. The size of these sets depends on a sequence of estimators {θ^s}s=1t\{{\hat{{\theta}}}_{s}\}_{s=1}^{t} that one uses to estimate the unknown parameter θ{\theta_{\star}}. We discuss how to pick the estimator sequence in order to yield a provably small radius of the sets, by using the Follow-the-Regularized-Leader algorithm, which implements a regularized maximum-likelihood estimator. We prove that the radius of the confidence sets is nearly-worst-case optimal, and accordingly, they yield nearly-worst-case regret bounds when used in generalized linear bandit applications. However, due to their data-dependent nature, they can be much tighter than this theory suggests. Thirdly, we analyze the limitations of classical (unweighted) LR sets when the underlying conditional observation model is not identifiable. In this case, the resulting (inevitable) estimation bias unnecessarily increases the size of the confidence sets. To mitigate this, we propose an adaptive reweighting scheme that decreases the effect of uninformed early bias of the estimator sequence on the size of the sets downstream. The reweighting does not affect the coverage guarantees of our sets and utilizes an elegant connection to (robust) powered likelihoods (Wasserman et al.,, 2020). Finally, thanks to the adaptive reweighting scheme, our sets are very practical as we showcase experimentally. We demonstrate that our method works well with exponential and non-exponential family likelihoods, and in parametric as well as in kernelized settings. We attribute their practical benefits to the fact that they do not depend on (possibly loose) worst-case parameters.

2 The Likelihood Method

The sequential likelihood ratio process (LRP) in (1) is a statistic that compares the likelihood of a given model parameter, with the performance of an adaptively chosen estimator sequence. As noted above, we generalize the traditional definition, which would have ws1w_{s}\equiv 1, and define a corresponding confidence set as

𝒞t={θ|Rt(θ)1/α}.\mathcal{C}_{t}=\left\{\theta\;\middle|\;R_{t}(\theta)\leq 1/\alpha\right\}. (2)

The rationale is that the better a parameter θ\theta is at explaining the data {(xs,ys)}st\{(x_{s},y_{s})\}_{s}^{t} from the true model θ{\theta_{\star}}, the smaller this statistic will be, thereby increasing its chances to be included in 𝒞t\mathcal{C}_{t}. When we construct RtR_{t}, the sequence of xsx_{s}, wsw_{s} and θ^s{\hat{{\theta}}}_{s} cannot depend on the noisy observation ysy_{s}. Formally, consider the filtration (s)s=0(\mathcal{F}_{s})_{s=0}^{\infty} with sub-σ\sigma-algebras s=σ(x1,,y1,xs,ys,xs+1)\mathcal{F}_{s}=\sigma(x_{1},\ldots,y_{1},\ldots x_{s},y_{s},x_{s+1}). We require that θ^s{\hat{{\theta}}}_{s} and wsw_{s} are s1\mathcal{F}_{s-1}-measurable. Under these very mild assumptions and with arbitrary weights ws(0,1]w_{s}\in(0,1], we can show coverage, i.e., our (weighted) confidence sets uniformly track the true parameter with probability 1α1-\alpha.

Theorem 1.

The stochastic process Rt(θ)R_{t}({\theta_{\star}}) in (1) is a non-negative supermartingale with respect to the filtration (t)(\mathcal{F}_{t}) and satisfies R0(θ)1R_{0}({\theta_{\star}})\equiv 1. In addition, the sequence 𝒞t\mathcal{C}_{t} from (2) satisfies θ(t:θ𝒞t)α.\mathbb{P}_{\theta_{\star}}\left(\exists t\,:\,{\theta_{\star}}\not\in\mathcal{C}_{t}\right)\leq\alpha.

The last statement follows by applying Ville’s inequality for super-martingales on Rt(θ)R_{t}({\theta_{\star}}). The proof closely follows Wasserman et al., (2020). While coverage is always guaranteed irrespective of the estimator sequence {θ^s}\{{\hat{{\theta}}}_{s}\}, we would like to make the sets as small as possible at fixed coverage, which we do by picking a well-predicting estimator sequence.

2.1 The Estimator Sequence Game

The specification of the LR process (LRP) allows us to choose an arbitrary estimator sequence {θ^s}s\{{\hat{{\theta}}}_{s}\}_{s}. To understand the importance of the sequence, let us introduce θ{\theta_{\star}} to the definition of RtR_{t} in (1), and divide by t({θ^s}s=1t)\mathcal{L}_{t}(\{{\hat{{\theta}}}_{s}\}_{s=1}^{t}). This gives the equivalent formulation

𝒞t:={θ|t(θ)t(θ)1αt(θ)t({θ^s}s=1t)confidence parameter}.\mathcal{C}_{t}:=\left\{\theta\leavevmode\nobreak\ \Big{|}\leavevmode\nobreak\ \frac{\mathcal{L}_{t}({\theta_{\star}})}{\mathcal{L}_{t}(\theta)}\leq\frac{1}{\alpha}\frac{\mathcal{L}_{t}({\theta_{\star}})}{\mathcal{L}_{t}(\{{\hat{{\theta}}}_{s}\}_{s=1}^{t})}\leftarrow\text{confidence parameter}\right\}.

We see that the predictor sequence does not influence the geometry of the confidence set, which is fully specified by the likelihood function. We also observe that the ratio on the right-hand side serves as a confidence parameter controlling the size (radius) of the confidence sets measured under the likelihood ratio distance to θ{\theta_{\star}}. If the confidence parameter goes to zero, only θ{\theta_{\star}} is in the set. The better the estimator sequence is at predicting the data, the smaller the inclusion threshold, and hence the smaller the sets will ultimately be. Specifically, taking the log\log, we would like to minimize

t:=logt(θ)t({θ^s}s=1t)=s=1tlog(pθ^sws(ys|xs))s=1tlog(pθws(ys|xs)).\displaystyle\mathcal{R}_{t}:=\log\frac{\mathcal{L}_{t}({\theta_{\star}})}{\mathcal{L}_{t}(\{{\hat{{\theta}}}_{s}\}_{s=1}^{t})}=\sum_{s=1}^{t}{-\log(p_{{\hat{{\theta}}}_{s}}^{w_{s}}(y_{s}\,|\,x_{s}))}-\sum_{s=1}^{t}{-\log(p_{{\theta_{\star}}}^{w_{s}}(y_{s}\,|\,x_{s}))}. (3)

The quantity t\mathcal{R}_{t} corresponds to a regret in an online prediction game, as will become apparent below.

Online Prediction Game

Online optimization is a mature field in interactive learning (Cesa-Bianchi and Lugosi,, 2006; Orabona,, 2019). The general goal is to minimize a sequence of loss functions as in Eq. (3) and compete against a baseline, which typically is the best-in-hindsight prediction, or – in our case – given by the performance of the fixed parameter θ{\theta_{\star}}. Specifically, at every timestep ss, iteratively, the agent chooses an action θ^s{\hat{{\theta}}}_{s} based on s1\mathcal{F}_{s-1}, and a loss function fs(θ)f_{s}(\theta) is revealed. In most of the online optimization literature, fsf_{s} can be chosen adversarially. In our prediction game, we know the whole form of loss function fs(θ)=log(pθws(ys|xs))f_{s}(\theta)=-\log(p^{w_{s}}_{\theta}(y_{s}\,|\,x_{s})), as can be seen in (3), and not just fs(θ^s)f_{s}({\hat{{\theta}}}_{s}). Opposed to traditional assumptions in online prediction, in our case, fsf_{s} are non-adversarial, but have a stochastic component due to ysy_{s}. Also, contrary to most instances of online prediction, we do not compare against the best-in-hindsight predictor, but θ{\theta_{\star}} instead, as this is more meaningful in our setting.

Online Optimization Algorithms

Generally, we seek an algorithm that incurs low regret. Here, we focus on Follow-the-Regularized Leader (FTRL), which corresponds exactly to using regularized maximum likelihood estimation, making it a natural and computationally practical choice. The update rule is defined in Alg. 1 (Line 3). While other algorithms could be considered, FTRL enjoys the optimal regret rate for generalized linear regression as we show later, and is easily implemented. In order to run the algorithm, one requires a sequence of strongly convex regularizers. For now, let us think of it as ψs(θ)=λθ22\psi_{s}(\theta)=\lambda||\theta||_{2}^{2}, which we use in practice. However, one can derive a tighter analysis for a slightly modified, time-dependent regularization strategy for generalized linear models as we show in Sec. 3.3.

2.2 Adaptive Reweighting: Choosing the Right Loss

There is yet more freedom in the construction of the LR, via the selection of the loss function. Not only do we select the predictor sequence, but also the weights of the losses via wtw_{t}. This idea allows controlling the influence of a particular data point (xt,yt)(x_{t},y_{t}) on the cumulative loss based on the value of xtx_{t}. For example, if we know a priori that for a given xtx_{t} our prediction will be most likely bad, we can opt out of using the pair (xt,yt)(x_{t},y_{t}) by setting wt=0w_{t}=0. Below we will propose a weighting scheme that depends on a notion of bias, which captures how much of the error in predicting yty_{t} is due to our uncertainty about θ^t{\hat{{\theta}}}_{t} (compared to the uncertainty we still would have knowing θ{\theta_{\star}}). Sometimes this bias is referred to as epistemic uncertainty in the literature, while the residual part of the error is referred to as aleatoric. Putting large weight on a data point heavily affected by this bias might unnecessarily increase the regret of our learner (and hence blow up the size of the confidence set). Note that, conveniently, even if we put low weight (zero) on a data point, nothing stops us from using this sample point to improve the estimator sequence in the next prediction round. As we will show below, our reweighting scheme is crucial in defining a practical algorithm for Reproducing Kernel Hilbert Space (RKHS) models and in high signal-to-noise ratio scenarios. Since we do not know θ{\theta_{\star}}, our strategy is to compute an estimate of the bias of the estimator θ^t{\hat{{\theta}}}_{t} and its effect on the value of the likelihood function for a specific xx that we played. We use the value of the bias to rescale the loss via wtw_{t} such that its effect is of the same magnitude as the statistical error (see Algorithm 1; we call this step bias-weighting).

Intuition

To give a bit more intuition, suppose we have a Gaussian likelihood. Then the negative log-likelihood of (xt,yt)(x_{t},y_{t}) with weighting is proportional to wtσ2(ytxtθ^t)2\frac{w_{t}}{\sigma^{2}}(y_{t}-x_{t}^{\top}{\hat{{\theta}}}_{t})^{2}. Now, if xtx_{t} does not lie in the span of the data points {xs}s=1t1\{x_{s}\}_{s=1}^{t-1} used to compute θ^t{\hat{{\theta}}}_{t}, it is in general unavoidable to incur large error, inversely proportional to σ2\sigma^{2}. To see this, let us decompose the projection onto xtx_{t} as

xt(θ^tθ)=xt(θ^t𝔼[θ^t])statistical error+xt(𝔼[θ^t]θ)biasxt(θ^t),x_{t}^{\top}({{\hat{{\theta}}}_{t}-{\theta_{\star}}})=\underbrace{x_{t}^{\top}({\hat{{\theta}}}_{t}-\mathbb{E}[{\hat{{\theta}}}_{t}])}_{\text{statistical error}}+\underbrace{x_{t}^{\top}(\mathbb{E}[{\hat{{\theta}}}_{t}]-{\theta_{\star}})}_{\operatorname{bias}_{x_{t}}({\hat{{\theta}}}_{t})}, (4)

where the first term represents the statistical error up to time tt, while the second, bias, is deterministic, and independent of the actual realization yy, depending only θ{\theta_{\star}}. Estimators with non-zero bias are biased. Plugging this into the likelihood function, we see that in expectation 1σ2𝔼[(ytxtθ^t)2|t1]1σ2biasxt2(θ^t)+ϵ2+Ct\frac{1}{\sigma^{2}}\mathbb{E}[(y_{t}-x_{t}^{\top}{\hat{{\theta}}}_{t})^{2}|\mathcal{F}_{t-1}]\lesssim\frac{1}{\sigma^{2}}{\operatorname{bias}^{2}_{x_{t}}({\hat{{\theta}}}_{t})}+\epsilon^{2}+\frac{C}{t}, where ϵ2\epsilon^{2} is the unavoidable predictive error in expectation (due to a noisy objective) and is a constant independent of σ2\sigma^{2}. Ct\frac{C}{t} is the statistical error, and CC is independent of σ2\sigma^{2}. Note that the bias term scales inversely with the variance, and leads to unnecessarily big confidence parameters for small σ2\sigma^{2}.

In fact, the problem is that we use the likelihood to measure the distance between two parameters, but this is only a “good“ distance once the deterministic source of the error (bias) vanishes. For this reason, without weighting, the incurred regret blows up severely in low-noise settings. To counter this, we balance the deterministic estimation bias and noise variance via proper selection of wtw_{t}. In this case, it turns out that wt=σ2σ2+biasxt2(θ^t)w_{t}=\frac{\sigma^{2}}{\sigma^{2}+\operatorname{bias}^{2}_{x_{t}}({\hat{{\theta}}}_{t})} ensures that the overall the scaling is independent of σ2\sigma^{2}. While the choice of weights {ws}st\{w_{s}\}_{s}^{t} influences the geometry of the confidence sets, with a good data collection and estimation strategy the bias asymptotically decreases to zero, and hence the weights converge to 11.

Bias estimation

In order to generalize this rule beyond Gaussian likelihoods, we need a proper generalization of the bias. Our generalization is motivated by our analysis of generalized linear models, but the method can be applied more broadly. The role of the squared statistical error (variance) is played by the inverse of the smoothness constant of the negative log-likelihood functions fsf_{s}, denoted by LL. This is the usual smoothness, commonly seen in the convex optimization literature. We consider penalized likelihood estimators with strongly convex regularizers (Alg. 1, line 3). For this estimator class, we define the bias via a hypothetical stochastic-error-free estimate θ^t×{\hat{{\theta}}}_{t}^{\times}, had we access to the expected values of the gradient loss functions (a.k.a. score). We use the first-order optimality conditions and the indicator function of the set Θ\Theta, iΘi_{\Theta}, to define the error-free-estimate θ^t×{\hat{{\theta}}}_{t}^{\times}, and the bias of the estimator θ^t{\hat{{\theta}}}_{t} as

biasxt2(θ^t)=(xt(θθ^t×))2with𝔼[s=1t1logpθ^t×(ys|xs)]ψt(θ^t×)+iΘ(θ^t×)=0,\operatorname{bias}^{2}_{x_{t}}({\hat{{\theta}}}_{t})=(x_{t}^{\top}({\theta_{\star}}-{\hat{{\theta}}}_{t}^{\times}))^{2}\quad\text{with}\quad\mathbb{E}\left[\sum_{s=1}^{t-1}\nabla\log p_{{\hat{{\theta}}}_{t}^{\times}}(y_{s}|x_{s})\right]-\nabla\psi_{t}({\hat{{\theta}}}_{t}^{\times})+i_{\Theta}({\hat{{\theta}}}^{\times}_{t})=0, (5)

where the expectation denotes a sequence of expectations conditioned on the prior filtration. This notion of bias coincides with the definition of bias in Eq. (4) for the Gaussian likelihood. This quantity cannot be evaluated in general, however, we prove a computable upper bound.

Theorem 2 (Bias estimate).

Let the negative log-likelihood have the form, logpθ(ys|xs)=g(xsθ)-\log p_{\theta}(y_{s}|x_{s})=g(x_{s}^{\top}\theta), where g:g:\mathbb{R}\rightarrow\mathbb{R} is μ\mu strongly-convex and let the regularizer be ψt(θ)=λ||θ||22\psi_{t}(\theta)=\lambda\lvert\lvert\theta\rvert\rvert_{2}^{2} making the overall objective strongly convex. Then, defining 𝐕tμ;λ=s=1tμxsxs+λ𝐈\mathbf{{V}}_{t}^{\mu;\lambda}=\sum_{s=1}^{t}\mu x_{s}x_{s}^{\top}+\lambda\mathbf{{I}}, we can bound

biasx2(θ^t)2λ||θ||22x(𝐕tμ;λ)1x.\operatorname{bias}^{2}_{x}({\hat{{\theta}}}_{t})\leq 2\lambda\lvert\lvert{\theta_{\star}}\rvert\rvert^{2}_{2}x^{\top}(\mathbf{{V}}_{t}^{\mu;\lambda})^{-1}x. (6)

The proof is deferred to App. A.3, and requires elementary convex analysis.

This leads us to propose the weighting scheme wt=1/Lbiasxt2(θ^t)+1/Lw_{t}=\frac{1/L}{\operatorname{bias}^{2}_{x_{t}}({\hat{{\theta}}}_{t})+1/L}. We justify that this is a sensible choice by analyzing the confidence set on the GLM class in Section 3, which satisfies the smoothness and strong-convexity conditions. We show that this rule properly balances the stochastic and bias components of the error in the regret as in (3). However, this rule is more broadly applicable beyond the canonical representation of GLM or the GLM family altogether.

1:Input: convex set Θd\Theta\subset\mathbb{R}^{d}, confidence level α>0\alpha>0, likelihood pθ(y|x)p_{\theta}(y|x), regularizers {ψt}t\{\psi_{t}\}_{t}
2:for t0t\in\mathbb{N}_{0} do
3:     θ^t=argminθΘs=1t1logpθ(ys|xs)+ψt(θ){\hat{{\theta}}}_{t}=\arg\min_{\theta\in\Theta}\sum_{s=1}^{t-1}-\log p_{\theta}(y_{s}\,|\,x_{s})+\psi_{t}(\theta) \triangleright FTRL
4:     wt={1/L1/L+biasxt2(θ^t)this work1classicalw_{t}=\begin{cases}\frac{1/L}{1/L+\operatorname{bias}^{2}_{x_{t}}({\hat{{\theta}}}_{t})}&\textsc{this work}\\ 1&\textsc{classical}\\ \end{cases} \triangleright bias-weighting biasxt(θ^t)\operatorname{bias}_{x_{t}}({\hat{{\theta}}}_{t}) in Eq. (5) or Eq.(6)
5:     𝒞t={θΘ|s=1tpθ^sws(ys|xs)pθws(ys|xs)1α}.\mathcal{C}_{t}=\left\{\theta\in\Theta\;\middle|\;\prod_{s=1}^{t}\frac{p^{w_{s}}_{{\hat{{\theta}}}_{s}}(y_{s}\,|\,x_{s})}{p^{w_{s}}_{\theta}(y_{s}\,|\,x_{s})}\leq\frac{1}{\alpha}\right\}. \triangleright Confidence set
6:end for
Algorithm 1 Constructing the LR Confidence Sequence

3 Theory: Linear Models

While the coverage (i.e., “correctness”) of the likelihood ratio confidence sets is always guaranteed, their worst-case size (affecting the “performance”) cannot be easily bounded in general. We analyze the size and the geometry of the LR confidence sequence in the special but versatile case of generalized linear models.

3.1 Generalized Linear Models

We assume knowledge of the conditional probability model pθ(y|x)p_{\theta}(y|x), where the covariates x𝒳dx\in\mathcal{X}\subset\mathbb{R}^{d}, and the true underlying model parameter lies in a set Θd\Theta\subset\mathbb{R}^{d}. If tt is indexing (discrete) time, then xtx_{t} is acquired sequentially, and the – subsequently observed – yty_{t} is sampled from an exponential family distribution parametrized as

pθ(y|xt)=h(y)exp(T(y)xtθA(xtθ)).p_{\theta}(y\,|\,x_{t})=h(y)\exp\left(T(y)\cdot x_{t}^{\top}\theta-A(x_{t}^{\top}\theta)\right). (7)

Here, AA is referred to as the log-partition function of the conditional distribution, and T(y)T(y) is the sufficient statistic. The function hh is the base measure, and has little effect on our further developments, as it cancels out in the LR. Examples of commonly used exponential families (Gaussian, Binomial, Poisson or Weibull) with their link functions can be found in Table 1 in App. A.1.

In order to facilitate theoretical analysis for online algorithms, we make the following assumptions about the covariates x𝒳x\in\mathcal{X} and the set of plausible parameters Θ\Theta.

Assumption 1.

The covariates are bounded, i.e., supx𝒳||x||21\sup_{x\in\mathcal{X}}\lvert\lvert x\rvert\rvert_{2}\leq 1, and the set Θ\Theta is contained in an 2\ell_{2}-ball of radius BB. We will also assume that the log-partition function is strongly convex, that is, that there exists μ:=infz[B,B]A′′(z)\mu:=\inf_{z\in[-B,B]}A^{\prime\prime}(z), and that AA is L-smooth, i.e. L:=supz[B,B]A′′(z)L:=\sup_{z\in[-B,B]}A^{\prime\prime}(z).

These assumptions are common in other works addressing the confidence sets of GLMs (Filippi et al.,, 2010; Faury et al.,, 2020), who remark that the dependence on μ\mu is undesirable. However, in contrast to these works, our confidence sets do not use these assumptions in the construction of the sets. We only require these for our theoretical analysis. As these are worst-case parameters, the practical performance can be much better for our sets.

3.2 Geometry and Concentration

Before stating our results, we need to define a distance notion that the convex negative log-likelihoods induce. For a continuously differentiable convex function ff, we denote the Bregman divergence as Df(a,b):=f(a)f(b)f(b)(ab).D_{f}(a,b):=f(a)-f(b)-\nabla f(b)^{\top}(a-b). The ν\nu-regularized sum of log-partition functions is defined as

Ztν(θ):=s=1twsA(xsθ)+ν2||θ||22.Z_{t}^{\nu}(\theta):=\sum_{s=1}^{t}w_{s}A(x_{s}^{\top}\theta)+\frac{\nu}{2}\lvert\lvert\theta\rvert\rvert^{2}_{2}. (8)

This function will capture the geometry of the LR confidence sets. The confidence set size depends mainly on two terms. One refers to a notion of complexity of the space referred to as Bregman information gain: Γtν(θ~t)=log(dexp(ν2||θ||22)dθdexp(DZtν(θ,θ~t))dθ)\Gamma_{t}^{\nu}({\tilde{{\theta}}}_{t})=\log\left(\frac{\int_{\mathbb{R}^{d}}\exp(-\frac{\nu}{2}\lvert\lvert\theta\rvert\rvert^{2}_{2})\mathrm{d}\theta}{\int_{\mathbb{R}^{d}}\exp(-D_{Z_{t}^{\nu}}(\theta,{\tilde{{\theta}}}_{t}))\mathrm{d}\theta}\right), first defined by Chowdhury et al., (2022) as a generalization of the information gain of Srinivas et al., (2009), γtν=log(det(i=1μνxixi+𝐈))\gamma_{t}^{\nu}=\log\left({\det(\sum_{i=1}\frac{\mu}{\nu}x_{i}x_{i}^{\top}+\mathbf{{I}})}\right) for Gaussian likelihoods. We will drop the superscript whenever the regularization is clear from context and simply refer to γt\gamma_{t}. This term appears because one can relate the decay of the likelihood as a function of the Bregman Divergence from θ{\theta_{\star}} with the performance of a (regularized) maximum likelihood estimator via convex (Fenchel) duality. In particular, if θ~t{\tilde{{\theta}}}_{t} is a regularized MLE, Γtν:=Γtν(θ~t)\Gamma_{t}^{\nu}:=\Gamma_{t}^{\nu}({\tilde{{\theta}}}_{t}) will asymptotically scale as 𝒪(dlogt)\mathcal{O}(d\log t) (cf. Chowdhury et al.,, 2022, for further discussion). For Gaussian likelihoods and ws1w_{s}\equiv 1, it coincides with the classical information gain independent of θ~t{\tilde{{\theta}}}_{t}. The second term that affects the size is the regret t\mathcal{R}_{t} of the online prediction game over tt rounds we introduced previously in (3). These two parts together yield the following result:

Theorem 3.

Let ν>0\nu>0 and α,δ(0,1)\alpha,\delta\in(0,1). For the level 1α1-\alpha confidence set 𝒞t\mathcal{C}_{t} defined in (2) under the GLM in (7), with probability 1δ1-\delta, for all t1t\geq 1, any θ𝒞t\theta\in\mathcal{C}_{t} satisfies

DZtν(θ,θ)4Lμξt+2log(1δ)+2t,\displaystyle D_{Z_{t}^{\nu}}(\theta,{\theta_{\star}})\leq\frac{4L}{\mu}\xi_{t}+2\log\left(\frac{1}{\delta}\right)+2\mathcal{R}_{t}, (9)

where ξt=(log(1α)+νB2+Γtν)\xi_{t}=\left(\log\left(\frac{1}{\alpha}\right)+{\nu B^{2}}+\Gamma_{t}^{\nu}\right) and L,μL,\mu are defined as above and finally t\mathcal{R}_{t} is the regret of the game in Eq. (3).

The set defined via the above divergence does not coincide with the LR confidence set. It is slightly larger due to a term involving ν\nu (as in Eq. (8)). This is a technical consequence of our proof technique, where the gradient of ZtνZ_{t}^{\nu} needs to be invertible, and regularization is added to this end. We note that this ν>0\nu>0 can be chosen freely. Note that the theorem involves two confidence levels, α\alpha and δ\delta: α\alpha is a bound on the Type I error – coverage of the confidence sets – while δ\delta upper bounds the probability of a large radius – and is therefore related to the power and Type II error of a corresponding hypothesis test. The proof of the theorem is deferred to App. B.2.

To give more intuition on these quantities, let us instantiate them for the Gaussian likelihood case with ws1w_{s}\equiv 1. In this scenario, Ztν(θ)=s=1t12σ2||θ||xsxs2+ν2||θ||22Z_{t}^{\nu}(\theta)=\sum_{s=1}^{t}\frac{1}{2\sigma^{2}}\lvert\lvert\theta\rvert\rvert_{x_{s}x_{s}^{\top}}^{2}+\frac{\nu}{2}\lvert\lvert\theta\rvert\rvert_{2}^{2}, and the (in this case symmetric) Bregman divergence is equal to DZtν(θ,θ)=12||θθ||𝐕tσ2;ν2D_{Z_{t}^{\nu}}({\theta_{\star}},\theta)=\frac{1}{2}\lvert\lvert\theta-{\theta_{\star}}\rvert\rvert^{2}_{\mathbf{V}_{t}^{\sigma^{-2};\nu}}, where 𝐕tμ;ν=s=1tμxsxs+ν𝐈\mathbf{{V}}_{t}^{\mu;\nu}=\sum_{s=1}^{t}{\mu x_{s}x_{s}^{\top}+\nu\mathbf{I}}, which means that our confidence sets are upper bounded by a ball in the same norm as those in the seminal work on linear bandits (Abbasi-Yadkori et al.,, 2011).

3.3 Online Optimization in GLMs: Follow the Regularized Leader

The size of the confidence sets in Theorem 3 depends on the regret of the online prediction game involving the estimator sequence. We now bound this regret when using the Follow-the-Regularized-Leader (FTRL) algorithm in this setting. This high probability bound is novel to the best of our knowledge and may be of independent interest. We state in a weight-agnostic manner first, and then with our particular choice. The latter variant uses a specifically chosen regularizer. In this case, we can track the contribution of each time-step towards the regret separately.

Theorem 4.

Let ψt(θ)=λ||θ||22\psi_{t}(\theta)=\lambda\lvert\lvert\theta\rvert\rvert_{2}^{2}. Assume Assumption 1, and additionally that AA is LL-smooth everywhere in d\mathbb{R}^{d}, and let wt[0,1]w_{t}\in[0,1] be arbitrary. Then, with probability 1δ1-\delta the regret of FTRL (Alg. 1) satisfies for all t1t\geq 1

tλB2+Lμ(γtλ+2log(1/δ))+2L2B2μγtλ.\mathcal{R}_{t}\leq\lambda B^{2}+\frac{L}{\mu}(\gamma_{t}^{\lambda}+2\log({1}/{\delta}))+\frac{2L^{2}B^{2}}{\mu}\gamma_{t}^{\lambda}. (10)

The regret bounds are optimal in the orders of γtλ\gamma_{t}^{\lambda}, matching lower bounds of Ouhamma et al., (2021), as for linear models γt=𝒪(dlogt)\gamma_{t}=\mathcal{O}(d\log t). Combining results of Thm. 4 with Thm. 3, we get a confidence parameter that scales with 𝒪(γt)\mathcal{O}(\sqrt{\gamma_{t}}), for confidence sets of the form θθ𝐕t||\theta-{\theta_{\star}}||_{\mathbf{{V}}_{t}}, which coincides with the best-known confidence sets in this setting in the worst-case (Abbasi-Yadkori et al.,, 2012). The requirement of global LL-smoothness can be relaxed to LL-smoothness over Θ\Theta. With a more elaborate (but less insightful) analysis, we can show that we achieve a 𝒪~(γt)\tilde{\mathcal{O}}(\gamma_{t}) bound even in this case. The proofs of these results are deferred to App. C.4, App. C.5 and App. C.6 respectively.

Regret, Weighting and Estimation Bias

Interestingly, the term in Thm. 4 involving the (crude) proxy to the bias – the bound BB – is not scaled by the same L/μL/\mu factors as the other terms in the regret bound (10) and in Theorem 3. Namely, the prefactor is L2/μL^{2}/\mu instead of L/μL/\mu. This extra dependence manifests itself in the unnecessary penalization through the estimation bias we introduced in Sec. 2.2, particularly in low-noise settings. We addressed this issue by picking the weights {wt}\{w_{t}\}. While the above theorem holds for any valid weighting, it does not exhibit the possible improvement from using specific weights.

We argued earlier that the error in prediction should not be measured by the likelihood function if there is deterministic error, since initially, we are fully uncertain about the value of θ(){\theta_{\star}}^{\top}(\cdot) outside the span of previous observations. Of course, if our goal would be to purely pick weights to minimize t\mathcal{R}_{t}, then ws=0w_{s}=0 would lead to zero regret and hence be optimal. However, the likelihood ratio would then be constant, and uninformative. In other words, the associated log-partition Bregman divergence in Theorem 3 would be trivial and not filter out any hypotheses. Clearly, some balance has to be met. With this motivation in mind, we proposed a nonzero weighting that decreases the regret contribution of the bias, namely wt=1/L1/L+biasxt2(θ^t)w_{t}=\frac{1/L}{1/L+\operatorname{bias}^{2}_{x_{t}}({\hat{{\theta}}}_{t})}. The advantage of this choice becomes more apparent when we use the regularizer ψt(θ)=λθ2+A(xtθ)\psi_{t}(\theta)=\lambda||\theta||^{2}+A(x_{t}^{\top}\theta) to obtain the following result.

Theorem 5.

Let ψs(θ)=λθ2+A(xsθ)\psi_{s}(\theta)=\lambda||\theta||^{2}+A(x_{s}^{\top}\theta). Assume Assumption 1, and additionally that AA is LL-smooth everywhere in d\mathbb{R}^{d}, and choose ws=1/L1/L+biasxs(θ^s)2w_{s}=\frac{1/L}{1/L+\operatorname{bias}_{x_{s}}({\hat{{\theta}}}_{s})^{2}}. Additionally, let the sequence of xsx_{s} be such that, s(1ws)(fs(θ)fs(θ¯s+1))L/μγtλ\sum_{s}(1-w_{s})(f_{s}({\theta_{\star}})-f_{s}(\bar{\theta}_{s+1}))\leq L/\mu\gamma_{t}^{\lambda}, where θ¯s\bar{\theta}_{s} is the FTRL optimizer with the regularizer λ||θ||22\lambda\lvert\lvert\theta\rvert\rvert_{2}^{2} from Theorem 4 111 Note that θ¯s+1\bar{\theta}_{s+1} corresponds to a regularized MLE that did observe the data pair (xs,ys)(x_{s},y_{s}).. Then, with probability 1δ1-\delta the regret of FTRL (Alg. 1) satisfies for all t1t\geq 1

tλB2+2Lμ(γtλ+log(1δ))+Lμs=1tB21/L+biasxs2(θ^s)Δγsλ,\mathcal{R}_{t}\leq\lambda B^{2}+\frac{2L}{\mu}\left(\gamma_{t}^{\lambda}+\log\left(\frac{1}{\delta}\right)\right)+\frac{L}{\mu}\sum_{s=1}^{t}\frac{B^{2}}{1/L+\operatorname{bias}^{2}_{x_{s}}({\hat{{\theta}}}_{s})}\Delta\gamma_{s}^{\lambda},

where Δγsλ=γs+1λγsλ\Delta\gamma_{s}^{\lambda}=\gamma_{s+1}^{\lambda}-\gamma_{s}^{\lambda}.

One can see that for points where the information gain Δγs\Delta\gamma_{s} is large (corresponding to more unexplored regions of the space, where the deterministic source of error is then large), the weighting scheme will make sure that the multiplicative contribution of B2B^{2} is mitigated, along with having the correct prefactor L/μL/\mu. The reader may wonder how this result is useful when we replace biasxs2(θ^s)\operatorname{bias}^{2}_{x_{s}}({\hat{{\theta}}}_{s}) with the upper bound from Thm. 2. While instructive, our bound still only makes the bias proxy B2B^{2} appear in front of the information gain Δγt\Delta\gamma_{t}, instead of the more desireable bias itself. In the latter case, we could also directly make use of the upper bound and get an explicit result only using an upper bound on the bias. We leave this for future work.

We point out that this choice of ψs(θ)\psi_{s}(\theta) in Theorem 5 corresponds to the Vovk-Azoury-Warmuth predictor (Vovk,, 2001; Azoury and Warmuth,, 1999) in the online learning literature. This choice is helpful in order to track the bias contribution more precisely in our proof.

4 Application: Linear and Kernelized Bandits

Our main motivation to construct confidence sets is bandit optimization. A prototypical bandit algorithm – the Upper Confidence Bound (UCB) (Auer,, 2002) – sequentially chooses covariates xsx_{s} in order to maximize the reward s=1trθ(xs)\sum_{s=1}^{t}r_{\theta_{\star}}(x_{s}), where rθr_{\theta_{\star}} is the unknown pay-off function parametrized by θ{\theta_{\star}}. UCB chooses the action xsx_{s} which maximizes the optimistic estimate of the reward in each round, namely

xs=argmaxx𝒳maxθ𝒞s1rθ(x),x_{s}=\operatorname*{arg\,max}_{x\in\mathcal{X}}\max_{\theta\in\mathcal{C}_{s-1}}r_{\theta}(x), (11)

where 𝒞s1\mathcal{C}_{s-1} is some confidence set for θ{\theta_{\star}}, and can be constructed with Algorithm 1 from the first s1s-1 data points. An important special case is when rθr_{\theta_{\star}} is linear (Abe and Long,, 1999) or modelled by a generalized linear model (Filippi et al.,, 2010). In that case, the inner optimization problem is convex as long as 𝒞s1\mathcal{C}_{s-1} is convex. The outer optimization is tractable for finite 𝒳\mathcal{X}. In the applications we consider, our confidence sets are convex, and we easily solve the UCB oracle using convex optimization toolboxes.

Extension to RKHS

We introduced the framework of LR confidence sets only for finite-dimensional Euclidean spaces. However, it can be easily extended to Reproducing Kernel Hilbert Spaces (RKHS) (Cucker and Smale,, 2002). The definition of the LR process in (1) is still well-posed, but now the sets are subsets of the RKHS, containing functions fkf\in\mathcal{H}_{k}. An outstanding issue is how to use these sets in downstream applications, and represent them tractably as in Figure 1. Conveniently, even with infinite-dimensional RKHSs, the inner-optimization in (11) admits a Lagrangian formulation, and the generalized representer theorem applies (Schölkopf et al.,, 2001; Mutný and Krause,, 2021). In other words, we can still derive a pointwise upper confidence band as ucb(x)=maxfk,||f||kB,fCsf,k(x,)\operatorname{ucb}(x)=\max_{f\in\mathcal{H}_{k},\lvert\lvert f\rvert\rvert_{k}\leq B,f\in C_{s}}\braket{f,k(x,\cdot)} in terms of {xj}j=1s{x}\{x_{j}\}_{j=1}^{s}\cup\{x\}, leading to a s+1s+1-dimensional, tractable optimization problem.

We also point out that the weighting is even more paramount in the RKHS setting, as the bias never vanishes for many infinite dimensional Hilbert spaces (Mutný and Krause,, 2022). For this purpose, our weighting is of paramount practical importance, as we can see in Figure 2a), where the gray arrow represents the significant improvement from reweighting.

4.1 Instantiation of the Theory for Linear Bandits

Before going to the experiments, we instantiate our theoretical results from Sec. 3 to the important and well-studied special case of linear payoffs. In that case, rθ(x)=x,θr_{\theta}(x)=\langle x,\,\theta\rangle and the agent observes ys=xs,θ+ηsy_{s}=\langle x_{s},\,{\theta_{\star}}\rangle+\eta_{s} upon playing action xsx_{s}, where ηs𝒩(0,σ2)\eta_{s}\sim\mathcal{N}(0,\sigma^{2}). We are interested in minimizing the so-called cumulative pseudo-regret, namely, t=s=1t[x,θxs,θ]\mathfrak{R}_{t}=\sum_{s=1}^{t}[\langle{x_{\star}},\,{\theta_{\star}}\rangle-\langle x_{s},\,{\theta_{\star}}\rangle], where xx_{\star} refers to the optimal action. Using the set from (2) along with Theorem 3 and the FTRL result of Theorem 4 we can get a regret bound for the choice ws1w_{s}\equiv 1.

Theorem 6.

Let ws1w_{s}\equiv 1. For any λ1σ2\lambda\geq\frac{1}{\sigma^{2}}, with probability at least 13δ1-3\delta, for all tt\in\mathbb{N} we have

t6tγtλ(σlog(1/δ)+γtλ+σλ1/2B+Bγtλ).\mathfrak{R}_{t}\leq 6\sqrt{t\gamma_{t}^{\lambda}}\left(\sigma\sqrt{\log(1/\delta)+\gamma_{t}^{\lambda}}+\sigma\lambda^{1/2}B+B\sqrt{\gamma_{t}^{\lambda}}\right).

Our results are optimal in both dd and tt up to constant and logarithmic factors. The proof is deferred to App. D, but is an instantiation of the aforementioned theorems, along with a standard analysis. There, we also compare to the seminal result of Abbasi-Yadkori et al., (2011), which does not suffer from the dependence on BγtB\sqrt{\gamma_{t}}. We attribute this to the incurred bias in the absence of the reweighting scheme.

For the weighted likelihood ratio, we can obtain a result similar to the above, but multiplied by an upper bound on sups1ws1\sup_{s\geq 1}w_{s}^{-1}. This is undesirable, as our experiments will show that the reweighting scheme vastly improves performance. While this could be somewhat mitigated by using the Theorem 5 instead of Theorem 4 to bound the FTRL regret, a better result should be achievable using our weighting scheme that improves upon Theorem 6 and possibly even matches Abbasi-Yadkori et al., (2011) exactly in the worst-case. We leave this for future work.

4.2 Experimental Evaluation

In this subsection, we demonstrate that the practical applicability goes well beyond the Gaussian theoretical result from the previous subsection. In the examples below, we always use the UCB algorithm but employ different confidence sets. In particular, we compare our LR confidence sets for different likelihood families with alternatives from the literature, notably classical sub-family confidence sets (Abbasi-Yadkori et al.,, 2011; Mutný and Krause,, 2021), and the robust confidence set of Neiswanger and Ramdas, (2021). In practice, however, the radius of these confidence sets is often tuned heuristically. We include such sets as a baseline without provable coverage as well. The main take-home message from the experiments is that among all the estimators and confidence sets that enjoy provable coverage, our confidence sets perform the best, on par with successful heuristics. For all our numerical experiments in Figure 2, the true payoff function is assumed to be an infinite dimensional RKHS element. For further details and experiments, please refer to App. E.

Refer to caption
Figure 2: Bandit experiments: On the yy-axis we report cumulative regret, while the xx-axis shows the number of iterations. In a) and b) we report the results for linear models with different parametric additive noise. In c) we report the results on a survival analysis with a log-Weibull distribution (p=2p=2) and in d) we showcase Poisson bandits. See App. E for more details. Heuristic methods are dashed, while provable are solid. Our sets perform the best among all provable methods. Notice in a) the difference in gray and black represents the improvement due to adaptive weighting over ws=1w_{s}=1 for all s[t]s\in[t]. For each experiment we did 10 reruns, median values are plotted.
Additive Noise Models

Suppose that rθr_{\theta_{\star}} is linear and we observe ys=xsθ+ηsy_{s}=x_{s}^{\top}{\theta_{\star}}+\eta_{s}, where ηs\eta_{s} is additive noise, and θ{\theta_{\star}} is an element of a Hilbert space. We consider classical Gaussian noise as well as Laplace noise in Fig. 2[a), b)]. Notice that in both cases our confidence sets yield lower regret than any other provably valid method. In both cases they are performing as good as heuristic confidence sets with confidence parameter βt2log(1/δ)\beta_{t}\equiv 2\log(1/\delta). The sub-Gaussian confidence sets of Abbasi-Yadkori et al., (2011) (AY 2011) are invalid for the Laplace distribution as it is not sub-Gaussian but only sub-Exponential. For this reason, we compare also with sub-exponential confidence sets derived similarly to those of (Faury et al.,, 2020). The confidence sets of (Neiswanger and Ramdas,, 2021) (NR 2021) perform similarly on Gaussian likelihood, but are only applicable to this setting, as their generalization to other likelihood families involves intractable posterior inference. We note also the difference between the unweighted LR and the weighted one. The examples in Fig. 2 use the true payoff functions r(x)=(1.43x)sin(18x)r(x)=-(1.4-3x)\sin(18x), which we model as an element of a RKHS with squared exponential kernel lengthscale γ=6×102\gamma=6\times 10^{-2} on [0,1.2][0,1.2], which is the baseline function no. 4 in the global optimization benchmark database infinity77 (Gavana,, 2021). Additional experiments can be found in App. E.

Poisson Bandits

A prominent example of generalized linear bandits (GLB) are Poisson bandits, where the linear payoff is scaled by an exponential function. We instantiate our results on a common benchmark problem, and report the results in Fig. 2d). We improve the regret of UCB for GLBs compared to two alternative confidence sets: one that uses a Laplace approximation with a heuristic confidence parameter, and one inspired by considerations in Mutný and Krause, (2021) (MK 2021), also with a heuristic confidence parameter. Note that we cannot compare directly to their provable results in their original form as they do not state them in the canonical form of the exponential family.

Survival Analysis

Survival analysis is a branch of statistics with a rich history that models the lifespan of a service or product (Breslow,, 1975; Cox,, 1997; Kleinbaum and Klein,, 2010). The classical approach postulates a well-informed likelihood model. Here, we use a specific hazard model, where the survival time TT is distributed with a Weibull distribution, parametrized by λ\lambda and pp. The rate λθ(x)=exp(xθ)\lambda_{\theta}(x)=\exp(x^{\top}\theta) differs for each configuration xx, and pp – which defines the shape of the survival distribution – is fixed and known. We assume that the unknown part is due to the parameter θ\theta which is the quantity we build a confidence set around to use within the UCB Algorithm. In particular, the probability density of the Weibull distribution is P(T=t|x)=λθ(x)ptp1exp(tpλθ(x))P(T=t|x)=\lambda_{\theta}(x)pt^{p-1}\exp(-t^{p}\lambda_{\theta}(x)). In fact, with p=2p=2, the confidence sets are convex and the UCB rule can be implemented efficiently.

Interestingly, this model admits an alternate linear regression formulation. Namely upon using the transformation Y=logTY=\log T, the transformed variables Y|xY|x follow a Gumbel-type distribution, with the following likelihood that can be obtained by the change of variables P(Y=y|x)=λθ(x)pexp(y)pexp(exp(y)pλθ(x))P(Y=y|x)=\lambda_{\theta}(x)p\exp(y)^{p}\exp(-\exp(y)^{p}\lambda_{\theta}(x)). The expectation over YY allows us to express it as a linear regression problem since 𝔼[Y|x]=(θx+γ)/p\mathop{{}\mathbb{E}}[Y|x]=-(\theta^{\top}x+\gamma)/p, where γ\gamma is the Euler-Mascheroni constant. More importantly, Y|xY|x is sub-exponential. Hence, this allows us to use confidence sets for sub-exponential variables constructed with the pseudo-maximization technique inspired by Faury et al., (2020). More details on how these sets are derived can be found in App. E. However, these approaches necessarily employ crude worst-case bounds and as can be seen in Figure 2c) the use of our LR-based confidence sequences substantially reduces the regret of the bandit learner.

5 Related Work and Conclusion

Related Work

The adaptive confidence sequences stem from the seminal work of Robbins et al., (1972), who note that these sets have α\alpha-bounded Type I error. The likelihood ratio framework has been recently popularized by Wasserman et al., (2020) for likelihood families without known test statistics under the name universal inference. This approach, although long established, is surprisingly uncommon in sequential decision-making tasks like bandits. This might be due to the absence of an analysis deriving the size of the confidence sets (Mutný and Krause,, 2021), a necessary ingredient to obtain regret bounds. We address this gap for generalized linear models. Another reason might be that practitioners might be interested in non-parametric sub-families – a scenario our method does not cover. That being said, many fields such as survival analysis (Cox,, 1997) do have well-informed likelihoods. However, most importantly, if used naively, this method tends to fail when one departs from assumptions that our probabilistic model is identifiable (i.e., pθ(|x)=pθ~(|x)p_{\theta}(\cdot\,|\,x)=p_{{\tilde{{\theta}}}}(\cdot\,|\,x) even if θθ~\theta\not={\tilde{{\theta}}}). We mitigate this problem by introducing the scaling parameters wtw_{t} in Eq. (1) to deal with it.

Prevalent constructions of anytime-valid confidence intervals rely on carefully derived concentration results and for a specific estimator such as the least-squares estimator and noise sub-families such as sub-Gaussian, sub-Bernoulli and sub-Poisson Abbasi-Yadkori et al., (2011); Faury et al., (2020); Mutný and Krause, (2021). Their constructions involve bounding the suprema of collections of self-normalized stochastic processes (Faury et al.,, 2020; Mutný and Krause,, 2021; Chowdhury et al.,, 2022). To facilitate closed-form expressions, worst-case parameters are introduced that prohibitively affect the size of the sets – making them much larger than they need to be.

Chowdhury et al., (2022) use the exact form of the likelihood to build confidence sets for parameters of exponential families. However, their approach is restricted to exponential family distributions. They use self-normalization and mixing techniques to explicitly determine the size of the confidence set and do not use an online learning subroutine as we do here. Neiswanger and Ramdas, (2021) use likelihood ratios for bandit optimization with possibly misspecified Gaussian processes but is not tractable beyond Gaussian likelihoods. The relation between online convex optimization and confidence sets has been noted in so-called online-to-confidence conversions (Abbasi-Yadkori et al.,, 2012; Jun et al.,, 2017; Zhao et al.,, 2022), where the existence of a low-regret learner implies a small confidence set. However, these sets still use potentially loose regret bounds to define confidence sets. Our definition is implicit. We do not necessarily need a regret bound to run our method, as the radius will depend on the actual, instance-dependent performance of the learner.

Conclusion

In this work, we generalized and analyzed sequential likelihood ratio confidence sets for adaptive inference. We showed that with well-specified likelihoods, this procedure gives small, any-time valid confidence sets with model-agnostic and precise coverage. For generalized linear models, we quantitatively analyzed their size and shape. We invite practitioners to explore and use this very versatile and practical methodology for sequential decision-making tasks.

Acknowledgments and Disclosure of Funding

We thank Wouter Koolen and Aaditya Ramdas for helpful discussions as well as for organizing the SAVI workshop where these discussions took place. NE acknowledges support from the Swiss Study Foundation and the Zeno Karl Schindler Foundation. MM has received funding from the Swiss National Science Foundation through NFP75. This publication was created as part of NCCR Catalysis (grant number 180544), a National Centre of Competence in Research funded by the Swiss National Science Foundation.

References

  • Abbasi-Yadkori et al., (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvari, C. (2011). Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320.
  • Abbasi-Yadkori et al., (2012) Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. (2012). Online-to-confidence-set conversions and application to sparse stochastic bandits. In Lawrence, N. D. and Girolami, M., editors, Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22 of Proceedings of Machine Learning Research, pages 1–9, La Palma, Canary Islands. PMLR.
  • Abe and Long, (1999) Abe, N. and Long, P. M. (1999). Associative reinforcement learning using linear probabilistic concepts. In ICML, pages 3–11. Citeseer.
  • Auer, (2002) Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422.
  • Azoury and Warmuth, (1999) Azoury, K. S. and Warmuth, M. K. (1999). Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 43:211–246.
  • Baker, (2016) Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533:452–454.
  • Breslow, (1975) Breslow, N. E. (1975). Analysis of survival data under the proportional hazards model. International Statistical Review / Revue Internationale de Statistique.
  • Carpentier et al., (2020) Carpentier, A., Vernade, C., and Abbasi-Yadkori, Y. (2020). The elliptical potential lemma revisited. ArXiv, abs/2010.10182.
  • Cesa-Bianchi and Lugosi, (2006) Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA.
  • Chowdhury et al., (2022) Chowdhury, S. R., Saux, P., Maillard, O.-A., and Gopalan, A. (2022). Bregman deviations of generic exponential families. arXiv preprint arXiv:2201.07306.
  • Cox, (1997) Cox, D. R. (1997). Some remarks on the analysis of survival data. In Proceedings of the First Seattle Symposium in Biostatistics, pages 1–9. Springer.
  • Cucker and Smale, (2002) Cucker, F. and Smale, S. (2002). On the mathematical foundations of learning. Bulletin of the American mathematical society, 39(1):1–49.
  • Faury et al., (2020) Faury, L., Abeille, M., Calauzènes, C., and Fercoq, O. (2020). Improved optimistic algorithms for logistic bandits. In ICML2020: Proceedings of the 37th International Conference on International Conference on Machine Learning.
  • Filippi et al., (2010) Filippi, S., Cappe, O., Garivier, A., and Szepesvári, C. (2010). Parametric bandits: The generalized linear case. In Advances in neural information processing systems, pages 586–594.
  • Gavana, (2021) Gavana, A. (2021). infinity global optimization benchmarks and ampgo. http://infinity77.net/global_optimization/index.html#.
  • Hazan, (2016) Hazan, E. (2016). Introduction to online convex optimization. Found. Trends Optim., 2:157–325.
  • Hazan et al., (2006) Hazan, E., Agarwal, A., and Kale, S. (2006). Logarithmic regret algorithms for online convex optimization. Machine Learning, 69:169–192.
  • Howard et al., (2018) Howard, S. R., Ramdas, A., McAuliffe, J., and Sekhon, J. (2018). Time-uniform chernoff bounds via nonnegative supermartingales. Probability Surveys, 17:257–317.
  • Jun et al., (2017) Jun, K.-S., Bhargava, A., Nowak, R., and Willett, R. (2017). Scalable generalized linear bandits: Online computation and hashing. Advances in Neural Information Processing Systems, 30.
  • Katz-Samuels and Jamieson, (2020) Katz-Samuels, J. and Jamieson, K. (2020). The true sample complexity of identifying good arms. In Chiappa, S. and Calandra, R., editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 1781–1791. PMLR.
  • Kleinbaum and Klein, (2010) Kleinbaum, D. G. and Klein, M. (2010). Survival analysis, volume 3. Springer.
  • Lattimore and Szepesvári, (2020) Lattimore, T. and Szepesvári, C. (2020). Bandit Algorithms. Cambridge University Press.
  • Makarova et al., (2021) Makarova, A., Usmanova, I., Bogunovic, I., and Krause, A. (2021). Risk-averse heteroscedastic bayesian optimization. In Proc. Neural Information Processing Systems (NeurIPS).
  • McCullagh, (2018) McCullagh, P. (2018). Generalized linear models. Chapman and Hall.
  • Mukherjee et al., (2022) Mukherjee, S., Tripathy, A. S., and Nowak, R. (2022). Chernoff sampling for active testing and extension to active regression. In International Conference on Artificial Intelligence and Statistics, pages 7384–7432. PMLR.
  • Mutný and Krause, (2021) Mutný, M. and Krause, A. (2021). No-regret algorithms for capturing events in poisson point processes. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7894–7904. PMLR.
  • Mutný and Krause, (2022) Mutný, M. and Krause, A. (2022). Experimental design of linear functionals in reproducing kernel hilbert spaces. In Proc. Neural Information Processing Systems (NeurIPS).
  • Neiswanger and Ramdas, (2021) Neiswanger, W. and Ramdas, A. (2021). Uncertainty quantification using martingales for misspecified gaussian processes. In Feldman, V., Ligett, K., and Sabato, S., editors, Proceedings of the 32nd International Conference on Algorithmic Learning Theory, volume 132 of Proceedings of Machine Learning Research, pages 963–982. PMLR.
  • Orabona, (2019) Orabona, F. (2019). A modern introduction to online learning.
  • Ouhamma et al., (2021) Ouhamma, R., Maillard, O.-A., and Perchet, V. (2021). Stochastic online linear regression: the forward algorithm to replace ridge. In Neural Information Processing Systems.
  • Ramdas et al., (2022) Ramdas, A., Grünwald, P., Vovk, V., and Shafer, G. (2022). Game-theoretic statistics and safe anytime-valid inference.
  • Robbins et al., (1972) Robbins, H., Siegmund, D., et al. (1972). A class of stopping rules for testing parametric hypotheses. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 4: Biology and Health. The Regents of the University of California.
  • Schölkopf et al., (2001) Schölkopf, B., Herbrich, R., and Smola, A. (2001). A generalized representer theorem. In Computational learning theory, pages 416–426. Springer.
  • Srinivas et al., (2009) Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. W. (2009). Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning.
  • Ville, (1939) Ville, J.-L. (1939). Étude critique de la notion de collectif.
  • Vovk, (2001) Vovk, V. (2001). Competitive on-line statistics. International Statistical Review, 69.
  • Wainwright, (2019) Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint. Cambridge university press.
  • Wald, (1945) Wald, A. (1945). Sequential tests of statistical hypotheses. Annals of Mathematical Statistics, 16:256–298.
  • Wasserman et al., (2020) Wasserman, L. A., Ramdas, A., and Balakrishnan, S. (2020). Universal inference. Proceedings of the National Academy of Sciences, 117:16880 – 16890.
  • Zhao et al., (2022) Zhao, H., Zhou, D., He, J., and Gu, Q. (2022). Bandit learning with general function classes: Heteroscedastic noise and variance-dependent regret bounds. ArXiv, abs/2202.13603.

Appendix A Proofs of Theorem 1 and 2

A.1 GLM Families

Table 1: Examples of exponential family distributions.
Name A(z)A(z) A(z)A^{\prime}(z) T(y)T(y) μ\mu LL
Gaussian z2/(2σ2)z^{2}/(2\sigma^{2}) z/σ2z/\sigma^{2} y/σy/\sigma 1/σ21/\sigma^{2} 1/σ21/\sigma^{2}
Poisson exp(z)\exp(z) exp(z)\exp(z) yy exp(B)\exp(-B) exp(B)\exp(B)
Binomial log(1+exp(z))\log(1+\exp(z)) 11+exp(z)\frac{1}{1+\exp(-z)} yy 𝒪(exp(B))\mathcal{O}(\exp(-B)) 1/41/4
Weibull klog(z)logkk\log(z)-\log k k/zk/z yky^{k} 1/B21/B^{2} \infty

A.2 Proof of Theorem 1 (Coverage)

Proof.

Starting with 𝔼[Rt(θ)|t1]\mathbb{E}\left[R_{t}({\theta_{\star}})\;\middle|\;\mathcal{F}_{t-1}\right]

=\displaystyle= 𝔼[Rt1(θ)pθ^twt(yt|xt)pθwt(yt|xt)|t1]\displaystyle\mathbb{E}\left[R_{t-1}({\theta_{\star}})\frac{p^{w_{t}}_{{\hat{{\theta}}}_{t}}(y_{t}\,|\,x_{t})}{p^{w_{t}}_{{\theta_{\star}}}(y_{t}\,|\,x_{t})}\;\middle|\;\mathcal{F}_{t-1}\right]
=\displaystyle= Rt1(θ)pθ^twt(y|xt)pθwt(y|xt)pθ(y|xt)dy\displaystyle R_{t-1}({\theta_{\star}})\int\frac{p^{w_{t}}_{{\hat{{\theta}}}_{t}}(y\,|\,x_{t})}{p^{w_{t}}_{{\theta_{\star}}}(y\,|\,x_{t})}p_{{\theta_{\star}}}(y\,|\,x_{t})\mathrm{d}y
=\displaystyle= Rt1(θ)e((wt1)Dwtr(pθ(xt),pθ^t(xt))Rt1(θ).\displaystyle R_{t-1}({\theta_{\star}})e^{((w_{t}-1)D^{r}_{w_{t}}(p_{{\theta_{\star}}}(x_{t}),p_{{\hat{{\theta}}}_{t}}(x_{t}))}\leq R_{t-1}({\theta_{\star}}).

The second equality is due to the fact that Rt1(θ)R_{t-1}({\theta_{\star}}) only depends on x1,y1x_{1},y_{1} through xt1,yt1x_{t-1},y_{t-1}. Since θ^t{\hat{{\theta}}}_{t} is t1\mathcal{F}_{t-1} measurable by assumption, pθ^tp_{{\hat{{\theta}}}_{t}} is a density, and if wt<1w_{t}<1, the integral is equal to an exponential of the Rényi-divergence Dwtr(,)D^{r}_{w_{t}}(\cdot,\cdot). The negativity of the exponent follows from wt<1w_{t}<1 and the non-negativity of the divergence. Note that in the "degenerate" case of wt=1w_{t}=1, we can easily see that the integral is over a density (cancellation), and hence also bounded by 11. The last part of the statement follows easily by using Ville’s inequality for supermartingales. ∎

All the elements of the above proof appear in Wasserman et al., (2020) albeit separately, and not with time-varying powered robust likelihoods.

A.3 Proof of Theorem 2 (Bias)

We will need a gradient characterization of strong-convexity, which we prove in the following lemma.

Lemma 1 (Convexity: Gradient).

Defining

Ft(θ)=s=1t𝔼θ[logθp(ys|xs)|s1],F_{t}(\theta)=-\sum_{s=1}^{t}\mathop{{}\mathbb{E}}_{{\theta_{\star}}}[\nabla\log_{\theta}p(y_{s}\,|\,x_{s})\,|\,\mathcal{F}_{s-1}],

under the assumption pθ(ys|xs)=g(xsθ)p_{\theta}(y_{s}|x_{s})=-g(x_{s}^{\top}\theta) and gg is μ\mu-strongly convex, we have for any θΘ\theta\in\Theta:

(Ft(θ)Ft(θ))(θθ)||θθ||𝐕tμ;02.(F_{t}(\theta)-F_{t}({\theta_{\star}}))^{\top}(\theta-{\theta_{\star}})\geq\lvert\lvert\theta-{\theta_{\star}}\rvert\rvert_{\mathbf{{V}}_{t}^{\mu;0}}^{2}.
Proof.

We assume that gg is μ\mu-strongly convex. Therefore, for any sts\leq t, we get the two inequalities

g(xsθ)g(xsθ)\displaystyle g(x_{s}^{\top}\theta)-g(x_{s}^{\top}{\theta_{\star}}) g(xsθ)(xsθxsθ)+μ2||xsθxsθ||22\displaystyle\geq g^{\prime}(x_{s}^{\top}{\theta_{\star}})(x_{s}^{\top}\theta-x_{s}^{\top}{\theta_{\star}})+\frac{\mu}{2}\lvert\lvert x_{s}^{\top}{\theta_{\star}}-x_{s}^{\top}\theta\rvert\rvert_{2}^{2}
g(xsθ)g(xsθ)\displaystyle g(x_{s}^{\top}{\theta_{\star}})-g(x_{s}^{\top}\theta) g(xsθ)(xsθxsθ)+μ2||xsθxsθ||22.\displaystyle\geq g^{\prime}(x_{s}^{\top}\theta)(x_{s}^{\top}{\theta_{\star}}-x_{s}^{\top}\theta)+\frac{\mu}{2}\lvert\lvert x_{s}^{\top}{\theta_{\star}}-x_{s}^{\top}\theta\rvert\rvert_{2}^{2}.

Adding these two together, we obtain

0(g(xsθ)g(xsθ))(xs(θθ))+μ||θθ||xsxs2.0\geq(g^{\prime}(x_{s}^{\top}{\theta_{\star}})-g^{\prime}(x_{s}^{\top}\theta))(x_{s}^{\top}(\theta-{\theta_{\star}}))+\mu\lvert\lvert\theta-{\theta_{\star}}\rvert\rvert_{x_{s}x_{s}^{\top}}^{2}.

Observing that θpθ(ys|xs)=g(xsθ)xs-\nabla_{\theta}p_{\theta}(y_{s}\,|\,x_{s})=-g^{\prime}(x_{s}^{\top}\theta)x_{s}, we can equivalently write

0(logpθ(ys|xs)logpθ(ys|xs))(θθ))+μ||θθ||xsxs2.0\geq(\nabla\log p_{\theta}(y_{s}\,|\,x_{s})-\nabla\log p_{{\theta_{\star}}}(y_{s}\,|\,x_{s}))^{\top}(\theta-{\theta_{\star}}))+\mu\lvert\lvert\theta-{\theta_{\star}}\rvert\rvert_{x_{s}x_{s}^{\top}}^{2}.

This holds for any realization of ysy_{s}, and hence taking expectations yields

(𝔼[logpθ(ys|xs)|s1]𝔼[logpθ(ys|xs)|s1])(θθ)μ||θθ||xsxs2.\left(\mathop{{}\mathbb{E}}[\nabla\log p_{{\theta_{\star}}}(y_{s}\,|\,x_{s})\,|\,\mathcal{F}_{s-1}]-\mathop{{}\mathbb{E}}[\nabla\log p_{\theta}(y_{s}\,|\,x_{s})\,|\,\mathcal{F}_{s-1}]\right)^{\top}(\theta-{\theta_{\star}})\geq\mu\lvert\lvert\theta-{\theta_{\star}}\rvert\rvert_{x_{s}x_{s}^{\top}}^{2}.

Summing up over sts\leq t and using the definition of FtF_{t} we get

(Ft(θ)Ft(θ))(θθ)||θθ||𝐕tμ;02.(F_{t}(\theta)-F_{t}({\theta_{\star}}))^{\top}(\theta-{\theta_{\star}})\geq\lvert\lvert\theta-{\theta_{\star}}\rvert\rvert_{\mathbf{{V}}_{t}^{\mu};0}^{2}.

Notice that the estimator in Alg. 1 has to fulfil the KKT conditions. We will denote the condition for belonging to the set as h(θ)B2h(\theta)\leq B^{2}, where hh is a squared and twice-differentiable norm (there are many choices beyond ||||22||\cdot||_{2}^{2}). The KKT conditions are

s=1tθlogpθ(ys|xs)+ψt(θ)+lh(θ)=0\displaystyle\sum_{s=1}^{t}-\nabla_{\theta}\log p_{\theta}(y_{s}|x_{s})+\nabla\psi_{t}(\theta)+l\nabla h(\theta)=0 (12)
l(h(θ)B2)=0\displaystyle l(h(\theta)-B^{2})=0
l0,\displaystyle l\geq 0,

where the second and third conditions represent a complementary slackness requirement. Notice that the system of these equations has a unique solution due to the strong-convexity of the objective, and has to attain a unique minimum on a compact convex subset of d\mathbb{R}^{d}. Adding the same quantity on both sides of (12) yields

s=1t𝔼θ[θlogpθ(ys|xs)|s1]+ψt(θ)+lh(θ)\displaystyle\sum_{s=1}^{t}-\mathbb{E}_{\theta_{\star}}[\nabla_{\theta}\log p_{\theta}(y_{s}|x_{s})|\mathcal{F}_{s-1}]+\nabla\psi_{t}(\theta)+l\nabla h(\theta)
=\displaystyle=\quad s=1t[θlogpθ(ys|xs)𝔼θ[θlogpθ(ys|xs)|s1]]:=Et.\displaystyle\underbrace{\sum_{s=1}^{t}[\nabla_{\theta}\log p_{\theta}(y_{s}|x_{s})-\mathbb{E}_{\theta_{\star}}[\nabla_{\theta}\log p_{\theta}(y_{s}|x_{s})|\mathcal{F}_{s-1}]]}_{:=E_{t}}. (13)

This line motivates the definition of the error-free estimator in θt×\theta^{\times}_{t} (6), where EtE_{t} is set to zero. We will also make use of a fundamental property of the score (gradient of log-likelihood), namely

𝔼ytpθ(|x)[logpθ(yt|xt)|t1]=0.\mathbb{E}_{y_{t}\sim p_{\theta_{\star}}(\cdot\,|\,x)}[\nabla\log p_{\theta_{\star}}(y_{t}|x_{t})|\mathcal{F}_{t-1}]=0. (14)

A classical textbook reference for this is e.g. McCullagh, (2018) but any other classical statistics textbook should contain it. Using these observations, we can already prove Theorem 2.

Proof of Theorem 2.

Using the optimality conditions of θt×\theta_{t}^{\times}, h(θ)=θ22h(\theta)=||\theta||^{2}_{2} and ψt(θ)=||θ||22\psi_{t}(\theta)=\lvert\lvert\theta\rvert\rvert_{2}^{2}, we obtain the following statements:

s=1t𝔼θ[logpθt×(ys|xs)|s1]+λθt×+2lθt×=0\displaystyle\sum_{s=1}^{t}-\mathbb{E}_{\theta_{\star}}[\nabla\log p_{\theta^{\times}_{t}}(y_{s}|x_{s})|\mathcal{F}_{s-1}]+\lambda\theta^{\times}_{t}+2l\theta^{\times}_{t}=0
\displaystyle\implies s=1t𝔼θ[logpθt×(ys|xs)|s1]+𝔼θ[logpθ(ys|xs)|s1]+λθt×+2lθt×=0,\displaystyle\sum_{s=1}^{t}-\mathbb{E}_{\theta_{\star}}[\nabla\log p_{\theta^{\times}_{t}}(y_{s}|x_{s})|\mathcal{F}_{s-1}]+\mathbb{E}_{\theta_{\star}}[\nabla\log p_{{\theta_{\star}}}(y_{s}|x_{s})|\mathcal{F}_{s-1}]+{\lambda}\theta^{\times}_{t}+2l\theta^{\times}_{t}=0,

where in the last line we used the property (14). Now, notice that since we know θ{\theta_{\star}} is generating the data, the best possible explanation without enforcing the constraint and the regularization would be to set θt×=θ\theta^{\times}_{t}={\theta_{\star}} as the cross-entropy is minimized at this point, and the above is just the optimality condition for optimizing the cross-entropy between these two distribution. Of course, this is only in the absence of regularization or constraints i.e. λ=0{\lambda}=0. Now, with the regularization constraint, as the true θ{\theta_{\star}} lies inside the constraint h(θ)B2h(\theta)\leq B^{2}, and both the regularization and constraints induce star-shaped sets, their effect is to make θt×\theta^{\times}_{t} smaller in norm than θ\theta^{*}. This holds generally for any hh which is a norm. As a consequence of this consideration, θt×2<B||\theta^{\times}_{t}||_{2}<B, and then the complementary slackness dictates that l=0l=0.

We can therefore proceed with this simplification. Let us use the shorthand Ft(θ)=s=1t𝔼θ[logpθ(ys|xs)|s1]F_{t}(\theta)=\sum_{s=1}^{t}-\mathbb{E}_{\theta_{\star}}[\nabla\log p_{\theta}(y_{s}|x_{s})|\mathcal{F}_{s-1}] and compute

Ft(θt×)F(θ)+λ(θt×θ)\displaystyle F_{t}(\theta_{t}^{\times})-F({\theta_{\star}})+\lambda(\theta^{\times}_{t}-{\theta_{\star}}) =λθ\displaystyle=-\lambda{\theta_{\star}}
\displaystyle\implies (θt×θ)(Ft(θt×)F(θ)+λ(θt×θ))\displaystyle(\theta_{t}^{\times}-{\theta_{\star}})^{\top}(F_{t}(\theta_{t}^{\times})-F({\theta_{\star}})+\lambda(\theta^{\times}_{t}-{\theta_{\star}})) =λ(θt×θ)θ\displaystyle=-\lambda(\theta^{\times}_{t}-{\theta_{\star}})^{\top}{\theta_{\star}}
Lemma1\displaystyle\stackrel{{\scriptstyle\text{Lemma}\leavevmode\nobreak\ \ref{lemma:convexity-gradient}}}{{\implies}} ||θt×θ||𝐕tμ,λ2\displaystyle\lvert\lvert\theta^{\times}_{t}-{\theta_{\star}}\rvert\rvert^{2}_{\mathbf{{V}}_{t}^{\mu,\lambda}} λ(θt×θ)θ.\displaystyle\leq-\lambda(\theta_{t}^{\times}-{\theta_{\star}})^{\top}{\theta_{\star}}. (15)

It suffices to apply the Cauchy-Schwarz Inequality and invoke (15):

biasxs(θ^s)2\displaystyle\operatorname{bias}_{x_{s}}({\hat{{\theta}}}_{s})^{2} =\displaystyle= (xs(θ^t×θ))2\displaystyle(x_{s}^{\top}({\hat{{\theta}}}_{t}^{\times}-{\theta_{\star}}))^{2}
\displaystyle\leq xs(𝐕tμ,λ)12||θt×θ||𝐕tμ,λ2\displaystyle||x_{s}||_{(\mathbf{{V}}^{\mu,\lambda}_{t})^{-1}}^{2}\lvert\lvert\theta^{\times}_{t}-{\theta_{\star}}\rvert\rvert^{2}_{\mathbf{{V}}_{t}^{\mu,\lambda}}
\displaystyle\leq λxs(𝐕tμ,λ)12λ(θt×θ)(θ)\displaystyle\lambda||x_{s}||_{(\mathbf{{V}}^{\mu,\lambda}_{t})^{-1}}^{2}\lambda(\theta_{t}^{\times}-{\theta_{\star}})^{\top}(-{\theta_{\star}})
\displaystyle\leq λxs(𝐕tμ,λ)12(θt×θ)2θ2\displaystyle\lambda||x_{s}||_{(\mathbf{{V}}^{\mu,\lambda}_{t})^{-1}}^{2}||(\theta_{t}^{\times}-{\theta_{\star}})||_{2}||{\theta_{\star}}||_{2}
\displaystyle\leq 2λxs(𝐕tμ,λ)12θ22,\displaystyle 2\lambda||x_{s}||_{(\mathbf{{V}}^{\mu,\lambda}_{t})^{-1}}^{2}||{\theta_{\star}}||_{2}^{2},

where in the last inequality we used ||θt×||2||θ||2\lvert\lvert\theta^{\times}_{t}\rvert\rvert_{2}\leq\lvert\lvert{\theta_{\star}}\rvert\rvert_{2}, due to the regularizer, as explained above. ∎

GLM models

Let us define the processes

St=s=1txsT(ys)andWt=s=1txsA(xsθ).S_{t}=\sum_{s=1}^{t}x_{s}T(y_{s})\quad\text{and}\quad W_{t}=\sum_{s=1}^{t}x_{s}A^{\prime}(x_{s}^{\top}{\theta_{\star}}).

In this scenario, an equivalent of (A.3) then involves the gradient of the regularized (unweighted) log-partition function Ztλ{Z_{t}^{\lambda}} we defined in (8) and is equal to

s=1tA(xsθ)xs+ψt(θ)+lh(θ)=E~t,\sum_{s=1}^{t}A^{\prime}(x_{s}^{\top}\theta)x_{s}+\nabla\psi_{t}(\theta)+l\nabla h(\theta)=\tilde{E}_{t}, (16)

where E~t=St\tilde{E}_{t}=S_{t} for θ^t{\hat{{\theta}}}_{t} and E~t=Wt\tilde{E}_{t}=W_{t} for θt×\theta_{t}^{\times}.

Appendix B Proof of Theorem 3 (Bregman Ball Confidence Set)

Proof sketch

We give a quick sketch of the proof. To bound the size of the sets, we will draw inspiration from the i.i.d. parameter estimation analysis of Wasserman et al., (2020) and separate out the likelihood ratio in a part that relates the true parameter with the estimator sequence (i.e. regret), and a part that is independent of the estimator and characterized by a supremum of a stochastic process. We want to show that any point which is far away from the true parameter will eventually not be included in the confidence set anymore. Defining (t)({θ^s}s=1t)\mathcal{L}^{(t)}(\{{\hat{{\theta}}}_{s}\}_{s=1}^{t}) as i=1tpθ^iwi(yi|xi)\prod_{i=1}^{t}p_{{\hat{{\theta}}}_{i}}^{w_{i}}(y_{i}\,|\,x_{i}), we wish to show that for any θ\theta far from θ{\theta_{\star}}, we have

log1Rt(θ)=log(t)(θ)(t)(θ)+log(t)(θ)(t)({θ^s}s=1t)log(α),\log\frac{1}{R_{t}(\theta)}=\log\frac{\mathcal{L}^{(t)}(\theta)}{\mathcal{L}^{(t)}({\theta_{\star}})}+\log\frac{\mathcal{L}^{(t)}({\theta_{\star}})}{\mathcal{L}^{(t)}(\{{\hat{{\theta}}}_{s}\}_{s=1}^{t})}\leq\log(\alpha),

which is equivalent to saying that θ𝒞t\theta\not\in\mathcal{C}_{t}. The second term corresponds to our notion of regret exactly (t\mathcal{R}_{t}, as discussed above). The first term is what we will focus on. We will bound the supremum of log(t)(θ)(t)(θ)\log\frac{\mathcal{L}^{(t)}(\theta)}{\mathcal{L}^{(t)}({\theta_{\star}})} for all θ\theta sufficiently far away from θ{\theta_{\star}}. "Far away" will be measured in the Bregman divergence outlined above. Note that this quantity can be expected to be negative, in general, (especially for "implausible" parameters), since with enough data, θ{\theta_{\star}} should appear much more likely. Writing this ratio out, we will observe that it is equal to

DZt0(θ,θ)+θθ,s=1twsxs(T(ys)𝔼θ[T(ys)])S~t.-D_{Z^{0}_{t}}(\theta,{\theta_{\star}})+\underbrace{\langle\theta-{\theta_{\star}},\,{\sum_{s=1}^{t}w_{s}x_{s}({T(y_{s})-\mathop{{}\mathbb{E}}_{{\theta_{\star}}}[T(y_{s})]})}\rangle}_{\approx\tilde{S}_{t}}.

At this point, it will be sufficient to bound the cross term (second term) over the whole of Θ\Theta. We view this supremum as part of the Legendre Fenchel transform of the function t(λ)=DZtν(θ+λ,θ)\mathcal{B}_{t}(\lambda)=D_{Z^{\nu}_{t}}({\theta_{\star}}+\lambda,{\theta_{\star}}):

supλd(λSt~t(λ))=(t)(St~)\displaystyle\sup_{\lambda\in\mathbb{R}^{d}}\left(\lambda^{\top}\tilde{S_{t}}-\mathcal{B}_{t}(\lambda)\right){=}\left(\mathcal{B}_{t}\right)^{\star}(\tilde{S_{t}})

and harness duality properties of the Bregman divergence, along with known concentration arguments (Chowdhury et al.,, 2022, Theorem A.1).

B.1 Technical Lemmas

We need to introduce the concept of Legendre functions:

Definition 1.

Let f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} be a convex function and C=int(dom(f))C=\mathrm{int}(\mathrm{dom}(f)). Then, a function is called Legendre if it satisfies

  1. 1.

    CC is non-empty.

  2. 2.

    ff is differentiable and strictly convex on CC.

  3. 3.

    limn||f(xn)||=\lim_{n\rightarrow\infty}\lvert\lvert\nabla f(x_{n})\rvert\rvert=\infty for any sequence (xn)n(x_{n})_{n} with xnCx_{n}\in C for all nn and limnxn=x\lim_{n\rightarrow\infty}x_{n}=x for some xCx\in\partial C.

This means that the gradient has to blow up near the edge of the domain. Note as well that the boundary condition is vacuous if there is no boundary. Legendre functions have some nice properties, most importantly regarding the bijectivity of their gradients (see e.g. Lattimore and Szepesvári, (2020)):

Lemma 2.

For a Legendre function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R}

  1. 1.

    f\nabla f is a bijection between int(dom(f))\mathrm{int}(\mathrm{dom}(f)) and int(dom(f))\mathrm{int}(\mathrm{dom}(f^{*})) with the inverse (f)1=f(\nabla f)^{-1}=\nabla f^{*}.

  2. 2.

    Df(x,y)=Df(f(y),f(x))D_{f}(x,y)=D_{f^{*}}(\nabla f(y),\nabla f(x)) for all x,yint(dom(f))x,y\in\mathrm{int}(\mathrm{dom}(f)).

  3. 3.

    The Fenchel conjugate ff^{*} is also Legendre.

With this, we can prove a slightly extended result, that appears as Lemma 2.1. in Chowdhury et al., (2022).

Lemma 3.

For a Legendre function ff we have the identity

Df(x,y)=(Df,x)(f(y)f(x))D_{f}(x,y)=(D_{f,x})^{*}(\nabla f(y)-\nabla f(x))

where we define Df,x(λ)=Df(x+λ,x)D_{f,x}(\lambda)=D_{f}(x+\lambda,x).

Notational Shorthands

Remember the model (7), with log-partition function AA. We define As(θ)=wsA(xsθ)A_{s}(\theta)=w_{s}A(x_{s}^{\top}\theta) and Ts(y):=wsxsT(y)T_{s}(y):=w_{s}x_{s}T(y) to denote the log-partition function and the response function of the same exponential family distribution, but parametrized by θ\theta instead of xsθx_{s}^{\top}\theta. That this is a valid parametrization can easily be seen from the likelihood definition. Indeed, denote by pβEFp^{EF}_{\beta} the exponential family reward distribution with parameter β\beta. Then our model (7) can be seen to satisfy

pθ(y|xs)=pxsθ(y)=h(y)exp(T(y)xsθA(xsθ))=h(y)exp(Ts(y)θAs(θ)).p_{\theta}(y\,|\,x_{s})=p_{x_{s}^{\top}\theta}(y)=h(y)\exp(T(y)x_{s}^{\top}\theta-A(x_{s}^{\top}\theta))=h(y)\exp(T_{s}(y)^{\top}\theta-A_{s}(\theta)).

Exponentiating the likelihood with a weighting wsw_{s} gives rise to another exponential family distribution. We can see that

pθws(y|xs)=hws(y)exp(wsT(y)xsθwsA(xsθ))=hws(y)exp(Ts(y)θAs(θ)).p_{\theta}^{w_{s}}(y\,|\,x_{s})={h}^{w_{s}}(y)\exp(w_{s}T(y)x_{s}^{\top}\theta-w_{s}A(x_{s}^{\top}\theta))={h}^{w_{s}}(y)\exp(T_{s}(y)^{\top}\theta-A_{s}(\theta)).

Note that this does not necessarily integrate to one, but it is easy to see that there is a normalization function h~\tilde{h} that makes it integrate to one. Therefore, the following is a valid parametrization of an exponential family distribution:

h~(y)exp(Ts(y)θAs(θ)).\tilde{h}(y)\exp(T_{s}(y)^{\top}\theta-A_{s}(\theta)).

Additionally, let A0(θ)=ν2||θ||22A_{0}(\theta)=\frac{\nu}{2}\lvert\lvert\theta\rvert\rvert_{2}^{2} be defined on d\mathbb{R}^{d} (i.e. a Legendre Function). We will also define the estimator

θ~t=(Ztν)1(s=1tTs(ys)).{\tilde{{\theta}}}_{t}=\left(\nabla Z^{\nu}_{t}\right)^{-1}\left(\sum_{s=1}^{t}T_{s}(y_{s})\right).

This is a well-defined quantity because the gradient will be invertible, by Lemma 2 above.

Conveniently, Chowdhury et al., (2022) prove the following Theorem 7 using an elegant application of the method of mixtures.

Proposition 1 (Theorem 7 in Chowdhury et al., (2022)).

With probability 1δ1-\delta, for all tt\in\mathbb{N}

DZtν(θ,θ~t)log(1/δ)+A0(θ)+Γtν,D_{Z^{\nu}_{t}}({\theta_{\star}},{\tilde{{\theta}}}_{t})\leq\log(1/\delta)+A_{0}({\theta_{\star}})+\Gamma_{t}^{\nu},

where

Γtν=log(dexp(12||θ||22)dθdexp(DZtν(θ,θ~t))dθ).\Gamma_{t}^{\nu}=\log\left(\frac{\int_{\mathbb{R}^{d}}\exp(-\frac{1}{2}\lvert\lvert\theta\rvert\rvert^{2}_{2})\mathrm{d}\theta}{\int_{\mathbb{R}^{d}}\exp(-D_{Z^{\nu}_{t}}(\theta,{\tilde{{\theta}}}_{t}))\mathrm{d}\theta}\right).

Lastly, we will need the (one-argument) function

t(λ)=DZtν(θ+λ,θ),\mathcal{B}_{t}(\lambda)=D_{Z_{t}^{\nu}}(\theta+\lambda,\theta),

i.e. a shortcut for the Bregman divergence of ZtνZ_{t}^{\nu} at θ\theta. We use this one-argument function as we will be interested in its dual. We will also need a lemma on the sub-homogeneity properties of this object.

Lemma 4.

Under Assumption 1, for θΘ\theta\in\Theta and λ\lambda such that θ+λΘ\theta+\lambda\in\Theta, we have for any γμ2L\gamma\leq\frac{\mu}{2L}

t(γλ)12γt(λ),\mathcal{B}_{t}(\gamma\lambda)\leq\frac{1}{2}\gamma\mathcal{B}_{t}(\lambda),

i.e. function g(γ)=t(γλ)g(\gamma)=\mathcal{B}_{t}(\gamma\lambda) is sub-homogeneous with contraction parameter 12\frac{1}{2} on [0,μ2L][0,\frac{\mu}{2L}].

See Appendix B.3 for a proof.

B.2 Proof of Theorem 3

As mentioned in the main paper, our proof will show that all θ\theta sufficiently far from θ{\theta_{\star}} will be excluded from 𝒞t\mathcal{C}_{t} eventually. Equation (B) in the main text specifies the exclusion criterion, i.e. θ𝒞t\theta\not\in\mathcal{C}_{t} if and only if

1Rt(θ)=log(t)(θ)(t)(θ)+log(t)(θ)(t)({θ^s}s=1t)log(α).\frac{1}{R_{t}(\theta)}=\log\frac{\mathcal{L}^{(t)}(\theta)}{\mathcal{L}^{(t)}({\theta_{\star}})}+\log\frac{\mathcal{L}^{(t)}({\theta_{\star}})}{\mathcal{L}^{(t)}(\{{\hat{{\theta}}}_{s}\}_{s=1}^{t})}\leq\log(\alpha). (17)

The second term is bounded by the regret of the online learner. And therefore, a sufficient condition for θ𝒞t\theta\not\in\mathcal{C}_{t} is

log((t)(θ)(t)(θ))log(α)t.\log\left(\frac{\mathcal{L}^{(t)}(\theta)}{\mathcal{L}^{(t)}({\theta_{\star}})}\right)\leq\log(\alpha)-\mathcal{R}_{t}.

Henceforth, we will be interested in having an explicit set 𝒞~t\tilde{\mathcal{C}}_{t} such that we can upper bound

supθ𝒞~tlog((t)(θ)(t)(θ)).\sup_{\theta\notin\tilde{\mathcal{C}}_{t}}\log\left(\frac{\mathcal{L}^{(t)}(\theta)}{\mathcal{L}^{(t)}({\theta_{\star}})}\right). (18)

This will imply that that 𝒞~tc𝒞tc\tilde{\mathcal{C}}_{t}^{c}\subset\mathcal{C}_{t}^{c}, or in other words, 𝒞t𝒞~t\mathcal{C}_{t}\subset\tilde{\mathcal{C}}_{t}. Without further ado, let us derive a more convenient form of the ratio in question

log((t)(θ)(t)(θ))\displaystyle\log\left(\frac{\mathcal{L}^{(t)}(\theta)}{\mathcal{L}^{(t)}({\theta_{\star}})}\right) =log(s=1th(ys)exp(wsxsθT(ys)wsA(xsθ))s=1th(ys)exp(wsxsθT(ys)wsA(xsθ)))\displaystyle=\log\left(\frac{\prod_{s=1}^{t}h(y_{s})\exp\left(w_{s}x_{s}^{\top}\theta T(y_{s})-w_{s}A(x_{s}^{\top}\theta)\right)}{\prod_{s=1}^{t}h(y_{s})\exp\left(w_{s}x_{s}^{\top}\theta T(y_{s})-w_{s}A(x_{s}^{\top}{\theta_{\star}})\right)}\right)
=s=1twsxsθT(ys)wsA(xsθ)wsxsθT(ys)+wsA(xsθ)\displaystyle=\sum_{s=1}^{t}w_{s}x_{s}^{\top}\theta T(y_{s})-w_{s}A(x_{s}^{\top}\theta)-w_{s}x_{s}^{\top}{\theta_{\star}}T(y_{s})+w_{s}A(x_{s}^{\top}{\theta_{\star}})
=s=1tθθ,wsT(ys)xs+wsA(xsθ)wsA(xsθ)\displaystyle=\sum_{s=1}^{t}\langle\theta-{\theta_{\star}},\,w_{s}T(y_{s})x_{s}\rangle+w_{s}A(x_{s}^{\top}{\theta_{\star}})-w_{s}A(x_{s}^{\top}\theta)
=s=1tθθ,wsT(ys)xs(wsA(xsθ)wsA(xsθ)xs(θθ)wsA(xsθ)\displaystyle=\sum_{s=1}^{t}\langle\theta-{\theta_{\star}},\,w_{s}T(y_{s})x_{s}\rangle-\big{(}w_{s}A(x_{s}^{\top}\theta)-w_{s}A(x_{s}^{\top}{\theta_{\star}})-{x_{s}^{\top}(\theta-{\theta_{\star}})}{w_{s}A^{\prime}(x_{s}^{\top}{\theta_{\star}})}
+xs(θθ)wsA(xsθ))\displaystyle\quad+{x_{s}^{\top}(\theta-{\theta_{\star}})}{w_{s}A^{\prime}(x_{s}^{\top}{\theta_{\star}})}\big{)}
=s=1tθθ,wsT(ys)xswsDA(xsθ,xsθ)xs(θθ)wsA(xsθ)\displaystyle=\sum_{s=1}^{t}\langle\theta-{\theta_{\star}},\,w_{s}T(y_{s})x_{s}\rangle-w_{s}D_{A}(x_{s}^{\top}\theta,x_{s}^{\top}{\theta_{\star}})-{x_{s}^{\top}(\theta-{\theta_{\star}})}{w_{s}A^{\prime}(x_{s}^{\top}{\theta_{\star}})}
=s=1twsDA(xsθ,xsθ)+s=1tθθ,wsT(ys)xsxswsA(xsθ).\displaystyle=-\sum_{s=1}^{t}w_{s}D_{A}(x_{s}^{\top}\theta,x_{s}^{\top}{\theta_{\star}})+\sum_{s=1}^{t}\langle\theta-{\theta_{\star}},\,w_{s}T(y_{s})x_{s}-x_{s}w_{s}A^{\prime}(x_{s}^{\top}{\theta_{\star}})\rangle.

We can switch parametrizations as described above:

log((t)(θ)(t)(θ))\displaystyle\log\left(\frac{\mathcal{L}^{(t)}(\theta)}{\mathcal{L}^{(t)}({\theta_{\star}})}\right) =DZt0(θ,θ)+s=1tθθ,Ts(ys)As(θ)\displaystyle=-D_{Z^{0}_{t}}(\theta,{\theta_{\star}})+\sum_{s=1}^{t}\langle\theta-{\theta_{\star}},\,T_{s}(y_{s})-\nabla A_{s}({\theta_{\star}})\rangle
=DZt0(θ,θ)+s=1tθθ,Ts(ys)𝔼θ[Ts(ys)]\displaystyle=-D_{Z^{0}_{t}}(\theta,{\theta_{\star}})+\sum_{s=1}^{t}\langle\theta-{\theta_{\star}},\,T_{s}(y_{s})-\mathop{{}\mathbb{E}}_{{\theta_{\star}}}[T_{s}(y_{s})]\rangle
=DZt0(θ,θ)+θθ,St,\displaystyle=-D_{Z^{0}_{t}}(\theta,{\theta_{\star}})+\langle\theta-{\theta_{\star}},\,S_{t}\rangle, (19)

where we define St:=s=1t(Ts(ys)𝔼θ[Ts(ys)])S_{t}:=\sum_{s=1}^{t}\left({T_{s}(y_{s})-\mathop{{}\mathbb{E}}_{{\theta_{\star}}}[T_{s}(y_{s})]}\right).

ZtνZ_{t}^{\nu} is strictly convex whenever ν0\nu\not=0, and convex otherwise (it might also be strictly convex otherwise, corresponding to some cases where the xsx_{s} span the full dd-dimensional Euclidean space and ws>0w_{s}>0, which will be satisfied uniformly. We note that since dom(Ztν)=d\mathrm{dom}(Z_{t}^{\nu})=\mathbb{R}^{d}, ZtνZ^{\nu}_{t} is, therefore, Legendre, and its gradient is invertible. We will relate our problem to this estimator via the duality properties developed above. First, note that by the well-known fact 𝔼θ[Ts(ys)]=As(θ)\mathop{{}\mathbb{E}}_{\theta_{\star}}[T_{s}(y_{s})]=\nabla A_{s}({\theta_{\star}}) and by the definition of θ~t{\tilde{{\theta}}}_{t}, we have

St\displaystyle S_{t} =Ztν((Ztν)1(s=1tTs(ys)))Zt0(θ)\displaystyle=\nabla Z^{\nu}_{t}\left(\left(\nabla Z^{\nu}_{t}\right)^{-1}\left(\sum_{s=1}^{t}T_{s}(y_{s})\right)\right)-\nabla Z^{0}_{t}({\theta_{\star}})
=Ztν(θ~t)Ztν(θ)=:S~t+A0(θ)=νθ.\displaystyle=\underbrace{\nabla Z^{\nu}_{t}\left({\tilde{{\theta}}}_{t}\right)-\nabla Z^{\nu}_{t}({\theta_{\star}})}_{=:\tilde{S}_{t}}+\underbrace{\nabla A_{0}({\theta_{\star}})}_{=\nu{\theta_{\star}}}. (20)

Now, we leverage the duality properties: We can write

supλd(λSt~t(λ))\displaystyle\sup_{\lambda\in\mathbb{R}^{d}}\left(\lambda^{\top}\tilde{S_{t}}-\mathcal{B}_{t}(\lambda)\right) =(i)(t)(St~)\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}\left(\mathcal{B}_{t}\right)^{\star}(\tilde{S_{t}})
=(20)(t)(Ztν(θ~t)Ztν(θ))\displaystyle\stackrel{{\scriptstyle\eqref{eq:stildedef}}}{{=}}\left(\mathcal{B}_{t}\right)^{\star}(\nabla Z^{\nu}_{t}({\tilde{{\theta}}}_{t})-\nabla Z^{\nu}_{t}({\theta_{\star}}))
=Lemma 3DZtν(θ,θ~t),\displaystyle\stackrel{{\scriptstyle\text{Lemma }\ref{lemma:dualitypatrick}}}{{=}}D_{Z^{\nu}_{t}}({\theta_{\star}},{\tilde{{\theta}}}_{t}), (21)

where (i)(i) is simply the definition of the Legendre-Fenchel transform. Why did we do all this work? Well, we are interested in the supremum in Equation (18). It is sufficient to bound the supremum over all θΘ\theta\in\Theta of terms of the form (see Equation (19))

θθ,St.\langle\theta-{\theta_{\star}},\,S_{t}\rangle.

While we could do a covering type argument (carefully relaxing the i.i.d. data assumptions typical in empirical process theory), it is much easier to relate this supremum to the estimator via duality.

With probability at least 1δ1-\delta, Proposition 1 gives us a high-probability time-uniform bound on

DZtν(θ,θ~t)log(1/δ)+A0(θ)+Γtν,D_{Z^{\nu}_{t}}({\theta_{\star}},{\tilde{{\theta}}}_{t})\leq\log(1/\delta)+A_{0}({\theta_{\star}})+\Gamma_{t}^{\nu},

and therefore, by plugging into Equation (21) and making the reparametrization λ=γ(θθ)\lambda=\gamma(\theta-{\theta_{\star}}) for some positive γ\gamma, it gives us

t0θdγ+:γS~t(θθ)t(γ(θθ))log(1/δ)+A0(θ)+Γtν.\displaystyle\forall t\geq 0\;\forall\theta\in\mathbb{R}^{d}\;\forall\gamma\in\mathbb{R}_{+}\;:\;\gamma\tilde{S}_{t}^{\top}(\theta-{\theta_{\star}})-{\mathcal{B}_{t}(\gamma(\theta-{\theta_{\star}}))}\leq\log(1/\delta)+A_{0}({\theta_{\star}})+\Gamma_{t}^{\nu}.

Therefore, for all t0t\geq 0 and all θd\theta\in\mathbb{R}^{d}, the following holds:

St(θθ)\displaystyle S_{t}^{\top}(\theta-{\theta_{\star}}) =S~t(θθ)+A0(θ)(θθ)\displaystyle=\tilde{S}_{t}^{\top}(\theta-{\theta_{\star}})+\nabla A_{0}({\theta_{\star}})^{\top}(\theta-{\theta_{\star}})
1γlog(1/δ)+1γA0(θ)+1γΓtν+1γt(γ(θθ))+A0(θ)(θθ).\displaystyle\leq\frac{1}{\gamma}\log(1/\delta)+\frac{1}{\gamma}A_{0}({\theta_{\star}})+\frac{1}{\gamma}\Gamma_{t}^{\nu}+\frac{1}{\gamma}\mathcal{B}_{t}(\gamma(\theta-{\theta_{\star}}))+\nabla A_{0}({\theta_{\star}})^{\top}(\theta-{\theta_{\star}}).

Since A0(θ)=ν2||θ||22A_{0}(\theta)=\frac{\nu}{2}\lvert\lvert\theta\rvert\rvert_{2}^{2}, restricting our uniform bound over θΘ\theta\in\Theta gives us t0θΘ\forall t\geq 0\;\forall\theta\in\Theta:

St(θθ)\displaystyle S_{t}^{\top}(\theta-{\theta_{\star}}) 1γlog(1/δ)+ν2γB2+1γΓtν+1γt(γ(θθ))+νB2.\displaystyle\leq\frac{1}{\gamma}\log(1/\delta)+\frac{\nu}{2\gamma}B^{2}+\frac{1}{\gamma}\Gamma_{t}^{\nu}+\frac{1}{\gamma}\mathcal{B}_{t}(\gamma(\theta-{\theta_{\star}}))+\nu B^{2}.

Now, we note that under Assumption 1, Lemma 4 kicks in and we have for any t0,θΘt\geq 0,\theta\in\Theta and γ=μ2L\gamma=\frac{\mu}{2L}

St(θθ)\displaystyle S_{t}^{\top}(\theta-{\theta_{\star}}) 1γlog(1/δ)+ν2γB2+1γΓtν+12t(θθ)+νB2.\displaystyle\leq\frac{1}{\gamma}\log(1/\delta)+\frac{\nu}{2\gamma}B^{2}+\frac{1}{\gamma}\Gamma_{t}^{\nu}+\frac{1}{2}\mathcal{B}_{t}(\theta-{\theta_{\star}})+\nu B^{2}. (22)

Finally, we can use this in (19) to obtain

log((t)(θ)(t)(θ))\displaystyle\log\left(\frac{\mathcal{L}^{(t)}(\theta)}{\mathcal{L}^{(t)}({\theta_{\star}})}\right) t(θθ)+θθ,St\displaystyle\leq-\mathcal{B}_{t}(\theta-{\theta_{\star}})+\langle\theta-{\theta_{\star}},\,S_{t}\rangle
(22)12t(θθ)+1γlog(1/δ)+ν2γB2+1γΓtν+νB2\displaystyle\stackrel{{\scriptstyle\eqref{eq:sublinearityapplied}}}{{\leq}}-\frac{1}{2}\mathcal{B}_{t}(\theta-{\theta_{\star}})+\frac{1}{\gamma}\log(1/\delta)+\frac{\nu}{2\gamma}B^{2}+\frac{1}{\gamma}\Gamma_{t}^{\nu}+\nu B^{2}
12t(θθ)+2Lμ(log(1/δ)+νB22+Γtν)+νB2.\displaystyle\leq-\frac{1}{2}\mathcal{B}_{t}(\theta-{\theta_{\star}})+\frac{2L}{\mu}\left(\log(1/\delta)+\frac{\nu B^{2}}{2}+\Gamma_{t}^{\nu}\right)+\nu B^{2}. (23)

It remains to investigate the full likelihood ratio in (17):

1Rt(θ)log(α)\displaystyle\frac{1}{R_{t}(\theta)}-\log(\alpha)
=\displaystyle=\; log(t)(θ)(t)(θ)+log(t)(θ)(t)({θ^s}s=1t)+log(1/α)\displaystyle\log\frac{\mathcal{L}^{(t)}(\theta)}{\mathcal{L}^{(t)}({\theta_{\star}})}+\log\frac{\mathcal{L}^{(t)}({\theta_{\star}})}{\mathcal{L}^{(t)}(\{{\hat{{\theta}}}_{s}\}_{s=1}^{t})}+\log(1/\alpha)
(23)&(3)\displaystyle\stackrel{{\scriptstyle\eqref{eq:firstratiofinal}\,\&\,\eqref{eq:see-regret}}}{{\leq}}\; 12t(θθ)+2Lμ(log(1/δ)+νB22+Γtν)+νB2+log(1/α)+t.\displaystyle-\frac{1}{2}\mathcal{B}_{t}(\theta-{\theta_{\star}})+\frac{2L}{\mu}\left(\log(1/\delta)+\frac{\nu B^{2}}{2}+\Gamma_{t}^{\nu}\right)+\nu B^{2}+\log(1/\alpha)+\mathcal{R}_{t}. (24)

Note that crucially for θΘ\theta\in\Theta, we have

θ𝒞t1Rt(θ)log(α)0.\theta\not\in\mathcal{C}_{t}\iff\frac{1}{R_{t}(\theta)}-\log(\alpha)\leq 0.

This is implied by

Ztν(θ,θ)4Lμ(log(1/δ)+νB22+Γtν)+2νB2+2log(1/α)+2t,\mathcal{B}_{Z^{\nu}_{t}}(\theta,{\theta_{\star}})\geq\frac{4L}{\mu}\left(\log(1/\delta)+\frac{\nu B^{2}}{2}+\Gamma_{t}^{\nu}\right)+2\nu B^{2}+2\log(1/\alpha)+2\mathcal{R}_{t},

or, since LμL\geq\mu, more compactly by

Ztν(θ,θ)4Lμ(log(1/δ)+νB2+Γtν)+2log(1/α)+2t.\mathcal{B}_{Z^{\nu}_{t}}(\theta,{\theta_{\star}})\geq\frac{4L}{\mu}\left(\log(1/\delta)+{\nu B^{2}}+\Gamma_{t}^{\nu}\right)+2\log(1/\alpha)+2\mathcal{R}_{t}.

B.3 Proof of Technical Lemmas

First, we will prove Lemma 3. The proof exactly follows Chowdhury et al., (2022), we include it here for convenience because it is very short.

Proof.

By definition

(Df,x)(f(y)f(x))\displaystyle\quad(D_{f,x})^{*}(\nabla f(y)-\nabla f(x))
=supad(a,f(y)f(x)Df,x(a))\displaystyle=\sup_{a\in\mathbb{R}^{d}}\left(\langle a,\,\nabla f(y)-\nabla f(x)\rangle-D_{f,x}(a)\right)
=supad(a,f(y)f(x)Df(x+a,x))\displaystyle=\sup_{a\in\mathbb{R}^{d}}\left(\langle a,\,\nabla f(y)-\nabla f(x)\rangle-D_{f}(x+a,x)\right)
=supad(a,f(y)f(x)f(x+a)+f(x)+f(x),a)\displaystyle=\sup_{a\in\mathbb{R}^{d}}\left(\langle a,\,\nabla f(y)-\nabla f(x)\rangle-f(x+a)+f(x)+\langle\nabla f(x),\,a\rangle\right)
=supad(a,f(y)f(x+a)+f(x)).\displaystyle=\sup_{a\in\mathbb{R}^{d}}\left(\langle a,\,\nabla f(y)\rangle-f(x+a)+f(x)\right).

Since ff is strictly convex and differentiable, first-order optimality conditions imply that the optimal aa satisfies f(y)f(x+a)=0\nabla f(y)-\nabla f(x+a)=0 (aa is unconstrained). Since the gradient is invertible, we must have a=yxa=y-x. If we plug this into the above, we have

(Df,x)(f(y)f(x))\displaystyle\quad(D_{f,x})^{*}(\nabla f(y)-\nabla f(x)) =yx,f(y)f(y)+f(x)\displaystyle=\langle y-x,\,\nabla f(y)\rangle-f(y)+f(x)
=f(x)f(y)f(y),xy\displaystyle=f(x)-f(y)-\langle\nabla f(y),\,x-y\rangle
=Df(x,y).\displaystyle=D_{f}(x,y).

Now we prove Lemma 4. To this end, we will do a reduction to the one-dimensional case, and prove the one-dimensional result below.

Lemma 5.

Under Assumption 1, for any a[B,B]a\in[B,B], any γ(0,μ2L]\gamma\in(0,\frac{\mu}{2L}] and any Δ\Delta with a+Δ[B,B]a+\Delta\in[B,B]

A(a+γΔ)A(a)A(a)γΔ12γ[A(a+Δ)A(a)A(a)Δ].A(a+\gamma\Delta)-A(a)-A^{\prime}(a)\gamma\Delta\leq\frac{1}{2}\gamma\left[A(a+\Delta)-A(a)-A^{\prime}(a)\Delta\right].

We prove that this implies the desired sublinearity of the full Bregman difference.

Proof.

(of Lemma 4). Let θ\theta and λ\lambda be such that θ,θ+λΘ\theta,\theta+\lambda\in\Theta. We will first show that for any s{0,,t}s\in\{0,\ldots,t\}, BAs(θ,θ+)B_{A_{s}(\theta,\theta+\cdot)} is sublinear, and then the result follows by the linearity of the Bregman divergence. Define as=xsθa_{s}=x_{s}^{\top}\theta and Δs=xsλ\Delta_{s}=x_{s}^{\top}\lambda. Then we have |Δs|=|xs(λ)|||xs||||λ||||xs||(||θ||+||θ+λ||(B+B)2B\lvert\Delta_{s}\rvert=\lvert x_{s}^{\top}(\lambda)\rvert\leq\lvert\lvert x_{s}\rvert\rvert\lvert\lvert\lambda\rvert\rvert\leq\lvert\lvert x_{s}\rvert\rvert(\lvert\lvert\theta\rvert\rvert+\lvert\lvert\theta+\lambda\rvert\rvert\leq(B+B)\leq 2B. Similarly we have |as|B\lvert a_{s}\rvert\leq B. Hence we satisfy the premise of Lemma 5 and we deduce that

DAs(θ+γλ,θ)\displaystyle D_{A_{s}}(\theta+\gamma\lambda,\theta) =wsA(xsθ+γxsλ)wsA(xsθ)+xswsA(xsθ),γλ\displaystyle=w_{s}A(x_{s}^{\top}\theta+\gamma x_{s}^{\top}\lambda)-w_{s}A(x_{s}^{\top}\theta)+\langle x_{s}w_{s}A^{\prime}(x_{s}^{\top}\theta),\,\gamma\lambda\rangle
=ws(A(as+γΔs)A(as)+A(as)γΔs)\displaystyle=w_{s}(A(a_{s}+\gamma{\Delta_{s}})-A(a_{s})+A^{\prime}(a_{s}){\gamma\Delta_{s}})
ws2γ[A(as+γΔs)A(as)+A(as)γΔs]\displaystyle\leq\frac{w_{s}}{2}\gamma\left[A(a_{s}+\gamma{\Delta_{s}})-A(a_{s})+A^{\prime}(a_{s}){\gamma\Delta_{s}}\right]
=12γ[wsA(xsθ+xsλ)wsA(xsθ)+wsA(xsθ)xsλ]\displaystyle=\frac{1}{2}\gamma\left[w_{s}A({x_{s}^{\top}\theta}+{x_{s}^{\top}\lambda})-w_{s}A(x_{s}^{\top}\theta)+w_{s}A^{\prime}(x_{s}^{\top}\theta){x_{s}^{\top}\lambda}\right]
=12γDAs(θ+λ,θ).\displaystyle=\frac{1}{2}\gamma D_{A_{s}}(\theta+\lambda,\theta).

We also note that for γμ2L12\gamma\leq\frac{\mu}{2L}\leq\frac{1}{2},

DA0(θ+γλ,θ)=ν2||γλ||2=γ2ν2||λ||2γν4||λ||2=12γDA0(θ+λ,θ).D_{A_{0}}(\theta+\gamma\lambda,\theta)=\frac{\nu}{2}\lvert\lvert\gamma\lambda\rvert\rvert^{2}=\frac{\gamma^{2}\nu}{2}\lvert\lvert\lambda\rvert\rvert^{2}\leq\frac{\gamma\nu}{4}\lvert\lvert\lambda\rvert\rvert^{2}=\frac{1}{2}\gamma D_{A_{0}}(\theta+\lambda,\theta). (25)

Therefore, by summing up the terms, we obtain

t(γλ)12γt(λ).\mathcal{B}_{t}(\gamma\lambda)\leq\frac{1}{2}\gamma\mathcal{B}_{t}(\lambda).

Then it remains to prove that Assumption 1 implies Lemma 5.

Proof.

(Lemma 5) LL-Lipschitzness of AA^{\prime} implies smoothness of AA. Additionally, μ\mu’s existence implies strong convexity of AA. With this, we can write for any aa and Δ\Delta with a+Δ[B,B]a+\Delta\in[-B,B]

A(a+Δ)A(a)+A(a)Δ+μ2Δ2\displaystyle A(a+\Delta)\geq A(a)+A^{\prime}(a)\Delta+\frac{\mu}{2}\Delta^{2}
\displaystyle\implies A(a+Δ)A(a)A(a)Δμ2Δ2.\displaystyle A(a+\Delta)-A(a)-A^{\prime}(a)\Delta\geq\frac{\mu}{2}\Delta^{2}.

Similarly,

A(a+γΔ)A(a)A(a)γΔL2γ2Δ2.A(a+\gamma\Delta)-A(a)-A^{\prime}(a)\gamma\Delta\leq\frac{L}{2}\gamma^{2}\Delta^{2}.

Putting this together, we have

A(a+γΔ)A(a)A(a)γΔL2γ2Δ2=Lγ2μμ2Δ2Lγμγ[A(a+Δ)A(a)A(a)Δ].A(a+\gamma\Delta)-A(a)-A^{\prime}(a)\gamma\Delta\leq\frac{L}{2}\gamma^{2}\Delta^{2}=\frac{L\gamma^{2}}{\mu}\frac{\mu}{2}\Delta^{2}\leq\frac{L\gamma}{\mu}\gamma[A(a+\Delta)-A(a)-A^{\prime}(a)\Delta].

The question is therefore: when is Lγμ12\frac{L\gamma}{\mu}\leq\frac{1}{2}? Clearly, choosing γ0=μ2L\gamma_{0}=\frac{\mu}{2L} makes Lγμ12\frac{L\gamma}{\mu}\leq\frac{1}{2} for all γγ0\gamma\leq\gamma_{0}. ∎

Appendix C FTRL Results: Proofs

C.1 Technical Lemmas I: Exponential Families

Lemma 6 (MGF for Exponential family).
𝔼[exp(T(y)u)|x]=exp(A(θx+u)A(θx)).\mathbb{E}[\exp(T(y)u)|x]=\exp(A({\theta_{\star}}^{\top}x+u)-A({\theta_{\star}}^{\top}x)).
Proof.
𝔼[exp(T(y)u)|x]=\displaystyle\mathbb{E}[\exp(T(y)u)|x]= yexp(T(y)u)h(y)exp(T(y)θxA(θx))𝑑y\displaystyle\int_{y}\exp(T(y)u)h(y)\exp(T(y){\theta_{\star}}^{\top}x-A({\theta_{\star}}^{\top}x))dy
=\displaystyle= exp(T(y)(θx+u))h(y)exp(A(θx))\displaystyle\int\exp(T(y)({\theta_{\star}}^{\top}x+u))h(y)\exp(-A({\theta_{\star}}^{\top}x))
×exp(A(θx+u))exp(A(θx+u))dy\displaystyle\;\times\exp(-A({\theta_{\star}}^{\top}x+u))\exp(A({\theta_{\star}}^{\top}x+u))dy
=\displaystyle= exp(A(θx+u)A(θx)),\displaystyle\exp(A({\theta_{\star}}^{\top}x+u)-A({\theta_{\star}}^{\top}x)),

where the last step follows because the density of a new exponential family distribution with parameter θx+u{\theta_{\star}}^{\top}x+u also integrates to 1. ∎

C.2 Technical Lemmas II: Elliptical Potential Lemma

We will repeatedly use instantiations of the following key lemma, known as the elliptical potential lemma. We will use the version from Hazan et al., (2006). Other variants are stated in Abbasi-Yadkori et al., (2011) or Carpentier et al., (2020).

Lemma 7 (Lemma 11 in Hazan et al., (2006)).

Let usdu_{s}\in\mathbb{R}^{d} be a sequence of vectors such that ||us||r\lvert\lvert u_{s}\rvert\rvert\leq r. Define 𝐕¯t=s=1tusus+λ𝐈\bar{\mathbf{V}}_{t}=\sum_{s=1}^{t}u_{s}u_{s}^{\top}+\lambda\mathbf{I}. Then

s=1t||us||𝐕¯s12log(det𝐕¯tdetλ𝐈)dlog(r2tλ+1).\sum_{s=1}^{t}\lvert\lvert u_{s}\rvert\rvert_{\bar{\mathbf{{V}}}_{s}^{-1}}^{2}\leq\log\left(\frac{\det\bar{\mathbf{{V}}}_{t}}{\det\lambda\mathbf{I}}\right)\leq d\log\left(\frac{r^{2}t}{\lambda}+1\right).

We will also need a result where the time indices of the matrix are shifted. For this, note that if λr2\lambda\geq r^{2}, then ususr2𝐈λ𝐈u_{s}u_{s}^{\top}\preceq r^{2}\mathbf{{I}}\preceq\lambda\mathbf{{I}}, and so we get 𝐕¯s𝐕¯s1+usus𝐕¯s1+λ𝐈2𝐕¯s1\bar{\mathbf{{V}}}_{s}\leq\bar{\mathbf{{V}}}_{s-1}+u_{s}u_{s}^{\top}\preceq\bar{\mathbf{{V}}}_{s-1}+\lambda\mathbf{{I}}\preceq 2\bar{\mathbf{{V}}}_{s-1}. Under our conditions, it follows that

s=1t||us||𝐕¯s1122s=1t||us||𝐕¯s12\sum_{s=1}^{t}\lvert\lvert u_{s}\rvert\rvert_{\bar{\mathbf{{V}}}_{s-1}^{-1}}^{2}\leq 2\sum_{s=1}^{t}\lvert\lvert u_{s}\rvert\rvert_{\bar{\mathbf{{V}}}_{s}^{-1}}^{2}
Corollary 1.

We have the following bounds:

γtλ=log(det(s=1μxsxs+λ𝐈)det(λ𝐈))dlog(μtλ+1),\displaystyle\gamma_{t}^{\lambda}=\log\left(\frac{\det(\sum_{s=1}\mu x_{s}x_{s}^{\top}+\lambda\mathbf{{I}})}{\det(\lambda\mathbf{{I}})}\right)\leq d\log\left(\frac{\mu t}{\lambda}+1\right),

and

s=1t||xs||(𝐕s1μ;λ)122μγtλ.\displaystyle\sum_{s=1}^{t}\lvert\lvert x_{s}\rvert\rvert_{(\mathbf{{V}}^{\mu;\lambda}_{s-1})^{-1}}^{2}\leq\frac{2}{{\mu}}\gamma_{t}^{\lambda}.
Proof.

The first bound is trivial by instantiating us=μxsu_{s}=\sqrt{\mu}x_{s}. The second bound is by noting

s=1t||xs||(𝐕s1μ;λ)12=1μs=1t||us||(𝐕s1μ;λ)122μs=1t||us||(𝐕sμ;λ)122μγtλ.\sum_{s=1}^{t}\lvert\lvert x_{s}\rvert\rvert^{2}_{(\mathbf{{V}}^{\mu;\lambda}_{s-1})^{-1}}=\frac{1}{{\mu}}\sum_{s=1}^{t}\lvert\lvert u_{s}\rvert\rvert^{2}_{(\mathbf{{V}}^{\mu;\lambda}_{s-1})^{-1}}\leq\frac{2}{{\mu}}\sum_{s=1}^{t}\lvert\lvert u_{s}\rvert\rvert^{2}_{(\mathbf{{V}}^{\mu;\lambda}_{s})^{-1}}\leq\frac{2}{{\mu}}\gamma_{t}^{\lambda}.

C.3 Technical Lemmas III: Supermartingales

Lemma 8 (Martingale Increment).

Define the parametrized random processes

j(r)\displaystyle\mathcal{M}_{j}(r) =exp(fj(θ)rA(xjθ)xjrA(xjθxjr)+A(xjθ))\displaystyle=\exp(\nabla f_{j}({\theta_{\star}})^{\top}r-A^{\prime}(x_{j}^{\top}{\theta_{\star}})x_{j}^{\top}r-A(x_{j}^{\top}{\theta_{\star}}-x_{j}^{\top}r)+A(x_{j}^{\top}{\theta_{\star}}))

and

𝒩j(r)=exp(fj(θ)rL2rxjxjr).\mathcal{N}_{j}(r)=\exp(\nabla f_{j}({\theta_{\star}})^{\top}r-\frac{L}{2}r^{\top}x_{j}x_{j}^{\top}r).

Then, under Assumption 1 we have for any rdr\in\mathbb{R}^{d} that 𝔼[j(r)|j1]=1\mathbb{E}[\mathcal{M}_{j}(r)\,|\,\mathcal{F}_{j-1}]=1 and 𝔼[𝒩j(r)|j1]1\mathbb{E}[\mathcal{N}_{j}(r)\,|\,\mathcal{F}_{j-1}]\leq 1.

Proof.

First, using the form of the exponential family and and recalling that θfj(θ)=θlogpθ(yj|xj)=θ[A(xjθ)T(yj)xjθ]\nabla_{\theta}f_{j}(\theta)=-\nabla_{\theta}\log p_{\theta}(y_{j}\,|\,x_{j})=\nabla_{\theta}[A(x_{j}^{\top}\theta)-T(y_{j})x_{j}^{\top}\theta] we obtain

𝔼[exp(fj(θ)r)|j1]\displaystyle\quad\;\mathbb{E}[\exp(\nabla f_{j}({\theta_{\star}})^{\top}r)\,|\,\mathcal{F}_{j-1}]
=yexp(fj(θ)r)×h(y)exp(T(y)xjθA(xjθ))𝑑y\displaystyle=\int_{y}\exp(\nabla f_{j}({\theta_{\star}})^{\top}r)\times h(y)\exp(T(y)x_{j}^{\top}{\theta_{\star}}-A(x_{j}^{\top}{\theta_{\star}}))dy
=yexp(T(y)xjr+A(xjθ)xjr)×h(y)exp(T(y)xjθA(xjθ))𝑑y\displaystyle=\int_{y}\exp(-T(y)x_{j}^{\top}r+A^{\prime}(x_{j}^{\top}{\theta_{\star}})x_{j}^{\top}r)\times h(y)\exp(T(y)x_{j}^{\top}{\theta_{\star}}-A(x_{j}^{\top}{\theta_{\star}}))dy
=exp(A(xjθ)xjr)yh(y)exp(T(y)(xjθxjr))exp(A(xjθxjr))𝑑y=1\displaystyle=\exp(A^{\prime}(x_{j}^{\top}{\theta_{\star}})x_{j}^{\top}r)\underbrace{\int_{y}h(y)\exp(T(y)(x_{j}^{\top}{\theta_{\star}}-x_{j}^{\top}r))\exp(-A(x_{j}^{\top}{\theta_{\star}}-x_{j}^{\top}r))dy}_{=1}
×exp(A(xjθxjr))exp(A(xjθ))\displaystyle\quad\,\times\exp(A(x_{j}^{\top}{\theta_{\star}}-x_{j}^{\top}r))\exp(-A(x_{j}^{\top}{\theta_{\star}}))
=exp(A(xjθ)xjr)exp(A(xjθxjr))exp(A(xjθ)),\displaystyle=\exp(A^{\prime}(x_{j}^{\top}{\theta_{\star}})x_{j}^{\top}r)\exp(A(x_{j}^{\top}{\theta_{\star}}-x_{j}^{\top}r))\exp(-A(x_{j}^{\top}{\theta_{\star}})),

which finishes the proof. The second statement follows by using LL-smoothness on the last equation and therefore noting that 𝒩j(r)j(r)\mathcal{N}_{j}(r)\leq\mathcal{M}_{j}(r). ∎

Lemma 9.

(Sequential Mixing) Define the martingale process,

Mt(r1,rt)=s=1t𝒩s(rs),M_{t}(r_{1},\dots r_{t})=\prod_{s=1}^{t}\mathcal{N}_{s}(r_{s}),

and recursively define the mixture martingale,

M¯s=M¯s1×r𝒩s(r)ps(r)dr,\bar{M}_{s}=\bar{M}_{s-1}\times\int_{r}\mathcal{N}_{s}(r)p_{s}(r)\mathrm{d}r,

where psp_{s} is a probability distribution equal 𝒩(0,𝐇s1)\mathcal{N}(0,\mathbf{H}_{s}^{-1}), 𝐇s=j=1s1Lxjxj+𝐈λLμ\mathbf{H}_{s}=\sum_{j=1}^{s-1}Lx_{j}x_{j}^{\top}+\mathbf{{I}}\lambda\frac{L}{\mu}, and M¯0=1\bar{M}_{0}=1. Then the following statements hold

  • {M¯s}s\{\bar{M}_{s}\}_{s} is an adapted super-martingale with respect to the usual filtration.

  • M¯t=exp(μLs=1tfs(θ)(𝐕sμ;λ)1fs(θ))det(𝐈λ)det(𝐕sμ;λ)\bar{M}_{t}=\exp(\frac{\mu}{L}\sum_{s=1}^{t}\nabla f_{s}({\theta_{\star}})^{\top}(\mathbf{{V}}^{\mu;\lambda}_{s})^{-1}\nabla f_{s}({\theta_{\star}}))\sqrt{\frac{\det(\mathbf{{I}}\lambda)}{\det(\mathbf{{V}}_{s}^{\mu;\lambda})}}.

where

𝐕sμ;λ=j=1sμxjxj+λ𝐈.\mathbf{{V}}_{s}^{\mu;\lambda}=\sum_{j=1}^{s}\mu x_{j}x_{j}^{\top}+\lambda\mathbf{{I}}.
Proof.

The first point follows from the fact that ps(r)p_{s}(r) is deterministic conditioned on the sub-σ\sigma-algebra s1\mathcal{F}_{s-1} (since psp_{s} makes use of xsx_{s} but not xs+1x_{s+1}). Therefore, under mild regularity conditions

𝔼[M¯s|s1]=𝔼[M¯s1ps(r)𝒩s(r)𝑑r|s1]=M¯s1rps(r)𝔼[𝒩s(r)|s1]drM¯s1.\displaystyle\mathop{{}\mathbb{E}}[\bar{M}_{s}\,|\,\mathcal{F}_{s-1}]=\mathop{{}\mathbb{E}}\left[\bar{M}_{s-1}\int p_{s}(r)\mathcal{N}_{s}(r)dr\,|\,\mathcal{F}_{s-1}\right]=\bar{M}_{s-1}\int_{r}p_{s}(r)\mathop{{}\mathbb{E}}[\mathcal{N}_{s}(r)\,|\,\mathcal{F}_{s-1}]dr\leq\bar{M}_{s-1}.

In other words, mixing does not affect the supermartingale properties. For the second point, we derive an explicit form of the mixture martingale. Note that we can write out

r𝒩s(r)ps(r)dr=1(2π)ddet(𝐇s1)rexp(fs(θ)rL2||r||xsxs212r𝐇sr)dr.\displaystyle\int_{r}\mathcal{N}_{s}(r)p_{s}(r)\mathrm{d}r=\frac{1}{\sqrt{(2\pi)^{d}\det(\mathbf{H}_{s}^{-1})}}\int_{r}\exp\left(\nabla f_{s}({\theta_{\star}})^{\top}r-\frac{L}{2}\lvert\lvert r\rvert\rvert^{2}_{x_{s}x_{s}^{\top}}-\frac{1}{2}r^{\top}\mathbf{H}_{s}r\right)\mathrm{d}r. (26)

We can complete the square to obtain

fs(θ)rL2||r||xsxs212r𝐇sr\displaystyle\quad\;\nabla f_{s}({\theta_{\star}})^{\top}r-\frac{L}{2}\lvert\lvert r\rvert\rvert^{2}_{x_{s}x_{s}^{\top}}-\frac{1}{2}r^{\top}\mathbf{H}_{s}r
=12||fs(θ)||(𝐇s+Lxsxs)1212||r(𝐇s+Lxsxs)1fs(θ)||𝐇s+Lxsxs2.\displaystyle=\frac{1}{2}\lvert\lvert\nabla f_{s}({\theta_{\star}})\rvert\rvert_{(\mathbf{H}_{s}+Lx_{s}x_{s}^{\top})^{-1}}^{2}-\frac{1}{2}\lvert\lvert r-(\mathbf{H}_{s}+Lx_{s}x_{s}^{\top})^{-1}\nabla f_{s}({\theta_{\star}})\rvert\rvert^{2}_{\mathbf{H}_{s}+Lx_{s}x_{s}^{\top}}.

The second term is the exponent of a exponent of a Gaussian integral with covariance 𝐇s+11\mathbf{H}_{s+1}^{-1}, and therefore results in

rexp(12||r(𝐇s+Lxsxs)1fs(θ)||𝐇s+Lxsxs2)𝑑r=(2π)ddet(𝐇s+11).\int_{r}\exp\left(-\frac{1}{2}\lvert\lvert r-(\mathbf{H}_{s}+Lx_{s}x_{s}^{\top})^{-1}\nabla f_{s}({\theta_{\star}})\rvert\rvert^{2}_{\mathbf{H}_{s}+Lx_{s}x_{s}^{\top}}\right)dr=\sqrt{(2\pi)^{d}\det(\mathbf{H}_{s+1}^{-1})}.

Plugging this into (26) we get

r𝒩s(r)ps(r)dr=det𝐇sdet𝐇s+1exp(12||fs(θ)||𝐇s+112).\displaystyle\int_{r}\mathcal{N}_{s}(r)p_{s}(r)\mathrm{d}r=\sqrt{\frac{\det\mathbf{H}_{s}}{\det\mathbf{H}_{s+1}}}\exp\left(\frac{1}{2}\lvert\lvert\nabla f_{s}({\theta_{\star}})\rvert\rvert_{\mathbf{H}_{s+1}^{-1}}^{2}\right).

By multiplying the individual steps, we can see that the determinant terms cancel in a telescoping product. This leads to the formulation

M¯t=exp(s=1tfs(θ)𝐇s+11fs(θ))detλLμ𝐈det𝐇t+1.\bar{M}_{t}=\exp\left(\sum_{s=1}^{t}\nabla f_{s}({\theta_{\star}})^{\top}\mathbf{H}_{s+1}^{-1}\nabla f_{s}({\theta_{\star}})\right)\sqrt{\frac{\det\frac{\lambda L}{\mu}\mathbf{I}}{\det\mathbf{H}_{t+1}}}.

To conclude the proof, note that 𝐇s+1=Lμ𝐕sμ;λ\mathbf{H}_{s+1}=\frac{L}{\mu}\mathbf{V}^{\mu;\lambda}_{s}. ∎

Lemma 10.

Under assumption of Lemma 9,

(s=1t||fs(θ)||(𝐕sμ;λ)12Lμlog(det(𝐕sμ;λ)det(𝐈λ))+Lμlog(1δ))δ.\mathbb{P}\left(\sum_{s=1}^{t}\lvert\lvert\nabla f_{s}({\theta_{\star}})\rvert\rvert_{(\mathbf{{V}}^{\mu;\lambda}_{s})^{-1}}^{2}\leq\frac{L}{\mu}\log\left(\frac{\det(\mathbf{{V}}^{\mu;\lambda}_{s})}{\det(\mathbf{{I}}\lambda)}\right)+\frac{L}{\mu}\log\left(\frac{1}{\delta}\right)\right)\leq\delta. (27)

with probability 1δ1-\delta.

Proof.

The statement, follows by applying Ville’s inequality for supermartingales, applying the logarithm, and rearranging. Namely,

(M¯tδ)=(log(M¯t)log(δ))δ.\mathbb{P}(\bar{M}_{t}\geq\delta)=\mathbb{P}(\log(\bar{M}_{t})\geq\log(\delta))\leq\delta.

The following results allow us to upper bound the weighted regret by the unweighted regret:

Lemma 11 (Weighting Reduction).

Let {θs}s=1t\{\theta_{s}\}_{s=1}^{t} be a sequence of vectors adapted to the filtration {s1}s\{\mathcal{F}_{s-1}\}_{s}. Define

Δt({θs})=s=1tws(fs(θs)fs(θ))fs(θs)+fs(θ)=s=1t(1ws)(fs(θ)fs(θs)).\Delta_{t}(\{\theta_{s}\})=\sum_{s=1}^{t}w_{s}(f_{s}(\theta_{s})-f_{s}({\theta_{\star}}))-f_{s}(\theta_{s})+f_{s}({\theta_{\star}})=\sum_{s=1}^{t}(1-w_{s})(f_{s}({\theta_{\star}})-f_{s}(\theta_{s})).

Then, Pt=exp(Δt({θs}s))P_{t}=\exp(\Delta_{t}(\{\theta_{s}\}_{s})) is a non-negative super-martingale for any choice of adapted {ws}\{w_{s}\}, and hence,

s=1tws(fs(θs)fs(θ))s=1t(fs(θs)fs(θ))+log(1δ)\sum_{s=1}^{t}w_{s}(f_{s}(\theta_{s})-f_{s}({\theta_{\star}}))\leq\sum_{s=1}^{t}(f_{s}(\theta_{s})-f_{s}({\theta_{\star}}))+\log\left(\frac{1}{\delta}\right)

with probability 1δ1-\delta for all t0t\geq 0.

Proof.
𝔼[Pt|t1]\displaystyle\mathbb{E}[P_{t}\,|\,\mathcal{F}_{t-1}] =\displaystyle= 𝔼θ[exp(s=1t(1ws)fs(θs)+(1ws)fs(θ))|t1]\displaystyle\mathbb{E}_{{\theta_{\star}}}\left[\exp\left(\sum_{s=1}^{t}-(1-w_{s})f_{s}(\theta_{s})+(1-w_{s})f_{s}({\theta_{\star}})\right)\Big{|}\mathcal{F}_{t-1}\right]
=\displaystyle= Pt1𝔼ytexp((1wt)ft(θt)+(1wt)ft(θ))\displaystyle P_{t-1}\mathop{{}\mathbb{E}}_{y_{t}\sim\mathbb{P}_{\star}}\exp(-(1-w_{t})f_{t}(\theta_{t})+(1-w_{t})f_{t}({\theta_{\star}}))
=\displaystyle= Pt1ytexp((1wt)ft(θt)+(1wt)ft(θ))exp(ft(θ))𝑑yt\displaystyle P_{t-1}\int_{y_{t}}\exp(-(1-w_{t})f_{t}(\theta_{t})+(1-w_{t})f_{t}({\theta_{\star}}))\exp(-f_{t}({\theta_{\star}}))dy_{t}
=\displaystyle= Pt1ytexp((1wt)ft(θt)wtft(θ))𝑑yt\displaystyle P_{t-1}\int_{y_{t}}\exp(-(1-w_{t})f_{t}(\theta_{t})-w_{t}f_{t}({\theta_{\star}}))dy_{t}
=\displaystyle= Pt1ytpθt(yt|xt)1wtpθ(yt|xt)wt𝑑yt\displaystyle P_{t-1}\int_{y_{t}}p_{\theta_{t}}(y_{t}\,|\,x_{t})^{1-w_{t}}p_{{\theta_{\star}}}(y_{t}\,|\,x_{t})^{w_{t}}dy_{t}
=\displaystyle= Pt1exp((1wt)Dwt(θ,θt))Pt1.\displaystyle P_{t-1}\exp(-(1-w_{t})D_{w_{t}}({\theta_{\star}},\theta_{t}))\leq P_{t-1}.

We have used here the definition of the Renyi-divergence and the fact that it is always non-negative, namely

Dw(θ1,θ2)=1w1logypθ1(y|x)1wpθ1(y|x)w𝑑y0,D_{w}(\theta_{1},\theta_{2})=\frac{1}{w-1}\log\int_{y}p_{\theta_{1}}(y\,|\,x)^{1-w}p_{\theta_{1}}(y\,|\,x)^{w}dy\geq 0,

for 0<w10<w\not=1.222The case wt=1w_{t}=1 is trivial for us. The rest follows by the application of Ville’s inequality. ∎

C.4 FTRL Proof: the Unweighted Case

Proof of Theorem 4 (first part).

We define the function that FTRL minimizes in each step (to pick θ^t{\hat{{\theta}}}_{t}) as, gt(θ)=s=1t1logpθ(ys|xs)+λθ22g_{t}(\theta)=\sum_{s=1}^{t-1}-\log p_{\theta}(y_{s}\,|\,x_{s})+\lambda||\theta||_{2}^{2}. We can rewrite this objective as

θ^t=argminθΘgt(θ)=argminθΘs=1t1fs(θ)+λ||θ||22=argminθΘs=1t1ms(θ)+ϕt(θ),{\hat{{\theta}}}_{t}=\arg\min_{\theta\in\Theta}g_{t}(\theta)=\arg\min_{\theta\in\Theta}\sum_{s=1}^{t-1}f_{s}(\theta)+\lambda\lvert\lvert\theta\rvert\rvert_{2}^{2}=\arg\min_{\theta\in\Theta}\sum_{s=1}^{t-1}m_{s}(\theta)+\phi_{t}(\theta),

where we recall that333The logh(ys)\log h(y_{s}) term does not play any role in the regret nor the FTRL objective.

fs(θ)=A(xsθ)T(ys)xsθlogh(ys),f_{s}(\theta)=A(x_{s}^{\top}\theta)-T(y_{s})x_{s}^{\top}\theta-\log h(y_{s}),

and we have introduced the shorthands

ms(θ)=T(ys)xsθm_{s}(\theta)=-T(y_{s})x_{s}^{\top}\theta

and

ϕt(θ)=s=1t1A(θxs)+λθ22.\phi_{t}(\theta)=\sum_{s=1}^{t-1}A(\theta^{\top}x_{s})+\lambda||\theta||_{2}^{2}.

In essence, we have shifted some of the objective into what is commonly looked at as the regularizer. By a standard telescoping sum argument, we obtain for any uu

s=1t(ms(θ^s)ms(u))\displaystyle\sum_{s=1}^{t}(m_{s}({\hat{{\theta}}}_{s})-m_{s}(u))
=\displaystyle= ϕt+1(u)minθϕ1(θ)+s=1t[gs(θ^s)gs+1(θ^s+1)+ms(θ^s)]+gt+1(θ^t+1)gt+1(u)0\displaystyle\phi_{t+1}(u)-\min_{\theta}\phi_{1}(\theta)+\sum_{s=1}^{t}[g_{s}({\hat{{\theta}}}_{s})-g_{s+1}({\hat{{\theta}}}_{s+1})+m_{s}({\hat{{\theta}}}_{s})]+\underbrace{g_{t+1}({\hat{{\theta}}}_{t+1})-g_{t+1}(u)}_{\leq 0}
\displaystyle\leq ϕt+1(u)+s=1t[gs(θ^s)gs+1(θ^s+1)+ms(θ^s)]\displaystyle\phi_{t+1}(u)+\sum_{s=1}^{t}[g_{s}({\hat{{\theta}}}_{s})-g_{s+1}({\hat{{\theta}}}_{s+1})+m_{s}({\hat{{\theta}}}_{s})]
=\displaystyle= ϕt+1(u)+s=1t[gs(θ^s)gs+1(θ^s+1)+gs+1(θ^s)ϕs+1(θ^s)gs(θ^s)+ϕs(θ^s)]\displaystyle\phi_{t+1}(u)+\sum_{s=1}^{t}[g_{s}({\hat{{\theta}}}_{s})-g_{s+1}({\hat{{\theta}}}_{s+1})+g_{s+1}({\hat{{\theta}}}_{s})-\phi_{s+1}({\hat{{\theta}}}_{s})-g_{s}({\hat{{\theta}}}_{s})+\phi_{s}({\hat{{\theta}}}_{s})]
=\displaystyle= ϕt+1(u)+s=1t[gs+1(θ^s)gs+1(θ^s+1)ϕs+1(θ^s)+ϕs(θ^s)].\displaystyle\phi_{t+1}(u)+\sum_{s=1}^{t}[g_{s+1}({\hat{{\theta}}}_{s})-g_{s+1}({\hat{{\theta}}}_{s+1})-\phi_{s+1}({\hat{{\theta}}}_{s})+\phi_{s}({\hat{{\theta}}}_{s})].

Now we use the strong-convexity of gs+1g_{s+1} under the norm ||||𝐕sμ;λ\lvert\lvert\cdot\rvert\rvert_{\mathbf{{V}}_{s}^{\mu;\lambda}} where 𝐕sμ;λ=j=1sμxsxs+λ𝐈\mathbf{{V}}_{s}^{\mu;\lambda}=\sum_{j=1}^{s}\mu x_{s}x_{s}^{\top}+\lambda\mathbf{{I}},

s=1t(ms(θ^s)ms(u))\displaystyle\sum_{s=1}^{t}(m_{s}({\hat{{\theta}}}_{s})-m_{s}(u))
\displaystyle\leq ϕt+1(u)+s=1t[(θ^sθ^s+1)gs+1(θ^s)\displaystyle\phi_{t+1}(u)+\sum_{s=1}^{t}[({\hat{{\theta}}}_{s}-{\hat{{\theta}}}_{s+1})^{\top}\nabla g_{s+1}({\hat{{\theta}}}_{s})
12(θ^sθ^s+1)𝐕sμ;λ(θ^sθ^s+1)ϕs+1(θ^s)+ϕs(θ^s)]\displaystyle-\frac{1}{2}({\hat{{\theta}}}_{s}-{\hat{{\theta}}}_{s+1})^{\top}\mathbf{{V}}_{s}^{\mu;\lambda}({\hat{{\theta}}}_{s}-{\hat{{\theta}}}_{s+1})-\phi_{s+1}({\hat{{\theta}}}_{s})+\phi_{s}({\hat{{\theta}}}_{s})]
\displaystyle\leq ϕt+1(u)+s=1t[(θ^sθ^s+1)fs(θ^s)\displaystyle\phi_{t+1}(u)+\sum_{s=1}^{t}[({\hat{{\theta}}}_{s}-{\hat{{\theta}}}_{s+1})^{\top}\nabla f_{s}({\hat{{\theta}}}_{s})
12(θ^sθ^s+1)𝐕sμ;λ(θ^sθ^s+1)ϕs+1(θ^s)+ϕs(θ^s)]\displaystyle-\frac{1}{2}({\hat{{\theta}}}_{s}-{\hat{{\theta}}}_{s+1})^{\top}\mathbf{{V}}_{s}^{\mu;\lambda}({\hat{{\theta}}}_{s}-{\hat{{\theta}}}_{s+1})-\phi_{s+1}({\hat{{\theta}}}_{s})+\phi_{s}({\hat{{\theta}}}_{s})]
\displaystyle\leq ϕt+1(u)+s=1t[12||fs(θ^s)||(𝐕sμ;λ)12ϕs+1(θ^s)+ϕs(θ^s)],\displaystyle\phi_{t+1}(u)+\sum_{s=1}^{t}\left[\frac{1}{2}\lvert\lvert\nabla f_{s}({\hat{{\theta}}}_{s})\rvert\rvert_{(\mathbf{{V}}_{s}^{\mu;\lambda})^{-1}}^{2}-\phi_{s+1}({\hat{{\theta}}}_{s})+\phi_{s}({\hat{{\theta}}}_{s})\right],

where in the second inequality we used that gs(θ^s)(xθ^s)0\nabla g_{s}({\hat{{\theta}}}_{s})^{\top}(x-{\hat{{\theta}}}_{s})\geq 0 due to the first-order optimality conditions for convex constrained minimization. Lastly, we optimized the resulting quadratic function over θ^s+1{\hat{{\theta}}}_{s+1} (over d\mathbb{R}^{d}) to get a worst case bound involving the dual-norm.

Note that for the shorthands we defined above:

s=1t[ϕs+1(θ^s)+ϕs(θ^s)]=s=1tA(θ^sxs).\sum_{s=1}^{t}[-\phi_{s+1}({\hat{{\theta}}}_{s})+\phi_{s}({\hat{{\theta}}}_{s})]=\sum_{s=1}^{t}-A({\hat{{\theta}}}_{s}^{\top}x_{s}).

Using our previous observations and the definition of ϕt+1(θ)\phi_{t+1}({\theta_{\star}}), we get for the overall regret:

t\displaystyle\quad\;\mathcal{R}_{t}
=s=1tfs(θ^s)fs(θ)\displaystyle=\sum_{s=1}^{t}f_{s}({\hat{{\theta}}}_{s})-f_{s}({\theta_{\star}})
=s=1tms(θ^s)ms(θ)+s=1tA(xsθ^s)A(xsθ)\displaystyle=\sum_{s=1}^{t}m_{s}({\hat{{\theta}}}_{s})-m_{s}({\theta_{\star}})+\sum_{s=1}^{t}A(x_{s}^{\top}{\hat{{\theta}}}_{s})-A(x_{s}^{\top}{\theta_{\star}})
=s=1tA(xsθ)A(xsθ^s)+s=1tA(xsθ^s)A(xsθ)+12s=1t||fs(θ^s)||(𝐕sμ;λ)12+λ||θ||2\displaystyle=\sum_{s=1}^{t}A(x_{s}^{\top}{\theta_{\star}})-A(x_{s}^{\top}{\hat{{\theta}}}_{s})+\sum_{s=1}^{t}A(x_{s}^{\top}{\hat{{\theta}}}_{s})-A(x_{s}^{\top}{\theta_{\star}})+\frac{1}{2}\sum_{s=1}^{t}\lvert\lvert\nabla f_{s}({\hat{{\theta}}}_{s})\rvert\rvert_{(\mathbf{{V}}_{s}^{\mu;\lambda})^{-1}}^{2}+\lambda\lvert\lvert{\theta_{\star}}\rvert\rvert^{2}
12s=1t||fs(θ^s)||(𝐕sμ;λ)12+λ||θ||2\displaystyle\leq\frac{1}{2}\sum_{s=1}^{t}\lvert\lvert\nabla f_{s}({\hat{{\theta}}}_{s})\rvert\rvert_{(\mathbf{{V}}_{s}^{\mu;\lambda})^{-1}}^{2}+\lambda\lvert\lvert{\theta_{\star}}\rvert\rvert^{2}
12s=1t||T(ys)xsA(xsθ^s)xs||(𝐕sμ;λ)12+λ||θ||2\displaystyle\leq\frac{1}{2}\sum_{s=1}^{t}\lvert\lvert T(y_{s})x_{s}-A^{\prime}(x_{s}^{\top}{\hat{{\theta}}}_{s})x_{s}\rvert\rvert_{(\mathbf{{V}}_{s}^{\mu;\lambda})^{-1}}^{2}+\lambda\lvert\lvert{\theta_{\star}}\rvert\rvert^{2}
s=1t[||T(ys)xsA(xsθ)xs||(𝐕s;μλ)12+||(A(xsθ^s)A(xsθ))xs||(𝐕sμ;λ)12]+λ||θ||2\displaystyle\leq\sum_{s=1}^{t}\left[\lvert\lvert T(y_{s})x_{s}-A^{\prime}(x_{s}^{\top}{\theta_{\star}})x_{s}\rvert\rvert_{(\mathbf{{V}}_{s}^{{}^{\mu};\lambda})^{-1}}^{2}+\lvert\lvert(A^{\prime}(x_{s}^{\top}{\hat{{\theta}}}_{s})-A^{\prime}(x_{s}^{\top}{\theta_{\star}}))x_{s}\rvert\rvert_{(\mathbf{{V}}_{s}^{\mu;\lambda})^{-1}}^{2}\right]+\lambda\lvert\lvert{\theta_{\star}}\rvert\rvert^{2}
s=1t[||T(ys)xsA(xsθ)xs||(𝐕s;μλ)12+2L2B2||xs||(𝐕sμ;λ)12]+λ||θ||2\displaystyle\leq\sum_{s=1}^{t}\left[\lvert\lvert T(y_{s})x_{s}-A^{\prime}(x_{s}^{\top}{\theta_{\star}})x_{s}\rvert\rvert_{(\mathbf{{V}}_{s}^{{}^{\mu};\lambda})^{-1}}^{2}+2L^{2}B^{2}\lvert\lvert x_{s}\rvert\rvert_{(\mathbf{{V}}_{s}^{\mu;\lambda})^{-1}}^{2}\right]+\lambda\lvert\lvert{\theta_{\star}}\rvert\rvert^{2}
s=1t[||fs(θ)||(𝐕sμ;λ)12+2L2B2||xs||(𝐕sμ;λ)12]+λB2\displaystyle\leq\sum_{s=1}^{t}\left[\lvert\lvert\nabla f_{s}({\theta_{\star}})\rvert\rvert_{(\mathbf{{V}}_{s}^{\mu;\lambda})^{-1}}^{2}+2L^{2}B^{2}\lvert\lvert x_{s}\rvert\rvert_{(\mathbf{{V}}_{s}^{\mu;\lambda})^{-1}}^{2}\right]+\lambda B^{2}
λB2+Lμ(γtλ+log(1δ))+s=1t2L2B2||xs||(𝐕sμ;λ)12\displaystyle\leq\lambda B^{2}+\frac{L}{\mu}\left(\gamma_{t}^{\lambda}+\log\left(\frac{1}{\delta}\right)\right)+\sum_{s=1}^{t}2L^{2}B^{2}\lvert\lvert x_{s}\rvert\rvert_{(\mathbf{{V}}_{s}^{\mu;\lambda})^{-1}}^{2}
λB2+Lμ(γtλ+log(1δ))+2L2B2μγtλ.\displaystyle\leq\lambda B^{2}+\frac{L}{\mu}\left(\gamma_{t}^{\lambda}+\log\left(\frac{1}{\delta}\right)\right)+\frac{2L^{2}B^{2}}{\mu}\gamma_{t}^{\lambda}.

The last line follows because of Lemma 7, and the second to last one follows because of Lemma 10. Notice that if we wish to deal with arbitrary weights {wt}\{w_{t}\}, we can simply resort to Lemma 11 and bound the weighted case with the unweighted case. In that case, we incur an additional additive log(1δ)\log(1\delta) term. ∎

C.5 FTRL Analysis: the Weighted Case (Vovk-Azoury-Warmuth Forecaster)

Proof.

We define the function that FTRL minimizes in each step (to pick θ^t{\hat{{\theta}}}_{t}) as g~t(θ)=s=1t1[A(xsθ)T(ys)xsθ]+ψt(θ)\tilde{g}_{t}(\theta)=\sum_{s=1}^{t-1}[A(x_{s}^{\top}\theta)-T(y_{s})x_{s}^{\top}\theta]+\psi_{t}(\theta) with ψt(θ)=A(xtθ)+λ||θ||22\psi_{t}(\theta)=A(x_{t}^{\top}\theta)+\lambda\lvert\lvert\theta\rvert\rvert_{2}^{2}. We can rewrite this objective as

θ^t=argminθΘg~t(θ)=argminθΘs=1t1ms(θ)+ϕt(θ),{\hat{{\theta}}}_{t}=\arg\min_{\theta\in\Theta}\tilde{g}_{t}(\theta)=\arg\min_{\theta\in\Theta}\sum_{s=1}^{t-1}m_{s}(\theta)+\phi_{t}(\theta),

by introducing the shorthands

ms(θ)=T(ys)xsθm_{s}(\theta)=-T(y_{s})x_{s}^{\top}\theta

and (notice the difference in time index of the second sum when compared to the proof in the previous subsection):

ϕt(θ)=s=1tA(θxs)+λθ22.\phi_{t}(\theta)=\sum_{s=1}^{t}A(\theta^{\top}x_{s})+\lambda||\theta||_{2}^{2}.

In addition consider the objective gtg_{t} from the classical FTRL analysis in Section C.4. It is not used to run the online algorithm, but is helpful in our analysis. With our new components, it is equal to

gt(θ)=s=1t1ms(θ)+s=1t1A(θxs)+λθ22=s=1t1ms(θ)+ϕt1(θ),g_{t}(\theta)=\sum_{s=1}^{t-1}m_{s}(\theta)+\sum_{s=1}^{t-1}A(\theta^{\top}x_{s})+\lambda||\theta||_{2}^{2}=\sum_{s=1}^{t-1}m_{s}(\theta)+\phi_{t-1}(\theta),

and its minimizer is θ¯t=argminθΘgt(θ)\bar{\theta}_{t}=\arg\min_{\theta\in\Theta}g_{t}(\theta). Also, consider a weighted version of the regularizer

ϕ¯t(θ)=s=1twsA(xsθ)+λθ22,\bar{\phi}_{t}(\theta)=\sum_{s=1}^{t}w_{s}A(x_{s}^{\top}\theta)+\lambda||\theta||_{2}^{2},

which will be useful. We use a variant of a similar telescoping sum argument as in the previous proof of Section C.4. We specifically use θ{\theta_{\star}} as the comparator to compete against. Notice that we insert a telescoping sum involving the objective gsg_{s}, which is not the objective that our estimator is minimizing:

s=1tws(ms(θ^s)ms(θ))\displaystyle\sum_{s=1}^{t}w_{s}(m_{s}({\hat{{\theta}}}_{s})-m_{s}({\theta_{\star}}))
=()\displaystyle\stackrel{{\scriptstyle(*)}}{{=}} ϕ¯t(θ)ϕ0(θ¯1)+s=1t[gs(θ¯s)gs+1(θ¯s+1)+wsms(θ^s)]+gt+1(θ¯t+1)gt+1(θ)0\displaystyle\bar{\phi}_{t}({\theta_{\star}})-\phi_{0}(\bar{\theta}_{1})+\sum_{s=1}^{t}[{g}_{s}(\bar{\theta}_{s})-{g}_{s+1}(\bar{\theta}_{s+1})+w_{s}m_{s}({\hat{{\theta}}}_{s})]+\underbrace{{g}_{t+1}(\bar{\theta}_{t+1})-{g}_{t+1}({\theta_{\star}})}_{\leq 0}
+s=1t(1ws)fs(θ)\displaystyle+\sum_{s=1}^{t}(1-w_{s})f_{s}({\theta_{\star}})
\displaystyle\leq ϕ¯t(θ)+s=1t[ws(gs(θ¯s)gs+1(θ¯s+1))+wsms(θ^s)]\displaystyle\bar{\phi}_{t}({\theta_{\star}})+\sum_{s=1}^{t}[w_{s}({g}_{s}(\bar{\theta}_{s})-{g}_{s+1}(\bar{\theta}_{s+1}))+w_{s}m_{s}({\hat{{\theta}}}_{s})]
+s=1t(1ws)(gs(θ¯s)gs+1(θ¯s+1)+fs(θ))\displaystyle+\sum_{s=1}^{t}(1-w_{s})(g_{s}(\bar{\theta}_{s})-g_{s+1}(\bar{\theta}_{s+1})+f_{s}({\theta_{\star}}))
=()\displaystyle\stackrel{{\scriptstyle(**)}}{{=}} ϕ¯t(θ)+s=1tws[gs(θ¯s)gs+1(θ¯s+1)+gs+1(θ^s)ϕs(θ^s)gs(θ^s)+ϕs1(θ^s)]\displaystyle\bar{\phi}_{t}({\theta_{\star}})+\sum_{s=1}^{t}w_{s}[{g}_{s}(\bar{\theta}_{s})-{g}_{s+1}(\bar{\theta}_{s+1})+{g}_{s+1}({\hat{{\theta}}}_{s})-{\phi}_{s}({\hat{{\theta}}}_{s})-{g}_{s}({\hat{{\theta}}}_{s})+{\phi}_{s-1}({\hat{{\theta}}}_{s})]
s=1t(1ws)[gs(θ¯s)gs+1(θ¯s+1)+fs(θ)]\displaystyle\sum_{s=1}^{t}(1-w_{s})[g_{s}(\bar{\theta}_{s})-g_{s+1}(\bar{\theta}_{s+1})+f_{s}({\theta_{\star}})]
()\displaystyle\stackrel{{\scriptstyle(***)}}{{\leq}} ϕ¯t(θ)+s=1tws[gs+1(θ¯s+1)+gs+1(θ^s)ϕs(θ^s)+ϕs1(θ^s)]+Δ~t.\displaystyle\bar{\phi}_{t}({\theta_{\star}})+\sum_{s=1}^{t}w_{s}[-{g}_{s+1}(\bar{\theta}_{s+1})+{g}_{s+1}({\hat{{\theta}}}_{s})-{\phi}_{s}({\hat{{\theta}}}_{s})+{\phi}_{s-1}({\hat{{\theta}}}_{s})]+\tilde{\Delta}_{t}.

In ()(*), we used the shorthands and definitions introduced above. In ()(**), we used the identity gs+1(θ^s)ϕs(θ^s)gs(θ^s)+ϕs1(θ^s)=ms(θ^s){g}_{s+1}({\hat{{\theta}}}_{s})-{\phi}_{s}({\hat{{\theta}}}_{s})-{g}_{s}({\hat{{\theta}}}_{s})+{\phi}_{s-1}({\hat{{\theta}}}_{s})=m_{s}({\hat{{\theta}}}_{s}). Finally, for ()(***), recall that θ¯s\bar{\theta}_{s} is the minimizer of gsg_{s}, and hence, gs(θ¯s)gs(θ^s)0g_{s}(\bar{\theta}_{s})-g_{s}({\hat{{\theta}}}_{s})\leq 0. Next, define Δ~t=s=1t(1ws)[gs(θ¯s)gs+1(θ¯s+1)+fs(θ)]\tilde{\Delta}_{t}=\sum_{s=1}^{t}(1-w_{s})[g_{s}(\bar{\theta}_{s})-g_{s+1}(\bar{\theta}_{s+1})+f_{s}({\theta_{\star}})]. We will bound this term later.

Now, we use the strong-convexity of gs+1(θ)g_{s+1}(\theta) under the norm ||||𝐕sμ;λ\lvert\lvert\cdot\rvert\rvert_{\mathbf{{V}}_{s}^{\mu;\lambda}} where 𝐕sμ;λ=j=1sμxjxj+λ𝐈\mathbf{{V}}_{s}^{\mu;\lambda}=\sum_{j=1}^{s}\mu x_{j}x_{j}^{\top}+\lambda\mathbf{{I}}, namely

gs+1(θ¯s+1)gs+1(θ^s+1)+gs+1(θ^s)(θ¯s+1θ^s)+12||θ¯s+1θ^s||𝐕sμ;λ2.g_{s+1}(\bar{\theta}_{s+1})\geq g_{s+1}({\hat{{\theta}}}_{s+1})+\nabla g_{s+1}({\hat{{\theta}}}_{s})^{\top}(\bar{\theta}_{s+1}-{\hat{{\theta}}}_{s})+\frac{1}{2}\lvert\lvert\bar{\theta}_{s+1}-{\hat{{\theta}}}_{s}\rvert\rvert_{\mathbf{{V}}_{s}^{\mu;\lambda}}^{2}.

We can then proceed as follows:

s=1tws(ms(θ^s)ms(u))\displaystyle\sum_{s=1}^{t}w_{s}(m_{s}({\hat{{\theta}}}_{s})-m_{s}(u)) (28)
\displaystyle\leq Δ~t+ϕ¯t(θ)+s=1tws[gs+1(θ^s)(θ^sθ¯s+1)\displaystyle\tilde{\Delta}_{t}+\bar{\phi}_{t}({\theta_{\star}})+\sum_{s=1}^{t}w_{s}[\nabla g_{s+1}({\hat{{\theta}}}_{s})^{\top}({\hat{{\theta}}}_{s}-\bar{\theta}_{s+1})
12||θ¯s+1θ^s||𝐕sμ;λ2ϕs(θ^s)+ϕs1(θ^s)]\displaystyle-\frac{1}{2}\lvert\lvert\bar{\theta}_{s+1}-{\hat{{\theta}}}_{s}\rvert\rvert_{\mathbf{{V}}_{s}^{\mu;\lambda}}^{2}-{\phi}_{s}({\hat{{\theta}}}_{s})+{\phi}_{s-1}({\hat{{\theta}}}_{s})]
\displaystyle\leq Δ~t+ϕ¯t(θ)+s=1tws[(g~s(θ^s)+ms(θ^s))(θ^sθ¯s+1)\displaystyle\tilde{\Delta}_{t}+\bar{\phi}_{t}({\theta_{\star}})+\sum_{s=1}^{t}w_{s}[(\nabla\tilde{g}_{s}({\hat{{\theta}}}_{s})+\nabla m_{s}({\hat{{\theta}}}_{s}))^{\top}({\hat{{\theta}}}_{s}-\bar{\theta}_{s+1})
12||θ¯s+1θ^s||𝐕sμ;λ2ϕs(θ^s)+ϕs1(θ^s)]\displaystyle-\frac{1}{2}\lvert\lvert\bar{\theta}_{s+1}-{\hat{{\theta}}}_{s}\rvert\rvert_{\mathbf{{V}}_{s}^{\mu;\lambda}}^{2}-{\phi}_{s}({\hat{{\theta}}}_{s})+{\phi}_{s-1}({\hat{{\theta}}}_{s})]
\displaystyle\leq Δ~t+ϕ¯t(θ)+12s=1tws||ms(θ^s)||(𝐕sμ)12ws(ϕs(θ^s)+ϕs1(θ^s)),\displaystyle\tilde{\Delta}_{t}+\bar{\phi}_{t}({\theta_{\star}})+\frac{1}{2}\sum_{s=1}^{t}w_{s}\lvert\lvert\nabla m_{s}({\hat{{\theta}}}_{s})\rvert\rvert_{(\mathbf{{V}}_{s}^{{}^{\mu}})^{-1}}^{2}-w_{s}({\phi}_{s}({\hat{{\theta}}}_{s})+{\phi}_{s-1}({\hat{{\theta}}}_{s})),

where in the second to last line we used that g~s(θ^s)(xθ^s)0\nabla\tilde{g}_{s}({\hat{{\theta}}}_{s})^{\top}(x-{\hat{{\theta}}}_{s})\geq 0 for any xx, due to the optimality of θ^s{\hat{{\theta}}}_{s} for the FTRL objective. In the last line, we optimized over θ¯s+1\bar{\theta}_{s+1} to get a worst-case bound on the quadratic function involving it. Also, note that for the shorthands we defined above:

s=1tws(ϕs(θ^s)+ϕs1(θ^s))=s=1tws(A(θ^sxs)).\sum_{s=1}^{t}w_{s}(-{\phi}_{s}({\hat{{\theta}}}_{s})+{\phi}_{s-1}({\hat{{\theta}}}_{s}))=\sum_{s=1}^{t}w_{s}(-A({\hat{{\theta}}}_{s}^{\top}x_{s})).

Our goal here is to upper bound the overall regret:

t\displaystyle\mathcal{R}_{t} =\displaystyle= s=1tws(fs(θ^s)fs(θ))\displaystyle\sum_{s=1}^{t}w_{s}(f_{s}({\hat{{\theta}}}_{s})-f_{s}({\theta_{\star}})) (29)
=\displaystyle= s=1tws(ms(θ^s)ms(θ))+s=1tws(A(xsθ^s)A(xsθ))\displaystyle\sum_{s=1}^{t}w_{s}(m_{s}({\hat{{\theta}}}_{s})-m_{s}({\theta_{\star}}))+\sum_{s=1}^{t}w_{s}(A(x_{s}^{\top}{\hat{{\theta}}}_{s})-A(x_{s}^{\top}{\theta_{\star}}))
\displaystyle\leq ϕ¯t(θ)s=1twsA(xsθ^s)+12s=1tws||ms(θ^s)||(𝐕sμ)12+Δ~t\displaystyle\bar{\phi}_{t}({\theta_{\star}})-\sum_{s=1}^{t}w_{s}A(x_{s}^{\top}{\hat{{\theta}}}_{s})+\frac{1}{2}\sum_{s=1}^{t}w_{s}\lvert\lvert\nabla m_{s}({\hat{{\theta}}}_{s})\rvert\rvert_{(\mathbf{{V}}_{s}^{{}^{\mu}})^{-1}}^{2}+\tilde{\Delta}_{t}
+s=1tws(A(xsθ^s)A(xsθ))\displaystyle+\sum_{s=1}^{t}w_{s}(A(x_{s}^{\top}{\hat{{\theta}}}_{s})-A(x_{s}^{\top}{\theta_{\star}}))
=\displaystyle= s=1tws(A(xsθ))s=1twsA(xsθ^s)+12s=1tws||ms(θ^s)||(𝐕sμ)12+λ||θ||2+Δ~t\displaystyle\sum_{s=1}^{t}w_{s}(A(x_{s}^{\top}{\theta_{\star}}))-\sum_{s=1}^{t}w_{s}A(x_{s}^{\top}{\hat{{\theta}}}_{s})+\frac{1}{2}\sum_{s=1}^{t}w_{s}\lvert\lvert\nabla m_{s}({\hat{{\theta}}}_{s})\rvert\rvert_{(\mathbf{{V}}_{s}^{{}^{\mu}})^{-1}}^{2}+\lambda\lvert\lvert{\theta_{\star}}\rvert\rvert^{2}+\tilde{\Delta}_{t}
+s=1tws(A(xsθ^s)A(xsθ))\displaystyle+\sum_{s=1}^{t}w_{s}(A(x_{s}^{\top}{\hat{{\theta}}}_{s})-A(x_{s}^{\top}{\theta_{\star}}))
=\displaystyle= Δ~t+12s=1tws||ms(θ^s)||(𝐕sμ)12+λ||θ||2.\displaystyle\tilde{\Delta}_{t}+\frac{1}{2}\sum_{s=1}^{t}w_{s}\lvert\lvert\nabla m_{s}({\hat{{\theta}}}_{s})\rvert\rvert_{(\mathbf{{V}}_{s}^{{}^{\mu}})^{-1}}^{2}+\lambda\lvert\lvert{\theta_{\star}}\rvert\rvert^{2}.

The first inequality follows by plugging in (28).

We return back to the term Δ~t\tilde{\Delta}_{t},

Δ~t\displaystyle\tilde{\Delta}_{t} =s=1t(1ws)[gs(θ¯s)gs+1(θ¯s+1)+fs(θ)]\displaystyle=\sum_{s=1}^{t}(1-w_{s})[g_{s}(\bar{\theta}_{s})-g_{s+1}(\bar{\theta}_{s+1})+f_{s}({\theta_{\star}})]
=s=1t(1ws)[gs(θ¯s)gs(θ¯s+1)fs(θ¯s+1)+fs(θ)]\displaystyle=\sum_{s=1}^{t}(1-w_{s})[g_{s}(\bar{\theta}_{s})-g_{s}(\bar{\theta}_{s+1})-f_{s}(\bar{\theta}_{s+1})+f_{s}({\theta_{\star}})]
s=1t(1ws)(fs(θ)fs(θ¯s+1))\displaystyle\leq\sum_{s=1}^{t}(1-w_{s})(f_{s}({\theta_{\star}})-f_{s}(\bar{\theta}_{s+1}))
Lμ(γtλ+log(1/δ)),\displaystyle\leq\frac{L}{\mu}(\gamma_{t}^{\lambda}+\log(1/\delta)), (30)

where the second to last line is by the optimality of θ¯s\bar{\theta}_{s} and the last one by the assumption in the theorem.

Carrying on with the analysis, i.e. with (29), we insert the definition of msm_{s} and obtain

t\displaystyle\quad\mathcal{R}_{t}
Δt+12s=1tws||T(ys)xs||(𝐕s;μλ)12+λ||θ||2\displaystyle\leq\Delta_{t}+\frac{1}{2}\sum_{s=1}^{t}w_{s}\lvert\lvert-T(y_{s})x_{s}\rvert\rvert_{(\mathbf{{V}}_{s}^{{}^{\mu};\lambda})^{-1}}^{2}+\lambda\lvert\lvert{\theta_{\star}}\rvert\rvert^{2}
Δt+s=1t[ws||T(ys)xsA(xsθ)xs||(𝐕s;μλ)12+ws||(A(xsθ))xs||(𝐕s;μλ)12]+λ||θ||2\displaystyle\leq\Delta_{t}+\sum_{s=1}^{t}\left[w_{s}\lvert\lvert T(y_{s})x_{s}-A^{\prime}(x_{s}^{\top}{\theta_{\star}})x_{s}\rvert\rvert_{(\mathbf{{V}}_{s}^{{}^{\mu};\lambda})^{-1}}^{2}+w_{s}\lvert\lvert(A^{\prime}(x_{s}^{\top}{\theta_{\star}}))x_{s}\rvert\rvert_{(\mathbf{{V}}_{s}^{{}^{\mu};\lambda})^{-1}}^{2}\right]+\lambda\lvert\lvert{\theta_{\star}}\rvert\rvert^{2}
()Δt+s=1t||T(ys)xsA(xsθ)xs||(𝐕s;μλ)12+L2B2s=1tws||xs||(𝐕s;μλ)12+λ||θ||2\displaystyle\stackrel{{\scriptstyle(*)}}{{\leq}}\Delta_{t}+\sum_{s=1}^{t}\lvert\lvert T(y_{s})x_{s}-A^{\prime}(x_{s}^{\top}{\theta_{\star}})x_{s}\rvert\rvert_{(\mathbf{{V}}_{s}^{{}^{\mu};\lambda})^{-1}}^{2}+L^{2}B^{2}\sum_{s=1}^{t}w_{s}\lvert\lvert x_{s}\rvert\rvert_{(\mathbf{{V}}_{s}^{{}^{\mu};\lambda})^{-1}}^{2}+\lambda\lvert\lvert{\theta_{\star}}\rvert\rvert^{2}
=()Δt+s=1t||fs(θ)||(𝐕sμ)12+Ls=1tB21/L+biasxs2(θ^s)||xs||(𝐕s;μλ)12+λB2\displaystyle\stackrel{{\scriptstyle(**)}}{{=}}\Delta_{t}+\sum_{s=1}^{t}\lvert\lvert\nabla f_{s}({\theta_{\star}})\rvert\rvert_{(\mathbf{{V}}_{s}^{{}^{\mu}})^{-1}}^{2}+L\sum_{s=1}^{t}\frac{B^{2}}{1/L+\operatorname{bias}^{2}_{x_{s}}({\hat{{\theta}}}_{s})}\lvert\lvert x_{s}\rvert\rvert_{(\mathbf{{V}}_{s}^{{}^{\mu};\lambda})^{-1}}^{2}+\lambda B^{2}
()λB2+2Lμ(γtλ+log(1δ))+Ls=1tB21/L+biasxs2(θ^s)||xs||(𝐕s;μλ)12.\displaystyle\stackrel{{\scriptstyle(***)}}{{\leq}}\lambda B^{2}+\frac{2L}{\mu}\left(\gamma_{t}^{\lambda}+\log\left(\frac{1}{\delta}\right)\right)+L\sum_{s=1}^{t}\frac{B^{2}}{1/L+\operatorname{bias}^{2}_{x_{s}}({\hat{{\theta}}}_{s})}\lvert\lvert x_{s}\rvert\rvert_{(\mathbf{{V}}_{s}^{{}^{\mu};\lambda})^{-1}}^{2}.

In ()(*), we use ws1w_{s}\leq 1 and the Lipschitzness of AA^{\prime}. In ()(**), we use the definition of the weights. Finally, in (***), we used Lemma 10 and (30). By substituting Δγs=μ||xs||(𝐕s;μλ)12\Delta\gamma_{s}=\mu\lvert\lvert x_{s}\rvert\rvert_{(\mathbf{{V}}_{s}^{{}^{\mu};\lambda})^{-1}}^{2}, we finish the proof. The event in Lemma 10 holds with probability 1δ1-\delta, completing the proof. ∎

C.6 FTRL Analysis: Beyond Global Smoothness

In this subsection, we give alternative analysis which avoids the necessity to impose a global smoothness condition our likelihood; instead strong convexity within a bounded domain suffices, and we will only assume that

ϵs:=𝔼xsθ[T(ys)]T(ys)\epsilon_{s}:=\mathop{{}\mathbb{E}}_{x_{s}^{\top}{\theta_{\star}}}[T(y_{s})]-T(y_{s})

are sub-Exponential random variables, setting us apart from Zhao et al., (2022) which assume sub-Gaussianity. In particular, we can show the following theorem

Theorem 7.

With probability 1δ1-\delta, uniformly over time tt\in\mathbb{N}, we have

tcdlog2(t/δ))log(t),\mathcal{R}_{t}\leq cd\log^{2}(t/\delta))\log(t),

where the universal constant cc hides all constants independent of t,dt,d and δ\delta.

C.6.1 Lemmas

We state the following result on sub-Exponential random variables.

Proposition 2 (Theorem 2.13 in Wainwright, (2019)).

If XX is a centered sub-Exponential variable with some finite variance proxy, then there exist constants c1,c2>0c_{1},c_{2}>0 such that for any t>0t>0

(|X|a)c1ec2a.\mathop{\mathbb{P}}(\lvert X\rvert\geq a)\leq c_{1}e^{-c_{2}a}.

By some careful union bounds (akin to a stitching argument), we can also provide upper bounds on anytime-valid upper bounds on the process St=maxstϵsS_{t}=\max_{s\leq t}\epsilon_{s}.

Lemma 12.

For any sequence (ϵs)s=1(\epsilon_{s})_{s=1}^{\infty} of sub-Exponential-variables, there exists a constant c~\tilde{c} independent of tt such that

(t:maxst|ϵs|c~log(s/δ))δ.\mathop{\mathbb{P}}(\exists t:\;\max_{s\leq t}\lvert\epsilon_{s}\rvert\geq\tilde{c}\log(s/\delta))\leq\delta.
Proof.

(of Lemma 12) By Proposition 2, there exists c1,c2>0c_{1},c_{2}>0 such that we have (|ϵs|a)c1ec2a\mathop{\mathbb{P}}(\lvert\epsilon_{s}\rvert\geq a)\leq c_{1}e^{-c_{2}a} for any fixed ss. Note that c1ec2aδc_{1}e^{-c_{2}a}\leq\delta is satisfied for a1c2log(c1/δ)=:c3log(c1/δ)a\geq\frac{1}{c_{2}}\log(c_{1}/\delta)=:c_{3}\log(c_{1}/\delta). Let us denote by i\mathcal{E}_{i} the event all j[2i,2i+1)j\in[2^{i},2^{i+1})\cap\mathbb{N} satisfy the inequality

|ϵj|<c3log(c1(22i+1)/δ).\lvert\epsilon_{j}\rvert<c_{3}\log(c_{1}(2^{2i+1})/\delta).

For a single jj, this happens with probability at least 1δ/22i+11-\delta/2^{2i+1}. Therefore, by a union bound, as |[2i,2i+1)|=2i\lvert[2^{i},2^{i+1})\cap\mathbb{N}\rvert=2^{i}, we can bound the probability of the complement, namely (ic)2iδ22i+1=δ2i+1\mathop{\mathbb{P}}(\mathcal{E}_{i}^{c})\leq 2^{i}\frac{\delta}{2^{2i+1}}=\frac{\delta}{2^{i+1}}. Now, by another union bound, we can conclude that

(i=0ic)i=0δ2i+1=δ21112=δ.\mathop{\mathbb{P}}(\cup_{i=0}^{\infty}\mathcal{E}_{i}^{c})\leq\sum_{i=0}^{\infty}\frac{\delta}{2^{i+1}}=\frac{\delta}{2}\frac{1}{1-\frac{1}{2}}=\delta.

Now we also have for any jj in this range that 22i+12j2,2^{2i+1}\leq 2j^{2}, and therefore, if i\mathcal{E}_{i} holds, we have for any j[2i,2i+1)j\in[2^{i},2^{i+1})\cap\mathbb{N}:

ϵjc3log(c1(2j2)/δ)2c3log(2c1j/δ)c~log(j/δ).\epsilon_{j}\leq c_{3}\log(c_{1}(2j^{2})/\delta)\leq 2c_{3}\log(2c_{1}j/\delta)\leq\tilde{c}\log(j/\delta).

We can immediately see that this implies

(t:maxst|ϵs|c~log(s/δ))δ.\mathop{\mathbb{P}}(\exists t:\;\max_{s\leq t}\lvert\epsilon_{s}\rvert\geq\tilde{c}\log(s/\delta))\leq\delta.

as desired. ∎

C.6.2 Proof of Theorem 7

Our proof initially follows the FTRL regret bound proofs in the adversarial setting Hazan, (2016); Orabona, (2019). It also has overlap with the proof in Zhao et al., (2022). We define the function that FTRL minimizes in each step as (to pick θ^t{\hat{{\theta}}}_{t})

gt(θ)=s=1t1logpθ(ys|xs)+ϕ(θ)g_{t}(\theta)=\sum_{s=1}^{t-1}-\log p_{\theta}(y_{s}\,|\,x_{s})+\phi(\theta)

for convenience. We initially use the same steps as in Theorem 4 to see that for any uΘu\in\Theta

s=1t(fs(θ^s)fs(u))\displaystyle\quad\;\sum_{s=1}^{t}(f_{s}({\hat{{\theta}}}_{s})-f_{s}(u))
ϕ(u)minθϕ(θ)+s=1t[gs(θ^s)gs+1(θ^s+1)+fs(θ^s)]+gt+1(θ^t+1)gt+1(u)\displaystyle\leq\phi(u)-\min_{\theta}\phi(\theta)+\sum_{s=1}^{t}[g_{s}({\hat{{\theta}}}_{s})-g_{s+1}({\hat{{\theta}}}_{s+1})+f_{s}({\hat{{\theta}}}_{s})]+g_{t+1}({\hat{{\theta}}}_{t+1})-g_{t+1}(u)
λB2+s=1t[gs(θ^s)gs+1(θ^s+1)+fs(θ^s)].\displaystyle\leq\lambda B^{2}+\sum_{s=1}^{t}[g_{s}({\hat{{\theta}}}_{s})-g_{s+1}({\hat{{\theta}}}_{s+1})+f_{s}({\hat{{\theta}}}_{s})].

Similarly to the proof of Theorem 4 in Appendix C.4 we bound these increments by the dual norm of the gradient of the objective.

gs(θ^s)gs+1(θ^s+1)+fs(θ^s)||fs(θ^s)||(𝐕sμ;λ)122.g_{s}({\hat{{\theta}}}_{s})-g_{s+1}({\hat{{\theta}}}_{s+1})+f_{s}({\hat{{\theta}}}_{s})\leq\frac{\lvert\lvert\nabla f_{s}({\hat{{\theta}}}_{s})\rvert\rvert_{(\mathbf{{V}}_{s}^{\mu;\lambda})^{-1}}^{2}}{2}. (31)

Now we note that

fs(θ)=A(xsθ)xsT(ys)xs.\nabla f_{s}(\theta)=A^{\prime}(x_{s}^{\top}\theta)x_{s}-T(y_{s})x_{s}.

Using properties of the exponential family, we deduce that

fs(θ)\displaystyle\nabla f_{s}(\theta) =(𝔼xsθ[T(ys)]T(ys))xs\displaystyle=\left(\mathop{{}\mathbb{E}}_{x_{s}^{\top}\theta}[T(y_{s})]-T(y_{s})\right)x_{s}
=(𝔼xsθ[T(ys)]𝔼xsθ[T(ys)]+𝔼xsθ[T(ys)]T(ys))xs.\displaystyle=\left(\mathop{{}\mathbb{E}}_{x_{s}^{\top}\theta}[T(y_{s})]-\mathop{{}\mathbb{E}}_{x_{s}^{\top}{\theta_{\star}}}[T(y_{s})]+\mathop{{}\mathbb{E}}_{x_{s}^{\top}{\theta_{\star}}}[T(y_{s})]-T(y_{s})\right)x_{s}.

From here on out, we proceed more crudely than in our previous analyses, since we are only concerned with asymptotic behavior when dd and tt are large. Let us define

U:=supθΘ|𝔼xsθ[T(ys)]𝔼xsθ[T(ys)]|,U:=\sup_{\theta\in\Theta}\lvert\mathop{{}\mathbb{E}}_{x_{s}^{\top}\theta}[T(y_{s})]-\mathop{{}\mathbb{E}}_{x_{s}^{\top}{\theta_{\star}}}[T(y_{s})]\rvert,

which is a model-dependent, deterministic quantity. Let us define the noise variables

ϵs:=𝔼xsθ[T(ys)]T(ys).\epsilon_{s}:=\mathop{{}\mathbb{E}}_{x_{s}^{\top}{\theta_{\star}}}[T(y_{s})]-T(y_{s}).

We bound

||fs(θ)||(𝐕sμ;λ)12\displaystyle\lvert\lvert\nabla f_{s}(\theta)\rvert\rvert_{(\mathbf{{V}}^{\mu;\lambda}_{s})^{-1}}^{2} 2(U2+ϵs2)||xs||(𝐕sμ;λ)12.\displaystyle\leq 2(U^{2}+\epsilon_{s}^{2})\lvert\lvert x_{s}\rvert\rvert_{(\mathbf{{V}}^{\mu;\lambda}_{s})^{-1}}^{2}.

Note that the ϵs\epsilon_{s} are centered, independent sub-Exponential variables, and as such are guaranteed to satisfy

(t:maxst|ϵs|c4log(s/δ))δ,\mathbb{P}(\exists t:\;\max_{s\leq t}\lvert\epsilon_{s}\rvert\geq c_{4}\log(s/\delta))\leq\delta,

by Lemma 12. This tells us that conditional on this event, we can upper bound for any tt

tλB2+1μ(U2+c~2log2(t/δ))s=1t||xs||(𝐕sμ;λ)12.\mathcal{R}_{t}\leq\lambda B^{2}+\frac{1}{\mu}(U^{2}+\tilde{c}^{2}\log^{2}(t/\delta))\sum_{s=1}^{t}\lvert\lvert x_{s}\rvert\rvert_{(\mathbf{{V}}^{\mu;\lambda}_{s})^{-1}}^{2}.

for some constant c~\tilde{c} independent of tt. By Lemma 7, there is thus a constant cc^{\prime} independent of tt and dd such that

tcdlog2(t/δ))log(t)=𝒪(dlog3(t)),\mathcal{R}_{t}\leq c^{\prime}d\log^{2}(t/\delta))\log(t)=\mathcal{O}(d\log^{3}(t)),

with probability 1δ1-\delta uniformly over tt\in\mathbb{N}.

Appendix D Regret Consequences for Stochastic Linear Bandits

As a corollary of our analysis, we provide the regret for stochastic linear bandits that use our confidence sets within the LinUCB algorithm.

Proof.

We proceed in two parts: first, we instantiate Theorem 3 and then we follow the classical regret analysis for stochastic linear bandits.

Specializing the Bregman divergence results

By Theorem 3, we know that for any ν>0\nu>0, we have that with probability 1δ1-\delta,

DZtν(θ,θ)4Lμξt+2log(1δ)+2t,D_{Z_{t}^{\nu}}(\theta,{\theta_{\star}})\leq\frac{4L}{\mu}\xi_{t}+2\log\left(\frac{1}{\delta}\right)+2\mathcal{R}_{t}, (32)

for all tt, where ξt=(log(1α)+νB2+Γtν)\xi_{t}=\left(\log\left(\frac{1}{\alpha}\right)+{\nu B^{2}}+\Gamma_{t}^{\nu}\right) and t\mathcal{R}_{t} is the online convex optimization regret. We also recall that

Ztν(θ)=s=1twsA(xsθ)+ν2||θ||22.Z_{t}^{\nu}(\theta)=\sum_{s=1}^{t}w_{s}A(x_{s}^{\top}\theta)+\frac{\nu}{2}\lvert\lvert\theta\rvert\rvert^{2}_{2}.

In the Gaussian case, where A(z)=z2/(2σ2)A(z)=z^{2}/(2\sigma^{2}). This implies that

Ztν(θ)=s=1twsσ2xsxsθ+νθ=𝐖tσ2;νθ,\nabla Z_{t}^{\nu}(\theta)=\sum_{s=1}^{t}\frac{w_{s}}{\sigma^{2}}x_{s}x_{s}^{\top}\theta+\nu\theta=\mathbf{W}_{t}^{\sigma^{-2};\nu}\theta,

where we have defined a weighted version of 𝐕tσ2;ν\mathbf{{V}}_{t}^{\sigma^{-2};\nu} as 𝐖tσ2;ν=s=1twsxsxsσ2+ν𝐈\mathbf{W}_{t}^{\sigma^{-2};\nu}=\sum_{s=1}^{t}\frac{w_{s}x_{s}x_{s}^{\top}}{\sigma^{2}}+\nu\mathbf{{I}} and therefore the Bregman divergence is given by DZtν(θ,θ)=12||θθ||𝐖tσ2;ν2.D_{Z_{t}^{\nu}}(\theta,{\theta_{\star}})=\frac{1}{2}\lvert\lvert\theta-{\theta_{\star}}\rvert\rvert_{\mathbf{W}_{t}^{\sigma^{-2};\nu}}^{2}. We can also see that the Bregman information gain is given by

Γtν\displaystyle\Gamma_{t}^{\nu} =log(dexp(12||θ||22)dθdexp(DZtν(θ,θ~t))dθ)=log(dexp(12||θ||2)2dθdexp(12||θθ||𝐖tσ2;ν2)dθ).\displaystyle=\log\left(\frac{\int_{\mathbb{R}^{d}}\exp(-\frac{1}{2}\lvert\lvert\theta\rvert\rvert_{2}^{2})\mathrm{d}\theta}{\int_{\mathbb{R}^{d}}\exp(-D_{Z_{t}^{\nu}}(\theta,{\tilde{{\theta}}}_{t}))\mathrm{d}\theta}\right)=\log\left(\frac{\int_{\mathbb{R}^{d}}\exp(-\frac{1}{2}\lvert\lvert\theta\rvert\rvert_{2})^{2}\mathrm{d}\theta}{\int_{\mathbb{R}^{d}}\exp(-\frac{1}{2}\lvert\lvert\theta-{\theta_{\star}}\rvert\rvert_{\mathbf{W}_{t}^{\sigma^{-2};\nu}}^{2})\mathrm{d}\theta}\right).

These Gaussian integrals are straightforward to evaluate. We know that

dexp(12||θ||22)dθ=(2π)d/2det((ν𝐈d)1).\int_{\mathbb{R}^{d}}\exp\left(-\frac{1}{2}\lvert\lvert\theta\rvert\rvert^{2}_{2}\right)\mathrm{d}\theta=(2\pi)^{d/2}\sqrt{\det((\nu\mathbf{I}_{d})^{-1})}.

Similarly,

dexp(12||θθ||𝐖tσ2;ν2)dθ=(2π)d/2det((𝐖tσ2;ν)1).\displaystyle\int_{\mathbb{R}^{d}}\exp\left(-\frac{1}{2}\lvert\lvert\theta-{\theta_{\star}}\rvert\rvert_{\mathbf{W}_{t}^{\sigma^{-2};\nu}}^{2}\right)\mathrm{d}\theta=(2\pi)^{d/2}\sqrt{\det((\mathbf{W}_{t}^{\sigma^{-2};\nu})^{-1})}.

Then, we can compute

Γtν=log(det(𝐖tσ2;ν)det(ν𝐈))=log(det(s=1twsxsxsσ2ν+𝐈)).\displaystyle\Gamma_{t}^{\nu}=\log\left(\frac{\det(\mathbf{W}_{t}^{\sigma^{-2};\nu})}{\det(\nu\mathbf{I})}\right)=\log\left(\det\left(\sum_{s=1}^{t}\frac{w_{s}x_{s}x_{s}^{\top}}{\sigma^{2}\nu}+\mathbf{I}\right)\right).

In the unweighted case, with which we proceed, we have Γtν=γtν\Gamma_{t}^{\nu}=\gamma_{t}^{\nu}, that is we recover the classical upper bound on the information gain (Srinivas et al.,, 2009). To summarize, we have specialized the bound (32) to say that for any θ𝒞t\theta\in\mathcal{C}_{t}, we have (since L=μ=1/σ2L=\mu=1/\sigma^{2})

||θθ||𝐕tσ2;ν28(log(1/α)+νB2+γtν)+4log(3/δ))+4t.\lvert\lvert\theta-{\theta_{\star}}\rvert\rvert_{\mathbf{V}_{t}^{\sigma^{-2};\nu}}^{2}\leq 8\left(\log(1/\alpha)+\nu B^{2}+\gamma_{t}^{\nu}\right)+4\log(3/\delta))+4\mathcal{R}_{t}.

Now, we instantiate the regret of the online learner using Theorem 4. With probability 1δ1-\delta, uniformly over tt, we have

tλB2+Lμ(γtλ+log(1δ))+2L2B2μγtλ.\mathcal{R}_{t}\leq\lambda B^{2}+\frac{L}{\mu}\left(\gamma_{t}^{\lambda}+\log\left(\frac{1}{\delta}\right)\right)+\frac{2L^{2}B^{2}}{\mu}\gamma_{t}^{\lambda}. (33)

We get by chosing α=δ\alpha=\delta, and setting ν=λ\nu=\lambda that

||θθ||𝐕tσ2;λ2\displaystyle\quad\;\lvert\lvert\theta-{\theta_{\star}}\rvert\rvert_{\mathbf{V}_{t}^{\sigma^{-2};\lambda}}^{2}
8(log(1/δ)+νB2+γtν)+4log(1/δ))+4λB2+4(γtλ+log(1/δ))+8B2σ2γtλ\displaystyle\leq 8\left(\log(1/\delta)+\nu B^{2}+\gamma_{t}^{\nu}\right)+4\log(1/\delta))+4\lambda B^{2}+4(\gamma_{t}^{\lambda}+\log({1}/{\delta}))+\frac{8B^{2}}{\sigma^{2}}\gamma_{t}^{\lambda}
16log(1/δ)+12λB2+8(B2σ2+1)γtλ=:βt.\displaystyle\leq 16\log(1/\delta)+12\lambda B^{2}+8\left(\frac{B^{2}}{\sigma^{2}}+1\right)\gamma_{t}^{\lambda}=:\beta_{t}.
Linear bandit regret analysis

We are ready to proceed with the bandit analysis for the UCB Algorithm. We follow Lattimore and Szepesvári, (2020) and bound the pseudo-regret, letting x{x_{\star}} be the optimal action. We bound the instantaneous regret at step 0st0\leq s\leq t as

rs\displaystyle r_{s} =θ,xxs\displaystyle=\langle{\theta_{\star}},\,{x_{\star}}-x_{s}\rangle
=θ,xθ,xs\displaystyle=\langle{\theta_{\star}},\,{x_{\star}}\rangle-\langle{\theta_{\star}},\,x_{s}\rangle
maxθ𝒞s1θ,xθ,xs\displaystyle\leq\max_{\theta\in\mathcal{C}_{s-1}}\langle\theta,\,{x_{\star}}\rangle-\langle{\theta_{\star}},\,x_{s}\rangle
maxx𝒳maxθ𝒞s1θ,xθ,xs\displaystyle\leq\max_{x\in\mathcal{X}}\max_{\theta\in\mathcal{C}_{s-1}}\langle\theta,\,x\rangle-\langle{\theta_{\star}},\,x_{s}\rangle
=()maxθ𝒞s1θ,xsθ,xs\displaystyle\stackrel{{\scriptstyle(*)}}{{=}}\max_{\theta\in\mathcal{C}_{s-1}}\langle\theta,\,x_{s}\rangle-\langle{\theta_{\star}},\,x_{s}\rangle
()||θ~sθ||𝐕s1σ2;λ||xs||(𝐕s1σ2;λ)1\displaystyle\stackrel{{\scriptstyle(**)}}{{\leq}}\lvert\lvert{\tilde{{\theta}}}_{s}-{\theta_{\star}}\rvert\rvert_{\mathbf{V}_{s-1}^{\sigma^{-2};\lambda}}\lvert\lvert x_{s}\rvert\rvert_{(\mathbf{V}_{s-1}^{\sigma^{-2};\lambda})^{-1}}
βs1||xs||(𝐕s1σ2;λ)1\displaystyle\;\;{\leq}\sqrt{\beta_{s-1}}\lvert\lvert x_{s}\rvert\rvert_{(\mathbf{V}_{s-1}^{\sigma^{-2};\lambda})^{-1}}
βt||xs||(𝐕s1σ2;λ)1.\displaystyle\;\leq\sqrt{\beta_{t}}\lvert\lvert x_{s}\rvert\rvert_{(\mathbf{V}_{s-1}^{\sigma^{-2};\lambda})^{-1}}.

The first inequality replaces θ{\theta_{\star}} by the upper confidence bound for action x{x_{\star}}, which is valid with probability 1α=1δ1-\alpha=1-\delta uniformly over time. Then, ()(*) uses the fact that xtx_{t} is chosen to maximize the upper confidence bound. Finally ()(**) defines the UCB parameter θ~t{\tilde{{\theta}}}_{t}. By Corollary 1, we have

s=1t||xs||(𝐕s1σ2;λ)122σ2γtλ.\sum_{s=1}^{t}\lvert\lvert x_{s}\rvert\rvert^{2}_{(\mathbf{V}_{s-1}^{\sigma^{-2};\lambda})^{-1}}\leq 2\sigma^{2}\gamma_{t}^{\lambda}.

Plugging all this together and using an 1/2\ell_{1}/\ell_{2}-norm inequality, we get

t\displaystyle\mathfrak{R}_{t} =s=1trs\displaystyle=\sum_{s=1}^{t}r_{s}
βtts=1t||xs||(𝐕s1σ2;λ)12\displaystyle\leq\sqrt{\beta_{t}}\sqrt{t\sum_{s=1}^{t}\lvert\lvert x_{s}\rvert\rvert^{2}_{(\mathbf{V}_{s-1}^{\sigma^{-2};\lambda})^{-1}}}
2tβtσ2γtλ\displaystyle\leq\sqrt{2t\beta_{t}\sigma^{2}\gamma_{t}^{\lambda}}
2σ2t(16log(1/δ)+12λB2+8γtλ+8B2/σ2γtλ)γtλ\displaystyle\leq\sqrt{2\sigma^{2}t(16\log(1/\delta)+12\lambda B^{2}+8\gamma_{t}^{\lambda}+8B^{2}/\sigma^{2}\gamma_{t}^{\lambda})\gamma_{t}^{\lambda}}
6tγt(σlog(1/δ)+γtλ+σλ1/2B+Bγtλ).\displaystyle\leq 6\sqrt{t\gamma_{t}}\left(\sigma\sqrt{\log(1/\delta)+\gamma_{t}^{\lambda}}+\sigma\lambda^{1/2}B+B\sqrt{\gamma_{t}^{\lambda}}\right).

To summarize and to justify why this bound holds with probability 13δ1-3\delta uniformly over time, note that we have bounded the probability of the FTRL bound (33) not holding for some tt by δ\delta. Then, the probability of (32) not holding for some tt is at most δ\delta. Finally, the anytime Type I error of our sets is also bounded by δ\delta. A union bound therefore concludes the proof. ∎

D.1 Comparison to Abbasi-Yadkori et. al. (2011)

We compare our result to the one from Abbasi-Yadkori et al., (2011). Under the assumption that λ1\lambda\geq 1, they show that the regret satisfies

t4tdlog(λ+t/d)(λB+σ2log(1/δ)+dlog(1+t/(λd)).\mathfrak{R}_{t}\leq 4\sqrt{td\log(\lambda+t/d)}\left(\sqrt{\lambda}B+\sigma\sqrt{2\log(1/\delta)+d\log(1+t/(\lambda d)}\right).

Observe that there is a reparametrization for the regularizer to get even more similar bounds. If we take λ=λ~/σ2\lambda=\tilde{\lambda}/\sigma^{2} for some λ~1\tilde{\lambda}\geq 1, our bound reads as

6tγtλ~/σ2(σlog(1/δ)+γtλ~/σ2+λ~1/2B+Bγtλ~/σ2).6\sqrt{t\gamma_{t}^{\tilde{\lambda}/\sigma^{2}}}\left(\sigma\sqrt{\log(1/\delta)+\gamma_{t}^{\tilde{\lambda}/\sigma^{2}}}+\tilde{\lambda}^{1/2}B+B\sqrt{\gamma_{t}^{\tilde{\lambda}/\sigma^{2}}}\right).

Given that by Corollary 1, we have

γtλ~/σ2dlog(tλ~+1),\gamma_{t}^{\tilde{\lambda}/\sigma^{2}}\leq d\log\left(\frac{t}{\tilde{\lambda}}+1\right),

we get almost matching bounds, up to an additional Bγtλ~/σ2B\sqrt{\gamma_{t}^{\tilde{\lambda}/\sigma^{2}}} term blowing up the regret, which we attribute to the accumulation of bias without the reweighting scheme. The remaining differences are down to using slightly different versions of the elliptical potential lemma, trading off generality and tightness (Abbasi-Yadkori et al.,, 2011; Hazan et al.,, 2006; Lattimore and Szepesvári,, 2020).

Appendix E Experimental Details

E.1 Calibration Plots

In Figure 3 we report the calibration of heuristics as well as other theoretically motivated works. The other theoretically motivated works are very pessimistic and are not appropriately calibrated. Note that one caveat of reporting calibration is that it is very much influenced by the data collection scheme in the sequential regime. In our case we use a bandit algorithm to collect the data. Arguably, in this setting, regret might be a better measure rather than looking at the calibration of the confidence sets. Additionally, the calibration depends on the true value θ{\theta_{\star}}. We report the results for zero parameter and a random parameter from a unit ball. We also report results for i.i.d. data.

Refer to caption
(a) ADAPTIVE Bandit sequence, random θ2=1||\theta_{\star}||_{2}=1, σ=0.1\sigma=0.1
Refer to caption
(b) ADAPTIVE Bandit sequence, random θ2=1||\theta_{\star}||_{2}=1, σ=0.01\sigma=0.01
Refer to caption
(c) ADAPTIVE Bandit sequence, θ=0\theta_{\star}=0, σ=0.1\sigma=0.1
Refer to caption
(d) ADAPTIVE Bandit sequence θ=0\theta_{\star}=0, σ=0.01\sigma=0.01
Refer to caption
(e) IID sequence, random θ2=1||\theta_{\star}||_{2}=1, σ=0.1\sigma=0.1
Refer to caption
(f) IID sequence, random θ2=1||\theta_{\star}||_{2}=1, σ=0.01\sigma=0.01
Refer to caption
(g) IID sequence, θ=0\theta_{\star}=0, σ=0.01\sigma=0.01
Refer to caption
(h) IID sequence θ=0\theta_{\star}=0, σ=0.01\sigma=0.01
Figure 3: We plot the calibration diagram for data collected from a bandit game trying to optimize a ground truth function using the same model as in Fig. 2a). Instead of the test function, we use an explicit member of the confidence set to avoid a potential mismatch between models. We check after T=15T=15 whether θCt\theta_{\star}\in C_{t} and average over 200 runs. We see that the (LR) are more conservative than the ideal calibration, however, they are provably valid and substantially better than any theoretically motivated confidence sets. We also see that the heuristic is not calibrated and fails many times. We see that for i.i.d. data, our sets are somewhat conservative since the data is not adapted, and our approach is not necessary. We note that the results depend on θ\theta_{\star}.

E.2 Baselines and Details

In the following section, we describe the baseline we used in the comparison. The details of parameters used in the experiments can be found in Table 2. As there is no explicit statement for sub-exponential variables, we give a formal derivation in Appendix F.

E.2.1 Sub-Exponential Random Variables: Confidence Sets

For this baseline, we will assume a linear model with additive sub-exponential noise, namely that there is θΘ\theta_{*}\in\Theta such that yt=θ,xt+ηty_{t}=\langle\theta_{*},\,x_{t}\rangle+\eta_{t}, where ηt\eta_{t} is (ν,γ)(\nu,\gamma)-conditionally sub-exponential (Wainwright,, 2019). We let as usual

𝐕tν2;λ=s=1txsxsν2λ𝐈.\mathbf{{V}}_{t}^{\nu^{-2};\lambda}=\sum_{s=1}^{t}\frac{x_{s}x_{s}^{\top}}{\nu^{2}}\lambda\mathbf{{I}}.

With this, one can prove the following time-uniform concentration result:

Proposition 3.

For any k(0,1)k\in(0,1), the following holds:

(t:||θ^tθ||𝐕tν2;λβSE)δ,\mathbb{P}\left(\exists t\,:\,\lvert\lvert{\hat{{\theta}}}_{t}-\theta^{*}\rvert\rvert_{\mathbf{{V}}_{t}^{\nu^{-2};\lambda}}\geq\sqrt{\beta_{SE}}\right)\leq\delta,

where

βSE=λ||θ||2+λkB+dλkBlog(11k)+1λkBlog((det(𝐕tν2;λ))1/2δdet(λ𝐈)).\sqrt{\beta_{SE}}=\sqrt{\lambda}\lvert\lvert{\theta_{\star}}\rvert\rvert_{2}+\sqrt{\lambda}kB+\frac{d}{\sqrt{\lambda}kB}\log\left(\frac{1}{1-k}\right)+\frac{1}{\sqrt{\lambda}kB}\log\left(\frac{(\det(\mathbf{{V}}_{t}^{\nu^{-2};\lambda}))^{1/2}}{\delta\det(\sqrt{\lambda}\mathbf{{I}})}\right).

The proof is very similar to prior work Faury et al., (2020); Mutný and Krause, (2021), and can be found in Appendix F. This can readily be applied in the survival analysis (after a suitable transformation explained in the main paper) and Laplace noise experiments.

E.2.2 Poisson Heuristics

We implement two heuristics. One is a Bayesian formulation due to the Laplace method, and one is due to considerations in Mutný and Krause, (2021) of how to construct a valid confidence set using the Fisher information. The Laplace method creates an ellipsoidal confidence set using the second-order information evaluated at the penalized maximum likelihood estimator. Namely, the second derivative of the likelihood evaluated at the maximum penalized likelihood θ^t{\hat{{\theta}}}_{t} is

𝐕laplace=s=1texp(θ^txs)xsxs.\mathbf{{V}}_{\text{laplace}}=\sum_{s=1}^{t}\exp({\hat{{\theta}}}_{t}^{\top}x_{s})x_{s}x_{s}^{\top}.

We use this to define ellipsoidal confidence sets as θ^tθ𝐕laplace22log(1/δ)||{\hat{{\theta}}}_{t}-\theta||_{\mathbf{{V}}_{\text{laplace}}}^{2}\leq 2\log(1/\delta). The other heuristic suggests using the worst-case parameter instead, namely

𝐕mutny=s=1texp(B)xsxs.\mathbf{{V}}_{\text{mutny}}=\sum_{s=1}^{t}\exp(B)x_{s}x_{s}^{\top}.

This method would have provable coverage with a proper confidence parameter. Its derivation is beyond the scope of this paper.

E.2.3 NR (2021)

This method follows from Neiswanger and Ramdas, (2021). Per se, this method was not developed to be versatile in terms of likelihood but instead provides a confidence set on ff, even if it originates from a misspecified prior. Nevertheless, it provides a likelihood-aware confidence sequence that is anytime-valid and doesn’t employ worst-case parameters, and hence is a good benchmark for our analysis. The confidence sets are of the form

{θ|log(θ)log(1/δ)+log(p(𝒟))},\{\theta\leavevmode\nobreak\ |\leavevmode\nobreak\ \log\mathcal{L}(\theta)\leq\log(1/\delta)+\log(p(\mathcal{D}))\},

where log(p(𝒟))\log(p(\mathcal{D})) is the current log-evidence of the data given the Gaussian prior. For more information, see Neiswanger and Ramdas, (2021).

Benchmark function dim |𝒳||\mathcal{X}| γ\gamma BB λ\lambda Gaussian/Laplace σ\sigma/bb
1D 1 262^{6} 0.06 4 1 0.15
Camelback 2 10210^{2} 0.2 2 1 0.10
Table 2: Summary of experimental parameters

E.3 Additive Models

We implemented two likelihoods, namely Gaussian and Laplace. We implemented the discretization of the domain |𝒳||\mathcal{X}|, and in the implementation we used Nystrom features defined on |𝒳||\mathcal{X}| providing the exact representation of the RKHS on the 𝒳\mathcal{X}, The points were chosen to be on a uniform grid. Notice that for the regularized estimator, we chose the rule of thumb λ=1/B\lambda=1/B as is motivated in (Mutný and Krause,, 2022).

The laplace parameter was picked as b=0.15b=0.15 likewise. Note that Laplace distribution is sub-exponential with parameters (b,2b2)(b,2b^{2}). We use 1/σ21/\sigma^{2} or 1/b1/b respectively for the value LL. Strictly speaking, the Laplace likelihood is not smooth, but a smoothed objective would most likely inherit a value depending on bb. As we maintain coverage with any choice of weighting, we do not risk invalidity by using a heuristic choice for LL.

E.4 Survival Analysis

We implemented the method exactly as specified, using the Weibull likelihood with parameter p=2p=2. Upon log-transformation, the Gumbel distribution is sub-exponential. To determine the parameter, consider the moment-generating function of the Gumbel distribution (β=1/p\beta=1/p in the canonical parameterization):

𝔼[eXt]=Γ(1t/2)exp(t)exp(t2/2)fort<1/2,\mathop{{}\mathbb{E}}[e^{X}t]=\Gamma(1-t/2)\exp(t)\leq\exp(t^{2}/2)\quad\text{for}\quad t<1/2,

hence, the sub-exponentiality parameter is 11, and we can use the above sub-exponential confidence sets with value b=1b=1. For the likelihood ratio code, we used L=exp(B)L=\exp(B), as this is the leading term of the Hessian of log-likelihood. The function is not smooth everywhere, but on a bounded domain, this turns out to be an appropriate scaling.

E.5 Poisson Bandits

In this case, we implemented a bandit game, where we used the parametrization rθ(x)Poisson(exp(θΦ(x)))r_{\theta}(x)\sim\text{Poisson}(\exp(-\theta^{\top}\Phi(x))), where Φ(x)\Phi(x) is the RKHS evaluation operator, and θ\theta is the unknown value.

We used L=exp(B)L=\exp(B), as this is the leading term of the Hessian of log-likelihood in this parametrization. The function is not smooth everywhere but on a bounded domain this is an appropriate scaling.

E.6 Additional Benchmark Functions

We focus on an additional baseline function: Camelback in 2D, a standard BO benchmark function. The results can be see in Figure 4.

Refer to caption
Figure 4: Camelback function. 10 repeats with median and standard quantiles plotted. Note that our method is the best method with provable coverage.

Appendix F Proof of Proposition 3

Define StS_{t} and the shorthand 𝐕t\mathbf{{V}}_{t}

St=s=1tηsxsν2and𝐕t:=𝐕tν2;λ=s=1txsxsν2+λ𝐈.S_{t}=\sum_{s=1}^{t}\eta_{s}\frac{x_{s}}{\nu^{2}}\quad\text{and}\quad\mathbf{{V}}_{t}:=\mathbf{{V}}^{\nu^{-2};\lambda}_{t}=\sum_{s=1}^{t}\frac{x_{s}x_{s}^{\top}}{\nu^{2}}+\lambda\mathbf{{I}}.

and the parametrized process

t(x)=exp(x,St12||x||𝐕t2).\mathcal{M}_{t}(x)=\exp(\langle x,\,S_{t}\rangle-\frac{1}{2}\lvert\lvert x\rvert\rvert_{\mathbf{{V}}_{t}}^{2}).
Lemma 13.

If ηt\eta_{t} is (ν,γ)(\nu,\gamma)-conditionally sub-Exponential, then t(x)\mathcal{M}_{t}(x) is a super-martingale on the ball {xd|||x||2ν2γ}\{x\in\mathbb{R}^{d}\,|\,\lvert\lvert x\rvert\rvert_{2}\leq\frac{\nu^{2}}{\gamma}\} with 0(x)1.\mathcal{M}_{0}(x)\leq 1.

Note that for γ0\gamma\rightarrow 0 (sub-Gaussian case), this recovers Lemma 20.2. in Lattimore and Szepesvári, (2020).

Proof.

It is easy to observe that for any xx, we have

exp(S0x12||x||𝐕02)=exp(12||x||λ𝐈2)1.\exp\left(S_{0}^{\top}x-\frac{1}{2}\lvert\lvert x\rvert\rvert_{\mathbf{{V}}_{0}}^{2}\right)=\exp\left(-\frac{1}{2}\lvert\lvert x\rvert\rvert_{\lambda\mathbf{{I}}}^{2}\right)\leq 1.

For the first part, we can write

𝔼[t(x)|t1]\displaystyle\mathop{{}\mathbb{E}}[\mathcal{M}_{t}(x)|\mathcal{F}_{t-1}] =𝔼[exp(x,St12||x||𝐕t2)|t1]\displaystyle=\mathop{{}\mathbb{E}}[\exp(\langle x,\,S_{t}\rangle-\frac{1}{2}\lvert\lvert x\rvert\rvert_{\mathbf{{V}}_{t}}^{2})\,|\,\mathcal{F}_{t-1}]
=𝔼[exp(x,St112||x||𝐕t12)exp(1ν2x,ηtxt12ν2||x||xtxt2)|t1]\displaystyle=\mathop{{}\mathbb{E}}[\exp(\langle x,\,S_{t-1}\rangle-\frac{1}{2}\lvert\lvert x\rvert\rvert_{\mathbf{{V}}_{t-1}}^{2})\exp(\frac{1}{\nu^{2}}\langle x,\,\eta_{t}x_{t}\rangle-\frac{1}{2\nu^{2}}\lvert\lvert x\rvert\rvert^{2}_{x_{t}x_{t}^{\top}})\,|\,\mathcal{F}_{t-1}]
=t1(x)𝔼[exp(1ν2ηtx,xt)|t1]exp(12ν2||x||xtxt2),\displaystyle=\mathcal{M}_{t-1}(x)\mathop{{}\mathbb{E}}[\exp(\frac{1}{\nu^{2}}\eta_{t}\langle x,\,x_{t}\rangle)\,|\,\mathcal{F}_{t-1}]\exp(-\frac{1}{2\nu^{2}}\lvert\lvert x\rvert\rvert_{x_{t}x_{t}^{\top}}^{2}),

where in the last step we use that xtx_{t} is t1\mathcal{F}_{t-1}-measurable. Now, as long as 1ν2|x,xs|1γ\frac{1}{\nu^{2}}\lvert\langle x,\,x_{s}\rangle\rvert\leq\frac{1}{\gamma}, we can apply our definition of conditional sub-Exponential noise to bound

𝔼[exp(ηt1ν2x,xt)]exp((x,xt)2ν22ν4)exp(||x||xtxt22ν2).\mathop{{}\mathbb{E}}[\exp(\eta_{t}\frac{1}{\nu^{2}}\langle x,\,x_{t}\rangle)]\leq\exp\left(\frac{(\langle x,\,x_{t}\rangle)^{2}\nu^{2}}{2\nu^{4}}\right)\leq\exp\left(\frac{\lvert\lvert x\rvert\rvert_{x_{t}x_{t}^{\top}}^{2}}{2\nu^{2}}\right).

From this we directly conclude

𝔼[t(x)|t1]t1(x).\mathop{{}\mathbb{E}}[\mathcal{M}_{t}(x)|\mathcal{F}_{t-1}]\leq\mathcal{M}_{t-1}(x).

By Cauchy-Schwarz, a sufficient condition is ||x||2ν2γ\lvert\lvert x\rvert\rvert_{2}\leq\frac{\nu^{2}}{\gamma} as this implies (with our assumptions on the actions)

|x,xt|||x||2||xt||2||x||2ν2γ.\lvert\langle x,\,x_{t}\rangle\rvert\leq\lvert\lvert x\rvert\rvert_{2}\lvert\lvert x_{t}\rvert\rvert_{2}\leq\lvert\lvert x\rvert\rvert_{2}\leq\frac{\nu^{2}}{\gamma}.

In the following, will use this result to prove any-time confidence estimates for the parameter θ\theta using the technique of pseudo-maximization, closely following Mutný and Krause, (2021).

Recall that t(x)\mathcal{M}_{t}(x) is defined on the ball of radius ν2γL\frac{\nu^{2}}{\gamma L}. This allows us some freedom in choosing the radius of the ball on which we integrate. In particular, let KK be this radius. While we need Kν2γLK\leq\frac{\nu^{2}}{\gamma L}, we can make KK larger by choosing larger ν2\nu^{2} (increasing ν2\nu^{2} only makes the set of noise distributions larger).

Ultimately, we wish to bound (following Lattimore and Szepesvári, (2020))

||θ^tθ||𝐕t\displaystyle\lvert\lvert{\hat{{\theta}}}_{t}-\theta^{*}\rvert\rvert_{\mathbf{{V}}_{t}} =||𝐕t1X1:t(X1:tθ+η1:t)θ||𝐕t\displaystyle=\lvert\lvert\mathbf{{V}}_{t}^{-1}X_{1:t}(X_{1:t}{\theta_{\star}}+\eta_{1:t})-{\theta_{\star}}\rvert\rvert_{\mathbf{{V}}_{t}}
=||𝐕t1X1:tX1:tθθ+𝐕t1St||𝐕t\displaystyle=\lvert\lvert\mathbf{{V}}_{t}^{-1}X_{1:t}X_{1:t}{\theta_{\star}}-{\theta_{\star}}+\mathbf{{V}}_{t}^{-1}S_{t}\rvert\rvert_{\mathbf{{V}}_{t}}
||St||𝐕t1+λ||θ||2.\displaystyle\leq\lvert\lvert S_{t}\rvert\rvert_{\mathbf{{V}}_{t}^{-1}}+\sqrt{\lambda}\lvert\lvert{\theta_{\star}}\rvert\rvert_{2}.

We can not control the second term, so we focus on the first: the self-normalized residuals. Via fenchel duality, one can motivate that the right object to study is the supremum of the martingale t(x)\mathcal{M}_{t}(x) over all xx\in\mathbb{R}.444But that is in our case ill-defined. Define M~t\tilde{M}_{t} to be the martingale t\mathcal{M}_{t} from above but with λ=0\lambda=0, i.e. no regularisation term. Similarly, let 𝐕~t=𝐕tλ𝐈\tilde{\mathbf{{V}}}_{t}=\mathbf{{V}}_{t}-\lambda\mathbf{{I}} be the design matrix without the regularisation term. Slightly counterintuitively, we will study

M¯t=||x||2KM~t(x)dh(x),\bar{M}_{t}=\int_{\lvert\lvert x\rvert\rvert_{2}\leq K}\tilde{M}_{t}(x)\mathrm{{d}}h(x),

where hh is the probability density function of a truncated normal distribution with inverse variance λ\lambda, that is with covariance matrix 1λ𝐈\frac{1}{\lambda}\mathbf{{I}}. By Lemma 20.3 in Lattimore and Szepesvári, (2020), M¯t\bar{M}_{t} is also a super-martingale with M¯01\bar{M}_{0}\leq 1. Then we have

M¯t\displaystyle\bar{M}_{t} =1N(h)||x||2Kexp(xSt12||x||𝐕~t2)exp(12xλ𝐈x)dx\displaystyle=\frac{1}{N(h)}\int_{\lvert\lvert x\rvert\rvert_{2}\leq K}\exp\left(x^{\top}S_{t}-\frac{1}{2}\lvert\lvert x\rvert\rvert_{\tilde{\mathbf{{V}}}_{t}}^{2}\right)\exp\left(-\frac{1}{2}x^{\top}\lambda\mathbf{{I}}x\right)\mathrm{{d}}x
=1N(h)||x||2Kexp(xSt12||x||𝐕t2)dx.\displaystyle=\frac{1}{N(h)}\int_{\lvert\lvert x\rvert\rvert_{2}\leq K}\exp\left(x^{\top}S_{t}-\frac{1}{2}\lvert\lvert x\rvert\rvert_{\mathbf{{V}}_{t}}^{2}\right)\mathrm{{d}}x.

We will define the shorthand ft(x)=xSt12x𝐕tx=ft(x)+ft(x)(xx)12(xx)𝐕t(xx)f_{t}(x)=x^{\top}S_{t}-\frac{1}{2}x^{\top}\mathbf{{V}}_{t}x=f_{t}(x^{*})+\nabla f_{t}(x^{*})^{\top}(x-x^{*})-\frac{1}{2}(x-x^{*})^{\top}\mathbf{{V}}_{t}(x-x^{*}) (by Taylor’s theorem), where x=argmax||x||kKft(x)x^{*}=\arg\max_{\lvert\lvert x\rvert\rvert\leq kK}f_{t}(x), k(0,1)k\in(0,1) will be chosen later. We can lower bound M¯t\bar{M}_{t} by

M¯t\displaystyle\bar{M}_{t} =1N(h)||x||2Kexp(xSt12||x||𝐕t2)dx\displaystyle=\frac{1}{N(h)}\int_{\lvert\lvert x\rvert\rvert_{2}\leq K}\exp\left(x^{\top}S_{t}-\frac{1}{2}\lvert\lvert x\rvert\rvert_{\mathbf{{V}}_{t}}^{2}\right)\mathrm{{d}}x
=exp(ft(x))N(h)||x||2Kexp(ft(x)(xx)12(xx)𝐕t(xx))dx\displaystyle=\frac{\exp(f_{t}(x^{*}))}{N(h)}\int_{\lvert\lvert x\rvert\rvert_{2}\leq K}\exp\left(\nabla f_{t}(x^{*})^{\top}(x-x^{*})-\frac{1}{2}(x-x^{*})^{\top}\mathbf{{V}}_{t}(x-x^{*})\right)\mathrm{{d}}x
=exp(ft(x))N(h)||y+x||2Kexp(ft(x)y12y𝐕ty)dy\displaystyle=\frac{\exp(f_{t}(x^{*}))}{N(h)}\int_{\lvert\lvert y+x^{*}\rvert\rvert_{2}\leq K}\exp\left(\nabla f_{t}(x^{*})^{\top}y-\frac{1}{2}y^{\top}\mathbf{{V}}_{t}y\right)\mathrm{{d}}y (34)
exp(ft(x))N(h)||y||2(1k)Kexp(ft(x)y12y𝐕ty)dy\displaystyle\geq\frac{\exp(f_{t}(x^{*}))}{N(h)}\int_{\lvert\lvert y\rvert\rvert_{2}\leq(1-k)K}\exp\left(\nabla f_{t}(x^{*})^{\top}y-\frac{1}{2}y^{\top}\mathbf{{V}}_{t}y\right)\mathrm{{d}}y (35)
=exp(ft(x))N(h)||y||2(1k)Kexp(ft(x)y)exp(12y𝐕ty)dy\displaystyle=\frac{\exp(f_{t}(x^{*}))}{N(h)}\int_{\lvert\lvert y\rvert\rvert_{2}\leq(1-k)K}\exp\left(\nabla f_{t}(x^{*})^{\top}y\right)\exp\left(-\frac{1}{2}y^{\top}\mathbf{{V}}_{t}y\right)\mathrm{{d}}y
=exp(ft(x))N(g)N(h)𝔼yg[exp(ft(x)y)]\displaystyle=\frac{\exp(f_{t}(x^{*}))N(g)}{N(h)}\mathop{{}\mathbb{E}}_{y\sim g}\left[\exp\left(\nabla f_{t}(x^{*})^{\top}y\right)\right]
exp(ft(x))N(g)N(h)exp(𝔼yg[ft(x)y])\displaystyle\geq\frac{\exp(f_{t}(x^{*}))N(g)}{N(h)}\exp\left(\mathop{{}\mathbb{E}}_{y\sim g}[\nabla f_{t}(x^{*})^{\top}y]\right) (36)
=exp(ft(x))N(g)N(h).\displaystyle=\frac{\exp(f_{t}(x^{*}))N(g)}{N(h)}.

where in step (34) we used the change of variables x=y+xx=y+x^{*}. In (35) we use that if ||y||2(1k)K\lvert\lvert y\rvert\rvert_{2}\leq(1-k)K, then ||x+y||2||x||2+||y||2(1k)K+kK=K\lvert\lvert x^{*}+y\rvert\rvert_{2}\leq\lvert\lvert x^{*}\rvert\rvert_{2}+\lvert\lvert y\rvert\rvert_{2}\leq(1-k)K+kK=K. Finally, in (36), we used Jensen’s inequality. The last inequality follows from symmetry. Note that we implicitly defined gg to be a truncated normal distribution with covariance matrix 𝐕t1\mathbf{{V}}_{t}^{-1} on the ball of radius (1k)K(1-k)K.

This puts us in a position to put Ville’s inequality to good use:

δ\displaystyle\delta (t:log(Mt¯)log(1/δ))\displaystyle\geq\mathbb{P}\left(\exists t\,:\,\log(\bar{M_{t}})\geq\log(1/\delta)\right)
(t:ft(x)+log(N(g)N(h))log(1/δ))\displaystyle\geq\mathbb{P}\left(\exists t\,:\,f_{t}(x^{*})+\log\left(\frac{N(g)}{N(h)}\right)\geq\log(1/\delta)\right)
(t:ft(x)log(N(h)N(g)δ)).\displaystyle\geq\mathbb{P}\left(\exists t\,:\,f_{t}(x^{*})\geq\log\left(\frac{N(h)}{N(g)\delta}\right)\right).

We now wish to recover ||St||𝐕t\lvert\lvert S_{t}\rvert\rvert_{\mathbf{{V}}_{t}}. Recall the definition of ft(x)f_{t}(x^{*}) as the maximum over all xx in a ball of radius kKkK. Consequently, we can choose x=𝐕t1St||St||𝐕t1λkKx=\frac{\mathbf{{V}}_{t}^{-1}S_{t}}{\lvert\lvert S_{t}\rvert\rvert_{\mathbf{{V}}_{t}^{-1}}}\sqrt{\lambda}kK, which has norm bounded by kKkK. We have

ft(x)ft(𝐕t1St||St||𝐕t1λkK)=||St||𝐕t1λkKλk2K2,f_{t}(x^{*})\geq f_{t}\left(\frac{\mathbf{{V}}_{t}^{-1}S_{t}}{\lvert\lvert S_{t}\rvert\rvert_{\mathbf{{V}}_{t}^{-1}}}\sqrt{\lambda}kK\right)=\lvert\lvert S_{t}\rvert\rvert_{\mathbf{{V}}_{t}^{-1}}\sqrt{\lambda}kK-\lambda k^{2}K^{2},

which immediately yields

(||St||𝐕t1λkK+1λkKlog(N(h)N(g)δ))δ.\displaystyle\mathbb{P}\left(\lvert\lvert S_{t}\rvert\rvert_{\mathbf{{V}}_{t}^{-1}}\geq\sqrt{\lambda}kK+\frac{1}{\sqrt{\lambda}kK}\log\left(\frac{N(h)}{N(g)\delta}\right)\right)\leq\delta.

The only thing that remains is bounding log(N(h)N(g))\log\left(\frac{N(h)}{N(g)}\right).

We give the following Lemma that is a slightly generalized version of Mutný and Krause, (2021) and originally inspired by Faury et al., (2020).

Lemma 14.

The normalizing constants satisfy

log(N(h)N(g))dlog(11k)+log((det(𝐕t))1/2det(λ𝐈)).\log\left(\frac{N(h)}{N(g)}\right)\leq d\log\left(\frac{1}{1-k}\right)+\log\left(\frac{(\det(\mathbf{{V}}_{t}))^{1/2}}{\det(\sqrt{\lambda}\mathbf{{I}})}\right).

We can use the bound from Lemma 14 to conclude that

(||St||𝐕t1λkK+dλkKlog(11k)+1λkKlog((det(𝐕t))1/2δdet(λ𝐈)))δ.\displaystyle\mathbb{P}\left(\lvert\lvert S_{t}\rvert\rvert_{\mathbf{{V}}_{t}^{-1}}\geq\sqrt{\lambda}kK+\frac{d}{\sqrt{\lambda}kK}\log\left(\frac{1}{1-k}\right)+\frac{1}{\sqrt{\lambda}kK}\log\left(\frac{(\det(\mathbf{{V}}_{t}))^{1/2}}{\delta\det(\sqrt{\lambda}\mathbf{{I}})}\right)\right)\leq\delta.

We stated earlier that

||θ^tθ||𝐕t||St||𝐕t1+λ||θ||2.\lvert\lvert{\hat{{\theta}}}_{t}-\theta^{*}\rvert\rvert_{\mathbf{{V}}_{t}}\leq\lvert\lvert S_{t}\rvert\rvert_{\mathbf{{V}}_{t}^{-1}}+\sqrt{\lambda}\lvert\lvert{\theta_{\star}}\rvert\rvert_{2}.

Combining this with our analysis, we get the Proposition 3.

We may now choose the parameters kk, KK and λ\lambda. Note that to get sub-Gaussian rates as in Abbasi-Yadkori, one needs to pick a regularization parameter of the order of λ=dlog(T)\lambda=d\log(T).

Proof of Lemma 14

We give a proof of the Lemma for completeness, and because the additional generality makes for a slightly different proof, even though the bound stays the same.

Proof.

We have

N(h)\displaystyle N(h) =||x||2Kexp(λ||x||22)dx\displaystyle=\int_{\lvert\lvert x\rvert\rvert_{2}\leq K}\exp(-\lambda\lvert\lvert x\rvert\rvert_{2}^{2})\mathrm{d}x
=1|det(2λ𝐈)|||x||2Kexp(12||2λx||22)|det(2λ𝐈)|dx\displaystyle=\frac{1}{\lvert\det(\sqrt{2\lambda}\mathbf{{I}})\rvert}\int_{\lvert\lvert x\rvert\rvert_{2}\leq K}\exp\left(-\frac{1}{2}\lvert\lvert\sqrt{2\lambda}x\rvert\rvert_{2}^{2}\right)\lvert\det(\sqrt{2\lambda}\mathbf{{I}})\rvert\mathrm{d}x
=1|det(2λ𝐈)|||x||22λKexp(12||x||22)dx.\displaystyle=\frac{1}{\lvert\det(\sqrt{2\lambda}\mathbf{{I}})\rvert}\int_{\lvert\lvert x\rvert\rvert_{2}\leq\sqrt{2\lambda}K}\exp\left(-\frac{1}{2}\lvert\lvert x\rvert\rvert_{2}^{2}\right)\mathrm{d}x.

Further we have

N(g)\displaystyle N(g) =||x||2(1k)Kexp(12x𝐕tx)dx\displaystyle=\int_{\lvert\lvert x\rvert\rvert_{2}\leq(1-k)K}\exp(-\frac{1}{2}x^{\top}\mathbf{{V}}_{t}x)\mathrm{d}x
=1|det(𝐕t1/2)|||x||2(1k)Kexp(12||𝐕t1/2x||22)|det(𝐕t1/2)|dx\displaystyle=\frac{1}{\lvert\det(\mathbf{{V}}_{t}^{1/2})\rvert}\int_{\lvert\lvert x\rvert\rvert_{2}\leq(1-k)K}\exp(-\frac{1}{2}\lvert\lvert\mathbf{{V}}_{t}^{1/2}x\rvert\rvert_{2}^{2})\lvert\det(\mathbf{{V}}_{t}^{1/2})\rvert\mathrm{d}x
=1|det(𝐕t1/2)|Sexp(12||x||22)dx,\displaystyle=\frac{1}{\lvert\det(\mathbf{{V}}_{t}^{1/2})\rvert}\int_{S}\exp(-\frac{1}{2}\lvert\lvert x\rvert\rvert_{2}^{2})\mathrm{d}x,

where S={𝐕t1/2x|||x||(1k)K}={x|||𝐕t1/2x||(1k)K}={x|x𝐕t1x(1k)K}S=\{\mathbf{{V}}_{t}^{1/2}x\,|\,\lvert\lvert x\rvert\rvert\leq(1-k)K\}=\{x\,|\,\lvert\lvert\mathbf{{V}}_{t}^{-1/2}x\rvert\rvert\leq(1-k)K\}=\{x\,|\,x^{\top}\mathbf{{V}}_{t}^{-1}x\leq(1-k)K\}. Note that 𝐕tλ𝐈\mathbf{{V}}_{t}\succeq\lambda\mathbf{{I}} and so 𝐕t11λ𝐈\mathbf{{V}}_{t}^{-1}\preceq\frac{1}{\lambda}\mathbf{{I}}. Therefore if ||x||22(1k)Kλ\lvert\lvert x\rvert\rvert_{2}^{2}\leq(1-k)K\sqrt{\lambda}, we have

x𝐕t1x1λ||x||2(1k)KxS.\sqrt{x^{\top}\mathbf{{V}}_{t}^{-1}x}\leq\frac{1}{\sqrt{\lambda}}\lvert\lvert x\rvert\rvert_{2}\leq(1-k)K\implies x\in S.

Thus {x|||x||2(1k)λK}S\{x\,|\,\lvert\lvert x\rvert\rvert_{2}\leq(1-k)\sqrt{\lambda}K\}\subseteq S and

N(g)1|det(𝐕t1/2)|||x||2(1k)λKexp(12||x||22)dx.N(g)\geq\frac{1}{\lvert\det(\mathbf{{V}}_{t}^{1/2})\rvert}\int_{\lvert\lvert x\rvert\rvert_{2}\leq(1-k)\sqrt{\lambda}K}\exp(-\frac{1}{2}\lvert\lvert x\rvert\rvert_{2}^{2})\mathrm{d}x.

We may therefore bound

N(g)N(h)(det𝐕t)1/2(det2λ𝐈)||x||22λKexp(12||x||22)dx||x||2(1k)λKexp(12||x||22)dx.\frac{N(g)}{N(h)}\leq\frac{(\det\mathbf{{V}}_{t})^{1/2}}{(\det\sqrt{2\lambda}\mathbf{{I}})}\frac{\int_{\lvert\lvert x\rvert\rvert_{2}\leq\sqrt{2\lambda}K}\exp\left(-\frac{1}{2}\lvert\lvert x\rvert\rvert_{2}^{2}\right)\mathrm{d}x}{\int_{\lvert\lvert x\rvert\rvert_{2}\leq(1-k)\sqrt{\lambda}K}\exp(-\frac{1}{2}\lvert\lvert x\rvert\rvert_{2}^{2})\mathrm{d}x}.

By a rather crude bound (as 1k21-k\leq\sqrt{2} in any case) we get

||x||22λKexp(12||x||22)dx||x||2(1k)λKexp(12||x||22)dx\displaystyle\frac{\int_{\lvert\lvert x\rvert\rvert_{2}\leq\sqrt{2\lambda}K}\exp\left(-\frac{1}{2}\lvert\lvert x\rvert\rvert_{2}^{2}\right)\mathrm{d}x}{\int_{\lvert\lvert x\rvert\rvert_{2}(1-k)\sqrt{\lambda}K}\exp(-\frac{1}{2}\lvert\lvert x\rvert\rvert_{2}^{2})\mathrm{d}x}
\displaystyle\leq\,\, ||x||2(1k)λKexp(12||x||22)dx+(1k)λK||x||22λKexp(12||x||22)dx||x||2(1k)λKexp(12||x||22)dx\displaystyle\frac{\int_{\lvert\lvert x\rvert\rvert_{2}\leq(1-k)\sqrt{\lambda}K}\exp\left(-\frac{1}{2}\lvert\lvert x\rvert\rvert_{2}^{2}\right)\mathrm{d}x+\int_{(1-k)\sqrt{\lambda}K\leq\lvert\lvert x\rvert\rvert_{2}\leq\sqrt{2\lambda}K}\exp\left(-\frac{1}{2}\lvert\lvert x\rvert\rvert_{2}^{2}\right)\mathrm{d}x}{\int_{\lvert\lvert x\rvert\rvert_{2}\leq(1-k)\sqrt{\lambda}K}\exp(-\frac{1}{2}\lvert\lvert x\rvert\rvert_{2}^{2})\mathrm{d}x}
=\displaystyle=\,\, 1+(1k)λK||x||22λKexp(12||x||22)dx||x||2(1k)λKexp(12||x||22)dx\displaystyle 1+\frac{\int_{(1-k)\sqrt{\lambda}K\leq\lvert\lvert x\rvert\rvert_{2}\leq\sqrt{2\lambda}K}\exp\left(-\frac{1}{2}\lvert\lvert x\rvert\rvert_{2}^{2}\right)\mathrm{d}x}{\int_{\lvert\lvert x\rvert\rvert_{2}\leq(1-k)\sqrt{\lambda}K}\exp(-\frac{1}{2}\lvert\lvert x\rvert\rvert_{2}^{2})\mathrm{d}x}
\displaystyle\leq\,\, 1+exp(12(1k)2λK2)exp(12(1k)2λK2)(1k)λK||x||22λKdx||x||2(1k)λKdx\displaystyle 1+\frac{\exp\left(-\frac{1}{2}(1-k)^{2}\lambda K^{2}\right)}{\exp\left(-\frac{1}{2}(1-k)^{2}\lambda K^{2}\right)}\frac{\int_{(1-k)\sqrt{\lambda}K\leq\lvert\lvert x\rvert\rvert_{2}\leq\sqrt{2\lambda}K}\mathrm{d}x}{\int_{\lvert\lvert x\rvert\rvert_{2}\leq(1-k)\sqrt{\lambda}K}\mathrm{d}x}
=\displaystyle=\,\, 1+vold(2λK)vold((1k)λK)vold((1k)λK)\displaystyle 1+\frac{\mathrm{vol}_{d}(\sqrt{2\lambda}K)-\mathrm{vol}_{d}((1-k)\sqrt{\lambda}K)}{\mathrm{vol}_{d}((1-k)\sqrt{\lambda}K)}
=\displaystyle=\,\, vold(2λK)vold((1k)λK)\displaystyle\frac{\mathrm{vol}_{d}(\sqrt{2\lambda}K)}{\mathrm{vol}_{d}((1-k)\sqrt{\lambda}K)}
=\displaystyle=\,\, (1k)d2d.\displaystyle(1-k)^{-d}\sqrt{2}^{d}.

We can put this together to obtain

N(h)N(g)(1k)d2d(det(𝐕t))1/2det(2λ𝐈)=(1k)d(det(𝐕t))1/2det(λ𝐈).\frac{N(h)}{N(g)}\leq(1-k)^{-d}\sqrt{2}^{d}\frac{(\det(\mathbf{{V}}_{t}))^{1/2}}{\det(\sqrt{2\lambda}\mathbf{{I}})}=(1-k)^{-d}\frac{(\det(\mathbf{{V}}_{t}))^{1/2}}{\det(\sqrt{\lambda}\mathbf{{I}})}.