This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Optimistic Algorithm for Online Convex Optimization with Adversarial Constraints

\nameJordan Lekeufack \email[email protected]
\addrDepartment of Statistics
University of California, Berkeley
\AND\nameMichael I. Jordan \email[email protected]
\addrDepartment of Statistics / Department of Electrical Engineering and Computer Science
University of California, Berkeley
Abstract

We study Online Convex Optimization (OCO) with adversarial constraints, where an online algorithm must make repeated decisions to minimize both convex loss functions and cumulative constraint violations. We focus on a setting where the algorithm has access to predictions of the loss and constraint functions. Our results show that we can improve the current best bounds of O(T)O(\sqrt{T}) regret and O~(T)\tilde{O}(\sqrt{T}) cumulative constraint violations to O(T(f))O(\sqrt{\mathcal{E}_{T}(f)}) and O~(T(g))\tilde{O}(\sqrt{\mathcal{E}_{T}(g)}), respectively, where T(f)\mathcal{E}_{T}(f) and T(g)\mathcal{E}_{T}(g) represent the cumulative prediction errors of the loss and constraint functions. In the worst case, where T(f)=O(T)\mathcal{E}_{T}(f)=O(T) and T(g)=O(T)\mathcal{E}_{T}(g)=O(T) (assuming bounded loss and constraint functions), our rates match the prior O(T)O(\sqrt{T}) results. However, when the loss and constraint predictions are accurate, our approach yields significantly smaller regret and cumulative constraint violations. Notably, if the constraint function remains constant over time, we achieve O~(1)\tilde{O}(1) cumulative constraint violation, aligning with prior results.

1 Introduction

We are interested in generalizations of Online Convex Optimization (OCO) to problems in which constraints are imposed—generalizations which we refer to as Constrained Online Convex Optimization (COCO). Recall the standard formulation of OCO (Orabona, 2019; Hazan, 2023): At each round tt, a learner makes a decision xt𝒳x_{t}\in{\cal X}, receives a loss function ftf_{t} from the environment, and suffers the loss ft(xt)f_{t}(x_{t}). The goal of the learner is to minimize the cumulative loss t=1Tft(xt)\sum_{t=1}^{T}f_{t}(x_{t}). In COCO, the learner additionally must attempt to satisfy a (potentially adversarial) constraint of the form gt(xt)0g_{t}(x_{t})\leq 0 at every time step. The learner observes gtg_{t} only after selecting xtx_{t}, and cannot always satisfy the constraint exactly but can hope to have a small cumulative constraint violation. In the adversarial setting it is not viable to seek absolute minima of the cumulative loss, and the problem is generally formulated in terms of obtaining a sublinear Regret—the difference between the learner’s cumulative loss and the cumulative loss of a fixed decision. Having a sublinear regret means that we perform as well as the best action in hindsight. In the COCO problem we also aim to ensure small cumulative constraint violation.

COCO has been studied extensively over the past two decades, starting with Mahdavi et al. (2012); Jenatton et al. (2016); Yu and Neely (2020) that assumed time-invariant constraints. There has also been work on time-varying constraints (Neely and Yu, 2017; Liu et al., 2022) under a Slater assumption: (xˇ𝒳,ξ>0,t,gt(xˇ)ξ)\exists\check{x}\in{\cal X},\xi>0,\forall t,g_{t}(\check{x})\leq-\xi). The guarantees presented by those authors are polynomial in ξ\xi, which unfortunately can be made arbitrarily small by the adversary. More recent work has established bounds on regret that scale as O(T)O(\sqrt{T}) with constraint violations scaling as O(TlogT)O(\sqrt{T}\log T) (Guo et al., 2022; Yi et al., 2023; Sinha and Vaze, 2024).

Recent work in OCO has considered settings in which the adversary is predictable—i.e., not entirely arbitrary—aiming to obtain improved regret bounds. For example, Chiang et al. (2012); Rakhlin and Sridharan (2013b) studied the case in which using the previous function ft1f_{t-1} is used as a prediction at the current time step (i.e., f^t=ft1\hat{f}_{t}=f_{t-1}). They showed that the regret improved from O(T)O(\sqrt{T}) to O(T(f))O(\sqrt{{\cal E}_{T}(f)}) where T(f){\cal E}_{T}(f) is a measure of the cumulative prediction error. This optimistic framework has also been studied in the COCO setting by Qiu et al. (2023), who focused on time-invariant constraints. Those authors established a similar bound for regret with O(1)O(1) violations. This line of work was pursued further in Anderson et al. (2022), who, considering the case of the case of adaptive constraints and perfect predictions, established a O(1)O(1) bound on regret with constraint violations of the order O(T)O(\sqrt{T}).

In the current paper we go beyond earlier work to consider the case of adversarial constraints. Specifically, our contributions are as follows:

  1. 1.

    We present the first algorithm to solve COCO problems in which the constraints are adversarial but also predictable. We present a meta-algorithm that, when built on an optimistic OCO algorithm, achieves O(T(f))O(\sqrt{{\cal E}_{T}(f)}) regret and O((T(g)+G)logT)O((\sqrt{{\cal E}_{T}(g)}+G)\log T) constraint violation.

  2. 2.

    When applied to problems having fixed constraints our meta-algorithm achieves a O(T(f))O(\sqrt{{\cal E}_{T}(f)}) regret and O(logT)O(\log T) constraint violation, matching previous work up to a logT\log T term.

The rest of the paper is structured as follow: We present previous work in Appendix 2, introduce the main assumptions and notation in Appendix 3, and review optimistic OCO in Appendix 4. In Appendix 5, we present our meta-algorithm for the COCO problem and establish theoretical guarantees.

Reference Complexity per round Constraints Loss Function Regret Violation
Guo et al. (2022) Conv-OPT Fixed Convex O(T)O(\sqrt{T}) O(1)O(1)
Adversarial Convex O(T)O(\sqrt{T}) O(T3/4)O(T^{3/4})
Yi et al. (2023) Conv-OPT Adversarial Convex O(Tmax(c,1c)O(T^{\max(c,1-c}) O(T1c/2)O(T^{1-c/2})
Sinha and Vaze (2024) Projection Adversarial Convex O(T)O(\sqrt{T}) O(TlogT)O(\sqrt{T}\log T)
Qiu et al. (2023) Projection Fixed Convex, Slater O(V(T))O(\sqrt{V_{\star}(T)}) O(1)O(1)
Ours Projection Fixed Convex O(T(f))O(\sqrt{{\cal E}_{T}(f)}) O(logT)O(\log T)
Projection Adversarial Convex O(T(f))O(\sqrt{{\cal E}_{T}(f)}) O((T(g))logT)O((\sqrt{{\cal E}_{T}(g)})\log T)
Table 1: Comparison with existing work. T(f){\cal E}_{T}(f) and T(g){\cal E}_{T}(g) are measures of the prediction error. V=t=2Tsupxft(x)ft1(x)2V_{\star}=\sum_{t=2}^{T}\sup_{x}||\nabla f_{t}(x)-\nabla f_{t-1}(x)||_{\star}^{2}. Note that when the prediction is the previous loss, T(f)=V{\cal E}_{T}(f)=V_{\star}.

2 Related Work

Unconstrained OCO

The OCO problem was introduced in Zinkevich (2003), who established a O(T)O(\sqrt{T}) guarantee based on projected online gradient descent (OGD). Hazan (2023); Orabona (2019) provide overviews of the burgeoning literature that has emerged since Zinkevich’s seminal work, in particular focusing on online mirror descent (OMD) as a general way to solve OCO problems.

Optimistic OCO

Optimistic OCO is often formulated as a problem involving gradual variation—i.e., where ft\nabla{f}_{t} and ft1\nabla{f}_{t-1} are close in some appropriate metric. Chiang et al. (2012) exploit this assumption in an optimistic version of OMD that incorporates a prediction based on the most recent gradient, and establish a regret bound of O(V)O(\sqrt{V_{*}}) where V=t=2Tsupx𝒳ft(x)ft1(x)2V_{*}=\sum_{t=2}^{T}\sup_{x\in{\cal X}}||\nabla{f}_{t}(x)-\nabla{f}_{t-1}(x)||_{\star}^{2}. Rakhlin and Sridharan (2013a, b) prove that when using a predictor MtM_{t} that is not necessarily the past gradient, one can have regret of the form O(t=1Tft(xt)Mt2)O\left(\sqrt{\sum_{t=1}^{T}||\nabla{f}_{t}(x_{t})-M_{t}||_{\star}^{2}}\right).

Constrained OCO

Constrained OCO was first studied in the context of time-invariant constraints; i.e., where gt:=gg_{t}:=g for all tt. In this setup one can employ projection-free algorithms, avoiding the potentially costly projection onto the set 𝒳={x𝒳0,g(x)0}{\cal X}=\{x\in{\cal X}_{0},g(x)\leq 0\}, by allowing the learner to violate the constraints in a controlled way (Mahdavi et al., 2012; Jenatton et al., 2016; Yu and Neely, 2020). Qiu et al. (2023) studied the case with gradual variations, proving a O(V)O(\sqrt{V_{*}}) regret guarantee and a O(1)O(1) constraint violation. The case of time-varying constraints is significantly harder as the constraints gtg_{t} are potentially adversarial. Most of the early work studying such constraints (Neely and Yu, 2017; Yi et al., 2023) accordingly incorporated an additional Slater condition: xˇ𝒳,ν>0,t,gt(xˇ)ν\exists\check{x}\in{\cal X},\nu>0,\forall t,\;g_{t}(\check{x})\leq-\nu. These papers establish regret guarantees that grow with ν1\nu^{-1}, which unfortunately can be vacuous as ν\nu can be arbitrarily small. Guo et al. (2022) present an algorithm that does not require the Slater condition and yielded improved bounds but it requires solving a convex optimization problem at each time step. In a more recent work, Sinha and Vaze (2024) presented a simple and efficient algorithm to solve the problem with just a projection and obtained state-of-the-art guarantees. See Table 1 for more comparison of our results with previous work.

3 Preliminaries

3.1 Problem setup and notation

Let \mathbb{R} denote the set of real numbers, and let d\mathbb{R}^{d} denote the set of dd-dimensional real vectors. Let 𝒳0d{\cal X}_{0}\subseteq\mathbb{R}^{d} denote the set of possible actions of the learner, where x𝒳0x\in{\cal X}_{0} is a specific action, and let ||||||\cdot|| be a norm defined on 𝒳0{\cal X}_{0}. Let the dual norm be denoted as θ:=maxx,x=1θ,x||\theta||_{\star}:=\max_{x,||x||=1}\langle\theta,\ x\rangle.

Online learning is a problem formulation in which the learner plays the following game with Nature. At each step tt:

  1. 1.

    The learner plays action xt𝒳0x_{t}\subseteq{\cal X}_{0}.

  2. 2.

    Nature reveals a loss function ft:𝒳0f_{t}:{\cal X}_{0}\to\mathbb{R} and a constraint function gt:𝒳0g_{t}:{\cal X}_{0}\to\mathbb{R}.111If we have multiple constraint functions 𝐠t,k{\mathbf{g}}_{t,k}, we set gt:=maxk𝐠t,kg_{t}:=\max_{k}{\mathbf{g}}_{t,k}.

  3. 3.

    The learner receives the loss ft(xt)f_{t}(x_{t}) and the constraint violation gt(xt)g_{t}(x_{t}).

In standard OCO, the loss function ftf_{t} is convex, and the goal of the learner is to minimize the regret with respect to an oracle action uu, where:

RegretT(u):=t=1Tft(xt)ft(u).\text{Regret}_{T}(u):=\sum_{t=1}^{T}f_{t}(x_{t})-f_{t}(u). (1)

In COCO, we generalize the OCO problem to additionally ask the learner to obtain a small cumulative constraint violation, which we denote as CCVT\text{CCV}_{T}:

CCVT=t=1Tgt+(xt)wheregt+(x):=max{0,gt(x)}.\text{CCV}_{T}=\sum_{t=1}^{T}g_{t}^{+}(x_{t})\quad\text{where}\quad g_{t}^{+}(x):=\max\{0,g_{t}(x)\}. (2)

Overall, the goal of the learner is to achieve both sublinear regret and sublinear CCV. This is a challenging problem, and indeed Mannor et al. (2009) proved that no algorithm can achieve both sublinear regret and sublinear cumulative constraint violation for an oracle in the set 𝒳:={x𝒳0,t=1Tgt(x)0}{\cal X}^{\prime}:=\{x\in{\cal X}_{0},\sum_{t=1}^{T}g_{t}(x)\leq 0\}. However, it is possible to find algorithms that achieve sublinear regret for the smaller set 𝒳:={x𝒳0,gt(x)0,t[T]}{\cal X}:=\{x\in{\cal X}_{0},g_{t}(x)\leq 0,\;\forall t\in[T]\}, and this latter problem is our focus.

Finally, we will assume that at the end of step tt, the learner can make predictions f^t+1\hat{f}_{t+1} and g^t+1\hat{g}_{t+1}. More precisely, we will be interested in predictions of the gradients, and, for any function hh, we will denote by h^t\nabla\hat{{h}}_{t} the prediction of the gradient of hh. We will abuse notation and denote by h^\hat{h} the function whose gradient is h^t\nabla\hat{{h}}_{t}. Moreover, we define the following prediction errors

εt(h):=ht(xt)h^t(xt)2,t(h):=τ=1tετ(h),\begin{split}\varepsilon_{t}(h)&:=||\nabla{h}_{t}(x_{t})-\nabla\hat{{h}}_{t}(x_{t})||_{\star}^{2},\\ {\cal E}_{t}(h)&:=\sum_{\tau=1}^{t}\varepsilon_{\tau}(h),\end{split} (3)

where (xt)t=1T(x_{t})_{t=1\dots T} is the sequence of actions taken by the learner.

Additionally, for a given β\beta-strongly convex function RR, we define the Bregman divergence between two points:

BR(x;y):=R(x)R(y)R(x),xy.B^{R}(x;y):=R(x)-R(y)-\langle\nabla R(x),\ x-y\rangle. (4)

Two special cases that are particularly important:

  1. 1.

    When R(x):=12x22R(x):=\frac{1}{2}||x||_{2}^{2}, the Bregman divergence is the Euclidean distance BR(x;y)=yx22B^{R}(x;y)=||y-x||_{2}^{2}, β=1\beta=1 with respect to the L2 norm, and ||||=||||=||||2||\cdot||=||\cdot||_{\star}=||\cdot||_{2}.

  2. 2.

    When R(x):=i=1dxilogxiR(x):=-\sum_{i=1}^{d}x_{i}\log x_{i}, the Bregman divergence is the KL divergence : BR(x;y)=DKL(x;y):=i=1dxilogxiyiB^{R}(x;y)=D_{\text{KL}}(x;y):=\sum_{i=1}^{d}x_{i}\log\frac{x_{i}}{y_{i}}, β=1\beta=1 with respect to the L1 norm, and ||||=||||1||\cdot||=||\cdot||_{1}, ||||=||||||\cdot||_{\star}=||\cdot||_{\infty}.

3.2 Assumptions

Throughout this paper we will use various combinations of the following assumptions.

Assumption 1 (Convex set, loss and constraints).

We make the following standard assumptions on the loss ff:

  1. 1.

    𝒳0{\cal X}_{0} is closed, convex and bounded with diameter DD.

  2. 2.

    t\forall t, ftf_{t} is convex and differentiable.

  3. 3.

    t\forall t, gtg_{t} is convex and differentiable.

Assumption 2 (Bounded losses).

The loss functions ftf_{t} are bounded by FF and the constraints gtg_{t} are bounded by GG.

Assumption 3 (Feasibility).

The set 𝒳{\cal X} is not empty.

Assumption 4 (Lipschitz gradients).

For any tt, the gradient of the loss function ftf_{t} and the gradient of the constraint function gtg_{t} are LtfL^{f}_{t} and LtgL^{g}_{t} Lipschitz, respectively. That is, x,y𝒳0\forall x,y\in{\cal X}_{0}, we have:

ft(x)ft(y)\displaystyle||\nabla{f}_{t}(x)-\nabla{f}_{t}(y)||_{\star} Ltfxy,\displaystyle\leq L^{f}_{t}||x-y||,
gt(x)gt(y)\displaystyle||\nabla{g}_{t}(x)-\nabla{g}_{t}(y)||_{\star} Ltgxy.\displaystyle\leq L^{g}_{t}||x-y||.

We will abuse notation and denote Ltf:=maxτtLτfL^{f}_{t}:=\max_{\tau\leq t}L^{f}_{\tau}, and similarly for L^tg\hat{L}^{g}_{t}. Moreover, we will assume that there is a constant LfL^{f} such that Lf=maxtLtfL^{f}=\max_{t}L^{f}_{t}. We similarly define LgL^{g}.

Assumption 5 (Prediction Sequence Regularity).

The prediction sequence of loss functions is such that t,f^t\forall t,\nabla\hat{{f}}_{t} is αf\alpha_{f}-smooth and g^t\nabla\hat{{g}}_{t} is αg\alpha_{g}-smooth: x,y𝒳0\forall x,y\in{\cal X}_{0}

f^t(x)f^t(y)\displaystyle||\nabla\hat{{f}}_{t}(x)-\nabla\hat{{f}}_{t}(y)||_{\star} L^tfxy,\displaystyle\leq\hat{L}^{f}_{t}||x-y||,
g^t(x)g^t(y)\displaystyle||\nabla\hat{{g}}_{t}(x)-\nabla\hat{{g}}_{t}(y)||_{\star} L^tgxy.\displaystyle\leq\hat{L}^{g}_{t}||x-y||.

We again abuse notation and let L^f:=maxτtL^τf\hat{L}^{f}:=\max_{\tau\leq t}\hat{L}^{f}_{\tau} and similarly for L^g\hat{L}^{g}.

Remark 6.

Note that those assumptions are all standard for different types of problems:

  • Assumptions 1, 4 or 5 are standard in OCO with predictive sequences (Chiang et al., 2012; D’Orazio and Huang, 2021; Qiu et al., 2023).

  • Assumptions 1, 2, 3 are standard in COCO.

The only non-standard assumption in 5, but when using past functions as predictions, it is actually equivalent to 4.

4 Optimistic Algorithms for OCO

In this section, we introduce some of the foundational optimistic algorithms that have been used for OCO. Our meta-algorithm to solve COCO problems will repose on these foundational ingredients.

4.1 Optimistic OMD and Optimistic AdaGrad

Algorithm 1 Optimistic Online Mirror Descent (Rakhlin and Sridharan, 2013b)
1:Sequence ηt>0\eta_{t}>0, x1x_{1}.
2:Initialize η1\eta_{1}.
3:for round t=1Tt=1\dots T do
4:     Play action xtx_{t}, receive ftf_{t}. Compute ft(xt)\nabla{f}_{t}(x_{t}).
5:     Compute ηt+1\eta_{t+1}.
6:     x~t+1:=argminx𝒳0ft(xt),x+1ηtBR(x;x~t).\tilde{x}_{t+1}:=\arg\min_{x\in{\cal X}_{0}}\langle\nabla{f}_{t}(x_{t}),\ x\rangle+\frac{1}{\eta_{t}}B^{R}(x;\tilde{x}_{t}). \triangleright Mirror step.
7:     Make prediction f^t+1(x~t+1).\nabla\hat{{f}}_{{t+1}}(\tilde{x}_{t+1}).
8:     xt+1:=argminx𝒳0f^t(x~t+1),x+1ηt+1BR(x;x~t+1).x_{t+1}:=\arg\min_{x\in{\cal X}_{0}}\langle\nabla\hat{{f}}_{t}(\tilde{x}_{t+1}),\ x\rangle+\frac{1}{\eta_{t+1}}B^{R}(x;\tilde{x}_{t+1}). \triangleright Optimistic Mirror step.
9:end for
Theorem 7 (Optimistic OMD).

Under Assumptions 1, 5 with LtfβηtL^{f}_{t}\leq\frac{\beta}{\eta_{t}}. If t,ηt+1ηt\forall t,\eta_{t+1}\leq\eta_{t}, t\forall t, then for any u𝒳0u\in{\cal X}_{0},

RegretT(u)2BηT+1+t=1Tηt+1βεtf,\text{Regret}_{T}(u)\leq\frac{2B}{\eta_{T+1}}+\sum_{t=1}^{T}\frac{\eta_{t+1}}{\beta}\varepsilon_{t}^{f}, (5)

where Bmaxt[T],x𝒳0BR(x;x~t)B\geq\max_{t\in[T],x\in{\cal X}_{0}}B^{R}(x;\tilde{x}_{t}).

If ηt:=η\eta_{t}:=\eta constant, we have

RegretT(u)2BR(u;x1)η+ηβT(f).\text{Regret}_{T}(u)\leq\frac{2B^{R}(u;x_{1})}{\eta}+\frac{\eta}{\beta}{\cal E}_{T}(f). (6)

For completeness we present a proof of this result in Appendix A, adapting the proof from Rakhlin and Sridharan (2013b) with the important change that the regret depends on εt(f)\varepsilon_{t}(f) instead of f^t(x~)ft(x)2||\nabla\hat{{f}}_{t}(\tilde{x})-\nabla{f}_{t}(x)||_{\star}^{2}. This is significant as εt(f)\varepsilon_{t}(f) equals zero when the predictions are perfect, but f^t(x~)ft(x)2||\nabla\hat{{f}}_{t}(\tilde{x})-\nabla{f}_{t}(x)||_{\star}^{2} might not be zero.

In order to obtain the optimal convergence rate with a fixed learning learning rate, we must set η=βBR(u;x1)T(f)\eta=\sqrt{\frac{\beta B^{R}(u;x_{1})}{{\cal E}_{T}(f)}} with requires knowing T(f){\cal E}_{T}(f) in advance. We mitigate this issue via the use of adaptive learning rates, in the style of Rakhlin and Sridharan (2013b). Note also that D’Orazio and Huang (2021) present an algorithm for a more general class of algorithms called Lagrangian hedging, OMD being a special case. However, this algorithm requires knowledge of maxtεt(f)\max_{t}\varepsilon_{t}(f), which our approach does not require.

Theorem 8 (Adapted from Rakhlin and Sridharan (2013b), Corollary 2).

Under Assumptions 1, 3, 5 with L^tfβLtf\hat{L}^{f}_{t}\leq\sqrt{\beta}L^{f}_{t}, if ηt\eta_{t} is:

ηt=min{βBt1(f)+t2(f),βLtf},\eta_{t}=\min\left\{\frac{\sqrt{\beta B}}{\sqrt{{\cal E}_{t-1}(f)}+\sqrt{{\cal E}_{t-2}(f)}},\frac{\sqrt{\beta}}{L^{f}_{t}}\right\}, (7)

then Algorithm 1 has regret

RegretT(u)5Bβ(T(f)+BLTf)=O(T(f)LTf),\text{Regret}_{T}(u)\leq 5\sqrt{\frac{B}{\beta}}\left(\sqrt{{\cal E}_{T}(f)}+\sqrt{B}L^{f}_{T}\right)=O\left(\sqrt{{\cal E}_{T}(f)}\vee L^{f}_{T}\right), (8)

where Bmaxt[T],x𝒳0BR(x;x~t)B\geq\max_{t\in[T],x\in{\cal X}_{0}}B^{R}(x;\tilde{x}_{t}).

Rakhlin and Sridharan (2013b) use a slightly different learning rate—instead of β/Ltf\sqrt{\beta}/L^{f}_{t}, they use 1. We modify that learning rate to be dependent on LtfL^{f}_{t} because this allows the scale of the function to vary over time, which will be useful in the generalization to COCO. If the functions ftf_{t} are linear, we can use any Ltf>0L_{t}^{f}>0.

The proof is in Appendix B.

Remark 9.

Note that the algorithm is computationally efficient. Indeed, a mirror step x=argminx𝒳0,x+1ηBR(x;z)x^{\star}=\arg\min_{x\in{\cal X}_{0}}\langle\ell,\ x\rangle+\frac{1}{\eta}B^{R}(x;z) can be computed in two steps:

  1. 1.

    Compute yy such that R(y)=R(z)η\nabla R(y)=\nabla R(z)-\eta\ell. In particular, if R\nabla R is invertible, y=(R)1(R(z)η)y=(\nabla R)^{-1}(\nabla R(z)-\eta\ell).

  2. 2.

    Let x=Π𝒳0,R(y):=argminx𝒳0BR(x;y)x^{\star}=\Pi_{{\cal X}_{0},R}(y):=\arg\min_{x\in{\cal X}_{0}}B^{R}(x;y).

Moreover, we have two special cases of interest:

  1. 1.

    When ||||=||||||\cdot||=||\cdot||_{\star} and R(x)=12x22R(x)=\frac{1}{2}||x||_{2}^{2}, this is simply projected gradient descent, x=Π𝒳0(zη)x^{\star}=\Pi_{{\cal X}_{0}}\left(z-\eta\ell\right).

  2. 2.

    When 𝒳=Δd{\cal X}=\Delta_{d} the dd-dimensional simplex, with RR being the entropy, xi=ziZexp(ηi)x^{\star}_{i}=\frac{z_{i}}{Z}\exp(-\eta\ell_{i}), where ZZ is a normalizing factor to ensure x1=1||x^{\star}||_{1}=1.

4.2 Optimistic OMD for learning with experts

In this setting, the agent has access to dd experts and has to form a distribution for selecting among them. She observes the loss of each expert and suffers an overall loss which is the expected value over the experts. Formally, we assume 𝒳0=Δd{\cal X}_{0}=\Delta_{d} where dd is the number of experts. At each step tt, the learner selects xtΔdx_{t}\in\Delta_{d}, a distribution over experts, and observes td\ell_{t}\in\mathbb{R}^{d}, which is the vector of loss across the experts. The learner then suffers the loss ft(xt)=t,xtf_{t}(x_{t})=\langle\ell_{t},\ x_{t}\rangle. We will let ^t\hat{\ell}_{t} denote the prediction of t\ell_{t}.

We could use the Algorithm 1 with ||||=||||2||\cdot||=||\cdot||_{2}, but in the worst case BB can be as large as O(d)O(d) resulting in a regret scaling in O(d)O(\sqrt{d}). We instead are able achieve a scaling of O(log(d))O(\log(d)). Let ||||=||||1||\cdot||=||\cdot||_{1} and ||||=||||||\cdot||_{\star}=||\cdot||_{\infty}. In that case, the Bregman divergence is the KL divergence and β=1\beta=1. However, the KL divergence is not upper bounded as any xt,ix_{t,i} can be close to zero. We circumvent this problem in Algorithm 2 by introducing the mixture yt=(1δ)x~t+δd𝟏y_{t}=(1-\delta)\tilde{x}_{t}+\frac{\delta}{d}\bm{1}. This algorithm can be found in Rakhlin and Sridharan (2013b) in the context of a two-player zero-sum game.

Algorithm 2 Optimistic Online Mirror Descent For Experts (Rakhlin and Sridharan (2013b)
1:Sequence x1,δx_{1},\delta.
2:Initialize η1\eta_{1}.
3:for round t=1Tt=1\dots T do
4:     Play action xtx_{t}, receive t\ell_{t}.
5:     Compute ηt+1\eta_{t+1}
6:     x~t+1:=argminx𝒳0t,x+1ηtDKL(x;yt).\tilde{x}_{t+1}:=\arg\min_{x\in{\cal X}_{0}}\langle\ell_{t},\ x\rangle+\frac{1}{\eta_{t}}D_{\text{KL}}(x;y_{t}). \triangleright Mirror step.
7:     Construct mixture yt+1=(1δ)x~t+1+δd𝟏.y_{t+1}=(1-\delta)\tilde{x}_{t+1}+\frac{\delta}{d}\bm{1}.
8:     Make prediction ^t+1\hat{\ell}_{t+1}.
9:     xt+1:=argminx𝒳0^t+1,x+1ηt+1DKL(x;yt+1).x_{t+1}:=\arg\min_{x\in{\cal X}_{0}}\langle\hat{\ell}_{t+1},\ x\rangle+\frac{1}{\eta_{t+1}}D_{\text{KL}}(x;y_{t+1}). \triangleright Optimistic Mirror step.
10:end for
Theorem 10 (Optimistic OMD with experts ).

Under Assumption 1 and for any tt, f^t\nabla\hat{{f}}_{t} is a constant function. If ηt+1ηt,t\eta_{t+1}\leq\eta_{t},\forall t, and δ=1/T\delta=1/T, then

RegretT(u)log(d2T)ηT+1+t=1Tηt+1εt(f)+12(ηT+11η11).\text{Regret}_{T}(u)\leq\frac{\log(d^{2}T)}{\eta_{T+1}}+\sum_{t=1}^{T}\eta_{t+1}\varepsilon_{t}(f)+\frac{1}{2}\left(\eta_{T+1}^{-1}-\eta_{1}^{-1}\right). (9)

Moreover, by using ηt\eta_{t} defined as

ηt=log(d2Te)min{1t1(f)+t2(f),1},\eta_{t}=\sqrt{\log(d^{2}Te)}\min\left\{\frac{1}{\sqrt{{\cal E}_{t-1}(f)}+\sqrt{{\cal E}_{t-2}(f)}},1\right\}, (10)

the regret guarantee becomes:

RegretT(u)2log(d2Te)(T(f)+1)=O(T(f)logT).\text{Regret}_{T}(u)\leq 2\sqrt{\log(d^{2}Te)}\left(\sqrt{{\cal E}_{T}(f)}+1\right)=O\left(\sqrt{{\cal E}_{T}(f)\log T}\right). (11)

5 Meta-Algorithm for Optimistic COCO

Algorithm 3 Optimistic learning with hard constraints
1:x1𝒳0x_{1}\in{\cal X}_{0}, λ>0\lambda>0, Q0=0Q_{0}=0, OCO algorithm 𝒜{\cal A}.
2:for round t=1Tt=1\dots T do
3:     Play action xtx_{t}, receive ftf_{t} and gtg_{t}.
4:     Compute {\cal L} defined in (12).
5:     Update Qt+1=Qt+λgt+(xt)Q_{t+1}=Q_{t}+\lambda g^{+}_{t}(x_{t}). \triangleright Dual update.
6:     Compute prediction ^t+1\hat{\cal L}_{t+1} as in (13).
7:     Update xt+1:=𝒜t(xt,1,,t,^t+1)x_{t+1}:={\cal A}_{t}(x_{t},{\cal L}_{1},\dots,{\cal L}_{t},\hat{\cal L}_{t+1}).
8:end for

Our meta-algorithm is inspired by Sinha and Vaze (2024). The main idea of that paper is to build a surrogate loss function t{\cal L}_{t} that can be seen as a Lagrangian of the optimization problem

minx𝒳0,gt(x)0ft(x).\min_{x\in{\cal X}_{0},g_{t}(x)\leq 0}f_{t}(x).

The learner then runs AdaGrad (Duchi et al., 2011) on the surrogate, with a theoretical guarantee of bounded cumulative constraint violation (CCV) and Regret.

Our meta-algorithm is based on the use of optimistic methods, such as those presented in Appendix 4, which allows us to obtain stronger bounds that depends on the prediction quality. Presented in Algorithm 3, this algorithm assumes that at the end of every step tt, the learner makes a prediction f^t+1\hat{f}_{t+1} and g^t+1\hat{g}_{t+1} of the upcoming loss ft+1f_{t+1} and constraint gt+1g_{t+1}.222We are actually only interested in the predictions of the gradients, but for simplicity we will let h^\hat{h} denote any function whose gradient is the prediction of the gradient h^t\nabla\hat{{h}}_{t}. At each time step tt, the learner forms a surrogate loss function, defined via a convex potential Lyapunov function:Φ:++\Phi:\mathbb{R}_{+}\to\mathbb{R}_{+}, Φ(0)=0\Phi(0)=0 where Φ\Phi is non-decreasing. Specifically:

t(x)=Vft(x)+λΦ(Qt)gt+(x).{\cal L}_{t}(x)=Vf_{t}(x)+\lambda\Phi^{\prime}(Q_{t})g^{+}_{t}(x). (12)

Using the predictions f^\hat{f} and g^\hat{g}, we form a prediction of the Lagrange function ^t+1\hat{\cal L}_{t+1}, where ^t\hat{\cal L}_{t} is defined in Equation 13.

^t(x)=Vf^t(x)+λΦ(Qt)g^t+(x).\hat{\cal L}_{t}(x)=V\hat{f}_{t}(x)+\lambda\Phi^{\prime}(Q_{t})\hat{g}^{+}_{t}(x). (13)

In Sinha and Vaze (2024), the update is Qt=Qt1+λgt+(xt)Q_{t}=Q_{t-1}+\lambda g^{+}_{t}(x_{t}), but using ^t+1\hat{\cal L}_{t+1} at tt requires Qt+1Q_{t+1} to be known at the end of tt. We instead define the following update:

Qt+1=Qt+λgt+(xt),with Q0=Q1=0.Q_{t+1}=Q_{t}+\lambda g^{+}_{t}(x_{t}),\quad\text{with }Q_{0}=Q_{1}=0. (14)

The learner then executes the step tt of algorithm 𝒜{\cal A}, denoted 𝒜t{\cal A}_{t} in Algorithm 3. We then have the following lemma that relates the regret on ff, CCV, and the regret of 𝒜{\cal A} on {\cal L}.

Lemma 11 (Regret decomposition).

For any algorithm 𝒜{\cal A}, if Φ\Phi is a Lyapunov potential function, we have that for any t1t\geq 1, and any u𝒳u\in{\cal X}

Φ(Qt+1)+Regrett(u)Regrett𝒜(u;1t)+St,\Phi(Q_{t+1})+\text{Regret}_{t}(u)\leq\text{Regret}_{t}^{\cal A}(u;\;{\cal L}_{1\dots t})+S_{t}, (15)

where St=λτ=1tgτ+(xτ)(Φ(Qτ+1)Φ(Qτ))S_{t}=\lambda\sum_{\tau=1}^{t}g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau})), and Regrett𝒜(u;1t)\text{Regret}_{t}^{\cal A}(u;\;{\cal L}_{1\dots t}) is the regret of the algorithm running on {\cal L}.

Proof  By convexity of Φ\Phi:

Φ(Qt+1)Φ(Qt)+Φ(Qt+1)(Qt+1Qt)=Φ(Qt)+Φ(Qt+1)λgt+(xt).\Phi(Q_{t+1})\leq\Phi(Q_{t})+\Phi^{\prime}(Q_{t+1})(Q_{t+1}-Q_{t})=\Phi(Q_{t})+\Phi^{\prime}(Q_{t+1})\lambda g^{+}_{t}(x_{t}).

Let u𝒳u\in{\cal X}, then by definition gt+(u)=0,t1g^{+}_{t}(u)=0,\forall t\geq 1, thus

Φ(Qτ+1)Φ(Qτ)+\displaystyle\Phi(Q_{\tau+1})-\Phi(Q_{\tau})+ V(fτ(xτ)fτ(u))\displaystyle V(f_{\tau}(x_{\tau})-f_{\tau}(u))
Φ(Qτ+1)λgτ+(xτ)+V(fτ(xτ)fτ(u))\displaystyle\leq\Phi^{\prime}(Q_{\tau+1})\lambda g^{+}_{\tau}(x_{\tau})+V(f_{\tau}(x_{\tau})-f_{\tau}(u))
Vfτ(xτ)+λΦ(Qτ)gτ+(xτ)((Vfτ(u)+λΦ(Qτ)gτ+(u))\displaystyle\leq Vf_{\tau}(x_{\tau})+\lambda\Phi^{\prime}(Q_{\tau})g^{+}_{\tau}(x_{\tau})-\big{(}(Vf_{\tau}(u)+\lambda\Phi^{\prime}(Q_{\tau})g^{+}_{\tau}(u)\big{)}
+λgτ+(xτ)(Φ(Qτ+1)Φ(Qτ))\displaystyle\quad\quad+\lambda g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau}))
τ(xτ)τ(u)+λgτ+(xτ)(Φ(Qτ+1)Φ(Qτ)).\displaystyle\leq{\cal L}_{\tau}(x_{\tau})-{\cal L}_{\tau}(u)+\lambda g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau})).

Summing τ\tau from 11 to tt:

Φ(Qt+1)Φ(Q1)+VRegrett(u)Regrett𝒜(u;1t)+St,\Phi(Q_{t+1})-\Phi(Q_{1})+V\text{Regret}_{t}(u)\leq\text{Regret}_{t}^{\cal A}(u;\;{\cal L}_{1\dots t})+S_{t},

where

St=λτ=1tgτ+(xτ)(Φ(Qτ+1)Φ(Qτ)).S_{t}=\lambda\sum_{\tau=1}^{t}g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau})).

 

Below we will make the assumption that the underlying optimistic OCO algorithm has standard regret guarantees that we will express in terms of a functional ψ\psi that takes as input a sequence of functions, and returns a constant. For simplicity, we will denote ψ(f1t):=ψt(f)\psi(f_{1\dots t}):=\psi_{t}(f). One example is ψt(f)=Ltf\psi_{t}(f)=L^{f}_{t}, another is ψt(f)=Ft=maxτtfτ\psi_{t}(f)=F_{t}=\max_{\tau\leq t}||f_{\tau}||_{\star}.

Assumption 12 (Regret of optimistic OCO).

The optimistic OCO algorithm 𝒜{\cal A} has the following regret guarantee: There is a constant CC\in\mathbb{R} and a sublinear functional ψ\psi such that for any sequence of functions (ft)t=1T(f_{t})_{t=1\dots T}, and any u𝒳0u\in{\cal X}_{0}

Regrett𝒜(u;f1t)C(t(f)ψt(f)).\text{Regret}_{t}^{\cal A}(u;f_{1\dots t})\leq C\left(\sqrt{{\cal E}_{t}(f)}\vee\psi_{t}(f)\right). (16)
Theorem 13 (Optimistic OCO regret and CCVguarantees).

Consider the following assumptions :

  1. 1.

    t{\cal L}_{t} and ^t\hat{\cal L}_{t} satisfy the assumptions of algorithm 𝒜{\cal A} for all tt.

  2. 2.

    Assumptions (1), (2), and (3).

  3. 3.

    𝒜{\cal A} satisfies 12 for some constant CC

  4. 4.

    Φ(Q):=exp(Q)1\Phi(Q):=\exp(Q)-1.

  5. 5.

    λ=(2C(2T(g)+ψT(g))+2G)1\lambda=\left(2C\left(\sqrt{2{\cal E}_{T}(g)}+\psi_{T}(g)\right)+2G\right)^{-1}.

Under these assumptions, Algorithm 3 has the following regret and CCV guarantees: T1,u𝒳,t[T]\forall T\geq 1,\forall u\in{\cal X},\forall t\in[T],

Regrett(u)\displaystyle\text{Regret}_{t}(u) C(2t(f)+ψt(f))+1=O(t(f)ψt(f)),\displaystyle\leq C\left(\sqrt{2{\cal E}_{t}(f)}+\psi_{t}(f)\right)+1=O\left(\sqrt{{\cal E}_{t}(f)}\vee\psi_{t}(f)\right), (17)
CCVT\displaystyle\text{CCV}_{T} 2(C(2T(g)+ψT(g))+G)log(2(C(2T(f)+ψT(f))+2FVT+2))\displaystyle\leq 2\left(C(\sqrt{2{\cal E}_{T}(g)}+\psi_{T}(g))+G\right)\log\left(2\left(C\left(\sqrt{2{\cal E}_{T}(f)}+\psi_{T}(f)\right)+2FVT+2\right)\right)
=O((T(g)ψT(g)G)logT).\displaystyle=O\left(\left(\sqrt{{\cal E}_{T}(g)}\vee\psi_{T}(g)\vee G\right)\log T\right). (18)

Proof  Note that {\cal L} is convex, and using (13), we obtain the following instantaneous prediction error:

εt()\displaystyle\varepsilon_{t}({\cal L}) :=t(xt)^t(xt)2\displaystyle:=||\nabla{{\cal L}}_{t}(x_{t})-\nabla\hat{{{\cal L}}}_{t}(x_{t})||_{\star}^{2}
2V2εt(f)+Φ(Qt)2λ2εt(g),\displaystyle\leq 2V^{2}\varepsilon_{t}(f)+\Phi^{\prime}(Q_{t})^{2}\lambda^{2}\varepsilon_{t}(g), using a+b22a2+2b2.\displaystyle\text{using }||a+b||_{\star}^{2}\leq 2||a||_{\star}^{2}+2||b||_{\star}^{2}.

Thus,

t()\displaystyle\sqrt{{\cal E}_{t}({\cal L})} τ=1t2V2ετ(f)+τ=1t2Φ(Qτ)2λ2εt(g)\displaystyle\leq\sqrt{\sum_{\tau=1}^{t}2V^{2}\varepsilon_{\tau}(f)+\sum_{\tau=1}^{t}2\Phi^{\prime}(Q_{\tau})^{2}\lambda^{2}\varepsilon_{t}(g)}
τ=1t2V2ετ(f)+τ=1t2Φ(Qτ)2λ2εt(g)\displaystyle\leq\sqrt{\sum_{\tau=1}^{t}2V^{2}\varepsilon_{\tau}(f)+\sum_{\tau=1}^{t}2\Phi^{\prime}(Q_{\tau})^{2}\lambda^{2}\varepsilon_{t}(g)}
V2t(f)+τ=1t2Φ(Qτ)2λ2εt(g)\displaystyle\leq V\sqrt{2{\cal E}_{t}(f)}+\sqrt{\sum_{\tau=1}^{t}2\Phi^{\prime}(Q_{\tau})^{2}\lambda^{2}\varepsilon_{t}(g)} using a+ba+b,\displaystyle\text{using }\sqrt{a+b}\leq\sqrt{a}+\sqrt{b},
V2t(f)+λΦ(Qt+1)2t(g).\displaystyle\leq V\sqrt{2{\cal E}_{t}(f)}+\lambda\Phi^{\prime}(Q_{t+1})\sqrt{2{\cal E}_{t}(g)}. (19)

By sub-linearity of ψt\psi_{t}:

ψt(t)Vψt(f)+λΦ(Qt)ψt(g)Vψt(f)+λΦ(Qt+1)ψt(g).\psi_{t}({\cal L}_{t})\leq V\psi_{t}(f)+\lambda\Phi^{\prime}(Q_{t})\psi_{t}(g)\leq V\psi_{t}(f)+\lambda\Phi^{\prime}(Q_{t+1})\psi_{t}(g).

Finally, using 12, we have

Regrett𝒜(u;1t)V(2T(f)+ψt(f))+λΦ(Qt+1)(2t(g)+ψt(g)),\text{Regret}_{t}^{\cal A}(u;\;{\cal L}_{1\dots t})\leq V\left(\sqrt{2{\cal E}_{T}(f)}+\psi_{t}(f)\right)+\lambda\Phi^{\prime}(Q_{t+1})\left(\sqrt{2{\cal E}_{t}(g)}+\psi_{t}(g)\right),

using the fact that QtQ_{t} is non-decreasing and Φ\Phi^{\prime} is a non-decreasing function. With those two assumptions, and knowing that gt+g^{+}_{t} is non-negative and upper bounded by GG we can also upper bound StS_{t}:

StGλ(Φ(Qt+1)Φ(Q1)).\displaystyle S_{t}\leq G\lambda(\Phi^{\prime}(Q_{{t+1}})-\Phi^{\prime}(Q_{1})).

We can now upper bound the regret: for any u𝒳u\in{\cal X}

exp(Qt+1)exp(Q1)+VRegrett(u)\displaystyle\exp(Q_{t+1})-\exp(Q_{1})+V\text{Regret}_{t}(u)\leq Cλexp(Qt+1)(2t(g)+ψt(g))\displaystyle C\lambda\exp(Q_{t+1})(\sqrt{2{\cal E}_{t}(g)}+\psi_{t}(g))
+CV(2t(f)+ψt(f))\displaystyle+CV(\sqrt{2{\cal E}_{t}(f)}+\psi_{t}(f))
+Gλ(exp(Qt+1)exp(Q1)).\displaystyle+G\lambda(\exp(Q_{t+1})-\exp(Q_{1})).

Thus

Regrett(u)(λC(2t(g)+ψt(g))+λG1)exp(Qt+1)V+(1Gλ)V+C(2t(f)+ψt(f)).\text{Regret}_{t}(u)\leq\left(\lambda C(\sqrt{2{\cal E}_{t}(g)}+\psi_{t}(g))+\lambda G-1\right)\frac{\exp(Q_{t+1})}{V}+\frac{(1-G\lambda)}{V}+C(\sqrt{2{\cal E}_{t}(f)}+\psi_{t}(f)). (20)

Therefore, if λ1C(2t(g)+ψt(g))+G\lambda\leq\frac{1}{C(\sqrt{2{\cal E}_{t}(g)}+\psi_{t}(g))+G},

Regrett(u)C2t(f)+exp(λ1G)VC(2t(f)+ψt(f))+1V.\text{Regret}_{t}(u)\leq C\sqrt{2{\cal E}_{t}(f)}+\frac{\exp(\lambda_{1}G)}{V}\leq C\left(\sqrt{2{\cal E}_{t}(f)}+\psi_{t}(f)\right)+\frac{1}{V}.

Note that the final inequality comes from using λ1G\lambda\leq\frac{1}{G}. Also note that Regrett(u)2Ft\text{Regret}_{t}(u)\geq-2Ft, thus:

exp(Qt+1)(1λ(C(2t(g)+ψt(g))+G))CV(2t(f)+ψt(f))+2FVt+1.\exp(Q_{t+1})\left(1-\lambda\left(C(\sqrt{2{\cal E}_{t}(g)}+\psi_{t}(g))+G\right)\right)\leq CV(\sqrt{2{\cal E}_{t}(f)}+\psi_{t}(f))+2FVt+1. (21)

If λ<1C(2t(g)+ψt(g))+G\lambda<\frac{1}{C(\sqrt{2{\cal E}_{t}(g)}+\psi_{t}(g))+G}, then

Qt+1logCV(2t(f)+ψt(f))+2FVt+11λ((2t(g)+ψt(g))+G),Q_{t+1}\leq\log\frac{CV(\sqrt{2{\cal E}_{t}(f)}+\psi_{t}(f))+2FVt+1}{1-\lambda\left((\sqrt{2{\cal E}_{t}(g)}+\psi_{t}(g))+G\right)},

and

CCVtQt+1λ1λlogC(2t(f)+ψt(f))+2FVt+11λ(C(2t(g)+ψt(g))+G).\text{CCV}_{t}\leq\frac{Q_{t+1}}{\lambda}\leq\frac{1}{\lambda}\log\frac{C(\sqrt{2{\cal E}_{t}(f)}+\psi_{t}(f))+2FVt+1}{1-\lambda\left(C(\sqrt{2{\cal E}_{t}(g)}+\psi_{t}(g))+G\right)}.

With λ=12C(2t(g)+ψt(g))+2G\lambda=\frac{1}{2C(\sqrt{2{\cal E}_{t}(g)}+\psi_{t}(g))+2G} and V=1V=1, we have

CCVT\displaystyle\text{CCV}_{T} (2C(2T(g)+ψT(g))+2G)log(2(C(2T(f)+ψT(f))+2FT+1))\displaystyle\leq\left(2C\left(\sqrt{2{\cal E}_{T}(g)}+\psi_{T}(g)\right)+2G\right)\log\left(2\left(C(\sqrt{2{\cal E}_{T}(f)}+\psi_{T}(f))+2FT+1\right)\right)
O((T(g)+ψT(g)+G)logT).\displaystyle\leq O\left((\sqrt{{\cal E}_{T}(g)}+\psi_{T}(g)+G)\log T\right).

 

Remark 14.

For unknown TT, we can use the doubling trick, and use the previous prediction errors to estimate T(g){\cal E}_{T}(g).

Finally, we have the following corollaries that are direct consequences of applying Algorithm 3, depending on the setting and the properties of the algorithm 𝒜{\cal A}.

Corollary 15 (Optimistic Adagrad COCO).

Consider the following assumptions:

  1. 1.

    𝒜{\cal A} is optimistic Adagrad described in Algorithm 1 with Lt=VLtf+Φ(Qt)LtgL^{{\cal L}}_{t}=VL^{f}_{t}+\Phi^{\prime}(Q_{t})L^{g}_{t}.

  2. 2.

    5 with L^tfβLtf\hat{L}^{f}_{t}\leq\sqrt{\beta}L^{f}_{t} and L^tgβLtg\hat{L}^{g}_{t}\leq\sqrt{\beta}L^{g}_{t}.

  3. 3.

    λ\lambda and Φ\Phi are set as in Theorem 13.

Under these assumptions, the meta-algorithm (3) has the following regret and constraint violation guarantees:

RegretT(u)O(T(f)Lf),CCVTO((T(g)GLg)logT).\begin{split}\text{Regret}_{T}(u)&\leq O\left(\sqrt{{\cal E}_{T}(f)}\vee L^{f}\right),\\ \text{CCV}_{T}&\leq O\left(\left(\sqrt{{\cal E}_{T}(g)}\vee G\vee L^{g}\right)\log T\right).\end{split} (22)

Proof  This is a direct consequence of Theorem 13. We only need to show that ^\hat{\cal L} satisfies the assumption of 𝒜{\cal A}:

  1. 1.

    There is some L^t\hat{L}^{{\cal L}}_{t} such that ^t\nabla\hat{{{\cal L}}}_{t} is L^t\hat{L}^{{\cal L}}_{t}-Lipschitz.

  2. 2.

    t\nabla{{\cal L}}_{t} is LtL^{{\cal L}}_{t}-Lipschitz.

  3. 3.

    L^tβLt\hat{L}^{{\cal L}}_{t}\leq\sqrt{\beta}L^{{\cal L}}_{t}.

The first two statements are straightforward from the definition of ^t\hat{\cal L}_{t}, t{\cal L}_{t} and sublinearity of the norm. Note that t\nabla{{\cal L}}_{t} is LtLL^{L}_{t}-Lipschitz for the same reason. Finally, the third assumption of the corollary directly implies L^tβLt\hat{L}^{{\cal L}}_{t}\leq\sqrt{\beta}L^{{\cal L}}_{t}.  

Corollary 16 (Fixed constraint bounds).

If gt:=gg_{t}:=g is known for all tt, then by using g^t=g\nabla\hat{{g}}_{t}=\nabla g in Algorithm 3 we have

Regrett(u)O(t(f)),t[T]CCVTO(logT).\begin{split}\text{Regret}_{t}(u)&\leq O\left(\sqrt{{\cal E}_{t}(f)}\right),\quad\forall t\in[T]\\ \text{CCV}_{T}&\leq O(\log T).\end{split} (23)

Proof  The proof is a direct consequence of Theorem 13 with T(g)=0{\cal E}_{T}(g)=0.  

Remark 17.

By applying the algorithm without predictions, i.e., using f^t=0\nabla\hat{{f}}_{t}=0 , we obtain RegretT(u)O(T)\text{Regret}_{T}(u)\leq O(\sqrt{T}) and O(logT)O(\log T) for constraint violations, improving on the O(TlogT)O(\sqrt{T\log T}) result for constraint violation when applying Sinha and Vaze (2024)’s algorithm.

Corollary 18 (Application to the experts setting).

When using Algorithm 2 in the meta-algorithm, we have

RegretT(u)O(log(d2Te)t(f)),CCVTO((T(g)GLg)log(d2Te)logT).\begin{split}\text{Regret}_{T}(u)&\leq O\left(\sqrt{\log(d^{2}Te){\cal E}_{t}(f)}\right),\\ \text{CCV}_{T}&\leq O\left(\left(\sqrt{{\cal E}_{T}(g)}\vee G\vee L^{g}\right)\sqrt{\log(d^{2}Te)}\log T\right).\end{split} (24)

References

  • Anderson et al. (2022) Daron Anderson, George Iosifidis, and Douglas J Leith. Lazy Lagrangians with predictions for online learning. arXiv preprint arXiv:2201.02890, 2022.
  • Chiang et al. (2012) Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, and Shenghuo Zhu. Online optimization with gradual variations. In Conference on Learning Theory, pages 6–1. JMLR Workshop and Conference Proceedings, 2012.
  • D’Orazio and Huang (2021) Ryan D’Orazio and Ruitong Huang. Optimistic and adaptive Lagrangian hedging. arXiv preprint arXiv:2101.09603, 2021.
  • Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
  • Guo et al. (2022) Hengquan Guo, Xin Liu, Honghao Wei, and Lei Ying. Online convex optimization with hard constraints: towards the best of two worlds and beyond. Advances in Neural Information Processing Systems, 35:36426–36439, 2022.
  • Hazan (2023) Elad Hazan. Introduction to online convex optimization, 2023. URL https://arxiv.org/abs/1909.05207.
  • Jenatton et al. (2016) Rodolphe Jenatton, Jim Huang, and Cedric Archambeau. Adaptive algorithms for online convex optimization with long-term constraints. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 402–411, New York, New York, USA, 20–22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/jenatton16.html.
  • Liu et al. (2022) Qingsong Liu, Wenfei Wu, Longbo Huang, and Zhixuan Fang. Simultaneously achieving sublinear regret and constraint violations for online convex optimization with time-varying constraints. ACM SIGMETRICS Performance Evaluation Review, 49(3):4–5, 2022.
  • Mahdavi et al. (2012) Mehrdad Mahdavi, Rong Jin, and Tianbao Yang. Trading regret for efficiency: online convex optimization with long term constraints. The Journal of Machine Learning Research, 13(1):2503–2528, 2012.
  • Mannor et al. (2009) Shie Mannor, John N Tsitsiklis, and Jia Yuan Yu. Online learning with sample path constraints. Journal of Machine Learning Research, 10(3), 2009.
  • Neely and Yu (2017) Michael J. Neely and Hao Yu. Online convex optimization with time-varying constraints, 2017. URL https://arxiv.org/abs/1702.04783.
  • Orabona (2019) Francesco Orabona. A modern introduction to online learning. arXiv preprint arXiv:1912.13213, 2019.
  • Qiu et al. (2023) Shuang Qiu, Xiaohan Wei, and Mladen Kolar. Gradient-variation bound for online convex optimization with constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9534–9542, 2023.
  • Rakhlin and Sridharan (2013a) Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Conference on Learning Theory, pages 993–1019. PMLR, 2013a.
  • Rakhlin and Sridharan (2013b) Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences, 2013b. URL https://arxiv.org/abs/1311.1869.
  • Sinha and Vaze (2024) Abhishek Sinha and Rahul Vaze. Optimal algorithms for online convex optimization with adversarial constraints, 2024. URL https://arxiv.org/abs/2310.18955.
  • Wei et al. (2020) Xiaohan Wei, Hao Yu, and Michael J Neely. Online primal-dual mirror descent under stochastic constraints. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 4(2):1–36, 2020.
  • Yi et al. (2023) Xinlei Yi, Xiuxian Li, Tao Yang, Lihua Xie, Yiguang Hong, Tianyou Chai, and Karl H Johansson. Distributed online convex optimization with adversarial constraints: reduced cumulative constraint violation bounds under Slater’s condition. arXiv preprint arXiv:2306.00149, 2023.
  • Yu and Neely (2020) Hao Yu and Michael J Neely. A low complexity algorithm with o (T\sqrt{T}) regret and o (1) constraint violations for online convex optimization with long term constraints. Journal of Machine Learning Research, 21(1):1–24, 2020.
  • Zinkevich (2003) Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03), pages 928–936, 2003.

Appendix A Proof of Theorem 7

The theorem is a direct consequence of the following lemma.

Lemma 19.

One step of optimistic online mirror descent satisfies:

ηtt,xtuBR(u;x~t)BR(u;x~t+1)+ηtt^txtx~t+1(BR(x~t+1;xt)+BR(xt;x~t)).\eta_{t}\langle\ell_{t},\ x_{t}-u\rangle\leq B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})+\eta_{t}||\ell_{t}-\hat{\ell}_{t}||_{\star}\cdot||x_{t}-\tilde{x}_{t+1}||-(B^{R}(\tilde{x}_{t+1};x_{t})+B^{R}(x_{t};\tilde{x}_{t})). (25)

Moreover, if f^t\nabla\hat{{f}}_{t} is L^tf\hat{L}^{f}_{t}-smooth with L^tfβηt\hat{L}^{f}_{t}\leq\frac{\beta}{\eta_{t}},

t,xtuBR(u;x~t)BR(u;x~t+1)ηt+BR(xt;x~t+1)(ηt+11ηt1)+ηt+1βεt(f).\langle\ell_{t},\ x_{t}-u\rangle\leq\frac{B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})}{\eta_{t}}+B^{R}(x_{t};\tilde{x}_{t+1})(\eta_{t+1}^{-1}-\eta_{t}^{-1})+\frac{\eta_{t+1}}{\beta}\varepsilon_{t}(f). (26)

We will need the following proposition to prove the lemma.

Proposition 20 (Chiang et al. (2012), proposition 18).

For any x0𝒳,dx_{0}\in{\cal X},\ell\in\mathbb{R}^{d}, if x:=argminx𝒳,x+1ηBR(x;x0)x^{\star}:=\arg\min_{x\in{\cal X}}\langle\ell,\ x\rangle+\frac{1}{\eta}B^{R}(x;x_{0}), then u𝒳\forall u\in{\cal X}

η,xu=BR(u;x0)BR(u;x)BR(x;x0).\eta\langle\ell,\ x^{\star}-u\rangle=B^{R}(u;x_{0})-B^{R}(u;x^{\star})-B^{R}(x^{\star};x_{0}). (27)

Proof [of Lemma 19] Let u𝒳u\in{\cal X}

ηtt,xtu\displaystyle\eta_{t}\langle\ell_{t},\ x_{t}-u\rangle =ηtt,x~t+1u+ηtt^t,xtx~t+1+ηt^t,xtx~t+1.\displaystyle=\langle\eta_{t}\ell_{t},\ \tilde{x}_{t+1}-u\rangle+\eta_{t}\langle\ell_{t}-\hat{\ell}_{t},\ x_{t}-\tilde{x}_{t+1}\rangle+\langle\eta_{t}\hat{\ell}_{t},\ x_{t}-\tilde{x}_{t+1}\rangle.

On one hand, using Proposition 20, the left and right terms can be upper bounded respectively :

ηtt,x~t+1u\displaystyle\langle\eta_{t}\ell_{t},\ \tilde{x}_{t+1}-u\rangle =BR(u;x~t)BR(u;x~t+1)BR(x~t+1;x~t),\displaystyle=B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})-B^{R}(\tilde{x}_{t+1};\tilde{x}_{t}),
ηt^t,xtx~t+1\displaystyle\langle\eta_{t}\hat{\ell}_{t},\ x_{t}-\tilde{x}_{t+1}\rangle =BR(x~t+1;x~t)BR(x~t+1;xt)BR(xt;x~t).\displaystyle=B^{R}(\tilde{x}_{t+1};\tilde{x}_{t})-B^{R}(\tilde{x}_{t+1};x_{t})-B^{R}(x_{t};\tilde{x}_{t}).

Therefore

ηtt,x~t+1u+ηt^t,xtx~t+1=BR(u;x~t)BR(u;x~t+1)(BR(x~t+1;xt)+BR(xt;x~t)).\langle\eta_{t}\ell_{t},\ \tilde{x}_{t+1}-u\rangle+\langle\eta_{t}\hat{\ell}_{t},\ x_{t}-\tilde{x}_{t+1}\rangle=B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})-(B^{R}(\tilde{x}_{t+1};x_{t})+B^{R}(x_{t};\tilde{x}_{t})).

On the other hand,

t^t,xtx~t+1\displaystyle\langle\ell_{t}-\hat{\ell}_{t},\ x_{t}-\tilde{x}_{t+1}\rangle xtx~t+1t^t.\displaystyle\leq||x_{t}-\tilde{x}_{t+1}||\cdot||\ell_{t}-\hat{\ell}_{t}||_{\star}.

By combining the last two inequalities, we obtain (25).
To prove (26), first note that by using the fact that a,b,ρ>0,ab12ρa2+ρ2b2\forall a,b,\rho>0,ab\leq\frac{1}{2\rho}a^{2}+\frac{\rho}{2}b^{2},

xtx~t+1t^tηt+12βt^t2+β2ηt+1xtx~t+12ηt+12βt^t2+1ηt+1BR(xt;x~t+1).||x_{t}-\tilde{x}_{t+1}||\cdot||\ell_{t}-\hat{\ell}_{t}||_{\star}\leq\frac{\eta_{t+1}}{2\beta}||\ell_{t}-\hat{\ell}_{t}||_{\star}^{2}+\frac{\beta}{2\eta_{t+1}}||x_{t}-\tilde{x}_{t+1}||^{2}\leq\frac{\eta_{t+1}}{2\beta}||\ell_{t}-\hat{\ell}_{t}||_{\star}^{2}+\frac{1}{\eta_{t+1}}B^{R}(x_{t};\tilde{x}_{t+1}).

For the second part of the statement, if f^\nabla\hat{f} is L^tf\hat{L}^{f}_{t}-smooth:

t^t2\displaystyle||\ell_{t}-\hat{\ell}_{t}||_{\star}^{2} =ft(xt)f^t(x~t)2\displaystyle=||\nabla{f}_{t}(x_{t})-\nabla\hat{{f}}_{t}(\tilde{x}_{t})||_{\star}^{2}
2ft(xt)f^t(xt)2+2f^t(xt)f^t(x~t)2\displaystyle\leq 2||\nabla{f}_{t}(x_{t})-\nabla\hat{{f}}_{t}(x_{t})||_{\star}^{2}+2||\nabla\hat{{f}}_{t}(x_{t})-\nabla\hat{{f}}_{t}(\tilde{x}_{t})||_{\star}^{2}
2ft(xt)f^t(xt)2+2(L^tf)2xtx~t2\displaystyle\leq 2||\nabla{f}_{t}(x_{t})-\nabla\hat{{f}}_{t}(x_{t})||_{\star}^{2}+2\left(\hat{L}^{f}_{t}\right)^{2}||x_{t}-\tilde{x}_{t}||^{2}
2εt(f)+2(L^tf)2βBR(xt;x~t).\displaystyle\leq 2\varepsilon_{t}(f)+\frac{2\left(\hat{L}^{f}_{t}\right)^{2}}{\beta}B^{R}(x_{t};\tilde{x}_{t}).

By inserting in (25) and dividing both sides by ηt\eta_{t}:

t,xtu\displaystyle\langle\ell_{t},\ x_{t}-u\rangle BR(u;x~t)BR(u;x~t+1)ηt+ηt+1βεt(f)+(L^tf)2ηt+1β2BR(xt;x~t)\displaystyle\leq\frac{B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})}{\eta_{t}}+\frac{\eta_{t+1}}{\beta}\varepsilon_{t}(f)+\frac{\left(\hat{L}^{f}_{t}\right)^{2}\eta_{t+1}}{\beta^{2}}B^{R}(x_{t};\tilde{x}_{t})
1ηt(BR(x~t+1;xt)+BR(xt;x~t))\displaystyle\quad\quad-\frac{1}{\eta_{t}}(B^{R}(\tilde{x}_{t+1};x_{t})+B^{R}(x_{t};\tilde{x}_{t}))
BR(u;x~t)BR(u;x~t+1)ηt+ηt+1βεt(f)+BR(xt;x~t)((L^tf)2ηt+1β21ηt)\displaystyle\leq\frac{B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})}{\eta_{t}}+\frac{\eta_{t+1}}{\beta}\varepsilon_{t}(f)+B^{R}(x_{t};\tilde{x}_{t})\left(\frac{\left(\hat{L}^{f}_{t}\right)^{2}\eta_{t+1}}{\beta^{2}}-\frac{1}{\eta_{t}}\right)
+BR(xt;x~t+1)(ηt+11ηt1).\displaystyle\quad\quad+B^{R}(x_{t};\tilde{x}_{t+1})(\eta_{t+1}^{-1}-\eta_{t}^{-1}).

If L^tfβ/ηt\hat{L}^{f}_{t}\leq\beta/\eta_{t}, then (L^tf)2β2ηtηt+1\left(\hat{L}^{f}_{t}\right)^{2}\leq\frac{\beta^{2}}{\eta_{t}\eta_{t+1}} since ηt\eta_{t} is non-increasing. We can upper bound the third term by zero.  

Proof [of Theorem 7] From (26), we have for any t1t\geq 1

t,xtuBR(u;x~t)BR(u;x~t+1)ηt+BR(xt;x~t+1)(ηt+11ηt1)+ηt+1βεt(f).\langle\ell_{t},\ x_{t}-u\rangle\leq\frac{B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})}{\eta_{t}}+B^{R}(x_{t};\tilde{x}_{t+1})(\eta_{t+1}^{-1}-\eta_{t}^{-1})+\frac{\eta_{t+1}}{\beta}\varepsilon_{t}(f). (28)

Note that by convexity of ftf_{t}, ft(xt)ft(u)t,xtuf_{t}(x_{t})-f_{t}(u)\leq\langle\ell_{t},\ x_{t}-u\rangle. Therefore, by taking the sum from 1 to T, we have

RegretT(u)\displaystyle\text{Regret}_{T}(u) t=1Tt,xtu\displaystyle\leq\sum_{t=1}^{T}\langle\ell_{t},\ x_{t}-u\rangle
t=1TBR(u;x~t)BR(u;x~t+1)ηt+t=1TBR(xt;x~t+1)(ηt+11ηt1)+t=1Tηt+1βεt(f)\displaystyle\leq\sum_{t=1}^{T}\frac{B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})}{\eta_{t}}+\sum_{t=1}^{T}B^{R}(x_{t};\tilde{x}_{t+1})(\eta_{t+1}^{-1}-\eta_{t}^{-1})+\sum_{t=1}^{T}\frac{\eta_{t+1}}{\beta}\varepsilon_{t}(f)
BR(u;x~1)η1+t=1T1(1ηt+11ηt)BR(u;x~t+1)+t=1T(1ηt+11ηt)BR(xt;x~t+1)+t=1Tηt+1βT(f)\displaystyle\leq\frac{B^{R}(u;\tilde{x}_{1})}{\eta_{1}}+\sum_{t=1}^{T-1}\left(\frac{1}{\eta_{{t+1}}}-\frac{1}{\eta_{t}}\right)B^{R}(u;\tilde{x}_{t+1})+\sum_{t=1}^{T}\left(\frac{1}{\eta_{{t+1}}}-\frac{1}{\eta_{t}}\right)B^{R}(x_{t};\tilde{x}_{t+1})+\sum_{t=1}^{T}\frac{\eta_{t+1}}{\beta}{\cal E}_{T}(f)
BηT+BηT+1+t=1Tηt+1βT(f)\displaystyle\leq\frac{B}{\eta_{T}}+\frac{B}{\eta_{T+1}}+\sum_{t=1}^{T}\frac{\eta_{t+1}}{\beta}{\cal E}_{T}(f)
2BηT+1+t=1Tηt+1βT(f),\displaystyle\leq\frac{2B}{\eta_{T+1}}+\sum_{t=1}^{T}\frac{\eta_{t+1}}{\beta}{\cal E}_{T}(f),

where B=maxtBR(u;xt)B=\max_{t}B^{R}(u;x_{t})  

Appendix B Proof of Theorem 8

Proof  Note that ηt:=βBmin{1t1(f)+t2(f),1LtfB}\eta_{t}:=\sqrt{\beta B}\min\left\{\frac{1}{\sqrt{{\cal E}_{t-1}(f)}+\sqrt{{\cal E}_{t-2}(f)}},\frac{1}{L^{f}_{t}\sqrt{B}}\right\} is non-decreasing. Moreover, we have ηtβLtf\eta_{t}\leq\frac{\sqrt{\beta}}{L^{f}_{t}}. Therefore,

L^tfβLtfL^tfβηt.\hat{L}^{f}_{t}\leq\sqrt{\beta}L^{f}_{t}\Longrightarrow\hat{L}^{f}_{t}\leq\frac{\beta}{\eta_{t}}.

We can apply Theorem 7:

RegretT(u)2BηT+1+t=1Tηt+1βεt(f).\text{Regret}_{T}(u)\leq\frac{2B}{\eta_{T+1}}+\sum_{t=1}^{T}\frac{\eta_{t+1}}{\beta}\varepsilon_{t}(f). (29)

Note that we can rewrite:

ηt=(βB)min{t1(f)t2(f)εt1(f),LtfB}.\eta_{t}=\left(\sqrt{\beta B}\right)\min\left\{\frac{\sqrt{{\cal E}_{t-1}(f)}-\sqrt{{\cal E}_{t-2}(f)}}{\varepsilon_{t-1}(f)},L^{f}_{t}\sqrt{B}\right\}. (30)

We can also show that

ηt1(βB)1max{2t1(f),LtfB}2(βB)1(t(f)+LtfB).\eta_{t}^{-1}\leq(\sqrt{\beta B})^{-1}\max\left\{2\sqrt{{\cal E}_{t-1}(f)},L^{f}_{t}\sqrt{B}\right\}\leq 2(\sqrt{\beta B})^{-1}(\sqrt{{\cal E}_{t}(f)}+L^{f}_{t}\sqrt{B}). (31)

Using (30) and (31) in the regret upper bound (29):

RegretT(u)\displaystyle\text{Regret}_{T}(u) 2BηT+1+t=1Tηt+1βεt(f)\displaystyle\leq\frac{2B}{\eta_{T+1}}+\sum_{t=1}^{T}\frac{\eta_{t+1}}{\beta}\varepsilon_{t}(f)
4Bβ(t(f)+LtfB)+t=1Tt1(f)t2(f)\displaystyle\leq 4\sqrt{\frac{B}{\beta}}\left(\sqrt{{\cal E}_{t}(f)}+L^{f}_{t}\sqrt{B}\right)+\sum_{t=1}^{T}\sqrt{{\cal E}_{t-1}(f)}-\sqrt{{\cal E}_{t-2}(f)}
4Bβ(t(f)+LtfB)+T1(f)\displaystyle\leq 4\sqrt{\frac{B}{\beta}}\left(\sqrt{{\cal E}_{t}(f)}+L^{f}_{t}\sqrt{B}\right)+\sqrt{{\cal E}_{T-1}(f)}
5Bβ(t(f)+LtfB).\displaystyle\leq 5\sqrt{\frac{B}{\beta}}\left(\sqrt{{\cal E}_{t}(f)}+L^{f}_{t}\sqrt{B}\right).

 

Appendix C Proof of Theorem 10

The proof is similar to that of Theorem 7. We will need a lemma from Wei et al. (2020) that relates DKL(u;x~t)D_{\text{KL}}(u;\tilde{x}_{t}) and DKL(u;yt)D_{\text{KL}}(u;y_{t}).

Lemma 21 (Wei et al. (2020), Lemma 31).

Let x~,uΔ\tilde{x},u\in\Delta and y=(1δ)x~+δd𝟏y=(1-\delta)\tilde{x}+\frac{\delta}{d}\bm{1}, then

DKL(u;y)\displaystyle D_{\text{KL}}(u;y) δlogd+DKL(u;x~),\displaystyle\leq\delta\log d+D_{\text{KL}}(u;\tilde{x}),
DKL(u;y)\displaystyle D_{\text{KL}}(u;y) log(d/δ).\displaystyle\leq\log(d/\delta).

Then we must show a lemma very similar to Lemma 19.

Lemma 22.

One step of optimistic online mirror descent satisfies:

t,xtuDKL(u;yt)DKL(u;yt+1)ηt+12xtx~t+112(ηt+11ηt1)+ηt+12εt(f)+δlogdηt.\langle\ell_{t},\ x_{t}-u\rangle\leq\frac{D_{\text{KL}}(u;y_{t})-D_{\text{KL}}(u;y_{t+1})}{\eta_{t}}+\frac{1}{2}||x_{t}-\tilde{x}_{t+1}||_{1}^{2}(\eta_{t+1}^{-1}-\eta_{t}^{-1})+\frac{\eta_{t+1}}{2}\varepsilon_{t}(f)+\frac{\delta\log d}{\eta_{t}}. (32)

Proof  The proof is exactly the same as Lemma 19. First, we can retrieve an equation similar to (25), where we replace x~t\tilde{x}_{t} with yty_{t}:

ηtt,xtuDKL(u;yt)DKL(u;x~t+1)+ηtt^txtx~t+11(DKL(x~t+1;xt)+DKL(xt;yt)).\eta_{t}\langle\ell_{t},\ x_{t}-u\rangle\leq D_{\text{KL}}(u;y_{t})-D_{\text{KL}}(u;\tilde{x}_{t+1})+\eta_{t}||\ell_{t}-\hat{\ell}_{t}||_{\infty}\cdot||x_{t}-\tilde{x}_{t+1}||_{1}-(D_{\text{KL}}(\tilde{x}_{t+1};x_{t})+D_{\text{KL}}(x_{t};y_{t})). (33)

Then we can upper bound

t^txtx~t+11ηt+12t^t2+12ηt+1xtx~t+112.||\ell_{t}-\hat{\ell}_{t}||_{\infty}\cdot||x_{t}-\tilde{x}_{t+1}||_{1}\leq\frac{\eta_{t+1}}{2}||\ell_{t}-\hat{\ell}_{t}||_{\infty}^{2}+\frac{1}{2\eta_{t+1}}||x_{t}-\tilde{x}_{t+1}||_{1}^{2}.

We can also use Lemma 21 below to upper bound the KL divergence:

DKL(u;x~t+1)δlogdDKL(u;yt+1),-D_{\text{KL}}(u;\tilde{x}_{t+1})\leq\delta\log d-D_{\text{KL}}(u;y_{t+1}),

and from Pinsker’s inequality:

DKL(x~t+1;xt)12x~t+1xt12DKL(x~t+1;xt)12x~t+1xt12.D_{\text{KL}}(\tilde{x}_{t+1};x_{t})\geq\frac{1}{2}||\tilde{x}_{t+1}-x_{t}||_{1}^{2}\Longrightarrow-D_{\text{KL}}(\tilde{x}_{t+1};x_{t})\leq-\frac{1}{2}||\tilde{x}_{t+1}-x_{t}||_{1}^{2}.

By combining these results in (33) and dividing both sides by ηt\eta_{t} we obtain

t,xtuDKL(u;yt)DKL(u;yt+1)+δlogdηt+ηt+12εt(f)+(ηt+11ηt1)xtx~t+1122.\langle\ell_{t},\ x_{t}-u\rangle\leq\frac{D_{\text{KL}}(u;y_{t})-D_{\text{KL}}(u;y_{t+1})+\delta\log d}{\eta_{t}}+\frac{\eta_{t+1}}{2}\varepsilon_{t}(f)+\left(\eta_{t+1}^{-1}-\eta_{t}^{-1}\right)\frac{||x_{t}-\tilde{x}_{t+1}||_{1}^{2}}{2}.

 

Proof [of Theorem 10] By first dividing both sides of (32) by ηt\eta_{t} and summing over tt:

RegretT(u)\displaystyle\text{Regret}_{T}(u) t=1TDKL(u;yt)DKL(u;yt+1)ηt+δlog(d)t=1T1ηt+t=1Tηt+12εtf+(ηt+11ηt1)xtx~t+1122.\displaystyle\leq\sum_{t=1}^{T}\frac{D_{\text{KL}}(u;y_{t})-D_{\text{KL}}(u;y_{t+1})}{\eta_{t}}+\delta\log(d)\sum_{t=1}^{T}\frac{1}{\eta_{t}}+\sum_{t=1}^{T}\frac{\eta_{t+1}}{2}\varepsilon_{t}^{f}+\left(\eta_{t+1}^{-1}-\eta_{t}^{-1}\right)\frac{||x_{t}-\tilde{x}_{t+1}||_{1}^{2}}{2}.

We can upper bound the first term of the sum by noticing that:

t=1TDKL(u;yt)DKL(u;yt+1)ηt\displaystyle\sum_{t=1}^{T}\frac{D_{\text{KL}}(u;y_{t})-D_{\text{KL}}(u;y_{t+1})}{\eta_{t}} DKL(u;y1)η1+t=2TDKL(u;yt)(ηt1ηt11)\displaystyle\leq\frac{D_{\text{KL}}(u;y_{1})}{\eta_{1}}+\sum_{t=2}^{T}D_{\text{KL}}(u;y_{t})(\eta_{t}^{-1}-\eta_{t-1}^{-1})
log(d/δ)1η1+log(d/δ)t=2T(ηt1ηt11)\displaystyle\leq\log(d/\delta)\frac{1}{\eta_{1}}+\log(d/\delta)\sum_{t=2}^{T}(\eta_{t}^{-1}-\eta_{t-1}^{-1}) using Lemma 21
log(d/δ)ηT\displaystyle\leq\frac{\log(d/\delta)}{\eta_{T}}
log(d/δ)ηT+1.\displaystyle\leq\frac{\log(d/\delta)}{\eta_{T+1}}.

The second term can be upper bounded by:

δlog(d)t=1T1ηtTδlogdηT+1.\delta\log(d)\sum_{t=1}^{T}\frac{1}{\eta_{t}}\leq T\delta\frac{\log d}{\eta_{T+1}}.

The last term can be upper bounded by xtx~t+111||x_{t}-\tilde{x}_{t+1}||_{1}\leq 1. By combining all of these terms, and setting δ=1T\delta=\frac{1}{T} we obtain (9):

RegretT(u)log(d2T)ηT+1+t=1Tηt+1εt(f)+12(ηT+11η11).\text{Regret}_{T}(u)\leq\frac{\log(d^{2}T)}{\eta_{T+1}}+\sum_{t=1}^{T}\eta_{t+1}\varepsilon_{t}(f)+\frac{1}{2}\left(\eta_{T+1}^{-1}-\eta_{1}^{-1}\right).

To prove (11), we use the same reasoning as in Theorem 8 where we let

ηt=log(d2Te)min{1t1(f)+t2(f),1}.\eta_{t}=\sqrt{\log(d^{2}Te)}\min\left\{\frac{1}{\sqrt{{\cal E}_{t-1}(f)}+\sqrt{{\cal E}_{t-2}(f)}},1\right\}. (34)