This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Online Control of Unknown Time-Varying Dynamical Systems

Edgar Minasyan1,4, Paula Gradu2, Max Simchowitz3, Elad Hazan1,4

1 Computer Science, Princeton University
2 EECS, UC Berkeley
3 EECS, MIT
4 Google AI Princeton

[email protected], [email protected]
[email protected], [email protected]
Abstract

We study online control of time-varying linear systems with unknown dynamics in the nonstochastic control model. At a high level, we demonstrate that this setting is qualitatively harder than that of either unknown time-invariant or known time-varying dynamics, and complement our negative results with algorithmic upper bounds in regimes where sublinear regret is possible. More specifically, we study regret bounds with respect to common classes of policies: Disturbance Action (SLS), Disturbance Response (Youla), and linear feedback policies. While these three classes are essentially equivalent for LTI systems, we demonstrate that these equivalences break down for time-varying systems.

We prove a lower bound that no algorithm can obtain sublinear regret with respect to the first two classes unless a certain measure of system variability also scales sublinearly in the horizon. Furthermore, we show that offline planning over the state linear feedback policies is NP-hard, suggesting hardness of the online learning problem.

On the positive side, we give an efficient algorithm that attains a sublinear regret bound against the class of Disturbance Response policies up to the aforementioned system variability term. In fact, our algorithm enjoys sublinear adaptive regret bounds, which is a strictly stronger metric than standard regret and is more appropriate for time-varying systems. We sketch extensions to Disturbance Action policies and partial observation, and propose an inefficient algorithm for regret against linear state feedback policies.

1 Introduction

The control of linear time-invariant (LTI) dynamical systems is well-studied and understood. This includes classical methods from optimal control such as LQR and LQG, as well as robust HH_{\infty} control. Recent advances study regret minimization and statistical complexity for online linear control, in both stochastic and adversarial perturbation models. Despite this progress, rigorous mathematical guarantees for nonlinear control remain elusive: nonlinear control is both statistically and computationally intractable in general.

In the face of these limitations, recent research has begun to study the rich continuum of settings which lie between LTI systems and generic nonlinear ones. The hope is to provide efficient and robust algorithms to solve the most general control problems that are tractable, and at the same time, to characterize precisely at which degree of nonlinearity no further progress can be made.

This paper studies the control of linear, time-varying (LTV) dynamical systems as one such point along this continuum. This is because the first-order Taylor approximation to the dynamics of any smooth nonlinear system about a given trajectory is an LTV system. These approximations are widely popular because they allow for efficient planning, as demonstrated by the success of iLQR and iLQG methods for nonlinear receding horizon control. We study online control of discrete-time LTV systems, with dynamics and time-varying costs

xt+1\displaystyle x_{t+1} =Atxt+Btut+wt,ct(xt,ut):(x,u).\displaystyle=A_{t}x_{t}+B_{t}u_{t}+w_{t},\quad c_{t}(x_{t},u_{t}):(x,u)\to\mathbb{R}. (1.1)

Above, xtx_{t} is the state of the system, utu_{t} the control input, wtw_{t} the disturbances, and At,BtA_{t},B_{t} the system matrices. Our results extend naturally to partial-state observation, where the controller observes linear projections of the state yt=Ctxty_{t}=C_{t}x_{t}. We focus on the challenges introduced when the system matrices At,BtA_{t},B_{t} and perturbations wtw_{t} are not known to the learner in advance, and can only be determined by live interaction with the changing systems.

In this setting, we find that the overall change in system dynamics across time characterizes the difficulty of controlling the unknown LTV system. We define a measure, called system variability, which quantifies this. We show both statistical and computational lower bounds as well as algorithmic upper bounds in terms of the system variabilility. Surprisingly, system variability does not impede the complexity of control when the dynamics are known [16].

1.1 Contributions

We consider the recently popularized nonstochastic model of online control, and study regret bounds with respect to common classes of policies: Disturbance Action (Dac/SLS [44]), Disturbance Response (Drc/Youla [46]), and linear feedback policies. Planning over the third class of feedback policies in LTI systems admits efficient convex relaxations via the the first two parametrizations, Dac and Drc. This insight has been the cornerstone of both robust [49, 44] and online [3, 39] control.

Separation of parametrizations. For linear time-varying systems, however, we find that equivalences between linear feedback, Dac and Drc fail to hold: we show that there are cases where any one of the three parametrizations exhibits strictly better control performance than the other two.

Regret against convex parametrizations. Our first set of results pertain to Dac and Drc parametrizations, which are convex and admit efficient optimization. We demonstrate that no algorithm can obtain sublinear regret with respect to these classes when faced with unknown, LTV dynamics unless a certain measure of system variability also scales sublinearly in the horizon. This is true even under full observation, controllable dynamics, and fixed control cost. This finding is in direct contrast to recent work which shows sublinear regret is attainable over LTV system dynamics if they are known [16].

We give an efficient algorithm that attains sublinear regret against these policy classes up to an additive penalty for the aforementioned system variability term found in our lower bound. When the system variability is sufficiently small, our algorithm recovers state-of-the-art results for unknown LTI system dynamics up to logarithmic factors.

In fact, our algorithm enjoys sublinear adaptive regret [21], a strictly stronger metric than standard regret which is more appropriate for time-varying systems. We also show that the stronger notion of adaptivity called strongly adaptive regret [11] is out of reach in the partial information setting.

Regret against state feedback. Finally, we consider the class of state feedback policies, which are linear feedback with memory length one. We show that full-information optimization over state feedback policies is computationally hard. This suggests that obtaining sublinear regret relative to these policies may be computationally prohibitive, though does not entirely rule out the possibility of improper learning. However, improper learning cannot be done via the Drc or Dac relaxations in light of our policy class separation results. Finally, we include an inefficient algorithm which attains sublinear (albeit nonparametric-rate) regret against state feedback control policies.

Paper Structure

Discussion of relevant literature and relation to our work can be found in Section 1.2. In Section 2, we formally introduce the setting of LTV nonstochastic control, the policy classes we study and our key result regarding their non-equivalence in the LTV setting (Theorem 2.1). Motivated by this non-equivalence, the remainder of the paper is split into the study of convex policies (Section 3) and of state feedback policies (Section 4). In Section 3, we show that regret against the Dac and Drc classes cannot be sublinear unless the metric system variability (Definition 3.1) itself is sublinear (Theorem 3.1), and also propose Algorithm 2 whose adaptive regret scales at the rate of our lower bound plus a T2/3T^{2/3} term (Theorem 3.4). On the other hand, in Section 4 we show sublinear regret against state feedback policies is technically possible (Theorem 4.1) with a computationally inefficient algorithm, but also provide a computational lower bound (Theorem 4.2) for planning which reveals significant difficulties imposed by the LTV dynamics in this scenario as well. Finally, in Section 5 we pose several future directions, concerning both questions in LTV control, as well as the extension to nonlinear control.

1.2 Related Work

Our study of LTV systems is motivated by the widespread practical popularity of iterative linearization for nonlinear receding horizon control; e.g., the iLQR [40], iLC [29], and iLQG [41] algorithms. Recent research has further demonstrated that near-optimal solutions to LTV approximations of dynamics confer stability guarantees onto the original nonlinear system of interest [45].

Low-Regret Control: We study algorithms which enjoy sublinear regret for online control of LTV systems; that is, whose performance tracks a given benchmark of policies up to a term which is vanishing relative to the problem horizon. [1] initiated the study of online control under the regret benchmark by introducing the online LQR problem: where a learner is faced with an unknown LTI system, fixed costs and i.i.d. Gaussian disturbances, and must attain performance relative to the LQR-optimal policy. Bounds for this setting were later improved and refined in [12, 26, 10, 38], and extended to partial-state observation in [25, 24]. Our work instead adopts the nonstochastic control setting [3], where the adversarially chosen (i.e. non-Gaussian) noise is considered to model the drift terms that arise in linearizations of nonlinear terms, and where costs may vary with time. [3] consider known system dynamics, later extended to unknown systems under both full-state [20] and partial-state observation [39, 37]. The study of nonstochastic control of known LTV dynamics was taken up in [16], with parallel work by [32] considering known LTV dynamics under stochastic noise.

Unknown LTV dynamics: Our work is the first to consider online (low-regret) control of unknown LTV systems in any model. There is, however, a rich body of classical work on adaptive control of LTV systems [28, 42]. These guarantees focus more heavily on error sensitivity and stability; they only permit dynamical recovery up to error that scales linearly in system noise, and thus guarantee only (vacuous) linear-in-horizon regret. More recent work has studied identification (but not online control) of an important LTV class called switching systems [31, 35].

Online Convex Optimization: We make extensive use of techniques from the field of online convex optimization [9, 18]. Most relevant to our work is the literature on adapting to changing environments in online learning, which starts from the works of [22, 6]. The notion of adaptive regret was introduced in [21] and significantly studied since as a metric for adaptive learning in OCO [2, 47]. [11] proposed to strengthen adaptive regret and the stronger metric has been shown to imply results over dynamic regret [48].

Recent nonlinear control literature: Recent research has also studied provably guarantees in various complementary (but incomparable) models: planning regret in nonlinear control [4], adaptive nonlinear control under linearly-parameterized uncertainty [5], online model-based control with access to non-convex planning oracles [23], and control with nonlinear observation models [27, 13].

2 Problem Setting

We study control of a linear time-varying (LTV) system Eq. 1.1 with state xtdxx_{t}\in\mathbb{R}^{d_{x}}, control input utduu_{t}\in\mathbb{R}^{d_{u}} chosen by the learner, and the external disturbance wtdxw_{t}\in\mathbb{R}^{d_{x}} chosen by Nature. The system is characterized by time-varying matrices Atdx×dx,Btdx×duA_{t}\in\mathbb{R}^{d_{x}\times d_{x}},B_{t}\in\mathbb{R}^{d_{x}\times d_{u}}. For simplicity, the initial state is x1=0x_{1}=0. At each time tt, oblivious111An oblivious adversary chooses the matrices, costs and perturbations prior to the control trajectory. adversary picks the system matrices (At,Bt)(A_{t},B_{t}), disturbances wtw_{t} and cost functions ct:dx×duc_{t}:\mathbb{R}^{d_{x}}\times\mathbb{R}^{d_{u}}\to\mathbb{R}. The dynamics (At,Bt)(A_{t},B_{t}) are unknown to the learner: one observes only the next state xt+1x_{t+1} and current cost ct(,)c_{t}(\cdot,\cdot) after playing control utu_{t}.

Adaptive Regret. The goal of the learner is to minimize regret w.r.t. a policy class Π\Pi, i.e. the difference between the cumulative cost of the learner and the best policy πΠ\pi^{\star}\in\Pi in hindsight. Formally, the regret of an algorithm 𝒜\mathcal{A} with control inputs u1:Tu_{1:T} and corresponding states x1:Tx_{1:T}, over an interval I=[r,s][T]I=[r,s]\subseteq[T], is defined as

RegretI(𝒜;Π)=tIct(xt,ut)infπΠtIct(xtπ,utπ).\mathrm{Regret}_{I}(\mathcal{A};\Pi)=\sum_{t\in I}c_{t}(x_{t},u_{t})-\inf_{\pi\in\Pi}\sum_{t\in I}c_{t}(x_{t}^{\pi},u_{t}^{\pi})~{}. (2.1)

Here utπ,xtπu_{t}^{\pi},x_{t}^{\pi} indicate the control input and the corresponding state when following policy π\pi. For a randomized algorithm 𝒜\mathcal{A}, we consider the expected regret. In this work, we focus on designing control algorithms that minimize adaptive regret, i.e. guarantee a low regret relative to the best-in-hindsight policy πIΠ\pi^{\star}_{I}\in\Pi on any interval I[T]I\subseteq[T]. This performance metric of adaptive regret is more suitable for control over LTV dynamics given its agility to compete against different local optimal policies πIΠ\pi^{\star}_{I}\in\Pi at different times [16]. To illustrate this point, we describe the implications of standard vs. adaptive regret for kk-switching LQR.

Example 2.1 (kk-switching LQR.).

Consider the problem of kk-switching LQR in which the system evolves according to the fixed (Aj,Bj)(A_{j},B_{j}) over each time interval Ij=[(j1)T/k,jT/k]I_{j}=\left[\lceil(j-1)\cdot T/k\rceil,\lfloor j\cdot T/k\rfloor\right] for j[1,k]j\in[1,k]. An adaptive regret guarantee ensures good performance against πj=argminπΠtIjct(xtπ,utπ)\pi^{\star}_{j}=\operatorname*{arg\,min}_{\pi\in\Pi}\sum_{t\in I_{j}}c_{t}(x_{t}^{\pi},u_{t}^{\pi}) on every interval IjI_{j}, in contrast to standard regret which only ensures good performance against a single, joint comparator π=argminπΠt=1Tct(xtπ,utπ)\pi^{\star}=\operatorname*{arg\,min}_{\pi\in\Pi}\sum_{t=1}^{T}c_{t}(x_{t}^{\pi},u_{t}^{\pi}). Clearly over every interval IjI_{j} the policy πj\pi^{\star}_{j} is a suitable comparator while π\pi^{\star} is not.

Key objects. A central object in our study is the sequence of Nature’s x’s x1:Tnatx^{\mathrm{nat}}_{1:T} that arises from playing zero control input ut=0u_{t}=0 at each t[T]t\in[T], i.e. xt+1nat=Atxtnat+wtx^{\mathrm{nat}}_{t+1}=A_{t}x^{\mathrm{nat}}_{t}+w_{t} [39]. This object allows us to split any state into a component independent of the algorithm’s actions and a component that is the direct effect of the chosen actions. To capture this intuition in equation form, we define the following operators for all tt,

Φt[0]=𝕀,h[1,t),Φt[h]=k=tth+1Ak,i[0,t),Gt[i]=Φt[i]Bti,\displaystyle\Phi_{t}^{[0]}=\mathbb{I},\quad\forall h\in[1,t),\,\Phi_{t}^{[h]}=\prod_{k=t}^{t-h+1}A_{k},\quad\forall i\in[0,t),\,G_{t}^{[i]}=\Phi_{t}^{[i]}B_{t-i},

where the matrix product sr\prod_{s}^{r} with srs\geq r is taken in the indicated order k=s,,rk=s,\dots,r. The following identities give an alternative representation for the Nature’s x’s xtnatx^{\mathrm{nat}}_{t} and state xtx_{t} with control input utu_{t} in terms of the Markov operator at time tt, Gt=[Gt[i]]i0G_{t}=[G_{t}^{[i]}]_{i\geq 0}:

xt+1nat=i=0t1Φt[i]wti,xt+1=xt+1nat+i=0t1Gt[i]uti.x^{\mathrm{nat}}_{t+1}=\sum_{i=0}^{t-1}\Phi_{t}^{[i]}w_{t-i},\quad x_{t+1}=x^{\mathrm{nat}}_{t+1}+\sum_{i=0}^{t-1}G_{t}^{[i]}u_{t-i}~{}.

These operators and the alternative representation capture the dynamics by decoupling the disturbance and the control action effects. Observe that the operators [Φt[i]]i[0,t)[\Phi_{t}^{[i]}]_{i\in[0,t)} capture the contribution of the perturbations on the state xt+1x_{t+1} and the Markov operators [Gt[i]]i[0,t)[G_{t}^{[i]}]_{i\in[0,t)} that of the controls.

Assumptions. We make the three basic assumptions: we require from (i) the disturbances to not blow up the system with no control input, (ii) the system to have decaying effect over time, and (iii) the costs to be well-behaved and admit efficient optimization. Formally, these assumptions are:

Assumption 1.

For all t[T]t\in[T], assume xtnatRnat\|x^{\mathrm{nat}}_{t}\|\leq R_{\mathrm{nat}}.

Assumption 2.

Assume there exist RG1R_{G}\geq 1 and ρ(0,1)\rho\in(0,1) s.t. for any h0h\geq 0 and for all t[T]t\in[T]

ihGt[i]opRGρh:=ψ(h).\displaystyle\sum_{i\geq h}\|G_{t}^{[i]}\|_{\mathrm{op}}\leq R_{G}\cdot\rho^{h}:=\psi(h)~{}.
Assumption 3.

Assume the costs ct:dx×duc_{t}:\mathbb{R}^{d_{x}}\times\mathbb{R}^{d_{u}}\to\mathbb{R} are general convex functions that satisfy the conditions 0ct(x,u)Lmax{1,x2+u2}0\leq c_{t}(x,u)\leq L\max\{1,\|x\|^{2}+\|u\|^{2}\}, and ct(x,u)Lmax{1,x+u}\|\nabla c_{t}(x,u)\|\leq L\max\{1,\|x\|+\|u\|\} for some constant L>0L>0, where \nabla denotes any subgradient [7].

The conditions in Assumption 3 allow for functions whose values and gradient grow as quickly as quadratics (e.g. the costs in LQR) , and the max{1,}\max\{1,\cdot\} term ensures the inclusion of standard bounded and Lipschitz functions as well. Assumptions 1 and 2 arise from the assumption our LTV system is open-loop stable; Section A.2 extends to the case where a nominal stabilizing controller is known, as in prior work [3, 39]. While these two assumptions may seem unnatural at first, they can be derived from the basic conditions of disturbance norm bound and sequential stability.

Lemma 2.1.

Suppose that there exist C11,ρ1(0,1)C_{1}\geq 1,\rho_{1}\in(0,1) such that Φt[h]opC1ρ1h\|\Phi_{t}^{[h]}\|_{\mathrm{op}}\leq C_{1}\rho_{1}^{h} for any h0h\geq 0 and all t[T]t\in[T], and suppose that maxtwtRw\max_{t}\|w_{t}\|\leq R_{w}. Then, Assumption 1 holds with Rnat=C11ρ1RwR_{\mathrm{nat}}=\frac{C_{1}}{1-\rho_{1}}R_{w}, and Assumption 2 holds with ρ=ρ1\rho=\rho_{1} and RG=max{1,maxtBtopC11ρ1}R_{G}=\max\{1,\max_{t}\|B_{t}\|_{\mathrm{op}}\cdot\frac{C_{1}}{1-\rho_{1}}\}.

Note that Assumption 2 implies that Gt1,op=i0Gt[i]opRG\|G_{t}\|_{\ell_{1},\mathrm{op}}=\sum_{i\geq 0}\|G_{t}^{[i]}\|_{\mathrm{op}}\leq R_{G}. It also suggests that for a sufficiently large hh the effect of iterations before tht-h are negligible at round tt. This prompts introducing a truncated Markov operator: denote G¯th=[Gt[i]]i<h\bar{G}^{h}_{t}=[G_{t}^{[i]}]_{i<h} to be the hh-truncation of the true Markov operator GtG_{t}. It follows that their difference is G¯thGt1,op=ihGt[i]opψ(h)\|\bar{G}_{t}^{h}-G_{t}\|_{\ell_{1},\mathrm{op}}=\sum_{i\geq h}\|G_{t}^{[i]}\|_{\mathrm{op}}\leq\psi(h) negligible in operator norm for a sufficiently large hh. Define the bounded set of hh-truncated Markov operators to be 𝒢(h,RG)={G=[G[i]]0i<h:G1,opRG}\mathcal{G}(h,R_{G})=\{G=[G^{[i]}]_{0\leq i<h}:\|G\|_{\ell_{1},\mathrm{op}}\leq R_{G}\} with G¯th𝒢(h,RG)\bar{G}_{t}^{h}\in\mathcal{G}(h,R_{G}) for all tt.

2.1 Benchmarks and Policy Classes

The performance of an algorithm, measured by Eq. 2.1, directly depends on the policy class Π\Pi that is chosen as a benchmark to compete against. In this work, we consider the following three policy classes: Drc, Dac, and linear feedback. Drc parameterizes control inputs in terms of Nature’s x’s xtnatx^{\mathrm{nat}}_{t}, Dac does so in terms of the disturbances wtw_{t} and linear feedback in terms of the states xtx_{t}. We express all three in terms of a length-mm parameter M=[M[i]]i<mM=[M^{[i]}]_{i<m} in a bounded ball (m,RM)\mathcal{M}(m,R_{M}):

(m,RM)={(M[0],,M[m1]):i=0m1M[i]opRM}.\mathcal{M}(m,R_{M})=\{(M^{[0]},\dots,M^{[m-1]}):\textstyle\sum_{i=0}^{m-1}\|M^{[i]}\|_{\mathrm{op}}\leq R_{M}\}~{}.
Definition 2.1 (Drc policy class).

A Drc control policy πdrcM\pi_{\mathrm{drc}}^{M} of length mm is given by utM=i=0m1M[i]xtinatu_{t}^{M}=\sum_{i=0}^{m-1}M^{[i]}x^{\mathrm{nat}}_{t-i} where M=[M[i]]i<mM=[M^{[i]}]_{i<m} is the parameter of the policy. Define the bounded Drc policy class as Πdrc(m,RM)={πdrcM:M(m,RM)}\Pi_{\mathrm{drc}}(m,R_{M})=\{\pi_{\mathrm{drc}}^{M}:M\in\mathcal{M}(m,R_{M})\}.

Definition 2.2 (Dac policy class).

A Dac control policy πdacM\pi_{\mathrm{dac}}^{M} of length mm is given by utM=i=0m1M[i]wtiu_{t}^{M}=\sum_{i=0}^{m-1}M^{[i]}w_{t-i} where M=[M[i]]i<mM=[M^{[i]}]_{i<m} is the parameter of the policy. Define the bounded Dac policy class as Πdac(m,RM)={πdacM:M(m,RM)}\Pi_{\mathrm{dac}}(m,R_{M})=\{\pi_{\mathrm{dac}}^{M}:M\in\mathcal{M}(m,R_{M})\}.

Definition 2.3 (Feedback policy class).

A feedback control policy πfeedM\pi_{\mathrm{feed}}^{M} of length mm is given by utM=i=0m1M[i]xtiu_{t}^{M}=\sum_{i=0}^{m-1}M^{[i]}x_{t-i} where M=[M[i]]i<mM=[M^{[i]}]_{i<m} is the parameter of the policy. Define the bounded feedback policy class as Πfeed(m,RM)={πfeedM:M(m,RM)}\Pi_{\mathrm{feed}}(m,R_{M})=\{\pi_{\mathrm{feed}}^{M}:M\in\mathcal{M}(m,R_{M})\}. In the special case of memory m=1m=1, denote the state feedback policy class as Πstate=Πfeed(m=1)\Pi_{\mathrm{state}}=\Pi_{\mathrm{feed}}(m=1).

Convexity. Both the Drc and Dac policy classes are convex parametrizations: a policy πΠdrcΠdac\pi\in\Pi_{\mathrm{drc}}\cup\Pi_{\mathrm{dac}} outputs controls utu_{t} that are linear in the policy-independent sequences x1:Tnatx^{\mathrm{nat}}_{1:T} and w1:Tw_{1:T}, and thus the mapping from parameter MM to resulting states and inputs (resp. costs) is affine (resp. convex). Hence, we refer to these as the convex classes. In contrast, feedback policies select inputs based on policy-dependent states, and are therefore non-convex [15].

We drop the arguments m,RMm,R_{M} when they are clear from the context. The state feedback policies Πstate\Pi_{\mathrm{state}} encompass the 2\mathcal{H}_{2} and \mathcal{H}_{\infty} optimal control laws under full observation. For LTI systems, Drc and Dac are equivalent [46, 44] and approximate all linear feedback policies to arbitrarily high precision [3, 39]. However, we show that these relationships between the classes break down for LTV systems: there exist scenarios where any one of the three classes strictly outperforms the other two.

Theorem 2.1 (Informal).

For each class Π\Pi in {Πdrc,Πdac,Πfeed}\{\Pi_{\mathrm{drc}},\Pi_{\mathrm{dac}},\Pi_{\mathrm{feed}}\} there exists a sequence of well-behaved (At,Bt,wt,ct)(A_{t},B_{t},w_{t},c_{t}) such that a policy πΠ\pi^{\star}\in\Pi suffers 0 cumulative cost, but each of the other two classes Π{Πdrc,Πdac,Πfeed}Π\Pi^{\prime}\in\{\Pi_{\mathrm{drc}},\Pi_{\mathrm{dac}},\Pi_{\mathrm{feed}}\}\setminus\Pi suffers Ω(T)\Omega(T) cost on all their constituent policies πΠ\pi\in\Pi^{\prime}.

The formal theorem that includes the definition of a well-behaved instance sequence and the final statement dependence on m,RMm,R_{M} along with its proof can be found in Section F.1.

Notation.

The norm \|\cdot\| refers to Euclidean norm unless otherwise stated, [n][n] is used as a shorthand for [1,n][1,n], TT is used as a subscript shorthand for [T][T]. The asymptotic notation 𝒪(),Ω()\mathcal{O}(\cdot),\Omega(\cdot) suppress all terms independent of TT, 𝒪~()\widetilde{\mathcal{O}}(\cdot) additionally suppresses terms logarithmic in TT. We define 𝒪~()\widetilde{\mathcal{O}}^{\star}(\cdot) to suppress absolute constants, polynomials in Rnat,RG,RMR_{\mathrm{nat}},R_{G},R_{M} and logarithms in TT.

3 Online Control over Convex Policies

This section considers online control of unknown LTV systems so as to compete with the convex Drc and Dac policy classes. The fundamental quantity which appears throughout our results is the system variability, which measures the variation of the time-varying Markov operators GtG_{t} over intervals II.

Definition 3.1.

Define the system variability of an LTV dynamical system with Markov operators 𝐆=G1:T\mathbf{G}=G_{1:T} over a contiguous interval I[T]I\subseteq[T] to be

VarI(𝐆)=minG1|I|tIGGt2,F2=1|I|tIGIGt2,F2,\mathrm{Var}_{I}(\mathbf{G})=\min_{G}\frac{1}{|I|}\sum_{t\in I}\|G-G_{t}\|_{\ell_{2},F}^{2}=\frac{1}{|I|}\sum_{t\in I}\|G_{I}-G_{t}\|_{\ell_{2},F}^{2},

where 2,F\|\cdot\|_{\ell_{2},F} indicates the 2\ell_{2} norm of the fully vectorized operator and GI=|I|1tIGtG_{I}=|I|^{-1}\sum_{t\in I}G_{t} is the empirical average of the operators that correspond to II. Recall that VarT(𝐆)\mathrm{Var}_{T}(\mathbf{G}) corresponds to I=[T]I=[T].

Our results in this section for both upper and lower bounds focus on expected regret: high probability results are possible as well with more technical effort using standard techniques.

3.1 A Linear Regret Lower Bound

Our first contribution is a negative one: that the regret against the class of either Dac or Drc policies cannot scale sublinearly in the time horizon. Informally, our result shows that the regret against these classes scales as TσT\sigma, where σ2\sigma^{2} is the system variability.

More precisely, for any σ2(0,1/8]\sigma^{2}\in(0,1/8], we construct a distribution 𝒟σ\mathcal{D}_{\sigma} over sequences (At,Bt,ct,wt)(A_{t},B_{t},c_{t},w_{t}), formally specified in Section F.2. Here, we list the essential properties of 𝒟σ\mathcal{D}_{\sigma}: (i) At0A_{t}\equiv 0, (ii) ctcc_{t}\equiv c is a fixed cost satisfying Assumption 3 with L4L\leq 4, (iii) the matrices (Bt)(B_{t}) are i.i.d., with Btop2\|B_{t}\|_{\mathrm{op}}\leq 2 almost surely, and 𝔼[Bt𝔼[Bt]F2]=σ2\mathbb{E}[\|B_{t}-\mathbb{E}[B_{t}]\|_{\mathrm{F}}^{2}]=\sigma^{2}, and (iv) wt4\|w_{t}\|\leq 4 for all tt. These conditions imply that Assumptions 2 and 1 hold for RG=2,ρ=0,Rnat=4R_{G}=2,\rho=0,R_{\mathrm{nat}}=4. The condition At0A_{t}\equiv 0 implies that xtnat=wtx^{\mathrm{nat}}_{t}=w_{t} for all tt, so the classes Drc and Dac are equivalent and the lower bound holds over both. Moreover, by Jensen’s inequality, this construction ensures that

𝔼[VarI(𝐆)]\displaystyle\mathbb{E}[\mathrm{Var}_{I}(\mathbf{G})] =|I|1𝔼[minGtIGGt2,F2]\displaystyle=|I|^{-1}\mathbb{E}[\min_{G}\textstyle\sum_{t\in I}\|G-G_{t}\|_{\ell_{2},F}^{2}]
=|I|1𝔼[minBtIBBtF2]𝔼[Bt𝔼[Bt]F2]=σ2.\displaystyle=|I|^{-1}\mathbb{E}[\min_{B}\textstyle\sum_{t\in I}\|B-B_{t}\|_{\mathrm{F}}^{2}]\leq\mathbb{E}[\|B_{t}-\mathbb{E}[B_{t}]\|_{\mathrm{F}}^{2}]=\sigma^{2}.

In particular, 𝔼[VarT(𝐆)]σ2\mathbb{E}[\mathrm{Var}_{T}(\mathbf{G})]\leq\sigma^{2}. For the described construction, we show the following lower bound:

Theorem 3.1.

Let CC be a universal, positive constant. For any σ(0,1/8]\sigma\in(0,1/8] and any online control algorithm 𝒜\mathcal{A}, there exists a Drc policy πΠdrc(1,1)\pi^{\star}\in\Pi_{\mathrm{drc}}(1,1) s.t. expected regret incurred by 𝒜\mathcal{A} under the distribution 𝒟σ\mathcal{D}_{\sigma} and cost c(x,u)c(x,u) is at least

𝔼𝒟σ,𝒜[RegretT(𝒜;{π})]CTσCT𝔼[VarT(𝐆)],\displaystyle\mathbb{E}_{\mathcal{D}_{\sigma},\mathcal{A}}[\mathrm{Regret}_{T}(\mathcal{A};\{\pi^{\star}\})]\geq C\cdot T\sigma\geq C\cdot T\cdot\sqrt{\mathbb{E}[\mathrm{Var}_{T}(\mathbf{G})]},

A full construction and proof of Theorem 3.1 is given in Section F.2. In particular, for σ=1/8\sigma=1/8, we find that no algorithm can attain less than Ω(T)\Omega(T) expected regret; a stark distinction from either unknown LTI [39, 20] or known LTV [16] systems.

3.2 Estimation of Time-Varying Vector Sequences

To devise an algorithmic upper bound that complements the result in LABEL:{thm:main_lb}, we first consider the setting of online prediction under a partial information model. This setting captures the system identification phase of LTV system control and is used to derive the final control guarantees. Formally, consider the following repeated game between a learner and an oblivious adversary: at each round t[T]t\in[T], the adversary picks a target vector 𝐳t𝒦\mathbf{z}^{\star}_{t}\in\mathcal{K} from a convex decision set 𝒦\mathcal{K} contained in a 0-centered ball of radius RzR_{z}; simultaneously, the learner selects an estimate 𝐳^t𝒦\hat{\mathbf{z}}_{t}\in\mathcal{K} and suffers quadratic loss t(𝐳^t)=𝐳^t𝐳t2\ell_{t}(\hat{\mathbf{z}}_{t})=\|\hat{\mathbf{z}}_{t}-\mathbf{z}^{\star}_{t}\|^{2}. The only feedback the learner has access to is via the following noisy and costly oracle.

Oracle 1 (Noisy Costly Oracle).

At each time t[T]t\in[T], the learner selects a decision bt{0,1}b_{t}\in\{0,1\} indicating whether a query is sent to the oracle. If bt=1b_{t}=1, the learner receives an unbiased estimate 𝐳~t\tilde{\mathbf{z}}_{t} as response such that 𝐳~tR~z\|\tilde{\mathbf{z}}_{t}\|\leq\tilde{R}_{z} and 𝔼[𝐳~tt,bt=1]=𝐳t\mathbb{E}[\tilde{\mathbf{z}}_{t}\mid\mathcal{F}_{t},b_{t}=1]=\mathbf{z}^{\star}_{t}. The filtration t\mathcal{F}_{t} is the sigma algebra generated by 𝐳~1:t1,b1:t1\tilde{\mathbf{z}}_{1:t-1},b_{1:t-1} and the choices of the oblivious adversary 𝐳1:T\mathbf{z}_{1:T}^{\star}. A completed query results in a unit cost λ>0\lambda>0 for the learner.

The performance metric of an online prediction algorithm 𝒜pred\mathcal{A}_{\mathrm{pred}} is expected quadratic loss regret along with the extra cumulative oracle query cost. It is defined over each interval I=[r,s][T]I=[r,s]\subseteq[T] as

RegretI(𝒜pred;λ)=𝔼1:T[tIt(𝐳^t)min𝐳𝒦tIt(𝐳)+λtIbt].\mathrm{Regret}_{I}(\mathcal{A}_{\mathrm{pred}};\lambda)=\mathbb{E}_{\mathcal{F}_{1:T}}\left[\sum_{t\in I}\ell_{t}(\hat{\mathbf{z}}_{t})-\min_{\mathbf{z}\in\mathcal{K}}\sum_{t\in I}\ell_{t}(\mathbf{z})+\lambda\sum_{t\in I}b_{t}\right]~{}. (3.1)
Algorithm 1 Adaptive Estimation Algorithm (Ada-Pred)
1:Input: parameter pp, decision set 𝒦\mathcal{K}
2:Initialize: 𝐳^1(1)𝒦\hat{\mathbf{z}}_{1}^{(1)}\in\mathcal{K}, working dictionary 𝒮1={(1:𝐳^1(1))}\mathcal{S}_{1}=\{(1:\hat{\mathbf{z}}_{1}^{(1)})\}, q1(1)=1q_{1}^{(1)}=1, parameter α=p(Rz+R~z)2\alpha=\frac{p}{(R_{z}+\tilde{R}_{z})^{2}}
3:for t=1,,Tt=1,\ldots,T do
4:     Play iterate 𝐳^t=(i,𝐳^t(i))𝒮tqt(i)𝐳^t(i)\hat{\mathbf{z}}_{t}=\sum_{(i,\hat{\mathbf{z}}_{t}^{(i)})\in\mathcal{S}_{t}}q_{t}^{(i)}\hat{\mathbf{z}}_{t}^{(i)}
5:     Draw/Receive btBernoulli(p)b_{t}\sim\mathrm{Bernoulli}(p)
6:     if bt=1b_{t}=1 then
7:         Request estimate 𝐳~t\tilde{\mathbf{z}}_{t} from Oracle 1
8:         Let ~t(𝐳)=12p𝐳𝐳~t2\tilde{\ell}_{t}(\mathbf{z})=\frac{1}{2p}||\mathbf{z}-\tilde{\mathbf{z}}_{t}||^{2} and ~t=1p(𝐳t𝐳~)\tilde{\nabla}_{t}=\frac{1}{p}(\mathbf{z}_{t}-\tilde{\mathbf{z}})
9:     else
10:         Let 𝐳~t\tilde{\mathbf{z}}_{t}\leftarrow\emptyset and ~t(𝐳)=0\tilde{\ell}_{t}(\mathbf{z})=0 and ~t=0\tilde{\nabla}_{t}=0      
11:     Update predictions 𝐳^t+1(i)Proj𝒦(𝐳^t(i)ηt(i)~t)\hat{\mathbf{z}}^{(i)}_{t+1}\leftarrow\mathrm{Proj}_{\mathcal{K}}(\hat{\mathbf{z}}^{(i)}_{t}-\eta_{t}^{(i)}\tilde{\nabla}_{t}) for all (i,𝐳^t(i))St(i,\hat{\mathbf{z}}^{(i)}_{t})\in S_{t}
12:     Form new dictionary 𝒮~t+1=(i,𝐳^t+1(i))ikeys(𝒮t)\tilde{\mathcal{S}}_{t+1}=(i,\hat{\mathbf{z}}^{(i)}_{t+1})_{i\in\mathrm{keys}(\mathcal{S}_{t})}
13:     Construct proxy new weights q¯t+1(i)=tt+1qt(i)eα~t(𝐳^t(i))jkeys(𝒮t)qt(j)eα~t(𝐳^t(j))\bar{q}_{t+1}^{(i)}=\tfrac{t}{t+1}\cdot\tfrac{q_{t}^{(i)}e^{-\alpha\tilde{\ell}_{t}(\hat{\mathbf{z}}_{t}^{(i)})}}{\sum_{j\in\mathrm{keys}(\mathcal{S}_{t})}q_{t}^{(j)}e^{-\alpha\tilde{\ell}_{t}(\hat{\mathbf{z}}_{t}^{(j)})}} for all ikeys(𝒮t)i\in\mathrm{keys}(\mathcal{S}_{t})
14:     Add new instance 𝒮~t+1𝒮~t+1(t+1,𝐳^t+1(t+1))\tilde{\mathcal{S}}_{t+1}\leftarrow\tilde{\mathcal{S}}_{t+1}\cup(t+1,\hat{\mathbf{z}}_{t+1}^{(t+1)}) for arbitrary 𝐳^t+1(t+1)𝒦\hat{\mathbf{z}}_{t+1}^{(t+1)}\in\mathcal{K} with q¯t+1(t+1)=1t+1\bar{q}_{t+1}^{(t+1)}=\frac{1}{t+1}
15:     Prune 𝒮~t+1\tilde{\mathcal{S}}_{t+1} to form 𝒮t+1\mathcal{S}_{t+1} (see Section C.1)
16:     Normalize qt+1(i)=q¯t+1(i)jkeys(𝒮t+1)q¯t+1(j)q_{t+1}^{(i)}=\frac{\bar{q}_{t+1}^{(i)}}{\sum_{j\in\mathrm{keys}(\mathcal{S}_{t+1})}\bar{q}_{t+1}^{(j)}}

To attain adaptive regret, i.e. bound Eq. 3.1 for each interval II, we propose Algorithm 1 constructed as follows. First, suppose we wanted non-adaptive (i.e. just I=[T]I=[T]) guarantees. In this special case, we propose to sample btBernoulli(p)b_{t}\sim\mathrm{Bernoulli}(p) for an appropriate parameter p(0,1)p\in(0,1), and perform a gradient descent update on the importance-weighted square loss ~t(𝐳)=12p𝕀{bt=1}𝐳𝐳~t2\tilde{\ell}_{t}(\mathbf{z})=\frac{1}{2p}\mathbb{I}\{b_{t}=1\}\|\mathbf{z}-\tilde{\mathbf{z}}_{t}\|^{2}. To extend this method to enjoy adaptive regret guarantees, we adopt the approach of [21]: the core idea in this approach is to initiate an instance of the base method at each round tt and use a weighted average of the instance predictions as the final prediction (Line 4). The instance weights are multiplicatively updated according to their performance (Line 13). To ensure computational efficiency, the algorithm only updates instances from a working dictionary 𝒮t\mathcal{S}_{t} (Line 11). These dictionaries are pruned each round (Line 15) such that |𝒮t|=O(logT)|\mathcal{S}_{t}|=O(\log T) (see Section C.1 for details).

Theorem 3.2.

Given access to queries from Oracle 1 and with stepsizes ηt(i)=1ti+1\eta_{t}^{(i)}=\frac{1}{t-i+1}, Algorithm 1 enjoys the following adaptive regret guarantee: for all I=[r,s][T]I=[r,s]\subseteq[T],

RegretI(Ada-Pred;λ)2(Rz+R~z)2(1+logslog|I|)p+λp|I|.\mathrm{Regret}_{I}(\textsc{Ada}\text{-}\textsc{Pred};\lambda)\leq\frac{2(R_{z}+\tilde{R}_{z})^{2}(1+\log{s}\cdot\log|I|)}{p}+\lambda p|I|~{}. (3.2)

When I=[T]I=[T], the optimal choice of parameter p=logT/λTp=\log T/\sqrt{\lambda T} yields regret scaling roughly as λTlog2T\sqrt{\lambda T\log^{2}T}. Unfortunatelly, this gives regret scaling as T\sqrt{T} for all interval sizes: to attain |I|\sqrt{|I|} regret on interval II, the optimal choice of pp would yield T/|I|\sim T/\sqrt{|I|} regret on [T][T], which is considerably worse for small |I||I|. One may ask if there exists a strongly adaptive algorithm which adapts pp as well, so as to enjoy regret polynomial in |I||I| for all intervals II simultaneously [11]. The following result shows this is not possible:

Theorem 3.3 (Informal).

For all γ>0\gamma>0 and λ>0\lambda>0, there exists no online algorithm 𝒜\mathcal{A} with feedback access to Oracle 1 that enjoys strongly adaptive regret of RegretI(𝒜;λ)=O~(|I|1γ)\mathrm{Regret}_{I}(\mathcal{A};\lambda)=\tilde{O}(|I|^{1-\gamma}).

Hence, in a sense, Algorithm 1 is as adaptive as one could hope for: it ensures a regret bound for all intervals II, but not a strongly adaptive one. The lower bound construction, formal statement, and proof of Theorem 3.3 are given in Section C.2.

3.3 Adaptive Regret for Control of Unknown Time-Varying Dynamics

We now apply our adaptive estimation algorithm (Algorithm 1) to the online control problem. Our proposed algorithm, Algorithm 2, takes in two sub-routines: a prediction algorithm 𝒜pred\mathcal{A}_{\mathrm{pred}} which enjoys low prediction regret in the sense of the previous section, and a control algorithm 𝒜ctrl\mathcal{A}_{\mathrm{ctrl}} which has low regret for control of known systems. Our master algorithm trades off between the two methods in epochs τ=1,2,\tau=1,2,\dots of length hh: each epoch corresponds to one step of 𝒜pred\mathcal{A}_{\mathrm{pred}} indexed by [τ][\tau].

Algorithm 2 DRC-OGD with Adaptive Exploration (Ada-Ctrl)
1:Input: parameters p,hp,h, prediction algorithm 𝒜predAda-Pred(p,G^0,𝒢(h,RG))\mathcal{A}_{\mathrm{pred}}\leftarrow\textsc{Ada}\text{-}\textsc{Pred}(p,\hat{G}_{0},\mathcal{G}(h,R_{G})), control algorithm 𝒜ctrlDrc-Ogd(m,R)\mathcal{A}_{\mathrm{ctrl}}\leftarrow\textsc{Drc}\text{-}\textsc{Ogd}(m,R_{\mathcal{M}})
2:for τ=1,,T/h\tau=1,\ldots,T/h do \triangleright let tτ=(τ1)h+1t_{\tau}=(\tau-1)h+1
3:     Set G^tτ,G^tτ+1,,G^tτ+h1\hat{G}_{t_{\tau}},\hat{G}_{t_{\tau}+1},\ldots,\hat{G}_{t_{\tau}+h-1} equal to τ\tau-th iterate G^[τ]\hat{G}_{[\tau]} from 𝒜pred\mathcal{A}_{\mathrm{pred}}
4:     Draw b[τ]Bernoulli(p)b_{[\tau]}\sim\text{Bernoulli}(p)
5:     for t=tτ,,tτ+h1t=t_{\tau},\ldots,t_{\tau}+h-1 do
6:         if b[τ]=1b_{[\tau]}=1 then
7:              Play control ut{±1}duu_{t}\sim\{\pm 1\}^{d_{u}}
8:         else
9:              Play control utu_{t} according to the tt-th input chosen by 𝒜ctrl\mathcal{A}_{\mathrm{ctrl}}          
10:         Suffer cost ct(xt,ut)c_{t}(x_{t},u_{t}) , observe new state xt+1x_{t+1}
11:         Extract x^t+1nat=Proj𝔹Rnat(xt+1i=0h1G^t[i]uti)\hat{x}_{t+1}^{\mathrm{nat}}=\mathrm{Proj}_{\mathbb{B}_{R_{\mathrm{nat}}}}\left(x_{t+1}-\sum_{i=0}^{h-1}\hat{G}_{t}^{[i]}u_{t-i}\right)
12:         Feed cost, Markov operator and Nature’s x estimates (ct,G^t,x^t+1nat)(c_{t},\hat{G}_{t},\hat{x}^{\mathrm{nat}}_{t+1}) to 𝒜ctrl\mathcal{A}_{\mathrm{ctrl}}.      
13:     if b[τ]=1b_{[\tau]}=1 then
14:         Feed (b[τ],G~[τ])(b_{[\tau]},\tilde{G}_{[\tau]}) to 𝒜pred\mathcal{A}_{\mathrm{pred}}, where G~[τ][i]=xtτ+hutτ+hi,i=0,1,,h1\tilde{G}_{[\tau]}^{[i]}=x_{t_{\tau}+h}u_{t_{\tau}+h-i}^{\top},\,i=0,1,\dots,h-1.
15:     else
16:         Feed (b[τ],G~[τ])(b_{[\tau]},\tilde{G}_{[\tau]}) to 𝒜pred\mathcal{A}_{\mathrm{pred}}, where G~[τ]\tilde{G}_{[\tau]}\leftarrow\emptyset.      

At each epoch, the algorithm receives Markov operator estimates from 𝒜pred\mathcal{A}_{\mathrm{pred}} (Line 3) and makes a binary decision b[τ]Bernoulli(p)b_{[\tau]}\sim\mathrm{Bernoulli}(p). If b[τ]=1b_{[\tau]}=1, then it explores using i.i.d. Rademacher inputs (Line 7), and sends the resulting estimator to 𝒜pred\mathcal{A}_{\mathrm{pred}} (Line 14). This corresponds to one query from Oracle 1. Otherwise, it selects inputs in line with 𝒜ctrl\mathcal{A}_{\mathrm{ctrl}} (Line 9), and does not give a query to 𝒜pred\mathcal{A}_{\mathrm{pred}} (Line 16). Regardless of exploration decision, the algorithm feeds costs, current estimates of the Markov operator and Nature’s x’s based on the Markov operator estimates to 𝒜ctrl\mathcal{A}_{\mathrm{ctrl}} (Lines 10-12), which it uses to select inputs and update its internal parameter.

The prediction algorithm 𝒜pred\mathcal{A}_{\mathrm{pred}} is taken to be Ada-Pred with the decision set 𝒦=𝒢(h,RG)\mathcal{K}=\mathcal{G}(h,R_{G}): the projection operation onto it and the ball 𝔹Rnat\mathbb{B}_{R_{\mathrm{nat}}} is done by clipping when the norm of the argument exceeds the indicated bound. The control algorithm 𝒜ctrl\mathcal{A}_{\mathrm{ctrl}} is taken to be Drc-Ogd [39] for known systems. The core technique behind Drc-Ogd is running online gradient descent on the Drc parameterization (Definition 2.1). In Appendix B we spell out the algorithm and extend previous analyses to both LTV systems and adaptive regret guarantees. The final result guarantees low adaptive regret as long as the system variability is sublinear.

Theorem 3.4.

For h=logTlogρ1h=\dfrac{\log{T}}{\log{\rho^{-1}}}, p=T1/3p=T^{-1/3} and mTm\leq\sqrt{T}, on any contiguous interval I[T]I\subseteq[T], Algorithm 2 enjoys the following adaptive regret guarantee:

𝔼[RegretI(Ada-Ctrl);Πdrc(m,RM)]𝒪~(Lm(|I|𝔼[VarI(𝐆)]+duT2/3))\displaystyle\mathbb{E}\left[\mathrm{Regret}_{I}(\textsc{Ada}\text{-}\textsc{Ctrl});\Pi_{\mathrm{drc}}(m,R_{M})\right]\leq\widetilde{\mathcal{O}}^{\star}\left(Lm\left(|I|\sqrt{\mathbb{E}[\mathrm{Var}_{I}(\mathbf{G})]}+d_{u}T^{2/3}\right)\right)
Proof Sketch.

The analysis proceeds by reducing the regret incurred to that over a known system, accounting for: 1) the additional exploration penalty (O(p|I|))(O(p|I|)), 2) the system misspecification induced error (tIG^tG¯th1,op)(\sim\sum_{t\in I}\|\hat{G}_{t}-\bar{G}_{t}^{h}\|_{\ell_{1},\mathrm{op}}), and 3) truncation errors (ψ(h)|I|\sim\psi(h)|I|). Via straightforward computations, the system misspecification error can be expressed in terms of the result in Theorem 3.2, ultimately leading to an error contribution |I|𝔼[VarI(𝐆)]+p1/2|I|1/2\sim|I|\sqrt{\mathbb{E}[\mathrm{Var}_{I}(\mathbf{G})]}+p^{-1/2}|I|^{1/2}. The analysis is finalized by noting that the chosen pp ideally balances p|I|p|I| and p1/2|I|1/2p^{-1/2}|I|^{1/2}, and that the chosen hh ensures that the truncation error is negligible. The full proof can be found in Appendix D. ∎

The adaptive regret bound in Theorem 3.4 has two notable terms. Note that the first term |I|𝔼[VarI(𝐆)]|I|\sqrt{\mathbb{E}[\mathrm{Var}_{I}(\mathbf{G})]} for I=[T]I=[T] matches the regret lower bound in Theorem 3.1. Furthermore, our algorithm is adaptive in this term for all intervals II. On the other hand, for unknown LTI systems with VarI(𝐆)=0\mathrm{Var}_{I}(\mathbf{G})=0, the algorithm recovers the state-of-the-art bound of T2/3T^{2/3} [20]. However, the T2/3T^{2/3} term is not adaptive to the intervals II consistent with the lower bound against strongly adaptive algorithms in Theorem 3.3.

4 Online Control over State Feedback

Given the impossibility of sublinear regret against Drc/Dac without further restrictions on system variability, this section studies whether sublinear regret is possible against the class of linear feedback policies. For simplicity, we focus on the state feedback policies ut=Kxtu_{t}=Kx_{t}, that is, linear feedback policies with memory m=1m=1 (Definition 2.3). We note that state feedback policies were the class which motivated the relaxation to Dac policies in the first study of nonstochastic control [3].

We present two results, rather qualitative in nature. First, we show that obtaining sublinear regret is, in the most literal sense, possible. The following result considers regret relative to a class 𝒦\mathcal{K} of static feedback controllers which satisfy the restrictive assumption that each K𝒦K\in\mathcal{K} stabilizes the time varying dynamics (At,Bt)(A_{t},B_{t}); see Appendix E for the formal algorithm, assumptions, and guarantees. We measure the regret against this class 𝒦\mathcal{K}:

RegretT(𝒦):=t=1Tct(xt,ut)infK𝒦t=1Tct(xtK,utK),\displaystyle\mathrm{Regret}_{T}(\mathcal{K}):=\sum_{t=1}^{T}c_{t}(x_{t},u_{t})-\inf_{K\in\mathcal{K}}\sum_{t=1}^{T}c_{t}(x_{t}^{K},u_{t}^{K}),

where (xtK,utK)(x_{t}^{K},u_{t}^{K}) are the iterates arising under the control law ut=Kxtu_{t}=Kx_{t}.

Theorem 4.1 (Sublinear regret against state-feedback).

Under a suitable stabilization assumption, there exists a computationally inefficient control algorithm which attains sublinear expected regret:

𝔼[RegretT(𝒦)]eΩ(dxdu/2)T112(dxdu+3).\displaystyle\mathbb{E}[\mathrm{Regret}_{T}(\mathcal{K})]\leq e^{\Omega(d_{x}d_{u}/2)}\cdot T^{1-\frac{1}{2(d_{x}d_{u}+3)}}.

Above, Ω()\Omega(\cdot) suppresses a universal constant and exponent base, both of which are made explicit in a formal theorem statement in Appendix E. The bound follows by running the Exp3 bandit algorithm on a discretization of the set 𝒦\mathcal{K} (high probability regret can be obtained by instead using Exp3.P [8]). The guarantee in Theorem 4.1 is neither practical nor sharp; its sole purpose is to confirm the possibility of sublinear regret. Due to the bandit reduction and exponential size of the cover of 𝒦du×dx\mathcal{K}\subset\mathbb{R}^{d_{u}\times d_{x}}, the algorithm is computationally inefficient and suffers a nonparametric rate of regret [33]: ϵ\epsilon-regret requires T=ϵΩ(dimension)T=\epsilon^{-\Omega(\mathrm{dimension})}.

One may wonder if one can do much better than this naive bandit reduction. For example, is there structure that can be leveraged? For LTV systems, we show that there is strong evidence to suggest that, at least from a computational standpoint, attaining polynomial regret (e.g. T1αT^{1-\alpha} for α>0\alpha>0 independent of dimension) is computationally prohibitive.

Theorem 4.2.

There exists a reduction from Max-3Sat on mm-clauses and nn-literals to the problem of finding a state-feedback controler KK which is within a small constant factor of optimal for the cost t=1Tct(xtK,xtK)\textstyle\sum_{t=1}^{T}c_{t}(x^{K}_{t},x^{K}_{t}) on a sequence of sequentially stable LTV systems and convex costs (At,Bt,ct)(A_{t},B_{t},c_{t}) with no disturbance (wt0w_{t}\equiv 0), with state dimension n+1n+1, input dimension 22, and horizon T=Θ(mn)T=\Theta(mn). Therefore, unless 𝖯=𝖭𝖯\mathsf{P}=\mathsf{NP}, the latter cannot be solved in time polynomial in nn [17].

A more precise statement, construction, and proof are given in Section F.4. Theorem 4.2 demonstrates that solving the offline optimization problem over state feedback controllers KK to within constant precision is 𝖭𝖯\mathsf{NP}-Hard. In particular, this means that any sublinear regret algorithm which is proper and convergent, in the sense that ut=Ktxtu_{t}=K_{t}x_{t} for some sequence KtK_{t} converges to a limit as TT\to\infty, must be computationally inefficient. This is true even if the costs and dynamics are known in advance. Our result suggests it is computationally hard to obtain sublinear regret, but it does not rigorously imply it. For example, there may be more clever convex relaxations (other than Drc and Dac, which provably cannot work) that yield efficient and sublinear regret. Secondly, this lower bound does not rule out the possibility of an computationally inefficient algorithm which nevertheless attains polynomial regret.

5 Discussion and Future Work

This paper provided guarantees for and studied the limitations of sublinear additive regret in online control of an unknown, linear time-varying (LTV) dynamical system.

Our setting was motivated by the fact that the first-order Taylor approximation (Jacobian linearization) of smooth, nonlinear systems about any smooth trajectory is LTV. One would therefore hope that low-regret guarantees against LTV systems may imply convergence to first-order stationary points of general nonlinear control objectives [34], which in turn may enjoy stability properties [45]. Making this connection rigorous poses several challenges. Among them, one would need to extend our low-regret guarantees against oblivious adversaries to hold against adaptive adversaries, the latter modeling how nonlinear system dynamics evolve in response to the learner’s control inputs. This may require parting from our current analysis, which leverages the independence between exploratory inputs and changes in system dynamics.

Because we show that linear-in-TT regret is unavoidable for changing systems with large system variability, at least for the main convex policy parametrizations, it would be interesting to study our online setting under other measures of performance. In particular, the competive ratio, or the ratio of total algorithm cost to optimal cost in hindsight (as opposed to the difference between the two measured by regret) may yield a complementary set of tradeoffs, or lead to new and exciting principles for adaptive controller design. Does system variability play the same deciding roles in competive analysis as it does in regret? And, in either competitive or regret analyses, what is the correct measure of system variability (e.g. variability in which norm/geometry, or of which system parameters) which best captures sensitivity of online cost to system changes?

Acknowledgments

Elad Hazan and Edgar Minasyan have been supported in part by NSF grant #1704860. This work was done in part when Paula Gradu was at Google AI Princeton and Princeton University. Max Simchowitz is generously supported by an Open Philanthropy AI fellowship.

References

  • [1] Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.
  • [2] Dmitry Adamskiy, Wouter M Koolen, Alexey Chernov, and Vladimir Vovk. A closer look at adaptive regret. The Journal of Machine Learning Research, 17(1):706–726, 2016.
  • [3] Naman Agarwal, Brian Bullins, Elad Hazan, Sham Kakade, and Karan Singh. Online control with adversarial disturbances. In International Conference on Machine Learning, pages 111–119, 2019.
  • [4] Naman Agarwal, Elad Hazan, Anirudha Majumdar, and Karan Singh. A regret minimization approach to iterative learning control. arXiv preprint arXiv:2102.13478, 2021.
  • [5] Nicholas M. Boffi, Stephen Tu, and Jean-Jacques E. Slotine. Regret bounds for adaptive nonlinear control, 2020.
  • [6] Olivier Bousquet and Manfred K Warmuth. Tracking a small set of experts by mixing past posteriors. Journal of Machine Learning Research, 3(Nov):363–396, 2002.
  • [7] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • [8] Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv preprint arXiv:1204.5721, 2012.
  • [9] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
  • [10] Alon Cohen, Tomer Koren, and Yishay Mansour. Learning linear-quadratic regulators efficiently with only T\sqrt{T} regret. In International Conference on Machine Learning, pages 1300–1309, 2019.
  • [11] Amit Daniely, Alon Gonen, and Shai Shalev-Shwartz. Strongly adaptive online learning. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1405–1411, Lille, France, 07–09 Jul 2015. PMLR.
  • [12] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188–4197, 2018.
  • [13] Sarah Dean and Benjamin Recht. Certainty equivalent perception-based control. arXiv preprint arXiv:2008.12332, 2020.
  • [14] Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. Experts in a markov decision process. In Advances in neural information processing systems, pages 401–408, 2005.
  • [15] Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1467–1476, 2018.
  • [16] Paula Gradu, Elad Hazan, and Edgar Minasyan. Adaptive regret for control of time-varying dynamics. arXiv preprint arXiv:2007.04393, 2020.
  • [17] Johan Håstad. Some optimal inapproximability results. Journal of the ACM (JACM), 48(4):798–859, 2001.
  • [18] Elad Hazan. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
  • [19] Elad Hazan. Introduction to online convex optimization, 2019.
  • [20] Elad Hazan, Sham Kakade, and Karan Singh. The nonstochastic control problem. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, pages 408–421. PMLR, 2020.
  • [21] Elad Hazan and Comandur Seshadhri. Efficient learning algorithms for changing environments. In Proceedings of the 26th annual international conference on machine learning, pages 393–400. ACM, 2009.
  • [22] Mark Herbster and Manfred K Warmuth. Tracking the best expert. Machine learning, 32(2):151–178, 1998.
  • [23] Sham Kakade, Akshay Krishnamurthy, Kendall Lowrey, Motoya Ohnishi, and Wen Sun. Information theoretic regret bounds for online nonlinear control. arXiv preprint arXiv:2006.12466, 2020.
  • [24] Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Logarithmic regret bound in partially observable linear dynamical systems, 2020.
  • [25] Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Regret minimization in partially observable linear quadratic control, 2020.
  • [26] Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalence is efficient for linear quadratic control. In Advances in Neural Information Processing Systems, pages 10154–10164, 2019.
  • [27] Zakaria Mhammedi, Dylan J Foster, Max Simchowitz, Dipendra Misra, Wen Sun, Akshay Krishnamurthy, Alexander Rakhlin, and John Langford. Learning the linear quadratic regulator from nonlinear observations. arXiv preprint arXiv:2010.03799, 2020.
  • [28] Richard H Middleton and Graham C Goodwin. Adaptive control of time-varying linear systems. IEEE Transactions on Automatic Control, 33(2):150–155, 1988.
  • [29] Kevin L Moore. Iterative learning control for deterministic systems. Springer Science & Business Media, 2012.
  • [30] Samet Oymak and Necmiye Ozay. Non-asymptotic identification of lti systems from a single trajectory. In 2019 American control conference (ACC), pages 5655–5661. IEEE, 2019.
  • [31] Necmiye Ozay, Constantino Lagoa, and Mario Sznaier. Set membership identification of switched linear systems with known number of subsystems. Automatica, 51:180–191, 2015.
  • [32] Guannan Qu, Yuanyuan Shi, Sahin Lale, Anima Anandkumar, and Adam Wierman. Stable online control of ltv systems stable online control of linear time-varying systems. arXiv preprint arXiv:2104.14134, 2021.
  • [33] Alexander Rakhlin and Karthik Sridharan. Online non-parametric regression. In Conference on Learning Theory, pages 1232–1264. PMLR, 2014.
  • [34] Vincent Roulet, Siddhartha Srinivasa, Dmitriy Drusvyatskiy, and Zaid Harchaoui. Iterative linearized control: stable algorithms and complexity guarantees. In International Conference on Machine Learning, pages 5518–5527. PMLR, 2019.
  • [35] Tuhin Sarkar, Alexander Rakhlin, and Munther Dahleh. Nonparametric system identification of stochastic switched linear systems. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 3623–3628. IEEE, 2019.
  • [36] Tuhin Sarkar, Alexander Rakhlin, and Munther A Dahleh. Finite-time system identification for partially observed lti systems of unknown order. arXiv preprint arXiv:1902.01848, 2019.
  • [37] Max Simchowitz. Making non-stochastic control (almost) as easy as stochastic, 2020.
  • [38] Max Simchowitz and Dylan Foster. Naive exploration is optimal for online lqr. In International Conference on Machine Learning, pages 8937–8948. PMLR, 2020.
  • [39] Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control, 2020.
  • [40] Y. Tassa, T. Erez, and E. Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4906–4913, 2012.
  • [41] Emanuel Todorov and Weiwei Li. A generalized iterative lqg method for locally-optimal feedback control of constrained nonlinear stochastic systems. In Proceedings of the 2005, American Control Conference, 2005., pages 300–306. IEEE, 2005.
  • [42] Kostas S Tsakalis and Petros A Ioannou. Linear time-varying systems: control and adaptation. Prentice-Hall, Inc., 1993.
  • [43] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
  • [44] Yuh-Shyang Wang, Nikolai Matni, and John C Doyle. A system-level approach to controller synthesis. IEEE Transactions on Automatic Control, 64(10):4079–4093, 2019.
  • [45] Tyler Westenbroek, Max Simchowitz, Michael I Jordan, and S Shankar Sastry. On the stability of nonlinear receding horizon control: a geometric perspective. arXiv preprint arXiv:2103.15010, 2021.
  • [46] Dante Youla, Hamid Jabr, and Jr Bongiorno. Modern wiener-hopf design of optimal controllers–part ii: The multivariable case. IEEE Transactions on Automatic Control, 21(3):319–338, 1976.
  • [47] Lijun Zhang, Tie-Yan Liu, and Zhi-Hua Zhou. Adaptive regret of convex and smooth functions. arXiv preprint arXiv:1904.11681, 2019.
  • [48] Lijun Zhang, Tianbao Yang, rong jin, and Zhi-Hua Zhou. Dynamic regret of strongly adaptive methods. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5882–5891. PMLR, 10–15 Jul 2018.
  • [49] Kemin Zhou, John Comstock Doyle, Keith Glover, et al. Robust and optimal control, volume 40. Prentice hall New Jersey, 1996.

Appendix A Extensions

A.1 Affine Offsets

For many systems, performance improves dramatically for controllers with constant affine terms, that is

utM=u¯M+t=0h1M[i]xtinat,\displaystyle u^{M}_{t}=\bar{u}^{M}+\sum_{t=0}^{h-1}M^{[i]}x^{\mathrm{nat}}_{t-i},

where u¯M\bar{u}^{M} is a constant affine term encoded by M=(M[0],,M[h1],u¯M)M=(M^{[0]},\dots,M^{[h-1]},\bar{u}^{M}). All our arguments apply more generally to control policies of this form. Moreover, we can even allow linear combinations of time varying terms:

utM\displaystyle u^{M}_{t} =t=0h1M[i]xtinat+t=0h1M[i]ψi(t)\displaystyle=\sum_{t=0}^{h-1}M^{[i]}x^{\mathrm{nat}}_{t-i}+\sum_{t=0}^{h^{\prime}-1}M^{[i]^{\prime}}\psi_{i}(t)
M=(M[0],,M[h1],M[0],,M[h1]),\displaystyle\quad M=(M^{[0]},\dots,M^{[h-1]},M^{[0]^{\prime}},\dots,M^{[h^{\prime}-1]^{\prime}}),

where now ψi(t)\psi_{i}(t) are fixed, possibly time varying basis functions (which do not depend on MM). The case of constant affine terms corresponds to h=1h^{\prime}=1, and ψi(t)=1\psi_{i}(t)=1 for all tt.

A.2 Changing Stabilizing Controllers

Our results extend naturally to the following setting: for each time t=1,2,,Tt=1,2,\dots,T, the algorithm has access to a static feedback control policy KtstbK^{\mathrm{stb}}_{t} such that the closed loop matrices (At+BtKtstb)(A_{t}+B_{t}K^{\mathrm{stb}}_{t}) are sequentially stable, that is

Φs+h,sstb:=i=ss+h(At+BtKtstb)\displaystyle\Phi^{\mathrm{stb}}_{s+h,s}:=\prod_{i=s}^{s+h}(A_{t}+B_{t}K^{\mathrm{stb}}_{t})

has geometric decay. We let xtnatx^{\mathrm{nat}}_{t} denote the iterates produced by the updates

xt+1nat=(At+BtKtstb)xtnat,utnat=Kxtnat,x1nat=0.\displaystyle x^{\mathrm{nat}}_{t+1}=(A_{t}+B_{t}K^{\mathrm{stb}}_{t})x^{\mathrm{nat}}_{t},u^{\mathrm{nat}}_{t}=Kx^{\mathrm{nat}}_{t},\quad x^{\mathrm{nat}}_{1}=0.

We compute the stabilized policies of the form

utM=Ktstbxt+i=0h1M[i]xtinat.\displaystyle u_{t}^{M}=K^{\mathrm{stb}}_{t}x_{t}+\sum_{i=0}^{h-1}M^{[i]}x^{\mathrm{nat}}_{t-i}.

To facillicate the extension, we define the stabilized Markov operator

G,t[0]=[0Idu],G,t[i]:=[IdxKt]Φt,ti+1stbBti,\displaystyle G_{\star,t}^{[0]}=\begin{bmatrix}0\\ I_{d_{u}}\end{bmatrix},\quad G_{\star,t}^{[i]}:=\begin{bmatrix}I_{d_{x}}\\ K_{t}\end{bmatrix}\Phi^{\mathrm{stb}}_{t,t-i+1}B_{t-i},

This Markov operator satisfies

[xtut]\displaystyle\begin{bmatrix}x_{t}\\ u_{t}\end{bmatrix} =[xtnatunat]+i=0t1G,t[i]u~t,s.t. ut=Ktxt+u~t.\displaystyle=\begin{bmatrix}x^{\mathrm{nat}}_{t}\\ u^{\mathrm{nat}}\end{bmatrix}+\sum_{i=0}^{t-1}G_{\star,t}^{[i]}\tilde{u}_{t},\quad\text{s.t. }u_{t}=K_{t}x_{t}+\tilde{u}_{t}.

With similar techniques, we obtain estimates G^t\widehat{G}_{t}, back out estimates of the Nature’s sequence (xnat,unat)(x^{\mathrm{nat}},u^{\mathrm{nat}}) via

[x^tnatu^nat]=clipr([xtut]i=0hG^t[i](utiKtistbxti)),\displaystyle\begin{bmatrix}\hat{x}^{\mathrm{nat}}_{t}\\ \hat{u}^{\mathrm{nat}}\end{bmatrix}=\mathrm{clip}_{r}\left(\begin{bmatrix}x_{t}\\ u_{t}\end{bmatrix}-\sum_{i=0}^{h}\widehat{G}_{t}^{[i]}(u_{t-i}-K^{\mathrm{stb}}_{t-i}x_{t-i})\right),

for a truncation radius r>0r>0 suitably chosen. Recall that we apply this clipr()\mathrm{clip}_{r}(\cdot) operator to ensure the estimates of the Nature’s sequenc does not grow unbounded and exert feedback. We then select inputs

u^t(M)=Ktstbxt+i=0h1M[i]x^tinat,\displaystyle\hat{u}_{t}(M)=K^{\mathrm{stb}}_{t}x_{t}+\sum_{i=0}^{h-1}M^{[i]}\hat{x}^{\mathrm{nat}}_{t-i},

Markov operator with low adaptive regret, and apply our Oco-with-memory algorithm to losses

f^t(M)=ct([x^tnatu^tnat]i=0hG^t[i]u^ti(M)).\displaystyle\widehat{f}_{t}(M)=c_{t}\left(\begin{bmatrix}\hat{x}^{\mathrm{nat}}_{t}\\ \hat{u}^{\mathrm{nat}}_{t}\end{bmatrix}-\sum_{i=0}^{h}\widehat{G}_{t}^{[i]}\hat{u}_{t-i}(M)\right).

A.3 Partial Observation

Our results further extend to partially observed systems. We explain this extension for sequentially stable systems; extensions to sequentialy stabilized systems by time varying linear dynamic controllers follows from the exposition in [39], Appendix C.

For partially observed systems, we have the same state transition dynamics xt+1=Atxt+Btut+wtx_{t+1}=A_{t}x_{t}+B_{t}u_{t}+w_{t}, but, for a time-varying observation matrix CtC_{t} and process noise ete_{t}, we observe outputs ytdyy_{t}\in\mathbb{R}^{d_{y}}

yt=Ctxt+et.\displaystyle y_{t}=C_{t}x_{t}+e_{t}.

Costs ct(yt,ut)c_{t}(y_{t},u_{t}) are suffered on input and outputs. As for full observation, the Nature’s sequence xnatx^{\mathrm{nat}} and ynaty^{\mathrm{nat}} correspond to states xtx_{t} and outputs yty_{t} which arise under identially zero input ut0u_{t}\equiv 0. The Drc parametrization selects linear conmbinations of Nature’s y’s:

utM=i=0m1M[i]ytinat.\displaystyle u_{t}^{M}=\sum_{i=0}^{m-1}M^{[i]}y^{\mathrm{nat}}_{t-i}.

Recalling

Φs+h,s:=i=ss+h(At+Bt),\displaystyle\Phi_{s+h,s}:=\prod_{i=s}^{s+h}(A_{t}+B_{t}),

the relevant Markov operators GtG_{\star t} are the ones mapping inputs to outputs:

G,t[0]=0,G,t[i]=CtΦt,ti+1Bti,\displaystyle G_{\star,t}^{[0]}=0,\quad G_{\star,t}^{[i]}=C_{t}\Phi_{t,t-i+1}B_{t-i},

With similar techniques, we obtain estimates G^t\widehat{G}_{t}, back out estimates of the Nature’s sequence ynaty^{\mathrm{nat}} via

y^nat=clipr(yti=0hG^t[i]uti)\displaystyle\hat{y}^{\mathrm{nat}}=\mathrm{clip}_{r}\left(y_{t}-\sum_{i=0}^{h}\widehat{G}_{t}^{[i]}u_{t-i}\right)

for trunctation radius r>0r>0 suitably chosen, we select inputs

u^t(M)=i=0h1M[i]y^tinat),\displaystyle\hat{u}_{t}(M)=\sum_{i=0}^{h-1}M^{[i]}\hat{y}^{\mathrm{nat}}_{t-i}),

update parameters with the Oco-with-memory losses

f^t(M)=t(y^tnati=0hG^t[i]M[i]u^ti(M),ut).\displaystyle\widehat{f}_{t}(M)=\ell_{t}(\hat{y}^{\mathrm{nat}}_{t}-\sum_{i=0}^{h}\widehat{G}_{t}^{[i]}M^{[i]}\hat{u}_{t-i}(M),u_{t}).

A.4 The Dac parametrization

Here we sketch an algorithm to compete with Dac-parametrized control policies [3]. For simplicity, we focus on sequentially stable system, though the discussion extends to systems sequentially stabilized by sequences of controllers (Ktstb)(K^{\mathrm{stb}}_{t}). Note that Dac does not apply under partial observation.

Recall that, in the Dac parametrization, the inputs are selected as linear combinations of past disturbances:

utM=i=0h1M[i]wti1.\displaystyle u_{t}^{M}=\sum_{i=0}^{h-1}M^{[i]}w_{t-i-1}.

To implement Dac, we therefore need empiricals estimate w^t\hat{{w}}_{t} of wtw_{t}. As per [20], it suffices to construct estimates (A^t,B^t)(\hat{A}_{t},\hat{B}_{t}) of (At,Bt)(A_{t},B_{t}), and choose

w^t=clipr(xtA^t1xt1B^t1ut1),\displaystyle\hat{{w}}_{t}=\mathrm{clip}_{r}\left(x_{t}-\hat{A}_{t-1}x_{t-1}-\hat{B}_{t-1}u_{t-1}\right),

again clipped at a suitable radius r>0r>0 to block compounding feedback. Given these estimates, our algorithm extends to Dac control in the expected way.

How does one obtain the estimates (A^t,B^t)(\hat{A}_{t},\hat{B}_{t})? First, we observe that since

G,t[0]=0,G,t[i]=Φt,ti+1Bti,Φs+h,s=i=ss+hAi,\displaystyle G_{\star,t}^{[0]}=0,\quad G_{\star,t}^{[i]}=\Phi_{t,t-i+1}B_{t-i},\quad\Phi_{s+h,s}=\prod_{i=s}^{s+h}A_{i},

we have Bt=G,t[1]B_{t}=G_{\star,t}^{[1]}. Hence, we can select B^t\hat{B}_{t} as G^t[1]\widehat{G}_{t}^{[1]}.

The estimate of AtA_{t} is more involved. For linear, time-invariant systems, AA can be recovered from GG_{\star} via the Ho-Kalman procedure as is does in [20] (see also [30, 36]). For time-varying systems, this become more challenging. Ommitting details in the interest of brevity, one can use the robustness properties of Ho-Kalman to argue that if the system matrices are slow moving (an assumption required for low regret), G,tG_{\star,t} is close to a stationarized analogue G¯,t\bar{G}_{\star,t} given by

G¯,t[i]:=Ati1Bt.\displaystyle\bar{G}_{\star,t}^{[i]}:=A_{t}^{i-1}B_{t}.

Hence, we can view any estimate G^t\widehat{G}_{t} of G,tG_{\star,t} as an estimate of G¯,t\bar{G}_{\star,t}, and apply Ho-Kalman to the latter.

Appendix B Adapive Regret for Time-Varying DRC-OGD

We first extend the DRC-OGD algorithm from [39] to the setting of known linear time-varying dynamics. We spell out Algorithm 3 and prove it attains 𝒪~(T)\tilde{\mathcal{O}}\left(\sqrt{T}\right) adaptive regret over general convex costs under fully adversarial noise (Theorem B.1) with respect to the DRC policy class. The main technique, using OGD over the DRC parametrization, remains unchanged from the original paper and we show it generalizes naturally to LTV systems.

1: Input: stepsize η\eta, memory mm, radius RR_{\mathcal{M}}
2: Initialize M1(m,R)M_{1}\in\mathcal{M}(m,R_{\mathcal{M}}) arbitrarily
3: Receive initial state x1x_{1}, set x1nat=x1x^{\mathrm{nat}}_{1}=x_{1} and x0nat=0x^{\mathrm{nat}}_{\leq 0}=0
4:for tt = 1T1\ldots T do
5:      Play control ut=i=0m1Mt[i]xtinatu_{t}=\sum_{i=0}^{m-1}M_{t}^{[i]}x^{\text{nat}}_{t-i}
6:      Suffer ct(xt,ut)c_{t}(x_{t},u_{t}) and observe cost function ct(,)c_{t}(\cdot,\cdot)
7:      Construct ft(M0,,Mh)ct(x^t(M0:h1),ut(Mh))f_{t}(M_{0},\ldots,M_{h})\doteq c_{t}(\hat{x}_{t}(M_{0:{h-1}}),u_{t}(M_{h})) and let f~t(M)ft(M,,M)\tilde{f}_{t}(M)\doteq f_{t}(M,\ldots,M)
8:      Update Mt+1Π(Mtηf~t(Mt))M_{t+1}\leftarrow\Pi_{\mathcal{M}}\left(M_{t}-\eta\nabla\tilde{f}_{t}(M_{t})\right)
9:      Receive new system GtG_{t}
10:      Receive new state xt+1x_{t+1} and extract xt+1nat=xt+1i=0t1Gt[i]utix^{\mathrm{nat}}_{t+1}=x_{t+1}-\sum_{i=0}^{t-1}G_{t}^{[i]}u_{t-i} or receive xt+1natx^{\mathrm{nat}}_{t+1}
Algorithm 3 Disturbance Response Control via Online Gradient Descent (DRC-OGD)
Theorem B.1.

Running Algorithm 3 with η=dminR22LRsys2(h+1)5/4T\eta=\frac{\sqrt{d_{\mathrm{min}}}R_{\mathcal{M}}^{2}}{2LR^{2}_{\mathrm{sys}}(h+1)^{5/4}\sqrt{T}} guarantees the following regret bound on every interval I=[r,s]I=[r,s]:

t=rsct(xt,ut)minπΠdrct=rsct(xtπ,utπ)6LRsys2(3dminm(h+1)5/4T+ψ(h)|I|)\sum_{t=r}^{s}c_{t}(x_{t},u_{t})-\min_{\pi\in\Pi_{\mathrm{drc}}}\sum_{t=r}^{s}c_{t}(x_{t}^{\pi},u_{t}^{\pi})\leq 6LR_{\mathrm{sys}}^{2}\left(3\sqrt{d_{\min}}m(h+1)^{5/4}\sqrt{T}+\psi(h)|I|\right)

where Rsys=RGRRnatR_{\mathrm{sys}}=R_{G}R_{\mathcal{M}}R_{\mathrm{nat}} and dmin=min{dx,du}d_{\mathrm{min}}=\min\{d_{x},d_{u}\}.

B.1 Adaptive Regret of OGD for functions with memory

We first prove that OGD with a fixed stepsize attains O(T)O(\sqrt{T}) adaptive regret for functions with memory.

1: Input: stepsize η\eta, memory mm, set 𝒦\mathcal{K}
2: Initialize x1𝒦x_{1}\in\mathcal{K} arbitrarily, set x0=x1x_{\leq 0}=x_{1} by convention
3:for tt = 1T1\ldots T do
4:      Play xtx_{t}
5:      Suffer ft(xth,,xt)f_{t}(x_{t-h},\ldots,x_{t}) and observe loss function ft(,,)f_{t}(\cdot,\ldots,\cdot)
6:      Construct proxy loss function f~t(x)ft(x,,x)\tilde{f}_{t}(x)\doteq f_{t}(x,\ldots,x)
7:      Update xt+1=Π𝒦(xtηf~t(xt))x_{t+1}=\Pi_{\mathcal{K}}\left(x_{t}-\eta\nabla\tilde{f}_{t}(x_{t})\right)
Algorithm 4 Online Gradient Descent for OCOwMem (Mem-OGD)
Theorem B.2.

Let {ft:𝒦h+1[0,1]}t=1T\{f_{t}:\mathcal{K}^{h+1}\rightarrow[0,1]\}_{t=1}^{T} be a sequence of L coordinate-wise Lipschitz loss functions with memory such that f~t\tilde{f}_{t} (Line 6) is convex. Then, on any interval I=[r,s][T]I=[r,s]\subseteq[T], Algorithm 4 enjoys the following adaptive policy regret guarantee:

t=rsft(xth,,xt)minx𝒦t=rsf~t(x)D2η+2ηL2(h+1)5/2|I|\sum_{t=r}^{s}f_{t}(x_{t-h},\ldots,x_{t})-\min_{x\in\mathcal{K}}\sum_{t=r}^{s}\tilde{f}_{t}(x)\leq\frac{D^{2}}{\eta}+2\eta L^{2}(h+1)^{5/2}|I|

where D=diam(𝒦)D=\text{diam}(\mathcal{K}).

First we state and prove the following well-known fact about vanilla projected OGD over (memory-less) loss functions:

Fact B.1.

Let {f~t:𝒦[0,1]}t=1T\{\tilde{f}_{t}:\mathcal{K}\rightarrow[0,1]\}_{t=1}^{T} be a sequence of convex loss functions with f~(x)G\|\nabla\tilde{f}(x)\|\leq G. Then, on any interval I=[r,s][T]I=[r,s]\subseteq[T], projected OGD enjoys the following guarantee:

t=rsf~t(xt)minx𝒦t=rsf~(x)D2η+η|I|G2\sum_{t=r}^{s}\tilde{f}_{t}(x_{t})-\min_{x\in\mathcal{K}}\sum_{t=r}^{s}\tilde{f}(x)\leq\frac{D^{2}}{\eta}+\eta|I|G^{2}

where D=diam(𝒦)D=\text{diam}(\mathcal{K}).

Proof.

Consider an arbitary interval I=[r,s][T]I=[r,s]\subseteq[T] Let 𝐱=argminx𝒦t=rsf~t(x)\mathbf{x}^{\star}=\arg\min_{x\in\mathcal{K}}\sum_{t=r}^{s}\tilde{f}_{t}(x) and denote tf~t(xt)\nabla_{t}\doteq\nabla\tilde{f}_{t}(x_{t}) for simplicity. By convexity we have

f~t(xt)f~t(x)t(xtx)\tilde{f}_{t}(x_{t})-\tilde{f}_{t}(x^{\star})\leq\nabla_{t}^{\top}(x_{t}-x^{\star}) (B.1)

By the Pythagorean theorem

xt+1x2\displaystyle\|x_{t+1}-x^{\star}\|^{2} =Π𝒦(xtηt)x2\displaystyle=\|\Pi_{\mathcal{K}}(x_{t}-\eta\nabla_{t})-x^{\star}\|^{2}
xtηtx2\displaystyle\leq\|x_{t}-\eta\nabla_{t}-x^{\star}\|^{2}
xtx2+η2t22ηt(xtx)\displaystyle\leq\|x_{t}-x^{\star}\|^{2}+\eta^{2}\|\nabla_{t}\|^{2}-2\eta\nabla_{t}^{\top}(x_{t}-x^{\star}) (B.2)

Hence we can bound the interval regret as:

2t=rs(ft(xt)ft(x))\displaystyle\Rightarrow 2\sum_{t=r}^{s}\left(f_{t}(x_{t})-f_{t}(x^{\star})\right) 2t=rst(xtx)\displaystyle\leq 2\sum_{t=r}^{s}\nabla_{t}^{\top}(x_{t}-x^{\star})
t=rs(xtx2xt+1x2η+ηG2)\displaystyle\leq\sum_{t=r}^{s}\left(\frac{\|x_{t}-x^{\star}\|^{2}-\|x_{t+1}-x^{\star}\|^{2}}{\eta}+\eta G^{2}\right)
=xrx2η+|I|ηG2\displaystyle=\frac{\|x_{r}-x^{\star}\|^{2}}{\eta}+|I|\eta G^{2}
D2η+η|I|G2\displaystyle\leq\frac{D^{2}}{\eta}+\eta|I|G^{2}

which yields the desired adaptive regret bound. ∎

Using this simple fact and Lipschitzness we are able to easily prove the desired guarantee for Algorithm 4.

Proof of Theorem B.2.

First note that Algorithm 4 is just doing gradient descent on the proxy convex loss functions f~t:𝒦[0,1]\tilde{f}_{t}:\mathcal{K}\rightarrow[0,1]. Hence, as long as we identify the gradient bound we can apply B.1 to get a bound on f~\tilde{f}-regret. Observe that

|f~t(x)f~t(y)|\displaystyle|\tilde{f}_{t}(x)-\tilde{f}_{t}(y)| =|ft(x,,x)ft(y,,y)|\displaystyle=|f_{t}(x,\ldots,x)-f_{t}(y,\ldots,y)|
|ft(x,,x)ft(y,,x)|+|ft(y,x,,x)ft(y,y,,x)|+\displaystyle\leq|f_{t}(x,\ldots,x)-f_{t}(y,\ldots,x)|+|f_{t}(y,x,\ldots,x)-f_{t}(y,y,\ldots,x)|+
+|ft(y,,y,x)ft(y,,y)|\displaystyle\ldots+|f_{t}(y,\ldots,y,x)-f_{t}(y,\ldots,y)|
L(h+1)xy\displaystyle\leq L(h+1)\|x-y\|

so f~t\tilde{f}_{t} is L(h+1)L(h+1)-Lipschitz and hence has a gradient bound of L(h+1)L(h+1). So we can apply B.1 to get

t=rsf~t(xt)minx𝒦t=rsf~(x)D2η+η|I|L2(h+1)2\sum_{t=r}^{s}\tilde{f}_{t}(x_{t})-\min_{x\in\mathcal{K}}\sum_{t=r}^{s}\tilde{f}(x)\leq\frac{D^{2}}{\eta}+\eta|I|L^{2}(h+1)^{2} (B.3)

We can use this to bound the adaptive policy regret. First note that

xtxt1=ηf~t(xt1)ηL(h+1)\|x_{t}-x_{t-1}\|=\|\eta\nabla\tilde{f}_{t}(x_{t-1})\|\leq\eta L(h+1)

and by the triangle inequality

xtxtij=1ixtj+1xtjηL(h+1)i\|x_{t}-x_{t-i}\|\leq\sum_{j=1}^{i}\|x_{t-j+1}-x_{t-j}\|\leq\eta L(h+1)\cdot i (B.4)

Using Eq. B.4 and Lipschitzness we have:

ft(xth,,xt)f~t(xt)\displaystyle f_{t}(x_{t-h},\ldots,x_{t})-\tilde{f}_{t}(x_{t}) L(xth,,xt)(xt,,xt)\displaystyle\leq L\|(x_{t-h},\ldots,x_{t})-(x_{t},\ldots,x_{t})\|
Li=1hxtxti2\displaystyle\leq L\sqrt{\sum_{i=1}^{h}\|x_{t}-x_{t-i}\|^{2}}
ηL2(h+1)i=1hi2\displaystyle\leq\eta L^{2}(h+1)\cdot\sqrt{\sum_{i=1}^{h}i^{2}}
ηL2(h+1)5/2\displaystyle\leq\eta L^{2}(h+1)^{5/2} (B.5)

Combining everything we get

t=rsft(xth,,xt)minx𝒦t=rsf~t(x)\displaystyle\sum_{t=r}^{s}f_{t}(x_{t-h},\ldots,x_{t})-\min_{x\in\mathcal{K}}\sum_{t=r}^{s}\tilde{f}_{t}(x) =t=rsft(xth,,xt)f~t(xt) (dist. to proxy loss [Eq. B.5]) +t=rsf~t(xt)minx𝒦t=rsf~t(xt) (f~-regret [Eq. B.3])\displaystyle=\underbrace{\sum_{t=r}^{s}f_{t}(x_{t-h},\ldots,x_{t})-\tilde{f}_{t}(x_{t})}_{\text{{\color[rgb]{0,0,1} (dist. to proxy loss [\lx@cref{creftypecap~refnum}{eq:lipscitz_relate}]) }}}+\underbrace{\sum_{t=r}^{s}\tilde{f}_{t}(x_{t})-\min_{x\in\mathcal{K}}\sum_{t=r}^{s}\tilde{f}_{t}(x_{t})}_{\text{{\color[rgb]{0,0,1} ($\tilde{f}$-regret [\lx@cref{creftypecap~refnum}{eq:proxy_loss_regret}]) }}}
2ηL2(h+1)5/2|I|+D2η\displaystyle\leq 2\eta L^{2}(h+1)^{5/2}|I|+\frac{D^{2}}{\eta}

B.2 Proof of Theorem B.1

We first prove that the constructed loss function satisfies key properties for efficient optimization.

Lemma B.2 (Convexity).

The loss functions f~t\tilde{f}_{t} constructed in 7 of Algorithm 3 are convex in MM.

Proof.

By definition we have that:

x^t(M)\displaystyle\hat{x}_{t}(M) =xtnat+i=1hGt[i]uti\displaystyle=x^{\mathrm{nat}}_{t}+\sum_{i=1}^{h}G_{t}^{[i]}u_{t-i}
=xtnat+i=1hGt[i](j=0m1M[j]xtijnat)\displaystyle=x^{\mathrm{nat}}_{t}+\sum_{i=1}^{h}G_{t}^{[i]}\left(\sum_{j=0}^{m-1}M^{[j]}x^{\mathrm{nat}}_{t-i-j}\right) (B.6)

which is affine in MM. Even more simply, we have ut(M)=i=0m1M[i]xtinatu_{t}(M)=\sum_{i=0}^{m-1}M^{[i]}x^{\mathrm{nat}}_{t-i}.

Since x^t(M)\hat{x}_{t}(M) and ut(M)u_{t}(M) are affine, and, respectively, linear functions of MM and composition with the convex cost ctc_{t} preserves convexity we get the desired property. ∎

Lemma B.3 (Lipschitzness).

The loss functions ftf_{t} constructed in 7 of Algorithm 3 are LfL_{f} coordinate-wise Lipschitz for Lf=3LRnat2RG2RmL_{f}=3LR_{\text{nat}}^{2}R_{G}^{2}R_{\mathcal{M}}\sqrt{m}.

Proof.

Observe that by Eq. B.6 we have x^t(M0:h)Rnat(1+RGR)\|\hat{x}_{t}(M_{0:h)}\|\leq R_{\text{nat}}(1+R_{G}R_{\mathcal{M}}). Straightforwardly, ut(Mh)RnatR\|u_{t}(M_{h})\|\leq R_{\text{nat}}R_{\mathcal{M}} as well.

For an arbitrary i0,h¯i\in\overline{0,h}, denoting M0:h=(M0,,Mi,Mh)M_{0:h}=(M_{0},\ldots,M_{i},\ldots M_{h}), M~0:h=(M0,,M~i,,Mh)\tilde{M}_{0:h}=(M_{0},\ldots,\tilde{M}_{i},\ldots,M_{h}) and using the sub-quadratic lipschitzness of the costs we have:

|ft(M0:h)ft(M~0:h)|\displaystyle|f_{t}(M_{0:h})-f_{t}(\tilde{M}_{0:h})| =|ct(x^t(M0:h),ut(M0:h))ct(x^t(M~0:h),ut(M~0:h))|\displaystyle=|c_{t}(\hat{x}_{t}(M_{0:h}),u_{t}(M_{0:h}))-c_{t}(\hat{x}_{t}(\tilde{M}_{0:h}),u_{t}(\tilde{M}_{0:h}))|
3LRnatRGRGt[hi](j=0m1(Mi[j]M~i[j])xtijnat)\displaystyle\leq 3LR_{\text{nat}}R_{G}R_{\mathcal{M}}\bigg{|}\bigg{|}G_{t}^{[h-i]}\left(\sum_{j=0}^{m-1}(M_{i}^{[j]}-\tilde{M}_{i}^{[j]})x^{\mathrm{nat}}_{t-i-j}\right)\bigg{|}\bigg{|} (i<h)\displaystyle(i<h)
or 3LRnatRGR(j=0m1(Mi[j]M~i[j])xtijnat)\displaystyle\text{or }\leq 3LR_{\text{nat}}R_{G}R_{\mathcal{M}}\left(\sum_{j=0}^{m-1}(M_{i}^{[j]}-\tilde{M}_{i}^{[j]})x^{\mathrm{nat}}_{t-i-j}\right) (i=h)\displaystyle(i=h)
3LRnat2RG2RmMiM~iF\displaystyle\leq 3LR_{\text{nat}}^{2}R_{G}^{2}R_{\mathcal{M}}\sqrt{m}\|M_{i}-\tilde{M}_{i}\|_{F}

so the function is coordinate-wise Lipschitz with constant Lf=3LRnat2RG2RmL_{f}=3LR_{\text{nat}}^{2}R_{G}^{2}R_{\mathcal{M}}\sqrt{m}. ∎

Lemma B.4 (Euclidean Diameter).

The euclidean diameter of (m,R)\mathcal{M}(m,R_{\mathcal{M}}) is at most D=2mmin{dx,du}RD=2\sqrt{m\cdot\min\{d_{x},d_{u}\}}R_{\mathcal{M}}.

Proof.

That for an arbitrary M(m,R)M\in\mathcal{M}(m,R_{\mathcal{M}}), we have

MF\displaystyle\|M\|_{F} =i=0m1M[i]F2\displaystyle=\sqrt{\sum_{i=0}^{m-1}\|M^{[i]}\|_{F}^{2}}
mmaxi[m1]M[i]F2\displaystyle\leq\sqrt{m\cdot\max_{i\in[m-1]}\|M^{[i]}\|_{F}^{2}}
=mmaxi[m1]min{dx,du}M[i]op\displaystyle=\sqrt{m}\max_{i\in[m-1]}\min\{d_{x},d_{u}\}\|M^{[i]}\|_{op}
mmin{dx,du}M1,op\displaystyle\leq\sqrt{m\cdot\min\{d_{x},d_{u}\}}\|M\|_{\ell_{1},op}
mmin{dx,du}R\displaystyle\leq\sqrt{m\cdot\min\{d_{x},d_{u}\}}R_{\mathcal{M}}

and the euclidean diameter is at most twice the maximal euclidean norm, concluding our statement. ∎

The three lemmas above will allow us to use the results in Section B.1 to obtain adaptive regret guarantees in terms of ftf_{t} which truncated the effect on the state of actions further than hh in the past. To convert guarantees in terms of ftf_{t} to ones in terms of ctc_{t}, we prove that the effect of the past is minimal:

Lemma B.5 (Truncation Error).

For a changing DRC policy that acts according to M1,,MtM_{1},\ldots,M_{t} up to time tt we have that:

ct(xt,ut)ct(x^t(Mth:t1),ut(Mt))3LRnat2R2RGψ(h)c_{t}(x_{t},u_{t})-c_{t}\left(\hat{x}_{t}(M_{t-h:t-1}),u_{t}(M_{t})\right)\leq 3LR_{\text{nat}}^{2}R_{\mathcal{M}}^{2}R_{G}\psi(h)
Proof.

By the sub-quadratic Lipschitzness (and noting ut2RRnat\|u_{t}\|_{2}\leq R_{\mathcal{M}}R_{\text{nat}}, xt2Rnat(1+RGR)\|x_{t}\|_{2}\leq R_{n}at(1+R_{G}R_{\mathcal{M}}), and ut=ut(Mt)u_{t}=u_{t}(M_{t})) we have:

ct(xt,ut)ct(x^t(Mth:t1),ut(Mt))\displaystyle c_{t}(x_{t},u_{t})-c_{t}\left(\hat{x}_{t}(M_{t-h:t-1}),u_{t}(M_{t})\right) 3LRnatRGRxt(M1:t1x^t(Mth:t1)\displaystyle\leq 3LR_{\text{nat}}R_{G}R_{\mathcal{M}}\|x_{t}(M_{1:{t-1}}-\hat{x}_{t}(M_{t-h:t-1})\|
=3LRnatRGRi=ht1Gt1[i]ut1i\displaystyle=3LR_{\text{nat}}R_{G}R_{\mathcal{M}}\|\sum_{i=h}^{t-1}G_{t-1}^{[i]}u_{t-1-i}\|
3LRnat2R2RGψ(h)\displaystyle\leq 3LR_{\text{nat}}^{2}R_{\mathcal{M}}^{2}R_{G}\psi(h)

Having proven all these preliminary results the proof of the main theorem is immediate:

Proof of Theorem B.1.

By the definition of the proxy loss in Line 7 of Algorithm 3, we can expand the regret of Algorithm 3 over interval I=[r,s]I=[r,s] as:

Regret =t=rsct(xt,ut)minMt=rsct(xtM,utM)\displaystyle=\sum_{t=r}^{s}c_{t}(x_{t},u_{t})-\min_{M\in\mathcal{M}}\sum_{t=r}^{s}c_{t}(x_{t}^{M},u_{t}^{M})
=t=rsct(xt,ut)t=rsct(x^t(Mth:t1),ut(Mt)) (truncation error I) +t=rsft(Mth:t)minMf~t(M) (f-regret)\displaystyle=\underbrace{\sum_{t=r}^{s}c_{t}(x_{t},u_{t})-\sum_{t=r}^{s}c_{t}(\hat{x}_{t}(M_{t-h:t-1}),u_{t}(M_{t}))}_{\text{{\color[rgb]{0,0,1} (truncation error I) }}}+\underbrace{\sum_{t=r}^{s}f_{t}(M_{t-h:t})-\min_{M\in\mathcal{M}}\tilde{f}_{t}(M)}_{\text{{\color[rgb]{0,0,1} (f-regret) }}}
+minMt=rsct(x^t(M),ut(M))minMt=rsct(xtM,utM) (truncation error II)\displaystyle+\underbrace{\min_{M\in\mathcal{M}}\sum_{t=r}^{s}c_{t}(\hat{x}_{t}(M),u_{t}(M))-\min_{M\in\mathcal{M}}\sum_{t=r}^{s}c_{t}(x_{t}^{M},u_{t}^{M})}_{\text{{\color[rgb]{0,0,1} (truncation error II) }}}

The first truncation error is bounded directly by Lemma B.5. For the second truncation error, let M=argminMt=rsct(xtM,utM)M^{\star}=\arg\min_{M\in\mathcal{M}}\sum_{t=r}^{s}c_{t}(x_{t}^{M},u_{t}^{M}). Clearly we have

minMt=rsct(x^t(M),ut(M))t=rsct(x^t(M),ut(M))\min_{M\in\mathcal{M}}\sum_{t=r}^{s}c_{t}(\hat{x}_{t}(M),u_{t}(M))\leq\sum_{t=r}^{s}c_{t}(\hat{x}_{t}(M^{\star}),u_{t}(M^{\star}))

and hence we can apply Lemma B.5 to bound:

truncation error II =minMt=rsct(x^t(M),ut(M))minMt=rsct(xtM,utM)\displaystyle=\min_{M\in\mathcal{M}}\sum_{t=r}^{s}c_{t}(\hat{x}_{t}(M),u_{t}(M))-\min_{M\in\mathcal{M}}\sum_{t=r}^{s}c_{t}(x_{t}^{M},u_{t}^{M})
t=rsct(x^t(M),ut(M))t=rsct(xtM,utM)\displaystyle\leq\sum_{t=r}^{s}c_{t}(\hat{x}_{t}(M^{\star}),u_{t}(M^{\star}))-\sum_{t=r}^{s}c_{t}(x_{t}^{M^{\star}},u_{t}^{M^{\star}})
3LRnat2R2RGψ(h)|I|\displaystyle\leq 3LR_{\text{nat}}^{2}R_{\mathcal{M}}^{2}R_{G}\psi(h)|I|

Finally, due to Lemma B.2, Lemma B.3 and Lemma B.4 we can apply Theorem B.2 to get

f-regretD2η+2ηLf2(h+1)5/2|I|\text{f-regret}\leq\frac{D^{2}}{\eta}+2\eta L_{f}^{2}(h+1)^{5/2}|I|

Summing everything up and plugging in the Lipschitz and diameter constants, we have:

Regret 6LRnat2R2RGψ(h)|I|+4mmin{dx,du}R2η+18η(L2Rnat4RG4R2)m(h+1)5/2|I|\displaystyle\leq 6LR_{\text{nat}}^{2}R_{\mathcal{M}}^{2}R_{G}\psi(h)|I|+\frac{4m\min\{d_{x},d_{u}\}R_{\mathcal{M}}^{2}}{\eta}+18\eta(L^{2}R_{\text{nat}}^{4}R_{G}^{4}R_{\mathcal{M}}^{2})m(h+1)^{5/2}|I|

Setting ηmin{dx,du}2LRG2Rnat2(h+1)5/4T\eta\doteq\frac{\sqrt{\min\{d_{x},d_{u}\}}}{2LR_{G}^{2}R_{\text{nat}}^{2}(h+1)^{5/4}\sqrt{T}}, we get

Regret 6LRnat2R2RGψ(h)|I|+17min{dx,du}(LRG2Rnat2R2)(h+1)5/4mT\displaystyle\leq 6LR_{\text{nat}}^{2}R_{\mathcal{M}}^{2}R_{G}\psi(h)|I|+17\sqrt{\min\{d_{x},d_{u}\}}(LR_{G}^{2}R_{\text{nat}}^{2}R_{\mathcal{M}}^{2})\cdot(h+1)^{5/4}m\cdot\sqrt{T}
6LRsys2(3dminm(h+1)5/4T+ψ(h)|I|)\displaystyle\leq 6LR_{\text{sys}}^{2}\left(3\sqrt{d_{\min}}m(h+1)^{5/4}\sqrt{T}+\psi(h)|I|\right)

where we denote RsysRGRRnatR_{\text{sys}}\doteq R_{G}R_{\mathcal{M}}R_{\text{nat}} and dminmin{dx,du}d_{\text{min}}\doteq\min\{d_{x},d_{u}\}. ∎

Appendix C Estimation of Time-Varying Vector Sequences

In this section we segway into the setting of online prediction under a partial information model. The goal is to estimate a sequence of vectors under limited noisy feedback where the feedback access is softly restricted via additional cost. As shown in the following section, this setting captures the system identification phase of controlling an unknown time-varying dynamical system. We first extensively study the simplified setting as below, and afterwards transfer our findings into meaningful results in control.

Formally, consider the following repeated game between a learner and an oblivious adversary: at each round t[T]t\in[T], the adversary picks a target vector 𝐳t𝒦\mathbf{z}^{\star}_{t}\in\mathcal{K} from a convex decision set 𝒦\mathcal{K} contained in a 0-centered ball of radius RzR_{z}; simultaneously, the learner selects an estimate 𝐳^t𝒦\hat{\mathbf{z}}_{t}\in\mathcal{K} and suffers quadratic loss ft(𝐳^t)=𝐳^t𝐳t2f_{t}(\hat{\mathbf{z}}_{t})=\|\hat{\mathbf{z}}_{t}-\mathbf{z}^{\star}_{t}\|^{2}. The only feedback the learner has access to is via the following noisy and costly oracle.

Oracle 2 (Noisy Costly Oracle).

At each time t[T]t\in[T], the learner selects a decision bt{0,1}b_{t}\in\{0,1\} indicating whether a query is sent to the oracle. If bt=1b_{t}=1, the learner receives an unbiased estimate 𝐳~t\tilde{\mathbf{z}}_{t} as response such that 𝐳~tR~z\|\tilde{\mathbf{z}}_{t}\|\leq\tilde{R}_{z} and 𝔼[𝐳~tt,bt=1]=𝐳t\mathbb{E}[\tilde{\mathbf{z}}_{t}\mid\mathcal{F}_{t},b_{t}=1]=\mathbf{z}^{\star}_{t} where t\mathcal{F}_{t} is the filtration sigma algebra generated by the entire sequence 𝐳1:T\mathbf{z}^{\star}_{1:T} and the past 𝐳~1:t1,b1:t1\tilde{\mathbf{z}}_{1:t-1},b_{1:t-1}. A completed query results in a unit cost for the learner denoted btb_{t} as well by abuse of notation.

The idea behind this setting is to model a general estimation framework for a time-varying system which focuses only on exploration. Committing to exploration, however, cannot realistically be free hence the additional cost for the number of calls to Oracle 2. Our goal is to design an algorithm 𝒜\mathcal{A} that minimizes the quadratic loss regret along with the extra oracle cost, defined over each interval I=[r,s][T]I=[r,s]\subseteq[T] as

RegretI(𝒜;λ)=𝔼[tIft(𝐳^t)]min𝐳𝒦tIft(𝐳)+λ𝔼[tIbt],\mathrm{Regret}_{I}(\mathcal{A};\lambda)=\mathbb{E}\left[\sum_{t\in I}f_{t}(\hat{\mathbf{z}}_{t})\right]-\min_{\mathbf{z}\in\mathcal{K}}\sum_{t\in I}f_{t}(\mathbf{z})+\lambda\mathbb{E}\left[\sum_{t\in I}b_{t}\right], (C.1)

where λ0\lambda\geq 0 is a scaling constant independent of the horizon TT. For the I=[T]I=[T] entire interval, we use TT as a subscript instead of [T][T]. The expectation above is taken over both the (potential) randomness of the algorithm and the stochasticity of the oracle responses; it is taken in the round order t=1,,Tt=1,\dots,T at each round conditioning on the past iterations.

In terms of estimation itself, the metric to consider over interval II is given by RegretI(𝒜;0)\mathrm{Regret}_{I}(\mathcal{A};0) that ignores the oracle call costs. Furthermore, we observe that the best-in-hindsight term in (C.1) is in fact a fundamental quantity of the vector sequence as defined below. This formulation will be used, and is more appropriate, when transferring our findings to the setting of control.

Definition C.1.

Define the variability of a time-varying vector sequence 𝐳1:T\mathbf{z}_{1:T} over an interval I[T]I\subseteq[T] to be

VarI(𝐳1:T)=1|I|min𝐳𝒦tI𝐳𝐳t2=1|I|tI𝐳¯I𝐳t2,\mathrm{Var}_{I}(\mathbf{z}_{1:T})=\frac{1}{|I|}\min_{\mathbf{z}\in\mathcal{K}}\sum_{t\in I}\|\mathbf{z}-\mathbf{z}_{t}\|^{2}=\frac{1}{|I|}\sum_{t\in I}||\bar{\mathbf{z}}_{I}-\mathbf{z}_{t}||^{2},

where 𝐳¯I=|I|1tI𝐳t𝒦\bar{\mathbf{z}}_{I}=|I|^{-1}\sum_{t\in I}\mathbf{z}_{t}\in\mathcal{K} is the empirical average of the members of the sequence that correspond to II.

This definition concludes the setup of our abstraction to general estimation of vector sequences. Regarding algorithmic results, we first present a base method that achieves logarithmic regret over the entire trajectory [1,T][1,T]. The idea is for the learner to uniformly query Oracle 2 with probability pp: once an estimate 𝐳~t\tilde{\mathbf{z}}_{t} is received, construct a stochastic gradient with expectation equal to the true gradient, and perform a gradient update. The algorithm is described in detail in Algorithm 5, and its guarantee given in the theorem below.

Algorithm 5 Base Estimation Algorithm
1:Input: pp, 𝐳^1𝒦\hat{\mathbf{z}}_{1}\in\mathcal{K}
2:for t=1,,Tt=1,\ldots,T do
3:     Play iterate 𝐳^t\hat{\mathbf{z}}_{t}
4:     Draw/Receive btBernoulli(p)b_{t}\sim\mathrm{Bernoulli}(p)
5:     if bt=1b_{t}=1 then
6:         Receive estimate 𝐳~t\tilde{\mathbf{z}}_{t} from Oracle 2
7:         Construct importance weighted gradient ~t:=1p(𝐳^t𝐳~t)\tilde{\nabla}_{t}:=\frac{1}{p}\left(\hat{\mathbf{z}}_{t}-\tilde{\mathbf{z}}_{t}\right)
8:     else
9:         Set ~t=0\tilde{\nabla}_{t}=0.      
10:     Update 𝐳^t+1=Proj𝒦(𝐳^tηt~t)\hat{\mathbf{z}}_{t+1}=\mathrm{Proj}_{\mathcal{K}}(\hat{\mathbf{z}}_{t}-\eta_{t}\tilde{\nabla}_{t}), ηt=1t\eta_{t}=\frac{1}{t}.
Theorem C.1.

Given access to queries from Oracle 2, with stepsizes ηt=1t\eta_{t}=\frac{1}{t}, Algorithm 5 enjoys the following regret guarantee:

RegretT(Algorithm 5;λ)(Rz+R~z)2(1+logT)p+λpT.\mathrm{Regret}_{T}(\lx@cref{creftypecap~refnum}{alg:est_base};\lambda)\leq\frac{(R_{z}+\tilde{R}_{z})^{2}(1+\log T)}{p}+\lambda pT~{}. (C.2)
Proof of Theorem C.1.

To prove the bound in the theorem, we construct the following proxy loss functions: if bt=1b_{t}=1 denote f~t(𝐳)=12p𝐳𝐳~t2\tilde{f}_{t}(\mathbf{z})=\frac{1}{2p}\|\mathbf{z}-\tilde{\mathbf{z}}_{t}\|^{2}, otherwise for bt=0b_{t}=0 denote f~t(𝐳)=0\tilde{f}_{t}(\mathbf{z})=0. The stochastic gradients of these functions at the current iterate can be written as 𝐳f~t(𝐳^t)=~t=𝕀{bt=1}p(𝐳^t𝐳~t)\nabla_{\mathbf{z}}\tilde{f}_{t}(\hat{\mathbf{z}}_{t})=\tilde{\nabla}_{t}=\frac{\mathbb{I}\{b_{t}=1\}}{p}(\hat{\mathbf{z}}_{t}-\tilde{\mathbf{z}}_{t}) and are used by the algorithm in the update rule. The idealized gradients are t=(𝐳^t𝐳t)\nabla_{t}=(\hat{\mathbf{z}}_{t}-\mathbf{z}^{\star}_{t}) which we would use given access to the true targets 𝐳t\mathbf{z}^{\star}_{t}. Recall that t\mathcal{F}_{t} denotes the sigma-algebra generated by the true target sequence 𝐳1:T\mathbf{z}^{\star}_{1:T}, as well as randomness of the past rounds b1:t1b_{1:t-1} and 𝐳~1:t1\tilde{\mathbf{z}}_{1:t-1}. Note then that 𝐳^t\hat{\mathbf{z}}_{t} is t\mathcal{F}_{t} measurable. We characterize two essential properties of the stochastic gradients:

Lemma C.1.

Let 𝐳¯𝒦\bar{\mathbf{z}}^{\star}\in\mathcal{K} be the minimizer of t=1T𝐳𝐳t2\sum_{t=1}^{T}\|\mathbf{z}-\mathbf{z}^{\star}_{t}\|^{2}, i.e. empirical average of 𝐳1:T\mathbf{z}^{\star}_{1:T}. Then,

𝔼[~t,𝐳^t𝐳¯]=𝔼[t,𝐳^t𝐳¯].\mathbb{E}[\langle\tilde{\nabla}_{t},\hat{\mathbf{z}}_{t}-\bar{\mathbf{z}}^{\star}\rangle]=\mathbb{E}[\langle\nabla_{t},\hat{\mathbf{z}}_{t}-\bar{\mathbf{z}}^{\star}\rangle]~{}.

Moreover, 𝔼[~t2](Rz+R~z)2/p\mathbb{E}[\|\tilde{\nabla}_{t}\|^{2}]\leq(R_{z}+\tilde{R}_{z})^{2}/p.

Proof.

Using the Oracle 2 assumption on 𝐳~t\tilde{\mathbf{z}}_{t}, we get

𝔼[~tt]\displaystyle\mathbb{E}[\tilde{\nabla}_{t}\mid\mathcal{F}_{t}] =1p𝔼[𝕀{bt=1}(𝐳^t𝐳~t)t]\displaystyle=\frac{1}{p}\mathbb{E}[\mathbb{I}\{b_{t}=1\}(\hat{\mathbf{z}}_{t}-\tilde{\mathbf{z}}_{t})\mid\mathcal{F}_{t}]
=1p𝔼[𝕀{bt=1}𝐳^tt]1p𝔼[𝕀{bt=1}𝐳~tt]\displaystyle=\frac{1}{p}\mathbb{E}[\mathbb{I}\{b_{t}=1\}\cdot\hat{\mathbf{z}}_{t}\mid\mathcal{F}_{t}]-\frac{1}{p}\mathbb{E}[\mathbb{I}\{b_{t}=1\}\cdot\tilde{\mathbf{z}}_{t}\mid\mathcal{F}_{t}]
=𝐳^t𝔼[𝐳~tt,bt=1]=(i)𝐳^t𝐳t=t,\displaystyle=\hat{\mathbf{z}}_{t}-\mathbb{E}[\tilde{\mathbf{z}}_{t}\mid\mathcal{F}_{t},b_{t}=1]\overset{(i)}{=}\hat{\mathbf{z}}_{t}-\mathbf{z}^{\star}_{t}=\nabla_{t},

where (i)(i) uses the unbiasedness property of Oracle 2. Next, since 𝐳¯\bar{\mathbf{z}}^{\star} is determined by 𝐳1:T\mathbf{z}^{\star}_{1:T} it is therefore t\mathcal{F}_{t} measurable for all tt. Thus, 𝐳^t𝐳¯\hat{\mathbf{z}}_{t}-\bar{\mathbf{z}}^{\star} is t\mathcal{F}_{t} measurable, so

𝔼[~t,𝐳^t𝐳¯]\displaystyle\mathbb{E}[\langle\tilde{\nabla}_{t},\hat{\mathbf{z}}_{t}-\bar{\mathbf{z}}^{\star}\rangle] =𝔼[𝔼[~tt],𝐳^t𝐳¯]=𝔼[t,𝐳^t𝐳¯].\displaystyle=\mathbb{E}\left[\langle\mathbb{E}[\tilde{\nabla}_{t}\mid\mathcal{F}_{t}],\hat{\mathbf{z}}_{t}-\bar{\mathbf{z}}^{\star}\rangle\right]=\mathbb{E}[\langle\nabla_{t},\hat{\mathbf{z}}_{t}-\bar{\mathbf{z}}^{\star}\rangle]~{}.

Finally, using the norm bound 𝐳^tRz\|\hat{\mathbf{z}}_{t}\|\leq R_{z} since 𝐳^t𝒦\hat{\mathbf{z}}_{t}\in\mathcal{K} and the assumption that 𝐳~tR~z\|\tilde{\mathbf{z}}_{t}\|\leq\tilde{R}_{z} from Oracle 2, we conclude

𝔼[~t2]=1p2𝔼[𝕀{bt=p}𝐳~t𝐳^t2]1p2𝔼[𝕀{bt=p}(Rz+R~z)2]=(Rz+R~z)2p.\mathbb{E}[\|\tilde{\nabla}_{t}\|^{2}]=\frac{1}{p^{2}}\mathbb{E}[\mathbb{I}\{b_{t}=p\}\|\tilde{\mathbf{z}}_{t}-\hat{\mathbf{z}}_{t}\|^{2}]\leq\frac{1}{p^{2}}\mathbb{E}[\mathbb{I}\{b_{t}=p\}(R_{z}+\tilde{R}_{z})^{2}]=\frac{(R_{z}+\tilde{R}_{z})^{2}}{p}~{}.

The rest of the theorem proof mirrors that of Theorem 3.3 in [19] but accounting for the stochastic gradient. We can view Algorithm 5 as running online stochastic gradient descent over strongly convex functions on losses 12ft(𝐳)=12𝐳𝐳t2\frac{1}{2}f_{t}(\mathbf{z})=\frac{1}{2}\|\mathbf{z}-\mathbf{z}^{\star}_{t}\|^{2} with true gradient t\nabla_{t} and stochastic gradient ~t\tilde{\nabla}_{t} at the iterate 𝐳^t\hat{\mathbf{z}}_{t}. Since the losses 12ft\frac{1}{2}f_{t} are 11-strongly convex, 𝔼[bt]=p\mathbb{E}[b_{t}]=p and using the claim from Lemma C.1 we get,

12RegretT=12𝔼[t=1T(ft(𝐳^t)ft(𝐳¯)+λbt)]\displaystyle\frac{1}{2}\mathrm{Regret}_{T}=\frac{1}{2}\mathbb{E}\left[\sum_{t=1}^{T}(f_{t}(\hat{\mathbf{z}}_{t})-f_{t}(\bar{\mathbf{z}}^{\star})+\lambda\cdot b_{t})\right] 𝔼[t=1T(t,𝐳^t𝐳¯12𝐳^t𝐳¯2)]+12λpT\displaystyle\leq\mathbb{E}\left[\sum_{t=1}^{T}(\langle\nabla_{t},\hat{\mathbf{z}}_{t}-\bar{\mathbf{z}}^{\star}\rangle-\frac{1}{2}\|\hat{\mathbf{z}}_{t}-\bar{\mathbf{z}}^{\star}\|^{2})\right]+\frac{1}{2}\lambda pT
=12𝔼[t=1T(2~t,𝐳^t𝐳¯𝐳^t𝐳¯2)]+12λpT.\displaystyle=\frac{1}{2}\mathbb{E}\left[\sum_{t=1}^{T}(2\langle\tilde{\nabla}_{t},\hat{\mathbf{z}}_{t}-\bar{\mathbf{z}}^{\star}\rangle-\|\hat{\mathbf{z}}_{t}-\bar{\mathbf{z}}^{\star}\|^{2})\right]+\frac{1}{2}\lambda pT.

The update rule is given as 𝐳^t+1=Proj𝒦(𝐳^tηt~t)\hat{\mathbf{z}}_{t+1}=\mathrm{Proj}_{\mathcal{K}}(\hat{\mathbf{z}}_{t}-\eta_{t}\tilde{\nabla}_{t}), so from the Pythagorean theorem for the projection

𝐳^t+1𝐳¯2\displaystyle\|\hat{\mathbf{z}}_{t+1}-\bar{\mathbf{z}}^{\star}\|^{2} 𝐳^tηt~t𝐳¯2=𝐳^t𝐳¯2+ηt2~t22ηt~t,𝐳^t𝐳¯.\displaystyle\leq\|\hat{\mathbf{z}}_{t}-\eta_{t}\tilde{\nabla}_{t}-\bar{\mathbf{z}}^{\star}\|^{2}=\|\hat{\mathbf{z}}_{t}-\bar{\mathbf{z}}^{\star}\|^{2}+\eta_{t}^{2}\|\tilde{\nabla}_{t}\|^{2}-2\eta_{t}\langle\tilde{\nabla}_{t},\hat{\mathbf{z}}_{t}-\bar{\mathbf{z}}^{\star}\rangle.
2~t,𝐳^t𝐳¯\displaystyle 2\langle\tilde{\nabla}_{t},\hat{\mathbf{z}}_{t}-\bar{\mathbf{z}}^{\star}\rangle 𝐳^t𝐳¯2𝐳^t+1𝐳¯2ηt+ηt~t2.\displaystyle\leq\frac{\|\hat{\mathbf{z}}_{t}-\bar{\mathbf{z}}^{\star}\|^{2}-\|\hat{\mathbf{z}}_{t+1}-\bar{\mathbf{z}}^{\star}\|^{2}}{\eta_{t}}+\eta_{t}\|\tilde{\nabla}_{t}\|^{2}.

Combining the above bounds results in

12RegretT12𝔼[t=1T(𝐳^t𝐳¯2𝐳^t+1𝐳¯2ηt𝐳^t𝐳¯2)+t=1Tηt~t2]+12λpT.\frac{1}{2}\mathrm{Regret}_{T}\leq\frac{1}{2}\mathbb{E}\left[\sum_{t=1}^{T}\left(\frac{\|\hat{\mathbf{z}}_{t}-\bar{\mathbf{z}}^{\star}\|^{2}-\|\hat{\mathbf{z}}_{t+1}-\bar{\mathbf{z}}^{\star}\|^{2}}{\eta_{t}}-\|\hat{\mathbf{z}}_{t}-\bar{\mathbf{z}}^{\star}\|^{2}\right)+\sum_{t=1}^{T}\eta_{t}\|\tilde{\nabla}_{t}\|^{2}\right]+\frac{1}{2}\lambda pT.

The telescoping sum inside the parentheses is equal to 0, the gradient term is bounded 𝔼[~t2](Rz+R~z)2p\mathbb{E}[\|\tilde{\nabla}_{t}\|^{2}]\leq\frac{(R_{z}+\tilde{R}_{z})^{2}}{p} according to Lemma C.1 and the stepsize sum is bounded by t=1Tηt1+logT\sum_{t=1}^{T}\eta_{t}\leq 1+\log T, yielding the final result

RegretT(Algorithm 5;λ)(Rz+R~z)2(1+logT)p+λpT.\mathrm{Regret}_{T}(\lx@cref{creftypecap~refnum}{alg:est_base};\lambda)\leq\frac{(R_{z}+\tilde{R}_{z})^{2}(1+\log T)}{p}+\lambda pT.

C.1 Adaptive Regret Bound

The guarantee in Theorem C.1 ensures that the predicted sequence 𝐳^1:T\hat{\mathbf{z}}_{1:T} performs comparably to the empirical mean 𝐳¯\bar{\mathbf{z}}^{\star} of the entire target sequence 𝐳1:T\mathbf{z}^{\star}_{1:T}. However, that doesn’t imply much about the performance of Algorithm 5 on a given local interval I[T]I\subseteq[T] since 𝐳¯I\bar{\mathbf{z}}^{\star}_{I} can be very different from 𝐳¯\bar{\mathbf{z}}^{\star}. Hence, we would like to extend our results to hold for any interval II, i.e. derive adaptive regret results as introduced in [21]. To do so we will use the approach of [21] using Algorithm 5 as a subroutine. The resulting algorithm, presented in Algorithm 6, suffers only a logarithmic computational overhead over Algorithm 5 with its performance guarantee stated in the theorem below.

Algorithm 6 Adaptive Estimation Algorithm
1:Input: parameter pp, decision set 𝒦\mathcal{K}, base estimation algorithm 𝒜\mathcal{A}, 𝐳^1𝒦\hat{\mathbf{z}}_{1}\in\mathcal{K}
2:Initialize: 𝒜1𝒜(p,𝐳^1)\mathcal{A}_{1}\leftarrow\mathcal{A}(p,\hat{\mathbf{z}}_{1}), working set 𝒮1={1}\mathcal{S}_{1}=\{1\}, q1(1)=1q_{1}^{(1)}=1, parameter α=p(Rz+R~z)2\alpha=\frac{p}{(R_{z}+\tilde{R}_{z})^{2}}
3:for t=1,,Tt=1,\ldots,T do
4:     Compute predictions 𝐳^t(i)𝒜i\hat{\mathbf{z}}_{t}^{(i)}\leftarrow\mathcal{A}_{i} for i𝒮ti\in\mathcal{S}_{t}
5:     Play iterate 𝐳^t=i𝒮tqt(i)𝐳^t(i)\hat{\mathbf{z}}_{t}=\sum_{i\in\mathcal{S}_{t}}q_{t}^{(i)}\hat{\mathbf{z}}_{t}^{(i)}
6:     Draw/Receive btBernoulli(p)b_{t}\sim\mathrm{Bernoulli}(p)
7:     if bt=1b_{t}=1 then
8:         Request estimate 𝐳~t\tilde{\mathbf{z}}_{t} from Oracle 2
9:         Let ~t(𝐳)=12p𝐳𝐳~t2\tilde{\ell}_{t}(\mathbf{z})=\frac{1}{2p}||\mathbf{z}-\tilde{\mathbf{z}}_{t}||^{2}
10:     else
11:         Let 𝐳~t\tilde{\mathbf{z}}_{t}\leftarrow\emptyset and ~t(𝐳)=0\tilde{\ell}_{t}(\mathbf{z})=0      
12:     Update expert algorithms 𝒜i(bt,𝐳~t)\mathcal{A}_{i}(b_{t},\tilde{\mathbf{z}}_{t}) for all i𝒮ti\in\mathcal{S}_{t}
13:     Form new set 𝒮~t+1=(i)i𝒮t\tilde{\mathcal{S}}_{t+1}=(i)_{i\in\mathcal{S}_{t}}
14:     Construct proxy new weights q¯t+1(i)=tt+1qt(i)eα~t(𝐳^t(i))j𝒮tqt(j)eα~t(𝐳^t(j))\bar{q}_{t+1}^{(i)}=\tfrac{t}{t+1}\cdot\tfrac{q_{t}^{(i)}e^{-\alpha\tilde{\ell}_{t}(\hat{\mathbf{z}}_{t}^{(i)})}}{\sum_{j\in\mathcal{S}_{t}}q_{t}^{(j)}e^{-\alpha\tilde{\ell}_{t}(\hat{\mathbf{z}}_{t}^{(j)})}} for all i𝒮ti\in\mathcal{S}_{t}
15:     Add new instance 𝒮~t+1𝒮~t+1t+1\tilde{\mathcal{S}}_{t+1}\leftarrow\tilde{\mathcal{S}}_{t+1}\cup{t+1} for arbitrary 𝒜t+1𝒜(p,𝐳^1(t+1)=𝐳^1)\mathcal{A}_{t+1}\leftarrow\mathcal{A}(p,\hat{\mathbf{z}}_{1}^{(t+1)}=\hat{\mathbf{z}}_{1}) with q¯t+1(t+1)=1t+1\bar{q}_{t+1}^{(t+1)}=\frac{1}{t+1}
16:     Prune 𝒮~t+1\tilde{\mathcal{S}}_{t+1} to form 𝒮t+1\mathcal{S}_{t+1}
17:     Normalize qt+1(i)=q¯t+1(i)j𝒮t+1q¯t+1(j)q_{t+1}^{(i)}=\frac{\bar{q}_{t+1}^{(i)}}{\sum_{j\in\mathcal{S}_{t+1}}\bar{q}_{t+1}^{(j)}}
Theorem C.2.

Taking the base estimation algorithm 𝒜\mathcal{A} to be Algorithm 5 and given access to queries from Oracle 2, Algorithm 6 enjoys the following guarantee:

I=[r,s][T],RegretI(Algorithm 6;λ)2(Rz+R~z)2(1+logslog|I|)p+λp|I|.\forall I=[r,s]\subseteq[T],\quad\mathrm{Regret}_{I}(\lx@cref{creftypecap~refnum}{alg:adaptive_est};\lambda)\leq\frac{2(R_{z}+\tilde{R}_{z})^{2}(1+\log{s}\cdot\log|I|)}{p}+\lambda p|I|~{}. (C.3)
Corollary C.1.

The estimation error over each interval I=[r,s][T]I=[r,s]\subseteq[T] is bounded as follows,

𝔼[tI𝐳^t𝐳t2]VarI(𝐳1:T)+2(Rz+R~z)2(1+logslog|I|)p.\mathbb{E}\left[\sum_{t\in I}\|\hat{\mathbf{z}}_{t}-\mathbf{z}^{\star}_{t}\|^{2}\right]\leq\mathrm{Var}_{I}(\mathbf{z}^{\star}_{1:T})+\frac{2(R_{z}+\tilde{R}_{z})^{2}(1+\log{s}\cdot\log|I|)}{p}~{}.
Proof of Theorem C.2.

First observe that ~t\tilde{\ell}_{t} is α\alpha-exp concave with α=p(Rz+R~z2)\alpha=\frac{p}{(R_{z}+\tilde{R}_{z}^{2})}. This is evident given its construction: ~t(𝐳)=12p𝐳𝐳~t2\tilde{\ell}_{t}(\mathbf{z})=\frac{1}{2p}\|\mathbf{z}-\tilde{\mathbf{z}}_{t}\|^{2} with 𝐳Rz\|\mathbf{z}\|\leq R_{z} since 𝐳𝒦\mathbf{z}\in\mathcal{K} and 𝐳~tR~z\|\tilde{\mathbf{z}}_{t}\|\leq\tilde{R}_{z} according to Oracle 2. The rest of the algorithm uses the approach of [21], in particular Algorithm 1, over exp concave functions to derive the guarantee in the theorem statement.

We note that Claim 3.1 in [21] holds identically in our case, i.e. for any I=[r,s]I=[r,s] the regret of Algorithm 6 with respect to 𝒜r\mathcal{A}_{r} is bounded by 2α(lnr+ln|I|)\frac{2}{\alpha}(\ln r+\ln|I|) if 𝒜r\mathcal{A}_{r} stays in the working set. We combine this fact with the bound given in Theorem C.1 to get that Algorithm 6 enjoys regret 3α(logr+log|I|)\frac{3}{\alpha}(\log r+\log|I|) over I=[r,s]I=[r,s] if 𝒜r\mathcal{A}_{r} stays in the working set 𝒮t\mathcal{S}_{t} throughout II. Finally, an induction argument along with the working set properties detailed in Section C.1.1 identical to that of Lemma 3.2 in [21] yields the desired result for ~t\tilde{\ell}_{t}. Notice that this is our desired result in expectation,

Observation C.2.

We have the following identity for any tt and rtr\leq t:

𝔼[~t(𝐳^t)~t(𝐳^t(r))]=𝔼[t(𝐳^t)t(𝐳^t(r))]\mathbb{E}\left[\tilde{\ell}_{t}(\hat{\mathbf{z}}_{t})-\tilde{\ell}_{t}(\hat{\mathbf{z}}_{t}^{(r)})\right]=\mathbb{E}\left[{\ell}_{t}(\hat{\mathbf{z}}_{t})-{\ell}_{t}(\hat{\mathbf{z}}_{t}^{(r)})\right]
Proof of C.2.

We can expand:

2𝔼[~t(𝐳^t)~t(𝐳^t(r))]\displaystyle 2\mathbb{E}\left[\tilde{\ell}_{t}(\hat{\mathbf{z}}_{t})-\tilde{\ell}_{t}(\hat{\mathbf{z}}_{t}^{(r)})\right] =2p𝔼[~t(𝐳^t)~t(𝐳^t(r))|bt=1]+2(1p)𝔼[~t(𝐳^t)~t(𝐳^t(r))|bt=0]\displaystyle=2p\cdot\mathbb{E}\left[\tilde{\ell}_{t}(\hat{\mathbf{z}}_{t})-\tilde{\ell}_{t}(\hat{\mathbf{z}}_{t}^{(r)})|b_{t}=1\right]+2(1-p)\cdot\mathbb{E}\left[\tilde{\ell}_{t}(\hat{\mathbf{z}}_{t})-\tilde{\ell}_{t}(\hat{\mathbf{z}}_{t}^{(r)})|b_{t}=0\right]
=p𝔼[1p(𝐳^t2+𝐳^t,𝐳~t+𝐳~t2𝐳^t(r)2𝐳^t(r),𝐳~t𝐳~t2)]+0\displaystyle=p\cdot\mathbb{E}\left[\dfrac{1}{p}\cdot\left(\|\hat{\mathbf{z}}_{t}\|^{2}+\langle\hat{\mathbf{z}}_{t},\tilde{\mathbf{z}}_{t}\rangle+\|\tilde{\mathbf{z}}_{t}\|^{2}-\|\hat{\mathbf{z}}_{t}^{(r)}\|^{2}-\langle\hat{\mathbf{z}}_{t}^{(r)},\tilde{\mathbf{z}}_{t}\rangle-\|\tilde{\mathbf{z}}_{t}\|^{2}\right)\right]+0
=𝔼[𝐳^t2𝐳^t(r)2]+𝔼[𝐳^t𝐳^t(r),𝐳~t]\displaystyle=\mathbb{E}\left[\|\hat{\mathbf{z}}_{t}\|^{2}-\|\hat{\mathbf{z}}_{t}^{(r)}\|^{2}\right]+\mathbb{E}\left[\langle\hat{\mathbf{z}}_{t}-\hat{\mathbf{z}}_{t}^{(r)},\tilde{\mathbf{z}}_{t}\rangle\right]

By the linearity of expectation, the fact that 𝐳^t,𝐳^t(r)\hat{\mathbf{z}}_{t},\hat{\mathbf{z}}_{t}^{(r)} are completely determined given t1\mathcal{F}_{t-1}, and the law of total expectation we have that

𝔼[𝐳^t𝐳^t(r),𝐳~t]\displaystyle\mathbb{E}\left[\langle\hat{\mathbf{z}}_{t}-\hat{\mathbf{z}}_{t}^{(r)},\tilde{\mathbf{z}}_{t}\rangle\right] =𝔼[𝔼[𝐳^t𝐳^t(r),𝐳~t|t1]]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\langle\hat{\mathbf{z}}_{t}-\hat{\mathbf{z}}_{t}^{(r)},\tilde{\mathbf{z}}_{t}\rangle|\mathcal{F}_{t-1}\right]\right]
=𝔼[𝐳^t𝐳^t(r),𝐳t]\displaystyle=\mathbb{E}\left[\langle\hat{\mathbf{z}}_{t}-\hat{\mathbf{z}}_{t}^{(r)},\mathbf{z}^{\star}_{t}\rangle\right]

Plugging this in above we, adding and subtracting 𝐳t2\|\mathbf{z}^{\star}_{t}\|^{2}, and rearranging we have:

2𝔼[~t(𝐳^t)~t(𝐳^t(r))]\displaystyle 2\mathbb{E}\left[\tilde{\ell}_{t}(\hat{\mathbf{z}}_{t})-\tilde{\ell}_{t}(\hat{\mathbf{z}}_{t}^{(r)})\right] =𝔼[𝐳^t2𝐳^t(r)2+𝐳^t𝐳^t(r),𝐳t+𝐳2𝐳2]\displaystyle=\mathbb{E}\left[\|\hat{\mathbf{z}}_{t}\|^{2}-\|\hat{\mathbf{z}}_{t}^{(r)}\|^{2}+\langle\hat{\mathbf{z}}_{t}-\hat{\mathbf{z}}_{t}^{(r)},\mathbf{z}^{\star}_{t}\rangle+\|\mathbf{z}\|^{2}-\|\mathbf{z}\|^{2}\right]
=𝔼[𝐳^t2+𝐳^t,𝐳t+𝐳2𝐳^t(r)2𝐳^t(r),𝐳t𝐳2]\displaystyle=\mathbb{E}\left[\|\hat{\mathbf{z}}_{t}\|^{2}+\langle\hat{\mathbf{z}}_{t},\mathbf{z}^{\star}_{t}\rangle+\|\mathbf{z}\|^{2}-\|\hat{\mathbf{z}}_{t}^{(r)}\|^{2}-\langle\hat{\mathbf{z}}_{t}^{(r)},\mathbf{z}^{\star}_{t}\rangle-\|\mathbf{z}\|^{2}\right]
=2𝔼[t(𝐳^t)t(𝐳^t(r))]\displaystyle=2\mathbb{E}\left[{\ell}_{t}(\hat{\mathbf{z}}_{t})-{\ell}_{t}(\hat{\mathbf{z}}_{t}^{(r)})\right]

as desired. ∎

Combining C.2 with the fact that ~t\tilde{\ell}_{t} are α\alpha-exp concave for α=p(Rz+R~z)2\alpha=\frac{p}{(R_{z}+\tilde{R}_{z})^{2}} we conclude the final statement of Theorem C.2. ∎

C.1.1 Working Set Construction

Our Algorithm 6 makes use of the working sets {𝒮t}t[T]\{\mathcal{S}_{t}\}_{t\in[T]} along with its properties in Claim C.3. In this section, we show the explicit construction of these working sets as in [21] and prove the claim.

Claim C.3.

The following properties hold for the working sets StS_{t} for all t[T]t\in[T]: (i) |𝒮t|=O(logT)|\mathcal{S}_{t}|=O(\log T); (ii) [s,(s+t)/2]𝒮t[s,(s+t)/2]\cap\mathcal{S}_{t}\neq\emptyset for any s[t]s\in[t]; (iii) 𝒮t+1\𝒮t={t+1}\mathcal{S}_{t+1}\backslash\mathcal{S}_{t}=\{t+1\}; (iv) |𝒮t\𝒮t+1|1|\mathcal{S}_{t}\backslash\mathcal{S}_{t+1}|\leq 1.

For any i[T]i\in[T], let it be given as i=r2ki=r2^{k} with rr odd and kk nonnegative. Denote m=2k+2+1m=2^{k+2}+1, then iSti\in S_{t} if and only if t[i,i+m]t\in[i,i+m]. This fully describes the construction of the working sets {St}t[T]\{S_{t}\}_{t\in[T]}, and we proceed to prove its properties.

Proof of Claim C.3.

For all t[T]t\in[T] we show the following properties of the working sets StS_{t}.

(i) |St|=O(logT)|S_{t}|=O(\log T): if iSti\in S_{t} then 1i=r2kt1\leq i=r2^{k}\leq t which implies that 0klog2t0\leq k\leq\log_{2}{t}. For each fixed kk in this range, if r2k=iStr2^{k}=i\in S_{t} then i[t2k+21,t]i\in[t-2^{k+2}-1,t] by construction. Since [t2k+21,t][t-2^{k+2}-1,t] is an interval of length 2k+2+2=42k+22^{k+2}+2=4\cdot 2^{k}+2, it can include at most 33 numbers of the form r2kr2^{k} with rr odd. Thus, there is at most 33 numbers i=r2kSti=r2^{k}\in S_{t} for each 0klog2t0\leq k\leq\log_{2}{t} which means that |St|=O(logt)=O(logT)|S_{t}|=O(\log t)=O(\log T).

(ii) [s,(s+t)/2]St[s,(s+t)/2]\cap S_{t}\neq\emptyset for all s[t]s\in[t]: this trivially holds for s=t1,ts=t-1,t. Let 2l(ts)/22^{l}\leq(t-s)/2 be the largest such exponent of 22. Since the size of the interval [s,(s+t)/2][s,(s+t)/2] is (ts)/2\lfloor(t-s)/2\rfloor, then there exists u[s,(s+t)/2]u\in[s,(s+t)/2] that divides 2l2^{l}. This means that the corresponding m2l+2+1>tsm\geq 2^{l+2}+1>t-s for usu\geq s is large enough so that t[u,u+m]t\in[u,u+m], and consequently, uStu\in S_{t}.

(iii) St+1\St={t+1}S_{t+1}\backslash S_{t}=\{t+1\}: let iSt+1i\in S_{t+1} and iSti\not\in S_{t}, which is equivalent to t+1[i,i+m]t+1\in[i,i+m] and t[i,i+m]t\not\in[i,i+m]. Clearly, i=t+1i=t+1 satisfies these conditions and is the only such number.

(iv) |St\St+1|1|S_{t}\backslash S_{t+1}|\leq 1: suppose there exist two i1,i2St\St+1i_{1},i_{2}\in S_{t}\backslash S_{t+1}. This implies that i1+m1=t=i2+m2i_{1}+m_{1}=t=i_{2}+m_{2} which in turn means 2k1(r1+4)=2k2(r2+4)2^{k_{1}}(r_{1}+4)=2^{k_{2}}(r_{2}+4). Since both r1+4,r2+4r_{1}+4,r_{2}+4 are odd, then k1=k2k_{1}=k_{2}, and consequently, r1=r2r_{1}=r_{2} resulting in i1=i2i_{1}=i_{2}. Thus, there can not exist two different members of St\St+1S_{t}\backslash S_{t+1} which concludes that |St\St+1|1|S_{t}\backslash S_{t+1}|\leq 1. ∎

C.2 No Strong Adaptivity

Notice that even though the guarantee of Theorem C.2 applies to all intervals II, it does not entail meaningful guarantees for all. The reason is the choice of parameter pp: if one wishes to optimize RegretT\mathrm{Regret}_{T} then p=𝒪(T1/2)p=\mathcal{O}(T^{-1/2}) implies 𝒪(T)\mathcal{O}(\sqrt{T}) regret, but this choice is meaningless for intervals with length |I|<<T|I|<<\sqrt{T}; on the other hand, optimizing the bound for small intervals leads to large bounds for the entire horizon. One might then ask whether there exist methods with strongly adaptive guarantees, and we answer this question with a negative.

Theorem C.3.

For any γ>0\gamma>0 and oracle cost λ>0\lambda>0, there exists no online algorithm 𝒜\mathcal{A} with feedback access to Oracle 2 that enjoys the following strongly adaptive regret guarantee: RegretI(𝒜;λ)=𝒪~(|I|1γ)\mathrm{Regret}_{I}(\mathcal{A};\lambda)=\tilde{\mathcal{O}}(|I|^{1-\gamma}).

Proof.

The proof of this impossibility results follows a simple construction: the idea behind it is that strongly adaptive guarantees imply both large and small amount of exploration. Let us suppose there exists such an algorithm 𝒜\mathcal{A} and arrive at a contradiction: I=[r,s][T]\forall I=[r,s]\subseteq[T] algorithm 𝒜\mathcal{A} has a regret bound RegretI(𝒜;λ)C|I|1γ=𝒪~(|I|1γ)\mathrm{Regret}_{I}(\mathcal{A};\lambda)\leq C\cdot|I|^{1-\gamma}=\tilde{\mathcal{O}}(|I|^{1-\gamma}) over any oblivious sequence 𝐳1:T\mathbf{z}^{\star}_{1:T} where CC depends on problem parameters, λ\lambda and logT\log T.

Construct the following oblivious sequence: let k=T1γ/2k=T^{1-\gamma/2} and I1,,IkI_{1},\dots,I_{k} be consecutive disjoint intervals such that j[k]Ij=[T]\cup_{j\in[k]}I_{j}=[T], IjIl=I_{j}\cap I_{l}=\emptyset for all jlj\neq l, and |Ij|=T/k=Tγ/2|I_{j}|=T/k=T^{\gamma/2} for all j[k]j\in[k] (w.l.o.g. we assume TT divides kk). Now for each interval IjI_{j}, j[k]j\in[k], sample a fresh qj{±1}Rad(1/2)q_{j}\in\{\pm 1\}\sim Rad(1/2) and let 𝐳t=qj\mathbf{z}^{\star}_{t}=q_{j} for all tIjt\in I_{j}.

According to the assumed guarantee, the overall regret is bounded as RegretT(𝒜;λ)leqCT1γ\mathrm{Regret}_{T}(\mathcal{A};\lambda)leqC\cdot T^{1-\gamma} which by definition implies that t=1TbtCλT1γ<k\sum_{t=1}^{T}b_{t}\leq\frac{C}{\lambda}T^{1-\gamma}<k where the last inequality is true for sufficiently large horizon TT. Since there are kk consecutive disjoint intervals I1,,IkI_{1},\dots,I_{k} and less than kk overall calls to Oracle 2, there exists an interval I{I1,,Ik}I\in\{I_{1},\dots,I_{k}\} such that tIbt=0\sum_{t\in I}b_{t}=0.

On the other hand, the assumed guarantee for 𝒜\mathcal{A} implies that the interval II of size |I|=Tγ/2|I|=T^{\gamma/2} enjoys sublinear regret, i.e. RegretI(𝒜;λ)=o(|I|)\mathrm{Regret}_{I}(\mathcal{A};\lambda)=o(|I|). We show that this is a contradiction given that tIbt=0\sum_{t\in I}b_{t}=0. As there were no oracle calls for the interval II, the predictions of 𝒜\mathcal{A}, 𝐳^t\hat{\mathbf{z}}_{t} over tIt\in I, are independent from the Rademacher sample of the interval qIq_{I}: this is true since the samples for each interval in I1,,IkI_{1},\dots,I_{k} are independent. Therefore, 𝐳^tqI\hat{\mathbf{z}}_{t}\perp q_{I} for all tIt\in I which means that since the best loss in hindsight over II is equal to 0 as 𝐳t=qI\mathbf{z}^{\star}_{t}=q_{I} for all tIt\in I,

RegretI(𝒜;λ)𝔼qI[tIt(𝐳^t)]=tI𝔼qI[𝐳^tqI2]=Ω(|I|).\mathrm{Regret}_{I}(\mathcal{A};\lambda)\geq\mathbb{E}_{q_{I}}\left[\sum_{t\in I}\ell_{t}(\hat{\mathbf{z}}_{t})\right]=\sum_{t\in I}\mathbb{E}_{q_{I}}[\|\hat{\mathbf{z}}_{t}-q_{I}\|^{2}]=\Omega(|I|)~{}.

Hence, for the interval II, the regret of 𝒜\mathcal{A} cannot be sublinear, which contradict the assumption that 𝒜\mathcal{A} exhibits strongly adaptive guarantees. This concludes that no strongly adaptive online algorithm exists in the described partial information model. ∎

Appendix D Adaptive Regret for Control of Changing Unknown Dynamics

In this section we give our full control algorithm which attains sublinear regret with respect to Πdrc\Pi_{\mathrm{drc}} up to an additive system variability term. A key component is the system estimation for which we will use Algorithm 6 and its guarantees from Appendix C. More specifically, our algorithm is based on the canonical explore-exploit approach: it explores with some probability pp by inputting random controls into the system, and otherwise outputs a control according to DRC-OGD (Algorithm 3). Note that due to the long-term consequences which appear in control, we need to explore for hh consecutive steps in order to get an estimate for the hh-truncation of the Markov operator. Hence, our algorithm will determine whether it explores or exploits in blocks of length hh. Furthermore, we will define the set of Markov operator of length hh and 1,op\ell_{1},op-norm bounded by RGR_{G} as 𝒢(h,RG)\mathcal{G}(h,R_{G}):

𝒢(h,RG){G=(G[0],,G[h1])h×dx×du s.t. i=0h1||G||opRG}\mathcal{G}(h,R_{G})\doteq\{G=(G^{[0]},\ldots,G^{[h-1]})\in\mathbb{R}^{h\times d_{x}\times d_{u}}\text{ s.t. }\sum_{i=0}^{h-1}||G||_{op}\leq R_{G}\}
Remark D.1.

Note that the radius of 𝒢(h,RG)\mathcal{G}(h,R_{G}) is bounded by R¯G=hdminRG\bar{R}_{G}=\sqrt{h\cdot d_{\mathrm{min}}}R_{G} where dmin=min{dx,du}d_{\mathrm{min}}=\min\{d_{x},d_{u}\}.

Proof of Remark D.1.

G𝒢(h,RG)\forall G\in\mathcal{G}(h,R_{G}), we have:

GF\displaystyle||G||_{F} =i=0h1G[i]F2\displaystyle=\sqrt{\sum_{i=0}^{h-1}||G^{[i]}||^{2}_{F}}
hmaxiG[i]F2\displaystyle\leq\sqrt{h}\sqrt{\max_{i}||G^{[i]}||_{F}^{2}}
hmin{dx,du}maxiG[i]op\displaystyle\leq\sqrt{h\min\{d_{x},d_{u}\}}\max_{i}||G^{[i]}||_{op}
hmin{dx,du}RG\displaystyle\leq\sqrt{h\min\{d_{x},d_{u}\}}R_{G}

and denoting dmin=min{dx,du}d_{\mathrm{min}}=\min\{d_{x},d_{u}\} yields the result. ∎

Remark D.2.

By abuse of notation, we will consider G[i]0G^{[i]}\doteq 0 for G𝒢(h,RG)G\in\mathcal{G}(h,R_{G}).

Remark D.3.

For simplicity, we assume TT divisible by hh (this is w.l.o.g. up to an extra O(h)O(h) cost which for us is negligble).

We spell out the full procedure in Algorithm 7 and give its guarantee below in Theorem D.1.

Algorithm 7 DRC-OGD with Exploration
1:Input: p,h,G^0,𝒜Algorithm 6(p,G^0,𝒢(h,RG))p,h,\hat{G}_{0},\mathcal{A}\leftarrow\lx@cref{creftypecap~refnum}{alg:adaptive_est}(p,\hat{G}_{0},\mathcal{G}(h,R_{G})), 𝒞Algorithm 3(η,m,R)\mathcal{C}\leftarrow\lx@cref{creftypecap~refnum}{alg:DRC-OGD}(\eta,m,R_{\mathcal{M}})
2:for τ1=0,,T/h1\tau_{1}=0,\ldots,T/h-1 do
3:     Request G^τ1h+1𝒜\hat{G}_{\tau_{1}\cdot h+1}\leftarrow\mathcal{A} and set G^τ1h+2,,G^(τ1+1)hG^τ1h+1\hat{G}_{\tau_{1}\cdot h+2},\ldots,\hat{G}_{(\tau_{1}+1)\cdot h}\leftarrow\hat{G}_{\tau_{1}\cdot h+1}
4:     Draw bτ1+1Bernoulli(p)b_{\tau_{1}+1}\sim\text{Bernoulli}(p)
5:     for τ2=1,,h\tau_{2}=1,\ldots,h do \triangleright let tτ1h+τ2t\doteq\tau_{1}\cdot h+\tau_{2}
6:         if bτ1=1b_{\tau_{1}}=1 then
7:              Play control ut{±1}duu_{t}\sim\{\pm 1\}^{d_{u}}
8:         else
9:              Play control ut𝒞u_{t}\leftarrow\mathcal{C}          
10:         Suffer cost ct(xt,ut)c_{t}(x_{t},u_{t}) , observe new state xt+1x_{t+1}
11:         Extract x^t+1nat=Proj𝔹Rnat[xt+1i=0h1G^t[i]uti]\hat{x}_{t+1}^{\mathrm{nat}}=\mathrm{Proj}_{\mathbb{B}_{R_{\mathrm{nat}}}}\left[x_{t+1}-\sum_{i=0}^{h-1}\hat{G}_{t}^{[i]}u_{t-i}\right]
12:         Update 𝒞(ct,G^t,x^t+1nat)\mathcal{C}\leftarrow(c_{t},\hat{G}_{t},\hat{x}^{\mathrm{nat}}_{t+1})      
13:     if bτ1=1b_{\tau_{1}}=1 then
14:         Let G~(τ1+1)h[i]=x(τ1+1)h+1u(τ1+1)hi\tilde{G}_{(\tau_{1}+1)\cdot h}^{[i]}=x_{(\tau_{1}+1)\cdot h+1}u_{(\tau_{1}+1)\cdot h-i}^{\top} for i=0,h1¯i=\overline{0,h-1}
15:     else
16:         Let G~(τ1+1)h\tilde{G}_{(\tau_{1}+1)\cdot h}\leftarrow\emptyset      
17:     Update 𝒜(bτ1+1,G~(τ1+1)h)\mathcal{A}\leftarrow(b_{\tau_{1}+1},\tilde{G}_{(\tau_{1}+1)\cdot h})
Theorem D.1.

For h=logTlogρ1h=\dfrac{\log{T}}{\log{\rho^{-1}}}, p=T1/3p=T^{-1/3} and mTm\leq\sqrt{T}, on any contiguous interval I[T]I\subseteq[T], Algorithm 2 enjoys the following adaptive regret guarantee222For precise constants please see Eq. D.2.:

𝔼[RegretI(Ada-Ctrl);Πdrc(m,RM)]𝒪~(Lm(|I|𝔼[VarI(𝐆)]+duT2/3))\displaystyle\mathbb{E}\left[\mathrm{Regret}_{I}(\textsc{Ada}\text{-}\textsc{Ctrl});\Pi_{\mathrm{drc}}(m,R_{M})\right]\leq\widetilde{\mathcal{O}}^{\star}\left(Lm\left(|I|\sqrt{\mathbb{E}[\mathrm{Var}_{I}(\mathbf{G})]}+d_{u}T^{2/3}\right)\right)

The proof of this theorem will proceed in terms of a quantity which we call total system variability which captures the total (rather than average) deviation from the mean operator for each interval. More precisely,

Definition D.1.

Define the total system variability of an LTV dynamical system with Markov operators 𝐆=G1:T\mathbf{G}=G_{1:T} over a contiguous interval I[T]I\subseteq[T] to be

VarItot(𝐆)=minGtIGGt2,F2=tIGIGt2,F2,\mathrm{Var}^{\mathrm{tot}}_{I}(\mathbf{G})=\min_{G}\sum_{t\in I}\|G-G_{t}\|_{\ell_{2},F}^{2}=\sum_{t\in I}\|G_{I}-G_{t}\|_{\ell_{2},F}^{2},

where 2,F\|\cdot\|_{\ell_{2},F} indicates the 2\ell_{2} norm of the fully vectorized operator and GI=|I|1tIGtG_{I}=|I|^{-1}\sum_{t\in I}G_{t} is the empirical average of the operators that correspond to II.

D.1 Estimation of the Markov Operator

Note that the estimation component of Algorithm 7 directly operates in the setting of Appendix C and effectively solves the problem of adaptively estimating the sequence G¯1h,,G¯(T/h)h\bar{G}_{1\cdot h},\ldots,\bar{G}_{(T/h)\cdot h} of the hh-truncations of the true Markov operators G1h,,G(T/h)hG_{1\cdot h},\ldots,G_{(T/h)\cdot h}. To formally be able to apply Theorem C.2, we first show that the estimates sent to Algorithm 6 satisfy the properties of Oracle 2.

Claim D.1.

The estimators G~τ1h,τ1=1,T/h¯\tilde{G}_{\tau_{1}\cdot h},\;\tau_{1}=\overline{1,T/h} satisfy the properties of Oracle 2 with R~G=hdu(Rnat+RGmax{du,RnatR})\tilde{R}_{G}=\sqrt{h\cdot d_{u}}(R_{\mathrm{nat}}+R_{G}\max\{\sqrt{d_{u}},R_{\mathrm{nat}}R_{\mathcal{M}}\}).

Proof.

There are only two things to prove:

  1. 1.

    Boundedness. Because we clip the nature’s x estimates x^tnat\hat{x}^{\mathrm{nat}}_{t} to RnatR_{\mathrm{nat}} and when the DAC policy lies in (m,R)\mathcal{M}(m,R_{\mathcal{M}}), we have that if bt=0b_{t}=0, utRnatR||u_{t}||\leq R_{\mathrm{nat}}R_{\mathcal{M}}. If utu_{t} is an exploratory action then ut=dumax{du,RnatR}||u_{t}||=\sqrt{d_{u}}\leq\max\{\sqrt{d_{u}},R_{\mathrm{nat}}R_{\mathcal{M}}\} by design. By the equation of the progression of the state, we have that

    xtRnat+RGmax{du,RnatR}\|x_{t}\|\leq R_{\mathrm{nat}}+R_{G}\max\{\sqrt{d_{u}},R_{\mathrm{nat}}R_{\mathcal{M}}\} (D.1)

    and by Cauchy Scwarz we get

    G~τ1h[i]F=uτ1hixτ1h+1du(Rnat+RGmax{du,RnatR})||\tilde{G}^{[i]}_{\tau_{1}\cdot h}||_{F}=||u^{\top}_{\tau_{1}\cdot h-i}x_{\tau_{1}\cdot h+1}||\leq\sqrt{d_{u}}(R_{\mathrm{nat}}+R_{G}\max\{\sqrt{d_{u}},R_{\mathrm{nat}}R_{\mathcal{M}}\})

    and hence

    G~τ1hFhdu(Rnat+RGmax{du,RnatR})||\tilde{G}_{\tau_{1}\cdot h}||_{F}\leq\sqrt{h\cdot d_{u}}(R_{\mathrm{nat}}+R_{G}\max\{\sqrt{d_{u}},R_{\mathrm{nat}}R_{\mathcal{M}}\})
  2. 2.

    Unbiasedness. Plugging in xτ1h=xτ1hnat+i=0tGτ1h[i]uτ1hix_{\tau_{1}\cdot h}=x^{\mathrm{nat}}_{\tau_{1}\cdot h}+\sum_{i=0}^{t}G_{\tau_{1}\cdot h}^{[i]}u_{\tau_{1}\cdot h-i}, we get exactly that 𝔼[G~τ1h[i]]=Gτ1h[i]\mathbb{E}[\tilde{G}_{\tau_{1}\cdot h}^{[i]}]=G_{\tau_{1}\cdot h}^{[i]}. Since this holds for the selected truncation hh, we have 𝔼[G~τ1h]=G¯τ1h\mathbb{E}[\tilde{G}_{\tau_{1}\cdot h}]=\bar{G}_{\tau_{1}\cdot h} for G¯\bar{G} as defined.

Hence we can simply apply the guarantees of Appendix C to obtain the following guarantee:

Corollary D.1.

On any interval J=[k,l][T/h]J=[k,l]\subseteq[T/h], we have that:

𝔼[τ=klG~τhG¯τhF2]VarJhtot(G¯1:T)+RegretJ(𝒜(p,R¯G,R~G;0))\mathbb{E}\left[\sum_{\tau=k}^{l}||\tilde{G}_{\tau\cdot h}-\bar{G}_{\tau\cdot h}||_{F}^{2}\right]\leq\mathrm{Var}^{\mathrm{tot}}_{J\cdot h}({\bar{G}_{1:T}})+\mathrm{Regret}_{J}(\mathcal{A}(p,\bar{R}_{G},\tilde{R}_{G};0))

where we use JhJ\cdot h to denote the set [kh,,lh][k\cdot h,\ldots,l\cdot h].

However, to properly analyze the additional regret introduced in the control framework by our estimation error, we need to convert Corollary D.1 into a guarantee in terms of 1,op\ell_{1},op norm which holds for any contiguous interval I=[r,s][T]I=[r,s]\subseteq[T]. This step is rather straightforward and only relies on a few basic properties/observations which we collect in D.2 below.

Observation D.2.

We will make the following observations:

  1. 1.

    AopAF||A||_{op}\leq||A||_{F} for any matrix AA,

  2. 2.

    VarJtot(𝐳1:T)VarItot(𝐳1:T)\mathrm{Var}^{\mathrm{tot}}_{J}(\mathbf{z}_{1:T})\leq\mathrm{Var}^{\mathrm{tot}}_{I}(\mathbf{z}_{1:T}) for any set of indices JIJ\subseteq I, and any sequence 𝐳1:T\mathbf{z}_{1:T},

  3. 3.

    G^tG¯tF2=i=1hG^t[i]G¯t[i]F2||\hat{G}_{t}-\bar{G}_{t}||_{F}^{2}=\sum\limits_{i=1}^{h}||\hat{G}_{t}^{[i]}-\bar{G}_{t}^{[i]}||_{F}^{2},

  4. 4.

    GtG¯t1,opψ(h)||G_{t}-\bar{G}_{t}||_{\ell_{1},op}\leq\psi(h),

  5. 5.

    For any I=[r,s]I=[r,s], t=r+1sG¯tG¯t1F24VarItot(G¯1:T)4VarItot(𝐆)\sum_{t=r+1}^{s}||\bar{G}_{t}-\bar{G}_{t-1}||^{2}_{F}\leq 4\mathrm{Var}^{\mathrm{tot}}_{I}(\bar{G}_{1:T})\leq 4\mathrm{Var}^{\mathrm{tot}}_{I}(\mathbf{G}).

Proof.

Properties 11-44 follow from the definitions of the relevant quantities, or are general well-known facts. For 55, by the triangle inequality, we have that, for any G¯\bar{G},

G¯tG¯t1FG¯tG¯F+G¯t1G¯F||\bar{G}_{t}-\bar{G}_{t-1}||_{F}\leq||\bar{G}_{t}-\bar{G}||_{F}+||\bar{G}_{t-1}-\bar{G}||_{F}
G¯tG¯t1F22(G¯tG¯F2+G¯t1G¯F2)\Rightarrow||\bar{G}_{t}-\bar{G}_{t-1}||_{F}^{2}\leq 2(||\bar{G}_{t}-\bar{G}||_{F}^{2}+||\bar{G}_{t-1}-\bar{G}||_{F}^{2})

summing the above from r+1r+1 to ss and taking G¯\bar{G} to be the sample mean over II yields the desired result. The second inequality simply follows by the fact that truncation can only decrease variance. ∎

As a first step, we first bound the Frobenius norm error of the truncated operators over an arbitrary contiguous interval I=[r,s][T]I=[r,s]\subseteq[T].

Lemma D.3.

On any interval I=[r,s][T]I=[r,s]\subseteq[T] with r1,sr-1,s divisible by hh333This is w.l.o.g. and only assumed for simplicity of presentation., we can bound the Frobenius estimation error of the truncated operators as:

𝔼[t=rsG^tG¯tF2]10h2VarItot(𝐆)+2hRegretI(𝒜(p,R¯G,R~G;0))\mathbb{E}\left[\sum_{t=r}^{s}||\hat{G}_{t}-\bar{G}_{t}||^{2}_{F}\right]\leq 10h^{2}\mathrm{Var}^{\mathrm{tot}}_{I}(\mathbf{G})+2h\mathrm{Regret}_{I}(\mathcal{A}(p,\bar{R}_{G},\tilde{R}_{G};0))
Proof.

By 3, we can write

t=rsG^tG¯tF2\displaystyle\sum_{t=r}^{s}||\hat{G}_{t}-\bar{G}_{t}||_{F}^{2} =τ1=(r1)/hs/h1τ2=1hG^τ1h+τ2G¯τ1h+τ2F2\displaystyle=\sum_{\tau_{1}=(r-1)/h}^{s/h-1}\sum_{\tau_{2}=1}^{h}||\hat{G}_{\tau_{1}\cdot h+\tau_{2}}-\bar{G}_{\tau_{1}\cdot h+\tau_{2}}||_{F}^{2}
=τ1=(r1)/hs/h1τ2=1hG^(τ1+1)hG¯τ1h+τ2F2\displaystyle=\sum_{\tau_{1}=(r-1)/h}^{s/h-1}\sum_{\tau_{2}=1}^{h}||\hat{G}_{(\tau_{1}+1)\cdot h}-\bar{G}_{\tau_{1}\cdot h+\tau_{2}}||_{F}^{2}
2τ1=(r1)/hs/h1τ2=1h(G^(τ1+1)hG¯(τ1+1)hF2+G¯(τ1+1)hG¯τ1h+τ2F2)\displaystyle\leq 2\sum_{\tau_{1}=(r-1)/h}^{s/h-1}\sum_{\tau_{2}=1}^{h}\left(||\hat{G}_{(\tau_{1}+1)\cdot h}-\bar{G}_{(\tau_{1}+1)\cdot h}||_{F}^{2}+||\bar{G}_{(\tau_{1}+1)\cdot h}-\bar{G}_{\tau_{1}\cdot h+\tau_{2}}||_{F}^{2}\right)
=2hτ=r/hs/hG^τhG¯τhF2 (Algorithm 6 estimation error) +2τ1=(r1)/hs/h1τ2=1hG¯(τ1+1)hG¯τ1h+τ2F2 (G¯ movement)\displaystyle=2h\underbrace{\sum_{\tau=r/h}^{s/h}||\hat{G}_{\tau\cdot h}-\bar{G}_{\tau\cdot h}||_{F}^{2}}_{\text{{\color[rgb]{0,0,1} (\lx@cref{creftypecap~refnum}{alg:adaptive_est} estimation error) }}}+2\underbrace{\sum_{\tau_{1}=(r-1)/h}^{s/h-1}\sum_{\tau_{2}=1}^{h}||\bar{G}_{(\tau_{1}+1)\cdot h}-\bar{G}_{\tau_{1}\cdot h+\tau_{2}}||_{F}^{2}}_{\text{{\color[rgb]{0,0,1} ($\bar{G}$ movement) }}}

For the first term, we will simply apply the above corollary after taking expectation. We will therefore focus on bounding the G¯\bar{G} movement term.

τ2:G¯(τ1+1)hG¯τ1h+τ2F2\displaystyle\forall\tau_{2}:\;\;\;||\bar{G}_{(\tau_{1}+1)\cdot h}-\bar{G}_{\tau_{1}\cdot h+\tau_{2}}||_{F}^{2} (i=1h1G¯τ1h+i+1G¯τ1h+iF)2\displaystyle\leq\left(\sum_{i=1}^{h-1}||\bar{G}_{\tau_{1}\cdot h+i+1}-\bar{G}_{\tau_{1}\cdot h+i}||_{F}\right)^{2} (-ineq.)\displaystyle(\triangle\text{-ineq.})
hi=1h1G¯τ1h+i+1G¯τ1h+iF2\displaystyle\leq h\sum_{i=1}^{h-1}||\bar{G}_{\tau_{1}\cdot h+i+1}-\bar{G}_{\tau_{1}\cdot h+i}||_{F}^{2} (C.S.)

This implies that

τ2=1hG¯(τ1+1)hG¯τ1h+τ2F2h2i=1h1G¯τ1h+i+1G¯τ1h+iF2\displaystyle\sum_{\tau_{2}=1}^{h}||\bar{G}_{(\tau_{1}+1)\cdot h}-\bar{G}_{\tau_{1}\cdot h+\tau_{2}}||_{F}^{2}\leq h^{2}\sum_{i=1}^{h-1}||\bar{G}_{\tau_{1}\cdot h+i+1}-\bar{G}_{\tau_{1}\cdot h+i}||_{F}^{2}

So finally we have that

G¯-movement\displaystyle\bar{G}\text{-movement} h2t=r+1sG¯tG¯tF2\displaystyle\leq h^{2}\sum_{t=r+1}^{s}||\bar{G}_{t}-\bar{G}_{t}||_{F}^{2}
4h2VarItot(𝐆)\displaystyle\leq 4h^{2}\mathrm{Var}^{\mathrm{tot}}_{I}(\mathbf{G}) D.2(5)\displaystyle\lx@cref{creftypecap~refnum}{obs:reduction_props}\;(5)

Finally, taking expectation, plugging in Corollary D.1 and noting that for J=[r/h,s/h]J=[r/h,s/h], VarJhtot(G¯1:T)VarItot(𝐆)\mathrm{Var}^{\mathrm{tot}}_{J\cdot h}({\bar{G}_{1:T}})\leq\mathrm{Var}^{\mathrm{tot}}_{I}(\mathbf{G}) (D.2 (2)) and RegretJ(𝒜(p,R¯G,R~G;0))RegretI(𝒜(p,R¯G,R~G;0))\mathrm{Regret}_{J}(\mathcal{A}(p,\bar{R}_{G},\tilde{R}_{G};0))\leq\mathrm{Regret}_{I}(\mathcal{A}(p,\bar{R}_{G},\tilde{R}_{G};0)), we get:

𝔼[t=rsG^tG¯tF2]\displaystyle\mathbb{E}\left[\sum_{t=r}^{s}||\hat{G}_{t}-\bar{G}_{t}||_{F}^{2}\right] 10h2𝔼[VarItot(𝐆)]+2hRegretI(𝒜(p,R¯G,R~G;0))\displaystyle\leq 10h^{2}\mathbb{E}[\mathrm{Var}^{\mathrm{tot}}_{I}(\mathbf{G})]+2h\mathrm{Regret}_{I}(\mathcal{A}(p,\bar{R}_{G},\tilde{R}_{G};0))

Lemma D.4.

On any interval I=[r,s][T]I=[r,s]\subseteq[T] with r1,sr-1,s divisible by hh, we can bound the squared 1,op\ell_{1},op estimation error of the truncated operators as:

𝔼[t=rsG^tG¯t1,op2]20h3𝔼[VarItot(𝐆)]+4h2RegretI(𝒜(p,R¯G,R~G;0))\mathbb{E}\left[\sum_{t=r}^{s}||\hat{G}_{t}-\bar{G}_{t}||_{\ell_{1},op}^{2}\right]\leq 20h^{3}\mathbb{E}[\mathrm{Var}^{\mathrm{tot}}_{I}(\mathbf{G})]+4h^{2}\mathrm{Regret}_{I}(\mathcal{A}(p,\bar{R}_{G},\tilde{R}_{G};0))
Proof.

We have that

t=rsG^tG¯t1,op2\displaystyle\sum_{t=r}^{s}||\hat{G}_{t}-\bar{G}_{t}||_{\ell_{1},op}^{2} t=rs(i=0h1G^t[i]G¯t[i]op)2\displaystyle\leq\sum_{t=r}^{s}\left(\sum_{i=0}^{h-1}||\hat{G}_{t}^{[i]}-\bar{G}_{t}^{[i]}||_{op}\right)^{2} -ineq.\displaystyle\triangle\text{-ineq.}
2t=rs(i=0h1G^t[i]G¯t[i]op)2\displaystyle\leq 2\sum_{t=r}^{s}\left(\sum_{i=0}^{h-1}||\hat{G}_{t}^{[i]}-\bar{G}_{t}^{[i]}||_{op}\right)^{2}
2ht=rsi=0h1G^t[i]G¯t[i]F2\displaystyle\leq 2h\sum_{t=r}^{s}\sum_{i=0}^{h-1}||\hat{G}_{t}^{[i]}-\bar{G}_{t}^{[i]}||_{F}^{2} D.2(1) & C.S.\displaystyle\lx@cref{creftypecap~refnum}{obs:reduction_props}\;(1)\text{ \& C.S.}
=2ht=rsG^tG¯tF2\displaystyle=2h\sum_{t=r}^{s}||\hat{G}_{t}-\bar{G}_{t}||_{F}^{2} D.2(3)\displaystyle\lx@cref{creftypecap~refnum}{obs:reduction_props}\;(3)

Taking expectation and plugging in the bound in Lemma D.3 yields the promised result. ∎

Finally we can use Lemma D.4 and Cauchy-Schwarz to get a result in terms of the linear (rather than squared) 1,op\ell_{1},op error accumulated over an interval:

Proposition D.5.

On any interval I=[r,s][T]I=[r,s]\subseteq[T] with r1,sr-1,s divisible by hh, we can bound the squared 1,op\ell_{1},op estimation error of the truncated operators as:

𝔼[t=rsG^tG¯t1,op]5h3/2|I|1/2VarItot(𝐆)+2h|I|1/2RegretI1/2(𝒜(p,R¯G,R~G;0))\mathbb{E}\left[\sum_{t=r}^{s}||\hat{G}_{t}-\bar{G}_{t}||_{\ell_{1},op}\right]\leq 5h^{3/2}|I|^{1/2}\sqrt{\mathrm{Var}^{\mathrm{tot}}_{I}(\mathbf{G})}+2h|I|^{1/2}\mathrm{Regret}_{I}^{1/2}(\mathcal{A}(p,\bar{R}_{G},\tilde{R}_{G};0))
Proof.

By Cauchy-Schwarz and Jensen (since \sqrt{\cdot} is concave) we have:

𝔼[t=rsG^tG¯t1,op]\displaystyle\mathbb{E}\left[\sum_{t=r}^{s}||\hat{G}_{t}-\bar{G}_{t}||_{\ell_{1},op}\right] |I|1/2𝔼[(t=rsG^tG¯t1,op2)1/2]\displaystyle\leq|I|^{1/2}\mathbb{E}\left[\left(\sum_{t=r}^{s}||\hat{G}_{t}-\bar{G}_{t}||_{\ell_{1},op}^{2}\right)^{1/2}\right]
|I|1/2(𝔼[t=rsG^tG¯t1,op2])1/2\displaystyle\leq|I|^{1/2}\left(\mathbb{E}\left[\sum_{t=r}^{s}||\hat{G}_{t}-\bar{G}_{t}||_{\ell_{1},op}^{2}\right]\right)^{1/2}
|I|1/220h3𝔼[VarItot(𝐆)]+4h2RegretI(𝒜(p,R¯G,R~G;0))\displaystyle\leq|I|^{1/2}\sqrt{20h^{3}\mathbb{E}[\mathrm{Var}^{\mathrm{tot}}_{I}(\mathbf{G})]+4h^{2}\mathrm{Regret}_{I}(\mathcal{A}(p,\bar{R}_{G},\tilde{R}_{G};0))}
5h3/2|I|1/2𝔼[VarItot(𝐆)]+2h|I|1/2RegretI1/2(𝒜(p,R¯G,R~G;0))\displaystyle\leq 5h^{3/2}|I|^{1/2}\sqrt{\mathbb{E}[\mathrm{Var}^{\mathrm{tot}}_{I}(\mathbf{G})}]+2h|I|^{1/2}\mathrm{Regret}_{I}^{1/2}(\mathcal{A}(p,\bar{R}_{G},\tilde{R}_{G};0))

D.2 Error Sensitivity

We now analyze concretely how the GtG_{t} estimation errors induce additional regret over the case of known systems. We can decompose the expected regret over an interval I=[r,s]I=[r,s] as:

RegretI\displaystyle\mathrm{Regret}_{I} =t=rsct(xt,ut)minπΠdrct=rsct(xtπ,utπ)\displaystyle=\sum_{t=r}^{s}c_{t}(x_{t},u_{t})-\min_{\pi\in\Pi_{\mathrm{drc}}}\sum_{t=r}^{s}c_{t}(x_{t}^{\pi},u_{t}^{\pi})
=t=rsct(xt,ut)ct(x^t(Mth:t1),u^t(Mt)) (realized iterate error)\displaystyle=\underbrace{\sum_{t=r}^{s}c_{t}(x_{t},u_{t})-c_{t}(\hat{x}_{t}(M_{t-h:t-1}),\hat{u}_{t}(M_{t}))}_{\text{{\color[rgb]{0,0,1} (realized iterate error) }}}
+t=rsct(x^t(Mth:t1),u^t(Mt)infMt=rsct(x^t(M),u^t(M)) (regret :=Regret^I)\displaystyle\quad+\underbrace{\sum_{t=r}^{s}c_{t}(\hat{x}_{t}(M_{t-h:t-1}),\hat{u}_{t}(M_{t})-\inf_{M\in\mathcal{M}}\sum_{t=r}^{s}c_{t}(\hat{x}_{t}(M),\hat{u}_{t}(M))}_{\text{{\color[rgb]{0,0,1} (regret $:=\widehat{\mathrm{Regret}}_{I}$) }}}
+infMt=rsct(x^t(M),u^t(M))infMt=rsct(xt(M),ut(M)) (comparator error)\displaystyle\quad+\underbrace{\inf_{M\in\mathcal{M}}\sum_{t=r}^{s}c_{t}(\hat{x}_{t}(M),\hat{u}_{t}(M))-\inf_{M\in\mathcal{M}}\sum_{t=r}^{s}c_{t}(x_{t}(M),u_{t}(M))}_{\text{{\color[rgb]{0,0,1} (comparator error) }}}

First let us bound the realized iterate error which bounds the difference between what actually happened and what would have happened in the fictive (G^t,x^tnat)(\hat{G}_{t},\hat{x}_{t}^{\mathrm{nat}}) system (without exploration).

Lemma D.6.

We can bound the difference between the true xtnatx^{\mathrm{nat}}_{t} and the extracted x^tnat\hat{x}_{t}^{\mathrm{nat}} as:

xtnatx^tnatRnatR(G¯t1G^t11,op+ψ(h))\|x^{\mathrm{nat}}_{t}-\hat{x}_{t}^{\mathrm{nat}}\|\leq R_{\mathrm{nat}}R_{\mathcal{M}}\left(\|\bar{G}_{t-1}-\hat{G}_{t-1}\|_{\ell_{1},op}+\psi(h)\right)
Proof.

Since xnat𝔹Rnatx^{\mathrm{nat}}\in\mathbb{B}_{R_{\mathrm{nat}}} and because (as argued earlier) utRnatR\|u_{t}\|\leq R_{\mathrm{nat}}R_{\mathcal{M}}, we have:

xtnatx^tnat\displaystyle\|x^{\mathrm{nat}}_{t}-\hat{x}_{t}^{\mathrm{nat}}\| =xtnatProj𝔹Rnat[xti=0h1G^t1[i]ut1i]\displaystyle=\bigg{|}\bigg{|}x^{\mathrm{nat}}_{t}-\mathrm{Proj}_{\mathbb{B}_{R_{\mathrm{nat}}}}\left[x_{t}-\sum_{i=0}^{h-1}\hat{G}_{t-1}^{[i]}u_{t-1-i}\right]\bigg{|}\bigg{|}
xtnatxt+i=0h1G^t1[i]ut1i\displaystyle\leq\bigg{|}\bigg{|}x^{\mathrm{nat}}_{t}-x_{t}+\sum_{i=0}^{h-1}\hat{G}_{t-1}^{[i]}u_{t-1-i}\bigg{|}\bigg{|} (Pythagoras)
=i=0h1(Gt1[i]G^t1[i])ut1i+i=ht2Gt1[i]ut1i\displaystyle=\bigg{|}\bigg{|}\sum_{i=0}^{h-1}(G_{t-1}^{[i]}-\hat{G}_{t-1}^{[i]})u_{t-1-i}+\sum_{i=h}^{t-2}G_{t-1}^{[i]}u_{t-1-i}\bigg{|}\bigg{|}
RnatR(G¯t1G^t11,op+ψ(h))\displaystyle\leq R_{\mathrm{nat}}R_{\mathcal{M}}\left(\|\bar{G}_{t-1}-\hat{G}_{t-1}\|_{\ell_{1},op}+\psi(h)\right)

Lemma D.7 (Realized Iterate Error).

For I=[r,s]I=[r,s] with r1,sr-1,s divisible by hh, we can bound the realized iterate error as:

t=rsct(xt,ut)ct(x^t(Mth:t1),u^t(Mt))2hτ1=(r1)/hs/hbτ11+4LduRsys2RG(t=rsG¯t1G^t11,op+ψ(h)|I|)\sum_{t=r}^{s}c_{t}(x_{t},u_{t})-c_{t}(\hat{x}_{t}(M_{t-h:t-1}),\hat{u}_{t}(M_{t}))\leq 2h\sum_{\tau_{1}=(r-1)/h}^{s/h}b_{\tau_{1}-1}+4L\sqrt{d_{u}}\frac{R_{\mathrm{sys}}^{2}}{R_{G}}\left(\sum_{t=r}^{s}\|\bar{G}_{t-1}-\hat{G}_{t-1}\|_{\ell_{1},op}+\psi(h)|I|\right)

where we denote RsysRGRnatRR_{\mathrm{sys}}\doteq R_{G}R_{\mathrm{nat}}R_{\mathcal{M}}.

Proof.

Consider the cost difference on a hh-block indexed by t=τ1h+1,τ1h+h¯t=\overline{\tau_{1}\cdot h+1,\tau_{1}\cdot h+h}. Consider the following three cases:

  1. 1.

    bτ1=1b_{\tau_{1}}=1: in this case we are exploring and cannot give a better guarantee than

    ct(xt,ut)ct(x^t(Mth:t1),u^t(Mt))1c_{t}(x_{t},u_{t})-c_{t}(\hat{x}_{t}(M_{t-h:t-1)},\hat{u}_{t}(M_{t}))\leq 1
  2. 2.

    bτ1=0,bτ11=1b_{\tau_{1}}=0,b_{\tau_{1}-1}=1: while we are not exploring during the current round, and hence ut=u^t(Mt)u_{t}=\hat{u}_{t}(M_{t}) for t=τ1h+1,τ1h+h¯t=\overline{\tau_{1}\cdot h+1,\tau_{1}\cdot h+h}, we have explored in the previous round and therefore utu_{t} and u^t(Mt)\hat{u}_{t}(M_{t}) may be arbitrarily far for tτ1ht\leq\tau_{1}\cdot h. This can induce xtx_{t} and x^t(Mth:t1)\hat{x}_{t}(M_{t-h:t-1}) to be quite far, especially the closer we get to τ1h+1\tau_{1}\cdot h+1. As such, in this event, we will also simply bound

    ct(xt,ut)ct(x^t(Mth:t1),u^t(Mt))1c_{t}(x_{t},u_{t})-c_{t}(\hat{x}_{t}(M_{t-h:t-1)},\hat{u}_{t}(M_{t}))\leq 1
  3. 3.

    bτ1=0,bτ11=0b_{\tau_{1}}=0,b_{\tau_{1}-1}=0: finally, in this case we have that ut=u^t(Mt)u_{t}=\hat{u}_{t}(M_{t}) for t=τ1hh+1,τ1h+h¯t=\overline{\tau_{1}\cdot h-h+1,\tau_{1}\cdot h+h}. We can expand:

    xt=xtnat+i=0h1Gt1[i]ut1i+i=ht2Gt1[i]ut1ix_{t}=x^{\mathrm{nat}}_{t}+\sum_{i=0}^{h-1}G_{t-1}^{[i]}u_{t-1-i}+\sum_{i=h}^{t-2}G_{t-1}^{[i]}u_{t-1-i}

    and

    x^t(Mth:t1)=x^tnat+i=0h1G^t1[i]u^t1i(Mt1i)\hat{x}_{t}(M_{t-h:t-1})=\hat{x}_{t}^{\mathrm{nat}}+\sum_{i=0}^{h-1}\hat{G}_{t-1}^{[i]}\hat{u}_{t-1-i}(M_{t-1-i})

    By the observation above, for t=τ1h+1,τ1h+h¯t=\overline{\tau_{1}\cdot h+1,\tau_{1}\cdot h+h}, we have

    xtx^t(Mth:t1)\displaystyle\|x_{t}-\hat{x}_{t}(M_{t-h:t-1})\| xtnatx^tnat+RnatR(G¯t1G^t11,op+ψ(h))\displaystyle\leq\|x^{\mathrm{nat}}_{t}-\hat{x}_{t}^{\mathrm{nat}}\|+R_{\mathrm{nat}}R_{\mathcal{M}}\left(\|\bar{G}_{t-1}-\hat{G}_{t-1}\|_{\ell_{1},op}+\psi(h)\right)
    2RnatR(G¯t1G^t11,op+ψ(h))\displaystyle\leq 2R_{\mathrm{nat}}R_{\mathcal{M}}\left(\|\bar{G}_{t-1}-\hat{G}_{t-1}\|_{\ell_{1},op}+\psi(h)\right)

    Hence, by the sub-quadratic Lipschitzness of the cost and Eq. D.1 we have

    ct(xt,ut)ct(x^t(Mth:t1),u^t(Mt))\displaystyle c_{t}(x_{t},u_{t})-c_{t}(\hat{x}_{t}(M_{t-h:t-1)},\hat{u}_{t}(M_{t})) 4LduRGRnat2R2(G¯t1G^t11,op+ψ(h))\displaystyle\leq 4L\sqrt{d_{u}}R_{G}R_{\mathrm{nat}}^{2}R_{\mathcal{M}}^{2}\left(\|\bar{G}_{t-1}-\hat{G}_{t-1}\|_{\ell_{1},op}+\psi(h)\right)

So for any τ1=0,T/h¯\tau_{1}=\overline{0,T/h}, we have

t=τ1h+1τ1h+hct(xt,ut)ct(x^t(Mth:t1),u^t(Mt))\displaystyle\sum_{t=\tau_{1}\cdot h+1}^{\tau_{1}\cdot h+h}c_{t}(x_{t},u_{t})-c_{t}(\hat{x}_{t}(M_{t-h:t-1}),\hat{u}_{t}(M_{t})) 𝟏bτ1=1h+𝟏bτ1=0,bτ11=1h\displaystyle\leq\mathbf{1}_{b_{\tau_{1}}=1}\cdot h+\mathbf{1}_{b_{\tau_{1}}=0,b_{\tau_{1}-1}=1}\cdot h
+𝟏bτ1=0,bτ11=04LduRGRnat2R2t=τ1h+1τ1h+hG¯t1G^t11,op\displaystyle+\mathbf{1}_{b_{\tau_{1}}=0,b_{\tau_{1}-1}=0}4L\sqrt{d_{u}}R_{G}R_{\mathrm{nat}}^{2}R_{\mathcal{M}}^{2}\sum_{t=\tau_{1}\cdot h+1}^{\tau_{1}\cdot h+h}\|\bar{G}_{t-1}-\hat{G}_{t-1}\|_{\ell_{1},op}
+𝟏bτ1=0,bτ11=04LduRGRnat2R2hψ(h)\displaystyle+\mathbf{1}_{b_{\tau_{1}}=0,b_{\tau_{1}-1}=0}4L\sqrt{d_{u}}R_{G}R_{\mathrm{nat}}^{2}R_{\mathcal{M}}^{2}h\cdot\psi(h)
(bτ1+bτ11)h\displaystyle\leq(b_{\tau_{1}}+b_{\tau_{1}-1})\cdot h
+4LduRGRnat2R2(t=τ1h(τ1+1)h1G¯tG^t1,op+hψ(h))\displaystyle+4L\sqrt{d_{u}}R_{G}R_{\mathrm{nat}}^{2}R_{\mathcal{M}}^{2}\left(\sum_{t=\tau_{1}\cdot h}^{(\tau_{1}+1)\cdot h-1}\|\bar{G}_{t}-\hat{G}_{t}\|_{\ell_{1},op}+h\psi(h)\right)

summing over τ1\tau_{1} yields the desired result.

Lemma D.8 (Comparator Error).

We can bound the comparator error as:

(comparator error)8LduRnat2RG2R3h(mt=rhmsG¯tG^t1,op+ψ(h))\text{(comparator error)}\leq 8L\sqrt{d_{u}}R_{\mathrm{nat}}^{2}R_{G}^{2}R_{\mathcal{M}}^{3}h\left(m\sum_{t=r-h-m}^{s}\|\bar{G}_{t}-\hat{G}_{t}\|_{\ell_{1},op}+\psi(h)\right)
Proof.

Let M=argminMt=rst=rsct(xt(M),ut(M))M^{\star}=\arg\min_{M\in\mathcal{M}}\sum_{t=r}^{s}\sum_{t=r}^{s}c_{t}(x_{t}(M),u_{t}(M)). We have that:

(comparator error) t=rsct(x^t(M),u^t(M))ct(xt(M),ut(M))\displaystyle\leq\sum_{t=r}^{s}c_{t}(\hat{x}_{t}(M^{\star}),\hat{u}_{t}(M^{\star}))-c_{t}(x_{t}(M^{\star}),u_{t}(M^{\star}))
2LduRnatRGRt=rs(x^t(M)xt(M)+u^t(M))ut(M))\displaystyle\leq 2L\sqrt{d_{u}}R_{\mathrm{nat}}R_{G}R_{\mathcal{M}}\sum_{t=r}^{s}\left(\|\hat{x}_{t}(M^{\star})-x_{t}(M^{\star})\|+\|\hat{u}_{t}(M^{\star}))-u_{t}(M^{\star})\|\right)

We have that

u^t(M)ut(M)\displaystyle\|\hat{u}_{t}(M^{\star})-u_{t}(M^{\star})\| =i=0m1M,[i](x^tinatxtinat)\displaystyle=\|\sum_{i=0}^{m-1}M^{\star,[i]}(\hat{x}^{\mathrm{nat}}_{t-i}-x^{\mathrm{nat}}_{t-i})\|
RnatR2(τ=tmt1G¯τG^τ1,op+ψ(h))\displaystyle\leq R_{\mathrm{nat}}R_{\mathcal{M}}^{2}\left(\sum_{\tau=t-m}^{t-1}\|\bar{G}_{\tau}-\hat{G}_{\tau}\|_{\ell_{1},op}+\psi(h)\right)

With a bit more computation, we can also bound the difference in the states. First, let us expand the expression of the states:

xt(M)=xtnat+i=0h1Gt1[i]ut1i(M)+i=ht2Gt1[i]ut1i(M)x_{t}(M^{\star})=x^{\mathrm{nat}}_{t}+\sum_{i=0}^{h-1}G_{t-1}^{[i]}u_{t-1-i}(M^{\star})+\sum_{i=h}^{t-2}G_{t-1}^{[i]}u_{t-1-i}(M^{\star})
x^t(M)=x^tnat+i=0h1G^t1[i]u^t1i(M)\hat{x}_{t}(M^{\star})=\hat{x}_{t}^{\mathrm{nat}}+\sum_{i=0}^{h-1}\hat{G}_{t-1}^{[i]}\hat{u}_{t-1-i}(M^{\star})

The only new thing we need to bound is:

i=0h1Gt1[i]ut1i(M)i=0h1G^t1[i]u^t1i(M)\displaystyle\sum_{i=0}^{h-1}G_{t-1}^{[i]}u_{t-1-i}(M^{\star})-\sum_{i=0}^{h-1}\hat{G}_{t-1}^{[i]}\hat{u}_{t-1-i}(M^{\star}) =i=0h1Gt1[i]ut1i(M)i=0h1G^t1[i]ut1i(M)\displaystyle=\sum_{i=0}^{h-1}G_{t-1}^{[i]}u_{t-1-i}(M^{\star})-\sum_{i=0}^{h-1}\hat{G}_{t-1}^{[i]}u_{t-1-i}(M^{\star})
+i=0h1G^t1[i]ut1i(M)i=0h1G^t1[i]u^t1i(M)\displaystyle\quad+\sum_{i=0}^{h-1}\hat{G}_{t-1}^{[i]}u_{t-1-i}(M^{\star})-\sum_{i=0}^{h-1}\hat{G}_{t-1}^{[i]}\hat{u}_{t-1-i}(M^{\star})
RnatRG¯t1G^t11,op+RGi=1huti(M)u^ti(M)\displaystyle\leq R_{\mathrm{nat}}R_{\mathcal{M}}\|\bar{G}_{t-1}-\hat{G}_{t-1}\|_{\ell_{1},op}+R_{G}\sum_{i=1}^{h}\|u_{t-i}(M^{\star})-\hat{u}_{t-i}(M^{\star})\|
2RGRnatR2h(mτ=thmt1G¯τG^τ1,op+ψ(h))\displaystyle\leq 2R_{G}R_{\mathrm{nat}}R_{\mathcal{M}}^{2}h\left(m\sum_{\tau=t-h-m}^{t-1}\|\bar{G}_{\tau}-\hat{G}_{\tau}\|_{\ell_{1},op}+\psi(h)\right)

Plugging this into our expressions for xt(M)x_{t}(M^{\star}), and using previous bounds we have

x^t(M)xt(M)\displaystyle\|\hat{x}_{t}(M^{\star})-x_{t}(M^{\star})\| 3RGRnatR2h(mτ=thmt1G¯τG^τ1,op+ψ(h))\displaystyle\leq 3R_{G}R_{\mathrm{nat}}R_{\mathcal{M}}^{2}h\left(m\sum_{\tau=t-h-m}^{t-1}\|\bar{G}_{\tau}-\hat{G}_{\tau}\|_{\ell_{1},op}+\psi(h)\right)

Hence we can finalize that:

(comparator error)8LduRnat2RG2R3h(mt=rhmsG¯tG^t1,op+ψ(h))\text{(comparator error)}\leq 8L\sqrt{d_{u}}R_{\mathrm{nat}}^{2}R_{G}^{2}R_{\mathcal{M}}^{3}h\left(m\sum_{t=r-h-m}^{s}\|\bar{G}_{t}-\hat{G}_{t}\|_{\ell_{1},op}+\psi(h)\right)

D.3 Proof of Theorem D.1

Proof of Theorem D.1.

We have that

RegretI\displaystyle\mathrm{Regret}_{I} 18LdminRG2R2Rnat2m(h+1)5/4T (Known System Regret) +2hτ=(r1)/hs/hbτ1 (Exploration Penalty)\displaystyle\leq\underbrace{18L\sqrt{d_{\mathrm{min}}}R_{G}^{2}R_{\mathcal{M}}^{2}R_{\mathrm{nat}}^{2}m(h+1)^{5/4}\sqrt{T}}_{\text{{\color[rgb]{0,0,1} (Known System Regret) }}}+\underbrace{2h\sum_{\tau=(r-1)/h}^{s/h}b_{\tau-1}}_{\text{{\color[rgb]{0,0,1} (Exploration Penalty) }}}
+12LRnat2RG2R3hm(t=rsG¯tG^t1,op+2RG(h+m)) (System Misspecification Induced Error)\displaystyle\quad+\underbrace{12LR_{\mathrm{nat}}^{2}R_{G}^{2}R_{\mathcal{M}}^{3}hm\left(\sum_{t=r}^{s}\|\bar{G}_{t}-\hat{G}_{t}\|_{\ell_{1},op}+2R_{G}(h+m)\right)}_{\text{{\color[rgb]{0,0,1} (System Misspecification Induced Error) }}}
+18LduRG2R3Rnat2ψ(h)h(|I|+h+m) (Truncation Error)\displaystyle\quad+\underbrace{18L\sqrt{d_{u}}R_{G}^{2}R_{\mathcal{M}}^{3}R_{\mathrm{nat}}^{2}\psi(h)h(|I|+h+m)}_{\text{{\color[rgb]{0,0,1} (Truncation Error) }}}

Taking expectation and plugging in Proposition D.5 we get:

𝔼[RegretI]\displaystyle\mathbb{E}[\mathrm{Regret}_{I}] 𝒪~(LduRsys2mT+p|I|\displaystyle\leq\widetilde{\mathcal{O}}\bigg{(}L\sqrt{d_{u}}R_{\mathrm{sys}}^{2}m\sqrt{T}+p|I|
+LRsys2Rm(|I|1/2𝔼[VarItot(𝐆)]+duRsys|I|1/2p1/2+RGm)\displaystyle\quad+LR_{\mathrm{sys}}^{2}R_{\mathcal{M}}m\left(|I|^{1/2}\sqrt{\mathbb{E}[\mathrm{Var}^{\mathrm{tot}}_{I}(\mathbf{G})]}+d_{u}R_{\mathrm{sys}}|I|^{1/2}p^{-1/2}+R_{G}m\right)
+LduRsys2RRGm)\displaystyle\quad+L\sqrt{d_{u}}R_{\mathrm{sys}}^{2}R_{\mathcal{M}}R_{G}m\bigg{)} (D.2)
=𝒪~(Lm(|I|𝔼[VarI(𝐆)]+duT2/3))\displaystyle=\widetilde{\mathcal{O}}^{\star}\left(Lm\left(|I|\sqrt{\mathbb{E}[\mathrm{Var}_{I}(\mathbf{G})]}+d_{u}T^{2/3}\right)\right)

where Rsys=RGRRnatR_{\mathrm{sys}}=R_{G}R_{\mathcal{M}}R_{\mathrm{nat}}, using VarI(𝐆)=|I|VarItot(𝐆)\mathrm{Var}_{I}(\mathbf{G})=|I|\mathrm{Var}^{\mathrm{tot}}_{I}(\mathbf{G}), and that for the chosen hh we have ψ(h)RGT1\psi(h)\leq R_{G}T^{-1}. ∎

Appendix E Sublinear Regret for State Feedback

We demonstrate that it is (information-theoretically) possible to achieve sublinear (though large) regret against a benchmark of stabilizing static feedback control policies.

We suppose there is a subset 𝒦du×dx\mathcal{K}\subset\mathbb{R}^{d_{u}\times d_{x}} of feedback policies KK, and our goal is to obtain regret compared to the best K𝒦K\in\mathcal{K}:

RegT(𝒦):=t=1Tct(xt,ut)infK𝒦t=1Tct(xtK,utK),\displaystyle\mathrm{Reg}_{T}(\mathcal{K}):=\sum_{t=1}^{T}c_{t}(x_{t},u_{t})-\inf_{K\in\mathcal{K}}\sum_{t=1}^{T}c_{t}(x_{t}^{K},u_{t}^{K}),

where (xtK,utK)(x_{t}^{K},u_{t}^{K}) are the iterates arising under the control law ut=Kxtu_{t}=Kx_{t}.

For this setting, we propose an algorithm the classic Exp3exponential weights algorithm (see, e.g. Chapter 3 of [8]) on an ε\varepsilon-cover 𝒦ε\mathcal{K}_{\varepsilon} of 𝒦\mathcal{K} in the operator norm. We maintain a constant controller KK on intervals of length HH, and feed the losses on those intervals to the Exp3 algorithm. Pseudocode is given in Algorithm 8.

Algorithm 8 Exponentially Weighted Control
1:Input: window length HH, step size η>0\eta>0, finite ε\varepsilon-cover 𝒦ε𝒦\mathcal{K}_{\varepsilon}\subset\mathcal{K}, initial estimate K1𝒦εK_{1}\in\mathcal{K}_{\varepsilon}
2:Initialize 1(K)=0\mathcal{L}_{1}(K)=0 and p1(K)=1/|𝒦ε|p_{1}(K)=1/|\mathcal{K}_{\varepsilon}|, K𝒦ε~{}\forall K\in\mathcal{K}_{\varepsilon},
3:for t=1,,Tt=1,\ldots,T do
4:     Play select ut=Ktxtu_{t}=K_{t}x_{t}.
5:     Recieve c^t=ct(xt,ut)\hat{c}_{t}=c_{t}(x_{t},u_{t})
6:     if tmodH=0t\mod H=0 then
7:         Set n=t/Hn=t/H, n=i=tH+1tc^i\ell_{n}=\sum_{i=t-H+1}^{t}\hat{c}_{i}
8:         Set n+1(Kt)=1pn(Kt)n+n(Kt)\mathcal{L}_{n+1}(K_{t})=\frac{1}{p_{n}(K_{t})}\ell_{n}+\mathcal{L}_{n}(K_{t})
9:         Set n+1(K)=n(K)\mathcal{L}_{n+1}(K)=\mathcal{L}_{n}(K) for all K𝒦ε/{Kt}K\in\mathcal{K}_{\varepsilon}/\{K_{t}\}
10:         Set pn+1(K)=exp(ηn+1(K))K𝒦εexp(ηn+1(K))p_{n+1}(K)=\frac{\exp(\eta\mathcal{L}_{n+1}(K))}{\sum_{K^{\prime}\in\mathcal{K}_{\varepsilon}}\exp(\eta\mathcal{L}_{n+1}(K^{\prime}))}
11:         Sample Kt+1pn+1()K_{t+1}\sim p_{n+1}(\cdot).
12:     else
13:         Set Kt+1=KtK_{t+1}=K_{t}      

We state our regret bound under the (quite restrictive) assumption that all policies K𝒦K\in\mathcal{K} are sequentially stabilizing. Formally, given a sequence of controllers K𝒦K\in\mathcal{K}, we define

Φs:t(K):=i=st(Ai+KiBi)=(At+KBt)(At1+KBt1)(As+KBs).\displaystyle\Phi_{s:t}(K):=\prod_{i=s}^{t}(A_{i}+K_{i}B_{i})=(A_{t}+KB_{t})\cdot(A_{t-1}+KB_{t-1})\cdot\dots\cdot(A_{s}+KB_{s}).

We assume that Φs:t(K)\Phi_{s:t}(K) exhibits geometric decay uniformly over all times for any fixed KK:

Assumption 4.

There exists c1c_{\star}\geq 1 and ρ(1/2,1)\rho_{\star}\in(1/2,1) such that for any indices sts\leq t and any fixed K𝒦K\in\mathcal{K}, Φs:t(K)opcρts\|\Phi_{s:t}(K)\|_{\mathrm{op}}\leq c_{\star}\rho_{\star}^{t-s}. We define the constant

R𝒦:=1+max{K:K𝒦}\displaystyle R_{\mathcal{K}}:=1+\max\{\|K\|:K\in\mathcal{K}\}
Theorem E.1.

Suppose Assumptions 4 and 3 holds, and for some Rw1R_{w}\geq 1 and RB0R_{B}\geq 0, maxtwtRw\max_{t}\|w_{t}\|\leq R_{w}, and maxtBtRB\max_{t}\|B_{t}\|\leq R_{B}. In addition, suppose TT is large enough that 𝒞ρT/41/2\mathcal{C}_{\star}\rho_{\star}^{T/4}\leq 1/2. Then, Algorithm 8 with horizon H=T1/4H=\lceil T^{1/4}\rceil, appropriate an step size η\eta and minimal ε\varepsilon-covering 𝒦ε\mathcal{K}_{\varepsilon} of 𝒦\mathcal{K} enjoys the following regret bound:

𝔼[RegT(𝒦)]L𝒞1(5R𝒦)dxdu/2T112(dxdu+3),\displaystyle\mathbb{E}[\mathrm{Reg}_{T}(\mathcal{K})]\leq L\mathcal{C}_{1}(5R_{\mathcal{K}})^{d_{x}d_{u}/2}\cdot T^{1-\frac{1}{2(d_{x}d_{u}+3)}},

where 𝒞1=𝒪(𝒞5(1ρ)3R𝒦3Rw2(1+RB)dudx)\mathcal{C}_{1}=\operatorname{\mathcal{O}}\left({\frac{\mathcal{C}_{\star}^{5}}{(1-\rho_{\star})^{3}}R_{\mathcal{K}}^{3}R_{w}^{2}(1+R_{B})\sqrt{d_{u}d_{x}}}\right).

The theorem is established by a reduction to online multi-arm bandits in Section E.1 below.

Remark E.1 (Extensions of Theorem E.1).

The following analysis extends to policies of the form ut=(Ktstb+K)xt+vu_{t}=(K^{\mathrm{stb}}_{t}+K)x_{t}+v, where (Ktstb)(K^{\mathrm{stb}}_{t}) is a fixed sequence of control policies determined a priori, K𝒦du×dxK\in\mathcal{K}\subset\mathbb{R}^{d_{u}\times d_{x}} is a feedback parameter, and v𝒱duv\in\mathcal{V}\subset\mathbb{R}^{d_{u}} is a bounded affine term. Letting (xtK,c,utK,c)(x_{t}^{K,c},u^{K,c}_{t}) denote the iterates produced by such a policy, our notion of regret is

RegT(𝒦×𝒱):=t=1Tct(xt,ut)inf(K,c)𝒦×𝒱t=1Tct(xtK,c,utK,c),\displaystyle\mathrm{Reg}_{T}(\mathcal{K}\times\mathcal{V}):=\sum_{t=1}^{T}c_{t}(x_{t},u_{t})-\inf_{(K,c)\in\mathcal{K}\times\mathcal{V}}\sum_{t=1}^{T}c_{t}(x_{t}^{K,c},u_{t}^{K,c}),

The only assumptions we require in general is that 𝒱\mathcal{V} is bounded, and that 𝒦\mathcal{K}, combined with (Kistb)(K^{\mathrm{stb}}_{i}), are sequentially stabilizing in the sense that, for any sts\leq t, the fixed (Kistb)(K^{\mathrm{stb}}_{i}) sequence, and any Ks:tstb𝒦ts+1K^{\mathrm{stb}}_{s:t}\in\mathcal{K}^{t-s+1}, it holds that the products

Φs:tstb(Ks:t):=i=st(Ai+(Kistb+Ki)Bi)\displaystyle\Phi^{\mathrm{stb}}_{s:t}(K_{s:t}):=\prod_{i=s}^{t}(A_{i}+(K^{\mathrm{stb}}_{i}+K_{i})B_{i})

exhibit geometric decay. ∎

E.1 Proof of Theorem E.1

In what follows, assume that H=T1/4H=T^{1/4} evenly divides TT. For every index nn\in\mathbb{N}, define tn=1+(n1)Ht_{n}=1+(n-1)H. To avoid confusion, we xt𝖺𝗅𝗀,ut𝖺𝗅𝗀x^{\mathsf{alg}}_{t},u^{\mathsf{alg}}_{t} denote the iterates produced by the algorithm. We define the sequence which begins at state xtn𝖺𝗅𝗀x^{\mathsf{alg}}_{t_{n}} at time tnt_{n}, and rolls forward under controller KK for future times:

x¯n;tn(K)\displaystyle\bar{x}_{n;t_{n}}(K) =xtn𝖺𝗅𝗀,x¯n;t+1(K)=(At+BtK)x¯n;t(K)+wt,ttn\displaystyle=x^{\mathsf{alg}}_{t_{n}},\quad\bar{x}_{n;t+1}(K)=(A_{t}+B_{t}K)\bar{x}_{n;t}(K)+w_{t},\quad t\geq t_{n}
u¯n;tn(K)\displaystyle\bar{u}_{n;t_{n}}(K) =Kx¯n;tn(K).\displaystyle=K\bar{x}_{n;t_{n}}(K).

Observe that, since we select a new controller KtnK_{t_{n}} just before each time tnt_{n}, we have

(x¯n;tn(Ktn),u¯n;tn(Ktn))=(xt𝖺𝗅𝗀,ut𝖺𝗅𝗀),t[tn,tn+11].\displaystyle(\bar{x}_{n;t_{n}}(K_{t_{n}}),\bar{u}_{n;t_{n}}(K_{t_{n}}))=(x^{\mathsf{alg}}_{t},u^{\mathsf{alg}}_{t}),\quad\forall t\in[t_{n},t_{n+1}-1].

Therefore, defining the losses,

n(K)=t=tntn+11ct(x¯n;tn(K),u¯n;tn(K)),\displaystyle\ell_{n}(K)=\sum_{t=t_{n}}^{t_{n+1}-1}c_{t}(\bar{x}_{n;t_{n}}(K),\bar{u}_{n;t_{n}}(K)),

we have

n(Ktn)=t=tntn+11ct(xt𝖺𝗅𝗀,ut𝖺𝗅𝗀).\displaystyle\ell_{n}(K_{t_{n}})=\sum_{t=t_{n}}^{t_{n+1}-1}c_{t}(x^{\mathsf{alg}}_{t},u^{\mathsf{alg}}_{t}).

Therfore, we may decompose the regret as

𝔼[RegT(𝒦)]\displaystyle\mathbb{E}[\mathrm{Reg}_{T}(\mathcal{K})] =𝔼[n=1T/Hn(Ktn)]infK𝒦ε𝔼[n=1T/Hn(K)]R1\displaystyle=\underbrace{\mathbb{E}\left[\sum_{n=1}^{T/H}\ell_{n}(K_{t_{n}})\right]-\inf_{K\in\mathcal{K}_{\varepsilon}}\mathbb{E}\left[\sum_{n=1}^{T/H}\ell_{n}(K)\right]}_{R_{1}}
+infK𝒦ε𝔼[n=1T/Hn(K)]infK𝒦εt=1Tct(xtK,utK)R2\displaystyle+\underbrace{\inf_{K\in\mathcal{K}_{\varepsilon}}\mathbb{E}\left[\sum_{n=1}^{T/H}\ell_{n}(K)\right]-\inf_{K\in\mathcal{K}_{\varepsilon}}\sum_{t=1}^{T}c_{t}(x_{t}^{K},u_{t}^{K})}_{R_{2}}
+infK𝒦εt=1Tct(xtK,utK)]infK𝒦t=1Tct(xtK,utK)R3.\displaystyle+\underbrace{\inf_{K\in\mathcal{K}_{\varepsilon}}\sum_{t=1}^{T}c_{t}(x_{t}^{K},u_{t}^{K})]-\inf_{K\in\mathcal{K}}\sum_{t=1}^{T}c_{t}(x_{t}^{K},u_{t}^{K})}_{R_{3}}.

Here, R1R_{1} is the simple regret on the n\ell_{n} sequence, R2R_{2} is the extend to which the n\ell_{n} sequence approximates regret against controller K𝒦εK\in\mathcal{K}_{\varepsilon} in the covering, and finally R3R_{3} bounds the regret of the covering against the full set 𝒦\mathcal{K}. Here, expectations are over the randomness in the algorithm, and due to the obliviousness of the adversary, we may assume that (ct,xtK,utK)(c_{t},x_{t}^{K},u_{t}^{K}) are deterministic and chosen in advance. We bound each of the three terms in sequence. Before proceeding, we use the following estimates:

Lemma E.1 (Key Term Bounds).

Suppose that H=T1/4H=T^{1/4} is sufficiently large that 𝒞ρH1/2\mathcal{C}_{\star}\rho_{\star}^{H}\leq 1/2. Moreover, let R=𝒞1ρR_{\star}=\frac{\mathcal{C}_{\star}}{1-\rho_{\star}}. Then,

  1. (a)

    For all K𝒦K\in\mathcal{K} and t[T]t\in[T], xtKRRw\|x^{K}_{t}\|\leq R_{\star}R_{w}

  2. (b)

    For any ttnt\geq t_{n} and K𝒦K\in\mathcal{K}, x¯n;t(K)2𝒞RRw\|\bar{x}_{n;t}(K)\|\leq 2\mathcal{C}_{\star}R_{\star}R_{w}

  3. (c)

    (xtK,utK)R𝒦RRw\|(x^{K}_{t},u^{K}_{t})\|\leq R_{\mathcal{K}}R_{\star}R_{w} and (x¯n;t(K),u¯n;t(K))2R𝒦𝒞RRw\|(\bar{x}_{n;t}(K),\bar{u}_{n;t}(K))\|\leq 2R_{\mathcal{K}}\mathcal{C}_{\star}R_{\star}R_{w}.

Proof.

Part a: Unfolding the dynamics, and bounding wtRw\|w_{t}\|\leq R_{w} and Φ\Phi via Assumption 4,

xtK\displaystyle\|x^{K}_{t}\| =s=1t1Φs+1:t(K)wtRw𝒞s0ρs=𝒞Rwρ:=RRw.\displaystyle=\left\|\sum_{s=1}^{t-1}\Phi_{s+1:t}(K)w_{t}\right\|\leq R_{w}\mathcal{C}_{\star}\sum_{s\geq 0}\rho_{\star}^{s}=\frac{\mathcal{C}_{\star}R_{w}}{\rho_{\star}}:=R_{\star}R_{w}.

Next, we bound xtn𝖺𝗅𝗀\|x^{\mathsf{alg}}_{t_{n}}\| for some nn,

xtn𝖺𝗅𝗀\displaystyle\|x^{\mathsf{alg}}_{t_{n}}\| =i=1HΦtn1+i:tn1(Ktn1)wtni+Φtn1:tn(Ktn1)xtn1𝖺𝗅𝗀\displaystyle=\left\|\sum_{i=1}^{H}\Phi_{t_{n-1}+i:t_{n}-1}(K_{t_{n-1}})w_{t_{n}-i}+\Phi_{t_{n-1}:t_{n}}(K_{t_{n-1}})x^{\mathsf{alg}}_{t_{n-1}}\right\|
RRw+𝒞ρHxtn1𝖺𝗅𝗀.\displaystyle\leq R_{\star}R_{w}+\mathcal{C}_{\star}\rho_{\star}^{H}\|x^{\mathsf{alg}}_{t_{n}-1}\|.

If HH is sufficiently large that 𝒞ρH1/2\mathcal{C}_{\star}\rho_{\star}^{H}\leq 1/2, then the above is just

xtn𝖺𝗅𝗀RRw+12xtn1𝖺𝗅𝗀,\displaystyle\|x^{\mathsf{alg}}_{t_{n}}\|\leq R_{\star}R_{w}+\frac{1}{2}\|x^{\mathsf{alg}}_{t_{n}-1}\|,

yielding the bound xtn𝖺𝗅𝗀2RRw\|x^{\mathsf{alg}}_{t_{n}}\|\leq 2R_{\star}R_{w} for all nn. Part b: Next, let us bound x¯n;t(K)\|\bar{x}_{n;t}(K)\| for some ttnt\geq t_{n}. We have

x¯n;t(K)\displaystyle\|\bar{x}_{n;t}(K)\| =i=tn+1tΦi:t(K)wtt+Φtn:t(K)xtn𝖺𝗅𝗀\displaystyle=\left\|\sum_{i=t_{n}+1}^{t}\Phi_{i:t}(K)w_{t}t+\Phi_{t_{n}:t}(K)x^{\mathsf{alg}}_{t_{n}}\right\|
Rw(0ttn1𝒞ρi)+𝒞ρtnxtn𝖺𝗅𝗀\displaystyle\leq R_{w}(\sum_{0}^{t-t_{n}-1}\mathcal{C}_{\star}\rho_{\star}^{i})+\mathcal{C}_{\star}\rho_{\star}^{t_{n}}\|x^{\mathsf{alg}}_{t_{n}}\|
Rw0ttn1𝒞ρi+2ρtn𝒞RRw.\displaystyle\leq R_{w}\sum_{0}^{t-t_{n}-1}\mathcal{C}_{\star}\rho_{\star}^{i}+2\rho_{\star}^{t_{n}}\mathcal{C}_{\star}R_{\star}R_{w}.

Using R=𝒞ρR_{\star}=\frac{\mathcal{C}_{\star}}{\rho_{\star}}, 𝒞1\mathcal{C}_{\star}\geq 1, and 0ttn1+ρtn1ρ=11ρ\sum_{0}^{t-t_{n}-1}+\frac{\rho_{\star}^{t_{n}}}{1-\rho_{\star}}=\frac{1}{1-\rho_{\star}}, the above simplifies to 2𝒞2Rw1ρ=2𝒞RwR2\frac{\mathcal{C}_{\star}^{2}R_{w}}{1-\rho_{\star}}=2\mathcal{C}_{\star}R_{w}R_{\star}.

Part c: This follows from the fact that, for any xdxx\in\mathbb{R}^{d_{x}} and K𝒦K\in\mathcal{K}, (x,Kx)(1+K)xR𝒦x\|(x,Kx)\|\leq(1+\|K\|)\|x\|\leq R_{\mathcal{K}}\|x\|. ∎

Bounding R1R_{1}

The term R1R_{1} corresponds to the simple regret on the sequence of losses n(K)\ell_{n}(K) over the discrete enumeration of controllers K𝒦εK\in\mathcal{K}_{\varepsilon}. Examining Algorithm 8, we simply run the Exp3 algorithm on these losses. By appealing to a standard regret bound for this algorithm with appropriate step size η\eta, we ensure that

R12BTH|𝒦ε|log|𝒦ε|,\displaystyle R_{1}\leq 2B\sqrt{\frac{T}{H}|\mathcal{K}_{\varepsilon}|\log|\mathcal{K}_{\varepsilon}|},

provided that, for all nn and K𝒦εK\in\mathcal{K}_{\varepsilon}, n(K)[0,B]\ell_{n}(K)\in[0,B]. To find the appropriate bound BB, we note that from the growth condition on the costs, Assumption 3, we have

0\displaystyle 0 n(K)=t=tntn+11ct(x¯n;tn(K),u¯n;tn(K))Hmaxttnct(x¯n;tn(K),u¯n;tn(K))\displaystyle\leq\ell_{n}(K)=\sum_{t=t_{n}}^{t_{n+1}-1}c_{t}(\bar{x}_{n;t_{n}}(K),\bar{u}_{n;t_{n}}(K))\leq H\max_{t\geq t_{n}}c_{t}(\bar{x}_{n;t_{n}}(K),\bar{u}_{n;t_{n}}(K))
LHmax{1,(x¯n;tn(K),u¯n;tn(K))2}8LH(R𝒦𝒞RRw)2,\displaystyle\leq LH\max\{1,\|(\bar{x}_{n;t_{n}}(K),\bar{u}_{n;t_{n}}(K))\|^{2}\}\leq 8LH(R_{\mathcal{K}}\mathcal{C}_{\star}R_{\star}R_{w})^{2},

where the last inequality uses Lemma E.1. Hence,

R116L(R𝒦𝒞RRw)2TH|𝒦ε|log|𝒦ε|.\displaystyle R_{1}\leq 16L(R_{\mathcal{K}}\mathcal{C}_{\star}R_{\star}R_{w})^{2}\sqrt{TH|\mathcal{K}_{\varepsilon}|\log|\mathcal{K}_{\varepsilon}|}.
Bounding R2R_{2}:

To bound R2R_{2}, it suffices to find a probability-one upper bound on

supK𝒦ε|n=1T/Hn(K)t=1Tct(xtK,utK)|\displaystyle\sup_{K\in\mathcal{K}_{\varepsilon}}\left|\sum_{n=1}^{T/H}\ell_{n}(K)-\sum_{t=1}^{T}c_{t}(x_{t}^{K},u_{t}^{K})\right| =supK𝒦ε|n=1T/Ht=tntn+11ct(x¯n;t(K),u¯n;t(K))ct(xtK,utK)|\displaystyle=\sup_{K\in\mathcal{K}_{\varepsilon}}\left|\sum_{n=1}^{T/H}\sum_{t=t_{n}}^{t_{n+1}-1}c_{t}(\bar{x}_{n;t}(K),\bar{u}_{n;t}(K))-c_{t}(x_{t}^{K},u_{t}^{K})\right|
THsupK𝒦εmaxnt=tntn+11|ct(x¯n;t(K),u¯n;t(K))ct(xtK,utK)|.\displaystyle\leq\frac{T}{H}\sup_{K\in\mathcal{K}_{\varepsilon}}\max_{n}\sum_{t=t_{n}}^{t_{n+1}-1}\left|c_{t}(\bar{x}_{n;t}(K),\bar{u}_{n;t}(K))-c_{t}(x_{t}^{K},u_{t}^{K})\right|.

Using the Lipschitz conditions on ctc_{t}, the bounds from Lemma E.1, and the bound 1+KR𝒦1+\|K\|\leq R_{\mathcal{K}},

t=tntn+11|ct(x¯n;t(K),u¯n;t(K))ct(xtK,utK)|\displaystyle\sum_{t=t_{n}}^{t_{n+1}-1}\left|c_{t}(\bar{x}_{n;t}(K),\bar{u}_{n;t}(K))-c_{t}(x_{t}^{K},u_{t}^{K})\right| L(R𝒦𝒞RRw)t=tntn+11(x¯n;t(K),u¯n;t(K))(xtK,utK)\displaystyle\leq L(R_{\mathcal{K}}\mathcal{C}_{\star}R_{\star}R_{w})\sum_{t=t_{n}}^{t_{n+1}-1}\|(\bar{x}_{n;t}(K),\bar{u}_{n;t}(K))-(x_{t}^{K},u_{t}^{K})\|
=L(R𝒦𝒞RRw)t=tntn+11(x¯n;t(K),Kx¯n;t(K))(xtK,KxtK)\displaystyle=L(R_{\mathcal{K}}\mathcal{C}_{\star}R_{\star}R_{w})\sum_{t=t_{n}}^{t_{n+1}-1}\|(\bar{x}_{n;t}(K),K\bar{x}_{n;t}(K))-(x_{t}^{K},Kx_{t}^{K})\|
=L(R𝒦𝒞RRw)R𝒦t=tntn+11x¯n;t(K)xtK.\displaystyle=L(R_{\mathcal{K}}\mathcal{C}_{\star}R_{\star}R_{w})R_{\mathcal{K}}\sum_{t=t_{n}}^{t_{n+1}-1}\|\bar{x}_{n;t}(K)-x_{t}^{K}\|.

Finally, we can compute that the difference x¯n;t(K)xtK=Φtnt(xtn𝖺𝗅𝗀xtnK)\bar{x}_{n;t}(K)-x_{t}^{K}=\Phi_{t_{n}}^{t}(x^{\mathsf{alg}}_{t_{n}}-x_{t_{n}}^{K}) depends only on the response to the state difference at time tnt_{n}. Hence, using Assumption 4 and Lemma E.1,the above is at most

L(R𝒦𝒞RRw)R𝒦i0𝒞ρi(xtn𝖺𝗅𝗀xtnK)\displaystyle L(R_{\mathcal{K}}\mathcal{C}_{\star}R_{\star}R_{w})R_{\mathcal{K}}\cdot\sum_{i\geq 0}\mathcal{C}_{\star}\rho_{\star}^{i}\|(x^{\mathsf{alg}}_{t_{n}}-x_{t_{n}}^{K})\| L(R𝒦𝒞RRw)R𝒦𝒞1ρ=R𝒞RRw\displaystyle\leq L(R_{\mathcal{K}}\mathcal{C}_{\star}R_{\star}R_{w})R_{\mathcal{K}}\cdot\underbrace{\frac{\mathcal{C}_{\star}}{1-\rho_{\star}}}{=R_{\star}}\cdot\mathcal{C}_{\star}R_{\star}R_{w}
L(R𝒦𝒞RRw)2R.\displaystyle\leq L(R_{\mathcal{K}}\mathcal{C}_{\star}R_{\star}R_{w})^{2}R_{\star}.

Concluding, we find

R2L(R𝒦𝒞RRw)2TRH.\displaystyle R_{2}\leq L(R_{\mathcal{K}}\mathcal{C}_{\star}R_{\star}R_{w})^{2}\cdot\frac{TR_{\star}}{H}.
Bounding R3R_{3}.

We now turn to bounding R3R_{3}, which captures the approximation error of approximating 𝒦\mathcal{K} with 𝒦ε\mathcal{K}_{\varepsilon}. We require the following technical lemma:

Lemma E.2.

Let K,K𝒦K,K^{\prime}\in\mathcal{K} satisfy KKopε\|K-K^{\prime}\|_{\mathrm{op}}\leq\varepsilon. Then, for all t1t\geq 1,

xtKxtK4εRwR2RB.\displaystyle\|x_{t}^{K}-x_{t}^{K^{\prime}}\|\leq 4\varepsilon R_{w}R_{\star}^{2}R_{B}.

Hence,

ct(xtK,utK)ct(xtK,utK)4εLRw2R𝒦2R3(1+RB).\displaystyle\|c_{t}(x_{t}^{K},u_{t}^{K})-c_{t}(x_{t}^{K^{\prime}},u_{t}^{K^{\prime}})\|\leq 4\varepsilon LR_{w}^{2}R_{\mathcal{K}}^{2}R_{\star}^{3}(1+R_{B}).

Using the fact that 𝒦ε\mathcal{K}_{\varepsilon} is an ε\varepsilon-covering of 𝒦\mathcal{K} in the operator norm means that for any K𝒦K\in\mathcal{K}, we can find a K𝒦εK^{\prime}\in\mathcal{K}_{\varepsilon} for which KKopε\|K-K^{\prime}\|_{\mathrm{op}}\leq\varepsilon. Hence, from the above lemma

|t=1Tct(xtK,utK)ct(xtK,utK)|4TεLRw2R𝒦2R3(1+RB).\displaystyle|\sum_{t=1}^{T}c_{t}(x_{t}^{K},u_{t}^{K})-c_{t}(x_{t}^{K^{\prime}},u_{t}^{K^{\prime}})|\leq 4T\varepsilon LR_{w}^{2}R_{\mathcal{K}}^{2}R_{\star}^{3}(1+R_{B}).

Since R3supK𝒦inf𝒦𝒦ε|t=1Tct(xtK,utK)ct(xtK,utK)|R_{3}\leq\sup_{K\in\mathcal{K}}\inf_{\mathcal{K}^{\prime}\in\mathcal{K}_{\varepsilon}}|\sum_{t=1}^{T}c_{t}(x_{t}^{K},u_{t}^{K})-c_{t}(x_{t}^{K^{\prime}},u_{t}^{K^{\prime}})|, we conclude

R34TεLRw2R𝒦2R3(1+RB).\displaystyle R_{3}\leq 4T\varepsilon LR_{w}^{2}R_{\mathcal{K}}^{2}R_{\star}^{3}(1+R_{B}).
Concluding the proof

In sum, we found

𝔼[RegT(𝒦)]\displaystyle\mathbb{E}[\mathrm{Reg}_{T}(\mathcal{K})] =R1+R2+R3\displaystyle=R_{1}+R_{2}+R_{3}
𝒪(L(R𝒦𝒞RRw)2)(TH|𝒦ε|log|𝒦ε|+TRH+(1+RB)Tε),\displaystyle\leq\operatorname{\mathcal{O}}\left({L(R_{\mathcal{K}}\mathcal{C}_{\star}R_{\star}R_{w})^{2}}\right)\left(\sqrt{TH|\mathcal{K}_{\varepsilon}|\log|\mathcal{K}_{\varepsilon}|}+\frac{TR_{\star}}{H}+(1+R_{B})T\varepsilon\right),
𝒪(L(R𝒦𝒞RRw)2(1+RB)R)(TH|𝒦ε|log|𝒦ε|+TH+Tε),\displaystyle\leq\operatorname{\mathcal{O}}\left({L(R_{\mathcal{K}}\mathcal{C}_{\star}R_{\star}R_{w})^{2}(1+R_{B})R_{\star}}\right)\left(\sqrt{TH|\mathcal{K}_{\varepsilon}|\log|\mathcal{K}_{\varepsilon}|}+\frac{T}{H}+T\varepsilon\right),

where in the last line, we use R1R_{\star}\geq 1 and 1+RB11+R_{B}\geq 1. Setting H=T1/4H=T^{1/4},

𝔼[RegT(𝒦)]\displaystyle\mathbb{E}[\mathrm{Reg}_{T}(\mathcal{K})] 𝒪(L(R𝒦𝒞RRw)2(1+RB)R)T3/4(|𝒦ε|log|𝒦ε|+T1/4ε),\displaystyle\leq\operatorname{\mathcal{O}}\left({L(R_{\mathcal{K}}\mathcal{C}_{\star}R_{\star}R_{w})^{2}(1+R_{B})R_{\star}}\right)T^{3/4}\cdot\left(\sqrt{|\mathcal{K}_{\varepsilon}|\log|\mathcal{K}_{\varepsilon}|}+T^{1/4}\varepsilon\right),

We bound the cardinality of 𝒦ε\mathcal{K}_{\varepsilon}. It suffices to ensure 𝒦ε\mathcal{K}_{\varepsilon} is an ε\varepsilon-covering in the larger Frobenius norm, which is just the Euclidean norm on RdxduR^{d_{x}d_{u}}. Since 𝒦\mathcal{K} is a bounded subset of this space, with radius at most R𝒦R_{\mathcal{K}}, we can find a covering such that |𝒦ε|(5R𝒦ε)dxdu|\mathcal{K}_{\varepsilon}|\leq(\frac{5R_{\mathcal{K}}}{\varepsilon})^{d_{x}d_{u}} (see, e.g. Chapter 4.2 in [43]). This yields

|𝒦ε|log|𝒦ε|dxdu(5R𝒦ε)dxdulog(5R𝒦ε)dxdu(5R𝒦ε)dxdu+1=5R𝒦(5R𝒦)dxduεdxdu+1\displaystyle|\mathcal{K}_{\varepsilon}|\log|\mathcal{K}_{\varepsilon}|\leq d_{x}d_{u}\left(\frac{5R_{\mathcal{K}}}{\varepsilon}\right)^{d_{x}d_{u}}\log\left(\frac{5R_{\mathcal{K}}}{\varepsilon}\right)\leq d_{x}d_{u}\left(\frac{5R_{\mathcal{K}}}{\varepsilon}\right)^{d_{x}d_{u}+1}=5R_{\mathcal{K}}\cdot(5R_{\mathcal{K}})^{d_{x}d_{u}}\varepsilon^{-d_{x}d_{u}+1}

where we use logxx\log x\leq x. Hence, we can bound

𝔼[RegT(𝒦)]𝒪(LR𝒦R(R𝒦𝒞RRw)2(1+RB)dxdu)(5R𝒦)dxdu/2T3/4((1ε)(1+dxdu)/2+εT1/4)\displaystyle\mathbb{E}[\mathrm{Reg}_{T}(\mathcal{K})]\leq\operatorname{\mathcal{O}}\left({LR_{\mathcal{K}}R_{\star}(R_{\mathcal{K}}\mathcal{C}_{\star}R_{\star}R_{w})^{2}(1+R_{B})\sqrt{d_{x}d_{u}}}\right)(5R_{\mathcal{K}})^{d_{x}d_{u}/2}\cdot T^{3/4}\left(\left(\frac{1}{\varepsilon}\right)^{(1+d_{x}d_{u})/2}+\varepsilon T^{1/4}\right)

Setting ε=T12(dxdu+3)\varepsilon=T^{-\frac{1}{2(d_{x}d_{u}+3)}} gives

𝔼[RegT(𝒦)]\displaystyle\mathbb{E}[\mathrm{Reg}_{T}(\mathcal{K})] 𝒪(LR(R𝒦𝒞RRw)2(1+RB)dxdu)(5R𝒦)dxdu+1T112(dxdu+3)\displaystyle\leq\operatorname{\mathcal{O}}\left({LR_{\star}(R_{\mathcal{K}}\mathcal{C}_{\star}R_{\star}R_{w})^{2}(1+R_{B})\sqrt{d_{x}d_{u}}}\right)(5R_{\mathcal{K}})^{d_{x}d_{u}+1}\cdot T^{1-\frac{1}{2(d_{x}d_{u}+3)}}
=𝒪(L𝒞5(1ρ)3R𝒦3Rw2(1+RB)dxdu)(5R𝒦)dxdu/2T112(dxdu+3),\displaystyle=\operatorname{\mathcal{O}}\left({L\frac{\mathcal{C}_{\star}^{5}}{(1-\rho_{\star})^{3}}R_{\mathcal{K}}^{3}R_{w}^{2}(1+R_{B})\sqrt{d_{x}d_{u}}}\right)(5R_{\mathcal{K}})^{d_{x}d_{u}/2}\cdot T^{1-\frac{1}{2(d_{x}d_{u}+3)}},

E.2 Ommited Proofs

Proof of Lemma E.1.

Part a: All such iterates can be realized by dynamics of the form x1=0x_{1}=0, xt+1=(At+BtKt)xt+wtx_{t+1}=(A_{t}+B_{t}K_{t})x_{t}+w_{t} and ut=Ktxtu_{t}=K_{t}x_{t} for any appropriate sequence (K1,K2,)(K_{1},K_{2},\dots) of elements of 𝒦\mathcal{K}. For such dynamics, we find

xt=s=1t1(i=s+1t(Ai+BiKi))ws=s=1t1Φs+1;t(Ks+1:t)wt.\displaystyle x_{t}=\sum_{s=1}^{t-1}\left(\prod_{i=s+1}^{t}(A_{i}+B_{i}K_{i})\right)w_{s}=\sum_{s=1}^{t-1}\Phi_{s+1;t}(K_{s+1:t})w_{t}.

Using wsRw\|w_{s}\|\leq R_{w} and the assumption Φs+1;t(Ks+1:t)wtcρts1\Phi_{s+1;t}(K_{s+1:t})w_{t}\leq c_{\star}\rho_{\star}^{t-s-1} from Assumption 4, we find

xt\displaystyle\|x_{t}\| Rwcs=1t1ρts1cRw1ρ=RRw\displaystyle\leq R_{w}c_{\star}\sum_{s=1}^{t-1}\rho_{\star}^{t-s-1}\leq\frac{c_{\star}R_{w}}{1-\rho_{\star}}=R_{\star}R_{w}
ut+xt\displaystyle\|u_{t}\|+\|x_{t}\| =xt+Ktxtxt(1+Kt)RR𝒦Rw\displaystyle=\|x_{t}\|+\|K_{t}x_{t}\|\leq\|x_{t}\|(1+\|K_{t}\|)\leq R_{\star}R_{\mathcal{K}}R_{w}

Part b: Since the closed-loop dynamics for 𝐱¯k;tK\bar{\mathbf{x}}_{k;t}^{K} and xtKx_{t}^{K} concincide for ttkt\geq t_{k} and are given by xt+1=(At+BtK)xt+wtx_{t+1}=(A_{t}+B_{t}K)x_{t}+w_{t}, we can compute

𝐱¯k;tKxtK=(i=tkt(At+BtK))(𝐱¯tk;kKxtK).\displaystyle\bar{\mathbf{x}}_{k;t}^{K}-x_{t}^{K}=\left(\prod_{i=t_{k}}^{t}(A_{t}+B_{t}K)\right)(\bar{\mathbf{x}}_{t_{k};k}^{K}-x_{t}^{K}).

Bounding 𝐱¯k;tKxtK)2RwRK\|\bar{\mathbf{x}}_{k;t}^{K}-x_{t}^{K})\|\leq 2R_{w}R_{K} from part (a) and (At+BtK)ttkcρttk\|(A_{t}+B_{t}K)^{t-t_{k}}\|\leq c_{\star}\rho_{\star}^{t-t_{k}} from Assumption 4 yields 𝐱¯k;tKxtK)2Rxcρttk\|\bar{\mathbf{x}}_{k;t}^{K}-x_{t}^{K})\|\leq 2R_{x}c_{\star}\rho_{\star}^{t-t_{k}}. Summing over ttkt\geq t_{k} yields ttk𝐱¯k;tKxtK2RwR2\sum_{t\geq t_{k}}\|\bar{\mathbf{x}}_{k;t}^{K}-x_{t}^{K}\|\leq 2R_{w}R_{\star}^{2}. Finally, using u¯k;tK=K𝐱¯k;tK\bar{u}_{k;t}^{K}=K\bar{\mathbf{x}}_{k;t}^{K} and utK=KxtKu_{t}^{K}=Kx_{t}^{K} gives

ttk𝐱¯t;kKxtK+u¯t;kKutK(1+K)ttk𝐱¯t;kKxtK2RwR𝒦R2.\displaystyle\sum_{t\geq t_{k}}\|\bar{\mathbf{x}}_{t;k}^{K}-x_{t}^{K}\|+\|\bar{u}_{t;k}^{K}-u_{t}^{K}\|\leq(1+\|K\|)\sum_{t\geq t_{k}}\|\bar{\mathbf{x}}_{t;k}^{K}-x_{t}^{K}\|\leq 2R_{w}R_{\mathcal{K}}R_{\star}^{2}.

Proof of Lemma E.2.

Introducing the short hand Xi=Ai+BiKX_{i}=A_{i}+B_{i}K and Yi=Ai+BiKY_{i}=A_{i}+B_{i}K^{\prime}, and expanding the dynamics, and introducing the short hand

xtKxtK\displaystyle\|x_{t}^{K}-x_{t}^{K^{\prime}}\| =s=1t1(i=s+1t(Ai+BiK=Xi)i=s+1t(Ai+BiK=Yi))ws\displaystyle=\left\|\sum_{s=1}^{t-1}\left(\prod_{i=s+1}^{t}(\underbrace{A_{i}+B_{i}K}_{=X_{i}})-\prod_{i=s+1}^{t}(\underbrace{A_{i}+B_{i}K^{\prime}}_{=Y_{i}})\right)w_{s}\right\|
Rws=1t1i=s+1tXii=s+1tYiop\displaystyle\leq R_{w}\sum_{s=1}^{t-1}\left\|\prod_{i=s+1}^{t}X_{i}-\prod_{i=s+1}^{t}Y_{i}\right\|_{\mathrm{op}}

Using an elementary matrix telescoping identiy,

i=s+1tXii=s+1tYi=j=s+1t(i=j+1tXj)(XjYj)i=s+1j1Yj.\displaystyle\prod_{i=s+1}^{t}X_{i}-\prod_{i=s+1}^{t}Y_{i}=\sum_{j=s+1}^{t}\left(\prod_{i=j+1}^{t}X_{j}\right)(X_{j}-Y_{j})\prod_{i=s+1}^{j-1}Y_{j}.

Thus, invoking stability assumption, Assumption 4, and setting RBmaxtBtopR_{B}\geq\max_{t}\|B_{t}\|_{\mathrm{op}},

i=s+1tXii=s+1tYiop\displaystyle\left\|\prod_{i=s+1}^{t}X_{i}-\prod_{i=s+1}^{t}Y_{i}\right\|_{\mathrm{op}} j=s+1ti=j+1tXjopi=s+1j1YjopXjYjop\displaystyle\leq\sum_{j=s+1}^{t}\left\|\prod_{i=j+1}^{t}X_{j}\right\|_{\mathrm{op}}\left\|\prod_{i=s+1}^{j-1}Y_{j}\right\|_{\mathrm{op}}\|X_{j}-Y_{j}\|_{\mathrm{op}}
=j=s+1tcρtj+1cρj1(s+1)Bj(KK)op\displaystyle=\sum_{j=s+1}^{t}c_{\star}\rho_{\star}^{t-j+1}c_{\star}\rho_{\star}^{j-1-(s+1)}\|B_{j}(K-K^{\prime})\|_{\mathrm{op}}
=c2ρ2j=s+1tρt(s+1)Bj(KK)op\displaystyle=\frac{c_{\star}^{2}}{\rho_{\star}^{2}}\sum_{j=s+1}^{t}\rho_{\star}^{t-(s+1)}\|B_{j}(K-K^{\prime})\|_{\mathrm{op}}
εRBc2ρ2(ts+1)ρt(s+1)\displaystyle\leq\varepsilon R_{B}\frac{c_{\star}^{2}}{\rho_{\star}^{2}}(t-s+1)\rho_{\star}^{t-(s+1)}

Thus, we find

xtKxtK\displaystyle\|x_{t}^{K}-x_{t}^{K^{\prime}}\| εRwRBc2ρ2s=1t1(ts+1)ρt(s+1)\displaystyle\leq\varepsilon R_{w}R_{B}\frac{c_{\star}^{2}}{\rho_{\star}^{2}}\sum_{s=1}^{t-1}(t-s+1)\rho_{\star}^{t-(s+1)}
εRwRBc2ρ2(1ρ)24εRwRBc2(1ρ)2=4εRwR2RB,\displaystyle\leq\varepsilon R_{w}R_{B}\frac{c_{\star}^{2}}{\rho_{\star}^{2}(1-\rho_{\star})^{2}}\leq 4\varepsilon R_{w}R_{B}\frac{c_{\star}^{2}}{(1-\rho_{\star})^{2}}=4\varepsilon R_{w}R_{\star}^{2}R_{B},

where in the last step, we use ρ1/2\rho_{\star}\geq 1/2. Thus, applying Assumption 3 and Lemma E.1,

ct(xtK,utK)ct(xtK,utK)\displaystyle\|c_{t}(x_{t}^{K},u_{t}^{K})-c_{t}(x_{t}^{K^{\prime}},u_{t}^{K^{\prime}})\| Lmax{1,(xtK,utK),(xtK,utK)}(xtK,utK)(xtK,utK)\displaystyle\leq L\max\{1,\|(x_{t}^{K},u_{t}^{K})\|,\|(x_{t}^{K^{\prime}},u_{t}^{K^{\prime}})\|\}\cdot\|(x_{t}^{K},u_{t}^{K})-(x_{t}^{K^{\prime}},u_{t}^{K^{\prime}})\|
LRwR𝒦R(xtK,utK)(xtK,utK).\displaystyle\leq LR_{w}R_{\mathcal{K}}R_{\star}\cdot\|(x_{t}^{K},u_{t}^{K})-(x_{t}^{K^{\prime}},u_{t}^{K^{\prime}})\|.

Continuing, we bound

(xtK,utK)(xtK,utK)\displaystyle\|(x_{t}^{K},u_{t}^{K})-(x_{t}^{K^{\prime}},u_{t}^{K^{\prime}})\| =(xtK,KxtK)(xtK,KxtK)\displaystyle=\|(x_{t}^{K},Kx_{t}^{K})-(x_{t}^{K^{\prime}},K^{\prime}x_{t}^{K^{\prime}})\|
(xtK,KxtK)(xtK,KxtK)+(KK)xtK\displaystyle\leq\|(x_{t}^{K},Kx_{t}^{K})-(x_{t}^{K^{\prime}},Kx_{t}^{K^{\prime}})\|+\|(K-K^{\prime})x_{t}^{K^{\prime}}\|
R𝒦xtKxtK+(KK)xtK.\displaystyle\leq R_{\mathcal{K}}\|x_{t}^{K^{\prime}}-x_{t}^{K^{\prime}}\|+\|(K-K^{\prime})x_{t}^{K^{\prime}}\|.

Finally, using the bound xtKxtK4εRwR2RB\|x_{t}^{K^{\prime}}-x_{t}^{K^{\prime}}\|\leq 4\varepsilon R_{w}R_{\star}^{2}R_{B} derived above, and bounding (KK)xtKεxtKεRwR\|(K-K^{\prime})x_{t}^{K^{\prime}}\|\leq\varepsilon\|x_{t}^{K^{\prime}}\|\leq\varepsilon R_{w}R_{\star} in view of Lemma E.1. Hence,

ct(xtK,utK)ct(xtK,utK)\displaystyle\|c_{t}(x_{t}^{K},u_{t}^{K})-c_{t}(x_{t}^{K^{\prime}},u_{t}^{K^{\prime}})\| LRwR𝒦R(4εR𝒦RwR2RB+εRwR)\displaystyle\leq LR_{w}R_{\mathcal{K}}R_{\star}(4\varepsilon R_{\mathcal{K}}R_{w}R_{\star}^{2}R_{B}+\varepsilon R_{w}R_{\star})
4εLRw2R𝒦2R3(RB+1),\displaystyle\leq 4\varepsilon LR_{w}^{2}R_{\mathcal{K}}^{2}R_{\star}^{3}(R_{B}+1),

where above we use that R𝒦,R1R_{\mathcal{K}},R_{\star}\geq 1 by assumption. ∎

Appendix F Lower Bounds and Separations

F.1 Separation between policy classes

Let 𝒵=(ct,wt,At,Bt)t1\mathcal{Z}=(c_{t},w_{t},A_{t},B_{t})_{t\geq 1} denote sequences over costs,disturbances, and dynamics. We let JT(π;𝒵)J_{T}(\pi;\mathcal{Z}) denote the cost of policy π\pi on the sequence 𝒵\mathcal{Z}. Our lower bounds hold even against sequences which enjoy the following regularity condition.

Definition F.1.

We say that 𝒵\mathcal{Z} is regular if, for all tt, ct(,)c_{t}(\cdot,\cdot) satisfies Assumption 3 with L1L\leq 1, and that for all tt, wt1\|w_{t}\|\leq 1, Btop1\|B_{t}\|_{\mathrm{op}}\leq 1 and Atop1/2\|A_{t}\|_{\mathrm{op}}\leq 1/2.

We define the policy classes

Πdrc+(h)\displaystyle\Pi_{\mathrm{drc}}^{+}(h) :={π:utπ=u0+i=0h1M[i]xtinat,t}\displaystyle:=\left\{\pi:u^{\pi}_{t}=u_{0}+\sum_{i=0}^{h-1}M^{[i]}x^{\mathrm{nat}}_{t-i},~{}\forall t\right\}
Πdac+(h)\displaystyle\Pi_{\mathrm{dac}}^{+}(h) :={π:utπ=u0+i=0h1M[i]wti1,t}\displaystyle:=\left\{\pi:u^{\pi}_{t}=u_{0}+\sum_{i=0}^{h-1}M^{[i]}w_{t-i-1},~{}\forall t\right\}
Πfeed+(h)\displaystyle\Pi_{\mathrm{feed}}^{+}(h) :={π:utπ=u0+i=0h1M[i]xtiπ,t}\displaystyle:=\left\{\pi:u^{\pi}_{t}=u_{0}+\sum_{i=0}^{h-1}M^{[i]}x^{\pi}_{t-i},~{}\forall t\right\}
Πfeed(h,R)\displaystyle\Pi_{\mathrm{feed}}(h,R) :={π:utπ=u0+i=0h1M[i]xtiπ,t,i=0h1M[i]opR.}.\displaystyle:=\left\{\pi:u^{\pi}_{t}=u_{0}+\sum_{i=0}^{h-1}M^{[i]}x^{\pi}_{t-i},~{}\forall t,\quad\sum_{i=0}^{h-1}\|M^{[i]}\|_{\mathrm{op}}\leq R.\right\}.

That is, Πdrc+(h)\Pi_{\mathrm{drc}}^{+}(h) are all length hh Drc policies of unbounded norm and allowing affine offsets, Πdac+(h)\Pi_{\mathrm{dac}}^{+}(h) are all length hh Dac policies of unbounded norm allowing affine offsets, and Πfeed+(h)\Pi_{\mathrm{feed}}^{+}(h) are all static feedback policies of unbounded norm allowing affine offsets, and Πfeed(h,R)\Pi_{\mathrm{feed}}(h,R) are feedback policies of bounded norm and horizon.

The following theorem demonstrates that the Dac, Drc, and feedback parametrizations are fundamentally incommensurate.

Theorem F.1.

Let 𝒞0>0\mathcal{C}_{0}>0 denote a universal constant. There exists three regular sequences 𝒵1,𝒵2,𝒵3\mathcal{Z}_{1},\mathcal{Z}_{2},\mathcal{Z}_{3} in du=dx=1d_{u}=d_{x}=1 which separate Dac, Drc, and feedback controllers, in the following sense:

  • (a)

    Under 𝒵1\mathcal{Z}_{1}, the static feedback policy π\pi selecting utπ=14xtπu^{\pi}_{t}=\frac{1}{4}x^{\pi}_{t} satisfies JT(π;𝒵1)=0J_{T}(\pi;\mathcal{Z}_{1})=0, but

    infπΠdrc+(h)Πdac+(h)JT(π;𝒵1)𝒞0(Th2)\displaystyle\inf_{\pi\in\Pi_{\mathrm{drc}}^{+}(h)\cup\Pi_{\mathrm{dac}}^{+}(h)}J_{T}(\pi;\mathcal{Z}_{1})\geq\mathcal{C}_{0}(T-h-2)
  • (b)

    Under 𝒵2\mathcal{Z}_{2}, the Dac policy π\pi selecting utπ=wtu^{\pi}_{t}=w_{t} satisfies JT(π;𝒵2)=0J_{T}(\pi;\mathcal{Z}_{2})=0, but

    infπΠdrc+(h)Πfeed+(h)JT(π;𝒵2)𝒞0(Th2)\displaystyle\inf_{\pi\in\Pi_{\mathrm{drc}}^{+}(h)\cup\Pi_{\mathrm{feed}}^{+}(h)}J_{T}(\pi;\mathcal{Z}_{2})\geq\mathcal{C}_{0}(T-h-2)
  • (c)

    Under 𝒵3\mathcal{Z}_{3}, the Drc policy π\pi selecting utπ=xtnatu^{\pi}_{t}=x^{\mathrm{nat}}_{t} satifies JT(π;𝒵3)=0J_{T}(\pi;\mathcal{Z}_{3})=0, but

    infπΠdac+(h)JT(π;𝒵3)𝒞0(Th3)\displaystyle\inf_{\pi\in\Pi_{\mathrm{dac}}^{+}(h)}J_{T}(\pi;\mathcal{Z}_{3})\geq\mathcal{C}_{0}(T-h-3)

    Moreover, for any hNh\in N, R>0R>0 and T10hT\geq 10h, we have

    infπΠfeed(h,R)JT(π;𝒵3)𝒞0Thmax{R,1}.\displaystyle\inf_{\pi\in\Pi_{\mathrm{feed}}(h,R)}J_{T}(\pi;\mathcal{Z}_{3})\geq\mathcal{C}_{0}\frac{T}{h\max\{R,1\}}.
Proof.

We establish the separations for each part with different constant factors. One can choose 𝒞0\mathcal{C}_{0} to be the minimum of all constants which arise.

Proof of part a.

We set 𝒵1\mathcal{Z}_{1} to be the sequence with At=0A_{t}=0 for all tt, wt=1w_{t}=1 for all tt, ct(x,u)=18(u14x)2c_{t}(x,u)=\frac{1}{8}(u-\frac{1}{4}x)^{2}, and

Bt={1t odd 1t even.\displaystyle B_{t}=\begin{cases}1&t\text{ odd }\\ -1&t\text{ even}.\end{cases}

This sequence is clearly regular, and is clear that the policy utπ=14xtπu_{t}^{\pi}=\frac{1}{4}x_{t}^{\pi} has JT(π;𝒵1)=0J_{T}(\pi;\mathcal{Z}_{1})=0. On the other hand, let πΠdrc+(h)Πdac+(h)\pi\in\Pi_{\mathrm{drc}}^{+}(h)\cup\Pi_{\mathrm{dac}}^{+}(h), JT(π;𝒵1)J_{T}(\pi;\mathcal{Z}_{1}). Since At0A_{t}\equiv 0, xtnat=wtx^{\mathrm{nat}}_{t}=w_{t} so Πdrc+(h)=Πdac+(h)\Pi_{\mathrm{drc}}^{+}(h)=\Pi_{\mathrm{dac}}^{+}(h). Moreover, since wt=1w_{t}=1 for all t1t\geq 1, any πΠdac+(h)\pi\in\Pi_{\mathrm{dac}}^{+}(h) has utπ=u¯u^{\pi}_{t}=\bar{u} for some fixed u¯\bar{u} for all t>ht>h. Then, for all t>h+1t>h+1, xt=1+Bt1u¯x_{t}=1+B_{t-1}\bar{u}. Thus, ct(xtπ,utπ)=18(u¯1+Bt1u¯4)2c_{t}(x^{\pi}_{t},u^{\pi}_{t})=\frac{1}{8}(\bar{u}-\frac{1+B_{t-1}\bar{u}}{4})^{2}. Using the definition of BtB_{t},

ct(xtπ,utπ)+ct+1(xt+1π,ut+1π)=18(u¯1u¯4)2+18(u¯1+u¯4)2=1128((3u¯+1)2+(5u¯+1))2=Ω(1).\displaystyle c_{t}(x^{\pi}_{t},u^{\pi}_{t})+c_{t+1}(x^{\pi}_{t+1},u^{\pi}_{t+1})=\frac{1}{8}(\bar{u}-\frac{1-\bar{u}}{4})^{2}+\frac{1}{8}(\bar{u}-\frac{1+\bar{u}}{4})^{2}=\frac{1}{128}\left((3\bar{u}+1)^{2}+(5\bar{u}+1)\right)^{2}=\Omega(1).

The bound follows.

Proof of part b.

Set ct(x,u)=(uwt1)2c_{t}(x,u)=(u-w_{t-1})^{2}. Then the Dac policy utπ=wt1u^{\pi}_{t}=w_{t-1} has zero cost. Further, set Bt0B_{t}\equiv 0, thus, xtxtnatx_{t}\equiv x^{\mathrm{nat}}_{t}, so Πfeed+(h)\Pi_{\mathrm{feed}}^{+}(h) and Πdrc+(h)\Pi_{\mathrm{drc}}^{+}(h) are equivalent on this system. Finally, let n=2m+1n=2m+1, and set w1=1w_{1}=1, and for t1t\geq 1, set

(At,wt)={(12,12)t is even (14,34)t is odd .\displaystyle(A_{t},w_{t})=\begin{cases}(\frac{1}{2},\frac{1}{2})&t\text{ is even }\\ (\frac{1}{4},\frac{3}{4})&t\text{ is odd }\end{cases}.

Then, one can verify via induction that xt=xtnat=1x_{t}=x^{\mathrm{nat}}_{t}=1 for all t2t\geq 2. Hence, for all th+1t\geq h+1, any πΠfeed+(h)Πdrc+(h)\pi\in\Pi_{\mathrm{feed}}^{+}(h)\cup\Pi_{\mathrm{drc}}^{+}(h) has a constant input utπ=u¯u_{t}^{\pi}=\bar{u}. However, ct(xtπ,utπ)+ct+1(xtπ,utπ)=(u¯12)2+(u¯34)2c_{t}(x^{\pi}_{t},u^{\pi}_{t})+c_{t+1}(x^{\pi}_{t},u^{\pi}_{t})=(\bar{u}-\frac{1}{2})^{2}+(\bar{u}-\frac{3}{4})^{2}, which is greater than a universal constant. Hence, the regret must spaces as Ω(T(h+2))\Omega(T-(h+2)).

Proof of part c.

Fix policity π\pi_{\star} to select utπ=xt1natu^{\pi_{\star}}_{t}=x^{\mathrm{nat}}_{t-1}. Denote the sequences that arise from this policy as (xt,ut)(x^{\star}_{t},u^{\star}_{t}). We set

ct(x,u)=14((u12ut)2+|xxt|)\displaystyle c_{t}(x,u)=\frac{1}{4}\left((u-\frac{1}{2}u^{\star}_{t})^{2}+|x-x^{\star}_{t}|\right)

By construction π\pi_{\star} has zero cost on ctc_{t}. Now, set wt=1w_{t}=1 for all tt, and

At={14tmod3=114tmod3=20tmod3=0.,Bt={14tmod3=1120tmod3=10tmod3=0.\displaystyle A_{t}=\begin{cases}\frac{1}{4}&t\mod 3=1\\ \frac{1}{4}&t\mod 3=2\\ 0&t\mod 3=0.\end{cases},\quad B_{t}=\begin{cases}-\frac{1}{4}&t\mod 3=1\\ -\frac{1}{20}&t\mod 3=1\\ 0&t\mod 3=0.\end{cases}
u3k+1\displaystyle u^{\star}_{3k+1} =x3k+1nat=1\displaystyle=x^{\mathrm{nat}}_{3k+1}=1
u3k+2\displaystyle u^{\star}_{3k+2} =x3k+2nat=54\displaystyle=x^{\mathrm{nat}}_{3k+2}=\frac{5}{4}
u3k+3\displaystyle u^{\star}_{3k+3} =x3k+3nat=2116.\displaystyle=x^{\mathrm{nat}}_{3k+3}=\frac{21}{16}.

Hence, a similar argument as in part (b), using the fact that that wtw_{t} is constant but utu^{\star}_{t} is periodic, shows that any πΠdac+(h)\pi\in\Pi_{\mathrm{dac}}^{+}(h) suffers cost Ω(Th3)\Omega(T-h-3).

Let us now analyze the performance of policies πΠfeed(m,RM)\pi\in\Pi_{\mathrm{feed}}(m,R_{M}). First, observe that xt=1x^{\star}_{t}=1 for all t2t\geq 2.

x3k+1\displaystyle x^{\star}_{3k+1} =1\displaystyle=1
x3k+2\displaystyle x^{\star}_{3k+2} =5414u3k+1=1.\displaystyle=\frac{5}{4}-\frac{1}{4}u^{\star}_{3k+1}=1.
x3k+2\displaystyle x^{\star}_{3k+2} =211614u3k+1120u3k+2\displaystyle=\frac{21}{16}-\frac{1}{4}u^{\star}_{3k+1}-\frac{1}{20}u^{\star}_{3k+2}
=21161412054=1.\displaystyle=\frac{21}{16}-\frac{1}{4}-\frac{1}{20}\cdot\frac{5}{4}=1.

Next, observe that for πΠfeed(h,R)\pi\in\Pi_{\mathrm{feed}}(h,R), and th+1t\geq h+1,

utπ=c+i=0h1M[i]xtπ=c+i=0h1M[i]xt+i=0h1M[i](xtπxt)\displaystyle u^{\pi}_{t}=c+\sum_{i=0}^{h-1}M^{[i]}x^{\pi}_{t}=c+\sum_{i=0}^{h-1}M^{[i]}x^{\star}_{t}+\sum_{i=0}^{h-1}M^{[i]}(x^{\pi}_{t}-x^{\star}_{t}) =(c+i=0h1M[i]):=u¯+i=0h1M[i](xtπxt)\displaystyle=\underbrace{(c+\sum_{i=0}^{h-1}M^{[i]})}_{:=\bar{u}}+\sum_{i=0}^{h-1}M^{[i]}(x^{\pi}_{t}-x^{\star}_{t})

Defining ϵt=utπu¯\epsilon_{t}=u^{\pi}_{t}-\bar{u}, we have

|ϵt||i=0h1M[i](xtπxt)|Rmaxi=0h1|xtiπxti|.\displaystyle|\epsilon_{t}|\leq\left|\sum_{i=0}^{h-1}M^{[i]}(x^{\pi}_{t}-x^{\star}_{t})\right|\leq R\max_{i=0}^{h-1}|x^{\pi}_{t-i}-x^{\star}_{t-i}|.

Hence, for integers kk,

maxi[3]|ϵ3k+i|Rmaxt=3kh+23k+3|xtπxt|.\displaystyle\max_{i\in[3]}|\epsilon_{3k+i}|\leq R\max_{t=3k-h+2}^{3k+3}|x^{\pi}_{t}-x^{\star}_{t}|. (F.1)

We now argue a dichotomoty on the size of maxi[3]|ϵ3k+i|\max_{i\in[3]}|\epsilon_{3k+i}|. First, we show that if the epsilons are large, the costs incurred on a past window of hh must be as well. This is Eq. F.1 would necessitate that xtπx^{\pi}_{t} differs from xtx^{\star}_{t} over the previous window.

Claim F.1.

Suppose maxi[3]|ϵ3k+i|132\max_{i\in[3]}|\epsilon_{3k+i}|\geq\frac{1}{32}. Then, t=3kh+23k+3ct(xtπ,utπ)127R\sum_{t=3k-h+2}^{3k+3}c_{t}(x^{\pi}_{t},u^{\pi}_{t})\geq\frac{1}{2^{7}R}.

Proof.

By Eq. F.1, we have that if maxi[3]|ϵ3k+i|132\max_{i\in[3]}|\epsilon_{3k+i}|\geq\frac{1}{32}, then maxt=3kh+23k+3|xtπxt|132R\max_{t=3k-h+2}^{3k+3}|x^{\pi}_{t}-x^{\star}_{t}|\geq\frac{1}{32R}. Since ct(xtπ,utπ)14|xtπxt|c_{t}(x^{\pi}_{t},u^{\pi}_{t})\geq\frac{1}{4}|x^{\pi}_{t}-x^{\star}_{t}|, the bound follows by upper bounding the maximum with the sum. ∎

On the other hand, we show that if the ϵ\epsilon-terms are small, then the costs on t{3k+1,3k+2,3k+3}t\in\{3k+1,3k+2,3k+3\} are at least a small constant. This is because the inputs selected by π\pi, u¯+ϵt\bar{u}+\epsilon_{t}, are close to constant, and therefore can fit the periodic values of utπu^{\pi_{\star}}_{t}.

Claim F.2.

Suppose maxi[3]|ϵ3k+i|132\max_{i\in[3]}|\epsilon_{3k+i}|\leq\frac{1}{32}. Then, i=13c3k+i(xtπ,utπ)212\sum_{i=1}^{3}c_{3k+i}(x^{\pi}_{t},u^{\pi}_{t})\geq 2^{-12}.

Proof.

We expand

i=13c3k+i(xtπ,utπ)\displaystyle\sum_{i=1}^{3}c_{3k+i}(x^{\pi}_{t},u^{\pi}_{t}) i=1314(ututπ)2\displaystyle\geq\sum_{i=1}^{3}\frac{1}{4}\left(u^{\star}_{t}-u^{\pi}_{t}\right)^{2}
i=1314(u3k+iu¯ϵ3k+i)2.\displaystyle\geq\sum_{i=1}^{3}\frac{1}{4}\left(u^{\star}_{3k+i}-\bar{u}-\epsilon_{3k+i}\right)^{2}.
=14((1u¯ϵ3k+1)2+(1+14u¯ϵ3k+1)2+(1+14+116u¯ϵ3k+i)2).\displaystyle=\frac{1}{4}\left(\left(1-\bar{u}-\epsilon_{3k+1}\right)^{2}+\left(1+\frac{1}{4}-\bar{u}-\epsilon_{3k+1}\right)^{2}+\left(1+\frac{1}{4}+\frac{1}{16}-\bar{u}-\epsilon_{3k+i}\right)^{2}\right).

In particular, suppose maxi[3]|ϵ3k+i|132\max_{i\in[3]}|\epsilon_{3k+i}|\leq\frac{1}{32}. Then unless |u¯1|116|\bar{u}-1|\leq\frac{1}{16}, the above is at least 14(132)2=212\frac{1}{4}\left(\frac{1}{32}\right)^{2}=2^{-12}. On the other hand, if |u¯1|116|\bar{u}-1|\leq\frac{1}{16}. Then, 14(1+14u¯ϵ3k+1)2(14116132)214(18)2212\frac{1}{4}\left(1+\frac{1}{4}-\bar{u}-\epsilon_{3k+1}\right)^{2}\geq(\frac{1}{4}-\frac{1}{16}-\frac{1}{32})^{2}\geq\frac{1}{4}(\frac{1}{8})^{2}\geq 2^{-12}. ∎

Combining both cases, we find that, for all kk such that 3kh+213k-h+2\geq 1,

t=3kh+23k+3ct(xtπ,utπ)217max{R,1}.\displaystyle\sum_{t=3k-h+2}^{3k+3}c_{t}(x^{\pi}_{t},u^{\pi}_{t})\geq\frac{2^{-17}}{\max\{R,1\}}.

In particular, we find that for JT(π;𝒵3)Ω(Thmax{1,R})J_{T}(\pi;\mathcal{Z}_{3})\geq\Omega(\frac{T}{h\max\{1,R\}}) provided (say) T10hT\geq 10h. ∎

F.2 Linear Regret Against Drc and Dac

In this section, we demonstrate linear regret against Drc and Dac. We consider a distribution over instances of the following form

At=0,Bt=[1000βt0001],wt=[ωt1ωt1],t0.\displaystyle A_{t}=0,\quad B_{t}=\begin{bmatrix}1&0&0\\ 0&\beta_{t}&0\\ 0&0&1\end{bmatrix},\quad w_{t}=-\begin{bmatrix}\omega_{t-1}\\ \omega_{t}\\ 1\end{bmatrix},t\geq 0.\quad (F.2)

Define the constant α=1/4\alpha=1/4. For a given σ(0,1/8)\sigma\in(0,1/8), let 𝒟σ\mathcal{D}_{\sigma} denote the distribution over (At,Bt,wt)(A_{t},B_{t},w_{t}) induced by drawing

βti.i.d[1σ,1+σ],ωti.i.d{1ασ,1+ασ}.\displaystyle\beta_{t}\overset{\mathrm{i.i.d}}{\sim}[1-\sigma,1+\sigma],\quad\omega_{t}\overset{\mathrm{i.i.d}}{\sim}\{1-\alpha\sigma,1+\alpha\sigma\}.

Note that these instances are (a) controllable, (b) stable, and (c) have variance scaling like Tσ2T\sigma^{2}, and (d) the Drc and Dac parametrizations coincide. Letting v[i]v[i] denote the ii-th coordinate of vectors vv, we consider cost of the form

cf(x,u)=x[2]2+u[2]2+f(x[1]).\displaystyle c_{f}(x,u)=x[2]^{2}+u[2]^{2}+f(x[1]). (F.3)

for either f(z)=|z|f(z)=|z| or f(z)=z2f(z)=z^{2}. Note that both choices of ff ensure that cfc_{f} satisfies Assumption 3, and the latter choice ensures that cf(x,u)c_{f}(x,u) is second order smooth.

Theorem F.2.

Let 𝖺𝗅𝗀\mathsf{alg} be any online learning algorithm. Let cf(x,u)c_{f}(x,u) as in (F.3). Then, for any σ>0\sigma>0, there exist a Drc policy πΠdrc(1,1)\pi_{\star}\in\Pi_{\mathrm{drc}}(1,1)444Equivalently, in Πdac(1,1)\Pi_{\mathrm{dac}}(1,1) since At0A_{t}\equiv 0 such that expected regret incurred by 𝖺𝗅𝗀\mathsf{alg} under the distribution 𝒟σ\mathcal{D}_{\sigma} and cost cf(x,u)c_{f}(x,u) is at least

𝔼𝒟σ,𝖺𝗅𝗀[JT(𝖺𝗅𝗀)JT(π)]{C1Tσ2for f(z)=z2,TC0σ2C1Tσfor f(z)=z,TC0σ,\displaystyle\mathbb{E}_{\mathcal{D}_{\sigma},\mathsf{alg}}[J_{T}(\mathsf{alg})-J_{T}(\pi_{\star})]\geq\cdot\begin{cases}C_{1}T\sigma^{2}&\text{for }f(z)=z^{2},~{}~{}T\geq\tfrac{C_{0}}{\sigma^{2}}\\ C_{1}T\sigma&\text{for }f(z)=z,~{}~{}T\geq\tfrac{C_{0}}{\sigma},\end{cases}

where above, C0,C1C_{0},C_{1} are universal, positive constants.

Proof.

In what follows, 𝔼[]\mathbb{E}[\cdot] denotes expectation under the nstance from 𝒟σ\mathcal{D}_{\sigma}, and any algorithmic randomness.

We expand xtx_{t} and utu_{t} into its coordinates xt=(xt;1,xt;2,xt;3)x_{t}=(x_{t;1},x_{t;2},x_{t;3}) and of ut=(ut;1,ut;2,ut;3)u_{t}=(u_{t;1},u_{t;2},u_{t;3}). For any policy π\pi, we can decompose its cost as

JT(π)\displaystyle J_{T}(\pi) =t=1Tcf(xtπ,utπ)=(uT;2π)2+f(xT;1π)+t=1T1cq(xt+1;2π,ut;2π)+f(xt;1π)\displaystyle=\sum_{t=1}^{T}c_{f}(x_{t}^{\pi},u_{t}^{\pi})=(u_{T;2}^{\pi})^{2}+f(x_{T;1}^{\pi})+\sum_{t=1}^{T-1}c_{\mathrm{q}}(x_{t+1;2}^{\pi},u_{t;2}^{\pi})+f(x_{t;1}^{\pi})
 where cq(xt+1;2π,ut;2π)=(xt+1;2π)2+(ut;2π)2.\displaystyle\quad\text{ where }c_{\mathrm{q}}(x_{t+1;2}^{\pi},u_{t;2}^{\pi})=(x_{t+1;2}^{\pi})^{2}+(u_{t;2}^{\pi})^{2}.

Note that x1=0x_{1}=0 in the above. The following lemma characterizes the conditional expectation of the cqc_{\mathrm{q}}

Lemma F.3.

There exist a constant cq;c_{\mathrm{q};\star} such that

𝔼[cq(xt+1;2,ut;2)]cq;=(2+σ2/3)𝔼[(ut;2u¯)2],\displaystyle\mathbb{E}[c_{\mathrm{q}}(x_{t+1;2},u_{t;2})]-c_{\mathrm{q};\star}=(2+\sigma^{2}/3)\mathbb{E}[(u_{t;2}-\bar{u})^{2}],

where u¯=12(1+σ2/6)\bar{u}=\frac{1}{2(1+\sigma^{2}/6)}.

The proof of Lemma F.3 is deferred to the end of the section.

Bounding the cost of π\pi^{\star}

We select π\pi^{\star} to be Dac (or equivalently, Drc since At0A_{t}\equiv 0) policy given by

M=[01000u¯000],utM=Mwt1.\displaystyle M=\begin{bmatrix}0&1&0\\ 0&0&\bar{u}\\ 0&0&0\end{bmatrix},\quad u_{t}^{M}=Mw_{t-1}.

We find that

𝔼[cq(xt+1;2π,ut;2π)]cq;=(2+σ2/3)𝔼[(ut;2Mu¯)2]=0,2tT1\displaystyle\mathbb{E}[c_{\mathrm{q}}(x_{t+1;2}^{\pi_{\star}},u_{t;2}^{\pi_{\star}})]-c_{\mathrm{q};\star}=(2+\sigma^{2}/3)\mathbb{E}[(u^{M}_{t;2}-\bar{u})^{2}]=0,\quad 2\leq t\leq T-1
𝔼[f(xt;1M)]=𝔼[f(ut1;1Mωt2)]=0,3tT.\displaystyle\mathbb{E}[f(x_{t;1}^{M})]=\mathbb{E}[f(u_{t-1;1}^{M}-\omega_{t-2})]=0,\quad 3\leq t\leq T.

Since the noise is uniformly bounded independent for all σ,α1\sigma,\alpha\leq 1, we conclude

𝔼[JT(πM)](T1)cq;\displaystyle\mathbb{E}[J_{T}(\pi^{M})]-(T-1)c_{\mathrm{q};\star} =𝔼[f(xt;2π)]𝔼[f(ωt)]+(uT;2π)2+𝔼[f(xT;1π)]=0+𝔼[cf(x1,u1)]=0\displaystyle=\underbrace{\mathbb{E}[f(x_{t;2}^{\pi})]}_{\mathbb{E}[f(\omega_{t})]}+(u_{T;2}^{\pi_{\star}})^{2}+\underbrace{\mathbb{E}[f(x_{T;1}^{\pi})]}_{=0}+\underbrace{\mathbb{E}[c_{f}(x_{1},u_{1})]}_{=0}
=f(1ασ)+f(1+ασ)2+(u¯wt1;3)2=u¯22.,\displaystyle=\frac{f(1-\alpha\sigma)+f(1+\alpha\sigma)}{2}+\underbrace{(\bar{u}\cdot w_{t-1;3})^{2}}{=\bar{u}^{2}}\leq 2., (F.4)

where we use α1/24\alpha\leq 1/24, σ1/8\sigma\leq 1/8, and u¯12(1+σ2/6)\bar{u}\leq\frac{1}{2(1+\sigma^{2}/6)}, and f(z)max{|z|,|z|2}f(z)\leq\max\{|z|,|z|^{2}\} to achieve the bound Eq. F.4.

Bounding the cost of adaptive policies

Fix any online learning algorithm 𝖺𝗅𝗀\mathsf{alg}; we lower bound its performance. Because BtB_{t} and wtw_{t} are drawn from a fixed probability distribution, and are therefore oblivious to the learner’s actions, we may assume without loss of generality that 𝖺𝗅𝗀\mathsf{alg} is deterministic.

The first step is to argue that any algorithm with small cost must select inputs where are bounded away from zero. Specifically, define the devent t:={ut;2𝖺𝗅𝗀1/6}\mathcal{E}_{t}:=\{u_{t;2}^{\mathsf{alg}}\geq 1/6\}. Using Lemma F.3 together with u¯1/3\bar{u}\geq 1/3,

𝔼[JT(𝖺𝗅𝗀)]Tcq;\displaystyle\mathbb{E}[J_{T}(\mathsf{alg})]-Tc_{\mathrm{q};\star} 2t=1T𝔼[f(xt;1+(ut;2𝖺𝗅𝗀u¯)2]]\displaystyle\geq 2\sum_{t=1}^{T}\mathbb{E}\left[f(x_{t;1}+(u^{\mathsf{alg}}_{t;2}-\bar{u})^{2}]\right]
2t=1T𝔼[f(xt;1𝖺𝗅𝗀)]+(1316)2[tc]\displaystyle\geq 2\sum_{t=1}^{T}\mathbb{E}\left[f(x_{t;1}^{\mathsf{alg}})\right]+(\frac{1}{3}-\frac{1}{6})^{2}\operatorname{\mathbb{P}}\left[\mathcal{E}_{t}^{c}\right]
=t=1T𝔼[f(xt;1𝖺𝗅𝗀)]+118[tc]\displaystyle=\sum_{t=1}^{T}\mathbb{E}\left[f(x_{t;1}^{\mathsf{alg}})\right]+\frac{1}{18}\operatorname{\mathbb{P}}\left[\mathcal{E}_{t}^{c}\right]
t=3T𝔼[f(xt;1𝖺𝗅𝗀)t1][t2]+118t=13[tc].\displaystyle\geq\sum_{t=3}^{T}\mathbb{E}\left[f(x_{t;1}^{\mathsf{alg}})\mid\mathcal{E}_{t-1}\right]\operatorname{\mathbb{P}}\left[\mathcal{E}_{t-2}\right]+\frac{1}{18}\sum_{t=1}^{3}\operatorname{\mathbb{P}}\left[\mathcal{E}_{t}^{c}\right]. (F.5)

We now lower bound 𝔼[f(xt;1𝖺𝗅𝗀)t2]\mathbb{E}\left[f(x_{t;1}^{\mathsf{alg}})\mid\mathcal{E}_{t-2}\right], again deferring the proof to the end of the section.

Lemma F.4.

For any executable policy π\pi 𝔼[f(xt;1π)t2]12f(ασ)\mathbb{E}[f(x_{t;1}^{\pi})\mid\mathcal{E}_{t-2}]\geq\frac{1}{2}f(\alpha\sigma), provided α1/24\alpha\leq 1/24.

Combining Lemma F.4 with the above bound, we have

𝔼[JT(𝖺𝗅𝗀)](T1)cq;\displaystyle\mathbb{E}[J_{T}(\mathsf{alg})]-(T-1)c_{\mathrm{q};\star} t=1Tf(ασ)2[t2]+118t=1T[tc]\displaystyle\geq\sum_{t=1}^{T}\frac{f(\alpha\sigma)}{2}\operatorname{\mathbb{P}}\left[\mathcal{E}_{t-2}\right]+\frac{1}{18}\sum_{t=1}^{T}\operatorname{\mathbb{P}}\left[\mathcal{E}_{t}^{c}\right]
min{f(ασ)2,118}t=1T2[t]+[tc]\displaystyle\geq\min\left\{\frac{f(\alpha\sigma)}{2},\frac{1}{18}\right\}\sum_{t=1}^{T-2}\operatorname{\mathbb{P}}\left[\mathcal{E}_{t}\right]+\operatorname{\mathbb{P}}\left[\mathcal{E}_{t}^{c}\right]
(T2)min{f(ασ)2,118}.\displaystyle\geq(T-2)\min\left\{\frac{f(\alpha\sigma)}{2},\frac{1}{18}\right\}.

With σ1\sigma\leq 1 and α=124\alpha=\frac{1}{24}, f(ασ)2118\frac{f(\alpha\sigma)}{2}\leq\frac{1}{18}. Combining with Eq. F.4, we have

𝔼[JT(𝖺𝗅𝗀)]𝔼[JT(π)](T2)f(ασ)22.\displaystyle\mathbb{E}[J_{T}(\mathsf{alg})]-\mathbb{E}[J_{T}(\pi^{\star})]\geq(T-2)\frac{f(\alpha\sigma)}{2}-2.

The bound follows. ∎

F.2.1 Omitted proofs

Proof of Lemma F.3.

Let (¯t)t1(\overline{\mathcal{F}}_{t})_{t\geq 1} denote the filtration induced by setting ¯t\overline{\mathcal{F}}_{t} to be the sigma-algebra generated by (β1,,βt1,ω1,,ωt}(\beta_{1},\dots,\beta_{t-1},\omega_{1},\dots,\omega_{t}\}. We have

𝔼[βt2¯t]\displaystyle\mathbb{E}[\beta_{t}^{2}\mid\overline{\mathcal{F}}_{t}] =12σu=1σ1+σu2du=(1+σ)3(1σ)36σ\displaystyle=\frac{1}{2\sigma}\int_{u=1-\sigma}^{1+\sigma}u^{2}\mathrm{d}u=\frac{(1+\sigma)^{3}-(1-\sigma)^{3}}{6\sigma}
=1+3σ+3σ2+σ3(13σ2+3σ2σ3)6σ\displaystyle=\frac{1+3\sigma+3\sigma^{2}+\sigma^{3}-(1-3\sigma^{2}+3\sigma^{2}-\sigma^{3})}{6\sigma}
=1+σ3/3.\displaystyle=1+\sigma^{3}/3.

Set cq(x,u)=x2+u2c_{\mathrm{q}}(x,u)=x^{2}+u^{2}.

𝔼[cq(xt+1;2,ut;2)¯t]\displaystyle\mathbb{E}[c_{\mathrm{q}}(x_{t+1;2},u_{t;2})\mid\overline{\mathcal{F}}_{t}] =ut;22+𝔼[(βtut;2ωt)2¯t]\displaystyle=u_{t;2}^{2}+\mathbb{E}[(\beta_{t}u_{t;2}-\omega_{t})^{2}\mid\overline{\mathcal{F}}_{t}]
=ut;22(1+𝔼[βt2¯t])2ut;2𝔼[βtwt¯t]+𝔼[ωt2¯t]\displaystyle=u_{t;2}^{2}(1+\mathbb{E}[\beta_{t}^{2}\mid\overline{\mathcal{F}}_{t}])-2u_{t;2}\mathbb{E}[\beta_{t}w_{t}\mid\overline{\mathcal{F}}_{t}]+\mathbb{E}[\omega_{t}^{2}\mid\overline{\mathcal{F}}_{t}]
=ut;22(2+σ2/3)2ut;2+(1+cσ2)\displaystyle=u_{t;2}^{2}(2+\sigma^{2}/3)-2u_{t;2}+(1+c\sigma^{2})

at u;2=12(1+σ2/6)u_{\star;2}=\frac{1}{2(1+\sigma^{2}/6)}. Define cq;:=minut;2𝔼[2(xt;2,ut;2)¯t]c_{\mathrm{q};\star}:=\min_{u_{t;2}}\mathbb{E}[\ell_{2}(x_{t;2},u_{t;2})\mid\overline{\mathcal{F}}_{t}], we then have

𝔼[cq(xt+1;2,ut;2)¯t]2;=(2+σ2/3)(ut;2u;2)2.\displaystyle\mathbb{E}[c_{\mathrm{q}}(x_{t+1;2},u_{t;2})\mid\overline{\mathcal{F}}_{t}]-\ell_{2;\star}=(2+\sigma^{2}/3)(u_{t;2}-u_{\star;2})^{2}. (F.6)

Proof of Lemma F.4.

Let us introduce a second event, ~\tilde{\mathcal{E}}, defined as

~t1:={(1+ασ)+ut;2𝖺𝗅𝗀(1σ)xt1;2𝖺𝗅𝗀(1ασ)+ut;2𝖺𝗅𝗀(1+σ)}.\displaystyle\tilde{\mathcal{E}}_{t-1}:=\left\{(1+\alpha\sigma)+u^{\mathsf{alg}}_{t;2}(1-\sigma)\leq x_{t-1;2}^{\mathsf{alg}}\leq(1-\alpha\sigma)+u^{\mathsf{alg}}_{t;2}(1+\sigma)\right\}.

Then, since ff is non-negative,

𝔼[f(xt;1π)t2]\displaystyle\mathbb{E}[f(x_{t;1}^{\pi})\mid\mathcal{E}_{t-2}] 𝔼[𝕀{~t1}f(xt;1π)t2]\displaystyle\geq\mathbb{E}[\mathbb{I}\{\tilde{\mathcal{E}}_{t-1}\}f(x_{t;1}^{\pi})\mid\mathcal{E}_{t-2}]
=[~t12]𝔼[𝔼[f(xt;1π)~t1t2].\displaystyle=\operatorname{\mathbb{P}}[\tilde{\mathcal{E}}_{t-1}\mid\mathcal{E}_{2}]\mathbb{E}[\cdot\mathbb{E}[f(x_{t;1}^{\pi})\mid\tilde{\mathcal{E}}_{t-1}\cap\mathcal{E}_{t-2}].

Let’s first lower bound [~t12]\operatorname{\mathbb{P}}[\tilde{\mathcal{E}}_{t-1}\mid\mathcal{E}_{2}]. Writing xt1;2𝖺𝗅𝗀=βt2ut;2𝖺𝗅𝗀+ωt2x_{t-1;2}^{\mathsf{alg}}=\beta_{t-2}u^{\mathsf{alg}}_{t;2}+\omega_{t-2}, ~t1\tilde{\mathcal{E}}_{t-1} occurs as soon as

(1+ασ)+ut2;2𝖺𝗅𝗀(1σ)\displaystyle(1+\alpha\sigma)+u^{\mathsf{alg}}_{t-2;2}(1-\sigma) βt2ut2;2𝖺𝗅𝗀+ωt2\displaystyle\leq\beta_{t-2}u^{\mathsf{alg}}_{t-2;2}+\omega_{t-2}
(1ασ)+ut2;2𝖺𝗅𝗀(1+σ)\displaystyle(1-\alpha\sigma)+u^{\mathsf{alg}}_{t-2;2}(1+\sigma) βt2ut2;2𝖺𝗅𝗀+ωt2.\displaystyle\geq\beta_{t-2}u^{\mathsf{alg}}_{t-2;2}+\omega_{t-2}.

Using that 1ασωt21+ασ1-\alpha\sigma\leq\omega_{t-2}\leq 1+\alpha\sigma and rearranging the above, it is enough that

2ασ\displaystyle 2\alpha\sigma (βt(1σ))ut2;2𝖺𝗅𝗀\displaystyle\leq(\beta_{t}-(1-\sigma))u^{\mathsf{alg}}_{t-2;2}
2ασ\displaystyle-2\alpha\sigma ((1+σ)βt)ut2;2𝖺𝗅𝗀.\displaystyle\geq-((1+\sigma)-\beta_{t})u^{\mathsf{alg}}_{t-2;2}.

Now, if t2\mathcal{E}_{t-2} holds, that ut;2𝖺𝗅𝗀1/6u^{\mathsf{alg}}_{t;2}\geq 1/6. Furthermore, by construction 1σβt1+σ1-\sigma\leq\beta_{t}\leq 1+\sigma. Therefore, if t2\mathcal{E}_{t-2} holds, then ~t1\tilde{\mathcal{E}}_{t-1} holds as long as

βt(1σ)12ασand(1+σ)βt12ασ.\displaystyle\beta_{t}-(1-\sigma)\geq 12\alpha\sigma\quad\text{and}\quad(1+\sigma)-\beta_{t}\geq 12\alpha\sigma.

In particular, for α=124\alpha=\frac{1}{24}, then

[~t12][βt[12σ,1+2σ]]=12.\displaystyle\operatorname{\mathbb{P}}[\tilde{\mathcal{E}}_{t-1}\mid\mathcal{E}_{2}]\geq\operatorname{\mathbb{P}}[\beta_{t}\in[1-\tfrac{2}{\sigma},1+\tfrac{2}{\sigma}]]=\frac{1}{2}. (F.7)

Next, we lower bound 𝔼[f(xt;1π)~t1t2]\mathbb{E}[f(x_{t;1}^{\pi})\mid\tilde{\mathcal{E}}_{t-1}\cap\mathcal{E}_{t-2}]. To do so, we observe that xt;1=ωt2ut1;1𝖺𝗅𝗀x_{t;1}=\omega_{t-2}-u^{\mathsf{alg}}_{t-1;1}. Moreover, since 𝖺𝗅𝗀\mathsf{alg} is deterministic (see discussion above), ut1;1𝖺𝗅𝗀u^{\mathsf{alg}}_{t-1;1} is a deterministic function of x1:t1𝖺𝗅𝗀x^{\mathsf{alg}}_{1:t-1}, and moreover, ~t1,t2\tilde{\mathcal{E}}_{t-1},\mathcal{E}_{t-2} are determinied by x1:t1𝖺𝗅𝗀x^{\mathsf{alg}}_{1:t-1}. Hence,

𝔼[f(xt;1π)~t1,t2]\displaystyle\mathbb{E}[f(x_{t;1}^{\pi})\mid\tilde{\mathcal{E}}_{t-1},\mathcal{E}_{t-2}] =𝔼[𝔼[f(xt;1π)x1:t1𝖺𝗅𝗀]~t1t2]\displaystyle=\mathbb{E}[\mathbb{E}[f(x_{t;1}^{\pi})\mid x^{\mathsf{alg}}_{1:t-1}]\mid\tilde{\mathcal{E}}_{t-1}\cap\mathcal{E}_{t-2}]
=𝔼[𝔼[f(ωt2ut1;1𝖺𝗅𝗀)x1:t1𝖺𝗅𝗀]~t1t2]\displaystyle=\mathbb{E}[\mathbb{E}[f(\omega_{t-2}-u^{\mathsf{alg}}_{t-1;1})\mid x^{\mathsf{alg}}_{1:t-1}]\mid\tilde{\mathcal{E}}_{t-1}\cap\mathcal{E}_{t-2}]
𝔼[minu𝔼[f(ωt2u)x1:t1𝖺𝗅𝗀]~t1t2].\displaystyle\geq\mathbb{E}[\min_{u\in\mathbb{R}}\mathbb{E}[f(\omega_{t-2}-u)\mid x^{\mathsf{alg}}_{1:t-1}]\mid\tilde{\mathcal{E}}_{t-1}\cap\mathcal{E}_{t-2}].

Thus, it suffices to characterize the distribution of ωt1x1:t1𝖺𝗅𝗀\omega_{t-1}\mid x^{\mathsf{alg}}_{1:t-1} whenever the events ~t1t2\tilde{\mathcal{E}}_{t-1}\cap\mathcal{E}_{t-2}. Indeed, we claim that ωt2x1:t1𝖺𝗅𝗀\omega_{t-2}\mid x^{\mathsf{alg}}_{1:t-1} is uniform on {1ασ,1+ασ}\{1-\alpha\sigma,1+\alpha\sigma\} on whenever ~t1\tilde{\mathcal{E}}_{t-1} holds.

Claim F.5.

Let ω=1ασ\omega^{-}=1-\alpha\sigma and ω+=1+ασ\omega^{+}=1+\alpha\sigma. then (with probability one) [ωt2=ωx1:t1𝖺𝗅𝗀]=[ωt2=ω+x1:t1𝖺𝗅𝗀]=12\operatorname{\mathbb{P}}[\omega_{t-2}=\omega^{-}\mid x^{\mathsf{alg}}_{1:t-1}]=\operatorname{\mathbb{P}}[\omega_{t-2}=\omega^{+}\mid x^{\mathsf{alg}}_{1:t-1}]=\frac{1}{2} if ~t1\tilde{\mathcal{E}}_{t-1} holds.

Proof.

If ~t1\tilde{\mathcal{E}}_{t-1} holds, then there exists exactly two values β\beta and β\beta^{\prime} in [1σ,1+σ][1-\sigma,1+\sigma] such that

ω++βut2;2𝖺𝗅𝗀=ω+βut2;2𝖺𝗅𝗀=xt1;2𝖺𝗅𝗀.\displaystyle\omega^{+}+\beta u^{\mathsf{alg}}_{t-2;2}=\omega^{-}+\beta^{\prime}u^{\mathsf{alg}}_{t-2;2}=x^{\mathsf{alg}}_{t-1;2}.

Even conditioned on x1:t2𝖺𝗅𝗀x^{\mathsf{alg}}_{1:t-2}, ωt2\omega_{t-2} is uniformly distributed on {ω,ω+}\{\omega^{-},\omega^{+}\}, and since β\beta and β\beta^{\prime} have the same probability mass under the uniform distribution of βt2\beta_{t-2}, it follows that [ωt2=ω+x1:t1𝖺𝗅𝗀]=[ωt2=ωx1:t1𝖺𝗅𝗀]\operatorname{\mathbb{P}}[\omega_{t-2}=\omega^{+}\mid x^{\mathsf{alg}}_{1:t-1}]=[\omega_{t-2}=\omega^{-}\mid x^{\mathsf{alg}}_{1:t-1}] when ~t1\tilde{\mathcal{E}}_{t-1} holds. ∎

Hence,

𝔼[f(xt;1π)~t1,t2]minu12(f(1+ασu)+f(1ασu)).\displaystyle\mathbb{E}[f(x_{t;1}^{\pi})\mid\tilde{\mathcal{E}}_{t-1},\mathcal{E}_{t-2}]\geq\min_{u\in\mathbb{R}}\frac{1}{2}(f(1+\alpha\sigma-u)+f(1-\alpha\sigma-u)).

For f(z)=z2f(z)=z^{2} or f(z)=|z|f(z)=|z|, the minimum is attained at u=1u=1, with values α2σ2\alpha^{2}\sigma^{2} and ασ\alpha\sigma, respectively, both equal to f(ασ)f(\alpha\sigma). Thus,combining with Eq. F.7, we conclude

𝔼[f(xt;1π)t2]12f(ασ).\displaystyle\mathbb{E}[f(x_{t;1}^{\pi})\mid\mathcal{E}_{t-2}]\geq\frac{1}{2}f(\alpha\sigma).

F.3 Lower Bound without Stability

Theorem F.3.

Consider a scalar LTV system with A=ρ[0,1]A=\rho\in[0,1], and BtB_{t} drawn independently and uniformly at random from {1}\{-1\}. Suppose that w1=1w_{1}=1, and wt=0w_{t}=0 for all t>1t>1. Finally, let ct(x,u)=x2c_{t}(x,u)=x^{2} be a fixed costs. Then,

  1. (a)

    There then Drc policy utπ=ρB2xtnatu_{t}^{\pi}=-\rho B_{2}x^{\mathrm{nat}}_{t}, Dac policy utπ=ρB2wt1u_{t}^{\pi}=-\rho B_{2}w_{t-1}, and static-feedback policy utπ=ρB2Au_{t}^{\pi}=-\rho B_{2}A (all chosen with foreknowledge of the (Bt)t1(B_{t})_{t\geq 1} sequence) all enjoy:

    JT(π)=1,with probability 1.\displaystyle J_{T}(\pi)=1,\quad\text{with probability 1}.
  2. (b)

    Any online learning algorithm without foreknowledge of (Bt)t1(B_{t})_{t\geq 1} must suffer expected cost

    𝔼[JT(π)]=Ω(min{T,11ρ}).\displaystyle\mathbb{E}[J_{T}(\pi)]=\Omega\left(\min\left\{T,\frac{1}{1-\rho}\right\}\right).
Proof.

In part (a), all policies choose u2=B2ρw1u_{2}=-B_{2}\rho w_{1}, so that x3=ρx2ρw1=0x_{3}=\rho x_{2}-\rho w_{1}=0. Since 0 is an equilibrium point and wt=0w_{t}=0 for all t2t\geq 2, the system remains at zero. Hence, the only cost incurred is at times 11 and 22, which are costs of zero and 11 respectively. In part (b), we use the unbiasedness of BtB_{t} to recurse

𝔼[x2]\displaystyle\mathbb{E}[x_{2}] =Ax1+𝔼[B1]𝔼[u1]+w1=1\displaystyle=Ax_{1}+\mathbb{E}[B_{1}]\mathbb{E}[u_{1}]+w_{1}=1
𝔼[xt+1]\displaystyle\mathbb{E}[x_{t+1}] =A𝔼[xt]+𝔼[Bt]𝔼[ut]+wt=0=ρ𝔼[xt],t2,\displaystyle=A\mathbb{E}[x_{t}]+\mathbb{E}[B_{t}]\mathbb{E}[u_{t}]+\underbrace{w_{t}}_{=0}=\rho\mathbb{E}[x_{t}],\quad\forall t\geq 2,

yielding 𝔼[xt]=ρt2\mathbb{E}[x_{t}]=\rho^{t-2} for all t2t\geq 2. Hence, the expected cost of the policy is

𝔼[JT(π)]=t=2Tρ(2t2).\displaystyle\mathbb{E}[J_{T}(\pi)]=\sum_{t=2}^{T}\rho^{(2t-2)}.

Considering the cases where ρ1T2\rho\geq\frac{1}{T-2} and ρ1T2\rho\leq\frac{1}{T-2}, we find the above is Ω(min{T,11ρ})\Omega\left(\min\left\{T,\frac{1}{1-\rho}\right\}\right). ∎

F.4 Hardness of Computing Best State Feedback Controller

Consider a time-varying linear dynamical system with no noise:

xt+1=Atxt+Btut+wt,t,wt0x_{t+1}=A_{t}x_{t}+B_{t}u_{t}+w_{t},\quad\forall t,\,w_{t}\equiv 0

subject to changing convex costs ct(x,u)c_{t}(x,u). We show that even in the no-noise setting properly learning the optimal state feedback policy is computationally hard. This statement holds with the control agent having full prior knowledge of the dynamics (At,Bt,ct)(A_{t},B_{t},c_{t}). It relies on a reduction to the MAX-3SAT problem which is 𝖭𝖯\mathsf{NP}-hard. Our lower bound is inspired by the analogous one for discrete MDPs by [14].

Theorem F.4.

There exists a reduction from Max-3Sat on mm-clauses and nn-literals to the problem of finding the state feedback policy KK optimal for the cost t=1Tct(xtK,utK)\sum_{t=1}^{T}c_{t}(x_{t}^{K},u_{t}^{K}), over sequentially stable dynamics given by (At,Bt,ct)(A_{t},B_{t},c_{t}): a solution to Max-3Sat with kk value implies optimal cost of at most k-k, and a solution KK to the control problem with kϵ-k-\epsilon value implies optimal value of Max-3Sat at least kk for any known ϵ>0\epsilon>0.

Let us first describe the construction of the dynamics that reduce the optimal control problem to the MAX-3SAT problem. Consider a 3-CNF formula ϕ\phi with mm clauses C1,,CmC_{1},\dots,C_{m} and nn literals y1,,yny_{1},\dots,y_{n}. The state space is of dimensionality dx=n+1d_{x}=n+1 and the action space is of dimensionality du=2d_{u}=2. The control problem is given as a sequence of mm episodes corresponding to the clauses of the formula ϕ\phi.

For a single clause CjC_{j} with j[m]j\in[m], let the dynamics (At,Bt,ct)t=1n+2(A_{t},B_{t},c_{t})_{t=1}^{n+2} be an episode of length n+2n+2 constructed as follows. The initial state is x1=[1,𝟎n]x_{1}=[1,\bm{0}_{n}]^{\top}. The state transitions are independent of the clause itself given by the following Atn+1×n+1A_{t}\in\mathbb{R}^{n+1\times n+1}:

  • for 1t<n1\leq t<n, At(t+1)=[𝟏n,0]A_{t}(t+1)=[\bm{1}_{n},0]^{\top}, At(n+1)=[𝟎n,1]A_{t}(n+1)=[\bm{0}_{n},1]^{\top}, At(i)=[𝟎n+1]A_{t}(i)=[\bm{0}_{n+1}]^{\top} for all other it+1,n+1i\neq t+1,n+1.

  • for t=nt=n, it becomes At(n+1)=[𝟏n+1]A_{t}(n+1)=[\bm{1}_{n+1}]^{\top} and At(i)=[𝟎n+1]A_{t}(i)=[\bm{0}_{n+1}]^{\top} for all other in+1i\neq n+1.

  • for t=n+1t=n+1 take At(1)=[𝟏n+1]A_{t}(1)=[\bm{1}_{n+1}]^{\top} and At(i)=[𝟎n+1]A_{t}(i)=[\bm{0}_{n+1}]^{\top} for all other i1i\neq 1; for t=n+2t=n+2 take At=𝟎n+1×n+1A_{t}=\bm{0}_{n+1\times n+1} to ensure sequential stability.

The action matrices BtB_{t} along with the costs ctc_{t}, on the other hand, depend on the content of the clause CjC_{j} itself. In particular, let IjI_{j} be the set of indices of the literals that are in clause CjC_{j}. We define the regularity cost to be c(x,u)=Sx(x)+(1x(n+1))2Su(u)c(x,u)=S_{x}(x)+(1-x(n+1))^{2}\cdot S_{u}(u), where Sx()=dist(,Δn+1)S_{x}(\cdot)=\mathrm{dist}(\cdot,\Delta_{n+1}) and Su()=dist(,Δ2)S_{u}(\cdot)=\mathrm{dist}(\cdot,\Delta_{2}) are the distance functions to the simplex sets of corresponding dimensionality. The action matrices and costs are given as follows:

  • for 1tn1\leq t\leq n and tIjt\not\in I_{j}, Bt=𝟎n+1×2B_{t}=\bm{0}_{n+1\times 2} and ct(x,u)=c(x,u)c_{t}(x,u)=c(x,u).

  • for 1tn1\leq t\leq n and tIjt\in I_{j}, if ytCjy_{t}\in C_{j}: then Bt(t+1)=[1,0]B_{t}(t+1)=[-1,0]^{\top}, Bt(n+1)=[1,0]B_{t}(n+1)=[1,0]^{\top} and Bt(i)=[0,0]B_{t}(i)=[0,0]^{\top} for all the other it+1,n+1i\neq t+1,n+1; the cost is ct(x,u)=c(x,u)u(1)(1x(n+1))c_{t}(x,u)=c(x,u)-u(1)\cdot(1-x(n+1)) rewarding the action [1,0][1,0] which corresponds to assigning the literal a value yt=1y_{t}=1.

  • for 1tn1\leq t\leq n and tIjt\in I_{j}, if ¬ytCj\neg y_{t}\in C_{j}: then Bt(t+1)=[0,1]B_{t}(t+1)=[0,-1]^{\top}, Bt(n+1)=[0,1]B_{t}(n+1)=[0,1]^{\top} and Bt(i)=[0,0]B_{t}(i)=[0,0]^{\top} for all the other it+1,n+1i\neq t+1,n+1; the cost is ct(x,u)=c(x,u)u(2)(1x(n+1))c_{t}(x,u)=c(x,u)-u(2)\cdot(1-x(n+1)) rewarding the action [0,1][0,1] which corresponds to assigning the literal a value yt=0y_{t}=0.

  • for t=n+1t=n+1, Bt=𝟎n+1×2B_{t}=\bm{0}_{n+1\times 2} and for t=n+2t=n+2, Bt(1)=[1,1]B_{t}(1)=[1,1]^{\top} and Bt(i)=[0,0]B_{t}(i)=[0,0]^{\top} for all other i1i\neq 1; for both t=n+1,n+2t=n+1,n+2, the costs are ct(x,u)=c(x,u)c_{t}(x,u)=c(x,u).

Note that the last two rounds t=n+1,n+2t=n+1,n+2 for a clause ensure sequential stability and identical starting state x1=[1,𝟎n]x_{1}=[1,\bm{0}_{n}]^{\top}.

Lemma F.6.

The described system (At,Bt)(A_{t},B_{t}) is sequentially stable and the costs ctc_{t} are convex in x,ux,u.

Proof.

By the construction of the state matrices AtA_{t}, we know that for any tn+2t\geq n+2 the operator Φt[n+2]=0\Phi_{t}^{[n+2]}=0 implying sequential stability of the system. To show convexity, note that the distance function g()=dist(,𝒮)g(\cdot)=\mathrm{dist}(\cdot,\mathcal{S}) for any convex and compact set 𝒮\mathcal{S} is a convex function. More specifically, for zdzz\in\mathbb{R}^{d_{z}} it is given by g(z)=minw𝒮zwg(z)=\min_{w\in\mathcal{S}}\|z-w\|. This is straightforward to show: take any z1,z2dzz_{1},z_{2}\in\mathbb{R}^{d_{z}}, and let w1,w2𝒮w_{1},w_{2}\in\mathcal{S} be the closest points to z1,z2z_{1},z_{2} respectively, i.e. w1z1=g(z1)\|w_{1}-z_{1}\|=g(z_{1}) and w2z2=g(z2)\|w_{2}-z_{2}\|=g(z_{2}). For any λ[0,1]\lambda\in[0,1], given the convexity of the set 𝒮\mathcal{S}, we know that λw1+(1λ)w2𝒮\lambda w_{1}+(1-\lambda)w_{2}\in\mathcal{S}, which concludes the convexity proof for gg:

λg(z1)+(1λ)g(z2)\displaystyle\lambda g(z_{1})+(1-\lambda)g(z_{2}) =λz1w1+(1λ)z2w2\displaystyle=\lambda\|z_{1}-w_{1}\|+(1-\lambda)\|z_{2}-w_{2}\|
λ(z1w1)+(1λ)(z2w2)\displaystyle\geq\|\lambda(z_{1}-w_{1})+(1-\lambda)(z_{2}-w_{2})\|
=λz1+(1λ)z2(λw1+(1λ)w2)\displaystyle=\|\lambda z_{1}+(1-\lambda)z_{2}-(\lambda w_{1}+(1-\lambda)w_{2})\|
g(λz1+(1λ)z2).\displaystyle\geq g(\lambda z_{1}+(1-\lambda)z_{2})~{}.

This means that both Sx()S_{x}(\cdot) and Su()S_{u}(\cdot) are convex functions since the simplex is convex in any dimension. The construction of the costs ctc_{t} is based on these two functions as well as linear components in x,ux,u, hence all ctc_{t} costs are convex in x,ux,u. ∎

Lemma F.7.

If there exists an assignment of literals yvy\leftarrow v s.t. the formula ϕ\phi has k[1,m]k\in[1,m] satisfied clauses, then there is a corresponding linear policy K2×n+1K\in\mathbb{R}^{2\times n+1} that suffers the exact cost of k-k.

Proof.

This should be evident from the construction itself. Let vi=1v_{i}=1 assignment correspond to v¯i=[1,0]\bar{v}_{i}=[1,0]^{\top} and vi=0v_{i}=0 to v¯i=[0,1]\bar{v}_{i}=[0,1]^{\top}. Then consider the linear policy K=[v¯1,,v¯n,𝟎2]K=[\bar{v}_{1},\dots,\bar{v}_{n},\bm{0}_{2}^{\top}]. Denote e1,,en+1n+1e_{1},\dots,e_{n+1}\in\mathbb{R}^{n+1} to be the basis vectors of the space. Note that according to the defined KK, if xtx_{t} is a basis vector of n+1\mathbb{R}^{n+1}, then ut=Kxtu_{t}=Kx_{t} is a basis vector of 2\mathbb{R}^{2}. It is straightforward to check by our construction that utu_{t} being a basis vector of 2\mathbb{R}^{2} implies xt+1x_{t+1} is a basis vector of n+1\mathbb{R}^{n+1}. Since x1=e1x_{1}=e_{1}, then the state-action pairs when following policy KK are both basis vectors, and satisfy the regularity conditions of c(x,u)c(x,u). This means that the policy KK plays v¯t\bar{v}_{t} if xt(t)=1x_{t}(t)=1 and plays 𝟎2\bm{0}_{2} if xt(n+1)=1x_{t}(n+1)=1. Hence, if the clause CjC_{j} is satisfied by the assignment yvy\leftarrow v, then the cost of KK over the episode is exactly 1-1, i.e. once the clause is satisfied, 1-1 is accrued and the state moves to the sink en+1e_{n+1}. If the clause is not satisfied, then the cost is 0 since c(x,u)c(x,u) is 0 throughout. This means that the constructed linear policy KK over the whole control sequence suffers cost k-k.i k ∎

Lemma F.8.

If there exists a state feedback policy K2×n+1K\in\mathbb{R}^{2\times n+1} s.t. following the actions ut=Kxtu_{t}=Kx_{t} results in cost at most kϵ-k-\epsilon for any k[1,m]k\in[1,m] and any ϵ(0,1)\epsilon\in(0,1), then there is a literal assignment yvy\leftarrow v s.t. the formula ϕ\phi has at least kk satisfied clauses.

Proof.

Let the linear policy matrix be given as K=[v¯1,,v¯n,v¯n+1]K=[\bar{v}_{1},\dots,\bar{v}_{n},\bar{v}_{n+1}]. The proof consists of two main components: (i) we argue that the policy KK^{*} with v¯i=argminvΔ2vv¯i\bar{v}_{i}^{*}=\operatorname*{arg\,min}_{v\in\Delta_{2}}\|v-\bar{v}_{i}\| for 1in1\leq i\leq n and v¯n+1=𝟎2\bar{v}_{n+1}^{*}=\bm{0}_{2} is at least as good as KK in terms of cost up to an approximation factor ϵ\epsilon; (ii) we show that for KK that satisfies the constraints, the randomized policy K^\hat{K} that has K^(i)=[1,0]\hat{K}(i)=[1,0]^{\top} w.p. v¯i(1)\bar{v}_{i}(1) and K^(i)=[0,1]\hat{K}(i)=[0,1]^{\top} w.p. v¯i(2)\bar{v}_{i}(2) (as well as K^(n+1)=𝟎2\hat{K}(n+1)=\bm{0}_{2}) suffers expected cost at most that of KK itself.

Suppose these two claims are true, then the described randomized linear policy K^\hat{K} has expected cost at most k-k, which means that there exists a deterministic linear policy with first nn columns as basis vectors, i.e. [1,0][1,0]^{\top} or [0,1][0,1]^{\top}, that suffers cost at most k-k. It follows that the corresponding assignment of literals given by the first nn columns of the linear policy yvy\leftarrow v satisfies at least kk out of the mm clauses, so ϕ\phi has at least kk satisfied clauses.

To prove (i), first notice that the policy KK^{*} suffers non-positive cost over the entire horizon since it satisfies the necessary constraints given by the regulatory cost c(x,u)c(x,u). Note also that said c(x,u)c(x,u) can be scaled by any constant Mϵ>0M_{\epsilon}>0. Now suppose that under the condition miniv¯iv¯iϵ\min_{i}\|\bar{v}_{i}-\bar{v}_{i}^{*}\|\leq\ell_{\epsilon} the cost difference of KK^{*} and KK is bounded by ϵ\epsilon: the choice of ϵ\ell_{\epsilon} can depend on any problem parameters, and since the construction is over T=Θ(mn)T=\Theta(mn) overall rounds (finite), such a choice is always possible. Hence, for all such KK^{*} we automatically infer that it suffers cost at most k-k and satisfies the necessary constraints. On the other hand, if the condition does not hold, then the distance of v¯i\bar{v}_{i} from Δ2\Delta_{2} is bounded from below by ϵ\ell_{\epsilon}, meaning that for a sufficiently large choice of Mϵ>0M_{\epsilon}>0 (given knowledge of ϵ\epsilon and all other parameters), the overall cost suffered by KK will be positive due to c(x,u)c(x,u), i.e. it will have a higher cost than KK^{*}. Therefore, any state feedback policy KK can be approximately replaced by KK^{*} that satisfies the constraints ensured by c(x,u)c(x,u).

To show (ii), for a policy KK that does satisfy these constraints, i.e. v¯iΔ2\bar{v}_{i}\in\Delta_{2} for 1in1\leq i\leq n and v¯n+1=𝟎2\bar{v}_{n+1}=\bm{0}_{2}, we show that its randomized version K^\hat{K} is at least just as good in terms of expected cost. Proving this claim is a matter of unrolling the dynamics for a single clause CjC_{j}. The order, the indices and negation or not of the literals in CjC_{j} does not affect the cost, so w.l.o.g. assume we have C1=y1y2y3C_{1}=y_{1}\lor y_{2}\lor y_{3}. The cost of a general policy KK over first 33 iterations is given by

v¯1(1)(1v¯1(1))2v¯2(1)(1v¯1(1))2(1v¯2(1))2v¯3(1)-\bar{v}_{1}(1)-(1-\bar{v}_{1}(1))^{2}\cdot\bar{v}_{2}(1)-(1-\bar{v}_{1}(1))^{2}\cdot(1-\bar{v}_{2}(1))^{2}\cdot\bar{v}_{3}(1)

The alternative randomized linear policy instead suffer expected cost given by

v¯1(1)(1v¯1(1))v¯2(1)(1v¯1(1))(1v¯2(1))v¯3(1)-\bar{v}_{1}(1)-(1-\bar{v}_{1}(1))\cdot\bar{v}_{2}(1)-(1-\bar{v}_{1}(1))\cdot(1-\bar{v}_{2}(1))\cdot\bar{v}_{3}(1)

which is straightforward to show to be not larger than the original cost. ∎

Proof of Theorem F.4.

Lemma F.6 above show that the given LTV system construction along with the costs satisfies the theorem conditions. Lemmas F.7 and F.8 indicate that the Max-3Sat problem can be reduced to the LTV control, in particular proper learning of state feedback policies in this setting. Given that Max-3Sat is 𝖭𝖯\mathsf{NP}-Hard even in its decision form implies the computational hardness of the offline optimization of LTV optimal state feedback policies. ∎