Best of Both Worlds in Online Control:
Competitive Ratio and Policy Regret

Gautam Goel [email protected] Simons Institute, UC Berkeley Naman Agarwal [email protected] Google AI Princeton Karan Singh [email protected] Tepper School of Business, Carnegie Mellon University Elad Hazan [email protected] Google AI Princeton Department of Computer Science, Princeton University

Abstract

We consider the fundamental problem of online control of a linear dynamical system from two different viewpoints: regret minimization and competitive analysis. We prove that the optimal competitive policy is well-approximated by a convex parameterized policy class, known as a disturbance-action control (DAC) policies. Using this structural result, we show that several recently proposed online control algorithms achieve the best of both worlds: sublinear regret vs. the best DAC policy selected in hindsight, and optimal competitive ratio, up to an additive correction which grows sublinearly in the time horizon. We further conclude that sublinear regret vs. the optimal competitive policy is attainable when the linear dynamical system is unknown, and even when a stabilizing controller for the dynamics is not available a priori.

1 Introduction

The study of online optimization consists of two main research directions. The first is online learning, which studies regret minimization in games. A notable framework within this line of work is online convex optimization, where an online decision maker iteratively chooses a point in a convex set and receives loss according to an adversarially chosen loss function. The metric of performance studied in this research thrust is regret, or the difference between overall loss and that of the best decision in hindsight.

The second direction is that of competitive analysis in metrical task systems. In this framework, the problem setting is similar, but the performance metric is very different. Instead of regret, the objective is to minimize the competitive ratio, i.e. the ratio of the reward of the online decision maker to that associated with the optimal sequence of decisions made in hindsight. For this ratio to remain bounded, an additional penalty is imposed on movement costs, or changes in the decision.

While the goals of the two research directions are similar, the performance metrics are very different and lead to different algorithms and methodologies. These two separate methodologies have recently been applied to the challenging setting of online control, yielding novel and exciting methods to the field of control of dynamical systems.

In the paper we unify these two disparate lines of work by establishing a connection between the two objectives. Namely, we show that the Gradient Perturbation Controller (GPC) minimizes regret against the policy that has the optimal competitive ratio for any Linear Time Invariant (LTI) dynamical system. The GPC algorithm hence gives the best of both worlds: sublinear regret, and optimal competitive ratio, in a single efficient algorithm.

Our main technical contribution is proving that the optimal competitive policy derived in [16] is well-approximated by a certain convex policy class, for which efficient online learning was recently established in the work of [2]. This implies that known regret minimization algorithms for online control can compete with this optimal competitive policy with vanishing regret.

This structural result has other important implications to online control, yielding new results: we show that sublinear regret can be attained vs. the optimal competitive policy even when the underlying dynamical system is unknown, and even when a stabilizing controller is not available.

1.1 Related work

Control of dynamical systems. Our study focuses on two types of algorithms for online control. The first class of algorithms enjoy sublinear regret for online control of dynamical systems; that is, whose performance tracks a given benchmark of policies up to a term which is vanishing relative to the problem horizon. [1] initiated the study of online control under the regret benchmark for linear time-invariant (LTI) dynamical systems. Bounds for this setting have since been improved and refined in [12, 25, 10, 29]. We are interested in adversarial noise and perturbations, and regret in the context of online control was initiated in the study of nonstochastic control setting [2], that allows for adversarially chosen (e.g. non-Gaussian) noise and general convex costs that may vary with time. This model has been studied for many extended settings, see [24] for a comprehensive survey.

Competitive control. [18] initiated the study of online control with competitive ratio guarantees and showed that the Online Balanced Descent algorithm introduced in [7] has bounded competitive ratio in a narrow class of linear systems. This approach to competitive control was extended in a series of papers [17, 27]. In recent work, [16] obtained an algorithm with optimal competitive ratio in general linear systems using $H_{\infty}$ techniques; in this paper we show that the competitive control algorithm obtained in [16] is closely approximated by the class of DAC policies, and use this connection to obtain our “best-of-both-worlds” result.

Online learning and online convex optimization. The regret minimization techniques that are the subject of this paper are based in the framework of online convex optimization, see [20]. Recent techniques in online nonstochastic control are based on extensions of OCO to the setting of loss functions with memory [3] and adaptive or dynamic regret [23, 33].

Competitive analysis of online algorithms and simultaneous bounds on competitive ratio and regret. Competitive analysis was introduced in [31] and was first studied in the context of Metrical Task Systems (MTS) in [6]; we refer to [5] for an overview of competitive analysis and online algorithms. A series of recent papers consider the problem of obtaining online algorithms with bounded competitive ratio and sublinear regret. In [4], it was shown no algorithm can simultaneously achieve both objectives in OCO with switching costs. On the other hand, [11] described an online algorithm for MTS with optimal competitive ratio and sublinear regret on every time interval.

2 Preliminaries

We consider the task of online control in linear time-invariant (LTI) dynamical systems. In this setting, the interaction between the learner and the environment proceeds as described next. At each time step, the learner incrementally observes the current state $x_{t}\in\mathbb{R}^{m}$ of the system, subsequently chooses a control input $u_{t}\in\mathbb{R}^{n}$ , and consequently is subject to an instantaneous cost $c(x_{t},u_{t})$ defined via the quadratic cost function (we assume the existence of $\beta,\mu$ such that $\beta I\succeq Q,R\succeq\mu I$ )

c(x,u)=x^{\top}Qx+u^{\top}Ru.

As a consequence of executing the control input $u_{t}$ , the dynamical system evolves to a subsequent state $x_{t+1}$ , as dictated by the following linear system parameterized by the matrices $A\in\mathbb{R}^{m\times m}$ and $B\in\mathbb{R}^{m\times n}$ , bounded as $\|A\|,\|B\|~{}\leq~{}\kappa$ , and the perturbation sequence $(w_{t})_{t\in[T]}$ .

x_{t+1}=Ax_{t}+Bu_{t}+w_{t}.

We assume without loss of generality that $x_{1}=0$ . The learner does not directly observe the perturbations, or know of them in advance. We do not make any (e.g., distributional) assumptions on the perturbations, other than that they satisfy a point-wise bound $\|w_{t}\|~{}\leq~{}W$ for all times steps $t$ . By the means of such interaction across $T$ time steps, we ascribe an aggregate cost to the learner $\mathcal{A}$ as

J_{T}(\mathcal{A}|w_{1:T})=\sum_{t=1}^{T}c(x_{t},u_{t}).

2.1 Policy classes

Since the learner selects the control inputs adaptively upon observing the state, the behavior of a learner may be described by a (strictly causal) policy $\pi$ , a mapping from the observed state sequence to the immediate action. We consider the following policy classes in the paper:

1.

$\Pi_{\textsc{all}}$ is the exhaustive set of $T$ -length sequence of control inputs.
2.

$\Pi_{\textsc{SC}}$ is the class of all strictly causal policies, mapping the heretofore observed state sequence to the next action.
3.

$\mathcal{K}\subset\mathbb{R}^{n\times m}$ is a class of linear state-feedback policies. Each member of this class is parameterized by some matrix $K\in\mathcal{K}$ , and recommends the immediate action $u_{t}\stackrel{{\scriptstyle\text{def}}}{{=}}Kx_{t}$ . Both the stochastic-optimal policy ( $\mathcal{H}_{2}$ -control) – Bayes-optimal for i.i.d. perturbations – and the robust policy ( $\mathcal{H}_{\infty}$ -control) – minimax-optimal for arbitrary perturbations – are linear state-feedback policies.
4.

$\mathcal{M}$ is a class of disturbance-action controllers (DAC), defined below, that recommend actions as a linear transformation of the past few perturbations, rather than the present state.

A linear policy $K$ is called stable if $|\lambda_{\texttt{max}}(A+BK)|<1$ . Such policies ensure that the state sequence remains bounded under their execution. The notion of strong stability, introduced by [9], is a non-asymptotic characterization of the notion of stability defined as follows.

Definition 1.

A linear policy $K\in\mathbb{R}^{n\times m}$ is said to be $(\kappa,\gamma)$ -strongly stable with respect to an LTI $(A,B)$ if there exists exists matrices $S,L$ satisfying $A+BK=SLS^{-1}$ such that

\max\{1,\|K\|,\|S\|\|S^{-1}\|\}~{}\leq~{}\kappa\text{ and }\max\{1/2,\|L\|\}~{}\leq~{}1-\gamma.

A sufficient condition for the existence of a strongly stable policy is the strong controllability of the linear system $(A,B)$ , a notion introduced in [9]. In words, strong controllability measures the minimum length and magnitude of control input needed to drive the system to any unit-sized state.

Let $\mathbb{K}$ be a fixed $(\kappa,\gamma)$ -strongly stable linear policy for the discussion that follows. We will specify a particular choice for $\mathbb{K}$ in Section 3. We formally define a disturbance action controller below. The purpose of superimposing a stable linear policy $\mathbb{K}$ on top of the linear-in-perturbation terms is to ensure that the state sequence produced under the execution of a (possibly non-stationary) disturbance-action controller remains bounded.

Definition 2.

A disturbance-action controller (DAC), specified by a horizon $H$ and parameters $M=\left(M^{[0]},\dots M^{[H-1]}\right)\in\mathbb{R}^{n\times m}$ , chooses the action at the time $t$ as

u_{t}(M)\stackrel{{\scriptstyle\text{def}}}{{=}}\mathbb{K}x_{t}+\sum_{t=1}^{H}M^{[i-1]}w_{t-i},

where $x_{t}$ is state at time $t$ , and $w_{1},\dots w_{t-1}$ are past perturbations.

Definition 3.

For any $H\in\mathbb{N},\gamma<1,\theta~{}\geq~{}1$ , an $(H,\theta,\gamma)$ -DAC policy class is the set of all $H$ -horizon DAC policies where $M=\left(M^{[0]},\dots M^{[H-1]}\right)$ satisfy $\forall\;i\|M^{[i]}\|~{}\leq~{}\theta(1-\gamma)^{i}$ .

2.2 Performance measures

This paper considers multiple criteria that may be used to assess the learner’s performance. We introduce these below.

Let $w_{1:T}$ be the perturbation sequence the dynamics are subject to. Given the foreknowledge of this sequence, we define the following notions of optimal cost; note that these notions are infeasible in the sense that no online learner can match these on all instances.

1.

$OPT_{*}(w_{1:T})\stackrel{{\scriptstyle\text{def}}}{{=}}\min_{u_{1:T}\in\Pi_{\textsc{All}}}J_{T}(u_{1:T}|w_{1:T})$ is the cost associated with the best sequence of control inputs given the perturbation sequence. No policy, causal or otherwise, can attain a cost smaller than $OPT_{*}(w)$ on the the perturbation sequence $w_{1:T}$ .
2.

For any policy class $\Pi$ , $OPT_{\Pi}(w_{1:T})\stackrel{{\scriptstyle\text{def}}}{{=}}\min_{\pi\in\Pi}J_{T}(\pi|w_{1:T})$ is the cost of the best policy in $\Pi$ , subject to the perturbation sequence. Note that $OPT_{*}(w_{1:T})=OPT_{\Pi_{\textsc{All}}}(w_{1:T})$ .

With respect to these baselines, we define the following performance measures.

Competitive Ratio (CR): The competitive ratio of an (online) learner $\mathcal{A}$ is the worst-case ratio of its cost to the optimal offline cost $OPT_{*}(w_{1:T})$ over all possible perturbation sequences.

\alpha_{T}(\mathcal{A})\stackrel{{\scriptstyle\text{def}}}{{=}}\max_{w_{1:T}}\frac{J_{T}(\mathcal{A}|w_{1:T})}{OPT_{*}(w_{1:T})}

The optimal competitive ratio is the competitive ratio of the best strictly causal controller.

\alpha^{*}_{T}\stackrel{{\scriptstyle\text{def}}}{{=}}\min_{\mathcal{A}\in\Pi_{\textsc{SC}}}\alpha_{T}(\mathcal{A})

The infinite-horizon optimal competitive ratio $\alpha^{*}=\lim_{T\to\infty}\alpha^{*}_{T}$ is defined as the limiting optimal competitive ratio as the horizon extends to infinity, whenever it exists.

Regret: On any perturbation sequence $w_{1:T}$ , given a policy class $\Pi$ , the regret of an online learner $\mathcal{A}$ is assigned to be the excess aggregate cost incurred in comparison to that of the best policy in $\Pi$ ,

R_{T,\Pi}(\mathcal{A}|w_{1:T})=J_{T}(\mathcal{A}|w_{1:T})-OPT_{\Pi}(w_{1:T}).

The worst-case regret is defined as the maximum regret attainable over all perturbation sequences,

R_{T,\Pi}(\mathcal{A})=\max_{w_{1:T}}R_{T,\Pi}(\mathcal{A}|w_{1:T}).

The two types of performance guarantees introduced above are qualitatively different in terms of the bound they espouse and the baseline they compare to. In particular:

Tighter bound for low regret: A sub-linear regret guarantee implies that the average costs of the learner and the baseline asymptotically match, while even an optimal competitive-ratio bound promises an average cost at most a constant factor times that of the baseline.

Stronger baseline for CR: Competitive ratio bounds holds against the optimal unrestricted policy while regret holds against the best fixed policy from a (typically parametric) policy class.

2.3 Characterization of the optimal CR algorithm

The following explicit characterization of a strictly causal policy that achieves an optimal competitive ratio in the infinite-horizon setting was recently obtained in [16]; this theorem shows that the competitive policy in the original system with state $x\in\mathbb{R}^{m}$ can be viewed as a state-feedback controller in a synthetic system with state $\xi\in\mathbb{R}^{2m}$ .

Theorem 4 (Optimal Competitive Policy).

The strictly causal controller with an optimal infinite-horizon competitive ratio $\alpha^{*}$ is given by the policy $u_{t}=\widehat{K}\xi_{t}$ , where $\widehat{K}\in\mathbb{R}^{n\times 2m}$ and the synthetic state $\xi\in\mathbb{R}^{2m}$ evolves according to the dynamics

\xi_{t+1}=\widehat{A}\xi_{t}+\widehat{B}_{u}u_{t}+\widehat{B}_{w}\hat{w}_{t+1},

\text{where }\widehat{A}=\begin{bmatrix}A&K\Sigma^{1/2}\\ 0&0\end{bmatrix},\quad\widehat{B}_{u}=\begin{bmatrix}B\\ 0\end{bmatrix},\quad\widehat{B}_{w}=\begin{bmatrix}0\\ I\end{bmatrix},\quad\widehat{w}_{t}=\Sigma^{-1/2}Q^{1/2}\nu_{t}.

The sequence $\nu_{t}$ is recursively defined as $\nu_{t+1}=(A-KQ^{1/2})\nu_{t}+w_{t}$ starting with $\nu_{1}=0$ . Here the matrices $\widehat{K},K,\Sigma$ (and auxiliary constants $P,\widetilde{B},\widetilde{H}$ and $\widehat{P}$ ) satisfy

K=APQ^{1/2}\Sigma^{-1},\quad\Sigma=I+Q^{1/2}PQ^{1/2},\quad P=BB^{\top}+APA^{\top}-K\Sigma K^{\top},

\widehat{K}=-(I_{n}+\widehat{B}_{u}^{\top}\widetilde{P}\widehat{B}_{u})^{-1}\widehat{B}_{u}^{\top}\widetilde{P}\widehat{A},\quad\widetilde{B}=\begin{bmatrix}\widehat{B}_{u}&\widehat{B}_{w}\end{bmatrix},\quad\widetilde{H}=\begin{bmatrix}I&0\\ 0&-\alpha^{*}I\end{bmatrix}+\widetilde{B}^{\top}\widehat{P}\widetilde{B},

\widetilde{P}=\widehat{P}-\widehat{P}\widehat{B}_{w}(-\alpha^{*}I_{p}+\widehat{B}_{w}^{\top}\widehat{P}\widehat{B}_{w})^{-1}\widehat{B}_{w}^{\top}\widehat{P},

\text{and }\quad\widehat{P}=\begin{bmatrix}Q&Q^{1/2}\Sigma^{1/2},\\ Q^{1/2}\Sigma^{1/2}&\Sigma\end{bmatrix}+\widehat{A}^{\top}\widehat{P}\widehat{A}-\widehat{A}^{\top}\widehat{P}\widetilde{B}\widetilde{H}^{-1}\widetilde{B}^{\top}\widehat{P}\widehat{A}.

Furthermore, let $\{x_{t}\}_{t=1}^{T}$ be the state sequence produced under the execution of such a policy. Then, the state sequence satisfies at all time $t$ that $\xi_{t}=\begin{bmatrix}x_{t}-\nu_{t}\\ \widehat{w}_{t}\end{bmatrix}.$

Let $\widehat{K}_{0}\in\mathbb{R}^{n\times m}$ be the sub-matrix induced by the first $m$ columns of $\widehat{K}$ . In general, the infinite-horizon optimal competitive ratio may not be finite. However, the stability of the associated filtering operation (i.e. $|\lambda_{\max}(A-KQ^{1/2})|<1$ ) and the closed loop control system (i.e. $|\lambda_{\max}(A+B\widehat{K}_{0})|<1$ ) is sufficient to ensure the existence of this limit. We utilize the following bounds that quantify this.

Assumption 1.

$\widehat{K}_{0}$ is $(\kappa,\gamma)$ -strongly stable with respect to the linear system $(A,B)$ , and $-K^{\top}$ is $(\kappa,\gamma)$ -strongly stable with respect to the linear system $(A^{\top},Q^{1/2})$ . Also, $\|\widehat{K}\|~{}\leq~{}\kappa$ .

We note that the above bounds are quantifications, and not strengthening, of the stability criterion. In particular, any stable controller is strongly stable for some $\kappa~{}\geq~{}1,\gamma<1$ . Here, we use the same parameters to state the strong stability for both controllers, $K$ and $\widehat{K}$ , for convenience. Such a simplification is valid, since given $(\kappa_{1},\gamma_{1})$ - and $(\kappa_{2},\gamma_{2})$ -strongly stable controllers, the said controllers are also $(\max\{\kappa_{1},\kappa_{2}\},\max\{\gamma_{1},\gamma_{2}\})$ -strongly stable.

2.4 Low-regret algorithms

Deviating from the methodologies of optimal and robust control, [2] propose considering an online control formulation in which the noise is adversarial, and thus the optimal controller is only defined in hindsight. This motivates different, online-learning based methods for the control task. [2] proposed an algorithm called GPC (Gradient Perturbation Controller) and show the following theorem (which we restate in notation consistent with this paper), which for LTI systems shows that regret when compared against any strongly-stable policy scales at most as $O(\sqrt{T})$ .

Theorem 5.

Given any $\kappa,\gamma$ , let $\mathcal{K}(\kappa,\gamma)$ be the set of $(\kappa,\gamma)$ strongly stable linear policies. There exists an algorithm $\mathcal{A}$ such that the following holds,

J_{T}(\mathcal{A}|w_{1:T})-\min_{K\in\mathcal{K}(\kappa,\gamma)}J_{T}(K|w_{1:T})~{}\leq~{}O(\sqrt{T}\log(T)).

Here $O(\cdot)$ contains polynomial factors depending on the system constants.

As can be observed from the analysis presented by [2], the above regret guarantee holds not just against the set of strongly stable linear policies but also against the set of $(H,\theta,\gamma)$ -DAC policies. The regret bound has been extended to different interaction models such as unknown systems [21], partial observation [30] and adaptive regret [19]. Furthermore the regret bound in this setting has been improved to logarithmic in $T$ [13, 28].

3 Statement of results

3.1 Main Result

The central observation we make is that regret-minimizing algorithms subject to certain qualifications automatically achieve an optimal competitive ratio bound up to a vanishing average cost term.

Typical regret-minimizing online control algorithms [2, 22] compete against the class of stable linear state-feedback policies. In general, neither the offline optimal policy (with cost $OPT_{*}$ ) nor the optimal competitive-ratio policy can be approximated by a linear policy [15]. However, the algorithm proposed in [2] and follow-up works that build on it also compete with a more-expressive class, that of disturbance-action policies (DACs). In [2], this choice was made purely for computational reasons to circumvent the non-convexity of the cost associated with linear policies; in this work, however, we use the flexibility of DAC policies to approximate the optimal competitive policy.

More formally, we prove that we can find a DAC which generates a sequence of states and control actions which closely track the sequence of states and control actions generated by the optimal competitive policy by taking the history $H$ of the DAC to be sufficiently large. This structural characterization of the competitive policy is sufficient to derive our best-of-both-worlds result, since a regret-minimizing learner competitive against an appropriately defined DAC class would also be competitive against the policy achieving an optimal competitive ratio, and hence achieve an optimal competitive ratio up to a residual regret term.

Theorem 6 (Optimal Competitive Policy is Approximately DAC).

Fix a horizon $T$ and a disturbance bound $W$ . For any $\varepsilon>0$ , set

H=\log(1-\gamma/2)^{-1}\log\left(\frac{1088W^{2}\kappa^{11}\max(1,\beta^{2})}{\gamma^{4}\varepsilon}T\right),\hskip 28.45274pt\theta=2\kappa^{2}\max(1,\beta^{1/2})

and define $\mathcal{M}$ be the set of $(H,\theta,\gamma)$ -DAC policies with stabilizing component $\mathbb{K}=\widehat{K}_{0}$ . Let $\mathcal{A}$ be the algorithm with the optimal competitive ratio $\alpha^{*}$ . Then there exists a policy $\pi\in\mathcal{M}$ such that for any perturbations $w_{1:T}$ satisfying $\|w_{t}\|~{}\leq~{}W$ , the cost incurred by $\pi$ satisfies

J_{T}(\pi|w_{1:T})<J_{T}(\mathcal{A}|w_{1:T})+\varepsilon.

We prove this theorem in Section 4. We now show that this result implies best-of-both-worlds.

3.2 Best-of-Both-Worlds in Known Systems

We begin by considering the case when the learner knows the linear system $(A,B)$ . In this setting, both the regret and competitive ratio thus measure the additional cost imposed by not knowing the perturbations in advance. The result below utilizes the regret bounds against DAC policies and associated algorithms from [2, 28].

Theorem 7 (Best-of-both-worlds in online control (known system)).

Assuming $(A,B)$ is known to the learner, there exists a constant

R_{T}=\tilde{O}\left(\operatorname{poly}(m,n,\beta,\kappa,\gamma^{-1})W^{2}\times\min\left\{\sqrt{T},\operatorname{poly}(\mu^{-1})\operatorname{polylog}T\right\}\right)

and a computationally efficient online control algorithm $\mathcal{A}$ which simultaneously achieves the following performance guarantees:

1.

(Optimal competitive ratio) The cost of $\mathcal{A}$ satisfies for any perturbation sequence $w_{1:T}$ that

$J_{T}(\mathcal{A}|w_{1:T})<\alpha^{*}\cdot OPT_{*}(w_{1:T})+R_{T},$

where $\alpha^{*}$ is the optimal competitive ratio.

(Low regret) The regret of $\mathcal{A}$ relative to the best linear state-feedback or DAC policy selected in hindsight grows sub-linearly in the horizon $T$ , i.e. for all $w_{1:T}$ , it holds

J_{T}(\mathcal{A}|w_{1:T})<\min_{\pi\in\mathcal{K}}J_{T}(\pi|w_{1:T})+R_{T}\quad\text{ and }\quad J_{T}(\mathcal{A}|w_{1:T})<\min_{\pi\in\mathcal{M}}J_{T}(\pi|w_{1:T})+R_{T}.

3.3 Best-of-Both-Worlds in Unknown Systems

We now present the main results for online control of unknown linear dynamical system. The first theorem deals with the case when the learner has coarse-grained information about the linear system $(A,B)$ in the form of access to a stabilizing controller $\mathbb{K}$ . In general, to compute such a stable controller, it is sufficient to known $(A,B)$ to some constant accuracy, as noted in [10]. This theorem utilizes low-regret algorithms from [22, 28].

Theorem 8 (Best-of-both-worlds in online control (unknown system, given a stabilizing controller)).

For a $(k,\kappa)$ -strongly controllable linear dynamical system $(A,B)$ , there exists a constant

R_{T}=\tilde{O}\left(\operatorname{poly}(m,n,\beta,k,\kappa,\gamma^{-1})W^{2}\times\min\left\{T^{2/3},\operatorname{poly}(\mu^{-1})\sqrt{T}\right\}\right)

and a computationally efficient online control algorithm $\mathcal{A}$ such that, when given access to a $(\kappa,\gamma)$ -strongly stable initial controller $\mathbb{K}$ , it guarantees

J_{T}(\mathcal{A}|w_{1:T})<\min\{\alpha^{*}\cdot OPT_{*}(w_{1:T}),\min_{\pi\in\mathcal{K}}J_{T}(\pi|w_{1:T}),\min_{\pi\in\mathcal{M}}J_{T}(\pi|w_{1:T})\}+R_{T}.

When an initial stabilizing controller is unavailable, we make use of the “blackbox control” algorithm in [8] to establish the next theorem.

Theorem 9 (Best-of-both-worlds in online control (unknown system, blackbox control)).

For a $(k,\kappa)$ -strongly controllable linear dynamical system $(A,B)$ , there exists a constant

R_{T}=2^{\operatorname{poly}(m,n,\beta,k,\kappa,\gamma^{-1})}+\tilde{O}\left(\operatorname{poly}(m,n,\beta,k,\kappa,\gamma^{-1})W^{2}\times\min\left\{T^{2/3},\operatorname{poly}(\mu^{-1})\sqrt{T}\right\}\right)

and a computationally efficient online control algorithm $\mathcal{A}$ that guarantees

J_{T}(\mathcal{A}|w_{1:T})<\min\{\alpha^{*}\cdot OPT_{*}(w_{1:T}),\min_{\pi\in\mathcal{K}}J_{T}(\pi|w_{1:T}),\min_{\pi\in\mathcal{M}}J_{T}(\pi|w_{1:T})\}+R_{T}.

4 Proof of Theorem 6

Proof of Theorem 6.

Unrolling the recursive definition of $\nu_{t}$ in Theorem 4, we see that

\nu_{t}=\sum_{i=0}^{t-1}(A-KQ^{1/2})^{i-1}w_{t-i}.

Let $\{x_{t}\}_{t=1}^{T}$ be the state sequence generated by the competitive control policy in the original $m$ -dimensional system and let $\{\xi_{t}\}_{t=1}^{T}$ is the sequence of states generated by the competitive policy in the $2m$ -dimensional synthetic system. Additionally, let $\widehat{K}=\begin{bmatrix}\widehat{K}_{0}&\widehat{K}_{1}\end{bmatrix}$ be the partition of $\widehat{K}$ along the first $m$ columns and the remaining $m$ . Recall that $A+B\widehat{K}_{0}=SLS^{-1}$ , where $\|S\|\|S^{-1}\|~{}\leq~{}\kappa$ and $\|L\|~{}\leq~{}1-\gamma$ . Using Theorem 4, we decompose $x_{t}$ as a linear combination of the disturbance terms:

	$\displaystyle x_{t+1}$	$\displaystyle=Ax_{t}+B\widehat{K}\xi_{t}+w_{t}=Ax_{t}+B\begin{bmatrix}\widehat{K}_{0}&\widehat{K}_{1}\end{bmatrix}\begin{bmatrix}x_{t}-\nu_{t}\\ \Sigma^{-1/2}Q^{1/2}\nu_{t}\end{bmatrix}+w_{t}$
		$\displaystyle=(A+B\widehat{K}_{0})x_{t}+B(\widehat{K_{1}}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})\nu_{t}+w_{t}$
		$\displaystyle=\sum_{i=0}^{t-1}\left((A+B\widehat{K}_{0})^{i}+\sum_{j=1}^{i}(A+B\widehat{K}_{0})^{i-j}B(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{j-1}\right)w_{t-i}.$

We now describe a DAC policy which approximates the competitive control policy; recall that every DAC policy is parameterized by a stabilizing controller and a set of $H$ weights. We take the stabilizing controller to be $\widehat{K}_{0}$ and the set of weights to be $M=(M^{[0]},\ldots,M^{[H-1]})$ , where $M^{[i-1]}$ is

M^{[i-1]}\stackrel{{\scriptstyle\text{def}}}{{=}}(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{i-1}.

The action generated by our DAC approximation of the competitive policy is therefore

\displaystyle u_{t}(M)\stackrel{{\scriptstyle\text{def}}}{{=}}\widehat{K}_{0}x_{t}(M)+\sum_{i=1}^{H}M^{[i-1]}w_{t-i},

(1)

where $x_{t}(M)$ is the corresponding state sequence generated by the DAC.

We now show that the weights $M^{[i-1]}$ ’s decay geometrically in time. Recall that $A+B\widehat{K}_{0}=S_{1}L_{1}S_{1}^{-1}$ , where $\|S_{1}\|\|S_{1}^{-1}\|~{}\leq~{}\kappa$ and $\|L\|~{}\leq~{}1-\gamma$ . Similarly, $A-KQ^{1/2}=S_{2}L_{2}S_{2}^{-1}$ , where $\|S_{2}\|\|S_{2}^{-1}\|~{}\leq~{}\kappa$ and $\|L_{2}\|~{}\leq~{}1-\gamma$ . Recall that $\Sigma=I+Q^{1/2}PQ^{1/2}\succeq I$ , therefore $\|\Sigma^{-1/2}\|~{}\leq~{}1$ . We see that

\displaystyle\|M^{[i-1]}\|~{}\leq~{}\|\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0}\|\|(A-KQ^{1/2})^{i-1}\|~{}\leq~{}2\kappa^{2}\max(1,\|Q\|^{1/2})(1-\gamma)^{i-1},

(2)

where we used the fact that $\kappa~{}\geq~{}\|\widehat{K}\|$ and hence $\kappa~{}\geq~{}\max(\|\widehat{K}_{0}\|,\|\widehat{K}_{1}\|)$ . We now bound the distance between $x_{t}$ and $x_{t}(M)$ . Plugging our choice of $u_{t}$ given in (1) into the dynamics of $x$ , we see that

	$\displaystyle x_{t+1}(M)$	$\displaystyle=Ax_{t}(M)+B\left(\widehat{K}_{0}x_{t}(M)+\sum_{i=1}^{H}M^{[i-1]}w_{t-i}\right)+w_{t}$
		$\displaystyle=\sum_{i=0}^{t-1}\left((A+B\widehat{K}_{0})^{i}+\sum_{j=1}^{\min\{H,i\}}(A+B\widehat{K}_{0})^{i-j}BM^{[j-1]}\right)w_{t-i}.$

Comparing the expansions of $x_{t}$ and $x_{t}(M)$ on a per-term basis, we observe that the terms in their difference contain either $(A+B\widehat{K}_{0})^{H/2}$ or $(A-KQ^{1/2})^{H/2}$ , and thus their difference is exponentially small in $H/2$ under the strong stability of $\widehat{K}_{0}$ and $K$ . Formally, we have

Claim 10.

$\|x_{t+1}-x_{t+1}(M)\|~{}\leq~{}\frac{16W\kappa^{4}\max(1,\|Q\|^{1/2})}{\gamma^{2}}(1-\gamma/2)^{H}.$

We now bound the distance between $u_{t}$ and $u_{t}(M)$ . Unrolling the dynamics, we get the following,

\displaystyle u_{t}

\displaystyle=\widehat{K}\xi_{t}=\widehat{K}_{0}x_{t}+\sum_{i=0}^{t-1}(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{i-1}w_{t-i},

u_{t}(M)=\widehat{K}_{0}x_{t}(M)+\sum_{i=1}^{H}(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{i-1}w_{t-i}.

Similar to the argument for Claim 10 we have the following,

Claim 11.

$\|u_{t}-u_{t}(M)\|~{}\leq~{}\frac{20W\kappa^{5}\max(1,\|Q\|^{1/2})}{\gamma^{2}}(1-\gamma/2)^{H}$ .

where we used the strong stability of $K^{\top}$ , the bound on $\|x_{t}-x_{t}(M)\|$ given in Claim 10 and the fact that $\kappa~{}\geq~{}\max(1,\|\widehat{K}_{0}\|,\|\widehat{K}_{1}\|)$ . We now show that by taking $H$ to be sufficiently large we can ensure that $J_{T}(\pi|w_{1:T})<J_{T}(\mathcal{A}|w_{1:T})+\varepsilon.$ We utilize the following bound on $x_{t}(M),u_{t}(M)$ :

Lemma 12.

The execution of any policy from a $(H,\theta,\gamma^{\prime})$ -DAC policy class produces state-action sequences that obey for all time $t$ that $\|x_{t}\|,\|u_{t}\|~{}\leq~{}{3\kappa^{3}\theta W}/{\gamma\gamma^{\prime}}$ .

In light of (2), we see that we can take $\theta=2\kappa^{2}\max(1,\|Q\|^{1/2})$ , leading to the bound

\max(\|x_{t}\|,\|u_{t}\|)~{}\leq~{}\frac{6W\kappa^{5}\max(1,\|Q\|^{1/2})}{\gamma^{2}}.

Using the easily-verified inequality $|\|x\|^{2}-\|y\|^{2}|~{}\leq~{}\left(\|x-y\|\right)\left(\|2x\|+\|x-y\|\right)$ , we put the pieces together to bound the difference in costs incurred by the competitive policy and our DAC approximation in each timestep:

	$\displaystyle\|x_{t}^{\top}Qx_{t}-x_{t}(M)^{\top}Qx^{\top}(M)\|~{}\leq~{}$	$\displaystyle\\|Q\\|\\|x_{t}-x_{t}(M)\\|\left(2\\|x_{t}(M)\\|+\\|x_{t}-x_{t}(M)\\|\right)$
	$\displaystyle~{}\leq~{}$	$\displaystyle\frac{448W^{2}\kappa^{10}\max(1,\\|Q\\|)\\|Q\\|}{\gamma^{4}}(1-\gamma/2)^{H}$

Similarly,

	$\displaystyle\|u_{t}^{\top}Ru_{t}-u_{t}(M)^{\top}Ru_{t}(M)\|~{}\leq~{}$	$\displaystyle\\|R\\|\\|u_{t}-u_{t}(M)\\|\left(2\\|u_{t}(M)\\|+\\|u_{t}-u_{t}(M)\\|\right)$
	$\displaystyle~{}\leq~{}$	$\displaystyle\frac{640W^{2}\kappa^{11}\max(1,\\|Q\\|)\\|R\\|}{\gamma^{4}}(1-\gamma/2)^{H}.$

The difference in aggregate cost is therefore

|J_{T}(\pi|w_{1:T})-J_{T}(\mathcal{A}|w_{1:T})|~{}\leq~{}\frac{1088W^{2}\kappa^{11}\max(1,\beta^{2})}{\gamma^{4}}(1-\gamma/2)^{H}T.

Taking $H~{}\geq~{}\log\left({1088W^{2}\kappa^{11}\max(1,\beta^{2})T}/{\gamma^{4}\varepsilon}\right)/{\log(1-\gamma/2)}$ concludes the proof. ∎

5 Experiments

In Figure 1 we compare the performance of various controllers, namely, the $H_{2}$ controller, the $H_{\infty}$ controller, the infinite horizon competitive controller from [16], the GPC controller from [2] and the “clairvoyant offline” controller which selects the optimal-in-hindsight sequence of controls. We do this comparison on a two dimensional double integrator system with different noise sequences. We confirm as the results of our paper suggest that the GPC controller attaining the best of both worlds guarantee is indeed the best performing controller and in particular matches and sometimes improves over the performance of competitive control. Further experiment details along with more simulations on different systems can be found in the appendix.

Refer to caption — (a) Sinusoidal Perturbations

6 Conclusions, Open Problems and Limitations

We have proved that the optimal competitive policy in an LTI dynamical system is well-approximated by the class of Disturbance Action Control (DAC) policies. This implies that the Gradient Perturbation Control (GPC) algorithm and related approaches are able to attain sublinear regret vs. this policy, even when the dynamical system is unknown ahead of time. This is the first time that a control method is shown to attain both sublinear regret vs. a large policy class, and simultaneously a competitive ratio vs. the optimal dynamic policy in hindsight (up to a vanishing additive term). It remains open to extend our results to time varying and nonlinear systems, the recent methods of [26, 19] are a potentially good starting point.

Acknowledgements

Elad Hazan gratefully acknowledges funding from NSF grant #1704860.

References

[1] Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.
[2] Naman Agarwal, Brian Bullins, Elad Hazan, Sham Kakade, and Karan Singh. Online control with adversarial disturbances. In International Conference on Machine Learning, pages 111–119, 2019.
[3] Oren Anava, Elad Hazan, and Shie Mannor. Online learning for adversaries with memory: price of past mistakes. Advances in Neural Information Processing Systems, 28, 2015.
[4] Lachlan Andrew, Siddharth Barman, Katrina Ligett, Minghong Lin, Adam Meyerson, Alan Roytman, and Adam Wierman. A tale of two metrics: Simultaneous bounds on competitiveness and regret. In Conference on Learning Theory, pages 741–763. PMLR, 2013.
[5] Allan Borodin and Ran El-Yaniv. Online computation and competitive analysis. Cambridge University Press, 2005.
[6] Allan Borodin, Nathan Linial, and Michael E Saks. An optimal on-line algorithm for metrical task system. Journal of the ACM (JACM), 39(4):745–763, 1992.
[7] Niangjun Chen, Gautam Goel, and Adam Wierman. Smoothed online convex optimization in high dimensions via online balanced descent. In Conference On Learning Theory, pages 1574–1594. PMLR, 2018.
[8] Xinyi Chen and Elad Hazan. Black-box control for linear dynamical systems. In Conference on Learning Theory, pages 1114–1143. PMLR, 2021.
[9] Alon Cohen, Avinatan Hasidim, Tomer Koren, Nevena Lazic, Yishay Mansour, and Kunal Talwar. Online linear quadratic control. In International Conference on Machine Learning, pages 1029–1038. PMLR, 2018.
[10] Alon Cohen, Tomer Koren, and Yishay Mansour. Learning linear-quadratic regulators efficiently with only $\sqrt{T}$ regret. In International Conference on Machine Learning, pages 1300–1309, 2019.
[11] Amit Daniely and Yishay Mansour. Competitive ratio vs regret minimization: achieving the best of both worlds. In Algorithmic Learning Theory, pages 333–368. PMLR, 2019.
[12] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188–4197, 2018.
[13] Dylan Foster and Max Simchowitz. Logarithmic regret for adversarial online control. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3211–3221. PMLR, 13–18 Jul 2020.
[14] Dylan Foster and Max Simchowitz. Logarithmic regret for adversarial online control. In International Conference on Machine Learning, pages 3211–3221. PMLR, 2020.
[15] Gautam Goel and Babak Hassibi. The power of linear controllers in lqr control. arXiv preprint arXiv:2002.02574, 2020.
[16] Gautam Goel and Babak Hassibi. Competitive control. arXiv preprint arXiv:2107.13657, 2021.
[17] Gautam Goel, Yiheng Lin, Haoyuan Sun, and Adam Wierman. Beyond online balanced descent: An optimal algorithm for smoothed online optimization. Advances in Neural Information Processing Systems, 32, 2019.
[18] Gautam Goel and Adam Wierman. An online algorithm for smoothed regression and lqr control. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2504–2513. PMLR, 2019.
[19] Paula Gradu, Elad Hazan, and Edgar Minasyan. Adaptive regret for control of time-varying dynamics. arXiv preprint arXiv:2007.04393, 2020.
[20] Elad Hazan. Introduction to online convex optimization. arXiv preprint arXiv:1909.05207, 2019.
[21] Elad Hazan, Sham Kakade, and Karan Singh. The nonstochastic control problem. In Aryeh Kontorovich and Gergely Neu, editors, Proceedings of the 31st International Conference on Algorithmic Learning Theory, volume 117 of Proceedings of Machine Learning Research, pages 408–421. PMLR, 08 Feb–11 Feb 2020.
[22] Elad Hazan, Sham Kakade, and Karan Singh. The nonstochastic control problem. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, pages 408–421. PMLR, 2020.
[23] Elad Hazan and Comandur Seshadhri. Efficient learning algorithms for changing environments. In Proceedings of the 26th annual international conference on machine learning, pages 393–400, 2009.
[24] Elad Hazan and Karan Singh. Tutorial: online and non-stochastic control, July 2021.
[25] Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalence is efficient for linear quadratic control. In Advances in Neural Information Processing Systems, pages 10154–10164, 2019.
[26] Edgar Minasyan, Paula Gradu, Max Simchowitz, and Elad Hazan. Online control of unknown time-varying dynamical systems. Advances in Neural Information Processing Systems, 34, 2021.
[27] Guanya Shi, Yiheng Lin, Soon-Jo Chung, Yisong Yue, and Adam Wierman. Online optimization with memory and competitive control. Advances in Neural Information Processing Systems, 33:20636–20647, 2020.
[28] Max Simchowitz. Making non-stochastic control (almost) as easy as stochastic. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18318–18329. Curran Associates, Inc., 2020.
[29] Max Simchowitz and Dylan Foster. Naive exploration is optimal for online lqr. In International Conference on Machine Learning, pages 8937–8948. PMLR, 2020.
[30] Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control, 2020.
[31] Daniel D Sleator and Robert E Tarjan. Amortized efficiency of list update and paging rules. Communications of the ACM, 28(2):202–208, 1985.
[32] Andras Varga. Detection and isolation of actuator/surface faults for a large transport aircraft. In Fault Tolerant Flight Control, pages 423–448. Springer, 2010.
[33] Lijun Zhang, Tianbao Yang, Zhi-Hua Zhou, et al. Dynamic regret of strongly adaptive methods. In International conference on machine learning, pages 5882–5891. PMLR, 2018.

Appendix A Appendix Layout

In Section B, we provide the proofs of the claims necessary to conclude Theorem 6. Following this, Section C & D proves the results for known and unknown systems respectively. Thereafter, we prove a generalization of Theorem 6 necessary for the case of unknown systems in Section E. Finally, Section F discusses the translation of the optimality results from infinite-horizon to finite-horizon. Section G provides further details on the experimental setup, along with more experimental evaluations.

Appendix B Proofs of claims substantiating Theorem 6

Proof of Claim 10.

Using the strong stability of $\widehat{K}_{0}$ and $K^{\top}$ , we have

	$\displaystyle\\|x_{t+1}-x_{t+1}(M)\\|$	(3)
$\displaystyle~{}\leq~{}$	$\displaystyle\sum_{i~{}\geq~{}H+1}\sum_{j=H+1}^{i}\left\\|(A+B\widehat{K}_{0})^{i-j}B(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{j-1}\right\\|W$
$\displaystyle~{}\leq~{}$	$\displaystyle 2W\kappa^{4}\max(1,\\|Q\\|^{1/2})\sum_{i~{}\geq~{}H+1}i(1-\gamma)^{i-1}$
$\displaystyle~{}\leq~{}$	$\displaystyle\frac{8W\kappa^{4}\max(1,\\|Q\\|^{1/2})}{\gamma}\sum_{i~{}\geq~{}H+1}(1-\gamma/2)^{i},$
$\displaystyle~{}\leq~{}$	$\displaystyle\frac{16W\kappa^{4}\max(1,\\|Q\\|^{1/2})}{\gamma^{2}}(1-\gamma/2)^{H}$	(4)

where in the penultimate line we used the following elementary result:

Lemma 13.

For all $\gamma\in[0,1/2]$ and all $i~{}\geq~{}0$ , the following inequality holds:

i(1-\gamma)^{i}~{}\leq~{}2(1-\gamma/2)^{i}/\gamma.

∎

Proof of Claim 11.

Recall that by unrolling the dynamics, we get the following,

\displaystyle u_{t}

\displaystyle=\widehat{K}\xi_{t}=\widehat{K}_{0}(x_{t}-\nu_{t})+\widehat{K}_{1}\widehat{w}_{t}=\widehat{K}_{0}x_{t}+\sum_{i=0}^{t-1}(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{i-1}w_{t-i},

u_{t}(M)=\widehat{K}_{0}x_{t}(M)+\sum_{i=1}^{H}(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{i-1}w_{t-i}.

Using the strong stability of $K^{\top}$ , we bound the point-wise difference between these sequences as

	$\displaystyle\\|u_{t}-u_{t}(M)\\|$	$\displaystyle~{}\leq~{}\\|\widehat{K}_{0}\\|\\|x_{t}-x_{t}(M)\\|+\sum_{i~{}\geq~{}H+1}\\|(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{i-1}\\|W$
		$\displaystyle~{}\leq~{}\frac{20W\kappa^{5}\max(1,\\|Q\\|^{1/2}))}{\gamma^{2}}(1-\gamma/2)^{H}.$

∎

Proof of Lemma 12.

Since $\mathbb{K}$ is $(\kappa,\gamma)$ -strongly stable, there exists matrices $Q,L$ such that $QLQ^{-1}=A+B\mathbb{K}$ with $\|Q^{-1}\|\|Q\|~{}\leq~{}\kappa$ and $\|L\|~{}\leq~{}1-\gamma$ . Also recall that for every member of a $(H,\theta,\gamma^{\prime})$ -DAC class satisfies $\|M^{[i]}\|~{}\leq~{}\theta(1-\gamma^{\prime})^{i}$ . Observe for any $t~{}\geq~{}0$ that

	$\displaystyle\\|x_{t+1}\\|$	$\displaystyle=\left\\|\sum_{i~{}\geq~{}0}(A+B\mathbb{K})^{i}\left(w_{t-i}+B\sum_{j=1}^{H}M^{[j-1]}w_{t-i-j}\right)\right\\|$
		$\displaystyle~{}\leq~{}\left(\sum_{i~{}\geq~{}0}\\|(A+B\mathbb{K})^{i}\\|\right)\max_{i}\left\\|w_{t-i}+B\sum_{j=1}^{H}M^{[j-1]}w_{t-i-j}\right\\|$
		$\displaystyle~{}\leq~{}\left(\sum_{i~{}\geq~{}0}\\|(A+B\mathbb{K})^{i}\\|\right)\max_{i}\left(\\|w_{t-i}\\|+\\|B\\|\max_{j}\\|w_{t-i-j}\\|\sum_{j=1}^{H}\\|M^{[j-1]}\\|\right)$
		$\displaystyle~{}\leq~{}\frac{\kappa}{\gamma}W\left(1+\frac{\kappa\theta}{\gamma^{\prime}}\right)~{}\leq~{}\frac{2\kappa^{2}\theta W}{\gamma\gamma^{\prime}},$

\displaystyle\|u_{t}\|

\displaystyle=\left\|\mathbb{K}x_{t}+\sum_{i=1}^{H}M^{[i-1]}w_{t-i}\right\|~{}\leq~{}\frac{2\kappa^{3}\theta W}{\gamma\gamma^{\prime}}+\frac{\theta W}{\gamma^{\prime}}.

∎

Proof of Lemma 13.

Recall that $xe^{-x}~{}\leq~{}1$ for all $x$ , therefore $ie^{-\gamma i/2}~{}\leq~{}\frac{2}{\gamma}.$ Also $\sqrt{1-\gamma}~{}\leq~{}1-\gamma/2$ for all $\gamma~{}\leq~{}1$ . We have

$\displaystyle i(1-\gamma)^{i}$	$\displaystyle=$	$\displaystyle i(1-\gamma)^{i/2}(1-\gamma)^{i/2}$
	$\displaystyle~{}\leq~{}$	$\displaystyle ie^{-\gamma i/2}(1-\gamma/2)^{i}$
	$\displaystyle~{}\leq~{}$	$\displaystyle\frac{2}{\gamma}(1-\gamma/2)^{i}.$

∎

Appendix C Proofs for known systems

Proof of Theorem 7.

The perturbation sequence $w_{1:T}$ has $T$ terms. Overloading this notation, we extend this to an infinite sequence with $w_{t}=0$ for all $t>T$ . Let $\mathcal{K}$ be the class of $(\kappa,\gamma)$ -strongly stable linear policies, and $\mathcal{M}$ be a $(H,\theta,\gamma^{\prime})$ -DAC class for

\theta=2\kappa^{2}\max\{1,\beta^{1/2}\},\gamma^{\prime}=\gamma/2,H=2\log(1088W^{2}\kappa^{11}\max\{1,\beta^{2}\}T/\gamma^{4})/\gamma.

By the results of [2], using the GPC algorithm (Algorithm 1 in [2]) as the online learner $\mathcal{L}$ , we have the following regret guarantee:

J_{T}(\mathcal{L}|w_{1:T})~{}\leq~{}\min_{\pi\in\mathcal{K}\cup\mathcal{M}}J_{T}(\pi|w_{1:T})+\tilde{O}(\operatorname{poly}(m,n,\kappa,\gamma^{-1},\beta)W^{2}\sqrt{T})

Now, recall that Theorem 6 asserts

\min_{\pi\in\mathcal{M}}J_{T}(\pi|w_{1:T})~{}\leq~{}J_{T}(\mathcal{A}|w_{1:T})+O(1),

where $\mathcal{A}$ is the online learner (from Theorem 4) that achieves an optimal competitive ratio $\alpha^{*}$ . By the non-negativity of the cost, and thereafter using the optimal competitive ratio bound, we have

	$\displaystyle J_{T}(\mathcal{\mathcal{L}}\|w_{1:T})$	$\displaystyle~{}\leq~{}\min_{\pi\in\mathcal{M}}J_{T}(\pi\|w_{1:T})+\tilde{O}(\operatorname{poly}(m,n,\kappa,\gamma^{-1},\beta)W^{2}\sqrt{T})$
		$\displaystyle~{}\leq~{}J_{T}(\mathcal{A}\|w_{1:T})+\tilde{O}(\operatorname{poly}(m,n,\kappa,\gamma^{-1},\beta)W^{2}\sqrt{T})$
		$\displaystyle~{}\leq~{}J_{\infty}(\mathcal{A}\|w_{1:\infty})+\tilde{O}(\operatorname{poly}(m,n,\kappa,\gamma^{-1},\beta)W^{2}\sqrt{T})$
		$\displaystyle~{}\leq~{}\alpha^{}\cdot OPT_{}(w_{1:\infty})+\tilde{O}(\operatorname{poly}(m,n,\kappa,\gamma^{-1},\beta)W^{2}\sqrt{T})$

An appeal to Theorem 15 concludes the $\sqrt{T}$ -part of the claim. For the polylog regret result, we use the algorithm and the associated $O(\operatorname{poly}(\mu^{-1}\log T))$ regret guarantee in [28]. ∎

Appendix D Proofs for unknown systems

Proof of Theorem 8.

The proof of this theorem is similar in structure to that of Theorem 7. In particular, we utilize regret-minimizing algorithms from [22] & [28] that apply even $A,B$ are unknown, given a $(\kappa,\gamma)$ -strongly stable controller $\mathbb{K}$ for a $(k,\kappa)$ -strongly controllable system. However, we highlight a crucial difference below.

Theorem 6 establishes that the infinite-horizon competitive control policy is contained in the class of DACs that utilize $\widehat{K}_{0}$ as the stabilizing baseline policy. Since this policy is derived from the competitive controller, which in turn depends on knowledge of $(A,B)$ , such a $\widehat{K}_{0}$ might not be available when the underlying system is unknown, or even approximately known. To alleviate this concern, we give an analogous result that holds with respect to any baseline stabilizing policy $\mathbb{K}$ , instead of the specific choice $\widehat{K}_{0}$ .

Theorem 14 (Generalized policy inclusion).

Fix a horizon $T$ . For any $\varepsilon>0$ , choose

H=2\log(10^{5}\beta^{2}W^{2}\kappa^{16}T^{5}/(\gamma^{4}\varepsilon))/\gamma,\theta=20\kappa^{5}\beta^{1/2}/\gamma,\gamma^{\prime}=\gamma/2

and define $\mathcal{M}$ be the set of $(H,\theta,\gamma^{\prime})$ -DAC policies with an arbitrary baseline stabilizing policy $\mathbb{K}$ . Let $\mathcal{A}$ be an online algorithm with the optimal competitive ratio $\alpha^{*}$ . Then there exists a $\pi\in\mathcal{M}$ such that for any perturbation sequence $w_{1:T}$ ,

J_{T}(\pi|w_{1:T})<J_{T}(\mathcal{A}|w_{1:T})+\varepsilon.

Now, with the aid of the above result, we may proceed as with Theorem 7. The algorithm and result in [22] guarantees $\tilde{O}(T^{2/3})$ regret against $\mathcal{K}\cup\mathcal{M}$ , and similarly those in [28] guarantee $\tilde{O}(\operatorname{poly}(\mu^{-1})\sqrt{T})$ regret – say, this bound is $R_{T}$ in either case. Again, by the non-negativity of the cost, and thereafter using the optimal competitive ratio bound, we have

	$\displaystyle J_{T}(\mathcal{A}\|w_{1:T})$	$\displaystyle~{}\leq~{}\min_{\pi\in\mathcal{M}}J_{T}(\pi\|w_{1:T})+R_{T}$
		$\displaystyle~{}\leq~{}J_{T}(\mathcal{A}\|w_{1:T})+R_{T}+O(1)$
		$\displaystyle~{}\leq~{}J_{\infty}(\mathcal{A}\|w_{1:\infty})+R_{T}+O(1)$
		$\displaystyle~{}\leq~{}\alpha^{}\cdot OPT_{}(w_{1:\infty})+R_{T}+O(1),$

where we extend $w_{1:T}$ to an infinite sequence $w_{1:\infty}$ with $w_{t}=0$ for all $t>T$ . An appeal to Theorem 15 concludes the claim. ∎

Proof of Theorem 9.

Here, we use the algorithm and the regret bound from [8], which guarantees $R_{T}=2^{O_{T}(1)}+\tilde{O}_{T}(\min\{\sqrt{T},\operatorname{poly}(\mu^{1}\log T)\})$ regret again the combined policy class $\mathcal{K}\cup\mathcal{M}$ . The rest of the proof is analogous to that of Theorem 8 – particularly in that we invoke Theorem 14. ∎

Appendix E A generalized proof of policy inclusion

Proof of Theorem 14.

We begin by taking note of the explicit characterization of an optimal competitive-ratio algorithm presented in Theorem 4. In particular, unrolling the recursive definition of $\nu_{t}$ , we have $\nu_{t}=\sum_{i~{}\geq~{}1}(A-KQ^{1/2})^{i-1}w_{t-i}$ . Let the true state sequence visited by such a competitive control policy be $\{x_{t}\}_{t=1}^{T}$ . Additionally, let $\widehat{K}=\begin{bmatrix}\widehat{K}_{0}&\widehat{K}_{1}\end{bmatrix}$ be the partition of $\widehat{K}$ along the first $m$ columns and the remaining $m$ . Since $\widehat{w}_{t}=\Sigma^{-1/2}Q^{1/2}\nu_{t}$ , by the second part of the said theorem, we have

	$\displaystyle x_{t+1}$	$\displaystyle=Ax_{t}+B\widehat{K}\xi_{t}+w_{t}=Ax_{t}+B\widehat{K}\begin{bmatrix}x_{t}-\nu_{t}\\ \Sigma^{-1/2}Q^{1/2}\nu_{t}\end{bmatrix}+w_{t}$
		$\displaystyle=(A+B\widehat{K}_{0})x_{t}+B(\widehat{K_{1}}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})\nu_{t}+w_{t}$
		$\displaystyle=\sum_{i~{}\geq~{}0}(A+B\widehat{K}_{0})^{i}\left(w_{t-i}+B(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})v_{t-i}\right)$
		$\displaystyle=\sum_{i~{}\geq~{}0}(A+B\widehat{K}_{0})^{i}\left(w_{t-i}+B(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})\sum_{j~{}\geq~{}1}(A-KQ^{1/2})^{j-1}w_{t-i-j}\right)$
		$\displaystyle=\sum_{i~{}\geq~{}0}\left((A+B\widehat{K}_{0})^{i}+\sum_{j=1}^{i}(A+B\widehat{K}_{0})^{i-j}B(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{j-1}\right)w_{t-i}$

We now choose a DAC policy intended to emulate the competitive control policy as

	$\displaystyle M^{[i-1]}\stackrel{{\scriptstyle\text{def}}}{{=}}$	$\displaystyle(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{i-1}$
		$\displaystyle+(\widehat{K}_{0}-\mathbb{K})\left((A+B\widehat{K}_{0})^{i-1}+\sum_{j=1}^{i-1}(A+B\widehat{K}_{0})^{i-j-1}B(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{j-1}\right)$

and define the associated action as $u_{t}(M)\stackrel{{\scriptstyle\text{def}}}{{=}}\mathbb{K}x_{t}(M)+\sum_{i=1}^{H}M^{[i-1]}w_{t-i}$ , where $x_{t}(M)$ is the state sequence achieved the execution of such a DAC policy.

	$\displaystyle x_{t+1}(M)$	$\displaystyle=Ax_{t}(M)+B\left(\mathbb{K}x_{t}(M)+\sum_{i=1}^{H}M^{[i-1]}w_{t-i}\right)+w_{t}$
		$\displaystyle=\sum_{i~{}\geq~{}0}(A+B\mathbb{K})^{i}\left(w_{t-i}+B\sum_{j=1}^{H}M^{[j-1]}w_{t-i-j}\right)$
		$\displaystyle=\sum_{i~{}\geq~{}0}\left((A+B\mathbb{K})^{i}+\sum_{j=1}^{\min\{H,i\}}(A+B\mathbb{K})^{i-j}BM^{[j-1]}\right)w_{t-i}$

Before proceeding to establish that $x_{t}$ and $x_{t}(M)$ are close, let us notice that $M^{[i]}$ ’s decay. To this extent, recall that due to strong stability there exist matrices $\widehat{S},\widehat{L},S,L$ with $\|\widehat{S}\|\|\widehat{S}^{-1}\|,\|S\|\|S^{-1}\|~{}\leq~{}\kappa$ and $\|L\|~{}\leq~{}1-\gamma,\|\widehat{L}\|~{}\leq~{}1-\gamma$ such that $\widehat{A}+\widehat{B}_{u}\widehat{K}=\widehat{S}\widehat{L}\widehat{S}^{-1}$ and $A-KQ^{1/2}=SLS^{-1}$ . Noting that $\Sigma\succeq I$ by definition and that $\beta I\succeq Q,I$ , we thus obtain

	$\displaystyle\\|M^{[i]}\\|$	$\displaystyle~{}\leq~{}2\kappa^{2}\beta^{1/2}(1-\gamma)^{i}+2\kappa\left(\kappa(1-\gamma)^{i}+2\beta^{1/2}\kappa^{4}i(1-\gamma)^{i-1}\right)$
		$\displaystyle~{}\leq~{}4\kappa^{2}\beta^{1/2}(1-\gamma)^{i}+16\kappa^{5}\beta^{1/2}(1-\gamma/2)^{i}/\gamma~{}\leq~{}20\kappa^{5}\beta^{1/2}(1-\gamma/2)^{i}/\gamma,$

where the last line follows from Lemma 13.

Note the identity $X^{n}=Y^{n}+\sum_{i=1}^{n}Y^{i-1}(X-Y)X^{n-i}$ . Using this twice, we have

	$\displaystyle(A+B\widehat{K}_{0})^{i}=$	$\displaystyle(A+B\mathbb{K})^{i}+\sum_{j=1}^{i}(A+B\mathbb{K})^{i-j}B(\widehat{K}_{0}-\mathbb{K})(A+B\widehat{K}_{0})^{j-1},$
	$\displaystyle(A+B\widehat{K}_{0})^{i-j}=$	$\displaystyle(A+B\mathbb{K})^{i-j}+\sum_{k=1}^{i-j}(A+B\mathbb{K})^{k-1}B(\widehat{K}_{0}-\mathbb{K})(A+B\widehat{K})^{i-j-k}.$

Thus proceeding with the above identities, this time invoking the strong stability of $\mathbb{K}$ , we have

	$\displaystyle\\|x_{t}-x_{t}(M)\\|~{}\leq~{}$	$\displaystyle\sum_{i~{}\geq~{}H+1}\sum_{j=H+1}^{i}\left\\|(A+B\mathbb{K})^{i-j}BM^{[j-1}]\right\\|W$
	$\displaystyle~{}\leq~{}$	$\displaystyle\sum_{i~{}\geq~{}H+1}\frac{20\kappa^{7}\beta^{1/2}W}{\gamma}i(1-\gamma/2)^{i-1}~{}\leq~{}\frac{20\kappa^{7}\beta^{1/2}WT^{2}}{\gamma}(1-\gamma/2)^{H}$

Similarly the corresponding control inputs are nearly equal. By the previous display concerning states, we have

	$\displaystyle u_{t}$	$\displaystyle=\widehat{K}_{0}x_{t}+(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})\nu_{t}$
		$\displaystyle=\widehat{K}_{0}x_{t}+(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})\sum_{i~{}\geq~{}1}(A-KQ^{1/2})^{i-1}w_{t-i},$
	$\displaystyle u_{t}(M)$	$\displaystyle=\mathbb{K}x_{t}(M)+\sum_{i=1}^{H}(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{i-1}w_{t-i}+(\widehat{K}_{0}-\mathbb{K})x_{t},$
	$\displaystyle\\|u_{t}-u_{t}(M)\\|$	$\displaystyle~{}\leq~{}\\|\mathbb{K}\\|\\|x_{t}-x_{t}(M)\\|+\sum_{i~{}\geq~{}H+1}\\|(A-KQ^{1/2})\\|^{i-1}2\kappa\beta^{1/2}W$
		$\displaystyle~{}\leq~{}\frac{20\kappa^{8}\beta^{1/2}WT^{2}}{\gamma}(1-\gamma/2)^{H}+2\kappa^{2}\beta^{1/2}W\sum_{i~{}\geq~{}H+1}(1-\gamma)^{i-1}$
		$\displaystyle~{}\leq~{}\frac{22\kappa^{8}\beta^{1/2}WT^{2}}{\gamma}(1-\gamma/2)^{H}.$

Being the iterates produced by a DAC policy, $x_{t}(M),u_{t}(M)$ are bounded as indicated by Lemma 12. Now, in terms of cost, since $\|x\|^{2}-\|y\|^{2}=(x-y)^{\top}(x+y)$ and given Equation 2, we note that

		$\displaystyle\|x_{t}^{\top}Qx_{t}+u_{t}^{\top}Ru_{t}-x_{t}(M)^{\top}Qx^{\top}-u_{t}(M)^{\top}Ru_{t}(M)\|$
	$\displaystyle~{}\leq~{}$	$\displaystyle\beta\\|u_{t}-u_{t}(M)\\|(2\\|u_{t}(M)\\|+\\|u_{t}-u_{t}(M)\\|)+\beta\\|x_{t}-x_{t}(M)\\|(2\\|x_{t}(M)\\|+\\|x_{t}-x_{t}(M)\\|)$
	$\displaystyle~{}\leq~{}$	$\displaystyle\frac{10^{5}\beta^{2}W^{2}\kappa^{16}T^{4}}{\gamma^{4}}(1-\gamma/2)^{H}$

The difference in aggregate cost is at most $T$ times the quantity above. Therefore, setting $H=2\log(10^{5}\beta^{2}W^{2}\kappa^{16}T^{5}/(\gamma^{4}\varepsilon))/\gamma$ is sufficient to ensure the validity of the claim. We have already established that the DAC satisfies $\theta=20\kappa^{5}\beta^{1/2}/\gamma$ and $\gamma^{\prime}=\gamma/2$ . ∎

Appendix F To finity and beyond

In this section, we relate the infinite-horizon cost $OPT_{*}(w_{1:\infty})$ , as used in earlier proofs, to the finite variant $OPT_{*}(w_{1:T})$ . By the non-negativity of the instantaneous cost, it is implied that $OPT_{*}(w_{1:\infty})~{}\geq~{}OPT_{*}(w_{1:T})$ . We prove the reverse inequality, up to constant terms.

Theorem 15.

Extend the perturbation sequence $w_{1:T}$ to an infinite sequence $w_{1:\infty}$ , by defining $w_{t}=0$ for $t>T$ . Then, we have

OPT_{*}(w_{1:\infty})~{}\leq~{}OPT_{*}(w_{1:T})+\operatorname{poly}(m,n,\kappa,\gamma^{-1},\mu^{-1},\beta)W^{2}.

Proof.

Let $u_{1:T}^{*}(w_{1:T})=\operatorname*{arg\,min}_{u_{1:T}}J_{T}(u_{1:T}|w_{1:T})$ ; these are the actions of the finite-horizon optimal offline controller attaining cost $OPT_{*}(w_{1:T})$ . Let $\mathcal{I}$ be an algorithm that executes $u_{t}^{*}(w_{1:T})$ for the first $T$ time steps and $u_{t}=\widehat{K}_{0}x_{t}$ (or any other $(\kappa,\gamma)$ -strongly stable policy) thereafter. Let $x_{t}(\mathcal{I}),u_{t}(\mathcal{I})$ be the state-action sequence associated with $\mathcal{I}$ . Now, we have by the definition of $OPT_{*}(w_{1:\infty})$ , specifically its optimality, that

	$\displaystyle OPT_{*}(w_{1:\infty})$	$\displaystyle~{}\leq~{}J_{\infty}(\mathcal{I}\|w_{1:\infty})$
		$\displaystyle=\sum_{t=1}^{T}c(x_{t}(\mathcal{I}),u_{t}(\mathcal{I}))+\sum_{t=T+1}^{\infty}c(x_{t}(\mathcal{I}),u_{t}(\mathcal{I}))$
		$\displaystyle=OPT_{*}(w_{1:T})+\sum_{t=T+1}^{\infty}c(x_{t}(\mathcal{I}),u_{t}(\mathcal{I}))$

where on the last line we use the observation that for the first $T$ steps $\mathcal{I}$ executes $u_{t}^{*}(w_{1:T})$ . Henceforth, we wish to establish a bound on the remainder term. With this aspiration, let us note that $w_{t}=0$ for all $t>T$ implying for $t>0$ that

x_{T+t}(\mathcal{I})=(A+B\widehat{K}_{0})^{t-1}x_{T+1},

u_{T+t}(\mathcal{I})=\widehat{K}_{0}(A+B\widehat{K}_{0})^{t-1}x_{T+1},

\sum_{t=T+1}^{\infty}c(x_{t}(\mathcal{I}),u_{t}(\mathcal{I}))~{}\leq~{}\frac{2\beta\kappa^{4}}{\gamma^{2}}\|x_{T+1}\|^{2}.

To conclude the proof of the claim, we invoke Lemma C.9 and C.6 in [14], whereby the authors establish a bound on the state sequence produced by an optimal offline policy using an explicit characterization of the same, to note that

\|x_{T+1}\|~{}\leq~{}\operatorname{poly}(m,n,\kappa,\gamma^{-1},\mu^{-1},\beta)W.

∎

Appendix G Experiment Setup & Further Evaluations

We compare 5 controllers in our experiments, the $H_{2}$ controller, the $H_{\infty}$ controller, the infinite horizon competitive controller from [16], the GPC controller from [2] and the “clairvoyant offline” controller which selects the optimal-in-hindsight sequence of controls. We consider two systems a standard 2 dimensional double integrator where the system is given as follows

A=\begin{bmatrix}1&1\\ 0&1\end{bmatrix},B_{u}=\begin{bmatrix}0\\ 1\end{bmatrix},B_{w}=I,Q=I,R=I

and a system derived from Boeing [32]. The exact system details can be found in the code supplied with the supplementary. On a high level the system has a 5-dimensional state and a 9-dimensional control.

The perturbations we add fall into 6 categories.

•

Sin Perturbations: every entry of $[w_{t}]_{i}=\sin(\frac{8\pi i}{T})$ where $T$ is the time horizon.
•

Sin Perturbations with changing Amplitude: every entry of $[w_{t}]_{i}=\sin(\frac{8\pi i}{T})\sin(\frac{6\pi i}{T})$ where $T$ is the time horizon.
•

Constant Perturbations: every entry of $w_{t}$ is set to 1.
•

Uniform Perturbations: every entry of $w_{t}$ is drawn uniformly and independently in the range $[0,1]$ .
•

Gaussian Perturbations: $w_{t}$ drawn independently from $\mathcal{N}(0,I)$ .
•

Gaussian Random Walk Perturbations: Every entry of $w_{t}$ drawn independently from $\mathcal{N}(w_{t-1},I)$ .

The performance of the controllers on the double integrator system can be found in Figure 2 and the performance of the controllers on the Boeing system can be found in 3. The experiments confirm our theorems, that over a large time horizon GPC performs at par and sometimes can be better than the competitive controller. We note that the optimal competitive ratio for the competitve controller turns out to be 14.67 in the double integrator system and 5.686 in the Boeing system.

Hyperparameter setup for GPC

GPC is only the controller with hyperparameters. We describe the hyperparameters used here.

•

Double Integrator System: For all perturbations $H=M=3$ . For all perturbations except the Gaussian Random walk perturbations, GPC was run with learning rate $0.002$ . For Gaussian Random walk perturbations, GPC was run with learning rate $1e-4$ .
•

Boeing System: For all perturbations $H=M=10$ . For all perturbations except the Gaussian Random walk perturbations, GPC was run with learning rate $0.001$ . For Gaussian Random walk perturbations, GPC was run with learning rate $1e-7$ .

We note that the learning rate needs to be reduced significantly for the Gaussian Random walk perturbations as the noise magnitude is growing in time and in this sense our theorem does not immediately apply to this setting.

	$\displaystyle\|x_{t}^{\top}Qx_{t}-x_{t}(M)^{\top}Qx^{\top}(M)\|~{}\leq~{}$	$\displaystyle\\|Q\\|\\|x_{t}-x_{t}(M)\\|\left(2\\|x_{t}(M)\\|+\\|x_{t}-x_{t}(M)\\|\right)$
	$\displaystyle~{}\leq~{}$	$\displaystyle\frac{448W^{2}\kappa^{10}\max(1,\\|Q\\|)\\|Q\\|}{\gamma^{4}}(1-\gamma/2)^{H}$

	$\displaystyle\|u_{t}^{\top}Ru_{t}-u_{t}(M)^{\top}Ru_{t}(M)\|~{}\leq~{}$	$\displaystyle\\|R\\|\\|u_{t}-u_{t}(M)\\|\left(2\\|u_{t}(M)\\|+\\|u_{t}-u_{t}(M)\\|\right)$
	$\displaystyle~{}\leq~{}$	$\displaystyle\frac{640W^{2}\kappa^{11}\max(1,\\|Q\\|)\\|R\\|}{\gamma^{4}}(1-\gamma/2)^{H}.$

	$\displaystyle\\|x_{t+1}\\|$	$\displaystyle=\left\\|\sum_{i~{}\geq~{}0}(A+B\mathbb{K})^{i}\left(w_{t-i}+B\sum_{j=1}^{H}M^{[j-1]}w_{t-i-j}\right)\right\\|$
		$\displaystyle~{}\leq~{}\left(\sum_{i~{}\geq~{}0}\\|(A+B\mathbb{K})^{i}\\|\right)\max_{i}\left\\|w_{t-i}+B\sum_{j=1}^{H}M^{[j-1]}w_{t-i-j}\right\\|$
		$\displaystyle~{}\leq~{}\left(\sum_{i~{}\geq~{}0}\\|(A+B\mathbb{K})^{i}\\|\right)\max_{i}\left(\\|w_{t-i}\\|+\\|B\\|\max_{j}\\|w_{t-i-j}\\|\sum_{j=1}^{H}\\|M^{[j-1]}\\|\right)$
		$\displaystyle~{}\leq~{}\frac{\kappa}{\gamma}W\left(1+\frac{\kappa\theta}{\gamma^{\prime}}\right)~{}\leq~{}\frac{2\kappa^{2}\theta W}{\gamma\gamma^{\prime}},$

	$\displaystyle J_{T}(\mathcal{\mathcal{L}}\|w_{1:T})$	$\displaystyle~{}\leq~{}\min_{\pi\in\mathcal{M}}J_{T}(\pi\|w_{1:T})+\tilde{O}(\operatorname{poly}(m,n,\kappa,\gamma^{-1},\beta)W^{2}\sqrt{T})$
		$\displaystyle~{}\leq~{}J_{T}(\mathcal{A}\|w_{1:T})+\tilde{O}(\operatorname{poly}(m,n,\kappa,\gamma^{-1},\beta)W^{2}\sqrt{T})$
		$\displaystyle~{}\leq~{}J_{\infty}(\mathcal{A}\|w_{1:\infty})+\tilde{O}(\operatorname{poly}(m,n,\kappa,\gamma^{-1},\beta)W^{2}\sqrt{T})$
		$\displaystyle~{}\leq~{}\alpha^{}\cdot OPT_{}(w_{1:\infty})+\tilde{O}(\operatorname{poly}(m,n,\kappa,\gamma^{-1},\beta)W^{2}\sqrt{T})$

	$\displaystyle J_{T}(\mathcal{A}\|w_{1:T})$	$\displaystyle~{}\leq~{}\min_{\pi\in\mathcal{M}}J_{T}(\pi\|w_{1:T})+R_{T}$
		$\displaystyle~{}\leq~{}J_{T}(\mathcal{A}\|w_{1:T})+R_{T}+O(1)$
		$\displaystyle~{}\leq~{}J_{\infty}(\mathcal{A}\|w_{1:\infty})+R_{T}+O(1)$
		$\displaystyle~{}\leq~{}\alpha^{}\cdot OPT_{}(w_{1:\infty})+R_{T}+O(1),$

Best of Both Worlds in Online Control: Competitive Ratio and Policy Regret