This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Best of Both Worlds in Online Control:
Competitive Ratio and Policy Regret

Gautam Goel [email protected] Simons Institute, UC Berkeley Naman Agarwal [email protected] Google AI Princeton Karan Singh [email protected] Tepper School of Business, Carnegie Mellon University Elad Hazan [email protected] Google AI Princeton Department of Computer Science, Princeton University
Abstract

We consider the fundamental problem of online control of a linear dynamical system from two different viewpoints: regret minimization and competitive analysis. We prove that the optimal competitive policy is well-approximated by a convex parameterized policy class, known as a disturbance-action control (DAC) policies. Using this structural result, we show that several recently proposed online control algorithms achieve the best of both worlds: sublinear regret vs. the best DAC policy selected in hindsight, and optimal competitive ratio, up to an additive correction which grows sublinearly in the time horizon. We further conclude that sublinear regret vs. the optimal competitive policy is attainable when the linear dynamical system is unknown, and even when a stabilizing controller for the dynamics is not available a priori.

1 Introduction

The study of online optimization consists of two main research directions. The first is online learning, which studies regret minimization in games. A notable framework within this line of work is online convex optimization, where an online decision maker iteratively chooses a point in a convex set and receives loss according to an adversarially chosen loss function. The metric of performance studied in this research thrust is regret, or the difference between overall loss and that of the best decision in hindsight.

The second direction is that of competitive analysis in metrical task systems. In this framework, the problem setting is similar, but the performance metric is very different. Instead of regret, the objective is to minimize the competitive ratio, i.e. the ratio of the reward of the online decision maker to that associated with the optimal sequence of decisions made in hindsight. For this ratio to remain bounded, an additional penalty is imposed on movement costs, or changes in the decision.

While the goals of the two research directions are similar, the performance metrics are very different and lead to different algorithms and methodologies. These two separate methodologies have recently been applied to the challenging setting of online control, yielding novel and exciting methods to the field of control of dynamical systems.

In the paper we unify these two disparate lines of work by establishing a connection between the two objectives. Namely, we show that the Gradient Perturbation Controller (GPC) minimizes regret against the policy that has the optimal competitive ratio for any Linear Time Invariant (LTI) dynamical system. The GPC algorithm hence gives the best of both worlds: sublinear regret, and optimal competitive ratio, in a single efficient algorithm.

Our main technical contribution is proving that the optimal competitive policy derived in [16] is well-approximated by a certain convex policy class, for which efficient online learning was recently established in the work of [2]. This implies that known regret minimization algorithms for online control can compete with this optimal competitive policy with vanishing regret.

This structural result has other important implications to online control, yielding new results: we show that sublinear regret can be attained vs. the optimal competitive policy even when the underlying dynamical system is unknown, and even when a stabilizing controller is not available.

1.1 Related work

Control of dynamical systems. Our study focuses on two types of algorithms for online control. The first class of algorithms enjoy sublinear regret for online control of dynamical systems; that is, whose performance tracks a given benchmark of policies up to a term which is vanishing relative to the problem horizon. [1] initiated the study of online control under the regret benchmark for linear time-invariant (LTI) dynamical systems. Bounds for this setting have since been improved and refined in [12, 25, 10, 29]. We are interested in adversarial noise and perturbations, and regret in the context of online control was initiated in the study of nonstochastic control setting [2], that allows for adversarially chosen (e.g. non-Gaussian) noise and general convex costs that may vary with time. This model has been studied for many extended settings, see [24] for a comprehensive survey.

Competitive control. [18] initiated the study of online control with competitive ratio guarantees and showed that the Online Balanced Descent algorithm introduced in [7] has bounded competitive ratio in a narrow class of linear systems. This approach to competitive control was extended in a series of papers [17, 27]. In recent work, [16] obtained an algorithm with optimal competitive ratio in general linear systems using HH_{\infty} techniques; in this paper we show that the competitive control algorithm obtained in [16] is closely approximated by the class of DAC policies, and use this connection to obtain our “best-of-both-worlds” result.

Online learning and online convex optimization. The regret minimization techniques that are the subject of this paper are based in the framework of online convex optimization, see [20]. Recent techniques in online nonstochastic control are based on extensions of OCO to the setting of loss functions with memory [3] and adaptive or dynamic regret [23, 33].

Competitive analysis of online algorithms and simultaneous bounds on competitive ratio and regret. Competitive analysis was introduced in [31] and was first studied in the context of Metrical Task Systems (MTS) in [6]; we refer to [5] for an overview of competitive analysis and online algorithms. A series of recent papers consider the problem of obtaining online algorithms with bounded competitive ratio and sublinear regret. In [4], it was shown no algorithm can simultaneously achieve both objectives in OCO with switching costs. On the other hand, [11] described an online algorithm for MTS with optimal competitive ratio and sublinear regret on every time interval.

2 Preliminaries

We consider the task of online control in linear time-invariant (LTI) dynamical systems. In this setting, the interaction between the learner and the environment proceeds as described next. At each time step, the learner incrementally observes the current state xtmx_{t}\in\mathbb{R}^{m} of the system, subsequently chooses a control input utnu_{t}\in\mathbb{R}^{n}, and consequently is subject to an instantaneous cost c(xt,ut)c(x_{t},u_{t}) defined via the quadratic cost function (we assume the existence of β,μ\beta,\mu such that βIQ,RμI\beta I\succeq Q,R\succeq\mu I)

c(x,u)=xQx+uRu.c(x,u)=x^{\top}Qx+u^{\top}Ru.

As a consequence of executing the control input utu_{t}, the dynamical system evolves to a subsequent state xt+1x_{t+1}, as dictated by the following linear system parameterized by the matrices Am×mA\in\mathbb{R}^{m\times m} and Bm×nB\in\mathbb{R}^{m\times n}, bounded as A,Bκ\|A\|,\|B\|~{}\leq~{}\kappa, and the perturbation sequence (wt)t[T](w_{t})_{t\in[T]}.

xt+1=Axt+But+wt.x_{t+1}=Ax_{t}+Bu_{t}+w_{t}.

We assume without loss of generality that x1=0x_{1}=0. The learner does not directly observe the perturbations, or know of them in advance. We do not make any (e.g., distributional) assumptions on the perturbations, other than that they satisfy a point-wise bound wtW\|w_{t}\|~{}\leq~{}W for all times steps tt. By the means of such interaction across TT time steps, we ascribe an aggregate cost to the learner 𝒜\mathcal{A} as

JT(𝒜|w1:T)=t=1Tc(xt,ut).J_{T}(\mathcal{A}|w_{1:T})=\sum_{t=1}^{T}c(x_{t},u_{t}).

2.1 Policy classes

Since the learner selects the control inputs adaptively upon observing the state, the behavior of a learner may be described by a (strictly causal) policy π\pi, a mapping from the observed state sequence to the immediate action. We consider the following policy classes in the paper:

  1. 1.

    Πall\Pi_{\textsc{all}} is the exhaustive set of TT-length sequence of control inputs.

  2. 2.

    ΠSC\Pi_{\textsc{SC}} is the class of all strictly causal policies, mapping the heretofore observed state sequence to the next action.

  3. 3.

    𝒦n×m\mathcal{K}\subset\mathbb{R}^{n\times m} is a class of linear state-feedback policies. Each member of this class is parameterized by some matrix K𝒦K\in\mathcal{K}, and recommends the immediate action ut=defKxtu_{t}\stackrel{{\scriptstyle\text{def}}}{{=}}Kx_{t}. Both the stochastic-optimal policy (2\mathcal{H}_{2}-control) – Bayes-optimal for i.i.d. perturbations – and the robust policy (\mathcal{H}_{\infty}-control) – minimax-optimal for arbitrary perturbations – are linear state-feedback policies.

  4. 4.

    \mathcal{M} is a class of disturbance-action controllers (DAC), defined below, that recommend actions as a linear transformation of the past few perturbations, rather than the present state.

A linear policy KK is called stable if |λmax(A+BK)|<1|\lambda_{\texttt{max}}(A+BK)|<1. Such policies ensure that the state sequence remains bounded under their execution. The notion of strong stability, introduced by [9], is a non-asymptotic characterization of the notion of stability defined as follows.

Definition 1.

A linear policy Kn×mK\in\mathbb{R}^{n\times m} is said to be (κ,γ)(\kappa,\gamma)-strongly stable with respect to an LTI (A,B)(A,B) if there exists exists matrices S,LS,L satisfying A+BK=SLS1A+BK=SLS^{-1} such that

max{1,K,SS1}κ and max{1/2,L}1γ.\max\{1,\|K\|,\|S\|\|S^{-1}\|\}~{}\leq~{}\kappa\text{ and }\max\{1/2,\|L\|\}~{}\leq~{}1-\gamma.

A sufficient condition for the existence of a strongly stable policy is the strong controllability of the linear system (A,B)(A,B), a notion introduced in [9]. In words, strong controllability measures the minimum length and magnitude of control input needed to drive the system to any unit-sized state.

Let 𝕂\mathbb{K} be a fixed (κ,γ)(\kappa,\gamma)-strongly stable linear policy for the discussion that follows. We will specify a particular choice for 𝕂\mathbb{K} in Section 3. We formally define a disturbance action controller below. The purpose of superimposing a stable linear policy 𝕂\mathbb{K} on top of the linear-in-perturbation terms is to ensure that the state sequence produced under the execution of a (possibly non-stationary) disturbance-action controller remains bounded.

Definition 2.

A disturbance-action controller (DAC), specified by a horizon HH and parameters M=(M[0],M[H1])n×mM=\left(M^{[0]},\dots M^{[H-1]}\right)\in\mathbb{R}^{n\times m}, chooses the action at the time tt as

ut(M)=def𝕂xt+t=1HM[i1]wti,u_{t}(M)\stackrel{{\scriptstyle\text{def}}}{{=}}\mathbb{K}x_{t}+\sum_{t=1}^{H}M^{[i-1]}w_{t-i},

where xtx_{t} is state at time tt, and w1,wt1w_{1},\dots w_{t-1} are past perturbations.

Definition 3.

For any H,γ<1,θ1H\in\mathbb{N},\gamma<1,\theta~{}\geq~{}1, an (H,θ,γ)(H,\theta,\gamma)-DAC policy class is the set of all HH-horizon DAC policies where M=(M[0],M[H1])M=\left(M^{[0]},\dots M^{[H-1]}\right) satisfy iM[i]θ(1γ)i\forall\;i\|M^{[i]}\|~{}\leq~{}\theta(1-\gamma)^{i}.

2.2 Performance measures

This paper considers multiple criteria that may be used to assess the learner’s performance. We introduce these below.

Let w1:Tw_{1:T} be the perturbation sequence the dynamics are subject to. Given the foreknowledge of this sequence, we define the following notions of optimal cost; note that these notions are infeasible in the sense that no online learner can match these on all instances.

  1. 1.

    OPT(w1:T)=defminu1:TΠAllJT(u1:T|w1:T)OPT_{*}(w_{1:T})\stackrel{{\scriptstyle\text{def}}}{{=}}\min_{u_{1:T}\in\Pi_{\textsc{All}}}J_{T}(u_{1:T}|w_{1:T}) is the cost associated with the best sequence of control inputs given the perturbation sequence. No policy, causal or otherwise, can attain a cost smaller than OPT(w)OPT_{*}(w) on the the perturbation sequence w1:Tw_{1:T}.

  2. 2.

    For any policy class Π\Pi, OPTΠ(w1:T)=defminπΠJT(π|w1:T)OPT_{\Pi}(w_{1:T})\stackrel{{\scriptstyle\text{def}}}{{=}}\min_{\pi\in\Pi}J_{T}(\pi|w_{1:T}) is the cost of the best policy in Π\Pi, subject to the perturbation sequence. Note that OPT(w1:T)=OPTΠAll(w1:T)OPT_{*}(w_{1:T})=OPT_{\Pi_{\textsc{All}}}(w_{1:T}).

With respect to these baselines, we define the following performance measures.

Competitive Ratio (CR): The competitive ratio of an (online) learner 𝒜\mathcal{A} is the worst-case ratio of its cost to the optimal offline cost OPT(w1:T)OPT_{*}(w_{1:T}) over all possible perturbation sequences.

αT(𝒜)=defmaxw1:TJT(𝒜|w1:T)OPT(w1:T)\alpha_{T}(\mathcal{A})\stackrel{{\scriptstyle\text{def}}}{{=}}\max_{w_{1:T}}\frac{J_{T}(\mathcal{A}|w_{1:T})}{OPT_{*}(w_{1:T})}

The optimal competitive ratio is the competitive ratio of the best strictly causal controller.

αT=defmin𝒜ΠSCαT(𝒜)\alpha^{*}_{T}\stackrel{{\scriptstyle\text{def}}}{{=}}\min_{\mathcal{A}\in\Pi_{\textsc{SC}}}\alpha_{T}(\mathcal{A})

The infinite-horizon optimal competitive ratio α=limTαT\alpha^{*}=\lim_{T\to\infty}\alpha^{*}_{T} is defined as the limiting optimal competitive ratio as the horizon extends to infinity, whenever it exists.

Regret: On any perturbation sequence w1:Tw_{1:T}, given a policy class Π\Pi, the regret of an online learner 𝒜\mathcal{A} is assigned to be the excess aggregate cost incurred in comparison to that of the best policy in Π\Pi,

RT,Π(𝒜|w1:T)=JT(𝒜|w1:T)OPTΠ(w1:T).R_{T,\Pi}(\mathcal{A}|w_{1:T})=J_{T}(\mathcal{A}|w_{1:T})-OPT_{\Pi}(w_{1:T}).

The worst-case regret is defined as the maximum regret attainable over all perturbation sequences,

RT,Π(𝒜)=maxw1:TRT,Π(𝒜|w1:T).R_{T,\Pi}(\mathcal{A})=\max_{w_{1:T}}R_{T,\Pi}(\mathcal{A}|w_{1:T}).

The two types of performance guarantees introduced above are qualitatively different in terms of the bound they espouse and the baseline they compare to. In particular:

Tighter bound for low regret: A sub-linear regret guarantee implies that the average costs of the learner and the baseline asymptotically match, while even an optimal competitive-ratio bound promises an average cost at most a constant factor times that of the baseline.

Stronger baseline for CR: Competitive ratio bounds holds against the optimal unrestricted policy while regret holds against the best fixed policy from a (typically parametric) policy class.

2.3 Characterization of the optimal CR algorithm

The following explicit characterization of a strictly causal policy that achieves an optimal competitive ratio in the infinite-horizon setting was recently obtained in [16]; this theorem shows that the competitive policy in the original system with state xmx\in\mathbb{R}^{m} can be viewed as a state-feedback controller in a synthetic system with state ξ2m\xi\in\mathbb{R}^{2m}.

Theorem 4 (Optimal Competitive Policy).

The strictly causal controller with an optimal infinite-horizon competitive ratio α\alpha^{*} is given by the policy ut=K^ξtu_{t}=\widehat{K}\xi_{t}, where K^n×2m\widehat{K}\in\mathbb{R}^{n\times 2m} and the synthetic state ξ2m\xi\in\mathbb{R}^{2m} evolves according to the dynamics

ξt+1=A^ξt+B^uut+B^ww^t+1,\xi_{t+1}=\widehat{A}\xi_{t}+\widehat{B}_{u}u_{t}+\widehat{B}_{w}\hat{w}_{t+1},
where A^=[AKΣ1/200],B^u=[B0],B^w=[0I],w^t=Σ1/2Q1/2νt.\text{where }\widehat{A}=\begin{bmatrix}A&K\Sigma^{1/2}\\ 0&0\end{bmatrix},\quad\widehat{B}_{u}=\begin{bmatrix}B\\ 0\end{bmatrix},\quad\widehat{B}_{w}=\begin{bmatrix}0\\ I\end{bmatrix},\quad\widehat{w}_{t}=\Sigma^{-1/2}Q^{1/2}\nu_{t}.

The sequence νt\nu_{t} is recursively defined as νt+1=(AKQ1/2)νt+wt\nu_{t+1}=(A-KQ^{1/2})\nu_{t}+w_{t} starting with ν1=0\nu_{1}=0. Here the matrices K^,K,Σ\widehat{K},K,\Sigma (and auxiliary constants P,B~,H~P,\widetilde{B},\widetilde{H} and P^\widehat{P}) satisfy

K=APQ1/2Σ1,Σ=I+Q1/2PQ1/2,P=BB+APAKΣK,K=APQ^{1/2}\Sigma^{-1},\quad\Sigma=I+Q^{1/2}PQ^{1/2},\quad P=BB^{\top}+APA^{\top}-K\Sigma K^{\top},
K^=(In+B^uP~B^u)1B^uP~A^,B~=[B^uB^w],H~=[I00αI]+B~P^B~,\widehat{K}=-(I_{n}+\widehat{B}_{u}^{\top}\widetilde{P}\widehat{B}_{u})^{-1}\widehat{B}_{u}^{\top}\widetilde{P}\widehat{A},\quad\widetilde{B}=\begin{bmatrix}\widehat{B}_{u}&\widehat{B}_{w}\end{bmatrix},\quad\widetilde{H}=\begin{bmatrix}I&0\\ 0&-\alpha^{*}I\end{bmatrix}+\widetilde{B}^{\top}\widehat{P}\widetilde{B},
P~=P^P^B^w(αIp+B^wP^B^w)1B^wP^,\widetilde{P}=\widehat{P}-\widehat{P}\widehat{B}_{w}(-\alpha^{*}I_{p}+\widehat{B}_{w}^{\top}\widehat{P}\widehat{B}_{w})^{-1}\widehat{B}_{w}^{\top}\widehat{P},
and P^=[QQ1/2Σ1/2,Q1/2Σ1/2Σ]+A^P^A^A^P^B~H~1B~P^A^.\text{and }\quad\widehat{P}=\begin{bmatrix}Q&Q^{1/2}\Sigma^{1/2},\\ Q^{1/2}\Sigma^{1/2}&\Sigma\end{bmatrix}+\widehat{A}^{\top}\widehat{P}\widehat{A}-\widehat{A}^{\top}\widehat{P}\widetilde{B}\widetilde{H}^{-1}\widetilde{B}^{\top}\widehat{P}\widehat{A}.

Furthermore, let {xt}t=1T\{x_{t}\}_{t=1}^{T} be the state sequence produced under the execution of such a policy. Then, the state sequence satisfies at all time tt that ξt=[xtνtw^t].\xi_{t}=\begin{bmatrix}x_{t}-\nu_{t}\\ \widehat{w}_{t}\end{bmatrix}.

Let K^0n×m\widehat{K}_{0}\in\mathbb{R}^{n\times m} be the sub-matrix induced by the first mm columns of K^\widehat{K}. In general, the infinite-horizon optimal competitive ratio may not be finite. However, the stability of the associated filtering operation (i.e. |λmax(AKQ1/2)|<1|\lambda_{\max}(A-KQ^{1/2})|<1) and the closed loop control system (i.e. |λmax(A+BK^0)|<1|\lambda_{\max}(A+B\widehat{K}_{0})|<1) is sufficient to ensure the existence of this limit. We utilize the following bounds that quantify this.

Assumption 1.

K^0\widehat{K}_{0} is (κ,γ)(\kappa,\gamma)-strongly stable with respect to the linear system (A,B)(A,B), and K-K^{\top} is (κ,γ)(\kappa,\gamma)-strongly stable with respect to the linear system (A,Q1/2)(A^{\top},Q^{1/2}). Also, K^κ\|\widehat{K}\|~{}\leq~{}\kappa.

We note that the above bounds are quantifications, and not strengthening, of the stability criterion. In particular, any stable controller is strongly stable for some κ1,γ<1\kappa~{}\geq~{}1,\gamma<1. Here, we use the same parameters to state the strong stability for both controllers, KK and K^\widehat{K}, for convenience. Such a simplification is valid, since given (κ1,γ1)(\kappa_{1},\gamma_{1})- and (κ2,γ2)(\kappa_{2},\gamma_{2})-strongly stable controllers, the said controllers are also (max{κ1,κ2},max{γ1,γ2})(\max\{\kappa_{1},\kappa_{2}\},\max\{\gamma_{1},\gamma_{2}\})-strongly stable.

2.4 Low-regret algorithms

Deviating from the methodologies of optimal and robust control, [2] propose considering an online control formulation in which the noise is adversarial, and thus the optimal controller is only defined in hindsight. This motivates different, online-learning based methods for the control task. [2] proposed an algorithm called GPC (Gradient Perturbation Controller) and show the following theorem (which we restate in notation consistent with this paper), which for LTI systems shows that regret when compared against any strongly-stable policy scales at most as O(T)O(\sqrt{T}).

Theorem 5.

Given any κ,γ\kappa,\gamma, let 𝒦(κ,γ)\mathcal{K}(\kappa,\gamma) be the set of (κ,γ)(\kappa,\gamma) strongly stable linear policies. There exists an algorithm 𝒜\mathcal{A} such that the following holds,

JT(𝒜|w1:T)minK𝒦(κ,γ)JT(K|w1:T)O(Tlog(T)).J_{T}(\mathcal{A}|w_{1:T})-\min_{K\in\mathcal{K}(\kappa,\gamma)}J_{T}(K|w_{1:T})~{}\leq~{}O(\sqrt{T}\log(T)).

Here O()O(\cdot) contains polynomial factors depending on the system constants.

As can be observed from the analysis presented by [2], the above regret guarantee holds not just against the set of strongly stable linear policies but also against the set of (H,θ,γ)(H,\theta,\gamma)-DAC policies. The regret bound has been extended to different interaction models such as unknown systems [21], partial observation [30] and adaptive regret [19]. Furthermore the regret bound in this setting has been improved to logarithmic in TT [13, 28].

3 Statement of results

3.1 Main Result

The central observation we make is that regret-minimizing algorithms subject to certain qualifications automatically achieve an optimal competitive ratio bound up to a vanishing average cost term.

Typical regret-minimizing online control algorithms [2, 22] compete against the class of stable linear state-feedback policies. In general, neither the offline optimal policy (with cost OPTOPT_{*}) nor the optimal competitive-ratio policy can be approximated by a linear policy [15]. However, the algorithm proposed in [2] and follow-up works that build on it also compete with a more-expressive class, that of disturbance-action policies (DACs). In [2], this choice was made purely for computational reasons to circumvent the non-convexity of the cost associated with linear policies; in this work, however, we use the flexibility of DAC policies to approximate the optimal competitive policy.

More formally, we prove that we can find a DAC which generates a sequence of states and control actions which closely track the sequence of states and control actions generated by the optimal competitive policy by taking the history HH of the DAC to be sufficiently large. This structural characterization of the competitive policy is sufficient to derive our best-of-both-worlds result, since a regret-minimizing learner competitive against an appropriately defined DAC class would also be competitive against the policy achieving an optimal competitive ratio, and hence achieve an optimal competitive ratio up to a residual regret term.

Theorem 6 (Optimal Competitive Policy is Approximately DAC).

Fix a horizon TT and a disturbance bound WW. For any ε>0\varepsilon>0, set

H=log(1γ/2)1log(1088W2κ11max(1,β2)γ4εT),θ=2κ2max(1,β1/2)H=\log(1-\gamma/2)^{-1}\log\left(\frac{1088W^{2}\kappa^{11}\max(1,\beta^{2})}{\gamma^{4}\varepsilon}T\right),\hskip 28.45274pt\theta=2\kappa^{2}\max(1,\beta^{1/2})

and define \mathcal{M} be the set of (H,θ,γ)(H,\theta,\gamma)-DAC policies with stabilizing component 𝕂=K^0\mathbb{K}=\widehat{K}_{0}. Let 𝒜\mathcal{A} be the algorithm with the optimal competitive ratio α\alpha^{*}. Then there exists a policy π\pi\in\mathcal{M} such that for any perturbations w1:Tw_{1:T} satisfying wtW\|w_{t}\|~{}\leq~{}W, the cost incurred by π\pi satisfies

JT(π|w1:T)<JT(𝒜|w1:T)+ε.J_{T}(\pi|w_{1:T})<J_{T}(\mathcal{A}|w_{1:T})+\varepsilon.

We prove this theorem in Section 4. We now show that this result implies best-of-both-worlds.

3.2 Best-of-Both-Worlds in Known Systems

We begin by considering the case when the learner knows the linear system (A,B)(A,B). In this setting, both the regret and competitive ratio thus measure the additional cost imposed by not knowing the perturbations in advance. The result below utilizes the regret bounds against DAC policies and associated algorithms from [2, 28].

Theorem 7 (Best-of-both-worlds in online control (known system)).

Assuming (A,B)(A,B) is known to the learner, there exists a constant

RT=O~(poly(m,n,β,κ,γ1)W2×min{T,poly(μ1)polylogT})R_{T}=\tilde{O}\left(\operatorname{poly}(m,n,\beta,\kappa,\gamma^{-1})W^{2}\times\min\left\{\sqrt{T},\operatorname{poly}(\mu^{-1})\operatorname{polylog}T\right\}\right)

and a computationally efficient online control algorithm 𝒜\mathcal{A} which simultaneously achieves the following performance guarantees:

  1. 1.

    (Optimal competitive ratio) The cost of 𝒜\mathcal{A} satisfies for any perturbation sequence w1:Tw_{1:T} that

    JT(𝒜|w1:T)<αOPT(w1:T)+RT,J_{T}(\mathcal{A}|w_{1:T})<\alpha^{*}\cdot OPT_{*}(w_{1:T})+R_{T},

    where α\alpha^{*} is the optimal competitive ratio.

  2. 2.

    (Low regret) The regret of 𝒜\mathcal{A} relative to the best linear state-feedback or DAC policy selected in hindsight grows sub-linearly in the horizon TT, i.e. for all w1:Tw_{1:T}, it holds

    JT(𝒜|w1:T)<minπ𝒦JT(π|w1:T)+RT and JT(𝒜|w1:T)<minπJT(π|w1:T)+RT.J_{T}(\mathcal{A}|w_{1:T})<\min_{\pi\in\mathcal{K}}J_{T}(\pi|w_{1:T})+R_{T}\quad\text{ and }\quad J_{T}(\mathcal{A}|w_{1:T})<\min_{\pi\in\mathcal{M}}J_{T}(\pi|w_{1:T})+R_{T}.

3.3 Best-of-Both-Worlds in Unknown Systems

We now present the main results for online control of unknown linear dynamical system. The first theorem deals with the case when the learner has coarse-grained information about the linear system (A,B)(A,B) in the form of access to a stabilizing controller 𝕂\mathbb{K}. In general, to compute such a stable controller, it is sufficient to known (A,B)(A,B) to some constant accuracy, as noted in [10]. This theorem utilizes low-regret algorithms from [22, 28].

Theorem 8 (Best-of-both-worlds in online control (unknown system, given a stabilizing controller)).

For a (k,κ)(k,\kappa)-strongly controllable linear dynamical system (A,B)(A,B), there exists a constant

RT=O~(poly(m,n,β,k,κ,γ1)W2×min{T2/3,poly(μ1)T})R_{T}=\tilde{O}\left(\operatorname{poly}(m,n,\beta,k,\kappa,\gamma^{-1})W^{2}\times\min\left\{T^{2/3},\operatorname{poly}(\mu^{-1})\sqrt{T}\right\}\right)

and a computationally efficient online control algorithm 𝒜\mathcal{A} such that, when given access to a (κ,γ)(\kappa,\gamma)-strongly stable initial controller 𝕂\mathbb{K}, it guarantees

JT(𝒜|w1:T)<min{αOPT(w1:T),minπ𝒦JT(π|w1:T),minπJT(π|w1:T)}+RT.J_{T}(\mathcal{A}|w_{1:T})<\min\{\alpha^{*}\cdot OPT_{*}(w_{1:T}),\min_{\pi\in\mathcal{K}}J_{T}(\pi|w_{1:T}),\min_{\pi\in\mathcal{M}}J_{T}(\pi|w_{1:T})\}+R_{T}.

When an initial stabilizing controller is unavailable, we make use of the “blackbox control” algorithm in [8] to establish the next theorem.

Theorem 9 (Best-of-both-worlds in online control (unknown system, blackbox control)).

For a (k,κ)(k,\kappa)-strongly controllable linear dynamical system (A,B)(A,B), there exists a constant

RT=2poly(m,n,β,k,κ,γ1)+O~(poly(m,n,β,k,κ,γ1)W2×min{T2/3,poly(μ1)T})R_{T}=2^{\operatorname{poly}(m,n,\beta,k,\kappa,\gamma^{-1})}+\tilde{O}\left(\operatorname{poly}(m,n,\beta,k,\kappa,\gamma^{-1})W^{2}\times\min\left\{T^{2/3},\operatorname{poly}(\mu^{-1})\sqrt{T}\right\}\right)

and a computationally efficient online control algorithm 𝒜\mathcal{A} that guarantees

JT(𝒜|w1:T)<min{αOPT(w1:T),minπ𝒦JT(π|w1:T),minπJT(π|w1:T)}+RT.J_{T}(\mathcal{A}|w_{1:T})<\min\{\alpha^{*}\cdot OPT_{*}(w_{1:T}),\min_{\pi\in\mathcal{K}}J_{T}(\pi|w_{1:T}),\min_{\pi\in\mathcal{M}}J_{T}(\pi|w_{1:T})\}+R_{T}.

4 Proof of Theorem 6

Proof of Theorem 6.

Unrolling the recursive definition of νt\nu_{t} in Theorem 4, we see that

νt=i=0t1(AKQ1/2)i1wti.\nu_{t}=\sum_{i=0}^{t-1}(A-KQ^{1/2})^{i-1}w_{t-i}.

Let {xt}t=1T\{x_{t}\}_{t=1}^{T} be the state sequence generated by the competitive control policy in the original mm-dimensional system and let {ξt}t=1T\{\xi_{t}\}_{t=1}^{T} is the sequence of states generated by the competitive policy in the 2m2m-dimensional synthetic system. Additionally, let K^=[K^0K^1]\widehat{K}=\begin{bmatrix}\widehat{K}_{0}&\widehat{K}_{1}\end{bmatrix} be the partition of K^\widehat{K} along the first mm columns and the remaining mm. Recall that A+BK^0=SLS1A+B\widehat{K}_{0}=SLS^{-1}, where SS1κ\|S\|\|S^{-1}\|~{}\leq~{}\kappa and L1γ\|L\|~{}\leq~{}1-\gamma. Using Theorem 4, we decompose xtx_{t} as a linear combination of the disturbance terms:

xt+1\displaystyle x_{t+1} =Axt+BK^ξt+wt=Axt+B[K^0K^1][xtνtΣ1/2Q1/2νt]+wt\displaystyle=Ax_{t}+B\widehat{K}\xi_{t}+w_{t}=Ax_{t}+B\begin{bmatrix}\widehat{K}_{0}&\widehat{K}_{1}\end{bmatrix}\begin{bmatrix}x_{t}-\nu_{t}\\ \Sigma^{-1/2}Q^{1/2}\nu_{t}\end{bmatrix}+w_{t}
=(A+BK^0)xt+B(K1^Σ1/2Q1/2K^0)νt+wt\displaystyle=(A+B\widehat{K}_{0})x_{t}+B(\widehat{K_{1}}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})\nu_{t}+w_{t}
=i=0t1((A+BK^0)i+j=1i(A+BK^0)ijB(K^1Σ1/2Q1/2K^0)(AKQ1/2)j1)wti.\displaystyle=\sum_{i=0}^{t-1}\left((A+B\widehat{K}_{0})^{i}+\sum_{j=1}^{i}(A+B\widehat{K}_{0})^{i-j}B(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{j-1}\right)w_{t-i}.

We now describe a DAC policy which approximates the competitive control policy; recall that every DAC policy is parameterized by a stabilizing controller and a set of HH weights. We take the stabilizing controller to be K^0\widehat{K}_{0} and the set of weights to be M=(M[0],,M[H1])M=(M^{[0]},\ldots,M^{[H-1]}), where M[i1]M^{[i-1]} is

M[i1]=def(K^1Σ1/2Q1/2K^0)(AKQ1/2)i1.M^{[i-1]}\stackrel{{\scriptstyle\text{def}}}{{=}}(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{i-1}.

The action generated by our DAC approximation of the competitive policy is therefore

ut(M)=defK^0xt(M)+i=1HM[i1]wti,\displaystyle u_{t}(M)\stackrel{{\scriptstyle\text{def}}}{{=}}\widehat{K}_{0}x_{t}(M)+\sum_{i=1}^{H}M^{[i-1]}w_{t-i}, (1)

where xt(M)x_{t}(M) is the corresponding state sequence generated by the DAC.

We now show that the weights M[i1]M^{[i-1]}’s decay geometrically in time. Recall that A+BK^0=S1L1S11A+B\widehat{K}_{0}=S_{1}L_{1}S_{1}^{-1}, where S1S11κ\|S_{1}\|\|S_{1}^{-1}\|~{}\leq~{}\kappa and L1γ\|L\|~{}\leq~{}1-\gamma. Similarly, AKQ1/2=S2L2S21A-KQ^{1/2}=S_{2}L_{2}S_{2}^{-1}, where S2S21κ\|S_{2}\|\|S_{2}^{-1}\|~{}\leq~{}\kappa and L21γ\|L_{2}\|~{}\leq~{}1-\gamma. Recall that Σ=I+Q1/2PQ1/2I\Sigma=I+Q^{1/2}PQ^{1/2}\succeq I, therefore Σ1/21\|\Sigma^{-1/2}\|~{}\leq~{}1. We see that

M[i1]K^1Σ1/2Q1/2K^0(AKQ1/2)i12κ2max(1,Q1/2)(1γ)i1,\displaystyle\|M^{[i-1]}\|~{}\leq~{}\|\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0}\|\|(A-KQ^{1/2})^{i-1}\|~{}\leq~{}2\kappa^{2}\max(1,\|Q\|^{1/2})(1-\gamma)^{i-1}, (2)

where we used the fact that κK^\kappa~{}\geq~{}\|\widehat{K}\| and hence κmax(K^0,K^1)\kappa~{}\geq~{}\max(\|\widehat{K}_{0}\|,\|\widehat{K}_{1}\|). We now bound the distance between xtx_{t} and xt(M)x_{t}(M) . Plugging our choice of utu_{t} given in (1) into the dynamics of xx, we see that

xt+1(M)\displaystyle x_{t+1}(M) =Axt(M)+B(K^0xt(M)+i=1HM[i1]wti)+wt\displaystyle=Ax_{t}(M)+B\left(\widehat{K}_{0}x_{t}(M)+\sum_{i=1}^{H}M^{[i-1]}w_{t-i}\right)+w_{t}
=i=0t1((A+BK^0)i+j=1min{H,i}(A+BK^0)ijBM[j1])wti.\displaystyle=\sum_{i=0}^{t-1}\left((A+B\widehat{K}_{0})^{i}+\sum_{j=1}^{\min\{H,i\}}(A+B\widehat{K}_{0})^{i-j}BM^{[j-1]}\right)w_{t-i}.

Comparing the expansions of xtx_{t} and xt(M)x_{t}(M) on a per-term basis, we observe that the terms in their difference contain either (A+BK^0)H/2(A+B\widehat{K}_{0})^{H/2} or (AKQ1/2)H/2(A-KQ^{1/2})^{H/2}, and thus their difference is exponentially small in H/2H/2 under the strong stability of K^0\widehat{K}_{0} and KK. Formally, we have

Claim 10.

xt+1xt+1(M)16Wκ4max(1,Q1/2)γ2(1γ/2)H.\|x_{t+1}-x_{t+1}(M)\|~{}\leq~{}\frac{16W\kappa^{4}\max(1,\|Q\|^{1/2})}{\gamma^{2}}(1-\gamma/2)^{H}.

We now bound the distance between utu_{t} and ut(M)u_{t}(M). Unrolling the dynamics, we get the following,

ut\displaystyle u_{t} =K^ξt=K^0xt+i=0t1(K^1Σ1/2Q1/2K^0)(AKQ1/2)i1wti,\displaystyle=\widehat{K}\xi_{t}=\widehat{K}_{0}x_{t}+\sum_{i=0}^{t-1}(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{i-1}w_{t-i},
ut(M)=K^0xt(M)+i=1H(K^1Σ1/2Q1/2K^0)(AKQ1/2)i1wti.u_{t}(M)=\widehat{K}_{0}x_{t}(M)+\sum_{i=1}^{H}(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{i-1}w_{t-i}.

Similar to the argument for Claim 10 we have the following,

Claim 11.

utut(M)20Wκ5max(1,Q1/2)γ2(1γ/2)H\|u_{t}-u_{t}(M)\|~{}\leq~{}\frac{20W\kappa^{5}\max(1,\|Q\|^{1/2})}{\gamma^{2}}(1-\gamma/2)^{H}.

where we used the strong stability of KK^{\top}, the bound on xtxt(M)\|x_{t}-x_{t}(M)\| given in Claim 10 and the fact that κmax(1,K^0,K^1)\kappa~{}\geq~{}\max(1,\|\widehat{K}_{0}\|,\|\widehat{K}_{1}\|). We now show that by taking HH to be sufficiently large we can ensure that JT(π|w1:T)<JT(𝒜|w1:T)+ε.J_{T}(\pi|w_{1:T})<J_{T}(\mathcal{A}|w_{1:T})+\varepsilon. We utilize the following bound on xt(M),ut(M)x_{t}(M),u_{t}(M):

Lemma 12.

The execution of any policy from a (H,θ,γ)(H,\theta,\gamma^{\prime})-DAC policy class produces state-action sequences that obey for all time tt that xt,ut3κ3θW/γγ\|x_{t}\|,\|u_{t}\|~{}\leq~{}{3\kappa^{3}\theta W}/{\gamma\gamma^{\prime}}.

In light of (2), we see that we can take θ=2κ2max(1,Q1/2)\theta=2\kappa^{2}\max(1,\|Q\|^{1/2}), leading to the bound

max(xt,ut)6Wκ5max(1,Q1/2)γ2.\max(\|x_{t}\|,\|u_{t}\|)~{}\leq~{}\frac{6W\kappa^{5}\max(1,\|Q\|^{1/2})}{\gamma^{2}}.

Using the easily-verified inequality |x2y2|(xy)(2x+xy)|\|x\|^{2}-\|y\|^{2}|~{}\leq~{}\left(\|x-y\|\right)\left(\|2x\|+\|x-y\|\right), we put the pieces together to bound the difference in costs incurred by the competitive policy and our DAC approximation in each timestep:

|xtQxtxt(M)Qx(M)|\displaystyle|x_{t}^{\top}Qx_{t}-x_{t}(M)^{\top}Qx^{\top}(M)|~{}\leq~{} Qxtxt(M)(2xt(M)+xtxt(M))\displaystyle\|Q\|\|x_{t}-x_{t}(M)\|\left(2\|x_{t}(M)\|+\|x_{t}-x_{t}(M)\|\right)
\displaystyle~{}\leq~{} 448W2κ10max(1,Q)Qγ4(1γ/2)H\displaystyle\frac{448W^{2}\kappa^{10}\max(1,\|Q\|)\|Q\|}{\gamma^{4}}(1-\gamma/2)^{H}

Similarly,

|utRutut(M)Rut(M)|\displaystyle|u_{t}^{\top}Ru_{t}-u_{t}(M)^{\top}Ru_{t}(M)|~{}\leq~{} Rutut(M)(2ut(M)+utut(M))\displaystyle\|R\|\|u_{t}-u_{t}(M)\|\left(2\|u_{t}(M)\|+\|u_{t}-u_{t}(M)\|\right)
\displaystyle~{}\leq~{} 640W2κ11max(1,Q)Rγ4(1γ/2)H.\displaystyle\frac{640W^{2}\kappa^{11}\max(1,\|Q\|)\|R\|}{\gamma^{4}}(1-\gamma/2)^{H}.

The difference in aggregate cost is therefore

|JT(π|w1:T)JT(𝒜|w1:T)|1088W2κ11max(1,β2)γ4(1γ/2)HT.|J_{T}(\pi|w_{1:T})-J_{T}(\mathcal{A}|w_{1:T})|~{}\leq~{}\frac{1088W^{2}\kappa^{11}\max(1,\beta^{2})}{\gamma^{4}}(1-\gamma/2)^{H}T.

Taking Hlog(1088W2κ11max(1,β2)T/γ4ε)/log(1γ/2)H~{}\geq~{}\log\left({1088W^{2}\kappa^{11}\max(1,\beta^{2})T}/{\gamma^{4}\varepsilon}\right)/{\log(1-\gamma/2)} concludes the proof. ∎

5 Experiments

In Figure 1 we compare the performance of various controllers, namely, the H2H_{2} controller, the HH_{\infty} controller, the infinite horizon competitive controller from [16], the GPC controller from [2] and the “clairvoyant offline” controller which selects the optimal-in-hindsight sequence of controls. We do this comparison on a two dimensional double integrator system with different noise sequences. We confirm as the results of our paper suggest that the GPC controller attaining the best of both worlds guarantee is indeed the best performing controller and in particular matches and sometimes improves over the performance of competitive control. Further experiment details along with more simulations on different systems can be found in the appendix.

Refer to caption
(a) Sinusoidal Perturbations
Refer to caption
(b) Constant Perturbations
Refer to caption
(c) Gaussian Perturbations
Refer to caption
(d) Gaussian Random Walk Perturbations
Figure 1: Relative performance of the linear-quadratic controllers in the double integrator system.

6 Conclusions, Open Problems and Limitations

We have proved that the optimal competitive policy in an LTI dynamical system is well-approximated by the class of Disturbance Action Control (DAC) policies. This implies that the Gradient Perturbation Control (GPC) algorithm and related approaches are able to attain sublinear regret vs. this policy, even when the dynamical system is unknown ahead of time. This is the first time that a control method is shown to attain both sublinear regret vs. a large policy class, and simultaneously a competitive ratio vs. the optimal dynamic policy in hindsight (up to a vanishing additive term). It remains open to extend our results to time varying and nonlinear systems, the recent methods of [26, 19] are a potentially good starting point.

Acknowledgements

Elad Hazan gratefully acknowledges funding from NSF grant #1704860.

References

  • [1] Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.
  • [2] Naman Agarwal, Brian Bullins, Elad Hazan, Sham Kakade, and Karan Singh. Online control with adversarial disturbances. In International Conference on Machine Learning, pages 111–119, 2019.
  • [3] Oren Anava, Elad Hazan, and Shie Mannor. Online learning for adversaries with memory: price of past mistakes. Advances in Neural Information Processing Systems, 28, 2015.
  • [4] Lachlan Andrew, Siddharth Barman, Katrina Ligett, Minghong Lin, Adam Meyerson, Alan Roytman, and Adam Wierman. A tale of two metrics: Simultaneous bounds on competitiveness and regret. In Conference on Learning Theory, pages 741–763. PMLR, 2013.
  • [5] Allan Borodin and Ran El-Yaniv. Online computation and competitive analysis. Cambridge University Press, 2005.
  • [6] Allan Borodin, Nathan Linial, and Michael E Saks. An optimal on-line algorithm for metrical task system. Journal of the ACM (JACM), 39(4):745–763, 1992.
  • [7] Niangjun Chen, Gautam Goel, and Adam Wierman. Smoothed online convex optimization in high dimensions via online balanced descent. In Conference On Learning Theory, pages 1574–1594. PMLR, 2018.
  • [8] Xinyi Chen and Elad Hazan. Black-box control for linear dynamical systems. In Conference on Learning Theory, pages 1114–1143. PMLR, 2021.
  • [9] Alon Cohen, Avinatan Hasidim, Tomer Koren, Nevena Lazic, Yishay Mansour, and Kunal Talwar. Online linear quadratic control. In International Conference on Machine Learning, pages 1029–1038. PMLR, 2018.
  • [10] Alon Cohen, Tomer Koren, and Yishay Mansour. Learning linear-quadratic regulators efficiently with only T\sqrt{T} regret. In International Conference on Machine Learning, pages 1300–1309, 2019.
  • [11] Amit Daniely and Yishay Mansour. Competitive ratio vs regret minimization: achieving the best of both worlds. In Algorithmic Learning Theory, pages 333–368. PMLR, 2019.
  • [12] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188–4197, 2018.
  • [13] Dylan Foster and Max Simchowitz. Logarithmic regret for adversarial online control. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3211–3221. PMLR, 13–18 Jul 2020.
  • [14] Dylan Foster and Max Simchowitz. Logarithmic regret for adversarial online control. In International Conference on Machine Learning, pages 3211–3221. PMLR, 2020.
  • [15] Gautam Goel and Babak Hassibi. The power of linear controllers in lqr control. arXiv preprint arXiv:2002.02574, 2020.
  • [16] Gautam Goel and Babak Hassibi. Competitive control. arXiv preprint arXiv:2107.13657, 2021.
  • [17] Gautam Goel, Yiheng Lin, Haoyuan Sun, and Adam Wierman. Beyond online balanced descent: An optimal algorithm for smoothed online optimization. Advances in Neural Information Processing Systems, 32, 2019.
  • [18] Gautam Goel and Adam Wierman. An online algorithm for smoothed regression and lqr control. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2504–2513. PMLR, 2019.
  • [19] Paula Gradu, Elad Hazan, and Edgar Minasyan. Adaptive regret for control of time-varying dynamics. arXiv preprint arXiv:2007.04393, 2020.
  • [20] Elad Hazan. Introduction to online convex optimization. arXiv preprint arXiv:1909.05207, 2019.
  • [21] Elad Hazan, Sham Kakade, and Karan Singh. The nonstochastic control problem. In Aryeh Kontorovich and Gergely Neu, editors, Proceedings of the 31st International Conference on Algorithmic Learning Theory, volume 117 of Proceedings of Machine Learning Research, pages 408–421. PMLR, 08 Feb–11 Feb 2020.
  • [22] Elad Hazan, Sham Kakade, and Karan Singh. The nonstochastic control problem. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, pages 408–421. PMLR, 2020.
  • [23] Elad Hazan and Comandur Seshadhri. Efficient learning algorithms for changing environments. In Proceedings of the 26th annual international conference on machine learning, pages 393–400, 2009.
  • [24] Elad Hazan and Karan Singh. Tutorial: online and non-stochastic control, July 2021.
  • [25] Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalence is efficient for linear quadratic control. In Advances in Neural Information Processing Systems, pages 10154–10164, 2019.
  • [26] Edgar Minasyan, Paula Gradu, Max Simchowitz, and Elad Hazan. Online control of unknown time-varying dynamical systems. Advances in Neural Information Processing Systems, 34, 2021.
  • [27] Guanya Shi, Yiheng Lin, Soon-Jo Chung, Yisong Yue, and Adam Wierman. Online optimization with memory and competitive control. Advances in Neural Information Processing Systems, 33:20636–20647, 2020.
  • [28] Max Simchowitz. Making non-stochastic control (almost) as easy as stochastic. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18318–18329. Curran Associates, Inc., 2020.
  • [29] Max Simchowitz and Dylan Foster. Naive exploration is optimal for online lqr. In International Conference on Machine Learning, pages 8937–8948. PMLR, 2020.
  • [30] Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control, 2020.
  • [31] Daniel D Sleator and Robert E Tarjan. Amortized efficiency of list update and paging rules. Communications of the ACM, 28(2):202–208, 1985.
  • [32] Andras Varga. Detection and isolation of actuator/surface faults for a large transport aircraft. In Fault Tolerant Flight Control, pages 423–448. Springer, 2010.
  • [33] Lijun Zhang, Tianbao Yang, Zhi-Hua Zhou, et al. Dynamic regret of strongly adaptive methods. In International conference on machine learning, pages 5882–5891. PMLR, 2018.

Appendix A Appendix Layout

In Section B, we provide the proofs of the claims necessary to conclude Theorem 6. Following this, Section C & D proves the results for known and unknown systems respectively. Thereafter, we prove a generalization of Theorem 6 necessary for the case of unknown systems in Section E. Finally, Section F discusses the translation of the optimality results from infinite-horizon to finite-horizon. Section G provides further details on the experimental setup, along with more experimental evaluations.

Appendix B Proofs of claims substantiating Theorem 6

Proof of Claim 10.

Using the strong stability of K^0\widehat{K}_{0} and KK^{\top}, we have

xt+1xt+1(M)\displaystyle\|x_{t+1}-x_{t+1}(M)\| (3)
\displaystyle~{}\leq~{} iH+1j=H+1i(A+BK^0)ijB(K^1Σ1/2Q1/2K^0)(AKQ1/2)j1W\displaystyle\sum_{i~{}\geq~{}H+1}\sum_{j=H+1}^{i}\left\|(A+B\widehat{K}_{0})^{i-j}B(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{j-1}\right\|W
\displaystyle~{}\leq~{} 2Wκ4max(1,Q1/2)iH+1i(1γ)i1\displaystyle 2W\kappa^{4}\max(1,\|Q\|^{1/2})\sum_{i~{}\geq~{}H+1}i(1-\gamma)^{i-1}
\displaystyle~{}\leq~{} 8Wκ4max(1,Q1/2)γiH+1(1γ/2)i,\displaystyle\frac{8W\kappa^{4}\max(1,\|Q\|^{1/2})}{\gamma}\sum_{i~{}\geq~{}H+1}(1-\gamma/2)^{i},
\displaystyle~{}\leq~{} 16Wκ4max(1,Q1/2)γ2(1γ/2)H\displaystyle\frac{16W\kappa^{4}\max(1,\|Q\|^{1/2})}{\gamma^{2}}(1-\gamma/2)^{H} (4)

where in the penultimate line we used the following elementary result:

Lemma 13.

For all γ[0,1/2]\gamma\in[0,1/2] and all i0i~{}\geq~{}0, the following inequality holds:

i(1γ)i2(1γ/2)i/γ.i(1-\gamma)^{i}~{}\leq~{}2(1-\gamma/2)^{i}/\gamma.

Proof of Claim 11.

Recall that by unrolling the dynamics, we get the following,

ut\displaystyle u_{t} =K^ξt=K^0(xtνt)+K^1w^t=K^0xt+i=0t1(K^1Σ1/2Q1/2K^0)(AKQ1/2)i1wti,\displaystyle=\widehat{K}\xi_{t}=\widehat{K}_{0}(x_{t}-\nu_{t})+\widehat{K}_{1}\widehat{w}_{t}=\widehat{K}_{0}x_{t}+\sum_{i=0}^{t-1}(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{i-1}w_{t-i},
ut(M)=K^0xt(M)+i=1H(K^1Σ1/2Q1/2K^0)(AKQ1/2)i1wti.u_{t}(M)=\widehat{K}_{0}x_{t}(M)+\sum_{i=1}^{H}(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{i-1}w_{t-i}.

Using the strong stability of KK^{\top}, we bound the point-wise difference between these sequences as

utut(M)\displaystyle\|u_{t}-u_{t}(M)\| K^0xtxt(M)+iH+1(K^1Σ1/2Q1/2K^0)(AKQ1/2)i1W\displaystyle~{}\leq~{}\|\widehat{K}_{0}\|\|x_{t}-x_{t}(M)\|+\sum_{i~{}\geq~{}H+1}\|(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{i-1}\|W
20Wκ5max(1,Q1/2))γ2(1γ/2)H.\displaystyle~{}\leq~{}\frac{20W\kappa^{5}\max(1,\|Q\|^{1/2}))}{\gamma^{2}}(1-\gamma/2)^{H}.

Proof of Lemma 12.

Since 𝕂\mathbb{K} is (κ,γ)(\kappa,\gamma)-strongly stable, there exists matrices Q,LQ,L such that QLQ1=A+B𝕂QLQ^{-1}=A+B\mathbb{K} with Q1Qκ\|Q^{-1}\|\|Q\|~{}\leq~{}\kappa and L1γ\|L\|~{}\leq~{}1-\gamma. Also recall that for every member of a (H,θ,γ)(H,\theta,\gamma^{\prime})-DAC class satisfies M[i]θ(1γ)i\|M^{[i]}\|~{}\leq~{}\theta(1-\gamma^{\prime})^{i}. Observe for any t0t~{}\geq~{}0 that

xt+1\displaystyle\|x_{t+1}\| =i0(A+B𝕂)i(wti+Bj=1HM[j1]wtij)\displaystyle=\left\|\sum_{i~{}\geq~{}0}(A+B\mathbb{K})^{i}\left(w_{t-i}+B\sum_{j=1}^{H}M^{[j-1]}w_{t-i-j}\right)\right\|
(i0(A+B𝕂)i)maxiwti+Bj=1HM[j1]wtij\displaystyle~{}\leq~{}\left(\sum_{i~{}\geq~{}0}\|(A+B\mathbb{K})^{i}\|\right)\max_{i}\left\|w_{t-i}+B\sum_{j=1}^{H}M^{[j-1]}w_{t-i-j}\right\|
(i0(A+B𝕂)i)maxi(wti+Bmaxjwtijj=1HM[j1])\displaystyle~{}\leq~{}\left(\sum_{i~{}\geq~{}0}\|(A+B\mathbb{K})^{i}\|\right)\max_{i}\left(\|w_{t-i}\|+\|B\|\max_{j}\|w_{t-i-j}\|\sum_{j=1}^{H}\|M^{[j-1]}\|\right)
κγW(1+κθγ)2κ2θWγγ,\displaystyle~{}\leq~{}\frac{\kappa}{\gamma}W\left(1+\frac{\kappa\theta}{\gamma^{\prime}}\right)~{}\leq~{}\frac{2\kappa^{2}\theta W}{\gamma\gamma^{\prime}},
ut\displaystyle\|u_{t}\| =𝕂xt+i=1HM[i1]wti2κ3θWγγ+θWγ.\displaystyle=\left\|\mathbb{K}x_{t}+\sum_{i=1}^{H}M^{[i-1]}w_{t-i}\right\|~{}\leq~{}\frac{2\kappa^{3}\theta W}{\gamma\gamma^{\prime}}+\frac{\theta W}{\gamma^{\prime}}.

Proof of Lemma 13.

Recall that xex1xe^{-x}~{}\leq~{}1 for all xx, therefore ieγi/22γ.ie^{-\gamma i/2}~{}\leq~{}\frac{2}{\gamma}. Also 1γ1γ/2\sqrt{1-\gamma}~{}\leq~{}1-\gamma/2 for all γ1\gamma~{}\leq~{}1. We have

i(1γ)i\displaystyle i(1-\gamma)^{i} =\displaystyle= i(1γ)i/2(1γ)i/2\displaystyle i(1-\gamma)^{i/2}(1-\gamma)^{i/2}
\displaystyle~{}\leq~{} ieγi/2(1γ/2)i\displaystyle ie^{-\gamma i/2}(1-\gamma/2)^{i}
\displaystyle~{}\leq~{} 2γ(1γ/2)i.\displaystyle\frac{2}{\gamma}(1-\gamma/2)^{i}.

Appendix C Proofs for known systems

Proof of Theorem 7.

The perturbation sequence w1:Tw_{1:T} has TT terms. Overloading this notation, we extend this to an infinite sequence with wt=0w_{t}=0 for all t>Tt>T. Let 𝒦\mathcal{K} be the class of (κ,γ)(\kappa,\gamma)-strongly stable linear policies, and \mathcal{M} be a (H,θ,γ)(H,\theta,\gamma^{\prime})-DAC class for

θ=2κ2max{1,β1/2},γ=γ/2,H=2log(1088W2κ11max{1,β2}T/γ4)/γ.\theta=2\kappa^{2}\max\{1,\beta^{1/2}\},\gamma^{\prime}=\gamma/2,H=2\log(1088W^{2}\kappa^{11}\max\{1,\beta^{2}\}T/\gamma^{4})/\gamma.

By the results of [2], using the GPC algorithm (Algorithm 1 in [2]) as the online learner \mathcal{L}, we have the following regret guarantee:

JT(|w1:T)minπ𝒦JT(π|w1:T)+O~(poly(m,n,κ,γ1,β)W2T)J_{T}(\mathcal{L}|w_{1:T})~{}\leq~{}\min_{\pi\in\mathcal{K}\cup\mathcal{M}}J_{T}(\pi|w_{1:T})+\tilde{O}(\operatorname{poly}(m,n,\kappa,\gamma^{-1},\beta)W^{2}\sqrt{T})

Now, recall that Theorem 6 asserts

minπJT(π|w1:T)JT(𝒜|w1:T)+O(1),\min_{\pi\in\mathcal{M}}J_{T}(\pi|w_{1:T})~{}\leq~{}J_{T}(\mathcal{A}|w_{1:T})+O(1),

where 𝒜\mathcal{A} is the online learner (from Theorem 4) that achieves an optimal competitive ratio α\alpha^{*}. By the non-negativity of the cost, and thereafter using the optimal competitive ratio bound, we have

JT(|w1:T)\displaystyle J_{T}(\mathcal{\mathcal{L}}|w_{1:T}) minπJT(π|w1:T)+O~(poly(m,n,κ,γ1,β)W2T)\displaystyle~{}\leq~{}\min_{\pi\in\mathcal{M}}J_{T}(\pi|w_{1:T})+\tilde{O}(\operatorname{poly}(m,n,\kappa,\gamma^{-1},\beta)W^{2}\sqrt{T})
JT(𝒜|w1:T)+O~(poly(m,n,κ,γ1,β)W2T)\displaystyle~{}\leq~{}J_{T}(\mathcal{A}|w_{1:T})+\tilde{O}(\operatorname{poly}(m,n,\kappa,\gamma^{-1},\beta)W^{2}\sqrt{T})
J(𝒜|w1:)+O~(poly(m,n,κ,γ1,β)W2T)\displaystyle~{}\leq~{}J_{\infty}(\mathcal{A}|w_{1:\infty})+\tilde{O}(\operatorname{poly}(m,n,\kappa,\gamma^{-1},\beta)W^{2}\sqrt{T})
αOPT(w1:)+O~(poly(m,n,κ,γ1,β)W2T)\displaystyle~{}\leq~{}\alpha^{*}\cdot OPT_{*}(w_{1:\infty})+\tilde{O}(\operatorname{poly}(m,n,\kappa,\gamma^{-1},\beta)W^{2}\sqrt{T})

An appeal to Theorem 15 concludes the T\sqrt{T}-part of the claim. For the polylog regret result, we use the algorithm and the associated O(poly(μ1logT))O(\operatorname{poly}(\mu^{-1}\log T)) regret guarantee in [28]. ∎

Appendix D Proofs for unknown systems

Proof of Theorem 8.

The proof of this theorem is similar in structure to that of Theorem 7. In particular, we utilize regret-minimizing algorithms from [22] & [28] that apply even A,BA,B are unknown, given a (κ,γ)(\kappa,\gamma)-strongly stable controller 𝕂\mathbb{K} for a (k,κ)(k,\kappa)-strongly controllable system. However, we highlight a crucial difference below.

Theorem 6 establishes that the infinite-horizon competitive control policy is contained in the class of DACs that utilize K^0\widehat{K}_{0} as the stabilizing baseline policy. Since this policy is derived from the competitive controller, which in turn depends on knowledge of (A,B)(A,B), such a K^0\widehat{K}_{0} might not be available when the underlying system is unknown, or even approximately known. To alleviate this concern, we give an analogous result that holds with respect to any baseline stabilizing policy 𝕂\mathbb{K}, instead of the specific choice K^0\widehat{K}_{0}.

Theorem 14 (Generalized policy inclusion).

Fix a horizon TT. For any ε>0\varepsilon>0, choose

H=2log(105β2W2κ16T5/(γ4ε))/γ,θ=20κ5β1/2/γ,γ=γ/2H=2\log(10^{5}\beta^{2}W^{2}\kappa^{16}T^{5}/(\gamma^{4}\varepsilon))/\gamma,\theta=20\kappa^{5}\beta^{1/2}/\gamma,\gamma^{\prime}=\gamma/2

and define \mathcal{M} be the set of (H,θ,γ)(H,\theta,\gamma^{\prime})-DAC policies with an arbitrary baseline stabilizing policy 𝕂\mathbb{K}. Let 𝒜\mathcal{A} be an online algorithm with the optimal competitive ratio α\alpha^{*}. Then there exists a π\pi\in\mathcal{M} such that for any perturbation sequence w1:Tw_{1:T},

JT(π|w1:T)<JT(𝒜|w1:T)+ε.J_{T}(\pi|w_{1:T})<J_{T}(\mathcal{A}|w_{1:T})+\varepsilon.

Now, with the aid of the above result, we may proceed as with Theorem 7. The algorithm and result in [22] guarantees O~(T2/3)\tilde{O}(T^{2/3}) regret against 𝒦\mathcal{K}\cup\mathcal{M}, and similarly those in [28] guarantee O~(poly(μ1)T)\tilde{O}(\operatorname{poly}(\mu^{-1})\sqrt{T}) regret – say, this bound is RTR_{T} in either case. Again, by the non-negativity of the cost, and thereafter using the optimal competitive ratio bound, we have

JT(𝒜|w1:T)\displaystyle J_{T}(\mathcal{A}|w_{1:T}) minπJT(π|w1:T)+RT\displaystyle~{}\leq~{}\min_{\pi\in\mathcal{M}}J_{T}(\pi|w_{1:T})+R_{T}
JT(𝒜|w1:T)+RT+O(1)\displaystyle~{}\leq~{}J_{T}(\mathcal{A}|w_{1:T})+R_{T}+O(1)
J(𝒜|w1:)+RT+O(1)\displaystyle~{}\leq~{}J_{\infty}(\mathcal{A}|w_{1:\infty})+R_{T}+O(1)
αOPT(w1:)+RT+O(1),\displaystyle~{}\leq~{}\alpha^{*}\cdot OPT_{*}(w_{1:\infty})+R_{T}+O(1),

where we extend w1:Tw_{1:T} to an infinite sequence w1:w_{1:\infty} with wt=0w_{t}=0 for all t>Tt>T. An appeal to Theorem 15 concludes the claim. ∎

Proof of Theorem 9.

Here, we use the algorithm and the regret bound from [8], which guarantees RT=2OT(1)+O~T(min{T,poly(μ1logT)})R_{T}=2^{O_{T}(1)}+\tilde{O}_{T}(\min\{\sqrt{T},\operatorname{poly}(\mu^{1}\log T)\}) regret again the combined policy class 𝒦\mathcal{K}\cup\mathcal{M}. The rest of the proof is analogous to that of Theorem 8 – particularly in that we invoke Theorem 14. ∎

Appendix E A generalized proof of policy inclusion

Proof of Theorem 14.

We begin by taking note of the explicit characterization of an optimal competitive-ratio algorithm presented in Theorem 4. In particular, unrolling the recursive definition of νt\nu_{t}, we have νt=i1(AKQ1/2)i1wti\nu_{t}=\sum_{i~{}\geq~{}1}(A-KQ^{1/2})^{i-1}w_{t-i}. Let the true state sequence visited by such a competitive control policy be {xt}t=1T\{x_{t}\}_{t=1}^{T}. Additionally, let K^=[K^0K^1]\widehat{K}=\begin{bmatrix}\widehat{K}_{0}&\widehat{K}_{1}\end{bmatrix} be the partition of K^\widehat{K} along the first mm columns and the remaining mm. Since w^t=Σ1/2Q1/2νt\widehat{w}_{t}=\Sigma^{-1/2}Q^{1/2}\nu_{t}, by the second part of the said theorem, we have

xt+1\displaystyle x_{t+1} =Axt+BK^ξt+wt=Axt+BK^[xtνtΣ1/2Q1/2νt]+wt\displaystyle=Ax_{t}+B\widehat{K}\xi_{t}+w_{t}=Ax_{t}+B\widehat{K}\begin{bmatrix}x_{t}-\nu_{t}\\ \Sigma^{-1/2}Q^{1/2}\nu_{t}\end{bmatrix}+w_{t}
=(A+BK^0)xt+B(K1^Σ1/2Q1/2K^0)νt+wt\displaystyle=(A+B\widehat{K}_{0})x_{t}+B(\widehat{K_{1}}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})\nu_{t}+w_{t}
=i0(A+BK^0)i(wti+B(K^1Σ1/2Q1/2K^0)vti)\displaystyle=\sum_{i~{}\geq~{}0}(A+B\widehat{K}_{0})^{i}\left(w_{t-i}+B(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})v_{t-i}\right)
=i0(A+BK^0)i(wti+B(K^1Σ1/2Q1/2K^0)j1(AKQ1/2)j1wtij)\displaystyle=\sum_{i~{}\geq~{}0}(A+B\widehat{K}_{0})^{i}\left(w_{t-i}+B(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})\sum_{j~{}\geq~{}1}(A-KQ^{1/2})^{j-1}w_{t-i-j}\right)
=i0((A+BK^0)i+j=1i(A+BK^0)ijB(K^1Σ1/2Q1/2K^0)(AKQ1/2)j1)wti\displaystyle=\sum_{i~{}\geq~{}0}\left((A+B\widehat{K}_{0})^{i}+\sum_{j=1}^{i}(A+B\widehat{K}_{0})^{i-j}B(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{j-1}\right)w_{t-i}

We now choose a DAC policy intended to emulate the competitive control policy as

M[i1]=def\displaystyle M^{[i-1]}\stackrel{{\scriptstyle\text{def}}}{{=}} (K^1Σ1/2Q1/2K^0)(AKQ1/2)i1\displaystyle(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{i-1}
+(K^0𝕂)((A+BK^0)i1+j=1i1(A+BK^0)ij1B(K^1Σ1/2Q1/2K^0)(AKQ1/2)j1)\displaystyle+(\widehat{K}_{0}-\mathbb{K})\left((A+B\widehat{K}_{0})^{i-1}+\sum_{j=1}^{i-1}(A+B\widehat{K}_{0})^{i-j-1}B(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{j-1}\right)

and define the associated action as ut(M)=def𝕂xt(M)+i=1HM[i1]wtiu_{t}(M)\stackrel{{\scriptstyle\text{def}}}{{=}}\mathbb{K}x_{t}(M)+\sum_{i=1}^{H}M^{[i-1]}w_{t-i}, where xt(M)x_{t}(M) is the state sequence achieved the execution of such a DAC policy.

xt+1(M)\displaystyle x_{t+1}(M) =Axt(M)+B(𝕂xt(M)+i=1HM[i1]wti)+wt\displaystyle=Ax_{t}(M)+B\left(\mathbb{K}x_{t}(M)+\sum_{i=1}^{H}M^{[i-1]}w_{t-i}\right)+w_{t}
=i0(A+B𝕂)i(wti+Bj=1HM[j1]wtij)\displaystyle=\sum_{i~{}\geq~{}0}(A+B\mathbb{K})^{i}\left(w_{t-i}+B\sum_{j=1}^{H}M^{[j-1]}w_{t-i-j}\right)
=i0((A+B𝕂)i+j=1min{H,i}(A+B𝕂)ijBM[j1])wti\displaystyle=\sum_{i~{}\geq~{}0}\left((A+B\mathbb{K})^{i}+\sum_{j=1}^{\min\{H,i\}}(A+B\mathbb{K})^{i-j}BM^{[j-1]}\right)w_{t-i}

Before proceeding to establish that xtx_{t} and xt(M)x_{t}(M) are close, let us notice that M[i]M^{[i]}’s decay. To this extent, recall that due to strong stability there exist matrices S^,L^,S,L\widehat{S},\widehat{L},S,L with S^S^1,SS1κ\|\widehat{S}\|\|\widehat{S}^{-1}\|,\|S\|\|S^{-1}\|~{}\leq~{}\kappa and L1γ,L^1γ\|L\|~{}\leq~{}1-\gamma,\|\widehat{L}\|~{}\leq~{}1-\gamma such that A^+B^uK^=S^L^S^1\widehat{A}+\widehat{B}_{u}\widehat{K}=\widehat{S}\widehat{L}\widehat{S}^{-1} and AKQ1/2=SLS1A-KQ^{1/2}=SLS^{-1}. Noting that ΣI\Sigma\succeq I by definition and that βIQ,I\beta I\succeq Q,I, we thus obtain

M[i]\displaystyle\|M^{[i]}\| 2κ2β1/2(1γ)i+2κ(κ(1γ)i+2β1/2κ4i(1γ)i1)\displaystyle~{}\leq~{}2\kappa^{2}\beta^{1/2}(1-\gamma)^{i}+2\kappa\left(\kappa(1-\gamma)^{i}+2\beta^{1/2}\kappa^{4}i(1-\gamma)^{i-1}\right)
4κ2β1/2(1γ)i+16κ5β1/2(1γ/2)i/γ20κ5β1/2(1γ/2)i/γ,\displaystyle~{}\leq~{}4\kappa^{2}\beta^{1/2}(1-\gamma)^{i}+16\kappa^{5}\beta^{1/2}(1-\gamma/2)^{i}/\gamma~{}\leq~{}20\kappa^{5}\beta^{1/2}(1-\gamma/2)^{i}/\gamma,

where the last line follows from Lemma 13.

Note the identity Xn=Yn+i=1nYi1(XY)XniX^{n}=Y^{n}+\sum_{i=1}^{n}Y^{i-1}(X-Y)X^{n-i}. Using this twice, we have

(A+BK^0)i=\displaystyle(A+B\widehat{K}_{0})^{i}= (A+B𝕂)i+j=1i(A+B𝕂)ijB(K^0𝕂)(A+BK^0)j1,\displaystyle(A+B\mathbb{K})^{i}+\sum_{j=1}^{i}(A+B\mathbb{K})^{i-j}B(\widehat{K}_{0}-\mathbb{K})(A+B\widehat{K}_{0})^{j-1},
(A+BK^0)ij=\displaystyle(A+B\widehat{K}_{0})^{i-j}= (A+B𝕂)ij+k=1ij(A+B𝕂)k1B(K^0𝕂)(A+BK^)ijk.\displaystyle(A+B\mathbb{K})^{i-j}+\sum_{k=1}^{i-j}(A+B\mathbb{K})^{k-1}B(\widehat{K}_{0}-\mathbb{K})(A+B\widehat{K})^{i-j-k}.

Thus proceeding with the above identities, this time invoking the strong stability of 𝕂\mathbb{K}, we have

xtxt(M)\displaystyle\|x_{t}-x_{t}(M)\|~{}\leq~{} iH+1j=H+1i(A+B𝕂)ijBM[j1]W\displaystyle\sum_{i~{}\geq~{}H+1}\sum_{j=H+1}^{i}\left\|(A+B\mathbb{K})^{i-j}BM^{[j-1}]\right\|W
\displaystyle~{}\leq~{} iH+120κ7β1/2Wγi(1γ/2)i120κ7β1/2WT2γ(1γ/2)H\displaystyle\sum_{i~{}\geq~{}H+1}\frac{20\kappa^{7}\beta^{1/2}W}{\gamma}i(1-\gamma/2)^{i-1}~{}\leq~{}\frac{20\kappa^{7}\beta^{1/2}WT^{2}}{\gamma}(1-\gamma/2)^{H}

Similarly the corresponding control inputs are nearly equal. By the previous display concerning states, we have

ut\displaystyle u_{t} =K^0xt+(K^1Σ1/2Q1/2K^0)νt\displaystyle=\widehat{K}_{0}x_{t}+(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})\nu_{t}
=K^0xt+(K^1Σ1/2Q1/2K^0)i1(AKQ1/2)i1wti,\displaystyle=\widehat{K}_{0}x_{t}+(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})\sum_{i~{}\geq~{}1}(A-KQ^{1/2})^{i-1}w_{t-i},
ut(M)\displaystyle u_{t}(M) =𝕂xt(M)+i=1H(K^1Σ1/2Q1/2K^0)(AKQ1/2)i1wti+(K^0𝕂)xt,\displaystyle=\mathbb{K}x_{t}(M)+\sum_{i=1}^{H}(\widehat{K}_{1}\Sigma^{-1/2}Q^{1/2}-\widehat{K}_{0})(A-KQ^{1/2})^{i-1}w_{t-i}+(\widehat{K}_{0}-\mathbb{K})x_{t},
utut(M)\displaystyle\|u_{t}-u_{t}(M)\| 𝕂xtxt(M)+iH+1(AKQ1/2)i12κβ1/2W\displaystyle~{}\leq~{}\|\mathbb{K}\|\|x_{t}-x_{t}(M)\|+\sum_{i~{}\geq~{}H+1}\|(A-KQ^{1/2})\|^{i-1}2\kappa\beta^{1/2}W
20κ8β1/2WT2γ(1γ/2)H+2κ2β1/2WiH+1(1γ)i1\displaystyle~{}\leq~{}\frac{20\kappa^{8}\beta^{1/2}WT^{2}}{\gamma}(1-\gamma/2)^{H}+2\kappa^{2}\beta^{1/2}W\sum_{i~{}\geq~{}H+1}(1-\gamma)^{i-1}
22κ8β1/2WT2γ(1γ/2)H.\displaystyle~{}\leq~{}\frac{22\kappa^{8}\beta^{1/2}WT^{2}}{\gamma}(1-\gamma/2)^{H}.

Being the iterates produced by a DAC policy, xt(M),ut(M)x_{t}(M),u_{t}(M) are bounded as indicated by Lemma 12. Now, in terms of cost, since x2y2=(xy)(x+y)\|x\|^{2}-\|y\|^{2}=(x-y)^{\top}(x+y) and given Equation 2, we note that

|xtQxt+utRutxt(M)Qxut(M)Rut(M)|\displaystyle|x_{t}^{\top}Qx_{t}+u_{t}^{\top}Ru_{t}-x_{t}(M)^{\top}Qx^{\top}-u_{t}(M)^{\top}Ru_{t}(M)|
\displaystyle~{}\leq~{} βutut(M)(2ut(M)+utut(M))+βxtxt(M)(2xt(M)+xtxt(M))\displaystyle\beta\|u_{t}-u_{t}(M)\|(2\|u_{t}(M)\|+\|u_{t}-u_{t}(M)\|)+\beta\|x_{t}-x_{t}(M)\|(2\|x_{t}(M)\|+\|x_{t}-x_{t}(M)\|)
\displaystyle~{}\leq~{} 105β2W2κ16T4γ4(1γ/2)H\displaystyle\frac{10^{5}\beta^{2}W^{2}\kappa^{16}T^{4}}{\gamma^{4}}(1-\gamma/2)^{H}

The difference in aggregate cost is at most TT times the quantity above. Therefore, setting H=2log(105β2W2κ16T5/(γ4ε))/γH=2\log(10^{5}\beta^{2}W^{2}\kappa^{16}T^{5}/(\gamma^{4}\varepsilon))/\gamma is sufficient to ensure the validity of the claim. We have already established that the DAC satisfies θ=20κ5β1/2/γ\theta=20\kappa^{5}\beta^{1/2}/\gamma and γ=γ/2\gamma^{\prime}=\gamma/2. ∎

Appendix F To finity and beyond

In this section, we relate the infinite-horizon cost OPT(w1:)OPT_{*}(w_{1:\infty}), as used in earlier proofs, to the finite variant OPT(w1:T)OPT_{*}(w_{1:T}). By the non-negativity of the instantaneous cost, it is implied that OPT(w1:)OPT(w1:T)OPT_{*}(w_{1:\infty})~{}\geq~{}OPT_{*}(w_{1:T}). We prove the reverse inequality, up to constant terms.

Theorem 15.

Extend the perturbation sequence w1:Tw_{1:T} to an infinite sequence w1:w_{1:\infty}, by defining wt=0w_{t}=0 for t>Tt>T. Then, we have

OPT(w1:)OPT(w1:T)+poly(m,n,κ,γ1,μ1,β)W2.OPT_{*}(w_{1:\infty})~{}\leq~{}OPT_{*}(w_{1:T})+\operatorname{poly}(m,n,\kappa,\gamma^{-1},\mu^{-1},\beta)W^{2}.
Proof.

Let u1:T(w1:T)=argminu1:TJT(u1:T|w1:T)u_{1:T}^{*}(w_{1:T})=\operatorname*{arg\,min}_{u_{1:T}}J_{T}(u_{1:T}|w_{1:T}); these are the actions of the finite-horizon optimal offline controller attaining cost OPT(w1:T)OPT_{*}(w_{1:T}). Let \mathcal{I} be an algorithm that executes ut(w1:T)u_{t}^{*}(w_{1:T}) for the first TT time steps and ut=K^0xtu_{t}=\widehat{K}_{0}x_{t} (or any other (κ,γ)(\kappa,\gamma)-strongly stable policy) thereafter. Let xt(),ut()x_{t}(\mathcal{I}),u_{t}(\mathcal{I}) be the state-action sequence associated with \mathcal{I}. Now, we have by the definition of OPT(w1:)OPT_{*}(w_{1:\infty}), specifically its optimality, that

OPT(w1:)\displaystyle OPT_{*}(w_{1:\infty}) J(|w1:)\displaystyle~{}\leq~{}J_{\infty}(\mathcal{I}|w_{1:\infty})
=t=1Tc(xt(),ut())+t=T+1c(xt(),ut())\displaystyle=\sum_{t=1}^{T}c(x_{t}(\mathcal{I}),u_{t}(\mathcal{I}))+\sum_{t=T+1}^{\infty}c(x_{t}(\mathcal{I}),u_{t}(\mathcal{I}))
=OPT(w1:T)+t=T+1c(xt(),ut())\displaystyle=OPT_{*}(w_{1:T})+\sum_{t=T+1}^{\infty}c(x_{t}(\mathcal{I}),u_{t}(\mathcal{I}))

where on the last line we use the observation that for the first TT steps \mathcal{I} executes ut(w1:T)u_{t}^{*}(w_{1:T}). Henceforth, we wish to establish a bound on the remainder term. With this aspiration, let us note that wt=0w_{t}=0 for all t>Tt>T implying for t>0t>0 that

xT+t()=(A+BK^0)t1xT+1,x_{T+t}(\mathcal{I})=(A+B\widehat{K}_{0})^{t-1}x_{T+1},
uT+t()=K^0(A+BK^0)t1xT+1,u_{T+t}(\mathcal{I})=\widehat{K}_{0}(A+B\widehat{K}_{0})^{t-1}x_{T+1},
t=T+1c(xt(),ut())2βκ4γ2xT+12.\sum_{t=T+1}^{\infty}c(x_{t}(\mathcal{I}),u_{t}(\mathcal{I}))~{}\leq~{}\frac{2\beta\kappa^{4}}{\gamma^{2}}\|x_{T+1}\|^{2}.

To conclude the proof of the claim, we invoke Lemma C.9 and C.6 in [14], whereby the authors establish a bound on the state sequence produced by an optimal offline policy using an explicit characterization of the same, to note that

xT+1poly(m,n,κ,γ1,μ1,β)W.\|x_{T+1}\|~{}\leq~{}\operatorname{poly}(m,n,\kappa,\gamma^{-1},\mu^{-1},\beta)W.

Appendix G Experiment Setup & Further Evaluations

We compare 5 controllers in our experiments, the H2H_{2} controller, the HH_{\infty} controller, the infinite horizon competitive controller from [16], the GPC controller from [2] and the “clairvoyant offline” controller which selects the optimal-in-hindsight sequence of controls. We consider two systems a standard 2 dimensional double integrator where the system is given as follows

A=[1101],Bu=[01],Bw=I,Q=I,R=IA=\begin{bmatrix}1&1\\ 0&1\end{bmatrix},B_{u}=\begin{bmatrix}0\\ 1\end{bmatrix},B_{w}=I,Q=I,R=I

and a system derived from Boeing [32]. The exact system details can be found in the code supplied with the supplementary. On a high level the system has a 5-dimensional state and a 9-dimensional control.

The perturbations we add fall into 6 categories.

  • Sin Perturbations: every entry of [wt]i=sin(8πiT)[w_{t}]_{i}=\sin(\frac{8\pi i}{T}) where TT is the time horizon.

  • Sin Perturbations with changing Amplitude: every entry of [wt]i=sin(8πiT)sin(6πiT)[w_{t}]_{i}=\sin(\frac{8\pi i}{T})\sin(\frac{6\pi i}{T}) where TT is the time horizon.

  • Constant Perturbations: every entry of wtw_{t} is set to 1.

  • Uniform Perturbations: every entry of wtw_{t} is drawn uniformly and independently in the range [0,1][0,1].

  • Gaussian Perturbations: wtw_{t} drawn independently from 𝒩(0,I)\mathcal{N}(0,I).

  • Gaussian Random Walk Perturbations: Every entry of wtw_{t} drawn independently from 𝒩(wt1,I)\mathcal{N}(w_{t-1},I).

The performance of the controllers on the double integrator system can be found in Figure 2 and the performance of the controllers on the Boeing system can be found in 3. The experiments confirm our theorems, that over a large time horizon GPC performs at par and sometimes can be better than the competitive controller. We note that the optimal competitive ratio for the competitve controller turns out to be 14.67 in the double integrator system and 5.686 in the Boeing system.

Refer to caption
(a) Sinusoidal Perturbations
Refer to caption
(b) Constant Perturbations
Refer to caption
(c) Sin perturbations with changing amplitude
Refer to caption
(d) Uniformly distributed perturbations
Refer to caption
(e) Gaussian Perturbations
Refer to caption
(f) Gaussian Random Walk Perturbations
Figure 2: Performance of the various controllers on the Double Integrator system under different noise profiles Error ranges have been indicated over 5 trials for randomly generated noise sequences. In such case the solid lines indicate the mean performance over the 5 trials.
Refer to caption
(a) Sin perturbations
Refer to caption
(b) Sin perturbations with changing amplitude
Refer to caption
(c) Constant perturbations
Refer to caption
(d) Uniformly distributed perturbations
Refer to caption
(e) Gaussian perturbations
Refer to caption
(f) Gaussian Random Walk perturbations
Figure 3: Performance of the various controllers on the Boeing system under different noise profiles Error ranges have been indicated over 5 trials for randomly generated noise sequences. In such case the solid lines indicate the mean performance over the 5 trials.

Hyperparameter setup for GPC

GPC is only the controller with hyperparameters. We describe the hyperparameters used here.

  • Double Integrator System: For all perturbations H=M=3H=M=3. For all perturbations except the Gaussian Random walk perturbations, GPC was run with learning rate 0.0020.002. For Gaussian Random walk perturbations, GPC was run with learning rate 1e41e-4.

  • Boeing System: For all perturbations H=M=10H=M=10. For all perturbations except the Gaussian Random walk perturbations, GPC was run with learning rate 0.0010.001. For Gaussian Random walk perturbations, GPC was run with learning rate 1e71e-7.

We note that the learning rate needs to be reduced significantly for the Gaussian Random walk perturbations as the noise magnitude is growing in time and in this sense our theorem does not immediately apply to this setting.