Online Policy Gradient for Model Free Learning
of Linear Quadratic Regulators with $\sqrt{T}$ Regret

Asaf Cassel School of Computer Science, Tel Aviv University; [email protected]. Tomer Koren School of Computer Science, Tel Aviv University, and Google Research, Tel Aviv; [email protected].

Abstract

We consider the task of learning to control a linear dynamical system under fixed quadratic costs, known as the Linear Quadratic Regulator (LQR) problem. While model-free approaches are often favorable in practice, thus far only model-based methods, which rely on costly system identification, have been shown to achieve regret that scales with the optimal dependence on the time horizon $T$ . We present the first model-free algorithm that achieves similar regret guarantees. Our method relies on an efficient policy gradient scheme, and a novel and tighter analysis of the cost of exploration in policy space in this setting.

1 Introduction

Model-free, policy gradient algorithms have become a staple of Reinforcement Learning (RL) with both practical successes [19, 12], and strong theoretical guarantees in several settings [26, 24]. In this work we study the design and analysis of such algorithms for the adaptive control of Linear Quadratic Regulator (LQR) systems, as seen through the lens of regret minimization [1, 9, 21]. In this continuous state and action reinforcement learning setting, an agent chooses control actions $u_{t}$ and the system state $x_{t}$ evolves according to the noisy linear dynamics

\displaystyle x_{t+1}=A_{\star}x_{t}+B_{\star}u_{t}+w_{t},

where $A_{\star}$ and $B_{\star}$ are transition matrices and $w_{t}$ are i.i.d zero-mean noise terms. The cost is a quadratic function of the current state and action, and the regret is measured with respect to the class of linear policies, which are known to be optimal for this setting.

Model-based methods, which perform planning based on a system identification procedure that estimates the transition matrices, have been studied extensively in recent years. This started with Abbasi-Yadkori and Szepesvári [1], which established an $O\brk 0{\sqrt{T}}$ regret guarantee albeit with a computationally intractable method. More recently, Cohen et al. [9], Mania et al. [21] complemented this result with computationally efficient methods, and Cassel et al. [6], Simchowitz and Foster [25] provided lower bounds, showing that this rate is generally unavoidable; regardless of whether the algorithm is model free or not. In comparison, the best existing model-free algorithms are policy iteration procedures by Krauth et al. [18] and Abbasi-Yadkori et al. [2] that respectively achieve $\smash{\widetilde{O}}\brk 0{T^{2/3}}$ and $\smash{\widetilde{O}}\brk 0{T^{2/3+\epsilon}}$ regret for $\epsilon=\Theta(1/\log{T})$ .

Our main result is an efficient (in fact, linear time per step) policy gradient algorithm that achieves $\smash{\widetilde{O}}\brk 0{\sqrt{T}}$ regret, thus closing the (theoretical) gap between model based and free methods for the LQR model. An interesting feature of our approach is that while the policies output by the algorithm are clearly state dependent, the tuning of their parameters requires no such access. Instead, we only rely on observations of the incurred cost, similar to bandit models (e.g., 5).

One of the main challenges of regret minimization in LQRs (and more generally, in reinforcement learning) is that it is generally infeasible to change policies as often as one likes. Roughly, this is due to a burn-in period following a policy change, during which the system converges to a new steady distribution, and typically incurs an additional cost proportional to the change in steady states, which is in turn proportional to the distance between policies. There are several ways to overcome this impediment. The simplest is to restrict the number of policy updates and explore directly in the action space via artificial noise (see e.g., 25). Another approach by Cohen et al. [9] considers a notion of slowly changing policies, however, these can be very prohibitive for exploration in policy space. Other works (e.g., 3) consider a policy parameterization that converts the problem into online optimization with memory, which also relies on slowly changing policies. This last method is also inherently model-based and thus not adequate for our purpose.

A key technical contribution that we make is to overcome this challenge by exploring directly in policy space. While the idea itself is not new, we provide a novel and tighter analysis that allows us to use larger perturbations, thus reducing the variance of the resulting gradient estimates. We achieve this by showing that the additional cost depends only quadratically on the exploration radius, which is a crucial ingredient for overcoming the $O\brk 0{T^{2/3}}$ limitation. The final ingredient of the analysis involves a sensitivity analysis of the gradient descent procedure that uses the estimated gradients. Here again, while similar analyses of gradient methods exist, we provide a general result that gives appropriate conditions for which the optimization error depends only quadratically on the error in the gradients.

Related work.

Policy gradient methods in the context of LQR has seen significant interest in recent years. Notably, Fazel et al. [10] establish its global convergence in the perfect information setting, and give complexity bounds for sample based methods. Subsequently, Malik et al. [20] improve the sample efficiency but their result holds only with a fixed probability and thus does not seem applicable for our purposes. Hambly et al. [13] also improve the sample efficiency, but in a finite horizon setting. Mohammadi et al. [22] give sample complexity bounds for the continuous-time variant of LQR. Finally, Tu and Recht [27] show that a model based method can potentially outperform the sample complexity of policy gradient by factors of the input and output dimensions. While we observe similar performance gaps in our regret bounds, these were not our main focus and may potentially be improved by a more refined analysis. Moving away from policy gradients, Yang et al. [30], Jin et al. [17], Yaghmaie and Gustafsson [29] analyze the convergence and sample complexity of other model free methods such as policy iteration and temporal difference (TD) learning, but they do not include any regret guarantees.

2 Preliminaries

2.1 Setup: Learning in LQR

We consider the problem of regret minimization in the LQR model. At each time step $t$ , a state $x_{t}\in\mathbb{R}^{d_{x}}$ is observed and action $u_{t}\in\mathbb{R}^{d_{u}}$ is chosen. The system evolves according to

\displaystyle x_{t+1}=A_{\star}x_{t}+B_{\star}u_{t}+w_{t},\quad(x_{0}=0~{}\text{w.l.o.g.}),

where the state-state $A_{\star}\in\mathbb{R}^{d_{x}\times d_{x}}$ and state-action $B_{\star}\in\mathbb{R}^{d_{x}\times d_{u}}$ matrices form the transition model and the $w_{t}$ are bounded, zero mean, i.i.d. noise terms with a positive definite covariance matrix $\Sigma_{w}\succ 0$ . Formally, there exist $\sigma,W>0$ such that

\displaystyle\mathbb{E}w_{t}=0\quad,\norm{w_{t}}\leq W\quad,\Sigma_{w}=\mathbb{E}w_{t}w_{t}^{\mkern-1.5mu\mathsf{T}}\succ\sigma^{2}I

The bounded noise assumption is made for simplicity of the analysis, and in Appendix A we show how to accommodate Gaussian noise via a simple reduction to this setting. At time $t$ , the instantaneous cost is

c_{t}=x_{t}^{\mkern-1.5mu\mathsf{T}}Qx_{t}+u_{t}^{\mkern-1.5mu\mathsf{T}}Ru_{t},

where $0\prec Q,R\preceq I$ are positive definite. We note that the upper bound is without loss of generality since multiplying $Q$ and $R$ by a constant factor only re-scales the regret.

A policy of the learner is a potentially time dependent mapping from past history to an action $u\in\mathbb{R}^{d_{u}}$ to be taken at the current time step. Classic results in linear control establish that, given the system parameters $A_{\star},B_{\star},Q$ and $R$ , a linear transformation of the current state is an optimal policy for the infinite horizon setting. We thus consider policies of the form $u_{t}=Kx_{t}$ and define their infinite horizon expected cost,

\displaystyle J\brk*{K}=\lim_{T\to\infty}\frac{1}{T}\mathbb{E}\delim{[}{]}*{\sum_{t=1}^{T}x_{t}^{\mkern-1.5mu\mathsf{T}}\brk*{Q+K^{\mkern-1.5mu\mathsf{T}}RK}x_{t}}

where the expectation is taken with respect to the random noise variables $w_{t}$ . Let $K_{\star}=\operatorname*{arg\,min}_{K}J\brk*{K}$ be a (unique) optimal policy and $J_{\star}=J\brk*{K_{\star}}$ denote the optimal infinite horizon expected cost, which are both well defined under mild assumptions.¹¹1These are valid under standard, very mild stabilizability assumptions (see 4) that hold in our setting. We are interested in minimizing the regret over $T$ decision rounds, defined as

\displaystyle R_{T}=\sum_{t=1}^{T}\brk!{x_{t}^{\mkern-1.5mu\mathsf{T}}Qx_{t}+u_{t}^{\mkern-1.5mu\mathsf{T}}Ru_{t}-J_{\star}}.

We focus on the setting where the learner does not have a full a-priori description of the transition parameters $A_{\star}$ and $B_{\star}$ , and has to learn them while controlling the system and minimizing the regret.

Throughout, we assume that the learner has knowledge of constants $\alpha_{0}>0$ and $\psi\geq 1$ such that

\displaystyle\norm{Q^{-1}},\norm{R^{-1}}\leq\ifrac{1}{\alpha_{0}},\text{ and }\norm{B_{\star}}\leq\psi.

We also assume that there is a known stable (not necessarily optimal) policy $K_{0}$ and $\nu>0$ such that $J\brk*{K_{0}}\leq\frac{1}{4}\nu$ . We note that all of the aforementioned parameters could be easily estimated at the cost of an additive constant regret term by means of a warm-up period. However, recovering the initial control $K_{0}$ gives a constant that depends exponentially on the problem parameters as shown by Chen and Hazan [7], Mania et al. [21], Cohen et al. [9].

Finally, denote the set of all “admissable” controllers

\displaystyle\mathcal{K}=\brk[c]{K\mid J\brk*{K}\leq\nu}.

By definition, $K_{0}\in\mathcal{K}$ . As discussed below, over the set $\mathcal{K}$ the LQR cost function $J$ has certain regularity properties that we will use throughout.

2.2 Smooth Optimization

Fazel et al. [10] show that while the objective $J\brk*{\cdot}$ is non-convex, it has properties that make it amenable to standard gradient based optimization schemes. We summarize these here as they are used in our analysis.

Definition 1 (PL-condition).

A function $f:\mathcal{X}\to\mathbb{R}$ with global minimum $f^{*}$ is said to be $\mu$ -PL if it satisfies the Polyak-Lojasiewicz (PL) inequality with constant $\mu>0$ , given by

\displaystyle\mu(f(x)-f^{*})\leq\norm{\nabla f(x)}^{2}\quad,\forall x\in\mathcal{X}.

Definition 2 (Smoothness).

A function $f:\mathbb{R}^{d}\to\mathbb{R}$ is locally $\beta,D_{0}$ -smooth over $\mathcal{X}\subseteq\mathbb{R}^{d}$ if for any $x\in\mathcal{X}$ and $y\in\mathbb{R}^{d}$ with $\norm{y-x}\leq D_{0}$

\displaystyle\norm{\nabla f(x)-\nabla f(y)}\leq{\beta}\norm{x-y}.

Definition 3 (Lipschitz).

A function $f:\mathbb{R}^{d}\to\mathbb{R}$ is locally $\beta,D_{0}$ -Lipschitz over $\mathcal{X}\subseteq\mathbb{R}^{d}$ if for any $x\in\mathcal{X}$ and $y\in\mathbb{R}^{d}$ with $\norm{y-x}\leq D_{0}$

\displaystyle\abs{f(x)-f(y)}\leq{G}\norm{x-y}.

It is well-known that for functions satisfying the above conditions and for sufficiently small step size $\eta$ , the gradient descent update rule

\displaystyle x_{t+1}=x_{t}-\eta\nabla f(x_{t})

converges exponentially fast, i.e., there exists $0\leq\rho<1$ such that ${f(x_{t})-f^{*}}\leq\rho^{t}(f(x_{0})-f^{*})$ (e.g., 23). This setting has also been investigated in the absence of a perfect gradient oracle. Here we provide a clean result that shows that the error in the optimization objective is only limited by the squared error of any gradient estimate.

Finally, we require the notion of a one point gradient estimate [11]. Let $f:\mathcal{X}\to\mathbb{R}$ and define its smoothed version with parameter $r>0$ as

\displaystyle f^{r}\brk*{x}=\mathbb{E}_{B}{f\brk*{x+rB}},

(1)

where $B\in\mathcal{B}^{d}$ is a uniform random vector over the Euclidean unit ball. The following lemma is standard (we include a proof in Appendix B for completeness).

Lemma 1.

If $f$ is $(D_{0},\beta)$ -locally smooth and $r\leq D_{0}$ , then:

1.

$\nabla f^{r}\brk*{x}=\frac{d}{r}\mathbb{E}_{U}\brk[s]{f\brk*{x+rU}U},$ where $U\in\mathcal{S}^{d}$ is a uniform random vector of the unit sphere;
2.

$\norm{\nabla f^{r}\brk*{x}-\nabla f\brk*{x}}\leq\beta r,\;\forall x\in\mathcal{X}.$

2.3 Background on LQR

It is well-known for the LQR problem that

\displaystyle J\brk*{K}=\mathrm{Tr}\brk*{P_{K}\Sigma_{w}}=\mathrm{Tr}\brk*{(Q+K^{\mkern-1.5mu\mathsf{T}}RK)\Sigma_{K}},

where $P_{K},\Sigma_{K}$ are the positive definite solutions to

		$\displaystyle P_{K}=Q+K^{\mkern-1.5mu\mathsf{T}}RK+(A_{\star}+B_{\star}K)^{\mkern-1.5mu\mathsf{T}}P_{K}(A_{\star}+B_{\star}K),$		(2)
		$\displaystyle\Sigma_{K}=\Sigma_{w}+(A_{\star}+B_{\star}K)\Sigma_{K}(A_{\star}+B_{\star}K)^{\mkern-1.5mu\mathsf{T}}.$		(3)

Another important notion is that of strong stability [8]. This is essentially a quantitative version of classic stability notions in linear control.

Definition 4 (strong stability).

A matrix $M$ is $(\kappa,\gamma)$ -strongly stable (for $\kappa\geq 1$ and $0<\gamma\leq 1$ ) if there exists matrices $H\succ 0$ and $L$ such that $M=HLH^{-1}$ with $\norm{L}\leq 1-\gamma$ and $\norm{H}\norm{H^{-1}}\leq\kappa$ . A controller $K$ for is $(\kappa,\gamma)-$ strongly stable if $\norm{K}\leq\kappa$ and the matrix $A_{\star}+B_{\star}K$ is $(\kappa,\gamma)$ -strongly stable.

The following lemma, due to Cohen et al. [9], relates the infinite horizon cost of a controller to its strong stability parameters.

Lemma 2 (9, Lemma 18).

Suppose that $K\in\mathcal{K}$ then $K$ is $(\kappa,\gamma)-$ strongly stable with $\kappa=\sqrt{\ifrac{\nu}{\alpha_{0}\sigma^{2}}}$ and $\gamma=\ifrac{1}{2\kappa^{2}}.$

The following two lemmas, due to Cohen et al. [8], Cassel et al. [6], show that the state covariance converges exponentially fast, and that the state is bounded as long as controllers are allowed to mix.

Lemma 3 (8, Lemma 3.2).

Suppose we play some fixed $K\in\mathcal{K}$ starting from some $x_{0}\in\mathbb{R}^{d_{x}}$ , then

	$\displaystyle\norm{\mathbb{E}\brk[s]{x_{t}x_{t}^{\mkern-1.5mu\mathsf{T}}}-\Sigma_{K}}$	$\displaystyle\leq\kappa^{2}e^{-2\gamma t}\norm{x_{0}x_{0}^{\mkern-1.5mu\mathsf{T}}-\Sigma_{K}},$
	$\displaystyle\abs{\mathbb{E}\brk[s]{c_{t}}-J\brk*{K}}$	$\displaystyle\leq\frac{\nu\kappa^{2}}{\sigma^{2}}e^{-2\gamma t}\norm{x_{0}x_{0}^{\mkern-1.5mu\mathsf{T}}-\Sigma_{K}}.$

Lemma 4 (6, Lemma 39).

Let $K_{1},K_{2},\ldots\in\mathcal{K}$ . If we play each controller $K_{i}$ for at least $\tau\geq 2\kappa^{2}\log 2\kappa$ rounds before switching to $K_{i+1}$ then for all $t\geq 1$ we have that $\norm{x_{t}}\leq 6\kappa^{4}W$ and $c_{t}\leq 36\nu\kappa^{8}W^{2}/\sigma^{2}.$

The following is a summary of results from Fazel et al. [10] that describe the main properties of $\Sigma_{K},P_{K},J\brk*{K}$ . See Appendix B for the complete details.

Lemma 5 (10, Lemmas 11, 13, 16, 27 and 28).

Let $K\in\mathcal{K}$ and $K^{\prime}\in\mathbb{R}^{d_{u}\times d_{x}}$ with

\displaystyle\norm{K-K^{\prime}}\leq\frac{1}{8\psi\kappa^{3}}=D_{0},

then we have that

1.

$\mathrm{Tr}\brk*{P_{K}}\leq J\brk*{K}/\sigma^{2};$ $\mathrm{Tr}\brk*{\Sigma_{K}}\leq J\brk*{K}/\alpha_{0};$
2.

$\norm{\Sigma_{K}-\Sigma_{K^{\prime}}}\leq(\ifrac{8\psi\nu\kappa^{3}}{\alpha_{0}})\norm{K-K^{\prime}};$
3.

$\norm{P_{K}-P_{K^{\prime}}}\leq 16\psi\kappa^{7}\norm{K-K^{\prime}};$
4.

$J$ satisfies the local Lipschitz condition (Definition 3) over $\mathcal{K}$ with $D_{0}$ and $G=\ifrac{4\psi\nu\kappa^{7}}{\alpha_{0}};$
5.

$J$ satisfies the local smoothness condition (Definition 2) over $\mathcal{K}$ with $D_{0}$ and $\beta=\ifrac{112\sqrt{d_{x}}\nu\psi^{2}\kappa^{8}}{\alpha_{0}};$
6.

$J$ satisfies the PL condition (Definition 1) with $\mu=\ifrac{4\nu}{\kappa^{4}}.$

3 Algorithm and Overview of Analysis

We are now ready to present our main algorithm for model free regret minimization in LQR. The algorithm, given in Algorithm 1, optimizes an underlying controller $K_{j}$ over epochs of exponentially increasing duration. Each epoch consists of sub-epochs, during which a perturbed controller $K_{j,i}$ centered at $K_{j}$ is drawn and played for $\tau$ rounds. At the end of each epoch, the algorithm uses $c_{j,i,\tau}$ , which is the cost incurred during the final round of playing the controller $K_{j,i}$ , to construct a gradient estimate which in turn is used to calculate the next underlying controller $K_{j+1}$ . Interestingly, we do not make any explicit use of the state observation $x_{t}$ which is only used implicitly to calculate the control signal, via $u_{t}=K_{t}x_{t}$ . Furthermore, the algorithm makes only $O\brk*{d_{u}d_{x}}$ computations per time step.

Algorithm 1 LQR Online Policy Gradient

1:input: initial controller

K_{0}\in\mathcal{K}

, step size

\eta

, mixing length

\tau

, parameters

\mu,r_{0},m_{0}

2:for epoch

j=0,1,2,\ldots

3: set

r_{j}=r_{0}(1-\mu\eta/3)^{j/2},m_{j}=m_{0}(1-\mu\eta/3)^{-2j}

4: for

i=1,\ldots,m_{j}

5: draw

\smash{\widetilde{U}_{j,i}}\in\mathbb{R}^{d_{u}\times d_{x}}

with i.i.d.

\mathcal{N}\brk 0{0,1}

entries

6: set

U_{j,i}=\smash{\widetilde{U}_{j,i}}/\norm{\smash{\widetilde{U}_{j,i}}}_{F}

7: play

K_{j,i}=K_{j}+r_{j}U_{j,i}

for

\tau

rounds

8: observe the cost of the final round

c_{j,i,\tau}

9: calculate

\hat{g}_{j}=\frac{d_{x}d_{u}}{m_{j}r_{j}}\sum_{i=1}^{m_{j}}c_{j,i,\tau}U_{j,i}

10: update

K_{j+1}=K_{j}-\eta\hat{g}_{j}

Our main result regarding Algorithm 1 is stated in the following theorem: a high-probability $O(\sqrt{T})$ regret guarantee with a polynomial dependence on the problem parameters.

Theorem 1.

Let $\kappa=\sqrt{{\nu}/{\alpha_{0}\sigma^{2}}}$ and suppose we run Algorithm 1 with parameters

	$\displaystyle\eta=\frac{\alpha_{0}}{128\nu\psi^{2}\kappa^{10}},\;\tau=2\kappa^{2}\log(7\kappa T),$
	$\displaystyle\mu=\frac{4\nu}{\kappa^{4}},\;r_{0}=\frac{\alpha_{0}}{448\sqrt{d_{x}}\psi^{2}\kappa^{10}},$
	$\displaystyle\sqrt{m_{0}}=\frac{2^{17}d_{u}d_{x}^{3/2}\psi^{2}\kappa^{20}W^{2}}{\alpha_{0}\sigma^{2}}\sqrt{\log\frac{240T^{4}}{\delta}},$

then with probability at least $1-\delta$ ,

\displaystyle R_{T}=O\brk*{\frac{d_{u}d_{x}^{3/2}\psi^{4}\kappa^{36}W^{2}}{\alpha_{0}}\sqrt{T\tau\log\frac{T}{\delta}}}.

Here we give an overview of the main steps in proving Theorem 1, deferring the details of each step to later sections. Our first step is analyzing the utility of the policies $K_{j}$ computed at the end of each epoch. We show that the regret of each $K_{j}$ (over epoch $j$ ) in terms of its long-term (steady state) cost compared to that of the optimal $K_{\star}$ , is controlled by the inverse square-root of the epoch length $m_{j}$ .

Lemma 6 (exploitation).

Under the parameter choices of Theorem 1, for any $j\geq 0$ we have that with probability at least $1-\delta/8T^{2}$ ,

\displaystyle J\brk*{K_{j}}-J_{\star}=O\brk*{\nu\sqrt{\frac{m_{0}}{m_{j}}}}=O\brk*{{d_{u}d_{x}^{3/2}\psi^{2}\kappa^{22}W^{2}}\sqrt{\frac{1}{m_{j}}\log\frac{T}{\delta}}},

and further that $J\brk*{K_{j}}\leq\nu/2$ .

The proof of the lemma is based on a careful analysis of gradient descent with inexact gradients and crucially exploits the PL and local-smoothness properties of the loss $J\brk*{\cdot}$ . More details can be found in Section 4.

The more interesting (and challenging) part of our analysis pertains to controlling the costs associated with exploration, namely, the penalties introduced by the perturbations of the controllers $K_{j}$ . The direct cost of exploration is clear: instead of playing the $K_{j}$ intended for exploitation, the algorithm actually follows the perturbed controllers $K_{j,i}$ and thus incurs the differences in long-term costs $J\brk*{K_{j,i}}-J\brk*{K_{j}}$ . Our following lemma bounds the accumulation of these penalties over an epoch $j$ ; importantly, it shows that while the bound scales linearly with the length of the epoch $m_{j}$ , it has a quadratic dependence on the exploration radius $r_{j}$ .

Lemma 7 (direct exploration cost).

Under the parameter choices of Theorem 1, for any $j\geq 0$ we have that with probability at least $1-\delta/4T$ ,

\displaystyle\sum_{i=1}^{m_{j}}J\brk*{K_{j,i}}-J\brk*{K_{j}}=O\brk*{\frac{\sqrt{d_{x}}\nu\psi^{2}\kappa^{8}}{\alpha_{0}}r_{j}^{2}m_{j}+\nu\sqrt{m_{j}\log\frac{T}{\delta}}}.

There are additional, indirect costs associated with exploration however: within each epoch the algorithm switches frequently between different policies, thereby suffering the indirect costs that stem from their “burn-in” period. This is precisely what gives rise to the differences between the realized cost $c_{j,i,s}$ and the long-term cost $J\brk*{K_{j,i}}$ of the policy $K_{j,i}$ , the cumulative effect of which is bounded in the next lemma. Here again, note the quadratic dependence on the exploration radius $r_{j}$ which is essential for obtaining our $\sqrt{T}$ -regret result.

Lemma 8 (indirect exploration cost).

Under the parameter choices of Theorem 1, for any $j\geq 0$ we have that with probability at least $1-\delta/4T$ ,

\displaystyle\sum_{i=1}^{m_{j}}\sum_{s=1}^{\tau}\brk!{c_{j,i,s}-J\brk*{K_{j,i}}}=O\brk*{\frac{\nu\kappa^{8}W^{2}}{\sigma^{2}}\tau\sqrt{m_{j}\log\frac{T}{\delta}}+\frac{d_{x}\nu\psi^{2}\kappa^{10}}{\alpha_{0}}m_{j}r_{j}^{2}}.

The technical details for Lemmas 7 and 8 are discussed in Section 5. We now have all the main pieces required for proving our main result.

Proof (of Theorem 1).

Taking a union bound, we conclude that Lemmas 8, 7 and 6 hold for all $j\geq 0$ with probability at least $1-\delta$ . Now, notice that our choice of parameters is such that

\displaystyle r_{j}^{2}m_{j}=r_{0}^{2}\sqrt{m_{0}m_{j}}=O\brk*{\frac{\sqrt{d_{x}}d_{u}\alpha_{0}W^{2}}{\psi^{2}\sigma^{2}}\sqrt{m_{j}\log\frac{T}{\delta}}}.

Plugging this back into Lemmas 8 and 7 we get that for all $j$ ,

	$\displaystyle\sum_{i=1}^{m_{j}}\sum_{s=1}^{\tau}\brk!{c_{j,i,s}\!-\!J\brk*{K_{j,i}}}$	$\displaystyle=O\brk*{\frac{d_{u}d_{x}^{3/2}\nu\kappa^{10}W^{2}}{\sigma^{2}}\tau\sqrt{m_{j}\log\frac{T}{\delta}}},$
	$\displaystyle\tau\sum_{i=1}^{m_{j}}J\brk{K_{j,i}}\!-\!J\brk{K_{j}}$	$\displaystyle=O\brk*{\frac{d_{u}d_{x}\nu\kappa^{8}W^{2}}{\sigma^{2}}\tau\sqrt{m_{j}\log\frac{T}{\delta}}}.$

We conclude that the regret during epoch $j$ is bounded as

	$\displaystyle\sum_{i=1}^{m_{j}}\sum_{s=1}^{\tau}\brk!{c_{j,i,s}-J_{\star}}$	$\displaystyle=\brk[s]{\sum_{i=1}^{m_{j}}\sum_{s=1}^{\tau}\brk!{c_{j,i,s}-J\brk{K_{j,i}}}}+\brk[s]{\tau\sum_{i=1}^{m_{j}}J\brk{K_{j,i}}-J\brk{K_{j}}}+\brk[s]{\tau m_{j}(J\brk{K_{j}}-J_{\star})}$
		$\displaystyle=O\brk*{{d_{u}d_{x}^{3/2}\psi^{2}\kappa^{22}W^{2}}\tau\sqrt{m_{j}\log\frac{T}{\delta}}},$

where the second step also used the fact that $\nu/\sigma^{2}\leq\kappa^{2}$ . Finally, a simple calculation (see Lemma 12) shows that

\displaystyle\sum_{j=0}^{n-1}\sqrt{m_{j}}=O\brk*{\frac{1}{\mu\eta}\sqrt{T/\tau}}=O\brk*{\frac{\psi^{2}\kappa^{14}}{\alpha_{0}}\sqrt{T/\tau}},

and thus summing over the regret accumulated in each epoch concludes the proof.

4 Optimization Analysis

At its core, Algorithm 1 is a policy gradient method with $K_{j}$ being the prediction after $j$ gradient steps. In this section we analyze the sub-optimality gap of the underlying controllers $K_{j}$ culminating in the proof of Lemma 6. To achieve this, we first consider a general optimization problem with a corrupted gradient oracle, and show that the optimization rate is limited only by the square of the corruption magnitude. We follow this with an analysis of the LQR gradient estimation from which the overall optimization cost follows readily.

4.1 Inexact First-Order Optimization

Let $f:\mathbb{R}^{d}\to\mathbb{R}$ be a function with global minimum $f_{*}>-\infty$ . Suppose there exists $\bar{f}\in\mathbb{R}$ such that $f$ is $\mu$ -PL, $(D_{0},\beta)$ -locally smooth, and $(D_{0},G)$ -locally Lipschitz over the sub-level set $\mathcal{X}=\brk[c]{x\mid f\brk*{x}\leq\bar{f}}.$ We consider the update rule

\displaystyle x_{t+1}=x_{t}-\eta\hat{g}_{t},

(4)

where $f\brk*{x_{0}}\leq\bar{f}$ , and $\hat{g}_{t}\in\mathbb{R}^{d}$ is a corrupted gradient oracle that satisfies

\displaystyle\norm{\hat{g}_{t}-\nabla f\brk*{x_{t}}}\leq\varepsilon_{t},

(5)

where $\varepsilon_{t}\leq\min\brk[c]{G,\sqrt{(\bar{f}-f_{*})\mu}/2}$ is the magnitude of the corruption at step $t$ . Define the effective corruption up to round $t$ as

\displaystyle\bar{\varepsilon}_{t}^{2}=\max_{s\leq t}\brk[c]*{\varepsilon_{s}^{2}\brk[s]{1-(\mu\eta/3)}^{t-s}},

and notice that if $\varepsilon_{s}\brk[s]{1-(\mu\eta/3)}\leq\varepsilon_{s+1}$ then $\bar{\varepsilon}_{t}=\varepsilon_{t}$ .

The following result shows that this update rule achieves a linear convergence rate up to an accuracy that depends quadratically on the corruptions.

Theorem 2 (corrupted gradient descent).

Suppose that $\eta\leq\min\brk[c]{\ifrac{1}{\beta},\ifrac{4}{\mu},\ifrac{D_{0}}{2G}.}$ Then for all $t\geq 0$ ,

\displaystyle f\brk*{x_{t}}-f_{*}\leq\max\brk[c]*{\frac{4\bar{\varepsilon}_{t-1}^{2}}{\mu},\brk[s]*{1-\frac{\mu\eta}{3}}^{t}\brk*{f\brk*{x_{0}}-f_{*}}},

and consequently $x_{t}\in\mathcal{X}$ .

Proof.

For ease of notation, denote $w_{t}=\hat{g}_{t}-\nabla f\brk*{x_{t}}$ . Now, suppose that $x_{t}\in\mathcal{X}$ , i.e., $f\brk*{x_{t}}\leq\bar{f}$ . Then we have that

\displaystyle\norm{x_{t+1}-x_{t}}=\norm{\hat{g}_{t}}\eta\leq(\norm{\nabla f\brk*{x_{t}}}+\varepsilon_{t})D_{0}/2G\leq D_{0},

where the second step used our choice of $\eta$ and the third step used the Lipschitz assumption on $x_{t}$ and the bound on $\varepsilon_{t}$ . We conclude that $x_{t},x_{t+1}$ satisfy the conditions for local smoothness and so we have that

	$\displaystyle f\brk{x_{t+1}}-f\brk{x_{t}}$	$\displaystyle\leq\nabla f\brk{x_{t}}^{\mkern-1.5mu\mathsf{T}}\brk{x_{t+1}-x_{t}}+\frac{\beta}{2}\norm{x_{t+1}-x_{t}}^{2}$
		$\displaystyle=\nabla f\brk{x_{t}}^{\mkern-1.5mu\mathsf{T}}\brk{-\eta\hat{g}_{t}}+\frac{\beta}{2}\norm{\eta\hat{g}_{t}}^{2}$
		$\displaystyle=\eta\brk[s]{-\norm{\nabla f\brk{x_{t}}}^{2}-\nabla f\brk{x_{t}}^{\mkern-1.5mu\mathsf{T}}w_{t}+\frac{\eta\beta}{2}\norm{\nabla f\brk{x_{t}}+w_{t}}^{2}}$
		$\displaystyle=\eta\brk[s]{-\brk{1-\frac{\eta\beta}{2}}\norm{\nabla f\brk{x_{t}}}^{2}-\brk{1-\eta\beta}\nabla f\brk*{x_{t}}^{\mkern-1.5mu\mathsf{T}}w_{t}+\frac{\eta\beta}{2}\norm{w_{t}}^{2}}$
		$\displaystyle=\eta\brk[s]{-\brk{1-\eta\beta}\norm{\frac{1}{2}\nabla f\brk{x_{t}}+w_{t}}^{2}-\brk{\frac{3}{4}-\frac{\eta\beta}{4}}\norm{\nabla f\brk{x_{t}}}^{2}+\brk{1-\frac{\eta\beta}{2}}\norm{w_{t}}^{2}}$
		$\displaystyle\leq\eta\brk[s]{-\brk{\frac{3}{4}-\frac{\eta\beta}{4}}\norm{\nabla f\brk{x_{t}}}^{2}+\brk{1-\frac{\eta\beta}{2}}\norm{w_{t}}^{2}},$

where the last transition holds by choice of $\eta\beta\leq 1$ . Next, using the PL condition and the bound on $w_{t}$ (see Eq. 5) we get that

\displaystyle f\brk*{x_{t+1}}-f\brk*{x_{t}}

\displaystyle\leq\eta\brk[s]*{-\mu\brk*{\frac{3}{4}-\frac{\eta\beta}{4}}\brk*{f\brk*{x_{t}}-f_{*}}+\brk*{1-\frac{\eta\beta}{2}}\varepsilon_{t}^{2}},

and adding $f\brk*{x_{t}}-f_{*}$ to both sides of the equation we get

\displaystyle f\brk*{x_{t+1}}-f_{*}\leq\brk[s]*{1-\frac{\mu\eta}{4}\brk*{3-\eta\beta}}\brk*{f\brk*{x_{t}}-f_{*}}+\brk*{1-\frac{\eta\beta}{2}}\eta\varepsilon_{t}^{2}.

Now, if $\frac{4\varepsilon_{t}^{2}}{\mu}\leq f\brk*{x_{t}}-f_{*}$ then since $\eta\beta\leq 1$ we have that

\displaystyle\begin{aligned} f\brk*{x_{t+1}}-f_{*}&\leq\brk[s]*{1-\frac{\mu\eta}{4}\brk*{3-\eta\beta}+\frac{\mu\eta}{4}\brk*{1-\frac{\eta\beta}{2}}}\brk*{f\brk*{x_{t}}-f_{*}}\\ &=\brk[s]*{1-\frac{\mu\eta}{2}\brk*{1-\frac{\eta\beta}{4}}}\brk*{f\brk*{x_{t}}-f_{*}}\\ &\leq\brk[s]*{1-\frac{\mu\eta}{3}}\brk*{f\brk*{x_{t}}-f_{*}}.\end{aligned}

(6)

On the other hand, if $\frac{4\varepsilon_{t}^{2}}{\mu}\geq f\brk*{x_{t}}-f_{*}\geq 0$ then we have that

\displaystyle\begin{aligned} f\brk*{x_{t+1}}-f_{*}&\leq\brk[s]*{\max\brk[c]*{0,\frac{4}{\mu}-\eta\brk*{3-\eta\beta}}+\eta\brk*{1-\frac{\eta\beta}{2}}}\varepsilon_{t}^{2}\\ &\leq\max\brk[c]*{\eta,\frac{4}{\mu}-\eta\brk*{2-\eta\beta}}\varepsilon_{t}^{2}\leq\frac{4\varepsilon_{t}^{2}}{\mu},\end{aligned}

(7)

where the last transition holds, again, by our choice of $\eta$ . Combining Eqs. 6 and 7 we conclude that

\displaystyle f\brk*{x_{t+1}}-f_{*}\leq\max\brk[c]*{\frac{4\varepsilon_{t}^{2}}{\mu},\brk[s]*{1-\frac{\mu\eta}{3}}\brk*{f\brk*{x_{t}}-f_{*}}}.

(8)

In particular, this implies that $f\brk*{x_{t+1}}\leq\max\brk[c]*{f_{*}+\frac{4\varepsilon_{t}^{2}}{\mu},f\brk*{x_{t}}}\leq\bar{f},$ and thus $x_{t+1}\in\mathcal{X}$ . Since we assume that $x_{0}\in\mathcal{X}$ , this completes an induction showing that $x_{t}\in\mathcal{X}$ for all $t\geq 0$ . We can thus unroll Eq. 8 recursively to obtain the final result.

4.2 Gradient Estimation

The gradient estimate $\hat{g}_{j}$ is a batched version of the typical one-point gradient estimator. We bound it in the next lemma using the following inductive idea: if $J\brk*{K_{j}}\leq\nu/2$ , then $K_{j,i}\in\mathcal{K}$ and standard concentration arguments imply that the estimation error is small with high probability and thus Theorem 2 implies that $J\brk*{K_{j+1}}\leq\nu/2$ .

Lemma 9 (Gradient estimation error).

Under the parameter choices of Theorem 1, for any $j\geq 0$ we have that with probability at least $1-(\delta/8T^{3})$ ,

\displaystyle\norm{\hat{g}_{j}-\nabla J\brk*{K_{j}}}_{F}\leq\frac{\sqrt{\mu\nu}}{4}\brk*{1-\frac{\mu\eta}{3}}^{j/2}.

Proof (of Lemma 9).

Assume that conditioned on the event $J\brk*{K_{j}^{\prime}}\leq\nu/2$ for all $j^{\prime}\leq j$ , the claim holds with probability at least $1-\delta/8T^{4}$ . We show by induction that we can peel-off the conditioning by summing the failure probability of each epoch. Concretely, we show by induction that the claim holds for all $j^{\prime}\leq j$ with probability at least $1-j\delta/8T^{4}$ . Since the number of epochs is less than $T$ (in fact logarithmic in $T$ ), this will conclude the proof.

The induction base follows immediately by our conditional assumption and the fact that $J\brk*{K_{0}}\leq\nu/4$ . Now, assume the hypothesis holds up to $j-1$ . We show that the conditions of Theorem 2 are satisfied with $\bar{f}=\nu/2$ up to round $j$ , and thus $J\brk*{K_{j^{\prime}}}\leq\nu/2$ for all $j^{\prime}\leq j$ . We can then invoke our conditional assumption and a union bound to conclude the induction step.

We verify the conditions of Theorem 2. First, the Lipschitz, smoothness, and PL conditions hold by Lemma 5. Next, notice that by definition $J_{\star}\leq J\brk*{K_{0}}\leq\nu/4$ , and so by the induction hypothesis $\norm{\hat{g}_{j^{\prime}}-\nabla J\brk*{K_{j^{\prime}}}}_{F}\leq\sqrt{\nu\mu}/4\leq\sqrt{(\bar{f}-f_{*})\mu}/2\leq G,$ for all $j^{\prime}<j$ . Finally, noticing that $\kappa^{2}>d_{x}$ it is easy to verify the condition on $\eta$ .

It remains to show the conditional claim holds. The event $J\brk*{K_{j^{\prime}}}\leq\nu/2$ for all $j^{\prime}\leq j$ essentially implies that the policy gradient scheme did not diverge up to the start of epoch $j$ . Importantly, this event is independent of any randomization during epoch $j$ and thus will not break any i.i.d. assumptions within the epoch. Moreover, by Lemma 5 and since $r_{0}\leq\nu/2G$ , this implies that $J\brk*{K_{j^{\prime},i}}\leq J\brk*{K_{j}}+Gr_{j}\leq\nu,$ i.e., $K_{j^{\prime},i}\in\mathcal{K}$ for all $i$ and $j^{\prime}\leq j$ . For the remainder of the proof, we implicitly assume that this holds, allowing us to invoke Lemmas 3, 4 and 5. For ease of notation, we will not specify this explicitly.

Now, let $J^{r}$ be the smoothed version of $J$ as in Eq. 1. Since $r_{j}\leq D_{0}$ we can use Lemma 1 to get that

	$\displaystyle\norm{\hat{g}_{j}-\nabla J\brk*{K_{j}}}_{F}$	$\displaystyle\leq\norm{\hat{g}_{j}-\nabla J^{r_{j}}\brk{K_{j}}}_{F}+\norm{\nabla J^{r_{j}}\brk{K_{j}}-\nabla J\brk*{K_{j}}}_{F}$
		$\displaystyle\leq\beta r_{j}+\norm{\hat{g}_{j}-\nabla J^{r_{j}}\brk*{K_{j}}}_{F},$

Next, we decompose the remaining term using the triangle inequality to get that

	$\displaystyle\norm{\hat{g}_{j}-\nabla J^{r_{j}}\brk*{K_{j}}}_{F}$	$\displaystyle=\norm{\frac{1}{m_{j}}\sum_{i=1}^{m_{j}}\brk{\frac{d_{x}d_{u}}{r_{j}}c_{j,i,\tau}U_{j,i}-\nabla J^{r_{j}}\brk{K_{j}}}}_{F}$
		$\displaystyle\leq\norm{\frac{1}{m_{j}}\sum_{i=1}^{m_{j}}\brk{\frac{d_{x}d_{u}}{r_{j}}J\brk{K_{j,i}}U_{j,i}-\nabla J^{r_{j}}\brk{K_{j}}}}_{F}+\norm{\frac{1}{m_{j}}\sum_{i=1}^{m_{j}}\brk{\frac{d_{x}d_{u}}{r_{j}}(c_{j,i,\tau}-J\brk*{K_{j,i}})U_{j,i}}}_{F}.$

By Lemma 1, we notice that, conditioned on $K_{j}$ , the first term is a sum of zero-mean i.i.d random vectors with norm bounded by $2d_{u}d_{x}\nu/r_{j}$ . We thus invoke Lemma 13 (Vector Azuma) to get that with probability at least $1-\ifrac{\delta}{16T^{4}}$

\displaystyle\norm*{\frac{1}{m_{j}}\sum_{i=1}^{m_{j}}\frac{d_{x}d_{u}}{r_{j}}J\brk*{K_{j,i}}U_{j,i}-\nabla J^{r_{j}}\brk*{K_{j}}}_{F}\leq\frac{d_{u}d_{x}\nu}{r_{j}}\sqrt{\frac{8}{m_{j}}\log\frac{240T^{4}}{\delta}}.

Next, denote $Z_{i}=\frac{d_{x}d_{u}}{r_{j}}(c_{j,i,\tau}-J\brk*{K_{j,i}})U_{j,i},$ and notice that the remaining term is exactly $\norm{\frac{1}{m_{j}}\sum_{i=1}^{m_{j}}Z_{i}}_{F}.$ Let $x_{j,i,\tau}$ be the state during the final round of playing controller $K_{j,i}$ , and $\mathcal{F}_{i}$ be the filtration adapted to $x_{j,1,\tau},\ldots,x_{j,i,\tau},U_{j,1},\ldots,U_{j,i}.$ We use Lemma 3 to get that

	$\displaystyle\norm{\mathbb{E}\brk[s]*{Z_{i}\mid\mathcal{F}_{i-1}}}_{F}$	$\displaystyle\leq\mathbb{E}\brk[s]{\norm{\mathbb{E}\brk[s]{Z_{i}\mid\mathcal{F}_{i-1},K_{j,i}}}_{F}\mid\mathcal{F}_{i-1}}$
		$\displaystyle\leq\frac{d_{x}d_{u}}{r_{j}}\mathbb{E}\brk[s]{\abs{\mathbb{E}\brk[s]{c_{j,i,\tau}\mid\mathcal{F}_{i-1},K_{j,i}}-J\brk{K_{j,i}}}\;\Big{\|}\;\mathcal{F}_{i-1}}$
		$\displaystyle\leq\frac{d_{x}d_{u}\nu\kappa^{2}}{r_{j}\sigma^{2}}e^{-2\gamma\tau}\mathbb{E}\brk[s]*{\norm{x_{j,i-1,\tau}x_{j,i-1,\tau}^{\mkern-1.5mu\mathsf{T}}-\Sigma_{K_{j,i}}}\;\Big{\|}\;\mathcal{F}_{i-1}}$
		$\displaystyle\leq\frac{37d_{x}d_{u}\nu\kappa^{10}W^{2}}{r_{j}\sigma^{2}}e^{-2\gamma\tau}$
		$\displaystyle\leq\frac{d_{x}d_{u}\nu\kappa^{8}W^{2}}{r_{j}\sigma^{2}T^{2}},$

where the last step plugged in the value of $\tau$ and the one before that used Lemmas 4 and 5 to bound $\norm{\Sigma_{K_{j,i}}}\leq\nu/\alpha_{0}=\kappa^{2}\sigma^{2}$ and $\norm{x_{j,i-1,\tau}}\leq 6\kappa^{4}W$ . Further using Lemma 4 to bound $c_{j,i,\tau}$ , we also get that

\displaystyle\norm{Z_{i}-\mathbb{E}\brk[s]*{Z_{i}\mid\mathcal{F}_{i-1}}}_{F}\leq\norm{Z_{i}}_{F}+\norm{\mathbb{E}\brk[s]*{Z_{i}\mid\mathcal{F}_{i-1}}}_{F}\leq\frac{d_{x}d_{u}c_{j,i,\tau}}{r_{j}}+\norm{\mathbb{E}\brk[s]*{Z_{i}\mid\mathcal{F}_{i-1}}}_{F}\leq\frac{37d_{x}d_{u}\nu\kappa^{8}W^{2}}{r_{j}\sigma^{2}}.

Since $Z_{i}$ is $\mathcal{F}_{i}-$ measurable we can invoke Lemma 13 (Vector Azuma) to get that with probability at least $1-\frac{\delta}{16T^{4}}$ ,

	$\displaystyle\norm*{\frac{1}{m_{j}}\sum_{i=1}^{m_{j}}Z_{i}}_{F}$	$\displaystyle\leq\frac{1}{m_{j}}\norm{\sum_{i=1}^{m_{j}}Z_{i}-\mathbb{E}\brk[s]{Z_{i}\mid\mathcal{F}_{i-1}}}_{F}+\frac{1}{m_{j}}\sum_{i=1}^{m_{j}}\norm{\mathbb{E}\brk[s]*{Z_{i}\mid\mathcal{F}_{i-1}}}_{F}$
		$\displaystyle\leq\frac{d_{x}d_{u}\nu\kappa^{8}W^{2}}{r_{j}\sigma^{2}}\brk[s]*{37\sqrt{\frac{2}{m_{j}}\log\frac{240T^{4}}{\delta}}+\frac{1}{T^{2}}}$
		$\displaystyle\leq\frac{54d_{x}d_{u}\nu\kappa^{8}W^{2}}{r_{j}\sigma^{2}}\sqrt{\frac{1}{m_{j}}\log\frac{240T^{4}}{\delta}}.$

Using a union bound and putting everything together, we conclude that with probability at least $1-(\delta/8T^{4})$ ,

	$\displaystyle\norm{\hat{g}_{j}-\nabla J\brk*{K_{j}}}_{F}$	$\displaystyle\leq\beta r_{j}+\frac{54d_{x}d_{u}\nu\kappa^{8}W^{2}}{r_{j}\sigma^{2}}\sqrt{\frac{1}{m_{j}}\log\frac{240T^{4}}{\delta}}$
		$\displaystyle=\brk[s]{\beta r_{0}+\frac{54d_{x}d_{u}\nu\kappa^{8}W^{2}}{\sigma^{2}r_{0}m_{0}^{1/2}}\sqrt{\log\frac{240T^{4}}{\delta}}}\brk{1-\frac{\mu\eta}{3}}^{j/2}$
		$\displaystyle\leq 2\beta r_{0}\brk*{1-\frac{\mu\eta}{3}}^{j/2}$
		$\displaystyle\leq\frac{\sqrt{\mu\nu}}{4}\brk*{1-\frac{\mu\eta}{3}}^{j/2},$

where the last steps plugged in the values of $\mu,\beta,r_{0}$ , and $m_{0}$ .

4.3 Proof of Lemma 6

Lemma 6 is a straightforward consequence of the previous results.

Proof.

For $j=0$ the claim holds trivially by our assumption that $J\brk*{K_{0}}\leq\nu/4$ . Now, for $j\geq 1$ , we use a union bound on Lemma 9 to get that with probability at least $1-\delta/8T^{2}$

\displaystyle\norm{\hat{g}_{j}-\nabla J\brk*{K_{j}}}\leq\frac{\sqrt{\mu\nu}}{4}\brk*{1-\frac{\mu\eta}{3}}^{j/2},\qquad\forall j\geq 0.

Then by Theorem 2 we have that

\displaystyle J\brk*{K_{j}}\leq J_{\star}+\frac{\nu}{4}\brk*{1-\frac{\mu\eta}{3}}^{j-1}\leq\min\brk[c]*{\frac{\nu}{2},J_{\star}+\frac{\nu}{2}\brk*{1-\frac{\mu\eta}{3}}^{j}},

where the last step used the facts that $J_{\star}\leq J\brk*{K_{0}}\leq\nu/4$ and $1-\mu\eta/3\geq 1/2$ .

5 Exploration Cost Analysis

In this section we demonstrate that exploring near a given initial policy does not incur linear regret in the exploration radius (as more straightforward arguments would give), and use this crucial observation for proving Lemmas 7 and 8.

We begin with Lemma 8. The main difficulty in the proof is captured by the following basic result, which roughly shows that the expected cost for transitioning between two i.i.d. copies of a given random policy scales with the variance of the latter. This would in turn give the quadratic dependence on the exploration radius we need.

Lemma 10.

Let $K\in\mathcal{K}$ be fixed. Suppose $K_{1},K_{2}$ are i.i.d. random variables such that $\mathbb{E}K_{i}=K$ , and $\norm{K_{i}-K}_{F}\leq r\leq D_{0}.$ If $x_{\tau}(K_{1})$ is the result of playing $K_{1}$ for $\tau\geq 1$ rounds starting at $x_{0}\in\mathbb{R}^{d_{x}}$ , then

\displaystyle\mathbb{E}\brk[s]{x_{\tau}(K_{1})^{\mkern-1.5mu\mathsf{T}}(P_{K_{2}}-P_{K_{1}})x_{\tau}(K_{1})}\leq\frac{256d_{x}\nu\psi^{2}\kappa^{10}}{\alpha_{0}}r^{2}+32d_{x}\psi\kappa^{9}(\norm{x_{0}}^{2}+\kappa^{2}\sigma^{2})re^{-2\gamma\tau}.

Proof.

Notice that the expectation is with respect to both controllers and the $\tau$ noise terms, all of which are jointly independent. We begin by using Lemmas 3 and 5 to get that

	$\displaystyle{\mathrm{Tr}\brk*{(P_{K_{2}}-P_{K_{1}})(\mathbb{E}\brk[s]{x_{\tau}(K_{1})x_{\tau}(K_{1})^{\mkern-1.5mu\mathsf{T}}\mid K_{1}}-\Sigma_{K_{1}})}}$	$\displaystyle\leq 32d_{x}\psi\kappa^{7}r\norm{\mathbb{E}\brk[s]{x_{\tau}(K_{1})x_{\tau}(K_{1})^{\mkern-1.5mu\mathsf{T}}\mid K_{1}}-\Sigma_{K_{1}})}$
		$\displaystyle\leq 32d_{x}\psi\kappa^{9}re^{-2\gamma\tau}\norm{x_{0}x_{0}^{\mkern-1.5mu\mathsf{T}}-\Sigma_{K_{1}}}$
		$\displaystyle\leq 32d_{x}\psi\kappa^{9}(\norm{x_{0}}^{2}+\kappa^{2}\sigma^{2})re^{-2\gamma\tau},$

where the last step also used the fact that $\kappa^{2}\sigma^{2}=\nu/\alpha_{0}$ . Now, since $P_{K_{1}},P_{K_{2}}$ do not depend on the noise, we can use the law of total expectation to get that

	$\displaystyle\mathbb{E}\brk[s]{x_{\tau}(K_{1})^{\mkern-1.5mu\mathsf{T}}(P_{K_{2}}-P_{K_{1}})x_{\tau}(K_{1})}$	$\displaystyle=\mathbb{E}\brk[s]{\mathrm{Tr}\brk*{(P_{K_{2}}-P_{K_{1}})\mathbb{E}\brk[s]{x_{\tau}(K_{1})x_{\tau}(K_{1})^{\mkern-1.5mu\mathsf{T}}\mid K_{1}}}}$
		$\displaystyle\leq\mathbb{E}\brk[s]{\mathrm{Tr}\brk*{(P_{K_{2}}-P_{K_{1}})\Sigma_{K_{1}}}}+4d_{x}\alpha_{0}\kappa^{2}(\norm{x_{0}}^{2}+\kappa^{2}\sigma^{2})e^{-2\gamma\tau}.$

To bound the remaining term, notice that since $K_{1},K_{2}$ are i.i.d, we may change their roles without changing the expectation, i.e.,

\displaystyle\mathbb{E}\brk[s]{\mathrm{Tr}\brk*{(P_{K_{2}}-P_{K_{1}})\Sigma_{K_{1}}}}=\mathbb{E}\brk[s]{\mathrm{Tr}\brk*{(P_{K_{1}}-P_{K_{2}})\Sigma_{K_{2}}}},

we conclude that

	$\displaystyle\mathbb{E}\brk[s]{\mathrm{Tr}\brk*{(P_{K_{2}}-P_{K_{1}})\Sigma_{K_{1}}}}$	$\displaystyle=\frac{1}{2}\mathbb{E}\brk[s]{\mathrm{Tr}\brk*{(P_{K_{2}}-P_{K_{1}})(\Sigma_{K_{1}}-\Sigma_{K_{2}})}}$
		$\displaystyle\leq\frac{d_{x}}{2}\norm{P_{K_{2}}-P_{K_{1}}}\norm{\Sigma_{K_{2}}-\Sigma_{K_{1}}}$
		$\displaystyle\leq\frac{256d_{x}\nu\psi^{2}\kappa^{10}}{\alpha_{0}}r^{2},$

where the last step also used Lemma 5.

5.1 Proof of Lemma 8

Before proving Lemma 8 we introduce a few simplifying notations. Since the lemma pertains to a single epoch, we omit its notation $j$ wherever it is clear from context. For example, $K_{j,i}$ will be shortened to $K_{i}$ and $x_{j,i,s}$ to $x_{i,s}$ . In any case, we reserve the index $j$ for epochs and $i$ for sub-epochs. In this context, we also denote the gap between realized and idealized costs during sub-epoch $i$ by

\displaystyle\Delta C_{i}=\sum_{s=1}^{\tau}(c_{i,s}-J\brk*{K_{i})},

and the filtration $\mathcal{H}_{i}$ adapted to $w_{1,1},\ldots,w_{i,\tau-1},K_{1},\ldots,K_{i}$ . We note that $K_{i}$ and $\Delta C_{i}$ are $\mathcal{H}_{i}-$ measurable. The following lemma uses Eq. 2 to decompose the cost gap at the various time resolutions. See proof at the end of this section.

Lemma 11.

If the epoch initial controller satisfies $J\brk*{K_{j}}\leq\nu/2$ then (recall that $P_{K}$ is the positive definite solution to Eq. 2):

1.

$c_{i,s}-J\brk*{K_{i}}=x_{i,s}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,s}-\mathbb{E}_{w_{i,s}}\brk[s]{x_{i,s+1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,s+1}};$
2.

$\mathbb{E}\brk[s]{\Delta C_{i}\mid\mathcal{H}_{i-1}}=\mathbb{E}\brk[s]*{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,1}-x_{i+1,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i+1,1}\mid\mathcal{H}_{i-1}};$
3.

$\sum_{i=1}^{m_{j}}\mathbb{E}\brk[s]{\Delta C_{i}\mid\mathcal{H}_{i-1}}\leq\mathbb{E}\brk[s]{x_{1,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{1}}x_{1,1}}+\sum_{i=2}^{m_{j}}\big{(}\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,1}\mid\mathcal{H}_{i-1}}-\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i-1}}x_{i,1}\mid\mathcal{H}_{i-2}}\big{)}.$

We are now ready to prove the main lemma of this section.

Proof (of Lemma 8).

First, by Lemma 6, the event $J\brk*{K_{j^{\prime}}}\leq\nu/2$ for all $j^{\prime}\leq j$ holds with probability at least $1-\delta/8T$ . As in the proof of Lemma 9, we will implicitly assume that this event holds, which will not break any i.i.d assumptions during epoch $j$ and implies that $K_{i}\in\mathcal{K}$ for all $1\leq i\leq m_{j}$ . We also use this to invoke Lemmas 4 and 5 to get that for any $1\leq i,i^{\prime}\leq m_{j}$ and $1\leq s\leq\tau$ we have $x_{i,s}^{\mkern-1.5mu\mathsf{T}}P_{K_{i^{\prime}}}x_{i,s}\leq 36\nu\kappa^{8}W^{2}/\sigma^{2}=\nu_{0}.$

Now, recall that $\Delta C_{i}$ is $\mathcal{H}_{i}$ -measurable and thus $\Delta C_{i}-\mathbb{E}\brk[s]{\Delta C_{i}\mid\mathcal{H}_{i-1}}$ is a martingale difference sequence. Using the first part of Lemma 11 we also conclude that each term bounded by $\tau\nu_{0}$ . Applying Azuma’s inequality we get that with probability at least $1-(\delta/16T)$

	$\displaystyle\sum_{i=1}^{m_{j}}\Delta C_{i}$	$\displaystyle=\sum_{i=1}^{m_{j}}\Delta C_{i}-\mathbb{E}\brk[s]{\Delta C_{i}\mid\mathcal{H}_{i-1}}+\mathbb{E}\brk[s]{\Delta C_{i}\mid\mathcal{H}_{i-1}}$
		$\displaystyle\leq\sqrt{2m_{j}\tau^{2}\nu_{0}^{2}\log\frac{16T}{\delta}}+\sum_{i=1}^{m_{j}}\mathbb{E}\brk[s]{\Delta C_{i}\mid\mathcal{H}_{i-1}}.$

Now, recall from Lemma 11 that

	$\displaystyle\sum_{i=1}^{m_{j}}$	$\displaystyle\mathbb{E}\brk[s]{\Delta C_{i}\mid\mathcal{H}_{i-1}}$
		$\displaystyle\leq\mathbb{E}\brk[s]{x_{1,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{1}}x_{1,1}}+\sum_{i=2}^{m_{j}}\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,1}\mid\mathcal{H}_{i-1}}-\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i-1}}x_{i,1}\mid\mathcal{H}_{i-2}}$
		$\displaystyle=\mathbb{E}\brk[s]{x_{1,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{1}}x_{1,1}}+\sum_{i=2}^{m_{j}}\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,1}\mid\mathcal{H}_{i-1}}-\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,1}\mid\mathcal{H}_{i-2}}+\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}(P_{K_{i}}-P_{K_{i-1}})x_{i,1}\mid\mathcal{H}_{i-2}}.$

The first two terms in the sum form a martingale difference sequence with each term being bound by $\nu_{0}$ . We thus have that with probability at least $1-\delta/16T$ ,

\displaystyle\sum_{i=1}^{m_{j}}\mathbb{E}\brk[s]{\Delta C_{i}\mid\mathcal{H}_{i-1}}\leq\nu_{0}+\sqrt{2m_{j}\nu_{0}^{2}\log\frac{16T}{\delta}}+\sum_{i=2}^{m_{j}}\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}(P_{K_{i}}-P_{K_{i-1}})x_{i,1}\mid\mathcal{H}_{i-2}}.

Notice that the summands in remaining term fit the setting of Lemma 10 and thus

	$\displaystyle\sum_{i=2}^{m_{j}}\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}(P_{K_{i}}-P_{K_{i-1}})x_{i,1}\mid\mathcal{H}_{i-2}}$	$\displaystyle\leq\frac{256d_{x}\nu\psi^{2}\kappa^{10}}{\alpha_{0}}r_{j}^{2}m_{j}+\sum_{i=1}^{m_{j}}32d_{x}\psi\kappa^{9}(\norm{x_{i,1}}^{2}+\kappa^{2}\sigma^{2})r_{j}e^{-2\gamma\tau}$
		$\displaystyle\leq\frac{256d_{x}\nu\psi^{2}\kappa^{10}}{\alpha_{0}}r_{j}^{2}m_{j}+\frac{25d_{x}\psi\kappa^{15}W^{2}r_{j}m_{j}}{T^{2}}$
		$\displaystyle\leq\frac{257d_{x}\nu\psi^{2}\kappa^{10}}{\alpha_{0}}r_{j}^{2}m_{j},$

where the second transition plugged in $\tau$ and used Lemma 4 to bound $\norm{x_{i,1}}$ , and the third transition used the fact that $T^{-2}\leq m_{j}^{-2}\leq r_{j}/m_{0}$ . Plugging in the value of $\nu_{0}$ and using a union bound, we conclude that with probability at least $1-\delta/4T$ ,

\displaystyle\sum_{i=1}^{m_{j}}\Delta C_{i}\leq\frac{144\nu\kappa^{8}W^{2}}{\sigma^{2}}\tau\sqrt{m_{j}\log\frac{16T}{\delta}}+\frac{257d_{x}\nu\psi^{2}\kappa^{10}}{\alpha_{0}}r_{j}^{2}m_{j},

as desired.

Proof (of Lemma 11).

By our assumption that $J\brk*{K_{j}}\leq\nu/2$ we have that $J\brk*{K_{i}}\leq\nu$ and thus $P_{K_{i}}$ is well defined. Now, recall that $x_{i,s+1}=\brk{A_{\star}+B_{\star}K_{i}}x_{i,s}+w_{i,s}$ and $J\brk*{K_{i}}=\mathbb{E}_{w_{i,s}}\brk[s]{w_{i,s}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}w_{i,s}}$ where $P_{K_{i}}$ satisfies Eq. 2 with $K=K_{i}$ . Then we have that

	$\displaystyle\mathbb{E}_{w_{i,s}}\brk[s]{x_{i,s+1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,s+1}}$	$\displaystyle=\mathbb{E}_{w_{i,s}}\brk[s]{((A_{\star}+B_{\star}K_{i})x_{i,s}+w_{i,s})^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}((A_{\star}+B_{\star}K_{i})x_{i,s}+w_{i,s})}$
		$\displaystyle=((A_{\star}+B_{\star}K_{i})x_{i,s})^{\mkern-1.5mu\mathsf{T}}P_{K_{t}}((A_{\star}+B_{\star}K_{i})x_{i,s})+\mathbb{E}_{w_{i,s}}\brk{w_{i,s}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}w_{i,s}}$
		$\displaystyle=((A_{\star}+B_{\star}K_{i})x_{i,s})^{\mkern-1.5mu\mathsf{T}}P_{K_{t}}((A_{\star}+B_{\star}K_{i})x_{i,s})+J\brk*{K_{i}}.$

Now, multiplying Eq. 2 by $x_{i,s}$ from both sides we get that

	$\displaystyle x_{i,s}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,s}$	$\displaystyle=x_{i,s}^{\mkern-1.5mu\mathsf{T}}\brk{Q+K_{i}^{\mkern-1.5mu\mathsf{T}}RK_{t}}x_{i,s}+((A_{\star}+B_{\star}K_{i})x_{i,s})^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}((A_{\star}+B_{\star}K_{i})x_{i,s})$
		$\displaystyle=c_{i,s}+\mathbb{E}_{w_{i,s}}\brk[s]{x_{i,s+1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,s+1}}-J\brk*{K_{i}},$

where the second transition plugged in the previous equality. Changing sides concludes the first part of the proof. For the second part, notice that taking expectation with respect to $w_{i,s}$ is equivalent to conditional expectation with respect to all past epochs and $w_{1,1},\ldots,w_{i,s-1},K_{1},\ldots,K_{i}$ of the current epoch. Since for all $1\leq s\leq\tau$ this contains $\mathcal{H}_{i-1}$ , we use the law of total expectation to get that

	$\displaystyle\mathbb{E}\brk[s]{c_{i,s}-J\brk*{K_{i}}\mid\mathcal{H}_{i-1}}$	$\displaystyle=\mathbb{E}\brk[s]{x_{i,s}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,s}\mid\mathcal{H}_{i-1}}-\mathbb{E}\brk[s]{\mathbb{E}_{w_{i,s}}\brk[s]{x_{i,s+1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,s+1}}\mid\mathcal{H}_{i-1}}$
		$\displaystyle=\mathbb{E}\brk[s]{x_{i,s}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,s}\mid\mathcal{H}_{i-1}}-\mathbb{E}\brk[s]{x_{i,s+1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,s+1}\mid\mathcal{H}_{i-1}}.$

Summing over $s$ , noticing that the sum is telescopic, and that time $(i,\tau+1)$ is in fact the start of the next sub-epoch, i.e., $(i+1,1)$ , concludes the second part of the proof. Finally, we sum over $i$ to get that

	$\displaystyle\sum_{i=1}^{m_{j}}\mathbb{E}\brk[s]{\Delta C_{i}\mid\mathcal{H}_{i-1}}=\sum_{i=1}^{m_{j}}\mathbb{E}\brk[s]*{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,1}-x_{i+1,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i+1,1}\mid\mathcal{H}_{i-1}}$
	$\displaystyle=\mathbb{E}\brk[s]{x_{1,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{1}}x_{1,1}}-\mathbb{E}\brk[s]{x_{m_{j}+1,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{m_{j}}}x_{m_{j}+1,1}\mid\mathcal{H}_{m_{j}-1}}+\sum_{i=2}^{m_{j}}\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,1}\mid\mathcal{H}_{i-1}}-\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i-1}}x_{i,1}\mid\mathcal{H}_{i-2}}$
	$\displaystyle\leq\mathbb{E}\brk[s]{x_{1,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{1}}x_{1,1}}+\sum_{i=2}^{m_{j}}\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,1}\mid\mathcal{H}_{i-1}}-\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i-1}}x_{i,1}\mid\mathcal{H}_{i-2}},$

concluding the third part of the proof.

5.2 Proof of Lemma 7

Proof (of Lemma 7).

By Lemma 6, the event $J\brk*{K_{j}}\leq\nu/2$ occurs with probability at least $1-\delta/8T^{2}$ . Similarly to Lemmas 8 and 9, we implicitly assume that this event holds, which does not break i.i.d assumptions inside the epoch and implies that $K_{j,i}\in\mathcal{K}$ for all $1\leq i\leq m_{j}$ . Now, notice that $\mathbb{E}\brk[s]{K_{j,i}\mid K_{j}}=K_{j}.$ Since $K_{j}\in\mathcal{K}$ and $r_{j}\leq D_{0}$ , we can invoke the local smoothness of $J\brk*{\cdot}$ (see Lemma 5) to get that

	$\displaystyle\mathbb{E}\brk[s]{J\brk*{K_{j,i}}\mid K_{j}}$	$\displaystyle\leq J\brk{K_{j}}+\nabla J\brk{K_{j}}^{\mkern-1.5mu\mathsf{T}}\mathbb{E}\brk[s]{K_{j,i}-K_{j}\mid K_{j}}+\frac{1}{2}\beta\mathbb{E}\brk[s]{\norm{K_{j,i}-K_{j}}^{2}\mid K_{j}}$
		$\displaystyle=J\brk*{K_{j}}+\frac{1}{2}\beta r_{j}^{2}.$

We thus have that

\displaystyle\sum_{i=1}^{m_{j}}J\brk*{K_{j,i}}-J\brk*{K_{j}}\leq\frac{1}{2}\beta r_{j}^{2}m_{j}+\sum_{i=1}^{m_{j}}J\brk*{K_{j,i}}-\mathbb{E}\brk[s]{J\brk*{K_{j,i}}\mid K_{j}}.

The remaining term is a sum of zero-mean i.i.d. random variables that are bounded by $\nu$ . We use Hoeffding’s inequality and a union bound to get that with probability at least $1-\delta/4T$

\displaystyle\sum_{i=1}^{m_{j}}J\brk*{K_{j,i}}-J\brk*{K_{j}}\leq\frac{1}{2}\beta r_{j}^{2}m_{j}+\nu\sqrt{\frac{1}{2}m_{j}\log\frac{8T}{\delta}},

and plugging in the value of $\beta$ from Lemma 5 concludes the proof.

Acknowledgements

We thank Nadav Merlis for numerous helpful discussions. This work was partially supported by the Israeli Science Foundation (ISF) grant 2549/19, by the Len Blavatnik and the Blavatnik Family foundation, and by the Yandex Initiative in Machine Learning.

References

Abbasi-Yadkori and Szepesvári [2011] Y. Abbasi-Yadkori and C. Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.
Abbasi-Yadkori et al. [2019] Y. Abbasi-Yadkori, N. Lazic, and C. Szepesvári. Model-free linear quadratic control via reduction to expert prediction. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3108–3117. PMLR, 2019.
Agarwal et al. [2019] N. Agarwal, E. Hazan, and K. Singh. Logarithmic regret for online control. In Advances in Neural Information Processing Systems, pages 10175–10184, 2019.
Bertsekas [1995] D. P. Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995.
Cassel and Koren [2020] A. Cassel and T. Koren. Bandit linear control. Advances in Neural Information Processing Systems, 33, 2020.
Cassel et al. [2020] A. Cassel, A. Cohen, and T. Koren. Logarithmic regret for learning linear quadratic regulators efficiently. In International Conference on Machine Learning, pages 1328–1337. PMLR, 2020.
Chen and Hazan [2020] X. Chen and E. Hazan. Black-box control for linear dynamical systems. arXiv preprint arXiv:2007.06650, 2020.
Cohen et al. [2018] A. Cohen, A. Hasidim, T. Koren, N. Lazic, Y. Mansour, and K. Talwar. Online linear quadratic control. In International Conference on Machine Learning, pages 1029–1038, 2018.
Cohen et al. [2019] A. Cohen, T. Koren, and Y. Mansour. Learning linear-quadratic regulators efficiently with only $\sqrt{T}$ regret. In International Conference on Machine Learning, pages 1300–1309, 2019.
Fazel et al. [2018] M. Fazel, R. Ge, S. Kakade, and M. Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In Proceedings of the 35th International Conference on Machine Learning, volume 80, 2018.
Flaxman et al. [2005] A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385–394, 2005.
Haarnoja et al. [2018] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870. PMLR, 2018.
Hambly et al. [2020] B. M. Hambly, R. Xu, and H. Yang. Policy gradient methods for the noisy linear quadratic regulator over a finite horizon. Available at SSRN, 2020.
Hanson and Wright [1971] D. L. Hanson and F. T. Wright. A bound on tail probabilities for quadratic forms in independent random variables. The Annals of Mathematical Statistics, 42(3):1079–1083, 1971.
Hayes [2005] T. P. Hayes. A large-deviation inequality for vector-valued martingales. Combinatorics, Probability and Computing, 2005.
Hsu et al. [2012] D. Hsu, S. Kakade, T. Zhang, et al. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17, 2012.
Jin et al. [2020] Z. Jin, J. M. Schmitt, and Z. Wen. On the analysis of model-free methods for the linear quadratic regulator. arXiv preprint arXiv:2007.03861, 2020.
Krauth et al. [2019] K. Krauth, S. Tu, and B. Recht. Finite-time analysis of approximate policy iteration for the linear quadratic regulator. In Advances in Neural Information Processing Systems, volume 32, 2019.
Lillicrap et al. [2015] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
Malik et al. [2019] D. Malik, A. Pananjady, K. Bhatia, K. Khamaru, P. Bartlett, and M. Wainwright. Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2916–2925. PMLR, 2019.
Mania et al. [2019] H. Mania, S. Tu, and B. Recht. Certainty equivalence is efficient for linear quadratic control. In Advances in Neural Information Processing Systems, volume 32, pages 10154–10164, 2019.
Mohammadi et al. [2020] H. Mohammadi, M. R. Jovanovic, and M. Soltanolkotabi. Learning the model-free linear quadratic regulator via random search. In Learning for Dynamics and Control, pages 531–539. PMLR, 2020.
Nesterov [2003] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
Silver et al. [2014] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In International conference on machine learning, pages 387–395. PMLR, 2014.
Simchowitz and Foster [2020] M. Simchowitz and D. Foster. Naive exploration is optimal for online lqr. In International Conference on Machine Learning, pages 8937–8948. PMLR, 2020.
Sutton et al. [1999] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, et al. Policy gradient methods for reinforcement learning with function approximation. In NIPs, volume 99, pages 1057–1063. Citeseer, 1999.
Tu and Recht [2019] S. Tu and B. Recht. The gap between model-based and model-free methods on the linear quadratic regulator: An asymptotic viewpoint. In Conference on Learning Theory, pages 3036–3083. PMLR, 2019.
Wright [1973] F. T. Wright. A bound on tail probabilities for quadratic forms in independent random variables whose distributions are not necessarily symmetric. The Annals of Probability, pages 1068–1070, 1973.
Yaghmaie and Gustafsson [2019] F. A. Yaghmaie and F. Gustafsson. Using reinforcement learning for model-free linear quadratic control with process and measurement noises. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 6510–6517. IEEE, 2019.
Yang et al. [2019] Z. Yang, Y. Chen, M. Hong, and Z. Wang. Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. In Advances in Neural Information Processing Systems, volume 32, 2019.

Appendix A Reducing Gaussian Noise to Bounded Noise

In this section we relax the bounded noise assumption, $\norm{w_{t}}\leq W$ , and replace it with the following tail assumption. For $\delta>0,T\geq 1$ , suppose there exists $S\subseteq\mathbb{R}^{d_{x}}$ such that:

1.

$\mathbb{P}\brk*{w_{t}\in S}\geq 1-\delta/T$ for all $1\leq t\leq T$ ;
2.

$\mathbb{E}\brk[s]{w_{t}\mathds{1}{\brk[c]*{w_{t}\in S}}}=0.$

The first assumption is a standard implication of any tail assumption. The second assumption implies that we can crop the noise while keeping it zero-mean. While not entirely trivial, this can be guaranteed for any continuous noise distributions. We note that $S$ is a theoretical construct, and is not a direct input to Algorithm 1. Indirectly, we use $S$ to calculate the parameters

\displaystyle\tilde{W}=\max_{w\in S}\norm{w},\quad\tilde{\sigma}^{2}=\min_{\norm{x}=1}\mathbb{E}\brk[s]{(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{2}}-\sqrt{\delta\mathbb{E}\brk[s]{(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{4}}/T},

which will serve as replacements for $W,\sigma$ in our bounded noise formulation. In practice, our results hold if for the chosen parameters $\delta,\tilde{W},\tilde{\sigma}$ , there exists a set $S$ satisfying the above. Our main findings for unbounded noise are summarized in the following meta-result.

Theorem 3.

Suppose $\delta\in(0,1)$ is such that $\tilde{\sigma}>0$ . If we run Algorithm 1 with the parameters as in Theorem 1 and $W,\sigma$ that satisfy $W\geq\tilde{W}$ and $0<\sigma\leq\tilde{\sigma}$ . , then the regret bound of Theorem 1 holds with probability at least $1-2\delta$

Proof.

Consider the LQR problem where the noise terms $w_{t}$ are replaced with $\tilde{w}_{t}=w_{t}\mathds{1}{\brk[c]*{w_{t}\in S}},$ and let $\tilde{c}_{t},\tilde{J}\brk*{\cdot}$ be the corresponding instantaneous and infinite horizon costs. Notice that by our assumptions, $\tilde{w}_{t}$ are indeed zero-mean, i.i.d, and satisfy $\norm{\tilde{w}_{t}}\leq\tilde{W}\leq W$ and

	$\displaystyle\min_{\norm{x}=1}\mathbb{E}\brk[s]{(\tilde{w}_{t}^{\mkern-1.5mu\mathsf{T}}x)^{2}}$	$\displaystyle=\min_{\norm{x}=1}\mathbb{E}\brk[s]{(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{2}}-\mathbb{E}\brk[s]{\mathds{1}{\brk[c]*{w_{t}\notin S}}(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{2}}$
		$\displaystyle\geq\min_{\norm{x}=1}\mathbb{E}\brk[s]{(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{2}}-\sqrt{\mathbb{P}\brk*{w_{t}\notin S}\mathbb{E}\brk[s]{(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{4}}}$
		$\displaystyle\geq\min_{\norm{x}=1}\mathbb{E}\brk[s]{(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{2}}-\sqrt{\delta\mathbb{E}\brk[s]{(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{4}}/T}=\tilde{\sigma}^{2}\geq\sigma^{2}>0,$

where the second transition used the Cauchy–Schwarz inequality. We thus have that $\sum_{t=1}^{T}(\tilde{c}_{t}-\tilde{J}_{\star})$ is bounded as in Theorem 1 with probability at least $1-\delta$ . Next, since $\mathbb{E}w_{t}w_{t}^{\mkern-1.5mu\mathsf{T}}\succeq\mathbb{E}\tilde{w}_{t}\tilde{w}_{t}^{\mkern-1.5mu\mathsf{T}}$ , we have that that $\tilde{J}\brk*{\cdot}$ is optimistic with respect to $J\brk*{\cdot}$ , i.e., $\tilde{J}\brk*{K}\leq J\brk*{K}$ for all $K$ , which implies that $\tilde{J}_{\star}\leq J_{\star}$ . Finally, using a union bound on the tail assumption, we have that $w_{t}=\tilde{w}_{t}$ for all $1\leq t\leq T$ with probability at least $1-\delta$ . On this event, Algorithm 1 is not aware that the noise is cropped and we thus have that $c_{t}=\tilde{c}_{t}$ for all $1\leq t\leq T$ . We conclude that with probability at least $1-\delta$

\displaystyle R_{T}=\sum_{t=1}^{T}(c_{t}-J_{\star})\leq\sum_{t=1}^{T}(\tilde{c}_{t}-\tilde{J}_{\star}),

and using another union bound concludes the proof.

Application to Gaussian noise.

We specialize Theorem 3 to the case where $w_{t}\sim\mathcal{N}\brk 0{0,\Sigma_{w}}$ , are zero-mean Gaussian random vectors with positive definite covariance $\Sigma_{w}\in\mathbb{R}^{d_{x}\times d_{x}}$ . The following result demonstrates how to run Algorithm 1 given upper and lower bounds on the covariance eigenvalues.

Proposition 1.

Let $\delta\in(0,1/3)$ . Suppose we run Algorithm 1 with parameters as in Theorem 1 and $W,\sigma$ that satisfy

\displaystyle W\geq\sqrt{5d_{x}\lambda_{\mathrm{max}}\brk*{\Sigma_{w}}\log\frac{T}{\delta}},\quad\sigma^{2}\leq\lambda_{\mathrm{min}}\brk*{\Sigma_{w}}(1-\sqrt{3\delta/T}),

where $\lambda_{\mathrm{min}}\brk*{\Sigma_{w}},\lambda_{\mathrm{max}}\brk*{\Sigma_{w}}$ are the minimal and maximal eigenvalues of $\Sigma_{w}$ . Then the regret bound of Theorem 1 holds with probability at least $1-2\delta$ .

Proof.

We show that $S=\brk[c]{w\mid\norm{\Sigma_{w}^{-1/2}w}\leq\sqrt{5d_{x}\log(T/\delta)}}$ satisfies the desired assumptions. First, by Lemma 14 we indeed have that $\mathbb{P}\brk*{w_{t}\in S}\geq 1-\delta/T.$ Next, denote $x_{t}=\Sigma_{w}^{-1/2}w_{t}$ and notice that $x_{t}\sim\mathcal{N}\brk 0{0,I}$ . We thus have that

\displaystyle\mathbb{E}\brk[s]{w_{t}\mathds{1}{\brk[c]*{w_{t}\in S}}}=\Sigma_{w}^{1/2}\mathbb{E}\brk[s]{\Sigma_{w}^{-1/2}w_{t}\mathds{1}{\brk[c]*{w_{t}\in S}}}=\Sigma_{w}^{1/2}\mathbb{E}\brk[s]*{x_{t}\mathds{1}{\brk[c]*{\norm{x_{t}}^{2}\leq 5d_{x}\log\frac{\delta}{T}}}}=0,

where the last transition follows from a symmetry argument. We conclude that $S$ satisfies our assumptions. We show that $W\geq\tilde{W}$ and $0\leq\sigma\leq\tilde{\sigma}$ , which then concludes the proof by invoking Theorem 3. First, we have that for any $w\in S$

\displaystyle\norm{w}\leq\norm{\Sigma_{w}^{1/2}}\norm{\Sigma_{w}^{-1/2}w}\leq\sqrt{5d_{x}\norm{\Sigma_{w}}\log\frac{T}{\delta}}\leq W,

and so $W\geq\tilde{W}$ . Finally, notice that for any $x\in\mathbb{R}^{d_{x}}$ $w_{t}^{\mkern-1.5mu\mathsf{T}}x$ is a zero-mean Gaussian random variable. Standard moment identities for Gaussian variables then give that $\mathbb{E}\brk[s]{(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{4}}=3\mathbb{E}\brk[s]{(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{2}}^{2},$ and so we have that

\displaystyle\tilde{\sigma}^{2}=\min_{\norm{x}=1}\mathbb{E}\brk[s]{(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{2}}(1-\sqrt{3\delta/T})=\lambda_{\mathrm{min}}\brk*{\Sigma_{w}}(1-\sqrt{3\delta/T})\geq\sigma^{2}>0,

where the last inequality holds by our choice of $\delta<1/3$ .

Appendix B Technical Lemmas

B.1 Summing the Square Roots of Epoch Lengths

Lemma 12.

Let $\rho\in[2/3,1)$ and define $m_{j}=m_{0}\rho^{-2j}$ . Suppose $n$ is such that $\sum_{j=0}^{n-2}m_{j}\leq T,$ then we have that

\displaystyle\sum_{j=0}^{n-1}m_{j}^{1/2}\leq\frac{22}{\mu\eta}\sqrt{T}.

Proof.

For ease of notation, denote $\rho=1-(\mu\eta/3)$ and notice that for our parameter choice it satisfies $\rho\in[2/3,1)$ . Now, notice that for $x\geq 1$ we have $x-1\leq\sqrt{x^{2}-1}$ and so we have that

\displaystyle\frac{\rho^{-n}-1}{\rho^{-1}-1}=\frac{\rho^{-1}+1}{\sqrt{\rho^{-2}-1}}\frac{\rho^{-n}-1}{\sqrt{\rho^{-2}-1}}\leq\frac{\rho^{-1}+1}{\rho^{-1}-1}\sqrt{\frac{\rho^{-2n}-1}{\rho^{-2}-1}}\leq\frac{2}{1-\rho}\sqrt{\frac{\rho^{-2n}-1}{\rho^{-2}-1}}=\frac{6}{\mu\eta}\sqrt{\frac{\rho^{-2n}-1}{\rho^{-2}-1}}.

Noticing that $m_{j}$ is a geometric sequence we get that

\displaystyle\sum_{j=0}^{n-1}m_{j}^{1/2}=m_{0}^{1/2}\frac{\rho^{-n}-1}{\rho^{-1}-1}\leq\frac{6}{\mu\eta}\sqrt{m_{0}\frac{\rho^{-2n}-1}{\rho^{-2}-1}}=\frac{6}{\mu\eta}\sqrt{\sum_{j=0}^{n-1}m_{j}}\leq\frac{6}{\mu\eta}\sqrt{(1+\rho^{-2})\sum_{j=0}^{n-2}m_{j}}\leq\frac{22}{\mu\eta}\sqrt{T},

where the last transition also used the fact $\rho^{-2}\leq 9/4$ .

B.2 Randomized Smoothing

Proof (of Lemma 1).

The first part follows from Stokes’ theorem. See Lemma 1 in [11] for details. For the second part, notice that $\nabla f^{r}\brk*{x}=\nabla\mathbb{E}_{B}f\brk*{x+rB}=\mathbb{E}_{B}\nabla f\brk*{x+rB}.$ We can thus use Jensen’s inequality to get that

\displaystyle\norm{\nabla f^{r}\brk*{x}-\nabla f\brk*{x}}=\norm{\mathbb{E}_{B}\brk[s]{\nabla f\brk*{x+rB}-\nabla f\brk*{x}}}\leq\mathbb{E}_{B}\norm{{\nabla f\brk*{x+rB}-\nabla f\brk*{x}}}\leq\beta r\mathbb{E}_{B}\norm{{B}}\leq\beta r,

where the third transition also used the smoothness (gradient Lipschitz) property of $f$ , and the last transition used the fact that $B$ is in the unit ball.

B.3 Details of Lemma 5

We review how Lemma 5 is derived from Fazel et al. [10]. For the rest of this section all Lemmas will refer to ones in [10].

The first part of the statement is immediate from their Lemma 13. Next, notice that

\displaystyle\frac{\lambda_{\mathrm{min}}\brk*{Q}\lambda_{\mathrm{min}}\brk*{\Sigma_{w}}}{4J\brk*{K}\norm{B_{\star}}(\norm{A_{\star}+B_{\star}K}+1)}\geq\frac{\alpha_{0}\sigma^{2}}{4\nu\psi 2\kappa}=\frac{1}{8\psi\kappa^{3}}=D_{0}.

We thus have that $K,K^{\prime}$ satisfy the condition of Lemma 16 and so we get that

\displaystyle\norm{\Sigma_{K}-\Sigma_{K^{\prime}}}\leq 4\brk*{\frac{J\brk*{K}}{\lambda_{\mathrm{min}}\brk*{Q}}}^{2}\frac{\norm{B_{\star}}(\norm{A_{\star}+B_{\star}K}+1)}{\lambda_{\mathrm{min}}\brk*{\Sigma_{w}}}\norm{K-K^{\prime}}\leq\frac{4\nu^{2}\psi^{2}2\kappa}{\alpha_{0}^{2}\sigma^{2}}\norm{K-K^{\prime}}=\frac{8\nu\psi\kappa^{3}}{\alpha_{0}}\norm{K-K^{\prime}},

thus concluding the second part. Next, define

\displaystyle\mathcal{T}_{K}(X)=\sum_{t=0}^{\infty}(A_{\star}+B_{\star}K)^{t}X\brk[s]{(A_{\star}+B_{\star}K)^{\mkern-1.5mu\mathsf{T}}}^{t},\qquad\mathcal{F}_{K}(X)=(A_{\star}+B_{\star}K)X(A_{\star}+B_{\star}K)^{\mkern-1.5mu\mathsf{T}},

which are linear operators on symmetric matrices. By Lemma 17 we have that

\displaystyle\norm{\mathcal{T}_{K}}\leq\frac{J\brk*{K}}{\lambda_{\mathrm{min}}\brk*{\Sigma_{w}}\lambda_{\mathrm{min}}\brk*{Q}}\leq\frac{\nu}{\sigma^{2}\alpha_{0}}=\kappa^{2},

and by Lemma 19 we have that

	$\displaystyle\norm{\mathcal{F}_{K}-\mathcal{F}_{K^{\prime}}}$	$\displaystyle\leq 2\norm{A_{\star}+B_{\star}K}\norm{B_{\star}}\norm{K-K^{\prime}}+\norm{B_{\star}}^{2}\norm{K-K^{\prime}}^{2}$
		$\displaystyle\leq\brk[s]*{2\kappa\psi+\frac{\psi^{2}}{8\psi\kappa^{3}}}\norm{K-K^{\prime}}$
		$\displaystyle\leq 3\psi\kappa\norm{K-K^{\prime}}.$

Now, continuing from the middle of the proof of Lemma 27 we get that

	$\displaystyle\norm{P_{K}-P_{K^{\prime}}}$	$\displaystyle\leq 2\norm{\mathcal{T}_{K}}^{2}\norm{\mathcal{F}_{K}-\mathcal{F}_{K^{\prime}}}\norm{Q+{K^{\prime}}^{\mkern-1.5mu\mathsf{T}}RK^{\prime}}+\norm{\mathcal{T}_{K}}\norm{K^{\mkern-1.5mu\mathsf{T}}RK-{K^{\prime}}^{\mkern-1.5mu\mathsf{T}}RK^{\prime}}$
		$\displaystyle\leq\brk[s]*{6\psi\kappa^{5}(1+\norm{K^{\prime}}^{2})+\kappa^{2}(\norm{K}+\norm{K^{\prime}})}\norm{K-K^{\prime}}$
		$\displaystyle\leq\psi\kappa^{5}\brk[s]*{6+6\kappa^{2}+12D_{0}\kappa+6D_{0}^{2}+2+8D_{0}^{2}}\norm{K-K^{\prime}}$
		$\displaystyle\leq 16\psi\kappa^{7}\norm{K-K^{\prime}},$

where the last step used the fact that $D_{0}\leq 1/8$ . Next, notice that

\displaystyle\mathrm{Tr}\brk*{\Sigma_{w}}\leq\mathrm{Tr}\brk*{P_{K_{0}}\Sigma_{w}}/\alpha_{0}\leq J\brk*{K_{0}}/\alpha_{0}\leq\nu/4\alpha_{0},

and thus the fourth property (Lipschitz) follows as

\displaystyle\abs{J\brk*{K}-J\brk*{K^{\prime}}}=\abs{\mathrm{Tr}\brk*{(P_{K}-P_{K^{\prime}})\Sigma_{w}}}\leq\norm{P_{K}-P_{K^{\prime}}}\mathrm{Tr}\brk*{\Sigma_{w}}\leq\frac{4\psi\nu\kappa^{7}}{\alpha_{0}}\norm{K-K^{\prime}}_{F}.

Next, the fifth statement (Smoothness) follows the ideas of Lemma 28. Concretely, recall that $\nabla J\brk*{K}=2E_{K}\Sigma_{K}$ where $E_{K}=RK+B_{\star}^{\mkern-1.5mu\mathsf{T}}P_{K}(A_{\star}+B_{\star}K)$ . Notice that

	$\displaystyle\norm{\Sigma_{K^{\prime}}}\leq\norm{\Sigma_{K^{\prime}}-\Sigma_{K}}+\norm{\Sigma_{K}}\leq 2\nu/\alpha_{0},$
	$\displaystyle\norm{E_{K}}\leq\norm{R}\norm{K}+\norm{B_{\star}}\norm{P_{K}}\norm{A_{\star}+B_{\star}K}\leq\kappa+\psi\kappa\nu/\sigma^{2}\leq 2\psi\kappa^{3},$

and thus we have that

	$\displaystyle\norm{\nabla J\brk{K}-\nabla J\brk{K^{\prime}}}_{F}\leq\sqrt{d_{x}}\norm{\nabla J\brk{K}-\nabla J\brk{K^{\prime}}}$	$\displaystyle=2\sqrt{d_{x}}\norm{(E_{K}-E_{K^{\prime}})\Sigma_{K^{\prime}}+E_{K}(\Sigma_{K}-\Sigma_{K^{\prime}})}$
		$\displaystyle\leq 2\sqrt{d_{x}}\brk[s]*{\norm{E_{K}-E_{K^{\prime}}}\norm{\Sigma_{K^{\prime}}}+\norm{E_{K}}\norm{\Sigma_{K}-\Sigma_{K^{\prime}}}}$
		$\displaystyle\leq 2\sqrt{d_{x}}\brk[s]*{\frac{2\nu}{\alpha_{0}}\norm{E_{K}-E_{K^{\prime}}}+\frac{16\nu\psi^{2}\kappa^{6}}{\alpha_{0}}\norm{K-K^{\prime}}}.$

Now, notice that $\norm{P_{K^{\prime}}}\leq\norm{P_{K^{\prime}}-P_{K}}+\norm{P_{K}}\leq 3\kappa^{4}$ and so

	$\displaystyle\norm{E_{K}-E_{K^{\prime}}}$	$\displaystyle\leq\norm{R(K-K^{\prime})+B_{\star}^{\mkern-1.5mu\mathsf{T}}P_{K}(A_{\star}+B_{\star}K)-B_{\star}^{\mkern-1.5mu\mathsf{T}}P_{K^{\prime}}(A_{\star}+B_{\star}K^{\prime})}$
		$\displaystyle\leq\norm{R}\norm{K-K^{\prime}}+\norm{B_{\star}^{\mkern-1.5mu\mathsf{T}}(P_{K}-P_{K^{\prime}})(A_{\star}+B_{\star}K)}+\norm{B_{\star}^{\mkern-1.5mu\mathsf{T}}P_{K^{\prime}}B_{\star}(K-K^{\prime})}$
		$\displaystyle\leq\brk[s]{1+16\psi^{2}\kappa^{8}+3\psi^{2}\kappa^{4}}\norm{K-K^{\prime}}$
		$\displaystyle\leq 20\psi^{2}\kappa^{8}\norm{K-K^{\prime}},$

and combining with the above, yields the desired smoothness condition

\displaystyle\norm{\nabla J\brk*{K}-\nabla J\brk*{K^{\prime}}}_{F}\leq\frac{112\sqrt{d_{x}}\nu\psi^{2}\kappa^{8}}{\alpha_{0}}\norm{K-K^{\prime}}_{F}.

Finally, the last statement (PL) is immediate from their Lemma 11 as

\displaystyle\frac{\lambda_{\mathrm{min}}\brk*{\Sigma_{K_{\star}}}}{\lambda_{\mathrm{min}}\brk*{\Sigma_{w}}^{2}\lambda_{\mathrm{min}}\brk*{R}}\leq\frac{\nu}{4\sigma^{4}\alpha_{0}^{2}}=\frac{4\nu}{\kappa^{4}}.

Appendix C Concentration inequalities

Lemma 13 (Theorem 1.8 of [15]).

Let $X$ be a very-weak martingale taking values in a real-valued euclidean space $E$ such that $X_{0}=0$ and for every $i$ , $\norm{X_{i}-X_{i-1}}\leq 1$ . Then, for every $a>0$ ,

\displaystyle\mathbb{P}\brk*{\norm{X_{n}}>a}\leq 2e^{2}e^{-\frac{a^{2}}{2n}}.

Alternatively, for any $\delta\in(0,\frac{1}{2}e^{-2})$ we have that with probability at least $1-\delta$

\displaystyle\norm{X_{n}}\leq\sqrt{2n\log\frac{15}{\delta}}.

The following theorem is a variant of the Hanson-Wright inequality [14, 28] which can be found in Hsu et al. [16].

Theorem 4.

Let $x\sim\mathcal{N}\brk 0{0,I}$ be a Gaussian random vector, let $A\in\mathbb{R}^{m\times n}$ and define $\Sigma=A^{T}A$ . Then we have that

\mathbb{P}\brk*{\norm{Ax}^{2}>\mathrm{Tr}\brk*{\Sigma}+2\sqrt{\mathrm{Tr}\brk*{\Sigma^{2}}z}+2\norm{\Sigma}z}\leq\exp\brk{-z},\qquad\text{ for all }z\geq 0.

The following lemma is a direct corollary of Theorem 4.

Lemma 14.

Let $w\sim\mathcal{N}\brk 0{0,\Sigma_{w}}$ be a Gaussian random vector in $\mathbb{R}^{d}$ . For any $\delta\in(0,1/e)$ , with probability at least $1-\delta$ we have that

\displaystyle\norm{w}\leq\sqrt{5\mathrm{Tr}\brk*{\Sigma_{w}}\log\frac{1}{\delta}}.

Proof.

Consider Theorem 4 with $A=\Sigma_{w}^{1/2}$ and thus $\Sigma=\Sigma_{w}$ . Then for $z\geq 1$ we have that

\mathrm{Tr}\brk*{\Sigma_{w}}+2\sqrt{\mathrm{Tr}\brk*{\Sigma_{w}^{2}}z}+2\norm{\Sigma_{w}}z\leq\mathrm{Tr}\brk*{\Sigma_{w}}z+2\mathrm{Tr}\brk*{\Sigma_{w}}z+2\mathrm{Tr}\brk*{\Sigma_{w}}z=5\mathrm{Tr}\brk*{\Sigma_{w}}z.

Now, for $x\sim\mathcal{N}\brk 0{0,I}$ we have that $w\overset{d}{=}Ax$ (equals in distribution). We thus have that for $z\geq 1$

\displaystyle\mathbb{P}\brk*{\norm{w}>\sqrt{5\mathrm{Tr}\brk*{\Sigma_{w}}z}}\leq\mathbb{P}\brk*{\norm{Ax}^{2}>{\mathrm{Tr}\brk*{\Sigma_{w}}+2\sqrt{\mathrm{Tr}\brk*{\Sigma_{w}^{2}}z}+2\norm{\Sigma_{w}}z}}\leq\exp\brk{-z},

and taking $z=\log\frac{1}{\delta}\geq 1$ (since $\delta\in(0,1/e)$ ) concludes the proof.

	$\displaystyle f\brk{x_{t+1}}-f\brk{x_{t}}$	$\displaystyle\leq\nabla f\brk{x_{t}}^{\mkern-1.5mu\mathsf{T}}\brk{x_{t+1}-x_{t}}+\frac{\beta}{2}\norm{x_{t+1}-x_{t}}^{2}$
		$\displaystyle=\nabla f\brk{x_{t}}^{\mkern-1.5mu\mathsf{T}}\brk{-\eta\hat{g}_{t}}+\frac{\beta}{2}\norm{\eta\hat{g}_{t}}^{2}$
		$\displaystyle=\eta\brk[s]{-\norm{\nabla f\brk{x_{t}}}^{2}-\nabla f\brk{x_{t}}^{\mkern-1.5mu\mathsf{T}}w_{t}+\frac{\eta\beta}{2}\norm{\nabla f\brk{x_{t}}+w_{t}}^{2}}$
		$\displaystyle=\eta\brk[s]{-\brk{1-\frac{\eta\beta}{2}}\norm{\nabla f\brk{x_{t}}}^{2}-\brk{1-\eta\beta}\nabla f\brk*{x_{t}}^{\mkern-1.5mu\mathsf{T}}w_{t}+\frac{\eta\beta}{2}\norm{w_{t}}^{2}}$
		$\displaystyle=\eta\brk[s]{-\brk{1-\eta\beta}\norm{\frac{1}{2}\nabla f\brk{x_{t}}+w_{t}}^{2}-\brk{\frac{3}{4}-\frac{\eta\beta}{4}}\norm{\nabla f\brk{x_{t}}}^{2}+\brk{1-\frac{\eta\beta}{2}}\norm{w_{t}}^{2}}$
		$\displaystyle\leq\eta\brk[s]{-\brk{\frac{3}{4}-\frac{\eta\beta}{4}}\norm{\nabla f\brk{x_{t}}}^{2}+\brk{1-\frac{\eta\beta}{2}}\norm{w_{t}}^{2}},$

Online Policy Gradient for Model Free Learning of Linear Quadratic Regulators with T\sqrt{T} Regret

Abstract

1 Introduction

Related work.

2 Preliminaries

2.1 Setup: Learning in LQR

2.2 Smooth Optimization

Definition 1 (PL-condition).

Definition 2 (Smoothness).

Definition 3 (Lipschitz).

Lemma 1.

2.3 Background on LQR

Definition 4 (strong stability).

Lemma 2 (9, Lemma 18).

Lemma 3 (8, Lemma 3.2).

Lemma 4 (6, Lemma 39).

Lemma 5 (10, Lemmas 11, 13, 16, 27 and 28).

3 Algorithm and Overview of Analysis

Theorem 1.

Lemma 6 (exploitation).

Lemma 7 (direct exploration cost).

Lemma 8 (indirect exploration cost).

Proof (of Theorem 1).

4 Optimization Analysis

4.1 Inexact First-Order Optimization

Theorem 2 (corrupted gradient descent).

Proof.

4.2 Gradient Estimation

Lemma 9 (Gradient estimation error).

Proof (of Lemma 9).

4.3 Proof of Lemma 6

Proof.

5 Exploration Cost Analysis

Lemma 10.

Proof.

5.1 Proof of Lemma 8

Lemma 11.

Proof (of Lemma 8).

Proof (of Lemma 11).

5.2 Proof of Lemma 7

Proof (of Lemma 7).

Acknowledgements

References

Appendix A Reducing Gaussian Noise to Bounded Noise

Theorem 3.

Proof.

Application to Gaussian noise.

Proposition 1.

Proof.

Appendix B Technical Lemmas

B.1 Summing the Square Roots of Epoch Lengths

Lemma 12.

Proof.

B.2 Randomized Smoothing

Proof (of Lemma 1).

B.3 Details of Lemma 5

Appendix C Concentration inequalities

Lemma 13 (Theorem 1.8 of [15]).

Theorem 4.

Lemma 14.

Proof.

Online Policy Gradient for Model Free Learning
of Linear Quadratic Regulators with $\sqrt{T}$ Regret