This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Online Policy Gradient for Model Free Learning
of Linear Quadratic Regulators with T\sqrt{T} Regret

Asaf Cassel School of Computer Science, Tel Aviv University; [email protected].    Tomer Koren School of Computer Science, Tel Aviv University, and Google Research, Tel Aviv; [email protected].
Abstract

We consider the task of learning to control a linear dynamical system under fixed quadratic costs, known as the Linear Quadratic Regulator (LQR) problem. While model-free approaches are often favorable in practice, thus far only model-based methods, which rely on costly system identification, have been shown to achieve regret that scales with the optimal dependence on the time horizon TT. We present the first model-free algorithm that achieves similar regret guarantees. Our method relies on an efficient policy gradient scheme, and a novel and tighter analysis of the cost of exploration in policy space in this setting.

1 Introduction

Model-free, policy gradient algorithms have become a staple of Reinforcement Learning (RL) with both practical successes [19, 12], and strong theoretical guarantees in several settings [26, 24]. In this work we study the design and analysis of such algorithms for the adaptive control of Linear Quadratic Regulator (LQR) systems, as seen through the lens of regret minimization [1, 9, 21]. In this continuous state and action reinforcement learning setting, an agent chooses control actions utu_{t} and the system state xtx_{t} evolves according to the noisy linear dynamics

xt+1=Axt+But+wt,\displaystyle x_{t+1}=A_{\star}x_{t}+B_{\star}u_{t}+w_{t},

where AA_{\star} and BB_{\star} are transition matrices and wtw_{t} are i.i.d zero-mean noise terms. The cost is a quadratic function of the current state and action, and the regret is measured with respect to the class of linear policies, which are known to be optimal for this setting.

Model-based methods, which perform planning based on a system identification procedure that estimates the transition matrices, have been studied extensively in recent years. This started with Abbasi-Yadkori and Szepesvári [1], which established an O\brk0TO\brk 0{\sqrt{T}} regret guarantee albeit with a computationally intractable method. More recently, Cohen et al. [9], Mania et al. [21] complemented this result with computationally efficient methods, and Cassel et al. [6], Simchowitz and Foster [25] provided lower bounds, showing that this rate is generally unavoidable; regardless of whether the algorithm is model free or not. In comparison, the best existing model-free algorithms are policy iteration procedures by Krauth et al. [18] and Abbasi-Yadkori et al. [2] that respectively achieve O~\brk0T2/3\smash{\widetilde{O}}\brk 0{T^{2/3}} and O~\brk0T2/3+ϵ\smash{\widetilde{O}}\brk 0{T^{2/3+\epsilon}} regret for ϵ=Θ(1/logT)\epsilon=\Theta(1/\log{T}).

Our main result is an efficient (in fact, linear time per step) policy gradient algorithm that achieves O~\brk0T\smash{\widetilde{O}}\brk 0{\sqrt{T}} regret, thus closing the (theoretical) gap between model based and free methods for the LQR model. An interesting feature of our approach is that while the policies output by the algorithm are clearly state dependent, the tuning of their parameters requires no such access. Instead, we only rely on observations of the incurred cost, similar to bandit models (e.g., 5).

One of the main challenges of regret minimization in LQRs (and more generally, in reinforcement learning) is that it is generally infeasible to change policies as often as one likes. Roughly, this is due to a burn-in period following a policy change, during which the system converges to a new steady distribution, and typically incurs an additional cost proportional to the change in steady states, which is in turn proportional to the distance between policies. There are several ways to overcome this impediment. The simplest is to restrict the number of policy updates and explore directly in the action space via artificial noise (see e.g., 25). Another approach by Cohen et al. [9] considers a notion of slowly changing policies, however, these can be very prohibitive for exploration in policy space. Other works (e.g., 3) consider a policy parameterization that converts the problem into online optimization with memory, which also relies on slowly changing policies. This last method is also inherently model-based and thus not adequate for our purpose.

A key technical contribution that we make is to overcome this challenge by exploring directly in policy space. While the idea itself is not new, we provide a novel and tighter analysis that allows us to use larger perturbations, thus reducing the variance of the resulting gradient estimates. We achieve this by showing that the additional cost depends only quadratically on the exploration radius, which is a crucial ingredient for overcoming the O\brk0T2/3O\brk 0{T^{2/3}} limitation. The final ingredient of the analysis involves a sensitivity analysis of the gradient descent procedure that uses the estimated gradients. Here again, while similar analyses of gradient methods exist, we provide a general result that gives appropriate conditions for which the optimization error depends only quadratically on the error in the gradients.

Related work.

Policy gradient methods in the context of LQR has seen significant interest in recent years. Notably, Fazel et al. [10] establish its global convergence in the perfect information setting, and give complexity bounds for sample based methods. Subsequently, Malik et al. [20] improve the sample efficiency but their result holds only with a fixed probability and thus does not seem applicable for our purposes. Hambly et al. [13] also improve the sample efficiency, but in a finite horizon setting. Mohammadi et al. [22] give sample complexity bounds for the continuous-time variant of LQR. Finally, Tu and Recht [27] show that a model based method can potentially outperform the sample complexity of policy gradient by factors of the input and output dimensions. While we observe similar performance gaps in our regret bounds, these were not our main focus and may potentially be improved by a more refined analysis. Moving away from policy gradients, Yang et al. [30], Jin et al. [17], Yaghmaie and Gustafsson [29] analyze the convergence and sample complexity of other model free methods such as policy iteration and temporal difference (TD) learning, but they do not include any regret guarantees.

2 Preliminaries

2.1 Setup: Learning in LQR

We consider the problem of regret minimization in the LQR model. At each time step tt, a state xtdxx_{t}\in\mathbb{R}^{d_{x}} is observed and action utduu_{t}\in\mathbb{R}^{d_{u}} is chosen. The system evolves according to

xt+1=Axt+But+wt,(x0=0w.l.o.g.),\displaystyle x_{t+1}=A_{\star}x_{t}+B_{\star}u_{t}+w_{t},\quad(x_{0}=0~{}\text{w.l.o.g.}),

where the state-state Adx×dxA_{\star}\in\mathbb{R}^{d_{x}\times d_{x}} and state-action Bdx×duB_{\star}\in\mathbb{R}^{d_{x}\times d_{u}} matrices form the transition model and the wtw_{t} are bounded, zero mean, i.i.d. noise terms with a positive definite covariance matrix Σw0\Sigma_{w}\succ 0. Formally, there exist σ,W>0\sigma,W>0 such that

𝔼wt=0,\normwtW,Σw=𝔼wtwt𝖳σ2I\displaystyle\mathbb{E}w_{t}=0\quad,\norm{w_{t}}\leq W\quad,\Sigma_{w}=\mathbb{E}w_{t}w_{t}^{\mkern-1.5mu\mathsf{T}}\succ\sigma^{2}I

The bounded noise assumption is made for simplicity of the analysis, and in Appendix A we show how to accommodate Gaussian noise via a simple reduction to this setting. At time tt, the instantaneous cost is

ct=xt𝖳Qxt+ut𝖳Rut,c_{t}=x_{t}^{\mkern-1.5mu\mathsf{T}}Qx_{t}+u_{t}^{\mkern-1.5mu\mathsf{T}}Ru_{t},

where 0Q,RI0\prec Q,R\preceq I are positive definite. We note that the upper bound is without loss of generality since multiplying QQ and RR by a constant factor only re-scales the regret.

A policy of the learner is a potentially time dependent mapping from past history to an action uduu\in\mathbb{R}^{d_{u}} to be taken at the current time step. Classic results in linear control establish that, given the system parameters A,B,QA_{\star},B_{\star},Q and RR, a linear transformation of the current state is an optimal policy for the infinite horizon setting. We thus consider policies of the form ut=Kxtu_{t}=Kx_{t} and define their infinite horizon expected cost,

J\brkK=limT1T𝔼\delim[]t=1Txt𝖳\brkQ+K𝖳RKxt\displaystyle J\brk*{K}=\lim_{T\to\infty}\frac{1}{T}\mathbb{E}\delim{[}{]}*{\sum_{t=1}^{T}x_{t}^{\mkern-1.5mu\mathsf{T}}\brk*{Q+K^{\mkern-1.5mu\mathsf{T}}RK}x_{t}}

where the expectation is taken with respect to the random noise variables wtw_{t}. Let K=argminKJ\brkKK_{\star}=\operatorname*{arg\,min}_{K}J\brk*{K} be a (unique) optimal policy and J=J\brkKJ_{\star}=J\brk*{K_{\star}} denote the optimal infinite horizon expected cost, which are both well defined under mild assumptions.111These are valid under standard, very mild stabilizability assumptions (see 4) that hold in our setting. We are interested in minimizing the regret over TT decision rounds, defined as

RT=t=1T\brk!xt𝖳Qxt+ut𝖳RutJ.\displaystyle R_{T}=\sum_{t=1}^{T}\brk!{x_{t}^{\mkern-1.5mu\mathsf{T}}Qx_{t}+u_{t}^{\mkern-1.5mu\mathsf{T}}Ru_{t}-J_{\star}}.

We focus on the setting where the learner does not have a full a-priori description of the transition parameters AA_{\star} and BB_{\star}, and has to learn them while controlling the system and minimizing the regret.

Throughout, we assume that the learner has knowledge of constants α0>0\alpha_{0}>0 and ψ1\psi\geq 1 such that

\normQ1,\normR11α0, and \normBψ.\displaystyle\norm{Q^{-1}},\norm{R^{-1}}\leq\ifrac{1}{\alpha_{0}},\text{ and }\norm{B_{\star}}\leq\psi.

We also assume that there is a known stable (not necessarily optimal) policy K0K_{0} and ν>0\nu>0 such that J\brkK014νJ\brk*{K_{0}}\leq\frac{1}{4}\nu. We note that all of the aforementioned parameters could be easily estimated at the cost of an additive constant regret term by means of a warm-up period. However, recovering the initial control K0K_{0} gives a constant that depends exponentially on the problem parameters as shown by Chen and Hazan [7], Mania et al. [21], Cohen et al. [9].

Finally, denote the set of all “admissable” controllers

𝒦=\brk[c]KJ\brkKν.\displaystyle\mathcal{K}=\brk[c]{K\mid J\brk*{K}\leq\nu}.

By definition, K0𝒦K_{0}\in\mathcal{K}. As discussed below, over the set 𝒦\mathcal{K} the LQR cost function JJ has certain regularity properties that we will use throughout.

2.2 Smooth Optimization

Fazel et al. [10] show that while the objective J\brkJ\brk*{\cdot} is non-convex, it has properties that make it amenable to standard gradient based optimization schemes. We summarize these here as they are used in our analysis.

Definition 1 (PL-condition).

A function f:𝒳f:\mathcal{X}\to\mathbb{R} with global minimum ff^{*} is said to be μ\mu-PL if it satisfies the Polyak-Lojasiewicz (PL) inequality with constant μ>0\mu>0, given by

μ(f(x)f)\normf(x)2,x𝒳.\displaystyle\mu(f(x)-f^{*})\leq\norm{\nabla f(x)}^{2}\quad,\forall x\in\mathcal{X}.
Definition 2 (Smoothness).

A function f:df:\mathbb{R}^{d}\to\mathbb{R} is locally β,D0\beta,D_{0}-smooth over 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d} if for any x𝒳x\in\mathcal{X} and ydy\in\mathbb{R}^{d} with \normyxD0\norm{y-x}\leq D_{0}

\normf(x)f(y)β\normxy.\displaystyle\norm{\nabla f(x)-\nabla f(y)}\leq{\beta}\norm{x-y}.
Definition 3 (Lipschitz).

A function f:df:\mathbb{R}^{d}\to\mathbb{R} is locally β,D0\beta,D_{0}-Lipschitz over 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d} if for any x𝒳x\in\mathcal{X} and ydy\in\mathbb{R}^{d} with \normyxD0\norm{y-x}\leq D_{0}

\absf(x)f(y)G\normxy.\displaystyle\abs{f(x)-f(y)}\leq{G}\norm{x-y}.

It is well-known that for functions satisfying the above conditions and for sufficiently small step size η\eta, the gradient descent update rule

xt+1=xtηf(xt)\displaystyle x_{t+1}=x_{t}-\eta\nabla f(x_{t})

converges exponentially fast, i.e., there exists 0ρ<10\leq\rho<1 such that f(xt)fρt(f(x0)f){f(x_{t})-f^{*}}\leq\rho^{t}(f(x_{0})-f^{*}) (e.g., 23). This setting has also been investigated in the absence of a perfect gradient oracle. Here we provide a clean result that shows that the error in the optimization objective is only limited by the squared error of any gradient estimate.

Finally, we require the notion of a one point gradient estimate [11]. Let f:𝒳f:\mathcal{X}\to\mathbb{R} and define its smoothed version with parameter r>0r>0 as

fr\brkx=𝔼Bf\brkx+rB,\displaystyle f^{r}\brk*{x}=\mathbb{E}_{B}{f\brk*{x+rB}}, (1)

where BdB\in\mathcal{B}^{d} is a uniform random vector over the Euclidean unit ball. The following lemma is standard (we include a proof in Appendix B for completeness).

Lemma 1.

If ff is (D0,β)(D_{0},\beta)-locally smooth and rD0r\leq D_{0}, then:

  1. 1.

    fr\brkx=dr𝔼U\brk[s]f\brkx+rUU,\nabla f^{r}\brk*{x}=\frac{d}{r}\mathbb{E}_{U}\brk[s]{f\brk*{x+rU}U}, where U𝒮dU\in\mathcal{S}^{d} is a uniform random vector of the unit sphere;

  2. 2.

    \normfr\brkxf\brkxβr,x𝒳.\norm{\nabla f^{r}\brk*{x}-\nabla f\brk*{x}}\leq\beta r,\;\forall x\in\mathcal{X}.

2.3 Background on LQR

It is well-known for the LQR problem that

J\brkK=Tr\brkPKΣw=Tr\brk(Q+K𝖳RK)ΣK,\displaystyle J\brk*{K}=\mathrm{Tr}\brk*{P_{K}\Sigma_{w}}=\mathrm{Tr}\brk*{(Q+K^{\mkern-1.5mu\mathsf{T}}RK)\Sigma_{K}},

where PK,ΣKP_{K},\Sigma_{K} are the positive definite solutions to

PK=Q+K𝖳RK+(A+BK)𝖳PK(A+BK),\displaystyle P_{K}=Q+K^{\mkern-1.5mu\mathsf{T}}RK+(A_{\star}+B_{\star}K)^{\mkern-1.5mu\mathsf{T}}P_{K}(A_{\star}+B_{\star}K), (2)
ΣK=Σw+(A+BK)ΣK(A+BK)𝖳.\displaystyle\Sigma_{K}=\Sigma_{w}+(A_{\star}+B_{\star}K)\Sigma_{K}(A_{\star}+B_{\star}K)^{\mkern-1.5mu\mathsf{T}}. (3)

Another important notion is that of strong stability [8]. This is essentially a quantitative version of classic stability notions in linear control.

Definition 4 (strong stability).

A matrix MM is (κ,γ)(\kappa,\gamma)-strongly stable (for κ1\kappa\geq 1 and 0<γ10<\gamma\leq 1) if there exists matrices H0H\succ 0 and LL such that M=HLH1M=HLH^{-1} with \normL1γ\norm{L}\leq 1-\gamma and \normH\normH1κ\norm{H}\norm{H^{-1}}\leq\kappa. A controller KK for is (κ,γ)(\kappa,\gamma)-strongly stable if \normKκ\norm{K}\leq\kappa and the matrix A+BKA_{\star}+B_{\star}K is (κ,γ)(\kappa,\gamma)-strongly stable.

The following lemma, due to Cohen et al. [9], relates the infinite horizon cost of a controller to its strong stability parameters.

Lemma 2 (9, Lemma 18).

Suppose that K𝒦K\in\mathcal{K} then KK is (κ,γ)(\kappa,\gamma)-strongly stable with κ=να0σ2\kappa=\sqrt{\ifrac{\nu}{\alpha_{0}\sigma^{2}}} and γ=12κ2.\gamma=\ifrac{1}{2\kappa^{2}}.

The following two lemmas, due to Cohen et al. [8], Cassel et al. [6], show that the state covariance converges exponentially fast, and that the state is bounded as long as controllers are allowed to mix.

Lemma 3 (8, Lemma 3.2).

Suppose we play some fixed K𝒦K\in\mathcal{K} starting from some x0dxx_{0}\in\mathbb{R}^{d_{x}}, then

\norm𝔼\brk[s]xtxt𝖳ΣK\displaystyle\norm{\mathbb{E}\brk[s]{x_{t}x_{t}^{\mkern-1.5mu\mathsf{T}}}-\Sigma_{K}} κ2e2γt\normx0x0𝖳ΣK,\displaystyle\leq\kappa^{2}e^{-2\gamma t}\norm{x_{0}x_{0}^{\mkern-1.5mu\mathsf{T}}-\Sigma_{K}},
\abs𝔼\brk[s]ctJ\brkK\displaystyle\abs{\mathbb{E}\brk[s]{c_{t}}-J\brk*{K}} νκ2σ2e2γt\normx0x0𝖳ΣK.\displaystyle\leq\frac{\nu\kappa^{2}}{\sigma^{2}}e^{-2\gamma t}\norm{x_{0}x_{0}^{\mkern-1.5mu\mathsf{T}}-\Sigma_{K}}.
Lemma 4 (6, Lemma 39).

Let K1,K2,𝒦K_{1},K_{2},\ldots\in\mathcal{K}. If we play each controller KiK_{i} for at least τ2κ2log2κ\tau\geq 2\kappa^{2}\log 2\kappa rounds before switching to Ki+1K_{i+1} then for all t1t\geq 1 we have that \normxt6κ4W\norm{x_{t}}\leq 6\kappa^{4}W and ct36νκ8W2/σ2.c_{t}\leq 36\nu\kappa^{8}W^{2}/\sigma^{2}.

The following is a summary of results from Fazel et al. [10] that describe the main properties of ΣK,PK,J\brkK\Sigma_{K},P_{K},J\brk*{K}. See Appendix B for the complete details.

Lemma 5 (10, Lemmas 11, 13, 16, 27 and 28).

Let K𝒦K\in\mathcal{K} and Kdu×dxK^{\prime}\in\mathbb{R}^{d_{u}\times d_{x}} with

\normKK18ψκ3=D0,\displaystyle\norm{K-K^{\prime}}\leq\frac{1}{8\psi\kappa^{3}}=D_{0},

then we have that

  1. 1.

    Tr\brkPKJ\brkK/σ2;\mathrm{Tr}\brk*{P_{K}}\leq J\brk*{K}/\sigma^{2}; Tr\brkΣKJ\brkK/α0;\mathrm{Tr}\brk*{\Sigma_{K}}\leq J\brk*{K}/\alpha_{0};

  2. 2.

    \normΣKΣK(8ψνκ3α0)\normKK;\norm{\Sigma_{K}-\Sigma_{K^{\prime}}}\leq(\ifrac{8\psi\nu\kappa^{3}}{\alpha_{0}})\norm{K-K^{\prime}};

  3. 3.

    \normPKPK16ψκ7\normKK;\norm{P_{K}-P_{K^{\prime}}}\leq 16\psi\kappa^{7}\norm{K-K^{\prime}};

  4. 4.

    JJ satisfies the local Lipschitz condition (Definition 3) over 𝒦\mathcal{K} with D0D_{0} and G=4ψνκ7α0;G=\ifrac{4\psi\nu\kappa^{7}}{\alpha_{0}};

  5. 5.

    JJ satisfies the local smoothness condition (Definition 2) over 𝒦\mathcal{K} with D0D_{0} and β=112dxνψ2κ8α0;\beta=\ifrac{112\sqrt{d_{x}}\nu\psi^{2}\kappa^{8}}{\alpha_{0}};

  6. 6.

    JJ satisfies the PL condition (Definition 1) with μ=4νκ4.\mu=\ifrac{4\nu}{\kappa^{4}}.

3 Algorithm and Overview of Analysis

We are now ready to present our main algorithm for model free regret minimization in LQR. The algorithm, given in Algorithm 1, optimizes an underlying controller KjK_{j} over epochs of exponentially increasing duration. Each epoch consists of sub-epochs, during which a perturbed controller Kj,iK_{j,i} centered at KjK_{j} is drawn and played for τ\tau rounds. At the end of each epoch, the algorithm uses cj,i,τc_{j,i,\tau}, which is the cost incurred during the final round of playing the controller Kj,iK_{j,i}, to construct a gradient estimate which in turn is used to calculate the next underlying controller Kj+1K_{j+1}. Interestingly, we do not make any explicit use of the state observation xtx_{t} which is only used implicitly to calculate the control signal, via ut=Ktxtu_{t}=K_{t}x_{t}. Furthermore, the algorithm makes only O\brkdudxO\brk*{d_{u}d_{x}} computations per time step.

Algorithm 1 LQR Online Policy Gradient
1:input: initial controller K0𝒦K_{0}\in\mathcal{K}, step size η\eta, mixing length τ\tau, parameters μ,r0,m0\mu,r_{0},m_{0}
2:for epoch j=0,1,2,j=0,1,2,\ldots do
3:     set rj=r0(1μη/3)j/2,mj=m0(1μη/3)2jr_{j}=r_{0}(1-\mu\eta/3)^{j/2},m_{j}=m_{0}(1-\mu\eta/3)^{-2j}
4:     for i=1,,mji=1,\ldots,m_{j} do
5:         draw U~j,idu×dx\smash{\widetilde{U}_{j,i}}\in\mathbb{R}^{d_{u}\times d_{x}} with i.i.d. 𝒩\brk00,1\mathcal{N}\brk 0{0,1} entries
6:         set Uj,i=U~j,i/\normU~j,iFU_{j,i}=\smash{\widetilde{U}_{j,i}}/\norm{\smash{\widetilde{U}_{j,i}}}_{F}
7:         play Kj,i=Kj+rjUj,iK_{j,i}=K_{j}+r_{j}U_{j,i} for τ\tau rounds
8:         observe the cost of the final round cj,i,τc_{j,i,\tau}      
9:     calculate g^j=dxdumjrji=1mjcj,i,τUj,i\hat{g}_{j}=\frac{d_{x}d_{u}}{m_{j}r_{j}}\sum_{i=1}^{m_{j}}c_{j,i,\tau}U_{j,i}
10:     update Kj+1=Kjηg^jK_{j+1}=K_{j}-\eta\hat{g}_{j}

Our main result regarding Algorithm 1 is stated in the following theorem: a high-probability O(T)O(\sqrt{T}) regret guarantee with a polynomial dependence on the problem parameters.

Theorem 1.

Let κ=ν/α0σ2\kappa=\sqrt{{\nu}/{\alpha_{0}\sigma^{2}}} and suppose we run Algorithm 1 with parameters

η=α0128νψ2κ10,τ=2κ2log(7κT),\displaystyle\eta=\frac{\alpha_{0}}{128\nu\psi^{2}\kappa^{10}},\;\tau=2\kappa^{2}\log(7\kappa T),
μ=4νκ4,r0=α0448dxψ2κ10,\displaystyle\mu=\frac{4\nu}{\kappa^{4}},\;r_{0}=\frac{\alpha_{0}}{448\sqrt{d_{x}}\psi^{2}\kappa^{10}},
m0=217dudx3/2ψ2κ20W2α0σ2log240T4δ,\displaystyle\sqrt{m_{0}}=\frac{2^{17}d_{u}d_{x}^{3/2}\psi^{2}\kappa^{20}W^{2}}{\alpha_{0}\sigma^{2}}\sqrt{\log\frac{240T^{4}}{\delta}},

then with probability at least 1δ1-\delta,

RT=O\brkdudx3/2ψ4κ36W2α0TτlogTδ.\displaystyle R_{T}=O\brk*{\frac{d_{u}d_{x}^{3/2}\psi^{4}\kappa^{36}W^{2}}{\alpha_{0}}\sqrt{T\tau\log\frac{T}{\delta}}}.

Here we give an overview of the main steps in proving Theorem 1, deferring the details of each step to later sections. Our first step is analyzing the utility of the policies KjK_{j} computed at the end of each epoch. We show that the regret of each KjK_{j} (over epoch jj) in terms of its long-term (steady state) cost compared to that of the optimal KK_{\star}, is controlled by the inverse square-root of the epoch length mjm_{j}.

Lemma 6 (exploitation).

Under the parameter choices of Theorem 1, for any j0j\geq 0 we have that with probability at least 1δ/8T21-\delta/8T^{2},

J\brkKjJ=O\brkνm0mj=O\brkdudx3/2ψ2κ22W21mjlogTδ,\displaystyle J\brk*{K_{j}}-J_{\star}=O\brk*{\nu\sqrt{\frac{m_{0}}{m_{j}}}}=O\brk*{{d_{u}d_{x}^{3/2}\psi^{2}\kappa^{22}W^{2}}\sqrt{\frac{1}{m_{j}}\log\frac{T}{\delta}}},

and further that J\brkKjν/2J\brk*{K_{j}}\leq\nu/2.

The proof of the lemma is based on a careful analysis of gradient descent with inexact gradients and crucially exploits the PL and local-smoothness properties of the loss J\brkJ\brk*{\cdot}. More details can be found in Section 4.

The more interesting (and challenging) part of our analysis pertains to controlling the costs associated with exploration, namely, the penalties introduced by the perturbations of the controllers KjK_{j}. The direct cost of exploration is clear: instead of playing the KjK_{j} intended for exploitation, the algorithm actually follows the perturbed controllers Kj,iK_{j,i} and thus incurs the differences in long-term costs J\brkKj,iJ\brkKjJ\brk*{K_{j,i}}-J\brk*{K_{j}}. Our following lemma bounds the accumulation of these penalties over an epoch jj; importantly, it shows that while the bound scales linearly with the length of the epoch mjm_{j}, it has a quadratic dependence on the exploration radius rjr_{j}.

Lemma 7 (direct exploration cost).

Under the parameter choices of Theorem 1, for any j0j\geq 0 we have that with probability at least 1δ/4T1-\delta/4T,

i=1mjJ\brkKj,iJ\brkKj=O\brkdxνψ2κ8α0rj2mj+νmjlogTδ.\displaystyle\sum_{i=1}^{m_{j}}J\brk*{K_{j,i}}-J\brk*{K_{j}}=O\brk*{\frac{\sqrt{d_{x}}\nu\psi^{2}\kappa^{8}}{\alpha_{0}}r_{j}^{2}m_{j}+\nu\sqrt{m_{j}\log\frac{T}{\delta}}}.

There are additional, indirect costs associated with exploration however: within each epoch the algorithm switches frequently between different policies, thereby suffering the indirect costs that stem from their “burn-in” period. This is precisely what gives rise to the differences between the realized cost cj,i,sc_{j,i,s} and the long-term cost J\brkKj,iJ\brk*{K_{j,i}} of the policy Kj,iK_{j,i}, the cumulative effect of which is bounded in the next lemma. Here again, note the quadratic dependence on the exploration radius rjr_{j} which is essential for obtaining our T\sqrt{T}-regret result.

Lemma 8 (indirect exploration cost).

Under the parameter choices of Theorem 1, for any j0j\geq 0 we have that with probability at least 1δ/4T1-\delta/4T,

i=1mjs=1τ\brk!cj,i,sJ\brkKj,i=O\brkνκ8W2σ2τmjlogTδ+dxνψ2κ10α0mjrj2.\displaystyle\sum_{i=1}^{m_{j}}\sum_{s=1}^{\tau}\brk!{c_{j,i,s}-J\brk*{K_{j,i}}}=O\brk*{\frac{\nu\kappa^{8}W^{2}}{\sigma^{2}}\tau\sqrt{m_{j}\log\frac{T}{\delta}}+\frac{d_{x}\nu\psi^{2}\kappa^{10}}{\alpha_{0}}m_{j}r_{j}^{2}}.

The technical details for Lemmas 7 and 8 are discussed in Section 5. We now have all the main pieces required for proving our main result.

Proof (of Theorem 1).

Taking a union bound, we conclude that Lemmas 8, 7 and 6 hold for all j0j\geq 0 with probability at least 1δ1-\delta. Now, notice that our choice of parameters is such that

rj2mj=r02m0mj=O\brkdxduα0W2ψ2σ2mjlogTδ.\displaystyle r_{j}^{2}m_{j}=r_{0}^{2}\sqrt{m_{0}m_{j}}=O\brk*{\frac{\sqrt{d_{x}}d_{u}\alpha_{0}W^{2}}{\psi^{2}\sigma^{2}}\sqrt{m_{j}\log\frac{T}{\delta}}}.

Plugging this back into Lemmas 8 and 7 we get that for all jj,

i=1mjs=1τ\brk!cj,i,sJ\brkKj,i\displaystyle\sum_{i=1}^{m_{j}}\sum_{s=1}^{\tau}\brk!{c_{j,i,s}\!-\!J\brk*{K_{j,i}}} =O\brkdudx3/2νκ10W2σ2τmjlogTδ,\displaystyle=O\brk*{\frac{d_{u}d_{x}^{3/2}\nu\kappa^{10}W^{2}}{\sigma^{2}}\tau\sqrt{m_{j}\log\frac{T}{\delta}}},
τi=1mjJ\brkKj,iJ\brkKj\displaystyle\tau\sum_{i=1}^{m_{j}}J\brk*{K_{j,i}}\!-\!J\brk*{K_{j}} =O\brkdudxνκ8W2σ2τmjlogTδ.\displaystyle=O\brk*{\frac{d_{u}d_{x}\nu\kappa^{8}W^{2}}{\sigma^{2}}\tau\sqrt{m_{j}\log\frac{T}{\delta}}}.

We conclude that the regret during epoch jj is bounded as

i=1mjs=1τ\brk!cj,i,sJ\displaystyle\sum_{i=1}^{m_{j}}\sum_{s=1}^{\tau}\brk!{c_{j,i,s}-J_{\star}} =\brk[s]i=1mjs=1τ\brk!cj,i,sJ\brkKj,i+\brk[s]τi=1mjJ\brkKj,iJ\brkKj+\brk[s]τmj(J\brkKjJ)\displaystyle=\brk[s]*{\sum_{i=1}^{m_{j}}\sum_{s=1}^{\tau}\brk!{c_{j,i,s}-J\brk*{K_{j,i}}}}+\brk[s]*{\tau\sum_{i=1}^{m_{j}}J\brk*{K_{j,i}}-J\brk*{K_{j}}}+\brk[s]{\tau m_{j}(J\brk*{K_{j}}-J_{\star})}
=O\brkdudx3/2ψ2κ22W2τmjlogTδ,\displaystyle=O\brk*{{d_{u}d_{x}^{3/2}\psi^{2}\kappa^{22}W^{2}}\tau\sqrt{m_{j}\log\frac{T}{\delta}}},

where the second step also used the fact that ν/σ2κ2\nu/\sigma^{2}\leq\kappa^{2}. Finally, a simple calculation (see Lemma 12) shows that

j=0n1mj=O\brk1μηT/τ=O\brkψ2κ14α0T/τ,\displaystyle\sum_{j=0}^{n-1}\sqrt{m_{j}}=O\brk*{\frac{1}{\mu\eta}\sqrt{T/\tau}}=O\brk*{\frac{\psi^{2}\kappa^{14}}{\alpha_{0}}\sqrt{T/\tau}},

and thus summing over the regret accumulated in each epoch concludes the proof.

4 Optimization Analysis

At its core, Algorithm 1 is a policy gradient method with KjK_{j} being the prediction after jj gradient steps. In this section we analyze the sub-optimality gap of the underlying controllers KjK_{j} culminating in the proof of Lemma 6. To achieve this, we first consider a general optimization problem with a corrupted gradient oracle, and show that the optimization rate is limited only by the square of the corruption magnitude. We follow this with an analysis of the LQR gradient estimation from which the overall optimization cost follows readily.

4.1 Inexact First-Order Optimization

Let f:df:\mathbb{R}^{d}\to\mathbb{R} be a function with global minimum f>f_{*}>-\infty. Suppose there exists f¯\bar{f}\in\mathbb{R} such that ff is μ\mu-PL, (D0,β)(D_{0},\beta)-locally smooth, and (D0,G)(D_{0},G)-locally Lipschitz over the sub-level set 𝒳=\brk[c]xf\brkxf¯.\mathcal{X}=\brk[c]{x\mid f\brk*{x}\leq\bar{f}}. We consider the update rule

xt+1=xtηg^t,\displaystyle x_{t+1}=x_{t}-\eta\hat{g}_{t}, (4)

where f\brkx0f¯f\brk*{x_{0}}\leq\bar{f}, and g^td\hat{g}_{t}\in\mathbb{R}^{d} is a corrupted gradient oracle that satisfies

\normg^tf\brkxtεt,\displaystyle\norm{\hat{g}_{t}-\nabla f\brk*{x_{t}}}\leq\varepsilon_{t}, (5)

where εtmin\brk[c]G,(f¯f)μ/2\varepsilon_{t}\leq\min\brk[c]{G,\sqrt{(\bar{f}-f_{*})\mu}/2} is the magnitude of the corruption at step tt. Define the effective corruption up to round tt as

ε¯t2=maxst\brk[c]εs2\brk[s]1(μη/3)ts,\displaystyle\bar{\varepsilon}_{t}^{2}=\max_{s\leq t}\brk[c]*{\varepsilon_{s}^{2}\brk[s]{1-(\mu\eta/3)}^{t-s}},

and notice that if εs\brk[s]1(μη/3)εs+1\varepsilon_{s}\brk[s]{1-(\mu\eta/3)}\leq\varepsilon_{s+1} then ε¯t=εt\bar{\varepsilon}_{t}=\varepsilon_{t}.

The following result shows that this update rule achieves a linear convergence rate up to an accuracy that depends quadratically on the corruptions.

Theorem 2 (corrupted gradient descent).

Suppose that ηmin\brk[c]1β,4μ,D02G.\eta\leq\min\brk[c]{\ifrac{1}{\beta},\ifrac{4}{\mu},\ifrac{D_{0}}{2G}.} Then for all t0t\geq 0,

f\brkxtfmax\brk[c]4ε¯t12μ,\brk[s]1μη3t\brkf\brkx0f,\displaystyle f\brk*{x_{t}}-f_{*}\leq\max\brk[c]*{\frac{4\bar{\varepsilon}_{t-1}^{2}}{\mu},\brk[s]*{1-\frac{\mu\eta}{3}}^{t}\brk*{f\brk*{x_{0}}-f_{*}}},

and consequently xt𝒳x_{t}\in\mathcal{X}.

Proof.

For ease of notation, denote wt=g^tf\brkxtw_{t}=\hat{g}_{t}-\nabla f\brk*{x_{t}}. Now, suppose that xt𝒳x_{t}\in\mathcal{X}, i.e., f\brkxtf¯f\brk*{x_{t}}\leq\bar{f}. Then we have that

\normxt+1xt=\normg^tη(\normf\brkxt+εt)D0/2GD0,\displaystyle\norm{x_{t+1}-x_{t}}=\norm{\hat{g}_{t}}\eta\leq(\norm{\nabla f\brk*{x_{t}}}+\varepsilon_{t})D_{0}/2G\leq D_{0},

where the second step used our choice of η\eta and the third step used the Lipschitz assumption on xtx_{t} and the bound on εt\varepsilon_{t}. We conclude that xt,xt+1x_{t},x_{t+1} satisfy the conditions for local smoothness and so we have that

f\brkxt+1f\brkxt\displaystyle f\brk*{x_{t+1}}-f\brk*{x_{t}} f\brkxt𝖳\brkxt+1xt+β2\normxt+1xt2\displaystyle\leq\nabla f\brk*{x_{t}}^{\mkern-1.5mu\mathsf{T}}\brk*{x_{t+1}-x_{t}}+\frac{\beta}{2}\norm{x_{t+1}-x_{t}}^{2}
=f\brkxt𝖳\brkηg^t+β2\normηg^t2\displaystyle=\nabla f\brk*{x_{t}}^{\mkern-1.5mu\mathsf{T}}\brk*{-\eta\hat{g}_{t}}+\frac{\beta}{2}\norm{\eta\hat{g}_{t}}^{2}
=η\brk[s]\normf\brkxt2f\brkxt𝖳wt+ηβ2\normf\brkxt+wt2\displaystyle=\eta\brk[s]*{-\norm{\nabla f\brk*{x_{t}}}^{2}-\nabla f\brk*{x_{t}}^{\mkern-1.5mu\mathsf{T}}w_{t}+\frac{\eta\beta}{2}\norm{\nabla f\brk*{x_{t}}+w_{t}}^{2}}
=η\brk[s]\brk1ηβ2\normf\brkxt2\brk1ηβf\brkxt𝖳wt+ηβ2\normwt2\displaystyle=\eta\brk[s]*{-\brk*{1-\frac{\eta\beta}{2}}\norm{\nabla f\brk*{x_{t}}}^{2}-\brk*{1-\eta\beta}\nabla f\brk*{x_{t}}^{\mkern-1.5mu\mathsf{T}}w_{t}+\frac{\eta\beta}{2}\norm{w_{t}}^{2}}
=η\brk[s]\brk1ηβ\norm12f\brkxt+wt2\brk34ηβ4\normf\brkxt2+\brk1ηβ2\normwt2\displaystyle=\eta\brk[s]*{-\brk*{1-\eta\beta}\norm{\frac{1}{2}\nabla f\brk*{x_{t}}+w_{t}}^{2}-\brk*{\frac{3}{4}-\frac{\eta\beta}{4}}\norm{\nabla f\brk*{x_{t}}}^{2}+\brk*{1-\frac{\eta\beta}{2}}\norm{w_{t}}^{2}}
η\brk[s]\brk34ηβ4\normf\brkxt2+\brk1ηβ2\normwt2,\displaystyle\leq\eta\brk[s]*{-\brk*{\frac{3}{4}-\frac{\eta\beta}{4}}\norm{\nabla f\brk*{x_{t}}}^{2}+\brk*{1-\frac{\eta\beta}{2}}\norm{w_{t}}^{2}},

where the last transition holds by choice of ηβ1\eta\beta\leq 1. Next, using the PL condition and the bound on wtw_{t} (see Eq. 5) we get that

f\brkxt+1f\brkxt\displaystyle f\brk*{x_{t+1}}-f\brk*{x_{t}} η\brk[s]μ\brk34ηβ4\brkf\brkxtf+\brk1ηβ2εt2,\displaystyle\leq\eta\brk[s]*{-\mu\brk*{\frac{3}{4}-\frac{\eta\beta}{4}}\brk*{f\brk*{x_{t}}-f_{*}}+\brk*{1-\frac{\eta\beta}{2}}\varepsilon_{t}^{2}},

and adding f\brkxtff\brk*{x_{t}}-f_{*} to both sides of the equation we get

f\brkxt+1f\brk[s]1μη4\brk3ηβ\brkf\brkxtf+\brk1ηβ2ηεt2.\displaystyle f\brk*{x_{t+1}}-f_{*}\leq\brk[s]*{1-\frac{\mu\eta}{4}\brk*{3-\eta\beta}}\brk*{f\brk*{x_{t}}-f_{*}}+\brk*{1-\frac{\eta\beta}{2}}\eta\varepsilon_{t}^{2}.

Now, if 4εt2μf\brkxtf\frac{4\varepsilon_{t}^{2}}{\mu}\leq f\brk*{x_{t}}-f_{*} then since ηβ1\eta\beta\leq 1 we have that

f\brkxt+1f\brk[s]1μη4\brk3ηβ+μη4\brk1ηβ2\brkf\brkxtf=\brk[s]1μη2\brk1ηβ4\brkf\brkxtf\brk[s]1μη3\brkf\brkxtf.\displaystyle\begin{aligned} f\brk*{x_{t+1}}-f_{*}&\leq\brk[s]*{1-\frac{\mu\eta}{4}\brk*{3-\eta\beta}+\frac{\mu\eta}{4}\brk*{1-\frac{\eta\beta}{2}}}\brk*{f\brk*{x_{t}}-f_{*}}\\ &=\brk[s]*{1-\frac{\mu\eta}{2}\brk*{1-\frac{\eta\beta}{4}}}\brk*{f\brk*{x_{t}}-f_{*}}\\ &\leq\brk[s]*{1-\frac{\mu\eta}{3}}\brk*{f\brk*{x_{t}}-f_{*}}.\end{aligned} (6)

On the other hand, if 4εt2μf\brkxtf0\frac{4\varepsilon_{t}^{2}}{\mu}\geq f\brk*{x_{t}}-f_{*}\geq 0 then we have that

f\brkxt+1f\brk[s]max\brk[c]0,4μη\brk3ηβ+η\brk1ηβ2εt2max\brk[c]η,4μη\brk2ηβεt24εt2μ,\displaystyle\begin{aligned} f\brk*{x_{t+1}}-f_{*}&\leq\brk[s]*{\max\brk[c]*{0,\frac{4}{\mu}-\eta\brk*{3-\eta\beta}}+\eta\brk*{1-\frac{\eta\beta}{2}}}\varepsilon_{t}^{2}\\ &\leq\max\brk[c]*{\eta,\frac{4}{\mu}-\eta\brk*{2-\eta\beta}}\varepsilon_{t}^{2}\leq\frac{4\varepsilon_{t}^{2}}{\mu},\end{aligned} (7)

where the last transition holds, again, by our choice of η\eta. Combining Eqs. 6 and 7 we conclude that

f\brkxt+1fmax\brk[c]4εt2μ,\brk[s]1μη3\brkf\brkxtf.\displaystyle f\brk*{x_{t+1}}-f_{*}\leq\max\brk[c]*{\frac{4\varepsilon_{t}^{2}}{\mu},\brk[s]*{1-\frac{\mu\eta}{3}}\brk*{f\brk*{x_{t}}-f_{*}}}. (8)

In particular, this implies that f\brkxt+1max\brk[c]f+4εt2μ,f\brkxtf¯,f\brk*{x_{t+1}}\leq\max\brk[c]*{f_{*}+\frac{4\varepsilon_{t}^{2}}{\mu},f\brk*{x_{t}}}\leq\bar{f}, and thus xt+1𝒳x_{t+1}\in\mathcal{X}. Since we assume that x0𝒳x_{0}\in\mathcal{X}, this completes an induction showing that xt𝒳x_{t}\in\mathcal{X} for all t0t\geq 0. We can thus unroll Eq. 8 recursively to obtain the final result.

4.2 Gradient Estimation

The gradient estimate g^j\hat{g}_{j} is a batched version of the typical one-point gradient estimator. We bound it in the next lemma using the following inductive idea: if J\brkKjν/2J\brk*{K_{j}}\leq\nu/2, then Kj,i𝒦K_{j,i}\in\mathcal{K} and standard concentration arguments imply that the estimation error is small with high probability and thus Theorem 2 implies that J\brkKj+1ν/2J\brk*{K_{j+1}}\leq\nu/2.

Lemma 9 (Gradient estimation error).

Under the parameter choices of Theorem 1, for any j0j\geq 0 we have that with probability at least 1(δ/8T3)1-(\delta/8T^{3}),

\normg^jJ\brkKjFμν4\brk1μη3j/2.\displaystyle\norm{\hat{g}_{j}-\nabla J\brk*{K_{j}}}_{F}\leq\frac{\sqrt{\mu\nu}}{4}\brk*{1-\frac{\mu\eta}{3}}^{j/2}.
Proof (of Lemma 9).

Assume that conditioned on the event J\brkKjν/2J\brk*{K_{j}^{\prime}}\leq\nu/2 for all jjj^{\prime}\leq j, the claim holds with probability at least 1δ/8T41-\delta/8T^{4}. We show by induction that we can peel-off the conditioning by summing the failure probability of each epoch. Concretely, we show by induction that the claim holds for all jjj^{\prime}\leq j with probability at least 1jδ/8T41-j\delta/8T^{4}. Since the number of epochs is less than TT (in fact logarithmic in TT), this will conclude the proof.

The induction base follows immediately by our conditional assumption and the fact that J\brkK0ν/4J\brk*{K_{0}}\leq\nu/4. Now, assume the hypothesis holds up to j1j-1. We show that the conditions of Theorem 2 are satisfied with f¯=ν/2\bar{f}=\nu/2 up to round jj, and thus J\brkKjν/2J\brk*{K_{j^{\prime}}}\leq\nu/2 for all jjj^{\prime}\leq j. We can then invoke our conditional assumption and a union bound to conclude the induction step.

We verify the conditions of Theorem 2. First, the Lipschitz, smoothness, and PL conditions hold by Lemma 5. Next, notice that by definition JJ\brkK0ν/4J_{\star}\leq J\brk*{K_{0}}\leq\nu/4, and so by the induction hypothesis \normg^jJ\brkKjFνμ/4(f¯f)μ/2G,\norm{\hat{g}_{j^{\prime}}-\nabla J\brk*{K_{j^{\prime}}}}_{F}\leq\sqrt{\nu\mu}/4\leq\sqrt{(\bar{f}-f_{*})\mu}/2\leq G, for all j<jj^{\prime}<j. Finally, noticing that κ2>dx\kappa^{2}>d_{x} it is easy to verify the condition on η\eta.

It remains to show the conditional claim holds. The event J\brkKjν/2J\brk*{K_{j^{\prime}}}\leq\nu/2 for all jjj^{\prime}\leq j essentially implies that the policy gradient scheme did not diverge up to the start of epoch jj. Importantly, this event is independent of any randomization during epoch jj and thus will not break any i.i.d. assumptions within the epoch. Moreover, by Lemma 5 and since r0ν/2Gr_{0}\leq\nu/2G, this implies that J\brkKj,iJ\brkKj+Grjν,J\brk*{K_{j^{\prime},i}}\leq J\brk*{K_{j}}+Gr_{j}\leq\nu, i.e., Kj,i𝒦K_{j^{\prime},i}\in\mathcal{K} for all ii and jjj^{\prime}\leq j. For the remainder of the proof, we implicitly assume that this holds, allowing us to invoke Lemmas 3, 4 and 5. For ease of notation, we will not specify this explicitly.

Now, let JrJ^{r} be the smoothed version of JJ as in Eq. 1. Since rjD0r_{j}\leq D_{0} we can use Lemma 1 to get that

\normg^jJ\brkKjF\displaystyle\norm{\hat{g}_{j}-\nabla J\brk*{K_{j}}}_{F} \normg^jJrj\brkKjF+\normJrj\brkKjJ\brkKjF\displaystyle\leq\norm{\hat{g}_{j}-\nabla J^{r_{j}}\brk*{K_{j}}}_{F}+\norm{\nabla J^{r_{j}}\brk*{K_{j}}-\nabla J\brk*{K_{j}}}_{F}
βrj+\normg^jJrj\brkKjF,\displaystyle\leq\beta r_{j}+\norm{\hat{g}_{j}-\nabla J^{r_{j}}\brk*{K_{j}}}_{F},

Next, we decompose the remaining term using the triangle inequality to get that

\normg^jJrj\brkKjF\displaystyle\norm{\hat{g}_{j}-\nabla J^{r_{j}}\brk*{K_{j}}}_{F} =\norm1mji=1mj\brkdxdurjcj,i,τUj,iJrj\brkKjF\displaystyle=\norm*{\frac{1}{m_{j}}\sum_{i=1}^{m_{j}}\brk{\frac{d_{x}d_{u}}{r_{j}}c_{j,i,\tau}U_{j,i}-\nabla J^{r_{j}}\brk*{K_{j}}}}_{F}
\norm1mji=1mj\brkdxdurjJ\brkKj,iUj,iJrj\brkKjF+\norm1mji=1mj\brkdxdurj(cj,i,τJ\brkKj,i)Uj,iF.\displaystyle\leq\norm*{\frac{1}{m_{j}}\sum_{i=1}^{m_{j}}\brk{\frac{d_{x}d_{u}}{r_{j}}J\brk*{K_{j,i}}U_{j,i}-\nabla J^{r_{j}}\brk*{K_{j}}}}_{F}+\norm*{\frac{1}{m_{j}}\sum_{i=1}^{m_{j}}\brk{\frac{d_{x}d_{u}}{r_{j}}(c_{j,i,\tau}-J\brk*{K_{j,i}})U_{j,i}}}_{F}.

By Lemma 1, we notice that, conditioned on KjK_{j}, the first term is a sum of zero-mean i.i.d random vectors with norm bounded by 2dudxν/rj2d_{u}d_{x}\nu/r_{j}. We thus invoke Lemma 13 (Vector Azuma) to get that with probability at least 1δ16T41-\ifrac{\delta}{16T^{4}}

\norm1mji=1mjdxdurjJ\brkKj,iUj,iJrj\brkKjFdudxνrj8mjlog240T4δ.\displaystyle\norm*{\frac{1}{m_{j}}\sum_{i=1}^{m_{j}}\frac{d_{x}d_{u}}{r_{j}}J\brk*{K_{j,i}}U_{j,i}-\nabla J^{r_{j}}\brk*{K_{j}}}_{F}\leq\frac{d_{u}d_{x}\nu}{r_{j}}\sqrt{\frac{8}{m_{j}}\log\frac{240T^{4}}{\delta}}.

Next, denote Zi=dxdurj(cj,i,τJ\brkKj,i)Uj,i,Z_{i}=\frac{d_{x}d_{u}}{r_{j}}(c_{j,i,\tau}-J\brk*{K_{j,i}})U_{j,i}, and notice that the remaining term is exactly \norm1mji=1mjZiF.\norm{\frac{1}{m_{j}}\sum_{i=1}^{m_{j}}Z_{i}}_{F}. Let xj,i,τx_{j,i,\tau} be the state during the final round of playing controller Kj,iK_{j,i}, and i\mathcal{F}_{i} be the filtration adapted to xj,1,τ,,xj,i,τ,Uj,1,,Uj,i.x_{j,1,\tau},\ldots,x_{j,i,\tau},U_{j,1},\ldots,U_{j,i}. We use Lemma 3 to get that

\norm𝔼\brk[s]Zii1F\displaystyle\norm{\mathbb{E}\brk[s]*{Z_{i}\mid\mathcal{F}_{i-1}}}_{F} 𝔼\brk[s]\norm𝔼\brk[s]Zii1,Kj,iFi1\displaystyle\leq\mathbb{E}\brk[s]*{\norm{\mathbb{E}\brk[s]*{Z_{i}\mid\mathcal{F}_{i-1},K_{j,i}}}_{F}\mid\mathcal{F}_{i-1}}
dxdurj𝔼\brk[s]\abs𝔼\brk[s]cj,i,τi1,Kj,iJ\brkKj,i|i1\displaystyle\leq\frac{d_{x}d_{u}}{r_{j}}\mathbb{E}\brk[s]*{\abs*{\mathbb{E}\brk[s]*{c_{j,i,\tau}\mid\mathcal{F}_{i-1},K_{j,i}}-J\brk*{K_{j,i}}}\;\Big{|}\;\mathcal{F}_{i-1}}
dxduνκ2rjσ2e2γτ𝔼\brk[s]\normxj,i1,τxj,i1,τ𝖳ΣKj,i|i1\displaystyle\leq\frac{d_{x}d_{u}\nu\kappa^{2}}{r_{j}\sigma^{2}}e^{-2\gamma\tau}\mathbb{E}\brk[s]*{\norm{x_{j,i-1,\tau}x_{j,i-1,\tau}^{\mkern-1.5mu\mathsf{T}}-\Sigma_{K_{j,i}}}\;\Big{|}\;\mathcal{F}_{i-1}}
37dxduνκ10W2rjσ2e2γτ\displaystyle\leq\frac{37d_{x}d_{u}\nu\kappa^{10}W^{2}}{r_{j}\sigma^{2}}e^{-2\gamma\tau}
dxduνκ8W2rjσ2T2,\displaystyle\leq\frac{d_{x}d_{u}\nu\kappa^{8}W^{2}}{r_{j}\sigma^{2}T^{2}},

where the last step plugged in the value of τ\tau and the one before that used Lemmas 4 and 5 to bound \normΣKj,iν/α0=κ2σ2\norm{\Sigma_{K_{j,i}}}\leq\nu/\alpha_{0}=\kappa^{2}\sigma^{2} and \normxj,i1,τ6κ4W\norm{x_{j,i-1,\tau}}\leq 6\kappa^{4}W. Further using Lemma 4 to bound cj,i,τc_{j,i,\tau}, we also get that

\normZi𝔼\brk[s]Zii1F\normZiF+\norm𝔼\brk[s]Zii1Fdxducj,i,τrj+\norm𝔼\brk[s]Zii1F37dxduνκ8W2rjσ2.\displaystyle\norm{Z_{i}-\mathbb{E}\brk[s]*{Z_{i}\mid\mathcal{F}_{i-1}}}_{F}\leq\norm{Z_{i}}_{F}+\norm{\mathbb{E}\brk[s]*{Z_{i}\mid\mathcal{F}_{i-1}}}_{F}\leq\frac{d_{x}d_{u}c_{j,i,\tau}}{r_{j}}+\norm{\mathbb{E}\brk[s]*{Z_{i}\mid\mathcal{F}_{i-1}}}_{F}\leq\frac{37d_{x}d_{u}\nu\kappa^{8}W^{2}}{r_{j}\sigma^{2}}.

Since ZiZ_{i} is i\mathcal{F}_{i}-measurable we can invoke Lemma 13 (Vector Azuma) to get that with probability at least 1δ16T41-\frac{\delta}{16T^{4}},

\norm1mji=1mjZiF\displaystyle\norm*{\frac{1}{m_{j}}\sum_{i=1}^{m_{j}}Z_{i}}_{F} 1mj\normi=1mjZi𝔼\brk[s]Zii1F+1mji=1mj\norm𝔼\brk[s]Zii1F\displaystyle\leq\frac{1}{m_{j}}\norm*{\sum_{i=1}^{m_{j}}Z_{i}-\mathbb{E}\brk[s]*{Z_{i}\mid\mathcal{F}_{i-1}}}_{F}+\frac{1}{m_{j}}\sum_{i=1}^{m_{j}}\norm{\mathbb{E}\brk[s]*{Z_{i}\mid\mathcal{F}_{i-1}}}_{F}
dxduνκ8W2rjσ2\brk[s]372mjlog240T4δ+1T2\displaystyle\leq\frac{d_{x}d_{u}\nu\kappa^{8}W^{2}}{r_{j}\sigma^{2}}\brk[s]*{37\sqrt{\frac{2}{m_{j}}\log\frac{240T^{4}}{\delta}}+\frac{1}{T^{2}}}
54dxduνκ8W2rjσ21mjlog240T4δ.\displaystyle\leq\frac{54d_{x}d_{u}\nu\kappa^{8}W^{2}}{r_{j}\sigma^{2}}\sqrt{\frac{1}{m_{j}}\log\frac{240T^{4}}{\delta}}.

Using a union bound and putting everything together, we conclude that with probability at least 1(δ/8T4)1-(\delta/8T^{4}),

\normg^jJ\brkKjF\displaystyle\norm{\hat{g}_{j}-\nabla J\brk*{K_{j}}}_{F} βrj+54dxduνκ8W2rjσ21mjlog240T4δ\displaystyle\leq\beta r_{j}+\frac{54d_{x}d_{u}\nu\kappa^{8}W^{2}}{r_{j}\sigma^{2}}\sqrt{\frac{1}{m_{j}}\log\frac{240T^{4}}{\delta}}
=\brk[s]βr0+54dxduνκ8W2σ2r0m01/2log240T4δ\brk1μη3j/2\displaystyle=\brk[s]*{\beta r_{0}+\frac{54d_{x}d_{u}\nu\kappa^{8}W^{2}}{\sigma^{2}r_{0}m_{0}^{1/2}}\sqrt{\log\frac{240T^{4}}{\delta}}}\brk*{1-\frac{\mu\eta}{3}}^{j/2}
2βr0\brk1μη3j/2\displaystyle\leq 2\beta r_{0}\brk*{1-\frac{\mu\eta}{3}}^{j/2}
μν4\brk1μη3j/2,\displaystyle\leq\frac{\sqrt{\mu\nu}}{4}\brk*{1-\frac{\mu\eta}{3}}^{j/2},

where the last steps plugged in the values of μ,β,r0\mu,\beta,r_{0}, and m0m_{0}.

4.3 Proof of Lemma 6

Lemma 6 is a straightforward consequence of the previous results.

Proof.

For j=0j=0 the claim holds trivially by our assumption that J\brkK0ν/4J\brk*{K_{0}}\leq\nu/4. Now, for j1j\geq 1, we use a union bound on Lemma 9 to get that with probability at least 1δ/8T21-\delta/8T^{2}

\normg^jJ\brkKjμν4\brk1μη3j/2,j0.\displaystyle\norm{\hat{g}_{j}-\nabla J\brk*{K_{j}}}\leq\frac{\sqrt{\mu\nu}}{4}\brk*{1-\frac{\mu\eta}{3}}^{j/2},\qquad\forall j\geq 0.

Then by Theorem 2 we have that

J\brkKjJ+ν4\brk1μη3j1min\brk[c]ν2,J+ν2\brk1μη3j,\displaystyle J\brk*{K_{j}}\leq J_{\star}+\frac{\nu}{4}\brk*{1-\frac{\mu\eta}{3}}^{j-1}\leq\min\brk[c]*{\frac{\nu}{2},J_{\star}+\frac{\nu}{2}\brk*{1-\frac{\mu\eta}{3}}^{j}},

where the last step used the facts that JJ\brkK0ν/4J_{\star}\leq J\brk*{K_{0}}\leq\nu/4 and 1μη/31/21-\mu\eta/3\geq 1/2.

5 Exploration Cost Analysis

In this section we demonstrate that exploring near a given initial policy does not incur linear regret in the exploration radius (as more straightforward arguments would give), and use this crucial observation for proving Lemmas 7 and 8.

We begin with Lemma 8. The main difficulty in the proof is captured by the following basic result, which roughly shows that the expected cost for transitioning between two i.i.d. copies of a given random policy scales with the variance of the latter. This would in turn give the quadratic dependence on the exploration radius we need.

Lemma 10.

Let K𝒦K\in\mathcal{K} be fixed. Suppose K1,K2K_{1},K_{2} are i.i.d. random variables such that 𝔼Ki=K\mathbb{E}K_{i}=K, and \normKiKFrD0.\norm{K_{i}-K}_{F}\leq r\leq D_{0}. If xτ(K1)x_{\tau}(K_{1}) is the result of playing K1K_{1} for τ1\tau\geq 1 rounds starting at x0dxx_{0}\in\mathbb{R}^{d_{x}}, then

𝔼\brk[s]xτ(K1)𝖳(PK2PK1)xτ(K1)256dxνψ2κ10α0r2+32dxψκ9(\normx02+κ2σ2)re2γτ.\displaystyle\mathbb{E}\brk[s]{x_{\tau}(K_{1})^{\mkern-1.5mu\mathsf{T}}(P_{K_{2}}-P_{K_{1}})x_{\tau}(K_{1})}\leq\frac{256d_{x}\nu\psi^{2}\kappa^{10}}{\alpha_{0}}r^{2}+32d_{x}\psi\kappa^{9}(\norm{x_{0}}^{2}+\kappa^{2}\sigma^{2})re^{-2\gamma\tau}.
Proof.

Notice that the expectation is with respect to both controllers and the τ\tau noise terms, all of which are jointly independent. We begin by using Lemmas 3 and 5 to get that

Tr\brk(PK2PK1)(𝔼\brk[s]xτ(K1)xτ(K1)𝖳K1ΣK1)\displaystyle{\mathrm{Tr}\brk*{(P_{K_{2}}-P_{K_{1}})(\mathbb{E}\brk[s]{x_{\tau}(K_{1})x_{\tau}(K_{1})^{\mkern-1.5mu\mathsf{T}}\mid K_{1}}-\Sigma_{K_{1}})}} 32dxψκ7r\norm𝔼\brk[s]xτ(K1)xτ(K1)𝖳K1ΣK1)\displaystyle\leq 32d_{x}\psi\kappa^{7}r\norm{\mathbb{E}\brk[s]{x_{\tau}(K_{1})x_{\tau}(K_{1})^{\mkern-1.5mu\mathsf{T}}\mid K_{1}}-\Sigma_{K_{1}})}
32dxψκ9re2γτ\normx0x0𝖳ΣK1\displaystyle\leq 32d_{x}\psi\kappa^{9}re^{-2\gamma\tau}\norm{x_{0}x_{0}^{\mkern-1.5mu\mathsf{T}}-\Sigma_{K_{1}}}
32dxψκ9(\normx02+κ2σ2)re2γτ,\displaystyle\leq 32d_{x}\psi\kappa^{9}(\norm{x_{0}}^{2}+\kappa^{2}\sigma^{2})re^{-2\gamma\tau},

where the last step also used the fact that κ2σ2=ν/α0\kappa^{2}\sigma^{2}=\nu/\alpha_{0}. Now, since PK1,PK2P_{K_{1}},P_{K_{2}} do not depend on the noise, we can use the law of total expectation to get that

𝔼\brk[s]xτ(K1)𝖳(PK2PK1)xτ(K1)\displaystyle\mathbb{E}\brk[s]{x_{\tau}(K_{1})^{\mkern-1.5mu\mathsf{T}}(P_{K_{2}}-P_{K_{1}})x_{\tau}(K_{1})} =𝔼\brk[s]Tr\brk(PK2PK1)𝔼\brk[s]xτ(K1)xτ(K1)𝖳K1\displaystyle=\mathbb{E}\brk[s]{\mathrm{Tr}\brk*{(P_{K_{2}}-P_{K_{1}})\mathbb{E}\brk[s]{x_{\tau}(K_{1})x_{\tau}(K_{1})^{\mkern-1.5mu\mathsf{T}}\mid K_{1}}}}
𝔼\brk[s]Tr\brk(PK2PK1)ΣK1+4dxα0κ2(\normx02+κ2σ2)e2γτ.\displaystyle\leq\mathbb{E}\brk[s]{\mathrm{Tr}\brk*{(P_{K_{2}}-P_{K_{1}})\Sigma_{K_{1}}}}+4d_{x}\alpha_{0}\kappa^{2}(\norm{x_{0}}^{2}+\kappa^{2}\sigma^{2})e^{-2\gamma\tau}.

To bound the remaining term, notice that since K1,K2K_{1},K_{2} are i.i.d, we may change their roles without changing the expectation, i.e.,

𝔼\brk[s]Tr\brk(PK2PK1)ΣK1=𝔼\brk[s]Tr\brk(PK1PK2)ΣK2,\displaystyle\mathbb{E}\brk[s]{\mathrm{Tr}\brk*{(P_{K_{2}}-P_{K_{1}})\Sigma_{K_{1}}}}=\mathbb{E}\brk[s]{\mathrm{Tr}\brk*{(P_{K_{1}}-P_{K_{2}})\Sigma_{K_{2}}}},

we conclude that

𝔼\brk[s]Tr\brk(PK2PK1)ΣK1\displaystyle\mathbb{E}\brk[s]{\mathrm{Tr}\brk*{(P_{K_{2}}-P_{K_{1}})\Sigma_{K_{1}}}} =12𝔼\brk[s]Tr\brk(PK2PK1)(ΣK1ΣK2)\displaystyle=\frac{1}{2}\mathbb{E}\brk[s]{\mathrm{Tr}\brk*{(P_{K_{2}}-P_{K_{1}})(\Sigma_{K_{1}}-\Sigma_{K_{2}})}}
dx2\normPK2PK1\normΣK2ΣK1\displaystyle\leq\frac{d_{x}}{2}\norm{P_{K_{2}}-P_{K_{1}}}\norm{\Sigma_{K_{2}}-\Sigma_{K_{1}}}
256dxνψ2κ10α0r2,\displaystyle\leq\frac{256d_{x}\nu\psi^{2}\kappa^{10}}{\alpha_{0}}r^{2},

where the last step also used Lemma 5.

5.1 Proof of Lemma 8

Before proving Lemma 8 we introduce a few simplifying notations. Since the lemma pertains to a single epoch, we omit its notation jj wherever it is clear from context. For example, Kj,iK_{j,i} will be shortened to KiK_{i} and xj,i,sx_{j,i,s} to xi,sx_{i,s}. In any case, we reserve the index jj for epochs and ii for sub-epochs. In this context, we also denote the gap between realized and idealized costs during sub-epoch ii by

ΔCi=s=1τ(ci,sJ\brkKi),\displaystyle\Delta C_{i}=\sum_{s=1}^{\tau}(c_{i,s}-J\brk*{K_{i})},

and the filtration i\mathcal{H}_{i} adapted to w1,1,,wi,τ1,K1,,Kiw_{1,1},\ldots,w_{i,\tau-1},K_{1},\ldots,K_{i}. We note that KiK_{i} and ΔCi\Delta C_{i} are i\mathcal{H}_{i}-measurable. The following lemma uses Eq. 2 to decompose the cost gap at the various time resolutions. See proof at the end of this section.

Lemma 11.

If the epoch initial controller satisfies J\brkKjν/2J\brk*{K_{j}}\leq\nu/2 then (recall that PKP_{K} is the positive definite solution to Eq. 2):

  1. 1.

    ci,sJ\brkKi=xi,s𝖳PKixi,s𝔼wi,s\brk[s]xi,s+1𝖳PKixi,s+1;c_{i,s}-J\brk*{K_{i}}=x_{i,s}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,s}-\mathbb{E}_{w_{i,s}}\brk[s]{x_{i,s+1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,s+1}};

  2. 2.

    𝔼\brk[s]ΔCii1=𝔼\brk[s]xi,1𝖳PKixi,1xi+1,1𝖳PKixi+1,1i1;\mathbb{E}\brk[s]{\Delta C_{i}\mid\mathcal{H}_{i-1}}=\mathbb{E}\brk[s]*{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,1}-x_{i+1,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i+1,1}\mid\mathcal{H}_{i-1}};

  3. 3.

    i=1mj𝔼\brk[s]ΔCii1𝔼\brk[s]x1,1𝖳PK1x1,1+i=2mj(𝔼\brk[s]xi,1𝖳PKixi,1i1𝔼\brk[s]xi,1𝖳PKi1xi,1i2).\sum_{i=1}^{m_{j}}\mathbb{E}\brk[s]{\Delta C_{i}\mid\mathcal{H}_{i-1}}\leq\mathbb{E}\brk[s]{x_{1,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{1}}x_{1,1}}+\sum_{i=2}^{m_{j}}\big{(}\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,1}\mid\mathcal{H}_{i-1}}-\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i-1}}x_{i,1}\mid\mathcal{H}_{i-2}}\big{)}.

We are now ready to prove the main lemma of this section.

Proof (of Lemma 8).

First, by Lemma 6, the event J\brkKjν/2J\brk*{K_{j^{\prime}}}\leq\nu/2 for all jjj^{\prime}\leq j holds with probability at least 1δ/8T1-\delta/8T. As in the proof of Lemma 9, we will implicitly assume that this event holds, which will not break any i.i.d assumptions during epoch jj and implies that Ki𝒦K_{i}\in\mathcal{K} for all 1imj1\leq i\leq m_{j}. We also use this to invoke Lemmas 4 and 5 to get that for any 1i,imj1\leq i,i^{\prime}\leq m_{j} and 1sτ1\leq s\leq\tau we have xi,s𝖳PKixi,s36νκ8W2/σ2=ν0.x_{i,s}^{\mkern-1.5mu\mathsf{T}}P_{K_{i^{\prime}}}x_{i,s}\leq 36\nu\kappa^{8}W^{2}/\sigma^{2}=\nu_{0}.

Now, recall that ΔCi\Delta C_{i} is i\mathcal{H}_{i}-measurable and thus ΔCi𝔼\brk[s]ΔCii1\Delta C_{i}-\mathbb{E}\brk[s]{\Delta C_{i}\mid\mathcal{H}_{i-1}} is a martingale difference sequence. Using the first part of Lemma 11 we also conclude that each term bounded by τν0\tau\nu_{0}. Applying Azuma’s inequality we get that with probability at least 1(δ/16T)1-(\delta/16T)

i=1mjΔCi\displaystyle\sum_{i=1}^{m_{j}}\Delta C_{i} =i=1mjΔCi𝔼\brk[s]ΔCii1+𝔼\brk[s]ΔCii1\displaystyle=\sum_{i=1}^{m_{j}}\Delta C_{i}-\mathbb{E}\brk[s]{\Delta C_{i}\mid\mathcal{H}_{i-1}}+\mathbb{E}\brk[s]{\Delta C_{i}\mid\mathcal{H}_{i-1}}
2mjτ2ν02log16Tδ+i=1mj𝔼\brk[s]ΔCii1.\displaystyle\leq\sqrt{2m_{j}\tau^{2}\nu_{0}^{2}\log\frac{16T}{\delta}}+\sum_{i=1}^{m_{j}}\mathbb{E}\brk[s]{\Delta C_{i}\mid\mathcal{H}_{i-1}}.

Now, recall from Lemma 11 that

i=1mj\displaystyle\sum_{i=1}^{m_{j}} 𝔼\brk[s]ΔCii1\displaystyle\mathbb{E}\brk[s]{\Delta C_{i}\mid\mathcal{H}_{i-1}}
𝔼\brk[s]x1,1𝖳PK1x1,1+i=2mj𝔼\brk[s]xi,1𝖳PKixi,1i1𝔼\brk[s]xi,1𝖳PKi1xi,1i2\displaystyle\leq\mathbb{E}\brk[s]{x_{1,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{1}}x_{1,1}}+\sum_{i=2}^{m_{j}}\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,1}\mid\mathcal{H}_{i-1}}-\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i-1}}x_{i,1}\mid\mathcal{H}_{i-2}}
=𝔼\brk[s]x1,1𝖳PK1x1,1+i=2mj𝔼\brk[s]xi,1𝖳PKixi,1i1𝔼\brk[s]xi,1𝖳PKixi,1i2+𝔼\brk[s]xi,1𝖳(PKiPKi1)xi,1i2.\displaystyle=\mathbb{E}\brk[s]{x_{1,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{1}}x_{1,1}}+\sum_{i=2}^{m_{j}}\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,1}\mid\mathcal{H}_{i-1}}-\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,1}\mid\mathcal{H}_{i-2}}+\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}(P_{K_{i}}-P_{K_{i-1}})x_{i,1}\mid\mathcal{H}_{i-2}}.

The first two terms in the sum form a martingale difference sequence with each term being bound by ν0\nu_{0}. We thus have that with probability at least 1δ/16T1-\delta/16T,

i=1mj𝔼\brk[s]ΔCii1ν0+2mjν02log16Tδ+i=2mj𝔼\brk[s]xi,1𝖳(PKiPKi1)xi,1i2.\displaystyle\sum_{i=1}^{m_{j}}\mathbb{E}\brk[s]{\Delta C_{i}\mid\mathcal{H}_{i-1}}\leq\nu_{0}+\sqrt{2m_{j}\nu_{0}^{2}\log\frac{16T}{\delta}}+\sum_{i=2}^{m_{j}}\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}(P_{K_{i}}-P_{K_{i-1}})x_{i,1}\mid\mathcal{H}_{i-2}}.

Notice that the summands in remaining term fit the setting of Lemma 10 and thus

i=2mj𝔼\brk[s]xi,1𝖳(PKiPKi1)xi,1i2\displaystyle\sum_{i=2}^{m_{j}}\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}(P_{K_{i}}-P_{K_{i-1}})x_{i,1}\mid\mathcal{H}_{i-2}} 256dxνψ2κ10α0rj2mj+i=1mj32dxψκ9(\normxi,12+κ2σ2)rje2γτ\displaystyle\leq\frac{256d_{x}\nu\psi^{2}\kappa^{10}}{\alpha_{0}}r_{j}^{2}m_{j}+\sum_{i=1}^{m_{j}}32d_{x}\psi\kappa^{9}(\norm{x_{i,1}}^{2}+\kappa^{2}\sigma^{2})r_{j}e^{-2\gamma\tau}
256dxνψ2κ10α0rj2mj+25dxψκ15W2rjmjT2\displaystyle\leq\frac{256d_{x}\nu\psi^{2}\kappa^{10}}{\alpha_{0}}r_{j}^{2}m_{j}+\frac{25d_{x}\psi\kappa^{15}W^{2}r_{j}m_{j}}{T^{2}}
257dxνψ2κ10α0rj2mj,\displaystyle\leq\frac{257d_{x}\nu\psi^{2}\kappa^{10}}{\alpha_{0}}r_{j}^{2}m_{j},

where the second transition plugged in τ\tau and used Lemma 4 to bound \normxi,1\norm{x_{i,1}}, and the third transition used the fact that T2mj2rj/m0T^{-2}\leq m_{j}^{-2}\leq r_{j}/m_{0}. Plugging in the value of ν0\nu_{0} and using a union bound, we conclude that with probability at least 1δ/4T1-\delta/4T,

i=1mjΔCi144νκ8W2σ2τmjlog16Tδ+257dxνψ2κ10α0rj2mj,\displaystyle\sum_{i=1}^{m_{j}}\Delta C_{i}\leq\frac{144\nu\kappa^{8}W^{2}}{\sigma^{2}}\tau\sqrt{m_{j}\log\frac{16T}{\delta}}+\frac{257d_{x}\nu\psi^{2}\kappa^{10}}{\alpha_{0}}r_{j}^{2}m_{j},

as desired.

Proof (of Lemma 11).

By our assumption that J\brkKjν/2J\brk*{K_{j}}\leq\nu/2 we have that J\brkKiνJ\brk*{K_{i}}\leq\nu and thus PKiP_{K_{i}} is well defined. Now, recall that xi,s+1=\brkA+BKixi,s+wi,sx_{i,s+1}=\brk{A_{\star}+B_{\star}K_{i}}x_{i,s}+w_{i,s} and J\brkKi=𝔼wi,s\brk[s]wi,s𝖳PKiwi,sJ\brk*{K_{i}}=\mathbb{E}_{w_{i,s}}\brk[s]{w_{i,s}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}w_{i,s}} where PKiP_{K_{i}} satisfies Eq. 2 with K=KiK=K_{i}. Then we have that

𝔼wi,s\brk[s]xi,s+1𝖳PKixi,s+1\displaystyle\mathbb{E}_{w_{i,s}}\brk[s]{x_{i,s+1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,s+1}} =𝔼wi,s\brk[s]((A+BKi)xi,s+wi,s)𝖳PKi((A+BKi)xi,s+wi,s)\displaystyle=\mathbb{E}_{w_{i,s}}\brk[s]{((A_{\star}+B_{\star}K_{i})x_{i,s}+w_{i,s})^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}((A_{\star}+B_{\star}K_{i})x_{i,s}+w_{i,s})}
=((A+BKi)xi,s)𝖳PKt((A+BKi)xi,s)+𝔼wi,s\brkwi,s𝖳PKiwi,s\displaystyle=((A_{\star}+B_{\star}K_{i})x_{i,s})^{\mkern-1.5mu\mathsf{T}}P_{K_{t}}((A_{\star}+B_{\star}K_{i})x_{i,s})+\mathbb{E}_{w_{i,s}}\brk{w_{i,s}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}w_{i,s}}
=((A+BKi)xi,s)𝖳PKt((A+BKi)xi,s)+J\brkKi.\displaystyle=((A_{\star}+B_{\star}K_{i})x_{i,s})^{\mkern-1.5mu\mathsf{T}}P_{K_{t}}((A_{\star}+B_{\star}K_{i})x_{i,s})+J\brk*{K_{i}}.

Now, multiplying Eq. 2 by xi,sx_{i,s} from both sides we get that

xi,s𝖳PKixi,s\displaystyle x_{i,s}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,s} =xi,s𝖳\brkQ+Ki𝖳RKtxi,s+((A+BKi)xi,s)𝖳PKi((A+BKi)xi,s)\displaystyle=x_{i,s}^{\mkern-1.5mu\mathsf{T}}\brk{Q+K_{i}^{\mkern-1.5mu\mathsf{T}}RK_{t}}x_{i,s}+((A_{\star}+B_{\star}K_{i})x_{i,s})^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}((A_{\star}+B_{\star}K_{i})x_{i,s})
=ci,s+𝔼wi,s\brk[s]xi,s+1𝖳PKixi,s+1J\brkKi,\displaystyle=c_{i,s}+\mathbb{E}_{w_{i,s}}\brk[s]{x_{i,s+1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,s+1}}-J\brk*{K_{i}},

where the second transition plugged in the previous equality. Changing sides concludes the first part of the proof. For the second part, notice that taking expectation with respect to wi,sw_{i,s} is equivalent to conditional expectation with respect to all past epochs and w1,1,,wi,s1,K1,,Kiw_{1,1},\ldots,w_{i,s-1},K_{1},\ldots,K_{i} of the current epoch. Since for all 1sτ1\leq s\leq\tau this contains i1\mathcal{H}_{i-1}, we use the law of total expectation to get that

𝔼\brk[s]ci,sJ\brkKii1\displaystyle\mathbb{E}\brk[s]{c_{i,s}-J\brk*{K_{i}}\mid\mathcal{H}_{i-1}} =𝔼\brk[s]xi,s𝖳PKixi,si1𝔼\brk[s]𝔼wi,s\brk[s]xi,s+1𝖳PKixi,s+1i1\displaystyle=\mathbb{E}\brk[s]{x_{i,s}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,s}\mid\mathcal{H}_{i-1}}-\mathbb{E}\brk[s]{\mathbb{E}_{w_{i,s}}\brk[s]{x_{i,s+1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,s+1}}\mid\mathcal{H}_{i-1}}
=𝔼\brk[s]xi,s𝖳PKixi,si1𝔼\brk[s]xi,s+1𝖳PKixi,s+1i1.\displaystyle=\mathbb{E}\brk[s]{x_{i,s}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,s}\mid\mathcal{H}_{i-1}}-\mathbb{E}\brk[s]{x_{i,s+1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,s+1}\mid\mathcal{H}_{i-1}}.

Summing over ss, noticing that the sum is telescopic, and that time (i,τ+1)(i,\tau+1) is in fact the start of the next sub-epoch, i.e., (i+1,1)(i+1,1), concludes the second part of the proof. Finally, we sum over ii to get that

i=1mj𝔼\brk[s]ΔCii1=i=1mj𝔼\brk[s]xi,1𝖳PKixi,1xi+1,1𝖳PKixi+1,1i1\displaystyle\sum_{i=1}^{m_{j}}\mathbb{E}\brk[s]{\Delta C_{i}\mid\mathcal{H}_{i-1}}=\sum_{i=1}^{m_{j}}\mathbb{E}\brk[s]*{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,1}-x_{i+1,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i+1,1}\mid\mathcal{H}_{i-1}}
=𝔼\brk[s]x1,1𝖳PK1x1,1𝔼\brk[s]xmj+1,1𝖳PKmjxmj+1,1mj1+i=2mj𝔼\brk[s]xi,1𝖳PKixi,1i1𝔼\brk[s]xi,1𝖳PKi1xi,1i2\displaystyle=\mathbb{E}\brk[s]{x_{1,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{1}}x_{1,1}}-\mathbb{E}\brk[s]{x_{m_{j}+1,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{m_{j}}}x_{m_{j}+1,1}\mid\mathcal{H}_{m_{j}-1}}+\sum_{i=2}^{m_{j}}\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,1}\mid\mathcal{H}_{i-1}}-\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i-1}}x_{i,1}\mid\mathcal{H}_{i-2}}
𝔼\brk[s]x1,1𝖳PK1x1,1+i=2mj𝔼\brk[s]xi,1𝖳PKixi,1i1𝔼\brk[s]xi,1𝖳PKi1xi,1i2,\displaystyle\leq\mathbb{E}\brk[s]{x_{1,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{1}}x_{1,1}}+\sum_{i=2}^{m_{j}}\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i}}x_{i,1}\mid\mathcal{H}_{i-1}}-\mathbb{E}\brk[s]{x_{i,1}^{\mkern-1.5mu\mathsf{T}}P_{K_{i-1}}x_{i,1}\mid\mathcal{H}_{i-2}},

concluding the third part of the proof.

5.2 Proof of Lemma 7

Proof (of Lemma 7).

By Lemma 6, the event J\brkKjν/2J\brk*{K_{j}}\leq\nu/2 occurs with probability at least 1δ/8T21-\delta/8T^{2}. Similarly to Lemmas 8 and 9, we implicitly assume that this event holds, which does not break i.i.d assumptions inside the epoch and implies that Kj,i𝒦K_{j,i}\in\mathcal{K} for all 1imj1\leq i\leq m_{j}. Now, notice that 𝔼\brk[s]Kj,iKj=Kj.\mathbb{E}\brk[s]{K_{j,i}\mid K_{j}}=K_{j}. Since Kj𝒦K_{j}\in\mathcal{K} and rjD0r_{j}\leq D_{0}, we can invoke the local smoothness of J\brkJ\brk*{\cdot} (see Lemma 5) to get that

𝔼\brk[s]J\brkKj,iKj\displaystyle\mathbb{E}\brk[s]{J\brk*{K_{j,i}}\mid K_{j}} J\brkKj+J\brkKj𝖳𝔼\brk[s]Kj,iKjKj+12β𝔼\brk[s]\normKj,iKj2Kj\displaystyle\leq J\brk*{K_{j}}+\nabla J\brk*{K_{j}}^{\mkern-1.5mu\mathsf{T}}\mathbb{E}\brk[s]{K_{j,i}-K_{j}\mid K_{j}}+\frac{1}{2}\beta\mathbb{E}\brk[s]{\norm{K_{j,i}-K_{j}}^{2}\mid K_{j}}
=J\brkKj+12βrj2.\displaystyle=J\brk*{K_{j}}+\frac{1}{2}\beta r_{j}^{2}.

We thus have that

i=1mjJ\brkKj,iJ\brkKj12βrj2mj+i=1mjJ\brkKj,i𝔼\brk[s]J\brkKj,iKj.\displaystyle\sum_{i=1}^{m_{j}}J\brk*{K_{j,i}}-J\brk*{K_{j}}\leq\frac{1}{2}\beta r_{j}^{2}m_{j}+\sum_{i=1}^{m_{j}}J\brk*{K_{j,i}}-\mathbb{E}\brk[s]{J\brk*{K_{j,i}}\mid K_{j}}.

The remaining term is a sum of zero-mean i.i.d. random variables that are bounded by ν\nu. We use Hoeffding’s inequality and a union bound to get that with probability at least 1δ/4T1-\delta/4T

i=1mjJ\brkKj,iJ\brkKj12βrj2mj+ν12mjlog8Tδ,\displaystyle\sum_{i=1}^{m_{j}}J\brk*{K_{j,i}}-J\brk*{K_{j}}\leq\frac{1}{2}\beta r_{j}^{2}m_{j}+\nu\sqrt{\frac{1}{2}m_{j}\log\frac{8T}{\delta}},

and plugging in the value of β\beta from Lemma 5 concludes the proof.

Acknowledgements

We thank Nadav Merlis for numerous helpful discussions. This work was partially supported by the Israeli Science Foundation (ISF) grant 2549/19, by the Len Blavatnik and the Blavatnik Family foundation, and by the Yandex Initiative in Machine Learning.

References

  • Abbasi-Yadkori and Szepesvári [2011] Y. Abbasi-Yadkori and C. Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.
  • Abbasi-Yadkori et al. [2019] Y. Abbasi-Yadkori, N. Lazic, and C. Szepesvári. Model-free linear quadratic control via reduction to expert prediction. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3108–3117. PMLR, 2019.
  • Agarwal et al. [2019] N. Agarwal, E. Hazan, and K. Singh. Logarithmic regret for online control. In Advances in Neural Information Processing Systems, pages 10175–10184, 2019.
  • Bertsekas [1995] D. P. Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995.
  • Cassel and Koren [2020] A. Cassel and T. Koren. Bandit linear control. Advances in Neural Information Processing Systems, 33, 2020.
  • Cassel et al. [2020] A. Cassel, A. Cohen, and T. Koren. Logarithmic regret for learning linear quadratic regulators efficiently. In International Conference on Machine Learning, pages 1328–1337. PMLR, 2020.
  • Chen and Hazan [2020] X. Chen and E. Hazan. Black-box control for linear dynamical systems. arXiv preprint arXiv:2007.06650, 2020.
  • Cohen et al. [2018] A. Cohen, A. Hasidim, T. Koren, N. Lazic, Y. Mansour, and K. Talwar. Online linear quadratic control. In International Conference on Machine Learning, pages 1029–1038, 2018.
  • Cohen et al. [2019] A. Cohen, T. Koren, and Y. Mansour. Learning linear-quadratic regulators efficiently with only T\sqrt{T} regret. In International Conference on Machine Learning, pages 1300–1309, 2019.
  • Fazel et al. [2018] M. Fazel, R. Ge, S. Kakade, and M. Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In Proceedings of the 35th International Conference on Machine Learning, volume 80, 2018.
  • Flaxman et al. [2005] A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385–394, 2005.
  • Haarnoja et al. [2018] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870. PMLR, 2018.
  • Hambly et al. [2020] B. M. Hambly, R. Xu, and H. Yang. Policy gradient methods for the noisy linear quadratic regulator over a finite horizon. Available at SSRN, 2020.
  • Hanson and Wright [1971] D. L. Hanson and F. T. Wright. A bound on tail probabilities for quadratic forms in independent random variables. The Annals of Mathematical Statistics, 42(3):1079–1083, 1971.
  • Hayes [2005] T. P. Hayes. A large-deviation inequality for vector-valued martingales. Combinatorics, Probability and Computing, 2005.
  • Hsu et al. [2012] D. Hsu, S. Kakade, T. Zhang, et al. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17, 2012.
  • Jin et al. [2020] Z. Jin, J. M. Schmitt, and Z. Wen. On the analysis of model-free methods for the linear quadratic regulator. arXiv preprint arXiv:2007.03861, 2020.
  • Krauth et al. [2019] K. Krauth, S. Tu, and B. Recht. Finite-time analysis of approximate policy iteration for the linear quadratic regulator. In Advances in Neural Information Processing Systems, volume 32, 2019.
  • Lillicrap et al. [2015] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • Malik et al. [2019] D. Malik, A. Pananjady, K. Bhatia, K. Khamaru, P. Bartlett, and M. Wainwright. Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2916–2925. PMLR, 2019.
  • Mania et al. [2019] H. Mania, S. Tu, and B. Recht. Certainty equivalence is efficient for linear quadratic control. In Advances in Neural Information Processing Systems, volume 32, pages 10154–10164, 2019.
  • Mohammadi et al. [2020] H. Mohammadi, M. R. Jovanovic, and M. Soltanolkotabi. Learning the model-free linear quadratic regulator via random search. In Learning for Dynamics and Control, pages 531–539. PMLR, 2020.
  • Nesterov [2003] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
  • Silver et al. [2014] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In International conference on machine learning, pages 387–395. PMLR, 2014.
  • Simchowitz and Foster [2020] M. Simchowitz and D. Foster. Naive exploration is optimal for online lqr. In International Conference on Machine Learning, pages 8937–8948. PMLR, 2020.
  • Sutton et al. [1999] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, et al. Policy gradient methods for reinforcement learning with function approximation. In NIPs, volume 99, pages 1057–1063. Citeseer, 1999.
  • Tu and Recht [2019] S. Tu and B. Recht. The gap between model-based and model-free methods on the linear quadratic regulator: An asymptotic viewpoint. In Conference on Learning Theory, pages 3036–3083. PMLR, 2019.
  • Wright [1973] F. T. Wright. A bound on tail probabilities for quadratic forms in independent random variables whose distributions are not necessarily symmetric. The Annals of Probability, pages 1068–1070, 1973.
  • Yaghmaie and Gustafsson [2019] F. A. Yaghmaie and F. Gustafsson. Using reinforcement learning for model-free linear quadratic control with process and measurement noises. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 6510–6517. IEEE, 2019.
  • Yang et al. [2019] Z. Yang, Y. Chen, M. Hong, and Z. Wang. Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. In Advances in Neural Information Processing Systems, volume 32, 2019.

Appendix A Reducing Gaussian Noise to Bounded Noise

In this section we relax the bounded noise assumption, \normwtW\norm{w_{t}}\leq W, and replace it with the following tail assumption. For δ>0,T1\delta>0,T\geq 1, suppose there exists SdxS\subseteq\mathbb{R}^{d_{x}} such that:

  1. 1.

    \brkwtS1δ/T\mathbb{P}\brk*{w_{t}\in S}\geq 1-\delta/T for all 1tT1\leq t\leq T;

  2. 2.

    𝔼\brk[s]wt𝟙\brk[c]wtS=0.\mathbb{E}\brk[s]{w_{t}\mathds{1}{\brk[c]*{w_{t}\in S}}}=0.

The first assumption is a standard implication of any tail assumption. The second assumption implies that we can crop the noise while keeping it zero-mean. While not entirely trivial, this can be guaranteed for any continuous noise distributions. We note that SS is a theoretical construct, and is not a direct input to Algorithm 1. Indirectly, we use SS to calculate the parameters

W~=maxwS\normw,σ~2=min\normx=1𝔼\brk[s](wt𝖳x)2δ𝔼\brk[s](wt𝖳x)4/T,\displaystyle\tilde{W}=\max_{w\in S}\norm{w},\quad\tilde{\sigma}^{2}=\min_{\norm{x}=1}\mathbb{E}\brk[s]{(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{2}}-\sqrt{\delta\mathbb{E}\brk[s]{(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{4}}/T},

which will serve as replacements for W,σW,\sigma in our bounded noise formulation. In practice, our results hold if for the chosen parameters δ,W~,σ~\delta,\tilde{W},\tilde{\sigma}, there exists a set SS satisfying the above. Our main findings for unbounded noise are summarized in the following meta-result.

Theorem 3.

Suppose δ(0,1)\delta\in(0,1) is such that σ~>0\tilde{\sigma}>0. If we run Algorithm 1 with the parameters as in Theorem 1 and W,σW,\sigma that satisfy WW~W\geq\tilde{W} and 0<σσ~0<\sigma\leq\tilde{\sigma}. , then the regret bound of Theorem 1 holds with probability at least 12δ1-2\delta

Proof.

Consider the LQR problem where the noise terms wtw_{t} are replaced with w~t=wt𝟙\brk[c]wtS,\tilde{w}_{t}=w_{t}\mathds{1}{\brk[c]*{w_{t}\in S}}, and let c~t,J~\brk\tilde{c}_{t},\tilde{J}\brk*{\cdot} be the corresponding instantaneous and infinite horizon costs. Notice that by our assumptions, w~t\tilde{w}_{t} are indeed zero-mean, i.i.d, and satisfy \normw~tW~W\norm{\tilde{w}_{t}}\leq\tilde{W}\leq W and

min\normx=1𝔼\brk[s](w~t𝖳x)2\displaystyle\min_{\norm{x}=1}\mathbb{E}\brk[s]{(\tilde{w}_{t}^{\mkern-1.5mu\mathsf{T}}x)^{2}} =min\normx=1𝔼\brk[s](wt𝖳x)2𝔼\brk[s]𝟙\brk[c]wtS(wt𝖳x)2\displaystyle=\min_{\norm{x}=1}\mathbb{E}\brk[s]{(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{2}}-\mathbb{E}\brk[s]{\mathds{1}{\brk[c]*{w_{t}\notin S}}(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{2}}
min\normx=1𝔼\brk[s](wt𝖳x)2\brkwtS𝔼\brk[s](wt𝖳x)4\displaystyle\geq\min_{\norm{x}=1}\mathbb{E}\brk[s]{(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{2}}-\sqrt{\mathbb{P}\brk*{w_{t}\notin S}\mathbb{E}\brk[s]{(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{4}}}
min\normx=1𝔼\brk[s](wt𝖳x)2δ𝔼\brk[s](wt𝖳x)4/T=σ~2σ2>0,\displaystyle\geq\min_{\norm{x}=1}\mathbb{E}\brk[s]{(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{2}}-\sqrt{\delta\mathbb{E}\brk[s]{(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{4}}/T}=\tilde{\sigma}^{2}\geq\sigma^{2}>0,

where the second transition used the Cauchy–Schwarz inequality. We thus have that t=1T(c~tJ~)\sum_{t=1}^{T}(\tilde{c}_{t}-\tilde{J}_{\star}) is bounded as in Theorem 1 with probability at least 1δ1-\delta. Next, since 𝔼wtwt𝖳𝔼w~tw~t𝖳\mathbb{E}w_{t}w_{t}^{\mkern-1.5mu\mathsf{T}}\succeq\mathbb{E}\tilde{w}_{t}\tilde{w}_{t}^{\mkern-1.5mu\mathsf{T}}, we have that that J~\brk\tilde{J}\brk*{\cdot} is optimistic with respect to J\brkJ\brk*{\cdot}, i.e., J~\brkKJ\brkK\tilde{J}\brk*{K}\leq J\brk*{K} for all KK, which implies that J~J\tilde{J}_{\star}\leq J_{\star}. Finally, using a union bound on the tail assumption, we have that wt=w~tw_{t}=\tilde{w}_{t} for all 1tT1\leq t\leq T with probability at least 1δ1-\delta. On this event, Algorithm 1 is not aware that the noise is cropped and we thus have that ct=c~tc_{t}=\tilde{c}_{t} for all 1tT1\leq t\leq T. We conclude that with probability at least 1δ1-\delta

RT=t=1T(ctJ)t=1T(c~tJ~),\displaystyle R_{T}=\sum_{t=1}^{T}(c_{t}-J_{\star})\leq\sum_{t=1}^{T}(\tilde{c}_{t}-\tilde{J}_{\star}),

and using another union bound concludes the proof.

Application to Gaussian noise.

We specialize Theorem 3 to the case where wt𝒩\brk00,Σww_{t}\sim\mathcal{N}\brk 0{0,\Sigma_{w}}, are zero-mean Gaussian random vectors with positive definite covariance Σwdx×dx\Sigma_{w}\in\mathbb{R}^{d_{x}\times d_{x}}. The following result demonstrates how to run Algorithm 1 given upper and lower bounds on the covariance eigenvalues.

Proposition 1.

Let δ(0,1/3)\delta\in(0,1/3). Suppose we run Algorithm 1 with parameters as in Theorem 1 and W,σW,\sigma that satisfy

W5dxλmax\brkΣwlogTδ,σ2λmin\brkΣw(13δ/T),\displaystyle W\geq\sqrt{5d_{x}\lambda_{\mathrm{max}}\brk*{\Sigma_{w}}\log\frac{T}{\delta}},\quad\sigma^{2}\leq\lambda_{\mathrm{min}}\brk*{\Sigma_{w}}(1-\sqrt{3\delta/T}),

where λmin\brkΣw,λmax\brkΣw\lambda_{\mathrm{min}}\brk*{\Sigma_{w}},\lambda_{\mathrm{max}}\brk*{\Sigma_{w}} are the minimal and maximal eigenvalues of Σw\Sigma_{w}. Then the regret bound of Theorem 1 holds with probability at least 12δ1-2\delta.

Proof.

We show that S=\brk[c]w\normΣw1/2w5dxlog(T/δ)S=\brk[c]{w\mid\norm{\Sigma_{w}^{-1/2}w}\leq\sqrt{5d_{x}\log(T/\delta)}} satisfies the desired assumptions. First, by Lemma 14 we indeed have that \brkwtS1δ/T.\mathbb{P}\brk*{w_{t}\in S}\geq 1-\delta/T. Next, denote xt=Σw1/2wtx_{t}=\Sigma_{w}^{-1/2}w_{t} and notice that xt𝒩\brk00,Ix_{t}\sim\mathcal{N}\brk 0{0,I}. We thus have that

𝔼\brk[s]wt𝟙\brk[c]wtS=Σw1/2𝔼\brk[s]Σw1/2wt𝟙\brk[c]wtS=Σw1/2𝔼\brk[s]xt𝟙\brk[c]\normxt25dxlogδT=0,\displaystyle\mathbb{E}\brk[s]{w_{t}\mathds{1}{\brk[c]*{w_{t}\in S}}}=\Sigma_{w}^{1/2}\mathbb{E}\brk[s]{\Sigma_{w}^{-1/2}w_{t}\mathds{1}{\brk[c]*{w_{t}\in S}}}=\Sigma_{w}^{1/2}\mathbb{E}\brk[s]*{x_{t}\mathds{1}{\brk[c]*{\norm{x_{t}}^{2}\leq 5d_{x}\log\frac{\delta}{T}}}}=0,

where the last transition follows from a symmetry argument. We conclude that SS satisfies our assumptions. We show that WW~W\geq\tilde{W} and 0σσ~0\leq\sigma\leq\tilde{\sigma}, which then concludes the proof by invoking Theorem 3. First, we have that for any wSw\in S

\normw\normΣw1/2\normΣw1/2w5dx\normΣwlogTδW,\displaystyle\norm{w}\leq\norm{\Sigma_{w}^{1/2}}\norm{\Sigma_{w}^{-1/2}w}\leq\sqrt{5d_{x}\norm{\Sigma_{w}}\log\frac{T}{\delta}}\leq W,

and so WW~W\geq\tilde{W}. Finally, notice that for any xdxx\in\mathbb{R}^{d_{x}} wt𝖳xw_{t}^{\mkern-1.5mu\mathsf{T}}x is a zero-mean Gaussian random variable. Standard moment identities for Gaussian variables then give that 𝔼\brk[s](wt𝖳x)4=3𝔼\brk[s](wt𝖳x)22,\mathbb{E}\brk[s]{(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{4}}=3\mathbb{E}\brk[s]{(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{2}}^{2}, and so we have that

σ~2=min\normx=1𝔼\brk[s](wt𝖳x)2(13δ/T)=λmin\brkΣw(13δ/T)σ2>0,\displaystyle\tilde{\sigma}^{2}=\min_{\norm{x}=1}\mathbb{E}\brk[s]{(w_{t}^{\mkern-1.5mu\mathsf{T}}x)^{2}}(1-\sqrt{3\delta/T})=\lambda_{\mathrm{min}}\brk*{\Sigma_{w}}(1-\sqrt{3\delta/T})\geq\sigma^{2}>0,

where the last inequality holds by our choice of δ<1/3\delta<1/3.

Appendix B Technical Lemmas

B.1 Summing the Square Roots of Epoch Lengths

Lemma 12.

Let ρ[2/3,1)\rho\in[2/3,1) and define mj=m0ρ2jm_{j}=m_{0}\rho^{-2j}. Suppose nn is such that j=0n2mjT,\sum_{j=0}^{n-2}m_{j}\leq T, then we have that

j=0n1mj1/222μηT.\displaystyle\sum_{j=0}^{n-1}m_{j}^{1/2}\leq\frac{22}{\mu\eta}\sqrt{T}.
Proof.

For ease of notation, denote ρ=1(μη/3)\rho=1-(\mu\eta/3) and notice that for our parameter choice it satisfies ρ[2/3,1)\rho\in[2/3,1). Now, notice that for x1x\geq 1 we have x1x21x-1\leq\sqrt{x^{2}-1} and so we have that

ρn1ρ11=ρ1+1ρ21ρn1ρ21ρ1+1ρ11ρ2n1ρ2121ρρ2n1ρ21=6μηρ2n1ρ21.\displaystyle\frac{\rho^{-n}-1}{\rho^{-1}-1}=\frac{\rho^{-1}+1}{\sqrt{\rho^{-2}-1}}\frac{\rho^{-n}-1}{\sqrt{\rho^{-2}-1}}\leq\frac{\rho^{-1}+1}{\rho^{-1}-1}\sqrt{\frac{\rho^{-2n}-1}{\rho^{-2}-1}}\leq\frac{2}{1-\rho}\sqrt{\frac{\rho^{-2n}-1}{\rho^{-2}-1}}=\frac{6}{\mu\eta}\sqrt{\frac{\rho^{-2n}-1}{\rho^{-2}-1}}.

Noticing that mjm_{j} is a geometric sequence we get that

j=0n1mj1/2=m01/2ρn1ρ116μηm0ρ2n1ρ21=6μηj=0n1mj6μη(1+ρ2)j=0n2mj22μηT,\displaystyle\sum_{j=0}^{n-1}m_{j}^{1/2}=m_{0}^{1/2}\frac{\rho^{-n}-1}{\rho^{-1}-1}\leq\frac{6}{\mu\eta}\sqrt{m_{0}\frac{\rho^{-2n}-1}{\rho^{-2}-1}}=\frac{6}{\mu\eta}\sqrt{\sum_{j=0}^{n-1}m_{j}}\leq\frac{6}{\mu\eta}\sqrt{(1+\rho^{-2})\sum_{j=0}^{n-2}m_{j}}\leq\frac{22}{\mu\eta}\sqrt{T},

where the last transition also used the fact ρ29/4\rho^{-2}\leq 9/4.

B.2 Randomized Smoothing

Proof (of Lemma 1).

The first part follows from Stokes’ theorem. See Lemma 1 in [11] for details. For the second part, notice that fr\brkx=𝔼Bf\brkx+rB=𝔼Bf\brkx+rB.\nabla f^{r}\brk*{x}=\nabla\mathbb{E}_{B}f\brk*{x+rB}=\mathbb{E}_{B}\nabla f\brk*{x+rB}. We can thus use Jensen’s inequality to get that

\normfr\brkxf\brkx=\norm𝔼B\brk[s]f\brkx+rBf\brkx𝔼B\normf\brkx+rBf\brkxβr𝔼B\normBβr,\displaystyle\norm{\nabla f^{r}\brk*{x}-\nabla f\brk*{x}}=\norm{\mathbb{E}_{B}\brk[s]{\nabla f\brk*{x+rB}-\nabla f\brk*{x}}}\leq\mathbb{E}_{B}\norm{{\nabla f\brk*{x+rB}-\nabla f\brk*{x}}}\leq\beta r\mathbb{E}_{B}\norm{{B}}\leq\beta r,

where the third transition also used the smoothness (gradient Lipschitz) property of ff, and the last transition used the fact that BB is in the unit ball.

B.3 Details of Lemma 5

We review how Lemma 5 is derived from Fazel et al. [10]. For the rest of this section all Lemmas will refer to ones in [10].

The first part of the statement is immediate from their Lemma 13. Next, notice that

λmin\brkQλmin\brkΣw4J\brkK\normB(\normA+BK+1)α0σ24νψ2κ=18ψκ3=D0.\displaystyle\frac{\lambda_{\mathrm{min}}\brk*{Q}\lambda_{\mathrm{min}}\brk*{\Sigma_{w}}}{4J\brk*{K}\norm{B_{\star}}(\norm{A_{\star}+B_{\star}K}+1)}\geq\frac{\alpha_{0}\sigma^{2}}{4\nu\psi 2\kappa}=\frac{1}{8\psi\kappa^{3}}=D_{0}.

We thus have that K,KK,K^{\prime} satisfy the condition of Lemma 16 and so we get that

\normΣKΣK4\brkJ\brkKλmin\brkQ2\normB(\normA+BK+1)λmin\brkΣw\normKK4ν2ψ22κα02σ2\normKK=8νψκ3α0\normKK,\displaystyle\norm{\Sigma_{K}-\Sigma_{K^{\prime}}}\leq 4\brk*{\frac{J\brk*{K}}{\lambda_{\mathrm{min}}\brk*{Q}}}^{2}\frac{\norm{B_{\star}}(\norm{A_{\star}+B_{\star}K}+1)}{\lambda_{\mathrm{min}}\brk*{\Sigma_{w}}}\norm{K-K^{\prime}}\leq\frac{4\nu^{2}\psi^{2}2\kappa}{\alpha_{0}^{2}\sigma^{2}}\norm{K-K^{\prime}}=\frac{8\nu\psi\kappa^{3}}{\alpha_{0}}\norm{K-K^{\prime}},

thus concluding the second part. Next, define

𝒯K(X)=t=0(A+BK)tX\brk[s](A+BK)𝖳t,K(X)=(A+BK)X(A+BK)𝖳,\displaystyle\mathcal{T}_{K}(X)=\sum_{t=0}^{\infty}(A_{\star}+B_{\star}K)^{t}X\brk[s]{(A_{\star}+B_{\star}K)^{\mkern-1.5mu\mathsf{T}}}^{t},\qquad\mathcal{F}_{K}(X)=(A_{\star}+B_{\star}K)X(A_{\star}+B_{\star}K)^{\mkern-1.5mu\mathsf{T}},

which are linear operators on symmetric matrices. By Lemma 17 we have that

\norm𝒯KJ\brkKλmin\brkΣwλmin\brkQνσ2α0=κ2,\displaystyle\norm{\mathcal{T}_{K}}\leq\frac{J\brk*{K}}{\lambda_{\mathrm{min}}\brk*{\Sigma_{w}}\lambda_{\mathrm{min}}\brk*{Q}}\leq\frac{\nu}{\sigma^{2}\alpha_{0}}=\kappa^{2},

and by Lemma 19 we have that

\normKK\displaystyle\norm{\mathcal{F}_{K}-\mathcal{F}_{K^{\prime}}} 2\normA+BK\normB\normKK+\normB2\normKK2\displaystyle\leq 2\norm{A_{\star}+B_{\star}K}\norm{B_{\star}}\norm{K-K^{\prime}}+\norm{B_{\star}}^{2}\norm{K-K^{\prime}}^{2}
\brk[s]2κψ+ψ28ψκ3\normKK\displaystyle\leq\brk[s]*{2\kappa\psi+\frac{\psi^{2}}{8\psi\kappa^{3}}}\norm{K-K^{\prime}}
3ψκ\normKK.\displaystyle\leq 3\psi\kappa\norm{K-K^{\prime}}.

Now, continuing from the middle of the proof of Lemma 27 we get that

\normPKPK\displaystyle\norm{P_{K}-P_{K^{\prime}}} 2\norm𝒯K2\normKK\normQ+K𝖳RK+\norm𝒯K\normK𝖳RKK𝖳RK\displaystyle\leq 2\norm{\mathcal{T}_{K}}^{2}\norm{\mathcal{F}_{K}-\mathcal{F}_{K^{\prime}}}\norm{Q+{K^{\prime}}^{\mkern-1.5mu\mathsf{T}}RK^{\prime}}+\norm{\mathcal{T}_{K}}\norm{K^{\mkern-1.5mu\mathsf{T}}RK-{K^{\prime}}^{\mkern-1.5mu\mathsf{T}}RK^{\prime}}
\brk[s]6ψκ5(1+\normK2)+κ2(\normK+\normK)\normKK\displaystyle\leq\brk[s]*{6\psi\kappa^{5}(1+\norm{K^{\prime}}^{2})+\kappa^{2}(\norm{K}+\norm{K^{\prime}})}\norm{K-K^{\prime}}
ψκ5\brk[s]6+6κ2+12D0κ+6D02+2+8D02\normKK\displaystyle\leq\psi\kappa^{5}\brk[s]*{6+6\kappa^{2}+12D_{0}\kappa+6D_{0}^{2}+2+8D_{0}^{2}}\norm{K-K^{\prime}}
16ψκ7\normKK,\displaystyle\leq 16\psi\kappa^{7}\norm{K-K^{\prime}},

where the last step used the fact that D01/8D_{0}\leq 1/8. Next, notice that

Tr\brkΣwTr\brkPK0Σw/α0J\brkK0/α0ν/4α0,\displaystyle\mathrm{Tr}\brk*{\Sigma_{w}}\leq\mathrm{Tr}\brk*{P_{K_{0}}\Sigma_{w}}/\alpha_{0}\leq J\brk*{K_{0}}/\alpha_{0}\leq\nu/4\alpha_{0},

and thus the fourth property (Lipschitz) follows as

\absJ\brkKJ\brkK=\absTr\brk(PKPK)Σw\normPKPKTr\brkΣw4ψνκ7α0\normKKF.\displaystyle\abs{J\brk*{K}-J\brk*{K^{\prime}}}=\abs{\mathrm{Tr}\brk*{(P_{K}-P_{K^{\prime}})\Sigma_{w}}}\leq\norm{P_{K}-P_{K^{\prime}}}\mathrm{Tr}\brk*{\Sigma_{w}}\leq\frac{4\psi\nu\kappa^{7}}{\alpha_{0}}\norm{K-K^{\prime}}_{F}.

Next, the fifth statement (Smoothness) follows the ideas of Lemma 28. Concretely, recall that J\brkK=2EKΣK\nabla J\brk*{K}=2E_{K}\Sigma_{K} where EK=RK+B𝖳PK(A+BK)E_{K}=RK+B_{\star}^{\mkern-1.5mu\mathsf{T}}P_{K}(A_{\star}+B_{\star}K). Notice that

\normΣK\normΣKΣK+\normΣK2ν/α0,\displaystyle\norm{\Sigma_{K^{\prime}}}\leq\norm{\Sigma_{K^{\prime}}-\Sigma_{K}}+\norm{\Sigma_{K}}\leq 2\nu/\alpha_{0},
\normEK\normR\normK+\normB\normPK\normA+BKκ+ψκν/σ22ψκ3,\displaystyle\norm{E_{K}}\leq\norm{R}\norm{K}+\norm{B_{\star}}\norm{P_{K}}\norm{A_{\star}+B_{\star}K}\leq\kappa+\psi\kappa\nu/\sigma^{2}\leq 2\psi\kappa^{3},

and thus we have that

\normJ\brkKJ\brkKFdx\normJ\brkKJ\brkK\displaystyle\norm{\nabla J\brk*{K}-\nabla J\brk*{K^{\prime}}}_{F}\leq\sqrt{d_{x}}\norm{\nabla J\brk*{K}-\nabla J\brk*{K^{\prime}}} =2dx\norm(EKEK)ΣK+EK(ΣKΣK)\displaystyle=2\sqrt{d_{x}}\norm{(E_{K}-E_{K^{\prime}})\Sigma_{K^{\prime}}+E_{K}(\Sigma_{K}-\Sigma_{K^{\prime}})}
2dx\brk[s]\normEKEK\normΣK+\normEK\normΣKΣK\displaystyle\leq 2\sqrt{d_{x}}\brk[s]*{\norm{E_{K}-E_{K^{\prime}}}\norm{\Sigma_{K^{\prime}}}+\norm{E_{K}}\norm{\Sigma_{K}-\Sigma_{K^{\prime}}}}
2dx\brk[s]2να0\normEKEK+16νψ2κ6α0\normKK.\displaystyle\leq 2\sqrt{d_{x}}\brk[s]*{\frac{2\nu}{\alpha_{0}}\norm{E_{K}-E_{K^{\prime}}}+\frac{16\nu\psi^{2}\kappa^{6}}{\alpha_{0}}\norm{K-K^{\prime}}}.

Now, notice that \normPK\normPKPK+\normPK3κ4\norm{P_{K^{\prime}}}\leq\norm{P_{K^{\prime}}-P_{K}}+\norm{P_{K}}\leq 3\kappa^{4} and so

\normEKEK\displaystyle\norm{E_{K}-E_{K^{\prime}}} \normR(KK)+B𝖳PK(A+BK)B𝖳PK(A+BK)\displaystyle\leq\norm{R(K-K^{\prime})+B_{\star}^{\mkern-1.5mu\mathsf{T}}P_{K}(A_{\star}+B_{\star}K)-B_{\star}^{\mkern-1.5mu\mathsf{T}}P_{K^{\prime}}(A_{\star}+B_{\star}K^{\prime})}
\normR\normKK+\normB𝖳(PKPK)(A+BK)+\normB𝖳PKB(KK)\displaystyle\leq\norm{R}\norm{K-K^{\prime}}+\norm{B_{\star}^{\mkern-1.5mu\mathsf{T}}(P_{K}-P_{K^{\prime}})(A_{\star}+B_{\star}K)}+\norm{B_{\star}^{\mkern-1.5mu\mathsf{T}}P_{K^{\prime}}B_{\star}(K-K^{\prime})}
\brk[s]1+16ψ2κ8+3ψ2κ4\normKK\displaystyle\leq\brk[s]{1+16\psi^{2}\kappa^{8}+3\psi^{2}\kappa^{4}}\norm{K-K^{\prime}}
20ψ2κ8\normKK,\displaystyle\leq 20\psi^{2}\kappa^{8}\norm{K-K^{\prime}},

and combining with the above, yields the desired smoothness condition

\normJ\brkKJ\brkKF112dxνψ2κ8α0\normKKF.\displaystyle\norm{\nabla J\brk*{K}-\nabla J\brk*{K^{\prime}}}_{F}\leq\frac{112\sqrt{d_{x}}\nu\psi^{2}\kappa^{8}}{\alpha_{0}}\norm{K-K^{\prime}}_{F}.

Finally, the last statement (PL) is immediate from their Lemma 11 as

λmin\brkΣKλmin\brkΣw2λmin\brkRν4σ4α02=4νκ4.\displaystyle\frac{\lambda_{\mathrm{min}}\brk*{\Sigma_{K_{\star}}}}{\lambda_{\mathrm{min}}\brk*{\Sigma_{w}}^{2}\lambda_{\mathrm{min}}\brk*{R}}\leq\frac{\nu}{4\sigma^{4}\alpha_{0}^{2}}=\frac{4\nu}{\kappa^{4}}.

Appendix C Concentration inequalities

Lemma 13 (Theorem 1.8 of [15]).

Let XX be a very-weak martingale taking values in a real-valued euclidean space EE such that X0=0X_{0}=0 and for every ii, \normXiXi11\norm{X_{i}-X_{i-1}}\leq 1. Then, for every a>0a>0,

\brk\normXn>a2e2ea22n.\displaystyle\mathbb{P}\brk*{\norm{X_{n}}>a}\leq 2e^{2}e^{-\frac{a^{2}}{2n}}.

Alternatively, for any δ(0,12e2)\delta\in(0,\frac{1}{2}e^{-2}) we have that with probability at least 1δ1-\delta

\normXn2nlog15δ.\displaystyle\norm{X_{n}}\leq\sqrt{2n\log\frac{15}{\delta}}.

The following theorem is a variant of the Hanson-Wright inequality [14, 28] which can be found in Hsu et al. [16].

Theorem 4.

Let x𝒩\brk00,Ix\sim\mathcal{N}\brk 0{0,I} be a Gaussian random vector, let Am×nA\in\mathbb{R}^{m\times n} and define Σ=ATA\Sigma=A^{T}A. Then we have that

\brk\normAx2>Tr\brkΣ+2Tr\brkΣ2z+2\normΣzexp\brkz, for all z0.\mathbb{P}\brk*{\norm{Ax}^{2}>\mathrm{Tr}\brk*{\Sigma}+2\sqrt{\mathrm{Tr}\brk*{\Sigma^{2}}z}+2\norm{\Sigma}z}\leq\exp\brk{-z},\qquad\text{ for all }z\geq 0.

The following lemma is a direct corollary of Theorem 4.

Lemma 14.

Let w𝒩\brk00,Σww\sim\mathcal{N}\brk 0{0,\Sigma_{w}} be a Gaussian random vector in d\mathbb{R}^{d}. For any δ(0,1/e)\delta\in(0,1/e), with probability at least 1δ1-\delta we have that

\normw5Tr\brkΣwlog1δ.\displaystyle\norm{w}\leq\sqrt{5\mathrm{Tr}\brk*{\Sigma_{w}}\log\frac{1}{\delta}}.
Proof.

Consider Theorem 4 with A=Σw1/2A=\Sigma_{w}^{1/2} and thus Σ=Σw\Sigma=\Sigma_{w}. Then for z1z\geq 1 we have that

Tr\brkΣw+2Tr\brkΣw2z+2\normΣwzTr\brkΣwz+2Tr\brkΣwz+2Tr\brkΣwz=5Tr\brkΣwz.\mathrm{Tr}\brk*{\Sigma_{w}}+2\sqrt{\mathrm{Tr}\brk*{\Sigma_{w}^{2}}z}+2\norm{\Sigma_{w}}z\leq\mathrm{Tr}\brk*{\Sigma_{w}}z+2\mathrm{Tr}\brk*{\Sigma_{w}}z+2\mathrm{Tr}\brk*{\Sigma_{w}}z=5\mathrm{Tr}\brk*{\Sigma_{w}}z.

Now, for x𝒩\brk00,Ix\sim\mathcal{N}\brk 0{0,I} we have that w=𝑑Axw\overset{d}{=}Ax (equals in distribution). We thus have that for z1z\geq 1

\brk\normw>5Tr\brkΣwz\brk\normAx2>Tr\brkΣw+2Tr\brkΣw2z+2\normΣwzexp\brkz,\displaystyle\mathbb{P}\brk*{\norm{w}>\sqrt{5\mathrm{Tr}\brk*{\Sigma_{w}}z}}\leq\mathbb{P}\brk*{\norm{Ax}^{2}>{\mathrm{Tr}\brk*{\Sigma_{w}}+2\sqrt{\mathrm{Tr}\brk*{\Sigma_{w}^{2}}z}+2\norm{\Sigma_{w}}z}}\leq\exp\brk{-z},

and taking z=log1δ1z=\log\frac{1}{\delta}\geq 1 (since δ(0,1/e)\delta\in(0,1/e)) concludes the proof.