Online Policy Gradient for Model Free Learning
of Linear Quadratic Regulators with Regret
Abstract
We consider the task of learning to control a linear dynamical system under fixed quadratic costs, known as the Linear Quadratic Regulator (LQR) problem. While model-free approaches are often favorable in practice, thus far only model-based methods, which rely on costly system identification, have been shown to achieve regret that scales with the optimal dependence on the time horizon . We present the first model-free algorithm that achieves similar regret guarantees. Our method relies on an efficient policy gradient scheme, and a novel and tighter analysis of the cost of exploration in policy space in this setting.
1 Introduction
Model-free, policy gradient algorithms have become a staple of Reinforcement Learning (RL) with both practical successes [19, 12], and strong theoretical guarantees in several settings [26, 24]. In this work we study the design and analysis of such algorithms for the adaptive control of Linear Quadratic Regulator (LQR) systems, as seen through the lens of regret minimization [1, 9, 21]. In this continuous state and action reinforcement learning setting, an agent chooses control actions and the system state evolves according to the noisy linear dynamics
where and are transition matrices and are i.i.d zero-mean noise terms. The cost is a quadratic function of the current state and action, and the regret is measured with respect to the class of linear policies, which are known to be optimal for this setting.
Model-based methods, which perform planning based on a system identification procedure that estimates the transition matrices, have been studied extensively in recent years. This started with Abbasi-Yadkori and Szepesvári [1], which established an regret guarantee albeit with a computationally intractable method. More recently, Cohen et al. [9], Mania et al. [21] complemented this result with computationally efficient methods, and Cassel et al. [6], Simchowitz and Foster [25] provided lower bounds, showing that this rate is generally unavoidable; regardless of whether the algorithm is model free or not. In comparison, the best existing model-free algorithms are policy iteration procedures by Krauth et al. [18] and Abbasi-Yadkori et al. [2] that respectively achieve and regret for .
Our main result is an efficient (in fact, linear time per step) policy gradient algorithm that achieves regret, thus closing the (theoretical) gap between model based and free methods for the LQR model. An interesting feature of our approach is that while the policies output by the algorithm are clearly state dependent, the tuning of their parameters requires no such access. Instead, we only rely on observations of the incurred cost, similar to bandit models (e.g., 5).
One of the main challenges of regret minimization in LQRs (and more generally, in reinforcement learning) is that it is generally infeasible to change policies as often as one likes. Roughly, this is due to a burn-in period following a policy change, during which the system converges to a new steady distribution, and typically incurs an additional cost proportional to the change in steady states, which is in turn proportional to the distance between policies. There are several ways to overcome this impediment. The simplest is to restrict the number of policy updates and explore directly in the action space via artificial noise (see e.g., 25). Another approach by Cohen et al. [9] considers a notion of slowly changing policies, however, these can be very prohibitive for exploration in policy space. Other works (e.g., 3) consider a policy parameterization that converts the problem into online optimization with memory, which also relies on slowly changing policies. This last method is also inherently model-based and thus not adequate for our purpose.
A key technical contribution that we make is to overcome this challenge by exploring directly in policy space. While the idea itself is not new, we provide a novel and tighter analysis that allows us to use larger perturbations, thus reducing the variance of the resulting gradient estimates. We achieve this by showing that the additional cost depends only quadratically on the exploration radius, which is a crucial ingredient for overcoming the limitation. The final ingredient of the analysis involves a sensitivity analysis of the gradient descent procedure that uses the estimated gradients. Here again, while similar analyses of gradient methods exist, we provide a general result that gives appropriate conditions for which the optimization error depends only quadratically on the error in the gradients.
Related work.
Policy gradient methods in the context of LQR has seen significant interest in recent years. Notably, Fazel et al. [10] establish its global convergence in the perfect information setting, and give complexity bounds for sample based methods. Subsequently, Malik et al. [20] improve the sample efficiency but their result holds only with a fixed probability and thus does not seem applicable for our purposes. Hambly et al. [13] also improve the sample efficiency, but in a finite horizon setting. Mohammadi et al. [22] give sample complexity bounds for the continuous-time variant of LQR. Finally, Tu and Recht [27] show that a model based method can potentially outperform the sample complexity of policy gradient by factors of the input and output dimensions. While we observe similar performance gaps in our regret bounds, these were not our main focus and may potentially be improved by a more refined analysis. Moving away from policy gradients, Yang et al. [30], Jin et al. [17], Yaghmaie and Gustafsson [29] analyze the convergence and sample complexity of other model free methods such as policy iteration and temporal difference (TD) learning, but they do not include any regret guarantees.
2 Preliminaries
2.1 Setup: Learning in LQR
We consider the problem of regret minimization in the LQR model. At each time step , a state is observed and action is chosen. The system evolves according to
where the state-state and state-action matrices form the transition model and the are bounded, zero mean, i.i.d. noise terms with a positive definite covariance matrix . Formally, there exist such that
The bounded noise assumption is made for simplicity of the analysis, and in Appendix A we show how to accommodate Gaussian noise via a simple reduction to this setting. At time , the instantaneous cost is
where are positive definite. We note that the upper bound is without loss of generality since multiplying and by a constant factor only re-scales the regret.
A policy of the learner is a potentially time dependent mapping from past history to an action to be taken at the current time step. Classic results in linear control establish that, given the system parameters and , a linear transformation of the current state is an optimal policy for the infinite horizon setting. We thus consider policies of the form and define their infinite horizon expected cost,
where the expectation is taken with respect to the random noise variables . Let be a (unique) optimal policy and denote the optimal infinite horizon expected cost, which are both well defined under mild assumptions.111These are valid under standard, very mild stabilizability assumptions (see 4) that hold in our setting. We are interested in minimizing the regret over decision rounds, defined as
We focus on the setting where the learner does not have a full a-priori description of the transition parameters and , and has to learn them while controlling the system and minimizing the regret.
Throughout, we assume that the learner has knowledge of constants and such that
We also assume that there is a known stable (not necessarily optimal) policy and such that . We note that all of the aforementioned parameters could be easily estimated at the cost of an additive constant regret term by means of a warm-up period. However, recovering the initial control gives a constant that depends exponentially on the problem parameters as shown by Chen and Hazan [7], Mania et al. [21], Cohen et al. [9].
Finally, denote the set of all “admissable” controllers
By definition, . As discussed below, over the set the LQR cost function has certain regularity properties that we will use throughout.
2.2 Smooth Optimization
Fazel et al. [10] show that while the objective is non-convex, it has properties that make it amenable to standard gradient based optimization schemes. We summarize these here as they are used in our analysis.
Definition 1 (PL-condition).
A function with global minimum is said to be -PL if it satisfies the Polyak-Lojasiewicz (PL) inequality with constant , given by
Definition 2 (Smoothness).
A function is locally -smooth over if for any and with
Definition 3 (Lipschitz).
A function is locally -Lipschitz over if for any and with
It is well-known that for functions satisfying the above conditions and for sufficiently small step size , the gradient descent update rule
converges exponentially fast, i.e., there exists such that (e.g., 23). This setting has also been investigated in the absence of a perfect gradient oracle. Here we provide a clean result that shows that the error in the optimization objective is only limited by the squared error of any gradient estimate.
Finally, we require the notion of a one point gradient estimate [11]. Let and define its smoothed version with parameter as
(1) |
where is a uniform random vector over the Euclidean unit ball. The following lemma is standard (we include a proof in Appendix B for completeness).
Lemma 1.
If is -locally smooth and , then:
-
1.
where is a uniform random vector of the unit sphere;
-
2.
2.3 Background on LQR
It is well-known for the LQR problem that
where are the positive definite solutions to
(2) | ||||
(3) |
Another important notion is that of strong stability [8]. This is essentially a quantitative version of classic stability notions in linear control.
Definition 4 (strong stability).
A matrix is -strongly stable (for and ) if there exists matrices and such that with and . A controller for is strongly stable if and the matrix is -strongly stable.
The following lemma, due to Cohen et al. [9], relates the infinite horizon cost of a controller to its strong stability parameters.
Lemma 2 (9, Lemma 18).
Suppose that then is strongly stable with and
The following two lemmas, due to Cohen et al. [8], Cassel et al. [6], show that the state covariance converges exponentially fast, and that the state is bounded as long as controllers are allowed to mix.
Lemma 3 (8, Lemma 3.2).
Suppose we play some fixed starting from some , then
Lemma 4 (6, Lemma 39).
Let . If we play each controller for at least rounds before switching to then for all we have that and
The following is a summary of results from Fazel et al. [10] that describe the main properties of . See Appendix B for the complete details.
Lemma 5 (10, Lemmas 11, 13, 16, 27 and 28).
Let and with
then we have that
-
1.
-
2.
-
3.
-
4.
satisfies the local Lipschitz condition (Definition 3) over with and
-
5.
satisfies the local smoothness condition (Definition 2) over with and
-
6.
satisfies the PL condition (Definition 1) with
3 Algorithm and Overview of Analysis
We are now ready to present our main algorithm for model free regret minimization in LQR. The algorithm, given in Algorithm 1, optimizes an underlying controller over epochs of exponentially increasing duration. Each epoch consists of sub-epochs, during which a perturbed controller centered at is drawn and played for rounds. At the end of each epoch, the algorithm uses , which is the cost incurred during the final round of playing the controller , to construct a gradient estimate which in turn is used to calculate the next underlying controller . Interestingly, we do not make any explicit use of the state observation which is only used implicitly to calculate the control signal, via . Furthermore, the algorithm makes only computations per time step.
Our main result regarding Algorithm 1 is stated in the following theorem: a high-probability regret guarantee with a polynomial dependence on the problem parameters.
Theorem 1.
Here we give an overview of the main steps in proving Theorem 1, deferring the details of each step to later sections. Our first step is analyzing the utility of the policies computed at the end of each epoch. We show that the regret of each (over epoch ) in terms of its long-term (steady state) cost compared to that of the optimal , is controlled by the inverse square-root of the epoch length .
Lemma 6 (exploitation).
Under the parameter choices of Theorem 1, for any we have that with probability at least ,
and further that .
The proof of the lemma is based on a careful analysis of gradient descent with inexact gradients and crucially exploits the PL and local-smoothness properties of the loss . More details can be found in Section 4.
The more interesting (and challenging) part of our analysis pertains to controlling the costs associated with exploration, namely, the penalties introduced by the perturbations of the controllers . The direct cost of exploration is clear: instead of playing the intended for exploitation, the algorithm actually follows the perturbed controllers and thus incurs the differences in long-term costs . Our following lemma bounds the accumulation of these penalties over an epoch ; importantly, it shows that while the bound scales linearly with the length of the epoch , it has a quadratic dependence on the exploration radius .
Lemma 7 (direct exploration cost).
Under the parameter choices of Theorem 1, for any we have that with probability at least ,
There are additional, indirect costs associated with exploration however: within each epoch the algorithm switches frequently between different policies, thereby suffering the indirect costs that stem from their “burn-in” period. This is precisely what gives rise to the differences between the realized cost and the long-term cost of the policy , the cumulative effect of which is bounded in the next lemma. Here again, note the quadratic dependence on the exploration radius which is essential for obtaining our -regret result.
Lemma 8 (indirect exploration cost).
Under the parameter choices of Theorem 1, for any we have that with probability at least ,
The technical details for Lemmas 7 and 8 are discussed in Section 5. We now have all the main pieces required for proving our main result.
Proof (of Theorem 1).
Taking a union bound, we conclude that Lemmas 8, 7 and 6 hold for all with probability at least . Now, notice that our choice of parameters is such that
Plugging this back into Lemmas 8 and 7 we get that for all ,
We conclude that the regret during epoch is bounded as
where the second step also used the fact that . Finally, a simple calculation (see Lemma 12) shows that
and thus summing over the regret accumulated in each epoch concludes the proof.
4 Optimization Analysis
At its core, Algorithm 1 is a policy gradient method with being the prediction after gradient steps. In this section we analyze the sub-optimality gap of the underlying controllers culminating in the proof of Lemma 6. To achieve this, we first consider a general optimization problem with a corrupted gradient oracle, and show that the optimization rate is limited only by the square of the corruption magnitude. We follow this with an analysis of the LQR gradient estimation from which the overall optimization cost follows readily.
4.1 Inexact First-Order Optimization
Let be a function with global minimum . Suppose there exists such that is -PL, -locally smooth, and -locally Lipschitz over the sub-level set We consider the update rule
(4) |
where , and is a corrupted gradient oracle that satisfies
(5) |
where is the magnitude of the corruption at step . Define the effective corruption up to round as
and notice that if then .
The following result shows that this update rule achieves a linear convergence rate up to an accuracy that depends quadratically on the corruptions.
Theorem 2 (corrupted gradient descent).
Suppose that Then for all ,
and consequently .
Proof.
For ease of notation, denote . Now, suppose that , i.e., . Then we have that
where the second step used our choice of and the third step used the Lipschitz assumption on and the bound on . We conclude that satisfy the conditions for local smoothness and so we have that
where the last transition holds by choice of . Next, using the PL condition and the bound on (see Eq. 5) we get that
and adding to both sides of the equation we get
Now, if then since we have that
(6) |
On the other hand, if then we have that
(7) |
where the last transition holds, again, by our choice of . Combining Eqs. 6 and 7 we conclude that
(8) |
In particular, this implies that and thus . Since we assume that , this completes an induction showing that for all . We can thus unroll Eq. 8 recursively to obtain the final result.
4.2 Gradient Estimation
The gradient estimate is a batched version of the typical one-point gradient estimator. We bound it in the next lemma using the following inductive idea: if , then and standard concentration arguments imply that the estimation error is small with high probability and thus Theorem 2 implies that .
Lemma 9 (Gradient estimation error).
Under the parameter choices of Theorem 1, for any we have that with probability at least ,
Proof (of Lemma 9).
Assume that conditioned on the event for all , the claim holds with probability at least . We show by induction that we can peel-off the conditioning by summing the failure probability of each epoch. Concretely, we show by induction that the claim holds for all with probability at least . Since the number of epochs is less than (in fact logarithmic in ), this will conclude the proof.
The induction base follows immediately by our conditional assumption and the fact that . Now, assume the hypothesis holds up to . We show that the conditions of Theorem 2 are satisfied with up to round , and thus for all . We can then invoke our conditional assumption and a union bound to conclude the induction step.
We verify the conditions of Theorem 2. First, the Lipschitz, smoothness, and PL conditions hold by Lemma 5. Next, notice that by definition , and so by the induction hypothesis for all . Finally, noticing that it is easy to verify the condition on .
It remains to show the conditional claim holds. The event for all essentially implies that the policy gradient scheme did not diverge up to the start of epoch . Importantly, this event is independent of any randomization during epoch and thus will not break any i.i.d. assumptions within the epoch. Moreover, by Lemma 5 and since , this implies that i.e., for all and . For the remainder of the proof, we implicitly assume that this holds, allowing us to invoke Lemmas 3, 4 and 5. For ease of notation, we will not specify this explicitly.
Now, let be the smoothed version of as in Eq. 1. Since we can use Lemma 1 to get that
Next, we decompose the remaining term using the triangle inequality to get that
By Lemma 1, we notice that, conditioned on , the first term is a sum of zero-mean i.i.d random vectors with norm bounded by . We thus invoke Lemma 13 (Vector Azuma) to get that with probability at least
Next, denote and notice that the remaining term is exactly Let be the state during the final round of playing controller , and be the filtration adapted to We use Lemma 3 to get that
where the last step plugged in the value of and the one before that used Lemmas 4 and 5 to bound and . Further using Lemma 4 to bound , we also get that
Since is measurable we can invoke Lemma 13 (Vector Azuma) to get that with probability at least ,
Using a union bound and putting everything together, we conclude that with probability at least ,
where the last steps plugged in the values of , and .
4.3 Proof of Lemma 6
Lemma 6 is a straightforward consequence of the previous results.
5 Exploration Cost Analysis
In this section we demonstrate that exploring near a given initial policy does not incur linear regret in the exploration radius (as more straightforward arguments would give), and use this crucial observation for proving Lemmas 7 and 8.
We begin with Lemma 8. The main difficulty in the proof is captured by the following basic result, which roughly shows that the expected cost for transitioning between two i.i.d. copies of a given random policy scales with the variance of the latter. This would in turn give the quadratic dependence on the exploration radius we need.
Lemma 10.
Let be fixed. Suppose are i.i.d. random variables such that , and If is the result of playing for rounds starting at , then
Proof.
Notice that the expectation is with respect to both controllers and the noise terms, all of which are jointly independent. We begin by using Lemmas 3 and 5 to get that
where the last step also used the fact that . Now, since do not depend on the noise, we can use the law of total expectation to get that
To bound the remaining term, notice that since are i.i.d, we may change their roles without changing the expectation, i.e.,
we conclude that
where the last step also used Lemma 5.
5.1 Proof of Lemma 8
Before proving Lemma 8 we introduce a few simplifying notations. Since the lemma pertains to a single epoch, we omit its notation wherever it is clear from context. For example, will be shortened to and to . In any case, we reserve the index for epochs and for sub-epochs. In this context, we also denote the gap between realized and idealized costs during sub-epoch by
and the filtration adapted to . We note that and are measurable. The following lemma uses Eq. 2 to decompose the cost gap at the various time resolutions. See proof at the end of this section.
Lemma 11.
If the epoch initial controller satisfies then (recall that is the positive definite solution to Eq. 2):
-
1.
-
2.
-
3.
We are now ready to prove the main lemma of this section.
Proof (of Lemma 8).
First, by Lemma 6, the event for all holds with probability at least . As in the proof of Lemma 9, we will implicitly assume that this event holds, which will not break any i.i.d assumptions during epoch and implies that for all . We also use this to invoke Lemmas 4 and 5 to get that for any and we have
Now, recall that is -measurable and thus is a martingale difference sequence. Using the first part of Lemma 11 we also conclude that each term bounded by . Applying Azuma’s inequality we get that with probability at least
Now, recall from Lemma 11 that
The first two terms in the sum form a martingale difference sequence with each term being bound by . We thus have that with probability at least ,
Notice that the summands in remaining term fit the setting of Lemma 10 and thus
where the second transition plugged in and used Lemma 4 to bound , and the third transition used the fact that . Plugging in the value of and using a union bound, we conclude that with probability at least ,
as desired.
Proof (of Lemma 11).
By our assumption that we have that and thus is well defined. Now, recall that and where satisfies Eq. 2 with . Then we have that
Now, multiplying Eq. 2 by from both sides we get that
where the second transition plugged in the previous equality. Changing sides concludes the first part of the proof. For the second part, notice that taking expectation with respect to is equivalent to conditional expectation with respect to all past epochs and of the current epoch. Since for all this contains , we use the law of total expectation to get that
Summing over , noticing that the sum is telescopic, and that time is in fact the start of the next sub-epoch, i.e., , concludes the second part of the proof. Finally, we sum over to get that
concluding the third part of the proof.
5.2 Proof of Lemma 7
Proof (of Lemma 7).
By Lemma 6, the event occurs with probability at least . Similarly to Lemmas 8 and 9, we implicitly assume that this event holds, which does not break i.i.d assumptions inside the epoch and implies that for all . Now, notice that Since and , we can invoke the local smoothness of (see Lemma 5) to get that
We thus have that
The remaining term is a sum of zero-mean i.i.d. random variables that are bounded by . We use Hoeffding’s inequality and a union bound to get that with probability at least
and plugging in the value of from Lemma 5 concludes the proof.
Acknowledgements
We thank Nadav Merlis for numerous helpful discussions. This work was partially supported by the Israeli Science Foundation (ISF) grant 2549/19, by the Len Blavatnik and the Blavatnik Family foundation, and by the Yandex Initiative in Machine Learning.
References
- Abbasi-Yadkori and Szepesvári [2011] Y. Abbasi-Yadkori and C. Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.
- Abbasi-Yadkori et al. [2019] Y. Abbasi-Yadkori, N. Lazic, and C. Szepesvári. Model-free linear quadratic control via reduction to expert prediction. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3108–3117. PMLR, 2019.
- Agarwal et al. [2019] N. Agarwal, E. Hazan, and K. Singh. Logarithmic regret for online control. In Advances in Neural Information Processing Systems, pages 10175–10184, 2019.
- Bertsekas [1995] D. P. Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995.
- Cassel and Koren [2020] A. Cassel and T. Koren. Bandit linear control. Advances in Neural Information Processing Systems, 33, 2020.
- Cassel et al. [2020] A. Cassel, A. Cohen, and T. Koren. Logarithmic regret for learning linear quadratic regulators efficiently. In International Conference on Machine Learning, pages 1328–1337. PMLR, 2020.
- Chen and Hazan [2020] X. Chen and E. Hazan. Black-box control for linear dynamical systems. arXiv preprint arXiv:2007.06650, 2020.
- Cohen et al. [2018] A. Cohen, A. Hasidim, T. Koren, N. Lazic, Y. Mansour, and K. Talwar. Online linear quadratic control. In International Conference on Machine Learning, pages 1029–1038, 2018.
- Cohen et al. [2019] A. Cohen, T. Koren, and Y. Mansour. Learning linear-quadratic regulators efficiently with only regret. In International Conference on Machine Learning, pages 1300–1309, 2019.
- Fazel et al. [2018] M. Fazel, R. Ge, S. Kakade, and M. Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In Proceedings of the 35th International Conference on Machine Learning, volume 80, 2018.
- Flaxman et al. [2005] A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385–394, 2005.
- Haarnoja et al. [2018] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870. PMLR, 2018.
- Hambly et al. [2020] B. M. Hambly, R. Xu, and H. Yang. Policy gradient methods for the noisy linear quadratic regulator over a finite horizon. Available at SSRN, 2020.
- Hanson and Wright [1971] D. L. Hanson and F. T. Wright. A bound on tail probabilities for quadratic forms in independent random variables. The Annals of Mathematical Statistics, 42(3):1079–1083, 1971.
- Hayes [2005] T. P. Hayes. A large-deviation inequality for vector-valued martingales. Combinatorics, Probability and Computing, 2005.
- Hsu et al. [2012] D. Hsu, S. Kakade, T. Zhang, et al. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17, 2012.
- Jin et al. [2020] Z. Jin, J. M. Schmitt, and Z. Wen. On the analysis of model-free methods for the linear quadratic regulator. arXiv preprint arXiv:2007.03861, 2020.
- Krauth et al. [2019] K. Krauth, S. Tu, and B. Recht. Finite-time analysis of approximate policy iteration for the linear quadratic regulator. In Advances in Neural Information Processing Systems, volume 32, 2019.
- Lillicrap et al. [2015] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Malik et al. [2019] D. Malik, A. Pananjady, K. Bhatia, K. Khamaru, P. Bartlett, and M. Wainwright. Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2916–2925. PMLR, 2019.
- Mania et al. [2019] H. Mania, S. Tu, and B. Recht. Certainty equivalence is efficient for linear quadratic control. In Advances in Neural Information Processing Systems, volume 32, pages 10154–10164, 2019.
- Mohammadi et al. [2020] H. Mohammadi, M. R. Jovanovic, and M. Soltanolkotabi. Learning the model-free linear quadratic regulator via random search. In Learning for Dynamics and Control, pages 531–539. PMLR, 2020.
- Nesterov [2003] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
- Silver et al. [2014] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In International conference on machine learning, pages 387–395. PMLR, 2014.
- Simchowitz and Foster [2020] M. Simchowitz and D. Foster. Naive exploration is optimal for online lqr. In International Conference on Machine Learning, pages 8937–8948. PMLR, 2020.
- Sutton et al. [1999] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, et al. Policy gradient methods for reinforcement learning with function approximation. In NIPs, volume 99, pages 1057–1063. Citeseer, 1999.
- Tu and Recht [2019] S. Tu and B. Recht. The gap between model-based and model-free methods on the linear quadratic regulator: An asymptotic viewpoint. In Conference on Learning Theory, pages 3036–3083. PMLR, 2019.
- Wright [1973] F. T. Wright. A bound on tail probabilities for quadratic forms in independent random variables whose distributions are not necessarily symmetric. The Annals of Probability, pages 1068–1070, 1973.
- Yaghmaie and Gustafsson [2019] F. A. Yaghmaie and F. Gustafsson. Using reinforcement learning for model-free linear quadratic control with process and measurement noises. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 6510–6517. IEEE, 2019.
- Yang et al. [2019] Z. Yang, Y. Chen, M. Hong, and Z. Wang. Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. In Advances in Neural Information Processing Systems, volume 32, 2019.
Appendix A Reducing Gaussian Noise to Bounded Noise
In this section we relax the bounded noise assumption, , and replace it with the following tail assumption. For , suppose there exists such that:
-
1.
for all ;
-
2.
The first assumption is a standard implication of any tail assumption. The second assumption implies that we can crop the noise while keeping it zero-mean. While not entirely trivial, this can be guaranteed for any continuous noise distributions. We note that is a theoretical construct, and is not a direct input to Algorithm 1. Indirectly, we use to calculate the parameters
which will serve as replacements for in our bounded noise formulation. In practice, our results hold if for the chosen parameters , there exists a set satisfying the above. Our main findings for unbounded noise are summarized in the following meta-result.
Theorem 3.
Suppose is such that . If we run Algorithm 1 with the parameters as in Theorem 1 and that satisfy and . , then the regret bound of Theorem 1 holds with probability at least
Proof.
Consider the LQR problem where the noise terms are replaced with and let be the corresponding instantaneous and infinite horizon costs. Notice that by our assumptions, are indeed zero-mean, i.i.d, and satisfy and
where the second transition used the Cauchy–Schwarz inequality. We thus have that is bounded as in Theorem 1 with probability at least . Next, since , we have that that is optimistic with respect to , i.e., for all , which implies that . Finally, using a union bound on the tail assumption, we have that for all with probability at least . On this event, Algorithm 1 is not aware that the noise is cropped and we thus have that for all . We conclude that with probability at least
and using another union bound concludes the proof.
Application to Gaussian noise.
We specialize Theorem 3 to the case where , are zero-mean Gaussian random vectors with positive definite covariance . The following result demonstrates how to run Algorithm 1 given upper and lower bounds on the covariance eigenvalues.
Proposition 1.
Let . Suppose we run Algorithm 1 with parameters as in Theorem 1 and that satisfy
where are the minimal and maximal eigenvalues of . Then the regret bound of Theorem 1 holds with probability at least .
Proof.
We show that satisfies the desired assumptions. First, by Lemma 14 we indeed have that Next, denote and notice that . We thus have that
where the last transition follows from a symmetry argument. We conclude that satisfies our assumptions. We show that and , which then concludes the proof by invoking Theorem 3. First, we have that for any
and so . Finally, notice that for any is a zero-mean Gaussian random variable. Standard moment identities for Gaussian variables then give that and so we have that
where the last inequality holds by our choice of .
Appendix B Technical Lemmas
B.1 Summing the Square Roots of Epoch Lengths
Lemma 12.
Let and define . Suppose is such that then we have that
Proof.
For ease of notation, denote and notice that for our parameter choice it satisfies . Now, notice that for we have and so we have that
Noticing that is a geometric sequence we get that
where the last transition also used the fact .
B.2 Randomized Smoothing
Proof (of Lemma 1).
The first part follows from Stokes’ theorem. See Lemma 1 in [11] for details. For the second part, notice that We can thus use Jensen’s inequality to get that
where the third transition also used the smoothness (gradient Lipschitz) property of , and the last transition used the fact that is in the unit ball.
B.3 Details of Lemma 5
We review how Lemma 5 is derived from Fazel et al. [10]. For the rest of this section all Lemmas will refer to ones in [10].
The first part of the statement is immediate from their Lemma 13. Next, notice that
We thus have that satisfy the condition of Lemma 16 and so we get that
thus concluding the second part. Next, define
which are linear operators on symmetric matrices. By Lemma 17 we have that
and by Lemma 19 we have that
Now, continuing from the middle of the proof of Lemma 27 we get that
where the last step used the fact that . Next, notice that
and thus the fourth property (Lipschitz) follows as
Next, the fifth statement (Smoothness) follows the ideas of Lemma 28. Concretely, recall that where . Notice that
and thus we have that
Now, notice that and so
and combining with the above, yields the desired smoothness condition
Finally, the last statement (PL) is immediate from their Lemma 11 as
Appendix C Concentration inequalities
Lemma 13 (Theorem 1.8 of [15]).
Let be a very-weak martingale taking values in a real-valued euclidean space such that and for every , . Then, for every ,
Alternatively, for any we have that with probability at least
The following theorem is a variant of the Hanson-Wright inequality [14, 28] which can be found in Hsu et al. [16].
Theorem 4.
Let be a Gaussian random vector, let and define . Then we have that
The following lemma is a direct corollary of Theorem 4.
Lemma 14.
Let be a Gaussian random vector in . For any , with probability at least we have that
Proof.
Consider Theorem 4 with and thus . Then for we have that
Now, for we have that (equals in distribution). We thus have that for
and taking (since ) concludes the proof.