Logarithmic Regret for Episodic Continuous-Time Linear-Quadratic Reinforcement Learning over a Finite-Time Horizon
Abstract
We study finite-time horizon continuous-time linear-quadratic reinforcement learning problems in an episodic setting, where both the state and control coefficients are unknown to the controller. We first propose a least-squares algorithm based on continuous-time observations and controls, and establish a logarithmic regret bound of magnitude , with being the number of learning episodes. The analysis consists of two components: perturbation analysis, which exploits the regularity and robustness of the associated Riccati differential equation; and parameter estimation error, which relies on sub-exponential properties of continuous-time least-squares estimators. We further propose a practically implementable least-squares algorithm based on discrete-time observations and piecewise constant controls, which achieves similar logarithmic regret with an additional term depending explicitly on the time stepsizes used in the algorithm.
1 Introduction
Reinforcement learning (RL) for linear quadratic (LQ) control problems has been one of the most active areas for both the control and the reinforcement learning communities. Over the last few decades, significant progresses have been made in the discrete-time setting.
1.1 Discrete-time RL
In the area of adaptive control with unknown dynamics parameters, the goal is to find optimal stationary policy that stabilizes the unknown dynamics and minimizes the long term average cost (Ioannou and Fidan (2006); Landau et al. (2011)). For an infinite-time horizon LQ system, it has been shown that persistent excitation conditions Green and Moore (1986) are critical to the parameter identification. Meanwhile, algorithms with asymptotic convergence in both the parameter estimation and the optimal control have been developed in Goodwin et al. (1981), Kumar (1983) and Campi and Kumar (1998): the first one assumes that costs only depend on state variables and the other two consider both state and control costs and use a cost-biased least-squared estimation method. See Faradonbeh et al. (2018a, 2019) and references therein for recent developments of (randomised) adaptive control algorithms for LQ systems.
Following the seminal works of Auer and Ortner (2007); Auer et al. (2009) and Osband et al. (2013), non-asymptotic regret bound analysis for RL algorithms has been one of the main topics, and has been developed for tabular Markov decision problems.
The non-asymptotic analysis of adaptive LQ problem by Abbasi-Yadkori and Szepesvári (2011) utilizes the Optimism in the Face of Uncertainty principle to construct a sequence of improving confidence regions for the unknown model parameters, and solves a non-convex constrained optimization problem for each confidence region; their algorithm achieves an regret bound, with being the number of time steps. To reduce the computational complexity and to avoid the non-convexity issue, Abeille and Lazaric (2018) and Ouyang et al. (2017) propose Thompson-sampling-based algorithms and derive regret bounds in the Bayesian setting; Dean et al. (2018) proposes a robust adaptive control algorithm to solve a convex sub-problem in each step and achieves an regret bound. The gap between these regret bounds is removed by Mania et al. (2019) and Cohen et al. (2019) via two different approaches for the same frequentist regret bound. Later, Simchowitz and Foster (2020) establishes a lower bound on the regret of order , where and are the dimensions of the actions and the states, and shows that a simple variant of certainty equivalent control matches the lower bound in both and the dimensions. Similar regret bounds have also been established under different settings and assumptions, such as Chen and Hazan (2021) in the adversarial setting and Lale et al. (2020a) without a stabilizing controller at the early stages of agent-environment interaction.
All the analyses are in discrete-time with an infinite time horizon. In all these problems, adaptive control algorithms are shown to achieve logarithmic regret bounds when additional information regarding the parameters of the system (often referred to as identifiability conditions) is available. Indeed, Faradonbeh et al. (2018b, 2020) prove that certainty equivalent adaptive regulator achieves logarithmic regret bounds if the system parameter satisfies certain sparsity or low-rankness conditions. Cassel et al. (2020) establishes logarithmic regret bounds when either the state transition matrix is unknown, or when the state-action transition matrix is unknown and the optimal policy is non-degenerate. In partially observable linear dynamical systems, which takes linear-quadratic Gaussian problem as a special case, Lale et al. (2020b) proposes an algorithm with a logarithmic regret bound, under the assumption that one has access to a set in which all controllers persistently excite the system to approximate the optimal control. Logarithmic regret bounds in the adversarial setting with known dynamics parameters have been established in Agarwal et al. (2019); Foster and Simchowitz (2020).
1.2 Continuous-time RL.
Most real-world control systems, such as those in aerospace, automotive industry and robotics, are naturally continuous-time dynamical systems. So are their related physical tasks, such as inverted pendulum problems, cart-pole balancing problems, and legged robot problems. Continuous-time finite-time horizon LQ control problems can be found in portfolio optimization Wang and Zhou (2020), algorithmic trading Cartea et al. (2018), production management of exhaustible resources Graber (2016), and biological movement systems Bailo et al. (2018).
Analysis for continuous-time LQ-RL and general RL problems, however, is fairly limited. The primary approach is to develop learning algorithms after discretizing both the time and the space spaces, and establish the convergence as discretization parameters tend to zero. For instance, Munos (2006) proposes a policy gradient algorithm and shows the convergence of the policy gradient estimate to the true gradient. Munos and Bourgine (1998); Munos (2000) design learning algorithms by discretizing Bellman equations of the underlying control problems and prove the asymptotic convergence of their algorithms. For the LQ system, attentions have been mostly on algorithms designs, including the integral reinforcement learning algorithm in Modares and Lewis (2014), and the policy iteration algorithm in Rizvi and Lin (2018). Yet, very little is known regarding the convergence rate or the regret bound of all these algorithms. Indeed, despite the natural analytical connection between LQ control and RL, the best known theoretical work for continuous-time LQ-RL is still due to Duncan et al. (1999), where an asymptotically sublinear regret for an ergodic model has been derived via a weighted least-squares-based estimation approach. Nevertheless, the exact order of the regret bound has not been studied.
Issues and challenges from non-asymptotic analysis.
It is insufficient and improper to rely solely on the analysis and algorithms for the discrete-time RL to solve the continuous-time problems. There is a mismatch between the algorithms timescale for the former and the underlying systems timescale for the latter. When designing algorithms that make observations and take actions at discrete time points, it is important to take the model mismatch into consideration. For instance, the empirical studies in Tallec et al. (2019) suggest that vanilla -learning methods exhibit degraded performance as the time stepsize decreases, while a proper scaling of learning rates with stepsize leads to more robust performance.
The questions are therefore: A) How to quantify the precise impacts of the observation stepsize and action stepsize on algorithm performance? B) How to derive non-asymptotic regret analysis for learning algorithms in continuous-time LQ-RL (or general RL) system, analogous to the discrete-time LQ-RL counterpart?
There are technical reasons behind the limited theoretical progress in the continuous-time domain for RL, including LQ-RL. In addition to the known difficulty for analyzing stochastic control problems, the learning component compounds the problem complexity and poses new challenges.
For instance, the counterpart in the continuous-time problem to the algebraic equations in Mania et al. (2019) for the discrete-time version is the regularity and stability of the continuous-time Riccati equation and the regularity of feedback controls. While Riccati equation and its robustness and existence and uniqueness of optimal controls have been well studied in the control literature, regularity of feedback controls with respect to underlying models is completely new for control theory and crucial for algorithm design and its robustness analysis. Moreover, deriving the exact order of the regret bound requires developing new and different techniques than those used for the asymptotic regret analysis in Duncan et al. (1999).
Our work and contributions.
This paper studies finite-time horizon continuous-time LQ-RL problems in an episodic setting.
-
•
It first proposes a greedy least-squares algorithm based on continuous-time observations and controls. At each iteration, the algorithm estimates the unknown parameters by a regularized least-squares estimator based on observed trajectories, then designs linear feedback controls via the Riccati differential equation for the estimated model. It identifies conditions under which the unknown state transition matrix and state-action transition matrix are uniquely identifiable under the optimal policies. (Remark 2.1 and Proposition 2.1). By exploiting the identifiability of coefficients, this continuous-time least-squares algorithm is shown to have a logarithmic regret of the magnitude , with being the number of learning episodes (Theorem 2.2). To the best of our knowledge, this is the first non-asymptotic logarithmic regret bound for continuous-time LQ-RL problems with unknown state and control coefficients.
-
•
It then proposes a practically implementable least-squares algorithm based on discrete-time observations and controls. At each iteration, the algorithm estimates the unknown parameters by observing continuous-time trajectories at discrete time points, then designs a piecewise constant linear feedback control via Riccati difference equations for an associated discrete-time LQ-RL problem. It shows that the regret of the discrete-time least-squares algorithm is of the magnitude , where is the time stepsize used in the -th update of model parameters (Theorem 2.3). Our analysis shows that scaling the regularization parameter of the discrete-time least-squares estimator with respect to time stepsize is critical for a robust performance of the algorithm in different timescales (Remark 2.3). To the best of our knowledge, this is the first discrete-time algorithm with rigorous regret bound for continuous-time LQ-RL problems.
Different from the least-squares algorithms for the ergodic LQ problems (see e.g., Duncan et al. (1999); Mania et al. (2019)), our continuous-time least-squares algorithm constructs feedback controls via Riccati differential equations instead of the algebraic equations in Mania et al. (2019). Here, the regularity and stability of the continuous-time Riccati equation is analyzed in order to establish the robustness of feedback controls.
Moreover, our analysis for the estimation error exploits extensively the sub-exponential tail behavior of the least-squares estimators. This probabilistic approach differs from the asymptotic sublinear regret analysis in Duncan et al. (1999); it establishes the exact order of the logarithmic regret bound by the concentration inequality for the error bound.
In addition, our analysis also exploits an important self-exploration property of finite-time horizon continuous-time LQ-RL problems, for which the time-dependent optimal feedback matrices ensure that the optimal state and control processes span the entire parameter space. This property allows us to design exploration-free learning algorithms with logarithmic regret bounds. Furthermore, we provide explicit conditions on models that guarantees the successful identification of the unknown parameters with optimal feedback policies. This is in contrast to the identification conditions for logarithmic regret bounds in discrete-time infinite-time-horizon LQ problems. Our conditions apply to arbitrary finite time-horizon problems, without imposing sparsity or low-rankness conditions on system parameters as in Faradonbeh et al. (2018b, 2020) or requiring these parameters to be partially known to the controller as in Cassel et al. (2020); Foster and Simchowitz (2020).
Finally, our analysis provides the precise parameter estimation error in terms of the sample size and time stepsize, and quantifies the performance gap between applying a piecewise-constant policy from an incorrect model and applying the optimal policy. The misspecification error scales linearly with respect to the stepsize, and the performance gap depends quadratically with respect to the time stepsize and the magnitude of parameter perturbations. Our analysis is based on the first-order convergence of Riccati difference equations and a uniform sub-exponential tail bound of discrete-time least-squares estimators.
Notation.
For each , we denote by the identity matrix, and by (resp. ) the space of symmetric positive semidefinite (resp. definite) matrices. We denote by the Euclidean norm of a given Euclidean space, by the matrix norm induced by Euclidean norms, and by and the transpose and trace of a matrix , respectively. For each , filtered probability space satisfying the usual condition and Euclidean space , we introduce the following spaces:
-
•
is the space of continuous functions satisfying ;
-
•
is the space of continuously differentiable functions satisfying ;
-
•
is the space of -valued -progressively measurable càdlàg processes satisfying ;
-
•
is the space of -valued -progressively measurable processes satisfying .
For notation simplicity, we denote by a generic constant, which depends only on the constants appearing in the assumptions and may take a different value at each occurrence.
2 Problem formulation and main results
2.1 Linear-quadratic reinforcement learning problem
In this section, we consider the linear-quadratic reinforcement learning (LQ-RL) problem, where the drift coefficient of the state dynamics is unknown to the controller.
More precisely, let be a given terminal time, be an -dimensional standard Brownian motion defined on a complete probability space , and be the filtration generated by augmented by the -null sets. Let be a given initial state and be fixed but unknown matrices, consider the following problem:
(2.1) |
where for each , the process satisfies the following controlled dynamics associated with the parameter :
(2.2) |
with given matrices and . Note that we assume the loss functional (2.1) only involves a time homogeneous running cost to allow a direct comparison with infinite-time horizon RL problems (see e.g., Duncan et al. (1992)), but similar analysis can be performed if the cost functions are time inhomogeneous, a terminal cost is included, or the Brownian motion in (2.2) is scaled by an known nonsingular diffusion matrix.
If the parameter are known to the controller, then (2.1)-(2.2) reduces to the classical LQ control problems. In this case, it is well known that (see e.g., Yong and Zhou (1999) and the references therein), the optimal control of (2.1)-(2.2) is given in a feedback form by
(2.3) |
where for all , solves the Riccati equation
(2.4) |
and is the state process governed by the following dynamics:
(2.5) |
To solve the LQ-RL problem (2.1)-(2.2) with unknown , the controller searches for the optimal control while simultaneously learning the system, i.e., the matrices . In an episodic (also known as reset or restart) learning framework, the controller improves her knowledge of the underlying dynamics through successive learning episodes, in order to find a control that is close to the optimal one.
Mathematically, it goes as follows. Let be the total number of learning episodes. In the -th learning episode, , a feedback control is exercised, and the state process evolves according to the dynamics (2.2) controlled by the policy :
(2.6) |
Here are independent -dimensional Brownian motions defined on the same probability space . The (expected) cost of learning in the -th episode is then given by
(2.7) |
and the (expected) regret of learning up to episodes (with the sequence of controls ) is defined as follows:
(2.8) |
where is the optimal cost of (2.1)-(2.2) when are known. Intuitively, the regret characterizes the cumulative loss from taking sub-optimal policies in all episodes.
2.2 Continuous-time least-squares algorithm and its regret bound
In this section, we consider a continuous-time least-squares algorithm, which chooses the optimal feedback control based on the current estimation of the parameter, and updates the parameter estimation based on the whole trajectories of the state dynamics.
More precisely, let be the current estimate of the unknown parameter , then the controller would exercise the optimal feedback control for (2.1)-(2.2) with replaced by , i.e.,
(2.9) |
where satisfies the Riccati equation (2.4) with replaced by :
(2.10) |
This leads to the state process satisfying (cf. (2.6)):
(2.11) |
We proceed to derive an -regularized least-squares estimation for based on sampled trajectories of . Observing from (2.11) that
Hence the martingale property of the Itô integral implies that
(2.12) |
provided that is invertible. This suggests a practical rule to improve one’s estimate for the true parameter , by replacing the expectations in (2.12) with empirical averages over independent realizations. More precisely, let and , , be trajectories of independent realizations of the state and control processes, we shall update the estimate by the following rule, inspired by (2.12):
(2.13) |
where for all and , and is the identity matrix.
The regularization term in (2.13) guarantees the required matrix inverse and vanishes as . The estimator (2.13) can be equivalently expressed as an -regularized least-squares estimator, as pointed out in Duncan et al. (1992) for the ergodic LQ-RL problem.
We summarize the continuous-time least-squares algorithm as follows.
Note that Algorithm 1 operates in cycles, with the number of episodes in the -th cycle. Hence, the regret of learning up to episodes (cf. (2.8)) can be upper bounded by the accumulated regret at the end of the -th cycle, where is the smallest integer such that .
In this section, we analyze the regret of Algorithm 1 based on the following assumptions of the learning problem (2.1)-(2.2).
H. 1
-
(1)
, , , , , and .
-
(2)
, with defined in (2.3).
Remark 2.1 (Self-exploration of finite-time horizon RL problems)
(H.11) is the standard assumption for finite-time horizon LQ-RL problems (see e.g., Hambly et al. (2020)), except that H.11 allows to be positive semidefinite, which is important for costs depending on partial states. (H.12) corresponds to the identifiability of the true parameter by executing the optimal policy . In fact, as shown in Proposition 3.10, under (H.11), (H.12) is equivalent to the following statement:
- (2’)
Item (2’) indicates an important self-exploration property of finite-time horizon continuous-time RL problems. In particular, the time-dependent optimal feedback matrix and the non-degenerate noises guarantee the non-degeneracy of the space spanned by and , enabling learning the parameters sufficiently well. This self-exploration property is critical for our design of exploration-free learning algorithms for (2.1)-(2.2) with a logarithmic regret (see Theorems 2.2 and 2.3).
One can easily show that (H.12) holds if the optimal policy is nondegenerate, i.e., . Similar nondegeneracy condition has been imposed in Cassel et al. (2020) for discrete-time ergodic LQ-RL problems. In particular, by assuming that the optimal stationary policy satisfies (along with other controllablity conditions), they propose learning algorithms with a logarithmic regret, under the assumption that only the control coefficient is unknown. In contrast, we allow both the state coefficient and the control coefficient to be unknown.
Moreover, the following proposition provides sufficient conditions of (H.12), which are special cases of Proposition 3.11.
Proposition 2.1
Proposition 2.1 provides two sets of conditions for (H.12) under two different scenarios: Item 1 applies to an arbitrary finite , and Item 2 only applies to sufficiently large . Item 2 assumes the asymptotic behavior of solutions to Riccati differential equations, which can be ensured by the stabilizability of the pair and detectability of the pair (see (Bitmead and Gevers, 1991, Theorems 10.9 and 10.10)). Note that our subsequent analysis is based on (H.1), and does not require stabilizability assumptions.
Remark 2.2 (Stabilizability of and dependence on )
Since the LQ-RL problem (2.1)-(2.2) is over the time horizon with a fixed , in general one does not need additional conditions on for the well-definedness of (2.1)-(2.2). If , then some stabilizability/controllability conditions of may be required for (2.1)-(2.2) to ensure a well-defined solution (see e.g., Dean et al. (2019)). Under these conditions, different algorithms have been shown to achieve sub-linear regret with respect to the number of decision steps (see e.g., Mania et al. (2019); Cohen et al. (2019)), and even logarithmic regrets provided that further identifiability assumptions are satisfied (see e.g., Faradonbeh et al. (2018b, 2020); Cassel et al. (2020); Lale et al. (2020b)); see Section 1.1 for more details. For , the regrets of learning algorithms for (2.1)-(2.2) in general depend exponentially on the time horizon (e.g., the constants in Theorem 2.2), as the moments of the optimal state process and control process may grow exponentially with respect to . It would be interesting to quantify the precise dependence of the regret bounds on . This would entail deriving precise a priori bounds of solutions to (2.10) and estimating the norm in terms of , and is left for future research.
We are now ready to state the main result of this section, which shows that the regret of Algorithm 1 grows logarithmically with respect to the number of episodes.
Theorem 2.2
To simplify the presentation, we analyze the performance of Algorithm 1 by assuming the number of learning episodes is doubled between two successive updates of the estimation of . Similar regret results can be established for Algorithm 1 with different choices of . Under this specific choice of , for any , Algorithm 1 splits episodes into cycles, where the -th cycle, , contains episodes, and the remaining episodes are in the last cycle.
Sketched proof of Theorem 2.2.
We outline the key steps of the proof of Theorem 2.2, and present the detailed arguments to Section 3.3. By exploiting the regularity and robustness of solutions to (2.10), we prove that the performance gap is of the magnitude , for all a-priori bounded (Proposition 3.8). We then establish a uniform sub-exponential property for the (deterministic and stochastic) integrals in (2.12), which along with (H.12) and Bernstein’s inequality leads to the following estimate of the parameter estimation error: for all , all sufficiently large , and all sufficiently close to ,
(2.14) |
where is generated by (2.13) with (Proposition 3.9). Then for each , applying (2.14) with for all shows that with probability ,
(2.15) |
where means the inequality holds with a multiplicative constant independent of and . By the quadratic performance gap and the choice of , the regret of Algorithm 1 up to the -th episode can be bounded by the regret at the end of -th cycle with :
(2.16) |
Observe that the choices of and ensure that . Hence, the right-hand side of (2.16) is of the magnitude , which along with the choices of and leads to the desired regret bound; see the end of Section 3.3 for more details.
2.3 Discrete-time least-squares algorithm and its regret bound
Note that Algorithm 1 in Section 2.2 requires executing feedback controls and observing corresponding state trajectories continuously. A common practice to solve continuous-time RL problems is by assuming that at each learning episode the dynamics only evolves in discrete time, and then estimate parameters according to discrete-time RL algorithms (see e.g., Munos and Bourgine (1998); Munos (2000, 2006); Tallec et al. (2019)). As the true dynamics evolves continuously, it is necessary to quantify the impact of reaction stepsize on the algorithm performance.
In this section, we analyze the performance of the above procedure for solving (2.1)-(2.2). We adapt regularized least-squares algorithms for discrete-time LQ problems to the present setting, and establish their regret bounds in terms of the discretization stepsize. Our analysis shows that a proper scaling of the regularization term in the least-squares estimation in terms of stepsize is critical for a robust performance with respect to different timescales.
More precisely, for a given cycle (i.e., the index in Algorithm 1), let be the current estimate of in (2.2), and let , , be a uniform partition of with stepsize . We then assume that (2.1)-(2.2) is piecewise constant between any two grid points , choose actions and make observations every , and update the estimated parameter based on these observations. To this end, we consider the following discrete-time LQ control problem with parameter :
(2.17) |
where , and are defined by
(2.18) |
Note that for simplicity, our strategy is constructed by assuming a discrete-time dynamics arising from an Euler discretization of (2.2) (with the estimated parameter ); similar analysis can be performed with a high-order approximation of (2.1)-(2.2).
It is well-known that (see e.g., Bitmead and Gevers (1991)), the optimal control of (2.17)-(2.18) is given by the following feedback form:
(2.19) |
where is the piecewise constant function (with stepsize ) defined by
(2.20) |
We then implement the piecewise constant strategy defined in (2.19) on the original system (2.2) for episodes, and update the estimated parameter by observing (2.2) with stepsize . More precisely, let be the state process associated with :
(2.21) |
and , , , be independent trajectories of , we update the parameter according to the following discrete-time least-squares estimator:
(2.22) |
with for all . The update (2.22) is consistent with the agent’s assumption that the state evolves according to (2.18) between two grid points. Setting the derivative (with respect to ) of the right-hand side of (2.22) to zero leads to
Dividing both sides by and rearranging the terms give the following equivalent expression of the discrete-time least squares estimator (2.22):
(2.23) |
Remark 2.3 (Scaling hyper-parameters with timescales)
In principle, when applying discrete-time RL algorithms in a continuous environment, it is critical to adopt a proper scaling of the hyper-parameters for a robust performance with respect to different timescales. Indeed, scaling the regularization term in (2.22) with respect to the stepsize is essential for the robustness of (2.23) for all small stepsizes . If one updates by minimizing the following -regularized loss function with a given hyper-parameter such that
(2.24) |
then the corresponding discrete-time estimator is given by
Observe that for any given , the estimator degenerates to zero as the stepsize tends to zero. Hence, to ensure the viability of across different timescales, the number of episodes has to increase appropriately when tends to zero. In contrast, by choosing in (2.24), (2.23) admits a continuous-time limit (2.13) as , and leads to a learning algorithm in which the episode numbers and the time stepsize can be chosen independently (see Theorem 2.3).
We now summarize the discrete-time least-squares algorithm as follows.
Again, as the -th cycle of Algorithm 2 contains episodes, for each , the regret of learning up to episodes (cf. (2.8)) can be upper bounded by the accumulated regret at the end of the -th cycle, where is the smallest integer such that . The following theorem is an analogue of Theorem 2.2 for Algorithm 2.
Theorem 2.3
Remark 2.4
Theorem 2.3 provides a general regret bound of Algorithm 2 with any time discretization steps , where is the number of intervention points in the -th cycle. Compared with Algorithm 1, the regret of Algorithm 2 has an additional term : for each learning episode, one achieves a sub-optimal loss by adjusting her policy in the discrete time and also suffers from model misspecification error in parameter estimation from discrete-time observations. Specifically,
- •
- •
Sketched proof of Theorem 2.3.
We point out the main differences between the proofs of Theorems 2.2-2.3, and give the detailed proof of Theorem 2.3 in Section 3.4. Compared with Theorem 2.2, the essential challenges in proving Theorem 2.3 are to quantify the precise dependence of the performance gap and the parameter estimation error on the stepsize. To this end, we first prove a first-order convergence of (2.20) to (2.9) as the stepsize tends to zero. Then by exploiting the affine structure of (2.21), we establish the following quadratic performance gap for a piecewise constant policy (Proposition 3.12):
(2.26) |
The analysis of the parameter estimation error is somewhat involved, as the state trajectories are merely -Hölder continuous in time with . By leveraging the analytic expression of , we first show the first-order convergence of to with
(2.27) |
We then prove that (2.23) enjoys a uniform sub-exponential tail bound for all close to and small . Comparing (2.23) with (2.27) and applying the above results allow for bounding the estimation error of (2.23) by (2.14) with an additional term (Proposition 3.14).
3 Proofs of Theorems 2.2 and 2.3
To simplify the notation, for any given and control that is affine in the spatial variable, we introduce the following random variables associated with continuous-time observations:
(3.1) | ||||||
and the random variables associated with discrete-time observations with stepsize :
(3.2) | ||||||
where is the state process associated with the parameter and the control (cf. (2.6)), for all , and are independent copies of .
3.1 Convergence and stability of Riccati equations and feedback controls
Lemma 3.1
Proof It has been shown in (Yong and Zhou, 1999, Corollary 2.10 on p. 297) that under (H.11), for all , (3.3) admits a unique solution such that for all . It remains to to prove the continuous differentiability of .
To this end, consider the Banach spaces and , and the operator defined by
where for all . Observe that for all , . Moreover, one can easily show that for any , is continuously Fréchet differentiable at , and the partial derivative is a bounded linear operator such that for all ,
Classical well-posedness results of linear differential equations
and
the boundedness of imply that
has a bounded inverse (and hence a bijection).
Thus, applying
the implicit function theorem
(see (Ciarlet, 2013, Theorem 7.13-1))
to
proves that
is continuously differentiable.
The following lemma establishes the stability of the Riccati difference operator, which is crucial for the subsequent convergence analysis.
Lemma 3.2
Proof Item 1 follows directly from the definition of and the identity that . We now prove Item 2. Let and , by (Bitmead and Gevers, 1991, Lemma 10.1),
with . Thus for all , , which along with implies
Hence, interchanging the roles of and in the above inequality and taking the supremum over
lead to the desired estimate.
The following proposition establishes the first-order convergence of the Riccati difference equation and the associated feedback controls, as the stepsize tends to zero.
Proposition 3.3
Proof Throughout this proof, we shall fix , let for all , and denote by a generic constant independent of and . By the continuity of the map (Lemma 3.1) and the boundedness of , there exists a constant such that for all , which implies for all . Consequently, it suffices to prove for all .
3.2 Concentration inequalities for least-squares estimators
In this section, we analyze the concentration behavior of the least-squares estimators (2.13) and (2.23). We first recall the definition of sub-exponential random variables (see e.g., Wainwright (2019)).
Definition 3.1
A random variable with mean is -sub-exponential for if for all .
Note that a -sub-exponential random variable is usually called a sub-Gaussian random variable. It is well-known that products of sub-Gaussian random variables are sub-exponential, and the class of sub-exponential random variables forms a vector space. Moreover, sub-exponential random variables enjoy the following concentration inequality (also known as Bernstein’s inequality; see e.g., (Wainwright, 2019, Equation 2.18 p. 29)).
Lemma 3.4
Let , and be independent -sub-exponential random variables with for all . Then for all ,
The following lemma shows double iterated Itô integrals are sub-exponential random variables.
Lemma 3.5
Let and be measurable functions such that and for all . Then there exist , depending polynomially on , such that
-
(1)
,
-
(2)
are -sub-exponential,
Proof We first prove Item 1 by assuming without loss of generality that for all , and by defining for any bounded measurable function . By similar arguments as (Cheridito et al., 2005, Lemma 3.2), we have for all and ,
As for all , we see for all . Consequently, for all ,
Replacing by shows the above estimate holds for , which implies the desired sub-exponential property of .
For Item 2,
observe that
for each ,
the Itô formula allows one to
express the product
as
a linear combination of double iterated Itô integrals
and deterministic integrals.
Then
the desired sub-exponential property follows from
the stochastic Fubini theorem (see e.g., Veraar (2012))
and Item 1.
The following theorem establishes the concentration properties of the random variables involved in the least-squares estimators.
Theorem 3.6
Proof We first show there exist such that all entries of , , are -sub-exponential for all and . By (3.1), we have
Moreover, applying the variation-of-constants formula (see e.g., (Mao, 2007, Theorem 3.1 p. 96)) to (2.11) shows that for all , where is the fundamental solution of . The continuity of (cf. Proposition 3.1) and the boundedness of implies that are uniformly bounded for all . Consequently, from Lemma 3.5, there exist such that all entries of and are -sub-exponential.
Similarly, by (2.21) and (3.2),
where for all , and is the fundamental solution of . By Proposition 3.3, are uniformly bounded for all and , which along with Lemma 3.5 leads to the desired sub-exponential properties of and .
Finally, since
for all
and
random variables ,
we can apply
Lemma 3.4
to each component of
,
and
,
and
conclude the desired concentration inequality
with a constant depending polynomially on .
3.3 Regret analysis of continuous-time least-squares algorithm
This section is devoted to the proof of Theorem 2.2, which consists of three steps: (1) We first quantify the performance gap between applying feedback controls for an incorrect model and that for the true model; our proof exploits the stability of Riccati equations established in Lemma 3.1; (2) We then estimate the parameter estimation error in terms of the number of learning episodes based on the sub-exponential tail behavior of the least-squares estimator (2.13); (3) Finally, we estimate the regret for the feedback controls in Algorithm 1, thus establishing Theorem 2.2.
Step 1: Analysis of the performance gap.
We start by establishing a quadratic expansion of the cost function at any open-loop control.
Proposition 3.7
Proof For notational simplicity, for all and , we write , denote by the associated state process defined by (2.2), and by . The affineness of (2.2) implies that for all . Hence, for all ,
which is based on the fact that and . As is the optimal control of , for all . Hence for all ,
(3.6) |
We now prove that the above quantity is in fact zero for all . To this end, let be a given (open-loop) control, and consider . Then by the affineness of (2.2), satisfies the following controlled dynamics:
(3.7) |
Moreover, one can verify by the affineness of (2.2) that also satisfies the dynamics (3.7), which along with the uniqueness of solutions to (3.7) shows that . Therefore, applying (3.6) with implies that
Hence for all ,
which leads to the desired result (3.5) due to the following identify:
Armed with Proposition 3.7, the following proposition quantifies the quadratic performance gap of a greedy policy .
Proposition 3.8
Suppose (H.11) holds and let be a bounded subset of . For each , let be defined in (2.9), let be the state process associated with (cf. (2.11)), let be defined in (2.3), and let be the state process associated with (cf. (2.5)). Then there exists a constant such that
where and for all , and is defined in (2.1).
Proof For all , applying Proposition 3.7 with gives
(3.8) |
where the last inequality used the fact that (see (2.11)), and the definitions of and . It remains to prove
for a constant independent of .
Observe that by (2.9), for all , with . Now by Lemma 3.1 and the boundedness of , there exists a constant such that and for all , which along with implies that and . Moreover, observe from (2.5) and (2.11) that and for all ,
which combined with the boundedness of and Gronwall’s inequality leads to
where the last inequality follows from , as is uniformly bounded. The above inequality further implies
which along with (3.8) finishes the desired estimate.
Step 2: Error bound for parameter estimation.
Proposition 3.9
Proof Let us fix and . By (2.12) and (2.13), we obtain
As for all nonsingular matrices and , we have
(3.10) |
where the last inequality follows from the assumption .
We now estimate each term in the right-hand side of (3.10), and denote by a generic constant independent of . By Theorem 3.6, with probability at least , and , with the constant given by
(3.11) |
Let be a sufficiently large constant satisfying , where is the constant such that for all . Then with probability at least , , which in turn yields
or equivalently . Moreover, the continuity of implies . Hence, by (3.10),
Substituting (3.11) into the above estimate yields
the desired estimate (3.22).
As ,
it is clear that
is satisfied
for all , with a sufficiently large .
Step 3: Proof of Theorem 2.2.
The following proposition shows that for any given , the full row rank of is equivalent to the well-definedness of (2.12) for all sufficiently close to .
Proposition 3.10
Proof For : By (3.1), if and only if there exists no nonzero such that
(3.12) |
where we applied Fubini’s theorem for the first identity. By (2.5), for all , where is the fundamental solution of , for all , and satisfies (2.10). Hence,
Then by (3.12) and the continuity of and , if and only if there exists no nonzero such that
where is the identity matrix. One can easily deduce by the invertibility of for all that for all , which subsequently shows that if and only if there exists no nonzero such that for all . Now let us denote without loss of generality that for some and . Then the above derivation shows that is equivalent to the following statement:
if and satisfy for all , then and . | (3.13) |
By (2.9), for all and , implying that . Then (3.13) can be rewritten as:
if satisfies for all , then . |
For : Item 3 clearly implies Item 2. On the other hand, for any given ,
Then, we can easily deduce
from the continuity of
(see Lemma 3.1)
that
is continuous,
which
implies the continuity of
.
Hence, by the continuity of the minimum eigenvalue function,
we
can conclude
Item 2
from
Item 3.
The following proposition provides sufficient conditions for the nondegeneracy of .
Proposition 3.11
Let , , and .
-
(1)
For all , if , then .
-
(2)
Assume that the algebraic Riccati equation admits a unique maximal solution . Let , and for each , let be defined in (2.10). Assume that and . Then there exists , such that for all , .
Proof To prove Item 1, suppose that and such that for all , with defined in (2.10). Setting , right multiplying (2.10) by , and left multiplying (2.10) by shows
As for all , for all , and hence . The assumption of then gives , which along with the invertibility of shows that .
To prove Item
2, observe that .
As , there exists
such that for all ,
. Fix and consider
such that for all .
Then the definitions of and
imply the invertibility of
,
which yields
.
Now we are ready for the proof of Theorem 2.2.
Proof [Proof of Theorem 2.2] As (H.12) holds with and , we can obtain from Proposition 3.10 that, there exist such that for all , we have . Then by Proposition 3.9, there exist constants , such that for all and , if , then with probability at least ,
(3.14) |
where denotes the right-hand side of (2.13) with the control . In the following, we fix and , with the constant satisfying
let , and for each , let , , and let be generated by (2.13) with and . Note that the choices of ensure that , and
(3.15) |
We now prove with probability at least ,
(3.16) |
Let us consider the induction statement for each : with probability at least , (3.16) holds for all . The fact that and (3.14) yields the induction statement for . Now suppose that the induction statement holds for some . Then the induction hypothesis and (3.15) ensure that for all (and hence ) with probability at least . Conditioning on this event, we can apply (3.14) with , and , and deduce with probability at least that (3.16) holds for the index . Combining this with the induction hypothesis yields (3.16) holds for the indices , with probability at least .
3.4 Regret analysis of discrete-time least-squares algorithm
This section is devoted to the proof of Theorem 2.3. The main step is similar to the proof of Theorem 2.2 in Section 3.3. However, one needs to quantity the precise impact of the piecewise constant policies and the discrete-time observations on the performance gap and the parameter estimation error.
Step 1: Analysis of the performance gap.
The following proposition shows the performance gap between applying a piecewise constant feedback control for an incorrect model and a continuous-time feedback control for the true model scales quadratically with respect to the stepsize and the parameter errors.
Proposition 3.12
Suppose (H.11) holds and let be a bounded subset of . For each and , let be defined in (2.19) with stepsize , let be the state process associated with (cf. (2.21)), let be defined in (2.3), and let be the state process associated with (cf. (2.5)). Then there exists such that
(3.18) |
where and for all , and is defined in (2.1).
Proof Let us fix and . By applying Proposition 3.7 with ,
(3.19) |
where the last inequality used the fact that (see (2.11)), and the definitions of and .
We then prove that there exists a constant , independent of , such that
By setting , we obtain from (2.5) and (2.21) that
(3.20) |
Since and for all , . Moreover, by for all (see Proposition 3.3) and (2.20), we have for all , which along with a moment estimate of (2.21) yields . Thus, by applying Gronwall’s inequality to (3.20), Lemma 3.1 and Proposition 3.3, for all and ,
(3.21) |
The above inequality further implies
which along with (3.19) finishes the desired estimate.
Step 2: Error bound for parameter estimation.
The following lemma shows that the difference between the expectations of and of scales linearly with respect to the stepsize.
Lemma 3.13
Proof By (2.21), we have for all , , which implies
Hence it suffices to prove that and for all and .
Let us fix and . In the following, we shall omit the superscripts of and if no confusion occurs. As , by (2.21), we have with . Thus,
By taking expectations of both sides of the above identity, the martingale property of the Itô integral, and the Itô isometry,
where the last inequality follows from . Since and , one can easily show that . Furthermore, by and the identity
we can show that
by the uniform boundedness of and .
Proposition 3.14
Proof We first prove that there exists such that for all and , for a constant independent of and . By (2.11) and (2.21), we have for all and , and
Proposition 3.3 shows for all , which along with Gronwall’s inequality yields for all and . One can further prove that with and for all . Thus, we have , which along with implies a uniform bound of for all sufficiently large .
Let us fix and for the subsequent analysis. The invertibility of implies that (cf. (2.12)). Then by (2.23), we can derive the following analogues of (3.10):
where and are defined in (3.2). Note that
where for both inequalities,
the first term on the right-hand side
can be estimated by
Theorem 3.6
(uniformly in ),
and the second term
is of the magnitude
due to Lemma 3.13.
Hence,
proceeding along the lines of the proof of
Proposition 3.9
leads to the desired result.
Step 3: Proof of Theorem 2.3.
The proof follows from similar arguments as that of Theorem 2.2, and we only present the main steps here. As (H.12) holds with and , we can obtain from Propositions 3.10 and 3.14 that, there exists a bounded set and constants , that for all , and , if , then with probability at least ,
(3.23) |
where denotes the right-hand side of (2.23) with the control and stepsize . Then by proceeding along the lines of the proof of Theorem 2.2, there exists and , such that for any given , if with and for all , then with probability at least ,
(3.24) |
where and for all . Consequently, we can conclude the desired regret bound from Proposition 3.12 (cf. (3.17)), with an additional term due to the time discretization errors in (3.18) and (3.24).
References
- Abbasi-Yadkori and Szepesvári (2011) Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.
- Abeille and Lazaric (2018) Marc Abeille and Alessandro Lazaric. Improved regret bounds for thompson sampling in linear quadratic control problems. In International Conference on Machine Learning, pages 1–9. PMLR, 2018.
- Agarwal et al. (2019) Naman Agarwal, Elad Hazan, and Karan Singh. Logarithmic regret for online control. Advances in Neural Information Processing Systems, 32, 2019.
- Auer and Ortner (2007) Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In Advances in Neural Information Processing Systems, pages 49–56, 2007.
- Auer et al. (2009) Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. In Advances in Neural Information Processing Systems, pages 89–96, 2009.
- Bailo et al. (2018) Rafael Bailo, Mattia Bongini, José A Carrillo, and Dante Kalise. Optimal consensus control of the Cucker-Smale model. IFAC-PapersOnLine, 51(13):1–6, 2018.
- Bitmead and Gevers (1991) Robert R Bitmead and Michel Gevers. Riccati difference and differential equations: Convergence, monotonicity and stability. In The Riccati Equation, pages 263–291. Springer, 1991.
- Campi and Kumar (1998) Marco C Campi and PR Kumar. Adaptive linear quadratic gaussian control: the cost-biased approach revisited. SIAM Journal on Control and Optimization, 36(6):1890–1907, 1998.
- Cartea et al. (2018) Alvaro Cartea, Sebastian Jaimungal, and Jason Ricci. Algorithmic trading, stochastic control, and mutually exciting processes. SIAM Review, 60(3):673–703, 2018.
- Cassel et al. (2020) Asaf Cassel, Alon Cohen, and Tomer Koren. Logarithmic regret for learning linear quadratic regulators efficiently. In International Conference on Machine Learning, pages 1328–1337. PMLR, 2020.
- Chen and Hazan (2021) Xinyi Chen and Elad Hazan. Black-box control for linear dynamical systems. In Conference on Learning Theory, pages 1114–1143. PMLR, 2021.
- Cheridito et al. (2005) Patrick Cheridito, H. Mete Soner, and Nizar Touzi. Small time path behavior of double stochastic integrals and applications to stochastic control. The Annals of Applied Probability, 15(4):2472–2495, 2005.
- Ciarlet (2013) Philippe G Ciarlet. Linear and nonlinear functional analysis with applications, volume 130. Siam, 2013.
- Cohen et al. (2019) Alon Cohen, Tomer Koren, and Yishay Mansour. Learning linear quadratic regulators efficiently with only regret. arXiv preprint arXiv:1902.06223, 2019.
- Dean et al. (2018) Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188–4197, 2018.
- Dean et al. (2019) Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, pages 1–47, 2019.
- Duncan et al. (1992) Tyrone E Duncan, Petr Mandl, and Bożenna Pasik-Duncan. On least squares estimation in continuous time linear stochastic systems. Kybernetika, 28(3):169–180, 1992.
- Duncan et al. (1999) Tyrone E Duncan, Lei Guo, and Bozenna Pasik-Duncan. Adaptive continuous-time linear quadratic Gaussian control. IEEE Transactions on Automatic Control, 44(9):1653–1662, 1999.
- Faradonbeh et al. (2018a) Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Finite-time adaptive stabilization of linear systems. IEEE Transactions on Automatic Control, 64(8):3498–3505, 2018a.
- Faradonbeh et al. (2018b) Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. On adaptive linear-quadratic regulators. arXiv e-prints, pages arXiv–1806, 2018b.
- Faradonbeh et al. (2019) Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Randomized algorithms for data-driven stabilization of stochastic linear systems. In 2019 IEEE Data Science Workshop (DSW), pages 170–174. IEEE, 2019.
- Faradonbeh et al. (2020) Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Input perturbations for adaptive control and learning. Automatica, 117:108950, 2020.
- Foster and Simchowitz (2020) Dylan Foster and Max Simchowitz. Logarithmic regret for adversarial online control. In International Conference on Machine Learning, pages 3211–3221. PMLR, 2020.
- Goodwin et al. (1981) Graham C Goodwin, Peter J Ramadge, and Peter E Caines. Discrete time stochastic adaptive control. SIAM Journal on Control and Optimization, 19(6):829–853, 1981.
- Graber (2016) P Jameson Graber. Linear quadratic mean field type control and mean field games with common noise, with application to production of an exhaustible resource. Applied Mathematics & Optimization, 74(3):459–486, 2016.
- Green and Moore (1986) Michael Green and John B Moore. Persistence of excitation in linear systems. Systems & control letters, 7(5):351–360, 1986.
- Hambly et al. (2020) Ben M Hambly, Renyuan Xu, and Huining Yang. Policy gradient methods for the noisy linear quadratic regulator over a finite horizon. Available at SSRN, 2020.
- Ioannou and Fidan (2006) Petros Ioannou and Bariş Fidan. Adaptive control tutorial. SIAM, 2006.
- Kumar (1983) PR Kumar. Optimal adaptive control of linear-quadratic-gaussian systems. SIAM Journal on Control and Optimization, 21(2):163–178, 1983.
- Lale et al. (2020a) Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Explore more and improve regret in linear quadratic regulators. arXiv preprint arXiv:2007.12291, 2020a.
- Lale et al. (2020b) Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Logarithmic regret bound in partially observable linear dynamical systems. Advances in Neural Information Processing Systems, 33:20876–20888, 2020b.
- Landau et al. (2011) Ioan Doré Landau, Rogelio Lozano, Mohammed M’Saad, and Alireza Karimi. Adaptive control: algorithms, analysis and applications. Springer Science & Business Media, 2011.
- Mania et al. (2019) Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalent control of LQR is efficient. arXiv preprint arXiv:1902.07826, 2019.
- Mao (2007) Xuerong Mao. Stochastic differential equations and applications. Elsevier, 2007.
- Modares and Lewis (2014) Hamidreza Modares and Frank L. Lewis. Linear quadratic tracking control of partially-unknown continuous-time systems using reinforcement learning. IEEE Transactions on Automatic Control, 59(11):3051–3056, 2014.
- Munos (2000) Rémi Munos. A study of reinforcement learning in the continuous case by the means of viscosity solutions. Machine Learning, 40(3):265–299, 2000.
- Munos (2006) Rémi Munos. Policy gradient in continuous time. Journal of Machine Learning Research, 7(May):771–791, 2006.
- Munos and Bourgine (1998) Rémi Munos and Paul Bourgine. Reinforcement learning for continuous stochastic control problems. In Advances in Neural Information Processing Systems, pages 1029–1035, 1998.
- Osband et al. (2013) Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
- Ouyang et al. (2017) Yi Ouyang, Mukul Gagrani, and Rahul Jain. Learning-based control of unknown linear systems with thompson sampling. arXiv preprint arXiv:1709.04047, 2017.
- Rizvi and Lin (2018) Syed Ali Asad Rizvi and Zongli Lin. Output feedback reinforcement learning control for the continuous-time linear quadratic regulator problem. In 2018 Annual American Control Conference (ACC), pages 3417–3422. IEEE, 2018.
- Simchowitz and Foster (2020) Max Simchowitz and Dylan Foster. Naive exploration is optimal for online lqr. In International Conference on Machine Learning, pages 8937–8948. PMLR, 2020.
- Tallec et al. (2019) Corentin Tallec, Léonard Blier, and Yann Ollivier. Making Deep Q-learning methods robust to time discretization. arXiv preprint arXiv:1901.09732, 2019.
- Veraar (2012) Mark Veraar. The stochastic Fubini theorem revisited. Stochastics An International Journal of Probability and Stochastic Processes, 84(4):543–551, 2012.
- Wainwright (2019) Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019.
- Wang and Zhou (2020) Haoran Wang and Xun Yu Zhou. Continuous-time mean–variance portfolio selection: A reinforcement learning framework. Mathematical Finance, 30(4):1273–1308, 2020.
- Yong and Zhou (1999) Jiongmin Yong and Xun Yu Zhou. Stochastic Controls: Hamiltonian Systems and HJB Equations, volume 43. Springer Science & Business Media, 1999.