This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Logarithmic Regret for Episodic Continuous-Time Linear-Quadratic Reinforcement Learning over a Finite-Time Horizon

\nameMatteo Basei \email[email protected]
\addrEDF R&D and FIME, Paris, France.
\AND\nameXin Guo \email[email protected]
\addrDepartment of Industrial Engineering and Operations Research
University of California, Berkeley, USA.
\AND\nameAnran Hu \email[email protected]
\addrDepartment of Industrial Engineering and Operations Research
University of California, Berkeley, USA.
\AND\nameYufei Zhang \email[email protected]
\addrMathematical Institute
University of Oxford, UK
Abstract

We study finite-time horizon continuous-time linear-quadratic reinforcement learning problems in an episodic setting, where both the state and control coefficients are unknown to the controller. We first propose a least-squares algorithm based on continuous-time observations and controls, and establish a logarithmic regret bound of magnitude 𝒪((lnM)(lnlnM))\mathcal{O}((\ln M)(\ln\ln M)), with MM being the number of learning episodes. The analysis consists of two components: perturbation analysis, which exploits the regularity and robustness of the associated Riccati differential equation; and parameter estimation error, which relies on sub-exponential properties of continuous-time least-squares estimators. We further propose a practically implementable least-squares algorithm based on discrete-time observations and piecewise constant controls, which achieves similar logarithmic regret with an additional term depending explicitly on the time stepsizes used in the algorithm.

1 Introduction

Reinforcement learning (RL) for linear quadratic (LQ) control problems has been one of the most active areas for both the control and the reinforcement learning communities. Over the last few decades, significant progresses have been made in the discrete-time setting.

1.1 Discrete-time RL

In the area of adaptive control with unknown dynamics parameters, the goal is to find optimal stationary policy that stabilizes the unknown dynamics and minimizes the long term average cost (Ioannou and Fidan (2006); Landau et al. (2011)). For an infinite-time horizon LQ system, it has been shown that persistent excitation conditions Green and Moore (1986) are critical to the parameter identification. Meanwhile, algorithms with asymptotic convergence in both the parameter estimation and the optimal control have been developed in Goodwin et al. (1981), Kumar (1983) and Campi and Kumar (1998): the first one assumes that costs only depend on state variables and the other two consider both state and control costs and use a cost-biased least-squared estimation method. See Faradonbeh et al. (2018a, 2019) and references therein for recent developments of (randomised) adaptive control algorithms for LQ systems.

Following the seminal works of Auer and Ortner (2007); Auer et al. (2009) and Osband et al. (2013), non-asymptotic regret bound analysis for RL algorithms has been one of the main topics, and has been developed for tabular Markov decision problems.

The non-asymptotic analysis of adaptive LQ problem by Abbasi-Yadkori and Szepesvári (2011) utilizes the Optimism in the Face of Uncertainty principle to construct a sequence of improving confidence regions for the unknown model parameters, and solves a non-convex constrained optimization problem for each confidence region; their algorithm achieves an 𝒪(T)\mathcal{O}(\sqrt{T}) regret bound, with TT being the number of time steps. To reduce the computational complexity and to avoid the non-convexity issue, Abeille and Lazaric (2018) and Ouyang et al. (2017) propose Thompson-sampling-based algorithms and derive 𝒪(T)\mathcal{O}(\sqrt{T}) regret bounds in the Bayesian setting; Dean et al. (2018) proposes a robust adaptive control algorithm to solve a convex sub-problem in each step and achieves an 𝒪(T2/3)\mathcal{O}(T^{2/3}) regret bound. The gap between these regret bounds is removed by Mania et al. (2019) and Cohen et al. (2019) via two different approaches for the same 𝒪(T)\mathcal{O}(\sqrt{T}) frequentist regret bound. Later, Simchowitz and Foster (2020) establishes a lower bound on the regret of order 𝒪(du2dxT)\mathcal{O}(\sqrt{d_{u}^{2}d_{x}T}), where dud_{u} and dxd_{x} are the dimensions of the actions and the states, and shows that a simple variant of certainty equivalent control matches the lower bound in both TT and the dimensions. Similar regret bounds have also been established under different settings and assumptions, such as Chen and Hazan (2021) in the adversarial setting and Lale et al. (2020a) without a stabilizing controller at the early stages of agent-environment interaction.

All the analyses are in discrete-time with an infinite time horizon. In all these problems, adaptive control algorithms are shown to achieve logarithmic regret bounds when additional information regarding the parameters of the system (often referred to as identifiability conditions) is available. Indeed, Faradonbeh et al. (2018b, 2020) prove that certainty equivalent adaptive regulator achieves logarithmic regret bounds if the system parameter satisfies certain sparsity or low-rankness conditions. Cassel et al. (2020) establishes logarithmic regret bounds when either the state transition matrix is unknown, or when the state-action transition matrix is unknown and the optimal policy is non-degenerate. In partially observable linear dynamical systems, which takes linear-quadratic Gaussian problem as a special case, Lale et al. (2020b) proposes an algorithm with a logarithmic regret bound, under the assumption that one has access to a set in which all controllers persistently excite the system to approximate the optimal control. Logarithmic regret bounds in the adversarial setting with known dynamics parameters have been established in Agarwal et al. (2019); Foster and Simchowitz (2020).

1.2 Continuous-time RL.

Most real-world control systems, such as those in aerospace, automotive industry and robotics, are naturally continuous-time dynamical systems. So are their related physical tasks, such as inverted pendulum problems, cart-pole balancing problems, and legged robot problems. Continuous-time finite-time horizon LQ control problems can be found in portfolio optimization Wang and Zhou (2020), algorithmic trading Cartea et al. (2018), production management of exhaustible resources Graber (2016), and biological movement systems Bailo et al. (2018).

Analysis for continuous-time LQ-RL and general RL problems, however, is fairly limited. The primary approach is to develop learning algorithms after discretizing both the time and the space spaces, and establish the convergence as discretization parameters tend to zero. For instance, Munos (2006) proposes a policy gradient algorithm and shows the convergence of the policy gradient estimate to the true gradient. Munos and Bourgine (1998); Munos (2000) design learning algorithms by discretizing Bellman equations of the underlying control problems and prove the asymptotic convergence of their algorithms. For the LQ system, attentions have been mostly on algorithms designs, including the integral reinforcement learning algorithm in Modares and Lewis (2014), and the policy iteration algorithm in Rizvi and Lin (2018). Yet, very little is known regarding the convergence rate or the regret bound of all these algorithms. Indeed, despite the natural analytical connection between LQ control and RL, the best known theoretical work for continuous-time LQ-RL is still due to Duncan et al. (1999), where an asymptotically sublinear regret for an ergodic model has been derived via a weighted least-squares-based estimation approach. Nevertheless, the exact order of the regret bound has not been studied.

Issues and challenges from non-asymptotic analysis.

It is insufficient and improper to rely solely on the analysis and algorithms for the discrete-time RL to solve the continuous-time problems. There is a mismatch between the algorithms timescale for the former and the underlying systems timescale for the latter. When designing algorithms that make observations and take actions at discrete time points, it is important to take the model mismatch into consideration. For instance, the empirical studies in Tallec et al. (2019) suggest that vanilla QQ-learning methods exhibit degraded performance as the time stepsize decreases, while a proper scaling of learning rates with stepsize leads to more robust performance.

The questions are therefore: A) How to quantify the precise impacts of the observation stepsize and action stepsize on algorithm performance? B) How to derive non-asymptotic regret analysis for learning algorithms in continuous-time LQ-RL (or general RL) system, analogous to the discrete-time LQ-RL counterpart?

There are technical reasons behind the limited theoretical progress in the continuous-time domain for RL, including LQ-RL. In addition to the known difficulty for analyzing stochastic control problems, the learning component compounds the problem complexity and poses new challenges.

For instance, the counterpart in the continuous-time problem to the algebraic equations in Mania et al. (2019) for the discrete-time version is the regularity and stability of the continuous-time Riccati equation and the regularity of feedback controls. While Riccati equation and its robustness and existence and uniqueness of optimal controls have been well studied in the control literature, regularity of feedback controls with respect to underlying models is completely new for control theory and crucial for algorithm design and its robustness analysis. Moreover, deriving the exact order of the regret bound requires developing new and different techniques than those used for the asymptotic regret analysis in Duncan et al. (1999).

Our work and contributions.

This paper studies finite-time horizon continuous-time LQ-RL problems in an episodic setting.

  • It first proposes a greedy least-squares algorithm based on continuous-time observations and controls. At each iteration, the algorithm estimates the unknown parameters by a regularized least-squares estimator based on observed trajectories, then designs linear feedback controls via the Riccati differential equation for the estimated model. It identifies conditions under which the unknown state transition matrix and state-action transition matrix are uniquely identifiable under the optimal policies. (Remark 2.1 and Proposition 2.1). By exploiting the identifiability of coefficients, this continuous-time least-squares algorithm is shown to have a logarithmic regret of the magnitude 𝒪((lnM)(lnlnM))\mathcal{O}((\ln M)(\ln\ln M)), with MM being the number of learning episodes (Theorem 2.2). To the best of our knowledge, this is the first non-asymptotic logarithmic regret bound for continuous-time LQ-RL problems with unknown state and control coefficients.

  • It then proposes a practically implementable least-squares algorithm based on discrete-time observations and controls. At each iteration, the algorithm estimates the unknown parameters by observing continuous-time trajectories at discrete time points, then designs a piecewise constant linear feedback control via Riccati difference equations for an associated discrete-time LQ-RL problem. It shows that the regret of the discrete-time least-squares algorithm is of the magnitude 𝒪((lnM)(lnlnM)+=0lnM2τ2)\mathcal{O}\big{(}(\ln M)(\ln\ln M)+\sum_{\ell=0}^{\ln M}2^{\ell}\tau_{\ell}^{2}\big{)}, where τ\tau_{\ell} is the time stepsize used in the (+1)(\ell+1)-th update of model parameters (Theorem 2.3). Our analysis shows that scaling the regularization parameter of the discrete-time least-squares estimator with respect to time stepsize is critical for a robust performance of the algorithm in different timescales (Remark 2.3). To the best of our knowledge, this is the first discrete-time algorithm with rigorous regret bound for continuous-time LQ-RL problems.

Different from the least-squares algorithms for the ergodic LQ problems (see e.g., Duncan et al. (1999); Mania et al. (2019)), our continuous-time least-squares algorithm constructs feedback controls via Riccati differential equations instead of the algebraic equations in Mania et al. (2019). Here, the regularity and stability of the continuous-time Riccati equation is analyzed in order to establish the robustness of feedback controls.

Moreover, our analysis for the estimation error exploits extensively the sub-exponential tail behavior of the least-squares estimators. This probabilistic approach differs from the asymptotic sublinear regret analysis in Duncan et al. (1999); it establishes the exact order of the logarithmic regret bound by the concentration inequality for the error bound.

In addition, our analysis also exploits an important self-exploration property of finite-time horizon continuous-time LQ-RL problems, for which the time-dependent optimal feedback matrices ensure that the optimal state and control processes span the entire parameter space. This property allows us to design exploration-free learning algorithms with logarithmic regret bounds. Furthermore, we provide explicit conditions on models that guarantees the successful identification of the unknown parameters with optimal feedback policies. This is in contrast to the identification conditions for logarithmic regret bounds in discrete-time infinite-time-horizon LQ problems. Our conditions apply to arbitrary finite time-horizon problems, without imposing sparsity or low-rankness conditions on system parameters as in Faradonbeh et al. (2018b, 2020) or requiring these parameters to be partially known to the controller as in Cassel et al. (2020); Foster and Simchowitz (2020).

Finally, our analysis provides the precise parameter estimation error in terms of the sample size and time stepsize, and quantifies the performance gap between applying a piecewise-constant policy from an incorrect model and applying the optimal policy. The misspecification error scales linearly with respect to the stepsize, and the performance gap depends quadratically with respect to the time stepsize and the magnitude of parameter perturbations. Our analysis is based on the first-order convergence of Riccati difference equations and a uniform sub-exponential tail bound of discrete-time least-squares estimators.

Notation.

For each nn\in{\mathbb{N}}, we denote by I=InI=I_{n} the n×nn\times n identity matrix, and by 𝕊0n{\mathbb{S}}^{n}_{0} (resp. 𝕊+n{\mathbb{S}}^{n}_{+}) the space of symmetric positive semidefinite (resp. definite) matrices. We denote by |||\cdot| the Euclidean norm of a given Euclidean space, by 2\|\cdot\|_{2} the matrix norm induced by Euclidean norms, and by AA^{\top} and tr(A)\textnormal{tr}(A) the transpose and trace of a matrix AA, respectively. For each T>0T>0, filtered probability space (Ω,,𝔽={t}t[0,T],)(\Omega,\mathcal{F},{\mathbb{F}}=\{\mathcal{F}_{t}\}_{t\in[0,T]},\mathbb{P}) satisfying the usual condition and Euclidean space (E,||)(E,|\cdot|), we introduce the following spaces:

  • C([0,T];E)C([0,T];E) is the space of continuous functions ϕ:[0,T]E\phi:[0,T]\rightarrow E satisfying ϕC([0,T];E)=supt[0,T]|ϕt|<\|\phi\|_{C([0,T];E)}=\sup_{t\in[0,T]}|\phi_{t}|<\infty;

  • C1([0,T];E)C^{1}([0,T];E) is the space of continuously differentiable functions ϕ:[0,T]E\phi:[0,T]\rightarrow E satisfying ϕC1([0,T];E)=supt[0,T](|ϕt|+|ϕt|)<\|\phi\|_{C^{1}([0,T];E)}=\sup_{t\in[0,T]}(|\phi_{t}|+|\phi^{\prime}_{t}|)<\infty;

  • 𝒮2(E)\mathcal{S}^{2}(E) is the space of EE-valued 𝔽{\mathbb{F}}-progressively measurable càdlàg processes X:Ω×[0,T]EX:\Omega\times[0,T]\rightarrow E satisfying X𝒮2(E)=𝔼[supt[0,T]|Xt|2]1/2<\|X\|_{\mathcal{S}^{2}(E)}={\mathbb{E}}[\sup_{t\in[0,T]}|X_{t}|^{2}]^{1/2}<\infty;

  • 2(E)\mathcal{H}^{2}(E) is the space of EE-valued 𝔽{\mathbb{F}}-progressively measurable processes X:Ω×[0,T]EX:\Omega\times[0,T]\rightarrow E satisfying X2(E)=𝔼[0T|Xt|2dt]1/2<\|X\|_{\mathcal{H}^{2}(E)}={\mathbb{E}}[\int_{0}^{T}|X_{t}|^{2}\,{\mathrm{d}}t]^{1/2}<\infty.

For notation simplicity, we denote by C[0,)C\in[0,\infty) a generic constant, which depends only on the constants appearing in the assumptions and may take a different value at each occurrence.

2 Problem formulation and main results

2.1 Linear-quadratic reinforcement learning problem

In this section, we consider the linear-quadratic reinforcement learning (LQ-RL) problem, where the drift coefficient of the state dynamics is unknown to the controller.

More precisely, let T(0,)T\in(0,\infty) be a given terminal time, WW be an nn-dimensional standard Brownian motion defined on a complete probability space (Ω,,)(\Omega,\mathcal{F},\mathbb{P}), and 𝔽=(t)t[0,T]{\mathbb{F}}=(\mathcal{F}_{t})_{t\in[0,T]} be the filtration generated by WW augmented by the \mathbb{P}-null sets. Let x0nx_{0}\in{\mathbb{R}}^{n} be a given initial state and (A,B)n×n×n×d(A^{\star},B^{\star})\in{\mathbb{R}}^{n\times n}\times{\mathbb{R}}^{n\times d} be fixed but unknown matrices, consider the following problem:

infU2(d)Jθ(U),withJθ(U)=𝔼[0T((Xtθ,U)QXtθ,U+(Ut)RUt)dt],\inf_{U\in\mathcal{H}^{2}({\mathbb{R}}^{d})}J^{\theta^{\star}}(U),\quad\textnormal{with}\quad J^{\theta^{\star}}(U)={\mathbb{E}}\left[\int_{0}^{T}\big{(}(X^{\theta^{\star},U}_{t})^{\top}QX^{\theta^{\star},U}_{t}+(U_{t})^{\top}RU_{t}\big{)}\,{\mathrm{d}}t\right], (2.1)

where for each U2(d)U\in\mathcal{H}^{2}({\mathbb{R}}^{d}), the process Xθ,U𝒮2(n)X^{\theta^{\star},U}\in\mathcal{S}^{2}({\mathbb{R}}^{n}) satisfies the following controlled dynamics associated with the parameter θ=(A,B)\theta^{\star}=(A^{\star},B^{\star})^{\top}:

dXt=(AXt+BUt)dt+dWt,t[0,T];X0=x0,{\mathrm{d}}X_{t}=(A^{\star}X_{t}+B^{\star}U_{t})\,{\mathrm{d}}t+\,{\mathrm{d}}W_{t},\quad t\in[0,T];\quad X_{0}=x_{0}, (2.2)

with given matrices Q𝕊0nQ\in{\mathbb{S}}^{n}_{0} and R𝕊+dR\in{\mathbb{S}}^{d}_{+}. Note that we assume the loss functional (2.1) only involves a time homogeneous running cost to allow a direct comparison with infinite-time horizon RL problems (see e.g., Duncan et al. (1992)), but similar analysis can be performed if the cost functions are time inhomogeneous, a terminal cost is included, or the Brownian motion WW in (2.2) is scaled by an known nonsingular diffusion matrix.

If the parameter θ=(A,B)\theta^{\star}=(A^{\star},B^{\star})^{\top} are known to the controller, then (2.1)-(2.2) reduces to the classical LQ control problems. In this case, it is well known that (see e.g., Yong and Zhou (1999) and the references therein), the optimal control UθU^{\theta^{\star}} of (2.1)-(2.2) is given in a feedback form by

Utθ=ψθ(t,Xtθ), with ψθ(t,x)=Ktθx(t,x)[0,T]×n,U^{\theta^{\star}}_{t}=\psi^{\theta^{\star}}(t,X^{\theta^{\star}}_{t}),\quad\textnormal{ with $\psi^{\theta^{\star}}(t,x)=K^{\theta^{\star}}_{t}x$, $\forall(t,x)\in[0,T]\times{\mathbb{R}}^{n}$,} (2.3)

where Ktθ=R1(B)PtθK^{\theta^{\star}}_{t}=-R^{-1}(B^{\star})^{\top}P^{\theta^{\star}}_{t} for all t[0,T]t\in[0,T], (Ptθ)t[0,T](P_{t}^{\theta^{\star}})_{t\in[0,T]} solves the Riccati equation

ddtPt+(A)Pt+PtAPt(BR1(B))Pt+Q=0,t[0,T];PT=0,\tfrac{{\mathrm{d}}}{{\mathrm{d}}t}P_{t}+(A^{\star})^{\top}P_{t}+P_{t}A^{\star}-P_{t}(B^{\star}R^{-1}(B^{\star})^{\top})P_{t}+Q=0,\quad t\in[0,T];\quad P_{T}=0, (2.4)

and XθX^{\theta^{\star}} is the state process governed by the following dynamics:

dXt=(AXt+BKtθXt)dt+dWt,t[0,T];X0=x0.{\mathrm{d}}X_{t}=(A^{\star}X_{t}+B^{\star}K^{\theta^{\star}}_{t}X_{t})\,{\mathrm{d}}t+\,{\mathrm{d}}W_{t},\quad t\in[0,T];\quad X_{0}=x_{0}. (2.5)

To solve the LQ-RL problem (2.1)-(2.2) with unknown θ\theta^{\star}, the controller searches for the optimal control while simultaneously learning the system, i.e., the matrices A,BA^{\star},B^{\star}. In an episodic (also known as reset or restart) learning framework, the controller improves her knowledge of the underlying dynamics XtX_{t} through successive learning episodes, in order to find a control that is close to the optimal one.

Mathematically, it goes as follows. Let MM\in\mathbb{N} be the total number of learning episodes. In the ii-th learning episode, i=1,,Mi=1,\ldots,M, a feedback control ψi\psi^{i} is exercised, and the state process XψiX^{\psi^{i}} evolves according to the dynamics (2.2) controlled by the policy ψi\psi^{i}:

dXt=(AXt+Bψi(t,Xt))dt+dWti,t[0,T];X0=x0.\begin{split}{\mathrm{d}}X_{t}&=(A^{\star}X_{t}+B^{\star}\psi^{i}(t,X_{t})){\mathrm{d}}t+{\mathrm{d}}W_{t}^{i},\quad t\in[0,T];\quad X_{0}=x_{0}.\\ \end{split} (2.6)

Here Wi,i=1,2,,MW^{i},i=1,2,\ldots,M are independent nn-dimensional Brownian motions defined on the same probability space (Ω,,)(\Omega,\mathcal{F},\mathbb{P}). The (expected) cost of learning in the ii-th episode is then given by

Jθ(Uψi)=𝔼[0T((Xtψi)QXtψi+(Utψi)RUtψi)dt],with Utψiψi(t,Xtψi)t[0,T],J^{\theta^{\star}}(U^{\psi^{i}})=\mathbb{E}\left[\int_{0}^{T}\big{(}(X^{\psi^{i}}_{t})^{\top}QX^{\psi^{i}}_{t}+(U^{\psi^{i}}_{t})^{\top}RU^{\psi^{i}}_{t}\big{)}\,{\mathrm{d}}t\right],\;\textnormal{with $U^{\psi^{i}}_{t}\coloneqq\psi^{i}(t,X^{\psi^{i}}_{t})$, $t\in[0,T]$,} (2.7)

and the (expected) regret of learning up to MM\in{\mathbb{N}} episodes (with the sequence of controls (Uψi)i=1M(U^{\psi^{i}})_{i=1}^{M}) is defined as follows:

R(M)=i=1M(Jθ(Uψi)Jθ(Uθ)),R(M)=\sum_{i=1}^{M}\Big{(}J^{\theta^{\star}}(U^{\psi^{i}})-J^{\theta^{\star}}(U^{\theta^{\star}})\Big{)}, (2.8)

where Jθ(Uθ)J^{\theta^{\star}}(U^{\theta^{\star}}) is the optimal cost of (2.1)-(2.2) when A,BA^{\star},B^{\star} are known. Intuitively, the regret characterizes the cumulative loss from taking sub-optimal policies in all episodes.

In the following, we shall propose several least-squares-based learning algorithms to solve (2.1)-(2.2), and prove that they achieve logarithmic regrets if θ\theta^{\star} is identifiable (see Remark 2.1 for details).

2.2 Continuous-time least-squares algorithm and its regret bound

In this section, we consider a continuous-time least-squares algorithm, which chooses the optimal feedback control based on the current estimation of the parameter, and updates the parameter estimation based on the whole trajectories of the state dynamics.

More precisely, let θ=(A,B)(n+d)×n\theta=(A,B)^{\top}\in{\mathbb{R}}^{(n+d)\times n} be the current estimate of the unknown parameter θ\theta^{\star}, then the controller would exercise the optimal feedback control ψθ\psi^{\theta} for (2.1)-(2.2) with θ\theta^{\star} replaced by θ\theta, i.e.,

ψθ(t,x)=Ktθx,KtθR1BPtθ,(t,x)[0,T]×n,\psi^{\theta}(t,x)=K^{\theta}_{t}x,\quad K^{\theta}_{t}\coloneqq-R^{-1}B^{\top}P^{\theta}_{t},\quad\forall(t,x)\in[0,T]\times{\mathbb{R}}^{n}, (2.9)

where PθP^{\theta} satisfies the Riccati equation (2.4) with θ\theta^{\star} replaced by θ\theta:

ddtPt+APt+PtAPt(BR1B)Pt+Q=0,t[0,T];PT=0.\tfrac{{\mathrm{d}}}{{\mathrm{d}}t}P_{t}+A^{\top}P_{t}+P_{t}A-P_{t}(BR^{-1}B^{\top})P_{t}+Q=0,\quad t\in[0,T];\quad P_{T}=0. (2.10)

This leads to the state process XψθX^{\psi^{\theta}} satisfying (cf. (2.6)):

dXt=(AXt+Bψθ(t,Xt))dt+dWt,t[0,T];X0=x0.{\mathrm{d}}X_{t}=(A^{\star}X_{t}+B^{\star}\psi^{\theta}(t,X_{t}))\,{\mathrm{d}}t+{\mathrm{d}}W_{t},\quad t\in[0,T];\quad X_{0}=x_{0}. (2.11)

We proceed to derive an 2\ell_{2}-regularized least-squares estimation for θ\theta^{\star} based on sampled trajectories of XψθX^{\psi^{\theta}}. Observing from (2.11) that

Ztψθ(dXtψθ)=Ztψθ(Ztψθ)θdt+Ztψθ(dWt), with Ztψθ=(Xψθψθ(t,Xtψθ)) for all t[0,T]Z^{\psi^{\theta}}_{t}({\mathrm{d}}X^{\psi^{\theta}}_{t})^{\top}=Z^{\psi^{\theta}}_{t}(Z^{\psi^{\theta}}_{t})^{\top}\theta^{\star}{\mathrm{d}}t+Z^{\psi^{\theta}}_{t}({\mathrm{d}}W_{t})^{\top},\quad\textnormal{ with $Z^{\psi^{\theta}}_{t}=\begin{pmatrix}X^{\psi^{\theta}}\\ \psi^{\theta}(t,X^{\psi^{\theta}}_{t})\end{pmatrix}$ for all $t\in[0,T]$. }

Hence the martingale property of the Itô integral implies that

θ=(𝔼[0TZtψθ(Ztψθ)dt])1𝔼[0TZtψθ(dXtψθ)],\theta^{\star}=\left(\mathbb{E}\left[\int_{0}^{T}Z^{\psi^{\theta}}_{t}(Z^{\psi^{\theta}}_{t})^{\top}\,{\mathrm{d}}t\right]\right)^{-1}\mathbb{E}\left[\int_{0}^{T}Z^{\psi^{\theta}}_{t}({\mathrm{d}}X^{\psi^{\theta}}_{t})^{\top}\right], (2.12)

provided that 𝔼[0TZtψθ(Ztψθ)dt]\mathbb{E}\big{[}\int_{0}^{T}Z^{\psi^{\theta}}_{t}(Z^{\psi^{\theta}}_{t})^{\top}\,{\mathrm{d}}t\big{]} is invertible. This suggests a practical rule to improve one’s estimate θ\theta for the true parameter θ\theta^{\star}, by replacing the expectations in (2.12) with empirical averages over independent realizations. More precisely, let mm\in{\mathbb{N}} and (Xtψθ,i,ψθ(t,Xtψθ,i))t[0,T](X^{\psi^{\theta},i}_{t},\psi^{\theta}(t,X^{\psi^{\theta},i}_{t}))_{t\in[0,T]}, i=1,,mi=1,\ldots,m, be trajectories of mm independent realizations of the state and control processes, we shall update the estimate θ\theta by the following rule, inspired by (2.12):

θ(1mi=1m0TZtψθ,i(Ztψθ,i)𝖳dt+1mI)1(1mi=1m0TZtψθ,i(dXtψθ,i)𝖳),\theta\longleftarrow\bigg{(}\frac{1}{m}\sum_{i=1}^{m}\int_{0}^{T}Z^{\psi^{\theta},i}_{t}(Z^{\psi^{\theta},i}_{t})^{\mathsf{T}}\,{\mathrm{d}}t+\frac{1}{m}I\bigg{)}^{-1}\bigg{(}\frac{1}{m}\sum_{i=1}^{m}\int_{0}^{T}Z^{\psi^{\theta},i}_{t}({\mathrm{d}}X^{\psi^{\theta},i}_{t})^{\mathsf{T}}\bigg{)}, (2.13)

where Ztψθ,i(Xtψθ,iψθ(t,Xtψθ,i))Z^{\psi^{\theta},i}_{t}\coloneqq\begin{pmatrix}X^{\psi^{\theta},i}_{t}\\ \psi^{\theta}(t,X^{\psi^{\theta},i}_{t})\end{pmatrix} for all t[0,T]t\in[0,T] and i=1,,mi=1,\ldots,m, and II is the (n+d)×(n+d)(n+d)\times(n+d) identity matrix.

The regularization term 1mI\frac{1}{m}I in (2.13) guarantees the required matrix inverse and vanishes as mm\rightarrow\infty. The estimator (2.13) can be equivalently expressed as an 2\ell_{2}-regularized least-squares estimator, as pointed out in Duncan et al. (1992) for the ergodic LQ-RL problem.

We summarize the continuous-time least-squares algorithm as follows.

Algorithm 1 Continuous-time least-squares algorithm
1:  Input: Choose an initial estimation θ0\theta_{0} of θ\theta^{\star} and numbers of learning episodes {m}{0}\{m_{\ell}\}_{\ell\in{\mathbb{N}}\cup\{0\}}.
2:  for =0,1,\ell=0,1,\cdots do
3:     Obtain the feedback control ψθ\psi^{\theta_{\ell}} as (2.9) with θ=θ\theta=\theta_{\ell}.
4:     Execute the feedback control ψθ\psi^{\theta_{\ell}} for mm_{\ell} independent episodes, and collect the trajectory data (Xtψθ,i,ψθ(t,Xtψθ,i))t[0,T](X^{\psi^{\theta_{\ell}},i}_{t},\psi^{\theta_{\ell}}(t,X^{\psi^{\theta_{\ell}},i}_{t}))_{t\in[0,T]}, i=1,,mi=1,\ldots,m_{\ell}.
5:     Obtain an updated estimation θ+1{\theta}_{\ell+1} by using (2.13) and the mm_{\ell} trajectories collected above.
6:  end for

Note that Algorithm 1 operates in cycles, with mm_{\ell} the number of episodes in the \ell-th cycle. Hence, the regret of learning up to MM episodes (cf. (2.8)) can be upper bounded by the accumulated regret at the end of the LL-th cycle, where LL is the smallest integer such that =0LmM\sum_{\ell=0}^{L}m_{\ell}\geq M.

In this section, we analyze the regret of Algorithm 1 based on the following assumptions of the learning problem (2.1)-(2.2).

H.​​ 1
  1. (1)

    T(0,)T\in(0,\infty), n,dn,d\in{\mathbb{N}}, x0nx_{0}\in{\mathbb{R}}^{n}, An×nA^{\star}\in{\mathbb{R}}^{n\times n}, Bn×dB^{\star}\in{\mathbb{R}}^{n\times d}, Q𝕊0nQ\in{\mathbb{S}}^{n}_{0} and R𝕊+dR\in{\mathbb{S}}^{d}_{+}.

  2. (2)

    {vd(Ktθ)v=0,t[0,T]}={0}\{v\in{\mathbb{R}}^{d}\mid(K^{\theta^{\star}}_{t})^{\top}v=0,\;\forall t\in[0,T]\}=\{0\}, with KθK^{\theta^{\star}} defined in (2.3).

Before discussing the regret of Algorithm 1, we make the following remark of (H.1).

Remark 2.1 (Self-exploration of finite-time horizon RL problems)

(H.11) is the standard assumption for finite-time horizon LQ-RL problems (see e.g., Hambly et al. (2020)), except that H.11 allows QQ to be positive semidefinite, which is important for costs depending on partial states. (H.12) corresponds to the identifiability of the true parameter θ\theta^{\star} by executing the optimal policy KθK^{\theta^{\star}}. In fact, as shown in Proposition 3.10, under (H.11), (H.12) is equivalent to the following statement:

  1. (2’)

    if unu\in{\mathbb{R}}^{n} and vdv\in{\mathbb{R}}^{d} satisfy uXθ+vUθ=0u^{\top}X^{\theta^{\star}}+v^{\top}U^{\theta^{\star}}=0 for ddt{\mathrm{d}}\mathbb{P}\otimes{\mathrm{d}}t-almost everywhere in Ω×[0,T]\Omega\times[0,T], then u=0u=0 and v=0v=0, where XθX^{\theta^{\star}} and UθU^{\theta^{\star}} are the optimal state and control processes of (2.1)-(2.2) defined by (2.5) and (2.3), respectively,

Item (2’) indicates an important self-exploration property of finite-time horizon continuous-time RL problems. In particular, the time-dependent optimal feedback matrix KθK^{\theta^{\star}} and the non-degenerate noises guarantee the non-degeneracy of the space spanned by XθX^{\theta^{\star}} and UθU^{\theta^{\star}}, enabling learning the parameters sufficiently well. This self-exploration property is critical for our design of exploration-free learning algorithms for (2.1)-(2.2) with a logarithmic regret (see Theorems 2.2 and 2.3).

One can easily show that (H.12) holds if the optimal policy (Kθ)t[0,T](K^{\theta^{\star}})_{t\in[0,T]} is nondegenerate, i.e., supt[0,T]λmin((Ktθ)(Ktθ))>0\sup_{t\in[0,T]}\lambda_{\min}\left((K^{\theta^{\star}}_{t})(K^{\theta^{\star}}_{t})^{\top}\right)>0. Similar nondegeneracy condition has been imposed in Cassel et al. (2020) for discrete-time ergodic LQ-RL problems. In particular, by assuming that the optimal stationary policy satisfies λmin(K(K))>0\lambda_{\min}(K^{\star}(K^{\star})^{\top})>0 (along with other controllablity conditions), they propose learning algorithms with a logarithmic regret, under the assumption that only the control coefficient BB^{\star} is unknown. In contrast, we allow both the state coefficient AA^{\star} and the control coefficient BB^{\star} to be unknown.

Moreover, the following proposition provides sufficient conditions of (H.12), which are special cases of Proposition 3.11.

Proposition 2.1

Let n,dn,d\in{\mathbb{N}}, Q𝕊0nQ\in{\mathbb{S}}_{0}^{n} and R𝕊+dR\in{\mathbb{S}}_{+}^{d}.

  1. (1)

    If (B)QB𝕊+d(B^{\star})^{\top}QB^{\star}\in{\mathbb{S}}^{d}_{+}, then (H.12) holds for all T>0T>0.

  2. (2)

    Assume that the algebraic Riccati equation (A)P+PAP(BR1(B))P+Q=0(A^{\star})^{\top}P+PA^{\star}-P(B^{\star}R^{-1}(B^{\star})^{\top})P+Q=0 admits a unique maximal solution P𝕊+nP^{\star}_{\infty}\in{\mathbb{S}}^{n}_{+}. Let K=R1(B)PK^{\star}_{\infty}=-R^{-1}(B^{\star})^{\top}P^{\star}_{\infty}, and for each T>0T>0, let P,(T)C([0,T];𝕊0n)P^{\star,(T)}\in C([0,T];{\mathbb{S}}_{0}^{n}) be defined in (2.4). Assume that limTP0,(T)=P\lim_{T\rightarrow\infty}P^{\star,(T)}_{0}=P^{\star}_{\infty} and K(K)𝕊+dK^{\star}_{\infty}(K^{\star}_{\infty})^{\top}\in{\mathbb{S}}_{+}^{d}. Then there exists T0>0T_{0}>0, such that (H.12) holds for all TT0T\geq T_{0}.

Proposition 2.1 provides two sets of conditions for (H.12) under two different scenarios: Item 1 applies to an arbitrary finite T>0T>0, and Item 2 only applies to sufficiently large TT. Item 2 assumes the asymptotic behavior of solutions to Riccati differential equations, which can be ensured by the stabilizability of the pair (A,B)(A^{\star},B^{\star}) and detectability of the pair (A,Q1/2)(A^{\star},Q^{1/2}) (see (Bitmead and Gevers, 1991, Theorems 10.9 and 10.10)). Note that our subsequent analysis is based on (H.1), and does not require stabilizability assumptions.

Remark 2.2 (Stabilizability of (A,B)(A^{\star},B^{\star}) and dependence on TT)

Since the LQ-RL problem (2.1)-(2.2) is over the time horizon [0,T][0,T] with a fixed T<T<\infty, in general one does not need additional conditions on (A,B)(A^{\star},B^{\star}) for the well-definedness of (2.1)-(2.2). If T=T=\infty, then some stabilizability/controllability conditions of (A,B)(A^{\star},B^{\star}) may be required for (2.1)-(2.2) to ensure a well-defined solution (see e.g., Dean et al. (2019)). Under these conditions, different algorithms have been shown to achieve sub-linear regret with respect to the number of decision steps (see e.g., Mania et al. (2019); Cohen et al. (2019)), and even logarithmic regrets provided that further identifiability assumptions are satisfied (see e.g., Faradonbeh et al. (2018b, 2020); Cassel et al. (2020); Lale et al. (2020b)); see Section 1.1 for more details. For T<T<\infty, the regrets of learning algorithms for (2.1)-(2.2) in general depend exponentially on the time horizon TT (e.g., the constants C0,CC_{0},C^{\prime} in Theorem 2.2), as the moments of the optimal state process XθX^{\theta^{\star}} and control process UθU^{\theta^{\star}} may grow exponentially with respect to TT. It would be interesting to quantify the precise dependence of the regret bounds on TT. This would entail deriving precise a priori bounds of solutions to (2.10) and estimating the norm (𝔼[0TZtθ(Ztθ)dt])12\|(\mathbb{E}[\int_{0}^{T}Z^{\theta^{\star}}_{t}(Z^{\theta^{\star}}_{t})^{\top}\,{\mathrm{d}}t])^{-1}\|_{2} in terms of (A,B,Q,T)(A^{\star},B^{\star},Q,T), and is left for future research.

We are now ready to state the main result of this section, which shows that the regret of Algorithm 1 grows logarithmically with respect to the number of episodes.

Theorem 2.2

Suppose (H.1) holds and let θ0=(A0,B0)(n+d)×d\theta_{0}=(A_{0},B_{0})^{\top}\in{\mathbb{R}}^{(n+d)\times d} such that (H.12) holds with θ0\theta_{0}. Then there exists a constant C0>0C_{0}>0 such that for all CC0C\geq C_{0}, and δ(0,3π2)\delta\in(0,\frac{3}{\pi^{2}}), if one sets m0=C(lnδ)m_{0}=C(-\ln\delta) and m=2m0m_{\ell}=2^{\ell}m_{0} for all \ell\in{\mathbb{N}}, then with probability at least 1π2δ31-\frac{\pi^{2}\delta}{3}, the regret of Algorithm 1 given by (2.8) satisfies

R(M)C((lnM)(lnlnM)+(lnδ)(lnM)),M,R(M)\leq C^{\prime}\big{(}(\ln M)(\ln\ln M)+(-\ln\delta)(\ln M)\big{)},\quad\forall M\in{\mathbb{N}},

where CC^{\prime} is a constant independent of MM and δ\delta.

To simplify the presentation, we analyze the performance of Algorithm 1 by assuming the number of learning episodes {m}\{m_{\ell}\}_{\ell} is doubled between two successive updates of the estimation of θ\theta^{\star}. Similar regret results can be established for Algorithm 1 with different choices of {m}\{m_{\ell}\}_{\ell}. Under this specific choice of {m}\{m_{\ell}\}_{\ell}, for any MM\in{\mathbb{N}}, Algorithm 1 splits MM episodes into L=log2(Mm0+1)1L=\lceil\log_{2}(\frac{M}{m_{0}}+1)\rceil-1 cycles, where the \ell-th cycle, =0,1,,L1\ell=0,1,\ldots,L-1, contains mm_{\ell} episodes, and the remaining M=0L1mM-\sum_{\ell=0}^{L-1}m_{\ell} episodes are in the last cycle.

Sketched proof of Theorem 2.2.

We outline the key steps of the proof of Theorem 2.2, and present the detailed arguments to Section 3.3. By exploiting the regularity and robustness of solutions to (2.10), we prove that the performance gap Jθ(Uψθ)Jθ(Uθ)J^{\theta^{\star}}(U^{\psi^{\theta}})-J^{\theta^{\star}}(U^{\theta^{\star}}) is of the magnitude 𝒪(|θθ|2)\mathcal{O}(|\theta-\theta^{\star}|^{2}), for all a-priori bounded θ\theta (Proposition 3.8). We then establish a uniform sub-exponential property for the (deterministic and stochastic) integrals in (2.12), which along with (H.12) and Bernstein’s inequality leads to the following estimate of the parameter estimation error: for all δ(0,1/2)\delta\in(0,1/2), all sufficiently large mm\in{\mathbb{N}}, and all θ\theta sufficiently close to θ\theta^{\star},

|θ^θ|𝒪(lnδm+lnδm+(lnδ)2m2),with probability 12δ,|\hat{\theta}-\theta^{\star}|\leq\mathcal{O}\Big{(}\sqrt{\frac{-\ln\delta}{m}}+\frac{-\ln\delta}{m}+\frac{(-\ln\delta)^{2}}{m^{2}}\Big{)},\quad\textnormal{with probability $1-2\delta$,} (2.14)

where θ^\hat{\theta} is generated by (2.13) with ψθ\psi^{\theta} (Proposition 3.9). Then for each δ>0\delta>0, applying (2.14) with δ=δ/(+1)2\delta_{\ell}=\delta/(\ell+1)^{2} for all {0}\ell\in{\mathbb{N}}\cup\{0\} shows that with probability 12=0δ=1π2δ31-2\sum_{\ell=0}^{\infty}\delta_{\ell}=1-\tfrac{\pi^{2}\delta}{3},

|θ^+1θ|2lnδm+(lnδ)2m2+(lnδ)4m4,,|\hat{\theta}_{\ell+1}-\theta^{\star}|^{2}\lesssim\frac{-\ln\delta_{\ell}}{m_{\ell}}+\frac{(-\ln\delta_{\ell})^{2}}{m^{2}_{\ell}}+\frac{(-\ln\delta_{\ell})^{4}}{m_{\ell}^{4}},\quad\forall\ell\in{\mathbb{N}}, (2.15)

where \lesssim means the inequality holds with a multiplicative constant independent of δ\delta and \ell. By the quadratic performance gap and the choice of {m}\{m_{\ell}\}_{\ell}, the regret of Algorithm 1 up to the MM-th episode can be bounded by the regret at the end of LL-th cycle with L=log2(Mm0+1)1L=\lceil\log_{2}(\frac{M}{m_{0}}+1)\rceil-1:

R(M)=0Lm|θθ|2=0L(lnδ)(1+lnδm+(lnδ)3m3).\displaystyle\begin{split}R(M)&\lesssim\sum_{\ell=0}^{L}m_{\ell}|\theta_{\ell}-\theta^{\star}|^{2}\lesssim\sum_{\ell=0}^{L}(-\ln\delta_{\ell})\left(1+\frac{-\ln\delta_{\ell}}{m_{\ell}}+\frac{(-\ln\delta_{\ell})^{3}}{m_{\ell}^{3}}\right).\end{split} (2.16)

Observe that the choices of {δ}\{\delta_{\ell}\}_{\ell} and {m}\{m_{\ell}\}_{\ell} ensure that supδ(0,3π2),lnδm<\sup_{\delta\in(0,\frac{3}{\pi^{2}}),\ell\in{\mathbb{N}}}\frac{-\ln\delta_{\ell}}{m_{\ell}}<\infty. Hence, the right-hand side of (2.16) is of the magnitude 𝒪(=0L(lnδ))\mathcal{O}\left(\sum_{\ell=0}^{L}(-\ln\delta_{\ell})\right), which along with the choices of δ\delta_{\ell} and LL leads to the desired regret bound; see the end of Section 3.3 for more details.

2.3 Discrete-time least-squares algorithm and its regret bound

Note that Algorithm 1 in Section 2.2 requires executing feedback controls and observing corresponding state trajectories continuously. A common practice to solve continuous-time RL problems is by assuming that at each learning episode the dynamics only evolves in discrete time, and then estimate parameters according to discrete-time RL algorithms (see e.g., Munos and Bourgine (1998); Munos (2000, 2006); Tallec et al. (2019)). As the true dynamics evolves continuously, it is necessary to quantify the impact of reaction stepsize on the algorithm performance.

In this section, we analyze the performance of the above procedure for solving (2.1)-(2.2). We adapt regularized least-squares algorithms for discrete-time LQ problems to the present setting, and establish their regret bounds in terms of the discretization stepsize. Our analysis shows that a proper scaling of the regularization term in the least-squares estimation in terms of stepsize is critical for a robust performance with respect to different timescales.

More precisely, for a given cycle (i.e., the index \ell in Algorithm 1), let θ=(A,B)(n+d)×n\theta=(A,B)^{\top}\in{\mathbb{R}}^{(n+d)\times n} be the current estimate of θ\theta^{\star} in (2.2), and let {ti}i=0N\{t_{i}\}_{i=0}^{N}, NN\in{\mathbb{N}}, be a uniform partition of [0,T][0,T] with stepsize τ=T/N\tau=T/N. We then assume that (2.1)-(2.2) is piecewise constant between any two grid points {ti}i=0N\{t_{i}\}_{i=0}^{N}, choose actions and make observations every τ\tau, and update the estimated parameter based on these observations. To this end, we consider the following discrete-time LQ control problem with parameter θ\theta:

infUN2(d)JN(U),with JN(U)=𝔼[i=0N1((XtiU,τ)QXtiU,τ+UtiRUti)τ] ,\displaystyle\begin{split}\inf_{U\in\mathcal{H}^{2}_{N}({\mathbb{R}}^{d})}J_{{N}}(U),\quad\textnormal{with $J_{{N}}(U)=\mathbb{E}\left[\sum_{i=0}^{N-1}\left((X^{U,\tau}_{t_{i}})^{\top}QX^{U,\tau}_{t_{i}}+U_{t_{i}}^{\top}RU_{t_{i}}\right)\tau\right]$ },\end{split} (2.17)

where N2(d)={U2(d)Ut=Uti,t[ti,ti+1),i=0,,N1}\mathcal{H}^{2}_{N}({\mathbb{R}}^{d})=\{U\in\mathcal{H}^{2}({\mathbb{R}}^{d})\mid U_{t}=U_{t_{i}},t\in[t_{i},t_{i+1}),i=0,\ldots,N-1\}, and (XtiU,τ)i=0N1(X^{U,\tau}_{t_{i}})_{i=0}^{N-1} are defined by

Xti+1U,τXtiU,τ=(AXtiU,τ+BUti)τ+Wti+1Wti,i=0,N1;X0U,τ=x0.X^{U,\tau}_{t_{i+1}}-X^{U,\tau}_{t_{i}}=(AX^{U,\tau}_{t_{i}}+BU_{t_{i}})\tau+W_{t_{i+1}}-W_{t_{i}},\quad i=0,\ldots N-1;\quad X^{U,\tau}_{0}=x_{0}. (2.18)

Note that for simplicity, our strategy is constructed by assuming a discrete-time dynamics arising from an Euler discretization of (2.2) (with the estimated parameter θ\theta); similar analysis can be performed with a high-order approximation of (2.1)-(2.2).

It is well-known that (see e.g., Bitmead and Gevers (1991)), the optimal control of (2.17)-(2.18) is given by the following feedback form:

Ut=ψθ,τ(t,XtU,τ), with ψθ,τ(t,x)=Ktθ,τx,(t,x)[0,T)×n, U_{t}=\psi^{\theta,\tau}(t,X^{U,\tau}_{t}),\quad\textnormal{ with $\psi^{\theta,\tau}(t,x)=K^{\theta,\tau}_{t}x,\quad\forall(t,x)\in[0,T)\times{\mathbb{R}}^{n},$ } (2.19)

where Kθ,τ:[0,T)d×nK^{\theta,\tau}:[0,T)\rightarrow{\mathbb{R}}^{d\times n} is the piecewise constant function (with stepsize τ=T/N\tau=T/N) defined by

Pti1θ,τ\displaystyle P^{\theta,\tau}_{t_{i-1}} =τQ+(I+τA)Ptiθ,τ(I+τA)(I+τA)Ptiθ,ττB(R+τBPtiθ,τB)1BPtiθ,τ(I+τA),\displaystyle=\tau Q+(I+\tau A)^{\top}P^{\theta,\tau}_{t_{i}}(I+\tau A)-(I+\tau A)^{\top}P^{\theta,\tau}_{t_{i}}\tau B(R+\tau B^{\top}P^{\theta,\tau}_{t_{i}}B)^{-1}B^{\top}P^{\theta,\tau}_{t_{i}}(I+\tau A),
i=0,,N1;PTθ,τ=0,\displaystyle\quad\quad\forall i=0,\dots,N-1;\quad P^{\theta,\tau}_{T}=0,
Ktθ,τ\displaystyle K^{\theta,\tau}_{t} =(R+τBPti+1θ,τB)1BPti+1θ,τ(I+τA),t[ti,ti+1),i=0,,N1.\displaystyle=-(R+\tau B^{\top}P^{\theta,\tau}_{t_{i+1}}B)^{-1}B^{\top}P^{\theta,\tau}_{t_{i+1}}(I+\tau A),\quad t\in[t_{i},t_{i+1}),\,i=0,\ldots,N-1. (2.20)

We then implement the piecewise constant strategy ψθ,τ\psi^{\theta,\tau} defined in (2.19) on the original system (2.2) for mm episodes, and update the estimated parameter θ\theta by observing (2.2) with stepsize τ=T/N\tau=T/N. More precisely, let Xψθ,τ𝒮2(n)X^{\psi^{\theta,\tau}}\in\mathcal{S}^{2}({\mathbb{R}}^{n}) be the state process associated with ψθ,τ\psi^{\theta,\tau}:

dXt=(AXt+BKtiθ,τXt)dt+dWt,t[ti,ti+1],i=0,,N1;X0=x0,{\mathrm{d}}X_{t}=(A^{\star}X_{t}+B^{\star}K^{\theta,\tau}_{t_{i}}X_{t})\,{\mathrm{d}}t+{\mathrm{d}}W_{t},\quad t\in[t_{i},t_{i+1}],\,i=0,\ldots,N-1;\quad X_{0}=x_{0}, (2.21)

and (Xtψθ,τ,j)t[0,T](X^{\psi^{\theta,\tau},j}_{t})_{t\in[0,T]}, j=1,,mj=1,\ldots,m, mm\in{\mathbb{N}}, be mm independent trajectories of Xψθ,τ𝒮2(n)X^{\psi^{\theta,\tau}}\in\mathcal{S}^{2}({\mathbb{R}}^{n}), we update the parameter θ\theta according to the following discrete-time least-squares estimator:

θargminθ(n+d)×nj=1mi=0N1Xti+1ψθ,τ,jXtiψθ,τ,jτθZtiψθ,τ,j22+τtr(θθ),\theta\leftarrow\operatorname*{arg\,min}_{\theta\in{\mathbb{R}}^{(n+d)\times n}}\sum_{j=1}^{m}\sum_{i=0}^{N-1}\|X^{\psi^{\theta,\tau},j}_{t_{i+1}}-X^{\psi^{\theta,\tau},j}_{t_{i}}-\tau\theta^{\top}Z^{\psi^{\theta,\tau},j}_{t_{i}}\|_{2}^{2}+\tau\textnormal{tr}(\theta^{\top}\theta), (2.22)

with Ztiψθ,τ,j(Xtiψθ,τ,jKtiθ,τXtiψθ,τ,j)Z^{\psi^{\theta,\tau},j}_{t_{i}}\coloneqq\begin{pmatrix}X^{\psi^{\theta,\tau},j}_{t_{i}}\\ K^{\theta,\tau}_{t_{i}}X^{\psi^{\theta,\tau},j}_{t_{i}}\end{pmatrix} for all i,ji,j. The update (2.22) is consistent with the agent’s assumption that the state evolves according to (2.18) between two grid points. Setting the derivative (with respect to θ\theta) of the right-hand side of (2.22) to zero leads to

j=1mi=0N1τZtiψθ,τ,j((Xti+1ψθ,τ,jXtiψθ,τ,j)τ(Ztiψθ,τ,j)θ)+τθ=0.-\sum_{j=1}^{m}\sum_{i=0}^{N-1}\tau Z^{\psi^{\theta,\tau},j}_{t_{i}}\left(\left(X^{\psi^{\theta,\tau},j}_{t_{i+1}}-X^{\psi^{\theta,\tau},j}_{t_{i}}\right)^{\top}-\tau(Z^{\psi^{\theta,\tau},j}_{t_{i}})^{\top}\theta\right)+\tau\theta=0.

Dividing both sides by τ/m\tau/m and rearranging the terms give the following equivalent expression of the discrete-time least squares estimator (2.22):

θ(1mj=1mi=0N1Ztiψθ,τ,j(Ztiψθ,τ,j)τ+1mI)1(1mj=1mi=0N1Ztiψθ,τ,j(Xti+1ψθ,τ,jXtiψθ,τ,j)).{\theta}\longleftarrow\bigg{(}\frac{1}{m}\sum_{j=1}^{m}\sum_{i=0}^{N-1}Z^{\psi^{\theta,\tau},j}_{t_{i}}(Z^{\psi^{\theta,\tau},j}_{t_{i}})^{\top}\tau+\frac{1}{m}I\bigg{)}^{-1}\bigg{(}\frac{1}{m}\sum_{j=1}^{m}\sum_{i=0}^{N-1}Z^{\psi^{\theta,\tau},j}_{t_{i}}\left(X^{\psi^{\theta,\tau},j}_{t_{i+1}}-X^{\psi^{\theta,\tau},j}_{t_{i}}\right)^{\top}\bigg{)}. (2.23)
Remark 2.3 (Scaling hyper-parameters with timescales)

In principle, when applying discrete-time RL algorithms in a continuous environment, it is critical to adopt a proper scaling of the hyper-parameters for a robust performance with respect to different timescales. Indeed, scaling the regularization term tr(θθ)\textnormal{tr}(\theta^{\top}\theta) in (2.22) with respect to the stepsize τ\tau is essential for the robustness of (2.23) for all small stepsizes τ\tau. If one updates θ\theta by minimizing the following 2\ell_{2}-regularized loss function with a given hyper-parameter α<1\alpha<1 such that

argminθ(n+d)×nj=1mi=0N1Xti+1ψθ,τ,jXtiψθ,τ,jτθZtiψθ,τ,j22+ταtr(θθ),\operatorname*{arg\,min}_{\theta\in{\mathbb{R}}^{(n+d)\times n}}\sum_{j=1}^{m}\sum_{i=0}^{N-1}\|X^{\psi^{\theta,\tau},j}_{t_{i+1}}-X^{\psi^{\theta,\tau},j}_{t_{i}}-\tau\theta^{\top}Z^{\psi^{\theta,\tau},j}_{t_{i}}\|_{2}^{2}+\tau^{\alpha}\textnormal{tr}(\theta^{\top}\theta), (2.24)

then the corresponding discrete-time estimator is given by

θτ(1mj=1mi=0N1Ztiψθ,τ,j(Ztiψθ,τ,j)τ+1τ1αmI)1(1mj=1mi=0N1Ztiψθ,τ,j(Xti+1ψθ,τ,jXtiψθ,τ,j)).{\theta}^{\tau}\coloneqq\bigg{(}\frac{1}{m}\sum_{j=1}^{m}\sum_{i=0}^{N-1}Z^{\psi^{\theta,\tau},j}_{t_{i}}(Z^{\psi^{\theta,\tau},j}_{t_{i}})^{\top}\tau+\frac{1}{\tau^{1-\alpha}m}I\bigg{)}^{-1}\bigg{(}\frac{1}{m}\sum_{j=1}^{m}\sum_{i=0}^{N-1}Z^{\psi^{\theta,\tau},j}_{t_{i}}\left(X^{\psi^{\theta,\tau},j}_{t_{i+1}}-X^{\psi^{\theta,\tau},j}_{t_{i}}\right)^{\top}\bigg{)}.

Observe that for any given mm\in{\mathbb{N}}, the estimator θτ\theta^{\tau} degenerates to zero as the stepsize τ\tau tends to zero. Hence, to ensure the viability of θτ\theta^{\tau} across different timescales, the number of episodes mm has to increase appropriately when τ\tau tends to zero. In contrast, by choosing α=1\alpha=1 in (2.24), (2.23) admits a continuous-time limit (2.13) as τ0\tau\rightarrow 0, and leads to a learning algorithm in which the episode numbers and the time stepsize can be chosen independently (see Theorem 2.3).

We now summarize the discrete-time least-squares algorithm as follows.

Algorithm 2 Discrete-time least-squares algorithm
1:  Input: Choose an initial estimation θ0\theta_{0} of θ\theta^{\star}, numbers of learning episodes {m}{0}\{m_{\ell}\}_{\ell\in{\mathbb{N}}\cup\{0\}} and numbers of intervention points {N}{0}\{N_{\ell}\}_{\ell\in{\mathbb{N}}\cup\{0\}}.
2:  for =0,1,\ell=0,1,\cdots do
3:     Obtain the piecewise constant control ψθ,τ\psi^{\theta_{\ell},\tau_{\ell}} as (2.19) with τ=T/N\tau=T/N_{\ell} and θ=θ\theta=\theta_{\ell}.
4:     Execute the control ψθ,τ\psi^{\theta_{\ell},\tau_{\ell}} for mm_{\ell} independent episodes, and collect the data Xtiψθ,τ,jX^{\psi^{\theta_{\ell},\tau_{\ell}},j}_{t_{i}}, i=0,,Ni=0,\ldots,N_{\ell}, j=1,,mj=1,\ldots,m_{\ell}.
5:     Obtain an updated estimation θ+1{\theta}_{\ell+1} by using (2.23) and the data (Xtiψθ,τ,j)i=0,,N,j=1,,m(X^{\psi^{\theta_{\ell},\tau_{\ell}},j}_{t_{i}})_{i=0,\ldots,N_{\ell},j=1,\ldots,m_{\ell}}.
6:  end for

Again, as the \ell-th cycle of Algorithm 2 contains mm_{\ell} episodes, for each MM\in{\mathbb{N}}, the regret of learning up to MM episodes (cf. (2.8)) can be upper bounded by the accumulated regret at the end of the LL-th cycle, where LL is the smallest integer such that =0LmM\sum_{\ell=0}^{L}m_{\ell}\geq M. The following theorem is an analogue of Theorem 2.2 for Algorithm 2.

Theorem 2.3

Suppose (H.1) holds and let θ0=(A0,B0)(n+d)×d\theta_{0}=(A_{0},B_{0})^{\top}\in{\mathbb{R}}^{(n+d)\times d} such that (H.12) holds with θ0\theta_{0}. Then there exists C0>0C_{0}>0 and n0n_{0}\in{\mathbb{N}} such that for all CC0C\geq C_{0}, and δ(0,3π2)\delta\in(0,\frac{3}{\pi^{2}}), if one sets m0=C(lnδ)m_{0}=C(-\ln\delta), m=2m0m_{\ell}=2^{\ell}m_{0} and Nn0N_{\ell}\geq n_{0} for all {0}\ell\in{\mathbb{N}}\cup\{0\}, then with probability at least 1π2δ31-\frac{\pi^{2}\delta}{3}, the regret of Algorithm 2 given by (2.8) satisfies

R(M)C((lnM)(lnlnM)+(lnδ)(lnM)+(lnδ)=0lnM2N2),M,\displaystyle R(M)\leq C^{\prime}\left((\ln M)(\ln\ln M)+(-\ln\delta)(\ln M)+(-\ln\delta)\sum_{\ell=0}^{\ln M}2^{\ell}N_{\ell}^{-2}\right),\quad\forall M\in{\mathbb{N}}, (2.25)

where CC^{\prime} is a constant independent of MM, δ\delta and (N){0}(N_{\ell})_{\ell\in{\mathbb{N}}\cup\{0\}}.

Remark 2.4

Theorem 2.3 provides a general regret bound of Algorithm 2 with any time discretization steps {N}0\{N_{\ell}\}_{\ell\geq 0}, where NN_{\ell} is the number of intervention points in the \ell-th cycle. Compared with Algorithm 1, the regret of Algorithm 2 has an additional term (lnδ)=0lnM2N2(-\ln\delta)\sum_{\ell=0}^{\ln M}2^{\ell}N_{\ell}^{-2}: for each learning episode, one achieves a sub-optimal loss by adjusting her policy in the discrete time and also suffers from model misspecification error in parameter estimation from discrete-time observations. Specifically,

  • if the time discretization step is fixed for all cycles, i.e., N=T/τN_{\ell}=T/\tau for all \ell, then the last term of (2.25) is of the magnitude:

    𝒪((lnδ)=0lnM2N2)=𝒪((lnδ)τ2=0lnM2)=𝒪((lnδ)τ2M),\mathcal{O}\left((-\ln\delta)\sum_{\ell=0}^{\ln M}2^{\ell}N_{\ell}^{-2}\right)=\mathcal{O}\left((-\ln\delta)\tau^{2}\sum_{\ell=0}^{\ln M}2^{\ell}\right)=\mathcal{O}((-\ln\delta)\tau^{2}M),

    and consequently Algorithm 2 achieves a sub-optimal linear regret;

  • if the time discretization step of the \ell-th cycle increases exponentially in terms of \ell, e.g., N=2N0N_{\ell}=\sqrt{2}^{\ell}N_{0} for =1,,lnM\ell=1,\ldots,\ln M, then the last term of (2.25) is of the magnitude:

    𝒪((lnδ)=0lnM2N2)=𝒪((lnδ)=0lnMN02)=𝒪((lnδ)lnM),\mathcal{O}\left((-\ln\delta)\sum_{\ell=0}^{\ln M}2^{\ell}N_{\ell}^{-2}\right)=\mathcal{O}\left((-\ln\delta)\sum_{\ell=0}^{\ln M}N_{0}^{-2}\right)=\mathcal{O}((-\ln\delta)\ln M),

    which guarantees that the regret of Algorithm 2 is still logarithmic in MM.

Sketched proof of Theorem 2.3.

We point out the main differences between the proofs of Theorems 2.2-2.3, and give the detailed proof of Theorem 2.3 in Section 3.4. Compared with Theorem 2.2, the essential challenges in proving Theorem 2.3 are to quantify the precise dependence of the performance gap and the parameter estimation error on the stepsize. To this end, we first prove a first-order convergence of (2.20) to (2.9) as the stepsize tends to zero. Then by exploiting the affine structure of (2.21), we establish the following quadratic performance gap for a piecewise constant policy ψθ,τ\psi^{\theta,\tau} (Proposition 3.12):

Jθ(Uψθ,τ)Jθ(Uθ)C(|θθ|2+τ2).J^{\theta^{\star}}(U^{\psi^{\theta,\tau}})-J^{\theta^{\star}}(U^{\theta^{\star}})\leq C(|\theta-\theta^{\star}|^{2}+\tau^{2}). (2.26)

The analysis of the parameter estimation error is somewhat involved, as the state trajectories are merely α\alpha-Hölder continuous in time with α<1/2\alpha<1/2. By leveraging the analytic expression of Xψθ,τX^{\psi^{\theta,\tau}}, we first show the first-order convergence of θ^τ\hat{\theta}^{\tau} to θ\theta^{\star} with

θ^τ(𝔼[i=0N1Ztiψθ,τ(Ztiψθ,τ)τ])1(𝔼[i=0N1Ztiψθ,τ(Xti+1ψθ,τXtiψθ,τ)]).\hat{\theta}^{\tau}\coloneqq\bigg{(}{\mathbb{E}}\bigg{[}\sum_{i=0}^{N-1}Z^{\psi^{\theta,\tau}}_{t_{i}}(Z^{\psi^{\theta,\tau}}_{t_{i}})^{\top}\tau\bigg{]}\bigg{)}^{-1}\bigg{(}{\mathbb{E}}\bigg{[}\sum_{i=0}^{N-1}Z^{\psi^{\theta,\tau}}_{t_{i}}\left(X^{\psi^{\theta,\tau}}_{t_{i+1}}-X^{\psi^{\theta,\tau}}_{t_{i}}\right)^{\top}\bigg{]}\bigg{)}. (2.27)

We then prove that (2.23) enjoys a uniform sub-exponential tail bound for all θ\theta close to θ\theta^{\star} and small τ\tau. Comparing (2.23) with (2.27) and applying the above results allow for bounding the estimation error of (2.23) by (2.14) with an additional 𝒪(τ)\mathcal{O}(\tau) term (Proposition 3.14).

3 Proofs of Theorems 2.2 and 2.3

To simplify the notation, for any given N,mN,m\in{\mathbb{N}} and control ψ:[0,T]×nd\psi:[0,T]\times{\mathbb{R}}^{n}\rightarrow{\mathbb{R}}^{d} that is affine in the spatial variable, we introduce the following random variables associated with continuous-time observations:

Vψ\displaystyle V^{\psi} =0TZtψ(Ztψ)dt,\displaystyle=\int_{0}^{T}Z^{\psi}_{t}(Z^{\psi}_{t})^{\top}\,{\mathrm{d}}t,\qquad Yψ\displaystyle Y^{\psi} =0TZtψ(dXtψ),\displaystyle=\int_{0}^{T}Z^{\psi}_{t}({\mathrm{d}}X^{\psi}_{t})^{\top}, (3.1)
Vψ,m\displaystyle V^{\psi,m} =1mj=1m0TZtψ,j(Ztψ,j)dt,\displaystyle=\frac{1}{m}\sum_{j=1}^{m}\int_{0}^{T}Z^{\psi,j}_{t}(Z^{\psi,j}_{t})^{\top}\,{\mathrm{d}}t,\qquad Yψ,m\displaystyle Y^{\psi,m} =1mj=1m0TZtψ,j(dXtψ,j),\displaystyle=\frac{1}{m}\sum_{j=1}^{m}\int_{0}^{T}Z^{\psi,j}_{t}({\mathrm{d}}X^{\psi,j}_{t})^{\top},

and the random variables associated with discrete-time observations with stepsize τ=T/N\tau=T/N:

Vψ,τ\displaystyle V^{\psi,\tau} =i=0N1Ztiψ(Ztiψ)τ,\displaystyle=\sum_{i=0}^{N-1}Z^{\psi}_{t_{i}}(Z^{\psi}_{t_{i}})^{\top}\tau,\qquad Yψ,τ\displaystyle Y^{\psi,\tau} =i=0N1Ztiψ(Xti+1ψXtiψ),\displaystyle=\sum_{i=0}^{N-1}Z^{\psi}_{t_{i}}(X^{\psi}_{t_{i+1}}-X^{\psi}_{t_{i}})^{\top}, (3.2)
Vψ,τ,m\displaystyle V^{\psi,\tau,m} =1mj=1mi=0N1Ztiψ,j(Ztiψ,j)τ,\displaystyle=\frac{1}{m}\sum_{j=1}^{m}\sum_{i=0}^{N-1}Z^{\psi,j}_{t_{i}}(Z^{\psi,j}_{t_{i}})^{\top}\tau,\qquad Yψ,τ,m\displaystyle Y^{\psi,\tau,m} =1mj=1mi=0N1Ztiψ,j(Xti+1ψ,jXtiψ,j),\displaystyle=\frac{1}{m}\sum_{j=1}^{m}\sum_{i=0}^{N-1}Z^{\psi,j}_{t_{i}}(X^{\psi,j}_{t_{i+1}}-X^{\psi,j}_{t_{i}})^{\top},

where XψX^{\psi} is the state process associated with the parameter θ\theta^{\star} and the control ψ\psi (cf. (2.6)), Ztψ=(Xψψ(t,Xtψ))Z^{\psi}_{t}=\begin{pmatrix}X^{\psi}\\ \psi(t,X^{\psi}_{t})\end{pmatrix} for all t[0,T]t\in[0,T], and (Xψ,j,Zψ,j)j=1m(X^{\psi,j},Z^{\psi,j})_{j=1}^{m} are independent copies of (Xψ,Zψ)(X^{\psi},Z^{\psi}).

3.1 Convergence and stability of Riccati equations and feedback controls

Lemma 3.1

Suppose (H.11) holds. Then for all θ=(A,B)(n+d)×n\theta=(A,B)^{\top}\in{\mathbb{R}}^{(n+d)\times n}, the Riccati equation

ddtPt+APt+PtAPtBR1BPt+Q=0,t[0,T];PT=0.\tfrac{{\mathrm{d}}}{{\mathrm{d}}t}P_{t}+A^{\top}P_{t}+P_{t}A-P_{t}BR^{-1}B^{\top}P_{t}+Q=0,\quad t\in[0,T];\quad P_{T}=0. (3.3)

admits a unique solution PθC([0,T];n×n)P^{\theta}\in C([0,T];{\mathbb{R}}^{n\times n}). Moreover, the map (n+d)×nθPθC1([0,T];n×n){\mathbb{R}}^{(n+d)\times n}\ni\theta\mapsto P^{\theta}\in C^{1}([0,T];{\mathbb{R}}^{n\times n}) is continuously differentiable.

Proof  It has been shown in (Yong and Zhou, 1999, Corollary 2.10 on p. 297) that under (H.11), for all θ=(A,B)(n+d)×n\theta=(A,B)^{\top}\in{\mathbb{R}}^{(n+d)\times n}, (3.3) admits a unique solution PθC([0,T];n×n)P^{\theta}\in C([0,T];{\mathbb{R}}^{n\times n}) such that Ptθ𝕊0nP^{\theta}_{t}\in{\mathbb{S}}^{n}_{0} for all t[0,T]t\in[0,T]. It remains to to prove the continuous differentiability of θPθ\theta\mapsto P^{\theta}.

To this end, consider the Banach spaces 𝕏=(n+d)×n×C1([0,T];n×n){\mathbb{X}}={\mathbb{R}}^{(n+d)\times n}\times C^{1}([0,T];{\mathbb{R}}^{n\times n}) and 𝕐=C([0,T];n×n)×n×n{\mathbb{Y}}=C([0,T];{\mathbb{R}}^{n\times n})\times{\mathbb{R}}^{n\times n}, and the operator Φ:𝕏𝕐\Phi:{\mathbb{X}}\rightarrow{\mathbb{Y}} defined by

𝕏(θ,P)Φ(θ,P)(F(θ,P),PT)𝕐,{\mathbb{X}}\ni(\theta,P)\mapsto\Phi(\theta,P)\coloneqq(F(\theta,P),P_{T})\in{\mathbb{Y}},

where F(θ,P)t=ddtPt+APt+PtAPtBR1BPt+QF(\theta,P)_{t}=\tfrac{{\mathrm{d}}}{{\mathrm{d}}t}P_{t}+A^{\top}P_{t}+P_{t}A-P_{t}BR^{-1}B^{\top}P_{t}+Q for all t[0,T]t\in[0,T]. Observe that for all θ(n+d)×n\theta\in{\mathbb{R}}^{(n+d)\times n}, Φ(Pθ,θ)=0\Phi(P^{\theta},\theta)=0. Moreover, one can easily show that for any (P,θ)𝕏(P,\theta)\in{\mathbb{X}}, Φ\Phi is continuously Fréchet differentiable at (P,θ)(P,\theta), and the partial derivative PΦ(θ,P):C1([0,T];n×n)𝕐\frac{\partial}{\partial P}\Phi(\theta,P):C^{1}([0,T];{\mathbb{R}}^{n\times n})\rightarrow{\mathbb{Y}} is a bounded linear operator such that for all P~C1([0,T];n×n)\tilde{P}\in C^{1}([0,T];{\mathbb{R}}^{n\times n}),

PΦ(θ,P)(P~)=((ddtP~t+AP~t+P~tAP~tBR1BPtPtBR1BP~t)t[0,T]P~T)𝕐.\frac{\partial}{\partial P}\Phi(\theta,P)(\tilde{P})=\begin{pmatrix}\left(\tfrac{{\mathrm{d}}}{{\mathrm{d}}t}\tilde{P}_{t}+A^{\top}\tilde{P}_{t}+\tilde{P}_{t}A-\tilde{P}_{t}BR^{-1}B^{\top}P_{t}-P_{t}BR^{-1}B^{\top}\tilde{P}_{t}\right)_{t\in[0,T]}\\ \tilde{P}_{T}\end{pmatrix}\in{\mathbb{Y}}.

Classical well-posedness results of linear differential equations and the boundedness of PP imply that PΦ(P,θ):C1([0,T];n×n)𝕐\frac{\partial}{\partial P}\Phi(P,\theta):C^{1}([0,T];{\mathbb{R}}^{n\times n})\rightarrow{\mathbb{Y}} has a bounded inverse (and hence a bijection). Thus, applying the implicit function theorem (see (Ciarlet, 2013, Theorem 7.13-1)) to Φ\Phi proves that (n+d)×nθPθC1([0,T];n×n){\mathbb{R}}^{(n+d)\times n}\ni\theta\mapsto P^{\theta}\in C^{1}([0,T];{\mathbb{R}}^{n\times n}) is continuously differentiable.  

The following lemma establishes the stability of the Riccati difference operator, which is crucial for the subsequent convergence analysis.

Lemma 3.2

Suppose (H.11) holds. For each θ=(A,B)(n+d)×n\theta=(A,B)^{\top}\in{\mathbb{R}}^{(n+d)\times n} and NN\in{\mathbb{N}}, let τ=T/N\tau=T/N and the function Γτθ:𝕊0n𝕊0n\Gamma^{\theta}_{\tau}:{\mathbb{S}}^{n}_{0}\rightarrow{\mathbb{S}}^{n}_{0} such that for all P𝕊0nP\in{\mathbb{S}}^{n}_{0},

Γτθ(P)τQ+(I+τA)P(I+τA)(I+τA)PτB(R+τBPB)1BP(I+τA).\Gamma^{\theta}_{\tau}(P)\coloneqq\tau Q+(I+\tau A)^{\top}P(I+\tau A)-(I+\tau A)^{\top}P\tau B(R+\tau B^{\top}PB)^{-1}B^{\top}P(I+\tau A). (3.4)

Then for all P,P𝕊0nP,P^{\prime}\in{\mathbb{S}}^{n}_{0},

  1. (1)

    Γτθ(P)2τQ2+(1+τA2)2P2\|\Gamma^{\theta}_{\tau}(P)\|_{2}\leq\tau\|Q\|_{2}+(1+\tau\|A\|_{2})^{2}\|P\|_{2},

  2. (2)

    Γτθ(P)Γτθ(P)2(1+τR12B22max{P2,P2})2(1+τA2)2PP2.\|\Gamma^{\theta}_{\tau}(P)-\Gamma^{\theta}_{\tau}(P^{\prime})\|_{2}\leq\big{(}1+\tau\|R^{-1}\|_{2}\|B\|_{2}^{2}\max\{\|P\|_{2},\|P^{\prime}\|_{2}\}\big{)}^{2}(1+\tau\|A\|_{2})^{2}\|P-P^{\prime}\|_{2}.

Proof  Item 1 follows directly from the definition of Γτθ\Gamma^{\theta}_{\tau} and the identity that Γτθ(P)2=sup{xΓτθ(P)xxn,|x|=1}\|\Gamma^{\theta}_{\tau}(P)\|_{2}=\sup\{x^{\top}\Gamma^{\theta}_{\tau}(P)x\mid x\in{\mathbb{R}}^{n},|x|=1\}. We now prove Item 2. Let δP=PP\delta P=P-P^{\prime} and δΓ(P)=Γτθ(P)Γτθ(P)\delta\Gamma(P)=\Gamma^{\theta}_{\tau}(P)-\Gamma^{\theta}_{\tau}(P^{\prime}), by (Bitmead and Gevers, 1991, Lemma 10.1),

δΓ(P)=FδPFFδPτB(τBPτB+τR)1τBδPF,\displaystyle\delta\Gamma(P)=F^{\top}\delta PF-F^{\top}\delta P\tau B(\tau B^{\top}P\tau B+\tau R)^{-1}\tau B^{\top}\delta PF,

with F=(IτB(τBPτB+τR)1τBP)(I+τA)F=(I-\tau B(\tau B^{\top}P^{\prime}\tau B+\tau R)^{-1}\tau B^{\top}P^{\prime})(I+\tau A). Thus for all xnx\in{\mathbb{R}}^{n}, xδΓ(P)xδP2F22|x|2x^{\top}\delta\Gamma(P)x\leq\|\delta P\|_{2}\|F\|^{2}_{2}|x|^{2}, which along with (τBPB+R)12R12\|(\tau B^{\top}P^{\prime}B+R)^{-1}\|_{2}\leq\|R^{-1}\|_{2} implies

xδΓ(P)xδP2(1+τR12B22P2)2(1+τA2)2|x|2,xn.x^{\top}\delta\Gamma(P)x\leq\|\delta P\|_{2}(1+\tau\|R^{-1}\|_{2}\|B\|_{2}^{2}\|P^{\prime}\|_{2})^{2}(1+\tau\|A\|_{2})^{2}|x|^{2},\quad x\in{\mathbb{R}}^{n}.

Hence, interchanging the roles of PP and PP^{\prime} in the above inequality and taking the supremum over xnx\in{\mathbb{R}}^{n} lead to the desired estimate.  

The following proposition establishes the first-order convergence of the Riccati difference equation and the associated feedback controls, as the stepsize tends to zero.

Proposition 3.3

Suppose (H.11) holds and let Θ\Theta be a bounded subset of (n+d)×n{\mathbb{R}}^{(n+d)\times n}. For each θ=(A,B)Θ\theta=(A,B)^{\top}\in\Theta and NN\in{\mathbb{N}}, let (Piθ,τ)i=0N(P^{\theta,\tau}_{i})_{i=0}^{N} such that PNθ,τ=0P^{\theta,\tau}_{N}=0 and Piθ,τ=Γτθ(Pi+1θ,τ)P^{\theta,\tau}_{i}=\Gamma^{\theta}_{\tau}(P^{\theta,\tau}_{i+1}) for all i=0,,N1i=0,\ldots,N-1, with Γτθ\Gamma^{\theta}_{\tau} defined in (3.4) with τ=T/N\tau=T/N. Then there exists a constant C0C\geq 0 such that for all θΘ,N\theta\in\Theta,N\in{\mathbb{N}},

supi=0,,N1supt[iτ,(i+1)τ)(PtθPiθ,τ2+KtθKiθ,τ2)Cτ,\sup_{i=0,\ldots,N-1}\sup_{t\in[i\tau,(i+1)\tau)}\big{(}\|P^{\theta}_{t}-P^{\theta,\tau}_{i}\|_{2}+\|K^{\theta}_{t}-K^{\theta,\tau}_{i}\|_{2}\big{)}\leq C\tau,

where PθC1([0,T];n×n)P^{\theta}\in C^{1}([0,T];{\mathbb{R}}^{n\times n}) satisfies (3.3), Ktθ=R1BPtθK^{\theta}_{t}=-R^{-1}B^{\top}P^{\theta}_{t} for all t[0,T]t\in[0,T] and Kiθ,τ=(R+τBPi+1θ,τB)1BPi+1θ,τ(I+τA)K^{\theta,\tau}_{i}=-(R+\tau B^{\top}P^{\theta,\tau}_{{i+1}}B)^{-1}B^{\top}P^{\theta,\tau}_{{i+1}}(I+\tau A) for all i=0,,N1i=0,\ldots,N-1.

Proof  Throughout this proof, we shall fix θΘ,N\theta\in\Theta,N\in{\mathbb{N}}, let ti=iτt_{i}=i\tau for all i=0,,Ni=0,\ldots,N, and denote by CC a generic constant independent of NN and θ\theta. By the continuity of the map θPθ\theta\mapsto P^{\theta} (Lemma 3.1) and the boundedness of Θ\Theta, there exists a constant CC such that PθC1([0,T];n×n)C\|P^{\theta}\|_{C^{1}([0,T];{\mathbb{R}}^{n\times n})}\leq C for all θΘ\theta\in\Theta, which implies PtθPsθ2C|ts|\|P^{\theta}_{t}-P^{\theta}_{s}\|_{2}\leq C|t-s| for all t,s[0,T]t,s\in[0,T]. Consequently, it suffices to prove PtiθPiθ,τ2+Kti+1θKiθ,τ2Cτ\|P^{\theta}_{t_{i}}-P^{\theta,\tau}_{i}\|_{2}+\|K^{\theta}_{t_{i+1}}-K^{\theta,\tau}_{i}\|_{2}\leq C\tau for all i=0,,N1i=0,\ldots,N-1.

We start by making two important observations. By Lemma 3.2 Item 1, Piθ,τ2τC+(1+Cτ)Pi+1θ,τ2\|P^{\theta,\tau}_{i}\|_{2}\leq\tau C+(1+C\tau)\|P^{\theta,\tau}_{i+1}\|_{2} for all i=0,,N1i=0,\ldots,N-1, which along with Gronwall’s inequality gives Piθ,τ2C\|P^{\theta,\tau}_{i}\|_{2}\leq C for all i=0,,Ni=0,\ldots,N. Moreover, by (3.4), for all P𝕊0nP\in{\mathbb{S}}^{n}_{0},

Γτθ(P)\displaystyle\Gamma^{\theta}_{\tau}(P) =τQ+P+τ(AP+PA)+τ2APA\displaystyle=\tau Q+P+\tau(A^{\top}P+PA)+\tau^{2}A^{\top}PA
τ(PB(R+τBPB)1BP+τ(AH+HA)+τ2AHA),\displaystyle\quad-\tau\Big{(}PB(R+\tau B^{\top}PB)^{-1}B^{\top}P+\tau(A^{\top}H+HA^{\top})+\tau^{2}A^{\top}HA\Big{)},

with HPB(R+τBPB)1BPH\coloneqq PB(R+\tau B^{\top}PB)^{-1}B^{\top}P. Hence for any given i=0,,N1i=0,\ldots,N-1, we see from (3.3) that

PtiθΓτθ(Pti+1θ)\displaystyle P^{\theta}_{t_{i}}-\Gamma^{\theta}_{\tau}(P^{\theta}_{t_{i+1}})
=titi+1(A(PtθPti+1θ)+(PtθPti+1θ)A)dttiti+1(PtθBR1BPtθPti+1θBR1BPti+1θ)dt\displaystyle=\int_{t_{i}}^{t_{i+1}}\big{(}A^{\top}(P^{\theta}_{t}-P^{\theta}_{t_{i+1}})+(P^{\theta}_{t}-P^{\theta}_{t_{i+1}})A\big{)}\,{\mathrm{d}}t-\int_{t_{i}}^{t_{i+1}}(P^{\theta}_{t}BR^{-1}B^{\top}P^{\theta}_{t}-P^{\theta}_{t_{i+1}}BR^{-1}B^{\top}P^{\theta}_{t_{i+1}})\,{\mathrm{d}}t
titi+1(Pti+1θBR1BPti+1θPti+1θB(R+τBPti+1θB)1BPti+1θ)dt\displaystyle\quad-\int_{t_{i}}^{t_{i+1}}(P^{\theta}_{t_{i+1}}BR^{-1}B^{\top}P^{\theta}_{t_{i+1}}-P^{\theta}_{t_{i+1}}B(R+\tau B^{\top}P^{\theta}_{t_{i+1}}B)^{-1}B^{\top}P^{\theta}_{t_{i+1}})\,{\mathrm{d}}t
+τ2(APti+1θA+AHi+1θ+Hi+1θA+τAHi+1θA),\displaystyle\quad+\tau^{2}(-A^{\top}P^{\theta}_{t_{i+1}}A+A^{\top}{H}^{\theta}_{i+1}+{H}^{\theta}_{i+1}A^{\top}+\tau A^{\top}{H}^{\theta}_{i+1}A),

with Hi+1θ=Pti+1θB(R+τBPti+1θB)1BPti+1θ{H}^{\theta}_{i+1}=P^{\theta}_{t_{i+1}}B(R+\tau B^{\top}P^{\theta}_{t_{i+1}}B)^{-1}B^{\top}P^{\theta}_{t_{i+1}}. Since PθC1([0,T];n×n)C\|P^{\theta}\|_{C^{1}([0,T];{\mathbb{R}}^{n\times n})}\leq C and R𝕊+dR\in{\mathbb{S}}^{d}_{+}, we have PtiθΓτθ(Pti+1θ)2Cτ2\|P^{\theta}_{t_{i}}-\Gamma^{\theta}_{\tau}(P^{\theta}_{t_{i+1}})\|_{2}\leq C\tau^{2} for all i=0,,N1i=0,\ldots,N-1.

We are ready to show maxi=0,,N1(PtiθPiθ,τ2+Kti+1θKiθ,τ2)Cτ\max_{i=0,\ldots,N-1}(\|P^{\theta}_{t_{i}}-P^{\theta,\tau}_{i}\|_{2}+\|K^{\theta}_{t_{i+1}}-K^{\theta,\tau}_{i}\|_{2})\leq C\tau. For any given i=0,,N1i=0,\ldots,N-1, by Lemma 3.2 Item 2 and the uniform boundedness of (Ptiθ)i=0N(P^{\theta}_{t_{i}})_{i=0}^{N} and (Piθ,τ)i=0N(P^{\theta,\tau}_{i})_{i=0}^{N},

PtiθPiθ,τ2\displaystyle\|P^{\theta}_{t_{i}}-P^{\theta,\tau}_{i}\|_{2} PtiθΓτθ(Pti+1θ)2+Γτθ(Pti+1θ)Γτθ(Pi+1θ,τ)2\displaystyle\leq\|P^{\theta}_{t_{i}}-\Gamma^{\theta}_{\tau}(P^{\theta}_{t_{i+1}})\|_{2}+\|\Gamma^{\theta}_{\tau}(P^{\theta}_{t_{i+1}})-\Gamma^{\theta}_{\tau}(P^{\theta,\tau}_{i+1})\|_{2}
Cτ2+(1+τCmax{Pti+1θ2,Pi+1θ,τ2})2(1+τ)Pti+1θPi+1θ,τ2\displaystyle\leq C\tau^{2}+\big{(}1+\tau C\max\{\|P^{\theta}_{t_{i+1}}\|_{2},\|P^{\theta,\tau}_{{i+1}}\|_{2}\}\big{)}^{2}(1+\tau)\|P^{\theta}_{t_{i+1}}-P^{\theta,\tau}_{{i+1}}\|_{2}
Cτ2+(1+τC)Pti+1θPi+1θ,τ2,\displaystyle\leq C\tau^{2}+\big{(}1+\tau C)\|P^{\theta}_{t_{i+1}}-P^{\theta,\tau}_{{i+1}}\|_{2},

which along with Gronwall’s inequality and PTθ=PNθ,τ=0P^{\theta}_{T}=P^{\theta,\tau}_{N}=0 shows the desired convergence rate of (Piθ,τ)i=1N(P^{\theta,\tau}_{i})_{i=1}^{N}. Furthermore, for all i=0,,N1i=0,\ldots,N-1,

Kti+1θKiθ,τ2\displaystyle\|K^{\theta}_{t_{i+1}}-K^{\theta,\tau}_{i}\|_{2} (R1(R+τBPi+1θ,τB)1)BPti+1θ2\displaystyle\leq\|(R^{-1}-(R+\tau B^{\top}P^{\theta,\tau}_{{i+1}}B)^{-1})B^{\top}P^{\theta}_{t_{i+1}}\|_{2}
+(R+τBPi+1θ,τB)1B(Pti+1θPi+1θ,τ(I+τA))2Cτ,\displaystyle\quad+\|(R+\tau B^{\top}P^{\theta,\tau}_{{i+1}}B)^{-1}B^{\top}(P^{\theta}_{t_{i+1}}-P^{\theta,\tau}_{{i+1}}(I+\tau A))\|_{2}\leq C\tau,

from the facts that Ptiθ2C\|P^{\theta}_{t_{i}}\|_{2}\leq C, Piθ,τ2C\|P^{\theta,\tau}_{i}\|_{2}\leq C and PtiθPiθ,τ2Cτ\|P^{\theta}_{t_{i}}-P^{\theta,\tau}_{i}\|_{2}\leq C\tau for all ii.  

3.2 Concentration inequalities for least-squares estimators

In this section, we analyze the concentration behavior of the least-squares estimators (2.13) and (2.23). We first recall the definition of sub-exponential random variables (see e.g., Wainwright (2019)).

Definition 3.1

A random variable XX with mean μ=𝔼[X]\mu=\mathbb{E}[X] is (ν,b)(\nu,b)-sub-exponential for ν,b[0,)\nu,b\in[0,\infty) if 𝔼[eλ(Xμ)]eν2λ2/2\mathbb{E}[e^{\lambda(X-\mu)}]\leq e^{\nu^{2}\lambda^{2}/2} for all |λ|<1/b|\lambda|<1/b.

Note that a (ν,0)(\nu,0)-sub-exponential random variable is usually called a sub-Gaussian random variable. It is well-known that products of sub-Gaussian random variables are sub-exponential, and the class of sub-exponential random variables forms a vector space. Moreover, sub-exponential random variables enjoy the following concentration inequality (also known as Bernstein’s inequality; see e.g., (Wainwright, 2019, Equation 2.18 p. 29)).

Lemma 3.4

Let mm\in{\mathbb{N}}, ν,b[0,)\nu,b\in[0,\infty) and (Xi)i=1m(X_{i})_{i=1}^{m} be independent (ν,b)(\nu,b)-sub-exponential random variables with μ=𝔼[Xi]\mu={\mathbb{E}}[X_{i}] for all i=1,,mi=1,\ldots,m. Then for all ϵ0\epsilon\geq 0,

(|1mi=1mXiμ|ϵ)2exp(min{mϵ22ν2,mϵ2b}).\mathbb{P}\left(\left|\frac{1}{m}\sum_{i=1}^{m}X_{i}-\mu\right|\geq\epsilon\right)\leq 2\exp\left(-\min\left\{\frac{m\epsilon^{2}}{2\nu^{2}},\frac{m\epsilon}{2b}\right\}\right).

The following lemma shows double iterated Itô integrals are sub-exponential random variables.

Lemma 3.5

Let L0L\geq 0 and g,h:[0,T]×[0,T]n×ng,h:[0,T]\times[0,T]\rightarrow{\mathbb{R}}^{n\times n} be measurable functions such that |g(t,s)|L|g(t,s)|\leq L and |h(t,s)|L|h(t,s)|\leq L for all t,s[0,T]t,s\in[0,T]. Then there exist ν,b[0,)\nu,b\in[0,\infty), depending polynomially on L,n,TL,n,T, such that

  1. (1)

    0T(0tg(t,s)dWs)dWt\int_{0}^{T}\big{(}\int_{0}^{t}g(t,s)\,{\mathrm{d}}W_{s}\big{)}^{\top}\,{\mathrm{d}}W_{t},

  2. (2)

    0T(0tg(t,s)dWs)(0th(t,s)dWs)dt\int_{0}^{T}\big{(}\int_{0}^{t}g(t,s)\,{\mathrm{d}}W_{s}\big{)}^{\top}\big{(}\int_{0}^{t}h(t,s)\,{\mathrm{d}}W_{s}\big{)}\,{\mathrm{d}}t

are (ν,b)(\nu,b)-sub-exponential,

Proof  We first prove Item 1 by assuming without loss of generality that g(t,s)2L\|g(t,s)\|_{2}\leq L for all t,s[0,T]t,s\in[0,T], and by defining Vq0T(0tq(t,s)dWs)dWtV^{q}\coloneqq\int_{0}^{T}\left(\int_{0}^{t}q(t,s)\,{\mathrm{d}}W_{s}\right)^{\top}{\mathrm{d}}W_{t} for any bounded measurable function q:[0,T]×[0,T]n×nq:[0,T]\times[0,T]\rightarrow{\mathbb{R}}^{n\times n}. By similar arguments as (Cheridito et al., 2005, Lemma 3.2), we have for all t[0,T]t\in[0,T] and 0λ<12T0\leq\lambda<\frac{1}{2T},

𝔼[exp(2λVgL)]𝔼[exp(2λVIn)]=(112λTexp(λT))n.\mathbb{E}[\exp(2\lambda V^{\frac{g}{L}})]\leq\mathbb{E}[\exp(2\lambda V^{I_{n}})]=\left(\frac{1}{\sqrt{1-2\lambda T}}\exp(-\lambda T)\right)^{n}.

As eλ12λe2λ2\frac{e^{-\lambda}}{\sqrt{1-2\lambda}}\leq e^{2\lambda^{2}} for all |λ|1/4|\lambda|\leq 1/4, we see 𝔼[exp(2λVg/L)]exp(2nλ2T2)\mathbb{E}[\exp(2\lambda V^{g/L})]\leq\exp(2n\lambda^{2}T^{2}) for all 0λ<14T0\leq\lambda<\frac{1}{4T}. Consequently, for all 0λ<12LT0\leq\lambda<\frac{1}{2LT},

𝔼[exp(λVg)]=𝔼[exp(2λL2VgL)]exp(nL2T2λ22).\mathbb{E}[\exp(\lambda V^{g})]={\mathbb{E}}\left[\exp\left(2\frac{\lambda L}{2}V^{\frac{g}{L}}\right)\right]\leq\exp\left(\frac{nL^{2}T^{2}\lambda^{2}}{2}\right).

Replacing gg by g-g shows the above estimate holds for |λ|<12LT|\lambda|<\frac{1}{2LT}, which implies the desired sub-exponential property of VgV^{g}.

For Item 2, observe that for each t[0,T]t\in[0,T], the Itô formula allows one to express the product (0tg(t,s)dWs)(0th(t,s)dWs)\big{(}\int_{0}^{t}g(t,s)\,{\mathrm{d}}W_{s}\big{)}^{\top}\big{(}\int_{0}^{t}h(t,s)\,{\mathrm{d}}W_{s}\big{)} as a linear combination of double iterated Itô integrals and deterministic integrals. Then the desired sub-exponential property follows from the stochastic Fubini theorem (see e.g., Veraar (2012)) and Item 1.  

The following theorem establishes the concentration properties of the random variables involved in the least-squares estimators.

Theorem 3.6

Suppose (H.11) holds and let Θ\Theta be a bounded subset of (n+d)×n{\mathbb{R}}^{(n+d)\times n}. For each θΘ\theta\in\Theta and NN\in{\mathbb{N}}, let ψθ\psi^{\theta} be defined in (2.9), and ψθ,τ\psi^{\theta,\tau} be defined in (2.20) with stepsize τ=T/N\tau=T/N. Then there exist constants C,ν,b>0C,\nu,b>0 such that for all θΘ\theta\in\Theta, N,mN,m\in{\mathbb{N}} and ϵ>0\epsilon>0,

max{(|Vψθ,m𝔼[Vψθ]|ϵ),(|Yψθ,m𝔼[Yψθ]|ϵ),\displaystyle\max\Big{\{}\mathbb{P}(|V^{\psi^{\theta},m}-{\mathbb{E}}[V^{\psi^{\theta}}]|\geq\epsilon),\mathbb{P}(|Y^{\psi^{\theta},m}-{\mathbb{E}}[Y^{\psi^{\theta}}]|\geq\epsilon),
(|Vψθ,τ,τ,m𝔼[Vψθ,τ,τ]|ϵ),(|Yψθ,τ,τ,m𝔼[Yψθ,τ,τ]|ϵ)}Cexp(1Cmin{mϵ2ν2,mϵb}),\displaystyle\quad\mathbb{P}(|V^{\psi^{\theta,\tau},\tau,m}-{\mathbb{E}}[V^{\psi^{\theta,\tau},\tau}]|\geq\epsilon),\mathbb{P}(|Y^{\psi^{\theta,\tau},\tau,m}-{\mathbb{E}}[Y^{\psi^{\theta,\tau},\tau}]|\geq\epsilon)\Big{\}}\leq C\exp\left(-\frac{1}{C}\min\left\{\frac{m\epsilon^{2}}{\nu^{2}},\frac{m\epsilon}{b}\right\}\right),

where VψθV^{\psi^{\theta}}, YψθY^{\psi^{\theta}}, Vψθ,mV^{\psi^{\theta},m}, Yψθ,mY^{\psi^{\theta},m} are defined in (3.1), and Vψθ,τ,τV^{\psi^{\theta,\tau},\tau}, Yψθ,τ,τY^{\psi^{\theta,\tau},\tau}, Vψθ,τ,τ,mV^{\psi^{\theta,\tau},\tau,m}, Yψθ,τ,τ,mY^{\psi^{\theta,\tau},\tau,m} are defined in (3.2).

Proof  We first show there exist ν,b>0\nu,b>0 such that all entries of VψθV^{\psi^{\theta}}, YψθY^{\psi^{\theta}} Yψθ,τ,τY^{\psi^{\theta,\tau},\tau}, Vψθ,τ,τV^{\psi^{\theta,\tau},\tau} are (ν,b)(\nu,b)-sub-exponential for all θΘ\theta\in\Theta and NN\in{\mathbb{N}}. By (3.1), we have

Vψθ\displaystyle V^{\psi^{\theta}} =0T(XtψθKtθXtψθ)((Xtψθ)(KtθXtψθ))dt,Yψθ=Vψθθ+0T(XtψθKtθXtψθ)(dWt).\displaystyle=\int_{0}^{T}\begin{pmatrix}X^{\psi^{\theta}}_{t}\\ K^{\theta}_{t}X^{\psi^{\theta}}_{t}\end{pmatrix}\begin{pmatrix}(X^{\psi^{\theta}}_{t})^{\top}&(K^{\theta}_{t}X^{\psi^{\theta}}_{t})^{\top}\end{pmatrix}\,{\mathrm{d}}t,\quad Y^{\psi^{\theta}}=V^{\psi^{\theta}}\theta^{\star}+\int_{0}^{T}\begin{pmatrix}X^{\psi^{\theta}}_{t}\\ K^{\theta}_{t}X^{\psi^{\theta}}_{t}\end{pmatrix}({\mathrm{d}}W_{t})^{\top}.

Moreover, applying the variation-of-constants formula (see e.g., (Mao, 2007, Theorem 3.1 p. 96)) to (2.11) shows that Xtψθ=Φtθ(x0+0t(Φsθ)1dWs)X^{\psi^{\theta}}_{t}=\Phi^{\theta}_{t}\big{(}x_{0}+\int_{0}^{t}(\Phi^{\theta}_{s})^{-1}\,{\mathrm{d}}W_{s}\big{)} for all t[0,T]t\in[0,T], where ΦθC([0,T];n×n)\Phi^{\theta}\in C([0,T];{\mathbb{R}}^{n\times n}) is the fundamental solution of dΦtθ=(A+BKtθ)Φtθdt{\mathrm{d}}\Phi^{\theta}_{t}=(A^{\star}+B^{\star}K^{\theta}_{t})\Phi^{\theta}_{t}\,{\mathrm{d}}t. The continuity of (n+d)×nθKθC([0,T];d×n){\mathbb{R}}^{(n+d)\times n}\ni\theta\mapsto K^{\theta}\in C([0,T];{\mathbb{R}}^{d\times n}) (cf. Proposition 3.1) and the boundedness of Θ\Theta implies that Kθ,Φθ,(Φθ)1K^{\theta},\Phi^{\theta},(\Phi^{\theta})^{-1} are uniformly bounded for all θΘ\theta\in\Theta. Consequently, from Lemma 3.5, there exist ν,b>0\nu,b>0 such that all entries of VψθV^{\psi^{\theta}} and YψθY^{\psi^{\theta}} are (ν,b)(\nu,b)-sub-exponential.

Similarly, by (2.21) and (3.2),

Vψθ,τ,τ\displaystyle V^{\psi^{\theta,\tau},\tau} =0Ti=0N1𝟏[ti,ti+1)(t)(Xtiψθ,τKtiθ,τXtiψθ,τ)((Xtiψθ,τ)(Ktiθ,τXtiψθ,τ))dt,\displaystyle=\int_{0}^{T}\sum_{i=0}^{N-1}\mathbf{1}_{[t_{i},t_{i+1})}(t)\begin{pmatrix}X^{\psi^{\theta,\tau}}_{t_{i}}\\ K^{\theta,\tau}_{t_{i}}X^{\psi^{\theta,\tau}}_{t_{i}}\end{pmatrix}\begin{pmatrix}(X^{\psi^{\theta,\tau}}_{t_{i}})^{\top}&(K^{\theta,\tau}_{t_{i}}X^{\psi^{\theta,\tau}}_{t_{i}})^{\top}\end{pmatrix}\,{\mathrm{d}}t,
Yψθ,τ,τ\displaystyle Y^{\psi^{\theta,\tau},\tau} =0Ti=0N1𝟏[ti,ti+1)(t)(Xtiψθ,τKtiθ,τXtiψθ,τ)((Xtψθ,τ)(Ktθ,τXtψθ,τ))(θ)dt\displaystyle=\int_{0}^{T}\sum_{i=0}^{N-1}\mathbf{1}_{[t_{i},t_{i+1})}(t)\begin{pmatrix}X^{\psi^{\theta,\tau}}_{t_{i}}\\ K^{\theta,\tau}_{t_{i}}X^{\psi^{\theta,\tau}}_{t_{i}}\end{pmatrix}\begin{pmatrix}(X^{\psi^{\theta,\tau}}_{t})^{\top}&(K^{\theta,\tau}_{t}X^{\psi^{\theta,\tau}}_{t})^{\top}\end{pmatrix}(\theta^{\star})^{\top}\,{\mathrm{d}}t
+0Ti=0N1𝟏[ti,ti+1)(t)(Xtiψθ,τKtiθ,τXtiψθ,τ)(dWt),\displaystyle\quad+\int_{0}^{T}\sum_{i=0}^{N-1}\mathbf{1}_{[t_{i},t_{i+1})}(t)\begin{pmatrix}X^{\psi^{\theta,\tau}}_{t_{i}}\\ K^{\theta,\tau}_{t_{i}}X^{\psi^{\theta,\tau}}_{t_{i}}\end{pmatrix}({\mathrm{d}}W_{t})^{\top},

where Xtψθ,τ=Φtθ,τ(x0+0t(Φsθ,τ)1dWs)X^{\psi^{\theta,\tau}}_{t}=\Phi^{\theta,\tau}_{t}\big{(}x_{0}+\int_{0}^{t}(\Phi^{\theta,\tau}_{s})^{-1}\,{\mathrm{d}}W_{s}\big{)} for all t[0,T]t\in[0,T], and Φθ,τC([0,T];n×n)\Phi^{\theta,\tau}\in C([0,T];{\mathbb{R}}^{n\times n}) is the fundamental solution of dΦtθ,τ=(A+BKtθ,τ)Φtθ,τdt{\mathrm{d}}\Phi^{\theta,\tau}_{t}=(A^{\star}+B^{\star}K^{\theta,\tau}_{t})\Phi^{\theta,\tau}_{t}\,{\mathrm{d}}t. By Proposition 3.3, Kθ,τ,Φθ,τ,(Φθ,τ)1K^{\theta,\tau},\Phi^{\theta,\tau},(\Phi^{\theta,\tau})^{-1} are uniformly bounded for all θΘ\theta\in\Theta and NN\in{\mathbb{N}}, which along with Lemma 3.5 leads to the desired sub-exponential properties of Yψθ,τ,τY^{\psi^{\theta,\tau},\tau} and Vψθ,τ,τV^{\psi^{\theta,\tau},\tau}.

Finally, since (|i=1Xi|ϵ)i=1(|Xi|ϵ/)\mathbb{P}(|\sum_{i=1}^{\ell}X_{i}|\geq\epsilon)\leq\sum_{i=1}^{\ell}\mathbb{P}(|X_{i}|\geq\epsilon/\ell) for all \ell\in{\mathbb{N}} and random variables (Xi)i=1(X_{i})_{i=1}^{\ell}, we can apply Lemma 3.4 to each component of VψθV^{\psi^{\theta}}, YψθY^{\psi^{\theta}} Yψθ,τ,τY^{\psi^{\theta,\tau},\tau} and Vψθ,τ,τV^{\psi^{\theta,\tau},\tau}, and conclude the desired concentration inequality with a constant CC depending polynomially on n,dn,d.  

3.3 Regret analysis of continuous-time least-squares algorithm

This section is devoted to the proof of Theorem 2.2, which consists of three steps: (1) We first quantify the performance gap between applying feedback controls for an incorrect model and that for the true model; our proof exploits the stability of Riccati equations established in Lemma 3.1; (2) We then estimate the parameter estimation error in terms of the number of learning episodes based on the sub-exponential tail behavior of the least-squares estimator (2.13); (3) Finally, we estimate the regret for the feedback controls (ψθ)(\psi^{\theta_{\ell}})_{\ell\in{\mathbb{N}}} in Algorithm 1, thus establishing Theorem 2.2.

Step 1: Analysis of the performance gap.

We start by establishing a quadratic expansion of the cost function at any open-loop control.

Proposition 3.7

Suppose (H.11) holds. Let ψθ\psi^{\theta^{\star}} be defined in (2.3), XθX^{{\theta^{\star}}} be the state process associated with ψθ\psi^{\theta^{\star}} (cf. (2.5)), and Uθ2(d)U^{\theta^{\star}}\in\mathcal{H}^{2}({\mathbb{R}}^{d}) be such that for all t[0,T]t\in[0,T], Utθ=ψθ(t,Xtθ)U^{\theta^{\star}}_{t}=\psi^{\theta^{\star}}(t,X^{{\theta^{\star}}}_{t}). Then for all U2(d)U\in\mathcal{H}^{2}({\mathbb{R}}^{d}),

Jθ(U)Jθ(Uθ)Q2Xθ,UXθ2(n)2+R2UUθ2(d)2,J^{\theta^{\star}}(U)-J^{\theta^{\star}}(U^{\theta^{\star}})\leq\|Q\|_{2}\|X^{\theta^{\star},U}-X^{{\theta^{\star}}}\|^{2}_{\mathcal{H}^{2}({\mathbb{R}}^{n})}+\|R\|_{2}\|U-U^{\theta^{\star}}\|^{2}_{\mathcal{H}^{2}({\mathbb{R}}^{d})}, (3.5)

where Xθ,UX^{\theta^{\star},U} is the state process controlled by UU (cf. (2.2)), and Jθ:2(d)J^{\theta^{\star}}:\mathcal{H}^{2}({\mathbb{R}}^{d})\rightarrow{\mathbb{R}} is defined in (2.1).

Proof  For notational simplicity, for all U2(d)U\in\mathcal{H}^{2}({\mathbb{R}}^{d}) and ϵ>0\epsilon>0, we write Uϵ=Uθ+ϵ(UUθ)U^{\epsilon}=U^{{\theta^{\star}}}+\epsilon(U-U^{{\theta^{\star}}}), denote by Xϵ=Xθ,UϵX^{\epsilon}=X^{\theta^{\star},U^{\epsilon}} the associated state process defined by (2.2), and by XU=X0=Xθ,UX^{U}=X^{0}=X^{\theta^{\star},U}. The affineness of (2.2) implies that Xϵ=(1ϵ)Xθ+ϵXUX^{\epsilon}=(1-\epsilon)X^{\theta^{\star}}+\epsilon X^{U} for all ϵ>0\epsilon>0. Hence, for all U2(d)U\in\mathcal{H}^{2}({\mathbb{R}}^{d}),

limϵ01ϵ(Jθ(Uϵ)Jθ(Uθ))\displaystyle\lim_{\epsilon\searrow 0}\frac{1}{\epsilon}\left(J^{\theta^{\star}}(U^{\epsilon})-J^{\theta^{\star}}(U^{\theta^{\star}})\right)
=limϵ01ϵ𝔼[0T(((1ϵ)Xtθ+ϵXtU)Q((1ϵ)Xtθ+ϵXtU)(Xtθ)QXtθ\displaystyle=\lim_{\epsilon\searrow 0}\frac{1}{\epsilon}\mathbb{E}\bigg{[}\int_{0}^{T}\bigg{(}\Big{(}(1-\epsilon)X^{\theta^{\star}}_{t}+\epsilon X^{U}_{t}\Big{)}^{\top}Q\Big{(}(1-\epsilon)X^{\theta^{\star}}_{t}+\epsilon X^{U}_{t}\Big{)}-(X^{\theta^{\star}}_{t})^{\top}QX^{\theta^{\star}}_{t}
+((1ϵ)Utθ+ϵUt)R((1ϵ)Utθ+ϵUt)(Utθ)RUtθ)dt]\displaystyle\quad+\left((1-\epsilon)U^{{\theta^{\star}}}_{t}+\epsilon U_{t}\right)^{\top}R\left((1-\epsilon)U^{{\theta^{\star}}}_{t}+\epsilon U_{t}\right)-(U^{{\theta^{\star}}}_{t})^{\top}RU^{{\theta^{\star}}}_{t}\bigg{)}\,{\mathrm{d}}t\bigg{]}
=limϵ0ϵ𝔼[0T((XtUXtθ)Q(XtUXtθ)+(UtUtθ)R(UtUtθ))dt]\displaystyle=\lim_{\epsilon\searrow 0}\epsilon\,\mathbb{E}\left[\int_{0}^{T}\Big{(}(X^{U}_{t}-X^{\theta^{\star}}_{t})^{\top}Q(X^{U}_{t}-X^{\theta^{\star}}_{t})+(U_{t}-U_{t}^{{\theta^{\star}}})^{\top}R(U_{t}-U_{t}^{{\theta^{\star}}})\Big{)}\,{\mathrm{d}}t\right]
+2𝔼[0T((XtUXtθ)QXtθ+(UtUtθ)RUtθ)dt]\displaystyle\qquad+2\mathbb{E}\left[\int_{0}^{T}\left((X^{U}_{t}-X^{\theta^{\star}}_{t})^{\top}QX_{t}^{\theta^{\star}}+(U_{t}-U_{t}^{{\theta^{\star}}})^{\top}RU_{t}^{{\theta^{\star}}}\right)\,{\mathrm{d}}t\right]
=2𝔼[0T((XtUXtθ)QXtθ+(UtUtθ)RUtθ)dt],\displaystyle=2\mathbb{E}\left[\int_{0}^{T}\left((X^{U}_{t}-X^{\theta^{\star}}_{t})^{\top}QX_{t}^{\theta^{\star}}+(U_{t}-U_{t}^{{\theta^{\star}}})^{\top}RU_{t}^{{\theta^{\star}}}\right)\,{\mathrm{d}}t\right],

which is based on the fact that XUXθ2(n)X^{U}-X^{\theta^{\star}}\in{\mathcal{H}^{2}({\mathbb{R}}^{n})} and UU2(d)U-U^{\star}\in{\mathcal{H}^{2}({\mathbb{R}}^{d})}. As UθU^{\theta} is the optimal control of JθJ^{\theta^{\star}}, Jθ(U)Jθ(Uθ)J^{\theta^{\star}}({U})\geq J^{\theta^{\star}}(U^{\theta^{\star}}) for all U2(d){U}\in\mathcal{H}^{2}({\mathbb{R}}^{d}). Hence for all U2(d)U\in\mathcal{H}^{2}({\mathbb{R}}^{d}),

𝔼[0T((XtUXtθ)QXtθ+(UtUtθ)RUtθ)dt]=limϵ012ϵ(Jθ(Uϵ)Jθ(Uθ))0.\mathbb{E}\left[\int_{0}^{T}\left((X^{U}_{t}-X^{\theta^{\star}}_{t})^{\top}QX_{t}^{\theta^{\star}}+(U_{t}-U_{t}^{{\theta^{\star}}})^{\top}RU_{t}^{{\theta^{\star}}}\right)\,{\mathrm{d}}t\right]=\lim_{\epsilon\searrow 0}\frac{1}{2\epsilon}(J^{\theta^{\star}}(U^{\epsilon})-J^{\theta^{\star}}(U^{{\theta^{\star}}}))\geq 0. (3.6)

We now prove that the above quantity is in fact zero for all U2(d)U\in\mathcal{H}^{2}({\mathbb{R}}^{d}). To this end, let U2(d)U\in\mathcal{H}^{2}({\mathbb{R}}^{d}) be a given (open-loop) control, and consider U~=Uθ(UUθ)\tilde{U}=U^{{\theta^{\star}}}-(U-U^{{\theta^{\star}}}). Then by the affineness of (2.2), XU~XθX^{\tilde{U}}-X^{{\theta^{\star}}} satisfies the following controlled dynamics:

dXt=(AXtB(UUθ)t)dt,t[0,T];X0=0.{\mathrm{d}}X_{t}=(A^{\star}X_{t}-B^{\star}(U-U^{{\theta^{\star}}})_{t})\,{\mathrm{d}}t,\quad t\in[0,T];\quad X_{0}=0. (3.7)

Moreover, one can verify by the affineness of (2.2) that (XUXθ)-(X^{{U}}-X^{{\theta^{\star}}}) also satisfies the dynamics (3.7), which along with the uniqueness of solutions to (3.7) shows that XU~Xθ=(XUXθ)X^{\tilde{U}}-X^{{\theta^{\star}}}=-(X^{{U}}-X^{{\theta^{\star}}}). Therefore, applying (3.6) with U=U~U=\tilde{U} implies that

0𝔼[0T((XtU~Xtθ)QXtθ+(U~tUtθ)RUtθ)dt]\displaystyle 0\leq\mathbb{E}\left[\int_{0}^{T}\left((X^{\tilde{U}}_{t}-X^{\theta^{\star}}_{t})^{\top}QX_{t}^{\theta^{\star}}+(\tilde{U}_{t}-U_{t}^{{\theta^{\star}}})^{\top}RU_{t}^{{\theta^{\star}}}\right)\,{\mathrm{d}}t\right]
=𝔼[0T((XtUXtθ)QXtθ+(UtUtθ)RUtθ)dt]0.\displaystyle=-\mathbb{E}\left[\int_{0}^{T}\left((X^{{U}}_{t}-X^{\theta^{\star}}_{t})^{\top}QX_{t}^{\theta^{\star}}+({U}_{t}-U_{t}^{{\theta^{\star}}})^{\top}RU_{t}^{{\theta^{\star}}}\right)\,{\mathrm{d}}t\right]\leq 0.

Hence for all U2(d)U\in\mathcal{H}^{2}({\mathbb{R}}^{d}),

𝔼[0T((XtUXtθ)QXtθ+(UtUtθ)RUtθ)dt]=0,\mathbb{E}\left[\int_{0}^{T}\left((X^{U}_{t}-X^{\theta^{\star}}_{t})^{\top}QX_{t}^{\theta^{\star}}+(U_{t}-U_{t}^{{\theta^{\star}}})^{\top}RU_{t}^{{\theta^{\star}}}\right)\,{\mathrm{d}}t\right]=0,

which leads to the desired result (3.5) due to the following identify:

Jθ(U)Jθ(Uθ)=𝔼[0T((XtU)QXtU(Xtθ)QXtθ+UtRUt(Utθ)RUtθ)dt]=𝔼[0T((XtU)QXtU(Xtθ)QXtθ+UtRUt(Utθ)RUtθ)dt]2𝔼[0T((XtUXtθ)QXtθ+(UtUtθ)RUtθ)dt]=𝔼[0T((XtUXtθ)Q(XtUXtθ)+(UtUtθ)R(UtUtθ)dt)].\begin{split}J^{\theta^{\star}}(U)-J^{\theta^{\star}}(U^{\theta^{\star}})&=\mathbb{E}\left[\int_{0}^{T}((X^{U}_{t})^{\top}QX^{U}_{t}-(X^{\theta^{\star}}_{t})^{\top}QX^{\theta^{\star}}_{t}+U_{t}^{\top}RU_{t}-(U_{t}^{\theta^{\star}})^{\top}RU_{t}^{\theta^{\star}})\,{\mathrm{d}}t\right]\\ &=\mathbb{E}\left[\int_{0}^{T}\left((X^{U}_{t})^{\top}QX^{U}_{t}-(X^{\theta^{\star}}_{t})^{\top}QX^{\theta^{\star}}_{t}+U_{t}^{\top}RU_{t}-(U_{t}^{\theta^{\star}})^{\top}RU_{t}^{\theta^{\star}}\right)\,{\mathrm{d}}t\right]\\ &\quad-2\mathbb{E}\left[\int_{0}^{T}\left((X^{U}_{t}-X^{\theta^{\star}}_{t})^{\top}QX_{t}^{\theta^{\star}}+(U_{t}-U_{t}^{{\theta^{\star}}})^{\top}RU_{t}^{{\theta^{\star}}}\right)\,{\mathrm{d}}t\right]\\ &=\mathbb{E}\left[\int_{0}^{T}\left((X^{U}_{t}-X_{t}^{\theta^{\star}})^{\top}Q(X^{U}_{t}-X_{t}^{\theta^{\star}})+(U_{t}-U_{t}^{\theta^{\star}})^{\top}R(U_{t}-U_{t}^{\theta^{\star}})\,{\mathrm{d}}t\right)\right].\end{split}

 

Armed with Proposition 3.7, the following proposition quantifies the quadratic performance gap of a greedy policy ψθ\psi^{\theta}.

Proposition 3.8

Suppose (H.11) holds and let Θ\Theta be a bounded subset of (n+d)×n{\mathbb{R}}^{(n+d)\times n}. For each θΘ\theta\in\Theta, let ψθ\psi^{\theta} be defined in (2.9), let XψθX^{\psi^{\theta}} be the state process associated with ψθ\psi^{\theta} (cf. (2.11)), let ψθ\psi^{\theta^{\star}} be defined in (2.3), and let XθX^{{\theta^{\star}}} be the state process associated with ψθ\psi^{\theta^{\star}} (cf. (2.5)). Then there exists a constant CC such that

|Jθ(Uψθ)Jθ(Uθ)|C|θθ|2,θΘ,|J^{\theta^{\star}}(U^{\psi^{\theta}})-J^{\theta^{\star}}(U^{\theta^{\star}})|\leq C|\theta-\theta^{\star}|^{2},\quad\forall\theta\in\Theta,

where Utψθ=ψθ(t,Xtψθ)U^{\psi^{\theta}}_{t}=\psi^{\theta}(t,X^{\psi^{\theta}}_{t}) and Utθ=ψθ(t,Xtθ)U^{\theta^{\star}}_{t}=\psi^{\theta^{\star}}(t,X^{{\theta^{\star}}}_{t}) for all t[0,T]t\in[0,T], and JθJ^{\theta^{\star}} is defined in (2.1).

Proof  For all θΘ\theta\in\Theta, applying Proposition 3.7 with U=UψθU=U^{\psi^{\theta}} gives

Jθ(Uψθ)Jθ(Uθ)Q2Xθ,UψθXθ2(n)2+R2UψθUθ2(d)2,Q2XψθXψθ2(n)2+R2ψθ(,Xψθ)ψθ(,Xψθ)2(d)2,\displaystyle\begin{split}&J^{\theta^{\star}}(U^{\psi^{\theta}})-J^{\theta^{\star}}(U^{\theta^{\star}})\\ &\leq\|Q\|_{2}\|X^{\theta^{\star},U^{\psi^{\theta}}}-X^{{\theta^{\star}}}\|^{2}_{\mathcal{H}^{2}({\mathbb{R}}^{n})}+\|R\|_{2}\|U^{\psi^{\theta}}-U^{\theta^{\star}}\|^{2}_{\mathcal{H}^{2}({\mathbb{R}}^{d})},\\ &\leq\|Q\|_{2}\|X^{{\psi^{\theta}}}-X^{\psi^{\theta^{\star}}}\|^{2}_{\mathcal{H}^{2}({\mathbb{R}}^{n})}+\|R\|_{2}\|\psi^{\theta}(\cdot,X^{\psi^{\theta}}_{\cdot})-\psi^{\theta^{\star}}(\cdot,X^{\psi^{\theta^{\star}}}_{\cdot})\|^{2}_{\mathcal{H}^{2}({\mathbb{R}}^{d})},\end{split} (3.8)

where the last inequality used the fact that Xθ,Uψθ=XψθX^{\theta^{\star},U^{\psi^{\theta}}}=X^{{\psi^{\theta}}} (see (2.11)), and the definitions of UψθU^{\psi^{\theta}} and UθU^{\theta^{\star}}. It remains to prove

XψθXψθ2(n)+ψθ(,Xψθ)ψθ(,Xψθ)2(d)C|θθ|,\|X^{{\psi^{\theta}}}-X^{\psi^{\theta^{\star}}}\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}+\|\psi^{\theta}(\cdot,X^{\psi^{\theta}}_{\cdot})-\psi^{\theta^{\star}}(\cdot,X^{\psi^{\theta^{\star}}}_{\cdot})\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}\leq C|\theta-\theta^{\star}|,

for a constant CC independent of θ\theta.

Observe that by (2.9), for all (t,x)[0,T]×n(t,x)\in[0,T]\times{\mathbb{R}}^{n}, ψθ(t,x)=Ktθx\psi^{\theta}(t,x)=K^{\theta}_{t}x with Ktθ=R1BPtθK^{\theta}_{t}=-R^{-1}B^{\top}P^{\theta}_{t}. Now by Lemma 3.1 and the boundedness of Θ\Theta, there exists a constant C0C\geq 0 such that PθC([0,T;n×n)C\|P^{\theta}\|_{C([0,T;{\mathbb{R}}^{n\times n})}\leq C and PθPθC([0,T;n×n)C|θθ|\|P^{\theta}-P^{\theta^{\star}}\|_{C([0,T;{\mathbb{R}}^{n\times n})}\leq C|\theta-\theta^{\star}| for all θΘ{θ}\theta\in\Theta\cup\{\theta^{\star}\}, which along with Ktθ=R1BPtθK^{\theta}_{t}=-R^{-1}B^{\top}P^{\theta}_{t} implies that KθC([0,T;d×n)C\|K^{\theta}\|_{C([0,T;{\mathbb{R}}^{d\times n})}\leq C and KθKθC([0,T;d×n)C|θθ|\|K^{\theta}-K^{\theta^{\star}}\|_{C([0,T;{\mathbb{R}}^{d\times n})}\leq C|\theta-\theta^{\star}|. Moreover, observe from (2.5) and (2.11) that X0θ=X0ψθX^{{\theta^{\star}}}_{0}=X^{\psi^{\theta}}_{0} and for all t[0,T]t\in[0,T],

d(XψθXψθ)t=((A+BKtθ)(XψθXψθ)t+B(KtθKtθ)Xtψθ)dt,{\mathrm{d}}(X^{\psi^{\theta^{\star}}}-X^{\psi^{\theta}})_{t}=\Big{(}(A^{\star}+B^{\star}K^{\theta^{\star}}_{t})(X^{\psi^{\theta^{\star}}}-X^{\psi^{\theta}})_{t}+B^{\star}(K^{\theta^{\star}}_{t}-K^{\theta}_{t})X^{\psi^{\theta}}_{t}\Big{)}\,{\mathrm{d}}t,

which combined with the boundedness of KθK^{\theta^{\star}} and Gronwall’s inequality leads to

XψθXψθ2(n)\displaystyle\|X^{\psi^{\theta^{\star}}}-X^{\psi^{\theta}}\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})} CXψθXψθ𝒮2(n)\displaystyle\leq C\|X^{\psi^{\theta^{\star}}}-X^{\psi^{\theta}}\|_{\mathcal{S}^{2}({\mathbb{R}}^{n})}
C(KθKθ)Xψθ2(d)CKθKθC([0,T;d×n)Xψθ2(n)\displaystyle\leq C\|(K^{\theta^{\star}}-K^{\theta})X^{\psi^{\theta}}\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}\leq C\|K^{\theta^{\star}}-K^{\theta}\|_{C([0,T;{\mathbb{R}}^{d\times n})}\|X^{\psi^{\theta}}\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}
C|θθ|,θΘ,\displaystyle\leq C|\theta-\theta^{\star}|,\quad\forall\theta\in\Theta,

where the last inequality follows from Xψθ2(n)C\|X^{\psi^{\theta}}\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}\leq C, as KθK^{\theta} is uniformly bounded. The above inequality further implies

ψθ(,Xψθ)ψθ(,Xψθ)2(d)=KθXψθKθXψθ2(d)\displaystyle\|\psi^{\theta}(\cdot,X^{\psi^{\theta}}_{\cdot})-\psi^{\theta^{\star}}(\cdot,X^{\psi^{\theta^{\star}}}_{\cdot})\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}=\|K^{\theta}_{\cdot}X^{\psi^{\theta}}_{\cdot}-K^{\theta^{\star}}_{\cdot}X^{\psi^{\theta^{\star}}}_{\cdot}\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}
(KθKθ)Xψθ2(d)+Kθ(XψθXψθ)2(d)\displaystyle\leq\|(K^{\theta}_{\cdot}-K^{\theta^{\star}}_{\cdot})X^{\psi^{\theta}}_{\cdot}\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}+\|K^{\theta^{\star}}_{\cdot}(X^{\psi^{\theta}}_{\cdot}-X^{\psi^{\theta^{\star}}}_{\cdot})\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}
KθKθC([0,T;d×n)Xψθ2(n)+KθC([0,T;d×n)XψθXψθ2(n)\displaystyle\leq\|K^{\theta^{\star}}-K^{\theta}\|_{C([0,T;{\mathbb{R}}^{d\times n})}\|X^{\psi^{\theta}}\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}+\|K^{\theta^{\star}}\|_{C([0,T;{\mathbb{R}}^{d\times n})}\|X^{\psi^{\theta}}-X^{\psi^{\theta^{\star}}}\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}
C|θθ|,θΘ,\displaystyle\leq C|\theta-\theta^{\star}|,\quad\forall\theta\in\Theta,

which along with (3.8) finishes the desired estimate.  

Step 2: Error bound for parameter estimation.

Proposition 3.9

Suppose (H.11) holds and let Θ(n+d)×n\Theta\subset{\mathbb{R}}^{(n+d)\times n} such that there exists C1>0C_{1}>0 satisfying (𝔼[Vψθ])12C1\|({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}\|_{2}\leq C_{1} and |θ|C1|\theta|\leq C_{1} for all θΘ\theta\in\Theta, with VψθV^{\psi^{\theta}} defined in (3.1). Then there exist constants C¯1,C¯20\bar{C}_{1},\bar{C}_{2}\geq 0, such that for all θΘ\theta\in\Theta and δ(0,1/2)\delta\in(0,1/2), if mC¯1(lnδ)m\geq\bar{C}_{1}(-\ln\delta), then with probability at least 12δ1-2\delta,

|θ^θ|C¯2(lnδm+lnδm+(lnδ)2m2),|\hat{\theta}-\theta^{\star}|\leq\bar{C}_{2}\bigg{(}\sqrt{\frac{-\ln\delta}{m}}+\frac{-\ln\delta}{m}+\frac{(-\ln\delta)^{2}}{m^{2}}\bigg{)}, (3.9)

where θ^\hat{\theta} denotes the right-hand side of (2.13) with the control ψθ\psi^{\theta}.

Proof  Let us fix δ(0,1/2)\delta\in(0,1/2) and θΘ\theta\in\Theta. By (2.12) and (2.13), we obtain

θ^θ2=(Vψθ,m+1mI)1Yψθ,m(𝔼[Vψθ])1𝔼[Yψθ]2(Vψθ,m+1mI)1(𝔼[Vψθ])12Yψθ,m2+(𝔼[Vψθ])12Yψθ,m𝔼[Yψθ]2.\begin{split}\|\hat{\theta}-\theta^{\star}\|_{2}&=\|(V^{\psi^{{\theta}},m}+\tfrac{1}{m}I)^{-1}Y^{\psi^{{\theta}},m}-({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}{\mathbb{E}}[Y^{\psi^{{\theta}}}]\|_{2}\\ &\leq\|(V^{\psi^{{\theta}},m}+\tfrac{1}{m}I)^{-1}-({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}\|_{2}\|Y^{\psi^{{\theta}},m}\|_{2}+\|({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}\|_{2}\|Y^{\psi^{{\theta}},m}-{\mathbb{E}}[Y^{\psi^{{\theta}}}]\|_{2}.\end{split}

As E1F1=F1(FE)E1E^{-1}-F^{-1}=F^{-1}(F-E)E^{-1} for all nonsingular matrices EE and FF, we have

θ^θ2(Vψθ,m+1mI)12(𝔼[Vψθ])12Yψθ,m2Vψθ,m𝔼[Vψθ]+1mI2+(𝔼[Vψθ])12Yψθ,m𝔼[Yψθ]2C1((Vψθ,m+1mI)12Yψθ,m2Vψθ,m𝔼[Vψθ]+1mI2+Yψθ,m𝔼[Yψθ]2),\begin{split}&\|\hat{\theta}-\theta^{\star}\|_{2}\\ &\leq\|(V^{\psi^{{\theta}},m}+\tfrac{1}{m}I)^{-1}\|_{2}\|({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}\|_{2}\|Y^{\psi^{{\theta}},m}\|_{2}\|V^{\psi^{{\theta}},m}-{\mathbb{E}}[V^{\psi^{{\theta}}}]+\tfrac{1}{m}I\|_{2}\\ &\quad+\|({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}\|_{2}\|Y^{\psi^{{\theta}},m}-{\mathbb{E}}[Y^{\psi^{{\theta}}}]\|_{2}\\ &\leq C_{1}\big{(}\|(V^{\psi^{{\theta}},m}+\tfrac{1}{m}I)^{-1}\|_{2}\|Y^{\psi^{{\theta}},m}\|_{2}\|V^{\psi^{{\theta}},m}-{\mathbb{E}}[V^{\psi^{{\theta}}}]+\tfrac{1}{m}I\|_{2}+\|Y^{\psi^{{\theta}},m}-{\mathbb{E}}[Y^{\psi^{{\theta}}}]\|_{2}\big{)},\end{split} (3.10)

where the last inequality follows from the assumption (𝔼[Vψθ])12C1\|({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}\|_{2}\leq C_{1}.

We now estimate each term in the right-hand side of (3.10), and denote by CC a generic constant independent of θΘ,δ(0,1/2),m\theta\in\Theta,\delta\in(0,1/2),m\in{\mathbb{N}}. By Theorem 3.6, with probability at least 12δ1-2\delta, Vψθ,m𝔼[Vψθ]2δm\|V^{\psi^{{\theta}},m}-{\mathbb{E}}[V^{\psi^{{\theta}}}]\|_{2}\leq\delta_{m} and Yψθ,m𝔼[Yψθ]2δm\|Y^{\psi^{{\theta}},m}-{\mathbb{E}}[Y^{\psi^{{\theta}}}]\|_{2}\leq\delta_{m}, with the constant δm\delta_{m} given by

δmCmax{(lnδm)12,lnδm}.\delta_{m}\coloneqq C\max\left\{\bigg{(}\frac{-\ln\delta}{m}\bigg{)}^{\frac{1}{2}},\frac{-\ln\delta}{m}\right\}. (3.11)

Let mm be a sufficiently large constant satisfying δm+1/m1/(2C1)\delta_{m}+1/m\leq 1/(2C_{1}), where C1C_{1} is the constant such that (𝔼[Vψθ])12C1\|({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}\|_{2}\leq C_{1} for all θΘ\theta\in\Theta. Then with probability at least 12δ1-2\delta, Vψθ,m𝔼[Vψθ]+1mI2δm+1m12C1\|V^{\psi^{{\theta}},m}-{\mathbb{E}}[V^{\psi^{{\theta}}}]+\tfrac{1}{m}I\|_{2}\leq\delta_{m}+\frac{1}{m}\leq\tfrac{1}{2C_{1}}, which in turn yields

λmin(Vψθ,m+1m)λmin(𝔼[Vψθ])Vψθ,m𝔼[Vψθ]+1mI212C1,\displaystyle\lambda_{\min}(V^{\psi^{{\theta}},m}+\tfrac{1}{m})\geq\lambda_{\min}({\mathbb{E}}[V^{\psi^{{\theta}}}])-\|V^{\psi^{{\theta}},m}-{\mathbb{E}}[V^{\psi^{{\theta}}}]+\tfrac{1}{m}I\|_{2}\geq\tfrac{1}{2C_{1}},

or equivalently (Vψθ,m+1m)122C1\|(V^{\psi^{{\theta}},m}+\tfrac{1}{m})^{-1}\|_{2}\leq 2C_{1}. Moreover, the continuity of (n+d)×nθ𝔼[Yψθ]{\mathbb{R}}^{(n+d)\times n}\ni\theta\mapsto{\mathbb{E}}[Y^{\psi^{{\theta}}}]\in{\mathbb{R}} implies Yψθ,m2𝔼[Yψθ]2+Yψθ,m𝔼[Yψθ]2C+Yψθ,m𝔼[Yψθ]2\|Y^{\psi^{{\theta}},m}\|_{2}\leq\|{\mathbb{E}}[Y^{\psi^{{\theta}}}]\|_{2}+\|Y^{\psi^{{\theta}},m}-{\mathbb{E}}[Y^{\psi^{{\theta}}}]\|_{2}\leq C+\|Y^{\psi^{{\theta}},m}-{\mathbb{E}}[Y^{\psi^{{\theta}}}]\|_{2}. Hence, by (3.10),

|θ^θ|C((1+Yψθ,m𝔼[Yψθ]2)Vψθ,m𝔼[Vψθ]+1mI2+Yψθ,m𝔼[Yψθ]2)C((δm+1m)(1+δm)+δm)C(δm+δm2+1m).\begin{split}&|\hat{\theta}-\theta^{\star}|\\ &\leq C\big{(}(1+\|Y^{\psi^{{\theta}},m}-{\mathbb{E}}[Y^{\psi^{{\theta}}}]\|_{2})\|V^{\psi^{{\theta}},m}-{\mathbb{E}}[V^{\psi^{{\theta}}}]+\tfrac{1}{m}I\|_{2}+\|Y^{\psi^{{\theta}},m}-{\mathbb{E}}[Y^{\psi^{{\theta}}}]\|_{2}\big{)}\\ &\leq C\left((\delta_{m}+\tfrac{1}{m})(1+\delta_{m})+\delta_{m}\right)\leq C\left(\delta_{m}+\delta_{m}^{2}+\tfrac{1}{m}\right).\end{split}

Substituting (3.11) into the above estimate yields the desired estimate (3.22). As δ(0,1/2)\delta\in(0,1/2), it is clear that δm+1/m1/(2C1)\delta_{m}+1/m\leq 1/(2C_{1}) is satisfied for all mC¯1(lnδ)m\geq\bar{C}_{1}(-\ln\delta), with a sufficiently large C1C_{1}.  

Step 3: Proof of Theorem 2.2.

The following proposition shows that for any given θ=(A,B)(n+d)×n\theta=(A,B)^{\top}\in{\mathbb{R}}^{(n+d)\times n}, the full row rank of KθK^{\theta} is equivalent to the well-definedness of (2.12) for all θ\theta^{\prime} sufficiently close to θ\theta.

Proposition 3.10

Suppose (H.11) holds. For each θ(n+d)×n\theta\in{\mathbb{R}}^{(n+d)\times n}, let VψθV^{\psi^{\theta}} be defined in (3.1). Then for any θ=(A,B)(n+d)×n\theta=(A,B)^{\top}\in{\mathbb{R}}^{(n+d)\times n}, the following properties are equivalent:

  1. (1)

    {vd(Ktθ)v=0,t[0,T]}={0}\{v\in{\mathbb{R}}^{d}\mid(K^{\theta}_{t})^{\top}v=0,\;\forall t\in[0,T]\}=\{0\}, with KθK^{\theta} defined in (2.9);

  2. (2)

    𝔼[Vψθ]𝕊+n+d{\mathbb{E}}[V^{\psi^{{\theta}}}]\in{\mathbb{S}}_{+}^{n+d};

  3. (3)

    there exist λ0,ε>0\lambda_{0},\varepsilon>0 such that λmin(𝔼[Vψθ])λ0\lambda_{\min}({\mathbb{E}}[V^{\psi^{{\theta}^{\prime}}}])\geq\lambda_{0} for all θΦε{θ(n+d)×n|θθ|ε}{\theta}^{\prime}\in\Phi_{\varepsilon}\coloneqq\{\theta^{\prime}\in{\mathbb{R}}^{(n+d)\times n}\mid|{\theta}^{\prime}-\theta|\leq\varepsilon\}, where λmin(Z)\lambda_{\min}(Z) is the minimum eigenvalue of Z𝕊0n+dZ\in{\mathbb{S}}^{n+d}_{0}.

Proof  For 12\ref{item:B_full_rank}\Longrightarrow\ref{item:theta_B}: By (3.1), 𝔼[Vψθ]𝕊+n+d{\mathbb{E}}[V^{\psi^{\theta}}]\in{\mathbb{S}}^{n+d}_{+} if and only if there exists no nonzero vn+dv\in{\mathbb{R}}^{n+d} such that

𝔼[0TvZtψθ(Ztψθ)vdt]=0Tv(IKtθ)𝔼[Xtψθ(Xtψθ)](I(Ktθ))vdt=0,{\mathbb{E}}\left[\int_{0}^{T}v^{\top}Z^{\psi^{\theta}}_{t}(Z^{\psi^{\theta}}_{t})^{\top}v\,{\mathrm{d}}t\right]=\int_{0}^{T}v^{\top}\begin{pmatrix}I\\ K^{\theta}_{t}\end{pmatrix}{\mathbb{E}}\left[X^{\psi^{\theta}}_{t}(X^{\psi^{\theta}}_{t})^{\top}\right]\begin{pmatrix}I&(K^{\theta}_{t})^{\top}\end{pmatrix}v\,{\mathrm{d}}t=0, (3.12)

where we applied Fubini’s theorem for the first identity. By (2.5), Xtψθ=Φtθ(x0+0t(Φsθ)1dWs)X^{\psi^{\theta}}_{t}=\Phi^{\theta}_{t}\big{(}x_{0}+\int_{0}^{t}(\Phi^{\theta}_{s})^{-1}\,{\mathrm{d}}W_{s}\big{)} for all t[0,T]t\in[0,T], where ΦθC([0,T];n×n)\Phi^{\theta}\in C([0,T];{\mathbb{R}}^{n\times n}) is the fundamental solution of dΦtθ=(A+BKtθ)Φtθdt{\mathrm{d}}\Phi^{\theta}_{t}=(A^{\star}+B^{\star}K^{\theta}_{t})\Phi^{\theta}_{t}\,{\mathrm{d}}t, Ktθ=R1BPtθK^{\theta}_{t}=-R^{-1}B^{\top}P^{\theta}_{t} for all t[0,T]t\in[0,T], and PθP^{\theta} satisfies (2.10). Hence,

𝔼[Xtψθ(Xtψθ)]=Φtθ(x0x0+0t(Φsθ)1((Φsθ)1)ds)(Φtθ)𝕊0n,t[0,T].{\mathbb{E}}\left[X^{\psi^{\theta}}_{t}(X^{\psi^{\theta}}_{t})^{\top}\right]=\Phi^{\theta}_{t}\left(x_{0}x_{0}^{\top}+\int_{0}^{t}(\Phi^{\theta}_{s})^{-1}((\Phi^{\theta}_{s})^{-1})^{\top}\,{\mathrm{d}}s\right)(\Phi^{\theta}_{t})^{\top}\in{\mathbb{S}}^{n}_{0},\quad\forall t\in[0,T].

Then by (3.12) and the continuity of t𝔼[Xtψθ(Xtψθ)]t\mapsto{\mathbb{E}}\left[X^{\psi^{\theta}}_{t}(X^{\psi^{\theta}}_{t})^{\top}\right] and tKtθt\mapsto K^{\theta}_{t}, 𝔼[Vψθ]𝕊+n+d{\mathbb{E}}[V^{\psi^{\theta}}]\in{\mathbb{S}}^{n+d}_{+} if and only if there exists no nonzero vn+dv\in{\mathbb{R}}^{n+d} such that

v(IKtθ)Φtθ(x0x0+0t(Φsθ)1((Φsθ)1)ds)(Φtθ)(I(Ktθ))v=0,t[0,T],v^{\top}\begin{pmatrix}I\\ K^{\theta}_{t}\end{pmatrix}\Phi^{\theta}_{t}\left(x_{0}x_{0}^{\top}+\int_{0}^{t}(\Phi^{\theta}_{s})^{-1}((\Phi^{\theta}_{s})^{-1})^{\top}\,{\mathrm{d}}s\right)(\Phi^{\theta}_{t})^{\top}\begin{pmatrix}I&(K^{\theta}_{t})^{\top}\end{pmatrix}v=0,\quad\forall t\in[0,T],

where II is the n×nn\times n identity matrix. One can easily deduce by the invertibility of (Φtθ)1(\Phi^{\theta}_{t})^{-1} for all t[0,T]t\in[0,T] that 0t(Φsθ)1((Φsθ)1)ds𝕊+n\int_{0}^{t}(\Phi^{\theta}_{s})^{-1}((\Phi^{\theta}_{s})^{-1})^{\top}\,{\mathrm{d}}s\in{\mathbb{S}}^{n}_{+} for all t>0t>0, which subsequently shows that 𝔼[Vψθ]𝕊+n+d{\mathbb{E}}[V^{\psi^{\theta}}]\in{\mathbb{S}}^{n+d}_{+} if and only if there exists no nonzero v~n+d\tilde{v}\in{\mathbb{R}}^{n+d} such that (I(Ktθ))v~=0\begin{pmatrix}I&(K^{\theta}_{t})^{\top}\end{pmatrix}\tilde{v}=0 for all t[0,T]t\in[0,T]. Now let us denote without loss of generality that v~=(uv)\tilde{v}=\begin{pmatrix}u\\ v\end{pmatrix} for some unu\in{\mathbb{R}}^{n} and vdv\in{\mathbb{R}}^{d}. Then the above derivation shows that 𝔼[Vψθ]𝕊+n+d{\mathbb{E}}[V^{\psi^{\theta}}]\in{\mathbb{S}}^{n+d}_{+} is equivalent to the following statement:

if unu\in{\mathbb{R}}^{n} and vdv\in{\mathbb{R}}^{d} satisfy u+(Ktθ)v=0u+(K^{\theta}_{t})^{\top}v=0 for all t[0,T]t\in[0,T], then u=0u=0 and v=0v=0. (3.13)

By (2.9), (Ktθ)=PtθBR1(K^{\theta}_{t})^{\top}=-P^{\theta}_{t}BR^{-1} for all t[0,T]t\in[0,T] and PTθ=0P^{\theta}_{T}=0, implying that KTθ=0K^{\theta}_{T}=0. Then (3.13) can be rewritten as:

if vdv\in{\mathbb{R}}^{d} satisfies (Ktθ)v=0(K^{\theta}_{t})^{\top}v=0 for all t[0,T]t\in[0,T], then v=0v=0.

For 23\ref{item:theta_B}\Longleftrightarrow\ref{item:theta_eps}: Item 3 clearly implies Item 2. On the other hand, for any given θ,θ(n+d)×n\theta,\theta^{\prime}\in{\mathbb{R}}^{(n+d)\times n},

d(XψθXψθ)t=((A+BKtθ)(XψθXψθ)t+B(KtθKtθ)Xtψθ)dt.{\mathrm{d}}(X^{\psi^{\theta}}-X^{\psi^{\theta^{\prime}}})_{t}=\Big{(}(A^{\star}+B^{\star}K^{\theta}_{t})(X^{\psi^{\theta}}-X^{\psi^{\theta^{\prime}}})_{t}+B^{\star}(K^{\theta}_{t}-K^{\theta^{\prime}}_{t})X^{\psi^{\theta^{\prime}}}_{t}\Big{)}\,{\mathrm{d}}t.

Then, we can easily deduce from the continuity of tKθt\mapsto K^{\theta} (see Lemma 3.1) that (n+d)×nθZψθ2((n+d)×n){\mathbb{R}}^{(n+d)\times n}\ni\theta\mapsto Z^{\psi^{\theta}}\in\mathcal{H}^{2}({\mathbb{R}}^{(n+d)\times n}) is continuous, which implies the continuity of (n+d)×nθVψθ=𝔼[0TZtψθ(Ztψθ)dt]𝕊0n+d{\mathbb{R}}^{(n+d)\times n}\ni\theta\mapsto V^{\psi^{\theta}}={\mathbb{E}}\big{[}\int_{0}^{T}Z^{\psi^{\theta}}_{t}(Z^{\psi^{\theta}}_{t})^{\top}\,{\mathrm{d}}t\big{]}\in{\mathbb{S}}^{n+d}_{0}. Hence, by the continuity of the minimum eigenvalue function, we can conclude Item 2 from Item 3.  

The following proposition provides sufficient conditions for the nondegeneracy of KθK^{\theta}.

Proposition 3.11

Let n,dn,d\in{\mathbb{N}}, θ=(A,B)(n+d)×n\theta=(A,B)^{\top}\in{\mathbb{R}}^{(n+d)\times n}, Q𝕊0nQ\in{\mathbb{S}}_{0}^{n} and R𝕊+dR\in{\mathbb{S}}_{+}^{d}.

  1. (1)

    For all T>0T>0, if BQB𝕊+dB^{\top}QB\in{\mathbb{S}}^{d}_{+}, then {vd(Ktθ)v=0,t[0,T]}={0}\{v\in{\mathbb{R}}^{d}\mid(K^{\theta}_{t})^{\top}v=0,\;\forall t\in[0,T]\}=\{0\}.

  2. (2)

    Assume that the algebraic Riccati equation AP+PAP(BR1B)P+Q=0A^{\top}P+PA-P(BR^{-1}B^{\top})P+Q=0 admits a unique maximal solution P𝕊+nP_{\infty}\in{\mathbb{S}}^{n}_{+}. Let K=R1BPK_{\infty}=-R^{-1}B^{\top}P_{\infty}, and for each T>0T>0, let P(T)C([0,T];𝕊0n)P^{(T)}\in C([0,T];{\mathbb{S}}_{0}^{n}) be defined in (2.10). Assume that limTP0(T)=P\lim_{T\rightarrow\infty}P^{(T)}_{0}=P_{\infty} and K(K)𝕊+dK_{\infty}(K_{\infty})^{\top}\in{\mathbb{S}}_{+}^{d}. Then there exists T0>0T_{0}>0, such that for all TT0T\geq T_{0}, {vd(Ktθ)v=0,t[0,T]}={0}\{v\in{\mathbb{R}}^{d}\mid(K^{\theta}_{t})^{\top}v=0,\;\forall t\in[0,T]\}=\{0\}.

Proof  To prove Item 1, suppose that BQB𝕊+nB^{\top}QB\in{\mathbb{S}}_{+}^{n} and vdv\in{\mathbb{R}}^{d} such that (Ktθ)v=PtθBR1v=0(K^{\theta}_{t})^{\top}v=-P^{\theta}_{t}BR^{-1}v=0 for all t[0,T]t\in[0,T], with PθP^{\theta} defined in (2.10). Setting u=R1vu=R^{-1}v, right multiplying (2.10) by BuBu, and left multiplying (2.10) by uBu^{\top}B^{\top} shows

uB(ddtPtθ)Bu+APtθBu+uBPtθABuuBPtθBR1BPtθBu+uBQBu=0.u^{\top}B^{\top}(\tfrac{{\mathrm{d}}}{{\mathrm{d}}t}P^{\theta}_{t})Bu+A^{\top}P^{\theta}_{t}Bu+u^{\top}B^{\top}P^{\theta}_{t}ABu-u^{\top}B^{\top}P^{\theta}_{t}BR^{-1}B^{\top}P^{\theta}_{t}Bu+u^{\top}B^{\top}QBu=0.

As PtθBu=0P^{\theta}_{t}Bu=0 for all t(0,T)t\in(0,T), uB(ddtPtθ)Bu=uBPtθ=0u^{\top}B^{\top}(\frac{{\mathrm{d}}}{{\mathrm{d}}t}P^{\theta}_{t})Bu=u^{\top}B^{\top}P^{\theta}_{t}=0 for all t(0,T)t\in(0,T), and hence uBQBu=0u^{\top}B^{\top}QBu=0. The assumption of BQB𝕊+nB^{\top}QB\in{\mathbb{S}}^{n}_{+} then gives u=R1v=0u=R^{-1}v=0, which along with the invertibility of R1R^{-1} shows that v=0v=0.

To prove Item 2, observe that limT(R1BP0(T))=K\lim_{T\rightarrow\infty}(-R^{-1}B^{\top}P_{0}^{(T)})=K_{\infty}. As λmin(K(K))>0\lambda_{\min}(K_{\infty}(K_{\infty})^{\top})>0, there exists T0>0T_{0}>0 such that for all TT0T\geq T_{0}, λmin((R1BP0(T))(R1BP0(T)))>0\lambda_{\min}\left((-R^{-1}B^{\top}P_{0}^{(T)})(-R^{-1}B^{\top}P_{0}^{(T)})^{\top}\right)>0. Fix TT0T\geq T_{0} and consider vdv\in{\mathbb{R}}^{d} such that (Ktθ)v=0(K^{\theta}_{t})^{\top}v=0 for all t[0,T]t\in[0,T]. Then the definitions of KθK^{\theta} and P(T)P^{(T)} imply the invertibility of K0θ(K0θ)K^{\theta}_{0}(K^{\theta}_{0})^{\top}, which yields v=(K0θ(K0θ))1K0θ(K0θ)v=0v=(K^{\theta}_{0}(K^{\theta}_{0})^{\top})^{-1}K^{\theta}_{0}(K^{\theta}_{0})^{\top}v=0.  

Now we are ready for the proof of Theorem 2.2.

Proof [Proof of Theorem 2.2] As (H.12) holds with θ\theta^{\star} and θ0\theta_{0}, we can obtain from Proposition 3.10 that, there exist C1,ε>0C_{1},\varepsilon>0 such that for all θΦε{θ(n+d)×n|θθ|ε}{θ0}\theta\in\Phi_{\varepsilon}\coloneqq\{\theta\mid{\mathbb{R}}^{(n+d)\times n}\mid|\theta-\theta^{\star}|\leq\varepsilon\}\cup\{\theta_{0}\}, we have (𝔼[Vψθ])12C1\|({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}\|_{2}\leq C_{1}. Then by Proposition 3.9, there exist constants C¯1,C¯21\bar{C}_{1},\bar{C}_{2}\geq 1, such that for all θΘε\theta\in\Theta_{\varepsilon} and δ(0,1/2)\delta^{\prime}\in(0,1/2), if mC¯1(lnδ)m\geq\bar{C}_{1}(-\ln\delta^{\prime}), then with probability at least 12δ1-2\delta^{\prime},

|θ^θ|C¯2(lnδm+lnδm+(lnδ)2m2),|\hat{\theta}-\theta^{\star}|\leq\bar{C}_{2}\left(\sqrt{\frac{-\ln\delta^{\prime}}{m}}+\frac{-\ln\delta^{\prime}}{m}+\frac{(-\ln\delta^{\prime})^{2}}{m^{2}}\right), (3.14)

where θ^\hat{\theta} denotes the right-hand side of (2.13) with the control ψθ\psi^{\theta}. In the following, we fix δ(0,3/π2)\delta\in(0,3/\pi^{2}) and CC0C\geq C_{0}, with the constant C0(0,){C}_{0}\in(0,\infty) satisfying

C0C¯1(sup{0},δ(0,3/π2)ln(δ/(+1)2)2(lnδ))/min{(ε3C¯2)2,1},{C}_{0}\geq\bar{C}_{1}\bigg{(}\sup_{\ell\in{\mathbb{N}}\cup\{0\},\delta\in(0,{3}/{\pi^{2}})}\frac{-\ln(\delta/(\ell+1)^{2})}{2^{\ell}(-\ln\delta)}\bigg{)}\bigg{/}\min\bigg{\{}\left(\frac{\varepsilon}{3\bar{C}_{2}}\right)^{2},1\bigg{\}},

let m0=C(lnδ)m_{0}=C(-\ln\delta), and for each {0}\ell\in{\mathbb{N}}\cup\{0\}, let δ=δ/(+1)2\delta_{\ell}=\delta/(\ell+1)^{2}, m=2m0m_{\ell}=2^{\ell}m_{0}, and let θ+1\theta_{\ell+1} be generated by (2.13) with m=mm=m_{\ell} and θ=θ\theta=\theta_{\ell}. Note that the choices of C0,m,δC_{0},m_{\ell},\delta_{\ell} ensure that mC¯1(lnδ)m_{\ell}\geq\bar{C}_{1}(-\ln\delta_{\ell}), and

C¯2(lnδm+(lnδ)m+(lnδ)2m2)3C¯2lnδmε,{0}.\bar{C}_{2}\left(\sqrt{\frac{-\ln\delta_{\ell}}{m_{\ell}}}+\frac{(-\ln\delta_{\ell})}{m_{\ell}}+\frac{(-\ln\delta_{\ell})^{2}}{m^{2}_{\ell}}\right)\leq 3\bar{C}_{2}\sqrt{\frac{-\ln\delta_{\ell}}{m_{\ell}}}\leq\varepsilon,\quad\forall\ell\in{\mathbb{N}}\cup\{0\}. (3.15)

We now prove with probability at least 12=0δ=1π23δ1-2\sum_{\ell=0}^{\infty}\delta_{\ell}=1-\frac{\pi^{2}}{3}\delta,

|θ+1θ|C¯2(lnδm+(lnδ)m+(lnδ)2m2),{0}.|{\theta}_{\ell+1}-\theta^{\star}|\leq\bar{C}_{2}\bigg{(}\sqrt{\frac{-\ln\delta_{\ell}}{m_{\ell}}}+\frac{(-\ln\delta_{\ell})}{m_{\ell}}+\frac{(-\ln\delta_{\ell})^{2}}{m_{\ell}^{2}}\bigg{)},\quad\forall\ell\in{\mathbb{N}}\cup\{0\}. (3.16)

Let us consider the induction statement for each k{0}k\in{\mathbb{N}}\cup\{0\}: with probability at least 12=0kδ1-2\sum_{\ell=0}^{k}\delta_{\ell}, (3.16) holds for all =0,,k\ell=0,\ldots,k. The fact that θ0Θε\theta_{0}\in\Theta_{\varepsilon} and (3.14) yields the induction statement for k=0k=0. Now suppose that the induction statement holds for some k{0}k\in{\mathbb{N}}\cup\{0\}. Then the induction hypothesis and (3.15) ensure that |θθ|ε|\theta_{\ell}-\theta^{\star}|\leq\varepsilon for all =1,,k+1\ell=1,\ldots,k+1 (and hence θk+1Θε\theta_{k+1}\in\Theta_{\varepsilon}) with probability at least 12=0kδ1-2\sum_{\ell=0}^{k}\delta_{\ell}. Conditioning on this event, we can apply (3.14) with θ=θk+1\theta=\theta_{k+1}, δ=δk+1<1/2\delta^{\prime}=\delta_{k+1}<1/2 and m=mk+1C¯1(lnδk+1)m=m_{k+1}\geq\bar{C}_{1}(-\ln\delta_{k+1}), and deduce with probability at least 12δk+11-2\delta_{k+1} that (3.16) holds for the index =k+1\ell=k+1. Combining this with the induction hypothesis yields (3.16) holds for the indices =0,,k+1\ell=0,\ldots,k+1, with probability at least 12=0k+1δ1-2\sum_{\ell=0}^{k+1}\delta_{\ell}.

Observe that for all ii\in{\mathbb{N}}, Algorithm 1 generates the ii-th trajectory with control ψθ\psi^{\theta_{\ell}} if i(j=01mj,j=0mj]=(m0(21),m0(2+11)]i\in(\sum_{j=0}^{\ell-1}m_{j},\sum_{j=0}^{\ell}m_{j}]=(m_{0}(2^{\ell}-1),m_{0}(2^{\ell+1}-1)] with some {0}\ell\in{\mathbb{N}}\cup\{0\}. Then conditioning on the event (3.16), we can obtain from Proposition 3.8 that, for all MM\in{\mathbb{N}},

R(M)=0log2(Mm0+1)1m(Jθ(Uψθ)Jθ(Uθ))C=0log2(Mm0+1)1m|θθ|2Cm0+C=0log2(Mm0+1)1(lnδ)(1+lnδm+(lnδ)3m3)C(lnδ)+C=1log2M(2lnlnδ)C((lnM)(lnlnM)+(lnM)(lnδ)),\displaystyle\begin{split}R(M)&\leq\sum_{\ell=0}^{\lceil\log_{2}(\frac{M}{m_{0}}+1)\rceil-1}m_{\ell}\Big{(}J^{\theta^{\star}}(U^{\psi^{\theta_{\ell}}})-J^{\theta^{\star}}(U^{\theta^{\star}})\Big{)}\leq C^{\prime}\sum_{\ell=0}^{\lceil\log_{2}(\frac{M}{m_{0}}+1)\rceil-1}m_{\ell}|\theta_{\ell}-\theta^{\star}|^{2}\\ &\leq C^{\prime}m_{0}+C^{\prime}\sum_{\ell=0}^{\lceil\log_{2}(\frac{M}{m_{0}}+1)\rceil-1}(-\ln\delta_{\ell})\Big{(}1+\frac{-\ln\delta_{\ell}}{m_{\ell}}+\frac{(-\ln\delta_{\ell})^{3}}{m_{\ell}^{3}}\Big{)}\\ &\leq C^{\prime}(-\ln\delta)+C^{\prime}\sum_{\ell=1}^{\lceil\log_{2}{M}\rceil}\bigg{(}2\ln\ell-\ln\delta\bigg{)}\leq C^{\prime}\left((\ln M)(\ln\ln M)+(\ln M)(-\ln\delta)\right),\end{split} (3.17)

with a constant CC^{\prime} independent of MM and δ\delta, where we have used =1nln=ln(n!)Cnlnn\sum_{\ell=1}^{n}\ln\ell=\ln(n!)\leq C^{\prime}n\ln n due to Stirling’s formula.  

3.4 Regret analysis of discrete-time least-squares algorithm

This section is devoted to the proof of Theorem 2.3. The main step is similar to the proof of Theorem 2.2 in Section 3.3. However, one needs to quantity the precise impact of the piecewise constant policies and the discrete-time observations on the performance gap and the parameter estimation error.

Step 1: Analysis of the performance gap.

The following proposition shows the performance gap between applying a piecewise constant feedback control for an incorrect model and a continuous-time feedback control for the true model scales quadratically with respect to the stepsize and the parameter errors.

Proposition 3.12

Suppose (H.11) holds and let Θ\Theta be a bounded subset of (n+d)×n{\mathbb{R}}^{(n+d)\times n}. For each θΘ\theta\in\Theta and NN\in{\mathbb{N}}, let ψθ,τ\psi^{\theta,\tau} be defined in (2.19) with stepsize τ=T/N\tau=T/N, let Xψθ,τX^{\psi^{\theta,\tau}} be the state process associated with ψθ,τ\psi^{\theta,\tau} (cf. (2.21)), let ψθ\psi^{\theta^{\star}} be defined in (2.3), and let XθX^{{\theta^{\star}}} be the state process associated with ψθ\psi^{\theta^{\star}} (cf. (2.5)). Then there exists C>0C>0 such that

|Jθ(Uψθ,τ)Jθ(Uθ)|C(N2+|θθ|2),θΘ,N,|J^{\theta^{\star}}(U^{\psi^{\theta,\tau}})-J^{\theta^{\star}}(U^{\theta^{\star}})|\leq C(N^{-2}+|\theta-\theta^{\star}|^{2}),\quad\forall\theta\in\Theta,N\in{\mathbb{N}}, (3.18)

where Utψθ,τ=ψθ,τ(t,Xtψθ,τ)U^{\psi^{\theta,\tau}}_{t}=\psi^{\theta,\tau}(t,X^{\psi^{\theta,\tau}}_{t}) and Utθ=ψθ(t,Xtθ)U^{\theta^{\star}}_{t}=\psi^{\theta^{\star}}(t,X^{{\theta^{\star}}}_{t}) for all t[0,T]t\in[0,T], and JθJ^{\theta^{\star}} is defined in (2.1).

Proof  Let us fix θΘ\theta\in\Theta and NN\in{\mathbb{N}}. By applying Proposition 3.7 with U=Uψθ,τU=U^{\psi^{\theta,\tau}},

Jθ(Uψθ,τ)Jθ(Uθ)Q2Xθ,Uψθ,τXθ2(n)2+R2Uψθ,τUθ2(d)2,Q2Xψθ,τXψθ2(n)2+R2ψθ,τ(,Xψθ,τ)ψθ(,Xψθ)2(d)2,\displaystyle\begin{split}&J^{\theta^{\star}}(U^{\psi^{\theta,\tau}})-J^{\theta^{\star}}(U^{\theta^{\star}})\\ &\leq\|Q\|_{2}\|X^{\theta^{\star},U^{\psi^{\theta,\tau}}}-X^{{\theta^{\star}}}\|^{2}_{\mathcal{H}^{2}({\mathbb{R}}^{n})}+\|R\|_{2}\|U^{\psi^{\theta,\tau}}-U^{\theta^{\star}}\|^{2}_{\mathcal{H}^{2}({\mathbb{R}}^{d})},\\ &\leq\|Q\|_{2}\|X^{{\psi^{\theta,\tau}}}-X^{\psi^{\theta^{\star}}}\|^{2}_{\mathcal{H}^{2}({\mathbb{R}}^{n})}+\|R\|_{2}\|\psi^{\theta,\tau}(\cdot,X^{\psi^{\theta,\tau}}_{\cdot})-\psi^{\theta^{\star}}(\cdot,X^{\psi^{\theta^{\star}}}_{\cdot})\|^{2}_{\mathcal{H}^{2}({\mathbb{R}}^{d})},\end{split} (3.19)

where the last inequality used the fact that Xθ,Uψθ,τ=Xψθ,τX^{\theta^{\star},U^{\psi^{\theta,\tau}}}=X^{{\psi^{\theta,\tau}}} (see (2.11)), and the definitions of Uψθ,τU^{\psi^{\theta,\tau}} and UθU^{\theta^{\star}}.

We then prove that there exists a constant CC, independent of θ,N\theta,N, such that

Xψθ,τXψθ2(n)+ψθ,τ(,Xψθ,τ)ψθ(,Xψθ)2(d)C(N1+|θθ|).\|X^{{\psi^{\theta,\tau}}}-X^{\psi^{\theta^{\star}}}\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}+\|\psi^{\theta,\tau}(\cdot,X^{\psi^{\theta,\tau}}_{\cdot})-\psi^{\theta^{\star}}(\cdot,X^{\psi^{\theta^{\star}}}_{\cdot})\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}\leq C(N^{-1}+|\theta-\theta^{\star}|).

By setting δX=XθXψθ,τ\delta X=X^{\theta^{\star}}-X^{\psi^{\theta,\tau}}, we obtain from (2.5) and (2.21) that

dδXt=(AδXt+BKtθδXt+(KtθKtθ,τ)Xtψθ,τ)dt,t[0,T];δX0=0.\displaystyle{\mathrm{d}}\delta X_{t}=(A^{\star}\delta X_{t}+B^{\star}K^{\theta^{\star}}_{t}\delta X_{t}+(K^{\theta^{\star}}_{t}-K^{\theta,\tau}_{t})X^{\psi^{\theta,\tau}}_{t})\,{\mathrm{d}}t,\quad t\in[0,T];\quad\delta X_{0}=0. (3.20)

Since PθC([0,T];n×n)C\|P^{\theta^{\star}}\|_{C([0,T];{\mathbb{R}}^{n\times n})}\leq C and Ktθ=R1BPtθK^{\theta^{\star}}_{t}=-R^{-1}B^{\top}P^{{\theta^{\star}}}_{t} for all t[0,T]t\in[0,T], KθC([0,T];d×n)C\|K^{\theta^{\star}}\|_{C([0,T];{\mathbb{R}}^{d\times n})}\leq C. Moreover, by Ptiθ,τ2C\|P^{\theta,\tau}_{t_{i}}\|_{2}\leq C for all i=0,,Ni=0,\ldots,N (see Proposition 3.3) and (2.20), we have Ktθ,τ2C\|K^{\theta,\tau}_{t}\|_{2}\leq C for all t[0,T]t\in[0,T], which along with a moment estimate of (2.21) yields Xψθ,τ𝒮2(n)C\|X^{\psi^{\theta,\tau}}\|_{\mathcal{S}^{2}({\mathbb{R}}^{n})}\leq C. Thus, by applying Gronwall’s inequality to (3.20), Lemma 3.1 and Proposition 3.3, for all θΘ\theta\in\Theta and NN\in{\mathbb{N}},

XθXψθ,τ2(n)CXθXψθ,τ𝒮2(n)C(KtθKtθ,τ)Xψθ,τ2(d)Cmaxt[0,T]KtθKtθ,τ2Cmaxt[0,T](KtθKtθ,τ2+KtθKtθ2)C(N1+|θθ|).\displaystyle\begin{split}&\|X^{\theta^{\star}}-X^{\psi^{\theta,\tau}}\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}\leq C\|X^{\theta^{\star}}-X^{\psi^{\theta,\tau}}\|_{\mathcal{S}^{2}({\mathbb{R}}^{n})}\\ &\leq C\|(K^{\theta^{\star}}_{t}-K^{\theta,\tau}_{t})X^{\psi^{\theta,\tau}}\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}\leq C\max_{t\in[0,T]}\|K^{\theta^{\star}}_{t}-K^{\theta,\tau}_{t}\|_{2}\\ &\leq C\max_{t\in[0,T]}(\|K^{\theta}_{t}-K^{\theta,\tau}_{t}\|_{2}+\|K^{\theta^{\star}}_{t}-K^{\theta}_{t}\|_{2})\leq C(N^{-1}+|\theta-\theta^{\star}|).\end{split} (3.21)

The above inequality further implies

ψθ,τ(,Xψθ,τ)ψθ(,Xψθ)2(d)=Kθ,τXψθ,τKθXψθ2(d)\displaystyle\|\psi^{\theta,\tau}(\cdot,X^{\psi^{\theta,\tau}}_{\cdot})-\psi^{\theta^{\star}}(\cdot,X^{\psi^{\theta^{\star}}}_{\cdot})\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}=\|K^{\theta,\tau}_{\cdot}X^{\psi^{\theta,\tau}}_{\cdot}-K^{\theta^{\star}}_{\cdot}X^{\psi^{\theta^{\star}}}_{\cdot}\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}
(Kθ,τKθ)Xψθ,τ2(d)+Kθ(Xψθ,τXψθ)2(d)\displaystyle\leq\|(K^{\theta,\tau}_{\cdot}-K^{\theta^{\star}}_{\cdot})X^{\psi^{\theta,\tau}}_{\cdot}\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}+\|K^{\theta^{\star}}_{\cdot}(X^{\psi^{\theta,\tau}}_{\cdot}-X^{\psi^{\theta^{\star}}}_{\cdot})\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}
KθKθ,τC([0,T;d×n)Xψθ,τ2(n)+KθC([0,T;d×n)Xψθ,τXψθ2(n)\displaystyle\leq\|K^{\theta^{\star}}-K^{\theta,\tau}\|_{C([0,T;{\mathbb{R}}^{d\times n})}\|X^{\psi^{\theta,\tau}}\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}+\|K^{\theta^{\star}}\|_{C([0,T;{\mathbb{R}}^{d\times n})}\|X^{\psi^{\theta,\tau}}-X^{\psi^{\theta^{\star}}}\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}
C(N1+|θθ|),θΘ,N,\displaystyle\leq C(N^{-1}+|\theta-\theta^{\star}|),\quad\forall\theta\in\Theta,N\in{\mathbb{N}},

which along with (3.19) finishes the desired estimate.  

Step 2: Error bound for parameter estimation.

The following lemma shows that the difference between the expectations of (Vψθ,τ,τ,Yψθ,τ,τ)(V^{\psi^{\theta,\tau},\tau},Y^{\psi^{\theta,\tau},\tau}) and of (Vψθ,τ,Yψθ,τ)(V^{\psi^{\theta,\tau}},Y^{\psi^{\theta,\tau}}) scales linearly with respect to the stepsize.

Lemma 3.13

Suppose (H.11) holds and let Θ\Theta be a bounded subset of (n+d)×n{\mathbb{R}}^{(n+d)\times n}. For each θΘ\theta\in\Theta and NN\in{\mathbb{N}}, let τ=T/N\tau=T/N, let ψθ,τ\psi^{\theta,\tau} be defined in (2.19), let Vψθ,τ,Yψθ,τV^{\psi^{\theta,\tau}},Y^{\psi^{\theta,\tau}} be defined in (3.1), and let Vψθ,τ,τ,Yψθ,τ,τV^{\psi^{\theta,\tau},\tau},Y^{\psi^{\theta,\tau},\tau} be defined in (3.2). Then there exists a constant CC such that

|𝔼[Vψθ,τ,τVψθ,τ]|+|𝔼[Yψθ,τ,τYψθ,τ]|CN1,θΘ,N.|{\mathbb{E}}[V^{\psi^{\theta,\tau},\tau}-V^{\psi^{\theta,\tau}}]|+|{\mathbb{E}}[Y^{\psi^{\theta,\tau},\tau}-Y^{\psi^{\theta,\tau}}]|\leq CN^{-1},\quad\forall\theta\in\Theta,N\in{\mathbb{N}}.

Proof  By (2.21), we have for all i=0,,N1i=0,\ldots,N-1, Xti+1ψθ,τXtiψθ,τ=titi+1(θ)Ztψθ,τdt+Wti+1WtiX^{\psi^{\theta,\tau}}_{t_{i+1}}-X^{\psi^{\theta,\tau}}_{t_{i}}=\int_{t_{i}}^{t_{i+1}}(\theta^{\star})^{\top}Z^{\psi^{\theta,\tau}}_{t}\,{\mathrm{d}}t+W_{t_{i+1}}-W_{t_{i}}, which implies

𝔼[Vψθ,τVψθ,τ,τ]\displaystyle{\mathbb{E}}[V^{\psi^{\theta,\tau}}-V^{\psi^{\theta,\tau},\tau}] =i=0N1titi+1𝔼[Ztψθ,τ(Ztψθ,τ)Ztiψθ,τ(Ztiψθ,τ)]dt,\displaystyle=\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}{\mathbb{E}}[Z^{\psi^{\theta,\tau}}_{t}(Z^{\psi^{\theta,\tau}}_{t})^{\top}-Z^{\psi^{\theta,\tau}}_{t_{i}}(Z^{\psi^{\theta,\tau}}_{t_{i}})^{\top}]\,{\mathrm{d}}t,
𝔼[Yψθ,τYψθ,τ,τ]\displaystyle{\mathbb{E}}[Y^{\psi^{\theta,\tau}}-Y^{\psi^{\theta,\tau},\tau}] =i=0N1titi+1𝔼[(Ztψθ,τZtiψθ,τ)(Ztψθ,τ)θ]dt.\displaystyle=\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}{\mathbb{E}}[(Z^{\psi^{\theta,\tau}}_{t}-Z^{\psi^{\theta,\tau}}_{t_{i}})(Z^{\psi^{\theta,\tau}}_{t})^{\top}\theta^{\star}]\,{\mathrm{d}}t.

Hence it suffices to prove that |𝔼[Ztψθ,τ(Ztψθ,τ)Ztiψθ,τ(Ztiψθ,τ)]|CN1|{\mathbb{E}}[Z^{\psi^{\theta,\tau}}_{t}(Z^{\psi^{\theta,\tau}}_{t})^{\top}-Z^{\psi^{\theta,\tau}}_{t_{i}}(Z^{\psi^{\theta,\tau}}_{t_{i}})^{\top}]|\leq CN^{-1} and |𝔼[(Ztψθ,τZtiψθ,τ)(Ztψθ,τ)]|CN1|{\mathbb{E}}[(Z^{\psi^{\theta,\tau}}_{t}-Z^{\psi^{\theta,\tau}}_{t_{i}})(Z^{\psi^{\theta,\tau}}_{t})^{\top}]|\leq CN^{-1} for all t[ti,ti+1]t\in[t_{i},t_{i+1}] and i=0,,N1i=0,\ldots,N-1.

Let us fix i=0,,N1i=0,\ldots,N-1 and t[ti,ti+1]t\in[t_{i},t_{i+1}]. In the following, we shall omit the superscripts of Xψθ,τX^{\psi^{\theta,\tau}} and Zψθ,τZ^{\psi^{\theta,\tau}} if no confusion occurs. As t[ti,ti+1]t\in[t_{i},t_{i+1}], by (2.21), we have Xt=eLtXti+titeL(ts)dWsX_{t}=e^{Lt}X_{t_{i}}+\int_{t_{i}}^{t}e^{L(t-s)}\,{\mathrm{d}}W_{s} with LA+BKtiθ,τL\coloneqq A^{\star}+B^{\star}K^{\theta,\tau}_{t_{i}}. Thus,

XtXtXtiXti\displaystyle X_{t}X^{\top}_{t}-X_{t_{i}}X^{\top}_{t_{i}}
=(XtXti+Xti)(XtXti+Xti)XtiXti\displaystyle=(X_{t}-X_{t_{i}}+X_{t_{i}})(X_{t}-X_{t_{i}}+X_{t_{i}})^{\top}-X_{t_{i}}X^{\top}_{t_{i}}
=(XtXti)(XtXti)+Xti(XtXti)+(XtXti)Xti\displaystyle=(X_{t}-X_{t_{i}})(X_{t}-X_{t_{i}})^{\top}+X_{t_{i}}(X_{t}-X_{t_{i}})^{\top}+(X_{t}-X_{t_{i}})X_{t_{i}}^{\top}
=((eLtI)Xti+titeL(ts)dWs)((eLtI)Xti+titeL(ts)dWs)\displaystyle=\bigg{(}(e^{Lt}-I)X_{t_{i}}+\int_{t_{i}}^{t}e^{L(t-s)}\,{\mathrm{d}}W_{s}\bigg{)}\bigg{(}(e^{Lt}-I)X_{t_{i}}+\int_{t_{i}}^{t}e^{L(t-s)}\,{\mathrm{d}}W_{s}\bigg{)}^{\top}
+Xti((eLtI)Xti+titeL(ts)dWs)+((eLtI)Xti+titeL(ts)dWs)Xti.\displaystyle\quad+X_{t_{i}}\bigg{(}(e^{Lt}-I)X_{t_{i}}+\int_{t_{i}}^{t}e^{L(t-s)}\,{\mathrm{d}}W_{s}\bigg{)}^{\top}+\bigg{(}(e^{Lt}-I)X_{t_{i}}+\int_{t_{i}}^{t}e^{L(t-s)}\,{\mathrm{d}}W_{s}\bigg{)}X_{t_{i}}^{\top}.

By taking expectations of both sides of the above identity, the martingale property of the Itô integral, and the Itô isometry,

𝔼[XtXtXtiXti]\displaystyle{\mathbb{E}}[X_{t}X^{\top}_{t}-X_{t_{i}}X^{\top}_{t_{i}}] =(eLtI)𝔼[XtiXti](eLtI)+titeL(ts)eL(ts)ds\displaystyle=(e^{Lt}-I){\mathbb{E}}[X_{t_{i}}X_{t_{i}}^{\top}](e^{L^{\top}t}-I)+\int_{t_{i}}^{t}e^{L(t-s)}e^{L^{\top}(t-s)}\,{\mathrm{d}}s
+𝔼[XtiXti](eLtI)+(eLtI)𝔼[XtiXti]C(tti),\displaystyle\quad+{\mathbb{E}}[X_{t_{i}}X_{t_{i}}^{\top}](e^{L^{\top}t}-I)+(e^{Lt}-I){\mathbb{E}}[X_{t_{i}}X_{t_{i}}^{\top}]\leq C(t-t_{i}),

where the last inequality follows from Xψθ,τ𝒮2(n)C\|X^{\psi^{\theta,\tau}}\|_{\mathcal{S}^{2}({\mathbb{R}}^{n})}\leq C. Since ψθ,τ(t,Xtψθ,τ)=Ktiθ,τXtψθ,τ\psi^{\theta,\tau}(t,X^{\psi^{\theta,\tau}}_{t})=K^{\theta,\tau}_{t_{i}}X^{\psi^{\theta,\tau}}_{t} and Ktiθ,τ2C\|K^{\theta,\tau}_{t_{i}}\|_{2}\leq C, one can easily show that |𝔼[Ztψθ,τ(Ztψθ,τ)Ztiψθ,τ(Ztiψθ,τ)]|CN1|{\mathbb{E}}[Z^{\psi^{\theta,\tau}}_{t}(Z^{\psi^{\theta,\tau}}_{t})^{\top}-Z^{\psi^{\theta,\tau}}_{t_{i}}(Z^{\psi^{\theta,\tau}}_{t_{i}})^{\top}]|\leq CN^{-1}. Furthermore, by Xtψθ,τ=eLtXtiψθ,τ+titeL(ts)dWsX^{\psi^{\theta,\tau}}_{t}=e^{Lt}X^{\psi^{\theta,\tau}}_{t_{i}}+\int_{t_{i}}^{t}e^{L(t-s)}\,{\mathrm{d}}W_{s} and the identity

Ztψθ,τ(Ztψθ,τ)Ztiψθ,τ(Ztiψθ,τ)=(Ztψθ,τZtiψθ,τ)(Ztψθ,τ)+Ztiψθ,τ(Ztψθ,τZtiψθ,τ),Z^{\psi^{\theta,\tau}}_{t}(Z^{\psi^{\theta,\tau}}_{t})^{\top}-Z^{\psi^{\theta,\tau}}_{t_{i}}(Z^{\psi^{\theta,\tau}}_{t_{i}})^{\top}=(Z^{\psi^{\theta,\tau}}_{t}-Z^{\psi^{\theta,\tau}}_{t_{i}})(Z^{\psi^{\theta,\tau}}_{t})^{\top}+Z^{\psi^{\theta,\tau}}_{t_{i}}(Z^{\psi^{\theta,\tau}}_{t}-Z^{\psi^{\theta,\tau}}_{t_{i}})^{\top},

we can show that

|𝔼[(Ztψθ,τZtiψθ,τ)(Ztψθ,τ)]|\displaystyle|{\mathbb{E}}[(Z^{\psi^{\theta,\tau}}_{t}-Z^{\psi^{\theta,\tau}}_{t_{i}})(Z^{\psi^{\theta,\tau}}_{t})^{\top}]|
|𝔼[Ztψθ,τ(Ztψθ,τ)Ztiψθ,τ(Ztiψθ,τ)]|+|𝔼[Ztiψθ,τ(Ztψθ,τZtiψθ,τ)]|\displaystyle\leq|{\mathbb{E}}[Z^{\psi^{\theta,\tau}}_{t}(Z^{\psi^{\theta,\tau}}_{t})^{\top}-Z^{\psi^{\theta,\tau}}_{t_{i}}(Z^{\psi^{\theta,\tau}}_{t_{i}})^{\top}]|+|{\mathbb{E}}[Z^{\psi^{\theta,\tau}}_{t_{i}}(Z^{\psi^{\theta,\tau}}_{t}-Z^{\psi^{\theta,\tau}}_{t_{i}})^{\top}]|
C(N1+|𝔼[Ztiψθ,τ(Xtiψθ,τ)(eLtI)(I(Ktiθ,τ))]|)CN1,\displaystyle\leq C\left(N^{-1}+\left|{\mathbb{E}}\left[Z^{\psi^{\theta,\tau}}_{t_{i}}(X^{\psi^{\theta,\tau}}_{t_{i}})^{\top}(e^{L^{\top}t}-I)\begin{pmatrix}I&(K^{\theta,\tau}_{t_{i}})^{\top}\end{pmatrix}\right]\right|\right)\leq CN^{-1},

by the uniform boundedness of Xψθ,τ𝒮2(n)\|X^{\psi^{\theta,\tau}}\|_{\mathcal{S}^{2}({\mathbb{R}}^{n})} and Kθ,τK^{\theta,\tau}.  

Proposition 3.14

Suppose (H.11) holds, and let Θ(n+d)×n\Theta\subset{\mathbb{R}}^{(n+d)\times n} such that there exists C1>0C_{1}>0 satisfying (𝔼[Vψθ])12C1\|({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}\|_{2}\leq C_{1} and |θ|C1|\theta|\leq C_{1} for all θΘ\theta\in\Theta, with VψθV^{\psi^{\theta}} defined in (3.1). Then there exist constants C¯1,C¯20\bar{C}_{1},\bar{C}_{2}\geq 0 and n0n_{0}\in{\mathbb{N}}, such that for all θΘ\theta\in\Theta, N[n0,)N\in{\mathbb{N}}\cap[n_{0},\infty) and δ(0,1/2)\delta\in(0,1/2), if mC¯1(lnδ)m\geq\bar{C}_{1}(-\ln\delta), then with probability at least 12δ1-2\delta,

|θ^θ|C¯2(lnδm+lnδm+(lnδ)2m2+1N),|\hat{\theta}-\theta^{\star}|\leq\bar{C}_{2}\bigg{(}\sqrt{\frac{-\ln\delta}{m}}+\frac{-\ln\delta}{m}+\frac{(-\ln\delta)^{2}}{m^{2}}+\frac{1}{N}\bigg{)}, (3.22)

where θ^\hat{\theta} denotes the right-hand side of (2.23) with the control ψθ,τ\psi^{\theta,\tau} and stepsize τ=T/N\tau=T/N.

Proof  We first prove that there exists n0n_{0}\in{\mathbb{N}} such that for all N[n0,)N\in{\mathbb{N}}\cap[n_{0},\infty) and θΘ\theta\in\Theta, (𝔼[Vψθ,τ])12C\|({\mathbb{E}}[V^{\psi^{{\theta,\tau}}}])^{-1}\|_{2}\leq C for a constant C>0C>0 independent of θ\theta and NN. By (2.11) and (2.21), we have for all θΘ\theta\in\Theta and NN\in{\mathbb{N}}, X0ψθ=X0ψθ,τX^{\psi^{\theta}}_{0}=X^{\psi^{\theta},\tau}_{0} and

d(XψθXψθ,τ)t=((A+BKtθ)(XψθXψθ,τ)t+B(KtθKtθ,τ)Xtψθ,τ)dt,t[0,T].{\mathrm{d}}(X^{\psi^{\theta}}-X^{\psi^{\theta},\tau})_{t}=\Big{(}(A^{\star}+B^{\star}K^{\theta}_{t})(X^{\psi^{\theta}}-X^{\psi^{\theta},\tau})_{t}+B^{\star}(K^{\theta}_{t}-K^{\theta,\tau}_{t})X^{\psi^{\theta},\tau}_{t}\Big{)}\,{\mathrm{d}}t,\quad t\in[0,T].

Proposition 3.3 shows KtθKtθ,τ2CN1\|K^{\theta}_{t}-K^{\theta,\tau}_{t}\|_{2}\leq CN^{-1} for all t[0,T]t\in[0,T], which along with Gronwall’s inequality yields XψθXψθ,τ𝒮2(n)CN1\|X^{\psi^{\theta}}-X^{\psi^{\theta},\tau}\|_{\mathcal{S}^{2}({\mathbb{R}}^{n})}\leq CN^{-1} for all θΘ\theta\in\Theta and NN\in{\mathbb{N}}. One can further prove that UψθUψθ,τ𝒮2(d)CN1\|U^{\psi^{\theta}}-U^{\psi^{\theta},\tau}\|_{\mathcal{S}^{2}({\mathbb{R}}^{d})}\leq CN^{-1} with Utψθ=ψθ(t,Xtψθ)U^{\psi^{\theta}}_{t}=\psi^{\theta}(t,X^{\psi^{\theta}}_{t}) and Utψθ,τ=ψθ,τ(t,Xtψθ,τ)U^{\psi^{\theta,\tau}}_{t}=\psi^{\theta,\tau}(t,X^{\psi^{\theta,\tau}}_{t}) for all t[0,T]t\in[0,T]. Thus, we have |𝔼[Vψθ]𝔼[Vψθ,τ]|CN1|{\mathbb{E}}[V^{\psi^{{\theta}}}]-{\mathbb{E}}[V^{\psi^{{\theta},\tau}}]|\leq CN^{-1}, which along with (𝔼[Vψθ])12C1\|({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}\|_{2}\leq C_{1} implies a uniform bound of (𝔼[Vψθ,τ])12\|({\mathbb{E}}[V^{\psi^{{\theta},\tau}}])^{-1}\|_{2} for all sufficiently large NN.

Let us fix N[n0,)N\in{\mathbb{N}}\cap[n_{0},\infty) and θΘ\theta\in\Theta for the subsequent analysis. The invertibility of 𝔼[Vψθ,τ]{\mathbb{E}}[V^{\psi^{{\theta},\tau}}] implies that θ=(𝔼[Vψθ,τ])1𝔼[Yψθ,τ])\theta^{\star}=({\mathbb{E}}[V^{\psi^{{\theta},\tau}}])^{-1}{\mathbb{E}}[Y^{\psi^{{\theta},\tau}}])(cf. (2.12)). Then by (2.23), we can derive the following analogues of (3.10):

θ^θ2=(Vψθ,τ,τ,m+1mI)1Yψθ,τ,τ,m(𝔼[Vψθ,τ])1𝔼[Yψθ,τ]2(Vψθ,τ,τ,m+1mI)1(𝔼[Vψθ,τ])12Yψθ,τ,m,τ2+(𝔼[Vψθ,τ])12Yψθ,τ,τ,m𝔼[Yψθ,τ]2(Vψθ,τ,τ,m+1mI)1(𝔼[Vψθ,τ])12Yψθ,τ,m,τ2Vψθ,τ,τ,m𝔼[Vψθ,τ]+1mI2+(𝔼[Vψθ,τ])12Yψθ,τ,τ,m𝔼[Yψθ,τ]2,\begin{split}&\|\hat{\theta}-\theta^{\star}\|_{2}\\ &=\|(V^{\psi^{{\theta,\tau}},\tau,m}+\tfrac{1}{m}I)^{-1}Y^{\psi^{{\theta,\tau}},\tau,m}-({\mathbb{E}}[V^{\psi^{{\theta,\tau}}}])^{-1}{\mathbb{E}}[Y^{\psi^{{\theta,\tau}}}]\|_{2}\\ &\leq\|(V^{\psi^{{\theta,\tau}},\tau,m}+\tfrac{1}{m}I)^{-1}-({\mathbb{E}}[V^{\psi^{{\theta,\tau}}}])^{-1}\|_{2}\|Y^{\psi^{{\theta},\tau},m,\tau}\|_{2}+\|({\mathbb{E}}[V^{\psi^{{\theta,\tau}}}])^{-1}\|_{2}\|Y^{\psi^{{\theta,\tau},\tau},m}-{\mathbb{E}}[Y^{\psi^{{\theta,\tau}}}]\|_{2}\\ &\leq\|(V^{\psi^{{\theta,\tau}},\tau,m}+\tfrac{1}{m}I)^{-1}\|\|({\mathbb{E}}[V^{\psi^{{\theta,\tau}}}])^{-1}\|_{2}\|Y^{\psi^{{\theta},\tau},m,\tau}\|_{2}\|V^{\psi^{{\theta,\tau}},\tau,m}-{\mathbb{E}}[V^{\psi^{{\theta,\tau}}}]+\tfrac{1}{m}I\|_{2}\\ &\quad+\|({\mathbb{E}}[V^{\psi^{{\theta,\tau}}}])^{-1}\|_{2}\|Y^{\psi^{{\theta,\tau},\tau},m}-{\mathbb{E}}[Y^{\psi^{{\theta,\tau}}}]\|_{2},\end{split}

where Vψθ,τ,τ,mV^{\psi^{{\theta,\tau}},\tau,m} and Yψθ,τ,τ,mY^{\psi^{{\theta,\tau}},\tau,m} are defined in (3.2). Note that

Vψθ,τ,τ,m𝔼[Vψθ,τ]+1mI2\displaystyle\|V^{\psi^{{\theta,\tau}},\tau,m}-{\mathbb{E}}[V^{\psi^{{\theta,\tau}}}]+\tfrac{1}{m}I\|_{2} Vψθ,τ,τ,m𝔼[Vψθ,τ,τ]2+𝔼[Vψθ,τ,τ]𝔼[Vψθ,τ]2+1m,\displaystyle\leq\|V^{\psi^{{\theta,\tau}},\tau,m}-{\mathbb{E}}[V^{\psi^{{\theta,\tau}},\tau}]\|_{2}+\|{\mathbb{E}}[V^{\psi^{{\theta,\tau}},\tau}]-{\mathbb{E}}[V^{\psi^{{\theta,\tau}}}]\|_{2}+\tfrac{1}{m},
Yψθ,τ,τ,m𝔼[Yψθ,τ]2\displaystyle\|Y^{\psi^{{\theta,\tau},\tau},m}-{\mathbb{E}}[Y^{\psi^{{\theta,\tau}}}]\|_{2} Yψθ,τ,τ,m𝔼[Yψθ,τ,τ]2+𝔼[Yψθ,τ,τ]𝔼[Yψθ,τ]2,\displaystyle\leq\|Y^{\psi^{{\theta,\tau},\tau},m}-{\mathbb{E}}[Y^{\psi^{{\theta,\tau},\tau}}]\|_{2}+\|{\mathbb{E}}[Y^{\psi^{{\theta,\tau},\tau}}]-{\mathbb{E}}[Y^{\psi^{{\theta,\tau}}}]\|_{2},

where for both inequalities, the first term on the right-hand side can be estimated by Theorem 3.6 (uniformly in NN), and the second term is of the magnitude 𝒪(N1)\mathcal{O}(N^{-1}) due to Lemma 3.13. Hence, proceeding along the lines of the proof of Proposition 3.9 leads to the desired result.  

Step 3: Proof of Theorem 2.3.

The proof follows from similar arguments as that of Theorem 2.2, and we only present the main steps here. As (H.12) holds with θ0\theta_{0} and θ\theta^{\star}, we can obtain from Propositions 3.10 and 3.14 that, there exists a bounded set Φε(n+d)×n\Phi_{\varepsilon}\subset{\mathbb{R}}^{(n+d)\times n} and constants C¯1,C¯21\bar{C}_{1},\bar{C}_{2}\geq 1, n0n_{0}\in{\mathbb{N}} that for all θΦε\theta\in\Phi_{\varepsilon}, N[n0,)N\in{\mathbb{N}}\cap[n_{0},\infty) and δ(0,1/2)\delta^{\prime}\in(0,1/2), if mC¯1(lnδ)m\geq\bar{C}_{1}(-\ln\delta), then with probability at least 12δ1-2\delta^{\prime},

|θ^θ|C¯2(lnδm+lnδm+(lnδ)2m2+1N),|\hat{\theta}-\theta^{\star}|\leq\bar{C}_{2}\bigg{(}\sqrt{\frac{-\ln\delta^{\prime}}{m}}+\frac{-\ln\delta^{\prime}}{m}+\frac{(-\ln\delta^{\prime})^{2}}{m^{2}}+\frac{1}{N}\bigg{)}, (3.23)

where θ^\hat{\theta} denotes the right-hand side of (2.23) with the control ψθ,τ\psi^{\theta,\tau} and stepsize τ=T/N\tau=T/N. Then by proceeding along the lines of the proof of Theorem 2.2, there exists C0>0C_{0}>0 and n0n_{0}\in{\mathbb{N}}, such that for any given δ(0,3π2)\delta\in(0,\frac{3}{\pi^{2}}), if m0=C(lnδ)m_{0}=C(-\ln\delta) with CC0C\geq C_{0} and Nn0N_{\ell}\geq n_{0} for all {0}\ell\in{\mathbb{N}}\cup\{0\}, then with probability at least 1π23δ1-\frac{\pi^{2}}{3}\delta,

|θ+1θ|C¯2(lnδm+(lnδ)m+(lnδ)2m2+1N),{0},|{\theta}_{\ell+1}-\theta^{\star}|\leq\bar{C}_{2}\bigg{(}\sqrt{\frac{-\ln\delta_{\ell}}{m_{\ell}}}+\frac{(-\ln\delta_{\ell})}{m_{\ell}}+\frac{(-\ln\delta_{\ell})^{2}}{m_{\ell}^{2}}+\frac{1}{N_{\ell}}\bigg{)},\quad\forall\ell\in{\mathbb{N}}\cup\{0\}, (3.24)

where δ=δ/(+1)2\delta_{\ell}=\delta/(\ell+1)^{2} and m=2m0m_{\ell}=2^{\ell}m_{0} for all \ell. Consequently, we can conclude the desired regret bound from Proposition 3.12 (cf. (3.17)), with an additional term =0lnMmN2\sum_{\ell=0}^{\ln M}m_{\ell}N_{\ell}^{-2} due to the time discretization errors in (3.18) and (3.24).

References

  • Abbasi-Yadkori and Szepesvári (2011) Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.
  • Abeille and Lazaric (2018) Marc Abeille and Alessandro Lazaric. Improved regret bounds for thompson sampling in linear quadratic control problems. In International Conference on Machine Learning, pages 1–9. PMLR, 2018.
  • Agarwal et al. (2019) Naman Agarwal, Elad Hazan, and Karan Singh. Logarithmic regret for online control. Advances in Neural Information Processing Systems, 32, 2019.
  • Auer and Ortner (2007) Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In Advances in Neural Information Processing Systems, pages 49–56, 2007.
  • Auer et al. (2009) Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. In Advances in Neural Information Processing Systems, pages 89–96, 2009.
  • Bailo et al. (2018) Rafael Bailo, Mattia Bongini, José A Carrillo, and Dante Kalise. Optimal consensus control of the Cucker-Smale model. IFAC-PapersOnLine, 51(13):1–6, 2018.
  • Bitmead and Gevers (1991) Robert R Bitmead and Michel Gevers. Riccati difference and differential equations: Convergence, monotonicity and stability. In The Riccati Equation, pages 263–291. Springer, 1991.
  • Campi and Kumar (1998) Marco C Campi and PR Kumar. Adaptive linear quadratic gaussian control: the cost-biased approach revisited. SIAM Journal on Control and Optimization, 36(6):1890–1907, 1998.
  • Cartea et al. (2018) Alvaro Cartea, Sebastian Jaimungal, and Jason Ricci. Algorithmic trading, stochastic control, and mutually exciting processes. SIAM Review, 60(3):673–703, 2018.
  • Cassel et al. (2020) Asaf Cassel, Alon Cohen, and Tomer Koren. Logarithmic regret for learning linear quadratic regulators efficiently. In International Conference on Machine Learning, pages 1328–1337. PMLR, 2020.
  • Chen and Hazan (2021) Xinyi Chen and Elad Hazan. Black-box control for linear dynamical systems. In Conference on Learning Theory, pages 1114–1143. PMLR, 2021.
  • Cheridito et al. (2005) Patrick Cheridito, H. Mete Soner, and Nizar Touzi. Small time path behavior of double stochastic integrals and applications to stochastic control. The Annals of Applied Probability, 15(4):2472–2495, 2005.
  • Ciarlet (2013) Philippe G Ciarlet. Linear and nonlinear functional analysis with applications, volume 130. Siam, 2013.
  • Cohen et al. (2019) Alon Cohen, Tomer Koren, and Yishay Mansour. Learning linear quadratic regulators efficiently with only T\sqrt{T} regret. arXiv preprint arXiv:1902.06223, 2019.
  • Dean et al. (2018) Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188–4197, 2018.
  • Dean et al. (2019) Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, pages 1–47, 2019.
  • Duncan et al. (1992) Tyrone E Duncan, Petr Mandl, and Bożenna Pasik-Duncan. On least squares estimation in continuous time linear stochastic systems. Kybernetika, 28(3):169–180, 1992.
  • Duncan et al. (1999) Tyrone E Duncan, Lei Guo, and Bozenna Pasik-Duncan. Adaptive continuous-time linear quadratic Gaussian control. IEEE Transactions on Automatic Control, 44(9):1653–1662, 1999.
  • Faradonbeh et al. (2018a) Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Finite-time adaptive stabilization of linear systems. IEEE Transactions on Automatic Control, 64(8):3498–3505, 2018a.
  • Faradonbeh et al. (2018b) Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. On adaptive linear-quadratic regulators. arXiv e-prints, pages arXiv–1806, 2018b.
  • Faradonbeh et al. (2019) Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Randomized algorithms for data-driven stabilization of stochastic linear systems. In 2019 IEEE Data Science Workshop (DSW), pages 170–174. IEEE, 2019.
  • Faradonbeh et al. (2020) Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Input perturbations for adaptive control and learning. Automatica, 117:108950, 2020.
  • Foster and Simchowitz (2020) Dylan Foster and Max Simchowitz. Logarithmic regret for adversarial online control. In International Conference on Machine Learning, pages 3211–3221. PMLR, 2020.
  • Goodwin et al. (1981) Graham C Goodwin, Peter J Ramadge, and Peter E Caines. Discrete time stochastic adaptive control. SIAM Journal on Control and Optimization, 19(6):829–853, 1981.
  • Graber (2016) P Jameson Graber. Linear quadratic mean field type control and mean field games with common noise, with application to production of an exhaustible resource. Applied Mathematics & Optimization, 74(3):459–486, 2016.
  • Green and Moore (1986) Michael Green and John B Moore. Persistence of excitation in linear systems. Systems & control letters, 7(5):351–360, 1986.
  • Hambly et al. (2020) Ben M Hambly, Renyuan Xu, and Huining Yang. Policy gradient methods for the noisy linear quadratic regulator over a finite horizon. Available at SSRN, 2020.
  • Ioannou and Fidan (2006) Petros Ioannou and Bariş Fidan. Adaptive control tutorial. SIAM, 2006.
  • Kumar (1983) PR Kumar. Optimal adaptive control of linear-quadratic-gaussian systems. SIAM Journal on Control and Optimization, 21(2):163–178, 1983.
  • Lale et al. (2020a) Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Explore more and improve regret in linear quadratic regulators. arXiv preprint arXiv:2007.12291, 2020a.
  • Lale et al. (2020b) Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Logarithmic regret bound in partially observable linear dynamical systems. Advances in Neural Information Processing Systems, 33:20876–20888, 2020b.
  • Landau et al. (2011) Ioan Doré Landau, Rogelio Lozano, Mohammed M’Saad, and Alireza Karimi. Adaptive control: algorithms, analysis and applications. Springer Science & Business Media, 2011.
  • Mania et al. (2019) Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalent control of LQR is efficient. arXiv preprint arXiv:1902.07826, 2019.
  • Mao (2007) Xuerong Mao. Stochastic differential equations and applications. Elsevier, 2007.
  • Modares and Lewis (2014) Hamidreza Modares and Frank L. Lewis. Linear quadratic tracking control of partially-unknown continuous-time systems using reinforcement learning. IEEE Transactions on Automatic Control, 59(11):3051–3056, 2014.
  • Munos (2000) Rémi Munos. A study of reinforcement learning in the continuous case by the means of viscosity solutions. Machine Learning, 40(3):265–299, 2000.
  • Munos (2006) Rémi Munos. Policy gradient in continuous time. Journal of Machine Learning Research, 7(May):771–791, 2006.
  • Munos and Bourgine (1998) Rémi Munos and Paul Bourgine. Reinforcement learning for continuous stochastic control problems. In Advances in Neural Information Processing Systems, pages 1029–1035, 1998.
  • Osband et al. (2013) Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
  • Ouyang et al. (2017) Yi Ouyang, Mukul Gagrani, and Rahul Jain. Learning-based control of unknown linear systems with thompson sampling. arXiv preprint arXiv:1709.04047, 2017.
  • Rizvi and Lin (2018) Syed Ali Asad Rizvi and Zongli Lin. Output feedback reinforcement learning control for the continuous-time linear quadratic regulator problem. In 2018 Annual American Control Conference (ACC), pages 3417–3422. IEEE, 2018.
  • Simchowitz and Foster (2020) Max Simchowitz and Dylan Foster. Naive exploration is optimal for online lqr. In International Conference on Machine Learning, pages 8937–8948. PMLR, 2020.
  • Tallec et al. (2019) Corentin Tallec, Léonard Blier, and Yann Ollivier. Making Deep Q-learning methods robust to time discretization. arXiv preprint arXiv:1901.09732, 2019.
  • Veraar (2012) Mark Veraar. The stochastic Fubini theorem revisited. Stochastics An International Journal of Probability and Stochastic Processes, 84(4):543–551, 2012.
  • Wainwright (2019) Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019.
  • Wang and Zhou (2020) Haoran Wang and Xun Yu Zhou. Continuous-time mean–variance portfolio selection: A reinforcement learning framework. Mathematical Finance, 30(4):1273–1308, 2020.
  • Yong and Zhou (1999) Jiongmin Yong and Xun Yu Zhou. Stochastic Controls: Hamiltonian Systems and HJB Equations, volume 43. Springer Science & Business Media, 1999.