Logarithmic Regret for Episodic Continuous-Time Linear-Quadratic Reinforcement Learning over a Finite-Time Horizon

\nameMatteo Basei \email[email protected]
\addrEDF R&D and FIME, Paris, France.
\AND\nameXin Guo \email[email protected]
\addrDepartment of Industrial Engineering and Operations Research
University of California, Berkeley, USA.
\AND\nameAnran Hu \email[email protected]
\addrDepartment of Industrial Engineering and Operations Research
University of California, Berkeley, USA.
\AND\nameYufei Zhang \email[email protected]
\addrMathematical Institute
University of Oxford, UK

Abstract

We study finite-time horizon continuous-time linear-quadratic reinforcement learning problems in an episodic setting, where both the state and control coefficients are unknown to the controller. We first propose a least-squares algorithm based on continuous-time observations and controls, and establish a logarithmic regret bound of magnitude $\mathcal{O}((\ln M)(\ln\ln M))$ , with $M$ being the number of learning episodes. The analysis consists of two components: perturbation analysis, which exploits the regularity and robustness of the associated Riccati differential equation; and parameter estimation error, which relies on sub-exponential properties of continuous-time least-squares estimators. We further propose a practically implementable least-squares algorithm based on discrete-time observations and piecewise constant controls, which achieves similar logarithmic regret with an additional term depending explicitly on the time stepsizes used in the algorithm.

1 Introduction

Reinforcement learning (RL) for linear quadratic (LQ) control problems has been one of the most active areas for both the control and the reinforcement learning communities. Over the last few decades, significant progresses have been made in the discrete-time setting.

1.1 Discrete-time RL

In the area of adaptive control with unknown dynamics parameters, the goal is to find optimal stationary policy that stabilizes the unknown dynamics and minimizes the long term average cost (Ioannou and Fidan (2006); Landau et al. (2011)). For an infinite-time horizon LQ system, it has been shown that persistent excitation conditions Green and Moore (1986) are critical to the parameter identification. Meanwhile, algorithms with asymptotic convergence in both the parameter estimation and the optimal control have been developed in Goodwin et al. (1981), Kumar (1983) and Campi and Kumar (1998): the first one assumes that costs only depend on state variables and the other two consider both state and control costs and use a cost-biased least-squared estimation method. See Faradonbeh et al. (2018a, 2019) and references therein for recent developments of (randomised) adaptive control algorithms for LQ systems.

Following the seminal works of Auer and Ortner (2007); Auer et al. (2009) and Osband et al. (2013), non-asymptotic regret bound analysis for RL algorithms has been one of the main topics, and has been developed for tabular Markov decision problems.

The non-asymptotic analysis of adaptive LQ problem by Abbasi-Yadkori and Szepesvári (2011) utilizes the Optimism in the Face of Uncertainty principle to construct a sequence of improving confidence regions for the unknown model parameters, and solves a non-convex constrained optimization problem for each confidence region; their algorithm achieves an $\mathcal{O}(\sqrt{T})$ regret bound, with $T$ being the number of time steps. To reduce the computational complexity and to avoid the non-convexity issue, Abeille and Lazaric (2018) and Ouyang et al. (2017) propose Thompson-sampling-based algorithms and derive $\mathcal{O}(\sqrt{T})$ regret bounds in the Bayesian setting; Dean et al. (2018) proposes a robust adaptive control algorithm to solve a convex sub-problem in each step and achieves an $\mathcal{O}(T^{2/3})$ regret bound. The gap between these regret bounds is removed by Mania et al. (2019) and Cohen et al. (2019) via two different approaches for the same $\mathcal{O}(\sqrt{T})$ frequentist regret bound. Later, Simchowitz and Foster (2020) establishes a lower bound on the regret of order $\mathcal{O}(\sqrt{d_{u}^{2}d_{x}T})$ , where $d_{u}$ and $d_{x}$ are the dimensions of the actions and the states, and shows that a simple variant of certainty equivalent control matches the lower bound in both $T$ and the dimensions. Similar regret bounds have also been established under different settings and assumptions, such as Chen and Hazan (2021) in the adversarial setting and Lale et al. (2020a) without a stabilizing controller at the early stages of agent-environment interaction.

All the analyses are in discrete-time with an infinite time horizon. In all these problems, adaptive control algorithms are shown to achieve logarithmic regret bounds when additional information regarding the parameters of the system (often referred to as identifiability conditions) is available. Indeed, Faradonbeh et al. (2018b, 2020) prove that certainty equivalent adaptive regulator achieves logarithmic regret bounds if the system parameter satisfies certain sparsity or low-rankness conditions. Cassel et al. (2020) establishes logarithmic regret bounds when either the state transition matrix is unknown, or when the state-action transition matrix is unknown and the optimal policy is non-degenerate. In partially observable linear dynamical systems, which takes linear-quadratic Gaussian problem as a special case, Lale et al. (2020b) proposes an algorithm with a logarithmic regret bound, under the assumption that one has access to a set in which all controllers persistently excite the system to approximate the optimal control. Logarithmic regret bounds in the adversarial setting with known dynamics parameters have been established in Agarwal et al. (2019); Foster and Simchowitz (2020).

1.2 Continuous-time RL.

Most real-world control systems, such as those in aerospace, automotive industry and robotics, are naturally continuous-time dynamical systems. So are their related physical tasks, such as inverted pendulum problems, cart-pole balancing problems, and legged robot problems. Continuous-time finite-time horizon LQ control problems can be found in portfolio optimization Wang and Zhou (2020), algorithmic trading Cartea et al. (2018), production management of exhaustible resources Graber (2016), and biological movement systems Bailo et al. (2018).

Analysis for continuous-time LQ-RL and general RL problems, however, is fairly limited. The primary approach is to develop learning algorithms after discretizing both the time and the space spaces, and establish the convergence as discretization parameters tend to zero. For instance, Munos (2006) proposes a policy gradient algorithm and shows the convergence of the policy gradient estimate to the true gradient. Munos and Bourgine (1998); Munos (2000) design learning algorithms by discretizing Bellman equations of the underlying control problems and prove the asymptotic convergence of their algorithms. For the LQ system, attentions have been mostly on algorithms designs, including the integral reinforcement learning algorithm in Modares and Lewis (2014), and the policy iteration algorithm in Rizvi and Lin (2018). Yet, very little is known regarding the convergence rate or the regret bound of all these algorithms. Indeed, despite the natural analytical connection between LQ control and RL, the best known theoretical work for continuous-time LQ-RL is still due to Duncan et al. (1999), where an asymptotically sublinear regret for an ergodic model has been derived via a weighted least-squares-based estimation approach. Nevertheless, the exact order of the regret bound has not been studied.

Issues and challenges from non-asymptotic analysis.

It is insufficient and improper to rely solely on the analysis and algorithms for the discrete-time RL to solve the continuous-time problems. There is a mismatch between the algorithms timescale for the former and the underlying systems timescale for the latter. When designing algorithms that make observations and take actions at discrete time points, it is important to take the model mismatch into consideration. For instance, the empirical studies in Tallec et al. (2019) suggest that vanilla $Q$ -learning methods exhibit degraded performance as the time stepsize decreases, while a proper scaling of learning rates with stepsize leads to more robust performance.

The questions are therefore: A) How to quantify the precise impacts of the observation stepsize and action stepsize on algorithm performance? B) How to derive non-asymptotic regret analysis for learning algorithms in continuous-time LQ-RL (or general RL) system, analogous to the discrete-time LQ-RL counterpart?

There are technical reasons behind the limited theoretical progress in the continuous-time domain for RL, including LQ-RL. In addition to the known difficulty for analyzing stochastic control problems, the learning component compounds the problem complexity and poses new challenges.

For instance, the counterpart in the continuous-time problem to the algebraic equations in Mania et al. (2019) for the discrete-time version is the regularity and stability of the continuous-time Riccati equation and the regularity of feedback controls. While Riccati equation and its robustness and existence and uniqueness of optimal controls have been well studied in the control literature, regularity of feedback controls with respect to underlying models is completely new for control theory and crucial for algorithm design and its robustness analysis. Moreover, deriving the exact order of the regret bound requires developing new and different techniques than those used for the asymptotic regret analysis in Duncan et al. (1999).

Our work and contributions.

This paper studies finite-time horizon continuous-time LQ-RL problems in an episodic setting.

•

It first proposes a greedy least-squares algorithm based on continuous-time observations and controls. At each iteration, the algorithm estimates the unknown parameters by a regularized least-squares estimator based on observed trajectories, then designs linear feedback controls via the Riccati differential equation for the estimated model. It identifies conditions under which the unknown state transition matrix and state-action transition matrix are uniquely identifiable under the optimal policies. (Remark 2.1 and Proposition 2.1). By exploiting the identifiability of coefficients, this continuous-time least-squares algorithm is shown to have a logarithmic regret of the magnitude $\mathcal{O}((\ln M)(\ln\ln M))$ , with $M$ being the number of learning episodes (Theorem 2.2). To the best of our knowledge, this is the first non-asymptotic logarithmic regret bound for continuous-time LQ-RL problems with unknown state and control coefficients.
•

It then proposes a practically implementable least-squares algorithm based on discrete-time observations and controls. At each iteration, the algorithm estimates the unknown parameters by observing continuous-time trajectories at discrete time points, then designs a piecewise constant linear feedback control via Riccati difference equations for an associated discrete-time LQ-RL problem. It shows that the regret of the discrete-time least-squares algorithm is of the magnitude $\mathcal{O}\big{(}(\ln M)(\ln\ln M)+\sum_{\ell=0}^{\ln M}2^{\ell}\tau_{\ell}^{2}\big{)}$ , where $\tau_{\ell}$ is the time stepsize used in the $(\ell+1)$ -th update of model parameters (Theorem 2.3). Our analysis shows that scaling the regularization parameter of the discrete-time least-squares estimator with respect to time stepsize is critical for a robust performance of the algorithm in different timescales (Remark 2.3). To the best of our knowledge, this is the first discrete-time algorithm with rigorous regret bound for continuous-time LQ-RL problems.

Different from the least-squares algorithms for the ergodic LQ problems (see e.g., Duncan et al. (1999); Mania et al. (2019)), our continuous-time least-squares algorithm constructs feedback controls via Riccati differential equations instead of the algebraic equations in Mania et al. (2019). Here, the regularity and stability of the continuous-time Riccati equation is analyzed in order to establish the robustness of feedback controls.

Moreover, our analysis for the estimation error exploits extensively the sub-exponential tail behavior of the least-squares estimators. This probabilistic approach differs from the asymptotic sublinear regret analysis in Duncan et al. (1999); it establishes the exact order of the logarithmic regret bound by the concentration inequality for the error bound.

In addition, our analysis also exploits an important self-exploration property of finite-time horizon continuous-time LQ-RL problems, for which the time-dependent optimal feedback matrices ensure that the optimal state and control processes span the entire parameter space. This property allows us to design exploration-free learning algorithms with logarithmic regret bounds. Furthermore, we provide explicit conditions on models that guarantees the successful identification of the unknown parameters with optimal feedback policies. This is in contrast to the identification conditions for logarithmic regret bounds in discrete-time infinite-time-horizon LQ problems. Our conditions apply to arbitrary finite time-horizon problems, without imposing sparsity or low-rankness conditions on system parameters as in Faradonbeh et al. (2018b, 2020) or requiring these parameters to be partially known to the controller as in Cassel et al. (2020); Foster and Simchowitz (2020).

Finally, our analysis provides the precise parameter estimation error in terms of the sample size and time stepsize, and quantifies the performance gap between applying a piecewise-constant policy from an incorrect model and applying the optimal policy. The misspecification error scales linearly with respect to the stepsize, and the performance gap depends quadratically with respect to the time stepsize and the magnitude of parameter perturbations. Our analysis is based on the first-order convergence of Riccati difference equations and a uniform sub-exponential tail bound of discrete-time least-squares estimators.

Notation.

For each $n\in{\mathbb{N}}$ , we denote by $I=I_{n}$ the $n\times n$ identity matrix, and by ${\mathbb{S}}^{n}_{0}$ (resp. ${\mathbb{S}}^{n}_{+}$ ) the space of symmetric positive semidefinite (resp. definite) matrices. We denote by $|\cdot|$ the Euclidean norm of a given Euclidean space, by $\|\cdot\|_{2}$ the matrix norm induced by Euclidean norms, and by $A^{\top}$ and $\textnormal{tr}(A)$ the transpose and trace of a matrix $A$ , respectively. For each $T>0$ , filtered probability space $(\Omega,\mathcal{F},{\mathbb{F}}=\{\mathcal{F}_{t}\}_{t\in[0,T]},\mathbb{P})$ satisfying the usual condition and Euclidean space $(E,|\cdot|)$ , we introduce the following spaces:

•

$C([0,T];E)$ is the space of continuous functions $\phi:[0,T]\rightarrow E$ satisfying $\|\phi\|_{C([0,T];E)}=\sup_{t\in[0,T]}|\phi_{t}|<\infty$ ;
•

$C^{1}([0,T];E)$ is the space of continuously differentiable functions $\phi:[0,T]\rightarrow E$ satisfying $\|\phi\|_{C^{1}([0,T];E)}=\sup_{t\in[0,T]}(|\phi_{t}|+|\phi^{\prime}_{t}|)<\infty$ ;
•

$\mathcal{S}^{2}(E)$ is the space of $E$ -valued ${\mathbb{F}}$ -progressively measurable càdlàg processes $X:\Omega\times[0,T]\rightarrow E$ satisfying $\|X\|_{\mathcal{S}^{2}(E)}={\mathbb{E}}[\sup_{t\in[0,T]}|X_{t}|^{2}]^{1/2}<\infty$ ;
•

$\mathcal{H}^{2}(E)$ is the space of $E$ -valued ${\mathbb{F}}$ -progressively measurable processes $X:\Omega\times[0,T]\rightarrow E$ satisfying $\|X\|_{\mathcal{H}^{2}(E)}={\mathbb{E}}[\int_{0}^{T}|X_{t}|^{2}\,{\mathrm{d}}t]^{1/2}<\infty$ .

For notation simplicity, we denote by $C\in[0,\infty)$ a generic constant, which depends only on the constants appearing in the assumptions and may take a different value at each occurrence.

2 Problem formulation and main results

2.1 Linear-quadratic reinforcement learning problem

In this section, we consider the linear-quadratic reinforcement learning (LQ-RL) problem, where the drift coefficient of the state dynamics is unknown to the controller.

More precisely, let $T\in(0,\infty)$ be a given terminal time, $W$ be an $n$ -dimensional standard Brownian motion defined on a complete probability space $(\Omega,\mathcal{F},\mathbb{P})$ , and ${\mathbb{F}}=(\mathcal{F}_{t})_{t\in[0,T]}$ be the filtration generated by $W$ augmented by the $\mathbb{P}$ -null sets. Let $x_{0}\in{\mathbb{R}}^{n}$ be a given initial state and $(A^{\star},B^{\star})\in{\mathbb{R}}^{n\times n}\times{\mathbb{R}}^{n\times d}$ be fixed but unknown matrices, consider the following problem:

\inf_{U\in\mathcal{H}^{2}({\mathbb{R}}^{d})}J^{\theta^{\star}}(U),\quad\textnormal{with}\quad J^{\theta^{\star}}(U)={\mathbb{E}}\left[\int_{0}^{T}\big{(}(X^{\theta^{\star},U}_{t})^{\top}QX^{\theta^{\star},U}_{t}+(U_{t})^{\top}RU_{t}\big{)}\,{\mathrm{d}}t\right],

(2.1)

where for each $U\in\mathcal{H}^{2}({\mathbb{R}}^{d})$ , the process $X^{\theta^{\star},U}\in\mathcal{S}^{2}({\mathbb{R}}^{n})$ satisfies the following controlled dynamics associated with the parameter $\theta^{\star}=(A^{\star},B^{\star})^{\top}$ :

{\mathrm{d}}X_{t}=(A^{\star}X_{t}+B^{\star}U_{t})\,{\mathrm{d}}t+\,{\mathrm{d}}W_{t},\quad t\in[0,T];\quad X_{0}=x_{0},

(2.2)

with given matrices $Q\in{\mathbb{S}}^{n}_{0}$ and $R\in{\mathbb{S}}^{d}_{+}$ . Note that we assume the loss functional (2.1) only involves a time homogeneous running cost to allow a direct comparison with infinite-time horizon RL problems (see e.g., Duncan et al. (1992)), but similar analysis can be performed if the cost functions are time inhomogeneous, a terminal cost is included, or the Brownian motion $W$ in (2.2) is scaled by an known nonsingular diffusion matrix.

If the parameter $\theta^{\star}=(A^{\star},B^{\star})^{\top}$ are known to the controller, then (2.1)-(2.2) reduces to the classical LQ control problems. In this case, it is well known that (see e.g., Yong and Zhou (1999) and the references therein), the optimal control $U^{\theta^{\star}}$ of (2.1)-(2.2) is given in a feedback form by

U^{\theta^{\star}}_{t}=\psi^{\theta^{\star}}(t,X^{\theta^{\star}}_{t}),\quad\textnormal{ with $\psi^{\theta^{\star}}(t,x)=K^{\theta^{\star}}_{t}x$, $\forall(t,x)\in[0,T]\times{\mathbb{R}}^{n}$,}

(2.3)

where $K^{\theta^{\star}}_{t}=-R^{-1}(B^{\star})^{\top}P^{\theta^{\star}}_{t}$ for all $t\in[0,T]$ , $(P_{t}^{\theta^{\star}})_{t\in[0,T]}$ solves the Riccati equation

\tfrac{{\mathrm{d}}}{{\mathrm{d}}t}P_{t}+(A^{\star})^{\top}P_{t}+P_{t}A^{\star}-P_{t}(B^{\star}R^{-1}(B^{\star})^{\top})P_{t}+Q=0,\quad t\in[0,T];\quad P_{T}=0,

(2.4)

and $X^{\theta^{\star}}$ is the state process governed by the following dynamics:

{\mathrm{d}}X_{t}=(A^{\star}X_{t}+B^{\star}K^{\theta^{\star}}_{t}X_{t})\,{\mathrm{d}}t+\,{\mathrm{d}}W_{t},\quad t\in[0,T];\quad X_{0}=x_{0}.

(2.5)

To solve the LQ-RL problem (2.1)-(2.2) with unknown $\theta^{\star}$ , the controller searches for the optimal control while simultaneously learning the system, i.e., the matrices $A^{\star},B^{\star}$ . In an episodic (also known as reset or restart) learning framework, the controller improves her knowledge of the underlying dynamics $X_{t}$ through successive learning episodes, in order to find a control that is close to the optimal one.

Mathematically, it goes as follows. Let $M\in\mathbb{N}$ be the total number of learning episodes. In the $i$ -th learning episode, $i=1,\ldots,M$ , a feedback control $\psi^{i}$ is exercised, and the state process $X^{\psi^{i}}$ evolves according to the dynamics (2.2) controlled by the policy $\psi^{i}$ :

\begin{split}{\mathrm{d}}X_{t}&=(A^{\star}X_{t}+B^{\star}\psi^{i}(t,X_{t})){\mathrm{d}}t+{\mathrm{d}}W_{t}^{i},\quad t\in[0,T];\quad X_{0}=x_{0}.\\ \end{split}

(2.6)

Here $W^{i},i=1,2,\ldots,M$ are independent $n$ -dimensional Brownian motions defined on the same probability space $(\Omega,\mathcal{F},\mathbb{P})$ . The (expected) cost of learning in the $i$ -th episode is then given by

J^{\theta^{\star}}(U^{\psi^{i}})=\mathbb{E}\left[\int_{0}^{T}\big{(}(X^{\psi^{i}}_{t})^{\top}QX^{\psi^{i}}_{t}+(U^{\psi^{i}}_{t})^{\top}RU^{\psi^{i}}_{t}\big{)}\,{\mathrm{d}}t\right],\;\textnormal{with $U^{\psi^{i}}_{t}\coloneqq\psi^{i}(t,X^{\psi^{i}}_{t})$, $t\in[0,T]$,}

(2.7)

and the (expected) regret of learning up to $M\in{\mathbb{N}}$ episodes (with the sequence of controls $(U^{\psi^{i}})_{i=1}^{M}$ ) is defined as follows:

R(M)=\sum_{i=1}^{M}\Big{(}J^{\theta^{\star}}(U^{\psi^{i}})-J^{\theta^{\star}}(U^{\theta^{\star}})\Big{)},

(2.8)

where $J^{\theta^{\star}}(U^{\theta^{\star}})$ is the optimal cost of (2.1)-(2.2) when $A^{\star},B^{\star}$ are known. Intuitively, the regret characterizes the cumulative loss from taking sub-optimal policies in all episodes.

In the following, we shall propose several least-squares-based learning algorithms to solve (2.1)-(2.2), and prove that they achieve logarithmic regrets if $\theta^{\star}$ is identifiable (see Remark 2.1 for details).

2.2 Continuous-time least-squares algorithm and its regret bound

In this section, we consider a continuous-time least-squares algorithm, which chooses the optimal feedback control based on the current estimation of the parameter, and updates the parameter estimation based on the whole trajectories of the state dynamics.

More precisely, let $\theta=(A,B)^{\top}\in{\mathbb{R}}^{(n+d)\times n}$ be the current estimate of the unknown parameter $\theta^{\star}$ , then the controller would exercise the optimal feedback control $\psi^{\theta}$ for (2.1)-(2.2) with $\theta^{\star}$ replaced by $\theta$ , i.e.,

\psi^{\theta}(t,x)=K^{\theta}_{t}x,\quad K^{\theta}_{t}\coloneqq-R^{-1}B^{\top}P^{\theta}_{t},\quad\forall(t,x)\in[0,T]\times{\mathbb{R}}^{n},

(2.9)

where $P^{\theta}$ satisfies the Riccati equation (2.4) with $\theta^{\star}$ replaced by $\theta$ :

\tfrac{{\mathrm{d}}}{{\mathrm{d}}t}P_{t}+A^{\top}P_{t}+P_{t}A-P_{t}(BR^{-1}B^{\top})P_{t}+Q=0,\quad t\in[0,T];\quad P_{T}=0.

(2.10)

This leads to the state process $X^{\psi^{\theta}}$ satisfying (cf. (2.6)):

{\mathrm{d}}X_{t}=(A^{\star}X_{t}+B^{\star}\psi^{\theta}(t,X_{t}))\,{\mathrm{d}}t+{\mathrm{d}}W_{t},\quad t\in[0,T];\quad X_{0}=x_{0}.

(2.11)

We proceed to derive an $\ell_{2}$ -regularized least-squares estimation for $\theta^{\star}$ based on sampled trajectories of $X^{\psi^{\theta}}$ . Observing from (2.11) that

Z^{\psi^{\theta}}_{t}({\mathrm{d}}X^{\psi^{\theta}}_{t})^{\top}=Z^{\psi^{\theta}}_{t}(Z^{\psi^{\theta}}_{t})^{\top}\theta^{\star}{\mathrm{d}}t+Z^{\psi^{\theta}}_{t}({\mathrm{d}}W_{t})^{\top},\quad\textnormal{ with $Z^{\psi^{\theta}}_{t}=\begin{pmatrix}X^{\psi^{\theta}}\\ \psi^{\theta}(t,X^{\psi^{\theta}}_{t})\end{pmatrix}$ for all $t\in[0,T]$. }

Hence the martingale property of the Itô integral implies that

\theta^{\star}=\left(\mathbb{E}\left[\int_{0}^{T}Z^{\psi^{\theta}}_{t}(Z^{\psi^{\theta}}_{t})^{\top}\,{\mathrm{d}}t\right]\right)^{-1}\mathbb{E}\left[\int_{0}^{T}Z^{\psi^{\theta}}_{t}({\mathrm{d}}X^{\psi^{\theta}}_{t})^{\top}\right],

(2.12)

provided that $\mathbb{E}\big{[}\int_{0}^{T}Z^{\psi^{\theta}}_{t}(Z^{\psi^{\theta}}_{t})^{\top}\,{\mathrm{d}}t\big{]}$ is invertible. This suggests a practical rule to improve one’s estimate $\theta$ for the true parameter $\theta^{\star}$ , by replacing the expectations in (2.12) with empirical averages over independent realizations. More precisely, let $m\in{\mathbb{N}}$ and $(X^{\psi^{\theta},i}_{t},\psi^{\theta}(t,X^{\psi^{\theta},i}_{t}))_{t\in[0,T]}$ , $i=1,\ldots,m$ , be trajectories of $m$ independent realizations of the state and control processes, we shall update the estimate $\theta$ by the following rule, inspired by (2.12):

\theta\longleftarrow\bigg{(}\frac{1}{m}\sum_{i=1}^{m}\int_{0}^{T}Z^{\psi^{\theta},i}_{t}(Z^{\psi^{\theta},i}_{t})^{\mathsf{T}}\,{\mathrm{d}}t+\frac{1}{m}I\bigg{)}^{-1}\bigg{(}\frac{1}{m}\sum_{i=1}^{m}\int_{0}^{T}Z^{\psi^{\theta},i}_{t}({\mathrm{d}}X^{\psi^{\theta},i}_{t})^{\mathsf{T}}\bigg{)},

(2.13)

where $Z^{\psi^{\theta},i}_{t}\coloneqq\begin{pmatrix}X^{\psi^{\theta},i}_{t}\\ \psi^{\theta}(t,X^{\psi^{\theta},i}_{t})\end{pmatrix}$ for all $t\in[0,T]$ and $i=1,\ldots,m$ , and $I$ is the $(n+d)\times(n+d)$ identity matrix.

The regularization term $\frac{1}{m}I$ in (2.13) guarantees the required matrix inverse and vanishes as $m\rightarrow\infty$ . The estimator (2.13) can be equivalently expressed as an $\ell_{2}$ -regularized least-squares estimator, as pointed out in Duncan et al. (1992) for the ergodic LQ-RL problem.

We summarize the continuous-time least-squares algorithm as follows.

Algorithm 1 Continuous-time least-squares algorithm

1: Input: Choose an initial estimation

\theta_{0}

\theta^{\star}

and numbers of learning episodes

\{m_{\ell}\}_{\ell\in{\mathbb{N}}\cup\{0\}}

2: for

\ell=0,1,\cdots

3: Obtain the feedback control

\psi^{\theta_{\ell}}

as (2.9) with

\theta=\theta_{\ell}

4: Execute the feedback control

\psi^{\theta_{\ell}}

for

m_{\ell}

independent episodes, and collect the trajectory data

(X^{\psi^{\theta_{\ell}},i}_{t},\psi^{\theta_{\ell}}(t,X^{\psi^{\theta_{\ell}},i}_{t}))_{t\in[0,T]}

i=1,\ldots,m_{\ell}

5: Obtain an updated estimation

{\theta}_{\ell+1}

by using (2.13) and the

m_{\ell}

trajectories collected above.

6: end for

Note that Algorithm 1 operates in cycles, with $m_{\ell}$ the number of episodes in the $\ell$ -th cycle. Hence, the regret of learning up to $M$ episodes (cf. (2.8)) can be upper bounded by the accumulated regret at the end of the $L$ -th cycle, where $L$ is the smallest integer such that $\sum_{\ell=0}^{L}m_{\ell}\geq M$ .

In this section, we analyze the regret of Algorithm 1 based on the following assumptions of the learning problem (2.1)-(2.2).

H. 1

(1)

$T\in(0,\infty)$ , $n,d\in{\mathbb{N}}$ , $x_{0}\in{\mathbb{R}}^{n}$ , $A^{\star}\in{\mathbb{R}}^{n\times n}$ , $B^{\star}\in{\mathbb{R}}^{n\times d}$ , $Q\in{\mathbb{S}}^{n}_{0}$ and $R\in{\mathbb{S}}^{d}_{+}$ .
(2)

$\{v\in{\mathbb{R}}^{d}\mid(K^{\theta^{\star}}_{t})^{\top}v=0,\;\forall t\in[0,T]\}=\{0\}$ , with $K^{\theta^{\star}}$ defined in (2.3).

Before discussing the regret of Algorithm 1, we make the following remark of (H.1).

Remark 2.1 (Self-exploration of finite-time horizon RL problems)

(H.1 1) is the standard assumption for finite-time horizon LQ-RL problems (see e.g., Hambly et al. (2020)), except that H.1 1 allows $Q$ to be positive semidefinite, which is important for costs depending on partial states. (H.1 2) corresponds to the identifiability of the true parameter $\theta^{\star}$ by executing the optimal policy $K^{\theta^{\star}}$ . In fact, as shown in Proposition 3.10, under (H.1 1), (H.1 2) is equivalent to the following statement:

(2’)

if $u\in{\mathbb{R}}^{n}$ and $v\in{\mathbb{R}}^{d}$ satisfy $u^{\top}X^{\theta^{\star}}+v^{\top}U^{\theta^{\star}}=0$ for ${\mathrm{d}}\mathbb{P}\otimes{\mathrm{d}}t$ -almost everywhere in $\Omega\times[0,T]$ , then $u=0$ and $v=0$ , where $X^{\theta^{\star}}$ and $U^{\theta^{\star}}$ are the optimal state and control processes of (2.1)-(2.2) defined by (2.5) and (2.3), respectively,

Item (2’) indicates an important self-exploration property of finite-time horizon continuous-time RL problems. In particular, the time-dependent optimal feedback matrix $K^{\theta^{\star}}$ and the non-degenerate noises guarantee the non-degeneracy of the space spanned by $X^{\theta^{\star}}$ and $U^{\theta^{\star}}$ , enabling learning the parameters sufficiently well. This self-exploration property is critical for our design of exploration-free learning algorithms for (2.1)-(2.2) with a logarithmic regret (see Theorems 2.2 and 2.3).

One can easily show that (H.1 2) holds if the optimal policy $(K^{\theta^{\star}})_{t\in[0,T]}$ is nondegenerate, i.e., $\sup_{t\in[0,T]}\lambda_{\min}\left((K^{\theta^{\star}}_{t})(K^{\theta^{\star}}_{t})^{\top}\right)>0$ . Similar nondegeneracy condition has been imposed in Cassel et al. (2020) for discrete-time ergodic LQ-RL problems. In particular, by assuming that the optimal stationary policy satisfies $\lambda_{\min}(K^{\star}(K^{\star})^{\top})>0$ (along with other controllablity conditions), they propose learning algorithms with a logarithmic regret, under the assumption that only the control coefficient $B^{\star}$ is unknown. In contrast, we allow both the state coefficient $A^{\star}$ and the control coefficient $B^{\star}$ to be unknown.

Moreover, the following proposition provides sufficient conditions of (H.1 2), which are special cases of Proposition 3.11.

Proposition 2.1

Let $n,d\in{\mathbb{N}}$ , $Q\in{\mathbb{S}}_{0}^{n}$ and $R\in{\mathbb{S}}_{+}^{d}$ .

(1)

If $(B^{\star})^{\top}QB^{\star}\in{\mathbb{S}}^{d}_{+}$ , then (H.1 2) holds for all $T>0$ .
(2)

Assume that the algebraic Riccati equation $(A^{\star})^{\top}P+PA^{\star}-P(B^{\star}R^{-1}(B^{\star})^{\top})P+Q=0$ admits a unique maximal solution $P^{\star}_{\infty}\in{\mathbb{S}}^{n}_{+}$ . Let $K^{\star}_{\infty}=-R^{-1}(B^{\star})^{\top}P^{\star}_{\infty}$ , and for each $T>0$ , let $P^{\star,(T)}\in C([0,T];{\mathbb{S}}_{0}^{n})$ be defined in (2.4). Assume that $\lim_{T\rightarrow\infty}P^{\star,(T)}_{0}=P^{\star}_{\infty}$ and $K^{\star}_{\infty}(K^{\star}_{\infty})^{\top}\in{\mathbb{S}}_{+}^{d}$ . Then there exists $T_{0}>0$ , such that (H.1 2) holds for all $T\geq T_{0}$ .

Proposition 2.1 provides two sets of conditions for (H.1 2) under two different scenarios: Item 1 applies to an arbitrary finite $T>0$ , and Item 2 only applies to sufficiently large $T$ . Item 2 assumes the asymptotic behavior of solutions to Riccati differential equations, which can be ensured by the stabilizability of the pair $(A^{\star},B^{\star})$ and detectability of the pair $(A^{\star},Q^{1/2})$ (see (Bitmead and Gevers, 1991, Theorems 10.9 and 10.10)). Note that our subsequent analysis is based on (H.1), and does not require stabilizability assumptions.

Remark 2.2 (Stabilizability of $(A^{\star},B^{\star})$ and dependence on $T$ )

Since the LQ-RL problem (2.1)-(2.2) is over the time horizon $[0,T]$ with a fixed $T<\infty$ , in general one does not need additional conditions on $(A^{\star},B^{\star})$ for the well-definedness of (2.1)-(2.2). If $T=\infty$ , then some stabilizability/controllability conditions of $(A^{\star},B^{\star})$ may be required for (2.1)-(2.2) to ensure a well-defined solution (see e.g., Dean et al. (2019)). Under these conditions, different algorithms have been shown to achieve sub-linear regret with respect to the number of decision steps (see e.g., Mania et al. (2019); Cohen et al. (2019)), and even logarithmic regrets provided that further identifiability assumptions are satisfied (see e.g., Faradonbeh et al. (2018b, 2020); Cassel et al. (2020); Lale et al. (2020b)); see Section 1.1 for more details. For $T<\infty$ , the regrets of learning algorithms for (2.1)-(2.2) in general depend exponentially on the time horizon $T$ (e.g., the constants $C_{0},C^{\prime}$ in Theorem 2.2), as the moments of the optimal state process $X^{\theta^{\star}}$ and control process $U^{\theta^{\star}}$ may grow exponentially with respect to $T$ . It would be interesting to quantify the precise dependence of the regret bounds on $T$ . This would entail deriving precise a priori bounds of solutions to (2.10) and estimating the norm $\|(\mathbb{E}[\int_{0}^{T}Z^{\theta^{\star}}_{t}(Z^{\theta^{\star}}_{t})^{\top}\,{\mathrm{d}}t])^{-1}\|_{2}$ in terms of $(A^{\star},B^{\star},Q,T)$ , and is left for future research.

We are now ready to state the main result of this section, which shows that the regret of Algorithm 1 grows logarithmically with respect to the number of episodes.

Theorem 2.2

Suppose (H.1) holds and let $\theta_{0}=(A_{0},B_{0})^{\top}\in{\mathbb{R}}^{(n+d)\times d}$ such that (H.1 2) holds with $\theta_{0}$ . Then there exists a constant $C_{0}>0$ such that for all $C\geq C_{0}$ , and $\delta\in(0,\frac{3}{\pi^{2}})$ , if one sets $m_{0}=C(-\ln\delta)$ and $m_{\ell}=2^{\ell}m_{0}$ for all $\ell\in{\mathbb{N}}$ , then with probability at least $1-\frac{\pi^{2}\delta}{3}$ , the regret of Algorithm 1 given by (2.8) satisfies

R(M)\leq C^{\prime}\big{(}(\ln M)(\ln\ln M)+(-\ln\delta)(\ln M)\big{)},\quad\forall M\in{\mathbb{N}},

where $C^{\prime}$ is a constant independent of $M$ and $\delta$ .

To simplify the presentation, we analyze the performance of Algorithm 1 by assuming the number of learning episodes $\{m_{\ell}\}_{\ell}$ is doubled between two successive updates of the estimation of $\theta^{\star}$ . Similar regret results can be established for Algorithm 1 with different choices of $\{m_{\ell}\}_{\ell}$ . Under this specific choice of $\{m_{\ell}\}_{\ell}$ , for any $M\in{\mathbb{N}}$ , Algorithm 1 splits $M$ episodes into $L=\lceil\log_{2}(\frac{M}{m_{0}}+1)\rceil-1$ cycles, where the $\ell$ -th cycle, $\ell=0,1,\ldots,L-1$ , contains $m_{\ell}$ episodes, and the remaining $M-\sum_{\ell=0}^{L-1}m_{\ell}$ episodes are in the last cycle.

Sketched proof of Theorem 2.2.

We outline the key steps of the proof of Theorem 2.2, and present the detailed arguments to Section 3.3. By exploiting the regularity and robustness of solutions to (2.10), we prove that the performance gap $J^{\theta^{\star}}(U^{\psi^{\theta}})-J^{\theta^{\star}}(U^{\theta^{\star}})$ is of the magnitude $\mathcal{O}(|\theta-\theta^{\star}|^{2})$ , for all a-priori bounded $\theta$ (Proposition 3.8). We then establish a uniform sub-exponential property for the (deterministic and stochastic) integrals in (2.12), which along with (H.1 2) and Bernstein’s inequality leads to the following estimate of the parameter estimation error: for all $\delta\in(0,1/2)$ , all sufficiently large $m\in{\mathbb{N}}$ , and all $\theta$ sufficiently close to $\theta^{\star}$ ,

|\hat{\theta}-\theta^{\star}|\leq\mathcal{O}\Big{(}\sqrt{\frac{-\ln\delta}{m}}+\frac{-\ln\delta}{m}+\frac{(-\ln\delta)^{2}}{m^{2}}\Big{)},\quad\textnormal{with probability $1-2\delta$,}

(2.14)

where $\hat{\theta}$ is generated by (2.13) with $\psi^{\theta}$ (Proposition 3.9). Then for each $\delta>0$ , applying (2.14) with $\delta_{\ell}=\delta/(\ell+1)^{2}$ for all $\ell\in{\mathbb{N}}\cup\{0\}$ shows that with probability $1-2\sum_{\ell=0}^{\infty}\delta_{\ell}=1-\tfrac{\pi^{2}\delta}{3}$ ,

|\hat{\theta}_{\ell+1}-\theta^{\star}|^{2}\lesssim\frac{-\ln\delta_{\ell}}{m_{\ell}}+\frac{(-\ln\delta_{\ell})^{2}}{m^{2}_{\ell}}+\frac{(-\ln\delta_{\ell})^{4}}{m_{\ell}^{4}},\quad\forall\ell\in{\mathbb{N}},

(2.15)

where $\lesssim$ means the inequality holds with a multiplicative constant independent of $\delta$ and $\ell$ . By the quadratic performance gap and the choice of $\{m_{\ell}\}_{\ell}$ , the regret of Algorithm 1 up to the $M$ -th episode can be bounded by the regret at the end of $L$ -th cycle with $L=\lceil\log_{2}(\frac{M}{m_{0}}+1)\rceil-1$ :

\displaystyle\begin{split}R(M)&\lesssim\sum_{\ell=0}^{L}m_{\ell}|\theta_{\ell}-\theta^{\star}|^{2}\lesssim\sum_{\ell=0}^{L}(-\ln\delta_{\ell})\left(1+\frac{-\ln\delta_{\ell}}{m_{\ell}}+\frac{(-\ln\delta_{\ell})^{3}}{m_{\ell}^{3}}\right).\end{split}

(2.16)

Observe that the choices of $\{\delta_{\ell}\}_{\ell}$ and $\{m_{\ell}\}_{\ell}$ ensure that $\sup_{\delta\in(0,\frac{3}{\pi^{2}}),\ell\in{\mathbb{N}}}\frac{-\ln\delta_{\ell}}{m_{\ell}}<\infty$ . Hence, the right-hand side of (2.16) is of the magnitude $\mathcal{O}\left(\sum_{\ell=0}^{L}(-\ln\delta_{\ell})\right)$ , which along with the choices of $\delta_{\ell}$ and $L$ leads to the desired regret bound; see the end of Section 3.3 for more details.

2.3 Discrete-time least-squares algorithm and its regret bound

Note that Algorithm 1 in Section 2.2 requires executing feedback controls and observing corresponding state trajectories continuously. A common practice to solve continuous-time RL problems is by assuming that at each learning episode the dynamics only evolves in discrete time, and then estimate parameters according to discrete-time RL algorithms (see e.g., Munos and Bourgine (1998); Munos (2000, 2006); Tallec et al. (2019)). As the true dynamics evolves continuously, it is necessary to quantify the impact of reaction stepsize on the algorithm performance.

In this section, we analyze the performance of the above procedure for solving (2.1)-(2.2). We adapt regularized least-squares algorithms for discrete-time LQ problems to the present setting, and establish their regret bounds in terms of the discretization stepsize. Our analysis shows that a proper scaling of the regularization term in the least-squares estimation in terms of stepsize is critical for a robust performance with respect to different timescales.

More precisely, for a given cycle (i.e., the index $\ell$ in Algorithm 1), let $\theta=(A,B)^{\top}\in{\mathbb{R}}^{(n+d)\times n}$ be the current estimate of $\theta^{\star}$ in (2.2), and let $\{t_{i}\}_{i=0}^{N}$ , $N\in{\mathbb{N}}$ , be a uniform partition of $[0,T]$ with stepsize $\tau=T/N$ . We then assume that (2.1)-(2.2) is piecewise constant between any two grid points $\{t_{i}\}_{i=0}^{N}$ , choose actions and make observations every $\tau$ , and update the estimated parameter based on these observations. To this end, we consider the following discrete-time LQ control problem with parameter $\theta$ :

\displaystyle\begin{split}\inf_{U\in\mathcal{H}^{2}_{N}({\mathbb{R}}^{d})}J_{{N}}(U),\quad\textnormal{with $J_{{N}}(U)=\mathbb{E}\left[\sum_{i=0}^{N-1}\left((X^{U,\tau}_{t_{i}})^{\top}QX^{U,\tau}_{t_{i}}+U_{t_{i}}^{\top}RU_{t_{i}}\right)\tau\right]$ },\end{split}

(2.17)

where $\mathcal{H}^{2}_{N}({\mathbb{R}}^{d})=\{U\in\mathcal{H}^{2}({\mathbb{R}}^{d})\mid U_{t}=U_{t_{i}},t\in[t_{i},t_{i+1}),i=0,\ldots,N-1\}$ , and $(X^{U,\tau}_{t_{i}})_{i=0}^{N-1}$ are defined by

X^{U,\tau}_{t_{i+1}}-X^{U,\tau}_{t_{i}}=(AX^{U,\tau}_{t_{i}}+BU_{t_{i}})\tau+W_{t_{i+1}}-W_{t_{i}},\quad i=0,\ldots N-1;\quad X^{U,\tau}_{0}=x_{0}.

(2.18)

Note that for simplicity, our strategy is constructed by assuming a discrete-time dynamics arising from an Euler discretization of (2.2) (with the estimated parameter $\theta$ ); similar analysis can be performed with a high-order approximation of (2.1)-(2.2).

It is well-known that (see e.g., Bitmead and Gevers (1991)), the optimal control of (2.17)-(2.18) is given by the following feedback form:

U_{t}=\psi^{\theta,\tau}(t,X^{U,\tau}_{t}),\quad\textnormal{ with $\psi^{\theta,\tau}(t,x)=K^{\theta,\tau}_{t}x,\quad\forall(t,x)\in[0,T)\times{\mathbb{R}}^{n},$ }

(2.19)

where $K^{\theta,\tau}:[0,T)\rightarrow{\mathbb{R}}^{d\times n}$ is the piecewise constant function (with stepsize $\tau=T/N$ ) defined by

$\displaystyle P^{\theta,\tau}_{t_{i-1}}$	$\displaystyle=\tau Q+(I+\tau A)^{\top}P^{\theta,\tau}_{t_{i}}(I+\tau A)-(I+\tau A)^{\top}P^{\theta,\tau}_{t_{i}}\tau B(R+\tau B^{\top}P^{\theta,\tau}_{t_{i}}B)^{-1}B^{\top}P^{\theta,\tau}_{t_{i}}(I+\tau A),$
	$\displaystyle\quad\quad\forall i=0,\dots,N-1;\quad P^{\theta,\tau}_{T}=0,$
$\displaystyle K^{\theta,\tau}_{t}$	$\displaystyle=-(R+\tau B^{\top}P^{\theta,\tau}_{t_{i+1}}B)^{-1}B^{\top}P^{\theta,\tau}_{t_{i+1}}(I+\tau A),\quad t\in[t_{i},t_{i+1}),\,i=0,\ldots,N-1.$	(2.20)

We then implement the piecewise constant strategy $\psi^{\theta,\tau}$ defined in (2.19) on the original system (2.2) for $m$ episodes, and update the estimated parameter $\theta$ by observing (2.2) with stepsize $\tau=T/N$ . More precisely, let $X^{\psi^{\theta,\tau}}\in\mathcal{S}^{2}({\mathbb{R}}^{n})$ be the state process associated with $\psi^{\theta,\tau}$ :

{\mathrm{d}}X_{t}=(A^{\star}X_{t}+B^{\star}K^{\theta,\tau}_{t_{i}}X_{t})\,{\mathrm{d}}t+{\mathrm{d}}W_{t},\quad t\in[t_{i},t_{i+1}],\,i=0,\ldots,N-1;\quad X_{0}=x_{0},

(2.21)

and $(X^{\psi^{\theta,\tau},j}_{t})_{t\in[0,T]}$ , $j=1,\ldots,m$ , $m\in{\mathbb{N}}$ , be $m$ independent trajectories of $X^{\psi^{\theta,\tau}}\in\mathcal{S}^{2}({\mathbb{R}}^{n})$ , we update the parameter $\theta$ according to the following discrete-time least-squares estimator:

\theta\leftarrow\operatorname*{arg\,min}_{\theta\in{\mathbb{R}}^{(n+d)\times n}}\sum_{j=1}^{m}\sum_{i=0}^{N-1}\|X^{\psi^{\theta,\tau},j}_{t_{i+1}}-X^{\psi^{\theta,\tau},j}_{t_{i}}-\tau\theta^{\top}Z^{\psi^{\theta,\tau},j}_{t_{i}}\|_{2}^{2}+\tau\textnormal{tr}(\theta^{\top}\theta),

(2.22)

with $Z^{\psi^{\theta,\tau},j}_{t_{i}}\coloneqq\begin{pmatrix}X^{\psi^{\theta,\tau},j}_{t_{i}}\\ K^{\theta,\tau}_{t_{i}}X^{\psi^{\theta,\tau},j}_{t_{i}}\end{pmatrix}$ for all $i,j$ . The update (2.22) is consistent with the agent’s assumption that the state evolves according to (2.18) between two grid points. Setting the derivative (with respect to $\theta$ ) of the right-hand side of (2.22) to zero leads to

-\sum_{j=1}^{m}\sum_{i=0}^{N-1}\tau Z^{\psi^{\theta,\tau},j}_{t_{i}}\left(\left(X^{\psi^{\theta,\tau},j}_{t_{i+1}}-X^{\psi^{\theta,\tau},j}_{t_{i}}\right)^{\top}-\tau(Z^{\psi^{\theta,\tau},j}_{t_{i}})^{\top}\theta\right)+\tau\theta=0.

Dividing both sides by $\tau/m$ and rearranging the terms give the following equivalent expression of the discrete-time least squares estimator (2.22):

{\theta}\longleftarrow\bigg{(}\frac{1}{m}\sum_{j=1}^{m}\sum_{i=0}^{N-1}Z^{\psi^{\theta,\tau},j}_{t_{i}}(Z^{\psi^{\theta,\tau},j}_{t_{i}})^{\top}\tau+\frac{1}{m}I\bigg{)}^{-1}\bigg{(}\frac{1}{m}\sum_{j=1}^{m}\sum_{i=0}^{N-1}Z^{\psi^{\theta,\tau},j}_{t_{i}}\left(X^{\psi^{\theta,\tau},j}_{t_{i+1}}-X^{\psi^{\theta,\tau},j}_{t_{i}}\right)^{\top}\bigg{)}.

(2.23)

Remark 2.3 (Scaling hyper-parameters with timescales)

In principle, when applying discrete-time RL algorithms in a continuous environment, it is critical to adopt a proper scaling of the hyper-parameters for a robust performance with respect to different timescales. Indeed, scaling the regularization term $\textnormal{tr}(\theta^{\top}\theta)$ in (2.22) with respect to the stepsize $\tau$ is essential for the robustness of (2.23) for all small stepsizes $\tau$ . If one updates $\theta$ by minimizing the following $\ell_{2}$ -regularized loss function with a given hyper-parameter $\alpha<1$ such that

\operatorname*{arg\,min}_{\theta\in{\mathbb{R}}^{(n+d)\times n}}\sum_{j=1}^{m}\sum_{i=0}^{N-1}\|X^{\psi^{\theta,\tau},j}_{t_{i+1}}-X^{\psi^{\theta,\tau},j}_{t_{i}}-\tau\theta^{\top}Z^{\psi^{\theta,\tau},j}_{t_{i}}\|_{2}^{2}+\tau^{\alpha}\textnormal{tr}(\theta^{\top}\theta),

(2.24)

then the corresponding discrete-time estimator is given by

{\theta}^{\tau}\coloneqq\bigg{(}\frac{1}{m}\sum_{j=1}^{m}\sum_{i=0}^{N-1}Z^{\psi^{\theta,\tau},j}_{t_{i}}(Z^{\psi^{\theta,\tau},j}_{t_{i}})^{\top}\tau+\frac{1}{\tau^{1-\alpha}m}I\bigg{)}^{-1}\bigg{(}\frac{1}{m}\sum_{j=1}^{m}\sum_{i=0}^{N-1}Z^{\psi^{\theta,\tau},j}_{t_{i}}\left(X^{\psi^{\theta,\tau},j}_{t_{i+1}}-X^{\psi^{\theta,\tau},j}_{t_{i}}\right)^{\top}\bigg{)}.

Observe that for any given $m\in{\mathbb{N}}$ , the estimator $\theta^{\tau}$ degenerates to zero as the stepsize $\tau$ tends to zero. Hence, to ensure the viability of $\theta^{\tau}$ across different timescales, the number of episodes $m$ has to increase appropriately when $\tau$ tends to zero. In contrast, by choosing $\alpha=1$ in (2.24), (2.23) admits a continuous-time limit (2.13) as $\tau\rightarrow 0$ , and leads to a learning algorithm in which the episode numbers and the time stepsize can be chosen independently (see Theorem 2.3).

We now summarize the discrete-time least-squares algorithm as follows.

Algorithm 2 Discrete-time least-squares algorithm

1: Input: Choose an initial estimation

\theta_{0}

\theta^{\star}

, numbers of learning episodes

\{m_{\ell}\}_{\ell\in{\mathbb{N}}\cup\{0\}}

and numbers of intervention points

\{N_{\ell}\}_{\ell\in{\mathbb{N}}\cup\{0\}}

2: for

\ell=0,1,\cdots

3: Obtain the piecewise constant control

\psi^{\theta_{\ell},\tau_{\ell}}

as (2.19) with

\tau=T/N_{\ell}

and

\theta=\theta_{\ell}

4: Execute the control

\psi^{\theta_{\ell},\tau_{\ell}}

for

m_{\ell}

independent episodes, and collect the data

X^{\psi^{\theta_{\ell},\tau_{\ell}},j}_{t_{i}}

i=0,\ldots,N_{\ell}

j=1,\ldots,m_{\ell}

5: Obtain an updated estimation

{\theta}_{\ell+1}

by using (2.23) and the data

(X^{\psi^{\theta_{\ell},\tau_{\ell}},j}_{t_{i}})_{i=0,\ldots,N_{\ell},j=1,\ldots,m_{\ell}}

6: end for

Again, as the $\ell$ -th cycle of Algorithm 2 contains $m_{\ell}$ episodes, for each $M\in{\mathbb{N}}$ , the regret of learning up to $M$ episodes (cf. (2.8)) can be upper bounded by the accumulated regret at the end of the $L$ -th cycle, where $L$ is the smallest integer such that $\sum_{\ell=0}^{L}m_{\ell}\geq M$ . The following theorem is an analogue of Theorem 2.2 for Algorithm 2.

Theorem 2.3

Suppose (H.1) holds and let $\theta_{0}=(A_{0},B_{0})^{\top}\in{\mathbb{R}}^{(n+d)\times d}$ such that (H.1 2) holds with $\theta_{0}$ . Then there exists $C_{0}>0$ and $n_{0}\in{\mathbb{N}}$ such that for all $C\geq C_{0}$ , and $\delta\in(0,\frac{3}{\pi^{2}})$ , if one sets $m_{0}=C(-\ln\delta)$ , $m_{\ell}=2^{\ell}m_{0}$ and $N_{\ell}\geq n_{0}$ for all $\ell\in{\mathbb{N}}\cup\{0\}$ , then with probability at least $1-\frac{\pi^{2}\delta}{3}$ , the regret of Algorithm 2 given by (2.8) satisfies

\displaystyle R(M)\leq C^{\prime}\left((\ln M)(\ln\ln M)+(-\ln\delta)(\ln M)+(-\ln\delta)\sum_{\ell=0}^{\ln M}2^{\ell}N_{\ell}^{-2}\right),\quad\forall M\in{\mathbb{N}},

(2.25)

where $C^{\prime}$ is a constant independent of $M$ , $\delta$ and $(N_{\ell})_{\ell\in{\mathbb{N}}\cup\{0\}}$ .

Remark 2.4

Theorem 2.3 provides a general regret bound of Algorithm 2 with any time discretization steps $\{N_{\ell}\}_{\ell\geq 0}$ , where $N_{\ell}$ is the number of intervention points in the $\ell$ -th cycle. Compared with Algorithm 1, the regret of Algorithm 2 has an additional term $(-\ln\delta)\sum_{\ell=0}^{\ln M}2^{\ell}N_{\ell}^{-2}$ : for each learning episode, one achieves a sub-optimal loss by adjusting her policy in the discrete time and also suffers from model misspecification error in parameter estimation from discrete-time observations. Specifically,

•

if the time discretization step is fixed for all cycles, i.e., $N_{\ell}=T/\tau$ for all $\ell$ , then the last term of (2.25) is of the magnitude:

\mathcal{O}\left((-\ln\delta)\sum_{\ell=0}^{\ln M}2^{\ell}N_{\ell}^{-2}\right)=\mathcal{O}\left((-\ln\delta)\tau^{2}\sum_{\ell=0}^{\ln M}2^{\ell}\right)=\mathcal{O}((-\ln\delta)\tau^{2}M),

and consequently Algorithm 2 achieves a sub-optimal linear regret;

•

if the time discretization step of the $\ell$ -th cycle increases exponentially in terms of $\ell$ , e.g., $N_{\ell}=\sqrt{2}^{\ell}N_{0}$ for $\ell=1,\ldots,\ln M$ , then the last term of (2.25) is of the magnitude:

\mathcal{O}\left((-\ln\delta)\sum_{\ell=0}^{\ln M}2^{\ell}N_{\ell}^{-2}\right)=\mathcal{O}\left((-\ln\delta)\sum_{\ell=0}^{\ln M}N_{0}^{-2}\right)=\mathcal{O}((-\ln\delta)\ln M),

which guarantees that the regret of Algorithm 2 is still logarithmic in $M$ .

Sketched proof of Theorem 2.3.

We point out the main differences between the proofs of Theorems 2.2-2.3, and give the detailed proof of Theorem 2.3 in Section 3.4. Compared with Theorem 2.2, the essential challenges in proving Theorem 2.3 are to quantify the precise dependence of the performance gap and the parameter estimation error on the stepsize. To this end, we first prove a first-order convergence of (2.20) to (2.9) as the stepsize tends to zero. Then by exploiting the affine structure of (2.21), we establish the following quadratic performance gap for a piecewise constant policy $\psi^{\theta,\tau}$ (Proposition 3.12):

J^{\theta^{\star}}(U^{\psi^{\theta,\tau}})-J^{\theta^{\star}}(U^{\theta^{\star}})\leq C(|\theta-\theta^{\star}|^{2}+\tau^{2}).

(2.26)

The analysis of the parameter estimation error is somewhat involved, as the state trajectories are merely $\alpha$ -Hölder continuous in time with $\alpha<1/2$ . By leveraging the analytic expression of $X^{\psi^{\theta,\tau}}$ , we first show the first-order convergence of $\hat{\theta}^{\tau}$ to $\theta^{\star}$ with

\hat{\theta}^{\tau}\coloneqq\bigg{(}{\mathbb{E}}\bigg{[}\sum_{i=0}^{N-1}Z^{\psi^{\theta,\tau}}_{t_{i}}(Z^{\psi^{\theta,\tau}}_{t_{i}})^{\top}\tau\bigg{]}\bigg{)}^{-1}\bigg{(}{\mathbb{E}}\bigg{[}\sum_{i=0}^{N-1}Z^{\psi^{\theta,\tau}}_{t_{i}}\left(X^{\psi^{\theta,\tau}}_{t_{i+1}}-X^{\psi^{\theta,\tau}}_{t_{i}}\right)^{\top}\bigg{]}\bigg{)}.

(2.27)

We then prove that (2.23) enjoys a uniform sub-exponential tail bound for all $\theta$ close to $\theta^{\star}$ and small $\tau$ . Comparing (2.23) with (2.27) and applying the above results allow for bounding the estimation error of (2.23) by (2.14) with an additional $\mathcal{O}(\tau)$ term (Proposition 3.14).

3 Proofs of Theorems 2.2 and 2.3

To simplify the notation, for any given $N,m\in{\mathbb{N}}$ and control $\psi:[0,T]\times{\mathbb{R}}^{n}\rightarrow{\mathbb{R}}^{d}$ that is affine in the spatial variable, we introduce the following random variables associated with continuous-time observations:

	$\displaystyle V^{\psi}$	$\displaystyle=\int_{0}^{T}Z^{\psi}_{t}(Z^{\psi}_{t})^{\top}\,{\mathrm{d}}t,\qquad$	$\displaystyle Y^{\psi}$	$\displaystyle=\int_{0}^{T}Z^{\psi}_{t}({\mathrm{d}}X^{\psi}_{t})^{\top},$		(3.1)
	$\displaystyle V^{\psi,m}$	$\displaystyle=\frac{1}{m}\sum_{j=1}^{m}\int_{0}^{T}Z^{\psi,j}_{t}(Z^{\psi,j}_{t})^{\top}\,{\mathrm{d}}t,\qquad$	$\displaystyle Y^{\psi,m}$	$\displaystyle=\frac{1}{m}\sum_{j=1}^{m}\int_{0}^{T}Z^{\psi,j}_{t}({\mathrm{d}}X^{\psi,j}_{t})^{\top},$		(3.1)

and the random variables associated with discrete-time observations with stepsize $\tau=T/N$ :

	$\displaystyle V^{\psi,\tau}$	$\displaystyle=\sum_{i=0}^{N-1}Z^{\psi}_{t_{i}}(Z^{\psi}_{t_{i}})^{\top}\tau,\qquad$	$\displaystyle Y^{\psi,\tau}$	$\displaystyle=\sum_{i=0}^{N-1}Z^{\psi}_{t_{i}}(X^{\psi}_{t_{i+1}}-X^{\psi}_{t_{i}})^{\top},$		(3.2)
	$\displaystyle V^{\psi,\tau,m}$	$\displaystyle=\frac{1}{m}\sum_{j=1}^{m}\sum_{i=0}^{N-1}Z^{\psi,j}_{t_{i}}(Z^{\psi,j}_{t_{i}})^{\top}\tau,\qquad$	$\displaystyle Y^{\psi,\tau,m}$	$\displaystyle=\frac{1}{m}\sum_{j=1}^{m}\sum_{i=0}^{N-1}Z^{\psi,j}_{t_{i}}(X^{\psi,j}_{t_{i+1}}-X^{\psi,j}_{t_{i}})^{\top},$		(3.2)

where $X^{\psi}$ is the state process associated with the parameter $\theta^{\star}$ and the control $\psi$ (cf. (2.6)), $Z^{\psi}_{t}=\begin{pmatrix}X^{\psi}\\ \psi(t,X^{\psi}_{t})\end{pmatrix}$ for all $t\in[0,T]$ , and $(X^{\psi,j},Z^{\psi,j})_{j=1}^{m}$ are independent copies of $(X^{\psi},Z^{\psi})$ .

3.1 Convergence and stability of Riccati equations and feedback controls

Lemma 3.1

Suppose (H.1 1) holds. Then for all $\theta=(A,B)^{\top}\in{\mathbb{R}}^{(n+d)\times n}$ , the Riccati equation

\tfrac{{\mathrm{d}}}{{\mathrm{d}}t}P_{t}+A^{\top}P_{t}+P_{t}A-P_{t}BR^{-1}B^{\top}P_{t}+Q=0,\quad t\in[0,T];\quad P_{T}=0.

(3.3)

admits a unique solution $P^{\theta}\in C([0,T];{\mathbb{R}}^{n\times n})$ . Moreover, the map ${\mathbb{R}}^{(n+d)\times n}\ni\theta\mapsto P^{\theta}\in C^{1}([0,T];{\mathbb{R}}^{n\times n})$ is continuously differentiable.

Proof It has been shown in (Yong and Zhou, 1999, Corollary 2.10 on p. 297) that under (H.1 1), for all $\theta=(A,B)^{\top}\in{\mathbb{R}}^{(n+d)\times n}$ , (3.3) admits a unique solution $P^{\theta}\in C([0,T];{\mathbb{R}}^{n\times n})$ such that $P^{\theta}_{t}\in{\mathbb{S}}^{n}_{0}$ for all $t\in[0,T]$ . It remains to to prove the continuous differentiability of $\theta\mapsto P^{\theta}$ .

To this end, consider the Banach spaces ${\mathbb{X}}={\mathbb{R}}^{(n+d)\times n}\times C^{1}([0,T];{\mathbb{R}}^{n\times n})$ and ${\mathbb{Y}}=C([0,T];{\mathbb{R}}^{n\times n})\times{\mathbb{R}}^{n\times n}$ , and the operator $\Phi:{\mathbb{X}}\rightarrow{\mathbb{Y}}$ defined by

{\mathbb{X}}\ni(\theta,P)\mapsto\Phi(\theta,P)\coloneqq(F(\theta,P),P_{T})\in{\mathbb{Y}},

where $F(\theta,P)_{t}=\tfrac{{\mathrm{d}}}{{\mathrm{d}}t}P_{t}+A^{\top}P_{t}+P_{t}A-P_{t}BR^{-1}B^{\top}P_{t}+Q$ for all $t\in[0,T]$ . Observe that for all $\theta\in{\mathbb{R}}^{(n+d)\times n}$ , $\Phi(P^{\theta},\theta)=0$ . Moreover, one can easily show that for any $(P,\theta)\in{\mathbb{X}}$ , $\Phi$ is continuously Fréchet differentiable at $(P,\theta)$ , and the partial derivative $\frac{\partial}{\partial P}\Phi(\theta,P):C^{1}([0,T];{\mathbb{R}}^{n\times n})\rightarrow{\mathbb{Y}}$ is a bounded linear operator such that for all $\tilde{P}\in C^{1}([0,T];{\mathbb{R}}^{n\times n})$ ,

\frac{\partial}{\partial P}\Phi(\theta,P)(\tilde{P})=\begin{pmatrix}\left(\tfrac{{\mathrm{d}}}{{\mathrm{d}}t}\tilde{P}_{t}+A^{\top}\tilde{P}_{t}+\tilde{P}_{t}A-\tilde{P}_{t}BR^{-1}B^{\top}P_{t}-P_{t}BR^{-1}B^{\top}\tilde{P}_{t}\right)_{t\in[0,T]}\\ \tilde{P}_{T}\end{pmatrix}\in{\mathbb{Y}}.

Classical well-posedness results of linear differential equations and the boundedness of $P$ imply that $\frac{\partial}{\partial P}\Phi(P,\theta):C^{1}([0,T];{\mathbb{R}}^{n\times n})\rightarrow{\mathbb{Y}}$ has a bounded inverse (and hence a bijection). Thus, applying the implicit function theorem (see (Ciarlet, 2013, Theorem 7.13-1)) to $\Phi$ proves that ${\mathbb{R}}^{(n+d)\times n}\ni\theta\mapsto P^{\theta}\in C^{1}([0,T];{\mathbb{R}}^{n\times n})$ is continuously differentiable.

The following lemma establishes the stability of the Riccati difference operator, which is crucial for the subsequent convergence analysis.

Lemma 3.2

Suppose (H.1 1) holds. For each $\theta=(A,B)^{\top}\in{\mathbb{R}}^{(n+d)\times n}$ and $N\in{\mathbb{N}}$ , let $\tau=T/N$ and the function $\Gamma^{\theta}_{\tau}:{\mathbb{S}}^{n}_{0}\rightarrow{\mathbb{S}}^{n}_{0}$ such that for all $P\in{\mathbb{S}}^{n}_{0}$ ,

\Gamma^{\theta}_{\tau}(P)\coloneqq\tau Q+(I+\tau A)^{\top}P(I+\tau A)-(I+\tau A)^{\top}P\tau B(R+\tau B^{\top}PB)^{-1}B^{\top}P(I+\tau A).

(3.4)

Then for all $P,P^{\prime}\in{\mathbb{S}}^{n}_{0}$ ,

(1)

$\|\Gamma^{\theta}_{\tau}(P)\|_{2}\leq\tau\|Q\|_{2}+(1+\tau\|A\|_{2})^{2}\|P\|_{2}$ ,
(2)

$\|\Gamma^{\theta}_{\tau}(P)-\Gamma^{\theta}_{\tau}(P^{\prime})\|_{2}\leq\big{(}1+\tau\|R^{-1}\|_{2}\|B\|_{2}^{2}\max\{\|P\|_{2},\|P^{\prime}\|_{2}\}\big{)}^{2}(1+\tau\|A\|_{2})^{2}\|P-P^{\prime}\|_{2}.$

Proof Item 1 follows directly from the definition of $\Gamma^{\theta}_{\tau}$ and the identity that $\|\Gamma^{\theta}_{\tau}(P)\|_{2}=\sup\{x^{\top}\Gamma^{\theta}_{\tau}(P)x\mid x\in{\mathbb{R}}^{n},|x|=1\}$ . We now prove Item 2. Let $\delta P=P-P^{\prime}$ and $\delta\Gamma(P)=\Gamma^{\theta}_{\tau}(P)-\Gamma^{\theta}_{\tau}(P^{\prime})$ , by (Bitmead and Gevers, 1991, Lemma 10.1),

\displaystyle\delta\Gamma(P)=F^{\top}\delta PF-F^{\top}\delta P\tau B(\tau B^{\top}P\tau B+\tau R)^{-1}\tau B^{\top}\delta PF,

with $F=(I-\tau B(\tau B^{\top}P^{\prime}\tau B+\tau R)^{-1}\tau B^{\top}P^{\prime})(I+\tau A)$ . Thus for all $x\in{\mathbb{R}}^{n}$ , $x^{\top}\delta\Gamma(P)x\leq\|\delta P\|_{2}\|F\|^{2}_{2}|x|^{2}$ , which along with $\|(\tau B^{\top}P^{\prime}B+R)^{-1}\|_{2}\leq\|R^{-1}\|_{2}$ implies

x^{\top}\delta\Gamma(P)x\leq\|\delta P\|_{2}(1+\tau\|R^{-1}\|_{2}\|B\|_{2}^{2}\|P^{\prime}\|_{2})^{2}(1+\tau\|A\|_{2})^{2}|x|^{2},\quad x\in{\mathbb{R}}^{n}.

Hence, interchanging the roles of $P$ and $P^{\prime}$ in the above inequality and taking the supremum over $x\in{\mathbb{R}}^{n}$ lead to the desired estimate.

The following proposition establishes the first-order convergence of the Riccati difference equation and the associated feedback controls, as the stepsize tends to zero.

Proposition 3.3

Suppose (H.1 1) holds and let $\Theta$ be a bounded subset of ${\mathbb{R}}^{(n+d)\times n}$ . For each $\theta=(A,B)^{\top}\in\Theta$ and $N\in{\mathbb{N}}$ , let $(P^{\theta,\tau}_{i})_{i=0}^{N}$ such that $P^{\theta,\tau}_{N}=0$ and $P^{\theta,\tau}_{i}=\Gamma^{\theta}_{\tau}(P^{\theta,\tau}_{i+1})$ for all $i=0,\ldots,N-1$ , with $\Gamma^{\theta}_{\tau}$ defined in (3.4) with $\tau=T/N$ . Then there exists a constant $C\geq 0$ such that for all $\theta\in\Theta,N\in{\mathbb{N}}$ ,

\sup_{i=0,\ldots,N-1}\sup_{t\in[i\tau,(i+1)\tau)}\big{(}\|P^{\theta}_{t}-P^{\theta,\tau}_{i}\|_{2}+\|K^{\theta}_{t}-K^{\theta,\tau}_{i}\|_{2}\big{)}\leq C\tau,

where $P^{\theta}\in C^{1}([0,T];{\mathbb{R}}^{n\times n})$ satisfies (3.3), $K^{\theta}_{t}=-R^{-1}B^{\top}P^{\theta}_{t}$ for all $t\in[0,T]$ and $K^{\theta,\tau}_{i}=-(R+\tau B^{\top}P^{\theta,\tau}_{{i+1}}B)^{-1}B^{\top}P^{\theta,\tau}_{{i+1}}(I+\tau A)$ for all $i=0,\ldots,N-1$ .

Proof Throughout this proof, we shall fix $\theta\in\Theta,N\in{\mathbb{N}}$ , let $t_{i}=i\tau$ for all $i=0,\ldots,N$ , and denote by $C$ a generic constant independent of $N$ and $\theta$ . By the continuity of the map $\theta\mapsto P^{\theta}$ (Lemma 3.1) and the boundedness of $\Theta$ , there exists a constant $C$ such that $\|P^{\theta}\|_{C^{1}([0,T];{\mathbb{R}}^{n\times n})}\leq C$ for all $\theta\in\Theta$ , which implies $\|P^{\theta}_{t}-P^{\theta}_{s}\|_{2}\leq C|t-s|$ for all $t,s\in[0,T]$ . Consequently, it suffices to prove $\|P^{\theta}_{t_{i}}-P^{\theta,\tau}_{i}\|_{2}+\|K^{\theta}_{t_{i+1}}-K^{\theta,\tau}_{i}\|_{2}\leq C\tau$ for all $i=0,\ldots,N-1$ .

We start by making two important observations. By Lemma 3.2 Item 1, $\|P^{\theta,\tau}_{i}\|_{2}\leq\tau C+(1+C\tau)\|P^{\theta,\tau}_{i+1}\|_{2}$ for all $i=0,\ldots,N-1$ , which along with Gronwall’s inequality gives $\|P^{\theta,\tau}_{i}\|_{2}\leq C$ for all $i=0,\ldots,N$ . Moreover, by (3.4), for all $P\in{\mathbb{S}}^{n}_{0}$ ,

	$\displaystyle\Gamma^{\theta}_{\tau}(P)$	$\displaystyle=\tau Q+P+\tau(A^{\top}P+PA)+\tau^{2}A^{\top}PA$
		$\displaystyle\quad-\tau\Big{(}PB(R+\tau B^{\top}PB)^{-1}B^{\top}P+\tau(A^{\top}H+HA^{\top})+\tau^{2}A^{\top}HA\Big{)},$

with $H\coloneqq PB(R+\tau B^{\top}PB)^{-1}B^{\top}P$ . Hence for any given $i=0,\ldots,N-1$ , we see from (3.3) that

	$\displaystyle P^{\theta}_{t_{i}}-\Gamma^{\theta}_{\tau}(P^{\theta}_{t_{i+1}})$
	$\displaystyle=\int_{t_{i}}^{t_{i+1}}\big{(}A^{\top}(P^{\theta}_{t}-P^{\theta}_{t_{i+1}})+(P^{\theta}_{t}-P^{\theta}_{t_{i+1}})A\big{)}\,{\mathrm{d}}t-\int_{t_{i}}^{t_{i+1}}(P^{\theta}_{t}BR^{-1}B^{\top}P^{\theta}_{t}-P^{\theta}_{t_{i+1}}BR^{-1}B^{\top}P^{\theta}_{t_{i+1}})\,{\mathrm{d}}t$
	$\displaystyle\quad-\int_{t_{i}}^{t_{i+1}}(P^{\theta}_{t_{i+1}}BR^{-1}B^{\top}P^{\theta}_{t_{i+1}}-P^{\theta}_{t_{i+1}}B(R+\tau B^{\top}P^{\theta}_{t_{i+1}}B)^{-1}B^{\top}P^{\theta}_{t_{i+1}})\,{\mathrm{d}}t$
	$\displaystyle\quad+\tau^{2}(-A^{\top}P^{\theta}_{t_{i+1}}A+A^{\top}{H}^{\theta}_{i+1}+{H}^{\theta}_{i+1}A^{\top}+\tau A^{\top}{H}^{\theta}_{i+1}A),$

with ${H}^{\theta}_{i+1}=P^{\theta}_{t_{i+1}}B(R+\tau B^{\top}P^{\theta}_{t_{i+1}}B)^{-1}B^{\top}P^{\theta}_{t_{i+1}}$ . Since $\|P^{\theta}\|_{C^{1}([0,T];{\mathbb{R}}^{n\times n})}\leq C$ and $R\in{\mathbb{S}}^{d}_{+}$ , we have $\|P^{\theta}_{t_{i}}-\Gamma^{\theta}_{\tau}(P^{\theta}_{t_{i+1}})\|_{2}\leq C\tau^{2}$ for all $i=0,\ldots,N-1$ .

We are ready to show $\max_{i=0,\ldots,N-1}(\|P^{\theta}_{t_{i}}-P^{\theta,\tau}_{i}\|_{2}+\|K^{\theta}_{t_{i+1}}-K^{\theta,\tau}_{i}\|_{2})\leq C\tau$ . For any given $i=0,\ldots,N-1$ , by Lemma 3.2 Item 2 and the uniform boundedness of $(P^{\theta}_{t_{i}})_{i=0}^{N}$ and $(P^{\theta,\tau}_{i})_{i=0}^{N}$ ,

	$\displaystyle\\|P^{\theta}_{t_{i}}-P^{\theta,\tau}_{i}\\|_{2}$	$\displaystyle\leq\\|P^{\theta}_{t_{i}}-\Gamma^{\theta}_{\tau}(P^{\theta}_{t_{i+1}})\\|_{2}+\\|\Gamma^{\theta}_{\tau}(P^{\theta}_{t_{i+1}})-\Gamma^{\theta}_{\tau}(P^{\theta,\tau}_{i+1})\\|_{2}$
		$\displaystyle\leq C\tau^{2}+\big{(}1+\tau C\max\{\\|P^{\theta}_{t_{i+1}}\\|_{2},\\|P^{\theta,\tau}_{{i+1}}\\|_{2}\}\big{)}^{2}(1+\tau)\\|P^{\theta}_{t_{i+1}}-P^{\theta,\tau}_{{i+1}}\\|_{2}$
		$\displaystyle\leq C\tau^{2}+\big{(}1+\tau C)\\|P^{\theta}_{t_{i+1}}-P^{\theta,\tau}_{{i+1}}\\|_{2},$

which along with Gronwall’s inequality and $P^{\theta}_{T}=P^{\theta,\tau}_{N}=0$ shows the desired convergence rate of $(P^{\theta,\tau}_{i})_{i=1}^{N}$ . Furthermore, for all $i=0,\ldots,N-1$ ,

	$\displaystyle\\|K^{\theta}_{t_{i+1}}-K^{\theta,\tau}_{i}\\|_{2}$	$\displaystyle\leq\\|(R^{-1}-(R+\tau B^{\top}P^{\theta,\tau}_{{i+1}}B)^{-1})B^{\top}P^{\theta}_{t_{i+1}}\\|_{2}$
		$\displaystyle\quad+\\|(R+\tau B^{\top}P^{\theta,\tau}_{{i+1}}B)^{-1}B^{\top}(P^{\theta}_{t_{i+1}}-P^{\theta,\tau}_{{i+1}}(I+\tau A))\\|_{2}\leq C\tau,$

from the facts that $\|P^{\theta}_{t_{i}}\|_{2}\leq C$ , $\|P^{\theta,\tau}_{i}\|_{2}\leq C$ and $\|P^{\theta}_{t_{i}}-P^{\theta,\tau}_{i}\|_{2}\leq C\tau$ for all $i$ .

3.2 Concentration inequalities for least-squares estimators

In this section, we analyze the concentration behavior of the least-squares estimators (2.13) and (2.23). We first recall the definition of sub-exponential random variables (see e.g., Wainwright (2019)).

Definition 3.1

A random variable $X$ with mean $\mu=\mathbb{E}[X]$ is $(\nu,b)$ -sub-exponential for $\nu,b\in[0,\infty)$ if $\mathbb{E}[e^{\lambda(X-\mu)}]\leq e^{\nu^{2}\lambda^{2}/2}$ for all $|\lambda|<1/b$ .

Note that a $(\nu,0)$ -sub-exponential random variable is usually called a sub-Gaussian random variable. It is well-known that products of sub-Gaussian random variables are sub-exponential, and the class of sub-exponential random variables forms a vector space. Moreover, sub-exponential random variables enjoy the following concentration inequality (also known as Bernstein’s inequality; see e.g., (Wainwright, 2019, Equation 2.18 p. 29)).

Lemma 3.4

Let $m\in{\mathbb{N}}$ , $\nu,b\in[0,\infty)$ and $(X_{i})_{i=1}^{m}$ be independent $(\nu,b)$ -sub-exponential random variables with $\mu={\mathbb{E}}[X_{i}]$ for all $i=1,\ldots,m$ . Then for all $\epsilon\geq 0$ ,

\mathbb{P}\left(\left|\frac{1}{m}\sum_{i=1}^{m}X_{i}-\mu\right|\geq\epsilon\right)\leq 2\exp\left(-\min\left\{\frac{m\epsilon^{2}}{2\nu^{2}},\frac{m\epsilon}{2b}\right\}\right).

The following lemma shows double iterated Itô integrals are sub-exponential random variables.

Lemma 3.5

Let $L\geq 0$ and $g,h:[0,T]\times[0,T]\rightarrow{\mathbb{R}}^{n\times n}$ be measurable functions such that $|g(t,s)|\leq L$ and $|h(t,s)|\leq L$ for all $t,s\in[0,T]$ . Then there exist $\nu,b\in[0,\infty)$ , depending polynomially on $L,n,T$ , such that

(1)

$\int_{0}^{T}\big{(}\int_{0}^{t}g(t,s)\,{\mathrm{d}}W_{s}\big{)}^{\top}\,{\mathrm{d}}W_{t}$ ,
(2)

$\int_{0}^{T}\big{(}\int_{0}^{t}g(t,s)\,{\mathrm{d}}W_{s}\big{)}^{\top}\big{(}\int_{0}^{t}h(t,s)\,{\mathrm{d}}W_{s}\big{)}\,{\mathrm{d}}t$

are $(\nu,b)$ -sub-exponential,

Proof We first prove Item 1 by assuming without loss of generality that $\|g(t,s)\|_{2}\leq L$ for all $t,s\in[0,T]$ , and by defining $V^{q}\coloneqq\int_{0}^{T}\left(\int_{0}^{t}q(t,s)\,{\mathrm{d}}W_{s}\right)^{\top}{\mathrm{d}}W_{t}$ for any bounded measurable function $q:[0,T]\times[0,T]\rightarrow{\mathbb{R}}^{n\times n}$ . By similar arguments as (Cheridito et al., 2005, Lemma 3.2), we have for all $t\in[0,T]$ and $0\leq\lambda<\frac{1}{2T}$ ,

\mathbb{E}[\exp(2\lambda V^{\frac{g}{L}})]\leq\mathbb{E}[\exp(2\lambda V^{I_{n}})]=\left(\frac{1}{\sqrt{1-2\lambda T}}\exp(-\lambda T)\right)^{n}.

As $\frac{e^{-\lambda}}{\sqrt{1-2\lambda}}\leq e^{2\lambda^{2}}$ for all $|\lambda|\leq 1/4$ , we see $\mathbb{E}[\exp(2\lambda V^{g/L})]\leq\exp(2n\lambda^{2}T^{2})$ for all $0\leq\lambda<\frac{1}{4T}$ . Consequently, for all $0\leq\lambda<\frac{1}{2LT}$ ,

\mathbb{E}[\exp(\lambda V^{g})]={\mathbb{E}}\left[\exp\left(2\frac{\lambda L}{2}V^{\frac{g}{L}}\right)\right]\leq\exp\left(\frac{nL^{2}T^{2}\lambda^{2}}{2}\right).

Replacing $g$ by $-g$ shows the above estimate holds for $|\lambda|<\frac{1}{2LT}$ , which implies the desired sub-exponential property of $V^{g}$ .

For Item 2, observe that for each $t\in[0,T]$ , the Itô formula allows one to express the product $\big{(}\int_{0}^{t}g(t,s)\,{\mathrm{d}}W_{s}\big{)}^{\top}\big{(}\int_{0}^{t}h(t,s)\,{\mathrm{d}}W_{s}\big{)}$ as a linear combination of double iterated Itô integrals and deterministic integrals. Then the desired sub-exponential property follows from the stochastic Fubini theorem (see e.g., Veraar (2012)) and Item 1.

The following theorem establishes the concentration properties of the random variables involved in the least-squares estimators.

Theorem 3.6

Suppose (H.1 1) holds and let $\Theta$ be a bounded subset of ${\mathbb{R}}^{(n+d)\times n}$ . For each $\theta\in\Theta$ and $N\in{\mathbb{N}}$ , let $\psi^{\theta}$ be defined in (2.9), and $\psi^{\theta,\tau}$ be defined in (2.20) with stepsize $\tau=T/N$ . Then there exist constants $C,\nu,b>0$ such that for all $\theta\in\Theta$ , $N,m\in{\mathbb{N}}$ and $\epsilon>0$ ,

	$\displaystyle\max\Big{\{}\mathbb{P}(\|V^{\psi^{\theta},m}-{\mathbb{E}}[V^{\psi^{\theta}}]\|\geq\epsilon),\mathbb{P}(\|Y^{\psi^{\theta},m}-{\mathbb{E}}[Y^{\psi^{\theta}}]\|\geq\epsilon),$
	$\displaystyle\quad\mathbb{P}(\|V^{\psi^{\theta,\tau},\tau,m}-{\mathbb{E}}[V^{\psi^{\theta,\tau},\tau}]\|\geq\epsilon),\mathbb{P}(\|Y^{\psi^{\theta,\tau},\tau,m}-{\mathbb{E}}[Y^{\psi^{\theta,\tau},\tau}]\|\geq\epsilon)\Big{\}}\leq C\exp\left(-\frac{1}{C}\min\left\{\frac{m\epsilon^{2}}{\nu^{2}},\frac{m\epsilon}{b}\right\}\right),$

where $V^{\psi^{\theta}}$ , $Y^{\psi^{\theta}}$ , $V^{\psi^{\theta},m}$ , $Y^{\psi^{\theta},m}$ are defined in (3.1), and $V^{\psi^{\theta,\tau},\tau}$ , $Y^{\psi^{\theta,\tau},\tau}$ , $V^{\psi^{\theta,\tau},\tau,m}$ , $Y^{\psi^{\theta,\tau},\tau,m}$ are defined in (3.2).

Proof We first show there exist $\nu,b>0$ such that all entries of $V^{\psi^{\theta}}$ , $Y^{\psi^{\theta}}$ $Y^{\psi^{\theta,\tau},\tau}$ , $V^{\psi^{\theta,\tau},\tau}$ are $(\nu,b)$ -sub-exponential for all $\theta\in\Theta$ and $N\in{\mathbb{N}}$ . By (3.1), we have

\displaystyle V^{\psi^{\theta}}

\displaystyle=\int_{0}^{T}\begin{pmatrix}X^{\psi^{\theta}}_{t}\\ K^{\theta}_{t}X^{\psi^{\theta}}_{t}\end{pmatrix}\begin{pmatrix}(X^{\psi^{\theta}}_{t})^{\top}&(K^{\theta}_{t}X^{\psi^{\theta}}_{t})^{\top}\end{pmatrix}\,{\mathrm{d}}t,\quad Y^{\psi^{\theta}}=V^{\psi^{\theta}}\theta^{\star}+\int_{0}^{T}\begin{pmatrix}X^{\psi^{\theta}}_{t}\\ K^{\theta}_{t}X^{\psi^{\theta}}_{t}\end{pmatrix}({\mathrm{d}}W_{t})^{\top}.

Moreover, applying the variation-of-constants formula (see e.g., (Mao, 2007, Theorem 3.1 p. 96)) to (2.11) shows that $X^{\psi^{\theta}}_{t}=\Phi^{\theta}_{t}\big{(}x_{0}+\int_{0}^{t}(\Phi^{\theta}_{s})^{-1}\,{\mathrm{d}}W_{s}\big{)}$ for all $t\in[0,T]$ , where $\Phi^{\theta}\in C([0,T];{\mathbb{R}}^{n\times n})$ is the fundamental solution of ${\mathrm{d}}\Phi^{\theta}_{t}=(A^{\star}+B^{\star}K^{\theta}_{t})\Phi^{\theta}_{t}\,{\mathrm{d}}t$ . The continuity of ${\mathbb{R}}^{(n+d)\times n}\ni\theta\mapsto K^{\theta}\in C([0,T];{\mathbb{R}}^{d\times n})$ (cf. Proposition 3.1) and the boundedness of $\Theta$ implies that $K^{\theta},\Phi^{\theta},(\Phi^{\theta})^{-1}$ are uniformly bounded for all $\theta\in\Theta$ . Consequently, from Lemma 3.5, there exist $\nu,b>0$ such that all entries of $V^{\psi^{\theta}}$ and $Y^{\psi^{\theta}}$ are $(\nu,b)$ -sub-exponential.

Similarly, by (2.21) and (3.2),

	$\displaystyle V^{\psi^{\theta,\tau},\tau}$	$\displaystyle=\int_{0}^{T}\sum_{i=0}^{N-1}\mathbf{1}_{[t_{i},t_{i+1})}(t)\begin{pmatrix}X^{\psi^{\theta,\tau}}_{t_{i}}\\ K^{\theta,\tau}_{t_{i}}X^{\psi^{\theta,\tau}}_{t_{i}}\end{pmatrix}\begin{pmatrix}(X^{\psi^{\theta,\tau}}_{t_{i}})^{\top}&(K^{\theta,\tau}_{t_{i}}X^{\psi^{\theta,\tau}}_{t_{i}})^{\top}\end{pmatrix}\,{\mathrm{d}}t,$
	$\displaystyle Y^{\psi^{\theta,\tau},\tau}$	$\displaystyle=\int_{0}^{T}\sum_{i=0}^{N-1}\mathbf{1}_{[t_{i},t_{i+1})}(t)\begin{pmatrix}X^{\psi^{\theta,\tau}}_{t_{i}}\\ K^{\theta,\tau}_{t_{i}}X^{\psi^{\theta,\tau}}_{t_{i}}\end{pmatrix}\begin{pmatrix}(X^{\psi^{\theta,\tau}}_{t})^{\top}&(K^{\theta,\tau}_{t}X^{\psi^{\theta,\tau}}_{t})^{\top}\end{pmatrix}(\theta^{\star})^{\top}\,{\mathrm{d}}t$
		$\displaystyle\quad+\int_{0}^{T}\sum_{i=0}^{N-1}\mathbf{1}_{[t_{i},t_{i+1})}(t)\begin{pmatrix}X^{\psi^{\theta,\tau}}_{t_{i}}\\ K^{\theta,\tau}_{t_{i}}X^{\psi^{\theta,\tau}}_{t_{i}}\end{pmatrix}({\mathrm{d}}W_{t})^{\top},$

where $X^{\psi^{\theta,\tau}}_{t}=\Phi^{\theta,\tau}_{t}\big{(}x_{0}+\int_{0}^{t}(\Phi^{\theta,\tau}_{s})^{-1}\,{\mathrm{d}}W_{s}\big{)}$ for all $t\in[0,T]$ , and $\Phi^{\theta,\tau}\in C([0,T];{\mathbb{R}}^{n\times n})$ is the fundamental solution of ${\mathrm{d}}\Phi^{\theta,\tau}_{t}=(A^{\star}+B^{\star}K^{\theta,\tau}_{t})\Phi^{\theta,\tau}_{t}\,{\mathrm{d}}t$ . By Proposition 3.3, $K^{\theta,\tau},\Phi^{\theta,\tau},(\Phi^{\theta,\tau})^{-1}$ are uniformly bounded for all $\theta\in\Theta$ and $N\in{\mathbb{N}}$ , which along with Lemma 3.5 leads to the desired sub-exponential properties of $Y^{\psi^{\theta,\tau},\tau}$ and $V^{\psi^{\theta,\tau},\tau}$ .

Finally, since $\mathbb{P}(|\sum_{i=1}^{\ell}X_{i}|\geq\epsilon)\leq\sum_{i=1}^{\ell}\mathbb{P}(|X_{i}|\geq\epsilon/\ell)$ for all $\ell\in{\mathbb{N}}$ and random variables $(X_{i})_{i=1}^{\ell}$ , we can apply Lemma 3.4 to each component of $V^{\psi^{\theta}}$ , $Y^{\psi^{\theta}}$ $Y^{\psi^{\theta,\tau},\tau}$ and $V^{\psi^{\theta,\tau},\tau}$ , and conclude the desired concentration inequality with a constant $C$ depending polynomially on $n,d$ .

3.3 Regret analysis of continuous-time least-squares algorithm

This section is devoted to the proof of Theorem 2.2, which consists of three steps: (1) We first quantify the performance gap between applying feedback controls for an incorrect model and that for the true model; our proof exploits the stability of Riccati equations established in Lemma 3.1; (2) We then estimate the parameter estimation error in terms of the number of learning episodes based on the sub-exponential tail behavior of the least-squares estimator (2.13); (3) Finally, we estimate the regret for the feedback controls $(\psi^{\theta_{\ell}})_{\ell\in{\mathbb{N}}}$ in Algorithm 1, thus establishing Theorem 2.2.

Step 1: Analysis of the performance gap.

We start by establishing a quadratic expansion of the cost function at any open-loop control.

Proposition 3.7

Suppose (H.1 1) holds. Let $\psi^{\theta^{\star}}$ be defined in (2.3), $X^{{\theta^{\star}}}$ be the state process associated with $\psi^{\theta^{\star}}$ (cf. (2.5)), and $U^{\theta^{\star}}\in\mathcal{H}^{2}({\mathbb{R}}^{d})$ be such that for all $t\in[0,T]$ , $U^{\theta^{\star}}_{t}=\psi^{\theta^{\star}}(t,X^{{\theta^{\star}}}_{t})$ . Then for all $U\in\mathcal{H}^{2}({\mathbb{R}}^{d})$ ,

J^{\theta^{\star}}(U)-J^{\theta^{\star}}(U^{\theta^{\star}})\leq\|Q\|_{2}\|X^{\theta^{\star},U}-X^{{\theta^{\star}}}\|^{2}_{\mathcal{H}^{2}({\mathbb{R}}^{n})}+\|R\|_{2}\|U-U^{\theta^{\star}}\|^{2}_{\mathcal{H}^{2}({\mathbb{R}}^{d})},

(3.5)

where $X^{\theta^{\star},U}$ is the state process controlled by $U$ (cf. (2.2)), and $J^{\theta^{\star}}:\mathcal{H}^{2}({\mathbb{R}}^{d})\rightarrow{\mathbb{R}}$ is defined in (2.1).

Proof For notational simplicity, for all $U\in\mathcal{H}^{2}({\mathbb{R}}^{d})$ and $\epsilon>0$ , we write $U^{\epsilon}=U^{{\theta^{\star}}}+\epsilon(U-U^{{\theta^{\star}}})$ , denote by $X^{\epsilon}=X^{\theta^{\star},U^{\epsilon}}$ the associated state process defined by (2.2), and by $X^{U}=X^{0}=X^{\theta^{\star},U}$ . The affineness of (2.2) implies that $X^{\epsilon}=(1-\epsilon)X^{\theta^{\star}}+\epsilon X^{U}$ for all $\epsilon>0$ . Hence, for all $U\in\mathcal{H}^{2}({\mathbb{R}}^{d})$ ,

	$\displaystyle\lim_{\epsilon\searrow 0}\frac{1}{\epsilon}\left(J^{\theta^{\star}}(U^{\epsilon})-J^{\theta^{\star}}(U^{\theta^{\star}})\right)$
	$\displaystyle=\lim_{\epsilon\searrow 0}\frac{1}{\epsilon}\mathbb{E}\bigg{[}\int_{0}^{T}\bigg{(}\Big{(}(1-\epsilon)X^{\theta^{\star}}_{t}+\epsilon X^{U}_{t}\Big{)}^{\top}Q\Big{(}(1-\epsilon)X^{\theta^{\star}}_{t}+\epsilon X^{U}_{t}\Big{)}-(X^{\theta^{\star}}_{t})^{\top}QX^{\theta^{\star}}_{t}$
	$\displaystyle\quad+\left((1-\epsilon)U^{{\theta^{\star}}}_{t}+\epsilon U_{t}\right)^{\top}R\left((1-\epsilon)U^{{\theta^{\star}}}_{t}+\epsilon U_{t}\right)-(U^{{\theta^{\star}}}_{t})^{\top}RU^{{\theta^{\star}}}_{t}\bigg{)}\,{\mathrm{d}}t\bigg{]}$
	$\displaystyle=\lim_{\epsilon\searrow 0}\epsilon\,\mathbb{E}\left[\int_{0}^{T}\Big{(}(X^{U}_{t}-X^{\theta^{\star}}_{t})^{\top}Q(X^{U}_{t}-X^{\theta^{\star}}_{t})+(U_{t}-U_{t}^{{\theta^{\star}}})^{\top}R(U_{t}-U_{t}^{{\theta^{\star}}})\Big{)}\,{\mathrm{d}}t\right]$
	$\displaystyle\qquad+2\mathbb{E}\left[\int_{0}^{T}\left((X^{U}_{t}-X^{\theta^{\star}}_{t})^{\top}QX_{t}^{\theta^{\star}}+(U_{t}-U_{t}^{{\theta^{\star}}})^{\top}RU_{t}^{{\theta^{\star}}}\right)\,{\mathrm{d}}t\right]$
	$\displaystyle=2\mathbb{E}\left[\int_{0}^{T}\left((X^{U}_{t}-X^{\theta^{\star}}_{t})^{\top}QX_{t}^{\theta^{\star}}+(U_{t}-U_{t}^{{\theta^{\star}}})^{\top}RU_{t}^{{\theta^{\star}}}\right)\,{\mathrm{d}}t\right],$

which is based on the fact that $X^{U}-X^{\theta^{\star}}\in{\mathcal{H}^{2}({\mathbb{R}}^{n})}$ and $U-U^{\star}\in{\mathcal{H}^{2}({\mathbb{R}}^{d})}$ . As $U^{\theta}$ is the optimal control of $J^{\theta^{\star}}$ , $J^{\theta^{\star}}({U})\geq J^{\theta^{\star}}(U^{\theta^{\star}})$ for all ${U}\in\mathcal{H}^{2}({\mathbb{R}}^{d})$ . Hence for all $U\in\mathcal{H}^{2}({\mathbb{R}}^{d})$ ,

\mathbb{E}\left[\int_{0}^{T}\left((X^{U}_{t}-X^{\theta^{\star}}_{t})^{\top}QX_{t}^{\theta^{\star}}+(U_{t}-U_{t}^{{\theta^{\star}}})^{\top}RU_{t}^{{\theta^{\star}}}\right)\,{\mathrm{d}}t\right]=\lim_{\epsilon\searrow 0}\frac{1}{2\epsilon}(J^{\theta^{\star}}(U^{\epsilon})-J^{\theta^{\star}}(U^{{\theta^{\star}}}))\geq 0.

(3.6)

We now prove that the above quantity is in fact zero for all $U\in\mathcal{H}^{2}({\mathbb{R}}^{d})$ . To this end, let $U\in\mathcal{H}^{2}({\mathbb{R}}^{d})$ be a given (open-loop) control, and consider $\tilde{U}=U^{{\theta^{\star}}}-(U-U^{{\theta^{\star}}})$ . Then by the affineness of (2.2), $X^{\tilde{U}}-X^{{\theta^{\star}}}$ satisfies the following controlled dynamics:

{\mathrm{d}}X_{t}=(A^{\star}X_{t}-B^{\star}(U-U^{{\theta^{\star}}})_{t})\,{\mathrm{d}}t,\quad t\in[0,T];\quad X_{0}=0.

(3.7)

Moreover, one can verify by the affineness of (2.2) that $-(X^{{U}}-X^{{\theta^{\star}}})$ also satisfies the dynamics (3.7), which along with the uniqueness of solutions to (3.7) shows that $X^{\tilde{U}}-X^{{\theta^{\star}}}=-(X^{{U}}-X^{{\theta^{\star}}})$ . Therefore, applying (3.6) with $U=\tilde{U}$ implies that

	$\displaystyle 0\leq\mathbb{E}\left[\int_{0}^{T}\left((X^{\tilde{U}}_{t}-X^{\theta^{\star}}_{t})^{\top}QX_{t}^{\theta^{\star}}+(\tilde{U}_{t}-U_{t}^{{\theta^{\star}}})^{\top}RU_{t}^{{\theta^{\star}}}\right)\,{\mathrm{d}}t\right]$
	$\displaystyle=-\mathbb{E}\left[\int_{0}^{T}\left((X^{{U}}_{t}-X^{\theta^{\star}}_{t})^{\top}QX_{t}^{\theta^{\star}}+({U}_{t}-U_{t}^{{\theta^{\star}}})^{\top}RU_{t}^{{\theta^{\star}}}\right)\,{\mathrm{d}}t\right]\leq 0.$

Hence for all $U\in\mathcal{H}^{2}({\mathbb{R}}^{d})$ ,

\mathbb{E}\left[\int_{0}^{T}\left((X^{U}_{t}-X^{\theta^{\star}}_{t})^{\top}QX_{t}^{\theta^{\star}}+(U_{t}-U_{t}^{{\theta^{\star}}})^{\top}RU_{t}^{{\theta^{\star}}}\right)\,{\mathrm{d}}t\right]=0,

which leads to the desired result (3.5) due to the following identify:

\begin{split}J^{\theta^{\star}}(U)-J^{\theta^{\star}}(U^{\theta^{\star}})&=\mathbb{E}\left[\int_{0}^{T}((X^{U}_{t})^{\top}QX^{U}_{t}-(X^{\theta^{\star}}_{t})^{\top}QX^{\theta^{\star}}_{t}+U_{t}^{\top}RU_{t}-(U_{t}^{\theta^{\star}})^{\top}RU_{t}^{\theta^{\star}})\,{\mathrm{d}}t\right]\\ &=\mathbb{E}\left[\int_{0}^{T}\left((X^{U}_{t})^{\top}QX^{U}_{t}-(X^{\theta^{\star}}_{t})^{\top}QX^{\theta^{\star}}_{t}+U_{t}^{\top}RU_{t}-(U_{t}^{\theta^{\star}})^{\top}RU_{t}^{\theta^{\star}}\right)\,{\mathrm{d}}t\right]\\ &\quad-2\mathbb{E}\left[\int_{0}^{T}\left((X^{U}_{t}-X^{\theta^{\star}}_{t})^{\top}QX_{t}^{\theta^{\star}}+(U_{t}-U_{t}^{{\theta^{\star}}})^{\top}RU_{t}^{{\theta^{\star}}}\right)\,{\mathrm{d}}t\right]\\ &=\mathbb{E}\left[\int_{0}^{T}\left((X^{U}_{t}-X_{t}^{\theta^{\star}})^{\top}Q(X^{U}_{t}-X_{t}^{\theta^{\star}})+(U_{t}-U_{t}^{\theta^{\star}})^{\top}R(U_{t}-U_{t}^{\theta^{\star}})\,{\mathrm{d}}t\right)\right].\end{split}

Armed with Proposition 3.7, the following proposition quantifies the quadratic performance gap of a greedy policy $\psi^{\theta}$ .

Proposition 3.8

Suppose (H.1 1) holds and let $\Theta$ be a bounded subset of ${\mathbb{R}}^{(n+d)\times n}$ . For each $\theta\in\Theta$ , let $\psi^{\theta}$ be defined in (2.9), let $X^{\psi^{\theta}}$ be the state process associated with $\psi^{\theta}$ (cf. (2.11)), let $\psi^{\theta^{\star}}$ be defined in (2.3), and let $X^{{\theta^{\star}}}$ be the state process associated with $\psi^{\theta^{\star}}$ (cf. (2.5)). Then there exists a constant $C$ such that

|J^{\theta^{\star}}(U^{\psi^{\theta}})-J^{\theta^{\star}}(U^{\theta^{\star}})|\leq C|\theta-\theta^{\star}|^{2},\quad\forall\theta\in\Theta,

where $U^{\psi^{\theta}}_{t}=\psi^{\theta}(t,X^{\psi^{\theta}}_{t})$ and $U^{\theta^{\star}}_{t}=\psi^{\theta^{\star}}(t,X^{{\theta^{\star}}}_{t})$ for all $t\in[0,T]$ , and $J^{\theta^{\star}}$ is defined in (2.1).

Proof For all $\theta\in\Theta$ , applying Proposition 3.7 with $U=U^{\psi^{\theta}}$ gives

\displaystyle\begin{split}&J^{\theta^{\star}}(U^{\psi^{\theta}})-J^{\theta^{\star}}(U^{\theta^{\star}})\\ &\leq\|Q\|_{2}\|X^{\theta^{\star},U^{\psi^{\theta}}}-X^{{\theta^{\star}}}\|^{2}_{\mathcal{H}^{2}({\mathbb{R}}^{n})}+\|R\|_{2}\|U^{\psi^{\theta}}-U^{\theta^{\star}}\|^{2}_{\mathcal{H}^{2}({\mathbb{R}}^{d})},\\ &\leq\|Q\|_{2}\|X^{{\psi^{\theta}}}-X^{\psi^{\theta^{\star}}}\|^{2}_{\mathcal{H}^{2}({\mathbb{R}}^{n})}+\|R\|_{2}\|\psi^{\theta}(\cdot,X^{\psi^{\theta}}_{\cdot})-\psi^{\theta^{\star}}(\cdot,X^{\psi^{\theta^{\star}}}_{\cdot})\|^{2}_{\mathcal{H}^{2}({\mathbb{R}}^{d})},\end{split}

(3.8)

where the last inequality used the fact that $X^{\theta^{\star},U^{\psi^{\theta}}}=X^{{\psi^{\theta}}}$ (see (2.11)), and the definitions of $U^{\psi^{\theta}}$ and $U^{\theta^{\star}}$ . It remains to prove

\|X^{{\psi^{\theta}}}-X^{\psi^{\theta^{\star}}}\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}+\|\psi^{\theta}(\cdot,X^{\psi^{\theta}}_{\cdot})-\psi^{\theta^{\star}}(\cdot,X^{\psi^{\theta^{\star}}}_{\cdot})\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}\leq C|\theta-\theta^{\star}|,

for a constant $C$ independent of $\theta$ .

Observe that by (2.9), for all $(t,x)\in[0,T]\times{\mathbb{R}}^{n}$ , $\psi^{\theta}(t,x)=K^{\theta}_{t}x$ with $K^{\theta}_{t}=-R^{-1}B^{\top}P^{\theta}_{t}$ . Now by Lemma 3.1 and the boundedness of $\Theta$ , there exists a constant $C\geq 0$ such that $\|P^{\theta}\|_{C([0,T;{\mathbb{R}}^{n\times n})}\leq C$ and $\|P^{\theta}-P^{\theta^{\star}}\|_{C([0,T;{\mathbb{R}}^{n\times n})}\leq C|\theta-\theta^{\star}|$ for all $\theta\in\Theta\cup\{\theta^{\star}\}$ , which along with $K^{\theta}_{t}=-R^{-1}B^{\top}P^{\theta}_{t}$ implies that $\|K^{\theta}\|_{C([0,T;{\mathbb{R}}^{d\times n})}\leq C$ and $\|K^{\theta}-K^{\theta^{\star}}\|_{C([0,T;{\mathbb{R}}^{d\times n})}\leq C|\theta-\theta^{\star}|$ . Moreover, observe from (2.5) and (2.11) that $X^{{\theta^{\star}}}_{0}=X^{\psi^{\theta}}_{0}$ and for all $t\in[0,T]$ ,

{\mathrm{d}}(X^{\psi^{\theta^{\star}}}-X^{\psi^{\theta}})_{t}=\Big{(}(A^{\star}+B^{\star}K^{\theta^{\star}}_{t})(X^{\psi^{\theta^{\star}}}-X^{\psi^{\theta}})_{t}+B^{\star}(K^{\theta^{\star}}_{t}-K^{\theta}_{t})X^{\psi^{\theta}}_{t}\Big{)}\,{\mathrm{d}}t,

which combined with the boundedness of $K^{\theta^{\star}}$ and Gronwall’s inequality leads to

	$\displaystyle\\|X^{\psi^{\theta^{\star}}}-X^{\psi^{\theta}}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}$	$\displaystyle\leq C\\|X^{\psi^{\theta^{\star}}}-X^{\psi^{\theta}}\\|_{\mathcal{S}^{2}({\mathbb{R}}^{n})}$
		$\displaystyle\leq C\\|(K^{\theta^{\star}}-K^{\theta})X^{\psi^{\theta}}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}\leq C\\|K^{\theta^{\star}}-K^{\theta}\\|_{C([0,T;{\mathbb{R}}^{d\times n})}\\|X^{\psi^{\theta}}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}$
		$\displaystyle\leq C\|\theta-\theta^{\star}\|,\quad\forall\theta\in\Theta,$

where the last inequality follows from $\|X^{\psi^{\theta}}\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}\leq C$ , as $K^{\theta}$ is uniformly bounded. The above inequality further implies

	$\displaystyle\\|\psi^{\theta}(\cdot,X^{\psi^{\theta}}_{\cdot})-\psi^{\theta^{\star}}(\cdot,X^{\psi^{\theta^{\star}}}_{\cdot})\\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}=\\|K^{\theta}_{\cdot}X^{\psi^{\theta}}_{\cdot}-K^{\theta^{\star}}_{\cdot}X^{\psi^{\theta^{\star}}}_{\cdot}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}$
	$\displaystyle\leq\\|(K^{\theta}_{\cdot}-K^{\theta^{\star}}_{\cdot})X^{\psi^{\theta}}_{\cdot}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}+\\|K^{\theta^{\star}}_{\cdot}(X^{\psi^{\theta}}_{\cdot}-X^{\psi^{\theta^{\star}}}_{\cdot})\\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}$
	$\displaystyle\leq\\|K^{\theta^{\star}}-K^{\theta}\\|_{C([0,T;{\mathbb{R}}^{d\times n})}\\|X^{\psi^{\theta}}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}+\\|K^{\theta^{\star}}\\|_{C([0,T;{\mathbb{R}}^{d\times n})}\\|X^{\psi^{\theta}}-X^{\psi^{\theta^{\star}}}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}$
	$\displaystyle\leq C\|\theta-\theta^{\star}\|,\quad\forall\theta\in\Theta,$

which along with (3.8) finishes the desired estimate.

Step 2: Error bound for parameter estimation.

Proposition 3.9

Suppose (H.1 1) holds and let $\Theta\subset{\mathbb{R}}^{(n+d)\times n}$ such that there exists $C_{1}>0$ satisfying $\|({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}\|_{2}\leq C_{1}$ and $|\theta|\leq C_{1}$ for all $\theta\in\Theta$ , with $V^{\psi^{\theta}}$ defined in (3.1). Then there exist constants $\bar{C}_{1},\bar{C}_{2}\geq 0$ , such that for all $\theta\in\Theta$ and $\delta\in(0,1/2)$ , if $m\geq\bar{C}_{1}(-\ln\delta)$ , then with probability at least $1-2\delta$ ,

|\hat{\theta}-\theta^{\star}|\leq\bar{C}_{2}\bigg{(}\sqrt{\frac{-\ln\delta}{m}}+\frac{-\ln\delta}{m}+\frac{(-\ln\delta)^{2}}{m^{2}}\bigg{)},

(3.9)

where $\hat{\theta}$ denotes the right-hand side of (2.13) with the control $\psi^{\theta}$ .

Proof Let us fix $\delta\in(0,1/2)$ and $\theta\in\Theta$ . By (2.12) and (2.13), we obtain

\begin{split}\|\hat{\theta}-\theta^{\star}\|_{2}&=\|(V^{\psi^{{\theta}},m}+\tfrac{1}{m}I)^{-1}Y^{\psi^{{\theta}},m}-({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}{\mathbb{E}}[Y^{\psi^{{\theta}}}]\|_{2}\\ &\leq\|(V^{\psi^{{\theta}},m}+\tfrac{1}{m}I)^{-1}-({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}\|_{2}\|Y^{\psi^{{\theta}},m}\|_{2}+\|({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}\|_{2}\|Y^{\psi^{{\theta}},m}-{\mathbb{E}}[Y^{\psi^{{\theta}}}]\|_{2}.\end{split}

As $E^{-1}-F^{-1}=F^{-1}(F-E)E^{-1}$ for all nonsingular matrices $E$ and $F$ , we have

\begin{split}&\|\hat{\theta}-\theta^{\star}\|_{2}\\ &\leq\|(V^{\psi^{{\theta}},m}+\tfrac{1}{m}I)^{-1}\|_{2}\|({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}\|_{2}\|Y^{\psi^{{\theta}},m}\|_{2}\|V^{\psi^{{\theta}},m}-{\mathbb{E}}[V^{\psi^{{\theta}}}]+\tfrac{1}{m}I\|_{2}\\ &\quad+\|({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}\|_{2}\|Y^{\psi^{{\theta}},m}-{\mathbb{E}}[Y^{\psi^{{\theta}}}]\|_{2}\\ &\leq C_{1}\big{(}\|(V^{\psi^{{\theta}},m}+\tfrac{1}{m}I)^{-1}\|_{2}\|Y^{\psi^{{\theta}},m}\|_{2}\|V^{\psi^{{\theta}},m}-{\mathbb{E}}[V^{\psi^{{\theta}}}]+\tfrac{1}{m}I\|_{2}+\|Y^{\psi^{{\theta}},m}-{\mathbb{E}}[Y^{\psi^{{\theta}}}]\|_{2}\big{)},\end{split}

(3.10)

where the last inequality follows from the assumption $\|({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}\|_{2}\leq C_{1}$ .

We now estimate each term in the right-hand side of (3.10), and denote by $C$ a generic constant independent of $\theta\in\Theta,\delta\in(0,1/2),m\in{\mathbb{N}}$ . By Theorem 3.6, with probability at least $1-2\delta$ , $\|V^{\psi^{{\theta}},m}-{\mathbb{E}}[V^{\psi^{{\theta}}}]\|_{2}\leq\delta_{m}$ and $\|Y^{\psi^{{\theta}},m}-{\mathbb{E}}[Y^{\psi^{{\theta}}}]\|_{2}\leq\delta_{m}$ , with the constant $\delta_{m}$ given by

\delta_{m}\coloneqq C\max\left\{\bigg{(}\frac{-\ln\delta}{m}\bigg{)}^{\frac{1}{2}},\frac{-\ln\delta}{m}\right\}.

(3.11)

Let $m$ be a sufficiently large constant satisfying $\delta_{m}+1/m\leq 1/(2C_{1})$ , where $C_{1}$ is the constant such that $\|({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}\|_{2}\leq C_{1}$ for all $\theta\in\Theta$ . Then with probability at least $1-2\delta$ , $\|V^{\psi^{{\theta}},m}-{\mathbb{E}}[V^{\psi^{{\theta}}}]+\tfrac{1}{m}I\|_{2}\leq\delta_{m}+\frac{1}{m}\leq\tfrac{1}{2C_{1}}$ , which in turn yields

\displaystyle\lambda_{\min}(V^{\psi^{{\theta}},m}+\tfrac{1}{m})\geq\lambda_{\min}({\mathbb{E}}[V^{\psi^{{\theta}}}])-\|V^{\psi^{{\theta}},m}-{\mathbb{E}}[V^{\psi^{{\theta}}}]+\tfrac{1}{m}I\|_{2}\geq\tfrac{1}{2C_{1}},

or equivalently $\|(V^{\psi^{{\theta}},m}+\tfrac{1}{m})^{-1}\|_{2}\leq 2C_{1}$ . Moreover, the continuity of ${\mathbb{R}}^{(n+d)\times n}\ni\theta\mapsto{\mathbb{E}}[Y^{\psi^{{\theta}}}]\in{\mathbb{R}}$ implies $\|Y^{\psi^{{\theta}},m}\|_{2}\leq\|{\mathbb{E}}[Y^{\psi^{{\theta}}}]\|_{2}+\|Y^{\psi^{{\theta}},m}-{\mathbb{E}}[Y^{\psi^{{\theta}}}]\|_{2}\leq C+\|Y^{\psi^{{\theta}},m}-{\mathbb{E}}[Y^{\psi^{{\theta}}}]\|_{2}$ . Hence, by (3.10),

\begin{split}&|\hat{\theta}-\theta^{\star}|\\ &\leq C\big{(}(1+\|Y^{\psi^{{\theta}},m}-{\mathbb{E}}[Y^{\psi^{{\theta}}}]\|_{2})\|V^{\psi^{{\theta}},m}-{\mathbb{E}}[V^{\psi^{{\theta}}}]+\tfrac{1}{m}I\|_{2}+\|Y^{\psi^{{\theta}},m}-{\mathbb{E}}[Y^{\psi^{{\theta}}}]\|_{2}\big{)}\\ &\leq C\left((\delta_{m}+\tfrac{1}{m})(1+\delta_{m})+\delta_{m}\right)\leq C\left(\delta_{m}+\delta_{m}^{2}+\tfrac{1}{m}\right).\end{split}

Substituting (3.11) into the above estimate yields the desired estimate (3.22). As $\delta\in(0,1/2)$ , it is clear that $\delta_{m}+1/m\leq 1/(2C_{1})$ is satisfied for all $m\geq\bar{C}_{1}(-\ln\delta)$ , with a sufficiently large $C_{1}$ .

Step 3: Proof of Theorem 2.2.

The following proposition shows that for any given $\theta=(A,B)^{\top}\in{\mathbb{R}}^{(n+d)\times n}$ , the full row rank of $K^{\theta}$ is equivalent to the well-definedness of (2.12) for all $\theta^{\prime}$ sufficiently close to $\theta$ .

Proposition 3.10

Suppose (H.1 1) holds. For each $\theta\in{\mathbb{R}}^{(n+d)\times n}$ , let $V^{\psi^{\theta}}$ be defined in (3.1). Then for any $\theta=(A,B)^{\top}\in{\mathbb{R}}^{(n+d)\times n}$ , the following properties are equivalent:

(1)

$\{v\in{\mathbb{R}}^{d}\mid(K^{\theta}_{t})^{\top}v=0,\;\forall t\in[0,T]\}=\{0\}$ , with $K^{\theta}$ defined in (2.9);
(2)

${\mathbb{E}}[V^{\psi^{{\theta}}}]\in{\mathbb{S}}_{+}^{n+d}$ ;
(3)

there exist $\lambda_{0},\varepsilon>0$ such that $\lambda_{\min}({\mathbb{E}}[V^{\psi^{{\theta}^{\prime}}}])\geq\lambda_{0}$ for all ${\theta}^{\prime}\in\Phi_{\varepsilon}\coloneqq\{\theta^{\prime}\in{\mathbb{R}}^{(n+d)\times n}\mid|{\theta}^{\prime}-\theta|\leq\varepsilon\}$ , where $\lambda_{\min}(Z)$ is the minimum eigenvalue of $Z\in{\mathbb{S}}^{n+d}_{0}$ .

Proof For $\ref{item:B_full_rank}\Longrightarrow\ref{item:theta_B}$ : By (3.1), ${\mathbb{E}}[V^{\psi^{\theta}}]\in{\mathbb{S}}^{n+d}_{+}$ if and only if there exists no nonzero $v\in{\mathbb{R}}^{n+d}$ such that

{\mathbb{E}}\left[\int_{0}^{T}v^{\top}Z^{\psi^{\theta}}_{t}(Z^{\psi^{\theta}}_{t})^{\top}v\,{\mathrm{d}}t\right]=\int_{0}^{T}v^{\top}\begin{pmatrix}I\\ K^{\theta}_{t}\end{pmatrix}{\mathbb{E}}\left[X^{\psi^{\theta}}_{t}(X^{\psi^{\theta}}_{t})^{\top}\right]\begin{pmatrix}I&(K^{\theta}_{t})^{\top}\end{pmatrix}v\,{\mathrm{d}}t=0,

(3.12)

where we applied Fubini’s theorem for the first identity. By (2.5), $X^{\psi^{\theta}}_{t}=\Phi^{\theta}_{t}\big{(}x_{0}+\int_{0}^{t}(\Phi^{\theta}_{s})^{-1}\,{\mathrm{d}}W_{s}\big{)}$ for all $t\in[0,T]$ , where $\Phi^{\theta}\in C([0,T];{\mathbb{R}}^{n\times n})$ is the fundamental solution of ${\mathrm{d}}\Phi^{\theta}_{t}=(A^{\star}+B^{\star}K^{\theta}_{t})\Phi^{\theta}_{t}\,{\mathrm{d}}t$ , $K^{\theta}_{t}=-R^{-1}B^{\top}P^{\theta}_{t}$ for all $t\in[0,T]$ , and $P^{\theta}$ satisfies (2.10). Hence,

{\mathbb{E}}\left[X^{\psi^{\theta}}_{t}(X^{\psi^{\theta}}_{t})^{\top}\right]=\Phi^{\theta}_{t}\left(x_{0}x_{0}^{\top}+\int_{0}^{t}(\Phi^{\theta}_{s})^{-1}((\Phi^{\theta}_{s})^{-1})^{\top}\,{\mathrm{d}}s\right)(\Phi^{\theta}_{t})^{\top}\in{\mathbb{S}}^{n}_{0},\quad\forall t\in[0,T].

Then by (3.12) and the continuity of $t\mapsto{\mathbb{E}}\left[X^{\psi^{\theta}}_{t}(X^{\psi^{\theta}}_{t})^{\top}\right]$ and $t\mapsto K^{\theta}_{t}$ , ${\mathbb{E}}[V^{\psi^{\theta}}]\in{\mathbb{S}}^{n+d}_{+}$ if and only if there exists no nonzero $v\in{\mathbb{R}}^{n+d}$ such that

v^{\top}\begin{pmatrix}I\\ K^{\theta}_{t}\end{pmatrix}\Phi^{\theta}_{t}\left(x_{0}x_{0}^{\top}+\int_{0}^{t}(\Phi^{\theta}_{s})^{-1}((\Phi^{\theta}_{s})^{-1})^{\top}\,{\mathrm{d}}s\right)(\Phi^{\theta}_{t})^{\top}\begin{pmatrix}I&(K^{\theta}_{t})^{\top}\end{pmatrix}v=0,\quad\forall t\in[0,T],

where $I$ is the $n\times n$ identity matrix. One can easily deduce by the invertibility of $(\Phi^{\theta}_{t})^{-1}$ for all $t\in[0,T]$ that $\int_{0}^{t}(\Phi^{\theta}_{s})^{-1}((\Phi^{\theta}_{s})^{-1})^{\top}\,{\mathrm{d}}s\in{\mathbb{S}}^{n}_{+}$ for all $t>0$ , which subsequently shows that ${\mathbb{E}}[V^{\psi^{\theta}}]\in{\mathbb{S}}^{n+d}_{+}$ if and only if there exists no nonzero $\tilde{v}\in{\mathbb{R}}^{n+d}$ such that $\begin{pmatrix}I&(K^{\theta}_{t})^{\top}\end{pmatrix}\tilde{v}=0$ for all $t\in[0,T]$ . Now let us denote without loss of generality that $\tilde{v}=\begin{pmatrix}u\\ v\end{pmatrix}$ for some $u\in{\mathbb{R}}^{n}$ and $v\in{\mathbb{R}}^{d}$ . Then the above derivation shows that ${\mathbb{E}}[V^{\psi^{\theta}}]\in{\mathbb{S}}^{n+d}_{+}$ is equivalent to the following statement:

u\in{\mathbb{R}}^{n}

and

v\in{\mathbb{R}}^{d}

satisfy

u+(K^{\theta}_{t})^{\top}v=0

for all

t\in[0,T]

, then

u=0

and

v=0

(3.13)

By (2.9), $(K^{\theta}_{t})^{\top}=-P^{\theta}_{t}BR^{-1}$ for all $t\in[0,T]$ and $P^{\theta}_{T}=0$ , implying that $K^{\theta}_{T}=0$ . Then (3.13) can be rewritten as:

v\in{\mathbb{R}}^{d}

satisfies

(K^{\theta}_{t})^{\top}v=0

for all

t\in[0,T]

, then

v=0

For $\ref{item:theta_B}\Longleftrightarrow\ref{item:theta_eps}$ : Item 3 clearly implies Item 2. On the other hand, for any given $\theta,\theta^{\prime}\in{\mathbb{R}}^{(n+d)\times n}$ ,

{\mathrm{d}}(X^{\psi^{\theta}}-X^{\psi^{\theta^{\prime}}})_{t}=\Big{(}(A^{\star}+B^{\star}K^{\theta}_{t})(X^{\psi^{\theta}}-X^{\psi^{\theta^{\prime}}})_{t}+B^{\star}(K^{\theta}_{t}-K^{\theta^{\prime}}_{t})X^{\psi^{\theta^{\prime}}}_{t}\Big{)}\,{\mathrm{d}}t.

Then, we can easily deduce from the continuity of $t\mapsto K^{\theta}$ (see Lemma 3.1) that ${\mathbb{R}}^{(n+d)\times n}\ni\theta\mapsto Z^{\psi^{\theta}}\in\mathcal{H}^{2}({\mathbb{R}}^{(n+d)\times n})$ is continuous, which implies the continuity of ${\mathbb{R}}^{(n+d)\times n}\ni\theta\mapsto V^{\psi^{\theta}}={\mathbb{E}}\big{[}\int_{0}^{T}Z^{\psi^{\theta}}_{t}(Z^{\psi^{\theta}}_{t})^{\top}\,{\mathrm{d}}t\big{]}\in{\mathbb{S}}^{n+d}_{0}$ . Hence, by the continuity of the minimum eigenvalue function, we can conclude Item 2 from Item 3.

The following proposition provides sufficient conditions for the nondegeneracy of $K^{\theta}$ .

Proposition 3.11

Let $n,d\in{\mathbb{N}}$ , $\theta=(A,B)^{\top}\in{\mathbb{R}}^{(n+d)\times n}$ , $Q\in{\mathbb{S}}_{0}^{n}$ and $R\in{\mathbb{S}}_{+}^{d}$ .

(1)

For all $T>0$ , if $B^{\top}QB\in{\mathbb{S}}^{d}_{+}$ , then $\{v\in{\mathbb{R}}^{d}\mid(K^{\theta}_{t})^{\top}v=0,\;\forall t\in[0,T]\}=\{0\}$ .
(2)

Assume that the algebraic Riccati equation $A^{\top}P+PA-P(BR^{-1}B^{\top})P+Q=0$ admits a unique maximal solution $P_{\infty}\in{\mathbb{S}}^{n}_{+}$ . Let $K_{\infty}=-R^{-1}B^{\top}P_{\infty}$ , and for each $T>0$ , let $P^{(T)}\in C([0,T];{\mathbb{S}}_{0}^{n})$ be defined in (2.10). Assume that $\lim_{T\rightarrow\infty}P^{(T)}_{0}=P_{\infty}$ and $K_{\infty}(K_{\infty})^{\top}\in{\mathbb{S}}_{+}^{d}$ . Then there exists $T_{0}>0$ , such that for all $T\geq T_{0}$ , $\{v\in{\mathbb{R}}^{d}\mid(K^{\theta}_{t})^{\top}v=0,\;\forall t\in[0,T]\}=\{0\}$ .

Proof To prove Item 1, suppose that $B^{\top}QB\in{\mathbb{S}}_{+}^{n}$ and $v\in{\mathbb{R}}^{d}$ such that $(K^{\theta}_{t})^{\top}v=-P^{\theta}_{t}BR^{-1}v=0$ for all $t\in[0,T]$ , with $P^{\theta}$ defined in (2.10). Setting $u=R^{-1}v$ , right multiplying (2.10) by $Bu$ , and left multiplying (2.10) by $u^{\top}B^{\top}$ shows

u^{\top}B^{\top}(\tfrac{{\mathrm{d}}}{{\mathrm{d}}t}P^{\theta}_{t})Bu+A^{\top}P^{\theta}_{t}Bu+u^{\top}B^{\top}P^{\theta}_{t}ABu-u^{\top}B^{\top}P^{\theta}_{t}BR^{-1}B^{\top}P^{\theta}_{t}Bu+u^{\top}B^{\top}QBu=0.

As $P^{\theta}_{t}Bu=0$ for all $t\in(0,T)$ , $u^{\top}B^{\top}(\frac{{\mathrm{d}}}{{\mathrm{d}}t}P^{\theta}_{t})Bu=u^{\top}B^{\top}P^{\theta}_{t}=0$ for all $t\in(0,T)$ , and hence $u^{\top}B^{\top}QBu=0$ . The assumption of $B^{\top}QB\in{\mathbb{S}}^{n}_{+}$ then gives $u=R^{-1}v=0$ , which along with the invertibility of $R^{-1}$ shows that $v=0$ .

To prove Item 2, observe that $\lim_{T\rightarrow\infty}(-R^{-1}B^{\top}P_{0}^{(T)})=K_{\infty}$ . As $\lambda_{\min}(K_{\infty}(K_{\infty})^{\top})>0$ , there exists $T_{0}>0$ such that for all $T\geq T_{0}$ , $\lambda_{\min}\left((-R^{-1}B^{\top}P_{0}^{(T)})(-R^{-1}B^{\top}P_{0}^{(T)})^{\top}\right)>0$ . Fix $T\geq T_{0}$ and consider $v\in{\mathbb{R}}^{d}$ such that $(K^{\theta}_{t})^{\top}v=0$ for all $t\in[0,T]$ . Then the definitions of $K^{\theta}$ and $P^{(T)}$ imply the invertibility of $K^{\theta}_{0}(K^{\theta}_{0})^{\top}$ , which yields $v=(K^{\theta}_{0}(K^{\theta}_{0})^{\top})^{-1}K^{\theta}_{0}(K^{\theta}_{0})^{\top}v=0$ .

Now we are ready for the proof of Theorem 2.2.

Proof [Proof of Theorem 2.2] As (H.1 2) holds with $\theta^{\star}$ and $\theta_{0}$ , we can obtain from Proposition 3.10 that, there exist $C_{1},\varepsilon>0$ such that for all $\theta\in\Phi_{\varepsilon}\coloneqq\{\theta\mid{\mathbb{R}}^{(n+d)\times n}\mid|\theta-\theta^{\star}|\leq\varepsilon\}\cup\{\theta_{0}\}$ , we have $\|({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}\|_{2}\leq C_{1}$ . Then by Proposition 3.9, there exist constants $\bar{C}_{1},\bar{C}_{2}\geq 1$ , such that for all $\theta\in\Theta_{\varepsilon}$ and $\delta^{\prime}\in(0,1/2)$ , if $m\geq\bar{C}_{1}(-\ln\delta^{\prime})$ , then with probability at least $1-2\delta^{\prime}$ ,

|\hat{\theta}-\theta^{\star}|\leq\bar{C}_{2}\left(\sqrt{\frac{-\ln\delta^{\prime}}{m}}+\frac{-\ln\delta^{\prime}}{m}+\frac{(-\ln\delta^{\prime})^{2}}{m^{2}}\right),

(3.14)

where $\hat{\theta}$ denotes the right-hand side of (2.13) with the control $\psi^{\theta}$ . In the following, we fix $\delta\in(0,3/\pi^{2})$ and $C\geq C_{0}$ , with the constant ${C}_{0}\in(0,\infty)$ satisfying

{C}_{0}\geq\bar{C}_{1}\bigg{(}\sup_{\ell\in{\mathbb{N}}\cup\{0\},\delta\in(0,{3}/{\pi^{2}})}\frac{-\ln(\delta/(\ell+1)^{2})}{2^{\ell}(-\ln\delta)}\bigg{)}\bigg{/}\min\bigg{\{}\left(\frac{\varepsilon}{3\bar{C}_{2}}\right)^{2},1\bigg{\}},

let $m_{0}=C(-\ln\delta)$ , and for each $\ell\in{\mathbb{N}}\cup\{0\}$ , let $\delta_{\ell}=\delta/(\ell+1)^{2}$ , $m_{\ell}=2^{\ell}m_{0}$ , and let $\theta_{\ell+1}$ be generated by (2.13) with $m=m_{\ell}$ and $\theta=\theta_{\ell}$ . Note that the choices of $C_{0},m_{\ell},\delta_{\ell}$ ensure that $m_{\ell}\geq\bar{C}_{1}(-\ln\delta_{\ell})$ , and

\bar{C}_{2}\left(\sqrt{\frac{-\ln\delta_{\ell}}{m_{\ell}}}+\frac{(-\ln\delta_{\ell})}{m_{\ell}}+\frac{(-\ln\delta_{\ell})^{2}}{m^{2}_{\ell}}\right)\leq 3\bar{C}_{2}\sqrt{\frac{-\ln\delta_{\ell}}{m_{\ell}}}\leq\varepsilon,\quad\forall\ell\in{\mathbb{N}}\cup\{0\}.

(3.15)

We now prove with probability at least $1-2\sum_{\ell=0}^{\infty}\delta_{\ell}=1-\frac{\pi^{2}}{3}\delta$ ,

|{\theta}_{\ell+1}-\theta^{\star}|\leq\bar{C}_{2}\bigg{(}\sqrt{\frac{-\ln\delta_{\ell}}{m_{\ell}}}+\frac{(-\ln\delta_{\ell})}{m_{\ell}}+\frac{(-\ln\delta_{\ell})^{2}}{m_{\ell}^{2}}\bigg{)},\quad\forall\ell\in{\mathbb{N}}\cup\{0\}.

(3.16)

Let us consider the induction statement for each $k\in{\mathbb{N}}\cup\{0\}$ : with probability at least $1-2\sum_{\ell=0}^{k}\delta_{\ell}$ , (3.16) holds for all $\ell=0,\ldots,k$ . The fact that $\theta_{0}\in\Theta_{\varepsilon}$ and (3.14) yields the induction statement for $k=0$ . Now suppose that the induction statement holds for some $k\in{\mathbb{N}}\cup\{0\}$ . Then the induction hypothesis and (3.15) ensure that $|\theta_{\ell}-\theta^{\star}|\leq\varepsilon$ for all $\ell=1,\ldots,k+1$ (and hence $\theta_{k+1}\in\Theta_{\varepsilon}$ ) with probability at least $1-2\sum_{\ell=0}^{k}\delta_{\ell}$ . Conditioning on this event, we can apply (3.14) with $\theta=\theta_{k+1}$ , $\delta^{\prime}=\delta_{k+1}<1/2$ and $m=m_{k+1}\geq\bar{C}_{1}(-\ln\delta_{k+1})$ , and deduce with probability at least $1-2\delta_{k+1}$ that (3.16) holds for the index $\ell=k+1$ . Combining this with the induction hypothesis yields (3.16) holds for the indices $\ell=0,\ldots,k+1$ , with probability at least $1-2\sum_{\ell=0}^{k+1}\delta_{\ell}$ .

Observe that for all $i\in{\mathbb{N}}$ , Algorithm 1 generates the $i$ -th trajectory with control $\psi^{\theta_{\ell}}$ if $i\in(\sum_{j=0}^{\ell-1}m_{j},\sum_{j=0}^{\ell}m_{j}]=(m_{0}(2^{\ell}-1),m_{0}(2^{\ell+1}-1)]$ with some $\ell\in{\mathbb{N}}\cup\{0\}$ . Then conditioning on the event (3.16), we can obtain from Proposition 3.8 that, for all $M\in{\mathbb{N}}$ ,

\displaystyle\begin{split}R(M)&\leq\sum_{\ell=0}^{\lceil\log_{2}(\frac{M}{m_{0}}+1)\rceil-1}m_{\ell}\Big{(}J^{\theta^{\star}}(U^{\psi^{\theta_{\ell}}})-J^{\theta^{\star}}(U^{\theta^{\star}})\Big{)}\leq C^{\prime}\sum_{\ell=0}^{\lceil\log_{2}(\frac{M}{m_{0}}+1)\rceil-1}m_{\ell}|\theta_{\ell}-\theta^{\star}|^{2}\\ &\leq C^{\prime}m_{0}+C^{\prime}\sum_{\ell=0}^{\lceil\log_{2}(\frac{M}{m_{0}}+1)\rceil-1}(-\ln\delta_{\ell})\Big{(}1+\frac{-\ln\delta_{\ell}}{m_{\ell}}+\frac{(-\ln\delta_{\ell})^{3}}{m_{\ell}^{3}}\Big{)}\\ &\leq C^{\prime}(-\ln\delta)+C^{\prime}\sum_{\ell=1}^{\lceil\log_{2}{M}\rceil}\bigg{(}2\ln\ell-\ln\delta\bigg{)}\leq C^{\prime}\left((\ln M)(\ln\ln M)+(\ln M)(-\ln\delta)\right),\end{split}

(3.17)

with a constant $C^{\prime}$ independent of $M$ and $\delta$ , where we have used $\sum_{\ell=1}^{n}\ln\ell=\ln(n!)\leq C^{\prime}n\ln n$ due to Stirling’s formula.

3.4 Regret analysis of discrete-time least-squares algorithm

This section is devoted to the proof of Theorem 2.3. The main step is similar to the proof of Theorem 2.2 in Section 3.3. However, one needs to quantity the precise impact of the piecewise constant policies and the discrete-time observations on the performance gap and the parameter estimation error.

Step 1: Analysis of the performance gap.

The following proposition shows the performance gap between applying a piecewise constant feedback control for an incorrect model and a continuous-time feedback control for the true model scales quadratically with respect to the stepsize and the parameter errors.

Proposition 3.12

Suppose (H.1 1) holds and let $\Theta$ be a bounded subset of ${\mathbb{R}}^{(n+d)\times n}$ . For each $\theta\in\Theta$ and $N\in{\mathbb{N}}$ , let $\psi^{\theta,\tau}$ be defined in (2.19) with stepsize $\tau=T/N$ , let $X^{\psi^{\theta,\tau}}$ be the state process associated with $\psi^{\theta,\tau}$ (cf. (2.21)), let $\psi^{\theta^{\star}}$ be defined in (2.3), and let $X^{{\theta^{\star}}}$ be the state process associated with $\psi^{\theta^{\star}}$ (cf. (2.5)). Then there exists $C>0$ such that

|J^{\theta^{\star}}(U^{\psi^{\theta,\tau}})-J^{\theta^{\star}}(U^{\theta^{\star}})|\leq C(N^{-2}+|\theta-\theta^{\star}|^{2}),\quad\forall\theta\in\Theta,N\in{\mathbb{N}},

(3.18)

where $U^{\psi^{\theta,\tau}}_{t}=\psi^{\theta,\tau}(t,X^{\psi^{\theta,\tau}}_{t})$ and $U^{\theta^{\star}}_{t}=\psi^{\theta^{\star}}(t,X^{{\theta^{\star}}}_{t})$ for all $t\in[0,T]$ , and $J^{\theta^{\star}}$ is defined in (2.1).

Proof Let us fix $\theta\in\Theta$ and $N\in{\mathbb{N}}$ . By applying Proposition 3.7 with $U=U^{\psi^{\theta,\tau}}$ ,

\displaystyle\begin{split}&J^{\theta^{\star}}(U^{\psi^{\theta,\tau}})-J^{\theta^{\star}}(U^{\theta^{\star}})\\ &\leq\|Q\|_{2}\|X^{\theta^{\star},U^{\psi^{\theta,\tau}}}-X^{{\theta^{\star}}}\|^{2}_{\mathcal{H}^{2}({\mathbb{R}}^{n})}+\|R\|_{2}\|U^{\psi^{\theta,\tau}}-U^{\theta^{\star}}\|^{2}_{\mathcal{H}^{2}({\mathbb{R}}^{d})},\\ &\leq\|Q\|_{2}\|X^{{\psi^{\theta,\tau}}}-X^{\psi^{\theta^{\star}}}\|^{2}_{\mathcal{H}^{2}({\mathbb{R}}^{n})}+\|R\|_{2}\|\psi^{\theta,\tau}(\cdot,X^{\psi^{\theta,\tau}}_{\cdot})-\psi^{\theta^{\star}}(\cdot,X^{\psi^{\theta^{\star}}}_{\cdot})\|^{2}_{\mathcal{H}^{2}({\mathbb{R}}^{d})},\end{split}

(3.19)

where the last inequality used the fact that $X^{\theta^{\star},U^{\psi^{\theta,\tau}}}=X^{{\psi^{\theta,\tau}}}$ (see (2.11)), and the definitions of $U^{\psi^{\theta,\tau}}$ and $U^{\theta^{\star}}$ .

We then prove that there exists a constant $C$ , independent of $\theta,N$ , such that

\|X^{{\psi^{\theta,\tau}}}-X^{\psi^{\theta^{\star}}}\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}+\|\psi^{\theta,\tau}(\cdot,X^{\psi^{\theta,\tau}}_{\cdot})-\psi^{\theta^{\star}}(\cdot,X^{\psi^{\theta^{\star}}}_{\cdot})\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}\leq C(N^{-1}+|\theta-\theta^{\star}|).

By setting $\delta X=X^{\theta^{\star}}-X^{\psi^{\theta,\tau}}$ , we obtain from (2.5) and (2.21) that

\displaystyle{\mathrm{d}}\delta X_{t}=(A^{\star}\delta X_{t}+B^{\star}K^{\theta^{\star}}_{t}\delta X_{t}+(K^{\theta^{\star}}_{t}-K^{\theta,\tau}_{t})X^{\psi^{\theta,\tau}}_{t})\,{\mathrm{d}}t,\quad t\in[0,T];\quad\delta X_{0}=0.

(3.20)

Since $\|P^{\theta^{\star}}\|_{C([0,T];{\mathbb{R}}^{n\times n})}\leq C$ and $K^{\theta^{\star}}_{t}=-R^{-1}B^{\top}P^{{\theta^{\star}}}_{t}$ for all $t\in[0,T]$ , $\|K^{\theta^{\star}}\|_{C([0,T];{\mathbb{R}}^{d\times n})}\leq C$ . Moreover, by $\|P^{\theta,\tau}_{t_{i}}\|_{2}\leq C$ for all $i=0,\ldots,N$ (see Proposition 3.3) and (2.20), we have $\|K^{\theta,\tau}_{t}\|_{2}\leq C$ for all $t\in[0,T]$ , which along with a moment estimate of (2.21) yields $\|X^{\psi^{\theta,\tau}}\|_{\mathcal{S}^{2}({\mathbb{R}}^{n})}\leq C$ . Thus, by applying Gronwall’s inequality to (3.20), Lemma 3.1 and Proposition 3.3, for all $\theta\in\Theta$ and $N\in{\mathbb{N}}$ ,

\displaystyle\begin{split}&\|X^{\theta^{\star}}-X^{\psi^{\theta,\tau}}\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}\leq C\|X^{\theta^{\star}}-X^{\psi^{\theta,\tau}}\|_{\mathcal{S}^{2}({\mathbb{R}}^{n})}\\ &\leq C\|(K^{\theta^{\star}}_{t}-K^{\theta,\tau}_{t})X^{\psi^{\theta,\tau}}\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}\leq C\max_{t\in[0,T]}\|K^{\theta^{\star}}_{t}-K^{\theta,\tau}_{t}\|_{2}\\ &\leq C\max_{t\in[0,T]}(\|K^{\theta}_{t}-K^{\theta,\tau}_{t}\|_{2}+\|K^{\theta^{\star}}_{t}-K^{\theta}_{t}\|_{2})\leq C(N^{-1}+|\theta-\theta^{\star}|).\end{split}

(3.21)

The above inequality further implies

	$\displaystyle\\|\psi^{\theta,\tau}(\cdot,X^{\psi^{\theta,\tau}}_{\cdot})-\psi^{\theta^{\star}}(\cdot,X^{\psi^{\theta^{\star}}}_{\cdot})\\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}=\\|K^{\theta,\tau}_{\cdot}X^{\psi^{\theta,\tau}}_{\cdot}-K^{\theta^{\star}}_{\cdot}X^{\psi^{\theta^{\star}}}_{\cdot}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}$
	$\displaystyle\leq\\|(K^{\theta,\tau}_{\cdot}-K^{\theta^{\star}}_{\cdot})X^{\psi^{\theta,\tau}}_{\cdot}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}+\\|K^{\theta^{\star}}_{\cdot}(X^{\psi^{\theta,\tau}}_{\cdot}-X^{\psi^{\theta^{\star}}}_{\cdot})\\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}$
	$\displaystyle\leq\\|K^{\theta^{\star}}-K^{\theta,\tau}\\|_{C([0,T;{\mathbb{R}}^{d\times n})}\\|X^{\psi^{\theta,\tau}}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}+\\|K^{\theta^{\star}}\\|_{C([0,T;{\mathbb{R}}^{d\times n})}\\|X^{\psi^{\theta,\tau}}-X^{\psi^{\theta^{\star}}}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}$
	$\displaystyle\leq C(N^{-1}+\|\theta-\theta^{\star}\|),\quad\forall\theta\in\Theta,N\in{\mathbb{N}},$

which along with (3.19) finishes the desired estimate.

Step 2: Error bound for parameter estimation.

The following lemma shows that the difference between the expectations of $(V^{\psi^{\theta,\tau},\tau},Y^{\psi^{\theta,\tau},\tau})$ and of $(V^{\psi^{\theta,\tau}},Y^{\psi^{\theta,\tau}})$ scales linearly with respect to the stepsize.

Lemma 3.13

Suppose (H.1 1) holds and let $\Theta$ be a bounded subset of ${\mathbb{R}}^{(n+d)\times n}$ . For each $\theta\in\Theta$ and $N\in{\mathbb{N}}$ , let $\tau=T/N$ , let $\psi^{\theta,\tau}$ be defined in (2.19), let $V^{\psi^{\theta,\tau}},Y^{\psi^{\theta,\tau}}$ be defined in (3.1), and let $V^{\psi^{\theta,\tau},\tau},Y^{\psi^{\theta,\tau},\tau}$ be defined in (3.2). Then there exists a constant $C$ such that

|{\mathbb{E}}[V^{\psi^{\theta,\tau},\tau}-V^{\psi^{\theta,\tau}}]|+|{\mathbb{E}}[Y^{\psi^{\theta,\tau},\tau}-Y^{\psi^{\theta,\tau}}]|\leq CN^{-1},\quad\forall\theta\in\Theta,N\in{\mathbb{N}}.

Proof By (2.21), we have for all $i=0,\ldots,N-1$ , $X^{\psi^{\theta,\tau}}_{t_{i+1}}-X^{\psi^{\theta,\tau}}_{t_{i}}=\int_{t_{i}}^{t_{i+1}}(\theta^{\star})^{\top}Z^{\psi^{\theta,\tau}}_{t}\,{\mathrm{d}}t+W_{t_{i+1}}-W_{t_{i}}$ , which implies

	$\displaystyle{\mathbb{E}}[V^{\psi^{\theta,\tau}}-V^{\psi^{\theta,\tau},\tau}]$	$\displaystyle=\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}{\mathbb{E}}[Z^{\psi^{\theta,\tau}}_{t}(Z^{\psi^{\theta,\tau}}_{t})^{\top}-Z^{\psi^{\theta,\tau}}_{t_{i}}(Z^{\psi^{\theta,\tau}}_{t_{i}})^{\top}]\,{\mathrm{d}}t,$
	$\displaystyle{\mathbb{E}}[Y^{\psi^{\theta,\tau}}-Y^{\psi^{\theta,\tau},\tau}]$	$\displaystyle=\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}{\mathbb{E}}[(Z^{\psi^{\theta,\tau}}_{t}-Z^{\psi^{\theta,\tau}}_{t_{i}})(Z^{\psi^{\theta,\tau}}_{t})^{\top}\theta^{\star}]\,{\mathrm{d}}t.$

Hence it suffices to prove that $|{\mathbb{E}}[Z^{\psi^{\theta,\tau}}_{t}(Z^{\psi^{\theta,\tau}}_{t})^{\top}-Z^{\psi^{\theta,\tau}}_{t_{i}}(Z^{\psi^{\theta,\tau}}_{t_{i}})^{\top}]|\leq CN^{-1}$ and $|{\mathbb{E}}[(Z^{\psi^{\theta,\tau}}_{t}-Z^{\psi^{\theta,\tau}}_{t_{i}})(Z^{\psi^{\theta,\tau}}_{t})^{\top}]|\leq CN^{-1}$ for all $t\in[t_{i},t_{i+1}]$ and $i=0,\ldots,N-1$ .

Let us fix $i=0,\ldots,N-1$ and $t\in[t_{i},t_{i+1}]$ . In the following, we shall omit the superscripts of $X^{\psi^{\theta,\tau}}$ and $Z^{\psi^{\theta,\tau}}$ if no confusion occurs. As $t\in[t_{i},t_{i+1}]$ , by (2.21), we have $X_{t}=e^{Lt}X_{t_{i}}+\int_{t_{i}}^{t}e^{L(t-s)}\,{\mathrm{d}}W_{s}$ with $L\coloneqq A^{\star}+B^{\star}K^{\theta,\tau}_{t_{i}}$ . Thus,

	$\displaystyle X_{t}X^{\top}_{t}-X_{t_{i}}X^{\top}_{t_{i}}$
	$\displaystyle=(X_{t}-X_{t_{i}}+X_{t_{i}})(X_{t}-X_{t_{i}}+X_{t_{i}})^{\top}-X_{t_{i}}X^{\top}_{t_{i}}$
	$\displaystyle=(X_{t}-X_{t_{i}})(X_{t}-X_{t_{i}})^{\top}+X_{t_{i}}(X_{t}-X_{t_{i}})^{\top}+(X_{t}-X_{t_{i}})X_{t_{i}}^{\top}$
	$\displaystyle=\bigg{(}(e^{Lt}-I)X_{t_{i}}+\int_{t_{i}}^{t}e^{L(t-s)}\,{\mathrm{d}}W_{s}\bigg{)}\bigg{(}(e^{Lt}-I)X_{t_{i}}+\int_{t_{i}}^{t}e^{L(t-s)}\,{\mathrm{d}}W_{s}\bigg{)}^{\top}$
	$\displaystyle\quad+X_{t_{i}}\bigg{(}(e^{Lt}-I)X_{t_{i}}+\int_{t_{i}}^{t}e^{L(t-s)}\,{\mathrm{d}}W_{s}\bigg{)}^{\top}+\bigg{(}(e^{Lt}-I)X_{t_{i}}+\int_{t_{i}}^{t}e^{L(t-s)}\,{\mathrm{d}}W_{s}\bigg{)}X_{t_{i}}^{\top}.$

By taking expectations of both sides of the above identity, the martingale property of the Itô integral, and the Itô isometry,

	$\displaystyle{\mathbb{E}}[X_{t}X^{\top}_{t}-X_{t_{i}}X^{\top}_{t_{i}}]$	$\displaystyle=(e^{Lt}-I){\mathbb{E}}[X_{t_{i}}X_{t_{i}}^{\top}](e^{L^{\top}t}-I)+\int_{t_{i}}^{t}e^{L(t-s)}e^{L^{\top}(t-s)}\,{\mathrm{d}}s$
		$\displaystyle\quad+{\mathbb{E}}[X_{t_{i}}X_{t_{i}}^{\top}](e^{L^{\top}t}-I)+(e^{Lt}-I){\mathbb{E}}[X_{t_{i}}X_{t_{i}}^{\top}]\leq C(t-t_{i}),$

where the last inequality follows from $\|X^{\psi^{\theta,\tau}}\|_{\mathcal{S}^{2}({\mathbb{R}}^{n})}\leq C$ . Since $\psi^{\theta,\tau}(t,X^{\psi^{\theta,\tau}}_{t})=K^{\theta,\tau}_{t_{i}}X^{\psi^{\theta,\tau}}_{t}$ and $\|K^{\theta,\tau}_{t_{i}}\|_{2}\leq C$ , one can easily show that $|{\mathbb{E}}[Z^{\psi^{\theta,\tau}}_{t}(Z^{\psi^{\theta,\tau}}_{t})^{\top}-Z^{\psi^{\theta,\tau}}_{t_{i}}(Z^{\psi^{\theta,\tau}}_{t_{i}})^{\top}]|\leq CN^{-1}$ . Furthermore, by $X^{\psi^{\theta,\tau}}_{t}=e^{Lt}X^{\psi^{\theta,\tau}}_{t_{i}}+\int_{t_{i}}^{t}e^{L(t-s)}\,{\mathrm{d}}W_{s}$ and the identity

Z^{\psi^{\theta,\tau}}_{t}(Z^{\psi^{\theta,\tau}}_{t})^{\top}-Z^{\psi^{\theta,\tau}}_{t_{i}}(Z^{\psi^{\theta,\tau}}_{t_{i}})^{\top}=(Z^{\psi^{\theta,\tau}}_{t}-Z^{\psi^{\theta,\tau}}_{t_{i}})(Z^{\psi^{\theta,\tau}}_{t})^{\top}+Z^{\psi^{\theta,\tau}}_{t_{i}}(Z^{\psi^{\theta,\tau}}_{t}-Z^{\psi^{\theta,\tau}}_{t_{i}})^{\top},

we can show that

	$\displaystyle\|{\mathbb{E}}[(Z^{\psi^{\theta,\tau}}_{t}-Z^{\psi^{\theta,\tau}}_{t_{i}})(Z^{\psi^{\theta,\tau}}_{t})^{\top}]\|$
	$\displaystyle\leq\|{\mathbb{E}}[Z^{\psi^{\theta,\tau}}_{t}(Z^{\psi^{\theta,\tau}}_{t})^{\top}-Z^{\psi^{\theta,\tau}}_{t_{i}}(Z^{\psi^{\theta,\tau}}_{t_{i}})^{\top}]\|+\|{\mathbb{E}}[Z^{\psi^{\theta,\tau}}_{t_{i}}(Z^{\psi^{\theta,\tau}}_{t}-Z^{\psi^{\theta,\tau}}_{t_{i}})^{\top}]\|$
	$\displaystyle\leq C\left(N^{-1}+\left\|{\mathbb{E}}\left[Z^{\psi^{\theta,\tau}}_{t_{i}}(X^{\psi^{\theta,\tau}}_{t_{i}})^{\top}(e^{L^{\top}t}-I)\begin{pmatrix}I&(K^{\theta,\tau}_{t_{i}})^{\top}\end{pmatrix}\right]\right\|\right)\leq CN^{-1},$

by the uniform boundedness of $\|X^{\psi^{\theta,\tau}}\|_{\mathcal{S}^{2}({\mathbb{R}}^{n})}$ and $K^{\theta,\tau}$ .

Proposition 3.14

Suppose (H.1 1) holds, and let $\Theta\subset{\mathbb{R}}^{(n+d)\times n}$ such that there exists $C_{1}>0$ satisfying $\|({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}\|_{2}\leq C_{1}$ and $|\theta|\leq C_{1}$ for all $\theta\in\Theta$ , with $V^{\psi^{\theta}}$ defined in (3.1). Then there exist constants $\bar{C}_{1},\bar{C}_{2}\geq 0$ and $n_{0}\in{\mathbb{N}}$ , such that for all $\theta\in\Theta$ , $N\in{\mathbb{N}}\cap[n_{0},\infty)$ and $\delta\in(0,1/2)$ , if $m\geq\bar{C}_{1}(-\ln\delta)$ , then with probability at least $1-2\delta$ ,

|\hat{\theta}-\theta^{\star}|\leq\bar{C}_{2}\bigg{(}\sqrt{\frac{-\ln\delta}{m}}+\frac{-\ln\delta}{m}+\frac{(-\ln\delta)^{2}}{m^{2}}+\frac{1}{N}\bigg{)},

(3.22)

where $\hat{\theta}$ denotes the right-hand side of (2.23) with the control $\psi^{\theta,\tau}$ and stepsize $\tau=T/N$ .

Proof We first prove that there exists $n_{0}\in{\mathbb{N}}$ such that for all $N\in{\mathbb{N}}\cap[n_{0},\infty)$ and $\theta\in\Theta$ , $\|({\mathbb{E}}[V^{\psi^{{\theta,\tau}}}])^{-1}\|_{2}\leq C$ for a constant $C>0$ independent of $\theta$ and $N$ . By (2.11) and (2.21), we have for all $\theta\in\Theta$ and $N\in{\mathbb{N}}$ , $X^{\psi^{\theta}}_{0}=X^{\psi^{\theta},\tau}_{0}$ and

{\mathrm{d}}(X^{\psi^{\theta}}-X^{\psi^{\theta},\tau})_{t}=\Big{(}(A^{\star}+B^{\star}K^{\theta}_{t})(X^{\psi^{\theta}}-X^{\psi^{\theta},\tau})_{t}+B^{\star}(K^{\theta}_{t}-K^{\theta,\tau}_{t})X^{\psi^{\theta},\tau}_{t}\Big{)}\,{\mathrm{d}}t,\quad t\in[0,T].

Proposition 3.3 shows $\|K^{\theta}_{t}-K^{\theta,\tau}_{t}\|_{2}\leq CN^{-1}$ for all $t\in[0,T]$ , which along with Gronwall’s inequality yields $\|X^{\psi^{\theta}}-X^{\psi^{\theta},\tau}\|_{\mathcal{S}^{2}({\mathbb{R}}^{n})}\leq CN^{-1}$ for all $\theta\in\Theta$ and $N\in{\mathbb{N}}$ . One can further prove that $\|U^{\psi^{\theta}}-U^{\psi^{\theta},\tau}\|_{\mathcal{S}^{2}({\mathbb{R}}^{d})}\leq CN^{-1}$ with $U^{\psi^{\theta}}_{t}=\psi^{\theta}(t,X^{\psi^{\theta}}_{t})$ and $U^{\psi^{\theta,\tau}}_{t}=\psi^{\theta,\tau}(t,X^{\psi^{\theta,\tau}}_{t})$ for all $t\in[0,T]$ . Thus, we have $|{\mathbb{E}}[V^{\psi^{{\theta}}}]-{\mathbb{E}}[V^{\psi^{{\theta},\tau}}]|\leq CN^{-1}$ , which along with $\|({\mathbb{E}}[V^{\psi^{{\theta}}}])^{-1}\|_{2}\leq C_{1}$ implies a uniform bound of $\|({\mathbb{E}}[V^{\psi^{{\theta},\tau}}])^{-1}\|_{2}$ for all sufficiently large $N$ .

Let us fix $N\in{\mathbb{N}}\cap[n_{0},\infty)$ and $\theta\in\Theta$ for the subsequent analysis. The invertibility of ${\mathbb{E}}[V^{\psi^{{\theta},\tau}}]$ implies that $\theta^{\star}=({\mathbb{E}}[V^{\psi^{{\theta},\tau}}])^{-1}{\mathbb{E}}[Y^{\psi^{{\theta},\tau}}])$ (cf. (2.12)). Then by (2.23), we can derive the following analogues of (3.10):

\begin{split}&\|\hat{\theta}-\theta^{\star}\|_{2}\\ &=\|(V^{\psi^{{\theta,\tau}},\tau,m}+\tfrac{1}{m}I)^{-1}Y^{\psi^{{\theta,\tau}},\tau,m}-({\mathbb{E}}[V^{\psi^{{\theta,\tau}}}])^{-1}{\mathbb{E}}[Y^{\psi^{{\theta,\tau}}}]\|_{2}\\ &\leq\|(V^{\psi^{{\theta,\tau}},\tau,m}+\tfrac{1}{m}I)^{-1}-({\mathbb{E}}[V^{\psi^{{\theta,\tau}}}])^{-1}\|_{2}\|Y^{\psi^{{\theta},\tau},m,\tau}\|_{2}+\|({\mathbb{E}}[V^{\psi^{{\theta,\tau}}}])^{-1}\|_{2}\|Y^{\psi^{{\theta,\tau},\tau},m}-{\mathbb{E}}[Y^{\psi^{{\theta,\tau}}}]\|_{2}\\ &\leq\|(V^{\psi^{{\theta,\tau}},\tau,m}+\tfrac{1}{m}I)^{-1}\|\|({\mathbb{E}}[V^{\psi^{{\theta,\tau}}}])^{-1}\|_{2}\|Y^{\psi^{{\theta},\tau},m,\tau}\|_{2}\|V^{\psi^{{\theta,\tau}},\tau,m}-{\mathbb{E}}[V^{\psi^{{\theta,\tau}}}]+\tfrac{1}{m}I\|_{2}\\ &\quad+\|({\mathbb{E}}[V^{\psi^{{\theta,\tau}}}])^{-1}\|_{2}\|Y^{\psi^{{\theta,\tau},\tau},m}-{\mathbb{E}}[Y^{\psi^{{\theta,\tau}}}]\|_{2},\end{split}

where $V^{\psi^{{\theta,\tau}},\tau,m}$ and $Y^{\psi^{{\theta,\tau}},\tau,m}$ are defined in (3.2). Note that

	$\displaystyle\\|V^{\psi^{{\theta,\tau}},\tau,m}-{\mathbb{E}}[V^{\psi^{{\theta,\tau}}}]+\tfrac{1}{m}I\\|_{2}$	$\displaystyle\leq\\|V^{\psi^{{\theta,\tau}},\tau,m}-{\mathbb{E}}[V^{\psi^{{\theta,\tau}},\tau}]\\|_{2}+\\|{\mathbb{E}}[V^{\psi^{{\theta,\tau}},\tau}]-{\mathbb{E}}[V^{\psi^{{\theta,\tau}}}]\\|_{2}+\tfrac{1}{m},$
	$\displaystyle\\|Y^{\psi^{{\theta,\tau},\tau},m}-{\mathbb{E}}[Y^{\psi^{{\theta,\tau}}}]\\|_{2}$	$\displaystyle\leq\\|Y^{\psi^{{\theta,\tau},\tau},m}-{\mathbb{E}}[Y^{\psi^{{\theta,\tau},\tau}}]\\|_{2}+\\|{\mathbb{E}}[Y^{\psi^{{\theta,\tau},\tau}}]-{\mathbb{E}}[Y^{\psi^{{\theta,\tau}}}]\\|_{2},$

where for both inequalities, the first term on the right-hand side can be estimated by Theorem 3.6 (uniformly in $N$ ), and the second term is of the magnitude $\mathcal{O}(N^{-1})$ due to Lemma 3.13. Hence, proceeding along the lines of the proof of Proposition 3.9 leads to the desired result.

Step 3: Proof of Theorem 2.3.

The proof follows from similar arguments as that of Theorem 2.2, and we only present the main steps here. As (H.1 2) holds with $\theta_{0}$ and $\theta^{\star}$ , we can obtain from Propositions 3.10 and 3.14 that, there exists a bounded set $\Phi_{\varepsilon}\subset{\mathbb{R}}^{(n+d)\times n}$ and constants $\bar{C}_{1},\bar{C}_{2}\geq 1$ , $n_{0}\in{\mathbb{N}}$ that for all $\theta\in\Phi_{\varepsilon}$ , $N\in{\mathbb{N}}\cap[n_{0},\infty)$ and $\delta^{\prime}\in(0,1/2)$ , if $m\geq\bar{C}_{1}(-\ln\delta)$ , then with probability at least $1-2\delta^{\prime}$ ,

|\hat{\theta}-\theta^{\star}|\leq\bar{C}_{2}\bigg{(}\sqrt{\frac{-\ln\delta^{\prime}}{m}}+\frac{-\ln\delta^{\prime}}{m}+\frac{(-\ln\delta^{\prime})^{2}}{m^{2}}+\frac{1}{N}\bigg{)},

(3.23)

where $\hat{\theta}$ denotes the right-hand side of (2.23) with the control $\psi^{\theta,\tau}$ and stepsize $\tau=T/N$ . Then by proceeding along the lines of the proof of Theorem 2.2, there exists $C_{0}>0$ and $n_{0}\in{\mathbb{N}}$ , such that for any given $\delta\in(0,\frac{3}{\pi^{2}})$ , if $m_{0}=C(-\ln\delta)$ with $C\geq C_{0}$ and $N_{\ell}\geq n_{0}$ for all $\ell\in{\mathbb{N}}\cup\{0\}$ , then with probability at least $1-\frac{\pi^{2}}{3}\delta$ ,

|{\theta}_{\ell+1}-\theta^{\star}|\leq\bar{C}_{2}\bigg{(}\sqrt{\frac{-\ln\delta_{\ell}}{m_{\ell}}}+\frac{(-\ln\delta_{\ell})}{m_{\ell}}+\frac{(-\ln\delta_{\ell})^{2}}{m_{\ell}^{2}}+\frac{1}{N_{\ell}}\bigg{)},\quad\forall\ell\in{\mathbb{N}}\cup\{0\},

(3.24)

where $\delta_{\ell}=\delta/(\ell+1)^{2}$ and $m_{\ell}=2^{\ell}m_{0}$ for all $\ell$ . Consequently, we can conclude the desired regret bound from Proposition 3.12 (cf. (3.17)), with an additional term $\sum_{\ell=0}^{\ln M}m_{\ell}N_{\ell}^{-2}$ due to the time discretization errors in (3.18) and (3.24).

References

Abbasi-Yadkori and Szepesvári (2011) Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.
Abeille and Lazaric (2018) Marc Abeille and Alessandro Lazaric. Improved regret bounds for thompson sampling in linear quadratic control problems. In International Conference on Machine Learning, pages 1–9. PMLR, 2018.
Agarwal et al. (2019) Naman Agarwal, Elad Hazan, and Karan Singh. Logarithmic regret for online control. Advances in Neural Information Processing Systems, 32, 2019.
Auer and Ortner (2007) Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In Advances in Neural Information Processing Systems, pages 49–56, 2007.
Auer et al. (2009) Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. In Advances in Neural Information Processing Systems, pages 89–96, 2009.
Bailo et al. (2018) Rafael Bailo, Mattia Bongini, José A Carrillo, and Dante Kalise. Optimal consensus control of the Cucker-Smale model. IFAC-PapersOnLine, 51(13):1–6, 2018.
Bitmead and Gevers (1991) Robert R Bitmead and Michel Gevers. Riccati difference and differential equations: Convergence, monotonicity and stability. In The Riccati Equation, pages 263–291. Springer, 1991.
Campi and Kumar (1998) Marco C Campi and PR Kumar. Adaptive linear quadratic gaussian control: the cost-biased approach revisited. SIAM Journal on Control and Optimization, 36(6):1890–1907, 1998.
Cartea et al. (2018) Alvaro Cartea, Sebastian Jaimungal, and Jason Ricci. Algorithmic trading, stochastic control, and mutually exciting processes. SIAM Review, 60(3):673–703, 2018.
Cassel et al. (2020) Asaf Cassel, Alon Cohen, and Tomer Koren. Logarithmic regret for learning linear quadratic regulators efficiently. In International Conference on Machine Learning, pages 1328–1337. PMLR, 2020.
Chen and Hazan (2021) Xinyi Chen and Elad Hazan. Black-box control for linear dynamical systems. In Conference on Learning Theory, pages 1114–1143. PMLR, 2021.
Cheridito et al. (2005) Patrick Cheridito, H. Mete Soner, and Nizar Touzi. Small time path behavior of double stochastic integrals and applications to stochastic control. The Annals of Applied Probability, 15(4):2472–2495, 2005.
Ciarlet (2013) Philippe G Ciarlet. Linear and nonlinear functional analysis with applications, volume 130. Siam, 2013.
Cohen et al. (2019) Alon Cohen, Tomer Koren, and Yishay Mansour. Learning linear quadratic regulators efficiently with only $\sqrt{T}$ regret. arXiv preprint arXiv:1902.06223, 2019.
Dean et al. (2018) Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188–4197, 2018.
Dean et al. (2019) Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, pages 1–47, 2019.
Duncan et al. (1992) Tyrone E Duncan, Petr Mandl, and Bożenna Pasik-Duncan. On least squares estimation in continuous time linear stochastic systems. Kybernetika, 28(3):169–180, 1992.
Duncan et al. (1999) Tyrone E Duncan, Lei Guo, and Bozenna Pasik-Duncan. Adaptive continuous-time linear quadratic Gaussian control. IEEE Transactions on Automatic Control, 44(9):1653–1662, 1999.
Faradonbeh et al. (2018a) Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Finite-time adaptive stabilization of linear systems. IEEE Transactions on Automatic Control, 64(8):3498–3505, 2018a.
Faradonbeh et al. (2018b) Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. On adaptive linear-quadratic regulators. arXiv e-prints, pages arXiv–1806, 2018b.
Faradonbeh et al. (2019) Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Randomized algorithms for data-driven stabilization of stochastic linear systems. In 2019 IEEE Data Science Workshop (DSW), pages 170–174. IEEE, 2019.
Faradonbeh et al. (2020) Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Input perturbations for adaptive control and learning. Automatica, 117:108950, 2020.
Foster and Simchowitz (2020) Dylan Foster and Max Simchowitz. Logarithmic regret for adversarial online control. In International Conference on Machine Learning, pages 3211–3221. PMLR, 2020.
Goodwin et al. (1981) Graham C Goodwin, Peter J Ramadge, and Peter E Caines. Discrete time stochastic adaptive control. SIAM Journal on Control and Optimization, 19(6):829–853, 1981.
Graber (2016) P Jameson Graber. Linear quadratic mean field type control and mean field games with common noise, with application to production of an exhaustible resource. Applied Mathematics & Optimization, 74(3):459–486, 2016.
Green and Moore (1986) Michael Green and John B Moore. Persistence of excitation in linear systems. Systems & control letters, 7(5):351–360, 1986.
Hambly et al. (2020) Ben M Hambly, Renyuan Xu, and Huining Yang. Policy gradient methods for the noisy linear quadratic regulator over a finite horizon. Available at SSRN, 2020.
Ioannou and Fidan (2006) Petros Ioannou and Bariş Fidan. Adaptive control tutorial. SIAM, 2006.
Kumar (1983) PR Kumar. Optimal adaptive control of linear-quadratic-gaussian systems. SIAM Journal on Control and Optimization, 21(2):163–178, 1983.
Lale et al. (2020a) Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Explore more and improve regret in linear quadratic regulators. arXiv preprint arXiv:2007.12291, 2020a.
Lale et al. (2020b) Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Logarithmic regret bound in partially observable linear dynamical systems. Advances in Neural Information Processing Systems, 33:20876–20888, 2020b.
Landau et al. (2011) Ioan Doré Landau, Rogelio Lozano, Mohammed M’Saad, and Alireza Karimi. Adaptive control: algorithms, analysis and applications. Springer Science & Business Media, 2011.
Mania et al. (2019) Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalent control of LQR is efficient. arXiv preprint arXiv:1902.07826, 2019.
Mao (2007) Xuerong Mao. Stochastic differential equations and applications. Elsevier, 2007.
Modares and Lewis (2014) Hamidreza Modares and Frank L. Lewis. Linear quadratic tracking control of partially-unknown continuous-time systems using reinforcement learning. IEEE Transactions on Automatic Control, 59(11):3051–3056, 2014.
Munos (2000) Rémi Munos. A study of reinforcement learning in the continuous case by the means of viscosity solutions. Machine Learning, 40(3):265–299, 2000.
Munos (2006) Rémi Munos. Policy gradient in continuous time. Journal of Machine Learning Research, 7(May):771–791, 2006.
Munos and Bourgine (1998) Rémi Munos and Paul Bourgine. Reinforcement learning for continuous stochastic control problems. In Advances in Neural Information Processing Systems, pages 1029–1035, 1998.
Osband et al. (2013) Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
Ouyang et al. (2017) Yi Ouyang, Mukul Gagrani, and Rahul Jain. Learning-based control of unknown linear systems with thompson sampling. arXiv preprint arXiv:1709.04047, 2017.
Rizvi and Lin (2018) Syed Ali Asad Rizvi and Zongli Lin. Output feedback reinforcement learning control for the continuous-time linear quadratic regulator problem. In 2018 Annual American Control Conference (ACC), pages 3417–3422. IEEE, 2018.
Simchowitz and Foster (2020) Max Simchowitz and Dylan Foster. Naive exploration is optimal for online lqr. In International Conference on Machine Learning, pages 8937–8948. PMLR, 2020.
Tallec et al. (2019) Corentin Tallec, Léonard Blier, and Yann Ollivier. Making Deep Q-learning methods robust to time discretization. arXiv preprint arXiv:1901.09732, 2019.
Veraar (2012) Mark Veraar. The stochastic Fubini theorem revisited. Stochastics An International Journal of Probability and Stochastic Processes, 84(4):543–551, 2012.
Wainwright (2019) Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019.
Wang and Zhou (2020) Haoran Wang and Xun Yu Zhou. Continuous-time mean–variance portfolio selection: A reinforcement learning framework. Mathematical Finance, 30(4):1273–1308, 2020.
Yong and Zhou (1999) Jiongmin Yong and Xun Yu Zhou. Stochastic Controls: Hamiltonian Systems and HJB Equations, volume 43. Springer Science & Business Media, 1999.

	$\displaystyle\\|P^{\theta}_{t_{i}}-P^{\theta,\tau}_{i}\\|_{2}$	$\displaystyle\leq\\|P^{\theta}_{t_{i}}-\Gamma^{\theta}_{\tau}(P^{\theta}_{t_{i+1}})\\|_{2}+\\|\Gamma^{\theta}_{\tau}(P^{\theta}_{t_{i+1}})-\Gamma^{\theta}_{\tau}(P^{\theta,\tau}_{i+1})\\|_{2}$
		$\displaystyle\leq C\tau^{2}+\big{(}1+\tau C\max\{\\|P^{\theta}_{t_{i+1}}\\|_{2},\\|P^{\theta,\tau}_{{i+1}}\\|_{2}\}\big{)}^{2}(1+\tau)\\|P^{\theta}_{t_{i+1}}-P^{\theta,\tau}_{{i+1}}\\|_{2}$
		$\displaystyle\leq C\tau^{2}+\big{(}1+\tau C)\\|P^{\theta}_{t_{i+1}}-P^{\theta,\tau}_{{i+1}}\\|_{2},$

	$\displaystyle\\|X^{\psi^{\theta^{\star}}}-X^{\psi^{\theta}}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}$	$\displaystyle\leq C\\|X^{\psi^{\theta^{\star}}}-X^{\psi^{\theta}}\\|_{\mathcal{S}^{2}({\mathbb{R}}^{n})}$
		$\displaystyle\leq C\\|(K^{\theta^{\star}}-K^{\theta})X^{\psi^{\theta}}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}\leq C\\|K^{\theta^{\star}}-K^{\theta}\\|_{C([0,T;{\mathbb{R}}^{d\times n})}\\|X^{\psi^{\theta}}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}$
		$\displaystyle\leq C\|\theta-\theta^{\star}\|,\quad\forall\theta\in\Theta,$

	$\displaystyle\\|\psi^{\theta}(\cdot,X^{\psi^{\theta}}_{\cdot})-\psi^{\theta^{\star}}(\cdot,X^{\psi^{\theta^{\star}}}_{\cdot})\\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}=\\|K^{\theta}_{\cdot}X^{\psi^{\theta}}_{\cdot}-K^{\theta^{\star}}_{\cdot}X^{\psi^{\theta^{\star}}}_{\cdot}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}$
	$\displaystyle\leq\\|(K^{\theta}_{\cdot}-K^{\theta^{\star}}_{\cdot})X^{\psi^{\theta}}_{\cdot}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}+\\|K^{\theta^{\star}}_{\cdot}(X^{\psi^{\theta}}_{\cdot}-X^{\psi^{\theta^{\star}}}_{\cdot})\\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}$
	$\displaystyle\leq\\|K^{\theta^{\star}}-K^{\theta}\\|_{C([0,T;{\mathbb{R}}^{d\times n})}\\|X^{\psi^{\theta}}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}+\\|K^{\theta^{\star}}\\|_{C([0,T;{\mathbb{R}}^{d\times n})}\\|X^{\psi^{\theta}}-X^{\psi^{\theta^{\star}}}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}$
	$\displaystyle\leq C\|\theta-\theta^{\star}\|,\quad\forall\theta\in\Theta,$

	$\displaystyle\\|\psi^{\theta,\tau}(\cdot,X^{\psi^{\theta,\tau}}_{\cdot})-\psi^{\theta^{\star}}(\cdot,X^{\psi^{\theta^{\star}}}_{\cdot})\\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}=\\|K^{\theta,\tau}_{\cdot}X^{\psi^{\theta,\tau}}_{\cdot}-K^{\theta^{\star}}_{\cdot}X^{\psi^{\theta^{\star}}}_{\cdot}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}$
	$\displaystyle\leq\\|(K^{\theta,\tau}_{\cdot}-K^{\theta^{\star}}_{\cdot})X^{\psi^{\theta,\tau}}_{\cdot}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}+\\|K^{\theta^{\star}}_{\cdot}(X^{\psi^{\theta,\tau}}_{\cdot}-X^{\psi^{\theta^{\star}}}_{\cdot})\\|_{\mathcal{H}^{2}({\mathbb{R}}^{d})}$
	$\displaystyle\leq\\|K^{\theta^{\star}}-K^{\theta,\tau}\\|_{C([0,T;{\mathbb{R}}^{d\times n})}\\|X^{\psi^{\theta,\tau}}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}+\\|K^{\theta^{\star}}\\|_{C([0,T;{\mathbb{R}}^{d\times n})}\\|X^{\psi^{\theta,\tau}}-X^{\psi^{\theta^{\star}}}\\|_{\mathcal{H}^{2}({\mathbb{R}}^{n})}$
	$\displaystyle\leq C(N^{-1}+\|\theta-\theta^{\star}\|),\quad\forall\theta\in\Theta,N\in{\mathbb{N}},$

	$\displaystyle\\|V^{\psi^{{\theta,\tau}},\tau,m}-{\mathbb{E}}[V^{\psi^{{\theta,\tau}}}]+\tfrac{1}{m}I\\|_{2}$	$\displaystyle\leq\\|V^{\psi^{{\theta,\tau}},\tau,m}-{\mathbb{E}}[V^{\psi^{{\theta,\tau}},\tau}]\\|_{2}+\\|{\mathbb{E}}[V^{\psi^{{\theta,\tau}},\tau}]-{\mathbb{E}}[V^{\psi^{{\theta,\tau}}}]\\|_{2}+\tfrac{1}{m},$
	$\displaystyle\\|Y^{\psi^{{\theta,\tau},\tau},m}-{\mathbb{E}}[Y^{\psi^{{\theta,\tau}}}]\\|_{2}$	$\displaystyle\leq\\|Y^{\psi^{{\theta,\tau},\tau},m}-{\mathbb{E}}[Y^{\psi^{{\theta,\tau},\tau}}]\\|_{2}+\\|{\mathbb{E}}[Y^{\psi^{{\theta,\tau},\tau}}]-{\mathbb{E}}[Y^{\psi^{{\theta,\tau}}}]\\|_{2},$

Logarithmic Regret for Episodic Continuous-Time Linear-Quadratic Reinforcement Learning over a Finite-Time Horizon

Abstract

1 Introduction

1.1 Discrete-time RL

1.2 Continuous-time RL.

Issues and challenges from non-asymptotic analysis.

Our work and contributions.

Notation.

2 Problem formulation and main results

2.1 Linear-quadratic reinforcement learning problem

2.2 Continuous-time least-squares algorithm and its regret bound

H.​​ 1

Remark 2.1 (Self-exploration of finite-time horizon RL problems)

Proposition 2.1

Remark 2.2 (Stabilizability of (A⋆,B⋆)(A^{\star},B^{\star}) and dependence on TT)

Theorem 2.2

Sketched proof of Theorem 2.2.

2.3 Discrete-time least-squares algorithm and its regret bound

Remark 2.3 (Scaling hyper-parameters with timescales)

Theorem 2.3

Remark 2.4

Sketched proof of Theorem 2.3.

3 Proofs of Theorems 2.2 and 2.3

3.1 Convergence and stability of Riccati equations and feedback controls

Lemma 3.1

Lemma 3.2

Proposition 3.3

3.2 Concentration inequalities for least-squares estimators

Definition 3.1

Lemma 3.4

Lemma 3.5

Theorem 3.6

3.3 Regret analysis of continuous-time least-squares algorithm

Step 1: Analysis of the performance gap.

Proposition 3.7

Proposition 3.8

Step 2: Error bound for parameter estimation.

Proposition 3.9

Step 3: Proof of Theorem 2.2.

Proposition 3.10

Proposition 3.11

3.4 Regret analysis of discrete-time least-squares algorithm

Step 1: Analysis of the performance gap.

Proposition 3.12

Step 2: Error bound for parameter estimation.

Lemma 3.13

Proposition 3.14

Step 3: Proof of Theorem 2.3.

References

H. 1

Remark 2.2 (Stabilizability of $(A^{\star},B^{\star})$ and dependence on $T$ )