This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A modified Thompson sampling-based learning algorithm for unknown linear systems

Mukul Gagrani, Sagar Sudhakara, Aditya Mahajan, Ashutosh Nayyar, and Yi Ouyang Mukul Gagrani is with Qualcomm AI research, San Diego. (email: [email protected])Sagar Sudhakara and Ashutosh Nayyar are with the Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA, USA. (email: [email protected], [email protected])Aditya Mahajan is with the department of Electrical and Computer Engineering, McGill University, Montreal, QC, Canada. (email: [email protected])Yi Ouyang is with Preferred Networks America, Burlingame, CA, USA (email: [email protected])The work at USC was supported by NSF grants ECCS 1750041 and ECCS 2025732 and the work of Aditya Mahajan was supported in part by the Innovation for Defence Excellence and Security (IDEaS) Program through grant CFPMN2-30.
Abstract

We revisit the Thompson sampling-based learning algorithm for controlling an unknown linear system with quadratic cost proposed in [1]. This algorithm operates in episodes of dynamic length and it is shown to have a regret bound of 𝒪~(T)\tilde{\mathcal{O}}(\sqrt{T}), where TT is the time-horizon. The regret bound of this algorithm is obtained under a technical assumption on the induced norm of the closed loop system. We propose a variation of this algorithm that enforces a lower bound TminT_{\min} on the episode length. We show that a careful choice of TminT_{\min} (that depends on the uncertainty about the system model) allows us to recover the 𝒪~(T)\tilde{\mathcal{O}}(\sqrt{T}) regret bound under a milder technical condition about the closed loop system.

I Introduction

The problem of learning an optimal policy for a system with linear dynamics and quadratic cost with unknown parameters has received considerable attention in the literature. Historically, the focus has been on developing algorithms which asymptotically learn optimal policies using techniques from adaptive control and reinforcement learning [2, 3, 4, 5]. In recent years, the emphasis has shifted towards developing algorithms with finite-time regret guarantees.

Broadly speaking, three classes of learning algorithms have been considered in the literature: optimism in the face of uncertainty [6, 7, 8, 9], certainty equivalence [10, 11, 12, 13, 14], and Thompson sampling [15, 16, 14, 17]. These algorithms provide two kinds of regret guarantees: frequentist and Bayesian. In the frequentist setting, it is established that the regret for the unknown system is bounded with high probability (with respect to the distribution of the process noise and the randomness introduced by the algorithm). In the Bayesian setting, it is assumed that there is a prior on the unknown system parameters and it is established that the expected regret is bounded (where the expectation is with respect to the prior, the distribution of the process noise, and the randomness introduced by the algorithm). These two notions of regret are different and, since the per-step cost is not bounded, one form of the regret does not imply the other.

In this paper, we revisit a recently proposed algorithm for establishing Bayesian regret called Thompson sampling with dynamic episodes (TSDE) [1]. The main result of [1] is to show that the Bayesian regret of TSDE accumulated up to time TT is bounded by 𝒪~(T)\tilde{\mathcal{O}}(\sqrt{T}), where the 𝒪~()\tilde{\mathcal{O}}(\cdot) notation hides constants and poly-logarithmic factors. This result was derived under a technical assumption on the induced norm of the closed loop system. In this paper, we present a variation of the TSDE algorithm and obtain a 𝒪~(T)\tilde{\mathcal{O}}(\sqrt{T}) bound on the Bayesian regret by imposing a much milder technical assumption.

II Model and problem formulation

We consider the same model as [1]. For the sake of completeness, we present the model below.

Consider a linear system with state xtnx_{t}\in\mathds{R}^{n}, control input utmu_{t}\in\mathds{R}^{m}, and disturbance wtnw_{t}\in\mathds{R}^{n}. For the ease of exposition, we assume that the system starts from an initial state x1=0x_{1}=0. The state evolves over time according to

xt+1=Axt+But+wt,t1,x_{t+1}=Ax_{t}+Bu_{t}+w_{t},\quad t\geq 1, (1)

where An×nA\in\mathds{R}^{n\times n} and Bn×mB\in\mathds{R}^{n\times m} are the system dynamics matrices. The noise {wt}t1\{w_{t}\}_{t\geq 1} is an independent and identically distributed Gaussian process with wt𝒩(0,σw2I)w_{t}\sim\mathcal{N}(0,\sigma_{w}^{2}I).

{remark}

In [1], it was assumed that σw2=1\sigma_{w}^{2}=1. Using a general σw2>0\sigma_{w}^{2}>0 does not fundamentally change any of the results or the proof arguments.

At each time tt, the system incurs a per-step cost given by

c(xt,ut)=xtQxt+utRut,c(x_{t},u_{t})=x_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}Qx_{t}+u_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}Ru_{t}, (2)

where QQ and RR are positive definite matrices.

Let θ=[A,B]\theta^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}=[A,B] denote the parameters of the system. θd×n\theta\in\mathds{R}^{d\times n}, where d=n+md=n+m. The performance of any policy π=(π1,π2,)\pi=(\pi_{1},\pi_{2},\dots) is measured by the long-term average cost given by

J(π;θ)=lim supT1T𝔼π[t=1Tc(xt,ut)].J(\pi;\theta)=\limsup_{T\to\infty}\frac{1}{T}\mathds{E}^{\pi}\Bigl{[}\sum_{t=1}^{T}c(x_{t},u_{t})\Bigr{]}. (3)

Let J(θ)J(\theta) denote the minimum of J(π;θ)J(\pi;\theta) over all policies. It is well known [18] that if the pair (A,B)(A,B) is stabilizable, then J(θ)J(\theta) is given by

J(θ)=σw2Tr(S(θ)),J(\theta)=\sigma_{w}^{2}\operatorname{Tr}(S(\theta)),

where S(θ)S(\theta) is the unique positive semi-definite solution of the following Riccati equation:

S(θ)=Q+AS(θ)AAS(θ)B(R+BS(θ)B)1BS(θ)A.S(\theta)=Q+A^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}S(\theta)A\\ -A^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}S(\theta)B(R+B^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}S(\theta)B)^{-1}B^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}S(\theta)A. (4)

Furthermore, the optimal control policy is given by

ut=G(θ)xt,u_{t}=G(\theta)x_{t}, (5)

where the gain matrix G(θ)G(\theta) is given by

G(θ)=(R+BS(θ)B)1BS(θ)A.G(\theta)=-(R+B^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}S(\theta)B)^{-1}B^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}S(\theta)A. (6)

As in [1], we are interested in the setting where the system parameters are unknown. We denote the unknown parameters by a random variable θ1\theta_{1} and assume that there is a prior distribution on θ1\theta_{1}. The Bayesian regret of a policy π\pi operating for horizon TT is defined by

R(T;π)=𝔼π[t=1Tc(xt,ut)TJ(θ1)],R(T;\pi)=\mathds{E}^{\pi}\Bigl{[}\sum_{t=1}^{T}c(x_{t},u_{t})-TJ(\theta_{1})\Bigr{]}, (7)

where the expectation is with respect to the prior on θ1\theta_{1}, the noise processes, the initial conditions, and the potential randomizations done by the policy π\pi.

III Thomson sampling based learning algorithm

As in [1], we assume that the unknown model parameters θ\theta lie in a compact subset Ω1\Omega_{1} of d×n\mathds{R}^{d\times n}. We use p|Ωp|\Omega to denote the restriction of probability distribution pp on the set Ω\Omega. We assume that there is a prior μ1\mu_{1} on Ω1\Omega_{1} which satisfies the following assumption.

Assumption 1.

There exist θ^1(i)d\hat{\theta}_{1}(i)\in\mathds{R}^{d} for i{1,,n}i\in\{1,\dots,n\} and a positive definite matrix Σ1d×d\Sigma_{1}\in\mathds{R}^{d\times d} such that for any θd×n\theta\in\mathds{R}^{d\times n}, μ1=μ¯1|Ω1\mu_{1}=\bar{\mu}_{1}\bigr{|}_{\Omega_{1}}, where

μ¯1(θ)=i=1nμ¯1(θ(i))andμ¯1(θ(i))=𝒩(θ^1(i),Σ1).\bar{\mu}_{1}(\theta)=\prod_{i=1}^{n}\bar{\mu}_{1}(\theta(i))\quad\text{and}\quad\bar{\mu}_{1}(\theta(i))=\mathcal{N}(\hat{\theta}_{1}(i),\Sigma_{1}).

We maintain a posterior distribution μt\mu_{t} on Ω1\Omega_{1} based on the history (x1:t1,u1:t1)(x_{1:t-1},u_{1:t-1}) of the observations until time tt. From standard results in linear Gaussian regression [19], we know that the posterior is a truncated Gaussian distribution

μt(θ)=[i=1nμ¯t(θ(i))]|Ω1\mu_{t}(\theta)=\biggl{[}\prod_{i=1}^{n}\bar{\mu}_{t}(\theta(i))\biggr{]}\biggr{|}_{\Omega_{1}}

where μ¯t(θ(i))=𝒩(θ^t(i),Σt)\bar{\mu}_{t}(\theta(i))=\mathcal{N}(\hat{\theta}_{t}(i),\Sigma_{t}) and {θ^t(i)}i=1n\{\hat{\theta}_{t}(i)\}_{i=1}^{n} and Σt\Sigma_{t} can be updated recursively as follows:

θ^t+1(i)\displaystyle\hat{\theta}_{t+1}(i) =θ^t(i)+Σtzt(xt+1(i)θ^t(i)zt)σw2+ztΣtzt,\displaystyle=\hat{\theta}_{t}(i)+\frac{\Sigma_{t}z_{t}(x_{t+1}(i)-\hat{\theta}_{t}(i)^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}z_{t})}{\sigma_{w}^{2}+z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\Sigma_{t}z_{t}}, (8)
Σt+11\displaystyle\Sigma_{t+1}^{-1} =Σt1+1σw2ztzt,\displaystyle=\Sigma_{t}^{-1}+\frac{1}{\sigma_{w}^{2}}z_{t}z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}, (9)

where zt=[xt,ut]z_{t}=[x_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}},u_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}]^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}.

III-A Thompson sampling with dynamic episodes algorithm

We now present a variation of the Thompson sampling with dynamic episodes (TSDE) algorithm of [1]. As the name suggests, the algorithm operates in episodes of dynamic length. The key difference from [1] is that we enforce that each episode is of a minimum length TminT_{\min}. The choice of TminT_{\min} will be explained later.

Let tkt_{k} and TkT_{k} denote the start time and the length of episode kk, respectively. Episode kk has a minimum length of TminT_{\min} and ends when the length of the episode is strictly larger than the length of the previous episode (i.e., ttk>Tk1t-t_{k}>T_{k-1}) or at the first time after tk+Tmint_{k}+T_{\min} when the determinant of the covariance Σt\Sigma_{t} falls below half of its value at time tkt_{k}, i.e., detΣt<12detΣtk\det\Sigma_{t}<\tfrac{1}{2}\det\Sigma_{t_{k}}. Thus,

tk+1=min{t>tk+Tmin|ttk>Tk1 or detΣt<12detΣtk}.t_{k+1}=\min\left\{t>t_{k}+T_{\min}\,\middle|\,\begin{lgathered}t-t_{k}>T_{k-1}\text{ or }\\ \det\Sigma_{t}<\tfrac{1}{2}\det\Sigma_{t_{k}}\end{lgathered}t-t_{k}>T_{k-1}\text{ or }\\ \det\Sigma_{t}<\tfrac{1}{2}\det\Sigma_{t_{k}}\right\}. (10)

Note that the stopping condition (10) implies that

Tmin+1TkTk1+1,kT_{\min}+1\leq T_{k}\leq T_{k-1}+1,\quad\forall k (11)

If we select Tmin=0T_{\min}=0 in the above algorithm, we recover the stopping condition of [1].

The TSDE algorithm works as follows. At the beginning of episode kk, a parameter θ¯k\bar{\theta}_{k} is sampled from the posterior distribution μtk\mu_{t_{k}}. During the episode, the control inputs are generated using the sampled parameters θ¯k\bar{\theta}_{k}, i.e.,

ut=G(θ¯k)xt,tkt<tk+1.u_{t}=G(\bar{\theta}_{k})x_{t},\quad t_{k}\leq t<t_{k+1}. (12)

The complete algorithm is presented in Algorithm 1.

Algorithm 1 TSDE
1:input: Ω1\Omega_{1}, θ^1\hat{\theta}_{1}, Σ1\Sigma_{1}
2:initialization: t1t\leftarrow 1, t0Tmint_{0}\leftarrow-T_{\min}, T1TminT_{-1}\leftarrow T_{\min}, k0k\leftarrow 0.
3:for t=1,2,t=1,2,\dots do
4:   observe xtx_{t}
5:   update μ¯t\bar{\mu}_{t} according to (8)–(9)
6:   if (ttk>Tmin)(t-t_{k}>T_{\min}) and
7:   ((ttk>Tk1) or (detΣt<12detΣtk))\bigl{(}(t-t_{k}>T_{k-1})\hbox{ or }(\det\Sigma_{t}<\tfrac{1}{2}\det\Sigma_{t_{k}})\bigr{)}
8:   then
9:   TkttkT_{k}\leftarrow t-t_{k}, kk+1k\leftarrow k+1, tktt_{k}\leftarrow t
10:    sample θ¯kμt\bar{\theta}_{k}\sim\mu_{t}
11:   end if
12:   Apply control ut=G(θ¯k)xtu_{t}=G(\bar{\theta}_{k})x_{t}
13:end for

III-B A technical assumption and the choice of minimum episode length

For each θ=[A,B]\theta^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}=[A,B], we define a 4-dimensional row-vector η(θ)\eta(\theta) as follows:

η(θ):=(θ,Tr(S(θ)),S(θ),[I,G(θ)]),\eta(\theta):=(\|\theta\|,\operatorname{Tr}(S(\theta)),\|S(\theta)\|,\|[I,G(\theta)^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}]^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\|), (13)

where S(θ)S(\theta) is the solution of the Riccati equation in (4) and G(θ)G(\theta) is the optimal gain matrix defined in (6). {definition} Let Mθ,MJ,MS,MG,α,δM_{\theta},M_{J},M_{S},M_{G},\alpha,\delta be positive constants such that α1\alpha\geq 1 and 0<δ<10<\delta<1. We say that the uncertainty set Ω1\Omega_{1} is of Type (Mθ,MJ,MS,MG,α,δ)(M_{\theta},M_{J},M_{S},M_{G},\alpha,\delta) if the following conditions hold:

  1. 1.

    For all θΩ1\theta\in\Omega_{1},

    η(θ)(Mθ,MJ,MS,MG)\displaystyle\eta(\theta)\leq(M_{\theta},M_{J},M_{S},M_{G}) (14)

    where the inequality is component-wise.

  2. 2.

    For any θ,ϕΩ1\theta,\phi\in\Omega_{1} with θ=[Aθ,Bθ]\theta^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}=[A_{\theta},B_{\theta}] and for any integer t1t\geq 1,

    (Aθ+BθG(ϕ))tαδt.\|(A_{\theta}+B_{\theta}G(\phi))^{t}\|\leq\alpha\delta^{t}.
Assumption 2.

We assume that the uncertainty set Ω1\Omega_{1} is of Type (Mθ,MJ,MS,MG,α,δ)(M_{\theta},M_{J},M_{S},M_{G},\alpha,\delta), where MθM_{\theta}, MJM_{J}, MSM_{S}, MGM_{G}, α\alpha, δ\delta are positive constants such that α1\alpha\geq 1 and 0<δ<10<\delta<1.

The following simple observation plays a critical role in analyzing the regret of TSDE. {lemma} Suppose Assumption 2 is true. Define

Tmin=logαlogδ.T^{*}_{\min}=\biggl{\lceil}\frac{\log\alpha}{-\log\delta}\biggr{\rceil}. (15)

Then, for θ,ϕΩ1\theta,\phi\in\Omega_{1} with θ=[Aθ,Bθ]\theta^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}=[A_{\theta},B_{\theta}], we have

(Aθ+BθG(ϕ))Tmin+1<1\|(A_{\theta}+B_{\theta}G(\phi))^{T^{*}_{\min}+1}\|<1 (16)
Proof.

The proof follows immediately from Assumption 2 and the definition of TminT^{*}_{\min}.

Before presenting our regret analysis under Assumption 2, we present two special cases of this assumption. The first case is identical to the assumption made in [1] about the uncertainty set.

Assumption 3.

Let MθM_{\theta}, MJM_{J}, MSM_{S}, MGM_{G}, α\alpha, δ\delta be positive constants such that α1\alpha\geq 1 and 0<δ<10<\delta<1. Assume that the uncertainty set Ω1\Omega_{1} satisfies the following conditions:

  1. 1.

    Equation (14) holds for all θΩ1\theta\in\Omega_{1}.

  2. 2.

    For any θ,ϕΩ1\theta,\phi\in\Omega_{1} with θ=[Aθ,Bθ]\theta^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}=[A_{\theta},B_{\theta}],

    (Aθ+BθG(ϕ))δ.\|(A_{\theta}+B_{\theta}G(\phi))\|\leq\delta.

In [1], part 2) of Assumption 3 was stated explicitly. In addition, the uncertainty set was assumed to be compact, which ensures part 1) of Assumption 3.

An uncertainty set that satisfies Assumption 3 also satisfies Assumption 2 with α=1\alpha=1. This is because if (Aθ+BθG(ϕ))δ\|(A_{\theta}+B_{\theta}G(\phi))\|\leq\delta, then

(Aθ+BθG(ϕ))t((Aθ+BθG(ϕ)))tδt.\displaystyle\|(A_{\theta}+B_{\theta}G(\phi))^{t}\|\leq\left(\|(A_{\theta}+B_{\theta}G(\phi))\|\right)^{t}\leq\delta^{t}. (17)

Thus, Assumption 2 is weaker than Assumption 3 used in [1].

For a square matrix AA, let ρ(A)\rho(A) denote the spectral radius of matrix AA.The next assumption can also be viewed as a special case of Assumption 2.

Assumption 4.

Let MθM_{\theta}, MJM_{J}, MSM_{S}, MGM_{G}, α\alpha, δ~\tilde{\delta} be positive constants such that α1\alpha\geq 1 and 0<δ~<10<\tilde{\delta}<1. Assume that the uncertainty set Ω1\Omega_{1} satisfies the following conditions:

  1. 1.

    Equation (14) holds for all θΩ1\theta\in\Omega_{1}.

  2. 2.

    For any θ,ϕΩ1\theta,\phi\in\Omega_{1} with θ=[Aθ,Bθ]\theta^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}=[A_{\theta},B_{\theta}],

    ρ(Aθ+BθG(ϕ))δ~.\rho(A_{\theta}+B_{\theta}G(\phi))\leq\tilde{\delta}.

We note that Assumption 4 is weaker than Assumption 3 (since ρ(A)A\rho(A)\leq||A|| for any matrix AA). Consider, for example, a family of matrices Aq=[δ~q0δ~]A_{q}=\begin{bmatrix}\tilde{\delta}&q\\ 0&\tilde{\delta}\end{bmatrix}, where qq\in\mathds{N} and 0<δ~<10<\tilde{\delta}<1. For each qq, the spectral radius of AqA_{q} is δ~\tilde{\delta} while its norm is at least qq. Thus, each AqA_{q} satisfies Assumption 4 but not Assumption 3.

The following lemma shows that an uncertainty set that satisfies Assumption 4 also satisfies Assumption 2 for some constants α1\alpha\geq 1 and δ~<δ<1\tilde{\delta}<\delta<1.

{lemma}

Suppose Assumption 4 is true. Then, there exist α1\alpha\geq 1 and δ~<δ<1\tilde{\delta}<\delta<1 such that for any θ,ϕΩ1\theta,\phi\in\Omega_{1} with θ=[Aθ,Bθ]\theta^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}=[A_{\theta},B_{\theta}] and for any integer t1t\geq 1,

(Aθ+BθG(ϕ))tαδt.\|(A_{\theta}+B_{\theta}G(\phi))^{t}\|\leq\alpha\delta^{t}.
Proof.

Define ε=δδ~\varepsilon=\delta-\tilde{\delta}. Let ={Aθ+BθG(ϕ):θ,ϕΩ1}\mathcal{L}=\{A_{\theta}+B_{\theta}G(\phi):\theta,\phi\in\Omega_{1}\}. Since Ω1\Omega_{1} is compact, so is \mathcal{L}. Now for any LL\in\mathcal{L}, there exists a norm (call it normL\operatorname{norm}_{L}) such that normL(L)<ρ(L)+εδ~+ε=δ\operatorname{norm}_{L}(L)<\rho(L)+\varepsilon\leq\tilde{\delta}+\varepsilon=\delta.

Since norms are continuous, there is an open ball centered at LL (let’s call this ballL\operatorname{ball}_{L}) such that for any HballLH\in\operatorname{ball}_{L}, we have normL(H)<δ\operatorname{norm}_{L}(H)<\delta. Consider the collection of open balls {ballL:L}\{\operatorname{ball}_{L}:L\in\mathcal{L}\}. This is an open cover of compact set \mathcal{L}. So, there is a finite sub-cover. Let’s denote this sub-cover by ballL1,,ballL\operatorname{ball}_{L_{1}},\dots,\operatorname{ball}_{L_{\ell}}. By equivalence of norms, there is a finite constant αk\alpha_{k} such that AαLknormLk(A)\|A\|\leq\alpha_{L_{k}}\operatorname{norm}_{L_{k}}(A) for any matrix AA, for all k{1,,}k\in\{1,\dots,\ell\}. Let α=max(1,maxkαLk)\alpha=\max(1,\max_{k}\alpha_{L_{k}}).

Now consider an arbitrary HH\in\mathcal{L}. It belongs to ballLk\operatorname{ball}_{L_{k}} for some k{1,,}k\in\{1,\dots,\ell\}. Therefore, normLk(H)<δ\operatorname{norm}_{L_{k}}(H)<\delta. Hence, for any integer tt, the above inequalities and the submulitplicity of norms give that HtαLknormLk(Ht)α(normLk(H))t<α(δ)t\|H^{t}\|\leq\alpha_{L_{k}}\operatorname{norm}_{L_{k}}(H^{t})\leq\alpha(\operatorname{norm}_{L_{k}}(H))^{t}<\alpha(\delta)^{t}.

{remark}

Condition 2 in Definition III-B states that Aθ+BθG(ϕ)A_{\theta}+B_{\theta}G(\phi) is uniformly exponentially stable for all all θ,ϕΩ1\theta,\phi\in\Omega_{1}. Condition 2 of Assumption 4 states that Aθ+BθG(ϕ)A_{\theta}+B_{\theta}G(\phi) is uniformly asymptotically stable for all θ,ϕΩ1\theta,\phi\in\Omega_{1}. For linear systems, asymptotic stability implies exponential stability. In Lemma III-B, we are effectively showing that when the uncertainty set is compact, uniform asymptotic stability implies uniform exponential stability.

III-C Regret bounds

The following result provides an upper bound on the regret of the proposed algorithm.

{theorem}

Under Assumptions 1 and 2 and with TminTminT_{\min}\geq T^{*}_{\min}, the regret of TSDE is upper bounded by

R(T;𝚃𝚂𝙳𝙴)𝒪~(σw2(n+m)nT).R(T;{\tt TSDE})\leq\tilde{\mathcal{O}}(\sigma_{w}^{2}(n+m)\sqrt{nT}). (18)

The proof is presented in the next section. {remark} The constants hidden in the 𝒪~()\tilde{\mathcal{O}}(\cdot) notation in Theorem III-C depend only on the type (Mθ,MJ,MS,MG,α,δ)(M_{\theta},M_{J},M_{S},M_{G},\alpha,\delta) of the uncertainty set Ω1\Omega_{1}. In particular, these constants do not depend on nn, mm, and TT.

IV Regret analysis

For the ease of notation, we use R(T)R(T) instead of R(T;TSDE)R(T;\texttt{TSDE}) in this section. Let KTK_{T} denote the number of episodes until horizon TT. Following the exact same steps as [1], we can show that

R(T)=R0(T)+R1(T)+R2(T)R(T)=R_{0}(T)+R_{1}(T)+R_{2}(T) (19)

where

R0(T)\displaystyle R_{0}(T) =𝔼[k=1KTTkJ(θ¯k)]T𝔼[J(θ1)],\displaystyle=\mathds{E}\bigg{[}\sum_{k=1}^{K_{T}}T_{k}J(\bar{\theta}_{k})\biggr{]}-T\mathds{E}[J(\theta_{1})], (20)
R1(T)\displaystyle R_{1}(T) =𝔼[k=1KTt=tktk+11[xtS(θ¯k)xtxt+1S(θ¯k)xt+1]]\displaystyle=\mathds{E}\Bigg{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\bigl{[}x_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}S(\bar{\theta}_{k})x_{t}-x_{t+1}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}S(\bar{\theta}_{k})x_{t+1}\bigr{]}\Biggr{]} (21)
R2(T)\displaystyle R_{2}(T) =𝔼[k=1KTt=tktk+11[(θ1zt)S(θ¯k)θ1zt\displaystyle=\mathds{E}\Bigg{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\bigl{[}(\theta_{1}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}z_{t})^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}S(\bar{\theta}_{k})\theta_{1}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}z_{t}
(θkzt)S(θ¯k)θkzt]]\displaystyle\hskip 100.00015pt-(\theta_{k}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}z_{t})^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}S(\bar{\theta}_{k})\theta_{k}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}z_{t}\bigr{]}\Biggr{]} (22)

We establish the bound on R(T)R(T) by individually bounding R0(T)R_{0}(T), R1(T)R_{1}(T), and R2(T)R_{2}(T).

{lemma}

The terms in (19) are bounded as follows:

  1. 1.

    R0(T)𝒪~(σw2(n+m)T)R_{0}(T)\leq\tilde{\mathcal{O}}(\sigma_{w}^{2}\sqrt{(n+m)T}).

  2. 2.

    R1(T)𝒪~(σw2(n+m)T)R_{1}(T)\leq\tilde{\mathcal{O}}(\sigma_{w}^{2}\sqrt{(n+m)T}).

  3. 3.

    R2(T)𝒪~(σw2(n+m)nT)R_{2}(T)\leq\tilde{\mathcal{O}}(\sigma_{w}^{2}(n+m)\sqrt{nT}).

Combining Lemma IV with equation (19) establishes Theorem III-C. Before presenting the proof of Lemma IV, we establish some preliminary results.

IV-A Preliminary results

Let XT=σw+max1tTxtX_{T}=\sigma_{w}+\max_{1\leq t\leq T}\|x_{t}\| denote the maximum of the norm of the state plus the noise standard deviation.

{lemma}

For any q1q\geq 1 and any T1T\geq 1,

𝔼[XTqσwq]𝒪((logT)q/2).\mathds{E}\Big{[}\frac{X_{T}^{q}}{\sigma_{w}^{q}}\Big{]}\leq\mathcal{O}({(\log T)}^{q/2}).

See Appendix B for proof.

{lemma}

For any q1q\geq 1, we have

𝔼[XTqσwqlog(XT2σw2)]𝒪~(1).\mathds{E}\Big{[}\frac{X_{T}^{q}}{\sigma_{w}^{q}}\log\Big{(}\frac{X_{T}^{2}}{\sigma_{w}^{2}}\Big{)}\Big{]}\leq\tilde{\mathcal{O}}(1).

See Appendix C for proof.

{lemma}

The number of episodes is bounded by

KT𝒪((n+m)Tlog(TXT2σw2)).K_{T}\leq\mathcal{O}\Bigl{(}\sqrt{(n+m)T\log\left(T\frac{X_{T}^{2}}{\sigma_{w}^{2}}\right)}\Bigr{)}.

See Appendix D for proof.

{remark}

The statement of Lemmas IV-A and IV-A are the same as that of the corresponding lemmas in [1]. The proof of Lemma IV-A in [1] relied on Assumption 3. Since we impose a weaker assumption, our proof is more involved. The proof of Lemma IV-A is similar to the proof of [1, Lemma 3]. However, since our TSDE algorithm is different from that in [1], some of the details of the proof are different.

IV-B Proof of Lemma IV

We now prove each part of Lemma IV separately.

IV-B1 Proof of bound on R0(T)R_{0}(T)

Following exactly the same argument as the proof of [1, Lemma 5], we can show that

R0(T)𝒪(σw2𝔼[KT]).R_{0}(T)\leq\mathcal{O}(\sigma_{w}^{2}\mathds{E}[K_{T}]). (23)

Substituting the result of Lemma IV-A, we get

R0(T)\displaystyle R_{0}(T) 𝒪(σw2𝔼[(n+m)Tlog(TXT2/σw2)])\displaystyle\leq\mathcal{O}\Bigl{(}\sigma_{w}^{2}\mathds{E}\Bigl{[}\sqrt{(n+m)T\log(TX_{T}^{2}/\sigma_{w}^{2})}\Bigr{]}\Bigr{)}
(a)𝒪(σw2(n+m)Tlog(T𝔼[XT2/σw2]))\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\mathcal{O}\Bigl{(}\sigma_{w}^{2}\sqrt{(n+m)T\log(T\mathds{E}[X_{T}^{2}/\sigma_{w}^{2}]})\Bigr{)}
(b)𝒪~(σw2(n+m)T)\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\tilde{\mathcal{O}}\bigl{(}\sigma_{w}^{2}\sqrt{(n+m)T}\bigr{)}

where (a)(a) follows from Jensen’s inequality and (b)(b) follows from Lemma IV-A.

IV-B2 Proof of bound on R1(T)R_{1}(T)

Following exactly the same argument as in the proof of [1, Lemma 6], we can show that

R1(T)𝒪(𝔼[KTXT2])R_{1}(T)\leq\mathcal{O}(\mathds{E}[K_{T}X_{T}^{2}]) (24)

Substituting the result of Lemma IV-A, we get

R1(T)𝒪((n+m)T𝔼[XT2log(TXT2/σw2)])R_{1}(T)\leq\mathcal{O}\Bigl{(}\sqrt{(n+m)T}\,\mathds{E}\bigl{[}X_{T}^{2}\sqrt{\log(TX_{T}^{2}/\sigma_{w}^{2})}\bigr{]}\Bigr{)} (25)

Now, consider the term

𝔼[XT2log(TXT2/σw2)]\displaystyle\mathds{E}\bigl{[}X_{T}^{2}\sqrt{\log(TX_{T}^{2}/\sigma_{w}^{2})}\bigr{]} (a)𝔼[XT4]𝔼[log(TXT2/σw2)]\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\sqrt{\mathds{E}\bigl{[}X_{T}^{4}\bigr{]}\mathds{E}\bigl{[}\log(TX_{T}^{2}/\sigma_{w}^{2})\bigr{]}}
(b)𝔼[XT4]log(T𝔼[XT2/σw2])\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\sqrt{\mathds{E}\bigl{[}X_{T}^{4}\bigr{]}\log(T\mathds{E}[X_{T}^{2}/\sigma_{w}^{2}])}
(c)𝒪~(σw2)\displaystyle\stackrel{{\scriptstyle(c)}}{{\leq}}\tilde{\mathcal{O}}(\sigma_{w}^{2}) (26)

where (a)(a) follows from Cauchy-Schwartz inequality, (b)(b) follows from Jensen’s inequality, and (c)(c) follows from Lemma IV-A.

Substituting (26) in (25), we get the bound on R1(T)R_{1}(T).

IV-B3 Proof of bound on R2(T)R_{2}(T)

As in [1], we can bound the inner summand in R2(T)R_{2}(T) as

S(θ¯k)0.5θ1zt2S(θ¯k)0.5θkzt2𝒪(XT(θ1θ¯k)zt).\|S(\bar{\theta}_{k})^{0.5}\theta_{1}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}z_{t}\|^{2}-\|S(\bar{\theta}_{k})^{0.5}\theta_{k}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}z_{t}\|^{2}\leq\mathcal{O}(X_{T}\|(\theta_{1}-\bar{\theta}_{k})^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}z_{t}\|).

Therefore,

R2(T)𝒪(𝔼[XTk=1KTt=tktk+11(θ1θ¯k)zt]),R_{2}(T)\leq\mathcal{O}\biggl{(}\mathds{E}\Bigg{[}X_{T}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\|(\theta_{1}-\bar{\theta}_{k})^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}z_{t}\|\biggr{]}\biggr{)},

which is same as [1, Eq. (45)]. Now, by simplifying the term inside 𝒪()\mathcal{O}(\cdot) using Cauchy-Schwartz inequality, we get

𝔼[XTk=1KTt=tktk+11(θ1θ¯k)zt]\displaystyle\hskip-20.00003pt\mathds{E}\Bigg{[}X_{T}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\|(\theta_{1}-\bar{\theta}_{k})^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}z_{t}\|\biggr{]}
𝔼[k=1KTt=tktk+11Σtk0.5(θ1θ¯k)2]\displaystyle\leq\sqrt{\mathds{E}\Bigg{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\|\Sigma_{t_{k}}^{-0.5}(\theta_{1}-\bar{\theta}_{k})\|^{2}\Biggr{]}}
×𝔼[k=1KTt=tktk+11XT2Σtk0.5zt2]\displaystyle\quad\times\sqrt{\mathds{E}\Bigg{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}X_{T}^{2}\|\Sigma_{t_{k}}^{0.5}z_{t}\|^{2}\Biggr{]}} (27)

Note that (27) is slightly different than the simplification of [1, Eq. (45)] using Cauchy-Schwartz inequality presented in [1, Eq. (46)], which used Σt\Sigma_{t} in each term in the right hand side instead of Σtk\Sigma_{t_{k}}.

We bound each term of (27) separately as follows. {lemma} We have the following inequality

𝔼[k=1KTt=tktk+11Σtk0.5(θ1θ¯k)2]\displaystyle\hskip-20.00003pt\mathds{E}\Bigg{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\|\Sigma_{t_{k}}^{-0.5}(\theta_{1}-\bar{\theta}_{k})\|^{2}\Biggr{]}
𝒪(n(n+m)(T+𝔼[KT]))𝒪(n(n+m)T).\displaystyle\leq\mathcal{O}(n(n+m)(T+\mathds{E}[K_{T}]))\leq\mathcal{O}(n(n+m)T).

See Appendix E for a proof. {lemma} We have the following inequality

𝔼[k=1KTt=tktk+11XT2Σtk0.5zt2]𝒪~((n+m)σw4)\mathds{E}\Bigg{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}X_{T}^{2}\|\Sigma_{t_{k}}^{0.5}z_{t}\|^{2}\Biggr{]}\leq\tilde{\mathcal{O}}\bigl{(}(n+m)\sigma_{w}^{4}\bigr{)}

See Appendix F for a proof.

We get the bound on R2(T)R_{2}(T) by substituting the result of Lemmas IV-B3 and IV-B3 in (27).

V Discussion and Conclusion

In this paper, we present a variation of the TSDE algorithm of [1] and show that its Bayesian regret up to time TT is bounded by 𝒪~(T)\tilde{\mathcal{O}}(\sqrt{T}) under a milder technical assumption than [1]. The result in [1] was derived under the assumption that there exists a δ<1\delta<1 such that for any θ,ϕΩ1\theta,\phi\in\Omega_{1}, Aθ+BθG(ϕ)δ\|A_{\theta}+B_{\theta}G(\phi)\|\leq\delta. For our analysis, we impose a different assumption for the closed loop gain when the system dynamics are θ\theta and the controller is chosen according to ϕ\phi. We show that the assumption of [1] implies our assumption. Our assumption is also implied by ρ(Aθ+BθG(ϕ))δ\rho(A_{\theta}+B_{\theta}G(\phi))\leq\delta.

The key technical result in [1] as well as our paper is Lemma IV-A, which shows that for any q1q\geq 1, 𝔼[XTq/σwq]𝒪~(logT)\mathds{E}[X_{T}^{q}/\sigma_{w}^{q}]\leq\tilde{\mathcal{O}}(\log T). The proof argument in both [1] as well as our paper is to show that there is some constant α0\alpha_{0} such that XTσw+α0WTX_{T}\leq\sigma_{w}+\alpha_{0}W_{T}. Under the stronger assumption in [1], one can show that for all tt, xt+1δxt+wt\|x_{t+1}\|\leq\delta\|x_{t}\|+\|w_{t}\|, which directly implies that XTσw+WT/(1δ)X_{T}\leq\sigma_{w}+W_{T}/(1-\delta). Under the weaker assumption in this paper, the argument is more subtle. The basic intuition is that in each episode, the system is asymptotically stable and, being a linear system, also exponentially stable (in the sense of Lemma III-B). So, if the episode length is sufficiently long, then we can ensure that xtk+1βxtk+α¯WT\|x_{t_{k+1}}\|\leq\beta\|x_{t_{k}}\|+\bar{\alpha}W_{T}, where β<1\beta<1 and α¯\bar{\alpha} is a constant. This is sufficient to ensure that XTσw+α0WTX_{T}\leq\sigma_{w}+\alpha_{0}W_{T} for an appropriately defined α0\alpha_{0}.

The fact that each episode must be of length TminT_{\min} implies that the second triggering condition is not triggered for the first TminT_{\min} steps in an episode. Therefore, in this interval, the determinant of the covariance can be smaller than half of its value at the beginning of the episode. Consequently, we cannot use the same proof argument as [1] to bound R2(T)R_{2}(T) because that proof relied on the fact that for any t{tk,,tk+11}t\in\{t_{k},\dots,t_{k+1}-1\}, detΣt1/detΣtk12\det\Sigma_{t}^{-1}/\det\Sigma_{t_{k}}^{-1}\leq 2. So, we provide a variation of that proof argument, where we use a coarser bound on detΣt1/detΣtk1\det\Sigma_{t}^{-1}/\det\Sigma_{t_{k}}^{-1} given by Lemma F.

We conclude by observing that the milder technical assumption imposed in this paper may not be necessary. Numerical experiments indicate that the regret of the TSDE algorithm shows 𝒪~(T)\tilde{\mathcal{O}}(\sqrt{T}) behavior even when the uncertainty set Ω1\Omega_{1} does not satisfy Assumption 4 (as was also reported in [1]). This suggests that it might be possible to further relax Assumption 4 and still establish an 𝒪~(T)\tilde{\mathcal{O}}(\sqrt{T}) regret bound.

Acknowledgement

We would like to thank Borna Sayedana for pointing out a mistake in Lemma 9 in a previous draft of this paper.

Appendix A An auxiliary result

{lemma}

For any q>0q>0, we have

𝔼[max1tTwtqσwq]𝒪((logT)q/2)\mathds{E}\Biggl{[}\max_{1\leq t\leq T}\frac{\|w_{t}\|^{q}}{\sigma_{w}^{q}}\Biggr{]}\leq\mathcal{O}({(\log T)^{q/2}}) (28)
Proof.

For ease of notation, define random variables w¯t(q)=wtq/σwq\bar{w}^{(q)}_{t}=\|w_{t}\|^{q}/\sigma_{w}^{q} and w¯t=(w¯t(q))2/q\bar{w}^{*}_{t}=(\bar{w}^{(q)}_{t})^{2/q}. Note that w¯t\bar{w}^{*}_{t} has a χ2\chi^{2}-distribution with nn-degrees of freedom. Therefore, w¯t\bar{w}^{*}_{t} has a moment generating function 𝔼[esw¯t]=(12s)n/2\mathds{E}[e^{s\bar{w}^{*}_{t}}]=(1-2s)^{-n/2} for s<1/2s<1/2. Pick a λ(0,1/2)\lambda\in(0,1/2). By Chernoff bound, we have

(w¯t>z)𝔼[eλw¯t]eλz=Cλeλz,\mathds{P}(\bar{w}^{*}_{t}>z)\leq\frac{\mathds{E}[e^{\lambda\bar{w}^{*}_{t}}]}{e^{\lambda z}}=C_{\lambda}e^{-\lambda z},

where Cλ=(12λ)n/2C_{\lambda}=(1-2\lambda)^{-n/2}. Therefore, the complementary CDF of w¯t(q)\bar{w}^{(q)}_{t} is bounded by

1Fw¯t(q)(z)=(w¯t(q)>z)=(w¯t>zq/2)Cλeλzq/2.1-F_{\bar{w}^{(q)}_{t}}(z)=\mathds{P}(\bar{w}^{(q)}_{t}>z)=\mathds{P}(\bar{w}^{*}_{t}>z^{q/2})\leq C_{\lambda}e^{-\lambda z^{q/2}}.

Now, pick λ(0,λ)\lambda^{\prime}\in(0,\lambda) and consider an i.i.d. process {ξt}t1\{\xi_{t}\}_{t\geq 1}, where ξt\xi_{t} has a CDF Fξt(z)=1eλzq/2F_{\xi_{t}}(z)=1-e^{-\lambda^{\prime}z^{q/2}}. We let ξ(T)\xi_{(T)} denote max1tTξt\max_{1\leq t\leq T}\xi_{t} and use a similar notation for w¯(T)(q)\bar{w}^{(q)}_{(T)}. Note that ξt\xi_{t} has a Weibull distribution with shape q/2q/2. Therefore, (see e.g., [20, Eq. (3)])

𝔼[ξ(T)]=𝒪((logT)q/2).\mathds{E}[\xi_{(T)}]=\mathcal{O}((\log T)^{q/2}). (29)

Now we present a bound on 𝔼[w¯(T)(q)]\mathds{E}[\bar{w}^{(q)}_{(T)}] in terms of 𝔼[ξ(T)]\mathds{E}[\xi_{(T)}]. Since 0<λ<λ0<\lambda^{\prime}<\lambda, there exists a z>0z^{\circ}>0 such that for all z>zz>z^{\circ}, Fw¯t(q)(z)>Fξt(z)F_{\bar{w}^{(q)}_{t}}(z)>F_{\xi_{t}}(z). Thus, w¯t(q)\bar{w}^{(q)}_{t} is stochastically dominated by ξt\xi_{t} in the weak stochastic order111A random variable xx is said to be dominated by a random variable yy in the weak stochastic order if for all increasing functions ff supported sufficiently away from 0, 𝔼[f(x)]𝔼[f(y)]\mathds{E}[f(x)]\leq\mathds{E}[f(y)]., as defined in [20]. Therefore, by [20, Theorem 2.1], there exists a constant c>0c>0 such that w¯t(q)\bar{w}^{(q)}_{t} is stochastically dominated by cξtc\xi_{t} in the convex order.222A random variable xx is said to be dominated by a random variable yy in the convex order if for all increasing and convex functions ff, 𝔼[f(x)]𝔼[f(y)]\mathds{E}[f(x)]\leq\mathds{E}[f(y)]. Consequently, by [20, Theorem 3.1(1)] (or [21, Theorem 2.2(1)]), we have that 𝔼[w¯(T)(q)]c𝔼[ξ(T)]\mathds{E}[\bar{w}^{(q)}_{(T)}]\leq c\mathds{E}[\xi_{(T)}]. Substituting (29) establishes the result of the Lemma.

Appendix B Proof of Lemma IV-A

For the ease of notation, let α¯=α/(1δ)\bar{\alpha}=\alpha/(1-\delta), and β=αδTmin+1\beta=\alpha\delta^{T_{\min}+1}. In addition, define WT=max1tTwtW_{T}=\max_{1\leq t\leq T}\|w_{t}\|, X¯k=maxtk<ttk+1xt\bar{X}_{k}=\max_{t_{k}<t\leq t_{k+1}}\|x_{t}\|, Yk=xtkY_{k}=\|x_{t_{k}}\|, and Hk=A+BG(θ¯k)H_{k}=A+BG(\bar{\theta}_{k}) where AA and BB are the true parameters.

From the system dynamics under the TSDE algorithm, we know that for any time t{tk+1,,tk+1}t\in\{t_{k}+1,\dots,t_{k+1}\}, we have

xt=Hkttkxtk+j=tkt1Hkt1jwj.x_{t}=H_{k}^{t-t_{k}}x_{t_{k}}+\sum_{j=t_{k}}^{t-1}H_{k}^{t-1-j}w_{j}.

Thus, from triangle inequality and Assumption 2, we get

xt\displaystyle\|x_{t}\| αδttkYk+[j=tkt1αδt1j]WT\displaystyle\leq\alpha\delta^{t-t_{k}}Y_{k}+\biggl{[}\sum_{j=t_{k}}^{t-1}\alpha\delta^{t-1-j}\biggr{]}W_{T}
αδttkYk+[α1δ]α¯WT.\displaystyle\leq\alpha\delta^{t-t_{k}}Y_{k}+\underbrace{\biggl{[}\frac{\alpha}{1-\delta}\biggr{]}}_{\eqqcolon\bar{\alpha}}W_{T}. (30)

Now at time t=tk+1t=t_{k+1}, we have

Yk+1\displaystyle Y_{k+1} =xtk+1αδTkYk+α¯WT.\displaystyle=\|x_{t_{k+1}}\|\leq\alpha\delta^{T_{k}}Y_{k}+\bar{\alpha}W_{T}.
βYk+α¯WT\displaystyle\leq\beta Y_{k}+\bar{\alpha}W_{T} (31)

where the second inequality follows from (11), which implies αδTkαδTmin+1β\alpha\delta^{T_{k}}\leq\alpha\delta^{T_{\min}+1}\eqqcolon\beta. From Lemma III-B, β<1\beta<1. Recursively expanding (31), we get

Yk\displaystyle Y_{k} α¯WT+βα¯WT++βk2α¯WT\displaystyle\leq\bar{\alpha}W_{T}+\beta\bar{\alpha}W_{T}+\dots+\beta^{k-2}\bar{\alpha}W_{T}
α¯1βWTβ¯WT.\displaystyle\leq\frac{\bar{\alpha}}{1-\beta}W_{T}\eqqcolon\bar{\beta}W_{T}. (32)

Substituting (32) is (30), we get that for any t{tk+1,,tk+1}t\in\{{t_{k}+1},\allowbreak\dots,t_{k+1}\}, we have

xt||αδttkβ¯WT+α¯WT[αβ¯+α¯]α0WT\|x_{t}||\leq\alpha\delta^{t-t_{k}}\bar{\beta}W_{T}+\bar{\alpha}W_{T}\leq\underbrace{[\alpha\bar{\beta}+\bar{\alpha}]}_{\eqqcolon\alpha_{0}}W_{T}

where in the last inequality, we have used the fact that δ(0,1)\delta\in(0,1). Thus, for any episode kk, we have

X¯k=maxtk<ttk+1xtα0WT.\bar{X}_{k}=\max_{t_{k}<t\leq t_{k+1}}\|x_{t}\|\leq\alpha_{0}W_{T}.

Hence,

XTσw+max{X¯1,,X¯KT}σw+α0WT.X_{T}\leq\sigma_{w}+\max\{\bar{X}_{1},\dots,\bar{X}_{K_{T}}\}\leq\sigma_{w}+\alpha_{0}W_{T}.

Therefore, for any q1q\geq 1, we have

𝔼[XTq]p=0q(qp)σwqpα0p𝔼[WTp]\mathds{E}[X_{T}^{q}]\leq\sum_{p=0}^{q}\binom{q}{p}\sigma_{w}^{q-p}\alpha_{0}^{p}\mathds{E}[W_{T}^{p}] (33)

From Lemma A, we have that

σwqp𝔼[WTp]=σwqp𝔼[max1tTwtp]σwq𝒪((logT)p/2).\sigma_{w}^{q-p}\mathds{E}[W_{T}^{p}]=\sigma_{w}^{q-p}\mathds{E}\Bigl{[}\max_{1\leq t\leq T}\|w_{t}\|^{p}\Bigr{]}\leq\sigma_{w}^{q}\mathcal{O}({(\log T)}^{p/2}).

Substituting this is (33), we obtain the result of the lemma.

Appendix C Proof of Lemma IV-A

Since log is an increasing function, logxlogmax(e,x)\log x\leq\log\max(e,x) for any x>0x>0. Therefore,

𝔼[\displaystyle\mathds{E}\Big{[} XTqσwqlog(XT2σw2)]𝔼[XTqσwqlogmax(e,XT2/σw2)]\displaystyle\frac{X_{T}^{q}}{\sigma_{w}^{q}}\log\Big{(}\frac{X_{T}^{2}}{\sigma_{w}^{2}}\Big{)}\Big{]}\leq\mathds{E}\Big{[}\frac{X_{T}^{q}}{\sigma_{w}^{q}}\log\max(e,X_{T}^{2}/\sigma_{w}^{2})\Big{]}
𝔼[XT2qσw2q]𝔼[(logmax(e,XT2/σw2))2]\displaystyle\leq\sqrt{\mathds{E}\Big{[}\frac{X_{T}^{2q}}{\sigma_{w}^{2q}}\Big{]}\;\mathds{E}\Big{[}\Big{(}\log\max(e,X_{T}^{2}/\sigma_{w}^{2})\Big{)}^{2}\Big{]}} (34)

where the last inequality follows from Cauchy-Schwartz inequality. Since (logx)2(\log x)^{2} is concave for xex\geq e, we can use Jensen’s inequality to write

𝔼[(logmax(e,XT2/σw2))2]\displaystyle\mathds{E}\Big{[}\Big{(}\log\max(e,X_{T}^{2}/\sigma_{w}^{2})\Big{)}^{2}\Big{]} (log(𝔼[max(e,XT2/σw2)]))2\displaystyle\leq\bigl{(}\log(\mathds{E}[\max(e,X_{T}^{2}/\sigma_{w}^{2})])\bigr{)}^{2}
(log(e+𝔼[XT2/σw2]))2\displaystyle\leq\bigl{(}\log(e+\mathds{E}[X_{T}^{2}/\sigma_{w}^{2}])\bigr{)}^{2}
(a)(log(e+𝒪(logT)))2\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\bigl{(}\log(e+\mathcal{O}(\log T))\bigr{)}^{2}
𝒪~(1)\displaystyle\leq\tilde{\mathcal{O}}(1) (35)

where (a)(a) uses Lemma IV-A. Substituting (35) in (34) and using Lemma IV-A for bounding 𝔼[XT2q/σw2q]\mathds{E}[X_{T}^{2q}/\sigma_{w}^{2q}], we get

𝔼[XTqσwqlog(XT2σw2)]\displaystyle\mathds{E}\Big{[}\frac{X_{T}^{q}}{\sigma_{w}^{q}}\log\Big{(}\frac{X_{T}^{2}}{\sigma_{w}^{2}}\Big{)}\Big{]} 𝔼[XT2qσw2q]𝒪~(1)𝒪~(1).\displaystyle\leq\sqrt{\mathds{E}\Big{[}\frac{X_{T}^{2q}}{\sigma_{w}^{2q}}\Big{]}\;\tilde{\mathcal{O}}(1)}\leq\tilde{\mathcal{O}}(1).

Appendix D Proof of Lemma IV-A

The high-level idea of the proof is same as that of [1, Lemma 3]. Define macro episodes with start times tnit_{n_{i}}, i>0i\in\mathds{N}_{>0}, where n1=1{n_{1}}=1 and for i1i\geq 1,

ni+1=min{k>ni|detΣtk<12detΣtk1}.{n_{i+1}}=\min\left\{k>{n_{i}}\,\middle|\,\det\Sigma_{t_{k}}<\tfrac{1}{2}\det\Sigma_{t_{k-1}}\right\}.

Thus, a new macro-episode starts whenever an episode ends due to the second stopping criterion. Let MM denote the number of macro-episodes until time TT and define nM+1=KT+1n_{M+1}=K_{T}+1. Let T¯i\bar{T}_{i} denote the length of the ii-th macro-episode. Within a macro-episode, all but the last episode must be triggered by the first stopping criterion. Thus, for k{ni,ni+1,,ni+12}k\in\{n_{i},n_{i}+1,\dots,n_{i+1}-2\},

Tk=max{Tk1+1,Tmin+1}=Tk1+1T_{k}=\max\{T_{k-1}+1,T_{\min}+1\}=T_{k-1}+1

where the last equality follows from (11). Hence, by following exactly the same argument as [1], we have

ni+1ni2T¯in_{i+1}-n_{i}\leq\sqrt{2\bar{T}_{i}}

and therefore following [1, Eq. (40)], we have

KT2MTK_{T}\leq\sqrt{2MT} (36)

which is same as [1, Eq. (41)].

Now, observe that

detΣT1\displaystyle\det\Sigma_{T}^{-1} (a)detΣtnM1(b)2detΣtnM11\displaystyle\stackrel{{\scriptstyle(a)}}{{\geq}}\det\Sigma_{t_{n_{M}}}^{-1}\stackrel{{\scriptstyle(b)}}{{\geq}}2\det\Sigma_{t_{n_{M-1}}}^{-1}
2M1detΣ11,\displaystyle\geq\cdots\geq 2^{M-1}\det\Sigma_{1}^{-1}, (37)

where (a)(a) follows because {detΣt1}t1\{\det\Sigma_{t}^{-1}\}_{t\geq 1} is a non-decreasing sequence (because Σ11Σ21\Sigma_{1}^{-1}\leq\Sigma_{2}^{-1}\ldots) and (b)(b) and subsequent inequalities follow from the definition of the macro episode and the second triggering condition.

Then following the same idea as the rest of the proof in [1], we get

M𝒪((n+m)log(TXT2/σw2)).M\leq\mathcal{O}((n+m)\log(TX_{T}^{2}/\sigma_{w}^{2})). (38)

Substituting (38) in (36), we obtain the result of the lemma.

Appendix E Proof of Lemma IV-B3

Observe that the summand is constant for each episode. Therefore,

𝔼[k=1KTt=tktk+11[Σtk0.5(θ1θ¯k)2]]\displaystyle\hskip-10.00002pt\mathds{E}\biggl{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\bigl{[}\|\Sigma_{t_{k}}^{-0.5}(\theta_{1}-\bar{\theta}_{k})^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\|^{2}\bigr{]}\biggr{]}
=𝔼[k=1KT[TkΣtk0.5(θ1θ¯k)2]]\displaystyle=\mathds{E}\biggl{[}\sum_{k=1}^{K_{T}}\bigl{[}T_{k}\|\Sigma_{t_{k}}^{-0.5}(\theta_{1}-\bar{\theta}_{k})^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\|^{2}\bigr{]}\biggr{]}
(a)𝔼[k=1KT[(Tk1+1)Σtk0.5(θ1θ¯k)2]]\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\mathds{E}\biggl{[}\sum_{k=1}^{K_{T}}\bigl{[}(T_{k-1}+1)\|\Sigma_{t_{k}}^{-0.5}(\theta_{1}-\bar{\theta}_{k})^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\|^{2}\bigr{]}\biggr{]}
=k=1𝔼[𝟙{tkT}(Tk1+1)Σtk0.5(θ1θ¯k)2]\displaystyle=\sum_{k=1}^{\infty}\mathds{E}\bigl{[}\mathds{1}_{\{t_{k}\leq T\}}(T_{k-1}+1)\|\Sigma_{t_{k}}^{-0.5}(\theta_{1}-\bar{\theta}_{k})^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\|^{2}\bigr{]}
=k=1𝔼[𝔼[𝟙{tkT}(Tk1+1)Σtk0.5(θ1θ¯k)2|htk]]\displaystyle=\sum_{k=1}^{\infty}\mathds{E}\Bigl{[}\mathds{E}\bigl{[}\mathds{1}_{\{t_{k}\leq T\}}(T_{k-1}+1)\|\Sigma_{t_{k}}^{-0.5}(\theta_{1}-\bar{\theta}_{k})^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\|^{2}\bigm{|}h_{t_{k}}\bigr{]}\Bigr{]}
=(b)k=1𝔼[𝟙{tkT}(Tk1+1)𝔼[Σtk0.5(θ1θ¯k)2|htk]]\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\sum_{k=1}^{\infty}\mathds{E}\Bigl{[}\mathds{1}_{\{t_{k}\leq T\}}(T_{k-1}+1)\mathds{E}\bigl{[}\|\Sigma_{t_{k}}^{-0.5}(\theta_{1}-\bar{\theta}_{k})^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\|^{2}\bigm{|}h_{t_{k}}\bigr{]}\Bigr{]}
(c)k=1𝔼[𝟙{tkT}(Tk1+1)2(n+m)n]\displaystyle\stackrel{{\scriptstyle(c)}}{{\leq}}\sum_{k=1}^{\infty}\mathds{E}\bigl{[}\mathds{1}_{\{t_{k}\leq T\}}(T_{k-1}+1)2(n+m)n\bigr{]}
2(n+m)n(T+𝔼[KT]),\displaystyle\leq 2(n+m)n(T+\mathds{E}[K_{T}]), (39)

where (a)(a) follows from (11), (b)(b) follows from the fact that 𝟙{tk<T}(Tk1+1)\mathds{1}_{\{t_{k}<T\}}(T_{k-1}+1) is σ(htk)\sigma(h_{t_{k}}) measurable, and (c)(c) hold because conditioned on htkh_{t_{k}} each column of Σtk0.5(θ1θ¯k)2\|\Sigma_{t_{k}}^{-0.5}(\theta_{1}-\bar{\theta}_{k})^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\|^{2} is the difference of two i.i.d. vectors 𝒩(0,I)\sim\mathcal{N}(0,I).

Eq. (39) proves the first part of the Lemma. The second part follows from the fact that KTTK_{T}\leq T.

Appendix F Proof of Lemma IV-B3

For any s<ts<t. Eq. (9) implies that Σs1Σt1\Sigma_{s}^{-1}\preceq\Sigma_{t}^{-1} and consequently Σt1\Sigma_{t}^{-1} is positive definite. Therefore, we have the following: {lemma} Let λmin\lambda_{\min} be the smallest eigenvalue of Σ11\Sigma_{1}^{-1}. Then, each eigenvalue of Σt1\Sigma_{t}^{-1} is no less than λmin\lambda_{\min}. Therefore, each eigenvalue of Σt\Sigma_{t} is no more than 1/λmin1/\lambda_{\min}. An immediate implication of Lemma F is the following: For any tt and ss,

ztΣszt1λminzt21λminMG2XT2,z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\Sigma_{s}z_{t}\leq\frac{1}{\lambda_{\min}}\|z_{t}\|^{2}\leq\frac{1}{\lambda_{\min}}M^{2}_{G}X_{T}^{2}, (40)

where MG=supθΩ1[I,G(θ)]M_{G}=\sup_{\theta\in\Omega_{1}}\|[I,G(\theta)^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}]^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\|.

For any s<ts<t, Σs1Σt1\Sigma_{s}^{-1}\preceq\Sigma_{t}^{-1} implies that ΣsΣt\Sigma_{s}\succeq\Sigma_{t}. Therefore, from [7, Lemma 11], we get that for any V0V\neq 0 (of appropriate dimensions),

VΣsVVΣtVdetΣsdetΣt=detΣt1detΣs1.\frac{\|V^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\Sigma_{s}V\|}{\|V^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\Sigma_{t}V\|}\leq\frac{\det\Sigma_{s}}{\det\Sigma_{t}}=\frac{\det\Sigma_{t}^{-1}}{\det\Sigma_{s}^{-1}}. (41)

Eq. (41) implies that for any t{tk,,tk+11}t\in\{t_{k},\dots,t_{k+1}-1\}, we have

Σtk0.5zt2=ztΣtkztdetΣt1detΣtk1ztΣtzt\|\Sigma_{t_{k}}^{0.5}z_{t}\|^{2}=z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\Sigma_{t_{k}}z_{t}\leq\frac{\det\Sigma_{t}^{-1}}{\det\Sigma_{t_{k}}^{-1}}z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\Sigma_{t}z_{t} (42)

For the ease of notation, let τk=tk+Tmin\tau_{k}=t_{k}+T_{\min}. Then we have the following bound on detΣt1/detΣtk1\det\Sigma_{t}^{-1}/\det\Sigma_{t_{k}}^{-1}. {lemma}

The following inequalities hold:

  1. 1.

    For t{tk,,τk}t\in\{t_{k},\dots,\tau_{k}\}, we have

    detΣt1detΣtk1(1+1λminσw2MG2XT2)Tmin.\frac{\det\Sigma_{t}^{-1}}{\det\Sigma_{t_{k}}^{-1}}\leq\biggl{(}1+\frac{1}{\lambda_{\min}\sigma_{w}^{2}}M^{2}_{G}X_{T}^{2}\biggr{)}^{T_{\min}}.
  2. 2.

    For t{τk+1,,tk+11}t\in\{\tau_{k}+1,\dots,t_{k+1}-1\}, we have

    detΣt1detΣtk12.\frac{\det\Sigma_{t}^{-1}}{\det\Sigma_{t_{k}}^{-1}}\leq 2.

Consequently, for all t{tk,,tk+11}t\in\{t_{k},\dots,t_{k+1}-1\}, we have

detΣt1detΣtk1(2+MG2XT2λminσw2)Tmin1.\frac{\det\Sigma_{t}^{-1}}{\det\Sigma_{t_{k}}^{-1}}\leq\biggl{(}2+\frac{M^{2}_{G}X_{T}^{2}}{\lambda_{\min}\sigma_{w}^{2}}\biggr{)}^{T_{\min}\vee 1}. (43)
Proof.

The second relationship follows from the second stopping criterion. We now prove the first relationship. Eq. (9) implies that

Σt+11=Σt1(I+1σw2Σtztzt).\Sigma_{t+1}^{-1}=\Sigma_{t}^{-1}\biggl{(}I+\frac{1}{\sigma_{w}^{2}}\Sigma_{t}z_{t}z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\biggr{)}.

Therefore,

detΣt+11detΣt1\displaystyle\frac{\det\Sigma_{t+1}^{-1}}{\det\Sigma_{t}^{-1}} =det(I+1σw2Σtztzt)=1+1σw2ztΣtzt\displaystyle=\det\biggl{(}I+\frac{1}{\sigma_{w}^{2}}\Sigma_{t}z_{t}z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\biggr{)}=1+\frac{1}{\sigma_{w}^{2}}z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\Sigma_{t}z_{t}
1+1λminσw2MG2XT2,\displaystyle\leq 1+\frac{1}{\lambda_{\min}\sigma_{w}^{2}}M^{2}_{G}X_{T}^{2}, (44)

where the last inequality follows from (40). Thus, for any t{tk,,τk}t\in\{t_{k},\dots,\tau_{k}\}, we have

detΣt1detΣtk1\displaystyle\frac{\det\Sigma_{t}^{-1}}{\det\Sigma_{t_{k}}^{-1}} (1+1λminσw2MG2XT2)ttk\displaystyle\leq\biggl{(}1+\frac{1}{\lambda_{\min}\sigma_{w}^{2}}M^{2}_{G}X_{T}^{2}\biggr{)}^{t-t_{k}}
(1+1λminσw2MG2XT2)Tmin\displaystyle\leq\biggl{(}1+\frac{1}{\lambda_{\min}\sigma_{w}^{2}}M^{2}_{G}X_{T}^{2}\biggr{)}^{T_{\min}} (45)

where the first inequality follows by repeatedly applying (44) as a telescopic product.

Let M¯=MG2XT2/λminσw2\bar{M}=M_{G}^{2}X_{T}^{2}/\lambda_{\min}\sigma_{w}^{2}. Then, (43) follows by observing that (1+M¯)Tmin(2+M¯)Tmin1(1+\bar{M})^{T_{\min}}\leq(2+\bar{M})^{T_{\min}\vee 1} and 2<(2+M¯)Tmin12<(2+\bar{M})^{T_{\min}\vee 1}.

Using Lemma F and (42), we get

t=tktk+11Σtk0.5zt2\displaystyle\sum_{t={t_{k}}}^{t_{k+1}-1}\|\Sigma_{t_{k}}^{0.5}z_{t}\|^{2} t=tktk+11detΣt1detΣtk1ztΣtzt\displaystyle\leq\sum_{t={t_{k}}}^{t_{k+1}-1}\frac{\det\Sigma_{t}^{-1}}{\det\Sigma_{t_{k}}^{-1}}z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\Sigma_{t}z_{t}
(2+MG2XT2λminσw2)Tmin1t=tktk+11ztΣtzt\displaystyle\leq\biggl{(}2+\frac{M^{2}_{G}X_{T}^{2}}{\lambda_{\min}\sigma_{w}^{2}}\biggr{)}^{T_{\min}\vee 1}\sum_{t={t_{k}}}^{t_{k+1}-1}z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\Sigma_{t}z_{t} (46)

where the first inequality follows from (42) and the second inequality follows from Lemma F. Therefore,

k=1KTt=tktk+11XT2Σtk0.5zt2\displaystyle\hskip-20.00003pt\sum_{k=1}^{K_{T}}\sum_{t={t_{k}}}^{t_{k+1}-1}X_{T}^{2}\|\Sigma_{t_{k}}^{0.5}z_{t}\|^{2}
(2+MG2XT2λminσw2)Tmin1XT2t=1TztΣtzt\displaystyle\leq\biggl{(}2+\frac{M^{2}_{G}X_{T}^{2}}{\lambda_{\min}\sigma_{w}^{2}}\biggr{)}^{T_{\min}\vee 1}X_{T}^{2}\sum_{t=1}^{T}z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\Sigma_{t}z_{t} (47)

From (40) for s=ts=t, we get that

ztΣtztmax(σw2,MG2XT2λmin)min(1,ztΣtztσw2).z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\Sigma_{t}z_{t}\leq\max\biggl{(}\sigma_{w}^{2},\frac{M_{G}^{2}X_{T}^{2}}{\lambda_{\min}}\biggr{)}\min\biggl{(}1,\frac{z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\Sigma_{t}z_{t}}{\sigma_{w}^{2}}\biggr{)}. (48)

Hence

t=1TztΣtzt(σw2+MG2XT2λmin)t=1Tmin(1,ztΣtztσw2)\sum_{t=1}^{T}z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\Sigma_{t}z_{t}\leq\biggl{(}\sigma_{w}^{2}+\frac{M_{G}^{2}X_{T}^{2}}{\lambda_{\min}}\biggr{)}\sum_{t=1}^{T}\min\biggl{(}1,\frac{z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\Sigma_{t}z_{t}}{\sigma_{w}^{2}}\biggr{)} (49)

Using (9) and the intermediate step of the proof of [22, Lemma 6], we have

t=1Tmin(1,ztΣtztσw2)=t=1Tmin(1,Σt0.5ztztΣt0.5σw2)\displaystyle\hskip-10.00002pt\sum_{t=1}^{T}\min\biggl{(}1,\frac{z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\Sigma_{t}z_{t}}{\sigma_{w}^{2}}\biggr{)}=\sum_{t=1}^{T}\min\biggl{(}1,\biggl{\lVert}\frac{\Sigma^{0.5}_{t}z_{t}z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\Sigma^{0.5}_{t}}{\sigma_{w}^{2}}\biggr{\rVert}\biggr{)}
2(n+m)log(Tr(ΣT+11)(n+m))logdetΣ11.\displaystyle\leq 2(n+m)\log\Bigg{(}\frac{\operatorname{Tr}(\Sigma_{T+1}^{-1})}{(n+m)}\Biggr{)}-\log\det\Sigma^{-1}_{1}. (50)

Now, from (9), we get that

Tr(ΣT+11)\displaystyle\operatorname{Tr}(\Sigma_{T+1}^{-1}) =Tr(Σ11)+t=1T1σw2Tr(ztzt)\displaystyle=\operatorname{Tr}(\Sigma_{1}^{-1})+\sum_{t=1}^{T}\frac{1}{\sigma_{w}^{2}}\operatorname{Tr}(z_{t}z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}})
Tr(Σ11)+Tσw2MG2XT2,\displaystyle\leq\operatorname{Tr}(\Sigma_{1}^{-1})+\frac{T}{\sigma_{w}^{2}}M_{G}^{2}X_{T}^{2}, (51)

where the last inequality uses the fact that Tr(ztzt)=Tr(ztzt)=zt2MG2XT2\operatorname{Tr}(z_{t}z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}})=\operatorname{Tr}(z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}z_{t})=\|z_{t}\|^{2}\leq M_{G}^{2}X_{T}^{2}. Combining (49) with (50) and (51), we get

t=1TztΣtzt𝒪((n+m)(σw2+XT2)log(TXT2/σw2)).\sum_{t=1}^{T}z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\Sigma_{t}z_{t}\leq\mathcal{O}\bigl{(}(n+m)(\sigma_{w}^{2}+X_{T}^{2})\log(TX_{T}^{2}/\sigma_{w}^{2})\bigr{)}. (52)

Therefore, we can bound the expectation of the right hand side of (47) as

𝔼[(2+MG2XT2λminσw2)Tmin1XT2t=1TztΣtzt]\displaystyle\hskip-20.00003pt\mathds{E}\biggl{[}\biggl{(}2+\frac{M^{2}_{G}X_{T}^{2}}{\lambda_{\min}\sigma_{w}^{2}}\biggr{)}^{T_{\min}\vee 1}X_{T}^{2}\sum_{t=1}^{T}z_{t}^{\mathchoice{\raisebox{0.0pt}{$\displaystyle\intercal$}}{\raisebox{0.0pt}{$\textstyle\intercal$}}{\raisebox{0.0pt}{$\scriptstyle\intercal$}}{\raisebox{0.0pt}{$\scriptscriptstyle\intercal$}}}\Sigma_{t}z_{t}\biggr{]}
𝒪(σw4(n+m)𝔼[F(XT)])\displaystyle\leq\mathcal{O}\bigl{(}\sigma_{w}^{4}(n+m)\mathds{E}[F(X_{T})]\bigr{)}
𝒪~(σw4(n+m)),\displaystyle\leq\tilde{\mathcal{O}}(\sigma_{w}^{4}(n+m)), (53)

where the first inequality follows from (52) with F(XT)=(2+MG2XT2λminσw2)Tmin1(XT2σw2+XT4σw4)log(TXT2/σw2)F(X_{T})=\big{(}2+\frac{M^{2}_{G}X_{T}^{2}}{\lambda_{\min}\sigma_{w}^{2}}\big{)}^{T_{\min}\vee 1}(\frac{X_{T}^{2}}{\sigma_{w}^{2}}+\frac{X_{T}^{4}}{\sigma_{w}^{4}})\log(TX_{T}^{2}/\sigma_{w}^{2}), and the last inequality follows from Lemma IV-A by noting that F(XT)F(X_{T}) is a polynomial of XT/σwX_{T}/\sigma_{w} multiplied by a poly-log term.

The result follows from (47) and (53).

References

  • [1] Y. Ouyang, M. Gagrani, and R. Jain, “Posterior sampling-based reinforcement learning for control of unknown linear systems,” IEEE Trans. Autom. Control, vol. 65, no. 8, pp. 3600–3607, 2019.
  • [2] K. J. Astrom and B. Wittenmark, Adaptive Control. Addison-Wesley Longman Publishing Co., Inc., 1994.
  • [3] P. E. Caines, Linear Stochastic Systems. John Wiley, 1988. Republished by Society of Industrial Applied Mathematics, 2018.
  • [4] S. J. Bradtke, “Reinforcement learning applied to linear quadratic regulation,” in Neural Information Processing Systems, pp. 295–302, 1993.
  • [5] S. J. Bradtke, B. E. Ydstie, and A. G. Barto, “Adaptive linear quadratic control using policy iteration,” in Proceedings of American Control Conference, vol. 3, pp. 3475–3479, IEEE, 1994.
  • [6] M. C. Campi and P. Kumar, “Adaptive linear quadratic Gaussian control: the cost-biased approach revisited,” SIAM Journal on Control and Optimization, vol. 36, no. 6, pp. 1890–1907, 1998.
  • [7] Y. Abbasi-Yadkori and C. Szepesvári, “Regret bounds for the adaptive control of linear quadratic systems,” in Annual Conference on Learning Theory, pp. 1–26, 2011.
  • [8] A. Cohen, T. Koren, and Y. Mansour, “Learning linear-quadratic regulators efficiently with only T\sqrt{T} regret,” in International Conference on Machine Learning, pp. 1300–1309, PMLR, 2019.
  • [9] M. Abeille and A. Lazaric, “Efficient optimistic exploration in linear-quadratic regulators via lagrangian relaxation,” in International Conference on Machine Learning, pp. 23–31, PMLR, 2020.
  • [10] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “Regret bounds for robust adaptive control of the linear quadratic regulator,” in Neural Information Processing Systems, pp. 4192–4201, 2018.
  • [11] H. Mania, S. Tu, and B. Recht, “Certainty equivalent control of LQR is efficient.” arXiv:1902.07826, 2019.
  • [12] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “Input perturbations for adaptive control and learning,” Automatica, vol. 117, p. 108950, 2020.
  • [13] M. Simchowitz and D. Foster, “Naive exploration is optimal for online lqr,” in International Conference on Machine Learning, pp. 8937–8948, PMLR, 2020.
  • [14] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “On adaptive linear–quadratic regulators,” Automatica, vol. 117, p. 108982, July 2020.
  • [15] Y. Ouyang, M. Gagrani, and R. Jain, “Control of unknown linear systems with thompson sampling,” in Allerton Conference on Communication, Control, and Computing, pp. 1198–1205, 2017.
  • [16] M. Abeille and A. Lazaric, “Improved regret bounds for thompson sampling in linear quadratic control problems,” in International Conference on Machine Learning, pp. 1–9, 2018.
  • [17] M. Gagrani, S. Sudhakara, A. Mahajan, A. Nayyar, and Y. Ouyang, “Thompson sampling for linear quadratic mean-field teams,” in 2021 60th IEEE Conference on Decision and Control (CDC), pp. 720–727, IEEE, 2021.
  • [18] K. J. Astrom, Introduction to stochastic control theory. Academic Press New York, 1970.
  • [19] J. Sternby, “On consistency for the method of least squares using martingale theory,” IEEE Trans. Autom. Control, vol. 22, no. 3, pp. 346–352, 1977.
  • [20] P. J. Downey and R. S. Maier, “Stochastic orderings and the growth of expected extremes,” tech. rep., Tech. Report 90-9, Department of Computer Science, University of Arizona, 1990.
  • [21] P. J. Downey and R. S. Maier, “Orderings arising from expected extremes, with an application,” in Institute of Mathematical Statistics Lecture Notes - Monograph Series, pp. 66–75, Hayward, CA: Institute of Mathematical Statistics, 1992.
  • [22] Y. Abbasi-Yadkori and C. Szepesvari, “Bayesian optimal control of smoothly parameterized systems: The lazy posterior sampling algorithm.” arXiv preprint arXiv:1406.3926, 2014.