Joint Stabilization and Regret Minimization through Switching in Over-Actuated Systems (extended version)

Jafar Abbaszadeh Chekan Kamyar Azizzadenesheli Cédric Langbort Coordinated Science Lab, University of Illinois at Urbana-Champaign, IL 61820 USA (e-mail: langbort & [email protected]). Department of Computer Science, Purdue University, IN 47907 USA (e-mail: [email protected])

Abstract

Adaptively controlling and minimizing regret in unknown dynamical systems while controlling the growth of the system state is crucial in real-world applications. In this work, we study the problem of stabilization and regret minimization of linear over-actuated dynamical systems. We propose an optimism-based algorithm that leverages possibility of switching between actuating modes in order to alleviate state explosion during initial time steps. We theoretically study the rate at which our algorithm learns a stabilizing controller and prove that it achieves a regret upper bound of $\mathcal{O}(\sqrt{T})$ .

keywords:

Adaptive Control, Regret Bound, Model-Based Reinforcement Learning, Over-Actuated Systems, Online-Learning.

^†^†thanks:

1 Introduction

The past few years have witnessed a growing interest in an online learning-based Linear Quadratic (LQ) control, in which an unknown LTI system is controlled while guaranteeing a suitable scaling of the regret (defined as the average difference between the closed-loop quadratic cost and the best achievable one, with the benefit of knowing the plant’s parameters) over a desired horizon $[0,T]$ .

Table 1 summarizes the regret’s scaling achieved by several recent works in the literature (Abbasi and Szepesvári (2011); Mania et al. (2019); Cohen et al. (2019); Lale et al. (2020a)). As can be seen, the bounds scale like $\sqrt{T}$ , but also include an exponential term in $m$ (the dimension of the plant’s state plus inputs) when an initial stabilizing controller is unavailable. Recently, Chen and Hazan (2021) have shown that this dependency is unavoidable in this setting, at least for an initial exploration time period, by providing a matching lower bound. On closer inspection, this undesirable dependency stems from an exponential growth of the system’s state during the initial period of learning, which eventually contributes a $\mathcal{O}(m^{m})$ term like that appearing on the last row of Table 1. Apart from negatively impacting regret, this transient state growth can also be damageable if, e.g., the linear plant to be controlled is in fact the linearization of a nonlinear process around some particular equilibrium, and the state can be driven outside the neighborhood of this equilibrium where the linearization is adequate.

A natural idea to try and partially alleviate this effect in over-actuated systems is to reduce the ambient dimension $``m"$ during the initial period by employing only a subset of all available actuators, before potentially switching to a different mode to simultaneously learn and control the plant. The goal of this paper is to show that this approach yields the desired result (a lower bound on the plant’s state and the regret than in the presence of all actuators) and to provide a rigorous proof of the achieved $\mathcal{O}((n+d_{i})^{(n+d_{i})})$ bound (where $d_{i}$ is number of actuators during initial exploration). We note that, while the idea is conceptually simple, obtaining these rigorous guarantees is not completely straightforward, and requires a revisit of the tools of Lale et al. (2020a) to overcome a potential challenge: ensuring that, at the end of the period when a strict subset of actuators is used, all entries of the B-matrix (including those corresponding to unused actuators) are sufficiently well-learned for the closed-loop to become stable and the regret to remain appropriately bounded. This can only be achieved by adding additional exploratory noise with respect to the approach of Lale et al. (2020a) and results in an additional linear term in the regret bound.

Table 1: Overview of Prior Works

Algorithm	Regret	*
Abbasi and Szepesvári (2011)	$\mathcal{O}(m^{m}\sqrt{T})$	No
Mania et al. (2019)	$\mathcal{O}(poly(m)\sqrt{T})$	Yes
Cohen et al. (2019)	$\mathcal{O}(poly(m)\sqrt{T})$	Yes
Lale et al. (2020a)	$\mathcal{O}(m^{m})+\mathcal{O}(poly(m)\sqrt{T})$	No

* requirement for initial stabilizing controller.

While this idea of leveraging over-actuation can generally be applied to any model-based type of online learning algorithms presenting the exponential scaling mentioned above, we focus on the special kind of methods based on “Optimism in the Face of Uncertainty (OFU)”.

OFU type of algorithms, which couple estimation and control design procedures, have shown their ability to outperform the naive certainty equivalence algorithm. Campi and Kumar (1998) propose an OFU based approach to address the optimal control problem for LQ setting with guaranteed asymptotic optimality. However, this algorithm only guarantees the convergence of average cost to that of the optimal control in the limit but does not provide any bound on the measure of performance loss in finite time. Abbasi and Szepesvári (2011) propose a learning-based algorithm to address the adaptive control design of LQ control system in finite time with worse case regret bound of $\mathcal{O}(\sqrt{T})$ with exponential dependence in dimension. Using $l_{2}$ -regularized least square estimation, they construct a high-probability confidence set around unknown parameters of the system and designed an algorithm that optimistically plays with respect to this set. Along this line, many works attempt to get rid of the exponential dependence with further assumption, e.g. highly sparse dynamics (see Ibrahimi et al. (2012)) or access to a stabilizing controller (see Cohen et al. (2019)). Furthermore, Faradonbeh et al. (2020) propose an OFU-based learning algorithm with mild assumptions and $\mathcal{O}(\sqrt{T})$ regret. This class of algorithms was extended by Lale et al. (2020b, 2021) to LQG setting where there is only partial and noisy observations on the state of system. In addition, Lale et al. (2020a) propose an algorithm with more exploration for both controllable and stabilizable systems.

The remainder of the paper is organized as follows: Section 2 reviews the preliminaries, assumptions and presents the problem statement and formulation. Section 3 overviews the proposed initial exploration (IExp), stabilizing OFU (SOFUA) algorithms and discusses in detail how to choose best actuating mode for the initial exploration purpose. Furthermore, in Section 3 the performance analysis (state norm upper-bound and regret bound) is given while leaving the details of the analysis to Section 6 and technical theorems and proofs of analysis to Section 7. Numerical experiments are given in Section 4. Finally, Section 5 summarizes the paper’s key contributions.

2 Assumptions and Problem Formulation

Consider the following linear time invariant dynamics and the associated cost functional given by:


$\displaystyle x_{t+1}$	$\displaystyle=A_{}x_{t}+B_{}u_{t}+\omega_{t+1}$	(1a)
$\displaystyle c_{t}$	$\displaystyle=x_{t}^{\top}Q_{}x_{t}+u_{t}^{\top}R_{}u_{t}$	(1b)

where the plant and input matrices $A_{*}\in\mathbb{R}^{n\times n}$ and $B_{*}\in\mathbb{R}^{n\times d}$ are initially unknown and have to be learned and $(A_{*},B_{*})$ is controllable. $Q_{*}\in\mathbb{R}^{n\times n}$ and $R_{*}\in\mathbb{R}^{d\times d}$ represent known and positive definite matrices. $\omega_{t+1}$ denotes process noise, satisfying the following assumption.

Assumption 1

There exists a filtration $\mathcal{F}_{t}$ such that

$(1.1)$ $\mathbb{E}[\omega_{t+1}\omega_{t+1}^{\top}|\;\mathcal{F}_{t}]=\bar{\sigma}_{\omega}^{2}I_{n}$ for some $\bar{\sigma}_{\omega}^{2}>0$ ;

$(1.2)$ $\omega_{t}$ are component-wise sub-Gaussian, i.e., there exists $\sigma_{\omega}>0$ such that for any $\gamma\in\mathbb{R}$ and $j=1,2,...,n$

\displaystyle\mathbb{E}[e^{\gamma\omega_{j}(t+1)}|\;\mathcal{F}_{t}]\leq e^{\gamma^{2}\sigma_{\omega}^{2}/2}.

The problem is designing a sequence $\{u_{t}\}$ of control inputs such that the regret $\mathcal{R}_{T}$ defined by

\displaystyle\mathcal{R}_{T}=\sum_{t=1}^{T}\bigg{(}x_{t}^{\top}Q_{*}x_{t}+u^{\top}_{t}R_{*}u_{t}-J_{*}(\Theta_{*},Q_{*},R_{*})\bigg{)}

(2)

achieves a desired specification which scales sublinearly in $T$ . The term $J_{*}(\Theta_{*},Q_{*},R_{*})$ in (2) where $\Theta_{*}=(A_{*}\;B_{*})^{\top}$ denotes optimal average expected cost. For LQR setting with controllable pair $(A,\;B)$ we have $J_{*}(\Theta,Q,R)=\bar{\sigma}_{\omega}^{2}trace(P(\Theta,Q,R))$ , where $P(\Theta,Q,R)$ is the unique solution of discrete algebraic riccati equation (DARE) and the average expected cost minimizing policy has feedback gain of

\displaystyle K(\Theta,Q,R)=-(B^{\top}P(\Theta,Q,R)B+R)^{-1}B^{\top}P(\Theta,Q,R)A.

While the regret’s exponential dependency on system dimension appears in the long-run in Abbasi and Szepesvári (2011) the recent results of Mania et al. (2019) on the existence of a stabilizing neighborhood, makes it possible to design an algorithm that only exhibits this dependency during an initial exploration phase (see Lale et al. (2020a)).

After this period, the controller designed for any estimated value of the parameters is guaranteed to be stabilizing and the exponentially dependent term thus only appears as a constant in overall regret bound. As explained in the introduction, this suggests using only a subset of actuators during initial exploration to even further reduce the guaranteed upperbound on the state.

In the remainder of the paper, we pick the best actuating mode (i.e. subset of actuators) so as to minimize the state norm upper-bound achieved during initial exploration and characterize the needed duration of this phase for all system parameter estimates to reside in the stabilizing neighborhood. This is necessary to guarantee both closed loop stability and acceptable regret, and makes it possible to switch to the full actuation mode.

Let $\mathbb{B}$ be the set of all columns, $b^{i}_{*}$ ( $i\in\{1,...,d\}$ ) of $B_{*}$ . An element of its power set $2^{\mathbb{B}}$ is a subset $\mathcal{B}_{j}$ $j\in\{1,...,2^{d}\}$ of columns corresponding to a submatrix $B_{*}^{j}$ of $B_{*}$ and mode $j$ . For simplicity, we assume that $B^{1}_{*}=B_{*}$ i.e., that the first mode contains all actuators. Given this definition we write down different actuating modes dynamics with extra exploratory noise as follows

\displaystyle x_{t+1}={\Theta^{i}_{*}}^{\top}z_{t}+B_{*}\nu_{t}+\omega_{t+1},\quad z_{t}=\begin{pmatrix}x_{t}\\ u^{i}_{t}\end{pmatrix}.

(3)

where $\Theta^{i}_{*}=(A_{*},B^{i}_{*})^{\top}$ is controllable.

The associated cost with this mode is

\displaystyle c^{i}_{t}

\displaystyle=x_{t}^{\top}Q_{*}x_{t}+{u^{i}_{t}}^{\top}R^{i}_{*}u^{i}_{t}

(4)

where $R_{*}^{i}\in\mathbb{R}^{d_{i}\times d_{i}}$ is a block of $R_{*}$ which penalizes the control inputs of the actuators of mode $i$ .

We have the following assumption on the modes which assists us in designing proposed strategy.

Assumption 2

(Side Information)

There exists $s^{i}$ and $\Upsilon_{i}$ such that $\Theta_{*}^{i}\in\mathcal{S}^{i}_{c}$ for all modes $i$ where

	$\displaystyle\mathcal{S}^{i}_{c}=$	$\displaystyle\{{\Theta^{i}}\in R^{(n+d_{i})\times n}\mid trace({\Theta^{i}}^{\top}\Theta^{i})\leq({s^{i}})^{2},$
		$(A,B^{i})$ is controllable,
		$\displaystyle\\|A+B^{i}K(\Theta^{i},Q_{},R^{i}_{})\\|\leq\Upsilon_{i}<1$
		$\displaystyle\text{and $(A,M)$ is observable,}\text{where $Q=M^{\top}M$}\}.$

2.

There are known positive constants $\eta^{i}$ , $\vartheta_{i}$ , $\gamma^{i}$ such that $\|B_{*}^{i}\|\leq\vartheta_{i}$ ,

$\displaystyle\sup_{\Theta^{i}\in\mathcal{S}^{i}}\|A_{*}+B^{i}_{*}K({\Theta}^{i},Q_{*},R^{i}_{*})\|\leq\eta^{i}$ (5)

and

$\displaystyle J_{*}(\Theta_{*}^{i},Q_{*},R^{i}_{*})-J_{*}(\Theta_{*},Q_{*},R_{*})\leq\gamma^{i}.$ (6)

for every mode $i$ .

By slightly abusing notation, we drop the superscript label for the actuating mode 1 (e.g. $\Upsilon_{1}=\Upsilon$ , $s^{1}=s$ , and $\mathcal{S}_{c}^{1}=\mathcal{S}_{c}$ ). It is obvious that $s^{i}\leq s$ $\forall i$ .

Note that the item (1) in Assumption 2 is typical in the literature of OFU-based algorithms (see e.g., Abbasi and Szepesvári (2011); Lale et al. (2020a)) while (2) in fact always holds in the sense that $\sup_{\Theta^{i}\in\mathcal{S}^{i}}\|A_{*}+B^{i}_{*}K({\Theta}^{i},Q_{*},R^{i}_{*})\|$ and $J_{*}(\Theta_{*}^{i},Q_{*},R^{i}_{*})-J_{*}(\Theta_{*},Q_{*},R_{*})$ are always bounded (see e.g., Abbasi and Szepesvári (2011); Lale et al. (2020a)). The point of (2), then, is that upper-bounds on their suprema are available which can in turn be used to bound regret explicitly. The knowledge of these bounds does not affect Algorithms 1 and 2 but their value enters Algorithm 3 for determination of the best actuating mode and the corresponding exploration duration. In that sense ”best actuating mode” should be understood as ”best given the available information”.

Boundedness of $S^{i}_{c}$ ’s implies boundedness of $P(\Theta^{i},Q_{*},R_{*}^{i})$ with a finite constant $D_{c}^{i}$ (see Anderson and Moore (1971)), (i.e., $D_{c}^{i}=\sup\{\left\lVert P(\Theta^{i},Q_{*},R_{*}^{i})\right\rVert\mid\Theta^{i}\in\mathcal{S}^{i}_{c}\}$ ). We define $D=\max_{i\in\mathcal{B}^{*}}D^{i}$ . Furthermore, there exists $\kappa^{i}_{c}>1$ such that $\kappa^{i}_{c}=\sup\{\left\lVert K(\Theta^{i},Q_{*},R_{*}^{i})\right\rVert\mid\Theta^{i}\in\mathcal{S}^{i}_{c}\}$ .

Recalling that the set of actuators of mode $i$ is $\mathcal{B}_{i}$ , we denote its complement by $\mathcal{B}_{i}^{c}$ (i.e. $\mathcal{B}_{i}\cup\mathcal{B}_{i}^{c}=\{1,...,d\}$ ). Furthermore, we denote the complement of control matrix $B_{*}^{i}$ by $\bar{B}^{i}_{*}$ .

If some modes fail to satisfy Assumption 2 they can simply be removed from the set $2^{\mathbb{B}}$ without affecting algorithm or the derived guarantees.

3 Overview of Proposed Strategy

In this section, we propose an algorithm in the spirit of that first proposed by Lale et al. (2020a) which leverages actuator redundancy in the ”more exploration” step to avoid blow up in the state norm while minimizing the regret bound. We break down the strategy into two phases of initial exploration, presented by Algorithm (IExp), and optimism (Opt), given by SOFUA algorithm.

The IExp algorithm, which leverages exploratory noise, is deployed in the actuating mode $i^{*}$ for duration $T_{c}^{i^{*}}$ to reach a stabilizing neighborhood of the full-actuation mode and alleviate state explosion while minimizing regret.

Afterwards, Algorithm 2 which leverages all the actuators comes into play. This algorithm has the central confidence set, given by the Algorithm 1, as an input. The best actuating mode $i^{*}$ that guarantees minimum possible state norm upper-bound and initial exploration duration $T^{i^{*}}_{c}$ is determined by running Algorithm 3 at the subsection 3.3.

Algorithm 1 Initial Exploration (IExp)

1: Inputs:

T^{i^{*}}_{c}

\,s^{i^{*}}>0,

\,\delta>0,

\,\sigma_{\omega},\,\sigma_{\nu}\,,\lambda>0

2: set

V^{i^{*}}_{0}=\lambda I

\hat{\Theta}^{i^{*}}=0

\tilde{\Theta}^{i}_{0}=\arg\min_{\Theta\in\mathcal{C}^{i^{*}}_{0}(\delta)\cap S^{i}}\,\,J(\Theta^{i},Q_{*},R_{*}^{i})

4: for

t=0,1,...,T^{i^{*}}_{c}

5: if

\det(V^{i^{*}}_{t})>2\det(V^{i^{*}}_{\tau})

t=0

then

6: Calculate

\hat{\Theta}^{i^{*}}_{t}

by (9) and set

\tau=t

7: Find

\tilde{\Theta}^{i^{*}}_{t}

by (11) for

i=i^{*}

8: else

\tilde{\Theta}^{i^{*}}_{t}=\tilde{\Theta}^{i^{*}}_{t-1}

10: end if

11: For the parameter

\tilde{\Theta}^{i^{*}}_{t}

solve the Ricatti equation and find

u^{i^{*}}_{t}=K(\tilde{\Theta}^{i^{*}}_{t},Q_{*},R_{*}^{i^{*}})x_{t}

12: Construct

\bar{u}^{i^{*}}_{t}

using (13) and apply it on the system

\Theta_{*}

(12) and observe new state

x_{t+1}

13: Using

u^{i^{*}}_{t}

and

x_{t}

form

z^{i^{*}}

and Save

(z^{i^{*}}_{t},x_{t+1})

into dataset

14: Set

V^{i^{*}}_{t+1}=V^{i^{*}}_{t}+z^{i^{*}}_{t}{z^{i^{*}}_{t}}^{\top}

and form

\mathcal{C}^{i^{*}}_{t+1}

15: using

\bar{u}^{i}_{t}

and

x_{t}

form

\bar{z}^{i}_{t}

16: Form

(\bar{z}^{i}_{t},x_{t+1})

17: Set

V_{t+1}=V_{t}+\bar{z}^{i}_{t}{\bar{z}_{t}^{i}}^{\top}

and form

\mathcal{C}_{t+1}

18: end for

19: Return

V_{T_{c}+1}

and corresponding

\mathcal{C}_{T_{c}}

Algorithm 2 Stabilizing OFU Algorithm (SOFUA)

1: Inputs:

T,

\,S>0,

\,\delta>0,

\,Q\,,L,\,V_{T_{c}},\,\mathcal{C}_{T_{c}},\,\hat{\Theta}_{T_{c}}

\tilde{\Theta}_{T_{c}}=argmin_{\Theta\in\mathcal{C}_{T_{c}}(\delta)\cap S}\,\,J(\Theta)

3: for

t=T_{c},T_{c}+1,T_{c}+2,...

4: if

\det(V_{t})>2\det(V_{\tau})

t=T_{c}

then

5: Calculate

\hat{\Theta}_{t}

by (9) and set

\tau=t

6: Find

\tilde{\Theta}_{t}

by (11) for

i=1

7: else

\tilde{\Theta}_{t}=\tilde{\Theta}_{t-1}

9: end if

10: For the parameter

\tilde{\Theta}_{t}

solve Ricatti and calculate

\bar{u}_{t}=K(\tilde{\Theta}_{t},Q_{*},R_{*})x_{t}

11: Apply the control on

\Theta_{*}

and observe new state

x_{t+1}

12: Save

(z_{t},x_{t+1})

into dataset

13:

V_{t+1}=V_{t}+z_{t}z_{t}^{\top}

14: end for

3.1 Main steps of Algorithm 1

3.1.1 Confidence Set Contruction

In IExp phase we add an extra exploratory Gaussian noise $\nu$ to the input of all actuators even those not in actuators set of mode $i$ . Assuming that the system actuates in an arbitrary mode $i$ , the dynamics of system, used for confidence set construction (i.e. system identification), is written as

\displaystyle x_{t+1}={\Theta^{i}_{*}}^{\top}\underline{z}^{i}_{t}+\bar{B}^{i}_{*}\bar{\nu}^{i}_{t}+\omega_{t+1},\quad\underline{z}^{i}_{t}=\begin{pmatrix}x_{t}\\ \underline{u}^{i}_{t}\end{pmatrix}.

(7)

in which $\bar{B}^{i}_{*}\in\mathbb{R}^{d-d_{i}}$ and $\underline{u}^{i}_{t}=u^{i}_{t}+\nu_{t}(\mathcal{B}_{i})$ , and $\bar{\nu}^{i}_{t}=\nu_{t}(\mathcal{B}_{i}^{c})$ where, if $\nu_{t}\in\mathbb{R}^{d}$ and $\mathcal{N}\subset\mathbb{B}$ , the vector $\nu(\mathcal{N})\in\mathbb{R}^{card(\mathcal{N})}$ is constructed by only keeping the entries of $\nu_{t}$ corresponding to the index set of elements in $\mathcal{N}$ . Note that (7) is equivalent to (3) but separates used and unused actuators.

By applying self-normalized process, the least square estimation error, $e(\Theta^{i})$ can be obtained as:

	$\displaystyle e(\Theta^{i})=\lambda\operatorname{Tr}({\Theta^{i}}^{\top}\Theta^{i})$
	$\displaystyle+\sum_{s=0}^{t-1}\operatorname{Tr}\big{(}(x_{s+1}-{\Theta^{i}}^{\top}\underline{z}^{i}_{s})(x_{s+1}-{\Theta^{i}}^{\top}\underline{z}^{i}_{s})^{\top})\big{)}$		(8)

with regularization parameter $\lambda$ . This yields the $l^{2}$ -regularized least square estimate:

\displaystyle\hat{\Theta}^{i}_{t}

\displaystyle=\operatorname*{argmin_{\Theta^{i}}}e(\Theta^{i})=({\underline{Z}_{t}^{i}}^{\top}\underline{Z}_{t}^{i}+\lambda I)^{-1}{\underline{Z}_{t}^{i}}^{\top}X_{t}

(9)

\displaystyle V^{i}_{t}=\lambda I+\sum_{s=0}^{t-1}\underline{z}^{i}_{s}{\underline{z}^{i}_{s}}^{\top}=\lambda I+{\underline{Z}_{t}^{i}}^{\top}\underline{Z}_{t}^{i},

it can be shown that with probability at least $(1-\delta)$ , where $0<\delta<1$ , the true parameters of system $\Theta^{i}_{*}$ belongs to the confidence set defined by (see 17):

$\displaystyle\mathcal{C}^{i}_{t}(\delta)$	$\displaystyle=\{{\Theta^{i}}^{\top}\in R^{n\times(n+d_{i})}\mid$
	$\displaystyle\operatorname{Tr}((\hat{\Theta}^{i}_{t}-\Theta^{i})^{\top}V_{t}^{i}(\hat{\Theta}^{i}_{t}-\Theta^{i}))\leq\beta^{i}_{t}(\delta)\},$
$\displaystyle\beta^{i}_{t}(\delta)$	$\displaystyle=\bigg{(}\lambda^{1/2}s^{i}+\sigma_{\omega}\sqrt{2n\log(n\frac{\det(V^{i}_{t})^{1/2}\det(\lambda I)^{-1/2}}{\delta}})$
	$\displaystyle+\\|\bar{B}_{*}^{i}\\|\sigma_{\nu}\sqrt{2d_{i}\log(d_{i}\frac{\det(V^{i}_{t})^{1/2}\det(\lambda I)^{-1/2}}{\delta}})\bigg{)}^{2}$	(10)

After finding high-probability confidence sets for the unknown parameter, the core step is implementing Optimism in the Face of Uncertainty (OFU) principle. At any time $t$ , we choose a parameter $\tilde{\Theta}^{i}_{t}\in\mathcal{S}^{i}_{c}\cap\mathcal{C}^{i}_{t}(\delta)$ such that:

\displaystyle J(\tilde{\Theta}^{i}_{t},Q_{*},R_{*}^{i})\leq\inf\limits_{\Theta^{i}\in\mathcal{C}^{i}_{t}(\delta)\cap\mathcal{S}^{i}_{c}}J(\Theta^{i},Q_{*},R^{i}_{*})+\frac{1}{\sqrt{t}}.

(11)

Then, by using the chosen parameters as if they were the true parameters, the linear feedback gain $K(\tilde{\Theta}^{i},Q_{*},R_{*}^{i})$ is designed. We synthesized the control $\underline{u}^{i}_{t}=u^{i}_{t}+\nu_{t}(\mathcal{B}_{i})$ on (7) where $u^{i}_{t}=K(\tilde{\Theta}^{i},Q_{*},R_{*}^{i})x_{t}$ . The extra exploratory noise $\nu_{t}\sim\mathcal{N}(\mu,\,\sigma_{\nu}^{2}I)\in\mathbb{R}^{d}$ with $\sigma_{\nu}^{2}=2\kappa^{2}\bar{\sigma}_{\omega}^{2}$ is the random “more exploration” term.

As can be seen in the regret bound analysis, recurrent switches in policy may worsen the performance, so a criterion is needed to prevent frequent policy switches. As such, at each time step $t$ the algorithm checks the condition $\det(V^{i}_{t})>2\det(V^{i}_{\tau})$ to determine whether updates to the control policy are needed where $\tau$ is the last time of policy update.

3.1.2 Central Ellipsoid Construction

Note that (10) holds regardless of the control signal $\underline{z}^{i}_{t}$ . The formulation above also holds for any actuation mode, being mindful that the dimension of the covariance matrix changes. Even while actuating in the IExp phase, by applying augmentation technique, we can build a confidence set (which we call the central ellipsoid) around the parameters of the full actuation mode thanks to extra exploratory noise. For $t\leq T^{i}_{c}$ , this simply can be carried out by rewriting (7) as follows:

\displaystyle x_{t+1}=\Theta_{*}^{\top}\bar{z}^{i}_{t}+\omega_{t+1},\quad\bar{z}^{i}_{t}=\begin{pmatrix}x_{t}\\ \bar{u}^{i}_{t}\end{pmatrix}

(12)

where $\bar{z}^{i}_{t}\in\mathbb{R}^{n+d}$ and $\bar{u}^{i}_{t}\in\mathbb{R}^{d}$ is constructed by augmentation as follows

\displaystyle\bar{u}^{i}_{t}(\mathcal{B}_{i})=u^{i}_{t}+\nu(\mathcal{B}_{i}),\quad\bar{u}^{i}_{t}(\mathcal{B}^{c}_{i})=\nu(\mathcal{B}^{c}_{i}).

(13)

By this augmentation, we can construct the central ellipsoid

	$\displaystyle\mathcal{C}_{t}(\delta)=\{\Theta^{\top}\in R^{n\times(n+d)}\mid\operatorname{Tr}((\hat{\Theta}_{t}-\Theta)^{\top}V_{t}(\hat{\Theta}_{t}-\Theta))$
	$\displaystyle\leq\beta_{t}(\delta)\}$
	$\displaystyle\beta_{t}(\delta)=(\sigma_{\omega}\sqrt{2n\log(\frac{\det(V_{t})^{1/2}\det(\lambda I)^{-1/2}}{\delta}})+\lambda^{1/2}s)^{2}.$		(14)

which is an input to Algorithm 2 and used to compute IExp duration.

3.2 Main steps of Algorithm 2

The main steps of Algorithm 2 are quite similar to those of Algorithm 1 with a minor difference in confidence set construction. Algorithm 2 receives $V_{T^{i^{*}}_{c}}$ , $Z_{T^{i^{*}}_{c}}$ , and $X_{T^{i^{*}}_{c}}$ from Algorithm 1, using which for $t>T^{i^{*}}_{c}$ we have

	$\displaystyle V_{t}$	$\displaystyle=V_{T^{i^{}}_{c}}+\sum_{s=T^{i^{}}_{c}+1}^{t-1}{z}_{s}{{z}_{s}}^{\top}$
	$\displaystyle Z_{t}X_{t}$	$\displaystyle=Z_{T^{i^{}}_{c}}X_{T^{i^{}}_{c}}^{\top}+\sum_{s=T^{i^{*}}_{c}+1}^{t-1}{z}_{s}{{x}_{s}}^{\top}$

and the confidence set is easily constructed.

The following theorem summarizes boundedness of state norm when Algorithm 1 and 2 are deployed.

Theorem 3

The IExp algorithm keeps the states of the underlying system actuating in any mode $i$ bounded with the probability at least $1-\delta$ during initial exploration, i.e.,

	$\displaystyle\\|x_{t}\\|$	$\displaystyle\leq\frac{1}{1-\Upsilon_{i}}\big{(}\frac{\eta_{i}}{\Upsilon_{i}}\big{)}^{n+d_{i}}\bigg{[}G_{i}Z_{t}^{\frac{n+d_{i}}{n+d_{i}+1}}\beta^{i}_{t}(\delta)^{\frac{1}{2(n+d_{i}+1)}}+$
		$\displaystyle\quad(\sigma_{\omega}\sqrt{2n\log\frac{nt}{\delta}}+\\|sI\\|\sigma_{\nu}\sqrt{2d_{i}\log\frac{d_{i}t}{\delta}})\bigg{]}=:\alpha_{t}^{i},$		(15)

for all modes $i\in\{1,...,2^{d}\}$ .

2.

For $t>T^{i^{*}}_{c}+\frac{(n+d_{i^{*}})\log(n+d_{i^{*}})+\log c^{i^{*}}-\log\chi_{s}}{\log\frac{2}{1-\Upsilon}}:=T_{rc}$ we, with probability at least $1-\delta$ , have $\|x_{t}\|\leq 2\chi_{s}$ where

$\displaystyle\chi_{s}:=\frac{2\sigma_{\omega}}{1-\Upsilon}\sqrt{2n\log\frac{n(T-T^{i^{*}}_{c})}{\delta}}.$ (16)

From parts (1) and (2) of Theorem 3 we define the following good events:

\displaystyle F^{i}_{t}=\{\omega\in\Omega\mid\forall s\leq T^{i}_{c},\left\lVert x_{s}\right\rVert\leq\alpha^{i}_{t}\}.

(17)

and

\displaystyle F^{op,c}_{t}=\{\omega\in\Omega\mid\forall\;\;T^{i^{*}}_{c}\leq s\leq t,\left\lVert x_{s}\right\rVert^{2}\leq X_{c}^{2}\}.

(18)

in which

\displaystyle X_{c}^{2}=\frac{32n\sigma^{2}_{\omega}(1+\kappa^{2})}{(1-\Upsilon)^{2}}\log\frac{n(T-T_{c})}{\delta}.

(19)

where both the events are used for regret bound analysis and the former one specifically is used to obtain best actuating mode for initial exploration.

3.3 Determining the Optimal Mode for IExp

We still need to specify the best actuating mode $i^{*}$ for initial exploration along with its corresponding upperbound $X_{t}^{i^{*}}$ . Theorem 5 specifies $i^{*}$ . First we need the following Lemma.

Lemma 4

At the end of initial exploration, for any mode $\forall i\in\{1,...,2^{d}\}$ the following inequality holds

\displaystyle||\hat{\Theta}_{T^{i}_{\omega}}-\Theta_{*}||_{2}\leq\frac{\mu^{i}_{c}}{\sqrt{T^{i}_{\omega}}}

(20)

where $\mu^{i}_{c}$ is given as follows

	$\displaystyle\mu^{i}_{c}:=$	$\displaystyle\frac{1}{\sigma_{\star}}\bigg{(}\sigma_{\omega}\sqrt{n(n+d)\log\big{(}1+\frac{\mathcal{P}_{c}}{\lambda(n+d)}\big{)}+2n\log\frac{1}{\delta}}+$
		$\displaystyle\sqrt{\lambda}s\bigg{)}$		(21)

with,

\displaystyle\mathcal{P}_{c}

\displaystyle:={X_{T^{i}_{\omega}}^{i}}^{2}(1+2{\kappa^{i}}^{2})T^{i}_{\omega}+4T^{i}_{\omega}\sigma^{2}_{\nu}d_{i}\log(dT^{i}_{\omega}/\delta)

in which $T^{i}_{\omega}$ stands for initial exploration duration of actuating in mode $i$ . Furthermore, if we define

\displaystyle T^{i}_{c}:=\frac{4(1+\kappa)^{2}\mu^{i2}_{c}}{(1-\Upsilon)^{2}}

(22)

then for $T^{i}_{\omega}>T^{i}_{c}$ , $||\hat{\Theta}_{T^{i}_{\omega}}-\Theta_{*}||_{2}\leq\frac{1-\Upsilon}{2(1+\kappa)}$ holds with probability at least $1-2\delta$ .

The proof is provided in Appendix 7.

Theorem 5

Suppose Assumptions 1 and 2 hold true. Then for a system actuating in the mode $i$ during initial exploration phase, the following results hold true

$I_{F^{i}_{t}}\max_{1\leq s\leq t}\|x_{s}\|\leq x_{t}$ where $I_{F^{i}_{t}}$ is indicator function of set $F^{i}_{t}$ and

	$\displaystyle x_{t}=Y_{i,t}^{n+d_{i}+1}$		(23)
	$\displaystyle Y_{i,t}:=\max\big{(}e,\lambda(n+d^{i})(e-1),\frac{-\bar{L}_{i}+\sqrt{\bar{L}_{i}^{2}+4\bar{K}_{i}}}{2\bar{K}_{i}}\big{)},$

with

	$\displaystyle\bar{L}^{i}=(\mathcal{D}^{i}_{1}+\mathcal{D}^{i}_{2})\big{(}2n\sigma_{\omega}\log\frac{1}{\delta}+\sigma_{\omega}\sqrt{\lambda}s^{i}\big{)}\log t+$
	$\displaystyle\mathcal{D}^{i}_{3}\sqrt{\log t/\delta}+(\mathcal{D}^{i}_{1}+\mathcal{D}^{i}_{2})n\sigma_{\omega}(n+d_{i})\times$
	$\displaystyle\bigg{(}\log\frac{(n+d_{i})\lambda+2{\mathcal{V}^{i}_{t}}^{2}}{(n+d_{i})\lambda}t+\log\frac{(1+2{\kappa^{i}}^{2})}{(n+d_{i})\lambda}t\bigg{)}\log t$
	$\displaystyle\quad\bar{K}^{i}=2(\mathcal{D}^{i}_{1}+\mathcal{D}^{i}_{2})n\sigma_{\omega}(n+d_{i})(n+d_{i}+1)\log t.$

where

	$\displaystyle\mathcal{D}^{i}_{1}:=\frac{4}{1-\Upsilon_{i}}\big{(}\frac{\eta_{i}}{\Upsilon_{i}}\big{)}^{n+d_{i}}\bar{G}_{i}(1+2{\kappa^{i}}^{2})^{\frac{n+d_{i}}{2(n+d_{i}+1)}}$
	$\displaystyle\mathcal{D}^{i}_{2}:=\frac{4}{1-\Upsilon_{i}}\big{(}\frac{\eta_{i}}{\Upsilon_{i}}\big{)}^{n+d_{i}}\bar{G}_{i}2^{\frac{n+d_{i}}{2(n+d_{i}+1)}}\mathcal{V}_{T}^{i},$
	$\displaystyle\mathcal{D}^{i}_{3}:=\frac{n\sqrt{2}}{1-\Upsilon_{i}}\big{(}\frac{\eta_{i}}{\Upsilon_{i}}\big{)}^{n+d_{i}}\sigma_{\omega}$

in which $\mathcal{V}^{i}_{t}=\sigma_{\nu}\sqrt{2d_{i}\log d_{i}t/\delta}$ holds with probability least $1-\delta/2$ .

The best actuating mode $i^{*}$ for initial exploration is,

	$\displaystyle i^{*}$	$\displaystyle=argmin_{i\in\{1,...,2^{\mathbb{B}}\}}Y_{i,T^{i}_{\omega}}^{n+d_{i}+1}$
		$\displaystyle s.t\;\;\;\;T^{i}_{\omega}\geq T^{i}_{c}$		(24)

3.

The upper-bound of state norm of system actuating in the mode $i^{*}$ during initial exploration phase can be written as follows:

$\displaystyle\|x_{t}\|\leq c_{c}^{i^{*}}(n+d_{i^{*}})^{n+d_{i^{*}}}$ (25)

for some finite system parameter-dependent constant $c_{c}^{i^{*}}$ .

Remark 6

While optimization problem (24) cannot be solved analytically because $T^{i}_{c}$ itself depends on $x_{T^{i}_{\omega}}$ , it can be determined using Algorithm 3.

Algorithm 3 Find best actuating mode

i^{*}

and its corresponding

T_{c}^{i^{*}}

1: Inputs:

\lambda,\,\kappa,\,S^{i}>0,

\,\delta>0,

\,,\sigma_{\omega},\,\vartheta_{i},\,\eta_{i},\,\Upsilon_{i}\,\forall i

2: for

\forall i\in\{1,...,2^{\mathbb{B}}\}

T^{i}_{itr}=1

4: for

t=1,2,...,T^{i}_{itr}

5: compute

T_{c}^{i}

by (22)

6: if

t<T_{c}^{i}

then

T^{i}_{itr}=T^{i}_{itr}+1

8: else

T^{i}_{\omega}=t

10: end if

11: end for

12: end for

13: Compute

x_{T^{i}_{\omega}}=Y_{i,T^{i}_{\omega}}^{n+d_{i}+1}

\forall i\in\{1,...,2^{\mathbb{B}}\}

14: Solve

i^{*}=\arg\min_{i\in\mathcal{B}_{*}}x_{T^{i}_{\omega}}

15: Outputs:

i^{*}

and

T^{i^{*}}_{\omega}

3.4 Regret Bound

Recall (2), the regret for the proposed strategy (IExp+
SOFUA) can be defined as follows:

\displaystyle\mathcal{R}_{T}=\mathbb{E}\left[\sum_{t=1}^{T}(x_{t}^{\top}Q_{*}x_{t}+{\bar{u}^{i(t)}_{t}}^{\top}R_{*}\bar{u}_{t}^{i(t)}-J_{*}(\Theta_{*},Q_{*},R_{*}))\right]

(26)

where $i(t)=i^{*}$ for $t\leq T^{i^{*}}_{c}$ and $i(t)=1$ for $t>T^{i^{*}}_{c}$ .

An upper-bound for $\mathcal{R}_{T}$ is given by the following theorem which is the next core result of our analysis.

Theorem 7

(Regret Bound of IExp+SOFUA) Under Assumptions 1 and 2, with probability at least $1-\delta$ the algorithm SOFUA together with additional exploration algorithm IExp which runs for $T^{i^{*}}_{c}$ time steps achieves regret of $\mathcal{O}\big{(}(n+d_{i^{*}})^{(n+d_{i^{*}})}T^{i^{*}}_{rc}\big{)}$ for $t\leq T^{i^{*}}_{c}$ and $\mathcal{O}\big{(}poly(n+d)\sqrt{T-T^{i^{*}}_{rc}}\big{)}$ for $t>T^{i^{*}}_{c}$ where $\mathcal{O}(.)$ absorbs logarithmic terms.

4 Numerical Experiment

In this section, we demonstrate on a practical example how the use of our algorithms successfully alleviates state explosion during the initial exploration phase. We consider a control system with drift and control matrices to be set as follows:

\displaystyle A_{*}=\begin{pmatrix}1.04&0&-0.27\\ 0.52&-0.81&0.83\\ 0&0.04&-0.90\end{pmatrix},\;B_{*}=\begin{pmatrix}0.61&-0.29&-0.47\\ 0.58&0.25&-0.5\\ 0&-0.72&0.29\end{pmatrix}.

We choose the cost matrices as follows:

\displaystyle Q_{*}=\begin{pmatrix}0.65&-0.08&-0.14\\ -0.08&0.57&0.26\\ -0.14&0.26&2.5\end{pmatrix},\;R_{*}=\begin{pmatrix}0.14&0.04&0.05\\ 0.04&0.24&0.08\\ 0.05&0.08&0.2\end{pmatrix}.

The Algorithm 3 outputs the exploration duration $T^{i^{*}}_{c}=50s$ and best actuating mode $i^{*}$ for initial exploration with corresponding control matrix $B_{*}^{i^{*}}$ and $R_{*}^{i^{*}}$

\displaystyle B_{*}^{i^{*}}=\begin{pmatrix}0.61&-0.29\\ 0.58&0.25\\ 0&-0.72\end{pmatrix},\;\;\;\;\;\;R_{*}^{i^{*}}=\begin{pmatrix}0.14&0.04\\ 0.04&0.24\end{pmatrix}.

Refer to caption — Figure 1: Top. State norm, Bottom. regret bound

It has graphically been shown in Abbasi-Yadkori (2013) that the optimization problem (11) is generally non-convex for $n,d>1$ . Because of this fact, we decided to solve optimization problem (11) using a projected gradient descent method in Algorithm 1 and 2, with basic step

\displaystyle\tilde{\Theta}^{i}_{t+1}\leftarrow PROJ_{\mathcal{C}^{i}_{t}(\delta)}\bigg{(}\tilde{\Theta}^{i}_{t}-\gamma\nabla_{\Theta^{i}}(L^{i}tr(P(\Theta^{i},Q_{*},R^{i}_{*})))\bigg{)}

(27)

where $L^{i}=\bar{\sigma}^{2}_{\omega}+\vartheta^{2}\bar{\sigma}^{2}_{\nu}$ for $i=i^{*}$ and $L^{i}=\bar{\sigma}^{2}_{\omega}$ for $i=1$ . $\nabla_{\Theta^{i}}f$ is the gradient of $f$ with respect to $\Theta^{i}$ . $\mathcal{C}^{i}_{t}(\Theta^{i})$ is the confidence set, $PROJ_{g}$ is Euclidean projection on $g$ and finally $\gamma$ is the step size. Computation of gradient $\nabla_{\Theta^{i}}$ as well as formulation of projection has been explicited in Abbasi-Yadkori (2013), similar to which we choose the learning rate as follows:

\displaystyle\gamma=\sqrt{\frac{0.001}{tr(V^{i}_{t})}}.

We apply the gradient method for 100 iterations to solve each OFU optimization problem and apply the projection technique until the projected point lies inside the confidence ellipsoid. The inputs to the OFU algorithm are $T=10000$ , $\delta=1/T$ , $\lambda=1$ , $\sigma_{\omega}=0.1$ , $s=1$ and we repeat simulation $10$ times.

As can be seen in Fig. 1, the maximum value of the state norm (attained during the initial exploration phase) is smaller when using mode $i^{*}$ than when all actuators are in action.

Regret-bound for both cases is linear during initial exploration phase, however SOFUA guarantees $\mathcal{O}(\sqrt{T})$ regret for $T>50s$ .

5 Conclusion

In this work, we proposed an OFU principle-based controller for over-actuated systems, which combines a step of ”more-exploration” (to produce a stabilizing neighborhood of the true parameters while guaranteeing a bounded state during exploration) with one of ”optimism”, which efficiently controls the system. Due to the redundancy, it is possible to further optimize the speed of convergence of the exploration phase to the stabilizing neighborhood by choosing over actuation modes, then to switch to full actuation to guarantee an $\mathcal{O}(\sqrt{T})$ regret in closed-loop, with polynomial dependency on the system dimension.

A natural extension of this work is to classes of systems in which some modes are only stabilizable. Speaking more broadly, the theme of this paper also opens the door to more applications of switching as a way to facilitate learning-based control of unknown systems, some of which are the subject of current work.

References

Abbasi and Szepesvári (2011) Abbasi, Y. and Szepesvári, C. (2011). Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, 1–26.
Abbasi-Yadkori (2013) Abbasi-Yadkori, Y. (2013). Online learning for linearly parametrized control problems. UAlberta.
Anderson and Moore (1971) Anderson, B.D. and Moore, J.B. (1971). Linear Optimal Control [by] Brian DO Anderson [and] John B. Moore. Prentice-hall.
Bertsekas (2011) Bertsekas, D.P. (2011). Dynamic programming and optimal control 3rd edition, volume ii. Belmont, MA: Athena Scientific.
Campi and Kumar (1998) Campi, M.C. and Kumar, P. (1998). Adaptive linear quadratic gaussian control: the cost-biased approach revisited. SIAM Journal on Control and Optimization, 36(6), 1890–1907.
Chen and Hazan (2021) Chen, X. and Hazan, E. (2021). Black-box control for linear dynamical systems. In Conference on Learning Theory, 1114–1143. PMLR.
Cohen et al. (2019) Cohen, A., Koren, T., and Mansour, Y. (2019). Learning linear-quadratic regulators efficiently with only sqrt(t) regret. International Conference on Machine Learning, 1300–1309.
Faradonbeh et al. (2020) Faradonbeh, M.K.S., Tewari, A., and Michailidis, G. (2020). Optimism-based adaptive regulation of linear-quadratic systems. IEEE Transactions on Automatic Control, 66(4), 1802–1808.
Ibrahimi et al. (2012) Ibrahimi, M., Javanmard, A., and Roy, B.V. (2012). Efficient reinforcement learning for high dimensional linear quadratic systems. In Advances in Neural Information Processing Systems, 2636–2644.
Lale et al. (2020a) Lale, S., Azizzadenesheli, K., Hassibi, B., and Anandkumar, A. (2020a). Explore more and improve regret in linear quadratic regulators. arXiv preprint arXiv:2007.12291.
Lale et al. (2020b) Lale, S., Azizzadenesheli, K., Hassibi, B., and Anandkumar, A. (2020b). Regret bound of adaptive control in linear quadratic gaussian (lqg) systems. arXiv.
Lale et al. (2021) Lale, S., Azizzadenesheli, K., Hassibi, B., and Anandkumar, A. (2021). Adaptive control and regret minimization in linear quadratic gaussian (lqg) setting. In 2021 American Control Conference (ACC), 2517–2522. IEEE.
Mania et al. (2019) Mania, H., Tu, S., and Recht, B. (2019). Certainty equivalence is efficient for linear quadratic control. Advances in Neural Information Processing Systems, 32.

6 Analysis

In this section, we dig further and provide a numerical experiment and rigorous analysis of the algorithms, properties of the closed-loop system, and regret bounds. The most technical results, proofs and lemma can be found in the Appendix.

6.1 Stabilization via IExp (proof of Theorem 3. (1))

This section attempts to upper-bound the state norm during initial exploration phase. We carry out this part regardless of which mode has been chosen for initial exploration.

During the initial exploration the state recursion of system actuating in mode $i$ is written as follows:

\displaystyle x_{t+1}=\begin{cases}\tilde{A}_{t}x_{t}+\tilde{B}^{i}_{t}u_{t}^{i}+B_{*}\nu_{t}+M^{i}_{t}z^{i}_{t}+\omega_{t}&\quad t\not\in\tau_{T_{c}^{i}}\\ A_{*}x_{t}+B_{*}\bar{u}^{i}_{t}+\omega_{t}&\quad t\in\tau_{T_{c}^{i}}\\ \end{cases}

(28)

where $M^{i}_{t}=(\Theta^{i}_{*}-\tilde{\Theta}^{i}_{t})$ . The state update equation can be written as follows:

\displaystyle x_{t+1}=\Gamma^{i}_{t}x_{t}+r_{t}

(29)

where

\displaystyle\Gamma^{i}_{t+1}=\begin{cases}\tilde{A}^{i}_{t}+\tilde{B}^{i}_{t}K(\tilde{\Theta}^{i}_{t},Q_{*},R_{*}^{i})&\quad t\not\in\tau_{T_{c}^{i}}\\ A_{*}+B^{i}_{*}K(\tilde{\Theta}^{i}_{t},Q_{*},R_{*}^{i})&\quad t\in\tau_{T_{c}^{i}}\\ \end{cases}

(30)

and

\displaystyle r_{t+1}=\begin{cases}B_{*}\nu_{t}+M^{i}_{t}z^{i}_{t}+\omega_{t}&\quad t\not\in\tau_{T_{c}^{i}}\\ B_{*}\nu_{t}+\omega_{t}&\quad t\in\tau_{T_{c}^{i}}\\ \end{cases}

(31)

By propagating the state back to time step zero, the state update equation can be written as:

\displaystyle x_{t}=\prod_{s=0}^{t-1}\Gamma^{i}_{s}x_{0}+\sum_{k=1}^{t}\bigg{(}\prod_{s=k}^{t-1}\Gamma^{i}_{s}\bigg{)}r_{k}.

(32)

Recalling Assumptions 2 we have

\displaystyle\max_{t\leq T}||A_{*}+B^{i}_{*}K(\tilde{\Theta}^{i}_{t},Q_{*},R_{*}^{i})||\leq\eta^{i}_{c},\;\;\;\max_{t\leq T}\|\tilde{A}_{t}+\tilde{B}_{t}^{i}K(\tilde{\Theta}^{i}_{t},Q_{*},R_{*}^{i})\|\leq\Upsilon_{i}<1.

(33)

Now, by assuming $x_{0}=0$ it yields

	$\displaystyle\\|x_{t}\\|$	$\displaystyle\leq\big{(}\frac{\eta^{i}}{\Upsilon_{i}}\big{)}^{n+d_{i}}\sum_{k=1}^{t}\Upsilon_{i}^{t-k+1}\\|r_{k}\\|\leq\big{(}\frac{\eta^{i}}{\Upsilon_{i}}\big{)}^{n+d_{i}}\max_{1\leq k\leq t}\\|r_{k}\\|\sum_{k=1}^{t}\Upsilon_{i}^{t-k+1}$
		$\displaystyle=\frac{1}{1-\Upsilon_{i}}\big{(}\frac{\eta^{i}_{c}}{\Upsilon_{i}}\big{)}^{n+d_{i}}\max_{1\leq k\leq t}\\|r_{k}\\|.$		(34)

On the other hand, we have

\displaystyle\|r_{k}\|\leq\begin{cases}\|M^{i}_{k}z^{i}_{k}\|+\|B_{*}\nu_{k}+\omega_{k}\|&\quad t\not\in\tau_{T_{c}^{i}}\\ \|B_{*}{\nu}_{k}+\omega_{k}\|&\quad t\in\tau_{T_{c}^{i}}\\ \end{cases}

(35)

which results in

\displaystyle\max_{k\leq t}\|r_{k}\|\leq\max_{k\leq t,t\notin\tau_{T^{i}_{c}}}\|M^{i}_{k}z^{i}_{k}\|+\max_{k\leq t}\|(sI){\nu}_{k}+\omega_{k}\|

(36)

where in second term from right hand side we applied the fact that $trace(\tilde{B}^{\top}\tilde{B})\leq s$ (see Assumption 2).

Now applying Lemma 18 and union bound argument one can write

	$\displaystyle\|\|x_{t}\|\|$	$\displaystyle\leq\frac{1}{1-\Upsilon_{i}}\big{(}\frac{\eta_{i}}{\Upsilon_{i}}\big{)}^{n+d_{i}}\bigg{[}G_{i}Z_{t}^{\frac{n+d_{i}}{n+d_{i}+1}}\beta^{i}_{t}(\delta)^{\frac{1}{2(n+d_{i}+1)}}+$
		$\displaystyle\quad(\sigma_{\omega}\sqrt{2n\log\frac{nt}{\delta}}+\\|sI\\|\sigma_{\nu}\sqrt{2d_{i}\log\frac{d_{i}t}{\delta}})\bigg{]}=:\alpha_{t}^{i},\;\forall i\in\{1,...,\|\mathcal{B}^{*}_{a}\|\}$		(37)

where $d_{i}$ stands for the number of actuators of an actuating mode $i$ and similarly any subscripts and superscripts $i$ denotes the actuating mode $i$ .

The policy explicited in Algorithm 1 keeps the states of the underlying system bounded with the probability at least $1-\delta$ during initial exploration which is defined as the ”good event” $F^{i}_{t}$

\displaystyle F^{i}_{t}=\{\omega\in\Omega\mid\forall s\leq T^{i}_{c},\left\lVert x_{s}\right\rVert\leq\alpha^{i}_{t}\}.

(38)

A second ”good event” is associated with the confidence set for an arbitraty mode $i$ defined as:

\displaystyle E^{i}_{t}=\{\omega\in\Omega\mid\forall s\leq t,\Theta_{*}^{i}\in\mathcal{C}^{i}_{s}(\delta/4)\}

(39)

6.2 Determining the exploration time and best mode for IExp

6.2.1 Exploration duration

Given the constructed central confidence set $\mathcal{C}_{T}$ , we aim to specify the time duration $T^{i^{*}}_{c}$ that guarantees the parameter estimate resides within stabilizing neighborhood. For this, we need to lower bound the smallest eigenvalue of co-variance matrix $V_{t}$ . The following lemma adapted from Lale et al. (2020a), named persistence of excitation during the extra exploration, provides this lower bound.

Lemma 8

For the initial exploration period of $T_{\omega}\geq\frac{6n}{c_{p}^{\prime}}\log(12/\delta)$ we have

\displaystyle\lambda_{\min}(V_{T_{\omega}})\geq\sigma_{\star}^{2}T_{\omega}

(40)

with probability at least $1-\delta$ where $\sigma_{\star}^{2}=\frac{c_{p}^{\prime}\sigma_{1}^{2}}{16}$ , $c_{p}^{\prime}:=\min\{c_{p},c^{\prime\prime}_{p}\}$ , and

	$\displaystyle c_{p}=\frac{\bar{\sigma}^{2}_{\omega}-\sigma^{2}_{1}-4\bar{\sigma}^{2}_{\nu}(1+\frac{\sigma^{2}_{2}}{2\bar{\sigma}^{2}_{\nu}})\exp(\frac{-\sigma_{2}^{2}}{\bar{\sigma}^{2}_{\nu}})}{\sigma_{2}^{2}}$
	$\displaystyle c^{\prime\prime}_{p}=\frac{\frac{\bar{\sigma}^{2}_{\omega}}{2}-4\bar{\sigma}^{2}_{\nu}(1+\frac{\sigma^{2}_{3}}{2\bar{\sigma}^{2}_{\nu}})\exp(\frac{-\sigma_{3}^{2}}{\bar{\sigma}^{2}_{\nu}})}{\sigma_{2}^{2}}-0.5\bar{\sigma}^{2}_{\nu}\exp(\frac{-\sigma^{2}_{3}}{2\bar{\sigma}^{2}_{\nu}})$

for any $\sigma^{2}_{1}\leq\sigma^{2}_{2}$ and $\sigma^{2}_{3}$ such that $c_{p},c^{\prime\prime}_{p}>0$ .

The following lemma gives an upper-bound for the parameter estimation error at the end of time $T$ which will be used to compute the minimum extra exploitation time $T_{\omega}$ .

Lemma 9

Suppose assumption 1 and 2 holds. For $T\geq poly(\sigma_{\omega}^{2},\sigma_{\nu}^{2},n\log(1/\delta)$ having additional exploration leads to

\displaystyle||\hat{\Theta}_{T}-\Theta_{*}||_{2}\leq\frac{1}{\sigma_{\star}\sqrt{T}}\bigg{(}\sigma_{\omega}\sqrt{2n\log\big{(}\frac{n\det(V_{T})^{1/2}}{\delta\det(\lambda I)^{1/2}}\big{)}}+\sqrt{\lambda}s\bigg{)}

(41)

{pf}

The proof is straightforward. First, a confidence set around the true but unknown parameters of the system is obtained which is given by (10). Then applying (40) given by Lemma 8 completes the proof.

There is one more step to obtain the extra exploration duration, $T_{\omega}$ which is obtaining an upper-bound for the right hand side of (41). Performing such a step allows us to state the following central result.

6.2.2 Best Actuating mode for initial exploration

Given the side information $\Upsilon_{i}$ and $\eta_{i}$ s for all actuating modes $\forall i\in\{1,...,2^{\mathbb{B}}\}$ , using the bound (6.1), we aim to find an actuating mode $i^{*}$ that provides the lowest possible upperbound of state at first phase. This guarantees the state norm does not blow-up while minimizing the regret. The following lemma specifies the best mode $i^{*}$ to reach this goal. Theorem 5 gives best actuating mode for IExp and its corresponding duration.

6.3 Stabilization via SOFUA (proof of Theorem 3. (2))

After running the IExp algorithm for $t\leq T^{i^{*}}_{c}$ (or $t\leq T^{i^{*}}_{s}$ ) noting that the confidence set is tight enough and we are in the stabilizing region, Algorithm 2 which leverages all the actuators comes into play. This algorithm has the central confidence set given by the Algorithm 1 as an input. The confidence ellipsoid for this phase is given as follows:

	$\displaystyle\mathcal{C}_{t}(\delta)$	$\displaystyle=\{{\Theta}\in\mathbb{R}^{(n+d)\times n}\mid$
		$\displaystyle\quad\operatorname{Tr}((\hat{\Theta}_{t}-\Theta)^{\top}V_{t}(\hat{\Theta}_{t}-\Theta))\leq\beta_{t}(\delta)\}$		(42)

where

\displaystyle\beta_{t}(\delta)=(\sigma_{\omega}\sqrt{2n\log(\frac{\det(V_{t})^{1/2}\det(\lambda I)^{-1/2}}{\delta}})+\lambda^{1/2}s)^{2}

(43)

and

\displaystyle V_{t}=V_{T^{i^{*}}_{e}}+\sum_{T^{i^{*}}_{e}+1}^{t}z_{t}z_{t}^{\top}.

(44)

Now, we can define the good event $E_{t}$ for time $t>T^{i^{*}}_{c}$

\displaystyle E_{t}=\{\omega\in\Omega\mid\forall T^{i^{*}}_{c}\leq s\leq t,\Theta_{*}\in\mathcal{C}_{s}(\delta/4)\}.

(45)

Now, we are ready to upperbound the state norm.

SOFUA keeps the states of the underlying system bounded with probability at least $1-\delta$ . In this section, we define the ”good event” $F^{op}_{t}$ for $t>T_{c}^{i^{*}}$ .

Noting that for $t>T_{c}^{i^{*}}$ the algorithm stops applying the exploratory noise $\nu$ , the state dynamics is written as follows: dynamics

\displaystyle x_{t+1}=

\displaystyle\big{(}A_{*}+B_{*}K(\tilde{\Theta}_{t},Q_{*},R_{*})\big{)}x_{t}+\omega_{t}=M_{t}x_{t}+\omega_{t}

(46)

where

	$\displaystyle M_{t}$	$\displaystyle=\big{(}A_{}-\tilde{A}_{t-1}+\tilde{A}_{t-1}+B_{}K(\tilde{\Theta}_{t},Q_{},R_{})$
		$\displaystyle+\tilde{B}_{t}K(\tilde{\Theta}_{t},Q_{},R_{})-\tilde{B}_{t-1}K(\tilde{\Theta}_{t},Q_{},R_{})\big{)}.$

With controllability assumption for the $t>T^{i^{*}}_{c}$ , if the event $E_{t}$ holds then $\|M_{t}\|<\frac{1+\Upsilon}{2}<1$ for all $t\geq T^{i^{*}}_{c}$ . Then starting from state $x_{T^{i^{*}}_{c}}$ one can write

\displaystyle\|x_{t}\|\leq\big{(}\frac{1+\Upsilon}{2}\big{)}^{t-T^{i^{*}}_{c}}\|x_{T^{i^{*}}_{c}}\|+\frac{2}{1-\Upsilon}\max_{T^{i^{*}}_{c}\leq s\leq t}\|\omega_{s}\|.

(47)

By applying union bound argument on the second term from right hand side of (47) and using the bound (25), it is straight forward to show that

\displaystyle\|x_{t}\|

\displaystyle\leq\big{(}\frac{1+\Upsilon}{2}\big{)}^{t-T^{i^{*}}_{c}}c^{i^{*}}(n+d_{i^{*}})^{n+d_{i^{*}}}+\chi_{s}

where

\displaystyle\chi_{s}:=\frac{2\sigma_{\omega}}{1-\Upsilon}\sqrt{2n\log\frac{n(T-T^{i^{*}}_{c})}{\delta}}.

(48)

For $t>T^{i^{*}}_{c}+\frac{(n+d_{i^{*}})\log(n+d_{i^{*}})+\log c^{i^{*}}-\log\chi_{s}}{\log\frac{2}{1-\Upsilon}}:=T_{rc}$ we have $\|x_{t}\|\leq 2\chi_{s}$ . Now the ”good event” $F^{op,c}_{t}$ is defined by

\displaystyle F^{op,c}_{t}=\{\omega\in\Omega\mid\forall\;\;T^{i^{*}}_{c}\leq s\leq t,\left\lVert x_{s}\right\rVert^{2}\leq X_{c}^{2}\}.

(49)

in which

\displaystyle X_{c}^{2}=\frac{32n\sigma^{2}_{\omega}(1+\kappa^{2})}{(1-\Upsilon)^{2}}\log\frac{n(T-T_{c})}{\delta}

(50)

6.4 Regret Bound Analysis

6.4.1 Regret decomposition

From definition of regret, one can write

$\displaystyle\mathcal{R}_{T}$	$\displaystyle=\sum_{t=1}^{T}(x_{t}^{\top}Q_{}x_{t}+{\bar{u}^{i(t)}_{t}}^{\top}R_{}\bar{u}_{t}^{i(t)})-TJ_{}(\Theta_{},Q_{},R_{})$
	$\displaystyle=\sum_{t=0}^{T_{\omega}}\bigg{(}x_{t}^{\top}Qx_{t}+{u^{i^{}}_{t}}^{\top}R_{}^{i^{}}u^{i^{}}_{t}+2{\nu_{t}}^{\top}R_{}{\bar{u}^{i^{}}_{t}}+{\nu_{t}}^{\top}R_{}\nu_{t}\bigg{)}-TJ_{}(\Theta_{},Q_{},R_{*})$
	$\displaystyle+\sum_{t=T_{\omega}+1}^{T}(x_{t}^{\top}Q_{}x_{t}+{\bar{u}_{t}}^{\top}R_{}\bar{u}_{t})$	(51)

Applying Bellman optimality equation (see Bertsekas (2011)) for LQ systems actuating in any mode $i=i^{*},1$ , one can write

	$\displaystyle J(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})+x_{t}^{\top}P(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})x_{t}$
	$\displaystyle=x_{t}^{\top}Q_{}x_{t}+{u^{i}_{t}}^{\top}R_{}^{i}u^{i}_{t}+\mathbb{E}\bigg{[}\big{(}\tilde{A}_{t-1}x_{t}+\tilde{B}^{i}_{t-1}u^{i}_{t}+\zeta_{t}\big{)}^{\top}P(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})\big{(}\tilde{A}_{t-1}x_{t}+\tilde{B}^{i}_{t-1}\underline{u}^{i}_{t}+\zeta_{t}\big{)}\|\mathcal{F}_{t-1}\bigg{]}$
	$\displaystyle=x_{t}^{\top}Q_{}x_{t}+{u^{i}}_{t}^{\top}R_{}^{i}u^{i}_{t}$
	$\displaystyle+\mathbb{E}\big{[}\big{(}\tilde{A}_{t-1}x_{t}+\tilde{B}^{i}_{t-1}u^{i}_{t}\big{)}^{\top}P(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})\big{(}\tilde{A}_{t-1}x_{t}+\tilde{B}^{i}_{t-1}u^{i}_{t}\big{)}\|\mathcal{F}_{t-1}\big{]}+\mathbb{E}\big{[}\zeta_{t}^{\top}P(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})\zeta_{t}\|\mathcal{F}_{t-1}\big{]}$
	$\displaystyle=x_{t}^{\top}Qx_{t}+{u^{i}}_{t}^{\top}R^{i}u^{i}_{t}+\mathbb{E}\big{[}\big{(}\tilde{A}_{t-1}x_{t}+\tilde{B}^{i}_{t-1}u^{i}_{t}\big{)}^{\top}P(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})\big{(}\tilde{A}_{t-1}x_{t}+\tilde{B}^{i}_{t-1}u^{i}_{t}\big{)}\|\mathcal{F}_{t-1}\big{]}$
	$\displaystyle+\mathbb{E}\bigg{[}x_{t+1}^{\top}P(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})x_{t+1}\|\mathcal{F}_{t-1}\bigg{]}-\mathbb{E}\big{[}\big{(}A_{}x_{t}+B^{i}_{}u^{i}_{t}\big{)}^{\top}P(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})\big{(}A_{}x_{t}+B^{i}_{}u^{i}_{t}\big{)}\|\mathcal{F}_{t-1}\big{]}$
	$\displaystyle=x_{t}^{\top}Qx_{t}+{u^{i}_{t}}^{\top}R^{i}u^{i}_{t}+\mathbb{E}\bigg{[}x_{t+1}^{\top}P(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})x_{t+1}\|\mathcal{F}_{t-1}\bigg{]}$
	$\displaystyle+\big{(}\tilde{A}_{t-1}x_{t}+\tilde{B}^{i}_{t-1}u^{i}_{t}\big{)}^{\top}P(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})\big{(}\tilde{A}_{t-1}x_{t}+\tilde{B}^{i}_{t-1}u^{i}_{t}\big{)}-\big{(}A_{}x_{t}+B^{i}_{}u^{i}_{t}\big{)}^{\top}P(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})\big{(}A_{}x_{t}+B^{i}_{}u^{i}_{t}\big{)}$

where $\zeta_{t}=B^{i}_{*}\nu_{t}(\mathcal{B}_{i})+\omega_{t}$ for $t\leq T_{\omega}$ and $\zeta_{t}=\omega_{t}$ for $t>T_{\omega}$ and with slight abuse of notation $u_{t}^{1}=\bar{u}_{t}$ . In third equality we applied the dynamics $x_{t+1}=A_{*}x_{t}+B^{i}_{*}u_{t}^{i}+\zeta$ and used the martingale property of process noise $\zeta_{t}$ .

Now taking summation up to time $T>T_{\omega}$ and redefining $J(\Theta^{i},Q_{*},R_{*})={\sigma_{\zeta}}^{2}Tr(P(\Theta^{i},Q_{*},R_{*}))$ gives

	$\displaystyle\sum_{t=0}^{T}\bigg{(}x_{t}^{\top}Qx_{t}+{u^{i(t)}_{t}}^{\top}R^{i(t)}{u}^{i(t)}_{t}\bigg{)}=\sum_{t=0}^{T}J(\tilde{\Theta}^{i(t)}_{t-1},Q_{},R^{i(t)}_{})+R_{1}-R_{2}-R_{3}$		(52)
	$\displaystyle=\sum_{t=0}^{T_{\omega}}\bigg{(}\sigma^{2}_{\nu}Tr(P(\tilde{\Theta}_{t-1}^{i^{}},Q_{},R^{i^{}}_{}))B^{i^{}}_{}{B_{}^{i^{}}}^{\top}\bigg{)}$		(53)
	$\displaystyle+\sum_{t=0}^{T_{\omega}}\bigg{(}\bar{\sigma}^{2}_{\omega}Tr(P(\tilde{\Theta}_{t-1}^{i^{}},Q_{},R^{i^{}}_{}))\bigg{)}+\sum_{t=T_{\omega}+1}^{T}\bigg{(}\bar{\sigma}^{2}_{\omega}Tr(P(\tilde{\Theta}_{t-1},Q_{},R_{}))\bigg{)}+R_{1}-R_{2}-R_{3}$		(54)

where

\displaystyle R_{1}=\sum_{t=0}^{T}\bigg{(}x^{\top}_{t}P(\tilde{\Theta}^{i(t)}_{t-1},Q_{*},R_{*}^{i(t)})x_{t}-\mathbb{E}\big{[}x^{\top}_{t+1}P(\tilde{\Theta}^{i(t)}_{t},Q_{*},R_{*}^{i(t)})x_{t+1}|\mathcal{F}_{t-1}\big{]}\bigg{)},

(55)

\displaystyle R_{2}=\sum_{t=0}^{T}\mathbb{E}[x_{t+1}^{\top}\big{(}P(\tilde{\Theta}_{t-1}^{i(t)},Q_{*},R_{*}^{i(t)})-P(\tilde{\Theta}_{t}^{i(t)},Q_{*},R_{*}^{i(t)})\big{)}x_{t+1}|\mathcal{F}_{t-1}]

(56)

and

	$\displaystyle R_{3}$	$\displaystyle=\sum_{t=0}^{T}\bigg{(}(\tilde{A}_{t-1}x_{t}+\tilde{B}^{i(t)}_{t-1}{u}^{i(t)}_{t})^{\top}P^{i}(\tilde{\Theta}^{i}_{t-1})(\tilde{A}_{t-1}x_{t}+\tilde{B}^{i}_{t-1}{u}^{i(t)}_{t})$
		$\displaystyle\quad-(A_{}x_{t}+B_{}^{i(t)}{u}^{i(t)}_{t})^{\top}P(\tilde{\Theta}^{i(t)}_{t-1},Q_{},R_{}^{i(t)})(A_{}x_{t}+B^{i(t)}_{}{u}^{i(t)}_{t})\bigg{)}$		(57)

with $i(t)=i^{*}$ for $t\leq T_{\omega}$ , and $i(t)=1$ when $t>T_{\omega}$ for which we drop the corresponding super/sub scripts abusively.

Recalling optimal average cost value formula and taking into account that the extra exploratory noise $\nu_{t}$ is independent of process noise $\omega_{t}$ , for duration $t<T_{\omega}$ that system actuates in the mode $i^{*}$ , the term $J(\tilde{\Theta}^{i^{*}}_{t-1},Q_{*},R_{*}^{i^{*}})$ is decomposed as follows

\displaystyle J(\tilde{\Theta}^{i^{*}}_{t-1},Q_{*},R_{*}^{i^{*}})=\sigma_{\nu}^{2}Tr\big{(}P(\tilde{\Theta}^{i^{*}}_{t-1},Q_{*},R_{*}^{i^{*}})B^{i^{*}}_{*}{B^{i^{*}}_{*}}^{\top}\big{)}+\bar{\sigma}_{\omega}^{2}Tr\big{(}P(\tilde{\Theta}^{i^{*}}_{t-1},Q_{*},R_{*}^{i^{*}})\big{)}.

(58)

From given side information (6) (Assumption 2), one can write

	$\displaystyle\bar{\sigma}_{\omega}^{2}Tr(P(\tilde{\Theta}^{i^{}}_{t-1},Q_{},R_{}^{i^{}}))=J(\tilde{\Theta}^{i^{}}_{t-1},Q_{},R_{}^{i^{}})\leq J(\tilde{\Theta}^{i^{}}_{t-1},Q_{},R_{}^{i^{}})$
	$\displaystyle\leq J(\Theta_{},Q_{},R_{})+\gamma^{i^{}}$		(59)

which results in

\displaystyle\sum_{t=0}^{T}\bigg{(}x_{t}^{\top}Q_{*}x_{t}+{{u}_{t}^{i(t)}}^{\top}R_{*}^{i(t)}{u}^{i(t)}_{t}\bigg{)}\leq\sigma_{\nu}^{2}T_{\omega}D||B^{i^{*}}_{*}||_{F}^{2}+TJ_{*}(\Theta_{*},\omega_{t})+T_{\omega}\gamma^{i^{*}}+R_{1}-R_{2}-R_{3}

(60)

Combining (59) with (54) and (51) for both controllable settings under the events $F^{i^{*}}_{T_{c}^{i^{*}}}\cap E_{T_{c}^{i^{*}}}$ for $t\leq T_{c}^{i^{*}}$ and $F^{opt}_{T}\cap E_{T}$ for $t\geq T_{c}^{i^{*}}$ (and for stabilizable setting with its corresponding events) the regret can be upper-bounded as follows:

\displaystyle\mathcal{R}(T)

\displaystyle\leq\sigma_{\nu}^{2}T_{\omega}D||B^{i^{*}}_{*}||_{F}^{2}+\gamma^{i^{*}}T_{\omega}+R_{0}+R_{1}-R_{2}-R_{3}

(61)

where

\displaystyle R_{0}=\sum_{t=0}^{T_{\omega}}\big{(}2{\nu_{t}}^{\top}R_{*}\underline{u}^{i^{*}}_{t}+{\nu_{t}}^{\top}R_{*}\nu_{t}\big{)}.

(62)

The term $R_{0}$ given by (62) which is direct effect of extra exploratory noise on the regret bound has same upper-bound for both controllable and stabilizable settings which is given by following lemma.

Lemma 10

(Bounding $R_{0}$ ) On the event $E\cap F^{i^{*}}$ , the term $R_{0}$ defined by (62) has the following upper bound:

	$\displaystyle\|R_{0}\|\leq d$	$\displaystyle\sigma_{\nu}\sqrt{B_{\delta}}+d\\|R_{*}\\|\sigma^{2}_{\nu}$
		$\displaystyle\times\big{(}T_{\omega}+\sqrt{T_{\omega}}\log\frac{4dT_{\omega}}{\delta}\sqrt{\log\frac{4}{\delta}}\big{)}$		(63)

where

	$\displaystyle B_{\delta}=8\big{(}1+T_{\omega}\kappa^{2}\\|R_{}\\|^{2}(n+d_{i^{}})^{2(n+d_{i^{*}})}\big{)}$
	$\displaystyle\times\log\bigg{(}\frac{4d}{\delta}\big{(}1+T_{\omega}\kappa^{2}\\|R_{}\\|^{2}(n+d_{i^{}})^{2(n+d_{i^{*}})}\big{)}^{1/2}\bigg{)}$

The following subsequent sections gives rest of upperbounds.

Lemma 11

(Bounding $R_{1}$ ) On the event $F^{i^{*}}_{T_{c}^{i^{*}}}\cap E^{i^{*}}_{T_{c}^{i^{*}}}$ for $t\leq T_{c}^{i^{*}}$ and $F^{c,op}_{T}\cap E^{c}_{T}$ for $t\geq T_{c}^{i^{*}}$ , with probability at least $1-\delta/2$ for $t>T_{rc}$ the term $R_{1}$ is upper-bounded as follows:

$\displaystyle R_{1}$	$\displaystyle\leq k_{c,1}(n+d_{i^{}})^{(n+d_{i^{}})}(\sigma_{\omega}+\|\|B_{*}\|\|\sigma_{\nu})$
	$\displaystyle\quad\times n\sqrt{T_{c}^{i^{}}}\log((n+d^{})T_{c}^{i^{*}}/\delta)$
	$\displaystyle\quad+k_{c,2}\sigma^{2}_{\omega}\frac{n\sqrt{n}}{(1-\Gamma)}\sqrt{t-T_{c}^{i^{}}}\log\big{(}n(t-T_{c}^{i^{}})/\delta\big{)}$
	$\displaystyle\quad+k_{c,3}n\sigma^{2}_{\omega}\sqrt{t-T_{\omega}}\log(nt/\delta)$
	$\displaystyle\quad+k_{c,4}n(\sigma_{\omega}+\|\|B_{*}\|\|\sigma_{\nu})^{2}\sqrt{T_{\omega}}\log(nt/\delta)$	(64)

for some problem dependent coefficients $k_{c,1},k_{c,2},k_{c,3},k_{c,4}$ .

{pf}

Proof follows as same steps as of Lale et al. (2020a), with only difference that exploration phase is performed through actuating in the mode $i^{*}$ with corresponding number of actuators $d_{i^{*}}$ .

The following lemma upper-bounds the term $R_{2}$ .

Lemma 12

(Bounding $|R_{2}|$ ) On the event $F^{i^{*}}_{T_{c}^{i^{*}}}\cap E^{i^{*}}_{T_{c}^{i^{*}}}$ for $t\leq T_{c}^{i^{*}}$ and $F^{c,op}_{T}\cap E^{c}_{T}$ for $t\geq T_{c}^{i^{*}}$ , it holds true that the term $R_{2}$ defined by (56) is upper-bounded as

	$\displaystyle\|R_{2}\|\leq 2D{c^{i^{}}}^{2}(n+d_{i^{}})^{2(n+d_{i^{*}})}$
	$\displaystyle\quad\times\big{\{}1+\log_{2}\bigg{(}1+\frac{{c^{i^{}}}^{2}(1+2{\kappa^{i}}^{2})(n+d_{i^{}})^{2(n+d_{i^{}})}T^{i^{}}_{c}+4\sigma^{2}_{\nu}d_{i^{}}\log\frac{8d_{i^{}}T^{i^{}}}{\delta}T^{i^{}}}{\lambda}\bigg{)}^{n+d_{i^{*}}}\big{\}}$
	$\displaystyle\quad+2D\frac{32n\sigma^{2}_{\omega}(1+\kappa^{2})}{(1-\Upsilon)^{2}}\log\frac{n(T-T_{c}^{i^{}})}{\delta}\log_{2}\big{(}\frac{\bar{\lambda}}{\sigma_{\star}^{2}(T_{c}^{i^{}}+1)}\big{)}^{n+d}$		(65)

in which

	$\displaystyle\bar{\lambda}$	$\displaystyle:=\bar{\lambda}:=\lambda+{c^{i^{}}}^{2}T^{i^{}}_{c}(n+d_{i^{}})^{2(n+d_{i^{}})}(1+2{\kappa^{i}}^{2})$
		$\displaystyle\quad+4\sigma^{2}_{\nu}d_{i^{}}\log(d_{i^{}}T^{i^{*}}_{c}/\delta)$
		$\displaystyle\quad+(T-T^{i^{}}_{c})\frac{32n\sigma^{2}_{\omega}(1+\kappa^{2})}{(1-\Upsilon)^{2}}\log\frac{n(T-T^{i^{}}_{c})}{\delta}.$

The proof can be found in Appendix 7

Lemma 13

(Bounding $|R_{3}|$ ) On the event $F^{i^{*}}_{T_{c}^{i^{*}}}\cap E^{i^{*}}_{T_{c}^{i^{*}}}$ for $t\leq T_{c}^{i^{*}}$ and $F^{c,op}_{T}\cap E^{c}_{T}$ for $t\geq T_{c}^{i^{*}}$ , the term $R_{3}$ defined by (57) has the following upper bound:

\displaystyle|R_{3}|=\mathcal{O}\big{(}(n+d_{i^{*}})^{(2(n+d_{i^{*}}))}T_{c}^{i^{*}}+(n+d)n^{2}\sqrt{T-T_{c}^{i^{*}}}\big{)}

(66)

Putting Everything Together, gives the overall regret bound which holds with probability at least $1-\delta$ . This bound has been summarized by Theorem 7.

7 Appendix

7.1 Technical Theorems and Lemmas

Lemma 14

(Norm of Sub-gaussian vector) For an entry-wise $R-$ subgaussian vector $y\in\mathbb{R}^{m}$ the following upper-bound holds with probability at least $1-\delta$

\displaystyle\|y\|\leq R\sqrt{2m\log(\frac{m}{\delta})}

Lemma 15

(Self-normalized bound for vector-valued martingales Abbasi and Szepesvári (2011)) Let $F_{k}$ be a filtration, $z_{k}$ be a stochastic process adapted to $F_{k}$ and $\omega^{i}_{k}$ (where $\omega^{i}_{k}$ is the $i-$ th element of noise vector $\omega_{k}$ ) be a real-valued martingale difference, again adapted to filtration $F_{k}$ which satisfies the conditionally sub-Gaussianity assumption (Assumption 2.4) with known constant $\sigma_{\omega}$ . Consider the martingale and co-variance matrices:

\displaystyle S^{i}_{t}:=\sum_{k=1}^{t}z_{k-1}\omega^{i}_{k},\quad V_{t}=\lambda I+\frac{1}{\beta}\sum_{s=1}^{t-1}z_{t}z_{t}^{T}

then with probability of at least $1-\delta$ , $0<\delta<1$ we have,

\displaystyle\left\lVert S^{i}_{t}\right\rVert^{2}_{V_{t}^{-1}}\leq 2\sigma_{\omega}^{2}\log\bigg{(}\frac{\det(V_{t})^{1/2}}{\delta\det(\lambda I)^{1/2}}\bigg{)}

(67)

Given the fact that for controllable systems solving DARE gives a unique stabilizing controller, in this section we go through an important result from literature that show there is a strongly stabilizing neigborhood around the parameters of a system. This means that solving DARE for any parameter value in this neighborhood gives a controller which stabilizes the system with true parameters.

Lemma 16

(Mania et al. Mania et al. (2019)) There exists explicit constants $C_{0}$ and

\displaystyle\epsilon=poly(\underline{\alpha}^{-1},\bar{\alpha},\|A_{*}\|,\|B_{*}\|,\bar{\sigma}_{\omega},D,n,d)

in which $\underline{\alpha}I\leq Q\leq\bar{\alpha}I$ and $\underline{\alpha}I\leq R\leq\bar{\alpha}I$ such that for any $\Theta^{\prime}\in\{\mathbb{R}^{n\times(n+d)}\;|\;\|\Theta^{\prime}-\Theta_{*}\|\leq\varepsilon,\;0\leq\varepsilon\leq\epsilon\}$ , we have

\displaystyle J(K(\Theta^{\prime}),\Theta_{*},Q,R)-J^{*}\leq C_{0}\varepsilon^{2}

(68)

where $J(K(\Theta^{\prime}),\Theta_{*},Q,R)$ is infinite horizon performance of policy $K(\Theta^{\prime})$ applied on $\Theta_{*}$ .

Lemma 16 implicitly says that for any estimates residing within stabilizing neighborhood, the designed controller, applied on true system, is stabilizing.

7.2 Confidence Set Construction

The following theorem gives the confidence set for initial exploration phase and actuating mode $i$ . Central confidence set and confidence set of actuating mode 1 can be similarly constructed.

Theorem 17

(System Identification) Consider linear dynamics model (7) where $\omega_{t}$ and $\nu_{t}$ are independent random vectors, both satisfying Assumption 1 with a known $\sigma_{\omega}$ and $\sigma_{\nu}$ . Let $trace({\Theta^{i}}^{\top}\Theta^{i})\leq s$ (which is part of Assumptions 2) hold and $\hat{\Theta}^{i}_{t}$ be $l_{2}-$ regularized least square estimation at time $t$ . Then with probability at least $1-\delta$ we have $\Theta^{i}_{*}\in\mathcal{C}_{t}^{i}(\delta)$ where

$\displaystyle\mathcal{C}^{i}_{t}(\delta)$	$\displaystyle=\{{\Theta^{i}}^{\top}\in R^{n\times(n+d_{i})}\mid$
	$\displaystyle\operatorname{Tr}((\hat{\Theta}^{i}_{t}-\Theta^{i})^{\top}V_{t}^{i}(\hat{\Theta}^{i}_{t}-\Theta^{i}))\leq\beta^{i}_{t}(\delta)\},$
$\displaystyle\beta^{i}_{t}(\delta)$	$\displaystyle=\bigg{(}\lambda^{1/2}s^{i}+\sigma_{\omega}\sqrt{2n\log(n\frac{\det(V^{i}_{t})^{1/2}\det(\lambda I)^{-1/2}}{\delta}})$
	$\displaystyle+\\|\bar{B}_{*}^{i}\\|\sigma_{\nu}\sqrt{2d_{i}\log(d_{i}\frac{\det(V^{i}_{t})^{1/2}\det(\lambda I)^{-1/2}}{\delta}})\bigg{)}^{2}$	(69)

{pf}

From (9) we have:

\displaystyle\hat{\Theta}^{i}_{t}

\displaystyle=\operatorname*{argmin_{\Theta^{i}}}e(\Theta^{i})=({\underline{Z}_{t}^{i}}^{\top}\underline{Z}_{t}^{i}+\lambda I)^{-1}{\underline{Z}_{t}^{i}}^{\top}X_{t}

(70)

where $\underline{Z}_{t}^{i}$ and $X_{t}$ are matrices whose rows are ${\underline{z}^{i}_{0}}^{\top},...,{\underline{z}^{i}_{t-1}}^{\top}$ and $x_{1}^{\top},...,x_{t}^{\top}$ , respectively. and on the other hand considering the definition of $X_{t}$ , $\underline{Z}_{t}^{i}$ and $W_{t}$ whose rows are $(\bar{B}^{i}_{*}\bar{\nu}^{i}_{0}+\omega_{0})^{\top}$ ,…, $(\bar{B}^{i}_{*}\bar{\nu}^{i}_{t-1}+\omega_{t-1})^{\top}$ the dynamic of system can be written as

\displaystyle X_{t}=\underline{Z}_{t}^{i}\Theta_{*}^{i}+W_{t}

which leads to

	$\displaystyle\hat{\Theta}^{i}_{t}$	$\displaystyle=({\underline{Z}^{i}_{t}}^{\top}{\underline{Z}^{i}_{t}}+\lambda I)^{-1}{\underline{Z}^{i}_{t}}^{\top}(\underline{Z}^{i}_{t}\Theta_{*}^{i}+W_{t})$
		$\displaystyle\quad=({\underline{Z}^{i}_{t}}^{\top}{\underline{Z}^{i}_{t}}+\lambda I)^{-1}({\underline{Z}^{i}_{t}}^{\top}\underline{Z}^{i}_{t}+\lambda I)\Theta^{i}_{}-\lambda({\underline{Z}^{i}_{t}}^{\top}{\underline{Z}^{i}_{t}}+\lambda I)^{-1}\Theta^{i}_{}$
		$\displaystyle\quad+({\underline{Z}^{i}_{t}}^{\top}{\underline{Z}^{i}_{t}}+\lambda I)^{-1}{\underline{Z}^{i}_{t}}^{\top}W_{t}$

Noting definition $V^{i}_{t}={\underline{Z}^{i}_{t}}^{\top}{\underline{Z}}^{i}_{t}+\lambda I$ it yields

\displaystyle\hat{\Theta}^{i}_{t}-\Theta^{i}_{*}

\displaystyle={{V}^{i}_{t}}^{-1}{\underline{Z}_{t}^{i}}^{\top}W_{t}+{V^{i}_{t}}^{-1}\lambda\Theta^{i}_{*}.

(71)

For an arbitrary random covariate $z^{i}$ we have,

\displaystyle{z^{i}}^{\top}\hat{\Theta}^{i}_{t}-{z^{i}}^{\top}\Theta^{i}_{*}

\displaystyle=\langle\,z^{i},{\underline{Z}^{i}_{t}}^{\top}W_{t}\rangle_{{V^{i}_{t}}^{-1}}+\langle\,z^{i},\lambda\Theta^{i}_{*}\rangle_{{V^{i}_{t}}^{-1}}.

(72)

By taking norm on both sides one can write,

\displaystyle\|{z^{i}}^{\top}\hat{\Theta}^{i}_{t}-{z^{i}}^{\top}\Theta^{i}_{*}\|

\displaystyle\leq\|z^{i}\|_{{V^{i}_{t}}^{-1}}\Bigg{(}\|{\underline{Z}}^{\top}_{t}W_{t}\|_{{V^{i}_{t}}^{-1}}+\|\lambda\Theta^{i}_{*}\|_{\bar{V}_{t}^{-1}}\Bigg{)}\leq\|z^{i}\|_{{V^{i}_{t}}^{-1}}\Bigg{(}\|{\underline{Z}}^{\top}_{t}W_{t}\|_{{V^{i}_{t}}^{-1}}+\sqrt{\lambda}s^{i}\Bigg{)}.

(73)

where in last inequality we applied the $\|\Theta^{i}_{*}\|^{2}\leq{s^{i}}^{2}$ (from Assumptions 2 and LABEL:Assumption3) and teh fact that ${V^{i}_{t}}^{-1}\leq 1/\lambda$ .

Using Lemma 15, $\|{\underline{Z}^{i}_{t}}^{\top}W_{t}\|_{{V^{i}_{t}}^{-1}}$ is bounded from above as

\displaystyle\|{\underline{Z}^{i}_{t}}^{\top}W_{t}\|_{{V^{i}_{t}}^{-1}}

\displaystyle\leq\sigma_{\omega}\sqrt{2n\log(\frac{n\det(V^{i}_{t})^{1/2}\det(\lambda I)^{-1/2}}{\delta}})+\|\bar{B}_{*}^{i}\|\sigma_{\nu}\sqrt{2d_{i}\log(\frac{d_{i}\det(V^{i}_{t})^{1/2}\det(\lambda I)^{-1/2}}{\delta}}).

(74)

By arbitrarily choosing $z^{i}=V^{i}_{t}(\hat{\Theta}_{t}^{i}-\Theta^{i}_{*})$ and plugging it into (73) it yields

	$\displaystyle\\|\hat{\Theta}^{i}_{t}-\Theta^{i}_{}\ \\|^{2}_{V^{i}_{t}}\leq\\|{V}^{i}_{t}(\hat{\Theta}_{t}^{i}-\Theta^{i}_{})\\|_{{V^{i}_{t}}^{-1}}$
	$\displaystyle\quad\Bigg{(}\sigma_{\omega}\sqrt{2n\log(\frac{\det({V}^{i}_{t})^{1/2}\det(\lambda I_{(n+d_{i})\times(n+d_{i})})^{-1/2}}{\delta}})+\\|\bar{B}_{*}^{i}\\|\sigma_{\nu}\sqrt{2d_{i}\log(\frac{d_{i}\det({V}^{i}_{t})^{1/2}\det(\lambda I_{(n+d_{i})\times(n+d_{i})})^{-1/2}}{\delta}})+\sqrt{\lambda}s^{i}\Bigg{)}$

and since $\|\hat{\Theta}^{i}_{t}-\Theta^{i}_{*}\|_{V^{i}_{t}}=\|{V}^{i}_{t}(\hat{\Theta}^{i}_{t}-\Theta^{i}_{*})\|_{{V^{i}_{t}}^{-1}}$ , the statement of Theorem 17 holds true.

Lemma 18

(Abbasi and Szepesvári (2011)) For any $0\leq t\leq T$ and $\forall\in\{1,...,2^{\mathbb{B}}\}$ we have that

\displaystyle\max_{s\leq t,s\notin\tau_{t}}||(\Theta^{i}_{*}-\hat{\Theta}^{i}_{s})^{\top}z^{i}_{s}||\leq GZ^{\frac{n+d_{i}}{n+d_{i}+1}}\beta^{i}_{t}(\delta/4)^{\frac{1}{2(n+d_{i}+1)}}

(75)

where $\tau_{t}$ is a set of finite number of times, with maximum cardinality $n+d_{i}$ , occurring between any time interval $[0\;t]$ such that $||(\Theta^{i}_{*}-\hat{\Theta}^{i}_{s})^{\top}z^{i}_{s}||$ is not well controlled in those instances (see Lemma 17 and 18 of Abbasi and Szepesvári (2011) for more details).

During initial exploration, while the system actuate in an arbitrary mode $i$ it constructs the central ellipsoid by using augmentation technique. The following lemma gives an upper-bound for the determinant of co-variance matrix of mode $i$ , $V^{i}$ and that of central ellipsoid $V$ (for central ellipsoid $i=1$ ).

Lemma 19

Let the system actuate in an arbitrary mode $i$ and applies extra exploratory noise $\nu$ . Further assume that the central ellipsoid is constructed by applying augmentation technique. Then with probability at least $1-\delta$ the determinant of the co-variance matrix $V^{i}$ ( $i=1$ denotes the central ellipsoid) is given by

	$\displaystyle\frac{\det(V^{i}_{t})}{\det(\lambda I_{(n+d_{i})\times(n+d_{i})})}\leq$
	$\displaystyle\Bigg{(}\frac{(n+d_{i})\lambda+(1+2{\kappa^{i}}^{2}){x_{t}}^{2}t+2{\mathcal{V}^{i}_{t}}^{2}t}{(n+d_{i})\lambda}\Bigg{)}^{n+d_{i}}$		(76)

where $\mathcal{V}^{i}_{t}=\sigma_{\nu}\sqrt{2d_{i}\log d_{i}t/\delta}$ .

{pf}

we can write

	$\displaystyle det(V^{i}_{t})$	$\displaystyle\leq\prod_{j=1}^{n+d_{i}}\Big{(}\lambda+\sum_{k=1}^{t-1}{z^{i}_{j}}_{k}^{2}\Big{)}$
		$\displaystyle\quad\leq\bigg{(}\frac{\sum_{j=1}^{n+d_{i}}\big{(}\lambda+\sum_{k=1}^{t-1}{z^{i}_{j}}_{k}^{2}\big{)}}{n+d}\bigg{)}^{n+d_{i}}$
		$\displaystyle\quad\leq\Bigg{(}\frac{(n+d_{i})\lambda+\sum_{k=1}^{t-1}\\|x_{k}\\|^{2}+2\\|u^{i}_{k}\\|+2\\|\nu(\mathcal{B}_{i})_{k}\\|^{2}}{n+d_{i}}\Bigg{)}^{n+d_{i}}$

In second inequality, we applied AM-GM inequality and in the third inequality we apply the property $(a+b)^{2}\leq 2a^{2}+2b^{2}$ . Furthermore, $\|u^{i}_{k}\|^{2}\leq{\kappa^{i}}^{2}\|x_{k}\|^{2}$ . Given $\max_{0\leq k\leq t}||\nu_{k}||=\mathcal{V}^{i}_{t}$ one can write:

	$\displaystyle\frac{\det(V^{i}_{t})}{\det(\lambda I_{(n+d_{i})\times(n+d_{i})})}\leq$
	$\displaystyle\Bigg{(}\frac{(n+d_{i})\lambda+(1+2{\kappa^{i}}^{2}){x_{t}}^{2}t+2{\mathcal{V}^{i}_{t}}^{2}t}{(n+d_{i})\lambda}\Bigg{)}^{n+d_{i}}$		(77)

where $\mathcal{V}^{i}_{t}=\sigma_{\nu}\sqrt{2d_{i}\log d_{i}t/\delta}$ holds with probability least $1-\delta/2$ . This completes the proof of (20). Proof of the second statement of lemma is given in Lale et al. (2020a).

7.3 Proofs of Lemma 4 and Theorem 5

{pf}

(Proof of Lemma 4). The proof directly follows by plaguing the upper-bound of $det(V_{t})$ into Lemma 9.

	$\displaystyle det(V_{t})$	$\displaystyle\leq\prod_{i=1}^{n+d}\Big{(}\lambda+\sum_{k=1}^{t-1}z_{ki}^{2}\Big{)}$
		$\displaystyle\quad\leq\bigg{(}\frac{\sum_{i=1}^{n+d}\big{(}\lambda+\sum_{k=1}^{t-1}z_{ki}^{2}\big{)}}{n+d}\bigg{)}^{n+d}$
		$\displaystyle\quad=\Bigg{(}\frac{(n+d)\lambda+\sum_{k=1}^{t-1}\|\|\bar{z}^{i^{*}}_{k}\|\|^{2}+2\|\|\zeta_{k}\|\|^{2}}{n+d}\Bigg{)}^{n+d}$

In second inequality we applied AM-GM inequality and in the third inequality we apply the property $(a+b)^{2}\leq 2a^{2}+2b^{2}$ . Furthermore, $||\bar{z}^{i^{*}}_{k}||^{2}\leq(1+2\kappa^{2})X_{e}^{2}$ . Given $||\zeta_{t}||\leq 2\sigma_{\nu}\sqrt{d\log(4nt(t+1)/\delta)}$ which holds with probability at least $1-\delta/2$ one can write:

	$\displaystyle\frac{\det(V_{t})}{\det(\lambda I_{(n+d)\times(n+d)})}\leq$
	$\displaystyle\Bigg{(}\frac{(n+d)\lambda+c^{i^{}}(n+d^{})^{2(n+d^{})}(1+2\kappa^{2})t+4\sigma^{2}_{\nu}(d-d^{})\log(8nt(t+1)/\delta)}{(n+d)\lambda}\Bigg{)}^{n+d}$		(78)

This completes the proof of (20). Proof of the second statement of lemma is given in Lale et al. (2020a).

{pf}

(Proof of Theorem 5) Given (6.1), we first upper-bound the term $G_{i}Z_{T}^{\frac{n+d_{i}}{n+d_{i}+1}}\beta^{i}_{t}(\delta/4)^{\frac{1}{2(n+d_{i}+1)}}$ . First, we can write

	$\displaystyle{Z^{i}_{T}}^{2}=\max_{0\leq t\leq T}\\|z_{t}\\|^{2}\leq\max_{0\leq t\leq T}\\|x_{t}\\|^{2}+\max_{0\leq t\leq T}\\|u^{i}_{t}+\nu_{t}(\mathcal{B}_{i})\\|^{2}\leq\max_{0\leq t\leq T}\\|x_{t}\\|^{2}+2\\|u^{i}_{t}\\|^{2}+2\max_{0\leq t\leq T}\\|\nu_{t}(\mathcal{B}_{i})\\|^{2}\leq$
	$\displaystyle(1+2{\kappa^{i}}^{2})\max_{0\leq t\leq T}\\|x_{t}\\|^{2}+2\max_{0\leq t\leq T}\\|\nu_{t}(\mathcal{B}_{i})\\|^{2}$		(79)

which results in

\displaystyle{Z^{i}_{T}}=\max_{0\leq t\leq T}||z^{i}_{t}||\leq\sqrt{(1+2{\kappa^{i}}^{2})}x_{T}+\sqrt{2}\mathcal{V}_{T}^{i}

(80)

where $x_{T}:=\max_{0\leq t\leq T}||x_{t}||$ and $\mathcal{V}_{T}^{i}:=\max_{0\leq t\leq T}||\nu_{t}(\mathcal{B}_{i})||$ . On the other hand, given the definition of $\beta^{i}_{t}(\delta/4)$ one can write:

\displaystyle\beta^{i}_{t}(\delta/4)^{\frac{1}{2(n+d_{i}+1)}}\leq\beta^{i}_{t}(\delta/4)^{\frac{1}{2}}\leq 4\sqrt{\beta^{i}_{t}(\delta)}

(81)

Combining the results give:

	$\displaystyle G_{i}{Z^{i}_{T}}^{\frac{n+d_{i}}{n+d_{i}+1}}\beta^{i}_{t}(\delta/4)^{\frac{1}{2(n+d_{i}+1)}}$	$\displaystyle\leq 4G_{i}\sqrt{\beta^{i}_{t}(\delta)}\big{(}\sqrt{1+2{\kappa^{i}}^{2}}x_{T}+\sqrt{2}\mathcal{V}_{T}^{i}\big{)}^{\frac{n+d_{i}}{n+d_{i}+1}}$		(82)
		$\displaystyle 4G_{i}\sqrt{\beta^{i}_{t}(\delta)}\big{(}(1+2{\kappa^{i}}^{2})^{\frac{n+d_{i}}{2(n+d_{i}+1)}}{x_{T}}^{\frac{n+d_{i}}{n+d_{i}+1}}+2^{\frac{n+d_{i}}{2(n+d_{i}+1)}}{\mathcal{V}_{T}^{i}}^{\frac{n+d_{i}}{n+d_{i}+1}}\big{)}$		(83)

We also have

\displaystyle G_{i}=2\bigg{(}\frac{2S(n+d_{i})^{\frac{n+d_{i}+1}{2}}}{U^{0.5}}\bigg{)}^{1/(n+d_{i}+1)}

where

\displaystyle U^{i}=\frac{U_{0}}{H}\;\;\;\;and\;\;\;U^{i}_{0}=\frac{1}{16^{n+d_{i}-2}(1\vee S^{2(n+d_{i}-2)})}

One can simply rewrite $G_{i}$ as follows:

\displaystyle G_{i}=\mathcal{C}H^{\frac{1}{2(n+d_{i}+1)}}

where

\displaystyle\mathcal{C}:=2\bigg{(}2S(n+d_{i})\big{(}16^{n+d_{i}-2}(1\vee S^{2(n+d_{i}-2)})\big{)}^{1/2}\bigg{)}^{1/(n+d_{i}+1)}

and

\displaystyle H_{i}>\bigg{(}16\vee\frac{4S^{2}M_{i}^{2}}{(n+d_{i})U^{i}_{0}}\bigg{)}

with $M_{i}$ to be defined as follows

\displaystyle M_{i}=\sup_{Y>0}\frac{\|\bar{B}_{*}^{i}\|\sigma_{\nu}\sqrt{n(n+d_{i})\log\big{(}\frac{1+TY/\lambda(n+d_{i})}{\delta}\big{)}}+\sigma_{\omega}\sqrt{n(n+d_{i})\log\big{(}\frac{1+TY/\lambda(n+d_{i})}{\delta}\big{)}}+\lambda^{1/2}s^{i}}{Y}

Given the fact that $Y=\sup_{0\leq t\leq T}||z_{t}||$ . Then by having nonzero initial state $x_{0}$ then by defining $Y^{*}=\sqrt{(1+2{\kappa^{i}}^{2})}\|x_{0}\|+\sqrt{2}\mathcal{V}_{T}^{i}$ we have the following upper-bound for $M$

\displaystyle M_{i}\leq\frac{\|\bar{B}_{*}^{i}\|\sigma_{\nu}\sqrt{n(n+d_{i})\log\big{(}\frac{1+TY^{*}/\lambda(n+d_{i})}{\delta}\big{)}}+\sigma_{\omega}\sqrt{n(n+d_{i})\log\big{(}\frac{1+TY^{*}/\lambda(n+d_{i})}{\delta}\big{)}}+\lambda^{1/2}s^{i}}{Y^{*}}:=M_{max}^{i}.

Then we can upper bound $G_{i}$ as follows

\displaystyle G_{i}\leq\mathcal{C}\bigg{(}16\vee\frac{4S^{2}M_{max}^{i2}}{(n+d_{i})U^{i}_{0}}\bigg{)}:=\bar{G}_{i}

Hence, the state norm is bounded as follows:

\displaystyle||x_{t}||\leq\mathcal{D}^{i}_{1}\sqrt{\beta^{i}_{t}(\delta)}\log(t)X^{\frac{n+d_{i}}{n+d_{i}+1}}+\mathcal{D}^{i}_{2}\sqrt{\log\frac{t}{\delta}}.

Adopted from Abbasi and Szepesvári (2011), it yields

\displaystyle X_{t}\leq\big{(}\mathcal{D}^{i}_{1}\sqrt{\beta^{i}_{t}(\delta)}\log(t)+\mathcal{D}^{i}_{2}\sqrt{\beta^{i}_{t}(\delta)}\log(t)+\mathcal{D}^{i}_{3}\sqrt{\log\frac{t}{\delta}}\big{)}^{n+d_{i}}

where

		$\displaystyle\mathcal{D}^{i}_{1}:=\frac{4}{1-\Upsilon_{i}}\big{(}\frac{\eta_{i}}{\Upsilon_{i}}\big{)}^{n+d_{i}}\bar{G}_{i}(1+2{\kappa^{i}}^{2})^{\frac{n+d_{i}}{2(n+d_{i}+1)}}$
		$\displaystyle\mathcal{D}^{i}_{2}:=\frac{4}{1-\Upsilon_{i}}\big{(}\frac{\eta_{i}}{\Upsilon_{i}}\big{)}^{n+d_{i}}\bar{G}_{i}2^{\frac{n+d_{i}}{2(n+d_{i}+1)}}\mathcal{V}_{T}^{i}$
		$\displaystyle\mathcal{D}^{i}_{3}:=\frac{n\sqrt{2}}{1-\Upsilon_{i}}\big{(}\frac{\eta_{i}}{\Upsilon_{i}}\big{)}^{n+d_{i}}\sigma_{\omega}$

By elementary but tedious calculations one can show

\displaystyle\sqrt{\beta_{t}^{i}(\delta)}\leq 2n\sigma_{\omega}\log\frac{1}{\delta}+\sigma_{\omega}\sqrt{\lambda}s^{i}+n\sigma_{\omega}\log\frac{\det(V^{i}_{t})}{\det(\lambda I_{(n+d_{i})\times(n+d_{i})})}

An upper bound for the third term in the right hand side is given by applying Lemma 4. Letting $a_{t}=X_{t}^{1//(n+d_{i}+1)}$ and $c=\max\{1,\max_{1\leq s\leq t}||a_{s}||\}$ results in

	$\displaystyle n\sigma_{\omega}\log\frac{\det(V^{i}_{t})}{\det(\lambda I_{(n+d_{i})\times(n+d_{i})})}\leq n\sigma_{\omega}(n+d_{i})\bigg{(}\log\frac{(n+d_{i})\lambda+2{\mathcal{V}^{i}_{t}}^{2}}{(n+d_{i})\lambda}t+\log\frac{(1+2{\kappa^{i}}^{2}){c}^{2}}{(n+d_{i})\lambda}t\bigg{)}$
	$\displaystyle\quad\leq n\sigma_{\omega}(n+d_{i})\bigg{(}\log\frac{(n+d_{i})\lambda+2{\mathcal{V}^{i}_{t}}^{2}}{(n+d_{i})\lambda}t+\log\frac{(1+2{\kappa^{i}}^{2})}{(n+d_{i})\lambda}t\bigg{)}+2n\sigma_{\omega}(n+d_{i})(n+d_{i}+1)\log^{2}c$

By applying elementary calculations, it yields

\displaystyle c\leq\bar{L}^{i}+\bar{K}^{i}\log^{2}c

where

	$\displaystyle\bar{L}^{i}=(\mathcal{D}^{i}_{1}+\mathcal{D}^{i}_{2})\big{(}2n\sigma_{\omega}\log\frac{1}{\delta}+\sigma_{\omega}\sqrt{\lambda}s^{i}\big{)}\log t+\mathcal{D}^{i}_{3}\sqrt{\log t/\delta}+$
	$\displaystyle\quad(\mathcal{D}^{i}_{1}+\mathcal{D}^{i}_{2})n\sigma_{\omega}(n+d_{i})\bigg{(}\log\frac{(n+d_{i})\lambda+2{\mathcal{V}^{i}_{t}}^{2}}{(n+d_{i})\lambda}t+\log\frac{(1+2{\kappa^{i}}^{2})}{(n+d_{i})\lambda}t\bigg{)}\log t$
	$\displaystyle\quad\bar{K}^{i}=2(\mathcal{D}_{1}^{i}+\mathcal{D}_{2}^{i})n\sigma_{\omega}(n+d_{i})(n+d_{i}+1)\log t.$

Using the property $\log x\leq x^{s}/s$ which holds $\forall s\in R^{+}$ and letting $s:=1$ , one can write

\displaystyle c\leq\frac{-\bar{L}_{i}+\sqrt{\bar{L}_{i}^{2}+4\bar{K}_{i}}}{2\bar{K}_{i}}.

By elementary calculations first statement of the lemma is shown. The proofs of statements 2 and 3 are immediate and we skip them for the sake of brevity.

7.4 Proofs of Regret Bound Analysis Section

{pf}

(Proof of Lemma 12) Note that except the times instances that there is switch in policy, most terms in RHS of (56) vanish. Denote covariance matrices of central ellipse and actuating mode $i^{*}$ by $V$ and $V^{i^{*}}$ respectively, and suppose at time steps $t_{n_{1}},...,t_{n_{N}}$ the algorithm changes the policy. Therefore, it yields $\det(V^{i^{*}}_{t_{n_{i}}})\geq 2\det(V^{i^{*}}_{t_{n_{i-1}}})$ for $t\leq T^{i^{*}}_{c}$ and $\det(V_{t_{n_{j}}})\geq 2\det(V_{t_{n_{j-1}}})$ for $t\geq T^{i^{*}}_{c}$ . This results in

	$\displaystyle\det(V^{i^{}}_{T^{i^{}}_{c}})$	$\displaystyle\geq 2^{N_{1}}\lambda^{n+d_{i^{*}}}$		(84)
	$\displaystyle\det(V_{T})$	$\displaystyle\geq 2^{N_{2}}\det(V_{T^{i^{*}}_{c}+1})$		(85)

where $N_{1}$ and $N_{2}$ are number of switches in policy while actuating in the mode $i^{*}$ and fully actuating mode respectively. On one hand $\det(V^{i^{*}}_{T^{i^{*}}_{c}})\leq\lambda_{\max}^{n+d_{i^{*}}}(V^{i^{*}}_{T^{i^{*}}_{c}})$ where

$\displaystyle\lambda_{\max}(V^{i^{}}_{T^{i^{}}_{c}})$	$\displaystyle\leq\lambda+\sum_{t=0}^{T^{i^{}}_{c}-1}\|\|z^{i^{}}_{t}\|\|^{2}\leq$
	$\displaystyle\quad\lambda+\big{(}(1+2{\kappa^{i}}^{2})\max_{0\leq t\leq T}\\|x_{t}\\|^{2}+2\max_{0\leq t\leq T}\\|\nu_{t}(\mathcal{B}_{i})\\|^{2}\big{)}T^{i^{*}}_{c}$	(86)
	$\displaystyle\quad\lambda+{c^{i^{}}}^{2}(1+2{\kappa^{i}}^{2})(n+d_{i^{}})^{2(n+d_{i^{}})}T^{i^{}}_{c}+4\sigma^{2}_{\nu}d_{i^{}}\log\frac{8d_{i^{}}T^{i^{}}_{c}}{\delta}T^{i^{}}_{c}$	(87)

in which we applied the bound

\displaystyle\|\nu_{t}\|\leq\sigma_{\omega}\sqrt{2d_{i^{*}}\log\frac{d_{i^{*}}t}{\delta}}.

(88)

Using (84) $N_{1}$ is upper-bounded by

\displaystyle N_{1}\leq\log_{2}\bigg{(}1+\frac{{c^{i^{*}}}^{2}(1+2{\kappa^{i}}^{2})(n+d_{i^{*}})^{2(n+d_{i^{*}})}T^{i^{*}}_{c}+4\sigma^{2}_{\nu}d_{i^{*}}\log\frac{8d_{i^{*}}T^{i^{*}}_{c}}{\delta}T^{i^{*}}_{c}}{\lambda}\bigg{)}^{n+d_{i^{*}}}

(89)

On the other hand we have $\det(V_{T})\leq\lambda_{\max}^{n+d}(V_{T})$ where

$\displaystyle\lambda_{\max}(V_{T})$	$\displaystyle\leq\lambda+\sum_{t=0}^{T-1}\|\|z_{t}\|\|^{2}$
	$\displaystyle\quad\leq\lambda+\sum_{t=0}^{T^{i^{}}_{c}}\\|z^{i^{}}_{t}\\|^{2}+\sum_{t=T^{i^{*}}_{c}+1}^{T-1}\\|z_{t}\\|^{2}$	(90)
	$\displaystyle\quad\leq\lambda+\sum_{t=0}^{T_{c}^{i^{}}}(1+2{\kappa^{i}}^{2})\max_{0\leq t\leq T_{c}^{i^{}}}\\|x_{t}\\|^{2}+2\max_{0\leq t\leq T_{c}^{i^{}}}\\|\nu_{t}(\mathcal{B}_{i})\\|^{2}+\sum_{t=T^{i^{}}_{c}+1}^{T-1}\\|z_{t}\\|^{2}$	(91)

Furthermore,

\displaystyle\Lambda:=\max_{0\leq t\leq T^{i^{*}}_{c}}\|\nu_{t}(\mathcal{B}_{i})\|\leq\sigma_{\nu}\sqrt{2d_{i^{*}}\log(d_{i^{*}}T^{i^{*}}_{c}/\delta)}

(92)

with probability at least $1-\delta/2$ together with upper-bounds of state norm in both initial exploration and optimism parts yields

$\displaystyle\lambda_{\max}(V_{T})\leq$	$\displaystyle\bar{\lambda}:=\lambda+{c^{i^{}}}^{2}T^{i^{}}_{c}(n+d_{i^{}})^{2(n+d_{i^{}})}(1+2{\kappa^{i}}^{2})$
	$\displaystyle\quad+4\sigma^{2}_{\nu}d_{i^{}}\log(d_{i^{}}T^{i^{*}}_{c}/\delta)$
	$\displaystyle\quad+(T-T^{i^{}}_{c})\frac{32n\sigma^{2}_{\omega}(1+\kappa^{2})}{(1-\Upsilon)^{2}}\log\frac{n(T-T^{i^{}}_{c})}{\delta}$	(93)

Considering (40) we have

\displaystyle\lambda_{\min}(V_{T^{i^{*}}_{c}+1})\geq\sigma_{\star}^{2}(T^{i^{*}}_{c}+1)

(94)

now applying $\det(V_{T})\leq\lambda_{\max}^{(n+d)}$ results in

\displaystyle N_{2}\leq\log_{2}\big{(}\frac{\bar{\lambda}}{\sigma_{\star}^{2}(T_{c}^{i^{*}}+1)}\big{)}^{n+d}

(95)

Considering switch from IExp to SOFUA that can cause switch in the policy, the total number of switch in policy is $N_{1}+N_{2}+1$ . Now, applying bounds of state norm for $t\leq T^{i^{*}}_{c}$ and $t>T^{i^{*}}_{c}$ complete proof. The following lemma adapts the proof of Abbasi and Szepesvári (2011) to our setting which will be useful in bounding $R_{3}$ .

Lemma 20

(Abbasi and Szepesvári (2011)) On the event $F^{i^{*}}_{T_{c}^{i^{*}}}\cap E_{T_{c}^{i^{*}}}$ for $t\leq T_{c}^{i^{*}}$ and $F^{opt}_{T}\cap E_{T}$ for $t\geq T_{c}^{i^{*}}$ , the following holds,

	$\displaystyle\sum_{t=0}^{T}\\|(\Theta_{*}-\tilde{\Theta}_{t})^{T}z_{t}\\|^{2}\leq\frac{16(1+\kappa^{2})}{\lambda}$
	$\displaystyle\quad\times\bigg{(}2X_{e}^{2}\beta^{i^{}2}_{T_{c}^{i^{}}}(\delta)\log\frac{\det(V^{i^{}}_{T_{c}^{i^{}}})}{\det{\lambda I}}$
	$\displaystyle\quad+X_{c}^{2}\beta_{T}^{2}(\delta)\log\frac{\det(V_{T})}{\det(V_{T_{c}^{i^{}}})}\bigg{)}+2S^{2}\Lambda^{2}T_{c}^{i^{}}$		(96)

where $X_{c}^{2}=\frac{32n\sigma^{2}_{\omega}(1+\kappa^{2})}{(1-\Upsilon)^{2}}\log\frac{n(T-T_{c}^{i^{*}})}{\delta}$ and $X_{e}^{2}=c_{c}^{i^{*}}(n+d_{i^{*}})^{2(n+d_{i^{*}})}$

{pf}

Let $s^{i}_{t}=(\Theta^{i}_{*}-\tilde{\Theta}^{i}_{t})^{\top}z^{i}_{t}$ , for both $i=1$ and $i=i^{*}$ , then one can write

\displaystyle\|s^{i}_{t}\|\leq\|(\Theta^{i}_{*}-\hat{\Theta}^{i}_{t})^{\top}z^{i}_{t}\|+\|(\hat{\Theta}^{i}_{t}-\tilde{\Theta}^{i}_{t})^{\top}z^{i}_{t}\|

(97)

For $t>T^{i^{*}}_{c}$ one can write

$\displaystyle\\|(\Theta-\hat{\Theta}_{t})^{\top}z_{t}\\|$	$\displaystyle\leq\\|{V_{t}}^{1/2}(\Theta-\hat{\Theta}_{t})\\|z_{t}\\|\\|_{{V_{t}}^{-1}}$	(98)
	$\displaystyle\leq\\|{V_{\tau}}^{1/2}(\Theta-\hat{\Theta}_{t})\\|\\|z_{t}\\|_{{V_{t}}^{-1}}\sqrt{\frac{\det(V_{t})}{\det(V_{\tau})}}$	(99)
	$\displaystyle\leq\sqrt{2}\\|{V_{\tau}}^{1/2}(\Theta-\hat{\Theta}_{t})\\|\\|z_{t}\\|_{{V_{t}}^{-1}}$	(100)
	$\displaystyle\leq\sqrt{2\beta_{\tau}(\delta)}\\|z_{t}\\|_{{V_{t}}^{-1}}$	(101)

where $\tau\leq t$ is the last time that policy change happened. We applied Cauchy-Schwartz inequality in (98). The inequality (99) follows from the Lemma 11 in Abbasi and Szepesvári (2011). Furthermore, applying the update rule gives (100) and finally (101) is obtained using the property $\lambda_{\max}\leq Tr(M)$ for $M\succeq 0$ .

Now, for $t\geq T^{i^{*}}_{c}$ we have

\displaystyle\|s_{t}\|^{2}\leq 8\beta^{i}_{\tau}(\delta)\|z^{i}_{t}\|_{{V_{t}^{i}}^{-1}}^{2}

which in turn is written

$\displaystyle\sum_{t=T_{c}}^{T}\\|(\Theta_{*}-\tilde{\Theta}_{t})^{\top}z_{t}\\|^{2}$	$\displaystyle\leq\frac{8X^{2}_{c}(1+\kappa^{2})\beta_{T}(\delta)}{\lambda}$
	$\displaystyle\times\sum_{t=T^{i^{*}}_{c}+1}^{T}\min\{\\|z_{t}\\|^{2}_{V_{t}^{-1}},1\}$
	$\displaystyle\leq\frac{16X^{2}_{c}(1+\kappa^{2})\beta_{T}(\delta)}{\lambda}\log\frac{\det(V_{T})}{\det(V_{T^{i^{*}}_{c}})}$	(102)

where in the second inequality we applied the Lemma 10 of Abbasi and Szepesvári (2011).

For $t\leq T^{i^{*}}_{c}$ the algorithm applies an extra exploratory noise, we still have same decomposition (97) with $i=i^{*}$ . However, to upper-bound $\|s_{t}^{i^{*}}\|$ we need to some algebraic manipulation which is substituting $\bar{z}^{i^{*}}_{t}$ in terms of $z^{i^{*}}_{t}$ . Bringing this into mind, by following similar steps as of (98-101) it yields

	$\displaystyle\\|(\Theta^{i^{}}-\hat{\Theta}^{i^{}}_{t})^{\top}\bar{z}^{i^{*}}_{t}\\|$	$\displaystyle\leq\\|(\Theta^{i^{}}-\hat{\Theta}^{i^{}}_{t})^{\top}(\bar{z}^{i^{*}}_{t}+\xi_{t}-\xi_{t})\\|$
		$\displaystyle\leq\\|(\Theta^{i^{}}-\hat{\Theta}^{i^{}}_{t})^{\top}z^{i^{}}_{t}\\|+\\|(\Theta^{i^{}}-\hat{\Theta}^{i^{*}}_{t})^{\top}\xi_{t}\\|$
		$\displaystyle\leq\\|V_{t}^{i^{}1/2}(\Theta^{i^{}}-\hat{\Theta}^{i^{}}_{t})\\|z^{i^{}}_{t}\\|\\|_{V_{t}^{i^{*}-1}}+S\Lambda$
		$\displaystyle\leq\\|V_{\tau}^{i^{}1/2}(\Theta^{i^{}}-\hat{\Theta}^{i^{}}_{t})\\|\\|z^{i^{}}_{t}\\|_{V_{t}^{i^{*}-1}}$
		$\displaystyle\times\sqrt{\frac{\det(V^{i^{}}_{t})}{\det(V^{i^{}}_{\tau})}}+S\Lambda$
		$\displaystyle\leq\sqrt{2}\\|V_{\tau}^{i^{}1/2}(\Theta^{i^{}}-\hat{\Theta}^{i^{}}_{t})\\|\\|z^{i^{}}_{t}\\|_{V_{t}^{i^{*}-1}}+S\Lambda$
		$\displaystyle\leq\sqrt{2\beta^{i^{}}_{\tau}(\delta)}\\|z^{i^{}}_{t}\\|_{V_{t}^{i^{*}-1}}+S\Lambda$

where $\xi=\begin{pmatrix}0\\ \nu_{t}(\mathcal{B}_{i})\end{pmatrix}$ . Applying this result gives

$\displaystyle\sum_{t=0}^{T^{i^{}}_{c}}\\|(\Theta^{i^{}}_{}-\tilde{\Theta}^{i^{}}_{t})^{\top}\bar{z}_{t}^{i^{*}}\\|^{2}$	$\displaystyle\leq\frac{16\big{(}(1+2{\kappa^{i^{}}}^{2})X^{2}_{e}+2\Lambda\big{)}{\beta^{i^{}}_{T^{i^{*}}_{c}}}(\delta)}{\lambda}$
	$\displaystyle\times\sum_{t=0}^{T^{i^{}}_{c}}\min\{\\|z^{i^{}}_{t}\\|^{2}_{V_{t}^{{i^{}}-1}},1\}+8S^{2}\Lambda^{2}T^{i^{}}_{c}$
	$\displaystyle\leq\frac{32\big{(}(1+2{\kappa^{i^{}}}^{2})X^{2}_{e}+2\Lambda\big{)}\beta^{i^{}}_{T^{i^{}}_{c}}(\delta)}{\lambda}\log\frac{\det(V^{i^{}}_{T^{i^{*}}_{c}})}{\det(\lambda I)}$
	$\displaystyle+8S^{2}\Lambda^{2}T^{i^{*}}_{c}$	(103)

where in first inequality we applied (79) and (92). Similar to (102) in the second inequality we applied the Lemma 10 of Abbasi and Szepesvári (2011). Combining (102) and (103) completes proof.

{pf}

(Proof of Lemma 13)) In upper-bounding the term $R_{3}$ we skip a few straight forward steps which can be found in Abbasi and Szepesvári (2011) and Lale et al. (2020a) and write

	$\displaystyle\|R_{3}\|\leq\bigg{(}\sum_{t=0}^{T^{i^{}}_{c}}\bigg{\\|}P(\tilde{\Theta}^{i^{}}_{t},Q_{},R_{}^{i^{}})^{1/2}\big{(}\tilde{\Theta}^{i^{}}_{t}-\Theta^{i^{}}_{}\big{)}^{\top}\bar{z}^{i^{*}}_{t}\bigg{\\|}^{2}\bigg{)}^{1/2}$
	$\displaystyle\times\bigg{(}\sum_{t=0}^{T^{i^{}}_{c}}\big{(}\big{\\|}P(\tilde{\Theta}^{i^{}}_{t},Q_{},R_{}^{i^{}})^{1/2}(\tilde{\Theta}_{t}^{i^{}})^{\top}\bar{z}^{i^{}}_{t}\big{\\|}+\big{\\|}P(\tilde{\Theta}^{i^{}}_{t},Q_{},R_{}^{i^{}})^{1/2}{\Theta_{}^{i^{}}}^{\top}\bar{z}^{i^{}}_{t}\big{\\|}\big{)}^{2}\bigg{)}^{1/2}$
	$\displaystyle+\bigg{(}\sum_{t=T^{i^{}}_{c}}^{T}\bigg{\\|}P(\tilde{\Theta}_{t},Q_{},R_{})^{1/2}\big{(}\tilde{\Theta}_{t}-\Theta_{}\big{)}^{\top}z_{t}\bigg{\\|}^{2}\bigg{)}^{1/2}$
	$\displaystyle\times\bigg{(}\sum_{t=T^{i^{}}_{c}}^{T}\big{(}\big{\\|}P(\tilde{\Theta}_{t},Q_{},R_{})^{1/2}\tilde{\Theta}_{t}^{\top}z_{t}\big{\\|}+\big{\\|}P(\tilde{\Theta}_{t},Q_{},R_{})^{1/2}\Theta_{}^{\top}z_{t}\big{\\|}\big{)}^{2}\bigg{)}^{1/2}$
	$\displaystyle\leq\sqrt{\frac{32D\big{(}(1+2{\kappa^{i^{}}}^{2})X^{2}_{e}+2\Lambda\big{)}\beta^{i^{}}_{T^{i^{}}_{c}}(\delta)}{\lambda}\log\frac{\det(V^{i^{}}_{T^{i^{}}_{c}})}{\det(\lambda I)}+8DS^{2}\Lambda^{2}T^{i^{}}_{c}}$
	$\displaystyle\times\sqrt{4T^{i^{}}_{c}DS^{2}\big{(}(1+2{\kappa^{i^{}}}^{2})X^{2}_{e}+2\Lambda\big{)}}$		(104)
	$\displaystyle+\sqrt{\frac{16D(1+\kappa^{2})\beta^{2}_{T}(\delta)X_{c}^{2}}{\lambda}\log\frac{\det(V_{T})}{\det(V_{T^{i^{*}}_{c}})}}$
	$\displaystyle\times\sqrt{4(T-T^{i^{*}}_{c})DS^{2}(1+\kappa^{2})X_{c}^{2}}$		(105)

where in the inequalities (104) and (105) we applied (103) and (102) (from the Lemma 20 ) respectively. The remaining step is following upper-bounds

	$\displaystyle\log\frac{\det(V^{i^{}}_{T^{i^{}}_{c}})}{\det(\lambda I)}\leq(n+d_{i^{}})\log\big{(}1+\frac{\big{(}(1+2{\kappa^{i^{}}}^{2})X^{2}_{e}+2\Lambda\big{)}}{\lambda(n+d_{i^{*}})}\big{)}$
	$\displaystyle\log\frac{\det(V_{T})}{\det(V_{T^{i^{*}}_{c}})}\leq(n+d)$
	$\displaystyle\times\log\frac{\lambda(n+d)+T^{i^{}}_{c}\big{(}2(1+2{\kappa^{i^{}}}^{2})X^{2}_{e}+4\Lambda+2\bar{\Lambda}\big{)}+(T-T^{i^{}}_{c})(1+\kappa^{2})X_{c}^{2}}{\sigma_{\star}^{2}T^{i^{}}_{c}(n+d)}$

where

\displaystyle\bar{\Lambda}:=\sigma_{\nu}\sqrt{2(d-d_{i^{*}})\log((d-d_{i^{*}})T^{i^{*}}_{c}/\delta)}

(106)

and in the second inequality we applied (94). Considering the definitions of $X_{e}^{2}$ , $X_{c}^{2}$ and $\Lambda$ we can easily notice the statement of lemma holds true.

$\displaystyle\mathcal{R}_{T}$	$\displaystyle=\sum_{t=1}^{T}(x_{t}^{\top}Q_{}x_{t}+{\bar{u}^{i(t)}_{t}}^{\top}R_{}\bar{u}_{t}^{i(t)})-TJ_{}(\Theta_{},Q_{},R_{})$
	$\displaystyle=\sum_{t=0}^{T_{\omega}}\bigg{(}x_{t}^{\top}Qx_{t}+{u^{i^{}}_{t}}^{\top}R_{}^{i^{}}u^{i^{}}_{t}+2{\nu_{t}}^{\top}R_{}{\bar{u}^{i^{}}_{t}}+{\nu_{t}}^{\top}R_{}\nu_{t}\bigg{)}-TJ_{}(\Theta_{},Q_{},R_{*})$
	$\displaystyle+\sum_{t=T_{\omega}+1}^{T}(x_{t}^{\top}Q_{}x_{t}+{\bar{u}_{t}}^{\top}R_{}\bar{u}_{t})$	(51)

	$\displaystyle J(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})+x_{t}^{\top}P(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})x_{t}$
	$\displaystyle=x_{t}^{\top}Q_{}x_{t}+{u^{i}_{t}}^{\top}R_{}^{i}u^{i}_{t}+\mathbb{E}\bigg{[}\big{(}\tilde{A}_{t-1}x_{t}+\tilde{B}^{i}_{t-1}u^{i}_{t}+\zeta_{t}\big{)}^{\top}P(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})\big{(}\tilde{A}_{t-1}x_{t}+\tilde{B}^{i}_{t-1}\underline{u}^{i}_{t}+\zeta_{t}\big{)}\|\mathcal{F}_{t-1}\bigg{]}$
	$\displaystyle=x_{t}^{\top}Q_{}x_{t}+{u^{i}}_{t}^{\top}R_{}^{i}u^{i}_{t}$
	$\displaystyle+\mathbb{E}\big{[}\big{(}\tilde{A}_{t-1}x_{t}+\tilde{B}^{i}_{t-1}u^{i}_{t}\big{)}^{\top}P(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})\big{(}\tilde{A}_{t-1}x_{t}+\tilde{B}^{i}_{t-1}u^{i}_{t}\big{)}\|\mathcal{F}_{t-1}\big{]}+\mathbb{E}\big{[}\zeta_{t}^{\top}P(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})\zeta_{t}\|\mathcal{F}_{t-1}\big{]}$
	$\displaystyle=x_{t}^{\top}Qx_{t}+{u^{i}}_{t}^{\top}R^{i}u^{i}_{t}+\mathbb{E}\big{[}\big{(}\tilde{A}_{t-1}x_{t}+\tilde{B}^{i}_{t-1}u^{i}_{t}\big{)}^{\top}P(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})\big{(}\tilde{A}_{t-1}x_{t}+\tilde{B}^{i}_{t-1}u^{i}_{t}\big{)}\|\mathcal{F}_{t-1}\big{]}$
	$\displaystyle+\mathbb{E}\bigg{[}x_{t+1}^{\top}P(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})x_{t+1}\|\mathcal{F}_{t-1}\bigg{]}-\mathbb{E}\big{[}\big{(}A_{}x_{t}+B^{i}_{}u^{i}_{t}\big{)}^{\top}P(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})\big{(}A_{}x_{t}+B^{i}_{}u^{i}_{t}\big{)}\|\mathcal{F}_{t-1}\big{]}$
	$\displaystyle=x_{t}^{\top}Qx_{t}+{u^{i}_{t}}^{\top}R^{i}u^{i}_{t}+\mathbb{E}\bigg{[}x_{t+1}^{\top}P(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})x_{t+1}\|\mathcal{F}_{t-1}\bigg{]}$
	$\displaystyle+\big{(}\tilde{A}_{t-1}x_{t}+\tilde{B}^{i}_{t-1}u^{i}_{t}\big{)}^{\top}P(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})\big{(}\tilde{A}_{t-1}x_{t}+\tilde{B}^{i}_{t-1}u^{i}_{t}\big{)}-\big{(}A_{}x_{t}+B^{i}_{}u^{i}_{t}\big{)}^{\top}P(\tilde{\Theta}_{t-1}^{i},Q_{},R_{}^{i})\big{(}A_{}x_{t}+B^{i}_{}u^{i}_{t}\big{)}$

$\displaystyle R_{1}$	$\displaystyle\leq k_{c,1}(n+d_{i^{}})^{(n+d_{i^{}})}(\sigma_{\omega}+\|\|B_{*}\|\|\sigma_{\nu})$
	$\displaystyle\quad\times n\sqrt{T_{c}^{i^{}}}\log((n+d^{})T_{c}^{i^{*}}/\delta)$
	$\displaystyle\quad+k_{c,2}\sigma^{2}_{\omega}\frac{n\sqrt{n}}{(1-\Gamma)}\sqrt{t-T_{c}^{i^{}}}\log\big{(}n(t-T_{c}^{i^{}})/\delta\big{)}$
	$\displaystyle\quad+k_{c,3}n\sigma^{2}_{\omega}\sqrt{t-T_{\omega}}\log(nt/\delta)$
	$\displaystyle\quad+k_{c,4}n(\sigma_{\omega}+\|\|B_{*}\|\|\sigma_{\nu})^{2}\sqrt{T_{\omega}}\log(nt/\delta)$	(64)

	$\displaystyle\|R_{2}\|\leq 2D{c^{i^{}}}^{2}(n+d_{i^{}})^{2(n+d_{i^{*}})}$
	$\displaystyle\quad\times\big{\{}1+\log_{2}\bigg{(}1+\frac{{c^{i^{}}}^{2}(1+2{\kappa^{i}}^{2})(n+d_{i^{}})^{2(n+d_{i^{}})}T^{i^{}}_{c}+4\sigma^{2}_{\nu}d_{i^{}}\log\frac{8d_{i^{}}T^{i^{}}}{\delta}T^{i^{}}}{\lambda}\bigg{)}^{n+d_{i^{*}}}\big{\}}$
	$\displaystyle\quad+2D\frac{32n\sigma^{2}_{\omega}(1+\kappa^{2})}{(1-\Upsilon)^{2}}\log\frac{n(T-T_{c}^{i^{}})}{\delta}\log_{2}\big{(}\frac{\bar{\lambda}}{\sigma_{\star}^{2}(T_{c}^{i^{}}+1)}\big{)}^{n+d}$		(65)

	$\displaystyle\bar{\lambda}$	$\displaystyle:=\bar{\lambda}:=\lambda+{c^{i^{}}}^{2}T^{i^{}}_{c}(n+d_{i^{}})^{2(n+d_{i^{}})}(1+2{\kappa^{i}}^{2})$
		$\displaystyle\quad+4\sigma^{2}_{\nu}d_{i^{}}\log(d_{i^{}}T^{i^{*}}_{c}/\delta)$
		$\displaystyle\quad+(T-T^{i^{}}_{c})\frac{32n\sigma^{2}_{\omega}(1+\kappa^{2})}{(1-\Upsilon)^{2}}\log\frac{n(T-T^{i^{}}_{c})}{\delta}.$