Learn and Control while Switching: Guaranteed Stability and Sublinear Regret

Jafar Abbaszadeh Chekan and Cédric Langbort This paragraph of the first footnote will contain the date on which you submitted your paper for review. It will also contain support information, including sponsor and financial support acknowledgment. For example, “This work was supported in part by the U.S. Department of Commerce under Grant BS123456.” J. A. Chekan and C. Langbort (emails: jafar2 & [email protected]) are with the Coordinated Science Laboratory and the Department of Aerospace Engineering at the University of Illinois at Urbana-Champaign (UIUC).

Abstract

Over-actuated systems often make it possible to achieve specific performances by switching between different subsets of actuators. However, when the system parameters are unknown, transferring authority to different subsets of actuators is challenging due to stability and performance efficiency concerns. This paper presents an efficient algorithm to tackle the so-called “learn and control while switching between different actuating modes” problem in the Linear Quadratic (LQ) setting. Our proposed strategy is constructed upon Optimism in the Face of Uncertainty (OFU) based algorithm equipped with a projection toolbox to keep the algorithm efficient, regret-wise. Along the way, we derive an optimum duration for the warm-up phase, thanks to the existence of a stabilizing neighborhood. The stability of the switched system is also guaranteed by designing a minimum average dwell time. The proposed strategy is proved to have a regret bound of $\mathcal{\bar{O}}\big{(}\sqrt{T}\big{)}+\mathcal{O}\big{(}ns\sqrt{T}\big{)}$ in horizon $T$ with $(ns)$ number of switches, provably outperforming naively applying the basic OFU algorithm.

Index Terms:

Over-actuated System, Reinforcement Learning, Regret, Switched System

I Introduction

Control allocation is a vast and richly studied field that addresses the problem of distributing control efforts over redundant actuators to achieve a specific performance. Along this line, supervisory switching control that enables operation with different subsets of actuators is a practical approach for control allocation [1, 2, 3, 4]. In this class of problems, the switching time, the switch mode (which represents a subset of actuators), or both of these parameters can be either specified as part of the general decision-making process or determined by an external unit. Three selected problem classes that illustrate the significance of switching in over-actuated systems are Fault Tolerant Control (FTC), Time-Triggered Protocol (TTP), and moving target defense (MTD). These examples differ in their approach to defining and designing these crucial variables. Designing an algorithm to efficiently switch between different actuating modes when system parameters are unknown presents a significant challenge in terms of performance. We refer to this problem as ”learn and control while switching.” In this context, it is inefficient to disregard previously acquired information about these parameters and start the learning and control process from scratch each time a mode switch occurs. Our goal is to create an algorithm that capitalizes on previously acquired knowledge when commencing a new actuation mode. More specifically, our focus lies on an online Optimism in the Face of Uncertainty (OFU) based algorithm within the framework of Linear Quadratic (LQ) settings. This class of algorithms builds a confidence ellipsoid containing the unknown system parameters, allowing us to design a feedback gain by adopting an optimistic perspective toward this defined set. A naive approach to address our proposed problem would involve the repetitive application of the OFU-based algorithm whenever a mode switch takes place. However, this method is likely to be inefficient, given that it disregards previously acquired knowledge. In this research endeavor, we introduce a novel strategy that leverages learned information. This strategy involves the projection of recently constructed ellipsoids into the mode space that the system has transitioned into. Following this projection, we establish a confidence ellipsoid for control design, incorporating both the knowledge stored within the projected ellipsoid and the online measurements gathered after initiating the actuation. The core technical challenge in constructing this ellipsoid involves confidence set construction while incorporating the already learned information stored within the projected confidence ellipsoid. In our setting, unlike [5], who use the trace norm of the system parameters for regularization in least square estimation and confidence set construction, we utilize a weighted Frobenius norm. This approach provides us with an opportunity to incorporate the projected ellipsoid into confidence ellipsoid construction. To fully incorporate this idea, we also forgo the confidence ellipsoid normalization applied in [5]. As a result, our confidence ellipsoid representation differs slightly from that of [5]. Regarding the stability guarantee in the switching setting, we need to determine a minimum average dwell time that ensures the system slows down the switching process sufficiently to guarantee boundedness of the state norm during the switching operation.

Actuator faults can cause a decay in performance (e.g., impact on energy consumption of actuator) or even more devastating incidents [6]. This kind of safety concerns is often considered within the literature on so-called Fault Tolerant Control (FTC) [7, 8], where the goal of switching is to bypass a failing actuator [9], [10]. [11].

In TTP systems [12, 13], where control must occur through communication channels whose availability varies according to a predetermined/ known a priori schedule, the goal of switching is to conform to the specifications of that protocol.

As another specific application of switching systems, we can refer to the MTD strategy in cyber-physical security of control systems. MTD, which first appeared in the computer science literature [14] is a proactive strategy using which a system can hide its dynamics from a potential attacker by randomly switching between a different subset of actuators. This strategy seems practical considering that adversarial agents have limited resources and cannot compromise all the actuators simultaneously. In [15, 16], the authors apply MTD for identifying malicious attacks on sensors and actuators. Furthermore, in [4] and [17] the authors show an effort to address this challenge for a system with known dynamics.

One significant difference between these applications is that some are instances of ”direct switching” whereas others belong to the ”indirect switching” category [18], which impacts the determination of ”when and what mode to switch to”. For example, in TTP, both the switch time and actuation mode are pre-specified and given. On the other hand, for the FTC, the switch time is simply when an anomaly is detected. For this class of problems, we propose a mechanism that picks the best actuating mode to switch to by examining the richness of learned information up to that point. As for the MTD application, given that high-frequency switches are desirable, we let the system switch as often as desired unless stability is violated. For this aim, we constrain the system to stay in a mode for some minimum average dwell time (MADT). The next actuating method is specified with a similar strategy as in the FTC case. In all cases, the algorithm guarantees closed-loop stability and that parameters are learned well-enough to ensure low regret (defined as the average difference between the closed-loop quadratic cost and the best achievable one, with the benefit of knowing the plant’s parameters). The former is addressed by designing a MADT while latter requires detailed insight into available RL algorithms in LQ control literature.

Several studies have recently attempted to address switched systems control under different assumptions when the system model is unknown. Authors in [19] designed data-driven control laws for switched systems with discrete-time linear systems and unknown switch signals. The stability of an unknown switched linear system is investigated probabilistically by observing only finite trajectories in [20]. Furthermore, in [21, 22] design of stabilizing controllers for a switched system is addressed by solely relying on system structure and experimental data and without an explicit plant identification.

Learning-based methods in LQ framework are divided into model-free ([23] [24, 25, 26, 27]) and model-based RL approaches. The former ignores parameter estimates and outputs sub-optimal policy by solely relying on the history of data and directly minimizing the cost function [26]. The latter category usually comes with guaranteed regret, thanks to system identification as a core step in control design. We use similar properties to obtain a guaranteed regret.

A naive approach to model-based adaptive control is a simple philosophy that estimates a system’s unknown parameters and treats them as actual parameters in optimal control design. This algorithm is known as certainty equivalence in the literature, and since it decouples the estimation problem from the control one, it may lead to strictly sub-optimal performance [28]. Model-Based algorithms, in general, can be divided into three main categories, namely ”Offline estimation and control synthesis,” ”Thompson Sampling-based,” and ”Optimism in the Face of Uncertainty (OFU)” algorithms. OFU that couples estimation and control design procedures have shown efficiency in outperforming the naive certainty equivalence algorithm. Campi and Kumar in [28] proposed a cost-based parameter estimator, which is an OFU based approach to address the optimal control problem for linear quadratic Gaussian systems with guaranteed asymptotic optimality. However, this algorithm only guarantees the convergence of average cost to that of the optimal control in limit and does not provide any bound on the measure of performance loss for the finite horizon. Abbasi-Yadkori and Szepesvari [29] proposed a learning-based algorithm to address the adaptive control design of the LQ control system in the finite horizon, with worst case regret bound of $O(\sqrt{T})$ . Using $l_{2}$ -regularized least square estimation, they construct high-probability confidence set around unknown parameters of the system and design an algorithm that optimistically plays concerning this set [30]. Along this line, Ibrahimi and et. al. [31] later on proposed an algorithm that achieves $O(p\sqrt{T})$ regret bound with state-space dimension of $p$ . Authors in [32] proposed an OFU-based learning algorithm with mild assumptions and $O(\sqrt{T})$ regret. Furthermore, in [33] propose an OFU-based algorithm which for joint stabilization and regret minimization that leverages actuator redundancy to alleviate state explosion during initial time steps when there is low number of data.

The objective function to be minimized in the algorithm [30] is non-convex which brings about a computational hurdle. However, recently Cohen et al., in [5], through formulating the LQ control problem in a relaxed semi-definite program (SDP) that accounts for the uncertainty in estimates, proposed a computationally efficient OFU-based algorithm. Moreover, unlike the state-norm bound given by [30] which is loose, [5] guarantees strongly sequentially stabilizing policies enabling us to come ups with tight upper-bound, which ensures small MADT (appropriate for MTD application). These advantages motivate us to use the relaxed SDP framework in our analysis rather than [30].

The remainder of the paper is organized as follows: Section II provides the problem statement and then is followed by Section III which gives a brief background review, overview of projection technique, and augmentation technique. Moving on to Section V, we first provide an overview of the projection-based strategy. Then, we detail the main algorithmic steps and summarize the stability analysis, which includes MADT design and algorithm performance, particularly in terms of the regret bound. Section VI briefly discusses the extension of the proposed algorithm to the FTC and MTD type of applications. Afterwards, numerical experiment is given in Section VII. Finally, we conclude the key achievements in Section VIII. The most rigorous analysis of the algorithms (e.g., stability and regret bounds) and technical proofs are provided in Appendix A while leaving complimentary proofs to Appendix B. .

II Problem statement

Throughout this paper, we use the following notations: $\|A\|$ denotes the operator norm i.e., $\|A\|=\max_{x,\|x\|\leq 1}\|Ax\|$ . We denote trace norm by $\|\bullet\|_{*}(=\sqrt{\operatorname{Tr}(\bullet^{\top}\bullet)})$ . The entry-wise dot product between matrices is denoted by $A\bullet B=\operatorname{Tr}(A^{\top}B)$ . The weighted norm of a matrix $A$ with respect to $V$ is interchangeably denoted by $\operatorname{Tr}(A^{\top}VA)$ or $\|A\|_{V}^{2}$ . We denote a $m$ dimensional identity matrix by $I_{m}$ . We denote the ceiling function by $\lceil x\rceil$ that maps $x$ to the least integer greater than or equal to $x$ .

Consider the following time-invariant discrete time LQR system

	$\displaystyle x_{t+1}$	$\displaystyle=A_{}x_{t}+B_{}u_{t}+\omega_{t+1}$		(1)
	$\displaystyle c_{t}$	$\displaystyle=x_{t}^{\top}Qx_{t}+u_{t}^{\top}Ru_{t}$		(2)

where $A_{*}\in\mathbb{R}^{n\times n}$ , $B_{*}\in\mathbb{R}^{n\times m}$ are unknown and $Q\in\mathbb{R}^{n\times n}$ and $R\in\mathbb{R}^{m\times m}$ are known and respectively positive definite matrices.

Let $\mathbb{B}$ be the set of all columns, $b^{i}_{*}$ ( $i\in\{1,...,m\}$ ) of $B_{*}$ . An element of its power set $2^{\mathbb{B}}$ is a subset $\mathcal{B}_{j}$ , $j\in\{1,...,2^{m}\}$ of columns corresponding to a submatrix $B_{*}^{j}$ of $B_{*}$ and mode $j$ . For simplicity, we assume that $B^{1}_{*}=B_{*}$ i.e., that the first mode contains all actuators. With this definition in place we can express the system dynamics as follows:

\displaystyle x_{t+1}=A_{*}x_{t}+B_{*}^{\sigma(t)}u_{t}^{\sigma(t)}+\omega_{t+1}

(3)

or equivalently

\displaystyle x_{t+1}={\Theta^{\sigma(t)}_{*}}^{\top}z_{t}+\omega_{t+1},\quad z_{t}=\begin{pmatrix}x_{t}\\ u^{\sigma(t)}_{t}\end{pmatrix}.

(4)

where $\Theta^{\sigma(t)}_{*}=(A_{*},B^{\sigma(t)}_{*})^{\top}$ is controllable and $\sigma:{0}\;\cup\mathbb{N}\rightarrow\mathcal{M}$ is a given right-continuous piecewise constant switching signal where $\mathcal{M}$ denotes the set of all controllable modes $(A_{,}B^{i}_{*})$ .

The associated cost with this mode is

\displaystyle c^{\sigma(t)}_{t}

\displaystyle=x_{t}^{\top}Qx_{t}+{u^{\sigma(t)}_{t}}^{\top}R^{\sigma(t)}u^{\sigma(t)}_{t}

(5)

where $R^{\sigma(t)}\in\mathbb{R}^{d_{\sigma(t)}\times d_{\sigma(t)}}$ is a block of $R$ which penalizes the control inputs of the actuators of mode $\sigma(t)\in\mathcal{M}.$

Since we are considering arbitrary, externally-determined, switching patterns (which could include switching to a particular mode at some time instant, then never switching out of it), controllability of the system in each mode is necessary to ensure controllability of the switched system. This mode-by-mode controllability assumption is also useful to enable OFU-style learning at each epoch.

The noise process satisfies the following assumption which is a standard assumption in controls community (e.g., [30], [5] and [34]).

Assumption 1

There exists a filtration $\mathcal{F}_{t}$ such that

$(1.2)$ $\omega_{t+1}$ is a martingale difference, i.e., $\mathbb{E}[\omega_{t+1}|\;\mathcal{F}_{t}]=0$

$(1.2)$ $\mathbb{E}[\omega_{t+1}\omega_{t+1}^{\top}|\;\mathcal{F}_{t}]=\bar{\sigma}_{\omega}^{2}I_{n}$ for some $\bar{\sigma}_{\omega}^{2}>0$ ;

$(1.3)$ $\omega_{t}$ are component-wise sub-Gaussian, i.e., there exists $\sigma_{\omega}>0$ such that for any $\gamma\in\mathbb{R}$ and $j=1,2,...,n$

\displaystyle\mathbb{E}[e^{\gamma\omega_{j}(t+1)}|\;\mathcal{F}_{t}]\leq e^{\gamma^{2}\sigma_{\omega}^{2}/2}.

Our goal is to design a strategy which can efficiently transfer the control authority between different subsets of actuators when the system parameters are unknown.

Relying on the past measurements, the problem is designing a sequence $\{u^{\sigma(t)}_{t}\}$ of control input such that the average expected cost

\displaystyle J(u_{1},u_{2},...)=\lim_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}[c^{\sigma(t)}_{t}].

(6)

is minimized where $\sigma(t)\in\mathcal{I}$ , where $\mathcal{I}=[i_{\tau_{1}}\;i_{\tau_{2}}\;...i_{\tau_{ns}}]$ , represents the switch sequence. Each element in the set $\mathcal{I}$ specifies both the mode index $i_{\tau_{j}}\in\mathcal{M}$ and the time $\tau_{j}$ at which the switch to that mode occurs. In the context of TTP applications, this sequence is typically predetermined, and thus the system has access to both the switch times and the corresponding modes. In contrast, for FTC, while the switch times $\tau_{j}$ are determined by an anomaly detector, the decision regarding which mode to switch to is made by the algorithm itself. Therefore, the algorithm plays a pivotal role in selecting the next mode, denoted as $i_{\tau_{j}}\in\mathcal{M}$ . When the system parameters are known, it may be relatively straightforward to specify the next mode that results in the lowest average expected cost. However, this task becomes significantly more challenging in the absence of such information. The MTD application introduces an even greater level of complexity, as the algorithm must autonomously decide both when and to which mode to switch. In this work, our primary focus is on TTP setting. In Section VI we will provide a brief overview of the general solutions for the other two settings.

As a measure of optimality loss having to do with the lack of insight into the model, the regret of a policy is defined as follows:

\displaystyle R_{T}=\sum_{t=1}^{T}\big{(}\mathbb{E}[c_{t}^{\sigma(t)}]-J_{*}(\Theta_{*}^{\sigma(t)},Q,R^{\sigma(t)})\big{)}

(7)

where $J_{*}(\Theta_{*}^{\sigma(t)},Q,R)$ is the optimal average expected cost of actuating in mode $\sigma(t)\in\{1,...,2^{m}\}$ .

Proposition 1

The regret incurred by employing a fixed feedback controller $K_{0}$ on a LQ system over $T_{0}$ rounds follows an order of $\mathcal{O}(T_{0})$ .

Assumption 2

There are known constants $\alpha_{0},\alpha_{1},\sigma,\vartheta,\nu>0$ such that,

	$\displaystyle\alpha_{0}I\leq Q\leq\alpha_{1}I,\;\;\;\alpha_{0}I\leq R^{i}\leq\alpha^{i}_{1}I$
	$\displaystyle W=\mathbb{E}[\omega_{t}\omega_{t}^{\top}]=\sigma^{2}I,\;\;\;\\|\Theta_{}^{i}\\|\leq\vartheta_{i},\;\;\;J_{}(\Theta_{*}^{i},Q,R)\leq\nu_{i}$

for all $i\in\mathcal{M}$ .

2.

There is an initial stabilizing policy $K_{0}$ to start algorithm with.

Note that having access to $\vartheta_{i}$ and $\nu_{i}$ for all $i$ separately is quite demanding and, hence, in practice, we only require knowledge of $\bar{\vartheta}:=\max_{i}\vartheta_{i}$ and $\bar{\nu}=\max_{i}\nu_{i}$ . The assumption of having access to an initial stabilizing policy like $K_{0}$ is a common practice in the literature (as seen in [5] and [35]). Nevertheless, it is possible to develop a stabilizing policy in real-time by relying on some a priori bounds. For further details on this approach, we encourage readers to explore [34] and [33].

Developing a ”learn and control while switching” strategy requires overcoming two core challenges: performance and stability. We propose an efficient OFU-based algorithm equipped with a projection tool that can efficiently transfer the learned information from an actuating mode to others to achieve the former goal. This strategy outperforms the naive approach of repeatedly applying the standard OFU algorithm. Fast recurrent switching between the actuating modes can cause explosion of state norm over finite horizon. Hence, in TTP setting, we operate under the assumption that the switches occur at a sufficiently slow rate on average, with the time interval between two consecutive switches $\tau_{j+1}-\tau_{j}$ being longer than some minimum average dwell time $\tau_{MAD}$ .

Before presenting the strategy in more detail (and, in a later section, the specifics of the algorithms), we start by reviewing the main ingredients of our approach to build the foundation for our analysis.

III Preliminaries

III-A Primal SDP formulation of LQR problem

Consider a linear system

	$\displaystyle x_{t+1}$	$\displaystyle=Ax_{t}+Bu_{t}+\omega_{t+1}$		(8)
	$\displaystyle c_{t}$	$\displaystyle=x_{t}^{\top}Qx_{t}+u_{t}^{\top}Ru_{t}.$		(9)

Observing that a steady-state joint state and input distribution $(x_{\infty},u_{\infty})$ exists for any stabilizing policy $\pi$ , we denote the covariance matrix of this joint distribution as $\mathcal{E}(\pi)$ , which is defined as follows:

\displaystyle\mathcal{E}(\pi)=\mathbb{E}\begin{pmatrix}x_{\infty}x_{\infty}^{\top}&x_{\infty}u_{\infty}^{\top}\\ u_{\infty}x_{\infty}^{\top}&u_{\infty}u_{\infty}^{\top}\end{pmatrix}

where $u_{\infty}=\pi(x_{\infty})$ and the average expected cost of the policy is given as follows

\displaystyle J(\pi)=\begin{pmatrix}Q&0\\ 0&R\end{pmatrix}\bullet\mathcal{E}(\pi).

Now, for a linear stabilizing policy $K$ that maps $u_{\infty}=Kx_{\infty}$ , the covariance matrix of the joint distribution takes the following form:

\displaystyle\mathcal{E}(K)=\begin{pmatrix}X&XK^{\top}\\ KX&KXK^{\top}\end{pmatrix}.

(10)

where $X=\mathbb{E}[x_{\infty}x_{\infty}^{\top}]$ . Then the average expected cost is computed as follows

\displaystyle J(K)=\begin{pmatrix}Q&0\\ 0&R\end{pmatrix}\bullet\mathcal{E}(K)=(Q+K^{\top}RK)\bullet X.

Now given $\Theta=(A\;B)^{\top}$ , the LQR control problem can be formulated in a SDP form, as follows:

	$\displaystyle\textrm{minimize}_{\Sigma}\;\;\;\begin{pmatrix}Q&0\\ 0&R\end{pmatrix}\bullet\Sigma$
	$\displaystyle\textrm{S.t.}\;\;\;\Sigma_{xx}={\Theta}^{\top}\Sigma\Theta+W$
	$\displaystyle\Sigma>0$		(11)

where any feasible solution $\Sigma$ can be represented as follows:

\displaystyle\Sigma=\begin{pmatrix}\Sigma_{xx}&\Sigma_{xu}\\ \Sigma_{ux}&\Sigma_{uu}\end{pmatrix}

(12)

in which $\Sigma_{xx}\in\mathbb{R}^{n\times n}$ , $\Sigma_{uu}\in\mathbb{R}^{m\times m}$ , and $\Sigma_{ux}=\Sigma_{xu}\in\mathbb{R}^{m\times n}$ . The optimal value of (11) coincides with the average expected cost $J_{*}(\Theta,Q,R)$ , and for $W>0$ (ensuring $\Sigma_{xx}>0$ ), the optimal policy of the system, denoted as $K_{*}(\Theta,Q,R)$ and associated with the optimum solution $\Sigma^{*}(\Theta,Q,R)$ , can be derived from $K_{*}(\Theta,Q,R)=\Sigma^{*}_{{ux}}(\Theta,Q,R)\Sigma^{*^{-1}}_{{xx}}(\Theta,Q,R)$ considering (10).

In [5], the relaxed form of the SDP is used to design the stabilizing control policy when the matrix $\Theta$ is replaced by its estimates (i.e., confidence set of $\Theta$ ). We will slightly modify the relaxed SDP formulation for our purpose.

III-B System Identification

The proposed algorithm includes a system identification step that applies self-normalized process to obtain least square estimates and uses the martingale difference property of process noise to construct high probability confidence set around the system’s parameters. To sketch a brief outline, assume the system actuates in the first mode and evolves in an arbitrary time interval $[0,\;t]$ . One can then simply rewrite (1) as follows:

\displaystyle X_{t}

\displaystyle=Z_{t}\Theta_{*}+W_{t}

(13)

where $X_{t}$ and $Z_{t}$ are matrices whose rows are $x^{\top}_{1},...,x^{\top}_{t}$ and ${z}^{\top}_{0},...,{z}^{\top}_{t-1}$ respectively. And in a similar way $W_{t}$ is constructed by concatenating $\omega_{1}^{\top},...,\omega_{t}^{\top}$ vertically.

Using the self-normalized process, the least square estimation error, $e(\Theta)$ can be defined as

	$\displaystyle e(\Theta)$	$\displaystyle=\lambda\operatorname{Tr}\big{(}(\Theta-\Theta_{0})^{\top}(\Theta-\Theta_{0})\big{)}+$
		$\displaystyle\quad\sum_{s=0}^{t-1}\operatorname{Tr}\big{(}(x_{s+1}-\Theta^{\top}z_{s})(x_{s+1}-\Theta^{\top}z_{s})^{\top})\big{)}$		(14)

where $\Theta_{0}$ is some given initial estimate and $\lambda>0$ is a regularization parameter. This yields the $l^{2}$ -regularized least square estimate:

\displaystyle\hat{\Theta}_{t}

\displaystyle=\operatorname*{argmin}_{\Theta}e(\Theta)=V_{[0,\;t]}^{-1}\big{(}Z_{t}^{\top}X_{t}+\lambda\Theta_{0}\big{)}.

(15)

where

\displaystyle V_{[0,\;t]}=\lambda I+\sum_{s=0}^{t-1}z_{s}z_{s}^{T}=\lambda I+Z_{t}^{\top}Z_{t},

is covariance matrix. Then assuming $\|\Theta_{*}-\Theta_{0}\|_{*}\leq\epsilon$ for $\epsilon>0$ , a high probability confidence set containing unknown parameters of system $\Theta_{*}$ is constructed as

	$\displaystyle\mathcal{C}_{t}(\delta):=\big{\{}$	$\displaystyle\Theta\in\mathbb{R}^{(n+m)\times n}\|$
		$\displaystyle\operatorname{Tr}\big{(}(\Theta-\hat{\Theta}_{t})^{T}V_{[0,\;t]}(\Theta-\hat{\Theta}_{t})\big{)}$	$\displaystyle\leq r_{t}\big{\}}$		(16)

where

\displaystyle r_{t}=\bigg{(}\sigma_{\omega}\sqrt{2n\log\frac{n\det(V_{[0,\;t]})}{\delta\det(\lambda I)}}+\sqrt{\lambda}\epsilon\bigg{)}^{2}.

(17)

It can be shown that the true matrix $\Theta_{*}$ is guaranteed to belong to this set with probability at least $1-\delta$ , for $0<\delta<1$ . The right hand side of (16) is dependent on the user-defined parameters $\lambda$ , and initial estimation error $\epsilon$ . The proof follows a combination of steps of [5] and [30] with the difference that we ignore the normalization of the confidence set. For the sake of brevity, we may sometimes use $\Delta_{t}=\Theta-\hat{\Theta}_{t}$ in our analysis.

III-C $(\kappa,\gamma)-$ strong sequential stability

We review the notation of strong sequentially stabilizing introduced in [5], which will be used in the context of switched system as well.

Definition 1

Consider a linear plant parameterized by $A$ and $B$ . The closed-loop system matrix $A+BK_{t}$ is $(\kappa,\gamma)-$ strongly stable for $\kappa>0$ and $0<\gamma<1$ if there exists $H_{t}\succ 0$ and $L_{t}$ such that $A_{*}+B_{*}K_{t}=H_{t}L_{t}{H}^{-1}_{t}$ and

1.

$\|L_{t}\|\leq 1-\gamma$ and $\|K_{t}\|\leq\kappa$
2.

$\|H_{t}\|\leq B_{0}$ and $\|{H}_{t}^{-1}\|\leq 1/b_{0}$ with $\kappa=B_{0}/b_{0}$

A sequence $A+BK_{1},A+BK_{2},...$ of closed-loop system matrices is $(\kappa,\gamma)-$ strongly sequentially stable if each $A+BK_{t}$ is $(\kappa,\gamma)-$ strongly stable and $\|H^{-1}_{t+1}H_{t}\|\leq 1+\gamma/2$ . Furthermore, we say a sequence $K_{1},K_{2},...$ of control gains is $(\kappa,\gamma)-$ strongly sequentially stabilizing for a plant $(A,B)$ if the sequence $A+BK_{1},A+BK_{2},...$ is $(\kappa,\gamma)-$ strongly sequentially stable.

As shown in [5] applying such a strongly stabilizing sequence of controllers starting from state $x_{t_{0}}$ guarantees the boundedness of state trajectory as follows:

\displaystyle\|x_{t}\|\leq\kappa_{i}e^{-\gamma_{i}(t-1)/2}\|x_{t_{0}}\|+\frac{2\kappa_{i}}{\gamma_{i}}\max\limits_{t_{0}\leq s\leq t}\|w_{s}\|.

(18)

Definition 2

The set $\mathcal{N}_{(\kappa_{0},\gamma_{0})}(\Theta_{*})=\{\Theta\in\mathbb{R}^{(n+m)\times n}:\;\|\Theta-\Theta_{*}\|\leq\epsilon\}$ , for some $\epsilon>0$ , is $(\kappa_{0},\gamma_{0})$ -stabilizing neighborhood of the system (1-2). In this neighborhood, for any $\Theta^{\prime}\in\mathcal{N}_{(\kappa_{0},\gamma_{0})}(\Theta_{*})$ , the designed $(\kappa^{\prime},\gamma^{\prime})$ -strongly stabilizing controller $K_{*}(\Theta^{\prime},Q,R)$ applied to the system (1), achieves an average expected cost lower than that of $K_{0}$ .

III-D Projection of the Confidence Ellipsoid

Suppose at time $t$ the central confidence ellipsoid

	$\displaystyle\mathcal{C}_{t}(\delta):=\{$	$\displaystyle\Theta\in\mathbb{R}^{(n+m)\times n}\|$
		$\displaystyle\operatorname{Tr}\big{(}(\Theta-\hat{\Theta}_{t})^{\top}V_{[0,\;t]}(\Theta-\hat{\Theta}_{t})\big{)}$	$\displaystyle\leq r_{t}\}$		(19)

is given. By projecting the central ellipsoid (19), rich with information until time $t$ , into the space of mode $i$ , the corresponding confidence set of parameters for mode $i$ is again an ellipsoid, denoted by $\mathcal{C}^{pi}_{t}(\delta)$ and represented as follows:

	$\displaystyle\mathcal{C}^{pi}_{t}(\delta):=\{$	$\displaystyle\Theta^{i}\in\mathbb{R}^{(n+m_{i})\times n}\|$
		$\displaystyle\operatorname{Tr}\big{(}(\Theta^{i}-\hat{\Theta}_{t}^{pi})^{\top}V^{pi}_{[0\;t]}(\Theta^{i}-\hat{\Theta}_{t}^{pi})\big{)}$	$\displaystyle\leq r\}.$		(20)

The characteristics $\hat{\Theta}_{t}^{pi}$ and $V^{pi}_{[0\;t]}$ of this projected ellipsoid can be computed following the steps specified below.

Consider the ellipsoid

\displaystyle\operatorname{Tr}(\Delta^{T}V\Delta)\leq r

(21)

where $\Delta=\Theta-\hat{\Theta}$ and $\Theta$ has $n+m$ rows. We aim to project this set to a lower dimensional space of mode $i$ identified by $\Theta^{i}\in\mathbb{R}^{(n+m_{i})\times n}$ for some $m_{i}<m$ . To this end, we first change the order of rows in such a way that the rows corresponding to the actuators of mode $i$ , $\tilde{\Delta}_{1}\in\mathbb{R}^{(n+m_{i})\times n}$ come first whereas $\tilde{\Delta}_{2}\in\mathbb{R}^{(m-m_{i})\times n}$ . The reformulated ellipsoid then is written as follows:

\displaystyle\operatorname{Tr}(\tilde{\Delta}^{\top}\tilde{V}\tilde{\Delta})\leq r,\quad\tilde{\Delta}=\left[\begin{array}[]{c}\tilde{\Delta}_{1}\\ \hline\cr\tilde{\Delta}_{2}\end{array}\right].

(24)

Where $\tilde{V}$ is in $\mathbb{R}^{(n+m)\times(n+m)}$ similar to $V$ . Let us now rewrite the parametrization of ellipsoid (24) as follows

\displaystyle\operatorname{vect}(\tilde{\Delta})^{T}\tilde{G}\operatorname{vect}(\tilde{\Delta})\leq r

where $\operatorname{vect}(\tilde{\Delta})$ in $\mathbb{R}^{(n+m)n}$ is a vectorization of $\tilde{\Delta}$ and $\tilde{G}=I_{(n\times n)}\otimes\tilde{V}$ .

Suppose $\partial E_{pi}$ is the boundary of the projection of the ellipsoid $E:=\operatorname{vect}(\tilde{\Delta})^{T}\tilde{G}\operatorname{vect}(\tilde{\Delta})\leq r$ on our target lower dimensional space. It is obvious that any point $Y\in\partial E_{pi}$ is the projection of a point $X\in\partial E$ and both of the points are on the tangent space of $\partial E$ parallel to the direction of ${\tilde{\Delta}_{2}}$ . Hence, we can deduce that on these points, $\nabla_{\tilde{\Delta}_{2}}E=0$ which gives $m-m_{i}$ algebric equations. By solving these equation for the components of ${\tilde{\Delta}_{2}}$ s in terms of the components of ${\tilde{\Delta}_{1}}$ , and subsequently substituting them into $E$ we obtain the projected ellipsoid $E_{pi}$ which can be written in following compact form :

\displaystyle\operatorname{vect}(\tilde{\Delta}_{1})^{T}(U-M^{T}T^{-1}M)\operatorname{vect}(\tilde{\Delta}_{1})\leq r.

(25)

where $U$ , $M$ , and $T$ are the block matrices of the partitioned $\tilde{G}$ , e.g.,

\displaystyle\tilde{G}=\left[\begin{array}[]{c|c}U&M^{T}\\ \hline\cr M&T\end{array}\right].

(28)

We can easily construct this partition noting that $U\in\mathbb{R}^{n(n+m_{i})\times n(n+m_{i})}$ . By sorting the arrays of vector $\operatorname{vect}(\tilde{\Delta}_{1})$ by order into a matrix $\Delta^{pi}$ of dimension $\mathbb{R}^{(n+m_{i})\times n}$ the projected confidence ellipsoid (25) can be rewritten in the standard format of

\displaystyle\operatorname{Tr}({\Delta^{pi}}^{T}V^{pi}\Delta^{pi})\leq r.

where covariance matrix $I_{(n\times n)}\otimes V^{pi}=U-M^{\top}T^{-1}M$ .

III-E Augmentation

Let us call the mode consisting of all actuators (i.e., mode $i=1$ ) the ”central mode”. Its associated confidence ellipsoid $\mathcal{C}_{t}(\delta)$ contains all information up to time $t$ and can be computed at all times even when the system operates in a different mode. Indeed when the system is in the mode $i$ with the following dynamics:

\displaystyle x_{t+1}={\Theta_{*}^{i}}^{\top}z^{i}_{t}+\omega_{t+1},\quad z^{i}_{t}=\begin{pmatrix}x_{t}\\ u^{i}_{t}\end{pmatrix}

(29)

we can equivalently rewrite it as follows:

\displaystyle x_{t+1}={\Theta_{*}}^{\top}z_{t}+\omega_{t+1},\quad z_{t}=\begin{pmatrix}x_{t}\\ u_{t}\end{pmatrix}

(30)

where $u_{t}\in\mathbb{R}^{m}$ and

\displaystyle u_{t}(\mathcal{B}_{i})=u_{t}^{i},\;u_{t}(\mathcal{B}_{i}^{c})=0

(31)

with $\mathcal{B}_{i}^{c}$ being the complement of $\mathcal{B}_{i}$ .

In turn the estimation process of III-B can be performed on (30) to obtain updated estimates of $\Theta_{*}$ computed by the central ellipsoid $\mathcal{C}_{t}(\delta)$ .

We refer to this simple mathematical operation as ”augmentation technique” using which we can update central ellipsoid regardless of the actuation mode.

IV Projection-based strategy

IV-A Overview

A possible way to tackle the ”learn and control while switching problem:” is to apply the OSLO algorithm described in [5] repeatedly, each time a switch occurs. However, such a naive approach is likely inefficient because (1) it ignores previously learned information and, (2) it runs a warm-up phase after each switch. As a (provably) better alternative, we propose a projection-based strategy which relies on the central ellipsoid to incorporate previously learned information in control design procedure. A controller that plays optimistically with respect to this set is then computed by solving a relaxed SDP.

To clarify the main idea, consider a simple example of control system with three actuators which offers seven actuating modes assuming that they all satisfy the controllability condition. Let $M_{1}$ denote the mode 1 which takes advantage of all the actuators. The modes $\{M_{2},M_{3},M_{4}\}$ and $\{M_{5},M_{6},M_{7}\}$ are the ones with two actuators and single actuator respectively. We can consider $M_{1}$ as the central mode which is at the same time the parent of all other actuating modes. Figure 1 illustrates the relationships between these actuating modes.

Refer to caption — Figure 1: Parent-Child Dependency

The dashed lines indicate how the modes are interconnected. For instance, the column set of modes $M_{5}$ and $M_{6}$ are the subsets of the column set of mode $M_{2}$ . We break the strategy into two phases, the warm-up and Learn and Control while Switching (LCS) phases, whose pseudo codes have been provided in Algorithms 1 and 2. The strategy starts with Algorithm 1, which applies a given $(\kappa_{0},\gamma_{0})-$ strong stabilizing policy feedback $K_{0}$ for $T_{0}$ time steps to provide Algorithm 2 with close enough parameter estimates $\Theta_{0}$ . For warm-up duration $T_{0}$ computation, we use the recent results of [36] on the existence of a stabilizing neighborhood around unknown parameters of the system. We compute $T_{0}$ such that the controller designed by Algorithm 2 at time $T_{0}$ , which relies on the estimate $\Theta_{0}$ , performs better than the initial stabilizing controller $K_{0}$ and ensures the achievement of a favorable overall regret. The computation of this duration is detailed in Theorem 1.

After the warm-up phase, Algorithm 2 is applied whenever a switch occurs. This algorithm possesses the capability to access the updated central ellipsoid, which it maintains through the utilization of the augmentation technique. LCSA projects the central ellipsoid to the space of the actuating mode $M_{i}$ ; the mode system operates in for the epoch (time interval between two subsequent switches of actuating mode). During operation in an epoch, the algorithm applies OFU for the actuating mode $M_{i}$ to obtain the control input. In parallel, it updates (i.e., enriches) the central ellipsoid by using the augmentation technique.

In the next section, we propose LCSA algorithm for the TTP application where the switch sequence (both switch time and modes) are given upfront. We define this sequence by $\mathcal{I}=[i_{\tau_{1}}\;i_{\tau_{2}}\;...i_{\tau_{ns}}]$ where $i_{\tau_{i}}\in\mathcal{M}$ and $\tau_{i}$ are switch times and $ns$ is total number of switches. Having $\Theta_{0}$ provided by Algorithm 1, we reset the time and let switch time $\tau_{i}\in\{\tau_{1},\tau_{2},...,\tau_{ns}\}$ where $\tau_{1}=0$ . We denote the time interval between two subsequent switches, i.e., $[\tau_{k},\tau_{k+1}]$ as epoch $k$ . This algorithm, however, can be slightly modified to address FTC and MTD classes of applications. We will discuss these modifications in Section VI.

Algorithm 1 Warm-Up Phase

1: Inputs:

T_{0}

\,\vartheta>0,

\,\delta>0,

\,\sigma_{\omega},\,\eta_{t}\sim\mathcal{N}(0,2\sigma_{\omega}^{2}\kappa_{0}^{2}I)

2: set

\lambda=\sigma_{\omega}^{2}\vartheta^{-2}

V_{0}=\lambda I

3: for

t=0:T_{0}

4: Observe

x_{t}

5: Play

u_{t}=K_{0}x_{t}+\eta_{t}

6: observe

x_{t+1}

7: using

u_{t}

and

x_{t}

form

z_{t}

and save

(z_{t},x_{t+1})

and update the ellipsoid according to the procedure given in IV-B1, i.e., (32-34)

8: end for

9: Output:

\Theta_{0}

the center of constructed ellipsoid

Algorithm 2 Learn and Control while Switching Algorithm(LCSA)

1: Inputs:

\mathcal{I}=[i_{\tau_{1}}\;i_{\tau_{2}}\;...i_{\tau_{ns}}],

\Theta_{0},

\lambda=8\kappa_{*}^{8}\sqrt{n+m}\frac{4\bar{\mu}\bar{\nu}}{\alpha_{0}\sigma^{2}}

2: Initialize central ellipsoid by (LABEL:eq:initialCentaEl)

3: Set

\tau_{1}=0

4: for

k=1:ns

5: Project central ellipsoid

\mathcal{C}_{\tau_{k}}(\delta)

into space of mode

i_{\tau_{k}}

and obtain

\displaystyle\operatorname{Tr}((\Theta^{i_{\tau_{k}}}-\hat{\Theta}_{\tau_{k}}^{i_{\tau_{k}}})^{\top}V^{pi_{\tau_{k}}}_{[0\;\tau_{k}]}(\Theta^{i_{\tau_{k}}}-\hat{\Theta}_{\tau_{k}}^{i_{\tau_{k}}}))\leq r_{\tau_{k}}

6: Set

V^{i_{\tau_{k}}}_{[0,\tau_{k}]}=V^{pi_{\tau_{k}}}_{[0\;\tau_{k}]}

, and

r^{i}=r_{\tau_{k}}

7: for

\tau_{k}\leq t<\tau_{k+1}

8: receive state

x_{t}

9: Set

\mu_{i_{\tau_{k}}}=r^{i_{\tau_{k}}}+(1+r^{i_{\tau_{k}}})\vartheta\|V^{i_{\tau_{k}}}_{[0\;t]}\|^{0.5}

10: Set

\tau=\tau_{k}

11: if

\det(V^{i}_{[0,\;t]})>2\det(V^{i}_{[0,\;\tau]})

t=\tau_{k}

then

12: start new episode and set

\tau=t

13: estimate system parameters: Compute the least square estimate

\hat{\Theta}^{i_{\tau_{k}}}_{t}

by solving (40)

14: Compute policy: Solve the dual SDP program (45) for

\Sigma(\Theta^{i_{\tau_{k}}}_{t},Q,R^{i_{\tau_{k}}})

15: Set

\bar{K}_{i_{\tau_{k}}}=\Sigma_{ux}(\Theta^{i_{\tau_{k}}}_{t},Q,R^{i_{\tau_{k}}})\Sigma^{-1}_{xx}(\Theta^{i_{\tau_{k}}}_{t},Q,R^{i_{\tau_{k}}})

16: else

17: Set

\bar{K}_{i_{\tau_{k}}}=K(\Theta^{i_{\tau_{k}}}_{\tau},Q,R^{i_{\tau_{k}}}),

18: end if

19: Apply

u^{i_{\tau_{k}}}_{t}=\bar{K}_{i_{\tau_{k}}}x_{t}

and observe new state

x_{t+1}

20: Save

(z^{i_{\tau_{k}}}_{t},x_{t+1})

into database

21: Update

V^{i_{\tau_{k}}}_{[0,\;t+1]}=V^{i_{\tau_{k}}}_{[0,\;t]}+z^{i_{\tau_{k}}}_{t}{z^{i_{\tau_{k}}}_{t}}^{\top}

22: compute

r^{i_{\tau_{k}}}=\bigg{(}\sigma_{\omega}\sqrt{2n\log\frac{n\det(V^{i_{\tau_{k}}}_{[0\;t+1]})}{\delta\det(V^{i_{\tau_{k}}}_{[0,\tau_{k}]})}}+\sqrt{r_{\tau_{k}}}\bigg{)}^{2},

23: Apply augmentation technique (Section III-E) to update central ellipsoid.

24: end for

25: end for

IV-B Main steps of Algorithms

For the sake of the reader more interested in implementation than in the algorithms’ theoretical underpinnings, we start by surveying the main steps of Algorithm 1 and 2. We are mostly concerned with explaining the computations and algebraic manipulations performed in each step, and provide details about the rationale for and properties of these steps under the umbrella of performance analysis in the next section. Proofs of most of these results are relegated to the appendix.

IV-B1 System Identification

The system identification step aims at building confidence set around the system’s true but unknown parameters. This step plays a major role in both Algorithm 1 and Algorithm 2.

Confidence Set construction of warm-up phase

In Algorithm 1, using data obtained by executing control input $u_{t}=K_{0}x_{t}+\eta_{t}$ (which consists of linear strongly stabilizing term and extra exploratory noise $\eta_{t}\sim\mathcal{N}(0,2\sigma_{\omega}^{2}\kappa_{0}^{2}I)$ ), a confidence set is constructed around the true parameters of the system by following the steps of Section III-B. For warm-up phase we let $\Theta_{0}=0$ in (14) which results in

\displaystyle\hat{\Theta}_{t}

\displaystyle=\operatorname*{argmin}_{\Theta}e(\Theta)=V_{[0,\;t]}^{-1}Z_{t}^{\top}X_{t}.

(32)

where the covariance matrix $V_{[0,\;t]}$ is constructed similar to Section III-B. Recalling $\|\Theta_{*}\|_{*}\leq\vartheta$ by assumption, a high probability confidence set around true but unknown parameters of system is constructed as

	$\displaystyle\mathcal{C}_{t}(\delta):=\big{\{}$	$\displaystyle\Theta\in\mathbb{R}^{(n+m)\times n}\|$
		$\displaystyle\operatorname{Tr}\big{(}(\Theta-\hat{\Theta}_{t})^{\top}V_{[0,\;t]}(\Theta-\hat{\Theta}_{t})\big{)}$	$\displaystyle\leq r_{t}\big{\}}$		(33)

where

\displaystyle r_{t}=\bigg{(}\sigma_{\omega}\sqrt{2n\log\frac{n\det(V_{[0,\;t]})}{\delta\det(\lambda I)}}+\sqrt{\lambda}\vartheta\bigg{)}^{2}.

(34)

For warm-up phase we set $\lambda=\sigma_{\omega}^{2}\vartheta^{-1}$ .

Deployment of Algorithm 1 endures until the parameter estimates reside in $(\kappa_{0},\gamma_{0})$ -stabilizing neighborhood $\mathcal{N}{(\kappa_{0},\gamma_{0})}(\Theta*)$ . As per Definition 2, the first stabilizing controller $K(\Theta_{0},Q,R)$ designed by Algorithm 2, where $\Theta_{0}=\hat{\Theta}_{T_{0}}$ has been substantiated to yield an average expected cost that is lower than that of $K_{0}$ when applied in mode 1. This, in essence, ensures that Algorithm 2 commences with a feedback gain as effective as the initial stabilizing feedback gain $K_{0}$ . Additionally, to secure joint stabilization and effectively manage regret, it is imperative to ensure the proximity of the estimate $\Theta_{0}$ . This duration is determined by the theorem presented below.

Theorem 1

For a given $(\kappa_{0},\gamma_{0})$ -strongly stabilizing input $K_{0}$ , there exist explicit constants $C_{0}$ and $\epsilon_{0}=poly(\alpha_{0}^{-1},\alpha_{1},\vartheta,\nu,n,m)$ such that the following holds: Let $T_{0}$ be the smallest time that satisfies:

	$\displaystyle\bigg{(}$	$\displaystyle\sigma_{\omega}\sqrt{2n\left(\log\frac{n}{\delta}+\log\left(1+\frac{300\sigma_{\omega}^{2}\kappa_{0}^{4}}{\gamma_{0}^{2}}(n+\vartheta^{2}\kappa^{2}_{0})\log\frac{T_{0}}{\delta}\right)\right)}$
		$\displaystyle+\sqrt{\lambda}\vartheta\bigg{)}^{2}\frac{80}{T_{0}\sigma^{2}_{\omega}}\leq\min\bigg{\{}\kappa^{2}_{0}\frac{\alpha_{0}\sigma^{2}_{\omega}}{C_{0}},\epsilon^{2}_{0},\frac{1}{\lambda}\bigg{\}}:=\epsilon^{2}.$		(35)

then that $K(\hat{\Theta}_{T_{0}},Q,R)$ , obtained by Algorithm 2 applied to the system $\Theta_{*}$ , has an average expected cost $J(\hat{\Theta}_{T_{0}},Q,R)$ that is lower than that of the policy $K_{0}$ . Furthermore, requiring $\epsilon^{2}\leq\frac{1}{{\lambda}}$ is a sufficient condition for $(\kappa,\gamma)-$ strong sequential stability of the closed-loop system under the implementation of the policy produced by Algorithm 2.

The regret associated with the implementation of the warm-up algorithm is of the order $\mathcal{O}(\sqrt{T})$ . To substantiate this assertion, let $T$ denote the specified horizon for implementing Algorithm 2. By definition, we have $\epsilon^{2}\leq 1/\lambda$ , where $\lambda=8\kappa_{*}^{8}\sqrt{n+m}\frac{4\bar{\mu}\bar{\nu}}{\alpha_{0}\sigma^{2}}$ , as the condition for strong sequential stability (see proof of Lemmas 9 and 10). This, along with the definition of $\bar{\mu}$ , indeed indicates that $\lambda=\mathcal{O}(\sqrt{T})$ . Now, based on equation (35), we deduce that $T_{0}=\mathcal{O}(\sqrt{T})$ . Given Proposition 1, which establishes that the regret associated with a fixed strongly stabilizing policy $K_{0}$ over $T_{0}$ rounds scales as $\mathcal{O}(T_{0})$ , we can conclude that within $T$ rounds, the warm-up phase incurs a regret of $\mathcal{O}(\sqrt{T}).$

Confidence Set construction of LCSA algorithm

Having been run for the mode 1, the warm-up phase outputs $\Theta_{0}$ as an initial estimate using which the central ellipsoid as an input to Algorithm 2, is constructed as follows

\displaystyle\mathcal{C}_{\tau_{1}}(\delta)=\{

\displaystyle\Theta\in\mathbb{R}^{(n+m)\times n}|\operatorname{Tr}((\Theta-{\Theta}_{0})^{\top}V_{0}(\Theta-{\Theta}_{0}))\leq r_{0}\}

where $r_{0}=\lambda\epsilon$ , $V_{0}=\lambda I$ where $\lambda=8\kappa_{*}^{8}\sqrt{n+m}\frac{4\bar{\mu}\bar{\nu}}{\alpha_{0}\sigma^{2}}$ and $r_{0}=\lambda\epsilon^{2}$ . The initial confidence ellipsoid (LABEL:eq:initialCentaEl) is, in fact, equivalent to $\|\Theta-\Theta_{0}\|^{2}\leq r_{0}$ . We apply the procedure of Section III-B together with augmentation technique to update the central ellipsoid.

Suppose that at time $\tau_{q}$ , the system switches to mode $i$ . The associated inherited confidence ellipsoid for mode $i$ can be derived through the projection of the central ellipsoid, as discussed in Section III-D, and can be expressed as follows:

	$\displaystyle\mathcal{C}^{pi}_{\tau_{q}}(\delta):=\{$	$\displaystyle\Theta^{j}\in\mathbb{R}^{(n+m_{i})\times n}\|$
		$\displaystyle\operatorname{Tr}\big{(}(\Theta^{i}-\hat{\Theta}^{pi}_{\tau_{q}})^{\top}V^{pi}_{[0\;\tau_{q}]}(\Theta^{i}-\hat{\Theta}^{pi}_{\tau_{q}})\big{)}$	$\displaystyle\leq r_{\tau_{q}}\}.$		(37)

The following theorem provides a confidence bound on the error of parameter estimation when there is an inherited confidence set.

Theorem 2

Let the linear model (4) satisfy Assumption 1 with a known $\sigma_{\omega}$ . Assume the system at an arbitrary time $\tau_{q}$ switches to the mode $i$ and runs for the time period $t\in[\tau_{q},\;\;\tau_{q+1}]$ (i.e. $q+1$ ’s epoch). Let the inherited information, which is a projection of central ellipsoid, be given by (37). Then with probability at least $1-\delta$ we have

\displaystyle\operatorname{Tr}((\Theta^{i}-\hat{\Theta}^{i}_{t})^{\top}V^{i}_{[0\;\;t]}(\Theta^{i}-\hat{\Theta}^{i}_{t}))\leq r^{i}

(38)

where

\displaystyle r^{i}

\displaystyle=\bigg{(}\sigma_{\omega}\sqrt{2n\log\frac{n\det(V^{i}_{[0\;\;t]})}{\delta\det(V_{[0\;\tau_{q}]}^{pi})}}+\sqrt{r_{\tau_{q}}}\bigg{)}^{2},

(39)

\displaystyle\hat{\Theta}^{i}_{t}

\displaystyle={V^{i}}^{-1}_{[0,\;t]}\big{(}{Z}^{T}X+V^{pi}_{[0,\;\;\tau_{q}]}\hat{\Theta}^{pi}_{\tau_{q}}\big{)},

(40)

and

	$\displaystyle V_{[0\;t]}^{i}=V_{[0\;\tau_{q}]}^{pi}+V^{i}_{[\tau_{q}\;t]},\quad V^{i}_{[\tau_{q}\;t]}=\sum_{t=\tau_{q}}^{t}z^{i}_{t}{z^{i}_{t}}^{\top},$
	$\displaystyle Z^{\top}X=\sum_{s=\tau_{q}}^{t}z^{i}_{s}{x^{i}_{s}}^{\top}$		(41)

The second term in the right hand side of (39) is related to the inherited information from the central ellipsoid while the first term is the contribution of learning at the running epoch. Similarly, in the definition of $V^{i}_{[0,;t]}$ provided by (41), the term $V_{[0;\tau_{q}]}^{pi}$ is derived from inherited information, while $V^{i}_{[\tau_{q};t]}$ represents the contribution of learning during the current running epoch.

IV-C Control design

The relaxed SDP problem which accounts for the uncertainity in parameter estimates is used for control design. Given the parameter estimate $\hat{\Theta}^{i}_{t}$ and the covariance matrix $V^{i}_{[0,\;t]}$ obtained by (40) and (41), the relaxed SDP is formulated as follows:

\displaystyle\begin{array}[]{rrclcl}\displaystyle\min&\lx@intercol\begin{pmatrix}Q&0\\ 0&R^{i}\end{pmatrix}\bullet\Sigma\hfil\lx@intercol\\ \textrm{s.t.}&\Sigma_{xx}\succeq{\hat{\Theta^{i}}_{t}}^{\top}\Sigma\hat{\Theta}_{t}^{i}+W-\mu_{i}\Sigma\bullet{{V^{i}}^{-1}_{[0,\;t]}}I,\\ &\Sigma\succ 0.\end{array}

(45)

where $\mu_{i}$ is any scalar satisfying $\mu_{i}\geq r^{i}+(1+r^{i})\vartheta\|V_{[0,\;t]}^{i}\|^{1/2}$ . And we define $\bar{\mu}\geq r^{i}+(1+r^{i})\vartheta\|V_{[0,\;T]}^{i}\|^{1/2}$ . Denoting $\Sigma(\hat{\Theta}_{t},Q,R^{i})$ as the optimal solution of (45), the controller extracted from solving the relaxed SDP is deterministic and linear in state ( $u^{i}=K(\hat{\Theta}^{i}_{t},Q,R^{i})x$ ) where

\displaystyle K(\hat{\Theta}^{i}_{t},Q,R^{i})=\Sigma_{ux}(\hat{\Theta}^{i}_{t},Q,R^{i}){\Sigma^{-1}_{xx}(\hat{\Theta}^{i}_{t},Q,R^{i})}.

The designed controller is strongly stabilizing in the sense of Definition 1 and the sequence of feedback gains is strongly sequentially stabilizing, as affirmed by Theorem 3. The detailed analysis of the relaxation of the primal SDP formulation, incorporating the constructed confidence ellipsoid to account for parameter uncertainty, is provided in Appendix A-C which is adapted from [5] with modifications as the result of not normalizing the confidence ellipsoid.

Recurrent update of confidence sets may worsen the performance, so a criterion is needed to prevent frequent policy switches in an actuation epoch. As such, Algorithm 2 updates the estimate and policy when the condition $\det(V^{j}_{t})>2\det(V^{j}_{\tau})$ is fulfilled where $\tau$ is the last time the policy got updated ( $\tau$ is solely used in Algorithm 2 and should not be mixed with switch times in actuating modes; $\tau_{i}^{\prime}$ s). This is a standard step in OFU-based algorithms ([30], [33], and [5]).

IV-D Projection Step

In Algorithm 2 we apply the projection technique described in Section III-D to transfer the information contained in the central ellipsoid to the mode that the system just switched to. The central ellipsoid is consistently updated in Algorithm 2 using the augmentation technique regardless of the active actuating mode.

V Performance Analysis

V-A Stability Analysis

In this section, we first proceed to restate the results of [5] showing that the sequence of feedback gains obtained by solving primal relaxed-SDP (45) is strongly sequentially stabilizing. We then define a Minimum Average Dwell Time (MADT), denoted as $\tau_{MAD}$ , which ensures the boundedness of the state norm in a switching scenario by mandating that the system remains in each mode for at least this duration. In the TTP case, this property would need to hold for the given sequence, i.e., $\tau_{j+1}-\tau_{j}\geq\tau_{MAD}$ , for $j\in\{1,...,ns-1\}$ . For MTD and FTC cases, on the other hand, the MADT constraints the times of switch.

Solving relaxed SDP (45) is the crucial step of the algorithm, however when it comes to regret bound and stability analysis, the dual program

\displaystyle\begin{array}[]{rrclcl}\displaystyle\max&\lx@intercol P\bullet W\hfil\lx@intercol\\ \textrm{s.t.}&\begin{pmatrix}Q-P&0\\ 0&R^{i}\end{pmatrix}+\hat{\Theta}_{t}^{i}P{\hat{\Theta^{i}}_{t}}^{\top}\succeq\mu_{i}\|P\|_{*}{{V^{i}}^{-1}_{[0,\;t]}}\\ &P\succeq 0\end{array}

(49)

and its optimal solution $P(\hat{\Theta}^{i}_{t},Q,R^{i})$ play a pivotal role.

Recalling the sequential strong stability definition, the following theorem adapted from [5] shows that the sequence of policies obtained by solving relaxed SDP (45) for system actuating in mode $i$ are strongly sequentially stabilizing for the parameters $\kappa_{i}$ and $\gamma_{i}$ to be defined.

Theorem 3

Let $P(\hat{\Theta}^{i}_{t},Q,R^{i})$ ’s for $t\in\{\tau_{q},\tau_{q}+1,...\}$ be solutions of the dual program (49) associated with an epoch occurring in time interval $[\tau_{q}\;\tau_{q+1}]$ and the mode $i$ with corresponding $\Theta^{i}_{t}$ and $V^{i}_{t}$ . Let $\kappa_{i}=\sqrt{2\nu_{i}/\alpha_{0}\sigma^{2}}$ and $\gamma_{i}=1/2\kappa_{i}^{2}$ with $\mu_{i}\geq r^{i}+(1+r^{i})\vartheta||V^{i}||^{1/2}$ , then the sequence of policies $K(\hat{\Theta}^{i}_{t},Q,R^{i})$ ’s (associated with $P(\hat{\Theta}^{i}_{t},Q,R^{i})$ ’s) are $(\kappa_{i},\gamma_{i})-$ sequentially strongly stabilizing with probability at least $1-\delta$ .

Proof:

This theorem is a direct results of Lemmas 9 and 10 provided and proved in Appendix. ∎

Recalling Definition 1, a sequentially stabilizing policy keeps the states of system bounded in an epoch. The following lemma gives this upper bound in terms of the stabilizing policy parameters $\kappa_{i}$ and $\gamma_{i}$ .

Lemma 1

For the system evolving in an arbitrary actuating mode $i$ in time interval of $[\tau_{q},\tau_{q+1}]$ by applying sequentially stabilizing controllers for any $t\in[\tau_{q},\tau_{q+1}]$ the state is upper-bounded by

\displaystyle\|x_{t}\|\leq\kappa_{i}e^{-\gamma_{i}(t-(\tau_{p}+1))/2}||x_{\tau_{q}}||+\frac{20\kappa_{i}}{\gamma_{i}}\sigma_{\omega}\sqrt{n\log\frac{t-\tau_{q}}{\delta}}

(50)

Proof:

The proof is straight forward but for the sake of completeness we provide it here. Applying the sequentially stabilizing policy implies (18), then using Hanson-Wright concentration inequality with probability at least $1-\delta/2(t-\tau_{q})$ yields

\displaystyle\max\limits_{\tau_{q}\leq s\leq t}||w_{s}||\leq 10\sigma_{\omega}\sqrt{n\log\frac{t-\tau_{q}}{\delta}},

(51)

which completes the proof. ∎

Having stable subsystems is a necessary but not a sufficient condition to ensure the stability of a switched system. The switches must occur at a sufficiently slow rate on average to guarantee stability. Therefore, we need to impose a requirement that the switches occur at a sufficiently slow rate. The following theorem establishes a lower bound on the time elapsed between two consecutive switches, referred to as the ”dwell time,” which serves as a sufficient condition for ensuring boundedness.

The following theorem establishes a lower bound on the average dwell time.

Theorem 4

Consider the switch sequence $\mathcal{I}=[i_{\tau_{1}},i_{\tau_{2}},\ldots,i_{\tau_{ns}}]$ , and let the system (3) be controlled by $(\kappa_{i_{\tau_{j}}},\gamma_{i_{\tau_{j}}})$ -strongly sequentially stabilizing controllers for any mode $i_{\tau_{j}}\in\mathcal{I}$ . Then, for a user-defined $0<\mathcal{X}<1$ by requiring $\tau_{j+1}-\tau_{j}\geq\tau_{\text{MAD}}$ , where

\displaystyle\tau_{\text{MAD}}:=\lceil\frac{\ln\kappa^{*}-\ln(1-\chi)}{-\ln(1-\gamma^{*})}\rceil

(52)

in which

\displaystyle\kappa^{*}=\max_{i\in\{1,\ldots,|\mathcal{B}^{*}|\}}\kappa_{i},\quad\text{and}\quad\gamma^{*}=\frac{1}{2\kappa^{*}},

(53)

the state norm will be bounded as follows:

\displaystyle\|x_{t}\|\leq\kappa^{*}(1-\mathcal{X})^{t}\|x_{0}\|+\frac{\bar{L}}{\mathcal{X}}

(54)

where

\displaystyle\bar{L}:=\frac{20\kappa^{*}}{\gamma^{*}}\sigma_{\omega}\sqrt{n\log\frac{t}{\delta}}.

V-B Regret Bound Analysis

Taking into account the fact that any confidence ellipsoid $\mathcal{C}_{t}^{i}(\delta)$ can be obtained through projecting the central ellipsoid $\mathcal{C}_{t}(\delta)$ , we can define the following unified “good event”, using the denomination of [30]

	$\displaystyle\mathcal{E}_{t}=\{$	$\displaystyle\operatorname{Tr}\big{(}\Delta^{T}_{s}V_{[0,\;s]}\Delta_{s}\big{)}\leq r_{s},$		(55)
		$\displaystyle\\|z_{s}\\|^{2}\leq 4{\kappa^{}}^{2}(1-\chi)^{2s}\\|x_{0}\\|^{2}+\frac{\bar{\Upsilon}}{2{\gamma^{}}^{2}},\;\forall s=1,...,t\}$

where

\displaystyle\bar{\Upsilon}=\frac{3200{\kappa^{*}}^{6}\sigma^{2}n}{\chi^{2}}\log\frac{T}{\delta}.

The upper bound for $\|z_{t}\|^{2}$ is obtained from (), as derived in the proof of Lemma 11 in Appendix A-D. The event $\mathcal{E}_{t}$ holds for any sequence of switches and is used for carrying out regret bound analysis.

By an appropriate decomposition of regret on the event $\mathcal{E}_{t}$ , for an arbitrary switch sequence $\mathcal{I}=[i_{\tau_{1}}\;i_{\tau_{2}}\;...i_{\tau_{ns}}]$ , the following upper-bound for regret is obtained (adapted from [5]):

\displaystyle R_{T}\leq R_{1}+R_{2}+R_{3}+R_{4}

(56)

where:

	$\displaystyle R_{1}=\sum_{t=0}^{T-1}\big{(}$	$\displaystyle x_{t}^{T}P(\hat{\Theta}^{\sigma(t)},Q,R^{\sigma(t)})x_{t}-$
		$\displaystyle x_{t+1}^{T}P(\hat{\Theta}^{\sigma(t)},Q,R^{\sigma(t)})x_{t+1}\big{)}1_{\mathcal{E}_{t}},$		(57)

	$\displaystyle R_{2}=\sum_{t=0}^{T}$	$\displaystyle w^{T}_{t}P(\hat{\Theta}^{\sigma(t)},Q,R^{\sigma(t)})\big{(}A_{*}+$
		$\displaystyle B^{\sigma(t)}_{*}K(\hat{\Theta}^{\sigma(t)},Q,R^{i(t)})\big{)}x_{t}1_{\mathcal{E}_{t}},$		(58)

	$\displaystyle R_{3}=\sum_{t=0}^{T}\big{(}$	$\displaystyle w^{T}_{t}P(\hat{\Theta}^{\sigma(t)},Q,R^{\sigma(t)})w_{t}-$
		$\displaystyle\sigma^{2}\\|P(\hat{\Theta}^{\sigma(t)},Q,R^{i(t)})\\|_{*}\big{)}1_{\mathcal{E}_{t}},$		(59)

\displaystyle R_{4}=\frac{4\nu}{\sigma_{\omega}^{2}}\sum_{t=0}^{T}\mu_{\sigma(t)}\big{(}z^{T}_{t}{v^{\sigma(t)}_{[0,\;t]}}^{-1}_{t}z_{t}\big{)}1_{\mathcal{E}_{t}},

(60)

in which $\sigma(t)=i_{\tau_{j}}$ for $\tau_{j}\leq t\leq\tau_{j+1}$ and $1_{\mathcal{E}_{t}}$ is the indicator function. By bounding the above terms separately, the overall imposed regret, not-counting the warm-up regret, can be obtained. The following theorem summarizes the result.

Theorem 5

The regret of projection-equipped algorithm in horizon $T$ and with $ns$ number of switches is $\mathcal{\bar{O}}\big{(}\sqrt{T}\big{)}+\mathcal{O}\big{(}(ns)\sqrt{T}\big{)}$ with probability at least $1-\delta$ . Moreover, this strategy is more efficient than repeatedly and naively applying the basic OFU algorithm.

VI Modifications to Address the MTD and FTC scenarios

Algorithm 2, with minor adjustments, can also address the FTC and MTD problems where switch sequence information (actuating modes and switch times) are unavailable. This section proposes possible directions to adjust the strategy to the mentioned applications.

For the FTC application, leaving aside switch time, which is supposedly determined by an external unit (e.g., a detector), the remaining challenge is determining the next actuating mode to switch to. From the set of available modes $\mathcal{M}$ , the mode that guarantees to achieve lower regret seems a rational choice. Proofs of Lemmas 13 and 16, in the regret bound analysis section, justify that switching to a mode whose covariance matrix has the highest determinant value guarantees achieving the lower regret bound. In other words, at the time $t$ the best mode which guarantees better regret is the mode $j^{*}:=\arg_{i\in\mathcal{M}}\max_{i}\det(V^{pi}_{[0,\;t]})$ where $V^{pi}_{[0,\;t]}$ is the projection of central ellipsoid on the space of mode $i\in\mathcal{M}$ .

In the MTD application, where the frequent switch is the central part of the strategy, in addition to determining the next actuating mode, we have to specify the switch instances so that the stability is not violated. To address the latter, we design a MADT (see Theorem 4) and require the algorithm to make the switches ”slow enough” by waiting on a mode for at least the MADT unit of time. Regarding the choice of the next mode, picking merely the well-learned one (by evaluating the covariance matrices) does not seem practical. The reason is that this strategy pushes the algorithm to choose the modes with more actuators (which have greater values of covariance matrix determinant) and technically ignore most of the modes, so the set of candidate modes will be smaller. In turn, this means less unpredictability and helps the attacker perform effective intrusion. To resolve this issue, one may decide to switch by optimizing the weighted sum of determinant values of covariance matrices (of the modes) and the value of some unpredictability measures. This idea basically gives a trade-off between lower regret and unpredictability. The authors in [4] use an information entropy measure produced by a probability simplex when the system is known.

VII Numerical Experiment

In this section, to demonstrate the effectiveness of the proposed “learn and control while switching” algorithm compared to repeatedly applying the basic OFU algorithm, we use a third-order linear system with three actuators as of the following plant and control matrices:

\displaystyle A_{*}=\begin{pmatrix}10.4&0&-2.7\\ 5.2&-8.1&8.3\\ 0&0.4&-9.0\end{pmatrix},\;B_{*}=\begin{pmatrix}-4.7&6.1&-2.9\\ -5.0&5.8&2.5\\ 2.9&0&-7.2\end{pmatrix}

which are unknown to the learner. Weighting cost matrices are chosen as follows:

\displaystyle Q=\begin{pmatrix}6.5&-0.8&-1.4\\ -0.8&5.7&2.6\\ -1.4&2.6&25\end{pmatrix},\;R=\begin{pmatrix}40&10&16\\ 10&28&8\\ 16&8&48\end{pmatrix}.

Figure 2 shows the regret bound of the algorithm for the switch sequence of $\{1,2,1\}$ where the mode 2 is specified by

\displaystyle B_{*}^{2}=\begin{pmatrix}-4.7&6.1\\ -5.0&5.8\\ 2.9&0\end{pmatrix},\;\;R^{2}=\begin{pmatrix}40&10\\ 10&28\\ \end{pmatrix}.

We let the switch happens at times $\tau_{1}=0$ , $\tau_{2}=5000$ , and $\tau_{3}=10000$ seconds. We intentionally picked this switch sequence to show the regret when the confidence set is projected to a lower-dimensional space and when the system switches back to a previously explored mode. We further assume the initial estimate $\Theta_{0}$ where $\|\Theta_{0}-\Theta_{*}\|\leq\epsilon$ is given and let the naive algorithm use this estimate at time $\tau_{2}$ when the system switches to the second mode. We set $T=15000$ , $\delta=0.5/T$ , $\sigma_{\omega}=0.003$ , $\nu=0.8$ and $\epsilon=0.0001$ .

At the time $\tau_{2}=5000$ seconds, when the switch happens, the projection-equipped algorithm outperforms the naive one thanks to the learned information during the first epoch, which is restored through the projection tool. At the time $\tau_{3}=10000$ seconds, the system switches back to mode 1. During this last epoch, we only see a slight improvement which is because the switch in policy, triggered by fulfillment of the condition $\det(V_{[0,t]})>\det(V_{[0,\tau]})$ (see the line 8 of algorithm 2) rarely happens as the value of $\det(V_{[0,t]})$ increases. In our case, we only have one time of policy improvement during the epoch $[\tau_{3},\;T]$ .

VIII Summary and Conclusion

In this work, we equipped the OFU algorithm with a projection tool to improve the ”learn and control” task when switching between the different subsets of actuators of a system is mandated. This improvement is proved mathematically in a regret-bound sense and shown throughout the simulation compared to a case when the basic OFU algorithm is repeatedly applied. We presented this idea for a time-triggered network control system problem when the sequences of switch times and modes are upfront. We also discussed the possibilities of extending the proposed algorithm to applications such as fault-tolerant control and moving target defense problems for which there is a freedom to pick the best actuating mode when it is a time to switch. To address this, we proposed a mechanism that offers the best-actuating mode, considering how rich the learned data is. A minimum average dwell-time is also designed to address stability concerns in applications such as moving target defense that frequent switches might cause instability. Moreover, by applying recent results on the existence of a stabilizing neighborhood, we have determined an initial exploration duration such that the efficiency for the warm-up phase is guaranteed.

References

[1] H. Ishii and B. A. Francis, “Stabilizing a linear system by switching control with dwell time,” in Proceedings of the 2001 American Control Conference.(Cat. No. 01CH37148), vol. 3. IEEE, 2001, pp. 1876–1881.
[2] J.-X. Zhang and G.-H. Yang, “Supervisory switching-based prescribed performance control of unknown nonlinear systems against actuator failures,” International Journal of Robust and Nonlinear Control, vol. 30, no. 6, pp. 2367–2385, 2020.
[3] M. Staroswiecki and D. Berdjag, “A general fault tolerant linear quadratic control strategy under actuator outages,” International Journal of Systems Science, vol. 41, no. 8, pp. 971–985, 2010.
[4] A. Kanellopoulos and K. G. Vamvoudakis, “A moving target defense control framework for cyber-physical systems,” IEEE Transactions on Automatic Control, 2019.
[5] A. Cohen, T. Koren, and Y. Mansour, “Learning linear-quadratic regulators efficiently with only sqrt(t) regret,” arXiv preprint arXiv:1902.06223, 2019.
[6] S. S. Tohidi, Y. Yildiz, and I. Kolmanovsky, “Fault tolerant control for over-actuated systems: An adaptive correction approach,” in 2016 American control conference (ACC). IEEE, 2016, pp. 2530–2535.
[7] M. Blanke, M. Kinnaert, J. Lunze, M. Staroswiecki, and J. Schröder, Diagnosis and fault-tolerant control. Springer, 2006, vol. 2.
[8] R. J. Patton, “Fault-tolerant control: the 1997 situation,” IFAC Proceedings Volumes, vol. 30, no. 18, pp. 1029–1051, 1997.
[9] M. Takahashi and T. Takagi, “Adaptive fault-tolerant control based on hybrid redundancy,” Asia-Pacific Journal of Chemical Engineering, vol. 7, no. 5, pp. 642–650, 2012.
[10] Q. Yang, S. S. Ge, and Y. Sun, “Adaptive actuator fault tolerant control for uncertain nonlinear systems with multiple actuators,” Automatica, vol. 60, pp. 92–99, 2015.
[11] H. Ouyang and Y. Lin, “Adaptive fault-tolerant control for actuator failures: a switching strategy,” Automatica, vol. 81, pp. 87–95, 2017.
[12] S. Shaheen, D. Heffernan, and G. Leen, “A gateway for time-triggered control networks,” Microprocessors and Microsystems, vol. 31, no. 1, pp. 38–50, 2007.
[13] R. Blind and F. Allgöwer, “On time-triggered and event-based control of integrator systems over a shared communication system,” Mathematics of Control, Signals, and Systems, vol. 25, no. 4, pp. 517–557, 2013.
[14] J.-H. Cho, D. P. Sharma, H. Alavizadeh, S. Yoon, N. Ben-Asher, T. J. Moore, D. S. Kim, H. Lim, and F. F. Nelson, “Toward proactive, adaptive defense: A survey on moving target defense,” IEEE Communications Surveys & Tutorials, 2020.
[15] S. Weerakkody and B. Sinopoli, “Detecting integrity attacks on control systems using a moving target approach,” in 2015 54th IEEE Conference on Decision and Control (CDC). IEEE, 2015, pp. 5820–5826.
[16] ——, “A moving target approach for identifying malicious sensors in control systems,” in 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2016, pp. 1149–1156.
[17] P. Griffioen, S. Weerakkody, and B. Sinopoli, “A moving target defense for securing cyber-physical systems,” IEEE Transactions on Automatic Control, vol. 66, no. 5, pp. 2016–2031, 2020.
[18] K. S. Narendra and J. Balakrishnan, “Adaptive control using multiple models,” IEEE transactions on automatic control, vol. 42, no. 2, pp. 171–187, 1997.
[19] M. Rotulo, C. De Persis, and P. Tesi, “Online learning of data-driven controllers for unknown switched linear systems,” arXiv preprint arXiv:2105.11523, 2021.
[20] J. Kenanian, A. Balkan, R. M. Jungers, and P. Tabuada, “Data driven stability analysis of black-box switched linear systems,” Automatica, vol. 109, p. 108533, 2019.
[21] T. Dai and M. Sznaier, “A moments based approach to designing mimo data driven controllers for switched systems,” in 2018 IEEE Conference on Decision and Control (CDC). IEEE, 2018, pp. 5652–5657.
[22] ——, “A convex optimization approach to synthesizing state feedback data-driven controllers for switched linear systems,” Automatica, vol. 139, p. 110190, 2022.
[23] S. J. Bradtke, B. E. Ydstie, and A. G. Barto, “Adaptive linear quadratic control using policy iteration,” in Proceedings of 1994 American Control Conference-ACC’94, vol. 3. IEEE, 1994, pp. 3475–3479.
[24] S. Tu and B. Recht, “Least-squares temporal difference learning for the linear quadratic regulator,” arXiv preprint arXiv:1712.08642, 2017.
[25] M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” in International Conference on Machine Learning. PMLR, 2018, pp. 1467–1476.
[26] Y. Abbasi-Yadkori, N. Lazic, and C. Szepesvári, “Model-free linear quadratic control via reduction to expert prediction,” arXiv preprint arXiv:1804.06021, 2018.
[27] S. Arora, E. Hazan, H. Lee, K. Singh, C. Zhang, and Y. Zhang, “Towards provable control for unknown linear dynamical systems,” arXiv preprint arXiv:1712.08642, 2018.
[28] M. C. Campi and P. Kumar, “Adaptive linear quadratic gaussian control: the cost-biased approach revisited,” SIAM Journal on Control and Optimization, vol. 36, no. 6, pp. 1890–1907, 1998.
[29] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári, “Online least squares estimation with self-normalized processes: An application to bandit problems,” arXiv preprint arXiv:1102.2670, 2011.
[30] Y. Abbasi-Yadkori and C. Szepesvári, “Regret bounds for the adaptive control of linear quadratic systems,” in Proceedings of the 24th Annual Conference on Learning Theory, 2011, pp. 1–26.
[31] M. Ibrahimi, A. Javanmard, and B. V. Roy, “Efficient reinforcement learning for high dimensional linear quadratic systems,” in Advances in Neural Information Processing Systems, 2012, pp. 2636–2644.
[32] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “Optimism-based adaptive regulation of linear-quadratic systems,” arXiv preprint arXiv:1711.07230, 2017.
[33] J. A. Chekan, K. Azizzadenesheli, and C. Langbort, “Joint stabilization and regret minimization through switching in systems with actuator redundancy,” arXiv preprint arXiv:2105.14709, 2021.
[34] S. Lale, K. Azizzadenesheli, B. Hassibi, and A. Anandkumar, “Explore more and improve regret in linear quadratic regulators,” arXiv preprint arXiv:2007.12291, 2020.
[35] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “Regret bounds for robust adaptive control of the linear quadratic regulator,” Advances in Neural Information Processing Systems, vol. 31, 2018.
[36] H. Mania, S. Tu, and B. Recht, “Certainty equivalence is efficient for linear quadratic control,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[37] A. Cassel, A. Cohen, and T. Koren, “Logarithmic regret for learning linear quadratic regulators efficiently,” in International Conference on Machine Learning. PMLR, 2020, pp. 1328–1337.

Appendix A Further Analysis

In this section, we dig further and provide proofs, rigorous analysis of the algorithms, properties of the closed-loop system, and regret bounds.

A-A Technical Theorems and Lemmas

The following lemma, adapted from [29] gives a self-normalized bound for scalar-valued martingales.

Lemma 2

Let $F_{k}$ be a filtration, $z_{k}$ be a stochastic process adapted to $F_{k}$ and $\omega^{i}_{k}$ (where $\omega^{i}_{k}$ is the $i-$ th element of noise vector $\omega_{k}$ ) be a real-valued martingale difference, again adapted to filtration $F_{k}$ which satisfies the conditionally sub-Gaussianity assumption (Assumption 1) with known constant $\sigma_{\omega}$ . Consider the martingale and co-variance matrices:

\displaystyle S^{i}_{t}:=\sum_{k=1}^{t}z_{k-1}\omega^{i}_{k},\quad V_{t}=\lambda I+\frac{1}{\beta}\sum_{s=1}^{t-1}z_{t}z_{t}^{T}

then with probability of at least $1-\delta$ , $0<\delta<1$ we have,

\displaystyle\left\lVert S^{i}_{t}\right\rVert^{2}_{V_{t}^{-1}}\leq 2\sigma_{\omega}^{2}\log\bigg{(}\frac{\det(V_{t})}{\delta\det(\lambda I)}\bigg{)}

(61)

Proof:

The proof can be found in Appendix B. ∎

The following corollary generalizes the Lemma 1 for a vector-valued martingale which will be later used in Theorem 2 to derive the confidence ellipsoid of estimates.

Corollary 1

Under the same assumptions as in the Lemma 1, $\beta>1$ and defining

\displaystyle S_{t}=Z^{T}_{t}W_{t}=\sum_{k=1}^{t}z_{k-1}\omega^{T}_{k}

with probability at least $1-\delta$ ,

\displaystyle\operatorname{Tr}(S^{T}_{t}V^{-1}_{t}S_{t})

\displaystyle\leq 2\sigma_{\omega}^{2}n\log\bigg{(}\frac{n\det(V_{t})}{\delta\det(\lambda I)}\bigg{)}.

(62)

Proof:

For proof see Appendix B. ∎

Lemma 3

(Mania et al [36]) There exists explicit constants $C_{0}$ and $\epsilon_{0}=poly(\alpha_{0}^{-1},\alpha_{1},\vartheta,\nu,n,m)$ such that $\|\Theta-\Theta_{*}\|\leq\epsilon$ for any $0\leq\epsilon\leq\epsilon_{0}$ guarantees

\displaystyle J(K(\Theta,Q,R),\Theta_{*},Q,R)-J_{*}\leq C_{0}\epsilon^{2}

(63)

where $J(K(\Theta,Q,R),\Theta_{*},Q,R)$ is average expected cost of executing the controller $J(K(\Theta,Q,R)$ on the system $\Theta_{*}$ and $J_{*}$ is optimal average expected cost.

Lemma 4

(Cassel et al [37]) Suppose $J(K(\Theta,Q,R),\Theta_{*},Q,R)\leq\mathcal{J}$ then $K(\Theta,Q,R)$ is $(\kappa,\gamma)$ -strongly stabilizing with $\kappa=\sqrt{\frac{\mathcal{J}}{\alpha_{0}\sigma^{2}_{\omega}}}$ and $\gamma=\frac{1}{2\kappa^{2}}$ .

A-B Confidence Construction (System Identification)

Proof of Theorem 1

Proof:

Note that during the warm-up phase we execute the control input $u_{t}=K_{0}x_{t}+\eta_{t}$ which includes a more-exploration term of $\eta_{t}\sim\mathcal{N}(0,2\sigma_{\omega}^{2}\kappa_{0}^{2}I)$ . The state norm of system is upperbounded by strong stabilizing property of $K_{0}$ during warm-up phase (See [5]). Now, let $T_{0}$ denote warm-up duration. The confidence ellipsoid is constructed during this phase by applying the procedure illustrated in Section III-B.

	$\displaystyle\mathcal{C}_{t}(\delta):=\{$	$\displaystyle\Theta\in\mathbb{R}^{(n+m)\times n}\|$
		$\displaystyle\;Tr\big{(}(\Theta-\hat{\Theta}_{T_{0}})^{\top}V_{[0,\;T_{0}]}(\Theta-\hat{\Theta}_{T_{0}})\big{)}$	$\displaystyle\leq r_{T_{0}}\}$

where

	$\displaystyle r_{T_{0}}=\bigg{(}$	$\displaystyle\sigma_{\omega}\sqrt{2n(\log\frac{n}{\delta}+\log(1+\frac{300\sigma_{\omega}^{2}\kappa_{0}^{4}}{\gamma_{0}^{2}}(n+\vartheta^{2}\kappa^{2}_{0})\log\frac{T_{0}}{\delta}}$
		$\displaystyle+\sqrt{\lambda}\vartheta\bigg{)}^{2}.$

Furthermore, from Theorem 20 of [5], we have the following lower bound for $V_{[0,\;T_{0}]}$

\displaystyle V_{[0,\;T_{0}]}\succeq\frac{T_{0}\sigma_{\omega}^{2}}{80}I,

which results in

	$\displaystyle\big{\\|}\hat{\Theta}-\Theta_{*}\big{\\|}^{2}\leq$
	$\displaystyle\frac{80}{T_{0}\sigma^{2}_{\omega}}\bigg{(}\sigma_{\omega}\sqrt{2n(\log\frac{n}{\delta}+\log(1+\frac{300\sigma_{\omega}^{2}\kappa_{0}^{4}}{\gamma_{0}^{2}}(n+\vartheta^{2}\kappa^{2}_{0})\log\frac{T_{0}}{\delta}}$
	$\displaystyle+\sqrt{\lambda}\vartheta\bigg{)}^{2}.$		(64)

Now applying the results of Lemmas 3 and 4 and setting $\kappa=\kappa_{0}$ yield

	$\displaystyle\frac{80}{T_{0}\sigma^{2}_{\omega}}\bigg{(}\sigma_{\omega}\sqrt{2n(\log\frac{n}{\delta}+\log(1+\frac{300\sigma_{\omega}^{2}\kappa_{0}^{4}}{\gamma_{0}^{2}}(n+\vartheta^{2}\kappa^{2}_{0})\log\frac{T_{0}}{\delta}}$
	$\displaystyle+\sqrt{\lambda}\vartheta\bigg{)}^{2}\leq\min\bigg{\{}\kappa^{2}_{0}\frac{\alpha_{0}\sigma^{2}_{\omega}}{C_{0}},\epsilon^{2}_{0}\bigg{\}}.$		(65)

When Algorithm 2 begins actuating in mode 1, given that $\Theta_{0}$ resides within a $(\kappa_{0},\gamma_{0})$ -stabilizing neighborhood, solving the primal problem with the confidence bound constructed by $\Theta_{0}$ results in an average expected cost improvement over that of $K_{0}$ , or at least achieves a similar performance. ∎

Proof of Theorem 2

Proof:

The $l_{2}-$ least square error regularized by (37) is written as follows

	$\displaystyle e(\Theta^{i})$	$\displaystyle=\operatorname{Tr}\big{(}(\hat{\Theta}^{pi}_{\tau_{q}}-\Theta^{i})^{\top}V_{[0\;\tau_{q}]}^{pi}(\hat{\Theta}^{pi}_{\tau_{q}}-\Theta^{i})\big{)}+$
		$\displaystyle\quad\sum_{s=\tau_{q}+1}^{t-1}\operatorname{Tr}\big{(}(x^{i}_{s+1}-{\Theta^{i}}^{\top}z^{i}_{s})(x^{i}_{s+1}-{\Theta^{i}}^{\top}z^{i}_{s})^{T}\big{)}.$		(66)

Solving

\displaystyle\begin{array}[]{rrclcl}\displaystyle\min_{\Theta^{i}}&\lx@intercol e(\Theta^{i})\hfil\lx@intercol\\ \end{array}

(68)

yields

\displaystyle\hat{\Theta}^{i}_{t}

\displaystyle={V^{i}}^{-1}_{[0,\;t]}\big{(}{Z}^{T}X+V^{pi}_{[0,\;\;\tau_{q}]}\hat{\Theta}^{pi}_{\tau_{q}}\big{)},

(69)

where

	$\displaystyle V_{[0\;t]}^{i}=V_{[0\;\tau_{q}]}^{pi}+V^{i}_{[\tau_{q}\;t]},\;V^{i}_{[\tau_{q}\;t]}=\sum_{t=\tau_{q}}^{t}z^{i}_{t}{z^{i}_{t}}^{\top},$
	$\displaystyle Z^{\top}X=\sum_{s=\tau_{q}}^{t}z^{i}_{s}{x^{i}_{s}}^{\top}$		(70)

Substitute $X=Z^{i}\Theta^{i}_{*}+W$ (counterpart of (13) for mode $i$ ), where $X$ , $Z$ , and $W$ whose rows are $x^{\top}_{\tau_{q}+1},...,x^{\top}_{t}$ , ${z}^{\top}_{\tau_{q}},...,{z}_{t-1}$ , and $\omega_{\tau_{q}+1}^{\top},...,\omega_{t}^{\top}$ into (69), yields,

	$\displaystyle\hat{\Theta}^{i}_{t}$	$\displaystyle={V^{i}}^{-1}_{[0,\;t]}\big{(}Z^{T}(Z\Theta^{i}_{*}+W)+V_{[0\;\tau_{q}]}^{pi}\hat{\Theta}^{pi}_{\tau_{q}}\big{)}$
		$\displaystyle={V^{i}}^{-1}_{[0,\;t]}\big{(}V_{[0\;t]}^{i}\Theta^{i}_{}+V^{pi}_{[0,\;\tau_{q}]}(\hat{\Theta}^{pi}_{\tau_{q}}-\Theta^{i}_{})+Z^{\top}W\big{)}$
		$\displaystyle=\Theta^{i}_{}+{V^{i}}^{-1}_{[0,\;t]}\big{(}V^{pi}_{[0,\;\tau_{q}]}(\hat{\Theta}^{pi}_{\tau_{q}}-\Theta^{i}_{})+Z^{\top}W\big{)}$

where $Z^{\top}W=\sum_{s=\tau_{q}}^{t}z^{i}_{s}\omega^{\top}_{s}$ . For an arbitrary random extended state $z^{i}$ , one can write

	$\displaystyle{z^{i}}^{\top}\hat{\Theta}^{i}_{i}-{z^{i}}^{\top}\Theta^{i}_{*}=$	$\displaystyle\langle\,z^{i},Z^{\top}W\rangle_{{V^{i}}^{-1}_{[0,\;t]}}+$
		$\displaystyle\langle\,z^{i},V^{pi}_{[0,\;\tau_{q}]}(\hat{\Theta}^{pi}_{\tau_{q}}-\Theta^{i}_{*})\rangle_{{V^{i}}^{-1}_{[0,\;t]}}$

which leads to

	$\displaystyle\\|{z^{i}}^{\top}\hat{\Theta}^{i}_{i}-{z^{i}}^{\top}\Theta^{i}_{*}\\|\leq\\|z^{i}\\|_{{V^{i}}^{-1}_{[0,\;t]}}$	$\displaystyle\Bigg{(}\\|Z^{\top}W\\|_{{V^{i}}^{-1}_{[0,\;t]}}+$
		$\displaystyle\\|V^{pi}_{[0,\;\tau_{q}]}(\hat{\Theta}^{pi}_{\tau_{q}}-\Theta^{i}_{*})\\|_{{V^{i}}^{-1}_{[0,\;t]}}\Bigg{)}$

which can equivalently be written as:

	$\displaystyle\\|\hat{\Theta}^{i}-\Theta^{i}_{*}\\|^{2}_{{V^{i}}_{[0,\;t]}}\leq$
	$\displaystyle\Bigg{(}\\|Z^{\top}W\\|_{{V^{i}}^{-1}_{[0,\;t]}}+\\|V^{pi}_{[0,\;\tau_{q}]}(\hat{\Theta}^{pi}_{\tau_{q}}-\Theta^{i}_{*})\\|_{{V^{i}}^{-1}_{[0,\;t]}}\Bigg{)}^{2}$		(71)

Given the fact that $V^{i}_{[0,\;t]}\geq V^{pi}_{[0,\;\tau_{q}]}$ ,

	$\displaystyle\\|\hat{\Theta}_{i}-\Theta_{i*}\\|^{2}_{V^{i}_{[0,\;t]}}\leq$
	$\displaystyle\Bigg{(}\\|Z^{\top}W\\|_{{V^{i}}^{-1}_{[0,\;t]}}+\sqrt{\\|\hat{\Theta}^{pi}_{\tau_{q}}-\Theta^{i}_{*}\\|^{2}_{{V^{pi}}^{-1}_{[0,\;\tau_{q}]}}}\Bigg{)}^{2}$

holds true where the first term, $\|Z^{\top}W\|_{{V^{i}}^{-1}_{[0,\;t]}}$ , is bounded by Corollary 1 with probability at least $1-\delta$ as follows:

\displaystyle\|Z^{\top}W\|_{{V^{i}}^{-1}_{[0,\;t]}}^{2}\leq 2\sigma_{\omega}^{2}\log\bigg{(}\frac{n\det(V^{i}_{[0,\;t]})}{\delta\det(V_{[0\;\tau_{q}]}^{pi})}\bigg{)}

(72)

and the second term is bounded by (37) as follows

\displaystyle\|\hat{\Theta}^{pi}_{\tau_{q}}-\Theta^{i}_{*}\|_{{V^{pi}}^{-1}_{[0,\;\tau_{q}]}}^{2}\leq r_{\tau_{q}}

(73)

with probability at least $1-\delta$ . Combining bounds (72) and (73) completes the proof. ∎

The following lemma aims at obtaining an upper-bound for the right hand side of (39) which will be later used in regret bound analysis.

Lemma 5

The radius $r^{j}$ of ellipsoid (38) can be upper-bounded as

	$\displaystyle r^{j}\leq 8\sigma_{\omega}^{2}n\bigg{(}$	$\displaystyle 2\log\frac{n}{\delta}+(1+2\bar{\Upsilon})\log t+$
		$\displaystyle(m-1)\log(1+2\bar{\Upsilon})t\bigg{)}+2\epsilon\lambda=:\bar{r}.$		(74)

where

\displaystyle\bar{\Upsilon}=\frac{3200{\kappa^{*}}^{6}\sigma^{2}n}{\chi^{2}}\log\frac{T}{\delta}.

A-C Properties of Relaxed SDP

This subsection attempts to go through some essential features of the primal and dual relaxed-SDP problems (45) and (49). While the former is solved by Algorithm 2 to compute a regularizing control in the face of parameter uncertainty, the latter plays a pivotal role in the stability and regret bound analysis.

Both primal and dual problems directly follow from applying next two matrix-perturbation lemmas 6 and 7 (adapted from [5]) on the exact primal SDP and its dual problems. As the following lemmas hold for any actuating mode $i$ , we deliberately dropped the superscripts $i$ .

Lemma 6

Let $X$ and $\Delta$ be matrices of matching sizes and let $\Delta\Delta^{T}\leq rV^{-1}$ for positive definite matrix $V$ . Then for any $P\geq 0$ and $\mu\geq r+(1+r)\|X\|\|V\|^{1/2}$ , we have

\displaystyle-\mu\|P\|_{*}V^{-1}\leq(X+\Delta)^{\top}P(X+\Delta)\leq\mu\|P\|_{*}V^{-1}.

Proof:

The proof follows the same steps of Lemma 24 in [5]. Furthermore, the lower bound of $\mu$ is obtained by following the same steps of [5] exactly. However, our bound is slightly different than that of [5] because, in the confidence set construction, we did not apply normalization. ∎

To discuss the solution property of primal problem we need following lemma adapted from [5].

Lemma 7

Let $X$ and $\Delta$ be matrices of matching sizes and let $\Delta\Delta^{\top}\leq rV^{-1}$ for positive definite matrix $V$ . Then for any $\Sigma\geq 0$ and $\mu\geq r+(1+r)\|X\|\|V\|^{1/2}$ ,

\displaystyle\|(X+\Delta)^{\top}\Sigma(X+\Delta)-X^{\top}\Sigma X\|\leq\mu\Sigma\bullet V^{-1}.

Proof:

The proof can be found [5]. ∎

By substituting $X=\hat{\Theta}^{i}_{t}$ , $\Delta=\hat{\Theta}_{t}^{i}-\Theta_{*}^{i}$ , $\Sigma=\Sigma(\hat{\Theta}^{i}_{t},Q,R^{i})$ , $P=P(\hat{\Theta}^{i}_{t},Q,R^{i})$ and $V=V^{i}_{[0,\;t]}$ in Lemmas 6 and 7, and applying them on exact primal ((11) with $\Theta=\Theta^{i}_{*}$ ) and its dual SDPs, the primal and dual relaxed-SDP problems ((45) and (49)) are obtained. Moreover, it is straightforward to show $\mu_{i}\geq r^{i}+(1+r^{i})\vartheta\|V^{i}_{[0,\;t]}\|^{1/2}$ .

Furthermore, it is also shown that $\Sigma(\Theta_{*}^{i},Q,R^{i})$ , the solution of SDP (11) (with $\Theta=\Theta_{*}^{i}$ and $R=R^{i}$ ), is a feasible solution of the relaxed SDP (45).

We can show that the primal-dual solutions of (45) and (49)) respectively satisfy $\|\Sigma(\hat{\Theta}^{i}_{t},Q,R^{i})\|_{*}\leq J^{*}_{i}/\alpha_{0}$ and $\|P(\hat{\Theta}^{i}_{t},Q,R^{i})\|_{*}\leq J_{*}^{i}/\sigma^{2}$ where $J_{*}^{i}$ is the corresponding optimal average expected cost of actuating in mode $i$ (See Lemma 15 in [5] for more details).

The following lemma shows how to extract the deterministic linear policy from the relaxed SDP. We will later apply this lemma in the Lemma 9 to show that for all actuating modes, the designed controller is strongly sequentially stabilizing in an epoch (a time interval between two subsequent switches).

Lemma 8

Assume $\mu_{i}\geq r^{i}+(1+r^{i})\vartheta\|V^{i}_{[0,\;t]}\|^{1/2}$ and $V^{i}_{[0,\;t]}\geq(\nu_{i}\mu_{i}/\alpha_{0}\sigma^{2})I$ (This holds by having definition $\lambda\geq 4\bar{\nu}\bar{\mu}/\alpha_{0}\sigma^{2}$ given in Lemma 9 with $\bar{\nu}=\max_{i}\nu_{i}$ and $\bar{\mu}>\mu_{i}$ ’s). Let $(\hat{A}_{t},\hat{B}_{t}^{i})^{\top}=\hat{\Theta}_{t}^{i}$ and $\Sigma(\hat{\Theta}^{i}_{t},Q,R^{i})$ and $P(\hat{\Theta}^{i}_{t},Q,R^{i})$ , denote primal-dual solutions of the relaxed SDPs (45) and (49). Then, $\Sigma_{xx}(\hat{\Theta}^{i}_{t},Q,R^{i})$ is invertible and

$\displaystyle P(\hat{\Theta}^{i}_{t},Q,R^{i})$	$\displaystyle=Q+K^{\top}(\hat{\Theta}^{i}_{t},Q,R^{i})R^{i}K(\hat{\Theta}^{i}_{t},Q,R^{i})+$
	$\displaystyle(\hat{A}_{t}+\hat{B}_{t}^{i}K(\hat{\Theta}^{i}_{t},Q,R^{i}))^{\top}P(\hat{\Theta}^{i}_{t},Q,R^{i})\times$
	$\displaystyle(\hat{A}_{t}+\hat{B}_{t}^{i}K(\hat{\Theta}^{i}_{t},Q,R^{i}))-\mu_{i}\\|P(\hat{\Theta}^{i}_{t},Q,R^{i})\\|_{*}\times$
	$\displaystyle\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{t},Q,R^{i})\end{pmatrix}{V^{i}}^{-1}_{[0,\;t]}\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{t},Q,R^{i})\end{pmatrix}^{\top}$	(75)

where $K(\hat{\Theta}^{i}_{t},Q,R^{i})=\Sigma_{ux}(\hat{\Theta}^{i}_{t},Q,R^{i}){\Sigma}^{-1}_{xx}(\hat{\Theta}^{i}_{t},Q,R^{i})$ .

Proof:

proof is in Lemma 16 of [5]. ∎

Now we continue with carrying out the stability analysis of the system under execution of control designed by Algorithms 3 and 4.

A-D Stability Analysis

Proof of Theorem 3 directly follows by Lemma 9, and 10 proved below.

Lemma 9 shows that the designed controller $K(\hat{\Theta}^{i}_{t},Q,R^{i})$ is $(\kappa_{i},\gamma_{i})-$ strongly stabilizing for the values of parameters specified in Theorem 3. Lemma 10 shows $P(\hat{\Theta}^{i}_{t},Q,R^{i})$ is closed to $P(\hat{\Theta}^{i}_{t+1},Q,R^{i})$ and as such proves the sequentiality of stability and completes the proof of the Theorem 3.

Lemma 9

For the system which actuates in an arbitrary mode $i$ , $K(\hat{\Theta}^{i}_{t},Q,R^{i})$ is $(\kappa_{i},\gamma_{i})-$ strongly stabilizing for $A_{*}+B^{i}_{*}K(\hat{\Theta}^{i}_{t},Q,R^{i})=H^{i}_{t}L^{i}_{t}{h^{i}}^{-1}_{t}$ where $H^{i}_{t}={P}^{1/2}(\hat{\Theta}^{i}_{t},Q,R^{i})$ and $\|L^{i}_{t}\|\leq 1-\gamma_{i}$ . Moreover, $(\alpha_{0}/2)I\leq P(\hat{\Theta}^{i}_{t},Q,R^{i})\leq(\nu_{i}/\sigma_{\omega}^{2})I$

Proof:

Regardless of the similarity of the proof to that of Lemma 18 in [5], we provide it for the sake of completeness and justification of the value of $\lambda$ , and upper-bound of $\mu_{i}$ mentioned in Theorem 3.

First we need to appropriately upper-bound $\mu_{i}$ which is an input of the algorithms 3 and 4. Given $\|V^{i}_{t}\|\leq(1+2\bar{\Upsilon})T\;\;\forall i$ (proved in Lemma 11),we have

$\displaystyle\mu_{i}$	$\displaystyle=r^{i}+(1+r^{i})\vartheta\\|V^{i}_{[0,\;t]}\\|^{1/2}$
	$\displaystyle\leq r^{i}+(1+r^{i})\vartheta\sqrt{(1+2\bar{\Upsilon})T}$
	$\displaystyle\leq\bar{r}+(1+\bar{r})\vartheta\sqrt{(1+2\bar{\Upsilon})T}:=\bar{\mu}$	(76)

where $\bar{r}$ is an upper bound $r^{i}$ for all $i\in\{1,...,2^{m}\}$ , which has already been provided by the Lemma 5:

	$\displaystyle\bar{r}:$	$\displaystyle=8\sigma^{2}_{\omega}n\bigg{(}2\log\frac{n}{\delta}+(1+2\bar{\Upsilon})\log T+$
		$\displaystyle(m-1)\log(1+2\bar{\Upsilon})T\bigg{)}+2\epsilon^{2}\lambda$

Recalling the dual formulation (49) and the fact $\nu_{i}\geq J_{*}^{i}$ one can write

\displaystyle\nu_{i}\geq J_{*}^{i}\geq P(\hat{\Theta}^{i}_{t},Q,R^{i})\bullet W\geq\begin{pmatrix}Q&0\\ 0&R^{i}\end{pmatrix}\bullet W\geq\alpha_{0}\sigma^{2}

which clearly shows that $\nu_{i}/\alpha_{0}\sigma^{2}\geq 1$ .

Now we assume that $\lambda\geq 4\nu_{i}\mu_{i}/\alpha_{0}\sigma_{\omega}^{2}\;\;\forall i\in\{1,...,2^{m}\}$ and apply perturbation lemma, Lemma 7 to (75) yields

	$\displaystyle P(\hat{\Theta}^{i}_{t},Q,R^{i})=Q+K^{\top}(\hat{\Theta}^{i}_{t},Q,R^{i})R^{i}K(\hat{\Theta}^{i}_{t},Q,R^{i})+$
	$\displaystyle(A_{}+B_{}^{i}K(\hat{\Theta}^{i}_{t},Q,R^{i}))^{\top}P(\hat{\Theta}^{i}_{t},Q,R^{i})\times$
	$\displaystyle(A_{}+B_{}^{i}K(\hat{\Theta}^{i}_{t},Q,R^{i}))-2\mu_{i}\\|P(\hat{\Theta}^{i}_{t},Q,R^{i})\\|_{*}\times$
	$\displaystyle\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{t},Q,R^{i})\end{pmatrix}{V^{i}}^{-1}_{[0,\;t]}\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{t},Q,R^{i})\end{pmatrix}^{\top}.$		(77)

We have $\|P(\hat{\Theta}^{i}_{t},Q,R^{i})\|_{*}\leq\nu_{i}/\sigma_{\omega}^{2}$ , and define $\lambda>(4\bar{\nu}\bar{\mu}/\alpha_{0}\sigma_{\omega}^{2})I$ where $\bar{\nu}=\max_{i\in\{1,...,2^{m}\}}\nu_{i}$ and $\bar{\mu}=\max_{i\in\{1,...,2^{m}\}}\mu_{i}$ defined by (76). Given the fact that $V^{i}_{[0,\;t]}\geq\lambda I$ it yields

	$\displaystyle\mu_{i}\\|P(\hat{\Theta}^{i}_{t},Q,R^{i})\\|_{*}{V^{i}}_{[0,\;t]}^{-1}$	$\displaystyle\leq\mu_{i}*\frac{\nu_{i}}{\sigma_{\omega}^{2}}\frac{\alpha_{0}\sigma_{\omega}^{2}}{4\bar{\nu}\bar{\mu}}I$
		$\displaystyle\leq\mu_{i}*\frac{\nu_{i}}{\sigma_{\omega}^{2}}\frac{\alpha_{0}\sigma_{\omega}^{2}}{4\nu_{i}\mu_{i}}I=\frac{\alpha_{0}}{4}I$		(78)

plugging (A-D) together with the assumptions $Q,R^{i}\geq\alpha_{0}I$ into (77) yields,

$\displaystyle P(\hat{\Theta}^{i}_{t},Q,R^{i})$	$\displaystyle\geq\frac{\alpha_{0}}{2}I+\frac{\alpha_{0}}{2}K^{\top}(\hat{\Theta}^{i}_{t},Q,R^{i})K(\hat{\Theta}^{i}_{t},Q,R^{i})+$
	$\displaystyle(A_{}+B^{i}_{}K(\hat{\Theta}^{i}_{t},Q,R^{i}))^{\top}P(\hat{\Theta}^{i}_{t},Q,R^{i})\times$
	$\displaystyle(A_{}+B^{i}_{}K(\hat{\Theta}^{i}_{t},Q,R^{i})).$	(79)

Defining $\kappa_{i}=\sqrt{\frac{2\nu_{i}}{\alpha_{0}\sigma_{\omega}^{2}}}$ and using the fact that $\|P(\hat{\Theta}^{i}_{t},Q,R^{i})\|\leq\frac{\nu_{i}}{\sigma_{\omega}^{2}}$ (which implies $\kappa_{i}^{-2}P(\hat{\Theta}^{i}_{t},Q,R^{i})\leq\frac{\nu_{i}\kappa_{i}^{-2}}{\sigma_{\omega}^{2}}I$ ),

it holds true that

\displaystyle P(\hat{\Theta}^{i}_{t},Q,R^{i})-\frac{1}{2}\alpha_{0}I\leq(1-\kappa_{i}^{-2})P(\hat{\Theta}^{i}_{t},Q,R^{i})

(80)

From (79) and (80) one has:

	$\displaystyle{P}^{-1/2}(\hat{\Theta}^{i}_{t},Q,R^{i})(A_{}+B^{i}_{}K(\hat{\Theta}^{i}_{t},Q,R^{i}))^{\top}P(\hat{\Theta}^{i}_{t},Q,R^{i})\times$
	$\displaystyle(A_{}+B^{i}_{}K(\hat{\Theta}^{i}_{t},Q,R^{i})){P}^{-1/2}(\hat{\Theta}^{i}_{t},Q,R^{i})\leq(1-\kappa_{i}^{-2})I.$		(81)

Denoting $H_{t}^{i}={P}^{1/2}(\hat{\Theta}^{i}_{t},Q,R^{i})$ and $L_{t}^{i}={P}^{-1/2}(\hat{\Theta}^{i}_{t},Q,R^{i})(A_{*}+B^{i}_{*}K(\hat{\Theta}^{i}_{t},Q,R^{i})){P}^{1/2}(\hat{\Theta}^{i}_{t},Q,R^{i})$ , (81) leads $\|L_{i}^{t}\|\leq\sqrt{1-\kappa_{i}^{-2}}\leq 1-1/2\kappa_{i}^{-2}$ .

Furthermore, (79) results in $P(\hat{\Theta}^{i}_{t},Q,R^{i})\geq\frac{\alpha_{0}}{2}{K}^{\top}(\hat{\Theta}^{i}_{t},Q,R^{i})K(\hat{\Theta}^{i}_{t},Q,R^{i})$ which together with $\|P(\hat{\Theta}^{i}_{t},Q,R^{i})\|\leq\frac{\nu_{i}}{\sigma_{\omega}^{2}}$ imply $\|K(\hat{\Theta}^{i}_{t},Q,R^{i})\|\leq\kappa_{i}$ .

This automatically yields $\|H^{i}_{t}\|\leq\sqrt{\nu_{i}/\sigma_{\omega}^{2}}$ and $\|{H^{i}_{t}}^{-1}\|\leq\sqrt{\frac{2}{\alpha_{0}}}$ . So, our proof is complete and $K(\hat{\Theta}^{i}_{t},Q,R^{i})$ ’s are $(\kappa_{i},\gamma_{i})-$ strongly stabilizing. ∎

Having established the strong stability by by Lemma 9, Lemma 10 shows $P(\hat{\Theta}^{i}_{t},Q,R^{i})$ is closed to $P(\hat{\Theta}^{i}_{t+1},Q,R^{i})$ and as such proves the sequentiality of stability and completes the proof of the Theorem 3 (See [5] for more detail).

Lemma 10

For any mode $i\in\mathcal{M}$ by choosing the regularization parameter $\lambda\geq 8\kappa_{*}^{8}\sqrt{n+m}\frac{4\bar{\nu}\bar{\mu}}{\alpha_{0}^{i}\sigma_{\omega}^{2}}$ we have

\displaystyle P(\hat{\Theta}^{i}_{t},Q,R^{i})\leq P(\Theta^{i}_{*},Q,R^{i})\leq P(\hat{\Theta}^{i}_{t+1},Q,R^{i})+\frac{\alpha_{0}^{i}\gamma_{i}}{2}I

Proof:

Proof can be found in Appendix B. ∎

Proof of Theorem 4

Proof:

From the Lemma 1, the state norm of system actuating in mode $i$ and in an arbitrary actuation epoch happening in interval $[\tau_{q}\;\tau_{q+1}]$ can be upper bounded as follows:

\displaystyle\|x_{t}\|\leq\kappa_{i}(1-\gamma_{i}/2)^{t}||x_{\tau_{q}}||+\frac{2\kappa_{i}}{\gamma_{i}}\max\limits_{\tau_{q}\leq t\leq\tau_{q+1}}\|w_{t}\|

(82)

where $x_{\tau_{q}}$ is an initial state value of the actuation epoch. Having $\kappa^{*}=\max_{i\in\{1,...,|\mathcal{B}^{*}|\}}\kappa_{i}$ ’s we can deduce for all modes $i\in\{1,...,2^{m}\}$ we have

\displaystyle\|x_{t}\|\leq\kappa^{*}(1-\gamma^{*}/2)^{t}\|x_{\tau_{q}}\|+\frac{2\kappa^{*}}{\gamma^{*}}\max\limits_{\tau_{q}\leq t\leq\tau_{q+1}}\|w_{t}\|

(83)

where $\kappa^{*}=\sqrt{{2\bar{\nu}}/{\alpha_{0}\sigma_{\omega}^{2}}}$ and $\bar{\nu}:=\max_{i\in\{1,...,2^{m}\}}$ .

Considering

\displaystyle\|w_{t}\|\leq 10\sigma_{\omega}\sqrt{n\log\frac{T}{\delta}}

we can upper-bound second term on right hand side of (83) by $\bar{L}$ which is defined as follows:

\displaystyle\bar{L}:=\frac{20\kappa^{*}}{\gamma^{*}}\sigma_{\omega}\sqrt{n\log\frac{T}{\delta}}.

Let us now denote the time sequences in which the switches in actuating mode happens be $I=\{k_{0},k_{1},...,k_{l-1},k_{l}\}$ , then one can upper bound the state norm as follows:

	$\displaystyle\\|x_{k_{f}}\\|$	$\displaystyle\leq(1-\gamma^{}/2)^{k_{f}-k_{l}}\kappa^{}\\|x_{k_{l}}\\|+\bar{L}$
		$\displaystyle\leq(1-\gamma^{}/2)^{k_{l}-k_{l-1}}(1-\gamma^{}/2)^{k_{f}-k_{l}}\kappa^{*2}\\|x_{k_{l-1}}\\|$
		$\displaystyle+\bar{L}(1-\gamma^{}/2)^{k_{l}-k_{l-1}}\kappa^{}+\bar{L}$
		$\displaystyle\leq...\leq(1-\gamma^{}/2)^{k_{f}-0}\kappa^{(k_{f}-0)/\tau}\\|x_{0}\\|+\Omega$

in which $\gamma^{*}=\frac{1}{2\kappa^{*}}$ . The term $\Omega$ includes all $\bar{L}-$ dependent terms.

For stability purpose, the following inequality

\displaystyle\kappa^{*}(1-\gamma^{*}/2)^{\tau_{AD}}=1-\chi<1

(84)

needs to be fulfilled which yields

\displaystyle\tau_{AD}>\tau_{MADT}:=Ceil\bigg{[}\frac{\ln\kappa^{*}-\ln(1-\chi)}{-\ln(1-\gamma^{*})}\bigg{]}

(85)

in which $\tau_{MADT}$ is MADT and $0<\chi<1$ is a user-define parameter. Subsequently we can write:

\displaystyle\Omega\leq\bar{L}\sum_{i=0}^{\infty}\big{(}(1-\gamma^{*}/2)\kappa^{*1/\tau_{AD}}\big{)}^{i}\leq\frac{\bar{L}}{\chi}=:U_{\Omega}

(86)

Then, the state is upper-bounded by:

\displaystyle\|x(t)\|

\displaystyle\leq(1-\chi)^{t}\|x_{0}\|+U_{\Omega}\leq e^{-\chi t}\|x_{0}\|+U_{\Omega}

(87)

∎

The boundedness of $V^{i}_{t}$ at each epoch is very crucial for estimation purpose. To bound it, we need to ensure that the extended state $z^{i}_{t}$ stays bounded as the sequentially stabilizing policy is executed. The following lemma gives this upper-bound and defines the parameter $\bar{\Upsilon}$ which are used in regret bound analysis section.

Lemma 11

Applying the Algorithms 1 and 2 yields

	$\displaystyle\sum_{s=1}^{t}\|\|z(s)\|\|^{2}\leq 2\bar{\Upsilon}T\;\;\;and\;\;\|\|V_{t}\|\|\leq(1+2\bar{\Upsilon})T$
	$\displaystyle\bar{\Upsilon}=\frac{3200{\kappa^{*}}^{6}\sigma^{2}n}{\chi^{2}}\log\frac{T}{\delta}$

Proof:

Proof can be found in Appendix B. ∎

A-E Regret Bound Analysis

This section presents regret bound analysis of the proposed ”learn and control while switching strategy.” Proceeding as in [5], the regret can be decomposed into the form of (57-60). However, each term must be bounded differently than in [5] due to the switching nature of the closed-loop system.

The following lemma gives an upper bound on the ratio of covariance matrix determinant of the central ellipsoid and that of its projection to an arbitrary space associated with mode $i$ . The result will be useful in upper-bounding $R_{1}$ and $R_{4}$ terms.

Lemma 12

Given $V_{[0\;t]}$ and $V^{pi}_{[0\;t]}$ as the covariance matrices associated with central ellipsoid and its projection to the space of mode $i$ respectively, then for any $t\leq T$

\displaystyle\log\frac{\det(V_{[0\;t]})}{\det(V^{pi}_{[0\;t]})}\leq(m-1)\log((1+2\bar{\Upsilon})t)

(88)

holds true almost surely.

Proof:

Proof has been provided in Appendix B. ∎

Following Lemma gives an upper bound for the term $R_{1}$ .

Lemma 13

Let $ns$ denote number of epochs.The upper bound for $|R_{1}|$ is given as follows:

\displaystyle|R_{1}|\leq\frac{\bar{\nu}(1+N_{ts})}{\sigma_{\omega}^{2}}X_{[0\;\;T]}

(89)

where $X_{[0\;T]}=\max_{t\leq T}\|x_{t}\|$ to be defined by

\displaystyle X_{[0\;\;T]}^{2}=2(1-\chi)^{2t}||x_{0}||^{2}+\frac{800{\kappa^{*}}^{2}\sigma_{\omega}^{2}n}{\chi^{2}{\gamma^{*}}^{2}}\log\frac{T}{\delta}

(90)

and

\displaystyle N_{ts}:=(1+2\bar{\Upsilon})\log T+(ns-1)(m-1)\log(1+2\bar{\Upsilon})T.

\displaystyle\bar{\Upsilon}=\frac{3200{\kappa^{*}}^{6}\sigma_{\omega}^{2}n}{\chi^{2}}\log\frac{T}{\delta}

Proof:

Proof can be found in Appendix B. ∎

Lemma 14

The upper bound on $|R_{2}|$ is given as follows:

\displaystyle|R_{2}|\leq\frac{\bar{\nu}\vartheta}{\sigma_{\omega}}\sqrt{3\bar{\Upsilon}T\log\frac{4}{\delta}}

which holds true with the probability at least $1-\delta/4$

Proof:

Proof directly follows [5] by taking into account $\|P(\hat{\Theta}^{i}_{t},Q,R^{i})\|\leq\bar{\nu}/\sigma^{2}\;\;\forall i\in\mathcal{I}$ where $\bar{\nu}=\max_{i}\mu_{i}$ . ∎

To bound $R_{3}$ we have the following lemma:

Lemma 15

The term $R_{3}$ has the following upper bound when the sequence of switches is $1,...,j$ and number of epochs is $d$

$\displaystyle R_{3}=\sum_{t=\tau_{1}=0}^{\tau_{2}}$	$\displaystyle\big{(}w^{T}_{t}P(\hat{\Theta}^{\sigma(t)}_{t},Q,R^{\sigma(t)})\omega_{t}$
	$\displaystyle-\sigma_{\omega}^{2}\\|P(\hat{\Theta}^{\sigma(t)}_{t},Q,R^{\sigma(t)})\\|_{*}\big{)}1_{\mathcal{E}_{t}}+...$
	$\displaystyle+\sum_{t=\tau_{ns}}^{T}\big{(}w^{T}_{t}P(\hat{\Theta}^{\sigma(t)}_{t},Q,R^{\sigma(t)})\omega_{t}$
	$\displaystyle-\sigma_{\omega}^{2}\\|P(\hat{\Theta}^{\sigma(t)}_{t},Q,R^{\sigma(t)})\\|_{*}\big{)}1_{\mathcal{E}_{t}}$
	$\displaystyle\leq 8\bar{\nu}\sqrt{T\log^{3}\frac{4T}{\delta}}$	(91)

Proof:

Proof directly follows [5] by taking into account $\|P(\hat{\Theta}^{i}_{t},Q,R^{i})\|\leq\bar{\nu}/\sigma^{2}\;\;\forall i\in\mathcal{I}$ where $\bar{\nu}=\max_{i}\mu_{i}$ . ∎

Lemma 16

The term $R_{4}$ has the following upper bound.

$\displaystyle R_{4}\leq\frac{8\bar{\nu}}{\sigma_{\omega}^{2}}\big{(}$	$\displaystyle\bar{r}+(1+\bar{r})\vartheta\sqrt{(1+2\bar{\Upsilon})T}\big{)}\big{(}(1+2\bar{\Upsilon})\log T$
	$\displaystyle+(ns-1)\log(1+2\bar{\Upsilon})T\big{)}$	(92)
$\displaystyle\bar{r}:=8\sigma_{\omega}^{2}n\bigg{(}$	$\displaystyle 2\log\frac{n}{\delta}+(1+2\bar{\Upsilon})\log T+$
	$\displaystyle(m-1)\log(1+2\bar{\Upsilon})T\bigg{)}+2\epsilon^{2}\lambda$

Proof:

Proof can be found in Appendix B. ∎

Proof of Theorem 5

Proof:

Proof of first statement

Putting everything together gives the following upper bound for regret $R_{T}$

	$\displaystyle R_{T}$	$\displaystyle\leq\frac{\bar{\nu}(1+N_{ts})}{\sigma_{\omega}^{2}}X_{[0\;\;T]}+$
		$\displaystyle\frac{\bar{\nu}\vartheta}{\sigma_{\omega}}\sqrt{3\bar{\Upsilon}T\log\frac{4}{\delta}}+8\bar{\nu}\sqrt{T\log^{3}\frac{4T}{\delta}}+$
		$\displaystyle\frac{8\bar{\nu}}{\sigma_{\omega}^{2}}\big{(}\bar{r}+(1+\bar{r})\vartheta\sqrt{(1+2\bar{\Upsilon})T}\big{)}\big{(}(1+2\bar{\Upsilon})\log T+$
		$\displaystyle(ns-1)\log(1+2\bar{\Upsilon})T\big{)}$

where

\displaystyle X_{[0\;\;T]}^{2}=2(1-\chi)^{2t}||x_{0}||^{2}+\frac{800{\kappa^{*}}^{2}\sigma_{\omega}^{2}n}{\chi^{2}{\gamma^{*}}^{2}}\log\frac{T}{\delta}

(93)

\displaystyle N_{ts}:=(1+2\bar{\Upsilon})\log T+(ns-1)(m-1)\log(1+2\bar{\Upsilon})T.

Proof of second statement

The regret bound of an OFU-based algorithm is directly connected to the extent of tightness of confidence ellipsoid. We can explicitly see this effect in the term $R_{4}$ . This being said, we now compare the confidence ellipsoid built by the projection-based algorithm and the one obtained by naively and repeatedly applying the basic OFU-based algorithm.

Assume at an arbitrary time $\tau$ system actuates in any arbitrary mode $i$ , for which the confidence ellipsoid is given by

	$\displaystyle\mathcal{C}^{i}_{t}(\delta):=\big{\{}$	$\displaystyle\Theta^{i}\in\mathbb{R}^{(n+m_{i})\times n}\|$
		$\displaystyle\operatorname{Tr}((\Theta^{i}-\hat{\Theta}^{i}_{t})^{\top}V^{i}_{[0\;\;t]}(\Theta^{i}-\hat{\Theta}^{i}_{t}))\leq r^{i}\}$		(94)

where

	$\displaystyle r^{i}=\bigg{(}$	$\displaystyle\sigma_{\omega}\sqrt{2n\log\frac{n\det(V^{i}_{[0\;\;t]})}{\delta\det(V_{[0,\;\tau]}^{pi})}}+\sigma_{\omega}\sqrt{2n\log\frac{n\det(V_{[0,\;\tau]})}{\delta\det(\lambda I)}}+$
		$\displaystyle\sqrt{\lambda}\epsilon\bigg{)}^{2}.$		(95)

and $V^{i}_{[0\;\;t]}$ and $\hat{\Theta}^{i}_{t}$ is given by (40,41).

On the other when the projection is not applied the confidence set is given as follows:

	$\displaystyle\mathcal{\bar{C}}^{i}_{t}(\delta):=\big{\{}$	$\displaystyle\Theta^{i}\in\mathbb{R}^{(n+m_{i})\times n}\|$
		$\displaystyle\operatorname{Tr}((\Theta^{i}-\hat{\bar{\Theta}}^{i}_{t})^{\top}\bar{V}^{i}_{t}(\Theta^{i}-\hat{\bar{\Theta}}^{i}_{t}))\leq r_{n}^{i}\}$		(96)

where

\displaystyle\hat{\bar{\Theta}}_{t}

\displaystyle={\bar{V^{i}}_{t}}^{-1}\big{(}{Z_{t}^{i}}^{\top}X_{t}+\lambda\Theta^{i}_{0}\big{)}.

(97)

and

\displaystyle V^{i}_{t}=\lambda I+\sum_{s=1}^{ns_{t-1}}z^{i}_{s}{z^{i}_{s}}^{T},\;\;\;{Z_{t}^{i}}^{\top}X_{t}=\sum_{s=1}^{ns_{t}}z^{i}_{s}{x_{s}^{i}}^{\top}

where $ns^{i}_{t}$ is number of time steps that system has been actuating in mode $i$ by time $t$ , and

\displaystyle r_{n}^{i}

\displaystyle=\bigg{(}\sigma_{\omega}\sqrt{2n\log\frac{n\det(\bar{V}^{i}_{t})}{\delta\det(\lambda I_{(n+m_{i})\times(n+m_{i})})}}+\sqrt{\lambda}\epsilon\bigg{)}^{2}.

(98)

Note that $\Theta_{*}^{i}$ belongs to the both confidence sets $\mathcal{C}(\delta)$ and $\mathcal{\bar{C}}(\delta)$ with same probability of $1-\delta$ . Hence, the tightest confidence set gives lower regret. Considering the fact that $V^{i}_{[0,\;t]}\succeq\bar{V}_{t}$ , there exists a $t_{0}<T$ for big enough $T$ such that for $t>t_{0}$

\displaystyle\det(\frac{V^{i}_{[0,\;t]}}{r^{i}})>\det(\frac{\bar{V}_{t}^{i}}{r_{n}^{i}}).

This guarantees $\mathcal{C}(\delta)\subset\mathcal{\bar{C}}(\delta)$ meaning that the confidence ellipsoid $\mathcal{C}(\delta)$ is tighter. ∎

Appendix B Supplementary Proofs

Proof of Lemma 2

Proof:

The proof follows from Theorem 3 and Corollary 1 in [30] which gives.

\displaystyle\left\lVert S^{i}_{t}\right\rVert^{2}_{V_{t}^{-1}}\leq 2\sigma_{\omega}^{2}\log\bigg{(}\frac{\det(V_{t})^{1/2}\det(\lambda I)^{-1/2}}{\delta}\bigg{)}

(99)

we have

\displaystyle\log\bigg{(}\frac{\det(V_{t})}{\det(\lambda I)}\bigg{)}\geq\log\bigg{(}\frac{\det(V_{t})^{1/2}}{\det(\lambda I)^{1/2}}\bigg{)},

true which completes proof. ∎

Proof of Corollary 1

Proof:

Applying Lemma 1 for $i=1,...,n$ yields

\displaystyle\left\lVert S^{i}_{t}\right\rVert^{2}_{V_{t}^{-1}}\leq 2\sigma_{\omega}^{2}\log\bigg{(}\frac{n\det(V_{t})}{\delta\det(\lambda I)}\bigg{)}

(100)

with probability at least $1-\delta/n$ . Furthermore, we have

\displaystyle\operatorname{Tr}(S^{T}_{t}V^{-1}_{t}S_{t})=\sum_{i=1}^{n}S^{iT}_{t}V^{-1}_{t}S^{i}_{t}

using which and applying union bound (62) holds with probability at least $1-\delta$ . ∎

Proof of Lemma 5

Proof:

Given definition of $r^{i}$ in (38), one can write:

	$\displaystyle r^{i}=\bigg{(}\sigma_{\omega}\big{[}\sqrt{2n\log\frac{n\det({V^{i}_{[0,t]}})}{\delta\det(V^{pi}_{[0,\tau_{q}]})}}+\sqrt{2n\log\frac{n\det(V_{[0,\tau_{q}]})}{\delta\det(\lambda I_{(n+m)})}}\big{]}$
	$\displaystyle+\sqrt{\lambda}\vartheta\bigg{)}^{2}\leq$
	$\displaystyle 8\sigma_{\omega}^{2}n\bigg{(}\log\frac{n\det({V^{i}_{[0\;\;t]}})}{\delta\det(V^{pi}_{[0\;\;\tau_{q}]})}+\log\frac{n\det(V_{[0\;\;\tau_{q}]})}{\delta\det(\lambda I_{(n+m)})}\bigg{)}+$
	$\displaystyle 2\lambda\vartheta^{2}\leq$
	$\displaystyle 8\sigma_{\omega}^{2}n\bigg{(}\log\frac{n\det({V^{i}_{[0\;\;t]}})}{\delta\det(\lambda I_{(n+m_{i})})}+\log\frac{n\det(V_{[0\;\;\tau_{q}]})}{\delta\det(V^{pi}_{[0\;\;\tau_{q}]})}\bigg{)}+$
	$\displaystyle 2\lambda\vartheta^{2}\leq$
	$\displaystyle 8\sigma_{\omega}^{2}n\bigg{(}2\log\frac{n}{\delta}+\log\frac{\det({V^{i}_{[0\;\;t]}})}{\det(\lambda I_{(n+m_{i})})}+\log\frac{\det(V_{[0\;\;\tau_{q}]})}{\det(V^{pi}_{[0\;\;\tau_{q}]})}\bigg{)}+$
	$\displaystyle 2\lambda\vartheta^{2}$

where in the second inequality we apply $\det(\lambda I_{(n+m_{i})})<\det(\lambda I_{(n+m)})$ . The proof is complete by applying the following properties

	$\displaystyle\log\frac{\det({V^{i}_{[0\;\;t]}})}{\det(\lambda I_{(n+m_{i})})}\leq(1+2\bar{\Upsilon})\log t$		(101)
	$\displaystyle\log\frac{\det(V_{[0\;\;\tau_{q}]})}{\det(V^{pi}_{[0\;\;\tau_{q}]})}\leq(m-1)\log((1+2\bar{\Upsilon}))\tau_{q}$		(102)

where (101) can be shown by following the same steps of Lemma 28 of [5] and applying $\sum_{s=1}^{t}||z(s)||^{2}\leq 2\bar{\Upsilon}T$ . And the inequality (102) is followed by (88). ∎

Proof of Lemma 10

Proof:

From Lemma 19 of [5] we have:

	$\displaystyle P_{}(\Theta^{i}_{},Q,R^{i})-P(\hat{\Theta}^{i}_{t},Q,R^{i})\preceq\frac{2\kappa_{i}^{2}\mu_{i}}{\gamma_{i}}\\|P(\Theta^{i}_{t},Q,R^{i})\\|_{*}$
	$\displaystyle\lVert\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{t},Q,R^{i})\end{pmatrix}{V^{i}}^{-1}_{[0,\;t]}\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{t},Q,R^{i})\end{pmatrix}^{\top}\rVert I\preceq$
	$\displaystyle\frac{2\kappa_{i}^{2}\mu_{i}}{\gamma_{i}}\\|P(\Theta^{i}_{t},Q,R^{i})\\|_{*}\big{(}2\kappa_{i}^{2}\\|{V^{i}}^{-1}_{[0,\;t]}\\|\big{)}I\preceq$
	$\displaystyle\frac{2\kappa_{i}^{2}\mu_{i}}{\gamma_{i}}\\|P(\Theta^{i}_{t},Q,R^{i})\\|_{*}(2\kappa_{i}^{2}\sqrt{n+m}){V^{i}}^{-1}_{[0,\;t]}\preceq$
	$\displaystyle\frac{2\kappa_{i}^{2}\mu_{i}}{\gamma_{i}}\\|P(\Theta^{i}_{t},Q,R^{i})\\|_{*}(2\kappa_{i}^{2}\sqrt{n+m})\frac{1}{\lambda_{i}}\preceq\frac{\alpha_{0}^{i}\gamma_{i}}{2}I$

where the last inequality holds by $\lambda\geq 8\kappa_{*}^{8}\sqrt{n+m}\frac{4\bar{\nu}\bar{\mu}}{\alpha_{0}^{i}\sigma_{\omega}^{2}}$ . ∎

Proof of Lemma 11

Proof:

Having (87) and given the fact that $z_{t}=\begin{pmatrix}I\\ K\end{pmatrix}x_{t}$ and $||\begin{pmatrix}I\\ K\end{pmatrix}||^{2}\leq 2{\kappa^{*}}^{2}$ , we have

\displaystyle||z(t)||^{2}

\displaystyle\leq 4{\kappa^{*}}^{2}(1-\chi)^{2t}||x_{0}||^{2}+\frac{1600\sigma^{2}{\kappa^{*}}^{4}n}{\chi^{2}{\gamma^{*}}^{2}}\log\frac{T}{\delta}

(103)

which yields

	$\displaystyle\sum_{s=1}^{t}\\|z(s)\\|^{2}$	$\displaystyle\leq\frac{4{\kappa^{}}^{2}}{2\chi-\chi^{2}}\\|x_{0}\\|^{2}+\frac{3200{\kappa^{}}^{6}\sigma_{\omega}^{2}n}{\chi^{2}}\log\frac{T}{\delta}t$
		$\displaystyle\leq\frac{4{\kappa^{*}}^{2}}{\chi}t+\bar{\Upsilon}t\leq 2\bar{\Upsilon}t$

where

\displaystyle\bar{\Upsilon}=\frac{3200{\kappa^{*}}^{6}\sigma_{\omega}^{2}n}{\chi^{2}}\log\frac{T}{\delta}

and in second inequality we applied $t\geq x_{0}$ .

Having $\sum_{s=1}^{T}||z_{s}||^{2}\leq 2\bar{\Upsilon}T$ along with $T\geq\lambda$ which is true always by recalling the definition of $\lambda$ , regardless of any sequences of actuating modes, it returns out that

\displaystyle||V_{t}||\leq\lambda+\beta^{-1}\sum_{s=1}^{t}||z_{s}||^{2}\leq(1+2\bar{\Upsilon})T

by picking $\beta=1$ . ∎

Proof of Lemma 12

Proof:

To upper-bound, the ratio (88) where the nominator is the covariance of the central ellipsoid and the denominator is the covariance of the ellipsoid constructed by the projection of the central ellipsoid to the space of mode $i$ , one can write:

\displaystyle\frac{\det(V_{[0,\;t]})}{\det(V^{pi}_{[0,\;t]})}=\frac{\prod_{l=1}^{m_{i}}\frac{1}{\lambda^{i}_{l}}}{\prod_{k=1}^{m}\frac{1}{\lambda_{k}}}

(104)

where $\lambda_{k}$ for $k\in\{1,...,n+m\}$ and $\lambda^{i}_{l}$ for $l\in\{1,...,n+m_{i}\}$ are eigenvalues of $V_{[0,\;t]}$ and $V^{i}_{[0,\;t]}$ respectively. We know that for any ellipsoid constructed by the covariance matrix $V_{[0,\;t]}$ , we have $1/\lambda_{k}$ ’s as the diameters of ellipsoid where $\lambda_{k}$ ’s are corresponding eigenvalues of $V_{[0,\;t]}$ . Since $V^{pi}_{[0,\;t]}$ is the covariance matrix of an ellipsoid which is projection of the central ellipsoid with covariance matrix $V_{[0,\;t]}$ , there always exist a $s\in\{1,...,n+m\}$ such that $1/\lambda_{s}>1/\lambda^{i}_{l}$ for all $l\in\{1,...,n+m_{j}\}$ . In other words, there is always a diameter of the central ellipsoid associated with $V_{[0,\;t]}$ which is greater than or equals to any diameter of its projection defined by $V^{pi}_{[0,\;t]}$ . Using this fact we can upper bound (104) as follows

	$\displaystyle\frac{\det(V_{[0,\;t]})}{\det(V^{pi}_{[0,\;t]})}\leq\lambda_{s}^{m-m_{i}}$	$\displaystyle\leq\lambda_{max}^{m-m_{i}}$
		$\displaystyle\leq\lambda_{max}^{m-1}(V)\leq(1+2\bar{\Upsilon})^{m-1}t^{m-1}$

The last inequality directly follows by the fact that $(m-m_{i})$ takes its maximum value of $m-1$ when the projection is performed into space with dimension $n+1$ (i.e., projection from central ellipsoid to the mode with only one actuator). Also, we have have $\lambda_{max}(V_{[0,\;t]})\leq\|V_{[0,\;t]}\|\leq(1+2\bar{\Upsilon})T$ which completes proof. ∎

Proof of Lemma 13

Proof:

Recalling the switch time sequence $\{\tau_{1},...,\tau_{ns}\}$ with $\tau_{1}=0$ and $T>\tau_{ns}$ , we have $ns$ number of epochs.

Considering (57), one can write

$\displaystyle R_{1}$	$\displaystyle=\sum_{t=0}^{T}(x^{T}_{t}P(\hat{\Theta}^{\sigma(t)},Q,R^{\sigma(t)})x_{t}-$
	$\displaystyle x^{T}_{t+1}P(\hat{\Theta}^{\sigma(t)},Q,R^{\sigma(t)})x_{t+1})1_{\mathcal{E}_{t}}=$
	$\displaystyle\sum_{t=1}^{T}(x^{T}_{t}(P(\hat{\Theta}^{\sigma(t)},Q,R^{\sigma(t)})-$
	$\displaystyle P(\hat{\Theta}^{\sigma(t-1)},Q,R^{\sigma(t-1)}))x_{t})1_{\mathcal{E}_{t}}+$
	$\displaystyle x_{0}^{T}P(\hat{\Theta}^{\sigma(0)},Q,R^{\sigma(0)})x_{0}-x_{T+1}^{T}P(\hat{\Theta}^{\sigma(T)},Q,R^{i(T)})x_{T+1}.$	(105)

where $i(t)\in\mathcal{I}$ . Note that the first term of the right-hand side takes zero value except when there is a switch in either the policy or actuation mode. Therefore, We first need to find an upper bound on the total number of switches.

We know for the system actuating in any arbitrary mode $i$ and arbitrary epoch $k$ happening in time interval $[\tau_{k}\;\tau^{k+1}]$ we have

\displaystyle\det(V_{[0\;\tau_{k}^{f}]}^{i})\geq 2^{d_{k}}\det(V_{[0\;\tau_{k}^{0}]}^{pi})

(106)

where $d_{k}$ is the total number of switches in the epoch. Suppose in the last epoch the system actuates in the mode $i$ . Furthermore, without loss of generality let system actuates in the mode $j$ within the epoch $ns-1$ . Then by applying (106) one can write

	$\displaystyle\det(V_{[0,\;T]}^{i})\frac{\det(V_{[0\;\tau_{ns}]})}{\det(V_{[0,\;\tau_{ns}]}^{pi})}$	$\displaystyle\geq 2^{d_{ns}}\det(V_{[0\;\tau_{ns}]})$
		$\displaystyle\geq 2^{d_{ns}}\det(V^{j}_{[0,\;\tau_{ns}]})$		(107)

where the last inequality is due to the fact that $\lambda>1$ and $V^{i}\;\;\forall i$ is strictly positive definite that guarantees $\det(V_{[0,\;\tau_{ns}]})\geq\det(V^{j}_{[0,\;\tau_{ns}]})$ . By following the same argument backward in switch times,it yields

	$\displaystyle\det(V^{i}_{[0,\;T]})\frac{\det(V_{[0,\;\tau_{ns}]})}{\det(V_{[0,\;\tau_{ns}]}^{pi})}...\frac{\det(V_{[0,\;\tau_{2}]})}{\det(V_{[0,\;\tau_{2}]}^{pl})}\geq$
	$\displaystyle 2^{d_{ns}+d_{ns-1}+...+d_{2}}\det(V_{[0,\;\tau_{2}]})$		(108)

Applying $\det(V_{[0,\;\tau_{2}]})\geq 2^{d_{1}}\det(\lambda I)$ results in

	$\displaystyle\det(V^{i}_{[0,\;T]})\frac{\det(V_{[0,\;\tau_{ns}]})}{\det(V_{[0,\;\tau_{ns}]}^{pi})}...\frac{\det(V_{[0,\;\tau_{2}]})}{\det(V_{[0,\;\tau_{2}]}^{pl})}\geq$
	$\displaystyle 2^{d_{ns}+d_{ns-1}+...+d_{1}}\det(\lambda I_{(n+m)})\geq$
	$\displaystyle 2^{d_{ns}+d_{ns-1}+...+d_{1}}\det(\lambda I_{(n+m_{i})})$		(109)

By taking $\log_{2}$ on both sides one can write

	$\displaystyle d_{ns}+d_{ns-1}+...+d_{1}\leq$	$\displaystyle\log\frac{\det(V^{i}_{[0\;T]})}{\det(\lambda I_{(n+m_{i})})}$
		$\displaystyle+(ns-1)\mathcal{F}$		(110)

where

\displaystyle\mathcal{F}=\max_{\forall i,j}\mathcal{F}_{i}

(111)

where $i$ denote the actuating mode and $j$ denotes the epoch and $\mathcal{F}_{i,j}$ is defined as follows

\displaystyle\mathcal{F}_{i,j}=\frac{\det(V_{[0\;\tau_{j}]})}{\det(V_{[0\;\tau_{j}]}^{pi})}:=(m-1)\log(1+2\Upsilon_{i})\tau_{j}

(112)

which is result of the Lemma 12. Then it yields,

\displaystyle\mathcal{F}=(m-1)\log(1+2\bar{\Upsilon})T

(113)

Furthermore, we have

\displaystyle\log(\frac{\det(V^{i}_{[0\;T]})}{\det(\lambda I_{(n+m_{i})})})\leq(1+2\bar{\Upsilon})\log T

(114)

in which $\bar{\Upsilon}=\max_{i}\Upsilon_{i}$ . As a result, the maximum number of switch is upper-bounded by

\displaystyle N_{ts}:=(1+2\bar{\Upsilon})\log T+(ns-1)(m-1)\log(1+2\bar{\Upsilon})T.

where $N_{mts}$ is the maximum number of total switches including the switches between actuating modes. The upper bound for $R_{1}$ can be written as follows:

\displaystyle|R_{1}|\leq\frac{\nu(1+N_{ts})}{\sigma_{\omega}^{2}}X_{[0\;\;T]}

(115)

which is obtained noting $||P^{i}_{t}||\leq\bar{\nu}/\sigma^{2}$ where $\bar{\nu}=\max_{\forall i}{\nu_{i}}$ and the upper-bound of state (90). ∎

Proof of Lemma 16

Proof:

Suppose the system evolves in a sequence of $d$ epochs starting with actuating mode 1, e.g $1,j,...,i$ . Then, one can write

	$\displaystyle\sum_{t=0}^{T}\big{(}{z^{\sigma(t)}_{t}}^{\top}{V^{\sigma(t)}_{t}}^{-1}{z^{\sigma(t)}_{t}}\big{)}1_{\mathcal{E}_{t}}=\sum_{t=0}^{\tau_{2}}\big{(}z^{T}_{t}{V}^{-1}_{[0\;t]}z_{t}\big{)}1_{\mathcal{E}_{t}}+$
	$\displaystyle\sum_{t=\tau_{2}}^{\tau_{3}}\big{(}{z^{j}}^{T}_{t}{V^{j}}^{-1}_{[0\;t]}z^{j}_{t}\big{)}1_{\mathcal{E}_{t}}+...+\sum_{t=\tau_{ns-1}}^{\tau_{ns}}\big{(}{z^{l}}^{T}_{t}{V^{l}}^{-1}_{[0\;t]}z^{l}_{t}\big{)}1_{\mathcal{E}_{t}}+$
	$\displaystyle\sum_{t=\tau_{ns}}^{T}\big{(}{z^{i}}^{T}_{t}{V^{i}}^{-1}_{[0\;t]}z^{i}_{t}\big{)}1_{\mathcal{E}_{t}}\leq 2\log\frac{\det(V_{[0,\;\tau_{2}]})}{\det(\lambda I_{(n+m)})}+$		(116)
	$\displaystyle 2\log\frac{\det(V^{j}_{[0,\;\tau 3]})}{\det(V^{pj}_{[0,\;\tau_{2}]})}+...+2\log\frac{\det(V^{l}_{[0\;\tau_{ns}]})}{\det(V^{pl}_{[0\;\tau_{ns-1}]})}+$
	$\displaystyle 2\log\frac{\det({V^{i}}_{[0,\;T]})}{\det(V^{pi}_{[0,\;\;\tau_{ns}]})}=2\log\frac{1}{\det(\lambda I_{(n+m)})}+$		(117)
	$\displaystyle 2\log\frac{\det(V_{[0,\;\tau_{2}]})}{\det(V^{pj}_{[0,\;\tau_{2}]})}+...+2\log\frac{\det(V^{l}_{[0\;\tau_{ns}]})}{\det(V^{pi}_{[0,\;\;\tau_{ns}]})}+$
	$\displaystyle 2\log\det({V^{i}}_{[0,\;T]})\leq$		(118)
	$\displaystyle 2\log\frac{\det({V^{i}}_{[0,\;T]})}{\det(\lambda I_{(n+m_{i})})}+2\log\frac{\det(V_{[0,\;\tau_{2}]})}{\det(V^{pj}_{[0,\;\tau_{2}]})}+...$
	$\displaystyle+2\log\frac{\det(V_{[0,\;\tau_{ns}]})}{\det(V^{pi}_{[0,\;\;\tau_{ns}]})}\leq$
	$\displaystyle 2(1+2\bar{\Upsilon})\log T+2(ns-1)(m-1)\log(1+2\bar{\Upsilon})T$		(119)

where in the inequality (116) we applied the result of Lemma 26 in [5] and then in (117) we telescoped the summation of logarithmic terms. The inequality (118) is due to the facts that $\det(\lambda I_{(n+m)})\geq\det(\lambda I_{(n+m_{i})})$ for $\lambda\geq 1$ and $V^{i}$ is strictly positive definite for all $i\in\mathcal{I}$ that guarantees $\det(V_{[0,\;\tau_{k}]})\geq\det(V^{pj}_{[0,\;\tau_{k}]})$ . Finally the last inequality is direct result of applying (111-114) and (101).

Recalling definition $r^{i}+(1+r^{i})\vartheta\sqrt{(1+2\Upsilon^{i})T}=:\mu_{i}$ and $\bar{\mu}=\max_{i\in\mathcal{B}^{*}}\mu_{i}$ we have the following upper bound for the terms $\frac{4\nu_{i}\mu_{i}}{\sigma_{\omega}^{2}}$ :

\displaystyle\frac{4\nu_{i}\mu_{i}}{\sigma_{\omega}^{2}}\leq\frac{4\bar{\nu}\bar{\mu}}{\sigma_{\omega}^{2}}\leq\frac{4\bar{\nu}}{\sigma_{\omega}^{2}}\big{(}r^{i}+(1+r^{i})\vartheta\sqrt{(1+2\bar{\Upsilon})T}\big{)}

where $r^{i}$ ’s $\forall i$ are upper-bounded by (74)

	$\displaystyle r^{i}\leq\bar{r}:=8\sigma_{\omega}^{2}n\bigg{(}$	$\displaystyle 2\log\frac{n}{\delta}+(1+2\bar{\Upsilon})\log T+$
		$\displaystyle(m-1)\log(1+2\bar{\Upsilon})T\bigg{)}+2\epsilon^{2}\lambda$

which completes the proof. ∎

	$\displaystyle\\|x_{k_{f}}\\|$	$\displaystyle\leq(1-\gamma^{}/2)^{k_{f}-k_{l}}\kappa^{}\\|x_{k_{l}}\\|+\bar{L}$
		$\displaystyle\leq(1-\gamma^{}/2)^{k_{l}-k_{l-1}}(1-\gamma^{}/2)^{k_{f}-k_{l}}\kappa^{*2}\\|x_{k_{l-1}}\\|$
		$\displaystyle+\bar{L}(1-\gamma^{}/2)^{k_{l}-k_{l-1}}\kappa^{}+\bar{L}$
		$\displaystyle\leq...\leq(1-\gamma^{}/2)^{k_{f}-0}\kappa^{(k_{f}-0)/\tau}\\|x_{0}\\|+\Omega$

	$\displaystyle P_{}(\Theta^{i}_{},Q,R^{i})-P(\hat{\Theta}^{i}_{t},Q,R^{i})\preceq\frac{2\kappa_{i}^{2}\mu_{i}}{\gamma_{i}}\\|P(\Theta^{i}_{t},Q,R^{i})\\|_{*}$
	$\displaystyle\lVert\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{t},Q,R^{i})\end{pmatrix}{V^{i}}^{-1}_{[0,\;t]}\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{t},Q,R^{i})\end{pmatrix}^{\top}\rVert I\preceq$
	$\displaystyle\frac{2\kappa_{i}^{2}\mu_{i}}{\gamma_{i}}\\|P(\Theta^{i}_{t},Q,R^{i})\\|_{*}\big{(}2\kappa_{i}^{2}\\|{V^{i}}^{-1}_{[0,\;t]}\\|\big{)}I\preceq$
	$\displaystyle\frac{2\kappa_{i}^{2}\mu_{i}}{\gamma_{i}}\\|P(\Theta^{i}_{t},Q,R^{i})\\|_{*}(2\kappa_{i}^{2}\sqrt{n+m}){V^{i}}^{-1}_{[0,\;t]}\preceq$
	$\displaystyle\frac{2\kappa_{i}^{2}\mu_{i}}{\gamma_{i}}\\|P(\Theta^{i}_{t},Q,R^{i})\\|_{*}(2\kappa_{i}^{2}\sqrt{n+m})\frac{1}{\lambda_{i}}\preceq\frac{\alpha_{0}^{i}\gamma_{i}}{2}I$

Learn and Control while Switching: Guaranteed Stability and Sublinear Regret

Abstract

Index Terms:

I Introduction

II Problem statement

Assumption 1

Proposition 1

Assumption 2

III Preliminaries

III-A Primal SDP formulation of LQR problem

III-B System Identification

III-C (κ,γ)−(\kappa,\gamma)- strong sequential stability

Definition 1

Definition 2

III-D Projection of the Confidence Ellipsoid

III-E Augmentation

IV Projection-based strategy

IV-A Overview

IV-B Main steps of Algorithms

IV-B1 System Identification

Confidence Set construction of warm-up phase

Theorem 1

Confidence Set construction of LCSA algorithm

Theorem 2

IV-C Control design

IV-D Projection Step

V Performance Analysis

V-A Stability Analysis

Theorem 3

Proof:

Lemma 1

Proof:

Theorem 4

V-B Regret Bound Analysis

Theorem 5

VI Modifications to Address the MTD and FTC scenarios

VII Numerical Experiment

VIII Summary and Conclusion

References

Appendix A Further Analysis

A-A Technical Theorems and Lemmas

Lemma 2

Proof:

Corollary 1

Proof:

Lemma 3

Lemma 4

A-B Confidence Construction (System Identification)

Proof:

Proof:

Lemma 5

A-C Properties of Relaxed SDP

Lemma 6

Proof:

Lemma 7

Proof:

Lemma 8

Proof:

A-D Stability Analysis

Lemma 9

Proof:

Lemma 10

Proof:

Proof:

Lemma 11

Proof:

A-E Regret Bound Analysis

Lemma 12

Proof:

Lemma 13

Proof:

Lemma 14

Proof:

Lemma 15

Proof:

Lemma 16

Proof:

Proof:

Appendix B Supplementary Proofs

Proof:

III-C $(\kappa,\gamma)-$ strong sequential stability