This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learn and Control while Switching: Guaranteed Stability and Sublinear Regret

Jafar Abbaszadeh Chekan and Cédric Langbort This paragraph of the first footnote will contain the date on which you submitted your paper for review. It will also contain support information, including sponsor and financial support acknowledgment. For example, “This work was supported in part by the U.S. Department of Commerce under Grant BS123456.” J. A. Chekan and C. Langbort (emails: jafar2 & [email protected]) are with the Coordinated Science Laboratory and the Department of Aerospace Engineering at the University of Illinois at Urbana-Champaign (UIUC).
Abstract

Over-actuated systems often make it possible to achieve specific performances by switching between different subsets of actuators. However, when the system parameters are unknown, transferring authority to different subsets of actuators is challenging due to stability and performance efficiency concerns. This paper presents an efficient algorithm to tackle the so-called “learn and control while switching between different actuating modes” problem in the Linear Quadratic (LQ) setting. Our proposed strategy is constructed upon Optimism in the Face of Uncertainty (OFU) based algorithm equipped with a projection toolbox to keep the algorithm efficient, regret-wise. Along the way, we derive an optimum duration for the warm-up phase, thanks to the existence of a stabilizing neighborhood. The stability of the switched system is also guaranteed by designing a minimum average dwell time. The proposed strategy is proved to have a regret bound of 𝒪¯(T)+𝒪(nsT)\mathcal{\bar{O}}\big{(}\sqrt{T}\big{)}+\mathcal{O}\big{(}ns\sqrt{T}\big{)} in horizon TT with (ns)(ns) number of switches, provably outperforming naively applying the basic OFU algorithm.

Index Terms:
Over-actuated System, Reinforcement Learning, Regret, Switched System

I Introduction

Control allocation is a vast and richly studied field that addresses the problem of distributing control efforts over redundant actuators to achieve a specific performance. Along this line, supervisory switching control that enables operation with different subsets of actuators is a practical approach for control allocation [1, 2, 3, 4]. In this class of problems, the switching time, the switch mode (which represents a subset of actuators), or both of these parameters can be either specified as part of the general decision-making process or determined by an external unit. Three selected problem classes that illustrate the significance of switching in over-actuated systems are Fault Tolerant Control (FTC), Time-Triggered Protocol (TTP), and moving target defense (MTD). These examples differ in their approach to defining and designing these crucial variables. Designing an algorithm to efficiently switch between different actuating modes when system parameters are unknown presents a significant challenge in terms of performance. We refer to this problem as ”learn and control while switching.” In this context, it is inefficient to disregard previously acquired information about these parameters and start the learning and control process from scratch each time a mode switch occurs. Our goal is to create an algorithm that capitalizes on previously acquired knowledge when commencing a new actuation mode. More specifically, our focus lies on an online Optimism in the Face of Uncertainty (OFU) based algorithm within the framework of Linear Quadratic (LQ) settings. This class of algorithms builds a confidence ellipsoid containing the unknown system parameters, allowing us to design a feedback gain by adopting an optimistic perspective toward this defined set. A naive approach to address our proposed problem would involve the repetitive application of the OFU-based algorithm whenever a mode switch takes place. However, this method is likely to be inefficient, given that it disregards previously acquired knowledge. In this research endeavor, we introduce a novel strategy that leverages learned information. This strategy involves the projection of recently constructed ellipsoids into the mode space that the system has transitioned into. Following this projection, we establish a confidence ellipsoid for control design, incorporating both the knowledge stored within the projected ellipsoid and the online measurements gathered after initiating the actuation. The core technical challenge in constructing this ellipsoid involves confidence set construction while incorporating the already learned information stored within the projected confidence ellipsoid. In our setting, unlike [5], who use the trace norm of the system parameters for regularization in least square estimation and confidence set construction, we utilize a weighted Frobenius norm. This approach provides us with an opportunity to incorporate the projected ellipsoid into confidence ellipsoid construction. To fully incorporate this idea, we also forgo the confidence ellipsoid normalization applied in [5]. As a result, our confidence ellipsoid representation differs slightly from that of [5]. Regarding the stability guarantee in the switching setting, we need to determine a minimum average dwell time that ensures the system slows down the switching process sufficiently to guarantee boundedness of the state norm during the switching operation.

Actuator faults can cause a decay in performance (e.g., impact on energy consumption of actuator) or even more devastating incidents [6]. This kind of safety concerns is often considered within the literature on so-called Fault Tolerant Control (FTC) [7, 8], where the goal of switching is to bypass a failing actuator [9], [10]. [11].

In TTP systems [12, 13], where control must occur through communication channels whose availability varies according to a predetermined/ known a priori schedule, the goal of switching is to conform to the specifications of that protocol.

As another specific application of switching systems, we can refer to the MTD strategy in cyber-physical security of control systems. MTD, which first appeared in the computer science literature [14] is a proactive strategy using which a system can hide its dynamics from a potential attacker by randomly switching between a different subset of actuators. This strategy seems practical considering that adversarial agents have limited resources and cannot compromise all the actuators simultaneously. In [15, 16], the authors apply MTD for identifying malicious attacks on sensors and actuators. Furthermore, in [4] and [17] the authors show an effort to address this challenge for a system with known dynamics.

One significant difference between these applications is that some are instances of ”direct switching” whereas others belong to the ”indirect switching” category [18], which impacts the determination of ”when and what mode to switch to”. For example, in TTP, both the switch time and actuation mode are pre-specified and given. On the other hand, for the FTC, the switch time is simply when an anomaly is detected. For this class of problems, we propose a mechanism that picks the best actuating mode to switch to by examining the richness of learned information up to that point. As for the MTD application, given that high-frequency switches are desirable, we let the system switch as often as desired unless stability is violated. For this aim, we constrain the system to stay in a mode for some minimum average dwell time (MADT). The next actuating method is specified with a similar strategy as in the FTC case. In all cases, the algorithm guarantees closed-loop stability and that parameters are learned well-enough to ensure low regret (defined as the average difference between the closed-loop quadratic cost and the best achievable one, with the benefit of knowing the plant’s parameters). The former is addressed by designing a MADT while latter requires detailed insight into available RL algorithms in LQ control literature.

Several studies have recently attempted to address switched systems control under different assumptions when the system model is unknown. Authors in [19] designed data-driven control laws for switched systems with discrete-time linear systems and unknown switch signals. The stability of an unknown switched linear system is investigated probabilistically by observing only finite trajectories in [20]. Furthermore, in [21, 22] design of stabilizing controllers for a switched system is addressed by solely relying on system structure and experimental data and without an explicit plant identification.

Learning-based methods in LQ framework are divided into model-free ([23] [24, 25, 26, 27]) and model-based RL approaches. The former ignores parameter estimates and outputs sub-optimal policy by solely relying on the history of data and directly minimizing the cost function [26]. The latter category usually comes with guaranteed regret, thanks to system identification as a core step in control design. We use similar properties to obtain a guaranteed regret.

A naive approach to model-based adaptive control is a simple philosophy that estimates a system’s unknown parameters and treats them as actual parameters in optimal control design. This algorithm is known as certainty equivalence in the literature, and since it decouples the estimation problem from the control one, it may lead to strictly sub-optimal performance [28]. Model-Based algorithms, in general, can be divided into three main categories, namely ”Offline estimation and control synthesis,” ”Thompson Sampling-based,” and ”Optimism in the Face of Uncertainty (OFU)” algorithms. OFU that couples estimation and control design procedures have shown efficiency in outperforming the naive certainty equivalence algorithm. Campi and Kumar in [28] proposed a cost-based parameter estimator, which is an OFU based approach to address the optimal control problem for linear quadratic Gaussian systems with guaranteed asymptotic optimality. However, this algorithm only guarantees the convergence of average cost to that of the optimal control in limit and does not provide any bound on the measure of performance loss for the finite horizon. Abbasi-Yadkori and Szepesvari [29] proposed a learning-based algorithm to address the adaptive control design of the LQ control system in the finite horizon, with worst case regret bound of O(T)O(\sqrt{T}). Using l2l_{2}-regularized least square estimation, they construct high-probability confidence set around unknown parameters of the system and design an algorithm that optimistically plays concerning this set [30]. Along this line, Ibrahimi and et. al. [31] later on proposed an algorithm that achieves O(pT)O(p\sqrt{T}) regret bound with state-space dimension of pp. Authors in [32] proposed an OFU-based learning algorithm with mild assumptions and O(T)O(\sqrt{T}) regret. Furthermore, in [33] propose an OFU-based algorithm which for joint stabilization and regret minimization that leverages actuator redundancy to alleviate state explosion during initial time steps when there is low number of data.

The objective function to be minimized in the algorithm [30] is non-convex which brings about a computational hurdle. However, recently Cohen et al., in [5], through formulating the LQ control problem in a relaxed semi-definite program (SDP) that accounts for the uncertainty in estimates, proposed a computationally efficient OFU-based algorithm. Moreover, unlike the state-norm bound given by [30] which is loose, [5] guarantees strongly sequentially stabilizing policies enabling us to come ups with tight upper-bound, which ensures small MADT (appropriate for MTD application). These advantages motivate us to use the relaxed SDP framework in our analysis rather than [30].

The remainder of the paper is organized as follows: Section II provides the problem statement and then is followed by Section III which gives a brief background review, overview of projection technique, and augmentation technique. Moving on to Section V, we first provide an overview of the projection-based strategy. Then, we detail the main algorithmic steps and summarize the stability analysis, which includes MADT design and algorithm performance, particularly in terms of the regret bound. Section VI briefly discusses the extension of the proposed algorithm to the FTC and MTD type of applications. Afterwards, numerical experiment is given in Section VII. Finally, we conclude the key achievements in Section VIII. The most rigorous analysis of the algorithms (e.g., stability and regret bounds) and technical proofs are provided in Appendix A while leaving complimentary proofs to Appendix B. .

II Problem statement

Throughout this paper, we use the following notations: A\|A\| denotes the operator norm i.e., A=maxx,x1Ax\|A\|=\max_{x,\|x\|\leq 1}\|Ax\|. We denote trace norm by (=Tr())\|\bullet\|_{*}(=\sqrt{\operatorname{Tr}(\bullet^{\top}\bullet)}). The entry-wise dot product between matrices is denoted by AB=Tr(AB)A\bullet B=\operatorname{Tr}(A^{\top}B). The weighted norm of a matrix AA with respect to VV is interchangeably denoted by Tr(AVA)\operatorname{Tr}(A^{\top}VA) or AV2\|A\|_{V}^{2}. We denote a mm dimensional identity matrix by ImI_{m}. We denote the ceiling function by x\lceil x\rceil that maps xx to the least integer greater than or equal to xx.

Consider the following time-invariant discrete time LQR system

xt+1\displaystyle x_{t+1} =Axt+But+ωt+1\displaystyle=A_{*}x_{t}+B_{*}u_{t}+\omega_{t+1} (1)
ct\displaystyle c_{t} =xtQxt+utRut\displaystyle=x_{t}^{\top}Qx_{t}+u_{t}^{\top}Ru_{t} (2)

where An×nA_{*}\in\mathbb{R}^{n\times n}, Bn×mB_{*}\in\mathbb{R}^{n\times m} are unknown and Qn×nQ\in\mathbb{R}^{n\times n} and Rm×mR\in\mathbb{R}^{m\times m} are known and respectively positive definite matrices.

Let 𝔹\mathbb{B} be the set of all columns, bib^{i}_{*} (i{1,,m}i\in\{1,...,m\}) of BB_{*}. An element of its power set 2𝔹2^{\mathbb{B}} is a subset j\mathcal{B}_{j}, j{1,,2m}j\in\{1,...,2^{m}\} of columns corresponding to a submatrix BjB_{*}^{j} of BB_{*} and mode jj. For simplicity, we assume that B1=BB^{1}_{*}=B_{*} i.e., that the first mode contains all actuators. With this definition in place we can express the system dynamics as follows:

xt+1=Axt+Bσ(t)utσ(t)+ωt+1\displaystyle x_{t+1}=A_{*}x_{t}+B_{*}^{\sigma(t)}u_{t}^{\sigma(t)}+\omega_{t+1} (3)

or equivalently

xt+1=Θσ(t)zt+ωt+1,zt=(xtutσ(t)).\displaystyle x_{t+1}={\Theta^{\sigma(t)}_{*}}^{\top}z_{t}+\omega_{t+1},\quad z_{t}=\begin{pmatrix}x_{t}\\ u^{\sigma(t)}_{t}\end{pmatrix}. (4)

where Θσ(t)=(A,Bσ(t))\Theta^{\sigma(t)}_{*}=(A_{*},B^{\sigma(t)}_{*})^{\top} is controllable and σ:0\sigma:{0}\;\cup\mathbb{N}\rightarrow\mathcal{M} is a given right-continuous piecewise constant switching signal where \mathcal{M} denotes the set of all controllable modes (A,Bi)(A_{,}B^{i}_{*}).

The associated cost with this mode is

ctσ(t)\displaystyle c^{\sigma(t)}_{t} =xtQxt+utσ(t)Rσ(t)utσ(t)\displaystyle=x_{t}^{\top}Qx_{t}+{u^{\sigma(t)}_{t}}^{\top}R^{\sigma(t)}u^{\sigma(t)}_{t} (5)

where Rσ(t)dσ(t)×dσ(t)R^{\sigma(t)}\in\mathbb{R}^{d_{\sigma(t)}\times d_{\sigma(t)}} is a block of RR which penalizes the control inputs of the actuators of mode σ(t).\sigma(t)\in\mathcal{M}.

Since we are considering arbitrary, externally-determined, switching patterns (which could include switching to a particular mode at some time instant, then never switching out of it), controllability of the system in each mode is necessary to ensure controllability of the switched system. This mode-by-mode controllability assumption is also useful to enable OFU-style learning at each epoch.

The noise process satisfies the following assumption which is a standard assumption in controls community (e.g., [30], [5] and [34]).

Assumption 1

There exists a filtration t\mathcal{F}_{t} such that

(1.2)(1.2) ωt+1\omega_{t+1} is a martingale difference, i.e., 𝔼[ωt+1|t]=0\mathbb{E}[\omega_{t+1}|\;\mathcal{F}_{t}]=0

(1.2)(1.2) 𝔼[ωt+1ωt+1|t]=σ¯ω2In\mathbb{E}[\omega_{t+1}\omega_{t+1}^{\top}|\;\mathcal{F}_{t}]=\bar{\sigma}_{\omega}^{2}I_{n} for some σ¯ω2>0\bar{\sigma}_{\omega}^{2}>0;

(1.3)(1.3) ωt\omega_{t} are component-wise sub-Gaussian, i.e., there exists σω>0\sigma_{\omega}>0 such that for any γ\gamma\in\mathbb{R} and j=1,2,,nj=1,2,...,n

𝔼[eγωj(t+1)|t]eγ2σω2/2.\displaystyle\mathbb{E}[e^{\gamma\omega_{j}(t+1)}|\;\mathcal{F}_{t}]\leq e^{\gamma^{2}\sigma_{\omega}^{2}/2}.

Our goal is to design a strategy which can efficiently transfer the control authority between different subsets of actuators when the system parameters are unknown.

Relying on the past measurements, the problem is designing a sequence {utσ(t)}\{u^{\sigma(t)}_{t}\} of control input such that the average expected cost

J(u1,u2,)=limT1Tt=1T𝔼[ctσ(t)].\displaystyle J(u_{1},u_{2},...)=\lim_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}[c^{\sigma(t)}_{t}]. (6)

is minimized where σ(t)\sigma(t)\in\mathcal{I}, where =[iτ1iτ2iτns]\mathcal{I}=[i_{\tau_{1}}\;i_{\tau_{2}}\;...i_{\tau_{ns}}], represents the switch sequence. Each element in the set \mathcal{I} specifies both the mode index iτji_{\tau_{j}}\in\mathcal{M} and the time τj\tau_{j} at which the switch to that mode occurs. In the context of TTP applications, this sequence is typically predetermined, and thus the system has access to both the switch times and the corresponding modes. In contrast, for FTC, while the switch times τj\tau_{j} are determined by an anomaly detector, the decision regarding which mode to switch to is made by the algorithm itself. Therefore, the algorithm plays a pivotal role in selecting the next mode, denoted as iτji_{\tau_{j}}\in\mathcal{M}. When the system parameters are known, it may be relatively straightforward to specify the next mode that results in the lowest average expected cost. However, this task becomes significantly more challenging in the absence of such information. The MTD application introduces an even greater level of complexity, as the algorithm must autonomously decide both when and to which mode to switch. In this work, our primary focus is on TTP setting. In Section VI we will provide a brief overview of the general solutions for the other two settings.

As a measure of optimality loss having to do with the lack of insight into the model, the regret of a policy is defined as follows:

RT=t=1T(𝔼[ctσ(t)]J(Θσ(t),Q,Rσ(t)))\displaystyle R_{T}=\sum_{t=1}^{T}\big{(}\mathbb{E}[c_{t}^{\sigma(t)}]-J_{*}(\Theta_{*}^{\sigma(t)},Q,R^{\sigma(t)})\big{)} (7)

where J(Θσ(t),Q,R)J_{*}(\Theta_{*}^{\sigma(t)},Q,R) is the optimal average expected cost of actuating in mode σ(t){1,,2m}\sigma(t)\in\{1,...,2^{m}\}.

Proposition 1

The regret incurred by employing a fixed feedback controller K0K_{0} on a LQ system over T0T_{0} rounds follows an order of 𝒪(T0)\mathcal{O}(T_{0}).

Assumption 2

  1. 1.

    There are known constants α0,α1,σ,ϑ,ν>0\alpha_{0},\alpha_{1},\sigma,\vartheta,\nu>0 such that,

    α0IQα1I,α0IRiα1iI\displaystyle\alpha_{0}I\leq Q\leq\alpha_{1}I,\;\;\;\alpha_{0}I\leq R^{i}\leq\alpha^{i}_{1}I
    W=𝔼[ωtωt]=σ2I,Θiϑi,J(Θi,Q,R)νi\displaystyle W=\mathbb{E}[\omega_{t}\omega_{t}^{\top}]=\sigma^{2}I,\;\;\;\|\Theta_{*}^{i}\|\leq\vartheta_{i},\;\;\;J_{*}(\Theta_{*}^{i},Q,R)\leq\nu_{i}

    for all ii\in\mathcal{M}.

  2. 2.

    There is an initial stabilizing policy K0K_{0} to start algorithm with.

Note that having access to ϑi\vartheta_{i} and νi\nu_{i} for all ii separately is quite demanding and, hence, in practice, we only require knowledge of ϑ¯:=maxiϑi\bar{\vartheta}:=\max_{i}\vartheta_{i} and ν¯=maxiνi\bar{\nu}=\max_{i}\nu_{i}. The assumption of having access to an initial stabilizing policy like K0K_{0} is a common practice in the literature (as seen in [5] and [35]). Nevertheless, it is possible to develop a stabilizing policy in real-time by relying on some a priori bounds. For further details on this approach, we encourage readers to explore [34] and [33].

Developing a ”learn and control while switching” strategy requires overcoming two core challenges: performance and stability. We propose an efficient OFU-based algorithm equipped with a projection tool that can efficiently transfer the learned information from an actuating mode to others to achieve the former goal. This strategy outperforms the naive approach of repeatedly applying the standard OFU algorithm. Fast recurrent switching between the actuating modes can cause explosion of state norm over finite horizon. Hence, in TTP setting, we operate under the assumption that the switches occur at a sufficiently slow rate on average, with the time interval between two consecutive switches τj+1τj\tau_{j+1}-\tau_{j} being longer than some minimum average dwell time τMAD\tau_{MAD}.

Before presenting the strategy in more detail (and, in a later section, the specifics of the algorithms), we start by reviewing the main ingredients of our approach to build the foundation for our analysis.

III Preliminaries

III-A Primal SDP formulation of LQR problem

Consider a linear system

xt+1\displaystyle x_{t+1} =Axt+But+ωt+1\displaystyle=Ax_{t}+Bu_{t}+\omega_{t+1} (8)
ct\displaystyle c_{t} =xtQxt+utRut.\displaystyle=x_{t}^{\top}Qx_{t}+u_{t}^{\top}Ru_{t}. (9)

Observing that a steady-state joint state and input distribution (x,u)(x_{\infty},u_{\infty}) exists for any stabilizing policy π\pi, we denote the covariance matrix of this joint distribution as (π)\mathcal{E}(\pi), which is defined as follows:

(π)=𝔼(xxxuuxuu)\displaystyle\mathcal{E}(\pi)=\mathbb{E}\begin{pmatrix}x_{\infty}x_{\infty}^{\top}&x_{\infty}u_{\infty}^{\top}\\ u_{\infty}x_{\infty}^{\top}&u_{\infty}u_{\infty}^{\top}\end{pmatrix}

where u=π(x)u_{\infty}=\pi(x_{\infty}) and the average expected cost of the policy is given as follows

J(π)=(Q00R)(π).\displaystyle J(\pi)=\begin{pmatrix}Q&0\\ 0&R\end{pmatrix}\bullet\mathcal{E}(\pi).

Now, for a linear stabilizing policy KK that maps u=Kxu_{\infty}=Kx_{\infty}, the covariance matrix of the joint distribution takes the following form:

(K)=(XXKKXKXK).\displaystyle\mathcal{E}(K)=\begin{pmatrix}X&XK^{\top}\\ KX&KXK^{\top}\end{pmatrix}. (10)

where X=𝔼[xx]X=\mathbb{E}[x_{\infty}x_{\infty}^{\top}]. Then the average expected cost is computed as follows

J(K)=(Q00R)(K)=(Q+KRK)X.\displaystyle J(K)=\begin{pmatrix}Q&0\\ 0&R\end{pmatrix}\bullet\mathcal{E}(K)=(Q+K^{\top}RK)\bullet X.

Now given Θ=(AB)\Theta=(A\;B)^{\top}, the LQR control problem can be formulated in a SDP form, as follows:

minimizeΣ(Q00R)Σ\displaystyle\textrm{minimize}_{\Sigma}\;\;\;\begin{pmatrix}Q&0\\ 0&R\end{pmatrix}\bullet\Sigma
S.t.Σxx=ΘΣΘ+W\displaystyle\textrm{S.t.}\;\;\;\Sigma_{xx}={\Theta}^{\top}\Sigma\Theta+W
Σ>0\displaystyle\Sigma>0 (11)

where any feasible solution Σ\Sigma can be represented as follows:

Σ=(ΣxxΣxuΣuxΣuu)\displaystyle\Sigma=\begin{pmatrix}\Sigma_{xx}&\Sigma_{xu}\\ \Sigma_{ux}&\Sigma_{uu}\end{pmatrix} (12)

in which Σxxn×n\Sigma_{xx}\in\mathbb{R}^{n\times n}, Σuum×m\Sigma_{uu}\in\mathbb{R}^{m\times m}, and Σux=Σxum×n\Sigma_{ux}=\Sigma_{xu}\in\mathbb{R}^{m\times n}. The optimal value of (11) coincides with the average expected cost J(Θ,Q,R)J_{*}(\Theta,Q,R), and for W>0W>0 (ensuring Σxx>0\Sigma_{xx}>0), the optimal policy of the system, denoted as K(Θ,Q,R)K_{*}(\Theta,Q,R) and associated with the optimum solution Σ(Θ,Q,R)\Sigma^{*}(\Theta,Q,R), can be derived from K(Θ,Q,R)=Σux(Θ,Q,R)Σxx1(Θ,Q,R)K_{*}(\Theta,Q,R)=\Sigma^{*}_{{ux}}(\Theta,Q,R)\Sigma^{*^{-1}}_{{xx}}(\Theta,Q,R) considering (10).

In [5], the relaxed form of the SDP is used to design the stabilizing control policy when the matrix Θ\Theta is replaced by its estimates (i.e., confidence set of Θ\Theta). We will slightly modify the relaxed SDP formulation for our purpose.

III-B System Identification

The proposed algorithm includes a system identification step that applies self-normalized process to obtain least square estimates and uses the martingale difference property of process noise to construct high probability confidence set around the system’s parameters. To sketch a brief outline, assume the system actuates in the first mode and evolves in an arbitrary time interval [0,t][0,\;t]. One can then simply rewrite (1) as follows:

Xt\displaystyle X_{t} =ZtΘ+Wt\displaystyle=Z_{t}\Theta_{*}+W_{t} (13)

where XtX_{t} and ZtZ_{t} are matrices whose rows are x1,,xtx^{\top}_{1},...,x^{\top}_{t} and z0,,zt1{z}^{\top}_{0},...,{z}^{\top}_{t-1} respectively. And in a similar way WtW_{t} is constructed by concatenating ω1,,ωt\omega_{1}^{\top},...,\omega_{t}^{\top} vertically.

Using the self-normalized process, the least square estimation error, e(Θ)e(\Theta) can be defined as

e(Θ)\displaystyle e(\Theta) =λTr((ΘΘ0)(ΘΘ0))+\displaystyle=\lambda\operatorname{Tr}\big{(}(\Theta-\Theta_{0})^{\top}(\Theta-\Theta_{0})\big{)}+
s=0t1Tr((xs+1Θzs)(xs+1Θzs)))\displaystyle\quad\sum_{s=0}^{t-1}\operatorname{Tr}\big{(}(x_{s+1}-\Theta^{\top}z_{s})(x_{s+1}-\Theta^{\top}z_{s})^{\top})\big{)} (14)

where Θ0\Theta_{0} is some given initial estimate and λ>0\lambda>0 is a regularization parameter. This yields the l2l^{2}-regularized least square estimate:

Θ^t\displaystyle\hat{\Theta}_{t} =argminΘe(Θ)=V[0,t]1(ZtXt+λΘ0).\displaystyle=\operatorname*{argmin}_{\Theta}e(\Theta)=V_{[0,\;t]}^{-1}\big{(}Z_{t}^{\top}X_{t}+\lambda\Theta_{0}\big{)}. (15)

where

V[0,t]=λI+s=0t1zszsT=λI+ZtZt,\displaystyle V_{[0,\;t]}=\lambda I+\sum_{s=0}^{t-1}z_{s}z_{s}^{T}=\lambda I+Z_{t}^{\top}Z_{t},

is covariance matrix. Then assuming ΘΘ0ϵ\|\Theta_{*}-\Theta_{0}\|_{*}\leq\epsilon for ϵ>0\epsilon>0, a high probability confidence set containing unknown parameters of system Θ\Theta_{*} is constructed as

𝒞t(δ):={\displaystyle\mathcal{C}_{t}(\delta):=\big{\{} Θ(n+m)×n|\displaystyle\Theta\in\mathbb{R}^{(n+m)\times n}|
Tr((ΘΘ^t)TV[0,t](ΘΘ^t))\displaystyle\operatorname{Tr}\big{(}(\Theta-\hat{\Theta}_{t})^{T}V_{[0,\;t]}(\Theta-\hat{\Theta}_{t})\big{)} rt}\displaystyle\leq r_{t}\big{\}} (16)

where

rt=(σω2nlogndet(V[0,t])δdet(λI)+λϵ)2.\displaystyle r_{t}=\bigg{(}\sigma_{\omega}\sqrt{2n\log\frac{n\det(V_{[0,\;t]})}{\delta\det(\lambda I)}}+\sqrt{\lambda}\epsilon\bigg{)}^{2}. (17)

It can be shown that the true matrix Θ\Theta_{*} is guaranteed to belong to this set with probability at least 1δ1-\delta, for 0<δ<10<\delta<1. The right hand side of (16) is dependent on the user-defined parameters λ\lambda, and initial estimation error ϵ\epsilon. The proof follows a combination of steps of [5] and [30] with the difference that we ignore the normalization of the confidence set. For the sake of brevity, we may sometimes use Δt=ΘΘ^t\Delta_{t}=\Theta-\hat{\Theta}_{t} in our analysis.

III-C (κ,γ)(\kappa,\gamma)- strong sequential stability

We review the notation of strong sequentially stabilizing introduced in [5], which will be used in the context of switched system as well.

Definition 1

Consider a linear plant parameterized by AA and BB. The closed-loop system matrix A+BKtA+BK_{t} is (κ,γ)(\kappa,\gamma)- strongly stable for κ>0\kappa>0 and 0<γ<10<\gamma<1 if there exists Ht0H_{t}\succ 0 and LtL_{t} such that A+BKt=HtLtHt1A_{*}+B_{*}K_{t}=H_{t}L_{t}{H}^{-1}_{t} and

  1. 1.

    Lt1γ\|L_{t}\|\leq 1-\gamma and Ktκ\|K_{t}\|\leq\kappa

  2. 2.

    HtB0\|H_{t}\|\leq B_{0} and Ht11/b0\|{H}_{t}^{-1}\|\leq 1/b_{0} with κ=B0/b0\kappa=B_{0}/b_{0}

A sequence A+BK1,A+BK2,A+BK_{1},A+BK_{2},... of closed-loop system matrices is (κ,γ)(\kappa,\gamma)-strongly sequentially stable if each A+BKtA+BK_{t} is (κ,γ)(\kappa,\gamma)- strongly stable and Ht+11Ht1+γ/2\|H^{-1}_{t+1}H_{t}\|\leq 1+\gamma/2. Furthermore, we say a sequence K1,K2,K_{1},K_{2},... of control gains is (κ,γ)(\kappa,\gamma)-strongly sequentially stabilizing for a plant (A,B)(A,B) if the sequence A+BK1,A+BK2,A+BK_{1},A+BK_{2},... is (κ,γ)(\kappa,\gamma)-strongly sequentially stable.

As shown in [5] applying such a strongly stabilizing sequence of controllers starting from state xt0x_{t_{0}} guarantees the boundedness of state trajectory as follows:

xtκieγi(t1)/2xt0+2κiγimaxt0stws.\displaystyle\|x_{t}\|\leq\kappa_{i}e^{-\gamma_{i}(t-1)/2}\|x_{t_{0}}\|+\frac{2\kappa_{i}}{\gamma_{i}}\max\limits_{t_{0}\leq s\leq t}\|w_{s}\|. (18)
Definition 2

The set 𝒩(κ0,γ0)(Θ)={Θ(n+m)×n:ΘΘϵ}\mathcal{N}_{(\kappa_{0},\gamma_{0})}(\Theta_{*})=\{\Theta\in\mathbb{R}^{(n+m)\times n}:\;\|\Theta-\Theta_{*}\|\leq\epsilon\}, for some ϵ>0\epsilon>0, is (κ0,γ0)(\kappa_{0},\gamma_{0})-stabilizing neighborhood of the system (1-2). In this neighborhood, for any Θ𝒩(κ0,γ0)(Θ)\Theta^{\prime}\in\mathcal{N}_{(\kappa_{0},\gamma_{0})}(\Theta_{*}), the designed (κ,γ)(\kappa^{\prime},\gamma^{\prime})-strongly stabilizing controller K(Θ,Q,R)K_{*}(\Theta^{\prime},Q,R) applied to the system (1), achieves an average expected cost lower than that of K0K_{0}.

III-D Projection of the Confidence Ellipsoid

Suppose at time tt the central confidence ellipsoid

𝒞t(δ):={\displaystyle\mathcal{C}_{t}(\delta):=\{ Θ(n+m)×n|\displaystyle\Theta\in\mathbb{R}^{(n+m)\times n}|
Tr((ΘΘ^t)V[0,t](ΘΘ^t))\displaystyle\operatorname{Tr}\big{(}(\Theta-\hat{\Theta}_{t})^{\top}V_{[0,\;t]}(\Theta-\hat{\Theta}_{t})\big{)} rt}\displaystyle\leq r_{t}\} (19)

is given. By projecting the central ellipsoid (19), rich with information until time tt, into the space of mode ii, the corresponding confidence set of parameters for mode ii is again an ellipsoid, denoted by 𝒞tpi(δ)\mathcal{C}^{pi}_{t}(\delta) and represented as follows:

𝒞tpi(δ):={\displaystyle\mathcal{C}^{pi}_{t}(\delta):=\{ Θi(n+mi)×n|\displaystyle\Theta^{i}\in\mathbb{R}^{(n+m_{i})\times n}|
Tr((ΘiΘ^tpi)V[0t]pi(ΘiΘ^tpi))\displaystyle\operatorname{Tr}\big{(}(\Theta^{i}-\hat{\Theta}_{t}^{pi})^{\top}V^{pi}_{[0\;t]}(\Theta^{i}-\hat{\Theta}_{t}^{pi})\big{)} r}.\displaystyle\leq r\}. (20)

The characteristics Θ^tpi\hat{\Theta}_{t}^{pi} and V[0t]piV^{pi}_{[0\;t]} of this projected ellipsoid can be computed following the steps specified below.

Consider the ellipsoid

Tr(ΔTVΔ)r\displaystyle\operatorname{Tr}(\Delta^{T}V\Delta)\leq r (21)

where Δ=ΘΘ^\Delta=\Theta-\hat{\Theta} and Θ\Theta has n+mn+m rows. We aim to project this set to a lower dimensional space of mode ii identified by Θi(n+mi)×n\Theta^{i}\in\mathbb{R}^{(n+m_{i})\times n} for some mi<mm_{i}<m. To this end, we first change the order of rows in such a way that the rows corresponding to the actuators of mode ii, Δ~1(n+mi)×n\tilde{\Delta}_{1}\in\mathbb{R}^{(n+m_{i})\times n} come first whereas Δ~2(mmi)×n\tilde{\Delta}_{2}\in\mathbb{R}^{(m-m_{i})\times n}. The reformulated ellipsoid then is written as follows:

Tr(Δ~V~Δ~)r,Δ~=[Δ~1Δ~2].\displaystyle\operatorname{Tr}(\tilde{\Delta}^{\top}\tilde{V}\tilde{\Delta})\leq r,\quad\tilde{\Delta}=\left[\begin{array}[]{c}\tilde{\Delta}_{1}\\ \hline\cr\tilde{\Delta}_{2}\end{array}\right]. (24)

Where V~\tilde{V} is in (n+m)×(n+m)\mathbb{R}^{(n+m)\times(n+m)} similar to VV. Let us now rewrite the parametrization of ellipsoid (24) as follows

vect(Δ~)TG~vect(Δ~)r\displaystyle\operatorname{vect}(\tilde{\Delta})^{T}\tilde{G}\operatorname{vect}(\tilde{\Delta})\leq r

where vect(Δ~)\operatorname{vect}(\tilde{\Delta}) in (n+m)n\mathbb{R}^{(n+m)n} is a vectorization of Δ~\tilde{\Delta} and G~=I(n×n)V~\tilde{G}=I_{(n\times n)}\otimes\tilde{V}.

Suppose Epi\partial E_{pi} is the boundary of the projection of the ellipsoid E:=vect(Δ~)TG~vect(Δ~)rE:=\operatorname{vect}(\tilde{\Delta})^{T}\tilde{G}\operatorname{vect}(\tilde{\Delta})\leq r on our target lower dimensional space. It is obvious that any point YEpiY\in\partial E_{pi} is the projection of a point XEX\in\partial E and both of the points are on the tangent space of E\partial E parallel to the direction of Δ~2{\tilde{\Delta}_{2}}. Hence, we can deduce that on these points, Δ~2E=0\nabla_{\tilde{\Delta}_{2}}E=0 which gives mmim-m_{i} algebric equations. By solving these equation for the components of Δ~2{\tilde{\Delta}_{2}}s in terms of the components of Δ~1{\tilde{\Delta}_{1}}, and subsequently substituting them into EE we obtain the projected ellipsoid EpiE_{pi} which can be written in following compact form :

vect(Δ~1)T(UMTT1M)vect(Δ~1)r.\displaystyle\operatorname{vect}(\tilde{\Delta}_{1})^{T}(U-M^{T}T^{-1}M)\operatorname{vect}(\tilde{\Delta}_{1})\leq r. (25)

where UU, MM, and TT are the block matrices of the partitioned G~\tilde{G}, e.g.,

G~=[UMTMT].\displaystyle\tilde{G}=\left[\begin{array}[]{c|c}U&M^{T}\\ \hline\cr M&T\end{array}\right]. (28)

We can easily construct this partition noting that Un(n+mi)×n(n+mi)U\in\mathbb{R}^{n(n+m_{i})\times n(n+m_{i})}. By sorting the arrays of vector vect(Δ~1)\operatorname{vect}(\tilde{\Delta}_{1}) by order into a matrix Δpi\Delta^{pi} of dimension (n+mi)×n\mathbb{R}^{(n+m_{i})\times n} the projected confidence ellipsoid (25) can be rewritten in the standard format of

Tr(ΔpiTVpiΔpi)r.\displaystyle\operatorname{Tr}({\Delta^{pi}}^{T}V^{pi}\Delta^{pi})\leq r.

where covariance matrix I(n×n)Vpi=UMT1MI_{(n\times n)}\otimes V^{pi}=U-M^{\top}T^{-1}M.

III-E Augmentation

Let us call the mode consisting of all actuators (i.e., mode i=1i=1) the ”central mode”. Its associated confidence ellipsoid 𝒞t(δ)\mathcal{C}_{t}(\delta) contains all information up to time tt and can be computed at all times even when the system operates in a different mode. Indeed when the system is in the mode ii with the following dynamics:

xt+1=Θizti+ωt+1,zti=(xtuti)\displaystyle x_{t+1}={\Theta_{*}^{i}}^{\top}z^{i}_{t}+\omega_{t+1},\quad z^{i}_{t}=\begin{pmatrix}x_{t}\\ u^{i}_{t}\end{pmatrix} (29)

we can equivalently rewrite it as follows:

xt+1=Θzt+ωt+1,zt=(xtut)\displaystyle x_{t+1}={\Theta_{*}}^{\top}z_{t}+\omega_{t+1},\quad z_{t}=\begin{pmatrix}x_{t}\\ u_{t}\end{pmatrix} (30)

where utmu_{t}\in\mathbb{R}^{m} and

ut(i)=uti,ut(ic)=0\displaystyle u_{t}(\mathcal{B}_{i})=u_{t}^{i},\;u_{t}(\mathcal{B}_{i}^{c})=0 (31)

with ic\mathcal{B}_{i}^{c} being the complement of i\mathcal{B}_{i}.

In turn the estimation process of III-B can be performed on (30) to obtain updated estimates of Θ\Theta_{*} computed by the central ellipsoid 𝒞t(δ)\mathcal{C}_{t}(\delta).

We refer to this simple mathematical operation as ”augmentation technique” using which we can update central ellipsoid regardless of the actuation mode.

IV Projection-based strategy

IV-A Overview

A possible way to tackle the ”learn and control while switching problem:” is to apply the OSLO algorithm described in [5] repeatedly, each time a switch occurs. However, such a naive approach is likely inefficient because (1) it ignores previously learned information and, (2) it runs a warm-up phase after each switch. As a (provably) better alternative, we propose a projection-based strategy which relies on the central ellipsoid to incorporate previously learned information in control design procedure. A controller that plays optimistically with respect to this set is then computed by solving a relaxed SDP.

To clarify the main idea, consider a simple example of control system with three actuators which offers seven actuating modes assuming that they all satisfy the controllability condition. Let M1M_{1} denote the mode 1 which takes advantage of all the actuators. The modes {M2,M3,M4}\{M_{2},M_{3},M_{4}\} and {M5,M6,M7}\{M_{5},M_{6},M_{7}\} are the ones with two actuators and single actuator respectively. We can consider M1M_{1} as the central mode which is at the same time the parent of all other actuating modes. Figure 1 illustrates the relationships between these actuating modes.

Refer to caption
Figure 1: Parent-Child Dependency

The dashed lines indicate how the modes are interconnected. For instance, the column set of modes M5M_{5} and M6M_{6} are the subsets of the column set of mode M2M_{2}. We break the strategy into two phases, the warm-up and Learn and Control while Switching (LCS) phases, whose pseudo codes have been provided in Algorithms 1 and 2. The strategy starts with Algorithm 1, which applies a given (κ0,γ0)(\kappa_{0},\gamma_{0})-strong stabilizing policy feedback K0K_{0} for T0T_{0} time steps to provide Algorithm 2 with close enough parameter estimates Θ0\Theta_{0}. For warm-up duration T0T_{0} computation, we use the recent results of [36] on the existence of a stabilizing neighborhood around unknown parameters of the system. We compute T0T_{0} such that the controller designed by Algorithm 2 at time T0T_{0}, which relies on the estimate Θ0\Theta_{0}, performs better than the initial stabilizing controller K0K_{0} and ensures the achievement of a favorable overall regret. The computation of this duration is detailed in Theorem 1.

After the warm-up phase, Algorithm 2 is applied whenever a switch occurs. This algorithm possesses the capability to access the updated central ellipsoid, which it maintains through the utilization of the augmentation technique. LCSA projects the central ellipsoid to the space of the actuating mode MiM_{i}; the mode system operates in for the epoch (time interval between two subsequent switches of actuating mode). During operation in an epoch, the algorithm applies OFU for the actuating mode MiM_{i} to obtain the control input. In parallel, it updates (i.e., enriches) the central ellipsoid by using the augmentation technique.

In the next section, we propose LCSA algorithm for the TTP application where the switch sequence (both switch time and modes) are given upfront. We define this sequence by =[iτ1iτ2iτns]\mathcal{I}=[i_{\tau_{1}}\;i_{\tau_{2}}\;...i_{\tau_{ns}}] where iτii_{\tau_{i}}\in\mathcal{M} and τi\tau_{i} are switch times and nsns is total number of switches. Having Θ0\Theta_{0} provided by Algorithm 1, we reset the time and let switch time τi{τ1,τ2,,τns}\tau_{i}\in\{\tau_{1},\tau_{2},...,\tau_{ns}\} where τ1=0\tau_{1}=0. We denote the time interval between two subsequent switches, i.e., [τk,τk+1][\tau_{k},\tau_{k+1}] as epoch kk. This algorithm, however, can be slightly modified to address FTC and MTD classes of applications. We will discuss these modifications in Section VI.

Algorithm 1 Warm-Up Phase
1:  Inputs:T0T_{0}ϑ>0,\,\vartheta>0,δ>0,\,\delta>0,σω,ηt𝒩(0,2σω2κ02I)\,\sigma_{\omega},\,\eta_{t}\sim\mathcal{N}(0,2\sigma_{\omega}^{2}\kappa_{0}^{2}I)
2:  set λ=σω2ϑ2\lambda=\sigma_{\omega}^{2}\vartheta^{-2}, V0=λIV_{0}=\lambda I
3:  for t=0:T0t=0:T_{0} do
4:     Observe xtx_{t}
5:     Play ut=K0xt+ηtu_{t}=K_{0}x_{t}+\eta_{t}
6:     observe xt+1x_{t+1}
7:     using utu_{t} and xtx_{t} form ztz_{t} and save (zt,xt+1)(z_{t},x_{t+1}) and update the ellipsoid according to the procedure given in IV-B1, i.e., (32-34)
8:  end for
9:  Output: Θ0\Theta_{0} the center of constructed ellipsoid
Algorithm 2 Learn and Control while Switching Algorithm(LCSA)
1:  Inputs: =[iτ1iτ2iτns],\mathcal{I}=[i_{\tau_{1}}\;i_{\tau_{2}}\;...i_{\tau_{ns}}], Θ0,\Theta_{0}, λ=8κ8n+m4μ¯ν¯α0σ2\lambda=8\kappa_{*}^{8}\sqrt{n+m}\frac{4\bar{\mu}\bar{\nu}}{\alpha_{0}\sigma^{2}}
2:  Initialize central ellipsoid by (LABEL:eq:initialCentaEl)
3:  Set τ1=0\tau_{1}=0
4:  for k=1:nsk=1:ns do
5:     Project central ellipsoid 𝒞τk(δ)\mathcal{C}_{\tau_{k}}(\delta) into space of mode iτki_{\tau_{k}} and obtain
Tr((ΘiτkΘ^τkiτk)V[0τk]piτk(ΘiτkΘ^τkiτk))rτk\displaystyle\operatorname{Tr}((\Theta^{i_{\tau_{k}}}-\hat{\Theta}_{\tau_{k}}^{i_{\tau_{k}}})^{\top}V^{pi_{\tau_{k}}}_{[0\;\tau_{k}]}(\Theta^{i_{\tau_{k}}}-\hat{\Theta}_{\tau_{k}}^{i_{\tau_{k}}}))\leq r_{\tau_{k}}
6:     Set V[0,τk]iτk=V[0τk]piτkV^{i_{\tau_{k}}}_{[0,\tau_{k}]}=V^{pi_{\tau_{k}}}_{[0\;\tau_{k}]}, and ri=rτkr^{i}=r_{\tau_{k}}
7:     for τkt<τk+1\tau_{k}\leq t<\tau_{k+1} do
8:        receive state xtx_{t}
9:        Set μiτk=riτk+(1+riτk)ϑV[0t]iτk0.5\mu_{i_{\tau_{k}}}=r^{i_{\tau_{k}}}+(1+r^{i_{\tau_{k}}})\vartheta\|V^{i_{\tau_{k}}}_{[0\;t]}\|^{0.5}
10:        Set τ=τk\tau=\tau_{k}
11:        if det(V[0,t]i)>2det(V[0,τ]i)\det(V^{i}_{[0,\;t]})>2\det(V^{i}_{[0,\;\tau]}) or t=τkt=\tau_{k} then
12:           start new episode and set τ=t\tau=t
13:            estimate system parameters: Compute the least square estimate Θ^tiτk\hat{\Theta}^{i_{\tau_{k}}}_{t} by solving (40)
14:           Compute policy: Solve the dual SDP program (45) for Σ(Θtiτk,Q,Riτk)\Sigma(\Theta^{i_{\tau_{k}}}_{t},Q,R^{i_{\tau_{k}}})
15:           Set K¯iτk=Σux(Θtiτk,Q,Riτk)Σxx1(Θtiτk,Q,Riτk)\bar{K}_{i_{\tau_{k}}}=\Sigma_{ux}(\Theta^{i_{\tau_{k}}}_{t},Q,R^{i_{\tau_{k}}})\Sigma^{-1}_{xx}(\Theta^{i_{\tau_{k}}}_{t},Q,R^{i_{\tau_{k}}})
16:        else
17:           Set K¯iτk=K(Θτiτk,Q,Riτk),\bar{K}_{i_{\tau_{k}}}=K(\Theta^{i_{\tau_{k}}}_{\tau},Q,R^{i_{\tau_{k}}}),
18:        end if
19:        Apply utiτk=K¯iτkxtu^{i_{\tau_{k}}}_{t}=\bar{K}_{i_{\tau_{k}}}x_{t} and observe new state xt+1x_{t+1}.
20:        Save (ztiτk,xt+1)(z^{i_{\tau_{k}}}_{t},x_{t+1}) into database
21:        Update V[0,t+1]iτk=V[0,t]iτk+ztiτkztiτkV^{i_{\tau_{k}}}_{[0,\;t+1]}=V^{i_{\tau_{k}}}_{[0,\;t]}+z^{i_{\tau_{k}}}_{t}{z^{i_{\tau_{k}}}_{t}}^{\top}
22:         computeriτk=(σω2nlogndet(V[0t+1]iτk)δdet(V[0,τk]iτk)+rτk)2,r^{i_{\tau_{k}}}=\bigg{(}\sigma_{\omega}\sqrt{2n\log\frac{n\det(V^{i_{\tau_{k}}}_{[0\;t+1]})}{\delta\det(V^{i_{\tau_{k}}}_{[0,\tau_{k}]})}}+\sqrt{r_{\tau_{k}}}\bigg{)}^{2},
23:        Apply augmentation technique (Section III-E) to update central ellipsoid.
24:     end for
25:  end for

IV-B Main steps of Algorithms

For the sake of the reader more interested in implementation than in the algorithms’ theoretical underpinnings, we start by surveying the main steps of Algorithm 1 and 2. We are mostly concerned with explaining the computations and algebraic manipulations performed in each step, and provide details about the rationale for and properties of these steps under the umbrella of performance analysis in the next section. Proofs of most of these results are relegated to the appendix.

IV-B1 System Identification

The system identification step aims at building confidence set around the system’s true but unknown parameters. This step plays a major role in both Algorithm 1 and Algorithm 2.

Confidence Set construction of warm-up phase

In Algorithm 1, using data obtained by executing control input ut=K0xt+ηtu_{t}=K_{0}x_{t}+\eta_{t} (which consists of linear strongly stabilizing term and extra exploratory noise ηt𝒩(0,2σω2κ02I)\eta_{t}\sim\mathcal{N}(0,2\sigma_{\omega}^{2}\kappa_{0}^{2}I)), a confidence set is constructed around the true parameters of the system by following the steps of Section III-B. For warm-up phase we let Θ0=0\Theta_{0}=0 in (14) which results in

Θ^t\displaystyle\hat{\Theta}_{t} =argminΘe(Θ)=V[0,t]1ZtXt.\displaystyle=\operatorname*{argmin}_{\Theta}e(\Theta)=V_{[0,\;t]}^{-1}Z_{t}^{\top}X_{t}. (32)

where the covariance matrix V[0,t]V_{[0,\;t]} is constructed similar to Section III-B. Recalling Θϑ\|\Theta_{*}\|_{*}\leq\vartheta by assumption, a high probability confidence set around true but unknown parameters of system is constructed as

𝒞t(δ):={\displaystyle\mathcal{C}_{t}(\delta):=\big{\{} Θ(n+m)×n|\displaystyle\Theta\in\mathbb{R}^{(n+m)\times n}|
Tr((ΘΘ^t)V[0,t](ΘΘ^t))\displaystyle\operatorname{Tr}\big{(}(\Theta-\hat{\Theta}_{t})^{\top}V_{[0,\;t]}(\Theta-\hat{\Theta}_{t})\big{)} rt}\displaystyle\leq r_{t}\big{\}} (33)

where

rt=(σω2nlogndet(V[0,t])δdet(λI)+λϑ)2.\displaystyle r_{t}=\bigg{(}\sigma_{\omega}\sqrt{2n\log\frac{n\det(V_{[0,\;t]})}{\delta\det(\lambda I)}}+\sqrt{\lambda}\vartheta\bigg{)}^{2}. (34)

For warm-up phase we set λ=σω2ϑ1\lambda=\sigma_{\omega}^{2}\vartheta^{-1}.

Deployment of Algorithm 1 endures until the parameter estimates reside in (κ0,γ0)(\kappa_{0},\gamma_{0})-stabilizing neighborhood 𝒩(κ0,γ0)(Θ)\mathcal{N}{(\kappa_{0},\gamma_{0})}(\Theta*). As per Definition 2, the first stabilizing controller K(Θ0,Q,R)K(\Theta_{0},Q,R) designed by Algorithm 2, where Θ0=Θ^T0\Theta_{0}=\hat{\Theta}_{T_{0}} has been substantiated to yield an average expected cost that is lower than that of K0K_{0} when applied in mode 1. This, in essence, ensures that Algorithm 2 commences with a feedback gain as effective as the initial stabilizing feedback gain K0K_{0}. Additionally, to secure joint stabilization and effectively manage regret, it is imperative to ensure the proximity of the estimate Θ0\Theta_{0}. This duration is determined by the theorem presented below.

Theorem 1

For a given (κ0,γ0)(\kappa_{0},\gamma_{0})-strongly stabilizing input K0K_{0}, there exist explicit constants C0C_{0} and ϵ0=poly(α01,α1,ϑ,ν,n,m)\epsilon_{0}=poly(\alpha_{0}^{-1},\alpha_{1},\vartheta,\nu,n,m) such that the following holds: Let T0T_{0} be the smallest time that satisfies:

(\displaystyle\bigg{(} σω2n(lognδ+log(1+300σω2κ04γ02(n+ϑ2κ02)logT0δ))\displaystyle\sigma_{\omega}\sqrt{2n\left(\log\frac{n}{\delta}+\log\left(1+\frac{300\sigma_{\omega}^{2}\kappa_{0}^{4}}{\gamma_{0}^{2}}(n+\vartheta^{2}\kappa^{2}_{0})\log\frac{T_{0}}{\delta}\right)\right)}
+λϑ)280T0σω2min{κ02α0σω2C0,ϵ02,1λ}:=ϵ2.\displaystyle+\sqrt{\lambda}\vartheta\bigg{)}^{2}\frac{80}{T_{0}\sigma^{2}_{\omega}}\leq\min\bigg{\{}\kappa^{2}_{0}\frac{\alpha_{0}\sigma^{2}_{\omega}}{C_{0}},\epsilon^{2}_{0},\frac{1}{\lambda}\bigg{\}}:=\epsilon^{2}. (35)

then that K(Θ^T0,Q,R)K(\hat{\Theta}_{T_{0}},Q,R), obtained by Algorithm 2 applied to the system Θ\Theta_{*}, has an average expected cost J(Θ^T0,Q,R)J(\hat{\Theta}_{T_{0}},Q,R) that is lower than that of the policy K0K_{0}. Furthermore, requiring ϵ21λ\epsilon^{2}\leq\frac{1}{{\lambda}} is a sufficient condition for (κ,γ)(\kappa,\gamma)-strong sequential stability of the closed-loop system under the implementation of the policy produced by Algorithm 2.

The regret associated with the implementation of the warm-up algorithm is of the order 𝒪(T)\mathcal{O}(\sqrt{T}). To substantiate this assertion, let TT denote the specified horizon for implementing Algorithm 2. By definition, we have ϵ21/λ\epsilon^{2}\leq 1/\lambda, where λ=8κ8n+m4μ¯ν¯α0σ2\lambda=8\kappa_{*}^{8}\sqrt{n+m}\frac{4\bar{\mu}\bar{\nu}}{\alpha_{0}\sigma^{2}}, as the condition for strong sequential stability (see proof of Lemmas 9 and 10). This, along with the definition of μ¯\bar{\mu}, indeed indicates that λ=𝒪(T)\lambda=\mathcal{O}(\sqrt{T}). Now, based on equation (35), we deduce that T0=𝒪(T)T_{0}=\mathcal{O}(\sqrt{T}). Given Proposition 1, which establishes that the regret associated with a fixed strongly stabilizing policy K0K_{0} over T0T_{0} rounds scales as 𝒪(T0)\mathcal{O}(T_{0}), we can conclude that within TT rounds, the warm-up phase incurs a regret of 𝒪(T).\mathcal{O}(\sqrt{T}).

Confidence Set construction of LCSA algorithm

Having been run for the mode 1, the warm-up phase outputs Θ0\Theta_{0} as an initial estimate using which the central ellipsoid as an input to Algorithm 2, is constructed as follows

𝒞τ1(δ)={\displaystyle\mathcal{C}_{\tau_{1}}(\delta)=\{ Θ(n+m)×n|Tr((ΘΘ0)V0(ΘΘ0))r0}\displaystyle\Theta\in\mathbb{R}^{(n+m)\times n}|\operatorname{Tr}((\Theta-{\Theta}_{0})^{\top}V_{0}(\Theta-{\Theta}_{0}))\leq r_{0}\}

where r0=λϵr_{0}=\lambda\epsilon, V0=λIV_{0}=\lambda I where λ=8κ8n+m4μ¯ν¯α0σ2\lambda=8\kappa_{*}^{8}\sqrt{n+m}\frac{4\bar{\mu}\bar{\nu}}{\alpha_{0}\sigma^{2}} and r0=λϵ2r_{0}=\lambda\epsilon^{2}. The initial confidence ellipsoid (LABEL:eq:initialCentaEl) is, in fact, equivalent to ΘΘ02r0\|\Theta-\Theta_{0}\|^{2}\leq r_{0}. We apply the procedure of Section III-B together with augmentation technique to update the central ellipsoid.

Suppose that at time τq\tau_{q}, the system switches to mode ii. The associated inherited confidence ellipsoid for mode ii can be derived through the projection of the central ellipsoid, as discussed in Section III-D, and can be expressed as follows:

𝒞τqpi(δ):={\displaystyle\mathcal{C}^{pi}_{\tau_{q}}(\delta):=\{ Θj(n+mi)×n|\displaystyle\Theta^{j}\in\mathbb{R}^{(n+m_{i})\times n}|
Tr((ΘiΘ^τqpi)V[0τq]pi(ΘiΘ^τqpi))\displaystyle\operatorname{Tr}\big{(}(\Theta^{i}-\hat{\Theta}^{pi}_{\tau_{q}})^{\top}V^{pi}_{[0\;\tau_{q}]}(\Theta^{i}-\hat{\Theta}^{pi}_{\tau_{q}})\big{)} rτq}.\displaystyle\leq r_{\tau_{q}}\}. (37)

The following theorem provides a confidence bound on the error of parameter estimation when there is an inherited confidence set.

Theorem 2

Let the linear model (4) satisfy Assumption 1 with a known σω\sigma_{\omega}. Assume the system at an arbitrary time τq\tau_{q} switches to the mode ii and runs for the time period t[τq,τq+1]t\in[\tau_{q},\;\;\tau_{q+1}] (i.e. q+1q+1’s epoch). Let the inherited information, which is a projection of central ellipsoid, be given by (37). Then with probability at least 1δ1-\delta we have

Tr((ΘiΘ^ti)V[0t]i(ΘiΘ^ti))ri\displaystyle\operatorname{Tr}((\Theta^{i}-\hat{\Theta}^{i}_{t})^{\top}V^{i}_{[0\;\;t]}(\Theta^{i}-\hat{\Theta}^{i}_{t}))\leq r^{i} (38)

where

ri\displaystyle r^{i} =(σω2nlogndet(V[0t]i)δdet(V[0τq]pi)+rτq)2,\displaystyle=\bigg{(}\sigma_{\omega}\sqrt{2n\log\frac{n\det(V^{i}_{[0\;\;t]})}{\delta\det(V_{[0\;\tau_{q}]}^{pi})}}+\sqrt{r_{\tau_{q}}}\bigg{)}^{2}, (39)
Θ^ti\displaystyle\hat{\Theta}^{i}_{t} =Vi[0,t]1(ZTX+V[0,τq]piΘ^τqpi),\displaystyle={V^{i}}^{-1}_{[0,\;t]}\big{(}{Z}^{T}X+V^{pi}_{[0,\;\;\tau_{q}]}\hat{\Theta}^{pi}_{\tau_{q}}\big{)}, (40)

and

V[0t]i=V[0τq]pi+V[τqt]i,V[τqt]i=t=τqtztizti,\displaystyle V_{[0\;t]}^{i}=V_{[0\;\tau_{q}]}^{pi}+V^{i}_{[\tau_{q}\;t]},\quad V^{i}_{[\tau_{q}\;t]}=\sum_{t=\tau_{q}}^{t}z^{i}_{t}{z^{i}_{t}}^{\top},
ZX=s=τqtzsixsi\displaystyle Z^{\top}X=\sum_{s=\tau_{q}}^{t}z^{i}_{s}{x^{i}_{s}}^{\top} (41)

The second term in the right hand side of (39) is related to the inherited information from the central ellipsoid while the first term is the contribution of learning at the running epoch. Similarly, in the definition of V[0,;t]iV^{i}_{[0,;t]} provided by (41), the term V[0;τq]piV_{[0;\tau_{q}]}^{pi} is derived from inherited information, while V[τq;t]iV^{i}_{[\tau_{q};t]} represents the contribution of learning during the current running epoch.

IV-C Control design

The relaxed SDP problem which accounts for the uncertainity in parameter estimates is used for control design. Given the parameter estimate Θ^ti\hat{\Theta}^{i}_{t} and the covariance matrix V[0,t]iV^{i}_{[0,\;t]} obtained by (40) and (41), the relaxed SDP is formulated as follows:

min(Q00Ri)Σs.t.ΣxxΘi^tΣΘ^ti+WμiΣVi[0,t]1I,Σ0.\displaystyle\begin{array}[]{rrclcl}\displaystyle\min&\lx@intercol\begin{pmatrix}Q&0\\ 0&R^{i}\end{pmatrix}\bullet\Sigma\hfil\lx@intercol\\ \textrm{s.t.}&\Sigma_{xx}\succeq{\hat{\Theta^{i}}_{t}}^{\top}\Sigma\hat{\Theta}_{t}^{i}+W-\mu_{i}\Sigma\bullet{{V^{i}}^{-1}_{[0,\;t]}}I,\\ &\Sigma\succ 0.\end{array} (45)

where μi\mu_{i} is any scalar satisfying μiri+(1+ri)ϑV[0,t]i1/2\mu_{i}\geq r^{i}+(1+r^{i})\vartheta\|V_{[0,\;t]}^{i}\|^{1/2}. And we define μ¯ri+(1+ri)ϑV[0,T]i1/2\bar{\mu}\geq r^{i}+(1+r^{i})\vartheta\|V_{[0,\;T]}^{i}\|^{1/2}. Denoting Σ(Θ^t,Q,Ri)\Sigma(\hat{\Theta}_{t},Q,R^{i}) as the optimal solution of (45), the controller extracted from solving the relaxed SDP is deterministic and linear in state (ui=K(Θ^ti,Q,Ri)xu^{i}=K(\hat{\Theta}^{i}_{t},Q,R^{i})x) where

K(Θ^ti,Q,Ri)=Σux(Θ^ti,Q,Ri)Σxx1(Θ^ti,Q,Ri).\displaystyle K(\hat{\Theta}^{i}_{t},Q,R^{i})=\Sigma_{ux}(\hat{\Theta}^{i}_{t},Q,R^{i}){\Sigma^{-1}_{xx}(\hat{\Theta}^{i}_{t},Q,R^{i})}.

The designed controller is strongly stabilizing in the sense of Definition 1 and the sequence of feedback gains is strongly sequentially stabilizing, as affirmed by Theorem 3. The detailed analysis of the relaxation of the primal SDP formulation, incorporating the constructed confidence ellipsoid to account for parameter uncertainty, is provided in Appendix A-C which is adapted from [5] with modifications as the result of not normalizing the confidence ellipsoid.

Recurrent update of confidence sets may worsen the performance, so a criterion is needed to prevent frequent policy switches in an actuation epoch. As such, Algorithm 2 updates the estimate and policy when the condition det(Vtj)>2det(Vτj)\det(V^{j}_{t})>2\det(V^{j}_{\tau}) is fulfilled where τ\tau is the last time the policy got updated (τ\tau is solely used in Algorithm 2 and should not be mixed with switch times in actuating modes; τi\tau_{i}^{\prime}s). This is a standard step in OFU-based algorithms ([30], [33], and [5]).

IV-D Projection Step

In Algorithm 2 we apply the projection technique described in Section III-D to transfer the information contained in the central ellipsoid to the mode that the system just switched to. The central ellipsoid is consistently updated in Algorithm 2 using the augmentation technique regardless of the active actuating mode.

V Performance Analysis

V-A Stability Analysis

In this section, we first proceed to restate the results of [5] showing that the sequence of feedback gains obtained by solving primal relaxed-SDP (45) is strongly sequentially stabilizing. We then define a Minimum Average Dwell Time (MADT), denoted as τMAD\tau_{MAD}, which ensures the boundedness of the state norm in a switching scenario by mandating that the system remains in each mode for at least this duration. In the TTP case, this property would need to hold for the given sequence, i.e., τj+1τjτMAD\tau_{j+1}-\tau_{j}\geq\tau_{MAD}, for j{1,,ns1}j\in\{1,...,ns-1\}. For MTD and FTC cases, on the other hand, the MADT constraints the times of switch.

Solving relaxed SDP (45) is the crucial step of the algorithm, however when it comes to regret bound and stability analysis, the dual program

maxPWs.t.(QP00Ri)+Θ^tiPΘi^tμiPVi[0,t]1P0\displaystyle\begin{array}[]{rrclcl}\displaystyle\max&\lx@intercol P\bullet W\hfil\lx@intercol\\ \textrm{s.t.}&\begin{pmatrix}Q-P&0\\ 0&R^{i}\end{pmatrix}+\hat{\Theta}_{t}^{i}P{\hat{\Theta^{i}}_{t}}^{\top}\succeq\mu_{i}\|P\|_{*}{{V^{i}}^{-1}_{[0,\;t]}}\\ &P\succeq 0\end{array} (49)

and its optimal solution P(Θ^ti,Q,Ri)P(\hat{\Theta}^{i}_{t},Q,R^{i}) play a pivotal role.

Recalling the sequential strong stability definition, the following theorem adapted from [5] shows that the sequence of policies obtained by solving relaxed SDP (45) for system actuating in mode ii are strongly sequentially stabilizing for the parameters κi\kappa_{i} and γi\gamma_{i} to be defined.

Theorem 3

Let P(Θ^ti,Q,Ri)P(\hat{\Theta}^{i}_{t},Q,R^{i})’s for t{τq,τq+1,}t\in\{\tau_{q},\tau_{q}+1,...\} be solutions of the dual program (49) associated with an epoch occurring in time interval [τqτq+1][\tau_{q}\;\tau_{q+1}] and the mode ii with corresponding Θti\Theta^{i}_{t} and VtiV^{i}_{t}. Let κi=2νi/α0σ2\kappa_{i}=\sqrt{2\nu_{i}/\alpha_{0}\sigma^{2}} and γi=1/2κi2\gamma_{i}=1/2\kappa_{i}^{2} with μiri+(1+ri)ϑVi1/2\mu_{i}\geq r^{i}+(1+r^{i})\vartheta||V^{i}||^{1/2}, then the sequence of policies K(Θ^ti,Q,Ri)K(\hat{\Theta}^{i}_{t},Q,R^{i})’s (associated with P(Θ^ti,Q,Ri)P(\hat{\Theta}^{i}_{t},Q,R^{i})’s) are (κi,γi)(\kappa_{i},\gamma_{i})- sequentially strongly stabilizing with probability at least 1δ1-\delta.

Proof:

This theorem is a direct results of Lemmas 9 and 10 provided and proved in Appendix. ∎

Recalling Definition 1, a sequentially stabilizing policy keeps the states of system bounded in an epoch. The following lemma gives this upper bound in terms of the stabilizing policy parameters κi\kappa_{i} and γi\gamma_{i}.

Lemma 1

For the system evolving in an arbitrary actuating mode ii in time interval of [τq,τq+1][\tau_{q},\tau_{q+1}] by applying sequentially stabilizing controllers for any t[τq,τq+1]t\in[\tau_{q},\tau_{q+1}] the state is upper-bounded by

xtκieγi(t(τp+1))/2xτq+20κiγiσωnlogtτqδ\displaystyle\|x_{t}\|\leq\kappa_{i}e^{-\gamma_{i}(t-(\tau_{p}+1))/2}||x_{\tau_{q}}||+\frac{20\kappa_{i}}{\gamma_{i}}\sigma_{\omega}\sqrt{n\log\frac{t-\tau_{q}}{\delta}} (50)
Proof:

The proof is straight forward but for the sake of completeness we provide it here. Applying the sequentially stabilizing policy implies (18), then using Hanson-Wright concentration inequality with probability at least 1δ/2(tτq)1-\delta/2(t-\tau_{q}) yields

maxτqstws10σωnlogtτqδ,\displaystyle\max\limits_{\tau_{q}\leq s\leq t}||w_{s}||\leq 10\sigma_{\omega}\sqrt{n\log\frac{t-\tau_{q}}{\delta}}, (51)

which completes the proof. ∎

Having stable subsystems is a necessary but not a sufficient condition to ensure the stability of a switched system. The switches must occur at a sufficiently slow rate on average to guarantee stability. Therefore, we need to impose a requirement that the switches occur at a sufficiently slow rate. The following theorem establishes a lower bound on the time elapsed between two consecutive switches, referred to as the ”dwell time,” which serves as a sufficient condition for ensuring boundedness.

The following theorem establishes a lower bound on the average dwell time.

Theorem 4

Consider the switch sequence =[iτ1,iτ2,,iτns]\mathcal{I}=[i_{\tau_{1}},i_{\tau_{2}},\ldots,i_{\tau_{ns}}], and let the system (3) be controlled by (κiτj,γiτj)(\kappa_{i_{\tau_{j}}},\gamma_{i_{\tau_{j}}})-strongly sequentially stabilizing controllers for any mode iτji_{\tau_{j}}\in\mathcal{I}. Then, for a user-defined 0<𝒳<10<\mathcal{X}<1 by requiring τj+1τjτMAD\tau_{j+1}-\tau_{j}\geq\tau_{\text{MAD}}, where

τMAD:=lnκln(1χ)ln(1γ)\displaystyle\tau_{\text{MAD}}:=\lceil\frac{\ln\kappa^{*}-\ln(1-\chi)}{-\ln(1-\gamma^{*})}\rceil (52)

in which

κ=maxi{1,,||}κi,andγ=12κ,\displaystyle\kappa^{*}=\max_{i\in\{1,\ldots,|\mathcal{B}^{*}|\}}\kappa_{i},\quad\text{and}\quad\gamma^{*}=\frac{1}{2\kappa^{*}}, (53)

the state norm will be bounded as follows:

xtκ(1𝒳)tx0+L¯𝒳\displaystyle\|x_{t}\|\leq\kappa^{*}(1-\mathcal{X})^{t}\|x_{0}\|+\frac{\bar{L}}{\mathcal{X}} (54)

where

L¯:=20κγσωnlogtδ.\displaystyle\bar{L}:=\frac{20\kappa^{*}}{\gamma^{*}}\sigma_{\omega}\sqrt{n\log\frac{t}{\delta}}.

V-B Regret Bound Analysis

Taking into account the fact that any confidence ellipsoid 𝒞ti(δ)\mathcal{C}_{t}^{i}(\delta) can be obtained through projecting the central ellipsoid 𝒞t(δ)\mathcal{C}_{t}(\delta), we can define the following unified “good event”, using the denomination of [30]

t={\displaystyle\mathcal{E}_{t}=\{ Tr(ΔsTV[0,s]Δs)rs,\displaystyle\operatorname{Tr}\big{(}\Delta^{T}_{s}V_{[0,\;s]}\Delta_{s}\big{)}\leq r_{s}, (55)
zs24κ2(1χ)2sx02+Υ¯2γ2,s=1,,t}\displaystyle\|z_{s}\|^{2}\leq 4{\kappa^{*}}^{2}(1-\chi)^{2s}\|x_{0}\|^{2}+\frac{\bar{\Upsilon}}{2{\gamma^{*}}^{2}},\;\forall s=1,...,t\}

where

Υ¯=3200κ6σ2nχ2logTδ.\displaystyle\bar{\Upsilon}=\frac{3200{\kappa^{*}}^{6}\sigma^{2}n}{\chi^{2}}\log\frac{T}{\delta}.

The upper bound for zt2\|z_{t}\|^{2} is obtained from (), as derived in the proof of Lemma 11 in Appendix A-D. The event t\mathcal{E}_{t} holds for any sequence of switches and is used for carrying out regret bound analysis.

By an appropriate decomposition of regret on the event t\mathcal{E}_{t}, for an arbitrary switch sequence =[iτ1iτ2iτns]\mathcal{I}=[i_{\tau_{1}}\;i_{\tau_{2}}\;...i_{\tau_{ns}}], the following upper-bound for regret is obtained (adapted from [5]):

RTR1+R2+R3+R4\displaystyle R_{T}\leq R_{1}+R_{2}+R_{3}+R_{4} (56)

where:

R1=t=0T1(\displaystyle R_{1}=\sum_{t=0}^{T-1}\big{(} xtTP(Θ^σ(t),Q,Rσ(t))xt\displaystyle x_{t}^{T}P(\hat{\Theta}^{\sigma(t)},Q,R^{\sigma(t)})x_{t}-
xt+1TP(Θ^σ(t),Q,Rσ(t))xt+1)1t,\displaystyle x_{t+1}^{T}P(\hat{\Theta}^{\sigma(t)},Q,R^{\sigma(t)})x_{t+1}\big{)}1_{\mathcal{E}_{t}}, (57)
R2=t=0T\displaystyle R_{2}=\sum_{t=0}^{T} wtTP(Θ^σ(t),Q,Rσ(t))(A+\displaystyle w^{T}_{t}P(\hat{\Theta}^{\sigma(t)},Q,R^{\sigma(t)})\big{(}A_{*}+
Bσ(t)K(Θ^σ(t),Q,Ri(t)))xt1t,\displaystyle B^{\sigma(t)}_{*}K(\hat{\Theta}^{\sigma(t)},Q,R^{i(t)})\big{)}x_{t}1_{\mathcal{E}_{t}}, (58)
R3=t=0T(\displaystyle R_{3}=\sum_{t=0}^{T}\big{(} wtTP(Θ^σ(t),Q,Rσ(t))wt\displaystyle w^{T}_{t}P(\hat{\Theta}^{\sigma(t)},Q,R^{\sigma(t)})w_{t}-
σ2P(Θ^σ(t),Q,Ri(t)))1t,\displaystyle\sigma^{2}\|P(\hat{\Theta}^{\sigma(t)},Q,R^{i(t)})\|_{*}\big{)}1_{\mathcal{E}_{t}}, (59)
R4=4νσω2t=0Tμσ(t)(ztTv[0,t]σ(t)t1zt)1t,\displaystyle R_{4}=\frac{4\nu}{\sigma_{\omega}^{2}}\sum_{t=0}^{T}\mu_{\sigma(t)}\big{(}z^{T}_{t}{v^{\sigma(t)}_{[0,\;t]}}^{-1}_{t}z_{t}\big{)}1_{\mathcal{E}_{t}}, (60)

in which σ(t)=iτj\sigma(t)=i_{\tau_{j}} for τjtτj+1\tau_{j}\leq t\leq\tau_{j+1} and 1t1_{\mathcal{E}_{t}} is the indicator function. By bounding the above terms separately, the overall imposed regret, not-counting the warm-up regret, can be obtained. The following theorem summarizes the result.

Theorem 5

The regret of projection-equipped algorithm in horizon TT and with nsns number of switches is 𝒪¯(T)+𝒪((ns)T)\mathcal{\bar{O}}\big{(}\sqrt{T}\big{)}+\mathcal{O}\big{(}(ns)\sqrt{T}\big{)} with probability at least 1δ1-\delta. Moreover, this strategy is more efficient than repeatedly and naively applying the basic OFU algorithm.

VI Modifications to Address the MTD and FTC scenarios

Algorithm 2, with minor adjustments, can also address the FTC and MTD problems where switch sequence information (actuating modes and switch times) are unavailable. This section proposes possible directions to adjust the strategy to the mentioned applications.

For the FTC application, leaving aside switch time, which is supposedly determined by an external unit (e.g., a detector), the remaining challenge is determining the next actuating mode to switch to. From the set of available modes \mathcal{M}, the mode that guarantees to achieve lower regret seems a rational choice. Proofs of Lemmas 13 and 16, in the regret bound analysis section, justify that switching to a mode whose covariance matrix has the highest determinant value guarantees achieving the lower regret bound. In other words, at the time tt the best mode which guarantees better regret is the mode j:=argimaxidet(V[0,t]pi)j^{*}:=\arg_{i\in\mathcal{M}}\max_{i}\det(V^{pi}_{[0,\;t]}) where V[0,t]piV^{pi}_{[0,\;t]} is the projection of central ellipsoid on the space of mode ii\in\mathcal{M}.

In the MTD application, where the frequent switch is the central part of the strategy, in addition to determining the next actuating mode, we have to specify the switch instances so that the stability is not violated. To address the latter, we design a MADT (see Theorem 4) and require the algorithm to make the switches ”slow enough” by waiting on a mode for at least the MADT unit of time. Regarding the choice of the next mode, picking merely the well-learned one (by evaluating the covariance matrices) does not seem practical. The reason is that this strategy pushes the algorithm to choose the modes with more actuators (which have greater values of covariance matrix determinant) and technically ignore most of the modes, so the set of candidate modes will be smaller. In turn, this means less unpredictability and helps the attacker perform effective intrusion. To resolve this issue, one may decide to switch by optimizing the weighted sum of determinant values of covariance matrices (of the modes) and the value of some unpredictability measures. This idea basically gives a trade-off between lower regret and unpredictability. The authors in [4] use an information entropy measure produced by a probability simplex when the system is known.

VII Numerical Experiment

In this section, to demonstrate the effectiveness of the proposed “learn and control while switching” algorithm compared to repeatedly applying the basic OFU algorithm, we use a third-order linear system with three actuators as of the following plant and control matrices:

A=(10.402.75.28.18.300.49.0),B=(4.76.12.95.05.82.52.907.2)\displaystyle A_{*}=\begin{pmatrix}10.4&0&-2.7\\ 5.2&-8.1&8.3\\ 0&0.4&-9.0\end{pmatrix},\;B_{*}=\begin{pmatrix}-4.7&6.1&-2.9\\ -5.0&5.8&2.5\\ 2.9&0&-7.2\end{pmatrix}

which are unknown to the learner. Weighting cost matrices are chosen as follows:

Q=(6.50.81.40.85.72.61.42.625),R=(4010161028816848).\displaystyle Q=\begin{pmatrix}6.5&-0.8&-1.4\\ -0.8&5.7&2.6\\ -1.4&2.6&25\end{pmatrix},\;R=\begin{pmatrix}40&10&16\\ 10&28&8\\ 16&8&48\end{pmatrix}.

Figure 2 shows the regret bound of the algorithm for the switch sequence of {1,2,1}\{1,2,1\} where the mode 2 is specified by

B2=(4.76.15.05.82.90),R2=(40101028).\displaystyle B_{*}^{2}=\begin{pmatrix}-4.7&6.1\\ -5.0&5.8\\ 2.9&0\end{pmatrix},\;\;R^{2}=\begin{pmatrix}40&10\\ 10&28\\ \end{pmatrix}.

We let the switch happens at times τ1=0\tau_{1}=0, τ2=5000\tau_{2}=5000, and τ3=10000\tau_{3}=10000 seconds. We intentionally picked this switch sequence to show the regret when the confidence set is projected to a lower-dimensional space and when the system switches back to a previously explored mode. We further assume the initial estimate Θ0\Theta_{0} where Θ0Θϵ\|\Theta_{0}-\Theta_{*}\|\leq\epsilon is given and let the naive algorithm use this estimate at time τ2\tau_{2} when the system switches to the second mode. We set T=15000T=15000, δ=0.5/T\delta=0.5/T, σω=0.003\sigma_{\omega}=0.003, ν=0.8\nu=0.8 and ϵ=0.0001\epsilon=0.0001.

At the time τ2=5000\tau_{2}=5000 seconds, when the switch happens, the projection-equipped algorithm outperforms the naive one thanks to the learned information during the first epoch, which is restored through the projection tool. At the time τ3=10000\tau_{3}=10000 seconds, the system switches back to mode 1. During this last epoch, we only see a slight improvement which is because the switch in policy, triggered by fulfillment of the condition det(V[0,t])>det(V[0,τ])\det(V_{[0,t]})>\det(V_{[0,\tau]}) (see the line 8 of algorithm 2) rarely happens as the value of det(V[0,t])\det(V_{[0,t]}) increases. In our case, we only have one time of policy improvement during the epoch [τ3,T][\tau_{3},\;T].

Refer to caption
Figure 2: Regret bound of the projection-based algorithm vs naive algorithm

VIII Summary and Conclusion

In this work, we equipped the OFU algorithm with a projection tool to improve the ”learn and control” task when switching between the different subsets of actuators of a system is mandated. This improvement is proved mathematically in a regret-bound sense and shown throughout the simulation compared to a case when the basic OFU algorithm is repeatedly applied. We presented this idea for a time-triggered network control system problem when the sequences of switch times and modes are upfront. We also discussed the possibilities of extending the proposed algorithm to applications such as fault-tolerant control and moving target defense problems for which there is a freedom to pick the best actuating mode when it is a time to switch. To address this, we proposed a mechanism that offers the best-actuating mode, considering how rich the learned data is. A minimum average dwell-time is also designed to address stability concerns in applications such as moving target defense that frequent switches might cause instability. Moreover, by applying recent results on the existence of a stabilizing neighborhood, we have determined an initial exploration duration such that the efficiency for the warm-up phase is guaranteed.

References

  • [1] H. Ishii and B. A. Francis, “Stabilizing a linear system by switching control with dwell time,” in Proceedings of the 2001 American Control Conference.(Cat. No. 01CH37148), vol. 3.   IEEE, 2001, pp. 1876–1881.
  • [2] J.-X. Zhang and G.-H. Yang, “Supervisory switching-based prescribed performance control of unknown nonlinear systems against actuator failures,” International Journal of Robust and Nonlinear Control, vol. 30, no. 6, pp. 2367–2385, 2020.
  • [3] M. Staroswiecki and D. Berdjag, “A general fault tolerant linear quadratic control strategy under actuator outages,” International Journal of Systems Science, vol. 41, no. 8, pp. 971–985, 2010.
  • [4] A. Kanellopoulos and K. G. Vamvoudakis, “A moving target defense control framework for cyber-physical systems,” IEEE Transactions on Automatic Control, 2019.
  • [5] A. Cohen, T. Koren, and Y. Mansour, “Learning linear-quadratic regulators efficiently with only sqrt(t) regret,” arXiv preprint arXiv:1902.06223, 2019.
  • [6] S. S. Tohidi, Y. Yildiz, and I. Kolmanovsky, “Fault tolerant control for over-actuated systems: An adaptive correction approach,” in 2016 American control conference (ACC).   IEEE, 2016, pp. 2530–2535.
  • [7] M. Blanke, M. Kinnaert, J. Lunze, M. Staroswiecki, and J. Schröder, Diagnosis and fault-tolerant control.   Springer, 2006, vol. 2.
  • [8] R. J. Patton, “Fault-tolerant control: the 1997 situation,” IFAC Proceedings Volumes, vol. 30, no. 18, pp. 1029–1051, 1997.
  • [9] M. Takahashi and T. Takagi, “Adaptive fault-tolerant control based on hybrid redundancy,” Asia-Pacific Journal of Chemical Engineering, vol. 7, no. 5, pp. 642–650, 2012.
  • [10] Q. Yang, S. S. Ge, and Y. Sun, “Adaptive actuator fault tolerant control for uncertain nonlinear systems with multiple actuators,” Automatica, vol. 60, pp. 92–99, 2015.
  • [11] H. Ouyang and Y. Lin, “Adaptive fault-tolerant control for actuator failures: a switching strategy,” Automatica, vol. 81, pp. 87–95, 2017.
  • [12] S. Shaheen, D. Heffernan, and G. Leen, “A gateway for time-triggered control networks,” Microprocessors and Microsystems, vol. 31, no. 1, pp. 38–50, 2007.
  • [13] R. Blind and F. Allgöwer, “On time-triggered and event-based control of integrator systems over a shared communication system,” Mathematics of Control, Signals, and Systems, vol. 25, no. 4, pp. 517–557, 2013.
  • [14] J.-H. Cho, D. P. Sharma, H. Alavizadeh, S. Yoon, N. Ben-Asher, T. J. Moore, D. S. Kim, H. Lim, and F. F. Nelson, “Toward proactive, adaptive defense: A survey on moving target defense,” IEEE Communications Surveys & Tutorials, 2020.
  • [15] S. Weerakkody and B. Sinopoli, “Detecting integrity attacks on control systems using a moving target approach,” in 2015 54th IEEE Conference on Decision and Control (CDC).   IEEE, 2015, pp. 5820–5826.
  • [16] ——, “A moving target approach for identifying malicious sensors in control systems,” in 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).   IEEE, 2016, pp. 1149–1156.
  • [17] P. Griffioen, S. Weerakkody, and B. Sinopoli, “A moving target defense for securing cyber-physical systems,” IEEE Transactions on Automatic Control, vol. 66, no. 5, pp. 2016–2031, 2020.
  • [18] K. S. Narendra and J. Balakrishnan, “Adaptive control using multiple models,” IEEE transactions on automatic control, vol. 42, no. 2, pp. 171–187, 1997.
  • [19] M. Rotulo, C. De Persis, and P. Tesi, “Online learning of data-driven controllers for unknown switched linear systems,” arXiv preprint arXiv:2105.11523, 2021.
  • [20] J. Kenanian, A. Balkan, R. M. Jungers, and P. Tabuada, “Data driven stability analysis of black-box switched linear systems,” Automatica, vol. 109, p. 108533, 2019.
  • [21] T. Dai and M. Sznaier, “A moments based approach to designing mimo data driven controllers for switched systems,” in 2018 IEEE Conference on Decision and Control (CDC).   IEEE, 2018, pp. 5652–5657.
  • [22] ——, “A convex optimization approach to synthesizing state feedback data-driven controllers for switched linear systems,” Automatica, vol. 139, p. 110190, 2022.
  • [23] S. J. Bradtke, B. E. Ydstie, and A. G. Barto, “Adaptive linear quadratic control using policy iteration,” in Proceedings of 1994 American Control Conference-ACC’94, vol. 3.   IEEE, 1994, pp. 3475–3479.
  • [24] S. Tu and B. Recht, “Least-squares temporal difference learning for the linear quadratic regulator,” arXiv preprint arXiv:1712.08642, 2017.
  • [25] M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” in International Conference on Machine Learning.   PMLR, 2018, pp. 1467–1476.
  • [26] Y. Abbasi-Yadkori, N. Lazic, and C. Szepesvári, “Model-free linear quadratic control via reduction to expert prediction,” arXiv preprint arXiv:1804.06021, 2018.
  • [27] S. Arora, E. Hazan, H. Lee, K. Singh, C. Zhang, and Y. Zhang, “Towards provable control for unknown linear dynamical systems,” arXiv preprint arXiv:1712.08642, 2018.
  • [28] M. C. Campi and P. Kumar, “Adaptive linear quadratic gaussian control: the cost-biased approach revisited,” SIAM Journal on Control and Optimization, vol. 36, no. 6, pp. 1890–1907, 1998.
  • [29] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári, “Online least squares estimation with self-normalized processes: An application to bandit problems,” arXiv preprint arXiv:1102.2670, 2011.
  • [30] Y. Abbasi-Yadkori and C. Szepesvári, “Regret bounds for the adaptive control of linear quadratic systems,” in Proceedings of the 24th Annual Conference on Learning Theory, 2011, pp. 1–26.
  • [31] M. Ibrahimi, A. Javanmard, and B. V. Roy, “Efficient reinforcement learning for high dimensional linear quadratic systems,” in Advances in Neural Information Processing Systems, 2012, pp. 2636–2644.
  • [32] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “Optimism-based adaptive regulation of linear-quadratic systems,” arXiv preprint arXiv:1711.07230, 2017.
  • [33] J. A. Chekan, K. Azizzadenesheli, and C. Langbort, “Joint stabilization and regret minimization through switching in systems with actuator redundancy,” arXiv preprint arXiv:2105.14709, 2021.
  • [34] S. Lale, K. Azizzadenesheli, B. Hassibi, and A. Anandkumar, “Explore more and improve regret in linear quadratic regulators,” arXiv preprint arXiv:2007.12291, 2020.
  • [35] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “Regret bounds for robust adaptive control of the linear quadratic regulator,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [36] H. Mania, S. Tu, and B. Recht, “Certainty equivalence is efficient for linear quadratic control,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [37] A. Cassel, A. Cohen, and T. Koren, “Logarithmic regret for learning linear quadratic regulators efficiently,” in International Conference on Machine Learning.   PMLR, 2020, pp. 1328–1337.

Appendix A Further Analysis

In this section, we dig further and provide proofs, rigorous analysis of the algorithms, properties of the closed-loop system, and regret bounds.

A-A Technical Theorems and Lemmas

The following lemma, adapted from [29] gives a self-normalized bound for scalar-valued martingales.

Lemma 2

Let FkF_{k} be a filtration, zkz_{k} be a stochastic process adapted to FkF_{k} and ωki\omega^{i}_{k} (where ωki\omega^{i}_{k} is the ii-th element of noise vector ωk\omega_{k}) be a real-valued martingale difference, again adapted to filtration FkF_{k} which satisfies the conditionally sub-Gaussianity assumption (Assumption 1) with known constant σω\sigma_{\omega}. Consider the martingale and co-variance matrices:

Sti:=k=1tzk1ωki,Vt=λI+1βs=1t1ztztT\displaystyle S^{i}_{t}:=\sum_{k=1}^{t}z_{k-1}\omega^{i}_{k},\quad V_{t}=\lambda I+\frac{1}{\beta}\sum_{s=1}^{t-1}z_{t}z_{t}^{T}

then with probability of at least 1δ1-\delta, 0<δ<10<\delta<1 we have,

StiVt122σω2log(det(Vt)δdet(λI))\displaystyle\left\lVert S^{i}_{t}\right\rVert^{2}_{V_{t}^{-1}}\leq 2\sigma_{\omega}^{2}\log\bigg{(}\frac{\det(V_{t})}{\delta\det(\lambda I)}\bigg{)} (61)
Proof:

The proof can be found in Appendix B. ∎

The following corollary generalizes the Lemma 1 for a vector-valued martingale which will be later used in Theorem 2 to derive the confidence ellipsoid of estimates.

Corollary 1

Under the same assumptions as in the Lemma 1, β>1\beta>1 and defining

St=ZtTWt=k=1tzk1ωkT\displaystyle S_{t}=Z^{T}_{t}W_{t}=\sum_{k=1}^{t}z_{k-1}\omega^{T}_{k}

with probability at least 1δ1-\delta,

Tr(StTVt1St)\displaystyle\operatorname{Tr}(S^{T}_{t}V^{-1}_{t}S_{t}) 2σω2nlog(ndet(Vt)δdet(λI)).\displaystyle\leq 2\sigma_{\omega}^{2}n\log\bigg{(}\frac{n\det(V_{t})}{\delta\det(\lambda I)}\bigg{)}. (62)
Proof:

For proof see Appendix B. ∎

Lemma 3

(Mania et al [36]) There exists explicit constants C0C_{0} and ϵ0=poly(α01,α1,ϑ,ν,n,m)\epsilon_{0}=poly(\alpha_{0}^{-1},\alpha_{1},\vartheta,\nu,n,m) such that ΘΘϵ\|\Theta-\Theta_{*}\|\leq\epsilon for any 0ϵϵ00\leq\epsilon\leq\epsilon_{0} guarantees

J(K(Θ,Q,R),Θ,Q,R)JC0ϵ2\displaystyle J(K(\Theta,Q,R),\Theta_{*},Q,R)-J_{*}\leq C_{0}\epsilon^{2} (63)

where J(K(Θ,Q,R),Θ,Q,R)J(K(\Theta,Q,R),\Theta_{*},Q,R) is average expected cost of executing the controller J(K(Θ,Q,R)J(K(\Theta,Q,R) on the system Θ\Theta_{*} and JJ_{*} is optimal average expected cost.

Lemma 4

(Cassel et al [37]) Suppose J(K(Θ,Q,R),Θ,Q,R)𝒥J(K(\Theta,Q,R),\Theta_{*},Q,R)\leq\mathcal{J} then K(Θ,Q,R)K(\Theta,Q,R) is (κ,γ)(\kappa,\gamma)-strongly stabilizing with κ=𝒥α0σω2\kappa=\sqrt{\frac{\mathcal{J}}{\alpha_{0}\sigma^{2}_{\omega}}} and γ=12κ2\gamma=\frac{1}{2\kappa^{2}}.

A-B Confidence Construction (System Identification)

Proof of Theorem 1

Proof:

Note that during the warm-up phase we execute the control input ut=K0xt+ηtu_{t}=K_{0}x_{t}+\eta_{t} which includes a more-exploration term ofηt𝒩(0,2σω2κ02I)\eta_{t}\sim\mathcal{N}(0,2\sigma_{\omega}^{2}\kappa_{0}^{2}I). The state norm of system is upperbounded by strong stabilizing property of K0K_{0} during warm-up phase (See [5]). Now, let T0T_{0} denote warm-up duration. The confidence ellipsoid is constructed during this phase by applying the procedure illustrated in Section III-B.

𝒞t(δ):={\displaystyle\mathcal{C}_{t}(\delta):=\{ Θ(n+m)×n|\displaystyle\Theta\in\mathbb{R}^{(n+m)\times n}|
Tr((ΘΘ^T0)V[0,T0](ΘΘ^T0))\displaystyle\;Tr\big{(}(\Theta-\hat{\Theta}_{T_{0}})^{\top}V_{[0,\;T_{0}]}(\Theta-\hat{\Theta}_{T_{0}})\big{)} rT0}\displaystyle\leq r_{T_{0}}\}

where

rT0=(\displaystyle r_{T_{0}}=\bigg{(} σω2n(lognδ+log(1+300σω2κ04γ02(n+ϑ2κ02)logT0δ\displaystyle\sigma_{\omega}\sqrt{2n(\log\frac{n}{\delta}+\log(1+\frac{300\sigma_{\omega}^{2}\kappa_{0}^{4}}{\gamma_{0}^{2}}(n+\vartheta^{2}\kappa^{2}_{0})\log\frac{T_{0}}{\delta}}
+λϑ)2.\displaystyle+\sqrt{\lambda}\vartheta\bigg{)}^{2}.

Furthermore, from Theorem 20 of [5], we have the following lower bound for V[0,T0]V_{[0,\;T_{0}]}

V[0,T0]T0σω280I,\displaystyle V_{[0,\;T_{0}]}\succeq\frac{T_{0}\sigma_{\omega}^{2}}{80}I,

which results in

Θ^Θ2\displaystyle\big{\|}\hat{\Theta}-\Theta_{*}\big{\|}^{2}\leq
80T0σω2(σω2n(lognδ+log(1+300σω2κ04γ02(n+ϑ2κ02)logT0δ\displaystyle\frac{80}{T_{0}\sigma^{2}_{\omega}}\bigg{(}\sigma_{\omega}\sqrt{2n(\log\frac{n}{\delta}+\log(1+\frac{300\sigma_{\omega}^{2}\kappa_{0}^{4}}{\gamma_{0}^{2}}(n+\vartheta^{2}\kappa^{2}_{0})\log\frac{T_{0}}{\delta}}
+λϑ)2.\displaystyle+\sqrt{\lambda}\vartheta\bigg{)}^{2}. (64)

Now applying the results of Lemmas 3 and 4 and setting κ=κ0\kappa=\kappa_{0} yield

80T0σω2(σω2n(lognδ+log(1+300σω2κ04γ02(n+ϑ2κ02)logT0δ\displaystyle\frac{80}{T_{0}\sigma^{2}_{\omega}}\bigg{(}\sigma_{\omega}\sqrt{2n(\log\frac{n}{\delta}+\log(1+\frac{300\sigma_{\omega}^{2}\kappa_{0}^{4}}{\gamma_{0}^{2}}(n+\vartheta^{2}\kappa^{2}_{0})\log\frac{T_{0}}{\delta}}
+λϑ)2min{κ02α0σω2C0,ϵ02}.\displaystyle+\sqrt{\lambda}\vartheta\bigg{)}^{2}\leq\min\bigg{\{}\kappa^{2}_{0}\frac{\alpha_{0}\sigma^{2}_{\omega}}{C_{0}},\epsilon^{2}_{0}\bigg{\}}. (65)

When Algorithm 2 begins actuating in mode 1, given that Θ0\Theta_{0} resides within a (κ0,γ0)(\kappa_{0},\gamma_{0})-stabilizing neighborhood, solving the primal problem with the confidence bound constructed by Θ0\Theta_{0} results in an average expected cost improvement over that of K0K_{0}, or at least achieves a similar performance. ∎

Proof of Theorem 2

Proof:

The l2l_{2}- least square error regularized by (37) is written as follows

e(Θi)\displaystyle e(\Theta^{i}) =Tr((Θ^τqpiΘi)V[0τq]pi(Θ^τqpiΘi))+\displaystyle=\operatorname{Tr}\big{(}(\hat{\Theta}^{pi}_{\tau_{q}}-\Theta^{i})^{\top}V_{[0\;\tau_{q}]}^{pi}(\hat{\Theta}^{pi}_{\tau_{q}}-\Theta^{i})\big{)}+
s=τq+1t1Tr((xs+1iΘizsi)(xs+1iΘizsi)T).\displaystyle\quad\sum_{s=\tau_{q}+1}^{t-1}\operatorname{Tr}\big{(}(x^{i}_{s+1}-{\Theta^{i}}^{\top}z^{i}_{s})(x^{i}_{s+1}-{\Theta^{i}}^{\top}z^{i}_{s})^{T}\big{)}. (66)

Solving

minΘie(Θi)\displaystyle\begin{array}[]{rrclcl}\displaystyle\min_{\Theta^{i}}&\lx@intercol e(\Theta^{i})\hfil\lx@intercol\\ \end{array} (68)

yields

Θ^ti\displaystyle\hat{\Theta}^{i}_{t} =Vi[0,t]1(ZTX+V[0,τq]piΘ^τqpi),\displaystyle={V^{i}}^{-1}_{[0,\;t]}\big{(}{Z}^{T}X+V^{pi}_{[0,\;\;\tau_{q}]}\hat{\Theta}^{pi}_{\tau_{q}}\big{)}, (69)

where

V[0t]i=V[0τq]pi+V[τqt]i,V[τqt]i=t=τqtztizti,\displaystyle V_{[0\;t]}^{i}=V_{[0\;\tau_{q}]}^{pi}+V^{i}_{[\tau_{q}\;t]},\;V^{i}_{[\tau_{q}\;t]}=\sum_{t=\tau_{q}}^{t}z^{i}_{t}{z^{i}_{t}}^{\top},
ZX=s=τqtzsixsi\displaystyle Z^{\top}X=\sum_{s=\tau_{q}}^{t}z^{i}_{s}{x^{i}_{s}}^{\top} (70)

Substitute X=ZiΘi+WX=Z^{i}\Theta^{i}_{*}+W (counterpart of (13) for mode ii), where XX, ZZ, and WW whose rows are xτq+1,,xtx^{\top}_{\tau_{q}+1},...,x^{\top}_{t}, zτq,,zt1{z}^{\top}_{\tau_{q}},...,{z}_{t-1}, and ωτq+1,,ωt\omega_{\tau_{q}+1}^{\top},...,\omega_{t}^{\top} into (69), yields,

Θ^ti\displaystyle\hat{\Theta}^{i}_{t} =Vi[0,t]1(ZT(ZΘi+W)+V[0τq]piΘ^τqpi)\displaystyle={V^{i}}^{-1}_{[0,\;t]}\big{(}Z^{T}(Z\Theta^{i}_{*}+W)+V_{[0\;\tau_{q}]}^{pi}\hat{\Theta}^{pi}_{\tau_{q}}\big{)}
=Vi[0,t]1(V[0t]iΘi+V[0,τq]pi(Θ^τqpiΘi)+ZW)\displaystyle={V^{i}}^{-1}_{[0,\;t]}\big{(}V_{[0\;t]}^{i}\Theta^{i}_{*}+V^{pi}_{[0,\;\tau_{q}]}(\hat{\Theta}^{pi}_{\tau_{q}}-\Theta^{i}_{*})+Z^{\top}W\big{)}
=Θi+Vi[0,t]1(V[0,τq]pi(Θ^τqpiΘi)+ZW)\displaystyle=\Theta^{i}_{*}+{V^{i}}^{-1}_{[0,\;t]}\big{(}V^{pi}_{[0,\;\tau_{q}]}(\hat{\Theta}^{pi}_{\tau_{q}}-\Theta^{i}_{*})+Z^{\top}W\big{)}

where ZW=s=τqtzsiωsZ^{\top}W=\sum_{s=\tau_{q}}^{t}z^{i}_{s}\omega^{\top}_{s}. For an arbitrary random extended state ziz^{i}, one can write

ziΘ^iiziΘi=\displaystyle{z^{i}}^{\top}\hat{\Theta}^{i}_{i}-{z^{i}}^{\top}\Theta^{i}_{*}= zi,ZWVi[0,t]1+\displaystyle\langle\,z^{i},Z^{\top}W\rangle_{{V^{i}}^{-1}_{[0,\;t]}}+
zi,V[0,τq]pi(Θ^τqpiΘi)Vi[0,t]1\displaystyle\langle\,z^{i},V^{pi}_{[0,\;\tau_{q}]}(\hat{\Theta}^{pi}_{\tau_{q}}-\Theta^{i}_{*})\rangle_{{V^{i}}^{-1}_{[0,\;t]}}

which leads to

ziΘ^iiziΘiziVi[0,t]1\displaystyle\|{z^{i}}^{\top}\hat{\Theta}^{i}_{i}-{z^{i}}^{\top}\Theta^{i}_{*}\|\leq\|z^{i}\|_{{V^{i}}^{-1}_{[0,\;t]}} (ZWVi[0,t]1+\displaystyle\Bigg{(}\|Z^{\top}W\|_{{V^{i}}^{-1}_{[0,\;t]}}+
V[0,τq]pi(Θ^τqpiΘi)Vi[0,t]1)\displaystyle\|V^{pi}_{[0,\;\tau_{q}]}(\hat{\Theta}^{pi}_{\tau_{q}}-\Theta^{i}_{*})\|_{{V^{i}}^{-1}_{[0,\;t]}}\Bigg{)}

which can equivalently be written as:

Θ^iΘiVi[0,t]2\displaystyle\|\hat{\Theta}^{i}-\Theta^{i}_{*}\|^{2}_{{V^{i}}_{[0,\;t]}}\leq
(ZWVi[0,t]1+V[0,τq]pi(Θ^τqpiΘi)Vi[0,t]1)2\displaystyle\Bigg{(}\|Z^{\top}W\|_{{V^{i}}^{-1}_{[0,\;t]}}+\|V^{pi}_{[0,\;\tau_{q}]}(\hat{\Theta}^{pi}_{\tau_{q}}-\Theta^{i}_{*})\|_{{V^{i}}^{-1}_{[0,\;t]}}\Bigg{)}^{2} (71)

Given the fact that V[0,t]iV[0,τq]piV^{i}_{[0,\;t]}\geq V^{pi}_{[0,\;\tau_{q}]},

Θ^iΘiV[0,t]i2\displaystyle\|\hat{\Theta}_{i}-\Theta_{i*}\|^{2}_{V^{i}_{[0,\;t]}}\leq
(ZWVi[0,t]1+Θ^τqpiΘiVpi[0,τq]12)2\displaystyle\Bigg{(}\|Z^{\top}W\|_{{V^{i}}^{-1}_{[0,\;t]}}+\sqrt{\|\hat{\Theta}^{pi}_{\tau_{q}}-\Theta^{i}_{*}\|^{2}_{{V^{pi}}^{-1}_{[0,\;\tau_{q}]}}}\Bigg{)}^{2}

holds true where the first term, ZWVi[0,t]1\|Z^{\top}W\|_{{V^{i}}^{-1}_{[0,\;t]}}, is bounded by Corollary 1 with probability at least 1δ1-\delta as follows:

ZWVi[0,t]122σω2log(ndet(V[0,t]i)δdet(V[0τq]pi))\displaystyle\|Z^{\top}W\|_{{V^{i}}^{-1}_{[0,\;t]}}^{2}\leq 2\sigma_{\omega}^{2}\log\bigg{(}\frac{n\det(V^{i}_{[0,\;t]})}{\delta\det(V_{[0\;\tau_{q}]}^{pi})}\bigg{)} (72)

and the second term is bounded by (37) as follows

Θ^τqpiΘiVpi[0,τq]12rτq\displaystyle\|\hat{\Theta}^{pi}_{\tau_{q}}-\Theta^{i}_{*}\|_{{V^{pi}}^{-1}_{[0,\;\tau_{q}]}}^{2}\leq r_{\tau_{q}} (73)

with probability at least 1δ1-\delta. Combining bounds (72) and (73) completes the proof. ∎

The following lemma aims at obtaining an upper-bound for the right hand side of (39) which will be later used in regret bound analysis.

Lemma 5

The radius rjr^{j} of ellipsoid (38) can be upper-bounded as

rj8σω2n(\displaystyle r^{j}\leq 8\sigma_{\omega}^{2}n\bigg{(} 2lognδ+(1+2Υ¯)logt+\displaystyle 2\log\frac{n}{\delta}+(1+2\bar{\Upsilon})\log t+
(m1)log(1+2Υ¯)t)+2ϵλ=:r¯.\displaystyle(m-1)\log(1+2\bar{\Upsilon})t\bigg{)}+2\epsilon\lambda=:\bar{r}. (74)

where

Υ¯=3200κ6σ2nχ2logTδ.\displaystyle\bar{\Upsilon}=\frac{3200{\kappa^{*}}^{6}\sigma^{2}n}{\chi^{2}}\log\frac{T}{\delta}.

A-C Properties of Relaxed SDP

This subsection attempts to go through some essential features of the primal and dual relaxed-SDP problems (45) and (49). While the former is solved by Algorithm 2 to compute a regularizing control in the face of parameter uncertainty, the latter plays a pivotal role in the stability and regret bound analysis.

Both primal and dual problems directly follow from applying next two matrix-perturbation lemmas 6 and 7 (adapted from [5]) on the exact primal SDP and its dual problems. As the following lemmas hold for any actuating mode ii, we deliberately dropped the superscripts ii.

Lemma 6

Let XX and Δ\Delta be matrices of matching sizes and let ΔΔTrV1\Delta\Delta^{T}\leq rV^{-1} for positive definite matrix VV. Then for any P0P\geq 0 and μr+(1+r)XV1/2\mu\geq r+(1+r)\|X\|\|V\|^{1/2}, we have

μPV1(X+Δ)P(X+Δ)μPV1.\displaystyle-\mu\|P\|_{*}V^{-1}\leq(X+\Delta)^{\top}P(X+\Delta)\leq\mu\|P\|_{*}V^{-1}.
Proof:

The proof follows the same steps of Lemma 24 in [5]. Furthermore, the lower bound of μ\mu is obtained by following the same steps of [5] exactly. However, our bound is slightly different than that of [5] because, in the confidence set construction, we did not apply normalization. ∎

To discuss the solution property of primal problem we need following lemma adapted from [5].

Lemma 7

Let XX and Δ\Delta be matrices of matching sizes and let ΔΔrV1\Delta\Delta^{\top}\leq rV^{-1} for positive definite matrix VV. Then for any Σ0\Sigma\geq 0 and μr+(1+r)XV1/2\mu\geq r+(1+r)\|X\|\|V\|^{1/2},

(X+Δ)Σ(X+Δ)XΣXμΣV1.\displaystyle\|(X+\Delta)^{\top}\Sigma(X+\Delta)-X^{\top}\Sigma X\|\leq\mu\Sigma\bullet V^{-1}.
Proof:

The proof can be found [5]. ∎

By substituting X=Θ^tiX=\hat{\Theta}^{i}_{t}, Δ=Θ^tiΘi\Delta=\hat{\Theta}_{t}^{i}-\Theta_{*}^{i}, Σ=Σ(Θ^ti,Q,Ri)\Sigma=\Sigma(\hat{\Theta}^{i}_{t},Q,R^{i}), P=P(Θ^ti,Q,Ri)P=P(\hat{\Theta}^{i}_{t},Q,R^{i}) and V=V[0,t]iV=V^{i}_{[0,\;t]} in Lemmas 6 and 7, and applying them on exact primal ((11) with Θ=Θi\Theta=\Theta^{i}_{*}) and its dual SDPs, the primal and dual relaxed-SDP problems ((45) and (49)) are obtained. Moreover, it is straightforward to show μiri+(1+ri)ϑV[0,t]i1/2\mu_{i}\geq r^{i}+(1+r^{i})\vartheta\|V^{i}_{[0,\;t]}\|^{1/2}.

Furthermore, it is also shown that Σ(Θi,Q,Ri)\Sigma(\Theta_{*}^{i},Q,R^{i}), the solution of SDP (11) (with Θ=Θi\Theta=\Theta_{*}^{i} and R=RiR=R^{i}), is a feasible solution of the relaxed SDP (45).

We can show that the primal-dual solutions of (45) and (49)) respectively satisfy Σ(Θ^ti,Q,Ri)Ji/α0\|\Sigma(\hat{\Theta}^{i}_{t},Q,R^{i})\|_{*}\leq J^{*}_{i}/\alpha_{0} and P(Θ^ti,Q,Ri)Ji/σ2\|P(\hat{\Theta}^{i}_{t},Q,R^{i})\|_{*}\leq J_{*}^{i}/\sigma^{2} where JiJ_{*}^{i} is the corresponding optimal average expected cost of actuating in mode ii (See Lemma 15 in [5] for more details).

The following lemma shows how to extract the deterministic linear policy from the relaxed SDP. We will later apply this lemma in the Lemma 9 to show that for all actuating modes, the designed controller is strongly sequentially stabilizing in an epoch (a time interval between two subsequent switches).

Lemma 8

Assume μiri+(1+ri)ϑV[0,t]i1/2\mu_{i}\geq r^{i}+(1+r^{i})\vartheta\|V^{i}_{[0,\;t]}\|^{1/2} and V[0,t]i(νiμi/α0σ2)IV^{i}_{[0,\;t]}\geq(\nu_{i}\mu_{i}/\alpha_{0}\sigma^{2})I (This holds by having definition λ4ν¯μ¯/α0σ2\lambda\geq 4\bar{\nu}\bar{\mu}/\alpha_{0}\sigma^{2} given in Lemma 9 with ν¯=maxiνi\bar{\nu}=\max_{i}\nu_{i} and μ¯>μi\bar{\mu}>\mu_{i}’s). Let (A^t,B^ti)=Θ^ti(\hat{A}_{t},\hat{B}_{t}^{i})^{\top}=\hat{\Theta}_{t}^{i} and Σ(Θ^ti,Q,Ri)\Sigma(\hat{\Theta}^{i}_{t},Q,R^{i}) and P(Θ^ti,Q,Ri)P(\hat{\Theta}^{i}_{t},Q,R^{i}), denote primal-dual solutions of the relaxed SDPs (45) and (49). Then, Σxx(Θ^ti,Q,Ri)\Sigma_{xx}(\hat{\Theta}^{i}_{t},Q,R^{i}) is invertible and

P(Θ^ti,Q,Ri)\displaystyle P(\hat{\Theta}^{i}_{t},Q,R^{i}) =Q+K(Θ^ti,Q,Ri)RiK(Θ^ti,Q,Ri)+\displaystyle=Q+K^{\top}(\hat{\Theta}^{i}_{t},Q,R^{i})R^{i}K(\hat{\Theta}^{i}_{t},Q,R^{i})+
(A^t+B^tiK(Θ^ti,Q,Ri))P(Θ^ti,Q,Ri)×\displaystyle(\hat{A}_{t}+\hat{B}_{t}^{i}K(\hat{\Theta}^{i}_{t},Q,R^{i}))^{\top}P(\hat{\Theta}^{i}_{t},Q,R^{i})\times
(A^t+B^tiK(Θ^ti,Q,Ri))μiP(Θ^ti,Q,Ri)×\displaystyle(\hat{A}_{t}+\hat{B}_{t}^{i}K(\hat{\Theta}^{i}_{t},Q,R^{i}))-\mu_{i}\|P(\hat{\Theta}^{i}_{t},Q,R^{i})\|_{*}\times
(IK(Θ^ti,Q,Ri))Vi[0,t]1(IK(Θ^ti,Q,Ri))\displaystyle\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{t},Q,R^{i})\end{pmatrix}{V^{i}}^{-1}_{[0,\;t]}\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{t},Q,R^{i})\end{pmatrix}^{\top} (75)

where K(Θ^ti,Q,Ri)=Σux(Θ^ti,Q,Ri)Σxx1(Θ^ti,Q,Ri)K(\hat{\Theta}^{i}_{t},Q,R^{i})=\Sigma_{ux}(\hat{\Theta}^{i}_{t},Q,R^{i}){\Sigma}^{-1}_{xx}(\hat{\Theta}^{i}_{t},Q,R^{i}).

Proof:

proof is in Lemma 16 of [5]. ∎

Now we continue with carrying out the stability analysis of the system under execution of control designed by Algorithms 3 and 4.

A-D Stability Analysis

Proof of Theorem 3 directly follows by Lemma 9, and 10 proved below.

Lemma 9 shows that the designed controller K(Θ^ti,Q,Ri)K(\hat{\Theta}^{i}_{t},Q,R^{i}) is (κi,γi)(\kappa_{i},\gamma_{i})- strongly stabilizing for the values of parameters specified in Theorem 3. Lemma 10 shows P(Θ^ti,Q,Ri)P(\hat{\Theta}^{i}_{t},Q,R^{i}) is closed to P(Θ^t+1i,Q,Ri)P(\hat{\Theta}^{i}_{t+1},Q,R^{i}) and as such proves the sequentiality of stability and completes the proof of the Theorem 3.

Lemma 9

For the system which actuates in an arbitrary mode ii, K(Θ^ti,Q,Ri)K(\hat{\Theta}^{i}_{t},Q,R^{i}) is (κi,γi)(\kappa_{i},\gamma_{i})-strongly stabilizing for A+BiK(Θ^ti,Q,Ri)=HtiLtihit1A_{*}+B^{i}_{*}K(\hat{\Theta}^{i}_{t},Q,R^{i})=H^{i}_{t}L^{i}_{t}{h^{i}}^{-1}_{t} where Hti=P1/2(Θ^ti,Q,Ri)H^{i}_{t}={P}^{1/2}(\hat{\Theta}^{i}_{t},Q,R^{i}) and Lti1γi\|L^{i}_{t}\|\leq 1-\gamma_{i}. Moreover, (α0/2)IP(Θ^ti,Q,Ri)(νi/σω2)I(\alpha_{0}/2)I\leq P(\hat{\Theta}^{i}_{t},Q,R^{i})\leq(\nu_{i}/\sigma_{\omega}^{2})I

Proof:

Regardless of the similarity of the proof to that of Lemma 18 in [5], we provide it for the sake of completeness and justification of the value of λ\lambda, and upper-bound of μi\mu_{i} mentioned in Theorem 3.

First we need to appropriately upper-bound μi\mu_{i} which is an input of the algorithms 3 and 4. Given Vti(1+2Υ¯)Ti\|V^{i}_{t}\|\leq(1+2\bar{\Upsilon})T\;\;\forall i (proved in Lemma 11),we have

μi\displaystyle\mu_{i} =ri+(1+ri)ϑV[0,t]i1/2\displaystyle=r^{i}+(1+r^{i})\vartheta\|V^{i}_{[0,\;t]}\|^{1/2}
ri+(1+ri)ϑ(1+2Υ¯)T\displaystyle\leq r^{i}+(1+r^{i})\vartheta\sqrt{(1+2\bar{\Upsilon})T}
r¯+(1+r¯)ϑ(1+2Υ¯)T:=μ¯\displaystyle\leq\bar{r}+(1+\bar{r})\vartheta\sqrt{(1+2\bar{\Upsilon})T}:=\bar{\mu} (76)

where r¯\bar{r} is an upper bound rir^{i} for all i{1,,2m}i\in\{1,...,2^{m}\}, which has already been provided by the Lemma 5:

r¯:\displaystyle\bar{r}: =8σω2n(2lognδ+(1+2Υ¯)logT+\displaystyle=8\sigma^{2}_{\omega}n\bigg{(}2\log\frac{n}{\delta}+(1+2\bar{\Upsilon})\log T+
(m1)log(1+2Υ¯)T)+2ϵ2λ\displaystyle(m-1)\log(1+2\bar{\Upsilon})T\bigg{)}+2\epsilon^{2}\lambda

Recalling the dual formulation (49) and the fact νiJi\nu_{i}\geq J_{*}^{i} one can write

νiJiP(Θ^ti,Q,Ri)W(Q00Ri)Wα0σ2\displaystyle\nu_{i}\geq J_{*}^{i}\geq P(\hat{\Theta}^{i}_{t},Q,R^{i})\bullet W\geq\begin{pmatrix}Q&0\\ 0&R^{i}\end{pmatrix}\bullet W\geq\alpha_{0}\sigma^{2}

which clearly shows that νi/α0σ21\nu_{i}/\alpha_{0}\sigma^{2}\geq 1.

Now we assume that λ4νiμi/α0σω2i{1,,2m}\lambda\geq 4\nu_{i}\mu_{i}/\alpha_{0}\sigma_{\omega}^{2}\;\;\forall i\in\{1,...,2^{m}\} and apply perturbation lemma, Lemma 7 to (75) yields

P(Θ^ti,Q,Ri)=Q+K(Θ^ti,Q,Ri)RiK(Θ^ti,Q,Ri)+\displaystyle P(\hat{\Theta}^{i}_{t},Q,R^{i})=Q+K^{\top}(\hat{\Theta}^{i}_{t},Q,R^{i})R^{i}K(\hat{\Theta}^{i}_{t},Q,R^{i})+
(A+BiK(Θ^ti,Q,Ri))P(Θ^ti,Q,Ri)×\displaystyle(A_{*}+B_{*}^{i}K(\hat{\Theta}^{i}_{t},Q,R^{i}))^{\top}P(\hat{\Theta}^{i}_{t},Q,R^{i})\times
(A+BiK(Θ^ti,Q,Ri))2μiP(Θ^ti,Q,Ri)×\displaystyle(A_{*}+B_{*}^{i}K(\hat{\Theta}^{i}_{t},Q,R^{i}))-2\mu_{i}\|P(\hat{\Theta}^{i}_{t},Q,R^{i})\|_{*}\times
(IK(Θ^ti,Q,Ri))Vi[0,t]1(IK(Θ^ti,Q,Ri)).\displaystyle\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{t},Q,R^{i})\end{pmatrix}{V^{i}}^{-1}_{[0,\;t]}\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{t},Q,R^{i})\end{pmatrix}^{\top}. (77)

We have P(Θ^ti,Q,Ri)νi/σω2\|P(\hat{\Theta}^{i}_{t},Q,R^{i})\|_{*}\leq\nu_{i}/\sigma_{\omega}^{2}, and define λ>(4ν¯μ¯/α0σω2)I\lambda>(4\bar{\nu}\bar{\mu}/\alpha_{0}\sigma_{\omega}^{2})I where ν¯=maxi{1,,2m}νi\bar{\nu}=\max_{i\in\{1,...,2^{m}\}}\nu_{i} and μ¯=maxi{1,,2m}μi\bar{\mu}=\max_{i\in\{1,...,2^{m}\}}\mu_{i} defined by (76). Given the fact that V[0,t]iλIV^{i}_{[0,\;t]}\geq\lambda I it yields

μiP(Θ^ti,Q,Ri)Vi[0,t]1\displaystyle\mu_{i}\|P(\hat{\Theta}^{i}_{t},Q,R^{i})\|_{*}{V^{i}}_{[0,\;t]}^{-1} μiνiσω2α0σω24ν¯μ¯I\displaystyle\leq\mu_{i}*\frac{\nu_{i}}{\sigma_{\omega}^{2}}\frac{\alpha_{0}\sigma_{\omega}^{2}}{4\bar{\nu}\bar{\mu}}I
μiνiσω2α0σω24νiμiI=α04I\displaystyle\leq\mu_{i}*\frac{\nu_{i}}{\sigma_{\omega}^{2}}\frac{\alpha_{0}\sigma_{\omega}^{2}}{4\nu_{i}\mu_{i}}I=\frac{\alpha_{0}}{4}I (78)

plugging (A-D) together with the assumptions Q,Riα0IQ,R^{i}\geq\alpha_{0}I into (77) yields,

P(Θ^ti,Q,Ri)\displaystyle P(\hat{\Theta}^{i}_{t},Q,R^{i}) α02I+α02K(Θ^ti,Q,Ri)K(Θ^ti,Q,Ri)+\displaystyle\geq\frac{\alpha_{0}}{2}I+\frac{\alpha_{0}}{2}K^{\top}(\hat{\Theta}^{i}_{t},Q,R^{i})K(\hat{\Theta}^{i}_{t},Q,R^{i})+
(A+BiK(Θ^ti,Q,Ri))P(Θ^ti,Q,Ri)×\displaystyle(A_{*}+B^{i}_{*}K(\hat{\Theta}^{i}_{t},Q,R^{i}))^{\top}P(\hat{\Theta}^{i}_{t},Q,R^{i})\times
(A+BiK(Θ^ti,Q,Ri)).\displaystyle(A_{*}+B^{i}_{*}K(\hat{\Theta}^{i}_{t},Q,R^{i})). (79)

Defining κi=2νiα0σω2\kappa_{i}=\sqrt{\frac{2\nu_{i}}{\alpha_{0}\sigma_{\omega}^{2}}} and using the fact that P(Θ^ti,Q,Ri)νiσω2\|P(\hat{\Theta}^{i}_{t},Q,R^{i})\|\leq\frac{\nu_{i}}{\sigma_{\omega}^{2}} (which implies κi2P(Θ^ti,Q,Ri)νiκi2σω2I\kappa_{i}^{-2}P(\hat{\Theta}^{i}_{t},Q,R^{i})\leq\frac{\nu_{i}\kappa_{i}^{-2}}{\sigma_{\omega}^{2}}I),

it holds true that

P(Θ^ti,Q,Ri)12α0I(1κi2)P(Θ^ti,Q,Ri)\displaystyle P(\hat{\Theta}^{i}_{t},Q,R^{i})-\frac{1}{2}\alpha_{0}I\leq(1-\kappa_{i}^{-2})P(\hat{\Theta}^{i}_{t},Q,R^{i}) (80)

From (79) and (80) one has:

P1/2(Θ^ti,Q,Ri)(A+BiK(Θ^ti,Q,Ri))P(Θ^ti,Q,Ri)×\displaystyle{P}^{-1/2}(\hat{\Theta}^{i}_{t},Q,R^{i})(A_{*}+B^{i}_{*}K(\hat{\Theta}^{i}_{t},Q,R^{i}))^{\top}P(\hat{\Theta}^{i}_{t},Q,R^{i})\times
(A+BiK(Θ^ti,Q,Ri))P1/2(Θ^ti,Q,Ri)(1κi2)I.\displaystyle(A_{*}+B^{i}_{*}K(\hat{\Theta}^{i}_{t},Q,R^{i})){P}^{-1/2}(\hat{\Theta}^{i}_{t},Q,R^{i})\leq(1-\kappa_{i}^{-2})I. (81)

Denoting Hti=P1/2(Θ^ti,Q,Ri)H_{t}^{i}={P}^{1/2}(\hat{\Theta}^{i}_{t},Q,R^{i}) and Lti=P1/2(Θ^ti,Q,Ri)(A+BiK(Θ^ti,Q,Ri))P1/2(Θ^ti,Q,Ri)L_{t}^{i}={P}^{-1/2}(\hat{\Theta}^{i}_{t},Q,R^{i})(A_{*}+B^{i}_{*}K(\hat{\Theta}^{i}_{t},Q,R^{i})){P}^{1/2}(\hat{\Theta}^{i}_{t},Q,R^{i}), (81) leads Lit1κi211/2κi2\|L_{i}^{t}\|\leq\sqrt{1-\kappa_{i}^{-2}}\leq 1-1/2\kappa_{i}^{-2}.

Furthermore, (79) results in P(Θ^ti,Q,Ri)α02K(Θ^ti,Q,Ri)K(Θ^ti,Q,Ri)P(\hat{\Theta}^{i}_{t},Q,R^{i})\geq\frac{\alpha_{0}}{2}{K}^{\top}(\hat{\Theta}^{i}_{t},Q,R^{i})K(\hat{\Theta}^{i}_{t},Q,R^{i}) which together with P(Θ^ti,Q,Ri)νiσω2\|P(\hat{\Theta}^{i}_{t},Q,R^{i})\|\leq\frac{\nu_{i}}{\sigma_{\omega}^{2}} imply K(Θ^ti,Q,Ri)κi\|K(\hat{\Theta}^{i}_{t},Q,R^{i})\|\leq\kappa_{i}.

This automatically yields Htiνi/σω2\|H^{i}_{t}\|\leq\sqrt{\nu_{i}/\sigma_{\omega}^{2}} and Hti12α0\|{H^{i}_{t}}^{-1}\|\leq\sqrt{\frac{2}{\alpha_{0}}}. So, our proof is complete and K(Θ^ti,Q,Ri)K(\hat{\Theta}^{i}_{t},Q,R^{i})’s are (κi,γi)(\kappa_{i},\gamma_{i})- strongly stabilizing. ∎

Having established the strong stability by by Lemma 9, Lemma 10 shows P(Θ^ti,Q,Ri)P(\hat{\Theta}^{i}_{t},Q,R^{i}) is closed to P(Θ^t+1i,Q,Ri)P(\hat{\Theta}^{i}_{t+1},Q,R^{i}) and as such proves the sequentiality of stability and completes the proof of the Theorem 3 (See [5] for more detail).

Lemma 10

For any mode ii\in\mathcal{M} by choosing the regularization parameter λ8κ8n+m4ν¯μ¯α0iσω2\lambda\geq 8\kappa_{*}^{8}\sqrt{n+m}\frac{4\bar{\nu}\bar{\mu}}{\alpha_{0}^{i}\sigma_{\omega}^{2}} we have

P(Θ^ti,Q,Ri)P(Θi,Q,Ri)P(Θ^t+1i,Q,Ri)+α0iγi2I\displaystyle P(\hat{\Theta}^{i}_{t},Q,R^{i})\leq P(\Theta^{i}_{*},Q,R^{i})\leq P(\hat{\Theta}^{i}_{t+1},Q,R^{i})+\frac{\alpha_{0}^{i}\gamma_{i}}{2}I
Proof:

Proof can be found in Appendix B. ∎

Proof of Theorem 4

Proof:

From the Lemma 1, the state norm of system actuating in mode ii and in an arbitrary actuation epoch happening in interval [τqτq+1][\tau_{q}\;\tau_{q+1}] can be upper bounded as follows:

xtκi(1γi/2)txτq+2κiγimaxτqtτq+1wt\displaystyle\|x_{t}\|\leq\kappa_{i}(1-\gamma_{i}/2)^{t}||x_{\tau_{q}}||+\frac{2\kappa_{i}}{\gamma_{i}}\max\limits_{\tau_{q}\leq t\leq\tau_{q+1}}\|w_{t}\| (82)

where xτqx_{\tau_{q}} is an initial state value of the actuation epoch. Having κ=maxi{1,,||}κi\kappa^{*}=\max_{i\in\{1,...,|\mathcal{B}^{*}|\}}\kappa_{i}’s we can deduce for all modes i{1,,2m}i\in\{1,...,2^{m}\} we have

xtκ(1γ/2)txτq+2κγmaxτqtτq+1wt\displaystyle\|x_{t}\|\leq\kappa^{*}(1-\gamma^{*}/2)^{t}\|x_{\tau_{q}}\|+\frac{2\kappa^{*}}{\gamma^{*}}\max\limits_{\tau_{q}\leq t\leq\tau_{q+1}}\|w_{t}\| (83)

where κ=2ν¯/α0σω2\kappa^{*}=\sqrt{{2\bar{\nu}}/{\alpha_{0}\sigma_{\omega}^{2}}} and ν¯:=maxi{1,,2m}\bar{\nu}:=\max_{i\in\{1,...,2^{m}\}}.

Considering

wt10σωnlogTδ\displaystyle\|w_{t}\|\leq 10\sigma_{\omega}\sqrt{n\log\frac{T}{\delta}}

we can upper-bound second term on right hand side of (83) by L¯\bar{L} which is defined as follows:

L¯:=20κγσωnlogTδ.\displaystyle\bar{L}:=\frac{20\kappa^{*}}{\gamma^{*}}\sigma_{\omega}\sqrt{n\log\frac{T}{\delta}}.

Let us now denote the time sequences in which the switches in actuating mode happens be I={k0,k1,,kl1,kl}I=\{k_{0},k_{1},...,k_{l-1},k_{l}\}, then one can upper bound the state norm as follows:

xkf\displaystyle\|x_{k_{f}}\| (1γ/2)kfklκxkl+L¯\displaystyle\leq(1-\gamma^{*}/2)^{k_{f}-k_{l}}\kappa^{*}\|x_{k_{l}}\|+\bar{L}
(1γ/2)klkl1(1γ/2)kfklκ2xkl1\displaystyle\leq(1-\gamma^{*}/2)^{k_{l}-k_{l-1}}(1-\gamma^{*}/2)^{k_{f}-k_{l}}\kappa^{*2}\|x_{k_{l-1}}\|
+L¯(1γ/2)klkl1κ+L¯\displaystyle+\bar{L}(1-\gamma^{*}/2)^{k_{l}-k_{l-1}}\kappa^{*}+\bar{L}
(1γ/2)kf0κ(kf0)/τx0+Ω\displaystyle\leq...\leq(1-\gamma^{*}/2)^{k_{f}-0}\kappa^{*(k_{f}-0)/\tau}\|x_{0}\|+\Omega

in which γ=12κ\gamma^{*}=\frac{1}{2\kappa^{*}}. The term Ω\Omega includes all L¯\bar{L}- dependent terms.

For stability purpose, the following inequality

κ(1γ/2)τAD=1χ<1\displaystyle\kappa^{*}(1-\gamma^{*}/2)^{\tau_{AD}}=1-\chi<1 (84)

needs to be fulfilled which yields

τAD>τMADT:=Ceil[lnκln(1χ)ln(1γ)]\displaystyle\tau_{AD}>\tau_{MADT}:=Ceil\bigg{[}\frac{\ln\kappa^{*}-\ln(1-\chi)}{-\ln(1-\gamma^{*})}\bigg{]} (85)

in which τMADT\tau_{MADT} is MADT and 0<χ<10<\chi<1 is a user-define parameter. Subsequently we can write:

ΩL¯i=0((1γ/2)κ1/τAD)iL¯χ=:UΩ\displaystyle\Omega\leq\bar{L}\sum_{i=0}^{\infty}\big{(}(1-\gamma^{*}/2)\kappa^{*1/\tau_{AD}}\big{)}^{i}\leq\frac{\bar{L}}{\chi}=:U_{\Omega} (86)

Then, the state is upper-bounded by:

x(t)\displaystyle\|x(t)\| (1χ)tx0+UΩeχtx0+UΩ\displaystyle\leq(1-\chi)^{t}\|x_{0}\|+U_{\Omega}\leq e^{-\chi t}\|x_{0}\|+U_{\Omega} (87)

The boundedness of VtiV^{i}_{t} at each epoch is very crucial for estimation purpose. To bound it, we need to ensure that the extended state ztiz^{i}_{t} stays bounded as the sequentially stabilizing policy is executed. The following lemma gives this upper-bound and defines the parameter Υ¯\bar{\Upsilon} which are used in regret bound analysis section.

Lemma 11

Applying the Algorithms 1 and 2 yields

s=1tz(s)22Υ¯TandVt(1+2Υ¯)T\displaystyle\sum_{s=1}^{t}||z(s)||^{2}\leq 2\bar{\Upsilon}T\;\;\;and\;\;||V_{t}||\leq(1+2\bar{\Upsilon})T
Υ¯=3200κ6σ2nχ2logTδ\displaystyle\bar{\Upsilon}=\frac{3200{\kappa^{*}}^{6}\sigma^{2}n}{\chi^{2}}\log\frac{T}{\delta}
Proof:

Proof can be found in Appendix B. ∎

A-E Regret Bound Analysis

This section presents regret bound analysis of the proposed ”learn and control while switching strategy.” Proceeding as in  [5], the regret can be decomposed into the form of (57-60). However, each term must be bounded differently than in  [5] due to the switching nature of the closed-loop system.

The following lemma gives an upper bound on the ratio of covariance matrix determinant of the central ellipsoid and that of its projection to an arbitrary space associated with mode ii. The result will be useful in upper-bounding R1R_{1} and R4R_{4} terms.

Lemma 12

Given V[0t]V_{[0\;t]} and V[0t]piV^{pi}_{[0\;t]} as the covariance matrices associated with central ellipsoid and its projection to the space of mode ii respectively, then for any tTt\leq T

logdet(V[0t])det(V[0t]pi)(m1)log((1+2Υ¯)t)\displaystyle\log\frac{\det(V_{[0\;t]})}{\det(V^{pi}_{[0\;t]})}\leq(m-1)\log((1+2\bar{\Upsilon})t) (88)

holds true almost surely.

Proof:

Proof has been provided in Appendix B. ∎

Following Lemma gives an upper bound for the term R1R_{1}.

Lemma 13

Let nsns denote number of epochs.The upper bound for |R1||R_{1}| is given as follows:

|R1|ν¯(1+Nts)σω2X[0T]\displaystyle|R_{1}|\leq\frac{\bar{\nu}(1+N_{ts})}{\sigma_{\omega}^{2}}X_{[0\;\;T]} (89)

where X[0T]=maxtTxtX_{[0\;T]}=\max_{t\leq T}\|x_{t}\| to be defined by

X[0T]2=2(1χ)2tx02+800κ2σω2nχ2γ2logTδ\displaystyle X_{[0\;\;T]}^{2}=2(1-\chi)^{2t}||x_{0}||^{2}+\frac{800{\kappa^{*}}^{2}\sigma_{\omega}^{2}n}{\chi^{2}{\gamma^{*}}^{2}}\log\frac{T}{\delta} (90)

and

Nts:=(1+2Υ¯)logT+(ns1)(m1)log(1+2Υ¯)T.\displaystyle N_{ts}:=(1+2\bar{\Upsilon})\log T+(ns-1)(m-1)\log(1+2\bar{\Upsilon})T.
Υ¯=3200κ6σω2nχ2logTδ\displaystyle\bar{\Upsilon}=\frac{3200{\kappa^{*}}^{6}\sigma_{\omega}^{2}n}{\chi^{2}}\log\frac{T}{\delta}
Proof:

Proof can be found in Appendix B. ∎

Lemma 14

The upper bound on |R2||R_{2}| is given as follows:

|R2|ν¯ϑσω3Υ¯Tlog4δ\displaystyle|R_{2}|\leq\frac{\bar{\nu}\vartheta}{\sigma_{\omega}}\sqrt{3\bar{\Upsilon}T\log\frac{4}{\delta}}

which holds true with the probability at least 1δ/41-\delta/4

Proof:

Proof directly follows [5] by taking into account P(Θ^ti,Q,Ri)ν¯/σ2i\|P(\hat{\Theta}^{i}_{t},Q,R^{i})\|\leq\bar{\nu}/\sigma^{2}\;\;\forall i\in\mathcal{I} where ν¯=maxiμi\bar{\nu}=\max_{i}\mu_{i}. ∎

To bound R3R_{3} we have the following lemma:

Lemma 15

The term R3R_{3} has the following upper bound when the sequence of switches is 1,,j1,...,j and number of epochs is dd

R3=t=τ1=0τ2\displaystyle R_{3}=\sum_{t=\tau_{1}=0}^{\tau_{2}} (wtTP(Θ^tσ(t),Q,Rσ(t))ωt\displaystyle\big{(}w^{T}_{t}P(\hat{\Theta}^{\sigma(t)}_{t},Q,R^{\sigma(t)})\omega_{t}
σω2P(Θ^tσ(t),Q,Rσ(t)))1t+\displaystyle-\sigma_{\omega}^{2}\|P(\hat{\Theta}^{\sigma(t)}_{t},Q,R^{\sigma(t)})\|_{*}\big{)}1_{\mathcal{E}_{t}}+...
+t=τnsT(wtTP(Θ^tσ(t),Q,Rσ(t))ωt\displaystyle+\sum_{t=\tau_{ns}}^{T}\big{(}w^{T}_{t}P(\hat{\Theta}^{\sigma(t)}_{t},Q,R^{\sigma(t)})\omega_{t}
σω2P(Θ^tσ(t),Q,Rσ(t)))1t\displaystyle-\sigma_{\omega}^{2}\|P(\hat{\Theta}^{\sigma(t)}_{t},Q,R^{\sigma(t)})\|_{*}\big{)}1_{\mathcal{E}_{t}}
8ν¯Tlog34Tδ\displaystyle\leq 8\bar{\nu}\sqrt{T\log^{3}\frac{4T}{\delta}} (91)
Proof:

Proof directly follows [5] by taking into account P(Θ^ti,Q,Ri)ν¯/σ2i\|P(\hat{\Theta}^{i}_{t},Q,R^{i})\|\leq\bar{\nu}/\sigma^{2}\;\;\forall i\in\mathcal{I} where ν¯=maxiμi\bar{\nu}=\max_{i}\mu_{i}. ∎

Lemma 16

The term R4R_{4} has the following upper bound.

R48ν¯σω2(\displaystyle R_{4}\leq\frac{8\bar{\nu}}{\sigma_{\omega}^{2}}\big{(} r¯+(1+r¯)ϑ(1+2Υ¯)T)((1+2Υ¯)logT\displaystyle\bar{r}+(1+\bar{r})\vartheta\sqrt{(1+2\bar{\Upsilon})T}\big{)}\big{(}(1+2\bar{\Upsilon})\log T
+(ns1)log(1+2Υ¯)T)\displaystyle+(ns-1)\log(1+2\bar{\Upsilon})T\big{)} (92)
r¯:=8σω2n(\displaystyle\bar{r}:=8\sigma_{\omega}^{2}n\bigg{(} 2lognδ+(1+2Υ¯)logT+\displaystyle 2\log\frac{n}{\delta}+(1+2\bar{\Upsilon})\log T+
(m1)log(1+2Υ¯)T)+2ϵ2λ\displaystyle(m-1)\log(1+2\bar{\Upsilon})T\bigg{)}+2\epsilon^{2}\lambda
Proof:

Proof can be found in Appendix B. ∎

Proof of Theorem 5

Proof:

Proof of first statement

Putting everything together gives the following upper bound for regret RTR_{T}

RT\displaystyle R_{T} ν¯(1+Nts)σω2X[0T]+\displaystyle\leq\frac{\bar{\nu}(1+N_{ts})}{\sigma_{\omega}^{2}}X_{[0\;\;T]}+
ν¯ϑσω3Υ¯Tlog4δ+8ν¯Tlog34Tδ+\displaystyle\frac{\bar{\nu}\vartheta}{\sigma_{\omega}}\sqrt{3\bar{\Upsilon}T\log\frac{4}{\delta}}+8\bar{\nu}\sqrt{T\log^{3}\frac{4T}{\delta}}+
8ν¯σω2(r¯+(1+r¯)ϑ(1+2Υ¯)T)((1+2Υ¯)logT+\displaystyle\frac{8\bar{\nu}}{\sigma_{\omega}^{2}}\big{(}\bar{r}+(1+\bar{r})\vartheta\sqrt{(1+2\bar{\Upsilon})T}\big{)}\big{(}(1+2\bar{\Upsilon})\log T+
(ns1)log(1+2Υ¯)T)\displaystyle(ns-1)\log(1+2\bar{\Upsilon})T\big{)}

where

X[0T]2=2(1χ)2tx02+800κ2σω2nχ2γ2logTδ\displaystyle X_{[0\;\;T]}^{2}=2(1-\chi)^{2t}||x_{0}||^{2}+\frac{800{\kappa^{*}}^{2}\sigma_{\omega}^{2}n}{\chi^{2}{\gamma^{*}}^{2}}\log\frac{T}{\delta} (93)
Nts:=(1+2Υ¯)logT+(ns1)(m1)log(1+2Υ¯)T.\displaystyle N_{ts}:=(1+2\bar{\Upsilon})\log T+(ns-1)(m-1)\log(1+2\bar{\Upsilon})T.

Proof of second statement

The regret bound of an OFU-based algorithm is directly connected to the extent of tightness of confidence ellipsoid. We can explicitly see this effect in the term R4R_{4}. This being said, we now compare the confidence ellipsoid built by the projection-based algorithm and the one obtained by naively and repeatedly applying the basic OFU-based algorithm.

Assume at an arbitrary time τ\tau system actuates in any arbitrary mode ii, for which the confidence ellipsoid is given by

𝒞ti(δ):={\displaystyle\mathcal{C}^{i}_{t}(\delta):=\big{\{} Θi(n+mi)×n|\displaystyle\Theta^{i}\in\mathbb{R}^{(n+m_{i})\times n}|
Tr((ΘiΘ^ti)V[0t]i(ΘiΘ^ti))ri}\displaystyle\operatorname{Tr}((\Theta^{i}-\hat{\Theta}^{i}_{t})^{\top}V^{i}_{[0\;\;t]}(\Theta^{i}-\hat{\Theta}^{i}_{t}))\leq r^{i}\} (94)

where

ri=(\displaystyle r^{i}=\bigg{(} σω2nlogndet(V[0t]i)δdet(V[0,τ]pi)+σω2nlogndet(V[0,τ])δdet(λI)+\displaystyle\sigma_{\omega}\sqrt{2n\log\frac{n\det(V^{i}_{[0\;\;t]})}{\delta\det(V_{[0,\;\tau]}^{pi})}}+\sigma_{\omega}\sqrt{2n\log\frac{n\det(V_{[0,\;\tau]})}{\delta\det(\lambda I)}}+
λϵ)2.\displaystyle\sqrt{\lambda}\epsilon\bigg{)}^{2}. (95)

and V[0t]iV^{i}_{[0\;\;t]} and Θ^ti\hat{\Theta}^{i}_{t} is given by (40,41).

On the other when the projection is not applied the confidence set is given as follows:

𝒞¯ti(δ):={\displaystyle\mathcal{\bar{C}}^{i}_{t}(\delta):=\big{\{} Θi(n+mi)×n|\displaystyle\Theta^{i}\in\mathbb{R}^{(n+m_{i})\times n}|
Tr((ΘiΘ¯^ti)V¯ti(ΘiΘ¯^ti))rni}\displaystyle\operatorname{Tr}((\Theta^{i}-\hat{\bar{\Theta}}^{i}_{t})^{\top}\bar{V}^{i}_{t}(\Theta^{i}-\hat{\bar{\Theta}}^{i}_{t}))\leq r_{n}^{i}\} (96)

where

Θ¯^t\displaystyle\hat{\bar{\Theta}}_{t} =Vi¯t1(ZtiXt+λΘ0i).\displaystyle={\bar{V^{i}}_{t}}^{-1}\big{(}{Z_{t}^{i}}^{\top}X_{t}+\lambda\Theta^{i}_{0}\big{)}. (97)

and

Vti=λI+s=1nst1zsizsiT,ZtiXt=s=1nstzsixsi\displaystyle V^{i}_{t}=\lambda I+\sum_{s=1}^{ns_{t-1}}z^{i}_{s}{z^{i}_{s}}^{T},\;\;\;{Z_{t}^{i}}^{\top}X_{t}=\sum_{s=1}^{ns_{t}}z^{i}_{s}{x_{s}^{i}}^{\top}

where nstins^{i}_{t} is number of time steps that system has been actuating in mode ii by time tt, and

rni\displaystyle r_{n}^{i} =(σω2nlogndet(V¯ti)δdet(λI(n+mi)×(n+mi))+λϵ)2.\displaystyle=\bigg{(}\sigma_{\omega}\sqrt{2n\log\frac{n\det(\bar{V}^{i}_{t})}{\delta\det(\lambda I_{(n+m_{i})\times(n+m_{i})})}}+\sqrt{\lambda}\epsilon\bigg{)}^{2}. (98)

Note that Θi\Theta_{*}^{i} belongs to the both confidence sets 𝒞(δ)\mathcal{C}(\delta) and 𝒞¯(δ)\mathcal{\bar{C}}(\delta) with same probability of 1δ1-\delta. Hence, the tightest confidence set gives lower regret. Considering the fact that V[0,t]iV¯tV^{i}_{[0,\;t]}\succeq\bar{V}_{t}, there exists a t0<Tt_{0}<T for big enough TT such that for t>t0t>t_{0}

det(V[0,t]iri)>det(V¯tirni).\displaystyle\det(\frac{V^{i}_{[0,\;t]}}{r^{i}})>\det(\frac{\bar{V}_{t}^{i}}{r_{n}^{i}}).

This guarantees 𝒞(δ)𝒞¯(δ)\mathcal{C}(\delta)\subset\mathcal{\bar{C}}(\delta) meaning that the confidence ellipsoid 𝒞(δ)\mathcal{C}(\delta) is tighter. ∎

Appendix B Supplementary Proofs

Proof of Lemma 2

Proof:

The proof follows from Theorem 3 and Corollary 1 in [30] which gives.

StiVt122σω2log(det(Vt)1/2det(λI)1/2δ)\displaystyle\left\lVert S^{i}_{t}\right\rVert^{2}_{V_{t}^{-1}}\leq 2\sigma_{\omega}^{2}\log\bigg{(}\frac{\det(V_{t})^{1/2}\det(\lambda I)^{-1/2}}{\delta}\bigg{)} (99)

we have

log(det(Vt)det(λI))log(det(Vt)1/2det(λI)1/2),\displaystyle\log\bigg{(}\frac{\det(V_{t})}{\det(\lambda I)}\bigg{)}\geq\log\bigg{(}\frac{\det(V_{t})^{1/2}}{\det(\lambda I)^{1/2}}\bigg{)},

true which completes proof. ∎

Proof of Corollary 1

Proof:

Applying Lemma 1 for i=1,,ni=1,...,n yields

StiVt122σω2log(ndet(Vt)δdet(λI))\displaystyle\left\lVert S^{i}_{t}\right\rVert^{2}_{V_{t}^{-1}}\leq 2\sigma_{\omega}^{2}\log\bigg{(}\frac{n\det(V_{t})}{\delta\det(\lambda I)}\bigg{)} (100)

with probability at least 1δ/n1-\delta/n. Furthermore, we have

Tr(StTVt1St)=i=1nStiTVt1Sti\displaystyle\operatorname{Tr}(S^{T}_{t}V^{-1}_{t}S_{t})=\sum_{i=1}^{n}S^{iT}_{t}V^{-1}_{t}S^{i}_{t}

using which and applying union bound (62) holds with probability at least 1δ1-\delta. ∎

Proof of Lemma 5

Proof:

Given definition of rir^{i} in (38), one can write:

ri=(σω[2nlogndet(V[0,t]i)δdet(V[0,τq]pi)+2nlogndet(V[0,τq])δdet(λI(n+m))]\displaystyle r^{i}=\bigg{(}\sigma_{\omega}\big{[}\sqrt{2n\log\frac{n\det({V^{i}_{[0,t]}})}{\delta\det(V^{pi}_{[0,\tau_{q}]})}}+\sqrt{2n\log\frac{n\det(V_{[0,\tau_{q}]})}{\delta\det(\lambda I_{(n+m)})}}\big{]}
+λϑ)2\displaystyle+\sqrt{\lambda}\vartheta\bigg{)}^{2}\leq
8σω2n(logndet(V[0t]i)δdet(V[0τq]pi)+logndet(V[0τq])δdet(λI(n+m)))+\displaystyle 8\sigma_{\omega}^{2}n\bigg{(}\log\frac{n\det({V^{i}_{[0\;\;t]}})}{\delta\det(V^{pi}_{[0\;\;\tau_{q}]})}+\log\frac{n\det(V_{[0\;\;\tau_{q}]})}{\delta\det(\lambda I_{(n+m)})}\bigg{)}+
2λϑ2\displaystyle 2\lambda\vartheta^{2}\leq
8σω2n(logndet(V[0t]i)δdet(λI(n+mi))+logndet(V[0τq])δdet(V[0τq]pi))+\displaystyle 8\sigma_{\omega}^{2}n\bigg{(}\log\frac{n\det({V^{i}_{[0\;\;t]}})}{\delta\det(\lambda I_{(n+m_{i})})}+\log\frac{n\det(V_{[0\;\;\tau_{q}]})}{\delta\det(V^{pi}_{[0\;\;\tau_{q}]})}\bigg{)}+
2λϑ2\displaystyle 2\lambda\vartheta^{2}\leq
8σω2n(2lognδ+logdet(V[0t]i)det(λI(n+mi))+logdet(V[0τq])det(V[0τq]pi))+\displaystyle 8\sigma_{\omega}^{2}n\bigg{(}2\log\frac{n}{\delta}+\log\frac{\det({V^{i}_{[0\;\;t]}})}{\det(\lambda I_{(n+m_{i})})}+\log\frac{\det(V_{[0\;\;\tau_{q}]})}{\det(V^{pi}_{[0\;\;\tau_{q}]})}\bigg{)}+
2λϑ2\displaystyle 2\lambda\vartheta^{2}

where in the second inequality we apply det(λI(n+mi))<det(λI(n+m))\det(\lambda I_{(n+m_{i})})<\det(\lambda I_{(n+m)}). The proof is complete by applying the following properties

logdet(V[0t]i)det(λI(n+mi))(1+2Υ¯)logt\displaystyle\log\frac{\det({V^{i}_{[0\;\;t]}})}{\det(\lambda I_{(n+m_{i})})}\leq(1+2\bar{\Upsilon})\log t (101)
logdet(V[0τq])det(V[0τq]pi)(m1)log((1+2Υ¯))τq\displaystyle\log\frac{\det(V_{[0\;\;\tau_{q}]})}{\det(V^{pi}_{[0\;\;\tau_{q}]})}\leq(m-1)\log((1+2\bar{\Upsilon}))\tau_{q} (102)

where (101) can be shown by following the same steps of Lemma 28 of [5] and applying s=1tz(s)22Υ¯T\sum_{s=1}^{t}||z(s)||^{2}\leq 2\bar{\Upsilon}T. And the inequality (102) is followed by (88). ∎

Proof of Lemma 10

Proof:

From Lemma 19 of [5] we have:

P(Θi,Q,Ri)P(Θ^ti,Q,Ri)2κi2μiγiP(Θti,Q,Ri)\displaystyle P_{*}(\Theta^{i}_{*},Q,R^{i})-P(\hat{\Theta}^{i}_{t},Q,R^{i})\preceq\frac{2\kappa_{i}^{2}\mu_{i}}{\gamma_{i}}\|P(\Theta^{i}_{t},Q,R^{i})\|_{*}
(IK(Θ^ti,Q,Ri))Vi[0,t]1(IK(Θ^ti,Q,Ri))I\displaystyle\lVert\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{t},Q,R^{i})\end{pmatrix}{V^{i}}^{-1}_{[0,\;t]}\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{t},Q,R^{i})\end{pmatrix}^{\top}\rVert I\preceq
2κi2μiγiP(Θti,Q,Ri)(2κi2Vi[0,t]1)I\displaystyle\frac{2\kappa_{i}^{2}\mu_{i}}{\gamma_{i}}\|P(\Theta^{i}_{t},Q,R^{i})\|_{*}\big{(}2\kappa_{i}^{2}\|{V^{i}}^{-1}_{[0,\;t]}\|\big{)}I\preceq
2κi2μiγiP(Θti,Q,Ri)(2κi2n+m)Vi[0,t]1\displaystyle\frac{2\kappa_{i}^{2}\mu_{i}}{\gamma_{i}}\|P(\Theta^{i}_{t},Q,R^{i})\|_{*}(2\kappa_{i}^{2}\sqrt{n+m}){V^{i}}^{-1}_{[0,\;t]}\preceq
2κi2μiγiP(Θti,Q,Ri)(2κi2n+m)1λiα0iγi2I\displaystyle\frac{2\kappa_{i}^{2}\mu_{i}}{\gamma_{i}}\|P(\Theta^{i}_{t},Q,R^{i})\|_{*}(2\kappa_{i}^{2}\sqrt{n+m})\frac{1}{\lambda_{i}}\preceq\frac{\alpha_{0}^{i}\gamma_{i}}{2}I

where the last inequality holds by λ8κ8n+m4ν¯μ¯α0iσω2\lambda\geq 8\kappa_{*}^{8}\sqrt{n+m}\frac{4\bar{\nu}\bar{\mu}}{\alpha_{0}^{i}\sigma_{\omega}^{2}}. ∎

Proof of Lemma 11

Proof:

Having (87) and given the fact that zt=(IK)xtz_{t}=\begin{pmatrix}I\\ K\end{pmatrix}x_{t} and (IK)22κ2||\begin{pmatrix}I\\ K\end{pmatrix}||^{2}\leq 2{\kappa^{*}}^{2}, we have

z(t)2\displaystyle||z(t)||^{2} 4κ2(1χ)2tx02+1600σ2κ4nχ2γ2logTδ\displaystyle\leq 4{\kappa^{*}}^{2}(1-\chi)^{2t}||x_{0}||^{2}+\frac{1600\sigma^{2}{\kappa^{*}}^{4}n}{\chi^{2}{\gamma^{*}}^{2}}\log\frac{T}{\delta} (103)

which yields

s=1tz(s)2\displaystyle\sum_{s=1}^{t}\|z(s)\|^{2} 4κ22χχ2x02+3200κ6σω2nχ2logTδt\displaystyle\leq\frac{4{\kappa^{*}}^{2}}{2\chi-\chi^{2}}\|x_{0}\|^{2}+\frac{3200{\kappa^{*}}^{6}\sigma_{\omega}^{2}n}{\chi^{2}}\log\frac{T}{\delta}t
4κ2χt+Υ¯t2Υ¯t\displaystyle\leq\frac{4{\kappa^{*}}^{2}}{\chi}t+\bar{\Upsilon}t\leq 2\bar{\Upsilon}t

where

Υ¯=3200κ6σω2nχ2logTδ\displaystyle\bar{\Upsilon}=\frac{3200{\kappa^{*}}^{6}\sigma_{\omega}^{2}n}{\chi^{2}}\log\frac{T}{\delta}

and in second inequality we applied tx0t\geq x_{0}.

Having s=1Tzs22Υ¯T\sum_{s=1}^{T}||z_{s}||^{2}\leq 2\bar{\Upsilon}T along with TλT\geq\lambda which is true always by recalling the definition of λ\lambda, regardless of any sequences of actuating modes, it returns out that

Vtλ+β1s=1tzs2(1+2Υ¯)T\displaystyle||V_{t}||\leq\lambda+\beta^{-1}\sum_{s=1}^{t}||z_{s}||^{2}\leq(1+2\bar{\Upsilon})T

by picking β=1\beta=1. ∎

Proof of Lemma 12

Proof:

To upper-bound, the ratio (88) where the nominator is the covariance of the central ellipsoid and the denominator is the covariance of the ellipsoid constructed by the projection of the central ellipsoid to the space of mode ii, one can write:

det(V[0,t])det(V[0,t]pi)=l=1mi1λlik=1m1λk\displaystyle\frac{\det(V_{[0,\;t]})}{\det(V^{pi}_{[0,\;t]})}=\frac{\prod_{l=1}^{m_{i}}\frac{1}{\lambda^{i}_{l}}}{\prod_{k=1}^{m}\frac{1}{\lambda_{k}}} (104)

where λk\lambda_{k} for k{1,,n+m}k\in\{1,...,n+m\} and λli\lambda^{i}_{l} for l{1,,n+mi}l\in\{1,...,n+m_{i}\} are eigenvalues of V[0,t]V_{[0,\;t]} and V[0,t]iV^{i}_{[0,\;t]} respectively. We know that for any ellipsoid constructed by the covariance matrix V[0,t]V_{[0,\;t]}, we have 1/λk1/\lambda_{k}’s as the diameters of ellipsoid where λk\lambda_{k}’s are corresponding eigenvalues of V[0,t]V_{[0,\;t]}. Since V[0,t]piV^{pi}_{[0,\;t]} is the covariance matrix of an ellipsoid which is projection of the central ellipsoid with covariance matrix V[0,t]V_{[0,\;t]}, there always exist a s{1,,n+m}s\in\{1,...,n+m\} such that 1/λs>1/λli1/\lambda_{s}>1/\lambda^{i}_{l} for all l{1,,n+mj}l\in\{1,...,n+m_{j}\}. In other words, there is always a diameter of the central ellipsoid associated with V[0,t]V_{[0,\;t]} which is greater than or equals to any diameter of its projection defined by V[0,t]piV^{pi}_{[0,\;t]}. Using this fact we can upper bound (104) as follows

det(V[0,t])det(V[0,t]pi)λsmmi\displaystyle\frac{\det(V_{[0,\;t]})}{\det(V^{pi}_{[0,\;t]})}\leq\lambda_{s}^{m-m_{i}} λmaxmmi\displaystyle\leq\lambda_{max}^{m-m_{i}}
λmaxm1(V)(1+2Υ¯)m1tm1\displaystyle\leq\lambda_{max}^{m-1}(V)\leq(1+2\bar{\Upsilon})^{m-1}t^{m-1}

The last inequality directly follows by the fact that (mmi)(m-m_{i}) takes its maximum value of m1m-1 when the projection is performed into space with dimension n+1n+1 (i.e., projection from central ellipsoid to the mode with only one actuator). Also, we have have λmax(V[0,t])V[0,t](1+2Υ¯)T\lambda_{max}(V_{[0,\;t]})\leq\|V_{[0,\;t]}\|\leq(1+2\bar{\Upsilon})T which completes proof. ∎

Proof of Lemma 13

Proof:

Recalling the switch time sequence {τ1,,τns}\{\tau_{1},...,\tau_{ns}\} with τ1=0\tau_{1}=0 and T>τnsT>\tau_{ns}, we have nsns number of epochs.

Considering (57), one can write

R1\displaystyle R_{1} =t=0T(xtTP(Θ^σ(t),Q,Rσ(t))xt\displaystyle=\sum_{t=0}^{T}(x^{T}_{t}P(\hat{\Theta}^{\sigma(t)},Q,R^{\sigma(t)})x_{t}-
xt+1TP(Θ^σ(t),Q,Rσ(t))xt+1)1t=\displaystyle x^{T}_{t+1}P(\hat{\Theta}^{\sigma(t)},Q,R^{\sigma(t)})x_{t+1})1_{\mathcal{E}_{t}}=
t=1T(xtT(P(Θ^σ(t),Q,Rσ(t))\displaystyle\sum_{t=1}^{T}(x^{T}_{t}(P(\hat{\Theta}^{\sigma(t)},Q,R^{\sigma(t)})-
P(Θ^σ(t1),Q,Rσ(t1)))xt)1t+\displaystyle P(\hat{\Theta}^{\sigma(t-1)},Q,R^{\sigma(t-1)}))x_{t})1_{\mathcal{E}_{t}}+
x0TP(Θ^σ(0),Q,Rσ(0))x0xT+1TP(Θ^σ(T),Q,Ri(T))xT+1.\displaystyle x_{0}^{T}P(\hat{\Theta}^{\sigma(0)},Q,R^{\sigma(0)})x_{0}-x_{T+1}^{T}P(\hat{\Theta}^{\sigma(T)},Q,R^{i(T)})x_{T+1}. (105)

where i(t)i(t)\in\mathcal{I}. Note that the first term of the right-hand side takes zero value except when there is a switch in either the policy or actuation mode. Therefore, We first need to find an upper bound on the total number of switches.

We know for the system actuating in any arbitrary mode ii and arbitrary epoch kk happening in time interval [τkτk+1][\tau_{k}\;\tau^{k+1}] we have

det(V[0τkf]i)2dkdet(V[0τk0]pi)\displaystyle\det(V_{[0\;\tau_{k}^{f}]}^{i})\geq 2^{d_{k}}\det(V_{[0\;\tau_{k}^{0}]}^{pi}) (106)

where dkd_{k} is the total number of switches in the epoch. Suppose in the last epoch the system actuates in the mode ii. Furthermore, without loss of generality let system actuates in the mode jj within the epoch ns1ns-1. Then by applying (106) one can write

det(V[0,T]i)det(V[0τns])det(V[0,τns]pi)\displaystyle\det(V_{[0,\;T]}^{i})\frac{\det(V_{[0\;\tau_{ns}]})}{\det(V_{[0,\;\tau_{ns}]}^{pi})} 2dnsdet(V[0τns])\displaystyle\geq 2^{d_{ns}}\det(V_{[0\;\tau_{ns}]})
2dnsdet(V[0,τns]j)\displaystyle\geq 2^{d_{ns}}\det(V^{j}_{[0,\;\tau_{ns}]}) (107)

where the last inequality is due to the fact that λ>1\lambda>1 and ViiV^{i}\;\;\forall i is strictly positive definite that guarantees det(V[0,τns])det(V[0,τns]j)\det(V_{[0,\;\tau_{ns}]})\geq\det(V^{j}_{[0,\;\tau_{ns}]}). By following the same argument backward in switch times,it yields

det(V[0,T]i)det(V[0,τns])det(V[0,τns]pi)det(V[0,τ2])det(V[0,τ2]pl)\displaystyle\det(V^{i}_{[0,\;T]})\frac{\det(V_{[0,\;\tau_{ns}]})}{\det(V_{[0,\;\tau_{ns}]}^{pi})}...\frac{\det(V_{[0,\;\tau_{2}]})}{\det(V_{[0,\;\tau_{2}]}^{pl})}\geq
2dns+dns1++d2det(V[0,τ2])\displaystyle 2^{d_{ns}+d_{ns-1}+...+d_{2}}\det(V_{[0,\;\tau_{2}]}) (108)

Applying det(V[0,τ2])2d1det(λI)\det(V_{[0,\;\tau_{2}]})\geq 2^{d_{1}}\det(\lambda I) results in

det(V[0,T]i)det(V[0,τns])det(V[0,τns]pi)det(V[0,τ2])det(V[0,τ2]pl)\displaystyle\det(V^{i}_{[0,\;T]})\frac{\det(V_{[0,\;\tau_{ns}]})}{\det(V_{[0,\;\tau_{ns}]}^{pi})}...\frac{\det(V_{[0,\;\tau_{2}]})}{\det(V_{[0,\;\tau_{2}]}^{pl})}\geq
2dns+dns1++d1det(λI(n+m))\displaystyle 2^{d_{ns}+d_{ns-1}+...+d_{1}}\det(\lambda I_{(n+m)})\geq
2dns+dns1++d1det(λI(n+mi))\displaystyle 2^{d_{ns}+d_{ns-1}+...+d_{1}}\det(\lambda I_{(n+m_{i})}) (109)

By taking log2\log_{2} on both sides one can write

dns+dns1++d1\displaystyle d_{ns}+d_{ns-1}+...+d_{1}\leq logdet(V[0T]i)det(λI(n+mi))\displaystyle\log\frac{\det(V^{i}_{[0\;T]})}{\det(\lambda I_{(n+m_{i})})}
+(ns1)\displaystyle+(ns-1)\mathcal{F} (110)

where

=maxi,ji\displaystyle\mathcal{F}=\max_{\forall i,j}\mathcal{F}_{i} (111)

where ii denote the actuating mode and jj denotes the epoch and i,j\mathcal{F}_{i,j} is defined as follows

i,j=det(V[0τj])det(V[0τj]pi):=(m1)log(1+2Υi)τj\displaystyle\mathcal{F}_{i,j}=\frac{\det(V_{[0\;\tau_{j}]})}{\det(V_{[0\;\tau_{j}]}^{pi})}:=(m-1)\log(1+2\Upsilon_{i})\tau_{j} (112)

which is result of the Lemma 12. Then it yields,

=(m1)log(1+2Υ¯)T\displaystyle\mathcal{F}=(m-1)\log(1+2\bar{\Upsilon})T (113)

Furthermore, we have

log(det(V[0T]i)det(λI(n+mi)))(1+2Υ¯)logT\displaystyle\log(\frac{\det(V^{i}_{[0\;T]})}{\det(\lambda I_{(n+m_{i})})})\leq(1+2\bar{\Upsilon})\log T (114)

in which Υ¯=maxiΥi\bar{\Upsilon}=\max_{i}\Upsilon_{i}. As a result, the maximum number of switch is upper-bounded by

Nts:=(1+2Υ¯)logT+(ns1)(m1)log(1+2Υ¯)T.\displaystyle N_{ts}:=(1+2\bar{\Upsilon})\log T+(ns-1)(m-1)\log(1+2\bar{\Upsilon})T.

where NmtsN_{mts} is the maximum number of total switches including the switches between actuating modes. The upper bound for R1R_{1} can be written as follows:

|R1|ν(1+Nts)σω2X[0T]\displaystyle|R_{1}|\leq\frac{\nu(1+N_{ts})}{\sigma_{\omega}^{2}}X_{[0\;\;T]} (115)

which is obtained noting Ptiν¯/σ2||P^{i}_{t}||\leq\bar{\nu}/\sigma^{2} where ν¯=maxiνi\bar{\nu}=\max_{\forall i}{\nu_{i}} and the upper-bound of state (90). ∎

Proof of Lemma 16

Proof:

Suppose the system evolves in a sequence of dd epochs starting with actuating mode 1, e.g 1,j,,i1,j,...,i. Then, one can write

t=0T(ztσ(t)Vtσ(t)1ztσ(t))1t=t=0τ2(ztTV[0t]1zt)1t+\displaystyle\sum_{t=0}^{T}\big{(}{z^{\sigma(t)}_{t}}^{\top}{V^{\sigma(t)}_{t}}^{-1}{z^{\sigma(t)}_{t}}\big{)}1_{\mathcal{E}_{t}}=\sum_{t=0}^{\tau_{2}}\big{(}z^{T}_{t}{V}^{-1}_{[0\;t]}z_{t}\big{)}1_{\mathcal{E}_{t}}+
t=τ2τ3(zjtTVj[0t]1ztj)1t++t=τns1τns(zltTVl[0t]1ztl)1t+\displaystyle\sum_{t=\tau_{2}}^{\tau_{3}}\big{(}{z^{j}}^{T}_{t}{V^{j}}^{-1}_{[0\;t]}z^{j}_{t}\big{)}1_{\mathcal{E}_{t}}+...+\sum_{t=\tau_{ns-1}}^{\tau_{ns}}\big{(}{z^{l}}^{T}_{t}{V^{l}}^{-1}_{[0\;t]}z^{l}_{t}\big{)}1_{\mathcal{E}_{t}}+
t=τnsT(zitTVi[0t]1zti)1t2logdet(V[0,τ2])det(λI(n+m))+\displaystyle\sum_{t=\tau_{ns}}^{T}\big{(}{z^{i}}^{T}_{t}{V^{i}}^{-1}_{[0\;t]}z^{i}_{t}\big{)}1_{\mathcal{E}_{t}}\leq 2\log\frac{\det(V_{[0,\;\tau_{2}]})}{\det(\lambda I_{(n+m)})}+ (116)
2logdet(V[0,τ3]j)det(V[0,τ2]pj)++2logdet(V[0τns]l)det(V[0τns1]pl)+\displaystyle 2\log\frac{\det(V^{j}_{[0,\;\tau 3]})}{\det(V^{pj}_{[0,\;\tau_{2}]})}+...+2\log\frac{\det(V^{l}_{[0\;\tau_{ns}]})}{\det(V^{pl}_{[0\;\tau_{ns-1}]})}+
2logdet(Vi[0,T])det(V[0,τns]pi)=2log1det(λI(n+m))+\displaystyle 2\log\frac{\det({V^{i}}_{[0,\;T]})}{\det(V^{pi}_{[0,\;\;\tau_{ns}]})}=2\log\frac{1}{\det(\lambda I_{(n+m)})}+ (117)
2logdet(V[0,τ2])det(V[0,τ2]pj)++2logdet(V[0τns]l)det(V[0,τns]pi)+\displaystyle 2\log\frac{\det(V_{[0,\;\tau_{2}]})}{\det(V^{pj}_{[0,\;\tau_{2}]})}+...+2\log\frac{\det(V^{l}_{[0\;\tau_{ns}]})}{\det(V^{pi}_{[0,\;\;\tau_{ns}]})}+
2logdet(Vi[0,T])\displaystyle 2\log\det({V^{i}}_{[0,\;T]})\leq (118)
2logdet(Vi[0,T])det(λI(n+mi))+2logdet(V[0,τ2])det(V[0,τ2]pj)+\displaystyle 2\log\frac{\det({V^{i}}_{[0,\;T]})}{\det(\lambda I_{(n+m_{i})})}+2\log\frac{\det(V_{[0,\;\tau_{2}]})}{\det(V^{pj}_{[0,\;\tau_{2}]})}+...
+2logdet(V[0,τns])det(V[0,τns]pi)\displaystyle+2\log\frac{\det(V_{[0,\;\tau_{ns}]})}{\det(V^{pi}_{[0,\;\;\tau_{ns}]})}\leq
2(1+2Υ¯)logT+2(ns1)(m1)log(1+2Υ¯)T\displaystyle 2(1+2\bar{\Upsilon})\log T+2(ns-1)(m-1)\log(1+2\bar{\Upsilon})T (119)

where in the inequality (116) we applied the result of Lemma 26 in [5] and then in (117) we telescoped the summation of logarithmic terms. The inequality (118) is due to the facts that det(λI(n+m))det(λI(n+mi))\det(\lambda I_{(n+m)})\geq\det(\lambda I_{(n+m_{i})}) for λ1\lambda\geq 1 and ViV^{i} is strictly positive definite for all ii\in\mathcal{I} that guarantees det(V[0,τk])det(V[0,τk]pj)\det(V_{[0,\;\tau_{k}]})\geq\det(V^{pj}_{[0,\;\tau_{k}]}). Finally the last inequality is direct result of applying (111-114) and (101).

Recalling definition ri+(1+ri)ϑ(1+2Υi)T=:μir^{i}+(1+r^{i})\vartheta\sqrt{(1+2\Upsilon^{i})T}=:\mu_{i} and μ¯=maxiμi\bar{\mu}=\max_{i\in\mathcal{B}^{*}}\mu_{i} we have the following upper bound for the terms 4νiμiσω2\frac{4\nu_{i}\mu_{i}}{\sigma_{\omega}^{2}}:

4νiμiσω24ν¯μ¯σω24ν¯σω2(ri+(1+ri)ϑ(1+2Υ¯)T)\displaystyle\frac{4\nu_{i}\mu_{i}}{\sigma_{\omega}^{2}}\leq\frac{4\bar{\nu}\bar{\mu}}{\sigma_{\omega}^{2}}\leq\frac{4\bar{\nu}}{\sigma_{\omega}^{2}}\big{(}r^{i}+(1+r^{i})\vartheta\sqrt{(1+2\bar{\Upsilon})T}\big{)}

where rir^{i}’s i\forall i are upper-bounded by (74)

rir¯:=8σω2n(\displaystyle r^{i}\leq\bar{r}:=8\sigma_{\omega}^{2}n\bigg{(} 2lognδ+(1+2Υ¯)logT+\displaystyle 2\log\frac{n}{\delta}+(1+2\bar{\Upsilon})\log T+
(m1)log(1+2Υ¯)T)+2ϵ2λ\displaystyle(m-1)\log(1+2\bar{\Upsilon})T\bigg{)}+2\epsilon^{2}\lambda

which completes the proof. ∎