This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning in Networked Control Systems

Rahul Singh and P. R. Kumar Rahul Singh is with the Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH, 43210, USA [email protected]P. R. Kumar is with the Department of Electrical & Computer Engineering, Texas A&M University, College Station, TX 77843, USA [email protected]
Abstract

We design adaptive controller (learning rule) for a networked control system (NCS) in which data packets containing control information are transmitted across a lossy wireless channel. We propose Upper Confidence Bounds for Networked Control Systems (UCB-NCS), a learning rule that maintains confidence intervals for the estimates of plant parameters (A(),B())(A_{(\star)},B_{(\star)}), and channel reliability p()p_{(\star)}, and utilizes the principle of optimism in the face of uncertainty while making control decisions.

We provide non-asymptotic performance guarantees for UCB-NCS by analyzing its “regret”, i.e., performance gap from the scenario when (A(),B(),p())(A_{(\star)},B_{(\star)},p_{(\star)}) are known to the controller. We show that with a high probability the regret can be upper-bounded as O~(CT)\tilde{O}\left(C\sqrt{T}\right)111Here O~\tilde{O} hides logarithmic factors., where TT is the operating time horizon of the system, and CC is a problem dependent constant.

I Introduction

Though adaptive control [1] of unknown Linear Quadratic Gaussian (LQG) systems [2] is a well-studied topic by now [3, 4, 5, 6], existing algorithms cannot be utilized for controlling an unknown NCS in which plant and network parameters are unknown. In departure from the traditional adaptive controllers for LQG systems, an algorithm now also needs to continually estimate the unknown network behaviour besides simultaneously learning and controlling the plant in an online manner. An important concern is that in general it is not optimal to design and operate network estimator independently of the process controller. Thus, the optimal controls u(t)u(t) should utilize the information gained about network quality in addition to using the information gained about plant parameters. Similarly, decisions made by the network scheduler should also “aid” the controller in “learning” the unknown plant parameters.

This work addresses the problem of adaptive control of a simple NCS in which data packets from the controller to the plant, are communicated over an unreliable channel. We model the plant as a LQG system. We propose a learning rule that maintains estimates and confidence sets for both a) (unknown) plant parameters (A(),B())(A_{(\star)},B_{(\star)}), and also b) (unknown) channel reliability p()p_{(\star)}. Controls are then generated using the principle of optimism in face of uncertainty [7], and depend upon both a) and b). We denote our algorithm as Upper Confidence Bounds for Networked Control Systems (UCB-NCS).

We show that UCB-NCS yields the same asymptotic performance as the optimal controller that has knowledge of the system and network parameters. We also quantify its finite-time performance by providing upper-bounds on its “regret” [8]. Regret scales as O~(CT)\tilde{O}\left(C\sqrt{T}\right), where TT is the operating time horizon and CC is a problem dependent constant. It also depends on the channel reliability through a certain quantity which we call the “margin of stability” η\eta (14). A larger value of η\eta means that the learning algorithm has a lower regret.

UCB-NCS has many appealing properties. For instance, network estimator needs to communicate only occasionally the value of its optimistic estimate of network reliability to the controller which then uses it to generate controls.

II System Model

We assume that the system of interest is linear, and evolves as follows

x(t+1)={A()x(t)+B()u(t)+w(t) if (t)=1A()x(t)+w(t) if (t)=0,\displaystyle x(t+1)=\begin{cases}A_{(\star)}x(t)+B_{(\star)}u(t)+w(t)\mbox{ if }\ell(t)=1\\ A_{(\star)}x(t)+w(t)\mbox{ if }\ell(t)=0,\end{cases} (1)

where A()n×n,B()n×mA_{(\star)}\in\mathbb{R}^{n\times n},B_{(\star)}\in\mathbb{R}^{n\times m} are the system matrices, (t){0,1}\ell(t)\in\left\{0,1\right\} is the instantaneous state of the wireless channel, and x(t)n,u(t)mx(t)\in\mathbb{R}^{n},u(t)\in\mathbb{R}^{m} are the system state and control input at time tt respectively. {(t)}t=1T\{\ell(t)\}_{t=1}^{T} are Bernoulli i.i.d. with mean value p()p_{(\star)}. {w(t)}t=1T\{w(t)\}_{t=1}^{T} is the process noise, and is assumed to be i.i.d. with

𝔼(w(t)wT(t))=σw2,t[1,T].\displaystyle\mathbb{E}\left(w(t)w^{T}(t)\right)=\sigma^{2}_{w},~{}\forall t\in[1,T].

The objective is to minimize the operating cost

𝔼t=1T1xT(t)Qx(t)+uT(t)Ru(t)+xT(T)Qx(T).\displaystyle\mathbb{E}\sum_{t=1}^{T-1}x^{T}(t)Qx(t)+u^{T}(t)Ru(t)+x^{T}(T)Qx(T). (2)

We let θ():=(A(),B(),p())\theta_{(\star)}:=\left(A_{(\star)},B_{(\star)},p_{(\star)}\right) denote the system parameters. θ()\theta_{(\star)} is not known to controller. We assume that the system is scalar, i.e., m=n=1m=n=1.

III Preliminaries on Jump Markov Linear Systems

Note that (1) is a Jump Markov Linear System (JMLS), and if the system parameter θ()\theta_{(\star)} is known, the optimal controls can be obtained by using Dynamic Programming [9].

There are matrices {Kθ()()}{0,1}\left\{K_{\theta_{(\star)}}(\ell)\right\}_{\ell\in\{0,1\}} such that the optimal control at tt is given by Kθ()((t))x(t)K_{\theta_{(\star)}}(\ell(t))x(t). We let {Kθ()}{0,1}\left\{K_{\theta}(\ell)\right\}_{\ell\in\{0,1\}} denote the optimal matrices when system parameter is equal to θ\theta.

We let Vθ(x,)V_{\theta}(x,\ell) denote the “cost-to-go” when system state is equal to xx, channel state is \ell and system dynamics are described by θ\theta. In fact value function is piecewise linear, and we let {Pθ()}{0,1}\{P_{\theta}(\ell)\}_{\ell\in\{0,1\}} denote the corresponding matrices. We also let JθJ_{\theta} be the optimal operating cost.

Notation: For a random variable (r.v.) XX, let XX_{\mathcal{F}} denote its projection onto the space of \mathcal{F} measurable funcions, i.e., its conditional expectation w.r.t. sigma-algebra \mathcal{F}. For x,yx,y\in\mathbb{Z}222\mathbb{Z} denotes the set of integers., we let [x,y]:={x,x+1,,y}[x,y]:=\left\{x,x+1,\ldots,y\right\}. For a set of r.v. s 𝒳\mathcal{X}, we let σ(𝒳)\sigma(\mathcal{X}) denote the smallest sigma-algebra with respect to which each r.v. in 𝒳\mathcal{X} is measurable. For functions f(x),g(x)f(x),g(x), we say f(x)=O(g(x))f(x)=O(g(x)) if limxf(x)/g(x)=1\lim_{x\to\infty}f(x)/g(x)=1. For a set 𝒳\mathcal{X}, we let 𝒳c\mathcal{X}^{c} denote its complement.

IV Upper Confidence Bounds for NCS (UCB-NCS)

Let t:=σ({(x(s),u(s))}s=1t1{x(t)})\mathcal{F}_{t}:=\sigma\left(\left\{(x(s),u(s))\right\}_{s=1}^{t-1}\cup\{x(t)\}\right). A learning policy, or an adaptive controller is a collection of maps {tu(t)}t=1T\left\{\mathcal{F}_{t}\mapsto u(t)\right\}_{t=1}^{T}. Let θ^(t):=(A^(t),B^(t),p^(t),)\hat{\theta}(t):=\left(\hat{A}(t),\hat{B}(t),\hat{p}(t),\right) denote the estimates of θ()=(A(),B(),p())\theta_{(\star)}=(A_{(\star)},B_{(\star)},p_{(\star)}) at time tt defined as follows. Let z(s):=x(s+1)z(s):=x(s+1), and λ>0\lambda>0.

p^(t)=s=1t(s)/t,\displaystyle\hat{p}(t)=\sum_{s=1}^{t}\ell(s)/t,
A^(t)argmin1/2[λA2+s=1t1(z(s)Ax(s))2(1(s))],\displaystyle\hat{A}(t)\in\arg\min 1/2\left[\lambda A^{2}+\sum_{s=1}^{t-1}\left(z(s)-Ax(s)\right)^{2}(1-\ell(s))\right],
B^(t)\displaystyle\hat{B}(t)\in
argmin[λB22+s=1t1(z(s)A^(t)x(s)Bu(s))2(s)2],\displaystyle\arg\min\left[\frac{\lambda B^{2}}{2}+\frac{\sum_{s=1}^{t-1}\left(z(s)-\hat{A}(t)x(s)-Bu(s)\right)^{2}\ell(s)}{2}\right], (3)

Define

V1(t):\displaystyle V_{1}(t): =λ+s=1t1x2(s)(1(s)),V2(t):=λ+s=1t1u2(s)(s),\displaystyle=\lambda+\sum_{s=1}^{t-1}x^{2}(s)(1-\ell(s)),V_{2}(t):=\lambda+\sum\limits_{s=1}^{t-1}u^{2}(s)\ell(s),
γi(δ,t)\displaystyle\gamma_{i}(\delta,t) :=log(λVi(t)/δ),i=1,2.\displaystyle:=\sqrt{\log\left(\lambda V_{i}(t)/\delta\right)},~{}i=1,2. (4)

Let 𝒞(t)=(𝒞1(t),𝒞2(t),𝒞3(t))\mathcal{C}(t)=\left(\mathcal{C}_{1}(t),\mathcal{C}_{2}(t),\mathcal{C}_{3}(t)\right) be the confidence intervals associated with the estimates (A^(t),B^(t),p^(t))\left(\hat{A}(t),\hat{B}(t),\hat{p}(t)\right) at time tt defined as follows,

𝒞1(t):\displaystyle\mathcal{C}_{1}(t): ={A:|AA^(t)|β1(t)},\displaystyle=\left\{A:|A-\hat{A}(t)|\leq\beta_{1}(t)\right\}, (5)
𝒞2(t):\displaystyle\mathcal{C}_{2}(t): ={B:|BB^(t)|β2(t)},\displaystyle=\left\{B:|B-\hat{B}(t)|\leq\beta_{2}(t)\right\},
𝒞3(t):\displaystyle\mathcal{C}_{3}(t): ={p:|pp^(t)|β3(t)},\displaystyle=\left\{p:|p-\hat{p}(t)|\leq\beta_{3}(t)\right\}, (6)

where

β1(t):\displaystyle\beta_{1}(t): =(γ1(δ,t)+λ1/2)/V1(δ,t),β3(t):=log(1/δ)/t\displaystyle=(\gamma_{1}(\delta,t)+\lambda^{1/2})/\sqrt{V_{1}(\delta,t)},~{}\beta_{3}(t):=\sqrt{\log\left(1/\delta\right)/t}
β2(t):\displaystyle\beta_{2}(t): =(γ2(δ,t)+λ1/2)V2(t)+Kmax(γ1(δ,t)+λ1/2)V1(δ,t).\displaystyle=\frac{(\gamma_{2}(\delta,t)+\lambda^{1/2})}{\sqrt{V_{2}(t)}}+K_{\max}\frac{(\gamma_{1}(\delta,t)+\lambda^{1/2})}{\sqrt{V_{1}(\delta,t)}}.

The learning rule decomposes the cumulative time into episodes, and implements a single stationary controller within each single episode that chooses u(t)u(t) as a function of x(t)x(t). Let τk\tau_{k} denote the starting time of kk-th episode. The controller implemented within episode kk is obtained at time τk\tau_{k} by solving the following optimization problem.

minθ𝒞(τk)ΘJθ,\displaystyle\min_{\theta\in\mathcal{C}(\tau_{k})\cap\Theta}J_{\theta}, (7)

where Θ\Theta is the set of “allowable” parameters. Let θ(τk)\theta(\tau_{k}) denote a solution to above problem. It implements the optimal controller corresponding to the case when true system parameters are equal to θ(τk)\theta(\tau_{k}). u(t)=Kθ(τk)((t))x(t)u(t)=K_{\theta(\tau_{k})}(\ell(t))x(t). Thus, u(t)=Kθ(τk)((t))x(t)u(t)=K_{\theta(\tau_{k})}(\ell(t))x(t) for t[τk,τk+11]t\in\left[\tau_{k},\tau_{k+1}-1\right].

A new episode begins when either V1(t)V_{1}(t) or V2(t)V_{2}(t) doubles or the operating time spent in current episode becomes equal to length of previous episode. The learning rule also ensures that the durations of episodes are at least LL time-slots, i.e., τk+1τkL\tau_{k+1}-\tau_{k}\geq L. We set

θ(t):=θ(τk),t[τk,τk+11],\displaystyle\theta(t):=\theta(\tau_{k}),\forall t\in\left[\tau_{k},\tau_{k+1}-1\right],

i.e., it is the current value of the UCB estimate of θ()\theta_{(\star)}. UCB-NCS is summarized in Algorithm 1.

Algorithm 1 UCB-NCS
0:  T,λ>0,δ>0,L,α>2T,\lambda>0,\delta>0,L\in\mathbb{N},\alpha>2Set V1,,V2,=λ,A^(1)=.5,B^(1)=.5,p^(1)=.5,τ=1,V1(1)=λ,V2(1)=λV^{1,\star},V^{2,\star}=\lambda,\hat{A}(1)=.5,\hat{B}(1)=.5,\hat{p}(1)=.5,\tau=1,V_{1}(1)=\lambda,V_{2}(1)=\lambda.
1:  for t=1,2,t=1,2,\ldots do
2:     if (V1(t)2V1,V_{1}(t)\geq 2V^{1,\star} or V2(t)2V2,V_{2}(t)\geq 2V^{2,\star} or t2τt\geq 2\tau) and tτLt-\tau\geq L then
3:        Calculate θ^(t)\hat{\theta}(t) as in (IV) and θ(t)\theta(t) by solving (7). Update V1,=V1(t),V2,=V2(t),τ=tV^{1,\star}=V_{1}(t),V^{2,\star}=V_{2}(t),\tau=t
4:     else
5:        θ^(t)=θ^(t1)\hat{\theta}(t)=\hat{\theta}(t-1)
6:     end ifCalculate u(t)u(t) based on current UCB estimate θ(t)\theta(t), system state x(t)x(t), and channel state (t)\ell(t). Use control u(t)=Kθ(t)((t))x(t)u(t)=K_{\theta(t)}(\ell(t))x(t).Update V1(t+1)=V1(t)+x2(t)(1(t)),V2(t+1)=V2(t)+u2(t)(t)V_{1}(t+1)=V_{1}(t)+x^{2}(t)(1-\ell(t)),V_{2}(t+1)=V_{2}(t)+u^{2}(t)\ell(t)
7:  end for

V Large Deviation Bounds on Estimation Errors

We now analyze the estimation errors e1(t):=A^(t)A,e2(t):=B^(t)Be_{1}(t):=\hat{A}(t)-A,e_{2}(t):=\hat{B}(t)-B.

Lemma 1

Define

:={ω:θ()=(A(),B(),p())𝒞(t),t[1,T]}.\displaystyle\mathcal{E}:=\left\{\omega:\theta_{(\star)}=\left(A_{(\star)},B_{(\star)},p_{(\star)}\right)\in\mathcal{C}(t),~{}\forall t\in[1,T]\right\}.

We then have that

(c)3δ.\displaystyle\mathbb{P}\left(\mathcal{E}^{c}\right)\leq 3\delta.
Proof:

It can be shown that

e1(t)=λA/V1(t)+s=1t1w(s)x(s)(1(s))/V1(t).\displaystyle e_{1}(t)=-\lambda A/V_{1}(t)+\sum\limits_{s=1}^{t-1}w(s)x(s)(1-\ell(s))/V_{1}(t). (8)

Note that {w(s)}s=1T1\{w(s)\}_{s=1}^{T-1} is a martingale difference sequence w.r.t. t\mathcal{F}_{t}, while x(t)x(t) is adapted to t\mathcal{F}_{t}. Thus, bound on e1(t)e_{1}(t) follows by using self-normalized bounds on martingales from Corollary 1 of [10].

To analyze e2(t)e_{2}(t), we observe,

e2(t)=(s=1t1w(s)u(s)(s)/V2(t)λB/V2(t))\displaystyle e_{2}(t)=\left(\sum\limits_{s=1}^{t-1}w(s)u(s)\ell(s)/V_{2}(t)-\lambda B/V_{2}(t)\right)
+[AA^(t)]s=1t1x(s)u(s)(s)/V2(t).\displaystyle+[A-\hat{A}(t)]\sum\limits_{s=1}^{t-1}x(s)u(s)\ell(s)/V_{2}(t). (9)

The first term within braces is bounded using Corollary 2 of [10]. To bound the second term, we observe that it is upper-bounded by Kmax|e1(t)|K_{\max}|e_{1}(t)|. We then use bounds on e1(t)e_{1}(t) to bound it. Bound on estimation error of p()p_{(\star)} is obtained using Azuma-Hoeffding inequality. ∎

VI Large Deviation Bounds on the System State |x(t)||x(t)|

We now bound |x(t)||x(t)| under UCB-NCS. System evolution under UCB-NCS is given by

x(t+1)=Asw(t)x(t)+w(t),t[1,T1],\displaystyle x(t+1)=A_{sw}(t)x(t)+w(t),t\in[1,T-1],

where

Asw(t):=[(A()+B()Kθ(t)((t)))(t)+A()(1(t))].\displaystyle A_{sw}(t):=\left[\left(A_{(\star)}+B_{(\star)}K_{{}_{\theta(t)}}(\ell(t))\right)\ell(t)+A_{(\star)}(1-\ell(t))\right].

Thus,

x(t)=x(0)G(0,t)+s=1t1w(s)G(s,t1),\displaystyle x(t)=x(0)G(0,t)+\sum_{s=1}^{t-1}w(s)G(s,t-1), (10)

where

G(s1,s2):={=s1s2Asw() if s2>s1,1 if s1=s2.\displaystyle G(s_{1},s_{2}):=\begin{cases}\prod\limits_{\ell=s_{1}}^{s_{2}}A_{sw}(\ell)\mbox{ if }s_{2}>s_{1},\\ 1\mbox{ if }s_{1}=s_{2}.\end{cases}

Consider the deviations

Δ(t1,t2):=s=t1t2(s)p()(t2t1),\displaystyle\Delta(t_{1},t_{2}):=\sum\limits_{s=t_{1}}^{t_{2}}\ell(s)-p_{(\star)}(t_{2}-t_{1}),

and the events,

𝒥t1,t2:={ω:|Δ(t1,t2)|2ασp()2(t2t1)log(t2t1)},\displaystyle\mathcal{J}_{t_{1},t_{2}}:=\left\{\omega:|\Delta(t_{1},t_{2})|\leq\sqrt{2\alpha\sigma^{2}_{p_{(\star)}}(t_{2}-t_{1})\log(t_{2}-t_{1})}\right\}, (11)

where σp()2:=p()(1p())\sigma^{2}_{p_{(\star)}}:=p_{(\star)}(1-p_{(\star)}), and α>2\alpha>2. It follows from Azuma-Hoeffding inequality that

(𝒥t1,t2c)1(t2t1)α,t1,t2[1,T].\displaystyle\mathbb{P}\left(\mathcal{J}^{c}_{t_{1},t_{2}}\right)\leq\frac{1}{(t_{2}-t_{1})^{\alpha}},~{}\forall t_{1},t_{2}\in[1,T]. (12)

Fix a sufficiently large L>0L>0333It suffices to let L>(2ασp()2/ϵ2)2L>\left(2\alpha\sigma^{2}_{p_{(\star)}}/\epsilon^{2}\right)^{2}, and define

𝒥:=t1,t2:t2t1+L𝒥t1,t2.\displaystyle\mathcal{J}:=\cap_{t_{1},t_{2}:t_{2}\geq t_{1}+L}~{}\mathcal{J}_{t_{1},t_{2}}. (13)

The following result by combining union bound with the bound (12).

Lemma 2
(𝒥c)T2/Lα.\displaystyle\mathbb{P}\left(\mathcal{J}^{c}\right)\leq T^{2}/L^{\alpha}.

We now focus on upper-bounding |G(s,t)||G(s,t)| on 𝒥\mathcal{J}.

Throughout, we assume that the true system parameter θ()\theta_{(\star)}, and the set Θ\Theta used by UCB-NCS, satisfy the following.

Assumption 1

Define

Λ(θ):=𝔼(logAsw(t)|θ(t)=θ).\displaystyle\Lambda(\theta):=\mathbb{E}\left(\log A_{sw}(t)|\theta(t)=\theta\right).

Let ϵ>0,η>0\epsilon>0,\eta>0. Then,

Λ(θ)<ηϵ<0,θΘ.\displaystyle\Lambda(\theta)<-\eta-\epsilon<0,~{}\forall\theta\in\Theta. (14)

We call η\eta as the “margin of stability” of the NCS. Note that η\eta depends upon a) Θ\Theta, b) (A(),B(),p())(A_{(\star)},B_{(\star)},p_{(\star)}).

Consider an element of 𝒥\mathcal{J}, and assume there are kk episodes during the time period [s,t][s,t]. Let Ni,k,i=0,1N_{i,k},i=0,1 denote the number of times channel state assumes value ii, and let θk\theta_{k} denote the UCB estimate of θ()\theta_{(\star)} during the kk-th episode. Let DkD_{k} denote the duration of kk-th episode. We have the following,

|G(s,t)|=m=stAsw()\displaystyle|G(s,t)|=\prod_{m=s}^{t}A_{sw}(\ell)
k=1Kexp(DkΛ(θk))exp(2ασp()2DklogDk)\displaystyle\leq\prod_{k=1}^{K}\exp\left(D_{k}\Lambda(\theta_{k})\right)\exp\left(\sqrt{2\alpha\sigma^{2}_{p_{(\star)}}D_{k}\log D_{k}}\right)
exp(η(ts)),\displaystyle\leq\exp\left(-\eta(t-s)\right), (15)

where the first inequality follows from definition of 𝒥\mathcal{J} (13), while the second follows from Assumption 1.

Let

:={ω:maxt[1,T]|w(t)|log1/2(T/δ)}.\displaystyle\mathcal{H}:=\left\{\omega:\max_{t\in[1,T]}|w(t)|\leq\log^{1/2}\left(T/\delta\right)\right\}.

Following is easily proved.

Lemma 3

We have

(c)δ.\displaystyle\mathbb{P}\left(\mathcal{H}^{c}\right)\leq\delta.
Lemma 4

Define

g(δ,T):=|x(0)|+log1/2(T/δ)/(1exp(η)).\displaystyle g(\delta,T):=|x(0)|+\log^{1/2}\left(T/\delta\right)/(1-\exp(-\eta)). (16)

Under Assumption 1, we have the following on 𝒥\mathcal{H}\cap\mathcal{J}

|x(t)|<g(δ,T),t[1,T].\displaystyle|x(t)|<g(\delta,T),~{}\forall t\in[1,T].

Note that we have suppressed dependence of function gg upon η,x(0)\eta,x(0).

Proof:

The proof follows by substituting in (10) the bound (VI) on |G(s,t)||G(s,t)| and the bound log1/2(Tδ)\log^{1/2}\left(\frac{T}{\delta}\right) on |w(s)||w(s)| on the set \mathcal{H}. ∎

VII Regret Analysis of UCB-NCS

Define R(T)R(T), the regret incurred by UCB-NCS until time TT as follows

R(T):\displaystyle R(T): =t=1Tc(t)TJθ(),\displaystyle=\sum_{t=1}^{T}c(t)-TJ_{\theta_{(\star)}},
where c(t):\displaystyle\mbox{ where }c(t): =Qx2(t)+Ru2(t).\displaystyle=Qx^{2}(t)+Ru^{2}(t). (17)

For θ=(A,B,p)\theta=(A,B,p), define

xθ(t+1;u)=Ax(t)+Bu+w(t).\displaystyle x_{\theta}(t+1;u)=Ax(t)+Bu+w(t).

Similarly, let {θ(t)}t=1T\{\ell_{\theta}(t)\}_{t=1}^{T} be drawn i.i.d. according to θ\theta.

Lemma 5

On the set \mathcal{E}, R(T)R(T) can be upper-bounded as follows,

R(T)R1+R2,\displaystyle R(T)\leq R_{1}+R_{2},

where,

R1:\displaystyle R_{1}: =t=1T1Vθ(t)(xθ()(t+1;u(t)),()(t+1))t\displaystyle=\sum_{t=1}^{T-1}V_{\theta(t)}(x_{\theta_{(\star)}}(t+1;u(t)),\ell_{(\star)}(t+1))_{\mathcal{F}_{t}}
Vθ(t)(x(t),(t))\displaystyle-V_{\theta(t)}(x(t),\ell(t))
R2:\displaystyle R_{2}: =t=1T1Vθ(t)(xθ(t)(t+1;u(t)),θ(t)(t+1))t\displaystyle=\sum_{t=1}^{T-1}V_{\theta(t)}(x_{\theta(t)}(t+1;u(t)),\ell_{\theta(t)}(t+1))_{\mathcal{F}_{t}}
Vθ(t)(xθ()(t+1;u(t)),()(t+1))t.\displaystyle-V_{\theta(t)}(x_{\theta_{(\star)}}(t+1;u(t)),\ell_{(\star)}(t+1))_{\mathcal{F}_{t}}.
Proof:

Consider the Bellman optimality equation at time tt when the true system parameter is assumed equal to θ(t)\theta(t),

Jθ(t)+Vθ(t)(x(t),(t))=Qx2(t)\displaystyle J_{\theta(t)}+V_{\theta(t)}(x(t),\ell(t))=Qx^{2}(t)
+minu[Ru2+Vθ(t)(xθ(t)(t+1;u),θ(t)(t+1))t]\displaystyle+\min_{u\in\mathbb{R}}\left[Ru^{2}+V_{\theta(t)}(x_{\theta(t)}(t+1;u),\ell_{\theta(t)}(t+1))_{\mathcal{F}_{t}}\right]
=Qx2(t)+Ru2(t)+Vθ(t)(xθ()(t+1;u(t)),()(t+1))t\displaystyle=Qx^{2}(t)+Ru^{2}(t)+V_{\theta(t)}(x_{\theta_{(\star)}}(t+1;u(t)),\ell_{(\star)}(t+1))_{\mathcal{F}_{t}}
+Vθ(t)(xθ(t)(t+1;u(t)),θ(t)(t+1))t\displaystyle+V_{\theta(t)}(x_{\theta(t)}(t+1;u(t)),\ell_{\theta(t)}(t+1))_{\mathcal{F}_{t}}
Vθ(t)(xθ()(t+1;u(t)),()(t+1))t\displaystyle-V_{\theta(t)}(x_{\theta_{(\star)}}(t+1;u(t)),\ell_{(\star)}(t+1))_{\mathcal{F}_{t}} (18)

where the second equality follows since the learning rule applies controls by assuming that θ(t)\theta(t) is the true system parameter. Note that on \mathcal{E}, Jθ(t)J_{\theta(t)} serves as a lower bound on the optimal cost Jθ()J_{\theta_{(\star)}} , so that t=0T(Qx2(t)+Ru2(t))t=0TJθ(t)\sum\limits_{t=0}^{T}\left(Qx^{2}(t)+Ru^{2}(t)\right)-\sum\limits_{t=0}^{T}J_{\theta(t)} serves as an upper-bound on R(T)R(T). Proof is completed by re-arranging the terms in (VII), and summing them from t=1t=1 to t=T1t=T-1. ∎

We now bound the terms R1,R2R_{1},R_{2} on \mathcal{E}.

VII-A Bounding R1R_{1}

We decompose R1R_{1} as follows, R1=𝒯1+𝒯2R_{1}=\mathcal{T}_{1}+\mathcal{T}_{2}, where,

𝒯1:\displaystyle\mathcal{T}_{1}: =t=1T1Vθ(t1)(xθ()(t;u(t1)),()(t))t1\displaystyle=\sum_{t=1}^{T-1}V_{\theta(t-1)}(x_{\theta_{(\star)}}(t;u(t-1)),\ell_{(\star)}(t))_{\mathcal{F}_{t-1}}
Vθ(t)(x(t),(t)),\displaystyle\qquad\qquad-V_{\theta(t)}(x(t),\ell(t)),
𝒯2:\displaystyle\mathcal{T}_{2}: =Vθ(T1)(xθ()(T;u(T1)),()(T))T1\displaystyle=V_{\theta(T-1)}(x_{\theta_{(\star)}}(T;u(T-1)),\ell_{(\star)}(T))_{\mathcal{F}_{T-1}}
Vθ(1)(x(1),(1)).\displaystyle\qquad\qquad-V_{\theta(1)}(x(1),\ell(1)).

We further decompose 𝒯1\mathcal{T}_{1} as follows,

𝒯1=𝒯3+𝒯4,\displaystyle\mathcal{T}_{1}=\mathcal{T}_{3}+\mathcal{T}_{4},

where,

𝒯3:\displaystyle\mathcal{T}_{3}: =t=1T1Vθ(t1)(xθ()(t;u(t1)),()(t))t1\displaystyle=\sum_{t=1}^{T-1}V_{\theta(t-1)}(x_{\theta_{(\star)}}(t;u(t-1)),\ell_{(\star)}(t))_{\mathcal{F}_{t-1}}
Vθ(t1)(x(t),(t))\displaystyle\qquad\qquad\qquad\qquad\qquad-V_{\theta(t-1)}(x(t),\ell(t))
𝒯4:\displaystyle\mathcal{T}_{4}: =t=1T1Vθ(t)(x(t),(t))Vθ(t1)(x(t),(t)).\displaystyle=\sum_{t=1}^{T-1}V_{\theta(t)}(x(t),\ell(t))-V_{\theta(t-1)}(x(t),\ell(t)).
Lemma 6
(𝒯3>Tg(δ,T)log(T/δ))δ+([𝒥]c),\displaystyle\mathbb{P}\left(\mathcal{T}_{3}>\sqrt{Tg(\delta,T)\log\left(T/\delta\right)}\right)\leq\delta+\mathbb{P}\left(\left[\mathcal{H}\cap\mathcal{J}\right]^{c}\right),

where g(δ,T)g(\delta,T) is as in (16).

Proof:

𝒯3\mathcal{T}_{3} is a martingale, though its increments are not bounded. However, its increments are upper-bounded as O(|x(t)|)O\left(|x(t)|\right). It follows from Lemma 4 that its increments are upper-bounded as O(g(δ,T))O\left(g(\delta,T)\right) on 𝒥\mathcal{H}\cap\mathcal{J}. The proof then follows from Proposition 34 of [11]. ∎

Henceforth denote

𝒢:={ω:𝒯3<Tg(δ,T)log(T/δ)}.\displaystyle\mathcal{G}:=\left\{\omega:\mathcal{T}_{3}<\sqrt{Tg(\delta,T)\log\left(T/\delta\right)}\right\}.

We obtain the following bound on R1R_{1} by combining results of Lemma 6 and Lemma 14.

Lemma 7 (Bounding R1R_{1})

Let

𝒰1:\displaystyle\mathcal{U}_{1}: =Tg(δ,T)log(T/δ)\displaystyle=\sqrt{Tg(\delta,T)\log\left(T/\delta\right)}
+2Pmaxg2(δ,T)+Pmaxf(δ,T)g(δ,T),\displaystyle+2P_{\max}g^{2}(\delta,T)+P_{\max}f(\delta,T)g(\delta,T), (19)

where g(δ,T),f(δ,T)g(\delta,T),f(\delta,T) are as in (16), (13). On 𝒢(𝒥)\mathcal{G}\cap\left(\mathcal{H}\cap\mathcal{J}\right) we have R1𝒰1R_{1}\leq\mathcal{U}_{1}.

VII-B Bounding R2R_{2}

We decompose R2R_{2} as follows,

R2=𝒯5+𝒯6.\displaystyle R_{2}=\mathcal{T}_{5}+\mathcal{T}_{6}. (20)

where

𝒯5:\displaystyle\mathcal{T}_{5}: =t=1T1Vθ(t)(xθ(t)(t+1;u(t)),θ(t)(t+1))t\displaystyle=\sum_{t=1}^{T-1}V_{\theta(t)}(x_{\theta(t)}(t+1;u(t)),\ell_{\theta(t)}(t+1))_{\mathcal{F}_{t}}
Vθ(t)(xθ()(t+1;u(t)),θ(t)(t+1))t\displaystyle\qquad\qquad-V_{\theta(t)}(x_{\theta_{(\star)}}(t+1;u(t)),\ell_{\theta(t)}(t+1))_{\mathcal{F}_{t}}
𝒯6:\displaystyle\mathcal{T}_{6}: =t=1T1Vθ(t)(xθ()(t+1;u(t)),θ(t)(t+1))t\displaystyle=\sum_{t=1}^{T-1}V_{\theta(t)}(x_{\theta_{(\star)}}(t+1;u(t)),\ell_{\theta(t)}(t+1))_{\mathcal{F}_{t}}
Vθ(t)(xθ()(t+1;u(t)),()(t+1))t.\displaystyle-V_{\theta(t)}(x_{\theta_{(\star)}}(t+1;u(t)),\ell_{(\star)}(t+1))_{\mathcal{F}_{t}}.

Note that under UCB-NCS, we have that u(t)=Kθ(t)((t))u(t)=K_{\theta(t)}(\ell(t)). Let

Kmax:=supθΘ,{0,1}Kθ(),Pmax:=supθΘ,{0,1}Pθ().\displaystyle K_{\max}:=\sup_{\theta\in\Theta,\ell\in\{0,1\}}K_{\theta}(\ell),P_{\max}:=\sup_{\theta\in\Theta,\ell\in\{0,1\}}P_{\theta}(\ell). (21)

After performing simple algebraic manipulations, we can show that

𝒯5\displaystyle\mathcal{T}_{5} Pmaxt=1T1|(Aθ(t)x(t)+Bθ(t)u(t))2\displaystyle\leq P_{\max}\sum_{t=1}^{T-1}\left|\left(A_{\theta(t)}x(t)+B_{\theta(t)}u(t)\right)^{2}\right.
(A()x(t)+B()u(t))2|\displaystyle\left.\qquad\qquad\qquad\qquad-\left(A_{(\star)}x(t)+B_{(\star)}u(t)\right)^{2}\right|
Pmax𝒯71/2×𝒯81/2\displaystyle\leq P_{\max}~{}\mathcal{T}_{7}^{1/2}\times\mathcal{T}_{8}^{1/2} (22)

where

𝒯7:=t=1T1|Aθ(t)x(t)A()x(t)+Bθ(t)u(t)B()u(t)|2,\displaystyle\mathcal{T}_{7}:=\sum_{t=1}^{T-1}\left|A_{\theta(t)}x(t)-A_{(\star)}x(t)+B_{\theta(t)}u(t)-B_{(\star)}u(t)\right|^{2},
𝒯8:=t=1T|Aθ(t)x(t)+Bθ(t)u(t)+A()x(t)+B()u(t)|2,\displaystyle\mathcal{T}_{8}:=\sum_{t=1}^{T}\left|A_{\theta(t)}x(t)+B_{\theta(t)}u(t)+A_{(\star)}x(t)+B_{(\star)}u(t)\right|^{2},

and the last inequality in (VII-B) follows from Cauchy-Schwartz inequality. The terms 𝒯7,𝒯8\mathcal{T}_{7},\mathcal{T}_{8} are bounded in Lemma 10 and Lemma 11 in Appendix. We substitute these bounds in (VII-B) and obtain the following result.

Lemma 8

On (𝒥)\mathcal{E}\cap\left(\mathcal{H}\cap\mathcal{J}\right), we have

𝒯5\displaystyle\mathcal{T}_{5} C1Tlog(V1(T)/λ)(γ1(δ,T)+γ2(δ,T)+2λ1/2)\displaystyle\leq C_{1}\sqrt{T}\log\left(V_{1}(T)/\lambda\right)\left(\gamma_{1}(\delta,T)+\gamma_{2}(\delta,T)+2\lambda^{1/2}\right)
×h(δ,T)g3/2(δ,T), where,\displaystyle\times\sqrt{h(\delta,T)}~{}g^{3/2}(\delta,T),\mbox{ where, } (23)
C1:\displaystyle C_{1}: =22Pmax(1+Kmax)Gcl,max/λ.\displaystyle=2\sqrt{2}P_{\max}\left(1+K_{\max}\right)G_{cl,\max}/\lambda. (24)

It remains to bound 𝒯6\mathcal{T}_{6} in order to bound R2R_{2}. This is done in Lemma 12 of Appendix.

Lemma 9

Let

𝒰2:=C1Tlog(V1(T)/λ)(γ1(δ,T)+γ2(δ,T)+2λ1/2)\displaystyle\mathcal{U}_{2}:=C_{1}\sqrt{T}\log\left(V_{1}(T)/\lambda\right)\left(\gamma_{1}(\delta,T)+\gamma_{2}(\delta,T)+2\lambda^{1/2}\right)
×h(δ,T)g3/2(δ,T)\displaystyle\times\sqrt{h(\delta,T)}~{}g^{3/2}(\delta,T)
+Pmax(Gcl,max2g(δ,T)+σ2)αTlogT.\displaystyle+P_{\max}\left(G^{2}_{cl,\max}~{}g(\delta,T)+\sigma^{2}\right)\sqrt{\alpha T\log T}.

On 𝒥\mathcal{E}\cap\mathcal{H}\cap\mathcal{J}, we have R2𝒰2R_{2}\leq\mathcal{U}_{2}.

Proof:

Follows by substituting bounds on 𝒯5\mathcal{T}_{5} and 𝒯6\mathcal{T}_{6} from Lemma 8 and Lemma 12 into (20). ∎

VIII Main Result

Theorem 1 (Bound on Regret)

Consider the NCS operating under UCB-NCS described in Algorithm 1. Under Assumption 1, R(T)𝒰1+𝒰2R(T)\leq\mathcal{U}_{1}+\mathcal{U}_{2} with a probability at least 7δ+T2/Lα7\delta+T^{2}/L^{\alpha}. The terms 𝒰1,𝒰2\mathcal{U}_{1},\mathcal{U}_{2} are defined in (7) and (23) respectively. Upon ignoring terms and factors that are O(logT)O(\log T), this bound simplifies to

T(log1/4(1/δ)+αPmaxGcl,max2log1/2(1/δ)+C1).\displaystyle\sqrt{T}\left(\log^{1/4}(1/\delta)+\sqrt{\alpha}P_{\max}G^{2}_{cl,\max}\log^{1/2}(1/\delta)+C_{1}\right).
Proof:

It follows from Lemma 5 that R(T)R1+R2R(T)\leq R_{1}+R_{2} on \mathcal{E}. Proof then follows by substituting upper-bounds from Lemma 7, Lemma 9, and using union bound to lower-bound the probability of 𝒢𝒥\mathcal{G}\cap\mathcal{E}\cap\mathcal{H}\cap\mathcal{J}. ∎

IX Conclusion and Future Work

We propose UCB-NCS, an adaptive control law, or learning rule for NCS, and provide its finite-time performance guarantees. We show that with a high probability, its regret scales as O~(T)\tilde{O}(\sqrt{T}) upto constant factors. We identify a certain quantity which we call margin of stability of NCS. Regret increases with a smaller margin, which indicates a low network quality.

Results in this work can be extended in various directions. So far we considered only scalar systems. A natural extension is to the case of vector systems. Another direction is to derive lower-bounds on expected value of regret that can be achieved under any admissible control policy.

Lemma 10 (Bounding 𝒯7\mathcal{T}_{7})

On (𝒥)\mathcal{E}\cap\left(\mathcal{H}\cap\mathcal{J}\right), we have

𝒯7\displaystyle\mathcal{T}_{7} (1+Kmax)2(γ1(δ,T)+γ2(T)+2λ1/2)2\displaystyle\leq\left(1+K_{\max}\right)^{2}\left(\gamma_{1}(\delta,T)+\gamma_{2}(T)+2\lambda^{1/2}\right)^{2}
×2{2h(δ,T)}2g2(δ,T)1λlog(V1(T)λ).\displaystyle\times 2\left\{2\sqrt{h(\delta,T)}\right\}^{2}\cdot g^{2}(\delta,T)\cdot\frac{1}{\lambda}\log\left(\frac{V_{1}(T)}{\lambda}\right).
Proof:

Let τ\tau be the time step at which the latest episode begins. Since under UCB-NCS we have u(t)=Kθ(t)((t))x(t)u(t)=K_{\theta(t)}(\ell(t))x(t), it can be shown that

|(Aθ(t)x(t)A()x(t))+(Bθ(t)u(t)B()u(t))|\displaystyle\left|\left(A_{\theta(t)}x(t)-A_{(\star)}x(t)\right)+\left(B_{\theta(t)}u(t)-B_{(\star)}u(t)\right)\right|
|(Aθ(t)A())||x(t)|+|(Bθ(t)B())|Kmax|x(t)|.\displaystyle\leq\left|\left(A_{\theta(t)}-A_{(\star)}\right)\right|\left|x(t)\right|+\left|\left(B_{\theta(t)}-B_{(\star)}\right)\right|K_{\max}\left|x(t)\right|. (25)

Now consider the following inequality,

|(Aθ(t)A())||Aθ(t)A^(t)|+|A^(t)A()|.\displaystyle\left|\left(A_{\theta(t)}-A_{(\star)}\right)\right|\leq\left|A_{\theta(t)}-\hat{A}(t)\right|+\left|\hat{A}(t)-A_{(\star)}\right|. (26)

For θ=(A,B,p)𝒞(τ)\theta=\left(A,B,p\right)\in\mathcal{C}(\tau), we have,

|AA^(τ)||x(t)|\displaystyle\left|A-\hat{A}(\tau)\right||x(t)| V1(τ)|AA^(τ)||x(t)|V1(t)h(δ,T)\displaystyle\leq\sqrt{V_{1}(\tau)}\left|A-\hat{A}(\tau)\right|\frac{|x(t)|}{\sqrt{V_{1}(t)}}\sqrt{h(\delta,T)}
(γ1(τ)+λ1/2)|x(t)|V1(t)h(δ,T),\displaystyle\leq\left(\gamma_{1}(\tau)+\lambda^{1/2}\right)\frac{|x(t)|}{\sqrt{V_{1}(t)}}\sqrt{h(\delta,T)}, (27)

where the first inequality follows from Lemma 14, and second inequality follows from the size of the confidence intervals (5). On \mathcal{E}, we have θ()𝒞(τ)\theta_{(\star)}\in\mathcal{C}(\tau), and also θ(t)𝒞(τ)\theta(t)\in\mathcal{C}(\tau); so we use inequality (IX) with θ\theta set equal to θ(),θ(t)\theta_{(\star)},\theta(t), and combine the resulting inequalities with (26) in order to obtain the following,

|(Aθ(t)A())||x(t)|\displaystyle\left|\left(A_{\theta(t)}-A_{(\star)}\right)\right|\left|x(t)\right|
2h(δ,T)(γ1(δ,t)+λ1/2)|x(t)|V1(t).\displaystyle\leq 2\sqrt{h(\delta,T)}\left(\gamma_{1}(\delta,t)+\lambda^{1/2}\right)\frac{|x(t)|}{\sqrt{V_{1}(t)}}. (28)

A similar bound can be obtained for |(Bθ(t)B())||x(t)|\left|\left(B_{\theta(t)}-B_{(\star)}\right)\right|\left|x(t)\right| also. Remaining proof comprises of substituting these bounds in (IX) and performing algebraic manipulations. We also utilize Lemma 10 of [6] in order to bound t=1T[|x(t)|2/V1(t)1],t=1T[|x(t)|2/V2(t)1]\sum_{t=1}^{T}\left[|x(t)|^{2}/V_{1}(t)\wedge 1\right],\sum_{t=1}^{T}\left[|x(t)|^{2}/V_{2}(t)\wedge 1\right]. ∎

Lemma 11 (Bounding 𝒯8\mathcal{T}_{8})

On 𝒥\mathcal{H}\cap\mathcal{J}, we have

𝒯8Gcl,max2Tg(δ,T),\displaystyle\mathcal{T}_{8}\leq G^{2}_{cl,\max}~{}Tg(\delta,T),

where

Gcl,max:=supθΘ,{0,1}{|Aθ+BθK(θ)|,|A()+B()K(θ)|}\displaystyle G_{cl,\max}:=\sup_{\theta\in\Theta,\ell\in\{0,1\}}\left\{|A_{\theta}+B_{\theta}K_{\ell}(\theta)|,|A_{(\star)}+B_{(\star)}K_{\ell}(\theta)|\right\}
Proof:

Follows from Lemma 4. ∎

Lemma 12

On (𝒥)\mathcal{E}\cap\left(\mathcal{H}\cap\mathcal{J}\right) we have

𝒯6Pmax(Gcl,max2g(δ,T)+σ2)αTlogT.\displaystyle\mathcal{T}_{6}\leq P_{\max}\left(G^{2}_{cl,\max}g(\delta,T)+\sigma^{2}\right)\sqrt{\alpha T\log T}.
Proof:

We have

𝒯6\displaystyle\mathcal{T}_{6} t=1T1|pθ(t)p()|Pmax(Gcl,max2x2(t)+σ2)\displaystyle\leq\sum_{t=1}^{T-1}|p_{\theta(t)}-p_{(\star)}|P_{\max}\left(G^{2}_{cl,\max}x^{2}(t)+\sigma^{2}\right)
Pmax(Gcl,max2maxt[1,T]x2(t)+σ2)(t=1T1|pθ(t)p()|).\displaystyle\leq P_{\max}\left(G^{2}_{cl,\max}\max_{t\in[1,T]}x^{2}(t)+\sigma^{2}\right)\left(\sum_{t=1}^{T-1}|p_{\theta(t)}-p_{(\star)}|\right).

The proof is completed by noting that on \mathcal{E}, we have |pθ(t)p()|β1(t)|p_{\theta(t)}-p_{(\star)}|\leq\beta_{1}(t), while on 𝒥\mathcal{H}\cap\mathcal{J} we have maxt[1,T]|x(t)|g(δ,T)\max_{t\in[1,T]}|x(t)|\leq g(\delta,T). ∎

Lemma 13 (Bounding N(T)N(T))

Define

f(δ,T):=log(1+Tg2(δ,T)/λ)\displaystyle f(\delta,T):=\log\left(1+Tg^{2}(\delta,T)/\lambda\right)
+log(1+TKmax2g2(δ,T)/λ)+log(T).\displaystyle+\log\left(1+TK^{2}_{\max}g^{2}(\delta,T)/\lambda\right)+\log\left(T\right). (29)

We have that

N(T)f(δ,T) on 𝒥.\displaystyle N(T)\leq f(\delta,T)\mbox{ on }\mathcal{H}\cap\mathcal{J}.
Proof:

Recall that a new episode starts only when either a) V1(t)V_{1}(t) or V2(t)V_{2}(t) doubles, or b) samples used for estimating channel reliability double. Let N1(T),N2(T)N_{1}(T),N_{2}(T) denote the number of episodes that began due to doubling of V1(t),V2(t)V_{1}(t),V_{2}(t) respectively. Let N3(T)N_{3}(T) be number of episodes that began due to b). Clearly, V1(T)2N1(T)λV_{1}(T)\geq 2^{N_{1}(T)}\lambda, while on 𝒥\mathcal{H}\cap\mathcal{J} we have |x(t)|g(δ,T)|x(t)|\leq g(\delta,T) (Lemma 4) so that V1(T)λ+Tg2(δ,T)V_{1}(T)\leq\lambda+Tg^{2}(\delta,T). Combining these, we obtain N1(T)log(1+Tg2(δ,T)/λ)N_{1}(T)\leq\log\left(1+Tg^{2}(\delta,T)/\lambda\right). Similarly, N2(T)log(1+TKmax2g2(δ,T)/λ)N_{2}(T)\leq\log\left(1+TK^{2}_{\max}g^{2}(\delta,T)/\lambda\right). Also, N3(T)log(T)N_{3}(T)\leq\log\left(T\right). The proof then follows by noting that N(T)=N1(T)+N2(T)+N3(T)N(T)=N_{1}(T)+N_{2}(T)+N_{3}(T). ∎

Lemma 14 (Bounding fluctuations within an episode)

We have

𝒯4\displaystyle\mathcal{T}_{4} Pmaxf(δ,T)g(δ,T),ω𝒥, and,\displaystyle\leq P_{\max}f(\delta,T)\cdot g(\delta,T),~{}\forall\omega\in\mathcal{H}\cap\mathcal{J},\mbox{ and},
𝒯2\displaystyle\mathcal{T}_{2} 2Pmaxg2(δ,T),ω𝒥,\displaystyle\leq 2P_{\max}g^{2}(\delta,T),~{}\forall\omega\in\mathcal{H}\cap\mathcal{J},

where g(δ,T),f(δ,T)g(\delta,T),f(\delta,T) are as in (16), (13).

Proof:

Recall 𝒯4=t=1T1Vθ(t)(x(t),(t))Vθ(t1)(x(t),(t))\mathcal{T}_{4}=\sum\limits_{t=1}^{T-1}V_{\theta(t)}(x(t),\ell(t))-V_{\theta(t-1)}(x(t),\ell(t)). Hence, the term in summation corresponding to time tt is non-zero only if the UCB estimate θ(t)\theta(t) changes, i.e., a new episode begins at time tt. Thus,

𝒯4N(T)Pmaxmaxt[1,T]|x(t)|2.\displaystyle\mathcal{T}_{4}\leq N(T)P_{\max}\max_{t\in[1,T]}|x(t)|^{2}.

The claim then follows by substituting bounds on N(T)N(T) and maxt[1,T]|x(t)|\max_{t\in[1,T]}|x(t)| from Lemma 4 and Lemma 13. ∎

Lemma 15

Define,

h(δ,T):=max{1+2L(1+1λ[log1/2(T/δ)1exp(η)]2),2}.\displaystyle h(\delta,T):=\max\left\{1+2L\left(1+\frac{1}{\lambda}\left[\frac{\log^{1/2}\left(T/\delta\right)}{1-\exp(-\eta)}\right]^{2}\right),2\right\}. (30)

We then have that

V1(t)V1(τk)h(δ,T),t[τk,τk+11].\displaystyle\frac{V_{1}(t)}{V_{1}(\tau_{k})}\leq h(\delta,T),~{}\forall t\in\left[\tau_{k},\tau_{k+1}-1\right].

Note that we have suppressed its dependence upon η,L\eta,L in order to simplify the notation.

Proof:

Consider the following cases.

Case a): t[τk,τk+L]t\in[\tau_{k},\tau_{k}+L]. We have

V1(t)V1(τk)1=s=τk+1tx2(s)/V1(τk)\displaystyle\frac{V_{1}(t)}{V_{1}(\tau_{k})}-1=\sum\limits_{s=\tau_{k}+1}^{t}x^{2}(s)/V_{1}(\tau_{k})
L(maxs[τk,τk+L]x(s))2/V1(τk)\displaystyle\leq L\left(\max_{s\in[\tau_{k},\tau_{k}+L]}x(s)\right)^{2}/V_{1}(\tau_{k})
L(|x(τk)|+log1/2(T/δ)1exp(η))2/V1(τk)\displaystyle\leq L\left(|x(\tau_{k})|+\frac{\log^{1/2}\left(T/\delta\right)}{1-\exp(-\eta)}\right)^{2}/V_{1}(\tau_{k})
2L(|x(τk)|2V1(τk)+1V1(τk)[log1/2(T/δ)1exp(η)]2)\displaystyle\leq 2L\left(\frac{|x(\tau_{k})|^{2}}{V_{1}(\tau_{k})}+\frac{1}{V_{1}(\tau_{k})}\left[\frac{\log^{1/2}\left(T/\delta\right)}{1-\exp(-\eta)}\right]^{2}\right)
2L(|x(τk)|2|x(τk)|2+1λ[log1/2(T/δ)1exp(η)]2).\displaystyle\leq 2L\left(\frac{|x(\tau_{k})|^{2}}{|x(\tau_{k})|^{2}}+\frac{1}{\lambda}\left[\frac{\log^{1/2}\left(T/\delta\right)}{1-\exp(-\eta)}\right]^{2}\right).

Case b): tt[τk+L+1,τk+11]t\in t\in[\tau_{k}+L+1,\tau_{k+1}-1]. In this case we have V1(t)V1(τk)<2\frac{V_{1}(t)}{V_{1}(\tau_{k})}<2, since a new episode begins once the ratio becomes greater than or equal to 22. ∎

References

  • [1] R. E. Bellman, Adaptive control processes: a guided tour.   Princeton university press, 2015.
  • [2] P. R. Kumar and P. Varaiya, Stochastic systems: Estimation, identification and adaptive control.   Prentice Hall Inc., Englewood Cliffs, 1986.
  • [3] A. Becker, P. R. Kumar, and C.-Z. Wei, “Adaptive control with the stochastic approximation algorithm: Geometry and convergence,” IEEE Transactions on Automatic Control, vol. 30, no. 4, pp. 330–338, 1985.
  • [4] H.-F. Chen and L. Guo, “Optimal adaptive control and consistent parameter estimates for armax model with quadratic cost,” SIAM Journal on Control and Optimization, vol. 25, no. 4, pp. 845–867, 1987.
  • [5] S. Bittanti, M. C. Campi et al., “Adaptive control of linear time invariant systems: the ?bet on the best? principle,” Communications in Information & Systems, vol. 6, no. 4, pp. 299–320, 2006.
  • [6] Y. Abbasi-Yadkori and C. Szepesvári, “Regret bounds for the adaptive control of linear quadratic systems,” in Proceedings of the 24th Annual Conference on Learning Theory, 2011, pp. 1–26.
  • [7] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Advances in applied mathematics, vol. 6, no. 1, pp. 4–22, 1985.
  • [8] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machine learning, vol. 47, no. 2-3, pp. 235–256, 2002.
  • [9] O. L. V. Costa, M. D. Fragoso, and R. P. Marques, Discrete-time Markov jump linear systems.   Springer Science & Business Media, 2006.
  • [10] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári, “Online least squares estimation with self-normalized processes: An application to bandit problems,” arXiv preprint arXiv:1102.2670, 2011.
  • [11] T. Tao, V. Vu et al., “Random matrices: universality of local eigenvalue statistics,” Acta mathematica, vol. 206, no. 1, pp. 127–204, 2011.