Learning in Networked Control Systems
Abstract
We design adaptive controller (learning rule) for a networked control system (NCS) in which data packets containing control information are transmitted across a lossy wireless channel. We propose Upper Confidence Bounds for Networked Control Systems (UCB-NCS), a learning rule that maintains confidence intervals for the estimates of plant parameters , and channel reliability , and utilizes the principle of optimism in the face of uncertainty while making control decisions.
We provide non-asymptotic performance guarantees for UCB-NCS by analyzing its “regret”, i.e., performance gap from the scenario when are known to the controller. We show that with a high probability the regret can be upper-bounded as 111Here hides logarithmic factors., where is the operating time horizon of the system, and is a problem dependent constant.
I Introduction
Though adaptive control [1] of unknown Linear Quadratic Gaussian (LQG) systems [2] is a well-studied topic by now [3, 4, 5, 6], existing algorithms cannot be utilized for controlling an unknown NCS in which plant and network parameters are unknown. In departure from the traditional adaptive controllers for LQG systems, an algorithm now also needs to continually estimate the unknown network behaviour besides simultaneously learning and controlling the plant in an online manner. An important concern is that in general it is not optimal to design and operate network estimator independently of the process controller. Thus, the optimal controls should utilize the information gained about network quality in addition to using the information gained about plant parameters. Similarly, decisions made by the network scheduler should also “aid” the controller in “learning” the unknown plant parameters.
This work addresses the problem of adaptive control of a simple NCS in which data packets from the controller to the plant, are communicated over an unreliable channel. We model the plant as a LQG system. We propose a learning rule that maintains estimates and confidence sets for both a) (unknown) plant parameters , and also b) (unknown) channel reliability . Controls are then generated using the principle of optimism in face of uncertainty [7], and depend upon both a) and b). We denote our algorithm as Upper Confidence Bounds for Networked Control Systems (UCB-NCS).
We show that UCB-NCS yields the same asymptotic performance as the optimal controller that has knowledge of the system and network parameters. We also quantify its finite-time performance by providing upper-bounds on its “regret” [8]. Regret scales as , where is the operating time horizon and is a problem dependent constant. It also depends on the channel reliability through a certain quantity which we call the “margin of stability” (14). A larger value of means that the learning algorithm has a lower regret.
UCB-NCS has many appealing properties. For instance, network estimator needs to communicate only occasionally the value of its optimistic estimate of network reliability to the controller which then uses it to generate controls.
II System Model
We assume that the system of interest is linear, and evolves as follows
(1) |
where are the system matrices, is the instantaneous state of the wireless channel, and are the system state and control input at time respectively. are Bernoulli i.i.d. with mean value . is the process noise, and is assumed to be i.i.d. with
The objective is to minimize the operating cost
(2) |
We let denote the system parameters. is not known to controller. We assume that the system is scalar, i.e., .
III Preliminaries on Jump Markov Linear Systems
Note that (1) is a Jump Markov Linear System (JMLS), and if the system parameter is known, the optimal controls can be obtained by using Dynamic Programming [9].
There are matrices such that the optimal control at is given by . We let denote the optimal matrices when system parameter is equal to .
We let denote the “cost-to-go” when system state is equal to , channel state is and system dynamics are described by . In fact value function is piecewise linear, and we let denote the corresponding matrices. We also let be the optimal operating cost.
Notation: For a random variable (r.v.) , let denote its projection onto the space of measurable funcions, i.e., its conditional expectation w.r.t. sigma-algebra . For 222 denotes the set of integers., we let . For a set of r.v. s , we let denote the smallest sigma-algebra with respect to which each r.v. in is measurable. For functions , we say if . For a set , we let denote its complement.
IV Upper Confidence Bounds for NCS (UCB-NCS)
Let . A learning policy, or an adaptive controller is a collection of maps . Let denote the estimates of at time defined as follows. Let , and .
(3) |
Define
(4) |
Let be the confidence intervals associated with the estimates at time defined as follows,
(5) | ||||
(6) |
where
The learning rule decomposes the cumulative time into episodes, and implements a single stationary controller within each single episode that chooses as a function of . Let denote the starting time of -th episode. The controller implemented within episode is obtained at time by solving the following optimization problem.
(7) |
where is the set of “allowable” parameters. Let denote a solution to above problem. It implements the optimal controller corresponding to the case when true system parameters are equal to . . Thus, for .
A new episode begins when either or doubles or the operating time spent in current episode becomes equal to length of previous episode. The learning rule also ensures that the durations of episodes are at least time-slots, i.e., . We set
i.e., it is the current value of the UCB estimate of . UCB-NCS is summarized in Algorithm 1.
V Large Deviation Bounds on Estimation Errors
We now analyze the estimation errors .
Lemma 1
Define
We then have that
Proof:
It can be shown that
(8) |
Note that is a martingale difference sequence w.r.t. , while is adapted to . Thus, bound on follows by using self-normalized bounds on martingales from Corollary 1 of [10].
To analyze , we observe,
(9) |
The first term within braces is bounded using Corollary 2 of [10]. To bound the second term, we observe that it is upper-bounded by . We then use bounds on to bound it. Bound on estimation error of is obtained using Azuma-Hoeffding inequality. ∎
VI Large Deviation Bounds on the System State
We now bound under UCB-NCS. System evolution under UCB-NCS is given by
where
Thus,
(10) |
where
Consider the deviations
and the events,
(11) |
where , and . It follows from Azuma-Hoeffding inequality that
(12) |
Fix a sufficiently large 333It suffices to let , and define
(13) |
The following result by combining union bound with the bound (12).
Lemma 2
We now focus on upper-bounding on .
Throughout, we assume that the true system parameter , and the set used by UCB-NCS, satisfy the following.
Assumption 1
Define
Let . Then,
(14) |
We call as the “margin of stability” of the NCS. Note that depends upon a) , b) .
Consider an element of , and assume there are episodes during the time period . Let denote the number of times channel state assumes value , and let denote the UCB estimate of during the -th episode. Let denote the duration of -th episode. We have the following,
(15) |
where the first inequality follows from definition of (13), while the second follows from Assumption 1.
Let
Following is easily proved.
Lemma 3
We have
Lemma 4
Define
(16) |
Under Assumption 1, we have the following on
Note that we have suppressed dependence of function upon .
VII Regret Analysis of UCB-NCS
Define , the regret incurred by UCB-NCS until time as follows
(17) |
For , define
Similarly, let be drawn i.i.d. according to .
Lemma 5
On the set , can be upper-bounded as follows,
where,
Proof:
Consider the Bellman optimality equation at time when the true system parameter is assumed equal to ,
(18) |
where the second equality follows since the learning rule applies controls by assuming that is the true system parameter. Note that on , serves as a lower bound on the optimal cost , so that serves as an upper-bound on . Proof is completed by re-arranging the terms in (VII), and summing them from to . ∎
We now bound the terms on .
VII-A Bounding
We decompose as follows, , where,
We further decompose as follows,
where,
Lemma 6
where is as in (16).
Proof:
VII-B Bounding
We decompose as follows,
(20) |
where
Note that under UCB-NCS, we have that . Let
(21) |
After performing simple algebraic manipulations, we can show that
(22) |
where
and the last inequality in (VII-B) follows from Cauchy-Schwartz inequality. The terms are bounded in Lemma 10 and Lemma 11 in Appendix. We substitute these bounds in (VII-B) and obtain the following result.
Lemma 8
On , we have
(23) | ||||
(24) |
It remains to bound in order to bound . This is done in Lemma 12 of Appendix.
Lemma 9
Let
On , we have .
VIII Main Result
Theorem 1 (Bound on Regret)
IX Conclusion and Future Work
We propose UCB-NCS, an adaptive control law, or learning rule for NCS, and provide its finite-time performance guarantees. We show that with a high probability, its regret scales as upto constant factors. We identify a certain quantity which we call margin of stability of NCS. Regret increases with a smaller margin, which indicates a low network quality.
Results in this work can be extended in various directions. So far we considered only scalar systems. A natural extension is to the case of vector systems. Another direction is to derive lower-bounds on expected value of regret that can be achieved under any admissible control policy.
Lemma 10 (Bounding )
On , we have
Proof:
Let be the time step at which the latest episode begins. Since under UCB-NCS we have , it can be shown that
(25) |
Now consider the following inequality,
(26) |
For , we have,
(27) |
where the first inequality follows from Lemma 14, and second inequality follows from the size of the confidence intervals (5). On , we have , and also ; so we use inequality (IX) with set equal to , and combine the resulting inequalities with (26) in order to obtain the following,
(28) |
A similar bound can be obtained for also. Remaining proof comprises of substituting these bounds in (IX) and performing algebraic manipulations. We also utilize Lemma 10 of [6] in order to bound . ∎
Lemma 11 (Bounding )
On , we have
where
Proof:
Follows from Lemma 4. ∎
Lemma 12
On we have
Proof:
We have
The proof is completed by noting that on , we have , while on we have . ∎
Lemma 13 (Bounding )
Define
(29) |
We have that
Proof:
Recall that a new episode starts only when either a) or doubles, or b) samples used for estimating channel reliability double. Let denote the number of episodes that began due to doubling of respectively. Let be number of episodes that began due to b). Clearly, , while on we have (Lemma 4) so that . Combining these, we obtain . Similarly, . Also, . The proof then follows by noting that . ∎
Proof:
Lemma 15
Define,
(30) |
We then have that
Note that we have suppressed its dependence upon in order to simplify the notation.
Proof:
Consider the following cases.
Case a): . We have
Case b): . In this case we have , since a new episode begins once the ratio becomes greater than or equal to . ∎
References
- [1] R. E. Bellman, Adaptive control processes: a guided tour. Princeton university press, 2015.
- [2] P. R. Kumar and P. Varaiya, Stochastic systems: Estimation, identification and adaptive control. Prentice Hall Inc., Englewood Cliffs, 1986.
- [3] A. Becker, P. R. Kumar, and C.-Z. Wei, “Adaptive control with the stochastic approximation algorithm: Geometry and convergence,” IEEE Transactions on Automatic Control, vol. 30, no. 4, pp. 330–338, 1985.
- [4] H.-F. Chen and L. Guo, “Optimal adaptive control and consistent parameter estimates for armax model with quadratic cost,” SIAM Journal on Control and Optimization, vol. 25, no. 4, pp. 845–867, 1987.
- [5] S. Bittanti, M. C. Campi et al., “Adaptive control of linear time invariant systems: the ?bet on the best? principle,” Communications in Information & Systems, vol. 6, no. 4, pp. 299–320, 2006.
- [6] Y. Abbasi-Yadkori and C. Szepesvári, “Regret bounds for the adaptive control of linear quadratic systems,” in Proceedings of the 24th Annual Conference on Learning Theory, 2011, pp. 1–26.
- [7] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Advances in applied mathematics, vol. 6, no. 1, pp. 4–22, 1985.
- [8] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machine learning, vol. 47, no. 2-3, pp. 235–256, 2002.
- [9] O. L. V. Costa, M. D. Fragoso, and R. P. Marques, Discrete-time Markov jump linear systems. Springer Science & Business Media, 2006.
- [10] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári, “Online least squares estimation with self-normalized processes: An application to bandit problems,” arXiv preprint arXiv:1102.2670, 2011.
- [11] T. Tao, V. Vu et al., “Random matrices: universality of local eigenvalue statistics,” Acta mathematica, vol. 206, no. 1, pp. 127–204, 2011.