This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\FAILED\FAILED

Stability of multiplexed NCS based on an epsilon-greedy algorithm for communication selection

Harsh Oza1    Irinel-Constantin Morărescu2    Vineeth S. Varma2    and Ravi Banavar1 1 The authors are with Systems &\& Control Engineering, IIT Bombay, Powai, Mumbai 400076, India. Email:{harsh.oza,banavar} @iitb.ac.in. 2 The authors are with Universite de Lorraine, CNRS, CRAN, F-54000 Nancy, France and also associated with the Automation Department, Technical University of Cluj-Napoca, 400114 Cluj-Napoca, Romania. Email: {constantin.morarescu, vineeth.satheeskumar-varma}@univ-lorraine.frThe authors thank the support of the Indo-French Centre for the Promotion of Advanced Research (IFCPRA). The work of I.C. Morărescu2, V.S. Varma has also been supported by project DECIDE funded under the PNRR I8 scheme by the Romanian Ministry of Research.A preliminary version of this article has been submitted to IEEE Control Systems articles.
Abstract

In this letter, we study a Networked Control System (NCS) with multiplexed communication and Bernoulli packet drops. Multiplexed communication refers to the constraint that transmission of a control signal and an observation signal cannot occur simultaneously due to the limited bandwidth. First, we propose an ε\varepsilon-greedy algorithm for the selection of the communication sequence that also ensures Mean Square Stability (MSS). We formulate the system as a Markovian Jump Linear System (MJLS) and provide the necessary conditions for MSS in terms of Linear Matrix Inequalities (LMIs) that need to be satisfied for three corner cases. We prove that the system is MSS for any convex combination of these three corner cases. Furthermore, we propose to use the ε\varepsilon-greedy algorithm with the ε\varepsilon that satisfies MSS conditions for training a Deep 𝖰\mathsf{Q} Network (DQN). The DQN is used to obtain an optimal communication sequence that minimizes a quadratic cost. We validate our approach with a numerical example that shows the efficacy of our method in comparison to the round-robin and a random scheme.

{IEEEkeywords}

Networked control system, Markovian Jump Linear System, Deep Reinforcement Learning

1 Introduction

A Networked Control System (NCS) is a system consisting of a plant, a controller, and a communication network. NCS plays an important role in various industries such as chemical plants, power systems, and warehouse management [1, 2]. In an NCS, the communication between the plant and the controller, as well as between the controller and the actuator, often occurs over a wireless network. Various challenges exist in such systems, namely, limited resources (bandwidth), delays, packet drops, or adversarial attacks [3, 4, 5, 6]. These uncertainties can affect the plant’s performance, making it important to study these scenarios. Our focus is on an NCS with limited bandwidth, formulated as a multiplexing constraint on transmitting control and observation signals and considering random packet drops. We aim to find a switching policy for selecting a communication direction that ensures Mean Square Stability (MSS) and then find the optimal communication sequence that minimizes a quadratic performance measure.

In an early piece of work, Athans et al. proposed an uncertainty threshold principle [7]. This principle states that optimum long-range control of a dynamical system with uncertainty parameters is feasible if and only if the uncertainty does not exceed a certain threshold. This forms the basis of subsequent work which gives stability conditions in different scenarios. In [8], linear systems controlled over a network face packet drop uncertainties, and are modeled as Bernoulli random variables, with independent drops in both communication channels. The necessary conditions for MSS and optimal control solutions using dynamic programming are then provided. Schenato et al. generalize this work by considering noisy measurements and giving stronger conditions for the existence of the solution [9]. They also show that the separation principle holds for the TCP protocol. Other studies focus on the design of a Kalman filter for wireless networks [10, 11] and stability conditions for systems with packet drops [12]. An alternate approach for random packet drops involves sending multiple copies [13], and event-triggered policies for nonlinear systems with packet drops are developed in [14].

While stability and optimality problems are addressed in the literature, the stability and optimal scheduling problems with multiplexing in control and observation have not been completely investigated. For instance, [9] discusses multiplexing and packet drops but not optimal network selection, while [15] considers a joint strategy for optimal selection and control of an NCS, but only for sensor signals. Major directions of research incorporate bandwidth constraints as i) multiplexing in multiple sensor signals, e.g., [11], ii) multiplexing in sensor and control signals, e.g., [9]. In addition, other communication uncertainties such as packet drops and delays are addressed [16]. The main objective of these attempts is to find an optimal control strategy or to find an optimal policy for the selection of communication channels, or both see, e.g., [17]. Leong et al. address the boundedness of error covariance objective in a multiplexed sensor information system with packet drops [18]. A stability condition is established based on the packet drop probability, and then an optimal sequence of communication is found by training a DQN with a ε\varepsilon - greedy algorithm.

In this letter, our contributions are as follows:

  1. i)

    We propose a modified ε\varepsilon-greedy algorithm for the selection of the direction of communication (transmit or receive.)

  2. ii)

    We establish the necessary conditions for the MSS of an NCS with multiplexed communication and packet drops.

  3. iii)

    We provide an optimal switching policy using Deep 𝖰\mathsf{Q} Learning that utilizes the proposed ε\varepsilon-greedy algorithm and ensures MSS, not just after training the NN, but also in the training phase.

The rest of the paper is organized as follows. Section 2 introduces the problem setup and communication constraints. In Section 3, we propose a switching strategy and formulate the problem as a Markovian Jump Linear System (MJLS). We provide necessary conditions for MSS in Section 4. Next, we formulate the optimal communication selection problem and provide a solution based on deep reinforcement learning in Section 5. Finally in Section 6, we validate our results on a numerical example.

2 Problem Setup

2.1 Plant and Controller Model

Consider a closed-loop discrete-time linear system

xk+1\displaystyle{x}_{k+1} =Axk+Buk\displaystyle=A{x}_{k}+Bu_{k} (1)
yk\displaystyle y_{k} =Cxk\displaystyle=C{x}_{k}

for all k0k\in\mathbb{Z}_{\geq 0}, where xknx{x}_{k}\in\mathbb{R}^{n_{{x}}} is the state, uknuu_{k}\in\mathbb{R}^{n_{u}} is the control input and yknyy_{k}\in\mathbb{R}^{n_{y}} is the output at kthk^{\text{th}} instant. We make the following assumptions regarding the original closed-loop system.

Assumption 1.

The pair (A,B)(A,B) is controllable and the pair (A,C)(A,C) is observable.

Assumption 2.

There exists a state feedback controller of the form

uk=Kxku_{k}=K{x}_{k} (2)

that stabilizes the system (1).

2.2 Networked System Model

In this letter, we are interested in an application where the plant and the controller are remotely located. The communication between the plant and the controller occurs over a wireless communication network. The networked system is illustrated in Fig. 1. The networked system dynamics can be written as

xk+1\displaystyle{x}_{k+1} =Axk+Bu^k\displaystyle=A{x}_{k}+B\hat{u}_{k} (3)
yk\displaystyle y_{k} =Cxk\displaystyle=C{x}_{k}

where u^k\hat{u}_{k} denotes the networked version of the control signal. We proceed by emulation of the controller (2) and use the controller as

uk=Kx^ku_{k}=K\hat{{x}}_{k} (4)

where x^k\hat{{x}}_{k} denotes the estimates of the state at the controller end. A more detailed explanation of these quantities is presented later in this section.

Controller uk=Kx^ku_{k}=K\hat{{x}}_{k}Predictor x^k={xkAx^k1+Bu^k1\hat{{x}}_{k}=\begin{cases}{x}_{k}\\ A\hat{{x}}_{k-1}+B\hat{u}_{k-1}\end{cases} NetworkPlant xk+1=Axk+Bu^k{x}_{k+1}=A{x}_{k}+B\hat{u}_{k} yk=Cxky_{k}=C{x}_{k}uku_{k}u^k\hat{u}_{k}xk{x}_{k}x^k\hat{{x}}_{k}
Figure 1: Schematic of a Networked Control System with information multiplexing and packet drops in the network

Motivated by real-world communication constraints in the form of bandwidth limitations, we consider a communication constraint as described below. At any time instant, the network scheduler has three choices:

  1. i)

    transmit the control input from the controller to the plant or,

  2. ii)

    transmit the measured signal from the plant to the predictor or,

  3. iii)

    not communicate at all.

These three choices are encapsulated in the form of a switch variable aka_{k} that takes values in a discrete set 𝒜{1,0,1}\mathcal{A}\coloneqq\{-1,0,1\} where,

ak{1if the control is transmitted;1if the observation is transmitted;0if there is no communication.a_{k}\coloneqq\begin{cases}1&\text{if the control is transmitted;}\\ -1&\text{if the observation is transmitted;}\\ 0&\text{if there is no communication.}\end{cases} (5)

We also consider lossy communication in the sense of a packet drop scenario. Consider γk\gamma_{k} to denote the packet drop event, modeled as independent Bernoulli random variables with probabilities:

(γk=1)=δand(γk=0)=1δ\displaystyle\mathbb{P}\left(\gamma_{k}=1\right)=\delta\quad\text{and}\quad\mathbb{P}\left(\gamma_{k}=0\right)=1-\delta (6)

where γk=1\gamma_{k}=1 indicates a successful packet transmission and γk=0\gamma_{k}=0 indicates failure.

Based on the switching and the packet drop assumptions, the transmitted control information through the network is written as

u^k{ukif ak=1 and γk=1;u^k1otherwise.\hat{u}_{k}\coloneqq\begin{cases}u_{k}&\text{if }a_{k}=1\text{ and }\gamma_{k}=1;\\ \hat{u}_{k-1}&\text{otherwise}.\end{cases} (7)

The coarse estimate of the state at the controlled end is

x^k={xkif ak=1 and γk=1;Ax^k1+Bu^k1otherwise.\hat{{x}}_{k}=\begin{cases}{x}_{k}&\text{if }a_{k}=-1\text{ and }\gamma_{k}=1;\\ A\hat{{x}}_{k-1}+B\hat{u}_{k-1}&\text{otherwise}\end{cases}. (8)

Define the concatenated state as

χk(xkx^k1u^k1)\chi_{k}\coloneqq\begin{pmatrix}{x}_{k}\\ \hat{{x}}_{k-1}\\ \hat{u}_{k-1}\end{pmatrix} (9)

with x^1=x0\hat{{x}}_{-1}={x}_{0} and u^k1=𝟎𝐧𝐮\hat{u}_{k-1}=\bf{0}_{n_{u}} respectively. The overall model of the system is written as

χk+1={𝖠1χkif ak=1 and γk=1;𝖠1χkif ak=1 and γk=1;𝖠0χkotherwise \chi_{k+1}=\begin{cases}\mathsf{A}_{1}\chi_{k}&\text{if }a_{k}=1\text{ and }\gamma_{k}=1;\\ \mathsf{A}_{-1}\chi_{k}&\text{if }a_{k}=-1\text{ and }\gamma_{k}=1;\\ \mathsf{A}_{0}\chi_{k}&\text{otherwise }\end{cases} (10)

with

𝖠1(ABKABKB0Inx000Inu),\mathsf{A}_{1}\coloneqq\begin{pmatrix}A&BKA&BKB\\ 0&\mathrm{I}_{n_{{x}}}&0\\ 0&0&\mathrm{I}_{n_{u}}\end{pmatrix}, (11)
𝖠1(A0B0Inx000Inu),\mathsf{A}_{-1}\coloneqq\begin{pmatrix}A&0&B\\ 0&\mathrm{I}_{n_{{x}}}&0\\ 0&0&\mathrm{I}_{n_{u}}\end{pmatrix}, (12)

and

𝖠0(A0B0Inx000Inu).\mathsf{A}_{0}\coloneqq\begin{pmatrix}A&0&B\\ 0&\mathrm{I}_{n_{{x}}}&0\\ 0&0&\mathrm{I}_{n_{u}}\end{pmatrix}. (13)

3 Switching Strategy

In this section, we first present the ε\varepsilon -greedy strategy for switching that ensures MSS. Next, we present the generalized switching probabilities using the ε\varepsilon-greedy algorithm. Lastly, we formulate the system as a Markov Jump Linear System (MJLS), discuss modes of operation in the MJLS, and state our objective.

3.1 ε\varepsilon- Greedy Algorithm for Switching

We employ the ε\varepsilon-greedy switching strategy for the decision-making process [19]. The algorithm consists of two parts: exploration and exploitation. Exploration addresses finding new possible solutions, whereas exploitation addresses utilizing the already known optimal solution. The variable ε\varepsilon acts as a parameter that weighs the two. Let the per-stage cost be defined by

JkxkQxk+u^kRu^k+λakak.J_{k}\coloneqq{x}_{k}^{\intercal}Q{x}_{k}+\hat{u}_{k}^{\intercal}R\hat{u}_{k}+\lambda a_{k}^{\intercal}a_{k}. (14)

The switching strategy is defined mathematically as follows:

ak={unif({1,1})if rk<εargminat𝒜lim supT1T𝔼[t=kT1βtJt]otherwisea_{k}=\begin{cases}\sim\text{unif}(\{-1,1\})\hfill\text{if }r_{k}<\varepsilon\\ \operatorname*{arg\,min}\limits_{a_{t}\in\mathcal{A}}\limsup\limits_{T\to\infty}\frac{1}{T}\mathbb{E}\left[\sum_{t=k}^{T-1}\beta^{t}J_{t}\right]&\text{otherwise}\end{cases} (15)

where rkunif[0,1]r_{k}\sim\text{unif}[0,1] and β]0,1[\beta\in]0,1[ is a discount factor that discounts the cost at future time stages.

When rkr_{k} is less than ε\varepsilon the switching variable aka_{k} is chosen uniformly randomly from {1,1}\{-1,1\}. This random selection represents exploration by allowing the system to consider other strategies that might not be optimal for that instance but could provide a better solution over a longer horizon. When rkεr_{k}\geq\varepsilon, the switch variable aka_{k} is determined by the exploitation part, i.e., minimizing the expected cost function. Where the cost function has the following terms:

  1. i)

    xkQxk{{x}_{k}^{\intercal}Q{x}_{k}}, represents a penalty on the state, where xk{x}_{k} is the state at time instant kk, and Q0Q\succeq 0.

  2. ii)

    u^kRu^k{\hat{u}_{k}^{\intercal}R\hat{u}_{k}} represents the penalty on the control effort, where uku_{k} is the control at time instant kk, and R0R\succ 0.

  3. iii)

    λakak{\lambda a_{k}^{\intercal}a_{k}} introduces a penalty for transmission of a packet, with λ\lambda being a weighing parameter that controls the trade-off between state and control cost versus the transmission cost.

Thus the ε\varepsilon-greedy strategy creates a balance between exploration and exploitation, ensuring that new strategies are explored while utilizing the accumulated information.

3.2 Switching Probabilities with ε\varepsilon- Greedy Algorithm

In this subsection, we discuss in detail the generalized switching probability under the ε\varepsilon- greedy algorithm and analyze the effect of the ε\varepsilon-greedy switching strategy on the MSS of the system. The switching probability distribution function 𝒫g[0,1]3\mathcal{P}_{g}\in[0,1]^{3} is given by

𝒫g((ak=1),(ak=0),(ak=1))\mathcal{P}_{g}\coloneqq\left(\mathbb{P}(a_{k}=1),\mathbb{P}(a_{k}=0),\mathbb{P}(a_{k}=-1)\right) (16)

with some switching algorithm gg. Given the switching strategy defined in (15), the probabilities of switching states are influenced by the choice of ε\varepsilon. The choice of ε\varepsilon, in turn, decides the balance between exploration and exploitation. With the switching strategy (15) the switching probability distribution (denoted as 𝒫ε\mathcal{P}_{\varepsilon}) can be written as

𝒫ε\displaystyle\mathcal{P}_{\varepsilon} =ε(12,0,12)+(1ε)(pk,1pkqk,qk)\displaystyle=\varepsilon\left(\frac{1}{2},0,\frac{1}{2}\right)+(1-\varepsilon)\left(p_{k},1-p_{k}-q_{k},q_{k}\right) (17)
=(ε2+(1ε)pk,(1ε)(1pkqk),ε2+(1ε)qk).\displaystyle=\left(\frac{\varepsilon}{2}+(1-\varepsilon)p_{k},(1-\varepsilon)(1-p_{k}-q_{k}),\frac{\varepsilon}{2}+(1-\varepsilon)q_{k}\right).

Here,

  1. i)

    the first term, ε(12,0,12)\varepsilon\left(\frac{1}{2},0,\frac{1}{2}\right), represents the probability distribution in the exploration phase because of the uniform switching between 11 and 1-1. With probability ε\varepsilon, the switch positions are chosen randomly, with an equal probability (12\frac{1}{2}) of switching to 11 or 1-1 and zero probability of remaining in position 0, and

  2. ii)

    the second term, (1ε)(p,1pq,q)(1-\varepsilon)\left(p,1-p-q,q\right), represents the probability distribution in exploitation phase. With probability 1ε1-\varepsilon, the switching follows a policy based on the probabilities pp and qq, where:

    • pp is the probability of switching to position 1,

    • qq is the probability of switching to position 1-1,

    • (1pq)(1-p-q) is the probability of remaining in position 0.

Here, p,q[0,1]p,q\in[0,1] and p+q1p+q\leq 1 to ensure that all probabilities sum to 1.

Remark 1.

The choice of ε\varepsilon impacts the MSS of the system which is one of the main interests of this article.

3.3 Formulation as a Markovian Jump Linear System

In this section, we elaborate on the framework of the Markovian Jump Linear System (MJLS) that relates to the system described in (10). An MJLS is a linear system that goes through random transitions between a finite number of modes, each governed by linear dynamics. The random transitions are governed by Markovian probabilities associated with switching from one mode to another mode. In the problem, the randomness takes place at two levels: the first is at the switching under ε\varepsilon- greedy policy and the second is the random packet drop. With this backdrop, we introduce modes of operation under varying circumstances.

System Modes: The system can operate in one of several modes at any instance, which is determined by the switch position and the status of packet transmission. These modes represent different scenarios of switching and packet drop. We define the modes of the Markovian switching system as follows:

1:(1,𝒮,𝖠1),2:(1,,𝖠0)3:(1,𝒮,𝖠1),\displaystyle\mathcal{M}1:\left(1,\mathcal{S},\mathsf{A}_{1}\right),\mathcal{M}2:\left(1,\mathcal{F},\mathsf{A}_{0}\right)\mathcal{M}3:\left(-1,\mathcal{S},\mathsf{A}_{-1}\right), (18)
4:(1,,𝖠0), and 5:(0,,𝖠0)\displaystyle\mathcal{M}4:\left(-1,\mathcal{F},\mathsf{A}_{0}\right),\text{ and }\mathcal{M}5:\left(0,-,\mathsf{A}_{0}\right)

where,

  • the first entry in each tuple indicates the position of the switch, which can be 1,-1, or 0;

  • the second entry denotes whether the packet transmission was successful (𝒮\mathcal{S}) or failed (\mathcal{F}), and

  • the third entry corresponds to the system matrix 𝖠\mathsf{A} that governs the dynamics in that particular mode.

For example, 1\mathcal{M}1 represents the case where the switch is in position 1, the packet transmission is successful, and the system dynamics are described by the matrix 𝖠1\mathsf{A}_{1}. On the other hand, 2\mathcal{M}2 represents the case where the switch is in position 1 but the packet transmission has failed, thus the system dynamics revert to 𝖠0\mathsf{A}_{0}.

The mode transition probability, pijp_{ij}, represents the likelihood of the system switching from mode ii to mode jj in the next time step.

Definition 1 (Mode Transition Probability).

Define the mode transition probability of switching from ii to jj as,

(ak+1=j|ak=i)pij\mathbb{P}(a_{k+1}=j|a_{k}=i)\coloneqq p_{ij}

with i,j{1,2,,M}i,j\in\{1,2,\ldots,M\} where pij[0,1]p_{ij}\in[0,1], j=1Mpij=1\sum_{j=1}^{M}p_{ij}=1 and MM denotes the total number of modes.

Definition 2 (Mean Square Stability).

The system (10) is mean square stable if and only if for some ζ1\zeta\geq 1, 0<ξ<10<\xi<1 and for every χ02nx×nu\chi_{0}\in\mathbb{R}^{2n_{{x}}\times n_{u}},

𝔼[χkχk]ζξkχ0χ0for all k0.\mathbb{E}\left[\chi_{k}^{\intercal}\chi_{k}\right]\leq\zeta\xi^{k}\chi_{0}^{\intercal}\chi_{0}\quad\text{for all }k\in\mathbb{Z}_{\geq 0}. (19)

Objective: Our goal is to determine the value for ε\varepsilon, given a δ]0,1]\delta\in]0,1], that ensures the origin of the system (10) remains Mean Square Stable under the switching algorithm (15).

4 Mean Square Stability of MJLS

In this section, we propose the methodology used to address the problem, which involves identifying corner cases relevant to switching probabilities. The goal is to determine a value ε¯\bar{\varepsilon}, given a fixed δ\delta, which ensures that certain Linear Matrix Inequalities (LMIs) are satisfied for each identified corner case [20]. Note that ϵ¯\bar{\epsilon} is not a bound but a value that satisfies the LMIs. Furthermore, we demonstrate that any convex combination of these corner cases also satisfies the MSS conditions derived from the LMIs. The convex combination relates to different switching scenarios in the exploitation phase, implying that the system is MSS irrespective of any switching policy implemented during the exploitation phase.

4.1 Corner Cases

We study this through specific corner cases:

  1. C1.

    p=0p=0 and q=1q=1: This case represents the switch being in position 1-1 throughout the entire exploitation phase, indicating that only observations are transmitted.

  2. C2.

    p=1p=1 and q=0q=0: This case represents the switch being in position 11 throughout the entire exploitation phase, indicating that only control signals are transmitted.

  3. C3.

    p=0p=0 and q=0q=0: This case depicts the switch remaining in position 0 for the duration of the exploitation phase, signifying that no transmissions occur.

The probability of switching in each of the corner cases is tabulated in TABLE 1.

Case Switch probabilities
General (ε2+(1ε)p,(1ε)(1pq),ε2+(1ε)q)\left(\frac{\varepsilon}{2}+(1-\varepsilon)p,(1-\varepsilon)(1-p-q),\frac{\varepsilon}{2}+(1-\varepsilon)q\right)
C1 (ε2,0,1ε2)\left(\frac{\varepsilon}{2},0,1-\frac{\varepsilon}{2}\right)
C2 (1ε2,0,ε2)\left(1-\frac{\varepsilon}{2},0,\frac{\varepsilon}{2}\right)
C3 (ε2,1ε,ε2)\left(\frac{\varepsilon}{2},1-{\varepsilon},\frac{\varepsilon}{2}\right)
Table 1: Switching probability distribution in different corner cases

4.2 Mode Transition Probabilities for Corner Cases

Let 𝖯(c)[0,1]M×M\mathsf{P}^{(c)}\in[0,1]^{M\times M} denote the mode transition probability matrix associated with the corner case cc, where c{1,2,3}c\in\{1,2,3\}. Here, the superscript cc is the variable representing each corner case. Each 𝖯j(c)[0,1]\mathsf{P}^{(c)}_{j}\in[0,1] denotes the probability of transitioning from each node to the mode j\mathcal{M}j in the corner case cc, i.e., 𝖯ij(c)=𝖯j(c)\mathsf{P}^{(c)}_{ij}=\mathsf{P}^{(c)}_{j} for all i,j{1,2,,5}i,j\in\{1,2,\ldots,5\}. The mode transition probabilities, for each case, are described in Table 2 and depicted in Fig. 2. With the given packet transmission success probability δ]0,1]\delta\in]0,1], we provide the necessary conditions for the MSS of the system using the LMIs given in [20].

1\mathcal{M}12\mathcal{M}23\mathcal{M}34\mathcal{M}45\mathcal{M}5𝖯1(g)\mathsf{P}^{(g)}_{1}𝖯2(g)\mathsf{P}^{(g)}_{2}𝖯3(g)\mathsf{P}^{(g)}_{3}𝖯4(g)\mathsf{P}^{(g)}_{4}𝖯5(g)\mathsf{P}^{(g)}_{5}𝖯1(g)\mathsf{P}^{(g)}_{1}𝖯2(g)\mathsf{P}^{(g)}_{2}𝖯3(g)\mathsf{P}^{(g)}_{3}𝖯4(g)\mathsf{P}^{(g)}_{4}𝖯5(g)\mathsf{P}^{(g)}_{5}𝖯1(g)\mathsf{P}^{(g)}_{1}𝖯2(g)\mathsf{P}^{(g)}_{2}𝖯3(g)\mathsf{P}^{(g)}_{3}𝖯4(g)\mathsf{P}^{(g)}_{4}𝖯5(g)\mathsf{P}^{(g)}_{5}𝖯1(g)\mathsf{P}^{(g)}_{1}𝖯2(g)\mathsf{P}^{(g)}_{2}𝖯3(g)\mathsf{P}^{(g)}_{3}𝖯4(g)\mathsf{P}^{(g)}_{4}𝖯5(g)\mathsf{P}^{(g)}_{5}𝖯1(g)\mathsf{P}^{(g)}_{1}𝖯2(g)\mathsf{P}^{(g)}_{2}𝖯3(g)\mathsf{P}^{(g)}_{3}𝖯4(g)\mathsf{P}^{(g)}_{4}𝖯5(g)\mathsf{P}^{(g)}_{5}
Figure 2: Markov Jump Linear System with associated mode transition probabilities for a general case.
𝖯1(c)\mathsf{P}^{(c)}_{1} 𝖯2(c)\mathsf{P}^{(c)}_{2} 𝖯3(c)\mathsf{P}^{(c)}_{3} 𝖯4(c)\mathsf{P}^{(c)}_{4} 𝖯5(c)\mathsf{P}^{(c)}_{5}
C1 δε2\delta\frac{\varepsilon}{2} (1δ)ε2(1-\delta)\frac{\varepsilon}{2} δ(1ε2)\delta(1-\frac{\varepsilon}{2}) (1δ)(1ε2)(1-\delta)(1-\frac{\varepsilon}{2}) 0
C2 δ(1ε2)\delta(1-\frac{\varepsilon}{2}) (1δ)(1ε2)(1-\delta)(1-\frac{\varepsilon}{2}) δε2\delta\frac{\varepsilon}{2} (1δ)ε2(1-\delta)\frac{\varepsilon}{2} 0
C3 δε2\delta\frac{\varepsilon}{2} (1δ)(1ε2)(1-\delta)(1-\frac{\varepsilon}{2}) δε2\delta\frac{\varepsilon}{2} (1δ)(1ε2)(1-\delta)(1-\frac{\varepsilon}{2}) 1ε1-\varepsilon
General δ(ε2+(1ε)p)\delta\left(\frac{\varepsilon}{2}+(1-\varepsilon)p\right) (1δ)(ε2+(1ε)p)(1-\delta)\left(\frac{\varepsilon}{2}+(1-\varepsilon)p\right) δ(ε2+(1ε)q)\delta\left(\frac{\varepsilon}{2}+(1-\varepsilon)q\right) (1δ)(ε2+(1ε)q)(1-\delta)\left(\frac{\varepsilon}{2}+(1-\varepsilon)q\right) (1ε)(1pq)(1-\varepsilon)(1-p-q)
Table 2: Mode transition probabilities for different cases
Theorem 1.

Given a δ[0,1]\delta\in[0,1], with mode transition probability matrix 𝖯(c)\mathsf{P}^{(c)} the origin of the system (10) is MSS if there exist an ε¯]0,1[\bar{\varepsilon}\in]0,1[ and a symmetric positive definite matrix VV, such that the following holds

j=15𝖯j(c)𝖠jV𝖠j<V\displaystyle\sum_{j=1}^{5}\mathsf{P}^{(c)}_{j}\mathsf{A}_{j}^{\intercal}V\mathsf{A}_{j}<V (20)

for all c{1,2,3}c\in\{1,2,3\}. Furthermore, with this ε\varepsilon, the origin of the system (10) is MSS under the ε\varepsilon-greedy algorithm (15).

Proof.

First, the MSS conditions given in [20, Corollary 3.26] are tailored to the specific problem to obtain the LMI (20). Suppose there exists an ε¯\bar{\varepsilon} and VV such that LMI (20) hold, then the origin of the system is stable for these corner cases. To prove the origin of the system (10) is MSS under (15) with ε=ε¯\varepsilon=\bar{\varepsilon}, we prove that the general case can be written as a convex combination of the three corner cases and then prove the LMI holds for the general case. Let α1,α2[0,1]\alpha_{1},\alpha_{2}\in[0,1] and α1+α21\alpha_{1}+\alpha_{2}\leq 1. Taking the convex combination of mode transition probabilities for all corner cases (see TABLE 2),

α1(δε2,(1δ)ε2,δ(1ε2),(1δ)(1ε2),0)\displaystyle\alpha_{1}\left(\delta\frac{\varepsilon}{2},(1-\delta)\frac{\varepsilon}{2},\delta(1-\frac{\varepsilon}{2}),(1-\delta)(1-\frac{\varepsilon}{2}),0\right) (21)
+\displaystyle+ α2(δ(1ε2),(1δ)(1ε2),δε2,(1δ)ε2,0)\displaystyle\alpha_{2}\left(\delta(1-\frac{\varepsilon}{2}),(1-\delta)(1-\frac{\varepsilon}{2}),\delta\frac{\varepsilon}{2},(1-\delta)\frac{\varepsilon}{2},0\right)
+\displaystyle+ (1α1α2)(δε2,(1δ)(1ε2),δε2,\displaystyle\left(1-\alpha_{1}-\alpha_{2}\right)\bigg{(}\delta\frac{\varepsilon}{2},(1-\delta)(1-\frac{\varepsilon}{2}),\delta\frac{\varepsilon}{2},
(1δ)(1ε2),1ε)\displaystyle(1-\delta)(1-\frac{\varepsilon}{2}),1-\varepsilon\bigg{)}
=\displaystyle= (δ(ε2+(1ε)α2),(1δ)(ε2+(1ε)α2),\displaystyle\bigg{(}\delta\left(\frac{\varepsilon}{2}+(1-\varepsilon)\alpha_{2}\right),(1-\delta)\left(\frac{\varepsilon}{2}+(1-\varepsilon)\alpha_{2}\right),
δ(ε2+(1ε)α1),(1δ)(ε2+(1ε)α1),\displaystyle\delta\left(\frac{\varepsilon}{2}+(1-\varepsilon)\alpha_{1}\right),(1-\delta)\left(\frac{\varepsilon}{2}+(1-\varepsilon)\alpha_{1}\right),
(1ε)(1α2α1)).\displaystyle(1-\varepsilon)(1-\alpha_{2}-\alpha_{1})\bigg{)}.

Comparing (21) with the mode transition probability of the general case in TABLE 2, we have α1=q\alpha_{1}=q and α2=p\alpha_{2}=p.
To prove that (10) is MSS for any p,q[0,1]p,q\in[0,1] and p+q1p+q\leq 1, let 𝖯(g)\mathsf{P}^{(g)} be the general mode transition probability matrix. Then

𝖯j(g)=q𝖯j(1)+p𝖯j(2)+(1qp)𝖯j(3)\mathsf{P}^{(g)}_{j}=q\mathsf{P}^{(1)}_{j}+p\mathsf{P}^{(2)}_{j}+(1-q-p)\mathsf{P}^{(3)}_{j}

for all j{1,2,,5}j\in\{1,2,\ldots,5\}.

j=15𝖯j(g)𝖠jV𝖠j\displaystyle\sum_{j=1}^{5}\mathsf{P}^{(g)}_{j}\mathsf{A}_{j}^{\intercal}V\mathsf{A}_{j}
=\displaystyle= j=15(q𝖯j(1)+p𝖯j(2)+(1qp)𝖯j(3))𝖠jV𝖠j\displaystyle\sum_{j=1}^{5}\left(q\mathsf{P}^{(1)}_{j}+p\mathsf{P}^{(2)}_{j}+(1-q-p)\mathsf{P}^{(3)}_{j}\right)\mathsf{A}_{j}^{\intercal}V\mathsf{A}_{j}
=\displaystyle= qj=15𝖯j(1)𝖠jV𝖠j+pj=15𝖯j(2)𝖠jV𝖠j\displaystyle q\sum_{j=1}^{5}\mathsf{P}^{(1)}_{j}\mathsf{A}_{j}^{\intercal}V\mathsf{A}_{j}+p\sum_{j=1}^{5}\mathsf{P}^{(2)}_{j}\mathsf{A}_{j}^{\intercal}V\mathsf{A}_{j}
+\displaystyle+ (1qp)j=15𝖯j(3)𝖠jV𝖠j\displaystyle(1-q-p)\sum_{j=1}^{5}\mathsf{P}^{(3)}_{j}\mathsf{A}_{j}^{\intercal}V\mathsf{A}_{j}
<(a)\displaystyle\stackrel{{\scriptstyle\text{(a)}}}{{<}} qV+pV+(1qp)V=V\displaystyle\,qV+pV+(1-q-p)V=V

The inequality (a) is from (20) and the fact that all quantities on both sides of (20) are non-negative. Hence, if ε¯\bar{\varepsilon} satisfies LMI (20) for three corner cases then, (20) is satisfied for any general switching strategy (15). ∎

To determine the value of ε\varepsilon that satisfies the LMIs required for MSS, we utilize a method involving Semi-Definite Programming (SDP) solvers and the bisection method. Based on Theorem 1, we set up the necessary LMIs involving symmetric positive definite matrix VV. These LMIs establish the conditions for MSS that the system must satisfy. We employ an SDP solver to numerically solve the formulated LMIs. Given the dependence of VV matrix on ε\varepsilon, we apply the bisection method to determine the value of ε(ε¯)\varepsilon\,(\bar{\varepsilon}) that satisfies all LMIs. Starting with an initial range for ε[0,1]\varepsilon\in[0,1], the bisection method iteratively narrows down to ε¯\bar{\varepsilon} by checking the existence of solutions of LMIs at midpoints within the range.

5 Optimal Switching Strategy

We intend to find an optimal switching policy that satisfies the necessary conditions for MSS established earlier and minimizes the average cost involving penalties on state, control, and communication. That is, to find

min{ak}\displaystyle\min_{\{a_{k}\}} lim supT1T𝔼[k=0T1Jk],\displaystyle\limsup_{T\to\infty}\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}J_{k}\right], (22)

which can be written as a discounted cost problem

min{ak}\displaystyle\min_{\{a_{k}\}} lim supT1T𝔼[k=0T1βkJk]\displaystyle\limsup_{T\to\infty}\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}\beta^{k}J_{k}\right] (23)

using [21, Lemma 5.3.1]. This transformation allows the application of reinforcement learning techniques for solutions.

5.1 𝖰\mathsf{Q}-Learning

The sequential decision-making problem is treated as a Markov Decision Process (MDP). The MDP has three components: i) state, ii) action, and iii) reward. The state space consists of {xk}nx\{{x}_{k}\}\in\mathbb{R}^{n_{{x}}} and the action space is ak{1,0,1}a_{k}\in\{-1,0,1\}. Define per-stage reward as

rk(xkQxk+u^kRu^k+λakak).r_{k}\coloneqq-\left({x}_{k}^{\intercal}Q{x}_{k}+\hat{u}_{k}^{\intercal}R\hat{u}_{k}+\lambda a_{k}^{\intercal}a_{k}\right). (24)

The system transitions from state xk{x}_{k} to xk+1{x}_{k+1} based on action aka_{k} and receives reward rk+1r_{k+1}. To evaluate the performance of a given policy, we use the action value function, denoted as 𝖰π(x,a)\mathsf{Q}_{\pi}({x},a). For a policy π\pi, which maps states to probabilities of selecting actions, the action value function is defined as:

𝖰π(x,a)=𝔼[k=0T1βkrk+1|x,a].\mathsf{Q}_{\pi}({x},a)=\mathbb{E}\left[\sum_{k=0}^{T-1}\beta^{k}r_{k+1}|{x},a\right]. (25)

The optimal 𝖰\mathsf{Q} value (denoted as 𝖰(xk,ak)\mathsf{Q}_{*}({x}_{k},a_{k})) satisfies Bellman’s principle of optimality:

𝖰(xk,ak)=𝔼[rk+β𝖰(xk+1,ak+1)].\mathsf{Q}_{*}({x}_{k},a_{k})=\mathbb{E}\left[r_{k}+\beta\mathsf{Q}_{*}({x}_{k+1},a_{k+1})\right]. (26)

The optimal policy can be found iteratively in the case of discrete sets of states and actions by the value iteration method [19]. In this problem, the state space is continuous, and this method cannot be directly applied. Hence, we use an advanced deep reinforcement learning technique using a neural network.

5.2 Deep 𝖰\mathsf{Q}-Learning

We approach the solution of the problem (23) using deep reinforcement learning [22]. A Deep Q Network (DQN) estimates the 𝖰\mathsf{Q}-values for each state and action. Thereby approximates the function 𝖰(x,a;θ)\mathsf{Q}({x},a;\theta), where the function is parameterized by θ\theta. The state acts as the input to the DQN, and the output corresponding to each action is the approximated 𝖰\mathsf{Q}-value. The loss is derived using Bellman’s principle as

loss[𝖰(xk,ak;θ)(rk+β𝖰(xk+1,ak+1;θ))]2.\text{loss}\coloneqq\left[\mathsf{Q}({x}_{k},a_{k};\theta)-\left({r_{k}+\beta\mathsf{Q}({x}_{k+1},a_{k+1};\theta)}\right)\right]^{2}. (27)

Here, two forward passes are made through the network to get the 𝖰\mathsf{Q}-values for the current and next states. To avoid this, Mnih et al. introduced the replay memory framework and another target network replicating the actual network but updated only after a few iterations [22]. Algorithm 1 presents the deep reinforcement learning technique used.

Algorithm 1 Deep reinforcement learning
1:  Initialize the replay memory DD to capacity KK
2:  Initialize the policy network QQ with random weights θ0\theta_{0}
3:  Copy the policy network as the target network Q^\hat{Q} with weights θ=θ0\theta^{-}=\theta_{0}
4:  Initialize x0{x}_{0}
5:  for k=0,1,,Nk=0,1,\ldots,N do
6:     Select an action (aka_{k} via switching algorithm (15) with the ε\varepsilon based on Theorem 1)
7:     Store the experience ek(ak,xk,rk,xk+1)e_{k}\coloneqq(a_{k},{x}_{k},r_{k},{x}_{k+1}) in the replay memory
8:     Sample random batch from the replay memory
9:     Calculate the loss between output 𝖰\mathsf{Q}-values and target 𝖰\mathsf{Q}-values as (27)
10:     Update the policy network using the backpropagation
11:     Update the target network after a predefined number of steps
12:  end for

6 Numerical Experiments

In this section, we validate our approach on system (10) with the following data:

A=(10.101),B=(01), and K=(0.0120.07)A=\begin{pmatrix}1&0.1\\ 0&1\end{pmatrix},\,B=\begin{pmatrix}0\\ 1\end{pmatrix},\text{ and }K=\begin{pmatrix}-0.012\\ -0.07\end{pmatrix}^{\intercal}

where the controller gain KK stabilizes as per Assumption 2.

First, we fix the packet transmission success probability δ]0,1]\delta\in]0,1], which is typically determined by the communication system. For a given δ\delta, we determine the corresponding value of ε\varepsilon (ε¯\bar{\varepsilon}) that ensures MSS for all corner cases. The relationship between the packet transmission success probability δ\delta and the ε¯\bar{\varepsilon} required for ensuring MSS is depicted in Fig. 3.

Refer to caption
Figure 3: ε¯\bar{\varepsilon} that satisfies MSS conditions for all three corner cases for different values of δ\delta

Next, for a fixed packet transmission success probability δ\delta, we determine the corresponding value of ε\varepsilon (ε¯\bar{\varepsilon}) that ensures the MSS of the system. We then illustrate the associated corner cases and the convex region formed by these corner cases. We implement the Deep Q-Network according to Algorithm 1 with the following specifications. The initial state conditions are randomly sampled from the uniform distribution within the [10,10]2[-10,10]^{2} range. A replay memory size of 10001000 is utilized to store experiences. The policy network architecture consists of two hidden layers, with 10241024 and 256256 neurons, respectively. Since there are three possible actions, the output layer consists of three neurons. Mini-batches of size 3232 are used for training. We fix δ=0.8\delta=0.8 and the exploration parameter is set to ε¯=0.2\bar{\varepsilon}=0.2. The learning rate is fixed at 0.0010.001. The network undergoes training for 800800 episodes. The moving average of reward in the training phase shows the DQN’s learning over several episodes and then the learning settles (see Fig. 4).

Refer to caption
Figure 4: Comparison of average reward (λ=0.5\lambda=0.5) of DQN method with round robin and random switching scheme

To assess the efficacy of our approach, we compare our results with two other methods: round-robin and random selection. Each method undergoes testing under identical initial conditions and parameters. The DQN method performs better than the other two methods. The round-robin method yields an average reward of 18.51-18.51, while the random method results in an average reward of 25.6-25.6. These comparisons underscore the effectiveness of the DQN after about 250250 episodes of training. In this subsection, we analyze the MSS property with the proposed ε\varepsilon- greedy algorithm, under different packet transmission success probability and corresponding ε\varepsilon found using Theorem 1. Figure 5 represents the case with low transmission success probability that requires a high value of ε\varepsilon to be MSS. Conversely, with low transmission success probability and low value of ε\varepsilon, we can observe in Figure 6 that the system is not MSS. As depicted in Figure 7, with high transmission success probability, a low value of ε\varepsilon is sufficient for achieving MSS.

Refer to caption
Figure 5: Low success probability, with high ε\varepsilon
Refer to caption
Figure 6: Low success probability, with low ε\varepsilon
Refer to caption
Figure 7: High success probability, with low ε\varepsilon

7 Conclusion

In this letter, we have proposed a modified ε\varepsilon-greedy algorithm for selecting communication direction in a multiplexed NCS. We have established the necessary conditions for the mean square stability (MSS) of an NCS with multiplexed communication and packet drops. We provided an optimal switching policy using Deep Q-Learning that utilizes the proposed ε\varepsilon-greedy algorithm and ensures MSS for training the neural network.

References

  • [1] A. Bemporad, M. Heemels, M. Johansson, et al., Networked control systems, vol. 406. Springer, 2010.
  • [2] X. Ge, F. Yang, and Q.-L. Han, “Distributed networked control systems: A brief overview,” Information Sciences, vol. 380, pp. 117–131, 2017.
  • [3] N. Elia, “When bode meets shannon: Control-oriented feedback communication schemes,” IEEE transactions on Automatic Control, vol. 49, no. 9, pp. 1477–1488, 2004.
  • [4] W. M. H. Heemels, A. R. Teel, N. Van de Wouw, and D. Nešić, “Networked control systems with communication constraints: Tradeoffs between transmission intervals, delays and performance,” IEEE Transactions on Automatic control, vol. 55, no. 8, pp. 1781–1796, 2010.
  • [5] V. Dolk and M. Heemels, “Event-triggered control systems under packet losses,” Automatica, vol. 80, pp. 143–155, 2017.
  • [6] H. Sandberg, V. Gupta, and K. H. Johansson, “Secure networked control systems,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 5, pp. 445–464, 2022.
  • [7] M. Athans, R. Ku, and S. Gershwin, “The uncertainty threshold principle: Some fundamental limitations of optimal decision making under dynamic uncertainty,” IEEE Transactions on Automatic Control, vol. 22, no. 3, pp. 491–495, 1977.
  • [8] O. C. Imer, S. Yuksel, and T. Başar, “Optimal control of dynamical systems over unreliable communication links,” IFAC Proceedings Volumes, vol. 37, no. 13, pp. 991–996, 2004.
  • [9] L. Schenato, B. Sinopoli, M. Franceschetti, K. Poolla, and S. S. Sastry, “Foundations of control and estimation over lossy networks,” Proceedings of the IEEE, vol. 95, no. 1, pp. 163–187, 2007.
  • [10] X. Liu and A. Goldsmith, “Kalman filtering with partial observation losses,” in 2004 43rd IEEE Conference on Decision and Control (CDC)(IEEE Cat. No. 04CH37601), vol. 4, pp. 4180–4186, IEEE, 2004.
  • [11] B. Sinopoli, L. Schenato, M. Franceschetti, K. Poolla, M. I. Jordan, and S. S. Sastry, “Kalman filtering with intermittent observations,” IEEE transactions on Automatic Control, vol. 49, no. 9, pp. 1453–1464, 2004.
  • [12] K. You, M. Fu, and L. Xie, “Mean square stability for kalman filtering with markovian packet losses,” Automatica, vol. 47, no. 12, pp. 2647–2657, 2011.
  • [13] A. R. Mesquita, J. P. Hespanha, and G. N. Nair, “Redundant data transmission in control/estimation over lossy networks,” Automatica, vol. 48, no. 8, pp. 1612–1620, 2012.
  • [14] V. S. Varma, R. Postoyan, D. E. Quevedo, and I.-C. Morărescu, “Event-triggered transmission policies for nonlinear control systems over erasure channels,” IEEE Control Systems Letters, vol. 7, pp. 2113–2118, 2023.
  • [15] D. Maity, M. H. Mamduhi, S. Hirche, and K. H. Johansson, “Optimal lqg control of networked systems under traffic-correlated delay and dropout,” IEEE Control Systems Letters, vol. 6, pp. 1280–1285, 2021.
  • [16] L. Zhang, H. Gao, and O. Kaynak, “Network-induced constraints in networked control systems—a survey,” IEEE transactions on industrial informatics, vol. 9, no. 1, pp. 403–416, 2012.
  • [17] A. Molin and S. Hirche, “On lqg joint optimal scheduling and control under communication constraints,” in Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference, pp. 5832–5838, IEEE, 2009.
  • [18] A. S. Leong, A. Ramaswamy, D. E. Quevedo, H. Karl, and L. Shi, “Deep reinforcement learning for wireless sensor scheduling in cyber–physical systems,” Automatica, vol. 113, p. 108759, 2020.
  • [19] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
  • [20] O. L. V. Costa, M. D. Fragoso, and R. P. Marques, Discrete-time Markov jump linear systems. Springer Science & Business Media, 2005.
  • [21] O. Hernández-Lerma and J. B. Lasserre, Discrete-time Markov control processes: basic optimality criteria, vol. 30. Springer Science & Business Media, 2012.
  • [22] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.