This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Structure-Enhanced Deep Reinforcement Learning for Optimal Transmission Scheduling
thanks: *Wanchun Liu is the corresponding author.

Jiazheng Chen1, Wanchun Liu*1, Daniel E. Quevedo2, Yonghui Li1, and Branka Vucetic1 1School of Electrical and Information Engineering, The University of Sydney, Australia
2School of Electrical Engineering and Robotics, Queensland University of Technology, Brisbane, Australia.
1Emails: {jiazheng.chen, wanchun.liu, yonghui.li, branka.vucetic}@sydney.edu.au. 2Email: [email protected].
Abstract

Remote state estimation of large-scale distributed dynamic processes plays an important role in Industry 4.0 applications. In this paper, by leveraging the theoretical results of structural properties of optimal scheduling policies, we develop a structure-enhanced deep reinforcement learning (DRL) framework for optimal scheduling of a multi-sensor remote estimation system to achieve the minimum overall estimation mean-square error (MSE). In particular, we propose a structure-enhanced action selection method, which tends to select actions that obey the policy structure. This explores the action space more effectively and enhances the learning efficiency of DRL agents. Furthermore, we introduce a structure-enhanced loss function to add penalty to actions that do not follow the policy structure. The new loss function guides the DRL to converge to the optimal policy structure quickly. Our numerical results show that the proposed structure-enhanced DRL algorithms can save the training time by 50% and reduce the remote estimation MSE by 10% to 25%, when compared to benchmark DRL algorithms.

Index Terms:
Remote state estimation, deep reinforcement learning, sensor scheduling, threshold structure.

I Introduction

Wireless networked control systems (WNCSs) consisting of distributed sensors, actuators, controllers, and plants are a key component for Industry 4.0 and have been widely applied in many areas such as industrial automation, vehicle monitoring systems, building automation and smart grids [1]. In particular, providing high-quality real-time remote estimation of dynamic system states plays an important role in ensuring control performance and stability of WNCSs [2]. For large-scale WNCSs, transmission scheduling of wireless sensors over limited bandwidth needs to be properly designed to guarantee remote estimation performance.

There are many existing works on the transmission scheduling of WNCSs. In [3], the optimal scheduling problem of a multi-loop WNCS with limited communication resources was investigated for minimizing the transmission power consumption under a WNCS stability constraint. In [4, 5], optimal sensor scheduling problems of remote estimation systems were investigated for achieving the best overall estimation mean-square error (MSE). In particular, dynamic decision-making problems were formulated as Markov decision processes (MDPs), which can be solved by classical methods, such as policy and value iterations. However, the conventional model-based solutions are not feasible in large-scale scheduling problems because of the curse of dimensionality caused by high-dimensional state and action spaces.111In [3, 4, 5], only the optimal scheduling of two-sensor systems has been solved effectively by the proposed methods.

In recent years, deep reinforcement learning (DRL) has been developed to deal with large MDPs by using deep neural networks as function approximators [6, 7]. Some works [8, 9] have used the deep Q-network (DQN), a simplest DRL method, to solve multi-sensor-multi-channel scheduling problems in different remote estimation scenarios. In particular, the sensor scheduling problems for systems with 6 sensors have been solved effectively, providing significant performance gains over heuristic methods in terms of estimation quality. More recent work [10] has introduced DRL algorithms with an actor-critic structure to solve the scheduling problem at a much larger scale (that cannot be handled by the DQN). However, existing works merely use the general DRL frameworks to solve specific scheduling problems, without questioning what features make sensor scheduling problems different from other MDPs. Also, we note that a drawback of general DRL is that it often cannot perform policy exploration effectively for specific tasks[11], which can lead to local minima or even total failure. Thus, the existing DRL-based solutions could be far from optimal.

Some recent works have shown that the optimal scheduling policies of remote estimation systems commonly have a “threshold structure” [12, 5], which means that there exist switching boundaries dividing the state space into multiple regions for different scheduling actions. In other words, the optimal policy has a structure where the action only changes at the switching boundaries of the state space. In [13], the threshold structure of an optimal scheduling policy has also been identified in a sensor scheduling system for minimizing the average age of information (AoI). Although these theoretical works have derived the structures of optimal policies, there is no existing work in the open literature utilizing the structural properties to guide DRL algorithms. This knowledge gap motivates us to investigate how knowledge of the optimal policy structure can be used to redesign DRL algorithms that find the optimal policy more effectively.

Contributions. In this paper, we develop novel structure-enhanced DRL algorithms for solving the optimal scheduling problem of a remote estimation system. Given the most commonly adopted DRL frameworks for scheduling, i.e., DQN and deep deterministic policy gradient (DDPG), we propose structure-enhanced DQN and DDPG algorithms. In particular, we design a structure-enhanced action selection method, which tends to select actions that obey the threshold structure. Such an action selection method can explore the action space more effectively and enhance the learning efficiency of DRL agents. Furthermore, we introduce a structure-enhanced loss function to add penalty to actions that do not follow the threshold structure. The new loss function guides the DRL to converge to the optimal policy structure quickly. Our numerical results show that the proposed structure-enhanced DRL algorithms can save the training time by 50%, while reducing the remote estimation MSE by 10% to 25% when compared to benchmark DRL algorithms.

II System Model

We consider a remote estimation system with NN dynamic processes each measured by a sensor, which pre-processes the raw measurements and sends its state estimates to a remote estimator through one of MM wireless channels, as illustrated in Fig. 1.

Refer to caption
Figure 1: Remote state estimation system with NN processes and MM channels.

II-A Process Model and Local State Estimation

Each dynamic process nn is modeled as a discrete linear time-invariant (LTI) system as [8, 14, 15]

𝐱n,k+1\displaystyle\mathbf{x}_{n,k+1} =𝐀n𝐱n,k+𝐰n,k,\displaystyle=\mathbf{A}_{n}\mathbf{x}_{n,k}+\mathbf{w}_{n,k}, (1)
𝐲n,k\displaystyle\mathbf{y}_{n,k} =𝐂n𝐱n,k+𝐯n,k,\displaystyle=\mathbf{C}_{n}\mathbf{x}_{n,k}+\mathbf{v}_{n,k},

where 𝐱n,kln\mathbf{x}_{n,k}\in\mathbb{R}^{l_{n}} is process nn’s state at time kk, and 𝐲n,ken\mathbf{y}_{n,k}\in\mathbb{R}^{e_{n}} is the state measurement of the sensor nn, 𝐀nln×ln\mathbf{A}_{n}\in\mathbb{R}^{l_{n}\times l_{n}} and 𝐂nen×ln\mathbf{C}_{n}\in\mathbb{R}^{e_{n}\times l_{n}} are the system matrix and the measurement matrix, respectively, 𝐰n,kln\mathbf{w}_{n,k}\in\mathbb{R}^{l_{n}} and 𝐯n,ken\mathbf{v}_{n,k}\in\mathbb{R}^{e_{n}} are the process disturbance and the measurement noise modeled as independent and identically distributed (i.i.d) zero-mean Gaussian random vectors 𝒩(𝟎,𝐖n)\mathcal{N}(\mathbf{0},\mathbf{W}_{n}) and 𝒩(𝟎,𝐕n)\mathcal{N}(\mathbf{0},\mathbf{V}_{n}), respectively. We assume that the spectral radius of 𝐀n,n\mathbf{A}_{n},\forall n, is greater than one, which means that the dynamic processes are unstable, making the remote estimation problem more interesting.

Due to the imperfect state measurement in (1), sensor nn executes a classic Kalman filter to pre-process the raw measurement and generate state estimate 𝐱n,ks\mathbf{x}^{s}_{n,k} at each time kk [14]. Sensor nn sends 𝐱n,ks\mathbf{x}^{s}_{n,k} to the remote estimator (not 𝐲n,k\mathbf{y}_{n,k}) as a packet, once scheduled. The local state estimation error covariance matrix is defined as

𝐏n,ks𝔼[(𝐱n,ks𝐱n,k)(𝐱n,ks𝐱n,k)].\mathbf{P}_{n,k}^{s}\triangleq\mathbb{E}\left[(\mathbf{x}_{n,k}^{s}-\mathbf{x}_{n,k})(\mathbf{x}_{n,k}^{s}-\mathbf{x}_{n,k})^{\top}\right]. (2)

We note that local Kalman filters are commonly assumed to operate in the steady state mode in the literature (see [14] and references therein222The nnth local Kalman filter is stable if and only if that (𝐀n,𝐂n)(\mathbf{A}_{n},\mathbf{C}_{n}) is observable and (𝐀n,𝐖n)(\mathbf{A}_{n},\sqrt{\mathbf{W}_{n}}) is controllable.), which means the error covariance matrix converges to a constant, i.e., 𝐏n,ks=𝐏¯n,k,n\mathbf{P}_{n,k}^{s}=\bar{\mathbf{P}}_{n},\forall k,n.

II-B Wireless Communications and Remote State Estimation

There are MM wireless channels (e.g., subcarriers) for the NN sensors’ transmissions, where MNM\leq N. We consider independent and identically distributed (i.i.d.) block fading channels, where the channel quality is fixed during each packet transmission and varies packet by packet, independently. Let the N×MN\times M matrix 𝐇k\mathbf{H}_{k} denote the channel state of the system at time kk, where the element in the nnth row and mmth column hn,m,k{1,2,,h¯}h_{n,m,k}\in\mathcal{H}\triangleq\left\{1,2,\dots,\bar{h}\right\} represents the channel state between sensor nn and the remote estimator at channel mm. In particular, there are h¯\bar{h} quantized channel states in total. The packet drop probability at channel state ii\in\mathcal{H} is pip_{i}. Without loss of generality, we assume that p1p2ph¯p_{1}\geq p_{2}\geq\dots\geq p_{\bar{h}}. The instantaneous channel state 𝐇k\mathbf{H}_{k} is available at the remote estimator based on standard channel estimation methods. The distribution of hn,m,kh_{n,m,k} is given as

Pr(hn,m,k=i)=qi(n,m),k,\operatorname{Pr}(h_{n,m,k}=i)=q^{(n,m)}_{i},\forall k, (3)

where i=1h¯qi(n,m)=1,n,m\sum_{i=1}^{\bar{h}}q^{(n,m)}_{i}=1,\forall n,m.

Due to the limited communication channels, only MM out of NN sensors can be scheduled at each time step. Let an,k{0,1,2,,M}a_{n,k}\in\{0,1,2,\dots,M\} represent the channel allocation for sensor nn at time kk, where

an,k={0if sensor n is not scheduledmif sensor n is scheduled to channel m.a_{n,k}=\left\{\begin{array}[]{ll}0&\text{if sensor $n$ is not scheduled}\\ m&\text{if sensor $n$ is scheduled to channel $m$.}\end{array}\right. (4)

In particular, we assume that each sensor can be scheduled to at most one channel and that each channel is assigned to one sensor [10]. Then, the constraints on an,ka_{n,k} are given as

m=1M𝟙(an,k=m)1,n=1N𝟙(an,k=m)=1,\sum_{m=1}^{M}\bm{\mathbbm{1}}\left(a_{n,k}=m\right)\leq 1,\quad\sum_{n=1}^{N}\bm{\mathbbm{1}}\left(a_{n,k}=m\right)=1, (5)

where 𝟙()\bm{\mathbbm{1}}(\cdot) is the indicator function.

Considering schedule actions and packet dropouts, sensor nn’s estimate may not be received by the remote estimator in every time slot. We define the packet reception indicator as

ηn,k={1,if sensor n’s packet is received at time k0,otherwise.\eta_{n,k}=\left\{\begin{array}[]{ll}1,&\text{if sensor $n$'s packet is received at time $k$}\\ 0,&\text{otherwise}.\end{array}\right.

Considering the randomness of ηn,k\eta_{n,k} and assuming that the remote estimator performs state estimation at the beginning of each time slot, the optimal estimator in terms of estimation MSE is given as

𝐱^n,k+1\displaystyle\hat{\mathbf{x}}_{n,k+1} ={𝐀n𝐱n,ks,if ηn,k=1𝐀n𝐱^n,k,otherwise\displaystyle=\left\{\begin{array}[]{ll}\mathbf{A}_{n}\mathbf{x}_{n,k}^{s},&{\text{if $\eta_{n,k}=1$}}\\ \mathbf{A}_{n}\hat{\mathbf{x}}_{n,k},&{\text{otherwise}}\end{array}\right. (8)

where 𝐀n\mathbf{A}_{n} is the system matrix of process nn defined under (1). If sensor nn’s packet is not received, the remote estimator propagates its estimate in the previous time slot to estimate the current state. From (1) and (8), we derive the estimation error covariance as

𝐏n,k+1\displaystyle\mathbf{P}_{n,k+1} 𝔼[(𝐱^n,k𝐱n,k)(𝐱^n,k𝐱n,k)]\displaystyle\triangleq\mathbb{E}\left[\left(\hat{\mathbf{x}}_{n,k}-\mathbf{x}_{n,k}\right)\left(\hat{\mathbf{x}}_{n,k}-\mathbf{x}_{n,k}\right)^{\top}\right] (9)
={𝐀n𝐏¯n𝐀n+𝐖n,if ηn,k=1𝐀n𝐏n,k𝐀n+𝐖n,otherwise,\displaystyle=\left\{\begin{array}[]{ll}\mathbf{A}_{n}\bar{\mathbf{P}}_{n}\mathbf{A}^{\top}_{n}+\mathbf{W}_{n},&{\text{if $\eta_{n,k}=1$}}\\ \mathbf{A}_{n}{\mathbf{P}}_{n,k}\mathbf{A}^{\top}_{n}+\mathbf{W}_{n},&{\text{otherwise}},\end{array}\right. (12)

where 𝐏¯n\bar{\mathbf{P}}_{n} is the local estimation error covariance of sensor nn defined under (2).

Let τn,k{1,2,}\tau_{n,k}\in\{1,2,\dots\} denote the age-of-information (AoI) of sensor nn at time kk, which measures the amount of time elapsed since the latest sensor packet was received. Then, we have

τn,k+1={1if ηn,k=1τn,k+1otherwise.\tau_{n,k+1}=\begin{cases}1&{\text{if $\eta_{n,k}=1$}}\\ \tau_{n,k}+1&{\text{otherwise.}}\end{cases} (13)

If a sensor is frequently scheduled at good channels, the corresponding average AoI is small. However, due to the scheduling constraint (5), this is often not possible.

From (12) and (13), the error covariance can be simplified as

𝐏n,k=fnτn,k(𝐏¯n),\mathbf{P}_{n,k}=f_{n}^{\tau_{n,k}}(\bar{\mathbf{P}}_{n}), (14)

where fn(𝐗)=𝐀n𝐗𝐀n+𝐖nf_{n}(\mathbf{X})=\mathbf{A}_{n}\mathbf{X}\mathbf{A}_{n}^{\top}+\mathbf{W}_{n} and fnτ+1()=fn(fnτ())f^{\tau+1}_{n}(\cdot)=f_{n}(f^{\tau}_{n}(\cdot)). It has been proved that the estimation MSE, i.e., Tr(𝐏n,k)\operatorname{Tr}(\mathbf{P}_{n,k}), monotonically increases with the AoI state τn,k\tau_{n,k} [14].

III Problem Formulation and Threshold structure

In this paper, we aim to find a dynamic scheduling policy π()\pi(\cdot) that uses the AoI states of all sensors, as well as the channel states to minimize the expected total discounted estimation MSE of all NN processes over the infinite time horizon.

Problem 1.
maxπlimT𝔼π[k=1Tn=1NγkTr(𝐏n,k)],\max_{\pi}\lim_{T\to\infty}\mathbb{E}^{\pi}\left[\sum_{k=1}^{T}\sum_{n=1}^{N}-\gamma^{k}\operatorname{Tr}(\mathbf{P}_{n,k})\right], (15)

where 𝔼π\mathbb{E}^{\pi} is the expected average value when adopting the policy π\pi and γ<1\gamma<1 is a discount factor.

Problem 1 is a sequential decision-making problem with the Markovian property. This is because the expected estimation MSE only depends on the AoI state τn,k\tau_{n,k} in (14), which has the Markovian property (13), and the channel states are i.i.d.. Therefore, we formulate Problem 1 as a Markov decision problem (MDP).

III-A MDP Formulation

We define the four elements of the MDP as below.

1) The state of the MDP is defined as 𝐬k(𝝉k,𝐇k)N×M×N\mathbf{s}_{k}\triangleq\left(\bm{\tau}_{k},\mathbf{H}_{k}\right)\in\mathbb{N}^{N}\times\mathcal{H}^{M\times N}, where 𝝉k=(τ1,k,τ2,k,,τN,k)N\bm{\tau}_{k}=(\tau_{1,k},\tau_{2,k},\dots,\tau_{N,k})\in\mathbb{N}^{N} is the AoI state vector. Thus, 𝐬k\mathbf{s}_{k} takes into account both the AoI and channel states.

2) The overall schedule action of the NN sensors is defined as 𝐚k=(a1,k,a2,k,,aN,k){0,1,2,,M}N\mathbf{a}_{k}=(a_{1,k},a_{2,k},\dots,a_{N,k})\in\left\{0,1,2,\dots,M\right\}^{N} under the constraint (5). There are N!/(NM)!N!/(N-M)! actions in total. The stationary policy π()\pi(\cdot) is a mapping between the state and the action, i.e., 𝐚k=π(𝐬k)\mathbf{a}_{k}=\pi(\mathbf{s}_{k}).

3) The transition probability Pr(𝐬k+1|𝐬k,𝐚k)\operatorname{Pr}(\mathbf{s}_{k+1}|\mathbf{s}_{k},\mathbf{a}_{k}) is the probability of the next state 𝐬k+1\mathbf{s}_{k+1} given the current state 𝐬k\mathbf{s}_{k} and the action 𝐚k\mathbf{a}_{k}. Since we adopt stationary scheduling policies, the state transition is independent of the time index. Thus, we drop the subscript kk here and use 𝐬\mathbf{s} and 𝐬+\mathbf{s}^{+} to represent the current and the next states, respectively. Due to the i.i.d. fading channel states, we have Pr(𝐬+|𝐬,𝐚)=Pr(𝝉+|𝝉,𝐇,𝐚)Pr(𝐇+),\operatorname{Pr}(\mathbf{s}^{+}|\mathbf{s},\mathbf{a})=\operatorname{Pr}(\bm{\tau}^{+}|\bm{\tau},\mathbf{H},\mathbf{a})\operatorname{Pr}(\mathbf{H}^{+}), where Pr(𝐇+)\operatorname{Pr}(\mathbf{H}^{+}) can be obtained from (3) and Pr(𝝉+|𝝉,𝐇,𝐚)=n=1NPr(τn+|τn,𝐇,an)\operatorname{Pr}(\bm{\tau}^{+}|\bm{\tau},\mathbf{H},\mathbf{a})=\prod_{n=1}^{N}\operatorname{Pr}(\tau^{+}_{n}|\tau_{n},\mathbf{H},a_{n}), which is derived based on (3) and (13)

Pr(τn+|τn,𝐇,an)={1phn,m,if τn+=1,an=mphn,m,if τn+=τn+1,an=m1, if τn+=τn+1,an=00,otherwise.\displaystyle\operatorname{Pr}(\tau^{+}_{n}|\tau_{n},\mathbf{H},a_{n})=\left\{\begin{array}[]{l}1-p_{h_{n,m}},\ \ \,\text{if $\tau^{+}_{n}=1,a_{n}=m$}\\ p_{h_{n,m}},\qquad\;\;\text{if $\tau^{+}_{n}=\tau_{n}+1,a_{n}=m$}\\ 1,\qquad\quad\ \ \text{\ \ if $\tau^{+}_{n}=\tau_{n}+1,a_{n}=0$}\\ 0,\qquad\quad\ \ {\ \ \text{otherwise.}}\end{array}\right. (20)

4) The immediate reward of Problem 1 at time kk is defined as n=1NTr(𝐏n,k)\sum_{n=1}^{N}-\text{Tr}(\mathbf{P}_{n,k}). Since 𝐏n,k\mathbf{P}_{n,k} defined in (14) is a function of τn,k\tau_{n,k}, the reward is represented as r(𝐬k)r(\mathbf{s}_{k}), monotonically decreasing with the AoI state.

III-B Threshold Policies

Many works have proved that optimal scheduling policies have the threshold structure in Definition 1 for different problems, where the reward of an individual user is a monotonic function of its state [5, 12, 13]. Our problem has the same property, where the estimation MSE of sensor nn decreases with the decreasing AoI state and the increasing channel state.

To verify whether the optimal policy of Problem 1 has the threshold structure, we consider a system with N=2N=2 and M=1M=1 and adopt the conventional value iteration algorithm to find the optimal policy. As illustrated in Fig. 2, we see action switching curves in both AoI and channel state spaces, and the properties in Definition 1 are observed. Due to space limitations, we here do not present formal proof that the optimal policy has a threshold structure, but will investigate it in our future work. In Section V, our numerical results also show that the optimized threshold policy is near optimal, confirming the conjecture.

Definition 1.

For a threshold policy π\pi of Problem 1, if channel mm is assigned to sensor nn at the state 𝐬=(𝛕,𝐇)\mathbf{s}=(\bm{\tau},\mathbf{H}), then the following two properties must hold:

(i) for state 𝐬=(𝛕,𝐇n,m)\mathbf{s}^{\prime}=(\bm{\tau},\mathbf{H}^{\prime}_{n,m}), where 𝐇n,m\mathbf{H}^{\prime}_{n,m} is identical to 𝐇\mathbf{H} except the sensor-nn-channel-mm state with hn,m>hn,mh^{\prime}_{n,m}>h_{n,m}, then channel mm is still assigned to sensor nn;

(ii) for state 𝐬=(𝛕(n),𝐇)\mathbf{s}^{\prime}=(\bm{\tau}^{\prime}_{(n)},\mathbf{H}), where 𝛕(n)\bm{\tau}^{\prime}_{(n)} is idential to 𝛕\bm{\tau} except sensor nn’s AoI with τn>τn\tau^{\prime}_{n}>\tau_{n}, then either channel mm or a better channel is assigned to sensor nn.

For property (i), given that sensor nn is scheduled at channel mm at a certain state, if the channel quality of hn,mh_{n,m} improves while the AoI and the other channel states are the same, then the threshold policy still assigns channel mm to sensor nn. The property is reasonable since the channel quality of sensor nn to channel mm is the only factor changed and improved.

For property (ii), if the AoI state of sensor nn is increased while the other states remain the same, the policy must schedule sensor nn to a channel that is no worse than the previous one. Such a scheduling policy does make sense, since a larger sensor nn’s AoI leads to a lower reward, requiring a better channel for transmission to improve the reward effectively.

In the following, we will develop novel schemes that leverage the threshold structure to guide DRL algorithms for solving Problem 1 with improved training speed and performance.

Refer to caption
(a) Schedule actions at AoI states
Refer to caption
(b) Schedule actions at channel states
Figure 2: Structure of the optimal scheduling policy with N=2N=2 and M=1M=1, where \bullet and ×\times represent the schedule of sensor 1 and 2, respectively.

IV Structure-Enhanced DRL

In the literature, DQN and DDPG are the most commonly adopted off-policy DRL algorithms for solving optimal scheduling problems (see [8, 10] and references therein), and they provide significant performance improvements over heuristic policies. Next, we develop threshold structure-enhanced (SE) DQN and DDPG for solving Problem 1.

IV-A Structure-Enhanced DQN

We define the Q-value as the expected discounted sum of the reward function

Qπ(𝐬k,𝐚k)=𝔼[t=kγtkr(𝐬t)|𝐚k,𝐚t=π(𝐬t),t>k],Q^{\pi}(\mathbf{s}_{k},\mathbf{a}_{k})=\mathbb{E}\left[\sum_{t=k}^{\infty}\gamma^{t-k}r(\mathbf{s}_{t})\big{|}\mathbf{a}_{k},\mathbf{a}_{t}=\pi(\mathbf{s}_{t}),\forall t>k\right], (21)

which measures the long-term performance of policy π\pi given the current state-action pair (𝐬k,𝐚k)(\mathbf{s}_{k},\mathbf{a}_{k}). Let π\pi^{*} denote the optimal policy of Problem 1. Then, the corresponding Q-value can, in principle, be obtained by solving the Bellman equation [16]

Q(𝐬k,𝐚k)=r(𝐬k)+𝔼𝐬k+1[γmax𝐚k+1Q(𝐬k+1,𝐚k+1)].Q^{*}(\mathbf{s}_{k},\mathbf{a}_{k})=r(\mathbf{s}_{k})+\mathbb{E}_{\mathbf{s}_{k+1}}\left[\gamma\max_{\mathbf{a}_{k+1}}Q^{*}\left(\mathbf{s}_{k+1},\mathbf{a}_{k+1}\right)\right]. (22)

From the definition (21), the optimal policy generates the best action achieving the highest Q-value, and can be written as

𝐚k=π(𝐬k)=argmax𝐚kQ(𝐬k,𝐚k).\mathbf{a}^{*}_{k}=\pi^{*}(\mathbf{s}_{k})=\mathop{\arg\max}_{\mathbf{a}_{k}}Q^{*}(\mathbf{s}_{k},\mathbf{a}_{k}). (23)

A well-trained DQN uses a neural network (NN) with the parameter set 𝜽\bm{\theta} to approximate the Q-values of the optimal policy in (23) as Q(𝐬k,𝐚k;𝜽)Q(\mathbf{s}_{k},\mathbf{a}_{k};\bm{\theta}) and use it to find the optimal action. Considering the action space defined in Section III-A, the DQN has N!/(NM)!N!/(N-M)! Q-value outputs of different state-action pairs. To train the DQN, one needs to sample data (consisting of states, actions, rewards, and next states), define a loss function of 𝜽\bm{\theta} based on the collected data, and minimize the loss function to find the optimized 𝜽\bm{\theta}. However, the conventional DQN training method has never utilized structures of optimal policies before.

To utilize the knowledge of the threshold policy structure for enhancing the DQN training performance, we propose 1) an SE action selection method based on Definition 1 to select reasonable actions and hence enhance the data sampling efficiency; and 2) an SE loss function definition to add the penalty to sampled actions that do not follow the threshold structure.

Our SE-DQN training algorithm has three stages: 1) the DQN with loose SE action selection stage, which only utilizes part of the structural property, 2) the DQN with tight SE action selection stage utilizes the full structural property, and 3) the conventional DQN stage. The first two stages use the SE action selection schemes and the SE loss function to train the DQN fast, resulting in a reasonable threshold policy, and the last stage is for further policy exploration. In what follows, we present the loose and tight SE action selection schemes and the SE loss function.

IV-A1 Loose SE action selection

We randomly select an action 𝐚ϵ\mathbf{a}^{\epsilon} from the entire action space with a probability of ϵ\epsilon for action exploration; with a probability of (1ϵ)(1-\epsilon), we generate the SE action 𝐚^\hat{\mathbf{a}} as below. For simplicity, we drop the time index when describing action selections.

The threshold structure suggests that the actions of 𝐬\mathbf{s} and the state with a smaller AoI or channel state are correlated. Thus, one can infer the action based on the action at the state with a smaller channel, or AoI state, based on property (i), or (ii), of Definition 1, respectively. We only consider property (ii) for loose SE action selection, as it is difficult to find actions that satisfy both structural properties (i) and (ii) at the beginning of training. We will utilize both (i) and (ii) in the tight SE action selection stage.

Given the state 𝐬=(𝝉,𝐇)\mathbf{s}=(\bm{\tau},\mathbf{H}), we define the state with a smaller sensor nn’s AoI as 𝐬˙(n)=(𝝉˙(n),𝐇)\dot{\mathbf{s}}^{(n)}=(\dot{\bm{\tau}}^{(n)},\mathbf{H}), where

𝝉˙(n)=(τ1,,τn1,,τN).\dot{\bm{\tau}}^{(n)}=(\tau_{1},\dots,\tau_{n}-1,\dots,\tau_{N}). (24)

For each nn, we calculate the corresponding action based on the Q-values as

𝐚˙(n)(a˙1(n),,a˙n(n),,a˙N(n))=argmax𝐚˙Q(𝐬˙(n),𝐚˙;𝜽).\displaystyle\dot{\mathbf{a}}^{(n)}\triangleq(\dot{a}^{(n)}_{1},\dots,\dot{a}^{(n)}_{n},\dots,\dot{a}^{(n)}_{N})=\mathop{\arg\max}_{\dot{\mathbf{a}}}Q(\dot{\mathbf{s}}^{(n)},\dot{\mathbf{a}};\bm{\theta}).\ (25)

Recall that a˙n(n){0,1,,M}\dot{a}^{(n)}_{n}\in\{0,1,\dots,M\} is the channel index assigned to sensor nn at the state 𝐬˙(n)\dot{\mathbf{s}}^{(n)}.

If a˙n(n)>0\dot{a}^{(n)}_{n}>0, then property (ii) implies that channel a˙n(n)\dot{a}^{(n)}_{n} or a better channel is assigned to sensor nn at the state 𝐬\mathbf{s}. We define the set of channels with better quality as

(n)={m|hn,m>hn,a˙n(n),m=1,,M}.\mathcal{M}^{(n)}=\{m^{\prime}|h_{n,m^{\prime}}>h_{n,\dot{a}^{(n)}_{n}},m^{\prime}=1,\dots,M\}. (26)

Then, the SE action for sensor nn, say a^n\hat{a}_{n}, is randomly chosen from the set (n)\mathcal{M}^{(n)} with probability ξ\xi, and is equal to a˙n(n)\dot{a}^{(n)}_{n} with probability 1ξ1-\xi.

If a˙n(n)=0\dot{a}^{(n)}_{n}=0, then property (ii) can not help with determining the action. Then, we define the action selected by the greedy policy (i.e., the conventional DQN method) at the state 𝐬\mathbf{s} as

𝐚~(a~1,a~2,,a~N)=argmax𝐚Q(𝐬,𝐚;𝜽).\tilde{\mathbf{a}}\triangleq(\tilde{a}_{1},\tilde{a}_{2},\dots,\tilde{a}_{N})=\mathop{\arg\max}_{\mathbf{a}}Q(\mathbf{s},\mathbf{a};\bm{\theta}). (27)

Thus, we set the SE action of sensor nn identical to the one generated by the conventional DQN method, i.e., a^n=a~n\hat{a}_{n}=\tilde{a}_{n}.

Now we can define the SE action for NN sensors as 𝐚^=(a^1,a^2,,a^N)\hat{\mathbf{a}}=(\hat{a}_{1},\hat{a}_{2},\dots,\hat{a}_{N}). If such an action meets the constraint

m=1M𝟙(a^n=m)1,n=1N𝟙(a^n=m)1,\displaystyle\sum_{m=1}^{M}\bm{\mathbbm{1}}\left(\hat{a}_{n}=m\right)\leq 1,\quad\sum_{n=1}^{N}\bm{\mathbbm{1}}\left(\hat{a}_{n}=m\right)\leq 1, (28)

which is less restrictive than (5), we select the action 𝐚\mathbf{a} as 𝐚^\hat{\mathbf{a}} and assign the unused channels randomly to unscheduled sensors; otherwise, 𝐚\mathbf{a} is identical to that of the conventional method as 𝐚~\tilde{\mathbf{a}}.

IV-A2 Tight SE action selection

By using the loose SE action selection, we first infer the scheduling action of sensor nn at the state 𝐬\mathbf{s} based on the action of the state with a smaller AoI, 𝐬˙(n)\dot{\mathbf{s}}^{(n)}. Then, we check whether the loose SE action satisfies the threshold property (i) as below.

For notation simplicity, we use m>0m>0 to denote the SE channel selection for sensor nn. Given the state 𝐬\mathbf{s}, we define the state with a smaller channel mm’s state for sensor nn as 𝐬¨(n,m)=(𝝉,𝐇¨(n,m))\ddot{\mathbf{s}}^{(n,m)}=(\bm{\tau},\ddot{\mathbf{H}}^{(n,m)}), where 𝐇¨(n,m)\ddot{\mathbf{H}}^{(n,m)} and 𝐇\mathbf{H} are identical except the element h¨n,m(n,m)=hn,m1\ddot{h}^{(n,m)}_{n,m}=h_{n,m}-1. Then, we calculate the corresponding action

𝐚¨(n,m)(a¨1(n,m),,a¨n(n,m),,a¨N(n,m))=argmax𝐚¨Q(𝐬¨(n,m),𝐚¨;𝜽).\ddot{\mathbf{a}}^{(n,m)}\!\!\triangleq\!(\ddot{a}^{(n,m)}_{1}\!\!\!,\!\dots,\!\ddot{a}^{(n,m)}_{n}\!\!\!,\dots,\ddot{a}^{(n,m)}_{N})\!=\!\mathop{\arg\max}_{\ddot{\mathbf{a}}}Q(\ddot{\mathbf{s}}^{(n,m)}\!\!\!,\ddot{\mathbf{a}};\!\bm{\theta}). (29)

From the structural properties (i) and (ii), the scheduling action a¨n(n,m)\ddot{a}^{(n,m)}_{n} should be identical to mm. Thus, if a¨n(n,m)=m\ddot{a}^{(n,m)}_{n}=m, then the SE action for sensor nn is a^n=m\hat{a}_{n}=m; otherwise, a^n\hat{a}_{n} is identical to the conventional DQN action. The SE action satisfying the structural properties (i) and (ii) is 𝐚^=(a^1,a^2,,a^N)\hat{\mathbf{a}}=(\hat{a}_{1},\hat{a}_{2},\dots,\hat{a}_{N}). If such an action meets the constraint (5), then it is executed as 𝐚=𝐚^\mathbf{a}=\hat{\mathbf{a}}; otherwise, we select the greedy action 𝐚=𝐚~\mathbf{a}=\tilde{\mathbf{a}}.

IV-A3 SE loss function

During the training, each transition (𝐬t,𝐚t,𝐚^t,𝐚~t,rt,𝐬t+1)(\mathbf{s}_{t},\mathbf{a}_{t},\hat{\mathbf{a}}_{t},\tilde{\mathbf{a}}_{t},r_{t},\mathbf{s}_{t+1}) is stored in a replay memory, where rtr(𝐬t)r_{t}\triangleq r(\mathbf{s}_{t}) denotes the immediate reward. Different from the conventional DQN, we include both the SE action 𝐚^\hat{\mathbf{a}} and the greedy action 𝐚~\tilde{\mathbf{a}}, in addition to the executed action 𝐚\mathbf{a}.

Let 𝒯i(𝐬i,𝐚i,𝐚^i,𝐚~t,ri,𝐬i+1)\mathcal{T}_{i}\triangleq(\mathbf{s}_{i},\mathbf{a}_{i},\hat{\mathbf{a}}_{i},\tilde{\mathbf{a}}_{t},r_{i},\mathbf{s}_{i+1}) denote the iith transition of a sampled batch from the replay memory. Same as the conventional DQN, we define the temporal-difference (TD) error of 𝒯i\mathcal{T}_{i} as

𝖳𝖣iyiQ(𝐬i,𝐚i;𝜽),\mathsf{TD}_{i}\triangleq y_{i}-Q(\mathbf{s}_{i},\mathbf{a}_{i};\bm{\theta}), (30)

where yi=ri+γmax𝐚i+1Q(𝐬i+1,𝐚i+1;𝜽)y_{i}=r_{i}+\gamma\max_{{\mathbf{a}}_{i+1}}Q(\mathbf{s}_{i+1},{\mathbf{a}}_{i+1};\bm{\theta}) is the estimation of Q-value at next step. This is to measure the gap between the left and right sides of the Bellman equation (22). A larger gap indicates that the approximated Q-values are far from the optimal. Different from DQN, we introduce the action-difference (AD) error as below to measure the difference of Q-value between actions selected by the SE strategy and the greedy strategy:

𝖠𝖣iQ(𝐬i,𝐚^i;𝜽)Q(𝐬i,𝐚~i;𝜽).\mathsf{AD}_{i}\triangleq Q(\mathbf{s}_{i},\hat{\mathbf{a}}_{i};\bm{\theta})-Q(\mathbf{s}_{i},\tilde{\mathbf{a}}_{i};\bm{\theta}). (31)

Since the optimal policy has the threshold structure, the inferred action 𝐚^\hat{\mathbf{a}} should be identical to the optimal action 𝐚~\tilde{\mathbf{a}}. Thus, a well-trained 𝜽\bm{\theta} should lead to a small difference in (31).

Based on (30) and (31), we define the loss function of 𝒯i\mathcal{T}_{i} as

Li(𝜽)={α1𝖳𝖣i2+(1α1)𝖠𝖣i2,if 𝐚i=𝐚^i𝖳𝖣i2,otherwise,L_{i}(\bm{\theta})=\left\{\begin{array}[]{l}\begin{aligned} &\alpha_{1}\mathsf{TD}^{2}_{i}+(1-\alpha_{1})\mathsf{AD}^{2}_{i},\quad\text{if $\mathbf{a}_{i}=\hat{\mathbf{a}}_{i}$}\\ &\mathsf{TD}^{2}_{i},\qquad\qquad\qquad\qquad\ \;\,\text{otherwise},\end{aligned}\end{array}\right. (32)

where α1\alpha_{1} is a hyperparameter to balance the importance of 𝖳𝖣i\mathsf{TD}_{i} and 𝖠𝖣i\mathsf{AD}_{i}. In other words, if the SE action is executed, both the TD and AD errors are taken into account; otherwise, the conventional TD-error-based loss function is adopted.

Given the batch size BB, the overall loss function is

L(𝜽)=1Bi=1BLi(𝜽).L(\bm{\theta})=\frac{1}{B}\sum_{i=1}^{B}L_{i}(\bm{\theta}). (33)

To optimize 𝜽\bm{\theta}, we adopt the well-known gradient descent method and calculate the gradient as below

𝜽L(𝜽)=1Bi=1B𝜽Li(𝜽)\nabla_{\bm{\theta}}L(\bm{\theta})=\frac{1}{B}\sum_{i=1}^{B}\nabla_{\bm{\theta}}L_{i}(\bm{\theta}) (34)

where 𝜽Li(𝜽)\nabla_{\bm{\theta}}L_{i}(\bm{\theta}) is given as

𝜽Li(𝜽)={2((1α1)𝖠𝖣i𝜽(Q(𝐬i,𝐚^i;𝜽)Q(𝐬i,𝐚~i;𝜽))+α1𝖳𝖣i𝜽Q(𝐬i,𝐚i;𝜽)),if 𝐚i=𝐚^i2(𝖳𝖣i𝜽Q(𝐬i,𝐚i;𝜽)),otherwise.\nabla_{\bm{\theta}}L_{i}(\bm{\theta})=\left\{\begin{array}[]{l}\begin{aligned} &\!\!\!-2((1-\alpha_{1})\mathsf{AD}_{i}\nabla_{\bm{\theta}}\left(Q\left(\mathbf{s}_{i},\hat{\mathbf{a}}_{i};\bm{\theta})-Q(\mathbf{s}_{i},\tilde{\mathbf{a}}_{i};\bm{\theta}\right)\right)\\ &\ \ +\alpha_{1}\mathsf{TD}_{i}\nabla_{\bm{\theta}}Q(\mathbf{s}_{i},\mathbf{a}_{i};\bm{\theta})),\quad\text{if $\mathbf{a}_{i}=\hat{\mathbf{a}}_{i}$}\\ &\!\!\!-2(\mathsf{TD}_{i}\nabla_{\bm{\theta}}Q(\mathbf{s}_{i},\mathbf{a}_{i};\bm{\theta})),\qquad\;\,\text{otherwise}.\end{aligned}\end{array}\right. (35)

The details of the SE-DQN algorithm are given in Algorithm 1.

Algorithm 1 SE-DQN for sensor scheduling in the remote estimation system
1:Initialize replay memory 𝒟\mathcal{D} to capacity NN
2:Initialize policy network with random weights 𝜽\bm{\theta}
3:Initialize target network with weights 𝜽^=𝜽\hat{\bm{\theta}}=\bm{\theta}
4:for episode =1,2,,E1=1,2,\dots,E_{1} do
5:    Initialize state 𝐬0\mathbf{s}_{0}
6:    for t=1,2,,Tt=1,2,\dots,T do
7:         Generate 𝐚^t\hat{\mathbf{a}}_{t} and select action 𝐚t\mathbf{a}_{t} by the loose SE action selection method of SE-DQN
8:         Execute action 𝐚t\mathbf{a}_{t} and observe rtr_{t} and 𝐬t+1\mathbf{s}_{t+1}
9:         Compute action 𝐚~t=argmax𝐚Q(𝐬t,𝐚;𝜽)\tilde{\mathbf{a}}_{t}=\mathop{\arg\max}_{\mathbf{a}}Q(\mathbf{s}_{t},\mathbf{a};\bm{\theta})
10:         Store transition (𝐬t,𝐚t,𝐚^t,𝐚~t,rt,𝐬t+1)(\mathbf{s}_{t},\mathbf{a}_{t},\hat{\mathbf{a}}_{t},\tilde{\mathbf{a}}_{t},r_{t},\mathbf{s}_{t+1}) in 𝒟\mathcal{D}
11:         Sample a random batch of transitions (𝐬i,𝐚i,𝐚^i,𝐚~i,ri,𝐬i+1)(\mathbf{s}_{i},\mathbf{a}_{i},\hat{\mathbf{a}}_{i},\tilde{\mathbf{a}}_{i},r_{i},\mathbf{s}_{i+1})
12:         Calculate 𝖳𝖣i\mathsf{TD}_{i} and 𝖠𝖣i\mathsf{AD}_{i} based on (30) and (31)
13:         Update 𝜽\bm{\theta} according to equation (34)
14:         Every tt^{\prime} steps set 𝜽^=𝜽t\hat{\bm{\theta}}=\bm{\theta}_{t}
15:    end for
16:end for
17:for episode =E1,,E2=E_{1},\dots,E_{2} do
18:    Repeat algorithm from line 5 to line 15 by adopting the tight SE action selection method
19:end for
20:for episode =E2,,E=E_{2},\dots,E do
21:    Execute the DQN algorithm as in [8]
22:end for

IV-B Structure-Enhanced DDPG

Different from the DQN, which has one NN to estimate the Q-value, a DDPG agent has two NNs [17], a critic NN with parameter 𝜽\bm{\theta} and an actor NN with the parameter 𝝁\bm{\mu}. In particular, the actor NN approximates the optimal policy π(𝐬)\pi^{*}(\mathbf{s}) by μ(𝐬;𝝁)\mu(\mathbf{s};\bm{\mu}), while the critic NN approximates the Q-value of the optimal policy Q(𝐬,𝐚)Q^{*}(\mathbf{s},\mathbf{a}) by Q(𝐬,𝐚;𝜽)Q(\mathbf{s},\mathbf{a};\bm{\theta}). In general, the critic NN judges whether the generated action of the actor NN is good or not, and the latter can be improved based on the former. The critic NN is updated by minimizing the TD error similar to the DQN. The actor-critic framework enables DDPG to solve MDPs with continuous and large action space, which cannot be handled by the DQN.

To solve our scheduling problem with a discrete action space, we adopt an action mapping scheme similar to the one adopted in [10]. We set the direct output of the actor NN with NN continuous values, ranging from 1-1 to 11, corresponding to sensors 11 to NN, respectively. Recall that the DQN has N!/(NM)!N!/(N-M)! outputs. The NN values are sorted in descending order. The sensors with the highest MM ranking are assigned to channels 11 to MM, respectively. The corresponding ranking values are then linearly normalized to [1,1][-1,1] as the output of the final outputs of the actor NN, named as the virtual action 𝐯\mathbf{v}. It directly follows that the virtual action 𝐯\mathbf{v} and the real scheduling action 𝐚\mathbf{a} can be mapped from one to the other directly. Therefore, we use the virtual action 𝐯\mathbf{v}, instead of the real action 𝐚\mathbf{a}, when presenting the SE-DDPG algorithm.

Similar to the SE-DQN, the SE-DDPG has the loose SE-DDPG stage, the tight SE-DDPG stage, and the conventional DDPG stage. The iith sampled transition is denoted as

𝒯i(𝐬i,𝐯i,𝐯^i,𝐯~t,ri,𝐬i+1).\mathcal{T}_{i}\triangleq(\mathbf{s}_{i},\mathbf{v}_{i},\hat{\mathbf{v}}_{i},\tilde{\mathbf{v}}_{t},r_{i},\mathbf{s}_{i+1}). (36)

We present the SE action selection method and the SE loss function in the sequel.

IV-B1 SE action selection

The general action selection approach for the SE-DDPG is identical to that of the SE-DQN, by simply converting 𝐯i\mathbf{v}_{i}, 𝐯iϵ\mathbf{v}^{\epsilon}_{i}, 𝐯~i\tilde{\mathbf{v}}_{i}, and 𝐯^i\hat{\mathbf{v}}_{i} to 𝐚i\mathbf{a}_{i}, 𝐚iϵ\mathbf{a}^{\epsilon}_{i}, 𝐚~i\tilde{\mathbf{a}}_{i}, and 𝐚^i\hat{\mathbf{a}}_{i}, respectively. Different from DQN with ϵ\epsilon-greedy actions, the action generated by the DDPG based on the current state is

𝐯~i=μ(𝐬i;𝝁),\tilde{\mathbf{v}}_{i}=\mu(\mathbf{s}_{i};\bm{\mu}), (37)

and the random action 𝐯iϵ\mathbf{v}^{\epsilon}_{i} was generated by adding noise to the original continuous output of the actor NN.

Algorithm 2 SE-DDPG for sensor scheduling in the remote estimation system
1:Initialize replay memory 𝒟\mathcal{D} to capacity NN
2:Initialize actor network and critic network with random weights 𝝁\bm{\mu} and 𝜽\bm{\theta}
3:Initialize target network and with weight 𝝁^=𝝁\hat{\bm{\mu}}=\bm{\mu}, 𝜽^=𝜽\hat{\bm{\theta}}=\bm{\theta}
4:for episode =1,2,,E1=1,2,\dots,E_{1} do
5:    Initialize a random noise 𝒩\mathcal{N} for action exploration
6:    Initialize state 𝐬0\mathbf{s}_{0}
7:    for t=1,2,,Tt=1,2,\dots,T do
8:         Generate 𝐯^t\hat{\mathbf{v}}_{t} and select action 𝐯t\mathbf{v}_{t} by the loose SE action selection method of SE-DDPG
9:         Mapping virtual action 𝐯t\mathbf{v}_{t} to real action 𝐚t\mathbf{a}_{t}
10:         Execute action 𝐚t\mathbf{a}_{t} and observe rtr_{t} and 𝐬t+1\mathbf{s}_{t+1}
11:         Compute action 𝐯~t=μ(𝐬t;𝝁)\tilde{\mathbf{v}}_{t}=\mu(\mathbf{s}_{t};\bm{\mu})
12:         Store transition (𝐬t,𝐯t,𝐯^t,𝐯~t,rt,𝐬t+1)(\mathbf{s}_{t},\mathbf{v}_{t},\hat{\mathbf{v}}_{t},\tilde{\mathbf{v}}_{t},r_{t},\mathbf{s}_{t+1}) in 𝒟\mathcal{D}
13:         Sample a random batch of transitions (𝐬i,𝐯i,𝐯^i,𝐯~i,ri,𝐬i+1)(\mathbf{s}_{i},\mathbf{v}_{i},\hat{\mathbf{v}}_{i},\tilde{\mathbf{v}}_{i},r_{i},\mathbf{s}_{i+1})
14:         Calculate 𝖳𝖣i\mathsf{TD}_{i} and 𝖠𝖣i\mathsf{AD}_{i} based on (30) and (31) but with the virtual actions 𝐯i,𝐯^i,𝐯~i\mathbf{v}_{i},\hat{\mathbf{v}}_{i},\tilde{\mathbf{v}}_{i}
15:         Update 𝜽\bm{\theta} according to equation (34)
16:         Update 𝝁\bm{\mu} according to equation (42)
17:         Update the target network:
𝝁^\displaystyle\hat{\bm{\mu}} δ𝝁+(1δ)𝝁^\displaystyle\leftarrow\delta\bm{\mu}+(1-\delta)\hat{\bm{\mu}} (38)
𝜽^\displaystyle\hat{\bm{\theta}} δ𝜽+(1δ)𝜽^\displaystyle\leftarrow\delta\bm{\theta}+(1-\delta)\hat{\bm{\theta}} (39)
18:    end for
19:end for
20:for episode E1=1,2,,E2E_{1}=1,2,\dots,E_{2} do
21:    Repeat algorithm from line 5 to line 18 by adopting the tight SE action selection method
22:end for
23:for episode E2=1,2,,EE_{2}=1,2,\dots,E do
24:    Execute the conventional DDPG algorithm as in [17]
25:end for

IV-B2 SE loss function

Different from the SE-DQN, the SE-DDPG needs different loss functions for updating the critic NN and the actor NN. For the critic NN, we use the same loss function as in (33), and thus the gradient for the critic NN update is identical to (34). Note that for DDPG, the next virtual action 𝐯~i+1\tilde{\mathbf{v}}_{i+1} is the direct output of the actor NN given the next state 𝐬i+1\mathbf{s}_{i+1}, i.e., 𝐯~i+1=μ(𝐬i+1;𝝁)\tilde{\mathbf{v}}_{i+1}=\mu(\mathbf{s}_{i+1};\bm{\mu}). Thus, when calculating the TD error (30), we have yi=ri+γQ(𝐬i+1,μ(𝐬i+1;𝝁);𝜽)y_{i}=r_{i}+\gamma Q(\mathbf{s}_{i+1},{\mu}(\mathbf{s}_{i+1};\bm{\mu});\bm{\theta}).

For the actor NN, we introduce the difference between actions selected by the SE strategy 𝐯^i\hat{\mathbf{v}}_{i} and the actor NN 𝐯~i\tilde{\mathbf{v}}_{i}, when 𝐯^i\hat{\mathbf{v}}_{i} is executed, i.e., 𝐯i=𝐯^i\mathbf{v}_{i}=\hat{\mathbf{v}}_{i}. If the SE action is not selected, then the loss function is the Q-value given the state-action pair, which is identical to the conventional DDPG. Given the hyperparameter α2\alpha_{2}, the loss function for the transition 𝒯i\mathcal{T}_{i} is defined as

Li(𝝁)={α2Q(𝐬i,𝐯~i;𝜽)+(1α2)(𝐯i𝐯~i)2,if 𝐯i=𝐯^iQ(𝐬i,𝐯~i;𝜽),otherwise,L_{i}(\bm{\mu})=\left\{\begin{array}[]{l}\begin{aligned} &\alpha_{2}Q(\mathbf{s}_{i},\tilde{\mathbf{v}}_{i};\bm{\theta})+(1-\alpha_{2})\left(\mathbf{v}_{i}-\tilde{\mathbf{v}}_{i}\right)^{2},\ \ \text{if $\mathbf{v}_{i}=\hat{\mathbf{v}}_{i}$}\\ &Q(\mathbf{s}_{i},\tilde{\mathbf{v}}_{i};\bm{\theta}),\qquad\qquad\qquad\qquad\qquad\ \ \ \,\text{otherwise},\end{aligned}\end{array}\right. (40)

and hence the overall loss function given the sampled batch is

L(𝝁)=1Bi=1BLi(𝝁).L(\bm{\mu})=\frac{1}{B}\sum_{i=1}^{B}L_{i}(\bm{\mu}). (41)

By replacing (37) and (40) in (41) and applying the chain rule, we can derive the gradient of the overall loss function in terms of 𝝁\bm{\mu} as

𝝁L(𝝁)=1Bi=1B𝝁Li(𝝁),\nabla_{\bm{\mu}}L(\bm{\mu})=\frac{1}{B}\sum_{i=1}^{B}\nabla_{\bm{\mu}}L_{i}(\bm{\mu}), (42)

where 𝝁Li(𝝁)\nabla_{\bm{\mu}}L_{i}(\bm{\mu}) is given by:

𝝁Li(𝝁)={α2𝐯~iQ(𝐬i,𝐯~i;𝜽)𝝁μ(𝐬i;𝝁)2(1α2)(𝐯iμ(𝐬i;𝝁))𝝁μ(𝐬i;𝝁),if 𝐯i=𝐯^i𝐯~iQ(𝐬i,𝐯~i;𝜽)𝝁μ(𝐬i;𝝁),otherwise.\nabla_{\bm{\mu}}L_{i}(\bm{\mu})=\left\{\begin{array}[]{l}\begin{aligned} &\alpha_{2}\nabla_{\tilde{\mathbf{v}}_{i}}Q(\mathbf{s}_{i},\tilde{\mathbf{v}}_{i};\bm{\theta})\nabla_{\bm{\mu}}\mu(\mathbf{s}_{i};\bm{\mu})-2(1-\alpha_{2})\\ &\ \ \left(\mathbf{v}_{i}-\mu\left(\mathbf{s}_{i};\bm{\mu}\right)\right)\nabla_{\bm{\mu}}\mu(\mathbf{s}_{i};\bm{\mu}),\quad\text{if $\mathbf{v}_{i}=\hat{\mathbf{v}}_{i}$}\\ &\nabla_{\tilde{\mathbf{v}}_{i}}Q(\mathbf{s}_{i},\tilde{\mathbf{v}}_{i};\bm{\theta})\nabla_{\bm{\mu}}\mu(\mathbf{s}_{i};\bm{\mu}),\qquad\text{otherwise}.\end{aligned}\end{array}\right. (43)

The details of the proposed SE-DDPG algorithm are given in Algorithm 2.

V Numerical Experiments

TABLE I: Summary of Hyperparameters
Hyperparameters of SE-DQN and SE-DDPG Value
Initial values of ϵ\epsilon and ξ\xi 1
Decay rates of ϵ\epsilon and ξ\xi 0.999
Minimum ϵ\epsilon and ξ\xi 0.01
Mini-batch size, BB 128
Experience replay memory size, KK 20000
Discount factor, γ\gamma 0.95
Weight of SE-DQN and critic network loss function, α1\alpha_{1} 0.5
Hyperparameters of SE-DQN
Learning rate 0.0001
Decay rate of learning rate 0.001
Target network update frequency 100
Input dimension of network, Q(𝐬,𝐚;𝜽)Q(\mathbf{s},\mathbf{a};\bm{\theta}) N+N×MN+N\times M
Output dimension of network, Q(𝐬,𝐚;𝜽)Q(\mathbf{s},\mathbf{a};\bm{\theta}) N!/(NM)!N!/(N-M)!
Hyperparameters of SE-DDPG
Learning rate of actor network 0.0001
Learning rate of critic network 0.001
Decay rate of learning rate 0.001
Soft parameter for target update, δ\delta 0.005
Weight of actor network loss function, α2\alpha_{2} 0.9
Input dimension of actor network, μ(𝐬t;𝝁)\mu(\mathbf{s}_{t};\bm{\mu}) N+N×MN+N\times M
Output dimension of actor network, μ(𝐬t;𝝁)\mu(\mathbf{s}_{t};\bm{\mu}) NN
Input dimension of critic network, Q(𝐬,𝐚;𝜽)Q(\mathbf{s},\mathbf{a};\bm{\theta}) 2N+N×M2N+N\times M
Output dimension of critic network, Q(𝐬,𝐚;𝜽)Q(\mathbf{s},\mathbf{a};\bm{\theta}) 11

In this section, we evaluate and compare the performance of the proposed SE-DQN and SE-DDPG with the benchmark DQN and DDPG.

V-A Experiment Setups

Our numerical experiments run on a computing platform with an Intel Core i5 9400F CPU @ 2.9 GHz with 16GB RAM and an NVIDIA RTX 2070 GPU. For the remote estimation system, we set the dimensions of the process state and the measurement as ln=2l_{n}=2 and en=1e_{n}=1, respectively. The system matrices {𝐀n}\{\mathbf{A}_{n}\} are randomly generated with the spectral radius within the range of (1,1.4)(1,1.4). The entries of 𝐂n\mathbf{C}_{n} are drawn uniformly from the range (0,1)(0,1), 𝐖n\mathbf{W}_{n} and 𝐕n\mathbf{V}_{n} are identity matrices. The fading channel state is quantized into h¯=5\bar{h}=5 levels, and the corresponding packet drop probabilities are 0.2,0.15,0.1,0.05,0.2,0.15,0.1,0.05, and 0.010.01. The distributions of the channel states of each sensor-channel pair (n,m)(n,m), i.e., q1(n,m),,qh¯(n,m)q^{(n,m)}_{1},\dots,q^{(n,m)}_{\bar{h}} are generated randomly.

During the training, we use the ADAM optimizer for calculating the gradient and reset the environment for each episode with T=500T=500 steps. The episode numbers for the loose SE action, the tight SE action, and the conventional DQN stages are 50, 100, and 150, respectively. The settings of the hyperparameters for Algorithm 1 and Algorithm 2 are summarized in Table I.

V-B Performance Comparison

Fig. 3 and Fig. 4 illustrate the average sum MSE of all processes during the training achieved by the SE-DRL algorithms and the benchmarks under some system settings. Fig. 3 shows that the SE-DQN saves about 50%50\% training episodes for the convergence, and also decreases the average sum MSE for 10%10\% than the conventional DQN. Fig. 4 shows that the SE-DDPG saves about 30%30\% training episodes and reduces the average sum MSE by 10%10\%, when compared to the conventional DDPG. Also, we see that the conventional DQN and DDPG stages (i.e. the last 150 episodes) in Fig 3 and 4 cannot improve much of the training performance. This implies that the SE stages have found near optimal policies.

In Table II, we test the performance of well-trained SE-DRL algorithms for different numbers of sensors and channels, and different settings of system parameters, i.e., 𝐀n\mathbf{A}_{n}, 𝐂n\mathbf{C}_{n}, and q1(n,m),,qh¯(n,m)q^{(n,m)}_{1},\dots,q^{(n,m)}_{\bar{h}}, based on 10000-step simulations. We see that for 6-sensor-3-channel systems, the SE-DQN reduces the average MSE by 10%10\% to 25%25\% over DQN, while both DDPG and SE-DDPG achieve similar performance as the SE-DQN. This suggests that the SE-DQN is almost optimal and that the DDPG and the SE-DDPG cannot improve further. In 10-sensor-5-channel systems, neither the DQN nor the SE-DQN can converge, and the SE-DDPG achieves a 10%10\% MSE reduction over the DDPG. In particular, the SE-DDPG is the only converged algorithm in Experiment 12. In 20-sensor-10-channel systems, we see that the SE-DDPG can reduce the average MSE by 15%15\% to 25%25\%. Therefore, the performance improvement of the SE-DDPG appears significant for large systems.

Refer to caption
Figure 3: Average sum MSE of all processes during training with N=6,M=3N=6,M=3.
Refer to caption
Figure 4: Average sum MSE of all processes during training with N=10,M=5N\!=10,M\!=5.
TABLE II: Performance Comparison of the SE-DRL and the Benchmarks in terms of the Average Estimation MSE
System Scale (N,M)(N,M) Experiment DQN SE-DQN DDPG SE-DDPG
(6,3)(6,3) 1 52.4121 47.6766 48.4075 47.0594
2 67.4247 49.8476 53.3675 44.7423
3 84.1721 59.9127 56.9504 55.2409
4 79.5902 65.1640 64.4313 58.5534
(10,5)(10,5) 9 - - 68.0247 62.4727
10 - - 89.6290 78.4138
11 - - 90.2812 81.1148
12 - - - 147.2844
(20,10)(20,10) 13 - - 181.4135 159.4321
14 - - 163.1257 142.1850
15 - - 173.0940 144.8166
16 - - 215.2231 160.2304

VI Conclusion

We have developed the SE-DRL algorithms for optimal sensor scheduling of remote estimation systems. In particular, we have proposed novel SE action selection and loss functions to enhance training efficiency. Our numerical results have shown that the proposed structure-enhanced DRL algorithms can save the training time by 50% and reduce the estimation MSE by 10% to 25%, when benchmarked against conventional DRL algorithms. For future work, we will rigorously prove the threshold structure of the optimal scheduling policy for our system. Also, we will investigate other structural properties of the optimal policy to enhance the DRL.

References

  • [1] P. Park, S. Coleri Ergen, C. Fischione, C. Lu, and K. H. Johansson, “Wireless network design for control systems: A survey,” IEEE Commun. Surv. Tutor., vol. 20, no. 2, pp. 978–1013, May 2018.
  • [2] K. Huang, W. Liu, Y. Li, B. Vucetic, and A. Savkin, “Optimal downlink-uplink scheduling of wireless networked control for industrial IoT,” IEEE Internet Things J., vol. 7, no. 3, pp. 1756–1772, Mar. 2020.
  • [3] K. Gatsis, M. Pajic, A. Ribeiro, and G. J. Pappas, “Opportunistic control over shared wireless channels,” IEEE Trans. Autom. Control, vol. 60, no. 12, pp. 3140–3155, Mar, 2015.
  • [4] D. Han, J. Wu, H. Zhang, and L. Shi, “Optimal sensor scheduling for multiple linear dynamical systems,” Automatica, vol. 75, pp. 260–270, Jan, 2017.
  • [5] S. Wu, X. Ren, S. Dey, and L. Shi, “Optimal scheduling of multiple sensors over shared channels with packet transmission constraint,” Automatica, vol. 96, pp. 22–31, Oct. 2018.
  • [6] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press, 2018.
  • [7] Z. Zhao, W. Liu, D. E. Quevedo, Y. Li, and B. Vucetic, “Deep learning for wireless networked systems: a joint estimation-control-scheduling approach,” arXiv preprint, Oct, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2210.00673
  • [8] A. S. Leong, A. Ramaswamy, D. E. Quevedo, H. Karl, and L. Shi, “Deep reinforcement learning for wireless sensor scheduling in cyber–physical systems,” Automatica, vol. 113, pp. 1–8, Mar. 2020. Art. no. 108759.
  • [9] W. Liu, K. Huang, D. E. Quevedo, B. Vucetic, and Y. Li, “Deep reinforcement learning for wireless scheduling in distributed networked control,” submitted to Automatica, 2021. [Online]. Available: https://doi.org/10.48550/arXiv.2109.12562
  • [10] G. Pang, W. Liu, Y. Li, and B. Vucetic, “DRL-based resource allocation in remote state estimation,” arXiv preprint, May. 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2205.12267
  • [11] Z. D. Guo and E. Brunskill, “Directed exploration for reinforcement learning,” arXiv preprint, Jun 2019. [Online]. Available: https://doi.org/10.48550/arXiv.1906.07805
  • [12] S. Wu, K. Ding, P. Cheng, and L. Shi, “Optimal scheduling of multiple sensors over lossy and bandwidth limited channels,” IEEE Trans. Netw. Syst., vol. 7, no. 3, pp. 1188–1200, Jan. 2020.
  • [13] Y.-P. Hsu, E. Modiano, and L. Duan, “Age of information: Design and analysis of optimal scheduling algorithms,” in Proc. IEEE Int. Symp. Inf. Theory, June. 2017, pp. 561–565.
  • [14] W. Liu, D. E. Quevedo, Y. Li, K. H. Johansson, and B. Vucetic, “Remote state estimation with smart sensors over Markov fading channels,” IEEE Trans. Autom. Control, vol. 67, no. 6, pp. 2743–2757, June, 2022.
  • [15] W. Liu, D. E. Quevedo, K. H. Johansson, B. Vucetic, and Y. Li, “Stability conditions for remote state estimation of multiple systems over multiple markov fading channels,” IEEE Trans. Autom. Control, early access, Aug. 2022.
  • [16] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing Atari with deep reinforcement learning,” in NIPS Deep Learning Workshop, 2013.
  • [17] T. P. Lillicrap, et al., “Continuous control with deep reinforcement learning,” arXiv preprint, Sep. 2015. [Online]. Available: https://doi.org/10.48550/arXiv.1509.02971