Structure-Enhanced Deep Reinforcement Learning for Optimal Transmission Scheduling
^†^†thanks: *Wanchun Liu is the corresponding author.

Jiazheng Chen¹, Wanchun Liu*¹, Daniel E. Quevedo², Yonghui Li¹, and Branka Vucetic¹ ¹School of Electrical and Information Engineering, The University of Sydney, Australia
²School of Electrical Engineering and Robotics, Queensland University of Technology, Brisbane, Australia.
¹Emails: {jiazheng.chen, wanchun.liu, yonghui.li, branka.vucetic}@sydney.edu.au. ²Email: [email protected].

Abstract

Remote state estimation of large-scale distributed dynamic processes plays an important role in Industry 4.0 applications. In this paper, by leveraging the theoretical results of structural properties of optimal scheduling policies, we develop a structure-enhanced deep reinforcement learning (DRL) framework for optimal scheduling of a multi-sensor remote estimation system to achieve the minimum overall estimation mean-square error (MSE). In particular, we propose a structure-enhanced action selection method, which tends to select actions that obey the policy structure. This explores the action space more effectively and enhances the learning efficiency of DRL agents. Furthermore, we introduce a structure-enhanced loss function to add penalty to actions that do not follow the policy structure. The new loss function guides the DRL to converge to the optimal policy structure quickly. Our numerical results show that the proposed structure-enhanced DRL algorithms can save the training time by 50% and reduce the remote estimation MSE by 10% to 25%, when compared to benchmark DRL algorithms.

Index Terms:

Remote state estimation, deep reinforcement learning, sensor scheduling, threshold structure.

I Introduction

Wireless networked control systems (WNCSs) consisting of distributed sensors, actuators, controllers, and plants are a key component for Industry 4.0 and have been widely applied in many areas such as industrial automation, vehicle monitoring systems, building automation and smart grids [1]. In particular, providing high-quality real-time remote estimation of dynamic system states plays an important role in ensuring control performance and stability of WNCSs [2]. For large-scale WNCSs, transmission scheduling of wireless sensors over limited bandwidth needs to be properly designed to guarantee remote estimation performance.

There are many existing works on the transmission scheduling of WNCSs. In [3], the optimal scheduling problem of a multi-loop WNCS with limited communication resources was investigated for minimizing the transmission power consumption under a WNCS stability constraint. In [4, 5], optimal sensor scheduling problems of remote estimation systems were investigated for achieving the best overall estimation mean-square error (MSE). In particular, dynamic decision-making problems were formulated as Markov decision processes (MDPs), which can be solved by classical methods, such as policy and value iterations. However, the conventional model-based solutions are not feasible in large-scale scheduling problems because of the curse of dimensionality caused by high-dimensional state and action spaces.¹¹1In [3, 4, 5], only the optimal scheduling of two-sensor systems has been solved effectively by the proposed methods.

In recent years, deep reinforcement learning (DRL) has been developed to deal with large MDPs by using deep neural networks as function approximators [6, 7]. Some works [8, 9] have used the deep Q-network (DQN), a simplest DRL method, to solve multi-sensor-multi-channel scheduling problems in different remote estimation scenarios. In particular, the sensor scheduling problems for systems with 6 sensors have been solved effectively, providing significant performance gains over heuristic methods in terms of estimation quality. More recent work [10] has introduced DRL algorithms with an actor-critic structure to solve the scheduling problem at a much larger scale (that cannot be handled by the DQN). However, existing works merely use the general DRL frameworks to solve specific scheduling problems, without questioning what features make sensor scheduling problems different from other MDPs. Also, we note that a drawback of general DRL is that it often cannot perform policy exploration effectively for specific tasks[11], which can lead to local minima or even total failure. Thus, the existing DRL-based solutions could be far from optimal.

Some recent works have shown that the optimal scheduling policies of remote estimation systems commonly have a “threshold structure” [12, 5], which means that there exist switching boundaries dividing the state space into multiple regions for different scheduling actions. In other words, the optimal policy has a structure where the action only changes at the switching boundaries of the state space. In [13], the threshold structure of an optimal scheduling policy has also been identified in a sensor scheduling system for minimizing the average age of information (AoI). Although these theoretical works have derived the structures of optimal policies, there is no existing work in the open literature utilizing the structural properties to guide DRL algorithms. This knowledge gap motivates us to investigate how knowledge of the optimal policy structure can be used to redesign DRL algorithms that find the optimal policy more effectively.

Contributions. In this paper, we develop novel structure-enhanced DRL algorithms for solving the optimal scheduling problem of a remote estimation system. Given the most commonly adopted DRL frameworks for scheduling, i.e., DQN and deep deterministic policy gradient (DDPG), we propose structure-enhanced DQN and DDPG algorithms. In particular, we design a structure-enhanced action selection method, which tends to select actions that obey the threshold structure. Such an action selection method can explore the action space more effectively and enhance the learning efficiency of DRL agents. Furthermore, we introduce a structure-enhanced loss function to add penalty to actions that do not follow the threshold structure. The new loss function guides the DRL to converge to the optimal policy structure quickly. Our numerical results show that the proposed structure-enhanced DRL algorithms can save the training time by 50%, while reducing the remote estimation MSE by 10% to 25% when compared to benchmark DRL algorithms.

II System Model

We consider a remote estimation system with $N$ dynamic processes each measured by a sensor, which pre-processes the raw measurements and sends its state estimates to a remote estimator through one of $M$ wireless channels, as illustrated in Fig. 1.

Refer to caption — Figure 1: Remote state estimation system with $N$ processes and $M$ channels.

II-A Process Model and Local State Estimation

Each dynamic process $n$ is modeled as a discrete linear time-invariant (LTI) system as [8, 14, 15]

	$\displaystyle\mathbf{x}_{n,k+1}$	$\displaystyle=\mathbf{A}_{n}\mathbf{x}_{n,k}+\mathbf{w}_{n,k},$		(1)
	$\displaystyle\mathbf{y}_{n,k}$	$\displaystyle=\mathbf{C}_{n}\mathbf{x}_{n,k}+\mathbf{v}_{n,k},$		(1)

where $\mathbf{x}_{n,k}\in\mathbb{R}^{l_{n}}$ is process $n$ ’s state at time $k$ , and $\mathbf{y}_{n,k}\in\mathbb{R}^{e_{n}}$ is the state measurement of the sensor $n$ , $\mathbf{A}_{n}\in\mathbb{R}^{l_{n}\times l_{n}}$ and $\mathbf{C}_{n}\in\mathbb{R}^{e_{n}\times l_{n}}$ are the system matrix and the measurement matrix, respectively, $\mathbf{w}_{n,k}\in\mathbb{R}^{l_{n}}$ and $\mathbf{v}_{n,k}\in\mathbb{R}^{e_{n}}$ are the process disturbance and the measurement noise modeled as independent and identically distributed (i.i.d) zero-mean Gaussian random vectors $\mathcal{N}(\mathbf{0},\mathbf{W}_{n})$ and $\mathcal{N}(\mathbf{0},\mathbf{V}_{n})$ , respectively. We assume that the spectral radius of $\mathbf{A}_{n},\forall n$ , is greater than one, which means that the dynamic processes are unstable, making the remote estimation problem more interesting.

Due to the imperfect state measurement in (1), sensor $n$ executes a classic Kalman filter to pre-process the raw measurement and generate state estimate $\mathbf{x}^{s}_{n,k}$ at each time $k$ [14]. Sensor $n$ sends $\mathbf{x}^{s}_{n,k}$ to the remote estimator (not $\mathbf{y}_{n,k}$ ) as a packet, once scheduled. The local state estimation error covariance matrix is defined as

\mathbf{P}_{n,k}^{s}\triangleq\mathbb{E}\left[(\mathbf{x}_{n,k}^{s}-\mathbf{x}_{n,k})(\mathbf{x}_{n,k}^{s}-\mathbf{x}_{n,k})^{\top}\right].

(2)

We note that local Kalman filters are commonly assumed to operate in the steady state mode in the literature (see [14] and references therein²²2The $n$ th local Kalman filter is stable if and only if that $(\mathbf{A}_{n},\mathbf{C}_{n})$ is observable and $(\mathbf{A}_{n},\sqrt{\mathbf{W}_{n}})$ is controllable.), which means the error covariance matrix converges to a constant, i.e., $\mathbf{P}_{n,k}^{s}=\bar{\mathbf{P}}_{n},\forall k,n$ .

II-B Wireless Communications and Remote State Estimation

There are $M$ wireless channels (e.g., subcarriers) for the $N$ sensors’ transmissions, where $M\leq N$ . We consider independent and identically distributed (i.i.d.) block fading channels, where the channel quality is fixed during each packet transmission and varies packet by packet, independently. Let the $N\times M$ matrix $\mathbf{H}_{k}$ denote the channel state of the system at time $k$ , where the element in the $n$ th row and $m$ th column $h_{n,m,k}\in\mathcal{H}\triangleq\left\{1,2,\dots,\bar{h}\right\}$ represents the channel state between sensor $n$ and the remote estimator at channel $m$ . In particular, there are $\bar{h}$ quantized channel states in total. The packet drop probability at channel state $i\in\mathcal{H}$ is $p_{i}$ . Without loss of generality, we assume that $p_{1}\geq p_{2}\geq\dots\geq p_{\bar{h}}$ . The instantaneous channel state $\mathbf{H}_{k}$ is available at the remote estimator based on standard channel estimation methods. The distribution of $h_{n,m,k}$ is given as

\operatorname{Pr}(h_{n,m,k}=i)=q^{(n,m)}_{i},\forall k,

(3)

where $\sum_{i=1}^{\bar{h}}q^{(n,m)}_{i}=1,\forall n,m$ .

Due to the limited communication channels, only $M$ out of $N$ sensors can be scheduled at each time step. Let $a_{n,k}\in\{0,1,2,\dots,M\}$ represent the channel allocation for sensor $n$ at time $k$ , where

a_{n,k}=\left\{\begin{array}[]{ll}0&\text{if sensor $n$ is not scheduled}\\ m&\text{if sensor $n$ is scheduled to channel $m$.}\end{array}\right.

(4)

In particular, we assume that each sensor can be scheduled to at most one channel and that each channel is assigned to one sensor [10]. Then, the constraints on $a_{n,k}$ are given as

\sum_{m=1}^{M}\bm{\mathbbm{1}}\left(a_{n,k}=m\right)\leq 1,\quad\sum_{n=1}^{N}\bm{\mathbbm{1}}\left(a_{n,k}=m\right)=1,

(5)

where $\bm{\mathbbm{1}}(\cdot)$ is the indicator function.

Considering schedule actions and packet dropouts, sensor $n$ ’s estimate may not be received by the remote estimator in every time slot. We define the packet reception indicator as

\eta_{n,k}=\left\{\begin{array}[]{ll}1,&\text{if sensor $n$'s packet is received at time $k$}\\ 0,&\text{otherwise}.\end{array}\right.

Considering the randomness of $\eta_{n,k}$ and assuming that the remote estimator performs state estimation at the beginning of each time slot, the optimal estimator in terms of estimation MSE is given as

\displaystyle\hat{\mathbf{x}}_{n,k+1}

\displaystyle=\left\{\begin{array}[]{ll}\mathbf{A}_{n}\mathbf{x}_{n,k}^{s},&{\text{if $\eta_{n,k}=1$}}\\ \mathbf{A}_{n}\hat{\mathbf{x}}_{n,k},&{\text{otherwise}}\end{array}\right.

(8)

where $\mathbf{A}_{n}$ is the system matrix of process $n$ defined under (1). If sensor $n$ ’s packet is not received, the remote estimator propagates its estimate in the previous time slot to estimate the current state. From (1) and (8), we derive the estimation error covariance as

	$\displaystyle\mathbf{P}_{n,k+1}$	$\displaystyle\triangleq\mathbb{E}\left[\left(\hat{\mathbf{x}}_{n,k}-\mathbf{x}_{n,k}\right)\left(\hat{\mathbf{x}}_{n,k}-\mathbf{x}_{n,k}\right)^{\top}\right]$		(9)
		$\displaystyle=\left\{\begin{array}[]{ll}\mathbf{A}_{n}\bar{\mathbf{P}}_{n}\mathbf{A}^{\top}_{n}+\mathbf{W}_{n},&{\text{if $\eta_{n,k}=1$}}\\ \mathbf{A}_{n}{\mathbf{P}}_{n,k}\mathbf{A}^{\top}_{n}+\mathbf{W}_{n},&{\text{otherwise}},\end{array}\right.$		(12)

where $\bar{\mathbf{P}}_{n}$ is the local estimation error covariance of sensor $n$ defined under (2).

Let $\tau_{n,k}\in\{1,2,\dots\}$ denote the age-of-information (AoI) of sensor $n$ at time $k$ , which measures the amount of time elapsed since the latest sensor packet was received. Then, we have

\tau_{n,k+1}=\begin{cases}1&{\text{if $\eta_{n,k}=1$}}\\ \tau_{n,k}+1&{\text{otherwise.}}\end{cases}

(13)

If a sensor is frequently scheduled at good channels, the corresponding average AoI is small. However, due to the scheduling constraint (5), this is often not possible.

From (12) and (13), the error covariance can be simplified as

\mathbf{P}_{n,k}=f_{n}^{\tau_{n,k}}(\bar{\mathbf{P}}_{n}),

(14)

where $f_{n}(\mathbf{X})=\mathbf{A}_{n}\mathbf{X}\mathbf{A}_{n}^{\top}+\mathbf{W}_{n}$ and $f^{\tau+1}_{n}(\cdot)=f_{n}(f^{\tau}_{n}(\cdot))$ . It has been proved that the estimation MSE, i.e., $\operatorname{Tr}(\mathbf{P}_{n,k})$ , monotonically increases with the AoI state $\tau_{n,k}$ [14].

III Problem Formulation and Threshold structure

In this paper, we aim to find a dynamic scheduling policy $\pi(\cdot)$ that uses the AoI states of all sensors, as well as the channel states to minimize the expected total discounted estimation MSE of all $N$ processes over the infinite time horizon.

Problem 1.

\max_{\pi}\lim_{T\to\infty}\mathbb{E}^{\pi}\left[\sum_{k=1}^{T}\sum_{n=1}^{N}-\gamma^{k}\operatorname{Tr}(\mathbf{P}_{n,k})\right],

(15)

where $\mathbb{E}^{\pi}$ is the expected average value when adopting the policy $\pi$ and $\gamma<1$ is a discount factor.

Problem 1 is a sequential decision-making problem with the Markovian property. This is because the expected estimation MSE only depends on the AoI state $\tau_{n,k}$ in (14), which has the Markovian property (13), and the channel states are i.i.d.. Therefore, we formulate Problem 1 as a Markov decision problem (MDP).

III-A MDP Formulation

We define the four elements of the MDP as below.

1) The state of the MDP is defined as $\mathbf{s}_{k}\triangleq\left(\bm{\tau}_{k},\mathbf{H}_{k}\right)\in\mathbb{N}^{N}\times\mathcal{H}^{M\times N}$ , where $\bm{\tau}_{k}=(\tau_{1,k},\tau_{2,k},\dots,\tau_{N,k})\in\mathbb{N}^{N}$ is the AoI state vector. Thus, $\mathbf{s}_{k}$ takes into account both the AoI and channel states.

2) The overall schedule action of the $N$ sensors is defined as $\mathbf{a}_{k}=(a_{1,k},a_{2,k},\dots,a_{N,k})\in\left\{0,1,2,\dots,M\right\}^{N}$ under the constraint (5). There are $N!/(N-M)!$ actions in total. The stationary policy $\pi(\cdot)$ is a mapping between the state and the action, i.e., $\mathbf{a}_{k}=\pi(\mathbf{s}_{k})$ .

3) The transition probability $\operatorname{Pr}(\mathbf{s}_{k+1}|\mathbf{s}_{k},\mathbf{a}_{k})$ is the probability of the next state $\mathbf{s}_{k+1}$ given the current state $\mathbf{s}_{k}$ and the action $\mathbf{a}_{k}$ . Since we adopt stationary scheduling policies, the state transition is independent of the time index. Thus, we drop the subscript $k$ here and use $\mathbf{s}$ and $\mathbf{s}^{+}$ to represent the current and the next states, respectively. Due to the i.i.d. fading channel states, we have $\operatorname{Pr}(\mathbf{s}^{+}|\mathbf{s},\mathbf{a})=\operatorname{Pr}(\bm{\tau}^{+}|\bm{\tau},\mathbf{H},\mathbf{a})\operatorname{Pr}(\mathbf{H}^{+}),$ where $\operatorname{Pr}(\mathbf{H}^{+})$ can be obtained from (3) and $\operatorname{Pr}(\bm{\tau}^{+}|\bm{\tau},\mathbf{H},\mathbf{a})=\prod_{n=1}^{N}\operatorname{Pr}(\tau^{+}_{n}|\tau_{n},\mathbf{H},a_{n})$ , which is derived based on (3) and (13)

\displaystyle\operatorname{Pr}(\tau^{+}_{n}|\tau_{n},\mathbf{H},a_{n})=\left\{\begin{array}[]{l}1-p_{h_{n,m}},\ \ \,\text{if $\tau^{+}_{n}=1,a_{n}=m$}\\ p_{h_{n,m}},\qquad\;\;\text{if $\tau^{+}_{n}=\tau_{n}+1,a_{n}=m$}\\ 1,\qquad\quad\ \ \text{\ \ if $\tau^{+}_{n}=\tau_{n}+1,a_{n}=0$}\\ 0,\qquad\quad\ \ {\ \ \text{otherwise.}}\end{array}\right.

(20)

4) The immediate reward of Problem 1 at time $k$ is defined as $\sum_{n=1}^{N}-\text{Tr}(\mathbf{P}_{n,k})$ . Since $\mathbf{P}_{n,k}$ defined in (14) is a function of $\tau_{n,k}$ , the reward is represented as $r(\mathbf{s}_{k})$ , monotonically decreasing with the AoI state.

III-B Threshold Policies

Many works have proved that optimal scheduling policies have the threshold structure in Definition 1 for different problems, where the reward of an individual user is a monotonic function of its state [5, 12, 13]. Our problem has the same property, where the estimation MSE of sensor $n$ decreases with the decreasing AoI state and the increasing channel state.

To verify whether the optimal policy of Problem 1 has the threshold structure, we consider a system with $N=2$ and $M=1$ and adopt the conventional value iteration algorithm to find the optimal policy. As illustrated in Fig. 2, we see action switching curves in both AoI and channel state spaces, and the properties in Definition 1 are observed. Due to space limitations, we here do not present formal proof that the optimal policy has a threshold structure, but will investigate it in our future work. In Section V, our numerical results also show that the optimized threshold policy is near optimal, confirming the conjecture.

Definition 1.

For a threshold policy $\pi$ of Problem 1, if channel $m$ is assigned to sensor $n$ at the state $\mathbf{s}=(\bm{\tau},\mathbf{H})$ , then the following two properties must hold:

(i) for state $\mathbf{s}^{\prime}=(\bm{\tau},\mathbf{H}^{\prime}_{n,m})$ , where $\mathbf{H}^{\prime}_{n,m}$ is identical to $\mathbf{H}$ except the sensor- $n$ -channel- $m$ state with $h^{\prime}_{n,m}>h_{n,m}$ , then channel $m$ is still assigned to sensor $n$ ;

(ii) for state $\mathbf{s}^{\prime}=(\bm{\tau}^{\prime}_{(n)},\mathbf{H})$ , where $\bm{\tau}^{\prime}_{(n)}$ is idential to $\bm{\tau}$ except sensor $n$ ’s AoI with $\tau^{\prime}_{n}>\tau_{n}$ , then either channel $m$ or a better channel is assigned to sensor $n$ .

For property (i), given that sensor $n$ is scheduled at channel $m$ at a certain state, if the channel quality of $h_{n,m}$ improves while the AoI and the other channel states are the same, then the threshold policy still assigns channel $m$ to sensor $n$ . The property is reasonable since the channel quality of sensor $n$ to channel $m$ is the only factor changed and improved.

For property (ii), if the AoI state of sensor $n$ is increased while the other states remain the same, the policy must schedule sensor $n$ to a channel that is no worse than the previous one. Such a scheduling policy does make sense, since a larger sensor $n$ ’s AoI leads to a lower reward, requiring a better channel for transmission to improve the reward effectively.

In the following, we will develop novel schemes that leverage the threshold structure to guide DRL algorithms for solving Problem 1 with improved training speed and performance.

IV Structure-Enhanced DRL

In the literature, DQN and DDPG are the most commonly adopted off-policy DRL algorithms for solving optimal scheduling problems (see [8, 10] and references therein), and they provide significant performance improvements over heuristic policies. Next, we develop threshold structure-enhanced (SE) DQN and DDPG for solving Problem 1.

IV-A Structure-Enhanced DQN

We define the Q-value as the expected discounted sum of the reward function

Q^{\pi}(\mathbf{s}_{k},\mathbf{a}_{k})=\mathbb{E}\left[\sum_{t=k}^{\infty}\gamma^{t-k}r(\mathbf{s}_{t})\big{|}\mathbf{a}_{k},\mathbf{a}_{t}=\pi(\mathbf{s}_{t}),\forall t>k\right],

(21)

which measures the long-term performance of policy $\pi$ given the current state-action pair $(\mathbf{s}_{k},\mathbf{a}_{k})$ . Let $\pi^{*}$ denote the optimal policy of Problem 1. Then, the corresponding Q-value can, in principle, be obtained by solving the Bellman equation [16]

Q^{*}(\mathbf{s}_{k},\mathbf{a}_{k})=r(\mathbf{s}_{k})+\mathbb{E}_{\mathbf{s}_{k+1}}\left[\gamma\max_{\mathbf{a}_{k+1}}Q^{*}\left(\mathbf{s}_{k+1},\mathbf{a}_{k+1}\right)\right].

(22)

From the definition (21), the optimal policy generates the best action achieving the highest Q-value, and can be written as

\mathbf{a}^{*}_{k}=\pi^{*}(\mathbf{s}_{k})=\mathop{\arg\max}_{\mathbf{a}_{k}}Q^{*}(\mathbf{s}_{k},\mathbf{a}_{k}).

(23)

A well-trained DQN uses a neural network (NN) with the parameter set $\bm{\theta}$ to approximate the Q-values of the optimal policy in (23) as $Q(\mathbf{s}_{k},\mathbf{a}_{k};\bm{\theta})$ and use it to find the optimal action. Considering the action space defined in Section III-A, the DQN has $N!/(N-M)!$ Q-value outputs of different state-action pairs. To train the DQN, one needs to sample data (consisting of states, actions, rewards, and next states), define a loss function of $\bm{\theta}$ based on the collected data, and minimize the loss function to find the optimized $\bm{\theta}$ . However, the conventional DQN training method has never utilized structures of optimal policies before.

To utilize the knowledge of the threshold policy structure for enhancing the DQN training performance, we propose 1) an SE action selection method based on Definition 1 to select reasonable actions and hence enhance the data sampling efficiency; and 2) an SE loss function definition to add the penalty to sampled actions that do not follow the threshold structure.

Our SE-DQN training algorithm has three stages: 1) the DQN with loose SE action selection stage, which only utilizes part of the structural property, 2) the DQN with tight SE action selection stage utilizes the full structural property, and 3) the conventional DQN stage. The first two stages use the SE action selection schemes and the SE loss function to train the DQN fast, resulting in a reasonable threshold policy, and the last stage is for further policy exploration. In what follows, we present the loose and tight SE action selection schemes and the SE loss function.

IV-A1 Loose SE action selection

We randomly select an action $\mathbf{a}^{\epsilon}$ from the entire action space with a probability of $\epsilon$ for action exploration; with a probability of $(1-\epsilon)$ , we generate the SE action $\hat{\mathbf{a}}$ as below. For simplicity, we drop the time index when describing action selections.

The threshold structure suggests that the actions of $\mathbf{s}$ and the state with a smaller AoI or channel state are correlated. Thus, one can infer the action based on the action at the state with a smaller channel, or AoI state, based on property (i), or (ii), of Definition 1, respectively. We only consider property (ii) for loose SE action selection, as it is difficult to find actions that satisfy both structural properties (i) and (ii) at the beginning of training. We will utilize both (i) and (ii) in the tight SE action selection stage.

Given the state $\mathbf{s}=(\bm{\tau},\mathbf{H})$ , we define the state with a smaller sensor $n$ ’s AoI as $\dot{\mathbf{s}}^{(n)}=(\dot{\bm{\tau}}^{(n)},\mathbf{H})$ , where

\dot{\bm{\tau}}^{(n)}=(\tau_{1},\dots,\tau_{n}-1,\dots,\tau_{N}).

(24)

For each $n$ , we calculate the corresponding action based on the Q-values as

\displaystyle\dot{\mathbf{a}}^{(n)}\triangleq(\dot{a}^{(n)}_{1},\dots,\dot{a}^{(n)}_{n},\dots,\dot{a}^{(n)}_{N})=\mathop{\arg\max}_{\dot{\mathbf{a}}}Q(\dot{\mathbf{s}}^{(n)},\dot{\mathbf{a}};\bm{\theta}).\

(25)

Recall that $\dot{a}^{(n)}_{n}\in\{0,1,\dots,M\}$ is the channel index assigned to sensor $n$ at the state $\dot{\mathbf{s}}^{(n)}$ .

If $\dot{a}^{(n)}_{n}>0$ , then property (ii) implies that channel $\dot{a}^{(n)}_{n}$ or a better channel is assigned to sensor $n$ at the state $\mathbf{s}$ . We define the set of channels with better quality as

\mathcal{M}^{(n)}=\{m^{\prime}|h_{n,m^{\prime}}>h_{n,\dot{a}^{(n)}_{n}},m^{\prime}=1,\dots,M\}.

(26)

Then, the SE action for sensor $n$ , say $\hat{a}_{n}$ , is randomly chosen from the set $\mathcal{M}^{(n)}$ with probability $\xi$ , and is equal to $\dot{a}^{(n)}_{n}$ with probability $1-\xi$ .

If $\dot{a}^{(n)}_{n}=0$ , then property (ii) can not help with determining the action. Then, we define the action selected by the greedy policy (i.e., the conventional DQN method) at the state $\mathbf{s}$ as

\tilde{\mathbf{a}}\triangleq(\tilde{a}_{1},\tilde{a}_{2},\dots,\tilde{a}_{N})=\mathop{\arg\max}_{\mathbf{a}}Q(\mathbf{s},\mathbf{a};\bm{\theta}).

(27)

Thus, we set the SE action of sensor $n$ identical to the one generated by the conventional DQN method, i.e., $\hat{a}_{n}=\tilde{a}_{n}$ .

Now we can define the SE action for $N$ sensors as $\hat{\mathbf{a}}=(\hat{a}_{1},\hat{a}_{2},\dots,\hat{a}_{N})$ . If such an action meets the constraint

\displaystyle\sum_{m=1}^{M}\bm{\mathbbm{1}}\left(\hat{a}_{n}=m\right)\leq 1,\quad\sum_{n=1}^{N}\bm{\mathbbm{1}}\left(\hat{a}_{n}=m\right)\leq 1,

(28)

which is less restrictive than (5), we select the action $\mathbf{a}$ as $\hat{\mathbf{a}}$ and assign the unused channels randomly to unscheduled sensors; otherwise, $\mathbf{a}$ is identical to that of the conventional method as $\tilde{\mathbf{a}}$ .

IV-A2 Tight SE action selection

By using the loose SE action selection, we first infer the scheduling action of sensor $n$ at the state $\mathbf{s}$ based on the action of the state with a smaller AoI, $\dot{\mathbf{s}}^{(n)}$ . Then, we check whether the loose SE action satisfies the threshold property (i) as below.

For notation simplicity, we use $m>0$ to denote the SE channel selection for sensor $n$ . Given the state $\mathbf{s}$ , we define the state with a smaller channel $m$ ’s state for sensor $n$ as $\ddot{\mathbf{s}}^{(n,m)}=(\bm{\tau},\ddot{\mathbf{H}}^{(n,m)})$ , where $\ddot{\mathbf{H}}^{(n,m)}$ and $\mathbf{H}$ are identical except the element $\ddot{h}^{(n,m)}_{n,m}=h_{n,m}-1$ . Then, we calculate the corresponding action

\ddot{\mathbf{a}}^{(n,m)}\!\!\triangleq\!(\ddot{a}^{(n,m)}_{1}\!\!\!,\!\dots,\!\ddot{a}^{(n,m)}_{n}\!\!\!,\dots,\ddot{a}^{(n,m)}_{N})\!=\!\mathop{\arg\max}_{\ddot{\mathbf{a}}}Q(\ddot{\mathbf{s}}^{(n,m)}\!\!\!,\ddot{\mathbf{a}};\!\bm{\theta}).

(29)

From the structural properties (i) and (ii), the scheduling action $\ddot{a}^{(n,m)}_{n}$ should be identical to $m$ . Thus, if $\ddot{a}^{(n,m)}_{n}=m$ , then the SE action for sensor $n$ is $\hat{a}_{n}=m$ ; otherwise, $\hat{a}_{n}$ is identical to the conventional DQN action. The SE action satisfying the structural properties (i) and (ii) is $\hat{\mathbf{a}}=(\hat{a}_{1},\hat{a}_{2},\dots,\hat{a}_{N})$ . If such an action meets the constraint (5), then it is executed as $\mathbf{a}=\hat{\mathbf{a}}$ ; otherwise, we select the greedy action $\mathbf{a}=\tilde{\mathbf{a}}$ .

IV-A3 SE loss function

During the training, each transition $(\mathbf{s}_{t},\mathbf{a}_{t},\hat{\mathbf{a}}_{t},\tilde{\mathbf{a}}_{t},r_{t},\mathbf{s}_{t+1})$ is stored in a replay memory, where $r_{t}\triangleq r(\mathbf{s}_{t})$ denotes the immediate reward. Different from the conventional DQN, we include both the SE action $\hat{\mathbf{a}}$ and the greedy action $\tilde{\mathbf{a}}$ , in addition to the executed action $\mathbf{a}$ .

Let $\mathcal{T}_{i}\triangleq(\mathbf{s}_{i},\mathbf{a}_{i},\hat{\mathbf{a}}_{i},\tilde{\mathbf{a}}_{t},r_{i},\mathbf{s}_{i+1})$ denote the $i$ th transition of a sampled batch from the replay memory. Same as the conventional DQN, we define the temporal-difference (TD) error of $\mathcal{T}_{i}$ as

\mathsf{TD}_{i}\triangleq y_{i}-Q(\mathbf{s}_{i},\mathbf{a}_{i};\bm{\theta}),

(30)

where $y_{i}=r_{i}+\gamma\max_{{\mathbf{a}}_{i+1}}Q(\mathbf{s}_{i+1},{\mathbf{a}}_{i+1};\bm{\theta})$ is the estimation of Q-value at next step. This is to measure the gap between the left and right sides of the Bellman equation (22). A larger gap indicates that the approximated Q-values are far from the optimal. Different from DQN, we introduce the action-difference (AD) error as below to measure the difference of Q-value between actions selected by the SE strategy and the greedy strategy:

\mathsf{AD}_{i}\triangleq Q(\mathbf{s}_{i},\hat{\mathbf{a}}_{i};\bm{\theta})-Q(\mathbf{s}_{i},\tilde{\mathbf{a}}_{i};\bm{\theta}).

(31)

Since the optimal policy has the threshold structure, the inferred action $\hat{\mathbf{a}}$ should be identical to the optimal action $\tilde{\mathbf{a}}$ . Thus, a well-trained $\bm{\theta}$ should lead to a small difference in (31).

Based on (30) and (31), we define the loss function of $\mathcal{T}_{i}$ as

L_{i}(\bm{\theta})=\left\{\begin{array}[]{l}\begin{aligned} &\alpha_{1}\mathsf{TD}^{2}_{i}+(1-\alpha_{1})\mathsf{AD}^{2}_{i},\quad\text{if $\mathbf{a}_{i}=\hat{\mathbf{a}}_{i}$}\\ &\mathsf{TD}^{2}_{i},\qquad\qquad\qquad\qquad\ \;\,\text{otherwise},\end{aligned}\end{array}\right.

(32)

where $\alpha_{1}$ is a hyperparameter to balance the importance of $\mathsf{TD}_{i}$ and $\mathsf{AD}_{i}$ . In other words, if the SE action is executed, both the TD and AD errors are taken into account; otherwise, the conventional TD-error-based loss function is adopted.

Given the batch size $B$ , the overall loss function is

L(\bm{\theta})=\frac{1}{B}\sum_{i=1}^{B}L_{i}(\bm{\theta}).

(33)

To optimize $\bm{\theta}$ , we adopt the well-known gradient descent method and calculate the gradient as below

\nabla_{\bm{\theta}}L(\bm{\theta})=\frac{1}{B}\sum_{i=1}^{B}\nabla_{\bm{\theta}}L_{i}(\bm{\theta})

(34)

where $\nabla_{\bm{\theta}}L_{i}(\bm{\theta})$ is given as

\nabla_{\bm{\theta}}L_{i}(\bm{\theta})=\left\{\begin{array}[]{l}\begin{aligned} &\!\!\!-2((1-\alpha_{1})\mathsf{AD}_{i}\nabla_{\bm{\theta}}\left(Q\left(\mathbf{s}_{i},\hat{\mathbf{a}}_{i};\bm{\theta})-Q(\mathbf{s}_{i},\tilde{\mathbf{a}}_{i};\bm{\theta}\right)\right)\\ &\ \ +\alpha_{1}\mathsf{TD}_{i}\nabla_{\bm{\theta}}Q(\mathbf{s}_{i},\mathbf{a}_{i};\bm{\theta})),\quad\text{if $\mathbf{a}_{i}=\hat{\mathbf{a}}_{i}$}\\ &\!\!\!-2(\mathsf{TD}_{i}\nabla_{\bm{\theta}}Q(\mathbf{s}_{i},\mathbf{a}_{i};\bm{\theta})),\qquad\;\,\text{otherwise}.\end{aligned}\end{array}\right.

(35)

The details of the SE-DQN algorithm are given in Algorithm 1.

Algorithm 1 SE-DQN for sensor scheduling in the remote estimation system

1:Initialize replay memory

\mathcal{D}

to capacity

N

2:Initialize policy network with random weights

\bm{\theta}

3:Initialize target network with weights

\hat{\bm{\theta}}=\bm{\theta}

4:for episode

=1,2,\dots,E_{1}

5: Initialize state

\mathbf{s}_{0}

6: for

t=1,2,\dots,T

7: Generate

\hat{\mathbf{a}}_{t}

and select action

\mathbf{a}_{t}

by the loose SE action selection method of SE-DQN

8: Execute action

\mathbf{a}_{t}

and observe

r_{t}

and

\mathbf{s}_{t+1}

9: Compute action

\tilde{\mathbf{a}}_{t}=\mathop{\arg\max}_{\mathbf{a}}Q(\mathbf{s}_{t},\mathbf{a};\bm{\theta})

10: Store transition

(\mathbf{s}_{t},\mathbf{a}_{t},\hat{\mathbf{a}}_{t},\tilde{\mathbf{a}}_{t},r_{t},\mathbf{s}_{t+1})

\mathcal{D}

11: Sample a random batch of transitions

(\mathbf{s}_{i},\mathbf{a}_{i},\hat{\mathbf{a}}_{i},\tilde{\mathbf{a}}_{i},r_{i},\mathbf{s}_{i+1})

12: Calculate

\mathsf{TD}_{i}

and

\mathsf{AD}_{i}

based on (30) and (31)

13: Update

\bm{\theta}

according to equation (34)

14: Every

t^{\prime}

steps set

\hat{\bm{\theta}}=\bm{\theta}_{t}

15: end for

16:end for

17:for episode

=E_{1},\dots,E_{2}

18: Repeat algorithm from line 5 to line 15 by adopting the tight SE action selection method

19:end for

20:for episode

=E_{2},\dots,E

21: Execute the DQN algorithm as in [8]

22:end for

IV-B Structure-Enhanced DDPG

Different from the DQN, which has one NN to estimate the Q-value, a DDPG agent has two NNs [17], a critic NN with parameter $\bm{\theta}$ and an actor NN with the parameter $\bm{\mu}$ . In particular, the actor NN approximates the optimal policy $\pi^{*}(\mathbf{s})$ by $\mu(\mathbf{s};\bm{\mu})$ , while the critic NN approximates the Q-value of the optimal policy $Q^{*}(\mathbf{s},\mathbf{a})$ by $Q(\mathbf{s},\mathbf{a};\bm{\theta})$ . In general, the critic NN judges whether the generated action of the actor NN is good or not, and the latter can be improved based on the former. The critic NN is updated by minimizing the TD error similar to the DQN. The actor-critic framework enables DDPG to solve MDPs with continuous and large action space, which cannot be handled by the DQN.

To solve our scheduling problem with a discrete action space, we adopt an action mapping scheme similar to the one adopted in [10]. We set the direct output of the actor NN with $N$ continuous values, ranging from $-1$ to $1$ , corresponding to sensors $1$ to $N$ , respectively. Recall that the DQN has $N!/(N-M)!$ outputs. The $N$ values are sorted in descending order. The sensors with the highest $M$ ranking are assigned to channels $1$ to $M$ , respectively. The corresponding ranking values are then linearly normalized to $[-1,1]$ as the output of the final outputs of the actor NN, named as the virtual action $\mathbf{v}$ . It directly follows that the virtual action $\mathbf{v}$ and the real scheduling action $\mathbf{a}$ can be mapped from one to the other directly. Therefore, we use the virtual action $\mathbf{v}$ , instead of the real action $\mathbf{a}$ , when presenting the SE-DDPG algorithm.

Similar to the SE-DQN, the SE-DDPG has the loose SE-DDPG stage, the tight SE-DDPG stage, and the conventional DDPG stage. The $i$ th sampled transition is denoted as

\mathcal{T}_{i}\triangleq(\mathbf{s}_{i},\mathbf{v}_{i},\hat{\mathbf{v}}_{i},\tilde{\mathbf{v}}_{t},r_{i},\mathbf{s}_{i+1}).

(36)

We present the SE action selection method and the SE loss function in the sequel.

IV-B1 SE action selection

The general action selection approach for the SE-DDPG is identical to that of the SE-DQN, by simply converting $\mathbf{v}_{i}$ , $\mathbf{v}^{\epsilon}_{i}$ , $\tilde{\mathbf{v}}_{i}$ , and $\hat{\mathbf{v}}_{i}$ to $\mathbf{a}_{i}$ , $\mathbf{a}^{\epsilon}_{i}$ , $\tilde{\mathbf{a}}_{i}$ , and $\hat{\mathbf{a}}_{i}$ , respectively. Different from DQN with $\epsilon$ -greedy actions, the action generated by the DDPG based on the current state is

\tilde{\mathbf{v}}_{i}=\mu(\mathbf{s}_{i};\bm{\mu}),

(37)

and the random action $\mathbf{v}^{\epsilon}_{i}$ was generated by adding noise to the original continuous output of the actor NN.

Algorithm 2 SE-DDPG for sensor scheduling in the remote estimation system

1:Initialize replay memory

\mathcal{D}

to capacity

N

2:Initialize actor network and critic network with random weights

\bm{\mu}

and

\bm{\theta}

3:Initialize target network and with weight

\hat{\bm{\mu}}=\bm{\mu}

\hat{\bm{\theta}}=\bm{\theta}

4:for episode

=1,2,\dots,E_{1}

5: Initialize a random noise

\mathcal{N}

for action exploration

6: Initialize state

\mathbf{s}_{0}

7: for

t=1,2,\dots,T

8: Generate

\hat{\mathbf{v}}_{t}

and select action

\mathbf{v}_{t}

by the loose SE action selection method of SE-DDPG

9: Mapping virtual action

\mathbf{v}_{t}

to real action

\mathbf{a}_{t}

10: Execute action

\mathbf{a}_{t}

and observe

r_{t}

and

\mathbf{s}_{t+1}

11: Compute action

\tilde{\mathbf{v}}_{t}=\mu(\mathbf{s}_{t};\bm{\mu})

12: Store transition

(\mathbf{s}_{t},\mathbf{v}_{t},\hat{\mathbf{v}}_{t},\tilde{\mathbf{v}}_{t},r_{t},\mathbf{s}_{t+1})

\mathcal{D}

13: Sample a random batch of transitions

(\mathbf{s}_{i},\mathbf{v}_{i},\hat{\mathbf{v}}_{i},\tilde{\mathbf{v}}_{i},r_{i},\mathbf{s}_{i+1})

14: Calculate

\mathsf{TD}_{i}

and

\mathsf{AD}_{i}

based on (30) and (31) but with the virtual actions

\mathbf{v}_{i},\hat{\mathbf{v}}_{i},\tilde{\mathbf{v}}_{i}

15: Update

\bm{\theta}

according to equation (34)

16: Update

\bm{\mu}

according to equation (42)

17: Update the target network:

	$\displaystyle\hat{\bm{\mu}}$	$\displaystyle\leftarrow\delta\bm{\mu}+(1-\delta)\hat{\bm{\mu}}$		(38)
	$\displaystyle\hat{\bm{\theta}}$	$\displaystyle\leftarrow\delta\bm{\theta}+(1-\delta)\hat{\bm{\theta}}$		(39)

18: end for

19:end for

20:for episode

E_{1}=1,2,\dots,E_{2}

21: Repeat algorithm from line 5 to line 18 by adopting the tight SE action selection method

22:end for

23:for episode

E_{2}=1,2,\dots,E

24: Execute the conventional DDPG algorithm as in [17]

25:end for

IV-B2 SE loss function

Different from the SE-DQN, the SE-DDPG needs different loss functions for updating the critic NN and the actor NN. For the critic NN, we use the same loss function as in (33), and thus the gradient for the critic NN update is identical to (34). Note that for DDPG, the next virtual action $\tilde{\mathbf{v}}_{i+1}$ is the direct output of the actor NN given the next state $\mathbf{s}_{i+1}$ , i.e., $\tilde{\mathbf{v}}_{i+1}=\mu(\mathbf{s}_{i+1};\bm{\mu})$ . Thus, when calculating the TD error (30), we have $y_{i}=r_{i}+\gamma Q(\mathbf{s}_{i+1},{\mu}(\mathbf{s}_{i+1};\bm{\mu});\bm{\theta})$ .

For the actor NN, we introduce the difference between actions selected by the SE strategy $\hat{\mathbf{v}}_{i}$ and the actor NN $\tilde{\mathbf{v}}_{i}$ , when $\hat{\mathbf{v}}_{i}$ is executed, i.e., $\mathbf{v}_{i}=\hat{\mathbf{v}}_{i}$ . If the SE action is not selected, then the loss function is the Q-value given the state-action pair, which is identical to the conventional DDPG. Given the hyperparameter $\alpha_{2}$ , the loss function for the transition $\mathcal{T}_{i}$ is defined as

L_{i}(\bm{\mu})=\left\{\begin{array}[]{l}\begin{aligned} &\alpha_{2}Q(\mathbf{s}_{i},\tilde{\mathbf{v}}_{i};\bm{\theta})+(1-\alpha_{2})\left(\mathbf{v}_{i}-\tilde{\mathbf{v}}_{i}\right)^{2},\ \ \text{if $\mathbf{v}_{i}=\hat{\mathbf{v}}_{i}$}\\ &Q(\mathbf{s}_{i},\tilde{\mathbf{v}}_{i};\bm{\theta}),\qquad\qquad\qquad\qquad\qquad\ \ \ \,\text{otherwise},\end{aligned}\end{array}\right.

(40)

and hence the overall loss function given the sampled batch is

L(\bm{\mu})=\frac{1}{B}\sum_{i=1}^{B}L_{i}(\bm{\mu}).

(41)

By replacing (37) and (40) in (41) and applying the chain rule, we can derive the gradient of the overall loss function in terms of $\bm{\mu}$ as

\nabla_{\bm{\mu}}L(\bm{\mu})=\frac{1}{B}\sum_{i=1}^{B}\nabla_{\bm{\mu}}L_{i}(\bm{\mu}),

(42)

where $\nabla_{\bm{\mu}}L_{i}(\bm{\mu})$ is given by:

\nabla_{\bm{\mu}}L_{i}(\bm{\mu})=\left\{\begin{array}[]{l}\begin{aligned} &\alpha_{2}\nabla_{\tilde{\mathbf{v}}_{i}}Q(\mathbf{s}_{i},\tilde{\mathbf{v}}_{i};\bm{\theta})\nabla_{\bm{\mu}}\mu(\mathbf{s}_{i};\bm{\mu})-2(1-\alpha_{2})\\ &\ \ \left(\mathbf{v}_{i}-\mu\left(\mathbf{s}_{i};\bm{\mu}\right)\right)\nabla_{\bm{\mu}}\mu(\mathbf{s}_{i};\bm{\mu}),\quad\text{if $\mathbf{v}_{i}=\hat{\mathbf{v}}_{i}$}\\ &\nabla_{\tilde{\mathbf{v}}_{i}}Q(\mathbf{s}_{i},\tilde{\mathbf{v}}_{i};\bm{\theta})\nabla_{\bm{\mu}}\mu(\mathbf{s}_{i};\bm{\mu}),\qquad\text{otherwise}.\end{aligned}\end{array}\right.

(43)

The details of the proposed SE-DDPG algorithm are given in Algorithm 2.

V Numerical Experiments

TABLE I: Summary of Hyperparameters

Hyperparameters of SE-DQN and SE-DDPG	Value
Initial values of $\epsilon$ and $\xi$	1
Decay rates of $\epsilon$ and $\xi$	0.999
Minimum $\epsilon$ and $\xi$	0.01
Mini-batch size, $B$	128
Experience replay memory size, $K$	20000
Discount factor, $\gamma$	0.95
Weight of SE-DQN and critic network loss function, $\alpha_{1}$	0.5
Hyperparameters of SE-DQN
Learning rate	0.0001
Decay rate of learning rate	0.001
Target network update frequency	100
Input dimension of network, $Q(\mathbf{s},\mathbf{a};\bm{\theta})$	$N+N\times M$
Output dimension of network, $Q(\mathbf{s},\mathbf{a};\bm{\theta})$	$N!/(N-M)!$
Hyperparameters of SE-DDPG
Learning rate of actor network	0.0001
Learning rate of critic network	0.001
Decay rate of learning rate	0.001
Soft parameter for target update, $\delta$	0.005
Weight of actor network loss function, $\alpha_{2}$	0.9
Input dimension of actor network, $\mu(\mathbf{s}_{t};\bm{\mu})$	$N+N\times M$
Output dimension of actor network, $\mu(\mathbf{s}_{t};\bm{\mu})$	$N$
Input dimension of critic network, $Q(\mathbf{s},\mathbf{a};\bm{\theta})$	$2N+N\times M$
Output dimension of critic network, $Q(\mathbf{s},\mathbf{a};\bm{\theta})$	$1$

In this section, we evaluate and compare the performance of the proposed SE-DQN and SE-DDPG with the benchmark DQN and DDPG.

V-A Experiment Setups

Our numerical experiments run on a computing platform with an Intel Core i5 9400F CPU @ 2.9 GHz with 16GB RAM and an NVIDIA RTX 2070 GPU. For the remote estimation system, we set the dimensions of the process state and the measurement as $l_{n}=2$ and $e_{n}=1$ , respectively. The system matrices $\{\mathbf{A}_{n}\}$ are randomly generated with the spectral radius within the range of $(1,1.4)$ . The entries of $\mathbf{C}_{n}$ are drawn uniformly from the range $(0,1)$ , $\mathbf{W}_{n}$ and $\mathbf{V}_{n}$ are identity matrices. The fading channel state is quantized into $\bar{h}=5$ levels, and the corresponding packet drop probabilities are $0.2,0.15,0.1,0.05,$ and $0.01$ . The distributions of the channel states of each sensor-channel pair $(n,m)$ , i.e., $q^{(n,m)}_{1},\dots,q^{(n,m)}_{\bar{h}}$ are generated randomly.

During the training, we use the ADAM optimizer for calculating the gradient and reset the environment for each episode with $T=500$ steps. The episode numbers for the loose SE action, the tight SE action, and the conventional DQN stages are 50, 100, and 150, respectively. The settings of the hyperparameters for Algorithm 1 and Algorithm 2 are summarized in Table I.

V-B Performance Comparison

Fig. 3 and Fig. 4 illustrate the average sum MSE of all processes during the training achieved by the SE-DRL algorithms and the benchmarks under some system settings. Fig. 3 shows that the SE-DQN saves about $50\%$ training episodes for the convergence, and also decreases the average sum MSE for $10\%$ than the conventional DQN. Fig. 4 shows that the SE-DDPG saves about $30\%$ training episodes and reduces the average sum MSE by $10\%$ , when compared to the conventional DDPG. Also, we see that the conventional DQN and DDPG stages (i.e. the last 150 episodes) in Fig 3 and 4 cannot improve much of the training performance. This implies that the SE stages have found near optimal policies.

In Table II, we test the performance of well-trained SE-DRL algorithms for different numbers of sensors and channels, and different settings of system parameters, i.e., $\mathbf{A}_{n}$ , $\mathbf{C}_{n}$ , and $q^{(n,m)}_{1},\dots,q^{(n,m)}_{\bar{h}}$ , based on 10000-step simulations. We see that for 6-sensor-3-channel systems, the SE-DQN reduces the average MSE by $10\%$ to $25\%$ over DQN, while both DDPG and SE-DDPG achieve similar performance as the SE-DQN. This suggests that the SE-DQN is almost optimal and that the DDPG and the SE-DDPG cannot improve further. In 10-sensor-5-channel systems, neither the DQN nor the SE-DQN can converge, and the SE-DDPG achieves a $10\%$ MSE reduction over the DDPG. In particular, the SE-DDPG is the only converged algorithm in Experiment 12. In 20-sensor-10-channel systems, we see that the SE-DDPG can reduce the average MSE by $15\%$ to $25\%$ . Therefore, the performance improvement of the SE-DDPG appears significant for large systems.

TABLE II: Performance Comparison of the SE-DRL and the Benchmarks in terms of the Average Estimation MSE

System Scale $(N,M)$	Experiment	DQN	SE-DQN	DDPG	SE-DDPG
$(6,3)$	1	52.4121	47.6766	48.4075	47.0594
	2	67.4247	49.8476	53.3675	44.7423
	3	84.1721	59.9127	56.9504	55.2409
	4	79.5902	65.1640	64.4313	58.5534
$(10,5)$	9	$-$	$-$	68.0247	62.4727
	10	$-$	$-$	89.6290	78.4138
	11	$-$	$-$	90.2812	81.1148
	12	$-$	$-$	$-$	147.2844
$(20,10)$	13	$-$	$-$	181.4135	159.4321
	14	$-$	$-$	163.1257	142.1850
	15	$-$	$-$	173.0940	144.8166
	16	$-$	$-$	215.2231	160.2304

VI Conclusion

We have developed the SE-DRL algorithms for optimal sensor scheduling of remote estimation systems. In particular, we have proposed novel SE action selection and loss functions to enhance training efficiency. Our numerical results have shown that the proposed structure-enhanced DRL algorithms can save the training time by 50% and reduce the estimation MSE by 10% to 25%, when benchmarked against conventional DRL algorithms. For future work, we will rigorously prove the threshold structure of the optimal scheduling policy for our system. Also, we will investigate other structural properties of the optimal policy to enhance the DRL.

References

[1] P. Park, S. Coleri Ergen, C. Fischione, C. Lu, and K. H. Johansson, “Wireless network design for control systems: A survey,” IEEE Commun. Surv. Tutor., vol. 20, no. 2, pp. 978–1013, May 2018.
[2] K. Huang, W. Liu, Y. Li, B. Vucetic, and A. Savkin, “Optimal downlink-uplink scheduling of wireless networked control for industrial IoT,” IEEE Internet Things J., vol. 7, no. 3, pp. 1756–1772, Mar. 2020.
[3] K. Gatsis, M. Pajic, A. Ribeiro, and G. J. Pappas, “Opportunistic control over shared wireless channels,” IEEE Trans. Autom. Control, vol. 60, no. 12, pp. 3140–3155, Mar, 2015.
[4] D. Han, J. Wu, H. Zhang, and L. Shi, “Optimal sensor scheduling for multiple linear dynamical systems,” Automatica, vol. 75, pp. 260–270, Jan, 2017.
[5] S. Wu, X. Ren, S. Dey, and L. Shi, “Optimal scheduling of multiple sensors over shared channels with packet transmission constraint,” Automatica, vol. 96, pp. 22–31, Oct. 2018.
[6] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
[7] Z. Zhao, W. Liu, D. E. Quevedo, Y. Li, and B. Vucetic, “Deep learning for wireless networked systems: a joint estimation-control-scheduling approach,” arXiv preprint, Oct, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2210.00673
[8] A. S. Leong, A. Ramaswamy, D. E. Quevedo, H. Karl, and L. Shi, “Deep reinforcement learning for wireless sensor scheduling in cyber–physical systems,” Automatica, vol. 113, pp. 1–8, Mar. 2020. Art. no. 108759.
[9] W. Liu, K. Huang, D. E. Quevedo, B. Vucetic, and Y. Li, “Deep reinforcement learning for wireless scheduling in distributed networked control,” submitted to Automatica, 2021. [Online]. Available: https://doi.org/10.48550/arXiv.2109.12562
[10] G. Pang, W. Liu, Y. Li, and B. Vucetic, “DRL-based resource allocation in remote state estimation,” arXiv preprint, May. 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2205.12267
[11] Z. D. Guo and E. Brunskill, “Directed exploration for reinforcement learning,” arXiv preprint, Jun 2019. [Online]. Available: https://doi.org/10.48550/arXiv.1906.07805
[12] S. Wu, K. Ding, P. Cheng, and L. Shi, “Optimal scheduling of multiple sensors over lossy and bandwidth limited channels,” IEEE Trans. Netw. Syst., vol. 7, no. 3, pp. 1188–1200, Jan. 2020.
[13] Y.-P. Hsu, E. Modiano, and L. Duan, “Age of information: Design and analysis of optimal scheduling algorithms,” in Proc. IEEE Int. Symp. Inf. Theory, June. 2017, pp. 561–565.
[14] W. Liu, D. E. Quevedo, Y. Li, K. H. Johansson, and B. Vucetic, “Remote state estimation with smart sensors over Markov fading channels,” IEEE Trans. Autom. Control, vol. 67, no. 6, pp. 2743–2757, June, 2022.
[15] W. Liu, D. E. Quevedo, K. H. Johansson, B. Vucetic, and Y. Li, “Stability conditions for remote state estimation of multiple systems over multiple markov fading channels,” IEEE Trans. Autom. Control, early access, Aug. 2022.
[16] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing Atari with deep reinforcement learning,” in NIPS Deep Learning Workshop, 2013.
[17] T. P. Lillicrap, et al., “Continuous control with deep reinforcement learning,” arXiv preprint, Sep. 2015. [Online]. Available: https://doi.org/10.48550/arXiv.1509.02971

Structure-Enhanced Deep Reinforcement Learning for Optimal Transmission Scheduling ††thanks: *Wanchun Liu is the corresponding author.