Deep Echo State Q-Network (DEQN) and Its Application in Dynamic Spectrum Sharing for 5G and Beyond
Abstract
Deep reinforcement learning (DRL) has been shown to be successful in many application domains. Combining recurrent neural networks (RNNs) and DRL further enables DRL to be applicable in non-Markovian environments by capturing temporal information. However, training of both DRL and RNNs is known to be challenging requiring a large amount of training data to achieve convergence. In many targeted applications, such as those used in the fifth generation (5G) cellular communication, the environment is highly dynamic while the available training data is very limited. Therefore, it is extremely important to develop DRL strategies that are capable of capturing the temporal correlation of the dynamic environment requiring limited training overhead. In this paper, we introduce the deep echo state Q-network (DEQN) that can adapt to the highly dynamic environment in a short period of time with limited training data. We evaluate the performance of the introduced DEQN method under the dynamic spectrum sharing (DSS) scenario, which is a promising technology in 5G and future 6G networks to increase the spectrum utilization. Compared to conventional spectrum management policy that grants a fixed spectrum band to a single system for exclusive access, DSS allows the secondary system to share the spectrum with the primary system. Our work sheds light on the application of an efficient DRL framework in highly dynamic environments with limited available training data.
Index Terms:
Echo state networks, deep reinforcement learning, convergence rate, dynamic spectrum sharing, 5G, and 6GI Introduction
In the last few years, deep reinforcement learning (DRL) has been widely adopted in different fields, ranging from playing video games [1], playing chess [2], to robotics [3]. DRL provides a flexible solution for many types of problems due to the fact that it does not need to model complex systems or to label data for training. Utilizing recurrent neural networks (RNNs) in DRL, the deep recurrent Q-network (DRQN) is introduced to process the temporal correlation of input sequences in a non-Markovian environment [4]. Even though DRQN is a powerful machine learning tool, it faces serious issues related to training due to the following two reasons: 1) DRL requires a relatively large amount of training data and computational resources to make the learning agent converge to an appropriate policy, which is a major bottleneck for applying DRL to many real-world applications [5]. 2) The kernel of DRQN, the RNN, has issues related to vanishing and exploding gradients that make the underlying training difficult [6]. Therefore, the difficulties of training DRL agents and RNNs make the training of DRQNs an extremely challenging problem and prevent it from being widely adopted for analyzing time-dynamic applications.
In light of the training challenges, in this work we exploit a special type of RNNs, echo state networks (ESNs), to reduce the training time and the required training data [7]. ESNs simplify the underlying RNNs training by only training the output weights while leaving input weights and recurrent weights untrained. Existing research shows that ESNs can achieve comparable performance with RNNs, especially in some tasks requiring fast learning [8]. Accordingly, in this work, we adopt ESNs as the Q-networks in the DRL framework, which is referred to as deep echo state Q-networks (DEQN). We will show that DEQN has the benefit of learning a good policy with short training time and limited training data.
Fueled by the popularity of smartphones as well as the upcoming deployment of the fifth generation (5G) mobile broadband networks, mobile data traffic will grow at a compound annual growth rate (CAGR) of 46 percent between 2017 and 2022, reaching 77.5 exabytes (EB) per month by 2022 [9]. A significant portion of these data traffic will be real-time or delay-sensitive. For example, live video will grow 9-fold from 2017 to 2022 while virtual reality and augmented reality traffic will increase 12-fold at a CAGR of 63 percent. This suggests that future wireless networks will likely face the pressing demand of being able to conduct real-time processing for large volume data in an efficient way. In 5G networks, massive connectivity is regarded as a primary use case with dynamic spectrum sharing (DSS) as an enabling technology. In fact, DSS has been announced as the key technology for 5G by many companies and operators around the world including Qualcomm, Ericsson, AT&T, and Verizon [10, 11]. Unlike the current static spectrum management policy that gives a single system exclusive right to access the spectrum, DSS has a more flexible policy by adopting a hierarchical access structure with primary users (PUs) and secondary users (SUs) [12]. SUs are allowed to access the licensed spectrum when PUs receive tolerable interference.
Obtaining control information from the environment is costly in 5G mobile wireless networks. First, a SU cannot detect the activities of all PUs simultaneously because performing spectrum sensing is energy-consuming. Second, exchanging control information between wireless devices imposes a control overhead in wireless network operations. Therefore, the major challenge of DSS is how to optimize the system performance under limited information exchange between the secondary system and the primary system. DRL is a suitable framework for developing DSS strategies because of its abilities to adapt to unknown environment without modeling the complex 5G networks. DRL usually requires tons of training data and long training time. However, wireless networks are dynamic due to factors such as path loss, shadow fading, and multi-path fading [13], which largely decreases the number of effective training data that reflect the latest environment. Furthermore, the performance of spectrum sharing depends on access strategies of multiple users. If one user changes its access strategy, then other users have to change their access strategies accordingly. Under these circumstances, the number of effective training samples reflected in the latest wireless environment will be extremely limited. As a result, designing an efficient DRL framework only requiring a small amount of training data will be critical for 5G and future 6G DSS networks. In this work, we introduce DEQN to learn a spectrum access strategy for each SU in a distributed fashion with limited training data and short training time in the highly dynamic 5G networks.
The main contributions of our work are as follows:
-
•
We design an efficient DRL framework, DEQN, to adapt to highly dynamic environment with limited training data and provide training strategies for the introduced DEQN.
-
•
We apply the DEQN method in the critical problem of DSS for 5G networks where the system is highly dynamic and interactive. Compared to existing DRL-based strategies, our method can quickly adapt to real mobile wireless environment to achieve improved network performance under limited training data.
-
•
This work is the first to formulate a DRL strategy that jointly considers spectrum sensing and spectrum sharing in the underlying DSS network for 5G.
II Problem Definition for DSS
In this section, we introduce the DSS problem and discuss its challenges. We consider a DSS system where the primary network consisting of PUs and the secondary network consisting of SUs. It is assumed that one wireless channel is allocated to each PU individually and cross-channel interference is negligible. We consider a discrete time model, where the dynamics of the DSS system, such as behaviors of users and changes of the wireless environment, are constrained to happen at discrete time slots ( is a natural number). Our goal is to develop a distributive DSS strategy for each SU to increase the spectrum utilization without harming the primary network’s performance.

The data of an user are transmitted over the wireless link between its transmitter and receiver. Signal-to-interference-plus-noise ratio (SINR) is a quality measure of the wireless connection that compares the power of a desired signal to the sum of the interference power and the power of background noise. The higher value of the SINR, the better quality of the wireless connection. The SINR of the user ’s wireless connection on channel at time slot is written as
(1) |
where and are the transmit power of the user and the user , respectively, is the set containing all the users that are transmitting on channel except for the user , is the channel gain of the desired link of the user , is the channel gain of the interference link between the user ’s transmitter and the user ’s receiver, and is the background noise power on channel . Note that all channel gains are changing over time so SINR is also time-variant. The desired link is the link between the transmitter and the receiver of the same user. The interference link is the link between the transmitter and the receiver of two different users if these two users are transmitting on the same channel simultaneously. Figure 1 shows the complicated association of desired links and interference links when PU1, SU1, and SU2 are operating on the same channel. Since cross-channel interference is negligible, the interference link between two users operating on different channels is out of consideration.
The radio signal attenuates as it propagates through space between the transmitter and the receiver, which is referred to as the path loss. In addition to the path loss, the channel gain is affected by many factors such as shadow fading and multi-path fading. Shadow fading is caused by a large obstacle like a hill or a building obscuring the main signal path between the transmitter and the receiver. Multi-path fading occurs in any environment where multiple propagation paths exist between the transmitter and the receiver, which may be caused by reflection, diffraction, or scattering. In telecommunication society, the channel model is carefully designed to be consistent with wireless field measurements. We generate channel gains based on the WINNER II channel model [14], which is widely used in industry to make fair comparisons of telecommunication algorithms.
To enable the protection of the primary network, we assume that a PU will broadcast a warning signal if its data transmission experiences a low SINR. There are two possible causes for low SINR. First, the wireless connection of the desired link of the PU is in deep fade, which means the channel gain of the desired link is low. This leads to a small value of the numerator in Equation (1) so SINR is low. Second, the signals from one or more SUs cause strong interference to a PU when they are transmitting over the same wireless channel at the same time. This leads to a large value of the denominator in Equation (1), so SINR assumes a low value again. We called SUs ”collides” with the PU in this case. The warning signal contains information related to which PU may be interfered so that the SUs transmitting on the same channel are aware of the issue. In fact, this kind of warning signal is similar to the control signals (e.g. synchronization, downlink/uplink control) used in current 4G and 5G networks. It is common to assume that the control signals are received perfectly at receivers, otherwise the underlying network will not even work. In reality, the control signal can be transmitted through a dedicated control channel. According to this mechanism, a PU will broadcast a warning signal once the received SINR is low, and this is the only control information from the primary system to the secondary system to enable the protection for PUs under DSS. Note that a PU may send a warning signal even when no collisions happen because of deep fade.
The activity of a PU consists of two states: (1) Active and (2) Inactive. If a PU is transmitting data, it is in Active state, otherwise it is in Inactive state. A spectrum opportunity on a channel occurs when the licensed PU of that channel is in Inactive state or any SU can transmit on that channel with little interference to the Active licensed PU. Unfortunately, it is difficult for a SU to obtain the information of activity states of PUs or the interference that it will cause in the highly dynamic 5G networks. A SU has to perform spectrum sensing to detect the activity of a PU, but the accuracy of detection is based on the wireless link between the transmitters of the PU and the SU, the background noise, and the transmit power of the PU. On the other hand, the interference level caused by a SU is determined by the interference link from the SU to the PU, the desired link of the PU, transmit powers of the PU and the SU, and the background noise. Furthermore, all these factors for determining spectrum opportunities are time-variant so control information becomes outdated quickly. Since obtaining control information is costly in 5G mobile wireless networks, it is impractical to design a DSS strategy by assuming that all the control information is known.
SUs should provide protection to prevent PUs from harmful interference since the primary system is the spectrum licensee. A commonly used method is that the transmitter of a SU performs spectrum sensing to detect the activity of a PU before accessing a channel. Due to the power and complexity constraints, a SU is unable to perform spectrum sensing across all channels simultaneously. Therefore, we assume that a SU can only sense one channel at a particular time. We adopt the energy detector as the underlying spectrum sensing method, which is the most common one due to its low complexity and cost. The energy detector of SU first computes the energy of received signals on channel as follows:
(2) |
where is the starting time slot of the spectrum sensing, is the received signal at time slot , and is the number of time slots of the spectrum sensing. We consider the half-duplex SU system where a SU cannot transmit data and perform spectrum sensing at the same time. We assume a periodic time structure of spectrum sensing and data transmission as shown in Figure 2. To be specific, the sensing and transmission period contains time slots from to , the spectrum sensing contains the first time slots in the period from to , and the data transmission contains the subsequent time slots in the period from to .

The received signal depends on the activity state of PU , the power of PU , the background noise, and the sensing link between the transmitters of PU and SU . When PU is in the Inactive state, the received signal is represented as
(3) |
When PU is in the Active state, the received signal is represented as
(4) |
where is a circularly-symmetric Gaussian noise with zero mean and variance , is the transmit power of PU , and is the channel gain of the sensing link between the transmitters of PU and SU .
If the energy computed in Equation (2) is higher than a threshold, the PU is considered in the Active state, otherwise the PU is considered in the Inactive state. The challenge of designing an energy detector is how to set the threshold properly. The value of the threshold is actually a trade-off between the detection probability and the false alarm probability. However, setting the threshold for achieving a good trade-off is related to many factors, including the channel gain of the sensing link, the transmit power of the PU, the noise variance, the number of received signals, etc. This information is difficult to obtain before deploying in the real environment and is time-variant. Furthermore, setting a threshold is difficult in some cases because of the relative positions of transmitters and receivers. As shown in Figure 1, the sensing link is between the transmitters of the PU and the SU, but the interference link is between the transmitter of the SU and the receiver of the PU. The discrepancy between the sensing link and the interference link may cause the hidden node problem, where the sensing link is weak but the interference link is strong. For example, the transmitters of a SU and a PU are far away from each other while the SU transmitter is close to the receiver of the PU. In this case, the transmitters of the SU and the PU are hidden nodes with respect to each other. The warning signals from PUs are designed to provide additional protection to the primary system for the case where the SU cannot detect the activity of the PU, thereby mitigating the issues caused by the hidden nodes. Meanwhile, instead of making the spectrum access decision solely based on the outcomes of the energy detector, we developed a DRL framework to construct a novel spectrum access policy: The DRL agent will use the sensed energy as the input to learn a spectrum access strategy to maximize the cumulative reward. The reward is designed to maximize the spectral-efficiencies of SUs while enabling the protection for PUs with the help of warning signals from PUs.
III DRL Framework for DSS and DEQN
III-A Background on DRL
RL is one type of machine learning method that provides a flexible architecture for solving many types of practical problems because it does not need to model complex systems or to label data for training. In RL, an agent learns how to select actions to maximize the cumulative reward in a stochastic environment. The dynamics of the environment is usually modeled as a Markov decision process (MDP), which characterized by a tuple , where is the state space, is the action space, is the state transition providing , is the reward function providing , and is a discount factor for calculating cumulative reward. Specifically, at time , the state is , the RL agent selects an action by following a policy and receives the reward , and then the system shifts to the next state according to the state transition probability. Note that the action affects both the immediate reward and the next state . Consequently, all subsequent rewards are affected by the current action. The goal of RL agent is to find a policy to maximize the cumulative reward, .
In RL, a model-free algorithm does not require state transition probability for learning, which is useful when the underlying system is complicated and difficult to model. Q-learning [15] is the most widely used model-free RL algorithm that aims to find the Q-function of each state-action pair for a given policy, which is defined as
(5) |
Q-function represents the cumulative reward when taking action in the state and then following policy . Q-learning constructs a Q-table to estimate the Q-function of each state-action pair by iteratively updating each element of the Q-table through dynamic programming. The update rule of the Q-table is given as follows:
(6) | ||||
where is the learning rate. The policy that selects action is the -greedy policy as follows:
(7) |
where is the exploration probability. However, Q-learning performs poorly when the dimension of the state is high because updating a large Q-table makes training difficult or even impossible.
Deep Q-Networks (DQN) [1] is introduced to solve high-dimensional state problems by leveraging a neural network as the function approximator of the Q-table, which is referred to as the Q-network. Specifically, the Q-network takes the state as input and outputs the estimated Q-function of all possible actions. One key approach of DQN to improve the training stability is by creating two Q-networks: the evaluation network and the target network . The target network is used to generate the targets for training the evaluation network while the evaluation network is used to determine the actions. The loss function for training the evaluation network is written as
(8) |
where is the target Q-value. The weights of the target network is periodically synchronized with the weights of the evaluation network . The purpose is to fix targets temporarily during training to improve the training stability of the evaluation network.
An improvement of DQN to prevent overestimation of Q-values is called double Q-learning [16], where the evaluation network is used to select the action when computing the target Q-value, but the target Q-value is still generated by the target network. Specifically, the target Q-value for the evaluation network is calculated by
(9) |
where . Double Q-learning can improve the accuracy in estimating Q-function, thereby improves the learned policy.
III-B Existing DRL-based Strategies for DSS
DRL-based methods have recently been applied in dynamic spectrum access (DSA) networks [17, 18, 19] where the focus is exclusively on the ”access” part of the problem with over-simplified network setup. To be specific, [17] considers single SU selects one channel to access in the multichannel environment, and the goal is to maximize the number of selecting good channels for access. [18] assumes that the available spectrum channels are known a priori and develops a centralized spectrum access algorithm for multi-user access. Both [17] and [18] assumes that one channel can only be used by one user at any particular time. Although [19] considers multiple SUs can access a channel at the same time, a SU cannot access a channel that a PU is using. [19] also assumes that each SU can sense all channels simultaneously and the collision between a PU and a SU can be perfectly detected. In this work, in order to provide a comprehensive study for the impact of DEQN on relevant DSS networks for 5G, we consider practical situations of DSS where 1) mobile users cannot conduct spectrum sensing perfectly. 2) mobile users cannot sense multiple channel at a particular time. 3) there are multiple PUs and multiple SUs in a DSS network. 4) A channel can be shared by multiple users if the interference between them is weak. Furthermore, unlike previous work which utilizes binary ACK/NACK feedback as the reward function, we calculate the practical reward based on the spectral-efficiency of each mobile link. To be closely in line with the real wireless environment, the spectral-efficiency of a mobile link is calculated using the transmission procedure defined in the telecommunication standard. In this way, we can train and evaluate the underlying DEQN-based DRL strategies in realistic 5G application scenarios. It is important to note that in our work we treat the unprocessed soft spectrum sensing information as the input states of the DRL agent. Soft spectrum sensing information can be directly obtained from spectrum sensing sensors. Through the soft spectrum sensing input, the DRL agent will learn an appropriate detection criterion for each SU that adapts to different mobile wireless environments, geometry of mobile users, and activities of mobile users. This is indeed the first work to study DSS that combines soft spectrum sensing information and spectrum access strategies through the DRL framework.
III-C DRL Problem Formulation for DSS
We now formulate the DSS problem using the DRL framework, where all SUs in the secondary system learn their spectrum access strategies in a distributed fashion through the interactions with the mobile wireless environment. To be specific, we assume that each SU has a DRL agent that takes its observed state as the input and learns how to perform spectrum sensing and access actions in order to maximize its cumulative reward. The reward for each SU is designed to maximize its spectrum efficiency and to prevent harmful interference to PUs.
The state of SU in the sensing and transmission period is denoted by
(10) |
where is a non-negative integer, is the energy of received signals, and is a one-hot -dimensional vector indicating the sensed channel from time slots to . If the index of the sensed channel is , then the element of is equal to one while other elements of are zeros. On the other hand, is equal to that is calculated by Equation (2).
The action of SU in the sensing and transmission period is denoted by
(11) |
where represents SU will either access the current sensed channel () or be idle () during the data transmission part of the period (from time slots to ), represents SU will sense channel during the sensing part of the period (from time slots to ). In other words, SU makes two decisions: decides whether to conduct data transmission in the current sensed channel of the period and decides which channel to sense in the period. Therefore, the dimension of each SU’s action space is . Note that the sensed channel in the period may be different from that in the period
CQI index | SINR | modulation | code rate | efficiency |
---|---|---|---|---|
() | (1024) | (bits per symbol) | ||
0 | out of range | |||
1 | -6.9360 | QPSK | 78 | 0.1523 |
2 | -5.1470 | QPSK | 120 | 0.2344 |
3 | -3.1800 | QPSK | 193 | 0.3770 |
4 | -1.2530 | QPSK | 308 | 0.6016 |
5 | 0.7610 | QPSK | 449 | 0.8770 |
6 | 2.6990 | QPSK | 602 | 1.1758 |
7 | 4.6940 | 16QAM | 378 | 1.4766 |
8 | 6.5250 | 16QAM | 490 | 1.9141 |
9 | 8.5730 | 16QAM | 616 | 2.4063 |
10 | 10.3660 | 64QAM | 466 | 2.7305 |
11 | 12.2890 | 64QAM | 567 | 3.3223 |
12 | 14.1730 | 64QAM | 666 | 3.9023 |
13 | 15.8880 | 64QAM | 772 | 4.5234 |
14 | 17.8140 | 64QAM | 873 | 5.1152 |
15 | 19.8290 | 64QAM | 948 | 5.5547 |
In our work, we use a discrete reward function which is similar to the existing DRL-based DSS methods. Compared to a simple binary reward ( and , and ) in [17] and [18], we consider a more relevant and comprehensive reward design that is based on the underlying achieved modulation and coding strategy (MCS) adopted in the 3GPP LTE/LTE-Advanced standard [20]. To be specific, a receiver measures SINR to evaluate the quality of the wireless connection and feedback the corresponding Channel Quality Indicator (CQI) to the transmitter [21]. In this work, we follow the method presented in [22] to map the received SINR to the CQI. After receiving the CQI, the transmitter determines the MCS for data transmission based on the CQI table specified in the 3GPP standard [20]. The SINR and CQI mapping to MCS is given in Table I for reference. Accordingly, the achieved spectral-efficiency can be calculated by (bits/symbol) = (modulation’s power of 2) (code rate) representing the average information bits per symbol. This critical metric is utilized as the reward function of our design.
To jointly consider the performance of the primary and the secondary systems, the reward function corresponding to SU accessing channel depends on both the spectral-efficiency of SU and PU . During time slots to , the average spectral-efficiency of SU , , and the average spectral-efficiency of PU , , are calculated by
(12) |
where and represent the spectral-efficiency of SU and PU on channel at time slot , respectively.
The reward of SU in the transmission period is defined as
(13) |
To enable the protection for the primary system, PU will broadcast a warning signal if its average spectral-efficiency is below , and then the reward received by SU that accesses channel is set to . To motivate SUs to explore spectrum opportunities, the reward is set to if SU decides to be idle in the transmission period. When PU does not suffer from strong interference (the average spectral-efficiency of PU is larger than ), we increase the reward from to as the average spectral-efficiency of SU increases (see Equation (13)). Note that the low spectral-efficiency of a PU or a SU does not necessarily mean collisions because the underlying wireless channels are changing dynamically over time. If the channel gain of the wireless link is small, the spectral-efficiency of the user will be low even if there is no collision. Therefore, the reward function and the warning signal are introduced since it is impossible to detect collisions perfectly in practical wireless environments.
III-D Efficient Training for DEQN
To capture the activity patterns of PUs, which are usually time-dependent, applying DRQNs is a natural choice. Although DQNs are able to learn the temporal correlation by stacking a history of states in the input, the sufficient number of stacked states is unknown because it depends on PUs’ behavior patterns. RNNs are a family of neural networks for processing sequential data without specifying the length of temporal correlation.
However, the training of RNNs is known to be difficult that suffers from vanishing and the exploding gradients problems. Furthermore, the required amount of training data for achieving convergence is large in the DRL scheme, since there are no explicit labels to guide the training and the agents have to learn from interacting with its environment. In the wireless environment, the channel gain of a wireless link changes rapidly, which is shown in Figure 3. Note that the environment observed by a SU is affected by other SUs’ access strategies because of possible collisions between SUs, and all SUs are dynamically adjusting their DSS strategies during their training processes. As a result, in the DSS problem, the duration for a learning environment being stable is short and the available training data is very limited.

The standard training technique for RNNs is to unfold the network in time into a computational graph that has a repetitive structure, which is called backpropagation through time (BPTT). BPTT suffers from the slow convergence rate and needs many training examples. DRQN also requires a large amount of training data because a learning agent finds a good policy by exploring the environment with different potential policies. Unfortunately, in the DSS problem, there are only limited training data for a stable environment due to dynamic channel gains, partial sensing, and the existence of multiple SUs. To address this issue, we use ESNs as the Q-networks in the DRQN framework to rapidly adapt to the environment. ESNs simplify the training of RNNs significantly by keeping the input weights and recurrent weights fixed and only training the output weights.
We denote the sequence of states for SU by . Accordingly, the sequence of hidden states, , is updated by
(14) | ||||
where is the input weight, is the recurrent weight, is the leaky parameter, and we let . The output sequence, , is computed by
(15) |
where is a concatenated vector of and , and is the output weight. Note that the output vector is a -dimensional vector, where each element of corresponds to the estimated Q-value of selecting one of all possible actions given the state .
The double Q-learning algorithm [16] is adopted to train the underlying DEQN agent of each SU. As discussed in Section III-A, each DEQN agent has two Q-networks: the evaluation network and the target network. Let the output sequence from the evaluation network and the target network be and , respectively. The loss function for training the evaluation network of SU is written as
(16) |
where and are the element of and , respectively, is the index of the maximum element of , is the target Q-value. To stabilize the training targets, the target network is only periodically synchronized with the evaluation network.
The input weights and the recurrent weights of ESNs are randomly initialized according to the constraints specified by the Echo State Property [23], and then they remain untrained. Only the output weights of ESNs are trained so the training is extremely fast. The main idea of ESNs is to generate a large reservoir that contains the necessary summary of past input sequences for predicting targets. From Equation (14), we can observe that the hidden state at any given time slot is unchanged during the training process if the input weights and recurrent weights are fixed. In contrast to conventional RNNs that usually initialize the hidden states to zeros and waste some training examples to set them to appropriate values in one training iteration, the benefit of ESNs is that the hidden states do not need to be reinitialized in every training iteration. Therefore, the training process becomes extremely efficient, which is especially suitable for learning in a high dynamic environment. Compared to storing in conventional DRQN framework, we also store hidden states because hidden states are unchanged. In this way, we do not have to waste lots of training time and data to recalculate hidden states in every training iteration. It largely boosts the training efficiency in the highly dynamic environment since we can avoid using BPTT and only update the output weights of networks. Furthermore, we can randomly sample from the replay memory to create a training batch, while conventional DRQN methods have to sample continuous sequences to create a training batch. Thus the training data can be more efficiently used in our DEQN method. The training data stored in the buffer will be refreshed periodically in order to adapt to the latest environment. Therefore, our training method is an online training algorithm that keeps updating the learning agent. The training algorithm for DEQNs in the DSS problem is detailed in Algorithm 1.
IV Performance Evaluation
IV-A Experimental Setup

We set the number of PUs and SUs to 4 and 6, respectively, and the locations of PUs and SUs are randomly defined in a 2000m2000m area. The distance between the transmitter and the receiver of each desired link is randomly chosen from 400m-450m. Figure 4 shows the geometry of the DSS network, where PUT/SUT represent the transmitters of PU/SU and PUR/SUR represent the receivers of PU/SU. The channel gains of desired links, interference links, and sensing links are generated by the WINNER II channel model widely used in 3GPP LTE-Advanced and 5G networks [14]. In this case, there are 4 desired links for PUs, 6 desired links for SUs, 30 interference links between different SUs, 24 interference links between SUTs and PURs, 24 interference links between PUTs and SURs, and 24 sensing links between PUTs and SUTs. Totally, 112 wireless links are generated in our simulation, which establishes a more complicated scenario than existing DRL-based DSS strategies [17, 18, 19]. Specifically, [17] considers each channel only has two possible states (good or bad) without modeling the true wireless environment; [18] assumes that the collision between users can be perfectly detected without considering the dynamics of interference links; [19] assumes that SUs are forbidden to access a channel when a PU is using without considering the actual interference links between PUs and SUs.
For each channel, the bandwidth is set to 5MHz and the variance of the Gaussian noise is set to -157.3dBm. The transmit power of PUs and SUs are both set to 500mW. We set the sensing and transmission period to 10 time slots and the sensing duration to 2 time slots, where one time slot represents interval of 1ms. We list all the parameters to generate the wireless environment in Table II.
Parameter | Value |
---|---|
number of PUs | 4 |
number of SUs | 6 |
simulation area | 2000m2000m |
distance between user pair | 400m-450m |
transmit power of PU | 500mW |
transmit power of SU | 500mW |
variance of Gaussian noise | -157.3dBm |
bandwidth of a channel | 5MHz |
interval of one time slot | 1ms |
sensing and transmission period | 10 time slots |
sensing duration | 2 time slots |
For the activity pattern of PUs, we let two PUs be in Active state every (PU1 and PU3) and two PUs be in Active state every (PU2 and PU4). Each SU trains its DEQN agent and updates the policy accordingly after collecting 300 samples in the buffer. The buffer will be refreshed after training so we only use training data from the latest 3 sec. The total number of training data is 60000, which requires 600 sec to collect all the training data. The initial exploration probability is set to 0.3, and then it will gradually decrease until is 0. We first train the Q-network with learning rate 0.01, and then the learning rate decreases to 0.001 when is less than 0.2.

IV-B Network Architecture
As shown in Figure 5, our DEQN network consists of reservoirs for extracting the necessary temporal correlation to predict targets. The number of neurons in each reservoir is set to 32 and the leaky parameter is set to 0.7 in Equation (14). During the training process, the input weights and the output weights are untrained. To find a good policy, only the output weight is trained to read essential temporal information from the input states and the hidden states stored in the experience replay buffer. Existing research shows that stacking RNNs automatically creates different time scales at different levels, and this stacked architecture has better ability to model long-term dependencies than single layer RNN [24, 25, 26]. We also find that stacking ESNs can indeed improve the performance in our experiment.
IV-C Results and Discussion
We evaluate our introduced DEQN method with three performance metrics: 1) The system throughput of PUs. 2) The system throughput of SUs. 3) The required training time. The throughput represents the number of transmitted bits per second, which is calculated by (spectral-efficiency) (bandwidth), and the system throughput represents the sum of users’ throughput in the primary system or secondary system. A good DSS strategy should increase the throughput of SUs as much as possible, while the transmissions of SUs do not harm the throughput of PUs. Therefore, each SU has to access an available channel by predicting activities of other mobile users. We compare with conventional DRQN method that uses Long Short Term Memory (LSTM) [27] as the Q-network. For a fair comparison, we also set the number of neurons in each LSTM layer to 32. The training algorithm of DRQNs is BPTT and double Q-learning with the same learning rate as DEQNs. Since each SU updates its policy for every 300 samples, we show all of our curves in figures by calculating the moving average of 300 consecutive samples for clarity.


DEQN1 and DEQN2 are our DEQN method with one and two layers, respectively, and DRQN1 and DRQN2 are the conventional DRQN method with one and two layers, respectively. The system throughput of PUs is shown in Figure 6 and the system throughput of SUs is shown in Figure 7. We observe that DEQNs have more stable performance than DRQNs, which empirically proves that the DEQN method can learn efficiently with limited training data. Note that one experience replay buffer only contains 300 latest training samples. After updating the learning agent of each SU using the 300 data in the buffer, DSS strategy of each SU changes so the environment observed by one SU also changes. Therefore, we have to erase the outdated samples from the buffer and let SUs collect new training data from the environment. Figure 8 shows the average reward of SUs versus time. We observe extremely unstable reward curves of both DRQN1 and DRQN2 so it proves that DRQNs cannot adapt to this dynamic 5G scenario well with few training data.

We observe that DEQN2 has better performance than DEQN1 in both the system throughput of PUs and SUs, which shows that deep structure (stacking ESNs) indeed improves the capability of the DRL agent to learn long-term temporal correlation. As for DRQNs, we observe that DRQNs do not have improved performance as we increase the number of layers in the underlying RNN. The main reason is that more training data are needed for training a larger network but even DRQN with one layer cannot be trained well.

The top priority of designing a DSS network is to prevent harmful interference to the primary system. To analyze the performance degradation of the primary system after allowing the secondary system to access, we show the system throughput of PUs when there is no SU exist in Figure 6. We observe that DEQN2 can achieve almost the same performance of the system throughput of PUs. A PU broadcasts a warning signal if its spectral-efficiency is below a threshold. For each PU, we record the frequency of (the PU sends a warning signal and it is received by some SUs) / (number of the PU’s access), which is called as the warning frequency. Figure 9 shows the average warning frequency of each PU versus time. We observe that the every PU decreases its warning frequency over time, meaning that each SU learns not to access the channel that will cause harmful interference to PUs.
Network | Training time (sec) |
---|---|
DEQN1 | 161 |
DEQN2 | 178 |
DRQN1 | 3776 |
DRQN2 | 7618 |
We compare the training time of different approaches in Table III when implemented and executed on the same machine with 2.71 GHz Intel i5 CPU and 12 GB RAM. The required training time for DRQN1 is 23.4 times the training time for DEQN1, and the required training time for DRQN2 is 42.8 times the training time for DEQN2. This huge difference shows the training speed advantage of our introduced DEQN method against the conventional DRQN method. DRQN suffers from high training time because BPTT unfolds the network in time to compute the gradients, but DEQN can be trained very efficiently because the hidden states can be pre-stored for many training iterations.
V Conclusion
In this paper, we introduced the concept of DEQN, a new RNN-based DRL strategy to efficiently capture the temporal correlation of the underlying time-dynamic environment requiring very limited amount of training data. The DEQN-based DRL strategies largely increase the rate of convergence compared to conventional DRQN-based strategies. DEQN-based spectrum access strategies are examined in DSS, a key technology in 5G and future 6G networks, showing significant performance improvements over state-of-the-art DRQN-based strategies. This provides strong evidence for adopting DEQN for real-time and time-dynamic applications. Our future work will be focused on developing methodologies for the design of neural network architectures tailored to different applications.
References
- [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing Atari with deep reinforcement learning,” arXiv, 2013.
- [2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, p. 484, 2016.
- [3] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016.
- [4] M. Hausknecht and P. Stone, “Deep recurrent Q-learning for partially observable mdps,” in AAAI Fall Symposium Series, 2015.
- [5] F. S. He, Y. Liu, A. G. Schwing, and J. Peng, “Learning to play in a day: Faster deep reinforcement learning by optimality tightening,” in ICLR, 2017.
- [6] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in ICML, 2013, pp. 1310–1318.
- [7] H. Jaeger, “The “echo state” approach to analysing and training recurrent neural networks-with an erratum note,” German National Research Center For Information Technology, Tech. Rep. 34, Jan. 2001.
- [8] G. Tanaka, T. Yamane, J. B. Héroux, R. Nakane, N. Kanazawa, S. Takeda, H. Numata, D. Nakano, and A. Hirose, “Recent advances in physical reservoir computing: a review,” Neural Networks, 2019.
- [9] Cisco, “Cisco visual networking index: Forecast and trends, 2017–2022,” VNI Global Fixed and Mobile Internet Traffic Forecasts, Feb. 2019.
- [10] S. Marek, “Marek’s Take: Dynamic spectrum sharing may change the 5G deployment game,” Fierce Wireless, April 2019.
- [11] S. Kinney, “Dynamic spectrum sharing is key to Verizon’s 5G strategy,” RCR Wireless News, August 2019.
- [12] W. S. H. M. W. Ahmad, N. A. M. Radzi, F. Samidi, A. Ismail, F. Abdullah, M. Z. Jamaludin, and M. Zakaria, “5G technology: Towards dynamic spectrum sharing using cognitive radio networks,” IEEE Access, vol. 8, pp. 14 460–14 488, 2020.
- [13] D. Tse and P. Viswanath, Fundamentals of wireless communication. Cambridge university press, 2005.
- [14] CEPT, “WINNER II Channel Models,” European Conference of Postal and Telecommunications Administrations (CEPT), Technical Report D1.1.2, Feb. 2008, version 1.2.
- [15] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
- [16] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in AAAI, 2016.
- [17] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep reinforcement learning for dynamic multichannel access in wireless networks,” IEEE Trans. on Cogn. Commun. Netw., vol. 4, no. 2, pp. 257–265, 2018.
- [18] O. Naparstek and K. Cohen, “Deep multi-user reinforcement learning for distributed dynamic spectrum access,” IEEE Trans. Wireless Commun., vol. 18, no. 1, pp. 310–323, 2018.
- [19] H. Chang, H. Song, Y. Yi, J. Zhang, H. He, and L. Liu, “Distributive dynamic spectrum access through deep reinforcement learning: A reservoir computing-based approach,” IEEE Internet Things J., vol. 6, no. 2, pp. 1938–1948, Apr. 2019.
- [20] 3GPP, “Evolved Universal Terrestrial Radio Access (E-UTRA); Physical Layer Procedures,” 3rd Generation Partnership Project (3GPP), Technical Specification (TS) 36.213, Jun. 2019, version 15.6.0.
- [21] L. Liu, R. Chen, S. Geirhofer, K. Sayana, Z. Shi, and Y. Zhou, “Downlink MIMO in LTE-advanced: SU-MIMO vs. MU-MIMO,” IEEE Commun. Mag., vol. 50, no. 2, pp. 140–147, 2012.
- [22] A. Chiumento, M. Bennis, C. Desset, L. Van der Perre, and S. Pollin, “Adaptive CSI and feedback estimation in LTE and beyond: a Gaussian process regression approach,” EURASIP JWCN, vol. 2015, no. 1, p. 168, Jun. 2015.
- [23] M. Lukoševičius, “A practical guide to applying echo state networks,” in Neural networks: Tricks of the trade. Springer, 2012, pp. 659–686.
- [24] M. Hermans and B. Schrauwen, “Training and analysing deep recurrent neural networks,” in NIPS, 2013, pp. 190–198.
- [25] C. Gallicchio, A. Micheli, and L. Pedrelli, “Deep reservoir computing: a critical experimental analysis,” Neurocomputing, vol. 268, pp. 87–99, 2017.
- [26] Z. Zhou, L. Liu, J. Zhang, and Y. Yi, “Deep reservoir computing meets 5g mimo-ofdm systems in symbol detection,” in AAAI, 2020.
- [27] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
![]() |
Hao-Hsuan Chang received the B.Sc. degree in electrical engineering and the M.S. degree in communication engineering from National Taiwan University, Taipei, Taiwan. He is currently pursuing the Ph.D. degree in electrical and computer engineering at Virginia Polytechnic Institute and State University, Blacksburg, VA, USA. His research interests include dynamic spectrum access, echo state network, and deep reinforcement learning. |
![]() |
Lingjia Liu is an Associate Professor with the Bradley Department of Electrical Engineering and Computer Engineering, Virginia Tech. He is also the Associate Director of Wireless@VT. Prior to joining VT, he was an Associate Professor with the EECS Department, University of Kansas (KU). He spent more than four years working in the Mitsubishi Electric Research Laboratory (MERL) and the Standards and Mobility Innovation Laboratory, Samsung Research America (SRA), where he received the Global Samsung Best Paper Award in 2008 and 2010. He was leading Samsung’s efforts on multiuser MIMO, CoMP, and HetNets in LTE/LTE-advanced standards. His general research interests mainly lie in emerging technologies for beyond 5G cellular networks, including machine learning for wireless networks, massive MIMO, massive MTC communications, and mmWave communications. He received the Air Force Summer Faculty Fellow, from 2013 to 2017, Miller Scholar at KU, 2014, Miller Professional Development Award for Distinguished Research at KU, 2015, the 2016 IEEE GLOBECOM Best Paper Award, the 2018 IEEE ISQED Best Paper Award, the 2018 IEEE TAOS Best Paper Award, 2018 IEEE TCGCC Best Conference Paper Award, and 2020 WOCC Charles Kao Best Paper Award. |
![]() |
Yang Yi (SM’17) is an Associate Professor in the Bradley Department of ECE at Virginia Tech (VT). She received the B.S. and M.S. degrees in electronic engineering at Shanghai Jiao Tong University, and the Ph.D. degree in electrical and computer engineering at Texas A&M University. Her research interests include very large scale integrated (VLSI) circuits and systems, computer aided design (CAD), and neuromorphic computing. Dr. Yi is currently serving as an associate editor for cyber journal of selected areas in microelectronics and has been serving on the editorial board of international journal of computational & neural engineering. Dr. Yi is the recipient of 2018 National Science CAREER award, 2016 Miller Professional Development Award for Distinguished Research, 2016 United States Air Force (USAF) Summer Faculty Fellowship, 2015 NSF EPSCoR First Award, and 2015 Miller Scholar. |