Fast Reinforcement Learning
for Anti-jamming Communications
Abstract
This letter presents a fast reinforcement learning algorithm for anti-jamming communications which chooses previous action with probability and applies -greedy with probability . A dynamic threshold based on the average value of previous several actions is designed and probability is formulated as a Gaussian-like function to guide the wireless devices. As a concrete example, the proposed algorithm is implemented in a wireless communication system against multiple jammers. Experimental results demonstrate that the proposed algorithm exceeds Q-learing, deep Q-networks (DQN), double DQN (DDQN), and prioritized experience reply based DDQN (PDDQN), in terms of signal-to-interference-plus-noise ratio and convergence rate.
Index Terms:
Reinforcement learning, -greedy, experience replay, wireless communications, jamming attacksI INTRODUCTION
Wireless communications are highly vulnerable to jamming attacks due to the open and sharing nature of wireless medium [1]. In general, attackers jam the ongoing transmissions via injecting malicious signal to wireless channel in use. To deal with the jamming attacks, the frequency hopping was proposed by selecting “good” frequencies in an ad-hoc way and avoiding the jammed frequency [2, 3]. But, it is no guidance for the random frequency hopping techniques to select a channel that is not blocked. Soon afterwards, Wang et al. [4] proposed an uncoordinated frequency hopping method by using the online learning theory to mitigate this problem. However, this conventional online learning method achieves only asymptotic optimum.
Recently, the success of reinforcement learning (RL) in decision-making problem attracts researchers to apply Q-learning (QL) to anti-jamming wireless communications. For instance, Xiao et al. [5, 6] employ QL to choose an appropriate transmission power. In [7, 8], QL is applied to select optimal frequency hopping channel. However, as the dimension of the actions increases, the Q-table of QL will become too large to quickly derive the optimal policy. Subsequently, deep Q-network (DQN) [9] was proposed to solve such a high-dimensional optimization problem. Different from QL, the Q value of DQN is not calculated by the state-value function, but learned by the neural networks such as convolutional neural network (CNN) and recurrent neural network (RNN). For example, in order to accelerate the learning speed and enhance the anti-jamming communication performance, Han et al. [10] apply DQN to choose the optimal communication channel, and Liu et al. [11] combine DQN and spectrum waterfall to improve the ability of exploring the unknown environment. With the rapid development of RL, double DQN (DDQN) [12] has been proposed, which combines two deep networks to avoid the dependency between the Q value calculation and network update. During the backpropagation of DDQN, only one network used for action selection is updated and its parameters are directly copied to the other network with a constant frequency. Hence, to some extent, DDQN mitigates the overestimation problem. Furthermore, prioritized experience reply based DQN (PDQN) [13] was developed to optimize the experience replay of DQN. Interestingly, researchers promptly set out to combine DDQN and PDQN together (a.k.a. PDDQN) to improve the convergence rate and avoid the overestimation problem, such as in game design [14]. Besides, the action repetition theory [15] was proposed to further enhance the convergence rate of DQN, whereas optimizing the action repetition rate becomes difficult. It is important and challenging to apply PDDQN to solve the optimization problem with the anti-jamming wireless communications.
In this letter, we present a fast reinforcement learning algorithm called -greedy. The key idea of -greedy is to preserve previous action with probability and apply -greedy with probability . Specifically, probability is formulated as a Gaussian-like function such that the greater value the previous action has, the larger value takes. By doing so, the optimal action can be selected with a higher probability and thus the learning process is significantly accelerated. As a concrete example, the proposed algorithm is implemented in a wireless communication system against multiple jammers. Experimental results show the superiority of the proposed algorithm over existing value-based RLs including QL, DQN, DDQN, and PDDQN. Especially, the -greedy based PDDQN achieves the best performance in terms of signal-to-interference-plus-noise ratio (SINR) and convergence rate.
The rest of this letter is organized as follows. In Section II, we introduce an anti-jamming wireless communication model. The proposed algorithm is described in detail in Section III. Section IV provides experimental results and conclusions are drawn in Section V.
II SYSTEM MODEL
As depicted in Fig. 1, we consider a wireless communication system where a sender transmits data to the receiver with a power , while there are jammers who can simultaneously launch jamming attacks with different power levels on channels. The jamming power levels are denoted as . In order to resist this complicated attack, the sender dynamically chooses an appropriate from different power levels , which in general is shown to be superior to the use of constant power under the constraint of the same average power [16]. Note that both the sender and jammers share the same frequency channels. At time slot , the channels chosen by the sender and jammers are respectively denoted by and . and denote the channel power gains from the sender and the th jammer () to the receiver, respectively.

In the above communication system, after the receiver gets the signal at time slot , the SINR is calculated by Eq. (1) and returned to the sender in the feedback channel.
(1) |
where is the receiver noise power, denotes the jamming power chosen by the th jammer, and is an indicator function that equals 1 if is true and 0 otherwise. If the receiver did not get the signal at time slot , the transmission channel is considered to be blocked by the jammers. In this case, the receiver informs the sender about the channel state information through the feedback channel. Then the sender retransmits the signal. The additional cost caused by the retransmission is denoted as . For simplicity, we assume that the feedback channel cannot be attacked and the transmission channel is considered to be blocked if takes the maximum value . Besides, the cost of the unit transmit power is denoted as . In [17], the authors formulated the utility function by considering the transit cost is proportional to the transmit power. Based on the SINR, retransmission cost , and transit cost, in this letter the utility at time slot is defined by
(2) |
It is known that the anti-jamming wireless communications aim to maximize the communication performance while at the same time save energy as much as possible. By referring to Eq. (2), our designed utility function has the ability of making a good tradeoff between the cost and communication performance.
III PROPOSED ALGORITHM
The proposed algorithm consists of three modules: -greedy action policy, double deep network structure, and prioritized experience replay, as shown in Fig. 2. The system state at time slot is denoted as , where is the SINR at time slot . Based on the current state , the sender selects an action containing the selected transmission channel and power level (denoted as ). After the action is adopted, the sender receives a reward (denoted as ). In the following, we describe each module in detail.
III-A -greedy Action Policy
In RL, the Q value is updated by a Q-function, which is written by
(3) |
where denotes the next state provided the sender takes action at state , denotes a set of actions that can be chosen at state , and denotes the discount factor which represents the uncertainty of the sender about the future rewards. To our best knowledge, most existing value-based RL methods apply -greedy to action selection policy. In conventional -greedy, agents select the action with the maximum Q value with a high probability and randomly select an action with a very low probability where denotes the cardinality of set . However, we find that agents still need to calculate the Q value at the next time slot even if they have already chosen an action which can greatly improve the utility at the current time slot. Moreover, we consider that during the learning process, the value of should be gradually decreased to ensure that agents have a higher probability to explore the possible optimal actions at the beginning and then the Q function can converge quickly in the end.
Based on -greedy, we propose to add a parameter to represent the probability of the current action that is directly selected at the next time slot without calculating the Q value, called -greedy action policy in this letter. Thus, there will be three possible ways for the sender to select the action at the current state. That is the sender may directly adopt previous action with probability , randomly select any action in with probability , and take the action which has the maximum Q value with probability . The proposed -greedy is formulated as
(4) |
where and denote the action policy and a random action, respectively. Let denote the average utility of previous time slots that can be computed by . According to -greedy, the sender will directly take the action with probability at the next time slot (). In order to improve the learning speed, we hope to preserve previous action which greatly contributes to the system. Therefore, we employ a difference between and to measure how much the contribution of action is. It is reasonable that the probability of adopting at the next time slot should increase as the difference increases, vice versa. Based on this consideration, we design a Gaussian-like function to compute the value of as follows.
(5) |
where and are the parameters which are used to control the step of adjusting . The larger value these two parameters take, the more gentle varies. By Eq. (5), the value of can be adaptively adjusted according to the dynamic threshold .

III-B Double Deep Network Structure
To update the Q value by Eq. (3) for each time slot, the neural networks can be used in our setup. As done in [10, 16, 17], we use CNN with two convolutional layers (Conv) and two fully connected (FC) layers, too. The first and second Conv layers have 20 filters of size and stride 1, and 40 filters of size and stride 1, respectively. Rectified linear unit (ReLU) is used as the activation function in the two Conv layers. The first FC layer has 180 ReLUs, and the second FC layer has outputs. Based on the outputs of CNN, the sender can obtain the optimal transmit power and channel.
Let denote the input of the CNN at time slot , which consists of current system state and previous system state-action pairs, i.e., . In this structure, we create two same CNNs which are denoted as and , respectively. is used to select the action that has the maximum Q value computed by Eq. (6), and use to calculate the target Q value by Eq. (7). The parameters of and at time slot are denoted as and , respectively.
(6) |
(7) |
where denotes the next state. Note that in this letter the utility is considered as the reward, denotes an action that has the maximum Q value trained by network at state . The parameters of are not updated, but overwritten by the parameters of at a period of time slots.
III-C Prioritized Experience Replay
Let denote the experience sample of time slot . And is stored in the sum-tree [13] (a kind of data structure where every node is the sum of its children and each leaf is associated with its own priority, denoted as ). Thus the whole sum-tree at time slot can be denoted as . For each experience replay, we extract the first experience samples in order of descending probability from the sum-tree. For the sake of brevity, we use the index to represent the th sample of these experience samples. Thus, we have that . The probability of the th sample (denoted as ) is calculated by
(8) |
To evaluate the priority of the experience sample, the temporal-difference (TD) error () is used, which is computed by
(9) |
According to the stochastic gradient descent (SGD) algorithm, the parameters of network are updated by means of minibatch updates and the loss function chosen by [13] is as follows.
(10) |
where denotes the importance sampling weights and can be calculated by
(11) |
where is a factor which is used to control the amount of importance sampling. In particular, the cases of indicate no and full importance samplings, respectively.
After the parameters of are updated, the TD error () of the experience sample is recalculated via Eq. (9) and the priority of each experience is updated by . The parameters of network do not need to be updated immediately, but replaced by the parameters of network with frequency . The overall algorithm process is shown in Algorithm 1.
IV EXPERIMENTAL RESULTS
In this section, a number of experiments is conducted to evaluate the proposed method. In our experimental setup, the parameters are set as , , W, W, , , , , , , , , , , , and epoch=400.
First, we implement the existing value-based RL methods including QL, DQN, DDQN, and PDDQN in a wireless communication system. These four methods are all based on -greedy. The SINR performance and convergence rate are analyzed and the results are shown in Fig. 3(a). As expected, PDDQN performs the best among the four methods. Hence, PDDQN is selected. This is the first time to our knowledge that PDDQN is applied to the anti-jamming wireless communications. Next, we compare the SINR performance between the variable and constant transmission power models under the same average power condition. For a fair comparison, both of the models use PDDQN and the comparison results are given after the convergence. According to our experiment, PDDQN becomes completely convergent after 200 time slots, and the average power is 6.19W in the variable power model. We can see from Fig. 3(b) that the variable power model obtains much higher SINR than the constant power model under the condition of the same average power W. Therefore, the variable transmission power model is adopted. Finally, we apply the proposed -greedy to the four value-based RL methods, which are named -QL, -DQN, -DDQN, and -PDDQN. The curves of SINR performance and convergence rate are drawn in Fig. 4. It is clear that the convergence rate with the proposed -greedy is greatly improved for the four RL methods. Moreover, the communication performance (measured by SINR) with -greedy is also much better than those without -greedy. This is mainly due to the fact the proposed algorithm can not only select the more valuable action in most cases but also seek for the most valuable action at a faster speed.






V CONCLUSIONS
In this letter, we have presented a -greedy RL algorithm. Our proposed -greedy can replace the existing -greedy which is used in almost all the value-based RL methods. We have implemented the proposed algorithm in a wireless communication system against multiple jammers. The experimental results show that the proposed algorithm accelerates the learning speed of the wireless device and significantly improves the performance of existing RL methods in terms of SINR and convergence rate. In the future, this algorithm will be implemented in the anti-jamming wireless communication scenario for continuous action set.
References
- [1] A. Mukherjee, S. A. A. Fakoorian, J. Huang, and A. L. Swindlehurst, “Principles of physical layer security in multiuser wireless networks: A survey,” IEEE Communications Surveys Tutorials, vol. 16, no. 3, pp. 1550–1573, 2014.
- [2] E. Lance and G. K. Kaleh, “A diversity scheme for a phase-coherent frequency-hopping spread-spectrum system,” IEEE Transactions on Communications, vol. 45, no. 9, pp. 1123–1129, 1997.
- [3] H. Wang, L. Zhang, T. Li, and J. Tugnait, “Spectrally efficient jamming mitigation based on code-controlled frequency hopping,” IEEE Transactions on Wireless Communications, vol. 10, no. 3, pp. 728–732, 2011.
- [4] Q. Wang, P. Xu, K. Ren, and X. Li, “Towards optimal adaptive ufh-based anti-jamming wireless communication,” IEEE Journal on Selected Areas in Communications, vol. 30, no. 1, pp. 16–30, 2012.
- [5] L. Xiao, Y. Li, J. Liu, and Y. Zhao, “Power control with reinforcement learning in cooperative cognitive radio networks against jamming,” The Journal of Supercomputing, vol. 71, no. 9, pp. 3237–3257, 2015.
- [6] L. Xiao, Y. Li, C. Dai, H. Dai, and H. V. Poor, “Reinforcement learning-based noma power allocation in the presence of smart jamming,” IEEE Transactions on Vehicular Technology, vol. 67, no. 4, pp. 3377–3389, 2018.
- [7] Y. Gwon, S. Dastangoo, C. Fossa, and H. T. Kung, “Competing mobile network game: Embracing antijamming and jamming strategies with reinforcement learning,” in Proc. IEEE Conference on Communications and Network Security (CNS), 2013, pp. 28–36.
- [8] F. Slimeni, B. Scheers, Z. Chtourou, and V. Le Nir, “Jamming mitigation in cognitive radio networks using a modified q-learning algorithm,” in Proc. International Conference on Military Communications and Information Systems (ICMCIS), 2015, pp. 1–7.
- [9] V. Mnih, K. Kavukcuoglu, and D. Silver, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529–533, 2015.
- [10] G. Han, L. Xiao, and H. V. Poor, “Two-dimensional anti-jamming communication based on deep reinforcement learning,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 2087–2091.
- [11] X. Liu, Y. Xu, L. Jia, Q. Wu, and A. Anpalagan, “Anti-jamming communications using spectrum waterfall: A deep reinforcement learning approach,” IEEE Communications Letters, vol. 22, no. 5, pp. 998–1001, 2018.
- [12] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in Proc. AAAI Conference on Artificial Intelligence, 2016, pp. 2094–2100.
- [13] T. Schaul, J. Quan, and I. Antonoglou, et al., “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, 2015.
- [14] T. Hester, M. Vecerik, and O. Pietquin, et al., “Deep q-learning from demonstrations,” in Proc. AAAI Conference on Artificial Intelligence, 2018, pp. 3223–3230.
- [15] A. S. Lakshminarayanan, S. Sharma, and B. Ravindran, “Dynamic action repetition for deep reinforcement learning,” in Proc. AAAI Conference on Artificial Intelligence, 2017, pp. 2133–2139.
- [16] L. Xiao, D. Jiang, X. Wan, W. Su, and Y. Tang, “Anti-jamming underwater transmission with mobility and learning,” IEEE Communications Letters, vol. 22, no. 3, pp. 542–545, 2018.
- [17] L. Xiao, D. Jiang, D. Xu, H. Zhu, Y. Zhang, and H. V. Poor, “Two-dimensional antijamming mobile communication based on reinforcement learning,” IEEE Transactions on Vehicular Technology, vol. 67, no. 10, pp. 9499–9512, 2018.