This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fast Reinforcement Learning
for Anti-jamming Communications

Pei-Gen Ye, Yuan-Gen Wang, 
Jin Li,  and Liang Xiao
P.-G. Ye, Y.-G. Wang, and J. Li are with the School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou 510006, China (E-mail: [email protected], {wangyg, lijin}@gzhu.edu.cn). L. Xiao is with the Department of Communication Engineering, Xiamen University, Xiamen 361005, China (E-mail: [email protected]).
Abstract

This letter presents a fast reinforcement learning algorithm for anti-jamming communications which chooses previous action with probability τ\tau and applies ϵ\epsilon-greedy with probability (1τ)(1-\tau). A dynamic threshold based on the average value of previous several actions is designed and probability τ\tau is formulated as a Gaussian-like function to guide the wireless devices. As a concrete example, the proposed algorithm is implemented in a wireless communication system against multiple jammers. Experimental results demonstrate that the proposed algorithm exceeds Q-learing, deep Q-networks (DQN), double DQN (DDQN), and prioritized experience reply based DDQN (PDDQN), in terms of signal-to-interference-plus-noise ratio and convergence rate.

Index Terms:
Reinforcement learning, ϵ\epsilon-greedy, experience replay, wireless communications, jamming attacks

I INTRODUCTION

Wireless communications are highly vulnerable to jamming attacks due to the open and sharing nature of wireless medium [1]. In general, attackers jam the ongoing transmissions via injecting malicious signal to wireless channel in use. To deal with the jamming attacks, the frequency hopping was proposed by selecting “good” frequencies in an ad-hoc way and avoiding the jammed frequency [2, 3]. But, it is no guidance for the random frequency hopping techniques to select a channel that is not blocked. Soon afterwards, Wang et al. [4] proposed an uncoordinated frequency hopping method by using the online learning theory to mitigate this problem. However, this conventional online learning method achieves only asymptotic optimum.

Recently, the success of reinforcement learning (RL) in decision-making problem attracts researchers to apply Q-learning (QL) to anti-jamming wireless communications. For instance, Xiao et al. [5, 6] employ QL to choose an appropriate transmission power. In [7, 8], QL is applied to select optimal frequency hopping channel. However, as the dimension of the actions increases, the Q-table of QL will become too large to quickly derive the optimal policy. Subsequently, deep Q-network (DQN) [9] was proposed to solve such a high-dimensional optimization problem. Different from QL, the Q value of DQN is not calculated by the state-value function, but learned by the neural networks such as convolutional neural network (CNN) and recurrent neural network (RNN). For example, in order to accelerate the learning speed and enhance the anti-jamming communication performance, Han et al. [10] apply DQN to choose the optimal communication channel, and Liu et al. [11] combine DQN and spectrum waterfall to improve the ability of exploring the unknown environment. With the rapid development of RL, double DQN (DDQN) [12] has been proposed, which combines two deep networks to avoid the dependency between the Q value calculation and network update. During the backpropagation of DDQN, only one network used for action selection is updated and its parameters are directly copied to the other network with a constant frequency. Hence, to some extent, DDQN mitigates the overestimation problem. Furthermore, prioritized experience reply based DQN (PDQN) [13] was developed to optimize the experience replay of DQN. Interestingly, researchers promptly set out to combine DDQN and PDQN together (a.k.a. PDDQN) to improve the convergence rate and avoid the overestimation problem, such as in game design [14]. Besides, the action repetition theory [15] was proposed to further enhance the convergence rate of DQN, whereas optimizing the action repetition rate becomes difficult. It is important and challenging to apply PDDQN to solve the optimization problem with the anti-jamming wireless communications.

In this letter, we present a fast reinforcement learning algorithm called (τ,ϵ)(\tau,\epsilon)-greedy. The key idea of (τ,ϵ)(\tau,\epsilon)-greedy is to preserve previous action with probability τ\tau and apply ϵ\epsilon-greedy with probability 1τ1-\tau. Specifically, probability τ\tau is formulated as a Gaussian-like function such that the greater value the previous action has, the larger value τ\tau takes. By doing so, the optimal action can be selected with a higher probability and thus the learning process is significantly accelerated. As a concrete example, the proposed algorithm is implemented in a wireless communication system against multiple jammers. Experimental results show the superiority of the proposed algorithm over existing value-based RLs including QL, DQN, DDQN, and PDDQN. Especially, the (τ,ϵ)(\tau,\epsilon)-greedy based PDDQN achieves the best performance in terms of signal-to-interference-plus-noise ratio (SINR) and convergence rate.

The rest of this letter is organized as follows. In Section II, we introduce an anti-jamming wireless communication model. The proposed algorithm is described in detail in Section III. Section IV provides experimental results and conclusions are drawn in Section V.

II SYSTEM MODEL

As depicted in Fig. 1, we consider a wireless communication system where a sender transmits data to the receiver with a power PS(s)P_{S}(s), while there are JJ jammers who can simultaneously launch jamming attacks with LL different power levels on channels. The LL jamming power levels are denoted as PJ(l){PJ(1),PJ(2),,PJ(L)}P_{J}(l)\in\{P_{J}(1),P_{J}(2),...,P_{J}(L)\}. In order to resist this complicated attack, the sender dynamically chooses an appropriate PS(s)P_{S}(s) from SS different power levels {PS(1),PS(2),,PS(S)}\{P_{S}(1),P_{S}(2),...,P_{S}(S)\}, which in general is shown to be superior to the use of constant power under the constraint of the same average power [16]. Note that both the sender and jammers share the same NN frequency channels. At time slot kk, the channels chosen by the sender and JJ jammers are respectively denoted by x(k)x^{(k)} and {y1(k),y2(k),,yJ(k)}\{y_{1}^{(k)},y_{2}^{(k)},...,y_{J}^{(k)}\}. hsh_{s} and hjh_{j} denote the channel power gains from the sender and the jjth jammer (j{1,,J}j\in\{1,...,J\}) to the receiver, respectively.

Refer to caption
Figure 1: Anti-jamming wireless communication system

In the above communication system, after the receiver gets the signal at time slot kk, the SINR is calculated by Eq. (1) and returned to the sender in the feedback channel.

SINR=PS(s)hsβ+j=1JPJj(l)hjf(x(k)=yj(k)),\textrm{SINR}=\frac{P_{S}(s)h_{s}}{\beta+\sum_{j=1}^{J}P_{J}^{j}(l)h_{j}f(x^{(k)}=y_{j}^{(k)})}, (1)

where β\beta is the receiver noise power, PJj(l)P_{J}^{j}(l) denotes the jamming power chosen by the jjth jammer, and f(ξ)f(\xi) is an indicator function that equals 1 if ξ\xi is true and 0 otherwise. If the receiver did not get the signal at time slot kk, the transmission channel is considered to be blocked by the jammers. In this case, the receiver informs the sender about the channel state information through the feedback channel. Then the sender retransmits the signal. The additional cost caused by the retransmission is denoted as CmC_{m}. For simplicity, we assume that the feedback channel cannot be attacked and the transmission channel is considered to be blocked if PJ(l)P_{J}(l) takes the maximum value PJ(L)P_{J}(L). Besides, the cost of the unit transmit power is denoted as CsC_{s}. In [17], the authors formulated the utility function by considering the transit cost is proportional to the transmit power. Based on the SINR, retransmission cost CmC_{m}, and transit cost, in this letter the utility us(k)u_{s}^{(k)} at time slot kk is defined by

us(k)=SINRCmf(PJj(l)=PJ(L))f(x(k)=yj(k))CsPs.u_{s}^{(k)}=\textrm{SINR}-C_{m}f(P_{J}^{j}(l)=P_{J}(L))f(x^{(k)}=y_{j}^{(k)})-C_{s}P_{s}. (2)

It is known that the anti-jamming wireless communications aim to maximize the communication performance while at the same time save energy as much as possible. By referring to Eq. (2), our designed utility function has the ability of making a good tradeoff between the cost and communication performance.

III PROPOSED ALGORITHM

The proposed algorithm consists of three modules: (τ,ϵ)(\tau,\epsilon)-greedy action policy, double deep network structure, and prioritized experience replay, as shown in Fig. 2. The system state at time slot kk is denoted as s(k)=SINR(k1)s^{(k)}=\textrm{SINR}^{(k-1)}, where SINR(k1)\textrm{SINR}^{(k-1)} is the SINR at time slot k1k-1. Based on the current state s(k)s^{(k)}, the sender selects an action containing the selected transmission channel x(k)x^{(k)} and power level PS(k)(s)P^{(k)}_{S}(s) (denoted as a(k)=[x(k),PS(k)(s)]\textbf{a}^{(k)}=[x^{(k)},P^{(k)}_{S}(s)]). After the action is adopted, the sender receives a reward (denoted as us(k)u^{(k)}_{s}). In the following, we describe each module in detail.

III-A (τ,ϵ)(\tau,\epsilon)-greedy Action Policy

In RL, the Q value is updated by a Q-function, which is written by

Q(s(k),a(k))=E[us(k)+γmaxa𝒜Q(s(k+1),a)s(k),a(k)],Q(s^{(k)},\textbf{a}^{(k)})=E[u_{s}^{(k)}+\gamma\max\limits_{\textbf{a}^{\prime}\in\mathcal{A}}Q(s^{(k+1)},\textbf{a}^{\prime})\mid s^{(k)},\textbf{a}^{(k)}], (3)

where s(k+1)s^{(k+1)} denotes the next state provided the sender takes action a(k)\textbf{a}^{(k)} at state s(k)s^{(k)}, 𝒜\mathcal{A} denotes a set of actions that can be chosen at state s(k+1)s^{(k+1)}, and γ\gamma denotes the discount factor which represents the uncertainty of the sender about the future rewards. To our best knowledge, most existing value-based RL methods apply ϵ\epsilon-greedy to action selection policy. In conventional ϵ\epsilon-greedy, agents select the action with the maximum Q value with a high probability and randomly select an action with a very low probability ϵ|𝒜|\frac{\epsilon}{|\mathcal{A}|} where |𝒜||\mathcal{A}| denotes the cardinality of set 𝒜\mathcal{A}. However, we find that agents still need to calculate the Q value at the next time slot even if they have already chosen an action which can greatly improve the utility at the current time slot. Moreover, we consider that during the learning process, the value of ϵ\epsilon should be gradually decreased to ensure that agents have a higher probability to explore the possible optimal actions at the beginning and then the Q function can converge quickly in the end.

Based on ϵ\epsilon-greedy, we propose to add a parameter τ\tau to represent the probability of the current action a(k)\textbf{a}^{(k)} that is directly selected at the next time slot without calculating the Q value, called (τ,ϵ)(\tau,\epsilon)-greedy action policy in this letter. Thus, there will be three possible ways for the sender to select the action at the current state. That is the sender may directly adopt previous action with probability τ\tau, randomly select any action in 𝒜\mathcal{A} with probability ϵ|𝒜|\frac{\epsilon}{|\mathcal{A}|}, and take the action which has the maximum Q value with probability (1τϵ)(1-\tau-\epsilon). The proposed (τ,ϵ)(\tau,\epsilon)-greedy is formulated as

π(a(k)|s(k))={a(k1)with τarwith ϵ|𝒜|argmaxa𝒜Q(s(k),a)with (1τϵ),\pi(\textbf{a}^{(k)}|s^{(k)})=\begin{cases}\textbf{a}^{(k-1)}&{\textrm{with }\tau}\\ \textbf{a}_{r}&{\textrm{with }\frac{\epsilon}{|\mathcal{A}|}}\\ \textrm{arg}\max\limits_{\textbf{a}^{\prime}\in\mathcal{A}}Q(s^{(k)},\textbf{a}^{\prime})&{\textrm{with }(1-\tau-\epsilon)}\end{cases}, (4)

where π\pi and ar\textbf{a}_{r} denote the action policy and a random action, respectively. Let u¯s(k1)\overline{u}_{s}^{(k-1)} denote the average utility of previous TT time slots that can be computed by u¯s(k1)=1Ti=1Tus(ki)\overline{u}_{s}^{(k-1)}=\frac{1}{T}\sum_{i=1}^{T}u_{s}^{(k-i)}. According to (τ,ϵ)(\tau,\epsilon)-greedy, the sender will directly take the action a(k)\textbf{a}^{(k)} with probability τ\tau at the next time slot (k+1k+1). In order to improve the learning speed, we hope to preserve previous action which greatly contributes to the system. Therefore, we employ a difference between us(k)u_{s}^{(k)} and u¯s(k1)\overline{u}_{s}^{(k-1)} to measure how much the contribution of action a(k)\textbf{a}^{(k)} is. It is reasonable that the probability τ\tau of adopting a(k)\textbf{a}^{(k)} at the next time slot should increase as the difference (us(k)u¯s(k1))(u_{s}^{(k)}-\overline{u}_{s}^{(k-1)}) increases, vice versa. Based on this consideration, we design a Gaussian-like function to compute the value of τ\tau as follows.

τ={112πσ1exp((us(k)u¯s(k1))22σ12)us(k)>u¯s(k1)12πσ2exp((us(k)u¯s(k1))22σ22)else,\tau=\begin{cases}1-\frac{1}{\sqrt{2\pi}\sigma_{1}}\textrm{exp}(\frac{(u_{s}^{(k)}-\overline{u}_{s}^{(k-1)})^{2}}{2\sigma_{1}^{2}})&{u_{s}^{(k)}>\overline{u}_{s}^{(k-1)}}\\ \frac{1}{\sqrt{2\pi}\sigma_{2}}\textrm{exp}(\frac{(u_{s}^{(k)}-\overline{u}_{s}^{(k-1)})^{2}}{2\sigma_{2}^{2}})&{\textrm{else}}\end{cases}, (5)

where σ1\sigma_{1} and σ2\sigma_{2} are the parameters which are used to control the step of adjusting τ\tau. The larger value these two parameters take, the more gentle τ\tau varies. By Eq. (5), the value of τ\tau can be adaptively adjusted according to the dynamic threshold u¯s(k1)\overline{u}_{s}^{(k-1)}.

Refer to caption
Figure 2: Diagram of the proposed algorithm.

III-B Double Deep Network Structure

To update the Q value by Eq. (3) for each time slot, the neural networks can be used in our setup. As done in [10, 16, 17], we use CNN with two convolutional layers (Conv) and two fully connected (FC) layers, too. The first and second Conv layers have 20 filters of size 3×33\times 3 and stride 1, and 40 filters of size 2×22\times 2 and stride 1, respectively. Rectified linear unit (ReLU) is used as the activation function in the two Conv layers. The first FC layer has 180 ReLUs, and the second FC layer has S×NS\times N outputs. Based on the outputs of CNN, the sender can obtain the optimal transmit power and channel.

Let 𝝋(k)\bm{\varphi}^{(k)} denote the input of the CNN at time slot kk, which consists of current system state and previous WW system state-action pairs, i.e., 𝝋(k)=[(s(kW),a(kW)),,(s(k1),a(k1)),s(k)]\bm{\varphi}^{(k)}=[(s^{(k-W)},\textbf{a}^{(k-W)}),...,(s^{(k-1)},\textbf{a}^{(k-1)}),s^{(k)}]. In this structure, we create two same CNNs which are denoted as Q1Q_{1} and Q2Q_{2}, respectively. Q1Q_{1} is used to select the action amax(k)\textbf{a}_{max}^{(k)} that has the maximum Q value computed by Eq. (6), and Q2Q_{2} use amax(k)\textbf{a}_{max}^{(k)} to calculate the target Q value QtargetQ_{target} by Eq. (7). The parameters of Q1Q_{1} and Q2Q_{2} at time slot kk are denoted as 𝜽1(k)\bm{\theta}^{(k)}_{1} and 𝜽2(k)\bm{\theta}^{(k)}_{2}, respectively.

amax(k)=argmaxa𝒜Q1(𝝋(k+1),a;𝜽1(k)),\textbf{a}_{max}^{(k)}=\textrm{arg}\max\limits_{\textbf{a}^{\prime}\in\mathcal{A}}Q_{1}(\bm{\varphi}^{(k+1)},\textbf{a}^{\prime};\bm{\theta}^{(k)}_{1}), (6)
Qtarget(k)=us(k)+γQ2(𝝋(k+1),amax(k);𝜽2(k)),Q_{target}^{(k)}=u_{s}^{(k)}+\gamma Q_{2}(\bm{\varphi}^{(k+1)},\textbf{a}^{(k)}_{max};\bm{\theta}^{(k)}_{2}), (7)

where 𝝋(k+1)\bm{\varphi}^{(k+1)} denotes the next state. Note that in this letter the utility us(k)u_{s}^{(k)} is considered as the reward, amax(k)\textbf{a}_{max}^{(k)} denotes an action that has the maximum Q value trained by Q1Q_{1} network at state 𝝋(k+1)\bm{\varphi}^{(k+1)}. The parameters of Q2Q_{2} are not updated, but overwritten by the parameters of Q1Q_{1} at a period of ff time slots.

III-C Prioritized Experience Replay

Let e(k)=[𝝋(k),𝝋(k+1),a(k),us(k)]\textbf{e}^{(k)}=[\bm{\varphi}^{(k)},\bm{\varphi}^{(k+1)},\textbf{a}^{(k)},u_{s}^{(k)}] denote the experience sample of time slot kk. And e(k)\textbf{e}^{(k)} is stored in the sum-tree [13] (a kind of data structure where every node is the sum of its children and each leaf is associated with its own priority, denoted as qkq_{k}). Thus the whole sum-tree at time slot kk can be denoted as 𝒮𝒯={(e(1),q1),,(e(k),qk)}\mathcal{ST}=\{(\textbf{e}^{(1)},q_{1}),...,(\textbf{e}^{(k)},q_{k})\}. For each experience replay, we extract the first MM experience samples in order of descending probability from the sum-tree. For the sake of brevity, we use the index ii to represent the iith sample of these MM experience samples. Thus, we have that {(e(i),qi)}={([𝝋(i),𝝋(i+1),a(i),us(i)],qi)},1iM\{(\textbf{e}^{(i)},q_{i})\}=\{([\bm{\varphi}^{(i)},\bm{\varphi}^{(i+1)},\textbf{a}^{(i)},u_{s}^{(i)}],q_{i})\},1\leq i\leq M. The probability of the iith sample (denoted as PiP_{i}) is calculated by

Pi=qij=1kqj.P_{i}=\frac{q_{i}}{\sum_{j=1}^{k}q_{j}}. (8)

To evaluate the priority of the experience sample, the temporal-difference (TD) error (ψi\psi_{i}) is used, which is computed by

ψi=Qtarget(i)Q1(𝝋(i),a(i)).\psi_{i}=Q_{target}^{(i)}-Q_{1}(\bm{\varphi}^{(i)},\textbf{a}^{(i)}). (9)

According to the stochastic gradient descent (SGD) algorithm, the parameters 𝜽1(k)\bm{\theta}_{1}^{(k)} of Q1Q_{1} network are updated by means of minibatch updates and the loss function chosen by [13] is as follows.

L(𝜽1(k))=1Mi=1Mωiψi2,L(\bm{\theta}_{1}^{(k)})=\frac{1}{M}\sum_{i=1}^{M}\omega_{i}\psi_{i}^{2}, (10)

where ωi\omega_{i} denotes the importance sampling weights and can be calculated by

ωi=(MPi)λmax1jkωj,\omega_{i}=\frac{(M\cdot P_{i})^{-\lambda}}{\max\limits_{1\leqslant j\leqslant k}\omega_{j}}, (11)

where λ\lambda is a factor which is used to control the amount of importance sampling. In particular, the cases of λ=0,1\lambda=0,1 indicate no and full importance samplings, respectively.

After the parameters of Q1Q_{1} are updated, the TD error ψj\psi_{j} (j{1,,k}j\in\{1,...,k\}) of the experience sample is recalculated via Eq. (9) and the priority of each experience is updated by qj=|ψj|q_{j}=|\psi_{j}|. The parameters 𝜽2(k)\bm{\theta}_{2}^{(k)} of Q2Q_{2} network do not need to be updated immediately, but replaced by the parameters 𝜽1(k)\bm{\theta}_{1}^{(k)} of network Q1Q_{1} with frequency ff. The overall algorithm process is shown in Algorithm 1.

Output: Transmission power and channel
1 Initialization: NN, hsh_{s}, hjh_{j}, β\beta, CmC_{m}, CsC_{s}, γ\gamma, TT, τ\tau, σ1\sigma_{1}, σ2\sigma_{2}, 𝜽1\bm{\theta}_{1}, 𝜽2\bm{\theta}_{2}, WW, MM, qq, ff, λ=0.4\lambda=0.4, ϵ=0.3\epsilon=0.3, s(0)=SINR(0)s^{(0)}=\textrm{SINR}^{(0)}, 𝒮𝒯=\mathcal{ST}=\varnothing
2 for kk=1,2,… do
3      
4      if kWk\leq W then
5             Choose channel xi(k)x_{i}^{(k)} and power Ps(k)P_{s}^{(k)} randomly;
6            
7      
8      else
9             zz = random[0,1];
10            
11            if zτz\leq\tau then
12                   Let x(k)=x(k1)x^{(k)}=x^{(k-1)}, ps(k)=ps(k1)p_{s}^{(k)}=p_{s}^{(k-1)};
13                  
14            
15            else
16                   Input 𝝋(k),𝜽(k)\bm{\varphi}^{(k)},\bm{\theta}^{(k)};
17                   Obtain CNN’s output Q(𝝋(k);𝜽(k))Q(\bm{\varphi}^{(k)};\bm{\theta}^{(k)});
18                   Choose a(k)=[x(k),ps(k)]\textbf{a}^{(k)}=[x^{(k)},p_{s}^{(k)}] by ϵ\epsilon-greedy;
19            
20            Receive s(k+1)=SINR(k)s^{(k+1)}=\textrm{SINR}^{(k)};
21             Evaluate us(k)u^{(k)}_{s} via (2) and update τ\tau via (5);
22             Let 𝝋(k+1)=[(s(kW+1),𝒂(kW+1)),,s(k+1)\bm{\varphi}^{(k+1)}=[(s^{(k-W+1)},\bm{a}^{(k-W+1)}),...,s^{(k+1)}];
23             Store (𝝋(k),𝝋(k+1),𝒂(k),us(k))(\bm{\varphi}^{(k)},\bm{\varphi}^{(k+1)},\bm{a}^{(k)},u_{s}^{(k)}) in 𝒮𝒯\mathcal{ST} as e(k)\textbf{e}^{(k)};
24            
25            for i=1,2,,Mi=1,2,...,M do
26                   Choose e(i)\textbf{e}^{(i)} with probability PiP_{i} via (8);
27                   Obtain amax\textbf{a}_{max} via (6) and QtargetQ_{target} via (7);
28                   Calculate ψi\psi_{i} and ωi\omega_{i} via (9) and (11) respectively;
29                  
30            
31            Update 𝜽1(k)\bm{\theta}^{(k)}_{1} via (10);
32             for i=1,2,…,k do
33                   Calculate ψi\psi_{i} for all experiences in 𝒮𝒯\mathcal{ST} via (9) and the updated network Q1Q_{1};
34                   Update qiq_{i} in 𝒮𝒯\mathcal{ST} by qi=|ψi|q_{i}=|\psi_{i}|;
35                  
36            
37      if kk%f==1f==1 then
38             Update parameters of Q2:𝜽2(k)=𝜽1(k)Q_{2}:\bm{\theta}^{(k)}_{2}=\bm{\theta}^{(k)}_{1};
39      
Algorithm 1 (τ,ϵ)(\tau,\epsilon)-PDDQN anti-jamming scheme.

IV EXPERIMENTAL RESULTS

In this section, a number of experiments is conducted to evaluate the proposed method. In our experimental setup, the parameters are set as J=2J=2, N=32N=32, PS=[1,5,10]P_{S}=[1,5,10]W, PJ=[0,4,8,10]P_{J}=[0,4,8,10]W, hs=hj=0.5h_{s}=h_{j}=0.5, β=1\beta=1, Cm=1C_{m}=1, Cs=0.2C_{s}=0.2, γ=0.6\gamma=0.6, T=5T=5, σ1=0.8\sigma_{1}=0.8, σ2=85\sigma_{2}=85, W=32W=32, M=10M=10, f=10f=10, and epoch=400.

First, we implement the existing value-based RL methods including QL, DQN, DDQN, and PDDQN in a wireless communication system. These four methods are all based on ϵ\epsilon-greedy. The SINR performance and convergence rate are analyzed and the results are shown in Fig. 3(a). As expected, PDDQN performs the best among the four methods. Hence, PDDQN is selected. This is the first time to our knowledge that PDDQN is applied to the anti-jamming wireless communications. Next, we compare the SINR performance between the variable and constant transmission power models under the same average power P¯\overline{P} condition. For a fair comparison, both of the models use PDDQN and the comparison results are given after the convergence. According to our experiment, PDDQN becomes completely convergent after 200 time slots, and the average power P¯\overline{P} is 6.19W in the variable power model. We can see from Fig. 3(b) that the variable power model obtains much higher SINR than the constant power model under the condition of the same average power P¯=6.19\overline{P}=6.19W. Therefore, the variable transmission power model is adopted. Finally, we apply the proposed (τ,ϵ)(\tau,\epsilon)-greedy to the four value-based RL methods, which are named (τ,ϵ)(\tau,\epsilon)-QL, (τ,ϵ)(\tau,\epsilon)-DQN, (τ,ϵ)(\tau,\epsilon)-DDQN, and (τ,ϵ)(\tau,\epsilon)-PDDQN. The curves of SINR performance and convergence rate are drawn in Fig. 4. It is clear that the convergence rate with the proposed (τ,ϵ)(\tau,\epsilon)-greedy is greatly improved for the four RL methods. Moreover, the communication performance (measured by SINR) with (τ,ϵ)(\tau,\epsilon)-greedy is also much better than those without (τ,ϵ)(\tau,\epsilon)-greedy. This is mainly due to the fact the proposed algorithm can not only select the more valuable action in most cases but also seek for the most valuable action at a faster speed.

Refer to caption
Refer to caption
Figure 3: Illustration of model selection. (a) SINR and convergence rate comparisons of various RL methods. (b) SINR comparison of the variable and constant power models.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: SINR and convergence rate comparisons of various RLs with and without (τ,ϵ)(\tau,\epsilon)-greedy.

V CONCLUSIONS

In this letter, we have presented a (τ,ϵ)(\tau,\epsilon)-greedy RL algorithm. Our proposed (τ,ϵ)(\tau,\epsilon)-greedy can replace the existing ϵ\epsilon-greedy which is used in almost all the value-based RL methods. We have implemented the proposed algorithm in a wireless communication system against multiple jammers. The experimental results show that the proposed algorithm accelerates the learning speed of the wireless device and significantly improves the performance of existing RL methods in terms of SINR and convergence rate. In the future, this algorithm will be implemented in the anti-jamming wireless communication scenario for continuous action set.

References

  • [1] A. Mukherjee, S. A. A. Fakoorian, J. Huang, and A. L. Swindlehurst, “Principles of physical layer security in multiuser wireless networks: A survey,” IEEE Communications Surveys Tutorials, vol. 16, no. 3, pp. 1550–1573, 2014.
  • [2] E. Lance and G. K. Kaleh, “A diversity scheme for a phase-coherent frequency-hopping spread-spectrum system,” IEEE Transactions on Communications, vol. 45, no. 9, pp. 1123–1129, 1997.
  • [3] H. Wang, L. Zhang, T. Li, and J. Tugnait, “Spectrally efficient jamming mitigation based on code-controlled frequency hopping,” IEEE Transactions on Wireless Communications, vol. 10, no. 3, pp. 728–732, 2011.
  • [4] Q. Wang, P. Xu, K. Ren, and X. Li, “Towards optimal adaptive ufh-based anti-jamming wireless communication,” IEEE Journal on Selected Areas in Communications, vol. 30, no. 1, pp. 16–30, 2012.
  • [5] L. Xiao, Y. Li, J. Liu, and Y. Zhao, “Power control with reinforcement learning in cooperative cognitive radio networks against jamming,” The Journal of Supercomputing, vol. 71, no. 9, pp. 3237–3257, 2015.
  • [6] L. Xiao, Y. Li, C. Dai, H. Dai, and H. V. Poor, “Reinforcement learning-based noma power allocation in the presence of smart jamming,” IEEE Transactions on Vehicular Technology, vol. 67, no. 4, pp. 3377–3389, 2018.
  • [7] Y. Gwon, S. Dastangoo, C. Fossa, and H. T. Kung, “Competing mobile network game: Embracing antijamming and jamming strategies with reinforcement learning,” in Proc. IEEE Conference on Communications and Network Security (CNS), 2013, pp. 28–36.
  • [8] F. Slimeni, B. Scheers, Z. Chtourou, and V. Le Nir, “Jamming mitigation in cognitive radio networks using a modified q-learning algorithm,” in Proc. International Conference on Military Communications and Information Systems (ICMCIS), 2015, pp. 1–7.
  • [9] V. Mnih, K. Kavukcuoglu, and D. Silver, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529–533, 2015.
  • [10] G. Han, L. Xiao, and H. V. Poor, “Two-dimensional anti-jamming communication based on deep reinforcement learning,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 2087–2091.
  • [11] X. Liu, Y. Xu, L. Jia, Q. Wu, and A. Anpalagan, “Anti-jamming communications using spectrum waterfall: A deep reinforcement learning approach,” IEEE Communications Letters, vol. 22, no. 5, pp. 998–1001, 2018.
  • [12] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in Proc. AAAI Conference on Artificial Intelligence, 2016, pp. 2094–2100.
  • [13] T. Schaul, J. Quan, and I. Antonoglou, et al., “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, 2015.
  • [14] T. Hester, M. Vecerik, and O. Pietquin, et al., “Deep q-learning from demonstrations,” in Proc. AAAI Conference on Artificial Intelligence, 2018, pp. 3223–3230.
  • [15] A. S. Lakshminarayanan, S. Sharma, and B. Ravindran, “Dynamic action repetition for deep reinforcement learning,” in Proc. AAAI Conference on Artificial Intelligence, 2017, pp. 2133–2139.
  • [16] L. Xiao, D. Jiang, X. Wan, W. Su, and Y. Tang, “Anti-jamming underwater transmission with mobility and learning,” IEEE Communications Letters, vol. 22, no. 3, pp. 542–545, 2018.
  • [17] L. Xiao, D. Jiang, D. Xu, H. Zhu, Y. Zhang, and H. V. Poor, “Two-dimensional antijamming mobile communication based on reinforcement learning,” IEEE Transactions on Vehicular Technology, vol. 67, no. 10, pp. 9499–9512, 2018.