This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust Network Slicing: Multi-Agent Policies, Adversarial Attacks, and Defensive Strategies

Feng Wang, M. Cenk Gursoy, and Senem Velipasalar The authors are with the Department of Electrical Engineering and Computer Science, Syracuse University, Syracuse, NY, 13244 (e-mail: [email protected], [email protected], [email protected])The material in this paper was presented in part at the 2022 IEEE International Conference on Communications (ICC).
Abstract

In this paper, we present a multi-agent deep reinforcement learning (deep RL) framework for network slicing in a dynamic environment with multiple base stations and multiple users. In particular, we propose a novel deep RL framework with multiple actors and centralized critic (MACC) in which actors are implemented as pointer networks to fit the varying dimension of input. We evaluate the performance of the proposed deep RL algorithm via simulations to demonstrate its effectiveness. Subsequently, we develop a deep RL based jammer with limited prior information and limited power budget. The goal of the jammer is to minimize the transmission rates achieved with network slicing and thus degrade the network slicing agents’ performance. We design a jammer with both listening and jamming phases and address jamming location optimization as well as jamming channel optimization via deep RL. We evaluate the jammer at the optimized location, generating interference attacks in the optimized set of channels by switching between the jamming phase and listening phase. We show that the proposed jammer can significantly reduce the victims’ performance without direct feedback or prior knowledge on the network slicing policies. Finally, we devise a Nash-equilibrium-supervised policy ensemble mixed strategy profile for network slicing (as a defensive measure) and jamming. We evaluate the performance of the proposed policy ensemble algorithm by applying on the network slicing agents and the jammer agent in simulations to show its effectiveness.

Index Terms:
Network slicing, dynamic channel access, deep reinforcement learning, multi-agent actor-critic, adversarial learning, policy ensemble, Nash equilibrium.

I Introduction

Network slicing in 5G radio access networks (RANs) allows enhancements in service flexibility for novel applications with heterogeneous requirements [1, 2, 3, 4]. In network slicing, the physical cellular network resources are divided into multiple virtual network slices to serve the end user, and thus network slicing is a vital technology to meet the strict requirements of each user by allocating the desired subset of network slices [5]. Instead of model-based optimization of resource allocation that assumes the knowledge of traffic statistics [6, 7, 8], deep reinforcement learning (deep RL) [9, 10] as a model-free decision making strategy can be deployed to optimize slice selection and better cope with the challenges such as dynamic environment, resource interplay, and user mobility [11, 12, 13, 14]. Most existing studies in the literature assume the slice state to be identical for all different slices over time, and hence the selection problem reduces to assigning the number of slices to each request.

In this paper, we consider a more practical scenario of a cellular coverage area with multiple base stations and a dynamic interference environment, analyze resource allocation in the presence of multiple users with random mobility patterns, and develop a multi-agent actor-critic deep RL agent to learn the slice conditions and achieve cooperation among base stations. We propose a learning and decision-making framework with multiple actors and a centralized critic (MACC) that aims at maximizing the overall performance instead of local performance. In the machine learning literature, MACC framework has been proposed for general RL tasks [15, 16, 17]. In our setting, this multi-agent system requires the actors at each base station to communicate with a centralized critic at the data center/server to share the experience and update parameters during the training process. It subsequently switches to decentralized decision making following the training.

Typically, when 5G RAN achieves faster transmission rates with higher frequency bands, it also has smaller coverage, necessitating a relatively dense cellular network architecture. In such a setting, the dynamic environment with user mobility might have significant impact on the network slicing performance. Due to this, we introduce the pointer network [18] to implement the actor policy to handle the varying observations of the deep RL agents.

It is important to note that due to being highly data driven, deep neural networks are vulnerable to minor but carefully crafted perturbations, and it is known that adversarial samples with such perturbations can cause significant loss in accuracy, e.g. in inference and classification problems in computer vision [19, 20, 21]. Given the broadcast nature of wireless communications, deep learning based adversarial attacks have also recently been attracting increasing attention in the context of wireless security [22, 23]. In particular, deep RL agents are vulnerable to attacks that lead to adversarial perturbations in their observations (akin to adversarial samples in classification and inference problems). In wireless settings, jamming is an adversarial attack that alters the state and/or observations of decision-making RL agents. Motivated by these considerations, we also design a deep RL based jammer agent with jamming and listening phases. Different from most existing works, we analyze how the jamming location and channel selection are optimized without direct feedback on the victims’ performance. We further analyze the performance degradation in network slicing agents in the presence of jamming attacks and identify their sensitivity in an adversarial environment.

Finally, we note that there is also a growing interest in developing defense strategies against adversarial attacks [24, 25, 26], and specifically jamming attacks [27, 28]. One of the most intriguing defense strategies is to ensemble several different policies, explore alternative strategies, and provide stable performance [17, 29]. In our context, we consider the mobile virtual network operator (MVNO) and the jammer as two players in a zero-sum incomplete information game. Several existing studies within this framework focus on applications, e.g., involving video games or poker games, in which random exploration does not lead to physical loss or penalty. In contrast, wireless users may experience disconnectivity in such cases. Therefore, an efficient and prudent exploration plan with fast convergence is desired. Most existing works also restrict the scope to the performance of the considered player or its performance against a certain type of adversary, failing to consider the nature of the zero-sum game against an unknown opponent that is potentially adaptive. In this paper, we propose an approach we refer to as Nash-equilibrium-supervised policy ensemble (NesPE) that utilizes the optimized mixed strategy profile to supervise the training process of the policies in the ensemble to fully explore the environment, and leaves no improvement space for the opponent to pursue the global equilibrium over all possible policy ensembles. We evaluate the performance of NesPE by applying it on both the aforementioned network slicing victim agent and jammer agent in a competitive context, and compare its significantly improved performance with two other policy ensemble algorithms.

The remainder of the paper is organized as follows. In Section II, we introduce the wireless network virtualization framework and dynamic channel model, and formulate the network slicing problem. In Section III, we propose the MACC deep RL algorithm with its pointer network architecture for actor implementation, and describe the network slicing agents. In this section, we also evaluate the performance of the proposed algorithm. Subsequently, we devise the deep RL based jammer agent in Section IV, introduce the two operation phases, jamming location optimization, jamming channel optimization and the actor-critic implementation, and evaluate the performance Finally, we design the NesPE algorithm in Section V, describe the steps of the algorithm, analyze its performance in both non-competitive and competitive environments, and compare it with other policy ensemble algorithms by applying on both the victim agent and jammer agent. Finally, we conclude the paper in Section VI.

II System Model and Problem Formulation

In this section, we introduce the wireless network virtualization (WNV) and the interference channel model, and formulate the network slicing as an optimization problem.

II-A Wireless Network Virtualization

WNV is well-known for enhancing the spectrum allocation efficiency. As shown in Fig. 1, infrastructure providers (InPs) own the physical infrastructure (e.g., base stations, backhaul, and core network), and operate the physical wireless network. WNV virtualizes the physical resources of InPs and separates them into slices to efficiently allocate these virtualized resources to the mobile virtual network operators (MVNOs). Therefore, MVNOs deliver differentiated services via slices to user equipments (UEs) with varying transmission requirements.

Refer to caption
Figure 1: WNV system model for allocating virtualized physical resources via MVNO. NcbN_{cb} denotes the number of slices available at InP bb.

In this paper, we consider a service area that contains NBN_{B} base stations of InPs with inter-base-station interference, and MVNOs that require services from the InPs. These MVNOs provide services to NuN_{u} UEs that are within the coverage area of at least one base station, and may move from the coverage of one base station to that of another. Users in the coverage area of multiple base stations that serve the MVNO can be assigned to any of these base stations. Without loss of generality, we discuss the station-user pair for a single MVNO in the remainder of this paper. Next, we introduce the dynamic channel environment.

II-B Interference Channel Model

We consider a dynamic environment where NN channels are available for each MVNO. The fading coefficient of the link between base station bb and a UE uu in a certain channel cc at time tt is denoted by hcb,u(t)𝒞𝒩(0,1)h_{c}^{b,u}(t)\sim\mathcal{CN}(0,1), an independent and identically distributed (i.i.d.) circularly symmetric complex Gaussian (CSCG) random variable with zero mean and unit variance. Each fading coefficient varies every TT time slots as a first-order complex Gauss-Markov process, according to the Jakes fading model [30]. Therefore, at time (n+1)T(n+1)T, the fading coefficient can be expressed as

hcb,u((n+1)T)=ρhcb,u(nT)+1ρ2ec((n+1)T),h_{c}^{b,u}((n+1)T)=\rho h_{c}^{b,u}(nT)+\sqrt{1-\rho^{2}}e_{c}((n+1)T), (1)

where ece_{c} denotes the channel innovation process, which is also an i.i.d. CSCG random variable. Furthermore, we have ρ=J0(2πfdT)\rho=J_{0}(2\pi f_{d}T), where J0J_{0} is the zeroth order Bessel function of the first kind and fdf_{d} is the maximum Doppler frequency.

II-C Network Slicing Problem Formulation

In the aforementioned environment, the resource allocation is performed in a first-come first-served fashion. Each base station may serve at most NrN_{r} requests from different users simultaneously, and the request queue can stack at most NqN_{q} requests. For each request, multiple channels can be allocated, and the transmission power is PBP_{B} in each channel. Different requests served by the same base station do not share the same channels, while requests to different base stations may share the same channel at the cost of inflicting interference. For all requests to each base station, transmission is allowed in no more than NcN_{c} channels, and consequently the total power is limited to NcPBN_{c}P_{B}. In this setting, if request kk is allocated the subset CkC_{k} of channels at base station bb, the sum rate 𝗋k\mathsf{r}_{k} for this request is

𝗋k=cCk𝗋cb,u,\mathsf{r}_{k}=\sum_{c\in C_{k}}{\mathsf{r}_{c}^{b,u}}, (2)

with

𝗋cb,u=log2(1+PBLb,u|hcb,u|2bb𝒩cb,b+σ2),\mathsf{r}_{c}^{b,u}=\log_{2}\left(1+{\frac{P_{B}L^{b,u}|h_{c}^{b,u}|^{2}}{\sum_{b^{\prime}\neq b}{\mathcal{N}^{b,b^{\prime}}_{c}}+\sigma^{2}}}\right), (3)
𝒩cb,b=𝟏cb,bPBLb,u|hcb,u|2,\mathcal{N}^{b,b^{\prime}}_{c}=\mathbf{1}_{c}^{b,b^{\prime}}P_{B}L^{b^{\prime},u}|h_{c}^{b^{\prime},u}|^{2}, (4)

where cc denotes the index of a selected channel, 𝗋cb,u\mathsf{r}_{c}^{b,u} is the transmission rate from base station bb to UE uu in channel cc, 𝒩cb,b\mathcal{N}^{b,b^{\prime}}_{c} is the interference from base station bb^{\prime} in channel cc, 𝟏cb,b\mathbf{1}_{c}^{b,b^{\prime}} is the indicator function for transmission at base stations bb and bb^{\prime} sharing channel cc, σ2\sigma^{2} is the variance of the additive white Gaussian noise, and Lb,uL^{b,u} is the path loss:

Lb,u\displaystyle L^{b,u} =(hB2+(xbxu)2+(ybyu)2)α/2,\displaystyle=\left(h_{B}^{2}+(x_{b}-x_{u})^{2}+(y_{b}-y_{u})^{2}\right)^{\alpha/2}, (5)

where hBh_{B} is the height of each base station, α\alpha is the path loss exponent, and {xb,yb}\{x_{b},y_{b}\} and {xu,yu}\{x_{u},y_{u}\} are the 2-D coordinates of base station bb and user uu, respectively.

Therefore, for each base station, the resource assignment including the selected subsets CkC_{k}, the number of selected channels ncn_{c}, the number of served requests nrn_{r}, and the number of requests in queue nqn_{q} must follow the following constraints:

Ck1Ck2=Ø,1k1<k2nr,{C_{k_{1}}}\cap{C_{k_{2}}}=\text{\O},\forall 1\leq k_{1}<k_{2}\leq n_{r}, (6)
nc=|k=1nrCk|,n_{c}=\left|\cup_{k=1}^{n_{r}}{C_{k}}\right|, (7)
ncNc,n_{c}\leq N_{c}, (8)
nrNr,n_{r}\leq N_{r}, (9)
nqNq.n_{q}\leq N_{q}. (10)

Above, (6) indicates that no channel is shared among different requests at the same base station, and (7) defines the number of channels being used. The constraints in (8)–(10) are the upper bounds on the number of channels, the number of requests being served simultaneously, and the number of requests in the queue, respectively. If nr=Nrn_{r}=N_{r} and nq=Nqn_{q}=N_{q} at a base station, any incoming request to that base station will be denied service.

The features of each request kk consist of minimum transmission rate mkm_{k}, lifetime lkl_{k}, and initial payload pkp_{k} before the transmission starts. At each time slot tt during transmission, the constraints are given as follows:

𝗋k(t)mk,\mathsf{r}_{k}(t)\geq m_{k}, (11)
lk(t)>0,l_{k}(t)>0, (12)

where lk(t)l_{k}(t) denotes the remaining lifetime at time tt. Note that with this definition, we have lk(0)=lkl_{k}(0)=l_{k}. If any request being processed fails to meet the constraints (11) or (12), it will be terminated and be marked as failed. Any request in the queue that fails to meet constraint (12) will also be removed and marked as failure. Otherwise, the lifetime will be updated for each request being processed and each request in the queue as follows:

lk(t+1)=lk(t)1.l_{k}(t+1)=l_{k}(t)-1. (13)

Each request being processed will transmit 𝗋k(t)\mathsf{r}_{k}(t) bits of the remaining payload:

pk(t+1)=max(pk(t)𝗋k(t),0).p_{k}(t+1)=\operatorname*{max}(p_{k}(t)-\mathsf{r}_{k}(t),0). (14)

Note that the initial payload is pk(0)=pkp_{k}(0)=p_{k}. If the payload is completed within the lifetime (i.e., the remaining payload is pk(t+1)=0p_{k}(t+1)=0 for t+1lkt+1\leq l_{k}), the request kk is completed and marked as success.

For all aforementioned cases, if the request is completed successfully in time, the network slicing agent at the base station bb receives a positive reward RkR_{k} equal to the initial payload pkp_{k}. Otherwise, it receives a negative reward of Rk=pkR_{k}=-p_{k}. Each user can only send one request at a time.

Afterwards, base station bb records the latest transmission rate history of 𝗋cb,u\mathsf{r}_{c}^{b,u} into a 2 dimensional matrix Hb0Nu×NH^{b}\in\mathbb{R}_{\geq 0}^{N_{u}\times N}. In each time slot tt, for request kk from user uu that is allocated subset CkC_{k} of channels at base station bb, the base station updates the entries corresponding to the channels that were selected for transmission:

Hb[u,c]=𝗋cb,u(t),cCk.H^{b}[u,c]=\mathsf{r}_{c}^{b,u}(t),\forall c\in C_{k}. (15)

Note that the location {xu,yu}\{x_{u},y_{u}\} of user uu, the fading coefficient hcb,uh_{c}^{b,u}, and the interference corresponding to 𝟏cb,b\mathbf{1}_{c}^{b,b^{\prime}} vary over time, and therefore HbH_{b} is only a first-order approximation of the potential transmission rate 𝗋cb,u(t+1)\mathsf{r}_{c}^{b,u}(t+1).

The goal of the network slicing agent at each base station is to find a policy π\pi that selects ncn_{c} channels and assigns them to nrn_{r} requests, so that the sum reward of all requests at all base stations is maximized over time:

argmaxπt=t(γ(tt)b=1NBkKbRk),\operatorname*{argmax}_{\pi}{\sum_{t^{\prime}=t}^{\infty}{\left(\gamma^{(t^{\prime}-t)}\sum_{b=1}^{N_{B}}\sum_{k\in K^{\prime}_{b}}{R_{k}}\right)}}, (16)

where KbK^{\prime}_{b} is the set of completed or terminated requests for base station bb at time tt, and γ(0,1)\gamma\in(0,1) is the discount factor.

III Multi-Agent Deep RL with Multiple Actors and Centralized Critic (MACC)

To solve the problem in (16), we propose a multi-agent deep RL algorithm with multiple actors and centralized critic (MACC) where the actors (one at each base station) utilize the pointer network to achieve the goal of attaining the maximal reward over all base stations by choosing the optimal subset of channels in processing each request. In the remainder of this paper, we denote the full observation at base station bb as 𝒪b\mathcal{O}_{b}, the observation over all base stations as 𝒪=b=1NB𝒪b\mathcal{O}=\cup_{b=1}^{N_{B}}\mathcal{O}_{b}, and the channel selection at base station bb as a matrix of actions 𝒜b{0,1}nr×N\mathcal{A}_{b}\in\{0,1\}^{n_{r}\times N}. Each element in 𝒜b\mathcal{A}_{b} is an indicator:

𝒜b[k,c]={1if channel c is assigned to request k,0otherwise.\displaystyle\mathcal{A}_{b}[k,c]=\begin{cases}1\hskip 8.5359pt\text{if channel $c$ is assigned to request $k$,}\\ 0\hskip 8.5359pt\text{otherwise.}\end{cases} (17)

In each time slot when nc(t)n_{c}(t) channels are selected, ck𝒜b[k,c]=nc(t)\sum_{c}\sum_{k}\mathcal{A}_{b}[k,c]=n_{c}(t). In this section, we first introduce the MACC framework, then describe the pointer network as the actor structure, and finally analyze the implementation on the considered network slicing problem.

III-A Deep RL with Multiple Actors and Centralized Critic

In this section, we briefly discuss the actor-critic algorithm [31]. This deep RL algorithm utilizes two neural networks, the actor and critic. The two networks have separate neurons and utilize separate backpropagation, and they may have separate hyperparameters.

The multi-agent extension of actor-critic algorithm where each agent has separate actor and critic that aim at maximizing each individual reward is referred to as independent actor-critic (IAC) [15, 16]. When the action of each agent interferes with the others and the goal is to maximize the sum reward over all agents, the deep RL with MACC may be preferred [16]. In this framework, the decentralized actors with parameter θ\theta and policy fθ(𝒪b)f^{\theta}(\mathcal{O}_{b}) are the same as in IAC, while there is only one centralized critic with parameter ϕ\phi and policy gϕ(𝒪)g^{\phi}(\mathcal{O}) aiming at the sum reward by updating each decentralized actor. Therefore, MACC agents are more likely to learn coordinated strategies through interaction between agents and mutual environment, despite the scarcity of information sharing at the training phase. In the framework of MACC, the critic maps 𝒪\mathcal{O} to a single temporal difference (TD) error

δ(t)=b=1NBkKbRk(t)+γgϕ(𝒪(t))gϕ(𝒪(t1)),\small\delta(t)=\sum_{b=1}^{N_{B}}\sum_{k\in K_{b}}{R_{k}}(t)+\gamma g^{\phi}\left(\mathcal{O}(t)\right)-g^{\phi}\left(\mathcal{O}(t-1)\right), (18)

where KbK_{b} is the set of requests being processed and requests in the queue at time tt, and γ(0,1)\gamma\in(0,1) is the discount factor. Then, the critic parameters ϕ\phi and actor parameters θ\theta are optimized by considering the least square temporal difference and policy gradient, respectively, as follows:

ϕ=argminϕ(δgϕ)2,\phi^{*}=\operatorname*{argmin}_{\phi}{(\delta^{g_{\phi}})^{2}}, (19)
θ=argmaxθθlogfθ(𝒪b)δgϕ.\theta^{*}=\operatorname*{argmax}_{\theta}{\nabla_{\theta}\log f^{\theta}(\mathcal{O}_{b})\delta^{g_{\phi}}}. (20)

For a wireless resource allocation problem, fast convergence in the training of neural networks is highly important. Therefore in our implementation, we not only recommend offline pre-training via simulations, but also speed up learning by sharing the neural network parameters among the agents, i.e., we use all the information to update one actor and one critic (as in IAC) during training and share the parameters among all agents.

III-B Pointer Network

In this section, we introduce the pointer network to implement the actor policy fθf^{\theta} for IAC and MACC.

Traditional sequence-processing recurrent neural networks (RNN), such as sequence-to-sequence learning [32] and neural Turing machines [33], have a limit on the output dimensionality, i.e., the output dimensionality is fixed by the dimensionality of the problem so it is the same during training and inference. However in a network slicing problem, the actor’s input dimensionality {N,nc,nr}\{N,n_{c},n_{r}\} may vary over time, and thus the expected output dimension of 𝒜b\mathcal{A}_{b} also varies according to the input. As opposed to the aforementioned sequence-processing algorithms, pointer network learns the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in the input sequence [18], and therefore the dimension of action 𝒜b\mathcal{A}_{b} in each time slot depends on the length of the input.

Another benefit of pointer networks is that they reduce the cardinality of 𝒜b\mathcal{A}_{b}. General deep RL algorithms typically list out each combination as an element in action 𝒜b\mathcal{A}_{b}^{\prime}, and each element indicates picking ncn_{c} channels out of NN channels and assigning each to one of nrn_{r} requests. Therefore 𝒜b\mathcal{A}_{b}^{\prime} has the dimension of (Nnc)nrnc\binom{N}{n_{c}}n_{r}^{n_{c}}. When the dimensions of NN, ncn_{c} and nrn_{r} are high, this will require a prohibitively large network to give such an output, and require exponentially longer time to train. Comparatively, 𝒜b\mathcal{A}_{b} generated from our proposed pointer network actor has only nrNn_{r}N output elements.

Therefore, we here introduce the pointer network as the novel architecture of the actor of the proposed agents as shown in Fig. 2. Similar to sequence-to-sequence learning, pointer network utilizes two RNNs called encoder and decoder whose output dimensions are heh_{e} and hdh_{d}. The former encodes the sequence {𝒪be1,𝒪be2,,𝒪benr}\{\mathcal{O}_{b}^{e1},\mathcal{O}_{b}^{e2},\ldots,\mathcal{O}_{b}^{e{n_{r}}}\} with 𝒪bek𝒪b\mathcal{O}_{b}^{ek}\subset\mathcal{O}_{b} (where the index k{1,2,,nr}k\in\{1,2,\ldots,n_{r}\} corresponds to requests), and produces the sequence of vectors {e1,e2,,enr}\{e_{1},e_{2},\ldots,e_{n_{r}}\} with ekhee_{k}\in\mathbb{R}^{h_{e}} at the output gate in each step. The decoder inherits the encoder’s hidden state and is fed by another sequence {𝒪bd1,𝒪bd2,,𝒪bdN},𝒪bdc𝒪b\{\mathcal{O}_{b}^{d1},\mathcal{O}_{b}^{d2},\ldots,\mathcal{O}_{b}^{d{N}}\},\mathcal{O}_{b}^{dc}\subset\mathcal{O}_{b} (where the index c{1,2,,N}c\in\{1,2,\ldots,N\} corresponds to channels), and produces the sequence of vectors {d1,d2,,dN},dchd\{d_{1},d_{2},\ldots,d_{N}\},d_{c}\in\mathbb{R}^{h_{d}}.

Refer to caption
Figure 2: Pointer network structure. At each step, the encoder RNN (red) converts each element of input 𝒪bek\mathcal{O}_{b}^{ek} into output eke_{k}, while the hidden state and cell state are fed to the decoder RNN (blue). The decoder converts each element of input 𝒪bdc\mathcal{O}_{b}^{dc} into output dcd_{c} at each step. The final output of the pointer network is generated by attention mechanism as the conditional probability 𝒫c\mathcal{P}^{c}, and each element are generated via eke_{k} and dcd_{c}.

Furthermore, pointer network computes the conditional probability array 𝒫c[0,1]nr\mathcal{P}^{c}\in[0,1]^{n_{r}} where

𝒫c[k]=P(𝒜b[k,c]=1|𝒪bd1,,𝒪bdc,𝒪be1,,𝒪benr),\mathcal{P}^{c}[k]=P(\mathcal{A}_{b}[k,c]=1|\mathcal{O}_{b}^{d1},\ldots,\mathcal{O}_{b}^{d{c}},\mathcal{O}_{b}^{e1},\ldots,\mathcal{O}_{b}^{e{n_{r}}}), (21)

using an attention mechanism as follows:

ukc\displaystyle u^{c}_{k} =vTtanh(W1ek+W2dc),\displaystyle=v^{T}\operatorname*{tanh}(W_{1}e_{k}+W_{2}d_{c}), (22)
𝒫c\displaystyle\mathcal{P}^{c} =softmax([u1c,u2c,,unrc]T),\displaystyle=\operatorname*{softmax}([u^{c}_{1},u^{c}_{2},\ldots,u^{c}_{n_{r}}]^{T}),

where softmax normalizes the vector of ukcu^{c}_{k} to be an output distribution over the dictionary of inputs, and vector vv and matrices W1W_{1} and W2W_{2} are trainable parameters of the attention model.

We note that the nr×Nn_{r}\times N matrix of the pointer network output 𝒫=[𝒫1,𝒫2,,𝒫N]\mathcal{P}=[\mathcal{P}^{1},\mathcal{P}^{2},\ldots,\mathcal{P}^{N}] typically consists of non-integer values, and we obtain the decision of 𝒜b\mathcal{A}_{b} by picking ncn_{c} maximum elements in each column:

M1\displaystyle M_{1} =[max(𝒫1),max(𝒫2),,max(𝒫N)],\displaystyle=[\operatorname*{max}(\mathcal{P}^{1}),\operatorname*{max}(\mathcal{P}^{2}),\ldots,\operatorname*{max}(\mathcal{P}^{N})], (23)
M2\displaystyle M_{2} =argmaxM1M1,|M1|=ncmM1m,\displaystyle=\operatorname*{argmax}_{M_{1}^{\prime}\subset M_{1},|M_{1}^{\prime}|=n_{c}}{\sum_{m\in M_{1}^{\prime}}{m}},
𝒜b[k,c]\displaystyle\mathcal{A}_{b}[k,c] ={1if cM2 and k=argmax𝒫c,0otherwise.\displaystyle=\begin{cases}1\hskip 8.5359pt\text{if $c\in M_{2}$ and $k=\operatorname*{argmax}\mathcal{P}^{c}$,}\\ 0\hskip 8.5359pt\text{otherwise.}\end{cases}

In our implementation, we use two Long Short Term Memory (LSTM) [34] RNNs of the same size (i.e., he=hdh_{e}=h_{d}) to model the encoder and decoder. During the backpropagation phase of neural network training, the partial derivative of error functions with respect to weights in layers (encoder and decoder in our case) that are far from the output layer may be vanishingly small, preventing the network from further training, and this is called vanishing gradient problem [35]. However, the encoder and decoder are cardinal for the function of the pointer network while we demand fast convergence, so we set W1W_{1} and W2W_{2} in (22) as trainable vectors and set vv as a single constant to reduce the depth of the pointer network and speed the training on both LSTMs.

III-C Network Slicing Agent

In this subsection, we introduce the action and state spaces for MACC deep RL with pointer networks employed for the aforementioned network slicing problem.

III-C1 Centralized Critic

The centralized critic agent gϕ(𝒪)g^{\phi}(\mathcal{O}) is implemented as a feed-forward neural network (FNN) that uses ReLU activation function for each hidden layer. This FNN takes the observation 𝒪={𝒪1,𝒪2,,𝒪NB}\mathcal{O}=\{\mathcal{O}_{1},\mathcal{O}_{2},\ldots,\mathcal{O}_{N_{B}}\} as the input where 𝒪b(t)={𝒜b(t1),Hb[𝒰b(t),𝒞](t),b(t)}\mathcal{O}_{b}(t)=\{\mathcal{A}_{b}(t-1),H^{b}[\mathcal{U}_{b}(t),\mathcal{C}](t),\mathcal{I}_{b}(t)\}. 𝒜b(t1)\mathcal{A}_{b}(t-1) is the action matrix as described in (23) in the last time slot, Hb(t)H^{b}(t) is the transmission rate history described in (15), 𝒰b(t)\mathcal{U}_{b}(t) is the set of users corresponding to the nrn_{r} requests served by base station bb, and 𝒞\mathcal{C} is the set of all channels {1,2,,N}\{1,2,\ldots,N\}. b\mathcal{I}_{b} is the information of requests, including the remaining payload pkp_{k}, minimum rate mkm_{k}, remaining lifetime lkl_{k} and the absolute value of reward |R|=pk|R|=p_{k} of the requests being processed and the requests in the queue: b=bKb\mathcal{I}_{b}=\mathcal{I}_{b}^{K_{b}} where bk={pk,mk,lk,pk}\mathcal{I}_{b}^{k}=\{p_{k},m_{k},l_{k},p_{k}\}.

The output of gϕ(𝒪)g^{\phi}(\mathcal{O}) is a single value that aims at minimizing the error δ(t)\delta(t) in (18). Note that in our implementation, the sum reward is not the instantaneous reward at time tt, but the rewards of KbK_{b}, the set of all requests being processed and requests in the queue at time tt. Therefore, we do not get the reward value or use the sample for training until all requests and those in the queue are completed. Compared to the instantaneous reward, we use this reward assignment to strengthen the correlation between the action 𝒜b\mathcal{A}_{b} and the outcome, and thus speed the convergence.

III-C2 Decentralized Actor with Pointer Network

The decentralized actor agent fθ(𝒪b)f^{\theta}(\mathcal{O}_{b}) at each base station bb is implemented as a pointer network. The initial hidden state and cell state of the encoder LSTM are generated by a separate FNN without hidden layers, whose input is the information of the requests in the queue {bnr+1,bnr+2,,bnr+nq}\{\mathcal{I}_{b}^{n_{r}+1},\mathcal{I}_{b}^{n_{r}+2},\ldots,\mathcal{I}_{b}^{n_{r}+n_{q}}\}, and the output is a vector of two states with length 2he2h_{e}.

The input that the encoder LSTM receives at each step is 𝒪bek(t)={𝒜b[k,𝒞](t1),Hb[𝒰b[k](t),𝒞](t),bk(t)}\mathcal{O}_{b}^{ek}(t)=\{\mathcal{A}_{b}[k,\mathcal{C}](t-1),H^{b}[\mathcal{U}_{b}[k](t),\mathcal{C}](t),\mathcal{I}_{b}^{k}(t)\} whose components are the last action corresponding to request kk, transmission rate history corresponding to the user 𝒰b[k]\mathcal{U}_{b}[k] of request kk, and the information of request kk.

The input that the decoder LSTM receives at each step is 𝒪bdc(t)={Hb[𝒰b(t),c](t),k=1nrbk(t)}\mathcal{O}_{b}^{dc}(t)=\{H^{b}[\mathcal{U}_{b}(t),c](t),\cup_{k=1}^{n_{r}}\mathcal{I}_{b}^{k}(t)\}. The two components are the transmission rate history corresponding to all users 𝒰b\mathcal{U}_{b} in channel cc, and the information of all requests being processed.

Therefore as in (22), the output eke_{k} and dcd_{c} of the encoder and decoder are fed into the attention mechanism to produce the conditional probability 𝒫\mathcal{P}, and thus we obtain the output action 𝒪b\mathcal{O}_{b} via (23).

Apart from the pointer network actor fθ(𝒪b)f^{\theta}(\mathcal{O}_{b}), we consider two other decision modes to explore different actions and train MACC policies, including the random mode and the max-rate mode. The random mode randomly assigns ncn_{c} out of NN channels to nrn_{r} requests without considering the observation, and the max-rate mode generates 𝒪b\mathcal{O}_{b} via transmission rate history matrix Hb[𝒰b,𝒞]H^{b}[\mathcal{U}_{b},\mathcal{C}] instead of 𝒫\mathcal{P}:

M1\displaystyle M_{1} =[max(Hb[𝒰b,1]),,max(Hb[𝒰b,N])],\displaystyle=[\operatorname*{max}(H^{b}[\mathcal{U}_{b},1]),\ldots,\operatorname*{max}(H^{b}[\mathcal{U}_{b},N])], (24)
M2\displaystyle M_{2} =argmaxM1M1,|M1|=ncmM1m,\displaystyle=\operatorname*{argmax}_{M_{1}^{\prime}\subset M_{1},|M_{1}^{\prime}|=n_{c}}{\sum_{m\in M_{1}^{\prime}}{m}},
𝒜b[k,c]\displaystyle\mathcal{A}_{b}[k,c] ={1if cM2 and k=argmaxHb[𝒰b,c],0otherwise.\displaystyle=\begin{cases}1\hskip 8.5359pt\text{if $c\in M_{2}$ and $k=\operatorname*{argmax}H^{b}[\mathcal{U}_{b},c]$,}\\ 0\hskip 8.5359pt\text{otherwise.}\end{cases}

The agent follows an ϵ\epsilon-ϵm\epsilon_{m}-greedy policy as shown in Algorithm 1 below to update the neural network parameters. Specifically, it initially starts with an exploration probability ϵ=ϵ0=1\epsilon=\epsilon_{0}=1 and chooses the max-rate mode with probability ϵm\epsilon_{m}, otherwise chooses the random mode. This probability decreases linearly to ϵ=ϵ11\epsilon=\epsilon_{1}\ll 1 to follow mostly the actor policy fθ(𝒪b)f^{\theta}(\mathcal{O}_{b}). When the training is over, ϵ\epsilon is fixed to ϵ1\epsilon_{1} in the testing duration TtestT_{test}.

Algorithm 1 ϵ\epsilon-ϵm\epsilon_{m}-greedy exploration method
ϵ=ϵ0\epsilon=\epsilon_{0}
for t in range(TtrainT_{train}do
    for b in range(NBN_{B}do
         if random(0,1)ϵ(0,1)\geq\epsilon then
             Agent bb executes fθ(𝒪b)f^{\theta}(\mathcal{O}_{b}) in (23)
         else
             if random(0,1)ϵm(0,1)\leq\epsilon_{m} then
                 Agent bb executes max-rate mode in (24)
             else
                 Agent bb executes random mode
             end if
         end if
    end for
    ϵ=ϵ(ϵ0ϵ1)/Ttrain\epsilon=\epsilon-(\epsilon_{0}-\epsilon_{1})/T_{train}
end for

III-D Numerical Results

As shown in Fig. 3, we in the experiments consider a service area with NB=5N_{B}=5 base stations and Nu=30N_{u}=30 users, and each cell radius is 2.5km. There are N=16N=16 channels available, and each base station picks at most Nc=8N_{c}=8 channels for transmission. The ratio between transmission power and the noise in each channel is PB/σk2=6.3P_{B}/\sigma_{k}^{2}=6.3. Each base station allows the processing of Nr=4N_{r}=4 requests, and keeps at most Nq=2N_{q}=2 requests in the queue. When a request from one user is terminated or completed, the user will not be able to send another request within Tr=2T_{r}=2 time slots.

Refer to caption
Figure 3: Coverage map of service area with 5 base stations and 30 users.

The channel varies with maximum Doppler frequency fd=1f_{d}=1 and dynamic slot duration T=0.02sT=0.02s, and the location of users follows a random walk pattern. Each user may transfer between the coverage of different base stations, while staying within the coverage area of at least one of the base stations. All other parameters are listed in Table I.

TABLE I: User Agent Training Parameters
Number of Base Stations NBN_{B} 5
Height of Base Station hBh_{B} 50 m
Range of Base Station Coverage 2.5 km
Number of MVNOs NMN_{M} 1
Number of Users NuN_{u} 30
Number of Channels NN 16
Number of Channels for Transmission NcN_{c} 8
Maximum Doppler Frequency fdf_{d} 1 Hz
Dynamic Slot Duration TT 0.02 ms
Transmission Power to Noise Ratio PB/σk2P_{B}/\sigma_{k}^{2} 6.3 dBm
Path Loss Exponent α\alpha -2
Maximum Number of Requests Processing NrN_{r} 4
Maximum Number of Requests in Queue NqN_{q} 2
Minimum Inter-arrival Time of Requests TrT_{r} 2
Random Range of Initial Payload pp (1, 2)
Random Range of Minimum Rate mm (0.8, 1)
Random Range of Lifetime ll p/m+p/m+ (2, 4)
Discount Factor γ\gamma 0.9
Adam Optimizer Learning Rate 10610^{-6}
Critic FNN Hidden Layer and Nodes 1, 70
Encoder LSTM Hidden State Size heh_{e} 70
Decoder LSTM Hidden State Size hdh_{d} 70
Training Time TtrainT_{train} 50000
Testing Time TtestT_{test} 150000
Training Period TtT_{t} 10
Mini-batch 10
Initial Exploration Probability ϵ0\epsilon_{0} 1
Final Exploration Probability ϵ1\epsilon_{1} 0.005
Max-Rate Probability ϵm\epsilon_{m} 0.1

In Fig. 4, we compare the moving average of the sum reward achieved by the network slicing agents utilizing the proposed MACC with pointer networks against other algorithms. Similarly as in [36], we introduce three statistical algorithms including max-rate which always executes the max-rate mode, FIFO that selects channels with high channel rate history but always assigns enough channel resources to the requests that has arrived earlier, and hard slicing that assigns channels with high channel rate history and the number of channels assigned to each request is proportional to the corresponding minimum transmission rate mkm_{k}. We also compare four deep RL agents, each of which deploys IAC or MACC frameworks using either FNN and pointer networks as actor. While the proposed MACC with pointer network agent starts with a lower performance at the beginning of the training phase due to random exploration, it outperforms all other algorithms during the test phase. In the initial training phase of t<50000t<50000, the proposed agent starts with random parameters and employs the ϵ\epsilon-ϵm\epsilon_{m}-greedy policy, which starts with a large probability ϵ(1ϵm)\epsilon(1-\epsilon_{m}) to explore with random actions, and consequently the performance starts low but gradually improves. We observe that in the test phase from t=50000t=50000 to t=200000t=200000, the sum reward eventually converges to around 20 bits/symbol, while slightly oscillating due to the challenging dynamic environment with random channel states, varying user locations, and randomly arriving requests. We further note that the proposed network slicing agents complete over 95%95\% of the requests, and hence attain a very high completion ratio as well.

Refer to caption
Figure 4: Performance comparison in terms of sum reward for the proposed MACC agent with pointer network based actors against different algorithms.

IV Deep RL Based Jammer

We have introduced the deep RL based network slicing agents in the previous section. Since deep RL is vulnerable to adversarial attacks that perturb the observations, the proposed network slicing agents can be open to attack by intelligent jammers, and it is critical to determine their sensitivity and robustness under such jamming attacks. In this section, we introduce an actor-critic deep RL agent that performs jamming attacks on the aforementioned victim network slicing agents/users (introduced in Section III) and aims at minimizing the victims’ transmission rate. We assume the jammer has the geometric map of all BSs, but it does not have any information on the channel states, users’ locations, requests, victim reward or the victim policy. This deep RL jammer may jam multiple channels to reduce the transmission rate and potentially may lead to request failures, or observe the environment and record the interference power in each channel to speculate the victims’ actions. We demonstrate a jammer that can significantly degrade the aforementioned victim users’ performance even though it lacks critical information on the victims.

IV-A System Model

In the considered channel model, the fading coefficient of the link from the jammer to the user equipment (UE) uu in a certain channel cc is denoted by hcJ,uh_{c}^{J,u}, and the fading coefficient of the link from the base station bb to the jammer in a certain channel cc is denoted by hcb,Jh_{c}^{b,J}. We consider hcb,uh_{c}^{b,u}, hcJ,uh_{c}^{J,u} and hcb,Jh_{c}^{b,J} to be independent and identically distributed (i.i.d.) and vary over time according to the Jakes fading model [30]. Once the jammer is initialized at horizontal location {xJ,yJ}\{x_{J},y_{J}\} with height hJh_{J}, it can choose in any given time slot one of the two operational phases: jamming phase and listening phase.

IV-A1 Jamming phase

In this phase, the jammer jams nJNJn_{J}\leq N_{J} channels simultaneously with jamming power PJP_{J} in each channel without receiving any feedback. With the additional interference from the jammer, we can express the transmission rate 𝗋c,Jb,u\mathsf{r}_{c,J}^{b,u} from base station bb to UE uu in channel cc as

𝗋c,Jb,u=log2(1+PBLb,u|hcb,u|2bb𝒩cb,b+𝒩cb,J+σ2),\mathsf{r}_{c,J}^{b,u}=\log_{2}\left(1+{\frac{P_{B}L^{b,u}|h_{c}^{b,u}|^{2}}{\sum_{b^{\prime}\neq b}{\mathcal{N}^{b,b^{\prime}}_{c}}+\mathcal{N}^{b,J}_{c}+\sigma^{2}}}\right), (25)
𝒩cb,J=𝟏cb,JPJLJ,u|hcJ,u|2,\mathcal{N}^{b,J}_{c}=\mathbf{1}_{c}^{b,J}P_{J}L^{J,u}|h_{c}^{J,u}|^{2}, (26)

where PJP_{J} is the jamming power, 𝒩cb,J\mathcal{N}^{b,J}_{c} is the jamming interference in channel cc, 𝟏cb,J\mathbf{1}_{c}^{b,J} is the indicator function for both base station bb and jammer choosing channel cc, and LJ,uL^{J,u} is the path loss:

LJ,u=(hJ2+(xJxu)2+(yJyu)2)α/2.L^{J,u}=\left(h_{J}^{2}+(x_{J}-x_{u})^{2}+(y_{J}-y_{u})^{2}\right)^{\alpha/2}. (27)

By degrading the transmission rate 𝗋c,Jb,u\mathsf{r}_{c,J}^{b,u} with the jamming interference in channel cc, the jamming attack may lead to a number of request failures. The jammer may further amplify its impact by intelligently choosing a preferable subset of channels to jam. The listening phase is introduced to learn such information from the environment.

IV-A2 Listening phase

In this phase, the jammer does not jam any channel, but only listens the (interference) power in each channel cc among NN channels:

𝒩clisten=b𝟏cbPBLb,J|hcb,J|2+σ2,\mathcal{N}^{\text{listen}}_{c}={\sum_{b}{\mathbf{1}_{c}^{b}P_{B}L^{b,J}|h_{c}^{b,J}|^{2}}+\sigma^{2}}, (28)

where 𝟏cb\mathbf{1}_{c}^{b} is the indicator function which has a value of 11 if there is a transmission at base station bb to any UE in channel cc, and Lb,JL^{b,J} is the path loss:

Lb,J=((hBhJ)2+(xbxJ)2+(ybyJ)2)α/2.L^{b,J}=\left((h_{B}-h_{J})^{2}+(x_{b}-x_{J})^{2}+(y_{b}-y_{J})^{2}\right)^{\alpha/2}. (29)

Due to the jammer’s lack of prior information, we consider an approximation and assume that the listened power 𝒩clisten\mathcal{N}^{\text{listen}}_{c} from all base stations transmitting in channel cc is a rough estimate of the sum of jamming interferences 𝒩c,estb,J\mathcal{N}^{b,J}_{c,\text{est}} if jammer were in the jamming phase and chose channel cc to inject interference, i.e.,

𝒩clistenb𝒩c,estb,J+σ2.\mathcal{N}^{\text{listen}}_{c}\approx{\sum_{b}{\mathcal{N}^{b,J}_{c,\text{est}}}+\sigma^{2}}. (30)

Therefore, with this assumption, the jammer anticipates that the higher 𝒩clisten\mathcal{N}^{\text{listen}}_{c} being observed/listened, the more likely that jamming in channel cc degrades the victim users’ performance. Given this, we introduce how we optimize the subset of channels to attack during the jamming phase in Section IV-C.

Another benefit of the listening phase is that no jamming power is consumed in this phase, and consequently average power consumption is reduced. In the remainder of this paper, we assume that the jammer only switches from the listening phase to the jamming phase by the end of each period with TJ+T_{J}\in\mathbb{R}^{+} time slots, and thus it has an average power consumption of nJPJ/TJn_{J}P_{J}/T_{J}.

IV-B Jamming Location Optimization

The jammer aims at minimizing the performance of victim users, but it does not have any information on channel fading, UE locations or rewards provided to different requests. Therefore, the jamming location is optimized by minimizing the expected sum transmission rate for given UE uu integrated over the service area when the channels for transmission coincide with the channels being jammed. More specifically, we have the following optimization:

{xJ,yJ}=argmin{xJ,yJ}𝔼h(bcDhb𝗋c,Jb,u𝑑xu𝑑yu),\{x_{J}^{*},y_{J}^{*}\}=\operatorname*{argmin}_{\{x_{J},y_{J}\}}{\mathbb{E}_{h}\left(\sum_{b}\sum_{c}\iint\limits_{D^{b}_{h}}{\mathsf{r}_{c,J}^{b,u}dx_{u}dy_{u}}\right)}, (31)

where the expectation with respect to the set of fading coefficients {h}\{h\} considers b,u,c:hcb,u,hcJ,u,hcb,Ji.i.d.𝒞𝒩(0,1)\forall b,u,c:h_{c}^{b,u},h_{c}^{J,u},h_{c}^{b,J}\overset{\text{i.i.d.}}{\sim}\mathcal{CN}(0,1), DhbD^{b}_{h} is the subset of coverage area with maximal transmission rate 𝗋c,Jb,u\mathsf{r}_{c,J}^{b,u} from base station bb given {hcb,u,hcJ,u,hcb,J}\{h_{c}^{b,u},h_{c}^{J,u},h_{c}^{b,J}\}, and b,b:𝟏cb,b=0,𝟏cb,J=𝟏cb=1\forall b,b^{\prime}:\mathbf{1}_{c}^{b,b^{\prime}}=0,\mathbf{1}_{c}^{b,J}=\mathbf{1}_{c}^{b}=1. Additionally, we note that PBP_{B}, hBh_{B}, |α||\alpha|, and σ\sigma with arbitrary positive values will not affect the optimized jamming location.

IV-C Jamming Channel Optimization

After the jammer is initialized in the true environment and has observed/listened the interference power 𝒩clisten(t1)\mathcal{N}^{\text{listen}}_{c}(t-1) within the listening phase at time t1t-1, it will decide the subset of channels 𝒞J(t)𝒞\mathcal{C}_{J}(t)\subset\mathcal{C} to jam during jamming phase at time tt, where |𝒞J(t)|=nJ(t)\left|\mathcal{C}_{J}(t)\right|=n_{J}(t) and 𝒞\mathcal{C} is the set of all channels {1,2,,N}\{1,2,\ldots,N\}. According to (30), channels with higher 𝒩clisten\mathcal{N}^{\text{listen}}_{c} are more likely to be better choices, but this information is not available in the jamming phase at time tt (since jamming is performed rather than listening). Therefore, 𝒞J(t)\mathcal{C}_{J}(t) can only be evaluated via 𝒩clisten(t1)\mathcal{N}^{\text{listen}}_{c}(t-1) and 𝒩clisten(t+1)\mathcal{N}^{\text{listen}}_{c}(t+1). Note that 𝒩clisten(t+1)\mathcal{N}^{\text{listen}}_{c}(t+1) is not available at time tt and this challenge will be addressed via deep RL in the next subsection (i.e., by essentially introducing a reward to train the neural network at time t+1t+1 and having that reward depend on 𝒩clisten(t1)\mathcal{N}^{\text{listen}}_{c}(t-1) and 𝒩clisten(t+1)\mathcal{N}^{\text{listen}}_{c}(t+1).) Hence, with the deep RL approach, the action will depend only on observation before time tt.

In the absence of information on the requests, we assume a model in which each request arrives and is completed independently. Thus, the state of current time slot tt can be estimated as a linear interpolation (or a weighted average) of 𝒩clisten(t1)\mathcal{N}^{\text{listen}}_{c}(t-1) and 𝒩clisten(t+1)\mathcal{N}^{\text{listen}}_{c}(t+1). Therefore, the optimized subset of channels to jam can be determined from

𝒞J(t)=argmax𝒞Jc𝒞J𝒩^cβ(t),\mathcal{C}_{J}^{*}(t)=\operatorname*{argmax}_{\mathcal{C}_{J}}{\sum_{c\in\mathcal{C}_{J}}{\hat{\mathcal{N}}^{\beta}_{c}(t)}}, (32)

where

𝒩^cβ(t)=1β(t)+1(β(t)𝒩clisten(t1)+𝒩clisten(t+1)),\hat{\mathcal{N}}^{\beta}_{c}(t)=\frac{1}{\beta(t)+1}\left(\beta(t)\mathcal{N}^{\text{listen}}_{c}(t-1)+\mathcal{N}^{\text{listen}}_{c}(t+1)\right), (33)

and β(t)\beta(t) describes the impact of the jammer on the victims. Typically, a request takes multiple time slots to get completed. When it is jammed, there are two possibilities. On the one hand, it may fail to meet the minimum transmission rate limit and get terminated immediately. In such a case, the next request in the queue is processed, and the network slicing agent rearranges and distributes channels into a new set of slices to be allocated to different requests from different users, and thus the listened interference 𝒩clisten(t+1)\mathcal{N}^{\text{listen}}_{c}(t+1) may change dramatically. In this case, we are likely to have β(t)>1\beta(t)>1. On the other hand, if the request under attack has a lower transmission rate but the minimum transmission rate limit and the lifetime constraint are satisfied, the transmission will last longer, and the interference 𝒩clisten(t+1)\mathcal{N}^{\text{listen}}_{c}(t+1) is less likely to vary from that at time tt. In this case, we are likely to have β(t)<1\beta(t)<1. Therefore, the value of β(t)\beta(t) should be determined via experience:

β(t)=max(2|𝒯|ct′′𝒯′′dclisten(t′′)|𝒯′′|ct𝒯dclisten(t)1,0),\beta(t)=\operatorname*{max}\left(\frac{2|\mathcal{T}^{\prime}|\sum_{c}{\sum_{t^{\prime\prime}\in\mathcal{T}^{\prime\prime}}{d^{\text{listen}}_{c}(t^{\prime\prime})}}}{|\mathcal{T}^{\prime\prime}|\sum_{c}{\sum_{t^{\prime}\in\mathcal{T}^{\prime}}{d^{\text{listen}}_{c}(t^{\prime})}}}-1,0\right), (34)

where

dclisten(t)=|𝒩clisten(t+1)𝒩clisten(t1)|,d^{\text{listen}}_{c}(t)=\left|\mathcal{N}^{\text{listen}}_{c}(t+1)-\mathcal{N}^{\text{listen}}_{c}(t-1)\right|, (35)

and 𝒯′′\mathcal{T}^{\prime\prime} is a set of time points in the jamming phase where each t′′𝒯′′t^{\prime\prime}\in\mathcal{T}^{\prime\prime} is close to time tt, and 𝒯\mathcal{T}^{\prime} is a set of time points where each t𝒯t^{\prime}\in\mathcal{T}^{\prime} are in successive listening phases without jamming attack. If TJ>3T_{J}>3, dclisten(𝒯)d^{\text{listen}}_{c}(\mathcal{T}^{\prime}) can be the set of successive listening phases in every period during training. Otherwise if TJ3T_{J}\leq 3, dclisten(𝒯)d^{\text{listen}}_{c}(\mathcal{T}^{\prime}) has to be collected before jamming starts.

Again, it is important to note that when the jammer agent makes decisions at time tt, 𝒩clisten(t+1)\mathcal{N}^{\text{listen}}_{c}(t+1) is not available. To address this, we propose a deep RL agent that uses the actor-critic algorithm to learn the policy.

IV-D Actor-Critic Jammer Agent

Our proposed jammer agent utilizes an actor-critic deep RL algorithm to learn the policy that optimizes the output 𝒞J(t)\mathcal{C}_{J}(t) to minimize the victims’ expected sum rate. The jammer works with a period TJT_{J}, and only switches from the listening phase to the jamming phase at the end of each period and uses the policy to make the decision 𝒞J(t)\mathcal{C}_{J}(t). We next introduce the observation, action, reward, and the actor-critic update of this agent.

IV-D1 Observation

At each time slot, the jammer records its instantaneous observation as a vector OJNO_{J}\in\mathbb{R}^{N}. In a listening phase, OJL={𝒩1listen,𝒩2listen,,𝒩Nlisten}O_{J}^{L}=\{\mathcal{N}^{\text{listen}}_{1},\mathcal{N}^{\text{listen}}_{2},\ldots,\mathcal{N}^{\text{listen}}_{N}\}. Otherwise, in a jamming phase, OJJ={𝟏(1𝒞J),𝟏(2𝒞J),𝟏(N𝒞J)}O_{J}^{J}=\{\mathbf{1}(1\in\mathcal{C}_{J}),\mathbf{1}(2\in\mathcal{C}_{J})\ldots,\mathbf{1}(N\in\mathcal{C}_{J})\} where 𝟏\mathbf{1} is the indicator function. In the beginning at time slot tt in the jamming phase, the full observation 𝒪J(t)={OJJ(tTJ),OJL(tTJ+1),,OJL(t1)}\mathcal{O}_{J}(t)=\{O_{J}^{J}(t-T_{J}),O_{J}^{L}(t-T_{J}+1),\ldots,O_{J}^{L}(t-1)\} is fed as the input state to the actor-critic agent.

IV-D2 Action

At the beginning of time slot tt in a jamming phase, given the input state 𝒪J(t)\mathcal{O}_{J}(t), the actor neural network outputs a vector of probabilities 𝒫J(t)[0,1]N\mathcal{P}_{J}(t)\in[0,1]^{N}. From the probability vector, the decision 𝒞J(t)\mathcal{C}_{J}(t) is derived which is the subset of channels to jam, and it is described as the action 𝒜J(t){0,1}N\mathcal{A}_{J}(t)\in\{0,1\}^{N}:

𝒞J(t)=argmax𝒞Jc𝒞J𝒫J(t),\mathcal{C}_{J}(t)=\operatorname*{argmax}_{\mathcal{C}_{J}}{\sum_{c\in\mathcal{C}_{J}}{\mathcal{P}_{J}(t)}}, (36)
𝒜J(t)={𝟏(1𝒞J(t)),𝟏(2𝒞J(t)),𝟏(N𝒞J(t))}.\mathcal{A}_{J}(t)=\{\mathbf{1}(1\in\mathcal{C}_{J}(t)),\mathbf{1}(2\in\mathcal{C}_{J}(t))\ldots,\mathbf{1}(N\in\mathcal{C}_{J}(t))\}. (37)

IV-D3 Reward

Following the jamming phase at time tt, the reward is received to train the critic after the next listening phase at time t+1t+1. This reward aims at encouraging the policy to produce an action that imitates 𝒩^cβ(t)\hat{\mathcal{N}}^{\beta}_{c}(t), the linear interpolation of listened interference as in (33). Therefore, we set the reward as the negative of the mean squared error:

RJ(t)=c(𝒜J(t)[c]𝒩^cβ(t)max(𝒩^cβ(t)))2.R_{J}(t)=-\sum_{c}{\left(\mathcal{A}_{J}(t)[c]-\frac{\hat{\mathcal{N}}^{\beta}_{c}(t)}{\operatorname*{max}\left(\hat{\mathcal{N}}^{\beta}_{c}(t)\right)}\right)^{2}}. (38)

IV-D4 Actor-Critic Update

At the beginning of a jamming phase at time tt, the actor with parameter θJ\theta_{J} and policy fθJ(𝒪J)f^{\theta_{J}}(\mathcal{O}_{J}) maps the input observation 𝒪J\mathcal{O}_{J} to the output probability 𝒫J\mathcal{P}_{J}, which is similar to a Q-value generator. The critic with parameter ϕJ\phi_{J} and policy gϕJ(𝒪J)g^{\phi_{J}}(\mathcal{O}_{J}) maps 𝒪J\mathcal{O}_{J} to a single temporal difference (TD) error:

δJ(t)=RJ(t)+γJgϕJ(𝒪J(t))gϕJ(𝒪J(tTJ)),\delta_{J}(t)={R_{J}}(t)+\gamma_{J}g^{\phi_{J}}(\mathcal{O}_{J}(t))-g^{\phi_{J}}(\mathcal{O}_{J}(t-T_{J})), (39)

where γJ(0,1)\gamma_{J}\in(0,1) is the discount factor. For each training sample, the critic is updated towards achieving the optimized parameter ϕJ\phi_{J}^{*} to minimize the least square TD:

ϕJ=argminϕJ(δJgϕJ)2.\phi_{J}^{*}=\operatorname*{argmin}_{\phi_{J}}{(\delta_{J}^{g_{\phi_{J}}})^{2}}. (40)

The actor is updated towards the optimized parameter θJ\theta_{J}^{*} to minimize the policy gradient:

θJ=argmaxθJθJc𝒞JlogfθJ(𝒪J)δJgϕJ.\theta_{J}^{*}=\operatorname*{argmax}_{\theta_{J}}{\nabla_{\theta_{J}}\sum_{c\in\mathcal{C}_{J}}\log f^{\theta_{J}}(\mathcal{O}_{J})\delta_{J}^{g_{\phi_{J}}}}. (41)

Both networks are updated alternately to attain the optimal actor-critic policy. Note that during each parameter update with the bootstrap method, the mini-batch of training samples (i.e., action-reward pairs) should be randomly drawn from a longer history record for faster convergence.

IV-E Numerical Results

As shown in Fig. 5, we in the experiments consider a service area with NB=5N_{B}=5 base stations, Nu=30N_{u}=30 users, and a jammer. The theoretically optimized location of the jammer {xJ,yJ}\{x_{J}^{*},y_{J}^{*}\} is determined according to (31) and it lies at the center {0,0}\{0,0\}, while the actual location is slightly moved away from the base station tower. There are N=16N=16 channels available, and the jammer picks at most NJ=8N_{J}=8 channels for jamming. The jamming power in each channel equals the transmission power in each channel: PJ=PBP_{J}=P_{B}. The jammer has phase switching period of TJ=2T_{J}=2 time slots.

Refer to caption
Figure 5: Coverage map of service area with 5 base stations, 30 users, and the jammer.

In the experiments, the network slicing agents are initialized as the well-trained MACC agents with pointer network actors as detailed in Section III, and time is initiated at t=0t=0. The jammer’s actor and critic policies are implemented as two feed-forward neural networks (FNNs), and both have one hidden layer with 16 hidden nodes. The jammer is initialized at t=100t=100 and begins jamming and performs online updating. During the training phase 100t<10000100\leq t<10000, the jammer follows an ϵ\epsilon-greedy policy to update the neural network parameters θJ\theta_{J} and ϕJ\phi_{J} with learning rate 10510^{-5}. It starts by fully exploring random actions, and the probability to choose random actions linearly decreases to 0.010.01, thus eventually leading the agent to mostly follow the actor policy fθJ(𝒪J)f^{\theta_{J}}(\mathcal{O}_{J}). This probability is fixed to 0.010.01 in the testing phase from t=10000t=10000 to t=20000t=20000.

In Fig. 6, we compare the performance of the proposed actor-critic jammer that approximates 𝒩^cβ(t)\hat{\mathcal{N}}^{\beta}_{c}(t) with four other scenarios in terms of victim sum reward in the testing phase. The figure shows the testing phase after the victim agent adapts to the jamming attack, and as a result its action policy becomes stable. The slight fluctuations in the curves are due to randomly arriving requests and varying channel states. We also note that the vertical axis is the victim’s reward, and hence the jammer with the better performance will lead to lower victim reward. The first case is the setting with no jammer and hence the performance is that of the original network slicing agent in terms of the sum reward in the absence of jamming attacks. The second scenario is with a last-interference jammer agent that is positioned at the same location (i.e., the origin {0,0}\{0,0\}) with the same power budget. However, this jammer agent does not utilize any machine learning algorithms, and chooses the subset of channels with the highest observed/listened interference power levels in the last listening phase:

𝒞JMaxIntf(t)=argmax𝒞Jc𝒞J𝒩clisten(t1).\mathcal{C}_{J}^{\text{MaxIntf}}(t)=\operatorname*{argmax}_{\mathcal{C}_{J}}{\sum_{c\in\mathcal{C}_{J}}{\mathcal{N}^{\text{listen}}_{c}(t-1)}}. (42)

Consequentially, the last-interference jammer which focuses on the last time slot is equivalent to the proposed jammer when β\beta\rightarrow\infty. The third scenario is with the next-interference jammer agent, which is also an actor-critic jammer agent but whose reward has β=0\beta=0, so it concentrates on the next time slot. Furthermore, we provide performance comparisons with a max-rate jammer that is assumed to know the channel environment between the base stations and users perfectly (which implies a rather strong jammer), and thus it obtains every potential channel rate regardless of the interference from other users, and picks the maximum via

rcmax=maxb,ulog2(1+PBLb,u|hcb,u|2σ2).r^{max}_{c}=\operatorname*{max}_{b,u}{\log_{2}\left(1+{\frac{P_{B}L^{b,u}|h_{c}^{b,u}|^{2}}{\sigma^{2}}}\right)}. (43)

Given NN maximum potential rates in NN different channels, the jammer randomly picks channel cc to jam with probability rcmax/cNrcmaxr^{max}_{c}/\sum_{c^{\prime}}^{N}{r^{max}_{c^{\prime}}}. To better evaluate the performance among different jammers, we assume that all jammers have the same power budget. We observe that all other jammers are less efficient in suppressing the victim sum reward compared to our proposed actor-critic jammer, which aims at estimating 𝒩^cβ(t)\hat{\mathcal{N}}^{\beta}_{c}(t) and therefore has better performance. Specifically, we observe in Fig. 6 that the proposed jammer results in the smallest sum reward values for the network slicing agents and has the most significant adversarial impact. In order to give a comprehensive comparison, we in Fig. 7 also illustrate the performance of the same set of jammers where the jamming power is 60dB higher than the transmission power, i.e. PJ=106PBP_{J}={10}^{6}P_{B}. We notice that since the victim user receives negative reward when the request fails, some curves have negative values at times in this harsh jamming interference environment.

Refer to caption
Figure 6: Comparison of victims’ sum reward in the testing phase achieved in the absence of jamming attack, and also achieved under attacks by the last-interference jammer, next-interference jammer, and the proposed actor-critic jammer. The jamming power equals the transmission power.
Refer to caption
Figure 7: Comparison of victims’ sum reward in the testing phase achieved in the absence of jamming attack, and also achieved under attacks by the last-interference jammer, next-interference jammer, and the proposed actor-critic jammer. The jamming power is 60dB higher than the transmission power.

While it is hard to distinguish some curves with similar performance, we provide the numerical results of average reward and request completion ratio in both experiments in Table II. We observe that the victim under the actor-critic jamming attack completes much less requests and attains the smallest average reward values in both scenarios, and thus the actor-critic jammer outperforms all other jammers including the max-rate jammer that knows the channel status. Additionally in the first scenario where PJ=PBP_{J}=P_{B}, we notice that the base station at coordinates (0,0)(0,0) in Fig. 5, which is the closest one to the jammer’s location, only completes 67.25% of the requests under the proposed jamming attack. In comparison, this base station under other types of attack completes about 80% of the requests. This base station also completes less requests under our proposed attack when PJ=106PBP_{J}={10}^{6}P_{B}. These observations indicate that the proposed jammer agent is more likely to learn from the received interference.

TABLE II: Performance Comparison on average reward and completion ratio of Different Jamming Algorithms during Test Phase
jammer PJ=PBP_{J}=P_{B} PJ=106PBP_{J}={10}^{6}P_{B}
None 19.47, 93.33% 19.47, 93.33%
Last-Interference 14.82, 82.36% 7.23,    75.30%
Next-Interference 11.67, 81.34% 3.83,    68.73%
Max-Rate 11.03, 76.07% 3.61,    63.75%
Actor-Critic 9.16,    73.89% 2.61,    60.62%

V Nash-Equilibrium-Supervised Policy Ensemble for Jamming and Defense

As demonstrated in the previous section, an actor-critic based jammer with limited information and power budget can lead to significant degradation in the performance of network slicing. In this section, instead of jamming detection or jamming pattern-based defensive strategies, we propose the Nash-equilibrium-supervised policy ensemble (NesPE) as a strategy that automatically explores the underlying available actions to achieve the optimal performance in both competitive and non-competitive contexts. We show that NesPE can be both utilized as the victims’ defensive strategy and also applied to jamming attack design.

V-A Nash-Equilibrium-Supervised Policy Ensemble

We first describe the details of the NesPE as a scheme that enhances the performance of a player in a multi-player zero-sum stochastic game [37] in which the player does not have direct observation of the opponents’ action choices. The actions of the optimal mixed strategy depend on the prior observation, and this relationship is to be determined via deep RL or other machine-learning-based algorithms. To train the policy ensemble with the deep RL algorithm of the player and to determine the optimal mixed strategy profile, we consider the tuple 𝒢=(𝒪,,(fθe)e,e^,,(𝒪l)l,u)\mathcal{G}=\left(\mathcal{O},\mathcal{E},(f^{\theta_{e}})_{e\in\mathcal{E}},\hat{e},\mathcal{L},(\mathcal{O}_{l})_{l\in\mathcal{L}},u\right), where

  • 𝒪\mathcal{O} is the prior observation before decision-making.

  • \mathcal{E} is a finite index set of policies, i.e., ={1,,E}\mathcal{E}=\{1,\ldots,E\}.

  • fθef^{\theta_{e}} is the actor function of policy ee with parameter θe\theta_{e}, and is regarded as the player’s strategy. It maps the prior observation to the player’s action 𝒜e=fθe(𝒪){0,1}N\mathcal{A}_{e}=f^{\theta_{e}}(\mathcal{O})\in\{0,1\}^{N} where the chosen elements have value 11 and others are set to 0.

  • e^\hat{e} is the index of the chosen policy to execute the action.

  • \mathcal{L} is the finite index set of subsequent observations obtained after decision-making, i.e. ={1,,L}\mathcal{L}=\{1,\ldots,L\}.

  • 𝒪l\mathcal{O}_{l} is the subsequent observation, and ll indirectly represents the opponents’ action choice. Thus it is regarded as the opponents’ strategy.

  • u:fθe(𝒪)×𝒪lu:f^{\theta_{e}}(\mathcal{O})\times\mathcal{O}_{l}\rightarrow\mathbb{R} is the utility (or payoff, reward) function of the player when it chooses policy ee and the interaction with the opponents results in the observation 𝒪l\mathcal{O}_{l}. The mapping function is unknown to both players, and its outcome is available only to the considered player after decision-making.

According to [38], every finite strategic-form game has a mixed strategy equilibrium. In this stochastic game without prior knowledge of the utility matrix, the considered player can alternatively obtain the utility matrix by recording the experienced utility and taking the average. We consider a matrix \mathcal{H} of queues with finite length (specifically, each element of this matrix is a queue with a few recorded utilities), and the matrix has EE rows and LL columns. In each time slot, the experienced utility ue^,l(t)u_{\hat{e},l}(t) is appended to the queue (e^,l)\mathcal{H}(\hat{e},l). At the beginning of the next time slot, the player uses the average of each queue to form a new matrix with E×LE\times L elements as the utility matrix for calculating the optimal profile at Nash equilibrium σ={σ1,,σE}\sigma=\{\sigma_{1},\ldots,\sigma_{E}\}, which is the set of probabilities to choose and execute each strategy at a time.

However, although the probability to choose each given policy is optimized, the parameters of the policies (θe)e({\theta_{e}})_{e\in\mathcal{E}} may not be optimized, and need further training with the received utility ue^,l(t)u_{\hat{e},l}(t) after each time slot tt. Most existing works of policy ensemble use similar rewards to train different policies, thus there is no guarantee that the policy ensemble explores different strategies. In NesPE, we consider a dual problem that takes both utility reward and the correlation of policy ensemble [23] into account. The expected correlation between policy e1e_{1} and policy e2e_{2} is defined as

ρ¯(e1,e2)=𝔼𝒪(n=1N𝟏(fθe1(𝒪)(n)=fθe2(𝒪)(n)=1)).\bar{\rho}(e_{1},e_{2})=\mathbb{E}_{\mathcal{O}}\left(\sum_{n=1}^{N}\mathbf{1}\left(f^{\theta_{e_{1}}}(\mathcal{O})(n)=f^{\theta_{e_{2}}}(\mathcal{O})(n)=1\right)\right). (44)

Therefore, the correlation between the two policies indicates the similarity of decision-making patterns between them. To fully explore all possible strategies and increase the policy variety to feed the mixed strategy, lower correlations between the policies are desired.

Thus, in order to maximize the expectation of utility reward ue^,lu_{\hat{e},l} and limit the correlation below a certain threshold DD, we at each time slot tt consider a non-linear programming problem for policy ee:

maxθe\displaystyle\max_{\theta_{e}}\;\;\; 𝔼(ue,l(t))\displaystyle\mathbb{E}\left(u_{e,l}(t)\right)
subject to Deeσe(t)ρ(e,e,t)0,\displaystyle D-\sum_{e^{\prime}\neq e}{\sigma_{e^{\prime}}(t)\rho(e,e^{\prime},t)}\geq 0, (45)

where

ρ(e,e,t)=n=1N𝟏(fθe(𝒪(t))(n)=fθe(𝒪(t))(n)=1).\rho(e,e^{\prime},t)=\sum_{n=1}^{N}\mathbf{1}\left(f^{\theta_{e}}(\mathcal{O}(t))(n)=f^{\theta_{e^{\prime}}}(\mathcal{O}(t))(n)=1\right). (46)

To solve the problem, we can convert it to maximizing a Lagrangian dual function, and set the reward of the deep RL algorithm as

R(e,t)=ue^,l(t)ζeeσe(t)ρ(e,e,t),R(e,t)=u_{\hat{e},l}(t)-\zeta\sum_{e^{\prime}\neq e}{\sigma_{e^{\prime}}(t)\rho(e,e^{\prime},t)}, (47)

where ζ\zeta is the dual variable, and R(e,t)R(e,t) is appended to the replay buffer for the training process to update the parameters of policy ee. We note that the instantaneous correlation ρ(e,e,t)\rho(e,e^{\prime},t) is weighted by the strategy profile σe\sigma_{e^{\prime}} of the other policy ee^{\prime}. Therefore, the more desired policies with higher σe\sigma_{e} are less affected by the correlation with other policies, while the less desired policies are encouraged to diverge from the former.

The proposed NesPE strategy is obtained by iteratively repeating the aforementioned process over time as shown in Algorithm 2.

Algorithm 2 Nash-Equilibrium-Supervised Policy Ensemble (NesPE)
Randomly initialize the parameter θe\theta_{e} of each policy
Randomly initialize the utility history \mathcal{H}
for tt in range do
     Acquire prior observation 𝒪(t)\mathcal{O}(t)
     Obtain mixed strategy profile σ\sigma from the average of \mathcal{H}
     Randomly choose one policy e^\hat{e} according to σ\sigma
     Execute action 𝒜e^=fθe^(𝒪(t))\mathcal{A}_{\hat{e}}=f^{\theta_{\hat{e}}}(\mathcal{O}(t))
     Classify the latter observation 𝒪l(t)\mathcal{O}_{l}(t) into category ll
     Append the payoff ue^,l(t)u_{\hat{e},l}(t) to the queue (e^,l)\mathcal{H}(\hat{e},l)
     Append the information at time tt to the replay buffer \mathcal{B}
     Randomly sample a mini-batch \mathcal{B}^{\prime} from \mathcal{B}
     for 𝒢\mathcal{G}^{\prime}, 𝒜e^\mathcal{A}_{\hat{e}^{\prime}}, ue^,l(t)u_{\hat{e}^{\prime},l}(t^{\prime}) in \mathcal{B}^{\prime} do
         for ee^{\prime} in \mathcal{E}^{\prime} do
              update θe\theta_{e^{\prime}} with reward R(e,t)R(e^{\prime},t^{\prime}) in (47)
         end for
     end for
end for

We analyze the performance of NesPE in two contexts. On the one hand, we consider a set of opponents and an environment in which a dominating action exists, where an action 𝒜\mathcal{A} is called dominating if

u(𝒜,l)u(𝒜,l),𝒜𝒜,l.u(\mathcal{A},l)\geq u(\mathcal{A}^{\prime},l),\forall\mathcal{A}^{\prime}\neq\mathcal{A},\forall l\in\mathcal{L}. (48)

Due to random initialization, there will be one policy ee in NesPE that has the highest σe\sigma_{e}, and thus we have |R(e,t)ue,l(t)|<|R(e1,t)ue1,l(t)|,e1e|R(e,t)-u_{e,l}(t)|<|R(e_{1},t)-u_{e_{1},l}(t)|,\forall e_{1}\neq e. Therefore the reward of policy ee focuses on maximizing 𝔼(ue,l)\mathbb{E}\left(u_{e,l}\right) while the other policies are encouraged to choose different actions other than that chosen by policy ee. Such iterative training will lead to limtσe(t)=1\lim\limits_{t\to\infty}\sigma_{e}(t)=1. Therefore, we note that NesPE in this context will converge to a pure strategy with a single policy ee that fully optimizes 𝔼(ue,l)\mathbb{E}\left(u_{e,l}\right).

On the other hand, if there is no dominating action and a mixed strategy is preferred, σe\sigma_{e} will be upper bounded. Each policy e1e_{1} is encouraged to find a balance between optimizing ue1,l(t)u_{e_{1},l}(t) and diverge from others, while the better policies will be less affected by the correlation minimization. Thus, the player may obtain the optimized mixed strategy via the converged Nash equilibrium profile.

V-B Numerical Results

In this section, we apply NesPE to both the network slicing victim agent and the jammer agent, and compare the performances in terms of victims’ average reward and completion ratio during the testing phase. All parameters other than the ensemble are the same as in Section IV-E.

For the policy ensemble of the victim user agent, we consider E=5E=5 policies in the ensemble, and L=5L=5 different classes of subsequent observations. In this case, at each base station bb, each element Hb[u,c]H^{b}[u,c] of the transmission rate history is a queue with the maximal length of QQ instead of a single number, so we have Hb0Nu×N×QH^{b}\in\mathbb{R}_{\geq 0}^{N_{u}\times N\times Q}. The subsequent observation is the set of experienced transmission rate 𝗋cb,u(t)\mathsf{r}_{c}^{b,u}(t) at time tt, and is classified into type ll via

l=\displaystyle l= argminl(do(Hb[u,c][Q1]),do(Hb[u,c][Q2]),\displaystyle\operatorname*{argmin}_{l}\Big{(}d_{o}(H^{b}[u,c][Q-1]),d_{o}(H^{b}[u,c][Q-2]), (49)
do(Hb[u,c][Q3]),do(Hb[u,c][Q4]),do(min(Hb[u,c]))),\displaystyle d_{o}(H^{b}[u,c][Q-3]),d_{o}(H^{b}[u,c][Q-4]),d_{o}(\operatorname*{min}(H^{b}[u,c]))\Big{)},
do(𝗋)=u𝒰bcCu(𝗋𝗋cb,u(t))2,\small d_{o}(\mathsf{r})=\sum_{u\in\mathcal{U}_{b}}\sum_{c\in C_{u}}\left(\mathsf{r}-\mathsf{r}_{c}^{b,u}(t)\right)^{2}, (50)

where 𝒰b\mathcal{U}_{b} is the set of the served users at base station bb, and CuC_{u} is the set of channels in the slice assigned to user uu.

For the policy ensemble of the jammer agent, we consider EJ=2E_{J}=2 policies in the ensemble, and LJ=2L_{J}=2 different classes of subsequent observations. In this case, a queue with a finite length of listened difference dclistend^{\text{listen}}_{c} is kept, and we use its average value dclisten¯\overline{d^{\text{listen}}_{c}} to classify the subsequent observation dclisten(t)d^{\text{listen}}_{c}(t) into

lJ\displaystyle l_{J} ={1if dclisten(t)<dclisten¯,2otherwise.\displaystyle=\begin{cases}1\hskip 8.5359pt\text{if $d^{\text{listen}}_{c}(t)<\overline{d^{\text{listen}}_{c}}$,}\\ 2\hskip 8.5359pt\text{otherwise.}\end{cases} (51)

For comparison, we also show the results obtained with two other types of policy ensemble methods that aim at robustness, including agents with policy ensembles (APE) [17] and cooperative evolutionary reinforcement learning (CERL) [29]. The key idea of APE is to randomly initialize different policies, select one policy at a time, maintain a separate replay buffer, and maximize the expected reward. On the other hand, CERL sets different policies with different hyper-parameters, uses a shared replay buffer, and applies a neuro-evolutionary algorithm.

In Table III, we compare the victim agents’ average rewards and completion ratios in the presence of a jammer agent when the proposed NesPE and other algorithms are employed at both the network slicing agents and the jammer. In rows from top to bottom, we have the performance with no jammer, original jammer with a single policy, and NesPE, APE, and CERL based jammers, respectively. For each jammer, we present the performance with both the lower power budget of 0dB where PJ=PBP_{J}=P_{B}, and the higher power budget of 60dB where PJ=106PBP_{J}={10}^{6}P_{B}. Across different columns from left to right, we have the performance when the victim utilizes a single policy, and NesPE, APE, and CERL based policies, respectively. In the first row, we notice that each type of victim agent has a similar performance (in terms of both average reward and completion ratio/percentage), which indicates that different victim agents are able to identify and utilize the dominating strategy. In the following rows, we see that NesPE based victim network slicing agent achieves a better performance than the other victim agents against all different types of the jammer agent. In the third row in which the performance of the victim network slicing agents in the presence of a NesPE based jammer is provided, we observe that all victims under this NesPE based jamming attack have lower rewards and completion ratios compared to those under the other types of jamming attacks. With the higher jamming power budget of 60dB, we observe similar trends when comparing different algorithms. Hence, NesPE based jamming agent performs better in suppressing the performance of network slicing agents. This lets us conclude that NesPE based strategies outperform the other algorithms in a competitive environment where the adversary exists, and both the victim and jammer perform more favorably when they employ NesPE.

TABLE III: Average Reward and Completion Ratio Comparison of Different Policy Ensemble Algorithms and Different Jamming Power Budget during Test Phase
Victim Jammer Single NesPE APE CERL
None 19.90
94.78%
20.08
95.23%
20.07
95.16%
20.08
95.22%
Single, 0dB 10.09
73.68%
14.96
84.90%
11.53
77.04%
11.71
77.30%
NesPE, 0dB 9.68
72.57%
12.22
78.76%
11.36
76.92%
11.23
76.31%
APE, 0dB 12.93
80.26%
14.44
83.55%
13.23
80.75%
11.42
76.66%
CERL, 0dB 12.38
78.81%
14.53
84.10%
12.53
79.20%
11.28
76.68%
Single, 60dB 3.58
58.76%
11.15
76.52%
8.58
70.33%
5.23
62.16%
NesPE, 60dB 2.25
55.35%
11.33
76.81%
6.69
66.19%
4.49
61.12%
APE, 60dB 3.35
58.09%
11.8
77.72%
9.44
72.39%
5.18
62.49%
CERL, 60dB 2.61
56.14%
11.98
78.15%
8.91
71.10%
4.70
61.65%

VI Conclusion

In this paper, we designed network slicing agents using MACC multi-agent deep RL with pointer network based actors. We considered an area covered by multiple base stations and addressed a dynamic environment with time-varying channel fading, mobile users, and randomly arriving service requests from the users. We described the system model, formulated the network slicing problem, and designed the MACC deep RL algorithm and pointer network based actor structure. We demonstrated the proposed agents’ performance via simulations and have shown that they can achieve high average rewards and complete around 95%95\% of the requests.

Subsequently, we developed a deep RL based jammer that aims at minimizing the network slicing agents’ (i.e., victims’) transmission rate and thus degrades their performance without prior knowledge of the victim policies and without receiving any direct feedback. We introduced the jamming and listening phases of the proposed jammer and addressed the jamming location optimization. We also studied jamming channel optimization by designing an actor-critic agent that decides on which channels to jam. We have demonstrated the effectiveness of the proposed jammer via simulation results, and quantified the degradation in the performance of the network slicing agents compared to the performance achieved in the absence of any jamming attacks. We also provided comparisons among actor-critic based jammers with different assumptions on how to decide on which channels to jam (e.g., based on the last or estimated next interference or linear interpolation of the two).

Finally, we designed the NesPE algorithm for a competitive zero-sum game between the victim agents and jammer agent. By applying NesPE on both the victim agents and the jammer agent and comparing with two other policy ensemble algorithms, we have shown that NesPE not only adapts to a highly dynamic environment with time-varying fading and mobile users, but also adapts to a competitive environment against adversary’s policy ensemble agent with optimal mixed strategy. Thus, both the victim network slicing agents and jammer should apply NesPE to attain improved performance levels, and the interaction between them converges to the Nash equilibrium over all possible policy ensembles.

References

  • [1] A. Ericsson, “5G systems enabling the transformation of industry and society,” Tech. Rep., 2017.
  • [2] X. Foukas, G. Patounas, A. Elmokashfi, and M. K. Marina, “Network slicing in 5G: Survey and challenges,” IEEE Communications Magazine, vol. 55, no. 5, pp. 94–100, 2017.
  • [3] N. Alliance, “Description of network slicing concept,” NGMN 5G P, vol. 1, no. 1, 2016.
  • [4] H. Zhang, N. Liu, X. Chu, K. Long, A.-H. Aghvami, and V. C. Leung, “Network slicing based 5G and future mobile networks: mobility, resource management, and challenges,” IEEE communications magazine, vol. 55, no. 8, pp. 138–145, 2017.
  • [5] S. A. Kazmi, L. U. Khan, N. H. Tran, and C. S. Hong, Network slicing for 5G and beyond networks.   Springer, 2019, vol. 1.
  • [6] Q. Ye, W. Zhuang, S. Zhang, A.-L. Jin, X. Shen, and X. Li, “Dynamic radio resource slicing for a two-tier heterogeneous wireless network,” IEEE Transactions on Vehicular Technology, vol. 67, no. 10, pp. 9896–9910, 2018.
  • [7] H. Peng, Q. Ye, and X. Shen, “Spectrum management for multi-access edge computing in autonomous vehicular networks,” IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 7, pp. 3001–3012, 2019.
  • [8] J. Tang, B. Shim, and T. Q. Quek, “Service multiplexing and revenue maximization in sliced c-ran incorporated with urllc and multicast embb,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 4, pp. 881–895, 2019.
  • [9] H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep reinforcement learning based resource allocation for v2v communications,” IEEE Transactions on Vehicular Technology, vol. 68, no. 4, pp. 3163–3173, 2019.
  • [10] Z. Xu, Y. Wang, J. Tang, J. Wang, and M. C. Gursoy, “A deep reinforcement learning based framework for power-efficient resource allocation in cloud rans,” in 2017 IEEE International Conference on Communications (ICC).   IEEE, 2017, pp. 1–6.
  • [11] R. Li, Z. Zhao, Q. Sun, I. Chih-Lin, C. Yang, X. Chen, M. Zhao, and H. Zhang, “Deep reinforcement learning for resource management in network slicing,” IEEE Access, vol. 6, pp. 74 429–74 441, 2018.
  • [12] L. Zhang, J. Tan, Y.-C. Liang, G. Feng, and D. Niyato, “Deep reinforcement learning-based modulation and coding scheme selection in cognitive heterogeneous networks,” IEEE Transactions on Wireless Communications, vol. 18, no. 6, pp. 3281–3294, 2019.
  • [13] Y. Shi, Y. E. Sagduyu, and T. Erpek, “Reinforcement learning for dynamic resource optimization in 5G radio access network slicing,” in 2020 IEEE 25th International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD).   IEEE, 2020, pp. 1–6.
  • [14] Y. Shao, R. Li, Z. Zhao, and H. Zhang, “Graph attention network-based drl for network slicing management in dense cellular networks,” in 2021 IEEE Wireless Communications and Networking Conference (WCNC).   IEEE, 2021, pp. 1–6.
  • [15] X. Lyu, Y. Xiao, B. Daley, and C. Amato, “Contrasting centralized and decentralized critics in multi-agent reinforcement learning,” arXiv preprint arXiv:2102.04402, 2021.
  • [16] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
  • [17] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” arXiv preprint arXiv:1706.02275, 2017.
  • [18] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” arXiv preprint arXiv:1506.03134, 2015.
  • [19] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
  • [20] Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9185–9193.
  • [21] Y. Lu, Y. Jia, J. Wang, B. Li, W. Chai, L. Carin, and S. Velipasalar, “Enhancing cross-task black-box transferability of adversarial examples with dispersion reduction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 940–949.
  • [22] Y. Shi, Y. E. Sagduyu, T. Erpek, K. Davaslioglu, Z. Lu, and J. H. Li, “Adversarial deep learning for cognitive radio security: Jamming attack and defense strategies,” in 2018 IEEE international conference on communications workshops (ICC Workshops).   IEEE, 2018, pp. 1–6.
  • [23] F. Wang, C. Zhong, M. C. Gursoy, and S. Velipasalar, “Resilient dynamic channel access via robust deep reinforcement learning,” IEEE Access, 2021.
  • [24] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017.
  • [25] H. Zhang, Y. Yu, J. Jiao, E. Xing, L. El Ghaoui, and M. Jordan, “Theoretically principled trade-off between robustness and accuracy,” in International Conference on Machine Learning.   PMLR, 2019, pp. 7472–7482.
  • [26] Y. Dong, T. Pang, H. Su, and J. Zhu, “Evading defenses to transferable adversarial examples by translation-invariant attacks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4312–4321.
  • [27] Y. Shi, Y. E. Sagduyu, T. Erpek, and M. C. Gursoy, “How to attack and defend 5G radio access network slicing with reinforcement learning,” arXiv preprint arXiv:2101.05768, 2021.
  • [28] F. Wang, M. C. Gursoy, and S. Velipasalar, “Adversarial reinforcement learning in dynamic channel access and power control,” in 2021 IEEE Wireless Communications and Networking Conference (WCNC).   IEEE, 2021, pp. 1–6.
  • [29] S. Khadka, S. Majumdar, T. Nassar, Z. Dwiel, E. Tumer, S. Miret, Y. Liu, and K. Tumer, “Collaborative evolutionary reinforcement learning,” in International Conference on Machine Learning.   PMLR, 2019, pp. 3341–3350.
  • [30] L. Liang, J. Kim, S. C. Jha, K. Sivanesan, and G. Y. Li, “Spectrum and power allocation for vehicular communications with delayed CSI feedback,” IEEE Wireless Communications Letters, vol. 6, no. 4, pp. 458–461, 2017.
  • [31] J. Peters and S. Schaal, “Natural actor-critic,” Neurocomputing, vol. 71, no. 7-9, pp. 1180–1190, 2008.
  • [32] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
  • [33] A. Graves, G. Wayne, and I. Danihelka, “Neural turing machines,” arXiv preprint arXiv:1410.5401, 2014.
  • [34] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [35] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in International conference on machine learning.   PMLR, 2013, pp. 1310–1318.
  • [36] F. Wang, M. C. Gursoy, and S. Velipasalar, “Multi-agent reinforcement learning with pointer networks for network slicing in cellular systems,” in ICC 2022-IEEE International Conference on Communications.   IEEE, 2022, pp. 1841–1846.
  • [37] L. S. Shapley, “Stochastic games,” Proceedings of the national academy of sciences, vol. 39, no. 10, pp. 1095–1100, 1953.
  • [38] J. F. Nash et al., “Equilibrium points in n-person games,” Proceedings of the national academy of sciences, vol. 36, no. 1, pp. 48–49, 1950.
[Uncaptioned image] Feng Wang is currently a Ph.D. student in the Department of Electrical Engineering and Computer Science at Syracuse University. He received his B.S. degree in Automation Science and Electrical Engineering from Beihang University (China) in 2016 and his M.S. degree from New York University in 2019. His primary research interests include wireless networks, adversarial learning, reinforcement learning, and federated learning.
[Uncaptioned image] M. Cenk Gursoy received the B.S. degree with high distinction in electrical and electronics engineering from Bogazici University, Istanbul, Turkey, in 1999 and the Ph.D. degree in electrical engineering from Princeton University, Princeton, NJ, in 2004. He was a recipient of the Gordon Wu Graduate Fellowship from Princeton University between 1999 and 2003. He is currently a Professor in the Department of Electrical Engineering and Computer Science at Syracuse University. His research interests are in the general areas of wireless communications, information theory, communication networks, signal processing, optimization and machine learning. He is an Editor for IEEE Transactions on Communications, and an Area Editor for IEEE Transactions on Vehicular Technology. He is on the Executive Editorial Committee of IEEE Transactions on Wireless Communications. He also served as an Editor for IEEE Transactions on Green Communications and Networking between 2016–2021, IEEE Transactions on Wireless Communications between 2010–2015 and 2017–2022, IEEE Communications Letters between 2012–2014, IEEE Journal on Selected Areas in Communications - Series on Green Communications and Networking (JSAC-SGCN) between 2015-2016, Physical Communication (Elsevier) between 2010–2017, and IEEE Transactions on Communications between 2013–2018. He has been the co-chair of the 2017 International Conference on Computing, Networking and Communications (ICNC) - Communication QoS and System Modeling Symposium, the co-chair of 2019 IEEE Global Communications Conference (Globecom) - Wireless Communications Symposium, the co-chair of 2019 IEEE Vehicular Technology Conference Fall - Green Communications and Networks Track, and the co-chair of 2021 IEEE Global Communications Conference (Globecom), Signal Processing for Communications Symposium. He received an NSF CAREER Award in 2006. More recently, he received the EURASIP Journal of Wireless Communications and Networking Best Paper Award, 2020 IEEE Region 1 Technological Innovation (Academic) Award, 2019 The 38th AIAA/IEEE Digital Avionics Systems Conference Best of Session (UTM-4) Award, 2017 IEEE PIMRC Best Paper Award, 2017 IEEE Green Communications & Computing Technical Committee Best Journal Paper Award, UNL College Distinguished Teaching Award, and the Maude Hammond Fling Faculty Research Fellowship. He is a Senior Member of IEEE, and is the Aerospace/Communications/Signal Processing Chapter Co-Chair of IEEE Syracuse Section.
[Uncaptioned image] Senem Velipasalar (Senior Member, IEEE) received the B.S. degree in electrical and electronic engineering from Bogazici University, Istanbul, Turkey, in 1999, the M.S. degree in electrical sciences and computer engineering from Brown University, Providence, RI, USA, in 2001, and the M.A. and Ph.D. degrees in electrical engineering from Princeton University, Princeton, NJ, USA, in 2004 and 2007, respectively. From 2007 to 2011, she was an Assistant Professor with the Department of Electrical Engineering, University of Nebraska-Lincoln. She is currently a Professor with the Department of Electrical Engineering and Computer Science, Syracuse University. The focus of her research has been on machine learning and its applications to 3D point cloud analysis, occupancy detection, and video activity recognition, UAV-mounted and wearable camera applications, multicamera tracking, and surveillance systems. She is a member of the Editorial Board of the IEEE Transactions on Image Processing and Journal of Signal Processing Systems (Springer).