This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\old@ps@headings

Federated Deep Reinforcement Learning for the Distributed Control of NextG Wireless Networks

Peyman Tehrani, Francesco Restuccia and Marco Levorato
Donald Bren School of Information and Computer Sciences, University of California at Irvine, United States
Department of Electrical and Computer Engineering, Northeastern University, United States
e-mail: {peymant, levorato}@uci.edu, [email protected]
This work was partially supported by the NSF grants MLWiNS-2003237 and CNS-2134567.
Abstract

Next Generation (NextG) networks are expected to support demanding tactile internet applications such as augmented reality and connected autonomous vehicles. Whereas recent innovations bring the promise of larger link capacity, their sensitivity to the environment and erratic performance defy traditional model-based control rationales. Zero-touch data-driven approaches can improve the ability of the network to adapt to the current operating conditions. Tools such as reinforcement learning (RL) algorithms can build optimal control policy solely based on a history of observations. Specifically, deep RL (DRL), which uses a deep neural network (DNN) as a predictor, has been shown to achieve good performance even in complex environments and with high dimensional inputs. However, the training of DRL models require a large amount of data, which may limit its adaptability to ever-evolving statistics of the underlying environment. Moreover, wireless networks are inherently distributed systems, where centralized DRL approaches would require excessive data exchange, while fully distributed approaches may result in slower convergence rates and performance degradation. In this paper, to address these challenges, we propose a federated learning (FL) approach to DRL, which we refer to federated DRL (F-DRL), where base stations (BS) collaboratively train the embedded DNN by only sharing models’ weights rather than training data. We evaluate two distinct versions of F-DRL, value and policy based, and show the superior performance they achieve compared to distributed and centralized DRL.

Index Terms:
Deep reinforcement learning, Federated Learning, Power control, Multi agent reinforcement learning, Wireless networks, Resource allocation.

I Introduction

On the one hand, next generation (NextG) networks are expected to support a wide range of essential and demanding real-time services such as augmented reality, connected autonomous vehicles and in-network computing that require coherent performance. On the other hand, recent advancements at the physical layers such as millimeter Wave (mmW) communications, while empowering the network with increased capacity, make its temporal behavior more erratic and convoluted. The NextG network environment, then, presents inherent control challenges that defy traditional model-based control approaches. To address these daunting challenges, the zero-touch network paradigm utilizes machine learning to eliminate the need for human-based design, and enable fast data-driven adaptation to different operating conditions.

Power and bandwidth allocation is one of the fundamental problems in wireless networks. While many optimization frameworks have been proposed in this domain [1], by directly utilizing the data originated by the system, data-driven methodologies have the potential to improve the adaptability of the network to environmental and traffic conditions and, ultimately, improve performance. Moreover, they are inherently more robust compared to model-based ones, as the latter class may suffer from model mismatch in real-world settings.

Among data driven algorithms, deep reinforcement learning (DRL) has achieved state-of-the-art performance in high dimensional complex control problems [2]. In DRL frameworks, the agent iteratively interacts with the environment to learn the optimal control policy. Additionally, DRL is often much faster in selecting the optimal action compared to conventional optimization methods which usually requires iterative computation and matrix inversion.

Most of the previous works in this domain proposed either fully centralized [3, 4] or distributed [5, 6] DRL algorithms to solve the resource allocation problem. In the former case, training and decision making is centralized, where all BSs send all state information at every time step to the server. In the latter case, each BS independently trains and executes the model without sharing information with other BSs.

In this paper, we propose a federated implementation of DRL algorithms which takes the advantages of both distributed and centralized solutions. Federated Learning (FL) [7] is a family of machine learning problems where many clients (e.g., mobile devices or whole organizations) collaboratively train a DNN model under the orchestration of a central server (e.g., a service provider), while keeping the training procedure decentralized. FL embodies the principles of focused collection and data minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized, training of machine learning models. This area has received considerable recent interest, both from academy and industry. However, so far FL have mainly been applied to supervised learning problems in fields such as NLP and machine vision, and there have been few contributions that used federated learning to train distributed control and DRL models. In this domain, a federated control approach has been proposed in [8] to solve coordination problems among a multitude of agents, edge caching problems [9, 10] and mobile edge computing strategies and offloading [11, 12] and other internet of things (IoT) applications[13].

Here we consider the maximization of the downlink sum rate of a multi-cell network with mobile users. The base stations (BSs) have access to local information and collaborate with each other by sharing their model weights in a FL fashion to quickly train the DNN model at the core of the DRL controller. We will consider different value-based and policy-based DRL algorithms and compare the performance and efficiency of distributed and federated versions of these algorithm. We demonstrate that our F-DRL approach improves the performance in terms of overall network sum rate compared to fully distributed implementations by more than 40%40\%. Our F-DRL approach also reduces communication overhead between BSs and the central server compared to fully centralized solutions and also only need to share the model weights publicly so would protect the local users information privacy in each cells.

II Literature Review

Recent work proposed the use of DRL algorithms to solve a wide range of optimization problems in wireless communication networks. Power allocation is one the main areas, with several contributions using DRL to find the optimal power [14, 15, 4, 16, 5, 17]. Some of these contributions use deep Q-learning on discrete power levels [15, 4, 18] and other states of the art DRL algorithms, such as deep deterministic policy gradient (DDPG) [5, 4] and trust region policy optimization (TRPO) [3] on continuous power control in multi cell network scenarios. In addition to power allocation, recent work explored the use of DRL for a wide range of resource management problems, including spectrum allocation [19], joint user association and resource allocation [17] and channel selection [20].

In this paper, we expand on this exciting area by proposing a framework where multiple – federated – DRL agents collaboratively learn a shared predictive model, while all training data remain local to the corresponding device (that is, the BS). The key motivations to integrate FL with DRL are (i) the boosting in training speed compared to fully distributed implementation, and (ii) the reduced amount of data to be shared. Regarding the latter point, we note that centralized DRL approaches to this problem need to exchange data for real-time control, while in our F-DRL frameworks the agents exchange the model weights to update the individual models. Thus, the data exchange has a more relaxed delay constraint and the backbone network’s load decreases.

III System Model and Problem Formulation

We consider a cellular network, where NN different Base Stations (BS) serve KK mobile users. We assume that both the BSs and users are equipped with one transmit antenna and each BS is deployed at the cell center. We index the BSs with n𝒩={1,2,,N}n\in\mathcal{N}=\{1,2,...,N\} and the users with k𝒦={1,2,,K}k\in\mathcal{K}=\{1,2,...,K\}. We denote the channel gain between the nnth BS and kkth user in cell jj at time slot tt with:

gn,j,kt=|hn,j,kt|2αn,j,k,g^{t}_{n,j,k}=|h^{t}_{n,j,k}|^{2}\alpha_{n,j,k}, (1)

where hn,j,kth^{t}_{n,j,k} is the small scale fading factor with Rayleigh distributed envelope and αn,j,k\alpha_{n,j,k} is the large-scale fading component, which includes path loss and log-normal shadowing. We model the small-scale Rayleigh fading component according to the Jakes fading model, that is, hn,j,kth^{t}_{n,j,k} is assumed to be a first-order complex Gauss-Markov process:

hn,j,kt=ρhn,j,kt1+1ρ2en,j,kt.h^{t}_{n,j,k}=\rho h^{t-1}_{n,j,k}+\sqrt{1-\rho^{2}}e^{t}_{n,j,k}. (2)

Here, the innovation process variables en,j,kte^{t}_{n,j,k} are identically distributed circularly symmetric complex Gaussian random variables with unit variance, independent from hn,j,kt1h^{t-1}_{n,j,k}. The temporal correlation between two consecutive fading component is

ρ=J0(2πfdTs),\rho=J_{0}(2\pi f_{d}T_{s}), (3)

where J0(.)J_{0}(.) is the zeroth-order Bessel function of the first kind, fdf_{d} is the maximum Doppler frequency, and TsT_{s} is the duration of one time slot. A higher mobility of users leads to a higher Doppler frequency and thus a lower temporal correlation of the channel.

Denoting the transmission power from BS nn to its user kk at slot tt with pn,ktp^{t}_{n,k}, the downlink signal-to-interference-plus-noise ratio (SINR) of user kk in cell nn at time slot tt is:

γn,k=pn,ktgn,j,ktIi+Io+Nk,\gamma_{n,k}=\frac{p^{t}_{n,k}g^{t}_{n,j,k}}{I_{i}+I_{o}+N_{k}}, (4)

where NkN_{k} is the noise power at user kk and IiI_{i} and IoI_{o} are the intra-cell and inter-cell interference, respectively:

Ii\displaystyle I_{i} =kkgn,n,ktpn,kt,\displaystyle=\sum_{k^{\prime}\neq k}g^{t}_{n,n,k}p^{t}_{n,k^{\prime}}, (5)
Io\displaystyle I_{o} =nngn,n,ktjpn,jt.\displaystyle=\sum_{n^{\prime}\neq n}g^{t}_{n^{\prime},n,k}\sum_{j}p^{t}_{n^{\prime},j}. (6)

The data rate at user kk in cell nn, then, is:

Cn,k=Blog(1+γn,k),C_{n,k}=B\log(1+\gamma_{n,k}), (7)

where BB is the bandwidth available to the network.

Our goal is to find the set of downlink transmission powers pn,ktp^{t}_{n,k} that maximize the sum rate of the whole network under a constraint on the maximum power. Formally, the optimization problem is:

maxpn,ktnkCn,k\displaystyle\max_{p^{t}_{n,k}}\sum_{n}\sum_{k}C_{n,k} (8)
s.t.0pn,ktPmaxk,n,\displaystyle\textrm{s.t.}\quad 0\leqslant p^{t}_{n,k}\leqslant P_{max}\quad\forall k,n,

where PmaxP_{max} is the the maximum transmission power for kkth AP and minimum data rate requirement at kkth user, respectively.

Clearly, due to the interference terms in the denominator of SINRs, the optimization problem is non-convex and its solution non trivial. Importantly, non-convexity is not the only challenge to overcome to solve the problem. In fact, while iterative algorithms can be developed that achieve good performance, these algorithms require compute-intense operations such as matrix inversion and bisection or singular value decomposition in each iteration, which makes their real-time implementation difficult. Additionally, these algorithms need full access to channel state information (CSI) of all the users to derive the optimal solution.

Therefore, the resolution of the optimization problem requires a self-adaptive solution feasible for execution at run-time while achieving good performance having access only to partial observations of the environment. We, then, reformulate the problem as a multi agent RL (MARL) problem , where each access point is an agent which in each time slot determines the transmit power allocated to its associated users. Then, based on the feedback that it receives from the network (which could be a function of rate and powers of other users and neighbor BSs), the BSs adapt their transmit power. Fig. 1 illustrates the framework we propose.

Refer to caption
Figure 1: Overall view of the network model and F-DRL procedure.

IV Federated Deep Reinforcement Learning

First, we redefine the problem (8) in a RL setting, where each BS is an agent whose objective is to maximize the sum rate of its own users, while mitigating interference to neighboring cells. Thus, each BS has separate control policies that output the optimal power levels given the current observed state. In the context of DRL, the base stations need to train DNN models whose output is either the Q-values or the control action (defined later). One of the critical issues, then, is training the DNN models as fast as possible to adapt to the current network conditions. In order to speed up training, we propose the federated deep reinforcement learning (FDRL) framework, where the federated agents – the BSs – collaboratively learn a predictive model by sharing their DRL model weights while keeping their users data private.

In the following sections, first we define the RL problem in terms of state space, action space and the reward function corresponding to problem (8). Then, we propose two versions of FDRL: federated deep Q network (FDQN) and federated deep policy gradient algorithm (FDPG) to solve the distributed power control problem.

IV-A RL Formulation

RL algorithms are usually set in the context of Markov decision processes (MDP), defined by the 5-tuple 𝒮,𝒜,𝒫,,γ\langle\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma\rangle, where 𝒮\mathcal{S} is the state space, 𝒜\mathcal{A} is the action space, Pr(st+1|st,at)𝒫Pr(s_{t+1}|s_{t},a_{t})\in\mathcal{P} are the transition probabilities, rt(st,at)r_{t}(s_{t},a_{t})\in\mathcal{R} is the reward function, and γ[0,1)\gamma\in\big{[}0,1\big{)} is the discount factor. At each time tt, based on the current state sts_{t}, the agent takes an action at𝒜a_{t}\in\mathcal{A} and transitions from state sts_{t} to a new state st+1s_{t+1} with probability Pr(st+1|st,at)Pr\big{(}s_{t+1}|s_{t},a_{t}\big{)}, receiving a reward rtr_{t}. We define the policy π(s,a)\pi\big{(}s,a\big{)} as the probability of taking action at=aa_{t}=a in state st=ss_{t}=s, that is, π(s,a)=Pr(at=a|st=s)\pi\big{(}s,a\big{)}=Pr\big{(}a_{t}{=}a|s_{t}{=}s\big{)}.

The goal of the RL agent is to learn a policy that maximizes the expected sum of discounted rewards that it receives over the long run, also called return: Rt=i=0γi+trt+iR_{t}=\sum_{i=0}^{\infty}\gamma^{i+t}r_{t+i}. We define the optimal policy π\pi^{*} as the policy that maximizes the expected return from a state ss:

π=argmaxπEπ{R0|s0=s}.\pi^{*}=\operatorname*{\arg\!\max}_{\pi}{E}_{\pi}\{R_{0}|s_{0}=s\}. (9)

Here we define the states, actions and reward function for our power control problem.

IV-A1 States

The main features that describe the network system are the channel gains between the BSs and users, and the previous transmission power and rate. We assume BSs have only access to information on neighboring cells in the set UnU_{n}. The feature set we include in the state of BS nn, then, is:

𝒮={gn,j,kt,pn,kt1,Cn,kt1}j,kUn\mathcal{S}=\big{\{}g^{t}_{n,j,k},p^{t-1}_{n,k},C^{t-1}_{n,k}\big{\}}\quad\forall j,k\in U_{n}\\ (10)

IV-A2 Actions

We use discrete power levels which take values between 0 and PmaxP_{max} defined as

𝒜={0,PmaxM1,2PmaxM1,.,Pmax},\mathcal{A}=\left\{0,\frac{P_{max}}{M-1},\frac{2P_{max}}{M-1},....,P_{max}\right\}, (11)

where MM is the number of power levels. All agents have the same action space.

IV-A3 Reward Function

In order to design the appropriate reward function, we need to estimate the progress of the nnth BS toward the goals of the optimization problem (8). To this aim, the reward function considers both the sum-rate of the agent’s cell and the interference generated to the neighbor cells UnU_{n}. We, then, define the reward function as:

rt=kCn,kt+βnUnkCn,kt,r_{t}=\sum_{k}C^{t}_{n,k}+\beta\sum_{n^{\prime}\in U_{n}}\sum_{k}C^{t}_{n^{\prime},k}, (12)

where here β\beta is a parameter controlling the tradeoff between sum-rate and interference. If β\beta is set to 0, then the agent ignores the performance degradation caused to other cells (which in turn do the same). As β\beta increases, the BSs take a conservative stance, that is, they privilege interference reduction with respect to their own achieved sum-rate.

IV-B Federated Learning Formulation

As described later in this section, DRL algorithms use DNNs to produce either Q-values or probability of taking an actions to optimize the return. The canonical federated learning problem involves learning a single, global statistical model (in our case a DNN) from data produced by a multitude of devices. In our framework, we learn this model under the constraint that in each of the BSs state information is stored and processed locally, with only intermediate model updates being communicated periodically to a central server. The goal of the training is to minimize the following objective function:

minθF(θ)=i=1NwiFi(θi),\min_{\theta}F(\theta)=\sum_{i=1}^{N}w_{i}F_{i}(\theta_{i}), (13)

where F(θ)F(\theta) and θ\theta are the global function and global model weights to be optimized and FiF_{i} and θi\theta_{i} are the local loss function and local model weights at BS ii, and wiw_{i} is the contribution of cell ii to whole network in terms of data size which is equal to wi=kii=1Nkiw_{i}=\frac{k_{i}}{\sum_{i=1}^{N}k_{i}} where kik_{i} is the number of users in cell ii .

IV-C Federated Deep Q Network

Formally, value-based RL algorithms use an action-value function Q(s,a)Q(s,a) to estimate an expected return starting from state ss when take action aa:

Qπ(st,a)\displaystyle Q_{\pi}(s_{t},a) =Eπ{k=1γk1rt+k1|st,a}\displaystyle=E_{\pi}\{\sum_{k=1}^{\infty}\gamma^{k-1}r_{t+k-1}|s_{t},a\} (14)
=Est+1,a{rt+γQπ(st+1,a)|st,at}.\displaystyle=E_{s_{t+1},a}\{r_{t}+\gamma Q_{\pi}(s_{t+1},a)|s_{t},a_{t}\}. (15)

The optimal action-value function Q(st,a)Q^{*}(s_{t},a) is the cheat sheet sought by the RL agent, which is defined as the maximum expectation of the cumulative discounted return from state sts_{t}:

Q(st,a)=Est+1{rt+γmaxaQ(st,a)|st,a}.Q^{*}(s_{t},a)=E_{s_{t+1}}\{r_{t}+\gamma\max_{a}Q^{*}(s_{t},a)|s_{t},a\}. (16)

In DRL, a function approximation technique (a DNN in the considered case) is used to learn a parametrized value function Q(s,a;θ)Q(s,a;\theta) in order to approximate the optimal Q-values. The one-step look ahead rt+γmaxaQ(st+1,a;θq)r_{t}+\gamma\max_{a}Q(s_{t+1},a;\theta_{q}) is the target to obtain Q(st,a;θq)Q(s_{t},a;\theta_{q}). Therefore the function Q(st,a,θq)Q(s_{t},a,\theta_{q}) is determined by the parameters θq\theta_{q}. The selection of a good action relies on accurate action-value estimation, and thus DQN attempts to find the optimal parameters θq\theta^{*}_{q} to minimize the loss function:

L(θq)=(rt+γmaxaQ(st+1,a;θq)Q(st,a;θq))2.L({\theta_{q}})=(r_{t}+\gamma\max_{a}Q(s_{t+1},a;\theta_{q})-Q(s_{t},a;\theta_{q}))^{2}. (17)

Similar to classical Q-learning, the agent collects experiences by interacting with the environment. The network trainer constructs a data set 𝒟\mathcal{D} by collecting the experiences until time tt in the form of (st1,at1,rt,st)(s_{t-1},a_{t-1},r_{t},s_{t}). We optimize the loss function L(θq)L({\theta_{q}}) using the collected data set 𝒟\mathcal{D}.

In the early stages of training, the agent estimation is not accurate, and a dynamic ϵ\epsilon-greedy policy is adopted to control the actions, where the agent with a certain probability explores different actions regardless of their reward. This strategy promotes accurate estimation over time, and reduces the risk of overfitting the model to actions with high rewards in the first phase of training.

By substituting the DQN cost function into equation (13), we obtain the FDQN cost as:

minθqL(θq)=i=1NwiLi(θqi),\min_{\theta_{q}}L(\theta_{q})=\sum_{i=1}^{N}w_{i}L_{i}(\theta_{q_{i}}), (18)

The overall learning procedure for FDQN is summarized in Algorithm 1.

Algorithm 1 FDQN
1:Aggregation period AgAg, learning rate lrlr, number of training episodes NeN_{e}, episode horizon time TT, exploration parameter ϵ\epsilon, Initial θq\theta_{q}
2:Initialization get initial θq\theta_{q} from server
3:for e:=1e:=1 to NeN_{e} do
4:   get initial state SS
5:   for t:=1t:=1 to TT do
6:      draw a random number r[0,1]r\in[0,1]
at={argmaxaQ(st,a,θqe)if r>ϵpick uniformly action aelse a_{t}=\begin{cases}\operatorname*{\arg\!\max}_{a}Q(s_{t},a,\theta_{q}^{e})&\quad\text{if }r>\epsilon\\ \text{pick uniformly action }a&\quad\text{else }\end{cases}
7:      Take action ata_{t}, go to state st+1s_{t+1} and get reward rt+1r_{t+1}
8:      Store{at,st,rt+1,st+1}\{a_{t},s_{t},r_{t+1},s_{t+1}\}
9:   end for
10:   update θqe+1=θqelrθqL(θqe)\theta_{q}^{e+1}=\theta_{q}^{e}-lr\nabla_{\theta_{q}}L(\theta_{q}^{e})
11:   if emodAg=0e\mod{Ag=0} then
12:      send θqt\theta_{q}^{t} to server for aggregation
13:      get aggregated θqe\theta_{q}^{e} from server
14:   end if
15:end for
16:θq\theta_{q}^{*}

IV-D Federated Deep Policy Gradient

In contrast to value based methods such as Q-learning, policy gradient algorithms directly optimize the policy without estimating the Q-values. This approach is more robust to the overestimation problem that affects value based methods. Using a DNN as a function approximator, we define the parametrized policy π(a|s;θp)\pi(a|s;\theta_{p}), where θp\theta_{p} are the DNN weights. Herein, we focus specifically on Reinforce as a policy gradient based learning algorithm that updates the policy based on the Monte-Carlo estimation of the agent average return. The objective of Reinforce is defined as:

J(θp)=Eπ{π(a|s;θp)R},J(\theta_{p})=E_{\pi}\{\pi(a|s;\theta_{p})R\}, (19)

given the parameters θp\theta_{p} and state ss, the policy network generates a stochastic policy, that is, a probability vector over actions. Taking the gradient with respect to θp\theta_{p} we obtain:

θpJ(θp)=Eπθp{θplog(π(a|s,θp))R},\nabla_{\theta_{p}}J(\theta_{p})=E_{\pi_{\theta_{p}}}\{\nabla_{\theta_{p}}\log(\pi(a|s,\theta_{p}))R\}, (20)
FDQN DQN-Dist DQN-Cent FDPG DPG-Dist DPG-Cent WMMSE Max Power
Mean (bit/S/Hz) 1.601 1.234 1.359 1.521 0.873 1.590 1.376 0.525
STD (bit/S/Hz) 0.233 0.233 0.211 0.231 0.227 0.241 0.205 0.182
Average Execution Time (S) 2.5 ×104\times 10^{-4} 2.6 ×104\times 10^{-4} 7.6 ×104\times 10^{-4} 2.7×104\times 10^{-4} 2.7×104\times 10^{-4} 8.5 ×104\times 10^{-4} 1.7×102\times 10^{-2} 9.2×106\times 10^{-6}
Communication Overhead 0.01 0 1 0.01 0 1 1 0
TABLE I: Baselines and FDRL Performance Comparison

Parameters are updated to increase the probability of actions associated with higher rewards trajectories. Since Reinforce outputs a stochastic policy with non zeros probability over all actions, it does not require an exploration procedure such as the ϵ\epsilon-greedy strategy described earlier. At the test time, the optimal policy can be obtained by deterministically selecting the action with the largest probability.

By substituting the Reinforce cost function into equation (13) we obtain the FDPG cost:

minθpJ(θp)=i=1NwiJi(θpi).\min_{\theta_{p}}J(\theta_{p})=\sum_{i=1}^{N}w_{i}J_{i}(\theta_{p_{i}}). (21)

The overall training procedure for FDPG is summarized in Algorithm 2.

Algorithm 2 FDPG
1:Aggregation period AgAg, learning rate lrlr, number of training episodes NeN_{e}, episode horizon time TT, Initial θp\theta_{p}
2:Initialization get initial θp\theta_{p} from server
3:for e:=1e:=1 to NeNe do
4:   get initial state SS
5:   for t:=1t:=1 to TT do
6:      sample action ata_{t} from distribution π(at|st,θpe)\pi(a_{t}|s_{t},\theta_{p}^{e})
7:      Take action ata_{t}, go to state st+1s_{t+1} and get reward rt+1r_{t+1}
8:   end for
9:   update θpe+1=θpe+lrθpJ(θpe)\theta_{p}^{e+1}=\theta_{p}^{e}+lr\nabla_{\theta_{p}}J(\theta_{p}^{e})
10:   if  emodAg=0e\mod{Ag=0} then
11:      send θpe\theta_{p}^{e} to server for aggregation
12:      get aggregated θpe\theta_{p}^{e} from server
13:   end if
14:end for
15:θp\theta_{p}^{*}

V Results

In this section, we provide a thorough performance evaluation of the proposed federated algorithms. We implement all the models in PyTorch and use the Adam optimizer to train them, setting the learning rate to lr=0.001lr=0.001. We use the same network architecture for DQL and deep policy gradient (Reinforce). We employ a neural network whose number of inputs are equal to the state dimension. The network has two hidden layers with 128 and 64 neurons respectively, followed by Relu activation functions. The output dimension is set to be the same as the number of discretized power levels (M=10M=10 in the results) for policy gradient and DQN. The models will be released in our GitHub repository111https://github.com/PeymanTehrani/FDRL-PC-Dyspan. In the training procedure, we consider a multi cell network with N=25N=25 cells in which in each cell serves K=4K=4 users. The Doppler frequency and time slot period are set to fd=10f_{d}=10Hz and Ts=20T_{s}=20ms respectively. We also consider the maximum transmission power to be Pmax=38P_{max}=38dbm and the control parameter in reward function β=1\beta=1.

Refer to caption
Figure 2: Comparison of the convergence of distributed deep policy gradients with its federated implementation for different aggregation frequencies.
Refer to caption
Figure 3: Comparison of the convergence of distributed deep Q-Learning with its federated implementation for different aggregation frequencies.

Figs 2 and 3 depict the average per user rate in the network for centralized, distributed and federated implementations of DPG and DQN algorithms with different aggregation periods. We train each model over 7000 episodes, where each episode itself contains a horizon of T=10T{=}10 time slots. To plot smoother curves, the results of groups of 100100 iterations is reduced to its average in the plots. In the distributed implementation, it is assumed each BS only trains its own model based on the its observed states and does not share the model weights with any other BSs (Agents) or the central server while in the centralized case in each episode all BSs send their user states to the central server in order to learn a global model.

Comparing the results of distributed and federated algorithms, we can see that the achieved rate per user improves significantly in the latter case. Additionally, when comparing the centralized case with federated version we can observe that they have similar performance, but the federated approach the communication between the BSs and server is up to 0.0010.001 times smaller (in AggPer=1000AggPer=1000 case). Furthermore, we note that a higher Aggregation frequency leads to faster convergence speed for a penalty of having larger communication overhead. It can be observed that FDPG convergence is smoother compared to FDQN for all the different aggregation frequencies, thus providing more coherent performance to users. This can be explained by the use of the ϵ\epsilon-greedy approach for exploration, which increases the random behavior until convergence is reached, while in the policy gradient-based algorithms, the gradient update selects state-action trajectories associated with higher average rewards which could benefits from model aggregation.

In Table I we compare the different implementations of DRL (federated, distributed and centralized) alongside with other baselines such as weighted minimum mean squared error (WMMSE) [21] and maximum power transmission (Max Power). Here, we tested all the models on 1000 different random episodes and reported the mean and standard deviation of the network sum rate, average execution time and the server-BS communication overhead for each of these algorithms. For the F-DRL we set the aggregation period to be equal to 100100. Based on the results, it is apparent that the proposed F-DRL approach provides a good balance between high throughput, fast execution time and low overhead. This proves that incorporating the F-DRL strategy at the nextG wireless networks leads to fast response, high performance and efficient automated networks.

In Fig. 4, the performance of the distributed and federated deep policy gradient algorithms are compared in terms of average per user rate. We can see that as the size of the network increases, the overall interference increases and consequently the average data rate per user decreases. Notably, when the number of BSs is very large (e.g., 3636) the gain of federated frameworks over fully distributed optimization is about 40%40\%, while for smaller networks the gain is around 15%15\%.

Refer to caption
Figure 4: Performance Comparison between the federated and distributed deep policy gradient for different number of available cells.

VI Conclusions

In this paper, we proposed federated deep reinforcement learning as a tool to solve a distributed power control problem in a wireless multi-cell network. We investigated the performance of DRL with value based and policy based methods and compared their federated version and distributed version implementations. We demonstrated by simulation that aggregating the models of the BSs can improve the performance in terms of overall network sum rate compared to fully distributed implementation, while also being more bandwidth efficient comparing to fully centralized scenarios. We also showed that the F-DRL approach outperforms the conventional optimization-based baselines both in terms of performance and execution time.

References

  • [1] P. Tehrani, F. Lahouti, and M. Zorzi, “Resource allocation in ofdma networks with half-duplex and imperfect full-duplex users,” in 2016 IEEE international conference on communications (ICC).   IEEE, 2016.
  • [2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [3] A. A. Khan and R. Adve, “Centralized & distributed deep reinforcement learning methods for downlink sum-rate optimization,” IEEE Transactions on Wireless Communications, 2020.
  • [4] F. Meng, P. Chen, L. Wu, and J. Cheng, “Power allocation in multi-user cellular networks: Deep reinforcement learning approaches,” IEEE Transactions on Wireless Communications, 2020.
  • [5] Y. Sinan Nasir and D. Guo, “Deep actor-critic learning for distributed power control in wireless mobile networks,” arXiv e-prints, pp. arXiv–2009, 2020.
  • [6] X. Zhang, M. R. Nakhai, G. Zheng, S. Lambotharan, and B. Ottersten, “Calibrated learning for online distributed power allocation in small-cell networks,” IEEE Transactions on Communications, 2019.
  • [7] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics.   PMLR, 2017.
  • [8] S. Kumar, P. Shah, D. Hakkani-Tur, and L. Heck, “Federated control with hierarchical multi-agent deep reinforcement learning,” arXiv preprint arXiv:1712.08266, 2017.
  • [9] X. Wang, R. Li, C. Wang, X. Li, T. Taleb, and V. C. Leung, “Attention-weighted federated deep reinforcement learning for device-to-device assisted heterogeneous collaborative edge caching,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 1, pp. 154–169, 2020.
  • [10] X. Wang, C. Wang, X. Li, V. C. Leung, and T. Taleb, “Federated deep reinforcement learning for internet of things with decentralized cooperative edge caching,” IEEE Internet of Things Journal, 2020.
  • [11] Z. Zhu, S. Wan, P. Fan, and K. B. Letaief, “Federated multi-agent actor-critic learning for age sensitive mobile edge computing,” arXiv preprint arXiv:2012.14137, 2020.
  • [12] J. Ren, H. Wang, T. Hou, S. Zheng, and C. Tang, “Federated learning-based computation offloading optimization in edge computing-supported internet of things,” IEEE Access, vol. 7, pp. 69 194–69 201, 2019.
  • [13] D. C. Nguyen, M. Ding, P. N. Pathirana, A. Seneviratne, J. Li, D. Niyato, and H. V. Poor, “Federated learning for industrial internet of things in future industries,” arXiv preprint arXiv:2105.14659, 2021.
  • [14] Y. Zhang, C. Kang, T. Ma, Y. Teng, and D. Guo, “Power allocation in multi-cell networks using deep reinforcement learning,” in 2018 IEEE 88th Vehicular Technology Conference (VTC-Fall).   IEEE, 2018.
  • [15] F. Meng, P. Chen, and L. Wu, “Power allocation in multi-user cellular networks with deep q learning approach,” in IEEE International Conference on Communications (ICC).   IEEE, 2019, pp. 1–6.
  • [16] Y. S. Nasir and D. Guo, “Multi-agent deep reinforcement learning for dynamic power allocation in wireless networks,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 10, 2019.
  • [17] H. Ding, F. Zhao, J. Tian, D. Li, and H. Zhang, “A deep reinforcement learning for user association and power control in heterogeneous networks,” Ad Hoc Networks, vol. 102, p. 102069, 2020.
  • [18] S. Saeidian, S. Tayamon, and E. Ghadimi, “Downlink power control in dense 5g radio access networks through deep reinforcement learning,” in IEEE International Conference on Communications (ICC), 2020.
  • [19] W. Lei, Y. Ye, and M. Xiao, “Deep reinforcement learning based spectrum allocation in integrated access and backhaul networks,” IEEE Transactions on Cognitive Communications and Networking, 2020.
  • [20] J. Tan, Y.-C. Liang, L. Zhang, and G. Feng, “Deep reinforcement learning for joint channel selection and power control in d2d networks,” IEEE Transactions on Wireless Communications, 2020.
  • [21] Q. Shi, M. Razaviyayn, Z.-Q. Luo, and C. He, “An iteratively weighted mmse approach to distributed sum-utility maximization for a mimo interfering broadcast channel,” IEEE Transactions on Signal Processing, vol. 59, no. 9, pp. 4331–4340, 2011.