This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Deep Reinforcement Learning for Multi-Agent Power Control in Heterogeneous Networks

Lin Zhang and Ying-Chang Liang, L. Zhang is with the Key Laboratory on Communications, University of Electronic Science and Technology of China (UESTC), Chengdu, China (emails: [email protected]).Y.-C. Liang is with the Center for Intelligent Networking and Communications (CINC), University of Electronic Science and Technology of China (UESTC), Chengdu, China (email: [email protected]).
Abstract

We consider a typical heterogeneous network (HetNet), in which multiple access points (APs) are deployed to serve users by reusing the same spectrum band. Since different APs and users may cause severe interference to each other, advanced power control techniques are needed to manage the interference and enhance the sum-rate of the whole network. Conventional power control techniques first collect instantaneous global channel state information (CSI) and then calculate sub-optimal solutions. Nevertheless, it is challenging to collect instantaneous global CSI in the HetNet, in which global CSI typically changes fast. In this paper, we exploit deep reinforcement learning (DRL) to design a multi-agent power control algorithm in the HetNet. To be specific, by treating each AP as an agent with a local deep neural network (DNN), we propose a multiple-actor-shared-critic (MASC) method to train the local DNNs separately in an online trial-and-error manner. With the proposed algorithm, each AP can independently use the local DNN to control the transmit power with only local observations. Simulations results show that the proposed algorithm outperforms the conventional power control algorithms in terms of both the converged average sum-rate and the computational complexity.

Index Terms:
DRL, multi-agent, power control, MASC, HetNet.

I Introduction

Driven by ubiquitous wireless-devices like smart phones and tablets, wireless data traffics have been dramatically increasing in recent years [1]-[3]. These wireless traffics lay heavy burden on conventional cellular networks, in which a macro base station (BS) is deployed to provide wireless access services for all the users within the macro-cell. As an alternative, the heterogeneous network (HetNet) is proposed by planning small cells in the macro-cell. Typical small cells include Pico-cell and Femto-cell, which are able to provide flexible wireless access services for users with heterogeneous demands. It has been demonstrated that small cells can effectively offload the wireless traffics of the macro BS, provided that the small cells are well coordinated [4].

Due to the spectrum scarcity issue, it is inefficient to assign orthogonal spectrum resources to all the cells (including macro-cell and small cells). Then, different cells may reuse the same spectrum resource, and lead to severe inter-cell interference. To suppress the inter-cell interference and enhance the sum-rate of the cells reusing the same spectrum resource, power control algorithms are usually adopted [5]-[11]. Two conventional power control algorithms to maximize the sum-rate are weighted minimum mean square error (WMMSE) algorithm [7] and fractional programming (FP) [8] algorithm. By assuming that the instantaneous global (including both intra-cell and inter-cell) channel state information (CSI) is available, both WMMSE and FP algorithms can be used to calculate power allocation policies for the access points (APs) in different cells simultaneously.

It is known that, the solutions of conventional power control algorithms are generally sub-optimal. In fact, the solutions can be completely invalid when the coherence time of wireless channels is longer than the processing time, which is the total time of estimating channels, running power control algorithms, and updating transmit power. In a HetNet, the radio environment is highly dynamic and the coherence time of wireless channels is typically shorter than the processing time. As a result, the output solution of the conventional power control algorithms is typically invalid.

Motivated by the appealing performance of the machine learning in computer science field, e.g., computer vision and natural language processing, machine learning is recently advocated for wireless network designs. One typical application of the machine learning lies in the dynamic power control for the sum-rate maximization in wireless networks [13] [14]. [13] designs an deep learning (DL) based algorithm to accelerate the power allocations in a general interference-channel scenario. By collecting a large number of global CSI sets, [13] uses the WMMSE algorithm to generate power allocation labels. Then, [13] trains a deep neural network (DNN) with these global CSI sets and the corresponding power allocation labels. With the trained DNN, power allocation policies can be directly calculated by feeding the instantaneous global CSI. To avoid requiring the instantaneous global CSI meanwhile eliminate the computational cost of generating power allocation labels with the WMMSE algorithm, [14] considered a homogeneous network and assumed that neighboring transceivers (i.e., APs and users) can exchange their local information through certain cooperations. Then, [14] developed a deep reinforcement learning (DRL) based algorithm, which optimizes power allocation policies in a trial-and-error manner and can converge to the performance of the WMMSE algorithm after sufficient trials.

I-A Contributions

In this paper, we study the power control problem for the sum-rate maximization in a typical HetNet, in which multi-tier cells coexist by reusing the same spectrum band. Different from the single-tier scenario, both the maximum transmit power and coverage of each AP in the multi-tier scenario are typically heterogeneous. Main contributions of the paper are summarized as follows:

  1. 1.

    We exploit DRL to design a multi-agent power control algorithm in the HetNet. First, we establish a local DNN at each AP and treat each AP as an agent. The input and the output of each local DNN are the local state information and the adopted local action, i.e., transmit power of the corresponding AP, respectively. Then, we propose a novel multiple-actor-shared-critic (MASC) method to train separately each local DNN in an online trial-and-error manner.

  2. 2.

    The MASC training method is composed of multiple actor DNNs and a shared critic DNN. In particular, we first establish an actor DNN in the core network for each local DNN, and the structure of each actor DNN is the same as the corresponding local DNN. Then, we establish a shared critic DNN in the core network for these actor DNNs. By feeding historical global information into the critic DNN, the output of the critic DNN can accurately evaluate whether the output (i.e., transmit power) of each actor DNN is good or not from an global view. By training each actor DNN with the evaluation of the critic DNN, the weight vector of each actor DNN can be updated towards the direction of the global optimum. The weight vector of each local DNN can be periodically replaced by that of the associated actor DNN until convergence.

  3. 3.

    The proposed algorithm has two main advantages compared with the existing power control algorithms. First, compared with [7], [8], [13], and [14], each AP in the proposed algorithm can independently control the transmit power and enhance the sum-rate based on only local state information, in the absence of instantaneous global CSI and any cooperation with other cells. Second, compared with [14], [15], and [16], the reward function of each agent in the proposed algorithm is the transmission rate between the corresponding AP and its served user, avoiding particular reward function designs for each AP. This may ease the transfer of the proposed algorithm framework into more resource management problems in wireless communications.

  4. 4.

    By considering both two-layer and three-layer HetNets in simulations, we demonstrate that the proposed algorithm can rapidly converge to an average sum-rate higher than those of WMMSE and FP algorithms. Simulation results also reveal that the proposed algorithm outperforms WMMSE and FP algorithms in terms of the computational complexity.

I-B Related literature

DRL originates from RL, which has been widely used for the designs of wireless communications [17]-[22], e.g., user/AP handoff, radio access technology selection, energy-efficient transmissions, user scheduling and resource allocation, spectrum sharing, and etc. In particular, RL estimates the long-term reward of each state-action pair and stores them into a two-dimensional table. For a given state, RL can choose the action subject to the maximum long-term reward to enhance the performance. It has been demonstrated that RL performs well in a decision-making scenario in which the size of the state and action spaces in the wireless system is relatively small. However, the effectiveness of RL diminishes when the state and action spaces become large.

To choose proper actions in the scenarios with large state and action spaces, DRL is proposed by properly integrating DL and RL [23]. In particular, by adopting the DNN which has a strong representation capability, DL has been applied in different areas of wireless communications [24], for example, power control [25], channel access [26], link scheduling [27]. The basic idea of DRL is as follows: Instead of storing the long-term reward of each state-action pair in a tabular manner, DRL uses a DNN to represent the long-term reward as a function of the state and action. Thanks to the strong representation capability of the DNN, the long-term reward of any state-action pair can be properly approximated. The application of DRL in HetNets includes interference control among small cells [15], power control in a HetNet [16], resource allocation in V2V communications [28], caching policy optimization in content distribution networks [29], multiple access optimization in a HetNet [30], modulation and coding scheme selection in a cognitive HetNet [31], joint beamforming/power control/interference coordination in 5G networks [32], UAV navigation [33], and spectrum sharing [34]. More applications of DRL in wireless communications can be found in [35].

I-C Organizations of the paper

The remainder of this paper is organized as follows. We provide the system model in Sec. II. In Sec. III, we provide the problem formulation and analysis. In Sec. IV, we give preliminaries by overviewing related DRL algorithms. In Sec. V, we elaborate the proposed power control algorithm. Simulation results are shown in Sec. VI. Finally, we conclude the paper in Sec. VII.

II System Model

Refer to caption
Figure 1: A general HetNet, in which multiple APs share the same spectrum band to serve the users within their coverages and may cause interference to each other.

As shown in Fig. 1, we consider a typical HetNet, in which multiple APs share the same spectrum band to serve users and may cause interference to each other. In particular, we denote an AP as AP nn, n={1,2,,N}n\in\mathbb{N}=\{1,2,\cdots,N\}, where NN is the number of APs. Accordingly, we denote the user served by AP nn as user equipment (UE) nn. Next, we provide the channel model and signal transmission model, respectively.

II-A Channel model

The channel between an AP and a UE has two components, i.e., large-scale attenuation (including path-loss and shadowing) and small-scale block Rayleigh fading. If we denote ϕn,k\phi_{n,k} as the large-scale attenuation and denote hn,kh_{n,k} as the small-scale block Rayleigh fading from AP nn to UE kk, the corresponding channel gain is gn,k=ϕn,k|hn,k|2g_{n,k}=\phi_{n,k}|h_{n,k}|^{2}. In particular, the large-scale attenuation is highly related to the locations of the AP and the UE, and typically remains constant for a long time. The small-scale block Rayleigh fading remains constant in a single time slot and changes among different time slots.

According to [36], we adopt the Jake’s model to represent the relationship between the small-scale Rayleigh fadings in two successive time slots, i.e.,

h(t)=ρh(t1)+ω,\displaystyle h(t)=\rho h(t-1)+\omega, (1)

where ρ\rho (0ρ10\leq\rho\leq 1) denotes the correlation coefficient of two successive small-scale Rayleigh fading realizations, ω\omega is a random variable represented by a distribution ω𝒞𝒩(0,1ρ2)\omega\sim\mathcal{CN}(0,1-\rho^{2}), and h(0)h(0) is a random variable produced by a distribution h(0)𝒞𝒩(0,1)h(0)\sim\mathcal{CN}(0,1). It should be noted that, the Jake’s model can be reduced to the independent and identically distributed (IID) channel model if ρ\rho is zero.

II-B Signal transmission model

If we denote xn(t)x_{n}(t) as the downlink signal from AP nn to UE nn with unit power in time slot tt, the received signal at UE nn is

yn(t)=pn(t)ϕn,nhn,n(t)xn(t)+k,knpk(t)ϕk,nhk,n(t)xk(t)+δn(t),\displaystyle y_{n}(t)=\sqrt{p_{n}(t)\phi_{n,n}}h_{n,n}(t)x_{n}(t)+\sum_{k\in\mathbb{N},k\neq n}\sqrt{p_{k}(t)\phi_{k,n}}h_{k,n}(t)x_{k}(t)+\delta_{n}(t), (2)

where pn(t)p_{n}(t) is the transmit power of AP nn, δn(t)\delta_{n}(t) is the noise at UE nn with power σ2\sigma^{2}, the first term on the right side is the received requiring signal from AP nn, and the second term on the right side is the received interference from other APs. By considering that all the downlink transmissions from APs to UEs are synchronized, the signal to interference and noise ratio (SINR) at UE nn can be written as

γn(t)=pn(t)gn,n(t)k,knpk(t)gk,n(t)+σ2,\displaystyle\gamma_{n}(t)=\frac{p_{n}(t)g_{n,n}(t)}{\sum_{k\in\mathbb{N},k\neq n}p_{k}(t)g_{k,n}(t)+\sigma^{2}}, (3)

Accordingly, the downlink transmission rate (in bps) from AP nn to UE nn is

rn(t)=Blog(1+γn(t)),\displaystyle r_{n}(t)=B\log\left(1+\gamma_{n}(t)\right), (4)

where BB is the bandwidth of the downlink transmission.

III Problem description and analysis

Our goal is to optimize the transmit power of all APs to maximize the sum-rate of all the downlink transmissions. Then, we can formulate the optimization problem as

maxpn,n\displaystyle\ \ \underset{p_{n},\forall n\in\mathbb{N}}{\max}\ \ R(t)=n=1Nrn(t)\displaystyle R(t)=\sum_{n=1}^{N}r_{n}(t)
s.t.\displaystyle{{\rm{s}}{\rm{.t}}{\rm{.}}}\ \ 0pn(t)pn,max,n,\displaystyle 0\leq p_{n}(t)\leq p_{n,\text{max}},\forall n\in\mathbb{N}, (5)

where pn,maxp_{n,\text{max}} is the maximum transmit power constraint of AP nn. Note that the APs of different cells (e.g., macro-cell base station, pico-cell base station, and femto-cell base station) typically have distinct maximum transmit power constraints.

Since wireless channels are typically dynamic, the optimal transmit power maximizing the sum-rates among distinct time slots differs a lot. In other words, the optimal transmit power should be determined at the beginning of each time slot to guarantee the optimality. Nevertheless, there exist two main challenges to determine the optimal transmit power of APs. First, according to [37], this problem is generally NP-hard and it is difficult to find the optimal solution. Second, the optimal transmit power should be determined at the beginning of each time slot and it is demanding to find the optimal solution (if possible) in such a short time period.

As mentioned above, conventional power control algorithms (e.g., WMMSE algorithm and FP algorithm) can output sub-optimal solutions for this problem by implicitly assuming a quasi-static radio environment, in which wireless channels change slowly. For the dynamic radio environment, DL [13] and DRL [14] are recently adopted to solve the problem with sub-optimal solutions, by assuming the availability of the instantaneous global CSI or the cooperations among neighboring APs. In fact, these algorithms are inapplicable in the considered scenario of this paper due to the following two main constraints:

  • Instantaneous global CSI is unavailable.

  • Neighboring APs are not willing to or even cannot cooperate with each other.

In the rest of the paper, we will first provide preliminaries by overviewing related DRL algorithms and then develop a DRL based multi-agent power control algorithm to solve the above problem.

IV Preliminaries

In this section, we provide an overview of two DRL algorithms, i.e., Deep Q-network (DQN) and Deep deterministic policy gradient (DDPG), both of which are important to develop the power control algorithm in this paper. In general, DRL algorithms mimic the human being to solve a successive decision-making problem via trial-and-errors. By observing the current environment state ss (sSs\in\textbf{\emph{S}}) and adopting an action aa (aAa\in\textbf{\emph{A}}) based on a policy π\pi, the DRL agent can obtain an immediate reward rr and observes a new environment state ss^{\prime}. By repeating this process and treating each tuple {s,a,r,s}\{s,a,r,s^{\prime}\} as an experience, the DRL agent can continuously learn the optimal policy from the experiences to maximize the long-term reward.

IV-A Deep Q-network (DQN) [23]

DQN establishes a DNN Q(s,a;θ)Q(s,a;\theta) with weight vector θ\theta to represent the expected cumulative discounted (long-term) reward by executing the action aa in the environmental state ss. Then, Q(s,a;θ)Q(s,a;\theta) can be rewritten in a recursive form (Bellman equation) as

Q(s,a;θ)=r(s,a)+ηsSaAps,s(a)Q(s,a;θ),\displaystyle Q(s,a;\theta)=r(s,a)+\eta\sum\limits_{s^{\prime}\in\textbf{\emph{S}}}\sum\limits_{a^{\prime}\in\textbf{\emph{A}}}{{p_{s,s^{\prime}}}(a)}Q(s^{\prime},a^{\prime};\theta), (6)

where r(s,a)r(s,a) is the immediate reward by executing the action aa in the environmental state ss, η[0,1]\eta\in[0,1] is the discount factor representing the discounted impact of future rewards, and ps,s(a){p_{s,s^{\prime}}}(a) is the transition probability of the environment state from ss to ss^{\prime} by executing the action aa. The DRL agent aims to find the optimal weight vector θ\theta^{*} to maximize the long-term reward for each state-action pair. With the optimal weight vector θ\theta^{*}, (6) can be rewritten as

Q(s,a;θ)=r(s,a)+ηsSps,s(a)maxaAQ(s,a;θ).\displaystyle Q(s,a;\theta^{*})=r(s,a)+\eta\sum\limits_{s^{\prime}\in\textbf{\emph{S}}}{{p_{s,s^{\prime}}}(a)\mathop{\max}\limits_{a^{\prime}\in\textbf{\emph{A}}}}\ Q(s^{\prime},a^{\prime};\theta^{*}). (7)

Accordingly, the optimal action policy is

π(s)=argmaxaA[Q(s,a;θ)],sS.\displaystyle{\pi^{*}}(s)=\mathop{\arg\max}\limits_{a\in\textbf{\emph{A}}}\left[{Q(s,a;\theta^{*})}\right],\ \forall\ s\in\textbf{\emph{S}}. (8)

However, it is challenging to directly obtain the optimal θ\theta^{*} from (7) since the transition probability ps,s(a){p_{s,s^{\prime}}}(a) is typically unknown to the DRL agent. Then, DRL agent updates θ\theta in an iterative manner. On the one hand, to balance exploitation and exploration, the DRL agent adopts an ϵ\epsilon-greedy algorithm to choose an action for each environmental state: for a given state ss, the DRL agent executes the action a=argmaxaAQ(s,a;θ)a=\arg\max_{a\in\textbf{\emph{A}}}Q(s,a;\theta) with the probability 1ϵ1-\epsilon, and randomly executes an action with the probability ϵ\epsilon. On the other hand, by executing actions with ϵ\epsilon-greedy algorithm, DRL agent can continuously accumulate experience e={s,a,r,s}e=\{s,a,r,s^{\prime}\} and store it in an experience replay buffer in a first-in-first-out (FIFO) fashion. After sampling a mini-batch of experiences \mathcal{E} with length DD (={e1,e2,,eD}\mathcal{E}=\{e_{1},e_{2},\cdots,e_{D}\}) from the experience replay buffer, the DRL agent can update θ\theta by adopting a stochastic gradient decent (SGD) method to minimize the expected prediction error (loss function) of the sampled experiences, i.e.,

𝕃(θ)=1D[r+ηmaxaAQ(s,a;θ)Q(s,a;θ)]2,\displaystyle\mathbb{L}(\theta)=\frac{1}{D}\sum_{\mathcal{E}}{{{\left[r+\eta\mathop{\max}\limits_{a^{\prime}\in\textbf{\emph{A}}}\ Q^{-}(s^{{}^{\prime}},a^{\prime};\theta^{-})-Q(s,a;\theta)\right]}^{2}}}, (9)

where Q(s,a;θ)Q^{-}(s,a;\theta^{-}) is the target DNN and is established with the same structure as Q(s,a;θ)Q(s,a;\theta). In particular, the weight vector θ\theta^{-} is updated by θ\theta periodically to stabilize the training of Q(s,a;θ)Q(s,a;\theta).

IV-B Deep deterministic policy gradient (DDPG) [38]

It should be noted that DQN needs to estimate the long-term reward for each state-action pair to obtain the optimal policy. When the action space is continuous, DQN is inapplicable and DDPG can be an alternative. DDPG has an actor-critic architecture, which includes an actor DNN μ(s;θμ)\mu(s;\theta_{\mu}) and a critic DNN Q(s,a;θQ)Q(s,a;\theta_{Q}). In particular, the actor DNN μ(s;θμ)\mu(s;\theta_{\mu}) is the policy network and is responsible to output a deterministic action aa from a continuous action space in an environment state ss, and the critic DNN is the value-function network (similar to DQN) and is responsible to estimate the long-term reward when executing the action aa in the environment state ss. Similar to DQN, to stabilize the training of actor DNN and critic DNN, DDPG establishes an target actor DNN μ(s;θμ)\mu^{-}(s;\theta_{\mu}^{-}) and a target critic DNN Q(s,a;θQ)Q^{-}(s,a;\theta_{Q}^{-}), respectively.

Similar to the DQN, the DDPG agent samples experiences from the experience replay buffer to train the actor DNN and critic DNN. Nevertheless, the DDPG accumulates experiences in a unique way. In particular, the executed action in each time slot is a noisy version of the output of the actor DNN, i.e., a=μ(s;θμ)+ζa=\mu(s;\theta_{\mu})+\zeta, where ζ\zeta is a random variable and guarantees continuously exploring the action space around the output of the actor DNN. Note that, the epsilon-greedy method is usually used for action selections in discrete action space scenarios. When the action space is continuous, the noisy version of the output of the actor DNN is preferable for action selections [38] [39].

The training procedure of the critic DNN is similar to that of DQN: by sampling a mini-batch of experiences \mathcal{E} with length DD from the experience replay buffer, the DDPG agent can update θQ\theta_{Q} by adopting a SGD method to minimize the expected prediction error of the sampled experiences, i.e.,

𝕃(θQ)=1D[r+ηmaxaAQ(si,a;θQ)Q(s,a;θQ)]2.\displaystyle\mathbb{L}(\theta_{Q})=\frac{1}{D}\sum_{\mathcal{E}}{{{\left[r+\eta\mathop{\max}\limits_{a^{\prime}\in\textbf{\emph{A}}}\ Q^{-}(s_{i}^{{}^{\prime}},a^{\prime};\theta_{Q}^{-})-Q(s,a;\theta_{Q})\right]}^{2}}}. (10)

Then, the DDPG agent adopts a soft update method to update the weight vector θQ\theta_{Q}^{-} of the target critic DNN, i.e., θQτθQ+(1τ)θQ\theta_{Q}^{-}\leftarrow\tau\theta_{Q}+(1-\tau)\theta_{Q}^{-}, where τ[0,1]\tau\in[0,1] is the learning rate of the target DNN.

The goal of training the actor DNN is to maximize the expected value-function Q(s,μ(s;θμ);θQ)Q\left(s,\mu(s;\theta_{\mu});\theta_{Q}\right) in terms of the environment state ss, i.e., J(θμ)=𝔼s[Q(s,μ(s;θμ);θQ)]J(\theta_{\mu})=\mathbb{E}_{s}\left[Q\left(s,\mu(s;\theta_{\mu});\theta_{Q}\right)\right]. Then, the weight vector θμ\theta_{\mu} of the actor DNN can be updated in the sampled gradient direction of J(θμ)J(\theta_{\mu}), i.e.,

θμJ(θμ)1DaQ(s,a;θQ)|a=μ(s;θμ)θμμ(s;θμ).\displaystyle\nabla_{\theta_{\mu}}J(\theta_{\mu})\approx\frac{1}{D}\sum_{\mathcal{E}}\nabla_{a}Q\left(s,a;\theta_{Q}\right)|_{a=\mu(s;\theta_{\mu})}\nabla_{\theta_{\mu}}\mu(s;\theta_{\mu}). (11)

Similar to the target critic DNN, the DDPG agent updates the weight vector of the target actor DNN by θμτθμ+(1τ)θμ\theta_{\mu}^{-}\leftarrow\tau\theta_{\mu}+(1-\tau)\theta_{\mu}^{-}.

V DRL for multi-agent power control algorithm

In this section, we exploit DRL to design a multi-agent power control algorithm for the APs in the HetNet. In the following, we will first introduce the algorithm framework and then elaborate the algorithm design.

V-A Algorithm framework

It is known that the optimal transmit power of APs is highly related to the instantaneous global CSI. But the instantaneous global CSI is unavailable to APs. Thus, it is impossible for each AP to optimize the transmit power through conventional power control algorithms, e.g., WMMSE algorithm or FP algorithm. From [13] and [14], the historical wireless data (e.g., global CSI, transmit power of APs, the mutual interference, and achieved sum-rates) of the whole network contains useful information that can be utilized to optimize the transmit power of APs. Thus, we aim to leverage DRL to develop an intelligent power control algorithm that can fully utilize the historical wireless data of the whole network. From the aspect of practical implementations, the intelligent power control algorithm should have two basic functionalities:

  • Functionality I: Each AP can use the algorithm to complete the optimization of the transmit power at the beginning of the each time slot, in order to guarantee the timeliness of the optimization.

  • Functionality II: Each AP can independently optimize the transmit power to enhance the sum-rate with only local observations, in the absence of the global CSI and any cooperations among APs.

Refer to caption
Figure 2: Proposed algorithm framework.

To realize Functionality I, we adopt a centralized-training-distributed-execution architecture as the basic algorithm framework as shown in Fig. 2. To be specific, a local DNN is established at each AP, and the input and the output of each local DNN are the local state information and the adopted local action, i.e., transmit power of the corresponding AP, respectively. In this way, each AP can feed local observations into the local DNN to calculate the transmit power in a real-time fashion. The weight vector of each local DNN is trained in the core network, which has redundant historical wireless data of the whole network. Since there are NN APs, we denote the corresponding NN local DNNs as μn(L)(sn;θn(L))\mu_{n}^{\text{(L)}}\left(s_{n};\theta_{n}^{\text{(L)}}\right) (nn\in\mathbb{N}), in which θn(L)\theta_{n}^{\text{(L)}} is the weight vector of the local DNN nn and sns_{n} is the local state of AP nn.

To realize Functionality II, we develop a MASC training method based on the DDPG to update the weight vectors of local DNNs. To be specific, as shown in Fig. 2, NN actor DNNs together with NN target actor DNNs are established in the core network to associate with NN local DNNs, respectively. Each actor DNN has the same structure as the associated local DNN, such that the trained weight vector of the actor DNN can be used to update the associated local DNN. Meanwhile, a shared critic DNN together with the corresponding target critic DNN is established in the core network to guide the training of NN actor DNNs. The inputs of the critic DNN include the global state information and the adopted global action, i.e., the transmit power of each AP, and the output of the critic DNN is the long-term sum-rate of the global state-action pair. It should be noted that, there are two main benefits to include the global state and global action in the input of the critic DNN. First, by doing this, the non-stationary radio environment issue the critic DNN faces is only caused by the time-variant nature of wireless channels. This is the major difference compared with the case, in which single-agent algorithm is directly applied to solve a multi-agent problem and the non-stationary issue is caused by both the time-variant nature of wireless channels and unknown actions of other agents. The adverse impact of the non-stationary radio environment issue on the system performance can thus be reduced. Second, by doing this, we can train the critic DNN with historical global state-action pairs together with the achieved sum-rates in the core network, such that the critic DNN has a global view of the relationship between the global state-action pairs and the long-term sum-rate. Then, the critic DNN can evaluate whether the output of an actor DNN is good or not in terms of the long-term sum-rate. By training each actor DNN with the evaluation of the critic DNN, the weight vector of each actor DNN can be updated towards the direction of the global optimum. To this end, the weight vector of each local DNN can be periodically replaced by that of the associated actor DNN until convergence.

We denote NN actor DNNs as μn(a)(sn;θn(a))\mu_{n}^{(\text{a})}\left(s_{n};\theta_{n}^{(\text{a})}\right), (nn\in\mathbb{N}), where θn(a)\theta_{n}^{(\text{a})} is the weight vector of actor DNN nn. Accordingly, we denote NN target actor DNNs as μn(a-)(sn;θn(a-))\mu_{n}^{(\text{a-})}\left(s_{n};\theta_{n}^{(\text{a-})}\right), (nn\in\mathbb{N}), where θn(a-)\theta_{n}^{(\text{a-})} is the corresponding weight vector. Then, we denote respectively the critic DNN and the target critic DNN as Q(s1,,sN,so,a1,,aN;θ(c))Q\left(s_{1},\cdots,s_{N},s_{\text{o}},a_{1},\cdots,a_{N};\theta^{(\text{c})}\right) and Q(s1,,sN,so,a1,,aN;θ(c-))Q^{-}\left(s_{1},\cdots,s_{N},s_{\text{o}},a_{1},\cdots,a_{N};\theta^{(\text{c-})}\right), where {s1,,sN,so}\{s_{1},\cdots,s_{N},s_{\text{o}}\} is referred to as the global state of the whole network including all the local states sns_{n} (nn\in\mathbb{N}) and other global state sos_{\text{o}} of the whole network, {a1,,aN}\{a_{1},\cdots,a_{N}\} is referred to as global action of the whole network including all the actions ana_{n} (n\forall\ n\in\mathbb{N}), θ(c)\theta^{(\text{c})} and θ(c-)\theta^{(\text{c-})} are the corresponding weight vectors.

Next, we detail the experience accumulation procedure followed by the MASC training method.

V-A1 Experience accumulation

At the beginning of time slot tt, AP nn (n\forall\ n\in\mathbb{N}) observes a local state sns_{n} and optimizes the action (i.e., transmit power) as an=μn(L)(sn;θn(L))+ζa_{n}=\mu_{n}^{\text{(L)}}\left(s_{n};\theta_{n}^{\text{(L)}}\right)+\zeta, where the action noise ζ\zeta is a random variable and guarantees continuously exploring the action space around the output of the local DNN. By executing the action (i.e., transmit power) within this time slot, each AP can obtain a reward (i.e., transmission rate) rnr_{n} in the end of time slot tt. At the beginning of the next time slot, AP nn, (n\forall\ n\in\mathbb{N}), observes a new local state sns^{\prime}_{n}. To this end, AP nn, (n\forall\ n\in\mathbb{N}), can obtain a local experience en={sn,an,rn,sn}e_{n}=\{s_{n},a_{n},r_{n},s_{n}^{{}^{\prime}}\} and meanwhile upload it to the core network via a bi-directional backhaul link with TdT_{d} time slots delay. Upon receiving ene_{n}, (n\forall\ n\in\mathbb{N}), a global experience is constructed as E={s1,,sN,so,a1,,aN,R,s1,,sN,so}E=\{s_{1},\cdots,s_{N},s_{\text{o}},a_{1},\cdots,a_{N},R,s_{1}^{{}^{\prime}},\cdots,s_{N}^{{}^{\prime}},s_{\text{o}^{{}^{\prime}}}\}, where R=n=1NrnR=\sum_{n=1}^{N}r_{n} is the global reward of the whole network, and each global experience will be stored in the experience replay buffer, which has the capacity of MM and works in an FIFO fashion. By repeating this procedure, the experience replay buffer can continuously accumulate new global experiences.

V-A2 MASC training method

To train the critic DNN, a mini-batch of experiences \mathcal{E} can be sampled from the experience replay buffer in each time slot. Then, the SGD method can be adopted to minimize the expected prediction error (loss function) of the sampled experiences, i.e.,

𝕃(θ(c))=1D[yTarQ(s1,,sN,so,a1,,aN;θ(c))]2,\displaystyle\!\!\mathbb{L}(\theta^{(\text{c})})\!\!=\!\!\frac{1}{D}\!\sum_{\mathcal{E}}{{{[y^{\text{Tar}}\!\!-\!\!Q(s_{1},\cdots,s_{N},s_{\text{o}},a_{1},\cdots,a_{N};\theta^{(\text{c})})]}^{2}}}, (12)

where yTary^{\text{Tar}} can be calculated by

yTar=n=1Nrn+ηmaxanAQ(s1,,sN,so,a1,,aN;θ(c-)).\displaystyle{y^{\text{Tar}}}=\sum_{n=1}^{N}r_{n}+\eta\mathop{\max}\limits_{a_{n}^{{}^{\prime}}\in\textbf{\emph{A}}}Q^{-}(s_{1}^{{}^{\prime}},\cdots,s_{N}^{{}^{\prime}},s_{\text{o}},a_{1}^{{}^{\prime}},\cdots,a_{N}^{{}^{\prime}};\theta^{(\text{c-})}). (13)

Then, a soft update method is adopted to update the weight vector θ(c-)\theta^{(\text{c-})} of the critic target DNN, i.e.,

θ(c-)τ(c)θ(c)+(1τ(c))θ(c-),\displaystyle\theta^{(\text{c-})}\leftarrow\tau^{(\text{c})}\theta^{(\text{c})}+(1-\tau^{(\text{c})})\theta^{(\text{c-})}, (14)

where τ(c)[0,1]\tau^{(\text{c})}\in[0,1] is the learning rate of the target critic DNN.

Since each AP aims to optimize the sum-rate of whole network, the training of the actor DNN nn, (n\forall\ n\in\mathbb{N}), can be designed to maximize the expected long-term global reward J(θ1(a),,θN(a))J(\theta_{1}^{\text{(a)}},\cdots,\theta_{N}^{\text{(a)}}), which is defined as the expectation of Q(s1,,sN,so,μ1(a)(s1;θ1(a)),,μN(a)(sN;θN(a));θ(c))Q\left(s_{1},\cdots,s_{N},s_{\text{o}},\mu_{1}^{\text{(}a)}(s_{1};\theta_{1}^{\text{(a)}}),\cdots,\mu_{N}^{\text{(}a)}(s_{N};\theta_{N}^{\text{(a)}});\theta^{\text{(c)}}\right) in terms of the global state {s1,,sN,so}\{s_{1},\cdots,s_{N},s_{\text{o}}\}, i.e.,

J(θ1(a),,θN(a))=𝔼s1,,sN,so[Q(s1,,sN,so,μ1(a)(s1;θ1(a)),,μN(a)(sN;θN(a));θ(c))].\displaystyle J(\theta_{1}^{\text{(a)}},\cdots,\theta_{N}^{\text{(a)}})=\mathbb{E}_{s_{1},\cdots,s_{N},s_{\text{o}}}\left[Q\left(s_{1},\cdots,s_{N},s_{\text{o}},\mu_{1}^{\text{(}a)}(s_{1};\theta_{1}^{\text{(a)}}),\cdots,\mu_{N}^{\text{(}a)}(s_{N};\theta_{N}^{\text{(a)}});\theta^{\text{(c)}}\right)\right]. (15)

By taking the partial derivation of J(θ1(a),,θN(a))J(\theta_{1}^{\text{(a)}},\cdots,\theta_{N}^{\text{(a)}}) with respect to θn(a)\theta_{n}^{\text{(a)}}, we have

θn(a)J(θ1(a),,θN(a))1DanQ(s1,,sN,so,a1,,aN;θ(c))|an=μ(sn;θn(a))θn(a)μn(a)(sn;θn(a)).\displaystyle\nabla_{\theta_{n}^{\text{(a)}}}J(\theta_{1}^{\text{(a)}},\cdots,\theta_{N}^{\text{(a)}})\approx\frac{1}{D}\sum_{\mathcal{E}}\nabla_{a_{n}}Q\left(s_{1},\cdots,s_{N},s_{\text{o}},a_{1},\cdots,a_{N};\theta^{\text{(c)}}\right)|_{a_{n}=\mu(s_{n};\theta_{n}^{\text{(a)}})}\nabla_{\theta_{n}^{\text{(a)}}}\mu_{n}^{\text{(a)}}(s_{n};\theta_{n}^{\text{(a)}}). (16)

Then, θn(a)\theta_{n}^{\text{(a)}} can be updated in the direction of θn(a)J(θ1(a),,θN(a))\nabla_{\theta_{n}^{\text{(a)}}}J(\theta_{1}^{\text{(a)}},\cdots,\theta_{N}^{\text{(a)}}), which is the direction with the maximum likelihood to increase J(θ1(a),,θN(a))J(\theta_{1}^{\text{(a)}},\cdots,\theta_{N}^{\text{(a)}}).

Similar to the target critic DNN, a soft update method is adopted to adjust the weight vector θn(a-)\theta_{n}^{(\text{a-})} of the target actor DNN, i.e.,

θn(a-)τn(a)θn(a)+(1τn(a))θn(a-),\displaystyle\theta_{n}^{(\text{a-})}\leftarrow\tau_{n}^{(\text{a})}\theta_{n}^{(\text{a})}+(1-\tau_{n}^{(\text{a})})\theta_{n}^{(\text{a-})}, (17)

where τn(a)[0,1]\tau_{n}^{\text{(a)}}\in[0,1] is the learning rate of the corresponding target actor DNN.

V-B Algorithm designs

In this part, we first design the experience and the structure for each actor DNN and local DNN. Then, we design the global experience and the structure of the critic DNN. Finally, we elaborate the algorithm followed by some related discussions.

Refer to caption
Figure 3: The structures of each actor DNN and the critic DNN, in which the arrows indicate the direction of the data flow. Each dotted box contains a certain number of hidden layers.

V-B1 Experience of actor DNNs

The local information at AP nn can be divided into historical local information in previous time slots and instantaneous local information at the beginning of the current time slot. Historical local information includes the channel gain between AP nn and UE nn, the transmit power of AP nn, the sum-interference from AP kk (k,knk\in\mathbb{N},k\neq n), the received SINR, and the corresponding transmission rate. Instantaneous local information includes the channel gain between AP nn and UE nn, and the sum-interference from AP kk (k,knk\in\mathbb{N},k\neq n). It should be noted that, the sum-interference from AP kk (k,knk\in\mathbb{N},k\neq n) at the beginning of the current time slot is generated as follows: At the beginning of the current time slot, the new transmit power of each AP has not been determined, and each AP still uses the transmit power of the previous time slot, although the CSI of the whole network has changed. Thus, state sns_{n} in time slot tt is designed as

sn(t)=\displaystyle\!s_{n}(t)= {gn,n(t1),pn(t1),k,knpk(t1)gk,n(t1),\displaystyle\left\{g_{n,n}(t\!-\!1),p_{n}(t\!-\!1),\!\!\!\!\sum_{k\in\mathbb{N},k\neq n}\!\!\!\!p_{k}(t\!-\!1)g_{k,n}(t\!-\!1),\right.
γn(t1),rn(t1),gn,n(t),k,knpk(t1)gk,n(t)}.\displaystyle\left.\gamma_{n}(t\!-\!1),r_{n}(t\!-\!1),g_{n,n}(t),\!\!\!\!\sum_{k\in\mathbb{N},k\neq n}\!\!\!\!p_{k}(t\!-\!1)g_{k,n}(t)\right\}. (18)

Besides, action an(t)a_{n}(t) and reward rn(t)r_{n}(t) can be respectively designed as the transmit power an(t)=pn(t)a_{n}(t)=p_{n}(t) and the corresponding achievable rate calculated by (4). Consequently, the experience en(t)e_{n}(t) can be constructed as

en(t)={sn(t1),an(t1),rn(t1),sn(t)}.\displaystyle e_{n}(t)=\{s_{n}(t-1),a_{n}(t-1),r_{n}(t-1),s_{n}(t)\}. (19)

V-B2 Structure of local/actor DNNs

The designed structure of the local/actor DNN includes five full-connected layers as illustrated in Fig. 3-(A). In particular, the first layer is the input layer for sns_{n} and has L1(a)=7L_{1}^{\text{(a)}}=7 neurons corresponding to seven elements in sns_{n}. The second layer and the third layer have respectively L2(a)L_{2}^{\text{(a)}} and L3(a)L_{3}^{\text{(a)}} neurons. The forth layer has L4(a)=1L_{4}^{\text{(a)}}=1 neuron with the sigmoid activation function, which outputs a value between zero and one. The forth layer has L5(a)=1L_{5}^{\text{(a)}}=1 neuron, which scales linearly the value from the forth layer to a value between zero and pn,maxp_{n,max}. With this structure, each local/actor DNN can take the local state as the input and output a transmit power satisfying the maximum transmit power constraint. In summary, there are L2(a)+L3(a)+9L_{2}^{\text{(a)}}+L_{3}^{\text{(a)}}+9 neurons in each local/actor DNN.

V-B3 Global experience

By considering TdT_{d} time slots delay for the core network to obtain the local information of APs, we design the global experience in time slot tt as

E(t)=\displaystyle E(t)= {s1(t1Td),,sN(t1Td),so(t1Td),\displaystyle\left\{s_{1}(t-1-T_{d}),\cdots,s_{N}(t-1-T_{d}),s_{\text{o}}(t-1-T_{d}),\right.
a1(t1Td),,aN(t1Td),R(t1Td),\displaystyle\left.a_{1}(t-1-T_{d}),\cdots,a_{N}(t-1-T_{d}),R(t-1-T_{d}),\right.
s1(tTd),,sN(tTd),so(tTd)}.\displaystyle\left.s_{1}(t-T_{d}),\cdots,s_{N}(t-T_{d}),s_{\text{o}}(t-T_{d})\right\}. (20)

In particular, sn(t1Td)s_{n}(t-1-T_{d}), sn(tTd)s_{n}(t-T_{d}), an(t1Td)a_{n}(t-1-T_{d}), and an(t1Td)a_{n}(t-1-T_{d}) (n\forall\ n\in\mathbb{N}) can be directly obtained from en(tTd)e_{n}(t-T_{d}) (n\forall\ n\in\mathbb{N}), and R(t1Td)=nrn(t1Td)R(t-1-T_{d})=\sum_{n\in\mathbb{N}}r_{n}(t-1-T_{d}) can be directly calculated with the local reward rn(t1Td)r_{n}(t-1-T_{d}) in en(tTd)e_{n}(t-T_{d}) (n\forall\ n\in\mathbb{N}). Here, we construct respectively so(t1Td)s_{\text{o}}(t-1-T_{d}) and so(tTd)s_{\text{o}}(t-T_{d}) as so(t1Td)=G(t1Td)s_{\text{o}}(t-1-T_{d})=G(t-1-T_{d}) and so(tTd)=G(tTd)s_{\text{o}}(t-T_{d})=G(t-T_{d}), where G(t1Td)G(t-1-T_{d}) and G(tTd)G(t-T_{d}) are the channel gain matrixes of the whole network in time slot t1Tdt-1-T_{d} and time slot tTdt-T_{d}, respectively. Since the channel gains gn,n(t1Td)g_{n,n}(t-1-T_{d}) and gn,n(tTd)g_{n,n}(t-T_{d}) (n\forall\ n\in\mathbb{N}) are available in sn(t1Td)s_{n}(t-1-T_{d}) and sn(tTd)s_{n}(t-T_{d}), we focus on the derivation of the interference channel gains gn,k(t1Td)g_{n,k}(t-1-T_{d}) and gn,k(tTd)g_{n,k}(t-T_{d}) (n,k,nk\forall\ n\in\mathbb{N},\ k\in\mathbb{N},\ n\neq k). Here, we take the derivation of gn,k(tTd)g_{n,k}(t-T_{d}) (n,k,nk\forall\ n\in\mathbb{N},\ k\in\mathbb{N},\ n\neq k) as an example. At the beginning of time slot tTdt-T_{d}, APs transmit orthogonal pilots to the corresponding UEs for channel and interference estimations, and UE nn can measure locally pk(t1Td)gk,n(tTd)p_{k}(t-1-T_{d})g_{k,n}(t-T_{d}) from AP kk (k,kn\forall\ k\in\mathbb{N},\ k\neq n). Then, UE nn can deliver the auxiliary information on(tTd)={pk(t1Td)gk,n(tTd),k,kn}o_{n}(t-T_{d})=\{p_{k}(t-1-T_{d})g_{k,n}(t-T_{d}),\forall\ k\in\mathbb{N},\ k\neq n\} together with local state sn(tTd)s_{n}(t-T_{d}) to the local DNN at the beginning of time slot tTdt-T_{d}. Then, by collecting simultaneously local experience en(tTd)e_{n}(t-T_{d}) and the auxiliary information on(tTd)o_{n}(t-T_{d}) from local DNN nn (n\forall\ n\in\mathbb{N}), the core network can calculate each interference channel gain gn,k(tTd)g_{n,k}(t-T_{d}) (n,k,nk\forall\ n\in\mathbb{N},\ k\in\mathbb{N},\ n\neq k) with pn(t1Td)gn,k(tTd)p_{n}(t-1-T_{d})g_{n,k}(t-T_{d}) in ok(tTd)o_{k}(t-T_{d}) and pn(t1Td)p_{n}(t-1-T_{d}) in sn(tTd)s_{n}(t-T_{d}). In this way, the interference channel gains gn,k(tTd)g_{n,k}(t-T_{d}) (n,k,nk\forall\ n\in\mathbb{N},\ k\in\mathbb{N},\ n\neq k) can be obtained to construct G(tTd)G(t-T_{d}). Following a similar procedure, the core network can construct the channel gain of the whole network in each time slot, including G(t1Td)G(t-1-T_{d}).

V-B4 Structure of the critic DNN

The designed structure of critic DNN are illustrated in Fig. 3-(B) including three modules, i.e., a state module, an action module, and a mixed state-action module. Each module contains several full-connected layers. For the state module, there are three full-connected layers. The first layer is the input layer for the global state {s1,,sN,so}\{s_{1},\cdots,s_{N},s_{\text{o}}\}. Since each sns_{n} (nn\in\ \mathbb{N}) has seven elements and sos_{\text{o}} has N2N^{2} elements, the first layer has L1(S)=7N+N2L_{1}^{\text{(S)}}=7N+N^{2} neurons. The second layer and the third layer of the state module have respectively L2(S)L_{2}^{\text{(S)}} and L3(S)L_{3}^{\text{(S)}} neurons. For the action module, there are two full-connected layers. The first layer of the action module is the input layer for the global action {a1,,aN}\{a_{1},\cdots,a_{N}\}. Since each ana_{n} (nn\in\ \mathbb{N}) is a one-dimension scalar, the first layer of the state module has L1(A)=NL_{1}^{\text{(A)}}=N neurons. The second layer of the action module has L2(A)L_{2}^{\text{(A)}} neurons. For the mixed state-action module, there are three full-connected layers. The first layer of the mixed state-action module is formulated by concatenating the last layers of the state module and the action module, and thus has L1(M)=L3(S)+L2(A)L_{1}^{\text{(M)}}=L_{3}^{\text{(S)}}+L_{2}^{\text{(A)}} neurons. The second layer of the mixed state-action module has L2(M)L_{2}^{\text{(M)}} neurons. The third layer of the mixed state-action module has one neuron, which outputs the long-term reward Q(c)(s1,,sN,so,a1,,aN;θ(c))Q^{\text{(c)}}(s_{1},\cdots,s_{N},s_{\text{o}},a_{1},\cdots,a_{N};\theta^{(\text{c})}). In summary, there are N2+8N+1+L2(S)+L3(S)+L2(A)+L2(M)N^{2}+8N+1+L_{2}^{\text{(S)}}+L_{3}^{\text{(S)}}+L_{2}^{\text{(A)}}+L_{2}^{\text{(M)}} neurons in the critic DNN.

V-B5 Proposed algorithm

The proposed algorithm is illustrated in Algorithm 1, which includes three stages, i.e., Initializations, Random experience accumulation, and Repeat.

In the stage of Initializations, NN local DNNs, NN actor DNNs, NN target actor DNNs, a critic DNN, and a target critic DNN need to be properly constructed and initialized. In particular, local DNN μn(L)(sn;θn(L))\mu_{n}^{\text{(L)}}(s_{n};\theta_{n}^{\text{(L)}}) (n\forall\ n\in\ \mathbb{N}) is established at AP nn by adopting the structure in Fig. 3-(A), actor DNN μn(a)(sn;θn(a))\mu_{n}^{\text{(a)}}(s_{n};\theta_{n}^{\text{(a)}}) with and the corresponding target actor DNN μn(a-)(sn;θn(a-))\mu_{n}^{\text{(a-)}}(s_{n};\theta_{n}^{\text{(a-)}}) are established in the core network to associate with local DNN nn by adopting the structure in Fig. 3-(A), meanwhile critic DNN Q(c)(s1,,sN,so,a1,,aN;θ(c))Q^{\text{(c)}}(s_{1},\cdots,s_{N},s_{\text{o}},a_{1},\cdots,a_{N};\theta^{(\text{c})}) and the corresponding target critic DNN Qc-(s1,,sN,so,a1,,aN;θ(c-))Q^{\text{c-}}(s_{1},\cdots,s_{N},s_{\text{o}},a_{1},\cdots,a_{N};\theta^{(\text{c-})}) are established in the core network by adopting the structure in Fig. 3-(B). Then, θn(L)\theta_{n}^{\text{(L)}}, θn(a)\theta_{n}^{\text{(a)}}, and θ(c)\theta^{(\text{c})} are randomly initialized, θn(a-)\theta_{n}^{\text{(a-)}} and θ(c-)\theta^{(\text{c-})} are initialized with θn(a)\theta_{n}^{\text{(a)}} and θ(c)\theta^{(\text{c})}, respectively.

Refer to caption
Figure 4: Diagram of the proposed algorithm.

In the stage of Random experience accumulation (t2Td+D+Tut\leq 2T_{d}+D+T_{u}), UE nn (n\forall\ n\in\ \mathbb{N}) observes local state sn(t)s_{n}(t) and auxiliary information on(t)o_{n}(t), and transmit them to AP nn at the beginning of time slot tt. AP nn chooses a random action (i.e., transmit power) and meanwhile uploads local experience en(t)e_{n}(t) and on(t)o_{n}(t) to the core network through the bi-directional backhaul link with TdT_{d} time slots delay. After collecting all the local experiences and the corresponding auxiliary information from NN APs, the core network constructs a global experience E(t)E(t) and stores it into the memory relay buffer with the capacity MM. As illustrated in Fig. 4, at the beginning of time slot Td+DT_{d}+D, the core network has DD global experiences in the memory replay buffer. From time slot Td+DT_{d}+D, the core network begins to sample a mini-batch of experiences \mathcal{E} with length DD from the memory replay buffer to train the critic DNN, the target critic DNN, actor DNNs, and target actor DNNs, i.e., update θ(c)\theta^{(\text{c})} to minimize (12), update θ(c-)\theta^{(\text{c-})} with (14), update θn(a)\theta_{n}^{\text{(a)}} with (16), and update θn(a-)\theta_{n}^{\text{(a-)}} with (17). From time slot Td+DT_{d}+D, in every TuT_{u} time slots, the core network transmits the latest weight vector θn(a)\theta_{n}^{\text{(a)}} to AP nn through the bi-directional backhaul link with TdT_{d} time slots delay. AP nn receives the latest θn(a)\theta_{n}^{\text{(a)}} in time slot 2Td+D+Tu2T_{d}+D+T_{u} and uses it to replace the weight vector θn(L)\theta_{n}^{\text{(L)}} of local DNN nn.

In the stage of Repeat (t>2Td+D+Tut>2T_{d}+D+T_{u}), UE nn (n\forall\ n\in\ \mathbb{N}) observes local state sn(t)s_{n}(t) and auxiliary information on(t)o_{n}(t), and transmit them to AP nn at the beginning of time slot tt. AP nn sets the transmit power to be pn(t)=μn(L)(sn(t);θn(L))+ζp_{n}(t)=\mu_{n}^{\text{(L)}}(s_{n}(t);\theta_{n}^{\text{(L)}})+\zeta and meanwhile uploads local experience en(t)e_{n}(t) and on(t)o_{n}(t) to the core network through the bi-directional backhaul link with TdT_{d} time slots delay. After collecting all the local experiences and the corresponding auxiliary information from NN APs, the core network constructs a global experience E(t)E(t) and stores it into the experience replay buffer. Then, a mini-batch of experiences are sampled from the experience replay buffer to train the critic DNN, the target critic DNN, actor DNNs, and target actor DNNs. In every TuT_{u} time slots, AP nn receives the latest θn(a)\theta_{n}^{\text{(a)}} and uses it to replace the weight vector θn(L)\theta_{n}^{\text{(L)}}.

V-B6 Discussions on the computational complexity

The computational complexity at each AP is dominated by the calculation of a DNN with L2(a)+L3(a)+9L_{2}^{\text{(a)}}+L_{3}^{\text{(a)}}+9 neurons, and thus the computational complexity at each AP is around 𝒪(L2(a)+L3(a)+9)\mathcal{O}(L_{2}^{\text{(a)}}+L_{3}^{\text{(a)}}+9). The computational complexity at the core network is dominated by the training of the critic DNN with N2+8N+1+L2(S)+L3(S)+L2(A)+L2(M)N^{2}+8N+1+L_{2}^{\text{(S)}}+L_{3}^{\text{(S)}}+L_{2}^{\text{(A)}}+L_{2}^{\text{(M)}} neurons and NN actor DNNs. In particular, the critic DNN is first trained and then NN actor DNNs can be trained simultaneously. Thus, the computational complexity of each training is around 𝒪(L2(a)+L3(a)+10+N2+8N+L2(S)+L3(S)+L2(A)+L2(M))\mathcal{O}(L_{2}^{\text{(a)}}+L_{3}^{\text{(a)}}+10+N^{2}+8N+L_{2}^{\text{(S)}}+L_{3}^{\text{(S)}}+L_{2}^{\text{(A)}}+L_{2}^{\text{(M)}}). In the simulation, with the designed DNNs, we show that the average time needed to calculate the transmit power at each AP is around 0.340.34 ms, which is much less than those by employing WMMSE algorithm and FP algorithm, the average time needed to train the critic DNN is around 9.79.7 ms, and the average time needed to train an actor DNN is around 5.95.9 ms. It should be pointed that, the computational capability of the nodes in practical networks is much stronger than the computer we use in the simulations. Thus, the average time needed to train a DNN and calculate the transmit power with a DNN can be further reduced in practical networks.

V-B7 Discussions on the overheads

Note that, the proposed architecture contains two kinds of overhead: local information of each AP to the core network and DNN weight vectors from the core network to the AP. The algorithm in [14] needs three kinds of overhead: local information of each AP to the core network, DNN weight vectors from the core network to the AP, and exchanged local information among neighboring APs. This difference means that, the proposed algorithm does not need cooperations among APs for the power control and is much easier than the algorithm in [14] to implement in practical situations. Besides, as the number of APs increases, the number of actor DNNs increases and the resulting overhead may affect the scalability of the proposed architecture. Nevertheless, the increase of APs may also enhance the spectrum utilization efficiency. Therefore, the number of APs needs to be properly designed to balance the overhead and the spectrum utilization efficiency in practical situations.

V-B8 Discussions on the Implementations

It is worth noting that, the proposed algorithm considers three engineering aspects to facilitate the implementation. First, the core network initially does not has any data for learning in practical situations. Then, the proposed algorithm allows each AP to select transmit power randomly to interact with the environment and accumulate useful data. Second, it typically takes some time for the information exchange between APs and the core network in the implementation. Then, the proposed algorithm takes a corresponding transmission latency into considerations and simulation results demonstrate the robustness of the proposed algorithm to this latency. Third, it is demanding for the core network to transmit the weight vectors of actor DNNs to APs for updating the corresponding local DNNs in each time slot. Then, the proposed algorithm allows the core network to transmit weight vectors periodically and simulation results show that this design does not affect the long-term performance of the proposed algorithm.

Algorithm 1 Proposed DRL based multi-agent power control algorithm.
1:  Initialization:
2:  Adopt the structures in Fig. 3-(A) and Fig. 3-(B), establish DNN networks including μn(L)(sn;θn(L))\mu_{n}^{\text{(L)}}(s_{n};\theta_{n}^{\text{(L)}}), μn(a)(sn;θn(a))\mu_{n}^{\text{(a)}}(s_{n};\theta_{n}^{\text{(a)}}), μn(a-)(sn;θn(a-))\mu_{n}^{\text{(a-)}}(s_{n};\theta_{n}^{\text{(a-)}}), μn(L)(sn;θn(L))\mu_{n}^{\text{(L)}}(s_{n};\theta_{n}^{\text{(L)}}), Q(c)(s1,,sN,so,a1,,aN;θ(c))Q^{\text{(c)}}(s_{1},\cdots,s_{N},s_{\text{o}},a_{1},\cdots,a_{N};\theta^{(\text{c})}) and Qc-(s1,,sN,so,a1,,aN;θ(c-))Q^{\text{c-}}(s_{1},\cdots,s_{N},s_{\text{o}},a_{1},\cdots,a_{N};\theta^{(\text{c-})}).
3:  Initialize θn(L)\theta_{n}^{\text{(L)}}, θn(a)\theta_{n}^{\text{(a)}}, and θ(c)\theta^{(\text{c})} randomly, initialize θn(a-)\theta_{n}^{\text{(a-)}} and θ(c-)\theta^{(\text{c-})} with θn(a)\theta_{n}^{\text{(a)}} and θ(c)\theta^{(\text{c})}, respectively.
4:  Random experience accumulation:
5:  At the beginning of time slot tt (tTd+Dt\leq T_{d}+D), UE nn transmits sn(t)s_{n}(t) and on(t)o_{n}(t) to AP nn, and AP nn randomly chooses a transmit power.
6:  In time slot tt (tTd+Dt\leq T_{d}+D), AP nn uploads local experience en(t)e_{n}(t) and auxiliary information on(t)o_{n}(t) to the core network. Upon receiving NN local experiences together with NN auxiliary information, the core network can construct a global experience and store it into the memory replay buffer.
7:  In time slot t=Td+Dt=T_{d}+D, the core network has DD global experiences in the memory replay buffer. Then, a mini-batch of experiences are sampled to update θ(c)\theta^{(\text{c})} to minimize (12), update θ(c-)\theta^{(\text{c-})} with (14), update θn(a)\theta_{n}^{\text{(a)}} with (16), and update θn(a-)\theta_{n}^{\text{(a-)}} with (17).
8:  From time slot t=Td+Dt=T_{d}+D, in every TuT_{u} time slots, the core network transmits the latest θn(a)\theta_{n}^{\text{(a)}} to AP nn. Upon receiving the latest θn(a)\theta_{n}^{\text{(a)}}, AP nn use it to replace the weight vector θn(L)\theta_{n}^{\text{(L)}}.
9:  Repeat:
10:  At the beginning of time slot tt (t>2Td+D+Tut>2T_{d}+D+T_{u}), UE nn transmits sn(t)s_{n}(t) and on(t)o_{n}(t) to AP nn, and AP nn sets the transmit power to be pn(t)=μn(L)(sn;θn(L))+ζp_{n}(t)=\mu_{n}^{\text{(L)}}(s_{n};\theta_{n}^{\text{(L)}})+\zeta, where ζ\zeta is the action noise.
11:  In time slot tt (t>2Td+D+Tut>2T_{d}+D+T_{u}), AP nn uploads local experience en(t)e_{n}(t) and auxiliary information on(t)o_{n}(t) to the core network; a mini-batch of experiences are sampled to update θ(c)\theta^{(\text{c})} to minimize (12), update θ(c-)\theta^{(\text{c-})} with (14), update θn(a)\theta_{n}^{\text{(a)}} (n\forall\ n\in\ \mathbb{N}) with (16), and update θn(a-)\theta_{n}^{\text{(a-)}} (n\forall\ n\in\ \mathbb{N}) with (17).
12:  In every TuT_{u} time slots, AP nn receives the latest θn(a)\theta_{n}^{\text{(a)}} and uses it to replaces the weight vector θn(L)\theta_{n}^{\text{(L)}}.

VI Simulation results

In this section, we provide simulation results to evaluate the performance of the proposed algorithm. For comparison, we have four benchmark algorithms, namely, WMMSE algorithm, FP algorithm, Full power algorithm, Random power algorithm. In particular, the maximum transmit power is used to initialize the WMMSE algorithm and the FP algorithm, which will stop iterations if the difference of the sum-rates per link between two successive iterations is smaller than 0.0010.001 or the number of iterations is larger than 500500. In the following, we first provide the settings of the simulation and then demonstrate the performance of the propose algorithm as well as the four benchmark algorithms. We implement the proposed algorithm with an open-source software called keras which is based on Tensorflow, and on a computer with the Intel Core i5-8250U and 1616G RAM.

VI-A Simulation settings

Table I: Hyperparameters of each local DNN.
Layers L1(a)L_{1}^{\text{(a)}} L2(a)L_{2}^{\text{(a)}} L3(a)L_{3}^{\text{(a)}} L4(a)L_{4}^{\text{(a)}} L5(a)L_{5}^{\text{(a)}}
Neuron number 77 100100 100100 11 11
Activation function Linear Relu Relu Sigmoid Linear
Action noise ζ\zeta Normal distribution with zero mean and variance 22
Table II: Hyperparameters of each actor DNN.
Layers L1(a)L_{1}^{\text{(a)}} L2(a)L_{2}^{\text{(a)}} L3(a)L_{3}^{\text{(a)}} L4(a)L_{4}^{\text{(a)}} L5(a)L_{5}^{\text{(a)}}
Neuron number 77 100100 100100 11 11
Activation function Linear Relu Relu Sigmoid Linear
Optimizer Adam optimizer with learning rate 0.00010.0001
Mini-batch size DD 128128
Learning rate τn(a)\tau_{n}^{\text{(a)}} τn(a)=0.001\tau_{n}^{\text{(a)}}=0.001
Table III: Hyperparameters of the critic DNN.
Layers L1(S)L_{1}^{\text{(S)}} L2(S)L_{2}^{\text{(S)}} L3(S)L_{3}^{\text{(S)}} L1(A)L_{1}^{\text{(A)}} L2(A)L_{2}^{\text{(A)}} L2(M)L_{2}^{\text{(M)}} L3(M)L_{3}^{\text{(M)}}
Neuron number 7N+N27N+N^{2} 200200 200200 NN 200200 200200 11
Activation function Linear Relu Linear Linear Linear Relu Linear
Optimizer Adam optimizer with learning rate 0.0010.001
Mini-batch size DD 128128
Learning rate τ(c)\tau^{\text{(c)}} τ(c)=0.001\tau^{\text{(c)}}=0.001
Discount factor η\eta 0.50.5

To begin with, we provide the hyperparameters of the DNNs in Table I, Table II, and Table III, which are determined by cross-validation [13] [14]. Note that the adopted hyperparameters maybe not the optimal. Since the proposed algorithm with these hyperparameters performs well, we use them to demonstrate the achievable performance rather than the optimal performance of the proposed algorithm. Besides, we pre-process the local/global state information of the proposed algorithm to reduce their variance in the following procedure: we first use the noise power to normalize each channel gain, and then use the mapping function f(x)=10log10(1+x)f(x)=10\log_{10}(1+x) to process the data related to the transmit power, channel gain, interference, and SINR.

In the simulations, we consider both two-layer HetNet scenario and three-layer HetNet scenario:

  • Two-layer HetNet scenario: In this scenario, there are five APs whose locations are respectively (0,0)(0,0), (500,0)(500,0), (0,500)(0,500), (500,0)(-500,0), and (0,500)(0,-500) in meters. Each AP has a disc service coverage defined by a minimum distance νmin\nu_{\text{min}} and a maximum distance νmax\nu_{\text{max}} from the AP to the served UE. AP 11 is in the first layer and AP nn (n{2,3,4,5})(n\in\ \{2,3,4,5\}) is in the second layer. νmin\nu_{\text{min}} is set to be 1010 meters for all APs, and νmax\nu_{\text{max}} of AP 11 in the first layer is 10001000 meters, and νmax\nu_{\text{max}} of each AP in the second layer is 200200 meters. The maximum transmit power of AP 11 in the first layer is 3030 dBm, the maximum transmit power of each AP in the second layer is 2323 dBm. The served UE by an AP is randomly located within the service coverage of the AP.

  • Three-layer HetNet scenario: In this scenario, there are nine APs whose locations are respectively (0,0)(0,0), (500,0)(500,0), (0,500)(0,500), (500,0)(-500,0), (0,500)(0,-500), (700,0)(700,0), (0,700)(0,700), (700,0)(-700,0), and (0,700)(0,-700) in meters. AP 11 is in the first layer, and AP nn (n{2,3,4,5})(n\in\ \{2,3,4,5\}) is in the second layer, and AP nn (n{6,7,8,9})(n\in\ \{6,7,8,9\}) is in the third layer. νmin\nu_{\text{min}} is set to be 1010 meters for all APs, and νmax\nu_{\text{max}} of AP 11 in the first layer is 10001000 meters, and νmax\nu_{\text{max}} of each AP in the second layer is 200200 meters, and νmax\nu_{\text{max}} of each AP in the third layer is 100100 meters. The maximum transmit power of AP 11 in the first layer is 3030 dBm, and the maximum transmit power of each AP in the second layer is 2323 dBm, and the maximum transmit power of each AP in the third layer is 2020 dBm. The served UE by an AP is randomly located within the service coverage of the AP.

Furthermore, the transmission bandwidth is set to be B=10B=10 MHz, the adopted path-loss model is 120.9+37.6log10(d)120.9+37.6\log 10(d) in dB, where dd in kilometer is the distance between a transmitter and a receiver [40], the log-normal shadowing standard deviation is 88 dB, the noise power σ2\sigma^{2} at each UE is 114-114 dBm, the delay TdT_{d} of the data transmission between the core network and each AP is Td=50T_{d}=50 time slots, the period TuT_{u} to update the weight vector of each local DNN is Tu=100T_{u}=100 time slots, and the capacity of the memory replay buffer is M=1000M=1000.

Refer to caption
Figure 5: Simulation model: (I) two-layer HetNet scenario; (II) three-layer HetNet scenario.

VI-B Performance comparison and analysis

In this part, we will provide the performance comparison and analysis of the proposed algorithm with four benchmark algorithms in two simulation scenarios. In particular, the simulation of the proposed algorithm has two stages, i.e., training stage and testing stage. In the training stage, the DNNs are trained with the proposed Algorithm 1 in the first 50005000 time slots. In the testing stage, the well-trained DNNs are used to optimize the transmit power in each AP in the following 20002000 time slots. Each curve is the average of ten trials, in which the location of the served UE by an AP is randomly generated within the service coverage of the AP.

Refer to caption
Figure 6: Average sum-rate performance of the two-layer HetNet scenario in the training stage. The channel correlation factor is set to be zero, i.e., IID channel, and each value is the moving average of the previous 200200 time slots.

Fig. 6 provides the average sum-rate performance of the proposed algorithm and four benchmark algorithms in the two-layer HetNet scenario in the training stage. The channel correlation factor is set to be zero, i.e., IID channel. In the figure, the average sum-rate of the proposed algorithm is the same as the random power algorithm at the beginning of data transmissions since the proposed algorithm has to choose transmit power randomly for each AP to accumulate experiences. Then, the average sum-rate of the proposed algorithm increases rapidly and exceeds the sum-rates calculated by WMMSE algorithm and FP algorithm after around 500500 time slots. Finally, the proposed algorithm converges after 15001500 time slots. This can be explained as follows. On the one hand, both WMMSE algorithm and FP algorithm can only output sub-optimal solutions of the power allocation problem in each single time slot, meaning that the average sum-rate performance of both algorithms are also sub-optimal. On the other hand, the proposed algorithm can explore continuously different power control strategies and accumulates global experiences. By learning from these global experiences, the critic DNN has a global view of the impacts of different power control strategies on the sum-rate. Then, the critic DNN can guide each actor DNN (or each local DNN) to update the weight vector towards the global optimum. Thus, it is reasonable that the proposed algorithm outperforms both the WMMSE algorithm and FP algorithm in terms of the average sum-rate.

Refer to caption
Figure 7: Sum-rate performance of the two-layer intersecting scenario in the testing stage. The channel correlation factor is set to be zero, i.e., IID channel, and each value is the moving average of the previous 200200 time slots.

Fig. 7 provides the corresponding sum-rate performance of the proposed algorithm in the two-layer HetNet in the testing stage. From the figure, the effectiveness of the proposed algorithm in the two-layer HetNet is demonstrated. It should be noted that, the sum-rate performance of the proposed algorithm in the testing stage is also higher that of the proposed algorithm in the training stage. This is because, the proposed algorithm in the training stage needs to continuously explore the transmit power allocation policy and train DNNs until convergence. The exploration may degrade the sum-rate performance after the convergence of the algorithm. On the contrary, by completely exploiting the well-trained DNNs to optimize the transmit power, the proposed algorithm can further enhance the sum-rate performance in the testing stage.

Refer to caption
Figure 8: Sum-rate performance of the three-layer HetNet scenario in the training stage. The channel correlation factor is set to be zero, i.e., IID channel, and each value is the moving average of the previous 200200 time slots.
Refer to caption
Figure 9: Sum-rate performance in the two-layer overlapping scenario in the testing stage. The channel correlation factor is set to be zero, i.e., IID channel, and each value is the moving average of the previous 200200 time slots.

Fig. 8 provides the average sum-rate performance of the proposed algorithm and four benchmark algorithms in the three-layer HetNet scenario in the training stage. The channel correlation factor is set to be zero, i.e., IID channel. In the figure, the average sum-rate of the proposed algorithm also increases rapidly and exceeds the average sum-rates calculated by WMMSE algorithm and FP algorithm after around 25002500 time slots, and converges after 30003000 time slots. This phenomenon can be explained in a similar way to that in Fig. 6. In addition, Fig. 9 provides the corresponding sum-rate performance of the proposed algorithm in the three-layer HetNet in the testing stage. From the figure, we observe a phenomenon similar to that in Fig. 7.

Refer to caption
Figure 10: Sum-rate performance of the two-layer HetNet scenario with a random channel correlation factor ρ\rho.

Fig. 10 provides the sum-rate performance of the proposed algorithm and four benchmark algorithms in the two-layer HetNet scenario with a random channel correlation factor ρ\rho. In Fig. 10-(a), the average sum-rate of the proposed algorithm in the training stage increases rapidly and exceeds the average sum-rates calculated by WMMSE algorithm and FP algorithm after around 500500 time slots, and converges after 10001000 time slots. Meanwhile,in Fig. 10-(b), the sum-rate of the proposed algorithm in the testing stage is generally higher than those of the benchmark algorithms. This demonstrates that sum-rate performance of the proposed algorithm outperforms those of benchmark algorithms in the two-layer HetNet scenario even with a random ρ\rho.

Refer to caption
Figure 11: Sum-rate performance in the three-layer HetNet scenario with a random channel correlation factor ρ\rho.

Fig. 11 provides the sum-rate performance of the proposed algorithm and four benchmark algorithms in the three-layer HetNet scenario with a random channel correlation factor ρ\rho. In Fig. 11-(a), the average sum-rate of the proposed algorithm in the training stage increases rapidly and converges to the average sum-rates calculated by WMMSE algorithm and FP algorithm after around 30003000 time slots. Meanwhile, in Fig. 11-(b), the sum-rate of the proposed algorithm in the testing stage is almost the same as those of WMMSE algorithm and FP algorithm. This demonstrates the advantages of the proposed algorithm in the three-layer HetNet scenario even with a random channel correlation factor ρ\rho. Note that, the performance advantage of the proposed algorithm in the three-layer HetNet scenario diminishes compared with that in the two-layer HetNet scenario. This is because, the complexity of the network topology increases exponentially as the network size scales up and it is generally more difficult to learn to maximize the sum-rate of the three-layer HetNet scenario than the two-layer HetNet scenario.

Table IV: Average time complexity.
Training the
critic DNN
Training an
actor DNN
Calculation with
a local DNN
WMMSE FP
9.79.7 ms 5.95.9 ms 0.340.34 ms 120120 ms 7979 ms

Table IV shows the average time complexity of different algorithms in the simulation. From the table, we observe that the average time needed to calculate a transmit power with a local DNN is much less than those of employing WMMSE algorithm and FP algorithm. Besides, the average time needed to train the critic DNN and actor DNNs is below ten mini-seconds. In fact, the computational capability of the nodes in practical networks is much stronger than the computer we use in the simulations. Thus, the average time needed to train a DNN and calculate the transmit power with a DNN will be further reduced in the practical networks.

VII Conclusions

In this paper, we exploited DRL to design a multi-agent power control in the HetNet. In particular, a deep neural networks (DNN) was established at each AP and a MASC method was developed to effectively train the DNNs. With the proposed algorithm, each AP can independently learn to optimize the transmit power and enhance the sum-rate with only local information. Simulation results demonstrated the superiority of the proposed algorithm compared with conventional power control algorithms, e.g., WMMSE algorithm and FP algorithm, from the aspects of average sum-rate and computational complexity. In fact, the proposed algorithm framework can also be applied in more resource management problems, in which global instantaneous CSI is unavailable and the cooperations among users are unavailable or highly cost.

References

  • [1] L. Zhang, Y.-C. Liang, and D. Niyato, “6G visions: Mobile ultra-broadband, super Internet-of-Things, and artificial intelligence,” China Communications, vol. 16, no. 8, pp. 1-14, Aug. 2019.
  • [2] J. G. Andrews, S. Buzzi, W. Choi, S. V. Hanly, A. Lozano, A. C. K. Soong, and J. C. Zhang, “What will 5G be?” IEEE J. Select. Areas Commun., vol. 32, no. 6, pp. 1065-1082, Jun. 2014.
  • [3] L. Zhang, M. Xiao, G. Wu, M. Alam, Y.-C. Liang, and S. Li, “A survey of advanced techniques for spectrum sharing in 5G networks,” IEEE Wireless Commun., vol. 24, no. 5, pp. 44-51, Oct. 2017.
  • [4] M. Agiwal, A. Roy, and N. Saxena, “Next generation 5G wireless networks: A comprehensive survey,” IEEE Commun. Surv. Tutor., vol. 18, no. 3, pp. 1617-1655, Third quarter 2016.
  • [5] C. Yang, J. Li, M. Guizani, A. Anpalagan, and M. Elkashlan, “Advanced spectrum sharing in 5G cognitive heterogeneous networks,” IEEE Wireless Commun., vol. 23, no. 2, pp. 94-101, Apr. 2016.
  • [6] S. Singh and J. G. Andrews, “Joint resource partitioning and offloading in heterogeneous cellular networks,” IEEE Wireless Commun., vol. 13, no. 2, pp. 888-901, Feb. 2014.
  • [7] Q. Shi, M. Razaviyayn, Z.-Q. Luo, and C. He, “An iteratively weighted MMSE approach to distributed sum-utility maximization for a MIMO interfering broadcast channel,” IEEE Trans. Signal Process., vol. 59, no. 9, pp. 4331-4340, Sep. 2011.
  • [8] K. Shen and W. Yu, “Fractional programming for communication systems-Part I: Power control and beamforming,” IEEE Trans. Signal Process., vol. 66, no. 10, pp. 2616-2630, May 2018.
  • [9] R. Q. Hu and Y. Qian, “An energy efficient and spectrum efficient wireless heterogeneous network framework for 5G systems,” IEEE Commun. Mag., vol. 52, no. 5, pp. 94-101, May 2014.
  • [10] J. Huang, R. A. Berry, and M. L. Honig, “Distributed interference compensation for wireless networks,” IEEE J. Sel. Areas Commun., vol. 24, no. 5, pp. 1074-1084, May 2006.
  • [11] H. Zhang, L. Venturino, N. Prasad, P. Li, S. Rangarajan, and X. Wang, “Weighted sum-rate maximization in multi-cell networks via coordinated scheduling and discrete power control,” IEEE J. Sel. Areas Commun., vol. 29, no. 6, pp. 1214-1224, Jun. 2011.
  • [12] L. B. Le and E. Hossain, “Resource allocation for spectrum underlay in cognitive radio networks,” IEEE Trans. Wireless Commun., vol. 7, no. 12, pp. 5306-5315, Dec. 2008.
  • [13] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, “Learning to optimize: Training deep neural networks for interference management,” IEEE Trans. Signal Process., vol. 66, no. 20, pp. 5438-5453, Oct. 2018.
  • [14] Y. S. Nasir and D. Guo, “Multi-agent deep reinforcement learning for dynamic power allocation in wireless networks,” IEEE J. Sel. Areas in Commun., vol. 37, no. 10, pp. 2239-2250, Oct. 2019.
  • [15] L. Xiao, H. Zhang, Y. Xiao, X. Wan, S. Liu, L.-C. Wang, and H. V. Poor, “Reinforcement learning-based downlink interference control for ultra-dense small cells,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 423-434, Jan. 2020.
  • [16] R. Amiri, M. A. Almasi, J. G. Andrews, and H. Mehrpouyan, “Reinforcement learning for self organization and power Control of two-tier heterogeneous networks,” IEEE Trans. Wireless Commun., vol. 18, no. 8, pp. 3933-3947, Aug. 2019.
  • [17] Y. Sun, G. Feng, S. Qin, Y.-C. Liang, and T.-S. P. Yum, “The SMART handoff policy for millimeter wave heterogeneous cellular networks,” IEEE Trans. Mobile Commun., vol. 17, no. 6, pp. 1456-1468, Jun. 2018.
  • [18] D. D. Nguyen, H. X. Nguyen, and L. B. White, “Reinforcement learning with network-assisted feedback for heterogeneous RAT selection,” IEEE Trans. Wireless Commun., vol. 16, no. 9, pp. 6062-6076, Sep. 2017.
  • [19] Y. Wei, F. R. Yu, M. Song, and Z. Han, “User scheduling and resource allocation in HetNets with hybrid energy supply: an actor-critic reinforcement learning approach,” IEEE Trans. Wireless Commun., vol. 17, no. 1, pp. 680-692, Jan. 2018.
  • [20] N. Morozs, T. Clarke, and D. Grace, “Heuristically accelerated reinforcement learning for dynamic secondary spectrum sharing,” IEEE Access, vol. 3, pp. 2771-2783, 2015.
  • [21] V. Raj, I. Dias. T. Tholeti, and S. Kalyani, “Spectrum access in cognitive radio using a two-stage reinforcement learning approach,” IEEE J. Sel. Topics Signal Process., vol. 12, no. 1, pp. 20-34, Feb. 2018.
  • [22] O. Iacoboaiea, B. Sayrac, S. B. Jemaa, and P. Bianchi, “SON coordination in heterogeneous networks: a reinforcement learning framework,” IEEE Trans. Wireless Commun., vol. 15, no. 9, pp. 5835-5847, Sep. 2016.
  • [23] V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529-533, 2015.
  • [24] C. Zhang, P. Patras, and H. Haddadi,“Deep Learning in Mobile and Wireless Networking: A Survey,” IEEE Commun. Surveys Tuts., vol. 21, no. 3, pp. 2224-2287, 3rd Quart., 2019.
  • [25] T. V. Chien, T. N. Canh, E. Bjornson, and E. G. Larsson, “Power control in cellular massive MIMO with varying user activity: A deep learning solution,” IEEE Trans. Wireless Commun., DOI: 10.1109/TWC.2020.2996368.
  • [26] R. Mennes, F. A. P. D. Figueiredo, and S. Latre, “Multi-agent deep learning for multi-channel access in slotted wireless networks,” IEEE Access, vol. 8, pp. 95032-95045, 2020.
  • [27] W. Cui, K. Shen, and W. Yu, “Spatial deep learning for wireless scheduling,” IEEE J. Select. Areas Commun., vol. 37, no. 6, pp. 1248-1261, Jun. 2019.
  • [28] H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep reinforcement learning based resource allocation for V2V communications,” IEEE Trans. Veh. Technol., vol. 68, no. 4, pp. 3163-3173, Apr. 2019.
  • [29] Y. Yu, T. Wang, and S. C. Liew, “Deep-reinforcement learning multiple access for heterogeneous wireless networks,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp. 1277-1290, Jun. 2019.
  • [30] Y. He, Z. Zhang, F. R. Yu, N. Zhao, H. Yin, V. C. M. Leung, and Y. Zhang, “Deep-reinforcement-learning-based optimization for cache-enabled opportunistic interference alignment wireless networks,” IEEE Trans. Veh. Technol., vol. 66, no. 11, pp. 10433-10445, Sep. 2017.
  • [31] L. Zhang, J. Tan, Y.-C. Liang, G. Feng, and D. Niyato, “Deep reinforcement learning-Based modulation and coding scheme selection in cognitive heterogeneous networks,” IEEE Trans. Wireless Commun., vol. 18, no. 6, pp. 3281-3294, Jun. 2019.
  • [32] F. B. Mismar, B. L. Evans, and A. Alkhateeb, “Deep reinforcement learning for 5G networks: Joint beamforming, power control, and interference coordination,” IEEE Trans. Commun., vol. 68, no. 3, pp. 1581-1592, Mar. 2020.
  • [33] H. Huang, Y. Yang, H. Wang, Z. Ding, H. Sari, and F. Adachi, “Deep reinforcement learning for UAV navigation through massive MIMO technique,” IEEE Trans. Veh. Technol., vol. 69, no. 1, pp. 1117-1121, Jan. 2020.
  • [34] H. Zhang, N. Yang, W. Huangfu, K. Long, and V. C. M. Leung, “Power control based on deep reinforcement learning for spectrum sharing,” IEEE Trans. Wireless Commun., vol. 19, no. 6, pp. 4209-4219, Jun. 2020.
  • [35] N. C. Luong et al., “Applications of deep reinforcement learning in communications and networking: A survey,” IEEE Commun. Surveys, vol. 21, no. 4, pp. 3133-3174, 2019.
  • [36] T. Kim, D. J. Love, B. Clerckx, “Does frequent low resolution feedback outperform infrequent high resolution feedback for multiple antenna beamforming systems?”, IEEE Trans. Signal Process., vol. 59, no. 4, pp. 1654-1669, Apr. 2011.
  • [37] Z.-Q. Luo and S. Zhang, “Dynamic spectrum management: Complexity and duality,” IEEE J. Sel. Topics Signal Process., vol. 2, no. 1, pp. 57-73, Feb. 2008.
  • [38] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Deterministic Policy Gradient Algorithms,” ICML, Jun. 2016.
  • [39] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” ICLR, Aug. 2015.
  • [40] Radio Frequency (RF) System Scenarios, document 3GPP TR 25.942, v.14.0.0, 2017.