This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DRL-based Distributed Resource Allocation for Edge Computing in Cell-Free Massive MIMO Network

Fitsum Debebe Tilahun1, Ameha Tsegaye Abebe2, and Chung G. Kang1 1School of Electrical Engineering, Korea University, Seoul, Republic of Korea
2Samsung Research, Seoul, Republic of Korea
Email: 1{fitsum_debebe, ccgkang}@korea.ac.kr, [email protected]
Abstract

In this paper, with the aim of addressing the stringent computing and quality-of-service (QoS) requirements of recently introduced advanced multimedia services, we consider a cell-free massive MIMO-enabled mobile edge network. In particular, benefited from the reliable cell-free links to offload intensive computation to the edge server, resource-constrained end-users can augment on-board (local) processing with edge computing. To this end, we formulate a joint communication and computing resource allocation (JCCRA) problem to minimize the total energy consumption of the users, while meeting the respective user-specific deadlines. To tackle the problem, we propose a fully distributed solution approach based on cooperative multi-agent reinforcement learning framework, wherein each user is implemented as a learning agent to make joint resource allocation relying on local information only. The simulation results demonstrate that the performance of the proposed distributed approach outperforms the heuristic baselines, converging to a centralized target benchmark, without resorting to large overhead. Moreover, we showed that the proposed algorithm has performed significantly better in cell-free system as compared with the cellular MEC systems, e.g., a small cell-based MEC system.

Index Terms:
cell-free massive MIMO, joint communication and computing resource allocation, deep reinforcement learning (DRL), multi-agent reinforcement learning, edge computing

I introduction

Over the past few years, there has been a surge in computation hungry, and yet highly delay-intolerant multimedia applications. To support these applications in resource-limited end-user devices, it is essential to augment on-board (local) processing by offloading part of the intensive computation to more powerful computing platforms at the network edge. To this end, the ubiquitous radio and computing resources at the mobile edge network should be jointly optimized in a way to improve certain system objectives such as execution delay and energy consumption. In this regard, a plethora of joint communication and computing resource allocation (JCCRA) schemes based on conventional optimization methods have been investigated. However, most of these approaches either assume accurate knowledge of network-wide information, which is difficult to obtain in practice, or are limited to quasi-static systems, such as deterministic task size and time-invariant channel conditions. Thus, their application to support time-sensitive applications in dynamic mobile edge network might be critically limited.

Driven by the recent advances in machine learning and deep reinforcement learning (DRL), several learning-based schemes have been proposed to provide flexible joint resource allocation. For instance, deep Q-network (DQN) is employed in [1, 2] for designing joint offloading and resource allocation decisions, while [3] presents a deep deterministic policy gradient (DDPG)-based scheme in heterogeneous networks with multiple users. However, these schemes rely on central entities to make joint resource allocation decision, suffering from considerable signaling overhead and scalability issues. [4] investigates distributed JCCRA problem in a single-cell mobile edge computing (MEC) system, however, the adopted computation model is not applicable for tasks with extreme reliability and ultra-low hard deadline requirements. We note that the large amount of effort on optimizing joint resource allocation in MEC networks focuses on cellular system. However, as we move towards the next decade, the current cellular MEC systems may not keep up with the diverse and more stringent requirements of the envisaged advanced applications, such as holographic displays, multi-sensory extended reality, and others.

In light of the recently proposed cell-free massive MIMO architecture, envisioned for beyond-5G and 6G networks, however, there are only limited works. Notably, cell-free architecture can provide sufficiently fast and reliable access links without cell-edge by serving a relatively small number of users from several geographically distributed access points (APs) [5]. It, therefore, opens up a new opportunity to support energy-efficient and consistently-low latency computational task offloading, in contrast to cellular MEC systems. [6] presents several performance analyses in a cell-free MEC system considering coverage radius of the APs, while the issues of active user detection and channel estimations are discussed in [7]. In this paper, we consider a JCCRA problem that minimizes total energy consumption of the users while meeting the user-specific application deadlines by jointly optimizing the allocation of local processor clock speed and uplink transmission power for each user. We then present a fully distributed solution approach based on cooperative multi-agent reinforcement learning, alleviating the signaling and communication overheads associated with centralized implementations. In sharp contrast to our work, both [6] and [7] do not deal with the JCCRA problem. To the best of our knowledge, this is the very first attempt to solve JCCRA problem in a fully distributed fashion for cell-free network-enabled mobile edge network. The fully distributed and intelligent JCCRA in our framework combined with the reliable access links of the cell-free massive MIMO architecture can be a promising means to handle the stringent requirements of the newly introduced multimedia services.

The rest of the paper is organized as follows: In Section II, we discuss the system model and problem formulation for the JCCRA framework. Section III presents the proposed distributed JCCRA scheme based on cooperative multi-agent reinforcement learning (MARL) framework. Simulation results are presented in Section IV, followed by some concluding remarks in Section V.

II system model and problem formulation

II-A User-centric Cell-free Massive MIMO: Overview

We consider a cell-free massive MIMO-enabled mobile edge network, depicted in Fig. 1, where a central processing unit (CPU) with an edge server of finite capacity fCPU{f^{CPU}} (in cycles per second) is connected to MM single-antenna access points (APs) via ideal fronthaul links. Furthermore, MM APs serve KK single-antenna users, such that MKM\gg K, entailing to the widely known system model for cell-free massive MIMO [5]. Let ={1,2,,M}{{\cal M}}=\left\{{1,2,...,M}\right\} and 𝒦={1,2,,K}{{\cal K}}=\left\{{1,2,...,K}\right\} denote sets of AP and user indices, respectively. Besides, each user k𝒦k\in{{\cal K}} is also equipped with limited local processor with maximum capability fkmaxf_{k}^{\max} (in cycle per second).

Let the channel between mm-th AP and kk-th user is given as gmk=βmk/12hmk{{g}_{mk}}=\beta_{mk}^{{}^{1}/{}_{2}}{{h}_{mk}} where βmk{{\beta}_{mk}} is a large-scale channel gain coefficient that models the effects of path loss and shadowing while hmk𝒞𝒩(0,1){h_{mk}}\sim{{\cal C}{\cal N}}\left({0,1}\right) represents small-scale channel fading. Let τc{{\tau}_{c}} denote a channel coherence period in which hmk{{h}_{mk}} remains the same. The coherence period τc{{\tau}_{c}} is divided into pilot transmission interval of τp{{\tau}_{p}} samples and uplink data transmission interval of (τcτp{{\tau}_{c}}-{{\tau}_{p}}) samples for offloading computation to the edge server in the CPU.

Refer to caption
Figure 1: User-centric cell-free massive MIMO-enabled mobile edge network: Illustrative system model

At the beginning of each coherence time, each user k𝒦k\in{{\cal K}} simultaneously transmit a pilot sequence ψkτp× 1{{\mathbf{\psi}}_{k}}\in{{\mathsf{\mathbb{C}}}^{{{\tau}_{p}}\times\,1}}, such that ψk2=1{{\left\|{{\mathbf{\psi}}_{k}}\right\|}^{2}}=1, which is used to estimate the respective channels g^mk{{\hat{g}}_{mk}} at each AP mm\in{{\cal M}}. Assuming pairwise orthogonal sequences {ψ1,ψ2,..,ψK}\left\{{{\mathbf{\psi}}_{1}},{{\mathbf{\psi}}_{2}},.....,{{\mathbf{\psi}}_{K}}\right\}, i.e., τp=K{{\tau}_{p}}=K, we ignore pilot contamination problem in the current model. Then, the received pilot vector at the mm-th AP, 𝐲mpτp× 1\mathbf{y}_{m}^{p}\in{{\mathsf{\mathbb{C}}}^{{{\tau}_{p}}\times\,1}}, can be represented as

𝐲mp=τpk=1Kqkpgmkψk+ωmp,{\bf{y}}_{m}^{p}=\sqrt{{\tau_{p}}}\sum\limits_{k=1}^{K}{\sqrt{q_{k}^{p}}{g_{mk}}{{\bf{\psi}}_{k}}}+{\bf{\omega}}_{m}^{p}, (1)

where qkpq_{k}^{p} is the pilot transmit power, and ωmp\mathbf{\omega}_{m}^{p} is a τp\,{{\tau}_{p}}-dimensional additive noise vector with independent and identically distributed entries of ωmp𝒞𝒩(0,σm2){\bf{\omega}}_{m}^{p}\sim{{\cal C}{\cal N}}\left({0,\sigma_{m}^{2}}\right). Based on the received vector 𝐲mp\mathbf{y}_{m}^{p}, the least-square (LS) channel estimate g^mk{{\hat{g}}_{mk}} can be expressed as:

g^mk=1τppkpψkH𝐲mp.{\hat{g}_{mk}}=\frac{1}{{\sqrt{{\tau_{p}}p_{k}^{p}}}}{\bf{\psi}}_{k}^{H}{\bf{y}}_{m}^{p}. (2)

The channel estimates are used to decode the uplink transmitted data of the users. After pilot transmission, the users transmit offloaded data to the APs. Let xk{{x}_{k}} denote the uplink data of user kk, with transmit power of pk{{p}_{k}}, which is determined as pk=ηkpkmax{{p}_{k}}={{\eta}_{k}}\,p_{k}^{\max}, where ηk{{\eta}_{k}}, and pkmaxp_{k}^{\max} represent power control coefficient and maximum uplink power of the kk-th user, respectively. Then, the received signal ymuy_{m}^{u} at the mm-th AP is represented as

ymu=k=1Kpkgmkxk+ωm.y_{m}^{u}=\sum\limits_{k=1}^{K}{\sqrt{{p_{k}}}{g_{mk}}{x_{k}}}+{\omega_{m}}. (3)

It is important to ensure the scalability of cell-free massive MIMO system, in terms of complexity for pilot detection and data processing. To this end, we limit the number of APs that serve the kk-th user to Nk<M{N_{k}}<M, and form a user-centric cluster of APs 𝒞k{AP1,AP2,,APNk}{{{\cal C}}_{k}}\subset\left\{{A{P_{1}},A{P_{2}},\cdots,A{P_{N_{k}}}}\right\}, by including all APs with the largest βmk{{\beta}_{mk}} to 𝒞k{{{\cal C}}_{k}} until the maximum allowable cluster size is reached, i.e., Nk=Cmax{N_{k}}={C^{\max}}. Thus, the transmitted data by user kk is only decoded by the APs in 𝒞k{{{\cal C}}_{k}}. Then, after performing maximum ratio combining (MRC) at each AP m𝒞km\in{{{\cal C}}_{k}} the quantity g^mkymu\hat{g}_{mk}^{*}y_{m}^{u} is transmitted to the CPU via a fronthaul link. The received soft estimates are combined at the CPU to decode the data transmitted by user kk as follows:

x^k=m=1Nkg^mkymu.{\hat{x}_{k}}=\sum\limits_{m=1}^{N_{k}}{\hat{g}_{mk}^{*}y_{m}^{u}}. (4)

Then, the uplink signal-to interference and noise ratio (SINR) γk{\gamma_{k}} for user kk can be expressed as:

γk=pk|m𝒞kg^mkgmk|2kkpk|m𝒞kg^mkgmk|2+σm2m𝒞k|g^mk|2{\gamma_{k}}=\frac{{{p_{k}}{{\left|{\sum\limits_{m\in{{{\cal C}}_{k}}}{\hat{g}_{mk}^{*}{g_{mk}}}}\right|}^{2}}}}{{\sum\limits_{k^{\prime}\neq k}{{p_{k^{\prime}}}}{{\left|{\sum\limits_{m\in{{{\cal C}}_{k}}}{\hat{g}_{mk}^{*}{g_{mk^{\prime}}}}}\right|}^{2}}+\sigma_{m}^{2}\sum\limits_{m\in{{{\cal C}}_{k}}}{{{\left|{{{\hat{g}}_{mk}}}\right|}^{2}}}}} (5)

Given the system bandwidth WW, the uplink rate of user kk is given as Rk=Wlog2(1+γk){R_{k}}=W{\log_{2}}\left({1+{\gamma_{k}}}\right) by Shannon’s theory.

II-B Parallel Computation Model

We assume each user kk has a computationally intensive task of 𝒯k(t){{\cal T}_{k}}\left(t\right) bits at the beginning of every discrete time step tt, whose duration Δt=τc\Delta t={{\tau}_{c}}. The task sizes in different time steps are independent and uniformly distributed over [𝒯min,𝒯max,]\left[{{\cal T}_{\min,}}\,{{\cal T}_{\max,}}\right], for every user k𝒦k\in{{\cal K}}. Let tkdt_{k}^{d} denote the maximum tolerable delay of user kk to complete execution of the given task. Considering the delay constraint and energy consumption, the kk-th user processes 𝒯klocal(t){\cal T}_{k}^{local}\left(t\right) bits locally while offloading the remaining 𝒯koffload(t){\cal T}_{k}^{offload}\left(t\right) bits to the edge server at the CPU.

II-B1 Local computation

Let fklocal(t)=αk(t)fkmaxf_{k}^{local}\left(t\right)={{\alpha}_{k}}\left(t\right)f_{k}^{\max}, for αk[0,1]{{\alpha}_{k}}\in\left[0,1\right], denote the kk-th user processor speed (in cycle per second) allocated for local task execution. The entire task is offloaded to the edge server if αk=0{{\alpha}_{k}}=0, while αk=1{{\alpha}_{k}}=1 implies that the whole local processing capability is fully utilized and the remaining task bits are offloaded to the edge server. Accordingly, the proportion of locally computed task size at time step tt is expressed as 𝒯klocal(t)=min(𝒯k(t),tkdfklocal(t)Ncpb){\cal T}_{k}^{local}\left(t\right)=\min\left({{\cal T}_{k}}\left(t\right),\frac{t_{k}^{d}f_{k}^{local}\left(t\right)}{{{N}_{cpb}}}\right), where Ncpb{{N}_{cpb}} corresponds to the number of cycles required to process one-bit task. Then, the time taken for local execution at time step tt is expressed as

tklocal(t)=min(𝒯klocal(t)Ncpbfklocal(t),tkd).t_{k}^{local}\left(t\right)=\min\left({\frac{{{\cal T}_{k}^{local}\left(t\right){N_{cpb}}}}{{f_{k}^{local}\left(t\right)}},t_{k}^{d}}\right). (6)

The corresponding energy consumption is given by

Eklocal(t)=ς𝒯klocal(t)Ncpb(fklocal(t))2,E_{k}^{local}\left(t\right)=\varsigma{\cal T}_{k}^{local}\left(t\right){N_{cpb}}{\left({f_{k}^{local}\left(t\right)}\right)^{2}}, (7)

where ς\varsigma corresponds to the effective switched capacitance.

II-B2 Computation Offloading

The remaining 𝒯koffload(t)=max(0,𝒯k(t)𝒯klocal(t)){\cal T}_{k}^{offload}\left(t\right)=\max\left(0,\,\,{{\cal T}_{k}}\left(t\right)-{\cal T}_{k}^{local}\left(t\right)\right) bits are offloaded to the edge server at the CPU for parallel computation. While offloading computation to the edge server, the experienced latency can be broken down into transmission delay, computing delay at the edge server, and transmission delay for retrieving the processed result. Assuming the retrieved data size after computation is much smaller compared to the offloaded data size, we ignore the retrieving delay. Then, the transmission delay to offload 𝒯koffload(t){\cal T}_{k}^{offload}\left(t\right) data bits is expressed as tktr(t)=𝒯koffload(t)Rk(t)t_{k}^{tr}\left(t\right)=\frac{{{{\cal T}}_{k}^{offload}\left(t\right)}}{{{R_{k}}\left(t\right)}}, where Rk(t){R_{k}}\left(t\right) is the uplink rate of user kk at time step tt. The computing time at the edge server is given by tkcomp(t)=𝒯koffload(t)NcpbfkCPU(t)t_{k}^{comp}\left(t\right)=\frac{{{{\cal T}}_{k}^{offload}\left(t\right){N_{cpb}}}}{{f_{k}^{CPU}\left(t\right)}}, where fkCPU(t)f_{k}^{CPU}\left(t\right) is computation resource at the CPU allocated for user kk, which is in proportion to 𝒯koffload(t){\cal T}_{k}^{offload}\left(t\right). The total offloading delay for edge computation is then given as

tkoffload(t)=tkcomp(t)+tktr(t).t_{k}^{offload}\left(t\right)=t_{k}^{comp}\left(t\right)+t_{k}^{tr}\left(t\right). (8)

The corresponding energy consumption for offloading computation is then expressed as

Ekoffload(t)=pk(t)tktr(t),E_{k}^{offload}\left(t\right)={p_{k}}\left(t\right)t_{k}^{tr}\left(t\right), (9)

where pk(t)=ηk(t)pkmax,ηk[0,1]{p_{k}}\left(t\right)={\eta_{k}}\left(t\right)\,p_{k}^{\max},\forall{\eta_{k}}\in\left[{0,1}\right]. Then, the overall experienced latency tk(t){{t}_{k}}\left(t\right) by user kk to execute 𝒯k{{\cal T}_{k}} bits locally and at the edge is given as:

tk(t)=max(tklocal(t),tkoffload(t)).{t_{k}}\left(t\right)=\max(t_{k}^{local}\left(t\right),t_{k}^{offload}\left(t\right)). (10)

Similarly, the energy consumption Ek(t){{E}_{k}}\left(t\right) of user kk at time step tt can be expressed as

Ek(t)=Eklocal(t)+Ekoffload(t).{E_{k}}\left(t\right)=E_{k}^{local}\left(t\right)+E_{k}^{offload}\left(t\right). (11)

II-C JCCRA Problem Formulation

According to the aforementioned communication and parallel computation models, we intend to jointly optimize the local processor speed fklocal(t)f_{k}^{local}\left(t\right) and uplink transmission power pk(t){p_{k}\left(t\right)} for every user k𝒦k\in{{\cal K}} at each time step tt, in order to minimize the total energy consumption of all users subject to the respective user-specific delay requirements. The JCCRA problem can mathematically be formulated to jointly determine (αk(t),ηk(t)),k𝒦\left({{\alpha_{k}}\left(t\right),{\eta_{k}}\left(t\right)}\right),\forall k\in{{\cal K}} as follows:

min{αk(t),ηk(t)|k}k=1KEk(t)subjecttotk(t)tkd,k                                    0αk(t)1,k                                     0ηk(t)1,k.\begin{array}[]{l}\mathop{\min}\limits_{\{{\alpha_{k}}(t),{\eta_{k}}(t)|\forall k\}}\,\,\,\,\,\,\,\,\,\,\sum\limits_{k=1}^{K}{{E_{k}}\left(t\right)}\\ {\rm{\,\,\,\,subject}}\,{\rm{to}}\,\,\,\,\,\,\,\,\,\,\,\,\,{t_{k}}\left(t\right)\leq t_{k}^{d},\,\,\,\,\forall k\\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,0\leq{\alpha_{k}}\left(t\right)\leq 1,\,\,\forall k\\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,0\leq{\eta_{k}}\left(t\right)\leq 1,\,\,\,\forall k.\end{array} (12)

(12) is a stochastic optimization problem in which the objective function and the first condition involve constantly changing random variables, i.e., task size and channel conditions. Consequently, the problem must be solved frequently, i.e., at every time step tt, and the solution must converge rapidly within the ultra-low delay constraint. Therefore, it is challenging to solve the problem using iterative optimization algorithms with reasonable complexity. Furthermore, relying on a one-shot optimization, it is difficult to guarantee a steady long-term performance with conventional algorithms since they cannot adapt to the changes in the mobile edge network.

III The Proposed Multi-Agent Reinforcement Learning-based Distributed JCCRA

To cope up with the dynamics in the MEC while supporting efficient and adaptive joint resource allocation for every user, we propose a fully distributed JCCRA based on cooperative multi-agent reinforcement learning (MARL). We note that solving the JCCRA problem centrally at the CPU might allow for efficient management of the available radio and computing resources due to globally processing of network-wide information. However, the associated signaling and communication overheads are forbiddingly significant, especially given the ultra-low delay constraints. Accordingly, we transform the JCCRA problem into a partially observable cooperative Markov game that consists of KK learning agents. One of the main challenges in multi-agent domain is the non-stationarity of the environment since the agents update their policies concurrently, which might lead to learning instability. To deal with this challenge, the agents are trained with the state-of-the-art multi-agent deep deterministic policy gradient (MADDPG) algorithm under the framework of centralized training and decentralized execution [8].

III-A DRL Formulation for JCCRA

Each user is implemented as a DRL agent that learn to map local observation of the environment to efficient JCCRA actions by sequentially interacting with the mobile edge network in discrete time steps. Let 𝒪k{{{\cal O}}_{k}}, and 𝒮{{\cal S}} denote the local observation space of agent k𝒦k\in{{\cal K}}, and the complete environment state space, respectively. At the beginning of each step tt, agent kk perceives the local observation of the environment state ok(t):𝒮𝒪k{o_{k}}\left(t\right):{{\cal S}}\mapsto{{{\cal O}}_{k}}, and takes an action ak(t)𝒜k{a_{k}}\left(t\right)\in{{{\cal A}}_{k}} from its action space 𝒜k{{{\cal A}}_{k}} according to the current JCCRA policy μk{{\mu}_{k}}. The interaction with the environment produces the next observation of the agent ok(t+1)𝒪k{o_{k}}\left({t+1}\right)\in{{{\cal O}}_{k}} and a real-valued scalar reward rk(t):𝒮×𝒜k{r_{k}}\left(t\right):{{\cal S}}\times{{{\cal A}}_{k}}\mapsto\mathsf{\mathbb{R}} that guides the agent to constantly improve the policy until it converges. The goal of the agent is, therefore, to derive an optimal JCCRA policy μk\mu_{k}^{*} that maximizes the expected long-term discounted cumulative reward, defined as Jk(μk)=𝔼[t=1Tεt1rk(t)]{{J}_{k}}\left({{\mu}_{k}}\right)=\mathbb{E}\left[\sum\limits_{t=1}^{T}{{{\varepsilon}^{t-1}}{{r}_{k}}}\left(t\right)\right], where ε[0, 1]\varepsilon\in\left[0,\,1\right] is a discount factor and TT is the total number of time steps (horizon). In the sequel, we define the local observation, action, and reward of agent kk for the JCRRA at a given time tt.

III-A1 Local observation

At the beginning of each time step tt, the local observation of agent kk is defined as ok(t)[𝒯k(t),tkd,Rk(t1)]{{o}_{k}}\left(t\right)\triangleq\left[{{\cal T}_{k}}(t),t_{k}^{d},{{R}_{k}}(t-1)\right], where 𝒯k(t){{\cal T}_{k}}(t), tkdt_{k}^{d}, , and Rk(t1){{R}_{k}}(t-1) are the incoming task size, user-specific application deadline, and the uplink rate from the previous time step, respectively. Note that the local observation is only a partial view of the environment, i.e., it does not include information about the other agents.

III-A2 Action

Based on the local observation of the environment, the agent takes action ak(t)=Δ[αk(t),ηk(t)]{a_{k}}\left(t\right)\buildrel\Delta\over{=}\left[{{\alpha_{k}}\left(t\right),{\eta_{k}}\left(t\right)}\right], i.e., local processor speed fklocal(t)=αk(t)fkmaxf_{k}^{local}\left(t\right)={\alpha_{k}}\left(t\right)f_{k}^{\max}, and uplink transmit power pk(t)=ηk(t)pkmax{p_{k}}\left(t\right)={\eta_{k}}\left(t\right)\,p_{k}^{\max}.

III-A3 Reward

To reflect the design objective of the JCCRA problem and enforce a cooperative behavior among the agents, we define joint reward as

rk(t)=k=1KξkEk(t),k𝒦,{r_{k}}(t)=-\sum\limits_{k=1}^{K}{{\xi_{k}}{E_{k}}\left(t\right),\,\,}\forall k\in{{\cal K}}, (13)

where ξk=1{\xi_{k}}=1 if tk(t)tkd{t_{k}}\left(t\right)\leq t_{k}^{d}, otherwise ξk=10{\xi_{k}}=10 to punish potentially selfish behavior of the agents that leads to failure in meeting the delay constraint.

III-B MADDPG Algorithm for Distributed JCCRA

During the offline training phase, the agents can be trained in a central unit, which presents an opportunity to share extra information among the agents for easing and accelerating the training process. Specifically, agent kk has now access for the joint action a(t)=(a1(t),,aK(t))a\left(t\right)=\left({{a}_{1}}\left(t\right),...,{{a}_{K}}\left(t\right)\right), and full observation of the environment state sk(t)=(ok(t),ok(t)){s_{k}}\left(t\right)=\left({{o_{k}}\left(t\right),{o_{-k}}\left(t\right)}\right), where ok(t){o_{-k}}\left(t\right) is the local observation of other agents at time step tt. The extra information endowed to the agents, however, is discarded during the execution phase, meaning the agents rely on their local observation entirely to make JCCRA decisions in a fully distributed manner.

As an extension of the single-agent deep deterministic policy gradient (DDPG) algorithm [9], the MADDPG is an actor-critic policy gradient algorithm. Specifically, the MADDPG agent kk employs two main neural networks: actor network with parameters θkμ\theta_{k}^{\mu} to approximate a JCCRA policy μk(ok|θkμ){{\mu}_{k}}\left({{o}_{k}}|\theta_{k}^{\mu}\right) and a critic network, parametrized by θkQ\theta_{k}^{Q}, to approximate a state-value function Qk(sk,a|θkQ){{Q}_{k}}\left({{s}_{k}},a|\theta_{k}^{Q}\right), along with their respective time-delayed copies, θkμ\theta_{k}^{{{\mu}^{\prime}}} and θkμ\theta_{k}^{{{\mu}^{\prime}}}, which serve as targets. At time step tt, the actor deterministically maps local observation ok(t){{o}_{k}}\left(t\right) to a specific continuous action μk(ok(t)|θkμ){{\mu}_{k}}\left({{o}_{k}}\left(t\right)|\theta_{k}^{\mu}\right) and then, a random noise process 𝒩k{{{\cal N}}_{k}} is added to generate an exploratory policy such that ak(t)=μk(ok(t)|θkμ)+𝒩k(t){{a}_{k}}\left(t\right)={{\mu}_{k}}\left({{o}_{k}}\left(t\right)|\theta_{k}^{\mu}\right)+{{{\cal N}}_{k}}\left(t\right). The environment, which is shared among the agents, collects the joint action a(t)={ak(t),k𝒦}a\left(t\right)=\left\{{{a_{k}}\left(t\right),\forall k\in{{\cal K}}}\right\} and returns the immediate reward rk(t){{r}_{k}}\left(t\right) and the next observation ok(t+1){{o}_{k}}\left(t+1\right) to the respective agents. To make use of the experience in later decision-making steps efficiently and improve the stability of the training, the agent’s transition along with the extra information ek(t)=(sk(t),a(t),rk(t),sk(t+1)){{e}_{k}}\left(t\right)=\left({{s}_{k}}\left(t\right),a\left(t\right),{{r}_{k}}\left(t\right),{{s}_{k}}\left(t+1\right)\right) is saved in the replay buffer 𝒟k{\cal D}_{k} of the agent.

To train the main networks, a mini batch of BB samples (ski,ai,rki,ski+1)|i=1B\left.{\left({s_{k}^{i},a^{i},r_{k}^{i},s_{k}^{i+1}}\right)}\right|_{i=1}^{B}, is randomly drawn from the replay buffer 𝒟k{\cal D}_{k}, where ii denotes a sample index. The critic network is updated to minimize the following loss function:

k(θkQ)=1Bi(ykiQk(ski,ai|θkQ))2,{{\cal L}_{k}}\left({\theta_{k}^{Q}}\right)=\frac{1}{B}\,\,\sum\nolimits_{i}{\,{{\left({y_{k}^{i}-{Q_{k}}\left({s_{k}^{i},{a^{i}}|\theta_{k}^{Q}}\right)}\right)}^{2}}}\,\,, (14)

where ykiy_{k}^{i} is the target value expressed as yki=rki+εQk(ski+1,ai+1|θkQ)|ai+1={μk(oki+1),k𝒦}y_{k}^{i}={{\left.r_{k}^{i}+\varepsilon\,\,{{{{Q}^{\prime}}}_{k}}\left(s_{k}^{i+1},{{a}_{i+1}}|\theta_{k}^{Q^{\prime}}\right)\right|}_{{{a}_{{}^{i+1}}}=\left\{{{{{\mu}^{\prime}}}_{k}}\left(o_{k}^{i+1}\right),\forall k\in\cal{K}\right\}}}. On the other hand, the parameters of the actor network are updated along the deterministic policy gradient given as

θkμJ(μk|θkμ)1B[iakQk(ski,ai|θkQ)×θkμμk(oki|θkμ)]|ai={μk(oki),k𝒦}.{\nabla_{\theta_{k}^{\mu}}}J\left({{\mu_{k}}|\theta_{k}^{\mu}}\right)\approx\frac{1}{B}\,\Biggl{[}\sum\nolimits_{i}{{\nabla_{{a_{k}}}}}{Q_{k}}\left({s_{k}^{i},{a_{i}}|\theta_{k}^{Q}}\right)\times\\ \nabla_{\theta_{k}^{\mu}}\,{\mu_{k}}\left({o_{k}^{i}|\theta_{k}^{\mu}}\right)\Biggl{]}\Biggl{|}_{{a_{i}}=\left\{{{\mu_{k}}\left({o_{k}^{i}}\right),\,\forall k\in{{\cal K}}}\right\}}. (15)

The target parameters in both actor and critic networks are updated as θkμτθkμ+(1τ)θkμ\theta_{k}^{\mu^{\prime}}\leftarrow\tau\theta_{k}^{\mu}+\left({1-\tau}\right)\theta_{k}^{\mu^{\prime}}, and θkQτθkQ+(1τ)θkQ\theta_{k}^{Q^{\prime}}\leftarrow\tau\theta_{k}^{Q}+\left({1-\tau}\right)\theta_{k}^{Q^{\prime}}, respectively, where τ\tau is a constant close to zero.

It is worth mentioning that the proposed JCCRA scheme is trained offline. Once it converges, real-time decision making can be conducted in a fully distributed fashion by simply performing inference with the trained actor network of each agent, relying on the respective local observation only.

IV Performance Analysis

We consider M=100M=100 APs and K=10K=10 users uniformly distributed at random over an area of 1 km2. The APs are connected to the CPU via ideal fronthaul links. We assume only 30% of the entire APs are clustered to serve each user, i.e. Nk=0.3M,k𝒦{N_{k}}=0.3M,\forall k\in{{\cal K}}. The system bandwidth is set to 5MHz and all active users share this bandwidth without channelization. Denoting the standard deviation of the shadow fading by σsh{{\sigma}_{sh}}, the large-scale channel gain is given as βmk=10PLmk1010σshzmk10{{\beta}_{mk}}={{10}^{\frac{P{{L}_{mk}}}{10}}}{{10}^{\frac{{{\sigma}_{sh}}{{z}_{mk}}}{10}}}, where zmk𝒩(0,1){z_{mk}}\sim{{\cal N}}\left({0,1}\right). The path loss PLmkP{{L}_{mk}} is given according to the three-slope model [10] as follows:

PLmk={L35log10(dmk)if dmk>d1L10log10(dmk2d11.5)if d0<dmkd1L10log10(d02d11.5)if dmkd0,P{L_{mk}}=\begin{cases}-L-35{\log_{10}}\left({{d_{mk}}}\right)&\text{if ${d_{mk}}>d_{1}$}\\ -L-10{\log_{10}}\left({d_{mk}^{2}\,d_{1}^{1.5}}\right)&\text{if ${d_{0}}<{d_{mk}}\leq{d_{1}}$}\\ -L-10{\log_{10}}\left({d_{0}^{2}\,d_{1}^{1.5}}\right)&\text{if ${d_{mk}}\leq{d_{0}}$}\end{cases}, (16)

where dmk{d_{mk}} is the distance between the user kk and AP mm, and LL is

L=46.3+33.9log10(f)13.82log10(hAP)(1.1log10(f)0.7)hu+1.56log10(f)0.8.L=46.3+33.9\,{\log_{10}}\left(f\right)-13.82\,{\log_{10}}\left({{h_{AP}}}\right)\\ -\left({1.1\,{{\log}_{10}}\left(f\right)-0.7}\right){h_{u}}+{1.56\,{{\log}_{10}}\left(f\right)-0.8}\,. (17)

Here, ff is the carrier frequency (in MHz), hAP{{h}_{AP}} is the AP antenna height (in m), and hu{{h}_{u}} denotes the user antenna height (in m). Further, no shadowing is considered unless dmk>d1{{d}_{mk}}>{{d}_{1}}. The edge server at the CPU has computation capacity fCPU=100GHz{{f}^{CPU}}=100\,\text{GHz}, while the users are equipped with local processor with fkmax=1GHzf_{k}^{\max}=1\,\text{GHz}. The length of a time step Δt\Delta t is set to 1ms, and each user generates a task of a random size that is uniformly distributed in the range of 2.5 to 7.5 kbits at the beginning of every time step tt. Furthermore, Furthermore, additional simulation parameters are set to Ncpb=500{{N}_{cpb}}=500, ς=1027\varsigma={{10}^{-27}}, pkmax=0.1Wp_{k}^{\max}=0.1\text{W}, σsh=10dB{{\sigma}_{sh}}=10\text{dB}, and tkd=Δt=τc=1mst_{k}^{d}=\Delta t={{\tau}_{c}}=1\text{ms}.

For the proposed MADDPG-based JCCRA implementation, the actor and critic networks of each agent are implemented with fully connected neural networks, which consist of 128, 64, and 64 nodes. All the hidden layers are activated by ReLu function, while the outputs of the actor are activated by sigmoid. The parameters of critic and actor networks are updated with adaptive moment (Adam) optimizer with the learning rates of 0.001 and 0.0001. Further, we set the discount factor γk=0.99{{\gamma}_{k}}=0.99, target update parameter τ=0.005\tau=0.005, and mini batch size B=128B=128. In our simulations, the agents are trained for several episodes, each consisting of 100 steps.

To evaluate the performance of the proposed MADDPG-based JCCRA, we compare it with three benchmarks:

IV-1 Centralized DDPG-based JCCRA scheme

This approach refers to a DDPG-based centralized resource allocation scheme implemented at the CPU. We adopt the same neural network structure and other hyperparameters as the MADDPG-based scheme to train the actor and critic in the single DDPG agent. Since global information of the entire network is processed centrally at the CPU to make JCCRA decisions for all users, we can potentially obtain the most efficient resource allocation. Consequently, this baseline serves as a target performance benchmark. However, as it involves significant signaling and communication overheads, this scheme might be infeasible to support time-sensitive applications.

Refer to caption
Figure 2: Average reward with training process
Refer to caption
Figure 3: Average success rate with training process

IV-2 Offloading-first with an uplink fractional power control (FPC) scheme

This approach preferably offloads the computation to the edge server with the aim of aggressively exploiting the reliable access link provided by cell-free massive MIMO while saving the local processing energy consumption. The uplink transmit power for the kk-th user is determined by the standard fractional power control (FPC) [11] as follows:

pk=min(pkmax,p0λkν),{p_{k}}=\min\left({p_{k}^{\max},{p_{0}}\lambda_{k}^{-\nu}}\right), (18)

where p0=35dBm{p_{0}}=-35\,{\rm{dBm}}, λk=m=1Nkβmk{\lambda_{k}}=\sum\limits_{m=1}^{{N_{k}}}{{\beta_{mk}}}, and ν=0.5\nu=0.5.

IV-3 Local execution-first with an uplink fractional power control (FPC) scheme

The entire local processing capability is fully utilized, i.e., fklocal=fkmaxf_{k}^{local}=f_{k}^{\max}, and the remaining task bits are offloaded to the edge server with uplink transmit power given according to (18).

In Fig. 2, we investigate the convergence and performance of the proposed MADDPG-based JCCRA scheme by evaluating the total reward accumulated during the training process. It can be observed that the performance of the proposed distributed scheme has converged to the target performance benchmark, i.e. the centralized DDPG-based JCCRA, while relieving the large overhead of the latter. This implies that by relying on local observation and additional information provided during the offline centralized training phase, the agents manage to learn efficient joint resource allocation. Moreover, the proposed scheme has significantly outperformed the heuristic baselines in terms of total reward accumulated.

In Fig. 3, we compare the average success rate of the users in attaining the delay constraint of the time-sensitive applications. One can observe that the proposed MADDPG-based JCCRA has achieved more than 99% success rate in accomplishing the tasks within the deadline as the training episodes progress. Even though local execution-first policy achieves 100% success rate, the performance comes at the cost of high energy consumption, as reflected in Fig. 2. Moreover, the fact that the policy of the learning agents outperforms the offloading-first baseline implies even if cell-free massive MIMO can provide reliable access links to the users, aggressive computation offloading can result in performance degradation due to ineffective use of the limited resources.

Refer to caption
(a) Average success rate
Refer to caption
(b) Reward
Figure 4: Performance comparison of the proposed algorithm over cell-free and cellular MEC architectures

We then implemented the proposed MADDPG-based JCCRA scheme in two different cellular MEC architectures, including a small-cell system and a single-cell system with co-located massive MIMO, to further investigate the gain obtained from cell-free architecture. For the small-cell MEC system, each user is served by a single AP with the largest large-scale fading coefficient according to the methodology in [5], whereas in co-located massive MIMO case, each user connects to a base station with total number of Nk{N_{k}} antennas. From Fig.’s 4(a) and 4(b), we can observe that the proposed algorithm over a cell-free system has significantly outperformed the cellular MEC systems, in terms of average success rate and total reward accumulated. This is attributed to the channel quality degradation due to spatial co-channel interference from other near-by cells in the small-cell system case, while in the single-cell MEC system, users are subjected to large interference from all co-located antennas under MRC, hence limiting the uplink rate for computation offloading.

V Conclusion and future works

In this paper, we presented a novel distributed joint communication and computing resource allocation (JCCRA) scheme based on cooperative multi-agent reinforcement learning framework for the cell-free massive MIMO-enabled mobile edge network. More specifically, each trained agent can make intelligent and adaptive joint resource allocation in real-time, based on local observation only, in order to minimize the total energy consumption while meeting the respective delay constraints of the dynamic MEC system. The performance of the proposed distributed JCCRA scheme is shown to converge to the centralized target baseline, without resorting to large overhead. Finally, we have shown that the proposed algorithm in cell-free system has outperformed two cellular MEC systems by wide margins. In the future, it would be interesting to extend the current formulation to determine a set of APs that are adaptive to dynamic user mobility.

Acknowledgment

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No.2020R1A2C100998413).

References

  • [1] J. Li, H. Gao, T. Lv, and Y. Lu, “Deep reinforcement learning based computation offloading and resource allocation for MEC,” 2018 IEEE Wireless Communications and Networking Conference (WCNC), Barcelona, 2018.   
  • [2] L. Huang, X. Feng, C. Zhang, L. Qian, and Y. Wu, “Deep reinforcement learning-based joint task offloading and bandwidth allocation for multi-user mobile edge computing,” Digit. Commun. Netw., vol. 5, no. 1, pp. 10–17, 2019.   
  • [3] Y. Dai, K. Zhang, S. Maharjan, and Y. Zhang, “Edge Intelligence for Energy-Efficient Computation Offloading and Resource Allocation in 5G Beyond,” in IEEE Transactions on Vehicular Technology, vol. 69, no. 10, pp. 12175-12186, Oct. 2020.   
  • [4] Chen Z. and Wang, X. “Decentralized computation offloading for multi-user mobile edge computing: a deep reinforcement learning approach”. J Wireless Com Network 2020, 188 (2020).    
  • [5] H. Q. Ngo, A. Ashikhmin, H. Yang, E. G. Larsson and T. L. Marzetta, “Cell-Free Massive MIMO Versus Small Cells,” in IEEE Transactions on Wireless Communications, vol. 16, no. 3, pp. 1834-1850, March 2017.    
  • [6] S. Mukherjee and J. Lee, “Edge Computing-Enabled Cell-Free Massive MIMO Systems,” in IEEE Transactions on Wireless Communications, vol. 19, no. 4, pp. 2884-2899, April 2020.    
  • [7] M. Ke, Z. Gao, Y. Wu, X. Gao, and K. -K. Wong, “Massive Access in Cell-Free Massive MIMO-Based Internet of Things: Cloud Computing and Edge Computing Paradigms,” in IEEE Journal on Selected Areas in Communications, vol. 39, no. 3, pp. 756-772, March 2021.    
  • [8] R. Lowe et al., “Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments”, arXiv preprint arXiv:1706.02275, 2020.    
  • [9] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” in Proc. International Conference on Learning Representations (ICLR), 2016.    
  • [10] Ao Tang, JiXian Sun, and Ke Gong, “Mobile propagation loss with a low base station antenna for NLOS street microcells in urban area,” IEEE VTS 53rd Vehicular Technology Conference, Spring 2001. Proceedings (Cat. No.01CH37202), Rhodes, Greece, 2001, pp. 333-336 vol.1.    
  • [11] R. Nikbakht, R. Mosayebi, and A. Lozano, “Uplink Fractional Power Control and Downlink Power Allocation for Cell-Free Networks,” in IEEE Wireless Communications Letters, vol. 9, no. 6, pp. 774-777, June 2020.