DRL-based Distributed Resource Allocation for Edge Computing in Cell-Free Massive MIMO Network

Fitsum Debebe Tilahun1, Ameha Tsegaye Abebe2, and Chung G. Kang1 1School of Electrical Engineering, Korea University, Seoul, Republic of Korea
2Samsung Research, Seoul, Republic of Korea
Email: 1{fitsum_debebe, ccgkang}@korea.ac.kr, [email protected]

Abstract

In this paper, with the aim of addressing the stringent computing and quality-of-service (QoS) requirements of recently introduced advanced multimedia services, we consider a cell-free massive MIMO-enabled mobile edge network. In particular, benefited from the reliable cell-free links to offload intensive computation to the edge server, resource-constrained end-users can augment on-board (local) processing with edge computing. To this end, we formulate a joint communication and computing resource allocation (JCCRA) problem to minimize the total energy consumption of the users, while meeting the respective user-specific deadlines. To tackle the problem, we propose a fully distributed solution approach based on cooperative multi-agent reinforcement learning framework, wherein each user is implemented as a learning agent to make joint resource allocation relying on local information only. The simulation results demonstrate that the performance of the proposed distributed approach outperforms the heuristic baselines, converging to a centralized target benchmark, without resorting to large overhead. Moreover, we showed that the proposed algorithm has performed significantly better in cell-free system as compared with the cellular MEC systems, e.g., a small cell-based MEC system.

Index Terms:

cell-free massive MIMO, joint communication and computing resource allocation, deep reinforcement learning (DRL), multi-agent reinforcement learning, edge computing

I introduction

Over the past few years, there has been a surge in computation hungry, and yet highly delay-intolerant multimedia applications. To support these applications in resource-limited end-user devices, it is essential to augment on-board (local) processing by offloading part of the intensive computation to more powerful computing platforms at the network edge. To this end, the ubiquitous radio and computing resources at the mobile edge network should be jointly optimized in a way to improve certain system objectives such as execution delay and energy consumption. In this regard, a plethora of joint communication and computing resource allocation (JCCRA) schemes based on conventional optimization methods have been investigated. However, most of these approaches either assume accurate knowledge of network-wide information, which is difficult to obtain in practice, or are limited to quasi-static systems, such as deterministic task size and time-invariant channel conditions. Thus, their application to support time-sensitive applications in dynamic mobile edge network might be critically limited.

Driven by the recent advances in machine learning and deep reinforcement learning (DRL), several learning-based schemes have been proposed to provide flexible joint resource allocation. For instance, deep Q-network (DQN) is employed in [1, 2] for designing joint offloading and resource allocation decisions, while [3] presents a deep deterministic policy gradient (DDPG)-based scheme in heterogeneous networks with multiple users. However, these schemes rely on central entities to make joint resource allocation decision, suffering from considerable signaling overhead and scalability issues. [4] investigates distributed JCCRA problem in a single-cell mobile edge computing (MEC) system, however, the adopted computation model is not applicable for tasks with extreme reliability and ultra-low hard deadline requirements. We note that the large amount of effort on optimizing joint resource allocation in MEC networks focuses on cellular system. However, as we move towards the next decade, the current cellular MEC systems may not keep up with the diverse and more stringent requirements of the envisaged advanced applications, such as holographic displays, multi-sensory extended reality, and others.

In light of the recently proposed cell-free massive MIMO architecture, envisioned for beyond-5G and 6G networks, however, there are only limited works. Notably, cell-free architecture can provide sufficiently fast and reliable access links without cell-edge by serving a relatively small number of users from several geographically distributed access points (APs) [5]. It, therefore, opens up a new opportunity to support energy-efficient and consistently-low latency computational task offloading, in contrast to cellular MEC systems. [6] presents several performance analyses in a cell-free MEC system considering coverage radius of the APs, while the issues of active user detection and channel estimations are discussed in [7]. In this paper, we consider a JCCRA problem that minimizes total energy consumption of the users while meeting the user-specific application deadlines by jointly optimizing the allocation of local processor clock speed and uplink transmission power for each user. We then present a fully distributed solution approach based on cooperative multi-agent reinforcement learning, alleviating the signaling and communication overheads associated with centralized implementations. In sharp contrast to our work, both [6] and [7] do not deal with the JCCRA problem. To the best of our knowledge, this is the very first attempt to solve JCCRA problem in a fully distributed fashion for cell-free network-enabled mobile edge network. The fully distributed and intelligent JCCRA in our framework combined with the reliable access links of the cell-free massive MIMO architecture can be a promising means to handle the stringent requirements of the newly introduced multimedia services.

The rest of the paper is organized as follows: In Section II, we discuss the system model and problem formulation for the JCCRA framework. Section III presents the proposed distributed JCCRA scheme based on cooperative multi-agent reinforcement learning (MARL) framework. Simulation results are presented in Section IV, followed by some concluding remarks in Section V.

II system model and problem formulation

II-A User-centric Cell-free Massive MIMO: Overview

We consider a cell-free massive MIMO-enabled mobile edge network, depicted in Fig. 1, where a central processing unit (CPU) with an edge server of finite capacity ${f^{CPU}}$ (in cycles per second) is connected to $M$ single-antenna access points (APs) via ideal fronthaul links. Furthermore, $M$ APs serve $K$ single-antenna users, such that $M\gg K$ , entailing to the widely known system model for cell-free massive MIMO [5]. Let ${{\cal M}}=\left\{{1,2,...,M}\right\}$ and ${{\cal K}}=\left\{{1,2,...,K}\right\}$ denote sets of AP and user indices, respectively. Besides, each user $k\in{{\cal K}}$ is also equipped with limited local processor with maximum capability $f_{k}^{\max}$ (in cycle per second).

Let the channel between $m$ -th AP and $k$ -th user is given as ${{g}_{mk}}=\beta_{mk}^{{}^{1}/{}_{2}}{{h}_{mk}}$ where ${{\beta}_{mk}}$ is a large-scale channel gain coefficient that models the effects of path loss and shadowing while ${h_{mk}}\sim{{\cal C}{\cal N}}\left({0,1}\right)$ represents small-scale channel fading. Let ${{\tau}_{c}}$ denote a channel coherence period in which ${{h}_{mk}}$ remains the same. The coherence period ${{\tau}_{c}}$ is divided into pilot transmission interval of ${{\tau}_{p}}$ samples and uplink data transmission interval of ( ${{\tau}_{c}}-{{\tau}_{p}}$ ) samples for offloading computation to the edge server in the CPU.

Refer to caption — Figure 1: User-centric cell-free massive MIMO-enabled mobile edge network: *Illustrative system model*

At the beginning of each coherence time, each user $k\in{{\cal K}}$ simultaneously transmit a pilot sequence ${{\mathbf{\psi}}_{k}}\in{{\mathsf{\mathbb{C}}}^{{{\tau}_{p}}\times\,1}}$ , such that ${{\left\|{{\mathbf{\psi}}_{k}}\right\|}^{2}}=1$ , which is used to estimate the respective channels ${{\hat{g}}_{mk}}$ at each AP $m\in{{\cal M}}$ . Assuming pairwise orthogonal sequences $\left\{{{\mathbf{\psi}}_{1}},{{\mathbf{\psi}}_{2}},.....,{{\mathbf{\psi}}_{K}}\right\}$ , i.e., ${{\tau}_{p}}=K$ , we ignore pilot contamination problem in the current model. Then, the received pilot vector at the $m$ -th AP, $\mathbf{y}_{m}^{p}\in{{\mathsf{\mathbb{C}}}^{{{\tau}_{p}}\times\,1}}$ , can be represented as

{\bf{y}}_{m}^{p}=\sqrt{{\tau_{p}}}\sum\limits_{k=1}^{K}{\sqrt{q_{k}^{p}}{g_{mk}}{{\bf{\psi}}_{k}}}+{\bf{\omega}}_{m}^{p},

(1)

where $q_{k}^{p}$ is the pilot transmit power, and $\mathbf{\omega}_{m}^{p}$ is a $\,{{\tau}_{p}}$ -dimensional additive noise vector with independent and identically distributed entries of ${\bf{\omega}}_{m}^{p}\sim{{\cal C}{\cal N}}\left({0,\sigma_{m}^{2}}\right)$ . Based on the received vector $\mathbf{y}_{m}^{p}$ , the least-square (LS) channel estimate ${{\hat{g}}_{mk}}$ can be expressed as:

{\hat{g}_{mk}}=\frac{1}{{\sqrt{{\tau_{p}}p_{k}^{p}}}}{\bf{\psi}}_{k}^{H}{\bf{y}}_{m}^{p}.

(2)

The channel estimates are used to decode the uplink transmitted data of the users. After pilot transmission, the users transmit offloaded data to the APs. Let ${{x}_{k}}$ denote the uplink data of user $k$ , with transmit power of ${{p}_{k}}$ , which is determined as ${{p}_{k}}={{\eta}_{k}}\,p_{k}^{\max}$ , where ${{\eta}_{k}}$ , and $p_{k}^{\max}$ represent power control coefficient and maximum uplink power of the $k$ -th user, respectively. Then, the received signal $y_{m}^{u}$ at the $m$ -th AP is represented as

y_{m}^{u}=\sum\limits_{k=1}^{K}{\sqrt{{p_{k}}}{g_{mk}}{x_{k}}}+{\omega_{m}}.

(3)

It is important to ensure the scalability of cell-free massive MIMO system, in terms of complexity for pilot detection and data processing. To this end, we limit the number of APs that serve the $k$ -th user to ${N_{k}}<M$ , and form a user-centric cluster of APs ${{{\cal C}}_{k}}\subset\left\{{A{P_{1}},A{P_{2}},\cdots,A{P_{N_{k}}}}\right\}$ , by including all APs with the largest ${{\beta}_{mk}}$ to ${{{\cal C}}_{k}}$ until the maximum allowable cluster size is reached, i.e., ${N_{k}}={C^{\max}}$ . Thus, the transmitted data by user $k$ is only decoded by the APs in ${{{\cal C}}_{k}}$ . Then, after performing maximum ratio combining (MRC) at each AP $m\in{{{\cal C}}_{k}}$ the quantity $\hat{g}_{mk}^{*}y_{m}^{u}$ is transmitted to the CPU via a fronthaul link. The received soft estimates are combined at the CPU to decode the data transmitted by user $k$ as follows:

{\hat{x}_{k}}=\sum\limits_{m=1}^{N_{k}}{\hat{g}_{mk}^{*}y_{m}^{u}}.

(4)

Then, the uplink signal-to interference and noise ratio (SINR) ${\gamma_{k}}$ for user $k$ can be expressed as:

{\gamma_{k}}=\frac{{{p_{k}}{{\left|{\sum\limits_{m\in{{{\cal C}}_{k}}}{\hat{g}_{mk}^{*}{g_{mk}}}}\right|}^{2}}}}{{\sum\limits_{k^{\prime}\neq k}{{p_{k^{\prime}}}}{{\left|{\sum\limits_{m\in{{{\cal C}}_{k}}}{\hat{g}_{mk}^{*}{g_{mk^{\prime}}}}}\right|}^{2}}+\sigma_{m}^{2}\sum\limits_{m\in{{{\cal C}}_{k}}}{{{\left|{{{\hat{g}}_{mk}}}\right|}^{2}}}}}

(5)

Given the system bandwidth $W$ , the uplink rate of user $k$ is given as ${R_{k}}=W{\log_{2}}\left({1+{\gamma_{k}}}\right)$ by Shannon’s theory.

II-B Parallel Computation Model

We assume each user $k$ has a computationally intensive task of ${{\cal T}_{k}}\left(t\right)$ bits at the beginning of every discrete time step $t$ , whose duration $\Delta t={{\tau}_{c}}$ . The task sizes in different time steps are independent and uniformly distributed over $\left[{{\cal T}_{\min,}}\,{{\cal T}_{\max,}}\right]$ , for every user $k\in{{\cal K}}$ . Let $t_{k}^{d}$ denote the maximum tolerable delay of user $k$ to complete execution of the given task. Considering the delay constraint and energy consumption, the $k$ -th user processes ${\cal T}_{k}^{local}\left(t\right)$ bits locally while offloading the remaining ${\cal T}_{k}^{offload}\left(t\right)$ bits to the edge server at the CPU.

II-B1 Local computation

Let $f_{k}^{local}\left(t\right)={{\alpha}_{k}}\left(t\right)f_{k}^{\max}$ , for ${{\alpha}_{k}}\in\left[0,1\right]$ , denote the $k$ -th user processor speed (in cycle per second) allocated for local task execution. The entire task is offloaded to the edge server if ${{\alpha}_{k}}=0$ , while ${{\alpha}_{k}}=1$ implies that the whole local processing capability is fully utilized and the remaining task bits are offloaded to the edge server. Accordingly, the proportion of locally computed task size at time step $t$ is expressed as ${\cal T}_{k}^{local}\left(t\right)=\min\left({{\cal T}_{k}}\left(t\right),\frac{t_{k}^{d}f_{k}^{local}\left(t\right)}{{{N}_{cpb}}}\right)$ , where ${{N}_{cpb}}$ corresponds to the number of cycles required to process one-bit task. Then, the time taken for local execution at time step $t$ is expressed as

t_{k}^{local}\left(t\right)=\min\left({\frac{{{\cal T}_{k}^{local}\left(t\right){N_{cpb}}}}{{f_{k}^{local}\left(t\right)}},t_{k}^{d}}\right).

(6)

The corresponding energy consumption is given by

E_{k}^{local}\left(t\right)=\varsigma{\cal T}_{k}^{local}\left(t\right){N_{cpb}}{\left({f_{k}^{local}\left(t\right)}\right)^{2}},

(7)

where $\varsigma$ corresponds to the effective switched capacitance.

II-B2 Computation Offloading

The remaining ${\cal T}_{k}^{offload}\left(t\right)=\max\left(0,\,\,{{\cal T}_{k}}\left(t\right)-{\cal T}_{k}^{local}\left(t\right)\right)$ bits are offloaded to the edge server at the CPU for parallel computation. While offloading computation to the edge server, the experienced latency can be broken down into transmission delay, computing delay at the edge server, and transmission delay for retrieving the processed result. Assuming the retrieved data size after computation is much smaller compared to the offloaded data size, we ignore the retrieving delay. Then, the transmission delay to offload ${\cal T}_{k}^{offload}\left(t\right)$ data bits is expressed as $t_{k}^{tr}\left(t\right)=\frac{{{{\cal T}}_{k}^{offload}\left(t\right)}}{{{R_{k}}\left(t\right)}}$ , where ${R_{k}}\left(t\right)$ is the uplink rate of user $k$ at time step $t$ . The computing time at the edge server is given by $t_{k}^{comp}\left(t\right)=\frac{{{{\cal T}}_{k}^{offload}\left(t\right){N_{cpb}}}}{{f_{k}^{CPU}\left(t\right)}}$ , where $f_{k}^{CPU}\left(t\right)$ is computation resource at the CPU allocated for user $k$ , which is in proportion to ${\cal T}_{k}^{offload}\left(t\right)$ . The total offloading delay for edge computation is then given as

t_{k}^{offload}\left(t\right)=t_{k}^{comp}\left(t\right)+t_{k}^{tr}\left(t\right).

(8)

The corresponding energy consumption for offloading computation is then expressed as

E_{k}^{offload}\left(t\right)={p_{k}}\left(t\right)t_{k}^{tr}\left(t\right),

(9)

where ${p_{k}}\left(t\right)={\eta_{k}}\left(t\right)\,p_{k}^{\max},\forall{\eta_{k}}\in\left[{0,1}\right]$ . Then, the overall experienced latency ${{t}_{k}}\left(t\right)$ by user $k$ to execute ${{\cal T}_{k}}$ bits locally and at the edge is given as:

{t_{k}}\left(t\right)=\max(t_{k}^{local}\left(t\right),t_{k}^{offload}\left(t\right)).

(10)

Similarly, the energy consumption ${{E}_{k}}\left(t\right)$ of user $k$ at time step $t$ can be expressed as

{E_{k}}\left(t\right)=E_{k}^{local}\left(t\right)+E_{k}^{offload}\left(t\right).

(11)

II-C JCCRA Problem Formulation

According to the aforementioned communication and parallel computation models, we intend to jointly optimize the local processor speed $f_{k}^{local}\left(t\right)$ and uplink transmission power ${p_{k}\left(t\right)}$ for every user $k\in{{\cal K}}$ at each time step $t$ , in order to minimize the total energy consumption of all users subject to the respective user-specific delay requirements. The JCCRA problem can mathematically be formulated to jointly determine $\left({{\alpha_{k}}\left(t\right),{\eta_{k}}\left(t\right)}\right),\forall k\in{{\cal K}}$ as follows:

\begin{array}[]{l}\mathop{\min}\limits_{\{{\alpha_{k}}(t),{\eta_{k}}(t)|\forall k\}}\,\,\,\,\,\,\,\,\,\,\sum\limits_{k=1}^{K}{{E_{k}}\left(t\right)}\\ {\rm{\,\,\,\,subject}}\,{\rm{to}}\,\,\,\,\,\,\,\,\,\,\,\,\,{t_{k}}\left(t\right)\leq t_{k}^{d},\,\,\,\,\forall k\\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,0\leq{\alpha_{k}}\left(t\right)\leq 1,\,\,\forall k\\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,0\leq{\eta_{k}}\left(t\right)\leq 1,\,\,\,\forall k.\end{array}

(12)

(12) is a stochastic optimization problem in which the objective function and the first condition involve constantly changing random variables, i.e., task size and channel conditions. Consequently, the problem must be solved frequently, i.e., at every time step $t$ , and the solution must converge rapidly within the ultra-low delay constraint. Therefore, it is challenging to solve the problem using iterative optimization algorithms with reasonable complexity. Furthermore, relying on a one-shot optimization, it is difficult to guarantee a steady long-term performance with conventional algorithms since they cannot adapt to the changes in the mobile edge network.

III The Proposed Multi-Agent Reinforcement Learning-based Distributed JCCRA

To cope up with the dynamics in the MEC while supporting efficient and adaptive joint resource allocation for every user, we propose a fully distributed JCCRA based on cooperative multi-agent reinforcement learning (MARL). We note that solving the JCCRA problem centrally at the CPU might allow for efficient management of the available radio and computing resources due to globally processing of network-wide information. However, the associated signaling and communication overheads are forbiddingly significant, especially given the ultra-low delay constraints. Accordingly, we transform the JCCRA problem into a partially observable cooperative Markov game that consists of $K$ learning agents. One of the main challenges in multi-agent domain is the non-stationarity of the environment since the agents update their policies concurrently, which might lead to learning instability. To deal with this challenge, the agents are trained with the state-of-the-art multi-agent deep deterministic policy gradient (MADDPG) algorithm under the framework of centralized training and decentralized execution [8].

III-A DRL Formulation for JCCRA

Each user is implemented as a DRL agent that learn to map local observation of the environment to efficient JCCRA actions by sequentially interacting with the mobile edge network in discrete time steps. Let ${{{\cal O}}_{k}}$ , and ${{\cal S}}$ denote the local observation space of agent $k\in{{\cal K}}$ , and the complete environment state space, respectively. At the beginning of each step $t$ , agent $k$ perceives the local observation of the environment state ${o_{k}}\left(t\right):{{\cal S}}\mapsto{{{\cal O}}_{k}}$ , and takes an action ${a_{k}}\left(t\right)\in{{{\cal A}}_{k}}$ from its action space ${{{\cal A}}_{k}}$ according to the current JCCRA policy ${{\mu}_{k}}$ . The interaction with the environment produces the next observation of the agent ${o_{k}}\left({t+1}\right)\in{{{\cal O}}_{k}}$ and a real-valued scalar reward ${r_{k}}\left(t\right):{{\cal S}}\times{{{\cal A}}_{k}}\mapsto\mathsf{\mathbb{R}}$ that guides the agent to constantly improve the policy until it converges. The goal of the agent is, therefore, to derive an optimal JCCRA policy $\mu_{k}^{*}$ that maximizes the expected long-term discounted cumulative reward, defined as ${{J}_{k}}\left({{\mu}_{k}}\right)=\mathbb{E}\left[\sum\limits_{t=1}^{T}{{{\varepsilon}^{t-1}}{{r}_{k}}}\left(t\right)\right]$ , where $\varepsilon\in\left[0,\,1\right]$ is a discount factor and $T$ is the total number of time steps (horizon). In the sequel, we define the local observation, action, and reward of agent $k$ for the JCRRA at a given time $t$ .

III-A1 Local observation

At the beginning of each time step $t$ , the local observation of agent $k$ is defined as ${{o}_{k}}\left(t\right)\triangleq\left[{{\cal T}_{k}}(t),t_{k}^{d},{{R}_{k}}(t-1)\right]$ , where ${{\cal T}_{k}}(t)$ , $t_{k}^{d}$ , , and ${{R}_{k}}(t-1)$ are the incoming task size, user-specific application deadline, and the uplink rate from the previous time step, respectively. Note that the local observation is only a partial view of the environment, i.e., it does not include information about the other agents.

III-A2 Action

Based on the local observation of the environment, the agent takes action ${a_{k}}\left(t\right)\buildrel\Delta\over{=}\left[{{\alpha_{k}}\left(t\right),{\eta_{k}}\left(t\right)}\right]$ , i.e., local processor speed $f_{k}^{local}\left(t\right)={\alpha_{k}}\left(t\right)f_{k}^{\max}$ , and uplink transmit power ${p_{k}}\left(t\right)={\eta_{k}}\left(t\right)\,p_{k}^{\max}$ .

III-A3 Reward

To reflect the design objective of the JCCRA problem and enforce a cooperative behavior among the agents, we define joint reward as

{r_{k}}(t)=-\sum\limits_{k=1}^{K}{{\xi_{k}}{E_{k}}\left(t\right),\,\,}\forall k\in{{\cal K}},

(13)

where ${\xi_{k}}=1$ if ${t_{k}}\left(t\right)\leq t_{k}^{d}$ , otherwise ${\xi_{k}}=10$ to punish potentially selfish behavior of the agents that leads to failure in meeting the delay constraint.

III-B MADDPG Algorithm for Distributed JCCRA

During the offline training phase, the agents can be trained in a central unit, which presents an opportunity to share extra information among the agents for easing and accelerating the training process. Specifically, agent $k$ has now access for the joint action $a\left(t\right)=\left({{a}_{1}}\left(t\right),...,{{a}_{K}}\left(t\right)\right)$ , and full observation of the environment state ${s_{k}}\left(t\right)=\left({{o_{k}}\left(t\right),{o_{-k}}\left(t\right)}\right)$ , where ${o_{-k}}\left(t\right)$ is the local observation of other agents at time step $t$ . The extra information endowed to the agents, however, is discarded during the execution phase, meaning the agents rely on their local observation entirely to make JCCRA decisions in a fully distributed manner.

As an extension of the single-agent deep deterministic policy gradient (DDPG) algorithm [9], the MADDPG is an actor-critic policy gradient algorithm. Specifically, the MADDPG agent $k$ employs two main neural networks: actor network with parameters $\theta_{k}^{\mu}$ to approximate a JCCRA policy ${{\mu}_{k}}\left({{o}_{k}}|\theta_{k}^{\mu}\right)$ and a critic network, parametrized by $\theta_{k}^{Q}$ , to approximate a state-value function ${{Q}_{k}}\left({{s}_{k}},a|\theta_{k}^{Q}\right)$ , along with their respective time-delayed copies, $\theta_{k}^{{{\mu}^{\prime}}}$ and $\theta_{k}^{{{\mu}^{\prime}}}$ , which serve as targets. At time step $t$ , the actor deterministically maps local observation ${{o}_{k}}\left(t\right)$ to a specific continuous action ${{\mu}_{k}}\left({{o}_{k}}\left(t\right)|\theta_{k}^{\mu}\right)$ and then, a random noise process ${{{\cal N}}_{k}}$ is added to generate an exploratory policy such that ${{a}_{k}}\left(t\right)={{\mu}_{k}}\left({{o}_{k}}\left(t\right)|\theta_{k}^{\mu}\right)+{{{\cal N}}_{k}}\left(t\right)$ . The environment, which is shared among the agents, collects the joint action $a\left(t\right)=\left\{{{a_{k}}\left(t\right),\forall k\in{{\cal K}}}\right\}$ and returns the immediate reward ${{r}_{k}}\left(t\right)$ and the next observation ${{o}_{k}}\left(t+1\right)$ to the respective agents. To make use of the experience in later decision-making steps efficiently and improve the stability of the training, the agent’s transition along with the extra information ${{e}_{k}}\left(t\right)=\left({{s}_{k}}\left(t\right),a\left(t\right),{{r}_{k}}\left(t\right),{{s}_{k}}\left(t+1\right)\right)$ is saved in the replay buffer ${\cal D}_{k}$ of the agent.

To train the main networks, a mini batch of $B$ samples $\left.{\left({s_{k}^{i},a^{i},r_{k}^{i},s_{k}^{i+1}}\right)}\right|_{i=1}^{B}$ , is randomly drawn from the replay buffer ${\cal D}_{k}$ , where $i$ denotes a sample index. The critic network is updated to minimize the following loss function:

{{\cal L}_{k}}\left({\theta_{k}^{Q}}\right)=\frac{1}{B}\,\,\sum\nolimits_{i}{\,{{\left({y_{k}^{i}-{Q_{k}}\left({s_{k}^{i},{a^{i}}|\theta_{k}^{Q}}\right)}\right)}^{2}}}\,\,,

(14)

where $y_{k}^{i}$ is the target value expressed as $y_{k}^{i}={{\left.r_{k}^{i}+\varepsilon\,\,{{{{Q}^{\prime}}}_{k}}\left(s_{k}^{i+1},{{a}_{i+1}}|\theta_{k}^{Q^{\prime}}\right)\right|}_{{{a}_{{}^{i+1}}}=\left\{{{{{\mu}^{\prime}}}_{k}}\left(o_{k}^{i+1}\right),\forall k\in\cal{K}\right\}}}$ . On the other hand, the parameters of the actor network are updated along the deterministic policy gradient given as

{\nabla_{\theta_{k}^{\mu}}}J\left({{\mu_{k}}|\theta_{k}^{\mu}}\right)\approx\frac{1}{B}\,\Biggl{[}\sum\nolimits_{i}{{\nabla_{{a_{k}}}}}{Q_{k}}\left({s_{k}^{i},{a_{i}}|\theta_{k}^{Q}}\right)\times\\ \nabla_{\theta_{k}^{\mu}}\,{\mu_{k}}\left({o_{k}^{i}|\theta_{k}^{\mu}}\right)\Biggl{]}\Biggl{|}_{{a_{i}}=\left\{{{\mu_{k}}\left({o_{k}^{i}}\right),\,\forall k\in{{\cal K}}}\right\}}.

(15)

The target parameters in both actor and critic networks are updated as $\theta_{k}^{\mu^{\prime}}\leftarrow\tau\theta_{k}^{\mu}+\left({1-\tau}\right)\theta_{k}^{\mu^{\prime}}$ , and $\theta_{k}^{Q^{\prime}}\leftarrow\tau\theta_{k}^{Q}+\left({1-\tau}\right)\theta_{k}^{Q^{\prime}}$ , respectively, where $\tau$ is a constant close to zero.

It is worth mentioning that the proposed JCCRA scheme is trained offline. Once it converges, real-time decision making can be conducted in a fully distributed fashion by simply performing inference with the trained actor network of each agent, relying on the respective local observation only.

IV Performance Analysis

We consider $M=100$ APs and $K=10$ users uniformly distributed at random over an area of 1 km². The APs are connected to the CPU via ideal fronthaul links. We assume only 30% of the entire APs are clustered to serve each user, i.e. ${N_{k}}=0.3M,\forall k\in{{\cal K}}$ . The system bandwidth is set to 5MHz and all active users share this bandwidth without channelization. Denoting the standard deviation of the shadow fading by ${{\sigma}_{sh}}$ , the large-scale channel gain is given as ${{\beta}_{mk}}={{10}^{\frac{P{{L}_{mk}}}{10}}}{{10}^{\frac{{{\sigma}_{sh}}{{z}_{mk}}}{10}}}$ , where ${z_{mk}}\sim{{\cal N}}\left({0,1}\right)$ . The path loss $P{{L}_{mk}}$ is given according to the three-slope model [10] as follows:

P{L_{mk}}=\begin{cases}-L-35{\log_{10}}\left({{d_{mk}}}\right)&\text{if ${d_{mk}}>d_{1}$}\\ -L-10{\log_{10}}\left({d_{mk}^{2}\,d_{1}^{1.5}}\right)&\text{if ${d_{0}}<{d_{mk}}\leq{d_{1}}$}\\ -L-10{\log_{10}}\left({d_{0}^{2}\,d_{1}^{1.5}}\right)&\text{if ${d_{mk}}\leq{d_{0}}$}\end{cases},

(16)

where ${d_{mk}}$ is the distance between the user $k$ and AP $m$ , and $L$ is

L=46.3+33.9\,{\log_{10}}\left(f\right)-13.82\,{\log_{10}}\left({{h_{AP}}}\right)\\ -\left({1.1\,{{\log}_{10}}\left(f\right)-0.7}\right){h_{u}}+{1.56\,{{\log}_{10}}\left(f\right)-0.8}\,.

(17)

Here, $f$ is the carrier frequency (in MHz), ${{h}_{AP}}$ is the AP antenna height (in m), and ${{h}_{u}}$ denotes the user antenna height (in m). Further, no shadowing is considered unless ${{d}_{mk}}>{{d}_{1}}$ . The edge server at the CPU has computation capacity ${{f}^{CPU}}=100\,\text{GHz}$ , while the users are equipped with local processor with $f_{k}^{\max}=1\,\text{GHz}$ . The length of a time step $\Delta t$ is set to 1ms, and each user generates a task of a random size that is uniformly distributed in the range of 2.5 to 7.5 kbits at the beginning of every time step $t$ . Furthermore, Furthermore, additional simulation parameters are set to ${{N}_{cpb}}=500$ , $\varsigma={{10}^{-27}}$ , $p_{k}^{\max}=0.1\text{W}$ , ${{\sigma}_{sh}}=10\text{dB}$ , and $t_{k}^{d}=\Delta t={{\tau}_{c}}=1\text{ms}$ .

For the proposed MADDPG-based JCCRA implementation, the actor and critic networks of each agent are implemented with fully connected neural networks, which consist of 128, 64, and 64 nodes. All the hidden layers are activated by ReLu function, while the outputs of the actor are activated by sigmoid. The parameters of critic and actor networks are updated with adaptive moment (Adam) optimizer with the learning rates of 0.001 and 0.0001. Further, we set the discount factor ${{\gamma}_{k}}=0.99$ , target update parameter $\tau=0.005$ , and mini batch size $B=128$ . In our simulations, the agents are trained for several episodes, each consisting of 100 steps.

To evaluate the performance of the proposed MADDPG-based JCCRA, we compare it with three benchmarks:

IV-1 Centralized DDPG-based JCCRA scheme

This approach refers to a DDPG-based centralized resource allocation scheme implemented at the CPU. We adopt the same neural network structure and other hyperparameters as the MADDPG-based scheme to train the actor and critic in the single DDPG agent. Since global information of the entire network is processed centrally at the CPU to make JCCRA decisions for all users, we can potentially obtain the most efficient resource allocation. Consequently, this baseline serves as a target performance benchmark. However, as it involves significant signaling and communication overheads, this scheme might be infeasible to support time-sensitive applications.

IV-2 Offloading-first with an uplink fractional power control (FPC) scheme

This approach preferably offloads the computation to the edge server with the aim of aggressively exploiting the reliable access link provided by cell-free massive MIMO while saving the local processing energy consumption. The uplink transmit power for the $k$ -th user is determined by the standard fractional power control (FPC) [11] as follows:

{p_{k}}=\min\left({p_{k}^{\max},{p_{0}}\lambda_{k}^{-\nu}}\right),

(18)

where ${p_{0}}=-35\,{\rm{dBm}}$ , ${\lambda_{k}}=\sum\limits_{m=1}^{{N_{k}}}{{\beta_{mk}}}$ , and $\nu=0.5$ .

IV-3 Local execution-first with an uplink fractional power control (FPC) scheme

The entire local processing capability is fully utilized, i.e., $f_{k}^{local}=f_{k}^{\max}$ , and the remaining task bits are offloaded to the edge server with uplink transmit power given according to (18).

In Fig. 2, we investigate the convergence and performance of the proposed MADDPG-based JCCRA scheme by evaluating the total reward accumulated during the training process. It can be observed that the performance of the proposed distributed scheme has converged to the target performance benchmark, i.e. the centralized DDPG-based JCCRA, while relieving the large overhead of the latter. This implies that by relying on local observation and additional information provided during the offline centralized training phase, the agents manage to learn efficient joint resource allocation. Moreover, the proposed scheme has significantly outperformed the heuristic baselines in terms of total reward accumulated.

In Fig. 3, we compare the average success rate of the users in attaining the delay constraint of the time-sensitive applications. One can observe that the proposed MADDPG-based JCCRA has achieved more than 99% success rate in accomplishing the tasks within the deadline as the training episodes progress. Even though local execution-first policy achieves 100% success rate, the performance comes at the cost of high energy consumption, as reflected in Fig. 2. Moreover, the fact that the policy of the learning agents outperforms the offloading-first baseline implies even if cell-free massive MIMO can provide reliable access links to the users, aggressive computation offloading can result in performance degradation due to ineffective use of the limited resources.

We then implemented the proposed MADDPG-based JCCRA scheme in two different cellular MEC architectures, including a small-cell system and a single-cell system with co-located massive MIMO, to further investigate the gain obtained from cell-free architecture. For the small-cell MEC system, each user is served by a single AP with the largest large-scale fading coefficient according to the methodology in [5], whereas in co-located massive MIMO case, each user connects to a base station with total number of ${N_{k}}$ antennas. From Fig.’s 4(a) and 4(b), we can observe that the proposed algorithm over a cell-free system has significantly outperformed the cellular MEC systems, in terms of average success rate and total reward accumulated. This is attributed to the channel quality degradation due to spatial co-channel interference from other near-by cells in the small-cell system case, while in the single-cell MEC system, users are subjected to large interference from all co-located antennas under MRC, hence limiting the uplink rate for computation offloading.

V Conclusion and future works

In this paper, we presented a novel distributed joint communication and computing resource allocation (JCCRA) scheme based on cooperative multi-agent reinforcement learning framework for the cell-free massive MIMO-enabled mobile edge network. More specifically, each trained agent can make intelligent and adaptive joint resource allocation in real-time, based on local observation only, in order to minimize the total energy consumption while meeting the respective delay constraints of the dynamic MEC system. The performance of the proposed distributed JCCRA scheme is shown to converge to the centralized target baseline, without resorting to large overhead. Finally, we have shown that the proposed algorithm in cell-free system has outperformed two cellular MEC systems by wide margins. In the future, it would be interesting to extend the current formulation to determine a set of APs that are adaptive to dynamic user mobility.

Acknowledgment

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No.2020R1A2C100998413).

References

[1] J. Li, H. Gao, T. Lv, and Y. Lu, “Deep reinforcement learning based computation offloading and resource allocation for MEC,” 2018 IEEE Wireless Communications and Networking Conference (WCNC), Barcelona, 2018.
[2] L. Huang, X. Feng, C. Zhang, L. Qian, and Y. Wu, “Deep reinforcement learning-based joint task offloading and bandwidth allocation for multi-user mobile edge computing,” Digit. Commun. Netw., vol. 5, no. 1, pp. 10–17, 2019.
[3] Y. Dai, K. Zhang, S. Maharjan, and Y. Zhang, “Edge Intelligence for Energy-Efficient Computation Offloading and Resource Allocation in 5G Beyond,” in IEEE Transactions on Vehicular Technology, vol. 69, no. 10, pp. 12175-12186, Oct. 2020.
[4] Chen Z. and Wang, X. “Decentralized computation offloading for multi-user mobile edge computing: a deep reinforcement learning approach”. J Wireless Com Network 2020, 188 (2020).
[5] H. Q. Ngo, A. Ashikhmin, H. Yang, E. G. Larsson and T. L. Marzetta, “Cell-Free Massive MIMO Versus Small Cells,” in IEEE Transactions on Wireless Communications, vol. 16, no. 3, pp. 1834-1850, March 2017.
[6] S. Mukherjee and J. Lee, “Edge Computing-Enabled Cell-Free Massive MIMO Systems,” in IEEE Transactions on Wireless Communications, vol. 19, no. 4, pp. 2884-2899, April 2020.
[7] M. Ke, Z. Gao, Y. Wu, X. Gao, and K. -K. Wong, “Massive Access in Cell-Free Massive MIMO-Based Internet of Things: Cloud Computing and Edge Computing Paradigms,” in IEEE Journal on Selected Areas in Communications, vol. 39, no. 3, pp. 756-772, March 2021.
[8] R. Lowe et al., “Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments”, arXiv preprint arXiv:1706.02275, 2020.
[9] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” in Proc. International Conference on Learning Representations (ICLR), 2016.
[10] Ao Tang, JiXian Sun, and Ke Gong, “Mobile propagation loss with a low base station antenna for NLOS street microcells in urban area,” IEEE VTS 53rd Vehicular Technology Conference, Spring 2001. Proceedings (Cat. No.01CH37202), Rhodes, Greece, 2001, pp. 333-336 vol.1.
[11] R. Nikbakht, R. Mosayebi, and A. Lozano, “Uplink Fractional Power Control and Downlink Power Allocation for Cell-Free Networks,” in IEEE Wireless Communications Letters, vol. 9, no. 6, pp. 774-777, June 2020.