This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distributed Multi-Agent Reinforcement Learning with One-hop Neighbors and Compute Straggler Mitigation

Baoqian Wang     \IEEEmembershipStudent Member, IEEE    Junfei Xie     \IEEEmembershipSenior Member, IEEE    Nikolay Atanasov     \IEEEmembershipSenior Member, IEEE We gratefully acknowledge support from NSF CAREER-2048266, NSF CCF-2402689 and ONR N00014-23-1-2353.Baoqian Wang is with the Boeing AI, The Boeing Company, Huntsville, AL 35808, USA (e-mail: [email protected]).Junfei Xie is with the Department of Electrical and Computer Engineering, San Diego State University, San Diego, CA 92182, USA (e-mail: [email protected]). Corresponding author.Nikolay Atanasov is with the Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, CA 92093, USA (e-mail: [email protected]).
Abstract

Most multi-agent reinforcement learning (MARL) methods are limited in the scale of problems they can handle. With increasing numbers of agents, the number of training iterations required to find the optimal behaviors increases exponentially due to the exponentially growing joint state and action spaces. This paper tackles this limitation by introducing a scalable MARL method called Distributed multi-Agent Reinforcement Learning with One-hop Neighbors (DARL1N). DARL1N is an off-policy actor-critic method that addresses the curse of dimensionality by restricting information exchanges among the agents to one-hop neighbors when representing value and policy functions. Each agent optimizes its value and policy functions over a one-hop neighborhood, significantly reducing the learning complexity, yet maintaining expressiveness by training with varying neighbor numbers and states. This structure allows us to formulate a distributed learning framework to further speed up the training procedure. Distributed computing systems, however, contain straggler compute nodes, which are slow or unresponsive due to communication bottlenecks, software or hardware problems. To mitigate the detrimental straggler effect, we introduce a novel coded distributed learning architecture, which leverages coding theory to improve the resilience of the learning system to stragglers. Comprehensive experiments show that DARL1N significantly reduces training time without sacrificing policy quality and is scalable as the number of agents increases. Moreover, the coded distributed learning architecture improves training efficiency in the presence of stragglers.

{IEEEkeywords}

Multi-Agent Reinforcement Learning, Scalability, Distributed Computing, Coded Computation.

1 Introduction

Recent years have witnessed tremendous success of reinforcement learning (RL) in challenging decision making problems, such as robot control and video games. Research efforts are currently focused on multi-agent settings, including cooperative robot navigation [1], multi-player games [2], and traffic management [3]. Direct application of RL techniques in multi-agent settings by running single-agent algorithms simultaneously on each agent exhibits poor performance [4]. This is because, without considering interactions among the agents, the environment becomes non-stationary from the perspective of a single agent.

Multi-agent reinforcement learning (MARL) [5] addresses this challenge by considering all agents and their dynamics collectively when learning the value function and policy of an individual agent. Most effective MARL algorithms, such as multi-agent deep deterministic policy gradient (MADDPG) [4] and counterfactual multi-agent (COMA) [6], adopt this strategy. However, learning a joint-state value or action-value (Q) or policy function is challenging due to the exponentially growing joint state and action spaces with increasing number of agents [7, 8]. Policies trained with joint state-action pairs have poor performance in large-scale settings as demonstrated in recent work [9, 10], because their accurate approximation requires models with extremely large capacity.

MARL algorithms that improve the quality of learned policies for large-scale multi-agent settings often employ value function factorization that factorizes the global value/Q function into individual value/Q functions depending on local states and actions, e.g., as in Value Decomposition Network (VDN) [11], QMIX [12], QTran [13], mean-field MARL (MFAC) [9] or scalable actor critic (SAC) [8]. In addition to value factorization, there are several other methods proposed to enable scalable MARL. MAAC [14] uses an attention module to abstract states of other agents when training an agent’s Q function, which reduces the quadratically increasing input space to a linear space. EPC [10] applies curriculum learning to gradually scale MARL up. While these methods achieve great policy performance, the training time can be significant when the number of agents increases because these methods cannot be easily trained in an efficient distributed or parallel manner over multiple computers.

To address the challenge of training policies for large numbers of agents over a distributed computing architecture, we propose a MARL algorithm called Distributed multi-Agent Reinforcement Learning with One-hop Neighbors (DARL1N). DARL1N’s main advantage over state-of-the-art MARL methods is that it allows distributed training across compute nodes (devices with networking, storage, and computing capabilities) running in parallel, with each compute node simulating only a very small subset of the agent transitions. This is made possible by modeling the agent team topology as a proximity graph and representing the Q function and policy of each agent as a function of its one-hop neighbors only. This structure significantly reduces the representation complexity of the Q and policy functions and yet maintains expressiveness when training is done over varying states and numbers of neighbors. Furthermore, when agent interactions are restricted to one-hop neighborhoods, training an agent’s Q function and policy requires transitions only of the agent itself and its potential two-hop neighbors. This enables highly efficient distributed training because each compute node needs to simulate only the transitions of the agents assigned to it and their two-hop neighbors.

RL or MARL policies can be trained over a distributed computing architecture in either a centralized [15, 16] or decentralized [17] manner. Decentralized architectures [17] offer greater resilience to node failures and malicious attacks but introduce significant communication overhead as frequent information exchanges are required. Additionally, achieving global coordination in such systems is inherently challenging: global information must either be inferred under the assumption of globally observable states, which limits its applicability, or obtained through consensus, which is difficult to achieve in large-scale systems. In contrast, centralized architectures with a central controller are more communication-efficient and facilitate easier global coordination, with communication occurring either asynchronously [18] or synchronously [19]. Asynchronous training faces multiple challenges including slow convergence, difficult debugging and analysis, and sometimes subpar quality of learned policies as learners may return stale gradients evaluated with old parameters [20, 21, 22, 19, 23]. Synchronous training is superior in these aspects but is vulnerable to computing stragglers [24] that are common in wireless and mobile networks. These stragglers, which are slow or unresponsive compute nodes caused by communication bottlenecks or software and hardware issues, can result in delays or failures in the training process. Coded computation [25] that employs coding theory to introduce redundant computation can mitigate computing stragglers. While extensively explored in various distributed computation problems such as matrix multiplication [25], linear inverse problems [26], convolution [27], and map reduce [28], its application for MARL remains under-studied. In our previous work [16], we explored the merits of coded computation in enhancing resilience and accelerating the training of MADDPG [16] in a distributed manner. Building upon this, in this paper, we propose a coded distributed learning architecture tailored for DARL1N. Unlike the one introduced in [16], where the central controller simulates global environment interactions among all agents and sends simulation data to each learner to train an agent, in the new architecture, each learner directly simulates local environment interactions among a small set of neighboring agents during individual agent training and thus improves distributed computing efficiency and reduces communication overhead.

Contributions: The primary contribution of this paper is a new MARL algorithm called DARL1N, which employs one-hop neighborhood factorization of the value and policy functions, allowing distributed training with each compute node simulating a small number of agent transitions. DARL1N supports highly-efficient distributed training and generates high-quality multi-agent policies for large agent teams. The second contribution is a novel coded distributed learning architecture for DARL1N called Coded DARL1N, which allows individual agents to be trained by multiple compute nodes simultaneously, enabling resilience to stragglers. Our analysis shows that introducing redundant computations via coding theory does not introduce bias in the value and policy gradient estimates, and the training converges similarly to stochastic gradient descent-based methods. Four codes including Maximum Distance Separable (MDS), Random Sparse, Repetition, and Low Density Generator Matrix (LDGM) codes are investigated to introduce redundant computation. Moreover, we conduct comprehensive experiments comparing DARL1N with four state-of-the-art MARL methods, including MADDPG, MFAC, EPC and SAC, and evaluating their performance in different RL environments, including Ising Model, Food Collection, Grassland, Adversarial Battle, and Multi-Access Wireless Communication. We also conduct experiments to evaluate the resilience of Coded DARL1N to stragglers when trained under different coding schemes.

It is noted that DARL1N was first presented in a short conference version [29]. Differing from this version, this journal article further extends DARL1N by introducing a new coded distributed learning architecture to enhance its resilience to stragglers while also improving training efficiency. Theoretical analysis is conducted to elucidate the convergence of Coded DARL1N. Additionally, this journal undertakes a more comprehensive experimental study, encompassing not only the performance of DARL1N but also its new coded variant. It includes additional benchmarks, environments, and evaluation metrics for a more thorough assessment from various aspects.

In the rest of the paper, Sec. 2 formulates the MARL problem to be addressed and describes the occurrence of stragglers within distributed learning systems. The proposed DARL1N algorithm is then introduced in Sec. 3, followed by the coded distributed learning architecture and different coding schemes, which are described in Sec. 4 and Sec. 5, respectively. Experiment results are presented in Sec. 6. Limitations and future work are discussed in Sec. 7. Finally, Sec. 8 concludes the paper.

2 Problem Statement

In MARL, MM agents learn to optimize their behavior by interacting with the environment. Denote the state and action of agent i[M]:={1,,M}i\in[M]:=\{1,\ldots,M\} by si𝒮is_{i}\in\mathcal{S}_{i} and ai𝒜ia_{i}\in{\cal A}_{i}, respectively, where 𝒮i{\cal S}_{i} and 𝒜i{\cal A}_{i} are the corresponding state and action spaces. Let 𝐬:=(s1,,sM)𝒮:=i[M]𝒮i\mathbf{s}:=(s_{1},\ldots,s_{M})\in\mathcal{S}:=\prod_{i\in[M]}\mathcal{S}_{i} and 𝐚:=(a1,,aM)𝒜:=i[M]𝒜i\mathbf{a}:=(a_{1},\ldots,a_{M})\in\mathcal{A}:=\prod_{i\in[M]}\mathcal{A}_{i} denote the joint state and action of all agents. At time tt, a joint action 𝐚(t)\mathbf{a}(t) applied at state 𝐬(t)\mathbf{s}(t) triggers a transition to a new state 𝐬(t+1)𝒮\mathbf{s}(t+1)\in{\cal S} according to a conditional probability density function (pdf) p(𝐬(t+1)|𝐬(t),𝐚(t))p(\mathbf{s}(t+1)|\mathbf{s}(t),\mathbf{a}(t)). After each transition, each agent ii receives a reward ri(𝐬(t),𝐚(t))r_{i}(\mathbf{s}(t),\mathbf{a}(t)), determined by the joint state and action according to the function ri:𝒮×𝒜r_{i}:{\cal S}\times{\cal A}\mapsto\mathbb{R}. The objective of each agent ii is to design a policy μi:𝒮𝒜i\mu_{i}:{\cal S}\rightarrow{\cal A}_{i} to maximize the expected cumulative discounted reward (known as the value function):

Vi𝝁(𝐬):=𝔼𝐚(t)=𝝁(𝐬(t))𝐬(t)p[t=0γtri(𝐬(t),𝐚(t))|𝐬(0)=𝐬],V^{\boldsymbol{\mu}}_{i}(\mathbf{s}):=\mathbb{E}_{\begin{subarray}{c}\mathbf{a}(t)=\boldsymbol{\mu}(\mathbf{s}(t))\\ \mathbf{s}(t)\sim p\phantom{(\mathbf{s}(t))}\end{subarray}}\biggl{[}\sum_{t=0}^{\infty}\gamma^{t}r_{i}(\mathbf{s}(t),\mathbf{a}(t))\;\big{|}\;\mathbf{s}(0)=\mathbf{s}\biggr{]},

where 𝝁:=(μ1,,μM)\boldsymbol{\mu}:=\left(\mu_{1},\ldots,\mu_{M}\right) denotes the joint policy of all agents and γ(0,1)\gamma\in(0,1) is a discount factor. Alternatively, an optimal policy μi\mu_{i}^{*} for agent ii can be obtained by maximizing the action-value (Q) function:

Qi𝝁\displaystyle Q_{i}^{\boldsymbol{\mu}} (𝐬,𝐚):=\displaystyle(\mathbf{s},\mathbf{a}):=
𝔼𝐚(t)=𝝁(𝐬(t))𝐬(t)p[t=0γtri(𝐬(t),𝐚(t))|𝐬(0)=𝐬,𝐚(0)=𝐚]\displaystyle\mathbb{E}_{\begin{subarray}{c}\mathbf{a}(t)=\boldsymbol{\mu}(\mathbf{s}(t))\\ \mathbf{s}(t)\sim p\phantom{(\mathbf{s}(t))}\end{subarray}}\biggl{[}\sum_{t=0}^{\infty}\gamma^{t}r_{i}(\mathbf{s}(t),\mathbf{a}(t))\;\big{|}\;\mathbf{s}(0)=\mathbf{s},\mathbf{a}(0)=\mathbf{a}\biggr{]}

and setting μi(𝐬)argmaxaimax𝐚iQi(𝐬,𝐚)\mu_{i}^{*}(\mathbf{s})\in\arg\max_{a_{i}}\max_{\mathbf{a}_{-i}}Q_{i}^{*}(\mathbf{s},\mathbf{a}), where Qi(𝐬,𝐚):=max𝝁Qi𝝁(𝐬,𝐚)Q_{i}^{*}(\mathbf{s},\mathbf{a}):=\max_{\boldsymbol{\mu}}Q_{i}^{\boldsymbol{\mu}}(\mathbf{s},\mathbf{a}) and 𝐚i\mathbf{a}_{-i} denotes the actions of all agents except ii.

To develop a distributed MARL algorithm, we impose additional structure on the MARL problem. Assume that all agents share a common state space, i.e., 𝒮i=𝒮j{\cal S}_{i}={\cal S}_{j}, i,j[M]\forall i,j\in[M] and let dist:𝒮i×𝒮i\operatorname*{dist}:{\cal S}_{i}\times{\cal S}_{i}\rightarrow\mathbb{R} be a distance metric on the state space. Note that the distance metric can also be defined over a common state subspace. However, for notation simplicity, a common state space is assumed here. Consider a proximity graph [30] that models the topology of the agent team. A dd-disk proximity graph is defined as a mapping that associates the joint state 𝐬𝒮\mathbf{s}\in{\cal S} with an undirected graph (𝒱,)({\cal V},{\cal E}) such that 𝒱={s1,s2,,sM}{\cal V}=\{s_{1},s_{2},\ldots,s_{M}\} and ={(si,sj)|dist(si,sj)d,ij}{\cal E}=\{(s_{i},s_{j})|\operatorname*{dist}(s_{i},s_{j})\leq d,i\neq j\}. Define the set of one-hop neighbors of agent ii as 𝒩i:={j|(si,sj)}{i}{\cal N}_{i}:=\{j|(s_{i},s_{j})\in{\cal E}\}\cup\{i\}. We make the following assumption about the agents’ motion.

Assumption 1

The distance between two consecutive states, si(t)s_{i}(t) and si(t+1)s_{i}(t+1), of agent ii is bounded, i.e., dist(si(t),si(t+1))ϵ\operatorname*{dist}(s_{i}(t),s_{i}(t+1))\leq\epsilon, for some ϵ>0\epsilon>0.

This assumption is satisfied in many problems where, e.g., due to physical constraints, the agent states can only change by a bounded amount in a single time step.

Define the potential neighbors of agent ii at time tt as 𝒫i(t):={j|dist(sj(t),si(t))2ϵ+d}{\cal P}_{i}(t):=\{j|\operatorname*{dist}(s_{j}(t),s_{i}(t))\leq 2\epsilon+d\}, which captures the set of agents that may become one-hop neighbors of agent ii at time t+1t+1. Denote the joint state and action of the one-hop neighbors of agent ii by 𝐬𝒩i=(sj1,,sj|𝒩i|)\mathbf{s}_{{\cal N}_{i}}=(s_{j_{1}},\ldots,s_{j_{|{\cal N}_{i}|}}) and 𝐚𝒩i=(aj1,,aj|𝒩i|)\mathbf{a}_{{\cal N}_{i}}=(a_{j_{1}},\ldots,a_{j_{|{\cal N}_{i}|}}), respectively, where j1,,j|𝒩i|𝒩ij_{1},\ldots,j_{|{\cal N}_{i}|}\in{\cal N}_{i}. Our key idea is to let agent ii’s policy, ai=μi(𝐬𝒩i)a_{i}=\mu_{i}(\mathbf{s}_{{\cal N}_{i}}), only depend on the one-hop neighbor states 𝐬𝒩i\mathbf{s}_{{\cal N}_{i}} instead of all agent states 𝐬\mathbf{s}, and we assume that each agent can obtain its one-hop neighbor states through observation or communication. The intuition is that agents that are far away from agent ii at time tt have little impact on its current action ai(t)a_{i}(t). To emphasize that the output of a function f:i[M]𝒮if:\prod_{i\in[M]}{\cal S}_{i}\mapsto\mathbb{R} is affected only by a subset 𝒩[M]{\cal N}\subseteq[M] of the input dimensions, we use the notation f(𝐬)=f(𝐬𝒩)f(\mathbf{s})=f(\mathbf{s}_{{\cal N}}) for 𝐬𝒮\mathbf{s}\in{\cal S} and 𝐬𝒩i𝒩𝒮i\mathbf{s}_{{\cal N}}\in\prod_{i\in{\cal N}}{\cal S}_{i} in the remainder of the paper. We make two additional assumptions on the problem structure to ensure the validity of our policy model.

Assumption 2

The reward of agent ii can be fully specified using its one-hop neighbor states 𝐬𝒩i\mathbf{s}_{{\cal N}_{i}} and actions 𝐚𝒩i\mathbf{a}_{{\cal N}_{i}}, i.e., ri(𝐬,𝐚)=ri(𝐬𝒩i,𝐚𝒩i)r_{i}(\mathbf{s},\mathbf{a})=r_{i}(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}}).

This assumption can always be satisfied by setting dd to the full environment range. In this case, the one-hop neighbor reward assumption becomes the standard reward definition, which depends on states and actions of all agents and is applicable to general MARL problems. For environments with local reward models, a smaller distance value dd can be chosen based on the specific environment configuration. For example, in collision avoidance problems, an agent’s reward may depend only on the states and actions of nearby agents that maintain a safe distance. In multi-agent networks or sensing problems, the one-hop neighbors can be those within communication or observation range. Next, we make a similar assumption for agent ii’s transition model.

Assumption 3

The transition model of agent ii depends only on 𝐚𝒩i\mathbf{a}_{{\cal N}_{i}} and states 𝐬𝒩i\mathbf{s}_{{\cal N}_{i}}, i.e., pi(si(t+1)𝐬𝒩i(t),𝐚𝒩i(t)).p_{i}\left(s_{i}(t+1)\mid\mathbf{s}_{{\cal N}_{i}}(t),\mathbf{a}_{{\cal N}_{i}}(t)\right).

The objective of each agent ii is to obtain an optimal policy μi\mu_{i}^{*} by solving the following problem:

μi(𝐬𝒩i)=argmaxaimax𝐚iQi(𝐬,𝐚),\mu_{i}^{*}(\mathbf{s}_{{{\cal N}}_{i}})=\arg\max_{a_{i}}\max_{\mathbf{a}_{-i}}Q_{i}^{*}(\mathbf{s},\mathbf{a}), (1)

where Qi(𝐬,𝐚):=max𝝁Qi𝝁(𝐬,𝐚)Q_{i}^{*}(\mathbf{s},\mathbf{a}):=\max_{\boldsymbol{\mu}}Q_{i}^{\boldsymbol{\mu}}(\mathbf{s},\mathbf{a}) is the optimal action-value (Q) function introduced in the previous section.

The goal of this paper is to develop a MARL algorithm that (i) utilizes policy and value representations that scale favorably with the number of agents MM and (ii) allows efficient training on a distributed computing system containing compute stragglers.

3 Distributed multi-Agent Reinforcement Learning with One-hop Neighbors (DARL1N)

This section develops the DARL1N algorithm to solve the MARL problem with proximity-graph structure introduced in Sec. 2. DARL1N considers the effect of the one-hop neighbors of an agent in representing its Q and policy functions, which allows updating the Q and policy function parameters using only local one-hop neighborhood transitions.

Specifically, the Q function of each agent ii can be expressed as a function of its one-hop neighbor states 𝐬𝒩i\mathbf{s}_{{\cal N}_{i}} and actions 𝐚𝒩i\mathbf{a}_{{\cal N}_{i}} as well as the states 𝐬𝒩i\mathbf{s}_{{\cal N}_{i}^{-}} and actions 𝐚𝒩i\mathbf{a}_{{\cal N}_{i}^{-}} of the remaining agents that are not immediate neighbors of ii:

Qi𝝁(𝐬,𝐚)=Qi𝝁(𝐬𝒩i,𝐬𝒩i,𝐚𝒩i,𝐚𝒩i).Q_{i}^{\boldsymbol{\mu}}(\mathbf{s},\mathbf{a})=Q_{i}^{\boldsymbol{\mu}}(\mathbf{s}_{{\cal N}_{i}},\mathbf{s}_{{\cal N}_{i}^{-}},\mathbf{a}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}^{-}}). (2)

Inspired by the SAC algorithm [8], we approximate the Q value with a function Q~i𝝁\tilde{Q}_{i}^{\boldsymbol{\mu}} that depends only on one-hop neighbor states and actions:

Q~i𝝁(𝐬𝒩i,𝐚𝒩i)\displaystyle\tilde{Q}_{i}^{\boldsymbol{\mu}}(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}})
=𝐬𝒩i,𝐚𝒩iwi(𝐬𝒩i,𝐬𝒩i,𝐚𝒩i,𝐚𝒩i)Qi𝝁(𝐬𝒩i,𝐬𝒩i,𝐚𝒩i,𝐚𝒩i)\displaystyle=\!\!\sum_{\mathbf{s}_{{\cal N}_{i}^{-}},\mathbf{a}_{{\cal N}_{i}^{-}}}\!\!w_{i}(\mathbf{s}_{{\cal N}_{i}},\mathbf{s}_{{\cal N}_{i}^{-}},\mathbf{a}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}^{-}})Q_{i}^{\boldsymbol{\mu}}(\mathbf{s}_{{\cal N}_{i}},\mathbf{s}_{{\cal N}_{i}^{-}},\mathbf{a}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}^{-}})

where the weights wi(𝐬𝒩i,𝐬𝒩i,𝐚𝒩i,𝐚𝒩i)>0w_{i}(\mathbf{s}_{{\cal N}_{i}},\mathbf{s}_{{\cal N}_{i}^{-}},\mathbf{a}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}^{-}})>0 satisfy 𝐬𝒩i,𝐚𝒩iwi(𝐬𝒩i,𝐬𝒩i,𝐚𝒩i,𝐚𝒩i)=1\sum_{\mathbf{s}_{{\cal N}_{i}^{-}},\mathbf{a}_{{\cal N}_{i}^{-}}}w_{i}(\mathbf{s}_{{\cal N}_{i}},\mathbf{s}_{{\cal N}_{i}^{-}},\mathbf{a}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}^{-}})=1. The approximation error is given in the following lemma.

Lemma 1

If the absolute value of agent ii’s reward is upper bounded as |ri(𝐬𝒩i,𝐚𝒩i)|r¯|r_{i}(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}})|\leq\bar{r}, for some r¯>0\bar{r}>0, the approximation error between Q~i𝛍(𝐬𝒩i,𝐚𝒩i)\tilde{Q}_{i}^{\boldsymbol{\mu}}(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}}) and Qi𝛍(𝐬,𝐚)Q_{i}^{\boldsymbol{\mu}}(\mathbf{s},\mathbf{a}) is bounded as:

|Q~i𝝁(𝐬𝒩i,𝐚𝒩i)Qi𝝁(𝐬,𝐚)|2r¯γ1γ.|\tilde{Q}_{i}^{\boldsymbol{\mu}}(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}})-Q_{i}^{\boldsymbol{\mu}}(\mathbf{s},\mathbf{a})|\leq\frac{2\bar{r}\gamma}{1-\gamma}.
Proof 3.1.

See Appendix A.1.

We parameterize the approximated Q function Q~i𝝁(𝐬𝒩i,𝐚𝒩i)\tilde{Q}_{i}^{\boldsymbol{\mu}}(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}}) and the policy μi(𝐬𝒩i)\mu_{i}(\mathbf{s}_{{\cal N}_{i}}) by θi\theta_{i} and ϕi\phi_{i}, respectively. To handle the varying sizes of 𝐬𝒩i\mathbf{s}_{{\cal N}_{i}} and 𝐚𝒩i\mathbf{a}_{{\cal N}_{i}}, in the implementation, we set the input dimension of Q~i𝝁\tilde{Q}_{i}^{\boldsymbol{\mu}} to the largest possible dimension of (𝐬𝒩i,𝐚𝒩i)(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}}), and apply zero-padding for agents that are not within the one-hop neighborhood of agent ii. The same procedure is applied to represent μi(𝐬𝒩i)\mu_{i}(\mathbf{s}_{{\cal N}_{i}}).

To learn the approximated Q function Q~i𝝁\tilde{Q}_{i}^{\boldsymbol{\mu}}, instead of incremental on-policy updates to the Q function as in SAC [8], we apply off-policy temporal-difference learning with a buffer similar to MADDPG [4]. The parameters θi\theta_{i} of the approximated Q function are updated by minimizing the following temporal difference error:

(θi)\displaystyle\mathcal{L}\left(\theta_{i}\right) =𝔼(𝐬𝒩i,𝐚𝒩i,ri,{𝐬𝒩l}l𝒩i)𝒟i[(Q~i𝝁(𝐬𝒩i,𝐚𝒩i)y)2]\displaystyle=\mathbb{E}_{(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}},r_{i},\{\mathbf{s}_{{\cal N}_{l}^{\prime}}\}_{\forall l\in{\cal N}_{i}^{\prime}})\sim{\cal D}_{i}}\!\!\left[\left(\tilde{Q}^{\boldsymbol{\mu}}_{i}\left(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}}\right)-y\right)^{2}\right]
y\displaystyle y =ri+γQ^i𝝁^(𝐬𝒩i,𝐚𝒩i)\displaystyle=r_{i}+\gamma\hat{Q}_{i}^{\hat{\boldsymbol{\mu}}}\left(\mathbf{s}_{{\cal N}_{i}^{\prime}},\mathbf{a}_{{\cal N}_{i}^{\prime}}\right) (3)

where 𝒟i{\cal D}_{i} is a replay buffer for agent ii that contains information only from 𝒩i{\cal N}_{i} and 𝒩i{\cal N}^{\prime}_{i}, the one-hop neighbors of agent ii at the current and next time steps, and the one-hop neighbors 𝒩l{\cal N}^{\prime}_{l} for l𝒩il\in{\cal N}^{\prime}_{i}. Data from the one-hop neighbors of the next-step one-hop neighbors 𝒩i{\cal N}^{\prime}_{i} are needed to compute the next-step one-hop neighbors actions 𝐚𝒩i\mathbf{a}_{{\cal N}_{i}^{\prime}}. To stabilize the training, a target Q function Q^i𝝁^\hat{Q}_{i}^{\hat{\boldsymbol{\mu}}} with parameters θ^i\hat{\theta}_{i} and a target policy function μ^i\hat{\mu}_{i} with parameters ϕ^i\hat{\phi}_{i} are used. The parameters θ^i\hat{\theta}_{i} and ϕ^i\hat{\phi}_{i} are updated using Polyak averaging: θ^i=τθ^i+(1τ)θi,ϕ^i=τϕ^i+(1τ)ϕi\hat{\theta}_{i}=\tau\hat{\theta}_{i}+(1-\tau)\theta_{i},\hat{\phi}_{i}=\tau\hat{\phi}_{i}+(1-\tau)\phi_{i} where τ\tau is a hyperparameter. In contrast to MADDPG [4], the replay buffer 𝒟i\mathcal{D}_{i} for agent ii only needs to store its local interactions (𝐬𝒩i,𝐚𝒩i,ri,{𝐬𝒩l}l𝒩i)(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}},r_{i},\{\mathbf{s}_{{\cal N}_{l}^{\prime}}\}_{\forall l\in{\cal N}^{\prime}_{i}}) with nearby agents, where {𝐬𝒩l}l𝒩i\{\mathbf{s}_{{\cal N}_{l}^{\prime}}\}_{\forall l\in{\cal N}^{\prime}_{i}} is required to calculate 𝐚𝒩i\mathbf{a}_{{\cal N}_{i}^{\prime}}. Also, in contrast to SAC [8], each agent ii only needs to collect its own training data by simulating local two-hop interactions, which reduces agents’ experience correlations and allows efficient distributed training as explained in Sec. 4.

Agent ii’s policy parameters ϕi\phi_{i} are updated using a gradient

𝐠(ϕi)=𝔼𝐬𝒩i,𝐚𝒩i𝒟i[ϕiμi(𝐬𝒩i)aiQ~i𝝁(𝐬𝒩i,𝐚𝒩i)],\displaystyle{\mathbf{g}(\phi_{i})=\mathbb{E}_{\begin{subarray}{c}\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}}\sim{\cal D}_{i}\end{subarray}}[\nabla_{\phi_{i}}\mu_{i}\left(\mathbf{s}_{{\cal N}_{i}}\right)\nabla_{a_{i}}\tilde{Q}^{\boldsymbol{\mu}}_{i}\left(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}}\right)],}

(4)

where again data 𝒟i\mathcal{D}_{i} only from local interactions is needed.

To implement the parameter updates proposed above, agent ii needs training data 𝒟i=(𝐬𝒩i,𝐚𝒩i,ri,{𝐬𝒩l}l𝒩i){\cal D}_{i}=(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}},r_{i},\{\mathbf{s}_{\mathcal{N}^{\prime}_{l}}\}_{l\in{\cal N}^{\prime}_{i}}) from its one-hop neighbors at the current and next time steps. The relation between one-hop neighbors at the current and next time steps is captured by the following proposition.

Proposition 3.2.

Under Assumption 1, if an agent jj is not a potential neighbor of agent ii at time tt, i.e., j𝒫i(t)j\not\in\mathcal{P}_{i}(t), it will not be a one-hop neighbor of agent ii at time t+1t+1, i.e., j𝒩i(t+1)j\not\in\mathcal{N}_{i}(t+1).

Proof 3.3.

See Appendix A.2.

Proposition 3.2 allows us to decouple the global interactions among agents and limit the necessary observations to be among one-hop neighbors. To collect training data, at each time step, agent ii first interacts with its one-hop neighbors to obtain their states 𝐬𝒩i\mathbf{s}_{{\cal N}_{i}} and actions 𝐚𝒩i\mathbf{a}_{{\cal N}_{i}}, and compute its reward ri(𝐬𝒩i,𝐚𝒩i)r_{i}(\mathbf{s}_{{{\cal N}}_{i}},\mathbf{a}_{{\cal N}_{i}}). To obtain 𝐬𝒩l\mathbf{s}_{{\cal N}_{l}^{\prime}} for all l𝒩il\in{\cal N}^{\prime}_{i}, we first determine agent ii’s one-hop neighbors at the next time step, 𝒩i{\cal N}^{\prime}_{i}. Using Proposition 3.2, we let each potential neighbor k𝒫ik\in{\cal P}_{i} perform a transition to a new state skpk(|𝐬𝒩k,𝐚𝒩k)s^{\prime}_{k}\sim p_{k}(\cdot|\mathbf{s}_{{{\cal N}}_{k}},\mathbf{a}_{{{\cal N}}_{k}}), which is sufficient to determine 𝒩i{\cal N}^{\prime}_{i}. Then, we let the potential neighbors 𝒫l{\cal P}_{l} of each new neighbor l𝒩il\in{\cal N}^{\prime}_{i} perform transitions to determine 𝒩l{\cal N}_{l}^{\prime} and obtain 𝐬𝒩l\mathbf{s}_{{\cal N}_{l}^{\prime}}.

Fig. 1(a) illustrates the data collection process. At time tt, agent ii obtains 𝐬𝒩i\mathbf{s}_{{\cal N}_{i}}, 𝐚𝒩i\mathbf{a}_{{\cal N}_{i}}, and ri(𝐬𝒩i,𝐚𝒩i)r_{i}(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}}) for 𝒩i={i,1}{\cal N}_{i}=\{i,1\}. Then, the potential neighbors of agent ii, 𝒫i={1,2,i}{\cal P}_{i}=\{1,2,i\}, proceed to their next states at time t+1t+1. This is sufficient to determine that 𝒩i={i,2}{\cal N}_{i}^{\prime}=\{i,2\} and obtain 𝐬𝒩i\mathbf{s}_{{\cal N}^{\prime}_{i}}. Finally, we let agent 33, which belongs to set 𝒫2={i,1,2,3}{\cal P}_{2}=\{i,1,2,3\}, perform a transition to determine that 𝒩2={i,2,3}{\cal N}_{2}^{\prime}=\{i,2,3\} and obtain 𝐬𝒩2\mathbf{s}_{{\cal N}^{\prime}_{2}}.

As each agent only needs to interact with one-hop neighbors to update its parameters, the agents can be trained in parallel on a distributed computing architecture, where each compute node only needs to simulate the two-hop neighbor transitions for agents assigned to it for training.

(a) Refer to caption

(a) Refer to caption

(b)Refer to caption

Figure 1: (a) One-hop neighbor transitions from one time step to the next in a dd-disk proximity graph; (b) Coded distributed learning architecture.

4 Coded Distributed Learning Architecture

In this section, we introduce an efficient and resilient distributed learning architecture for training DARL1N. A coded distributed learning architecture, illustrated in Fig. 1(b), consists of a central controller and NN compute nodes, called learners. The central controller stores a copy of all parameters of the policy ϕi\phi_{i}, target policy ϕ^i\hat{\phi}_{i}, Q function θi\theta_{i}, and target Q function θ^i\hat{\theta}_{i}, for all i[M]i\in[M]. In each training iteration, the central controller broadcasts all agents’ parameters to all learners, who then calculate and return the gradients required for updating the parameters. In a traditional uncoded distributed learning architecture, each agent is only trained (with its policy and value gradients computed) by a single learner. If any learner becomes slow or unresponsive, i.e., a straggler, the whole training procedure is delayed or may fail. Our coded distributed learning architecture addresses the possible presence of stragglers in the computing system by introducing redundant computations. We let more than one learner train each agent, which not only improves the system resilience to stragglers but also accelerates the training speed, as we show in Sec. 6.2. To describe which learners are assigned to train each agent, we introduce an assignment matrix 𝐂N×M\mathbf{C}\in\mathbb{R}^{N\times M} with non-zero entries cj,i0c_{j,i}\neq 0 indicating that learner j[N]j\in[N] is assigned to train agent i[M]i\in[M]. The complete set of learners assigned to train an agent ii can then be determined by {j|cj,i0,j[N]}\{j|c_{j,i}\neq 0,\forall j\in[N]\}. To construct the assignment matrix 𝐂\mathbf{C}, we apply coding theory as explained in Sec.5.

To calculate the gradients for an agent ii, each learner jj with cj,i0c_{j,i}\neq 0 simulates transitions to get the interaction data (𝐬𝒩i,𝐚𝒩i,ri,{𝐬𝒩l}l𝒩i)(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}},r_{i},\{\mathbf{s}_{{\cal N}_{l}^{\prime}}\}_{l\in{\cal N}^{\prime}_{i}}) as described in Sec. 3, which are stored in a replay buffer 𝒟j,i{\cal D}_{j,i}. After that, learner jj calculates the gradients of the temporal difference error needed for updating the Q function parameters θi\theta_{i} of agent ii using (3) and updating the policy parameters ϕi\phi_{i} using (4).

As the replay buffer 𝒟j,i{\cal D}_{j,i} can have a large size, to improve efficiency, we use a mini-batch j,i{\cal B}_{j,i} uniformly sampled from 𝒟j,i{\cal D}_{j,i} to estimate the expectations in (3)-(4). In particular, the temporal difference error in (3) is estimated with:

j^(θi)\displaystyle\hat{\mathcal{L}_{j}}\left(\theta_{i}\right) =1|j,i|(𝐬𝒩i,𝐚𝒩i,ri,{𝐬𝒩l}l𝒩i)j,i(Q~i𝝁(𝐬𝒩i,𝐚𝒩i)y)2\displaystyle=\frac{1}{|{\cal B}_{j,i}|}\sum_{\begin{subarray}{c}(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}},r_{i},\{\mathbf{s}_{{\cal N}_{l}^{\prime}}\}_{\forall l\in{\cal N}_{i}^{\prime}})\\ \in{\cal B}_{j,i}\end{subarray}}\!\!\left(\tilde{Q}^{\boldsymbol{\mu}}_{i}\left(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}}\right)-y\right)^{2}
y\displaystyle y =ri+γQ^i𝝁^(𝐬𝒩i,𝐚𝒩i).\displaystyle=r_{i}+\gamma\hat{Q}_{i}^{\hat{\boldsymbol{\mu}}}\left(\mathbf{s}_{{\cal N}_{i}^{\prime}},\mathbf{a}_{{\cal N}_{i}^{\prime}}\right). (5)

Similarly, the gradients used to update policy parameters are estimated with:

𝐠^j(ϕi)=1|j,i|(𝐬𝒩i,𝐚𝒩i)j,iϕiμi(𝐬𝒩i)aiQ~i𝝁(𝐬𝒩i,𝐚𝒩i).\displaystyle\hat{\mathbf{g}}_{j}(\phi_{i})=\frac{1}{|{\cal B}_{j,i}|}\sum\limits_{(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}})\in{\cal B}_{j,i}}\nabla_{\phi_{i}}\mu_{i}\left(\mathbf{s}_{{\cal N}_{i}}\right)\nabla_{a_{i}}\tilde{Q}^{\boldsymbol{\mu}}_{i}\left(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}}\right). (6)

Let 𝐞^j,i=[^j(θi),𝐠^j(ϕi)]\hat{\mathbf{e}}_{j,i}=[\nabla\hat{{\cal L}}_{j}(\theta_{i}),\hat{\mathbf{g}}_{j}(\phi_{i})] denote the concatenation of estimated gradients. Instead of directly returning the estimated gradients for all agents trained by learner jj, i.e., {𝐞^j,i|i[M],cj,i0}\{\hat{\mathbf{e}}_{j,i}|\forall i\in[M],c_{j,i}\neq 0\}, learner jj calculates a linear combination of the gradients: yj=i=1Mcj,i𝐞^j,iy_{j}=\sum_{i=1}^{M}c_{j,i}\hat{\mathbf{e}}_{j,i} with weights provided by the assignment matrix 𝐂\mathbf{C} and returns yjy_{j} back to the central controller.

At the central controller, let 𝐲𝒥\mathbf{y}_{\cal J} denote the results that have arrived by a certain time from learners 𝒥={j|yj{\cal J}=\{j|y_{j} is received}\}. Moreover, let 𝐂𝒥|𝒥|×M\mathbf{C}_{{\cal J}}\in\mathbb{R}^{|{\cal J}|\times M} be a submatrix of 𝐂\mathbf{C} formed by the jj-th row of 𝐂,j𝒥\mathbf{C},\forall j\in{\cal J}. The received gradients 𝐲𝒥\mathbf{y}_{\cal J} satisfy:

𝐲𝒥\displaystyle\mathbf{y}_{\cal J} =𝐃𝐪\displaystyle=\mathbf{D}\mathbf{q}
𝐪\displaystyle\mathbf{q} =[𝐞^1,1;𝐞^1,2;;𝐞^N,M1;𝐞^N,M]\displaystyle=[\hat{\mathbf{e}}_{1,1};\hat{\mathbf{e}}_{1,2};\ldots;\hat{\mathbf{e}}_{N,M-1};\hat{\mathbf{e}}_{N,M}] (7)

where 𝐃|𝒥|×MN\mathbf{D}\in\mathbb{R}^{|{\cal J}|\times MN} is constructed as follows: for ii-th row of 𝐃\mathbf{D}, fill the (1+(𝒥i1)M)(1+({\cal J}_{i}-1)M)-th to (𝒥iM)({\cal J}_{i}M)-th entries with ii-th row of 𝐂𝒥\mathbf{C}_{{\cal J}} and set all other entries to 0, 𝒥i{\cal J}_{i} denotes ii-th element of 𝒥{\cal J}. The vector 𝐪\mathbf{q} is a concatenation of all the gradients estimated by all learners. The central controller updates the agents’ parameters once it receives enough results to decode all estimated gradients, denoted as 𝐞~\tilde{\mathbf{e}}. This happens when rank(𝐂𝒥)=M\operatorname*{rank}(\mathbf{C}_{\cal J})=M, and the decoding equation is given as follows:

𝐞~=(𝐂𝒥T𝐂𝒥)1𝐂𝒥T𝐲𝒥.\tilde{\mathbf{e}}=(\mathbf{C}_{\cal J}^{T}\mathbf{C}_{\cal J})^{-1}\mathbf{C}^{T}_{\cal J}\mathbf{y}_{\cal J}. (8)

Alg. 1 summarizes the coded training procedure of DARL1N over a distributed computing architecture, referred to as the Coded DARL1N.

In Coded DARL1N, the gradients 𝐞~\tilde{\mathbf{e}} used by the central controller for parameter updates are stochastic gradients computed by mini-batch samples, which are estimates of the true gradients 𝐞=[𝐞1,,𝐞M]\mathbf{e}=[\mathbf{e}_{1},\ldots,\mathbf{e}_{M}], where 𝐞i=[(θi),𝐠(ϕi)]\mathbf{e}_{i}=[\nabla{\cal L}(\theta_{i}),\mathbf{g}(\phi_{i})] with (θi){\cal L}(\theta_{i}) and 𝐠(ϕi)\mathbf{g}(\phi_{i}) defined in (4) and (3), respectively. The estimation performance is illustrated in the following theorem.

Theorem 4.4.

The mini-batch stochastic gradients 𝐞~\tilde{\mathbf{e}} computed by Coded DARL1N are unbiased estimates of the true gradients 𝐞\mathbf{e}, with variance determined by the assignment matrix 𝐂\mathbf{C}.

Proof 4.5.

See Appendix A.3.

Based on Theorem 4.4, we can infer that Coded DARL1N converges asymptotically similarly to other stochastic gradient descent-based methods [31].

// Central controller:
1 Initialize policy, target policy, Q, and target Q parameters ϕ={ϕi,ϕ^i}i[M]\boldsymbol{\phi}=\{\phi_{i},\hat{\phi}_{i}\}_{i\in[M]}, 𝜽={θi,θ^i}i[M]\boldsymbol{\theta}=\{\theta_{i},\hat{\theta}_{i}\}_{i\in[M]};
2 Broadcast ϕ,𝜽\boldsymbol{\phi},\boldsymbol{\theta} to the learners;
3 𝐲𝒥[]\mathbf{y}_{{\cal J}}\leftarrow[\ ];
4 do
5       Listen to channel and collect yjy_{j} from the learners: 𝐲𝒥[𝐲𝒥,yj],j[N]\mathbf{y}_{{\cal J}}\leftarrow[\mathbf{y}_{{\cal J}},y_{j}],j\in[N];
6while  𝐞~\tilde{\mathbf{e}} is not recoverable;
7Send acknowledgements to learners;
8 Update ϕ,𝜽\boldsymbol{\phi},\boldsymbol{\theta} with 𝐞~\tilde{\mathbf{e}};
// Learner jj:
9 Initialize replay buffer 𝒟j,i{\cal D}_{j,i};
10 for iter=1:max_iterationiter=1:\text{max\_iteration} do
11       Listen to channel;
12       if ϕ,𝛉\boldsymbol{\phi},\boldsymbol{\theta} received from the central controller then
13             yj0y_{j}\leftarrow 0; i1i\leftarrow 1;
14             while iMi\leq M and no acknowledgement received do
15                   if cj,i0c_{j,i}\not=0 then
                         // Local interactions:
16                         Perform local interactions to collect training data for agent ii and store the data into 𝒟j,i{\cal D}_{j,i};
17                         Sample a mini-batch from 𝒟j,i\mathcal{D}_{j,i} and calculate 𝐞^j,i\hat{\mathbf{e}}_{j,i} using (4)-(6);
18                  yjyj+cj,i𝐞^j,iy_{j}\leftarrow y_{j}+c_{j,i}\hat{\mathbf{e}}_{j,i};
19                   ii+1i\leftarrow i+1;
20                   Send updated yjy_{j} to the central controller;
21            
22      
Algorithm 1 Coded DARL1N

5 Assignment Matrix Construction and Assessment

The assignment matrix 𝐂\mathbf{C} affects both the policy variance and computational efficiency of Coded DARL1N. In this section, we explore different schemes, both uncoded and coded, for constructing the assignment matrix, and conduct theoretical analyses on their performance.

5.1 Assignment Matrix Construction

5.1.1 Uncoded Assignment Scheme

In an uncoded distributed training architecture, different learner nodes train different agents exclusively. The assignment matrix can then be constructed as: 𝐂Uncoded=[𝐈M|𝟎]T\mathbf{C}^{\text{Uncoded}}=[\mathbf{I}_{M}|\mathbf{0}]^{T}, where 𝐈MM×M\mathbf{I}_{M}\in\mathbb{R}^{M\times M} is an identity matrix.

5.1.2 Coded Assignment Schemes

Coded distributed training assigns each agent to multiple learners. Here, we investigate five codes, where the encoding matrices can be directly utilized as the assignment matrix.

  • MDS Code: An MDS code [32] is an erasure code with the property that any square submatrix of its encoding matrix 𝐂MDS\mathbf{C}^{\text{MDS}} has full rank. A Vandermonde matrix [33, 16] is commonly used for encoding, with the (j,i)(j,i)-th entry given by 𝐂j,i=αij1\mathbf{C}_{j,i}=\alpha_{i}^{j-1}, where αi0\alpha_{i}\neq 0, i[M]i\in[M], can be any non-zero distinct real numbers.

  • Random Sparse Code: Compared to an MDS code, a Random Sparse code [34] results in a sparser assignment matrix with the (j,i)(j,i)-th entry given by:

    𝐂j,iRandom={0,with probability 1ξ,ζ,with probabilityξ.\mathbf{C}^{\text{Random}}_{j,i}=\begin{cases}0,&\text{with probability}\ 1-\xi,\\ \zeta,&\text{with probability}\ \xi.\end{cases} (9)

    where ζ𝒩(0,1)\zeta\sim{\cal N}(0,1), ξ[0,1]\xi\in[0,1].

  • Repetition Code: A Repetition code [34] assigns agents to the learners repetitively in a round-robin fashion. The (j,i)(j,i)-th entry of the assignment matrix is given by:

    𝐂j,iRepetition={1,ifi=(j mod M),0,else,\mathbf{C}^{\text{Repetition}}_{j,i}=\begin{cases}1,&\text{if}\ i=(j\text{~{}mod~{}}M),\\ 0,&\text{else},\end{cases} (10)

    where mod is the modulo operator.

  • LDGM Code: An LDGM code [35] is a special type of a Low Density Parity Check code [36] that constructs a sparser assignment matrix. By applying a systematic biased random code ensemble [35], the LDGM assignment matrix takes the form: 𝐂LDGM=[𝐈M|𝐏^]T\mathbf{C}^{\text{LDGM}}=[\mathbf{I}_{M}|\hat{\mathbf{P}}]^{T} , where each entry of 𝐏^\hat{\mathbf{P}} is generated independently according to a Bernoulli distribution with success probability Pr(𝐏^i,j=1)=ρ\text{Pr}(\hat{\mathbf{P}}_{i,j}=1)=\rho. Note that when ρ12\rho\leq\frac{1}{2}, the assignment matrix of LDGM code has a low density.

5.2 Analysis and Comparison of Assignment Schemes

We provide a theoretical analysis and comparison of different assignment schemes from the following aspects: 1) computation overhead and 2) resilience to stragglers.

5.2.1 Computation Overhead

The coded schemes mitigate the impact of stragglers by assigning each agent to multiple learners. The training performed by the extra learners is redundant. To quantify the computation overhead introduced by this redundancy, we use the following metric:

oc=1Mj=1Ni=1M𝟙𝐂j,i01,o_{c}=\frac{1}{M}\sum_{j=1}^{N}\sum_{i=1}^{M}\mathds{1}_{\mathbf{C}_{j,i}\neq 0}-1,

where the first term on the right hand side calculates the average number of learners used for training each agent, and oc0o_{c}\geq 0. Using the above metric, the computation overhead of each assignment scheme can be derived as follows:

  • Uncoded: ocUncoded=0o_{c}^{\text{Uncoded}}=0, as each agent is assigned to only one learner in the uncoded scheme.

  • MDS Code: ocMDS=N1o_{c}^{\text{MDS}}=N-1. All entries of the MDS assignment matrix are non-zero, indicating that each agent is assigned to all learners.

  • Random Sparse Code: ocRandomo_{c}^{\text{Random}} depends on the parameter ξ\xi, but its expectation is derived as 𝔼(ocRandom)=ξN1\mathbb{E}(o_{c}^{\text{Random}})=\xi N-1.

  • Repetition Code: ocRepetition=NM1o_{c}^{\text{Repetition}}=\frac{N}{M}-1.

  • LDGM Code: 𝔼(ocLDGM)=(NM)ρ\mathbb{E}(o_{c}^{\text{LDGM}})=(N-M)\rho.

Among these schemes, the MDS code incurs the highest computation overhead, while the uncoded scheme results in the lowest. The overhead introduced by the Random Sparse and LDGM codes depend on their parameters, ξ\xi and ρ\rho.

5.2.2 Resilience to Stragglers

According to (8), the central controller can update the agents’ gradients only after receiving results from enough learners, specifically when rank(𝐂𝒥)=M\operatorname*{rank}(\mathbf{C}_{{\cal J}})=M. To evaluate the resilience of assignment schemes to stragglers, we analyze the probability of each scheme being influenced by stragglers under the following assumption.

Assumption 4

In each training iteration, each compute node in a distributed computing system has a probability of η[0,1]\eta\in[0,1] to become a straggler.

The Random Sparse and LDGM codes have randomly generated entries that depend on parameters ξ\xi and ρ\rho, making them hard to analyze theoretically. We focus our analysis on the Uncoded, MDS, and Repetition codes as follows.

Proposition 5.6.

The probability that the Uncoded scheme will be influenced by stragglers is 1(1η)M1-(1-\eta)^{M}.

Proposition 5.7.

The probability that the MDS code will be influenced by stragglers is j=NM+1N(Nj)(1η)Njηj\sum_{j=N-M+1}^{N}{N\choose j}(1-\eta)^{N-j}\eta^{j}.

Proposition 5.8.

The probability that the Repetition code will be influenced by stragglers is 1(1ηNM)M1-(1-\eta^{\frac{N}{M}})^{M} given that NM\frac{N}{M} is positive integer.

The proof of Proposition 5.6 is direct since the Uncoded scheme is affected by any straggler. The proofs of Propositions 5.7 and 5.8 are provided in Appendices A.4 and A.5, respectively.

6 Experiments

In this section, we evaluate the DARL1N algorithm and our coding schemes for mitigating compute stragglers.

6.1 Performance of DARL1N

We conduct a series of comparisons between DARL1N and four state-of-the-art MARL algorithms. For fair comparison, we train DARL1N using a distributed learning architecture with uncoded assignments, and run our experiments on the Amazon EC2 computing clusters [37], which are considered reliable and free of stragglers.

6.1.1 Experiment Settings

Environment Configurations

We evaluate DARL1N in five environments, including the Ising Model [9], Food Collection, Grassland, Adversarial Battle [10], and Multi-Access Wireless Communication [8], which cover cooperative and mixed cooperative competitive games. Please refer to [8, 9, 10] for the description of each environment.

To understand the scalability of our method, we vary the number of agents MM and the size of the local state spaces. The specific configurations for the first four environments can be referenced in the conference version [29]. In the Multi-Access Wireless Communication environment, which was not considered in [29], we adopt the setting in [8] and consider a grid of 3×33\times 3 agents, with each having a state space of 𝒮i={0,1}z{\cal S}_{i}=\{0,1\}^{z} to indicate whether there is a packet to send by time step zz, where zz is set to either z=2z=2 or z=10z=10.

Neighborhood Configuration

In both the Ising Model and Multi-Access Wireless Communication environments, the agents are arranged in a two dimensional lattice graph, with rewards depending solely on their proximal agents. Consequently, an agent’s one-hop neighbors are naturally defined as those directly connected to it, including itself. In the other three environments, the agents are trained to avoid one another. Therefore, the one-hop neighbor distances dd are naturally set as the Euclidean safety distances. Specifically, the safety distances (or dd) are set to 0.15,0.2,0.25,0.3,0.350.15,0.2,0.25,0.3,0.35 when M=3,6,12,24,48M=3,6,12,24,48, respectively. Each agent observes its one-hop neighbors to obtain one-hop neighbor states. Other distance metrics that account for velocity can be employed, which is left for future work.

Benchmarks

We compare our method with four state-of-the-art MARL algorithms: MADDPG [4], MFAC [9], EPC [10], and SAC [8]. The SAC algorithm only works in the Multi-Access Wireless Communication environment due to the reward assumption.

Evaluation Metrics

We measure the performance using two criteria: training efficiency and policy quality. To measure the training efficiency, we use two metrics: 1) average training time spent to run a specified number of training iterations and 2) convergence time. The convergence time is defined as the time when the variance of the average total training reward over 90 consecutive iterations does not exceed 2% of the absolute mean reward, where the average total training reward is the total reward of all agents averaged over 10 episodes in three training runs with different random seeds. To measure policy quality, we use convergence reward, which is the average total training reward at the convergence time.

(a) (b) (c) (d)

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Average training time of different methods to run (a) 10 iterations in the Ising Model, (b) 30 iterations in the Food Collection, (c) 30 iterations in the Grassland, and (d) 30 iterations in the Adversarial Battle environments.
Computing Configurations

The computing resources are configured in a way so that DARL1N utilizes roughly the same or fewer resources than the baseline methods, as described in [29]. For the Multi-Access Wireless Communication environment, we employ the Amazon EC2 c5n.largec5n.large instance for DARL1N training, the z1d.3xlargez1d.3xlarge instance for MADDPG and MFAC, and the c5.12xlargec5.12xlarge instance for EPC training. The configurations for the training parameters, as well as the representations of policy and Q functions can be found in [29].

6.1.2 Comparative Studies

Ising Model
Table 1: Convergence time and convergence reward of different methods in the Ising Model environment.
Method Convergence Time (s) Convergence Reward
M=9M=9 M=16M=16 M=25M=25 M=64M=64 M=9M=9 M=16M=16 M=25M=25 M=64M=64
MADDPG 62 263 810 1996 460 819 1280 1831
MFAC 63 274 851 2003 468 814 1276 1751
EPC 101 26 51 62 468 831 1278 3321
EPC Scratch 101 412 993 2995 468 826 1275 2503
DARL1N 38 102 210 110 465 828 1279 2282

As shown in Tab. 1, when the number of agents is small (M=9M=9), all methods achieve roughly the same reward. DARL1N takes the least amount of time to converge while EPC takes the longest time. When the number of agents increases, it can be observed that the EPC converges immediately and the convergence reward it achieves when M=64M=64 is much higher than the other methods. The reason is that, in the Ising Model, each agent only needs information of its four fixed neighbors, and hence in EPC the policy obtained from the previous stage can be applied to the current stage. The other methods train the agents from scratch without curriculum learning. For illustration, we also show the performance achieved by training EPC from scratch without curriculum learning (labeled as EPC Scratch in Tab. 1). The results show that EPC Scratch converges much slower than EPC as the number of agents increases. Note that when the number of agents is 9, EPC and EPC Scratch are the same. Moreover, DARL1N achieves a reward comparable with that of EPC Scratch but converges much faster. From Fig. 2, we can observe that DARL1N requires much less time to perform a training iteration than the benchmark methods.

Food Collection
Table 2: Convergence time and convergence reward of different methods in the Food Collection environment.
Method Convergence Time (s) Convergence Reward
M=3M=3 M=6M=6 M=12M=12 M=24M=24 M=3M=3 M=6M=6 M=12M=12 M=24M=24
MADDPG 501 1102 4883 2005 24 24 -112 -364
MFAC 512 832 4924 2013 20 23 -115 -362
EPC 1314 723 2900 8104 31 34 -16 -87
DARL1N 502 382 310 1830 14 25 13 -61

(a) (b)

Refer to caption
Refer to caption
Figure 3: Average total training reward of different methods in the Food Collection environment when there are (a) M=12M=12, and (b) M=24M=24 agents.

Tab. 2 shows that, in Food Collection, when the problem scale is small, DARL1N, MADDPG and MFAC achieve similar performance in terms of policy quality. As the problem scale increases, the performance of MADDPG and MFAC degrades significantly and becomes much worse than DARL1N or EPC when M=12M=12 and M=24M=24, which is also illustrated in Fig. 3. The convergence reward achieved by DARL1N is comparable or sometimes higher than that achieved by EPC. Moreover, the convergence speed of DARL1N is the highest among all methods in all scenarios. Notably, the convergence time of DALR1N and EPC increases, while that of MADDPG and MFAC decreases as MM increases to 24. This occurs because MADDPG and MFAC fail to handle such large-scale networks, causing them to stop learning earlier.

To evaluate the impact of the proposed one-hop neighbor reward formulation on the learning performance, we also present in Fig. 3 the training rewards of DARL1N with a standard reward definition, labeled as DARL1N (Full Range), whose neighbor distance dd is set to cover the entire environment, thereby including all agents as one-hop neighbors. The results show that DARL1N (Full Range) achieves performance comparable to EPC but performs worse than DARL1N with a small number of agents considered as one-hop neighbors. This suggests that in the Food Collection environment, agent behavior primarily depends on interactions with a nearby, smaller group of agents. Fig. 2 illustrates the training time of DARL1N (Full Range), which increases compared to DARL1N due to the inclusion of more agents in the reward calculations.

Fig. 2 also presents the training times of the four benchmarks. Among all methods compared, DARL1N achieves the highest training efficiency and its training time grows linearly as the number of agents increases. When M=24M=24, EPC takes the longest time to train. This is because of the complex policy and Q neural network architectures in EPC, the input dimensions of which grow linearly and quadratically, respectively, with more agents.

To demonstrate DARL1N’s applicability to general MARL problems with global reward and transition models, we conduct a comparison study using a variant of the Food Collection environment where agents must coordinate to exclusively collect all the food. As shown in Fig. 4(a), DARL1N achieves the highest reward level with the fastest training speed. This is due to its distributed learning architecture, which reduces training experience correlation and accelerates training through parallel computing, even without decomposition. In contrast, EPC performs significantly worse, likely because curriculum learning struggles with global agent coordination.

(a) (b)

Refer to caption
Refer to caption
Figure 4: (a) Average total training reward of different methods in the Food Collection environment with global reward and transition models when there are M=8M=8 agents and (b) Mean and standard deviation of normalized total reward of competing agents trained by different methods in the Adversarial Battle environment with M=48M=48.
Grassland
Table 3: Convergence time and convergence reward of different methods in the Grassland environment.
Method Convergence Time (s) Convergence Reward
M=6M=6 M=12M=12 M=24M=24 M=48M=48 M=6M=6 M=12M=12 M=24M=24 M=48M=48
MADDPG 423 6271 2827 1121 21 11 -302 -612
MFAC 431 7124 3156 1025 23 9 -311 -608
EPC 4883 2006 3324 15221 12 38 105 205
DARL1N 103 402 1752 5221 18 46 113 210

Similar as the results in the Food Collection environment, the policy generated by DARL1N is equally good or even better than those generated by the benchmark methods, as shown in Tab. 3 and Fig. 2, especially when the problem scale is large. DARL1N also has the fastest convergence speed and takes the shortest time to run a training iteration.

Adversarial Battle
Table 4: Convergence time and convergence reward of different methods in the Adversarial Battle environment.
Method Convergence Time (s) Convergence Reward
M=6M=6 M=12M=12 M=24M=24 M=48M=48 M=6M=6 M=12M=12 M=24M=24 M=48M=48
MADDPG 452 1331 1521 7600 -72 -211 -725 -1321
MFAC 463 1721 1624 6234 -73 -221 -694 -1201
EPC 1512 1432 2041 9210 -75 -215 -405 -642
DARL1N 121 756 1123 3110 -71 -212 -410 -682

(a) (b)

Refer to caption
Refer to caption
Figure 5: Average total training reward of DARL1N and SAC in the Multi-Access Wireless Communication environment when (a) z=2z=2 (b) z=10z=10.

In this environment, DARL1N again achieves good performance in terms of policy quality and training efficiency compared to the benchmark methods, as shown in Tab. 4 and Fig. 2. To further evaluate the performance, we reconsider the last scenario (M=48M=48) and train the good agents and adversary agents using two different methods. The trained good agents and adversarial agents then compete with each other. We apply the Min-Max normalization to measure the normalized total reward of agents at each side achieved in an episode. To reduce uncertainty, we generate 10 episodes and record the mean values and standard deviations. As shown in Fig. 4(b), DARL1N achieves the best performance, and both DARL1N and EPC significantly outperform MADDPG and MFAC.

Multi-Access Wireless Communication

Fig. 5 shows that SAC achieves a higher reward than DARL1N when zz takes a small value. However, when zz increases, which causes an exponential growth of the state space, DARL1N achieves a much higher reward and converges much faster than SAC. This demonstrates that DARL1N scales better than SAC with the size of the state space.

6.1.3 Impact of Neighbor Distance

The parameter of neighbor distance dd in DARL1N determines the number of one-hop neighbors of an agent, thereby influencing both training efficiency and policy quality. To evaluate its impact, we conduct experiments using the Grassland environment, considering three good and three adversary agents. The rewards are set to 1010 for good agents collecting a grass pellet and 100-100 for colliding with adversary agents.

The results shown in Fig. 6 indicate that, as the neighbor distance dd increases, the training reward increases while the training time arises. This stems from the increased number of one-hop neighbors each agent must consider, thereby requiring each learner to collect and process more data. This phenomenon reveals a trade-off between training efficiency and policy quality controlled by the neighbor distance, which can be properly chosen to achieve a good balance.

6.2 Performance of Coded Distributed Learning Architecture

In this section, we first conduct numerical studies to evaluate the performance of different assignment schemes described in Sec. 5. We then train DARL1N over the proposed coded distributed learning architecture and conduct experimental studies to evaluate its performance in mitigating the effect of computing stragglers.

(a) (b)

Refer to caption
Refer to caption
Figure 6: (a) Average training time of 30 iterations and (b) average total training reward of DARL1N in the Grassland environment when dd increases with M=6M=6.
Table 5: Comparison of different assignment schemes in terms of computation overhead, success rate, and average VV.
Computation overhead Success rate Average V
η=0\eta=0 η=0.2\eta=0.2 η=0.5\eta=0.5 |𝒥|=12|{\cal J}|=12 |𝒥|=18|{\cal J}|=18 |𝒥|=24|{\cal J}|=24
Uncoded 0 1 0 0 0 0 0
MDS 23 1 1 1 124.82 79.75 73.64
Random Sparse ξ=0.2\xi=0.2 4.5 1 0.78 0.17 11.48 1.62 -0.68
ξ=0.4\xi=0.4 8.0 1 1 0.98 16.82 2.93 -1.28
ξ=0.8\xi=0.8 18.3 1 1 1 13.15 1.40 -3.31
Repetition 1 1 0.603 0 0 0 -8.32
LDGM ρ=0.1\rho=0.1 0.91 1 0.253 0 2.08 -1.46 -4.59
ρ=0.3\rho=0.3 4.41 1 0.79 0.11 8.93 1.05 -3.21
ρ=0.5\rho=0.5 5.5 1 0.99 0.41 13.13 5.13 -0.31

6.2.1 Numerical Evaluation

We conduct numerical studies to evaluate the performance of different assignment schemes in the following three aspects.

  • Computation overhead: Metric (5.2.1) is applied, with the mean overhead averaged over 10 experiment runs used for the Random Sparse and LDGM schemes.

  • Resilience to stragglers: The success rate computed as follows is used. We randomly turn some learners into stragglers that fail to return any results according to Assumption 4. Monte Carlo simulations are then conducted to measure the success rate, which is the ratio of training iterations in which gradients can be successfully estimated with results returned from non-stragglers.

  • Impact on policy quality: According to Theorem 4.4, the gradients estimated by Coded DARL1N are unbiased but their variance depends on the assignment matrix. Therefore, we use the variance of the estimated gradients, denoted by 𝕍[𝐞^]\mathbb{V}[\hat{\mathbf{e}}], to assess the impact of assignment schemes on policy quality. Specifically, we vary the number of learners whose results are used by the central controller to estimate the gradients and calculate the average value of V:=log(det(𝕍[𝐞~]))V:=\log(\det(\mathbb{V}[\tilde{\mathbf{e}}])) over 100100 Monte Carlo simulation runs.

Tab. 5 presents results when M=12M=12 and N=24N=24. The performance of the Random Sparse and LDGM schemes, characterized by parameters ξ\xi and ρ\rho respectively, is evaluated across different parameter values. The results show that the Uncoded scheme has the lowest computation overhead, the smallest variance in most scenarios, but the poorest resilience to stragglers. In contrast, the MDS scheme exhibits the best resilience to stragglers but the largest computation overhead and variance. The Repetition scheme has the smallest variance among all schemes and the lowest computation overhead among coded schemes, though it is relatively less resilient. For the Random Sparse and LDGM schemes, increasing ξ\xi or ρ\rho leads to higher computation overhead and improved resilience to stragglers. However, the Random Sparse scheme generally exhibits larger variance compared to LDGM. In the following experiments, we set ξ=0.8\xi=0.8 and ρ=0.3\rho=0.3.

6.2.2 Experiments

To understand the performance of the coded distributed learning architecture, we train DARL1N using different assignment schemes and evaluate its performance in different straggler scenarios simulated on Amazon EC2.

Experiment Settings

We select the Food Collection environment and set the number of agents and learners in all experiments to M=12M=12 and N=24N=24, respectively. To evaluate the impact of stragglers, we vary the straggler probability η\eta in Assumption 4. As Amazon EC2 computing instances are generally stable, we simulate stragglers by having selected compute nodes delay returning results by Δ>0\Delta>0 amount of time. Evaluations on other computing systems where stragglers are more common, such as wireless and mobile computing systems, are deferred to future work.

Experiment Results

We first evaluate the average training time of DARL1N with different assignment schemes. We vary the straggler probability η\eta and the straggler effect Δ\Delta. The results are shown in Fig. 7(a) when Δ=1\Delta=1 and Fig. 7(b) when Δ=4\Delta=4, where the training time is measured by averaging the time for running 30 training iterations. We can observe that when no stragglers exist (η=0\eta=0), the Uncoded scheme is the most efficient as it has zero computation overhead. The MDS and Random Sparse schemes require a much longer training time than other schemes due to the substantial computation overhead introduced by these schemes. When stragglers exist (η>0\eta>0), the performance of the Uncoded scheme degrades significantly, especially when the straggler effect is significant as shown in Fig. 7(b). Compared to the Uncoded scheme, the LDGM and Repetition schemes are more resilient to stragglers, as indicated by the slower increase in training time. They are also more efficient than the MDS and Random Sparse schemes in most cases. On the contrary, the training time of MDS and Random Sparse does not grow much as Δ\Delta increases from 11 to 44, evidencing their high resilience to stragglers. Although they require more training time than other schemes when the straggler effect Δ\Delta or the straggler probability η\eta is small, they achieve higher training efficiency when Δ\Delta and/or η\eta are large.

Refer to caption
Figure 7: Average training time of different DARL1N implementations with straggler effect Δ=1\Delta=1 and Δ=4\Delta=4, respectively.

To evaluate the impact of different assignment schemes on the quality of trained policies, we measure the training reward achieved by each DARL1N implementation with Δ=1\Delta=1. Tab. 6 summarizes the convergence time, convergence reward, and variance VV, averaged over 600 training iterations, for different implementations as straggler probability η\eta increases. We can see that the Uncoded scheme converges fast when no stragglers exist (η=0\eta=0) but its convergence speed decreases significantly when stragglers exist (η>0\eta>0). The MDS and Random Sparse schemes achieve the lowest reward and slowest convergence rate, while the LDGM scheme achieves the highest reward and convergence rate in most cases, especially when the straggler probability η\eta is large. The Repetition scheme generally achieves good training reward performance and converges fast when the straggler probability is small. Moreover, we can also see that a larger average V generally leads to a lower reward.

Table 6: Convergence time, convergence reward and average V of different DARL1N implementations.
Schemes Convergence Time (s) Convergence Reward Average V
η=0\eta=0 η=0.2\eta=0.2 η=0.5\eta=0.5 η=0\eta=0 η=0.2\eta=0.2 η=0.5\eta=0.5 η=0\eta=0 η=0.2\eta=0.2 η=0.5\eta=0.5
Uncoded 323 510 521 8 5 10 0 0 0
MDS 502 748 625 -231 -253 -212 82.08 100.10 104.06
Random Sparse 512 820 670 -252 -231 -227 15.28 13.47 12.88
Repetition 331 372 564 11 6 8 -3.16 -4.49 -4.15
LDGM 324 320 450 9 12 14 -0.51 0.14 0.35

6.2.3 Discussion

The experiment results above suggest guidelines for selecting an appropriate assignment scheme of agents to learners. The Uncoded scheme has zero computation overhead and low variance but is the least resilient to stragglers. This makes it suited best for stable distributed systems, such as server-based setups with wired communication. The MDS scheme, while offering the highest resilience, incurs the highest computation overhead and variance, leading to poor policy quality, and making it unsuitable for distributed training using DARL1N. In unstable distributed systems, where stragglers are present, a trade-off between policy quality and resilience must be considered. If policy quality is the priority, the Repetition scheme is an excellent choice. Conversely, if resilience is more critical, the Random Sparse scheme is preferable. For a balanced approach that addresses both aspects, the LDGM scheme is a good option.

7 Limitations and Future Work

In environments with local reward and transition models, DARL1N needs a suitably chosen distance metric to establish the agent neighborhoods that achieve the right balance between policy quality and ability to distribute the training efficiently. In future work, we will explore learning a neighbor distance metric that adapts to the environment, e.g., based on past episodes or contextual information, to achieve an effective balance between policy reward and training speed. Moreover, the coded distributed learning architecture for DARL1N is designed for a distributed computing system with a stable central controller in place. In future work, we will design a new coded architecture to improve resilience of central controllers such as introducing redundant central controllers using coding theory. Other issues to consider for further improvements include integration of curriculum learning similar as EPC, partially observable states, and software infrastructure to support distributed learning with low-latency communication.

8 Conclusion

This paper introduced DARL1N, a scalable MARL algorithm that can be trained over a distribute computing architecture. DARL1N reduces the representation complexity of the value and policy functions of each agent in a MARL problem by disregarding the influence of other agents that are not within one hop of a proximity graph. This model enables highly efficient distributed training, in which a compute node only needs data from an agent it is training and its potential one-hop neighbors. We conducted comprehensive experiments using five MARL environments and compared DARL1N with four state-of-the-art MARL algorithms. DARL1N generates equally good or even better policies in almost all scenarios with significantly higher training efficiency than benchmark methods, especially in large-scale problem settings. To improve the resilience of DARL1N to stragglers common in distributed computing systems, we developed coding schemes that assign each agent to multiple learners. We evaluate properties of MDS, Random Sparse, Repetition, LDGM codes and provide guidelines on selecting suitable assignment schemes under different situations.

References

  • [1] Y. Jin, S. Wei, J. Yuan, and X. Zhang, “Hierarchical and stable multiagent reinforcement learning for cooperative navigation control,” IEEE Transactions on Neural Networks and Learning Systems, 2021.
  • [2] R. Song, F. L. Lewis, and Q. Wei, “Off-policy integral reinforcement learning method to solve nonlinear continuous-time multiplayer nonzero-sum games,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 3, pp. 704–713, 2016.
  • [3] G.-P. Antonio and C. Maria-Dolores, “Multi-agent deep reinforcement learning to manage connected autonomous vehicles at tomorrow’s intersections,” IEEE Transactions on Vehicular Technology, vol. 71, no. 7, pp. 7033–7043, 2022.
  • [4] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” in Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 6379–6390.
  • [5] L. Buşoniu, R. Babuška, and B. De Schutter, “Multi-agent reinforcement learning: An overview,” Innovations in Multi-agent Systems and Applications, pp. 183–221, 2010.
  • [6] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” in AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
  • [7] K. Gogineni, P. Wei, T. Lan, and G. Venkataramani, “Scalability bottlenecks in multi-agent reinforcement learning systems,” arXiv preprint arXiv:2302.05007, 2023.
  • [8] G. Qu, A. Wierman, and N. Li, “Scalable reinforcement learning of localized policies for multi-agent networked systems,” in Learning for Dynamics and Control (L4DC).   PMLR, 2020, pp. 256–266.
  • [9] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang, “Mean field multi-agent reinforcement learning,” in International Conference on Machine Learning (ICML).   PMLR, 2018, pp. 5571–5580.
  • [10] Q. Long, Z. Zhou, A. Gupta, F. Fang, Y. Wu, and X. Wang, “Evolutionary population curriculum for scaling multi-agent reinforcement learning,” in International Conference on Learning Representations (ICLR), 2020.
  • [11] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, and T. Graepel, “Value-decomposition networks for cooperative multi-agent learning,” arXiv preprint:1706.05296, 2017.
  • [12] T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. Foerster, and S. Whiteson, “Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning,” in International Conference on Machine Learning (ICML).   PMLR, 2018, pp. 4295–4304.
  • [13] K. Son, D. Kim, W. J. Kang, D. E. Hostallero, and Y. Yi, “Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning,” in International Conference on Machine Learning.   PMLR, 2019, pp. 5887–5896.
  • [14] S. Iqbal and F. Sha, “Actor-attention-critic for multi-agent reinforcement learning,” in International Conference on Machine Learning.   PMLR, 2019, pp. 2961–2970.
  • [15] A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. De Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Petersen et al., “Massively parallel methods for deep reinforcement learning,” in International Conference on Machine Learning (ICML), 2015.
  • [16] B. Wang, J. Xie, and N. Atanasov, “Coding for Distributed Multi-Agent Reinforcement Learning,” in IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 10 625–10 631.
  • [17] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Basar, “Fully decentralized multi-agent reinforcement learning with networked agents,” in International Conference on Machine Learning (ICML).   PMLR, 2018, pp. 5872–5881.
  • [18] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning (ICML), 2016, pp. 1928–1937.
  • [19] OpenAI, “A2C,” https://openai.com/blog/baselines-acktr-a2c/, 2022, accessed: 2022-01-13.
  • [20] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing, “More effective distributed ml via a stale synchronous parallel parameter server,” in Advances in Neural Information Processing Systems (NeurIPS), 2013, pp. 1223–1231.
  • [21] M. Li, D. G. Andersen, A. J. Smola, and K. Yu, “Communication efficient distributed machine learning with the parameter server,” in Advances in Neural Information Processing Systems (NeurIPS), 2014, pp. 19–27.
  • [22] S. Dutta, G. Joshi, S. Ghosh, P. Dube, and P. Nagpurkar, “Slow and stale gradients can win the race: Error-runtime trade-offs in distributed sgd,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2018, pp. 803–812.
  • [23] M. Tan, “Multi-agent reinforcement learning: Independent vs. cooperative agents,” in Proceedings of the tenth international conference on machine learning, 1993, pp. 330–337.
  • [24] A. Reisizadeh, S. Prakash, R. Pedarsani, and A. S. Avestimehr, “Coded computation over heterogeneous clusters,” IEEE Transactions on Information Theory, vol. 65, no. 7, pp. 4227–4242, 2019.
  • [25] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding Up Distributed Machine Learning Using Codes,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1514–1529, mar 2018.
  • [26] Y. Yang, P. Grover, and S. Kar, “Coded distributed computing for inverse problems,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [27] B. Zhou, J. Xie, and B. Wang, “Dynamic Coded Distributed Convolution for UAV-based Networked Airborne Computing,” in IEEE International Conference on Unmanned Aircraft Systems (ICUAS), 2022, pp. 955–961.
  • [28] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded MapReduce,” in Allerton Conference on Communication, Control, and Computing (Allerton).   IEEE, 2015, pp. 964–971.
  • [29] B. Wang, J. Xie, and N. Atanasov, “Darl1n: Distributed multi-agent reinforcement learning with one-hop neighbors,” pp. 9003–9010, 2022.
  • [30] F. Bullo, J. Cortés, and S. Martinez, Distributed control of robotic networks: a mathematical approach to motion coordination algorithms.   Princeton University Press, 2009.
  • [31] X. Qian and D. Klabjan, “The impact of the mini-batch size on the variance of gradients in stochastic gradient descent,” arXiv preprint arXiv:2004.13146, 2020.
  • [32] J. Lacan and J. Fimes, “Systematic mds erasure codes based on vandermonde matrices,” IEEE Communications Letters, vol. 8, no. 9, pp. 570–572, 2004.
  • [33] A. Klinger, “The vandermonde matrix,” The American Mathematical Monthly, vol. 74, no. 5, pp. 571–574, 1967.
  • [34] K. Lee, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Coded computation for multicore setups,” in IEEE International Symposium on Information Theory (ISIT), 2017, pp. 2413–2417.
  • [35] S. Cai, W. Lin, X. Yao, B. Wei, and X. Ma, “Systematic convolutional low density generator matrix code,” IEEE Transactions on Information Theory, vol. 67, no. 6, pp. 3752–3764, 2021.
  • [36] R. Gallager, “Low-density parity-check codes,” IRE Transactions on information theory, vol. 8, no. 1, pp. 21–28, 1962.
  • [37] AWS, “Amazon ec2,” https://aws.amazon.com/ec2/, 2022, accessed: 2022-01-13.

Appendix A Appendix

A.1 Proof of Lemma 1

Consider the Q-value function Qi𝝁Q_{i}^{\boldsymbol{\mu}} of agent ii. For two different sets of non-neighbor states 𝐬^𝒩i𝐬𝒩i\hat{\mathbf{s}}_{{\cal N}_{i}^{-}}\neq\mathbf{s}_{{\cal N}_{i}^{-}} and actions 𝐚^𝒩i𝐚𝒩i\hat{\mathbf{a}}_{{\cal N}_{i}^{-}}\neq\mathbf{a}_{{\cal N}_{i}^{-}}, we first show that:

|Qi𝝁(𝐬𝒩i,𝐬𝒩i,𝐚𝒩i,𝐚𝒩i)\displaystyle|Q_{i}^{\boldsymbol{\mu}}(\mathbf{s}_{{\cal N}_{i}},\mathbf{s}_{{\cal N}_{i}^{-}},\mathbf{a}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}^{-}}) Qi𝝁(𝐬𝒩i,𝐬^𝒩i,𝐚𝒩i,𝐚^𝒩i)|\displaystyle-Q_{i}^{\boldsymbol{\mu}}(\mathbf{s}_{{\cal N}_{i}},\hat{\mathbf{s}}_{{\cal N}_{i}^{-}},\mathbf{a}_{{\cal N}_{i}},\hat{\mathbf{a}}_{{\cal N}_{i}^{-}})|
2r¯γ1γ.\displaystyle\leq\frac{2\bar{r}\gamma}{1-\gamma}. (11)

Letting (𝐬,𝐚)(\mathbf{s},\mathbf{a}) and (𝐬^,𝐚^)(\hat{\mathbf{s}},\hat{\mathbf{a}}) denote (𝐬𝒩i,𝐬𝒩i,𝐚𝒩i,𝐚𝒩i)(\mathbf{s}_{{\cal N}_{i}},\mathbf{s}_{{\cal N}_{i}^{-}},\mathbf{a}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}^{-}}) and (𝐬𝒩i,s^𝒩i,𝐚𝒩i,a^𝒩i)(\mathbf{s}_{{\cal N}_{i}},\hat{s}_{{\cal N}_{i}^{-}},\mathbf{a}_{{\cal N}_{i}},\hat{a}_{{\cal N}_{i}^{-}}), respectively, we have:

|Qi𝝁(𝐬,𝐚)Qi𝝁(𝐬^,𝐚^)|\displaystyle\left|Q_{i}^{\boldsymbol{\mu}}\left(\mathbf{s},\mathbf{a}\right)-Q_{i}^{\boldsymbol{\mu}}\left(\hat{\mathbf{s}},\hat{\mathbf{a}}\right)\right|
=|𝔼[t=0γtri(𝐬𝒩i(t),𝐚𝒩i(t))(𝐬(0),𝐚(0))=(𝐬,𝐚)]\displaystyle=\biggl{\lvert}\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r_{i}\left(\mathbf{s}_{{\cal N}_{i}}(t),\mathbf{a}_{{\cal N}_{i}}(t)\right)\mid(\mathbf{s}(0),\mathbf{a}(0))=(\mathbf{s},\mathbf{a})]
𝔼[t=0γtri(𝐬𝒩i(t),𝐚𝒩i(t))(𝐬(0),𝐚(0))=(𝐬^,𝐚^)]|\displaystyle-\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r_{i}\left(\mathbf{s}_{{\cal N}_{i}}(t),\mathbf{a}_{{\cal N}_{i}}(t)\right)\mid(\mathbf{s}(0),\mathbf{a}(0))=\left(\hat{\mathbf{s}},\hat{\mathbf{a}}\right)]\biggr{\rvert}
t=0|𝔼[γtri(𝐬𝒩i(t),𝐚𝒩i(t))(𝐬(0),𝐚(0))=(𝐬,𝐚)]\displaystyle\leq\sum_{t=0}^{\infty}\bigl{\lvert}\mathbb{E}\left[\gamma^{t}r_{i}\left(\mathbf{s}_{{\cal N}_{i}}(t),\mathbf{a}_{{\cal N}_{i}}(t)\right)\mid(\mathbf{s}(0),\mathbf{a}(0))=(\mathbf{s},\mathbf{a})\right]
𝔼[γtri(𝐬𝒩i(t),𝐚𝒩i(t))(𝐬(0),𝐚(0))=(𝐬^,𝐚^)]|\displaystyle-\mathbb{E}\left[\gamma^{t}r_{i}\left(\mathbf{s}_{{\cal N}_{i}}(t),\mathbf{a}_{{\cal N}_{i}}(t)\right)\mid(\mathbf{s}(0),\mathbf{a}(0))=\left(\hat{\mathbf{s}},\hat{\mathbf{a}}\right)\right]\bigr{\rvert}
=(𝐚)t=1|𝔼[γtri(𝐬𝒩i(t),𝐚𝒩i(t))(𝐬(0),𝐚(0))=(𝐬,𝐚)]\displaystyle\stackrel{{\scriptstyle(\mathbf{a})}}{{=}}\sum_{t=1}^{\infty}\bigl{\lvert}\mathbb{E}\left[\gamma^{t}r_{i}\left(\mathbf{s}_{{\cal N}_{i}}(t),\mathbf{a}_{{\cal N}_{i}}(t)\right)\mid(\mathbf{s}(0),\mathbf{a}(0))=(\mathbf{s},\mathbf{a})\right]
𝔼[γtri(𝐬𝒩i(t),𝐚𝒩i(t))(𝐬(0),𝐚(0))=(𝐬^,𝐚^)]|\displaystyle-\mathbb{E}\left[\gamma^{t}r_{i}\left(\mathbf{s}_{{\cal N}_{i}}(t),\mathbf{a}_{{\cal N}_{i}}(t)\right)\mid(\mathbf{s}(0),\mathbf{a}(0))=\left(\hat{\mathbf{s}},\hat{\mathbf{a}}\right)\right]\bigr{\rvert}
t=1γt(|𝔼[ri(𝐬𝒩i(t),𝐚𝒩i(t))(𝐬(0),𝐚(0))=(𝐬,𝐚)]|\displaystyle\leq\sum_{t=1}^{\infty}\gamma^{t}(\bigl{\lvert}\mathbb{E}\left[r_{i}\left(\mathbf{s}_{{\cal N}_{i}}(t),\mathbf{a}_{{\cal N}_{i}}(t)\right)\mid(\mathbf{s}(0),\mathbf{a}(0))=(\mathbf{s},\mathbf{a})\right]\bigr{\rvert}
+|𝔼[ri(𝐬𝒩i(t),𝐚𝒩i(t))(𝐬(0),𝐚(0))=(𝐬^,𝐚^)]|)\displaystyle+\bigl{\lvert}\mathbb{E}\left[r_{i}\left(\mathbf{s}_{{\cal N}_{i}}(t),\mathbf{a}_{{\cal N}_{i}}(t)\right)\mid(\mathbf{s}(0),\mathbf{a}(0))=\left(\hat{\mathbf{s}},\hat{\mathbf{a}}\right)\right]\bigr{\rvert})
t=12γtr¯=2r¯γ1γ\displaystyle\leq\sum_{t=1}^{\infty}2\gamma^{t}\bar{r}=\frac{2\bar{r}\gamma}{1-\gamma} (12)

where (𝐚)(\mathbf{a}) derives from the fact that (𝐬𝒩i,𝐚𝒩i)(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}}) are part of both (𝐬,𝐚)(\mathbf{s},\mathbf{a}) and (𝐬^,𝐚^)(\hat{\mathbf{s}},\hat{\mathbf{a}}). In the above equations, the expectation 𝔼\mathbb{E} is over state-action trajectories generated by the policy 𝝁\boldsymbol{\mu} and the transition model pp. Then, we have:

|Q~i𝝁(𝐬𝒩i,𝐚𝒩i)Qi𝝁(𝐬,𝐚)|\displaystyle\left|\tilde{Q}_{i}^{\boldsymbol{\mu}}\left(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}}\right)-Q_{i}^{\boldsymbol{\mu}}\left(\mathbf{s},\mathbf{a}\right)\right|
=|𝐬𝒩i,𝐚𝒩iωi(𝐬𝒩i,𝐚𝒩i,𝐬𝒩i,𝐚𝒩i)Qi𝝁(𝐬𝒩i,𝐚𝒩i,𝐬𝒩i,𝐚𝒩i)\displaystyle=\biggl{|}\sum_{\mathbf{s}_{{\cal N}_{i}^{-}},\mathbf{a}_{{\cal N}_{i}^{-}}}\!\!\omega_{i}(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}},\mathbf{s}_{{\cal N}_{i}^{-}},\mathbf{a}_{{\cal N}_{i}^{-}})Q_{i}^{\boldsymbol{\mu}}(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}},\mathbf{s}_{{\cal N}_{i}^{-}},\mathbf{a}_{{\cal N}_{i}^{-}})
Qi𝝁(𝐬𝒩i,𝐚𝒩i,𝐬^𝒩i,𝐚^𝒩i)|\displaystyle\quad-Q_{i}^{\boldsymbol{\mu}}(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}},\hat{\mathbf{s}}_{{\cal N}_{i}^{-}},\hat{\mathbf{a}}_{{\cal N}_{i}^{-}})\biggr{|}
𝐬𝒩i,𝐚𝒩iωi(𝐬𝒩i,𝐚𝒩i,𝐬𝒩i,𝐚𝒩i)|Qi𝝁(𝐬𝒩i,𝐚𝒩i,𝐬𝒩i,𝐚𝒩i)\displaystyle\leq\!\!\sum_{\mathbf{s}_{{\cal N}_{i}^{-}},\mathbf{a}_{{\cal N}_{i}^{-}}}\!\!\omega_{i}(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}},\mathbf{s}_{{\cal N}_{i}^{-}},\mathbf{a}_{{\cal N}_{i}^{-}})\biggl{|}Q_{i}^{\boldsymbol{\mu}}(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}},\mathbf{s}_{{\cal N}_{i}^{-}},\mathbf{a}_{{\cal N}_{i}^{-}})
Qi𝝁(𝐬𝒩i,𝐚𝒩i,𝐬^𝒩i,𝐚^𝒩i)|2r¯γ1γ.\displaystyle\quad-Q_{i}^{\boldsymbol{\mu}}(\mathbf{s}_{{\cal N}_{i}},\mathbf{a}_{{\cal N}_{i}},\hat{\mathbf{s}}_{{\cal N}_{i}^{-}},\hat{\mathbf{a}}_{{\cal N}_{i}^{-}})\biggr{|}\leq\frac{2\bar{r}\gamma}{1-\gamma}. (13)

A.2 Proof of Proposition 3.2

If agent j𝒫i(t)j\not\in\mathcal{P}_{i}(t), then based on the definition of potential neighbors, we have dist(𝐬i(t),𝐬j(t))>d+2ϵ\operatorname*{dist}(\mathbf{s}_{i}(t),\mathbf{s}_{j}(t))>d+2\epsilon. According to the triangle inequality, dist(𝐬i(t),𝐬j(t+1))+dist(𝐬j(t+1),𝐬j(t))dist(𝐬i(t),𝐬j(t))\operatorname*{dist}(\mathbf{s}_{i}(t),\mathbf{s}_{j}(t+1))+\operatorname*{dist}(\mathbf{s}_{j}(t+1),\mathbf{s}_{j}(t))\geq\operatorname*{dist}(\mathbf{s}_{i}(t),\mathbf{s}_{j}(t)), and according to Assumption 1, dist(𝐬j(t+1),𝐬j(t))ϵ\operatorname*{dist}(\mathbf{s}_{j}(t+1),\mathbf{s}_{j}(t))\leq\epsilon. Therefore, dist(𝐬i(t),𝐬j(t+1))>d+ϵ\operatorname*{dist}(\mathbf{s}_{i}(t),\mathbf{s}_{j}(t+1))>d+\epsilon. Using the triangle inequality again, we obtain dist(𝐬i(t+1),𝐬j(t+1))+dist(𝐬i(t+1),𝐬i(t))dist(𝐬i(t),𝐬j(t+1))\operatorname*{dist}(\mathbf{s}_{i}(t+1),\mathbf{s}_{j}(t+1))+\operatorname*{dist}(\mathbf{s}_{i}(t+1),\mathbf{s}_{i}(t))\geq\operatorname*{dist}(\mathbf{s}_{i}(t),\mathbf{s}_{j}(t+1)). As dist(𝐬i(t+1),𝐬i(t))ϵ\operatorname*{dist}(\mathbf{s}_{i}(t+1),\mathbf{s}_{i}(t))\leq\epsilon, we have dist(𝐬i(t+1),𝐬j(t+1))>d\operatorname*{dist}(\mathbf{s}_{i}(t+1),\mathbf{s}_{j}(t+1))>d. Therefore, agent jj will not be a one-hop neighbor of agent ii at time t+1t+1.

A.3 Proof of Theorem 4.4

The bias of the gradient estimator 𝐞~\tilde{\mathbf{e}} can be calculated using (7) and (8) as follows:

𝔼[𝐞~]𝐞\displaystyle\mathbb{E}[\tilde{\mathbf{e}}]-\mathbf{e} =(𝐂𝒥T𝐂𝒥)1𝐂𝒥T𝔼[𝐲𝒥]𝐞\displaystyle=(\mathbf{C}_{\cal J}^{T}\mathbf{C}_{\cal J})^{-1}\mathbf{C}^{T}_{\cal J}\mathbb{E}[\mathbf{y}_{\cal J}]-\mathbf{e}
=(𝐂𝒥T𝐂𝒥)1𝐂𝒥T𝐃𝔼[𝐪]𝐞.\displaystyle=(\mathbf{C}_{\cal J}^{T}\mathbf{C}_{\cal J})^{-1}\mathbf{C}^{T}_{\cal J}\mathbf{D}\mathbb{E}[\mathbf{q}]-\mathbf{e}. (14)

Since each learner uses the same set of parameters broadcast by the central controller for agent-environment interaction in each training iteration, the replay buffers 𝒟j,i{\cal D}_{j,i}, j[N]\forall j\in[N], all follow the same distribution as that of 𝒟i{\cal D}_{i}. Therefore, we have 𝔼[𝐞^j,i]=𝐞i\mathbb{E}[\hat{\mathbf{e}}_{j,i}]=\mathbf{e}_{i} and 𝐃𝔼[𝐪]=𝐂𝒥𝐞\mathbf{D}\mathbb{E}[\mathbf{q}]=\mathbf{C}_{\cal J}\mathbf{e} leading to:

𝔼[𝐞~]𝐞=(𝐂𝒥T𝐂𝒥)1𝐂𝒥T𝐂𝒥𝐞𝐞=0,\mathbb{E}[\tilde{\mathbf{e}}]-\mathbf{e}=(\mathbf{C}_{\cal J}^{T}\mathbf{C}_{\cal J})^{-1}\mathbf{C}^{T}_{\cal J}\mathbf{C}_{\cal J}\mathbf{e}-\mathbf{e}=0, (15)

which shows that 𝐞~\tilde{\mathbf{e}} is an unbiased estimator.

Next, we compute the variance of the gradient estimator 𝐞~\tilde{\mathbf{e}}:

𝕍[𝐞~]\displaystyle\mathbb{V}[\tilde{\mathbf{e}}] =𝕍[(𝐂𝒥T𝐂𝒥)1𝐂𝒥T𝐲𝒥]\displaystyle=\mathbb{V}[(\mathbf{C}_{\cal J}^{T}\mathbf{C}_{\cal J})^{-1}\mathbf{C}^{T}_{\cal J}\mathbf{y}_{\cal J}] (16)
=(𝐂𝒥T𝐂𝒥)1𝐂𝒥T𝐃𝕍(𝐪)𝐃T((𝐂𝒥T𝐂𝒥)1𝐂𝒥T)T,\displaystyle=(\mathbf{C}_{\cal J}^{T}\mathbf{C}_{\cal J})^{-1}\mathbf{C}^{T}_{\cal J}\mathbf{D}\mathbb{V}(\mathbf{q})\mathbf{D}^{T}((\mathbf{C}_{\cal J}^{T}\mathbf{C}_{\cal J})^{-1}\mathbf{C}^{T}_{\cal J})^{T},

where 𝕍[𝐪]=diag(𝕍(𝐞^1,1),,𝕍(𝐞^N,M))\mathbb{V}[\mathbf{q}]=\operatorname*{diag}(\mathbb{V}(\hat{\mathbf{e}}_{1,1}),\ldots,\mathbb{V}(\hat{\mathbf{e}}_{N,M})) and diag()\operatorname*{diag}() creates a diagonal matrix with the 𝕍(𝐞^j,i)\mathbb{V}(\hat{\mathbf{e}}_{j,i}) as diagonal element. Since 𝐞^j,i,i[M]\hat{\mathbf{e}}_{j,i},\forall i\in[M] are independent from each other for each j[N]j\in[N]. According to (16), we can see that the variance of the gradient estimator 𝐞~\tilde{\mathbf{e}} is impacted by 𝐂𝒥\mathbf{C}_{\cal J}, which is determined by the assignment matrix 𝐂\mathbf{C} as well as the learners who return their computations promptly.

A.4 Proof of Proposition 5.7

The performance of the MDS code scheme will be affected only if the number of stragglers exceeds NMN-M because 𝐂MDS\mathbf{C}^{\text{MDS}} has rank MM. If there are W>NMW>N-M stragglers, the results from non-straggler nodes will be insufficient for the central controller to decode the parameter gradients and it needs to wait for results from the stragglers. Under Assumption 4, WW follows a binomial distribution with probability mass function p(W=w)=(Nw)(1η)Nwηwp(W=w)={N\choose w}(1-\eta)^{N-w}\eta^{w}. Therefore, the probability that the performance will be affected by the stragglers is j=NM+1Np(W=j)=j=NM+1N(Nj)(1η)Njηj\sum_{j=N-M+1}^{N}p(W=j)=\sum_{j=N-M+1}^{N}{N\choose j}(1-\eta)^{N-j}\eta^{j}.

A.5 Proof of Proposition 5.8

The assignment matrix 𝐂Repetition\mathbf{C}^{\text{Repetition}} defined in (10) has MM linearly independent rows, with each row containing NM\frac{N}{M} duplicate copies. For jj-th copy of ii-th linearly independent row, i[M],j[NM]\forall i\in[M],\forall j\in[\frac{N}{M}], we have a learner with index i+(j1)Mi+(j-1)M needs to send 𝐲i+(j1)M\mathbf{y}_{i+(j-1)M} back to the central controller. The estimated gradients can be decoded when learners with index i+(j1)M,i[M]i+(j-1)M,\forall i\in[M], with any j[NM]j\in[\frac{N}{M}] send results back to satisfy rank(𝐂𝒥\mathbf{C}_{\cal J}) = MM. Under Assumption 4, for a i[M]i\in[M], the probability that the learners with index i+(j1)M,j[NM]i+(j-1)M,\forall j\in[\frac{N}{M}] are all stragglers that fail to send results back is ηNM\eta^{\frac{N}{M}}. Furthermore, the probability that there is at least one non-straggler learner for each i[M]i\in[M] is (1ηNM)M(1-\eta^{\frac{N}{M}})^{M}. Therefore, the probability that the performance will be affected by stragglers is then represented with 1(1ηNM)M1-(1-\eta^{\frac{N}{M}})^{M}.

{IEEEbiography}

[[Uncaptioned image]]Baoqian Wang (S’20) is currently an advanced technologist in intelligent systems at The Boeing Company. He received a Ph.D. degree in Electrical and Computer Engineering from University of California San Diego and San Diego State University in 2023. He received his B.S. degree from Yangtze University, Wuhan China, in 2017, and M.S. degree in Computer Science from Texas A&M University-Corpus Christi. His research interests include distributed computing, reinforcement learning and robotics.

{IEEEbiography}

[[Uncaptioned image]]Junfei Xie (S’13-M’16-SM’21) is currently an Associate Professor in the Department of Electrical and Computer Engineering at San Diego State University. She received the B.S. degree in Electrical Engineering from University of Electronic Science and Technology of China (UESTC), Chengdu, China, in 2012. She received the M.S. degree in Electrical Engineering in 2013 and the Ph.D. degree in Computer Science and Engineering from University of North Texas, Denton, TX, in 2016. From 2016 to 2019, she was an Assistant Professor in the Department of Computing Sciences at Texas A&M University-Corpus Christi. She is the recipient of the NSF CAREER Award. Her current research interests include large-scale dynamic system design and control, airborne networks, airborne computing, and air traffic flow management, etc.

{IEEEbiography}

[[Uncaptioned image]]Nikolay Atanasov (S’07-M’16-SM’23) is an Associate Professor of Electrical and Computer Engineering at the University of California San Diego, La Jolla, CA, USA. He obtained a B.S. degree in Electrical Engineering from Trinity College, Hartford, CT, USA in 2008 and M.S. and Ph.D. degrees in Electrical and Systems Engineering from the University of Pennsylvania, Philadelphia, PA, USA in 2012 and 2015, respectively. Dr. Atanasov’s research focuses on robotics, control theory, and machine learning with emphasis on active perception problems for autonomous mobile robots. He works on probabilistic models for simultaneous localization and mapping (SLAM) and on optimal control and reinforcement learning algorithms for minimizing probabilistic model uncertainty. Dr. Atanasov’s work has been recognized by the Joseph and Rosaline Wolf award for the best Ph.D. dissertation in Electrical and Systems Engineering at the University of Pennsylvania in 2015, the Best Conference Paper Award at the IEEE International Conference on Robotics and Automation (ICRA) in 2017, the NSF CAREER Award in 2021, and the IEEE RAS Early Academic Career Award in Robotics and Automation in 2023.