Distributed Multi-Agent Reinforcement Learning with One-hop Neighbors and Compute Straggler Mitigation
Abstract
Most multi-agent reinforcement learning (MARL) methods are limited in the scale of problems they can handle. With increasing numbers of agents, the number of training iterations required to find the optimal behaviors increases exponentially due to the exponentially growing joint state and action spaces. This paper tackles this limitation by introducing a scalable MARL method called Distributed multi-Agent Reinforcement Learning with One-hop Neighbors (DARL1N). DARL1N is an off-policy actor-critic method that addresses the curse of dimensionality by restricting information exchanges among the agents to one-hop neighbors when representing value and policy functions. Each agent optimizes its value and policy functions over a one-hop neighborhood, significantly reducing the learning complexity, yet maintaining expressiveness by training with varying neighbor numbers and states. This structure allows us to formulate a distributed learning framework to further speed up the training procedure. Distributed computing systems, however, contain straggler compute nodes, which are slow or unresponsive due to communication bottlenecks, software or hardware problems. To mitigate the detrimental straggler effect, we introduce a novel coded distributed learning architecture, which leverages coding theory to improve the resilience of the learning system to stragglers. Comprehensive experiments show that DARL1N significantly reduces training time without sacrificing policy quality and is scalable as the number of agents increases. Moreover, the coded distributed learning architecture improves training efficiency in the presence of stragglers.
Multi-Agent Reinforcement Learning, Scalability, Distributed Computing, Coded Computation.
1 Introduction
Recent years have witnessed tremendous success of reinforcement learning (RL) in challenging decision making problems, such as robot control and video games. Research efforts are currently focused on multi-agent settings, including cooperative robot navigation [1], multi-player games [2], and traffic management [3]. Direct application of RL techniques in multi-agent settings by running single-agent algorithms simultaneously on each agent exhibits poor performance [4]. This is because, without considering interactions among the agents, the environment becomes non-stationary from the perspective of a single agent.
Multi-agent reinforcement learning (MARL) [5] addresses this challenge by considering all agents and their dynamics collectively when learning the value function and policy of an individual agent. Most effective MARL algorithms, such as multi-agent deep deterministic policy gradient (MADDPG) [4] and counterfactual multi-agent (COMA) [6], adopt this strategy. However, learning a joint-state value or action-value (Q) or policy function is challenging due to the exponentially growing joint state and action spaces with increasing number of agents [7, 8]. Policies trained with joint state-action pairs have poor performance in large-scale settings as demonstrated in recent work [9, 10], because their accurate approximation requires models with extremely large capacity.
MARL algorithms that improve the quality of learned policies for large-scale multi-agent settings often employ value function factorization that factorizes the global value/Q function into individual value/Q functions depending on local states and actions, e.g., as in Value Decomposition Network (VDN) [11], QMIX [12], QTran [13], mean-field MARL (MFAC) [9] or scalable actor critic (SAC) [8]. In addition to value factorization, there are several other methods proposed to enable scalable MARL. MAAC [14] uses an attention module to abstract states of other agents when training an agent’s Q function, which reduces the quadratically increasing input space to a linear space. EPC [10] applies curriculum learning to gradually scale MARL up. While these methods achieve great policy performance, the training time can be significant when the number of agents increases because these methods cannot be easily trained in an efficient distributed or parallel manner over multiple computers.
To address the challenge of training policies for large numbers of agents over a distributed computing architecture, we propose a MARL algorithm called Distributed multi-Agent Reinforcement Learning with One-hop Neighbors (DARL1N). DARL1N’s main advantage over state-of-the-art MARL methods is that it allows distributed training across compute nodes (devices with networking, storage, and computing capabilities) running in parallel, with each compute node simulating only a very small subset of the agent transitions. This is made possible by modeling the agent team topology as a proximity graph and representing the Q function and policy of each agent as a function of its one-hop neighbors only. This structure significantly reduces the representation complexity of the Q and policy functions and yet maintains expressiveness when training is done over varying states and numbers of neighbors. Furthermore, when agent interactions are restricted to one-hop neighborhoods, training an agent’s Q function and policy requires transitions only of the agent itself and its potential two-hop neighbors. This enables highly efficient distributed training because each compute node needs to simulate only the transitions of the agents assigned to it and their two-hop neighbors.
RL or MARL policies can be trained over a distributed computing architecture in either a centralized [15, 16] or decentralized [17] manner. Decentralized architectures [17] offer greater resilience to node failures and malicious attacks but introduce significant communication overhead as frequent information exchanges are required. Additionally, achieving global coordination in such systems is inherently challenging: global information must either be inferred under the assumption of globally observable states, which limits its applicability, or obtained through consensus, which is difficult to achieve in large-scale systems. In contrast, centralized architectures with a central controller are more communication-efficient and facilitate easier global coordination, with communication occurring either asynchronously [18] or synchronously [19]. Asynchronous training faces multiple challenges including slow convergence, difficult debugging and analysis, and sometimes subpar quality of learned policies as learners may return stale gradients evaluated with old parameters [20, 21, 22, 19, 23]. Synchronous training is superior in these aspects but is vulnerable to computing stragglers [24] that are common in wireless and mobile networks. These stragglers, which are slow or unresponsive compute nodes caused by communication bottlenecks or software and hardware issues, can result in delays or failures in the training process. Coded computation [25] that employs coding theory to introduce redundant computation can mitigate computing stragglers. While extensively explored in various distributed computation problems such as matrix multiplication [25], linear inverse problems [26], convolution [27], and map reduce [28], its application for MARL remains under-studied. In our previous work [16], we explored the merits of coded computation in enhancing resilience and accelerating the training of MADDPG [16] in a distributed manner. Building upon this, in this paper, we propose a coded distributed learning architecture tailored for DARL1N. Unlike the one introduced in [16], where the central controller simulates global environment interactions among all agents and sends simulation data to each learner to train an agent, in the new architecture, each learner directly simulates local environment interactions among a small set of neighboring agents during individual agent training and thus improves distributed computing efficiency and reduces communication overhead.
Contributions: The primary contribution of this paper is a new MARL algorithm called DARL1N, which employs one-hop neighborhood factorization of the value and policy functions, allowing distributed training with each compute node simulating a small number of agent transitions. DARL1N supports highly-efficient distributed training and generates high-quality multi-agent policies for large agent teams. The second contribution is a novel coded distributed learning architecture for DARL1N called Coded DARL1N, which allows individual agents to be trained by multiple compute nodes simultaneously, enabling resilience to stragglers. Our analysis shows that introducing redundant computations via coding theory does not introduce bias in the value and policy gradient estimates, and the training converges similarly to stochastic gradient descent-based methods. Four codes including Maximum Distance Separable (MDS), Random Sparse, Repetition, and Low Density Generator Matrix (LDGM) codes are investigated to introduce redundant computation. Moreover, we conduct comprehensive experiments comparing DARL1N with four state-of-the-art MARL methods, including MADDPG, MFAC, EPC and SAC, and evaluating their performance in different RL environments, including Ising Model, Food Collection, Grassland, Adversarial Battle, and Multi-Access Wireless Communication. We also conduct experiments to evaluate the resilience of Coded DARL1N to stragglers when trained under different coding schemes.
It is noted that DARL1N was first presented in a short conference version [29]. Differing from this version, this journal article further extends DARL1N by introducing a new coded distributed learning architecture to enhance its resilience to stragglers while also improving training efficiency. Theoretical analysis is conducted to elucidate the convergence of Coded DARL1N. Additionally, this journal undertakes a more comprehensive experimental study, encompassing not only the performance of DARL1N but also its new coded variant. It includes additional benchmarks, environments, and evaluation metrics for a more thorough assessment from various aspects.
In the rest of the paper, Sec. 2 formulates the MARL problem to be addressed and describes the occurrence of stragglers within distributed learning systems. The proposed DARL1N algorithm is then introduced in Sec. 3, followed by the coded distributed learning architecture and different coding schemes, which are described in Sec. 4 and Sec. 5, respectively. Experiment results are presented in Sec. 6. Limitations and future work are discussed in Sec. 7. Finally, Sec. 8 concludes the paper.
2 Problem Statement
In MARL, agents learn to optimize their behavior by interacting with the environment. Denote the state and action of agent by and , respectively, where and are the corresponding state and action spaces. Let and denote the joint state and action of all agents. At time , a joint action applied at state triggers a transition to a new state according to a conditional probability density function (pdf) . After each transition, each agent receives a reward , determined by the joint state and action according to the function . The objective of each agent is to design a policy to maximize the expected cumulative discounted reward (known as the value function):
where denotes the joint policy of all agents and is a discount factor. Alternatively, an optimal policy for agent can be obtained by maximizing the action-value (Q) function:
and setting , where and denotes the actions of all agents except .
To develop a distributed MARL algorithm, we impose additional structure on the MARL problem. Assume that all agents share a common state space, i.e., , and let be a distance metric on the state space. Note that the distance metric can also be defined over a common state subspace. However, for notation simplicity, a common state space is assumed here. Consider a proximity graph [30] that models the topology of the agent team. A -disk proximity graph is defined as a mapping that associates the joint state with an undirected graph such that and . Define the set of one-hop neighbors of agent as . We make the following assumption about the agents’ motion.
Assumption 1
The distance between two consecutive states, and , of agent is bounded, i.e., , for some .
This assumption is satisfied in many problems where, e.g., due to physical constraints, the agent states can only change by a bounded amount in a single time step.
Define the potential neighbors of agent at time as , which captures the set of agents that may become one-hop neighbors of agent at time . Denote the joint state and action of the one-hop neighbors of agent by and , respectively, where . Our key idea is to let agent ’s policy, , only depend on the one-hop neighbor states instead of all agent states , and we assume that each agent can obtain its one-hop neighbor states through observation or communication. The intuition is that agents that are far away from agent at time have little impact on its current action . To emphasize that the output of a function is affected only by a subset of the input dimensions, we use the notation for and in the remainder of the paper. We make two additional assumptions on the problem structure to ensure the validity of our policy model.
Assumption 2
The reward of agent can be fully specified using its one-hop neighbor states and actions , i.e., .
This assumption can always be satisfied by setting to the full environment range. In this case, the one-hop neighbor reward assumption becomes the standard reward definition, which depends on states and actions of all agents and is applicable to general MARL problems. For environments with local reward models, a smaller distance value can be chosen based on the specific environment configuration. For example, in collision avoidance problems, an agent’s reward may depend only on the states and actions of nearby agents that maintain a safe distance. In multi-agent networks or sensing problems, the one-hop neighbors can be those within communication or observation range. Next, we make a similar assumption for agent ’s transition model.
Assumption 3
The transition model of agent depends only on and states , i.e.,
The objective of each agent is to obtain an optimal policy by solving the following problem:
(1) |
where is the optimal action-value (Q) function introduced in the previous section.
The goal of this paper is to develop a MARL algorithm that (i) utilizes policy and value representations that scale favorably with the number of agents and (ii) allows efficient training on a distributed computing system containing compute stragglers.
3 Distributed multi-Agent Reinforcement Learning with One-hop Neighbors (DARL1N)
This section develops the DARL1N algorithm to solve the MARL problem with proximity-graph structure introduced in Sec. 2. DARL1N considers the effect of the one-hop neighbors of an agent in representing its Q and policy functions, which allows updating the Q and policy function parameters using only local one-hop neighborhood transitions.
Specifically, the Q function of each agent can be expressed as a function of its one-hop neighbor states and actions as well as the states and actions of the remaining agents that are not immediate neighbors of :
(2) |
Inspired by the SAC algorithm [8], we approximate the Q value with a function that depends only on one-hop neighbor states and actions:
where the weights satisfy . The approximation error is given in the following lemma.
Lemma 1
If the absolute value of agent ’s reward is upper bounded as , for some , the approximation error between and is bounded as:
Proof 3.1.
See Appendix A.1.
We parameterize the approximated Q function and the policy by and , respectively. To handle the varying sizes of and , in the implementation, we set the input dimension of to the largest possible dimension of , and apply zero-padding for agents that are not within the one-hop neighborhood of agent . The same procedure is applied to represent .
To learn the approximated Q function , instead of incremental on-policy updates to the Q function as in SAC [8], we apply off-policy temporal-difference learning with a buffer similar to MADDPG [4]. The parameters of the approximated Q function are updated by minimizing the following temporal difference error:
(3) |
where is a replay buffer for agent that contains information only from and , the one-hop neighbors of agent at the current and next time steps, and the one-hop neighbors for . Data from the one-hop neighbors of the next-step one-hop neighbors are needed to compute the next-step one-hop neighbors actions . To stabilize the training, a target Q function with parameters and a target policy function with parameters are used. The parameters and are updated using Polyak averaging: where is a hyperparameter. In contrast to MADDPG [4], the replay buffer for agent only needs to store its local interactions with nearby agents, where is required to calculate . Also, in contrast to SAC [8], each agent only needs to collect its own training data by simulating local two-hop interactions, which reduces agents’ experience correlations and allows efficient distributed training as explained in Sec. 4.
Agent ’s policy parameters are updated using a gradient
|
(4) |
where again data only from local interactions is needed.
To implement the parameter updates proposed above, agent needs training data from its one-hop neighbors at the current and next time steps. The relation between one-hop neighbors at the current and next time steps is captured by the following proposition.
Proposition 3.2.
Under Assumption 1, if an agent is not a potential neighbor of agent at time , i.e., , it will not be a one-hop neighbor of agent at time , i.e., .
Proof 3.3.
See Appendix A.2.
Proposition 3.2 allows us to decouple the global interactions among agents and limit the necessary observations to be among one-hop neighbors. To collect training data, at each time step, agent first interacts with its one-hop neighbors to obtain their states and actions , and compute its reward . To obtain for all , we first determine agent ’s one-hop neighbors at the next time step, . Using Proposition 3.2, we let each potential neighbor perform a transition to a new state , which is sufficient to determine . Then, we let the potential neighbors of each new neighbor perform transitions to determine and obtain .
Fig. 1(a) illustrates the data collection process. At time , agent obtains , , and for . Then, the potential neighbors of agent , , proceed to their next states at time . This is sufficient to determine that and obtain . Finally, we let agent , which belongs to set , perform a transition to determine that and obtain .
As each agent only needs to interact with one-hop neighbors to update its parameters, the agents can be trained in parallel on a distributed computing architecture, where each compute node only needs to simulate the two-hop neighbor transitions for agents assigned to it for training.
(a)
(b)
4 Coded Distributed Learning Architecture
In this section, we introduce an efficient and resilient distributed learning architecture for training DARL1N. A coded distributed learning architecture, illustrated in Fig. 1(b), consists of a central controller and compute nodes, called learners. The central controller stores a copy of all parameters of the policy , target policy , Q function , and target Q function , for all . In each training iteration, the central controller broadcasts all agents’ parameters to all learners, who then calculate and return the gradients required for updating the parameters. In a traditional uncoded distributed learning architecture, each agent is only trained (with its policy and value gradients computed) by a single learner. If any learner becomes slow or unresponsive, i.e., a straggler, the whole training procedure is delayed or may fail. Our coded distributed learning architecture addresses the possible presence of stragglers in the computing system by introducing redundant computations. We let more than one learner train each agent, which not only improves the system resilience to stragglers but also accelerates the training speed, as we show in Sec. 6.2. To describe which learners are assigned to train each agent, we introduce an assignment matrix with non-zero entries indicating that learner is assigned to train agent . The complete set of learners assigned to train an agent can then be determined by . To construct the assignment matrix , we apply coding theory as explained in Sec.5.
To calculate the gradients for an agent , each learner with simulates transitions to get the interaction data as described in Sec. 3, which are stored in a replay buffer . After that, learner calculates the gradients of the temporal difference error needed for updating the Q function parameters of agent using (3) and updating the policy parameters using (4).
As the replay buffer can have a large size, to improve efficiency, we use a mini-batch uniformly sampled from to estimate the expectations in (3)-(4). In particular, the temporal difference error in (3) is estimated with:
(5) |
Similarly, the gradients used to update policy parameters are estimated with:
(6) |
Let denote the concatenation of estimated gradients. Instead of directly returning the estimated gradients for all agents trained by learner , i.e., , learner calculates a linear combination of the gradients: with weights provided by the assignment matrix and returns back to the central controller.
At the central controller, let denote the results that have arrived by a certain time from learners is received. Moreover, let be a submatrix of formed by the -th row of . The received gradients satisfy:
(7) |
where is constructed as follows: for -th row of , fill the -th to -th entries with -th row of and set all other entries to 0, denotes -th element of . The vector is a concatenation of all the gradients estimated by all learners. The central controller updates the agents’ parameters once it receives enough results to decode all estimated gradients, denoted as . This happens when , and the decoding equation is given as follows:
(8) |
Alg. 1 summarizes the coded training procedure of DARL1N over a distributed computing architecture, referred to as the Coded DARL1N.
In Coded DARL1N, the gradients used by the central controller for parameter updates are stochastic gradients computed by mini-batch samples, which are estimates of the true gradients , where with and defined in (4) and (3), respectively. The estimation performance is illustrated in the following theorem.
Theorem 4.4.
The mini-batch stochastic gradients computed by Coded DARL1N are unbiased estimates of the true gradients , with variance determined by the assignment matrix .
Proof 4.5.
See Appendix A.3.
Based on Theorem 4.4, we can infer that Coded DARL1N converges asymptotically similarly to other stochastic gradient descent-based methods [31].
5 Assignment Matrix Construction and Assessment
The assignment matrix affects both the policy variance and computational efficiency of Coded DARL1N. In this section, we explore different schemes, both uncoded and coded, for constructing the assignment matrix, and conduct theoretical analyses on their performance.
5.1 Assignment Matrix Construction
5.1.1 Uncoded Assignment Scheme
In an uncoded distributed training architecture, different learner nodes train different agents exclusively. The assignment matrix can then be constructed as: , where is an identity matrix.
5.1.2 Coded Assignment Schemes
Coded distributed training assigns each agent to multiple learners. Here, we investigate five codes, where the encoding matrices can be directly utilized as the assignment matrix.
- •
-
•
Random Sparse Code: Compared to an MDS code, a Random Sparse code [34] results in a sparser assignment matrix with the -th entry given by:
(9) where , .
-
•
Repetition Code: A Repetition code [34] assigns agents to the learners repetitively in a round-robin fashion. The -th entry of the assignment matrix is given by:
(10) where mod is the modulo operator.
-
•
LDGM Code: An LDGM code [35] is a special type of a Low Density Parity Check code [36] that constructs a sparser assignment matrix. By applying a systematic biased random code ensemble [35], the LDGM assignment matrix takes the form: , where each entry of is generated independently according to a Bernoulli distribution with success probability . Note that when , the assignment matrix of LDGM code has a low density.
5.2 Analysis and Comparison of Assignment Schemes
We provide a theoretical analysis and comparison of different assignment schemes from the following aspects: 1) computation overhead and 2) resilience to stragglers.
5.2.1 Computation Overhead
The coded schemes mitigate the impact of stragglers by assigning each agent to multiple learners. The training performed by the extra learners is redundant. To quantify the computation overhead introduced by this redundancy, we use the following metric:
where the first term on the right hand side calculates the average number of learners used for training each agent, and . Using the above metric, the computation overhead of each assignment scheme can be derived as follows:
-
•
Uncoded: , as each agent is assigned to only one learner in the uncoded scheme.
-
•
MDS Code: . All entries of the MDS assignment matrix are non-zero, indicating that each agent is assigned to all learners.
-
•
Random Sparse Code: depends on the parameter , but its expectation is derived as .
-
•
Repetition Code: .
-
•
LDGM Code: .
Among these schemes, the MDS code incurs the highest computation overhead, while the uncoded scheme results in the lowest. The overhead introduced by the Random Sparse and LDGM codes depend on their parameters, and .
5.2.2 Resilience to Stragglers
According to (8), the central controller can update the agents’ gradients only after receiving results from enough learners, specifically when . To evaluate the resilience of assignment schemes to stragglers, we analyze the probability of each scheme being influenced by stragglers under the following assumption.
Assumption 4
In each training iteration, each compute node in a distributed computing system has a probability of to become a straggler.
The Random Sparse and LDGM codes have randomly generated entries that depend on parameters and , making them hard to analyze theoretically. We focus our analysis on the Uncoded, MDS, and Repetition codes as follows.
Proposition 5.6.
The probability that the Uncoded scheme will be influenced by stragglers is .
Proposition 5.7.
The probability that the MDS code will be influenced by stragglers is .
Proposition 5.8.
The probability that the Repetition code will be influenced by stragglers is given that is positive integer.
6 Experiments
In this section, we evaluate the DARL1N algorithm and our coding schemes for mitigating compute stragglers.
6.1 Performance of DARL1N
We conduct a series of comparisons between DARL1N and four state-of-the-art MARL algorithms. For fair comparison, we train DARL1N using a distributed learning architecture with uncoded assignments, and run our experiments on the Amazon EC2 computing clusters [37], which are considered reliable and free of stragglers.
6.1.1 Experiment Settings
Environment Configurations
We evaluate DARL1N in five environments, including the Ising Model [9], Food Collection, Grassland, Adversarial Battle [10], and Multi-Access Wireless Communication [8], which cover cooperative and mixed cooperative competitive games. Please refer to [8, 9, 10] for the description of each environment.
To understand the scalability of our method, we vary the number of agents and the size of the local state spaces. The specific configurations for the first four environments can be referenced in the conference version [29]. In the Multi-Access Wireless Communication environment, which was not considered in [29], we adopt the setting in [8] and consider a grid of agents, with each having a state space of to indicate whether there is a packet to send by time step , where is set to either or .
Neighborhood Configuration
In both the Ising Model and Multi-Access Wireless Communication environments, the agents are arranged in a two dimensional lattice graph, with rewards depending solely on their proximal agents. Consequently, an agent’s one-hop neighbors are naturally defined as those directly connected to it, including itself. In the other three environments, the agents are trained to avoid one another. Therefore, the one-hop neighbor distances are naturally set as the Euclidean safety distances. Specifically, the safety distances (or ) are set to when , respectively. Each agent observes its one-hop neighbors to obtain one-hop neighbor states. Other distance metrics that account for velocity can be employed, which is left for future work.
Benchmarks
Evaluation Metrics
We measure the performance using two criteria: training efficiency and policy quality. To measure the training efficiency, we use two metrics: 1) average training time spent to run a specified number of training iterations and 2) convergence time. The convergence time is defined as the time when the variance of the average total training reward over 90 consecutive iterations does not exceed 2% of the absolute mean reward, where the average total training reward is the total reward of all agents averaged over 10 episodes in three training runs with different random seeds. To measure policy quality, we use convergence reward, which is the average total training reward at the convergence time.
(a) (b) (c) (d)




Computing Configurations
The computing resources are configured in a way so that DARL1N utilizes roughly the same or fewer resources than the baseline methods, as described in [29]. For the Multi-Access Wireless Communication environment, we employ the Amazon EC2 instance for DARL1N training, the instance for MADDPG and MFAC, and the instance for EPC training. The configurations for the training parameters, as well as the representations of policy and Q functions can be found in [29].
6.1.2 Comparative Studies
Ising Model
Method | Convergence Time (s) | Convergence Reward | ||||||
MADDPG | 62 | 263 | 810 | 1996 | 460 | 819 | 1280 | 1831 |
MFAC | 63 | 274 | 851 | 2003 | 468 | 814 | 1276 | 1751 |
EPC | 101 | 26 | 51 | 62 | 468 | 831 | 1278 | 3321 |
EPC Scratch | 101 | 412 | 993 | 2995 | 468 | 826 | 1275 | 2503 |
DARL1N | 38 | 102 | 210 | 110 | 465 | 828 | 1279 | 2282 |
As shown in Tab. 1, when the number of agents is small (), all methods achieve roughly the same reward. DARL1N takes the least amount of time to converge while EPC takes the longest time. When the number of agents increases, it can be observed that the EPC converges immediately and the convergence reward it achieves when is much higher than the other methods. The reason is that, in the Ising Model, each agent only needs information of its four fixed neighbors, and hence in EPC the policy obtained from the previous stage can be applied to the current stage. The other methods train the agents from scratch without curriculum learning. For illustration, we also show the performance achieved by training EPC from scratch without curriculum learning (labeled as EPC Scratch in Tab. 1). The results show that EPC Scratch converges much slower than EPC as the number of agents increases. Note that when the number of agents is 9, EPC and EPC Scratch are the same. Moreover, DARL1N achieves a reward comparable with that of EPC Scratch but converges much faster. From Fig. 2, we can observe that DARL1N requires much less time to perform a training iteration than the benchmark methods.
Food Collection
Method | Convergence Time (s) | Convergence Reward | ||||||
MADDPG | 501 | 1102 | 4883 | 2005 | 24 | 24 | -112 | -364 |
MFAC | 512 | 832 | 4924 | 2013 | 20 | 23 | -115 | -362 |
EPC | 1314 | 723 | 2900 | 8104 | 31 | 34 | -16 | -87 |
DARL1N | 502 | 382 | 310 | 1830 | 14 | 25 | 13 | -61 |
(a) (b)


Tab. 2 shows that, in Food Collection, when the problem scale is small, DARL1N, MADDPG and MFAC achieve similar performance in terms of policy quality. As the problem scale increases, the performance of MADDPG and MFAC degrades significantly and becomes much worse than DARL1N or EPC when and , which is also illustrated in Fig. 3. The convergence reward achieved by DARL1N is comparable or sometimes higher than that achieved by EPC. Moreover, the convergence speed of DARL1N is the highest among all methods in all scenarios. Notably, the convergence time of DALR1N and EPC increases, while that of MADDPG and MFAC decreases as increases to 24. This occurs because MADDPG and MFAC fail to handle such large-scale networks, causing them to stop learning earlier.
To evaluate the impact of the proposed one-hop neighbor reward formulation on the learning performance, we also present in Fig. 3 the training rewards of DARL1N with a standard reward definition, labeled as DARL1N (Full Range), whose neighbor distance is set to cover the entire environment, thereby including all agents as one-hop neighbors. The results show that DARL1N (Full Range) achieves performance comparable to EPC but performs worse than DARL1N with a small number of agents considered as one-hop neighbors. This suggests that in the Food Collection environment, agent behavior primarily depends on interactions with a nearby, smaller group of agents. Fig. 2 illustrates the training time of DARL1N (Full Range), which increases compared to DARL1N due to the inclusion of more agents in the reward calculations.
Fig. 2 also presents the training times of the four benchmarks. Among all methods compared, DARL1N achieves the highest training efficiency and its training time grows linearly as the number of agents increases. When , EPC takes the longest time to train. This is because of the complex policy and Q neural network architectures in EPC, the input dimensions of which grow linearly and quadratically, respectively, with more agents.
To demonstrate DARL1N’s applicability to general MARL problems with global reward and transition models, we conduct a comparison study using a variant of the Food Collection environment where agents must coordinate to exclusively collect all the food. As shown in Fig. 4(a), DARL1N achieves the highest reward level with the fastest training speed. This is due to its distributed learning architecture, which reduces training experience correlation and accelerates training through parallel computing, even without decomposition. In contrast, EPC performs significantly worse, likely because curriculum learning struggles with global agent coordination.
(a) (b)


Grassland
Method | Convergence Time (s) | Convergence Reward | ||||||
MADDPG | 423 | 6271 | 2827 | 1121 | 21 | 11 | -302 | -612 |
MFAC | 431 | 7124 | 3156 | 1025 | 23 | 9 | -311 | -608 |
EPC | 4883 | 2006 | 3324 | 15221 | 12 | 38 | 105 | 205 |
DARL1N | 103 | 402 | 1752 | 5221 | 18 | 46 | 113 | 210 |
Similar as the results in the Food Collection environment, the policy generated by DARL1N is equally good or even better than those generated by the benchmark methods, as shown in Tab. 3 and Fig. 2, especially when the problem scale is large. DARL1N also has the fastest convergence speed and takes the shortest time to run a training iteration.
Adversarial Battle
Method | Convergence Time (s) | Convergence Reward | ||||||
MADDPG | 452 | 1331 | 1521 | 7600 | -72 | -211 | -725 | -1321 |
MFAC | 463 | 1721 | 1624 | 6234 | -73 | -221 | -694 | -1201 |
EPC | 1512 | 1432 | 2041 | 9210 | -75 | -215 | -405 | -642 |
DARL1N | 121 | 756 | 1123 | 3110 | -71 | -212 | -410 | -682 |
(a) (b)


In this environment, DARL1N again achieves good performance in terms of policy quality and training efficiency compared to the benchmark methods, as shown in Tab. 4 and Fig. 2. To further evaluate the performance, we reconsider the last scenario () and train the good agents and adversary agents using two different methods. The trained good agents and adversarial agents then compete with each other. We apply the Min-Max normalization to measure the normalized total reward of agents at each side achieved in an episode. To reduce uncertainty, we generate 10 episodes and record the mean values and standard deviations. As shown in Fig. 4(b), DARL1N achieves the best performance, and both DARL1N and EPC significantly outperform MADDPG and MFAC.
Multi-Access Wireless Communication
Fig. 5 shows that SAC achieves a higher reward than DARL1N when takes a small value. However, when increases, which causes an exponential growth of the state space, DARL1N achieves a much higher reward and converges much faster than SAC. This demonstrates that DARL1N scales better than SAC with the size of the state space.
6.1.3 Impact of Neighbor Distance
The parameter of neighbor distance in DARL1N determines the number of one-hop neighbors of an agent, thereby influencing both training efficiency and policy quality. To evaluate its impact, we conduct experiments using the Grassland environment, considering three good and three adversary agents. The rewards are set to for good agents collecting a grass pellet and for colliding with adversary agents.
The results shown in Fig. 6 indicate that, as the neighbor distance increases, the training reward increases while the training time arises. This stems from the increased number of one-hop neighbors each agent must consider, thereby requiring each learner to collect and process more data. This phenomenon reveals a trade-off between training efficiency and policy quality controlled by the neighbor distance, which can be properly chosen to achieve a good balance.
6.2 Performance of Coded Distributed Learning Architecture
In this section, we first conduct numerical studies to evaluate the performance of different assignment schemes described in Sec. 5. We then train DARL1N over the proposed coded distributed learning architecture and conduct experimental studies to evaluate its performance in mitigating the effect of computing stragglers.
(a) (b)


Computation overhead | Success rate | Average V | ||||||
Uncoded | 0 | 1 | 0 | 0 | 0 | 0 | 0 | |
MDS | 23 | 1 | 1 | 1 | 124.82 | 79.75 | 73.64 | |
Random Sparse | 4.5 | 1 | 0.78 | 0.17 | 11.48 | 1.62 | -0.68 | |
8.0 | 1 | 1 | 0.98 | 16.82 | 2.93 | -1.28 | ||
18.3 | 1 | 1 | 1 | 13.15 | 1.40 | -3.31 | ||
Repetition | 1 | 1 | 0.603 | 0 | 0 | 0 | -8.32 | |
LDGM | 0.91 | 1 | 0.253 | 0 | 2.08 | -1.46 | -4.59 | |
4.41 | 1 | 0.79 | 0.11 | 8.93 | 1.05 | -3.21 | ||
5.5 | 1 | 0.99 | 0.41 | 13.13 | 5.13 | -0.31 |
6.2.1 Numerical Evaluation
We conduct numerical studies to evaluate the performance of different assignment schemes in the following three aspects.
-
•
Computation overhead: Metric (5.2.1) is applied, with the mean overhead averaged over 10 experiment runs used for the Random Sparse and LDGM schemes.
-
•
Resilience to stragglers: The success rate computed as follows is used. We randomly turn some learners into stragglers that fail to return any results according to Assumption 4. Monte Carlo simulations are then conducted to measure the success rate, which is the ratio of training iterations in which gradients can be successfully estimated with results returned from non-stragglers.
-
•
Impact on policy quality: According to Theorem 4.4, the gradients estimated by Coded DARL1N are unbiased but their variance depends on the assignment matrix. Therefore, we use the variance of the estimated gradients, denoted by , to assess the impact of assignment schemes on policy quality. Specifically, we vary the number of learners whose results are used by the central controller to estimate the gradients and calculate the average value of over Monte Carlo simulation runs.
Tab. 5 presents results when and . The performance of the Random Sparse and LDGM schemes, characterized by parameters and respectively, is evaluated across different parameter values. The results show that the Uncoded scheme has the lowest computation overhead, the smallest variance in most scenarios, but the poorest resilience to stragglers. In contrast, the MDS scheme exhibits the best resilience to stragglers but the largest computation overhead and variance. The Repetition scheme has the smallest variance among all schemes and the lowest computation overhead among coded schemes, though it is relatively less resilient. For the Random Sparse and LDGM schemes, increasing or leads to higher computation overhead and improved resilience to stragglers. However, the Random Sparse scheme generally exhibits larger variance compared to LDGM. In the following experiments, we set and .
6.2.2 Experiments
To understand the performance of the coded distributed learning architecture, we train DARL1N using different assignment schemes and evaluate its performance in different straggler scenarios simulated on Amazon EC2.
Experiment Settings
We select the Food Collection environment and set the number of agents and learners in all experiments to and , respectively. To evaluate the impact of stragglers, we vary the straggler probability in Assumption 4. As Amazon EC2 computing instances are generally stable, we simulate stragglers by having selected compute nodes delay returning results by amount of time. Evaluations on other computing systems where stragglers are more common, such as wireless and mobile computing systems, are deferred to future work.
Experiment Results
We first evaluate the average training time of DARL1N with different assignment schemes. We vary the straggler probability and the straggler effect . The results are shown in Fig. 7(a) when and Fig. 7(b) when , where the training time is measured by averaging the time for running 30 training iterations. We can observe that when no stragglers exist (), the Uncoded scheme is the most efficient as it has zero computation overhead. The MDS and Random Sparse schemes require a much longer training time than other schemes due to the substantial computation overhead introduced by these schemes. When stragglers exist (), the performance of the Uncoded scheme degrades significantly, especially when the straggler effect is significant as shown in Fig. 7(b). Compared to the Uncoded scheme, the LDGM and Repetition schemes are more resilient to stragglers, as indicated by the slower increase in training time. They are also more efficient than the MDS and Random Sparse schemes in most cases. On the contrary, the training time of MDS and Random Sparse does not grow much as increases from to , evidencing their high resilience to stragglers. Although they require more training time than other schemes when the straggler effect or the straggler probability is small, they achieve higher training efficiency when and/or are large.

To evaluate the impact of different assignment schemes on the quality of trained policies, we measure the training reward achieved by each DARL1N implementation with . Tab. 6 summarizes the convergence time, convergence reward, and variance , averaged over 600 training iterations, for different implementations as straggler probability increases. We can see that the Uncoded scheme converges fast when no stragglers exist () but its convergence speed decreases significantly when stragglers exist (). The MDS and Random Sparse schemes achieve the lowest reward and slowest convergence rate, while the LDGM scheme achieves the highest reward and convergence rate in most cases, especially when the straggler probability is large. The Repetition scheme generally achieves good training reward performance and converges fast when the straggler probability is small. Moreover, we can also see that a larger average V generally leads to a lower reward.
Schemes | Convergence Time (s) | Convergence Reward | Average V | ||||||
Uncoded | 323 | 510 | 521 | 8 | 5 | 10 | 0 | 0 | 0 |
MDS | 502 | 748 | 625 | -231 | -253 | -212 | 82.08 | 100.10 | 104.06 |
Random Sparse | 512 | 820 | 670 | -252 | -231 | -227 | 15.28 | 13.47 | 12.88 |
Repetition | 331 | 372 | 564 | 11 | 6 | 8 | -3.16 | -4.49 | -4.15 |
LDGM | 324 | 320 | 450 | 9 | 12 | 14 | -0.51 | 0.14 | 0.35 |
6.2.3 Discussion
The experiment results above suggest guidelines for selecting an appropriate assignment scheme of agents to learners. The Uncoded scheme has zero computation overhead and low variance but is the least resilient to stragglers. This makes it suited best for stable distributed systems, such as server-based setups with wired communication. The MDS scheme, while offering the highest resilience, incurs the highest computation overhead and variance, leading to poor policy quality, and making it unsuitable for distributed training using DARL1N. In unstable distributed systems, where stragglers are present, a trade-off between policy quality and resilience must be considered. If policy quality is the priority, the Repetition scheme is an excellent choice. Conversely, if resilience is more critical, the Random Sparse scheme is preferable. For a balanced approach that addresses both aspects, the LDGM scheme is a good option.
7 Limitations and Future Work
In environments with local reward and transition models, DARL1N needs a suitably chosen distance metric to establish the agent neighborhoods that achieve the right balance between policy quality and ability to distribute the training efficiently. In future work, we will explore learning a neighbor distance metric that adapts to the environment, e.g., based on past episodes or contextual information, to achieve an effective balance between policy reward and training speed. Moreover, the coded distributed learning architecture for DARL1N is designed for a distributed computing system with a stable central controller in place. In future work, we will design a new coded architecture to improve resilience of central controllers such as introducing redundant central controllers using coding theory. Other issues to consider for further improvements include integration of curriculum learning similar as EPC, partially observable states, and software infrastructure to support distributed learning with low-latency communication.
8 Conclusion
This paper introduced DARL1N, a scalable MARL algorithm that can be trained over a distribute computing architecture. DARL1N reduces the representation complexity of the value and policy functions of each agent in a MARL problem by disregarding the influence of other agents that are not within one hop of a proximity graph. This model enables highly efficient distributed training, in which a compute node only needs data from an agent it is training and its potential one-hop neighbors. We conducted comprehensive experiments using five MARL environments and compared DARL1N with four state-of-the-art MARL algorithms. DARL1N generates equally good or even better policies in almost all scenarios with significantly higher training efficiency than benchmark methods, especially in large-scale problem settings. To improve the resilience of DARL1N to stragglers common in distributed computing systems, we developed coding schemes that assign each agent to multiple learners. We evaluate properties of MDS, Random Sparse, Repetition, LDGM codes and provide guidelines on selecting suitable assignment schemes under different situations.
References
- [1] Y. Jin, S. Wei, J. Yuan, and X. Zhang, “Hierarchical and stable multiagent reinforcement learning for cooperative navigation control,” IEEE Transactions on Neural Networks and Learning Systems, 2021.
- [2] R. Song, F. L. Lewis, and Q. Wei, “Off-policy integral reinforcement learning method to solve nonlinear continuous-time multiplayer nonzero-sum games,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 3, pp. 704–713, 2016.
- [3] G.-P. Antonio and C. Maria-Dolores, “Multi-agent deep reinforcement learning to manage connected autonomous vehicles at tomorrow’s intersections,” IEEE Transactions on Vehicular Technology, vol. 71, no. 7, pp. 7033–7043, 2022.
- [4] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” in Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 6379–6390.
- [5] L. Buşoniu, R. Babuška, and B. De Schutter, “Multi-agent reinforcement learning: An overview,” Innovations in Multi-agent Systems and Applications, pp. 183–221, 2010.
- [6] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” in AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
- [7] K. Gogineni, P. Wei, T. Lan, and G. Venkataramani, “Scalability bottlenecks in multi-agent reinforcement learning systems,” arXiv preprint arXiv:2302.05007, 2023.
- [8] G. Qu, A. Wierman, and N. Li, “Scalable reinforcement learning of localized policies for multi-agent networked systems,” in Learning for Dynamics and Control (L4DC). PMLR, 2020, pp. 256–266.
- [9] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang, “Mean field multi-agent reinforcement learning,” in International Conference on Machine Learning (ICML). PMLR, 2018, pp. 5571–5580.
- [10] Q. Long, Z. Zhou, A. Gupta, F. Fang, Y. Wu, and X. Wang, “Evolutionary population curriculum for scaling multi-agent reinforcement learning,” in International Conference on Learning Representations (ICLR), 2020.
- [11] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, and T. Graepel, “Value-decomposition networks for cooperative multi-agent learning,” arXiv preprint:1706.05296, 2017.
- [12] T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. Foerster, and S. Whiteson, “Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning,” in International Conference on Machine Learning (ICML). PMLR, 2018, pp. 4295–4304.
- [13] K. Son, D. Kim, W. J. Kang, D. E. Hostallero, and Y. Yi, “Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning,” in International Conference on Machine Learning. PMLR, 2019, pp. 5887–5896.
- [14] S. Iqbal and F. Sha, “Actor-attention-critic for multi-agent reinforcement learning,” in International Conference on Machine Learning. PMLR, 2019, pp. 2961–2970.
- [15] A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. De Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Petersen et al., “Massively parallel methods for deep reinforcement learning,” in International Conference on Machine Learning (ICML), 2015.
- [16] B. Wang, J. Xie, and N. Atanasov, “Coding for Distributed Multi-Agent Reinforcement Learning,” in IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 10 625–10 631.
- [17] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Basar, “Fully decentralized multi-agent reinforcement learning with networked agents,” in International Conference on Machine Learning (ICML). PMLR, 2018, pp. 5872–5881.
- [18] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning (ICML), 2016, pp. 1928–1937.
- [19] OpenAI, “A2C,” https://openai.com/blog/baselines-acktr-a2c/, 2022, accessed: 2022-01-13.
- [20] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing, “More effective distributed ml via a stale synchronous parallel parameter server,” in Advances in Neural Information Processing Systems (NeurIPS), 2013, pp. 1223–1231.
- [21] M. Li, D. G. Andersen, A. J. Smola, and K. Yu, “Communication efficient distributed machine learning with the parameter server,” in Advances in Neural Information Processing Systems (NeurIPS), 2014, pp. 19–27.
- [22] S. Dutta, G. Joshi, S. Ghosh, P. Dube, and P. Nagpurkar, “Slow and stale gradients can win the race: Error-runtime trade-offs in distributed sgd,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2018, pp. 803–812.
- [23] M. Tan, “Multi-agent reinforcement learning: Independent vs. cooperative agents,” in Proceedings of the tenth international conference on machine learning, 1993, pp. 330–337.
- [24] A. Reisizadeh, S. Prakash, R. Pedarsani, and A. S. Avestimehr, “Coded computation over heterogeneous clusters,” IEEE Transactions on Information Theory, vol. 65, no. 7, pp. 4227–4242, 2019.
- [25] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding Up Distributed Machine Learning Using Codes,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1514–1529, mar 2018.
- [26] Y. Yang, P. Grover, and S. Kar, “Coded distributed computing for inverse problems,” Advances in Neural Information Processing Systems, vol. 30, 2017.
- [27] B. Zhou, J. Xie, and B. Wang, “Dynamic Coded Distributed Convolution for UAV-based Networked Airborne Computing,” in IEEE International Conference on Unmanned Aircraft Systems (ICUAS), 2022, pp. 955–961.
- [28] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded MapReduce,” in Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2015, pp. 964–971.
- [29] B. Wang, J. Xie, and N. Atanasov, “Darl1n: Distributed multi-agent reinforcement learning with one-hop neighbors,” pp. 9003–9010, 2022.
- [30] F. Bullo, J. Cortés, and S. Martinez, Distributed control of robotic networks: a mathematical approach to motion coordination algorithms. Princeton University Press, 2009.
- [31] X. Qian and D. Klabjan, “The impact of the mini-batch size on the variance of gradients in stochastic gradient descent,” arXiv preprint arXiv:2004.13146, 2020.
- [32] J. Lacan and J. Fimes, “Systematic mds erasure codes based on vandermonde matrices,” IEEE Communications Letters, vol. 8, no. 9, pp. 570–572, 2004.
- [33] A. Klinger, “The vandermonde matrix,” The American Mathematical Monthly, vol. 74, no. 5, pp. 571–574, 1967.
- [34] K. Lee, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Coded computation for multicore setups,” in IEEE International Symposium on Information Theory (ISIT), 2017, pp. 2413–2417.
- [35] S. Cai, W. Lin, X. Yao, B. Wei, and X. Ma, “Systematic convolutional low density generator matrix code,” IEEE Transactions on Information Theory, vol. 67, no. 6, pp. 3752–3764, 2021.
- [36] R. Gallager, “Low-density parity-check codes,” IRE Transactions on information theory, vol. 8, no. 1, pp. 21–28, 1962.
- [37] AWS, “Amazon ec2,” https://aws.amazon.com/ec2/, 2022, accessed: 2022-01-13.
Appendix A Appendix
A.1 Proof of Lemma 1
Consider the Q-value function of agent . For two different sets of non-neighbor states and actions , we first show that:
(11) |
Letting and denote and , respectively, we have:
(12) |
where derives from the fact that are part of both and . In the above equations, the expectation is over state-action trajectories generated by the policy and the transition model . Then, we have:
(13) |
A.2 Proof of Proposition 3.2
If agent , then based on the definition of potential neighbors, we have . According to the triangle inequality, , and according to Assumption 1, . Therefore, . Using the triangle inequality again, we obtain . As , we have . Therefore, agent will not be a one-hop neighbor of agent at time .
A.3 Proof of Theorem 4.4
The bias of the gradient estimator can be calculated using (7) and (8) as follows:
(14) |
Since each learner uses the same set of parameters broadcast by the central controller for agent-environment interaction in each training iteration, the replay buffers , , all follow the same distribution as that of . Therefore, we have and leading to:
(15) |
which shows that is an unbiased estimator.
Next, we compute the variance of the gradient estimator :
(16) | ||||
where and creates a diagonal matrix with the as diagonal element. Since are independent from each other for each . According to (16), we can see that the variance of the gradient estimator is impacted by , which is determined by the assignment matrix as well as the learners who return their computations promptly.
A.4 Proof of Proposition 5.7
The performance of the MDS code scheme will be affected only if the number of stragglers exceeds because has rank . If there are stragglers, the results from non-straggler nodes will be insufficient for the central controller to decode the parameter gradients and it needs to wait for results from the stragglers. Under Assumption 4, follows a binomial distribution with probability mass function . Therefore, the probability that the performance will be affected by the stragglers is .
A.5 Proof of Proposition 5.8
The assignment matrix defined in (10) has linearly independent rows, with each row containing duplicate copies. For -th copy of -th linearly independent row, , we have a learner with index needs to send back to the central controller. The estimated gradients can be decoded when learners with index , with any send results back to satisfy rank() = . Under Assumption 4, for a , the probability that the learners with index are all stragglers that fail to send results back is . Furthermore, the probability that there is at least one non-straggler learner for each is . Therefore, the probability that the performance will be affected by stragglers is then represented with .
[]Baoqian Wang (S’20) is currently an advanced technologist in intelligent systems at The Boeing Company. He received a Ph.D. degree in Electrical and Computer Engineering from University of California San Diego and San Diego State University in 2023. He received his B.S. degree from Yangtze University, Wuhan China, in 2017, and M.S. degree in Computer Science from Texas A&M University-Corpus Christi. His research interests include distributed computing, reinforcement learning and robotics.
[]Junfei Xie (S’13-M’16-SM’21) is currently an Associate Professor in the Department of Electrical and Computer Engineering at San Diego State University. She received the B.S. degree in Electrical Engineering from University of Electronic Science and Technology of China (UESTC), Chengdu, China, in 2012. She received the M.S. degree in Electrical Engineering in 2013 and the Ph.D. degree in Computer Science and Engineering from University of North Texas, Denton, TX, in 2016. From 2016 to 2019, she was an Assistant Professor in the Department of Computing Sciences at Texas A&M University-Corpus Christi. She is the recipient of the NSF CAREER Award. Her current research interests include large-scale dynamic system design and control, airborne networks, airborne computing, and air traffic flow management, etc.
[]Nikolay Atanasov
(S’07-M’16-SM’23) is an Associate Professor of Electrical and Computer Engineering at the University of California San Diego, La Jolla, CA, USA. He obtained a B.S. degree in Electrical Engineering from Trinity College, Hartford, CT, USA in 2008 and M.S. and Ph.D. degrees in Electrical and Systems Engineering from the University of Pennsylvania, Philadelphia, PA, USA in 2012 and 2015, respectively. Dr. Atanasov’s research focuses on robotics, control theory, and machine learning with emphasis on active perception problems for autonomous mobile robots. He works on probabilistic models for simultaneous localization and mapping (SLAM) and on optimal control and reinforcement learning algorithms for minimizing probabilistic model uncertainty. Dr. Atanasov’s work has been recognized by the Joseph and Rosaline Wolf award for the best Ph.D. dissertation in Electrical and Systems Engineering at the University of Pennsylvania in 2015, the Best Conference Paper Award at the IEEE International Conference on Robotics and Automation (ICRA) in 2017, the NSF CAREER Award in 2021, and the IEEE RAS Early Academic Career Award in Robotics and Automation in 2023.