\setcopyright

ifaamas \acmConference[AAMAS ’24]Proc. of the 23nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024)May 29 – June 2, 2024 London, United KingdomA. Ricci, W. Yeoh, N. Agmon, B. An (eds.) \copyrightyear2024 \acmYear2024 \acmDOI \acmPrice \acmISBN \affiliation \institutionChina University of Petroleum \cityBeijing \countryChina \affiliation \institutionChina University of Petroleum \cityBeijing \countryChina \affiliation \institutionChina University of Petroleum \cityBeijing \countryChina \affiliation \institutionChina University of Petroleum \cityBeijing \countryChina \affiliation \institutionTélécom Paris \cityParis \countryFrance

PPS-QMIX:Periodically Parameter Sharing for Accelerating Convergence of Multi-Agent Reinforcement Learning

Ke Zhang [email protected] , DanDan Zhu [email protected] , Qiuhan Xu [email protected] , Hao Zhou [email protected] and Ce Zheng [email protected]

Abstract.

Training for multi-agent reinforcement learning(MARL) is a time-consuming process caused by distribution shift of each agent. One drawback is that strategy of each agent in MARL is independent but actually in cooperation. Thus, a vertical issue in multi-agent reinforcement learning is how to efficiently accelerate training process. To address this problem, current research has leveraged a centralized function(CF) across multiple agents to learn contribution of the team reward for each agent. However, CF based methods introduce joint error from other agents in estimation of value network. In so doing, inspired by federated learning, we propose three simple novel approaches called i:Average Periodically Parameter Sharing(A-PPS), ii:Reward-Scalability Periodically Parameter Sharing(RS-PPS) and iii:Partial Personalized Periodically Parameter Sharing(PP-PPS) mechanism to accelerate training of MARL. Agents share Q-value network periodically during the training process. Agents which has same identity adapt collected reward as scalability and update partial neural network during period to share different parameters. We apply our approaches in classical MARL method QMIX and evaluate our approaches on various tasks in StarCraft Multi-Agent Challenge(SMAC) environment. Performance of numerical experiments yield enormous enhancement, with an average improvement of 10%-30%, and enable to win tasks that QMIX cannot. Our code can be downloaded from https://github.com/ColaZhang22/PPS-QMIX.

Key words and phrases:

Multi-agent Reinforcement Learning, Federated learning, Parameter Sharing, Reward-scalability, Partial Personalized Parameter Sharing

1. Introduction

In recent years, Multi-agent reinforcement learning(MARL) has recently obtained enormous success for handling various optimization tasks, ranging from electric gamesVinyals et al. (2019); Wang et al. (2022); Terry et al. (2021), autonomous drivingZhao et al. (2022); Silva et al. (2019). Compared with reinforcement learning which just has one agent, MARL has to tackle several challenges. Due to decentralized value function or policy and limitation of communication, one issue is that algorithm also has high instability and deviation caused by insufficient estimation and sampling. Another issue is curse of dimensionality caused by large dimensional observation and action spaces. A common wayLowe et al. (2017); Yu et al. (2022) to address these problems is referred as centralized training and decentralized execution(CTDE)Foerster et al. (2018), has been adapted in most MARL algorithms, which all agents share only one parameterized function for all policy network or Q-value function to estimate contribution of each agent. Recent approaches VDNSunehag et al. (2018), QTranSon et al. (2019)and QMIX Rashid et al. (2018) focus on building an heuristic appropriate function to measure the proportion of each agent’s contribution to the global reward.

However, in real world circumstance, like autonomous vehiclesCao et al. (2013) and internet of things devicesMohammadi et al. (2018), privacy of sensitive data is a crucial issue that consists of surrounding sensitive state information which cannot share with other vehicles or devices. Thus, we consider a decentralized training decentralized execution(DTDE) circumstance that each agent can just visit it own experience buffer but cannot access to other experience trajectory. The other drawback is this situation comes with a vital challenge limited access of local information efficiently increase fluctuations of exploration for each agent. As shown in fig, due to the difference of policy, exploration of each agent enables to be regarded as a non-independent and identically distribution(Non-IID)Li et al. (2022). Therefore, for each agent, distribution drift in exploration leads policy falls into sub-optimal point and ignores cooperative relationship with other agent.

Refer to caption — Figure 1. Distribution drift in decentralized training for agents. Due to distribution drift in exploration process, each agent enable to acquire local optima but cannot get to global optima. To solve Non-IID in agent experience trajectory, each agent explore their own environment and transmitted local model to aggregate into a generalized model.

Motivated by these challenges, inspired by concept of federated learningWu et al. (2022); Yu et al. (2023). We first propose average periodically parameter sharing(A-PPS), which means each agent shares an average weight in aggregation phase. However,in some tasks, due to difference of exploration, A-PPS cannot appropriately reflect effectiveness in their process of exploration cause some agents do not acquire useful experience. Instead of taking the average of model weights, a reward buffer is used as model weights to measure the quality of agents exploration process. Besides, proved by personalized federated learning, for each agent, part of their own model can be seen as combination between personalized feature representation and value representation. We just aggregate value representation part of module and allow agent to have personalized feature representation part. Thus, based on naive FedAvg methods, we propose three simple novel approaches called average periodically parameter sharing(A-PPS), reward-scalability periodically parameter sharing(RS-PPS) and Partial Personalized Periodically Parameter Sharing(PP-PPS).

Conclusively, we evaluate A-PPS, RS-PPS and PP-PPS in multiple scenarios from SMAC simulated environment, a benchmark of multi-agent cooperation environment. A-PPS, RS-PPS and PP-PPS bring obvious enhancement compared with QMIX and VDN in different tasks of SMAC environmentSamvelyan et al. (2019). In some environments, in comparison with classical A-PPS, our approaches possess faster convergence rate and has a increasing efficiency. Notably, in some asymmetric competition task like corridor, our RS-PPS and PP-PPS has a increasing privacy and effectiveness. Our source code is available from URL: https://www.acm.org/publications/proceedings-template. The following is a list of the contributions of this paper:

(1)

Inspire by FedAvg, we introduce federated learning in classical mutli-agent reinforcement learning approach QMIX and perform a average periodically parameter sharing(A-PPS) and evaluate A-PPS into SMAC to prove its validation and efficiency.
(2)

Due to distribution drift in multi-agent exploration trajectory, we propose a novel approach called reward-scalability periodically parameter sharing(RS-PPS) which use a accumulated reward function to measure weight of aggregation for each agent.
(3)

Inspired by modification about personalized federated learningOzkara et al. (2023); Dai et al. (2022), we divided value network into two parts, which one is feature representation and the other is value network, to increase learning adaptation and keep individual personality for multi-agent.

2. Background

Multi-agent reinforcement learning(MARL)Foerster et al. (2016) consist of multiple agents can make decisions cooperatively in a shared environment. Unlike traditional distributed single-agent reinforcement learning, agents in MARL do not exist independently but in cooperation or competition relationships. Multi agent reinforcement learning can be formalized as a decentralized partially observable Markov decision process (Dec POMDP) problemOliehoek and Amato (2016). Dec-POMDP generally consists of multiple tuple ( $S_{t}$ , $A_{t}$ , $R_{t}$ , $\gamma_{t}$ ). $S_{t}=(s^{1}_{t},s^{2}_{t},...,s^{n}_{t})$ is joint state and $A_{t}=(a^{1}_{t},a^{2}_{t},...,a^{n}_{t})$ represents joint actions adapted by agent at time t and n represent numbers of agent in common environment, the whole process of MARL can be described in Fig.LABEL:Framework_of_Multi-Agent_Reinforcement_Learning. Compare with RL, in multi-agent reinforcement learning, the state transition functions of agents in the same state will still change due to the effect of the actions from other agents and thus do not satisfy the time invariance of the Markov process, which means for each agent in environment:

P(s_{t^{{}^{\prime}}}|s_{t},a_{t})\neq P(s_{t^{1^{\prime}}}|s_{t^{1}},a_{t^{1}})\ \ \ with\ \ s_{t}=s_{t^{1}}

To address this problem, recent studies can be developed into various categories including value-based and policy-based approaches. The first employs one or more centralized functions (CFs) to comprehend the impact of the agents’ actions to the team goal, such as VDN, QMIX. The CFs allow to optimize the agents’ parameters with respect to a global team reward.

At the same time, in the process of exploring and gaining experience, due to the initial random strategy, some agents take the correct action when facing different scenarios of cooperative tasks, while others take the wrong action. In the CF training paradigm, these correct and incorrect behaviors are treated as a whole and trained as a sample. Therefore, it will increase the exploration space and lead to inconsistent iteration directions of the intelligent agent value network during the training process, resulting in convergence difficulties and falling into local optima. In this situation, intelligent agents are unable to maximize the global value function and complete collaborative tasks. Therefore, another major challenge in solving multi-agent problems is how to ensure correct information and share it with other agents during differentiated exploration processes.

3. Related Work

3.1. Federated Learning

Distribution of data is a key component in federated learningHebert et al. (2023). The pervasiveness of multiple agents in real world like vehicles, has led to the rapid growth of private data originating from distributed sources. Meanwhile, distribution of trajectory in multi-agent reinforcement learning plays a significant role and influences efficiency of training. Both distribution of data and trajectory in multi-agent environment is a non-independent and identically distribution(Non-IID)Wang et al. (2020a), which each agent learns deviant knowledge from different experience and generates a preferred strategy so that falls into local optima like in 1.

To overcome Non-IID constraints and protect privacy, FedAvg has been proposed by McMahan et al. (2017) to solve unbalanced data distribution and protect privacy. FedAvg is a promising method to train the neural network parameters while keeping the training data in the local devices so that can protect sensitive data. For each agent in MARL, given a learning

A typical process of FedAvg with k agents can formulize as follow and average gradient on its local data at the current model $w_{t}$ , For federated multi-agent reinforcement learning, given a set of $N$ agents with loss functions ${F_{i}}_{i=1}^{N}$ and datasets distribution ${D_{i}}_{i=1}^{N}$ interested in formulating a joint cooperative model, FedAvg updates a centralized model as:

For\ \ each\ \ agent\ \ i:w_{i}(t+1)=w_{i}(t)-\gamma\nabla F_{i}(w_{i}(t))

If\ \ t\ \ mod\ \ interval=0:\ \ w_{g}(t+1)=\sum_{i=1}^{N}\frac{|D_{i}|}{|D|}w_{i}(t+1)

where $w_{i}(t)$ is the model update computed by agent $i$ at time step $t$ , $\gamma$ is a fixed learning rate and $w_{g}$ are the 3parameters of the centralized model and will be synchronized periodically into each individual agent. $D_{i}$ and $D$ denote datasets in each agent. $interval$ denotes the span between aggregations, after a span, agent learning novel knowledge and experience from trajectory. Meanwhile, how to determine weight for each agent in the environment is a critical issue. In traditional federated learning, due to the uneven distribution of the dataset, aggregation models can introduce new knowledge that is unknown to the current model. However, distribution of agent experience is Non-IID cause each agent has a different trajectory. To dissipate negative impact of Non-IID datasets, some modified Federated learning approaches proposed that adjust weight of each client by distance and dataset amount. Compared with deep learning, reinforcement learning has a exclusive criteria to estimate efficiency of each agent module for MARL. Inspired by these modified approaches, reward is utilized as a criteria to aggregate the module.

Therefore, in this work, we introduce reward-based weight to modified and utilize FL into multi-agent reinforcement learning to accelerate training process. Instead of AvgFed, our approach measure and determine the weight based on accumulated reward for each agent in trajectory. We merge our approaches into classical value factorization methods QMIX for MARL and replace QMIX with a application of A-PPS, RS-PPS, PP-PPS. These approaches can be merged into various value factorization methods. Modified QMIX(A-PPS,RP-PPS,PP-PPS) has a significant improvement and benefits from other methods in different tasks in SMAC.

3.2. Value Function Factorization

A naive approach proposedWang et al. (2020b) by method to solve multi-agent reinforcement learning which concatenate and merge states and actions of all agents into an huge action and state space has been proved that fail to find a global optimum due to some lazy agents. This approach ignore individual max reward so that in some scenarios agents become lazy and cannot learn to cooperate with other agents to get a global optimum. Therefore, the key in MARL is to balance individual global-max(IGM). One method to address this issue by determining role and individual contribution for each agent is called Value Function Factorization. Current Value Function Factorization methodsHuang et al. (2022) focus on employing of a centralized function (CF) that learns each agent’s contribution to the team reward.

Classical Value function factorization method VDN consider a fully cooperative multi-agent reinforcement learning which each agent just enable to observe its own state and take action. In VDN, each agent possess a local $Q_{i}$ value network which is a part of global $Q_{t}otal$ . VDN assume the global value function $Q_{t}otal$ can be decomposed into individual value function $Q_{i}$ for each agent $i$ :

arg\ \max_{a}Q_{total}(\textbf{s},\textbf{a})=f\begin{pmatrix}arg\ \max_{a_{1}}Q_{1}(s_{1},a_{1})\\ arg\ \max_{a_{2}}Q_{2}(s_{2},a_{2})\\ ...\\ arg\ \max_{a_{i}}Q_{i}(s_{i},a_{i})\end{pmatrix}

in which s is joint state of observation and s is joint action taken for all agents. Therefore, each value function $Q_{i}$ based observation for one agent can be summed by a centralized function. For VDN, $Q_{t}otal$ is the sum of the $Q_{i}$ values of each agent:

Q_{total}(\textbf{s},\textbf{a})=\sum_{i=1}^{N}Q_{i}(s_{i},a_{i})

QMIX propose a mixed network to modifies VDN by replacing CF in VDN, simple sum of $Q_{i}$ value, by a neural network and add restriction to keep monotonic of CT:

Q_{total}(\textbf{s},\textbf{a})=R(Q_{1}(s_{1},a_{1}),Q_{2}(s_{2},a_{2}),...,Q_{i}(s_{i},a_{i}),\textbf{s})

w_{t+1}=w_{t}+\nabla_{w}(Q(s_{t},a_{t})-(\gamma*r_{t}+Q(s_{t+1},a_{t+1},w_{t})))

for\ \ each\ \ agent\ \ i:\frac{\partial Q_{total}}{\partial Q_{i}}\geq 0\ \ i\in\left\{1,...,N\right\}

In which $R(*)$ is a RNN neural network to estimate the contribution for each agent. Q-mix demonstrated enforcing positive weights on the mixer network is sufficient to guarantee improvement compared with VDN. There exist several modified QMIX method, such as TransfQMIXGallici et al. (2023); Parisotto et al. (2020), QtranSon et al. (2019), OWQMix and CWQMixRashid et al. (2020) to enhance learning efficiency. TransfQMix introduces multi-head attention mechanismVaswani et al. (2017) to deal with high dimensional state space so that learn a useful representation from global state s.

Q_{total}(\textbf{s},\textbf{a})\longrightarrow Q_{total}(Attention(\textbf{s}),\textbf{a})

OWQMix and CWQmix introduce a weight function $w$ to estimate importance of each joint action.

However, previous approaches ignores a crucial component, imbalance learning trajectory for each agent in the process of training. Due to exploration in beginning step, each agent acquire a imbalance trajectories and get various experience from these trajectories. So each agent cannot finish cooperative task due to difference of experience. Therefore, federated learning is introduced into our method aims to lead agent to learn experience from other agents.

4. Method

We consider a multi-agent cooperation problem where given $N$ agents, each agent can only access its own observations $s_{i}$ and generate action $a_{i}$ based on its own value function. In the process of multi-agent exploration of the environment, due to the large state space and random action strategies, the empirical trajectories obtained by each agent are biased and uneven. Sometimes, action taken by agent in observation is right but has no reward cause by other agent negative impact. Therefore, model training requires a lot of time to converge to the optimal value and achieve the optimal strategy. Meanwhile, due to limitations in access rights and privacy protection, intelligent agents are unable to learn to cooperate with other agents, resulting in local sub-optimal solutions. One way to solve this problem is to obtain parameter sharing through agents, enabling them to learn and collaborate with other agents to complete tasks. Inspired by federated learning, we propose the Average Parameter Sharing method based on the QMIX model to eliminate negative impact of imbalance experience between agents and accelerate training for convergence. The overall architecture for our method is shown as in figure 2.

4.1. Average Periodically Parameter Sharing

In order to overcome the imbalanced experience during the exploration process, inspired by FedAvg and for simplicity, our paper first proposes to average the weights of each agent into an entity and distribute them to each agent, as shown in fig 3 (a). Because every intelligent agent learns different knowledge during the process of exploring the environment and gaining experience, simply iterating through their own experience and rewards can easily lead to incorrect iteration directions and local optima. The method of weight aggregation can transfer new knowledge learned by intelligent agents to other agents through weight sharing, avoiding falling into local optima and avoiding the leakage of private data. For the A-PPS method, the importance of each agent in MARL is equal. A-PPS enable to be summarized as follow:

w_{g}(t+1)=\sum_{i=1}^{N}\frac{1}{N}w_{i}(t+1)

in which $w_{g}$ denotes global parameters of value network and $w_{i}$ denotes local parameters for each value networks. In the process of exploring the environment, as the number of agents in the task and environment increases, the observation state dimension of the agent explodes. Small parameter differences in the value function of intelligent agents can lead to unstable action selection, resulting in a huge deviation from the optimal distribution. Therefore, average parameter sharing is not suitable in this situation, and it cannot guarantee the acquisition of effective knowledge from other intelligent agents during the aggregation process. Therefore, it is necessary to propose a standard to measure the knowledge learned by intelligent agents during the training process.

4.2. Reward-Scalability Periodically Parameter Sharing

In order to solve the deviation problem caused by the high-dimensional state space problem mentioned above, fed prox in Non IID datasets is used to reduce the convergence problem caused by the deviation of exploration and experience acquisition. FedProxLi et al. (2020) is an improved federated learning algorithm used to address non independent identically distributed (non IID) data and device heterogeneity issues in federated learning. The distribution and amount of data in FedProx are used to determine the weights occupied by each model during the aggregation process. In multi-agent reinforcement learning, due to the cooperative nature of tasks, at time t, one agent takes the correct behavior, while other agents take the incorrect behavior. Therefore, there is also a problem of non independent and identically distributed multi-agent exploration. To address this issue, our method utilizes the rewards obtained by agents during the exploration process to measure the differences in distribution. Therefore, a novel method called Reward-scalability periodically parameter sharing(RS-PPS) in fig 3 (b) is proposed and a reward buffer is proposed to evaluate the differences in experience obtained by different agents during exploration:

reward\ \ buffer\ \ for\ \ agent\ \ i:\textbf{R}_{i}=\sum_{t=1}^{T}r_{t}

w_{g}(t+1)=\sum_{i=1}^{N}\frac{R_{i}}{\sum_{i=1}^{N}R_{i}}w_{i}(t+1)

where $R_{i}$ denotes accumulate reward during time interval $T$ for agent $i$ and is used to estimate quality of exploration of agent. Compared to A-PPS, PS-PPS can adjust aggregation weights based on the rewards obtained by the agent during the exploration process. The more reward functions obtained during the exploration process, the closer the agent is considered to be to the global optimum, and therefore will be given higher weights. This process can be seen as imparting new experiential knowledge to other agents, while also avoiding the tendency towards the agent receiving the most rewards and falling into suboptimal solutions.

4.3. Partial Personalized Periodically Parameter Sharing

Characterized agents are a very effective way to solve collaborative tasks, where agents play different roles in the team to complete their own tasks and maximize the global reward function. In collaborative tasks, determining one’s own role in the team is a very challenging issue, as correctly characterized agents can complete tasks within the team. Therefore, we propose a Partial Personalized Periodic Parameter Sharing (PP-PPS) in fig 3 (c) method that divides the neural network of agent into two parts. The first part is the agent characterization network, where agents with different roles will focus on different parts of the environment. The second part is the intelligent agent value network, which evaluates the value of state action pairs in the environment. Therefore, during the aggregation process, only the second part of the network needs to be aggregated:

\begin{cases}w_{i,j}(t+1)=w_{i,j}(t+1)&\text{ if }j\ \ not\ \ in\ \ layer\\ w_{i,j}(t+1)=\sum_{i=1}^{N}\frac{R_{i}}{\sum_{i=1}^{N}R_{i}}w_{i,j}(t+1)&\text{ if }j\ \ in\ \ layer\end{cases}

in which $j$ donates layer in neural network of QMIX for each agent. Currently, a fixed point is used for the boundary point part of a two-part network. When parameters $w_{i,j}$ are located in the characterization network layer, they will not participate in parameter sharing to ensure the characterization of the intelligent agent. However, when the parameters $w_{i,j}$ are located in the value network part of the agent, the agent will join parameter sharing to accelerate the convergence of the network.The role of intelligent agents in collaborative tasks varies, so the required representation networks are also different. Due to the instability of the network model caused by fixed segmentation points, the convergence speed of the PP-PPS method is not as fast as the first two methods in most tasks.

5. Experiment

These three approaches, A-PPS, RS-PPS and PRS-PPS is implemented based on Python and Pytorch package. And the all experiments were run in parallel on a computing cluster consisting of two RTX 3080 GPU, 24 GB memory. In each task, same hyper-parameters in table 1 are adapted in our experiment, with exception that we train for 1e6 or 2e6 training iterations caused by computation restriction.

To verify effectiveness of three proposed approaches, we selected QMIX as basic model and evaluated proposed approaches in a SMAC environment. Then meanwhile we compared results of these methods with conventional value decomposition approaches QMIX and VDN in different SMAC tasks.

Modified QMIX architecture described in section 3 is employed for A-PPS,RS-PPS, PRS-PPS. For each task, we fixed the random number seed, and evaluated 32 times every 5000 steps to obtain the winning rate. After training a certain number of times, we believe that the agent is believed to learned different tendency novel knowledge and therefore lead agent shared it with other agents through different methods to accelerate training and complete collaborative tasks. For A-PPS method, each agent shares knowledge equally with other agents. RS-PPS method uses a reward buffer of the same length as the batch size to evaluate the effective knowledge learned by agent during the training process, and aggregates it based on the weight of a reward buffer shown in section 4.3. In final PP-PPS method, a portion of the value network is selected as a personalized representation layer, while the other portion is regarded as value representation layer and aggregated through the RS-PPS method. The experimental results indicate that for conventional QMIX methods, our methods can accelerate network training and enable agents to complete tasks that cannot be completed by conventional QMIX methods.

5.1. SMAC

The SMAC environment, a micro unit control game environment based on electric game,StarCraft 2, was used to test the three methods we proposed. In SMAC environment, multiple agents need to collaborate against and beat enemies to win the game. Each agent has an limited observation range and can only observe the status of friendly forces within the range. The observed feature vectors consist of attribute information of friends and enemies: [distance, relative x, relative y, health, shield, unit type]. Each agent only enable to access their own observation state and rewards obtained during the exploration process.

In every exploration step, agent just receives local information from its perspective. There is no difference between friendly units in the field and dead friendly units from the perspective of the intelligent agent. According to the number of enemies and the number of our intelligent agents, tasks can be divided into asymmetric tasks, which the number of allied agents varies from enemies, and symmetric tasks ,which the number of allied agents is same as enemies. Finally, we tested our proposed algorithm in both circumstance: 3m, 2s3z and asymmetric adversarial tasks: 10m vs 11m, MMM2, Corridor and compared it with QMIX and VDN algorithms. A-PPS, RS-PPS and PP-PPS can be directly applied in conventional MARL algorithm and do not need to add other architecture.

Table 1. Hyper-parameters Used

Parameter	Value
Max Train Steps	1e6
Evaluate Freq	5000
Target Update Freq	200
Base Algorithm	VDN, QMIX
Epsilon Decay Steps	50000
Epsilon Max	1
Epsilon Min	0.05
Buffer Size	5000
Batch Size	96
Learning rate	5e-4
Gamma	0.99
Mixed Hidden Num	1
Mixed Hidden Dim	(State Dim,64)
Optimizer	Adam
Grad Clip	True
Activation Function	ReLu
Orthogonal Initialization	True
Lr Decay	False
Parameter sharing
Soft Update	0.05
Reward Buffer Size	96
Aggregation Freq	300
Personalized Layer	4

6. Result

As shown in Figure 5, our method performs only slightly faster than QMIX and VND in SMAC 3m and 2s3z tasks after 1 million training steps. We speculate that due to the low number of agents in the 3m (total of six agents) and 2s3z (total of ten agents) maps, the state space dimension is low and the distribution deviation caused by exploration are relatively small. Therefore, the effectiveness of our method is not significant and may even amplify the erroneous experience. At the same time, both 3m and 2s3z are symmetric adversarial tasks, with the same number and types of intelligent agents on both sides. Therefore, there is no significant difference in local observation of intelligent agents, resulting in relatively few erroneous experiences and a relatively average distribution. This leads to the best performance of A-PPS in our method. For instance, for a certain intelligent agent in 3m, only 12 situations need to be considered when there are 0, 1, 2 allies and 0, 1, 2, 3 enemies in the field of view. Therefore, the observation space of intelligent agents is relatively small.

Table 2. Main Result

Scenarios	A-PPS	RS-PPS	PP-PPS	QMIX	VDN
3M	0.96	0.91	0.83	0.87	0.95
2s3z	0.85	0.82	0.71	0.73	0.77
MMM2	0.06	0.12	0	0	0
10m11m	0.04	0.31	0.05	0	0
Corridor	0.03	0.29	0.1	0	0.09

In the A-PPS method, the average aggregation of the value network of each agent achieved better performance and win rate. Due to the average distribution of experience and trajectory, A-PPS can effectively accelerate training without significantly improving the effectiveness.

To verify robustness of our method, we further take ablation in tasks MMM2 (which stages for 1 Medivac, 2 Marauders versus 7 Marines 1 Medivac, 3 Marauders and 8 Marines), 10mvs11m (which stages for 10 Marins versus 11 Marins, for a total of 22 entities), and Corridor (which stages for 6 Zealots versus 24 Zergling, for a total of 30 entities) with a large number of agents in an asymmetric environment.

Compared to A-PPS, RS-PPS in these experiments achieved more outstanding results. During training process, In contrast with QMIX and VDN, our proposed method achieved significant improvement and effectiveness in MMM2, 10m vs 11m, and Corridor tasks. In asymmetric tasks, due to the presence of other types and quantities of agents, the observed state space of each agent fluctuates greatly and increase explosively. Therefore, prior A-PPS can lead to a large number of erroneous experiences and thus RS-PPS, a reward buffer and reward scalability mechanism, is introduced into our model. The aggregation model based on reward buffer enables to correct the weights in aggregation process of models through the rewards obtained in the interval. Because correct decision made by each agent when facing exploration process will increase the weight occupied by the model in the aggregation process, that is, agent that allows other agents to learn value functions from largest accumulated reward function. So our RS-PPS method accelerates model convergence better when facing high-dimensional observation spaces and guides agents to make the right choices when facing complex observation states.

As for PP-PPS, due to harness of fixed personalized points in the training process of PP-PPS, which means that fixed number of layers in the network model remains unchanged during aggregation process. However, in our experiments, for different agents, their corresponding representation layers are different, but fixed personalized point approach leads to different agents using the same number of representation layers. Therefore, PP-PPS leads to increasing deviation in representation of different agents, but compared to conventional methods, PP-PPS modified algorithm also introduce limited new knowledge during sharing process. Therefore, although PP-PPS performs poorly in multiple tasks compared to A-PPS and RS-PPS. However, experiments have still shown that PP-PPS method performs better than QMIX.

7. Conclusion and Further Work

In this paper, we consider a problem of slow training speed caused by uneven experience distribution of a single agent during multi-agent reinforcement learning, and the negative impact from incorrect actions of other agents on correct actions of agents in the team. Inspired by the federated learning algorithm to solve the Non-IId problem, three methods, A-PPS, RS-PPS, and PP-PPS, were introduced to solve these issues. We tested the three methods we proposed based on QMIX model in symmetric and asymmetric tasks of SMAC environment. The training convergence speed of A-PPS is faster than that of the basic QMIX model in the case of symmetric confrontation between a small number of agents. Compared with A-PPS and RS-PPS, in complex multi-agent circumstances, RS-PPS can achieve better victory rates and achieve tasks that cannot be completed by QMIX and VDN methods. Nevertheless, PP-PPS performs poorly in multiple tasks due to its fixed personalized representation layer, but its experimental results are better than QMIX.

Conclusively, proposed approaches, A-PPS, RS-PPS, and PP-PPS, can effectively accelerate convergence rate of multi-agent reinforcement learning training process and enable agents to complete tasks that were previously impossible to complete by QMIX and VDN. Our approaches enable to be applied to multi-agent reinforcement learning algorithms and produce effective results. Furthermore, we wll modifies personalized layer, PP-PPS algorithm, and explore novel architecture of feature representation so that intelligent agents can adaptively select effective feature representations instead of manually selecting fixed network layers.

{acks}

This research was supported by the Science Foundation of China University of Petroleum, Beijing (No. 2462020YXZZ024)

References

(1)
Cao et al. (2013) Yongcan Cao, Wenwu Yu, Wei Ren, and Guanrong Chen. 2013. An Overview of Recent Progress in the Study of Distributed Multi-Agent Coordination. IEEE Trans. Ind. Informatics 9, 1 (2013), 427–438. https://doi.org/10.1109/TII.2012.2219061
Dai et al. (2022) Rong Dai, Li Shen, Fengxiang He, Xinmei Tian, and Dacheng Tao. 2022. DisPFL: Towards Communication-Efficient Personalized Federated Learning via Decentralized Sparse Training. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (Eds.). PMLR, 4587–4604. https://proceedings.mlr.press/v162/dai22b.html
Foerster et al. (2016) Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, and Shimon Whiteson. 2016. Learning to Communicate with Deep Multi-Agent Reinforcement Learning. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (Eds.). 2137–2145. https://proceedings.neurips.cc/paper/2016/hash/c7635bfd99248a2cdef8249ef7bfbef4-Abstract.html
Foerster et al. (2018) Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. 2018. Counterfactual Multi-Agent Policy Gradients. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, Sheila A. McIlraith and Kilian Q. Weinberger (Eds.). AAAI Press, 2974–2982. https://doi.org/10.1609/aaai.v32i1.11794
Gallici et al. (2023) Matteo Gallici, Mario Martin, and Ivan Masmitja. 2023. TransfQMix: Transformers for Leveraging the Graph Structure of Multi-Agent Reinforcement Learning Problems. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2023, London, United Kingdom, 29 May 2023 - 2 June 2023, Noa Agmon, Bo An, Alessandro Ricci, and William Yeoh (Eds.). ACM, 1679–1687. https://doi.org/10.5555/3545946.3598825
Hebert et al. (2023) Liam Hebert, Lukasz Golab, Pascal Poupart, and Robin Cohen. 2023. FedFormer: Contextual Federation with Attention in Reinforcement Learning. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2023, London, United Kingdom, 29 May 2023 - 2 June 2023, Noa Agmon, Bo An, Alessandro Ricci, and William Yeoh (Eds.). ACM, 810–818. https://doi.org/10.5555/3545946.3598716
Huang et al. (2022) Wenhan Huang, Kai Li, Kun Shao, Tianze Zhou, Jun Luo, Dongge Wang, Hangyu Mao, Jianye Hao, Jun Wang, and Xiaotie Deng. 2022. Multiagent Q-learning with Sub-Team Coordination. In 21st International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2022, Auckland, New Zealand, May 9-13, 2022, Piotr Faliszewski, Viviana Mascardi, Catherine Pelachaud, and Matthew E. Taylor (Eds.). International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS), 1630–1632. https://doi.org/10.5555/3535850.3536058
Li et al. (2022) Qinbin Li, Yiqun Diao, Quan Chen, and Bingsheng He. 2022. Federated Learning on Non-IID Data Silos: An Experimental Study. In 38th IEEE International Conference on Data Engineering, ICDE 2022, Kuala Lumpur, Malaysia, May 9-12, 2022. IEEE, 965–978. https://doi.org/10.1109/ICDE53745.2022.00077
Li et al. (2020) Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. 2020. Federated Optimization in Heterogeneous Networks. In Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020, Inderjit S. Dhillon, Dimitris S. Papailiopoulos, and Vivienne Sze (Eds.). mlsys.org. https://proceedings.mlsys.org/book/316.pdf
Lowe et al. (2017) Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. 2017. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 6379–6390. https://proceedings.neurips.cc/paper/2017/hash/68a9750337a418a86fe06c1991a1d64c-Abstract.html
McMahan et al. (2017) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. 2017. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA (Proceedings of Machine Learning Research, Vol. 54), Aarti Singh and Xiaojin (Jerry) Zhu (Eds.). PMLR, 1273–1282. http://proceedings.mlr.press/v54/mcmahan17a.html
Mohammadi et al. (2018) Mehdi Mohammadi, Ala I. Al-Fuqaha, Mohsen Guizani, and Jun-Seok Oh. 2018. Semisupervised Deep Reinforcement Learning in Support of IoT and Smart City Services. IEEE Internet Things J. 5, 2 (2018), 624–635. https://doi.org/10.1109/JIOT.2017.2712560
Oliehoek and Amato (2016) Frans A. Oliehoek and Christopher Amato. 2016. A Concise Introduction to Decentralized POMDPs. Springer. https://doi.org/10.1007/978-3-319-28929-8
Ozkara et al. (2023) Kaan Ozkara, Antonious M. Girgis, Deepesh Data, and Suhas N. Diggavi. 2023. A Statistical Framework for Personalized Federated Learning and Estimation: Theory, Algorithms, and Privacy. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=FUiDMCr_W4o
Parisotto et al. (2020) Emilio Parisotto, H. Francis Song, Jack W. Rae, Razvan Pascanu, Çaglar Gülçehre, Siddhant M. Jayakumar, Max Jaderberg, Raphaël Lopez Kaufman, Aidan Clark, Seb Noury, Matthew M. Botvinick, Nicolas Heess, and Raia Hadsell. 2020. Stabilizing Transformers for Reinforcement Learning. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 7487–7498. http://proceedings.mlr.press/v119/parisotto20a.html
Rashid et al. (2020) Tabish Rashid, Gregory Farquhar, Bei Peng, and Shimon Whiteson. 2020. Weighted QMIX: Expanding Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/73a427badebe0e32caa2e1fc7530b7f3-Abstract.html
Rashid et al. (2018) Tabish Rashid, Mikayel Samvelyan, Christian Schröder de Witt, Gregory Farquhar, Jakob N. Foerster, and Shimon Whiteson. 2018. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018 (Proceedings of Machine Learning Research, Vol. 80), Jennifer G. Dy and Andreas Krause (Eds.). PMLR, 4292–4301. http://proceedings.mlr.press/v80/rashid18a.html
Samvelyan et al. (2019) Mikayel Samvelyan, Tabish Rashid, Christian Schröder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philip H. S. Torr, Jakob N. Foerster, and Shimon Whiteson. 2019. The StarCraft Multi-Agent Challenge. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’19, Montreal, QC, Canada, May 13-17, 2019, Edith Elkind, Manuela Veloso, Noa Agmon, and Matthew E. Taylor (Eds.). International Foundation for Autonomous Agents and Multiagent Systems, 2186–2188. http://dl.acm.org/citation.cfm?id=3332052
Silva et al. (2019) Maria Amélia Lopes Silva, Sérgio Ricardo de Souza, Marcone Jamilson Freitas Souza, and Ana Lúcia C. Bazzan. 2019. A reinforcement learning-based multi-agent framework applied for solving routing and scheduling problems. Expert Syst. Appl. 131 (2019), 148–171. https://doi.org/10.1016/j.eswa.2019.04.056
Son et al. (2019) Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Hostallero, and Yung Yi. 2019. QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 5887–5896. http://proceedings.mlr.press/v97/son19a.html
Sunehag et al. (2018) Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinícius Flores Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel. 2018. Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS 2018, Stockholm, Sweden, July 10-15, 2018, Elisabeth André, Sven Koenig, Mehdi Dastani, and Gita Sukthankar (Eds.). International Foundation for Autonomous Agents and Multiagent Systems Richland, SC, USA / ACM, 2085–2087. http://dl.acm.org/citation.cfm?id=3238080
Terry et al. (2021) Justin K. Terry, Benjamin Black, Nathaniel Grammel, Mario Jayakumar, Ananth Hari, Ryan Sullivan, Luis S. Santos, Clemens Dieffendahl, Caroline Horsch, Rodrigo Perez-Vicente, Niall L. Williams, Yashas Lokesh, and Praveen Ravi. 2021. PettingZoo: Gym for Multi-Agent Reinforcement Learning. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (Eds.). 15032–15043. https://proceedings.neurips.cc/paper/2021/hash/7ed2d3454c5eea71148b11d0c25104ff-Abstract.html
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
Vinyals et al. (2019) Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, Alexander Sasha Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalibard, David Budden, Yury Sulsky, James Molloy, Tom Le Paine, Çaglar Gülçehre, Ziyu Wang, Tobias Pfaff, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy P. Lillicrap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps, and David Silver. 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nat. 575, 7782 (2019), 350–354. https://doi.org/10.1038/s41586-019-1724-z
Wang et al. (2020a) Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H. Vincent Poor. 2020a. Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/564127c03caab942e503ee6f810f54fd-Abstract.html
Wang et al. (2022) Rose E. Wang, Sarah A. Wu, James A. Evans, David C. Parkes, Joshua B. Tenenbaum, and Max Kleiman-Weiner. 2022. Too Many cooks: Bayesian inference for coordinating Multi-agent Collaboration. In Human-Like Machine Intelligence, Stephen H. Muggleton and Nicholas Chater (Eds.). Oxford University Press, 152–170. https://doi.org/10.1093/oso/9780198862536.003.0008
Wang et al. (2020b) Tonghan Wang, Jianhao Wang, Chongyi Zheng, and Chongjie Zhang. 2020b. Learning Nearly Decomposable Value Functions Via Communication Minimization. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=HJx-3grYDB
Wu et al. (2022) Di Wu, Rehmat Ullah, Paul Harvey, Peter Kilpatrick, Ivor T. A. Spence, and Blesson Varghese. 2022. FedAdapt: Adaptive Offloading for IoT Devices in Federated Learning. IEEE Internet Things J. 9, 21 (2022), 20889–20901. https://doi.org/10.1109/JIOT.2022.3176469
Yu et al. (2022) Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre M. Bayen, and Yi Wu. 2022. The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games. In NeurIPS. http://papers.nips.cc/paper_files/paper/2022/hash/9c1535a02f0ce079433344e14d910597-Abstract-Datasets_and_Benchmarks.html
Yu et al. (2023) Xiaofan Yu, Lucy Cherkasova, Harsh Vardhan, Quanling Zhao, Emily Ekaireb, Xiyuan Zhang, Arya Mazumdar, and Tajana Rosing. 2023. Async-HFL: Efficient and Robust Asynchronous Federated Learning in Hierarchical IoT Networks. In Proceedings of the 8th ACM/IEEE Conference on Internet of Things Design and Implementation, IoTDI 2023, San Antonio, TX, USA, May 9-12, 2023. ACM, 236–248. https://doi.org/10.1145/3576842.3582377
Zhao et al. (2022) Yinuo Zhao, Kun Wu, Zhiyuan Xu, Zhengping Che, Qi Lu, Jian Tang, and Chi Harold Liu. 2022. CADRE: A Cascade Deep Reinforcement Learning Framework for Vision-Based Autonomous Urban Driving. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022. AAAI Press, 3481–3489. https://doi.org/10.1609/aaai.v36i3.20259