FedKL: Tackling Data Heterogeneity in Federated Reinforcement Learning by Penalizing KL Divergence

Zhijie Xie and S.H. Song Zhijie Xie and S.H. Song are with Department of Electronic and Computer Engineering, the Hong Kong University of Science and Technology, Hong Kong. e-mail: ([email protected], [email protected]).

Abstract

As a distributed learning paradigm, Federated Learning (FL) faces the communication bottleneck issue due to many rounds of model synchronization and aggregation. Heterogeneous data further deteriorates the situation by causing slow convergence. Although the impact of data heterogeneity on supervised FL has been widely studied, the related investigation for Federated Reinforcement Learning (FRL) is still in its infancy. In this paper, we first define the type and level of data heterogeneity for policy gradient based FRL systems. By inspecting the connection between the global and local objective functions, we prove that local training can benefit the global objective, if the local update is properly penalized by the total variation (TV) distance between the local and global policies. A necessary condition for the global policy to be learn-able from the local policy is also derived, which is directly related to the heterogeneity level. Based on the theoretical result, a Kullback-Leibler (KL) divergence based penalty is proposed, which, different from the conventional method that penalizes the model divergence in the parameter space, directly constrains the model outputs in the distribution space. Convergence proof of the proposed algorithm is also provided. By jointly penalizing the divergence of the local policy from the global policy with a global penalty and constraining each iteration of the local training with a local penalty, the proposed method achieves a better trade-off between training speed (step size) and convergence. Experiment results on two popular Reinforcement Learning (RL) experiment platforms demonstrate the advantage of the proposed algorithm over existing methods in accelerating and stabilizing the training process with heterogeneous data.

Index Terms:

Federated Reinforcement Learning, Data Heterogeneity, Policy Gradient.

I Introduction

Federated Learning (FL) was proposed as a distributed Machine Learning (ML) scheme where data from distributed devices are utilized for training without violating the user privacy. The idea has been successfully implemented (e.g. the Google Keyboard) [1] and attracted much research attention [2, 3]. One of the key challenges for FL is the communication bottleneck due to the frequent communication of big ML models. There are mainly two methods to mitigate the issue, i.e., reducing the communication workload per round and decreasing the number of communication rounds. To this end, model compression has been proposed to reduce the communication workload per round [4, 5], and another line of research tried to improve the convergence speed to minimize the number of communication rounds [6, 7, 8]. Unfortunately, data heterogeneity between difference devices significantly slows the convergence down [9, 10].

As an important ML paradigm, Reinforcement Learning (RL) trains the policy of an agent in an iterative way by interacting with the environment. In each time step, the agent makes a decision to manipulate the environment and receives a reward for its action, where the objective of the training is to maximize the long term reward[11]. Due to the distributed data and computing power in large-scale applications such as autonomous driving, the training of RL algorithms under the FL framework is inevitable. Unfortunately, many challenges faced by supervised FL, e.g., the data heterogeneity and communication bottleneck, are still valid and even worse for FRL [12]. For example, centralized policy gradient methods already suffer from high variance which is detrimental to the convergence speed and training performance [13, 14, 15], and data heterogeneity imposes another layer of difficulty for the convergence of FRL.

Data heterogeneity in FL refers to the situation where the data collected by different devices are not independent and identically distributed (IID) [3, 16]. Despite its empirical success, FedAvg does not perform well on highly skewed non-IID data [10, 17, 18]. As a result, many innovative methods have been proposed to tackle the data heterogeneity issue [9]. Data-level approaches aim to modify the data distribution to address the root of the problem. For example, a data-sharing strategy [17] was proposed to create a small but globally shared dataset to mitigate the impact of heterogeneous data. On the other hand, algorithm-level approaches try to solve the problem by improving the local training algorithms. Along this line of research, FedProx [19] added a weighted proximal term to the local objective function to constrain the divergence of the local model from the global model. Finally, system-level approaches focus on client clustering/selection [20] to counterbalance the bias introduced by non-IID data and speed up convergence.

The heterogeneity issue of FRL is different from that of the supervised FL and the related study is very limited [21, 22, 23]. There are mainly two types of RL methods, namely, the action-value method and the policy gradient (PG) method [12, 11]. The first performs action selection according to the value function, while the second utilizes ML models to approximate the policy and update the policy by gradient based algorithms. Data heterogeneity affects different RL methods in different ways, and thus requires different solutions. For the action-value method, Lifelong Reinforcement Learning (LFRL) [24] explored the combination of Q-learning and FL in real-world applications, where the knowledge learned in one particular environment is transferred to another. Federated RL (FedRL) [25] considered the scenario where some agents can only observe part of the environment and have no access to the reward. To handle this issue, agents maintain their individual Q-networks in addition to the global Q-network, and exchange their predicted Q value for cooperation. For policy gradient methods, Federated Transfer Reinforcement Learning (FTRL) [22] proposed an online transfer process to numerically align the scale of the observations and actions in different environments so that the policy learnt in one environment is portable to another. Given that devices of the same type may have different dynamics, a FRL architecture, which adopts Proximal Policy Gradient (PPO) and Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [26, 27, 28] to FRL in Internet of Things (IoT) networks, was proposed in [23, 29]. This FRL scheme alternates between sharing gradient and transferring mature models to other agents. For offline learning, data heterogeneity exists in different local datasets. To address this problem, Federated Dynamic Treatment Regime (FDTR) [30] proposed the first federated policy optimization algorithm for offline RL. There are also works that aim to improve the communication efficiency with heterogeneous data [31, 32]. For instance, Federated Multi-Agent Reinforcement Learning (FMARL) [32] proposed to gradually decay the gradients during local training to avoid model divergence and improve communication efficiency. QAvg and PAvg [33] theoretically proved that federated Q-Learning and federated policy gradient can converge in tabular case when agents’ environments are heterogeneous.

In this paper, we will investigate the data heterogeneity issue in FRL systems whose agents employ PG methods with an advantage estimator [15, 34]. For that purpose, we first classify the types of data heterogeneity in FRL and define the heterogeneity level. By analyzing the connection between the global and local objective functions, we prove that although it is possible to improve the global objective by updating the local policy according to the local environment, the local update must be properly constrained. Based on the theory, we propose an algorithm that utilizes a Kullback–Leibler (KL) divergence based term to penalize the local training for diverging too far away from the global policy and name it the FedKL algorithm. The proposed KL penalty will be referred to as the global penalty because it guarantees that the local training of one agent will not diverge too far away from the global policy. Note that this global penalty is different from the per-iteration penalty utilized in centralized RL [26]. We will investigate the benefits of adding the global penalty and how its interplay with the local per-iteration penalty helps mitigate the data heterogeneity issue in FRL. Experiment results show that FedKL is able to stabilize training and speed up convergence when compared with existing algorithms, including FedAvg, FedProx and FMARL.

The major contributions of this work can be summarized as follows:

•

We classify the data heterogeneity of FRL into two types, due to different initial state distributions and different environment dynamics, respectively. Then, we define the level of heterogeneity, based on which a necessary condition for the local policy improvement to be beneficial for the global objective is derived.
•

We prove that the improvement of the local policy on one agent can benefit the global objective, as long as the local policy update is properly penalized by the total variation (TV) distance between the local and global policies. Based on the theoretical result, we propose the FedKL algorithm and adopt several approximations to simplify the implementation of FedKL. Convergence proof of the proposed algorithm is also provided.
•

Experiment results show that FedKL outperforms existing algorithm-level solutions with faster convergence and better stabilization. Several interesting insights are also obtained: 1) The proposed heterogeneity level is very effective and can be utilized to predict the learning behavior; 2) The KL divergence based penalty is more effective than the penalty defined in the parameter space; 3) The proposed global penalty enables a larger learning step size, which speeds up the convergence without causing divergence.

II Preliminaries

II-A Federated Learning

A typical FL system consists of one central server and $N$ agents (clients/devices), where the data of the agents are jointly utilized to train a global model without violating their privacy. In particular, the global optimization problem can be formulated as [18]

\displaystyle\min_{\theta}f(\theta)

\displaystyle=\sum_{n=1}^{N}q_{n}f_{n}(\theta),

(1)

where $f_{n}(\theta)$ represents the local objective function, e.g. the loss function, of the $n$ -th agent with respect to the model parameters $\theta$ . $q_{n}$ is a weighting factor often set to $q_{n}=l_{n}/L$ , with $l_{n}$ denoting the number of data points at the $n$ -th agent and $L=\sum_{n=1}^{N}l_{n}$ . In each communication round, $K$ out of the total $N$ agents are selected and the central server broadcasts the global model to the selected agents. Then, each agent updates its local model utilizing the local data and uploads the trained model to the server for model aggregation. This process will repeat for some rounds until a preset learning objective is achieved.

II-B Federated Reinforcement Learning

The system setup of FRL is similar, i.e., one central server federates the learning of $N$ distributed agents. In the $t$ -th training round, the central server broadcasts the current global policy $\pi^{t}$ to $K$ selected agents which will perform $I$ iterations of local training. In each iteration, the agent interacts with its environment to collect $T$ time-steps of data, and optimizes its local policy according to the local objective with gradient ascent methods. At the end of each round, the training results will be uploaded to the server for aggregation. An example of FRL systems for autonomous driving is illustrated in Fig. 1.

Refer to caption — Figure 1: An example of FRL systems: autonomous driving. Many vehicles run on different roads. In each training round, selected vehicles (white cars) will participate in the training. With the PG method, policies are represented by ML models, which will be communicated between agents and the server for model aggregation and synchronization. The model aggregation is the same as that of FedAvg.

We model the local training problem of each agent as a finite Markov Decision Process (MDP). Accordingly, the FRL system consists of $N$ finite MDPs $\{(\mathcal{S},\mu_{n},\mathcal{A},P_{n},\gamma)|n\in[1,N]\}$ , where $\mathcal{S}$ denotes a finite set of states, $\mu_{n}$ represents the initial state distribution of the n-th MDP, $\mathcal{A}$ is a finite set of actions and $\gamma\in(0,1)$ is the discount factor. The dynamics function $P_{n}(s^{\prime},r|s,a):\mathcal{S}\times\mathbb{R}\times\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ represents the probability that the n-th MDP transits from state $s$ to $s^{\prime}$ and obtain a reward of $r\in\mathbb{R}$ after taking action $a$ , which characterizes the environment dynamics of the n-th MDP [11]. We can define the transition probability $P_{n}(s^{\prime}|s,a):\mathcal{S}\times\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ as $P_{n}(s^{\prime}|s,a)=\sum_{r\in\mathbb{R}}P_{n}(s^{\prime},r|s,a)$ . Furthermore, based on the dynamics function, we can define the reward function $\mathcal{R}_{n}(s,a):\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ as $\mathcal{R}_{n}(s,a)=\sum_{r\in\mathbb{R}}r\sum_{s^{\prime}\in\mathcal{S}}P_{n}(s^{\prime},r|s,a)$ , which gives the expected reward of the n-th MDP for taking action $a$ in state $s$ . As a result, the $n$ -th MDP $\mathcal{M}_{n}$ can be represented by a 5-tuple $(\mathcal{S},\mu_{n},\mathcal{A},P_{n},\gamma)$ sharing the same state and action space with other agents, but with possibly different initial state distributions and environment dynamics. An agent’s behavior is controlled by a stochastic policy $\pi_{n}:\mathcal{S}\times\mathcal{A}\rightarrow[0,1]$ which outputs the probability of taking an action $a$ in a given state $s$ . Specifically, each agent $n$ aims to train a local policy to maximize its expected discounted reward over the long run

\displaystyle\eta_{n}(\pi)=\mathbb{E}_{s_{0}\thicksim\mu_{n},a_{t}\thicksim\pi,s_{t+1}\thicksim P_{n}}\left[\sum_{t=0}^{\infty}\gamma^{t}\mathcal{R}_{n}(s_{t},a_{t})\right],

(2)

where the notation $\mathbb{E}_{s_{0}\thicksim\mu_{n},a_{t}\thicksim\pi,s_{t+1}\thicksim P_{n}}$ indicates that the reward is averaged over all states and actions according to the initial state distribution, the transition probability and the policy. Accordingly, the optimization problem for FRL can be formulated as

\displaystyle\max_{\pi}\eta(\pi)

\displaystyle=\sum_{n=1}^{N}q_{n}\eta_{n}(\pi).

(3)

Note that $\mu_{n}$ is added to the formulation in (2) to represent the different initial state distributions for different agents and will be discussed in details later. The above formulation covers both IID and non-IID cases. In particular, the different MDPs, i.e., different initial state distributions and environment dynamics, represent the heterogeneous environments experienced by different agents (non-IID). All MDPs will be identical for the IID case [35].

II-C Policy Gradient Methods

In this section, we will introduce the policy gradient (PG) method [15], which is taken by the agents to update their local policies in this paper. While action-value methods learn the values of actions and then select actions based on their estimated values, PG methods directly learn the decision-making policy.

To update the policy by gradient ascent methods, we need to estimate the policy gradient. For that purpose, we first define the state-value function and the action-value function. Specifically, the state-value function $V_{\pi,P_{n}}(s)$ of a state $s$ is the expected return when the agent starts from state $s$ and follows policy $\pi$ thereafter:

\displaystyle V_{\pi,P_{n}}(s)

\displaystyle=\mathbb{E}_{\pi,P_{n}}\left[\sum_{l=0}^{\infty}\gamma^{l}\mathcal{R}(s_{t+l},a_{t+l})|s_{t}=s\right],

(4)

where the expectation is performed over actions sampled from policy $\pi$ and the next state is sampled based on the transition probability $P_{n}$ . Similarly, the action-value function $Q_{\pi,P_{n}}(s,a)$ of a state-action pair is the expected return when the agent starts from state $s$ , takes the action $a$ , and follows policy $\pi$ thereafter:

\displaystyle Q_{\pi,P_{n}}(s,a)

\displaystyle=\mathbb{E}_{\pi,P_{n}}\left[\sum_{l=0}^{\infty}\gamma^{l}\mathcal{R}(s_{t+l},a_{t+l})|s_{t}=s,a_{t}=a\right].

(5)

By subtracting the state-value function from the action-value function, we can define the advantage function $A_{\pi,P_{n}}(s,a)=Q_{\pi,P_{n}}(s,a)-V_{\pi,P_{n}}(s),$ which measures the value of action $a$ in state $s$ . The author of [15] proved that the parameters $\theta$ of a differentiable policy $\pi(\theta)$ can be updated using the policy gradient

\displaystyle\hat{g}=\mathbb{E}_{\pi,P_{n}}\left[\nabla_{\theta}\log\pi(a|s)A_{\pi,P_{n}}(s,a)\right],

(6)

where $\nabla_{\theta}$ denotes the derivative with respect to $\theta$ .

However, the high variance of the gradient estimation often leads to large policy updates and causes instability. As a result, the step size must be limited. For that purpose, [36, 37] investigated the performance gap for taking two different polices $\pi$ and $\pi^{\prime}$ ,

		$\displaystyle\eta_{n}(\pi^{\prime})-\eta_{n}(\pi)$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{s\thicksim\rho_{\pi^{\prime},\mu_{n},P_{n}}}\left[\mathbb{E}_{a\thicksim\pi^{\prime}(a\|s)}\left[A_{\pi,P_{n}}(s,a)\right]\right],$		(7)

where

\displaystyle\rho_{\pi,\mu_{n},P_{n}}(s)

\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}\mbox{Pr}(s_{t}=s|\pi,\mu_{n},P_{n}),

(8)

denotes the unnormalized discounted visitation frequency.

Equation (7) implies that the reward can be increased by any policy update that improves the expected advantage. However, it is difficult to optimize the right-hand side (RHS) of (7) since $\rho_{\pi^{\prime},\mu_{n},P_{n}}$ is dependent on $\pi^{\prime}$ . Therefore, the following surrogate objective was proposed

\displaystyle L_{\pi,\mu_{n},P_{n}}(\pi^{\prime})

\displaystyle=\eta_{n}(\pi)+\mathbb{A}_{\pi,\mu_{n},P_{n}}(\pi^{\prime}),

(9)

where

\displaystyle\mathbb{A}_{\pi,\mu_{n},P_{n}}(\pi^{\prime})=\mathbb{E}_{s\thicksim\rho_{\pi,\mu_{n},P_{n}}}\left[\mathbb{E}_{a\thicksim\pi^{\prime}(\cdot|s)}\left[A_{\pi,P_{n}}(s,a)\right]\right],

(10)

denotes the policy advantage that measures the advantage of policy $\pi^{\prime}$ over $\pi$ , starting from state $s\thicksim\mu_{n}$ and following $\pi^{\prime}$ . $L_{\pi,\mu_{n},P_{n}}$ is an approximation to $\eta_{n}(\pi^{\prime})$ but easier to optimize because it replaces the dependency of $\rho_{\pi^{\prime},\mu_{n},P_{n}}$ on $\pi^{\prime}$ with that of $\rho_{\pi,\mu_{n},P_{n}}$ on $\pi$ .

Although the above step involves an approximation, it is shown that a sufficiently small step that improves $L_{\pi,\mu_{n},P_{n}}(\pi^{\prime})$ will also improve $\eta_{n}(\pi^{\prime})$ . Specifically, the following policy improvement bound was derived for any policies $\pi$ and $\pi^{\prime}$

\displaystyle\eta_{n}(\pi^{\prime})

\displaystyle\geq L_{\pi,\mu_{n},P_{n}}(\pi^{\prime})-\text{$cD_{KL}^{max}(\pi,\pi^{\prime}),$}

(11)

or equivalently

\displaystyle\eta_{n}(\pi^{\prime})-\eta_{n}(\pi)

\displaystyle\geq\mathbb{A}_{\pi,\mu_{n},P_{n}}(\pi^{\prime})-\text{$cD_{KL}^{max}(\pi,\pi^{\prime}),$}

(12)

where $c=\frac{4\max_{s,a}|A_{\pi,P_{k}}(s,a)|\gamma}{(1-\gamma)^{2}}$ is a constant and $D_{KL}^{max}(\pi,\pi^{\prime})=\max_{s}D_{KL}\left(\pi(\cdot|s)\|\pi^{\prime}(\cdot|s)\right)$ denotes the maximum KL divergence between two policies $\pi$ and $\pi^{\prime}$ among all states. As a result, one can optimize Equation (2) by optimizing the surrogate objective. A tighter bound for the policy improvement was later derived [38]

	$\displaystyle\eta_{n}(\pi^{\prime})-\eta_{n}(\pi)$	$\displaystyle\geq\mathbb{A}_{\pi,\mu_{n},P_{n}}(\pi^{\prime})$		(13)
		$\displaystyle\quad-\text{$c^{\text{CPO}}\mathbb{E}_{s\thicksim\rho_{\pi,\mu_{n},P_{n}}}\left[D_{TV}(\pi(\cdot\|s)\\|\pi^{\prime}(\cdot\|s))\right],$}$

where $c^{\text{CPO}}=\frac{2\max_{s}\left|\mathbb{E}_{a\thicksim\pi^{\prime}(\cdot|s)}\left[A_{\pi,P_{k}}(s,a)\right]\right|\gamma}{(1-\gamma)^{2}}$ is upper bounded by $c$ and $D_{TV}(\pi(\cdot|s)\|\pi^{\prime}(\cdot|s))=\frac{1}{2}\sum_{a}\left|\pi(a|s)-\pi^{\prime}(a|s)\right|$ denotes the TV distance between two discrete probability distributions $\pi(\cdot|s)$ and $\pi^{\prime}(\cdot|s)$ . For the continuous case, it is straightforward to replace the summation by integration. In this work, (13) is useful to derive the policy improvement bound of FRL.

III Data Heterogeneity and its Impact on FRL

In this section, we will first investigate the types of data heterogeneity in FRL and define the heterogeneity level. Then, by studying the connection between the local and global objectives, we will investigate how the local update can help the global objective and derive a necessary condition for the global policy to be learn-able.

III-A Types of Data Heterogeneity in FRL

Given (8), the expected discounted reward for the $n$ -th MDP $\mathcal{M}_{n}$ , defined in (2), can be written as

\displaystyle\eta_{n}(\pi)=\sum_{s}\rho_{\pi,\mu_{n},P_{n}}(s)\sum_{a}\pi(a|s)\mathcal{R}_{n}(s,a).

(14)

It can be observed that the performance of a policy on $\mathcal{M}_{n}$ is jointly determined by its initial state distribution $\mu_{n}$ and the dynamics function $P_{n}$ , both of which can be either stochastic or deterministic. As a result, agents with different initial state distributions and environment dynamics will end up with different policies, causing the heterogeneity issue. Therefore, we classify the data heterogeneity in FRL into two types:

1.

Heterogeneity due to initial state distribution;
2.

Heterogeneity due to environment dynamics.

For ease of illustration, we consider two extreme cases where only one type of heterogeneity exists. Consider two finite MDPs, $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ , with the same environment dynamics, $P_{1}=P_{2}$ but different initial state distributions, $\mu_{1}\neq\mu_{2}$ . This corresponds to the situation where two self-driving vehicles have the same car model and training routine, but tend to enter the same road from different entrances. Now, consider the other extreme case, where $\mu_{1}=\mu_{2}$ but $P_{1}\neq P_{2}$ , i.e., the two MDPs have the same initial state distribution but different environment dynamics. This can be utilized to model the situation where two different cars always enter the same road at the same entrance. The two types of heterogeneity can occur simultaneously. But, the heterogeneity due to different initial state distributions is of more concern for episodic tasks whose on-policy distribution is different from the stationary distribution [11] and heavily relies on the initial state distribution.

III-B Level of Heterogeneity

In the following, we define a measurement regarding the level (impact) of heterogeneity which will be utilized to investigate the connection between the global and local policy advantages. For ease of illustration, we define $\mathbf{D}_{\pi,\mu_{n},P_{n}}$ as a $\left|\mathcal{S}\right|\times\left|\mathcal{S}\right|$ diagonal matrix with $\rho_{\pi,\mu_{n},P_{n}}$ being the $i$ -th diagonal entry, where $\left|\mathcal{S}\right|$ denotes the cardinality of $\mathcal{S}$ . Similarly, we denote $\mathbf{\Pi}_{\pi}$ as a $\left|\mathcal{S}\right|\times\left|\mathcal{A}\right|$ matrix whose $(i,j)$ th entry is $\pi(a_{j}|s_{i})$ , and $\mathbf{A}_{\pi,P_{n}}$ as a $\left|\mathcal{S}\right|\times\left|\mathcal{A}\right|$ matrix whose $(i,j)$ th entry is $A_{\pi,P_{n}}(s_{i},a_{j})$ . With the above notation, we can rewrite the policy advantage in (10) as

	$\displaystyle\mathbb{A}_{\pi,\mu_{n},P_{n}}(\pi^{\prime})$	$\displaystyle=\sum_{s}\rho_{\pi,\mu_{n},P_{n}}(s)\sum_{a}\pi^{\prime}(a\|s)A_{\pi,P_{n}}(s,a)$		(15)
		$\displaystyle=\mbox{Tr}\left(\mathbf{D}_{\pi,\mu_{n},P_{n}}\mathbf{A}_{\pi,P_{n}}\mathbf{\Pi}_{\pi^{\prime}}^{T}\right),$		(16)

where ${\rm Tr}(\cdot)$ denotes the trace operator. It can be observed that the advantage of taking a new policy $\pi^{\prime}$ on a specific agent $n$ is realized through the matrix product $\mathbf{D}_{\pi,\mu_{n},P_{n}}\mathbf{A}_{\pi,P_{n}}$ . Thus, we propose the following definition regarding the level of heterogeneity. Without loss of generality, we assume that all the states are reachable, i.e., $\rho_{\pi,\mu_{n},P_{n}}(s)>0$ for all $s$ and $\pi$ .

Definition I.

(Level of heterogeneity). We define the level of heterogeneity of the n-th agent under policy $\pi$ by the Frobenius norm of the matrix

\displaystyle\mathbf{B}_{\pi,\mu_{n},P_{n}}=\sum_{k=1}^{N}q_{k}\mathbf{D}_{\pi,\mu_{n},P_{n}}^{-1}\mathbf{D}_{\pi,\mu_{k},P_{k}}\mathbf{A}_{\pi,P_{k}}-\mathbf{A}_{\pi,P_{n}}

(17)

as $\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}$ .

Note that $\mathbf{B}_{\pi,\mu_{n},P_{n}}$ measures the deviation of the policy advantage of the $n$ -th agent from the average policy advantage of all agents. Given the local policy advantage directly affects the policy gradient, its deviation from the average will lead to potential gradient divergence. For the IID case, $\mathbf{B}_{\pi,\mu_{n},P_{n}}=\mathbf{0}$ and thus $\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}=0$ for arbitrary policy $\pi$ . With non-IID data, $\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}$ will be positive and its magnitude indicates the level of heterogeneity of the $n$ -th agent. Intuitively, if the level of heterogeneity of one agent is too high, i.e., the policy advantage calculated over its MDP is too different from that of the others, improving the global policy according to the local environment may not be beneficial for the whole system. In Corollary I, we will prove a necessary condition, with respect to $\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}$ , for the global policy to be learn-able from the local policy.

III-C Connection between the Global and Local Objectives in FRL

In FRL, the local objective $\eta_{n}(\pi)$ serves as a proxy of the global objective $\eta(\pi)$ . Specifically, each agent tries to optimize its local objective, but the global objective is the actual performance measure. Given the global policy $\pi^{t}$ in the $t$ -th training round, the $n$ -th agent interacts with its environment and obtains a new local policy $\pi_{n}^{t+1}$ . Without access to any information of other agents, the $n$ -th agent can only update the local policy by optimizing the local objective $\eta_{n}(\pi)$ . However, a good local performance does not necessarily lead to a good global performance, especially with data heterogeneity. The question we want to answer is how we can optimize the local policy to best improve the global performance. For that purpose, we start from the local improvement.

It follows from (13) that

	$\displaystyle\eta_{n}(\pi_{n}^{t+1})-\eta_{n}(\pi^{t})$
$\displaystyle\geq$	$\displaystyle\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})$
	$\displaystyle-\text{$c^{\text{CPO}}\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{n}^{t+1}(\cdot\|s))\right],$}$	(18)

which indicates that the improvement brought by the local policy $\pi_{n}^{t+1}$ to the $n$ -th agent over the current global policy $\pi^{t}$ is lower bounded by the policy advantage. The next question is whether we can bound the global improvement similarly. For that purpose, we first investigate the relation between the global and local policy advantage, and the result is given in the following theorem.

Theorem I.

The following bound holds for all agents $n=1,...,N$

		$\displaystyle\sum_{k=1}^{N}q_{k}\mathbb{A}_{\pi^{t},\mu_{k},P_{k}}(\pi_{n}^{t+1})$		(19)
	$\displaystyle\geq$	$\displaystyle\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})-\alpha\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[D_{TV}(\pi(\cdot\|s)\\|\pi^{\prime}(\cdot\|s))\right],$

where $\alpha=2\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}$ .

The proof is given in Appendix A. Note that the left-hand-side (LHS) of (19) represents the average policy advantages of $\pi_{n}^{t+1}$ over $\pi^{t}$ when applying $\pi_{n}^{t+1}$ on all agents and the RHS represents the penalized policy advantage at the n-th agent. Thus, Theorem I indicates that the policy advantage of $\pi_{n}^{t+1}$ on all agents is lower bounded by its local policy advantage with a penalty term. As a result, improving the penalized local policy advantage of a given agent may improve its local policy’s performance on all agents.

Based on Theorem I, we have the following corollary.

Corollary I.

The condition

\displaystyle\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}<\left\|\mathbf{A}_{\pi,P_{n}}\right\|_{F}

(20)

is a necessary condition for the local policy update to be able to improve the global objective.

The proof is given in Appendix B.

Remark I.

The RHS of (20) measures the policy advantage and can be regarded as the potential of the local policy to be further improved according to the local environment. The LHS of (20) indicates the heterogeneity level of the $n$ -th agent. Corollary I implies that, in order to improve the global objective by optimizing the local objective, we need $\left\|\mathbf{A}_{\pi,P_{n}}\right\|_{F}-\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}>0$ . So, the heterogeneity level serves as a penalty, i.e., the local policy’s contribution to the global objective is penalized by the heterogeneity level. In other words, even if the local training is beneficial for the local objective, it may not help the global objective unless the improvement is larger than the heterogeneity level. We have more to say about this in the experiment result section.

Remark II.

Note that Corollary I includes the IID case as a special case. In particular, without data heterogeneity, the LHS of (20) will be zero. Thus, the necessary condition holds as long as the RHS is greater than zero, which indicates that it is much easier for the local training to help the global training. This makes sense because for the IID case, all agents have the same MDP, thus any improvement for one agent will probably be beneficial for all agents.

However, the average policy advantage is not a direct measure of the final performance, as it is a proxy for the improvement of the global objective. A more useful result should directly measure the improvement in terms of the global expected discounted reward. For that purpose, we have the following corollary.

Corollary II.

Based on Theorem I, we have the following inequality showing the relation between the global objective and the local update:

$\displaystyle\eta(\pi_{n}^{t+1})$	$\displaystyle\geq\sum_{k=1}^{N}q_{k}\eta_{k}(\pi^{t})+\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})$
	$\displaystyle\quad-(\alpha+\beta)\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{n}^{t+1}(\cdot\|s))\right]$
	$\displaystyle\quad-\delta D_{TV}^{\max}(\pi^{t},\pi_{n}^{t+1}),$	(21)

where $\delta=2\sum_{k=1}^{N}q_{k}\frac{4\epsilon_{k}\gamma}{(1-\gamma)^{2}}D_{TV}(\rho_{\pi^{t},\mu_{k},P_{k}}\|\rho_{\pi^{t},\mu_{n},P_{n}})$ , $\beta=\sum_{k=1}^{N}q_{k}\frac{4\epsilon_{k}\gamma}{(1-\gamma)^{2}}$ and $\epsilon_{k}=\max_{s,a}\left|A_{\pi^{t},P_{k}}(s,a)\right|$ denotes the maximum action advantage among all state and action pairs.

The proof is given in Appendix C. The LHS of (II) represents the average of expected discounted reward of the n-th local policy $\pi_{n}^{t+1}$ over all agents. The first term on the RHS of (II) is a constant, representing the return of the current policy, and the second term represents the policy advantage of $\pi_{n}^{t+1}$ over $\pi_{n}^{t}$ in the $n$ -th agent. Corollary II indicates that it is possible to improve a local policy’s average performance on all agents by improving the local policy advantage. In particular, by penalizing the local update with the TV distance, weighted by a coefficient ( $\alpha$ ) proportional to the heterogeneity level, we may improve the performance of a local policy over all environments. In fact, it can be proved that optimizing the RHS of (II) will effectively improve the expected discounted reward on all agents. Please refer to Appendix D for more details.

The last two terms in (II) are penalty terms representing the expected TV distance and the maximum TV divergence between $\pi_{n}^{t+1}$ and $\pi_{n}^{t}$ . The two penalties indicate that even if it is possible to improve the global performance by optimizing the local objective, the improvement only happens when the local policy $\pi_{n}^{t+1}$ is close to the global policy $\pi^{t}$ . Given the first term on the RHS of (II) is a constant, Corollary II suggests the following objective for the local training

$\displaystyle h_{n}(\pi_{n}^{t+1};\pi^{t})$	$\displaystyle=\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})-(\alpha+\beta)$
	$\displaystyle\quad\cdot\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{n}^{t+1}(\cdot\|s))\right]$
	$\displaystyle\quad-\delta D_{TV}^{\max}(\pi^{t},\pi_{n}^{t+1}).$	(22)

Remark III.

There are three penalty terms in (22), namely the $\alpha$ , $\beta$ and $\delta$ penalty term, penalizing the difference between the local and global policies. Thus, all of them can be regarded as a global penalty. However, they came from different sources and were added for different purposes. The $\alpha$ penalty occurred when investigating the connection between the local and global policy advantage, i.e., Theorem I, to tackle the data heterogeneity, while the $\beta$ and $\delta$ penalty came from (12) and (13) to constrain the policy update at one agent.

Remark IV.

The IID case is covered by Corollary II. This special case occurs when all agents share the same initial state distribution and environment dynamics, i.e. the level of heterogeneity $\left\|\mathbf{B}_{\pi,n}\right\|_{F}=0$ and hence $\alpha=0$ and $\beta$ becomes the constant $c$ in (11) for all agents. As a result, the $\alpha$ penalty term will disappear. However, the $\beta$ and $\delta$ penalty term is still valid. This indicates that, even with IID data, a global penalty term is needed, and data heterogeneity will require a higher global penalty.

Next, we show the convergence of policy updates according to (22). The theoretical analysis is motivated by the convergence analysis of Heterogeneous-Agent Trust Region Policy Optimization (HATPRO) algorithm [39]. Similar to [33], we focus our discussion on the tabular case, i.e. we assume the aggregation step can linearly combine local policies to obtain the global policy. A summary of this learning procedure is presented in Algorithm 1.

Algorithm 1 Federated Policy Iteration with Monotonic Improvement Guarantee. Tabular Case.

\pi^{0}

1: for round t = 0,1,… do

2: Synchronize the global model

\pi^{t}

to every agents.

3: Compute the exact values of

\alpha

\beta

and

\delta

4: for agent k = 1,2,…,N do

5: Optimize

\pi_{k}^{t+1}=\arg\max_{\pi}\left[h_{k}(\pi;\pi^{t})\right]

6: Upload

\pi_{k}^{t+1}

and

q_{k}

to the central server.

7: end for

8: The central server aggregates the

\theta

’s as

\pi^{t+1}(a|s)\leftarrow\sum_{k=1}^{N}q_{k}\pi_{k}^{t+1}(a|s),\forall s,a

9: end for

Theorem II.

Algorithm 1 is guaranteed to generate a monotonically improving sequence of policies, i.e. $\eta(\pi^{t+1})\geq\eta(\pi^{t})$ for $t\in\mathbb{N}$ .

The proof is given in Appendix E. With Theorem II, the convergence of Algorithm 1 is given by the following theorem.

Theorem III.

A sequence of policies $\left(\pi^{t}\right)_{t=0}^{\infty}$ generated by Algorithm 1 is convergent.

The proof is given in Appendix F. When policies are parameterized with tabular methods, Theorem II and III are exact. For non-linear parameterization, the algorithm can be similarly implemented, but the error term introduced by the aggregation step should be considered.

IV The Proposed Algorithm: FedKL

Although we have obtained a theory justifying the objective in (22), directly optimizing it is difficult due to several issues. First, it is impractical to compute the last terms of (22) by traversing all states. Moreover, despite it’s theoretical importance, the TV distance does not have a closed-form expression by far. Second, the advantage function has to be estimated by sampling. Third, one agent cannot access information of other agents, making it impossible to determine the coefficients in (22). In the following, we tackle these issues one by one.

IV-A TV Distance

To handle the first issue, we replace the maximum of TV divergence over all states $D_{TV}^{\max}(\pi^{t},\pi_{n}^{t+1})$ by its average value $\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot|s)\lVert\pi_{n}^{t+1}(\cdot|s))\right]$ . It has been shown that this heuristic approximation has similar empirical performance as the one with the maximum divergence [36]. Furthermore, we replace the TV distance in (22) by the KL divergence based on the Pinsker’s inequality [40], i.e. $D_{TV}(p\lVert q)\leq\sqrt{\frac{1}{2}D_{KL}(p\lVert q)}$ , and obtain

	$\displaystyle h_{n}(\pi_{n}^{t+1};\pi^{t})\geq$	$\displaystyle\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})-(\alpha+\beta+\delta)$		(23)
		$\displaystyle\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[\sqrt{\frac{1}{2}D_{KL}(\pi^{t}(\cdot\|s)\lVert\pi_{n}^{t+1}(\cdot\|s))}\right].$

IV-B Approximation of the Policy Advantage and Penalty Terms

To optimize (23), we need to calculate the policy advantage and the penalty terms. To this end, we assume all agents adopt the training algorithm proposed by [26]. Next, we show how we can approximate (23) by Monte Carlo simulation. In each training round, the agent will perform $I$ iterations of local training. Let $\pi^{t+1}_{n,i}$ denote the local policy of the $n$ -th agent after $i$ iterations of local training with $i=0,...,I$ , where $\pi^{t+1}_{n,0}=\pi^{t}$ and $\pi^{t+1}_{n,I}=\pi^{t+1}_{n}$ . Thus, the objective function for the $i$ -th iteration of the $n$ -th agent to obtain $\pi^{t+1}_{n,i}$ from $\pi^{t+1}_{n,i-1}$ can be given by

	$\displaystyle h_{n}(\pi_{n,i}^{t+1};\pi^{t},\pi_{n,i-1}^{t+1})$
$\displaystyle=$	$\displaystyle\mathbb{E}_{s\thicksim\rho_{\pi^{t+1}_{n,i-1},\mu_{n},P_{n}},a\thicksim\pi_{n,i-1}^{t+1}}\left[\omega(s,a)A_{\pi^{t},P_{n}}(s,a)\right.$
	$\displaystyle-c_{1}\sqrt{\frac{1}{2}D_{KL}(\pi^{t}(\cdot\|s)\lVert\pi_{n,i}^{t+1}(\cdot\|s))}$
	$\displaystyle\left.-c_{2}\text{$D_{KL}$($\pi_{n,i-1}^{t+1}$ ($\cdot$$\|$s) $\lVert\pi_{n,i}^{t+1}$ ($\cdot$$\|$s))}\right],$	(24)

where $c_{1}$ is used to approximate $(\alpha+\beta+\delta)$ , the last term is the KL penalty introduced by the local training, and $c_{2}$ is used to approximate $c$ defined in (11). Here, $\omega(s,a)=\frac{\rho_{\pi^{t},\mu_{n},P_{n}}(s)}{\rho_{\pi^{t+1}_{n,i-1},\mu_{n},P_{n}}(s)}\frac{\pi_{n,i}^{t+1}(a|s)}{\pi_{n,i-1}^{t+1}(a|s)}$ is the importance sampling ratio where we estimate $\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})$ by states and actions sampled from $\rho_{\pi^{t+1}_{n,i-1},\mu_{n},P_{n}}$ and $\pi^{t+1}_{n,i-1}$ . The expected value of the first term in (24) is an unbiased estimation of the policy advantage $\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})$ . However, it is difficult and inefficient to compute $\omega(s,a)$ directly. Therefore, we ignore the difference in the discounted visitation frequency and use $\omega(s,a)=\frac{\pi_{n,i}^{t+1}(a|s)}{\pi_{n,i-1}^{t+1}(a|s)},$ which yields a slightly biased estimation of the policy advantage but with smaller variance.

In the following discussion, we will refer to the second and third terms of (24) as the global and local penalties, respectively. In particular, the second term penalizes the divergence of one agent’s policy from the global policy and the third term penalizes the policy outputs in two consecutive iterations. Given both the two penalties are related to KL divergence, we will refer to the proposed algorithm as FedKL.

IV-C Adaptive Penalty Coefficients

In FRL, the initial state distribution and dynamics function of one agent is not known to others. Thus, the penalty coefficients $c_{1}$ and $c_{2}$ cannot be known locally. In this paper, we adopt the adaptive method in [26] to determine the coefficients. Specifically, we introduce two hyper-parameters, $d_{local}$ and $d_{global}$ , as the target local and global KL divergence. Then, $c_{1}$ and $c_{2}$ will be adjusted in an adaptive manner so that the target $d_{local}$ and $d_{global}$ are achieved in each policy update. After initializing $c_{1},c_{2}$ , we perform a three-phases update in each policy update as follows:

•

Perform policy update using E epochs of minibatch gradient ascent to optimize $h_{n}(\pi_{n,i}^{t+1};\pi^{t},\pi_{n,i-1}^{t+1})$ in (24).
•
Compute
$d=\mathbb{E}_{s\thicksim\rho_{\pi^{t+1}_{n,i-1},\mu_{n},P_{n}}}\left[D_{KL}\left(\pi_{n,i-1}^{t+1}(\cdot|s)\lVert\pi_{n,i}^{t+1}(\cdot|s)\right)\right]$
- –
  
  $\text{If }d<d_{local}/1.1,c_{2}\leftarrow c_{2}/2$
- –
  
  $\text{If }d>d_{local}\times 1.1,c_{2}\leftarrow c_{2}\times 2$
•
Compute $d=\sqrt{\frac{1}{2}D_{KL}(\pi^{t}(\cdot|s)\lVert\pi_{n,i}^{t+1}(\cdot|s))}$
- –
  
  $\text{If }d<d_{global}/1.1,c_{1}\leftarrow c_{1}/2$
- –
  
  $\text{If }d>d_{global}\times 1.1,c_{1}\leftarrow c_{1}\times 2$

Specifically, in each training round, FedKL performs several iterations of SGD update. After each iteration, FedKL updates its estimation of $c_{1}$ and $c_{2}$ for the next iteration so that the target local and global KL divergence are achieved.

Algorithm 2 FedKL using adaptive penalty coefficients.

K,I,T,E,\pi^{0}

1: for round t = 0,1,… do

2: Selects a subset of K agents according to some criteria.

3: Synchronize the global model to every selected agents.

4: for agent k = 1,2,…,K do

5: for iteration i = 1,2,…,I do

6: Run policy

\pi_{k,i-1}^{t+1}

in environment for T timesteps.

7: Optimize

h_{k}(\pi_{k,i}^{t+1};\pi^{t},\pi_{n,i-1}^{t+1})

for

E

epochs to obtain

\pi_{k,i}^{t+1}

8: end for

9: Upload

\theta_{k}^{t+1}

and

l_{k}

to the central server.

10: end for

11: The central server aggregates the

\theta

’s as

\theta^{t+1}=\sum_{k=1}^{K}\frac{l_{k}}{L}\theta_{k}^{t+1}

12: end for

Remark V.

A larger target KL divergence ( $d_{global}$ and $d_{local}$ ) will lead to a smaller penalty coefficient on the KL divergence ( $c_{1}$ and $c_{2}$ ). Given the KL divergence between two policies before and after training represents the learning step size, the target KL divergence can be regarded as a target learning step size. Thus, $d_{global}$ and $d_{local}$ can help maintain a constant step size for each training round and iteration, respectively.

The heuristics for tuning $d_{local}$ and $d_{global}$ will be discussed in the experiments section. The training procedure of FedKL is illustrated in Algorithm 2.

V Experiments

In the experiments, we will compare the proposed FedKL with other algorithm-level solutions for tackling the heterogeneity issue, including FedAvg, FedProx and FMARL. In particular, agents of the FedAvg system will perform local training utilizing the algorithm proposed in [26], with the local penalty term. Agents of the FedProx system will deploy the same algorithm as FedAvg and add the proximal term [19]. Agents of the FMARL system will deploy the same algorithm as FedAvg and utilize the gradient decay scheme in [32]. On the other hand, agents of the proposed FedKL system will implement the same algorithm as FedAvg, but with the global KL divergence penalty. All code, experiments and modified environments can be found at: https://github.com/ShiehShieh/FedKL.

V-A Experiment Platforms

To the best of authors’ knowledge, there are few works focusing on the data heterogeneity issue in FRL, making it difficult to find benchmarks for comparison purpose. Therefore, we made several light-weight modifications to popular RL simulators to accommodate the FRL setting. There are two widely used simulation environments for evaluating the performance of RL controllers, namely the MuJoCo simulator [41, 42] and the Flow simulator [43]. In the following, we introduce the two simulators and the settings for FRL experiments.

V-A1 Heterogeneous Robotic Tasks

In Experiment 1, we consider a robotic task in the Reacher-v2 environment provided by OpenAI Gym. As shown in Fig. 2a, the agent actuates a robotic arm to move the end-effector to reach a target, which is denoted by a red sphere and spawn at a random position in a $0.4\times 0.4$ field. The environment consists of a 11-dimensional state space, a 2-dimensional action space and a reward function that determines the reward based on the length of the action vector as well as the distance between the end-effector and the target.

To simulate the heterogeneity due to different initial state distributions, we split the field into $Q$ sub-fields and create $Q$ sub-environments. For each sub-environment, we assume the robotic arm will always start from one of the $Q$ specific sub-fields and try to reach the target. Note that this corresponds to an extreme setup for the different initial state distributions. The case with $Q=60$ is illustrated in Fig. 2b. To simulate the heterogeneity due to different environment dynamics, we add noise to each action as in [29]. Specifically, we add noise to distort all actions such that they are different for different devices. Let $m_{noise}$ and $\sigma_{noise}$ denote the mean and variance of the added noise. We assume $m_{noise}=0$ and pick different variances to represent different levels of heterogeneity.

V-A2 Traffic Control by Isolated RL Agents

In Experiment 2, we consider the traffic control in the “figure eight” network [43]. As shown in Fig. 3, there are in total 14 vehicles on the road with half of them controlled by RL agents and the rest controlled by the underlying Intelligent Driver Model (IDM) controller provided by the SUMO library. The IDM controller can be regarded as the human-driver model. This experiment is designed for mixed-autonomy, where a centralized controller coordinates a set of connected autonomous vehicles (CAVs) to improve the system-level velocities.

Some modifications are required to run FRL experiments in the above network. Fortunately, there are a few modifications proposed by [32], where autonomous vehicles (AVs) are not connected by a centralized controller but isolated agents. Each RL-controlled vehicle is able to observe the position and speed of itself, and that for the two vehicles ahead and behind it. As a result, each state is represented by a 6-dimensional vector. Since each agent is controlled by its local policy, every action is a scalar specifying the acceleration for the current time-step. Further assume that the average speed of all vehicles is accessible to all AVs.

By default, AVs and human-driven vehicles are positioned alternatively, as shown in Fig. 3a. If we utilize ‘h’ and ‘r’ to represent the human-driven vehicle and AVs, respectively, the default placement can be written as [h,r,h,r,h,r,h,r,h,r,h,r,h,r]. To simulate the heterogeneity due to different initial state distributions, we fix the starting position of all AVs such that everyone starts from the same position in each run. Given the AVs have access to the information about the two vehicles before and after, we may create different environment dynamics by adjusting the placement pattern. For example, the car placement [h,r,h,h,r,r,h,r,h,h,h,r,r,r], as shown in Fig. 3b, will lead to different environment dynamics.

V-A3 Implementation Details

For both experiments, we use neural networks to represent policies as in [36, 43]. Specifically, fully-connected Multilayer Perceptrons (MLPs) with $\tanh$ non-linearity are utilized for two experiments with hidden layers (64, 64) and (100, 50, 25) [43], respectively. Generalized Advantage Estimation (GAE) [34] is used to estimate the advantage function. The value functions for two experiments are represented by two MLPs with $\tanh$ non-linearity and hidden layers (64, 64) and (256, 256), respectively. The learning rate of one algorithm will remain the same for one simulation without decaying, but it will be tuned for different simulations. The hyperparameters $d_{local}$ , $d_{global}$ and $\mu$ ¹¹1This is a hyperparameter defined by FedProx to control the weight of the proximal term. are carefully tuned so that they are near-optimal. In each training round, we set $K=3,I=20$ , and $T=2048$ for Experiment 1, and $K=7,I=50$ , and $T=1500$ for Experiment 2. For both experiments, we set the discount factor $\gamma$ , batch size, and the discount factor for GAE, $\lambda$ in [34], as 0.99, 64, and 0.95, respectively.

V-B Level of Heterogeneity and Its Impact

In this section, we will investigate the type and level of heterogeneity and how they affect the performance of FRL. For ease of illustration, we will use Experiment 1 to analyze the heterogeneity type and level. Detailed performance comparison between different algorithms and in-depth discussion regarding the key parameters will be given in the next section based on results from Experiment 2.

To illustrate the impact of data heterogeneity, we show in Fig. 4 the changes of $G=\left\|\mathbf{D}_{\pi,\mu_{n},P_{n}}\mathbf{A}_{\pi,P_{n}}\right\|_{F}-\left\|\mathbf{D}_{\pi,\mu_{n},P_{n}}\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}$ . Each term in (20) is rescaled by $\mathbf{D}_{\pi,\mu_{n},P_{n}}$ to overcome the numerical instability introduced by the inverse operation $\mathbf{D}_{\pi,\mu_{n},P_{n}}^{-1}$ . Note that according to Corollary I, the local training is useful only when $\left\|\mathbf{A}_{\pi,P_{n}}\right\|_{F}-\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}>0$ . Thus, we may regard $G$ as the potential of one agent to contribute to the global policy. In particular, if we treat $\left\|\mathbf{D}_{\pi,\mu_{n},P_{n}}\mathbf{A}_{\pi,P_{n}}\right\|_{F}$ as the local training potential, then $G$ is the local potential penalized by the heterogeneity level, i.e., even if there is potential improvement from local training, it may not be helpful for the global policy when the heterogeneity level is high. We consider four different scenarios, including the IID case, the case with different initial state distributions ( $Q=60$ ), and two cases with different level of environment dynamics, where the heterogeneity level is controlled by the variance of the action noises. It can be observed that $G$ decreases with training because training will reduce agents’ potential to further contribute to the global policy. On the other hand, comparing the two cases with different environment dynamics, we notice that higher level of heterogeneity will lead to smaller $G$ , indicating that heterogeneity will limit the potential contribution of one agent. Note that the comparison between different types of heterogeneity may not be meaningful, as their local advantage function may have different ranges. Fluctuations of the curves are due to limited number of simulations and matrix estimation error.

To further illustrate the impact of the heterogeneity level, we show in Fig. 5 the training performance with high level of action noise. The red lines indicate the training rounds where $G<0$ happens in selected agents. It can be observed that, for those rounds, the global training will slow down, which agrees with Corollary I, i.e., $G<0$ indicates that the associate training is not helpful for the global policy. In Figs. 6 and 7, we show the performance comparison of FedAvg, FedProx, FMARL and FedKL with different initial state distributions and different transition probabilities, respectively. It can be observed that with the same learning step size, FedKL can converge to a stable performance, but both FedAvg and FedProx may diverge or fluctuate. We can certainly reduce the step size for FedAvg and FedProx, but this will lead to slow learning speed. See next section for more related discussions.

V-C Advantage of the KL-based Global Penalty

In the following, we show the results of Experiment 2 where both types of heterogeneity are present. The best performance of each algorithm is shown in Fig. 8, where the averaged total reward over three runs is reported. Each run uses the same seed for the SUMO simulator but different seeds for the parameter initialization of the neural network. It can be observed that FedKL converges much faster than FedAvg, FedProx and FMARL. The faster convergence of FedKL indicates a larger step size. Then, a natural question is whether we can increase the step size of FedAvg, FedProx and FMARL to speed up their training. In Fig. 9, we show the results where the $d_{local}$ for FedAvg, FedProx and FMARL are increased. As a result, all of them converge at a similar speed as FedKL, but they diverge after reaching the maximum reward. The above results provide empirical evidence that directly constraining the model output by KL divergence is more effective in handling the data heterogeneity than penalizing model divergence in the parameter space. Furthermore, applying a proper global KL penalty can prevent divergence from happening without loss of learning speed.

Why KL Penalty is better? It can be observed from Fig. 9 (after Round 120) that the proximal term of FedProx can slow down the divergence, but failed in avoiding it. This is because the proximal term $\left\|\theta_{n}^{t+1}-\theta^{t}\right\|$ constrains the learning in the model parameter space by the $\mathcal{L}2$ -norm. Although the change in the model parameter space will lead to changes in the output distribution space (actions) and eventually influence the training objective, it is more effective to directly constrain the training in the distribution space [44, 45], which justifies the performance advantage of the proposed KL-divergence based penalty.

Why Global Penalty Helps? We now investigate how $d_{global}$ helps speed up convergence and stabilize training. Assume there are $I$ iterations of local training in each round and let $d_{i}$ denote the policy update in the $i$ -th iteration. We consider two cases, without and with the global penalty (target step size), respectively, as illustrated in Fig. 10. We use the red and green arrows to represent the optimal optimization direction of the global and local objectives, respectively. Note that they are not on the same direction due to data heterogeneity. For the first case, we only apply the local penalty which constrains the step size for each iteration with $|d_{i}|\leq d_{local}$ , i.e., the radius of the green circle in part (a) of Fig. 10. Let $d_{max}$ denote the highest tolerable step size in one training round, without causing training divergence. Thus, we need $d_{local}\leq d_{max}/I$ , which is determined by considering the worst case scenario where all updates are on the same direction. However, the worst case happens with a very low chance. As a result, $d_{local}$ is pessimistically small for most training. Performance of FedAvg in Fig. 8 corresponds to the case that $d_{local}$ is small enough to avoid divergence, but causes slow convergence. Performance of FedAvg in Fig. 9 corresponds to the case that $d_{local}$ is large enough for fast training, but causes divergence.

Now consider the second case with $d_{global}$ , denoted by the radius of the red circle in part (b) of Fig. 10. Under such circumstances, we only need to guarantee $\left|\sum_{i}^{I}d_{i}\right|\leq d_{global}\leq d_{max}$ and $\left|d_{i}\right|\leq d_{local}$ . As a result, $d_{local}$ can be much larger than $d_{max}/I$ , leading to a much larger per-iteration step size, represented by the larger radius of the green circle in part (b) of Fig. 10. Thus, adding $d_{global}$ enables a larger $d_{local}$ and their joint efforts can achieve a better trade-off between training speed (step size) and convergence.

Heuristics for Tuning Target Step Size: The global KL divergence is a good indicator of model divergence and we usually observe a large global KL divergence followed by a drastic performance degradation. This observation suggests a heuristic way for tuning the global step-size target $d_{global}$ . For example, one can first run the algorithm without activating the global KL penalty (set $c_{1}$ to 0). Then, one may find the optimal value of $d_{local}$ and observe the average value of the KL divergence between $\pi^{t}$ and $\pi^{t+1}_{n}$ for certain number of rounds. It was observed that using a $d_{global}$ slightly larger than the average, together with a $d_{local}$ larger than the optimal value found in previous runs, can significantly speed up learning in the early stage of training.

V-D Communication-Efficient FRL

There are many benefits to solving the data heterogeneity issue, including faster convergence, stable performance, etc. Faster convergence will also lead to fewer rounds of communication and mitigate the communication bottleneck issue for FRL. For example, to achieve a total reward of 350 in Experiment 2, it needs around 110 rounds of communication for both FedAvg and FedProx, but only 50 rounds for FedKL, representing a saving of more than $50\%$ of the communication workload.

Local agents in FedKL use the Actor-Critic architecture [27, 15], where the actor refers to the policy $\pi$ and the critic refers to the value function or the advantage function. One possible way to reduce communication workload is to keep the critic local since it is not needed for inference. We will investigate this idea in future works.

Another issue for large-scale FRL applications like autonomous driving is the model aggregation over many agents moving in a large area. A possible solution is the hierarchical FL scheme [46], where aggregations happen in two layers. First, local aggregations among agents within a small region are performed on a local server. Then, the aggregation results from multiple local servers are sent to a global server for global aggregation. The local and global aggregations can happen with different frequencies, which may significantly reduce the communication workload. Unfortunately, the convergence analysis for the hierarchical scheme with data heterogeneity is not available, and it will be interesting to investigate the performance of FedKL under such a hierarchical framework.

VI Conclusion

In this paper, we investigated the data heterogeneity issue in FRL systems. For that purpose, we first classified the types of data heterogeneity in FRL and defined the heterogeneity level. It was proved that although optimizing the local policy may improve the global policy, the deviation of the local policy from the global one must be properly constrained. A necessary condition for the local policy improvement to be beneficial for the global objective was also derived with respect to the proposed heterogeneity level. Based on the theoretical result, a KL-divergence based global penalty term was proposed, which was shown to be able to enable a larger local step size without causing model divergence. The convergence proof of the proposed algorithm is also provided. Experiment results demonstrated the advantage of the proposed scheme in speeding up and stabilizing the training. One conclusion we can draw from this work is that heterogeneity not only reduces the contribution of one agent’s local training to the global policy, but also increases the risk of divergence. It is thus important to balance between learning speed and convergence with a proper learning step size. Due to the lack of global information in the agents, a feasible solution is to maximize the learning speed and avoid divergence by jointly tuning the local and global penalty. In the future, a big challenge for implementing large-scale FRL is the model aggregation among many moving agents and it is critical to develop distributed aggregation schemes with heterogeneous data.

Appendix A Proof of Theorem I

Proof.

Following [47], we define the “absolute” value of a matrix $\mathbf{X}$ as

\displaystyle\left|\mathbf{X}\right|=(\mathbf{X}^{\dagger}\mathbf{X})^{\frac{1}{2}},\qquad\left|\mathbf{X}^{\dagger}\right|=(\mathbf{X}\mathbf{X}^{\dagger})^{\frac{1}{2}},

(25)

where $\mathbf{X}^{\dagger}$ denotes the conjugate transpose of $\mathbf{X}$ . In other words, $\left|\mathbf{X}\right|$ can be regarded as the square-root of $\mathbf{X}^{\dagger}\mathbf{X}$ . For a real matrix $\mathbf{X}$ , $\mathbf{X}^{T}\mathbf{X}$ is symmetric and positive semi-definite. Thus, there must be a unique square root $\left|\mathbf{X}\right|$ of $\mathbf{X}^{T}\mathbf{X}$ that is real and positive semi-definite. For a given scalar $x$ , the above operator will degenerate to the absolute value operator $\left|x\right|=(xx)^{\frac{1}{2}}$ .

To understand the impact of the local optimization on the global objective, we define the gap between the global and local policy advantage for taking the updated local policy of the $n$ -th agent $\pi_{n}^{t+1}$ as

\displaystyle G(\pi_{n}^{t+1})=\left|\sum_{k=1}^{N}q_{k}\mathbb{A}_{\pi^{t},\mu_{k},P_{k}}(\pi_{n}^{t+1})-\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})\right|.

(26)

Utilizing the vector notation from (16), we can calculate the gap as

	$\displaystyle G(\pi_{n}^{t+1})$
$\displaystyle=$	$\displaystyle\left\|\mbox{Tr}\left(\sum_{k=1}^{N}q_{k}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{D}^{-1}_{\pi^{t},\mu_{n},P_{n}}\mathbf{D}_{\pi^{t},\mu_{k},P_{k}}\mathbf{A}_{\pi^{t},P_{k}}\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\right.\right.$
	$\displaystyle\left.\left.-\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{A}_{\pi^{t},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\right)\right\|$	(27)
$\displaystyle=$	$\displaystyle\left\|\mbox{Tr}\left(\sum_{k=1}^{N}q_{k}\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{D}^{-1}_{\pi^{t},\mu_{n},P_{n}}\mathbf{D}_{\pi^{t},\mu_{k},P_{k}}\mathbf{A}_{\pi^{t},P_{k}}\right.\right.$
	$\displaystyle\left.\left.-\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{A}_{\pi^{t},P_{n}}\right)\right\|$	(28)
$\displaystyle=$	$\displaystyle\left\|\mbox{Tr}\left(\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right)\right\|,$	(29)

where the second equation follows from (17).

The Cauchy-Schwarz inequality [47] states that, for two complex matrices $\mathbf{X}$ and $\mathbf{Y}$ ,

\displaystyle\left|\mbox{Tr}\left(\mathbf{X}^{\dagger}\mathbf{Y}\right)\right|\leq\sqrt{\mbox{Tr}\left(\left|\mathbf{X}\right|\cdot\left|\mathbf{Y}\right|\right)\mbox{Tr}\left(\left|\mathbf{X}^{\dagger}\right|\cdot\left|\mathbf{Y}^{\dagger}\right|\right)}.

(30)

Accordingly, we can obtain

	$\displaystyle\left\|\mbox{Tr}\left(\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right)\right\|$
$\displaystyle\leq$	$\displaystyle\sqrt{\mbox{Tr}\left(\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}^{T}\right\|\left\|\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right\|\right)}$
	$\displaystyle\cdot\sqrt{\mbox{Tr}\left(\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\|\left\|\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}\right\|\right)}.$	(31)

According to the definition in (25), we can always find real positive semi-definite matrices $\left|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}^{T}\right|$ , $\left|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right|$ , $\left|\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\right|$ and $\left|\mathbf{\Pi}_{\pi_{n}^{t+1}}\right|$ . By the Cauchy-Schwarz inequality, we can further obtain

		$\displaystyle\mbox{Tr}\left(\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}^{T}\right\|\left\|\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right\|\right)$
	$\displaystyle\leq$	$\displaystyle\sqrt{\mbox{Tr}\left(\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}^{T}\right\|^{2}\right)\mbox{Tr}\left(\left\|\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right\|^{2}\right)},$		(32)

and

		$\displaystyle\mbox{Tr}\left(\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\|\left\|\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}\right\|\right)$
	$\displaystyle\leq$	$\displaystyle\sqrt{\mbox{Tr}\left(\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\|^{2}\right)\mbox{Tr}\left(\left\|\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}\right\|^{2}\right)}.$		(33)

Substituting (32) and (33) into (31) gives

	$\displaystyle G(\pi_{n}^{t+1})$	$\displaystyle\leq\left(\mbox{Tr}\left(\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}^{T}\right\|^{2}\right)\mbox{Tr}\left(\left\|\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right\|^{2}\right)\right.$
		$\displaystyle\quad\left.\cdot\mbox{Tr}\left(\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\|^{2}\right)\mbox{Tr}\left(\left\|\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}\right\|^{2}\right)\right)^{\frac{1}{4}}.$		(34)

Given

	$\displaystyle\mbox{Tr}\left(\left\|\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right\|^{2}\right)tr(\left\|\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}\right\|^{2})$
$\displaystyle=$	$\displaystyle\mbox{Tr}\left(\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right)$
	$\displaystyle\cdot\mbox{Tr}\left(\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}\right)$	(35)
$\displaystyle=$	$\displaystyle\mbox{Tr}\left(\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right)^{2},$	(36)

and similarly

		$\displaystyle\mbox{Tr}\left(\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}^{T}\right\|^{2}\right)\mbox{Tr}\left(\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\|^{2}\right)$
	$\displaystyle=$	$\displaystyle\mbox{Tr}\left(\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}^{T}\right)^{2},$		(37)

we have

$\displaystyle G(\pi_{n}^{t+1})$	$\displaystyle\leq\sqrt{\mbox{Tr}\left(\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}^{T}\right)}$
	$\displaystyle\quad\cdot\sqrt{\mbox{Tr}\left(\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right)}$	(38)
	$\displaystyle=\left\\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\\|_{F}\sqrt{\sum_{s,a}[\rho_{\pi^{t},\mu_{n},P_{n}}(s)\pi_{n}^{t+1}(a\|s)]^{2}}.$	(39)

Thus, we can further obtain ²²2Note that we have used the notation $\left(\pi^{\prime}-\pi\right)(s,a)=\pi^{\prime}(a|s)-\pi(a|s),\forall s\in\mathcal{S},a\in\mathcal{A}$ .

		$\displaystyle G(\pi_{n}^{t+1}-\pi_{n}^{t})$		(40)
	$\displaystyle\leq$	$\displaystyle\left\\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\\|_{F}\sqrt{\sum_{s,a}\left[\rho_{\pi^{t},\mu_{n},P_{n}}(s)\left(\pi_{n}^{t+1}(a\|s)-\pi^{t}(a\|s)\right)\right]^{2}}.$

Given $\sum_{a}\pi^{t}(a|s)A_{\pi^{t},\mu}(s,a)=0,\forall{s\in\mathcal{S}}$ , we have

	$\displaystyle\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})$
$\displaystyle=$	$\displaystyle\sum_{s}\rho_{\pi^{t},\mu_{n},P_{n}}(s)\sum_{a}\pi_{n}^{t+1}(a\|s)A_{\pi^{t},P_{n}}(s,a)$
	$\displaystyle\quad-\sum_{s}\rho_{\pi^{t},\mu_{n},P_{n}}(s)\sum_{a}\pi^{t}(a\|s)A_{\pi^{t},P_{n}}(s,a)$	(41)
$\displaystyle=$	$\displaystyle\sum_{s}\rho_{\pi^{t},\mu_{n},P_{n}}(s)\sum_{a}\left[\pi_{n}^{t+1}(a\|s)-\pi^{t}(a\|s)\right]A_{\pi^{t},P_{n}}(s,a)$	(42)
$\displaystyle=$	$\displaystyle\mathbb{A}_{\pi^{t},P_{n}}\left(\left(\pi_{n}^{t+1}-\pi^{t}\right)\right).$	(43)

It thus follows from (26) that

\displaystyle G(\pi_{n}^{t+1})=G(\pi_{n}^{t+1}-\pi_{n}^{t}).

(44)

By combining (44) and (40), we can obtain

	$\displaystyle G(\pi_{n}^{t+1})$	$\displaystyle\leq\left\\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\\|_{F}$		(45)
		$\displaystyle\quad\cdot\sqrt{\sum_{s,a}\left[\rho_{\pi^{t},\mu_{n},P_{n}}(s)\left(\pi_{n}^{t+1}(a\|s)-\pi^{t}(a\|s)\right)\right]^{2}}.$

Based on (26), we can obtain

	$\displaystyle\sum_{k=1}^{N}q_{k}\mathbb{A}_{\pi^{t},\mu_{k},P_{k}}(\pi_{n}^{t+1})$
$\displaystyle\geq$	$\displaystyle\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})-\left\\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\\|_{F}$
	$\displaystyle\cdot\sqrt{\sum_{s,a}\left[\rho_{\pi^{t},\mu_{n},P_{n}}(s)\left(\pi_{n}^{t+1}(a\|s)-\pi^{t}(a\|s)\right)\right]^{2}}.$	(46)

Given $\sqrt{\left|a\right|^{2}+\left|b\right|^{2}}\leq\sqrt{\left(\left|a\right|+\left|b\right|\right)^{2}}=\left|a\right|+\left|b\right|$ , we can obtain

	$\displaystyle\sum_{k=1}^{N}q_{k}\mathbb{A}_{\pi^{t},\mu_{k},P_{k}}(\pi_{n}^{t+1})$
$\displaystyle\geq$	$\displaystyle\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})-\left\\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\\|_{F}$
	$\displaystyle\cdot\sum_{s,a}\rho_{\pi^{t},\mu_{n},P_{n}}(s)\left\|\pi_{n}^{t+1}(a\|s)-\pi^{t}(a\|s)\right\|.$	(47)

Denotes the TV distance between $\pi(\cdot|s)$ and $\pi^{\prime}(\cdot|s)$ as $D_{TV}(\pi(\cdot|s)\|\pi^{\prime}(\cdot|s))=\frac{1}{2}\sum_{a}\left|\pi(a|s)-\pi^{\prime}(a|s)\right|$ . We can finally obtain

		$\displaystyle\sum_{k=1}^{N}q_{k}\mathbb{A}_{\pi^{t},\mu_{k},P_{k}}(\pi_{n}^{t+1})$		(48)
	$\displaystyle\geq$	$\displaystyle\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})-\alpha\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[D_{TV}(\pi(\cdot\|s)\\|\pi^{\prime}(\cdot\|s))\right],$

where $\alpha=2\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\|_{F}$ ∎

Appendix B Proof of Corollary I

Proof.

It follows from (16) that $\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})=\mbox{Tr}\left(\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{A}_{\pi^{t},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\right).$ By following the same procedure from (29) to (40), we can prove

	$\displaystyle\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})$
$\displaystyle\leq$	$\displaystyle\left\\|\mathbf{A}_{\pi^{t},P_{n}}\right\\|_{F}$
	$\displaystyle\cdot\sqrt{\sum_{s,a}\left[\rho_{\pi^{t},\mu_{n},P_{n}}(s)\left(\pi_{n}^{t+1}(a\|s)-\pi^{t}(a\|s)\right)\right]^{2}}.$	(49)

Thus, if $\left\|\mathbf{A}_{\pi^{t},P_{n}}\right\|_{F}\leq\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\|_{F}$ , then

	$\displaystyle\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})$
$\displaystyle\leq$	$\displaystyle\left\\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\\|_{F}$
	$\displaystyle\cdot\sqrt{\sum_{s,a}\left[\rho_{\pi^{t},\mu_{n},P_{n}}(s)\left(\pi_{n}^{t+1}(a\|s)-\pi^{t}(a\|s)\right)\right]^{2}}.$	(50)

As a result, the right-hand side of (LABEL:eq:84) will be less or equal to zero, where the equality holds when $\pi_{n}^{t+1}=\pi^{t}$ . This indicates that there is no benefit from further updating $\pi^{t}$ at the $n$ -th agent, and there will be no improvement for the global objective. Thus, (20) is a necessary condition for the local update to be able to improve the global objective. ∎

Appendix C Proof of Corollary II

Proof.

It follows from (3) that the global discounted reward at time $t+1$ can be given by

\displaystyle\text{$\eta$}(\pi_{n}^{t+1})=\sum_{k=1}^{N}q_{k}\eta_{k}(\pi_{n}^{t+1}).

(51)

By substituting (13) into (51), we can obtain

	$\displaystyle\text{$\eta$}(\pi_{n}^{t+1})$	$\displaystyle\geq\sum_{k=1}^{N}q_{k}\left(L_{\pi^{t},\mu_{k},P_{k}}(\pi_{n}^{t+1})\right.$		(52)
		$\displaystyle\quad\left.-\text{$c_{k}^{\text{CPO}}\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{k},P_{k}}}\left[D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{n}^{t+1}(\cdot\|s))\right]$}\right).$

By substituting (9) into (52) and bounding $c_{k}^{\text{CPO}}$ from below by $c_{k}^{\prime}$ , we have

	$\displaystyle\text{$\eta$}(\pi_{n}^{t+1})$	$\displaystyle\geq\sum_{k=1}^{N}q_{k}\left(\eta_{k}(\pi^{t})+\mathbb{A}_{\pi^{t},P_{k}}(\pi_{n}^{t+1})\right.$		(53)
		$\displaystyle\quad\left.-\text{$c^{\prime}_{k}\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{k},P_{k}}}\left[D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{n}^{t+1}(\cdot\|s))\right]$}\right),$

where $c^{\prime}_{k}=\frac{4\epsilon_{k}\gamma}{(1-\gamma)^{2}}$ , and $\epsilon_{k}=\max_{s,a}|A_{\pi^{t},P_{k}}(s,a)|$ denotes the maximum absolute value of the advantage function over all state and action pairs.

To remove the dependency on $\rho_{\pi^{t},\mu_{k},P_{k}}$ , we have

	$\displaystyle\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{k},P_{k}}}\left[D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{n}^{t+1}(\cdot\|s))\right]$
	$\displaystyle-\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{n}^{t+1}(\cdot\|s))\right]$
$\displaystyle=$	$\displaystyle\sum_{s}\rho_{\pi^{t},\mu_{k},P_{k}}(s)D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{n}^{t+1}(\cdot\|s))$
	$\displaystyle-\sum_{s}\rho_{\pi^{t},\mu_{n},P_{n}}(s)D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{n}^{t+1}(\cdot\|s))$	(54)
$\displaystyle\leq$	$\displaystyle D_{TV}^{\max}(\pi^{t},\pi_{n}^{t+1})\sum_{s}\left\|\rho_{\pi^{t},\mu_{k},P_{k}}(s)-\rho_{\pi^{t},\mu_{n},P_{n}}(s)\right\|,$	(55)

where $D_{TV}^{max}(\pi,\pi^{\prime})=\max_{s}D_{TV}\left(\pi(\cdot|s)\|\pi^{\prime}(\cdot|s)\right)$ denotes the maximum TV divergence between two policies $\pi$ and $\pi^{\prime}$ among all states. It thus follows from (C) that

	$\displaystyle\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{k},P_{k}}}\left[D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{n}^{t+1}(\cdot\|s))\right]$
$\displaystyle\leq$	$\displaystyle\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{n}^{t+1}(\cdot\|s))\right]$
	$\displaystyle+2D_{TV}^{\max}(\pi^{t},\pi_{n}^{t+1})D_{TV}(\rho_{\pi^{t},\mu_{k},P_{k}}\\|\rho_{\pi^{t},\mu_{n},P_{n}}).$	(56)

Substituting (19) and (C) into (53) gives:

$\displaystyle\eta(\pi_{n}^{t+1})$	$\displaystyle\geq\sum_{k=1}^{N}q_{k}\eta_{k}(\pi^{t})+\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})$
	$\displaystyle\quad-(\alpha+\beta)\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{n}^{t+1}(\cdot\|s))\right]$
	$\displaystyle\quad-\delta D_{TV}^{\max}(\pi^{t},\pi_{n}^{t+1}),$	(57)

Appendix D Effectiveness of Improving the Local Objective

Let $g(\pi_{n}^{t+1})$ denote the RHS of (II) as a function of $\pi_{n}^{t+1}$ . Thus, we have the following relation

$\displaystyle\eta(\pi_{n}^{t+1})$	$\displaystyle\geq g(\pi_{n}^{t+1}),$	(58)
$\displaystyle\eta(\pi^{t})$	$\displaystyle=g(\pi^{t}),$	(59)
$\displaystyle\eta(\pi_{n}^{t+1})-\eta(\pi^{t})$	$\displaystyle\geq g(\pi_{n}^{t+1})-g(\pi^{t}).$	(60)

The first inequality comes from (II). The equality in (59) can be checked by replacing $\pi_{n}^{t+1}$ with $\pi^{t}$ in the RHS of (II). Finally, subtracting (59) from (58) gives (60). The above conditions guarantee that we can improve the reward $\eta(\pi_{n}^{t})$ at each iteration $t$ by improving the surrogate term $g(\pi_{n}^{t})$ . In particular, the surrogate function $g(\pi)$ minorizes $\eta(\pi)$ at $\pi^{t}$ [36, 48].

Appendix E Proof of Theorem II

Lemma I.

(Linearity of policy advantage $\mathbb{A}_{\pi}$ and surrogate objective $L_{\pi}$ ). Given arbitrary policies $\pi$ , $\hat{\pi}$ and $\pi^{\prime}$ , the policy advantage of linearly combinated policy $\gamma\hat{\pi}+(1-\gamma)\pi^{\prime}$ over $\pi$ is the linear combination of individual policy advantages, i.e. for all $\gamma\leq 1$ , we have

	$\displaystyle\mathbb{A}_{\pi}(\gamma\hat{\pi}+(1-\gamma)\pi^{\prime})$	$\displaystyle=\gamma\mathbb{A}_{\pi}(\hat{\pi})+(1-\gamma)\mathbb{A}_{\pi}(\pi^{\prime})$		(61)
	$\displaystyle L_{\pi}(\gamma\hat{\pi}+(1-\gamma)\pi^{\prime})$	$\displaystyle=\gamma L_{\pi}(\hat{\pi})+(1-\gamma)L_{\pi}(\pi^{\prime}).$		(62)

Proof.

	$\displaystyle L_{\pi}(\gamma\hat{\pi}+(1-\gamma)\pi^{\prime})$
$\displaystyle=$	$\displaystyle\eta(\pi)+\gamma\sum_{s}\rho_{\pi}(s)\sum_{a}\hat{\pi}(a\|s)A_{\pi}(s,a)$
	$\displaystyle+(1-\gamma)\sum_{s}\rho_{\pi}(s)\sum_{a}\pi^{\prime}(a\|s)A_{\pi}(s,a)$	(63)
$\displaystyle=$	$\displaystyle\eta(\pi)+\gamma\mathbb{A}_{\pi}(\hat{\pi})+(1-\gamma)\mathbb{A}_{\pi}(\pi^{\prime})$	(64)
$\displaystyle=$	$\displaystyle\gamma L_{\pi}(\hat{\pi})+(1-\gamma)L_{\pi}(\pi^{\prime}).$	(65)

∎

Lemma II.

In tabular case, for all states, the TV divergence between $\pi^{t+1}(\cdot|s)$ and $\pi^{t}(\cdot|s)$ is upper bounded by the expectation of the TV divergence between local policies $\pi_{k}^{t+1}(\cdot|s)$ and $\pi^{t}(\cdot|s)$ .

Proof.

For all states $s$ ,

	$\displaystyle D_{TV}(\pi^{t}(\cdot\|s),\pi^{t+1})(\cdot\|s)]$
$\displaystyle=$	$\displaystyle\frac{1}{2}\left\\|\pi^{t}(\cdot\|s)-\pi^{t+1}(\cdot\|s)\right\\|_{1}$	(66)
$\displaystyle=$	$\displaystyle\frac{1}{2}\left\\|\sum_{n=1}^{N}p_{k}(\pi^{t}(\cdot\|s)-\pi_{k}^{t+1}(\cdot\|s))\right\\|_{1}$	(67)
$\displaystyle\leq$	$\displaystyle\sum_{k=1}^{N}q_{k}\frac{1}{2}\left\\|\pi^{t}(\cdot\|s)-\pi_{k}^{t+1}(\cdot\|s)\right\\|_{1}$	(68)
$\displaystyle=$	$\displaystyle\sum_{k=1}^{N}q_{k}\left[D_{TV}(\pi^{t}(\cdot\|s)\lVert\pi_{k}^{t+1}(\cdot\|s)\right]$	(69)
$\displaystyle=$	$\displaystyle\mathbb{E}_{k}\left[D_{TV}(\pi^{t}(\cdot\|s)\lVert\pi_{k}^{t+1}(\cdot\|s)\right].$	(70)

∎

Now the proof of Theorem II follows.

Proof.

By (13) and bounding $c_{n}^{\text{CPO}}$ by $c_{n}$ , we can obtain

$\displaystyle\eta(\pi^{t+1})$	$\displaystyle=\sum_{n=1}^{N}q_{n}\eta_{n}(\sum_{k=1}^{N}q_{k}\pi_{k}^{t+1})$	(71)
	$\displaystyle\geq\sum_{n=1}^{N}p_{n}\left\{\eta_{n}(\pi^{t})+\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}\left(\sum_{k=1}^{N}q_{k}\pi_{k}^{t+1}\right)\right.$
	$\displaystyle\quad\left.-c_{n}\mathbb{E}_{s\thicksim\rho_{\pi,\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot\|s)\\|\pi^{t+1}(\cdot\|s))\right]\right\}.$	(72)

By the linearity of advantage function, i.e. Lemma I , we have

$\displaystyle\eta(\pi^{t+1})$	$\displaystyle\geq\eta(\pi^{t})+\sum_{n=1}^{N}q_{n}\left\{\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\sum_{k=1}^{N}q_{k}\pi_{k}^{t+1})\right.$
	$\displaystyle\quad\left.-c_{n}\mathbb{E}_{s\thicksim\rho_{\pi,\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot\|s)\\|\pi^{t+1}(\cdot\|s))\right]\right\}$	(73)
	$\displaystyle=\eta(\pi^{t})+\sum_{n=1}^{N}q_{n}\left\{\sum_{k=1}^{N}q_{k}\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{k}^{t+1})\right.$
	$\displaystyle\quad\left.-c_{n}\mathbb{E}_{s\thicksim\rho_{\pi,\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot\|s)\\|\pi^{t+1}(\cdot\|s))\right]\right\}.$	(74)

By Lemma II, we have

	$\displaystyle\eta(\pi^{t+1})$
$\displaystyle\geq$	$\displaystyle\eta(\pi^{t})+\sum_{n=1}^{N}q_{n}\left\{\sum_{k=1}^{N}q_{k}\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{k}^{t+1})\right.$
	$\displaystyle\left.-c_{n}\mathbb{E}_{s\thicksim\rho_{\pi,\mu_{n},P_{n}}}\left[\sum_{k}^{N}q_{k}D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{k}^{t+1}(\cdot\|s))\right]\right\}$	(75)
$\displaystyle=$	$\displaystyle\eta(\pi^{t})+\sum_{n=1}^{N}q_{n}\left\{\sum_{k=1}^{N}q_{k}\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{k}^{t+1})\right.$
	$\displaystyle\left.-\sum_{k=1}^{N}q_{k}c_{n}\mathbb{E}_{s\thicksim\rho_{\pi,\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{k}^{t+1}(\cdot\|s))\right]\right\}.$	(76)

By exchanging the summation order, we have

	$\displaystyle\eta(\pi^{t+1})$	$\displaystyle\geq\eta(\pi^{t})+\sum_{k=1}^{N}q_{k}\left\{\sum_{n=1}^{N}q_{n}\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{k}^{t+1})\right.$
		$\displaystyle\quad\left.-\beta\mathbb{E}_{s\thicksim\rho_{\pi,\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{k}^{t+1}(\cdot\|s))\right]\right\}.$		(77)

By following the same procedure from Appendix C, we have

$\displaystyle\eta(\pi^{t+1})$	$\displaystyle\geq\eta(\pi^{t})+\sum_{k=1}^{N}q_{k}\left\{\mathbb{A}_{\pi^{t},\mu_{k},P_{k}}(\pi_{k}^{t+1})\right.$
	$\displaystyle\quad\left.-(\alpha+\beta)\mathbb{E}_{s\thicksim\rho_{\pi,\mu_{k},P_{k}}}\left[D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{k}^{t+1}(\cdot\|s))\right]\right\}$
	$\displaystyle\quad\left.-\delta D_{TV}^{\max}(\pi^{t},\pi_{k}^{t+1})\right\}.$	(78)

This proves that the RHS of Corollary II forms a lower bound on the policy performance difference between $\pi^{t+1}$ and $\pi^{t}$ . Moreover, since the equality holds when $\pi_{k}^{t+1}=\pi^{t},\forall k=0,...,N$ , optimizing the RHS of (78) or (22) in each round can monotonically improve the policy performance difference between $\pi^{t+1}$ and $\pi^{t}$ due to the proof in Appendix D. This completes the proof of Theorem II. ∎

Appendix F Proof of Theorem III

This proof uses some useful lemmas and corollaries from [39].

Lemma III.

(Continuity of $\rho_{\pi}$ , [39, Lemma 4]). The (unnormalized) discounted visitation frequency $\rho_{\pi}$ is continuous in $\pi$ .

Lemma IV.

(Continuity of $Q_{\pi}$ , [39, Lemma 5]). Let $\pi$ be a policy. Then $Q_{\pi}\left(s,a\right)$ is Lipschitz-continuous in $\pi$ .

Corollary III.

([39, Corollary 1]). Due to the continuity of $Q_{\pi}$ , the following functions are Lipschitz-continuous in $\pi$ : The state value function $V_{\pi}$ , the advantage function $A_{\pi}\left(s,a\right)$ , and the expected total reward $\eta\left(\pi\right)$ .

Lemma V.

(Continuity of policy advantage, [39, Lemma 6]). Let $\pi$ and $\hat{\pi}$ be policies. The policy advantage $\mathbb{A}_{\pi}(\hat{\pi})$ is continuous in $\pi$ .

Lemma VI.

(Continuity of $\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}$ ). For every agent $k=1,...,N$ , the level of heterogeneity is continuous in $\pi$ .

Proof.

To simplify the proof, we expand the matrix norm $\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}$ as follows

	$\displaystyle\left\\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\\|_{F}$
$\displaystyle=$	$\displaystyle\left\\|\sum_{k=1}^{N}q_{k}\mathbf{D}_{\pi,\mu_{n},P_{n}}^{-1}\mathbf{D}_{\pi,\mu_{k},P_{k}}\mathbf{A}_{\pi,P_{k}}-\mathbf{A}_{\pi,P_{n}}\right\\|_{F}$	(79)
$\displaystyle=$	$\displaystyle\sum_{k=1}^{N}q_{k}\sum_{s}\sum_{a}\left[\frac{\rho_{\pi,\mu_{k},P_{k}}(s)}{\rho_{\pi,\mu_{n},P_{n}}(s)}A_{\pi,P_{k}}(s,a)-A_{\pi,P_{n}}(s,a)\right]^{2}.$	(80)

The continuity of $\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}$ follows from the assumption that every state is reachable and the continuity of $\rho_{\pi}(s)$ and $A_{\pi}(s,a)$ . ∎

Lemma VII.

(Continuity of local objective). For every agent $k=1,...,N$ , the local objective $h_{k}(\pi^{\prime};\pi)$ in (22) is continuous in $\pi$ .

Proof.

By Lemma V, $A_{\hat{\pi},\mu_{k},P_{k}}(\pi)$ is continuous. By Corollary III and Lemma VI, $\alpha$ , $\beta$ and $\delta$ are continuous. By the continuity of $D_{TV}$ and $\max$ operator, the last two terms of (22) are continuous. As a result, $h_{k}(\pi^{\prime};\pi)$ is continuous in $\pi$ . ∎

Definition II.

(FedKL-Stationarity) A policy $\pi^{\prime}$ is FedKL-stationary if, for every agent $k=1,...,N$ ,

$\displaystyle\pi^{\prime}$	$\displaystyle=\arg\max_{\pi}\left[\mathbb{A}_{\pi^{\prime},\mu_{k},P_{k}}(\pi)\right.$
	$\displaystyle\quad-\left.\left(\alpha+\beta\right)\mathbb{E}_{s\thicksim\rho_{\pi^{\prime},\mu_{k},P_{k}}}\left[D_{TV}(\pi^{\prime}(\cdot\|s),\pi(\cdot\|s))\right]\right.$
	$\displaystyle\quad\left.-\delta D_{TV}^{\max}(\pi^{\prime},\pi)\right]$	(81)
	$\displaystyle=\arg\max_{\pi}\left[h_{k}(\pi;\pi^{\prime})\right].$	(82)

Next, we prove Theorem III. To this end, we start from the convergence of (22) and then show the limit points of the sequence generated by Algorithm 1 are FedKL-stationary.

Proof.

By Theorem II, the sequence $\left(\eta\left(\pi^{t}\right)\right)_{t=0}^{\infty}$ is non-decreasing and bounded above by $\frac{R_{\max}}{1-\gamma}$ , where $R_{\max}$ is the maximum of reward. As a result, we know the sequence converges. Let $\hat{\eta}$ denote the limit point. Given the sequence of policies $\left(\pi^{t}\right)_{t=0}^{\infty}$ is bounded, we know it has at least one convergent subsequence, by the Bolzano-Weierstrass Theorem. Denote $\hat{\pi}$ as any limit point of the sequence, and let $\left(\pi^{t_{j}}\right)_{t_{j}=0}^{\infty}$ be a subsequence converging to $\hat{\pi}$ . By the continuity of $\eta$ in $\pi$ (Corollary III), we have

\displaystyle\eta\left(\hat{\pi}\right)

\displaystyle=\eta\left(\lim_{j\to\infty}\pi^{t_{j}}\right)=\lim_{j\to\infty}\eta\left(\pi^{t_{j}}\right)=\hat{\eta}.

(83)

We will now establish the FedKL-stationarity of any limit point $\hat{\pi}$ . By Theorem II, we have

$\displaystyle 0$	$\displaystyle=\lim_{t\to\infty}\left[\eta\left(\pi^{t+1}\right)-\eta\left(\pi^{t}\right)\right]$	(84)
	$\displaystyle\geq\lim_{t\to\infty}\mathbb{E}_{k}\left[\mathbb{A}_{\pi^{t},\mu_{k},P_{k}}(\pi_{k}^{t+1})\right.$
	$\displaystyle\quad\left.-\left(\alpha+\beta\right)\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{k},P_{k}}}\left[D_{TV}(\pi^{t}\left(\cdot\|s\right),\pi_{k}^{t+1}\left(\cdot\|s\right))\right]\right.$
	$\displaystyle\quad\left.-\delta D_{TV}^{\max}(\pi^{t},\pi_{k}^{t+1})\right].$	(85)

Consider an arbitrary limit point $\hat{\pi}$ and a subsequence $\left(\pi^{t_{j}}\right)_{j=0}^{\infty}$ that converges to $\hat{\pi}$ . Then, we can obtain

$\displaystyle 0$	$\displaystyle\geq\lim_{j\to\infty}\mathbb{E}_{k}\left[\mathbb{A}_{\pi^{t_{j}},\mu_{k},P_{k}}(\pi_{k}^{t_{j}+1})\right.$
	$\displaystyle\quad\left.-\left(\alpha+\beta\right)\mathbb{E}_{s\thicksim\rho_{\pi^{t_{j}},\mu_{k},P_{k}}}\left[D_{TV}(\pi^{t_{j}}\left(\cdot\|s\right),\pi_{k}^{t_{j}+1}\left(\cdot\|s\right))\right]\right.$
	$\displaystyle\quad\left.-\delta D_{TV}^{\max}(\pi^{t_{j}},\pi_{k}^{t_{j}+1})\right].$	(86)

As every agent is optimizing its local objective, the expectation is with respect to non-negative random variables. Thus, for arbitrary $k=1,...,N$ , the RHS of (86) is bounded from below by

	$\displaystyle q_{k}\lim_{j\to\infty}\max_{\pi}\left[\mathbb{A}_{\pi^{t_{j}},\mu_{k},P_{k}}(\pi)\right.$
	$\displaystyle\left.-\left(\alpha+\beta\right)\mathbb{E}_{s\thicksim\rho_{\pi^{t_{j}},\mu_{k},P_{k}}}\left[D_{TV}(\pi^{t_{j}}\left(\cdot\|s\right),\pi\left(\cdot\|s\right))\right]\right]$
	$\displaystyle\left.-\delta D_{TV}^{\max}(\pi^{t_{j}},\pi)\right]\geq 0.$		(87)

By Lemma VII and the Squeeze theorem, it follows from (86) and (87) that

\displaystyle\lim_{j\to\infty}\max_{\pi}\left[h_{k}(\pi;\pi^{t_{j}})\right]=\max_{\pi}\left[h_{k}(\pi;\hat{\pi})\right]=0.

(88)

This proves that, for any limit point of the policy sequence induced by Algorithm 1, $\max_{\pi}\left[h_{k}(\pi;\hat{\pi})\right]=0$ for every agent $k=1,...,N$ , which is equivalent to Definition II.

∎

References

[1] T. Chen, K. Zhang, G. B. Giannakis, and T. Basar, “Communication-efficient distributed reinforcement learning,” CoRR, vol. abs/1812.03239, 2018.
[2] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning: Concept and applications,” ACM Trans. Intell. Syst. Technol., vol. 10, no. 2, Jan. 2019.
[3] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 50–60, 2020.
[4] J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” CoRR, vol. abs/1610.05492, 2016.
[5] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
[6] F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe, “Local sgd with periodic averaging: Tighter analysis and adaptive synchronization,” in Advances in Neural Information Processing Systems, 2019, vol. 32, pp. 11 082–11 094.
[7] J. Wang and G. Joshi, “Cooperative sgd: A unified framework for the design and analysis of local-update sgd algorithms,” Journal of Machine Learning Research, vol. 22, no. 213, pp. 1–50, 2021.
[8] H. Yu, S. Yang, and S. Zhu, “Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 5693–5700, Jul. 2019.
[9] H. Zhu, J. Xu, S. Liu, and Y. Jin, “Federated learning on non-iid data: A survey,” CoRR, vol. abs/2106.06843, 2021.
[10] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of fedavg on non-iid data,” in ICLR 2020 : Eighth International Conference on Learning Representations, 2020.
[11] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed. The MIT Press, 2018.
[12] J. Qi, Q. Zhou, L. Lei, and K. Zheng, “Federated reinforcement learning: Techniques, applications, and open challenges,” CoRR, vol. abs/2108.11887, 2021.
[13] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in Proceedings of the 31st International Conference on International Conference on Machine Learning, vol. 32, 2014, pp. 387–395.
[14] K. Ciosek and S. Whiteson, “Expected policy gradients for reinforcement learning,” Journal of Machine Learning Research, vol. 21, no. 52, pp. 1–51, 2020.
[15] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in Neural Information Processing Systems, vol. 12, 2000.
[16] Q. Li, Y. Diao, Q. Chen, and B. He, “Federated learning on non-iid data silos: An experimental study,” CoRR, vol. abs/2102.02079, 2021.
[17] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with non-iid data,” CoRR, vol. abs/1806.00582, 2018.
[18] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-Efficient Learning of Deep Networks from Decentralized Data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, vol. 54, 20–22 Apr 2017, pp. 1273–1282.
[19] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” in Proceedings of Machine Learning and Systems, vol. 2, 2020, pp. 429–450.
[20] H. Wang, Z. Kaplan, D. Niu, and B. Li, “Optimizing federated learning on non-iid data with reinforcement learning,” in IEEE INFOCOM 2020 - IEEE Conference on Computer Communications, 2020, pp. 1698–1707.
[21] B. Liu, L. Wang, and M. Liu, “Lifelong federated reinforcement learning: A learning architecture for navigation in cloud robotic systems,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 4555–4562, 2019.
[22] X. Liang, Y. Liu, T. Chen, M. Liu, and Q. Yang, “Federated transfer reinforcement learning for autonomous driving,” CoRR, vol. abs/1910.06001, 2019.
[23] H.-K. Lim, J.-B. Kim, J.-S. Heo, and Y.-H. Han, Federated Reinforcement Learning for Training Control Policies on Multiple IoT Devices. Sensors, 2020, vol. 20, no. 5.
[24] B. Liu, L. Wang, and M. Liu, “Lifelong federated reinforcement learning: A learning architecture for navigation in cloud robotic systems,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, pp. 1688–1695.
[25] H. H. Zhuo, W. Feng, Q. Xu, Q. Yang, and Y. Lin, “Federated reinforcement learning,” CoRR, vol. abs/1901.08277, 2019.
[26] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017.
[27] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” in Advances in Neural Information Processing Systems, vol. 12. MIT Press, 2000, pp. 1008–1014.
[28] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 6382–6393.
[29] H. kyo Lim, J.-B. Kim, I. Ullah, J.-S. Heo, and Y.-H. Han, “Federated reinforcement learning acceleration method for precise control of multiple devices,” IEEE Access, vol. 9, pp. 76 296–76 306, 2021.
[30] D. Zhou, Y. Zhang, A. Sonabend-W, Z. Wang, J. Lu, and T. Cai, “Federated offline reinforcement learning,” 2022.
[31] Y. Cao, S.-Y. Lien, Y.-C. Liang, and K.-C. Chen, “Federated deep reinforcement learning for user access control in open radio access networks,” in ICC 2021 - IEEE International Conference on Communications, 2021, pp. 1–6.
[32] X. Xu, R. Li, Z. Zhao, and H. Zhang, “The gradient convergence bound of federated multi-agent reinforcement learning with efficient communication,” CoRR, vol. abs/2103.13026, 2021.
[33] H. Jin, Y. Peng, W. Yang, S. Wang, and Z. Zhang, “Federated reinforcement learning with environment heterogeneity,” in Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, G. Camps-Valls, F. J. R. Ruiz, and I. Valera, Eds., vol. 151. PMLR, 28–30 Mar 2022, pp. 18–37.
[34] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” in Proceedings of the International Conference on Learning Representations (ICLR), 2016.
[35] X. Fan, Y. Ma, Z. Dai, W. Jing, C. Tan, and B. K. H. Low, “Fault-tolerant federated reinforcement learning with theoretical guarantee,” Advances in Neural Information Processing Systems, vol. 34, 2021.
[36] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization,” CoRR, vol. abs/1502.05477, 2015.
[37] S. Kakade and J. Langford, “Approximately optimal approximate reinforcement learning,” in Proceedings of the Nineteenth International Conference on Machine Learning, 2002, pp. 267–274.
[38] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70. PMLR, 06–11 Aug 2017, pp. 22–31.
[39] J. G. Kuba, R. Chen, M. Wen, Y. Wen, F. Sun, J. Wang, and Y. Yang, “Trust region policy optimisation in multi-agent reinforcement learning,” 2022.
[40] I. Csiszár and J. Körner, Information theory: coding theorems for discrete memoryless systems. Cambridge University Press, 2011.
[41] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 5026–5033.
[42] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” CoRR, vol. abs/1606.01540, 2016.
[43] E. Vinitsky, A. Kreidieh, L. L. Flem, N. Kheterpal, K. Jang, C. Wu, F. Wu, R. Liaw, E. Liang, and A. M. Bayen, “Benchmarks for reinforcement learning in mixed-autonomy traffic,” in Proceedings of The 2nd Conference on Robot Learning, vol. 87, 2018, pp. 399–409.
[44] S. M. Kakade, “A natural policy gradient,” Advances in neural information processing systems, vol. 14, 2001.
[45] J. A. Bagnell and J. Schneider, “Covariant policy search,” IJCAI, 2003.
[46] L. Liu, J. Zhang, S. Song, and K. B. Letaief, “Client-edge-cloud hierarchical federated learning,” in IEEE International Conference on Communications (ICC), 2020, pp. 1–6.
[47] B. Baumgartner, “An inequality for the trace of matrix products, using absolute values,” 2011. [Online]. Available: https://arxiv.org/abs/1106.6189
[48] D. R. Hunter and K. Lange, “A tutorial on mm algorithms,” The American Statistician, vol. 58, no. 1, pp. 30–37, 2004.

	$\displaystyle\left\|\mbox{Tr}\left(\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right)\right\|$
$\displaystyle\leq$	$\displaystyle\sqrt{\mbox{Tr}\left(\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}^{T}\right\|\left\|\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right\|\right)}$
	$\displaystyle\cdot\sqrt{\mbox{Tr}\left(\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\|\left\|\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}\right\|\right)}.$	(31)

	$\displaystyle\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{k},P_{k}}}\left[D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{n}^{t+1}(\cdot\|s))\right]$
	$\displaystyle-\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{n}^{t+1}(\cdot\|s))\right]$
$\displaystyle=$	$\displaystyle\sum_{s}\rho_{\pi^{t},\mu_{k},P_{k}}(s)D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{n}^{t+1}(\cdot\|s))$
	$\displaystyle-\sum_{s}\rho_{\pi^{t},\mu_{n},P_{n}}(s)D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{n}^{t+1}(\cdot\|s))$	(54)
$\displaystyle\leq$	$\displaystyle D_{TV}^{\max}(\pi^{t},\pi_{n}^{t+1})\sum_{s}\left\|\rho_{\pi^{t},\mu_{k},P_{k}}(s)-\rho_{\pi^{t},\mu_{n},P_{n}}(s)\right\|,$	(55)

	$\displaystyle\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{k},P_{k}}}\left[D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{n}^{t+1}(\cdot\|s))\right]$
$\displaystyle\leq$	$\displaystyle\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot\|s)\\|\pi_{n}^{t+1}(\cdot\|s))\right]$
	$\displaystyle+2D_{TV}^{\max}(\pi^{t},\pi_{n}^{t+1})D_{TV}(\rho_{\pi^{t},\mu_{k},P_{k}}\\|\rho_{\pi^{t},\mu_{n},P_{n}}).$	(56)