This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

FedKL: Tackling Data Heterogeneity in Federated Reinforcement Learning by Penalizing KL Divergence

Zhijie Xie and S.H. Song Zhijie Xie and S.H. Song are with Department of Electronic and Computer Engineering, the Hong Kong University of Science and Technology, Hong Kong. e-mail: ([email protected], [email protected]).
Abstract

As a distributed learning paradigm, Federated Learning (FL) faces the communication bottleneck issue due to many rounds of model synchronization and aggregation. Heterogeneous data further deteriorates the situation by causing slow convergence. Although the impact of data heterogeneity on supervised FL has been widely studied, the related investigation for Federated Reinforcement Learning (FRL) is still in its infancy. In this paper, we first define the type and level of data heterogeneity for policy gradient based FRL systems. By inspecting the connection between the global and local objective functions, we prove that local training can benefit the global objective, if the local update is properly penalized by the total variation (TV) distance between the local and global policies. A necessary condition for the global policy to be learn-able from the local policy is also derived, which is directly related to the heterogeneity level. Based on the theoretical result, a Kullback-Leibler (KL) divergence based penalty is proposed, which, different from the conventional method that penalizes the model divergence in the parameter space, directly constrains the model outputs in the distribution space. Convergence proof of the proposed algorithm is also provided. By jointly penalizing the divergence of the local policy from the global policy with a global penalty and constraining each iteration of the local training with a local penalty, the proposed method achieves a better trade-off between training speed (step size) and convergence. Experiment results on two popular Reinforcement Learning (RL) experiment platforms demonstrate the advantage of the proposed algorithm over existing methods in accelerating and stabilizing the training process with heterogeneous data.

Index Terms:
Federated Reinforcement Learning, Data Heterogeneity, Policy Gradient.

I Introduction

Federated Learning (FL) was proposed as a distributed Machine Learning (ML) scheme where data from distributed devices are utilized for training without violating the user privacy. The idea has been successfully implemented (e.g. the Google Keyboard) [1] and attracted much research attention [2, 3]. One of the key challenges for FL is the communication bottleneck due to the frequent communication of big ML models. There are mainly two methods to mitigate the issue, i.e., reducing the communication workload per round and decreasing the number of communication rounds. To this end, model compression has been proposed to reduce the communication workload per round [4, 5], and another line of research tried to improve the convergence speed to minimize the number of communication rounds [6, 7, 8]. Unfortunately, data heterogeneity between difference devices significantly slows the convergence down [9, 10].

As an important ML paradigm, Reinforcement Learning (RL) trains the policy of an agent in an iterative way by interacting with the environment. In each time step, the agent makes a decision to manipulate the environment and receives a reward for its action, where the objective of the training is to maximize the long term reward[11]. Due to the distributed data and computing power in large-scale applications such as autonomous driving, the training of RL algorithms under the FL framework is inevitable. Unfortunately, many challenges faced by supervised FL, e.g., the data heterogeneity and communication bottleneck, are still valid and even worse for FRL [12]. For example, centralized policy gradient methods already suffer from high variance which is detrimental to the convergence speed and training performance [13, 14, 15], and data heterogeneity imposes another layer of difficulty for the convergence of FRL.

Data heterogeneity in FL refers to the situation where the data collected by different devices are not independent and identically distributed (IID) [3, 16]. Despite its empirical success, FedAvg does not perform well on highly skewed non-IID data [10, 17, 18]. As a result, many innovative methods have been proposed to tackle the data heterogeneity issue [9]. Data-level approaches aim to modify the data distribution to address the root of the problem. For example, a data-sharing strategy [17] was proposed to create a small but globally shared dataset to mitigate the impact of heterogeneous data. On the other hand, algorithm-level approaches try to solve the problem by improving the local training algorithms. Along this line of research, FedProx [19] added a weighted proximal term to the local objective function to constrain the divergence of the local model from the global model. Finally, system-level approaches focus on client clustering/selection [20] to counterbalance the bias introduced by non-IID data and speed up convergence.

The heterogeneity issue of FRL is different from that of the supervised FL and the related study is very limited [21, 22, 23]. There are mainly two types of RL methods, namely, the action-value method and the policy gradient (PG) method [12, 11]. The first performs action selection according to the value function, while the second utilizes ML models to approximate the policy and update the policy by gradient based algorithms. Data heterogeneity affects different RL methods in different ways, and thus requires different solutions. For the action-value method, Lifelong Reinforcement Learning (LFRL) [24] explored the combination of Q-learning and FL in real-world applications, where the knowledge learned in one particular environment is transferred to another. Federated RL (FedRL) [25] considered the scenario where some agents can only observe part of the environment and have no access to the reward. To handle this issue, agents maintain their individual Q-networks in addition to the global Q-network, and exchange their predicted Q value for cooperation. For policy gradient methods, Federated Transfer Reinforcement Learning (FTRL) [22] proposed an online transfer process to numerically align the scale of the observations and actions in different environments so that the policy learnt in one environment is portable to another. Given that devices of the same type may have different dynamics, a FRL architecture, which adopts Proximal Policy Gradient (PPO) and Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [26, 27, 28] to FRL in Internet of Things (IoT) networks, was proposed in [23, 29]. This FRL scheme alternates between sharing gradient and transferring mature models to other agents. For offline learning, data heterogeneity exists in different local datasets. To address this problem, Federated Dynamic Treatment Regime (FDTR) [30] proposed the first federated policy optimization algorithm for offline RL. There are also works that aim to improve the communication efficiency with heterogeneous data [31, 32]. For instance, Federated Multi-Agent Reinforcement Learning (FMARL) [32] proposed to gradually decay the gradients during local training to avoid model divergence and improve communication efficiency. QAvg and PAvg [33] theoretically proved that federated Q-Learning and federated policy gradient can converge in tabular case when agents’ environments are heterogeneous.

In this paper, we will investigate the data heterogeneity issue in FRL systems whose agents employ PG methods with an advantage estimator [15, 34]. For that purpose, we first classify the types of data heterogeneity in FRL and define the heterogeneity level. By analyzing the connection between the global and local objective functions, we prove that although it is possible to improve the global objective by updating the local policy according to the local environment, the local update must be properly constrained. Based on the theory, we propose an algorithm that utilizes a Kullback–Leibler (KL) divergence based term to penalize the local training for diverging too far away from the global policy and name it the FedKL algorithm. The proposed KL penalty will be referred to as the global penalty because it guarantees that the local training of one agent will not diverge too far away from the global policy. Note that this global penalty is different from the per-iteration penalty utilized in centralized RL [26]. We will investigate the benefits of adding the global penalty and how its interplay with the local per-iteration penalty helps mitigate the data heterogeneity issue in FRL. Experiment results show that FedKL is able to stabilize training and speed up convergence when compared with existing algorithms, including FedAvg, FedProx and FMARL.

The major contributions of this work can be summarized as follows:

  • We classify the data heterogeneity of FRL into two types, due to different initial state distributions and different environment dynamics, respectively. Then, we define the level of heterogeneity, based on which a necessary condition for the local policy improvement to be beneficial for the global objective is derived.

  • We prove that the improvement of the local policy on one agent can benefit the global objective, as long as the local policy update is properly penalized by the total variation (TV) distance between the local and global policies. Based on the theoretical result, we propose the FedKL algorithm and adopt several approximations to simplify the implementation of FedKL. Convergence proof of the proposed algorithm is also provided.

  • Experiment results show that FedKL outperforms existing algorithm-level solutions with faster convergence and better stabilization. Several interesting insights are also obtained: 1) The proposed heterogeneity level is very effective and can be utilized to predict the learning behavior; 2) The KL divergence based penalty is more effective than the penalty defined in the parameter space; 3) The proposed global penalty enables a larger learning step size, which speeds up the convergence without causing divergence.

II Preliminaries

II-A Federated Learning

A typical FL system consists of one central server and NN agents (clients/devices), where the data of the agents are jointly utilized to train a global model without violating their privacy. In particular, the global optimization problem can be formulated as [18]

minθf(θ)\displaystyle\min_{\theta}f(\theta) =n=1Nqnfn(θ),\displaystyle=\sum_{n=1}^{N}q_{n}f_{n}(\theta), (1)

where fn(θ)f_{n}(\theta) represents the local objective function, e.g. the loss function, of the nn-th agent with respect to the model parameters θ\theta. qnq_{n} is a weighting factor often set to qn=ln/Lq_{n}=l_{n}/L, with lnl_{n} denoting the number of data points at the nn-th agent and L=n=1NlnL=\sum_{n=1}^{N}l_{n}. In each communication round, KK out of the total NN agents are selected and the central server broadcasts the global model to the selected agents. Then, each agent updates its local model utilizing the local data and uploads the trained model to the server for model aggregation. This process will repeat for some rounds until a preset learning objective is achieved.

II-B Federated Reinforcement Learning

The system setup of FRL is similar, i.e., one central server federates the learning of NN distributed agents. In the tt-th training round, the central server broadcasts the current global policy πt\pi^{t} to KK selected agents which will perform II iterations of local training. In each iteration, the agent interacts with its environment to collect TT time-steps of data, and optimizes its local policy according to the local objective with gradient ascent methods. At the end of each round, the training results will be uploaded to the server for aggregation. An example of FRL systems for autonomous driving is illustrated in Fig. 1.

Refer to caption
Figure 1: An example of FRL systems: autonomous driving. Many vehicles run on different roads. In each training round, selected vehicles (white cars) will participate in the training. With the PG method, policies are represented by ML models, which will be communicated between agents and the server for model aggregation and synchronization. The model aggregation is the same as that of FedAvg.

We model the local training problem of each agent as a finite Markov Decision Process (MDP). Accordingly, the FRL system consists of NN finite MDPs {(𝒮,μn,𝒜,Pn,γ)|n[1,N]}\{(\mathcal{S},\mu_{n},\mathcal{A},P_{n},\gamma)|n\in[1,N]\}, where 𝒮\mathcal{S} denotes a finite set of states, μn\mu_{n} represents the initial state distribution of the n-th MDP, 𝒜\mathcal{A} is a finite set of actions and γ(0,1)\gamma\in(0,1) is the discount factor. The dynamics function Pn(s,r|s,a):𝒮××𝒮×𝒜P_{n}(s^{\prime},r|s,a):\mathcal{S}\times\mathbb{R}\times\mathcal{S}\times\mathcal{A}\to\mathbb{R} represents the probability that the n-th MDP transits from state ss to ss^{\prime} and obtain a reward of rr\in\mathbb{R} after taking action aa, which characterizes the environment dynamics of the n-th MDP [11]. We can define the transition probability Pn(s|s,a):𝒮×𝒮×𝒜P_{n}(s^{\prime}|s,a):\mathcal{S}\times\mathcal{S}\times\mathcal{A}\to\mathbb{R} as Pn(s|s,a)=rPn(s,r|s,a)P_{n}(s^{\prime}|s,a)=\sum_{r\in\mathbb{R}}P_{n}(s^{\prime},r|s,a). Furthermore, based on the dynamics function, we can define the reward function n(s,a):𝒮×𝒜\mathcal{R}_{n}(s,a):\mathcal{S}\times\mathcal{A}\to\mathbb{R} as n(s,a)=rrs𝒮Pn(s,r|s,a)\mathcal{R}_{n}(s,a)=\sum_{r\in\mathbb{R}}r\sum_{s^{\prime}\in\mathcal{S}}P_{n}(s^{\prime},r|s,a), which gives the expected reward of the n-th MDP for taking action aa in state ss. As a result, the nn-th MDP n\mathcal{M}_{n} can be represented by a 5-tuple (𝒮,μn,𝒜,Pn,γ)(\mathcal{S},\mu_{n},\mathcal{A},P_{n},\gamma) sharing the same state and action space with other agents, but with possibly different initial state distributions and environment dynamics. An agent’s behavior is controlled by a stochastic policy πn:𝒮×𝒜[0,1]\pi_{n}:\mathcal{S}\times\mathcal{A}\rightarrow[0,1] which outputs the probability of taking an action aa in a given state ss. Specifically, each agent nn aims to train a local policy to maximize its expected discounted reward over the long run

ηn(π)=𝔼s0μn,atπ,st+1Pn[t=0γtn(st,at)],\displaystyle\eta_{n}(\pi)=\mathbb{E}_{s_{0}\thicksim\mu_{n},a_{t}\thicksim\pi,s_{t+1}\thicksim P_{n}}\left[\sum_{t=0}^{\infty}\gamma^{t}\mathcal{R}_{n}(s_{t},a_{t})\right], (2)

where the notation 𝔼s0μn,atπ,st+1Pn\mathbb{E}_{s_{0}\thicksim\mu_{n},a_{t}\thicksim\pi,s_{t+1}\thicksim P_{n}} indicates that the reward is averaged over all states and actions according to the initial state distribution, the transition probability and the policy. Accordingly, the optimization problem for FRL can be formulated as

maxπη(π)\displaystyle\max_{\pi}\eta(\pi) =n=1Nqnηn(π).\displaystyle=\sum_{n=1}^{N}q_{n}\eta_{n}(\pi). (3)

Note that μn\mu_{n} is added to the formulation in (2) to represent the different initial state distributions for different agents and will be discussed in details later. The above formulation covers both IID and non-IID cases. In particular, the different MDPs, i.e., different initial state distributions and environment dynamics, represent the heterogeneous environments experienced by different agents (non-IID). All MDPs will be identical for the IID case [35].

II-C Policy Gradient Methods

In this section, we will introduce the policy gradient (PG) method [15], which is taken by the agents to update their local policies in this paper. While action-value methods learn the values of actions and then select actions based on their estimated values, PG methods directly learn the decision-making policy.

To update the policy by gradient ascent methods, we need to estimate the policy gradient. For that purpose, we first define the state-value function and the action-value function. Specifically, the state-value function Vπ,Pn(s)V_{\pi,P_{n}}(s) of a state ss is the expected return when the agent starts from state ss and follows policy π\pi thereafter:

Vπ,Pn(s)\displaystyle V_{\pi,P_{n}}(s) =𝔼π,Pn[l=0γl(st+l,at+l)|st=s],\displaystyle=\mathbb{E}_{\pi,P_{n}}\left[\sum_{l=0}^{\infty}\gamma^{l}\mathcal{R}(s_{t+l},a_{t+l})|s_{t}=s\right], (4)

where the expectation is performed over actions sampled from policy π\pi and the next state is sampled based on the transition probability PnP_{n}. Similarly, the action-value function Qπ,Pn(s,a)Q_{\pi,P_{n}}(s,a) of a state-action pair is the expected return when the agent starts from state ss, takes the action aa, and follows policy π\pi thereafter:

Qπ,Pn(s,a)\displaystyle Q_{\pi,P_{n}}(s,a) =𝔼π,Pn[l=0γl(st+l,at+l)|st=s,at=a].\displaystyle=\mathbb{E}_{\pi,P_{n}}\left[\sum_{l=0}^{\infty}\gamma^{l}\mathcal{R}(s_{t+l},a_{t+l})|s_{t}=s,a_{t}=a\right]. (5)

By subtracting the state-value function from the action-value function, we can define the advantage function Aπ,Pn(s,a)=Qπ,Pn(s,a)Vπ,Pn(s),A_{\pi,P_{n}}(s,a)=Q_{\pi,P_{n}}(s,a)-V_{\pi,P_{n}}(s), which measures the value of action aa in state ss. The author of [15] proved that the parameters θ\theta of a differentiable policy π(θ)\pi(\theta) can be updated using the policy gradient

g^=𝔼π,Pn[θlogπ(a|s)Aπ,Pn(s,a)],\displaystyle\hat{g}=\mathbb{E}_{\pi,P_{n}}\left[\nabla_{\theta}\log\pi(a|s)A_{\pi,P_{n}}(s,a)\right], (6)

where θ\nabla_{\theta} denotes the derivative with respect to θ\theta.

However, the high variance of the gradient estimation often leads to large policy updates and causes instability. As a result, the step size must be limited. For that purpose, [36, 37] investigated the performance gap for taking two different polices π\pi and π\pi^{\prime},

ηn(π)ηn(π)\displaystyle\eta_{n}(\pi^{\prime})-\eta_{n}(\pi)
=\displaystyle= 𝔼sρπ,μn,Pn[𝔼aπ(a|s)[Aπ,Pn(s,a)]],\displaystyle\mathbb{E}_{s\thicksim\rho_{\pi^{\prime},\mu_{n},P_{n}}}\left[\mathbb{E}_{a\thicksim\pi^{\prime}(a|s)}\left[A_{\pi,P_{n}}(s,a)\right]\right], (7)

where

ρπ,μn,Pn(s)\displaystyle\rho_{\pi,\mu_{n},P_{n}}(s) =t=0γtPr(st=s|π,μn,Pn),\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}\mbox{Pr}(s_{t}=s|\pi,\mu_{n},P_{n}), (8)

denotes the unnormalized discounted visitation frequency.

Equation (7) implies that the reward can be increased by any policy update that improves the expected advantage. However, it is difficult to optimize the right-hand side (RHS) of (7) since ρπ,μn,Pn\rho_{\pi^{\prime},\mu_{n},P_{n}} is dependent on π\pi^{\prime}. Therefore, the following surrogate objective was proposed

Lπ,μn,Pn(π)\displaystyle L_{\pi,\mu_{n},P_{n}}(\pi^{\prime}) =ηn(π)+𝔸π,μn,Pn(π),\displaystyle=\eta_{n}(\pi)+\mathbb{A}_{\pi,\mu_{n},P_{n}}(\pi^{\prime}), (9)

where

𝔸π,μn,Pn(π)=𝔼sρπ,μn,Pn[𝔼aπ(|s)[Aπ,Pn(s,a)]],\displaystyle\mathbb{A}_{\pi,\mu_{n},P_{n}}(\pi^{\prime})=\mathbb{E}_{s\thicksim\rho_{\pi,\mu_{n},P_{n}}}\left[\mathbb{E}_{a\thicksim\pi^{\prime}(\cdot|s)}\left[A_{\pi,P_{n}}(s,a)\right]\right], (10)

denotes the policy advantage that measures the advantage of policy π\pi^{\prime} over π\pi, starting from state sμns\thicksim\mu_{n} and following π\pi^{\prime}. Lπ,μn,PnL_{\pi,\mu_{n},P_{n}} is an approximation to ηn(π)\eta_{n}(\pi^{\prime}) but easier to optimize because it replaces the dependency of ρπ,μn,Pn\rho_{\pi^{\prime},\mu_{n},P_{n}} on π\pi^{\prime} with that of ρπ,μn,Pn\rho_{\pi,\mu_{n},P_{n}} on π\pi.

Although the above step involves an approximation, it is shown that a sufficiently small step that improves Lπ,μn,Pn(π)L_{\pi,\mu_{n},P_{n}}(\pi^{\prime}) will also improve ηn(π)\eta_{n}(\pi^{\prime}). Specifically, the following policy improvement bound was derived for any policies π\pi and π\pi^{\prime}

ηn(π)\displaystyle\eta_{n}(\pi^{\prime}) Lπ,μn,Pn(π)cDKLmax(π,π),\displaystyle\geq L_{\pi,\mu_{n},P_{n}}(\pi^{\prime})-\text{$cD_{KL}^{max}(\pi,\pi^{\prime}),$} (11)

or equivalently

ηn(π)ηn(π)\displaystyle\eta_{n}(\pi^{\prime})-\eta_{n}(\pi) 𝔸π,μn,Pn(π)cDKLmax(π,π),\displaystyle\geq\mathbb{A}_{\pi,\mu_{n},P_{n}}(\pi^{\prime})-\text{$cD_{KL}^{max}(\pi,\pi^{\prime}),$} (12)

where c=4maxs,a|Aπ,Pk(s,a)|γ(1γ)2c=\frac{4\max_{s,a}|A_{\pi,P_{k}}(s,a)|\gamma}{(1-\gamma)^{2}} is a constant and DKLmax(π,π)=maxsDKL(π(|s)π(|s))D_{KL}^{max}(\pi,\pi^{\prime})=\max_{s}D_{KL}\left(\pi(\cdot|s)\|\pi^{\prime}(\cdot|s)\right) denotes the maximum KL divergence between two policies π\pi and π\pi^{\prime} among all states. As a result, one can optimize Equation (2) by optimizing the surrogate objective. A tighter bound for the policy improvement was later derived [38]

ηn(π)ηn(π)\displaystyle\eta_{n}(\pi^{\prime})-\eta_{n}(\pi) 𝔸π,μn,Pn(π)\displaystyle\geq\mathbb{A}_{\pi,\mu_{n},P_{n}}(\pi^{\prime}) (13)
cCPO𝔼sρπ,μn,Pn[DTV(π(|s)π(|s))],\displaystyle\quad-\text{$c^{\text{CPO}}\mathbb{E}_{s\thicksim\rho_{\pi,\mu_{n},P_{n}}}\left[D_{TV}(\pi(\cdot|s)\|\pi^{\prime}(\cdot|s))\right],$}

where cCPO=2maxs|𝔼aπ(|s)[Aπ,Pk(s,a)]|γ(1γ)2c^{\text{CPO}}=\frac{2\max_{s}\left|\mathbb{E}_{a\thicksim\pi^{\prime}(\cdot|s)}\left[A_{\pi,P_{k}}(s,a)\right]\right|\gamma}{(1-\gamma)^{2}} is upper bounded by cc and DTV(π(|s)π(|s))=12a|π(a|s)π(a|s)|D_{TV}(\pi(\cdot|s)\|\pi^{\prime}(\cdot|s))=\frac{1}{2}\sum_{a}\left|\pi(a|s)-\pi^{\prime}(a|s)\right| denotes the TV distance between two discrete probability distributions π(|s)\pi(\cdot|s) and π(|s)\pi^{\prime}(\cdot|s). For the continuous case, it is straightforward to replace the summation by integration. In this work, (13) is useful to derive the policy improvement bound of FRL.

III Data Heterogeneity and its Impact on FRL

In this section, we will first investigate the types of data heterogeneity in FRL and define the heterogeneity level. Then, by studying the connection between the local and global objectives, we will investigate how the local update can help the global objective and derive a necessary condition for the global policy to be learn-able.

III-A Types of Data Heterogeneity in FRL

Given (8), the expected discounted reward for the nn-th MDP n\mathcal{M}_{n}, defined in (2), can be written as

ηn(π)=sρπ,μn,Pn(s)aπ(a|s)n(s,a).\displaystyle\eta_{n}(\pi)=\sum_{s}\rho_{\pi,\mu_{n},P_{n}}(s)\sum_{a}\pi(a|s)\mathcal{R}_{n}(s,a). (14)

It can be observed that the performance of a policy on n\mathcal{M}_{n} is jointly determined by its initial state distribution μn\mu_{n} and the dynamics function PnP_{n}, both of which can be either stochastic or deterministic. As a result, agents with different initial state distributions and environment dynamics will end up with different policies, causing the heterogeneity issue. Therefore, we classify the data heterogeneity in FRL into two types:

  1. 1.

    Heterogeneity due to initial state distribution;

  2. 2.

    Heterogeneity due to environment dynamics.

For ease of illustration, we consider two extreme cases where only one type of heterogeneity exists. Consider two finite MDPs, 1\mathcal{M}_{1} and 2\mathcal{M}_{2}, with the same environment dynamics, P1=P2P_{1}=P_{2} but different initial state distributions, μ1μ2\mu_{1}\neq\mu_{2}. This corresponds to the situation where two self-driving vehicles have the same car model and training routine, but tend to enter the same road from different entrances. Now, consider the other extreme case, where μ1=μ2\mu_{1}=\mu_{2} but P1P2P_{1}\neq P_{2}, i.e., the two MDPs have the same initial state distribution but different environment dynamics. This can be utilized to model the situation where two different cars always enter the same road at the same entrance. The two types of heterogeneity can occur simultaneously. But, the heterogeneity due to different initial state distributions is of more concern for episodic tasks whose on-policy distribution is different from the stationary distribution [11] and heavily relies on the initial state distribution.

III-B Level of Heterogeneity

In the following, we define a measurement regarding the level (impact) of heterogeneity which will be utilized to investigate the connection between the global and local policy advantages. For ease of illustration, we define 𝐃π,μn,Pn\mathbf{D}_{\pi,\mu_{n},P_{n}} as a |𝒮|×|𝒮|\left|\mathcal{S}\right|\times\left|\mathcal{S}\right| diagonal matrix with ρπ,μn,Pn\rho_{\pi,\mu_{n},P_{n}} being the ii-th diagonal entry, where |𝒮|\left|\mathcal{S}\right| denotes the cardinality of 𝒮\mathcal{S}. Similarly, we denote 𝚷π\mathbf{\Pi}_{\pi} as a |𝒮|×|𝒜|\left|\mathcal{S}\right|\times\left|\mathcal{A}\right| matrix whose (i,j)(i,j)th entry is π(aj|si)\pi(a_{j}|s_{i}), and 𝐀π,Pn\mathbf{A}_{\pi,P_{n}} as a |𝒮|×|𝒜|\left|\mathcal{S}\right|\times\left|\mathcal{A}\right| matrix whose (i,j)(i,j)th entry is Aπ,Pn(si,aj)A_{\pi,P_{n}}(s_{i},a_{j}). With the above notation, we can rewrite the policy advantage in (10) as

𝔸π,μn,Pn(π)\displaystyle\mathbb{A}_{\pi,\mu_{n},P_{n}}(\pi^{\prime}) =sρπ,μn,Pn(s)aπ(a|s)Aπ,Pn(s,a)\displaystyle=\sum_{s}\rho_{\pi,\mu_{n},P_{n}}(s)\sum_{a}\pi^{\prime}(a|s)A_{\pi,P_{n}}(s,a) (15)
=Tr(𝐃π,μn,Pn𝐀π,Pn𝚷πT),\displaystyle=\mbox{Tr}\left(\mathbf{D}_{\pi,\mu_{n},P_{n}}\mathbf{A}_{\pi,P_{n}}\mathbf{\Pi}_{\pi^{\prime}}^{T}\right), (16)

where Tr(){\rm Tr}(\cdot) denotes the trace operator. It can be observed that the advantage of taking a new policy π\pi^{\prime} on a specific agent nn is realized through the matrix product 𝐃π,μn,Pn𝐀π,Pn\mathbf{D}_{\pi,\mu_{n},P_{n}}\mathbf{A}_{\pi,P_{n}}. Thus, we propose the following definition regarding the level of heterogeneity. Without loss of generality, we assume that all the states are reachable, i.e., ρπ,μn,Pn(s)>0\rho_{\pi,\mu_{n},P_{n}}(s)>0 for all ss and π\pi.

Definition I.

(Level of heterogeneity). We define the level of heterogeneity of the n-th agent under policy π\pi by the Frobenius norm of the matrix

𝐁π,μn,Pn=k=1Nqk𝐃π,μn,Pn1𝐃π,μk,Pk𝐀π,Pk𝐀π,Pn\displaystyle\mathbf{B}_{\pi,\mu_{n},P_{n}}=\sum_{k=1}^{N}q_{k}\mathbf{D}_{\pi,\mu_{n},P_{n}}^{-1}\mathbf{D}_{\pi,\mu_{k},P_{k}}\mathbf{A}_{\pi,P_{k}}-\mathbf{A}_{\pi,P_{n}} (17)

as 𝐁π,μn,PnF\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}.

Note that 𝐁π,μn,Pn\mathbf{B}_{\pi,\mu_{n},P_{n}} measures the deviation of the policy advantage of the nn-th agent from the average policy advantage of all agents. Given the local policy advantage directly affects the policy gradient, its deviation from the average will lead to potential gradient divergence. For the IID case, 𝐁π,μn,Pn=𝟎\mathbf{B}_{\pi,\mu_{n},P_{n}}=\mathbf{0} and thus 𝐁π,μn,PnF=0\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}=0 for arbitrary policy π\pi. With non-IID data, 𝐁π,μn,PnF\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F} will be positive and its magnitude indicates the level of heterogeneity of the nn-th agent. Intuitively, if the level of heterogeneity of one agent is too high, i.e., the policy advantage calculated over its MDP is too different from that of the others, improving the global policy according to the local environment may not be beneficial for the whole system. In Corollary I, we will prove a necessary condition, with respect to 𝐁π,μn,PnF\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}, for the global policy to be learn-able from the local policy.

III-C Connection between the Global and Local Objectives in FRL

In FRL, the local objective ηn(π)\eta_{n}(\pi) serves as a proxy of the global objective η(π)\eta(\pi). Specifically, each agent tries to optimize its local objective, but the global objective is the actual performance measure. Given the global policy πt\pi^{t} in the tt-th training round, the nn-th agent interacts with its environment and obtains a new local policy πnt+1\pi_{n}^{t+1}. Without access to any information of other agents, the nn-th agent can only update the local policy by optimizing the local objective ηn(π)\eta_{n}(\pi). However, a good local performance does not necessarily lead to a good global performance, especially with data heterogeneity. The question we want to answer is how we can optimize the local policy to best improve the global performance. For that purpose, we start from the local improvement.

It follows from (13) that

ηn(πnt+1)ηn(πt)\displaystyle\eta_{n}(\pi_{n}^{t+1})-\eta_{n}(\pi^{t})
\displaystyle\geq 𝔸πt,μn,Pn(πnt+1)\displaystyle\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})
cCPO𝔼sρπt,μn,Pn[DTV(πt(|s)πnt+1(|s))],\displaystyle-\text{$c^{\text{CPO}}\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot|s)\|\pi_{n}^{t+1}(\cdot|s))\right],$} (18)

which indicates that the improvement brought by the local policy πnt+1\pi_{n}^{t+1} to the nn-th agent over the current global policy πt\pi^{t} is lower bounded by the policy advantage. The next question is whether we can bound the global improvement similarly. For that purpose, we first investigate the relation between the global and local policy advantage, and the result is given in the following theorem.

Theorem I.

The following bound holds for all agents n=1,,Nn=1,...,N

k=1Nqk𝔸πt,μk,Pk(πnt+1)\displaystyle\sum_{k=1}^{N}q_{k}\mathbb{A}_{\pi^{t},\mu_{k},P_{k}}(\pi_{n}^{t+1}) (19)
\displaystyle\geq 𝔸πt,μn,Pn(πnt+1)α𝔼sρπt,μn,Pn[DTV(π(|s)π(|s))],\displaystyle\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})-\alpha\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[D_{TV}(\pi(\cdot|s)\|\pi^{\prime}(\cdot|s))\right],

where α=2𝐁π,μn,PnF\alpha=2\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}.

The proof is given in Appendix A. Note that the left-hand-side (LHS) of (19) represents the average policy advantages of πnt+1\pi_{n}^{t+1} over πt\pi^{t} when applying πnt+1\pi_{n}^{t+1} on all agents and the RHS represents the penalized policy advantage at the n-th agent. Thus, Theorem I indicates that the policy advantage of πnt+1\pi_{n}^{t+1} on all agents is lower bounded by its local policy advantage with a penalty term. As a result, improving the penalized local policy advantage of a given agent may improve its local policy’s performance on all agents.

Based on Theorem I, we have the following corollary.

Corollary I.

The condition

𝐁π,μn,PnF<𝐀π,PnF\displaystyle\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}<\left\|\mathbf{A}_{\pi,P_{n}}\right\|_{F} (20)

is a necessary condition for the local policy update to be able to improve the global objective.

The proof is given in Appendix B.

Remark I.

The RHS of (20) measures the policy advantage and can be regarded as the potential of the local policy to be further improved according to the local environment. The LHS of (20) indicates the heterogeneity level of the nn-th agent. Corollary I implies that, in order to improve the global objective by optimizing the local objective, we need 𝐀π,PnF𝐁π,μn,PnF>0\left\|\mathbf{A}_{\pi,P_{n}}\right\|_{F}-\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}>0. So, the heterogeneity level serves as a penalty, i.e., the local policy’s contribution to the global objective is penalized by the heterogeneity level. In other words, even if the local training is beneficial for the local objective, it may not help the global objective unless the improvement is larger than the heterogeneity level. We have more to say about this in the experiment result section.

Remark II.

Note that Corollary I includes the IID case as a special case. In particular, without data heterogeneity, the LHS of (20) will be zero. Thus, the necessary condition holds as long as the RHS is greater than zero, which indicates that it is much easier for the local training to help the global training. This makes sense because for the IID case, all agents have the same MDP, thus any improvement for one agent will probably be beneficial for all agents.

However, the average policy advantage is not a direct measure of the final performance, as it is a proxy for the improvement of the global objective. A more useful result should directly measure the improvement in terms of the global expected discounted reward. For that purpose, we have the following corollary.

Corollary II.

Based on Theorem I, we have the following inequality showing the relation between the global objective and the local update:

η(πnt+1)\displaystyle\eta(\pi_{n}^{t+1}) k=1Nqkηk(πt)+𝔸πt,μn,Pn(πnt+1)\displaystyle\geq\sum_{k=1}^{N}q_{k}\eta_{k}(\pi^{t})+\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})
(α+β)𝔼sρπt,μn,Pn[DTV(πt(|s)πnt+1(|s))]\displaystyle\quad-(\alpha+\beta)\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot|s)\|\pi_{n}^{t+1}(\cdot|s))\right]
δDTVmax(πt,πnt+1),\displaystyle\quad-\delta D_{TV}^{\max}(\pi^{t},\pi_{n}^{t+1}), (21)

where δ=2k=1Nqk4ϵkγ(1γ)2DTV(ρπt,μk,Pkρπt,μn,Pn)\delta=2\sum_{k=1}^{N}q_{k}\frac{4\epsilon_{k}\gamma}{(1-\gamma)^{2}}D_{TV}(\rho_{\pi^{t},\mu_{k},P_{k}}\|\rho_{\pi^{t},\mu_{n},P_{n}}), β=k=1Nqk4ϵkγ(1γ)2\beta=\sum_{k=1}^{N}q_{k}\frac{4\epsilon_{k}\gamma}{(1-\gamma)^{2}} and ϵk=maxs,a|Aπt,Pk(s,a)|\epsilon_{k}=\max_{s,a}\left|A_{\pi^{t},P_{k}}(s,a)\right| denotes the maximum action advantage among all state and action pairs.

The proof is given in Appendix C. The LHS of (II) represents the average of expected discounted reward of the n-th local policy πnt+1\pi_{n}^{t+1} over all agents. The first term on the RHS of (II) is a constant, representing the return of the current policy, and the second term represents the policy advantage of πnt+1\pi_{n}^{t+1} over πnt\pi_{n}^{t} in the nn-th agent. Corollary II indicates that it is possible to improve a local policy’s average performance on all agents by improving the local policy advantage. In particular, by penalizing the local update with the TV distance, weighted by a coefficient (α\alpha) proportional to the heterogeneity level, we may improve the performance of a local policy over all environments. In fact, it can be proved that optimizing the RHS of (II) will effectively improve the expected discounted reward on all agents. Please refer to Appendix D for more details.

The last two terms in (II) are penalty terms representing the expected TV distance and the maximum TV divergence between πnt+1\pi_{n}^{t+1} and πnt\pi_{n}^{t}. The two penalties indicate that even if it is possible to improve the global performance by optimizing the local objective, the improvement only happens when the local policy πnt+1\pi_{n}^{t+1} is close to the global policy πt\pi^{t}. Given the first term on the RHS of (II) is a constant, Corollary II suggests the following objective for the local training

hn(πnt+1;πt)\displaystyle h_{n}(\pi_{n}^{t+1};\pi^{t}) =𝔸πt,μn,Pn(πnt+1)(α+β)\displaystyle=\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})-(\alpha+\beta)
𝔼sρπt,μn,Pn[DTV(πt(|s)πnt+1(|s))]\displaystyle\quad\cdot\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot|s)\|\pi_{n}^{t+1}(\cdot|s))\right]
δDTVmax(πt,πnt+1).\displaystyle\quad-\delta D_{TV}^{\max}(\pi^{t},\pi_{n}^{t+1}). (22)
Remark III.

There are three penalty terms in (22), namely the α\alpha, β\beta and δ\delta penalty term, penalizing the difference between the local and global policies. Thus, all of them can be regarded as a global penalty. However, they came from different sources and were added for different purposes. The α\alpha penalty occurred when investigating the connection between the local and global policy advantage, i.e., Theorem I, to tackle the data heterogeneity, while the β\beta and δ\delta penalty came from (12) and (13) to constrain the policy update at one agent.

Remark IV.

The IID case is covered by Corollary II. This special case occurs when all agents share the same initial state distribution and environment dynamics, i.e. the level of heterogeneity 𝐁π,nF=0\left\|\mathbf{B}_{\pi,n}\right\|_{F}=0 and hence α=0\alpha=0 and β\beta becomes the constant cc in (11) for all agents. As a result, the α\alpha penalty term will disappear. However, the β\beta and δ\delta penalty term is still valid. This indicates that, even with IID data, a global penalty term is needed, and data heterogeneity will require a higher global penalty.

Next, we show the convergence of policy updates according to (22). The theoretical analysis is motivated by the convergence analysis of Heterogeneous-Agent Trust Region Policy Optimization (HATPRO) algorithm [39]. Similar to [33], we focus our discussion on the tabular case, i.e. we assume the aggregation step can linearly combine local policies to obtain the global policy. A summary of this learning procedure is presented in Algorithm 1.

Algorithm 1 Federated Policy Iteration with Monotonic Improvement Guarantee. Tabular Case.
0:  π0\pi^{0}
1:  for round t = 0,1,… do
2:     Synchronize the global model πt\pi^{t} to every agents.
3:     Compute the exact values of α\alpha, β\beta and δ\delta.
4:     for agent k = 1,2,…,N do
5:        Optimize πkt+1=argmaxπ[hk(π;πt)]\pi_{k}^{t+1}=\arg\max_{\pi}\left[h_{k}(\pi;\pi^{t})\right].
6:        Upload πkt+1\pi_{k}^{t+1} and qkq_{k} to the central server.
7:     end for
8:     The central server aggregates the θ\theta’s as πt+1(a|s)k=1Nqkπkt+1(a|s),s,a\pi^{t+1}(a|s)\leftarrow\sum_{k=1}^{N}q_{k}\pi_{k}^{t+1}(a|s),\forall s,a.
9:  end for
Theorem II.

Algorithm 1 is guaranteed to generate a monotonically improving sequence of policies, i.e. η(πt+1)η(πt)\eta(\pi^{t+1})\geq\eta(\pi^{t}) for tt\in\mathbb{N}.

The proof is given in Appendix E. With Theorem II, the convergence of Algorithm 1 is given by the following theorem.

Theorem III.

A sequence of policies (πt)t=0\left(\pi^{t}\right)_{t=0}^{\infty} generated by Algorithm 1 is convergent.

The proof is given in Appendix F. When policies are parameterized with tabular methods, Theorem II and III are exact. For non-linear parameterization, the algorithm can be similarly implemented, but the error term introduced by the aggregation step should be considered.

IV The Proposed Algorithm: FedKL

Although we have obtained a theory justifying the objective in (22), directly optimizing it is difficult due to several issues. First, it is impractical to compute the last terms of (22) by traversing all states. Moreover, despite it’s theoretical importance, the TV distance does not have a closed-form expression by far. Second, the advantage function has to be estimated by sampling. Third, one agent cannot access information of other agents, making it impossible to determine the coefficients in (22). In the following, we tackle these issues one by one.

IV-A TV Distance

To handle the first issue, we replace the maximum of TV divergence over all states DTVmax(πt,πnt+1)D_{TV}^{\max}(\pi^{t},\pi_{n}^{t+1}) by its average value 𝔼sρπt,μn,Pn[DTV(πt(|s)πnt+1(|s))]\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot|s)\lVert\pi_{n}^{t+1}(\cdot|s))\right]. It has been shown that this heuristic approximation has similar empirical performance as the one with the maximum divergence [36]. Furthermore, we replace the TV distance in (22) by the KL divergence based on the Pinsker’s inequality [40], i.e. DTV(pq)12DKL(pq)D_{TV}(p\lVert q)\leq\sqrt{\frac{1}{2}D_{KL}(p\lVert q)}, and obtain

hn(πnt+1;πt)\displaystyle h_{n}(\pi_{n}^{t+1};\pi^{t})\geq 𝔸πt,μn,Pn(πnt+1)(α+β+δ)\displaystyle\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})-(\alpha+\beta+\delta) (23)
𝔼sρπt,μn,Pn[12DKL(πt(|s)πnt+1(|s))].\displaystyle\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[\sqrt{\frac{1}{2}D_{KL}(\pi^{t}(\cdot|s)\lVert\pi_{n}^{t+1}(\cdot|s))}\right].

IV-B Approximation of the Policy Advantage and Penalty Terms

To optimize (23), we need to calculate the policy advantage and the penalty terms. To this end, we assume all agents adopt the training algorithm proposed by [26]. Next, we show how we can approximate (23) by Monte Carlo simulation. In each training round, the agent will perform II iterations of local training. Let πn,it+1\pi^{t+1}_{n,i} denote the local policy of the nn-th agent after ii iterations of local training with i=0,,Ii=0,...,I, where πn,0t+1=πt\pi^{t+1}_{n,0}=\pi^{t} and πn,It+1=πnt+1\pi^{t+1}_{n,I}=\pi^{t+1}_{n}. Thus, the objective function for the ii-th iteration of the nn-th agent to obtain πn,it+1\pi^{t+1}_{n,i} from πn,i1t+1\pi^{t+1}_{n,i-1} can be given by

hn(πn,it+1;πt,πn,i1t+1)\displaystyle h_{n}(\pi_{n,i}^{t+1};\pi^{t},\pi_{n,i-1}^{t+1})
=\displaystyle= 𝔼sρπn,i1t+1,μn,Pn,aπn,i1t+1[ω(s,a)Aπt,Pn(s,a)\displaystyle\mathbb{E}_{s\thicksim\rho_{\pi^{t+1}_{n,i-1},\mu_{n},P_{n}},a\thicksim\pi_{n,i-1}^{t+1}}\left[\omega(s,a)A_{\pi^{t},P_{n}}(s,a)\right.
c112DKL(πt(|s)πn,it+1(|s))\displaystyle-c_{1}\sqrt{\frac{1}{2}D_{KL}(\pi^{t}(\cdot|s)\lVert\pi_{n,i}^{t+1}(\cdot|s))}
c2DKL(πn,i1t+1 (|s) πn,it+1 (|s))],\displaystyle\left.-c_{2}\text{$D_{KL}$($\pi_{n,i-1}^{t+1}$ ($\cdot$$|$s) $\lVert\pi_{n,i}^{t+1}$ ($\cdot$$|$s))}\right], (24)

where c1c_{1} is used to approximate (α+β+δ)(\alpha+\beta+\delta), the last term is the KL penalty introduced by the local training, and c2c_{2} is used to approximate cc defined in (11). Here, ω(s,a)=ρπt,μn,Pn(s)ρπn,i1t+1,μn,Pn(s)πn,it+1(a|s)πn,i1t+1(a|s)\omega(s,a)=\frac{\rho_{\pi^{t},\mu_{n},P_{n}}(s)}{\rho_{\pi^{t+1}_{n,i-1},\mu_{n},P_{n}}(s)}\frac{\pi_{n,i}^{t+1}(a|s)}{\pi_{n,i-1}^{t+1}(a|s)} is the importance sampling ratio where we estimate 𝔸πt,μn,Pn(πnt+1)\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1}) by states and actions sampled from ρπn,i1t+1,μn,Pn\rho_{\pi^{t+1}_{n,i-1},\mu_{n},P_{n}} and πn,i1t+1\pi^{t+1}_{n,i-1}. The expected value of the first term in (24) is an unbiased estimation of the policy advantage 𝔸πt,μn,Pn(πnt+1)\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1}). However, it is difficult and inefficient to compute ω(s,a)\omega(s,a) directly. Therefore, we ignore the difference in the discounted visitation frequency and use ω(s,a)=πn,it+1(a|s)πn,i1t+1(a|s),\omega(s,a)=\frac{\pi_{n,i}^{t+1}(a|s)}{\pi_{n,i-1}^{t+1}(a|s)}, which yields a slightly biased estimation of the policy advantage but with smaller variance.

In the following discussion, we will refer to the second and third terms of (24) as the global and local penalties, respectively. In particular, the second term penalizes the divergence of one agent’s policy from the global policy and the third term penalizes the policy outputs in two consecutive iterations. Given both the two penalties are related to KL divergence, we will refer to the proposed algorithm as FedKL.

IV-C Adaptive Penalty Coefficients

In FRL, the initial state distribution and dynamics function of one agent is not known to others. Thus, the penalty coefficients c1c_{1} and c2c_{2} cannot be known locally. In this paper, we adopt the adaptive method in [26] to determine the coefficients. Specifically, we introduce two hyper-parameters, dlocald_{local} and dglobald_{global}, as the target local and global KL divergence. Then, c1c_{1} and c2c_{2} will be adjusted in an adaptive manner so that the target dlocald_{local} and dglobald_{global} are achieved in each policy update. After initializing c1,c2c_{1},c_{2}, we perform a three-phases update in each policy update as follows:

  • Perform policy update using E epochs of minibatch gradient ascent to optimize hn(πn,it+1;πt,πn,i1t+1)h_{n}(\pi_{n,i}^{t+1};\pi^{t},\pi_{n,i-1}^{t+1}) in (24).

  • Compute
    d=𝔼sρπn,i1t+1,μn,Pn[DKL(πn,i1t+1(|s)πn,it+1(|s))]d=\mathbb{E}_{s\thicksim\rho_{\pi^{t+1}_{n,i-1},\mu_{n},P_{n}}}\left[D_{KL}\left(\pi_{n,i-1}^{t+1}(\cdot|s)\lVert\pi_{n,i}^{t+1}(\cdot|s)\right)\right]

    • If d<dlocal/1.1,c2c2/2\text{If }d<d_{local}/1.1,c_{2}\leftarrow c_{2}/2

    • If d>dlocal×1.1,c2c2×2\text{If }d>d_{local}\times 1.1,c_{2}\leftarrow c_{2}\times 2

  • Compute d=12DKL(πt(|s)πn,it+1(|s))d=\sqrt{\frac{1}{2}D_{KL}(\pi^{t}(\cdot|s)\lVert\pi_{n,i}^{t+1}(\cdot|s))}

    • If d<dglobal/1.1,c1c1/2\text{If }d<d_{global}/1.1,c_{1}\leftarrow c_{1}/2

    • If d>dglobal×1.1,c1c1×2\text{If }d>d_{global}\times 1.1,c_{1}\leftarrow c_{1}\times 2

Specifically, in each training round, FedKL performs several iterations of SGD update. After each iteration, FedKL updates its estimation of c1c_{1} and c2c_{2} for the next iteration so that the target local and global KL divergence are achieved.

Algorithm 2 FedKL using adaptive penalty coefficients.
0:  K,I,T,E,π0K,I,T,E,\pi^{0}
1:  for round t = 0,1,… do
2:     Selects a subset of K agents according to some criteria.
3:     Synchronize the global model to every selected agents.
4:     for agent k = 1,2,…,K do
5:        for iteration i = 1,2,…,I do
6:           Run policy πk,i1t+1\pi_{k,i-1}^{t+1} in environment for T timesteps.
7:           Optimize hk(πk,it+1;πt,πn,i1t+1)h_{k}(\pi_{k,i}^{t+1};\pi^{t},\pi_{n,i-1}^{t+1}) for EE epochs to obtain πk,it+1\pi_{k,i}^{t+1}.
8:        end for
9:        Upload θkt+1\theta_{k}^{t+1} and lkl_{k} to the central server.
10:     end for
11:     The central server aggregates the θ\theta’s as θt+1=k=1KlkLθkt+1\theta^{t+1}=\sum_{k=1}^{K}\frac{l_{k}}{L}\theta_{k}^{t+1}.
12:  end for
Remark V.

A larger target KL divergence (dglobald_{global} and dlocald_{local}) will lead to a smaller penalty coefficient on the KL divergence (c1c_{1} and c2c_{2}). Given the KL divergence between two policies before and after training represents the learning step size, the target KL divergence can be regarded as a target learning step size. Thus, dglobald_{global} and dlocald_{local} can help maintain a constant step size for each training round and iteration, respectively.

The heuristics for tuning dlocald_{local} and dglobald_{global} will be discussed in the experiments section. The training procedure of FedKL is illustrated in Algorithm 2.

V Experiments

Refer to caption
(a)
Refer to caption
(b)
Figure 2: Reacher’s field splitting. Sub-fields in grey are reachable by the robotic arm. Sub-environments are created based on these sub-fields.

In the experiments, we will compare the proposed FedKL with other algorithm-level solutions for tackling the heterogeneity issue, including FedAvg, FedProx and FMARL. In particular, agents of the FedAvg system will perform local training utilizing the algorithm proposed in [26], with the local penalty term. Agents of the FedProx system will deploy the same algorithm as FedAvg and add the proximal term [19]. Agents of the FMARL system will deploy the same algorithm as FedAvg and utilize the gradient decay scheme in [32]. On the other hand, agents of the proposed FedKL system will implement the same algorithm as FedAvg, but with the global KL divergence penalty. All code, experiments and modified environments can be found at: https://github.com/ShiehShieh/FedKL.

V-A Experiment Platforms

To the best of authors’ knowledge, there are few works focusing on the data heterogeneity issue in FRL, making it difficult to find benchmarks for comparison purpose. Therefore, we made several light-weight modifications to popular RL simulators to accommodate the FRL setting. There are two widely used simulation environments for evaluating the performance of RL controllers, namely the MuJoCo simulator [41, 42] and the Flow simulator [43]. In the following, we introduce the two simulators and the settings for FRL experiments.

V-A1 Heterogeneous Robotic Tasks

In Experiment 1, we consider a robotic task in the Reacher-v2 environment provided by OpenAI Gym. As shown in Fig. 2a, the agent actuates a robotic arm to move the end-effector to reach a target, which is denoted by a red sphere and spawn at a random position in a 0.4×0.40.4\times 0.4 field. The environment consists of a 11-dimensional state space, a 2-dimensional action space and a reward function that determines the reward based on the length of the action vector as well as the distance between the end-effector and the target.

To simulate the heterogeneity due to different initial state distributions, we split the field into QQ sub-fields and create QQ sub-environments. For each sub-environment, we assume the robotic arm will always start from one of the QQ specific sub-fields and try to reach the target. Note that this corresponds to an extreme setup for the different initial state distributions. The case with Q=60Q=60 is illustrated in Fig. 2b. To simulate the heterogeneity due to different environment dynamics, we add noise to each action as in [29]. Specifically, we add noise to distort all actions such that they are different for different devices. Let mnoisem_{noise} and σnoise\sigma_{noise} denote the mean and variance of the added noise. We assume mnoise=0m_{noise}=0 and pick different variances to represent different levels of heterogeneity.

V-A2 Traffic Control by Isolated RL Agents

In Experiment 2, we consider the traffic control in the “figure eight” network [43]. As shown in Fig. 3, there are in total 14 vehicles on the road with half of them controlled by RL agents and the rest controlled by the underlying Intelligent Driver Model (IDM) controller provided by the SUMO library. The IDM controller can be regarded as the human-driver model. This experiment is designed for mixed-autonomy, where a centralized controller coordinates a set of connected autonomous vehicles (CAVs) to improve the system-level velocities.

Refer to caption
(a)
Refer to caption
(b)
Figure 3: Vehicle Placement Schema. Red cars are human-driven vehicles. White cars are RL-controlled vehicles.

Some modifications are required to run FRL experiments in the above network. Fortunately, there are a few modifications proposed by [32], where autonomous vehicles (AVs) are not connected by a centralized controller but isolated agents. Each RL-controlled vehicle is able to observe the position and speed of itself, and that for the two vehicles ahead and behind it. As a result, each state is represented by a 6-dimensional vector. Since each agent is controlled by its local policy, every action is a scalar specifying the acceleration for the current time-step. Further assume that the average speed of all vehicles is accessible to all AVs.

By default, AVs and human-driven vehicles are positioned alternatively, as shown in Fig. 3a. If we utilize ‘h’ and ‘r’ to represent the human-driven vehicle and AVs, respectively, the default placement can be written as [h,r,h,r,h,r,h,r,h,r,h,r,h,r]. To simulate the heterogeneity due to different initial state distributions, we fix the starting position of all AVs such that everyone starts from the same position in each run. Given the AVs have access to the information about the two vehicles before and after, we may create different environment dynamics by adjusting the placement pattern. For example, the car placement [h,r,h,h,r,r,h,r,h,h,h,r,r,r], as shown in Fig. 3b, will lead to different environment dynamics.

V-A3 Implementation Details

For both experiments, we use neural networks to represent policies as in [36, 43]. Specifically, fully-connected Multilayer Perceptrons (MLPs) with tanh\tanh non-linearity are utilized for two experiments with hidden layers (64, 64) and (100, 50, 25) [43], respectively. Generalized Advantage Estimation (GAE) [34] is used to estimate the advantage function. The value functions for two experiments are represented by two MLPs with tanh\tanh non-linearity and hidden layers (64, 64) and (256, 256), respectively. The learning rate of one algorithm will remain the same for one simulation without decaying, but it will be tuned for different simulations. The hyperparameters dlocald_{local}, dglobald_{global} and μ\mu111This is a hyperparameter defined by FedProx to control the weight of the proximal term. are carefully tuned so that they are near-optimal. In each training round, we set K=3,I=20K=3,I=20, and T=2048T=2048 for Experiment 1, and K=7,I=50K=7,I=50, and T=1500T=1500 for Experiment 2. For both experiments, we set the discount factor γ\gamma, batch size, and the discount factor for GAE, λ\lambda in [34], as 0.99, 64, and 0.95, respectively.

V-B Level of Heterogeneity and Its Impact

Refer to caption
Figure 4: Visualization of 𝐃π,μn,Pn𝐀π,PnF𝐃π,μn,Pn𝐁π,μn,PnF\left\|\mathbf{D}_{\pi,\mu_{n},P_{n}}\mathbf{A}_{\pi,P_{n}}\right\|_{F}-\left\|\mathbf{D}_{\pi,\mu_{n},P_{n}}\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F} on modified Reacher-v2 environments. Results are averaged across three runs.
Refer to caption
Figure 5: The effect of negative GG on modified Reacher-V2 environments with different environment dynamics (action noise 𝒩(0,0.4)\thicksim\mathcal{N}(0,0.4)). Red lines indicate the rounds where G<0G<0 happens.

In this section, we will investigate the type and level of heterogeneity and how they affect the performance of FRL. For ease of illustration, we will use Experiment 1 to analyze the heterogeneity type and level. Detailed performance comparison between different algorithms and in-depth discussion regarding the key parameters will be given in the next section based on results from Experiment 2.

Refer to caption
Figure 6: Comparison of FedAvg, FedProx and FedKL on modified Reacher-V2 environments with different initial state distributions Q=60Q=60. For all algorithms, the learning rate is 0.001. For FedProx, μ=0.02\mu=0.02. For FMARL, λ=0.9999\lambda=0.9999. For FedKL, dglobal=0.6d_{global}=0.6. Results are averaged across three runs.
Refer to caption
Figure 7: Comparison of FedAvg, FedProx and FedKL on modified Reacher-V2 environments with different environment dynamics (action noise 𝒩(0,0.4)\thicksim\mathcal{N}(0,0.4)). For all algorithms, the learning rate is 0.001. For FedProx, μ=0.02\mu=0.02. For FMARL, λ=0.9999\lambda=0.9999. For FedKL, dglobal=0.6d_{global}=0.6. Results are averaged across three runs.

To illustrate the impact of data heterogeneity, we show in Fig. 4 the changes of G=𝐃π,μn,Pn𝐀π,PnF𝐃π,μn,Pn𝐁π,μn,PnFG=\left\|\mathbf{D}_{\pi,\mu_{n},P_{n}}\mathbf{A}_{\pi,P_{n}}\right\|_{F}-\left\|\mathbf{D}_{\pi,\mu_{n},P_{n}}\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}. Each term in (20) is rescaled by 𝐃π,μn,Pn\mathbf{D}_{\pi,\mu_{n},P_{n}} to overcome the numerical instability introduced by the inverse operation 𝐃π,μn,Pn1\mathbf{D}_{\pi,\mu_{n},P_{n}}^{-1}. Note that according to Corollary I, the local training is useful only when 𝐀π,PnF𝐁π,μn,PnF>0\left\|\mathbf{A}_{\pi,P_{n}}\right\|_{F}-\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}>0. Thus, we may regard GG as the potential of one agent to contribute to the global policy. In particular, if we treat 𝐃π,μn,Pn𝐀π,PnF\left\|\mathbf{D}_{\pi,\mu_{n},P_{n}}\mathbf{A}_{\pi,P_{n}}\right\|_{F} as the local training potential, then GG is the local potential penalized by the heterogeneity level, i.e., even if there is potential improvement from local training, it may not be helpful for the global policy when the heterogeneity level is high. We consider four different scenarios, including the IID case, the case with different initial state distributions (Q=60Q=60), and two cases with different level of environment dynamics, where the heterogeneity level is controlled by the variance of the action noises. It can be observed that GG decreases with training because training will reduce agents’ potential to further contribute to the global policy. On the other hand, comparing the two cases with different environment dynamics, we notice that higher level of heterogeneity will lead to smaller GG, indicating that heterogeneity will limit the potential contribution of one agent. Note that the comparison between different types of heterogeneity may not be meaningful, as their local advantage function may have different ranges. Fluctuations of the curves are due to limited number of simulations and matrix estimation error.

To further illustrate the impact of the heterogeneity level, we show in Fig. 5 the training performance with high level of action noise. The red lines indicate the training rounds where G<0G<0 happens in selected agents. It can be observed that, for those rounds, the global training will slow down, which agrees with Corollary I, i.e., G<0G<0 indicates that the associate training is not helpful for the global policy. In Figs. 6 and 7, we show the performance comparison of FedAvg, FedProx, FMARL and FedKL with different initial state distributions and different transition probabilities, respectively. It can be observed that with the same learning step size, FedKL can converge to a stable performance, but both FedAvg and FedProx may diverge or fluctuate. We can certainly reduce the step size for FedAvg and FedProx, but this will lead to slow learning speed. See next section for more related discussions.

V-C Advantage of the KL-based Global Penalty

Refer to caption
Figure 8: Learning curves of FedAvg, FedProx and FedKL for Experiment 2. The best performance is shown for each algorithm. For all algorithms, the learning rate is 0.01. For FedAvg, dlocal=0.0002d_{local}=0.0002. For FedProx, dlocal=0.0002,μ=0.001d_{local}=0.0002,\mu=0.001. For FedKL, dlocal=0.0003,dglobal=0.15d_{local}=0.0003,d_{global}=0.15. Results are averaged across three runs.
Refer to caption
Figure 9: The settings are the same as Fig. 8, but all algorithms have same local penalty, i.e. dlocal=0.0003d_{local}=0.0003.

In the following, we show the results of Experiment 2 where both types of heterogeneity are present. The best performance of each algorithm is shown in Fig. 8, where the averaged total reward over three runs is reported. Each run uses the same seed for the SUMO simulator but different seeds for the parameter initialization of the neural network. It can be observed that FedKL converges much faster than FedAvg, FedProx and FMARL. The faster convergence of FedKL indicates a larger step size. Then, a natural question is whether we can increase the step size of FedAvg, FedProx and FMARL to speed up their training. In Fig. 9, we show the results where the dlocald_{local} for FedAvg, FedProx and FMARL are increased. As a result, all of them converge at a similar speed as FedKL, but they diverge after reaching the maximum reward. The above results provide empirical evidence that directly constraining the model output by KL divergence is more effective in handling the data heterogeneity than penalizing model divergence in the parameter space. Furthermore, applying a proper global KL penalty can prevent divergence from happening without loss of learning speed.

Why KL Penalty is better? It can be observed from Fig. 9 (after Round 120) that the proximal term of FedProx can slow down the divergence, but failed in avoiding it. This is because the proximal term θnt+1θt\left\|\theta_{n}^{t+1}-\theta^{t}\right\| constrains the learning in the model parameter space by the 2\mathcal{L}2-norm. Although the change in the model parameter space will lead to changes in the output distribution space (actions) and eventually influence the training objective, it is more effective to directly constrain the training in the distribution space [44, 45], which justifies the performance advantage of the proposed KL-divergence based penalty.

Refer to caption
Figure 10: The effect of dglobald_{global} and dlocald_{local}. (a) Without the global penalty, agents must use a small dlocald_{local} at the cost of slow learning. (b) With the global penalty, agents can explore a wider region (faster learning) without worrying about divergence.

Why Global Penalty Helps? We now investigate how dglobald_{global} helps speed up convergence and stabilize training. Assume there are II iterations of local training in each round and let did_{i} denote the policy update in the ii-th iteration. We consider two cases, without and with the global penalty (target step size), respectively, as illustrated in Fig. 10. We use the red and green arrows to represent the optimal optimization direction of the global and local objectives, respectively. Note that they are not on the same direction due to data heterogeneity. For the first case, we only apply the local penalty which constrains the step size for each iteration with |di|dlocal|d_{i}|\leq d_{local}, i.e., the radius of the green circle in part (a) of Fig. 10. Let dmaxd_{max} denote the highest tolerable step size in one training round, without causing training divergence. Thus, we need dlocaldmax/Id_{local}\leq d_{max}/I, which is determined by considering the worst case scenario where all updates are on the same direction. However, the worst case happens with a very low chance. As a result, dlocald_{local} is pessimistically small for most training. Performance of FedAvg in Fig. 8 corresponds to the case that dlocald_{local} is small enough to avoid divergence, but causes slow convergence. Performance of FedAvg in Fig. 9 corresponds to the case that dlocald_{local} is large enough for fast training, but causes divergence.

Now consider the second case with dglobald_{global}, denoted by the radius of the red circle in part (b) of Fig. 10. Under such circumstances, we only need to guarantee |iIdi|dglobaldmax\left|\sum_{i}^{I}d_{i}\right|\leq d_{global}\leq d_{max} and |di|dlocal\left|d_{i}\right|\leq d_{local}. As a result, dlocald_{local} can be much larger than dmax/Id_{max}/I, leading to a much larger per-iteration step size, represented by the larger radius of the green circle in part (b) of Fig. 10. Thus, adding dglobald_{global} enables a larger dlocald_{local} and their joint efforts can achieve a better trade-off between training speed (step size) and convergence.

Heuristics for Tuning Target Step Size: The global KL divergence is a good indicator of model divergence and we usually observe a large global KL divergence followed by a drastic performance degradation. This observation suggests a heuristic way for tuning the global step-size target dglobald_{global}. For example, one can first run the algorithm without activating the global KL penalty (set c1c_{1} to 0). Then, one may find the optimal value of dlocald_{local} and observe the average value of the KL divergence between πt\pi^{t} and πnt+1\pi^{t+1}_{n} for certain number of rounds. It was observed that using a dglobald_{global} slightly larger than the average, together with a dlocald_{local} larger than the optimal value found in previous runs, can significantly speed up learning in the early stage of training.

V-D Communication-Efficient FRL

There are many benefits to solving the data heterogeneity issue, including faster convergence, stable performance, etc. Faster convergence will also lead to fewer rounds of communication and mitigate the communication bottleneck issue for FRL. For example, to achieve a total reward of 350 in Experiment 2, it needs around 110 rounds of communication for both FedAvg and FedProx, but only 50 rounds for FedKL, representing a saving of more than 50%50\% of the communication workload.

Local agents in FedKL use the Actor-Critic architecture [27, 15], where the actor refers to the policy π\pi and the critic refers to the value function or the advantage function. One possible way to reduce communication workload is to keep the critic local since it is not needed for inference. We will investigate this idea in future works.

Another issue for large-scale FRL applications like autonomous driving is the model aggregation over many agents moving in a large area. A possible solution is the hierarchical FL scheme [46], where aggregations happen in two layers. First, local aggregations among agents within a small region are performed on a local server. Then, the aggregation results from multiple local servers are sent to a global server for global aggregation. The local and global aggregations can happen with different frequencies, which may significantly reduce the communication workload. Unfortunately, the convergence analysis for the hierarchical scheme with data heterogeneity is not available, and it will be interesting to investigate the performance of FedKL under such a hierarchical framework.

VI Conclusion

In this paper, we investigated the data heterogeneity issue in FRL systems. For that purpose, we first classified the types of data heterogeneity in FRL and defined the heterogeneity level. It was proved that although optimizing the local policy may improve the global policy, the deviation of the local policy from the global one must be properly constrained. A necessary condition for the local policy improvement to be beneficial for the global objective was also derived with respect to the proposed heterogeneity level. Based on the theoretical result, a KL-divergence based global penalty term was proposed, which was shown to be able to enable a larger local step size without causing model divergence. The convergence proof of the proposed algorithm is also provided. Experiment results demonstrated the advantage of the proposed scheme in speeding up and stabilizing the training. One conclusion we can draw from this work is that heterogeneity not only reduces the contribution of one agent’s local training to the global policy, but also increases the risk of divergence. It is thus important to balance between learning speed and convergence with a proper learning step size. Due to the lack of global information in the agents, a feasible solution is to maximize the learning speed and avoid divergence by jointly tuning the local and global penalty. In the future, a big challenge for implementing large-scale FRL is the model aggregation among many moving agents and it is critical to develop distributed aggregation schemes with heterogeneous data.

Appendix A Proof of Theorem I

Proof.

Following [47], we define the “absolute” value of a matrix 𝐗\mathbf{X} as

|𝐗|=(𝐗𝐗)12,|𝐗|=(𝐗𝐗)12,\displaystyle\left|\mathbf{X}\right|=(\mathbf{X}^{\dagger}\mathbf{X})^{\frac{1}{2}},\qquad\left|\mathbf{X}^{\dagger}\right|=(\mathbf{X}\mathbf{X}^{\dagger})^{\frac{1}{2}}, (25)

where 𝐗\mathbf{X}^{\dagger} denotes the conjugate transpose of 𝐗\mathbf{X}. In other words, |𝐗|\left|\mathbf{X}\right| can be regarded as the square-root of 𝐗𝐗\mathbf{X}^{\dagger}\mathbf{X}. For a real matrix 𝐗\mathbf{X}, 𝐗T𝐗\mathbf{X}^{T}\mathbf{X} is symmetric and positive semi-definite. Thus, there must be a unique square root |𝐗|\left|\mathbf{X}\right| of 𝐗T𝐗\mathbf{X}^{T}\mathbf{X} that is real and positive semi-definite. For a given scalar xx, the above operator will degenerate to the absolute value operator |x|=(xx)12\left|x\right|=(xx)^{\frac{1}{2}}.

To understand the impact of the local optimization on the global objective, we define the gap between the global and local policy advantage for taking the updated local policy of the nn-th agent πnt+1\pi_{n}^{t+1} as

G(πnt+1)=|k=1Nqk𝔸πt,μk,Pk(πnt+1)𝔸πt,μn,Pn(πnt+1)|.\displaystyle G(\pi_{n}^{t+1})=\left|\sum_{k=1}^{N}q_{k}\mathbb{A}_{\pi^{t},\mu_{k},P_{k}}(\pi_{n}^{t+1})-\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})\right|. (26)

Utilizing the vector notation from (16), we can calculate the gap as

G(πnt+1)\displaystyle G(\pi_{n}^{t+1})
=\displaystyle= |Tr(k=1Nqk𝐃πt,μn,Pn𝐃πt,μn,Pn1𝐃πt,μk,Pk𝐀πt,Pk𝚷πnt+1T\displaystyle\left|\mbox{Tr}\left(\sum_{k=1}^{N}q_{k}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{D}^{-1}_{\pi^{t},\mu_{n},P_{n}}\mathbf{D}_{\pi^{t},\mu_{k},P_{k}}\mathbf{A}_{\pi^{t},P_{k}}\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\right.\right.
𝐃πt,μn,Pn𝐀πt,Pn𝚷πnt+1T)|\displaystyle\left.\left.-\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{A}_{\pi^{t},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\right)\right| (27)
=\displaystyle= |Tr(k=1Nqk𝚷πnt+1T𝐃πt,μn,Pn𝐃πt,μn,Pn1𝐃πt,μk,Pk𝐀πt,Pk\displaystyle\left|\mbox{Tr}\left(\sum_{k=1}^{N}q_{k}\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{D}^{-1}_{\pi^{t},\mu_{n},P_{n}}\mathbf{D}_{\pi^{t},\mu_{k},P_{k}}\mathbf{A}_{\pi^{t},P_{k}}\right.\right.
𝚷πnt+1T𝐃πt,μn,Pn𝐀πt,Pn)|\displaystyle\left.\left.-\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{A}_{\pi^{t},P_{n}}\right)\right| (28)
=\displaystyle= |Tr(𝐁πt,μn,Pn𝚷πnt+1T𝐃πt,μn,Pn)|,\displaystyle\left|\mbox{Tr}\left(\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right)\right|, (29)

where the second equation follows from (17).

The Cauchy-Schwarz inequality [47] states that, for two complex matrices 𝐗\mathbf{X} and 𝐘\mathbf{Y},

|Tr(𝐗𝐘)|Tr(|𝐗||𝐘|)Tr(|𝐗||𝐘|).\displaystyle\left|\mbox{Tr}\left(\mathbf{X}^{\dagger}\mathbf{Y}\right)\right|\leq\sqrt{\mbox{Tr}\left(\left|\mathbf{X}\right|\cdot\left|\mathbf{Y}\right|\right)\mbox{Tr}\left(\left|\mathbf{X}^{\dagger}\right|\cdot\left|\mathbf{Y}^{\dagger}\right|\right)}. (30)

Accordingly, we can obtain

|Tr(𝐁πt,μn,Pn𝚷πnt+1T𝐃πt,μn,Pn)|\displaystyle\left|\mbox{Tr}\left(\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right)\right|
\displaystyle\leq Tr(|𝐁πt,μn,PnT||𝚷πnt+1T𝐃πt,μn,Pn|)\displaystyle\sqrt{\mbox{Tr}\left(\left|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}^{T}\right|\left|\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right|\right)}
Tr(|𝐁πt,μn,Pn||𝐃πt,μn,Pn𝚷πnt+1|).\displaystyle\cdot\sqrt{\mbox{Tr}\left(\left|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right|\left|\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}\right|\right)}. (31)

According to the definition in (25), we can always find real positive semi-definite matrices |𝐁πt,μn,PnT|\left|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}^{T}\right|, |𝐁πt,μn,Pn|\left|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right|, |𝚷πnt+1T|\left|\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\right| and |𝚷πnt+1|\left|\mathbf{\Pi}_{\pi_{n}^{t+1}}\right|. By the Cauchy-Schwarz inequality, we can further obtain

Tr(|𝐁πt,μn,PnT||𝚷πnt+1T𝐃πt,μn,Pn|)\displaystyle\mbox{Tr}\left(\left|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}^{T}\right|\left|\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right|\right)
\displaystyle\leq Tr(|𝐁πt,μn,PnT|2)Tr(|𝚷πnt+1T𝐃πt,μn,Pn|2),\displaystyle\sqrt{\mbox{Tr}\left(\left|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}^{T}\right|^{2}\right)\mbox{Tr}\left(\left|\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right|^{2}\right)}, (32)

and

Tr(|𝐁πt,μn,Pn||𝐃πt,μn,Pn𝚷πnt+1|)\displaystyle\mbox{Tr}\left(\left|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right|\left|\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}\right|\right)
\displaystyle\leq Tr(|𝐁πt,μn,Pn|2)Tr(|𝐃πt,μn,Pn𝚷πnt+1|2).\displaystyle\sqrt{\mbox{Tr}\left(\left|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right|^{2}\right)\mbox{Tr}\left(\left|\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}\right|^{2}\right)}. (33)

Substituting (32) and (33) into (31) gives

G(πnt+1)\displaystyle G(\pi_{n}^{t+1}) (Tr(|𝐁πt,μn,PnT|2)Tr(|𝚷πnt+1T𝐃πt,μn,Pn|2)\displaystyle\leq\left(\mbox{Tr}\left(\left|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}^{T}\right|^{2}\right)\mbox{Tr}\left(\left|\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right|^{2}\right)\right.
Tr(|𝐁πt,μn,Pn|2)Tr(|𝐃πt,μn,Pn𝚷πnt+1|2))14.\displaystyle\quad\left.\cdot\mbox{Tr}\left(\left|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right|^{2}\right)\mbox{Tr}\left(\left|\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}\right|^{2}\right)\right)^{\frac{1}{4}}. (34)

Given

Tr(|𝚷πnt+1T𝐃πt,μn,Pn|2)tr(|𝐃πt,μn,Pn𝚷πnt+1|2)\displaystyle\mbox{Tr}\left(\left|\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right|^{2}\right)tr(\left|\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}\right|^{2})
=\displaystyle= Tr(𝐃πt,μn,Pn𝚷πnt+1𝚷πnt+1T𝐃πt,μn,Pn)\displaystyle\mbox{Tr}\left(\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right)
Tr(𝚷πnt+1T𝐃πt,μn,Pn𝐃πt,μn,Pn𝚷πnt+1)\displaystyle\cdot\mbox{Tr}\left(\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}\right) (35)
=\displaystyle= Tr(𝐃πt,μn,Pn𝚷πnt+1𝚷πnt+1T𝐃πt,μn,Pn)2,\displaystyle\mbox{Tr}\left(\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right)^{2}, (36)

and similarly

Tr(|𝐁πt,μn,PnT|2)Tr(|𝐁πt,μn,Pn|2)\displaystyle\mbox{Tr}\left(\left|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}^{T}\right|^{2}\right)\mbox{Tr}\left(\left|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right|^{2}\right)
=\displaystyle= Tr(𝐁πt,μn,Pn𝐁πt,μn,PnT)2,\displaystyle\mbox{Tr}\left(\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}^{T}\right)^{2}, (37)

we have

G(πnt+1)\displaystyle G(\pi_{n}^{t+1}) Tr(𝐁πt,μn,Pn𝐁πt,μn,PnT)\displaystyle\leq\sqrt{\mbox{Tr}\left(\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}^{T}\right)}
Tr(𝐃πt,μn,Pn𝚷πnt+1𝚷πnt+1T𝐃πt,μn,Pn)\displaystyle\quad\cdot\sqrt{\mbox{Tr}\left(\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\right)} (38)
=𝐁πt,μn,PnFs,a[ρπt,μn,Pn(s)πnt+1(a|s)]2.\displaystyle=\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\|_{F}\sqrt{\sum_{s,a}[\rho_{\pi^{t},\mu_{n},P_{n}}(s)\pi_{n}^{t+1}(a|s)]^{2}}. (39)

Thus, we can further obtain 222Note that we have used the notation (ππ)(s,a)=π(a|s)π(a|s),s𝒮,a𝒜\left(\pi^{\prime}-\pi\right)(s,a)=\pi^{\prime}(a|s)-\pi(a|s),\forall s\in\mathcal{S},a\in\mathcal{A}.

G(πnt+1πnt)\displaystyle G(\pi_{n}^{t+1}-\pi_{n}^{t}) (40)
\displaystyle\leq 𝐁πt,μn,PnFs,a[ρπt,μn,Pn(s)(πnt+1(a|s)πt(a|s))]2.\displaystyle\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\|_{F}\sqrt{\sum_{s,a}\left[\rho_{\pi^{t},\mu_{n},P_{n}}(s)\left(\pi_{n}^{t+1}(a|s)-\pi^{t}(a|s)\right)\right]^{2}}.

Given aπt(a|s)Aπt,μ(s,a)=0,s𝒮\sum_{a}\pi^{t}(a|s)A_{\pi^{t},\mu}(s,a)=0,\forall{s\in\mathcal{S}}, we have

𝔸πt,μn,Pn(πnt+1)\displaystyle\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})
=\displaystyle= sρπt,μn,Pn(s)aπnt+1(a|s)Aπt,Pn(s,a)\displaystyle\sum_{s}\rho_{\pi^{t},\mu_{n},P_{n}}(s)\sum_{a}\pi_{n}^{t+1}(a|s)A_{\pi^{t},P_{n}}(s,a)
sρπt,μn,Pn(s)aπt(a|s)Aπt,Pn(s,a)\displaystyle\quad-\sum_{s}\rho_{\pi^{t},\mu_{n},P_{n}}(s)\sum_{a}\pi^{t}(a|s)A_{\pi^{t},P_{n}}(s,a) (41)
=\displaystyle= sρπt,μn,Pn(s)a[πnt+1(a|s)πt(a|s)]Aπt,Pn(s,a)\displaystyle\sum_{s}\rho_{\pi^{t},\mu_{n},P_{n}}(s)\sum_{a}\left[\pi_{n}^{t+1}(a|s)-\pi^{t}(a|s)\right]A_{\pi^{t},P_{n}}(s,a) (42)
=\displaystyle= 𝔸πt,Pn((πnt+1πt)).\displaystyle\mathbb{A}_{\pi^{t},P_{n}}\left(\left(\pi_{n}^{t+1}-\pi^{t}\right)\right). (43)

It thus follows from (26) that

G(πnt+1)=G(πnt+1πnt).\displaystyle G(\pi_{n}^{t+1})=G(\pi_{n}^{t+1}-\pi_{n}^{t}). (44)

By combining (44) and (40), we can obtain

G(πnt+1)\displaystyle G(\pi_{n}^{t+1}) 𝐁πt,μn,PnF\displaystyle\leq\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\|_{F} (45)
s,a[ρπt,μn,Pn(s)(πnt+1(a|s)πt(a|s))]2.\displaystyle\quad\cdot\sqrt{\sum_{s,a}\left[\rho_{\pi^{t},\mu_{n},P_{n}}(s)\left(\pi_{n}^{t+1}(a|s)-\pi^{t}(a|s)\right)\right]^{2}}.

Based on (26), we can obtain

k=1Nqk𝔸πt,μk,Pk(πnt+1)\displaystyle\sum_{k=1}^{N}q_{k}\mathbb{A}_{\pi^{t},\mu_{k},P_{k}}(\pi_{n}^{t+1})
\displaystyle\geq 𝔸πt,μn,Pn(πnt+1)𝐁πt,μn,PnF\displaystyle\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})-\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\|_{F}
s,a[ρπt,μn,Pn(s)(πnt+1(a|s)πt(a|s))]2.\displaystyle\cdot\sqrt{\sum_{s,a}\left[\rho_{\pi^{t},\mu_{n},P_{n}}(s)\left(\pi_{n}^{t+1}(a|s)-\pi^{t}(a|s)\right)\right]^{2}}. (46)

Given |a|2+|b|2(|a|+|b|)2=|a|+|b|\sqrt{\left|a\right|^{2}+\left|b\right|^{2}}\leq\sqrt{\left(\left|a\right|+\left|b\right|\right)^{2}}=\left|a\right|+\left|b\right|, we can obtain

k=1Nqk𝔸πt,μk,Pk(πnt+1)\displaystyle\sum_{k=1}^{N}q_{k}\mathbb{A}_{\pi^{t},\mu_{k},P_{k}}(\pi_{n}^{t+1})
\displaystyle\geq 𝔸πt,μn,Pn(πnt+1)𝐁πt,μn,PnF\displaystyle\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})-\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\|_{F}
s,aρπt,μn,Pn(s)|πnt+1(a|s)πt(a|s)|.\displaystyle\cdot\sum_{s,a}\rho_{\pi^{t},\mu_{n},P_{n}}(s)\left|\pi_{n}^{t+1}(a|s)-\pi^{t}(a|s)\right|. (47)

Denotes the TV distance between π(|s)\pi(\cdot|s) and π(|s)\pi^{\prime}(\cdot|s) as DTV(π(|s)π(|s))=12a|π(a|s)π(a|s)|D_{TV}(\pi(\cdot|s)\|\pi^{\prime}(\cdot|s))=\frac{1}{2}\sum_{a}\left|\pi(a|s)-\pi^{\prime}(a|s)\right|. We can finally obtain

k=1Nqk𝔸πt,μk,Pk(πnt+1)\displaystyle\sum_{k=1}^{N}q_{k}\mathbb{A}_{\pi^{t},\mu_{k},P_{k}}(\pi_{n}^{t+1}) (48)
\displaystyle\geq 𝔸πt,μn,Pn(πnt+1)α𝔼sρπt,μn,Pn[DTV(π(|s)π(|s))],\displaystyle\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})-\alpha\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[D_{TV}(\pi(\cdot|s)\|\pi^{\prime}(\cdot|s))\right],

where α=2𝐁πt,μn,PnF\alpha=2\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\|_{F}

Appendix B Proof of Corollary I

Proof.

It follows from (16) that 𝔸πt,μn,Pn(πnt+1)=Tr(𝐃πt,μn,Pn𝐀πt,Pn𝚷πnt+1T).\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})=\mbox{Tr}\left(\mathbf{D}_{\pi^{t},\mu_{n},P_{n}}\mathbf{A}_{\pi^{t},P_{n}}\mathbf{\Pi}_{\pi_{n}^{t+1}}^{T}\right). By following the same procedure from (29) to (40), we can prove

𝔸πt,μn,Pn(πnt+1)\displaystyle\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})
\displaystyle\leq 𝐀πt,PnF\displaystyle\left\|\mathbf{A}_{\pi^{t},P_{n}}\right\|_{F}
s,a[ρπt,μn,Pn(s)(πnt+1(a|s)πt(a|s))]2.\displaystyle\cdot\sqrt{\sum_{s,a}\left[\rho_{\pi^{t},\mu_{n},P_{n}}(s)\left(\pi_{n}^{t+1}(a|s)-\pi^{t}(a|s)\right)\right]^{2}}. (49)

Thus, if 𝐀πt,PnF𝐁πt,μn,PnF\left\|\mathbf{A}_{\pi^{t},P_{n}}\right\|_{F}\leq\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\|_{F}, then

𝔸πt,μn,Pn(πnt+1)\displaystyle\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})
\displaystyle\leq 𝐁πt,μn,PnF\displaystyle\left\|\mathbf{B}_{\pi^{t},\mu_{n},P_{n}}\right\|_{F}
s,a[ρπt,μn,Pn(s)(πnt+1(a|s)πt(a|s))]2.\displaystyle\cdot\sqrt{\sum_{s,a}\left[\rho_{\pi^{t},\mu_{n},P_{n}}(s)\left(\pi_{n}^{t+1}(a|s)-\pi^{t}(a|s)\right)\right]^{2}}. (50)

As a result, the right-hand side of (LABEL:eq:84) will be less or equal to zero, where the equality holds when πnt+1=πt\pi_{n}^{t+1}=\pi^{t}. This indicates that there is no benefit from further updating πt\pi^{t} at the nn-th agent, and there will be no improvement for the global objective. Thus, (20) is a necessary condition for the local update to be able to improve the global objective. ∎

Appendix C Proof of Corollary II

Proof.

It follows from (3) that the global discounted reward at time t+1t+1 can be given by

η(πnt+1)=k=1Nqkηk(πnt+1).\displaystyle\text{$\eta$}(\pi_{n}^{t+1})=\sum_{k=1}^{N}q_{k}\eta_{k}(\pi_{n}^{t+1}). (51)

By substituting (13) into (51), we can obtain

η(πnt+1)\displaystyle\text{$\eta$}(\pi_{n}^{t+1}) k=1Nqk(Lπt,μk,Pk(πnt+1)\displaystyle\geq\sum_{k=1}^{N}q_{k}\left(L_{\pi^{t},\mu_{k},P_{k}}(\pi_{n}^{t+1})\right. (52)
ckCPO𝔼sρπt,μk,Pk[DTV(πt(|s)πnt+1(|s))]).\displaystyle\quad\left.-\text{$c_{k}^{\text{CPO}}\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{k},P_{k}}}\left[D_{TV}(\pi^{t}(\cdot|s)\|\pi_{n}^{t+1}(\cdot|s))\right]$}\right).

By substituting (9) into (52) and bounding ckCPOc_{k}^{\text{CPO}} from below by ckc_{k}^{\prime}, we have

η(πnt+1)\displaystyle\text{$\eta$}(\pi_{n}^{t+1}) k=1Nqk(ηk(πt)+𝔸πt,Pk(πnt+1)\displaystyle\geq\sum_{k=1}^{N}q_{k}\left(\eta_{k}(\pi^{t})+\mathbb{A}_{\pi^{t},P_{k}}(\pi_{n}^{t+1})\right. (53)
ck𝔼sρπt,μk,Pk[DTV(πt(|s)πnt+1(|s))]),\displaystyle\quad\left.-\text{$c^{\prime}_{k}\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{k},P_{k}}}\left[D_{TV}(\pi^{t}(\cdot|s)\|\pi_{n}^{t+1}(\cdot|s))\right]$}\right),

where ck=4ϵkγ(1γ)2c^{\prime}_{k}=\frac{4\epsilon_{k}\gamma}{(1-\gamma)^{2}}, and ϵk=maxs,a|Aπt,Pk(s,a)|\epsilon_{k}=\max_{s,a}|A_{\pi^{t},P_{k}}(s,a)| denotes the maximum absolute value of the advantage function over all state and action pairs.

To remove the dependency on ρπt,μk,Pk\rho_{\pi^{t},\mu_{k},P_{k}}, we have

𝔼sρπt,μk,Pk[DTV(πt(|s)πnt+1(|s))]\displaystyle\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{k},P_{k}}}\left[D_{TV}(\pi^{t}(\cdot|s)\|\pi_{n}^{t+1}(\cdot|s))\right]
𝔼sρπt,μn,Pn[DTV(πt(|s)πnt+1(|s))]\displaystyle-\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot|s)\|\pi_{n}^{t+1}(\cdot|s))\right]
=\displaystyle= sρπt,μk,Pk(s)DTV(πt(|s)πnt+1(|s))\displaystyle\sum_{s}\rho_{\pi^{t},\mu_{k},P_{k}}(s)D_{TV}(\pi^{t}(\cdot|s)\|\pi_{n}^{t+1}(\cdot|s))
sρπt,μn,Pn(s)DTV(πt(|s)πnt+1(|s))\displaystyle-\sum_{s}\rho_{\pi^{t},\mu_{n},P_{n}}(s)D_{TV}(\pi^{t}(\cdot|s)\|\pi_{n}^{t+1}(\cdot|s)) (54)
\displaystyle\leq DTVmax(πt,πnt+1)s|ρπt,μk,Pk(s)ρπt,μn,Pn(s)|,\displaystyle D_{TV}^{\max}(\pi^{t},\pi_{n}^{t+1})\sum_{s}\left|\rho_{\pi^{t},\mu_{k},P_{k}}(s)-\rho_{\pi^{t},\mu_{n},P_{n}}(s)\right|, (55)

where DTVmax(π,π)=maxsDTV(π(|s)π(|s))D_{TV}^{max}(\pi,\pi^{\prime})=\max_{s}D_{TV}\left(\pi(\cdot|s)\|\pi^{\prime}(\cdot|s)\right) denotes the maximum TV divergence between two policies π\pi and π\pi^{\prime} among all states. It thus follows from (C) that

𝔼sρπt,μk,Pk[DTV(πt(|s)πnt+1(|s))]\displaystyle\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{k},P_{k}}}\left[D_{TV}(\pi^{t}(\cdot|s)\|\pi_{n}^{t+1}(\cdot|s))\right]
\displaystyle\leq 𝔼sρπt,μn,Pn[DTV(πt(|s)πnt+1(|s))]\displaystyle\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot|s)\|\pi_{n}^{t+1}(\cdot|s))\right]
+2DTVmax(πt,πnt+1)DTV(ρπt,μk,Pkρπt,μn,Pn).\displaystyle+2D_{TV}^{\max}(\pi^{t},\pi_{n}^{t+1})D_{TV}(\rho_{\pi^{t},\mu_{k},P_{k}}\|\rho_{\pi^{t},\mu_{n},P_{n}}). (56)

Substituting (19) and (C) into (53) gives:

η(πnt+1)\displaystyle\eta(\pi_{n}^{t+1}) k=1Nqkηk(πt)+𝔸πt,μn,Pn(πnt+1)\displaystyle\geq\sum_{k=1}^{N}q_{k}\eta_{k}(\pi^{t})+\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{n}^{t+1})
(α+β)𝔼sρπt,μn,Pn[DTV(πt(|s)πnt+1(|s))]\displaystyle\quad-(\alpha+\beta)\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot|s)\|\pi_{n}^{t+1}(\cdot|s))\right]
δDTVmax(πt,πnt+1),\displaystyle\quad-\delta D_{TV}^{\max}(\pi^{t},\pi_{n}^{t+1}), (57)

where δ=2k=1Nqk4ϵkγ(1γ)2DTV(ρπt,μk,Pkρπt,μn,Pn)\delta=2\sum_{k=1}^{N}q_{k}\frac{4\epsilon_{k}\gamma}{(1-\gamma)^{2}}D_{TV}(\rho_{\pi^{t},\mu_{k},P_{k}}\|\rho_{\pi^{t},\mu_{n},P_{n}}), β=k=1Nqk4ϵkγ(1γ)2\beta=\sum_{k=1}^{N}q_{k}\frac{4\epsilon_{k}\gamma}{(1-\gamma)^{2}} and ϵk=maxs,a|Aπt,Pk(s,a)|\epsilon_{k}=\max_{s,a}\left|A_{\pi^{t},P_{k}}(s,a)\right|. ∎

Appendix D Effectiveness of Improving the Local Objective

Let g(πnt+1)g(\pi_{n}^{t+1}) denote the RHS of (II) as a function of πnt+1\pi_{n}^{t+1}. Thus, we have the following relation

η(πnt+1)\displaystyle\eta(\pi_{n}^{t+1}) g(πnt+1),\displaystyle\geq g(\pi_{n}^{t+1}), (58)
η(πt)\displaystyle\eta(\pi^{t}) =g(πt),\displaystyle=g(\pi^{t}), (59)
η(πnt+1)η(πt)\displaystyle\eta(\pi_{n}^{t+1})-\eta(\pi^{t}) g(πnt+1)g(πt).\displaystyle\geq g(\pi_{n}^{t+1})-g(\pi^{t}). (60)

The first inequality comes from (II). The equality in (59) can be checked by replacing πnt+1\pi_{n}^{t+1} with πt\pi^{t} in the RHS of (II). Finally, subtracting (59) from (58) gives (60). The above conditions guarantee that we can improve the reward η(πnt)\eta(\pi_{n}^{t}) at each iteration tt by improving the surrogate term g(πnt)g(\pi_{n}^{t}). In particular, the surrogate function g(π)g(\pi) minorizes η(π)\eta(\pi) at πt\pi^{t} [36, 48].

Appendix E Proof of Theorem II

Lemma I.

(Linearity of policy advantage 𝔸π\mathbb{A}_{\pi} and surrogate objective LπL_{\pi}). Given arbitrary policies π\pi, π^\hat{\pi} and π\pi^{\prime}, the policy advantage of linearly combinated policy γπ^+(1γ)π\gamma\hat{\pi}+(1-\gamma)\pi^{\prime} over π\pi is the linear combination of individual policy advantages, i.e. for all γ1\gamma\leq 1, we have

𝔸π(γπ^+(1γ)π)\displaystyle\mathbb{A}_{\pi}(\gamma\hat{\pi}+(1-\gamma)\pi^{\prime}) =γ𝔸π(π^)+(1γ)𝔸π(π)\displaystyle=\gamma\mathbb{A}_{\pi}(\hat{\pi})+(1-\gamma)\mathbb{A}_{\pi}(\pi^{\prime}) (61)
Lπ(γπ^+(1γ)π)\displaystyle L_{\pi}(\gamma\hat{\pi}+(1-\gamma)\pi^{\prime}) =γLπ(π^)+(1γ)Lπ(π).\displaystyle=\gamma L_{\pi}(\hat{\pi})+(1-\gamma)L_{\pi}(\pi^{\prime}). (62)
Proof.
Lπ(γπ^+(1γ)π)\displaystyle L_{\pi}(\gamma\hat{\pi}+(1-\gamma)\pi^{\prime})
=\displaystyle= η(π)+γsρπ(s)aπ^(a|s)Aπ(s,a)\displaystyle\eta(\pi)+\gamma\sum_{s}\rho_{\pi}(s)\sum_{a}\hat{\pi}(a|s)A_{\pi}(s,a)
+(1γ)sρπ(s)aπ(a|s)Aπ(s,a)\displaystyle+(1-\gamma)\sum_{s}\rho_{\pi}(s)\sum_{a}\pi^{\prime}(a|s)A_{\pi}(s,a) (63)
=\displaystyle= η(π)+γ𝔸π(π^)+(1γ)𝔸π(π)\displaystyle\eta(\pi)+\gamma\mathbb{A}_{\pi}(\hat{\pi})+(1-\gamma)\mathbb{A}_{\pi}(\pi^{\prime}) (64)
=\displaystyle= γLπ(π^)+(1γ)Lπ(π).\displaystyle\gamma L_{\pi}(\hat{\pi})+(1-\gamma)L_{\pi}(\pi^{\prime}). (65)

Lemma II.

In tabular case, for all states, the TV divergence between πt+1(|s)\pi^{t+1}(\cdot|s) and πt(|s)\pi^{t}(\cdot|s) is upper bounded by the expectation of the TV divergence between local policies πkt+1(|s)\pi_{k}^{t+1}(\cdot|s) and πt(|s)\pi^{t}(\cdot|s).

Proof.

For all states ss,

DTV(πt(|s),πt+1)(|s)]\displaystyle D_{TV}(\pi^{t}(\cdot|s),\pi^{t+1})(\cdot|s)]
=\displaystyle= 12πt(|s)πt+1(|s)1\displaystyle\frac{1}{2}\left\|\pi^{t}(\cdot|s)-\pi^{t+1}(\cdot|s)\right\|_{1} (66)
=\displaystyle= 12n=1Npk(πt(|s)πkt+1(|s))1\displaystyle\frac{1}{2}\left\|\sum_{n=1}^{N}p_{k}(\pi^{t}(\cdot|s)-\pi_{k}^{t+1}(\cdot|s))\right\|_{1} (67)
\displaystyle\leq k=1Nqk12πt(|s)πkt+1(|s)1\displaystyle\sum_{k=1}^{N}q_{k}\frac{1}{2}\left\|\pi^{t}(\cdot|s)-\pi_{k}^{t+1}(\cdot|s)\right\|_{1} (68)
=\displaystyle= k=1Nqk[DTV(πt(|s)πkt+1(|s)]\displaystyle\sum_{k=1}^{N}q_{k}\left[D_{TV}(\pi^{t}(\cdot|s)\lVert\pi_{k}^{t+1}(\cdot|s)\right] (69)
=\displaystyle= 𝔼k[DTV(πt(|s)πkt+1(|s)].\displaystyle\mathbb{E}_{k}\left[D_{TV}(\pi^{t}(\cdot|s)\lVert\pi_{k}^{t+1}(\cdot|s)\right]. (70)

Now the proof of Theorem II follows.

Proof.

By (13) and bounding cnCPOc_{n}^{\text{CPO}} by cnc_{n}, we can obtain

η(πt+1)\displaystyle\eta(\pi^{t+1}) =n=1Nqnηn(k=1Nqkπkt+1)\displaystyle=\sum_{n=1}^{N}q_{n}\eta_{n}(\sum_{k=1}^{N}q_{k}\pi_{k}^{t+1}) (71)
n=1Npn{ηn(πt)+𝔸πt,μn,Pn(k=1Nqkπkt+1)\displaystyle\geq\sum_{n=1}^{N}p_{n}\left\{\eta_{n}(\pi^{t})+\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}\left(\sum_{k=1}^{N}q_{k}\pi_{k}^{t+1}\right)\right.
cn𝔼sρπ,μn,Pn[DTV(πt(|s)πt+1(|s))]}.\displaystyle\quad\left.-c_{n}\mathbb{E}_{s\thicksim\rho_{\pi,\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot|s)\|\pi^{t+1}(\cdot|s))\right]\right\}. (72)

By the linearity of advantage function, i.e. Lemma I , we have

η(πt+1)\displaystyle\eta(\pi^{t+1}) η(πt)+n=1Nqn{𝔸πt,μn,Pn(k=1Nqkπkt+1)\displaystyle\geq\eta(\pi^{t})+\sum_{n=1}^{N}q_{n}\left\{\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\sum_{k=1}^{N}q_{k}\pi_{k}^{t+1})\right.
cn𝔼sρπ,μn,Pn[DTV(πt(|s)πt+1(|s))]}\displaystyle\quad\left.-c_{n}\mathbb{E}_{s\thicksim\rho_{\pi,\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot|s)\|\pi^{t+1}(\cdot|s))\right]\right\} (73)
=η(πt)+n=1Nqn{k=1Nqk𝔸πt,μn,Pn(πkt+1)\displaystyle=\eta(\pi^{t})+\sum_{n=1}^{N}q_{n}\left\{\sum_{k=1}^{N}q_{k}\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{k}^{t+1})\right.
cn𝔼sρπ,μn,Pn[DTV(πt(|s)πt+1(|s))]}.\displaystyle\quad\left.-c_{n}\mathbb{E}_{s\thicksim\rho_{\pi,\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot|s)\|\pi^{t+1}(\cdot|s))\right]\right\}. (74)

By Lemma II, we have

η(πt+1)\displaystyle\eta(\pi^{t+1})
\displaystyle\geq η(πt)+n=1Nqn{k=1Nqk𝔸πt,μn,Pn(πkt+1)\displaystyle\eta(\pi^{t})+\sum_{n=1}^{N}q_{n}\left\{\sum_{k=1}^{N}q_{k}\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{k}^{t+1})\right.
cn𝔼sρπ,μn,Pn[kNqkDTV(πt(|s)πkt+1(|s))]}\displaystyle\left.-c_{n}\mathbb{E}_{s\thicksim\rho_{\pi,\mu_{n},P_{n}}}\left[\sum_{k}^{N}q_{k}D_{TV}(\pi^{t}(\cdot|s)\|\pi_{k}^{t+1}(\cdot|s))\right]\right\} (75)
=\displaystyle= η(πt)+n=1Nqn{k=1Nqk𝔸πt,μn,Pn(πkt+1)\displaystyle\eta(\pi^{t})+\sum_{n=1}^{N}q_{n}\left\{\sum_{k=1}^{N}q_{k}\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{k}^{t+1})\right.
k=1Nqkcn𝔼sρπ,μn,Pn[DTV(πt(|s)πkt+1(|s))]}.\displaystyle\left.-\sum_{k=1}^{N}q_{k}c_{n}\mathbb{E}_{s\thicksim\rho_{\pi,\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot|s)\|\pi_{k}^{t+1}(\cdot|s))\right]\right\}. (76)

By exchanging the summation order, we have

η(πt+1)\displaystyle\eta(\pi^{t+1}) η(πt)+k=1Nqk{n=1Nqn𝔸πt,μn,Pn(πkt+1)\displaystyle\geq\eta(\pi^{t})+\sum_{k=1}^{N}q_{k}\left\{\sum_{n=1}^{N}q_{n}\mathbb{A}_{\pi^{t},\mu_{n},P_{n}}(\pi_{k}^{t+1})\right.
β𝔼sρπ,μn,Pn[DTV(πt(|s)πkt+1(|s))]}.\displaystyle\quad\left.-\beta\mathbb{E}_{s\thicksim\rho_{\pi,\mu_{n},P_{n}}}\left[D_{TV}(\pi^{t}(\cdot|s)\|\pi_{k}^{t+1}(\cdot|s))\right]\right\}. (77)

By following the same procedure from Appendix C, we have

η(πt+1)\displaystyle\eta(\pi^{t+1}) η(πt)+k=1Nqk{𝔸πt,μk,Pk(πkt+1)\displaystyle\geq\eta(\pi^{t})+\sum_{k=1}^{N}q_{k}\left\{\mathbb{A}_{\pi^{t},\mu_{k},P_{k}}(\pi_{k}^{t+1})\right.
(α+β)𝔼sρπ,μk,Pk[DTV(πt(|s)πkt+1(|s))]}\displaystyle\quad\left.-(\alpha+\beta)\mathbb{E}_{s\thicksim\rho_{\pi,\mu_{k},P_{k}}}\left[D_{TV}(\pi^{t}(\cdot|s)\|\pi_{k}^{t+1}(\cdot|s))\right]\right\}
δDTVmax(πt,πkt+1)}.\displaystyle\quad\left.-\delta D_{TV}^{\max}(\pi^{t},\pi_{k}^{t+1})\right\}. (78)

This proves that the RHS of Corollary II forms a lower bound on the policy performance difference between πt+1\pi^{t+1} and πt\pi^{t}. Moreover, since the equality holds when πkt+1=πt,k=0,,N\pi_{k}^{t+1}=\pi^{t},\forall k=0,...,N, optimizing the RHS of (78) or (22) in each round can monotonically improve the policy performance difference between πt+1\pi^{t+1} and πt\pi^{t} due to the proof in Appendix D. This completes the proof of Theorem II. ∎

Appendix F Proof of Theorem III

This proof uses some useful lemmas and corollaries from [39].

Lemma III.

(Continuity of ρπ\rho_{\pi}, [39, Lemma 4]). The (unnormalized) discounted visitation frequency ρπ\rho_{\pi} is continuous in π\pi.

Lemma IV.

(Continuity of QπQ_{\pi}, [39, Lemma 5]). Let π\pi be a policy. Then Qπ(s,a)Q_{\pi}\left(s,a\right) is Lipschitz-continuous in π\pi.

Corollary III.

([39, Corollary 1]). Due to the continuity of QπQ_{\pi}, the following functions are Lipschitz-continuous in π\pi: The state value function VπV_{\pi}, the advantage function Aπ(s,a)A_{\pi}\left(s,a\right), and the expected total reward η(π)\eta\left(\pi\right).

Lemma V.

(Continuity of policy advantage, [39, Lemma 6]). Let π\pi and π^\hat{\pi} be policies. The policy advantage 𝔸π(π^)\mathbb{A}_{\pi}(\hat{\pi}) is continuous in π\pi.

Lemma VI.

(Continuity of 𝐁π,μn,PnF\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}). For every agent k=1,,Nk=1,...,N, the level of heterogeneity is continuous in π\pi.

Proof.

To simplify the proof, we expand the matrix norm 𝐁π,μn,PnF\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F} as follows

𝐁π,μn,PnF\displaystyle\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F}
=\displaystyle= k=1Nqk𝐃π,μn,Pn1𝐃π,μk,Pk𝐀π,Pk𝐀π,PnF\displaystyle\left\|\sum_{k=1}^{N}q_{k}\mathbf{D}_{\pi,\mu_{n},P_{n}}^{-1}\mathbf{D}_{\pi,\mu_{k},P_{k}}\mathbf{A}_{\pi,P_{k}}-\mathbf{A}_{\pi,P_{n}}\right\|_{F} (79)
=\displaystyle= k=1Nqksa[ρπ,μk,Pk(s)ρπ,μn,Pn(s)Aπ,Pk(s,a)Aπ,Pn(s,a)]2.\displaystyle\sum_{k=1}^{N}q_{k}\sum_{s}\sum_{a}\left[\frac{\rho_{\pi,\mu_{k},P_{k}}(s)}{\rho_{\pi,\mu_{n},P_{n}}(s)}A_{\pi,P_{k}}(s,a)-A_{\pi,P_{n}}(s,a)\right]^{2}. (80)

The continuity of 𝐁π,μn,PnF\left\|\mathbf{B}_{\pi,\mu_{n},P_{n}}\right\|_{F} follows from the assumption that every state is reachable and the continuity of ρπ(s)\rho_{\pi}(s) and Aπ(s,a)A_{\pi}(s,a). ∎

Lemma VII.

(Continuity of local objective). For every agent k=1,,Nk=1,...,N, the local objective hk(π;π)h_{k}(\pi^{\prime};\pi) in (22) is continuous in π\pi.

Proof.

By Lemma V, Aπ^,μk,Pk(π)A_{\hat{\pi},\mu_{k},P_{k}}(\pi) is continuous. By Corollary III and Lemma VI, α\alpha, β\beta and δ\delta are continuous. By the continuity of DTVD_{TV} and max\max operator, the last two terms of (22) are continuous. As a result, hk(π;π)h_{k}(\pi^{\prime};\pi) is continuous in π\pi. ∎

Definition II.

(FedKL-Stationarity) A policy π\pi^{\prime} is FedKL-stationary if, for every agent k=1,,Nk=1,...,N,

π\displaystyle\pi^{\prime} =argmaxπ[𝔸π,μk,Pk(π)\displaystyle=\arg\max_{\pi}\left[\mathbb{A}_{\pi^{\prime},\mu_{k},P_{k}}(\pi)\right.
(α+β)𝔼sρπ,μk,Pk[DTV(π(|s),π(|s))]\displaystyle\quad-\left.\left(\alpha+\beta\right)\mathbb{E}_{s\thicksim\rho_{\pi^{\prime},\mu_{k},P_{k}}}\left[D_{TV}(\pi^{\prime}(\cdot|s),\pi(\cdot|s))\right]\right.
δDTVmax(π,π)]\displaystyle\quad\left.-\delta D_{TV}^{\max}(\pi^{\prime},\pi)\right] (81)
=argmaxπ[hk(π;π)].\displaystyle=\arg\max_{\pi}\left[h_{k}(\pi;\pi^{\prime})\right]. (82)

Next, we prove Theorem III. To this end, we start from the convergence of (22) and then show the limit points of the sequence generated by Algorithm 1 are FedKL-stationary.

Proof.

By Theorem II, the sequence (η(πt))t=0\left(\eta\left(\pi^{t}\right)\right)_{t=0}^{\infty} is non-decreasing and bounded above by Rmax1γ\frac{R_{\max}}{1-\gamma}, where RmaxR_{\max} is the maximum of reward. As a result, we know the sequence converges. Let η^\hat{\eta} denote the limit point. Given the sequence of policies (πt)t=0\left(\pi^{t}\right)_{t=0}^{\infty} is bounded, we know it has at least one convergent subsequence, by the Bolzano-Weierstrass Theorem. Denote π^\hat{\pi} as any limit point of the sequence, and let (πtj)tj=0\left(\pi^{t_{j}}\right)_{t_{j}=0}^{\infty} be a subsequence converging to π^\hat{\pi}. By the continuity of η\eta in π\pi (Corollary III), we have

η(π^)\displaystyle\eta\left(\hat{\pi}\right) =η(limjπtj)=limjη(πtj)=η^.\displaystyle=\eta\left(\lim_{j\to\infty}\pi^{t_{j}}\right)=\lim_{j\to\infty}\eta\left(\pi^{t_{j}}\right)=\hat{\eta}. (83)

We will now establish the FedKL-stationarity of any limit point π^\hat{\pi}. By Theorem II, we have

0\displaystyle 0 =limt[η(πt+1)η(πt)]\displaystyle=\lim_{t\to\infty}\left[\eta\left(\pi^{t+1}\right)-\eta\left(\pi^{t}\right)\right] (84)
limt𝔼k[𝔸πt,μk,Pk(πkt+1)\displaystyle\geq\lim_{t\to\infty}\mathbb{E}_{k}\left[\mathbb{A}_{\pi^{t},\mu_{k},P_{k}}(\pi_{k}^{t+1})\right.
(α+β)𝔼sρπt,μk,Pk[DTV(πt(|s),πkt+1(|s))]\displaystyle\quad\left.-\left(\alpha+\beta\right)\mathbb{E}_{s\thicksim\rho_{\pi^{t},\mu_{k},P_{k}}}\left[D_{TV}(\pi^{t}\left(\cdot|s\right),\pi_{k}^{t+1}\left(\cdot|s\right))\right]\right.
δDTVmax(πt,πkt+1)].\displaystyle\quad\left.-\delta D_{TV}^{\max}(\pi^{t},\pi_{k}^{t+1})\right]. (85)

Consider an arbitrary limit point π^\hat{\pi} and a subsequence (πtj)j=0\left(\pi^{t_{j}}\right)_{j=0}^{\infty} that converges to π^\hat{\pi}. Then, we can obtain

0\displaystyle 0 limj𝔼k[𝔸πtj,μk,Pk(πktj+1)\displaystyle\geq\lim_{j\to\infty}\mathbb{E}_{k}\left[\mathbb{A}_{\pi^{t_{j}},\mu_{k},P_{k}}(\pi_{k}^{t_{j}+1})\right.
(α+β)𝔼sρπtj,μk,Pk[DTV(πtj(|s),πktj+1(|s))]\displaystyle\quad\left.-\left(\alpha+\beta\right)\mathbb{E}_{s\thicksim\rho_{\pi^{t_{j}},\mu_{k},P_{k}}}\left[D_{TV}(\pi^{t_{j}}\left(\cdot|s\right),\pi_{k}^{t_{j}+1}\left(\cdot|s\right))\right]\right.
δDTVmax(πtj,πktj+1)].\displaystyle\quad\left.-\delta D_{TV}^{\max}(\pi^{t_{j}},\pi_{k}^{t_{j}+1})\right]. (86)

As every agent is optimizing its local objective, the expectation is with respect to non-negative random variables. Thus, for arbitrary k=1,,Nk=1,...,N, the RHS of (86) is bounded from below by

qklimjmaxπ[𝔸πtj,μk,Pk(π)\displaystyle q_{k}\lim_{j\to\infty}\max_{\pi}\left[\mathbb{A}_{\pi^{t_{j}},\mu_{k},P_{k}}(\pi)\right.
(α+β)𝔼sρπtj,μk,Pk[DTV(πtj(|s),π(|s))]]\displaystyle\left.-\left(\alpha+\beta\right)\mathbb{E}_{s\thicksim\rho_{\pi^{t_{j}},\mu_{k},P_{k}}}\left[D_{TV}(\pi^{t_{j}}\left(\cdot|s\right),\pi\left(\cdot|s\right))\right]\right]
δDTVmax(πtj,π)]0.\displaystyle\left.-\delta D_{TV}^{\max}(\pi^{t_{j}},\pi)\right]\geq 0. (87)

By Lemma VII and the Squeeze theorem, it follows from (86) and (87) that

limjmaxπ[hk(π;πtj)]=maxπ[hk(π;π^)]=0.\displaystyle\lim_{j\to\infty}\max_{\pi}\left[h_{k}(\pi;\pi^{t_{j}})\right]=\max_{\pi}\left[h_{k}(\pi;\hat{\pi})\right]=0. (88)

This proves that, for any limit point of the policy sequence induced by Algorithm 1, maxπ[hk(π;π^)]=0\max_{\pi}\left[h_{k}(\pi;\hat{\pi})\right]=0 for every agent k=1,,Nk=1,...,N, which is equivalent to Definition II.

References

  • [1] T. Chen, K. Zhang, G. B. Giannakis, and T. Basar, “Communication-efficient distributed reinforcement learning,” CoRR, vol. abs/1812.03239, 2018.
  • [2] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning: Concept and applications,” ACM Trans. Intell. Syst. Technol., vol. 10, no. 2, Jan. 2019.
  • [3] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 50–60, 2020.
  • [4] J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” CoRR, vol. abs/1610.05492, 2016.
  • [5] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [6] F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe, “Local sgd with periodic averaging: Tighter analysis and adaptive synchronization,” in Advances in Neural Information Processing Systems, 2019, vol. 32, pp. 11 082–11 094.
  • [7] J. Wang and G. Joshi, “Cooperative sgd: A unified framework for the design and analysis of local-update sgd algorithms,” Journal of Machine Learning Research, vol. 22, no. 213, pp. 1–50, 2021.
  • [8] H. Yu, S. Yang, and S. Zhu, “Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 5693–5700, Jul. 2019.
  • [9] H. Zhu, J. Xu, S. Liu, and Y. Jin, “Federated learning on non-iid data: A survey,” CoRR, vol. abs/2106.06843, 2021.
  • [10] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of fedavg on non-iid data,” in ICLR 2020 : Eighth International Conference on Learning Representations, 2020.
  • [11] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed.   The MIT Press, 2018.
  • [12] J. Qi, Q. Zhou, L. Lei, and K. Zheng, “Federated reinforcement learning: Techniques, applications, and open challenges,” CoRR, vol. abs/2108.11887, 2021.
  • [13] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in Proceedings of the 31st International Conference on International Conference on Machine Learning, vol. 32, 2014, pp. 387–395.
  • [14] K. Ciosek and S. Whiteson, “Expected policy gradients for reinforcement learning,” Journal of Machine Learning Research, vol. 21, no. 52, pp. 1–51, 2020.
  • [15] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in Neural Information Processing Systems, vol. 12, 2000.
  • [16] Q. Li, Y. Diao, Q. Chen, and B. He, “Federated learning on non-iid data silos: An experimental study,” CoRR, vol. abs/2102.02079, 2021.
  • [17] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with non-iid data,” CoRR, vol. abs/1806.00582, 2018.
  • [18] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-Efficient Learning of Deep Networks from Decentralized Data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, vol. 54, 20–22 Apr 2017, pp. 1273–1282.
  • [19] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” in Proceedings of Machine Learning and Systems, vol. 2, 2020, pp. 429–450.
  • [20] H. Wang, Z. Kaplan, D. Niu, and B. Li, “Optimizing federated learning on non-iid data with reinforcement learning,” in IEEE INFOCOM 2020 - IEEE Conference on Computer Communications, 2020, pp. 1698–1707.
  • [21] B. Liu, L. Wang, and M. Liu, “Lifelong federated reinforcement learning: A learning architecture for navigation in cloud robotic systems,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 4555–4562, 2019.
  • [22] X. Liang, Y. Liu, T. Chen, M. Liu, and Q. Yang, “Federated transfer reinforcement learning for autonomous driving,” CoRR, vol. abs/1910.06001, 2019.
  • [23] H.-K. Lim, J.-B. Kim, J.-S. Heo, and Y.-H. Han, Federated Reinforcement Learning for Training Control Policies on Multiple IoT Devices.   Sensors, 2020, vol. 20, no. 5.
  • [24] B. Liu, L. Wang, and M. Liu, “Lifelong federated reinforcement learning: A learning architecture for navigation in cloud robotic systems,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, pp. 1688–1695.
  • [25] H. H. Zhuo, W. Feng, Q. Xu, Q. Yang, and Y. Lin, “Federated reinforcement learning,” CoRR, vol. abs/1901.08277, 2019.
  • [26] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017.
  • [27] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” in Advances in Neural Information Processing Systems, vol. 12.   MIT Press, 2000, pp. 1008–1014.
  • [28] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 6382–6393.
  • [29] H. kyo Lim, J.-B. Kim, I. Ullah, J.-S. Heo, and Y.-H. Han, “Federated reinforcement learning acceleration method for precise control of multiple devices,” IEEE Access, vol. 9, pp. 76 296–76 306, 2021.
  • [30] D. Zhou, Y. Zhang, A. Sonabend-W, Z. Wang, J. Lu, and T. Cai, “Federated offline reinforcement learning,” 2022.
  • [31] Y. Cao, S.-Y. Lien, Y.-C. Liang, and K.-C. Chen, “Federated deep reinforcement learning for user access control in open radio access networks,” in ICC 2021 - IEEE International Conference on Communications, 2021, pp. 1–6.
  • [32] X. Xu, R. Li, Z. Zhao, and H. Zhang, “The gradient convergence bound of federated multi-agent reinforcement learning with efficient communication,” CoRR, vol. abs/2103.13026, 2021.
  • [33] H. Jin, Y. Peng, W. Yang, S. Wang, and Z. Zhang, “Federated reinforcement learning with environment heterogeneity,” in Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, G. Camps-Valls, F. J. R. Ruiz, and I. Valera, Eds., vol. 151.   PMLR, 28–30 Mar 2022, pp. 18–37.
  • [34] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” in Proceedings of the International Conference on Learning Representations (ICLR), 2016.
  • [35] X. Fan, Y. Ma, Z. Dai, W. Jing, C. Tan, and B. K. H. Low, “Fault-tolerant federated reinforcement learning with theoretical guarantee,” Advances in Neural Information Processing Systems, vol. 34, 2021.
  • [36] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization,” CoRR, vol. abs/1502.05477, 2015.
  • [37] S. Kakade and J. Langford, “Approximately optimal approximate reinforcement learning,” in Proceedings of the Nineteenth International Conference on Machine Learning, 2002, pp. 267–274.
  • [38] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70.   PMLR, 06–11 Aug 2017, pp. 22–31.
  • [39] J. G. Kuba, R. Chen, M. Wen, Y. Wen, F. Sun, J. Wang, and Y. Yang, “Trust region policy optimisation in multi-agent reinforcement learning,” 2022.
  • [40] I. Csiszár and J. Körner, Information theory: coding theorems for discrete memoryless systems.   Cambridge University Press, 2011.
  • [41] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 5026–5033.
  • [42] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” CoRR, vol. abs/1606.01540, 2016.
  • [43] E. Vinitsky, A. Kreidieh, L. L. Flem, N. Kheterpal, K. Jang, C. Wu, F. Wu, R. Liaw, E. Liang, and A. M. Bayen, “Benchmarks for reinforcement learning in mixed-autonomy traffic,” in Proceedings of The 2nd Conference on Robot Learning, vol. 87, 2018, pp. 399–409.
  • [44] S. M. Kakade, “A natural policy gradient,” Advances in neural information processing systems, vol. 14, 2001.
  • [45] J. A. Bagnell and J. Schneider, “Covariant policy search,” IJCAI, 2003.
  • [46] L. Liu, J. Zhang, S. Song, and K. B. Letaief, “Client-edge-cloud hierarchical federated learning,” in IEEE International Conference on Communications (ICC), 2020, pp. 1–6.
  • [47] B. Baumgartner, “An inequality for the trace of matrix products, using absolute values,” 2011. [Online]. Available: https://arxiv.org/abs/1106.6189
  • [48] D. R. Hunter and K. Lange, “A tutorial on mm algorithms,” The American Statistician, vol. 58, no. 1, pp. 30–37, 2004.