This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Convergence Theory of Meta Reinforcement Learning with Personalized Policies

Haozhi Wang, Qing Wang, ,Yunfeng Shao, Dong Li, Jianye Hao, Yinchuan Li Corresponding author: Yinchuan Li (email: [email protected]).Haozhi Wang, Qing Wang and Jianye Hao are with School of Electrical and Information Engineering, Tianjin University, Tianjin, China. This work was completed while Haozhi Wang was a member of the Huawei Noah’s Ark Lab for Advanced Study. Yunfeng Shao, Dong Li and Yinchuan Li are with Huawei Noah’s Ark Lab, Beijing, China
Abstract

Modern meta-reinforcement learning (Meta-RL) methods are mainly developed based on model-agnostic meta-learning, which performs policy gradient steps across tasks to maximize policy performance. However, the gradient conflict problem is still poorly understood in Meta-RL, which may lead to performance degradation when encountering distinct tasks. To tackle this challenge, this paper proposes a novel personalized Meta-RL (pMeta-RL) algorithm, which aggregates task-specific personalized policies to update a meta-policy used for all tasks, while maintains personalized policies to maximize the average return of each task under the constraint of the meta-policy. We also provide the theoretical analysis under the tabular setting, which demonstrates that the convergence of our pMeta-RL algorithm. Moreover, we extend the proposed pMeta-RL algorithm to a deep network version based on soft actor-critic, making it suitable for continuous control tasks. Experiment results show that proposed algorithms outperform other previous Meta-RL algorithms on Gym and MuJoCo suites.

Index Terms:
Meta Reinforcement Learning, Personalized policies.

I Introduction

Reinforcement learning (RL) [1, 2] has long been an interesting research topic under the development of artificial intelligence, which has demonstrated extraordinary capabilities in many fields, such as playing games [3, 4] and robot control [5, 6]. Despite its successful applications, deep RL still suffers from data inefficiency when training agents in specific environments. And the learned policy may be overfitted and cannot be well generalized to other unseen tasks.

Meta-learning, also known as learning to learn, has recently gained increasing attention for its success in improving sample efficiency in regression, classification and RL tasks. Unlike meta-supervised learning, meta reinforcement learning (meta-RL) needs to automatically extract prior knowledge from previous tasks and achieve fast adaptation [7]. Mainstream meta RL methods are based on model-agnostic meta-learning algorithms, such as MAML [8], E-MAML [9], which update model parameters from gradients across tasks by discriminating the learning process and adaptation. However, these methods do not consider the problem of task diversity.

Some other Meta-RL algorithms attempt to learn an inference network from historical transition samples of different tasks during the meta-training process and build a mapping from observation data to task-relevant information [10, 11, 12]. Then the agent can distinguish the current task according to the inference network and perform action selection to alleviate the gradient conflict problem. However, these methods jointly learn diverse robot manipulation tasks with a shared policy network, which may hurt the final performance compared to independent training in each task [13]. A major reason is that it is unclear how tasks will interact with each other when jointly trained, and optimizing some tasks may negatively affect others [14].

To tackle this problem, in this paper, we propose a personalized meta reinforcement learning (pMeta-RL) framework without sharing trajectories. More specifically, we let each task train its own policy and obtain a meta-policy for multiple tasks by sharing policy model parameters and aggregation. In order for each task to have its own personalized policy while contributing to the meta-policy, we introduce a personalization constraint in the optimization objective function to associate the meta-policy and the personalized policy. Furthermore, we extend the framework to deep RL scenario and employ alternating optimization methods to solve the problems. Under the constraint of the meta-policy, each task updates the personalized policy according to its corresponding transition samples, and maintains an auxiliary policy that can be used for meta-policy learning.

I-A Main Contributions

First, we propose a personalization framework to improve the performance of meta-RL on distinct tasks, which learns a meta-policy and personalized policies for all tasks and specific tasks, respectively. In particular, by learning multiple task-specific personalized policies, the performance on each task can be improved. Meanwhile, to aggregate these differentiated personalized policies to obtain a meta-policy applicable to all tasks, our framework addresses the gradient conflict problem by adopting personalization constraint in the objective function. Under the tabular setting, we constrain the meta Q-table and personalized Q-tables by formulating a differentiable function to encourage each task to pursue its personalized Q-tables around the meta Q-table. Under the deep RL setting, we obtain the personalization performance by constraining the personalized and meta networks (e.g., actor, critic, and inference networks), so that the networks can refer to each other during training.

Second, we propose an alternating minimization algorithm, named pMeta-RL, to solve the personalized meta-RL problem. The personalized policies are updated based on the value iteration, while the auxiliary policies used to aggregate the meta-policy are updated based on the gradient descent. More importantly, theoretical analysis shows that the proposed algorithm can converge well, and the convergence speed is basically linear with the iteration number. And we give an upper bound on the difference between the personalized policies and meta-policy, which is influenced by the regularization parameter and task diversity. Moreover, we extend the proposed pMeta-RL algorithm to a deep network version (named deep pMeta-RL) based on soft actor-critic (SAC), making it suitable for continuous control tasks.

Finally, we evaluate the performance of pMeta-RL and deep pMeta-RL on the Gridworld environment and the MuJoCo suites, respectively. The experimental results show that the proposed pMeta-RL can achieve better performance than the model averaging method, and verify the conclusion of the convergence analysis. Compared to other previous Meta-RL algorithms, our deep pMeta-RL algorithm can achieve the highest average return on the MuJoCo locomotion tasks.

II Related Works

II-A Meta Reinforcement Learning:

Existing meta-RL methods learn the dynamics models and policies that can quickly adapt to unseen tasks, which is built on the meta-learning framework in the context of RL [7, 15, 8]. RL2\text{RL}^{2} [15] formulate the meta-learning problem as a second RL procedure, it represent a “fast” RL algorithm as a recurrent neural network and learn it from data but update the parameters at a “slow” pace. MAML [8] meta-learns model parameters by differentiating the learning process for fast adaptation on unseen tasks. E-MAML and E-RL2\text{RL}^{2} [9] explicitly optimize the per-task sampling distributions during adaptation with respect to the expected future returns, which is closely related to the MAML algorithm. ProMP [16] theoretically analyses the MAML formulation and addresses the biased meta-gradient issue. To achieve sample efficiency, PEARL [10] performs structured exploration by reasoning about uncertainty over tasks and enables fast adaptation by accumulating experience online. VariBAD [11] learns a policy that conditions on this posterior belief, which can trade off exploration and exploitation under task uncertainty. Other meta-RL works, such as MAME [17] and MetaCURE [12], learn a separate exploration policy for general dense and sparse-reward tasks.

II-B Multi-task Reinforcement Learning:

Multi-task RL methods have demonstrated that learning with different tasks can benefit each other in robotics and other fields [18, 19, 20]. However, the gradient conflict problem still exists in multi-task RL, various approaches based on compositional models [21, 22, 23, 24], gradients similarity [25, 26], or policy distillation [27, 14, 28] have been proposed to solve this challenge. In particular, multi-head multi-task SAC [29] [13] and experience sharing networks [22, 23] training with diverse robot manipulation tasks jointly with a sharing network backbone and multiple task-specific heads for actions. Hard routing [21] and Soft modularization [24] learn a routing network to reconfigure different modules for different tasks automatically without explicitly specifying the policy structure, which can avoid the gradients conflicts effectively. Other works such as gradient surgery [25] and normalization [26], leverage the similarity between gradients from different tasks to enhance the learning process and mitigate interference. Existing algorithms have shown superior performance in multiple tasks, however, they need to share trajectories between tasks and are difficult to generalize to unseen tasks.

II-C Federated Reinforcement Learning:

Another related work is the federated reinforcement learning (FRL), these methods encourage multiple agents to federatively build a better decision-making policy in a privacy-preserving manner, which have been applied in robot navigation and Internet of things [30, 31, 32]. [31] analyzes the communication overhead in FRL and gives the convergence bound. [32] studies adversarial attacks in FRL systems in detail and gives theoretical guarantees in the event of participant failure. However, exist FRL algorithms usually only consider the same tasks and lack the ability to handle multiple tasks.

Refer to caption
(a)
Figure 1: (a) Several distinct tasks on MuJoCo. The tasks vary in either the transition function (e.g., robots with different dynamics) or the reward function (e.g., different goal location). (b) Visualization of 2D Sparse-Point-Robot. The blue circles represent the different tasks. We highlight two distinct goals in dark blue and visualize the policies leading the agent to reach them. xx and yy axes of the heatmap represent the two dimensions of action, while zz axis represents the state-action value on the initial state. Red represents higher values. For goals 1 and 2, the personalized policies prefer left and right actions, respectively. In contrast, traditional gradient aggregation-based meta-policy have similar propensities for left and right actions. (c) Performance comparison of our algorithm and PEARL at 50000 training steps. The agent using the personalized policy can navigate to more right goals.

III Problem Formulation

III-A Preliminaries

Reinforcement learning (RL) algorithms are designed to solve sequential decision problems [2, 1, 3]. Generally speaking, the agent makes action decisions at each discrete time step based on its observation of the environment \mathcal{E}. After performing an action, the environment generates a reward and transfers the agent to a new state. This can be formulated as a standard Markov decision process (MDP), consisting of a tuple 𝒢=𝒮,𝒜,𝒫,,γ\mathcal{G}=\left\langle\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma\right\rangle, where 𝒮\mathcal{S} is a finite set of states, 𝒜\mathcal{A} is a finite set of actions, r(s,a):𝒮×𝒜r\in\mathcal{R}(s,a):\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R} is the bounded reward function with (s,a)max\mathcal{R}(s,a)\leq\mathcal{R}_{\max}, 𝒫(s|s,a):𝒮×𝒜×𝒮[0,1]\mathcal{P}(s^{\prime}\left|s\right.,a):\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\left[0,1\right] denotes the state transition function, and γ[0,1)\gamma\in\left[0,1\right) is the discount factor. The goal of RL is to find the optimal policy π\pi^{*} by maximizing the expected long-term return.

III-B Conventional Meta-RL methods

Most gradient-based meta-RL algorithms are model-agnostic meta-learners built upon MAML, performing vanilla policy gradient steps towards maximizing the performance of the policy for each task. These methods generally consider tasks drawn from a distribution 𝒯ip(𝒯)\mathcal{T}_{i}\sim p\left(\mathcal{T}\right), where each task 𝒯i\mathcal{T}_{i} is a different MDP. During the meta-training time, the goal of Meta-RL is to learn a policy π(a|s)\pi(a|s), which can adapt to the multiple tasks. In particular, this policy π(a|s)\pi(a|s) can be optimized by maximizing the average expected return across the task distribution,

maxπ{𝔼𝒯ip(𝒯)𝔼sρ0aπ[Qiπ(s,a)]},\begin{split}\max_{\pi}\left\{\mathbb{E}_{\mathcal{T}_{i}\sim p(\mathcal{T})}\mathbb{E}_{\begin{subarray}{c}s\sim\rho_{0}\\ a\sim\pi\end{subarray}}[Q^{\pi}_{i}(s,a)]\right\},\end{split} (1)

where ρ0\rho_{0} is the initial state-distribution, and Qiπ(s,a)Q^{\pi}_{i}(s,a) is the state-action value function.

III-C Towards Personalized Meta-RL

Existing meta-RL algorithms leverage the transitions from similar tasks to improve the sample efficiency and achieve good performance on future new tasks. However, most gradient-based methods suppose that the task distribution is “homogeneous”, which means that all tasks are from the same domain, (i.e. a single dataset, multiple datasets with same input feature space [33] or robot control with different transition functions but similar dynamics [8]). This limits the application of these algorithms to distinct or “hetergeneous” tasks, i.e., different tasks may vary in environments or transition functions with a large interval [34]. Figure 1(a) shows several distinct tasks, such as navigating to two completely opposite directions on Ant robot or controlling the speed of the Half-Cheetah to reach two endpoints of a speed interval. In this setting, optimizing some tasks in meta-RL algorithms may negatively affect others.

Figure 1(b) illustrates these problem. For distinct tasks, gradient aggregation makes the values of different state-action pairs similar, i.e., a recommended action may not be given accurately. Other Meta-RL methods, such as PEARL [10], distinguish different tasks by learning a task-relevant context 𝐳\mathbf{z} based on the history. These algorithms maintain an inference network qζ(𝐳|𝐜)q_{\zeta}(\mathbf{z}|\mathbf{c}), parameterized by ζ\zeta, to encode salient information about tasks from context 𝐜\mathbf{c}, which consists of the transitions of different tasks. Nevertheless, the gradient conflict problem still exists due to the shared policy network parameters. To solve this challenge, we aim to learn a meta-policy that adapts to multiple distinct tasks, while optimizing a personalized policy for each task. In this subsection, we first give the definition of personalized Meta-RL under the tabular and deep network settings, while the corresponding algorithm for solving personalized Meta-RL is proposed in the next section.

Definition 1 (Personalized Meta-RL)

Assume that there are NN tasks and the ii-th (i=1,,Ni=1,...,N) task 𝒯i\mathcal{T}_{i} drawn from p(𝒯)p(\mathcal{T}). Define Qπ(s,a),s𝒮,a𝒜Q^{\pi}(s,a),s\in{\mathcal{S}},a\in{\mathcal{A}} as the meta Q-table, where 𝒮=iN𝒮i{\mathcal{S}}=\cup_{i\in N}\mathcal{S}_{i} and 𝒜=iN𝒜i{\mathcal{A}}=\cup_{i\in N}\mathcal{A}_{i} with 𝒮i\mathcal{S}_{i} and ρi,0\rho_{i,0} being the set of the ii-th task’s states and initial state distribution, respectively. The personalized Meta-RL aims to train a meta-policy π\pi by maximizing

maxπ{(π):=1Ni=1Ni(π)},\max_{\pi}\Bigg{\{}\mathcal{L}(\pi):=\frac{1}{N}\sum_{i=1}^{N}\mathcal{L}_{i}(\pi)\Bigg{\}}, (2)

where

i(π)\displaystyle\mathcal{L}_{i}(\pi) =maxπiLi(πi)λ2φi(Qiπi,Qπ),\displaystyle=\max_{\pi_{i}}L_{i}(\pi_{i})-\frac{\lambda}{2}\varphi_{i}(Q^{\pi_{i}}_{i},Q^{\pi}),
Li(πi)\displaystyle L_{i}(\pi_{i}) =𝔼siρi,0aiπit=0γti(si,t,ai,t)|si,0=si,si,t,ai,t\displaystyle=\mathbb{E}_{\begin{subarray}{c}s_{i}\sim\rho_{i,0}\\ a_{i}\sim\pi_{i}\end{subarray}}\sum_{t=0}^{\infty}\gamma^{t}\mathcal{R}_{i}(s_{i,t},a_{i,t})|s_{i,0}=s_{i},s_{i,t},a_{i,t}

with λ\lambda being a weight parameter that controls the level of personalization; Qiπi(si,ai)Q^{\pi_{i}}_{i}(s_{i},a_{i}) being the personalized Q-table of the ii-th task; Li(πi)L_{i}(\pi_{i}) being the expected long-term return under the policy πi\pi_{i}; and φi()\varphi_{i}(\cdot) being a differentiable constraint function, such as 2\ell_{2}-norm.

Definition 1 describes a personalization approach based on Q-learning, where the personalized constraint enables each task optimizes its own policy around Qπ(s,a)Q^{\pi}(s,a). Since Q learning is difficult to adapt to large-scale problems, we use a function approximator to represent the Q-values, and extend Definition 1 to deep personalized meta-RL based on SAC [29] and inference network qζq_{\zeta}.

Definition 2 (Deep Personalized Meta-RL)

Assume that there are NN training tasks drawn from p(𝒯)p(\mathcal{T}). Define πω(a|s,𝐳)\pi_{\omega}(a|s,\mathbf{z}) as the meta-policy, where 𝐳\mathbf{z} is the task-relevant vector. Deep personalized meta-RL aims to train a meta actor model πω\pi_{\omega}, a meta Q function QϑQ_{\vartheta} and a meta inference network qϖq_{\varpi} by alternately minimizing

minωdF(ω)\displaystyle\min_{\omega\in\mathbb{R}^{d}}F(\omega) :=1Ni=1NFi,π(ω),\displaystyle:=\frac{1}{N}\sum_{i=1}^{N}F_{i,\pi}(\omega),
minϑdF(ϑ)\displaystyle\min_{\vartheta\in\mathbb{R}^{d}}F(\vartheta) :=1Ni=1NFi,Q(ϑ),\displaystyle:=\frac{1}{N}\sum_{i=1}^{N}F_{i,Q}(\vartheta),
minϖdF(ϖ)\displaystyle\min_{\varpi\in\mathbb{R}^{d}}F(\varpi) :=1Ni=1NFi,q(ϖ),\displaystyle:=\frac{1}{N}\sum_{i=1}^{N}F_{i,q}(\varpi),

where

Fi,π(ω)=\displaystyle F_{i,\pi}(\omega)= minϕidJπi(ϕi)+λ2φi(ϕi,ω),\displaystyle~{}\min_{\phi_{i}\in\mathbb{R}^{d}}J_{\pi_{i}}(\phi_{i})+\frac{\lambda}{2}\varphi_{i}(\phi_{i},\omega),
Fi,Q(ϑ)=\displaystyle F_{i,Q}(\vartheta)= minθidJQi(θi)+λ2φi(θi,ϑ),\displaystyle~{}\min_{\theta_{i}\in\mathbb{R}^{d}}J_{Q_{i}}(\theta_{i})+\frac{\lambda}{2}\varphi_{i}(\theta_{i},\vartheta),
Fi,q(ϖ)=\displaystyle F_{i,q}(\varpi)= minζidJQi(θi)+λ2φi(ζi,ϖ)\displaystyle~{}\min_{\zeta_{i}\in\mathbb{R}^{d}}J_{Q_{i}}(\theta_{i})+\frac{\lambda}{2}\varphi_{i}(\zeta_{i},\varpi)
+𝔼qζ(𝐳|𝐜𝒯)[DKL(qζ(𝐳|𝐜𝒯)p(𝐳))]\displaystyle+\mathbb{E}_{q_{\zeta}(\mathbf{z}|\mathbf{c}^{\mathcal{T})}}\left[D_{\text{KL}}(q_{\zeta}(\mathbf{z}|\mathbf{c}^{\mathcal{T})}\big{\|}p(\mathbf{z}))\right]

with ϕi\phi_{i}, θi\theta_{i} and ζi\zeta_{i} being the personalized actor model, soft Q function model and inference model of the ii-th task respectively, Jπi(ϕi)J_{\pi_{i}}(\phi_{i}) and JQi(θi)J_{Q_{i}}(\theta_{i}) being the actor and critic objective function, which is given by

Jπi(ϕi)=𝔼st𝒟iatπϕi[αlogπϕi(at|st,𝐳)Qθi(st,at,𝐳)],J_{\pi_{i}}(\phi_{i})=\mathbb{E}_{\begin{subarray}{c}s_{t}\sim\mathcal{D}_{i}\\ a_{t}\sim\pi_{\phi_{i}}\end{subarray}}\left[\alpha\log\pi_{\phi_{i}}(a_{t}|s_{t},\mathbf{z})-Q_{\theta_{i}}(s_{t},a_{t},\mathbf{z})\right],

and

JQi(θi)\displaystyle J_{Q_{i}}(\theta_{i}) =𝔼(st,at)𝒟i𝐳qζi(𝐳|𝐜)[12(Qθi(st,at,𝐳)\displaystyle=\mathbb{E}_{\begin{subarray}{c}(s_{t},a_{t})\sim\mathcal{D}_{i}\\ \mathbf{z}\sim q_{\zeta_{i}}(\mathbf{z}|\mathbf{c})\end{subarray}}\Big{[}\frac{1}{2}\big{(}Q_{\theta_{i}}(s_{t},a_{t},\mathbf{z})
(r(st,at)+γ𝔼st+1p[Vθi¯(st+1,𝐳¯)]))2],\displaystyle~{}~{}~{}~{}-(r(s_{t},a_{t})+\gamma\mathbb{E}_{s_{t+1}\sim p}\left[V_{\bar{\theta_{i}}}(s_{t+1},\bar{\mathbf{z}})\right])\big{)}^{2}\Big{]},

respectively, where 𝐳¯\bar{\mathbf{z}} indicates that gradients are not being computed through it.

Definition 2 considers the personalization approach to three models (ϕi,θi,ζi)(\phi_{i},\theta_{i},\zeta_{i}), and establishes the relationship between them to optimize the personalized policy based on the meta-policy. In the next section, we will propose the corresponding algorithms to solve these problems.

IV Proposed Algorithms

In this section, we first propose the pMeta-RL algorithm to solve personalized Meta-RL in Definition 1 and present the corresponding theoretical analysis. Then we extend this algorithm with deep neural network to solve deep personalized Meta-RL in Definition 2.

IV-A Personalized Multi-task Value Iteration

To solve personalized Meta-RL, our pMeta-RL algorithm updates QiπiQ_{i}^{\pi_{i}} and QπQ^{\pi} by alternately minimizing the two subproblems of pMeta-RL. More specifically, we first update QiπiQ_{i}^{\pi_{i}} by solving

Q^iπi=argminQiπiLi(πi)+λ2φi(Qiπi,Qπ),\displaystyle\hat{Q}_{i}^{\pi_{i}}=\arg\min_{Q_{i}^{\pi_{i}}}-L_{i}(\pi_{i})+\frac{\lambda}{2}\varphi_{i}(Q^{\pi_{i}}_{i},Q^{\pi}), (3)

then we introduce an auxiliary meta Q-values QiπQ_{i}^{\pi} for each task, which can be updated by solving

Q^iπ=argminQiπλ2φi(Qiπi,Qiπ).\displaystyle\hat{Q}_{i}^{\pi}=\arg\min_{Q_{i}^{\pi}}\frac{\lambda}{2}\varphi_{i}(Q^{\pi_{i}}_{i},Q_{i}^{\pi}). (4)

After that, all auxiliary meta Q-values Qiπ,i=1,,NQ_{i}^{\pi},i=1,\cdots,N can be aggregated to update the QπQ^{\pi}.

We first focus on solving (3). Note that in the model-free RL, the standard Q-learning can be used to solve argminπiLi(πi)\arg\min_{\pi_{i}}-L_{i}(\pi_{i}). For each state-action pair (s,a)(s,a), Qiπi(s,a)Q_{i}^{\pi_{i}}(s,a) is estimated by iteratively applying the Bellman optimal operator [Qiπi(s,a)]=i(s,a)+γ𝔼s𝒫(s|s,a)[maxaQiπi(s,a)]\mathcal{B}^{*}\left[Q_{i}^{\pi_{i}}(s,a)\right]=\mathcal{R}_{i}(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathcal{P}(s^{\prime}|s,a)}\left[\max_{a^{\prime}}Q_{i}^{\pi_{i}}(s^{\prime},a^{\prime})\right], i.e.,

Qi,k+1πi(s,a)argminQiπi12Qiπi(s,a)[Qi,kπi(s,a)]22,Q^{\pi_{i}}_{i,k+1}(s,a)\leftarrow\arg\min_{Q_{i}^{\pi_{i}}}\frac{1}{2}\left\|Q_{i}^{\pi_{i}}(s,a)-\mathcal{B}^{*}\left[Q^{\pi_{i}}_{i,k}(s,a)\right]\right\|_{2}^{2},

where Qi,kπi(s,a)Q^{\pi_{i}}_{i,k}(s,a) is the Q-table at the kk-th iteration. Then the exact or an approximate maximization scheme is used to recover the greedy policy. This inspires us to update Qiπi(si,ai)Q^{\pi_{i}}_{i}(s_{i},a_{i}) by performing the following iterations

Qi,k+1πi(si,ai)argminQπiΞi,k(Qπi),Q^{\pi_{i}}_{i,k+1}(s_{i},a_{i})\leftarrow\arg\min_{Q^{\pi_{i}}}\Xi_{i,k}(Q^{\pi_{i}}), (5)

where

Ξi,k(Qπi)=\displaystyle\Xi_{i,k}(Q^{\pi_{i}})= 12Qπi(si,ai)[Qi,kπi(si,ai)]22\displaystyle~{}\frac{1}{2}\left\|Q^{\pi_{i}}(s_{i},a_{i})-\mathcal{B}^{*}[Q^{\pi_{i}}_{i,k}(s_{i},a_{i})]\right\|_{2}^{2}
+λ2φi(Qiπi(si,ai),Qiπ(si,ai)).\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\frac{\lambda}{2}\varphi_{i}(Q^{\pi_{i}}_{i}(s_{i},a_{i}),Q^{\pi}_{i}(s_{i},a_{i})).

We solve (5) via one step gradient descent based on the gradient Ξi,k(Qiπi)\nabla\Xi_{i,k}\left(Q_{i}^{\pi_{i}}\right) as the following

Qi,k+1πi(si,ai)=Qi,kπi(si,ai)+ηk[i(si,ai)+\displaystyle Q_{i,k+1}^{\pi_{i}}(s_{i},a_{i})=Q_{i,k}^{\pi_{i}}(s_{i},a_{i})+\eta^{k}\Big{[}\mathcal{R}_{i}(s_{i},a_{i})+ (6)
γ𝔼si𝒫i{maxaiQi,kπi(si,ai)}Qi,kπi(si,ai)]+ηkλ2φi,\displaystyle\gamma\mathbb{E}_{s_{i}^{\prime}\sim\mathcal{P}_{i}}\{\max_{a_{i}^{\prime}}Q_{i,k}^{\pi_{i}}(s_{i}^{\prime},a_{i}^{\prime})\}-Q_{i,k}^{\pi_{i}}(s_{i},a_{i})\Big{]}+\frac{\eta^{k}\lambda}{2}\nabla\varphi_{i},

where φi=2(Qiπi(si,ai)Qiπ(si,ai))\nabla\varphi_{i}=2\left(Q_{i}^{\pi_{i}}(s_{i},a_{i})-Q_{i}^{\pi}(s_{i},a_{i})\right) and ηk\eta^{k} is a learning rate. However, it is not straightforward to update QiπiQ_{i}^{\pi_{i}} based on (6) since the system transition function 𝒫i\mathcal{P}_{i} is usually unknown. In most model-free RL algorithms, we need to sample transition data to update Q-value. We hence leverage several samples (si,ai,ri,si)(s_{i},a_{i},r_{i},s_{i}^{\prime}) to obtain an approximated Q-table Q~iπi\widetilde{Q}_{i}^{\pi_{i}} as the following,

Q~i,k+1πi(si,ai)=Q~i,kπi(si,ai)+ηk{i(si,ai)+\displaystyle\widetilde{Q}_{i,k+1}^{\pi_{i}}(s_{i},a_{i})=\widetilde{Q}_{i,k}^{\pi_{i}}(s_{i},a_{i})+\eta^{k}\Big{\{}\mathcal{R}_{i}(s_{i},a_{i})+
γmaxaiQ~i,kπi(si,ai)Q~i,kπi(si,ai)}+ηkλ2φi.\displaystyle~{}~{}~{}~{}~{}~{}~{}\gamma\max_{a_{i}^{\prime}}\widetilde{Q}_{i,k}^{\pi_{i}}(s_{i}^{\prime},a_{i}^{\prime})-\widetilde{Q}_{i,k}^{\pi_{i}}(s_{i},a_{i})\Big{\}}+\frac{\eta^{k}\lambda}{2}\nabla\varphi_{i}. (7)

The iterations in (6) and (IV-A) go until the maximum iteration number KK is reached. Then we have the following Theorem 1, which is proved in Appendix A-C. Theorem 1 indicates that the error generated by using the approximated Q-table Q~iπi\widetilde{Q}_{i}^{\pi_{i}} is minimal after sufficient iterations. We hence set Qiπi(si,ai)Q~i,Kπi(si,ai)Q_{i}^{\pi_{i}}(s_{i},a_{i})\leftarrow\widetilde{Q}_{i,K}^{\pi_{i}}(s_{i},a_{i}).

Theorem 1

For a large enough KK, there exists a small δ\delta satisfies

|Qi,Kπi(si,ai)Q~i,Kπi(si,ai)|δ,si,ai,i.\left|Q_{i,K}^{\pi_{i}}(s_{i},a_{i})-\widetilde{Q}_{i,K}^{\pi_{i}}(s_{i},a_{i})\right|\leq\delta,\forall s_{i},a_{i},i.

Once Qiπi(si,ai)Q_{i}^{\pi_{i}}(s_{i},a_{i}) is available, we update the auxiliary Q values QiπQ_{i}^{\pi} in (4) as the following

Qiπ(si,ai)=\displaystyle Q_{i}^{\pi}(s_{i},a_{i})= Qiπ(si,ai)ηi(Qiπ(si,ai))\displaystyle~{}Q_{i}^{\pi}(s_{i},a_{i})-\eta\nabla\mathcal{L}_{i}(Q_{i}^{\pi}(s_{i},a_{i})) (8)
=\displaystyle= Qiπ(si,ai)+ηλ[Qiπi(si,ai)Qiπ(si,ai)],\displaystyle~{}Q_{i}^{\pi}(s_{i},a_{i})+\eta\lambda\big{[}Q_{i}^{\pi_{i}}(s_{i},a_{i})-Q_{i}^{\pi}(s_{i},a_{i})\big{]},

where η\eta is a learning rate.

We solve these two subproblems alternately until the maximum number RR of iterations is reached and then update the meta Q-values. Note that since s𝒮=iN𝒮i,a𝒜=iN𝒜is\in{\mathcal{S}}=\cup_{i\in N}\mathcal{S}_{i},a\in{\mathcal{A}}=\cup_{i\in N}\mathcal{A}_{i}, the states and actions in the meta Q-table may not necessarily be in a specific personalized Q-table. Hence all the auxiliary meta Q-tables Qi,Rπ(si,ai),i=1,,NQ_{i,R}^{\pi}(s_{i},a_{i}),i=1,...,N are collected to perform the following aggregation

Q^π(s,a)=(1β)Qπ(s,a)+βNsi=1NQiπ𝕀{Qiπ(s,a)},\begin{split}\hat{Q}^{\pi}(s,a)&=(1-\beta)Q^{\pi}(s,a)\\ &~{}~{}~{}~{}~{}~{}~{}~{}+\frac{\beta}{N_{s}}\sum_{i=1}^{N}Q_{i}^{\pi}\cdot\mathbb{I}\{Q_{i}^{\pi}(s,a)\neq\emptyset\},\end{split} (9)

where β\beta is a weight parameter, 𝕀{}\mathbb{I}\{\cdot\} is an indicator operator, and NsN_{s} denotes the number of tasks that satisfy Qiπ(s,a)Q_{i}^{\pi}(s,a)\neq\emptyset, i.e., |𝕀{Qiπ(s,a)}||\mathbb{I}\{Q_{i}^{\pi}(s,a)\neq\emptyset\}|. We repeat the above process until the meta-policy converges or reaches the maximum number of iterations CC.

IV-B Theoretical Analysis

In this subsection, we present the convergence analysis of our pMeta-RL algorithm. First, we present the following assumption, which quantifies the diversity of the reward and transition function among tasks.

Assumption 1

For any state-action pair (s,a)(s,a), the reward function of each task can be bounded by i(s,a)1Nj=1N(j(s,a))σ1,i\mathcal{R}_{i}(s,a)-\frac{1}{N}\sum_{j=1}^{N}(\mathcal{R}_{j}(s,a))\leq\sigma_{1,i}, while the transition function is bounded by 𝒫i(s|s,a)1Nj=1N(𝒫j(s|s,a))σ2,i\mathcal{P}_{i}(s^{\prime}|s,a)-\frac{1}{N}\sum_{j=1}^{N}(\mathcal{P}_{j}(s^{\prime}|s,a))\leq\sigma_{2,i}.

Then we have the following Theorem 2 based on Assumption 1, which is proved in Appendix A-D.

Theorem 2

Let Assumption 1 holds. When ηη~βR\eta\leq\frac{\tilde{\eta}}{\beta R} and

η~L(32+12λ28+24λ2λ28)14,\tilde{\eta}L\left(\frac{3}{2}+\frac{12}{\lambda^{2}-8}+\frac{24\lambda^{2}}{\lambda^{2}-8}\right)\leq\frac{1}{4},

we have

(a)\displaystyle(a)~{} 𝔼[(Qπ,)2]𝒪(Δη^2C+\displaystyle~{}\mathbb{E}\left[\left\|\nabla\mathcal{L}(Q^{\pi,*})\right\|^{2}\right]\leq\mathcal{O}\Bigg{(}\frac{\Delta}{\hat{\eta}_{2}C}+
Δ23(96L2(λ2δ2+σ22))13(βC)23+(Δ3Lσ22)12C+3λ2δ2),\displaystyle\frac{\Delta^{\frac{2}{3}}\left(96L^{2}(\lambda^{2}\delta^{2}+\sigma_{2}^{2})\right)^{\frac{1}{3}}}{\left(\beta C\right)^{\frac{2}{3}}}+\frac{\left(\Delta 3L\sigma_{2}^{2}\right)^{\frac{1}{2}}}{\sqrt{C}}+3\lambda^{2}\delta^{2}\Bigg{)},
(b)\displaystyle(b)~{} i=1N𝔼[Q^i,cπiQcπ2]\displaystyle~{}\sum_{i=1}^{N}\mathbb{E}\left[\left\|\hat{Q}_{i,c}^{\pi_{i}}-Q_{c}^{\pi}\right\|^{2}\right]\leq
𝒪(𝔼[(Qπ,)2])+𝒪(δ2+2σ22λ2),\displaystyle~{}~{}~{}~{}\mathcal{O}\left(\mathbb{E}\left[\left\|\nabla\mathcal{L}(Q^{\pi,*})\right\|^{2}\right]\right)+\mathcal{O}\left(\delta^{2}+\frac{2\sigma_{2}^{2}}{\lambda^{2}}\right),

where LL is the smoothness of (π)\mathcal{L}(\pi), Δ=(Q0π)\Delta=\mathcal{L}(Q^{\pi}_{0})-\mathcal{L}^{*} and σ22=2λ2λ28σ2+2λ2γ(λ28)(1γ)max\sigma_{2}^{2}=\frac{2\lambda^{2}}{\lambda^{2}-8}\sigma^{2}+\frac{2\lambda^{2}\gamma}{(\lambda^{2}-8)(1-\gamma)}\mathcal{R}_{\text{max}} with σ2=1Ni=1N(σ1,i+σ2,iγ(1γ)max)2\sigma^{2}=\frac{1}{N}\sum_{i=1}^{N}\left(\sigma_{1,i}+\frac{\sigma_{2,i}\gamma}{(1-\gamma)}\mathcal{R}_{\text{max}}\right)^{2}.

Remark 1

Theorem 2(a) shows the convergence of the meta-policy. The first term is caused by the initial error Δ\Delta, which decreases linearly with the increase of training iterations. Moreover, the second and third terms are respectively due to the tasks drift and initial error Δ\Delta, which decreases as the increase of training iterations. Finally, the last term shows that the gradient of the optimal policy model can be close to zero when δ\delta is small enough. Theorem 2(b) describes an upper bound on the distance between the personalized policy and the meta-policy. The first term denotes the convergence rate consistent with the meta-policy. The σ2\sigma^{2} in the second term indicates that the diversity of tasks can increase the upper bound, while a more prominent regularization factor λ\lambda can strengthen the connection to the meta-policy.

IV-C Deep Personalized Meta-Reinforcement Learning

In this subsection, we extend the meta-policy iteration algorithm to a deep learning version with neural network function approximators. Taking the actor model ϕi\phi_{i} as an example, the deep pMeta-RL updates the personalized policy by solving the following two sub-problems:

ϕ^i\displaystyle\hat{\phi}_{i} =argminϕiJπi(ϕi)+λ2φi(ϕi,ω),\displaystyle=\arg\min_{\phi_{i}}J_{\pi_{i}}(\phi_{i})+\frac{\lambda}{2}\varphi_{i}(\phi_{i},\omega), (10)
ω^i\displaystyle\hat{\omega}_{i} =argminωiλ2φi(ϕi,ωi),\displaystyle=\arg\min_{\omega_{i}}\frac{\lambda}{2}\varphi_{i}(\phi_{i},\omega_{i}), (11)

where ωi\omega_{i} is the auxiliary model, which is used to update the meta-model ω\omega.

To solve (10), similar with Theorem 1, we sample a mini-batch data DiD_{i} to obtain an approximated Jπi(ϕi)\nabla J_{\pi_{i}}(\phi_{i}) by calculating 𝔼[Jπi(ϕi;Di)]\mathbb{E}\left[\nabla J_{\pi_{i}}(\phi_{i};D_{i})\right]. Then, the personalized policy is updated based on the stochastic gradient descent by

ϕ^i=ϕiα(Jπi(ϕi;Di)+λ2φi(ϕi,ωi)),\hat{\phi}_{i}=\phi_{i}-\alpha(\nabla J_{\pi_{i}}(\phi_{i};D_{i})+\frac{\lambda}{2}\nabla\varphi_{i}(\phi_{i},\omega_{i})), (12)

where we use the 2\ell_{2}-norm constraint such that φi(ϕi,ωi)=2(ϕiωi)\nabla\varphi_{i}(\phi_{i},\omega_{i})=2(\phi_{i}-\omega_{i}) and α\alpha is the personalized policy learning rate.

For the subproblem (11), ωi\omega_{i} is updated by

ω^i=ωiηFi,π(ωi)=ωiηλ(ωiϕi),\begin{split}\hat{\omega}_{i}=\omega_{i}-\eta\nabla F_{i,\pi}(\omega_{i})=\omega_{i}-\eta\lambda\left(\omega_{i}-\phi_{i}\right),\end{split} (13)

where η\eta denotes the auxiliary model learning rate. Moreover, the critic and inference models are also updated similarly, which we omit here for brevity.

After RR iterations, the meta-policy can be updated as the following

ωc+1=(1β)ωc+βi=0Nωi,RcN,\omega^{c+1}=(1-\beta)\omega^{c}+\beta\sum_{i=0}^{N}\frac{\omega_{i,R}^{c}}{N}, (14)

where β\beta is a weight parameter and cc represents the current number of update rounds of the meta-policy. Finally, we obtain a well-trained meta-policy (ωC,ϑC,ϖC)(\omega^{C},\vartheta^{C},\varpi^{C}). We summarize pMeta-RL and deep pMeta-RL algorithms in Algorithm 1 for clarification.

Algorithm 1 pMeta-RL / Deep pMeta-RL Algorithms

Input: C,U,λ,η,β,ω0,ϑ0,ϖ0C,U,\lambda,\eta,\beta,\omega^{0},\vartheta^{0},\varpi^{0} and a set of training tasks {𝒯i}i=1N\left\{\mathcal{T}_{i}\right\}_{i=1\cdots N} from p(𝒯)p\left(\mathcal{T}\right).

1:  for c=1,,Cc=1,\cdots,C do
2:     for each 𝒯i\mathcal{T}_{i} do {For deep pMeta-RL only}
3:        Initialize context 𝐜i={}\mathbf{c}_{i}=\left\{\right\}
4:        Sample 𝐳qζi(𝐳|𝐜i)\mathbf{z}\sim q_{\zeta_{i}}(\mathbf{z}|\mathbf{c}^{i})
5:        Update 𝐜i={(sj,aj,s,rj)}j=1T𝒟i\mathbf{c}_{i}=\left\{\left(s_{j},a_{j},s^{\prime},r_{j}\right)\right\}_{j=1\cdots T}\sim\mathcal{D}_{i}
6:     end for
7:     Set the meta-policy Qπ/(ωc,ϑc,ϖc)Q^{\pi}/\left(\omega^{c},\vartheta^{c},\varpi^{c}\right) as the personalized policy for each task
8:     for each 𝒯i\mathcal{T}_{i} parallel do
9:        for r=1,,Rr=1,\cdots,R do
10:           Personalized policy update: Update Qiπi/(ϕi,rc,θi,rc,ζi,rc)Q_{i}^{\pi_{i}}/(\phi_{i,r}^{c},\theta_{i,r}^{c},\zeta_{i,r}^{c}) by (IV-A) or (12)
11:           Auxiliary meta-policy update: Update Qiπ/(ωi,rc,ϑi,rc,ϖi,rc)Q_{i}^{\pi}/(\omega_{i,r}^{c},\vartheta_{i,r}^{c},\varpi_{i,r}^{c}) by (8) or (13)
12:        end for
13:     end for
14:     Update meta-policy Qπ/(ωc,ϑc,ϖc)Q^{\pi}/\left(\omega^{c},\vartheta^{c},\varpi^{c}\right) by (9) or (14)
15:  end for
16:  Output: meta-policy Qπ/(ω,ϑ,ϖ)Q^{\pi}/\left(\omega,\vartheta,\varpi\right), personalized policies Qiπi/(ϕi,θi,ζi),i=1,,NQ_{i}^{\pi_{i}}/\left(\phi_{i},\theta_{i},\zeta_{i}\right),i=1,\cdots,N
Remark 2

Algorithm 1 integrates two algorithms, pMeta-RL and deep pMeta-RL. For the tabular setting, our pMeta-RL algorithm updates QiπQ_{i}^{\pi} without using lines 3-7 in Algorithm 1. Note that our personalization method is a plug-and-play module that can directly combine with many RL algorithms (e.g., DQN, SAC). For example, each task learns a SAC policy whose actor and critic models are updated according to lines 9-11. The vector 𝐳\mathbf{z} used to distinguish different tasks can be replaced by a task embedding vector, such as a one-hot vector. Therefore, we can learn a policy (line 15) to solve multiple distinct tasks in a decentralized manner.

The deep pMeta-RL in Algorithm 1 proposes a general meta-training procedure, and the meta-testing phase is consistent with PEARL. By training an inference network, deep pMeta-RL can estimate the state-value function Qϑ(s,a,𝐳)Q_{\vartheta}(s,a,\mathbf{z}) under different tasks and generalize to unseen tasks.

V Experiment Results

V-A Environments Setting

For the Gym suite [35], we consider two classical control environments, “CartPole”, and “MountainCar” 111The IDs of these environments in the OpenAI Gym library are: CartPole-v0 and MonutainCar-v0. For each environment, we modify its physical parameters to induce different transitions as different tasks.

TABLE I: The system parameters of different tasks on the Gym suites.
Environments Parameters task1 task2 task3 task4 task5
CartPole force 20.0 1.0 10.0 10.0 10.0
masspole 0.1 0.1 1.0 0.01 0.1
lengthpole 0.5 0.5 0.5 0.5 0.05
MountainCar force 0.001 0.001 0.001 0.001 0.001
gravity 0.0025 0.0025 0.0025 0.0025 0.0025
inclination ζ\zeta 3.0 3.5 4.0 4.5 5.0

CartPole: The goal of CartPole is to balance a pole on the top of the cart. The agent can get the position and velocity of the cart, the angle and angular velocity of the pole as observations, which is a four-dimensional continuous space. We can make the cart move left or right as the action space. This task ends when the pole falls over or the cart out of bounds. The system parameters contain the force, the mass of pole (masspole) and the length of pole (lengthpole).

MountainCar: The goal of MounatianCar is to reach a specific position on a one dimensional track between two mountains. The agent learns to drive back and forth to obtain enough power to allow the car to reach its goal. The states are the position p˙\dot{p} and velocity v˙\dot{v} of the car and the action consist of pushing left, right and no push. The system transition function can be written as

v˙t+1=v˙t+(a1)F^G^cos(ζp˙)p˙t+1=p˙t+v˙t+1\begin{split}\dot{v}_{t+1}&=\dot{v}_{t}+(a-1)\hat{F}-\hat{G}\cos(\zeta\dot{p})\\ \dot{p}_{t+1}&=\dot{p}_{t}+\dot{v}_{t+1}\end{split} (15)

where a=0,1,2a=0,1,2 represent the push left,right and no push, respectively, F^\hat{F} is the magnitude of the force, G^\hat{G} is the gravity, and ζ\zeta controls the inclination of the track. We modify ζ\zeta to generate different tasks, which is shown in table II.

For the mojuco suite, we consider the same meta-learning domains as [36, 12].

Ant-Fwd-Back: Move forward or backward (2 train tasks, 2 test tasks).

Half-Cheetah-Dir: Move forward or backward (2 train tasks, 2 test tasks).

Half-Cheetah-Vel: Reach a target velocity (10 train tasks, 3 test tasks).

Walker-2D-Params: The agent is randomly initialized with different system dynamics parameters and keep walking (10 train tasks, 3 test tasks).

TABLE II: The hperparameters of pMeta-RL on CartPole and MountainCar.
Hyperparameters                Value
CartPole MountainCar
Learning rate 1e-3 1e-3
Hidden layer size [512 512] [512 512]
Batch size DD 64 64
Replay buffer size 2e5 2e5
Meta-policy update round CC 30 30
Personalized update round RR 200 500
Action selector ϵ\epsilon-greedy ϵ\epsilon-greedy
ϵ\epsilon-start 0.3 0.3
ϵ\epsilon-finish 0.01 0.01
Target update interval 350 350
Regularization coefficient λ\lambda 20 20
Discount factor γ\gamma 0.99 0.99
TABLE III: The hperparameters of pMeta-RL on the MuJoCo suite.
Hyperparameters                      Value
Ant-Fwd-Back Half-Cheetah-Dir Half-Cheetah-Vel Walker-2D-Params
Actor Learning rate 1e-3 1e-3 1e-3 1e-3
Critic Learning rate 1e-3 1e-3 1e-3 1e-3
Actor Hidden layer size [512 512] [512 512] [512 512] [512 512]
Critic Hidden layer size [512 512] [512 512] [512 512] [512 512]
Batch size DD 256 256 256 256
Replay buffer 𝒟i\mathcal{D}_{i} size 1e5 1e5 2e4 2e4
Meta-policy update round CC 200 300 100 100
Personalized update round RR 5000 4000 2000 2000
Total timesteps 2e6 2.4e6 2e6 4e6
Regularization coefficient λ\lambda 15 15 20 20
Target smoothing coefficient 0.005 0.005 0.005 0.005
Discount factor γ\gamma 0.99 0.99 0.99 0.99

V-B Experiments Details

We implement our personalized method based on DQN with a prioritized experience replay buffer, all the details can be found in Table II. Moreover, our code is based on Tianshou [37] framework222https://github.com/thu-ml/tianshou, and all experiments were conducted on a NVIDIA Quadro RTX 6000 environment.

Then we present all the hyperparameter settings in the Gym and MuJoCo tasks. Table II shows the hyperparameters used on CartPole and MountainCar. All hyperparameters are basically set to be the same as in [38]. Table III shows the hyperparameters used on MuJoCo suite, where all hyperparameters are basically set to be the same as in [10, 12]. For all the compared algorithms, we set the number of training steps to be the same as that used in our pMeta-RL. The rest of the hyperparameters in their algorithm are consistent with the settings recommended in their papers [10, 12].

V-C Performance of pMeta-RL: A Warm-Up

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 2: Average Returns on the simple Gridworld and the Gym suite. (Concrete lines: the average over 5 seeds; Shaded areas: the standard deviation over 5 seeds). MP and PP represent the meta-policy and personalized policy, respectively. Top: Average returns under different algorithms and different λ\lambda. Bottom: Average returns with different algorithms on the Gym suite [35].

To evaluate the performance of the proposed pMeta-RL, we construct a simple Gridworld environment with a finite state-action space based on [39]. In this Gridworld environment, the goal of each task is to reach the landmark in the grid and the reward is set as the distance between the agent and landmark. The number of tasks is N=5N=5, and the tasks vary in different grid sizes. We use the model averaging method of the Q-values as the baseline, whose performance of this method has been validated in [38]. In particular, this method only performs weighted average aggregation on the Q tables of each task to obtain the multi-task Q-table. We set C=10,R=3,K=1,λ=10C=10,R=3,K=1,\lambda=10 in the experiments. Figure 2 shows the convergence rate of pMeta-RL, the average returns obtained by the personalized policy and meta-policy are better than that obtained by the model-averaging method. Note that the performance of the model-averaging method can gradually approach the personalized model as the training epoch increases. This is because the multi-task Q-table also retains all the state-action values of each task, which can adapt to different tasks. Furthermore, we directly extend the Q-learning-based approach to DQN and test the performance in the Gym environments, whose tasks vary the transition function. We can find that the performance of the personalized policy is better than the averaged-DQN, which is not well adapted to distinct tasks from Figure 2.

Figure 2 shows the performance of pMeta-RL in Gridworld environment with different values of λ\lambda, where we set C=10,R=3,K=1C=10,R=3,K=1. When λ=10\lambda=10, pMeta-RL achieves the best performance. However, a larger λ\lambda may degrade performance which mean this parameter need to be chosen carefully for various environments.

V-C1 Effects of regularization λ\lambda

Table IV shows the performance of pMeta-RL with different values of λ\lambda, where we set |Di|=64,C=30,α=0.001,η=0.001,β=1\left|D_{i}\right|=64,C=30,\alpha=0.001,\eta=0.001,\beta=1. In our experiments, we found that a proper λ\lambda can achieve better performance, while a very larger λ\lambda may hurt performance. Therefore, λ\lambda should be carefully designed for various environments. We hence choose λ=20\lambda=20 for “CartPole” and “MountainCar” in the remaining experiments.

TABLE IV: Effect of regularization λ\lambda
Average Return of pMeta-RL
CartPole Change λ\lambda λ=\lambda=5 λ=\lambda=20 λ=\lambda=30 λ=\lambda=50
MP 185.11 ±\pm 2.52 192.54 ±\pm 1.56 191.51 ±\pm 1.23 190.86 ±\pm 2.09
PP 196.47 ±\pm 0.89 197.13 ±\pm 0.45 196.53 ±\pm 1.03 196.72 ±\pm 0.68
MountainCar Change λ\lambda λ=\lambda=5 λ=\lambda=20 λ=\lambda=30 λ=\lambda=50
GM -111.15 ±\pm 4.31 -107.08±\pm 3.21 -110.49 ±\pm 3.89 -108.41 ±\pm 2.78
PM -93.05 ±\pm 1.07 -92.13 ±\pm 0.95 -92.41 ±\pm 1.25 -98.16 ±\pm 2.13
TABLE V: Best average return comparison results for each algorithm on the Gym suites.
Algorithms CartPole MountainCar
DQN* 198.05 ±\pm 1.03 -92.96 ±\pm 2.08
DQN 169.20 ±\pm 2.89 -133.96 ±\pm 3.25
Model-Averaging 191.92 ±\pm 1.36 -111.34 ±\pm 5.95
pMeta-RL (MP) 192.54 ±\pm 2.54 -107.15 ±\pm 4.10
pMeta-RL (PP) 197.13 ±\pm 0.45 -93.05 ±\pm 0.95

V-C2 Ablation study

We compare the performance of the proposed personalization method with DQN [3], Averaged-DQN [38]. Table V shows the best average reward achieved by various algorithms. The DQN\text{DQN}^{*} denotes the average reward obtained by the optimal DQN policy for each task, while DQN denotes the average reward achieved by the DQN policy on multiple tasks. We can find that our personalization policy is closer to the optimal policy, and our meta-policy is better than other algorithms. Figure 3 shows that the personalized policy is more suitable for its specific task than meta-policy, and illustrates the generalization ability of the personalized policy under the constraints of the meta-policy, which can generalize to other tasks while completing its corresponding task.

Refer to caption
(a)
Figure 3: Visualization of policies and trajectories on MountainCar. The orange and red lines represent the trajectories on task 1 and task 2, respectively. The heatmaps from left to right are denoted as meta-policy, personalized policy for task 1 and optimal policy for task 1. Left: The car using the meta-policy can reach the goals (i.e., position 0.5\geq 0.5) on both tasks 1 and 2, but it takes more time steps on task 1. Middle: The car using the personalized policy can reach the goals quickly on task 1, but slower on task 2. Right: The optimal policy plans the optimal trajectory for the car on task 1, but it is unable to complete task 2.

V-D Performance of Deep pMeta-RL

Refer to caption
(a)
Figure 4: Comparison results of the average returns of different meta-RL algorithms on the MuJoCo suites. The dotted line represents the final performance.

To evaluate the proposed deep pMeta-RL algorithm, we conducted extensive experiments on the continuous control tasks in the MuJoCo suits [40], which are benchmarks commonly used by meta-RL algorithms. These tasks vary in either the reward function (target velocity for Half-Cheetah-Vel, and walking direction for Half-Cheetah-Fwd-Back, Ant-Fwd-Back) or transition function (Walker-Params). We compare deep pMeta-RL against several representative meta-RL algorithms, including PEARL [10], ProMP [16], MetaCURE [12] and MAML-TRPO [8]. We also compare the recurrence-based policy gradient RL2\text{RL}^{2} method as [10]. For a fair comparison, we set the maximum episode length to 200 for all the above tasks, which is similar to PEARL. In addition, we set the same number of training steps for each epoch for the above algorithms.

Figure 4 shows that the personalized policy of deep pMeta-RL based on SAC can outperform prior meta-RL methods across all domains in terms of the average returns. The meta-policy achieves comparable performance to other algorithms, albeit with a drop in performance in the Half-Cheetah-Vel environment. This may be mainly due to the increase in the number of tasks in this environment and the increase in the similarity between tasks, resulting in a weakened personalization ability and a slower meta-policy convergence rate. Furthermore, we also observe that MetaCURE is unable to obtain reasonable results in Walker-2D-Params under the same setting. MetaCURE is an efficient algorithm for solving the sparse rewards problem. However, in the context of dense rewards, the intrinsic rewards proposed in the original paper may conflict with dense rewards, resulting in performance degradation. The average returns obtained using the personalized policies are much better than other meta-RL algorithms (e.g., 12.59%12.59\% and 11.86%11.86\% than PEARL in Ant-Fwd-Back and Cheetah-Fwd-Back environments, respectively), but only slightly gain in the Cheetah-Vel and Walker-2D-Params environments. And we found PEARL can obtains almost the same performance as the personalized policy in Walker-2D-Params. Summarizing the above, we can infer that our personalization method is better suited for more distinct tasks, e.g., navigating to the exact opposite direction.

Refer to caption
(a)
Figure 5: Trajectories visualization on Sparse-Point-Robot environment. The robot need to navigate to goals on a semicircle (dark blue or other goals in light blue). We compare our personalized policy with PEARL at different context 𝐜1:5\mathbf{c}_{1:5} and 𝐜1:15\mathbf{c}_{1:15}.

We visualized the trajectories to illustrate the superiority of the personalized policy on the Sparse-Point-Robot, which is a 2D navigation task in which a point robot must navigate to different goal locations on edge of a semicircle. A reward is given only when the robot is within a certain radius of the goal, which is set as 0.2 in our experiments. By randomly sampling 10 goals, we compare PEARL with different context 𝐜1:5\mathbf{c}_{1:5} and 𝐜1:15\mathbf{c}_{1:15}, whose subscripts denote the number of trajectories contained in the context. Specifically, the first trajectory is collected with probabilistic context variable 𝐳\mathbf{z} sampled from the prior p(𝐳)p(\mathbf{z}) and the subsequent trajectories are collected with 𝐳qζ(𝐳|𝐜)\mathbf{z}\sim q_{\zeta}(\mathbf{z}|\mathbf{c}) where context is aggregated over all collected trajectories. As shown in Figure 5, we can observe that that the agent with personalized policy can navigate to more targets than PEARL and achieve higher returns. Moreover, the robot can navigate more accurately with more context information, for example, a robot making decisions with 𝐜1:15\mathbf{c}_{1:15} can navigate to the center of the goals, while using 𝐜1:5\mathbf{c}_{1:5} may go beyond the goals radius area.

Refer to caption
(a)
Refer to caption
(b)
Figure 6: Performance of the meta-policy on unseen tasks

We also validate the performance of the meta-policy on the unseen tasks in Figure 6, which is almost consistent with the performance in Figure 4. We find that the meta-policy can adapt well to multiple tasks in most environments, except for the slower convergence on Half-Cheetah-vel. Combining Figure 4 and Figure 6, we conclude that the performance of the personalized policy is promising, such that our algorithm can be more suitable for decentralized multi-task learning, each task maintains its personalized policy while learning a meta-policy for multiple tasks without sharing trajectories.

VI Conclusions

In this paper, we propose a personalization approach in meta-RL to solve the gradient conflict problem, which learns a meta-policy and personalized policies for all tasks and specific tasks, respectively. By adopting a personalization constrain in the objective function, our algorithm encourages each task to pursue its personalized policy around the meta-policy under the tabular and deep network settings. We introduce an auxiliary policy to decouple the personalized and meta-policy learning process and propose an alternating minimization method for policy improvement. Moreover, theoretical analysis shows that our algorithm converges linearly with the iteration number and gives an upper bound on the difference between the personalized policies and meta-policy. Experimental results demonstrate that pMeta-RL outperforms many advanced meta-RL algorithms on the continuous control tasks.

Appendix A Proof of Main Theorems

A-A Some Useful Results

Proposition 1

[Jensen’s inequality] For any vector xid,i=1,,Mx_{i}\in\mathbb{R}^{d},i=1,\cdots,M, we have

i=1Mxi2Mi=1Mxi2.\left\|\sum_{i=1}^{M}x_{i}^{2}\right\|\leq M\sum_{i=1}^{M}\left\|x_{i}\right\|^{2}.
Proposition 2

[Jensen’s inequality] For any vector x1,x2dx_{1},x_{2}\in\mathbb{R}^{d}, we have

𝔼[x1x222](1+1R)𝔼[x122]+(1+R)𝔼[x222],\mathbb{E}\left[\left\|x_{1}-x_{2}\right\|_{2}^{2}\right]\leq\left(1+\frac{1}{R}\right)\mathbb{E}\left[\left\|x_{1}\right\|_{2}^{2}\right]+\left(1+R\right)\mathbb{E}\left[\left\|x_{2}\right\|_{2}^{2}\right],

for any constant RR.

Lemma 1

[[41]] The random process {Δt}\left\{\Delta_{t}\right\} taking values in n\mathbb{R}^{n} and defined as

Δt+1(x)=(1αt(x))Δt(x)+αt(x)Mt(x)\Delta_{t+1}(x)=(1-\alpha_{t}(x))\Delta_{t}(x)+\alpha_{t}(x)M_{t}(x)

converges to zero with probability 1 under the following assumptions:

  • 0αt10\leq\alpha_{t}\leq 1, tαt(x)=\sum_{t}\alpha_{t}(x)=\infty and tαt2(x)<\sum_{t}\alpha_{t}^{2}(x)<\infty;

  • 𝔼[Mt(x)|t]WγΔtW\left\|\mathbb{E}\left[M_{t}(x)|\mathcal{F}_{t}\right]\right\|_{W}\leq\gamma\left\|\Delta_{t}\right\|_{W}, with γ<1\gamma<1;

  • var[Mt(x)|t]C(1+ΔtW2)\text{\emph{var}}\left[M_{t}(x)|\mathcal{F}_{t}\right]\leq C(1+\left\|\Delta_{t}\right\|^{2}_{W}), for C>0C>0.

where the t={Δt,Δt1,,Mt1,,αt1,}\mathcal{F}_{t}=\left\{\Delta_{t},\Delta_{t-1},\cdots,M_{t-1},\cdots,\alpha_{t-1},\cdots\right\} stands for the past at step tt, and the notation W\left\|\cdot\right\|_{W} refers to some weighted maximum norm.

A-B Some Important Lemmas

Lemma 2

For each state-action pair, the meta Q-values update as

Qc+1π(s,a)=Qcπ(s,a)η~Zc(s,a)\begin{split}Q_{c+1}^{\pi}(s,a)=Q_{c}^{\pi}(s,a)-\tilde{\eta}Z_{c}(s,a)\end{split}

where η~=ηβR\tilde{\eta}=\eta\beta R and

Zc(s,a)=1NRi=1Nr=0R1Zi,c,r(s,a),\begin{split}Z_{c}(s,a)=\frac{1}{NR}\sum_{i=1}^{N}\sum_{r=0}^{R-1}Z_{i,c,r}(s,a),\end{split} (16)

with Zi,c,r(s,a)=λ(Qi,c,rπ(s,a)Qi,c,r+1πi(s,a))Z_{i,c,r}(s,a)=\lambda(Q_{i,c,r}^{\pi}(s,a)-Q_{i,c,r+1}^{\pi_{i}}(s,a)).

Lemma 3

Let Assumption 1 hold. For any Q~i\tilde{Q}_{i} and s𝒮,a𝒜s\in\mathcal{S},a\in\mathcal{A}, the variance of each task is bounded

1Ni=1Ni(s,a)+γ𝔼s𝒫i(s|s,a)[maxaQ~i(s,a)]1Nj=1N(j(s,a)+γ𝔼s𝒫j(s|s,a)[maxaQ~i(s,a)])2σ2.\begin{split}&~{}~{}~{}~{}\frac{1}{N}\sum_{i=1}^{N}\Big{\|}\mathcal{R}_{i}(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathcal{P}_{i}(s^{\prime}|s,a)}\left[\max_{a^{\prime}}\tilde{Q}_{i}(s^{\prime},a^{\prime})\right]-\\ &\frac{1}{N}\sum_{j=1}^{N}\left(\mathcal{R}_{j}(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathcal{P}_{j}(s^{\prime}|s,a)}\left[\max_{a^{\prime}}\tilde{Q}_{i}(s^{\prime},a^{\prime})\right]\right)\Big{\|}^{2}\leq\sigma^{2}.\end{split}
Lemma 4

If Assumption 1 holds, we have

1Ni=1Ni(Qcπ(s,a))(Qcπ(s,a))28λ28(Qcπ(s,a))2+2σ22,\begin{split}&\frac{1}{N}\sum_{i=1}^{N}\left\|\nabla\mathcal{L}_{i}(Q_{c}^{\pi}(s,a))-\nabla\mathcal{L}(Q_{c}^{\pi}(s,a))\right\|^{2}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\leq\frac{8}{\lambda^{2}-8}\left\|\nabla\mathcal{L}(Q_{c}^{\pi}(s,a))\right\|^{2}+2\sigma_{2}^{2},\end{split}

where σ22=2λ2λ28σ2+2λ2γ(λ28)(1γ)max\sigma_{2}^{2}=\frac{2\lambda^{2}}{\lambda^{2}-8}\sigma^{2}+\frac{2\lambda^{2}\gamma}{(\lambda^{2}-8)(1-\gamma)}\mathcal{R}_{\text{max}}.

Lemma 5

The task drift error

1NRi,rN,R𝔼[Zi,c,r(s,a)i(Qcπ(s,a))2]2λ2δ2+32L2η~2β2(1Ni=1N𝔼[i(Qcπ(s,a))]+2λ2δ2).\begin{split}&~{}~{}~{}~{}~{}~{}~{}\frac{1}{NR}\sum_{i,r}^{N,R}\mathbb{E}\big{[}\left\|Z_{i,c,r}(s,a)-\nabla\mathcal{L}_{i}(Q_{c}^{\pi}(s,a))\right\|^{2}\big{]}\\ &\leq 2\lambda^{2}\delta^{2}+\frac{32L^{2}\tilde{\eta}^{2}}{\beta^{2}}\left(\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}\big{[}\left\|\nabla\mathcal{L}_{i}(Q_{c}^{\pi}(s,a))\right\|\big{]}+2\lambda^{2}\delta^{2}\right).\end{split}

A-C Proof of Theorem 1

We first prove that the updated form in (6) is a contraction mapping. By setting the derivative of Ξi\Xi_{i} to 0, we have one step optimal solution Qi,c,r+1πiQ_{i,c,r+1}^{\pi_{i}} as the following

Qi,c,r+1πi(s,a)=λ1+λQi,c,rπ(s,a)+11+λ(i(s,a)+γ𝔼s𝒫(s|s,a)[maxaQi,c,rπi(s,a)])\begin{split}&Q_{i,c,r+1}^{\pi_{i}}(s,a)=\frac{\lambda}{1+\lambda}Q_{i,c,r}^{\pi}(s,a)+\\ &~{}~{}\frac{1}{1+\lambda}\Big{(}\mathcal{R}_{i}(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathcal{P}(s^{\prime}|s,a)}\left[\max_{a^{\prime}}Q_{i,c,r}^{\pi_{i}}(s^{\prime},a^{\prime})\right]\Big{)}\\ \end{split}

for each state-action pair (s,a)(s,a).

To simplify notations, we omit the subscript cc, rr and πi\pi_{i} from now on. We define an additional Bellman backup operator as

^Qi(s,a):=λ1+λQπ(s,a)+11+λ(i(s,a)+γ𝔼s𝒫(s|s,a)[maxaQi(s,a)]),\begin{split}&\hat{\mathcal{B}}Q_{i}(s,a):=\frac{\lambda}{1+\lambda}Q^{\pi}(s,a)+\\ &~{}~{}\frac{1}{1+\lambda}\Big{(}\mathcal{R}_{i}(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathcal{P}(s^{\prime}|s,a)}\Big{[}\max_{a^{\prime}}Q_{i}(s^{\prime},a^{\prime})\Big{]}\Big{)},\end{split} (17)

and a norm on QiQ_{i} values as Qi1Qi2:=maxs,a|Qi1(s,a)Qi2(s,a)|\big{\|}Q_{i}^{1}-Q_{i}^{2}\big{\|}:=\max_{s,a}\big{|}Q_{i}^{1}(s,a)-Q_{i}^{2}(s,a)\big{|}.

Suppose that Qi1Qi2=ε\big{\|}Q_{i}^{1}-Q_{i}^{2}\big{\|}=\varepsilon, then we have

𝔼s𝒫(s|s,a)[maxa[Qi1(s,a)]]𝔼s𝒫(s|s,a)[maxa[Qi2(s,a)+ε]]𝔼s𝒫(s|s,a)[maxa[Qi2(s,a)]]+ε.\begin{split}&~{}~{}~{}\mathbb{E}_{s^{\prime}\sim\mathcal{P}(s^{\prime}|s,a)}\Big{[}\max_{a^{\prime}}\big{[}Q_{i}^{1}(s^{\prime},a^{\prime})\big{]}\Big{]}\\ &\leq\mathbb{E}_{s^{\prime}\sim\mathcal{P}(s^{\prime}|s,a)}\Big{[}\max_{a^{\prime}}\big{[}Q_{i}^{2}(s^{\prime},a^{\prime})+\varepsilon\big{]}\Big{]}\\ &\leq\mathbb{E}_{s^{\prime}\sim\mathcal{P}(s^{\prime}|s,a)}\Big{[}\max_{a^{\prime}}\big{[}Q_{i}^{2}(s^{\prime},a^{\prime})\big{]}\Big{]}+\varepsilon.\end{split}

Therefore

^Qi1^Qi211+λγε=11+λγQi1Qi2,\left\|\hat{\mathcal{B}}Q_{i}^{1}-\hat{\mathcal{B}}Q_{i}^{2}\right\|\leq\frac{1}{1+\lambda}\gamma\varepsilon=\frac{1}{1+\lambda}\gamma\left\|Q_{i}^{1}-Q_{i}^{2}\right\|, (18)

when Qπ(s,a)Q^{\pi}(s,a) is constant and γ1+λ<1\frac{\gamma}{1+\lambda}<1. Hence ^\hat{\mathcal{B}} is a contraction, and there exist a fixed point Qi(s,a)Q_{i}^{*}(s,a) such that Qi(s,a)=^Qi(s,a)Q_{i}^{*}(s,a)=\hat{\mathcal{B}}Q_{i}^{*}(s,a).

We next prove that the update method of QiQ_{i} in (IV-A) can iterate to this fixed point. We rewrite (IV-A) ignoring the superscript as the following

Qik+1(s,a)=(1α(1+λ))Qik(s,a)+α[(i(s,a)+γmaxaQik(s,a))+λQπ(s,a)]=(1α^)Qik(s,a)+α^[11+λ(i(s,a)+γmaxaQik(s,a))+λ1+λQπ(s,a)],\begin{split}&~{}~{}~{}~{}Q_{i}^{k+1}(s,a)=(1-\alpha(1+\lambda))Q_{i}^{k}(s,a)\\ &+\alpha\bigg{[}\left(\mathcal{R}_{i}(s,a)+\gamma\max_{a^{\prime}}Q_{i}^{k}(s^{\prime},a^{\prime})\right)+\lambda Q^{\pi}(s,a)\bigg{]}\\ &=(1-\hat{\alpha})Q_{i}^{k}(s,a)\\ &+\hat{\alpha}\bigg{[}\frac{1}{1+\lambda}\left(\mathcal{R}_{i}(s,a)+\gamma\max_{a^{\prime}}Q_{i}^{k}(s^{\prime},a^{\prime})\right)+\frac{\lambda}{1+\lambda}Q^{\pi}(s,a)\bigg{]},\end{split}

where α^=α(1+λ)\hat{\alpha}=\alpha(1+\lambda) and kk is the iteration index.

By subtracting from both sides the fixed point Qi(s,a)Q^{*}_{i}(s,a) and defining

Δik(s,a)=Qik(s,a)Qi(s,a),\Delta_{i}^{k}(s,a)=Q_{i}^{k}(s,a)-Q^{*}_{i}(s,a),

we can obtain

Δik+1(s,a)=(1α^)Δik+1(s,a)+α^Mik(s,a),\begin{split}\Delta_{i}^{k+1}(s,a)&=(1-\hat{\alpha})\Delta_{i}^{k+1}(s,a)+\hat{\alpha}M_{i}^{k}(s,a),\end{split} (19)

where

Mik(s,a)=11+λ(i(s,a)+γmaxaQik(s,a))+λ1+λQπ(s,a)Qi(s,a)\begin{split}M_{i}^{k}(s,a)&=\frac{1}{1+\lambda}\left(\mathcal{R}_{i}(s,a)+\gamma\max_{a^{\prime}}Q_{i}^{k}(s^{\prime},a^{\prime})\right)\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\frac{\lambda}{1+\lambda}Q^{\pi}(s,a)-Q^{*}_{i}(s,a)\end{split}

with ss^{\prime} being a random sample state obtained from the Markov chain 𝒫\mathcal{P}. Then we have

𝔼[Mik(s,a)|k]=s𝒮𝒫(s|s,a)[11+λ(i(s,a)+γmaxaQik(s,a))+λ1+λQπ(s,a)Qi(s,a)]=(a)^Qik(s,a)Qi(s,a)=(b)^Qik(s,a)^Qi(s,a),\begin{split}&\mathbb{E}\left[M_{i}^{k}(s,a)|\mathcal{F}_{k}\right]=\sum_{s^{\prime}\in\mathcal{S}}\mathcal{P}(s^{\prime}|s,a)\bigg{[}\frac{1}{1+\lambda}\bigg{(}\mathcal{R}_{i}(s,a)\\ &~{}~{}~{}+\gamma\max_{a^{\prime}}Q_{i}^{k}(s^{\prime},a^{\prime})\bigg{)}+\frac{\lambda}{1+\lambda}Q^{\pi}(s,a)-Q^{*}_{i}(s,a)\bigg{]}\\ &~{}~{}~{}\overset{(a)}{=}\hat{\mathcal{B}}Q_{i}^{k}(s,a)-Q^{*}_{i}(s,a)\overset{(b)}{=}\hat{\mathcal{B}}Q_{i}^{k}(s,a)-\hat{\mathcal{B}}Q^{*}_{i}(s,a),\end{split}

where (a)(a) uses the fact in (17) and (b)(b) leverages the property of the fixed point in a contraction, i.e., Q(s,a)=^Q(s,a)Q^{*}(s,a)=\hat{\mathcal{B}}Q^{*}(s,a).

According to (18), we have

𝔼[Mik(s,a)|k]11+λγQikQi=11+λγΔik.\begin{split}\left\|\mathbb{E}\left[M_{i}^{k}(s,a)|\mathcal{F}_{k}\right]\right\|_{\infty}&\leq\frac{1}{1+\lambda}\gamma\left\|Q_{i}^{k}-Q^{*}_{i}\right\|_{\infty}\\ &=\frac{1}{1+\lambda}\gamma\left\|\Delta_{i}^{k}\right\|_{\infty}.\end{split}

Moreover, the variance of Mik(s,a)M_{i}^{k}(s,a) can be written as

var[Mik(s,a)|k]\displaystyle~{}~{}~{}~{}~{}\text{var}\left[M_{i}^{k}(s,a)|\mathcal{F}_{k}\right]
=𝔼[(Mik(s,a)𝔼[Mik(s,a)])2]\displaystyle=\mathbb{E}\left[\left(M_{i}^{k}(s,a)-\mathbb{E}\left[M_{i}^{k}(s,a)\right]\right)^{2}\right]
=𝔼[(11+λ(i(s,a)+γmaxaQik(s,a))\displaystyle=\mathbb{E}\Big{[}\Big{(}\frac{1}{1+\lambda}(\mathcal{R}_{i}(s,a)+\gamma\max_{a^{\prime}}Q_{i}^{k}(s^{\prime},a^{\prime}))
+λ1+λQπ(s,a)Q(s,a)^Qik(s,a)+Q(s,a))2]\displaystyle~{}+\frac{\lambda}{1+\lambda}Q^{\pi}(s,a)-Q^{*}(s,a)-\hat{\mathcal{B}}Q_{i}^{k}(s,a)+Q^{*}(s,a)\Big{)}^{2}\Big{]}
=1(1+λ)2var[((i(s,a)+γmaxaQik(s,a))|k].\displaystyle=\frac{1}{(1+\lambda)^{2}}\text{var}\Big{[}\Big{(}(\mathcal{R}_{i}(s,a)+\gamma\max_{a^{\prime}}Q_{i}^{k}(s^{\prime},a^{\prime}))|\mathcal{F}_{k}\Big{]}. (20)

Since i(s,a)\mathcal{R}_{i}(s,a) is bounded and QπQ^{\pi} is constant in each personalized iteration, there exist a constant CC such that

var[Mik(s,a)|k]1(1+λ)2C(1+Δk2).\text{var}\left[M_{i}^{k}(s,a)|\mathcal{F}_{k}\right]\leq\frac{1}{(1+\lambda)^{2}}C\left(1+\left\|\Delta^{k}\right\|^{2}\right).

To sum up, the update in (19) satisfies all the assumptions in Lemma 1, therefore, Δik\Delta_{i}^{k} can converge to 0 with probability 1. In other words, there exist a KK such that

|QiK(s,a)Q(s,a)|δ.\left|Q_{i}^{K}(s,a)-Q^{*}(s,a)\right|\leq\delta.

Then we complete the proof.

A-D Proof of Theorem 2

We first consider the smoothness property of i\mathcal{L}_{i}. For the first term Li(πi)=𝔼sρ0𝔼aπi[Qiπi(s,a)]L_{i}(\pi_{i})=\mathbb{E}_{s\sim\rho_{0}}\mathbb{E}_{a\sim\pi_{i}}\big{[}Q_{i}^{\pi_{i}}(s,a)\big{]}, when πi\pi_{i} is Boltzmann policy [42], i.e., expQiπi(s,a)/aexpQiπi(s,a)\exp{Q_{i}^{\pi_{i}}(s,a)}/\sum_{a}\exp{Q_{i}^{\pi_{i}}(s,a)}, LiL_{i} is a differentiable function. In addition, the regularization term λ2φi\frac{\lambda}{2}\varphi_{i} is 2\ell_{2}-norm, which is λ\lambda-smooth function. Thus i\mathcal{L}_{i} can be a smooth function.

We next proof the convergence of the meta-policy. Let \mathcal{L} be a LL-smoothness function, then we have

𝔼[(Qc+1π(s,a))(Qcπ(s,a))]𝔼[(Qcπ(s,a)),(Qc+1π(s,a)Qcπ(s,a))]+L2Qc+1π(s,a)Qcπ(s,a)2=η~𝔼[(Qcπ(s,a)),Zc(s,a)]+η~L2𝔼[Zc(s,a)2]=η~𝔼[(Qcπ(s,a))2]η~𝔼[(Qcπ(s,a)),Zc(s,a)]+η~L2𝔼[Zc(s,a)2].\begin{split}&~{}~{}~{}~{}~{}~{}\mathbb{E}\big{[}\mathcal{L}(Q_{c+1}^{\pi}(s,a))-\mathcal{L}(Q_{c}^{\pi}(s,a))\big{]}\\ &\leq\mathbb{E}\big{[}\left\langle\nabla\mathcal{L}(Q_{c}^{\pi}(s,a)),(Q_{c+1}^{\pi}(s,a)-Q_{c}^{\pi}(s,a))\right\rangle\big{]}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\frac{L}{2}\left\|Q_{c+1}^{\pi}(s,a)-Q_{c}^{\pi}(s,a)\right\|^{2}\\ &=-\tilde{\eta}\mathbb{E}\big{[}\left\langle\nabla\mathcal{L}(Q_{c}^{\pi}(s,a)),Z_{c}(s,a)\right\rangle\big{]}+\frac{\tilde{\eta}L}{2}\mathbb{E}\big{[}\left\|Z_{c}(s,a)\right\|^{2}\big{]}\\ &=-\tilde{\eta}\mathbb{E}\big{[}\left\|\nabla\mathcal{L}(Q_{c}^{\pi}(s,a))\right\|^{2}\big{]}-\tilde{\eta}\mathbb{E}\big{[}\left\langle\nabla\mathcal{L}(Q_{c}^{\pi}(s,a)),Z_{c}(s,a)\right\rangle\big{]}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\frac{\tilde{\eta}L}{2}\mathbb{E}\big{[}\left\|Z_{c}(s,a)\right\|^{2}\big{]}.\end{split} (21)

By using Proposition 1, Zc(s,a)2\left\|Z_{c}(s,a)\right\|^{2} can be decomposed into

𝔼[Zc(s,a)2]3𝔼[1NRi,rN,RZi,c,r(s,a)(Qcπ(s,a))2]+31Ni=1Ni(Qcπ(s,a))(Qcπ(s,a))2+3(Qcπ(s,a))2.\begin{split}&~{}~{}~{}~{}\mathbb{E}\big{[}\big{\|}Z_{c}(s,a)\big{\|}^{2}\big{]}\\ &\leq 3\mathbb{E}\Big{[}\Big{\|}\frac{1}{NR}\sum_{i,r}^{N,R}Z_{i,c,r}(s,a)-\nabla\mathcal{L}(Q_{c}^{\pi}(s,a))\Big{\|}^{2}\Big{]}\\ &+3\Big{\|}\frac{1}{N}\sum_{i=1}^{N}\nabla\mathcal{L}_{i}(Q_{c}^{\pi}(s,a))-\nabla\mathcal{L}(Q_{c}^{\pi}(s,a))\Big{\|}^{2}\\ &+3\left\|\nabla\mathcal{L}(Q_{c}^{\pi}(s,a))\right\|^{2}.\end{split} (22)

According the Cauchy-Swartz and AM-GM inequalities, substituting (22) into (21) yields

𝔼[(Qc+1π(s,a))(Qcπ(s,a))]η~(13Lη~)2𝔼[(Qcπ(s,a))2]+3η~2L2𝔼1Ni=1Ni(Qcπ(s,a))(Qcπ(s,a))2+η~(1+3η~L)21NRi,rN,R𝔼[Zi,c,r(s,a)i(Qcπ(s,a))].\begin{split}&~{}~{}~{}~{}\mathbb{E}\big{[}\mathcal{L}(Q_{c+1}^{\pi}(s,a))-\mathcal{L}(Q_{c}^{\pi}(s,a))\big{]}\\ &\leq-\frac{\tilde{\eta}(1-3L\tilde{\eta})}{2}\mathbb{E}\big{[}\left\|\nabla\mathcal{L}(Q_{c}^{\pi}(s,a))\right\|^{2}\big{]}\\ &+\frac{3\tilde{\eta}^{2}L}{2}\mathbb{E}\Big{\|}\frac{1}{N}\sum_{i=1}^{N}\nabla\mathcal{L}_{i}(Q_{c}^{\pi}(s,a))-\nabla\mathcal{L}(Q_{c}^{\pi}(s,a))\Big{\|}^{2}\\ &+\frac{\tilde{\eta}(1+3\tilde{\eta}L)}{2}\frac{1}{NR}\sum_{i,r}^{N,R}\mathbb{E}\big{[}\left\|Z_{i,c,r}(s,a)-\nabla\mathcal{L}_{i}(Q_{c}^{\pi}(s,a))\right\|\big{]}.\end{split}

Then we have

𝔼[(Qc+1π(s,a))(Qcπ(s,a))]\displaystyle~{}~{}~{}~{}~{}~{}~{}\mathbb{E}\big{[}\mathcal{L}(Q_{c+1}^{\pi}(s,a))-\mathcal{L}(Q_{c}^{\pi}(s,a))\big{]}
(a)η~(13Lη~)2𝔼[(Qcπ(s,a))2]\displaystyle\overset{(a)}{\leq}-\frac{\tilde{\eta}(1-3L\tilde{\eta})}{2}\mathbb{E}\big{[}\left\|\nabla\mathcal{L}(Q_{c}^{\pi}(s,a))\right\|^{2}\big{]}
+3η~2L2𝔼1Ni=1Ni(Qcπ(s,a))(Qcπ(s,a))2\displaystyle~{}+\frac{3\tilde{\eta}^{2}L}{2}\mathbb{E}\Big{\|}\frac{1}{N}\sum_{i=1}^{N}\nabla\mathcal{L}_{i}(Q_{c}^{\pi}(s,a))-\nabla\mathcal{L}(Q_{c}^{\pi}(s,a))\Big{\|}^{2}
+η~(1+3Lη~)2[2λ2δ2+32η~2L2β2(2λ2δ2+𝔼[(Qcπ(s,a))2])]\displaystyle~{}+\frac{\tilde{\eta}(1+3L\tilde{\eta})}{2}\Bigg{[}2\lambda^{2}\delta^{2}+\frac{32\tilde{\eta}^{2}L^{2}}{\beta^{2}}\Bigg{(}2\lambda^{2}\delta^{2}+\mathbb{E}\big{[}\left\|\nabla\mathcal{L}(Q_{c}^{\pi}(s,a))\right\|^{2}\big{]}\Bigg{)}\Bigg{]}
+i=1N1N𝔼[i(Qcπs,a)(Qcπ(s,a))2]\displaystyle~{}+\sum_{i=1}^{N}\frac{1}{N}\mathbb{E}\big{[}\left\|\nabla\mathcal{L}_{i}(Q_{c}^{\pi}{s,a})-\mathcal{L}(Q_{c}^{\pi}(s,a))\right\|^{2}\big{]}
(b)η~(13Lη~)2𝔼[(Qcπ(s,a))2]\displaystyle\overset{(b)}{\leq}-\frac{\tilde{\eta}(1-3L\tilde{\eta})}{2}\mathbb{E}\big{[}\left\|\nabla\mathcal{L}(Q_{c}^{\pi}(s,a))\right\|^{2}\big{]}
+η~2L(12λ28+16η~(1+3Lη~)λ2L)β2(λ28))𝔼[(Qcπ(s,a))]\displaystyle~{}+\tilde{\eta}^{2}L\bigg{(}\frac{12}{\lambda^{2}-8}+\frac{16\tilde{\eta}(1+3L\tilde{\eta})\lambda^{2}L)}{\beta^{2}(\lambda^{2}-8)}\bigg{)}\mathbb{E}\left[\nabla\mathcal{L}(Q_{c}^{\pi}(s,a))\right]
+η~3(1+3Lη~)32L2(λ2δ2+σ22)β2+3η~2Lσ22+η~(1+3Lη~)λ2δ2\displaystyle~{}~{}~{}~{}~{}+\tilde{\eta}^{3}(1+3L\tilde{\eta})\frac{32L^{2}(\lambda^{2}\delta^{2}+\sigma^{2}_{2})}{\beta^{2}}+3\tilde{\eta}^{2}L\sigma^{2}_{2}+\tilde{\eta}(1+3L\tilde{\eta})\lambda^{2}\delta^{2}
η~[12η~L(32+12λ28+24λ2λ28)]𝔼[(Qcπ(s,a))2]\displaystyle\leq-\tilde{\eta}\left[\frac{1}{2}-\tilde{\eta}L\left(\frac{3}{2}+\frac{12}{\lambda^{2}-8}+\frac{24\lambda^{2}}{\lambda^{2}-8}\right)\right]\mathbb{E}\big{[}\left\|\nabla\mathcal{L}(Q_{c}^{\pi}(s,a))\right\|^{2}\big{]}
+η~3(1+3Lη~)32L2(λ2δ2+σ22)β2+3η~2Lσ22+η~(1+3Lη~)λ2δ2\displaystyle~{}+\tilde{\eta}^{3}(1+3L\tilde{\eta})\frac{32L^{2}(\lambda^{2}\delta^{2}+\sigma^{2}_{2})}{\beta^{2}}+3\tilde{\eta}^{2}L\sigma^{2}_{2}+\tilde{\eta}(1+3L\tilde{\eta})\lambda^{2}\delta^{2}
(c)η~2𝔼[(Qcπ(s,a))2]+η~3β96L2(λ2δ2+σ22)\displaystyle\overset{(c)}{\leq}-\frac{\tilde{\eta}}{2}\mathbb{E}\big{[}\left\|\nabla\mathcal{L}(Q_{c}^{\pi}(s,a))\right\|^{2}\big{]}+\frac{\tilde{\eta}^{3}}{\beta}96L^{2}(\lambda^{2}\delta^{2}+\sigma_{2}^{2})
+3η~2LFσ22+3η~λ2δ2,\displaystyle~{}+3\tilde{\eta}^{2}L_{F}\sigma^{2}_{2}+3\tilde{\eta}\lambda^{2}\delta^{2}, (23)

where (a)(a) is based on Lemma 5 and the fact that 𝔼[X2]=𝔼[X𝔼[X]2]+𝔼[X2]\mathbb{E}\big{[}\big{\|}X\big{\|}^{2}\big{]}=\mathbb{E}\big{[}\big{\|}X-\mathbb{E}\big{[}X\big{]}\big{\|}^{2}\big{]}+\mathbb{E}\big{[}\big{\|}X\big{\|}^{2}\big{]} for random variable XX, and (b)(b) is obtained by Lemma 4. When η~β2L\tilde{\eta}\leq\frac{\beta}{2L}, we have 1+3Lη~1+3β23β1+3L\tilde{\eta}\leq 1+\frac{3\beta}{2}\leq 3\beta and assume that

12η~L(32+12λ28+24λ2λ28)14,\frac{1}{2}-\tilde{\eta}L\left(\frac{3}{2}+\frac{12}{\lambda^{2}-8}+\frac{24\lambda^{2}}{\lambda^{2}-8}\right)\geq\frac{1}{4},

then we have (c)(c).

By taking average over CC, we have

12Cc=0C1𝔼[(Qcπ(s,a))2]𝔼[(Q0π(s,a))(QCπ(s,a))]η~C+η~2βY4+η~Y5+Y6,\begin{split}&~{}~{}~{}\frac{1}{2C}\sum_{c=0}^{C-1}\mathbb{E}\left[\left\|\nabla\mathcal{L}\left(Q_{c}^{\pi}(s,a)\right)\right\|^{2}\right]\\ &\leq\frac{\mathbb{E}\left[\mathcal{L}\left(Q_{0}^{\pi}(s,a)\right)-\mathcal{L}\left(Q^{\pi}_{C}(s,a)\right)\right]}{\tilde{\eta}C}+\frac{\tilde{\eta}^{2}}{\beta}Y_{4}+\tilde{\eta}Y_{5}+Y_{6},\end{split}

where Y4=96L2(λ2δ2+σ22)Y_{4}=96L^{2}(\lambda^{2}\delta^{2}+\sigma_{2}^{2}), Y5=3Lσ22Y_{5}=3L\sigma^{2}_{2} and Y6=3λ2δ2Y_{6}=3\lambda^{2}\delta^{2}.

Let Δ:=(Q0π(s,a))(Qg,(s,a))\Delta:=\mathcal{L}(Q_{0}^{\pi}(s,a))-\mathcal{L}(Q^{g,*}(s,a)), consider one case, when η~3βΔCY4\tilde{\eta}^{3}\leq\frac{\beta\Delta}{CY_{4}} and η~2ΔCY5\tilde{\eta}^{2}\leq\frac{\Delta}{CY_{5}}, we have

12Cc=0C1𝔼[(Qcπ)2]Δη^2C+(Δ)2/3(Y4)1/3(βC)2/3+(ΔY5)1/2C+Y6.\begin{split}&\frac{1}{2C}\sum_{c=0}^{C-1}\mathbb{E}\left[\left\|\nabla\mathcal{L}\left(Q_{c}^{\pi}\right)\right\|^{2}\right]\leq\frac{\Delta}{\hat{\eta}_{2}C}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\frac{\left(\Delta\right)^{2/3}\left(Y_{4}\right)^{1/3}}{\left(\beta C\right)^{2/3}}+\frac{\left(\Delta Y_{5}\right)^{1/2}}{\sqrt{C}}+Y_{6}.\end{split}

Finally, we proof the upper bound on the difference between the personalized polices and meta-policy,

1Ni=1N𝔼[Qi,cπi(s,a)Qcπ(s,a)2]\displaystyle~{}~{}~{}\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}\left[\left\|Q_{i,c}^{\pi_{i}}(s,a)-Q_{c}^{\pi}(s,a)\right\|^{2}\right]
1Ni=1N2𝔼[Qi,cπi(s,a)Qi,c(s,a)2\displaystyle\leq\frac{1}{N}\sum_{i=1}^{N}2\mathbb{E}\bigg{[}\left\|Q_{i,c}^{\pi_{i}}(s,a)-Q_{i,c}^{*}(s,a)\right\|^{2}
+Qi,c(s,a)Qcπ(s,a)2]\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\left\|Q_{i,c}^{*}(s,a)-Q_{c}^{\pi}(s,a)\right\|^{2}\bigg{]}
2δ2+2Ni=1N𝔼[i(Qcπ(s,a))2]λ2\displaystyle\leq 2\delta^{2}+\frac{2}{N}\sum_{i=1}^{N}\frac{\mathbb{E}\left[\left\|\nabla\mathcal{L}_{i}\left(Q_{c}^{\pi}(s,a)\right)\right\|^{2}\right]}{\lambda^{2}}
(a)2δ2+2λ28𝔼[(Qcπ(s,a))2]+2σ22λ2,\displaystyle\overset{(a)}{\leq}2\delta^{2}+\frac{2}{\lambda^{2}-8}\mathbb{E}\left[\left\|\nabla\mathcal{L}\left(Q_{c}^{\pi}(s,a)\right)\right\|^{2}\right]+\frac{2\sigma_{2}^{2}}{\lambda^{2}}, (24)

where Qi,c(s,a)Q_{i,c}^{*}(s,a) is current optimal Q-values and (a)(a) is based on Lemma 4. Summing (A-D) from c=0c=0 to CC, we have

1CNi=0C1i=1N𝔼[Qi,cπi(s,a)Qcπ(s,a)2]2λ281Ci=0C1𝔼[(Qcπ(s,a))2]+2δ2+2σ22λ2.\begin{split}&~{}~{}\frac{1}{CN}\sum_{i=0}^{C-1}\sum_{i=1}^{N}\mathbb{E}\left[\left\|Q_{i,c}^{\pi_{i}}(s,a)-Q_{c}^{\pi}(s,a)\right\|^{2}\right]\\ &\leq\frac{2}{\lambda^{2}-8}\frac{1}{C}\sum_{i=0}^{C-1}\mathbb{E}\left[\left\|\nabla\mathcal{L}\left(Q_{c}^{\pi}(s,a)\right)\right\|^{2}\right]+2\delta^{2}+\frac{2\sigma_{2}^{2}}{\lambda^{2}}.\end{split}

Then we complete the proof.

A-E Proof of Important Lemmas

A-E1 Proof of Lemma 2

We rewrite the auxiliary Q-table QiπQ_{i}^{\pi} in (8) update as follows

Qi,c,r+1π(s,a)=Qi,c,rπ(s,a)ηλ(Qi,c,rπ(s,a)Qi,c,r+1πi(s,a)),Q_{i,c,r+1}^{\pi}(s,a)=Q_{i,c,r}^{\pi}(s,a)-\eta\lambda(Q_{i,c,r}^{\pi}(s,a)-Q_{i,c,r+1}^{\pi_{i}}(s,a)),

which implies that

ηr=0R1Zi,c,r(s,a)=r=0R1(Qi,c,rπ(s,a)Qi,c,r+1π(s,a))=Qcπ(s,a)Qi,c,Rπ(s,a).\begin{split}\eta\sum_{r=0}^{R-1}Z_{i,c,r}(s,a)&=\sum_{r=0}^{R-1}(Q_{i,c,r}^{\pi}(s,a)-Q_{i,c,r+1}^{\pi}(s,a))\\ &=Q_{c}^{\pi}(s,a)-Q_{i,c,R}^{\pi}(s,a).\end{split} (25)

Note that the meta Q-values update as

Qc+1π(s,a)=(1β)Qcπ(s,a)+βNi=1NQi,c,Rπ(s,a),\begin{split}Q_{c+1}^{\pi}(s,a)=(1-\beta)Q_{c}^{\pi}(s,a)+\frac{\beta}{N}\sum_{i=1}^{N}Q_{i,c,R}^{\pi}(s,a),\end{split} (26)

By substituting (25) into (26), we finish the proof.

A-E2 Proof of Lemma 3

Note that i\mathcal{R}_{i} is bounded reward function such that Qi(s,a)Q_{i}(s,a) is bounded by max1γ\frac{\mathcal{R}_{\max}}{1-\gamma}, then we have

1Ni=1Ni(s,a)+γ𝔼s𝒫i(s|s,a)[maxaQ~i(s,a)]1Nj=1N(j(s,a)+γ𝔼s𝒫j(s|s,a)[maxaQ~i(s,a)])2(a)1Ni=1Nσ1,i+σ2,iγ1γmax2σ2,\begin{split}&\frac{1}{N}\sum_{i=1}^{N}\Big{\|}\mathcal{R}_{i}(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathcal{P}_{i}(s^{\prime}|s,a)}\left[\max_{a^{\prime}}\tilde{Q}_{i}(s^{\prime},a^{\prime})\right]\\ &~{}~{}~{}-\frac{1}{N}\sum_{j=1}^{N}\left(\mathcal{R}_{j}(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathcal{P}_{j}(s^{\prime}|s,a)}\left[\max_{a^{\prime}}\tilde{Q}_{i}(s^{\prime},a^{\prime})\right]\right)\Big{\|}^{2}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\overset{(a)}{\leq}\frac{1}{N}\sum_{i=1}^{N}\Big{\|}\sigma_{1,i}+\frac{\sigma_{2,i}\gamma}{1-\gamma}\mathcal{R}_{\text{max}}\Big{\|}^{2}\leq\sigma^{2},\end{split}

where (a)(a) is based on Assumption 1. We finish the proof.

A-E3 Proof of Lemma 4

For i\mathcal{L}_{i} is a smooth function, we have i(Qπ(s,a))=λ(Qπ(s,a)Qi(s,a))\nabla\mathcal{L}_{i}(Q^{\pi}(s,a))=\lambda(Q^{\pi}(s,a)-Q_{i}^{*}(s,a)). Then

i(Qπ(s,a))(Qπ(s,a))2\displaystyle~{}\left\|\nabla\mathcal{L}_{i}(Q^{\pi}(s,a))-\nabla\mathcal{L}(Q^{\pi}(s,a))\right\|^{2}
=\displaystyle= λ(Qπ(s,a)Qi(s,a))1Nj=1Nλ(Qπ(s,a)Qj(s,a))2\displaystyle~{}\Big{\|}\lambda\left(Q^{\pi}(s,a)-Q_{i}^{*}(s,a)\right)-\frac{1}{N}\sum_{j=1}^{N}\lambda\left(Q^{\pi}(s,a)-Q_{j}^{*}(s,a)\right)\Big{\|}^{2}
=(a)\displaystyle\overset{(a)}{=} i(s,a)+γ𝔼s𝒫i[maxaQi(s,a)]Qi(s,a)\displaystyle~{}\Big{\|}\mathcal{R}_{i}(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathcal{P}_{i}}\left[\max_{a^{\prime}}Q_{i}^{*}(s^{\prime},a^{\prime})\right]-Q_{i}^{*}(s,a)
1Nj=1N(j(s,a)+γ𝔼s𝒫j[maxaQj(s,a)]Qj(s,a))2\displaystyle-\frac{1}{N}\sum_{j=1}^{N}\left(\mathcal{R}_{j}(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathcal{P}_{j}}\left[\max_{a^{\prime}}Q_{j}^{*}(s^{\prime},a^{\prime})\right]-Q_{j}^{*}(s,a)\right)\Big{\|}^{2}
(b)\displaystyle\overset{(b)}{\leq} 2i(s,a)+γ𝔼s𝒫i[maxaQi(s,a)]Qi(s,a)\displaystyle~{}2\Big{\|}\mathcal{R}_{i}(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathcal{P}_{i}}\left[\max_{a^{\prime}}Q_{i}^{*}(s^{\prime},a^{\prime})\right]-Q_{i}^{*}(s,a)
1Nj=1N(j(s,a)+γ𝔼s𝒫j[maxaQi(s,a)]Qi(s,a))2\displaystyle-\frac{1}{N}\sum_{j=1}^{N}\left(\mathcal{R}_{j}(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathcal{P}_{j}}\left[\max_{a^{\prime}}Q_{i}^{*}(s^{\prime},a^{\prime})\right]-Q_{i}^{*}(s,a)\right)\Big{\|}^{2}
+21Nj=1N(j(s,a)+γ𝔼s𝒫j[maxaQi(s,a)]Qi(s,a))\displaystyle+2\Big{\|}\frac{1}{N}\sum_{j=1}^{N}\left(\mathcal{R}_{j}(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathcal{P}_{j}}\left[\max_{a^{\prime}}Q_{i}^{*}(s^{\prime},a^{\prime})\right]-Q_{i}^{*}(s,a)\right)
1Nj=1N(j(s,a)+γ𝔼s𝒫j[maxaQj(s,a)]Qj(s,a))2,\displaystyle-\frac{1}{N}\sum_{j=1}^{N}\left(\mathcal{R}_{j}(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathcal{P}_{j}}\left[\max_{a^{\prime}}Q_{j}^{*}(s^{\prime},a^{\prime})\right]-Q_{j}^{*}(s,a)\right)\Big{\|}^{2},

where (a)(a) is due to the property of the fixed point of ^\hat{\mathcal{B}} mentioned in (17), i.e.,

Qi(s,a)=1(1+λ)(i(s,a)+γ𝔼s𝒫i(s|s,a)[maxaQi(s,a)])`+λ(1+λ)Qπ(s,a),\begin{split}Q_{i}^{*}(s,a)&=\frac{1}{(1+\lambda)}\Big{(}\mathcal{R}_{i}(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathcal{P}_{i}(s^{\prime}|s,a)}\Big{[}\max_{a^{\prime}}Q_{i}^{*}(s^{\prime},a^{\prime})\Big{]}\Big{)}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}`+\frac{\lambda}{(1+\lambda)}Q^{\pi}(s,a),\end{split}

and (b)(b) is due to Proposition 1. Taking the average over the number of tasks, we have

1Ni=1Ni(Qπ(s,a))(Qπ(s,a))2\displaystyle~{}\frac{1}{N}\sum_{i=1}^{N}\left\|\nabla\mathcal{L}_{i}(Q^{\pi}(s,a))-\nabla\mathcal{L}(Q^{\pi}(s,a))\right\|^{2}
(a)\displaystyle\overset{(a)}{\leq} 2σ2+2N2i=1Nj=1Nj(s,a)\displaystyle~{}2\sigma^{2}+\frac{2}{N^{2}}\sum_{i=1}^{N}\sum_{j=1}^{N}\Big{\|}\mathcal{R}_{j}(s,a)
+γ𝔼s𝒫j[maxaQi(s,a)]Qi(s,a)\displaystyle+\gamma\mathbb{E}_{s^{\prime}\sim\mathcal{P}_{j}}\left[\max_{a^{\prime}}Q_{i}^{*}(s^{\prime},a^{\prime})\right]-Q_{i}^{*}(s,a)
j(s,a)+γ𝔼s𝒫j[maxaQj(s,a)]Qj(s,a)2\displaystyle~{}~{}-\mathcal{R}_{j}(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathcal{P}_{j}}\left[\max_{a^{\prime}}Q_{j}^{*}(s^{\prime},a^{\prime})\right]-Q_{j}^{*}(s,a)\Big{\|}^{2}
\displaystyle\leq 2σ2+2γ1γmax+2N2i=1Nj=1NQi(s,a)Qj(s,a)2\displaystyle~{}2\sigma^{2}+\frac{2\gamma}{1-\gamma}\mathcal{R}_{\text{max}}+\frac{2}{N^{2}}\sum_{i=1}^{N}\sum_{j=1}^{N}\left\|Q_{i}^{*}(s,a)-Q_{j}^{*}(s,a)\right\|^{2}
(b)\displaystyle\overset{(b)}{\leq} 2σ2+2γ1γmax+2N2i=1Nj=1N\displaystyle~{}2\sigma^{2}+\frac{2\gamma}{1-\gamma}\mathcal{R}_{\text{max}}+\frac{2}{N^{2}}\sum_{i=1}^{N}\sum_{j=1}^{N}
2λ2(i(Qπ(s,a))2+j(Qπ(s,a))2)\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\frac{2}{\lambda^{2}}\left(\left\|\nabla\mathcal{L}_{i}(Q^{\pi}(s,a))\right\|^{2}+\left\|\nabla\mathcal{L}_{j}(Q^{\pi}(s,a))\right\|^{2}\right)
(c)\displaystyle\overset{(c)}{\leq} 2σ2+2γ1γmax+8λ2[(Qπ(s,a))2\displaystyle~{}2\sigma^{2}+\frac{2\gamma}{1-\gamma}\mathcal{R}_{\text{max}}+\frac{8}{\lambda^{2}}\bigg{[}\left\|\nabla\mathcal{L}(Q^{\pi}(s,a))\right\|^{2} (27)
+1Ni=1Ni(Qπ(s,a))(Qπ(s,a))2],\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}+\frac{1}{N}\sum_{i=1}^{N}\left\|\nabla\mathcal{L}_{i}(Q^{\pi}(s,a))-\nabla\mathcal{L}(Q^{\pi}(s,a))\right\|^{2}\bigg{]},

where (a)(a) is due to Assumption 1 and Proposition 1, and (b)(b) is based on Jensen’s inequality. By re-arranging the terms in (c)(c), we obtain

1Ni=1Ni(Qπ(s,a))(Qπ(s,a))2σ22+8λ28(Qπ(s,a))2,\begin{split}&\frac{1}{N}\sum_{i=1}^{N}\left\|\nabla\mathcal{L}_{i}(Q^{\pi}(s,a))-\nabla\mathcal{L}(Q^{\pi}(s,a))\right\|^{2}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\leq\sigma_{2}^{2}+\frac{8}{\lambda^{2}-8}\left\|\nabla\mathcal{L}(Q^{\pi}(s,a))\right\|^{2},\end{split}

where σ22=2λ2λ28σ2+2λ2γ(λ28)(1γ)max\sigma_{2}^{2}=\frac{2\lambda^{2}}{\lambda^{2}-8}\sigma^{2}+\frac{2\lambda^{2}\gamma}{(\lambda^{2}-8)(1-\gamma)}\mathcal{R}_{\text{max}}. Then we finish the proof.

A-E4 Proof of lemma 5

According to the fact i(Qi,c,rπ(s,a))=λ(Qi,c,rπ(s,a)Qi,c,r(s,a))\nabla\mathcal{L}_{i}(Q_{i,c,r}^{\pi}(s,a))=\lambda\big{(}Q_{i,c,r}^{\pi}(s,a)-Q_{i,c,r}^{*}(s,a)\big{)}, for rr-th iteration, we have

𝔼[Zi,c,r(s,a)i(Qcπ(s,a))2](a)2𝔼[Zi,c,r(s,a)i(Qi,c,rπ(s,a))2]+2𝔼[i(Qi,c,rπ,g(s,a))i(Qcπ(s,a))](b)2λ2𝔼[Qi,c,rπi(s,a)Qi,c,r(s,a)2]+2L2𝔼[Qi,c,rπ(s,a)Qcπ(s,a)2]2(λ2δ2+L2𝔼[Qi,c,rπ(s,a)Qcπ(s,a)2]),\begin{split}&~{}~{}~{}~{}~{}~{}\mathbb{E}\big{[}\left\|Z_{i,c,r}(s,a)-\nabla\mathcal{L}_{i}(Q_{c}^{\pi}(s,a))\right\|^{2}\big{]}\\ &\overset{(a)}{\leq}2\mathbb{E}\big{[}\left\|Z_{i,c,r}(s,a)-\nabla\mathcal{L}_{i}(Q_{i,c,r}^{\pi}(s,a))\right\|^{2}\big{]}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+2\mathbb{E}\big{[}\nabla\mathcal{L}_{i}(Q_{i,c,r}^{\pi,g}(s,a))-\nabla\mathcal{L}_{i}(Q_{c}^{\pi}(s,a))\big{]}\\ &\overset{(b)}{\leq}2\lambda^{2}\mathbb{E}\big{[}\left\|Q_{i,c,r}^{\pi_{i}}(s,a)-Q_{i,c,r}^{*}(s,a)\right\|^{2}\big{]}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+2L^{2}\mathbb{E}\big{[}\left\|Q_{i,c,r}^{\pi}(s,a)-Q_{c}^{\pi}(s,a)\right\|^{2}\big{]}\\ &\leq 2\Big{(}\lambda^{2}\delta^{2}+L^{2}\mathbb{E}\big{[}\left\|Q_{i,c,r}^{\pi}(s,a)-Q_{c}^{\pi}(s,a)\right\|^{2}\big{]}\Big{)},\end{split} (28)

where (a)(a) and (b)(b) are due to Proposition 1.

Then we focus on the difference between Qi,c,rπ(s,a)Q_{i,c,r}^{\pi}(s,a) and Qcπ(s,a)Q_{c}^{\pi}(s,a). The inequalities (a),(b)(a),(b) and (c)(c) are based on the Peter Paul inequality, Proposition 1 and (28), respectively. Let m=Rm=R, and when η~2β2R4(1+R)LF2\tilde{\eta}^{2}\leq\frac{\beta^{2}R}{4(1+R)L_{F}^{2}}, we have 4(1+m)η2LF21R4(1+m)\eta^{2}L_{F}^{2}\leq\frac{1}{R} such that the inequality (d)(d) holds. (31) is due to unrolling (A-E4) recursively, and (f)(f) is according to the fact r=0R12(1+1R)r=(1+2/R)R12/Re212/R4R\sum_{r=0}^{R-1}2\left(1+\frac{1}{R}\right)^{r}=\frac{(1+2/R)^{R}-1}{2/R}\leq\frac{e^{2}-1}{2/R}\leq 4R.

Substituting (32) into (28) we have

𝔼[Qi,c,rπ(s,a)Qcπ(s,a)]\displaystyle~{}~{}~{}~{}~{}~{}\mathbb{E}\left[\left\|Q_{i,c,r}^{\pi}(s,a)-Q_{c}^{\pi}(s,a)\right\|\right]
=𝔼[Qi,c,r1π(s,a)Qcπ(s,a)ηZi,c,r1(s,a)]\displaystyle=\mathbb{E}\left[\left\|Q_{i,c,r-1}^{\pi}(s,a)-Q_{c}^{\pi}(s,a)-\eta Z_{i,c,r-1}(s,a)\right\|\right]
(a)(1+1m)𝔼Qi,c,r1π(s,a)Qcπ(s,a)2\displaystyle\overset{(a)}{\leq}(1+\frac{1}{m})\mathbb{E}\left\|Q_{i,c,r-1}^{\pi}(s,a)-Q_{c}^{\pi}(s,a)\right\|^{2}
+(1+m)η2𝔼Zi,c,r1(s,a)2\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+(1+m)\eta^{2}\mathbb{E}\left\|Z_{i,c,r-1}(s,a)\right\|^{2}
(b)(1+1m)𝔼Qi,c,r1π(s,a)Qcπ(s,a)2\displaystyle\overset{(b)}{\leq}(1+\frac{1}{m})\mathbb{E}\left\|Q_{i,c,r-1}^{\pi}(s,a)-Q_{c}^{\pi}(s,a)\right\|^{2}
+2(1+m)η2(𝔼Zi,c,r1(s,a)i(Qcπ(s,a))2)\displaystyle~{}~{}~{}~{}~{}~{}+2(1+m)\eta^{2}\left(\mathbb{E}\left\|Z_{i,c,r-1}(s,a)-\nabla\mathcal{L}_{i}(Q_{c}^{\pi}(s,a))\right\|^{2}\right)
+2(1+m)η2(𝔼i(Qcπ(s,a))2)\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+2(1+m)\eta^{2}\left(\mathbb{E}\left\|\nabla\mathcal{L}_{i}(Q_{c}^{\pi}(s,a))\right\|^{2}\right)
(c)(1+1m)𝔼Qi,c,r1π(s,a)Qcπ(s,a)2\displaystyle\overset{(c)}{\leq}(1+\frac{1}{m})\mathbb{E}\left\|Q_{i,c,r-1}^{\pi}(s,a)-Q_{c}^{\pi}(s,a)\right\|^{2}
+2(1+m)η2(𝔼i(Qcπ(s,a))2)\displaystyle~{}~{}+2(1+m)\eta^{2}\left(\mathbb{E}\left\|\nabla\mathcal{L}_{i}(Q_{c}^{\pi}(s,a))\right\|^{2}\right)
+4(1+m)η2(λ2δ2+L2𝔼Qi,c,rπ(s,a)Qcπ(s,a)2)\displaystyle~{}~{}+4(1+m)\eta^{2}\left(\lambda^{2}\delta^{2}+L^{2}\mathbb{E}\left\|Q_{i,c,r}^{\pi}(s,a)-Q_{c}^{\pi}(s,a)\right\|^{2}\right)
=(1+1m+4(1+m)η2L2)𝔼Qi,c,rπ(s,a)Qcπ(s,a)2\displaystyle=(1+\frac{1}{m}+4(1+m)\eta^{2}L^{2})\mathbb{E}\left\|Q_{i,c,r}^{\pi}(s,a)-Q_{c}^{\pi}(s,a)\right\|^{2}
+2(1+m)η2(𝔼i(Qcπ(s,a))2)+4(1+m)η2λ2δ2\displaystyle~{}~{}+2(1+m)\eta^{2}\left(\mathbb{E}\left\|\nabla\mathcal{L}_{i}(Q_{c}^{\pi}(s,a))\right\|^{2}\right)+4(1+m)\eta^{2}\lambda^{2}\delta^{2} (29)
(d)(1+2R)𝔼Qi,c,rπ(s,a)Qcπ(s,a)2\displaystyle\overset{(d)}{\leq}(1+\frac{2}{R})\mathbb{E}\left\|Q_{i,c,r}^{\pi}(s,a)-Q_{c}^{\pi}(s,a)\right\|^{2}
+4η~2β2R𝔼i(Qcπ(s,a))2+8η~2λ2δ2β2R\displaystyle~{}~{}+\frac{4\tilde{\eta}^{2}}{\beta^{2}R}\mathbb{E}\left\|\mathcal{L}_{i}(Q_{c}^{\pi}(s,a))\right\|^{2}+\frac{8\tilde{\eta}^{2}\lambda^{2}\delta^{2}}{\beta^{2}R} (30)
(e)4η~2β2R(𝔼i(Qcπ(s,a))2+2λ2δ2)r=0R1(1+2R)r\displaystyle\overset{(e)}{\leq}\frac{4\tilde{\eta}^{2}}{\beta^{2}R}\left(\mathbb{E}\left\|\mathcal{L}_{i}(Q_{c}^{\pi}(s,a))\right\|^{2}+2\lambda^{2}\delta^{2}\right)\sum_{r=0}^{R-1}\left(1+\frac{2}{R}\right)^{r} (31)
(f)16η~2β2(𝔼i(Qcπ(s,a))2+2λ2δ2),\displaystyle\overset{(f)}{\leq}\frac{16\tilde{\eta}^{2}}{\beta^{2}}\left(\mathbb{E}\left\|\nabla\mathcal{L}_{i}(Q_{c}^{\pi}(s,a))\right\|^{2}+2\lambda^{2}\delta^{2}\right), (32)

then we complete the proof.

References

  • [1] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine learning, vol. 3, no. 1, pp. 9–44, 1988.
  • [2] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
  • [3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [4] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning.   PMLR, 2016, pp. 1928–1937.
  • [5] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016.
  • [6] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
  • [7] S. Thrun and L. Pratt, Learning to learn.   Springer Science & Business Media, 2012.
  • [8] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in International Conference on Machine Learning.   PMLR, 2017, pp. 1126–1135.
  • [9] B. C. Stadie, G. Yang, R. Houthooft, X. Chen, Y. Duan, Y. Wu, P. Abbeel, and I. Sutskever, “Some considerations on learning to explore via meta-reinforcement learning,” arXiv preprint arXiv:1803.01118, 2018.
  • [10] K. Rakelly, A. Zhou, C. Finn, S. Levine, and D. Quillen, “Efficient off-policy meta-reinforcement learning via probabilistic context variables,” in International conference on machine learning.   PMLR, 2019, pp. 5331–5340.
  • [11] L. Zintgraf, K. Shiarlis, M. Igl, S. Schulze, Y. Gal, K. Hofmann, and S. Whiteson, “Varibad: A very good method for bayes-adaptive deep rl via meta-learning,” arXiv preprint arXiv:1910.08348, 2019.
  • [12] J. Zhang, J. Wang, H. Hu, T. Chen, Y. Chen, C. Fan, and C. Zhang, “Metacure: Meta reinforcement learning with empowerment-driven exploration,” in International Conference on Machine Learning.   PMLR, 2021, pp. 12 600–12 610.
  • [13] T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine, “Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,” in Conference on Robot Learning.   PMLR, 2020, pp. 1094–1100.
  • [14] Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu, “Distral: Robust multitask reinforcement learning,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [15] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel, “RL2\text{RL}^{2}: Fast reinforcement learning via slow reinforcement learning,” arXiv preprint arXiv:1611.02779, 2016.
  • [16] J. Rothfuss, D. Lee, I. Clavera, T. Asfour, and P. Abbeel, “Promp: Proximal meta-policy search,” arXiv preprint arXiv:1810.06784, 2018.
  • [17] S. Gurumurthy, S. Kumar, and K. Sycara, “Mame: Model-agnostic meta-exploration,” in Conference on Robot Learning.   PMLR, 2020, pp. 910–922.
  • [18] L. Pinto, D. Gandhi, Y. Han, Y.-L. Park, and A. Gupta, “The curious robot: Learning visual representations via physical interactions,” in European Conference on Computer Vision.   Springer, 2016, pp. 3–18.
  • [19] L. Pinto and A. Gupta, “Learning to push by grasping: Using multiple tasks for effective learning,” in 2017 IEEE international conference on robotics and automation (ICRA).   IEEE, 2017, pp. 2161–2168.
  • [20] M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. Wiele, V. Mnih, N. Heess, and J. T. Springenberg, “Learning by playing solving sparse reward tasks from scratch,” in International Conference on Machine Learning.   PMLR, 2018, pp. 4344–4353.
  • [21] C. Rosenbaum, T. Klinger, and M. Riemer, “Routing networks: Adaptive selection of non-linear functions for multi-task learning,” arXiv preprint arXiv:1711.01239, 2017.
  • [22] C. D’Eramo, D. Tateo, A. Bonarini, M. Restelli, and J. Peters, “Sharing knowledge in multi-task deep reinforcement learning,” in International Conference on Learning Representations, 2019.
  • [23] T. L. Vuong, D. V. Nguyen, T. L. Nguyen, C. M. Bui, H. D. Kieu, V. C. Ta, Q. L. Tran, and T. H. Le, “Sharing experience in multitask reinforcement learning,” 2019.
  • [24] R. Yang, H. Xu, Y. WU, and X. Wang, “Multi-task reinforcement learning with soft modularization,” Advances in Neural Information Processing Systems, vol. 33, pp. 4767–4777, 2020.
  • [25] T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task learning,” arXiv preprint arXiv:2001.06782, 2020.
  • [26] Z. Chen, V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich, “Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks,” in International Conference on Machine Learning.   PMLR, 2018, pp. 794–803.
  • [27] E. Parisotto, J. L. Ba, and R. Salakhutdinov, “Actor-mimic: Deep multitask and transfer reinforcement learning,” arXiv preprint arXiv:1511.06342, 2015.
  • [28] Z. Xu, K. Wu, Z. Che, J. Tang, and J. Ye, “Knowledge transfer in multi-task deep reinforcement learning for continuous control,” arXiv preprint arXiv:2010.07494, 2020.
  • [29] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning.   PMLR, 2018, pp. 1861–1870.
  • [30] B. Liu, L. Wang, and M. Liu, “Lifelong federated reinforcement learning: a learning architecture for navigation in cloud robotic systems,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 4555–4562, 2019.
  • [31] X. Wang, C. Wang, X. Li, V. C. Leung, and T. Taleb, “Federated deep reinforcement learning for internet of things with decentralized cooperative edge caching,” IEEE Internet of Things Journal, vol. 7, no. 10, pp. 9441–9455, 2020.
  • [32] X. Fan, Y. Ma, Z. Dai, W. Jing, C. Tan, and B. K. H. Low, “Fault-tolerant federated reinforcement learning with theoretical guarantee,” Advances in Neural Information Processing Systems, vol. 34, 2021.
  • [33] R. Vuorio, S.-H. Sun, H. Hu, and J. J. Lim, “Multimodal model-agnostic meta-learning via task-aware modulation,” arXiv preprint arXiv:1910.13616, 2019.
  • [34] J. Chen and A. Zhang, “Hetmaml: Task-heterogeneous model-agnostic meta-learning for few-shot learning across modalities,” arXiv preprint arXiv:2105.07889, 2021.
  • [35] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016.
  • [36] M. Li, Z. Qin, Y. Jiao, Y. Yang, J. Wang, C. Wang, G. Wu, and J. Ye, “Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning,” in The World Wide Web Conference, 2019, pp. 983–994.
  • [37] J. Weng, H. Chen, D. Yan, K. You, A. Duburcq, M. Zhang, H. Su, and J. Zhu, “Tianshou: A highly modularized deep reinforcement learning library,” arXiv preprint arXiv:2107.14171, 2021.
  • [38] O. Anschel, N. Baram, and N. Shimkin, “Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning,” in International conference on machine learning.   PMLR, 2017, pp. 176–185.
  • [39] M. Chevalier-Boisvert, L. Willems, and S. Pal, “Minimalistic gridworld environment for openai gym,” https://github.com/maximecb/gym-minigrid, 2018.
  • [40] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.   IEEE, 2012, pp. 5026–5033.
  • [41] T. Jaakkola, M. I. Jordan, and S. P. Singh, “On the convergence of stochastic iterative dynamic programming algorithms,” Neural computation, vol. 6, no. 6, pp. 1185–1201, 1994.
  • [42] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press, 2018.