Efficient Multi-Task Reinforcement Learning via Task-Specific Action Correction*

Jinyuan Feng^1,2 Min Chen¹, Zhiqiang Pu^1,2 Tenghai Qiu¹ Jianqiang Yi^1,2
¹Institute of Automation, Chinese Academy of Sciences, China
²University of Chinese Academy of Sciences, China

Abstract

Multi-task reinforcement learning (MTRL) demonstrate potential for enhancing the generalization of a robot, enabling it to perform multiple tasks concurrently. However, the performance of MTRL may still be susceptible to conflicts between tasks and negative interference. To facilitate efficient MTRL, we propose Task-Specific Action Correction (TSAC), a general and complementary approach designed for simultaneous learning of multiple tasks. TSAC decomposes policy learning into two separate policies: a shared policy (SP) and an action correction policy (ACP). To alleviate conflicts resulting from excessive focus on specific tasks’ details in SP, ACP incorporates goal-oriented sparse rewards, enabling an agent to adopt a long-term perspective and achieve generalization across tasks. Additional rewards transform the original problem into a multi-objective MTRL problem. Furthermore, to convert the multi-objective MTRL into a single-objective formulation, TSAC assigns a virtual expected budget to the sparse rewards and employs Lagrangian method to transform a constrained single-objective optimization into an unconstrained one. Experimental evaluations conducted on Meta-World’s MT10 and MT50 benchmarks demonstrate that TSAC outperforms existing state-of-the-art methods, achieving significant improvements in both sample efficiency and effective action execution.

Index Terms:

Multi-task reinforcement learning, robotic manipulation tasks, goal-oriented sparse rewards, Lagrangian method

I Introduction

Empowering generalist robots through reinforcement learning is one of the essential targets of robotic learning. Reinforcement learning (RL) with the assistance of neural networks has become a crucial methodology in various domains, such as gaming[1, 2, 3, 4], large language models[5] and real-world applications including robotics[6]. However, the majority of research in RL predominantly focuses on specific problem scenarios, prioritizing mastery of individual tasks through the learning of single policies, often at the expense of generalization. Multi-task reinforcement learning (MTRL), on the other hand, emerges as a promising approach to improve generalization by leveraging domain information obtained from related tasks.

MTRL naturally incorporates a curriculum, as it enables the learning of more manageable tasks to facilitate the teaching of more challenging tasks[7]. However, MTRL is prone to negative transfer[8], a phenomenon where the task-specific knowledge from a task can impede the overall learning process of other tasks. This phenomenon is also referred to as task conflict, which becomes more acute as the number of tasks increases. From the optimization standpoint, task conflict arises from conflicting gradients[9] between tasks, where the gradients move in opposite directions.

A multitude of methods have been developed to alleviate negative transfer and achieve efficient MTRL. Classical approaches include policy distillation[10, 11], explicit measurement of task relatedness[12, 13], and information sharing[14, 15], among others. Policy distillation involves training a smaller network structure to bring previous tasks to an expert level, thereby integrating multiple policies into a single policy. However, these methods increase the number of network parameters as they require separate networks for different tasks and an additional distillation step. Some researchers utilize validation loss on tasks[12] or causal influence[13] to determine better task groupings. One drawback of the aforementioned methods is the need for substantial computational resources. Often, calculating task correlations and adjusting learning methods can only be done through trial and error, leading to high costs. In MTRL, information sharing can be achieved through data sharing, parameters sharing, representations sharing, or behavior sharing. For instance, CDS[15] routes data based on task-specific data to improve information sharing. Similarly, a simple method proposed in [14] applies a zero reward to unlabeled data, facilitating data sharing in theory and practice. Parameter sharing through learning shared representations can effectively transfer knowledge between policies. For instance, a soft modularization method presented in [16] shares parameters by generating soft combinations of different modules. Similarly, AdaShare[17] and an automated multi-task learning algorithm in [18] adaptively determine the feature sharing mode across multiple tasks. However, these methods suffer from high computational complexity as they require dynamic exploration of network connectivity. Attention mechanisms can be leveraged to share representations, as proposed in[19, 20], which group task knowledge into sub-networks without the need for prior assumptions. CARE[21] leverages metadata to capture the shared structure among tasks. There is another significant MTRL work that focuses on the challenge of multi-objective optimization from a gradient perspective, aiming to reduce the impact of conflicting gradients by manipulating the gradients based on various criteria[9, 22]. However, these methods impose an additional computational burden. The aforementioned methods reduce or coordinate conflicts between tasks from the perspectives of representation, gradient, task grouping, etc., effectively achieving MTRL. However, they neglect the potential causes of task conflict. An agent solely focuses on task-specific information and details within each individual task, it may result in short-sightedness and hinder the generalization across tasks. By introducing goal-oriented sparse learning signals, it is possible to strike a balance between task-specific performance and generalization across tasks.

Imagine that humans are simultaneously learning a variety of related manipulation tasks, for instance, as depicted in Fig. 1. Within these tasks, there are several actions that exhibit similarity, such as approaching objects like doors or drawers and interacting with them. Humans have the ability to leverage previously learned behaviors when encountering a specific task, making slight adjustments based on the task’s goal. Furthermore, when humans are confronted with numerous tasks simultaneously, having detailed instructions and requirements for each task can lead to confusion and a sense of being overwhelmed, even though they can be helpful in completing individual tasks. In contrast, when each task has a clearly defined goal, humans naturally adopt a longer-term perspective regarding conflicts and priorities among tasks. The above human behaviors imply their recognition that current conflicts may not be necessary for achieving the overall objectives of the tasks. Our idea is inspired by this recognition.

Refer to caption — Figure 1: a variety of related manipulation tasks. Several actions are similar across these tasks: getting closer to the objects and interacting with them.

In this paper, we propose a novel approach called Task-Specific Action Correction (TSAC) as a general and complementary method for MTRL. TSAC decomposes policy learning into two policies: a shared policy (SP) and an action correction policy (ACP). SP maximizes well-shaped and intensive rewards, which focus on task-specific information and accelerate the learning process. Its output actions, referred to as shared actions, are potentially short-sighted. On the other hand, ACP utilizes goal-oriented sparse rewards to generate a far-sighted edited action that can cross tasks. The goal-oriented sparse rewards which is sparse and strongly correlated with the completion of the objective. SP and ACP collaborate with each other, where SP provides a suboptimal policy that facilitates the training of ACP in the sparse rewards setting. ACP, in return, improves the overall performance. To balance the training of these two policies, we assign a virtual expected budget to the sparse rewards and use the Lagrangian method to dynamically adjust the weights of the loss in ACP. Our two-policy paradigm draws inspiration from works in safe reinforcement learning[23, 24, 25]. However, our approach differs significantly in terms of motivation and interpretation.

To implement our approach, we employ the Soft Actor-Critic algorithm[26] as the underlying reinforcement learning policy. It is worth noting that our approach is algorithm-agnostic and can be integrated with existing MTRL methods. Our experimental results demonstrate the efficiency and significance of the cooperation between the two policies and simply combining the two rewards do not yield comparable results. Moreover, our experiments conducted on the Metaworld benchmark[27] using MT10 and MT50 showcase significant improvements in both sample efficiency and final performance compared to previous state-of-the-art multi-task policies. In summary, our contributions are as follows:

(1) We propose Task-Specific Action Correction (TSAC), a general and complementary approach for MTRL, which decomposes policy learning into two policies, facilitating efficient MTRL. TSAC can be combined with any existing MTRL methods.

(2) We introduce goal-oriented sparse rewards to provide agents with a long-term perspective for handling task conflicts that arise from excessive focus on specific tasks’ details.

(3) We assign a virtual expected budget to the sparse rewards and utilize Lagrangian method to transform a constrained single-objective optimization into an unconstrained one. The Lagrangian multiplier dynamically adjust the loss weights in the two policy networks.

(4) We demonstrate the efficiency and significance of the cooperation between the two policies in MTRL, and show that TSAC achieves significant improvements in both sample efficiency and effective action execution.

II preliminaries

II-A Multi-Task Reinforcement Learning

We extend the single-task Markov Decision Process (MDP) problem to multi-task MDP for a agent under the framework of Contextual MDP (CMDP)[28]. CMDP is defined by a tuple $\left\langle\mathcal{C},\mathcal{S},\mathcal{A},\gamma,\mathcal{M}\right\rangle$ . Here, $\mathcal{C}$ can be viewd as a task set $\mathcal{C}=\left\{\mathcal{T}_{i}\right\}_{i=1}^{N}$ , where $\mathcal{T}_{i}$ denotes the task $i$ and $N$ is the number of tasks. $\mathcal{S}$ represents the shared state space , $\mathcal{A}$ denotes the shared action space, and $\gamma$ is the discount factor. State $s\in\mathcal{S}$ and action $a\in\mathcal{A}$ . $\mathcal{M}$ is a fuction that maps a task $\mathcal{T}_{i}\in\mathcal{C}$ to MDP parameters, such that $\mathcal{M}(\mathcal{T}_{i})=\left\{P_{i},R_{i},\rho_{i}\right\}$ . The transition probability $P_{i}$ , reward function $R_{i}$ and initial state distribution $\rho_{i}$ vary by each task. During training, tasks are sampled from a uniform distribution $p(\mathcal{T})$ . The agent’s policy takes state $s$ as input and outputs action $a$ . The objective of the agent’s policy $\pi$ is to maximize the expected return $\mathbb{E}_{\mathcal{T}_{i}\sim p(\mathcal{T})}[\mathbb{E}_{\pi}[\sum_{t}\gamma^{t}R_{i}(s_{t},a_{t})]]$ , where $s_{t}$ and $a_{t}$ represent the state and the action at timestep $t$ .

II-B Soft Actor-Critic

In this paper, we adopt Soft Actor-Critic (SAC) [26] as the fundamental policy. As observed in [27], SAC, being an off-policy actor-critic algorithm with a strong exploration ability based on maximum entropy, exhibits superior performance.

The concept of SAC goes beyond merely maximizing cumulative rewards. It also introduces additional stochasticity to the policy. Thus, a regularization term that incorporates entropy is included in the reinforcement learning objective. The correction term added with entropy can be defined as follows:

\pi^{*}=\arg\max_{\pi}\mathbb{E}_{\pi}\left[\sum_{t}r\left(s_{t},a_{t}\right)+\alpha H\left(\pi\left(\cdot\mid s_{t}\right)\right)\right]

(1)

In this context, $\alpha$ represents a regularization coefficient that controls the significance of the entropy term. $H$ denotes the entropy.

III method

In this section, we present a general and complementary approach named Task-Specific Action Correction (TSAC) to decompose policy learning across two policies: the shared policy (SP) and the action correction policy (ACP).

III-A Overall Structure of TSAC

As illustrated in Fig. 2, TSAC consists of a pair of cooperative policies. The first policy, called the shared policy (SP) and denoted as $\operatorname{\pi}_{\phi}$ , optimizes the guiding dense rewards by proposing a preliminary action $\hat{a}\sim\pi_{\phi}(\cdot|s)$ . However, this preliminary action may be shortsighted. The second policy, referred to as the action correction policy (ACP) and denoted as $\pi_{\psi}$ , corrects the preliminary action by providing an action correction $\Delta a\sim\pi_{\psi}(\cdot|s,\hat{a})$ to execute an effective action. The result action, denoted as $a=h(\hat{a},\Delta a)$ , is then output to the environment, where $h$ represents an editing function. ACP is conditioned on SP’s output $\hat{a}$ . Together, these two policies cooperate to improve sample efficiency and performance. For simplicity, we denote the overall composed policy as $\pi_{\psi\circ\phi}(a|s)$ .

Motivation: TSAC decomposes policy learning into two subtasks that focus on guiding dense rewards or goal-oriented sparse rewards. This decomposition is motivated by the following considerations:

III-A1 Different Conflict Horizons

To achieve effective MTRL, most applications incorporate guiding dense rewards into each task. However, SP is myopic as it only focus on specific tasks’ details and overlooks whether the final goal is accomplished, despite the final goal being the most important aspect. SP only coordinates task conflicts from various tasks in the short term. In contrast, ACP aims to maximize goal-oriented sparse rewards, enabling an agent to adopt a long-term perspective and achieve generalization across tasks. With the assistance of SP, ACP has more opportunities to obtain goal-oriented sparse rewards, thus alleviating the challenge of learning sparse rewards in ACP.

III-A2 Efficient Exploration

From the perspective of SP, its action is altered by ACP. Instead of discouraging SP from taking suboptimal actions, ACP offers suggestions to improve the preliminary action, enabling SP to continue exploration in an effective and far-sighted manner. This guarded exploration leads to a better overall exploration policy because ACP is less likely to hinder SP’s exploration. From the perspective of ACP, it determines the action based on SP, enhancing its ability to explore. From an entropy perspective, the decomposition of policy learning into two policy networks introduces additional entropy.

III-B Goal-oriented sparse rewards

Manually designed dense rewards incorporate prior knowledge and effectively guide policy learning. However, the magnitude of reward values does not directly indicate the ability of a policy to accomplish tasks. Therefore, we have introduced goal-oriented sparse rewards, which are correlated with the completion of the objective. The goal-oriented sparse rewards of a task $\mathcal{T}_{i}$ is characterized by an " $\epsilon$ -region" in state space, represented by:

R_{i}^{s}(s,a)=\begin{cases}\delta_{s_{g}}(s)&\mathrm{if~{}}f(s,{s}_{g})\leq\epsilon\\ 0&\mathrm{else},\end{cases}

(2)

In this equation, $R_{i}^{s}$ denotes the goal-oriented sparse rewards of task $\mathcal{T}_{i}$ , $s$ is the current state, $s_{g}$ denotes the goal state, $f(s,s_{g})$ is a function that maps the goal state and current state to a latent space, computing the distance between them. $\delta_{s_{g}}$ defines the reward value and set $\delta_{s_{g}}=1$ . $\epsilon$ represents a small distance threshold.

Following the description of CMDP (section II. A), the initial state distribution $\rho_{i}$ determines the probability density of an episode starting at state $s_{0}$ . For each transition $(s_{t},a_{t},s_{t+1})$ from task $\mathcal{T}_{i}$ at timestep $t$ , the environment produces a scalar $R_{i}(s_{t},a_{t})$ . It is worth mentioning that the reward function $R_{i}$ is artificially designed and intensive, which we refer to as guiding dense rewards. Similarly, the environment produces a scalar $R_{i}^{s}(s_{t},a_{t})$ as a goal-oriented sparse reward. Higher values for both the guiding dense rewards and the goal-oriented sparse rewards indicate better performance.

For each state $s_{t}$ , the guiding dense rewards state value of policy $\pi$ is denoted as $V^{\pi}(s_{t})=\mathbb{E}_{\mathcal{T}_{i}\sim p(\mathcal{T})}[\mathbb{E}_{\pi}\sum_{t^{\prime}=t}^{\infty}\gamma^{t^{\prime}-t}R_{i}(s_{t^{\prime}},a_{t^{\prime}})]$ . The guiding dense rewards state-action value is denoted as $Q^{\pi}(s_{t},a_{t})=\mathbb{E}_{\mathcal{T}_{i}\sim p(\mathcal{T})}[R_{i}(s_{t},a_{t})+\gamma\mathbb{E}_{s_{t+1}\thicksim P_{i}}V^{\pi}(s_{t+1})]$ . Similarly, We define $V_{s}^{\pi}$ and $Q_{s}^{\pi}$ for the goal-oriented sparse rewards.

We consider the MTRL objective from the perspective of guiding dense rewards and goal-oriented sparse rewards:

\{\max_{\pi}\mathbb{E}_{\mathcal{T}_{i}\sim p(\mathcal{T})}\underset{s_{0}\sim\rho_{i}}{\operatorname*{\mathbb{E}}}V^{\pi}(s_{0}),\max_{\pi}\mathbb{E}_{\mathcal{T}_{i}\sim p(\mathcal{T})}\underset{s_{0}\sim\rho_{i}}{\operatorname*{\mathbb{E}}}V_{s}^{\pi}(s_{0})\}

(3)

MTRL can be viewed as a multi-objective optimization problem. The introduction of sparse rewards adds an additional objective to the MTRL framework, which increases the problem’s complexity. To convert multi-objective MTRL into single-objective MTRL, we assign a virtual expected budget $C$ to the sparse rewards. This allows us to write the MTRL objective as follows:

\begin{split}\max_{\pi}\mathbb{E}_{\mathcal{T}_{i}\sim p(\mathcal{T})}\underset{s_{0}\sim\rho_{i}}{\operatorname*{\mathbb{E}}}V^{\pi}(s_{0}),\\ s.t.\quad\mathbb{E}_{\mathcal{T}_{i}\sim p(\mathcal{T})}\underset{s_{0}\sim\rho_{i}}{\operatorname*{\mathbb{E}}}V_{s}^{\pi}(s_{0})+C\geq 0,\end{split}

(4)

In this equation, $C$ represents a virtual expected budget. Specifically for each time step, Eq.4 can be rewritten as:

\mathbb{E}_{\mathcal{T}_{i}\sim p(\mathcal{T})}[\mathbb{E}_{\pi}[\sum_{t}\gamma^{t}(R_{i}^{s}(s_{t},a_{t})+c)]]\geq 0,

(5)

Here, $c$ denotes the expected budget specific to each step and it relates to $C$ through the equation $\sum_{t=0}^{\infty}\gamma^{t}c=\frac{c}{1-\gamma}=C$ . Notably, the expected budget represents an average target to be achieved rather than a strict enforcement.

To simplify the problem, we transform the multi-objective MTRL into a single-objective MTRL using the Lagrangian method. This method converts the constrained optimization problem Eq.(4) into an unconstrained one by introducing a multiplier $\lambda$ :

\begin{split}&\min_{\lambda\geq 0}\max_{\pi}\mathbb{E}_{\mathcal{T}_{i}\sim p(\mathcal{T})}\underset{s_{0}\sim\rho_{i}}{\operatorname*{\mathbb{E}}}V^{\pi}(s_{0})\\ &+\lambda\left(\mathbb{E}_{\mathcal{T}_{i}\sim p(\mathcal{T})}\underset{s_{0}\sim\rho_{i}}{\operatorname*{\mathbb{E}}}V_{s}^{\pi}(s_{0})+C\right).\end{split}

(6)

In this formulation, the weight of the goal-oriented sparse rewards combined with the guiding dense rewards is represented by $\lambda$ . We solve this objective by using model-free MTRL algorithms.

III-C Objective function

We utilize an off-policy actor-critic approach to train the two policies. Given the overall policy $\pi_{\psi\circ\phi}(a|s)$ , we use the typical Temporal Difference (TD)[29] backup to learn $Q^{\pi_{\psi\circ\phi}}(s,a;\theta)$ and $Q_{s}^{\pi_{\psi\circ\phi}}(s,a;\theta_{s})$ , which are parameterized as $Q(s,a;\theta)$ and $Q_{s}(s,a;\theta_{s})$ respectively. Here, $\theta$ represents the network parameters collectively for the two state-action values. When provided with $s_{t+1}$ and $a_{t+1}$ , the Bellman backup operator for the guiding dense rewards state-action value is expressed as:

Q(s_{t},a_{t};\theta)=\mathbb{E}_{\mathcal{T}_{i}\sim p(\mathcal{T})}R_{i}(s_{t},a_{t})+\gamma Q(s_{t+1},a_{t+1};\theta),

(7)

where $R_{i}$ represents the guiding dense rewards obtained from task $\mathcal{T}_{i}$ . The backup operator for the goal-oriented sparse rewards state-action value $Q_{s}$ is defined similarly with $R^{s}$ . Both $Q(s,a;\theta)$ and $Q_{s}(s,a;\theta_{s})$ can be learned from transitions $(s_{t},a_{t},s_{t+1})$ sampled from a replay buffer.

To achieve off-policy training of $\psi$ and $\phi$ , we convert Eq.(6) into a bi-level optimization surrogate. The formulation is presented as follows:

\begin{split}&\mathrm{(a)}\quad\max_{\phi,\psi}\underset{s\sim\mathcal{D}}{\operatorname*{\mathbb{E}}}\left[Q(s,a;\theta)+\lambda Q_{s}(s,a;\theta_{s}))\right],\\ &\mathrm{(b)}\quad\min_{\lambda\geq 0}\lambda\Lambda_{\pi_{\psi\circ\phi}}.\quad\Lambda_{\pi_{\psi\circ\phi}}\triangleq\underset{s_{0}\sim\rho_{i}}{\operatorname*{\mathbb{E}}}V_{s}^{\pi_{\psi\circ\phi}}(s_{0})+C\end{split}

(8)

Here, $\mathcal{D}$ represents a replay buffer containing a historical marginal state distribution used to train the policies. The initial state distribution $\rho$ is employed to train $\lambda$ . This distinction arises from the idea that when fine-tuning $\lambda$ , we should primarily consider how well the policy satisfies the virtual expected budget starting with $\rho$ , rather than with some historical state distribution. Subsequently, We further transform the off-policy objective(Eq.(8),a) into two parts:

\begin{split}&\mathrm{(a)}\quad\max_{{\color[rgb]{0,0,1}\phi}}\underset{\begin{subarray}{c}s\sim\mathcal{D},{\color[rgb]{0,0,1}\hat{a}}\sim\pi_{{\color[rgb]{0,0,1}\phi}}(\cdot\mid s),\\ {\color[rgb]{0,0,1}\Delta a}\sim\pi_{\psi}(\cdot\mid s,{\color[rgb]{0,0,1}\hat{a}})\\ {\color[rgb]{0,0,1}a}=h({\color[rgb]{0,0,1}\hat{a}},{\color[rgb]{0,0,1}\Delta a})\end{subarray}}{\mathbb{E}}[Q(s,{\color[rgb]{0,0,1}a};\theta],\\ &\mathrm{(b)}\quad\max_{{\color[rgb]{1,0,0}\phi}}\underset{\begin{subarray}{c}s\sim\mathcal{D},\hat{a}\sim\\ pi_{{\color[rgb]{0,0,1}\phi}}(\cdot\mid s),\\ {\color[rgb]{1,0,0}\Delta a}\sim\pi_{{\color[rgb]{1,0,0}\psi}}(\cdot\mid s,{\color[rgb]{0,0,1}\hat{a}})\\ {\color[rgb]{1,0,0}a}=h(\hat{a},{\color[rgb]{1,0,0}\Delta a})\end{subarray}}{\mathbb{E}}[-d({\color[rgb]{1,0,0}a},\hat{a})+\lambda Q_{s}(s,{\color[rgb]{1,0,0}a};{\theta}_{s})],\end{split}

(9)

In this modified formulation, the distance function $d(a,\hat{a})$ quantifies the change from $\hat{a}$ to $a$ . It is not necessarily proportional to $\Delta a$ as the editing function $h$ can introduce nonlinearity. ACP $\pi_{\psi}$ aims to maximize the goal-oriented sparse rewards while minimizing the distance between the actions before and after the correction. On the other hand, SP $\pi_{\phi}$ solely focus on maximizing the guiding dense rewards. Importantly, ACP modifies SP’s action, which aligns with the discussed motivation for efficient exploration. The training objective (Eq.(9),b) for ACP relies on a critic $Q_{s}$ , which learns the expected future goal-oriented sparse rewards. Consequently, guided by $Q_{s}$ , ACP explores actions with greater potential in long-term sequences.

III-D Action correction function and Distance function

In this section, we present the design for the action correction function $h(\hat{a},\Delta a)$ and the distance function $d(a,\hat{a})$ .

III-D1 Action correction function

We opt for the correction function $h$ to be primarily additive and simple, which helps to reduce training difficulty. Without loss of generality, we assume a bounded action space $[-A,A]$ , and that both $\hat{a}$ and $\Delta a$ are already within this space. Consequently, we define:

a=h(\hat{a},\Delta a)=\operatorname*{min}(\operatorname*{max}(2\hat{a}+\Delta a,-A),A),

(10)

Here, the element-wise min and max, along with the multiplication by $2$ and the clipping, ensure that $a\in[-A,A]$ . This means that SP retains full control over the final action and can overwrite the action if necessary. This is crucial because ACP faces challenges in learning an effective correction policy in the short horizon when SP still dominates the learning process. Although the additive operation is simple, the overall editing process is sufficiently general to encompass any modification.

This additive editing function is motivated by the goal of achieving sparsity. Policies are typically evaluated based on metrics such as success, which are only triggered for certain states. To explicitly incorporate this inductive bias, we adopt the additive correction function, which ensures that ACP learns a policy that maximizes metrics such as success based on SP. This simplifies the optimization landscape of ACP.

III-D2 Distance function

we utilize the hinge loss to compare the guiding dense rewards state-action values of $\hat{a}$ and $a$ :

d(a,\hat{a})\triangleq\max(0,Q(s,\hat{a};\theta)-Q(s,a;\theta))

(11)

Here, $Q$ represents the critic and $\theta$ denotes the parameters. This loss yields zero if the edited action $a$ already attains a higher state-action value than the preliminary action $\hat{a}$ . In such cases, only $Q_{s}$ is optimized by $\pi_{\psi}$ . Otherwise, the inner part of (Eq.(9),b) is recovered as $Q(s,a)+\lambda Q_{s}(s,a)$ . Our distance function in the critic $Q$ is more appropriate than $L_{2}$ distance in the action space, because we ultimately care about how the $Q$ changes after the action is edited.

III-E Training process

To practically train the objectives, we employ stochastic gradient descent (SGD) simultaneously to Eq.(8)(b) and Eq.(9). The re-parameterization trick is utilized for both $\pi_{\psi}$ and $\pi_{\phi}$ to enable the application of SGD. To assess $\Lambda_{\pi_{\psi\circ\phi}}$ , a batch of rollout experiences $\{(s_{n},a_{n})\}_{n=1}^{N}$ following $\pi_{\psi\circ\phi}$ is provided. The gradient of $\lambda$ (Eq.(8)(b)) is approximated as:

\Lambda_{\pi_{\psi\circ\phi}}\approx\frac{1}{N}\sum_{n=1}^{N}R_{s}\left(s_{n},a_{n}\right)+c

(12)

Here, $c$ represents the virtual expected budget as defined in Eq.(5). Subsequent to each rollout, a batch of goal-oriented sparse rewards is collected, each reward is compared to $-c$ , and the mean of the differences is used to adjust $\lambda$ . This approximation enables the updating of $\lambda$ using mini-batches of data, irather than waiting for complete episodes to conclude or relying on the often inaccurate estimated $V_{s}^{\pi\psi\circ\phi}$ . Multiple parallel environments are employed to mitigate temporal correlation within the rollout batch data, which constitutes the data to be placed into the replay buffer.

The computational graph presenting Eq.(9) is illustrated in Fig. 3. Our approach is applicable to various goal-oriented sparse rewards and action distance functions. To encourage exploration, we integrate SAC[26], which incorporates the entropy terms of $\pi_{\psi}$ and $\pi_{\phi}$ in Eq.(9), dynamically adjusting their weights based on two entropy targets as described in [26].In our experiments, both SP and ACP are trained from scratch. The pseudocode for the overall TSAC is shown in the algorithm 1.

Input:

N

tasks; virtual expected budget

C

1Initialize

\theta

\phi

and

\psi

; reset the replay buffer

\mathcal{D}\leftarrow\emptyset

2 for each training iteration do

3 Reset the rollout batch

\mathcal{B}\leftarrow\emptyset

;

4 for each rollout step do

5 for task $i\cdot\cdot\cdot N$ do

6 Action proposal by SP:

\hat{a}\sim\pi_{\phi}(\cdot|s)

;

7 Action correcting by ACP:

\Delta a\sim\pi_{\psi}(\cdot|s,\hat{a})

;

8 Output action

h(\hat{a},\Delta a)

;

9 Task

i

’s transition

s^{\prime}\sim\mathcal{P}_{i}(s^{\prime}|s,a)

;

10 Add the transition to the rollout batch

\mathcal{B}\leftarrow\mathcal{B}\bigcup\{(s,a,s^{\prime},R_{i}(s,a),R_{i}^{s}(s,a))\}

;

12 end for

14 end for

15 Store the rollout batch in the buffer

\mathcal{D}\leftarrow\mathcal{D}\bigcup\mathcal{B}

;

16 Sample a training mini-batch

\mathcal{B}_{t}

from the replay buffer

\mathcal{D}

for computing gradient;

17 Perform one gradient step on the critic parameters

\theta

by TD backup (Eq.7) on

Q

and

Q_{c}

;

18 Estimate the gradient of the Lagrangian multiplier

\lambda

by evaluating Eq.12 on

\mathcal{B}_{t}

;

19 Optimize the multiplier by

\lambda\leftarrow\lambda-\alpha\Lambda_{\pi_{\psi\circ\phi}}

;

20 Use SGD to optimize SP:

\phi\leftarrow\phi+\alpha\Delta\phi

(gradient of Eq.9, a);

21 Use SGD to optimize ACP:

\psi\leftarrow\psi+\alpha\Delta\psi

(gradient of Eq.9, b);

22 Update other parameters such as entropy weight, target critic network, etc.

23 end for

Algorithm 1 TSAC: Task-Specific Action Correction

IV Experiments

In this section, we evaluate TSAC in the Meta-World multi-task RL environment[27] and use Meta-World’s MT10 and MT50 benchmarks. The MT10 and MT50 evaluation protocols consist of 10 and 50 tasks, respectively (shown in Fig. 4). We compare TSAC against several baseline methods and conduct ablation studies to verify the effectiveness of our method.

The first goal of our experimental evaluation is to assess Whether TSAC improves the performance of a multi-task agent. We compare the performance of TSAC in two different settings: a short horizon to evaluate its sample efficiency and a long horizon to measure its overall performance. For comparison, We select CARE[21], MT–SAC, Soft Modularization[16] and PCGrad[9] as our baselines.

Furthermore, we evaluate different action correction functions to identify which yields the best performance. The action correction functions considered are SP-dominated, ACP-dominated, equal, and Softclip (See Section IV-C for the definitions) .

IV-A Baselines

We will compare our method against the following baselines:

CARE: As a representations sharing method, representations are shared to learn how to compose them by leveraging additional information about each task.

MT–SAC: This approach directly applies the SAC algorithm to the multi-task setting. It utilizes a shared backbone with disentangled alphas.

Soft Modularization: As a parameters sharing method, Soft Modularization shares parameters and uses a routing network to softly combine all possible routes for each task.

PCGrad: It is a gradient manipulation method that projects a task’s gradient onto the normal plane of any other conflicting task. However, it has high time complexity and is not suitable for MT50 in the long horizon.

TSAC(ours): Our proposed method builds upon CARE, leveraging its representation-sharing module. In contrast, our approach is based on behavior sharing and utilizes goal-oriented sparse rewards.

TABLE I: Success rate of baselines on the short horizon(150k steps per task) for MT10 and MT50.

Success——150K	MT10	MT50
TSAC(ours)	0.390 $\pm$ 0.115	0.362 $\pm$ 0.051
CARE	0.260 $\pm$ 0.062	0.277 $\pm$ 0.028
MT-SAC	0.198 $\pm$ 0.068	0.220 $\pm$ 0.035
Soft-Mod	0.180 $\pm$ 0.108	0.151 $\pm$ 0.040
PCGrad	0.276 $\pm$ 0.107	0.238 $\pm$ 0.035

TABLE II: Success rate of baselines on MT10 on the long-horizon(1M steps per task). Results are reported at the end of the 0.8M,1M steps and at the best average value.

Success——MT10	0.8M	1M	Best
TSAC(ours)	0.762 $\pm$ 0.109	0.722 $\pm$ 0.052	0.827 $\pm$ 0.038
CARE	0.667 $\pm$ 0.082	0.642 $\pm$ 0.076	0.708 $\pm$ 0.114
MT-SAC	0.574 $\pm$ 0.097	0.555 $\pm$ 0.166	0.635 $\pm$ 0.120
Soft Mod	0.549 $\pm$ 0.087	0.560 $\pm$ 0.107	0.596 $\pm$ 0.102
PCGrad	0.655 $\pm$ 0.150	0.644 $\pm$ 0.090	0.708 $\pm$ 0.114

TABLE III: Success rate of baselines on MT50 on the long-horizon(1M steps per task). Results are reported at the end of the 0.8M,1M steps and at the best average value.

Success——MT50	0.8M	1M	Best
TSAC(ours)	0.450 $\pm$ 0.046	0.445 $\pm$ 0.045	0.524 $\pm$ 0.030
CARE	0.418 $\pm$ 0.057	0.395 $\pm$ 0.031	0.497 $\pm$ 0.035
MT-SAC	0.381 $\pm$ 0.044	0.390 $\pm$ 0.045	0.431 $\pm$ 0.046
Soft Mod	0.155 $\pm$ 0.033	0.162 $\pm$ 0.034	0.207 $\pm$ 0.051

IV-B Comparative evaluation

Fig. 5a shows the average success rate on the 10 tasks of the MT10 benchmark form Meta-world for TSAC, CARE, MT–SAC, Soft Modularization, and PCGrad. Since the success rate is a binary variable, it is noisy; therefore, the results were averaged across multiple seeds and the curves were smoothed. Mean and standard error are reported for each value.

We consider 1 million steps as a long horizon, and 150 thousand steps as a short horizon. The short horizon is utilized to observe the exploration ability of different methods, while the long horizon is used to visualize the performance of the method at various time points. It is worth noting that all methods are trained using SAC with disentangled alphas.

Table LABEL:table1 and Fig. 5a demonstrate that our method outperforms all the baselines on the short horizon in MT10. Additionally, Fig. 5c illustrates that even in MT50 our method still outperform all baselines. For comparison, it takes around 1 million steps for the Multi-task SAC agent to reach the accuracy that our TSAC agent achieves around 300 thousand steps, suggesting that our method is highly sample-efficient and exploration-efficient.

Table LABEL:table2 and LABEL:table3 as well as Fig. 5b and Fig. 5d depict that TSAC is able to learn a good policy on the long horizon. For MT10, TSAC performs best and reaches a top success rate of 0.827 during the training. Furthermore, sampling the success rate around 0.8 million and 1 million steps reveals that TSAC has the best performance. For MT50, conflicts between tasks become more acute. CARE stops learning around 600 thousand steps and performance begins to decline in the following training steps. Despite suffering from the conflicts between tasks, TSAC’s performance declines, but it still performs better than CARE. MT-SAC achieves smooth learning and has similar performance to that of CARE and TSAC at 0.8 million and 1 million steps. However, in terms of the best performance throughout training, MT–SAC is far inferior to TSAC, which outperforms all methods. Since the success metric is a binary variable and very noisy, the best performance is obtained by smoothing over the curve. Importantly, TSAC achieves average success rates of 0.450 and 0.445 on MT50 at 0.8 million and 1 million steps, surpassing the reported results from CARE, MT–SAC and Soft Modularization.

IV-C Ablation study

The result in the previous section suggests that TSAC is both sample efficient and exploration efficient. Furthermore, TSAC yields a good policy on long horizons. In this section, we further investigate the impact of the various action correction functions employed by TSAC and discuss the potential for improved performance.

In ablation study, We will consider four different functions $h$ :

SP dominated (ours): $h=\operatorname*{min}(\operatorname*{max}(2\hat{a}+\Delta a,-A),A)$ , SP’s action dominates the final action.

ACP dominated: $h=\operatorname*{min}(\operatorname*{max}(\hat{a}+2\Delta a,-A),A)$ , ACP’s action dominates the final action.

equal contribution: $h=\operatorname*{min}(\operatorname*{max}(\hat{a}+\Delta a,-A),A)$ , SP and ACP contribute equally to the final action.

Softclip: $h=Softclip(2\hat{a}+\Delta a)$ , $h$ use softclip to smooth out the output action and bring in nonlinearity.

Fig. 6 illustrates that SP dominated action correction function outperforms other functions in producing the final action. However, when SP and ACP contribute equally to the final action, the agent encounters diffculties in learning a policy. This conflict relationship between SP and ACP presents challenges in optimizing each respective goal. Conversely, when either SP or ACP dominates the action correction funcion, the agent can successfully learn a well-performing policy. Comparatively, SP outperforms ACP due to ACP’s focus on optimizing the goal-oriented sparse rewards, which is challenging to train due to its sparsity. In contrast, the Softclip function exhibits a slight decrease in performance compared to our action correction function. This decline can be attributed to the introduction of nonlinearity, which further complicates and hampers the learning process. Consequently, we opt for a simple and linear operation instead of training a learnable network.

V CONCLUSION AND FURTHER WORK

In this work, we present the Task-Specific Action Correction (TSAC), a general and complementary MTRL method inspired by Safe Reinforcement Learning, that surpasses the several well-performed baselines on the MT10 and MT50 benchmark from Meta-World.

In this paper, we showed that our method is able to learn a high-performing policy and achieve significant improvements in both sample efficiency and final performance compared to previous state-of-the-art multi-task policies. Furthermore, TSAC is general and complementary enough to be integrated with existing methods like CARE, MT–SAC to improve them. Finally, we show the benifits of decomposing policy learning into two policies and the validity of introducing goal-oriented sparse rewards. TSAC still performs well with more difficult and diverse tasks (MT50).

In future work, We will introduce a pre-trained paradigm which policy is pretrained to maximize guiding dense rewards, and this policy is used as the initialization for SP. In this case, SP cannot be frozen because ACP continuously modifies its MDP. Instead, SP needs to be fine-tuned to adapt to the evolving actions of ACP. This pre-trained SP has the potential to accelerate the convergence of TSAC.

References

[1] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
[2] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, 2017.
[3] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Dębiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse et al., “Dota 2 with large scale deep reinforcement learning,” arXiv preprint arXiv:1912.06680, 2019.
[4] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev et al., “Grandmaster level in starcraft ii using multi-agent reinforcement learning,” Nature, vol. 575, no. 7782, pp. 350–354, 2019.
[5] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
[6] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
[7] L. Pinto and A. Gupta, “Learning to push by grasping: Using multiple tasks for effective learning,” in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 2161–2168.
[8] S. Ruder, “An overview of multi-task learning in deep neural networks,” arXiv preprint arXiv:1706.05098, 2017.
[9] T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 5824–5836, 2020.
[10] A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell, “Policy distillation,” arXiv preprint arXiv:1511.06295, 2015.
[11] Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu, “Distral: Robust multitask reinforcement learning,” Advances in Neural Information Processing Systems, vol. 30, 2017.
[12] S. Liu, S. James, A. J. Davison, and E. Johns, “Auto-lambda: Disentangling dynamic task relationships,” arXiv preprint arXiv:2202.03091, 2022.
[13] C. Fifty, E. Amid, Z. Zhao, T. Yu, R. Anil, and C. Finn, “Efficiently identifying task groupings for multi-task learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 27 503–27 516, 2021.
[14] T. Yu, A. Kumar, Y. Chebotar, K. Hausman, S. Levine, and C. Finn, “Conservative data sharing for multi-task offline reinforcement learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 11 501–11 516, 2021.
[15] T. Yu, A. Kumar, Y. Chebotar, K. Hausman, C. Finn, and S. Levine, “How to leverage unlabeled data in offline reinforcement learning,” in International Conference on Machine Learning. PMLR, 2022, pp. 25 611–25 635.
[16] R. Yang, H. Xu, Y. Wu, and X. Wang, “Multi-task reinforcement learning with soft modularization,” Advances in Neural Information Processing Systems, vol. 33, pp. 4767–4777, 2020.
[17] X. Sun, R. Panda, R. Feris, and K. Saenko, “Adashare: Learning what to share for efficient deep multi-task learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 8728–8740, 2020.
[18] P. Guo, C.-Y. Lee, and D. Ulbricht, “Learning to branch for multi-task learning,” in Proceedings of the 37th International Conference on Machine Learning. PMLR, 2020, pp. 3854–3863.
[19] T. Bram, G. Brunner, O. Richter, and R. Wattenhofer, “Attentive multi-task deep reinforcement learning,” in Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part III. Springer, 2020, pp. 134–149.
[20] G. Cheng, L. Dong, W. Cai, and C. Sun, “Multi-task reinforcement learning with attention-based mixture of experts,” IEEE Robotics and Automation Letters, 2023.
[21] S. Sodhani, A. Zhang, and J. Pineau, “Multi-task reinforcement learning with context-based representations,” in International Conference on Machine Learning. PMLR, 2021, pp. 9767–9779.
[22] B. Liu, X. Liu, X. Jin, P. Stone, and Q. Liu, “Conflict-averse gradient descent for multi-task learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 18 878–18 890, 2021.
[23] H. Yu, W. Xu, and H. Zhang, “Towards safe reinforcement learning with a safety editor policy,” Advances in Neural Information Processing Systems, vol. 35, pp. 2608–2621, 2022.
[24] Y. Zhang, Q. Vuong, and K. Ross, “First order constrained optimization in policy space,” Advances in Neural Information Processing Systems, vol. 33, pp. 15 338–15 349, 2020.
[25] Z. Qin, Y. Chen, and C. Fan, “Density constrained reinforcement learning,” in International Conference on Machine Learning. PMLR, 2021, pp. 8682–8692.
[26] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in Proceedings of the 35th International Conference on Machine Learning. PMLR, 2018, pp. 1861–1870.
[27] T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine, “Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,” in Proceedings of the Conference on Robot Learning. PMLR, 2020, pp. 1094–1100.
[28] A. Hallak, D. Di Castro, and S. Mannor, “Contextual markov decision processes,” arXiv preprint arXiv:1502.02259, 2015.
[29] G. Tesauro et al., “Temporal difference learning and td-gammon,” Communications of the ACM, vol. 38, no. 3, pp. 58–68, 1995.