Episodic Return Decomposition by Difference of Implicitly Assigned Sub-Trajectory Reward

Haoxin Lin^1,2,3, Hongqiu Wu^1,2, Jiaji Zhang^1,2, Yihao Sun^1,2, Junyin Ye^1,2,3, Yang Yu^1,2,3,4
¹National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
²School of Artificial Intelligence, Nanjing University, Nanjing, China
³Polixir Technologies, Nanjing, China
⁴Peng Cheng Laboratory, Shenzhen, 518055, China
{linhx, wuhq, zhangjj, sunyh, yejy}@lamda.nju.edu.cn, [email protected] Corresponding Author

Abstract

Real-world decision-making problems are usually accompanied by delayed rewards, which affects the sample efficiency of Reinforcement Learning, especially in the extremely delayed case where the only feedback is the episodic reward obtained at the end of an episode. Episodic return decomposition is a promising way to deal with the episodic-reward setting. Several corresponding algorithms have shown remarkable effectiveness of the learned step-wise proxy rewards from return decomposition. However, these existing methods lack either attribution or representation capacity, leading to inefficient decomposition in the case of long-term episodes. In this paper, we propose a novel episodic return decomposition method called Diaster (Difference of implicitly assigned sub-trajectory reward). Diaster decomposes any episodic reward into credits of two divided sub-trajectories at any cut point, and the step-wise proxy rewards come from differences in expectation. We theoretically and empirically verify that the decomposed proxy reward function can guide the policy to be nearly optimal. Experimental results show that our method outperforms previous state-of-the-art methods in terms of both sample efficiency and performance. The code is available at https://github.com/HxLyn3/Diaster.

Introduction

Reinforcement Learning (RL) has made empirical successes in several simulated domains and attracted close attention in recent years. Sparse and delayed reward, usually facing a range of real-world problems such as industrial control [1], molecular design [2], and recommendation systems [3], is one of the critical challenges hindering the application of RL in reality. The trouble with the delayed reward function is that it affects RL’s sample efficiency when using it to estimate the value function. Specifically, delayed rewards introduce numerous noneffective updates and fitting capacity loss [4] into Temporal-Difference (TD) Learning and result in a high variance in Monte-Carlo (MC) Learning [5]. The problem is more severe in the highly delayed case that the feedback appears only at the end of the episode.

It is essential to design a proxy Markovian reward function as the substitute for the original delayed one since current standard RL algorithms prefer instant feedback at each environmental step. The apparent solution is designing the required proxy reward function by handcraft, but it depends on domain knowledge and is not general for all tasks. Reward shaping [6, 7, 8], one kind of automatical reward design method, is widely observed that it can accelerate RL. However, the reshaped rewards are coupled with the original rewards and are ineffective in environments where only an episodic reward is fed back. Recent work [5, 9, 10, 11] proposes to decompose the episodic reward into a string of step-wise rewards through the trajectory, which seems promising to deal with the long-term delay.

Refer to caption — (a) Attribute a return to the full trajectory.

Previous return decomposition methods can be sorted into two classes, as shown in Figure 1(a) and 1(b). One pays attention to precisely predicting the long-term episodic return, thinking little of reward redistribution. The representative work is RUDDER [5], which utilizes an LSTM [12] network to make a return prediction and assigns rewards to each pair of state-action via contribution analysis from a backward view. Another focuses on step-wise return decomposition without considering temporal structural representations. A little work [10, 13, 11] aimed at the least-squares-based return decomposition objective falls into this class. Nevertheless, the above work suffers from a lack of either attribution or representation capacity, leading to inefficient episodic return decomposition in the case of long-term episodes.

In this paper, we propose to cut an entire trajectory at any time step and decompose the episodic return into two sub-trajectory rewards, as demonstrated in Figure 1(c). The step-wise reward can be straightforwardly obtained by differencing the implicitly assigned sub-trajectory reward in expectation. Based on this key idea, we formulate a simple yet effective episodic return decomposition method called Diaster (Difference of implicit assigned sub-trajectory reward), which can be embedded in any model-free RL algorithm. This new class of return decomposition not only introduces temporal structural representations into the episodic reward function but also properly attributes the return to different parts of a given trajectory. Hence, a practical proxy reward function can be learned to guide the policy to be optimal.

In general, our contributions are summarized as follows:

•

We elaborate on the formulation of our new episodic return decomposition method, i.e., Diaster, and present its practical neuron-based implementation.
•

We theoretically verify that the step-wise proxy reward function learned through Diaster is return-equivalent to the original MDP in expectation and capable of guiding the policy to be nearly optimal.
•

We empirically show that Diaster can provide effective proxy rewards for RL algorithms and outperform previous state-of-the-art return decomposition methods in terms of both sample efficiency and performance.
•

We conduct an ablation study to show that setting the number of cut points for Diaster can achieve a trade-off between long-term representation and attribution.

Related Work

Delayed Reward

The setting of delayed rewards [14, 15, 16] usually accompanies real-world RL problems, resulting in a high bias in TD Learning and high variance in MC Learning [5]. This paper considers the extremely delayed case in which only one episodic feedback can be obtained at the end of an episode. The problem is how to automatically find a proxy dense reward function to encourage an efficient value estimation in any task with the episodic-reward setting.

Reward Shaping

Reward shaping is one of the popular methods of replacing the original rewards with more frequent feedback to accelerate RL agents’ learning. Early work [17, 6, 7] focuses on designing the shaping reward function and shows an improved learning efficiency. However, these methods need more consideration for the consistency of the optimal policy. Potential-based reward shaping (PBRS) [8] proposes to guarantee the policy invariance property by restricting the shaping function to a difference of potential functions defined over states. Several variants [18, 19, 20] of PBRS introduce more information into the potential function to provide additional expressive power for proxy rewards and improve the sample efficiency. These potential-based methods do not correspond to an optimal return decomposition as their proxy rewards are additively coupled with the original rewards, according to [5].

Return Decomposition

Episodic return decomposition is promising to tackle the rewards with an extreme delay. RUDDER [5] uses an LSTM network [12] to predict the return of an episode and decompose the predicted return into contributions of each pair of state-action in the sequence. [10] directly optimizes the least-squares-based decomposition objective, i.e., the expected square error between the sum of the state-action pairs’ proxy rewards and the corresponding trajectory-wise reward. This direct return decomposition method is extended by [13] to decompose the episodic return into the rewards of all time intervals. IRCR [9] considers a uniform reward redistribution of the return to each constituent state-action pair, which is an intuitive solution of the optimal reward problem (ORP) [21, 22]. To bridge direct return decomposition and IRCR, RRD [11] lets the proxy rewards from a random subsequence of the trajectory to fit the episodic return, achieving a trade-off between the decomposition complexity and the estimation accuracy.

Temporal Credit Assignment

Long-term temporal credit assignment (TCA) [23], attributing agents’ actions to future outcomes that may be observed after a long time interval, is a long-lasting problem in RL. General TCA frameworks [24, 25] are proposed to extend the $\lambda$ -return [26], which is one of the earliest heuristic mechanisms for TCA. Hindsight credit assignment (HCA) [27] assigns credit to past actions by reasoning in the backward view, opening up a new family of algorithms. Temporal value transport (TVT) [28] augments the capabilities of memory-based agents and uses recall of specific memories to credit actions from the past, achieving an efficient credit transport.

Preliminaries

A finite-horizon Markov Decision Process (MDP) with an episodic-reward setting is characterized by a tuple $\langle\mathcal{S},\mathcal{A},P,\rho_{0},r,R_{\text{ep}},T\rangle$ . The episodic horizon is limited by the scalar $T$ . Each episode starts with an environmental state $s_{0}\in\mathcal{S}$ sampled from the initial state distribution $\rho_{0}$ . At each time step $t\in\{0,\ldots,T-1\}$ , after agent observing the state $s_{t}$ and performing a $a_{t}\in\mathcal{A}$ , the environment transitions to the next state $s_{t+1}\in\mathcal{S}$ according to the unknown transition function $P(s_{t+1}|s_{t},a_{t})$ . The step-wise reward $r(s_{t},a_{t})$ is unobservable to the agent, while the only feedback called episodic reward $R_{\text{ep}}(\tau_{0:T})$ is accessible at the end of the trajectory $\tau_{0:T}=(s_{0},a_{0},s_{1},a_{1},...,s_{T-1},a_{T-1})$ . In this paper, we consider an assumption that the episodic reward comes from the sum of the step-wise rewards, i.e., $R_{\text{ep}}(\tau_{0:T})=\sum_{t=0}^{T-1}r(s_{t},a_{t})$ , which is satisfied in most cases. The goal under this setting is to find the optimal policy that maximizes the expectation of episodic return (i.e., episodic reward): $\mathbb{E}_{\rho^{\pi}}\left[R_{\text{ep}}(\tau_{0:T})\right]$ , where $\rho^{\pi}$ induced by the dynamics function $P(s_{t+1}|s_{t},a_{t})$ and the policy $\pi(a_{t}|s_{t})$ is the on-policy distribution over states.

The state-action value function defined as

Q^{\pi}(s_{t},a_{t})=\mathbb{E}_{P,\pi}\left[\left.\sum_{i=0}^{T-t-1}r(s_{t+i},a_{t+i})\right|s_{t},a_{t}\right]

(1)

cannot be estimated since the immediate reward isn’t fed back from the environment. Directly using the delayed episodic reward to learn the state-action value function leads to the optimal policy consistently, as proven by [5]. However, it isn’t an appropriate choice as it introduces high bias in TD Learning and high variance in MC Learning. The promising way to solve any MDP with an episodic-reward setting is to find a proxy dense reward function $\hat{r}(s,a)$ that nearly satisfies the sum-from decomposition,

R_{\text{ep}}(\tau_{0:T})\approx\hat{R}_{\text{ep}}(\tau_{0:T})=\sum_{t=0}^{T-1}\hat{r}(s_{t},a_{t}),

(2)

which is considered in several previous work [5, 10, 13, 11, 29].

Diaster: Difference of Implicitly Assigned Sub-Trajectory Reward

In this section, we will first elaborate on our return decomposition method Diaster (Difference of implicit assigned sub-trajectory reward), and next verify the effectiveness of this method theoretically. Then we will present a practical neuron-based implementation of Diaster.

Description of Diaster

Rather than directly decomposing the episodic reward into the step-wise rewards, we consider finding a proxy sub-trajectory reward function $\hat{R}_{sub}$ that satisfies

	$\displaystyle\hat{R}_{\text{sub}}(\varnothing)=0,$		(3)
	$\displaystyle\hat{R}_{\text{sub}}(\tau_{0:T})\approx R_{\text{ep}}(\tau_{0:T}),$		(4)
	$\displaystyle\hat{R}_{\text{sub}}(\tau_{0:c})+\hat{R}_{\text{sub}}(\tau_{c:T})\approx R_{\text{ep}}(\tau_{0:T}),\forall c\in\{1,\ldots,T-1\},$		(5)

for any episode $\tau_{0:T}$ , where $\tau_{0:c}$ is the former $c$ -step sub-trajectory, $\tau_{c:T}$ is the latter $\{T-c\}$ -step sub-trajectory, and $\hat{R}_{\text{sub}}(\varnothing)$ is the supplementary definition of the empty sub-trajectory $\tau_{0:0}=\varnothing$ . Compared to the step-wide decomposition (2), this formulation of decomposition can ease the difficulty of credit assignment for several reasons:

•

Only two components need to be assigned the reward after cutting an episode at any cut-point $c$ .
•

$T$ cut-points to cut an episode provides more relations between the proxy reward function and the episodic reward function, increasing the training samples for episodic return decomposition.
•

The step-wise decomposition (2) can be regarded as a special case of the subtrajectory-wise decomposition (5) that lets $\hat{R}_{\text{sub}}(\tau_{0:c})=\sum_{t=0}^{c-1}\hat{r}(s_{t},a_{t})$ and $\hat{R}_{\text{sub}}(\tau_{c:T})=\sum_{t=c}^{T-1}\hat{r}(s_{t},a_{t})$ . However, a sub-trajectory reward does not have to come from the sum of its state-action pairs’ rewards in practice, while it can be any other function of the sub-trajectory. The general formulation can relax the assumption that decides how the episodic reward comes.
•

This form of return decomposition introduces some temporal structural representations into the episodic reward function, encouraging the utilization of RNNs [30] to take advantage of the sequential information.

Given any episode $\tau_{0:T}$ sampled by the policy $\pi$ , the proxy reward connected with its preceding sub-trajectory is obtained by difference as

	$\displaystyle\hat{R}_{d}(s_{0},a_{0})=\hat{R}_{\text{sub}}(\tau_{0:1})-\hat{R}_{\text{sub}}(\varnothing),$		(6)
	$\displaystyle\hat{R}_{d}(\tau_{0:t},s_{t},a_{t})=\hat{R}_{\text{sub}}(\tau_{0:t+1})-\hat{R}_{\text{sub}}(\tau_{0:t}),$		(7)

for any $(s_{t},a_{t})$ belonging to $\tau_{0:T}$ . The method of difference is also considered by [5] to guarantee the equivalent episodic return since

\sum_{t=0}^{T-1}\hat{R}_{d}(\tau_{0:t},s_{t},a_{t})=\hat{R}_{\text{sub}}(\tau_{0:T})\approx R_{\text{ep}}(\tau_{0:T}).

(8)

However, the above reward function $\hat{R}_{d}$ is still delayed and non-Markovian. Thus we define the step-wise proxy reward function as

\hat{r}^{\pi}(s_{t},a_{t})=\mathbb{E}_{\tau_{0:t}\sim\Gamma^{\pi}_{t}(\cdot|s_{t})}\left[\hat{R}_{d}(\tau_{0:t},s_{t},a_{t})\right],

(9)

where

\Gamma^{\pi}_{t}(\tau_{0:t}|s_{t})=\frac{\mathcal{T}_{t}^{\pi}(\tau_{0:t})P(s_{t}|s_{t-1},a_{t-1})}{\rho^{\pi}(s_{t})}

(10)

is the distribution conditioned on the state $s_{t}$ over the $t$ -step sub-trajectory space, while $\mathcal{T}_{t}^{\pi}(\tau_{0:t})$ is the unconditioned distribution that comes from

\mathcal{T}_{t}^{\pi}(\tau_{0:t})=\rho_{0}(s_{0})\pi(a_{0}|s_{0})\prod_{i=1}^{t-1}P(s_{i}|s_{i-1},a_{i-1})\pi(a_{i}|s_{i}).

(11)

Analysis of Diaster

Our analysis in this subsection focuses on determining whether the proxy reward function and the corresponding proxy state-action value function are effective for policy optimization. We start by presenting a simple theorem, offering some insights about the relationship between the step-wise reward $\hat{r}^{\pi}$ and the sub-trajectory reward $\hat{R}_{\text{sub}}$ .

Theorem 1.

Given any policy $\pi$ , the following equation holds for any subsequence length $h\in\{2,\ldots,T\}$ and time step $t\in\{0,\ldots,T-h\}$ , that is,

\mathbb{E}_{\rho^{\pi}}\left[\sum_{i=t}^{t+h-1}\hat{r}^{\pi}(s_{i},a_{i})\right]=\mathbb{E}_{\rho^{\pi}}\left[\hat{R}_{\text{sub}}(\tau_{0:t+h})-\hat{R}_{\text{sub}}(\tau_{0:t})\right].

(12)

Proof.

See Appendix A.1. ∎

This theorem states that the sum of any consecutive $h$ -step sequence of $\hat{r}^{\pi}$ is the difference of two sub-trajectory rewards that differ by $h$ steps, in expectation. In particular, letting $t=0$ and $h=T$ , it becomes

\mathbb{E}_{\rho^{\pi}}\left[\sum_{t=0}^{T-1}\hat{r}^{\pi}(s_{t},a_{t})\right]\approx\mathbb{E}_{\rho^{\pi}}\left[R_{\text{ep}}(\tau_{0:T})\right],

(13)

which reveals that $\hat{r}^{\pi}$ is one of the return-equivalent proxy rewards of the original MDP. Therefore, the goal of policy optimization does not change while using $\hat{r}^{\pi}$ as the immediate guidance.

Next we aim to show that using the proxy reward $\hat{r}^{\pi}(s_{t},a_{t})$ to learn the proxy state-action value function defined as

\hat{Q}^{\pi}(s_{t},a_{t})=\mathbb{E}_{P,\pi}\left[\left.\sum_{i=0}^{T-t-1}\hat{r}^{\pi}(s_{t+i},a_{t+i})\right|s_{t},a_{t}\right],

(14)

nearly leads to the optimal policy in the original MDP. The necessary condition that $\hat{Q}^{\pi}(s_{t},a_{t})$ needs to satisfy for the optimal policy consistency is provided by Lemma 1.

Lemma 1.

For any finite MDP with a state-action value function $Q^{\pi}(s,a)$ defined based on the unobservable step-wise reward function $r(s,a)$ , any proxy state-action value function $\hat{Q}^{\pi}(s,a)$ that satisfies

\hat{Q}^{\pi}(s,a)=Q^{\pi}(s,a)+\delta(s),

(15)

where $\delta(s)$ is any function defined over the state space, will lead to the same optimal policy as $Q^{\pi}$ , whether using value-based or policy gradient methods.

Proof.

See Appendix A.2. ∎

This lemma indicates that once the difference of $\hat{Q}^{\pi}(s,a)$ and $Q^{\pi}(s,a)$ is only related to the state $s$ , the original optimal policy can be guaranteed even though using $\hat{Q}^{\pi}(s,a)$ to optimize the policy. In order to verify that $\hat{Q}^{\pi}(s,a)$ can satisfy the condition (15), we further extend Theorem 1 to the conditional form.

Theorem 2.

Given any policy $\pi$ , the following equation holds for any $s_{t}\in\mathcal{S}$ , $a_{t}\in\mathcal{A}$ at any time step $t\in\{0,\ldots,T-1\}$ , that is,

\mathbb{E}_{P,\pi}\left[\left.\sum_{i=1}^{T-t-1}\hat{r}^{\pi}(s_{t+i},a_{t+i})\right|s_{t},a_{t}\right]=\mathbb{E}_{P,\pi}\left[\left.\hat{R}_{\text{sub}}(\tau_{0:T})-\hat{R}_{\text{sub}}(\tau_{0:t+1})\right|s_{t},a_{t}\right].

(16)

Proof.

See Appendix A.3. ∎

By applying Theorem 2, it is observed that

$\displaystyle\hat{Q}^{\pi}(s_{t},a_{t})=$	$\displaystyle\hat{r}^{\pi}(s_{t},a_{t})+\mathbb{E}_{P,\pi}\left[\left.\hat{R}_{\text{sub}}(\tau_{0:T})-\hat{R}_{\text{sub}}(\tau_{0:t+1})\right\|s_{t},a_{t}\right]$	(17)
$\displaystyle\approx$	$\displaystyle\mathbb{E}_{P,\pi}\left[\left.\sum_{i=0}^{T-1}r(s_{i},a_{i})-\hat{R}_{\text{sub}}(\tau_{0:t})\right\|s_{t},a_{t}\right]$
$\displaystyle\approx$	$\displaystyle Q^{\pi}(s_{t},a_{t})+\mathbb{E}_{P,\pi}\left[\left.\sum_{i=0}^{t-1}r(s_{i},a_{i})-\hat{R}_{\text{sub}}(\tau_{0:t})\right\|s_{t}\right],$

which conforms the form of the condition (15) with $\delta(s_{t})=\mathbb{E}_{P,\pi}\left[\left.\sum_{i=0}^{t-1}r(s_{i},a_{i})-\hat{R}_{\text{sub}}(\tau_{0:t})\right|s_{t}\right]$ . Hence $\hat{Q}^{\pi}$ is an effective substitute of $Q^{\pi}$ to optimize the policy, in the case of original reward $r$ being inaccessible.

Algorithm 1 Diaster

1: Input: Neural parameters

\psi

\phi

, replay buffer

\mathcal{D}

, learning rate

\lambda_{\hat{R}}

\lambda_{\hat{r}}

, batch size

B

, and RL algorithm RL-ALGO containing the initial policy

\pi_{\theta}

2: for

N

episodes do

3: Collect a trajectory

\tau_{0:T}

using the policy

\pi_{\theta}

4: Store

\tau_{0:T}

and its feedback

R_{\text{ep}}(\tau_{0:T})

\mathcal{D}

5: for

M

batches do

6: Sample

B

trajectories

\{\tau_{0:T}^{i}\}_{i=0}^{B-1}

from

\mathcal{D}

7: Sample

B

scalars

\{c^{i}\}_{i=0}^{B-1}

from

\{0,\ldots,T-1\}

\psi\leftarrow\psi-\nabla_{\psi}\frac{1}{B}\sum_{i=0}^{B-1}l_{\hat{R}}(\psi,\tau_{0:T}^{i},c^{i})

by Eq.(19).

\phi\leftarrow\phi-\nabla_{\phi}\frac{1}{B}\sum_{i=0}^{B-1}l_{\hat{r}}(\phi,\tau_{0:c^{i}+1}^{i})

by Eq.(21).

10: Sample

B

transitions

\{(s_{i},a_{i},s_{i+1})\}_{i=0}^{B-1}

from

\mathcal{D}

11: Use RL-ALGO to optimize the policy

\pi_{\theta}

with

\{(s_{i},a_{i},\hat{r}^{\phi}(s_{i},a_{i}),s_{i+1})\}_{i=0}^{B-1}

12: end for

13: end for

Practical Implementation of Diaster

The primary problem is how to obtain an implicitly assigned sub-trajectory reward that satisfies (5). The error of the approximation (5) directly decides the optimality of the policy learned from the proxy rewards. Considering to extract the temporal structures contained in sub-trajectories, we use an RNN with a GRU [31] cell, parameterized by $\psi$ , to represent the sub-trajectory reward function $\hat{R}_{\text{sub}}^{\psi}$ . We train $\hat{R}_{\text{sub}}^{\psi}$ by minimizing the expected loss of episodic return decomposition:

\mathcal{L}_{\hat{R}}(\psi)=\mathop{\mathbb{E}}_{\tau_{0:T}\in\mathcal{D}}\left[\mathop{\mathbb{E}}\limits_{c\sim U(\{0,\ldots,T-1\})}\left[l_{\hat{R}}(\psi,\tau_{0:T},c)\right]\right],

(18)

with the replay buffer $\mathcal{D}$ , and

l_{\hat{R}}(\psi,\tau_{0:T},c)=\left(\hat{R}_{\text{sub}}^{\psi}(\tau_{0:c})+\hat{R}_{\text{sub}}^{\psi}(\tau_{c:T})-R_{\text{ep}}(\tau_{0:T})\right)^{2}.

(19)

The step-wise proxy reward function $\hat{r}^{\phi}$ , with neural parameters $\phi$ , is trained to predict the difference of sub-trajectory rewards. The expected loss is

\mathcal{L}_{\hat{r}}(\phi)=\mathop{\mathbb{E}}_{\tau_{0:T}\in\mathcal{D}}\left[\mathop{\mathbb{E}}\limits_{t\sim U(\{0,\ldots,T-1\})}\left[l_{\hat{r}}(\phi,\tau_{0:t+1})\right]\right],

(20)

where

l_{\hat{r}}(\phi,\tau_{0:t+1})=\left(\hat{r}^{\phi}(s_{t},a_{t})-\left(\hat{R}_{\text{sub}}^{\psi}(\tau_{0:t+1})-\hat{R}_{\text{sub}}^{\psi}(\tau_{0:t})\right)\right)^{2}

(21)

is the prediction loss for any given sub-trajectory $\tau_{0:t+1}=(s_{0},a_{0},\ldots,s_{t},a_{t})$ .

The complete algorithm of Diaster is described in Algorithm 1, where RL-ALGO can be any off-policy RL algorithm. It is efficient to apply Diaster in tasks with an episodic-reward setting due to its easy implementation and few hyper-parameters requiring tuning.

Experiments

In this section, we focus on four primary questions: 1) What is the step-wise proxy reward function learned through Diaster like? 2) How well does Diaster perform on RL benchmark tasks, in contrast to previous state-of-the-art episodic return decomposition methods? 3) What will happen if decomposing the episodic return into implicit rewards of more than two sub-trajectories? 4) Is it necessary to learn the step-wise proxy reward function?

Experimental Setting

We conduct a series of experiments on MuJoCo [32] and PointMaze [33] benchmarks with the same episodic-reward setting as [9, 11]. Specifically, we modify the original environment to prevent the agent from accessing the instant reward. Instead, zero signals will be received by the agent at non-terminal states. The only feedback is the episodic reward $R_{\text{ep}}(\tau_{0:T})$ returned at the end of an episode. The episode is set up with a maximum trajectory horizon $T=1000$ . Other environmental parameters are kept the same as the default.

During the following experiments, our Diaster is implemented based on SAC [34], one of the state-of-the-art model-free RL algorithms in continuous control tasks. The algorithmic hyper-parameters are given in Appendix B.

Proxy Reward Visualization

We aim to visualize the step-wise proxy rewards learned through our Diaster on four MuJoCo tasks with the episodic-reward setting, including Hopper, Swimmer, Walker2d, and Humanoid. The principal goal of these environments is to let the robot move in the forward direction as fast as possible in the limited time horizon. After that, the episodic reward function mainly connects with the agent’s forward motion. To coincide with the episodic goal, the proxy reward function has to be related to the forward motion at any time step. Therefore, revealing the relationship between the learned proxy rewards and the agent’s forward motion can indicate the effectiveness of return decomposition. We sample several state-action pairs in the environment and show their proxy rewards along with the step-wise displacements of forward motion. What’s more, we show the original environmental rewards for reference. The results are demonstrated in Figure 2.

We observe that the point with more distance of forward motion tends to have a greater proxy reward. The nearly monotonic tendency shows how the learned proxy rewards depend on the forward motion and indicates that Diaster can discover the latent attribution of the episodic reward function and make an effective decomposition. Intuitively, using such a proxy reward function can encourage the agent to step forward, consistent with the original goal of MuJoCo tasks.

Performance Comparison

We evaluate our Diaster on seven MuJoCo tasks with the episodic-reward setting, including InvertedPendulum, Hopper, Swimmer, Walker2d, Ant, Humanoid, and HumanoidStandup. Three existing episodic return decomposition methods are selected for comparison:

•

RUDDER [5] trains an LSTM network to predict the episodic reward at every time step. The step-wise proxy reward is assigned by the difference between return predictions of two consecutive states.
•

IRCR [9] considers setting the proxy reward as the normalized value of the corresponding episodic return instead of learning any parametric step-wise reward function.
•

RRD [11] carries out return decomposition on randomly sampled short subsequences of trajectories, which is adaptive to long-horizon tasks and achieves state-of-the-art performance.

Figure 3 shows the learning curves of our Diaster (red) and the other three baselines. By contrast, RUDDER is no match for Diaster in all seven tasks. As for RRD and IRCR, it can be observed that Diaster shows a better sample efficiency than these two algorithms in most of the environments. Only in the HumanoidStandup task, Diaster has a slight disadvantage compared to RRD and IRCR. Although unable to demonstrate an overwhelming superiority, these results show that Diaster has both high sample efficiency and competitive performance on MuJoCo benchmarks.

On Number of Cut Points

We propose to cut an entire trajectory at any cut point and decompose the episodic return into two sub-trajectory rewards in the above. A natural question is what will happen if decomposing the episodic return into implicit rewards of more than two sub-trajectories. In this subsection, we conduct an ablation study to show how Diaster performs while changing the number of cut points.

With any $m$ sampled cut points $\{c_{i}\}_{i=0}^{m-1}$ satisfying that $0<c_{0}<c_{1}<\cdots<c_{m-1}<T$ , the condition (5) for the sub-trajectory reward function can be extended as

\hat{R}_{\text{sub}(\tau_{0:c_{0}})}+\sum_{i=1}^{m-1}\hat{R}_{\text{sub}(\tau_{c_{i-1}:c_{i}})}+\hat{R}_{\text{sub}(\tau_{c_{m-1}:T})}\approx R_{\text{ep}}(\tau_{0:T}),

(22)

while $\hat{R}_{\text{sub}}(\varnothing)$ and $\hat{R}_{\text{sub}}(\tau_{0:T})\approx R_{\text{ep}}(\tau_{0:T})$ none the less hold. If $m=0$ , this formulation will just learn a function to fit $R_{\text{ep}}(\tau_{0:T})$ , without any attribution of the episodic reward. If $m=T-1$ , this formulation will equal step-wise return decomposition (2), which is incapable of bringing in any temporal structural representation. Any $m$ in $\{1,2,\ldots,T-2\}$ can achieve a trade-off between representation and attribution of the episodic reward function.

First, we visualize the influence of $m$ on the learned proxy value function. For the sake of explainability, the experiment is conducted on one PointMaze task (PointMaze_UMaze), where a 2-Degree-of-Freedom ball force-actuated in the cartesian directions x and y is aiming to reach a target goal in a closed maze. We plot the heatmap of the normalized expected cumulative proxy rewards in the future for different numbers of cut points, as shown in Figure 4. The red star is the goal on the map. The closer the position (x, y) is to the goal, the greater its expected proxy value tends to be. The tendency of 10 cut points is the clearest, indicating that a proper $m$ between $0$ and $T-1$ can yield reasonable proxy value estimation since the attribution and representation of the episodic return are both considered.

Next, we evaluate Diaster with a set of choices of the cut points’ number, $m\in\{0,1,5,10,100,T-1\}$ , on four MuJoCo tasks with the episodic-reward setting, including InvertedPendulum, Hopper, Swimmer, and Walker2d. The results are demonstrated in Figure 5. The performance increases first and then decreases as the number of cut points $m$ increases from $0$ to $T-1$ . $m=0$ and $m=T-1$ are always unable to achieve an acceptable performance, because $m=0$ restricts the attribution capability while $m=T-1$ restricts the long-term representation. Although the sensitivity of this hyper-parameter depends on the task, an appropriate integer $m$ between $0$ and $T-1$ can better approximate the return decomposition object and achieve better performance since it can balance attribution and representation of the episodic return. In all the experimental tasks, $m=1$ or $m=5$ performs the best, which suggests that a relatively small $m$ is befitting.

In conclusion, attribution and representation are both necessary for long-term episodic return decomposition. The lack of either of them would lead to poor performance. The trade-off between attribution and representation achieved by our Diaster improves the sample efficiency of RL in the tasks with episodic-reward settings.

On Step-wise Proxy Reward Function Learning

We propose to learn the sub-trajectory reward function $\hat{R}_{\text{sub}}^{\psi}$ by minimizing (19) then learn the step-wise proxy reward function $\hat{r}^{\phi}$ by minimizing (21) and use $\hat{r}^{\phi}$ to provide incentives for the agent. It’s also feasible to directly use the difference of $\hat{R}_{\text{sub}}^{\psi}$ to optimize the policy for each trajectory. In this subsection, we conduct an ablation study to show how Diaster performs without learning $\hat{r}^{\phi}$ .

We evaluate Diaster, with and without learning the step-wise proxy reward function $\hat{r}^{\phi}$ , respectively, on four MuJoCo tasks with the episodic-reward setting, including InvertedPendulum, Hopper, Swimmer, and Walker2d. The results are demonstrated in Figure 6. We point out that $\hat{R}_{\text{sub}}^{\psi}(\tau_{0:t})$ is conditioned on the sub-trajectory $\tau_{0:t}$ and the difference of $\hat{R}_{\text{sub}}^{\psi}$ is non-Markovian. Without a Markovian reward signal, the policy optimization becomes non-stationary. Therefore, the performance deteriorates after removing the step-wise rewards in all four tasks.

Conclusion

In this paper, we present a new episodic return decomposition approach for episodic-reward setting where the only accessible feedback is the episodic reward obtained at the end of an episode. Specifically, we propose to cut an entire trajectory at any time step and decompose the episodic return into two sub-trajectory rewards. Then the step-wise reward can be straightforwardly obtained by differencing the implicitly assigned sub-trajectory reward in expectation. The new method called Diaster not only introduces temporal structural representations into the episodic reward function but also properly attributes the return to different parts of a given trajectory. The theoretical results verify that the step-wise proxy reward function learned through Diaster is return-equivalent to the original MDP in expectation and capable of guiding the policy to be nearly optimal. The empirical results show that Diaster can provide effective proxy rewards for RL algorithms and outperform previous state-of-the-art return decomposition methods in terms of both sample efficiency and performance.

Acknowledgments

This work is supported by the National Science Foundation of China (61921006), and The Major Key Project of PCL (PCL2021A12).

References

[1] Daniel Hein, Stefan Depeweg, Michel Tokic, Steffen Udluft, Alexander Hentschel, Thomas A. Runkler, and Volkmar Sterzing. A benchmark environment motivated by industrial control problems. In 2017 IEEE Symposium Series on Computational Intelligence (SSCI’17), Honolulu, Hawaii, 2017.
[2] Marcus Olivecrona, Thomas Blaschke, Ola Engkvist, and Hongming Chen. Molecular de-novo design through deep reinforcement learning. Journal of Cheminformatics, 9(1):48:1–48:14, 2017.
[3] Shi-Yong Chen, Yang Yu, Qing Da, Jun Tan, Hai-Kuan Huang, and Hai-Hong Tang. Stabilizing reinforcement learning in dynamic environment with application to online recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’18), London, UK, 2018.
[4] Clare Lyle, Mark Rowland, and Will Dabney. Understanding and preventing capacity loss in reinforcement learning. In The 10th International Conference on Learning Representations (ICLR’22), Virtual Event, 2022.
[5] Jose A. Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, and Sepp Hochreiter. RUDDER: return decomposition for delayed rewards. In Advances in Neural Information Processing Systems 32 (NeurIPS’19), Vancouver, Canada, 2019.
[6] Maja J. Mataric. Reward functions for accelerated learning. In Proceedings of the 11th International Conference on Machine Learning (ICML’94), New Brunswick, NJ, 1994.
[7] Jette Randløv and Preben Alstrøm. Learning to drive a bicycle using reinforcement learning and shaping. In Proceedings of the 15th International Conference on Machine Learning (ICML’98), Madison, Wisconsin, 1998.
[8] Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the 16th International Conference on Machine Learning (ICML’99), Bled, Slovenia, 1999.
[9] Tanmay Gangwani, Yuan Zhou, and Jian Peng. Learning guidance rewards with trajectory-space smoothing. In Advances in Neural Information Processing Systems 33 (NeurIPS’20), Virtual Event, 2020.
[10] Yonathan Efroni, Nadav Merlis, and Shie Mannor. Reinforcement learning with trajectory feedback. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI’21), Virtual Event, 2021.
[11] Zhizhou Ren, Ruihan Guo, Yuan Zhou, and Jian Peng. Learning long-term reward redistribution via randomized return decomposition. In 10th International Conference on Learning Representations (ICLR’22), Virtual Event, 2022.
[12] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
[13] Yang Liu, Yunan Luo, Yuanyi Zhong, Xi Chen, Qiang Liu, and Jian Peng. Sequence modeling of temporal credit assignment for episodic reinforcement learning. CoRR, abs/1905.13420, 2019.
[14] Thomas J. Walsh, Ali Nouri, Lihong Li, and Michael L. Littman. Learning and planning in environments with delayed feedback. Autonomous Agents and Multi-Agent Systems, 18(1):83–105, 2009.
[15] Yann Bouteiller, Simon Ramstedt, Giovanni Beltrame, Christopher J. Pal, and Jonathan Binas. Reinforcement learning with random delays. In 9th International Conference on Learning Representations (ICLR’21), Virtual Event, 2021.
[16] Beining Han, Zhizhou Ren, Zuofan Wu, Yuan Zhou, and Jian Peng. Off-policy reinforcement learning with delayed rewards. In Proceedings of the 39th International Conference on Machine Learning (ICML’22), Baltimore, Maryland, 2022.
[17] Marco Dorigo and Marco Colombetti. Robot shaping: Developing autonomous agents through learning. Artificial Intelligence, 71(2):321–370, 1994.
[18] Eric Wiewiora, Garrison W. Cottrell, and Charles Elkan. Principled methods for advising reinforcement learning agents. In Proceedings of the 20th International Conference on Machine Learning (ICML’03), Washington, DC, 2003.
[19] Sam Devlin and Daniel Kudenko. Dynamic potential-based reward shaping. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS’12), Valencia, Spain, 2012.
[20] Anna Harutyunyan, Sam Devlin, Peter Vrancx, and Ann Nowé. Expressing arbitrary reward functions as potential-based advice. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15), Austin, Texas, 2015.
[21] Andrew G. Barto, Richard L. Lewis, and Satinder Singh. Where do rewards come from. Proceedings of the 31st Annual Meeting of the Cognitive Science Society (CogSci’09), 2009.
[22] Satinder Singh, Richard L. Lewis, Andrew G. Barto, and Jonathan Sorg. Intrinsically motivated reinforcement learning: An evolutionary perspective. IEEE Transactions on Autonomous Mental Development, 2(2):70–82, 2010.
[23] Richard S. Sutton. Temporal credit assignment in reinforcement learning. 1984.
[24] Zhongwen Xu, Hado van Hasselt, and David Silver. Meta-gradient reinforcement learning. In Advances in Neural Information Processing Systems 31 (NeurIPS’18), Montréal, Canada, 2018.
[25] Zeyu Zheng, Risto Vuorio, Richard L. Lewis, and Satinder Singh. Adaptive pairwise weights for temporal credit assignment. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI’22), Virtual Event, 2022.
[26] Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988.
[27] Anna Harutyunyan, Will Dabney, Thomas Mesnard, Mohammad Gheshlaghi Azar, Bilal Piot, Nicolas Heess, Hado van Hasselt, Gregory Wayne, Satinder Singh, Doina Precup, and Rémi Munos. Hindsight credit assignment. In Advances in Neural Information Processing Systems 32 (NeurIPS’19), Vancouver, Canada, 2019.
[28] Chia-Chun Hung, Timothy P. Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico Carnevale, Arun Ahuja, and Greg Wayne. Optimizing agent behavior over long time scales by transporting value. Nature Communications, 10(1):1–12, 2019.
[29] David Raposo, Samuel Ritter, Adam Santoro, Greg Wayne, Theophane Weber, Matt M. Botvinick, Hado van Hasselt, and H. Francis Song. Synthetic returns for long-term credit assignment. CoRR, abs/2102.12425, 2021.
[30] Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14(2):179–211, 1990.
[31] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (EMNLP’14), Doha, Qatar, 2014.
[32] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’12), Vilamoura, Portugal, 2012.
[33] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: datasets for deep data-driven reinforcement learning. CoRR, abs/2004.07219, 2020.
[34] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning (ICML’18), Stockholmsmässan, Stockholm, Sweden, 2018.
[35] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
[36] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI’16), Phoenix, Arizona, 2016.
[37] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33nd International Conference on Machine Learning (ICML’16), New York City, NY, 2016.
[38] Sham M. Kakade. A natural policy gradient. In Thomas G. Dietterich, Suzanna Becker, and Zoubin Ghahramani, editors, Advances in Neural Information Processing Systems 14 (NeurIPS’01), Vancouver, British Columbia, 2001.
[39] John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15), Lille, France, 2015.
[40] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
[41] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS’11), Fort Lauderdale, USA, 2011.
[42] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations (ICLR’15), San Diego, CA, 2015.

Appendix A A. Proofs of the Theoretical Results

A.1. Proof of Theorem 1

Any environmental state $s_{t}$ is sampled according to the on-policy distribution $\rho^{\pi}(s_{t})$ and the action $a_{t}$ is chosen by the policy $\pi(a_{t}|s_{t})$ . We unfold the expectation-form as

\mathbb{E}_{\rho^{\pi}}\left[\sum_{i=t}^{t+h-1}\hat{r}^{\pi}(s_{i},a_{i})\right]=\sum_{i=t}^{t+h-1}\sum_{s_{i},a_{i}}\rho^{\pi}(s_{i})\pi(a_{i}|s_{i})\hat{r}^{\pi}(s_{i},a_{i}).

(23)

Replacing the step-wise proxy reward function $\hat{r}^{\pi}$ with its definition (9), we obtain

	$\displaystyle\mathbb{E}_{\rho^{\pi}}\left[\sum_{i=t}^{t+h-1}\hat{r}^{\pi}(s_{i},a_{i})\right]$	(24)
$\displaystyle=$	$\displaystyle\sum_{i=t}^{t+h-1}\sum_{s_{i},a_{i}}\rho^{\pi}(s_{i})\pi(a_{i}\|s_{i})\sum_{\tilde{\tau}_{0:i}}\Gamma^{\pi}_{i}(\tilde{\tau}_{0:i}\|s_{i})\hat{R}_{d}(\tilde{\tau}_{0:i},s_{i},a_{i})$
$\displaystyle=$	$\displaystyle\sum_{i=t}^{t+h-1}\sum_{s_{i},a_{i}}\rho^{\pi}(s_{i})\pi(a_{i}\|s_{i})\sum_{\tilde{\tau}_{0:i}}\Gamma^{\pi}_{i}(\tilde{\tau}_{0:i}\|s_{i})\hat{R}_{\text{sub}}(\tilde{\tau}_{0:i},s_{i},a_{i})$
	$\displaystyle\quad\quad-\sum_{i=t}^{t+h-1}\sum_{s_{i},a_{i}}\rho^{\pi}(s_{i})\pi(a_{i}\|s_{i})\sum_{\tilde{\tau}_{0:i}}\Gamma^{\pi}_{i}(\tilde{\tau}_{0:i}\|s_{i})\hat{R}_{\text{sub}}(\tilde{\tau}_{0:i}).$

The first term can be rewritten as

	$\displaystyle\sum_{i=t}^{t+h-1}\sum_{s_{i},a_{i}}\rho^{\pi}(s_{i})\pi(a_{i}\|s_{i})\sum_{\tilde{\tau}_{0:i}}\Gamma^{\pi}_{i}(\tilde{\tau}_{0:i}\|s_{i})\hat{R}_{\text{sub}}(\tilde{\tau}_{0:i},s_{i},a_{i})$	(25)
$\displaystyle=$	$\displaystyle\sum_{i=t}^{t+h-1}\sum_{\tilde{\tau}_{0:i}}\sum_{s_{i},a_{i}}\mathcal{T}_{i}^{\pi}(\tilde{\tau}_{0:i})P(s_{i}\|\tilde{s}_{i-1},\tilde{a}_{i-1})\pi(a_{i}\|s_{i})\hat{R}_{\text{sub}}(\tilde{\tau}_{0:i},s_{i},a_{i})$
$\displaystyle=$	$\displaystyle\sum_{i=t}^{t+h-1}\sum_{\tilde{\tau}_{0:i}}\sum_{s_{i},a_{i}}\mathcal{T}_{i+1}^{\pi}(\tilde{\tau}_{0:i},s_{i},a_{i})\hat{R}_{\text{sub}}(\tilde{\tau}_{0:i},s_{i},a_{i})$
$\displaystyle=$	$\displaystyle\sum_{i=t}^{t+h-1}\sum_{\tilde{\tau}_{0:i+1}}\mathcal{T}_{i+1}^{\pi}(\tilde{\tau}_{0:i+1})\hat{R}_{\text{sub}}(\tilde{\tau}_{0:i+1}),$

while the second term can be rewritten as

	$\displaystyle\sum_{i=t}^{t+h-1}\sum_{s_{i},a_{i}}\rho^{\pi}(s_{i})\pi(a_{i}\|s_{i})\sum_{\tilde{\tau}_{0:i}}\Gamma^{\pi}_{i}(\tilde{\tau}_{0:i}\|s_{i})\hat{R}_{\text{sub}}(\tilde{\tau}_{0:i})$	(26)
$\displaystyle=$	$\displaystyle\sum_{i=t}^{t+h-1}\sum_{s_{i},a_{i}}\rho^{\pi}(s_{i})\pi(a_{i}\|s_{i})\sum_{\tilde{\tau}_{0:i}}\frac{\mathcal{T}_{i}^{\pi}(\tilde{\tau}_{0:i})P(s_{i}\|\tilde{s}_{i-1},\tilde{a}_{i-1})}{\rho^{\pi}(s_{i})}\hat{R}_{\text{sub}}(\tilde{\tau}_{0:i})$
$\displaystyle=$	$\displaystyle\sum_{i=t}^{t+h-1}\sum_{\tilde{\tau}_{0:i}}\mathcal{T}_{i}^{\pi}(\tilde{\tau}_{0:i})\hat{R}_{\text{sub}}(\tilde{\tau}_{0:i}).$

Since the first term is partial-cancellable with the second term, the final result is obtained as

	$\displaystyle\mathbb{E}_{\rho^{\pi}}\left[\sum_{i=t}^{t+h-1}\hat{r}^{\pi}(s_{i},a_{i})\right]$	(27)
$\displaystyle=$	$\displaystyle\sum_{\tau_{0:t+h}}\mathcal{T}_{t+h}^{\pi}(\tau_{0:t+h})\hat{R}_{\text{sub}}(\tau_{0:t+h})-\sum_{\tau_{0:t}}\mathcal{T}_{i}^{\pi}(\tau_{0:t})\hat{R}_{\text{sub}}(\tau_{0:t})$
$\displaystyle=$	$\displaystyle\mathbb{E}_{\rho^{\pi}}\left[\hat{R}_{\text{sub}}(\tau_{0:t+h})-\hat{R}_{\text{sub}}(\tau_{0:t})\right].$

A.2. Proof of Lemma 1

For value-based methods [35, 36, 37], the policy is improved by directly applying $\arg\max_{a}Q(s,a)$ . Since it is observed that

\arg\max_{a}\hat{Q}^{\pi}(s,a)=\arg\max_{a}[Q^{\pi}(s,a)+\delta(s)]=\arg\max_{a}Q^{\pi}(s,a),

(28)

for any $s\in\mathcal{S}$ , any policy $\pi$ can be updated to the same policy during the policy improvement stage at each policy iteration, whether evaluated by $\hat{Q}^{\pi}$ or $Q^{\pi}$ .

For policy gradient methods [38, 39, 34, 40], given any policy $\pi_{\theta}$ parameterized by $\theta$ , the derivation

	$\displaystyle\mathbb{E}_{s_{t},a_{t}}\left[\nabla_{\theta}\log\pi_{\theta}(a_{t}\|s_{t})\hat{Q}^{\pi_{\theta}}(s_{t},a_{t})\right]$	(29)
$\displaystyle=$	$\displaystyle\mathbb{E}_{s_{t},a_{t}}\left[\nabla_{\theta}\log\pi_{\theta}(a_{t}\|s_{t})\left[Q^{\pi_{\theta}}(s_{t},a_{t})+\delta(s_{t})\right]\right]$
$\displaystyle=$	$\displaystyle\mathbb{E}_{s_{t},a_{t}}\left[\nabla_{\theta}\log\pi_{\theta}(a_{t}\|s_{t})Q^{\pi_{\theta}}(s_{t},a_{t})\right]+\mathbb{E}_{s_{t},a_{t}}\left[\nabla_{\theta}\log\pi_{\theta}(a_{t}\|s_{t})\delta(s_{t})\right]$
$\displaystyle=$	$\displaystyle\mathbb{E}_{s_{t},a_{t}}\left[\nabla_{\theta}\log\pi_{\theta}(a_{t}\|s_{t})Q^{\pi_{\theta}}(s_{t},a_{t})\right]+\mathbb{E}_{s_{t}}\left[\sum_{a_{t}}\pi_{\theta}(a_{t}\|s_{t})\nabla_{\theta}\log\pi_{\theta}(a_{t}\|s_{t})\delta(s_{t})\right]$
$\displaystyle=$	$\displaystyle\mathbb{E}_{s_{t},a_{t}}\left[\nabla_{\theta}\log\pi_{\theta}(a_{t}\|s_{t})Q^{\pi_{\theta}}(s_{t},a_{t})\right]+\mathbb{E}_{s_{t}}\left[\delta(s_{t})\sum_{a_{t}}\nabla_{\theta}\pi_{\theta}(a_{t}\|s_{t})\right]$
$\displaystyle=$	$\displaystyle\mathbb{E}_{s_{t},a_{t}}\left[\nabla_{\theta}\log\pi_{\theta}(a_{t}\|s_{t})Q^{\pi_{\theta}}(s_{t},a_{t})\right]+\mathbb{E}_{s_{t}}\left[\delta(s_{t})\nabla_{\theta}\sum_{a_{t}}\pi_{\theta}(a_{t}\|s_{t})\right]$
$\displaystyle=$	$\displaystyle\mathbb{E}_{s_{t},a_{t}}\left[\nabla_{\theta}\log\pi_{\theta}(a_{t}\|s_{t})Q^{\pi_{\theta}}(s_{t},a_{t})\right]$

indicates that the policy gradient computed by $\hat{Q}^{\pi}$ is the same as $Q^{\pi}$ in expectation, thus leading to the same optimal policy.

A.3. Proof of Theorem 2

Any sub-trajectory starting from the pair $(s_{t},a_{t})$ is sampled according to the dynamics function $P$ and the policy $\pi$ . The probability of the state $s_{t+h}$ ( $h\in\{1,\ldots,T-t-1\}$ ) conditioned on $(s_{t},a_{t})$ is given by

\mu^{\pi}(s_{t+h}|s_{t},a_{t})=\sum_{s_{t+1},a_{t+1}}\sum_{s_{t+2},a_{t+2}}\cdots\sum_{s_{t+h-1},a_{t+h-1}}P(s_{t+1}|s_{t},a_{t})\left(\prod_{i=1}^{h-1}\pi(a_{t+i}|s_{t+i})P(s_{t+i+1}|s_{t+i},a_{t+i})\right),

(30)

thus we can unfold the expectation-form as

\mathbb{E}_{P,\pi}\left[\left.\sum_{i=1}^{T-t-1}\hat{r}^{\pi}(s_{t+i},a_{t+i})\right|s_{t},a_{t}\right]=\sum_{i=1}^{T-t-1}\sum_{s_{t+i},a_{t+i}}\mu^{\pi}(s_{t+i}|s_{t},a_{t})\pi(a_{t+i}|s_{t+i})\hat{r}^{\pi}(s_{t+i},a_{t+i})

(31)

Replacing the step-wise proxy reward function $\hat{r}^{\pi}$ with its definition (9), we obtain

	$\displaystyle\mathbb{E}_{P,\pi}\left[\left.\sum_{i=1}^{T-t-1}\hat{r}^{\pi}(s_{t+i},a_{t+i})\right\|s_{t},a_{t}\right]$	(32)
$\displaystyle=$	$\displaystyle\sum_{i=1}^{T-t-1}\sum_{s_{t+i},a_{t+i}}\mu^{\pi}(s_{t+i}\|s_{t},a_{t})\pi(a_{t+i}\|s_{t+i})\sum_{\tilde{\tau}_{0:t+i}}\Gamma^{\pi}_{t+i}(\tilde{\tau}_{0:{t+i}}\|s_{t+i})\hat{R}_{d}(\tilde{\tau}_{0:t+i},s_{t+i},a_{t+i})$
$\displaystyle=$	$\displaystyle\sum_{i=1}^{T-t-1}\sum_{s_{t+i},a_{t+i}}\mu^{\pi}(s_{t+i}\|s_{t},a_{t})\pi(a_{t+i}\|s_{t+i})\sum_{\tilde{\tau}_{0:t+i}}\Gamma^{\pi}_{t+i}(\tilde{\tau}_{0:t+i}\|s_{t+i})\hat{R}_{\text{sub}}(\tilde{\tau}_{0:t+i},s_{t+i},a_{t+i})$
	$\displaystyle\quad\quad-\sum_{i=1}^{T-t-1}\sum_{s_{t+i},a_{t+i}}\mu^{\pi}(s_{t+i}\|s_{t},a_{t})\pi(a_{t+i}\|s_{t+i})\sum_{\tilde{\tau}_{0:t+i}}\Gamma^{\pi}_{t+i}(\tilde{\tau}_{0:t+i}\|s_{t+i})\hat{R}_{\text{sub}}(\tilde{\tau}_{0:t+i}).$

Letting $\Upsilon^{\pi}_{t+i+1}(\tilde{\tau}_{0:t+i},s_{t+i},a_{t+i}|s_{t},a_{t})=\mu^{\pi}(s_{t+i}|s_{t},a_{t})\pi(a_{t+i}|s_{t+i})\Gamma^{\pi}_{t+i}(\tilde{\tau}_{0:t+i}|s_{t+i})$ , the first term can be rewritten as

	$\displaystyle\sum_{i=1}^{T-t-1}\sum_{s_{t+i},a_{t+i}}\mu^{\pi}(s_{t+i}\|s_{t},a_{t})\pi(a_{t+i}\|s_{t+i})\sum_{\tilde{\tau}_{0:t+i}}\Gamma^{\pi}_{t+i}(\tilde{\tau}_{0:t+i}\|s_{t+i})\hat{R}_{\text{sub}}(\tilde{\tau}_{0:t+i},s_{t+i},a_{t+i})$	(33)
$\displaystyle=$	$\displaystyle\sum_{i=1}^{T-t-1}\sum_{s_{t+i},a_{t+i}}\sum_{\tilde{\tau}_{0:t+i}}\Upsilon^{\pi}_{t+i+1}(\tilde{\tau}_{0:t+i},s_{t+i},a_{t+i}\|s_{t},a_{t})\hat{R}_{\text{sub}}(\tilde{\tau}_{0:t+i},s_{t+i},a_{t+i})$
$\displaystyle=$	$\displaystyle\sum_{i=1}^{T-t-1}\sum_{\tilde{\tau}_{0:t+i+1}}\Upsilon^{\pi}_{t+i+1}(\tilde{\tau}_{0:t+i+1}\|s_{t},a_{t})\hat{R}_{\text{sub}}(\tilde{\tau}_{0:t+i+1}),$

while the second term can be rewritten as

	$\displaystyle\sum_{i=1}^{T-t-1}\sum_{s_{t+i},a_{t+i}}\mu^{\pi}(s_{t+i}\|s_{t},a_{t})\pi(a_{t+i}\|s_{t+i})\sum_{\tilde{\tau}_{0:t+i}}\Gamma^{\pi}_{t+i}(\tilde{\tau}_{0:t+i}\|s_{t+i})\hat{R}_{\text{sub}}(\tilde{\tau}_{0:t+i})$	(34)
$\displaystyle=$	$\displaystyle\sum_{i=1}^{T-t-1}\sum_{s_{t+i},a_{t+i}}\sum_{\tilde{\tau}_{0:t+i}}\Upsilon^{\pi}_{t+i+1}(\tilde{\tau}_{0:t+i},s_{t+i},a_{t+i}\|s_{t},a_{t})\hat{R}_{\text{sub}}(\tilde{\tau}_{0:t+i})$
$\displaystyle=$	$\displaystyle\sum_{i=1}^{T-t-1}\sum_{\tilde{\tau}_{0:t+i}}\Upsilon^{\pi}_{t+i+1}(\tilde{\tau}_{0:t+i}\|s_{t},a_{t})\hat{R}_{\text{sub}}(\tilde{\tau}_{0:t+i}).$

Since the first term is partial-cancellable with the second term, the final result is obtained as

	$\displaystyle\mathbb{E}_{P,\pi}\left[\left.\sum_{i=1}^{T-t-1}\hat{r}^{\pi}(s_{t+i},a_{t+i})\right\|s_{t},a_{t}\right]$	(35)
$\displaystyle=$	$\displaystyle\sum_{\tau_{0:T}}\Upsilon^{\pi}_{T}(\tau_{0:T}\|s_{t},a_{t})\hat{R}_{\text{sub}}(\tau_{0:T})-\sum_{\tau_{0:t+1}}\Upsilon^{\pi}_{t+1}(\tau_{0:t+1}\|s_{t},a_{t})\hat{R}_{\text{sub}}(\tau_{0:t+1})$
$\displaystyle=$	$\displaystyle\mathbb{E}_{P,\pi}\left[\left.\hat{R}_{\text{sub}}(\tau_{0:T})-\hat{R}_{\text{sub}}(\tau_{0:t+1})\right\|s_{t},a_{t}\right].$

Appendix B B. Hyper-parameters

Hyper-parameter	Default
hidden layers (all networks)	2
hidden neurons	256
activation	ReLU [41]
optimizer (all losses)	Adam [42]
RL-ALGO	SAC [34]
learning rate (all networks)	$3\times 10^{-4}$
discount factor	0.99
initial temperature	1.0
target entropy	-dim( $\mathcal{A}$ )
target network smoothing coefficient	0.005
replay buffer capacity	$10^{6}$
mini-batch size	256

Table 1: The hyper-parameter configuration of Diaster in MuJoCo experiments.

	$\displaystyle\mathbb{E}_{P,\pi}\left[\left.\sum_{i=1}^{T-t-1}\hat{r}^{\pi}(s_{t+i},a_{t+i})\right\|s_{t},a_{t}\right]$	(32)
$\displaystyle=$	$\displaystyle\sum_{i=1}^{T-t-1}\sum_{s_{t+i},a_{t+i}}\mu^{\pi}(s_{t+i}\|s_{t},a_{t})\pi(a_{t+i}\|s_{t+i})\sum_{\tilde{\tau}_{0:t+i}}\Gamma^{\pi}_{t+i}(\tilde{\tau}_{0:{t+i}}\|s_{t+i})\hat{R}_{d}(\tilde{\tau}_{0:t+i},s_{t+i},a_{t+i})$
$\displaystyle=$	$\displaystyle\sum_{i=1}^{T-t-1}\sum_{s_{t+i},a_{t+i}}\mu^{\pi}(s_{t+i}\|s_{t},a_{t})\pi(a_{t+i}\|s_{t+i})\sum_{\tilde{\tau}_{0:t+i}}\Gamma^{\pi}_{t+i}(\tilde{\tau}_{0:t+i}\|s_{t+i})\hat{R}_{\text{sub}}(\tilde{\tau}_{0:t+i},s_{t+i},a_{t+i})$
	$\displaystyle\quad\quad-\sum_{i=1}^{T-t-1}\sum_{s_{t+i},a_{t+i}}\mu^{\pi}(s_{t+i}\|s_{t},a_{t})\pi(a_{t+i}\|s_{t+i})\sum_{\tilde{\tau}_{0:t+i}}\Gamma^{\pi}_{t+i}(\tilde{\tau}_{0:t+i}\|s_{t+i})\hat{R}_{\text{sub}}(\tilde{\tau}_{0:t+i}).$