This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sequential Action-Induced Invariant Representation for Reinforcement Learning

Dayang Liang, Qihang Chen, Yunlong Liu11footnotemark: 1
Xiamen University
Xiamen, China
ylliu @ xmu.edu.cn
Corresponding Author
Abstract

How to accurately learn task-relevant state representations from high-dimensional observations with visual distractions is a realistic and challenging problem in visual reinforcement learning. Recently, unsupervised representation learning methods based on bisimulation metrics, contrast, prediction, and reconstruction have shown the ability for task-relevant information extraction. However, due to the lack of appropriate mechanisms for the extraction of task information in the prediction, contrast, and reconstruction-related approaches and the limitations of bisimulation-related methods in domains with sparse rewards, it is still difficult for these methods to be effectively extended to environments with distractions. To alleviate these problems, in the paper, the action sequences, which contain task-intensive signals, are incorporated into representation learning. Specifically, we propose a Sequential Action–induced invariant Representation (SAR) method, in which the encoder is optimized by an auxiliary learner to only preserve the components that follow the control signals of sequential actions, so the agent can be induced to learn the robust representation against distractions. We conduct extensive experiments on the DeepMind Control suite tasks with distractions while achieving the best performance over strong baselines. We also demonstrate the effectiveness of our method at disregarding task-irrelevant information by deploying SAR to real-world CARLA-based autonomous driving with natural distractions. Finally, we provide the analysis results of generalization drawn from the generalization decay and t-SNE visualization. Code and demo videos are available at https://github.com/DMU-XMU/SAR.git.

1 Introduction

Visual Deep Reinforcement Learning (DRL) with high-dimensional images as input can easily deal with decision-making problems in various complex scenarios, which has achieved great success in robot control Li et al. (2021)Kalashnikov et al. (2018), autonomous driving Zhao et al. (2022)Wu et al. (2022), video games Jaderberg et al. (2019), health care Liang et al. (2022), among others. However, observation signals in real-world applications are usually high-rank and unstructured, and the direct use of such high-dimensional input for visual DRL often results in information redundancy and sample inefficiency.

In the literature, many techniques have been adopted to learn the representation efficiently for visual DRL, and the main approaches are to learn a state representation jointly in an end-to-end manner with the help of convolutional encoder Mnih et al. (2015), attention mechanisms Liang et al. (2021) and Transformer Vaswani et al. (2017). Although these methods perform well in some low-rank high-dimensional backgrounds, it is still difficult to handle images from real-world applications that contain many task-irrelevant signals, where the sample cost for the learning of the policy in such scenarios will be very high. Fortunately, the number of basic states that directly guide the decision may be much lower than that of the image Schölkopf (2022), which allows the representation learner to extract background-independent task states from complex observations, so that agents can be generalized in environments with similar tasks but different backgrounds.

Recently, data augmentation and encoder reconstruction have been widely employed to address the insufficient generalization performance of visual DRL. For example, self-supervised visual transformer Dosovitskiy et al. (2020) is used to reconstruct masked hidden states (MLR) Yu et al. (2022), autoencoders (MAE) Liu et al. (2022)He et al. (2022), world models (MWM) Seo et al. (2022) to promote the encoder to learn task-relevant features and predict current and future states. Random data augmentation has also been widely used in sample augmentation (RAD) Laskin et al. (2020b), noise contrastive estimation (CURL) Laskin et al. (2020a), and robust value estimation (DrQ) Yarats et al. (2021). However, naively introducing data augmentation to arbitrarily transform observations and reconstruct encoders in a task-agnostic manner may cause problems of high variance and overfitting, further deteriorating the generalization performance Yuan et al. (2022) and leading to additional computational costs.

By inducing representation with behavioral similarity metrics (BSM) Zang et al. (2022), recent research has achieved some success in task-relevant representation learning, where an element of the Markov decision process (MDP) tuple that is able to determine the task, such as the reward, action, or transition model, is adopted to establish the task equivalence relationship with the hidden state to enhance the learning of the lossless representation of task-relevant information. Related works mainly include: CRESP Yang et al. (2022) based on sequence rewards, contrastive learning method Liu et al. (2020) based on MC-returns, DBC Zhang et al. (2020) based on rewards and transitions, and PSM Agarwal et al. (2020) based on one-step action and transition.

However, existing methods still have some limitations. For the reward-based metric methods, the limitations mainly include: i) It is inaccurate to directly group observations with the equal one-step reward into a class of states. To distinguish different encoded states, the trajectories with different rewards are required, while only the one-step reward is adopted in current related approaches Kemertas & Aumentado-Armstrong (2021). ii) In environments with sparse rewards, insufficient reward signals may directly lead to the failure of reward-based behavioral similarity metrics, thereby worsening representation learning Agarwal et al. (2020) Agarwal et al. (2021). For the action-based approaches, e.g., the method with PSM Agarwal et al. (2020), although more task-relevant information, i.e., the action, is considered in the representation learning, it is still not enough to learn an accurate representation with a one-step metric (the issue is similar to i)).

Refer to caption
Figure 1: We illustrate the idea of extracting task-relevant information using action sequences on a simple control task. Taking Cartpole-swingup as an example, its task is to control the slider to move left and right through the learned policy to finally achieve the balance of the connecting rod. We can see that with the navigation of the agent in the scenario, the movement rule of the regions (Middle: slider and connecting rod) navigated by the agent are consistent with the sequential action signals (Left: action sequence), then if we can lock these background-irrelevant task regions/information (Right: red highlight part) by those action signals, a more accurate task-relevant representation can be obtained for decision making.

To address the aforementioned issues, the action sequence distribution is modeled and adopted for task-relevant representation learning with the purpose of decoupling task-related states and distractions accurately. This method stems from a crucial insight (further illustrated with Fig 1): task-related regions in a static observation are hard to recognize without prior but they can be moved by interactive actions, and further, sequential multi-step actions can cause them to move obviously, i.e., generate motion trajectories. Intuitively, these regions, i.e., task-relevant information can be efficiently recognized and located by multi-step actions. With this insight, we can model the idea of multi-step actions to accurately represent task-relevant information while avoiding the problem of inaccurate single-step metrics and unstable metrics of sparse reward.

In this paper, we propose a Sequential Action-induced invariant-Representation (SAR) method, which makes the output of state encoder consistent with the control rules of real sequential actions (i.e., label) by minimizing the distance between the predicted and the real action sequence, thereby locking into task-relevant representation from the stacked observations. To reach the above idea, we utilize the characteristic function to model a probability distribution for real action sequences while leveraging the latest policy to guarantee the accuracy of real actions, thereby achieving an effective metric between real and predicted action sequences. Empirical evaluations on the DeepMind Control (DMControl) suite with unseen distractions demonstrate our approach outperforms recent representation-related baselines. To further verify the effectiveness of our approach, SAR is also applied to the field of autonomous driving with real natural distractions, experimental results show that significant performance than baselines can still be achieved. In addition, the results of the generalization decay ratio and t-SNE visualization analysis also strongly confirm that the method using sequential actions can extract task-relevant information more accurately.

2 Background

We model the underlying system of visual reinforcement learning as a Markov decision process (MDP), which can be described by a tuple =(𝒮,𝒜,𝒫,,γ)\cal M=(S,A,P,R,\gamma), where 𝒮\cal S represents the state space, 𝒜\cal A represents the continuous action space, 𝒫(s|s,a):𝒮×𝒜×𝒮{\cal P}(s^{\prime}|s,a){:\cal S}\times{\cal A}\times{\cal S}\to\mathbb{R} represents the transition probability from state ss to next state ss^{\prime} after executing action a𝒜a\in\cal A, (s,a):𝒮×𝒜{\cal R}(s,a):{\cal S}\times{\cal A}\to\mathbb{R} represents the reward signal rr obtained after executing action aa under state ss, γ[0,1]\gamma\in[0,1] represents the discount factor. Generally, the accurate state ss is difficult to obtain in environments with visual distractions. It is necessary to extract the hidden state from the observation o𝒪o\in\cal O with the help of state encoder ϕ\phi, where 𝒪\cal O is the observation space with task-irrelevant information.

The aim of visual reinforcement learning is to jointly learn an efficient encoder ϕ\phi for state representation and a policy π\pi for decision. They allow the agent to extract task-relevant low-dimensional hidden state ϕ(o):oz\phi(o):o\to z from a given high-dimensional o𝒪o\sim\cal O, and choose an action aπ(z)a\sim\pi(z) according to the znz\in\mathbb{R}^{n}, thus obtaining the reward signal r=(s,a)r={\cal R}(s,a) from the environment. By iterating the above process, the agent finally learns an optimal policy that maximizes the expected cumulative discounted rewards, formulated as maxπ𝔼st𝒫,atπ[t=0γt(st,at)]max_{\pi}\mathbb{E}_{s_{t}\sim{\cal P},a_{t}\sim\pi}[{\sum}^{\infty}_{t=0}{\gamma}^{t}{\cal R}(s_{t},a_{t})].

2.1 Soft Actor Critic

SAC is an off-policy reinforcement learning algorithm for continuous control, which consists of a Critic network for learning the value function Qφ(o;a)Q_{\varphi}(o;a) with parameters φ{\varphi} and an Actor network for learning the policy function πψ(a|o)\pi_{\psi}(a|o) with parameters ψ{\psi}. Unlike value-based methods, SAC optimizes a stochastic policy to maximize the expected trajectory reward. The main highlight of SAC is that the maximum entropy item =log(πψ(a|o)){\cal H}=log(\pi_{\psi}(a|o)) is added to decentralize sampling actions, which enhances the exploration and robustness of the algorithm.

Specifically, The Critic network first samples transition et=(ot,at,ot+1,rt,d)e_{t}=(o_{t},a_{t},o_{t+1},r_{t},d) from the replay buffer 𝒟\cal D to minimize the Bellman error, where dd represents done signal. The training of parameters φ\varphi can be represented by the following loss:

V(φ)=𝔼e𝒟[(Qφ(ot,at)(rt+γ(1d)𝒯))2].{\cal L}^{V}(\varphi)=\mathbb{E}_{e\sim\cal D}\left[{\left(Q_{\varphi}(o_{t},a_{t})-(r_{t}+\gamma(1-d){\cal T})\right)}^{2}\right]. (1)

The target 𝒯\cal T computes the expectation of the next actions sampling from current policy, defined as:

𝒯=𝔼aπ[(Q^φ^(ot+1,a)αlog(πψ(a|ot+1))],{\cal T}=\mathbb{E}_{a^{\prime}\sim\pi}\left[(\hat{Q}_{\hat{\varphi}}(o_{t+1},a^{\prime})-\alpha log(\pi_{\psi}(a^{\prime}|o_{t+1}))\right], (2)

where network φ^\hat{\varphi} comes from the Exponential Moving Average (EMA) of the Critic parameter φ\varphi, and the parameter α\alpha is a positive entropy coefficient that determines the priority of entropy maximization over value function optimization.

Finally, we sample actions aπψa\sim\pi_{\psi} from the policy with parameter ψ\psi, and train the Actor by maximizing the expected reward of the sampled actions:

π(ψ)=𝔼aπ[(Qπ(o,a)αlog(πψ(a|o))].{\cal L}^{\pi}(\psi)=\mathbb{E}_{a\sim\pi}\left[(Q^{\pi}(o,a)-\alpha log(\pi_{\psi}(a|o))\right]. (3)

2.2 Behavior Similarity Metric

Behavior Similarity Metrics (BSM) related methods generally employ MDP tuples, such as reward, action, policy, etc., to measure the task equivalence between observations, where an auxiliary loss is usually constructed to promote the extraction of task-relevant state representations. Policy Similarity Metric (PSM) is a BSM-based method that utilizes policy distance. Since our work is closely related to PSM in terms of optimization, we briefly introduce the basic definition of PSM and the differences between them.

For the same point, both our method and the PSM method are based on the BSM principle, i.e., if the encoder distance is equal to the distance between certain MDP tuples (such as the PSM distance in Theorem 3.1), it is considered that all the information extracted by the encoder is task-related. For differences, PSM method optimizes the one-step distance zizj1\lVert z_{i}-z_{j}\rVert_{1} of the encoded states, making it close to the one-step distance between the MDP tuples that can determine the task, i.e., the sum distance of the policy distance DIST(π(x),π(y))DIST(\pi^{*}(x),\pi^{*}(y)) with the next transition distance 𝒲1{\cal W}_{1}. However, in our method, besides the basic idea of sequence actions, the most important difference is that we minimize the difficult multi-step distance, while there is no need to learn the transition model 𝒫{\cal P} avoiding the impact of model accuracy.

Theorem 2.1

(Policy Similarity Metric Agarwal et al. (2020)). Let 𝔐\mathfrak{M} be the set of all pseudometrics on space SS. For given DIST and π{\pi}^{*}, A pseudometric transformation function FPSMπ:𝔪𝔢𝔱𝔪𝔢𝔱F_{PSM}^{{\pi}^{*}}:\mathfrak{met\to met} is defined as,

PSMπ=DIST(π(si),π(sj))+γ𝒲1(d)(𝒫siπ,𝒫sjπ),{\cal F}_{PSM}^{{\pi}^{*}}=DIST({\pi}^{*}(s_{i}),{\pi}^{*}(s_{j}))+\gamma{\cal W}_{1}(d^{*})({\cal P}^{\pi^{*}}_{s_{i}},{\cal P}^{\pi^{*}}_{s_{j}}), (4)

where DIST denotes a probability pseudometric between policies, π\pi^{*} is the optimal policy, transition 𝒫sπ=aπ(a|s)𝒫(|s,a){\cal P}_{s}^{\pi}={\sum}_{a}\pi(a|s){\cal P}(\cdot|s,a) and 𝒲1{\cal W}_{1} is the 1-Wassersrein distance given the pseudometric dd. Then, PSMπ{\cal F}_{PSM}^{{\pi}^{*}} has a unique fixed point dd^{*} which is a PSM metric.

2.3 Distraction Background

As SAR uses sequential actions for the distinction of the task-relevant representation and the distraction, it is required that the task foreground and the distraction background should have a low coupling property, i.e., their movements do not exhibit a consistent relationship, which actually holds in many real-world applications. Since in reality, the distraction backgrounds of most scenes are random and not coupled with task foregrounds, also it is really difficult for multi-degree-of-freedom systems to couple with distractions, thus our method is applicable to the state representation of most scenes with distractions. Nevertheless, we still give the following assumptions.

Assumption 2.1

Given a reinforcement learning system, where the transition model of states is denoted as 𝒫S{\cal P}_{S} and the transition model of distraction background is denoted as 𝒫X{\cal P}_{X}, then we assume that DKL(𝒫S𝒫X)0D_{KL}({\cal P}_{S}\parallel{\cal P}_{X})\gg 0, i.e., the similarity of the two distributions are nearly to zero.

3 Algorithm

In this section, we propose Sequential Action-induced invariant Representation (SAR), a visual DRL method that leverages optimal action sequences to capture task-relevant information from sequential observations. The key detail of SAR is to optimize the potential distance between the encoding state and the real sequential actions through an auxiliary loss to make the prediction vector of the encoding state consistent with the control signal of the sequential action. In this way, SAR is able to promote the encoding hidden state to only preserve the components that follow the control signals of real sequential actions, thereby ensuring that the representation extracted from the observation is task-relevant.

Refer to caption
Figure 2: Overview architecture of SAR. SAR consists of two parts, an auxiliary task for learning representations and a reinforcement learning task for learning policies. The box region is the core module, i.e., the auxiliary task, which aims to minimize the error between the predicted output signal 𝒢\cal G and the optimal action sequence distribution 𝒴\cal Y. Thus, the gradient of auxiliary loss can be back-propagated to ϕ\phi, assisting ϕ\phi to learn state representations associated with real action sequences from unstructured scenes. Finally, the efficient representation learned by the encoder will be worked for the decision task of reinforcement learning.

The overall framework of SAR is shown in Fig. 2, which is based on the Soft Actor-Critic reinforcement learning architecture. In detail, SAR consists of an auxiliary module (Aux Loss) for self-supervised representation learning and an RL module (RL loss) for policy learning, where we mainly focus on the auxiliary module. In the architecture, the two modules share an encoder ϕ\phi for state representation but affect it differently. Specifically, in the phase of backward gradient update, the RL module will train both the encoder network ϕ\phi and the Actor-Critic network, while the auxiliary module will only train ϕ\phi. During forward inference, the pipeline of the auxiliary module is ignored, and only the RL module is used to obtain current action. In addition, to improve the generalization of the encoder network, we also randomly augment the observations before encoding according to existing methods.

3.1 Construction of Auxiliary Task

The construction method of auxiliary tasks tremendously determines the expected performance. Therefore, during the construction of auxiliary tasks, a key step is how to construct auxiliary objectives that can efficiently optimize the state encoder. To that end, we construct an auxiliary task with optimal action sequences and make them guide the extraction of task-relevant information. In detail, in order to take full advantage of the properties possessed by the action sequence, we establish a probability model for sequence actions and finally employ the characteristic function of the probability distribution as the target of the auxiliary loss. Before this implementation, we first use the newest policy to calculate the actions of sampling observations and try to find better action sequences.

Actions Sequence Sequential action-induced representation aims to lock task-relevant information in observations that have a specific mapping relationship with action sequences, so action sequences will directly guide the extraction of task-relevant information. In theory (Theorem 3.1), SAR should also utilize the optimal action sequences sampled from optimal policy π\pi^{*}, which helps to learn representations accurately Agarwal et al. (2020). As for reason, we empirically believe that due to the optimal sequential actions sampled from an optimal probability distribution function (PDF), the information controlled by optimal actions is deeply decoupled from the random noise information. In contrast, information controlled by random actions cannot be accurately extracted due to its coupling with random noises.

In practice, the optimal actions cannot be obtained during training, but the actions sampled from the suboptimal policy can also be worked in representation learning with an accuracy limitation Agarwal et al. (2020). Therefore, to further improve the optimality of action sequences, we make the following improvement: At each training step, for the action ata_{t} sampled from the replay buffer, we do not directly use the ata_{t} that was calculated with old policy πold\pi^{old} like Agarwal et al. Instead, we update the ata_{t} by the latest current policy. To be specific, we first sample sequential observations {ot,ot+1,,ot+T}\{o_{t},o_{t+1},...,o_{t+T}\} of length TT from replay buffer, and then use the current policy πnew\pi^{new} to calculate the action sequence with the sampled sequential observations:

at,at+1,,at+Tπnew(ot),πnew(ot+1),,πnew(ot+T).a^{*}_{t},a^{*}_{t+1},...,a^{*}_{t+T}\approx\pi^{new}(o_{t}),\pi^{new}(o_{t+1}),...,\pi^{new}(o_{t+T}). (5)

Compared with the old policy adopted by Agarwal et al., current policy πnew\pi^{new} is thought to be closer to the optimal distribution of π\pi^{*} (see Assumption 4.1), so adopting the πnew\pi^{new} will guarantee a tighter lower bound of the action optimality. For easier reading, suboptimal policy πnew\pi^{new} and suboptimal action anewa^{new} (anewπnew()a^{new}\sim\pi^{new}(\cdot)) will be marked with π\pi^{*} and aa^{*} respectively below.

Assumption 3.1

Given a reinforcement learning process for steady training. Before the policy π\pi converges, we empirically assume that the distribution of the current πnew\pi^{new} will be closer to the distribution of the optimal π\pi^{*} as the number of gradient update steps increases, i.e., DKL(πnewπ)<DKL(πoldπ)D_{KL}(\pi^{new}\parallel\pi^{*})<D_{KL}(\pi^{old}\parallel\pi^{*}).

Characteristic Functions of Actions Sequence Distributions We now model the probability distribution of action sequences to promote the construction of the auxiliary loss target. It is worth noting that since the PDF of the action sequence is difficult to handle, we use the characteristic function of the action sequence distribution to replace the PDF according to the work of Yang et al. Yang et al. (2022). The following Lemma 4.1 shows that the characteristic function is a probability method that has an equivalent relationship with the PDF.

Lemma 3.1

[21] Two random vectors XX and YY have the same characteristic function if and only if they have the same probability distribution function.

Next, the characteristic function of action sequence distribution is used as the target 𝒴\cal{Y} of the auxiliary loss in the modeling process and calculation formula.

Consider a set of sequential actions as=(at,at+1,,at+T)Ψa^{s}=(a^{*}_{t},a^{*}_{t+1},...,a^{*}_{t+T})\in\Psi under π\pi^{*}, where Ψ\Psi is the combined space of the T-step actions defined on original action space 𝒜\cal A, and the probability distribution function of asa^{s} is expressed as fΨ(as)f_{\Psi}(a^{s}). Then, the characteristic function defined under Ψ\Psi is expressed as follows:

𝒴Ψ(θ,as)\displaystyle{\cal Y}_{\Psi}(\theta,a^{s}) =𝔼asfΨ[ei<θ,as>]\displaystyle=\mathbb{E}_{a^{s}\sim f_{\Psi}}\left[e^{i<\theta,a^{s}>}\right] (6)
=ei<θ,as>fΨ(as)𝑑as.\displaystyle=\int e^{i<\theta,a^{s}>}f_{\Psi}(a^{s})da^{s}. (7)

Where imaginary unit i=1i=\sqrt{-1}, inner product <θ,as>=k=tk=t+Tθkak<\theta,a^{s}>={\sum}^{k=t+T}_{k=t}\theta_{k}a^{*}_{k}, and 𝒴\cal Y is the complex function of the real variable θΘ\theta\sim\Theta. Θ\Theta is the probability density function on T\mathbb{R}^{T}, where Gaussian distribution Θ=𝒩(μ,σ2)\Theta={\cal N}(\mu,\sigma^{2}) is actually adopted.

Finally, the characteristic function of action sequence distribution can be expressed by the real and imaginary units as follows:

𝒴cos=cos(𝒴Ψ(θ,as)),𝒴sin=sin(𝒴Ψ(θ,as)).{\cal Y}^{cos}=cos\left({\cal Y}_{\Psi}(\theta,a^{s})\right),~{}{\cal Y}^{sin}=sin\left({\cal Y}_{\Psi}(\theta,a^{s})\right). (8)

3.2 Optimization Implementation

In this subsection, we mainly introduce the implementation and optimization details of the auxiliary task and give the overall training algorithm of the SAR framework.

First, we estimate the characteristic function of the action sequence distribution using the neural network predictor PredPred. To improve the generalization of representation learning, the input of the PredPred consists of a concatenation of encoding state zz, real variable θ\theta, and reward sequence rs={rt,rt+1,..,rt+T}r^{s}=\{r_{t},r_{t+1},..,r_{t+T}\}. Therefore, the output of PredPred prediction can be denoted as:

𝒢ϕ(o,θ,rs)=Pred([ϕ(o),θ,rs]).{{\cal G}_{\phi}(o,\theta,r^{s})}=Pred(\left[\phi(o),\theta,r^{s}\right]). (9)

Second, we compute the L2L_{2} distance (mean squared error) between the predicted characteristic function and the true characteristic function of the action sequence, then optimizing the encoder ϕ\phi through preliminary loss function:

MSE(𝒢ϕ,𝒴)\displaystyle{\cal L}^{MSE}({\cal G_{\phi},Y}) (10)
=𝔼<o,rs,as>D,θΘ𝒢ϕ(o,θ,rs)𝒴Ψ(θ,as)22.\displaystyle=\mathbb{E}_{<o,r^{s},a^{s}>\sim D,\theta\sim\Theta}\left\lVert{{\cal G}_{\phi}(o,\theta,r^{s})}-{\cal Y}_{\Psi}(\theta,a^{s})\right\rVert_{2}^{2}. (11)
Algorithm 1 Sequential Action-Induced Invariant Representation

Input: replay buffer DD with size NN, learning rate, batch size, etc.
Output: optimal π\pi

1:  Initialize Critic network φ\varphi, Actor network ψ\psi, and encoder network ϕ\phi.
2:  for episode m0m\leftarrow 0 to MM do
3:     Encode state zt=ϕ(ot)z_{t}=\phi(o_{t}).
4:     Excute action at=π(|zt)a_{t}=\pi(\cdot|z_{t}).
5:     Collect data DD{ot,at,tt+1}D\leftarrow D\cup\{o_{t},a_{t},t_{t+1}\}.
6:  end for
7:  for gradient step i0i\leftarrow 0 to II do
8:     Sample batch BiDB_{i}\sim D.
9:     Get sequence {ak:k+T,rk:k+T}\{a_{k:k+T},r_{k:k+T}\} via BiB_{i}.
10:     Update action sequence ak:k+T=π(ok:k+T)a_{k:k+T}=\pi(o_{k:k+T}).
11:     Train the Actor-Critic V+π{\cal L}^{V}+{\cal L}^{\pi}. (Eq. 1 and 3).
12:     Train the auxiliary task MSE+CS{\cal L}^{MSE}+{\cal L}^{CS}. (Eq. 15 and 17).
13:  end for
14:  return optimal π\pi

In fact, since the true characteristic function cannot be directly obtained, following Yang et al. Yang et al. (2022), we optimize an upper bound of MSE(𝒢,𝒴){\cal L}^{MSE}(\cal G,Y):

MSE(𝒢ϕ,𝒴)\displaystyle{\cal L}^{MSE}({\cal G_{\phi},Y}) (12)
=𝔼<o,rs,as>D,θΘ𝒢ϕ(o,θ,rs)ei<θ,as>22\displaystyle=\mathbb{E}_{<o,r^{s},a^{s}>\sim D,\theta\sim\Theta}\left\lVert{{\cal G}_{\phi}(o,\theta,r^{s})}-e^{i<\theta,a^{s}>}\right\rVert_{2}^{2} (13)
𝔼<o,rs>D,θΘ𝒢ϕ(o,θ,rs)𝔼asfΨ[ei<θ,as>]22\displaystyle\geq\mathbb{E}_{<o,r^{s}>\sim D,\theta\sim\Theta}\left\lVert{{\cal G}_{\phi}(o,\theta,r^{s})}-\mathbb{E}_{a^{s}\sim f_{\Psi}}\left[e^{i<\theta,a^{s}>}\right]\right\rVert_{2}^{2} (14)

In Eq.12, since the true characteristic function is represented by real and imaginary units, the predicted vector is also decoupled into two corresponding parts [𝒢ϕcos,𝒢ϕsin]=𝒢ϕ[{\cal G}_{\phi}^{cos},{\cal G}_{\phi}^{sin}]=\cal G_{\phi}. Finally, the loss function with decoupled real and imaginary parts is expressed as:

MSE(𝒢ϕ,𝒴)\displaystyle{\cal L}^{MSE}({\cal G_{\phi},Y}) (15)
=𝔼<o,rs,as>D,θΘ[𝒢ϕcos𝒴cos22+𝒢ϕsin𝒴sin22].\displaystyle=\mathbb{E}_{<o,r^{s},a^{s}>\sim D,\theta\sim\Theta}\left[\lVert{\cal G}_{\phi}^{cos}-{\cal Y}^{cos}\rVert_{2}^{2}+\lVert{\cal G}_{\phi}^{sin}-{\cal Y}^{sin}\rVert_{2}^{2}\right]. (16)

Additionally, we construct a cosine similarity (CS) loss function on the cosine component between the true characteristic and predicted vector. Unlike L2L_{2} distance, which optimizes the magnitude of the prediction vector, the cosine similarity loss can guarantee the phase distance between the predicted and the target vector, thus making the prediction vector closest to the characteristic vector of the true action sequence.

CS\displaystyle{\cal L}^{CS} =Cosine(𝒢ϕcos,𝒴cos)\displaystyle=Cosine({\cal G}_{\phi}^{cos},{\cal Y}^{cos}) (17)
=11ki=0K1𝒢ϕ,icos𝒢ϕ,icos𝒴icos𝒴icos.\displaystyle=1-\frac{1}{k}\sum^{K-1}_{i=0}\frac{{\cal G}^{cos}_{\phi,i}}{\lVert{\cal G}^{cos}_{\phi,i}\lVert}\frac{{\cal Y}^{cos}_{i}}{\lVert{\cal Y}^{cos}_{i}\lVert}. (18)

In the end, the above auxiliary representation learning module with MSE{\cal L}^{MSE} and CS{\cal L}^{CS} will be extended to the Soft Actor-Critic (SAC) reinforcement learning framework. And the AC framework includes the Actor loss π{\cal L}^{\pi} for the training policy and the Critic loss V{\cal L}^{V} for the training value network. Therefore, we train SAR through minimizing the total loss =MSE+CS+π+V{\cal L}={\cal L}^{MSE}+{\cal L}^{CS}+{\cal L}^{\pi}+{\cal L}^{V}, and corresponding pseudocode can be seen in Algorithm 1.

4 Results

To effectively verify the generalization performance of our SAR and recent baselines under distracting environments, the main experiments set different distraction sources for the training and evaluation phases, i.e., the evaluation phase uses video backgrounds not seen during training. Then we report the comparison results of SAR and baselines in the challenging DeepMind Control (DMControl) suite tasks, and comprehensively analyze the generalization ability of SAR with the generalization decay ratio and t-SNE visualization. Finally, we also deploy the algorithms to the field of autonomous driving with real natural distractions to verify the superior performance.

4.1 Evaluation Setting

DMControl with Background Distraction The DMControl suite Tassa et al. (2018) is a challenging visual environment based on the Mujoco physics simulator, which is widely used for the performance verification of visual reinforcement learning. Following to the distracting DMControl setting Stone et al. (2021), the background distraction of the experiment is sampled from natural videos in the DAVIS 2017 dataset. Specifically, algorithms use 2 videos alternately as background during the training phase and evaluation on unseen 30 videos. We adopt the above challenging settings to evaluate the anti-distraction representation ability of algorithms.

Baselines We compare SAR against recent strong visual RL baselines under the DMC suite: (i) CRESP Yang et al. (2022), an algorithm that proposes to induce task-relevant state representations through reward sequences, is one of the current algorithms with outstanding generalization performances; (ii) DrQ Yarats et al. (2021), a recent state-of-the-art algorithm that utilizes data augmentation techniques to learn robust representations from pixels; (iii) DBC Zhang et al. (2020) which is a common RL baseline for learning compact encodings based on the task equivalence principle under bisimulation metric; (iiii) CURL Laskin et al. (2020a), a contrastive unsupervised representation learning method for RL, which has achieved state-of-the-art data-efficiency on clean pixel-based environments.

4.2 Main Results

The evaluation results of the experiment on the DMControl tasks with the unseen distracting background are shown in Fig 3. In general, we evaluate the performance of SAR by comparing with four common baselines: DrQ, DBC, CURL, as well as the latest CRESP, in 6 tasks of different difficulty of Cartpole, Cheetah, Hopper, and Ball_in_up in four scenarios. In these experiments, all baselines adopt the settings consistent with the original work. For SAR, the learning rate is 5e-4, the initial temperature is 0.1, the update frequency of Critic and Actor is 2, and we use Adam optimizer Kingma & Ba (2014) to optimize the auxiliary task, Critic and Actor network with a batch size of 256. See Table 3 for the full of hyperparameters.

Refer to caption
Figure 3: The evaluation results for 500K environment steps on unseen natural video background settings in DMC tasks. Each method is trained using 5 different seeds; for every seed the mean episode returns are computed every 10000 environment steps, averaging over 10 episodes. All the data of baselines in this experiment come from the recurring results of the corresponding standard algorithms.
Table 1: Comparison results of the best episode scores on 6 DMC tasks.(mean &\& standard deviation for 5 seeds)
Methods Bic-catch Cartpole-swingup Cheetah-Run Reacher-easy Hopper-stand Cartpole-swingup-sp
CURL 659±\pm84 509±\pm22 348±\pm47 589±\pm183 140±\pm135 0±\pm0
DBC 193±\pm10 214±\pm17 43±\pm7 210±\pm15 6±\pm1 0±\pm0
DrQ 659±\pm22 533±\pm67 379±\pm57 573±\pm161 159±\pm96 43±\pm40
CRESP 638±\pm65 593±\pm49 311±\pm42 672±\pm71 185±\pm108 43±\pm46
SAR (ours) 768±\pm66 651±\pm34 446±\pm23 795±\pm71 313±\pm117 67±\pm70

In order to highlight that our method has a stronger representation ability for eliminating background distractions, we choose 4 difficult tasks (Ball-in-cup-catch, Cartpole-swingup, Cheetah-run, and Reacher-easy) from the DMC500K benchmark, and additional 2 high-difficulty tasks with sparse reward property (Hopper-stand, Cartpole-swingup-sparse). As shown in Table 1 based on 5 repeated experiments, the SAR method with sequential action-induced task information representation achieves the best performances on these tasks compared to the baselines.

4.3 Probing the Generalization of Representations

Refer to caption
Figure 4: The performance comparison curves of the models during the training stage with the seen distracting backgrounds. Results for each curve are averaged from 5 seeds.
Refer to caption
Figure 5: Comparison results of generalization decay ratio. It is calculated based on the score results of the models in the training stage with seen distractions and the evaluation stage with unseen distractions. For visualization, we halve the decay ratio value of Cartpole-swingup-sprse.

We quantify and analyze the ability of different representations to induce task-relevant information during the training and evaluation stages. Fig. 4 shows the performance comparison curves of baselines during the training stage. In this stage, with iterative training, the distracting backgrounds (videos) will be gradually learned and familiarized by agents, i.e., the distracting backgrounds are seen for agents, so the agents are easier to achieve generalization. In contrast, since the distracting backgrounds used during training and evaluation are independent of each other, i.e., the background videos used during the evaluation stage are not seen during the training stage, some models with weak generalization ability may have significant performance degradation due to the unseen distracting backgrounds. Of course, if the representation learner can sufficiently understand and determine the task-relevant information, the agent should perform equally in the training and evaluation stage.

To quantify the generalization ability of the models, we denote the generalization decay ratio of the model performance from the training (seen background distractions) to evaluation (unseen background distractions) stage as ρ=scoretrainscoreevelscoretrain\rho=\frac{{score}_{train}-{score}_{evel}}{{score}_{train}}. The comparison bar charts of the generalization decay ratio for each method on different tasks are shown in Fig. 5. To be specific, we can summarize the main two conclusions: i) Our SAR method that learns state representation via action sequences achieves the lowest generalization decay ratio compared with the baselines on most tasks, and shows better robustness to the unseen video backgrounds. ii) There is a performance loss when the methods generalize from training to the evaluation stage, and the general decay range of performance is 10%30%10\%\sim 30\%, where the largest performance loss is on the complex tasks of Hopper-stand and Cartpole-swingup-sprse.

4.4 t-SNE Visualization for Latent Spaces

Refer to caption
Figure 6: t-SNE visualization of latent state spaces learned with converged SAR (left) and DrQ (right). We color-code the embedded points with reward values (higher value yellow, lower value purple). The middle part of the figure is three groups of observations with similar tasks but different background distractions, and the link points represent the projection positions of observation codes under the latent spaces of SAR and DrQ.

We employ the t-SNE visualization to observe the distribution maps of observation transitions under the latent state spaces of SAR and DrQ encoders, where the transitions come from the same batch of 4-episode interaction data, and to establish the relationship between the distribution and the observations with background distractions. In addition, we associate observation images and embedding points according to their sequence numbers annotated in advance. As shown in Fig. 6, the embedding maps on the left and right correspond to the 2D projections of the observations in the SAR and DrQ latent state spaces, respectively, and the middle shows three groups of random observation samples with similar task information but different video backgrounds. For the projection pairs of these samples, we find that they are very close in the latent state space of our SAR and have similar state values (similar colors). Contrary to SAR, the distances and state values in the latent spaces of the DrQ encoder are far apart, which obviously deviates from reality. In other words, for unseen background distractions, the proposed SAR method can accurately extract task regions through action sequences and compresses them into a low-dimensional representation, which has task and value information that are nearly consistent with the real state. In summary, the t-SNE visualization experiment strongly confirms our previous hypothesis that the method of sequential action-induced invariant representations can better learn representations than baselines.

4.5 Application in Autonomous Driving

The inputs of real-world visual control systems such as autonomous driving often contain task-irrelevant elements, such as clouds, mountains, and light, it is necessary to extract as much task-relevant information as possible, e.g., the information about road parts, other vehicles, or obstacles, in order to better autonomous driving under complex and unstructured distractions. To verify the state representation and generalization ability of the algorithms under real visual observations, we applied them to the field of visual autonomous driving based on the CARLA simulator Dosovitskiy et al. (2017).

Refer to caption
Figure 7: Left: the bird’s-eye view of the Town04 map. Right: the upper right is two live photos during the training period, in which the red vehicle is the trained agent. Corresponding to the upper scenes, the lower right marked with “obs_” is two wide-angle down-sampled images spliced by 5 cameras arranged on the roof, which are directly utilized as the observation input for the agent training.

CARLA Task CARLA is a powerful autonomous driving simulator that has visual inputs and physical effects nearly the same as the cars in the real world, where we can set rich task components, weather, lighting and NPC vehicles in the scene to meet requirements of different tasks. In this experiment, we choose the officially provided Town04 map with a ring highway containing a crossroad, as shown in Fig. 7. Concretely, we set up a 300-degree wide-angle image collector composed of 5 cameras with 60-degree views on the roof of the vehicle agent. The agent needs to operate the steering, brake and accelerator to control the vehicle’s movement under conditions of wind, rain, cloud or sunlight changes.

Task Setting  The goal of agents is to safely drive as far as possible in a limited 1000 environmental steps, and the episode will be reset when the agent exceeds the maximum environmental step, goes out of road bounds, or crashes. The observation input for training is an RGB image concatenated by 5 cameras, with a size of 84×42084\times 420 pixels, and the control actions are standardized speed and steer. Following Zhang et al. Zhang et al. (2020), the reward function is set as rt(s,a)=vTu^highwayΔtCiimpulseCs|steer|r_{t}(s,a)=\textbf{v}^{T}\hat{\textbf{u}}_{highway}*\Delta t-C_{i}*impulse-C_{s}*|steer|, where u^highway\hat{\textbf{u}}_{highway} is the unit vector of the highway, vTu^highway\textbf{v}^{T}\hat{\textbf{u}}_{highway} represents the effective speed which is projected by the vehicle speed onto the highway, and then is multiplied by the unit time Δt\Delta t to represent the effective distance, impulseimpulse is the clash force (N/sN/s) obtained via physical calculation, and steelsteel is the output amplitude of the steering. Detail parameters can be seen in Table 4. Finally, it should be emphasized that learning a basic driving policy through very short training steps (1e5 steps) in such real complex scenarios is a challenging task.

Refer to caption
Figure 8: Performance comparison of Carla task under 400K environment steps. Left: evaluation curves of average episode reward for SAR and the baselines.Right: average episode driving distance for SAR and the baselines. Each curve of each method is averaged over 3 seeds and is smoothed for visual clarity, where the shading is the standard deviation. It is worth noting that in this experiment, the DBC algorithm was abandoned by reason of its public code could not reproduce the standard performance.
Table 2: Comparison results of the best episode scores on Carla tasks. (mean &\& standard deviation for 3 seeds)
Methods Reward Distance (m) Successes episodes Average steer Average brake Crash intensity
CURL 149.1±\pm14.1 172.5±\pm7.6 14% 9.76% 1.75% 4311
DrQ 122.8±\pm2.58 158.1±\pm1.54 21% 12.02% 1.66% 3313
CRESP 111.1±\pm39.4 115.7±\pm50.2 8% 15.53% 1.93 3898
SAR (ours) 193.2±\pm4.5 219.4±\pm11.5 30% 7.86% 1.43% 4023

Evaluation  We choose CRESP, CURL, and DrQ algorithms as the baselines. As shown in Fig. 8, our algorithm is significantly better than the baselines in both crucial comparisons of episode reward and episode distance. To be specific, the best average reward and driving distance of the proposed algorithm are 193 mm and 219 mm, which are 29.5% and 27.3% higher than the suboptimal baseline respectively, as shown in Table 2. In addition, although the learning parameters of our algorithm are slightly increased than those of the CURL and DrQ, the convergence rate is still higher than that of all baselines. The successes episode ratio is another important metric, which represents the proportion of episodes that reached the episode horizon. As can be seen from Table 2, our method also improves by 42.9% over the suboptimal baseline. Furthermore, for the metrics of the average steering amplitude change, average brake degree, and crash intensity, which represent the stability metrics of the vehicle during driving, our method outperforms the baselines in terms of the first two metrics, for crash intensity, the performance of our algorithm is lower than the baselines, the possible explanation is as the agents of the baseline travel a shorter distance than ours, so some of the crashes can be avoided. As a whole, the significant advantages of the comparative experiments empirically indicate that the proposed algorithm is able to better handle task-irrelevant distractions, thereby learning an excellent policy.

5 Related work

To promote the computation and generalization of downstream decision-making tasks with high-dimensional inputs, it is usually required to learn low-dimensional representations from the high-dimensional imagesLi et al. (2023); Yang et al. (2023); Mazoure et al. (2021); Chen et al. (2020); Dong et al. (2022); Huang et al. (2022). Traditional methods including DQN Mnih et al. (2015), Reformer Kitaev et al. (2019), GMAQN Liang et al. (2021), GTrXL Parisotto et al. (2020) and RAD Laskin et al. (2020b) typically trains a state encoder in an end-to-end manner, but they are difficult to handle non-structural high-dimensional scenarios in reality. In the literature, other usually adopted approaches are as follows.

Reconstruction-based Representations By decoupling representation learning from policy learning, pixel-level image reconstruction and latent state encoding are also typically used methods in the early stage of the development of representation learning Fu et al. (2021); Higgins et al. (2017). For example, Lange et al. Lange et al. (2012) optimize the reconstruction loss to predict the information required by the policy in a two-stage learning process. Yu et al. Yu et al. (2022) and Liu et al. Liu et al. (2022) introduce mask-based reconstruction loss, which aims to reconstruct data and facilitate the learning of state representations. Furthermore, model-based methods, such as the PlaNet Hafner et al. (2019) and SLAC Lee et al. (2020) algorithms, use the reconstruction loss to learn latent state models. However, these methods often require extensive manual tuning, and moreover, the learned representations can be hardly task-relevant.

Contrastive-based Representations Recently, self-supervised learning of representations in reinforcement learning has attracted much attention, and contrastive representation learning is one of the typical approaches Laskin et al. (2020a), where the InfoNCE loss Oord et al. (2018) is optimized to maximize the mutual information between anchors and positive samples while staying away from the information of negative samples Henaff (2020). To embed task-relevant information into the representation, Liu et al. Liu et al. (2020) and Chen et al. Chen et al. (2022) divided positive and negative samples by task-related rewards, Laskin et al.Laskin et al. (2022) used contrastive learning between state transitions and abstract skills to learn behavior representations, Fan et al. Fan & Li (2022) introduced temporal information to establish multi-view contrastive learning and Agarwal et al. Agarwal et al. (2020) adopted policy similarity metric (PSM) as the temperature parameter for contrastive learning.

Metric-based Representations Most of the work on self-supervised learning representations focuses on how to determine a representation that is equivalent to the task. Although some work with contrastive learning has achieved some success, as the related methods mainly optimize a relaxed lower bound of mutual information, the problem of out-of-distribution generalization cannot be handled well. Recently, by directly optimizing the distance between encoding state and MDP elements, the task-related representations via the metrics of task equivalence can be extracted, which have shown better generalization in environments of similar tasks but unfamiliar backgrounds Chen & Pan (2022a). The methods mainly include: deep bisimulation metrics (DBC) Zhang et al. (2020), which optimizes the encoding distance between samples to make it equal to the sum of corresponding rewards and transitions distance; policy similarity metric (PSM) Agarwal et al. (2020); Chen & Pan (2022b), which introduces one-step action distance to measure the similarity of states; return-based contrastive learning Liu et al. (2020), which uses the return to distinguish the task information and maximize the mutual information between encoder and high-return samples; CRESP Yang et al. (2022), which takes the insight that the same reward should correspond to the same task, then the distance between the predicted distribution and the real distribution of reward sequences is minimized.

Inspired by the aforementioned work, especially the PSM and CRESP methods, in this paper, we also aim to optimize an auxiliary loss to promote task-relevant representation learning. But rather than using the rewards or one-step action, in our method, sequential actions are modeled to capture favorable state representations, which have been shown in experiments that can determine task-relevant representations more accurately.

6 Conclusion

We propose the SAR method that leverages action sequences to induce invariant state representations. The SAR is derived from the idea that there is a consistent relationship between real task information in sequential observations and control signals of the sequential actions. By modeling this relationship, our SAR has achieved the ability that quickly lock on the task-relevant information controlled with the sequential actions, which greatly improves the representation performance in unstructured scenes with background distractions.

We compare SAR against strong baselines in the DMControl suite tasks and evaluate it under the challenging setting with unseen background distractions. The results show that SAR achieves the best performance on most of the tasks in the DMC500k benchmark as well as complex tasks with sparse rewards. We also demonstrate again the effectiveness of SAR at disregarding task-irrelevant information by applying it to real-world autonomous driving with natural distractions. Further, we quantitatively analyzed the performance decay of models from the training to the evaluation stage and observed the encoding states of observations under latent spaces, which is still able to well support our conclusions. Finally, the core auxiliary module in the SAR method can be extended straightforwardly to arbitrary visual DRL algorithms.

Table 3: Hyperparameters used for experiments of DMControl tasks
Hyperparameter Value
Observation rendering 100 ×\times 100
Observation downsampling 84 ×\times 84
Augmentation Random shift
Training frames 500000
Replay buffer capacity 100000
Initial exploration steps 1000
Action repeat 8 Cartpole-wingup
4 otherwise
Stacked frames 3
Evaluation episodes 10
Batch size 256
learning rate 0.0005
Discount factor 0.99
Actor update frequency 2
Critic update frequency 2
Actor log stddev bounds [-5,2]
Init temperature 0.1
Actions sequence length 5
State representation dimension 50
Optimizer Adam
Table 4: Partial hyperparameters used for experiments of CARLA tasks
Hyperparameter Value
Camera number 5
Full fov angles 5×\times60 degree
Observation downsampling 84 ×\times 420
Initial exploration steps 100
Training frames 400000
Action repeat 4
Batch size 128
Δt\Delta t 0.05 seconds
CiC_{i} 0.0001
CsC_{s} 1.0

Acknowledgments

We would like to thank the project funds supported by the National Natural Science Foundation of China (No.61772438 and No.61375077).

References

  • Agarwal et al. (2020) Agarwal, Rishabh, Machado, Marlos C, Castro, Pablo Samuel, and Bellemare, Marc G. Contrastive behavioral similarity embeddings for generalization in reinforcement learning. In International Conference on Learning Representations, 2020.
  • Agarwal et al. (2021) Agarwal, Siddhant, Courville, Aaron, and Agarwal, Rishabh. Behavior predictive representations for generalization in reinforcement learning. In Deep RL Workshop NeurIPS 2021, 2021.
  • Chen & Pan (2022a) Chen, Jianda and Pan, Sinno. Learning representations via a robust behavioral metric for deep reinforcement learning. In Advances in Neural Information Processing Systems, 2022a.
  • Chen & Pan (2022b) Chen, Jianda and Pan, Sinno Jialin. Learning generalizable representations for reinforcement learning via adaptive meta-learner of behavioral similarities. International Conference on Learning Representations, 2022b.
  • Chen et al. (2022) Chen, Qihang, Liang, Dayang, and Liu, Yunlong. Hard negative sample mining for contrastive representation in reinforcement learning. In Advances in Knowledge Discovery and Data Mining: 26th Pacific-Asia Conference, PAKDD 2022, Chengdu, China, May 16–19, 2022, Proceedings, Part II, pp.  277–288. Springer, 2022.
  • Chen et al. (2020) Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, and Hinton, Geoffrey. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020.
  • Dong et al. (2022) Dong, Wenkai, Zhang, Zhaoxiang, Song, Chunfeng, and Tan, Tieniu. Identifying the key frames: An attention-aware sampling method for action recognition. Pattern Recognition, 130:108797, 2022.
  • Dosovitskiy et al. (2017) Dosovitskiy, Alexey, Ros, German, Codevilla, Felipe, Lopez, Antonio, and Koltun, Vladlen. Carla: An open urban driving simulator. In Conference on robot learning, pp.  1–16. PMLR, 2017.
  • Dosovitskiy et al. (2020) Dosovitskiy, Alexey, Beyer, Lucas, Kolesnikov, Alexander, Weissenborn, Dirk, Zhai, Xiaohua, Unterthiner, Thomas, Dehghani, Mostafa, Minderer, Matthias, Heigold, Georg, Gelly, Sylvain, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  • Fan & Li (2022) Fan, Jiameng and Li, Wenchao. Dribo: Robust deep reinforcement learning via multi-view information bottleneck. In International Conference on Machine Learning, pp. 6074–6102. PMLR, 2022.
  • Fu et al. (2021) Fu, Xiang, Yang, Ge, Agrawal, Pulkit, and Jaakkola, Tommi. Learning task informed abstractions. In International Conference on Machine Learning, pp. 3480–3491. PMLR, 2021.
  • Hafner et al. (2019) Hafner, Danijar, Lillicrap, Timothy, Fischer, Ian, Villegas, Ruben, Ha, David, Lee, Honglak, and Davidson, James. Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, pp. 2555–2565. PMLR, 2019.
  • He et al. (2022) He, Kaiming, Chen, Xinlei, Xie, Saining, Li, Yanghao, Dollár, Piotr, and Girshick, Ross. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16000–16009, 2022.
  • Henaff (2020) Henaff, Olivier. Data-efficient image recognition with contrastive predictive coding. In International conference on machine learning, pp. 4182–4192. PMLR, 2020.
  • Higgins et al. (2017) Higgins, Irina, Pal, Arka, Rusu, Andrei, Matthey, Loic, Burgess, Christopher, Pritzel, Alexander, Botvinick, Matthew, Blundell, Charles, and Lerchner, Alexander. Darla: Improving zero-shot transfer in reinforcement learning. In International Conference on Machine Learning, pp. 1480–1490. PMLR, 2017.
  • Huang et al. (2022) Huang, Fuxian, Li, Weichao, Cui, Jiabao, Fu, Yongjian, and Li, Xi. Unified curiosity-driven learning with smoothed intrinsic reward estimation. Pattern Recognition, 123:108352, 2022.
  • Jaderberg et al. (2019) Jaderberg, Max, Czarnecki, Wojciech M, Dunning, Iain, Marris, Luke, Lever, Guy, Castaneda, Antonio Garcia, Beattie, Charles, Rabinowitz, Neil C, Morcos, Ari S, Ruderman, Avraham, et al. Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science, 364(6443):859–865, 2019.
  • Kalashnikov et al. (2018) Kalashnikov, Dmitry, Irpan, Alex, Pastor, Peter, Ibarz, Julian, Herzog, Alexander, Jang, Eric, Quillen, Deirdre, Holly, Ethan, Kalakrishnan, Mrinal, Vanhoucke, Vincent, and Levine, Sergey. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. ArXiv, abs/1806.10293, 2018.
  • Kemertas & Aumentado-Armstrong (2021) Kemertas, Mete and Aumentado-Armstrong, Tristan. Towards robust bisimulation metric learning. Advances in Neural Information Processing Systems, 34:4764–4777, 2021.
  • Kingma & Ba (2014) Kingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kitaev et al. (2019) Kitaev, Nikita, Kaiser, Lukasz, and Levskaya, Anselm. Reformer: The efficient transformer. In International Conference on Learning Representations, 2019.
  • Lange et al. (2012) Lange, Sascha, Riedmiller, Martin, and Voigtländer, Arne. Autonomous reinforcement learning on raw visual input data in a real world application. In The 2012 international joint conference on neural networks (IJCNN), pp.  1–8. IEEE, 2012.
  • Laskin et al. (2020a) Laskin, Michael, Srinivas, Aravind, and Abbeel, Pieter. Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pp. 5639–5650. PMLR, 2020a.
  • Laskin et al. (2022) Laskin, Michael, Liu, Hao, Peng, Xue Bin, Yarats, Denis, Rajeswaran, Aravind, and Abbeel, Pieter. Cic: Contrastive intrinsic control for unsupervised skill discovery. Advances in Neural Information Processing Systems, 2022.
  • Laskin et al. (2020b) Laskin, Misha, Lee, Kimin, Stooke, Adam, Pinto, Lerrel, Abbeel, Pieter, and Srinivas, Aravind. Reinforcement learning with augmented data. Advances in neural information processing systems, 33:19884–19895, 2020b.
  • Lee et al. (2020) Lee, Alex X, Nagabandi, Anusha, Abbeel, Pieter, and Levine, Sergey. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. Advances in Neural Information Processing Systems, 33:741–752, 2020.
  • Li et al. (2023) Li, Tianyi, Yang, Genke, and Chu, Jian. Implicit posteriori parameter distribution optimization in reinforcement learning. IEEE Transactions on Cybernetics, pp.  1–14, 2023. doi: 10.1109/TCYB.2023.3254596.
  • Li et al. (2021) Li, Zhongyu, Cheng, Xuxin, Peng, Xue Bin, Abbeel, P., Levine, Sergey, Berseth, Glen, and Sreenath, Koushil. Reinforcement learning for robust parameterized locomotion control of bipedal robots. 2021 IEEE International Conference on Robotics and Automation (ICRA), pp.  2811–2817, 2021.
  • Liang et al. (2021) Liang, Dayang, Chen, Qihang, and Liu, Yunlong. Gated multi-attention representation in reinforcement learning. Knowledge-Based Systems, 233:107535, 2021.
  • Liang et al. (2022) Liang, Dayang, Deng, Huiyi, and Liu, Yunlong. The treatment of sepsis: an episodic memory-assisted deep reinforcement learning approach. Applied Intelligence, pp.  1–11, 2022.
  • Liu et al. (2022) Liu, Fangchen, Liu, Hao, Grover, Aditya, and Abbeel, Pieter. Masked autoencoding for scalable and generalizable decision making. In Advances in Neural Information Processing Systems, 2022.
  • Liu et al. (2020) Liu, Guoqing, Zhang, Chuheng, Zhao, Li, Qin, Tao, Zhu, Jinhua, Jian, Li, Yu, Nenghai, and Liu, Tie-Yan. Return-based contrastive representation learning for reinforcement learning. In International Conference on Learning Representations, 2020.
  • Mazoure et al. (2021) Mazoure, Bogdan, Ahmed, Ahmed M, Hjelm, R Devon, Kolobov, Andrey, and MacAlpine, Patrick. Cross-trajectory representation learning for zero-shot generalization in rl. In International Conference on Learning Representations, 2021.
  • Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  • Oord et al. (2018) Oord, Aaron van den, Li, Yazhe, and Vinyals, Oriol. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • Parisotto et al. (2020) Parisotto, Emilio, Song, Francis, Rae, Jack, Pascanu, Razvan, Gulcehre, Caglar, Jayakumar, Siddhant, Jaderberg, Max, Kaufman, Raphael Lopez, Clark, Aidan, Noury, Seb, et al. Stabilizing transformers for reinforcement learning. In International conference on machine learning, pp. 7487–7498. PMLR, 2020.
  • Schölkopf (2022) Schölkopf, Bernhard. Causality for machine learning. In Probabilistic and Causal Inference: The Works of Judea Pearl, pp.  765–804. 2022.
  • Seo et al. (2022) Seo, Younggyo, Hafner, Danijar, Liu, Hao, Liu, Fangchen, James, Stephen, Lee, Kimin, and Abbeel, Pieter. Masked world models for visual control. arXiv preprint arXiv:2206.14244, 2022.
  • Stone et al. (2021) Stone, Austin, Ramirez, Oscar, Konolige, Kurt, and Jonschkowski, Rico. The distracting control suite–a challenging benchmark for reinforcement learning from pixels. arXiv preprint arXiv:2101.02722, 2021.
  • Tassa et al. (2018) Tassa, Yuval, Doron, Yotam, Muldal, Alistair, Erez, Tom, Li, Yazhe, de Las Casas, Diego, Budden, David, Abdolmaleki, Abbas, Merel, Josh, Lefrancq, Andrew, Lillicrap, Timothy P., and Riedmiller, Martin A. Deepmind control suite. ArXiv, abs/1801.00690, 2018.
  • Vaswani et al. (2017) Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, Łukasz, and Polosukhin, Illia. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Wu et al. (2022) Wu, Peng, Jia, Xiaosong, Chen, Li, Yan, Junchi, Li, Hongyang, and Qiao, Yu. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. Advances in Neural Information Processing Systems, 2022.
  • Yang et al. (2022) Yang, Rui, Wang, Jie, Geng, Zijie, Ye, Mingxuan, Ji, Shuiwang, Li, Bin, and Wu, Fengli. Learning task-relevant representations for generalization via characteristic functions of reward sequence distributions. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022.
  • Yang et al. (2023) Yang, Zhiyou, Qu, Hong, Fu, Mingsheng, Hu, Wang, and Zhao, Yongze. A maximum divergence approach to optimal policy in deep reinforcement learning. IEEE Transactions on Cybernetics, 53(3):1499–1510, 2023. doi: 10.1109/TCYB.2021.3104612.
  • Yarats et al. (2021) Yarats, Denis, Kostrikov, Ilya, and Fergus, Rob. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. In International Conference on Learning Representations, 2021.
  • Yu et al. (2022) Yu, Tao, Zhang, Zhizheng, Lan, Cuiling, Chen, Zhibo, and Lu, Yan. Mask-based latent reconstruction for reinforcement learning. ArXiv, abs/2201.12096, 2022.
  • Yuan et al. (2022) Yuan, Zhecheng, Ma, Guozheng, Mu, Yao, Xia, Bo, Yuan, Bo, Wang, Xueqian, Luo, Ping, and Xu, Huazhe. Don’t touch what matters: Task-aware lipschitz data augmentationfor visual reinforcement learning. arXiv preprint arXiv:2202.09982, 2022.
  • Zang et al. (2022) Zang, Hongyu, Li, Xin, and Wang, Mingzhong. Simsr: Simple distance-based state representations for deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  8997–9005, 2022.
  • Zhang et al. (2020) Zhang, Amy, McAllister, Rowan Thomas, Calandra, Roberto, Gal, Yarin, and Levine, Sergey. Learning invariant representations for reinforcement learning without reconstruction. In International Conference on Learning Representations, 2020.
  • Zhao et al. (2022) Zhao, Yinuo, Wu, Kun, Xu, Zhiyuan, Che, Zhengping, Lu, Qi, Tang, Jian, and Liu, Chi Harold. Cadre: A cascade deep reinforcement learning framework for vision-based autonomous urban driving. In AAAI, 2022.