Cross-domain Random Pre-training with Prototypes for Reinforcement Learning

Xin Liu, Yaran Chen, Haoran Li, Boyu Li, Dongbin Zhao This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.X. Liu, Y. Chen, H. Li, B. Li, and D. Zhao are with the State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, and also with the School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China. (email: [email protected], [email protected], [email protected], [email protected], [email protected])B. Li is also with Beijing Academy of Artificial Intelligence, Beijing, 100084, China.

Abstract

Unsupervised cross-domain Reinforcement Learning (RL) pre-training shows great potential for challenging continuous visual control but poses a big challenge. In this paper, we propose Cross-domain Random Pre-Training with prototypes (CRPTpro), a novel, efficient, and effective self-supervised cross-domain RL pre-training framework. CRPTpro decouples data sampling from encoder pre-training, proposing decoupled random collection to easily and quickly generate a qualified cross-domain pre-training dataset. Moreover, a novel prototypical self-supervised algorithm is proposed to pre-train an effective visual encoder that is generic across different domains. Without finetuning, the cross-domain encoder can be implemented for challenging downstream tasks defined in different domains, either seen or unseen. Compared with recent advanced methods, CRPTpro achieves better performance on downstream policy learning without extra training on exploration agents for data collection, greatly reducing the burden of pre-training. We conduct extensive experiments across eight challenging continuous visual-control domains, including balance control, robot locomotion, and manipulation. CRPTpro significantly outperforms the next best Proto-RL(C) on 11/12 cross-domain downstream tasks with only 54% wall-clock pre-training time, exhibiting state-of-the-art pre-training performance with greatly improved pre-training efficiency. The complete code is available at https://github.com/liuxin0824/CRPTpro.

Index Terms:

Deep reinforcement learning, Self-supervised visual pre-training, Unsupervised exploration, Cross-domain representation learning, Prototypical representation learning.

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

I Introduction

Representation learning is crucial for Deep Reinforcement Learning (DRL), especially image-based RL. Traditional task-specific RL algorithms [1, 2, 3] rely on the reward function to learn feature representation and downstream policy simultaneously, succeeding in many fields [4, 5, 6, 7, 8, 9, 10, 11, 12]. However, it is sample inefficient when faced with high-dimensional input like images, especially complex visual input in challenging continuous visual motor control. Besides, the task-specified encoder learned with lots of effort cannot generalize to novel tasks. For problems mentioned above, self-supervised task-agnostic pre-training, the combination of Self-Supervised Learning (SSL) and unsupervised visual pre-training, is proposed. By designing auxiliary tasks, SSL improves the perception ability of the visual encoder in a targeted manner, thus learning better representation for downstream policy. Employing SSL to achieve pre-training over a task-agnostic dataset makes the encoder generic across different downstream tasks. Recent works [13, 14] have proven it possible to pre-train a powerful single-domain encoder which enables efficient downstream RL on different challenging visual-control tasks defined in the same domain.

Refer to caption — Figure 1: Difference between three kinds of visual encoders in image-based RL on DMControl. (a) The single-task encoder. It is dedicated to only one task. (b) The single-domain encoder. It generalizes across tasks in a single domain, i.e., tasks from the same domain can share the same encoder. (c) The cross-domain encoder. It generalizes across both tasks and domains, i.e., tasks from different domains can share the same encoder. CRPTpro pre-trains a powerful cross-domain encoder enabling state-of-the-art downstream policy learning across multiple domains in challenging continuous visual control.

Now that the single-domain encoder has been implemented, another question naturally arises: Can we pre-train a cross-domain encoder that is generic across both tasks and domains? This means that different tasks defined in different domains can share the same encoder. To avoid confusion, we use DeepMind Control suite (DMControl) [15] as an example to illustrate the differences between these encoders, as shown in Fig. 1. There are many benefits of pre-training a cross-domain encoder. Intuitively, its versatility can greatly reduce the training burden when faced with multiple domains. In the long run, a generic encoder is necessary for many promising sub-fields of DRL, such as multi-tasks RL [16], meta-RL [17], transfer RL [18], and the long-term goal of RL: generalist agent [19].

Currently, unsupervised active pre-training methods [13, 14] are state-of-the-art, enabling effective encoder pre-training for multiple tasks defined in a single domain. As a branch of unsupervised RL [20, 21, 22, 23], these approaches train extra exploration agents via unsupervised intrinsic reward to collect task-agnostic data for encoder learning. The exploration policy is built on the visual encoder, and they are updated simultaneously during the exploration agent training (i.e., pre-training). However, two problems emerge when transplanting these methods into cross-domain pre-training. First, the pre-trained encoder degenerates, leading to poor downstream policy learning. In cross-domain settings, the chicken-and-egg problem between exploration agent training and visual encoder pre-training is much more severe than that in single-domain pre-training. It’s difficult for current active methods to train multiple qualified exploration agents on a substandard encoder that is under cross-domain pre-training, and vice versa. Second, the pre-training efficiency is insufficient. As mentioned above, current advanced active pre-training methods employ extra RL to train exploration agents for data collection. This leads to a severe pre-training burden, which multiplies with the number of domains.

In this work, we address the above problems through CRPTpro, a novel, efficient, and effective self-supervised cross-domain RL pre-training framework. Instead of training extra exploration agents for data collection like recent advanced unsupervised methods, CRPTpro completely decouples data sampling from encoder pre-training, proposing decoupled random collection. It employs an off-the-shelf random policy to achieve steady exploration across multiple domains, easily and quickly producing a qualified cross-domain pre-training dataset to improve both the performance and efficiency of encoder pre-training. Moreover, a novel self-supervised algorithm named efficient prototypical learning is proposed to further improve the pre-training performance. It improves original prototypical representation learning [24, 14, 25, 26] by a novel intrinsic loss that facilitates the diffusion of prototypes. After pre-training, the cross-domain encoder obtained by CRPTpro is frozen and achieves efficient downstream policy learning via RL on challenging visual-control tasks from different domains. Besides, the pre-trained encoder can generalize well to unseen domains directly or through only a few-shot finetuning. We conduct extensive experiments on eight different, representative and challenging continuous visual-control environments, including classical balance control like pendulum, multi-joint robot locomotion, robotic manipulation and so on. Results demonstrate that CRPTpro outperforms all cross-domain pre-training baselines significantly, enabling state-of-the-art cross-domain downstream policy learning. For pre-training efficiency, CRPTpro also exceeds the most advanced approach by a large margin. In addition, we demonstrate and analyze the effectiveness of our decoupled random collection and efficient prototypical learning in different pre-training settings.

The contributions of our paper can be summarized as:

•

We propose CRPTpro, a novel, efficient, and effective self-supervised cross-domain RL pre-training framework. The cross-domain encoder obtained by CRPTpro enables efficient downstream RL on different challenging visual-control tasks defined in different domains. In addition, it can generalize well to unseen domains directly or through a few-shot finetuning.
•

Unlike recent advanced methods, CRPTpro decouples data sampling from encoder pre-training, proposing decoupled random collection to easily and quickly generate a qualified cross-domain pre-training dataset, which improves both pre-training efficiency and downstream policy performance. In addition, CRPTpro proposes efficient prototypical learning, a novel prototypical self-supervised algorithm to further improve the pre-training.
•

Extensive experiments demonstrate that CRPTpro significantly outperforms all cross-domain baselines on downstream policy learning. It improves current state-of-the-art method by a large margin, with greatly improved pre-training efficiency. In addition, we provide detailed analysis about the proposed decoupled random collection and efficient prototypical learning.

II Related Work

II-A Self-supervised Learning in RL

End-to-end RL shows sample inefficiency when faced with high-dimensional observations like images. One solution is to employ auxiliary objectives to do SSL, for example, predicting the designed properties of the environments [27, 28]. Following these, CURL [29] introduced a contrastive auxiliary task into end-to-end RL to improve sample efficiency. SPR [30] combined data augmentation with an auxiliary SSL objective. CltrFormer [18] employed transformer [31] in visual-control tasks, using contrastive learning to get transferable representation between different domains on DMControl [15]. Recently, a novel SSL approach: prototypical representation learning [24] based on the Sinkhorn-Knopp algorithm [32] was introduced into RL by Proto-RL [14]. Their success motivated DreamerPro [25], ProtoCAD [26] and our CRPTpro. Different from previous prototypical algorithm, we introduce an extra intrinsic loss between prototypes along with original comparative loss, proposing efficient prototypical learning, a novel and effective self-supervised algorithm. In addition, CRPTpro performs efficient prototypical learning on a static dataset as original SwAV [24] does.

II-B Encoder Pre-training for Downstream RL

Inspired by the success that SSL can pre-train a strong feature extractor without labels in the field of CV [33, 34] and NLP [35, 36], ATC [37] tried to decouple the representation learning from downstream policy learning and first achieved considerable results. MVP [38] utilized the offline dataset from the internet, pre-training a visual encoder for different motor tasks. SGI [39] employed finetuning on the pre-trained encoder, achieving efficient policy learning on Atari 100k benchmark [40]. Inspired by unsupervised RL [20, 41, 21], APT [13] pre-trained an extra exploration agent via an unsupervised intrinsic reward for data collection, along with visual representation learning by SimCLR [34]. Subsequently, Proto-RL [14] used prototypical representation learning [24] to simultaneously achieve encoder pre-training and strengthen APT unsupervised reward, setting state-of-the-art downstream policy performance on DMControl. Different from these active pre-training methods [13, 14, 23], CRPTpro gives up unsupervised RL but decouples data collection from encoder pre-training by an off-the-shelf exploration policy, achieving state-of-the-art cross-domain pre-training performance with dramatically improved pre-training efficiency.

III Methodology

Problem definition: In most circumstances, a visual-control task could be formulated as an infinite-horizon Partially Observable Markov Decision Process (POMDP) [42, 43], denoted by $\mathcal{M}^{p}=(\mathcal{O},\mathcal{A},\mathcal{P},\mathcal{R},\gamma,d_{0})$ , where $\mathcal{O}$ is the high-dimensional observation space, i.e. pixels, $\mathcal{A}$ is the action space, $\mathcal{P}$ is the distribution of next observation given the history and current action, $R$ is the reward function, $\gamma$ is the discount factor, and $d_{0}$ is the distribution of the initial observation. By stacking three consecutive previous observations into a state, this POMDP is converted into an Markov Decision Process (MDP) $\mathcal{M}=(\mathcal{X},\mathcal{A},\mathcal{P},\mathcal{R},\gamma,d_{0})$ , where the next state only depends on the current state, unrelated to the history. RL can be performed on the MDP to obtain a downstream policy.

The proposed CRPTpro is shown in Fig. 2. During pre-training, we propose decoupled random collection to efficiently produce an effective cross-domain pre-training dataset, as shown in section III.A. Then, we describe how the novel efficient prototypical learning is implemented and how to pre-train the cross-domain encoder and prototypes, which is detailed in section III.B. After pre-training, the frozen cross-domain encoder and prototypes enable efficient downstream RL on different challenging visual-control tasks from different domains, which we describe in section III.C.

III-A Decoupled Random Collection

The dataset for pre-training should be diverse to cover the observation space as much as possible. As a branch of unsupervised RL, active pre-training methods [13, 14] design an intrinsic reward related to exploration, based on which they train extra agents to explore the observation space along with learning visual encoders. These methods are able to explore and collect far-reaching states in hard-exploration domains and achieve state-of-the-art performance in single-domain visual-control pre-training. However, due to the requirements of extra agent training, these methods suffer a severe pre-training burden that even exceeds that of downstream RL many times. Especially when facing multiple domains, this extra burden is dramatically increased. Meanwhile, these approaches exhibit huge performance drops when transplanted into cross-domain pre-training due to a chicken-and-egg problem: pre-training an effective encoder requires effective exploration agents for qualified data collection, while effective exploration agents rely on an effective encoder. This problem is severely amplified in cross-domain pre-training because multiple exploration policies are required for multiple domains. It’s hard for current active methods to train multiple qualified exploration strategies from scratch on a substandard visual encoder, and it’s also hard to achieve ideal pre-training on unqualified data sampled by unqualified exploration policies.

Due to the above problems, CRPTpro gives up training extra unsupervised exploration agents but decouples data sampling from encoder pre-training, proposing decoupled random collection for cross-domain pre-training. It employs an off-the-shelf random policy across multiple domains for pre-training dataset collection, which dramatically reduces the pre-training burden. Specifically, it chooses a simple uniform distribution to sample actions for environment interaction and data (image observations) collection. Data from different domains is saved into different data buffers, forming the cross-domain pre-training dataset, which is collected once and used permanently. This cross-domain pre-training dataset is then used for SSL.

In terms of the ability to explore far-reaching states, the uniform distribution policy is not as good as active exploration. However, for visual-control pre-training across multiple domains, our decoupled random collection provides better and qualifier data than active exploration due to the following reasons: (i) The above chicken-and-egg conflict makes it hard to train qualified exploration policies for active exploration, while our off-the-shelf uniform distribution policy is not affected. (ii) The difficulty of seeking far-reaching states in common motor control is much lower than that in a ’maze’. The uniform distribution policy may have trouble finding the far-reaching maze end but can drive a cheetah robot to exhibit different behaviors. (iii) A single exploration agent may explore very deeply, but it is hard to explore widely like a random policy. For example, a uniform distribution policy can manipulate a robotic arm to extend in all directions due to its randomness, but a single exploration agent may deeply explore only one direction. Here, a wider exploration is more important because more diverse data can help the encoder better understand motion changes. In conclusion, decoupled random collection can sample a qualified dataset (see visualization in Section IV.G). In addition, due to the changing exploration agent, active pre-training methods train their encoders on a changing dataset, which is unstable. By contrast, our approach decouples data sampling from encoder learning, achieving data collection before pre-training, which enables a steady learning process. Therefore, the proposed decoupled random is the better cross-domain data collection method due to (i) a qualifier pre-training dataset and (ii) a more stable learning process.

In summary, employing decoupled random collection brings the following advantages compared with recent advanced active pre-training methods: (i) CRPTpro has an unparalleled efficiency advantage in the pre-training stage. It employs an off-the-shelf data collection policy and doesn’t need to spend time on extra RL for exploration agents. (ii) CRPTpro achieves better pre-training performance, especially when pre-training efficiency is required. Recent advanced active pre-training suffers from the severe chicken-and-egg problem mentioned above, while CRPTpro enables robust and steady learning on a qualified cross-domain pre-training dataset.

III-B Efficient Prototypical Learning

CRPTpro learns a cross-domain visual encoder and several basic vectors called prototypes over the cross-domain pre-training dataset produced by decoupled random collection. As illustrated in Fig. 2, CRPTpro selects data buffers from different domains sequentially and cyclically to achieve the pre-training of one generic encoder. Concretely, in each step of the training, one data buffer is chosen to update the encoder only once, while another one will be chosen in the next step. This prevents the encoder from favoring the latest domain.

Over the chosen buffer, CRPTpro employs efficient prototypical learning, a novel prototypical self-supervised algorithm. Following previous prototypical algorithms [24, 14, 25, 26], it sets several trainable prototypes as the cluster centers in the latent space, comparing observations projected onto prototypes with their cluster assignment targets. In addition to the comparison, a novel intrinsic loss is proposed to facilitate the diffusion of prototypes in the latent space, further improving the pre-training performance, seen in Fig. 3.

The comparative loss is calculated as the following. First of all, M frames $\{x_{t_{i}}\}_{i=1}^{M}$ and their next frames $\{x_{t_{i}+1}\}_{i=1}^{M}$ are sampled randomly from the data buffer. The subscript represents the timing number. The current frames $\{x_{t_{i}}\}_{i=1}^{M}$ are used to predict the cluster assignment targets computed over the next frames $\{x_{t_{i}+1}\}_{i=1}^{M}$ and M trainable vectors $\{c_{1},...,c_{M}\}$ called prototypes. $x_{t_{i}}$ is one of the current frames. It undergoes augmentation by random image shifts, encoding by the generic convolutional encoder $f_{\theta}$ , and projection by the Multi-Layer Perceptron (MLP) $g_{\theta}$ in turn to produce a vector $z_{t_{i}}$ in the latent space where prototypes exist. Here, another MLP $v_{\psi}$ is used to predict $z_{t_{i}}$ into $u_{t_{i}}$ in order to avoid collapse to trivial solutions. $v_{\psi}$ doesn’t change the dimension of input, so both $z_{t_{i}}$ and $u_{t_{i}}$ have the same dimension as prototypes. We then take a softmax over the dot products of $u_{t_{i}}$ and all the prototypes $\{c_{j}\}_{j=1}^{M}$ :

\left(p_{t_{i}}^{(1)},...p_{t_{i}}^{(M)}\right)=\rm{softmax}\it\left(\frac{\hat{u}_{t_{i}}^{T}\hat{c}_{1}}{\tau},...,\frac{\hat{u}_{t_{i}}^{T}\hat{c}_{M}}{\tau}\right),

(1)

where $p_{t_{i}}$ is the probabilities that $x_{t_{i}}$ maps to the prototypes for comparison, ${\tau}$ is a temperature hyper-parameter, and the hats on $u_{t_{i}}$ and $\{c_{j}\}_{j=1}^{M}$ denotes the $l_{2}$ -normalization. The parameters $\theta$ , $\psi$ , and all prototypes are trainable and updated simultaneously when minimizing the comparative loss.

To obtain the target compared with $p_{t_{i}}$ , all next frames $\{x_{t_{i}+1}\}_{i=1}^{M}$ undergo augmentation by random image shifts, encoding by target encoder $f_{\bar{\theta}}$ , and projection by target MLP $g_{\bar{\theta}}$ sequentially to produce latent embeddings $\{z_{t_{i}+1}\}_{i=1}^{M}$ analogously. To avoid collapse to trivial solutions, parameters $\bar{\theta}$ of these target networks are updated using the Exponential Moving Average (EMA) [44] of $\theta$ :

\bar{\theta}\xleftarrow{}(1-\eta)\bar{\theta}+\eta\theta.

(2)

Now we can apply the Sinkhorn-Knopp algorithm [32] on $l_{2}$ -normalized embeddings $\{\hat{z}_{t_{i}+1}\}_{i=1}^{M}$ and prototypes $\{\hat{c}_{j}\}_{j=1}^{M}$ . Concretely, the algorithm begins with the square matrix $C$ , whose elements are computed by the dot product over each embedding and prototype:

C_{ij}=\hat{z}_{t_{i}+1}\hat{c}_{j}.

(3)

Then it employs iterative doubly-normalization on the matrix $C$ to obtain target matrix $T$ , constraining every column and row to have the same sum with as little change of original $C$ as possible. The row normalization and column normalization are used in doubly-normalization. The row normalization is formulated as the following:

{\rm row}(C)=\frac{1}{M}C\cdot{\rm diag}(\frac{1}{{\rm sum}(C,1)}),

(4)

where ${\rm sum}(C,1)$ denotes the row addition and $\rm diag(\cdot)$ denotes the diagonalization of a matrix. Similarly, the column normalization is defined as the following:

{\rm col}(C)=\frac{1}{M}{\rm diag}(\frac{1}{{\rm sum}(C,0)})\cdot C,

(5)

where ${\rm sum}(C,0)$ denotes the column addition. The doubly-normalization, $\rm dou(\cdot)$ , consists of one row normalization and one column normalization:

\rm dou\it(C)=\rm col\it(\rm row\it(C)).

(6)

Three times of doubly-normalization is applied on $C$ to obtain target matrix $T$ . The $i$ -th row of $T$ is the cluster assignment target $q_{t_{i}+1}$ of the frame $x_{t_{i}}$ . Combined with the probabilities $p_{t_{i}}$ computed by Eq.(1), the comparative loss over all frames is calculated as the following:

\mathcal{L}_{comp}=-\frac{1}{M}\sum_{i=1}^{M}q_{t_{i}+1}^{T}\rm log\it p_{t_{i}}.

(7)

During the training process, the prototypes serve as cluster centers for the selected samples. They are randomly initialized at the beginning and then gradually expand their coverage to the visited states in the latent space, which is done through the traction of samples.

Algorithm 1 Full pseudo-code of CRPTpro.

Require:

N

reward-free domains

D_{i}

i\in[1,...N]

N

data buffers

B_{i},i\in[1,...N]

K

downstream task MDPs

\mathcal{M}_{k}

k\in[1,...K]

, buffer size

I_{s}

, pre-training update times

I_{p}

, (optional) finetuning update times

I_{f}

, number of downstream RL steps

I_{d}

Randomly Initialize: generic visual encoder

f_{\theta}

, projector

g_{\theta}

, predictor

v_{\theta}

, target encoder

f_{\bar{\theta}}

, target projector

g_{\bar{\theta}}

and M prototypes

\{c_{j}\}_{j=1}^{M}

###decoupled random collection

for

index=1,...,I_{s}

for

D_{i}\in[D_{1},...D_{N}]

Take random action for

D_{i}

by uniform distribution.

Interact with

D_{i}

to get observation

x_{index}^{i}

Save

x_{index}^{i}

into data buffer

B_{i}

end for

###efficient prototypical learning

for

index=1,...,I_{p}

Choose data buffer

B_{o}

, where

o=index\%N

Randomly sample

M

frames and next frames from

B_{c}

Compute the self-supervised loss

\mathcal{L}_{SSL}

in Eq.(10).

Update

f_{\theta}

g_{\theta}

v_{\theta}

and

\{c_{j}\}_{j=1}^{M}

by backpropagation.

Update

f_{\bar{\theta}}

and

g_{\bar{\theta}}

by EMA shown in Eq.(2).

end for

###(optional) single-domain finetuning

Choose data buffer

B_{f}

corresponding to target domain. If the domain is unseen, employ decoupled random collection to produce the data buffer.

for

index=1,...,I_{f}

Randomly sample

M

frames and next frames from

B_{f}

Compute the self-supervised loss

\mathcal{L}_{SSL}

in Eq.(10).

Update

f_{\theta}

g_{\theta}

v_{\theta}

and

\{c_{j}\}_{j=1}^{M}

by backpropagation.

Update

f_{\bar{\theta}}

and

g_{\bar{\theta}}

by EMA shown in Eq.(2).

end for

###downstream RL in multiple domains

for

\mathcal{M}_{k}\in[\mathcal{M}_{1},...\mathcal{M}_{K}]

freeze the pre-trained

f_{\theta}

and

\{c_{j}\}_{j=1}^{M}

Use

f_{\theta}

to encode the original observation space.

Use

\{c_{j}\}_{j=1}^{M}

for reward augmentation by Eq.(11).

Perform

I_{d}

steps RL by RAD-SAC over novel MDP.

end for

Intuitively, facilitating the diffusion and coverage of prototypes can (i) accelerate the training process and (ii) make prototypes wider cluster centers, leading to a wider and more distinguishable latent space, i.e., a stronger encoder. To this end, we design a novel intrinsic loss aiming to accelerate the diffusion of the prototypes, which is done by increasing the difference between prototypes. Specifically, the difference between two $l_{2}$ -normalized prototypes (unit vectors) can be well measured by their cosine similarity:

{\rm dif}(\hat{c}_{j},\hat{c}_{k})\propto-{\rm sim}(\hat{c}_{j},\hat{c}_{k})=-\hat{c}_{j}^{T}\hat{c}_{k},

(8)

where the hats over $c_{j}$ and $c_{k}$ mean $l_{2}$ -normalization. Our intrinsic loss is based on this cosine similarity:

\mathcal{L}_{intr}=\sum_{j=1}^{M}\sum_{k=1,k\neq j}^{M}\frac{\rm det\it(\hat{c}_{j}^{T})\cdot\hat{c}_{k}}{\rm det\it(\hat{c}_{j}^{T}\hat{c}_{k})+w},w>1,

(9)

where $\rm det(\cdot)$ denotes the detach operation which prevents the gradient from backpropagation. $w$ is a weight hyper-parameter. It is employed to balance different update speeds of different prototypes. In practice, in the case of the dot product as the loss, a remote prototype is updated slower than a near prototype after $l_{2}$ -normalization. The denominator defined as the sum of $\hat{c}_{j}^{T}\hat{c}_{k}$ and $w$ increases the gradient of the remote prototypes, further accelerating the diffusion of prototypes.

With the comparative loss $\mathcal{L}_{comp}$ computed by Eq.(7) and the intrinsic loss $\mathcal{L}_{intr}$ computed by Eq.(9), the overall self-supervised loss $\mathcal{L}_{SSL}$ of efficient prototypical learning in practice is:

\mathcal{L}_{SSL}=\mathcal{L}_{comp}+\alpha\mathcal{L}_{intr},

(10)

where $\alpha$ is a coefficient scaling the intrinsic reward. Significantly, CRPTpro updates the parameters $\theta$ , $\psi$ and prototypes $\{c_{j}\}_{j=1}^{M}$ during the pre-training. The cross-domain encoder $f_{\theta}$ and prototypes $\{c_{j}\}_{j=1}^{M}$ are then used to conduct efficient downstream RL in different domains.

III-C Downstream RL in Multiple Domains

After pre-training, the encoder can be frozen and used to perform efficient downstream policy learning on challenging visual-control tasks from different domains either seen or unseen, with the help of frozen prototypes. Specifically, the encoder maps the state space $\mathcal{X}$ into the embedding space $\mathcal{Y}$ , converting the old MDP $(\mathcal{X},\mathcal{A},\mathcal{P},\mathcal{R},\gamma,d_{0})$ into $(\mathcal{Y},\mathcal{A},\mathcal{P},\mathcal{R},\gamma,d_{0})$ . Correspondingly, the transition $(x_{t_{i}},a_{t_{i}},r_{t_{i}},x_{t_{i}+1})$ is converted into $(y_{t_{i}},a_{t_{i}},r_{t_{i}},y_{t_{i}+1})$ . Following [14], we augment the extrinsic reward $r_{t_{i}}$ by adding an exploration reward $\hat{r}_{t_{i}}$ based on prototypes to encourage exploration. $\hat{r}_{t_{i}}$ is a particle-based entropy estimation [45] which is positively correlated with the exploration ability of the current strategy:

\hat{r}_{t_{i}}=||z_{t_{i}+1}-z_{t_{i}+1}^{k\rm{NN}(Q)}||,

(11)

where $z_{t_{i}+1}$ is the projection of $y_{t_{i}+1}$ in the latent space, as mentioned in section III.B. $z_{t_{i}+1}^{k\rm{NN}(Q)}$ means the k-nearest neighbor of $z_{t_{i}+1}$ in the set $Q$ . $Q$ is a projection set defined as $\{z_{t_{l}+1}\}$ , where $z_{t_{l}+1}$ is chosen from the off-policy replay buffer by each prototype, according to their dot products. We refer the readers to [14] for the detailed description of why and how $r_{t_{i}}$ works. With reward augmentation, the transition $(y_{t_{i}},a_{t_{i}},r_{t_{i}},y_{t_{i}+1})$ is converted into $(y_{t_{i}},a_{t_{i}},r_{t_{i}}+\beta\hat{r}_{t_{i}},y_{t_{i}+1})$ , on which we employ RAD-SAC[3, 46] with the augmentation of random shift [47] as the RL algorithm to learn a control policy.

In the cross-domain setting, we don’t employ finetuning to maintain cross-domain versatility. However, we note that finetuning the cross-domain encoder on a single target domain is able to further improve its performance but reduce its cross-domain versatility, especially in the unseen domains. After finetuning, the encoder focuses more on a certain domain and reduces its versatility, which means CRPTpro-finetuning can be regarded as a single-domain pre-training method. We provide the full pseudo-code of CRPTpro in Algorithm 1.

TABLE I: 8 challenging and diverse visual-control domains are divided into 4 different groups. The experiments of cross-domain pre-training are conducted in groups. For example, experiments on Group-A denote pre-training encoders over domains of Cheetah, Walker, and Quadruped.

Group	$1^{st}$ domain	$2^{nd}$ domain	$3^{rd}$ domain
Group-A	Cheetah	Walker	Quadruped
Group-B	Cartpole	Pendulum	Manipulation
Group-C	Manipulation	Hopper	Finger
Group-D	Walker	Pendulum	Hopper

TABLE II: This table intuitively shows that the selected 4 groups cover 8 domains as evenly as possible. We use the first three letters of the domain name as an abbreviation, e.g., Che denotes Cheetah.

Group	Che	Wal	Qua	Car	Pen	Man	Hop	Fin
Group-A	✓	✓	✓
Group-B				✓	✓	✓
Group-C						✓	✓	✓
Group-D		✓			✓		✓

IV Experiments

In this section, we verify the superiority of our CRPTpro across (i) cross-domain downstream RL performance in section IV.B, (ii) visual generalization in section IV.C, (iii) single-domain finetuning in section IV.D, and (iv) pre-training efficiency in section IV.E. In addition, extensive analysis is also provided, including (i) numerical ablation study in section IV.F, (ii) effect and reasons of efficient prototypical learning in section IV.F, and (iii) visualized analysis of decoupled random collection in section IV.G.

IV-A Setup

Environment Details

We evaluate CRPTpro on DMControl [15], a representative benchmark containing many and different types of continuous visual-control tasks. It is proven and widely considered to be the most challenging motor-control benchmark for unsupervised RL [48, 41, 22]. According to the recent influential works [37, 13, 14, 47], we select 8 popular and challenging tasks defined in 8 different domains: Cheetah-Run, Walker-Run, Quadruped-Run, Cartpole-Swingup sparse, Pendulum-Swingup, Manipulation-Reach duplo, Hopper-Hop and Finger-Spin. These tasks cover different fields including balance control problems, multi-joint robot locomotion, robotic arm manipulation and so on.

To the best of our knowledge, we are the first to primarily concentrate on cross-domain pre-training on challenging DMControl benchmark, whereas previous related works [37, 13] only conduct a small number of exploratory experiments. Therefore, we divide the selected 8 domains into groups for cross-domain experiments by ourselves. They are further divided into 4 domain groups for cross-domain pre-training, as shown in Table I. Under this division, the selected 4 groups cover the 8 domains as evenly as possible (shown by a more intuitive Table II). Moreover, different groups also have different characteristics (For example, Group-A contains only multi-joint robot domains while Group-C contains 3 different types of domains).

Following prior works, visual observations are represented as 84 $\times$ 84 $\times$ 3 pixel rendering and 3 consecutive previous observations are stacked to form the 84 $\times$ 84 $\times$ 9 state as input. Action repeat is set to 2 across all tasks. The episode length is set to 1000 for all tasks except Manipulation-Reach duplo, with 250 episode length.

TABLE III: Performance comparison on cross-domain pre-training. For all cross-domain methods, encoders are pre-trained on 3 domains contained in each group and then frozen to conduct downstream RL on 3 different downstream tasks defined in 3 seen domains. DrQ serves as the end-to-end expert method to demonstrate the score levels of different tasks for mean expert-normalized score calculation. CRPTpro significantly outperforms all cross-domain pre-training baselines, achieving considerable results (outperforming DrQ by 95.6% ) across all downstream tasks.

Task	Cross-domain pre-training methods					End-to-end expert
Task	CRPTpro(ours)	ATC[37]	APT(C)[13]	Proto-RL(C)[14]	BeCL[23]	DrQ[47]
(a) Cross-domain methods pre-train encoders on Group-A.
Cheetah-Run	611±44	300±29	284±8	421±82	287±222	807±75
Walker-Run	513±33	271±16	312±37	431±83	130±79	485±181
Quadruped-Run	242±78	145±73	201±95	191±80	343±45	139±64
(b) Cross-domain methods pre-train encoders on Group-B.
Cartpole-Swingup sparse	722±51	0±0	20±23	693±43	529±397	315±243
Pendulum-Swingup	865±26	524±341	182±98	868±35	21±9	635±218
Manipulation-Reach duplo	163±22	7±6	5±4	47±31	80±20	27±11
(c) Cross-domain methods pre-train encoders on Group-C.
Manipulation-Reach duplo	146±31	8±11	10±10	37±31	93±24	27±11
Hopper-Hop	206±4	2±1	4±7	160±30	3±1	283±32
Finger-Spin	873±157	883±125	954±23	858±182	336±425	938±103
(d) Cross-domain methods pre-train encoders on Group-D.
Walker-Run	509±60	200±16	215±5	433±73	151±96	485±181
Pendulum-Swingup	875±25	357±249	86±115	513±294	22±8	635±218
Hopper-Hop	210±12	3±3	22±25	146±32	2±1	283±32
Overall evaluation across all 12 downstream tasks of 4 groups.
Mean Score	495	225	191	400	166	-
Mean Expert-Normalized Score	1.956	0.440	0.412	1.096	0.993	-

Implementation of CRPTpro

The neural networks in CRPTpro all use the same architecture from [14]. 512 prototypes are learned by CRPTpro, each parameterized as a 128-dimensional vector. During pre-training, the encoder is updated a total of 50k times across 3 domains. In downstream RL, 500k steps are allowed. All hyper-parameters of RAD-SAC[3, 46] are the same as [14] except the RL reply buffer size changed from 100k to 40k. In both pre-training and downstream RL, Adam [49] is chosen as optimizer with learning rate of 1e-4 and mini-batch size of 512. Results of CRPTpro are over at least 6 different evaluations (60 episodes). Table VII provides the full settings of hyper-parameters in Appendix.

Baselines

4 cross-domain pre-training baselines: APT(C) [13], Proto-RL(C) [14], ATC [37] and BeCL [23], 2 single-domain pre-training baselines: APT(S) [13] and Proto-RL(S) [14], and 1 end-to-end image-based RL baseline: DrQ [47] are selected through the experiment section. They are all the most recent and advanced methods, where Proto-RL(C) and Proto-RL(S) are respectively state-of-the-art cross-domain pre-training and single-domain pre-training. In particular, Proto-RL(S) is one of state-of-the-art image-based RL methods on DMControl. DrQ is a popular end-to-end method which can be employed as an expert method to show the score level of different tasks.

APT [13] learns a representation through contrastive learning by actively searching for novel states in reward-free environments. It designs a novel task-agnostic reward based on particle-based entropy maximization and trains an exploration agent over the reward to sample pre-training data. Over the sampled data, APT employs SimCLR [34] to achieve self-supervised encoder pre-training. It is proposed for both cross-domain pre-training and single-domain pre-training, marked as APT(C) and APT(S). Proto-RL [14] uses prototypes[24] to enhance both task-agnostic exploration and representation learning. It employs particle-based entropy maximization to train an exploration agent for data collection like APT and uses prototypes to select candidate particles better. The prototypes and visual encoder are pre-trained simultaneously by prototypical representation learning in the pre-training stage. It is the best single-domain pre-training method and a state-of-the-art visual RL algorithm on DMControl. It is marked as Proto-RL(S) in single-domain pre-training while Proto-RL(C) in cross-domain pre-training. ATC [37] proposed an unsupervised task tailored to reinforcement learning. It requires a model to associate observations from nearby time steps within the same trajectory. Note that original ATC uses the task-specific expert dataset in pre-training and doesn’t provide task-agnostic data collection method. Therefore, (i) we employ our decoupled random collection for ATC to compare its unsupervised task with our efficient prototypical learning and (ii) ATC is not included in the pre-training efficiency comparison in Section IV.F. BeCL [23] is a recent state-of-the-art unsupervised skill discovery method. It employs contrastive learning [34] to maximize a novel mutual information objective between observations. Unsupervised skill discovery aims to learn task-agnostic exploration, which makes it suitable for unsupervised pre-training. DrQ [47] is a powerful and popular end-to-end DRL algorithm augmenting Q-function based on SAC [46].

For cross-domain pre-training baselines, the encoder is pre-trained 50k update times (200k task-agnostic steps RL for active pre-training methods) in 3 domains and performs 500k steps downstream RL on each task like CRPTpro. For non-cross-domain baselines, we follow the settings of[14]: 500k steps (i.e., 125k update times) task-agnostic active pre-training in one domain and 500k steps downstream RL on each task, for APT(S) and Proto-RL(S); 1M steps task-specific end-to-end RL on each task for DrQ.

IV-B Cross-domain Pre-training

We compare CRPTpro with 4 cross-domain pre-training baselines over 4 groups defined in Table I. CRPTpro hyper-parameter settings are shown in Table VII. DrQ is set as an end-to-end expert to show the score level of different downstream tasks and calculate an normalized score. Results are summarized in Table III. CRPTpro significantly improves APT(C), ATC and BeCL on 11/12 tasks. Compared with Proto-RL(C), CRPTpro achieves better downstream policy learning on 11/12 tasks and similar performance on the rest task. Numerically, we calculate the mean score and the mean expert-normalized score over all 12 downstream tasks of 4 groups, where our CRPTpro improves upon the next best cross-domain pre-training baseline by 23.8% and 78.5% respectively. Perhaps the most exciting result is that CRPTpro achieves a 1.956 mean expert-normalized score, demonstrating the huge advantages of cross-domain visual pre-training for image-based RL, which is ignored by previous works.

In summary, our CRPTpro significantly beats all baselines, becoming a novel state-of-the-art cross-domain pre-training method. This is attributed to the following reasons: First, our decoupled random collection avoids the severe chicken-and-egg problem between exploration agent training and visual encoder pre-training (detailed description in section III.A), generating a qualified and diverse cross-domain pre-training dataset and enabling stable pre-training process. Second, our efficient prototypical learning helps the encoder learn more effective image embeddings.

TABLE IV: Performance comparison of cross-domain methods generalizing to unseen domains and unseen colors without finetuning. Pre-trained encoders are frozen and used to conduct 500k steps RL on unseen downstream tasks defined in unseen domains (or domains with unseen colors). CRPTpro overall outperforms all cross-domain baselines.

(i) Encoders are pre-trained on Group-A and tested on 5 tasks from 5 unseen domains.
Task	CRPTpro(ours)	ATC [37]	APT(C) [13]	Proto-RL(C) [14]	BeCL [23]
Pendulum-Swingup	671±273	214±159	502±346	118±75	23±8
Finger-Spin	809±232	879±133	748±47	850±194	3±1
Hopper-Hop	187±19	23±27	36±31	138±47	2±1
Manipulation-Reach duplo	62±18	10±13	18±19	20±7	64±37
Cartpole-Swingup sparse	93±17	20±12	15±12	39±24	0±0
(ii) Encoders are pre-trained on Group-C and tested on 5 tasks from 5 unseen domains.
Cheetah-Run	448±12	241±10	283±31	374±10	350±223
Walker-Run	570±26	268±8	231±33	497±75	139±67
Quadruped-Run	270±32	132±66	114±63	234±40	322±123
Pendulum-Swingup	872±27	135±102	778±106	524±370	23±10
Cartpole-Swingup sparse	682±98	0±1	9±10	94±79	0±0
(iii) Encoders are pre-trained on Group-C and tested on 3 tasks with unseen background color.
Manipulation-Reach duplo (unseen color)	163±16	34±20	15±9	55±52	5±6
Pendulum-Swingup (unseen color)	567±317	136±101	556±104	126±37	21±9
Cheetah-Run (unseen color)	485±26	184±18	174±21	337±36	1±1

TABLE V: Comparison between CRPTpro and five pre-training algorithms on pre-training efficiency. Outcomes come from experiments conducted on the NVIDIA Tesla P100. CRPTpro is the most efficient pre-training algorithm.

CRPTpro

(ours)

Proto-RL

(C)

BeCL

APT

(C)

Proto-RL

(S)

APT

(S)

Wall- clock time

1 domain

4.8h

8.8h

9.4h

12.2h

18.8h

20.6h

2 domains

4.8h

8.8h

9.4h

12.2h

37.6h

41.2h

3 domains

4.8h

8.8h

9.4h

12.2h

56.4h

61.8h

Update times

1 domain

50k

125k

2 domains

50k

250k

3 domains

50k

375k

IV-C Generalization in Unseen Domains

In addition to the seen domains, the encoder pre-trained by CRPTpro can generalize well to unseen domains without finetuning. We conduct three sets of representative experiments to show the generalization comparison of all cross-domain methods. (i) We pre-train encoders on Group-A which contains only robot domains. Then we freeze the pre-trained encoders to directly perform 500k steps downstream RL on the rest tasks defined in five unseen and different types of domains: manipulation (Manipulation), robot (Hopper), and balance control (Pendulum, Cartpole, and Finger). (ii) We pre-train encoders on Group-C which contains three different types of domains: manipulation (Manipulation), robot (Hopper) and balance control (Finger), and then freeze the encoder to perform 500k steps RL on the other five unseen tasks. (iii) For encoders pre-trained on Group-C, we further change the background color (by exchanging the RGB value in reverse order) of three different types of control tasks from both the seen domain (Manipulation) and unseen domains (Cheetah and Pendulum), testing the generalization to unseen background colors. CRPTpro hyper-parameters are provided in Table VII.

The results are shown in Table IV. CRPTpro overall exceeds all cross-domain baselines across all three sets of experiments. In some unseen tasks (e.g., Pendulum-Swingup in (i) and Walker-Run in (ii)), CRPTpro even enables better policy learning than the end-to-end expert DrQ. Note that CRPTpro has not trained its encoder in these unseen domains, while DrQ trains its encoder for 1M steps on each task. This indicates that the encoder of CRPTpro can well capture the movement changes that are generic across domains. Moreover, CRPTpro is the only method that doesn’t exhibit performance degradation on Cheetah-Run when facing unseen color, which means it can effectively filter our useless information irrelevant to movements. We mainly attribute these movement understanding advantages to the proposed decoupled random collection because we note that CRPTpro on Group-C (containing three different types of domains) performs much better than CRPTpro on Group-A (containing only robot domains). For example, CRPTpro on Group-C can effectively solve all unseen tasks, including Cartpole-Swingup sparse, while CRPTpro on Group-A can not. Even on Walker-Run which is included in Group-A but not in Group-C, CRPTpro on Group-C performs better than CRPTpro on Group-A (result in Table III). This phenomenon shows that data diversity is the key to pre-training performance, which is also verified in the ablation study (Section IV.F). The more diverse dataset sampled by decoupled random collection enables CRPTpro to better understand generic movement changes than all baselines.

IV-D Finetuning

As mentioned in section III.C, the cross-domain encoder obtained by CRPTpro can be finetuned on a target domain for better downstream policy learning, especially in unseen domains. To highlight this ability, we finetune the encoder pre-trained on Group-A by CRPTpro in five different domains (three unseen domains: Pendulum, Manipulation, and Cartpole & two seen domains: Cheetah and Walker) respectively with default hyper-parameters in Table VII. Then, we freeze the finetuned encoder to conduct 500k steps RL on the corresponding downstream task. CRPTpro-finetuning could be regarded as a single-domain pre-training method and we compare it with three non-cross-domain baselines.

The results are shown in Fig. 4, demonstrating that CRPTpro-finetuning is an effective single-domain pre-training method. It is comparable with Proto-RL(S), the best single-domain pre-training method and also one of state-of-the-art image-based RL methods. In addition, CRPTpro-finetuning uses only 58.5k update times on encoder (50k pre-training update times on Group-A, 2k finetuning update times in Pendulum, 2k finetuning update times in Manipulation, 3k finetuning update times in Cartpole, 1k finetuning update times in Cheetah, and 0.5k finetuning update times in Walker) to surpass most non-cross-domain baselines. As a comparison, the total pre-training update times for Proto-RL(S) and APT(S) in 5 domains is 625k. In each unseen domain, CRPTpro only updates its encoder at most 3k times while the baselines update their encoders at least 125k times. This indicates that the cross-domain prior knowledge learned with our novel self-supervised algorithm can greatly promote representation learning even in unseen domains.

TABLE VI: Employing efficient prototypical learning facilitates the diffusion of prototypes and makes them cover wider after pre-training.

(

\times 10^{-4}

)

CRPTpro

w/o

\mathcal{L}_{intr}

Proto-RL

(C)

\mathcal{L}_{intr}

Proto-RL

(C)

Proto-RL

(S)

\mathcal{L}_{intr}

Proto-RL

(S)

KNE

\downarrow

-26.545

-10.871

-38.602

-7.701

-80.131

-44.036

ANE

\downarrow

7.996

23.778

21.635

51.280

20.587

48.774

\mathcal{L}_{intr}

-31.080

-20.505

-50.294

-26.860

-140.213

-80.839

IV-E Pre-training Efficiency

In this section, we demonstrate the pre-training efficiency comparison between different pre-training methods when facing different numbers of domains. Three cross-domain pre-training methods, including the current state-of-the-art cross-domain pre-training method, Proto-RL(C), and two single-domain pre-training methods, including the current state-of-the-art single-domain pre-training method, Proto-RL(S), are employed for comparison with our CRPTpro. ATC is not in comparison because it doesn’t provide a task-agnostic pre-training approach in their paper (see baseline details in Section IV.A.c). The outcomes are based on experiments conducted on the NVIDIA Tesla P100.

The results are shown in Table V. CRPTpro becomes a novel state-of-the-art cross-domain pre-training method with greatly improved pre-training efficiency. Concretely, it spends only 54% wall-clock pre-training time to outperform the next best Proto-RL(C) by 78.5% on the mean expert-normalized score. The main contributor is our decoupled random collection, which employs an off-the-shelf exploration policy to avoid the severe training burden of extra RL. Compared with the state-of-the-art single-domain pre-training method (Proto-RL(S)), all the cross-domain pre-training methods exhibit huge efficiency improvements which multiply with the number of domains. This is because cross-domain pre-training methods can pre-train a generic encoder across multiple domains, while Proto-RL(S) cannot. However, the downstream policy performance of all cross-domain baselines is much worse than that of Proto-RL(S), which makes the efficiency comparison between them meaningless. In contrast, CRPTpro can achieve competitive downstream policy learning compared with Proto-RL(S) after few-shot finetuning, with much less pre-training consumption (only 9.3% update times, see Section IV.D).

In downstream RL, CRPTpro augments the extrinsic reward with a k-NN-based unsupervised reward. According to our experimental tests on the NVIDIA Tesla P100, the consumption of this reward is tiny, about 1‰ of the downstream RL consumption. This is due to (i) the small size of the k-NN buffer and (ii) the low dimension of the k-NN particles, which we detail in Section III.C. In addition, the current state-of-the-art Proto-RL(C) and Proto-RL(S) also utilize this reward. Therefore, it doesn’t affect the efficiency advantage of our approach.

IV-F Ablations & Analysis of Efficient Prototypical Learning

CRPTpro employs the proposed efficient prototypical learning, which introduces a novel $\mathcal{L}_{intr}$ to facilitate the diffusion of prototypes and improve pre-training. In this section, we demonstrate its effectiveness and rationality through three different sets of experiments

First, we prove its numerical effectiveness in three different prototypical pre-training settings. Ablation study is also done here. (i) We ablate the $\mathcal{L}_{intr}$ from CRPTpro with default hyper-parameter settings to observe the difference, shown in Fig. 5a. (ii) We add $\mathcal{L}_{intr}$ into Proto-RL(C) with default hyper-parameter settings to show its effectiveness in Fig. 5a. (iii) We add $\mathcal{L}_{intr}$ into Proto-RL(S) of 200k steps task-agnostic pre-training and 500k steps downstream RL. The settings are all shown in Table VII except $\alpha$ set to $1e-3$ for Cheetah-Run and $5e-4$ for Hopper-Hop. The curves are shown in Fig. 5b.

The numerical results verify the effectiveness of $\mathcal{L}_{intr}$ in efficient prototypical learning across all prototypical pre-training settings. Since the Proto-RL(C) w/ $\mathcal{L}_{intr}$ is actually CRPTpro w/o decoupled random collection, Fig. 5a can also serve as the ablation study, verifying the effectiveness of both the decoupled random collection and efficient prototypical learning in CRPTpro.

Second, we show that the coverage and diffusion of prototypes are facilitated when employing efficient prototypical learning. Two metrics are used to evaluate prototypes’ coverage: All-Neighbor Estimation (ANE) proportional to the cosine similarity between all prototype pairs; K-Neighbor Estimation (KNE) proportional to k-nearest neighbor’s cosine similarity of all prototypes. We test the pre-trained models’ prototypes in 3 different prototypical pre-training settings. Results in Table VI show that $\mathcal{L}_{intr}$ makes prototypes finally cover wider in the latent space. In addition, we show the curve of metric difference between Proto-RL(S) and Proto-RL(S) w/ $\mathcal{L}_{intr}$ during pre-training in Fig. 6a. This indicates that the $\mathcal{L}_{intr}$ continues to function throughout the training process of efficient prototypical learning, demonstrating that $\mathcal{L}_{intr}$ makes prototypes cover faster in the self-supervised pre-training. In summary, our efficient prototypical learning indeed improves the diffusion and coverage of prototypes, leading to better encoder pre-training and the final performance.

Finally, we observe that optimizing $\mathcal{L}_{intr}$ is an instinctual process in vanilla prototypical algorithms. We show the $\mathcal{L}_{intr}$ curve in the single-domain active pre-training setting, as shown in Fig. 6b. We find that even in Proto-RL(S), where $\mathcal{L}_{intr}$ is not employed, this loss is also instinctively decreased, which means reducing $\mathcal{L}_{intr}$ is an instinctual process in vanilla prototypical representation learning. This phenomenon motivates us to optimize $\mathcal{L}_{intr}$ in a targeted manner to facilitate the training and accelerate the convergence.

IV-G Visualization of Decoupled Random Collection

To further analyze the effectiveness of our decoupled random collection and pre-trained encoder, we employ PCA to visualize two pre-training datasets: the cross-domain pre-training dataset produced by decoupled random collection (CRPTpro) and the single-domain exploration dataset produced by Proto-RL(S) [14]. Note that Proto-RL(S) is a state-of-the-art image-based RL method, providing the state-of-the-art unsupervised exploration. We don’t visualize Proto-RL(C) because its performance is much less than CRPTpro and Proto-RL(S). For cross-domain dataset produced by decoupled random collection (CRPTpro), we sample 125 frames from each domain. For Proto-RL(S), we conduct adequate single-domain pre-training and then sample 125 frames with the pre-trained exploration. All the sampled frames are encoded by corresponding CRPTpro encoder and then reduced dimensionality to 4 by PCA (4-dimensional PCA results retain more than half of the original information here).

The results are shown in Fig. 7. In a single domain, the decoupled random collection in CRPTpro (orange) is not as exploratory as the exploration agents in Proto-RL(S) (blue), but not a lot worse. It also explores some observations that Proto-RL(S) cannot reach. Moreover, the cross-domain pre-training dataset produced by CRPTpro (orange, green and red) overall exhibits wider coverage than the single-domain exploration dataset produced by Proto-RL(S) (blue) because of the cross-domain diversity. In addition, we observe that our encoder can ideally handle unseen data sampled by exploration agents and distinguish them clearly, showing the powerful generalization. These explain why CRPTpro can pre-train a generic encoder conducting efficient downstream policy learning in multiple domains, even reaching the same level as the best active pre-training method (also a state-of-the-art image-based RL method), Proto-RL(S), after few-shot finetuning.

V Conclusion

In this paper, we present CRPTpro, a novel cross-domain RL pre-training framework enabling efficient downstream policy learning on sets of visual-control tasks defined in different domains. It improves current state-of-the-art cross-domain pre-training algorithm by a large margin in downstream policy learning performance, with greatly improved pre-training efficiency. The proposed decoupled random collection employs a simple and off-the-shelf random policy to produce a qualified pre-training dataset of cross-domain diversity, successfully circumventing the chicken-and-egg problem (conflict between exploration policy training and representation learning) which plagues current advanced active pre-training. We hope this could inspire the use of cross-domain diversity in various RL settings (e.g., designing cross-domain auxiliary tasks) and the rethinking of the random policy in challenging motor control. In addition, we hope the proposed efficient prototypical learning can encourage further research on prototypical algorithms in self-supervised RL.

TABLE VII: Hyper-parameter settings in CRPTpro.

Parameter	Setting
Convolution channels	$(32,32,32,32)$
Convolution stride	$(2,1,1,1)$
Filter size	$3\times 3$
Representation dimensionality	$39200$
Latent dimensionality	$128$
Number of prototypes $\rm{M}$	$512$
Predictor hidden units	$1024$
Actor feature dimensionality	$50$
Actor MLP hidden units	$1024$
Critic feature dimensionality	$50$
Critic MLP hidden units	$1024$
SSL optimizer	Adam
SSL learning rate	$10^{-4}$
Pre-training data buffer capacity	$100000$
Intrinsic loss weight $w$	$1.5$
Intrinsic loss coefficient $\alpha$	$5\times 10^{-3}$
Encoder target update frequency	$1$
Encoder target EMA momentum	$0.05$
RL optimizer	Adam
RL learning rate	$10^{-4}$
RL algorithm	RAD-SAC
Random shift pad	$\pm 4$
SAC initial temperature	$0.1$
Discount $\gamma$	$0.99$
SAC replay buffer capacity	$40000$
Actor update frequency	$2$
Actor log stddev bounds	$[-10,2]$
Critic update frequency	$1$
Critic target update frequency	$2$
Critic target EMA momentum	$0.01$
Exploration reward $k$ in $\rm{NN}$	$3$
Exploration reward coefficient $\beta$	$0.2$
Exploration reward buffer size	$2048$
Softmax temperature	$0.1$

References

[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
[2] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
[3] M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas, “Reinforcement learning with augmented data,” Advances in neural information processing systems, vol. 33, pp. 19 884–19 895, 2020.
[4] Z. Ding, Y. Chen, N. Li, D. Zhao, and C. P. Chen, “Stacked bnas: Rethinking broad convolutional neural network for neural architecture search,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023.
[5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
[6] J. Sharma, P.-A. Andersen, O.-C. Granmo, and M. Goodwin, “Deep q-learning with q-matrix transfer learning for novel fire evacuation environment,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 51, no. 12, pp. 7363–7381, 2021.
[7] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016.
[8] Z. Ding, Y. Chen, N. Li, D. Zhao, Z. Sun, and C. P. Chen, “Bnas: Efficient neural architecture search using broad scalable architecture,” IEEE Transactions on Neural Networks and Learning Systems, 2021.
[9] J. Chai, W. Chen, Y. Zhu, Z.-X. Yao, and D. Zhao, “A hierarchical deep reinforcement learning framework for 6-dof ucav air-to-air combat,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023.
[10] S. Narvekar, J. Sinapov, and P. Stone, “Autonomous task sequencing for customized curriculum design in reinforcement learning.” in IJCAI, 2017, pp. 2536–2542.
[11] K. Wu, M. Wu, J. Yang, Z. Chen, Z. Li, and X. Li, “Deep reinforcement learning boosted partial domain adaptation.” in IJCAI, 2021, pp. 3192–3199.
[12] Z. Ding, Y. Chen, N. Li, and D. Zhao, “Bnas-v2: Memory-efficient and performance-collapse-prevented broad neural architecture search,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 52, no. 10, pp. 6259–6272, 2022.
[13] H. Liu and P. Abbeel, “Behavior from the void: Unsupervised active pre-training,” Advances in Neural Information Processing Systems, vol. 34, pp. 18 459–18 473, 2021.
[14] D. Yarats, R. Fergus, A. Lazaric, and L. Pinto, “Reinforcement learning with prototypical representations,” in International Conference on Machine Learning. PMLR, 2021, pp. 11 920–11 931.
[15] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq et al., “Deepmind control suite,” arXiv preprint arXiv:1801.00690, 2018.
[16] Z. Yang, K. E. Merrick, H. A. Abbass, and L. Jin, “Multi-task deep reinforcement learning for continuous action control.” in IJCAI, vol. 17, 2017, pp. 3301–3307.
[17] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in International conference on machine learning. PMLR, 2017, pp. 1126–1135.
[18] Y. M. Mu, S. Chen, M. Ding, J. Chen, R. Chen, and P. Luo, “Ctrlformer: Learning transferable state representation for visual control via transformer,” in International Conference on Machine Learning. PMLR, 2022, pp. 16 043–16 061.
[19] S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y. Sulsky, J. Kay, J. T. Springenberg et al., “A generalist agent,” arXiv preprint arXiv:2205.06175, 2022.
[20] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine, “Diversity is all you need: Learning skills without a reward function,” arXiv preprint arXiv:1802.06070, 2018.
[21] D. Pathak, D. Gandhi, and A. Gupta, “Self-supervised exploration via disagreement,” in International conference on machine learning. PMLR, 2019, pp. 5062–5071.
[22] M. Laskin, H. Liu, X. B. Peng, D. Yarats, A. Rajeswaran, and P. Abbeel, “Unsupervised reinforcement learning with contrastive intrinsic control,” Advances in Neural Information Processing Systems, vol. 35, pp. 34 478–34 491, 2022.
[23] R. e. a. Yang, “Behavior contrastive learning for unsupervised skill discovery,” arXiv, 2023.
[24] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” Advances in Neural Information Processing Systems, vol. 33, pp. 9912–9924, 2020.
[25] F. Deng, I. Jang, and S. Ahn, “Dreamerpro: Reconstruction-free model-based reinforcement learning with prototypical representations,” in International Conference on Machine Learning. PMLR, 2022, pp. 4956–4975.
[26] J. Wang, Y. Mu, D. Li, Q. Zhang, D. Zhao, Y. Zhuang, P. Luo, B. Wang, and J. Hao, “Prototypical context-aware dynamics generalization for high-dimensional model-based reinforcement learning,” arXiv preprint arXiv:2211.12774, 2022.
[27] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu et al., “Learning to navigate in complex environments,” arXiv preprint arXiv:1611.03673, 2016.
[28] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised auxiliary tasks,” arXiv preprint arXiv:1611.05397, 2016.
[29] M. Laskin, A. Srinivas, and P. Abbeel, “Curl: Contrastive unsupervised representations for reinforcement learning,” in International Conference on Machine Learning. PMLR, 2020, pp. 5639–5650.
[30] M. Schwarzer, A. Anand, R. Goel, R. D. Hjelm, A. Courville, and P. Bachman, “Data-efficient reinforcement learning with self-predictive representations,” arXiv preprint arXiv:2007.05929, 2020.
[31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[32] M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” Advances in neural information processing systems, vol. 26, 2013.
[33] O. Henaff, “Data-efficient image recognition with contrastive predictive coding,” in International conference on machine learning. PMLR, 2020, pp. 4182–4192.
[34] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
[35] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[36] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
[37] A. Stooke, K. Lee, P. Abbeel, and M. Laskin, “Decoupling representation learning from reinforcement learning,” in International Conference on Machine Learning. PMLR, 2021, pp. 9870–9879.
[38] T. Xiao, I. Radosavovic, T. Darrell, and J. Malik, “Masked visual pre-training for motor control,” arXiv preprint arXiv:2203.06173, 2022.
[39] M. Schwarzer, N. Rajkumar, M. Noukhovitch, A. Anand, L. Charlin, R. D. Hjelm, P. Bachman, and A. C. Courville, “Pretraining representations for data-efficient reinforcement learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 12 686–12 699, 2021.
[40] L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine et al., “Model-based reinforcement learning for atari,” arXiv preprint arXiv:1903.00374, 2019.
[41] M. Laskin, D. Yarats, H. Liu, K. Lee, A. Zhan, K. Lu, C. Cang, L. Pinto, and P. Abbeel, “Urlb: Unsupervised reinforcement learning benchmark,” arXiv preprint arXiv:2110.15191, 2021.
[42] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partially observable stochastic domains,” Artificial intelligence, vol. 101, no. 1-2, pp. 99–134, 1998.
[43] R. Bellman, “A markovian decision process,” Journal of mathematics and mechanics, pp. 679–684, 1957.
[44] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738.
[45] H. Singh, N. Misra, V. Hnizdo, A. Fedorowicz, and E. Demchuk, “Nearest neighbor estimates of entropy,” American journal of mathematical and management sciences, vol. 23, no. 3-4, pp. 301–321, 2003.
[46] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning. PMLR, 2018, pp. 1861–1870.
[47] D. Yarats, I. Kostrikov, and R. Fergus, “Image augmentation is all you need: Regularizing deep reinforcement learning from pixels,” in International Conference on Learning Representations, 2020.
[48] M. Laskin, H. Liu, X. B. Peng, D. Yarats, A. Rajeswaran, and P. Abbeel, “Cic: Contrastive intrinsic control for unsupervised skill discovery,” arXiv preprint arXiv:2202.00161, 2022.
[49] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.