Intrinsically Motivated Self-supervised Learning in Reinforcement Learning

Yue Zhao¹, Chenzhuang Du², Hang Zhao², Tiejun Li^1,⋆ ¹Peking University, Haidian, Beijing, China²Tsinghua University, Haidian, Beijing, China^⋆Correspondence author [email protected]

Abstract

In vision-based reinforcement learning (RL) tasks, it is prevalent to assign auxiliary tasks with a surrogate self-supervised loss so as to obtain more semantic representations and improve sample efficiency. However, abundant information in self-supervised auxiliary tasks has been disregarded, since the representation learning part and the decision-making part are separated. To sufficiently utilize information in auxiliary tasks, we present a simple yet effective idea to employ self-supervised loss as an intrinsic reward, called Intrinsically Motivated Self-Supervised learning in Reinforcement learning (IM-SSR). We formally show that the self-supervised loss can be decomposed as exploration for novel states and robustness improvement from nuisance elimination. IM-SSR can be effortlessly plugged into any reinforcement learning with self-supervised auxiliary objectives with nearly no additional cost. Combined with IM-SSR, the previous underlying algorithms achieve salient improvements on both sample efficiency and generalization in various vision-based robotics tasks from the DeepMind Control Suite, especially when the reward signal is sparse.

I Introduction

Reinforcement learning has achieved significant success in many fields with visual observations [1, 2, 3, 4]. Sample efficiency and representation learning in vision-based reinforcement learning tasks have, though, hitherto been a challenging problem [2, 5], especially when applying RL into real-world robotics [6]. Assigning an auxiliary task with a surrogate loss, which introduces the powerful and promising self-supervised learning (SSL) methods in computer vision to RL, is an efficacious technique for representation learning and sample efficiency. With self-supervised auxiliary objectives, the RL algorithms learn more semantic representations and achieve faster convergence.

Albeit combining with SSL naively can improve sample efficiency in vision-based RL, there is still room for improvements. Especially in a real-world setting with sparse rewards, exploration becomes desiderata for RL to obtain non-trivial reward signals, which remains a predicament that previous SSL-RL baselines can not solve efficiently. Adequate information in self-supervised auxiliary tasks can offer practical assistance; however, such paramount information has been disregarded, since the representation learning part is separated from the decision-making part.

Based on that, we propose a framework for SSL-RL to assign the surrogate self-supervised loss into policy learning as an intrinsic reward, namely Intrinsically Motivated Self-Supervised learning in Reinforcement learning (IM-SSR). It is a simple yet effective modification with nearly no additional cost, which can be effortlessly plugged in any reinforcement learning with self-supervised auxiliary tasks. Intuitively, the self-supervised loss measures the quality of the observation representation learning, which would be relatively large for the states rarely encountered or vulnerable to nuisance. Also we theoretically decompose the self-supervised loss as a metric in the feature space: one motivates exploration for novel states, the other improves robustness from nuisance elimination. Thereby, it is reasonable to design the self-supervised loss as an intrinsic reward to award extra bonuses for those states. Combining with this intrinsic reward, the agent is encouraged to explore novel and vulnerable states, thus improving sample efficiency along side generalization ability.

Empirical experiments are conducted on robotics control tasks in DeepMind Control Suite [7] based on a generalization benchmark DMControl-GB¹¹1https://github.com/nicklashansen/dmcontrol-generalization-benchmark. Complex testing tasks are included, like environments with unseen natural video backgrounds, which can evaluate the generalization ability of the algorithm to various real-world settings. Two of the classical vision-based reinforcement learning algorithms with self-supervised methods, CURL [2] and SODA [1], are chosen to be baselines. Combined with IM-SSR, the underlying baselines improve sample efficiency and generalization, especially when the reward signals are sparse, resulting in new state-of-the-art performance.

We summarize our contributions as follows:

•

We present a general framework, IM-SSR, which is an effective idea to utilize the self-supervised loss in auxiliary tasks as an intrinsic reward. The philosophy of associating self-supervised learning with reinforcement learning is promising since the information in auxiliary tasks should not be disregarded.
•

We formally show that the self-supervised loss can be decomposed as exploration for novel states and robustness in representations from nuisance elimination.
•

With nearly no additional cost, we offer an approach to promote improvements on the previous underlying reinforcement learning algorithms with self-supervised auxiliary loss, which achieves new state-of-the-art performance on vision-based reinforcement learning.

II Methodology

Refer to caption — Figure 1: Intrinsically Motivated Self-supervised Learning in Reinforcement Learning Framework

As demonstrated in Figure 1, the framework of reinforcement learning combined with self-supervised learning is: First, we maintain a replay buffer with transitions. Various kinds of augmentations can be employed to obtain different views of the original observations. It remains ambiguous whether to use the original observation or an augmented view into the encoder for RL, which we demonstrate as above just for simplicity. In reinforcement learning part, we encode inputs into latent variable $z$ and train the policy based on $z$ and related transitions; in self-supervised learning part, an auxiliary task is conducted to constrain the consistency of $z^{\prime}$ encoded from different views of observations. The self-supervised loss in auxiliary task is generally denoted as $\mathcal{L}_{\text{SSL}}$ and explicitly formulated in Appendix B. To utilize the self-supervised learning part, we introduce $\mathcal{L}_{\text{SSL}}^{i}$ on each observation as an intrinsic reward $R^{intrinsic}$ added to the original extrinsic reward in Equation (1).

R_{t}=R^{extrinsic}_{t}+\beta_{t}*R^{intrinsic}_{t}

(1)

When $\beta_{t}=0$ , i.e., $R_{t}=R_{t}^{extrinsic}$ , the framework degenerates into previous SSL-RL methods, which have no interplay between SSL and RL.

Motivation. Intuitively, the self-supervised loss contains information about how well the learned representation is. The loss would be relatively large for rarely encountered states, which motivates us to award novel or vulnerable states. The bonus motivates the agent to explore, which helps with the searching for non-trivial rewards. Based on that, sample efficiency can be improved significantly, especially in sparse-rewarded tasks. We have to clarify that IM-SSR utilizes the inherent yet unemployed information in SSL, which is different from those manually designed exploration methods. Also, generalization ability is a by-product gained from building robust representations for vulnerable states.

Summary. To utilize inherent information in SSL, we present a generic framework, IM-SSR, where the SSR can be replaced by any suitable self-supervised learning method. For instance, IM-CURL means a CURL baseline combined with the intrinsic reward modification. Our method requires no additional network architectures nor any modification on the self-supervised task, thus making it easy to be plugged-in.

The pseudo-code of IM-SSR is as follows.

Algorithm 1 IM-SSR Framework

Input: Replay buffer

\mathcal{B}

, RL update frequency

m

, RL batch size

N

, Intrinsic parameter

\beta

repeat

for

0

m

Sample a batch of

N

transitions

\{(o_{i},a_{i},o_{i}^{\prime},r_{i},\text{done}_{i})\}^{N}_{i=1}\sim\mathcal{B}

Compute self-supervised loss

l_{i}

between each observation and its augmentation

Normalize within the batch:

\{r^{intrinsic}_{i}\}^{N}_{i=1}\leftarrow\texttt{Normalize}(\{l_{i}\}^{N}_{i=1})

Add intrinsic reward:

r_{i}\leftarrow r_{i}+\beta*r^{intrinsic}_{i}

Optimize reinforcement learning objective

\mathcal{L}_{\text{{RL}}}

end for

Sample a new batch of transitions from

\mathcal{B}

or use transitions sampled in RL updates

Optimize self-supervised learning objective

\mathcal{L}_{\text{{SSL}}}

until reach max training steps

Implementation details. The intrinsic rewards need to be normalized, since it is the relative value inside one batch/update that matters, not the exact quantity. The hyper-parameter $\beta_{t}$ can be elaborately designed to decay; however, we implement the method with a fixed $\beta$ or a naive decaying schedule without fine-tuning and it still achieves desired performance. Our method can be easily deployed on any SSL-RL framework. In the experiments, we strictly follow the same way as the corresponding baseline does and use exactly the same augmentation method. Details of the related SSL-RL baselines are shown in Appendix B.

III Analysis on Self-supervised Loss in Reinforcement Learning

We formally show that it is reasonable to employ the self-supervised loss as an intrinsic reward, which can be interpreted as an exploration bonus and robustness improvement from nuisance elimination.

III-A Notations and Preliminaries

Figure 2: Representation MDP.

A Markov Decision Process (MDP) is defined for the information transmission in Figure 2. $x\in\mathcal{X}$ denotes an optimal representation in the latent space, which is necessary and sufficient for the reinforcement learning task $y$ . Another random variable $n\in\mathcal{N}$ denotes nuisance for the task, i.e., $n\perp y$ or $I(y;n)=0$ ; also, we assume $n\perp x$ and $I(x;n)=0$ . Observation $o$ is generated by an implicit function $g(\cdot,\cdot)$ , i.e., $o=g(x,n)\in\mathcal{O}$ . For instance, $o$ denotes the stacked pixel observation captured from the environment; $x$ is the desired optimal representation for downstream tasks, such as state-based proprioceptive features; $n$ denotes task-irrelevant information like textures, shadows and backgrounds, which may be a culprit for the representation learning, thus seriously depreciating sample efficiency and generalization ability to unseen environments of the algorithm.

We now consider a generic setting for contrastive learning as an auxiliary self-supervised task along side reinforcement learning. The encoding function is denoted by $f:\mathcal{O}\rightarrow\mathcal{Z}$ . A main principle for contrastive learning is that features from different views of the same observation should be forced to be close in the feature space, i.e., the ideal encoder $f^{\star}$ would satisfy $\text{dist}(f^{\star}(o_{i}),f^{\star}(o_{j}))\propto\text{dist}(x_{i},x_{j})$ . The contrastive loss depends on the SSL architecture, which has the form like Equation (5) in CURL or Equation (6) in SODA as shown in Appendix B. Then we use ${\rho}(z_{i},z_{j})$ to denote the underlying pair-wise contrastive loss as a metric defined on a metric space $\mathcal{H}$ , measuring the distance between different latent representations $z_{i}\in\mathcal{Z}$ . It satisfies the well-known properties: $\rho(x,y)\geq 0$ ; $\rho(x,y)=\rho(y,x)$ ; $\rho(x,y)\leq\rho(x,z)+\rho(y,z)$ .

III-B Decomposition and Interpretation of Contrastive Loss

For each observation $o=g(x,n)$ , we analyze the pair-wise contrastive loss. Augmentations of $o$ are denoted by $o^{\prime}=t(o)$ . Since the augmentation function is designed to perturb nuisance and maintain significant information of the original observation, we assume $\exists\ n^{\prime}$ , s.t. $o^{\prime}=t(o)=g(x,n^{\prime})$ . The corresponding features are $z=f(o)$ and $z^{\prime}=f(o^{\prime})$ . We further introduce an optimal representation function $f^{\star}$ . Ideally, it would ignore information related to nuisance $n$ and maintain the information in $x$ related to downstream tasks, i.e., $f^{\star}(g(x,n))=f^{\star}(g(x,n^{\prime})),\forall n,n^{\prime}\in\mathcal{N}$ . We may use a fixed $n_{0}$ to denote $z^{\star}=f^{\star}(g(x,n))=f^{\star}(g(x,n_{0})),\forall n\in\mathcal{N}$ .

	$\displaystyle\rho({z}^{\prime},z)=\rho(f(g(x,n^{\prime})),f(g(x,n)))$	(2)
$\displaystyle\leq$	$\displaystyle\rho(f(g(x,n^{\prime})),f^{\star}(g(x,n_{0})))+\rho(f(g(x,n)),f^{\star}(g(x,n_{0})))$
$\displaystyle=$	$\displaystyle\rho(h(x,n^{\prime}),h^{\star}(x,n_{0}))+\rho(h(x,n),h^{\star}(x,n_{0})),$

The loss can be decomposed as above, where $h/h^{\star}$ denotes the composite function of $g$ and $f/f^{\star}$ , and the second inequality is from the triangle inequality of the metric. Each term on the right hand side can be further bounded by

		$\displaystyle\rho(h(x,n),h^{\star}(x,n_{0}))$		(3)
	$\displaystyle\leq$	$\displaystyle\underbrace{\rho(h(x,n_{0}),h^{\star}(x,n_{0}))}_{\text{(i) projected distance of x}}+\underbrace{\rho(h(x,n),h(x,n_{0}))}_{\text{(ii) nuisance elimination}}.$		(3)

The first term, interpreted as the projected distance of $x$ , is indeed an exploration bonus. It is the distance in feature space between different projection functions $h(.,n_{0})$ and $h^{\star}(.,n_{0})$ , which contrastive learning is trying to minimize; therefore, the distance would be smaller if the state has been encountered. From the perspective of exploration in reinforcement learning, this term can be treated as a prediction error, and it would bring extra bonus for novel states that the contrastive part is not trained on. The prediction error is often considered to motivate exploration like RND [8] does; however, it requires additional networks for distillation. In the most desired case, this term should vanish, as the function $h$ is optimized to $h^{\star}$ .

The second term, related to nuisance elimination, encourages the agent to visit vulnerable states and introduces randomness into the optimization. States that are more sensitive to distortions will give larger distance between $h(x,n)$ and $h^{\star}(x,n_{0})$ , which is encouraged to be visited or revisited. To be clarified, since the nuisance $n$ in the considered MDP is independent of $x$ , our method can not directly improve the generalization ability of the baseline; however, it can contribute by building more robust representations for vulnerable states. Robustness in representations makes the learned representations more invariant to distortions, thus improving the generalization ability to untrained environments. While the function $h$ converges to $h^{\star}$ , the information in $n$ is getting ignored, and the whole term of nuisance elimination would also vanish.

To summarize, the intrinsic reward, i.e., pair-wise contrastive loss of the auxiliary tasks in reinforcement learning, can be interpreted as: exploration for novel states that improves sample efficiency, and robustness in representations from nuisance elimination that improves generalization ability.

	$\displaystyle\rho({z}^{\prime},z)$	$\displaystyle\leq\underbrace{2\rho(h(x,n_{0}),h^{\star}(x,n_{0}))}_{\text{(i) {Exploration} for novel states}}$		(4)
	$\displaystyle+$	$\displaystyle\underbrace{\left[\rho(h(x,n),h(x,n_{0}))+\rho(h(x,n^{\prime}),h(x,n_{0}))\right]}_{\text{(ii) {Robustness} from nuisance elimination}}.$		(4)

Inequality (4) is tightly bounded, since $h$ is optimized towards $h^{\star}$ and $h^{\star}$ will ignore the information in $n$ ideally. Apart from that, with a decaying parameter $\beta_{t}$ in front of the intrinsic reward, our method is able to converge to the optimal policy asymptotically. Specifically, $\rho({z}^{\prime},z)$ can be a pair-wise loss of Equation (5) in CURL or Equation (6) in SODA, which will serve as an intrinsic reward in IM-SSR. The pair-wise loss can also be designed by ourselves to utilize the information in encoders, which will be further explored in Appendix D.

IV Experiments

TABLE I: The results of CURL, SODA, IM-CURL and IM-SODA in unmodified environments.

DMControl Suite (training)

0.2T Training Frames

0.5T Training Frames

1.0T Training Frames

CURL

IM-CURL

SODA

IM-SODA

CURL

IM-CURL

SODA

IM-SODA

CURL

IM-CURL

SODA

IM-SODA

cartpole

swingup_sparse

50.1

\pm

50.3

318.4

\pm

240.3

6.6

\pm

11.1

10.3

\pm

11.7

726.1

\pm

48.0

721.9

\pm

48.7

435.5

\pm

361.9

632.1

\pm

249.2

744.2

\pm

23.8

744.8

\pm

54.2

617.3

\pm

331.4

792.0

\pm

38.5

finger

spin

784.2

\pm

112.4

789.9

\pm

94.2

779.5

\pm

108.8

861.8

\pm

76.8

862.3

\pm

132.7

916.8

\pm

37.4

909.3

\pm

50.8

975.9

\pm

8.2

920.9

\pm

124.8

949.7

\pm

39.4

937.1

\pm

35.3

983.1

\pm

2.4

pendulum

swingup

22.6

\pm

38.3

21.5

\pm

41.4

22.3

\pm

34.1

36.5

\pm

38.5

41.8

\pm

46.0

239.5

\pm

285.3

175.4

\pm

222.2

494.3

\pm

359.5

171.9

\pm

289.8

780.3

\pm

38.9

474.1

\pm

391.0

736.2

\pm

196.1

reacher

easy

871.1

\pm

97.0

877.6

\pm

50.1

474.9

\pm

56.5

513.5

\pm

127.8

952.7

\pm

36.8

904.7

\pm

86.7

472.0

\pm

179.2

644.7

\pm

153.2

932.6

\pm

46.5

970.0

\pm

4.9

539.8

\pm

285.2

709.7

\pm

169.7

walker

walk

680.3

\pm

79.5

766.7

\pm

107.7

618.1

\pm

65.6

641.7

\pm

43.3

813.4

\pm

63.4

832.4

\pm

118.1

735.2

\pm

61.3

776.3

\pm

41.5

855.3

\pm

66.8

873.8

\pm

88.0

840.3

\pm

47.2

834.9

\pm

52.1

(a) training

(b) color_easy

(d) video_easy

(e) video_hard

Figure 3: DMControl-GB Environments [1]

IV-A Experimental Settings

Empirically, we conduct experiments on DMControl suite. All experiments are evaluated over 5 random seeds, except for those mentioned in the related caption. DMControl-GB provides diverse test environments for policy evaluation, which are important for sim-to-real in robotics. The agent is trained in the unmodified training environment, while it is evaluated in some new environments as well as the unmodified environment of DMControl suite. The new test environments can be summarized as two types: changing the color randomly and replacing the background with natural videos. Specifically, it includes color_easy, color_hard, video_easy, video_hard, as shown in Figure 3 and we select training, color_hard and video_hard as the representative environments.

Primarily, to validate the priority in sample efficiency of IM-SSR, we compare IM-CURL, IM-SODA with CURL and SODA in the training environment. Especially, we elaborately design tasks with sparse rewards to further prove the significant exploration brought by IM-SSR. We conduct all experiments with 500K training frames. We denote the total training frames as T, using 0.2T as early-term, 0.5T as mid-term and 1.0T as final-term to represent the training progress of 20%, 50% and 100%. We report the mean and standard deviation in the tables. Additional results and more details can be found in Appendix A and C.

Besides, to verify that IM-SSR is helpful for robustness and generalization, we compare IM-SODA and SODA in challenging generalization environments. First, in Section IV-C, we intend to evaluate the generalization ability of sim-to-real, where agents are trained in simulated unmodified mujoco environment training, and tested in real-world like environment color_hard and video_hard. Second, we design tasks which are trained on one scene and tested on another to further validate the generalization in transferring. The detailed analysis of generalization tasks and results can be found in Appendix E.

IV-B Sample Efficiency

In this subsection, we demonstrate that IM-SSR can improve sample efficiency on the training environment as Figure 3(a). The results shown in Table I can be summarized:

•

IM-SSR surpasses its underlying baselines in the early-term in most cases and maintains its priority in the mid-term significantly.
•

The improvements in mid-term are much larger than in other terms, since IM-SSR converges much faster than underlying baselines. The baseline may still struggle in learning representations or searching for non-trivial rewards, while IM-SSR has already learned semantic representations and obtained positive and effective reward signals to be trained on.
•

In the final-term, we expect IM-SSR and baselines to achieve similar results like in walker_walk. However, we surprisingly find that in several tasks like cartpole_swingup_sparse, IM-SSR achieves higher mean and much lower variance in performance than the underlying baselines, which means IM-SODA is much more stable in these environments.
•

There are few cases that IM-SSR provides slight improvements, which mainly attribute to the dense reward. When setting the reward signal to be sparse, the improvements of IM-SSR become significant. Related experiments are implemented in Section IV-D.

IV-C Generalization

We demonstrate that IM-SSR can improve robustness and generalization, and the evaluation is conducted on color_hard and video_hard environments as Figure 3(c) and Figure 3(e).

TABLE II: Generalization results of SODA and IM-SODA on color_hard and video_hard.

random colors

video backgrounds

DMControl-GB (evaluating)

0.2T

0.5T

1.0T

0.2T

0.5T

1.0T

SODA

IM-SODA

SODA

IM-SODA

SODA

IM-SODA

SODA

IM-SODA

SODA

IM-SODA

SODA

IM-SODA

cartpole

swingup_sparse

5.8

\pm

9.1

10.7

\pm

11.3

341.9

\pm

282.4

429.0

\pm

223.5

468.6

\pm

257.1

573.8

\pm

104.2

1.5

\pm

2.5

4.9

\pm

5.3

73.5

\pm

55.7

89.8

\pm

69.7

111.9

\pm

77.5

139.8

\pm

86.7

finger

spin

717.7

\pm

109.3

811.5

\pm

74.7

831.5

\pm

48.7

925.1

\pm

50.1

879.2

\pm

40.8

931.2

\pm

37.4

354.6

\pm

50.9

361.1

\pm

64.2

417.8

\pm

57.3

481.4

\pm

31.6

333.6

\pm

18.1

353.6

\pm

11.4

pendulum

swingup

10.4

\pm

20.6

33.8

\pm

37.0

133.5

\pm

148.1

303.6

\pm

223.1

306.0

\pm

272.9

471.9

\pm

179.7

1.4

\pm

1.3

20.9

\pm

23.7

38.9

\pm

30.5

45.1

\pm

39.4

30.4

\pm

26.5

42.1

\pm

23.3

reacher

easy

363.4

\pm

113.2

397.3

\pm

117.2

300.7

\pm

127.7

422.1

\pm

136.1

380.4

\pm

209.7

437.2

\pm

137.9

335.9

\pm

117.6

357.1

\pm

111.7

297.6

\pm

168.0

468.7

\pm

130.4

350.3

\pm

205.1

434.3

\pm

147.7

walker

walk

511.9

\pm

63.8

553.1

\pm

47.4

591.7

\pm

77.0

630.0

\pm

71.9

644.6

\pm

66.0

620.2

\pm

60.2

257.7

\pm

103.2

318.2

\pm

70.3

303.0

\pm

107.7

327.9

\pm

89.7

331.3

\pm

80.5

333.1

\pm

53.0

As shown in Table II, in the early-term and mid-term, IM-SODA outperforms SODA nearly on all tasks, which proves that its generalization ability in unseen challenging evaluation environments is much better. Since video_hard is relatively harder, it makes both algorithms perform poorly in some environments. Still, the peak of IM-SODA is higher than SODA, which also shows better generalization ability of IM-SODA.

IV-D Sparse Cases

Although methods like CURL and SODA achieve comparable performance in major tasks in DMControl, they still perform poorly on tasks with sparse rewards like cartpole_swingup_sparse and pendulum_swingup. As discussed above, simply combining RL with self-supervised learning cannot help solve environments with sparse rewards, but the agent of IM-SSR is encouraged to explore for novel states that have non-trivial rewards potentially.

To verify this hypothesis, we visualize the training process on these two environments with sparse reward signals in Figure 4. Each row in Figure 4 corresponds to an environment, pendulum_swingup is on the top, while cartpole_swingup_sparse is on the bottom. The first column shows the comparison between CURL and IM-CURL on training, while other columns on the right show the comparison between SODA and IM-SODA on training, color_hard and video_hard respectively. IM-SSR improves the sample efficiency significantly in almost all cases. Specifically, the result on the left top panel shows that in IM-CURL pendulum_swingup converges to much higher averaged returns at about 400K steps, while CURL still performs extremely bad at 500K steps.

What makes the improvement to be marginal or significant? As we notice in Table I, sometimes IM-SSR and the corresponding baselines achieve similar performance. The decomposition and interpretation of pair-wise contrastive loss in Equation (4) can be an explanation, especially the exploration term. When the reward signal is dense, the baseline already learns well without exploring for novel states, and there is not much room for IM-SSR to improve. However, when the reward signal is sparse, the significance of exploration emerges and the intrinsic reward does help.

To verify this idea, we take walker_walk as an example and modify the reward to be sparse as either 0 or 1. The new task is called walker_walk_sparse. In walker_walk, IM-SSR and the corresponding baselines have already achieved high performance in early-term. However, as shown in Figure 5 with designed sparse settings, IM-CURL and IM-SODA perform much better than CURL and SODA due to the lack of reward supervision.

V Related Work

Self-Supervised Learning for Visual Representation. Self-supervised learning has attracted interests of researchers in Computer Vision [9, 10, 11, 12, 13, 14]. MoCo (Momentum Contrastive) [9] builds a dynamic queue with the help of the momentum encoder. It trains the encoder by matching queries and keys using InfoNCE loss. SimCLR [10] introduces a learnable nonlinear transformation between the representation and the contrastive loss. With the projection layers, we can remove the momentum encoder and the memory bank. BYOL [11] trains an online network to predict the target network representation of the same image with the different views. Amazingly, they remove negative samples. SimSiam [13] explores what makes good representation in BYOL by modifying the predictor and removing the momentum encoder.

Auxiliary Task in Reinforcement Learning. Directly using images as the input in deep reinforcement learning always causes sample inefficiency, researchers propose to jointly train the policy and a self-supervised loss to solve the inefficiency problem to improve performances [15, 1, 2, 5, 16, 17]. Inspired by [9, 10], CURL [2] uses contrastive learning as the auxiliary tasks. They just treat the different randomly cropped views of the same image as the positive pair. SODA [1] introduces a new overlay augmentation that they use the augmentation and randomly crop the same image as the positive pair, which shows State-of-The-Art performance in generalization [7]. In [11], they use a online network to predict the target network and do not need negative pairs. Besides, they do not use the same batch data to optimize the RL loss and the auxiliary task loss. PAD [16] uses inverse dynamics prediction error to optimize the encoder and introduces policy adaptation tricks to improve generalization performance. Besides, SVEA [17] stables the learning of Q-value to improve generalization. However, the policy training part and SSL part are separated, which prevents the policy from referencing to representation learning to make decision.

Exploration. Exploration is an important topic in reinforcement learning. A natural and effective way is count-based exploration bonuses [18]. When the space is becoming larger and larger, simple counting fails, and lots of work have studied how to generalize count exploration to large state spaces [8, 19, 20]. Another class of exploration is based on the errors in predicting dynamics [8, 21, 22]. For example, Random Network Distillation [8] treats the error of predicting features of the next observations given by a fixed random model as the bonus. When policy gives an action to find a non-seen state, it will be awarded. However, exploration based on predicted errors needs additional network architectures, while our method only borrows the representation learning error as the intrinsic reward.

VI Conclusions and Future Work

In summary, we present a simple yet effective idea to introduce the information in self-supervised learning into policy learning as intrinsically motivated self-supervised learning in reinforcement learning (IM-SSR). Theoretical analysis is made to interpret decomposition of the self-supervised loss, as exploration for novel states and robustness from nuisance elimination. The implementation of the IM-SSR is quite simple, without any extra modification on architectures nor costs in computation, yet the improvement is significant. Both IM-CURL and IM-SODA achieve faster convergence than CURL and SODA. Besides, IM-SODA generalizes much better than SODA in unseen real-world like environments. We also emphasize the IM-SSR performs much better than the underlying baselines in sparse reward environments. Therefore, we are offering an approach to further improve sample efficiency and generalization of suitable SSL-RL baselines, with nearly no additional cost.

Albeit it is promising to introduce interaction between SSL and RL, intrinsic reward is not the only way to implement this idea; thereby, many other approaches remain to be developed. Also, an adaptive schedule can be designed for the intrinsic parameter $\beta_{t}$ to achieve a more robust and tuning-free algorithm. In the future, we desire to further extend IM-SSR to more robotics tasks in more complex environments.

Appendix A Performance Curves

To show the effectiveness of IM-SSR, we visualize the performance curves in DMControl environments. To further verify the generalization ability of IM-SSR from simulation to real world, evaluation on complex unseen environments is conducted based on DMControl-GB. As we can see, almost in all environments, IM-SSR outperforms the baseline methods. We mainly follow the experimental setting in CURL and SODA, which trains the agent in unmodified training environments and test in various environments: unmodified training, color_hard and video_hard. The x-axis denotes training frames (1 million).

Additional results related to PAD and SVEA are in Figure 8 and Figure 10, where $\beta_{0}=0.05$ . We focus on the performance on pendulum_swingup.

The performance curves clearly demonstrate the priority of IM-SSR on sample efficiency over baselines. The green curve which represents IM-SSR is higher than the red baseline almost every time. Especially in tasks with sparse rewards, like the first and the third columns, IM-SSR obtains non-trivial rewards much earlier and surpasses the underlying baseline significantly. Also in generalization evaluation, IM-SODA shows better generalization ability in color_hard and video_hard.

Appendix B Basic SSL-RL Architectures

Our method can be easily deployed on any SSL-RL framework. In this paper, we use CURL [2], SODA [1] and PAD [16] as the basic SSL architectures in the IM-SSR framework, which are built based on SAC [23].

CURL. Vision-based RL is disturbed by sample-inefficiency [5]. Inspired by the success of self supervised learning in computer vision [9, 10], CURL [2] proposes to training an extra contrastive objective as an auxiliary task to make encoder to learn high level representation faster and better. More specifically, they maximize the agreement between augmented versions of the same observation by InfoNCE loss as Equation (5):

\mathcal{L}_{\text{CURL}}=-\log\frac{\exp\left(q^{T}Wk_{+}\right)}{\exp\left(q^{T}Wk_{+}\right)+\sum_{i=0}^{K-1}\exp\left(q^{T}Wk_{i}\right)},

(5)

where $q$ is the encoder query, $W$ is parameters to be optimized and $k_{+}$ denotes the key that matches $q$ [2]. Intuitively, the value of the loss is low when $q$ is similar to its positive key $k_{+}$ and dissimilar to other negative keys (denoted as $k_{i}$ ). In CURL implementation, we simply treat different augmented views of the same state as the positive pair.

SODA. SODA [1] maximizes the mutual information between latent representations of augmented and non-augmented data by employing BYOL-like [11] architecture, which does not need negative samples at all. The RL policy and the self-supervised auxiliary task share a common encoder $f_{\theta}$ . $o^{\prime}$ is an augmented observation of $o$ . They use the encoder $f_{\theta}$ and an projection network $g_{\theta}$ to extract $z^{\prime}=g_{\theta}(f_{\theta}(o^{\prime}))$ , and use the target encoder $f_{\psi}$ and target projection network $g_{\psi}$ to extract $z^{\star}=g_{\psi}(f_{\psi}(o^{\star}))$ , where $\psi$ is an exponential moving average (EMA) of $\theta$ . The objective of SODA is to predict $z^{\star}$ from $z^{\prime}$ by $h_{\theta}$ , which is formulated as a consistency loss:

\mathcal{L}_{\text{SODA}}\left(\hat{z},z^{\star};\theta\right)=\mathbb{E}_{t\sim\mathcal{T}}\left[\left\|\hat{z}_{\circ}-z_{\circ}^{\star}\right\|_{2}^{2}\right],

(6)

where $\hat{z}\triangleq h_{\theta}(z^{\prime})$ , $\hat{z_{o}}\triangleq\hat{z}/\left\|\hat{z}\right\|_{2}$ and $z_{o}^{\star}\triangleq z^{\star}/\left\|z^{\star}\right\|_{2}$ .

PAD. PAD [16] explores to use self-supervision in new environments for continue training the policy for generalization. They employ the inverse dynamics model as the SSL part. Specifically, at each step, a transition sequence ( $s_{t}$ , $a_{t}$ , $s_{t+1}$ ) is observed and an inverse model takes $s_{t}$ and $s_{t+1}$ to predict $a_{t}$ , which can adapt the policy to the new environments without any reward signal. Formally, the inverse dynamics objective for continuous actions can be written as:

\mathcal{L}_{\text{PAD}}=MSE(a_{t},f(s_{t},s_{t+1}))

(7)

where MSE means mean squared error and $f$ means the inverse prediction model.

SVEA. SVEA [17] improves the generalization of RL algorithms by stabling the learning of the Q-values. To be more specific, SVEA minimizes a nonnegative objective $\mathcal{L}_{\text{SVEA}}$ , where

	$\displaystyle\mathcal{L}_{\text{SVEA}}=\mathbb{E}$	$\displaystyle{[\alpha\left\\|Q\left(f\left(s_{t}\right),a_{t}\right)-q_{t}^{\text{target }}\right\\|}$		(8)
		$\displaystyle+\beta\left\\|Q\left(f\left(s_{t}^{aug}\right),a_{t}\right)-q_{t}^{\text{target }}\right\\|]$		(8)

$q_{t}^{target}$ is the target q-value in Bellman equation. $\alpha$ and $\beta$ is the constant coefficients and $s^{aug}_{t}$ means we use the augmented data to compute the Q-values.

Data Augmentations. In order to smoothly apply SSL to RL tasks, we need to augment the same observation. CURL only augments data by random crop, while SODA proposes a novel and stronger augmentation method, random overlay, which linearly interpolates between an observation and another image. In CURL and PAD, the inputs of different encoders are different crops of the same observation; in SODA, both inputs are the same crop, while the momentum encoder’s input is additionally processed by the random overlay augmentation method. More details can be found in [1, 16, 2]. Noting that our IM-SSR uses the same augmentation method as baselines.

Update Details. Besides, there are differences in the update of auxiliary tasks: Some, like CURL, update the auxiliary task with same sampled transitions as in RL update, while others do not, like SODA. We find it interesting that using different transitions improves the performance, which remains to be further studied. Also, the update frequency of RL and SSL can be different. We mention these differences in the pseudo-code and strictly follow the same way as the corresponding baseline does.

Appendix C Implementation Details

C-A Baseline Details

We show the implementation details for CURL, SODA and PAD in this subsection. Specifically, we present in detail about the hyperparameters for both algorithms in Table III and the choice of the $\beta$ for intrinsic reward in Table IV. Noting that the basic SSL-RLs are mainly based on the official released implementation²²2https://github.com/MishaLaskin/curl ³³3https://github.com/nicklashansen/dmcontrol-generalization-benchmark.

We utilize a simple decaying schedule for $\beta$ , by multiplying a rate every 100k steps, like $\beta\leftarrow\beta*0.8$ . All experiments follow this rule, except for cartpole_swingup_sparse, which uses a fixed parameter $\beta=\beta_{0}$ without any complicated designed schedule. Such a hyper-parameter, which is fixed or naively scheduled, can still obtain solid performance, thus proving the stability of IM-SSR. In the future, adaptive schedule can be used to adjust $\beta$ . Besides, similar to RND [8], we use the predicted observation error of the next state.

For data augmentation, both CURL and PAD only use random cropping while SODA proposes a new augmentation method for generalization: Random Overlay [1], which linearly interpolates between an observation and another natural image as Figure 9 shows. We can formally write it as:

t_{overlay}=(1-\alpha)*o+\alpha*\epsilon,

(9)

where $o$ is the observation, $\epsilon$ is the natural image and $\alpha$ is the interpolation coefficient. In SODA [1], $\alpha$ is set as $0.5$ . We emphasize that we use exactly the same augmentations as the baseline methods.

C-B Hyper-parameters

Hyper-parameters are summarized in Table III and Table IV.

TABLE III: The Hyper-parameters used in our experiments.

Hyper-parameter	Value
Frame rendering	$3\times 100\times 100$
Frame after crop	$3\times 84\times 84$
Stacked frames	3
Number of conv. layers	11 (SODA, PAD, SVEA)
	4 (CURL)
Number of filters in conv.	32
Action repeat	2 (finger_spin)
	8 (cartpole_swingup_sparse, pendulum_swingup)
	4 (otherwise)
Discount factor $\gamma$	$0.99$
Episode time steps	1,000
Learning algorithm	Soft Actor-Critic
Number of training steps	500,000
Replay buffer size	500,000 (SODA, PAD, SVEA)
	100,000 (CURL)
Optimizer (RL/aux.)	Adam $\left(\beta_{1}=0.9,\beta_{2}=0.999\right)$
Optimizer $(\alpha)$	Adam $\left(\beta_{1}=0.5,\beta_{2}=0.999\right)$
Learning rate (RL)	1e-3
Learning rate $(\alpha)$	1e-4
Learning rate $($ SODA $)$	3e-4
Batch size in RL	128
Batch size in SSL	128 (CURL)
	256 (SODA, PAD, SVEA)
Actor update freq.	2
Critic update freq.	1
Auxiliary update freq.	1 (CURL, PAD)
	2 (SODA)
Momentum coef. $\tau$ (SODA)	$0.005$

TABLE IV: The initial

\beta_{0}

of Intrinsic Reward in different environments.

IM-CURL	0.01	0.005	0.05	0.005	0.005	0.1
	cartpole	finger	pendulum	reacher	walker	walker
	swingup_sparse	spin	swingup	easy	walk	walk_sparse
IM-SODA	0.01	0.005	0.1	0.001	0.05	0.1

C-C Sparse Reward Settings

In Section IV-D, we modify the reward signal in walker_walk to compare the improvements between dense rewards and sparse rewards. The original reward signal designed in DMControl Suite is a combination of terms related to upright torso, torso height and horizontal velocity [7]. The returned reward is scaled from 0 to 1, which is computed by the following codes.

⬇

def get_reward(self, physics):

"""Returns a reward to the agent."""

standing = rewards.tolerance(

physics.torso_height(),

bounds=(_STAND_HEIGHT, float(’inf’)),

margin=_STAND_HEIGHT/2)

upright = (1+physics.torso_upright())/2

stand_reward = (3*standing+upright)/4

if self._move_speed == 0:

return stand_reward

else:

move_reward = rewards.tolerance(

physics.horizontal_velocity(),

bounds=(self._move_speed, float(’inf’)),

margin=self._move_speed/2,

value_at_margin=0.5,

sigmoid=’linear’)

return stand_reward*(5*move_reward+1)/6

To maintain the desired combination of rewards in walking, we keep the same setting of the dense reward, and only change it from the continuous version in $[0,1]$ to a discrete version in $\{0,1\}$ . The sparse reward is defined as follows.

\text{sparse\_reward}=\begin{cases}1.0&\text{dense\_reward > 0.5}\\ 0.0&\text{otherwise}\end{cases}

(10)

Appendix D Further Extensions

Various architectures of SSL-RL can also be explored, such as SVEA [24] whose self-supervised learning part is merged into the Q-learning procedure. Hence, we cannot directly utilize the SSL loss as an intrinsic reward, since no auxiliary loss is available.

However, the philosophy of IM-SSR to utilize the paramount information in SSL can still be realized by information maintained in encoders. We simply add an MSE loss as the metric to evaluate the distance between encoded augmentations and encoded original observations, and then use the pair-wise MSE as the intrinsic reward. The additional computation is still very little.

The detailed implementation is as follows. $\hat{z}_{i}$ denotes the encoded variable of the original observation by encoder in Q function, while $z_{i}^{\star}$ denotes the encoded variable of the original observation by the encoder in target function. By this way, we can calculate $l_{i}=\left\|\hat{z}_{i}-z_{i}^{\star}\right\|_{2}^{2}$ and normalize them within a batch to design the intrinsic reward.

The results show that IM-SVEA achieves better performance than SVEA in Figure 10. The improvements in video_hard can be ignored, which may be attributed to the baseline SVEA who fails.

This section motivates us to explore various approaches to design the intrinsic reward, when the formulation of SSL loss is not desired or even there is no available auxiliary loss. More attempts on how to efficiently utilize the information in the SSL part are needed in the future.

Appendix E Generalization Benchmarks

E-A Motivations on Generalization Tasks

We include two types of experiments related to generalization, one is more likely to sim-to-real, and another is more likely to transferring. In Section IV-C, we train agents based on unmodified environments and tested on unseen challenging environments which is designed for simulation to reality. We also implemented experiments that are trained on one scene and tested on another unseen scene with a dynamical changing camera pose, which is designed for transferring in real world.

First, we would like to explain for generalization tasks in Section IV-C based on DMControl-GB. Researchers mainly focus on the original DMControl baseline in previous works, which is still away from real life. If we have a task in the real world, sometimes we do not train directly in the real world; we can model the object and simulate the task in a mujoco-like world instead. The physical objects are already included in the simulated environment, and the main differences will be nuisances we mentioned in the paper, like color, background, and others. Therefore, it is reasonable to treat the benchmark as a useful generalization benchmark, which helps transfer from simulation to real life.

In this work, to the best of our ability, we try to simulate transferring from one scene to another. The implementation of transferring tasks is based on DMControl Generalization Benchmark [1] and Distracting Control Suite [25]. Additional challenging experiments are implemented, where the agent is trained based on natural videos instead of simulated mujoco environments and tested in another unseen scene. Apart from the changing of scenes, camera poses are also dynamic in training and testing as shown in Figure 11.

Figure 11: Various Scenes in Distracting Control Suite Environments [25]

E-B Performance in Transferring Tasks

As is shown in Figure 12, we conduct experiments on walker_walk_sparse to compare between the trained scene and another unseen test scene. The performance show that IM-SODA performs better than SODA as desired.

Appendix F Sensitivity Analysis on $\beta_{0}$

Here, we fix the schedule of $\beta$ and explore the utility of various $\beta_{0}$ in IM-SSR. Figure 13(a) shows that different $\beta_{0}$ will lead to different performance, where a suitable $\beta_{0}=0.05$ leads to the best performance. As in shown in Figure 13(b), with the increasing of $\beta_{0}$ , the final performance will increase first and then decrease. The magnitude of $\beta_{0}$ matters for a proper modification on the original reward. If the $\beta_{0}$ is too big, the intrinsic reward will be the most important target of the agent, and the extrinsic reward will be neglected, which results in low performance. If the $\beta_{0}$ is too small, the performance is closer to CURL, which is a special case for $\beta_{0}=0$ .

References

[1] N. Hansen and X. Wang, “Generalization in reinforcement learning by soft data augmentation,” arXiv preprint arXiv:2011.13389, 2020.
[2] A. Srinivas, M. Laskin, and P. Abbeel, “Curl: Contrastive unsupervised representations for reinforcement learning,” arXiv preprint arXiv:2004.04136, 2020.
[3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
[4] K. Schmeckpeper, O. Rybkin, K. Daniilidis, S. Levine, and C. Finn, “Reinforcement learning with videos: Combining offline observations with interaction,” arXiv preprint arXiv:2011.06507, 2020.
[5] D. Yarats, A. Zhang, I. Kostrikov, B. Amos, J. Pineau, and R. Fergus, “Improving sample efficiency in model-free reinforcement learning from images,” arXiv preprint arXiv:1910.01741, 2019.
[6] J. Ibarz, J. Tan, C. Finn, M. Kalakrishnan, P. Pastor, and S. Levine, “How to train your robot with deep reinforcement learning: lessons we have learned,” The International Journal of Robotics Research, vol. 40, no. 4-5, pp. 698–721, 2021.
[7] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al., “Deepmind control suite,” arXiv preprint arXiv:1801.00690, 2018.
[8] Y. Burda, H. Edwards, A. Storkey, and O. Klimov, “Exploration by random network distillation,” arXiv preprint arXiv:1810.12894, 2018.
[9] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
[10] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
[11] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, et al., “Bootstrap your own latent: A new approach to self-supervised learning,” arXiv preprint arXiv:2006.07733, 2020.
[12] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
[13] X. Chen and K. He, “Exploring simple siamese representation learning,” arXiv preprint arXiv:2011.10566, 2020.
[14] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” arXiv preprint arXiv:2103.03230, 2021.
[15] E. Shelhamer, P. Mahmoudieh, M. Argus, and T. Darrell, “Loss is its own reward: Self-supervision for reinforcement learning,” arXiv preprint arXiv:1612.07307, 2016.
[16] N. Hansen, R. Jangir, Y. Sun, G. Alenyà, P. Abbeel, A. A. Efros, L. Pinto, and X. Wang, “Self-supervised policy adaptation during deployment,” arXiv preprint arXiv:2007.04309, 2020.
[17] N. Hansen, H. Su, and X. Wang, “Stabilizing deep q-learning with convnets and vision transformers under data augmentation,” Advances in Neural Information Processing Systems, vol. 34, 2021.
[18] A. L. Strehl and M. L. Littman, “An analysis of model-based interval estimation for markov decision processes,” Journal of Computer and System Sciences, vol. 74, no. 8, pp. 1309–1331, 2008.
[19] M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos, “Unifying count-based exploration and intrinsic motivation,” arXiv preprint arXiv:1606.01868, 2016.
[20] J. Fu, J. D. Co-Reyes, and S. Levine, “Ex2: Exploration with exemplar models for deep reinforcement learning,” arXiv preprint arXiv:1703.01260, 2017.
[21] J. Schmidhuber, “A possibility for implementing curiosity and boredom in model-building neural controllers,” in Proc. of the international conference on simulation of adaptive behavior: From animals to animats, 1991, pp. 222–227.
[22] Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros, “Large-scale study of curiosity-driven learning,” arXiv preprint arXiv:1808.04355, 2018.
[23] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International Conference on Machine Learning. PMLR, 2018, pp. 1861–1870.
[24] N. Hansen, H. Su, and X. Wang, “Stabilizing deep q-learning with convnets and vision transformers under data augmentation,” 2021.
[25] A. Stone, O. Ramirez, K. Konolige, and R. Jonschkowski, “The distracting control suite–a challenging benchmark for reinforcement learning from pixels,” arXiv preprint arXiv:2101.02722, 2021.

Intrinsically Motivated Self-supervised Learning in Reinforcement Learning

Abstract

I Introduction

II Methodology

III Analysis on Self-supervised Loss in Reinforcement Learning

III-A Notations and Preliminaries

III-B Decomposition and Interpretation of Contrastive Loss

IV Experiments

IV-A Experimental Settings

IV-B Sample Efficiency

IV-C Generalization

IV-D Sparse Cases

V Related Work

VI Conclusions and Future Work

Appendix A Performance Curves

Appendix B Basic SSL-RL Architectures

Appendix C Implementation Details

C-A Baseline Details

C-B Hyper-parameters

C-C Sparse Reward Settings

Appendix D Further Extensions

Appendix E Generalization Benchmarks

E-A Motivations on Generalization Tasks

E-B Performance in Transferring Tasks

Appendix F Sensitivity Analysis on β0\beta_{0}

References

Appendix F Sensitivity Analysis on $\beta_{0}$