This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Don’t Touch What Matters: Task-Aware Lipschitz Data Augmentation
for Visual Reinforcement Learning

Zhecheng Yuan1, Guozheng Ma1, Yao Mu2, Bo Xia1, Bo Yuan1,
Xueqian Wang1, Ping Luo2, Huazhe Xu3
1Tsinghua University 2Hongkong University 3Standford University
[email protected], [email protected]
Abstract

One of the key challenges in visual Reinforcement Learning (RL) is to learn policies that can generalize to unseen environments. Recently, data augmentation techniques aiming at enhancing data diversity have demonstrated proven performance in improving the generalization ability of learned policies. However, due to the sensitivity of RL training, naively applying data augmentation, which transforms each pixel in a task-agnostic manner, may suffer from instability and damage the sample efficiency, thus further exacerbating the generalization performance. At the heart of this phenomenon is the diverged action distribution and high-variance value estimation in the face of augmented images. To alleviate this issue, we propose Task-aware Lipschitz Data Augmentation (TLDA) for visual RL, which explicitly identifies the task-correlated pixels with large Lipschitz constants, and only augments the task-irrelevant pixels. To verify the effectiveness of TLDA, we conduct extensive experiments on DeepMind Control suite, CARLA and DeepMind Manipulation tasks, showing that TLDA improves both sample efficiency in training time and generalization in test time. It outperforms previous state-of-the-art methods across the 3 different visual control benchmarks111https://sites.google.com/view/algotlda/home.

1 Introduction

Deep Reinforcement Learning (DRL) from visual observations has carved out brilliant paths in many domains such as video games Mnih et al. (2015), robotics manipulation Kalashnikov et al. (2018), and visual navigation Zhu et al. (2017). However, it remains challenging to obtain generalizable policies across different environments with visual variations due to overfitting Zhang et al. (2018).

Refer to caption
Figure 1: Augmenting observation in a task-agnostic manner (in the middle) distracts the agent’s decision, hence it will damage agent’s asymptotic performance. This problem can be alleviated by task-aware data augmentation (in the bottom).

Data Augmentation Shorten and Khoshgoftaar (2019) and Domain Randomization Tobin et al. (2017) based approaches are widely used to learn generalizable visual representations. However, recent work Hansen et al. (2021) find that in visual RL, there is a dilemma: heavy data augmentations are vital for better generalization, but it will cause a significant decrease in both the sample efficiency and the training stability.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 2: Action Distribution. We use t-SNE to show high-dimensional actions employed by the same agent. The grey dots are the actions given the observations without augmentation (w/o aug); the blue dots (a) and orange dots (c) are the actions given the same observations under strong (random_conv) and weak augmentation (random_shift), respectively. The visualized results demonstrate that there is a significant action distribution shift under strong augmentation, while under weak augmentation, the policy is closer to the original distribution. The red dots (b) are TLDA under strong augmentation, which gives rise to an action distribution that remains similar to the grey dots.

One of the main reasons is that data augmentation conventionally perform pixel-level image transformation, where each pixel is transformed in a task-agnostic manner. However, in visual RL, each pixel in the observation has different relevance to the task and the reward function. Hence, it is worth rethinking data augmentation in the new context of visual RL.

To better understand the effect of data augmentation in visual RL, we visualize the action distribution output from policies trained with various data augmentation choices in Figure 2. We find that the agent’s actions vary dramatically when faced with different augmentation methods. Specifically, when weak augmentation such as shifting is applied, the action distribution remains closer to the original distribution that has no augmentation (Figure 2(c)); however, when strong augmentation e.g., random convolution is applied, the action distribution drastically changes (Figure 2(a)) and the Q-estimation yields the discrepancy with the un-augmented data, as shown in Figure 3. This comparison reveals the severe problem that causes instability when data augmentation is applied blindly without knowing the task information.

In this work, we propose a task-aware data augmentation method in visual RL that learns to augment the pixels less correlated to the task, namely Task-aware Lipschitz Data Augmentation (TLDA), as shown in Figure 1. A desirable quality for such a method is to that it maintains a stable policy output even on augmented observations. Following this insight, we introduce the Lipschitz constant that measures the relevance between the pixel and the task, then guides the augmentation strategy. Specifically, we first impose a perturbation on a certain pixel, and calculate the corresponding Lipschitz constant for the pixel via the policy change before and after the perturbation. Then, to avoid the occurrence of drastic policy changes, we treat the pixels with larger Lipschitz constant as the task-correlated ones and avoid augmenting them. Therefore, the output could be more stable while keeping the diversity of augmented data.

We conduct experiments on 3 benchmarks: DMControl Generalization Benchmark (DMC-GB) Hansen and Wang (2021), CARLA Dosovitskiy et al. (2017), and DMControl manipulation tasks Tunyasuvunakool et al. (2020). We train agents in a fixed environment and evaluate their generalization on the environments that are unseen during training. Extensive experiments show that TLDA outperforms the prior state-of-the-art methods due to more stable and efficient training and robust generalization performance.

Our main contributions are summarized as follows:

  • We propose Task-aware Lipschitz Data Augmentation (TLDA), which can be implemented on any downstream visual RL algorithm easily without adding auxiliary objectives or additional learnable parameters.

  • We provide theoretical understanding and empirical results to show TLDA can alleviate the action distribution shift and high variance Q-estimation problems effectively.

  • TLDA achieves competitive or better sample efficiency and generalization ability than previous state-of-the-art methods in 33 different kinds of benchmarks.

2 Related Work

Generalization in RL. Researchers have been investigated RL generalization from various perspectives, such as different visual appearances Cobbe et al. (2019), dynamics Packer et al. (2018) and environment structures Cobbe et al. (2020). In this paper, we focus on generalization over different visual appearances. Two popular paradigms are proposed to address the overfitting issue in current visual RL research. The first is to regard generalization as a representation learning problem. For example, Bi-simulation metric Ferns et al. (2011) is implemented to learn robust representation features Zhang et al. (2020). The other paradigm is to design auxiliary tasks. SODA Hansen and Wang (2021) adds a BYOL-like Grill et al. (2020) architecture and introduces an auxiliary loss which encourages the representation to be invariant to task-irrelevant properties of the environment. In contrast to the previous efforts, our method does not require to employ a specific metric to learn representation, nor to introduce additional modules.

Data Augmentation for RL. Data Augmentation is an efficient method to improve the generalization of visual RL. RAD Laskin et al. (2020) compares different data augmentation methods and reveals that the benefits of different augmentation methods to RL tasks are not the same. SECANT Fan et al. (2021) mentions that weak augmentation can improve sample efficiency but not generalization ability. It also shows that the simple use of strong augmentation is likely to cause training divergence, though generalization ability is improved. Automatic data augmentation is proposed in Raileanu et al. (2021) to make better use of data augmentation. We advocate this paradigm and believe that one crucial factor for improving sample efficiency and generalization lies in the design of data augmentation, namely, how we can diversify the input as much as possible while maintaining the invariance of output. We show that how strong augmentation affects action distribution shifts and causes high variance of Q estimation, and illustrate that our approach is effective in alleviating these two problems.

3 Preliminaries

We consider learning in a Markov Decision Process (MDP) formulated by the tuple 𝒮,𝒜,r,𝒫,γ\langle\mathcal{S},\mathcal{A},r,\mathcal{P},\gamma\rangle where 𝒮\mathcal{S} is the state space, 𝒜\mathcal{A} is the action space, r:𝒮×𝒜r:\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R} is a reward function, 𝒫(st+1st,at)\mathcal{P}\left({s}_{t+1}\mid{s_{t}},{a_{t}}\right) is the state transition function, γ[0,1)\gamma\in[0,1) is the discount factor. The goal is to learn a policy π{\pi^{*}} to maximize the expected cumulative return π=argmaxπ𝔼atπ(st),st𝒫[t=1Tγtr(st,at)]\pi^{*}=\operatorname{argmax}_{\pi}\mathbb{E}_{a_{t}\sim\pi\left(\cdot\mid s_{t}\right),s_{t}\sim\mathcal{P}}\left[\sum_{t=1}^{T}\gamma^{t}r\left(s_{t},a_{t}\right)\right], starting from an initial state s0𝒮{s}_{0}\in\mathcal{S} and following the policy πθ(st)\pi_{\theta}\left(\cdot\mid s_{t}\right) which is parameterized by a set of learnable parameters θ\theta. Meanwhile, we expect the learned policy πθ\pi^{*}_{\theta} can be well generalized to new environments, which have the same structure and definition of the original MDP, but with different observation space 𝒪\mathcal{O} constructed from the same state space 𝒮\mathcal{S}.

Refer to caption
Figure 3: Q-estimation error. Top: We measure the Q-estimation mean square error of the different augmented observation vs.the non-augmented observation. The blue bar and the red bar are the error between strong augmented data and TLDA-augmented data vs non-augmented data, separately. It shows that TLDA can significantly reduce the Q-estimation error to alleviate the high-variance estimation problems. Bottom: The distribution of Q-estimation. TLDA comes up with a closer Q-estimation distribution with the original one.

3.1 Data Augmentation

Definition 1

(Optimality-Invariant State Transformation)
Given an MDP \mathcal{M}, we define an augmentation method ϕ:𝒪𝒪\phi:\mathcal{O}\rightarrow\mathcal{O}^{\prime} as an optimality-invariant transformation if o𝒪,a𝒜,ϕ(o)𝒪\forall o\in\mathcal{O},a\in\mathcal{A},\phi(o)\in\mathcal{O^{\prime}}, where 𝒪\mathcal{O^{\prime}} is a new set of observation satisfies:

Q(o,a)=Q(ϕ(o),a)π(o)=π(ϕ(o))\displaystyle Q(o,a)=Q(\phi(o),a)\quad\pi(\cdot\mid o)=\pi(\cdot\mid\phi(o)) (1)

A desirable quality for data augmentation is to satisfy the form of Optimality-Invariant State Transformation while distortion or distracting noise is added to the observation.

3.2 Lipschitz constant

The Lipschitz constant is frequently utilized to measure the robustness of a model, we introduce Lipschitz continuity of the policy here. A function ff: nm\mathbb{R}^{n}\rightarrow\mathbb{R}^{m} is Lipschitz continuous on 𝒳n\mathcal{X}\subseteq\mathbb{R}^{n} if there exists a non-negative constant K0K\geq 0 such that

f(x)f(y)Kxy for all x,y𝒳\|f(x)-f(y)\|\leq K\|x-y\|\text{ for all }x,y\in\mathcal{X} (2)

The smallest such KK is called the Lipschitz constant of ff Pauli et al. (2021).

Definition 2 (Lipschitz constant of the policy)

Assume the state space is equipped with a distance metric d(,)d(\cdot,\cdot). Under a certain augmentation method ϕ\phi, the Lipschitz constant of a policy π\pi is defined as follows:

Kπ=sups𝒮DTV(π(ϕ(s))π(s))d(ϕ(s),s)K_{\pi}=\sup_{s\in\mathcal{S}}\frac{D_{TV}\left(\pi\left(\cdot\mid\phi(s)\right)\|\pi\left(\cdot\mid s\right)\right)}{d(\phi(s),s)} (3)

where DTV(P||Q)=12a𝒜|P(a)Q(a)|D_{TV}(P||Q)=\frac{1}{2}\sum_{a\in\mathcal{A}}|P(a)-Q(a)| is the total variation distance between distributions. If KπK_{\pi} is finite, the policy π\pi is Lipschitz continuous.

For a certain model, a smaller Lipschitz constant generally represents higher stability against the variance of input  Finlay et al. (2018). The following proposition illustrates that the estimation error of Q-value can be bounded by the Lipschitz constant:

Proposition 1

We consider an MDP \mathcal{M}, a policy π\pi and an augmentation method ϕ\phi. Suppose the rewards are bounded by rmaxr_{max} and state space is equipped with a distance metric d(,)d(\cdot,\cdot), such that a𝒜,s𝒮,|r(s,a)|rmax\forall a\in\mathcal{A},\forall s\in\mathcal{S},|r(s,a)|\leq r_{max}, the following inequality holds, where d(ϕ)=sups𝒮d(ϕ(s),s):\left\|d(\phi)\right\|_{\infty}=\sup_{s\in\mathcal{S}}d(\phi(s),s):

|Qπ(s,a)Qπ(ϕ(s),a)|2rmax(Kπd(ϕ)+1)1γ\displaystyle\left|Q^{\pi}(s,a)-Q^{\pi}(\phi(s),a)\right|\leq 2r_{max}\frac{(K_{\pi}\left\|d(\phi)\right\|_{\infty}+1)}{1-\gamma} (4)

The formal statement and the proof are shown in Appendix A. This proposition indicates that if a smaller Lipschitz constant under one specific augmentation is acquired, we will have a tighter bound of the Q-value estimation with a lower variance while implementing data augmentation.

4 Method

To maintain the training stability and improve the generalization ability, we propose: Task-aware Lipschitz Data Augmentation (TLDA), an efficient and general task-aware data augmentation method for visual RL.

Refer to caption
Figure 4: Overview of TLDA. This figure shows two examples of TLDA and the pipeline implementing it. The agent generates the K-matrix for a stacked frame, and then preserves the larger Lipschitz constant areas under strong augmentation. The preserved areas are highlighted in the K-matrix.

4.1 Construct the K-matrix

We first calculate the Lipschitz constant from perturbed input images. By using a kernel to perturb the original image oH×Wo\in\mathbb{R}^{H\times W}, we obtain the perturbed image denoted as A(o)A(o). Next, we choose the pixels centered with the location (i,j)(i,j) of A(o)A(o) as in the Eq (5), denoted as Φ(o,i,j)\Phi(o,i,j). Specifically, we use the Hadamard product \odot to choose the perturbed pixels around location(i,j)(i,j) by an image mask M(i,j)(0,1)H×W:M(i,j)\in(0,1)^{H\times W}:

Φ(o,i,j)=o(1M(i,j))+A(o)M(i,j)\Phi(o,i,j)=o\odot(1-M(i,j))+A\left(o\right)\odot M(i,j) (5)

To derive the Lipschitz constant, we use the notation d(Φ(o,i,j),o)d(\Phi(o,i,j),o) to represent the distance between input oo and Φ(o,i,j)\Phi(o,i,j) under metric d(,)d(\cdot,\cdot). As in Definition 2, for a given observation oo, the Lipschitz constant of the pixel (i,j)(i,j) can be computed as follows:

Kijπ=DTV(π(Φ(o,i,j))π(o))d(Φ(o,i,j),o)K_{ij}^{\pi}=\frac{D_{TV}\left(\pi\left(\cdot\mid\Phi(o,i,j)\right)\|\pi(\cdot\mid o)\right)}{d(\Phi(o,i,j),o)} (6)

where the numerator can be interpreted as distance between two action distributions: π(Φ(o,i,j))\pi(\cdot\mid\Phi(o,i,j)), π(o)\pi(\cdot\mid o), and the denominator is the distance between the original observation and the perturbed one.

With the per-pixel Lipschitz constant in hand, we then construct the matrix that can reflect the task-relevance information and be applied on the whole observation. By arranging KijπK_{ij}^{\pi} into a matrix which have the same size as oo following Eq (7), we denote this matrix as the K-matrix:

K-matrix[K11πK12πK1nπK21πK22πK2nπKm1πKm2πKmnπ]\textit{K-matrix}\triangleq\left[\begin{array}[]{ c c c c }{K}_{11}^{\pi}&{K}_{12}^{\pi}&\cdots&{K}_{1n}^{\pi}\\ {K}_{21}^{\pi}&{K}_{22}^{\pi}&\cdots&{K}_{2n}^{\pi}\\ \vdots&\vdots&\ddots&\vdots\\ {K}_{m1}^{\pi}&{K}_{m2}^{\pi}&\cdots&{K}_{mn}^{\pi}\end{array}\right] (7)

We aim to capture the task-related locations with large Lipschitz constants which tend to cause high variance in the policy/value output during the same level of perturbation.

4.2 Task-Aware Lipschitz Augmentation (TLDA) with the K-matrix

Intuitively, data augmentation operations should not modify the task-related pixels indicated by large Lipschitz constants. We follow this intuition and propose a simple yet effective way to decide which areas can be modified. We use the mean value of the K-matrix as a threshold, and binarize the K-matrix by the following way, where NN is the number of pixels (H×WH\times W), Kmean=1N×ijKijπ:K^{mean}=\frac{1}{N}\times\sum_{ij}K_{ij}^{\pi}:

MijK={1,ifKijπKmean0,else\begin{split}\vspace{-10pt}M_{ij}^{K}=\begin{cases}1,&if\quad K_{ij}^{\pi}\geq K^{mean}\\ 0,&else\end{cases}\end{split} (8)

The obtained mask MKM^{K} is used to decide which pixels can be augmented. For any data augmentation method o=Aug(o)o^{\prime}=\text{Aug}(o), we apply the following operation:

o~=MKo+(1MK)o\tilde{o}=M^{K}\odot o+(1-M^{K})\odot o^{\prime} (9)

We note that the output o~\tilde{o} is only modified in the areas that have low relevance to the task.

Refer to caption
Figure 5: Sample efficiency in training environment. We compare TLDA, SVEA, and random patch under two kinds of augmentations. Top row and Bottom Row are corresponding to Random Conv and Random Overlay training curves of the episode return respectively. TLDA (red line) shows better sample efficiency on the training period. Mean and standard deviation of 5 runs.
Algorithm 1 Task-aware Lipschitz Data Augmentation (TLDA)
1:  Denote network parameters as θ\theta, ψ\psi
2:  Denote momentum coefficient τ\tau, Batch size NN, Strong augmentation \mathcal{F}, Replay Buffer \mathcal{B}
3:  for every update iteration do
4:     Sample a batch of observations NN from \mathcal{B}
5:     for i=1,2,Ni=1,2,\ldots N do
6:        Implement the strong augmentation: oi=(oi)o^{\prime}_{i}=\mathcal{F}(o_{i})
7:        for each pixel do
8:           Calculate the KijπK_{ij}^{\pi} based on the distance between π(Φ(oi,i,j))\pi(\cdot\mid\Phi(o_{i},i,j)), π(oi)\pi(\cdot\mid o_{i})
9:        end for
10:        Arrange KijπK_{ij}^{\pi} into K-matrix
11:        Get the preserved location by mask MK(oi)M^{K}(o_{i}) based on Eq (8)
12:        Acquire the TLDA output based on Eq (9)
13:     end for
14:     Optimize Q(θ)\mathcal{L}_{Q}(\theta) w.r.t. θ\theta
15:     Update ψ(1τ)ψ+τθ\psi\leftarrow(1-\tau)\psi+\tau\theta
16:  end for

As mentioned above, our approach tends to preserve the pixels with large KijπK_{ij}^{\pi} and augment only the pixels associated with the small ones, which adds an implicit constraint to maintain the stable output of the policy and value network. Hence, it echoes with the Optimality-Invariant State Transformation as in Definition 1. Figure 4 demonstrates the overall framework of TLDA. During the training process, the K-matrix is calculated on the fly against every training step on augmented observations. Take cutout (adding a black patch to the image) in Figure 4 as an example, since the corresponding K-matrix shows that the upper part of the robot’s body features large Lipschitz constants, therefore, blindly augmenting the image might touch the pixels in this area and cause catastrophic action/value changes. In contrast, TLDA preserves the critical parts of the original observations indicated by K-matrix. It can help to maintain the stability of the action/value outputs.

4.3 Reinforcement Learning with TLDA

We use soft-actor-critic (SAC) as the base reinforcement learning algorithm for TLDA. Similar to previous work, we also include a regularization term Q(θ)\mathcal{R}_{Q}(\theta) to the SAC critic loss 𝒥Q(θ)\mathcal{J}_{Q}\left(\theta\right) to handle augmented data. Our critic loss Q(θ)\mathcal{L}_{Q}(\theta) is as follows, where staug{s}_{t}^{\text{aug}} is calculated by Eq (9), and Q^(st,at)=r(st,at)+γ𝔼st+1𝒫[V(st+1)]\hat{Q}\left({s}_{t},{a}_{t}\right)=r\left({s}_{t},{a}_{t}\right)+\gamma\mathbb{E}_{{s}_{t+1}\sim\mathcal{P}}\left[V\left({s}_{t+1}\right)\right]:

Q(θ)=𝒥Q(θ)+λQ(θ)\mathcal{L}_{Q}(\theta)=\mathcal{J}_{Q}(\theta)+\lambda\mathcal{R}_{Q}(\theta) (10)

with

𝒥Q(θ)=𝔼(st,at)𝒟[12(Qθ(st,at)Q^(st,at))2]Q(θ)=𝔼(st,at)𝒟[12(Qθ(staug ,at)Q^(st,at))2]\displaystyle\begin{aligned} \mathcal{J}_{Q}(\theta)&=\mathbb{E}_{\left(s_{t},a_{t}\right)\sim\mathcal{D}}\left[\frac{1}{2}\left(Q_{\theta}\left(s_{t},a_{t}\right)-\hat{Q}\left(s_{t},a_{t}\right)\right)^{2}\right]\\ \vspace{-1pt}\mathcal{R}_{Q}(\theta)&=\mathbb{E}_{\left(s_{t},a_{t}\right)\sim\mathcal{D}}\left[\frac{1}{2}\left(Q_{\theta}\left(s_{t}^{\text{aug }},a_{t}\right)-\hat{Q}\left(s_{t},a_{t}\right)\right)^{2}\right]\end{aligned}

The instantiated RL algorithm are in Algorithm 1, and more implementation details are summarized in Appendix B.

5 Experiment

In this section, we explore how TLDA can affect the agent’s sample efficiency and generalization performance. We compare our method with other baselines on a wide spectrum of tasks including DeepMind control suite, CARLA simulator, as well as DeepMind Manipulation tasks. We also ablate TLDA and investigate its effect on action distributions and value estimation.

Table 1: DMC-GB Generalization Performance. We report the episode return in test environments. The agents are trained on a fixed environment and evaluated on two unseen test environments, i.e., random colors (Bottom) and video backgrounds (Top). Our method achieves competitive or better performance in 7 out of 10 tasks.
Setting
DMControl
DrQ PAD
SVEA
(conv)
SVEA
(overlay)
SODA
(conv)
SODA
(overlay)
TLDA
(conv)
TLDA
(overlay)
[Uncaptioned image]
Cartpole,
Swingup
485485±105\pm 105 521521±76\pm 76 606606±85\pm 85 782782±𝟐𝟕\pm\bm{27} 474474±143\pm 143 758758±62\pm 62 607607±74\pm 74 671671±57\pm 57
Walker,
Stand
873873±83\pm 83 935935±20\pm 20 795795±70\pm 70 961961±8\pm 8 903903±56\pm 56 955955±13\pm 13 962{962}±15{\pm 15} 𝟗𝟕𝟑\bm{973}±𝟔\pm\bm{6}
Walker.
Walk
682682±89\pm 89 717717±79\pm 79 612612±144\pm 144 819819±71\pm 71 635635±48\pm 48 768768±38\pm 38 873873±𝟑𝟒\pm\bm{34} 868{868}±63\pm{63}
Ball_in_cup,
Catch
318318±157\pm 157 436436±55\pm 55 659659±110\pm 110 871871±106\pm 106 539539±111\pm 111 875{875}±56\pm{56} 𝟖𝟖𝟕\bm{887}±𝟓𝟖{\pm\bm{58}} 855{855}±56{\pm 56}
Cheetah,
Run
102102±30\pm 30 206206±34\pm 34 292292±32\pm 32 249249±20\pm 20 229229±29\pm 29 223223±32\pm 32 𝟑𝟓𝟔\bm{356}±𝟓𝟐{\pm\bm{52}} 336{336}±57{\pm 57}
[Uncaptioned image]
Cartpole,
Swingup
586586±52\pm 52 630630±63\pm 63 𝟖𝟑𝟕\bm{837}±𝟐𝟑\pm\bm{23} 832832±23\pm 23 831831±21\pm 21 805805±28\pm 28 748748±40\pm 40 760{760}±60\pm{60}
Walker,
Stand
770770±71\pm 71 797797±46\pm 46 942{942}±26\pm{26} 933933±24\pm 24 930930±12\pm 12 893893±12\pm 12 919919±24\pm 24 𝟗𝟒𝟕\bm{947}±𝟐𝟔\pm\bm{26}
Walker.
Walk
520520±91\pm 91 468468±47\pm 47 760760±145\pm 145 749749±61\pm 61 697697±66\pm 66 692692±68\pm 68 753753±83\pm 83 𝟖𝟐𝟑\bm{823}±𝟓𝟖\pm\bm{58}
Ball_in_cup,
Catch
365365±210\pm 210 563563±50\pm 50 𝟗𝟔𝟏\bm{961}±𝟕\pm\bm{7} 959959±5\pm 5 892892±37\pm 37 949949±19\pm 19 932932±32\pm 32 930{930}±40\pm{40}
Cheetah,
Run
100100±27\pm 27 159159±28\pm 28 264264±51\pm 51 273273±23\pm 23 294294±34\pm 34 238238±28\pm 28 𝟑𝟕𝟏\bm{371}±𝟓𝟏\pm\bm{51} 358{358}±25\pm{25}

5.1 Evaluation on DeepMind Control Suite

Setup. We implement our method with SAC as the base algorithm. Convolution Neural Networks are used for the image inputs. We include a detailed description of all hyper-parameters and the architecture in Appendix B. For comparison, we mainly consider two augmentation ways applied in the prior state-of-the-art methods: random convolution (passing input through a random convolutional layer) and random overlay (linearly combining the observation oo with the extra image \mathcal{I}, ϕ(o)=αo+(1α))\phi(o)=\alpha o+(1-\alpha)\mathcal{I}).

Baselines. We benchmark TLDA against the following state-of-the-art methods: (1) DrQ Kostrikov et al. (2020): SAC with weak augmentation (random shift); (2) PAD Hansen et al. (2020): adding an auxiliary task for adapting to the unseen environment; (3) SODA Hansen and Wang (2021): maximizing the mutual information between latent representation by employing a BYOL-like Grill et al. (2020) architecture; (4) SVEA Hansen et al. (2021): modifying the form of Q-target. We run 5 random seeds and report the mean and standard deviation of episode rewards.

Sample efficiency under strong augmentations. We compare the sample efficiency with SVEA to exhibit the effectiveness of TLDA.

We also include another baseline that preserves random patches from the un-augmented observation as opposed to TLDA that preserves task-related parts. We call this baseline random patch. By contrast, SVEA only uses the strong augmentation method but retains no raw pixel. Figure 5 demonstrates that TLDA achieves better or comparable asymptotic performance in the training environment than baselines on DM-control suite while having better sample efficiency. The results also indicate that random patch will hinder the performance in some tasks. We reckon that since random patch does not have any pixel-to-task relevance knowledge, it inevitably destroys the image’s integrity and even leads to further distortion to the observations after data augmentation. Therefore, blindly keeping the original observation’s information cannot improve the agent’s training performance. It is the retention of areas with larger Lipschitz constants, instead of random original areas, that boosts the sample efficiency of training agents.

Generalization Performance. We evaluate the agent’s generalization ability on two settings from DMControl-GB Hansen and Wang (2021): (i) random colors of the background and agent; (ii) dynamic video backgrounds. Results are shown in Table 1. TLDA outperforms prior state-of-the-art methods in 7 out of 10 instances. The agent trained with TLDA is able to acquire a good robust policy in different unseen environments. Meanwhile, we notice that prior methods are sensitive to different kinds of augmentations, which makes their testing performance varies dramatically. On the contrary, our method with task-aware observations is more stable and not susceptible to this issue.

Qualitative Results of TLDA. As shown in Figure 6, from the K-matrix on the test environments, agents trained by TLDA will give larger Lipschitz constants on the robot’s body while the SVEA agents are prone to focus on the lighting visual background. Our method is capable of learning the main factors that influence the performance and neglecting the irrelevant areas that hinder generalization.

Refer to caption
Figure 6: Visualization of K-matrix in Generalization. We visualize the same observation frame of K-matrix (red color) about SVEA (Bottom Row) and TLDA (Top Row) during generalization . It exhibits that the K-matrix calculated by SVEA will highlight the video background while TLDA still focuses on the robot body.

Effect on Action Distribution and Q-estimation. In this section, we analyze how TLDA influences the output of the policy and value networks. Given a DrQ agent trained in the original environment, we assess the Q-value estimation and the action distribution under different augmentation. To get a better understanding of this issue, we visualize the action distribution of the agent under different augmentation methods, as shown in Figure 2. For weak augmentation, although its action distribution is closest to the un-augmented one (Figure 2(c)), it cannot improve generalization, as shown in Table 1(DrQ). Strong augmentation, on the other hand, will cause an obvious distribution shift (Figure 2(a)), thus significantly hindering the training process.

We find that TLDA has a closer action distribution than simply applying strong augmentation (Figure 2(b)) by using the Lipschitz constant to identify and preserve the task-aware areas. Furthermore, as shown in Figure 3, we find that the Q-estimation of TLDA has a lower variance than that of naively applying strong augmentation. These two results illustrate that TLDA has the potential to achieve higher sample efficiency in training and learn a more robust policy to perform well in unseen environments.

5.2 Evaluation on Autonomous Driving in CARLA

To further evaluate the TLDA’s performance, we apply this method in the tasks with more realistic observations: autonomous driving in the CARLA simulator. In our experiment, we use one camera as our input observation for driving tasks, where the goal of the agent is to drive along a curvy road as far as possible in 1000 time-steps without colliding with the moving vehicles, pedestrians and barriers. We adapt the reward function and train an agent under the weather with the same setting from previous work Zhang et al. (2020). The training results are shown in Figure 7. We find that our method achieves the best training sample efficiency. For generalization, CARLA provides different weather conditions with built-in parameters. We evaluate our method in 4 kinds of weather with different lighting conditions, realistic raining and slipperiness. Results are in Table 2, where we choose the success rate for reach 100m distance as the driving evaluation metric. TLDA outperforms all base algorithms in both sample efficiency and generalization ability with a more stable driving policy. Additional results are in Appendix D.3.

Table 2: CARLA Driving. We report the success rate for reaching 100m distance under the unseen weather during 250 episodes across 5 seeds for each weather. (50 episodes for each seed.)
Setting DrQ SVEA Ours
Training 24%24\% 49%49\% 𝟓𝟐%\bm{52\%}
Wet Noon 0.8%0.8\% 8.8%8.8\% 18%18\%
SoftRain noon 0.4%0.4\% 1.2%1.2\% 7.6%7.6\%
Wet Sunset 0.8%0.8\% 1.6%1.6\% 9.2%9.2\%
MidRain Sunset 0.0%0.0\% 5.2%5.2\% 12%12\%
Refer to caption
Figure 7: CARLA Training Performance. We evaluate three algorithms across 5 seeds. TLDA (red line) achieves better performance than SVEA (blue line) and DrQ (green line) in sample efficiency.

5.3 Evaluation on DMC Manipulation Tasks

Robot manipulation is another set of challenging and meaningful tasks for visual RL. DM control Tunyasuvunakool et al. (2020) provides a set of configurable manipulation tasks with a robotic Jaco arm and snap-together bricks. We consider two tasks for experiments: reach and push. More details are in Appendix C.3.

All the agents are trained on the default background and evaluated on different colors of arms and platforms. Training results and generalization performance are shown in Appendix D.2 and Table 3. The results show that our method can be adapted to the unseen environments more appropriately. The Modified Platform and Modified Both setting are challenging for agents to discern the target objects from the noisy backgrounds. SVEA under strong data augmentation suffers from instability and divergence for training, while TLDA can augment the pixel in a task-aware manner, thus further maintaining the training stability. Despite that DrQ shows better training performance, it barely generalizes to the environments with different visual layouts. In summary, sample efficiency and generalization performance contribute to exhibiting the superiority of the proposed algorithm.

Table 3: DMC Manipulation Tasks. We evaluate the episode return in different modified (M) visual setting. M in the Setting column means: Modified. TLDA can better focus on the aim objects in the noisy and colorful visual backgrounds.
Task Setting DrQ SVEA Ours
Reach Training 𝟏𝟑𝟔\bm{136} ±𝟐𝟎\pm\bm{20} 4949 ±48\pm 48 124124 ±32\pm 32
M Arm 𝟔𝟖\bm{68} ±𝟐𝟎\pm\bm{20} 2121 ±25\pm 25 5555 ±21\pm 21
M Platform 0.80.8 ±1.3\pm 1.3 2424 ±25\pm 25 𝟖𝟗\bm{89} ±𝟒𝟎\pm\bm{40}
M Both 11 ±2\pm 2 1313 ±14\pm 14 𝟑𝟔\bm{36} ±𝟐𝟓\pm\bm{25}
Push Training 𝟏𝟒𝟏\bm{141} ±𝟒𝟕\pm\bm{47} 42{42} ±40\pm{40} 109{109} ±27\pm{27}
M Arm 𝟖𝟖\bm{88} ±𝟓𝟐\pm\bm{52} 21{21} ±16\pm{16} 60{60} ±43\pm{43}
M Platform 4{4} ±1\pm{1} 34{34} ±28\pm{28} 𝟗𝟓\bm{95} ±𝟑𝟑\pm\bm{33}
M Both 5{5} ±1\pm{1} 32{32} ±20\pm{20} 𝟓𝟔\bm{56} ±𝟒𝟐\pm\bm{42}

6 Conclusion

In this paper, we propose Task-aware Lipschitz Data Augmentation (TLDA) for visual RL, which can reliably identify and augment pixels that are not strongly correlated with the learning task while keeping task-related pixels untouched. This technique aims to provide a principled mechanism for boosting the generalization ability of RL agents and can be seamlessly incorporated into various existing visual RL frameworks. Experimental results on three challenging benchmarks confirm that, compared with the baselines, TLDA not only features higher sample efficiency but also helps the agents generalize well to the unseen environments.

References

  • Cobbe et al. (2020) Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. 2020. Leveraging procedural generation to benchmark reinforcement learning. In International conference on machine learning, pages 2048–2056. PMLR.
  • Cobbe et al. (2019) Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, and John Schulman. 2019. Quantifying generalization in reinforcement learning. In International Conference on Machine Learning, pages 1282–1289. PMLR.
  • Dosovitskiy et al. (2017) Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. 2017. Carla: An open urban driving simulator. In Conference on robot learning, pages 1–16. PMLR.
  • Fan et al. (2021) Linxi Fan, Guanzhi Wang, De-An Huang, Zhiding Yu, Li Fei-Fei, Yuke Zhu, and Anima Anandkumar. 2021. Secant: Self-expert cloning for zero-shot generalization of visual policies. arXiv preprint arXiv:2106.09678.
  • Ferns et al. (2011) Norm Ferns, Prakash Panangaden, and Doina Precup. 2011. Bisimulation metrics for continuous markov decision processes. SIAM Journal on Computing, 40(6):1662–1714.
  • Finlay et al. (2018) Chris Finlay, Jeff Calder, Bilal Abbasi, and Adam Oberman. 2018. Lipschitz regularized deep neural networks generalize and are adversarially robust. arXiv preprint arXiv:1808.09540.
  • Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733.
  • Hansen et al. (2020) Nicklas Hansen, Rishabh Jangir, Yu Sun, Guillem Alenyà, Pieter Abbeel, Alexei A Efros, Lerrel Pinto, and Xiaolong Wang. 2020. Self-supervised policy adaptation during deployment. arXiv preprint arXiv:2007.04309.
  • Hansen et al. (2021) Nicklas Hansen, Hao Su, and Xiaolong Wang. 2021. Stabilizing deep q-learning with convnets and vision transformers under data augmentation. arXiv e-prints, pages arXiv–2107.
  • Hansen and Wang (2021) Nicklas Hansen and Xiaolong Wang. 2021. Generalization in reinforcement learning by soft data augmentation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13611–13617. IEEE.
  • Kalashnikov et al. (2018) Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. 2018. Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, pages 651–673. PMLR.
  • Kostrikov et al. (2020) Ilya Kostrikov, Denis Yarats, and Rob Fergus. 2020. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. arXiv preprint arXiv:2004.13649.
  • Laskin et al. (2020) Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. 2020. Reinforcement learning with augmented data. Advances in Neural Information Processing Systems, 33.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. nature, 518(7540):529–533.
  • Packer et al. (2018) Charles Packer, Katelyn Gao, Jernej Kos, Philipp Krähenbühl, Vladlen Koltun, and Dawn Song. 2018. Assessing generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282.
  • Pauli et al. (2021) Patricia Pauli, Anne Koch, Julian Berberich, Paul Kohler, and Frank Allgower. 2021. Training robust neural networks using lipschitz bounds. IEEE Control Systems Letters.
  • Raileanu et al. (2021) Roberta Raileanu, Maxwell Goldstein, Denis Yarats, Ilya Kostrikov, and Rob Fergus. 2021. Automatic data augmentation for generalization in reinforcement learning. Advances in Neural Information Processing Systems, 34.
  • Shorten and Khoshgoftaar (2019) Connor Shorten and Taghi M Khoshgoftaar. 2019. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):1–48.
  • Tobin et al. (2017) Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. 2017. Domain randomization for transferring deep neural networks from simulation to the real world. In IROS, pages 23–30. IEEE.
  • Tunyasuvunakool et al. (2020) Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. 2020. dm_control: Software and tasks for continuous control. Software Impacts, 6:100022.
  • Zhang et al. (2020) Amy Zhang, Rowan McAllister, Roberto Calandra, Yarin Gal, and Sergey Levine. 2020. Learning invariant representations for reinforcement learning without reconstruction. arXiv preprint arXiv:2006.10742.
  • Zhang et al. (2018) Chiyuan Zhang, Oriol Vinyals, Remi Munos, and Samy Bengio. 2018. A study on overfitting in deep reinforcement learning. arXiv preprint arXiv:1804.06893.
  • Zhu et al. (2017) Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. 2017. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, pages 3357–3364. IEEE.

Appendix A The Proof of Proposition 1

Proposition 1 We consider an MDP \mathcal{M} and an augmentation method ϕ\phi. Let π\pi be a policy over ϕ\phi. Suppose the rewards are bounded by rmaxr_{max} such that a𝒜,s𝒮,|r(s,a)|rmax.\forall a\in\mathcal{A},\forall s\in\mathcal{S},|r(s,a)|\leq r_{max}. Then for ϕ\phi and π\pi, the following inequality holds :

|Qπ(s,a)Qπ(ϕ(s),a)|2rmax(Kπd(ϕ)+1)1γ\displaystyle\left|Q^{\pi}(s,a)-Q^{\pi}(\phi(s),a)\right|\leq 2r_{max}\frac{(K_{\pi}\left\|d(\phi)\right\|_{\infty}+1)}{1-\gamma} (11)

Proof. We are interested in the Q-value function which is defined as follows:

Qπ(s,a)\displaystyle Q^{\pi}(s,a) =Eπ[rt+1+γrt+2+At=a,St=s]\displaystyle=E_{\pi}\left[r_{t+1}+\gamma r_{t+2}+\cdots\mid A_{t}=a,S_{t}=s\right] (12)
=satγtpt(s)π(a|s)rt(s,a)\displaystyle=\sum_{s}\sum_{a}\sum_{t}\gamma^{t}p^{t}(s)\pi(a|s)r_{t}(s,a)

Let pϕt(s)p_{\phi}^{t}(s) be the probability of visiting state ϕ(s)\phi(s) at time tt, πϕ(a|s)\pi_{\phi}(a|s) be the probability of doing action aa in state ϕ(s)\phi(s), thus

Qπ(ϕ(s),a)=satγtpϕt(s)πϕ(a|s)rt(s,a)Q^{\pi}(\phi(s),a)=\sum_{s}\sum_{a}\sum_{t}\gamma^{t}p_{\phi}^{t}(s)\pi_{\phi}(a|s)r_{t}(s,a) (13)

By these definitions, we can write that:

|Qπ(s,a)Qπ(ϕ(s),a)|\displaystyle\left|Q^{\pi}(s,a)-Q^{\pi}(\phi(s),a)\right| rmaxsatγt|pϕt(s)πϕ(a|s)pt(s)π(a|s)\displaystyle\leq r_{max}\sum_{s}\sum_{a}\sum_{t}\gamma^{t}|p_{\phi}^{t}(s)\pi_{\phi}\left(a|s)-p^{t}(s)\pi(a|s)\mid\right. (14)
=rmaxsatγt|pϕt(s)πϕ(a|s)+pϕt(s)π(a|s)pϕt(s)π(a|s)pt(s)π(a|s)|\displaystyle=r_{max}\sum_{s}\sum_{a}\sum_{t}\gamma^{t}|p_{\phi}^{t}(s)\pi_{\phi}(a|s)+p_{\phi}^{t}(s)\pi(a|s)-p_{\phi}^{t}(s)\pi(a|s)-p^{t}(s)\pi(a|s)|
rmaxsatγt(pϕt(s)|πϕ(a|s)π(a|s)|+π(a|s)|pϕt(s)pt(s)|)\displaystyle\leq r_{max}\sum_{s}\sum_{a}\sum_{t}\gamma^{t}(p_{\phi}^{t}(s)|\pi_{\phi}(a|s)-\pi(a|s)|+\pi(a|s)|p_{\phi}^{t}(s)-p^{t}(s)|)
=rmax(satγt(pϕt(s)|πϕ(a|s)π(a|s)|)+2tγtDTV(pϕt()|pt())))\displaystyle=r_{max}(\sum_{s}\sum_{a}\sum_{t}\gamma^{t}(p_{\phi}^{t}(s)|\pi_{\phi}(a|s)-\pi(a|s)|)+2\sum_{t}\gamma^{t}D_{TV}(p_{\phi}^{t}(\cdot)|p^{t}(\cdot))))
2rmaxtγt(maxsDTV(πϕ(|s)||π(|s))+DTV(pϕt()||pt()))\displaystyle\leq 2r_{max}\sum_{t}\gamma^{t}(\max_{s}D_{TV}(\pi_{\phi}(\cdot|s)||\pi(\cdot|s))+D_{TV}(p_{\phi}^{t}(\cdot)||p^{t}(\cdot)))

DTV(P||Q)=12a𝒜|P(a)Q(a)|D_{TV}(P||Q)=\frac{1}{2}\sum_{a\in\mathcal{A}}|P(a)-Q(a)|. This above inequality shows that the Q-value estimation is bounded by two total variation distance (the smaller the total variation distance is, the closer the Q-value estimation will be).

Lemma2 For a given state ss, and a given policy π(|s)\pi(\cdot|s) under the data augmentation method ϕ\phi, we assume the state space is equipped with a distance metric d(,)d(\cdot,\cdot), the following bound holds:

maxsDTV(πϕ(|s)||π(|s))Kπd(ϕ)\max_{s}D_{TV}(\pi_{\phi}(\cdot|s)||\pi(\cdot|s))\leq K_{\pi}\left\|d(\phi)\right\|_{\infty} (15)

Proof.

maxsDTV(πϕ(|s)||π(|s))\displaystyle\max_{s}D_{TV}(\pi_{\phi}(\cdot|s)||\pi(\cdot|s)) (16)
=maxsDTV(πϕ(|s)||π(|s))d(ϕ(s),s)d(ϕ(s),s)\displaystyle=\max_{s}\frac{D_{TV}(\pi_{\phi}(\cdot|s)||\pi(\cdot|s))}{d(\phi(s),s)}d(\phi(s),s)
maxsDTV(πϕ(|s)||π(|s))d(ϕ(s),s)maxsd(ϕ(s),s)\displaystyle\leq\max_{s}\frac{D_{TV}(\pi_{\phi}(\cdot|s)||\pi(\cdot|s))}{d(\phi(s),s)}\max_{s}d(\phi(s),s)
=Kπd(ϕ)\displaystyle=K_{\pi}\left\|d(\phi)\right\|_{\infty}

We also note that the DTV(|)[0,1]D_{TV}(\cdot|\cdot)\in[0,1], so according to the bound of Eq (14) we have that

|Qπ(s,a)Qπ(ϕ(s),a)|\displaystyle\left|Q^{\pi}(s,a)-Q^{\pi}(\phi(s),a)\right| 2rmaxtγt(maxsDTV(πϕ(|s)||π(|s))+DTV(pϕt()||pt()))\displaystyle\leq 2r_{max}\sum_{t}\gamma^{t}(\max_{s}D_{TV}(\pi_{\phi}(\cdot|s)||\pi(\cdot|s))+D_{TV}(p_{\phi}^{t}(\cdot)||p^{t}(\cdot))) (17)
2rmaxtγt(Kπd(ϕ)+1)\displaystyle\leq 2r_{max}\sum_{t}\gamma^{t}(K_{\pi}\left\|d(\phi)\right\|_{\infty}+1)
2rmax(Kπd(ϕ)+1)1γ\displaystyle\leq 2r_{max}\frac{(K_{\pi}\left\|d(\phi)\right\|_{\infty}+1)}{1-\gamma}

Appendix B Implementation Details

In this section, we provide details of our algorithm’s implementation and hyperparameter setting. Table 4 exhibits the hyper-parameters of TLDA in three benchmarks. It should be noticed that if we calculate the Lipschitz constant for each pixel, it will increase the computation complexity and take more time for training. Therefore, we calculate the Lipschitz constant every 5 pixels during training for one observation, which is enough to acquire good performance. We choose Gaussian blur as the kernel and implement the 2D Gaussian centered at (i,j)(i,j) as Mask M(i,j)M(i,j). Since we only change a few chosen perturbed pixels (near the pixel(i,j)(i,j) by M(i,j)M(i,j)) in the whole image for every location (i,j)(i,j) when calculating the distance between Φ(o,i,j)\Phi(o,i,j) and oo, we can approximate that this metric is not relevant to the specific location (i,j)(i,j); thus, the Lipschitz constant can be proportional to the distance between two action distributions: π(Φ(o,i,j))\pi(\cdot\mid\Phi(o,i,j)), π(o)\pi(\cdot\mid o).

Meanwhile, we compare different metrics for TLDA, as shown in Figure 8. The result shows that 2\ell_{2} distance is sharper and more concentrated while the total-variance distance has the Blurry effect, as well as KL divergence is darker. So empirically, we choose 2\ell_{2} distance as our metric to calculate the Lipschitz constant during all experiments. DrQ shows that shifting is an effective method to improve sample efficiency. Therefore we apply shifting to the observation at first. The linear combination factor α\alpha of random overlay is 0.50.5 with the dataset in DMC-GB. We summarize our method in Algorithm 1. The inner two for loops can be computed in parallel.

Table 4: Hyperparameter about TLDA in 3 benchmarks.
Hyperparameter DMControl-GB CARLA Manipulation Tasks
Input dimension 9 ×\times 84 ×\times 84 9 ×\times 84 ×\times 84 9 ×\times 84 ×\times 84
Stacked Frames 3 3 3
Discount factor γ\gamma 0.99 0.99 0.99
Action repeat 8(cartpole) 4(otherwise) 4 2
Actor learning rate
5e-4(walker walk)
1e-3(otherwise)
1e-3 3e-5
Critic learning rate
5e-4(walker walk)
1e-3(otherwise)
1e-3 3e-5
Random cropping padding 4 4 4
Batch size 128 128 128
Regularization term λ\lambda 1 1 1
Training step 500k 100k 500k
Replay buffer size 500,000 100,000 500,000
Encoder conv layers 4 4 4
Optimizer(θ\theta) Adam Adam Adam
Refer to caption
Figure 8: Different metrics for TLDA. We compare the K-matrix under different metrics. This figure demonstrates that the 2\ell_{2} distance is sharper and more concentrated while the total-variance distance has the Blurry effect as well as KL divergence is darker.

Appendix C Environment Details

We conduct our method on three challenging visual control benchmarks, as shown in Figure 9.

Refer to caption
Figure 9: Three Benchmarks for visualization. Top to Bottom: DMC Control Suite, CARLA simulator autonomous driving, and DMC manipulation tasks.

C.1 DeepMind Control Suite

DMC-GB, a popular benchmark modified from DMControl suite is introduced in  Hansen and Wang (2021) for visual RL. We use 5 typical tasks that support random color and dynamic video backgrounds. The detailed setting is listed in Table 4.

C.2 CARLA

CARLA is a widely-used autonomous driving simulator. In our experiment, we choose highway CARLA Town4 as the map for the driving task. The goal of the training agent is to drive along a highway road as far as possible under diverse weather conditions. We choose a stable version of CARLA 0.9.6 Dosovitskiy et al. (2017) and use the reward function and network architecture in Zhang et al. (2020). We use one camera as our input observation, which is an 84×84×384\times 84\times 3 image. The action is composed of thrusting and steering these two continuous controls. We choose random overlay as the strong augmentation method during training, whose linear combination factor α=0.5\alpha=0.5 with the dataset from DMC-GB.

C.3 DeepMind Manipulation Task

DeepMind Manipulation task is introduced in Tunyasuvunakool et al. (2020) for robot continuous controls, which provides a Kinova robotic arm and a list of objects for building reward functions.

We additionally consider two tasks for the experiments:

  • reach: the agent needs to reach the shown red brick by manipulating the arm;

  • push: the goal is to push the red bricks to the position of a white mark point;

The input observation is the stacked RGB images of 84×8484\times 84 pixels. There are two available observation versions: feature vectors and pixel images. All environments return a reward r(s,a)[0,1]r(s,a)\in[0,1] per step, and have an episode time limit in 10 seconds. It is challenging for the RL algorithm based on SAC to perform well in these tasks without adding other useful tools. Therefore, we choose the task reach_duplo_vision, which aims to move the arm to a brick resting on the ground, and create a task push_brick_vision, whose goal is to push a brick to a goal position. The reward function of push is based on the reach task, and we add a reward term about the distance between the red brick and goal position. The closer distance between them, the higher reward the agent will achieve. We visualize the observations of two tasks, as shown in Figure 10. We choose random overlay as the strong augmentation way during training whose linear combination factor α=0.5\alpha=0.5 with the dataset from DMC-GB.

Refer to caption
(a) Reach
Refer to caption
(b) Push
Figure 10: Two tasks of Manipulation. (a) is the reach task; the goal of the agent is to reach the red brick. (b) is the push task; we add a white point as a goal position; the goal of the agent is to push the red brick to the white position.
Refer to caption
Figure 11: Training Performance in Manipulation Tasks. DrQ (green line) shows the best sample efficiency during training. Under strong augmentation, SVEA (blue line) may suffer from training divergence while TLDA (red line) can still maintain stability.

Appendix D Additional Result

D.1 TLDA in DMC

As shown in Figure 12, we exhibit more comparisons between the TLDA and normal strong augmentations to show the superiority of our method. We use the same converged DrQ agent under the same seed to evaluate how augmentation methods will influence the agent’s performance. More details are in this link222https://sites.google.com/view/algotlda/home. TLDA can effectively help training agents to alleviate the degradation of performance while facing the strong augmentation.

D.2 Training Curves of Manipulation Tasks.

All methods are trained in 500k steps on Manipulation Tasks, and the training results are shown in Figure 11. Although DrQ shows the best training performance, comparison results for generalization in Table 3 demonstrate that our method significantly outperforms baselines, which means DrQ may tend to overfit the environment, and SVEA may easily suffer from divergence under these more challenging benchmarks.

D.3 CARLA

We adapt the reward function and train agents under the weather whose setting is the same as the previous work Zhang et al. (2020). The training results are shown in Figure 7. The results indicate that TLDA outperforms other baselines in sample efficiency.

Except for the success rate of reaching 100m distance, the crash intensity is also an essential driving metric. We evaluate the crash intensity under different weather and report the average value in the training and unseen environments in Table 5. The results show that TLDA has the lowest crash intensity compared to the other baselines in unseen environments.

Table 5: CARLA about Crash Intensity. We report the average crash intensity under different unseen environments with dynamic weather conditions across 5 seeds runs. The lower crash intensity value indicates the better agent’s driving stability.
Setting DrQ SVEA Ours
Training 𝟏𝟓𝟐𝟎\bm{1520} 18341834 18621862
Wet Noon 2550{2550} 23802380 𝟏𝟓𝟐𝟒\bm{1524}
SoftRain noon 24632463 13001300 𝟗𝟕𝟓\bm{975}
Wet Sunset 18931893 12271227 𝟏𝟏𝟔𝟎\bm{1160}
MidRain Sunset 18081808 15611561 𝟕𝟔𝟕\bm{767}
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 12: TLDA in DM-control-suite. We compare the converged DrQ agent facing different kinds of augmentations in the same timestep. The results show that TLDA will help the training agent to get a better asymptotic performance.