Unsupervised Feature Learning for Manipulation with Contrastive Domain Randomization

Carmel Rabinovitz¹ Niko Grupen² and Aviv Tamar¹ ¹Department of Electrical Engineering, Technion, Israel [email protected], [email protected]²Department of Computer Science, Cornell University, NY, USA [email protected]

Abstract

Robotic tasks such as manipulation with visual inputs require image features that capture the physical properties of the scene, e.g., the position and configuration of objects. Recently, it has been suggested to learn such features in an unsupervised manner from simulated, self-supervised, robot interaction; the idea being that high-level physical properties are well captured by modern physical simulators, and their representation from visual inputs may transfer well to the real world. In particular, learning methods based on noise contrastive estimation have shown promising results. To robustify the simulation-to-real transfer, domain randomization (DR) was suggested for learning features that are invariant to irrelevant visual properties such as textures or lighting. In this work, however, we show that a naive application of DR to unsupervised learning based on contrastive estimation does not promote invariance, as the loss function maximizes mutual information between the features and both the relevant and irrelevant visual properties. We propose a simple modification of the contrastive loss to fix this, exploiting the fact that we can control the simulated randomization of visual properties. Our approach learns physical features that are significantly more robust to visual domain variation, as we demonstrate using both rigid and non-rigid objects.

I Introduction

If a robot is to perform object manipulation tasks, it must process its sensory inputs to extract physical properties that are relevant to the objects and the task. For example, for picking an object, the robot may extract the pose of the object and its size from a camera image. However, for general objects and tasks, defining the relevant properties can be difficult – what properties should be extracted for manipulating a deformable object such as a rope? And what should be encoded about objects with nontrivial geometries?

If the task (or task distribution) is known in advance and can be described using a reward function, reinforcement learning (RL) can be used to learn a policy for solving the task [1]. By definition, however, the RL agent will only learn features that are relevant for the task it trained on, as specified through the reward function, and the features may not be applicable for other tasks. In addition, for many real world tasks it is difficult to define a reward function [2, 3, 4].

An alternative and promising approach is to learn features in an unsupervised fashion, from self-supervised data of the robot interacting with objects. The idea is that features that are useful for making predictions about the future state of the environment, likely encode relevant object properties, and will therefore be useful for a variety of downstream tasks. A popular method for learning such features is based on contrastive learning: maximizing mutual information between features of past and future observations [5, 6, 7].

However, collecting a diverse set of self-supervised robot interactions requires significant resources [8, 9, 10]. Exploiting the recent advances in physical simulators, Yan et al. proposed to learn physical features using contrastive learning from simulated interaction data [11]. Since simulated data is easy to generate, and no reward function needs to be manually defined for unsupervised learning, the approach in [11] offers a principled and scalable method for learning general physical features, such as the configuration of non-rigid objects.

The features learned in simulation can only be accurate up to the differences between simulation and reality – the sim-to-real gap. To robustify learning, a popular trick is domain randomization (DR) – randomly modify irrelevant visual properties such as lighting and object textures during training, with the hope of learning features that are invariant to these perturbations. Indeed, [11] demonstrated successful sim-to-real transfer using DR. Conversely, however, in this work we claim that a straightforward application of DR to contrastive learning is fundamentally flawed: by carefully formulating the problem, we show that the standard CPC loss function aims to maximize information between the features and both relevant and irrelevant visual properties, and therefore does not learn features that are robustly invariant to the irrelevant randomization. Indeed, the experiments in [11] were designed with a very small sim-to-real gap, and we demonstrate that when this gap is increased, the quality of the features degrade significantly. Our derivation also prescribes a simple fix – by applying a different randomization to past and future observations, which can easily be done in simulation, we guarantee that invariant representations are learned.

We demonstrate that our method, termed Contrastive Domain Randomization (CDR), is able to learn relevant physical features of rigid and non-rigid objects that are highly robust to irrelevant visual perturbations such as background, texture, and lighting. As such, CDR paves the way for learning visual representations of physical properties that are general, robust, and easy to obtain.

II Related Work

Our work is situated between three areas: domain randomization, contrastive learning, and object manipulation.

II-A Domain Randomization

Domain randomization is a class of domain adaptation techniques [12, 13] that targets the transition of learned models from simulation to the real-world. DR has been successful on vision problems (e.g. object detection, pose estimation [14]) often using a rendering engine to perturb non-essential properties in synthetic scenes [15]. Recent work has extended DR to robotic grasping [16], navigation [17], locomotion [18], and, relevantly, deformable object manipulation [19]. Further methods alter simulation dynamics in addition to visual features [20, 21], randomizing physical properties, such as friction and latency, to learn robust manipulation policies. Implicit in DR is an assumption that the environment is composed of relevant and irrelevant properties, which together generate an observation. The goal is to learn a representation that attends to relevant properties (e.g. physics) while ignoring the irrelevant ones. In this work, we make a connection between the domain randomization objective and the idea of learning invariances, as studied in the causal inference literature [22]. Recent work on invariant risk minimization [23] also aims to learn invariant features. However, our setting controls the simulated domain (the ‘intervention’ in causal inference parlance), enabling more effective representation learning.

II-B Contrastive Learning

Contrastive learning has gained popularity as a method for self-supervised feature learning, and relates to information-theoretic concepts such as mutual information [5, 6, 7]. Leveraging spatially- or temporally-coherent data (e.g., images or videos), contrastive learning techniques construct ”pretext” tasks [24] in which a neural network must discriminate between similar and dissimilar input samples [25, 26]. Pretext tasks can involve predicting image patches [27, 5], frames in videos [5], optical flow [6], programs [28], and other visual features [29, 26, 25]. Crucially, contrastive learning can be interpreted as learning a non-linear transformation that is invariant to distortions of the input data [30]. Our method combines these ideas with domain-randomization to learn self-supervised domain-invariant features.

II-C Deformable Object Manipulation

We focus on deformable object manipulation approaches that combine predictive models with forward planning. Recent work has proposed learning non-linear deformation functions to assist feedback control [31, 32]. Generative modeling [33, 34] has enabled neural networks to learn richer dynamics that can be used by model-predictive control (MPC) for manipulation [35, 36]. Contrastive learning methods, which represent spatial/temporal patterns in the data as latent vectors, are particularly useful for model-based planning [37, 38]. For example, recent methods applied MPC for cloth and rope manipulation, using a forward model learned using contrastive learning [39, 11]. Our work extends these approaches to planning that is invariant to irrelevant visual properties.

III Background

Our work builds on contrastive learning and domain randomization. To set the stage for our development, we formalize both ideas here under a unified notation.

III-A Contrastive Learning

Contrastive Learning methods learn compact representations of high dimensional observations such as images, video, or audio. Unlike supervised learning, where a parametric model $f$ is trained to maximize the similarity between a prediction $\hat{z}$ and a known label $y$ , contrastive learning trains an encoder $f_{\theta}$ , parametrized by $\theta$ , to map observations $o$ into a latent representation $\hat{z}=f_{\theta}(o)$ , and makes predictions directly in the latent space. For example, Contrastive Predictive Coding (CPC) [5] introduces a self-supervised instance discrimination task in which $f_{\theta}$ is used to predict a single positive label $y$ amongst $N-1$ contrastive labels $y_{j}$ . This objective, known as InfoNCE loss, simultaneously maximizes the similarity $h(\hat{z},y)$ between an observation based prediction $\hat{z}$ and the positive label $y$ while minimizing the similarity $h(\hat{z},y_{j})$ between $\hat{z}$ and contrastive labels $y_{j}$ . As a self-supervised task, positive labels $y$ are computed from the observation itself, while contrastive labels $y_{j}$ are computed from different observations in the data. InfoNCE frames the discrimination task as a standard classification problem, through the cross-entropy loss:

\mathcal{L_{\text{INCE}}^{N}}=\mathbb{E}_{o}\mathbb{E}_{y|o}\left[-\log\frac{h(\hat{z},y)}{\sum_{j=1}^{N}h(\hat{z},y_{j})}\right],

(1)

where $N$ represents the total number of positive and negative samples. Prior work showed that minimizing (1) is equivalent to maximizing the mutual information (MI) between the observation space $o$ and the latent space $z$ according to the following inequality [5, 6]: $I(o,\hat{z})\geq\log(N)-\mathcal{L}_{\text{INCE}}^{N}$ . Note that increasing the number of contrastive samples in $N$ improves the lower bound on mutual information, and in turn, the learnt representations.

Typically, the InfoNCE loss is minimized using stochastic gradient descent, where a batch contains random observations from the data, and the loss for a batch of size $N$ is:

\sum_{k=1}^{N}-\log\frac{h(\hat{z}_{i},y_{i})}{\sum_{j=1}^{N}h(\hat{z}_{i},y_{j}))}.

That is, we set the constrastive labels for an observation to be the labels of other observations in the batch.

Contrastive samples often leverage spatial, temporal, or semantic consistency in the input data. Given image data $o$ , for example, it is possible to generate a prediction $\hat{z}=f_{\theta}(\tau^{\prime}(o))$ and positive label $y=f_{\theta}(\tau(o))$ by applying separate visual transformations ( $\tau^{\prime}$ and $\tau$ , respectively) to a single image $o$ . Contrastive labels can then be generated by applying transformations to different images $y_{j}=f_{\theta}(\tau(o_{j}))$ [7, 25]. For sequential data, the temporal dimension of the input serves as a useful signal with which to generate contrastive labels. CPC follows this approach, defining the instance discrimination task over frames in a video [5]. Specifically, from the latent representation $z_{t}$ of the image at time $t$ , $o_{t}$ , CPC predicts the latent representation of a frame $k$ steps in the future $\hat{z}_{t+k}$ , using the representation of the true frame $z_{t+k}$ as a positive label, and representations of frames from random time-steps (or different videos altogether) as contrastive labels $z_{j\neq t+k}$ . In doing so, CPC simultaneously maximizes the similarity $h(\hat{z}_{t+k},z_{t+k})$ between the predicted and true future frame representations and minimizes the similarity $h(\hat{z}_{t+k},z_{j\neq t+k})$ between the prediction and each of the contrastive sample representations. Key to this approach is the joint optimization of an encoder $z_{t}=f_{\theta}(o_{t})$ and an auto-regressive model $\hat{z}_{t+k}=g_{\phi}(z_{\leq t})$ , parametrized by $\phi$ , that at time $t$ , produces a latent prediction $\hat{z}_{t+k}$ , given a summary $z_{\leq t}$ of observations up to time $t$ . Our method similarly exploits the temporal structure of video data for contrastive sampling.

Several similarity metrics have been proposed in the literature. A common choice is the weighted dot-product $\exp(z_{i}^{T}Wz_{j})$ or $\sigma(z_{i}^{T}Wz_{j})$ [7, 25]. The cosine distance and L2 distance have also proven effective [5, 11, 40].

III-B Domain Randomization

Domain randomization is a popular method of simulation-to-real transfer for neural networks [17, 14]. By exposing the network to extensive variations of scene parameters that are irrelevant for decision making during training (e.g. lighting conditions and textures), the goal of DR is to learn robust features that will transfer well to the real world. Here, we cast DR under a probabilistic formulation that will serve our subsequent development.

We assume that an observation (either simulated or real) $o(x,e)$ is generated from two independent random variables $x$ and $e$ , representing relevant and irrelevant domain properties, respectively. We further assume access to the observation generation process—i.e. given $x$ and $e$ , we can generate $o(x,e)$ . For example, $x$ can represent the pose of an object, $e$ its texture, and $o(x,e)$ is generated by a rendering engine. Under these assumptions, DR can be framed as a supervised learning problem over the following loss: $\mathcal{L_{\text{DR}}}(l)=\mathbb{E}_{e}\mathbb{E}_{x}\mathbb{E}_{y|x,e}\left[l(o(x,e),y)\right],$ where $y$ is a label for the observation $o(x,e)$ , sampled from $p(y|x,e)$ , and $l$ is some supervised learning loss function, such as regression or classification. If $y$ is independent of $e$ , optimizing over this loss encourages the network to learn a prediction that is agnostic to $e$ .

IV Contrastive Domain Randomization (CDR)

In this section, we describe our proposed framework for learning domain-invariant representations: Contrastive Domain Randomization (CDR). We begin by presenting a formalism for incorporating DR into the contrastive learning framework. We then show that a straightforward combination of CPC with DR does not learn domain-invariant representations. Finally, we introduce our method for learning contrastive predictive models and show that, under mild assumptions, the learned representations of our model are in fact domain-invariant.

Refer to caption — Figure 1: A causal model that produces environment observations $o(x,e)$ , encoding $z$ and labels $y$ . (a) applying Naive DR to InfoNCE process where labels $y$ depend on both relevant $x$ and irrelevant environmental properties $e$ . (b) Contrastive DR, in which $y\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}e$ by randomly replacing the environmental properties $e$ that generates $y$ to $e^{\prime}$ .

IV-A Problem Formulation

Recall that observations $o(x,e)$ are generated from two independent variables—relevant properties $x$ and irrelevant environmental properties $e$ —and $z=f_{\theta}(o(x,e))$ is an encoded representation of $o(x,e)$ . As in DR, we assume access to the observation generation process. Let $y(x,e)$ be a random variable we wish to predict from $o(x,e)$ , such as the outcome of an action $a_{t}$ or an encoding of future observations. This framework supports two types of causal models: (i) domain-dependent models in which $y$ is influenced by both $e$ and $x$ (Figure 1a); (ii) domain-independent models in which $x$ predicts $y$ , regardless of interventions over $e$ (Figure 1b). In the next section, we show that straightforwardly applying domain randomization to the CPC objective results in domain-dependent representations.

IV-B Naive CPC does not Learn Domain-Invariant Features

A simple application of DR to CPC is to domain-randomize each sample in our data, and apply CPC on this data [11]. The corresponding loss for a batch of size $N$ can be written as:

\sum_{i=1}^{N}-\log\frac{h(\hat{z}(x_{i},e_{i}),y(x_{i},e_{i}))}{\sum_{j=1}^{N}h(\hat{z}(x_{i},e_{i}),y(x_{j},e_{j}))},

(2)

where $x_{i},e_{i}$ correspond to random observations $o(x_{i},e_{i})$ in the data, and we denote by $\hat{z}(x_{i},e_{i})$ and $y(x_{i},e_{i})$ the prediction and label that correspond to the observation.

Though Eq. (1) suggests that CPC maximizes the mutual information between the representation $z$ and the observation $o$ , when inspecting Eq. (2), we note that there is nothing in the loss function preventing both $y$ and $z$ from becoming dependent on $e$ . That is, Eq. (2) may learn to maximize mutual information between $e$ and $z$ , effectively learning features that encode spurious properties of the random domain. In fact, depending on the task, $e$ might serve well in distinguishing positive labels from contrastive labels! Indeed, we empirically show that this naive CPC formulation learns representations that depend on the random domain.

An alternative approach to Eq. (2) is to contrast between different observations in the data that share the same domain $e$ . This will prevent CPC from learning to discriminate between labels based on $e$ . The corresponding loss for a batch is:

\sum_{i=1}^{N}-\log\frac{h(\hat{z}(x_{i},e),y(x_{i},e))}{\sum_{j=1}^{N}h(\hat{z}(x_{i},e),y(x_{j},e))},

(3)

where $x_{i},e$ correspond to random observations from the data that share the same domain. While this approach is appealing, it suffers from several technical problems. The first is that SGD based optimization tends to work well when samples in a batch are not correlated. The second is that this method requires generating at least $N$ different observations for each combination of domain properties $e$ , which becomes prohibitively large when we want high variation in $e$ .

IV-C Domain-Invariant CPC

We propose a novel contrastive loss function that learns domain-invariant representations while preserving high sample efficiency. Our method exploits domain randomization to remove the dependence between predictive variables $y$ and spurious domain features $e$ .

First, we sample $\{x,e,y\}$ from the joint distribution $p(e)p(x)p(y|x,e)$ . Using the observation generation process, we then generate both positive $D=\{o(x,e),y\}$ and intervened $D^{\prime}=\{o^{\prime},y^{\prime}\}$ datasets; where $o^{\prime}=o(x,e^{\prime})$ and $y^{\prime}=y(x,e^{\prime})$ , and $e^{\prime}$ is randomly sampled. Our key observation is that since $y^{\prime}$ is sampled from $p(y^{\prime}|x,e^{\prime})$ , we ensure that $y^{\prime}$ is independent of $e$ , namely, $y^{\prime}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}e$ . We next introduce the Contrastive Domain Randomization loss:

\sum_{i=1}^{N}-\log\frac{h(\hat{z}(x_{i},e_{i}),y^{\prime}(x_{i},e^{\prime}_{i}))}{\sum_{j=1}^{N}h(\hat{z}(x_{i},e_{i}),y^{\prime}(x_{j},e^{\prime}_{j}))},

(4)

which is optimized using both $D$ and $D^{\prime}$ . Note that, in contrast to naive CPC (2), CDR does not induce learning representations that depend on $e$ . In comparison to Eq. 3, however, we can sample from different domains in the same batch, leading to improved efficiency.

IV-D Domain-Invariance Guarantees

We now show that CDR indeed learns domain invariant representations. We focus on neural-network encoders, and specifically assume that the encoder can be written as: $f(x,e)=\sigma(W_{x}^{T}\phi_{x}(x)+W_{e}^{T}\phi_{e}(e)+b),$ that is, the encoder has separable features for $x$ and $e$ . In this case, the following proposition shows that the CDR loss will learn to ignore $e$ .

Proposition 1

Let $x$ and $e$ be independent random variables defined over spaces $X$ and $E$ , respectively. Let $y\in Y$ be a random variable that satisfies $y\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}e$ . Consider a family of functions $f$ , parameterized by $W_{x},W_{e}$ , and $b$ , that satisfy:

f(x,e)=\sigma(W_{x}^{T}\phi_{x}(x)+W_{e}^{T}\phi_{e}(e)+b),

(5)

where $\phi_{x}(x)$ and $\phi_{e}(e)$ are functions that depend solely on $x$ and $e$ , respectively, and $\sigma$ is some activation function. Given the following loss:

\mathcal{L}(f)=\mathbb{E}_{e}\mathbb{E}_{x}\mathbb{E}_{y|x}\left[l(f(x,e),y)\right],

(6)

an optimal set of parameters is achieved when $\phi_{e}(e)$ is nullified (i.e. $W_{e}=0$ ).

Proof:

Our proof assumes that $X$ and $E$ are finite, but can be extended to the infinite setting. Define $g(e)$ as the expectation

g(e)=\mathbb{E}_{x}\mathbb{E}_{y|x}\left[l(f(x,e),y)\right].

The objective in (6) can be restated as: $\mathcal{L}(f)=\mathbb{E}_{e}g(e)$ . We can then define $e^{*}=\operatorname*{arg\,min}_{e}{g(e)}$ as the optimal environmental variable that minimizes $\mathcal{L}(f)$ and write:

\mathcal{L}(f)=\mathbb{E}_{e}g(e)=\sum_{e}{p(e)g(e)}\geq\sum_{e}{p(e)g(e^{*})}=g(e^{*}).\vspace{-0.5em}

For any parameters $W_{x}$ and $b$ , consider the case in which $W_{e}=0$ . If we define a new parameter $b^{\prime}$ as $b^{\prime}=b+W_{e}^{*T}\phi_{e}(e^{*})$ , where $e^{*},W_{e}^{*}=\operatorname*{arg\,min}_{e,W_{e}}{g(e)}$ , then we can re-write:

g(e)=\mathbb{E}_{x}\mathbb{E}_{y|x}\left[l(\sigma(W_{x}^{T}\phi_{x}(x)+b^{\prime}),y)\right]=g(e^{*}).

It follows that for the set of parameters $\{W_{x}^{opt},b^{opt}\}$ that minimize $\mathcal{L}(f)$ , the parameters $\{W_{x}=W_{x}^{opt},W_{e}=0,b=b^{\prime}\}$ produce the same loss. ∎

If $f$ learns to produce $\phi_{x}(o(x,e)){=}\phi_{x}(x)$ and $\phi_{e}(o(x,e))=\phi_{e}(e)$ in any layer of the neural network, then according to Proposition 1 the optimal representation will be domain-invariant. While the assumption on separable features is idealistic, modern neural-network architectures are often expressive enough to be capable of separating $x$ and $e$ . In such cases, the result above guarantees invariant features. In our experiments, we further demonstrate this result empirically.

V CDR Feature Learning from Videos

In this section, we describe how to use CDR for learning domain-invariant image representations that capture intuitive physical concepts and generalize to the real world. We assume to have access to a physics simulator that follows the real world physical rules and renders image observations. We consider two experimental paradigms – controlled and uncontrolled, as we next describe.

V-A Uncontrolled environment

In the uncontrolled setting we only observe sequences of images $\{o_{t}\}$ depicting some physical interaction.

We follow CPC [5] and define $g_{\phi}$ to be an auto-regressive model (GRU) that summarizes the past observations in the latent space $z_{\leq t}$ and predicts the next $K$ future latent states $\hat{z}_{i}=\hat{z}_{t+k}=W_{k}^{T}GRU(z\leq t)$ . we set our positive labels $y^{\prime}$ to be the true representation at time $t+k$ , $z^{\prime}_{i}=z^{\prime}_{t+k}=f_{\theta}(o(x_{t+k},e^{\prime}))$ , and contrastive labels are drawn from different time-steps or from different videos. we average $K$ predictions and optimize:

\sum_{i=1}^{N}\left[\frac{1}{K}\sum_{k=1}^{K}-\log\frac{h(\hat{z}_{t+k},z^{\prime}_{t+k})}{\sum_{j=1}^{N}h(\hat{z}_{t+k},z_{j}))}\right].

(7)

V-B Controlled environment

In the controlled settings, an agent interacts with the environment and we can observe its actions. That is, we observe $o_{t}$ and an action $a_{t}$ at each time-step. Here, we define $g_{\phi}$ to be a 1-step forward model $\hat{z}_{i}=\hat{z}_{t+1}=g_{\phi}(z_{t},a_{t})$ . We set our positive labels $y^{\prime}$ to be the true representation of the next frame $z^{\prime}_{i}=z^{\prime}_{t+1}=f_{\theta}(o(x_{t+1},e^{\prime}))$ .

VI Experimental Results

In this section, we empirically evaluate our method on both simulated and real world data. Our evaluations are built to answer the following questions:

(1) Does CDR learn to produce both informative and domain-invariant representations?
(2) Is CDR robust to the simulation-reality gap?

Since our representation learning is unsupervised, it is difficult to define a quantitative performance measure to evaluate the quality of the learned representations. To address this challenge, we devised both simulated experiments that measure the representation quality, and real-robot experiments that evaluate the representations when input to a downstream visual planning task. In simulation, we designed experiments in which certain physical properties, such as the position and orientation of objects, are necessary for making accurate predictions about the future evolution of the system. In such domains, we know that a good representation learning method must represent these features reliably, and we correspondingly design an evaluation protocol termed information retrieval to reflect this. For the real-robot evaluation, we evaluate representations that were trained solely in simulation, by using them as input to an MPC-style planner based on image goals [11]. We term this sim-to-real evaluation as planning. We next describe our domains, training details, and results.

TABLE I: Accuracy in Predicting Physical Properties

	Euclidean	Euclidean (OOD)	IOU			IOU (OOD)
	Rigid	Rigid	Rigid	Rigid	Rope	Rigid	Rigid	Rope
	uncontrolled	uncontrolled	uncontrolled	controlled		uncontrolled	controlled
Baseline	$0.45\pm 0.39$	$0.55\pm 0.46$	$0.18\pm 0.14$	$0.19\pm 0.13$	$0.15\pm 0.12$	$0.15\pm 0.14$	$0.15\pm 0.13$	$0.14\pm 0.11$
CDR (Ours)	$\textbf{0.3}\pm\textbf{0.36}$	$\textbf{0.42}\pm\textbf{0.48}$	$\textbf{0.34}\pm\textbf{0.16}$	$\textbf{0.27}\pm\textbf{0.12}$	$\textbf{0.24}\pm\textbf{0.15}$	$\textbf{0.28}\pm\textbf{0.17}$	$\textbf{0.26}\pm\textbf{0.13}$	$\textbf{0.22}\pm\textbf{0.13}$

TABLE II: Invariance to Random Texture

	Cosine similarity (higher is better)			MSE distance (lower is better)
	Rigid (uncontrolled)	Rigid (controlled)	Rope	Rigid (uncontrolled)	Rigid (controlled)	Rope
Baseline	$0.69\pm 0.22$	$0.8\pm 0.19$	$0.86\pm 0.12$	$0.51\pm 0.44$	$0.8\pm 0.67$	$1.06\pm 0.89$
CDR (Ours)	$\textbf{0.86}\pm\textbf{0.21}$	$\textbf{0.91}\pm\textbf{0.1}$	$\textbf{0.97}\pm\textbf{0.06}$	$\textbf{0.13}\pm\textbf{0.2}$	$\textbf{0.12}\pm\textbf{0.12}$	$\textbf{0.11}\pm\textbf{0.24}$

VI-A Simulated Environments

We focus on the following environments (see Fig. 2).

Uncontrolled Rigid objects

Two simple rigid objects, a cube or a ball, initialized with random size, position and mass, are placed in a blocked frame with low friction. At the start of each episode a force in a random direction and magnitude is applied to one of the objects, making it collide with the frame and the second object over time.

Controlled Rigid objects

Two complex shaped rigid objects, a Hammer and a Banana, initialized with random size, position and orientation, placed in a blocked frame with high friction. An agent, represented by a small cube or sphere initialized with random size, interacts with the environment by randomly pushing the other objects in a random direction. We use a two-dimensional action space: the direction and magnitude in Cartesian coordinates of the force applied to the agent at each time-step.

Controlled Deformable Rope

A rope with random length and thickness, represented in simulation as $n\in\{35...45\}$ connected small spheres with random radius, is being pushed in a random direction by a robotic arm. The arm is not visible in the overhead images. We use a four-dimensional action space: the 2-d initial and final positions of the robot end effector. At the start of each episode, the rope state is randomly initialized by applying a random number of random actions.

All our experiments use the bullet [41] physics simulator with an overhead camera that renders 128×128×3 RGB images as observations. We apply random textures sampled from Describable Textures Dataset (DTD) [42].

VI-B Real Robot Environment

To evaluate sim-to-real transfer, we use the Franka Emika Panda robotic arm, with an Intel RealSense D415 camera positioned to look diagonally down at the scene (see Fig. 5). We make sure that the arm is not visible in the images by moving it out of sight before taking an image. We also use a simulation of the real Panda robotic arm environment to generate training data and for evaluation. Our manipulation task is moving a random sized cube. The actions are similar to the Controlled Deformable Rope environment – the Panda robotic arm pushes the cube in a random direction that is parallel to the table, specified by a start and goal position for the tip of the gripper.

VI-C Training and Baselines

Real Robot Experiment

for fair comparison with the state of the art, we use Contrastive Forward Modeling (CFM) [11] as baseline. Both baseline and CDR uses the same model, a ResNet18 [43] that was pre-trained on the ImageNet [44] dataset as a backbone encoder, CFM forward model architecture, and L2 similarity function $h_{i,j}=-||z_{i}-z_{j}||^{2}$ . We replace the last fully connected (FC) of the encoder with a FC layer that compresses the encoder outputs to 8-dimensional latent vector $z$ .¹¹1Our code is publicly available at https://github.com/carmelrabinov/cdr

We applied DR and generated a set of 4,000 videos with 15 frames each, where each set samples textures independently from DTD. Baseline is trained with the objective presented in Eq. 2 as in CFM, and CDR with the objective presented in Eq. 4. We also use 10% of the training videos as a validation set for early stopping.

Simulated Experiments

For the Controlled environments (Rigid and Deformable) we use dot-product similarity $h_{i,j}=\exp(z_{i}^{T}z_{j})$ and replace the CFM forward model architecture with an action encoder that encodes the actions to a 16-dimensional vector with a 2-layered MLP with tanh activations and a hidden layer of size 64 for both Baseline and CDR. the encoding of the state $z_{t}$ and actions are then concatenated and used as input to a 3-layered MLP with ReLU activations and a skip connection. we found this architecture beneficial for both Baseline and CDR.

For the Uncontrolled Rigid objects environment, we use bilinear similarity and replace the CFM forward model architecture with a GRU that summarize past observations and predict $K$ steps into the future as in CPC, i.e $\hat{z}_{t+k}=W_{k}^{T}GRU(z\leq t)$ . we use the objective in Eq. 7 with $K=6$ .

We generated a set of 10,000 videos with 30 frames each for both rigid environments, and a set of 8,000 videos with 15 frames each for the deformable rope environment. DTD is divided into 47 categories, we use 3 categories as out-of-distribution (OOD) test textures and sample uniformly from the rest during training.

VI-D Information Retrieval Results

Our goal is to learn representations that capture relevant properties of the scene, and ignore spurious features. To evaluate such behavior on out-of-distribution data, we test the ability to retrieve similar observations that share similar physical properties (e.g. position, orientation, size) but from different domains. Given a test image $o$ , we first encode it into $z=f_{\theta}(o)$ , and retrieve the latent nearest neighbour from an external dataset $D_{ext}$ , unseen during training $z_{NN}=\operatorname*{arg\,min}_{z^{\prime}\in D_{ext}}h(z,z^{\prime})$ . For $h$ , we use the Cosine distance, as we found it to perform well for both CDR and the baseline. We evaluate test images from both in distribution (following the terminology of Sec. III-B, with $e$ seen in training but different $x$ ) and OOD (both $e$ and $x$ were not seen in training). To evaluate performance, we measure the Intersection Over Union (IOU) of the object pixels between the nearest neighbor and the test image (higher is better). For the rigid object environments, we also report the sum of Euclidean distances between the objects in the test image and its retrieved nearest neighbour (lower is better). Table I shows quantitative results of CDR against Baseline. Note that CDR significantly outperforms the baselines by a large margin.

To test if our representations are in fact domain-invariant, we randomly sample observations from different domains but with the same object configurations $o_{1}=o(x,e_{1}),o_{2}=o(x,e_{2})$ , and compare the distance between them in the latent space, i.e $h(o_{1},o_{2})$ . Table II shows that CDR is significantly more domain-invariant compare to the baseline.

To evaluate sim-to-real transfer, we use real world images that are very different from the simulated data in both domain appearance and physical properties (see Fig. 3 and Fig. 4). We then qualitatively evaluate our ability to retrieve similar simulated data. In Fig. 3 we compare CDR and Baseline to retrieve observations from a challenging reference image with a very different lightning and frame. Though retrieved positions are not always perfect, CDR tends to capture correctly the relative position and sizes of the objects – an essential property for predicting the outcome of a collision. For the controlled rigid objects data, we use a hammer and a banana much bigger than the maximal size in simulation. Nevertheless, CDR tends to capture correctly the relationship between the objects and the cube agent. For the rope data, we use a colored rope, and unlike previous work [11], we use a long rope that can form complex shapes. Furthermore, to demonstrate generalization outside lab conditions, we also use the same rope material as background for some reference images. Note that CDR yields retrievals that are much more accurate than the baseline.

TABLE III: Euclidean distance from final state to goal state in cm. each simulation is given 10 trials to reach a goal. Simulation results are based on 80 experiments, Real robot results with goal image from real world or simulation are based on 16 and 21 experiments, respectively.

	Simulation		Real Robot
	In domain goal	Different domain goal	Real world goal	Simulated goal
Baseline	$15.18\pm 9.01$	$15.47\pm 8.20$	$15.26\pm 9.13$	$10.02\pm 5.88$
CDR (Ours)	$\textbf{9.31}\pm\textbf{7.93}$	$\textbf{7.93}\pm\textbf{7.46}$	$\textbf{7.59}\pm\textbf{8.04}$	$\textbf{3.48}\pm\textbf{3.34}$

VI-E Planning Results

To evaluate if representations learned with CDR are useful for planning, we test our model in reaching a goal image from a random state using a simple 1-step Model Predictive Control (MPC) planner. The planner is provided with a goal image $o_{goal}$ which can be either a real-world image or a simulated one (see Fig. 6), which is then encoded into $z_{goal}=f_{\theta}(o_{goal})$ . In each time step, the planner samples a set $A$ of 1000 possible actions that are guaranteed to push the cube in some random direction and magnitude. We feed these into our forward model, along with the representation of the current image $z_{t}=f_{\theta}(o_{t})$ to get a set of 1000 latent representations of next states, corresponding to the different actions, $\hat{z}_{t+1}=g_{\phi}(z_{t},a)$ . We greedily choose an action to execute, based on distance in latent space $a_{t}=\operatorname*{arg\,min}_{a\in A}h(z_{goal},g_{\phi}(z_{t},a))$ .

To evaluate sim2real transfer, we use both simulated goal images and real-world goal images, and test the robot on a variety of challenging environments. For example, we add multiple objects to the scene, while no additional objects were seen during training (see Fig. 7). We measure the Euclidean distance from the final state to the goal state. Table III shows that CDR significantly outperforms the CFM Baseline.

VII Conclusion

Learning from simulated data is a promising approach to scale up robot learning to complex tasks. In this work, we proposed a principled robustification of unsupervised representation learning using domain randomization, and demonstrated that it can learn relevant representations that are robust to irrelevant features of the domain appearance. Key to our approach is exploiting the fact that we can intervene on a simulated image, and change some of its irrelevant features.

Our framework is general, and can be used with any physical simulator, any domain randomization technique, and any InfoNCE based representation learning method. As such, it is a promising direction for learning general representations for robotic tasks.

Acknowledgments

The authors wish to thank Shadi Endrawis for helping in setting up the Panda arm interface. Aviv Tamar is partly funded by the Israel Science Foundation (ISF-759/19)

References

[1] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016.
[2] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement learning,” in Proceedings of the twenty-first international conference on Machine learning, 2004, p. 1.
[3] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” in Advances in Neural Information Processing Systems, 2017, pp. 4299–4307.
[4] A. Singh, L. Yang, K. Hartikainen, C. Finn, and S. Levine, “End-to-end robotic reinforcement learning without reward engineering,” arXiv preprint arXiv:1904.07854, 2019.
[5] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
[6] Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” arXiv preprint arXiv:1906.05849, 2019.
[7] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” arXiv preprint arXiv:2002.05709, 2020.
[8] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” The International Journal of Robotics Research, vol. 37, no. 4-5, pp. 421–436, 2018.
[9] L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,” in 2016 IEEE international conference on robotics and automation (ICRA). IEEE, 2016, pp. 3406–3413.
[10] S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn, “Robonet: Large-scale multi-robot learning,” arXiv preprint arXiv:1910.11215, 2019.
[11] W. Yan, A. Vangipuram, P. Abbeel, and L. Pinto, “Learning predictive representations for deformable objects using contrastive estimation,” arXiv preprint arXiv:2003.05436, 2020.
[12] M. Wang and W. Deng, “Deep visual domain adaptation: A survey,” Neurocomputing, vol. 312, pp. 135–153, 2018.
[13] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa, “Visual domain adaptation: A survey of recent advances,” IEEE signal processing magazine, vol. 32, no. 3, pp. 53–69, 2015.
[14] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 23–30.
[15] X. Ren, J. Luo, E. Solowjow, J. A. Ojea, A. Gupta, A. Tamar, and P. Abbeel, “Domain randomization for active pose estimation,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 7228–7234.
[16] J. Tobin, L. Biewald, R. Duan, M. Andrychowicz, A. Handa, V. Kumar, B. McGrew, A. Ray, J. Schneider, P. Welinder et al., “Domain randomization and generative models for robotic grasping,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 3482–3489.
[17] F. Sadeghi and S. Levine, “Cad2rl: Real single-image flight without a single real image,” arXiv preprint arXiv:1611.04201, 2016.
[18] I. Mordatch, K. Lowrey, and E. Todorov, “Ensemble-cio: Full-body dynamic motion planning that transfers to physical humanoids,” in 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2015, pp. 5307–5314.
[19] J. Matas, S. James, and A. J. Davison, “Sim-to-real reinforcement learning for deformable object manipulation,” arXiv preprint arXiv:1806.07851, 2018.
[20] R. Antonova, S. Cruciani, C. Smith, and D. Kragic, “Reinforcement learning for pivoting task,” arXiv preprint arXiv:1703.00472, 2017.
[21] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization,” in 2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018, pp. 1–8.
[22] J. Peters, D. Janzing, and B. Schölkopf, Elements of causal inference. The MIT Press, 2017.
[23] M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz, “Invariant risk minimization,” arXiv preprint arXiv:1907.02893, 2019.
[24] I. Misra and L. v. d. Maaten, “Self-supervised learning of pretext-invariant representations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6707–6717.
[25] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton, “Big self-supervised models are strong semi-supervised learners,” arXiv preprint arXiv:2006.10029, 2020.
[26] X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” arXiv preprint arXiv:2003.04297, 2020.
[27] O. J. Hénaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch, S. Eslami, and A. v. d. Oord, “Data-efficient image recognition with contrastive predictive coding,” arXiv preprint arXiv:1905.09272, 2019.
[28] P. Jain, A. Jain, T. Zhang, P. Abbeel, J. E. Gonzalez, and I. Stoica, “Contrastive code representation learning,” arXiv preprint arXiv:2007.04973, 2020.
[29] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
[30] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 1735–1742.
[31] Z. Hu, P. Sun, and J. Pan, “Three-dimensional deformable object manipulation using fast online gaussian process regression,” IEEE Robotics and Automation Letters, vol. 3, no. 2, pp. 979–986, 2018.
[32] B. Jia, Z. Hu, Z. Pan, D. Manocha, and J. Pan, “Learning-based feedback controller for deformable object manipulation,” arXiv preprint arXiv:1806.09618, 2018.
[33] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Advances in neural information processing systems, 2016, pp. 2172–2180.
[34] C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learning for physical interaction through video prediction,” in Advances in neural information processing systems, 2016, pp. 64–72.
[35] F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine, “Visual foresight: Model-based deep reinforcement learning for vision-based robotic control,” arXiv preprint arXiv:1812.00568, 2018.
[36] T. Kurutach, A. Tamar, G. Yang, S. J. Russell, and P. Abbeel, “Learning plannable representations with causal infogan,” in Advances in Neural Information Processing Systems, 2018, pp. 8733–8744.
[37] A. Wang, T. Kurutach, K. Liu, P. Abbeel, and A. Tamar, “Learning robotic manipulation through visual planning and acting,” arXiv preprint arXiv:1905.04411, 2019.
[38] A. Nair, D. Chen, P. Agrawal, P. Isola, P. Abbeel, J. Malik, and S. Levine, “Combining self-supervised learning and imitation for vision-based rope manipulation,” in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 2146–2153.
[39] Y. Ding, I. Clavera, and P. Abbeel, “Mutual information maximization for robust plannable representations,” arXiv preprint arXiv:2005.08114, 2020.
[40] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre-training for speech recognition,” arXiv preprint arXiv:1904.05862, 2019.
[41] E. Coumans et al., “Bullet physics library,” Open source: bulletphysics. org, vol. 15, no. 49, p. 5, 2013.
[42] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3606–3613.
[43] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[44] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.