This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SNeRL: Semantic-aware Neural Radiance Fields for Reinforcement Learning

Dongseok Shim    Seungjae Lee    H. Jin Kim
Abstract

As previous representations for reinforcement learning cannot effectively incorporate a human-intuitive understanding of the 3D environment, they usually suffer from sub-optimal performances. In this paper, we present Semantic-aware Neural Radiance Fields for Reinforcement Learning (SNeRL), which jointly optimizes semantic-aware neural radiance fields (NeRF) with a convolutional encoder to learn 3D-aware neural implicit representation from multi-view images. We introduce 3D semantic and distilled feature fields in parallel to the RGB radiance fields in NeRF to learn semantic and object-centric representation for reinforcement learning. SNeRL outperforms not only previous pixel-based representations but also recent 3D-aware representations both in model-free and model-based reinforcement learning.

Machine Learning, ICML

1 Introduction

Developing agents that can achieve complex control tasks directly from image inputs has been a long-standing problem in reinforcement learning (RL). Previous works over the past few years have made notable progress in the data efficiency of learning visual control problems. The most challenging part of solving visual control tasks is obtaining the low-dimensional latent representations from high-dimensional observations. To this end, they pre-train the encoder in various ways such as unsupervised representation learning via image reconstruction using offline datasets (Finn et al., 2016; Kulkarni et al., 2019; Islam et al., 2022), contrastive learning (Zhan et al., 2022), reconstructing task information (Yang & Nachum, 2021; Yamada et al., 2022), and training multi-view consistency (Dwibedi et al., 2018). Other approaches utilize joint learning of auxiliary unsupervised tasks (Laskin et al., 2020b; Schwarzer et al., 2020), and data-augmented reinforcement learning (Laskin et al., 2020a; Yarats et al., 2021).

Refer to caption
Figure 1: Semantic-aware NeRF for reinforcement learning. We present SNeRL, a reinforcement learning framework that learns 3D-aware representation with a convolutional encoder and semantic-aware NeRF decoder. The latent vectors from the encoder are propagated to the policy network to generate an action for RL agents.

While a number of works have been proposed to improve the data efficiency in visual control problems, the majority of the encoders trained from those methods have limited capability in obtaining 3D structural information and lack equivariance to 3D transformations. Such limitations come from ignoring 3D structural information and learning visual representation from a single-view observation.

Recently, there have been attempts to consider 3D information of the environment in robot control and manipulation (Li et al., 2022; Driess et al., 2022) by learning implicit spatial representation via neural radiance fields (NeRF) (Mildenhall et al., 2020). They map pixel-level multi-view observations of a scene to a latent vector through an autoencoder structure, where the NeRF decoder provides 3D structured neural scene representation via RGB self-supervision for each view.

Even though the aforementioned pioneers achieved better performance compared to the previous RL algorithms with a single-view observation, they still did not take full advantage of 3D-aware representation learning. It is because those methods only exploit RGB supervision to train NeRF, which makes it difficult for the encoder to learn object-centric or semantic representation for RL downstream tasks. Although NeRF-RL (Driess et al., 2022) proposes compositional NeRF to mitigate such limitations, the RL agents in NeRF-RL require object-individual masks during training and deployment to utilize semantic representations, which is quite unrealistic.

In this work, we propose Semantic-aware Neural Radiance Fields for Reinforcement Learning (SNeRL) which learns both 3D-aware semantic and geometric representation for RL agents in a 3-dimensional environment. First of all, our proposed method learns 3D-aware semantic representation by predicting 3D semantic fields with ground truth labels. As a result, SNeRL enables downstream visual control tasks without object-individual masks and addresses the limitation of the prior work (Driess et al., 2022). Also, to capture further fine-grained features that could not be fully expressed in semantic fields and to take advantage of data-driven approaches, we employ an off-the-shelf feature descriptor (Caron et al., 2021) as a teacher network and learn to predict feature fields via a distillation method such as Kobayashi et al. (2022).

We also introduce a multi-view adaptation of recent self-predictive representation learning (Chen & He, 2021) as an auxiliary task which further improves the performance of SNeRL. In the proposed auxiliary task, SNeRL computes the target representation by utilizing the observations from different camera views in the same timestep to learn spatially consistent representation.

Our proposed SNeRL outperforms not only the previous single-view representation learning algorithms for RL (Laskin et al., 2020b; Yarats et al., 2021) but also the state-of-the-art method with multi-view observations (Driess et al., 2022) in four different visual control tasks.

To sum up, our contribution can be summarized as follows:

  • We present SNeRL, a framework that utilizes NeRF with semantic and distilled feature fields to learn 3D-aware semantic representation for reinforcement learning.

  • We validate the effectiveness of SNeRL both with model-free and model-based methods. To the best of our knowledge, SNeRL is the first work that leverages semantic-aware representations without object masks in RL downstream tasks. Also, this is the first study to utilize 3D-aware representations to model-based RL.

  • The proposed SNeRL outperforms the previous single and multi-view image-based RL algorithms in four different 3D environments from Meta-world. In addition, auxiliary self-predictive representation learning with multi-view observations proposed for spatially consistent representation can enable further improvements.

2 Related Work

2.1 3D Scene Representation Learning

To learn 3D-aware representation from a single view image, the previous methods exploit standard convolutional autoencoder architecture conditioned by the camera poses, which generates scenes from arbitrary views with either deterministic (Tatarchenko et al., 2016; Worrall et al., 2017) or stochastic (Eslami et al., 2018) latent vectors. Recently, neural radiance fields (NeRF) have achieved an exceptional progress in understanding 3D scenes and synthesizing novel views. Following, some approaches propose latent-conditioned NeRF (Martin-Brualla et al., 2021; Yu et al., 2021; Wang et al., 2021), but the major objective of the aforementioned methods is improving the quality of synthesized images rather than extracting time-variant latent vectors with 3D dynamic scene understanding from multi-view inputs. In this paper, we leverage the autoencoder with convolutional encoder and NeRF-style decoder (Li et al., 2022; Driess et al., 2022) so that the encoder can extract 3D-aware representation from multi-view inputs for RL downstream tasks.

2.2 Representation Learning for RL

The RL frameworks with image inputs typically have an encoder, which maps high-dimensional observations to a low-dimensional latent vector. RL agent is trained over the latent state space to maximize its objective functions, e.g., the total discounted reward for each episode. While a number of works have made significant advancements, it still remains a challenging open problem.

To address the sample inefficiency of image-based RL, prior works adopt various data-augmentation techniques (Laskin et al., 2020a; Yarats et al., 2021), contrastive learning with data augmentation (Laskin et al., 2020b; Schwarzer et al., 2020; Stooke et al., 2021; Liu & Abbeel, 2021; Zhan et al., 2022), representation learning from image reconstruction (Islam et al., 2022; Kulkarni et al., 2019), or task information reconstruction (Yang & Nachum, 2021; Yamada et al., 2022). Other approaches propose to capture the relations between multi-view data (Dwibedi et al., 2018; Kinose et al., 2022; Sermanet et al., 2018) or keypoints (Manuelli et al., 2020). There are also some approaches leveraging transition sequence data (Hansen et al., 2020; You et al., 2022), or pre-training with offline image-based RL (Wang et al., 2022). Unfortunately, these works have limited capability in learning 3D-structural information and could not obtain an intuitive understanding of the 3D environments that humans have because of the 2D bias that 2D convolutional neural networks have.

In recent, there have been attempts to learn the 3D structure of the real world (Li et al., 2022; Driess et al., 2022). Li et al. (2022) firstly proposes autoencoder with convolutional encoder and NeRF (Mildenhall et al., 2020) decoder to control the visuomotor with learned dynamics model and model-predictive control (MPC). Following, NeRF-RL (Driess et al., 2022) extends the prior study and firstly introduces NeRF-based architecture to the general model-free RL framework. However, they could not learn semantic features due to the limited RGB supervision with naïve NeRF. To learn object-centric representation only with RGB supervision, NeRF-RL presents compositional NeRF with object-individual masks, but requiring masks during the deployment of RL agents seems to be a strong assumption.

In this paper, we propose SNeRL which learns both geometric and semantic information with RGB, semantic, and distilled feature supervision for RL downstream tasks without any object masks during the inference phase.

3 Preliminaries

3.1 Neural Radiance Fields

The concept of neural radiance fields (NeRF) (Mildenhall et al., 2020) is to represent the 3D scene with learnable and continuous volumetric fields fθf_{\theta}. Specifically, at any 3D world coordinate 𝐱3\mathbf{x}\in\mathbb{R}^{3} and unit viewing direction 𝐝3\mathbf{d}\in\mathbb{R}^{3}, fθf_{\theta} estimates the differntiable volume density σ\sigma and RGB color 𝐜\mathbf{c}: fθ(𝐱,𝐝)=(σ,𝐜)f_{\theta}(\mathbf{x},\mathbf{d})=(\sigma,\mathbf{c}). Let the camera ray of the pixel in the camera coordinate be 𝐫=𝐨+t𝐝\mathbf{r}=\mathbf{o}+t\mathbf{d}, where 𝐨\mathbf{o} indicates the camera origin. The corresponding pixel value from an arbitrary view can be rendered through volumetric radiance fields as:

C(𝐫)=tntfT(t)σ(t)𝐜(t)𝑑tC(\mathbf{r})=\int^{t_{f}}_{t_{n}}T(t)\sigma(t)\mathbf{c}(t)dt (1)

where T(t)=exp(tntσ(s)𝑑s)T(t)=\mathrm{exp}(-\int^{t}_{t_{n}}\sigma(s)ds) and tnt_{n} and tft_{f} indicate pre-defined lower and upper bound of the depth respectively.

Then, fθf_{\theta}, which is usually formulated with MLP, is optimized through pixel-wise RGB supervision from multiple views as:

=i,jC^(𝐫i,j)C(𝐫i,j)22,\mathcal{L}=\sum_{i,j}||\hat{C}(\mathbf{r}_{i,j})-C(\mathbf{r}_{i,j})||_{2}^{2}, (2)

where 𝐫i,j\mathbf{r}_{i,j} indicates ray jj from images of ithi^{th} view. C^\hat{C} and CC represents the rendered volumetric fields into 2D image and ground truth pixel value respectively.

3.2 Reinforcement Learning

We consider a finite-horizon Markov Decision Process (MDP) =(𝒪,𝒜,𝒯,,γ)\mathcal{M}=(\mathcal{O},\mathcal{A},\mathcal{T},\mathcal{R},\gamma), where 𝒪\mathcal{O} denotes the high-dimensional observation space (image pixels), 𝒜\mathcal{A} the action space, 𝒯(o|o,a)\mathcal{T}(o^{\prime}|o,a) the transition dynamics (o,o𝒪,a𝒜)(o,o^{\prime}\in\mathcal{O},a\in\mathcal{A}), :𝒪×𝒜\mathcal{R}:\mathcal{O}\times\mathcal{A}\rightarrow\mathbb{R} the reward function, and γ[0,1)\gamma\in[0,1) the discount factor. Following the general idea of learning RL downstream tasks with pre-trained scene representations, we consider an encoder Ω:𝒪𝒵\Omega:\mathcal{O}\rightarrow\mathcal{Z} that maps and high-dimensional observation o𝒪o\in\mathcal{O} to a low-dimensional latent state z𝒵z\in\mathcal{Z} on which an RL agent operates. To learn how to succeed in downstream tasks, the RL policy πθ(a𝒜|z=Ω(o))\pi_{\theta}(a\in\mathcal{A}|z=\Omega(o)) maximizes the total discounted reward t=0H1=γt(ot,at)\sum_{t=0}^{H-1}=\gamma^{t}\mathcal{R}(o_{t},a_{t}) of trajectories τi=(z0,o0,,zH,oH)i\tau_{i}=(z_{0},o_{0},...,z_{H},o_{H})_{i}.

4 Method

Refer to caption
Figure 2: SNeRL Overview. SNeRL consists of two stages, which are pre-training NeRF-based autoencoder and fine-tuning to the downstream RL tasks, respectively. With observations from three different camera views, an encoder produces a single latent vector zz, and a decoder with neural rendering function fθf_{\theta} takes the position 𝐱\mathbf{x}, viewing direction 𝐝\mathbf{d} in the 3D coordinates and zz as inputs to synthesize three different fields in the arbitrary views. An auxiliary multi-view self-prediction loss is applied to enable view-invariant representation. Then, the encoder and the decoder are jointly optimized in a supervised manner with an offline dataset. The pre-trained encoder is utilized as a feature extractor to train the policy with off-the-shelf RL algorithms.

In this section, we demonstrate the details of SNeRL which consists of a multi-view convolutional image encoder and a latent-conditioned NeRF decoder to learn the 3D-aware representation. Compared to the previous method (Driess et al., 2022) which also proposes NeRF supervision for RL, SNeRL is capable of extracting object-centric or semantic representation without any object-individual masks during deployment. The pre-trained image encoder is exploited as a feature extractor for downstream RL tasks, and the overview of SNeRL framework is depicted in Figure 2.

4.1 Multi-view Encoder

Similar to Li et al. (2022), we adopt the multi-view encoder Ω\Omega which fuses the observations from multiple camera views together to learn a single latent vector zz for RL tasks. The encoder takes the pixel-level observations oiH×W×3o^{i}\in\mathbb{R}^{H\times W\times 3}, and the corresponding camera projection matrices Ki3×4K^{i}\in\mathbb{R}^{3\times 4} captured from VV different camera view as inputs, i.e., i=1Vi=1\cdots V. To generate z𝒵z\in\mathcal{Z} from the inputs, a convolutional network ECNNE_{\mathrm{CNN}} first extracts viewpoint-invariant features from each image. The features from different camera views are channel-wise concatenated with their corresponding (flattened) camera projection matrices to reflect the viewpoint information to the following feature vectors. Then, the concatenated vectors are passed through MLP layers, gMLPg_{\mathrm{MLP}}, to produce mid-level viewpoint-aware encodings. Lastly, the feature encodings from different camera views are averaged to generate a single encoding, and the averaged feature encoding is projected to the latent space 𝒵\mathcal{Z} with the latent encoder hMLPh_{\mathrm{MLP}} as follows:

z=Ω(o1:V,K1:V)=hMLP(1Vi=1VgMLP(ECNN(oi),Ki))\begin{split}z=&\,\Omega(o^{1:V},K^{1:V})\\ =&\,h_{\mathrm{MLP}}(\frac{1}{V}\sum_{i=1}^{V}g_{\mathrm{MLP}}(E_{\mathrm{CNN}}(o^{i}),K^{i}))\end{split} (3)

4.2 Semantic-aware NeRF Decoder

To inject 3D structural information into the latent vector zz, we leverage a latent-conditioned NeRF architecture (Yu et al., 2021; Martin-Brualla et al., 2021; Wang et al., 2021) for the decoder. The difference between previous latent-conditioned NeRF and our proposed SNeRL is that the neural rendering function fθf_{\theta} from SNeRL not only synthesizes novel views with RGB pixel value 𝐜\mathbf{c} but also with the semantic label 𝐬\mathbf{s} (Zhi et al., 2021; Fu et al., 2022; Kundu et al., 2022) and high-dimensional distilled features 𝐟\mathbf{f} from the large-scale teacher network (Kobayashi et al., 2022) as follows:

𝐜=fθ(z,𝐱,𝐝),𝐬,𝐟=fθ(z,𝐱)\mathbf{c}=f_{\theta}(z,\mathbf{x},\mathbf{d}),\,\,\mathbf{s},\,\mathbf{f}=f_{\theta}(z,\mathbf{x}) (4)

By estimating three different radiance fields (semantic, feature, and RGB), the latent vector zz is jointly optimized to learn the geometric and semantic representations of the 3D environment. Unlike RGB value 𝐜\mathbf{c} which is dependent on both the position 𝐱\mathbf{x} and the viewing direction 𝐝\mathbf{d}, we formulate the semantic label and distilled feature to be invariant to the viewing direction 𝐝\mathbf{d} because the inherent properties of the scene or the object do not change according to the direction of the camera ray.

As SNeRL predicts three different fields, RGB, semantic, and distilled feature, by adding field-wise branches, they share the neural rendering function fθf_{\theta} until estimating the density σ\sigma. It indicates that three radiance fields have the same accumulated transmittance T(t)T(t) at depth t[tn,tf]t\in[t_{n},t_{f}] along the ray 𝐫=𝐨+t𝐝\mathbf{r}=\mathbf{o}+t\mathbf{d} as

T(t)=exp(tntσ(𝐫(s))𝑑s).T(t)=\mathrm{exp}(-\int^{t}_{t_{n}}\sigma(\mathbf{r}(s))ds). (5)

For rendering the RGB field, we follow the same training framework as general latent-conditioned NeRF (Yu et al., 2021; Martin-Brualla et al., 2021; Wang et al., 2021), which optimizes the neural rendering function fθf_{\theta} via pixel-wise RGB supervision. RGB supervision enables the encoder to extract geometric features from the observed environment by learning the RGB and density distribution in the 3-dimensional space. The rendered pixel value C^(𝐫)\hat{C}(\mathbf{r}) can be calculated as

C^(𝐫)=tntfT(t)σ(𝐫)𝐜(𝐫,𝐝)𝑑t,\hat{C}(\mathbf{r})=\int^{t_{f}}_{t_{n}}T(t)\sigma(\mathbf{r})\mathbf{c}(\mathbf{r},\mathbf{d})dt, (6)

and the loss function for RGB field, RGB\mathcal{L}_{\mathrm{RGB}}, can be formulated with simple L2 loss between the rendered C^(𝐫)\hat{C}(\mathbf{r}) and the ground truth pixel colors C(𝐫)C(\mathbf{r}),

RGB=i,jC^(𝐫i,j)C(𝐫i,j)22,\mathcal{L}_{\mathrm{RGB}}=\sum_{i,j}||\hat{C}(\mathbf{r}_{i,j})-C(\mathbf{r}_{i,j})||_{2}^{2}, (7)

where 𝐫i,j\mathbf{r}_{i,j} indicates the camera ray jj from the observation ii, oio^{i}.

Unfortunately, optimizing the encoder only with only an RGB reconstruction is difficult to capture the semantic or object-centric properties of the 3D scene, which are crucial for downstream RL tasks. Therefore, we extend NeRF-based decoder by appending additional branches before injecting the viewing direction 𝐝\mathbf{d} into the rendering function, fθf_{\theta} for semantic segmentation. The rendered semantic labels S^(𝐫)\hat{S}(\mathbf{r}) can be calculated as

S^(𝐫)=tntfT(t)σ(𝐫)𝐬(𝐫)𝑑t\hat{S}(\mathbf{r})=\int^{t_{f}}_{t_{n}}T(t)\sigma(\mathbf{r})\mathbf{s}(\mathbf{r})dt (8)

and the loss function for semantic field, seg\mathcal{L}_{seg}, can be formulated with the standard cross entropy loss,

sem=i,jl=1LSl(𝐫i,j)logS^l(𝐫i,j),\mathcal{L}_{sem}=-\sum_{i,j}\sum_{l=1}^{L}S^{l}(\mathbf{r}_{i,j})\mathrm{log}\,\hat{S}^{l}(\mathbf{r}_{i,j}), (9)

where S^l\hat{S}^{l} and SlS^{l} denote the probability of the ray jj in observation ii belonging to the class ll and its corresponding ground-truth semantic labels, respectively.

To capture further fine-grained features that could not be fully expressed in semantic fields, SNeRL also synthesizes distilled feature fields (Kobayashi et al. (2022)) that predict the output of a pre-trained feature descriptor in a knowledge-distillation manner (Hinton et al., ). It is well known from prior literature (Caron et al., 2021) that Vision Transformer (ViT) (Dosovitskiy et al., 2020) trained in a self-supervised manner, e.g. DINO (Caron et al., 2021), can work as an excellent feature descriptor which explicitly represents the scene layouts such as object boundaries. Since the output of the ViT feature descriptor contains high-dimensional information with different values in all pixels depending on the geometric relationship or semantic meaning, the pre-trained ViT becomes a good feature descriptor with another advantage from the semantic label.

Therefore, we take advantage of such benefits to the NeRF-based decoder so that the latent vector zz learns high-level information by distilling the knowledge from ViT teacher network which cannot be learned via ground-truth semantic supervision. The distilled feature fields can be rendered as

F^(𝐫)=tntfT(t)σ(𝐫)𝐟(𝐫)𝑑t.\hat{F}(\mathbf{r})=\int^{t_{f}}_{t_{n}}T(t)\sigma(\mathbf{r})\mathbf{f}(\mathbf{r})dt. (10)

The loss function for distilled feature field, feat\mathcal{L}_{feat}, is formulated by penalizing the difference between the rendered features F^(𝐫)\hat{F}(\mathbf{r}) and the outputs of ViT feature descriptor F(o,𝐫)F(o,\mathbf{r}) as

feat=i,jF^(𝐫i,j)F(oi,𝐫i,j)1.\mathcal{L}_{feat}=\sum_{i,j}||\hat{F}(\mathbf{r}_{i,j})-F(o^{i},\mathbf{r}_{i,j})||_{1}. (11)

Finally, the total loss function \mathcal{L} for jointly optimizing the multi-view encoder and NeRF-based decoder can be formulated as the linear combination of aforementioned losses as:

=RGB+λsemsem+λfeatfeat\mathcal{L}=\mathcal{L}_{\mathrm{RGB}}+\lambda_{sem}\mathcal{L}_{sem}+\lambda_{feat}\mathcal{L}_{feat} (12)

where λsem\lambda_{sem} and λfeat\lambda_{feat} are set to 0.004 and 0.04, respectively, to balance the losses (Zhi et al., 2021; Kobayashi et al., 2022). After training, the multi-view encoder Ω\Omega is exploited as a 3D structural and semantic feature extractor for any off-the-shelf downstream RL algorithms.

4.3 Multi-view Self Predictive Representation

We additionally enforce the multi-view self-predictive loss to the latent vector zz to ensure that the encoder learns the viewpoint-invariant representation with observations from the same scene. The randomly sampled observations from two different camera pose, o1o^{1} and o2o^{2}, are processed by the convolutional feature extractor, ECNNE_{\mathrm{CNN}}, and the weights of the feature extractor are shared between two inputs. A feature from one view, z1z_{1}, is mapped with a prediction network, hpredh_{pred}, to match it to the feature from the other view, z2z_{2}. We formulate the self-predictive loss function DD with negative cosine similarity as follows:

𝒟(p1,z2)=p1p12z2z22,\mathcal{D}(p_{1},z_{2})=-\frac{p_{1}}{{||p_{1}||_{2}}}\cdot\frac{z_{2}}{{||z_{2}||_{2}}}, (13)

where p1p_{1} and z2z_{2} indicate two output vectors, p1hpred(ECNN(o1))p_{1}\triangleq h_{pred}(E_{\mathrm{CNN}}(o^{1})) and z2ECNN(o2)z_{2}\triangleq E_{\mathrm{CNN}}(o^{2}), respectively. We assume that z2z_{2} is constant and the encoder ECNNE_{\mathrm{CNN}} only receives the gradient from p1p_{1} following Chen & He (2021).

The symmetrized auxiliary representation loss function can be formulated as follows:

aux=12𝒟(p1,z2)+12𝒟(p2,z1).\mathcal{L}_{\mathrm{aux}}=\frac{1}{2}\mathcal{D}(p_{1},z_{2})+\frac{1}{2}\mathcal{D}(p_{2},z_{1}). (14)

5 Experiments

Refer to caption
Figure 3: Episode returns of the evaluation results. Shading indicates a standard deviation across 4 seeds. The curves are not visible in the Hammer-v2 environment as they overlap each other. Note that SNeRL in this figure is obtained without auxiliary loss in section 4.3 (multi-view self-predictive presentation), which could enable further improvements in some environments.

In this section, we demonstrate several experiments on the 3-dimensional environments to explore the effectiveness of SNeRL compared to existing state-of-the-art RL algorithms both in model-free and model-based settings. We fix the downstream RL algorithms and adopt Soft Actor-Critic (Haarnoja et al., 2018) in the model-free setting and Dreamer (Hafner et al., 2019) in the model-based setting for SNeRL and all the baselines for a fair evaluation.

Environments.

We evaluate SNeRL on four visual control environments based on the MuJoCo (Todorov et al., 2012), including some complex control tasks that require clever use of interactions between the objects to obtain high rewards. All the tasks are performed by a simulated Sawyer robot which has a single arm and gripper in hand (4-DoF). The action space of the Sawyer robot consists of the position (x,y,z) of the end-effector and gripper control (open/close). The agent takes 128x128 images from three different camera views as pixel-level inputs and receives dense rewards from the environment provided by Meta-world (Yu et al., 2020).

  • Window-open-v2 : This environment involves the Sawyer robot opening a sliding window with a handle. The initial state of the robotic hand is [0, 0.4, 0.2] and the robot receives rewards for pushing the handle and opening a window located in [-0.1, 0.785, 0.16].

  • Hammer-v2 : The Sawyer robot is supposed to grasp the handle of the hammer, which is generated in a random position, and hit the head of the nail to drive it. The initial state of the robotic hand is generated randomly in {(x,y,z)|0.5x0.5,0.4y1,0.05z0.5}\{(x,y,z)|-0.5\leq x\leq 0.5,0.4\leq y\leq 1,0.05\leq z\leq 0.5\}. The robot receives rewards for picking up the hammer and inserting the nail into a piece of wood.

  • Drawer-open-v2 : The Sawyer robot is supposed to open a drawer by holding the handle of the drawer and pulling it. The initial state of the robotic hand is the same as Hammer-v2. The robot receives rewards for opening a drawer.

  • Soccer-v2 : In this task, the Sawyer robot tries to score by pushing a soccer ball that is generated in a random position. The initial state of the robotic hand is the same as Hammer-v2. The robot receives rewards for touching the soccer ball and putting it into the net.

We refer to Meta-world (Yu et al., 2020) for more details including the reward function and the range of the random positions.

Refer to caption
Figure 4: Ablation study. (a): SNeRL with both semantic and feature supervision shows higher performance than the case where only one of the two is applied. Also, in relatively simple environments, using only one of the two could be enough to improve the performance of the prior work. (b): Additional multi-view self-predictive loss can further improve SNeRL in some environments. (c): Learned representations via SNeRL can also be adopted in model-based RL.

Baselines.

We compare SNeRL to several state-of-the-art visual RL methods and a 3D-aware RL method, which are briefly described below. DrQ-v2 (Yarats et al., 2021) is an improved version of DrQ (Yarats et al., 2020), which solves visual control tasks with data augmentation and scheduled exploration noise. CURL (Laskin et al., 2020b) trains RL agents with an auxiliary contrastive loss which ensures that the embeddings for data-augmented versions of observations match. CURL-multiview is a multi-view adaptation of CURL, which utilize 3 different camera views and has a CNN encoder with the same structure as that of SNeRL. CNN-AE uses a standard CNN autoencoder (instead of NeRF decoder) to pre-train an encoder using the reconstruction loss proposed in Finn et al. (2016). NeRF-RL (Driess et al., 2022) pre-trains an autoencoder with convolutional encoder and naïve NeRF-style decoder, without semantic and feature supervision.

We note that learning downstream RL tasks in CNN-AE and NeRF-RL shares the identical method as SNeRL, and they use the same offline dataset collected by random actions and the policies provided by Meta-world (half-and-half mixed). We refer the reader to Appendix B.2 for the experiments on other datasets. Also, all the multi-view methods (CURL-multiview, CNN-AE, NeRF-RL, SNeRL) receive the same observations and do not receive per-object masks from the environment. For the rest of the baselines which operate on a single view, we choose a single camera position from which the states of each object can be observed clearly.

5.1 Experiment Result

Figure 3 shows the episode returns of SNeRL and baselines in 4 different visual control tasks. Thanks to the learned object-centric representation via semantic and distilled feature supervision, SNeRL consistently outperforms state-of-the-art visual RL methods and the prior 3D-aware RL method (NeRF-RL) in terms of data efficiency and performance.

Specifically, the contrastive baselines (CURL, CURL-multiview) and DrQ-v2 could not achieve high returns in the difficult environments (soccer and hammer) even though some of them succeed in the relatively easy environment (window). The results also show that pre-training CNN via naïve reconstruction loss (CNN-AE) with offline data could not succeed in the environments at all. These imply that extracting not only the 3D-aware geometric but also the object-centric and semantic information from multi-view observations is critical for RL performances.

Interestingly, we observe that pre-training a NeRF-based autoencoder only with RGB supervision (NeRF-RL) is not sufficient to learn the features for RL downstream tasks, and it can not outperform multi-view adaptation of the visual RL method with contrastive loss (CURL-multiview). These are contrary to the results reported in the prior work (Driess et al., 2022), which we analyze as follows: the environments we adopted in this work are more challenging compared to those of Driess et al. (2022), which consist of simple-shaped objects with primary colors. Therefore, it is relatively difficult to obtain semantic information simply using RGB supervision. Thus, leveraging a semantic-aware NeRF decoder is required to extract the features for better performances in RL downstream tasks in the practical use of 3D-aware RL, which is consistent with our analysis.

Refer to caption
Figure 5: Qualitative results on the image reconstruction in 3 different camera views via neural rendering. The synthesized images from SNeRL achieve better fidelity compared to NeRF-RL in several environments.

5.2 Ablation Study

Semantic and Distilled Feature Fields.

To validate how each semantic-aware radiance field leveraged in SNeRL (semantic and distilled feature fields) contributes to the downstream RL performances, we evaluate its two ablated variants without semantic and feature supervision, respectively. As shown in Figure 4(a), SNeRL, which takes advantage of both semantic and feature supervision from ground-truth labels and a ViT-based feature descriptor, achieves the best performance compared to all the ablated models. We observe that the performance gap between SNeRL and the ablated models depends on the environment, as semantic labels are sufficient to learn semantic information in a relatively simple environment (window), but it requires both supervision in a complex environment (soccer).

Multi-view Representation Learning.

We also introduce auxiliary representation learning which is suitable for multi-view observations with self-predictive loss. To demonstrate its effectiveness on downstream RL tasks, we evaluate two different models, which are SNeRL with and without multi-view self-predictive representation learning. By enforcing the latent vector to be invariant to the viewpoint of the observation, we report that the proposed representation learning improves the RL agent’s performance in some environments as shown in Figure 4(b).

5.3 Image Reconstruction via Neural Rendering

Even though we only leverage the convolutional encoder for downstream RL tasks, we compare the image rendering performance of NeRF-based decoders from SNeRL and NeRF-RL to explore the relationship between the synthesized image quality and RL performances. As NeRF originally aims to synthesize images of arbitrary camera views from a static scene, NeRF-RL, which also trains the volumetric field with a sole RGB supervision, cannot reconstruct dynamic objects, e.g. the robot arm, in input images without semantic information. On the other hand, SNeRL, which utilizes semantic labels and feature outputs from the ViT teacher network as additional supervision signals, not only achieves better RL performance but also well represents the dynamic scene and produces high-fidelity rendering outputs as shown in Figure 5.

5.4 SNeRL for Model-based RL

In this section, we evaluate whether the learned representation via SNeRL can also be adopted in off-the-shelf model-based reinforcement learning algorithms, which train a world model to characterize the environment and conduct planning over the learned model. We adopt Dreamer (Hafner et al., 2019) as a downstream model-based RL agent and replace the encoder of the representation model with our pre-trained encoder. Refer to Appendix A for additional implementation details and an architectural overview.

Our results are shown in Figure 4(c). We observe that learning model-based RL with the pre-trained encoder of SNeRL outperforms the pre-trained weights of the prior 3D-aware RL method (NeRF-RL) and naïve CNN autoencoder. This empirical evidence is consistent with the case of model-free RL in section 5.1, indicating that the proposed method allows the encoder to learn representations that are important for general off-the-shelf RL agents.

6 Conclusion

In this paper, we present SNeRL, a semantic-aware radiance field for RL, that outperforms existing representation learning methods for RL algorithms across four different 3-dimensional environments. SNeRL leverages semantic and distilled feature supervision with latent condition NeRF autoencoders as well as RGB supervision to enable image encoders to express 3D-aware geometric and semantic representation on downstream RL tasks. We also propose a multi-view self-predictive loss as an auxiliary representation learning to force latent vectors to be viewpoint invariant. Finally, we verify that SNeRL is effective in both model-free and model-based RL algorithms.

Limitations.

Despite these improvements, SNeRL inherits the limitations of the prior 3D-aware RL method. First of all, SNeRL requires multi-view offline data, and collecting an offline dataset covering the state space in some complex control tasks might be challenging. Also, our method uses a NeRF decoder that consumes more computational budget than CNN, so there might be limitations in extending our method to an online setup which trains the encoder concurrently with RL agents.

7 Acknowledgement

This research was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) [NO.2021-0-01343, Artificial Intelligence Graduate School Program (Seoul National University)]. Also, this research was supported by Unmanned Vehicles Core Technology Research and Development Program through the National Research Foundation of Korea(NRF) and Unmanned Vehicle Advanced Research Center(UVARC) funded by the Ministry of Science and ICT, the Republic of Korea(NRF-2020M3C1C1A01086411). Seungjae Lee would like to acknowledge financial support from Hyundai Motor Chung Mong-Koo Foundation.

References

  • Caron et al. (2021) Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., and Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9650–9660, 2021.
  • Chen & He (2021) Chen, X. and He, K. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15750–15758, 2021.
  • Dosovitskiy et al. (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  • Driess et al. (2022) Driess, D., Schubert, I., Florence, P., Li, Y., and Toussaint, M. Reinforcement learning with neural radiance fields. arXiv preprint arXiv:2206.01634, 2022.
  • Dwibedi et al. (2018) Dwibedi, D., Tompson, J., Lynch, C., and Sermanet, P. Learning actionable representations from visual observations. In 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp.  1577–1584. IEEE, 2018.
  • Eslami et al. (2018) Eslami, S. A., Jimenez Rezende, D., Besse, F., Viola, F., Morcos, A. S., Garnelo, M., Ruderman, A., Rusu, A. A., Danihelka, I., Gregor, K., et al. Neural scene representation and rendering. Science, 360(6394):1204–1210, 2018.
  • Finn et al. (2016) Finn, C., Tan, X. Y., Duan, Y., Darrell, T., Levine, S., and Abbeel, P. Deep spatial autoencoders for visuomotor learning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp.  512–519. IEEE, 2016.
  • Fu et al. (2022) Fu, X., Zhang, S., Chen, T., Lu, Y., Zhu, L., Zhou, X., Geiger, A., and Liao, Y. Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation. arXiv preprint arXiv:2203.15224, 2022.
  • Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
  • Hafner et al. (2019) Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019.
  • Hansen et al. (2020) Hansen, N., Jangir, R., Sun, Y., Alenyà, G., Abbeel, P., Efros, A. A., Pinto, L., and Wang, X. Self-supervised policy adaptation during deployment. arXiv preprint arXiv:2007.04309, 2020.
  • (12) Hinton, G., Vinyals, O., Dean, J., et al. Distilling the knowledge in a neural network.
  • Islam et al. (2022) Islam, R., Zang, H., Goyal, A., Lamb, A., Kawaguchi, K., Li, X., Laroche, R., Bengio, Y., and Combes, R. T. D. Discrete factorial representations as an abstraction for goal conditioned reinforcement learning. arXiv preprint arXiv:2211.00247, 2022.
  • Kinose et al. (2022) Kinose, A., Okada, M., Okumura, R., and Taniguchi, T. Multi-view dreaming: Multi-view world model with contrastive learning. arXiv preprint arXiv:2203.11024, 2022.
  • Kobayashi et al. (2022) Kobayashi, S., Matsumoto, E., and Sitzmann, V. Decomposing nerf for editing via feature field distillation. arXiv preprint arXiv:2205.15585, 2022.
  • Kulkarni et al. (2019) Kulkarni, T. D., Gupta, A., Ionescu, C., Borgeaud, S., Reynolds, M., Zisserman, A., and Mnih, V. Unsupervised learning of object keypoints for perception and control. Advances in neural information processing systems, 32, 2019.
  • Kundu et al. (2022) Kundu, A., Genova, K., Yin, X., Fathi, A., Pantofaru, C., Guibas, L. J., Tagliasacchi, A., Dellaert, F., and Funkhouser, T. Panoptic neural fields: A semantic object-aware neural scene representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12871–12881, 2022.
  • Laskin et al. (2020a) Laskin, M., Lee, K., Stooke, A., Pinto, L., Abbeel, P., and Srinivas, A. Reinforcement learning with augmented data. Advances in neural information processing systems, 33:19884–19895, 2020a.
  • Laskin et al. (2020b) Laskin, M., Srinivas, A., and Abbeel, P. Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pp. 5639–5650. PMLR, 2020b.
  • Li et al. (2022) Li, Y., Li, S., Sitzmann, V., Agrawal, P., and Torralba, A. 3d neural scene representations for visuomotor control. In Conference on Robot Learning, pp.  112–123. PMLR, 2022.
  • Liu & Abbeel (2021) Liu, H. and Abbeel, P. Behavior from the void: Unsupervised active pre-training. Advances in Neural Information Processing Systems, 34:18459–18473, 2021.
  • Manuelli et al. (2020) Manuelli, L., Li, Y., Florence, P., and Tedrake, R. Keypoints into the future: Self-supervised correspondence in model-based reinforcement learning. arXiv preprint arXiv:2009.05085, 2020.
  • Martin-Brualla et al. (2021) Martin-Brualla, R., Radwan, N., Sajjadi, M. S., Barron, J. T., Dosovitskiy, A., and Duckworth, D. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7210–7219, 2021.
  • Mildenhall et al. (2020) Mildenhall, B., Srinivasan, P., Tancik, M., Barron, J., Ramamoorthi, R., and Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, 2020.
  • Schwarzer et al. (2020) Schwarzer, M., Anand, A., Goel, R., Hjelm, R. D., Courville, A., and Bachman, P. Data-efficient reinforcement learning with self-predictive representations. arXiv preprint arXiv:2007.05929, 2020.
  • Sermanet et al. (2018) Sermanet, P., Lynch, C., Chebotar, Y., Hsu, J., Jang, E., Schaal, S., Levine, S., and Brain, G. Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE international conference on robotics and automation (ICRA), pp.  1134–1141. IEEE, 2018.
  • Stooke et al. (2021) Stooke, A., Lee, K., Abbeel, P., and Laskin, M. Decoupling representation learning from reinforcement learning. In International Conference on Machine Learning, pp. 9870–9879. PMLR, 2021.
  • Tatarchenko et al. (2016) Tatarchenko, M., Dosovitskiy, A., and Brox, T. Multi-view 3d models from single images with a convolutional network. In European Conference on Computer Vision, pp.  322–337. Springer, 2016.
  • Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp.  5026–5033. IEEE, 2012.
  • Wang et al. (2022) Wang, C., Luo, X., Ross, K., and Li, D. Vrl3: A data-driven framework for visual deep reinforcement learning. arXiv preprint arXiv:2202.10324, 2022.
  • Wang et al. (2021) Wang, Q., Wang, Z., Genova, K., Srinivasan, P. P., Zhou, H., Barron, J. T., Martin-Brualla, R., Snavely, N., and Funkhouser, T. Ibrnet: Learning multi-view image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4690–4699, 2021.
  • Worrall et al. (2017) Worrall, D. E., Garbin, S. J., Turmukhambetov, D., and Brostow, G. J. Interpretable transformations with encoder-decoder networks. In Proceedings of the IEEE International Conference on Computer Vision, pp.  5726–5735, 2017.
  • Yamada et al. (2022) Yamada, J., Pertsch, K., Gunjal, A., and Lim, J. J. Task-induced representation learning. arXiv preprint arXiv:2204.11827, 2022.
  • Yang & Nachum (2021) Yang, M. and Nachum, O. Representation matters: offline pretraining for sequential decision making. In International Conference on Machine Learning, pp. 11784–11794. PMLR, 2021.
  • Yarats et al. (2020) Yarats, D., Kostrikov, I., and Fergus, R. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. In International Conference on Learning Representations, 2020.
  • Yarats et al. (2021) Yarats, D., Fergus, R., Lazaric, A., and Pinto, L. Mastering visual continuous control: Improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645, 2021.
  • You et al. (2022) You, B., Arenz, O., Chen, Y., and Peters, J. Integrating contrastive learning with dynamic models for reinforcement learning from images. Neurocomputing, 476:102–114, 2022.
  • Yu et al. (2021) Yu, A., Ye, V., Tancik, M., and Kanazawa, A. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4578–4587, 2021.
  • Yu et al. (2020) Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, C., and Levine, S. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pp.  1094–1100. PMLR, 2020.
  • Zhan et al. (2022) Zhan, A., Zhao, R., Pinto, L., Abbeel, P., and Laskin, M. Learning visual robotic control efficiently with contrastive pre-training and data augmentation. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  4040–4047. IEEE, 2022.
  • Zhi et al. (2021) Zhi, S., Laidlow, T., Leutenegger, S., and Davison, A. J. In-place scene labelling and understanding with implicit scene representation. 2021.

Appendix A Algorithms

A.1 Model-free RL : Soft Actor-Critic

In this project, we adopt Soft Actor-Critic algorithm (Haarnoja et al., 2018) (SAC) in all the experiments of the model-free downstream RL. SAC optimizes stochastic policies to maximize both the expected trajectory returns and the expected entropy of the actions. Although SAC shows promising performances on a range of control tasks over continuous action spaces including many benchmark tasks, it fails or suffers from data inefficiency in some visual control tasks. To train actor network πϕ\pi_{\phi} and critic networks Qθ1,Qθ2Q_{\theta_{1}},Q_{\theta_{2}} SAC algorithm minimizes the following objective functions

Jπ(ϕ)=𝔼st𝒟[𝔼atπϕ[αlog(πϕ(at|st))mini=1,2Qθi(st,at)]]J_{\pi}(\phi)=\mathbb{E}_{{s_{t}}\sim\mathcal{D}}\big{[}\mathbb{E}_{a_{t}\sim\pi_{\phi}}[\alpha\log(\pi_{\phi}(a_{t}|s_{t}))-\min_{i=1,2}Q_{\theta_{i}}(s_{t},a_{t})]\big{]} (15)
JQ(θi)=𝔼st,at,st+1,r𝒟,at+1πϕ(st+1)[12(Qθi(st,at)(r+γ𝒯))2],J_{Q}(\theta_{i})=\mathbb{E}_{{s_{t},a_{t},s_{t+1},r}\sim\mathcal{D},a_{t+1}\sim\pi_{\phi}(s_{t+1})}\bigg{[}\frac{1}{2}(Q_{\theta_{i}}(s_{t},a_{t})-(r+\gamma\mathcal{T}))^{2}\bigg{]}, (16)

where 𝒟\mathcal{D} denotes the replay buffer, α\alpha the temperature hyperparameter, and γ\gamma the discount factor. The target value 𝒯\mathcal{T} in Eq. 16 is

𝒯=mini=1,2Qθ^i(st+1,at+1)αlogπϕ(at+1|st+1).\mathcal{T}=\min_{i=1,2}Q_{\hat{\theta}_{i}}(s_{t+1},a_{t+1})-\alpha\log\pi_{\phi}(a_{t+1}|s_{t+1}). (17)

SAC also utilizes target networks Qθ^1,Qθ^2Q_{\hat{\theta}_{1}},Q_{\hat{\theta}_{2}} which are obtained as an Exponentially Moving Average (EMA) of the Q networks (Qθ1,Qθ2Q_{\theta_{1}},Q_{\theta_{2}}) for better learning stability, and gradient-based temperature tuning to determine the relative importance of the entropy,

J(α)=𝔼atπϕ(st)[αlogπϕ(at|st)α¯].J(\alpha)=\mathbb{E}_{a_{t}\sim\pi_{\phi}(s_{t})}[-\alpha\log\pi_{\phi}(a_{t}|s_{t})-\alpha\bar{\mathcal{H}}]. (18)

A.2 Model-based RL : Dreamer

To evaluate whether SNeRL can also be adopted in model-based RL algorithms, we adopt Dreamer (Hafner et al., 2019). Dreamer learns the world model which consists of the following components:

pθ(st|st1,at1,ot)qθ(ot|st)qθ(rt|st)qθ(st|st1,at1).\begin{split}&p_{\theta}(s_{t}|s_{t-1},a_{t-1},o_{t})\\ &q_{\theta}(o_{t}|s_{t})\\ &q_{\theta}(r_{t}|s_{t})\\ &q_{\theta}(s_{t}|s_{t-1},a_{t-1}).\end{split} (19)

These components are jointly optimized to increase the variational lower bound, which includes the following terms:

𝒥Ot=lnq(ot|st)𝒥Rt=lnq(rt|st)𝒥Dt=βKL(p(st|st1,at1,ot)||q(st|st1,at1)).\begin{split}\mathcal{J}_{O}^{t}&=\ln q(o_{t}|s_{t})\\ \mathcal{J}_{R}^{t}&=\ln q(r_{t}|s_{t})\\ \mathcal{J}_{D}^{t}&=-\beta KL\big{(}p(s_{t}|s_{t-1},a_{t-1},o_{t})||q(s_{t}|s_{t-1},a_{t-1})\big{)}.\end{split} (20)

We replace the convolutional encoder of Dreamer with our feature extractor Ω\Omega (shared encoder of SNeRL) to design an RL with a dynamics model over the latent space with pre-trained mapping. To learn the action and value models, Dreamer optimizes the value model vψv_{\psi} and the action model qϕq_{\phi} using the objectives

maxϕ𝔼(τ=tt+HVλ(sτ))maxψ𝔼qθ,qϕ(τ=tt+H12vψ(sτ)Vλ(sτ)2),\begin{split}&\max_{\phi}\mathbb{E}\bigg{(}\sum_{\tau=t}^{t+H}V_{\lambda}(s_{\tau})\bigg{)}\\ &\max_{\psi}\mathbb{E}_{q_{\theta},q_{\phi}}\bigg{(}\sum_{\tau=t}^{t+H}\frac{1}{2}\Big{|}\Big{|}v_{\psi}(s_{\tau})-V_{\lambda}(s_{\tau})\Big{|}\Big{|}^{2}\bigg{)},\end{split} (21)

where H denotes the horizon, and VλV_{\lambda} the exponentially-weight value estimation. Refer to Dreamer (Hafner et al., 2019) for details.

A.3 Pseudo-code

Algorithm 1 Stage 1. pre-train multi-view encoder with SNeRL
1:  Input: encoder Ω\Omega, off-the-shelf feature descriptor F^\hat{F}, offline dataset 𝒟\mathcal{D}
2:  for iteration=1,2,…,N do
3:     for sample minibatch dd from 𝒟\mathcal{D} do
4:        zΩ(o1:V,K1:V)z\leftarrow\Omega(o^{1:V},K^{1:V})
5:        RGB+λsemsem+λfeatfeat\mathcal{L}\leftarrow\mathcal{L}_{\mathrm{RGB}}+\lambda_{sem}\mathcal{L}_{sem}+\lambda_{feat}\mathcal{L}_{feat}
6:        update the parameters of Ω\Omega to minimize \mathcal{L}
7:     end for
8:  end for
Algorithm 2 Stage 2. Downstream Model-free RL (SAC)
1:  Input: total training episodes NN, Env, environment horizon HH, actor network πϕ\pi_{\phi}, critic networks Qθi=1,2Q_{\theta_{i=1,2}}, target critic networks Qθ^i=1,2Q_{\hat{\theta}_{i=1,2}}, temperature α\alpha, replay buffer \mathcal{B}, pre-trained encoder Ω\Omega.
2:  for iteration=1,2,…,N do
3:     ot=0𝙴𝚗𝚟.𝚛𝚎𝚜𝚎𝚝()o_{t=0}\leftarrow\mathtt{Env.reset()}
4:     for tt=0,1,…,HH-1 do
5:        atπϕ(|Ω(ot))a_{t}\leftarrow\pi_{\phi}(\cdot|\Omega(o_{t}))
6:        rt,ot+1𝙴𝚗𝚟.𝚜𝚝𝚎𝚙(at)r_{t},o_{t+1}\leftarrow\mathtt{Env.step}(a_{t})
7:        {(ot,at,rt,ot+1)}\mathcal{B}\leftarrow\mathcal{B}\cup\{(o_{t},a_{t},r_{t},o_{t+1})\}
8:     end for
9:     for each gradient step do
10:        θiθiλQθiJQ(θi)\theta_{i}\leftarrow{\theta_{i}}-\lambda_{Q}\nabla_{\theta_{i}}J_{Q}(\theta_{i})
11:        ϕϕλπϕJπ(ϕ)\phi\leftarrow{\phi}-\lambda_{\pi}\nabla_{\phi}J_{\pi}(\phi)
12:        ααλαJ(α)\alpha\leftarrow{\alpha}-\lambda\nabla_{\alpha}J(\alpha)
13:        if update target critic networks then
14:           θ^iτθi+(1τ)θ^i\hat{\theta}_{i}\leftarrow{\tau{\theta}_{i}}+(1-\tau)\hat{\theta}_{i}
15:        end if
16:     end for
17:  end for
Algorithm 3 Stage 2. Downstream Model-based RL (Dreamer)
1:  Input: total training episodes NN, update step CC, Env, environment horizon TT, imagination horizon HH, Neural network parameters θ,ϕ,ψ\theta,\phi,\psi, replay buffer \mathcal{B}, pre-trained encoder Ω\Omega.
2:  for iteration=1,2,…,N do
3:     for cc=1,…,CC do
4:        sample data sequence {(at,ot,rt)}t=0,,H1\{(a_{t},o_{t},r_{t})\}_{t=0,...,H-1}\sim\mathcal{B}
5:        compute model states stpθ(st|Ω(ot1),at1)s_{t}\sim p_{\theta}(s_{t}|\Omega(o_{t-1}),a_{t-1}) and reward qθ(rt|Ω(ot))q_{\theta}(r_{t}|\Omega(o_{t}))
6:        update θ\theta using representation learning
7:        imaging trajectories and compute value estimates Vλ(sτ)V_{\lambda}(s_{\tau})
8:        ϕϕ+αϕτ=tt+HVλ(sτ)\phi\leftarrow\phi+\alpha\nabla_{\phi}\sum_{\tau=t}^{t+H}V_{\lambda}(s_{\tau})
9:        ψψ+αψτ=tt+H12vψ(sτ)Vλ(sτ)2\psi\leftarrow\psi+\alpha\nabla_{\psi}\sum_{\tau=t}^{t+H}\frac{1}{2}||v_{\psi}(s_{\tau})-V_{\lambda}(s_{\tau})||^{2}
10:     end for
11:     ot=0𝙴𝚗𝚟.𝚛𝚎𝚜𝚎𝚝()o_{t=0}\leftarrow\mathtt{Env.reset()}
12:     for tt=0,1,…,T1T-1 do
13:        atqϕ(|Ω(ot))a_{t}\leftarrow q_{\phi}(\cdot|\Omega(o_{t}))
14:        rt,ot+1𝙴𝚗𝚟.𝚜𝚝𝚎𝚙(at)r_{t},o_{t+1}\leftarrow\mathtt{Env.step}(a_{t})
15:        {(ot,at,rt,ot+1)}\mathcal{B}\leftarrow\mathcal{B}\cup\{(o_{t},a_{t},r_{t},o_{t+1})\}
16:     end for
17:  end for

Appendix B Training & Experiments Details

B.1 Encoder Architecture

We design the encoder architecture similar to Laskin et al. (2020b), which consists of multiple convolutional layers and ReLU activation, but modify it to be applicable in the multi-view observation inputs. The same encoder is also adopted in the actor and critic to embed the pixel-level (multi-view) observations. We demonstrate the details of the convolutional encoder with PyTorch-like pseudo-code as below.

Algorithm 4 Multi-view Encoder Pseudocode, PyTorch-like
def encoder(x1, x2, x3, K1, K2, K3, z_dim):
"""
␣␣␣␣Multi-viewConvNetencoder
␣␣␣␣args:
␣␣␣␣␣␣␣␣B=batch_size,C=channels,
␣␣␣␣␣␣␣␣H,W=spatial_dims
␣␣␣␣␣␣␣␣x1,x2,x3:imagesfrom3differentcameraviews
␣␣␣␣␣␣␣␣x1,x2,x3shape:[B,C,H,W]
␣␣␣␣␣␣␣␣K1,K2,K3:cameraposesfrom3differentcameraviews
␣␣␣␣␣␣␣␣K1,K2,K3shape:[B,4,4]
␣␣␣␣␣␣␣␣z_dim=latentdimension
␣␣␣␣"""
x = x / 255.
# c: channels, f: filters
# k: kernel, s: stride
z1 = Conv2d(c=x1.shape[1], f=32, k=3, s=2)(x1)
z1 = ReLU(z1)
z2 = Conv2d(c=x2.shape[1], f=32, k=3, s=2)(x2)
z2 = ReLU(z2)
z3 = Conv2d(c=x3.shape[1], f=32, k=3, s=2)(x3)
z3 = ReLU(z3)
for _ in range(num_layers-1):
z1 = Conv2d(c=32, f=32, k=3, s=1)(z1)
z1 = ReLU(z1)
z2 = Conv2d(c=32, f=32, k=3, s=1)(z2)
z2 = ReLU(z2)
z3 = Conv2d(c=32, f=32, k=3, s=1)(z3)
z3 = ReLU(z3)
z1 = flatten(z1)
z2 = flatten(z2)
z3 = flatten(z3)
z1 = concat([z1, K1.view(B,16)], dim=1)
z2 = concat([z2, K1.view(B,16)], dim=1)
z3 = concat([z3, K1.view(B,16)], dim=1)
z1 = Linear(z1.shape[1], z_dim)(z1)
z2 = Linear(z2.shape[1], z_dim)(z2)
z3 = Linear(z3.shape[1], z_dim)(z3)
z = concat([z1, z2, z3], dim=1).mean(dim=1)
z = Linear(z.shape[1], z_dim)(z)
z = LayerNorm(z)
z = tanh(z)
return z

B.2 Datasets

The offline datasets for SNeRL and baselines consist of 14400 scenes. Each scene consists of three image observations taken from different camera views. The observations from each camera view are represented in Figure 1 (Window-open-v2) and Figure 5 (Soccer-v2, Hammer-v2, Drawer-open-v2). To collect the dataset, we utilized random actions and the policies provided by Meta-world (half-and-half mixed).

To observe how the performance of the proposed method varies with the quality of the dataset, we further trained the SNeRL encoder with a dataset collected by a single expert demo and random actions. Only 120 scenes of the total scenes (14400, 120/14400\simeq1%) were obtained from the path of the expert demo, and the remaining 14280 scenes were obtained by taking random actions from one moment of the path of the expert demo. As shown in Figure 6, we observe that the quality of the dataset slightly affects the learning stability, but there is no dramatic performance degradation. The results show that there would be no significant degradation in the performance of the SNeRL if the dataset adequately covers the state space, even if the policy that collects the offline dataset is suboptimal.

Refer to caption
Figure 6: The performance of SNeRL with different quality of offline datasets.

B.3 Computational Resources

Stage 1 (pre-training encoder) in our experiments has been performed using a single NVIDIA RTX A6000 and AMD Ryzen 2950X, and stage 2 (RL downstream tasks) has been performed using an NVIDIA RTX A5000 and AMD Ryzen 2950X. Training the SNeRL encoder takes 4-5 days and learning model-based RL and model-free RL takes 1-2 days.

B.4 Hyperparameters

Table 1: Hyperparameters for pre-training multi-view encoder
SNeRL
Convolution layers 4
Number of filters 32
Non-linearity ReLU
MLP layers for NeRF 8
Hidden units (MLP) 256
Number of different views 3
NeRF learning rate 5e-4
Number of rays per gradient step 1024
Number of samples per ray 64
Table 2: Hyperparameters for SAC (for SNeRL and baselines)
SAC
hidden layer (1024, 1024)
frame stack 2
replay buffer size 100000
initial random steps 1000
batch size 128
actor learning rate 1e-3
critic learning rate 1e-3
α\alpha learning rate 1e-4
β\beta for Adam optimizer (actor, critic) 0.9
eps for Adam optimizer (α\alpha) 1e-08
β\beta for Adam optimizer (α\alpha) 0.5
eps for Adam optimizer (α\alpha) 1e-08
critic target update interval 2
actor network update interval 2
actor log std min, max -10, 2
init temperature 0.1
τ\tau for EMA 0.01
discount factor γ\gamma 0.99
Table 3: Hyperparameters for Dreamer (for SNeRL and baselines)
Dreamer
embedding size 63
hidden / belief size 128
state size 30
action noise 0.3
batch size 32
world model learning rate 1e-3
actor learning rate 5e-5
value network learning rate 5e-5
discount factor γ\gamma 0.99
replay buffer size 100000
planning horizon 15
eps for Adam optimizer 1e-07