This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Few-Shot Image-to-Semantics Translation for Policy Transfer in Reinforcement Learning thanks: This research is partially supported by the JSPS KAKENHI Grant Number 19H04179, and based on a project, JPNP18002, commissioned by NEDO.

Rei Sato13, Kazuto Fukuchi23, Jun Sakuma23 and Youhei Akimoto23 1 Graduate School of Science and Technology, University of Tsukuba, Tsukuba, Japan
[email protected]
2 Faculty of Engineering, Information and Systems, University of Tsukuba, Tsukuba, Japan
[email protected], [email protected], [email protected]
3 RIKEN Center for Advanced Intelligence Project
Abstract

We investigate policy transfer using image-to-semantics translation to mitigate learning difficulties in vision-based robotics control agents. This problem assumes two environments: a simulator environment with semantics, that is, low-dimensional and essential information, as the state space, and a real-world environment with images as the state space. By learning mapping from images to semantics, we can transfer a policy, pre-trained in the simulator, to the real world, thereby eliminating real-world on-policy agent interactions to learn, which are costly and risky. In addition, using image-to-semantics mapping is advantageous in terms of the computational efficiency to train the policy and the interpretability of the obtained policy over other types of sim-to-real transfer strategies. To tackle the main difficulty in learning image-to-semantics mapping, namely the human annotation cost for producing a training dataset, we propose two techniques: pair augmentation with the transition function in the simulator environment and active learning. We observed a reduction in the annotation cost without a decline in the performance of the transfer, and the proposed approach outperformed the existing approach without annotation.

Index Terms:
deep reinforcement learning, policy transfer, sim-to-real

I Introduction

Deep reinforcement learning (DRL) has been actively studied for robot control applications in real-world environments because of its ability to train vision-based agents; that is, the robot control actions are output directly from the observed images [1, 2, 3, 4]. One of the major advantages of vision-based agents in robotics is that camera-captured images can be incorporated into the decision-making of the agent without using a handcrafted feature extractor.

However, allowing vision-based robot control agents to learn by reinforcement learning in the real-world is challenging in terms of risk and cost because it requires a large amount of real-world interactions with unstable robots. Reinforcement learning involves a learning policy interacting with the environment, and it is theoretically and empirically known that the length of the interaction required for training increases with the dimension of the state space [5, 6].

To address the difficulty associated with reinforcement learning in a real-world environment, methods have been proposed that pre-train a policy on a simulator environment and transfer it to the real-world environment [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. In this methodology, policies are learned in a simulator, that is, a reinforcement learning environment on a computer that mimics the real-world environment. The policy pre-trained in the simulator is expected to be the optimal policy in the real-world environment.

However, developing a simulator that imitates the real-world environment is not always an easy task. Particularly, because the real world provides image observations, a simulator environment requires a renderer to generate images as states. However, producing a renderer that can generate photorealistic images is fraught with financial and technical difficulties.

In the case that a photorealistic renderer cannot be produced, another style of observations must be adopted as states during the pre-training of the policy in a simulator environment. Most existing approaches substitute photorealistic observations for non-photorealistic ones using transfer techniques [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17].

We investigated a type of transfer strategy called image-to-semantics to deal with the absence of a photorealistic renderer, which was created by [18]. In this approach, the semantics—low-dimensional and essential information of a state that represents an image—are employed as a form of state observation instead of images in the simulator environment. The transfer algorithm consists of two steps: pre-training a policy on the simulator environment with semantics as its observation, obtaining a mapping from photorealistic images to their corresponding semantics, and using the image-to-semantics mapping as a pre-processing component of the policy in the real-world environment. A semantics-based pre-trained policy can be operated in the real-world environment using image observations. In addition to being a solution to the case without a photorealistic renderer, image-to-semantics mapping has advantages in terms of the computational cost for policy pre-training in the simulator and the interpretability of the acquired policy.

The crucial part of this approach is obtaining the image-to-semantics translation mapping. To the best of our knowledge, [19, 18] are the only studies that have dealt with learning image-to-semantics translation. We highlight the remaining problems of [19, 18]: (1) [19] used a paired dataset, that is, multiple pairs of images and corresponding semantics, to train the mapping. Considerable human effort is required to make a paired dataset because human annotators provide semantics that represent images. (2) Although the style translation method without a paired dataset [18] aims at saving annotation cost, its performance is not often satisfactory owing to the low approximation quality of the image-to-semantics translation mapping, as confirmed in our experiments.

In this study, we tackled learning image-to-semantics translation using a paired dataset; however, we reduced the cost of creating a paired dataset using two strategies: pair augmentation and active learning. In our experiments, we confirmed the following claims: first, compared to [19], we reduced the cost of making a paired dataset while preserving the performance of the policy transfer. Second, we achieved significantly higher performance than [18], in which a paired dataset was not used, by using a small paired dataset. For practicality, we conducted experiments under the condition that only inaccurate paired data can be obtained due to various errors, such as annotation errors, and confirmed that the proposed method has a certain robustness against errors.

Our code is publicly available at https://github.com/madoibito80/im2sem.

II Problem Formulation

II-A Markov Decision Process (MDP)

We defined a vision-based robotics task in the real world; that is, the real-world environment is a target MDP: τ=(𝒮τ,𝒜,pτ,rτ,γ)\mathcal{M}^{\tau}=(\mathcal{S}^{\tau},\mathcal{A},p^{\tau},r^{\tau},\gamma), where 𝒮τ\mathcal{S}^{\tau} is a state space, 𝒜\mathcal{A} is an action space, pτ:𝒮τ×𝒜×𝒮τp^{\tau}:\mathcal{S}^{\tau}\times\mathcal{A}\times\mathcal{S}^{\tau}\to\mathbb{R} is a transition probability density, rτ:𝒮τ×𝒜×𝒮τr^{\tau}:\mathcal{S}^{\tau}\times\mathcal{A}\times\mathcal{S}^{\tau}\to\mathbb{R} is a reward function, and γ[0,1]\gamma\in[0,1] is a discount factor. Because we assumed that the target MDP is a vision-based task, 𝒮τ\mathcal{S}^{\tau} consists of images, and each s𝒮τs\in\mathcal{S}^{\tau} contains single or multiple image frames. In standard model-free reinforcement learning (RL) settings, agents can interact with the environment: they observe st+1pτ(at,st)s_{t+1}\sim p^{\tau}(\cdot\mid a_{t},s_{t}) and reward rt=rτ(st+1,at,st)r_{t}=r^{\tau}(s_{t+1},a_{t},s_{t}) by performing action ata_{t} at state sts_{t}, which is internally preserved in the environment at timestep tt; after the transition, st+1s_{t+1} is stored in the environment. However, there are concerns in terms of the risk and cost associated with learning a policy through extensive interaction with τ\mathcal{M}^{\tau}.

To reduce the risk and cost of training a policy in the target MDP, we pre-trained a policy on a simulator environment, called the source MDP: σ=(𝒮σ,𝒜,pσ,rσ,γ)\mathcal{M}^{\sigma}=(\mathcal{S}^{\sigma},\mathcal{A},p^{\sigma},r^{\sigma},\gamma). Note that the action space 𝒜\mathcal{A} is the same between the two MDPs. In contrast, the state space 𝒮σ\mathcal{S}^{\sigma}, the transition probability density pσ:𝒮σ×𝒜×𝒮σp^{\sigma}:\mathcal{S}^{\sigma}\times\mathcal{A}\times\mathcal{S}^{\sigma}\to\mathbb{R}, and the reward function rσ:𝒮σ×𝒜×𝒮σr^{\sigma}:\mathcal{S}^{\sigma}\times\mathcal{A}\times\mathcal{S}^{\sigma}\to\mathbb{R} are different from those of the target MDP. We assumed that because we considered robotics tasks, the deterministic transition function Trσ(s,a)=spσ(a,s)Tr^{\sigma}(s,a)=s^{\prime}\sim p^{\sigma}(\cdot\mid a,s) could be defined in the simulator environment and pσp^{\sigma} resembled a Dirac delta distribution.

The source state space 𝒮σ\mathcal{S}^{\sigma} corresponded to a semantic space, that is, each s𝒮σs\in\mathcal{S}^{\sigma} was semantic information. For example, consider a robot-arm grasp task; each s𝒮τs\in\mathcal{S}^{\tau} is a single or multiple image frame showing a robot arm and objects to be grasped. Each s𝒮σs\in\mathcal{S}^{\sigma} consists of semantics such as xyzxyz-coordinates of the end-effector and target objects and angles of joints.

The source MDP and target MDP are expected to have some structural correspondence. Here, we describe our assumptions regarding the relations of the two MDPs. We assumed the existence of a function F:𝒮τ𝒮σF:\mathcal{S}^{\tau}\to\mathcal{S}^{\sigma} satisfying the following conditions:

Transition Condition: For all (s,a,s)𝒮τ×𝒜×𝒮τ(s^{\prime},a,s)\in\mathcal{S}^{\tau}\times\mathcal{A}\times\mathcal{S}^{\tau}, pσ(F(s)a,F(s))=s¯𝒮¯pτ(s¯a,s)ds¯p^{\sigma}(F(s^{\prime})\mid a,F(s))=\int_{\bar{s}\in\bar{\mathcal{S}}}p^{\tau}(\bar{s}\mid a,s)\mathrm{d}\bar{s}, where 𝒮¯={s¯𝒮τF(s¯)=F(s)}\bar{\mathcal{S}}=\{\bar{s}\in\mathcal{S}^{\tau}\mid F(\bar{s})=F(s^{\prime})\}.

Reward Condition: For all (s,a,s)𝒮τ×𝒜×𝒮τ(s^{\prime},a,s)\in\mathcal{S}^{\tau}\times\mathcal{A}\times\mathcal{S}^{\tau}, rσ(F(s),a,F(s))=rτ(s,a,s)r^{\sigma}(F(s^{\prime}),a,F(s))=r^{\tau}(s^{\prime},a,s).

In the above conditions, FF is considered an oracle that takes an image and outputs corresponding semantics; that is, FF is the true image-to-semantics translation mapping. In the transition condition, 𝒮¯\bar{\mathcal{S}} is a set of images that has common semantics F(s)F(s^{\prime}). Imagine the transition from s𝒮τs\in\mathcal{S}^{\tau} to s𝒮τs^{\prime}\in\mathcal{S}^{\tau} with action a𝒜a\in\mathcal{A} in the target MDP, the transition condition holds F(s)=Trσ(F(s),a)F(s^{\prime})=Tr^{\sigma}(F(s),a). The reward condition indicates that a reward for this transition rτ(s,a,s)r^{\tau}(s^{\prime},a,s) equals the one for a transition from F(s)𝒮σF(s)\in\mathcal{S}^{\sigma} to F(s)𝒮σF(s^{\prime})\in\mathcal{S}^{\sigma} with the action aa in the source MDP.

II-B Transfer via Image-to-Semantics

Refer to caption

Figure 1: Illustration of transfer via image-to-semantics. We approximated the image-to-semantics translation mapping FF as F^\hat{F}. Because the action space was common to both MDPs, we operated the composite of the source policy πσ\pi^{\sigma} and approximated image-to-semantics translation mapping F^\hat{F}, that is, πσF^\pi^{\sigma}\circ\hat{F} in the target MDP.

II-B1 Policy Transfer

The objective of RL is the expectation of the discounted cumulative reward:

J(π;p,r,γ,p0)=𝔼π,p,p0[t=0γtr(st+1,at,st)]\textstyle J(\pi;p,r,\gamma,p_{0})=\mathbb{E}_{\pi,p,p_{0}}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t+1},a_{t},s_{t})\right] (1)

and maximizing it w.r.t. π\pi. Here, π:𝒮×𝒜\pi:\mathcal{S}\times\mathcal{A}\to\mathbb{R} is a policy, that is, a conditional distribution of ata_{t} given sts_{t}, and p0p_{0} is the distribution of the initial state s0s_{0} over the state space. Our objective was to obtain a well-trained policy on the target MDP: πτ=argmaxπ¯τJ(π¯τ;pτ,rτ,γ,p0τ)\pi^{\tau}=\mathrm{arg~{}max}_{\bar{\pi}^{\tau}}J(\bar{\pi}^{\tau};p^{\tau},r^{\tau},\gamma,p^{\tau}_{0}).

Under the situation in which the transition and reward conditions mentioned above hold for some FF, we can replace πτ\pi^{\tau} by πσF\pi^{\sigma}\circ F, where πσ\pi^{\sigma} is a well-trained policy on the source MDP, that is, πσ=argmaxπ¯σJ(π¯σ;pσ,rσ,γ,p0σ)\pi^{\sigma}=\mathrm{arg~{}max}_{\bar{\pi}^{\sigma}}J(\bar{\pi}^{\sigma};p^{\sigma},r^{\sigma},\gamma,p^{\sigma}_{0}). Solving this maximization by RL requires sole interaction with σ\mathcal{M}^{\sigma} instead of τ\mathcal{M}^{\tau}. As noted, interactions with τ\mathcal{M}^{\tau} require real-world operations; however, interactions with σ\mathcal{M}^{\sigma} are performed on the simulator, which is cost-effective.

Based on this property, we studied the following transfer procedure: pre-train πσ\pi^{\sigma} on σ\mathcal{M}^{\sigma}, approximate FF as F^\hat{F}, and output the target agent πσF^\pi^{\sigma}\circ\hat{F}. This procedure was investigated by [18]. Figure 1 illustrates the transfer via image-to-semantics.

II-B2 Advantages

The above-mentioned transfer strategy, that is, transfer via image-to-semantics, has the following three advantages over approaches using a renderer in the source MDP shown in Table I. First, a renderer is not required. Existing methods that use a renderer generally aim to transfer an agent based on non-photorealistic images in a simulator to photorealistic images in the real world [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. Therefore, they require the preparation of a renderer on the simulator to generate non-photorealistic images as state observations. Transfer via image-to-semantics performs similar transfer learning; however, it does not require a renderer because the source MDP has a semantic space as its state space. This can reduce the development cost of the simulator for some tasks. Second, because semantics are low-dimensional variables compared to images, we can improve the sample efficiency required to train the policy πσ\pi^{\sigma} on σ\mathcal{M}^{\sigma} [5, 6]. Learning vision-based agents are generally associated with large computational costs, even on a simulator [20], but transfer via image-to-semantics is relatively lightweight in this respect and occasionally allows a human to design the policy. Third, using semantics as an intermediate representation of the target agent contributes to its high interpretability because of the low-dimensionality and interpretability of semantics. Similar to [19, 21], because the real-world agent πσF^\pi^{\sigma}\circ\hat{F} can be separated into two components, which are independently trained, it is easier to assess than one trained in an end-to-end manner.

II-C Resource Strategy

In this section, in addition to the two MDP environments, we define resources that can be used to approximate FF.

II-C1 Transition Function

In the target MDP, the state transition result st+1s_{t+1} due to the selected action ata_{t} can be observed only for state sts_{t} stored inside the environment. In contrast, in the source MDP, we assumed that the state transition result for any s𝒮σs\in\mathcal{S}^{\sigma} could be observed, replacing the sts_{t} stored inside the environment with ss. This is because the actual state transition probability pτp^{\tau} in the target MDP is a physical phenomenon in the real world, but the state transition rule TrσTr^{\sigma} in the source MDP is a black-box function on the computer.

II-C2 Offline Dataset

The offline dataset comprised observations of the target MDP, that is, 𝒯τ={(st,at,𝟙end(st+1))𝒮τ×𝒜×{0,1}}t\mathcal{T}^{\tau}=\{(s_{t},a_{t},\mathbbm{1}_{\mathrm{end}}(s_{t+1}))\in\mathcal{S}^{\tau}\times\mathcal{A}\times\{0,1\}\}_{t}, where 𝟙end(st+1)=1\mathbbm{1}_{\mathrm{end}}(s_{t+1})=1 represents that st+1s_{t+1} corresponding to a terminal state; otherwise, 0. Note that successive indices in the offline dataset shared the same context of the episode, except at the end of the episode. 𝒯τ\mathcal{T}^{\tau} can be obtained before training starts and is collected by a behavior policy. Because the offline dataset can be reused for any trial and be obtained by a safety-guaranteed behavior policy, we assumed it could be created at a relatively low cost.

We solely used the offline dataset for supervised and unsupervised learning purposes. If offline reinforcement learning is executed, the vision-based agent can be trained directly without approximating FF. However, training a vision-based agent using an offline dataset by reinforcement learning requires large-scale trajectories in the scope of millions [22]. In this study, we considered situations in which the total number of timesteps in the offline dataset was limited, for example, less than 100k timesteps.

We did not need to generate reward signals while collecting the offline dataset. World models [23] have been studied for the procedure: approximate MDP \mathcal{M} as ^\hat{\mathcal{M}} using an offline dataset of \mathcal{M}; train a policy by reinforcement learning by interacting with the approximated environment ^\hat{\mathcal{M}} instead of interacting with the original environment \mathcal{M}. One could imagine that we could replace interactions with the target MDP by interactions with the approximated one. However, to accomplish this, we must observe signals regarding reward in the real world while collecting the offline dataset, and we must approximate a reward function that is often sparse; both of these are not always easy [24]. Therefore, we did not consider approximating the target MDP and did not assume the reward was contained in 𝒯τ\mathcal{T}^{\tau}.

II-C3 Paired Dataset

The paired dataset 𝒫\mathcal{P} consisted of multiple pairs of target state observations and their corresponding source state observations. Let \mathcal{I} denote the set of indices that indicate the position of the offline dataset. Using the true image-to-semantics translation mapping FF, we can denote 𝒫={(F(si),si)(si,ai,ei)𝒯τ,i}\mathcal{P}=\{(F(s_{i}),s_{i})\mid(s_{i},a_{i},e_{i})\in\mathcal{T}^{\tau},i\in\mathcal{I}\}. Under practical situations, querying FF equals annotating corresponding semantics to the images of the indices \mathcal{I} in the offline dataset 𝒯τ\mathcal{T}^{\tau} by human annotators. Because of its annotation cost, we assumed the size of the paired dataset ||\lvert\mathcal{I}\rvert to be significantly smaller than that of the offline dataset, for example, ||100\lvert\mathcal{I}\rvert\leq 100.

III Related Work

TABLE I: Related policy transfer methods for observation style shift. Each method requires different resources: renderer, offline dataset (OFF), and paired dataset (PAIR).
Method Renderer OFF PAIR
Tobin et al.[7]
RCAN[8]
DARLA[9]
Pinto et al.[10]
MLVR[11]
Tzeng et al.[12]
GraspGAN[13]
RL-CycleGAN[14]
RetinaGAN[15]
MDQN[16]
ADT[17]
Zhang et al.[18]
CRAR[19]
Ours

We introduced some existing sim-to-real transfer methods that use a non-photorealistic renderer on the simulator. Table I lists the transfer methods that do not require on-policy interaction in the target MDP, assuming vision-based agents. The main difficulty tackled by these methods was the absence of a photorealistic renderer on the simulator. In the real world, images captured by a camera are input to the agent; however, generating photorealistic images on the simulator is generally difficult because it requires developing a high-quality renderer.

In [7, 8, 9, 10, 11, 25], the algorithms learned policies or intermediate representations that were robust to changes in image style using a non-photorealistic renderer. Thus, these algorithms were expected to perform well even when a photorealistic style was applied in a real-world environment. In particular, the domain randomization technique has been widely used [7, 8, 9, 10].

III-A Transfer via Image-to-Image Translation

In contrast to the above methods, [12, 13, 14, 15, 17, 16, 18, 19] aimed to perform style translation mapping among specific styles. To accomplish this, these methods required an offline dataset of the target MDP. Because these methods followed the principle of collection without execution of on-policy interaction, the offline dataset could be collected by a safety-guaranteed policy. Unsupervised style translation, such as domain adaptation [26] and CycleGAN [27], are often used to change the styles for state-of-the-art methods [13, 14, 15, 17, 18, 28, 24]. Using this translation mapping as a pre-processing function of the target agent, the pre-trained policy can determine actions in the same image style as the source MDP in the target MDP.

However, domain adaptation and cycle-consistency [27] only have a weak alignment ability [18], and some existing methods use paired datasets to properly transfer styles [19, 17, 16]. Therefore, these two datasets have been widely employed in previous studies and can be assumed to be a common setting.

The similarity of transfer via image-to-semantics and image-to-image is that they train style translation mapping F^\hat{F} among the source and target state spaces that preserves essential information; furthermore, the agent is the composite πF^\pi\circ\hat{F}, where π\pi is a policy.

Again, the above methods use a non-photorealistic renderer on the simulator. Thus, these methods cannot be compared with transfer via image-to-semantics, as explained in Section II-B2.

III-B Learning Image-to-Semantics

Previous studies have used semantics in the source MDP [18, 10, 17, 16, 12]. An important perspective on the applicability of these methods to image-to-semantics is whether they use a renderer on the simulator, as shown in Table I and as discussed in Section II-B2. Because methods using a renderer assume that the source state space is an image space, image-to-semantics is beyond their scope, and it is not certain that their mechanism will be successful in image-to-semantics. For example, CycleGAN, which has been successfully used for image-to-image learning, failed in image-to-semantics [18]. In this regard, we refer to [18], an unpaired method that applies the findings from image-to-image to image-to-semantics. In addition, [19] is compared as a representative method that uses a paired dataset as in this study.

III-B1 CRAR

We refer to Section 4.4 of CRAR [19] as a baseline of image-to-semantics learning. They described the following policy transfer strategy: pre-train a source state encoder Eσ:𝒮σ𝒵E^{\sigma}:\mathcal{S}^{\sigma}\to\mathcal{Z}, where 𝒵\mathcal{Z} is a latent space of the encoder; train the source policy πσ:𝒵𝒜\pi^{\sigma}:\mathcal{Z}\to\mathcal{A}; and train a target state encoder Eτ:𝒮τ𝒵E^{\tau}:\mathcal{S}^{\tau}\to\mathcal{Z} with regularization term (sσ,sτ)𝒫Eσ(sσ)Eτ(sτ)22\sum_{(s^{\sigma},s^{\tau})\in\mathcal{P}}\lVert E^{\sigma}(s^{\sigma})-E^{\tau}(s^{\tau})\rVert_{2}^{2}, where 𝒫\mathcal{P} is a paired dataset. Then, the target agent is the composite πσEτ:𝒮τ𝒜\pi^{\sigma}\circ E^{\tau}:\mathcal{S}^{\tau}\to\mathcal{A}. Here, EτE^{\tau} can be regarded as a style translation mapping. Note that they only performed this experiment in the setting where 𝒮σ\mathcal{S}^{\sigma} and 𝒮τ\mathcal{S}^{\tau} are both image spaces; however, it can be applied easily where 𝒮σ\mathcal{S}^{\sigma} is the semantic space.

III-B2 Zhang et al.

We referred to the cross-modality setting of their experiment as our baseline for image-to-semantics [18]. This setting is the same as the transfer via image-to-semantics.

There remain some challenges in [19, 18]. For [18], the human annotation cost was eliminated because they did not use a paired dataset. However, the loss function defined by [18] for unpaired image-to-semantics style translation will not necessarily provide a well-approximated FF. Therefore, we decided to use a paired dataset to efficiently supervise the loss function as performed in [19], but with a paired dataset smaller than [19].

IV Methodology

Refer to caption

Figure 2: Illustration of pair augmentation. Oracle FF generates semantics corresponding to a particular image in the offline dataset. The next state in semantics is computed using the transition function TrσTr^{\sigma} with the current semantics along with the action taken while collecting the offline dataset. This allows us to obtain semantics corresponding to the image at the next timestep in the offline dataset without any annotation costs. Augmented pairs with green dual directional arrows were stored in 𝒫\mathcal{P}^{\prime}. In this figure, note that rendered (non-photorealistic) images are shown in the offline dataset, but in reality, camera-captured (photorealistic) images are contained.

Our approach approximates the image-to-semantics translation FF using an offline dataset 𝒯τ\mathcal{T}^{\tau}. Similar to [19], we used a paired dataset 𝒫={(F(si),si)(si,ai,ei)𝒯τ,i}\mathcal{P}=\{(F(s_{i}),s_{i})\mid(s_{i},a_{i},e_{i})\in\mathcal{T}^{\tau},i\in\mathcal{I}\}, which was constructed by querying F(si)F(s_{i}) to human annotators for an image observation of target MDP si𝒮τs_{i}\in\mathcal{S}^{\tau} included in 𝒯τ\mathcal{T}^{\tau}. We incorporated two main ideas to reduce the annotation cost. Pair augmentation generates an augmented paired dataset 𝒫\mathcal{P}^{\prime} using an offline dataset 𝒯τ\mathcal{T}^{\tau}. Active learning defines \mathcal{I}, that is, it selects a subset of 𝒯τ\mathcal{T}^{\tau} to be annotated to construct 𝒫\mathcal{P} (Algorithm 2). We present an overall procedure of our method in Algorithm 1.

We assumed that we have an offline dataset 𝒯τ\mathcal{T}^{\tau}, which comprises multiple episodes in the target MDP. Let 𝒪\mathcal{O} denote the set of indices corresponding to the beginning of an episode in 𝒯τ\mathcal{T}^{\tau}, that is, 𝒪={0}{i0<i<|𝒯τ| and ei1=1 for (si1τ,ai1,ei1)𝒯τ}\mathcal{O}=\{0\}\cup\{i\mid 0<i<\lvert\mathcal{T}^{\tau}\rvert\text{ and }e_{i-1}=1\text{ for }(s^{\tau}_{i-1},a_{i-1},e_{i-1})\in\mathcal{T}^{\tau}\}, where eie_{i} is the indicator: when timestep ii is the end of an episode then ei=1e_{i}=1. For each i𝒪i\in\mathcal{O}, let i={t1tmin({kk1,ei+k=1})}\mathcal{E}_{i}=\{t\mid 1\leq t\leq\min(\{k\mid k\geq 1,\ e_{i+k}=1\})\}. Then, for each i𝒪i\in\mathcal{O}, a subsequence of 𝒯τ\mathcal{T}^{\tau} starting from timestep ii and ending at i+|i|i+\lvert\mathcal{E}_{i}\rvert corresponds to an episode.

Algorithm 1 Overall Procedure
1:Source MDP σ\mathcal{M}^{\sigma}, Offline dataset 𝒯τ\mathcal{T}^{\tau}, Oracle FF
2:Train source MDP’s policy πσ\pi^{\sigma} on σ\mathcal{M}^{\sigma}
3:Train VAE encoder EτE^{\tau} using 𝒯τ\mathcal{T}^{\tau}
4:Determine indices \mathcal{I} by active learning (Algorithm 2) using EτE^{\tau}, 𝒯τ\mathcal{T}^{\tau}
5:Create 𝒫\mathcal{P} for \mathcal{I}, 𝒯τ\mathcal{T}^{\tau} by oracle (human annotator) FF
6:Create augmented pairs 𝒫\mathcal{P}^{\prime} using 𝒫\mathcal{P}, 𝒯τ,Trσσ\mathcal{T}^{\tau},Tr^{\sigma}\in\mathcal{M}^{\sigma}
7:Train F^\hat{F} by minimizing Equation 2
8:Target MDP’s agent πσF^\pi^{\sigma}\circ\hat{F}

IV-A Pair Augmentation by Transition Function

The objective of pair augmentation is to construct artificial paired data 𝒫\mathcal{P}^{\prime} such that sσF(sτ)s^{\sigma}\approx F(s^{\tau}) for (sσ,sτ)𝒫(s^{\sigma},s^{\tau})\in\mathcal{P}^{\prime} and sτ𝒯τs^{\tau}\in\mathcal{T}^{\tau}. Using an augmented paired dataset, we aimed to obtain F^\hat{F} that approximates FF by minimizing the loss

(F^,𝒫𝒫)=1|𝒫𝒫|(sσ,sτ)𝒫𝒫sσF^(sτ)22.\mathcal{L}(\hat{F},\mathcal{P}\cup\mathcal{P}^{\prime})=\frac{1}{\lvert\mathcal{P}\cup\mathcal{P}^{\prime}\rvert}\sum_{(s^{\sigma},s^{\tau})\in\mathcal{P}\cup\mathcal{P}^{\prime}}\lVert s^{\sigma}-\hat{F}(s^{\tau})\rVert_{2}^{2}\enspace. (2)

Note that CRAR [19] adopts (F^,𝒫)\mathcal{L}(\hat{F},\mathcal{P}) instead of (F^,𝒫𝒫)\mathcal{L}(\hat{F},\mathcal{P}\cup\mathcal{P}^{\prime}).

Our principle is as follows. Let 𝒪\mathcal{I}\subseteq\mathcal{O} be a subset of indices corresponding to the beginning of the episodes in 𝒯τ\mathcal{T}^{\tau}. Suppose we have a paired dataset 𝒫\mathcal{P} constructed by querying semantics siσ=F(siτ)s_{i}^{\sigma}=F(s_{i}^{\tau}) corresponding to images siτs_{i}^{\tau} in 𝒯τ\mathcal{T}^{\tau} for time index ii\in\mathcal{I}. Although semantics si+1σs_{i+1}^{\sigma} representing an image of the next timestep si+1τs^{\tau}_{i+1} in 𝒯τ\mathcal{T}^{\tau} is unknown, because of the transition condition given in Section II-A and deterministic transition, it equals si+1σ=Trσ(siσ,ai)s_{i+1}^{\sigma}=Tr^{\sigma}(s_{i}^{\sigma},a_{i}), where aia_{i} is the action taken at timestep ii when collecting the offline dataset 𝒯τ\mathcal{T}^{\tau} and is included in 𝒯τ\mathcal{T}^{\tau}. In reality, because human annotations and state transition contain errors as compared to the truth, the generated semantics si+1σs^{\sigma}_{i+1} do not exactly represent the image si+1τs^{\tau}_{i+1}. However, even with errors in FF and TrσTr^{\sigma}, it is expected that the generation of the above semantics is a valuable approximation. By recursively applying the above generation, we obtained the augmented paired dataset 𝒫\mathcal{P}^{\prime}.

Formally, 𝒫\mathcal{P}^{\prime} was constructed as follows: For each index ii\in\mathcal{I}, we defined a sequence {s^i+tσ}ti\{\widehat{s}_{i+t}^{\sigma}\}_{t\in\mathcal{E}_{i}} as s^iσ=siσ\widehat{s}_{i}^{\sigma}=s_{i}^{\sigma} (contained in 𝒫\mathcal{P}) and s^i+tσ=Trσ(s^i+t1σ,ai+t1)\widehat{s}_{i+t}^{\sigma}=Tr^{\sigma}(\widehat{s}_{i+t-1}^{\sigma},a_{i+t-1}) for tit\in\mathcal{E}_{i}, where ai+t1a_{i+t-1} are contained in 𝒯τ\mathcal{T}^{\tau}. The augmented paired dataset is then 𝒫={(s^i+tσ,si+tτ)}i,ti\mathcal{P}^{\prime}=\{(\widehat{s}^{\sigma}_{i+t},s^{\tau}_{i+t})\}_{i\in\mathcal{I},t\in\mathcal{E}_{i}}, where si+tτs^{\tau}_{i+t} is contained in 𝒯τ\mathcal{T}^{\tau}. Thus, we could construct an augmented paired dataset 𝒫\mathcal{P}^{\prime} of size |𝒫|=i|i|\lvert\mathcal{P}^{\prime}\rvert=\sum_{i\in\mathcal{I}}\lvert\mathcal{E}_{i}\rvert from the paired dataset 𝒫\mathcal{P} of size |𝒫|=||\lvert\mathcal{P}\rvert=\lvert\mathcal{I}\rvert.

Figure 2 illustrates the pair augmentation scheme.

The reason why \mathcal{I} was a subset of episode start indices 𝒪\mathcal{O} rather than {j0j<|𝒯τ|,j}\mathcal{I}\subseteq\{j\mid 0\leq j<\lvert\mathcal{T}^{\tau}\rvert,j\in\mathbb{Z}\} was to maximize the size of augmented pairs |i|\lvert\mathcal{E}_{i}\rvert. In other words, because we could augment siσ=F(siτ)s^{\sigma}_{i}=F(s^{\tau}_{i}) until the end of the episode including siτs^{\tau}_{i}, to maximize |𝒫𝒫|\lvert\mathcal{P}\cup\mathcal{P}^{\prime}\rvert, human annotations should be conducted at the beginning of an episode of 𝒯τ\mathcal{T}^{\tau}.

IV-B Active Learning for Pair Augmentation

To select episodes for annotation, that is, decide \mathcal{I}, we incorporated the idea of diversity-based active learning (AL) [29, 30, 31]. Their motivation was to select dissimilar samples to effectively reduce the approximation error. Intuitively, if 𝒫𝒫\mathcal{P}\cup\mathcal{P}^{\prime} has many similar pairs, they might have a similar effect on training F^\hat{F}; this may lead to a waste in annotation cost. Therefore, we attempted to select episodes (indexed by 𝒪\mathcal{I}\subset\mathcal{O}) to be annotated to ensure the inclusion of diverse pairs.

We successively selected the episode to annotate, and we called each selection step the nn-th round. For i𝒪i\in\mathcal{O}, let Bi={si+tτ}t{0}iB_{i}=\{s_{i+t}^{\tau}\}_{t\in\{0\}\cup\mathcal{E}_{i}} be a set of target state observations present in the episode starting at timestep i𝒪i\in\mathcal{O}. We referred to it as batch. Let n1\mathcal{I}_{n-1} be the set of selected indices before the nn-th round, and let Sn1=kn1BkS_{n-1}=\bigcup_{k\in\mathcal{I}_{n-1}}B_{k} be a set of all the state vectors in the episodes selected before the nn-th round. Let d:𝒮τ×𝒮τd:\mathcal{S}^{\tau}\times\mathcal{S}^{\tau}\to\mathbb{R} be some appropriate distance measure. In the nn-th round, a batch was selected based on the following two diversity measures: The inter batch diversity

finter(Bi,Sn1)=sτBiminsjτSn1d(sτ,sjτ)f_{\mathrm{inter}}(B_{i},S_{n-1})=\sum_{s^{\tau}\in B_{i}}\min_{s_{j}^{\tau}\in S_{n-1}}d(s^{\tau},s_{j}^{\tau}) (3)

can evaluate the dissimilarity of BiB_{i} and Sn1S_{n-1}. The batch with the greatest finterf_{\mathrm{inter}} was considered to be the most dissimilar batch against the pre-selected batches. The intra batch diversity

fintra(Bi)=spτBisqτBid(spτ,sqτ)f_{\mathrm{intra}}(B_{i})=\sum_{s^{\tau}_{p}\in B_{i}}\sum_{s^{\tau}_{q}\in B_{i}}d(s_{p}^{\tau},s_{q}^{\tau}) (4)

can evaluate the dissimilarity of the states inside BiB_{i}. The batch with the greatest fintraf_{\mathrm{intra}} was considered to contain the most diverse states.

Algorithm 2 Active Learning
1:Trained VAE encoder EτE^{\tau}, Offline dataset 𝒯τ\mathcal{T}^{\tau}
2:Initialize 0={c}|cUniform(𝒪)\mathcal{I}_{0}=\{c\}_{|c\sim\mathrm{Uniform}(\mathcal{O})}
3:for 1n<N1\leq n<N do \triangleright nn-th round
4:     Set Sn1=kn1BkS_{n-1}=\bigcup_{k\in\mathcal{I}_{n-1}}B_{k}
5:     Measure finter(Bi,Sn1)f_{\mathrm{inter}}(B_{i},S_{n-1}) for all i𝒪i\in\mathcal{O}
6:     Pick top b%b\% of indices in terms of finterf_{\mathrm{inter}} as 𝒬\mathcal{Q}
7:     Measure fintra(Bi)f_{\mathrm{intra}}(B_{i}) for all i𝒬i\in\mathcal{Q}
8:     Pick the index cc from 𝒬\mathcal{Q} with the greatest fintraf_{\mathrm{intra}}
9:     Set n={c}n1\mathcal{I}_{n}=\{c\}\cup\mathcal{I}_{n-1}
10:end for
11:Indices N1\mathcal{I}_{N-1} (with the size of NN) as \mathcal{I}

We selected a batch that maximizes the above two diversity measures; we performed a bi-objective optimization for selection. To avoid overemphasizing one measure over the other, we employed two separate single-objective optimizations for each measure. In each round, we picked up indices of batches with finterf_{\mathrm{inter}} in the top b%b\% (b=10b=10 in our experiments) from unselected episodes as 𝒬\mathcal{Q}, and subsequently, selected the batch with the greatest fintraf_{\mathrm{intra}} from 𝒬\mathcal{Q}. 0\mathcal{I}_{0} was initialized with the episode sampled from 𝒪\mathcal{O} uniformly at random.

IV-C Representation Learning Using Offline Dataset

For d:𝒮τ×𝒮τd:\mathcal{S}^{\tau}\times\mathcal{S}^{\tau}\to\mathbb{R} to be a reasonable distance measure in the image space, we employed a VAE encoder [32]: Eτ:𝒮τ𝒵E^{\tau}:\mathcal{S}^{\tau}\to\mathcal{Z}. It stochastically outputs a latent vector z𝒵z\in\mathcal{Z} for sτ𝒮τs^{\tau}\in\mathcal{S}^{\tau}. The distance between two states spτ𝒮τs_{p}^{\tau}\in\mathcal{S}^{\tau} and sqτ𝒮τs_{q}^{\tau}\in\mathcal{S}^{\tau} was given by the Euclidean distance between the mean vectors for their latent representations, that is, d(spτ,sqτ)=𝔼[Eτ(spτ)]𝔼[Eτ(sqτ)]2d(s_{p}^{\tau},s_{q}^{\tau})=\lVert\mathbb{E}[E^{\tau}(s_{p}^{\tau})]-\mathbb{E}[E^{\tau}(s_{q}^{\tau})]\rVert_{2}. We trained EτE^{\tau} using all states in the offline dataset 𝒯τ\mathcal{T}^{\tau} before performing the active learning procedure.

We used the states contained in iBi\bigcup_{i\in\mathcal{I}}B_{i} in training F^\hat{F} by Equation 2; however, the remaining i𝒪Bi\bigcup_{i\in\mathcal{O}\setminus\mathcal{I}}B_{i} were not used. In order to use it, we included EτE^{\tau} as a feature extractor for F^\hat{F} by receiving the benefit of representation learning for downstream tasks. We modeled F^=ϕEτ\hat{F}=\phi\circ E^{\tau}, and we trained ϕ\phi by Equation 2, whereas EτE^{\tau} was fixed.

V Experiments

We aimed to verify the following two claims: (1) the proposed paired augmentation and AL reduces the annotation cost for approximating F^\hat{F} while maintaining its performance level; and (2) the paradigm with the paired dataset performs better than the method without paired datasets.

V-A Evaluation Metrics

V-A1 Policy Performance (PP)

The most important evaluation metric for F^\hat{F} is the expected cumulative reward of the target agent using Equation 1:

PP(F^;πσ,τ)=J(πσF^;pτ,rτ,γ,p0τ).\mathrm{PP}(\hat{F};\pi^{\sigma},\mathcal{M}^{\tau})=J(\pi^{\sigma}\circ\hat{F};p^{\tau},r^{\tau},\gamma,p^{\tau}_{0})\,. (5)

In our experiments, we approximated it by averaging the cumulative reward of 50 episodes with γ=1\gamma=1. This metric was commonly used in [19, 18].

V-A2 Matching Distance (MD)

Because our technical contribution was mainly to approximate FF, we used the following empirical approximation error:

MD(F^;𝒯,F)=1|𝒯|(sτ,a,e)𝒯F(sτ)F^(sτ)22,\mathrm{MD}(\hat{F};\mathcal{T},F)=\frac{1}{\lvert\mathcal{T}\rvert}\sum_{(s^{\tau},a,e)\in\mathcal{T}}\lVert F(s^{\tau})-\hat{F}(s^{\tau})\rVert_{2}^{2}, (6)

where 𝒯\mathcal{T} is a trajectory collected by a behavior policy in the target MDP, which is not used for learning F^\hat{F}. Unfortunately, in a real-world environment, evaluating Equation 6 for a large size of 𝒯\mathcal{T} is challenging because FF requires human annotation. To enable MD in our experiment, we performed experiments using the simulator for both the source MDP and target MDP. We adopted the rendered image space as the state space of the target MDP. Because both semantics and images were generated in the simulator, FF was freely available to calculate Equation 6. A similar metric to Equation 6 was used in [18].

V-B Environment

We evaluated the proposed approach on three environments.

V-B1 ViZDoom Shooting (Shooting)

ViZDoom Shooting [33] is a first-person view shooter task, in which an agent obtains 64×\times64 RGB images from the first-person perspective in the target MDP. The agent can change its xx-coordinate by moving left and right in the room and attacking forward (|𝒜|=3\lvert\mathcal{A}\rvert=3). An enemy spawns with a random xx-coordinate on the other side of the room at the start of the episode and does not move or attack. The agent can destroy the enemy by moving to the front of it and shooting it; time to destruction is directly related to the reward. Semantics are the xx-coordinates of the agent and the enemy; hence, 𝒮σ\mathcal{S}^{\sigma} is a 2-dimensional space. The maximum timesteps is 50 for each episode. The behavior policy to collect the offline dataset 𝒯τ\mathcal{T}^{\tau} is a random policy, and 𝒯τ\mathcal{T}^{\tau} consists of 200 episodes, that is, 10k timesteps in total.

V-B2 PyBullet KUKA Grasp (KUKA)

This is a grasp task using PyBullet’s KUKA iiwa robot arm [34]. Success is achieved by manipulating the end-effector of the robot arm and lifting a randomly placed cylinder. The semantics are the xyzxyz-coordinate and the 3-dimensional Euler angle of the end-effector and the xyzxyz-coordinate of the cylinder; hence, 𝒮σ\mathcal{S}^{\sigma} is a 9-dimensional space. We used the rendered 64×\times64 RGB images captured from three different viewpoints simultaneously as the state observations in the target MDP. The total timesteps per episode is fixed to 40. The behavior policy to collect 𝒯τ\mathcal{T}^{\tau} is a random policy, and 𝒯τ\mathcal{T}^{\tau} comprised 250 episodes, that is, 10k timesteps in total.

V-B3 PyBullet HalfCheetah-v0 (HalfCheetah)

This is a PyBullet version of the HalfCheetah, that is, a task in which a 2-dimensional cheetah is manipulated by continuous control to run faster. The torque of the six joints can be controlled (𝒜=[1,1]6\mathcal{A}=[-1,1]^{6}), and the semantic space is a 26-dimensional space. We collected 64×\times64 images captured from three different viewpoints for two consecutive timesteps and defined 𝒮τ\mathcal{S}^{\tau} as an image space containing a total of 6 frames. The total timesteps per episode is fixed to 1000. The behavior policy to collect 𝒯τ\mathcal{T}^{\tau} is a random policy, and 𝒯τ\mathcal{T}^{\tau} consists of 100 episodes, which is 100k timesteps in total.

In our experiments, information such as xyzxyz-coordinates and velocity can be recovered from a combination of multiple images by capturing images from multiple viewpoints at consecutive times, and such a setup is necessary in practice.

V-C Setting

We used a 7-layer convolutional neural network and a 4-layer fully connected neural network for the VAE encoders EτE^{\tau} and ϕ\phi, respectively, for both the proposed and existing methods. We trained them in gradients using Adam [35]. The dimensions of the latent space of VAE 𝒵\mathcal{Z} were set to 32, 96, and 192 for Shooting, KUKA, and HalfCheetah, respectively. For CRAR [19], we uniformly selected indices \mathcal{I} from {i0i<|𝒯τ|,i}\{i\mid 0\leq i<\lvert\mathcal{T}^{\tau}\rvert,i\in\mathbb{Z}\}. For our method, without an AL setting, \mathcal{I} was selected uniformly and randomly from 𝒪\mathcal{O}. For Shooting and KUKA, we used a handcrafted policy instead of one trained by RL as πσ\pi^{\sigma}. In HalfCheetah, we trained πσ\pi^{\sigma} using PPO [36].

V-D Results

TABLE II: Results of Shooting. MD values were scaled to 10210^{2} for convenience. πσ\pi^{\sigma} has PP=45.9945.99, and the behavior policy has PP=16.3916.39.
Method MD PP
||=0\lvert\mathcal{I}\rvert=0
Zhang et al. 37.42±6.2337.42\pm 6.23 22.46±4.9922.46\pm 4.99
||=10\lvert\mathcal{I}\rvert=10
CRAR 11.00±1.7211.00\pm 1.72 35.16±4.3135.16\pm 4.31
Ours w/o AL 3.44±0.893.44\pm 0.89 43.99±1.2943.99\pm 1.29
Ours 0.16±0.120.16\pm 0.12 44.73±1.0744.73\pm 1.07
||=50\lvert\mathcal{I}\rvert=50
CRAR 2.80±0.702.80\pm 0.70 42.29±2.1242.29\pm 2.12
Ours w/o AL 0.06±0.020.06\pm 0.02 46.02±0.3146.02\pm 0.31
Ours 0.02±0.000.02\pm 0.00 45.66±0.3445.66\pm 0.34
Refer to caption
Figure 3: Scatter of the obtained semantics on ViZDoom Shooting with |𝒫|=||=10\lvert\mathcal{P}\rvert=\lvert\mathcal{I}\rvert=10: {F(sτ)(sσ,sτ)𝒫}\{F(s^{\tau})\mid(s^{\sigma},s^{\tau})\in\mathcal{P}\} for CRAR, and {F(sτ)(sσ,sτ)𝒫𝒫}\{F(s^{\tau})\mid(s^{\sigma},s^{\tau})\in\mathcal{P}\cup\mathcal{P}^{\prime}\} for our method. Each square represents a 2-dimensional semantic space. The semantic space shows that both pair augmentation and AL contribute to expanding the coverage.
TABLE III: Results of KUKA. PP corresponds to grasp success probability. πσ\pi^{\sigma} has PP=1.01.0, and the behavior policy has PP=0.0480.048.
Method MD PP
||=0\lvert\mathcal{I}\rvert=0
Zhang et al. 0.90±0.120.90\pm 0.12 0.12±0.080.12\pm 0.08
||=10\lvert\mathcal{I}\rvert=10
CRAR 0.59±0.070.59\pm 0.07 0.24±0.220.24\pm 0.22
Ours w/o AL 0.35±0.050.35\pm 0.05 0.52±0.190.52\pm 0.19
Ours 0.37±0.030.37\pm 0.03 0.65±0.070.65\pm 0.07
||=100\lvert\mathcal{I}\rvert=100
CRAR 0.32±0.020.32\pm 0.02 0.52±0.150.52\pm 0.15
Ours w/o AL 0.11±0.010.11\pm 0.01 0.76±0.090.76\pm 0.09
Ours 0.12±0.010.12\pm 0.01 0.90±0.040.90\pm 0.04
TABLE IV: Results of HalfCheetah. πσ\pi^{\sigma} has PP=2735.732735.73, and the behavior policy has PP=1230.01-1230.01.
Method MD PP
||=0\lvert\mathcal{I}\rvert=0
Zhang et al. 3.71±0.323.71\pm 0.32 1511.97±192.51-1511.97\pm 192.51
||=10\lvert\mathcal{I}\rvert=10
CRAR 1.80±0.091.80\pm 0.09 1411.83±144.21-1411.83\pm 144.21
Ours w/o AL 0.40±0.040.40\pm 0.04 596.20±121.43596.20\pm 121.43
Ours 0.37±0.050.37\pm 0.05 580.21±71.95580.21\pm 71.95
||=50\lvert\mathcal{I}\rvert=50
CRAR 0.88±0.050.88\pm 0.05 818.29±300.08-818.29\pm 300.08
Ours w/o AL 0.12±0.010.12\pm 0.01 878.79±88.52878.79\pm 88.52
Ours 0.07±0.010.07\pm 0.01 968.49±99.53968.49\pm 99.53

Tables II, III and IV show the results of the image-to-semantics learning in the three environments. These tables show the results on average±\pmstd over five trials. ||\lvert\mathcal{I}\rvert denotes the number of paired data, which is the annotation cost. Because of the transition and reward conditions, the PP of πσF\pi^{\sigma}\circ F on τ\mathcal{M}^{\tau} assimilate to that of πσ\pi^{\sigma} on σ\mathcal{M}^{\sigma}.

Note that most image-to-image methods shown in Table I cannot be compared with image-to-semantics methods because some assumptions cannot be satisfied under image-to-semantics settings. One way to speculate on the performance of the image-to-image techniques in an image-to-semantics setting is to see Zhang et al. [18]. Zhang et al. used domain adaptation [26], which is commonly used in image-to-image learning; thus, their method can be interpreted as a representative example in which the techniques cultivated in image-to-image are imported to image-to-semantics. Although CycleGAN [27] is also widely employed in image-to-image learning, along with domain adaptation, they confirmed in their experiments that this method did not outperform their method in the image-to-semantics setting [18].

In all cases, compared with the approach of Zhang et al. [18], our approaches with and without AL achieved a smaller MD and a greater PP. Zhang et al.’s approach is designed to learn F^\hat{F} without a paired dataset to eliminate the annotation cost. However, learning without pairs does not necessarily lead to the true image-to-semantics translation mapping, as observed in the high MD and low PP in our results. This result shows the effectiveness of the paradigm using paired data when aiming for higher performance policy transfer while compromising the annotation cost to prepare a small number of paired data.

By comparing the results of our approaches with and without AL and those of CRAR, we confirmed the efficacy of pair augmentation in achieving a smaller MD and higher PP. We achieved PP=44.73±1.0744.73\pm 1.07 in Shooting with 10 pairs using pair augmentation and AL, which is more than the PP=42.29±2.1242.29\pm 2.12 achieved by CRAR with 50 pairs. In addition, in KUKA, we achieved PP=0.65±0.070.65\pm 0.07 in 10 pairs, which exceeds PP=0.52±0.150.52\pm 0.15 achieved by CRAR with 100 pairs. This means that the annotation cost was reduced by more than ×\times5 and ×\times10, respectively. This difference is even more pronounced in HalfCheetah. This may be because the ratio |𝒫𝒫|/|𝒫|\lvert\mathcal{P}\cup\mathcal{P}^{\prime}\rvert/\lvert\mathcal{P}\rvert is the greatest in this environment: CRAR uses |𝒫|=||\lvert\mathcal{P}\rvert=\lvert\mathcal{I}\rvert paired data, whereas our proposed approach used an additional |𝒫|=999||\lvert\mathcal{P}^{\prime}\rvert=999\lvert\mathcal{I}\rvert augmented paired data because the number of timesteps per episode was 1000 in this environment.

A tendency of reduced MD and increased PP was observed in the proposed approach with AL compared to that without AL. Specifically, AL reduced MD except in KUKA, and clearly improved PP in KUKA, while achieving competitive PP in the other two environments.

In Figure 3, we present the effectiveness of our approach in Shooting. The proposed AL maximized the diversity in the latent space that represents the image space, but the diversity was also maximized when this result was visualized in the semantic space.

V-E Experiments with Errors

In this section, we verify the robustness of the proposed method against errors in annotation and state transitions.

V-E1 Annotation Error

In our previous discussion and in experiments of Section V-D, we assumed that we could query oracle FF, that is, true image-to-semantics mapping by human annotations. However, because human annotation indicates the process of assigning semantics to images by humans, errors are expected to occur in the output semantics. Therefore, we provided a new experimental setup here: for some sτ𝒮τs^{\tau}\in\mathcal{S}^{\tau}, we can observe F(sτ)+ϵF(s^{\tau})+\epsilon instead of F(sτ)F(s^{\tau}) while creating the paired dataset 𝒫\mathcal{P}, where ϵdim(𝒮σ)\epsilon\in\mathbb{R}^{\dim(\mathcal{S}^{\sigma})} is a random vector representing the annotation error.

V-E2 Transition Error

In reality, the state transition function on the simulator TrσTr^{\sigma} is expected to contain modeling errors. For example, environment parameters such as friction coefficients and motor torques in the real world cannot be accurately estimated in the simulator, and thus, state transitions in reality cannot be correctly imitated. Therefore, we provided a new experimental setup here: for some (s,a)𝒮σ×𝒜(s,a)\in\mathcal{S}^{\sigma}\times\mathcal{A}, we could obtain Trσ(s,a)+ϵTr^{\sigma}(s,a)+\epsilon instead of Trσ(s,a)Tr^{\sigma}(s,a) while augmenting a paired dataset, where ϵdim(𝒮σ)\epsilon\in\mathbb{R}^{\dim(\mathcal{S}^{\sigma})} is a random vector representing the transition error. Note that when training πσ\pi^{\sigma}, we used the one without errors in our experiments.

V-E3 Error Generation

We generated two types of errors by adding a random variable ϵdim(𝒮σ)\epsilon\in\mathbb{R}^{\dim(\mathcal{S}^{\sigma})}. Here, we denoted a value of the hh-th dimension of xHx\in\mathbb{R}^{H} as x(h)x_{(h)}\in\mathbb{R}. We sampled ϵ(h)𝒩(h)\epsilon_{(h)}\sim\mathcal{N}_{(h)}, where 𝒩(h)\mathcal{N}_{(h)} is a Gaussian distribution with mean μ=0\mu=0 and standard deviation σ=αstd[s(h)σ]sσ𝒯σ\sigma=\alpha\cdot{\rm std}[s^{\sigma}_{(h)}]_{s^{\sigma}\in\mathcal{T}^{\sigma}}. Here, std[s(h)σ]sσ𝒯σ{\rm std}[s^{\sigma}_{(h)}]_{s^{\sigma}\in\mathcal{T}^{\sigma}} is the sample standard deviation of a source trajectory 𝒯σ\mathcal{T}^{\sigma} collected by a behavior policy in source MDP, and α0\alpha\geq 0 is the noise scale.

For the annotation error, using the semantics sequence of augmented pairs {s^i+tσ}ti\{\widehat{s}_{i+t}^{\sigma}\}_{t\in\mathcal{E}_{i}} provided without error, we provided the semantics as F(siτ)+ϵ¯iF(s^{\tau}_{i})+\bar{\epsilon}_{i} for 𝒫\mathcal{P} and {s^i+tσ+ϵ¯i}ti\{\widehat{s}_{i+t}^{\sigma}+\bar{\epsilon}_{i}\}_{t\in\mathcal{E}_{i}} for 𝒫\mathcal{P}^{\prime}, where ϵ¯i\bar{\epsilon}_{i} is a realized random vector with α>0\alpha>0. For the transition error, we defined {s^i+tσ+j=1tϵ¯i,j}ti\{\widehat{s}_{i+t}^{\sigma}+\sum_{j=1}^{t}\bar{\epsilon}_{i,j}\}_{t\in\mathcal{E}_{i}} for 𝒫\mathcal{P}^{\prime}, where ϵ¯i,j\bar{\epsilon}_{i,j} is the realized random vector. Note that, here, we approximated the error generation based on the following assumption: Trσ(s,a)=s+TrΔσ(a)Tr^{\sigma}(s,a)=s+Tr^{\sigma}_{\Delta}(a), that is, Trσ(s+ϵ¯1,a)+ϵ¯2=s+TrΔσ(a)+ϵ¯1+ϵ¯2Tr^{\sigma}(s+\bar{\epsilon}_{1},a)+\bar{\epsilon}_{2}=s+Tr^{\sigma}_{\Delta}(a)+\bar{\epsilon}_{1}+\bar{\epsilon}_{2}. This is an approximation simplifying implementation; however, for Shooting and KUKA, the above assumption is actually satisfied for almost all states and actions.

V-E4 Results

TABLE V: Results with annotation errors.
Method α\alpha MD PP
Shooting (||=50\lvert\mathcal{I}\rvert=50)
CRAR 0.0 2.80±0.702.80\pm 0.70 42.29±2.1242.29\pm 2.12
Ours w/o AL 0.06±0.020.06\pm 0.02 46.02±0.3146.02\pm 0.31
CRAR 0.04 2.99±0.932.99\pm 0.93 42.91±1.7142.91\pm 1.71
Ours w/o AL 0.12±0.050.12\pm 0.05 45.90±0.5445.90\pm 0.54
CRAR 0.15 3.50±1.213.50\pm 1.21 43.51±2.0443.51\pm 2.04
Ours w/o AL 0.47±0.070.47\pm 0.07 44.56±0.5944.56\pm 0.59
CRAR 0.3 4.90±0.854.90\pm 0.85 40.83±2.7840.83\pm 2.78
Ours w/o AL 1.42±0.151.42\pm 0.15 42.47±1.7442.47\pm 1.74
KUKA (||=100\lvert\mathcal{I}\rvert=100)
CRAR 0.0 0.32±0.020.32\pm 0.02 0.52±0.150.52\pm 0.15
Ours w/o AL 0.11±0.010.11\pm 0.01 0.76±0.090.76\pm 0.09
CRAR 0.04 0.33±0.030.33\pm 0.03 0.58±0.110.58\pm 0.11
Ours w/o AL 0.12±0.010.12\pm 0.01 0.76±0.180.76\pm 0.18
CRAR 0.15 0.32±0.030.32\pm 0.03 0.53±0.150.53\pm 0.15
Ours w/o AL 0.14±0.010.14\pm 0.01 0.68±0.150.68\pm 0.15
HalfCheetah (||=50\lvert\mathcal{I}\rvert=50)
CRAR 0.0 0.88±0.050.88\pm 0.05 818.29±300.08-818.29\pm 300.08
Ours w/o AL 0.12±0.010.12\pm 0.01 878.79±88.52878.79\pm 88.52
CRAR 0.04 0.91±0.030.91\pm 0.03 919.86±328.14-919.86\pm 328.14
Ours w/o AL 0.12±0.010.12\pm 0.01 787.0±230.32787.0\pm 230.32
CRAR 0.15 0.99±0.060.99\pm 0.06 833.35±464.76-833.35\pm 464.76
Ours w/o AL 0.15±0.020.15\pm 0.02 616.63±260.73616.63\pm 260.73

Here, we analyze the effect of two types of errors on 𝒫\mathcal{P} and 𝒫\mathcal{P}^{\prime}, and understand how this affects the approximation of FF. Therefore, we do not experiment with the method of Zhang et al., which does not utilize paired datasets. In addition, to eliminate the effect of the choice of \mathcal{I} on the generation of 𝒫\mathcal{P} and 𝒫\mathcal{P}^{\prime} in the comparison between the proposed method and CRAR, we conducted experiments under the setting without AL.

The results with annotation errors are shown in Table V. In both CRAR and the proposed method, the semantics of the paired data deviated from the true data as the scale of the annotation error α\alpha increased; thus, we observed that MD tends to increase for both methods. Although PP tended to decrease only for the proposed method, the proposed method achieved better MD and PP than CRAR for the same error scale α\alpha. In addition, we confirmed that PP with an annotation error for our method remains comparable to the case without an annotation error for a certain degree of α\alpha. For example, in KUKA experiments, the proposed method achieved PP =0.76±0.18=0.76\pm 0.18 with α=0.04\alpha=0.04, which was close to PP =0.76±0.09=0.76\pm 0.09 without annotation error. We conclude that the proposed pair augmentation is effective in image-to-semantics learning even in the presence of annotation errors.

TABLE VI: Results with transition errors.
Method α\alpha MD PP
Shooting (||=50\lvert\mathcal{I}\rvert=50)
CRAR 0.0 2.80±0.702.80\pm 0.70 42.29±2.1242.29\pm 2.12
Ours w/o AL 0.06±0.020.06\pm 0.02 46.02±0.3146.02\pm 0.31
Ours w/o AL 0.01 0.12±0.040.12\pm 0.04 45.90±0.5145.90\pm 0.51
Ours w/o AL 0.04 0.67±0.120.67\pm 0.12 44.78±1.1644.78\pm 1.16
Ours w/o AL 0.1 3.43±1.003.43\pm 1.00 43.44±1.6043.44\pm 1.60
KUKA (||=100\lvert\mathcal{I}\rvert=100)
CRAR 0.0 0.32±0.020.32\pm 0.02 0.52±0.150.52\pm 0.15
Ours w/o AL 0.11±0.010.11\pm 0.01 0.76±0.090.76\pm 0.09
Ours w/o AL 0.01 0.12±0.010.12\pm 0.01 0.80±0.110.80\pm 0.11
Ours w/o AL 0.04 0.14±0.000.14\pm 0.00 0.67±0.220.67\pm 0.22
HalfCheetah (||=50\lvert\mathcal{I}\rvert=50)
CRAR 0.0 0.88±0.050.88\pm 0.05 818.29±300.08-818.29\pm 300.08
Ours w/o AL 0.12±0.010.12\pm 0.01 878.79±88.52878.79\pm 88.52
Ours w/o AL 0.01 0.18±0.020.18\pm 0.02 426.61±221.12426.61\pm 221.12
Ours w/o AL 0.04 1.21±0.131.21\pm 0.13 523.8±336.51-523.8\pm 336.51

The results with transition errors are shown in Table VI. Note that CRAR does not use TrσTr^{\sigma}; thus, the result did not depend on transition error scale α\alpha; the result of CRAR with α>0\alpha>0 matched the result of α=0\alpha=0. Because the proposed pair augmentation scheme used TrσTr^{\sigma} to generate semantics, for larger tit\in\mathcal{E}_{i}, the variance of error was expected to be large; then, the augmented semantics in 𝒫\mathcal{P}^{\prime} were far from the actual semantics. In fact, we observed an increase in MD and a decrease in PP in the proposed method as the scale of α\alpha increased. In contrast, both MD and PP were better than CRAR up to α=0.04\alpha=0.04 for Shooting and KUKA, and up to α=0.01\alpha=0.01 for HalfCheetah. This indicates that the proposed pair augmentation is effective in reducing the annotation costs up to a certain level of transition errors.

V-F Effect of Behavior Policy

To further reveal the behavior of image-to-semantics methods, we evaluated them on HalfCheetah by adopting a low performance policy, rather than the random policy, as the behavior policy. We pre-trained the low performance policy with a small number of iterations using PPO.

We observed that, compared with Table IV and Table VII, the performance of the behavior policy affected the PP of the resulting target agent. Here, the PP of the random policy was 1230.01-1230.01 and that of the low performance policy was 822.39822.39. Therefore, the PP of our proposed method was improved from 968.49±99.53968.49\pm 99.53 to 1527.37±133.191527.37\pm 133.19 when ||=50\lvert\mathcal{I}\rvert=50.

TABLE VII: Results of HalfCheetah when 𝒯τ\mathcal{T}^{\tau} was collected by the low performance policy. The behavior policy has PP=822.39822.39 and πσ\pi^{\sigma} has PP=2735.732735.73. The trajectory for calculating MD is collected by the low performance policy.
Method MD PP
||=0\lvert\mathcal{I}\rvert=0
Zhang et al. 1.03±0.191.03\pm 0.19 1241.31±523.06-1241.31\pm 523.06
||=10\lvert\mathcal{I}\rvert=10
CRAR 0.74±0.060.74\pm 0.06 1549.82±106.04-1549.82\pm 106.04
Ours w/o AL 0.09±0.020.09\pm 0.02 1117.05±127.931117.05\pm 127.93
Ours 0.06±0.000.06\pm 0.00 1145.77±110.291145.77\pm 110.29
||=50\lvert\mathcal{I}\rvert=50
CRAR 0.37±0.040.37\pm 0.04 1416.37±194.31-1416.37\pm 194.31
Ours w/o AL 0.03±0.000.03\pm 0.00 1393.11±226.391393.11\pm 226.39
Ours 0.02±0.000.02\pm 0.00 1527.37±133.191527.37\pm 133.19

These results indicate that owing to the low performance of the random policy, faster-running states, that is, states with high velocity, cannot be observed; in other words, the random policy can only observe a limited state. This limitation could lead to an increase in the approximation error of F^\hat{F}. This implied that image-to-semantics is affected by the performance of the behavior policy in some tasks.

A promising result for the image-to-semantics framework is that the target agents obtained by our approach outperform the behavior policy. In particular, in Table VII, the PP of the behavior policy is 822.39822.39; furthermore, when image-to-semantics was performed with 50 annotations (||=50\lvert\mathcal{I}\rvert=50), we obtained a PP of 1527.37±133.191527.37\pm 133.19. In other words, we could achieve a higher performance compared with that of the behavior policy using a small number of annotations and the image-to-semantics protocol.

In the previous discussion, we found that the PP achieved by the image-to-semantics framework is affected by the quantity and quality of paired data, and the region of state space comprising the dataset for training F^\hat{F}. In fact, as an extreme example, F^\hat{F} trained using |𝒫|\lvert\mathcal{P}\rvert=100k with the trajectories collected by the optimal policy, achieved PP=2624.65±29.092624.65\pm 29.09, which is almost identical to PP=2735.732735.73, the performance of the optimal source policy. Note that such a near-complete policy transfer is already achieved in Shooting and KUKA, as shown in Tables II and III.

VI Conclusion

In this study, we investigated the image-to-semantics problem for vision-based agents in robotics. Using paired data for learning image-to-semantics mapping is favorable for achieving high-performance policy transfer; however, the cost of creating paired data cannot be ignored. This study contributes to existing literature by reducing the annotation cost using two techniques: pair augmentation and active learning. We also confirmed the effectiveness of the proposed method in our experiments.

In future work, we must address the following limitations: (1) Experiments have not been conducted using actual robots; therefore, it is not known how difficulties specific to actual robots will affect the image-to-semantics performance; (2) We cannot always freely query TrσTr^{\sigma}; therefore, it would be beneficial to know if we can substitute the one learned using source trajectory, similar to [18, 23]; (3) In some cases, the transition error is too large, and we would like to be able to improve the approximation accuracy of F^\hat{F} by considering performing pair augmentation for {tti,tK}\{t\mid t\in\mathcal{E}_{i},t\leq K\} rather than i\mathcal{E}_{i}. This expectation is because augmented semantics with larger tit\in\mathcal{E}_{i} are inaccurate. Furthermore, we would like to find a way to automatically determine such a KK.

References

  • [1] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine, “Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation,” in Conference on Robot Learning (CoRL), 2018.
  • [2] M. A. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. V. de Wiele, V. Mnih, N. Heess, and J. T. Springenberg, “Learning by playing - solving sparse reward tasks from scratch,” in International Conference on Machine Learning (ICML), 2018, pp. 4344–4353.
  • [3] S. Joshi, S. Kumra, and F. Sahin, “Robotic grasping using deep reinforcement learning,” in IEEE International Conference on Automation Science and Engineering (CASE), 2020, pp. 1461–1466.
  • [4] S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” in International Symposium on Experimental Robotics (ISER), 2017.
  • [5] S. S. Du, S. M. Kakade, R. Wang, and L. F. Yang, “Is a good representation sufficient for sample efficient reinforcement learning?” in International Conference on Learning Representations (ICLR), 2020.
  • [6] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. de Las Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. P. Lillicrap, and M. A. Riedmiller, “Deepmind control suite,” CoRR, vol. arXiv:1801.00690, 2018.
  • [7] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 23–30.
  • [8] S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis, “Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 12 619–12 629.
  • [9] I. Higgins, A. Pal, A. Rusu, L. Matthey, C. Burgess, A. Pritzel, M. Botvinick, C. Blundell, and A. Lerchner, “DARLA: Improving zero-shot transfer in reinforcement learning,” in International Conference on Machine Learning (ICML), 2017, pp. 1480–1490.
  • [10] L. Pinto, M. Andrychowicz, P. Welinder, W. Zaremba, and P. Abbeel, “Asymmetric actor critic for image-based robot learning,” in Robotics: Science and Systems (RSS), 2018.
  • [11] B. Chen, A. Sax, G. Lewis, I. Armeni, S. Savarese, A. Zamir, J. Malik, and L. Pinto, “Robust policies via mid-level visual representations: An experimental study in manipulation and navigation,” in Conference on Robot Learning (CoRL), 2020.
  • [12] E. Tzeng, C. Devin, J. Hoffman, and C. Finn, “Adapting deep visuomotor representations with weak pairwise constraints,” Algorithmic Foundations of Robotics, pp. 688–703, 2020.
  • [13] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, S. Levine, and V. Vanhoucke, “Using simulation and domain adaptation to improve efficiency of deep robotic grasping,” in IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 4243–4250.
  • [14] K. Rao, C. Harris, A. Irpan, S. Levine, J. Ibarz, and M. Khansari, “Rl-cyclegan: Reinforcement learning aware simulation-to-real,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 154–11 163.
  • [15] D. Ho, K. Rao, Z. Xu, E. Jang, M. Khansari, and Y. Bai, “Retinagan: An object-aware approach to sim-to-real transfer,” in IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 10 920–10 926.
  • [16] F. Zhang, J. Leitner, M. Milford, and P. Corke, “Modular deep q networks for sim-to-real transfer of visuo-motor policies,” in Australasian Conference on Robotics and Automation (ACRA), 2017.
  • [17] F. Zhang, J. Leitner, Z. Ge, M. Milford, and P. Corke, “Adversarial discriminative sim-to-real transfer of visuo-motor policies,” International Journal of Robotics Research (IJRR), pp. 1229–1245, 2019.
  • [18] Q. Zhang, T. Xiao, A. A. Efros, L. Pinto, and X. Wang, “Learning cross-domain correspondence for control with dynamics cycle-consistency,” in International Conference on Learning Representations (ICLR), 2021.
  • [19] V. Francois-Lavet, Y. Bengio, D. Precup, and J. Pineau, “Combined reinforcement learning via abstract representations,” in AAAI Conference on Artificial Intelligence (AAAI), vol. 33, 2019, pp. 3582–3589.
  • [20] A. Srinivas, M. Laskin, and P. Abbeel, “CURL: contrastive unsupervised representations for reinforcement learning,” in International Conference on Machine Learning (ICML), 2020, pp. 5639–5650.
  • [21] J. Yang, G. Lee, S. Chang, and N. Kwak, “Towards governing agent’s efficacy: Action-conditional beta-vae for deep transparent reinforcement learning,” in Asian Conference on Machine Learning (ACML), 2019, pp. 32–47.
  • [22] R. Agarwal, D. Schuurmans, and M. Norouzi, “An optimistic perspective on offline reinforcement learning,” in International Conference on Machine Learning (ICML), 2020, pp. 104–114.
  • [23] D. Ha and J. Schmidhuber, “Recurrent world models facilitate policy evolution,” in Neural Information Processing Systems (NeurIPS), 2018, pp. 2451–2463.
  • [24] G. Zhang, L. Zhong, Y. Lee, and J. J. Lim, “Policy transfer across visual and dynamics domain gaps via iterative grounding,” in Robotics: Science and Systems (RSS), 2021.
  • [25] M. Müller, A. Dosovitskiy, B. Ghanem, and V. Koltun, “Driving policy transfer via modularity and abstraction,” in Conference on Robot Learning (CoRL), 2018.
  • [26] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2962–2971.
  • [27] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2017, pp. 2242–2251.
  • [28] S. Gamrian and Y. Goldberg, “Transfer learning for related reinforcement learning tasks via image-to-image translation,” in International Conference on Machine Learning (ICML), 2019, pp. 2063–2072.
  • [29] S. Sinha, S. Ebrahimi, and T. Darrell, “Variational adversarial active learning,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 5971–5980.
  • [30] D. Wu, C.-T. Lin, and J. Huang, “Active learning for regression using greedy sampling,” Information Sciences, vol. 474, pp. 90–105, 2019.
  • [31] F. Zhdanov, “Diverse mini-batch active learning,” CoRR, vol. arXiv:1901.05954, 2019.
  • [32] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in International Conference on Learning Representations (ICLR), 2014.
  • [33] M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaskowski, “Vizdoom: A doom-based AI research platform for visual reinforcement learning,” in IEEE Conference on Computational Intelligence and Games (CIG), 2016, pp. 1–8.
  • [34] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” 2016.
  • [35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), Y. Bengio and Y. LeCun, Eds., 2015.
  • [36] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. arXiv:1707.06347, 2017.