Few-Shot Image-to-Semantics Translation for Policy Transfer in Reinforcement Learning ††thanks: This research is partially supported by the JSPS KAKENHI Grant Number 19H04179, and based on a project, JPNP18002, commissioned by NEDO.
Abstract
We investigate policy transfer using image-to-semantics translation to mitigate learning difficulties in vision-based robotics control agents. This problem assumes two environments: a simulator environment with semantics, that is, low-dimensional and essential information, as the state space, and a real-world environment with images as the state space. By learning mapping from images to semantics, we can transfer a policy, pre-trained in the simulator, to the real world, thereby eliminating real-world on-policy agent interactions to learn, which are costly and risky. In addition, using image-to-semantics mapping is advantageous in terms of the computational efficiency to train the policy and the interpretability of the obtained policy over other types of sim-to-real transfer strategies. To tackle the main difficulty in learning image-to-semantics mapping, namely the human annotation cost for producing a training dataset, we propose two techniques: pair augmentation with the transition function in the simulator environment and active learning. We observed a reduction in the annotation cost without a decline in the performance of the transfer, and the proposed approach outperformed the existing approach without annotation.
Index Terms:
deep reinforcement learning, policy transfer, sim-to-realI Introduction
Deep reinforcement learning (DRL) has been actively studied for robot control applications in real-world environments because of its ability to train vision-based agents; that is, the robot control actions are output directly from the observed images [1, 2, 3, 4]. One of the major advantages of vision-based agents in robotics is that camera-captured images can be incorporated into the decision-making of the agent without using a handcrafted feature extractor.
However, allowing vision-based robot control agents to learn by reinforcement learning in the real-world is challenging in terms of risk and cost because it requires a large amount of real-world interactions with unstable robots. Reinforcement learning involves a learning policy interacting with the environment, and it is theoretically and empirically known that the length of the interaction required for training increases with the dimension of the state space [5, 6].
To address the difficulty associated with reinforcement learning in a real-world environment, methods have been proposed that pre-train a policy on a simulator environment and transfer it to the real-world environment [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. In this methodology, policies are learned in a simulator, that is, a reinforcement learning environment on a computer that mimics the real-world environment. The policy pre-trained in the simulator is expected to be the optimal policy in the real-world environment.
However, developing a simulator that imitates the real-world environment is not always an easy task. Particularly, because the real world provides image observations, a simulator environment requires a renderer to generate images as states. However, producing a renderer that can generate photorealistic images is fraught with financial and technical difficulties.
In the case that a photorealistic renderer cannot be produced, another style of observations must be adopted as states during the pre-training of the policy in a simulator environment. Most existing approaches substitute photorealistic observations for non-photorealistic ones using transfer techniques [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17].
We investigated a type of transfer strategy called image-to-semantics to deal with the absence of a photorealistic renderer, which was created by [18]. In this approach, the semantics—low-dimensional and essential information of a state that represents an image—are employed as a form of state observation instead of images in the simulator environment. The transfer algorithm consists of two steps: pre-training a policy on the simulator environment with semantics as its observation, obtaining a mapping from photorealistic images to their corresponding semantics, and using the image-to-semantics mapping as a pre-processing component of the policy in the real-world environment. A semantics-based pre-trained policy can be operated in the real-world environment using image observations. In addition to being a solution to the case without a photorealistic renderer, image-to-semantics mapping has advantages in terms of the computational cost for policy pre-training in the simulator and the interpretability of the acquired policy.
The crucial part of this approach is obtaining the image-to-semantics translation mapping. To the best of our knowledge, [19, 18] are the only studies that have dealt with learning image-to-semantics translation. We highlight the remaining problems of [19, 18]: (1) [19] used a paired dataset, that is, multiple pairs of images and corresponding semantics, to train the mapping. Considerable human effort is required to make a paired dataset because human annotators provide semantics that represent images. (2) Although the style translation method without a paired dataset [18] aims at saving annotation cost, its performance is not often satisfactory owing to the low approximation quality of the image-to-semantics translation mapping, as confirmed in our experiments.
In this study, we tackled learning image-to-semantics translation using a paired dataset; however, we reduced the cost of creating a paired dataset using two strategies: pair augmentation and active learning. In our experiments, we confirmed the following claims: first, compared to [19], we reduced the cost of making a paired dataset while preserving the performance of the policy transfer. Second, we achieved significantly higher performance than [18], in which a paired dataset was not used, by using a small paired dataset. For practicality, we conducted experiments under the condition that only inaccurate paired data can be obtained due to various errors, such as annotation errors, and confirmed that the proposed method has a certain robustness against errors.
Our code is publicly available at https://github.com/madoibito80/im2sem.
II Problem Formulation
II-A Markov Decision Process (MDP)
We defined a vision-based robotics task in the real world; that is, the real-world environment is a target MDP: , where is a state space, is an action space, is a transition probability density, is a reward function, and is a discount factor. Because we assumed that the target MDP is a vision-based task, consists of images, and each contains single or multiple image frames. In standard model-free reinforcement learning (RL) settings, agents can interact with the environment: they observe and reward by performing action at state , which is internally preserved in the environment at timestep ; after the transition, is stored in the environment. However, there are concerns in terms of the risk and cost associated with learning a policy through extensive interaction with .
To reduce the risk and cost of training a policy in the target MDP, we pre-trained a policy on a simulator environment, called the source MDP: . Note that the action space is the same between the two MDPs. In contrast, the state space , the transition probability density , and the reward function are different from those of the target MDP. We assumed that because we considered robotics tasks, the deterministic transition function could be defined in the simulator environment and resembled a Dirac delta distribution.
The source state space corresponded to a semantic space, that is, each was semantic information. For example, consider a robot-arm grasp task; each is a single or multiple image frame showing a robot arm and objects to be grasped. Each consists of semantics such as -coordinates of the end-effector and target objects and angles of joints.
The source MDP and target MDP are expected to have some structural correspondence. Here, we describe our assumptions regarding the relations of the two MDPs. We assumed the existence of a function satisfying the following conditions:
Transition Condition: For all , , where .
Reward Condition: For all , .
In the above conditions, is considered an oracle that takes an image and outputs corresponding semantics; that is, is the true image-to-semantics translation mapping. In the transition condition, is a set of images that has common semantics . Imagine the transition from to with action in the target MDP, the transition condition holds . The reward condition indicates that a reward for this transition equals the one for a transition from to with the action in the source MDP.
II-B Transfer via Image-to-Semantics
II-B1 Policy Transfer
The objective of RL is the expectation of the discounted cumulative reward:
(1) |
and maximizing it w.r.t. . Here, is a policy, that is, a conditional distribution of given , and is the distribution of the initial state over the state space. Our objective was to obtain a well-trained policy on the target MDP: .
Under the situation in which the transition and reward conditions mentioned above hold for some , we can replace by , where is a well-trained policy on the source MDP, that is, . Solving this maximization by RL requires sole interaction with instead of . As noted, interactions with require real-world operations; however, interactions with are performed on the simulator, which is cost-effective.
II-B2 Advantages
The above-mentioned transfer strategy, that is, transfer via image-to-semantics, has the following three advantages over approaches using a renderer in the source MDP shown in Table I. First, a renderer is not required. Existing methods that use a renderer generally aim to transfer an agent based on non-photorealistic images in a simulator to photorealistic images in the real world [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. Therefore, they require the preparation of a renderer on the simulator to generate non-photorealistic images as state observations. Transfer via image-to-semantics performs similar transfer learning; however, it does not require a renderer because the source MDP has a semantic space as its state space. This can reduce the development cost of the simulator for some tasks. Second, because semantics are low-dimensional variables compared to images, we can improve the sample efficiency required to train the policy on [5, 6]. Learning vision-based agents are generally associated with large computational costs, even on a simulator [20], but transfer via image-to-semantics is relatively lightweight in this respect and occasionally allows a human to design the policy. Third, using semantics as an intermediate representation of the target agent contributes to its high interpretability because of the low-dimensionality and interpretability of semantics. Similar to [19, 21], because the real-world agent can be separated into two components, which are independently trained, it is easier to assess than one trained in an end-to-end manner.
II-C Resource Strategy
In this section, in addition to the two MDP environments, we define resources that can be used to approximate .
II-C1 Transition Function
In the target MDP, the state transition result due to the selected action can be observed only for state stored inside the environment. In contrast, in the source MDP, we assumed that the state transition result for any could be observed, replacing the stored inside the environment with . This is because the actual state transition probability in the target MDP is a physical phenomenon in the real world, but the state transition rule in the source MDP is a black-box function on the computer.
II-C2 Offline Dataset
The offline dataset comprised observations of the target MDP, that is, , where represents that corresponding to a terminal state; otherwise, . Note that successive indices in the offline dataset shared the same context of the episode, except at the end of the episode. can be obtained before training starts and is collected by a behavior policy. Because the offline dataset can be reused for any trial and be obtained by a safety-guaranteed behavior policy, we assumed it could be created at a relatively low cost.
We solely used the offline dataset for supervised and unsupervised learning purposes. If offline reinforcement learning is executed, the vision-based agent can be trained directly without approximating . However, training a vision-based agent using an offline dataset by reinforcement learning requires large-scale trajectories in the scope of millions [22]. In this study, we considered situations in which the total number of timesteps in the offline dataset was limited, for example, less than 100k timesteps.
We did not need to generate reward signals while collecting the offline dataset. World models [23] have been studied for the procedure: approximate MDP as using an offline dataset of ; train a policy by reinforcement learning by interacting with the approximated environment instead of interacting with the original environment . One could imagine that we could replace interactions with the target MDP by interactions with the approximated one. However, to accomplish this, we must observe signals regarding reward in the real world while collecting the offline dataset, and we must approximate a reward function that is often sparse; both of these are not always easy [24]. Therefore, we did not consider approximating the target MDP and did not assume the reward was contained in .
II-C3 Paired Dataset
The paired dataset consisted of multiple pairs of target state observations and their corresponding source state observations. Let denote the set of indices that indicate the position of the offline dataset. Using the true image-to-semantics translation mapping , we can denote . Under practical situations, querying equals annotating corresponding semantics to the images of the indices in the offline dataset by human annotators. Because of its annotation cost, we assumed the size of the paired dataset to be significantly smaller than that of the offline dataset, for example, .
III Related Work
Method | Renderer | OFF | PAIR |
Tobin et al.[7] | ✓ | ||
RCAN[8] | ✓ | ||
DARLA[9] | ✓ | ||
Pinto et al.[10] | ✓ | ||
MLVR[11] | ✓ | ||
Tzeng et al.[12] | ✓ | ✓ | |
GraspGAN[13] | ✓ | ✓ | |
RL-CycleGAN[14] | ✓ | ✓ | |
RetinaGAN[15] | ✓ | ✓ | |
MDQN[16] | ✓ | ✓ | ✓ |
ADT[17] | ✓ | ✓ | ✓ |
Zhang et al.[18] | ✓ | ||
CRAR[19] | ✓ | ✓ | |
Ours | ✓ | ✓ |
We introduced some existing sim-to-real transfer methods that use a non-photorealistic renderer on the simulator. Table I lists the transfer methods that do not require on-policy interaction in the target MDP, assuming vision-based agents. The main difficulty tackled by these methods was the absence of a photorealistic renderer on the simulator. In the real world, images captured by a camera are input to the agent; however, generating photorealistic images on the simulator is generally difficult because it requires developing a high-quality renderer.
In [7, 8, 9, 10, 11, 25], the algorithms learned policies or intermediate representations that were robust to changes in image style using a non-photorealistic renderer. Thus, these algorithms were expected to perform well even when a photorealistic style was applied in a real-world environment. In particular, the domain randomization technique has been widely used [7, 8, 9, 10].
III-A Transfer via Image-to-Image Translation
In contrast to the above methods, [12, 13, 14, 15, 17, 16, 18, 19] aimed to perform style translation mapping among specific styles. To accomplish this, these methods required an offline dataset of the target MDP. Because these methods followed the principle of collection without execution of on-policy interaction, the offline dataset could be collected by a safety-guaranteed policy. Unsupervised style translation, such as domain adaptation [26] and CycleGAN [27], are often used to change the styles for state-of-the-art methods [13, 14, 15, 17, 18, 28, 24]. Using this translation mapping as a pre-processing function of the target agent, the pre-trained policy can determine actions in the same image style as the source MDP in the target MDP.
However, domain adaptation and cycle-consistency [27] only have a weak alignment ability [18], and some existing methods use paired datasets to properly transfer styles [19, 17, 16]. Therefore, these two datasets have been widely employed in previous studies and can be assumed to be a common setting.
The similarity of transfer via image-to-semantics and image-to-image is that they train style translation mapping among the source and target state spaces that preserves essential information; furthermore, the agent is the composite , where is a policy.
Again, the above methods use a non-photorealistic renderer on the simulator. Thus, these methods cannot be compared with transfer via image-to-semantics, as explained in Section II-B2.
III-B Learning Image-to-Semantics
Previous studies have used semantics in the source MDP [18, 10, 17, 16, 12]. An important perspective on the applicability of these methods to image-to-semantics is whether they use a renderer on the simulator, as shown in Table I and as discussed in Section II-B2. Because methods using a renderer assume that the source state space is an image space, image-to-semantics is beyond their scope, and it is not certain that their mechanism will be successful in image-to-semantics. For example, CycleGAN, which has been successfully used for image-to-image learning, failed in image-to-semantics [18]. In this regard, we refer to [18], an unpaired method that applies the findings from image-to-image to image-to-semantics. In addition, [19] is compared as a representative method that uses a paired dataset as in this study.
III-B1 CRAR
We refer to Section 4.4 of CRAR [19] as a baseline of image-to-semantics learning. They described the following policy transfer strategy: pre-train a source state encoder , where is a latent space of the encoder; train the source policy ; and train a target state encoder with regularization term , where is a paired dataset. Then, the target agent is the composite . Here, can be regarded as a style translation mapping. Note that they only performed this experiment in the setting where and are both image spaces; however, it can be applied easily where is the semantic space.
III-B2 Zhang et al.
We referred to the cross-modality setting of their experiment as our baseline for image-to-semantics [18]. This setting is the same as the transfer via image-to-semantics.
There remain some challenges in [19, 18]. For [18], the human annotation cost was eliminated because they did not use a paired dataset. However, the loss function defined by [18] for unpaired image-to-semantics style translation will not necessarily provide a well-approximated . Therefore, we decided to use a paired dataset to efficiently supervise the loss function as performed in [19], but with a paired dataset smaller than [19].
IV Methodology
Our approach approximates the image-to-semantics translation using an offline dataset . Similar to [19], we used a paired dataset , which was constructed by querying to human annotators for an image observation of target MDP included in . We incorporated two main ideas to reduce the annotation cost. Pair augmentation generates an augmented paired dataset using an offline dataset . Active learning defines , that is, it selects a subset of to be annotated to construct (Algorithm 2). We present an overall procedure of our method in Algorithm 1.
We assumed that we have an offline dataset , which comprises multiple episodes in the target MDP. Let denote the set of indices corresponding to the beginning of an episode in , that is, , where is the indicator: when timestep is the end of an episode then . For each , let . Then, for each , a subsequence of starting from timestep and ending at corresponds to an episode.
IV-A Pair Augmentation by Transition Function
The objective of pair augmentation is to construct artificial paired data such that for and . Using an augmented paired dataset, we aimed to obtain that approximates by minimizing the loss
(2) |
Note that CRAR [19] adopts instead of .
Our principle is as follows. Let be a subset of indices corresponding to the beginning of the episodes in . Suppose we have a paired dataset constructed by querying semantics corresponding to images in for time index . Although semantics representing an image of the next timestep in is unknown, because of the transition condition given in Section II-A and deterministic transition, it equals , where is the action taken at timestep when collecting the offline dataset and is included in . In reality, because human annotations and state transition contain errors as compared to the truth, the generated semantics do not exactly represent the image . However, even with errors in and , it is expected that the generation of the above semantics is a valuable approximation. By recursively applying the above generation, we obtained the augmented paired dataset .
Formally, was constructed as follows: For each index , we defined a sequence as (contained in ) and for , where are contained in . The augmented paired dataset is then , where is contained in . Thus, we could construct an augmented paired dataset of size from the paired dataset of size .
Figure 2 illustrates the pair augmentation scheme.
The reason why was a subset of episode start indices rather than was to maximize the size of augmented pairs . In other words, because we could augment until the end of the episode including , to maximize , human annotations should be conducted at the beginning of an episode of .
IV-B Active Learning for Pair Augmentation
To select episodes for annotation, that is, decide , we incorporated the idea of diversity-based active learning (AL) [29, 30, 31]. Their motivation was to select dissimilar samples to effectively reduce the approximation error. Intuitively, if has many similar pairs, they might have a similar effect on training ; this may lead to a waste in annotation cost. Therefore, we attempted to select episodes (indexed by ) to be annotated to ensure the inclusion of diverse pairs.
We successively selected the episode to annotate, and we called each selection step the -th round. For , let be a set of target state observations present in the episode starting at timestep . We referred to it as batch. Let be the set of selected indices before the -th round, and let be a set of all the state vectors in the episodes selected before the -th round. Let be some appropriate distance measure. In the -th round, a batch was selected based on the following two diversity measures: The inter batch diversity
(3) |
can evaluate the dissimilarity of and . The batch with the greatest was considered to be the most dissimilar batch against the pre-selected batches. The intra batch diversity
(4) |
can evaluate the dissimilarity of the states inside . The batch with the greatest was considered to contain the most diverse states.
We selected a batch that maximizes the above two diversity measures; we performed a bi-objective optimization for selection. To avoid overemphasizing one measure over the other, we employed two separate single-objective optimizations for each measure. In each round, we picked up indices of batches with in the top ( in our experiments) from unselected episodes as , and subsequently, selected the batch with the greatest from . was initialized with the episode sampled from uniformly at random.
IV-C Representation Learning Using Offline Dataset
For to be a reasonable distance measure in the image space, we employed a VAE encoder [32]: . It stochastically outputs a latent vector for . The distance between two states and was given by the Euclidean distance between the mean vectors for their latent representations, that is, . We trained using all states in the offline dataset before performing the active learning procedure.
We used the states contained in in training by Equation 2; however, the remaining were not used. In order to use it, we included as a feature extractor for by receiving the benefit of representation learning for downstream tasks. We modeled , and we trained by Equation 2, whereas was fixed.
V Experiments
We aimed to verify the following two claims: (1) the proposed paired augmentation and AL reduces the annotation cost for approximating while maintaining its performance level; and (2) the paradigm with the paired dataset performs better than the method without paired datasets.
V-A Evaluation Metrics
V-A1 Policy Performance (PP)
The most important evaluation metric for is the expected cumulative reward of the target agent using Equation 1:
(5) |
In our experiments, we approximated it by averaging the cumulative reward of 50 episodes with . This metric was commonly used in [19, 18].
V-A2 Matching Distance (MD)
Because our technical contribution was mainly to approximate , we used the following empirical approximation error:
(6) |
where is a trajectory collected by a behavior policy in the target MDP, which is not used for learning . Unfortunately, in a real-world environment, evaluating Equation 6 for a large size of is challenging because requires human annotation. To enable MD in our experiment, we performed experiments using the simulator for both the source MDP and target MDP. We adopted the rendered image space as the state space of the target MDP. Because both semantics and images were generated in the simulator, was freely available to calculate Equation 6. A similar metric to Equation 6 was used in [18].
V-B Environment
We evaluated the proposed approach on three environments.
V-B1 ViZDoom Shooting (Shooting)
ViZDoom Shooting [33] is a first-person view shooter task, in which an agent obtains 6464 RGB images from the first-person perspective in the target MDP. The agent can change its -coordinate by moving left and right in the room and attacking forward (). An enemy spawns with a random -coordinate on the other side of the room at the start of the episode and does not move or attack. The agent can destroy the enemy by moving to the front of it and shooting it; time to destruction is directly related to the reward. Semantics are the -coordinates of the agent and the enemy; hence, is a 2-dimensional space. The maximum timesteps is 50 for each episode. The behavior policy to collect the offline dataset is a random policy, and consists of 200 episodes, that is, 10k timesteps in total.
V-B2 PyBullet KUKA Grasp (KUKA)
This is a grasp task using PyBullet’s KUKA iiwa robot arm [34]. Success is achieved by manipulating the end-effector of the robot arm and lifting a randomly placed cylinder. The semantics are the -coordinate and the 3-dimensional Euler angle of the end-effector and the -coordinate of the cylinder; hence, is a 9-dimensional space. We used the rendered 6464 RGB images captured from three different viewpoints simultaneously as the state observations in the target MDP. The total timesteps per episode is fixed to 40. The behavior policy to collect is a random policy, and comprised 250 episodes, that is, 10k timesteps in total.
V-B3 PyBullet HalfCheetah-v0 (HalfCheetah)
This is a PyBullet version of the HalfCheetah, that is, a task in which a 2-dimensional cheetah is manipulated by continuous control to run faster. The torque of the six joints can be controlled (), and the semantic space is a 26-dimensional space. We collected 6464 images captured from three different viewpoints for two consecutive timesteps and defined as an image space containing a total of 6 frames. The total timesteps per episode is fixed to 1000. The behavior policy to collect is a random policy, and consists of 100 episodes, which is 100k timesteps in total.
In our experiments, information such as -coordinates and velocity can be recovered from a combination of multiple images by capturing images from multiple viewpoints at consecutive times, and such a setup is necessary in practice.
V-C Setting
We used a 7-layer convolutional neural network and a 4-layer fully connected neural network for the VAE encoders and , respectively, for both the proposed and existing methods. We trained them in gradients using Adam [35]. The dimensions of the latent space of VAE were set to 32, 96, and 192 for Shooting, KUKA, and HalfCheetah, respectively. For CRAR [19], we uniformly selected indices from . For our method, without an AL setting, was selected uniformly and randomly from . For Shooting and KUKA, we used a handcrafted policy instead of one trained by RL as . In HalfCheetah, we trained using PPO [36].
V-D Results
Method | MD | PP |
---|---|---|
Zhang et al. | ||
CRAR | ||
Ours w/o AL | ||
Ours | ||
CRAR | ||
Ours w/o AL | ||
Ours |

Method | MD | PP |
---|---|---|
Zhang et al. | ||
CRAR | ||
Ours w/o AL | ||
Ours | ||
CRAR | ||
Ours w/o AL | ||
Ours |
Method | MD | PP |
---|---|---|
Zhang et al. | ||
CRAR | ||
Ours w/o AL | ||
Ours | ||
CRAR | ||
Ours w/o AL | ||
Ours |
Tables II, III and IV show the results of the image-to-semantics learning in the three environments. These tables show the results on averagestd over five trials. denotes the number of paired data, which is the annotation cost. Because of the transition and reward conditions, the PP of on assimilate to that of on .
Note that most image-to-image methods shown in Table I cannot be compared with image-to-semantics methods because some assumptions cannot be satisfied under image-to-semantics settings. One way to speculate on the performance of the image-to-image techniques in an image-to-semantics setting is to see Zhang et al. [18]. Zhang et al. used domain adaptation [26], which is commonly used in image-to-image learning; thus, their method can be interpreted as a representative example in which the techniques cultivated in image-to-image are imported to image-to-semantics. Although CycleGAN [27] is also widely employed in image-to-image learning, along with domain adaptation, they confirmed in their experiments that this method did not outperform their method in the image-to-semantics setting [18].
In all cases, compared with the approach of Zhang et al. [18], our approaches with and without AL achieved a smaller MD and a greater PP. Zhang et al.’s approach is designed to learn without a paired dataset to eliminate the annotation cost. However, learning without pairs does not necessarily lead to the true image-to-semantics translation mapping, as observed in the high MD and low PP in our results. This result shows the effectiveness of the paradigm using paired data when aiming for higher performance policy transfer while compromising the annotation cost to prepare a small number of paired data.
By comparing the results of our approaches with and without AL and those of CRAR, we confirmed the efficacy of pair augmentation in achieving a smaller MD and higher PP. We achieved PP= in Shooting with 10 pairs using pair augmentation and AL, which is more than the PP= achieved by CRAR with 50 pairs. In addition, in KUKA, we achieved PP= in 10 pairs, which exceeds PP= achieved by CRAR with 100 pairs. This means that the annotation cost was reduced by more than 5 and 10, respectively. This difference is even more pronounced in HalfCheetah. This may be because the ratio is the greatest in this environment: CRAR uses paired data, whereas our proposed approach used an additional augmented paired data because the number of timesteps per episode was 1000 in this environment.
A tendency of reduced MD and increased PP was observed in the proposed approach with AL compared to that without AL. Specifically, AL reduced MD except in KUKA, and clearly improved PP in KUKA, while achieving competitive PP in the other two environments.
In Figure 3, we present the effectiveness of our approach in Shooting. The proposed AL maximized the diversity in the latent space that represents the image space, but the diversity was also maximized when this result was visualized in the semantic space.
V-E Experiments with Errors
In this section, we verify the robustness of the proposed method against errors in annotation and state transitions.
V-E1 Annotation Error
In our previous discussion and in experiments of Section V-D, we assumed that we could query oracle , that is, true image-to-semantics mapping by human annotations. However, because human annotation indicates the process of assigning semantics to images by humans, errors are expected to occur in the output semantics. Therefore, we provided a new experimental setup here: for some , we can observe instead of while creating the paired dataset , where is a random vector representing the annotation error.
V-E2 Transition Error
In reality, the state transition function on the simulator is expected to contain modeling errors. For example, environment parameters such as friction coefficients and motor torques in the real world cannot be accurately estimated in the simulator, and thus, state transitions in reality cannot be correctly imitated. Therefore, we provided a new experimental setup here: for some , we could obtain instead of while augmenting a paired dataset, where is a random vector representing the transition error. Note that when training , we used the one without errors in our experiments.
V-E3 Error Generation
We generated two types of errors by adding a random variable . Here, we denoted a value of the -th dimension of as . We sampled , where is a Gaussian distribution with mean and standard deviation . Here, is the sample standard deviation of a source trajectory collected by a behavior policy in source MDP, and is the noise scale.
For the annotation error, using the semantics sequence of augmented pairs provided without error, we provided the semantics as for and for , where is a realized random vector with . For the transition error, we defined for , where is the realized random vector. Note that, here, we approximated the error generation based on the following assumption: , that is, . This is an approximation simplifying implementation; however, for Shooting and KUKA, the above assumption is actually satisfied for almost all states and actions.
V-E4 Results
Method | MD | PP | |
---|---|---|---|
Shooting () | |||
CRAR | 0.0 | ||
Ours w/o AL | |||
CRAR | 0.04 | ||
Ours w/o AL | |||
CRAR | 0.15 | ||
Ours w/o AL | |||
CRAR | 0.3 | ||
Ours w/o AL | |||
KUKA () | |||
CRAR | 0.0 | ||
Ours w/o AL | |||
CRAR | 0.04 | ||
Ours w/o AL | |||
CRAR | 0.15 | ||
Ours w/o AL | |||
HalfCheetah () | |||
CRAR | 0.0 | ||
Ours w/o AL | |||
CRAR | 0.04 | ||
Ours w/o AL | |||
CRAR | 0.15 | ||
Ours w/o AL |
Here, we analyze the effect of two types of errors on and , and understand how this affects the approximation of . Therefore, we do not experiment with the method of Zhang et al., which does not utilize paired datasets. In addition, to eliminate the effect of the choice of on the generation of and in the comparison between the proposed method and CRAR, we conducted experiments under the setting without AL.
The results with annotation errors are shown in Table V. In both CRAR and the proposed method, the semantics of the paired data deviated from the true data as the scale of the annotation error increased; thus, we observed that MD tends to increase for both methods. Although PP tended to decrease only for the proposed method, the proposed method achieved better MD and PP than CRAR for the same error scale . In addition, we confirmed that PP with an annotation error for our method remains comparable to the case without an annotation error for a certain degree of . For example, in KUKA experiments, the proposed method achieved PP with , which was close to PP without annotation error. We conclude that the proposed pair augmentation is effective in image-to-semantics learning even in the presence of annotation errors.
Method | MD | PP | |
---|---|---|---|
Shooting () | |||
CRAR | 0.0 | ||
Ours w/o AL | |||
Ours w/o AL | 0.01 | ||
Ours w/o AL | 0.04 | ||
Ours w/o AL | 0.1 | ||
KUKA () | |||
CRAR | 0.0 | ||
Ours w/o AL | |||
Ours w/o AL | 0.01 | ||
Ours w/o AL | 0.04 | ||
HalfCheetah () | |||
CRAR | 0.0 | ||
Ours w/o AL | |||
Ours w/o AL | 0.01 | ||
Ours w/o AL | 0.04 |
The results with transition errors are shown in Table VI. Note that CRAR does not use ; thus, the result did not depend on transition error scale ; the result of CRAR with matched the result of . Because the proposed pair augmentation scheme used to generate semantics, for larger , the variance of error was expected to be large; then, the augmented semantics in were far from the actual semantics. In fact, we observed an increase in MD and a decrease in PP in the proposed method as the scale of increased. In contrast, both MD and PP were better than CRAR up to for Shooting and KUKA, and up to for HalfCheetah. This indicates that the proposed pair augmentation is effective in reducing the annotation costs up to a certain level of transition errors.
V-F Effect of Behavior Policy
To further reveal the behavior of image-to-semantics methods, we evaluated them on HalfCheetah by adopting a low performance policy, rather than the random policy, as the behavior policy. We pre-trained the low performance policy with a small number of iterations using PPO.
We observed that, compared with Table IV and Table VII, the performance of the behavior policy affected the PP of the resulting target agent. Here, the PP of the random policy was and that of the low performance policy was . Therefore, the PP of our proposed method was improved from to when .
Method | MD | PP |
---|---|---|
Zhang et al. | ||
CRAR | ||
Ours w/o AL | ||
Ours | ||
CRAR | ||
Ours w/o AL | ||
Ours |
These results indicate that owing to the low performance of the random policy, faster-running states, that is, states with high velocity, cannot be observed; in other words, the random policy can only observe a limited state. This limitation could lead to an increase in the approximation error of . This implied that image-to-semantics is affected by the performance of the behavior policy in some tasks.
A promising result for the image-to-semantics framework is that the target agents obtained by our approach outperform the behavior policy. In particular, in Table VII, the PP of the behavior policy is ; furthermore, when image-to-semantics was performed with 50 annotations (), we obtained a PP of . In other words, we could achieve a higher performance compared with that of the behavior policy using a small number of annotations and the image-to-semantics protocol.
In the previous discussion, we found that the PP achieved by the image-to-semantics framework is affected by the quantity and quality of paired data, and the region of state space comprising the dataset for training . In fact, as an extreme example, trained using =100k with the trajectories collected by the optimal policy, achieved PP=, which is almost identical to PP=, the performance of the optimal source policy. Note that such a near-complete policy transfer is already achieved in Shooting and KUKA, as shown in Tables II and III.
VI Conclusion
In this study, we investigated the image-to-semantics problem for vision-based agents in robotics. Using paired data for learning image-to-semantics mapping is favorable for achieving high-performance policy transfer; however, the cost of creating paired data cannot be ignored. This study contributes to existing literature by reducing the annotation cost using two techniques: pair augmentation and active learning. We also confirmed the effectiveness of the proposed method in our experiments.
In future work, we must address the following limitations: (1) Experiments have not been conducted using actual robots; therefore, it is not known how difficulties specific to actual robots will affect the image-to-semantics performance; (2) We cannot always freely query ; therefore, it would be beneficial to know if we can substitute the one learned using source trajectory, similar to [18, 23]; (3) In some cases, the transition error is too large, and we would like to be able to improve the approximation accuracy of by considering performing pair augmentation for rather than . This expectation is because augmented semantics with larger are inaccurate. Furthermore, we would like to find a way to automatically determine such a .
References
- [1] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine, “Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation,” in Conference on Robot Learning (CoRL), 2018.
- [2] M. A. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. V. de Wiele, V. Mnih, N. Heess, and J. T. Springenberg, “Learning by playing - solving sparse reward tasks from scratch,” in International Conference on Machine Learning (ICML), 2018, pp. 4344–4353.
- [3] S. Joshi, S. Kumra, and F. Sahin, “Robotic grasping using deep reinforcement learning,” in IEEE International Conference on Automation Science and Engineering (CASE), 2020, pp. 1461–1466.
- [4] S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” in International Symposium on Experimental Robotics (ISER), 2017.
- [5] S. S. Du, S. M. Kakade, R. Wang, and L. F. Yang, “Is a good representation sufficient for sample efficient reinforcement learning?” in International Conference on Learning Representations (ICLR), 2020.
- [6] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. de Las Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. P. Lillicrap, and M. A. Riedmiller, “Deepmind control suite,” CoRR, vol. arXiv:1801.00690, 2018.
- [7] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 23–30.
- [8] S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis, “Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 12 619–12 629.
- [9] I. Higgins, A. Pal, A. Rusu, L. Matthey, C. Burgess, A. Pritzel, M. Botvinick, C. Blundell, and A. Lerchner, “DARLA: Improving zero-shot transfer in reinforcement learning,” in International Conference on Machine Learning (ICML), 2017, pp. 1480–1490.
- [10] L. Pinto, M. Andrychowicz, P. Welinder, W. Zaremba, and P. Abbeel, “Asymmetric actor critic for image-based robot learning,” in Robotics: Science and Systems (RSS), 2018.
- [11] B. Chen, A. Sax, G. Lewis, I. Armeni, S. Savarese, A. Zamir, J. Malik, and L. Pinto, “Robust policies via mid-level visual representations: An experimental study in manipulation and navigation,” in Conference on Robot Learning (CoRL), 2020.
- [12] E. Tzeng, C. Devin, J. Hoffman, and C. Finn, “Adapting deep visuomotor representations with weak pairwise constraints,” Algorithmic Foundations of Robotics, pp. 688–703, 2020.
- [13] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, S. Levine, and V. Vanhoucke, “Using simulation and domain adaptation to improve efficiency of deep robotic grasping,” in IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 4243–4250.
- [14] K. Rao, C. Harris, A. Irpan, S. Levine, J. Ibarz, and M. Khansari, “Rl-cyclegan: Reinforcement learning aware simulation-to-real,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 154–11 163.
- [15] D. Ho, K. Rao, Z. Xu, E. Jang, M. Khansari, and Y. Bai, “Retinagan: An object-aware approach to sim-to-real transfer,” in IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 10 920–10 926.
- [16] F. Zhang, J. Leitner, M. Milford, and P. Corke, “Modular deep q networks for sim-to-real transfer of visuo-motor policies,” in Australasian Conference on Robotics and Automation (ACRA), 2017.
- [17] F. Zhang, J. Leitner, Z. Ge, M. Milford, and P. Corke, “Adversarial discriminative sim-to-real transfer of visuo-motor policies,” International Journal of Robotics Research (IJRR), pp. 1229–1245, 2019.
- [18] Q. Zhang, T. Xiao, A. A. Efros, L. Pinto, and X. Wang, “Learning cross-domain correspondence for control with dynamics cycle-consistency,” in International Conference on Learning Representations (ICLR), 2021.
- [19] V. Francois-Lavet, Y. Bengio, D. Precup, and J. Pineau, “Combined reinforcement learning via abstract representations,” in AAAI Conference on Artificial Intelligence (AAAI), vol. 33, 2019, pp. 3582–3589.
- [20] A. Srinivas, M. Laskin, and P. Abbeel, “CURL: contrastive unsupervised representations for reinforcement learning,” in International Conference on Machine Learning (ICML), 2020, pp. 5639–5650.
- [21] J. Yang, G. Lee, S. Chang, and N. Kwak, “Towards governing agent’s efficacy: Action-conditional beta-vae for deep transparent reinforcement learning,” in Asian Conference on Machine Learning (ACML), 2019, pp. 32–47.
- [22] R. Agarwal, D. Schuurmans, and M. Norouzi, “An optimistic perspective on offline reinforcement learning,” in International Conference on Machine Learning (ICML), 2020, pp. 104–114.
- [23] D. Ha and J. Schmidhuber, “Recurrent world models facilitate policy evolution,” in Neural Information Processing Systems (NeurIPS), 2018, pp. 2451–2463.
- [24] G. Zhang, L. Zhong, Y. Lee, and J. J. Lim, “Policy transfer across visual and dynamics domain gaps via iterative grounding,” in Robotics: Science and Systems (RSS), 2021.
- [25] M. Müller, A. Dosovitskiy, B. Ghanem, and V. Koltun, “Driving policy transfer via modularity and abstraction,” in Conference on Robot Learning (CoRL), 2018.
- [26] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2962–2971.
- [27] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2017, pp. 2242–2251.
- [28] S. Gamrian and Y. Goldberg, “Transfer learning for related reinforcement learning tasks via image-to-image translation,” in International Conference on Machine Learning (ICML), 2019, pp. 2063–2072.
- [29] S. Sinha, S. Ebrahimi, and T. Darrell, “Variational adversarial active learning,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 5971–5980.
- [30] D. Wu, C.-T. Lin, and J. Huang, “Active learning for regression using greedy sampling,” Information Sciences, vol. 474, pp. 90–105, 2019.
- [31] F. Zhdanov, “Diverse mini-batch active learning,” CoRR, vol. arXiv:1901.05954, 2019.
- [32] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in International Conference on Learning Representations (ICLR), 2014.
- [33] M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaskowski, “Vizdoom: A doom-based AI research platform for visual reinforcement learning,” in IEEE Conference on Computational Intelligence and Games (CIG), 2016, pp. 1–8.
- [34] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” 2016.
- [35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), Y. Bengio and Y. LeCun, Eds., 2015.
- [36] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. arXiv:1707.06347, 2017.