This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Grasping with Chopsticks: Combating Covariate Shift in Model-free Imitation Learning for Fine Manipulation

Liyiming Ke, Jingqiang Wang, Tapomayukh Bhattacharjee, Byron Boots and Siddhartha Srinivasa University of Washington, Seattle WA 98105 USA. {kayke, jwq123, tapo, bboots, siddh}@uw.edu
Abstract

Billions of people use chopsticks, a simple yet versatile tool, for fine manipulation of everyday objects. The small, curved, and slippery tips of chopsticks pose a challenge for picking up small objects, making them a suitably complex test case. This paper leverages human demonstrations to develop an autonomous chopsticks-equipped robotic manipulator. Due to the lack of accurate models for fine manipulation, we explore model-free imitation learning, which traditionally suffers from the covariate shift phenomenon that causes poor generalization. We propose two approaches to reduce covariate shift, neither of which requires access to an interactive expert or a model, unlike previous approaches. First, we alleviate single-step prediction errors by applying an invariant operator to increase the data support at critical steps for grasping. Second, we generate synthetic corrective labels by adding bounded noise and combining parametric and non-parametric methods to prevent error accumulation. We demonstrate our methods on a real chopstick-equipped robot that we built, and observe the agent’s success rate increase from 37.3%37.3\% to 80%80\%, which is comparable to the human expert performance of 82.6%82.6\%.

I Introduction

Although complex end effectors are inherently suited to fine manipulation [1] due to fewer design constraints, simple tools are easier to study and deploy, and they are ubiquitous in industrial manipulators. With a human-in-the-loop, simple end effectors can also perform general fine manipulation [2, 3]. We choose chopsticks, a simple tool that is very familiar to humans, as an example to learn and automate fine manipulation strategies from human demonstrations. To that end, we have built an automated chopstick-equipped robot comprised of a 6DOF robot arm outfitted with a 1DOF actuated chopstick (Fig. 1). Our goal is to demonstrate autonomous superhuman chopstick dexterity with our robot.

The efficacy of chopsticks’ design has inspired researchers to adapt them for diverse robotic applications, such as surgery [4, 5, 6], micro-manipulation [7], and meal assistance [8, 9]. However, chopsticks’ practicality and generality in applications come at the cost of complexity to control [10]. Their small, curved, and slippery tips require precise movements for grasping small and rigid objects such as a toy marble. Their limited allowance for failures makes them a suitably complex test case for evaluating fine manipulation tasks. Noticeably, humans have demonstrated impressive adaptability in teleoperating a robot equipped with chopsticks to pick up hard-to-grasp small objects [11]. We aim to leverage human demonstrations to learn control policies using imitation learning [12, 13]. The challenge we face is further exacerbated by the lack of accurate models for our assembled robotic test-bed [14], a common constraint for fine manipulation tasks [15, 16].

The lack of accurate models motivates our study of model-free imitation learning [17, 18, 19, 20]. Here, we have access to demonstration data but not to the expert’s policy function or the environment’s transition model. Under these conditions, supervised learning methods like behavior cloning [21] learn a policy function by matching the expert’s action distribution. Minimizing action distribution divergence, however, does not necessarily guarantee the recovery of parsimonious states that lead to task success [22]. A learned agent can suffer from covariate shift [23], i.e., compounding errors in the action space that lead the agent to unseen states during test time. This problem can be especially detrimental for fine manipulation, the success of which critically depends on a few steps that usually occur near the end of a trajectory.

Refer to caption
Figure 1: Fine manipulation using chopsticks.

To remedy covariate shift, researchers have proposed interactive imitation learning methods, such as DAgger [24] and DART [25], to query an expert online for corrective labels. DAgger rolls out a learned agent and asks the expert for labels on learner visited states, which can be computationally expensive and unnatural on a teleoperation interface [25]. DART injects noise during data collection, disturbs expert teleoperation, and forces the expert to provide corrective labels. However, injecting noise during data collection can burden the expert: adding a small amount of random noise for our fine manipulation task, as DART suggests, would require the expert to spend 43%43\% more time on collecting data.111Though 95%95\% of the noise injected resulted in at most 0.350.35^{\circ} deviation per joint, it lowered the expert success rate by 18%18\% and forced the expert to spend more time completing each trajectory.

These challenges prompt us to address covariate-shift in model-free imitation learning in a non-interactive setting, where we have access to demonstration data but not to an interactive expert. Since covariate shift results from the interplay of single-step errors and their accumulation over time, our key ideas are to (1) increase data support to address single-step errors, and (2) provide corrective labels to address the accumulation of errors. Specifically, we provide:

  • Enhanced data support by transforming the data to an object-centric frame that preserves the relative transformation between the end effector and object, while making training data denser around the critical region for grasp success.

  • Corrective labels by injecting noise into the collected state, assuming the same action may serve as the corrective label for the deviated state. Thus, we implicitly enforce smoothness to the learned policy and tell the agent how to recover from deviated states.

  • Corrective labels by choosing a combination of parametric and non-parametric methods that improve matching of the action distribution at unseen states. Because of our problem structure, a better match in action distribution leads to a higher likelihood of matching the state distribution, preventing error accumulation.

We demonstrate our proposal’s effectiveness on a physical robot equipped with chopsticks to pick up small cube- and ball-shaped objects, as shown in Fig. 1. Our proposed agent achieves 60% success rates picking up even the most challenging item, a small ball, whereas a naive behavior cloning agent has only a 12% success rate. Our agent achieves an 80% average success rate picking up all three objects tested, comparable to the expert human performance of 82%. We conduct ablation tests, visualize the resulting states’ distribution, and observe a smaller covariate shift from our proposed agents. We also validate the generality of the noise injection method on several Mujoco simulated tasks.

Our promising empirical results, based on pragmatic assumptions of data support and policy smoothness, open the door for further theoretical analysis of combating covariate shift. Furthermore, although we have focused on the non-interactive setting, our techniques directly transfer to the interactive setting, enhancing robustness while reducing user burden.

II Methods

II-A Transform: Increasing Data Support

Our goal is to develop an agent that can generalize from demonstration data to predict an action for any query state. However, we lack data support for some states (e.g., the “unseen state” during rollout). We propose to apply an invariant operator to transform the data, making it denser around the region of interest and thus increasing the data support.

In manipulation, changing the frame of reference can significantly change the distribution of trajectories (Fig. 2). We could choose a robot-centric frame, where the robot base is the origin, or an object-centric frame [26], where the object location is the origin. We propose that using an object-centric frame can reduce the covariate shift and improve the policy generalization, especially for fine manipulation. The transformation to an object-centric frame would result in a denser distribution of trajectories near the origin where the object is located, increasing data support for this critical region that determines grasping success. Using an object-centric frame also allows the policy learned to be invariant to the translation of object location. This makes the learned policy more sample efficient when generalizing to novel object locations.

Refer to caption
(a) Robot-centric frame.
Refer to caption
(b) Object-centric frame.
Figure 2: Visualizing the end-effector positions for all demonstrations under different coordinate frames. Each black dot is an xyz-position of the end effector in one step. We highlight one trajectory, which starts with red dots and ends in blue.

II-B Noise: Generating Synthetic Corrective Labels

Refer to caption
(a) Covariate shift: A learner roll out (black) deviates from the demonstration (red) and error accumulates.
Refer to caption
(b) Inject noise into the collected states and reuse the collected action as synthetic corrective labels.
Refer to caption
(c) Use non-parametric methods (like a k-NN) to return the agent to proper region when deviations occur.
Figure 3: Prevent error escalation in imitation learning.

Although the transformation technique we use improves the agent’s success rate, we still observe significant deviations during test time that result in task failure (Fig. 3(a)). This is understandable because machine learning algorithms generally need exponentially more data for progressive improvement [25]. Instead of naively collecting more data, we introduce corrective action labels that can help the agent recover from deviations. For example, Venkatraman et al. [27] rolled out trained agents, collected their deviation states and used model-predictive control to generate corrective labels to go back to the demonstrated trajectory. Unfortunately, models sufficiently accurate for fine manipulation can be challenging to build.

We propose to generate synthetic corrective labels by injecting noise into the collected demonstration states (“deviated state”) and reusing the collected action (“corrective labels”), thus not requiring access to an expert or a model. Unlike DART and DAgger, which emphasize collecting corrective labels for the states that the agent will visit during rollout (test state distribution), we hypothesize that we do not need to match the deviated states’ distribution accurately. Instead, we need to collect enough corrective labels to cover the deviated states’ distribution. Since we can generate labels for free without burdening an expert, we choose to generate labels for randomly sampled deviated states, thus simplifying the selection of states for which to generate synthetic corrective labels. Fig. 3(b) shows an example where we sample states around a demonstrated state and reuse the demonstrated action as synthetic corrective labels.

Researchers have injected noise [28] into problems that reduce a high-dimensional input to a low-dimensional output, e.g., for classification [29] and object recognition in visual and language domains [30]. In these works, such tasks are invariant under a wide variety of transformations [31]. However, our robotic manipulation task has low-dimensional states and actions, where the mapping learned may not be invariant to the noise. We provide two insights to justify why injecting noise can still be desirable.

First, we apply a small amount of additive Gaussian noise to the demonstration state instead of a large amount that could pollute the data by mapping a state to a detrimental action. Inspired by [32], which showed the effectiveness of noise injection for autoencoders by carefully tuning the magnitude of the noise, we generate Gaussian noise ϵ𝒩(0,σ)\epsilon\sim\mathcal{N}(0,\sigma) to add to the collected states, where σ\sigma is the covariance of the noise. For simplicity, we correlate the noise size σ\sigma with the variance of the data.

Second, because of the structure of our problem, the collected action can serve as the corrective label for the noise-injected deviated state. Our state and action representations both include the end-effector pose. Therefore, when an agent starts drifting from a demonstrated trajectory and enters a deviated state, our algorithm can teach it to return to the original trajectory by reusing the same action label. Injecting noise can also ensure the learned policy is smooth, which is desired since we assume the actions are Lipschitz continuous w.r.t the states.

II-C Ensemble: Following the Expert Advice

We can reduce error accumulation at unseen states by choosing methods that more effectively match the action distribution. A neural network’s optimization objective is limited to its training data and will not necessarily generalize well to unseen inputs [33]. In contrast, non-parametric methods generate test outputs by combining the training data, their predictions must come from the training data and are therefore constrained [34]. For example, a k-nearest neighbor (𝚔-𝙽𝙽\mathtt{k\text{-}NN}) agent will not cause the robot to move its joint positions beyond the interpolation of its training data.

We use 𝚔-𝙽𝙽\mathtt{k\text{-}NN} in conjunction with behavior cloning (𝙱𝙲\mathtt{BC}). Specifically, our agent follows the 𝚔-𝙽𝙽\mathtt{k\text{-}NN} predicted action if the query state deviates from the training data (Fig. 3(c)).

By using the 𝚔-𝙽𝙽\mathtt{k\text{-}NN} method, we are forcing a known action to a new unseen state during test time to ensure the action distribution during training and testing will match. For our manipulation task, the state and action both include the robot’s end-effector pose. Sending a known action is equivalent to sending the agent to a known state, implicitly reducing the agent’s deviation from training data, thus reducing covariate shift. However, nonparametric method’s performance is subject to its distance function, which can be difficult to design for high-dimensional data.

The distance function for non-parametric methods serves two purposes: (1) to evaluate the proximity of a query to the stored data points; and (2) to weight and combine the expert labels. Our key observation is that (1) requires only a rough estimate of the distance to decide whether a query state is far from the training data, and (2) needs a carefully tuned distance function to assign weights to expert labels. Therefore, we propose to use a decision tree to invoke a 𝚔-𝙽𝙽\mathtt{k\text{-}NN} agent only when the distance of the query state is far from its nearest neighbors and invoke a behavior cloning neural network agent otherwise. By invoking 𝚔-𝙽𝙽\mathtt{k\text{-}NN} only when the agent is far away, we bypass the need to carefully design a distance function for it, favor 𝙱𝙲\mathtt{BC}’s scalability with data when we are inside the training data distribution, and rely on 𝚔-𝙽𝙽\mathtt{k\text{-}NN} to correct the agent’s deviation when we are outside the training data distribution.

III Experiments

Refer to caption
(a) Robot platform.
Refer to caption
(b) Example demonstration.
Refer to caption
(c) Evaluation.
Figure 4: Experiment setup.

III-A Experimental Setup

We built a 6-DOF robot (Fig. 4(a)) equipped with a pair of chopsticks as its end effector in order to develop algorithms that control the chopsticks to pick up challenging objects: a cube with a 11 cm edge length, a ball with a 22 cm diameter, and another ball with 1.41.4 cm diameter, as shown in Fig. 4(c). The kinematic model for our inexpensive hardware is not highly accurate since the robot is assembled from parts with joints that are not strictly rigid. Even with the best calibration, inaccuracies still accumulate along robot links and result in position errors ranging from 1 mm to 6 mm at the robot’s end effector. This implies that the difference between the calculated chopstick tip position and its actual position is comparable to the radius of the small objects used in our experiments. For each object, we collected 500500 trajectories from an expert teleoperating the robot to pick up the object (Fig. 4(b)). The data collection setup is the same as in our previous work [11].

Our agent had access to the tracked location of the objects and the robot’s end-effector pose. We defined success as grasping the objects using chopsticks, lifting them above the workstation, and holding them in the air for 11 s. We evaluated the performance of each method on each object by computing the success rate over 25 trials. During evaluation, we divided the square workstation plate into a 5×55\times 5 grid (Fig. 4(c)) and placed the object in the center of each grid cell to ensure effective coverage over the entire workspace. See Appendix V-A for more details.

III-B Experimental Procedure

We compared our methods in Section II with human demonstrations during teleoperation (𝙴𝚡𝚙𝚎𝚛𝚝\mathtt{Expert}) and a replay of the successful demonstrations (𝚁𝚎𝚙𝚕𝚊𝚢\mathtt{Replay}). 𝚁𝚎𝚙𝚕𝚊𝚢\mathtt{Replay} tests the repeatability of our hardware. We chose successful demonstrations, placed objects at exactly the same locations used during data collection, and replayed the demonstrations to see if the robot could pick up the objects.

We used two baselines. The first is a parametric method, 𝙱𝙲\mathtt{BC}+𝚁𝚘𝚋𝚘𝚝𝙲\mathtt{RobotC}, a neural-network based behavior cloning agent that uses the default robot-centric frame. The second is a non-parametric method, 𝚔-𝙽𝙽\mathtt{k\text{-}NN}+𝚁𝚘𝚋𝚘𝚝𝙲\mathtt{RobotC}, which is a k-nearest-neighbor agent that also uses the robot-centric frame.

We evaluated three methods as described in Section II: (1) using the object-centric frame to train behavior cloning and the k-nearest neighbors agents, 𝙱𝙲\mathtt{BC}+𝙾𝚋𝚓𝙲\mathtt{ObjC} and 𝚔-𝙽𝙽\mathtt{k\text{-}NN}+𝙾𝚋𝚓𝙲\mathtt{ObjC}, respectively, (2) injecting a small amount of Gaussian noise into the behavior cloning agent, 𝙱𝙲\mathtt{BC}+𝙾𝚋𝚓𝙲\mathtt{ObjC}+𝙽𝚘𝚒𝚜𝚎\mathtt{Noise}, and (3) combining the parametric method 𝙱𝙲\mathtt{BC}+𝙾𝚋𝚓𝙲\mathtt{ObjC}+𝙽𝚘𝚒𝚜𝚎\mathtt{Noise} and non-parametric method 𝚔-𝙽𝙽\mathtt{k\text{-}NN}+𝙾𝚋𝚓𝙲\mathtt{ObjC} via a decision tree model, denoted as 𝙴𝚗𝚜𝚎𝚖𝚋𝚕𝚎\mathtt{Ensemble}. Implementation details are shown in the Appendix V-B.

IV Results

Method Cube Ball20\diameter 20 mm Ball14\diameter 14 mm All
𝙴𝚡𝚙𝚎𝚛𝚝\mathtt{Expert} 100 80 68 82.7
𝚁𝚎𝚙𝚕𝚊𝚢\mathtt{Replay} 100 80 80 86.7
𝙱𝙲\mathtt{BC} +𝚁𝚘𝚋𝚘𝚝𝙲\mathtt{RobotC} 84 16 12 37.3
𝙱𝙲\mathtt{BC} +𝙾𝚋𝚓𝙲\mathtt{ObjC} 92 16 24 44.0
𝙱𝙲\mathtt{BC} +𝙾𝚋𝚓𝙲\mathtt{ObjC}+𝙽𝚘𝚒𝚜𝚎\mathtt{Noise} 92 76 48 72.0
𝚔-𝙽𝙽\mathtt{k\text{-}NN} +𝚁𝚘𝚋𝚘𝚝𝙲\mathtt{RobotC} 64 28 8 33.3
𝚔-𝙽𝙽\mathtt{k\text{-}NN} +𝙾𝚋𝚓𝙲\mathtt{ObjC} 84 64 12 53.3
𝙴𝚗𝚜𝚎𝚖𝚋𝚕𝚎\mathtt{Ensemble} 96 84 60 80.0
TABLE I: Percentage success rates evaluated over 25 trials.
Refer to caption
(a) 𝚔-𝙽𝙽\mathtt{k\text{-}NN}+𝙾𝚋𝚓𝙲\mathtt{ObjC} versus 𝙱𝙲\mathtt{BC}+𝙾𝚋𝚓𝙲\mathtt{ObjC}.
Refer to caption
(b) 𝙱𝙲\mathtt{BC}+𝙾𝚋𝚓𝙲\mathtt{ObjC} versus 𝙱𝙲\mathtt{BC}+𝙾𝚋𝚓𝙲\mathtt{ObjC}+𝙽𝚘𝚒𝚜𝚎\mathtt{Noise}.
Refer to caption
(c) 𝙴𝚗𝚜𝚎𝚖𝚋𝚕𝚎\mathtt{Ensemble} versus 𝙱𝙲\mathtt{BC}+𝙾𝚋𝚓𝙲\mathtt{ObjC}+𝙽𝚘𝚒𝚜𝚎\mathtt{Noise}.
Figure 5: Comparing the state distributions as a proxy of covariate shift for trained agents. Each dot is a state in the agent’s roll out. Green states are farther away from the demonstrations (purple), indicating that their corresponding agent suffers from more covariate shift than the yellow agent.

IV-A Success Rates for Fine Manipulation

The experimental results are shown in Table. I, and the best performers in each column are highlighted. Our parametric method baseline, 𝙱𝙲\mathtt{BC}+𝚁𝚘𝚋𝚘𝚝𝙲\mathtt{RobotC}, and nonparametric method baseline, 𝚔-𝙽𝙽\mathtt{k\text{-}NN}+𝚁𝚘𝚋𝚘𝚝𝙲\mathtt{RobotC}, had relatively low success rates. However, the causes of their failures differ. 𝙱𝙲\mathtt{BC}+𝚁𝚘𝚋𝚘𝚝𝙲\mathtt{RobotC} has difficulty picking up objects that are placed farther away from the robot. The agent tends to reach towards the wrong location after moving over a long distance to approach the object, highlighting the covariate shift’s impact. In contrast, the 𝚔-𝙽𝙽\mathtt{k\text{-}NN}+𝚁𝚘𝚋𝚘𝚝𝙲\mathtt{RobotC} agent’s poses look more similar to expert demonstrations. However, its trajectories are not smooth and sometimes end abruptly on top of the object without picking it up. This occurs because 𝚔-𝙽𝙽\mathtt{k\text{-}NN} does not guarantee a smooth policy function; even after careful tuning of the distance function, it was challenging to eliminate the jerky motions. 𝚔-𝙽𝙽\mathtt{k\text{-}NN}’s sudden stops are due to direct imitation of the training data. During demonstration, the human expert often slowed or even paused their movements around the object, adjusting the approaching pose before closing the chopsticks and lifting the object. The distance function we chose fails to select and mix the more relevant action labels. This confirms the sensitivity of 𝚔-𝙽𝙽\mathtt{k\text{-}NN} to its distance function.

Transforming to the 𝙾𝚋𝚓𝙲\mathtt{ObjC} frame improved the success rates for 𝚔-𝙽𝙽\mathtt{k\text{-}NN} and 𝙱𝙲\mathtt{BC} by 2020% and 6.76.7%, respectively. 𝚔-𝙽𝙽\mathtt{k\text{-}NN} becomes less likely to generate jerky motions or stop since it benefits from the increased data support. 𝙱𝙲\mathtt{BC} still suffers from covariate shift, but the agent has a higher likelihood of reaching towards the object due to denser data distribution near it.

Injecting noise to 𝙱𝙲\mathtt{BC}+𝙾𝚋𝚓𝙲\mathtt{ObjC} during training increases its success rate by 28%28\%. When the items are close to the robot, the agent has an almost 100%100\% success rate picking up even the most challenging item. For objects that are far away, the robot sometimes picks up the object by successfully reaching the location; at other time, it ends up merely rotating the chopsticks.

Using a decision tree to combine our best parametric method (𝙱𝙲\mathtt{BC}+𝙾𝚋𝚓𝙲\mathtt{ObjC}+𝙽𝚘𝚒𝚜𝚎\mathtt{Noise}) and non-parametric method (𝚔-𝙽𝙽\mathtt{k\text{-}NN}+𝙾𝚋𝚓𝙲\mathtt{ObjC}) yields the highest performing agent that achieves near-expert performance. During test time, if a state’s distance to its nearest neighbors exceeds a threshold, the agent triggers the non-parametric method to bring the state back. We observe that almost all rollouts trigger the non-parametric method at least once. No matter how far an object is placed, the 𝙴𝚗𝚜𝚎𝚖𝚋𝚕𝚎\mathtt{Ensemble} agent can reach it in a “standard” way that is similar to the pose demonstrated by the expert. Failures occasionally occur as the agent misses the grasping point by some sub-mm error.

IV-B Covariate Shift Across Methods

To gauge the covariate shift for different agents, we visualize the distributions of their test states. We collect 25 rollouts from each agent, record the robot-visited states, and plot the state distribution after dimensionality reduction using Principal Component Analysis (PCA), as shown in Fig. 5. First, we observe that 𝙱𝙲\mathtt{BC} encounters more covariate shift than 𝚔-𝙽𝙽\mathtt{k\text{-}NN}, i.e., that states visited by 𝚔-𝙽𝙽\mathtt{k\text{-}NN} are closer to the demonstrated states, confirming that a better matching of action distribution will lead to a better match of state distribution. Second, injecting noise into 𝙱𝙲\mathtt{BC} results in less covariate shift than no noise, verifying that noise injection can provide effective correcting labels. Third, the 𝙴𝚗𝚜𝚎𝚖𝚋𝚕𝚎\mathtt{Ensemble} model that combines 𝙱𝙲\mathtt{BC} with 𝚔-𝙽𝙽\mathtt{k\text{-}NN} has less covariate shift than using 𝙱𝙲\mathtt{BC} alone.

IV-C Noise Injection: Validation through MuJoCo Environments

Refer to caption
Figure 6: Performance comparison before and after noise injection. For each MuJoCo environment and each condition, we trained 5 agents using consecutive random seeds. The performance difference is statistically significant under paired T-tests.

We apply the noise injection method to MuJoCo simulated environments [35] to test the method’s generality. We use demonstration data from [36], train 55 behavior cloning agents under consecutive random seeds as baselines, and train another 55 agents with noise injection for comparison. Figure 6 compares performances before and after noise injection. A paired T-test shows that p<0.05p<0.05 for all environments. There is strong evidence that, on average, noise injection improves the imitation learner.

Though the performance gains for some simulated environments are not as significant as those we see for our real robot, we think the difference may be due to the demonstration source. We use “real” human data for our real robot experiment versus the “synthetic expert demonstration” generated by a reinforcement learning (RL) agent for the MuJoCo tasks. Human experts are known to exhibit multi-modal behaviors during demonstrations, whereas trained RL agents tend to have single modes in their reaction [37]. Given that noise injection improves the success rate for our physical robot task by a considerable 2828%, further inquiry is needed to determine if noise injection is better at enhancing learning from multi-modal demonstration data.

V Discussion

We leave some topics for future work. During noise injection, for simplicity, we experiment only with independent multivariate Gaussian noise with a fixed size of covariance. It is worth exploring how to formalize the bounded noise and analyzing how different task domains may benefit from different noise shapes. For the ensemble model, future work could explore an alternative way to switch between 𝚔-𝙽𝙽\mathtt{k\text{-}NN} and 𝙱𝙲\mathtt{BC} agents in the 𝙴𝚗𝚜𝚎𝚖𝚋𝚕𝚎\mathtt{Ensemble} model, perhaps by learning a threshold condition from the data.

Our work critically depend on two key assumptions. First, to increase data support by applying an invariant operator, we assume the existence of a critical region that demands more data support. Second, to reuse collected action labels and leverage a nonparametric method to generate corrective labels, a more accurate match of action distribution should lead to a more accurate match of the states. The assumption holds if a part of the state and action representation is directly connected, e.g., the robot state contains its joint position, and the robot command accepts the target joint position. The assumption does not hold, for example, if the robot is torque-controlled; in these cases, further exploration on how a learner can generate synthetic corrective labels is needed.

Nevertheless, our proposals do not assume access to a model or an interactive expert and are therefore more easily applicable to fine manipulation tasks. Compared to DAgger and DART, which collect corrective labels from experts, we can generate synthetic corrective labels for free. Because of the relatively lower cost of doing so, we generate labels for randomly sampled state distributions that cover the deviated state distribution without accurately matching it. Though our proposals focus on a non-interactive setting, they can directly transfer to an interactive one.

We choose model-free imitation learning because an accurate model for fine manipulation is rare. However, it remains to be seen how to leverage an inaccurate model in imitation learning. This work is but a first step towards exploring general-purpose autonomous fine manipulation using simple tools. We look forward to extending it by combining model-free and model-based methods to manipulate a more diverse set of hard-to-grasp small objects.

Appendix

V-A Experimental Setup

Robotic Testbed.

We use the end-effector (EE) pose to describe the robot’s state, which is an 8D vector containing (1) the x-y-z position of the bottom chopstick tip, (2) a quarternion representation of the rotation of the chopsticks, and (3) the opening angle of the last joint. We command the robot by sending a target end-effector pose at 100Hz, using an Inverse Kinematics solver to translate to joint positions and running a PID controller at 500Hz to move each joint.

Calibration Improved Performance.

The default model and controller for our hardware were not highly accurate. The average EE position error was 1010 mm. After careful calibration, we reduced this error to 44 mm. Initially, even a well-tuned controller had low success rates for picking up a cube and small ball during replay (90% and 15%, respectively). We implemented a custom PID controller and gain-tuning to achieve 100% and 80%, respectively.

Demonstrations.

We collect the demonstration at 100 Hz to match the test scenario. Each trajectory contains an average of 600 (state, action) pairs. The state is a 11-D vector containing the robot’s state and the object’s x-y-z position. The action is the target end-effector pose. During each trajectory, we initiate the robot around a fixed home configuration and place the object at a random location across the workstation. One expert user collect all trajectories to reduce multi-modal behavior that might interfere with learning (e.g., picking up object using different strategies). We filter out failed trajectories and keep only the 500 successful ones.

V-B Implementation Details

𝙱𝙲\mathtt{BC}.

We trained a two-layer fully connected neural network of size 64×3264\times 32 with ReLUReLU activation. It outputs the 8D target end0effector pose. To compute its loss, we divided the 8D pose to position, rotation, and opening angle and computed the loss for each component using the mean squared error or the rotation difference. We then used a weighted linear combination to sum the components’ losses to a 1D loss. The weights are tunable parameters.

𝚔-𝙽𝙽\mathtt{k\text{-}NN}.

We used the last 33 end-effector poses and the current object location as input to the 𝚔-𝙽𝙽\mathtt{k\text{-}NN} agent. We specify its distance function to be similar to the 𝙱𝙲\mathtt{BC} loss function but use a different set of weights.

𝙽𝚘𝚒𝚜𝚎\mathtt{Noise}.

During training of 𝙱𝙲\mathtt{BC}  agents, instead of optimizing i batch[f(xi)ai]\sum_{i\in\text{ batch}}[f(x_{i})-a_{i}], where xix_{i} is the state and aia_{i} is the action, we sample 20% of the data in each batch and replace the state xix_{i} with xi=xi+ϵx^{\prime}_{i}=x_{i}+\epsilon, where ϵ𝒩(0,σ)\epsilon\sim\mathcal{N}(0,\sigma). σ\sigma is a diagonal matrix whose diagnoal entries are ησ^\eta\hat{\sigma}. We choose a fixed noise magnitude, η\eta, for all dimensions of the state and σ^\hat{\sigma} is the variance of each dimension of the state. Empirically, σ=η\sigma=\eta also achieves comparable performance.

𝙴𝚗𝚜𝚎𝚖𝚋𝚕𝚎\mathtt{Ensemble}.

Given that 𝚔-𝙽𝙽\mathtt{k\text{-}NN} yields the nearest neighbors for a state and the corresponding distances, we set a threshold parameter α\alpha such that the agent follows 𝙱𝙲\mathtt{BC}  iff (di)/k<α\sum(d_{i})/k<\alpha, where did_{i} the distance to the i-th closest neighbor. Further details are in [38].

Acknowledgement

Research reported in this publication was supported by the Eunice Kennedy Shriver National Institute Of Child Health & Human Development of the National Institutes of Health under Award Number F32HD101192. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This work was also (partially) funded by the National Science Foundation IIS (#2007011), National Science Foundation DMS (#1839371), the Office of Naval Research, US Army Research Laboratory CCDC, Amazon, and Honda Research Institute USA.

References

  • [1] “Gross and fine manipulation.” [Online]. Available: https://www.bls.gov/ors/factsheet/gross-and-fine-manipulation.htm#
  • [2] C. M. R. Marohn and C. E. J. Hanly, “Twenty-first century surgery using twenty-first century technology: Surgical robotics,” Current Surgery, vol. 61, no. 5, pp. 466–473, 2004.
  • [3] D. Kent, C. Saldanha, and S. Chernova, “A comparison of remote robot teleoperation interfaces for general object manipulation,” in Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction, 2017, pp. 371–379.
  • [4] H. Sakurai, T. Kanno, and K. Kawashima, “Thin-diameter chopsticks robot for laparoscopic surgery,” in International Conference on Robotics and Automation, 2016, pp. 4122–4127.
  • [5] R. A. Joseph, A. C. Goh, S. P. Cuevas, M. A. Donovan, M. G. Kauffman, N. A. Salas, B. Miles, B. L. Bass, and B. J. Dunkin, “Chopstick surgery: a novel technique improves surgeon performance and eliminates arm collision in robotic single-incision laparoscopic surgery,” Surgical Endoscopy, vol. 24, no. 6, pp. 1331–1335, 2010.
  • [6] M. Ragupathi, D. I. Ramos-Valadez, R. Pedraza, and E. M. Haas, “Robotic-assisted single-incision laparoscopic partial cecectomy,” The International Journal of Medical Robotics and Computer Assisted Surgery, vol. 6, no. 3, pp. 362–367, 2010.
  • [7] A. A. Ramadan, T. Takubo, Y. Mae, K. Oohara, and T. Arai, “Developmental process of a chopstick-like hybrid-structure two-fingered micromanipulator hand for 3-d manipulation of microscopic objects,” IEEE Transactions on Industrial Electronics, vol. 56, no. 4, pp. 1121–1135, 2009.
  • [8] B.-C. Chang, B.-S. Huang, C.-K. Chen, and S.-J. Wang, “The pincer chopsticks: The investigation of a new utensil in pinching function,” Applied Ergonomics, vol. 38, no. 3, pp. 385–390, 2007.
  • [9] A. Yamazaki and R. Masuda, “Autonomous foods handling by chopsticks for meal assistant robot,” in ROBOTIK 2012; 7th German Conference on Robotics.   VDE, 2012, pp. 1–6.
  • [10] M. T. Mason, A. Rodriguez, S. S. Srinivasa, and A. S. Vazquez, “Autonomous manipulation with a general-purpose simple hand,” The International Journal of Robotics Research, vol. 31, no. 5, pp. 688–703, 2012.
  • [11] L. Ke, A. Kamat, J. Wang, T. Bhattacharjee, C. Mavrogiannis, and S. S. Srinivasa, “Telemanipulation with chopsticks: Analyzing human factors in user demonstrations,” in International Conference on Intelligent Robots and Systems.   IEEE, 2020.
  • [12] A. Billard and D. Grollman, “Imitation learning (of robots),” Springer, Tech. Rep., 2011.
  • [13] K. Mülling, J. Kober, O. Kroemer, and J. Peters, “Learning to select and generalize striking movements in robot table tennis,” The International Journal of Robotics Research, vol. 32, no. 3, pp. 263–279, 2013.
  • [14] H. Robotics, “HEBI robotics X series actuator datasheet.”
  • [15] M. R. Cutkosky, Robotic grasping and fine manipulation.   Springer Science & Business Media, 2012, vol. 6.
  • [16] A. Billard and D. Kragic, “Trends and challenges in robot manipulation,” Science, vol. 364, no. 6446, 2019.
  • [17] J. Ho, J. Gupta, and S. Ermon, “Model-free imitation learning with policy optimization,” in International Conference on Machine Learning, 2016, pp. 2760–2769.
  • [18] P. Pastor, H. Hoffmann, T. Asfour, and S. Schaal, “Learning and generalization of motor skills by learning from demonstration,” in International Conference on Robotics and Automation.   IEEE, 2009, pp. 763–768.
  • [19] J. S. Dyrstad, E. R. Øye, A. Stahl, and J. R. Mathiassen, “Teaching a robot to grasp real fish by imitation learning from a human supervisor in virtual reality,” in International Conference on Intelligent Robots and Systems.   IEEE, 2018, pp. 7185–7192.
  • [20] T. Zhang, Z. McCarthy, O. Jow, D. Lee, X. Chen, K. Goldberg, and P. Abbeel, “Deep imitation learning for complex manipulation tasks from virtual reality teleoperation,” in International Conference on Robotics and Automation.   IEEE, 2018, pp. 1–8.
  • [21] D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,” in Advances in Neural Information Processing Systems, 1989, pp. 305–313.
  • [22] T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, and J. Peters, “An algorithmic perspective on imitation learning,” arXiv preprint arXiv:1811.06711, 2018.
  • [23] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence, Dataset Shift in Machine Learning.   The MIT Press, 2009.
  • [24] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp. 627–635.
  • [25] M. Laskey, J. Lee, R. Fox, A. Dragan, and K. Goldberg, “Dart: Noise injection for robust imitation learning,” arXiv preprint arXiv:1703.09327, 2017.
  • [26] M. T. Mason, S. S. Srinivasa, and A. S. Vazquez, “Generality and simple hands,” in Robotics Research.   Springer, 2011, pp. 345–361.
  • [27] A. Venkatraman, M. Hebert, and J. A. Bagnell, “Improving multi-step prediction of learned time series models,” in Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
  • [28] C. M. Bishop, “Training with noise is equivalent to tikhonov regularization,” Neural Computation, vol. 7, no. 1, pp. 108–116, 1995.
  • [29] J. Sietsma and R. J. Dow, “Creating artificial neural networks that generalize,” Neural Networks, vol. 4, no. 1, pp. 67–79, 1991.
  • [30] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” Journal of Big Data, vol. 6, no. 1, p. 60, 2019.
  • [31] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning.   MIT press Cambridge, 2016, vol. 1.
  • [32] B. Poole, J. Sohl-Dickstein, and S. Ganguli, “Analyzing noise in autoencoders and deep networks,” arXiv preprint arXiv:1406.1831, 2014.
  • [33] S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 3rd ed.   USA: Prentice Hall Press, 2009.
  • [34] N. S. Altman, “An introduction to kernel and nearest-neighbor nonparametric regression,” The American Statistician, vol. 46, no. 3, pp. 175–185, 1992.
  • [35] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.   IEEE, 2012, pp. 5026–5033.
  • [36] “University of california berkeley CS 285: Deep reinforcement learning,” http://rail.eecs.berkeley.edu/deeprlcourse/, accessed: 2020-10-27.
  • [37] L. Ke, S. Choudhury, M. Barnes, W. Sun, G. Lee, and S. Srinivasa, “Imitation learning as ff-divergence minimization,” in Proceedings of the Workshop on Algorithmic Foundations of Robotics, 2020.
  • [38] “Additional supplementary details (google drive).” [Online]. Available: https://drive.google.com/file/d/1PsryvqkxB9bNuRzgoqYIvaIm0rLt0sd7/view?usp=sharing