Human-Agent Joint Learning for Efficient Robot Manipulation Skill Acquisition

Shengcheng Luo^1*, Quanquan Peng^1*, Jun Lv^1*, Kaiwen Hong²,
Katherine Rose Driggs-Campbell², Cewu Lu¹, Yong-Lu Li¹ * denotes equal contribution ¹Shanghai Jiao Tong University, ²University of Illinois Urbana-Champaign.

Abstract

Employing a teleoperation system for gathering demonstrations offers the potential for more efficient learning of robot manipulation. However, teleoperating a robot arm equipped with a dexterous hand or gripper, via a teleoperation system presents inherent challenges due to the task’s high dimensionality, complexity of motion, and differences between physiological structures. In this study, we introduce a novel system for joint learning between human operators and robots, that enables human operators to share control of a robot end-effector with a learned assistive agent, simplifies the data collection process, and facilitates simultaneous human demonstration collection and robot manipulation training. As data accumulates, the assistive agent gradually learns. Consequently, less human effort and attention are required, enhancing the efficiency of the data collection process. It also allows the human operator to adjust the control ratio to achieve a trade-off between manual and automated control. We conducted experiments in both simulated environments and physical real-world settings. Through user studies and quantitative evaluations, it is evident that the proposed system could enhance data collection efficiency and reduce the need for human adaptation while ensuring the collected data is of sufficient quality for downstream tasks. For more details, please refer to our webpage https://norweig1an.github.io/HAJL.github.io/.

{strip}

Figure 1: Human-agent joint learning overview. Traditional frameworks typically separate human and agent training, requiring operators first to learn the task environment before data collection. This often leads to inefficiencies due to delayed and insufficient data gathering. In our framework, we integrate human and agent training from the start in a joint learning model. This enables simultaneous development and adapts the agents to human operation more effectively, enhancing overall efficiency and promoting better collaboration between humans and machines allowing for human effortless adaptation data collection.

I INTRODUCTION

In recent years, significant progress has been made on learning robot manipulation policies from demonstrations. Previous studies have utilized teleoperation systems [1, 2, 3, 4, 5, 6] to collect human demonstrations, and learning-based policies [7, 8, 9] have been formulated using the gathered data. Despite the notable advancements, several challenges still need to be addressed. For example, in vision-based teleoperation systems, even with state-of-the-art 3D hand pose estimation algorithms [10, 11, 12, 13], errors persist that significantly degrade the teleoperation. Additionally, discrepancies between the structures of human hands and robot end-effectors, along with the lack of haptic feedback during contact-rich manipulation, also pose challenges. As a result, current teleoperation systems demand substantial human effort, in addition, collecting high-quality datasets remains a challenging and labor-intensive task in many scenarios.

Naturally, a question was raised: in data-collection, how to make human effort less while improving the data quality? Here, we aim to address this question and argue that human-agent joint learning can help. That said, an effective and efficient teleoperation system should be designed to preferentially capture the operator’s intentions for directing a robot end effector and pose the main frame, while concurrently enabling an autonomous agent to help us ensure motion stability and interpolate the details. To this end, we propose a framework that achieves shared control between the human and a learned assistive agent during data collection. As shown in Fig. 1, our human-agent joint learning framework seeks to integrate the data collection and policy learning to enhance the efficiency of the whole process, reducing human effort, and improving the data quality.

Given our human-agent joint learning approach, we allow the data acquisition agent to grow and learn along with the human operator. Inspired by shared autonomy [14, 15, 16], we introduce a novel teleoperation system that enables collaboration between humans and learning-based agents to control a robot jointly during the data collection and learning process. In particular, our proposed system provides the flexibility to adjust a “control ratio” between the human operator and a learning-based agent. A lower control ratio, in the beginning, emphasizes the human’s role in teaching the agent finer-grained knowledge under the structure of human intention and principal actions. As the agent’s learning improves, a higher ratio indicates greater autonomy from the learned agent to replace the human effort to “inpaint” the whole process given only human intention and principal actions.

With the proposed system, the human effort will be reduced due to the shared control during data collection. Additionally, the agent learning process is integrated with the data collection, improving the efficiency of the whole process. In addition, the quality of the collected data is also improved, benefiting different kinds of downstream tasks. We conducted experiments in six different simulation environments using two types of end-effectors: a dexterous hand and a gripper. Additionally, we performed experiments on three real-world tasks to validate our findings. Evaluation results indicate that our proposed system significantly enhances data collection efficiency, increasing the collection success rate by 30% and nearly doubling the collection speed. Additionally, data collected in shared autonomy mode is as effective for downstream tasks and models as data collected directly from the teleoperation system, demonstrating comparable validity. Our main contributions are summarized as follows:

•

We study how to reduce human adaptation while keeping data quality in teleoperation data collection and propose a human-agent joint learning paradigm.
•

We build a system that fosters concurrent development between the human operator and assistive agent, which not only streamlines the learning process but also expedites the robot’s ability to perform robot manipulations autonomously.
•

Conducting both simulation and real-world experiments to demonstrate the efficiency and effectiveness of our proposed system. Our system achieved significant performance improvements, including a 30% increase in data collection success rate and double the collection speed.

II RELATED WORKS

Teleoperation for Data Collection. Data has always been a crucial foundation, and robots are no exception. Teleoperation serves as a significant source for collecting robot data [7, 17, 18, 19, 20, 21, 22]. Some works achieve teleoperation through wearable devices [1, 2, 3, 4, 23], and vision-based teleoperation systems offer a low-cost and easily developed alternative [5, 6, 24, 25]. For instance, [25] utilizes neural networks for markerless vision-based teleoperation of dexterous robotic hands from depth images. [5] set up a vision-based teleoperation system to control the Allegro Hand, accomplishing various contact-rich manipulation tasks in the real world. Recently, [6] introduced AnyTeleop, a unified teleoperation system designed to accommodate various arms, hands, realities, and camera setups within a singular framework. In this paper, we introduce a joint learning paradigm to assist teleoperation by sharing control between the human operator and a learning-based agent, improving the efficiency of data collection using teleoperation.

Interactive robot learning. Collecting fine-grained human demonstration data for robotic manipulation is an effective but labor-intensive and time-consuming way to enable robots to complete a wide range of tasks [26, 27]. Previous work uses shared autonomy to assist people with disability in performing tasks by arbitrating human inputs and robot actions [28]. Many of the shared autonomy algorithms aim to estimate human intents from a set of pre-defined goals [29, 30, 31, 32], using clothoid curves to parametrize the state and control [33] or by mapping low-dimension control input to high-dimension robot actions [28, 34]. In this work, we introduce a system that integrates the agent’s learning process with data collection, facilitating both data collection and robot learning.

III PROPOSED METHOD

The primary contribution of this work is the development of a novel and highly efficient data collection method. To achieve this, the system is designed in two key stages: first, the proposed system allows human operators to control the robot via a teleoperation system to gather an initial but insufficient training dataset, which serves as the foundation for the second stage (Sec. III-B). Second, using these data, we train a diffusion-model-based assistive agent (Sec. III-C) to establish shared control between the human operator and the agent, thereby improving the efficiency of the data collection process (Sec. III-D). This approach mirrors the concept of “bootstrapping” [35], where, as more data is accumulated, the system progressively reduces the effort required from human operators, facilitating further data collection and iterative system improvement. Additionally, once sufficient data has been gathered, the system offers the option to transition the shared control agent to full autonomy.

III-A Preliminary

To get a learned agent in Sec. III-C, enabling human-agent joint learning, we follow the Denoising Diffusion Probabilistic Model (DDPM) [36] training paradigm. Here we first briefly introduce the DDPM algorithm. The forward process of the Diffusion Model can be regarded as adding Gaussian noise to the data $x^{0}$ according to a variance schedule $\beta_{1:K}$ by

x_{k}=\sqrt{\alpha_{k}}x_{k-1}+\sqrt{1-\alpha_{k}}\epsilon,

(1)

where $\epsilon\sim\mathcal{N}(\mathbf{0,I}),\alpha_{k}=1-\beta_{k}$ . DDPM models the output generation as a denoising process (Stochastic Langevin Dynamics). A line of works [37, 38, 8, 39] use diffusion model to generate the action for agents: given $x^{K}$ sampled from Gaussian noise $\mathcal{N}(\mathbf{0,I})$ , it utilizes a parameterized diffusion process to model how $x^{K}$ is denoised in order to get noise-free action $x^{0}$ by

p_{\theta}(x^{0})=\int p(x^{K})\prod_{k=1}^{K}p_{\theta}(x^{k-1}|x^{k})\mathrm{d}x^{1:K},

(2)

where $p_{\theta}(x^{k-1}|x^{k})=\mathcal{N}(\mu_{\theta}(x^{k},k),\Sigma(x^{k},k))$ is usually referred as reverse process. [40] shows that $p_{\theta}(x^{t-1}|x^{k})$ becomes tractable when conditioned on $x_{0}$ and Eq. 2 can be reformulated as minimizing the error in the noise prediction. [36] simplify the training loss function as

\mathcal{L}:=\mathbb{E}_{k,\boldsymbol{x}_{0},\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\boldsymbol{I})}\left[\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\theta}(\boldsymbol{x}_{k}(\boldsymbol{x}_{0},\boldsymbol{\epsilon}),k)\|_{2}^{2}\right],

(3)

where step $k$ is sampled uniformly as $k\in[1,K]$ , $\boldsymbol{\epsilon}_{\theta}$ is the noise prediction model. During the inference phase, we can generate $x_{0}$ by recursively sample $\boldsymbol{z}\sim\mathcal{N}(\mathbf{0},\boldsymbol{I})$ :

x_{k-1}=\mu_{\theta}(x_{k},k)+\sigma_{k}\boldsymbol{z}.

(4)

Similar to [8, 41], with the collected trajectory $\{(s_{i},a_{i})\}_{i=0}^{T}$ , we aim to train an agent to imitate the trajectory, accomplishing a specific task $\mathcal{T}$ . Therefore, we utilize DDPM to capture the conditional distribution of $p(a|s)$ and the training loss in Eq. 3 shall be modified as

\mathcal{L}:=\mathbb{E}_{k,(s_{i},a_{i}),\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\boldsymbol{I})}\left[\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\theta}({a}_{i}+\boldsymbol{\epsilon},s_{i},k)\|_{2}^{2}\right].

(5)

Figure 7: Simulation tasks overview. Here are six task settings and their task flow for Pick-and-Place (left), Articulated-Manipulation (middle), Gripper-Push (upper-right) and Dexterous-Tool-Use (bottom-right).

III-B Teleoperation System.

Our system initially captures the raw sensory signal $\mathcal{I}$ . Human hand pose $\mathcal{P}^{h}$ can be obtained from the captured signal using off-the-shelf 3D hand pose estimation [10, 11, 13]. The pose $\mathcal{P}^{h}$ consists of the positions of the human hand’s keypoints. Then, employing an inverse kinematic function $f_{\textit{IK}}$ , we compute the action of the robot $a\in\mathbb{R}^{m}$ , such that $a=f_{\textit{IK}}(\mathcal{P}^{h}_{t},\mathcal{P}^{h}_{t+1})$ , where it is calculated upon the change in the hand pose. Given this teleoperation system, the human operator will move the hand to produce a sequence of hand poses $\{\mathcal{P}^{h}_{i}\}_{i=0}^{T}$ to teleoperate the robot with an action sequence $\{a_{i}\}_{i=0}^{T}$ to achieve the task $\mathcal{T}$ . The human collected demonstration trajectory $\{(s_{i},a_{i})\}_{i=0}^{T}$ , where $s\in\mathbb{R}^{n}$ is the robot state, could be used for downstream tasks.

III-C Diffusion-Model-Based Assistive Agent.

After collecting data via teleoperation mentioned in Sec. III-B, we train a diffusion-model-based assistive agent to learn how to assist humans in collecting data in a shared control manner.

At an abstract level, the diffusion-model-based assist agent, noted as $f(\cdot|\cdot)$ , is provided with the state $s$ , denoising step number $k$ , and a noise action $a^{k}$ , which could be an imperfect action gathered from the teleoperation system or sampled from a Gaussian distribution, to predict the desired action

a=f(a^{k}|s,k).

(6)

Refer to caption — Figure 8: Diffusion based shared control. To achieve shared control between the human and agent, we blend the action from the human operator $a^{h}$ using the forward and reverse process. The parameter $\gamma$ governs the control ratio, where a lower $\gamma$ results in the action better aligning with the human operator’s intention. In contrast, a higher $\gamma$ allows the learned agent to exert more influence over the blended action.

During data collection, the proposed system offers the option to control the robot in a shared control mode rather than directly applying the collected action $a^{h}$ from the teleoperation system. This leads to a reduced human workload during the data collection process. The classical shared autonomy method is achieved through the equation [29]:

a^{s}=\gamma a^{h}+(1-\gamma)a^{r},

(7)

where $a^{r}$ is generated by the learned agent. However, considering that the agent operates as a diffusion policy (Fig. 8), we blend the action from the human with the forward and reverse processes. Given action $a^{h}$ , a forward process diffuses the action as follows: $a^{k}=a^{h}+\epsilon^{k}.$ Subsequently, a reverse process denoises the action $a^{k}$ :

a^{s}=f(a^{k}|s,k).

(8)

By applying action $a$ , the control of the robot is shared between the human and the diffusion-model-based assistive agent. We can adjust the control ratio $\gamma=k/K$ between the human operator and the diffusion-model-based assistive agent by varying $k$ . When $\gamma=0$ , the action $a^{s}$ represents the teleoperation action $a^{h}$ , which is the dexterous robot directly controlled by a human operator. As $\gamma$ approaches $1.0$ , the action $a^{s}$ transitions to full autonomy $a^{r}$ . A higher $\gamma$ value indicates a higher level of autonomy, allowing the learning-based agent more control rights to stabilize and direct the dexterous hand.

Algorithm 1 Overall Process

1:The human operator

\mathcal{H}

;

2:The collected dataset

\mathcal{D}

; assistive agent

f

; control ratio

\gamma

;

3:Initialization:

\mathcal{D}\leftarrow\emptyset,\gamma\leftarrow 0

;

4:while

|\mathcal{D}|

is small do

\triangleright

not enough data is collected

\mathcal{H}

collects data

d

under

f

’s help;

\triangleright

see III-D2 for control ratio adjustment

6: if

d

is valid then

\mathcal{D}\leftarrow\{d\}\cup\mathcal{D}

;

8: end if

9: Finetune

f

with

\mathcal{D}

;

10:end while

11:return

\mathcal{D}

and

f

;

III-D Integrating Data Collection and Manipulation Learning.

In this section, we show how to integrate data collection and manipulation learning into a unified framework that progressively reduces human effort and enhances robot autonomy.

III-D1 Detailed Algorithm Explanation

We outline the overall process in Algo. 1. The assistive agent is trained in three steps as follows:

Step 1. Initially, we collect a dataset for pre-training agent $f$ under full manual control by human operators, i.e., with the control ratio $\gamma=0$ .

Step 2. Given the initial dataset, we train a relatively low performance assistive agent to aid in further data collection. The training process has been formulated in Eq. 5 and Eq. 6, where a neural network $\epsilon_{\theta}$ is trained to predict noise $\epsilon$ out of the noisy action $a^{k}$ .

Step 3. The trained agent assists in a second data collection round, aiming for higher efficiency and success. We then refine the agent using data from both rounds to enhance its performance. This cycle repeats until the agent achieves full autonomy and the required data volume is collected.

III-D2 Control Ratio Adjustment

For each data collection, we offer users two options to adjust the control ratio $\gamma$ : (1) Users can empirically adjust $\gamma$ based on their needs. (2) Alternatively, set $\gamma=\frac{1}{2}(1+\cos\theta)$ , where $\theta$ is obtained by calculating the dot product of the previous timestep’s human action $a^{h}$ and shared action $a^{s}$ . This assesses alignment, increasing $\gamma$ for positive alignment to enhance agent control, and reducing it for misalignment to increase user control.

After obtaining the control ratio $\gamma$ , we calculate the shared action $a^{s}$ , using the human operator’s action $a^{h}$ as input, as defined in Eq. 8 (shown in Fig. 8).

IV EXPERIMENTS

In this section, we introduce the settings for both real world tasks and simulation tasks, along with the experimental results and data validation. Due to page limitations, we have included some detailed information such as training details and ablation study results on our webpage.

IV-A Tasks.

We adopt six multi-stage manipulation tasks (Fig. 7). Pick-and-Place aims at picking an object on the table and placing it into a container. Articulated-Manipulation’s objective for the dexterous hand is to grasp and unscrew a door handle to open it, while for the gripper, it is to grab a drawer handle and pull the drawer open. Push-cube requires the robot to push the cube to the target position. Tool-Use aims at picking a hammer and using it to drive a nail into a board.

		Pick-and-Place			Door-Open			Tool-Use
		Success	Horizon	Collection	Success	Horizon	Collection	Success	Horizon	Collection
		Rate $\uparrow$	Length $\downarrow$	Speed $\uparrow$	Rate $\uparrow$	Length $\downarrow$	Speed $\uparrow$	Rate $\uparrow$	Length $\downarrow$	Speed $\uparrow$
Group1	w/ Ours	86.96	219.01	320	87.11	142.29	460	66.50	232.17	200
	w/o Ours	51.53	378.49	176	62.49	258.27	252	42.38	487.95	129
Group2	w/ Ours	94.06	214.16	324	80.29	134.16	424	55.55	275.71	172
	w/o Ours	45.42	471.48	120	53.45	317.21	176	34.47	511.03	124

TABLE I: User studies on three dexterous hand tasks.

Figure 11: Shared control process overview. The white one is the hand controlled purely by the human operator, while the cyan one is under shared control between the human and the assistive agent.

IV-B Efficiency of Data Collection.

Our proposed system leverages shared control between human operators and learned agents to enhance the efficiency of data collection. To learn how the assistant agent could improve the data collection process, we conducted a user study.

In the user study, 10 human operators participate, collecting data under two modes: one where control is shared between the operator and the learned agent (w/ Ours), and the other where control is directly by the operator alone (w/o Ours). Each participant is instructed to collect as much data as possible within three minutes under two different modes for three dexterous hand tasks. Three metrics are evaluated: Success Rate (Percent) indicates the percentage of attempts where data collection was successful. Horizon Length (Steps per Sample) measures the length of each collected trajectory, with a lower horizon length indicating smoother data collection. Collection Speed (Samples per Hour) refers to the number of successful trajectories that can be collected in one hour.

In Tab. I, by sharing control between humans and learned agents, our system shows improvements in both success rate and collection speed, while the average horizon length of the collected trajectories is reduced. This suggests that our system enhances the efficiency of data collection by facilitating a process that is easier to succeed, faster, and more fluid in terms of trajectory smoothness. To ensure the fairness of the experiment wasn’t compromised, we equally divided the user group into two parts, Group 1 first collected data directly by themselves (w/o Ours) and then collected data with an assistive agent (w/ Ours), while the Group 2 reversed the order, first (w/ Ours) mode and then (w/o Ours) mode.

IV-C Quantitative Evaluation.

To gain deeper insight into how the learned agent assists the human operator, we visualize several keyframes from the data collection process of three dexterous hand tasks. From Fig. 11, it is evident that human operators are not required to provide too precise control with the assistive agent facilitating shared control over the dexterous hand. Instead, they only need to convey high-level intentions, such as the direction of hand movement or finger grasp motions. In multi-stage tasks, like picking up a hammer and then using it to drive a nail, operators only need to provide a trigger action to guide the agent to transition from one sub-stage to the next. As a result, less effort and attention are required, making the data collection easier to execute successfully and speeding it up.

When the learned agent shares control with users, the system effectively corrects imperfect human control signals to accomplish specific tasks. Given the challenge of directly measuring the level of imperfection in user signals and the correction ability of our system, we simulate human input using a baseline agent trained with Behavior Cloning (BC) as a proxy for user control.

In Fig. LABEL:fig:simulateduser, additional data collected under our framework effectively contributes to training improvements. The graph illustrates that with limited data availability, the agent can assist the simulated operator more effectively. As the agent gains access to and trains on more data, its ability to correct actions improves. These results indicate that our system gradually reduces the demand for the operator’s attention and effort, thereby enhancing the overall efficiency of data collection process.

Furthermore, once sufficient data is collected and the assistive agent is trained, it can transition into full autonomy mode by setting $\gamma=1$ and denoising actions from the noise sampled from Gaussian distributions. Across three different dexterous manipulation tasks, we can achieve success rates of 0.76, 0.78, and 0.89, indicating that the assistive agent can effectively transform into an automated dexterous manipulation agent.

From our experiments, we have observed that the assistive agent significantly aids human operators in managing fine control, especially in scenarios where accurate observation by humans is challenging. For instance, tasks such as grasping an egg or moving a hammer present visual challenges. It can be difficult to visually confirm whether the egg is securely grasped or if there’s a risk of it being dropped. This uncertainty makes it hard for human operators to react promptly to sudden changes. However, within our proposed joint learning framework, human operators are primarily required to focus on high-level intentions and task planning during data collection, while the assistive agent manages the detailed low-level actions. This division of labor significantly reduces the burden on human operators by clearly separating strategic planning from execution tasks, streamlining the collaboration between humans and machines.

Dexterous	Pick-and-Place		Articulated-Manipulation		Tool-Use
Hand	40 $\mathcal{H}$	10 $\mathcal{H}$ + 30 $\mathcal{S}$	40 $\mathcal{H}$	10 $\mathcal{H}$ + 30 $\mathcal{S}$	40 $\mathcal{H}$	10 $\mathcal{H}$ + 30 $\mathcal{S}$
BC	0.30	0.50	0.22	0.57	0.39	0.40
BC-RNN	0.54	0.67	0.47	0.50	0.27	0.25
DP	0.73	0.76	0.77	0.78	0.88	0.89
Parallel	Pick-and-Place		Articulated-Manipulation		Push-cube
Gripper	40 $\mathcal{H}$	10 $\mathcal{H}$ + 30 $\mathcal{S}$	40 $\mathcal{H}$	10 $\mathcal{H}$ + 30 $\mathcal{S}$	40 $\mathcal{H}$	10 $\mathcal{H}$ + 30 $\mathcal{S}$
BC	0.42	0.44	0.35	0.37	0.88	0.85
BC-RNN	0.39	0.36	0.71	0.73	0.59	0.67
DP	0.51	0.60	0.42	0.67	0.83	0.82

TABLE II: Data quality on downstream tasks.

	Dexterous Tool-Use		Gripper Push-cube
	BC	DP	BC	DP
10 $\mathcal{H}$	0.29	0.45	0.23	0.42
10 $\mathcal{H}$ + 10 $\mathcal{H}$	0.28	0.67	0.37	0.78
10 $\mathcal{H}$ + 20 $\mathcal{H}$	0.28	0.82	0.51	0.67
10 $\mathcal{H}$ + 30 $\mathcal{H}$	0.39	0.88	0.88	0.83
10 $\mathcal{H}$ + 10 $\mathcal{S}$	0.31	0.71	0.33	0.81
10 $\mathcal{H}$ + 20 $\mathcal{S}$	0.30	0.79	0.61	0.62
10 $\mathcal{H}$ + 30 $\mathcal{S}$	0.40	0.89	0.85	0.82

TABLE III: Agent performance under increasing data.

IV-D Data Quality on Downstream Task.

In this section, we illustrate that collecting data under shared control does not compromise the quality of the data. We gather dexterous hand and gripper manipulation demonstrations via the proposed system in two modes: fully controlling the robots by a human ( $\mathcal{H}$ ) and sharing control ( $\mathcal{S}$ ) between the human operator and the learned assistive agent. And utilize these data to train different kinds of agents, like BC, BC-RNN [7], and Diffusion Policy (DP) [8].

In Tab. II, compared to directly collecting human demonstrations from the expert human operator, who can achieve success rates and efficiency comparable to those with agent assistance, the data collected by sharing control between the human and the assistive agent can achieve comparable or even surprisingly better results with BC and BC-RNN. Their results are comparable with DP, possibly as DP can better fit the tasks, which is in line with [8].

In Tab. III, we compare the effects of using different sets of data to train BC and DP. We can find that utilizing more data collected under the shared control mode leads to comparable performance on the tool-use and push-cube tasks. This verifies that the new data contributes significantly to policy learning and can achieve a similar effect compared to the data from human experts but at a much lower cost. These results indicate that the data collected under the proposed paradigm have sufficient quality and efficiency for downstream tasks.

IV-E Real World Experiment and User Feedback.

To better evaluate our system, we further conduct real-world experiments. Three tasks are adopted: Pick-and-Place, Articulated-Manipulation, and Push-cube in Fig. 14. Following the same rules as Sec. IV-B, four human volunteers are invited to participate in the user study to collect data under two modes: one where control is shared between the human operator and the learned agent (w/ Ours), and the other where control is directly by the human operator alone (w/o Ours). Our proposed system achieves significant improvements in success rate and collection speed by sharing control between human operators and learned agents, as demonstrated in Tab. IV. Additionally, data gathered under our proposed joint learning shared control mode yield performance on the three tasks that are comparable to those pure human datasets using BC and DP, further substantiated by the results presented in Tab. V.

	Success Rate $\uparrow$	Horizon Length $\downarrow$	Collection Speed $\uparrow$
w/ Ours	0.79	18.72	151
w/o Ours	0.70	21.54	121

TABLE IV: Real world gripper Pick-and-Place task user study.

	Pick-and-Place		Articulated-Manipulation		Push-cube
	40 $\mathcal{H}$	20 $\mathcal{H}$ + 20 $\mathcal{S}$	30 $\mathcal{H}$	10 $\mathcal{H}$ + 20 $\mathcal{S}$	20 $\mathcal{H}$	10 $\mathcal{H}$ + 10 $\mathcal{S}$
BC	13 / 20	14 / 20	18 / 20	19 / 20	15 / 20	15 / 20
DP	11 / 20	12 / 20	16 / 20	12 / 20	15 / 20	13 / 20

TABLE V: Real world gripper experiments of data quality.

Satisfaction: $\alpha=0.769$
1. It is fun to use.
2. It works the way I want it to work.
3. It is wonderful.
4. It helps me be more effective.
5. It is flexible.
User-Friendly: $\alpha=0.852$
6. It is simple to use.
7. It is effortless.
8. I can use it without written instructions.
9. I do not notice any inconsistencies as I use it.

TABLE VI: Subjective Measures.

We have developed a questionnaire comprising shown in Tab. VI to capture various dimensions of user experience and ergonomics, and we invited 10 volunteers to rate our system based on their feedback.

This questionnaire assesses ease of use and overall satisfaction. The reliability of our questionnaire is supported by strong Cronbach’s alpha values: $\alpha=0.769$ for the satisfaction section and $\alpha=0.852$ for the user-friendly section, indicating internal consistency.

V CONCLUSION

In this paper, we introduce a novel human-agent joint learning paradigm that enables simultaneous human demonstration collection and robot manipulation teaching. This approach allows the human operator to share control with a diffusion-model-based assistive agent within a vision-based teleoperation system to control multiple robot end-effectors such as grippers and dexterous hands. Given our paradigm, the human operator can reduce the effort spent on data collection and adjust the control ratio between the human and agent based on different scenarios. Our system offers a more efficient and flexible solution for data collection and robot manipulation learning via teleoperation.

References

Arunachalam et al. [2023] S. P. Arunachalam, I. Güzey, S. Chintala, and L. Pinto, “Holo-dex: Teaching dexterity with immersive mixed reality,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 5962–5969.
Gharaybeh et al. [2019] Z. Gharaybeh, H. Chizeck, and A. Stewart, Telerobotic control in virtual reality. IEEE, 2019.
Liu et al. [2017] H. Liu, X. Xie, M. Millar, M. Edmonds, F. Gao, Y. Zhu, V. J. Santos, B. Rothrock, and S.-C. Zhu, “A glove-based system for studying hand-object manipulation via joint pose and force sensing,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 6617–6624.
Liu et al. [2019] H. Liu, Z. Zhang, X. Xie, Y. Zhu, Y. Liu, Y. Wang, and S.-C. Zhu, “High-fidelity grasping in virtual reality using a glove-based system,” in 2019 international conference on robotics and automation (ICRA). IEEE, 2019, pp. 5180–5186.
Handa et al. [2020] A. Handa, K. Van Wyk, W. Yang, J. Liang, Y.-W. Chao, Q. Wan, S. Birchfield, N. Ratliff, and D. Fox, “Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 9164–9170.
Qin et al. [2023] Y. Qin, W. Yang, B. Huang, K. Van Wyk, H. Su, X. Wang, Y.-W. Chao, and D. Fox, “Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system,” arXiv preprint arXiv:2307.04577, 2023.
Mandlekar et al. [2021] A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y. Zhu, and R. Martín-Martín, “What matters in learning from offline human demonstrations for robot manipulation,” arXiv preprint arXiv:2108.03298, 2021.
Chi et al. [2023] C. Chi, S. Feng, Y. Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” 2023.
Florence et al. [2022] P. Florence, C. Lynch, A. Zeng, O. A. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mordatch, and J. Tompson, “Implicit behavioral cloning,” in Conference on Robot Learning. PMLR, 2022, pp. 158–168.
Lv et al. [2021] J. Lv, W. Xu, L. Yang, S. Qian, C. Mao, and C. Lu, “Handtailor: Towards high-precision monocular 3d hand recovery,” British Machine Vision Conference (BMVC), 2021.
Rong et al. [2021] Y. Rong, T. Shiratori, and H. Joo, “Frankmocap: A monocular 3d whole-body pose estimation system via regression and integration,” in IEEE International Conference on Computer Vision Workshops, 2021.
Schmidt et al. [2014] T. Schmidt, R. A. Newcombe, and D. Fox, “Dart: Dense articulated real-time tracking.” in Robotics: Science and systems, vol. 2, no. 1. Berkeley, CA, 2014, pp. 1–9.
Weichert et al. [2013] F. Weichert, D. Bachmann, B. Rudak, and D. Fisseler, “Analysis of the accuracy and robustness of the leap motion controller,” Sensors, vol. 13, no. 5, pp. 6380–6393, 2013.
Javdani et al. [2015] S. Javdani, S. S. Srinivasa, and J. A. Bagnell, “Shared autonomy via hindsight optimization,” Robotics science and systems: online proceedings, 2015.
Reddy et al. [2018] S. Reddy, A. D. Dragan, and S. Levine, “Shared autonomy via deep reinforcement learning,” arXiv preprint arXiv:1802.01744, 2018.
Schaff and Walter [2020] C. Schaff and M. R. Walter, “Residual policy learning for shared autonomy,” arXiv preprint arXiv:2004.05097, 2020.
Brohan et al. [2022] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu et al., “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817, 2022.
Ebert et al. [2021] F. Ebert, Y. Yang, K. Schmeckpeper, B. Bucher, G. Georgakis, K. Daniilidis, C. Finn, and S. Levine, “Bridge data: Boosting generalization of robotic skills with cross-domain datasets,” arXiv preprint arXiv:2109.13396, 2021.
Fang et al. [2023] H.-S. Fang, H. Fang, Z. Tang, J. Liu, J. Wang, H. Zhu, and C. Lu, “Rh20t: A robotic dataset for learning diverse skills in one-shot,” arXiv preprint arXiv:2307.00595, 2023.
Kofman et al. [2005] J. Kofman, X. Wu, T. J. Luu, and S. Verma, “Teleoperation of a robot manipulator using a vision-based human-robot interface,” IEEE transactions on industrial electronics, vol. 52, no. 5, pp. 1206–1219, 2005.
Mandlekar et al. [2018] A. Mandlekar, Y. Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay et al., “Roboturk: A crowdsourcing platform for robotic skill learning through imitation,” in Conference on Robot Learning. PMLR, 2018, pp. 879–893.
Fang et al. [2024] H. Fang, H.-S. Fang, Y. Wang, J. Ren, J. Chen, R. Zhang, W. Wang, and C. Lu, “Airexo: Low-cost exoskeletons for learning whole-arm manipulation in the wild,” in 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 15 031–15 038.
Lipton et al. [2017] J. I. Lipton, A. J. Fay, and D. Rus, “Baxter’s homunculus: Virtual reality spaces for teleoperation in manufacturing,” IEEE Robotics and Automation Letters, vol. 3, no. 1, pp. 179–186, 2017.
Antotsiou et al. [2019] D. Antotsiou, G. Garcia-Hernando, and T.-K. Kim, “Task-oriented hand motion retargeting for dexterous manipulation imitation,” in Computer Vision–ECCV 2018 Workshops: Munich, Germany, September 8-14, 2018, Proceedings, Part VI 15. Springer, 2019, pp. 287–301.
Li et al. [2019] S. Li, X. Ma, H. Liang, M. Görner, P. Ruppel, B. Fang, F. Sun, and J. Zhang, “Vision-based teleoperation of shadow dexterous hand using end-to-end deep neural network,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 416–422.
Liu et al. [2022] H. Liu, S. Nasiriany, L. Zhang, Z. Bao, and Y. Zhu, “Robot learning on the job: Human-in-the-loop autonomy and learning during deployment,” arXiv preprint arXiv:2211.08416, 2022.
Walke et al. [2023] H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen-Estruch, A. W. He, V. Myers, M. J. Kim, M. Du et al., “Bridgedata v2: A dataset for robot learning at scale,” in Conference on Robot Learning. PMLR, 2023, pp. 1723–1736.
Jeon et al. [2020] H. J. Jeon, D. P. Losey, and D. Sadigh, “Shared autonomy with learned latent actions,” arXiv preprint arXiv:2005.03210, 2020.
Dragan and Srinivasa [2013] A. D. Dragan and S. S. Srinivasa, “A policy-blending formalism for shared control,” The International Journal of Robotics Research, vol. 32, no. 7, pp. 790–805, 2013.
Javdani et al. [2018] S. Javdani, H. Admoni, S. Pellegrinelli, S. S. Srinivasa, and J. A. Bagnell, “Shared autonomy via hindsight optimization for teleoperation and teaming,” The International Journal of Robotics Research, vol. 37, no. 7, pp. 717–742, 2018.
Muelling et al. [2017] K. Muelling, A. Venkatraman, J.-S. Valois, J. E. Downey, J. Weiss, S. Javdani, M. Hebert, A. B. Schwartz, J. L. Collinger, and J. A. Bagnell, “Autonomy infused teleoperation with application to brain computer interface controlled manipulation,” Autonomous Robots, vol. 41, pp. 1401–1422, 2017.
Sadigh et al. [2016] D. Sadigh, S. S. Sastry, S. A. Seshia, and A. Dragan, “Information gathering actions over human internal state,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2016, pp. 66–73.
Mower et al. [2021] C. Mower, J. Moura, and S. Vijayakumar, “Skill-based shared control,” in Robotics: Science and Systems XVII. The Robotics: Science and Systems Foundation, Jul. 2021, robotics: Science and Systems 2021, R:SS 2021 ; Conference date: 12-07-2021 Through 16-07-2021. [Online]. Available: https://roboticsconference.org/
Losey et al. [2022] D. P. Losey, H. J. Jeon, M. Li, K. Srinivasan, A. Mandlekar, A. Garg, J. Bohg, and D. Sadigh, “Learning latent actions to control assistive robots,” Autonomous robots, vol. 46, no. 1, pp. 115–147, 2022.
Chu et al. [2023] X. Chu, Y. Tang, L. H. Kwok, Y. Cai, and K. W. S. Au, “Bootstrapping robotic skill learning with intuitive teleoperation: Initial feasibility study,” 2023. [Online]. Available: https://arxiv.org/abs/2311.06543
Ho et al. [2020] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
Janner et al. [2022] M. Janner, Y. Du, J. B. Tenenbaum, and S. Levine, “Planning with diffusion for flexible behavior synthesis,” 2022.
Ajay et al. [2023] A. Ajay, Y. Du, A. Gupta, J. Tenenbaum, T. Jaakkola, and P. Agrawal, “Is conditional generative modeling all you need for decision-making?” 2023.
Xu et al. [2023] M. Xu, Z. Xu, C. Chi, M. Veloso, and S. Song, “Xskill: Cross embodiment skill discovery,” 2023.
Luo [2022] C. Luo, “Understanding diffusion models: A unified perspective,” 2022.
Yoneda et al. [2023] T. Yoneda, L. Sun, G. Yang, B. C. Stadie, and M. R. Walter, “To the noise and back: Diffusion for shared autonomy,” in Robotics: Science and Systems XIX, Daegu, Republic of Korea, July 10-14, 2023, 2023.