UniDoorManip: Learning Universal Door Manipulation Policy Over Large-scale and Diverse Door Manipulation Environments

Yu Li¹¹1Equal contribution. ¹ Xiaojie Zhang¹¹1Equal contribution. ¹ Ruihai Wu¹¹1Equal contribution. ²
Zilong Zhang ¹ Yiran Geng ² Hao Dong²²2Corresponding author. ² Zhaofeng He²²2Corresponding author. ¹
¹Beijing University of Posts and Telecommunications ²School of CS, Peking University

Abstract

Learning a universal manipulation policy encompassing doors with diverse categories, geometries and mechanisms, is crucial for future embodied agents to effectively work in complex and broad real-world scenarios. Due to the limited datasets and unrealistic simulation environments, previous works fail to achieve good performance across various doors. In this work, we build a novel door manipulation environment reflecting different realistic door manipulation mechanisms, and further equip this environment with a large-scale door dataset covering 6 door categories with hundreds of door bodies and handles, making up thousands of different door instances. Additionally, to better emulate real-world scenarios, we introduce a mobile robot as the agent and use the partial and occluded point cloud as the observation, which are not considered in previous works while possessing significance for real-world implementations. To learn a universal policy over diverse doors, we propose a novel framework disentangling the whole manipulation process into three stages, and integrating them by training in the reversed order of inference. Extensive experiments validate the effectiveness of our designs and demonstrate our framework’s strong performance. Code, data and videos are avaible on https://unidoormanip.github.io/.

Figure 1: Our Proposed Environment, Dataset and Universal Manipulation Policy. We build a novel door manipulation environment equipped with a large-scale door dataset covering 6 door categories with hundreds of door bodies and handles, and configure it with different realistic door manipulation mechanisms. To learn the universal door manipulation policy, we propose a novel framework which can generalize to unseen shapes and categories.

1 INTRODUCTION

To enable robots to exhibit human-like abilities in performing a wide range of tasks, it is crucial for them to acquire proficiency in manipulating articulated objects. Among these tasks, door manipulation holds significant importance due to the frequent need to open or close doors in various scenarios. While previous works have focused primarily on interior doors [29, 30], we aim to extend doors to a more general setting, e.g., doors in windows, cars, safes, as illustrated in Figure 1. In the above broad scenarios, the door manipulation task covers doors with diverse types, geometries and manipulation mechanisms, which poses a great challenge to learn a universal door manipulation policy.

Prior works in the field have struggled to learn the universal manipulation policy due to the lack of diversity in terms of types, geometries, and manipulation mechanisms. Besides using push or pull to open the door[21, 37, 33, 6, 5, 4], DoorGym [29] presents an approach that automatically generates door bodies and handles with hard-code, which results in bad performance when faced with new unseen doors due to the limited geometric diversity. To facilitate the learning of a universal door manipulation policy, we build a comprehensive door manipulation environment that encompasses doors across diverse types, geometries, and manipulation mechanisms. Our efforts in constructing this environment have been focused on the following two key aspects.

Firstly, recognizing the limited door types, geometries and quantities in existing datasets [6, 17, 36, 29], we propose a large-scale door dataset with diverse categories and geometries. Our dataset consists of two door components: body and handle, providing users the flexibility to configure doors according to their specific requirements, as illustrated in Figure 1. Table 1 shows that our dataset encompasses 6 distinct types of doors, comprising 328 door bodies and 204 handles, enabling the composition of thousands of door objects while ensuring their compatibility. Secondly, to mitigate the gap between simulation and the real world, we introduce more realistic settings in our door manipulation environment. Concretely, our goal is to address the intricacies associated with doors featuring latching mechanisms, which require handle grasping and manipulation prior to the door opening. Compared with previous works utilizing sole parallel gripper or suction gripper [21, 33, 37, 4], we employ a mobile robot arm equipped with a parallel gripper as our agent. Furthermore, due to occlusion between the door and the robot arm, as well as occlusion within the door itself, our visual observations during the door manipulation process are partially occluded.

To manipulate doors with latching mechanisms, previous works [29, 5] have explored the training of a single universal policy using reinforcement learning (RL) algorithms. However, directly training such a policy for the entire door manipulation process in an end-to-end manner poses great challenges. This is because door manipulation contains three separate but related stages: handle grasping, handle manipulation, and door opening. The inherent separation of these stages results in complicated manipulation mechanisms and vast exploration space, making it difficult for a single RL policy to learn. To tackle this challenge, we propose a novel framework that disentangles these three stages, each stage contains a specific manipulation process, making it easier to learn a generalizable manipulation policy. Besides, we train these policies in a conditional way to reveal their interrelations and dependencies.

In the three-stages manipulation, the first stage, handle grasping, only requires the grasp pose of the end-effector, while the latter two stages require action sequence generation. Hence we employ a policy specifically for predicting handle grasping action. Leveraging the inherent generalizability across diverse geometries provided by visual affordance [8, 35], the policy predicts a point-level score map where a higher score indicates a greater chance of successful door manipulation, to propose the grasp action. After grasping the handle steadily, the goal of the following stages is to manipulate the handle to unlock the door and open the door. Unlike most works configuring the door with a simple manipulation mechanism, our proposed environment simulates diverse realistic door manipulation mechanisms that closely resemble real-world scenarios, e.g., lever, round, key and valve, as shown in the right part of Figure 1. Although handle manipulation and door opening share the similarity of generating action sequences, their specific action types differ significantly. Hence we train separate policies for each stage to alleviate the burden of the model, enabling a universal capability for handle manipulation and door opening. In contrast to open-loop manipulation [33], which lacks real-time adjustments, we employ a closed-loop formulation that dynamically adapts subsequent actions based on the current observation for these two stages. Furthermore, to seamlessly integrate the three separate but related universal policies for each stage into a single universal policy for the entire manipulation process, we introduce a conditioned training strategy that bridges the gap among policies.

Through extensive experiments in simulation, we validate the effectiveness of our design choices and demonstrate that our approach significantly outperforms previous methods. Additionally, we conduct real-world experiments to show the generalization capability of our approach in real-world scenarios.

In summary, our main contribution encompasses:

•

We are the first to build a door manipulation environment with diverse realistic manipulation mechanisms and equip this environment with a large-scale door dataset covering diverse types, handles, and geometries for universal manipulation policy learning.
•

To achieve universal and realistic door manipulation, we propose a novel framework that disentangles the whole manipulation process into three stages with respective universal policies and integrates them into the whole universal policy leveraging conditioned training.
•

Extensive experiments validate the effectiveness of our designs and demonstrate our framework’s strong performance in learning the universal and realistic policy.

2 RELATED WORK

2.1 Door Manipulation Environment and Datasets

Building a door manipulation environment that simulates the real world and transferring the policy trained in simulation to the real world has been the main approach for door manipulation tasks in recent years. However, recent works covering door manipulation have mainly two drawbacks: 1)Unrealistic simulation. Besides using push or pull to open the door[21, 37, 33, 6, 5, 4], Robosuite [40] benchmarks opening doors with latching mechanism as standardized task but employs small doors with large handles. 2)Lack of diversity in the dataset. PartNet-Mobility [36] and AKB-48 [17] provide a diverse dataset for articulated objects including doors. Focusing on the cross-category diversity, they ignore the intra-category diversity of doors. To address these two problems, we build a door manipulation environment with diverse realistic manipulation mechanisms, and equip this environment with a large-scale door dataset covering diverse types, handles and geometries for universal manipulation policy learning.

2.2 3D Articulated Object Manipulation

Extensive investigations have been conducted in the field of articulated objects, encompassing various facets such as reconstructing the dynamic structure of articulated objects [1, 10, 14, 23, 19, 12, 3], estimating 6d pose [11, 18, 15, 32, 6], comprehending the intricacies of manipulating them [13, 37, 4, 5, 22, 9, 7, 2, 26], and point-level visual affordance [21, 33, 31]. Among these areas of study, particular attention has been dedicated to the manipulation of articulated objects, with a specific emphasis on generating optimal action sequences. Previous works [29, 5] have explored the training of a single universal policy using state-based [29] or visual-based RL [5, 7], which suffer from vast exploration space and complicated manipulation mechanisms for the door manipulation task. Otherwise, we propose a novel framework that disentangles the whole manipulation process into three stages with respective universal policies and integrates them into the whole universal policy leveraging conditioned training.

Datasets	Int.			Win.			Car.			Saf.			Sto.			Ref.
Datasets	B	H	CO	B	H	CO	B	H	CO	B	H	CO	B	H	CO	B	H	CO
AKB-48 [17]	-	9	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
PartNet-Mobility [36]	26	22	26	3	1	3	-	-	-	30	14	30	155	-	-	4	-	-
GAPartNet [6]	14	11	14	-	-	-	-	-	-	29	1	29	133	-	-	4	-	-
DoorGym [29]	-	20	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Ours	57	96	5472	18	37	666	22	15	330	61	39	2379	160	8	1280	10	9	90

Table 1: Statistic comparisons between previous dataset and ours. For category, Int., Win., Car., Saf., Sto., Ref. respectively denote doors from Interior, Window, Car, Safe, StorageFurniture, Refrigerator. For asset number, B, H, CO indicate numbers of body, handle and composited object assets with the two parts.

Env.	Data.	Mob.	Latch.	Part.	Occ.
GAPartNet [6]	P + A
W2A [21, 33, 31]	P
RLAfford [7]	P
PartManip [5]	G
DoorGym [29]	D
EnvAfford [34]	P
Ours	Ours

Table 2: Comparison between Our Environment and Others. For simplicity, Data., Mob., Latch., Part. and Occ. respectively denote Dataset, Mobile Robot Arm, Latching Mechanism, Partial Observation and Occlusion in Observation. Besides, P, A, G and D respectively denote PartNet-Mobility, AKB-48, GAPartNet and DoorGym in Table 1.

3 Large-scale Diverse Door Manipulation Environment

An environment based on large-scale diverse door datasets and realistic simulation is a necessity for the training of a universal manipulation policy. Considering the lack of diversity and realness in the current simulation environments [29, 40, 21, 37, 33, 6, 5, 4], we propose a novel environment with large-scale diverse door dataset (Section 3.1) and realistic simulations (Section 3.2) based on IsaacGym [20].

3.1 Large-scale Diverse Door Dataset

In recent years, several works have proposed their datasets for door manipulation as illustrated in Table 1. DoorGym [29] claims a large-scale and scalable dataset specifically for door manipulation. Due to the hard-coded generation and the same templates, the doors they construct lack the diversity of geometry, which leads to bad performance on unseen doors. Limited by the environment, the door boards don’t have a corresponding urdf or object entity, which can not be transferred to other environments. PartNet-Mobility [36] and AKB-48 [17] provide a diverse dataset for articulated objects including doors. Focusing on the cross-category diversity, they ignore the intra-category diversity of doors, which are not suitable for our goal. Hence we introduce a large-scale diverse dataset specifically for door manipulation.

Compared with the current dataset providing the whole door without part composition over diverse configuration, we construct our dataset in the formulation of two door parts, named body and handle. Given the composition method we provide, users of our dataset can configure their personalized door objects with diverse settings according to their needs. In total, we create 328 bodies and 204 handles which can be composed of thousands of door objects, as shown in Table 1.

To satisfy the need for diversity, we make efforts at both the category and intra-category level. For category level, We select 6 representative categories that cover most of the door-opening scenes we encounter in the real world, including Interior, Window, Car, Safe, Storagefurniture and Refrigerator. For diversity in the intra-category, we collect hundreds of door bodies and handles with irregular geometries and diverse manipulation mechanisms as illustrated in Figure 1. Table 1 provides a detailed statistic comparison between previous datasets and ours. Notice that we do the calculation for all object assets based on whether their manipulation mechanisms are similar to doors.

3.2 Settings of Our Realistic Environment

To closely emulate the real world, we specifically configure our environment both from the manipulation and the observation. To simulate the latching mechanism, we apply a force $\mathcal{F}_{door}\left(\theta\right)$ in the reverse direction of the opening to the doors until the handle joint angle $\theta_{h}$ have reached the opening threshold $thre_{door}$ . Furthermore, we also add a resilient force to both the handle $\mathcal{F}_{handle}\left(\theta\right)$ and door to ensure the robustness of the manipulation policy. The forces are formulated as:

\mathcal{F}_{door}\left(\theta_{d},\theta_{h}\right)=\left\{\begin{aligned} F_{f}&,&\quad\theta_{h}&\leq thre\\ k_{1}\theta_{d}&,&\quad\theta_{h}&>thre\\ \end{aligned}\right.

(1)

\mathcal{F}_{handle}\left(\theta_{h}\right)=k_{2}\theta_{h}

(2)

Where $\theta_{d}$ represents the door joint angle. Here we set $k_{1}=3,k_{2}=3,F_{f}=150$ . The unit of force is $Newton$ . Instead of a parallel or suction gripper, we utilize the whole robot arm with a mobile base and parallel gripper as the agent to manipulate the doors. In this condition, collision between doors and robot arm as well as the joint limitation of the robot arm itself must be taken into consideration. Unlike previous works using the pseudo perfect observation [7, 6, 33, 21] or partial observation [5, 37, 4] which don’t consider the occlusion caused by the door itself or the robot arm, we put a fixed camera in front of the door and robot arm and acquire the partially occluded point cloud generated by the depth image as the observation. Table 2 shows the detailed comparison between our environment and others from various aspects.

Refer to caption — Figure 2: Our pipeline for the framework. We disentangle the entire door manipulation process into three stages: handle grasping, handle manipulation and door opening. We predict affordance map to graps the handle and employ the similar formulation but train separate policy for handle manipulation and door opening. Besides, We integrate the three policies for each stage leveraging conditioned training.

4 METHOD

As illustrated in Figure 2, we propose a novel framework that disentangles door manipulation into three distinct but related stages, each with a corresponding universal manipulation policy (Section 4.1). We leverage conditioned training to train these policies, as they have inter-dependencies, and thus they can be integrated into a unified universal policy (Section 4.2). In the first stage, we employ generalizable point-level visual affordance [8, 24, 38, 16] to propose stable grasp poses (Section 4.3). In the second stage, we train a universal policy covering multiple handle manipulation mechanisms in our proposed realistic environment (Section 4.4). In the third stage, we train a policy to open doors with unlocked handles(Section 4.5). Additionally, we provide a comprehensive description of our data collection and training strategy(Section 4.6).

4.1 Problem Formulation

We disentangle the door manipulation task into three stages: handle grasping, handle manipulation and door opening. Following UMPNet [37], we define the policy $\pi$ for each stage to be generating robot actions $a_{t}$ , given the partial point cloud observation $O_{t}$ and the robot state $S_{t}$ at time $t$ . Here, the robot action $a_{t}=(p_{t},r_{t})$ indicates the next end-effector pose of robot arm, consisting of position $p_{t}\in\mathbb{R}^{3}$ and orientation $r_{t}\in\mathbb{SO}(3)$ . The robot state $S_{t}$ consists of its joint angles and velocities.

In stage one, given the initial point cloud observation $O_{0}$ and the contact point $p_{0}\in O_{0}$ , the model outputs a per-point affordance map $A_{O_{0}}$ where a higher score on $p_{0}$ indicates a greater chance of successful door manipulation after grasping $p_{0}$ with action $a_{0}$ .

In stage two, given the observation $O_{1}$ after grasping, the goal is to manipulate the handle in the right direction to unlock the door. Given $O_{1}$ and robot state $S_{1}$ , we train a manipulation policy to generate the action $a_{1}$ for diverse mechanisms.

In stage three, the goal is to open the door as wide as possible. Given the current visual observation $O_{t}$ and robot state $S_{t}$ at time $t$ , we employ a manipulation policy to generate the opening action $a_{t}$ in a close-loop manner until the door joint angle $\theta_{d}$ exceeds a threshold.

4.2 Disentanglement and Conditioned Training

As door manipulation includes multiple different stages, it is difficult to directly train a universal policy in an end-to-end manner covering all the manipulation policies in those diverse stages. Therefore, we disentangle the door manipulation into three distinct stages. Due to the similarity of the manipulation policies in each stage, it is easy to train a universal and generalizable policy in each stage and merge them into a whole universal manipulation policy.

While separated, these related stages have internal dependencies. For example, if the robot arm grasps the handle too far outward, it may result in disengagement while rotating the handle. Also, if the robot arm has a strange state in handle manipulation, it’s easy to collide with the doors while the door is opening. Hence, to integrate all policies for each stage, we employ a conditioned training formulation leveraging these internal relations. Because we can evaluate the results of handle grasping during handle manipulation and evaluate the results of handle manipulation during door opening, we train the policy for door opening first. Then we train the handle manipulation policy using data collected based on the door opening policy. Given the policies for the latter two stages, we can obtain the first policy for initial handle grasping.

4.3 Handle Grasping

Due to the diverse geometries of different handle types, it’s challenging to train a universal and generalizable manipulation policy for initial handle grasping. While existing heuristic methods leveraging the pose estimation [6] or keypoint representation [30] can not generalize to objects with irregular shapes, we employ point-level visual affordance [8] for manipulation, which predicts a point-level score map on the target object, indicating actionability for downstream tasks.

The policy for handle grasping consists of three models: affordance predictor $\mathcal{E}_{\text{grasp }}$ , action sampler and action discriminator $\mathcal{D}_{grasp}$ . During inference, taking the initial observed point cloud $O_{0}\in\mathbb{R}^{3\times 4096}$ and the contact point $p\in\mathbb{R}^{3}$ as input, the affordance predictor $\mathcal{E}_{\text{grasp}}$ predicts a point-level score map where we choose point with highest score as the grasp point. We employ a segmentation version PointNet++ [25] to extract a per-point feature from $O_{0}$ , where $f_{p_{0}^{i}\mid O_{0}}\in\mathbb{R}^{128}$ represents the feature of point $p_{0}^{i}$ , and use MLP to encode the contact point $p$ into a latent representation $f_{p_{0}}\in\mathbb{R}^{128}$ . Based on the grasp point, the action sampler proposes alternative actions $\left\{a_{0}^{0},a_{0}^{1},\ldots,a_{0}^{k}\right\}$ by sampling random orientations from the normal plane of the handle axis and combining each of them with the grasp point. Taking the concatenation of the point cloud feature and the action feature extracted by another MLP, the action discriminator $\mathcal{D}_{grasp}$ outputs an action score. We select the action with the highest action score to grasp the handle.

In the reverse order of inference, we train the model of action discriminator $\mathcal{D}_{grasp}$ firstly which is supervised by the final door joint angle $\theta_{d}$ using $L_{1}$ loss. For training affordance predictor $\mathcal{E}_{\text{grasp }}$ , we sample 50 random actions based on the contact point $p_{0}$ selected from $O_{0}$ . Then we estimate their action scores using $\mathcal{D}_{grasp}$ and regress the prediction to the mean score of the top-10 rated action with $L_{1}$ loss.

4.4 Handle Manipulation

Leveraging these diverse manipulation mechanisms provided by our proposed environment, we employ a generative network architecture to learn the high dimensional data distribution and output the handle manipulation action. The policy for handle manipulation is composed of two modules: action generator $\mathcal{G}_{handle}$ and action discriminator $\mathcal{D}_{handle}$ . The action generator $\mathcal{G}_{handle}$ is implemented as a conditional variational autoencoder(cVAE) [28], composed of an action encoder that maps the input action $a_{t}$ into a Gaussian noise $z$ and an action decoder that reconstructs the action input from the noise vector. We set the current point cloud $O_{t}$ and robot state $S_{t}$ as the conditions of the encoder and decoder, which are implemented both by MLPs. Given the alternative actions $\left\{a_{1}^{0},a_{1}^{1},\ldots,a_{1}^{k}\right\}$ generated by $\mathcal{G}_{handle}$ , plus $O_{t}$ and $S_{t}$ , the action discriminator predict action scores which we select the action with the highest score as the manipulation action.

For training action generator $\mathcal{G}_{handle}$ , we use the KL divergence loss for regularizing Gaussian bottleneck noises, the $L_{1}$ loss to regress the action position $p_{t}$ and the a 6D-rotation loss [39] for supervise the action orientation $r_{t}$ . The total loss can be formulated as:

\mathcal{L}_{h}=\lambda_{KL}\mathcal{L}_{kl}+\lambda_{pos}\mathcal{L}_{pos}+\lambda_{rot}\mathcal{L}_{rot}

(3)

where we use a set of hyper-parameters to adjust the loss balance. For training action discriminator $\mathcal{D}_{handle}$ , we use $L_{1}$ loss to supervise the action score with the residual handle joint angle between before and after the action execution.

4.5 Door Opening

To tackle the diverse geometries and long action sequences, Instead of open-loop manipulation [33], we employ a closed-loop formulation that iteratively generates the next action $a_{t}$ for door opening based on the current observation including the partially occluded point cloud $O_{t}$ and the robot state $S_{t}$ for door opening stage. We follow the implementation of the policy for handle manipulation and employ an action generator $G_{door}$ and an action discriminator $D_{door}$ to output the door opening action. Different from the previous discriminator, we supervise the action score with the residual door joint angle between before and after the action execution.

4.6 Data Collection and Training Strategy

To train policies for the three stages, we collect data including the input observation and ground truth supervision leveraging the rule-based method. For different manipulation mechanisms and stages, we use the door states like the joint axis of the handle and body which can be acquired in the simulation environment to calculate the next action.

In the reverse order of inference, we train the door-opening policy first. By supervising the policy with the final door manipulation result, the door opening policy can generate action that facilitates to open the door as wide as it can. Then we replace the rule for the door opening with the trained policy and collect data for handle manipulation policy training. In this way, the handle manipulation policy is integrated with the door opening policy, which outputs action that facilitates door opening. Finally, we replace the latter two rules with trained two policies and collect data for affordance training. Due to the supervision of the final door joint angle for affordance score, the policy for the initial handle grasping can predict a foresightful grasp pose which helps to avoid the bad cases in the following stages, like the disengagement between the end-effector and handle, the collision between the robot arm and door.

5 EXPERIMENTS

5.1 Task, Metric, Evaluation and Settings

We conduct our experiments on the representative door manipulation tasks: pull door. The initial closed door can only be unlocked and pulled to open. The robot arm needs to pull the door until the door joint angle $\theta_{d}$ is larger than a threshold $thre_{door}$ . Here, we set $thre_{door}$ to be $45^{\circ}$ .

For training and testing, to show the generalizability of our proposed framework across diverse categories, geometries and manipulation mechanisms, we train the universal manipulation policy on the door category including Interior, Window, Safe and Car. We carefully split the object assets of the aforementioned categories into the train and test part which ensures the universal policy is tested on unseen shapes. Furthermore, we also leave two more categories, Storagefurniture and Refrigerator, for extensive tests on unseen novel categories.

For evaluation metric, we use the task success rate as our main evaluation metric. To reduce the evaluation noise, we conduct each experiment 3 times using random seeds and report the mean performance as well as the variance.

5.2 Baselines

We compare our method with the following baselines: 1) GAPartNet + GT: following the heuristic manipulation method of GAPartNet, we directly obtain the initial ground truth like poses of the handle and board from simulation and implement a heuristic motion planning method for door manipulation. 2)DoorGym: we introduce the state-based PPO [27] implemented in DoorGym as our baseline. Here we use a flying gripper as the agent and put the gripper close to the handle in the beginning. 3)PartManip: we introduce a visual-based PPO used in PartManip as our baseline. The setting is similar to DoorGym. 4)VAT-MART: a method that proposes visual action trajectory for downstream manipulation task in an open-loop formulation. For a fair comparison, We develop an implementation version in our environment.

5.3 Result Analysis

Figure 3 shows the whole manipulation sequence of our universal manipulation in four categories, including Safe, Window, Interior and Car. The results show that our universal policy can generalize over diverse categories, geometries and manipulation mechanisms. Located in the upper left of the initial pictures, we use a heap map to visualize the affordance score where the redder point indicates the higher score. It’s worth mentioning that the affordance map represents where the fingers of the end-effector contact with the object, which just matches the result affordance of the round handle for the category Interior.

Table 3 shows that our framework outperforms all baselines for both the train and test categories in the pull door task. When encountering handles with irregular types and geometries, GAPartNet + GT can not calculate a proper initial grasp pose even provided with ground truth handle pose. Besides, the heuristic motion planning method can not generalize well in unseen objects. For the two RL-based methods DoorGym and PartManip, due to the large exploration space for multiple stages and complicated reward engineering for diverse categories, these methods fail to train an end-to-end policy for door manipulation. Open-loop method VAT-MART predicts the action trajectory with the initial observation, which leads to accumulation error and inability to real-time adjustments.

Task	Pull Door
Method	Train				Test
Method
GAPartNet [6]+GT	0.62	0.88	0.41	0.44	0.52	0.26
DoorGym [29]	0.56	0.72	0.61	0.41	0.19	0.23
PartManip [5]	0.47	0.61	0.54	0.34	0.42	0.19
VAT-MART [33]	0.59	0.62	0.57	0.43	0.51	0.25
Ours w/o disentangle.	0.44	0.88	0.20	0.19	0.05	0.22
Ours w/o condition.	0.77	0.31	0.58	0.51	0.54	0.33
Ours w/o state.	0.73	0.59	0.16	0.36	0.45	0.37
Ours w/o mobile.	0.87	0.60	0.00	0.43	0.50	0.81
Ours	0.99	0.91	0.81	0.72	0.75	0.89

Table 3: Experimental results of the baselines and ablation studies on Pull Door task. In the Method column, Train represents we test on unseen shapes in train categories. Test represents we test on unseen shapes in unseen test categories.

5.4 Ablation Study

To further evaluate the importance of different components of our method, we conducted an ablation study by comparing our method with four ablations: 1) Ours w/o disentangle.: instead of training policies for the latter two stages separately, we use a single policy to generate the manipulation action after handle grasping. 2)Ours w/o condition.: ours without conditioned training. 3)Ours w/o state.: ours without the input observation of robot state. 4)Ours w/o mob.: ours that uses a fixed base robot arm instead of a mobile robot arm.

The experimental results presented in Table 3 and Figure 4 provide evidence supporting the efficacy of the four components we have devised in enhancing the performance of the universal manipulation policy. Figure 4 depicts the trends observed in the performance of Ours and Ours w/o condition. as the door joint angle $\theta_{d}$ increases. Both approaches exhibit a similar downward trend, indicating that conditioned training contributes to performance improvement. Conversely, Ours w/o disentangle. yields the lowest success rate across most categories and all $\theta_{d}$ values due to the inappropriate state(Figure 5(b)), underscoring the criticality of disentangling the door manipulation process for effective universal manipulation policy learning. Notably, Ours w/o mobile. experiences a rapid decline as $\theta_{d}$ slightly increases and achieves a zero success rate in the Car category, suggesting that the absence of a mobile base makes it prone to collisions with doors as shown in Figure 5 (a). Finally, Ours w/o state. exhibits a rapid decline in performance after $\theta_{d}$ reaches $30^{\circ}$ , demonstrating that incorporating robot state information can mitigate degradation in point cloud data(Figure 5(c)) as $\theta_{d}$ increases.

5.5 Real-world Experiment

We conduct experiments that involve interacting with various real-world doors including Interior, Window, Safe and Storagefurniture as shown in Figure 6. We employ a Franka Emika Panda Robot Arm to perform door manipulation tasks. For input of observation, we obtain real-time point clouds from an Azure Kinect DK depth camera and robot state from the robot arm. The manipulation process in Figure 6 demonstrates that our universal policy can effectively transfer to real-world scenarios. See supplementary for more details.

6 CONCLUSION

In this work, we aim to learn a universal manipulation policy for doors with diverse categories, geometries and mechanisms. For realistic simulation, we introduce a novel door manipulation environment including a large-scale door dataset and diverse realistic latching mechanisms. Based on the environment, we present a novel framework for universal policy learning which disentangles the entire manipulation process into three stages and integrates them by reverse training. Extensive experiments validate the effectiveness of our designs and demonstrate our framework’s stronger performance than previous work.

References

Abbatematteo et al. [2019] Ben Abbatematteo, Stefanie Tellex, and George Konidaris. Learning to generalize kinematic models to novel objects. In Proceedings of the 3rd Conference on Robot Learning, 2019.
Borja-Diaz et al. [2022] Jessica Borja-Diaz, Oier Mees, Gabriel Kalweit, Lukas Hermann, Joschka Boedecker, and Wolfram Burgard. Affordance learning from play for sample-efficient policy learning. In 2022 International Conference on Robotics and Automation (ICRA), pages 6372–6378. IEEE, 2022.
Du et al. [2023] Yushi Du, Ruihai Wu, Yan Shen, and Hao Dong. Learning part motion of articulated objects using spatially continuous neural implicit representations. In British Machine Vision Conference (BMVC), 2023.
Eisner et al. [2022] Ben Eisner, Harry Zhang, and David Held. Flowbot3d: Learning 3d articulation flow to manipulate articulated objects. arXiv preprint arXiv:2205.04382, 2022.
Geng et al. [2023a] Haoran Geng, Ziming Li, Yiran Geng, Jiayi Chen, Hao Dong, and He Wang. Partmanip: Learning cross-category generalizable part manipulation policy from point cloud observations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2978–2988, 2023a.
Geng et al. [2023b] Haoran Geng, Helin Xu, Chengyang Zhao, Chao Xu, Li Yi, Siyuan Huang, and He Wang. Gapartnet: Cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7081–7091, 2023b.
Geng et al. [2022] Yiran Geng, Boshi An, Haoran Geng, Yuanpei Chen, Yaodong Yang, and Hao Dong. End-to-end affordance learning for robotic manipulation. arXiv preprint arXiv:2209.12941, 2022.
Gibson [1977] James J Gibson. The theory of affordances. Hilldale, USA, 1(2):67–82, 1977.
Gu et al. [2023] Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, et al. Maniskill2: A unified benchmark for generalizable manipulation skills. arXiv preprint arXiv:2302.04659, 2023.
Hausman et al. [2015] Karol Hausman, Scott Niekum, Sarah Osentoski, and Gaurav S Sukhatme. Active articulation model estimation through interactive perception. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 3305–3312. IEEE, 2015.
Jain et al. [2021] Ajinkya Jain, Rudolf Lioutikov, Caleb Chuck, and Scott Niekum. Screwnet: Category-independent articulation model estimation from depth images using screw theory. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13670–13677. IEEE, 2021.
Jiang et al. [2022] Zhenyu Jiang, Cheng-Chun Hsu, and Yuke Zhu. Ditto: Building digital twins of articulated objects from interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5616–5626, 2022.
Katz and Brock [2008] Dov Katz and Oliver Brock. Manipulating articulated objects with interactive perception. In 2008 IEEE International Conference on Robotics and Automation, pages 272–277. IEEE, 2008.
Katz et al. [2013] Dov Katz, Moslem Kazemi, J Andrew Bagnell, and Anthony Stentz. Interactive segmentation, tracking, and kinematic modeling of unknown 3d articulated objects. In 2013 IEEE International Conference on Robotics and Automation, pages 5003–5010. IEEE, 2013.
Li et al. [2020] Xiaolong Li, He Wang, Li Yi, Leonidas J Guibas, A Lynn Abbott, and Shuran Song. Category-level articulated object pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3706–3715, 2020.
Ling et al. [2024] Suhan Ling, Yian Wang, Shiguang Wu, Yuzheng Zhuang, Tianyi Xu, Yu Li, Chang Liu, and Hao Dong. Articulated object manipulation with coarse-to-fine affordance for mitigating the effect of point cloud noise. ICRA, 2024.
Liu et al. [2022] Liu Liu, Wenqiang Xu, Haoyuan Fu, Sucheng Qian, Qiaojun Yu, Yang Han, and Cewu Lu. Akb-48: A real-world articulated object knowledge base. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14809–14818, 2022.
Liu et al. [2020] Qihao Liu, Weichao Qiu, Weiyao Wang, Gregory D Hager, and Alan L Yuille. Nothing but geometric constraints: A model-free method for articulated object pose estimation. arXiv preprint arXiv:2012.00088, 2020.
Lv et al. [2022] Jun Lv, Qiaojun Yu, Lin Shao, Wenhai Liu, Wenqiang Xu, and Cewu Lu. Sagci-system: Towards sample-efficient, generalizable, compositional, and incremental robot learning. In 2022 International Conference on Robotics and Automation (ICRA), pages 98–105. IEEE, 2022.
Makoviychuk et al. [2021] Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021.
Mo et al. [2021] Kaichun Mo, Leonidas J. Guibas, Mustafa Mukadam, Abhinav Gupta, and Shubham Tulsiani. Where2act: From pixels to actions for articulated 3d objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6813–6823, 2021.
Mu et al. [2021] Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. arXiv preprint arXiv:2107.14483, 2021.
Nie et al. [2022] Neil Nie, Samir Yitzhak Gadre, Kiana Ehsani, and Shuran Song. Structure from action: Learning interactions for articulated object 3d structure discovery. arXiv preprint arXiv:2207.08997, 2022.
Ning et al. [2023] Chuanruo Ning, Ruihai Wu, Haoran Lu, Kaichun Mo, and Hao Dong. Where2explore: Few-shot affordance learning for unseen novel categories of articulated objects. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
Qi et al. [2017] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017.
Schiavi et al. [2023] Giulio Schiavi, Paula Wulkop, Giuseppe Rizzi, Lionel Ott, Roland Siegwart, and Jen Jen Chung. Learning agent-aware affordances for closed-loop interaction with articulated objects. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5916–5922. IEEE, 2023.
Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Sohn et al. [2015] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28, 2015.
Urakami et al. [2019] Yusuke Urakami, Alec Hodgkinson, Casey Carlin, Randall Leu, Luca Rigazio, and Pieter Abbeel. Doorgym: A scalable door opening environment and baseline agent. arXiv preprint arXiv:1908.01887, 2019.
Wang et al. [2020] Jiayu Wang, Shize Lin, Chuxiong Hu, Yu Zhu, and Limin Zhu. Learning semantic keypoint representations for door opening manipulation. IEEE Robotics and Automation Letters, 5(4):6980–6987, 2020.
Wang et al. [2022] Yian Wang, Ruihai Wu, Kaichun Mo, Jiaqi Ke, Qingnan Fan, Leonidas Guibas, and Hao Dong. Adaafford: Learning to adapt manipulation affordance for 3d articulated objects via few-shot interactions. European conference on computer vision (ECCV 2022), 2022.
Weng et al. [2021] Yijia Weng, He Wang, Qiang Zhou, Yuzhe Qin, Yueqi Duan, Qingnan Fan, Baoquan Chen, Hao Su, and Leonidas J Guibas. Captra: Category-level pose tracking for rigid and articulated objects from point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13209–13218, 2021.
Wu et al. [2022] Ruihai Wu, Yan Zhao, Kaichun Mo, Zizheng Guo, Yian Wang, Tianhao Wu, Qingnan Fan, Xuelin Chen, Leonidas Guibas, and Hao Dong. VAT-mart: Learning visual action trajectory proposals for manipulating 3d ARTiculated objects. In International Conference on Learning Representations, 2022.
Wu et al. [2023a] Ruihai Wu, Kai Cheng, Yan Zhao, Chuanruo Ning, Guanqi Zhan, and Hao Dong. Learning environment-aware affordance for 3d articulated object manipulation under occlusions. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.
Wu et al. [2023b] Ruihai Wu, Chuanruo Ning, and Hao Dong. Learning foresightful dense visual affordance for deformable object manipulation. In IEEE International Conference on Computer Vision (ICCV), 2023b.
Xiang et al. [2020] Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11097–11107, 2020.
Xu et al. [2022] Zhenjia Xu, He Zhanpeng, and Shuran Song. Umpnet: Universal manipulation policy network for articulated objects. IEEE Robotics and Automation Letters, 2022.
Zhao et al. [2022] Yan Zhao, Ruihai Wu, Zhehuan Chen, Yourong Zhang, Qingnan Fan, Kaichun Mo, and Hao Dong. Dualafford: Learning collaborative visual affordance for dual-gripper object manipulation. arXiv preprint arXiv:2207.01971, 2022.
Zhou et al. [2019] Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2019.
Zhu et al. [2020] Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Martín-Martín, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020.