22email: {yianwang,wuruihai,kjq001220,hao.dong}@pku.edu.cn
33institutetext: Stanford University
33email: {kaichun,guibas}@cs.stanford.edu
44institutetext: Tencent AI Lab
44email: [email protected]
55institutetext: Peng Cheng Lab
https://hyperplane-lab.github.io/AdaAfford
AdaAfford: Learning to Adapt Manipulation Affordance for 3D Articulated Objects via Few-shot Interactions
Abstract
Perceiving and interacting with 3D articulated objects, such as cabinets, doors, and faucets, pose particular challenges for future home-assistant robots performing daily tasks in human environments.
Besides parsing the articulated parts and joint parameters, researchers recently advocate learning manipulation affordance over the input shape geometry which is more task-aware and geometrically fine-grained. However, taking only passive observations as inputs, these methods ignore many hidden but important kinematic constraints (e.g., joint location and limits) and dynamic factors (e.g., joint friction and restitution), therefore losing significant accuracy for test cases with such uncertainties. In this paper, we propose a novel framework, named AdaAfford, that learns to perform very few test-time interactions for quickly adapting the affordance priors to more accurate instance-specific posteriors. We conduct large-scale experiments using the PartNet-Mobility dataset and prove that our system performs better than baselines. 44footnotetext: Corresponding author.
1 Introduction
For future home-assistant robots to aid humans in accomplishing diverse everyday tasks, we must equip them with strong capabilities perceiving and interacting with diverse 3D objects in human environments. Articulated objects, such as cabinets, doors, and faucets, are particularly interesting kinds of 3D shapes in our daily lives since agents can interact with them and trigger functionally important state changes of the objects (e.g., push closed the drawer of the cabinet, rotate the handle and pull open the door, turn on/off the water from the faucet by rotating the switch). However, because robots need to understand more semantically complicated part semantics and manipulate articulated parts with higher degree-of-freedoms than rigid objects, it remains a very important yet challenging task to perceive and interact with 3D articulated objects.
Many previous works have investigated the problem of perceiving and interacting with 3D articulated objects. Researchers have been pushing the state-of-the-arts on segmenting articulated parts [32, 42], tracking them [30, 35], and estimating joint parameters [34, 40], enabling robotic systems [23, 2, 33] to successfully perform sophisticated planning and control over 3D articulated objects.

More recently, beyond recognizing the articulated parts and joints, researchers have been proposing learning more task-aware and geometrically fine-grained manipulation affordance over input 3D geometry. Where2Act [18], the most related to our work, learns densely labeled manipulation affordance heatmaps over 3D input partial scans of articulated objects, as illustrated in Fig. 1 (b), by performing self-supervised trial-and-error interaction in a physical simulator. There are also many other works leveraging similar dense affordance predictions over 3D scenes [21] and rigid objects [17]. Such densely labeled affordance predictions over 3D data provide more geometrically fine-grained actionable information and can be learned task-specifically given different manipulation actions, showing promises in bridging the perception-interaction gaps for robotic manipulation over large-scale 3D data across different tasks.
However, taking only a single-frame observation of the 3D shape as input (e.g., a single 2D image, a single partial 3D scan), these methods systematically fail to capture many hidden but important kinematic or dynamic factors and therefore predict inaccurate affordance heatmaps, similar to Fig. 1 (b), by averaging out such uncertainties. For example, given a fully closed cabinet door with no obvious handle as shown in Fig. 1 (top-row), it is uncertain if the door axis is on the left or right side, which significantly affects the manipulation affordance predictions. Other kinematic uncertainties include joint limits (e.g., push inward or pull outward for a door) and types (e.g., slide or rotate to open a door). Besides, various dynamic or physical parameters (e.g., part mass, joint friction) are also unobservable from single-frame inputs but largely affect manipulation affordance. For example, with increasing friction coefficient for a cabinet drawer (Fig. 1, bottom-row), robots would be able to push the inner board.
In this paper, we propose a novel framework AdaAfford learning perform very few test-time interactions to reduce such kinematic or dynamic uncertainties and fastly adapts the affordance prior predictions to instance-specific posteriors given a novel test shape. Our system learns a data-efficient strategy that sequentially samples very few uncertain or interesting locations to interact, as the interacting grippers illustrated in Fig. 1 (b), according to the current affordance predictions and past interaction trials (we begin with the affordance prior predictions of Where2Act [18] and zero interaction history). The interaction outcomes, each of which includes the interaction location, direction, and the resulting part motion, are then observed and incorporated to produce posterior affordance predictions, as illustrated in Fig. 1 (c), by a proposed fast-adaptation mechanism. We set up a benchmark for experiments and evaluations using the large-scale PartNet-Mobility dataset [20] and the SAPIEN physical simulator [37]. We use in total 972 shapes from 15 object categories and conduct experiments for several action types, and randomly sample the kinematic and dynamic parameters for the 3D articulated objects in simulation. Experiments show our method can successfully and efficiently adapt manipulation affordance to novel test shapes with as few as one to four interactions. Quantitative evaluation further proves the effectiveness of our proposed approach.
In summary, our main contributions are the following. 1) we point out and investigate an important limitation of the methods that learn densely labeled visual manipulation affordance – the unawareness of hidden yet important kinematic and dynamic uncertainties; 2) we propose a novel framework AdaAfford that learns to perform very few test-time interactions to reduce uncertainties and quickly adapt to predicting an instance-specific affordance posterior; 3) we set up a large-scale benchmark, built upon PartNet-Mobility [20] and SAPIEN [37], for experiments and evaluations, and results demonstrated the effectiveness and efficiency of the proposed approach.
2 Related Work
Visual Affordance on 3D Shapes.
Affordance [9] suggests possible ways for agents to interact with objects. Many past works have investigated learning grasp [29, 15, 26, 13, 11] and manipulation [21, 27, 17, 18, 36, 39] affordance for robot-object interaction, while there are also many works studying affordance for hand-object [12, 4, 17, 41, 3], object-object [31, 46, 19], and human-scene [8, 16, 25, 21] interaction scenarios. Among these works, researchers have proposed different representations for visual affordance, including detection locations [29, 15], parts [17], keypoints [27], heatmaps [21, 18], etc. In this work, we mostly follow the settings in [18] for learning visual affordance heatmaps for manipulating 3D articulated objects. Different from previous works that infer possible agent-object visual affordance heatmaps passively from static visual observations, our framework leverages active interactions to efficiently query uncertain kinematic or dynamic factors for learning more accurate instance-adaptive visual affordance.
Fast Adaption via Few-shot Interactions.
Researchers have explored various approaches [45, 7, 28, 44, 5] for fast adaption via few-shot interactions. Many past works have also designed interactive perception methods to figure out object mass [14], dynamic parameters [38, 1, 6, 10], or parameters for known models [43]. Different from these studies proposing general algorithms for policy adaptation or figuring out explicit system parameters for rigid objects, we focus on designing a working solution for our specific task of learning visual affordance heatmaps for manipulating 3D articulated objects with special designs on predicting geometry-grounded interaction proposals and interaction-adaptive affordance predictions.
3 Problem Formulation
Given as input a single-frame 3D partial point cloud observation of an articulated object (e.g., lifted from a depth scanner with known camera intrinsics), the Where2Act framework [18] directly outputs a per-point manipulation affordance heatmap , where higher scores indicate bigger chances for being interacted with to accomplish a given short-term manipulation task (e.g., pushing, pulling). Additionally, a diverse set of gripper orientations is proposed at each point suggesting possible ways for robot agents to interact with, each of which also associated with a success likelihood . No interaction is allowed at test time in Where2Act and a fixed set of system dynamic parameters is used across all shapes.
We follow most of the Where2Act settings except that we randomly vary the system dynamics and allow test-time interactions over the 3D shape to reduce kinematic or dynamic uncertainties. Our AdaAfford system proposes a few interactions sequentially . Each interaction executes a task-specific hard-coded short-term trajectory defined in Where2Act, parametrized by the interaction point and the gripper orientation , and observes a part motion . Starting from the input shape observation , every interaction where changes the part state and thus produces a new shape point cloud input for the next interaction . Leveraging the interaction observations , our system then adapts the per-point manipulation affordance predicted by Where2Act to a posterior that reduces uncertainties and provides more accurate instance-specific predictions. For each gripper orientation , we also update the success likelihood score considering the test-time interactions.
4 Method
Our proposed AdaAfford framework primarily consists of two modules – an Adaptive Interaction Proposal (AIP) module and an Adaptive Affordance Prediction (AAP) module. While the AIP module learns a greedy yet effective strategy for sequentially proposing few-shot test-time interactions , the AAP module is trained to adapt affordance predictions from Where2Act [18] prior to a posterior observing the sampled interactions . We iterate two modules recurrently at test time to produce a sequence of few-shot interactions leading to the final affordance posterior prediction . During training, we iteratively alternate the training for the two modules until a joint convergence.
Below, we first introduce the test-time inference procedure for a brief overview. Next, we describe the input backbone encoders that are shared among all networks in our framework. Then, we describe the detailed architectures and system designs of the two modules. We conclude with the training losses and strategy.

Test-time Overview.
Fig. 2 presents an overview of the method. We apply a recurrent structure at test time. Starting from the affordance prediction without any interaction, the AIP module proposes the first action for producing the interaction data . Then, at each timestep , we feed the current set of interactions as inputs to the AAP module and extract hidden information that adapts the affordance map prediction to . The AIP module then takes as input and proposes an action composed of the interaction point and the gripper orientation for the next interaction. Performing this action in the environment, we obtain the next-step interaction data and put it into the interaction set . We iterate until the interaction budget has been reached or our AIP module decides to stop. When the procedure stops at timestep , we output the final affordance posterior .
Input Encoders.
This paragraph details how we encode inputs into features as all the encoder networks in the two modules take the same input entities (e.g., the shape observation , the interaction action ) and thus share the same architecture. We use the PointNet++ segmentation network [24] to encode the input shape point cloud into per-point feature maps and denote as the feature at any point . We use Multilayer Perceptron (MLP) networks to encode other vector inputs (e.g., the interaction action and the part motion ) into . The networks in the following subsections will first encode the inputs into and , and then concatenate them into . The encoders do not share weights across different modules.

4.1 Adaptive Affordance Prediction Module
The Adaptive Affordance Prediction (AAP) module takes as inputs few-shot interactions and predicts the affordance posterior . This module is composed of three subnetworks: 1) an Adaptive Information Encoder that extracts hidden information from a set of interactions ; 2) an Adaptive Affordance Network that predicts the posterior affordance heatmap conditioned on the hidden information ; and 3) an Adaptive Critic Network that predicts the AAP action score for an action conditioned on the hidden information . Here, an action is represented as including an interaction point and a gripper orientation .
Adaptive Information Encoder.
Given a set of interactions as inputs, the Adaptive Information Encoder outputs a 128-dim hidden information representation ( for brevity). It first encodes each interaction using the input encoders mentioned before, and then uses an MLP network to encode the features into a 128-dim latent code representing the hidden information extracted from . As different interactions contain different amount of hidden information, we use another MLP Network to predict an attention score for each interaction. To get a summarized hidden information from a set of interactions, we simply computes a weighted average over all ’s according to the weights ’s and use the resulting feature as . Formally, we have .
Adaptive Critic Network.
Given the object partial point cloud observation , an arbitrary interaction point , an arbitrary gripper orientation and the latent code , the Adaptive Critic Network predicts an AAP action score indicating the likelihood for the success of the interaction action given the interaction information . It first encodes the input using the input encoders as mentioned before and then employs an MLP network to predict AAP action score , taking the concatenated features together with as inputs. A higher AAP action score for action indicates a higher chance for to succeed in accomplishing the given manipulation task.
Adaptive Affordance Network.
Given the input object partial point cloud , an arbitrary point , and the latent code , the Adaptive Affordance Network predicts an actionability score at point . It first encodes the input using the aforementioned input encoders and then uses an MLP network that takes the concatenated features together with as inputs and produces an actionability score as the output. A higher actionability score indicates a higher chance to successfully interact on point .
4.2 Adaptive Interaction Proposal Module
Adaptive Interaction Proposal (AIP) module proposes an action (denote the current timestep as ) for the next step interaction, given the feature extracted from the current interaction observations . This module contains two networks: 1) an Adaptive Interaction Proposal Affordance Network that predicts an AIP actionability score indicating how likely the next-action is worth interacting at point , and 2) an Adaptive Interaction Proposal Critic Network predicting an AIP action score suggesting the gripper orientation to pick for the next interaction. We leverage the predictions of the two networks to propose the next action .
Adaptive Interaction Proposal Critic Network.
Given the input object partial point cloud , an arbitrary interaction point , an arbitrary gripper orientation , the latent code , and the AAP action score produced by , the AIP Critic Network predicts the AIP action score of . It first encodes the inputs using the input encoders and then uses an MLP network that takes the concatenated features together with as inputs and generates an AIP action score for the action . A higher AIP action score suggests that the action may query more unknown yet interesting hidden information and thus is worth exploring next.
Adaptive Interaction Proposal Affordance Network.
Given the input partial shape observation , an arbitrary interaction point , the latent code , and the AAP actionability score at point estimated by , the AIP Affordance Network predicts the AIP actionability score at point . It first encodes the inputs using the aforementioned input encoders and then employs an MLP network to predict an AIP actionability score , taking the concatenated features together with as inputs. A higher AIP actionability score at indicates more unknown yet helpful hidden information may be obtained by executing an interaction at .
Next-step Interaction Action Proposal.
To propose an action for the next interaction, given the hidden information and the input shape partial point cloud , we first obtain the AIP actionability heatmap for every point predicted by the AIP Affordance Network and then select the point with the highest AIP actionability score . Then, we sample 100 random actions at using the Where2Act’s pre-trained Action Proposal Network, use our AIP critic network to generate the AIP action scores for each action , and then choose the action with the highest AIP action score .
Stopping Criterion for the Few-shot Interactions.
The AIP procedure for generating few-shot interactions stops when a preset budget is reached or the maximal AIP actionability score is below a certain threshold (e.g., 0.05).
4.3 Training and Losses
In brief, for AAP module, we use ground-truth motion to supervise and , and utilize to supervise the training of . For AIP module, we use AAP module to supervise the training of and use it to supervise . Below, we describe the losses and the training strategy in detail.
AAP Action Scoring Loss.
To supervise , we use a standard binary cross entropy loss, which measures the error between the prediction of and target part’s ground truth motion of an interaction . Specifically, given the hidden information , a batch of interaction observations where , and the AAP action score prediction for each interaction , the loss is defined as
where if (e.g., ) or rendering a binary discretization for each interaction outcome.
AAP Actionability Scoring Loss.
To train the Adaptive Affordance Network , we apply an loss to measure the difference from the predicted score to the ground truth. To estimate the ground truth actionability score for , we randomly sample 100 actions at according to the pre-trained Where2Act Action Proposal Network, predict the AAP action scores ’s of these actions ’s using the Adaptive Critic Network , and take the average of the top-5 scores as the ground truth actionability score.
AIP Action Scoring Loss.
To supervise the AIP Critic Network , we use an loss to measure the difference between our predicted AIP action score and the ground truth AIP action score . We design a greedy yet effective way to estimate the ground-truth scores. Given a set of interactions , to generate for an interaction action , we respectively encode two interaction subsets and into latent codes and . Then, we feed and as the conditional inputs to the Adaptive Critic Network separately and count the difference of the AAP action scoring loss as the ground truth of AIP action score . More concretely, let the AAP action scoring loss conditioned on and respectively be and . We define the ground truth AIP action score . The AIP action score is trained to regress an estimated positive influence of executing on the AAP action score predictions, where an action giving more influence is preferred as it helps discover more hidden information useful to the task.
AIP Actionability Scoring Loss.
To train the AIP Affordance Network , we use another loss. For each position , we sample 100 actions ’s using the pre-trained Where2Act Action Proposal Network, obtain the AIP action scores ’s of these actions ’s predicted by the AIP Critic Network , and use the average of the top-5 scores as the regression target.
Training Strategy.
We iteratively train the AAP module and AIP module until a joint convergence since the update of the subnetworks in one module will affect the training of the subnetworks in the other module. More specifically, the update of and in the AAP module will affect the ground-truth AIP action scores, while the update of and in the AIP module will change the proposed interactions used to generate in the AAP module. Therefore, our final solution is to train the AAP and AIP modules iteratively. The procedure starts with firstly training the AAP module using randomly sampled interactions. We then train the AIP module to learn to propose more efficient and effective proposals. Then, with the trained subnetworks in the AIP module, we finetune the AAP module with the proposed few-shot interactions. We iteratively alternate the training until both modules converge.
5 Experiments
We perform experiments using the large-scale PartNet-Mobility dataset [20] and the SAPIEN simulator [37], and set up several baselines for comparisons. Results demonstrate the effectiveness and superiority of the proposed approach.
5.1 Data and Settings
Data.
Following the settings of Where2Act [18], we conduct our experiments in the SAPIEN [37] simulator equipped with NVIDIA PhysX [22] simulation engine and the large scale PartNet-Mobility [20] dataset. We use 972 articulated 3D objects covering 15 object categories, mostly following Where2Act, to carry out the experiments. The dataset is divided into 10 training and 5 testing categories. The shapes in the training categories are further divided into two disjoint sets of training and test shapes. See supplementary for detailed statistics.
Experiment Settings.
Following Where2Act [18], we perform experiments over all object categories under different manipulation action types. We train one network for each downstream manipulation task over training shapes from the 10 training object categories and evaluate the performance over test shapes from the training categories and shapes from unseen test categories. Besides, to further demonstrate the effectiveness of our method, we conduct two additional experiments under challenging tasks with clear kinematic ambiguity, each of which is conducted over a single object category: 1) pulling closed doors of cabinets that cannot be easily distinguished which side to pull open; 2) pushing faucets with uncertainties which direction to rotate (clockwise or counter-clockwise). These experiments are particularly interesting yet challenging cases on which previous work Where2Act [18] fail drastically and we hope to test our framework.
F-score (%) | Sample-Succ (%) | ||
pushing all (train cat.) | Where2Act | 56.44 | 20.85 |
Where2Act-interaction | 72.13 | 31.53 | |
Where2Act-adaptation | 64.16/65.42/64.99 | 20.77/22.72/26.82 | |
ours-random | 70.24/70.58/70.85 | 29.59/31.35/32.57 | |
ours-fps | 64.32/69.58/70.99 | 26.22/27.30/30.65 | |
ours-final | 72.78/73.12/75.18 | 33.82/33.23/35.23 | |
pushing all (test cat.) | Where2Act | 59.95 | 21.69 |
Where2Act-interaction | 76.12 | 37.10 | |
Where2Act-adaptation | 51.09/53.28/55.56 | 19.06/22.27/24.50 | |
ours-random | 75.12/76.92/76.98 | 30.78/30.78/29.48 | |
ours-fps | 66.17/67.27/69.08 | 33.64/35.19/37.79 | |
ours-final | 77.58/77.63/78.42 | 34.97/36.75/37.40 | |
pulling all (train cat.) | Where2Act | 31.19 | 1.92 |
Where2Act-interaction | 38.28 | 3.89 | |
Where2Act-adaptation | 37.22/38.48/39.13 | 1.11/2.15/1.62 | |
ours-random | 35.03/34.48/36.84 | 4.44/2.78/6.11 | |
ours-fps | 39.88/42.74/43.55 | 2.78/5.56/4.44 | |
ours-final | 42.62/43.87/44.08 | 7.78/9.44/10.55 | |
pulling all (test cat.) | Where2Act | 36.36 | 10.00 |
Where2Act-interaction | 45.80 | 9.73 | |
Where2Act-adaptation | 40.11/45.52/48.80 | 3.40/6.25/10.17 | |
ours-random | 41.97/44.88/46.11 | 6.13/4.78/8.26 | |
ours-fps | 43.67/42.77/48.33 | 4.35/3.91/4.78 | |
ours-final | 49.51/50.00/51.33 | 5.21/7.39/10.45 | |
pulling closed door | Where2Act | 48.44 | 4.38 |
Where2Act-interaction | 66.79 | 9.09 | |
Where2Act-adaptation | 50.21/55.75/56.81 | 6.60/7.18/6.83 | |
ours-random | 52.41/54.25/53.37 | 7.14/6.84/6.53 | |
ours-fps | 59.79/63.43/69.13 | 8.88/11.33/12.10 | |
ours-final | 57.83/65.60/79.65 | 10.86/11.57/22.14 | |
pushing faucet | Where2Act | 64.92 | 55.46 |
Where2Act-interaction | 79.85 | 80.97 | |
Where2Act-adaptation | 66.25/62.18/67.15 | 57.50/52.08/61.70 | |
ours-random | 72.61/76.29/79.16 | 61.81/79.01/80.82 | |
ours-fps | 74.19/79.36/77.95 | 60.44/70.12/77.41 | |
ours-final | 77.42/83.06/83.83 | 65.90/81.66/82.14 |
Environment Settings.
Following Where2Act, we abstract away the robot arm and only use a Franka Panda flying gripper as the robot actuator. The input shape point cloud is assumed to be cleanly segmented out. To generate the input partial point cloud scans, we mount an RGB-D camera with known intrinsic parameters 5-unit-length away pointing to the center of the target object.
To simulate manipulating shapes with uncertain dynamics, we randomly change the following three physical parameters in SAPIEN: 1) the friction of the target part joint, 2) the mass of the target part, and 3) the friction coefficient of the target part surface. For the ”pulling closed door” task, we manually select the cabinets whose doors have no clear handle geometry in the PartNet-Mobility dataset [37], and set the poses of those doors to be closed. The gripper cannot tell which side to pull open the door because it is impossible to tell whether the axis position is on the left or right of the door from passive visual observations. For the ”pushing faucet” task, we randomly set the rotating direction of the faucet switch to be in one of the following three modes: only clockwise, only counter clockwise, or both ways.
5.2 Baselines and Evaluation Metrics.
We set up several baseline and employ two metrics for quantitative comparisons.
Baselines and Ablation Study.
We compare our framework with several baselines (see supplementary for more detailed descriptions for the baseline designs):
-
-
Where2Act: the original method proposed in [18] where only the pure visual information is used for predicting the visual actionable information and no interaction data is used at all during test time;
-
-
Where2Act-interaction: the Where2Act method augmented with four additional interaction observations as inputs where the interaction positions are uniformly sampled over the predicted affordance heatmap using Furthest Point Sampling (FPS) and we train an additional encoding branch similar to the Adaptive Information Encoder to extract the additional input feature;
-
-
Where2Act-adaptation: the Where2Act method augmented with a heuristic based adaptation mechanism to replace the AAP module where given the interaction observations we locally adjust the predictions for similar points;
-
-
Ours-random: a variant of our proposed method that we use randomly sampled interaction trials over the geometry instead of the AIP proposals;
-
-
Ours-fps: a variant of our proposed method that we use FPS to sample over the predicted affordance for interactions instead of the AIP proposals.
We compare to Where2Act to show that the few-shot interactions indeed help to remove ambiguities and improve the performance. The Where2Act-interaction baseline uses FPS to sample interaction positions relying on the intuition that it may sparsely sample over all possible regions of uncertainties. Comparing to this baseline helps validate that our iterative framework learns smarter strategies to perform more effective interaction trials. Furthermore, the Where2Act-adaptation baseline helps substantiate the effectiveness of our proposed AAP module, while the Ours-random and Ours-fps baselines are designed to verify the usefulness of the proposed AIP module.
Besides, we compare to an ablated version of our method to verify the significance of iterative training between the AAP module and the AIP module.
-
-
Ours w/o iter: an ablated version that trains the whole framework without the iteratively training process.

Evaluation Metrics.
Following Where2Act [18], we use the F-score, balancing the precision and recall, to evaluate the predictions of the Adaptive Critic Network , and use the sample-successful rate (Sample-Succ) to evaluate the performance of the Adaptive Critic Network and the Adaptive Affordance Network . To compute the sample successful rate, we apply the learned test-time strategy to fill and then use the extracted hidden information as the conditional input to and . After that, we randomly select a point to interact from the group of points with the top-100 actionability scores , sample 100 actions ’s at , obtain the AAP action scores ’s of these actions ’s predicted by the AAP Critic Network , and then choose the action with the highest to execute. We perform 10 interaction trials per test shape and report the final sample-succ rate as the percentage of sampling successful interactions in simulation.
5.3 Results and Analysis
Table 1 presents the quantitative comparisons against the baselines showing that our method achieves the best performance in most comparison entries. Specifically, compared to Where2Act, we observe that our method can improve the performance evidently with only 1 interaction. Also, the performance increases as the number of interactions increases in most cases. Compared to the Where2Act-adaptation baseline, our method with the proposed AAP module shows better performance, revealing that learning an adaptation network works better than using simple heuristics for adaptation. Compared to the Where2Act-interaction baseline that is fed with four interactions in one shot, our whole framework works better because our recurrent structure strategically and successively selects the most effective interaction trials. Finally, the superior performance against the Ours-random and Ours-fps baselines that use random and FPS sampled interaction trials further validate that our proposed AIP module is effective in strategically and iteratively picking interaction trials.

Fig. 4 shows example visualizations for our predicted affordance map posterior given interactions under different hidden kinematic or dynamic information (see the caption for more details explaining the different scenarios). In these figures, it is clear to see that our proposed method successfully adapts the affordance prediction conditioned on different hidden information. The affordance predictions within one shape share the same visual inputs but output different results, showing that our hidden embedding contains certain information. For example, in the first (second) column, we observe that with bigger joint friction coefficients the door (handle) is harder to manipulate and one needs to push (pull) at only points very far away from the joint axis to accomplish the task. On the right-most (second-to-the-right) column, our network successfully figures out which side of the door (faucet switch) one needs to push.
Fig. 5 further shows some interaction proposals by our AIP module with its influence on the prediction of AAP affordance map and to the AIP affordance map itself. In the first row, for example, we see that the AIP affordance first proposes to interact at both sides of the faucet since it knows little about the hidden information but at the second timestep proposes the right side as it already learns that the left side is actionable. Cases in the first and third rows demonstrate that the past few interactions will influence the selection of future interaction points, justifying the necessity of our recurrent structure for interaction selection. Specifically, in the third row, after the failure of the first interaction, our AIP module proposes interaction points farther from the joints since it already knows interactions on points with shorter distances than the first interaction point are not likely to succeed. In the last row, we show cases which only require one step to adapt.
Ablation Study.
In Table 2, comparing against the ablated version of our method Ours w/o iter that trains the whole system without the interactive training process, we see that Ours-final achieves better results in most cases, which proves the effectiveness of the iterative training scheme. By iteratively alternating the training between the AAP module and the AIP module, the networks would be trained under the distribution of test-time interactions and thus achieve improved performance. Our method can generalize well to novel shapes and even shapes from unseen object categories through the scores in test category.
F-score (%) | Sample-Succ (%) | ||
pushing all (train cat.) | ours w/o iter | 71.21/72.64/73.16 | 30.67/31.62/32.56 |
ours-final | 72.78/73.12/75.18 | 33.82/33.23/35.23 | |
pushing all (test cat.) | ours w/o iter | 77.24/77.33/77.17 | 31.03/33.89/38.83 |
ours-final | 77.58/77.63/78.42 | 34.97/36.75/37.40 | |
pulling all (train cat.) | ours w/o iter | 41.19/42.10/42.81 | 6.67/7.22/8.33 |
ours-final | 42.62/43.87/44.08 | 7.78/9.44/10.55 | |
pulling all (test cat.) | ours w/o iter | 48.31/48.28/50.50 | 5.65/6.52/9.13 |
ours-final | 49.51/50.00/51.33 | 5.21/7.39/10.45 | |
pulling closed door | ours w/o iter | 56.74/64.88/80.64 | 9.77/11.50/22.00 |
ours-final | 57.83/65.60/79.65 | 10.86/11.57/22.14 | |
pushing faucet | ours w/o iter | 73.81/83.03/84.32 | 61.11/81.60/84.03 |
ours-final | 77.42/83.06/83.83 | 65.90/81.66/82.14 |

Real-world and Real-robot Experiments.
Finally, we perform real-world and real-robot experiments to show that our method can to some degree work beyond synthetic data. We use a Franka panda robot with a two-finger parallel gripper as the actuator to pull open a cabinet door in the real world. Fig. 6 presents the results that our system proposes two interaction trials to inquire more information about this real-world cabinet and successfully learns to adapt to the posterior predictions.
Please refer to the supplementary materials for a video better illustrating this example, more experiment settings, more example results, and more experiments with additional analysis.
6 Conclusion
This work addresses a big limitation of previous works learning visual actionable affordance for manipulating 3D articulated objects – the hidden kinematic or dynamic uncertainties. We propose a novel framework AdaAfford that samples a few test-time interactions for fastly adapting to a more accurate affordance posterior prediction removing such ambiguities. Experimental results validate the effectiveness of our method compared to baseline approaches.
Limitations and Future Works.
This work only considers two action types and 3D articulated objects. Future works may study more interaction and data types. Also, we only perform short-term interactions. Future works can investigate how to extend the framework for long-term manipulation trajectories. Finally, we abstract away the complexity of robot arms and only use flying grippers in our experiments. Future works shall work on considering the robot arm constraints.
Acknowledgements.
National Natural Science Foundation of China —Youth Science Fund (No.62006006). Leonidas and Kaichun were supported by the Toyota Research Institute (TRI) University 2.0 program, NSF grant IIS-1763268, a Vannevar Bush Faculty Fellowship, and a gift from the Amazon Research Awards program. The Toyota Research Institute University 2.0 program111Toyota Research Institute (”TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity..
References
- [1] Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke by poking: Experiential learning of intuitive physics. arXiv preprint arXiv:1606.07419, 2016.
- [2] Sachin Chitta, Benjamin Cohen, and Maxim Likhachev. Planning for autonomous door opening with a mobile manipulator. In 2010 IEEE International Conference on Robotics and Automation, pages 1799–1806. IEEE, 2010.
- [3] Enric Corona, Albert Pumarola, Guillem Alenya, Francesc Moreno-Noguer, and Grégory Rogez. Ganhand: Predicting human grasp affordances in multi-object scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5031–5041, 2020.
- [4] Kuan Fang, Te-Lin Wu, Daniel Yang, Silvio Savarese, and Joseph J Lim. Demo2vec: Reasoning object affordances from online videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2139–2147, 2018.
- [5] Karim Farid and Nourhan Sakr. Few shot system identification for reinforcement learning. arXiv preprint arXiv:2103.08850, 2021.
- [6] Fabio Ferreira, Lin Shao, Tamim Asfour, and Jeannette Bohg. Learning visual dynamics models of rigid objects using relational inductive biases. arXiv preprint arXiv:1909.03749, 2019.
- [7] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135. PMLR, 2017.
- [8] David F Fouhey, Vincent Delaitre, Abhinav Gupta, Alexei A Efros, Ivan Laptev, and Josef Sivic. People watching: Human actions as a cue for single view geometry. In European Conference on Computer Vision, pages 732–745. Springer, 2012.
- [9] James J Gibson. The theory of affordances. Hilldale, USA, 1(2):67–82, 1977.
- [10] Michael Janner, Sergey Levine, William T Freeman, Joshua B Tenenbaum, Chelsea Finn, and Jiajun Wu. Reasoning about physical interactions with object-oriented prediction and planning. arXiv preprint arXiv:1812.10972, 2018.
- [11] Zhenyu Jiang, Yifeng Zhu, Maxwell Svetlik, Kuan Fang, and Yuke Zhu. Synergies between affordance and geometry: 6-dof grasp detection via implicit representations. Proceedings of Robotics: Science and Systems (RSS), 2021.
- [12] Hedvig Kjellström, Javier Romero, and Danica Kragić. Visual object-action recognition: Inferring object affordances from human demonstration. Computer Vision and Image Understanding, 115(1):81–90, 2011.
- [13] Mia Kokic, Danica Kragic, and Jeannette Bohg. Learning task-oriented grasping from human activity datasets. IEEE Robotics and Automation Letters, 5(2):3352–3359, 2020.
- [14] K Niranjan Kumar, Irfan Essa, Sehoon Ha, and C Karen Liu. Estimating mass distribution of articulated objects using non-prehensile manipulation. arXiv preprint arXiv:1907.03964, 2019.
- [15] Ian Lenz, Honglak Lee, and Ashutosh Saxena. Deep learning for detecting robotic grasps. The International Journal of Robotics Research, 34(4-5):705–724, 2015.
- [16] Xueting Li, Sifei Liu, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, and Jan Kautz. Putting humans in a scene: Learning affordance in 3d indoor environments. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
- [17] Priyanka Mandikal and Kristen Grauman. Learning dexterous grasping with object-centric visual affordances. In IEEE International Conference on Robotics and Automation (ICRA), 2021.
- [18] Kaichun Mo, Leonidas J. Guibas, Mustafa Mukadam, Abhinav Gupta, and Shubham Tulsiani. Where2act: From pixels to actions for articulated 3d objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6813–6823, October 2021.
- [19] Kaichun Mo, Yuzhe Qin, Fanbo Xiang, Hao Su, and Leonidas Guibas. O2O-Afford: Annotation-free large-scale object-object affordance learning. In Conference on Robot Learning (CoRL), 2021.
- [20] Kaichun Mo, Shilin Zhu, Angel X. Chang, Li Yi, Subarna Tripathi, Leonidas J. Guibas, and Hao Su. PartNet: A large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- [21] Tushar Nagarajan and Kristen Grauman. Learning affordance landscapes for interaction exploration in 3d environments. In NeurIPS, 2020.
- [22] NVIDIA. Nvidia.physx.
- [23] L Peterson, David Austin, and Danica Kragic. High-level control of a mobile manipulator for door opening. In Proceedings. 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2000)(Cat. No. 00CH37113), volume 3, pages 2333–2338. IEEE, 2000.
- [24] Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413, 2017.
- [25] William Qi, Ravi Teja Mullapudi, Saurabh Gupta, and Deva Ramanan. Learning to move with affordance maps. arXiv preprint arXiv:2001.02364, 2020.
- [26] Yuzhe Qin, Rui Chen, Hao Zhu, Meng Song, Jing Xu, and Hao Su. S4g: Amodal single-view single-shot se (3) grasp detection in cluttered scenes. In Conference on robot learning, pages 53–65. PMLR, 2020.
- [27] Zengyi Qin, Kuan Fang, Yuke Zhu, Li Fei-Fei, and Silvio Savarese. Keto: Learning keypoint representations for tool manipulation. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 7278–7285. IEEE, 2020.
- [28] Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, pages 5331–5340. PMLR, 2019.
- [29] Joseph Redmon and Anelia Angelova. Real-time grasp detection using convolutional neural networks. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 1316–1322. IEEE, 2015.
- [30] Tanner Schmidt, Richard A Newcombe, and Dieter Fox. Dart: Dense articulated real-time tracking. In Robotics: Science and Systems, volume 2. Berkeley, CA, 2014.
- [31] Yu Sun, Shaogang Ren, and Yun Lin. Object–object interaction affordance learning. Robotics and Autonomous Systems, 62(4):487–496, 2014.
- [32] Dimitrios Tzionas and Juergen Gall. Reconstructing articulated rigged models from rgb-d videos. In European Conference on Computer Vision, pages 620–633. Springer, 2016.
- [33] Yusuke Urakami, Alec Hodgkinson, Casey Carlin, Randall Leu, Luca Rigazio, and Pieter Abbeel. Doorgym: A scalable door opening environment and baseline agent. Deep RL workshop at NeurIPS 2019, 2019.
- [34] Xiaogang Wang, Bin Zhou, Yahao Shi, Xiaowu Chen, Qinping Zhao, and Kai Xu. Shape2motion: Joint analysis of motion parts and attributes from 3d shapes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8876–8884, 2019.
- [35] Yijia Weng, He Wang, Qiang Zhou, Yuzhe Qin, Yueqi Duan, Qingnan Fan, Baoquan Chen, Hao Su, and Leonidas J. Guibas. Captra: Category-level pose tracking for rigid and articulated objects from point clouds. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 13209–13218, October 2021.
- [36] Ruihai Wu, Yan Zhao, Kaichun Mo, Zizheng Guo, Yian Wang, Tianhao Wu, Qingnan Fan, Xuelin Chen, Leonidas Guibas, and Hao Dong. VAT-mart: Learning visual action trajectory proposals for manipulating 3d ARTiculated objects. In International Conference on Learning Representations, 2022.
- [37] Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, Li Yi, Angel X. Chang, Leonidas J. Guibas, and Hao Su. SAPIEN: A simulated part-based interactive environment. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- [38] Zhenjia Xu, Jiajun Wu, Andy Zeng, Joshua B Tenenbaum, and Shuran Song. Densephysnet: Learning dense physical object representations via multi-step dynamic interactions. arXiv preprint arXiv:1906.03853, 2019.
- [39] Zhenjia Xu, He Zhanpeng, and Shuran Song. Umpnet: Universal manipulation policy network for articulated objects. IEEE Robotics and Automation Letters, 2022.
- [40] Zhihao Yan, Ruizhen Hu, Xingguang Yan, Luanmin Chen, Oliver van Kaick, Hao Zhang, and Hui Huang. RPM-NET: Recurrent prediction of motion and parts from point cloud. ACM Trans. on Graphics, 38(6):Article 240, 2019.
- [41] Lixin Yang, Xinyu Zhan, Kailin Li, Wenqiang Xu, Jiefeng Li, and Cewu Lu. Cpf: Learning a contact potential field to model the hand-object interaction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11097–11106, 2021.
- [42] Li Yi, Haibin Huang, Difan Liu, Evangelos Kalogerakis, Hao Su, and Leonidas Guibas. Deep part induction from articulated object pairs. ACM Trans. Graph., 37(6), Dec. 2018.
- [43] Wenhao Yu, Jie Tan, C Karen Liu, and Greg Turk. Preparing for the unknown: Learning a universal policy with online system identification. arXiv preprint arXiv:1702.02453, 2017.
- [44] Tony Z Zhao, Anusha Nagabandi, Kate Rakelly, Chelsea Finn, and Sergey Levine. Meld: Meta-reinforcement learning from images via latent state models. arXiv preprint arXiv:2010.13957, 2020.
- [45] Wenxuan Zhou, Lerrel Pinto, and Abhinav Gupta. Environment probing interaction policies. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
- [46] Yixin Zhu, Yibiao Zhao, and Song Chun Zhu. Understanding tools: Task-oriented object modeling, learning and recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2855–2864, 2015.
7 More Detailed Data Statistics

Train-Cats | Box | Microwave | Door | Faucet | TrashCan |
Train / Test | 20 / 8 | 9 / 3 | 23 / 12 | 65 / 19 | 52 / 17 |
Kettle | Refrigerator | Switch | Cabinet | Window | |
22 / 7 | 32 / 11 | 53 / 17 | 270 / 75 | 40 / 18 | |
Total | 586 / 187 | ||||
Test-Cats | Table | Washer | Bucket | Pot | Safe |
95 | 16 | 36 | 23 | 29 | |
Total | 199 | ||||
ADDL Exp. | Category | Train data | Test data | ||
Closed door | Cabinet | 74 | 11 | ||
Faucet | Faucet | 15 | 4 |
8 More Details about Method
In our setting, the subsequent interactions always continue from the previous interactions. But it actually doesn’t matter for our method, we can take any distribution of test-time interactions in the training process and it won’t violate our design.
The input point cloud is the current observation and might be changed if an interaction successfully moves the object.
9 More Experiment Settings
Details of Baselines
For Ours-fps, Where2Act-interaction and Where2Act-adaptation baselines, we augment the FPS method with an actionability score. In detail, we only select the points with higher actionability scores than a preset threshold (e.g., 0.5). If there are not enough such points, the threshold will be set lower until there exist at least 50 points whose actionability scores are higher than the threshold. For Where2Act-adaptation baseline, we first train a network to give the similarities between points. For point and point , given their point features extracted by PointNet++ and the distance between them, the network outputs a similarity score . Then, an interaction acting on with action score will influence point by:
(1) |
where if (e.g., ) or , . Specifically, if the original an actionability score of point is and the original action score of an arbitrary action on point is , the new action score and actionability score would be:
(2) | |||
(3) |
To train the network to give similarity between points, similar to our method, we use the ground truth result of the action as the regression target of .
More baselines
We employ several baselines using FPS method to sample interaction points, and the results show the usefulness of the proposed AIP module of our framework.
-
-
Ours-purefps: that directly uses FPS method to sample interaction points without using actionability scores.
-
-
Ours-argfps: that uses FPS augmented with actionability scores to select interaction points. When sampling a new point, we combine its distance to the sampled point set with its actionability score while doing FPS, as the weighted distance.
10 More Results and Analysis
In Figure 8 and 9, we show more qualitative results. See the captions of these two figures for more details.
Table 4 shows the comparisons between different methods using FPS. In most cases, both Ours-argfps and Ours-fps achieve better results than Ours-purefps. Because in Ours-purefps baseline, FPS only cares about the 3D position of points discarding the point features. While Ours-argfps and Ours-fps utilize the action scores which are generated by point features and thus achieve better results. Results show that our framework gets better performance in most cases compared with those baselines, which further shows the effectiveness of our AIP module.
F-score (%) | Sample-Succ (%) | ||
pushing all (train cat.) | ours-purefps | 66.78/69.43/70.65 | 28.23/31.50/29.51 |
ours-argfps | 66.78/69.43/70.65 | 28.23/31.50/29.51 | |
ours-fps | 64.32/69.58/70.99 | 26.22/27.30/30.65 | |
ours-final | 72.78/73.12/75.18 | 33.82/33.23/35.23 | |
pushing all (test cat.) | ours-purefps | 66.35/66.55/67.19 | 34.15/32.60/35.06 |
ours-argfps | 74.04/75.03/76.63 | 33.11/34.54/36.49 | |
ours-fps | 66.17/67.27/69.08 | 33.64/35.19/37.79 | |
ours-final | 77.58/77.63/78.42 | 34.97/36.75/37.40 | |
pulling all (train cat.) | ours-purefps | 35.46/37.54/37.35 | 2.78/5.56/2.78 |
ours-argfps | 35.46/37.54/37.35 | 3.89/4.44/6.11 | |
ours-fps | 39.88/42.74/43.55 | 2.78/5.56/4.44 | |
ours-final | 42.62/43.87/44.08 | 7.78/9.44/10.55 | |
pulling all (test cat.) | ours-purefps | 43.60/48.91/47.36 | 6.96/5.22/3.91 |
ours-argfps | 45.17/47.39/50.60 | 8.69/7.22/10.00 | |
ours-fps | 43.67/42.77/48.33 | 4.35/3.91/4.78 | |
ours-final | 49.51/50.00/51.33 | 5.21/7.39/10.45 | |
pulling closed door | ours-purefps | 53.53/59.81/67.20 | 6.67/7.64/10.71 |
ours-argfps | 58.42/62.31/68.72 | 8.94/11.25/13.75 | |
ours-fps | 59.79/63.43/69.13 | 8.88/11.33/12.10 | |
ours-final | 57.83/65.60/79.65 | 10.86/11.57/22.14 | |
pushing faucet | ours-purefps | 73.39/79.13/79.85 | 61.88/76.59/72.50 |
ours-argfps | 74.66/78.30/79.61 | 61.42/66.65/74.75 | |
ours-fps | 74.19/79.36/77.95 | 60.44/70.12/77.41 | |
ours-final | 77.42/83.06/83.83 | 65.90/81.66/82.14 |

