Put the Bear on the Chair! Intelligent Robot Interaction
with Previously Unseen Chairs via Robot Imagination

Hongtao Wu, Xin Meng, Sipu Ruan, Gregory S. Chirikjian This work was supported by NUS Startup grants R-265-000-665-133, R-265-000-665-731, Faculty Board account C-265-000-071-001, and the National Research Foundation, Singapore under its Medium Sized Center for Advanced Robotics Technology Innovation (CARTIN) R-261-521-002-592.H. Wu, X. Meng, S. Ruan, and G. S. Chirikjian are with the Department of Mechanical Engineering, National University of Singapore, Singapore. H. Wu is also with the Laboratory for Computational Sensing and Robotics (LCSR), Johns Hopkins University, Baltimore, MD 21218, USA. Address all correspondence to G. S. Chirikjian: [email protected]

Abstract

In this paper, we study the problem of autonomously seating a teddy bear on a previously unseen chair. To achieve this goal, we present a novel method for robots to imagine the sitting pose of the bear by physically simulating a virtual humanoid agent sitting on the chair. We also develop a robotic system which leverages motion planning to plan SE(2) motions for a humanoid robot to walk to the chair and whole-body motions to put the bear on it. Furthermore, to cope with cases where the chair is not in an accessible pose for placing the bear, a human assistance module is introduced for a human to follow language instructions given by the robot to rotate the chair and help make the chair accessible. We implement our method with a robot arm and a humanoid robot. We calibrate the proposed system with 3 chairs and test on 12 previously unseen chairs in both accessible and inaccessible poses extensively. Results show that our method enables the robot to autonomously seat the teddy bear on the 12 previously unseen chairs with a very high success rate. The human assistance module is also shown to be very effective in changing the accessibility of the chair. Video demos and more details are available at https://chirikjianlab.github.io/putbearonchair/.

I INTRODUCTION

Object affordances describe potential interactions with an object [1]. They are related to the object functionality. For example, a chair is able to afford the functionality of sitting. Thus, it possesses the sitting affordance. For robots operating in unstructured environments and interacting with previously unseen objects, e.g., personal robots, the understanding of object affordances is very desirable [2]. It allows robots to understand how to interact with them efficiently and intelligently. Reasoning about object affordances can also help robots discover the potential of an object to afford a functionality, despite it not being a typical instance of the object class to which the functionality is related (e.g., the improvised chair assembled with books and boxes in Fig. 5(d)). Independently identifying affordances represents a higher level of robot intelligence.

Refer to caption — Figure 1: Overview. (a) The robot imagines the sitting pose for a previously unseen chair and seats the bear on the chair. A robot arm with a mounted depth camera scans the chair and tracks the chair and the humanoid robot. The coordinate frame at the lower left corner indicates the base frame of the robot arm which is used as the world frame throughout the paper. (b) Result of seating the bear on the chair. (c) The humanoid robot retrieves its hands and uprights its body after seating the bear on the chair.

To address the problem of affordance reasoning, we introduced the interaction-based definition (IBD) which defines objects from the perspective of interactions instead of appearances in our previous work [3]. In particular, IBD defines a chair as “an object which can be stably placed on a flat horizontal surface in such a way that a typical human is able to sit (to adopt or rest in a posture in which the body is supported on the buttocks and thighs and the torso is more or less upright) on it stably above the ground”.

In this paper, we go beyond our previous work. We propose a novel method for robots to imagine how to sit on a previously unseen chair, i.e., the sitting pose for the chair. We develop a robotic system, with a Franka Emika Panda robot arm and a NAO humanoid robot, to actualize the understanding of the sitting affordance by autonomously seating a teddy bear on the chair (Fig. 1). To “put the bear on the chair”, the robot needs to carry the bear to the chair and seat the bear on it. To accomplish this task, the robot first needs to understand the sitting affordance of the chair, i.e., how the chair can be sat. Also, the robot needs to plan a whole-body motion to seat the bear on the chair. Moreover, if the chair is in an inaccessible pose (e.g., when the chair is facing a wall), the robot should be able to reason the accessibility and understand how to make it accessible. In this paper, we introduce a simple but efficient human assistance module with closed-loop visual feedback to enable the robot to instruct a human to rotate the chair and make it accessible for seating the bear. Fig. 2 shows the pipeline of our method. Results show that our method enables the robot to autonomously seat the bear on 12 previously unseen chairs with diverse shapes and appearances in 72 trials, including both accessible and inaccessible chair poses, with a success rate of $94.4\%$ . Also, our human assistance module is shown to be very effective. It successfully changes the accessibility of the chair for seating the bear with a $100\%$ success rate in 36 trials with inaccessible poses. To our knowledge, this work is the first to physically seat an agent on a previously unseen chair in the real world. The aim of this work is to examine how well the robots are able to accomplish this challenging task by leveraging different components of robotics. We envision promising future applications of our method for robots operating in household environments and interacting with humans.

This work differs from our previous work [3] in various ways. Rather than classifying whether a previously unseen object is a chair or not, in this work, we assume the given object is known as a chair a priori, as such can be determined from [3]. Also, unlike [3] which focuses on perceiving the sitting affordance without real-robot experiments, this work goes beyond by actualizing the understanding of affordances with physical experiments. The main contributions of the paper are as follows:

•

A method for robots to imagine the sitting pose for a previously unseen chair.
•

A human assistance module to enable the robot to reason about the accessibility of a chair and interact with a human to change the accessibility if necessary.
•

A robotic system which is able to autonomously seat a teddy bear on a previously unseen chair for sitting.

II RELATED WORK

Object Affordance of Chairs. The class of chairs is an important object class in human life. The exploration of sitting affordance has become popular in recent decades [4, 5, 6, 3, 7]. Hinkle and Olson [4] simulate dropping spheres onto objects and classify objects into containers, chairs, and tables based on the final configuration of spheres. Grabner et al. [5] fit a humanoid mesh onto an object mesh and exploit the vertex distance and the triangle intersection to evaluate the object affordance as a chair for chair classification. Instead of object classification, we use object affordances to understand the interaction with a chair.

Affordance Reasoning. There is a growing interest in reasoning object affordances in the field of computer vision [8, 9, 10, 11, 12, 13, 14, 2, 15] and robotics [16, 17, 18, 19, 20, 21, 22, 2, 23]. Stoytchev [16] introduces an approach to ground tool affordances via dynamically applying different behaviors from a behavioral repertoire. [17, 18, 9, 10, 14] use convolutional neural networks (CNNs) to detect regions of affordance in an image. Ruiz and Mayol-Cuevas [11] predict affordance candidate locations in environments via the interaction tensor. In contrast to detecting different types of affordances, we focus on understanding the sitting affordance and use it for real robot experiments.

Physics Reasoning. Reasoning of physics has also been introduced to the study of object affordances [3, 19, 24, 25, 26, 27]. Battaglia et al. [24] introduce the concept of an intuitive physics engine which simulates physics to explain human physical perception of the world. Zhu et al. [26] employ physical simulations to infer forces exerted on humans and learn human utilities from videos. The digital twin [28] makes use of physical simulations to optimize operations and predict failures in manufacturing. We do not infer physics or use it to predict outcomes, but exploit physics to imagine interactions with objects to perceive object affordances.

III METHODS

III-A Problem Formulation

Throughout the paper, we assume the given unseen chair is upright and the back of the agent is supported while sitting. The virtual agent in the imagination is an articulated humanoid body (Fig. 3(a)). To seat the agent on the chair, we need to know 1) the sitting pose $g=(R,\mathbf{p})\in SE(3)$ where $R\in SO(3)$ and $\mathbf{p}\in\mathbb{R}^{3}$ specify the rotation and position of the agent’s base link in the world frame and 2) the joint angles of the agent. However, the real agent is a teddy bear of which the joints have large damping coefficients. The joints can be considered fixed and the joint angles are already close to those of a sitting configuration (Fig. 1(b)). Thus, in this paper, we simplify the problem to just finding $g$ .

According to the interaction-based definition (IBD) of chairs (Sec. I), the torso is more or less upright when sitting. Thus, we further simplify the problem by restricting $R=R_{z}(\gamma)R_{0}$ . $R_{z}(\gamma)$ denotes the rotation about the z-axis of the world frame and $\gamma$ is the yaw angle of the base link. $R_{0}$ denotes the initial rotation which sets the agent to an upright sitting configuration with its face facing towards the x-axis of the world frame. That is, given an unseen chair, the problem becomes finding the position $\mathbf{p}$ and the yaw angle $\gamma$ of the base link in the world frame for sitting. We denote the direction indicated by $\gamma$ in the xy-plane as the sitting direction. The base link of the agent is the pelvis link.

III-B Revisit of Sitting Affordance Model (SAM)

In sitting imagination (Sec. III-C), we simulate a virtual humanoid agent sitting on the chair in physical simulations. The resultant configuration of the agent at the end of the simulation is denoted as $C_{{res}}$ . The sitting affordance model (SAM) [3] evaluates whether $C_{{res}}$ is a correct sitting by comparing it with a key configuration $C_{{key}}$ . We briefly review SAM here and further details can be found in [3].

The evaluation is based on four criteria. 1) Joint Angle Score. The joint angles of a configuration $C$ can be described with a vector $\boldsymbol{\theta}\in\mathbb{R}^{n}$ where $n$ denotes the total number of the agent’s joints. The joint angle score is defined as $J=\sum_{i}w_{J}^{i}|\theta_{{res}}^{i}-\theta_{{key}}^{i}|$ . $w_{J}^{i}$ denotes the weight of the $i$ -th joints. $\theta_{{res}}^{i}$ ( $\theta_{{key}}^{i}$ ) is the $i$ -th element of $\boldsymbol{\theta}_{{res}}$ ( $\boldsymbol{\theta}_{{key}}$ ). Lower is better. 2) Link Rotation Score. According to IBD, the torso is more or less upright. Thus, SAM considers the link rotation in the world frame when evaluating $C_{{res}}$ . The link rotation score is defined as $L=\sum_{i}w_{L}^{i}(1-\mathbf{z}_{{res}}^{i}\cdot\mathbf{z}_{{key}}^{i})$ . $w_{L}^{i}$ is the weight for the $i$ -th link. $\mathbf{z}_{{res}}^{i}$ ( $\mathbf{z}_{{key}}^{i}$ ) is the z-axis unit vector of the $i$ -th link frame in $C_{{res}}$ ( $C_{{key}}$ ). Lower is better. 3) Sitting Height. Sitting height $H$ is also an important factor in sitting. SAM takes into account $H$ in the evaluation of $C_{{res}}$ . 4) Contact. SAM counts the number of contact points of the agent’s links which can be described with a vector $\mathbf{T}\in\mathbb{R}^{m}$ where $m$ denotes the total number of links.

$C_{{res}}$ is classified as a correct sitting if the following are all satisfied: $J<J_{\textrm{max}}$ , $L<L_{\textrm{max}}$ , $H\in(H_{\textrm{min}},H_{\textrm{max}})$ , $\phi(\textbf{T})>0$ . $J_{\textrm{max}}$ , $L_{\textrm{max}}$ , $H_{\textrm{min}}$ , and $H_{\textrm{max}}$ are four thresholds. $\phi(\cdot):\mathbb{R}^{m}\rightarrow\{0,1\}$ is a binary function which outputs 1 if 1) the total number of contact points is larger than some thresholds and 2) the number of contact points for lower and upper body links are both larger than zeros, and 0 if otherwise.

III-C Sitting Imagination

We physically simulate the agent sitting on the chair in different rotations $R_{{chr}}\in SO(3)$ to find $\mathbf{p}$ and $\gamma$ for sitting (Fig. 3). Given the 3D model of the chair, we first compute the minimum volume oriented bounding box (OBB) [29]. As in [3], we then apply a rigid body transformation $g_{{obb}}=(R_{{obb}},\mathbf{p}_{{obb}})\in SE(3)$ to the model (Fig. 3(b)). $\mathbf{p}_{{obb}}$ horizontally translates the chair to align the OBB center with the origin of the world frame in the xy-plane. $R_{{obb}}=R_{z}(\gamma_{{obb}})$ rotates the chair about the z-axis to align the OBB with the coordinate system of the world frame. We apply $g_{{obb}}$ because we notice that the back of the chair is heuristically coincident with one of the OBB faces which benefits the finding of correct sittings in the imagination. After applying $g_{obb}$ , we attach the world frame as the body frame of the chair.

The rotation of the chair $R_{chr}$ in the simulation is enumerated by setting $R_{{chr}}=R_{z}(\alpha_{i})R_{{obb}}$ and $\alpha_{i}=i\pi/4,i=0,1,2,...,7$ (Fig. 3(c)). Note that $R_{{chr}}=R_{z}(\alpha_{i})R_{{obb}}=R_{z}(\alpha_{i}+\gamma_{{obb}})$ is a rotation about the z-axis in the world frame. We drop the agent from above the chair to simulate sitting (Fig. 3(d)). Before the drop, the agent is set to a pre-sitting configuration facing the x-axis (Fig. 3(a)). The base link of the agent is placed on a plane 15cm above the chair OBB. For each rotation, we first sample three positions on the plane to drop: the origin and two positions with a translation of $\pm L_{sit}$ along the x-axis from the origin, respectively. $L_{sit}$ is scaled linearly with respect to the size of the OBB. If no more than one correct sitting is found for all rotations, four extra positions are sampled to drop: positions with $\pm 2L_{sit}$ and $\pm 3L_{sit}$ translations along the x-axis from the origin, respectively. The reason we start sampling drops close to the origin of the plane, which is aligned with the center of the OBB horizontally, is that most chairs have their seats positioned close to the OBB center. However, the seats for some chairs are closed to the OBB peripheral. Thus, if not enough correct sittings can be found, our search expands towards the peripheral. In total, we simulate 24 drops if no extra drops are needed and 56 drops if otherwise.

The rotation with the largest number of correct sittings $N$ is selected as the best rotation $\alpha^{*}$ for sitting. If more than one rotation has the largest $N$ (Fig. 3(f)(g)(h)), we select the one with the smallest averaged value of $JL$ (lower is better). The sitting pose $g=(R,\mathbf{p})$ in the world frame is:

	$\displaystyle\vspace{-0.5cm}\mathbf{p}=g_{obb}^{-1}\circ\overline{\mathbf{p}^{\prime}}$		(1)
	$\displaystyle R=R_{z}(\gamma)R_{0},\quad\gamma=\overline{\gamma^{\prime}}-\gamma_{obb}\vspace{-0.5cm}$		(2)

where $g\circ\mathbf{p}^{\prime}=R\mathbf{p}^{\prime}+\mathbf{p}$ and the inverse $g^{-1}\circ\mathbf{p}^{\prime}=R^{-1}\mathbf{p}^{\prime}-R^{-1}\mathbf{p}$ ; $\overline{\mathbf{p}^{\prime}}$ and $\overline{\gamma^{\prime}}$ are the weighted average of the agent’s position and yaw angle relative to the chair frame of all the correct sittings with the chair rotated by $\alpha^{*}$ . The weight is the reciprocal value of $JL$ of the sitting.

III-D Motion Planning

We assume the motion of seating the bear is quasi-static. Due to the complexity of the motion constraints, we formulate the planning of this motion as a trajectory optimization problem [30]:

$\displaystyle\underset{\mathrm{\mathbf{q}},\dot{\mathrm{\mathbf{q}}}}{\text{minimize}}$	$\displaystyle J=\sum_{k=1}^{N}L(\mathrm{\mathbf{q}}(k))$	(3)
subject to	Kinematics constraints	(4)
	Stability constraints	(5)
	Collision constraints	(6)

where $L(\mathrm{\mathbf{q}}(k))=1/2\lVert\mathrm{\mathbf{q}}(k)-\mathrm{\mathbf{q}}_{{goal}}\rVert_{Q}^{2}$ . Kinematics constrains describe the state transition and limit the magnitude of joints and control inputs. Stability constrains confine the horizontal projection of the robot center of mass (COM) to the supporting polygon $\mathbf{S}$ formed by the feet. Collision constraints ensure that the trajectory is collision-free.

The initial configuration of the robot is a pre-defined standing posture (Fig. 4(a)). The goal configuration $\mathrm{\mathbf{q}}_{{goal}}$ is generated via a constrained optimization [30]:

$\displaystyle\underset{\mathrm{\mathbf{q}}}{\text{minimize}}$	$\displaystyle w_{1}d_{\mathrm{COM}}(\mathrm{\mathbf{q}})+w_{2}\mathrm{Torso}(\mathrm{\mathbf{q}})$	(7)
subject to	Kinematics constraints	(8)
	Stability constraints	(9)
	Collision constraints	(10)

The cost (Eq. (7)) aims to minimize 1) $d_{\mathrm{COM}}(\mathrm{\mathbf{q}})$ which is the distance between the COM horizontal projection and the center of the supporting polygon $\mathbf{S}$ and 2) $\mathrm{Torso}(\mathrm{\mathbf{q}})$ which is the bending angle of the torso. A less bending torso exerts less torque on the motors, making uprighting the body easier after seating the bear. Kinematics constraints ensure that the bear sits in the imagined sitting pose $g$ when the robot is in the goal configuration $\mathbf{q}_{goal}$ .

We use RRTConnect [31] in the OMPL library [32] to plan an $SE(2)$ trajectory for the robot to carry the bear to the chair. The robot is encapsulated with an ellipsoid for collision checking with the FCL library [33]. The setting is shown in Fig. 4(c).

IV SYSTEM PIPELINE

Fig. 2 shows the pipeline of our method. The robot arm first scans and reconstructs the 3D model of the chair (Sec. V-A). The sitting imagination is then conducted to find the imagined sitting pose $g$ (Sec. III-C). With $g$ , we first determine the $SE(2)$ goal pose $s_{goal}$ for the NAO to walk to and place the bear. The rotation of $s_{goal}$ is set such that the NAO faces the opposite of the sitting direction indicated by $\gamma$ . The position of $s_{goal}$ is on a horizontal ray which originates from the projection of $\mathbf{p}$ on the xy-plane and points towards the sitting direction. It is initially set such that the NAO is $l_{init}$ away from $\mathbf{p}$ . If the NAO is in collision with the chair, we move it along the ray until it is collision-free. After that, if the distance between the NAO and $\mathbf{p}$ is too large, we move $\mathbf{p}$ horizontally along the ray and make it closer to the robot due to the robot workspace constraint. We then pre-plan the motion to seat the bear and check the validity of $s_{goal}$ . To reduce the planning time, the motion is simplified as a bilaterally symmetric motion – the motion of the left-half body is symmetric to that of the right-half. The bear is held in hands facing the sitting direction at the beginning of the motion (Fig. 4(a)). We restrict the motion to be an $SE(2)$ motion, in which only the pitch joints are activated, to maintain the facing of the bear throughout the motion.

The $SE(2)$ trajectory to walk to $s_{goal}$ is then planned. If $s_{goal}$ is out of the planning arena or blocked by obstacles, no plans will be made. In this case, the NAO gives a language instruction to instruct a human to assist with rotating the chair about the vertical axis such that the sitting direction points towards the NAO where there are no obstacles in between (Fig. 2). The instruction is generated from a template: “Please rotate the chair about the vertical axis $\tt{direction}$ for $\tt{rotation\_angle}$ degrees”. $\tt{direction}\in\{\tt{clockwise},\tt{counterclockwise}\}$ . $\tt{rotation\_angle}\in\{\beta|\beta=i*30,i=0,1,2,...,6\}$ is the multiple of 30 degrees closest to the precise rotation angle such that the sitting direction is pointing towards the NAO. We use multiples of 30 degrees instead of the precise angle because it is easier for humans to understand. The pose of the chair is tracked by iterative closest point (ICP) [34]. The transformation of the chair in the human rotation is denoted as $g_{rot}$ . The imagined sitting pose $g$ and the $SE(2)$ goal $s_{goal}$ are updated and transformed by $g_{rot}$ after the rotation. The robot then tries to plan a trajectory to the updated $s_{goal}$ . This process is repeated until a valid $SE(2)$ trajectory is found. In practice, we regard the trial as a failure if no $SE(2)$ trajectories can be found after three rotations. When a valid $SE(2)$ trajectory is found, the NAO is controlled to walk and follow the waypoints along the trajectory via a PID controller. The walking motion is controlled by the NAOqi SDK¹¹1https://developer.softbankrobotics.com/nao6/naoqi-developer-guide/naoqi-apis. The bear is passed to the NAO manually before it starts walking. When the NAO arrives at $s_{goal}$ , a whole-body motion is planned to place the bear based on the robot current pose. Finally, the robot executes the motion, releases the bear, and uprights its body (Fig. 4(b)).

V EXPERIMENTS

Fig. 1(a) shows the experiment setup. A PrimeSense Carmine 1.09 RGB-D camera is mounted on the robot arm. Besides scanning the chair, the camera is also used for tracking the chair and the NAO. The NAO is tracked with an ArUco tag attached on top of its head.

V-A 3D Scanning

In the experiment, the chair is placed randomly in a 50 $\times$ 50cm² squared area in front of the robot arm in its upright pose. The robot arm moves to 9 pre-defined collision-free configurations to capture depth images of the scene (Fig. 5(a)(b)(c)). The pose of the camera at each view is obtained from the forward kinematics of the robot arm. This allows us to use TSDF Fusion [35] to densely reconstruct the scene and the point cloud. The chair is segmented from the scene by plane segmentation in the PCL library [36].

V-B Data

Our dataset contains 15 chairs with diverse shapes and appearances (Fig. 5(d)). They are designed for children aged from 0 to 3. We choose small chairs because the size of the chair is restricted by the workspace of the NAO and the robot arm. If the chair is too tall, the NAO will be too short to put the bear on it; if the chair is too large, the robot arm will not be able to scan it. We use 3 chairs (calibration set) to calibrate the simulation and the motion planning and control. The rest 12 chairs (test set), which are unseen by the robot, are used for testing.

V-C Physical Simulation

Pybullet [37] is used as the physics engine for sitting imagination. The chair and the virtual humanoid agent are imported with the URDF files which specify the mass, COM, inertia matrix, friction coefficient, and joint properties. The collision in the simulation is modelled as inelastic. The physics attributes of the chair is computed with Meshlab [38]. As the chairs in the dataset are designed for children, we set the height and weight of the agent in the simulation accordingly [39].

VI RESULTS & DISCUSSIONS

We implement our method with Robot Operating System (ROS) on a computer running Intel Core i9-10920X @ 3.5GHz CPU. Our unoptimized implementation takes about 70s, 4s, and 29s for 3D scanning and reconstruction, sitting imagination, and motion planning, respectively. The walking and seating motions take about 67s and 24s, respectively.

Accessible. In the first set of experiments, the chair is placed such that it is accessible for seating the bear. The sitting direction points towards the NAO with a deviation within a range of $\pm 30$ degrees. No human assistance is needed in this case. For each chair in the test set, we place it in 3 different poses for testing (36 trials in total) (Fig. 6).

Inaccessible + Human Obeys. In the second set of experiments, the chair is placed such that it is inaccessible. The sitting direction of the chair either points towards 1) the robot arm or 2) to the edges of the planning arena (Fig. 7(a)). In both cases, no valid $SE(2)$ trajectories can be found in the initial configuration. Human assistance is needed to rotate the chair and make it accessible. We recruit 6 volunteers to participate in the experiments. For each chair in the test set, we place it in 2 different inaccessible poses (24 trials in total). We ask the human to obey the instruction given by the NAO throughout the whole trial (Fig. 7).

Inaccessible + Human Disobeys. There exist many uncertainties in human assistance (e.g., the human is distracted or misunderstands the instruction), resulting in inaccessibility even after a rotation. In this set of experiments, we want to test the robustness of our method in addressing these uncertainties. For each chair in the test set, we place it in an inaccessible pose as in the Inaccessible + Human Obeys setting (12 trials in total). We ask the human to deliberately disobey the first instruction given by the NAO and obey the following instructions (Fig. 7).

We recruit 15 annotators to annotate the experiment results. Each trial is annotated by five different annotators. For each trial, we show the experiment video and the images of the bear at the end of the trial. The annotator is then asked 1) “Do you think the robot has been successful in seating the bear on the chair?” For the trial with human assistance, we also 2) “Is the chair accessible after the human assistance?” The results on the test set are shown in the Tab. I.

(1) Seating Bear Success
Experiment	Trial	Positive Annotation Num.
Experiment	Num.	= 5	$\geq$ 4	$\geq$ 3
Accessible	36	26	34	34
Inaccessible + Human Obeys	24	18	22	23
Inaccessible + Human Disobeys	12	10	10	11
Total	72	54	66	68
(2) Chair Accessible After Human Assistance
Experiment	Assist.	Positive Annotation Num.
Experiment	Num.	= 5	$\geq$ 4	$\geq$ 3
Inaccessible + Human Obeys	23	23	23	23
Inaccessible + Human Disobeys	11	10	11	11
Total	34	33	34	34

TABLE I: Experiment Results on the Test Set.

In table (1) of Tab. I, we show the number of trials with at least 5, 4, and 3 positive annotations to the first question. We count the trial as successful if the sitting imagination found the sitting pose and more than half of all the 5 annotations (at least 3) are positive. In general, we are able to achieve a very high success rate of $94.4\%$ on the 12 unseen chairs in the test set (68 out of 72). The step stool accounts for all the 4 failure cases in the 72 trials. The sitting imagination fails to find the sitting pose because the depth of the seat is too small. A successful trial of the step stool is shown in Fig. 6. Remarkably, the success rate of all the 6 trials of the improvised chair is $100\%$ . This opens up a promising potential of our imagination method to discover the affordance of an object which can afford the sitting functionality despite it not being a typical chair.

In table (2), we show the total number of trials with human assistance and the number of trials with at least 5, 4, and 3 positive annotations to the second question. The goal of human assistance is to make the chair accessible for placing the bear. We evaluate the accessibility of the chair at the end of human assistance. We only count the trial in which human assistance is involved, i.e., we disregard the failure trials where the sitting imagination fails to find the sitting pose and no instructions will be given. We consider the chair accessible if more than half of all the annotations are positive. The results show that our human assistance module is very effective – it achieves a $100\%$ success rate in making the chair accessible in all the trials with successful sitting imagination.

VII CONCLUSIONS & FUTURE WORK

We propose a novel method for robots to imagine the sitting pose for a previously unseen chair. We develop a robotic system which is able to seat a teddy bear on the chair autonomously via robot imagination. Moreover, we introduce a module for robots to ask a human to assist with changing the accessibility of the chair when it is in an inaccessible pose. Results show that our method enables the robot to seat the bear on 12 previously unseen chairs in 72 trials with a very high success rate. The human assistance module is also shown to be very effective in making the chair accessible from inaccessible poses. Future work can focus on exploring more versatile whole-body motion planning methods for seating the bear on chairs.

ACKNOWLEDGMENT

The authors thank Yuanfeng Han for discussions, all the volunteers for the human assistance in the experiments, and all the annotators for the human annotations.

References

[1] J. J. Gibson, The Ecological Approach to Visual Perception. Boston, MA, USA: Houghton Mifflin, 1979.
[2] L. Bogoni and R. Bajcsy, “Interactive recognition and representation of functionality,” Computer Vision and Image Understanding, vol. 62, no. 2, pp. 194–214, 1995.
[3] H. Wu, D. Misra, and G. S. Chirikjian, “Is That a Chair? Imagining Affordances Using Simulations of an Articulated Human Body,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 7240–7246.
[4] L. Hinkle and E. Olson, “Predicting object functionality using physical simulations,” in 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2013, pp. 2784–2790.
[5] H. Grabner, J. Gall, and L. Van Gool, “What makes a chair a chair?” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 1529–1536.
[6] V. Seib, M. Knauf, and D. Paulus, “Detecting Fine-grained Sitting Affordances with Fuzzy Sets.” in Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2016), 2016, pp. 289–298.
[7] E. Bar-Aviv and E. Rivlin, “Functional 3D Object Classification Using Simulation of Embodied Agent,” in Proceedings of the British Machine Vision Conference (BMVC), 2006, pp. 307–316.
[8] A. Myers, C. L. Teo, C. Fermüller, and Y. Aloimonos, “Affordance detection of tool parts from geometric features,” in 2015 IEEE International Conference on Robotics and Automation (ICRA), 2015.
[9] J. Sawatzky, A. Srikantha, and J. Gall, “Weakly Supervised Affordance Detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2795–2804.
[10] A. Roy and S. Todorovic, “A Multi-scale CNN for Affordance Segmentation in RGB Images,” in European Conference on Computer Vision (ECCV). Springer, 2016, pp. 186–201.
[11] E. Ruiz and W. Mayol-Cuevas, “Geometric Affordance Perception: Leveraging Deep 3d Saliency With the Interaction Tensor,” Frontiers in Neurorobotics, vol. 14, p. 45, 2020.
[12] Y. Zhu, A. Fathi, and L. Fei-Fei, “Reasoning about Object Affordances in a Knowledge Base Representation,” in European Conference on Computer Vision (ECCV), 2014, pp. 408–424.
[13] S.-B. Ho, “Representing and using functional definitions for visual recognition,” Ph.D. dissertation, The University of Wisconsin-Madison, 1987.
[14] C. Desai and D. Ramanan, “Predicting Functional Regions on Objects,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013, pp. 968–975.
[15] A. Bansal, S. S. Rambhatla, A. Shrivastava, and R. Chellappa, “Detecting human-object interactions via functional generalization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 10 460–10 469.
[16] A. Stoytchev, “Behavior-Grounded Representation of Tool Affordances,” in 2005 IEEE International Conference on Robotics and Automation (ICRA), 2005, pp. 3060–3065.
[17] T.-T. Do, A. Nguyen, and I. Reid, “AffordanceNet: An End-to-End Deep Learning Approach for Object Affordance Detection,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 5882–5889.
[18] F.-J. Chu, R. Xu, and P. A. Vela, “Learning Affordance Segmentation for real-World Robotic Manipulation via Synthetic Images,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1140–1147, 2019.
[19] H. Wu and G. S. Chirikjian, “Can I pour into it? Robot Imagining Open Containability Affordance of Previously Unseen Objects via Physical Simulations,” IEEE Robotics and Automation Letters, vol. 6, no. 1, pp. 271–278, 2020.
[20] P. Abelha and F. Guerin, “Learning how a tool affords by simulating 3D models from the web,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 4923–4929.
[21] L. Piyathilaka and S. Kodagoda, “Affordance-map: Mapping human context in 3D scenes using cost-sensitive SVM and virtual human models,” in 2015 IEEE International Conference on Robotics and Biomimetics (ROBIO), 2015, pp. 2035–2040.
[22] R. Bajcsy, “Active perception,” Proceedings of the IEEE, vol. 76, no. 8, pp. 966–1005, 1988.
[23] H. Kjellström, J. Romero, and D. Kragić, “Visual object-action recognition: Inferring object affordances from human demonstration,” Computer Vision and Image Understanding, vol. 115, no. 1, pp. 81–90, 2011.
[24] P. W. Battaglia, J. B. Hamrick, and J. B. Tenenbaum, “Simulation as an engine of physical scene understanding,” Proceedings of the National Academy of Sciences, vol. 110, no. 45, pp. 18 327–18 332, 2013.
[25] Y. Zhu, Y. Zhao, and S.-C. Zhu, “Understanding Tools: Task-Oriented Object Modeling, Learning and Recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[26] Y. Zhu, C. Jiang, Y. Zhao, D. Terzopoulos, and S.-C. Zhu, “Inferring Forces and Learning Human Utilities from Videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3823–3833.
[27] L. Kunze and M. Beetz, “Envisioning the qualitative effects of robot manipulation actions using simulation-based projections,” Artificial Intelligence, vol. 247, pp. 352–380, 2017.
[28] S. Boschert and R. Rosen, “Digital Twin—The Simulation Aspect,” in Mechatronic Futures, 2016, pp. 59–74.
[29] M. Dawson-Haggerty et al., “Trimesh,” https://trimsh.org/, 2019–2021.
[30] Y. Han, R. Li, and G. S. Chirikjian, “Can I lift it? Humanoid robot reasoning about the feasibility of lifting a heavy box with unknown physical properties,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, pp. 3877–3883.
[31] J. J. Kuffner and S. M. LaValle, “RRT-Connect: An efficient approach to single-query path planning,” in 2000 IEEE International Conference on Robotics and Automation (ICRA), vol. 2, 2000, pp. 995–1001.
[32] I. A. Şucan, M. Moll, and L. E. Kavraki, “The Open Motion Planning Library,” IEEE Robotics & Automation Magazine, vol. 19, no. 4, pp. 72–82, 2012.
[33] J. Pan, S. Chitta, and D. Manocha, “FCL: A general purpose library for collision and proximity queries,” in 2012 IEEE International Conference on Robotics and Automation (ICRA), 2012, pp. 3859–3866.
[34] P. J. Besl and N. McKay, “A method for registration of 3-D shapes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 14, no. 2, pp. 239–256, 1992.
[35] B. Curless and M. Levoy, “A Volumetric Method for Building Complex Models from Range Images,” in Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, 1996, pp. 303–312.
[36] R. B. Rusu and S. Cousins, “3D is here: Point Cloud Library (PCL),” in 2011 IEEE International Conference on Robotics and Automation (ICRA), 2011, pp. 1–4.
[37] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” http://pybullet.org, 2016–2021.
[38] P. Cignoni, M. Callieri, M. Corsini, M. Dellepiane, F. Ganovelli, and G. Ranzuglia, “Meshlab: an Open-Source Mesh Processing Tool.” in Eurographics Italian Chapter Conference, vol. 2008, 2008, pp. 129–136.
[39] M. D. Onis, “The WHO child growth standards.” Pediatric nutrition in practice, pp. 254–269, 2008.

Put the Bear on the Chair! Intelligent Robot Interaction with Previously Unseen Chairs via Robot Imagination