This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Flipbot: Learning Continuous Paper Flipping via Coarse-to-Fine Exteroceptive-Proprioceptive Exploration

Chao Zhao*1, Chunli Jiang*1, Junhao Cai1, Michael Yu Wang1,2, Hongyu Yu1,2, and Qifeng Chen1 *Authors with equal contribution. 1The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong.2HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute, Futian, Shenzhen.
Abstract

This paper tackles the task of singulating and grasping paper-like deformable objects. We refer to such tasks as paper-flipping. In contrast to manipulating deformable objects that lack compression strength (such as shirts and ropes), minor variations in the physical properties of the paper-like deformable objects significantly impact the results, making manipulation highly challenging. Here, we present Flipbot, a novel solution for flipping paper-like deformable objects. Flipbot allows the robot to capture object physical properties by integrating exteroceptive and proprioceptive perceptions that are indispensable for manipulating deformable objects. Furthermore, by incorporating a proposed coarse-to-fine exploration process, the system is capable of learning the optimal control parameters for effective paper-flipping through proprioceptive and exteroceptive inputs. We deploy our method on a real-world robot with a soft gripper and learn in a self-supervised manner. The resulting policy demonstrates the effectiveness of Flipbot on paper-flipping tasks with various settings beyond the reach of prior studies, including but not limited to flipping pages throughout a book and emptying paper sheets in a box. The code is available here : https://robotll.github.io/Flipbot/

I Introduction

Deformable object manipulation has achieved notable progress in robotics. However, until now, robots could not match the generalization and robustness of humans in manipulating thin and flexible objects. One of these tasks is flipping book pages, as shown in Fig. 1, which requires singulating and grasping paper page by page. Humans can briskly turn pages of a book by watching the target and using the tactile sensations on their fingertips to adjust their actions. In this process, human instinctively combines exteroceptive and proprioceptive perception to accommodate the irregular paper thickness and physical properties, such as slipperiness, stiffness, and friction. Endowing robots to have such capability is a grand challenge in the field of robotics.

One of the foremost challenges in manipulating thin and flexible objects is incomplete and noisy perception [1]. For example, a stack of paper is unstable, and the contact between each layer is not observable. Therefore, the robot may have to perceive physical properties between paper, such as friction, and elasticity, to successfully singulate and grasp a sheet from a stack. Exteroceptive perception obtained from camera sensors is incomplete for such tasks and unreliable in real-world conditions. The depth sensors, which most existing works rely on, cannot distinguish the different layers of stacked paper due to the paper thickness. Depth sensors are also inherently incapable of capturing the surface’s physical properties, such as hardness and flexibility [2]. Some works use tactile sensors as proprioception to estimate deformable objects’ physical properties. For example, [3] uses a high-precision tactile sensor to measure the geometry of the contact surface and the object’s hardness. [4] manipulates cables with a pair of robotic grippers using real-time tactile feedback. Nevertheless, high-precision tactile sensors are often expensive and require specific finger shapes to fit. In addition to the challenge in environment perception, manipulating thin and flexible objects may desire the gripper with the dexterity and compliance of human fingers, which further adds to the difficulty [5].

\begin{overpic}[width=433.62pt]{img/fg1.jpg} \put(74.75,30.75){\footnotesize{Exteroception}} \put(74.75,1.5){\footnotesize{Propriocepion}} \put(69.0,35.5){\footnotesize{Crop}} \par\put(67.5,11.1){\tiny{mx}} \put(67.5,7.8){\tiny{my}} \put(67.5,5.0){\tiny{mz}} \par\put(68.6,14.0){\tiny{fz}} \put(68.3,17.5){\tiny{fy}} \put(68.7,20.0){\tiny{fx}} \end{overpic}
Figure 1: A soft gripper with the learned policy flips a book. The time-lapse image depicts the operation of the gripper as it interacts with the book to singulate and grasp a piece of paper. The cropped depth image in the red line box located at the upper right corner presents the exteroceptive observation from the depth camera. The readings on the bottom right show the proprioceptive observation from the force-torque sensor.

To address the above challenges, we present Flipbot, a self-supervised method for singulating and grasping paper-like deformable objects at unprecedented robustness, enabling continuous paper flipping. At its core, Flipbot is based on a principled solution integrating exteroceptive and proprioceptive perceptions into policy learning. We obtain proprioception from the Force/Torque (F/T) sensor readings and exteroception from a depth camera. We use a procedural motion, referred to as “Swipe” to actively interact with the environment. When a “Swipe” motion is applied to a piece of paper, the deformation brought about by the interaction between the finger and object reveals imperceptible physical characteristics like mass, flexural rigidity, and friction. Meanwhile, visual observation provides global information on the environment. We design a cross-sensory encoder to integrate exteroceptive and proprioceptive perceptions into an implicit state representation. The encoder is trained end-to-end in a self-supervised manner as a part of policy learning. By incorporating exteroceptive-proprioceptive information into policy learning, the robot is able to discover the optimal policy for paper-flipping through continuous exploration. Furthermore, the reward signal for policy learning is derived from visual observation; Flipbot is fully trained by self-exploration without human demonstration or annotation.

The primary contribution of the presented work is the proposed new approach, Flipbot, for sigulating and grasping paper-like objects. It achieves substantial improvements over the prior studies while maintaining exceptional robustness. Our extensive experiments show that Flipbot is able to perform page-flipping from the beginning to the end of a book accurately and consistently, and exhibits remarkable zero-shot generalization under conditions never encountered during training: novel paper materials such as coated and plastic paper and tasks such as emptying a box full filled with paper.

II Related Work

Deformable object manipulation presents a persistent and enduring challenge within the field of robotics. Conventional analytic approaches rely on modeling object dynamics and then using model predictive control [6], or trajectory optimization [7] for manipulation. However, analytic approaches require substantial prior knowledge of geometry, and the physical properties of the object [1]. For example, [8] presents an approach for manipulating a piece of thin deformable object by analyzing the object’s internal energy exchange concerning object poses. And [9] proposes a close-loop shape control method utilizing visual markers, which limits the generality. Moreover, the high-dimensional state representation and complex dynamics of the deformable object provide additional challenges to generalizing novel objects and environments.

Recently, learning-based methods have become increasingly popular alternatives to perform deformable object manipulation. Most work [10, 11] learns the object dynamic from visual features rather than explicit modeling physical processes. For example, [12] encodes visual observation into latent space with self-supervision, followed by model-based planning. Another line of approach defines a set of primitives for deformable object manipulation and learns a mapping from image to predefined primitives [13]. Such image-to-primitive formulation has been applied across various tasks including manipulating rope [14], smoothing fabric [15], and blowing bags [16]. However, the physical information of the environment, which necessitates deformable object manipulation, is challenging to be obtained from visual perception. In this regard, [17] estimates the physical properties of fabric materials through a high-resolution tactile sensor, GelSight [18]. Further, [4] proposes an approach to manipulate a cable based on tactile feedback without vision sensory. [19] employs tactile sensors to manually collect data for training a classifier that can differentiate between towels with thicknesses of 1-3 layers. Then a heuristic approach is used to consistently attempt to grasp specific layers of towels based on the classifier’s prediction outcomes. Nevertheless, tactile sensors alone are hard to provide global information about the environment, which inevitably restricts the range of manipulation or requires prior knowledge of objects.

More recently, a small number of papers have explored the use of soft grippers in deformable object manipulation, which is known for its ease of grasping objects without high precision control [8, 20]. The authors of [21] demonstrated a soft gripper system that is capable of handling a wide range of food products by reconfiguring fingers into different poses. In addition, [5] quantitatively indicates that the compliance of the soft gripper can facilitate the manipulation of thin deformable objects.

Compared with the above studies, our presented approach, Flipbot, incorporates exteroceptive and proprioceptive feedback in deformable object manipulation rather than relying on a single perception source. Flipbot thus combines the best of both worlds: the global information about the environment afforded by exteroception and the local information about physic property afforded by proprioception. Filpbot also leverages the compliance from a soft pneumatic gripper for performing dexterous behavior. The resulting policy has taken the real robot to various tasks surpassing prior published work in the field of deformable object manipulation.

\begin{overpic}[width=433.62pt]{img/fg2.jpg} \par\put(41.5,18.0){\small{Exteroception}} \put(30.25,9.75){\small{``Swipe''}} \put(28.75,0.85){\small{Page number}} \put(7.0,0.75){\small{Real robot environment}} \put(41.5,7.75){\small{Proprioception}} \put(43.5,4.25){\small{Reward}} \put(89.0,7.5){\small{Action $a_{t}$}} \par\put(62.0,24.0){\small{Policy architecture}} \put(57.0,20.25){\small{Cross-sensory encoder}} \put(59.75,16.5){\small{Exter encoder}} \put(60.0,12.5){\small{Prop encoder}} \put(55.75,17.6){\small{$o_{tv}$}} \put(55.75,13.5){\small{$o_{tf}$}} \put(77.0,15.5){\small{$l_{t}$}} \par\put(41.0,12.2){\tiny{mx}} \put(41.0,11.0){\tiny{my}} \put(41.0,9.8){\tiny{mz}} \par\put(41.4,15.7){\tiny{fx}} \put(41.4,14.55){\tiny{fy}} \put(41.4,13.3){\tiny{fz}} \par\par\put(72.7,10.5){ {\rotatebox{90.0}{\small Concatenate}}} \put(79.75,13.25){ {\rotatebox{90.0}{\small MLP}}} \end{overpic}
Figure 2: System Overview. We train the policy using SAC in the real world. We follow a coarse-to-fine exploration process to obtain exteroception and proprioception. First, the camera captures the depth image, and the cropped area is used as extrinsic perception. Next, the soft finger “Swipe” on paper captures force (fx,fy,fz)(fx,fy,fz) and torque (mx,my,mz)(mx,my,mz) values from force sensors as proprioception. The RL agent receives the observations and predicts the actions to be performed by the robot, and receives the reward based on changes in page numbers.

III Method

The goal of Flipbot aims to empower robots to effectively singulate and grasp thin and flexible objects through exteroceptive and proprioceptive perception. Our key insight is that global information about positions and shapes on a large scale provided by vision and local information about contact and force provided by proprioceptive perception are indispensable parts of manipulating deformable objects like paper. Also, proprioception and exteroception fusion reveals physical information that helps robots better explore and make decisions. The overview of Flipbot is shown in Fig. 2.

First, we utilize a simple soft gripper for manipulation (see Fig. 4(c)). The natural compliance of the soft gripper provides unique benefits for manipulating thin and flexible objects while avoiding damage to the object. Another advantage is that the soft gripper has a more straightforward actuation strategy in movements such as bending the fingers, compared with fully actuated rigid grippers.

Then, we use a coarse-to-fine exploration process to obtain unobservable physical information about deformable objects. In this process, first, the depth camera provides a rough observation of the object. We then use a procedural motion “Swipe” and an F/T sensor to monitor the object’s state. One advantage of using the F/T sensor instead of a tactile sensor is that the force sensor can be assembled seamlessly with soft hands without a specific finger design.

Last, we use a cross-sensory encoder to fuse the proprioception and exteroception and use model-free reinforcement learning (RL) to learn the policy that avoids explicit modeling of diverse and frequent transitions in the contact state between the object and the soft hand.

III-A Problem Formulation

We formulate the problem of the paper-flipping as a Markov Decision Process (MDP). An MDP consists of four components: a state space SS, an action space AA, a reward function R(st,st+1)R(s_{t},s_{t+1}), and a transition probability P(st+1|st,at)P(s_{t+1}|s_{t},a_{t}). In our framework, an agent uses a policy π(at|st)\pi(a_{t}|s_{t}) to select an action ata_{t} for controlling the robot and receives rewards rtr_{t}. The goal of the reinforcement learning framework is to obtain the optimal policy π\pi^{*}, which maximizes the expected discounted sum of rewards over a finite time horizon. To achieve this objective, we utilize the Soft Actor-Critic [22] (SAC) algorithm for training. SAC requires the learning of an actor network that maps observations to actions and a critic network that estimates the expected future rewards based on the input.

III-B Observations via Coarse-to-Fine Exploration

The state is defined as st=(otv,otf)s_{t}=(o_{tv},o_{tf}), where otvo_{tv} refers to the exteroceptive observation, otfo_{tf} refers to the proprioceptive observation, shown in Fig. 2. We deploy a coarse-to-fine exploration procedure with two steps for obtaining observations otvo_{tv} and otfo_{tf}. First, a wrist-mounted camera takes the environment’s point cloud ptp_{t} from a height and converts the point cloud to a depth image. We then use a 60 ×\times 60 resolution window to crop the depth image, as the exteroceptive observation otvo_{tv}. Next, we perform an exploratory “Swipe” motion, to obtain physical information about the contact surface between the paper and the finger. The robot first descends a certain distance that the finger of a soft hand approaches the surface of the top right corner of the paper diagonally, where the distance is calculated according to the point cloud ptp_{t}. Then, we give the soft gripper a positive air pressure so that fingers touch and interact with the paper. After this process, we record readings from the F/T sensor, including forces (fx,fy,fz)(fx,fy,fz) in x,y,zx,y,z axes and three simultaneous torques (mx,my,mz)(mx,my,mz) about the same axes. Thus, the proprioceptive observation otfo_{tf} is defined as a tuple of (fx,fy,fz,mx,my,mz)(fx,fy,fz,mx,my,mz). Fig. 3 shows forces and torques after “Swipe” on different pages in the book. By incorporating an F/T sensor and exploratory action, otfo_{tf} latently contains rich information related to contact states between the fingers and the object, such as gripper-object friction.

\begin{overpic}[width=433.62pt]{img/fg3.jpg} \put(0.75,38.3){\tiny{\color[rgb]{1,1,1}Page 1}} \put(0.5,20.0){\tiny{\color[rgb]{1,1,1}Page 25}} \put(0.5,1.0){\tiny{\color[rgb]{1,1,1}Page 50}} \put(45.0,1.5){\scriptsize{Normalized forces and torques}} \par\put(26.5,50.0){\tiny{fx}} \put(26.5,42.0){\tiny{fy}} \put(26.5,34.0){\tiny{fz}} \put(26.1,26.0){\tiny{mx}} \put(26.1,18.2){\tiny{my}} \put(26.1,10.2){\tiny{mz}} \par\put(90.0,16.5){\tiny{Page 1}} \put(90.0,13.55){\tiny{Page 25}} \put(90.0,10.75){\tiny{Page 50}} \par\put(29.5,5.0){\tiny{0}} \put(42.55,5.0){\tiny{0.2}} \put(56.0,5.0){\tiny{0.4}} \put(69.75,5.0){\tiny{0.6}} \put(83.25,5.0){\tiny{0.8}} \put(97.8,5.0){\tiny{1}} \par\end{overpic}
Figure 3: Visualization of forces and torques after “Swipe” on the different page numbers.

III-C Action and Reward

After the coarse-to-fine exploration procedure, the robot predicts the action based on observations to singulate and grasp the paper. The action includes a gripper displacement, denote as (xt,zt,θt)(x_{t},z_{t},\theta_{t}), as shown in Fig. 4(a). The gripper displacement refers to the relative difference between the current pose after the “Swipe” exploration procedure and the desired one. Specifically, xt[6mm,6mm]x_{t}\in[-6mm,6mm] is the relative displacement on the line α\alpha connecting the two fingertips, where α\alpha belongs to the longitudinal plane AA formed by two fingers. θt[0°,3°]\theta_{t}\in[0\degree,3\degree] is the orientation of the gripper about the normal βA\beta\perp A. zt[6mm,6mm]z_{t}\in[-6mm,6mm] is the the relative displacement on the line γ\gamma, where γ(α×β)\gamma\perp(\alpha\times\beta). Furthermore, an additional action component Λ\Lambda is utilized to govern the closing or opening of the gripper. Operationally, we control the gripper aperture by commanding the pressure change. Thus, the action is formally defined as at=(xt,zt,θt,Λ)a_{t}=(x_{t},z_{t},\theta_{t},\Lambda), where each coordinate of the action is discretized based on the characteristics of the workspace.

At the end of an episode, the reward is given, 1 for successfully flipping a single layer of paper and 0 for otherwise. In other words, flipping two or more layers of paper simultaneously is treated as a failure. The reward is automatically determined by identifying page numbers on the book, which we describe further in Sec. III-E

III-D Policy architecture

The policy π(at|st)\pi(a_{t}|s_{t}) is modeled with a cross-sensory encoder and a multilayer perceptron (MLP) block, as shown in Fig. 2. The cross-sensory encoder takes the exteroceptive observation otvo_{tv} and proprioceptive observation otfo_{tf} as inputs and embeds them into a latent vector, which represents the abstraction of proprioception and exteroception. More specifically, otvo_{tv} is processed by a global pooling layer and concatenated with otfo_{tf} to be a vector of size 1x7. Then, the concatenated vector is fed into subsequent an MLP block to compress inputs to a more compact representation ltl_{t}. At last, the ltl_{t} is fed through the subsequent MLP layer to predict actions.

\begin{overpic}[width=433.62pt]{img/fg4.jpg} \put(25.0,96.0){\scriptsize{Plane A}} \put(14.0,77.5){\scriptsize{$\alpha$}} \put(2.5,91.0){\tiny{$\theta_{t}$}} \put(5.0,97.5){\tiny{$x_{t}$}} \put(8.0,95.0){\tiny{$z_{t}$}} \par\put(9.0,80.0){\scriptsize{$\gamma$}} \put(10.0,75.0){\scriptsize{$\beta$}} \put(16.2,65.5){\scriptsize{(a)}} \put(21.0,73.0){\scriptsize{Paper}} \par\put(63.45,65.5){\scriptsize{(b)}} \put(57.75,68.0){\scriptsize{Training hours}} \put(38.0,80.0){ {\rotatebox{90.0}{\scriptsize Success rate}}} \put(40.25,97.0){\scriptsize{100}} \put(41.25,89.0){\scriptsize{80}} \put(41.25,81.0){\scriptsize{60}} \put(41.25,73.0){\scriptsize{40}} \par\put(46.0,70.25){\scriptsize{0}} \put(56.3,70.25){\scriptsize{1}} \put(66.3,70.25){\scriptsize{2}} \put(76.3,70.25){\scriptsize{3}} \par\par\put(40.0,0.0){\scriptsize{(c)}} \par\put(0.0,25.0){\small{Depth camera}} \put(16.0,2.75){\small{Book for policy training}} \put(50.0,25.25){\small{Soft gripper}} \put(45.0,34.0){\small{F/T sensor}} \put(73.5,28.25){\small{Recycling}} \put(73.0,24.75){\small{mechanism}} \par\par\end{overpic}
Figure 4: (a): Visualization of our action coordinate system. (b): Success rate curve of our policy training. (c): Our hardware setting for policy training.

III-E Training via self-supervision

We train the policy in a real robot platform. Fig. 4(c) shows our hardware setting for training, including the following major components: a Universal Robot 10 robot arm equipped with a 3D printed thermoplastic polyurethane soft gripper, an ATI gamma F/T sensor, and an Intel Realsense L515 depth camera, as well as a recycling mechanism. During the whole training, we only use a book assembled with printer paper, as shown in Fig. 4(c). We train our model through trial-and-error with the following procedure:

At each training step, the robot starts to execute coarse-to-fine exploration from an initial pose. In the process of “Swipe”, the wrist-mounted camera captures an RGB-D image. The depth channel is used to construct exteroceptive observation otvo_{tv}, and the page number ntn_{t} is recognized from RGB channels for reward calculation. The robot then descends a certain distance that the finger approaches the paper’s surface to perform an exploratory action to obtain the physical observation otfo_{tf} from the readings of F/T sensors. After this, the robot downloads the latest policy parameters from the optimizer to predict action ata_{t} and executes. We automatically calculate rewards according to the change of page numbers without human intervention, the reward rtr_{t} is 1 if nt+1=nt+2n_{t+1}=n_{t}+2, otherwise 0. The page number identification benefits from Tesseract [23]. At last, the generated episode is added to a replay buffer, and the optimizer sampling from this replay buffer to update the policy. We use the Adam optimizer with a learning rate of 3×1033\times 10^{-3}. The robot then continuously collects episodes until it reaches the last page of the book, at which point the book is reset to the first page again using the recycling mechanism. In this way, human intervention is kept at a minimum during the training process.

The final model training took four hours, with the learning curves for the training presented in Fig. 4(b).

\begin{overpic}[width=433.62pt]{img/fg5.jpg} \put(22.5,58.0){\scriptsize{Printer paper}} \put(50.5,58.0){\scriptsize{Coated paper}} \put(79.5,58.0){\scriptsize{Plastic paper}} \par\put(0.0,49.0){\scriptsize{Single layer (0$\degree$)}} \put(3.0,31.0){\scriptsize{Book (30$\degree$)}} \put(4.0,10.0){\scriptsize{Box (60$\degree$)}} \end{overpic}
Figure 5: A subset (9 of 27) of our test scene settings. Columns from left to right show different paper materials: printing paper, coated paper, and plastic paper. Rows from top to bottom show different test scenarios and workspace tilt angles
Table I: Results of experiments in the real world.
Method Tilt angle Full Book page flipping Paper-box emptying Single paper grasping
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Printer Paper Coated Paper Plastic Paper Printer Paper Coated Paper Plastic Paper Printer Paper Coated Paper Plastic Paper
SR PPH SR PPH SR PPH SR PPH SR PPH SR PPH SR PPH SR PPH SR PPH
Flex&Flip [8] 0 72% 223 77% 239 52% 161 69% 214 82% 254 49% 152 83% 260 91% 282 74% 229
Flipbot-w/o prop 85% 264 93% 288 66% 205 81% 251 91% 282 60% 186 95% 295 98% 304 85% 264
Flipbot 94% 291 96% 298 82% 254 90% 279 94% 291 68% 211 99% 307 98% 304 92% 285
Flex&Flip [8] 30 76% 236 74% 229 44% 136 62% 192 72% 223 42% 130 80% 248 87% 270 76% 236
Flipbot-w/o prop 88% 273 87% 270 63% 195 84% 260 88% 273 55% 171 85% 264 92% 295 86% 267
Flipbot 93% 288 91% 282 72% 223 88% 273 91% 282 62% 192 92% 285 95% 295 90% 279
Flex&Flip [8] 60 64% 198 56% 174 47% 192 56% 174 58% 180 38% 118 84% 260 82% 254 83% 257
Flipbot-w/o prop 76% 236 72% 223 62% 192 77% 239 70% 217 58% 179 86% 267 85% 264 91% 282
Flipbot 84% 260 82% 253 70% 217 82% 254 80% 248 66% 205 96% 298 92% 285 94% 291
  • *

    SR stands for success rate.

\begin{overpic}[width=433.62pt]{img/fg6.jpg} \par\put(0.0,49.5){\footnotesize{A: Book}} \put(0.0,48.0){\footnotesize{0 degrees}} \put(0.0,46.5){\footnotesize{Coated paper}} \par\put(0.0,38.5){\footnotesize{B: Book}} \put(0.0,37.0){\footnotesize{0 degrees}} \put(0.0,35.5){\footnotesize{Plastic paper}} \par\put(0.0,27.5){\footnotesize{C: Box}} \put(0.0,26.0){\footnotesize{30 degrees}} \put(0.0,24.5){\footnotesize{Printer paper}} \par\put(0.0,16.5){\footnotesize{D: Book}} \put(0.0,15.0){\footnotesize{60 degrees}} \put(0.0,13.5){\footnotesize{Printer paper}} \par\put(0.0,5.5){\footnotesize{E: Book}} \put(0.0,4.0){\footnotesize{60 degrees}} \put(0.0,2.5){\footnotesize{Printer paper}} \par\end{overpic}
Figure 6: Flipbot performs paper-flipping in different scenes. A-D: Flipbot successfully singulates and grasps a piece of paper in various settings; E: Flipbot fails to singulate and grasp a piece of printer paper with a 60-degree tilt angle. The circled area in red denotes that two layers of paper were flipped.

IV Experiments

We design a set of experiments in real-world settings to evaluate the system’s generalization ability to novel object physical parameters and the advantage of using exteroceptive and proprioceptive exploration. For all following experiments, we use the same robot hardware setting and the same model trained with the book assembled from printer paper, described in Sec. III-E. The system’s performance is evaluated on its generalization to unseen paper types (i.e., flipping different types of paper when only trained on printer paper) and unseen scenarios(e.g., emptying paper in a box) and its efficiency (i.e., the speed and accuracy of paper-flipping).

Scene setup: We investigate the performance of our system across various object settings and scene configurations. In total, we have 27 different test scenes with the combination of test scenarios, paper types and tilt angles. We test with the following three scenarios:

  • Full Book page flipping. It is a similar scenario as in policy training, where the robot needs to flip book pages one by one throughout the book.

  • Paper-box emptying. The robot grasps each sheet one by one from a pile of paper dumped into a box until emptying it. This is more challenging than the book setup because the physical interaction between the paper is more complex without the constraints of the spine.

  • Single paper grasping. The robot grasps a single piece of paper lying on a flat surface.

In each scenario, we use three types of paper that have different physical properties, including the printer paper, coated paper, and plastic paper. The physical property of printer paper is the same as we have used during training, which has the highest friction coefficient among the three types. The coated paper and plastic paper are unseen paper types. The coated paper has the lowest friction coefficient and the plastic paper has medium friction coefficient. The detailed physical properties of these three paper types are shown in Tab. II. Meanwhile, we also vary tilt angles (0, 30, 60 degrees) for the workspace to test the effect of gravity on paper flipping.

Table II: physical properties of test paper
Physical properties Printer paper Coated paper Plastic paper
(seen type) (unssen type) (unssen type)
Static Coefficient
of Friction
0.462±\pm0.0087 0.283±\pm0.0104 0.334±\pm0.0066
Kinetic Coefficient
of Friction
0.417±\pm0.0542 0.174±\pm0.0229 0.259±\pm0.0263
Young’s Modulus
in Machine Direction(GPaGPa)
2.84±\pm0.17 2.62±\pm0.14 1.54±\pm0.23
Density
(g/m2g/m^{2})
102.5±\pm2.32 59.8±\pm0.93 385.4±\pm1.74
Thickness
(mmmm)
0.096±\pm0.006 0.057±\pm0.012 0.151±\pm0.017

Metric: We utilize two evaluation metrics for validating algorithm performance: success rates (successful paper flips/total attempts) and PPH (successful paper flips per hour). The success of paper flipping for each attempt is measured by whether the gripper detaches and flips strictly one piece of paper. For example, in the book page flipping task, the robot detaches and flips two pieces of paper simultaneously is considered a failure. PPH is the product of the speed of flipping in an hour and the success rate, which includes the time of perception, network inference, and robot execution in enabling paper-flipping manipulation. It is important to note that our Flipbot implementation is not optimized for high-speed execution; thus, the reported PPH is solely used to compare relative performance.

Algorithm comparisons: We compare with the following methods:

  • Flex&Flip [8]: it simplifies a piece of paper as a linear object and uses a physical model to analyze the motion. Its original version could only grasp a single piece of paper lying on a flat surface. We adapt and extend the physical model provided by the authors and hardcode the thickness of different paper types to allow for multi-layered paper flipping.

  • Flipbot-w/o prop: policy learns from only exteroceptive sensory (i.e., depth camera), which directly maps the visual observation to action.

  • Flipbot: policy learns with coarse-to-fine exteroceptive-proprioceptive exploration, which is the full non-ablated method we propose in this article.

IV-A Experimental Results

Comparison to prior work. We first compare the performance of our approach with Flex&Flip [8] with different paper types and scenarios (row 1 vs. row 3 in Tab. I). Note that Flex&Flip [8] is the state-of-the-art method for single-layer paper grasping, and we extend it to multi-layered paper scenarios (i.e., paper-box emptying and full book page flipping). In the single paper grasping case, Flipbot performs better (+16%) than Flex&Flip [8] on printer paper. The advantage is much more pronounced in multi-layered paper cases, with Flipbot outperforming Flex&Flip [8] around 20%. In all three test scenarios, quantitative results in Tab. I suggest that our method (Flipbot) maintains comparable success rates on unseen paper types (i.e., coated and plastic paper) with respect to the seen paper type (i.e., printer paper). In contrast, the performance of Flex&Flip [8] on the plastic paper type degrades significantly (up to -20%) on unseen paper types.

Effectiveness of exteroceptive-proprioceptive exploration. We conduct controlled experiments to evaluate the contribution of exteroceptive-proprioceptive exploration quantitatively. The proprioceptive perception provides information on the unobservable physical features, facilitating policy learning effectiveness. As a result, compared with Flipbot-w/o prop that does not use proprioceptive, Flipbot achieved a higher success rate. Quantitative results in Tab. I indicate that compared to Flipbot-w/o prop, the success rate of Flipbot increases at most 24% and at least 4% across test cases.

Generalization to novel tilt angles of workspace. In this experiment, we investigate the generalization ability of these methods to gravity changes by varying tilt angles (0, 30, 60 degrees) of the workspace (see Fig. 6C-D). In different tilt angle setups, detaching a single sheet of paper becomes more challenging as the physical properties between the different layers of the paper change with the direction of gravity. Quantitative results in Tab. I show that the performance of our learned policy degrades slightly as the tilt angle increases. We hypothesize this happened since the physics in these test scenes differ from the training, increasing the difficulty of generalization. Nevertheless, Flipbot still outperforms other methods in terms of success rate and PPH in all test cases.

Overall, our experimental evaluation demonstrates that Flipbot is an efficient approach for paper-flipping tasks. We find the exteroceptive and proprioceptive perceptions are essential for paper-flipping, particularly for sigulating and detaching a sheet from a pile of paper. The learned policy has been demonstrated to outperform state-of-the-art methods and is also applicable to tasks beyond the reach of prior studies, such as turning pages throughout a book. Our work is not without limitations. First, when the working area is at a larger inclination angle, the friction between the paper tends to be smaller. Hence, multiple layers of paper are easy to be grasped simultaneously (see Fig. 6E). Also, two layers of paper sometimes stick together. We assume it happens because of Van der Waals forces. A dual-arm system may be essential to address this issue, suggesting exciting opportunities for future study.

V Conclusion

We have presented a novel solution for singulating and grasping thin and flexible deformable objects that utilize the cross-sensory encoding of exteroceptive and proprioceptive perceptions, which we term Flipbot. Meanwhile, the system takes advantage of the under actuation and compliance of the soft pneumatic actuator to control contact forces precisely for the singulation of a thin layer of deformable objects. We deploy the algorithm on a real-robot system and show that integrating exteroceptive and proprioceptive inputs can effectively facilitate deformable object manipulation. Extensive controlled experiments demonstrated the robustness and effectiveness of Flipbot. Beyond the experiment results, our work extends frontiers in deformable object manipulation, and the methodology presented in this work can have broad applications. A future direction is to extend the proposed approach to long-horizon deformable object manipulation tasks, such as origami folding, cleaning messy desktops, collecting mail and letters, etc.

References

  • [1] J. Zhu, A. Cherubini, C. Dune, D. Navarro-Alarcon, F. Alambeigi, D. Berenson, F. Ficuciello, K. Harada, J. Kober, X. LI, J. Pan, W. Yuan, and M. Gienger, “Challenges and outlook in robotic manipulation of deformable objects,” IEEE Robotics & Automation Magazine, pp. 2–12, 2022.
  • [2] T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, and M. Hutter, “Learning robust perceptive locomotion for quadrupedal robots in the wild,” Science Robotics, vol. 7, no. 62, p. eabk2822, 2022.
  • [3] W. Yuan, M. A. Srinivasan, and E. H. Adelson, “Estimating object hardness with a gelsight touch sensor,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2016, pp. 208–215.
  • [4] Y. She, S. Wang, S. Dong, N. Sunil, A. Rodriguez, and E. Adelson, “Cable manipulation with a tactile-reactive gripper,” The International Journal of Robotics Research, vol. 40, no. 12-14, pp. 1385–1401, 2021.
  • [5] C. B. Teeple, J. Werfel, and R. J. Wood, “Multi-dimensional compliance of soft grippers enables gentle interaction with thin, flexible objects,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 728–734.
  • [6] F. Allgöwer and A. Zheng, Nonlinear model predictive control.   Birkhäuser, 2012, vol. 26.
  • [7] S. Zimmermann, R. Poranne, and S. Coros, “Dynamic manipulation of deformable objects with implicit integration,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 4209–4216, 2021.
  • [8] C. Jiang, A. Nazir, G. Abbasnejad, and J. Seo, “Dynamic flex-and-flip manipulation of deformable linear objects,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, pp. 3158–3163.
  • [9] Y. Guo, X. Jiang, and Y. Liu, “Deformation control of a deformable object based on visual and tactile feedback,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2021, pp. 675–681.
  • [10] H. Shi, H. Xu, Z. Huang, Y. Li, and J. Wu, “Robocraft: Learning to see, simulate, and shape elasto-plastic objects with graph networks,” Robotics: Science and Systems (RSS), 2022.
  • [11] B. Shen, Z. Jiang, C. Choy, L. J. Guibas, S. Savarese, A. Anandkumar, and Y. Zhu, “Acid: Action-conditional implicit visual dynamics for deformable object manipulation,” Robotics: Science and Systems (RSS), 2022.
  • [12] W. Yan, A. Vangipuram, P. Abbeel, and L. Pinto, “Learning predictive representations for deformable objects using contrastive estimation,” in Conference on Robot Learning.   PMLR, 2022.
  • [13] H. Ha and S. Song, “Flingbot: The unreasonable effectiveness of dynamic manipulation for cloth unfolding,” in Conference on Robot Learning.   PMLR, 2022, pp. 24–33.
  • [14] P. Sundaresan, J. Grannen, B. Thananjeyan, A. Balakrishna, M. Laskey, K. Stone, J. E. Gonzalez, and K. Goldberg, “Learning rope manipulation policies using dense object descriptors trained on synthetic depth data,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 9411–9418.
  • [15] Y. Avigal, L. Berscheid, T. Asfour, T. Kröger, and K. Goldberg, “Speedfolding: Learning efficient bimanual folding of garments,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2022, pp. 1–8.
  • [16] Z. Xu, C. Chi, B. Burchfiel, E. Cousineau, S. Feng, and S. Song, “Dextairity: Deformable manipulation can be a breeze,” in Proceedings of Robotics: Science and Systems (RSS), 2022.
  • [17] W. Yuan, Y. Mo, S. Wang, and E. H. Adelson, “Active clothing material perception using tactile sensing and deep learning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 4842–4849.
  • [18] W. Yuan, S. Dong, and E. H. Adelson, “Gelsight: High-resolution robot tactile sensors for estimating geometry and force,” Sensors, vol. 17, no. 12, p. 2762, 2017.
  • [19] S. Tirumala, T. Weng, D. Seita, O. Kroemer, Z. Temel, and D. Held, “Learning to singulate layers of cloth using tactile feedback,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2022, pp. 7773–7780.
  • [20] J. Hughes, U. Culha, F. Giardina, F. Guenther, A. Rosendo, and F. Iida, “Soft manipulators and grippers: a review,” Frontiers in Robotics and AI, vol. 3, p. 69, 2016.
  • [21] J. H. Low, P. M. Khin, Q. Q. Han, H. Yao, Y. S. Teoh, Y. Zeng, S. Li, J. Liu, Z. Liu, P. V. y Alvarado, et al., “Sensorized reconfigurable soft robotic gripper system for automated food handling,” IEEE/ASME Transactions On Mechatronics, 2021.
  • [22] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning.   PMLR, 2018, pp. 1861–1870.
  • [23] R. Smith, “An overview of the tesseract ocr engine,” in Ninth international conference on document analysis and recognition (ICDAR 2007), vol. 2.   IEEE, 2007, pp. 629–633.