Benchmarking the Sim-to-Real Gap in Cloth Manipulation

David Blanco-Mulero¹, Oriol Barbany², Gokhan Alcan¹, Adrià Colomé², Carme Torras², and Ville Kyrki¹ Manuscript accepted for publication at IEEE Robotics and Automation Letters. This research has received funding from: Academy of Finland (grant number 317020), Business Finland (decision 9249/31/2021), ERC under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 741930, project CLOTHILDE) and European Union’s Horizon 2020 research and innovation programme (grant agreement No. 101070600, project SoftEnable). (Corresponding author: David Blanco-Mulero.) ¹ David Blanco-Mulero, Gokhan Alcan and Ville Kyrki are with Department of Electrical Engineering and Automation (EEA), Aalto University, 02150, Espoo, Finland. (e-mail: david.blancomulero@aalto.fi)² Oriol Barbany, Adrià Colomé and Carme Torras are with the Institut de Robòtica i Informàtica Industrial, CSIC-UPC, Spain.

Abstract

Realistic physics engines play a crucial role for learning to manipulate deformable objects such as garments in simulation. By doing so, researchers can circumvent challenges such as sensing the deformation of the object in the real-world. In spite of the extensive use of simulations for this task, few works have evaluated the reality gap between deformable object simulators and real-world data. We present a benchmark dataset to evaluate the sim-to-real gap in cloth manipulation. The dataset is collected by performing a dynamic as well as a quasi-static cloth manipulation task involving contact with a rigid table. We use the dataset to evaluate the reality gap, computational time, and simulation stability of four popular deformable object simulators: MuJoCo, Bullet, Flex, and SOFA. Additionally, we discuss the benefits and drawbacks of each simulator. The benchmark dataset is open-source. Supplementary material, videos, and code, can be found at https://sites.google.com/view/cloth-sim2real-benchmark.

I Introduction

Cloth manipulation is a crucial component in applications ranging from care-giving [1] and household chores [2], to the textile industry. Endowing robots with cloth manipulation skills is non-trivial. First, deformable objects have infinite Degrees of Freedom, which makes it challenging to represent their state in the world [3]. Second, deformable objects have complex dynamics, which is even further pronounced when performing dynamic manipulation actions that require acceleration forces to succeed with the task [4, 5]. Third, deploying a robot in the real-world presents safety challenges such as damaging the physical system or the environment the robot interacts with.

Refer to caption — Figure 1: The real-world cloth manipulation dataset was collected, pre-processed and benchmarked against multiple simulation engines, assessing their sim-to-real gap.

Considerable research on cloth manipulation addresses these challenges with the aid of simulation engines [1, 6, 7, 8]. This relaxes the safety issues and provides a vast amount of trials where the controllers can be evaluated and improved. However, these simulators approximate the dynamics of the real world, which results in a gap when compared to reality [9]. This reality gap becomes even more apparent when performing dynamic cloth manipulation tasks [10, 11]. Under longer-term prediction the subsequent errors accumulate, widening up the reality gap, which results in a poor sim-to-real transfer. However, no studies are available that quantify the reality gap when performing dynamic cloth manipulation tasks.

Although the state-of-the-art continues using the available simulators for learning cloth manipulation tasks, the fidelity of simulators for these tasks has not been thoroughly evaluated. While domain randomisation has been used to obtain more robust controllers that partially alleviate the sim-to-real gap [12], it does not necessarily solve the issue. We present a dataset for benchmarking cloth manipulation and evaluate the reality gap of current state-of-the-art simulators (Fig. 1). In addition, we provide insights about the available simulators, pointing out their benefits and drawbacks. Our contributions can be summarised as:

•

A dataset for benchmarking cloth manipulation using cloths from a publicly available benchmarking dataset.
•

Benchmarking the most popular, currently available, physics engines that simulate deformable objects compared to a real-world scenario.
•

Evaluating the capabilities of physics engines to simulate dynamic in-air manipulation and quasi-static in-contact manipulation of cloths.

The work will also enable researchers to evaluate new simulators using the benchmark and the open-source code.

II Related Works

II-A Deformable Object Simulation

There exists a broad variety of deformable object simulators. One of their main differences lies on the dynamics model used, ranging from particle-based systems such as the mass-spring (MuJoCo [13]) or Position Based Dynamics (PBD) (Flex [14]), to constitutive models such as the Finite Element Method (FEM) (Bullet [15], SOFA [16]). Although simulators such as Arcsim have been fine-tuned to match the dynamics of fabric materials [17], the reality gap when performing manipulation tasks has not been evaluated.

As a result of the benefits of learning controllers in simulation, recent work has focused on measuring the simulators’ accuracy against real-world data. Collins et al. [9] benchmarked the accuracy of different simulation engines in a rigid-object manipulation task. More recently, Acosta et al. [18] measured the error of simulated rigid-body contact after optimising the parameters of different simulators. However, no prior work has evaluated the reality gap in dynamic deformable object manipulation.

II-B Benchmarking Deformable Object Manipulation

The problem of benchmarking manipulation tasks can be viewed from different perspectives: 1) designing datasets [2] and tasks [19] for benchmarking robotic systems, 2) measuring the performance of multiple algorithms on a task, 3) evaluating the disparity between simulation and real task performance for a given algorithm, and 4) measuring the reality gap between simulation and a real-world dataset.

Most works in deformable object manipulation have focused either on 2) evaluating multiple algorithms in a simulation engine [20, 21], or 3) evaluating the gap when transferring a skill to the real world [4, 22]. However, these works do not quantify the reality gap of the simulations used to train the learning algorithms, which can result in poor performance when performing sim-to-real transfer in a zero-shot manner.

More recently, Lim et al. [23] proposed an approach to learn controllers from real data and simulators fine-tuned with real data for planar cable manipulation, evaluating the reality gap in terms of the cable trajectory. Similarly, Sundaresan et al. [24] fine-tuned a differentiable simulator with data from the real-world, evaluating the reality gap in quasi-static tasks. To the best of our knowledge, our work is the first to study the reality gap in a dynamic cloth manipulation task against a real-world benchmark dataset. In this work we measure the performance of four simulation engines widely used for deformable object manipulation: MuJoCo [6, 7, 11], Bullet [8, 25, 26], Flex [4, 12, 20], and SOFA [10, 27]. Our benchmark is agnostic to the simulator and can be easily applied to forthcoming simulators.

III The Benchmark

The proposed benchmark consists of the following:

•

a real-world dataset composed of point clouds and RGB-D images at each time step of the cloth manipulation, using three cloths with different material properties;
•

dynamic and quasi-static manipulation tasks performed with a bi-manual system, simulated in four simulation engines;
•

metrics to evaluate the reality gap of the simulated environment, along with the stability and computational cost of the simulator.

III-A Task Description

We propose a fabric placement manipulation task performed by a bi-manual system that involves two pre-defined trajectories (see Fig. 2). The first trajectory consists of a dynamic motion of the fabric. The second trajectory brings the garment in contact with a rigid surface and then drags it through the planar surface. The objective of the two trajectories is to evaluate two different dynamics: 1) the dynamics of the fabric without contact, and 2) the dynamics of the contact between the garment and a rigid object.

The goal of the fabric placement task is to end in a flattened configuration starting from a position free of contact. In order to focus on the accuracy of the simulation, the task assumes a successful grasp state. Thus, the fabric starts in a grasped position, where two corners of a rectangular piece of cloth are grasped by a manipulator using a pinch grasp [28]. We decide to use a pinch grasp as this does not require any additional set-up such as those required for interfacing and simulating, e.g., touch-based sensors [29].

In order to place the cloth in a flattened configuration, dynamic motions can control the fabric outside of the working space of the manipulators while efficiently placing the cloth flat in a single attempt. Thus, we design a fling motion [4] and define it with a quintic polynomial

x(t)=a_{0}+a_{1}t+a_{2}t^{2}+a_{3}t^{3}+a_{4}t^{4}+a_{5}t^{5},

(1)

which is detailed in Sec. III-B. When performing a highly dynamic motion the fabric suffers an abrupt deformation. This is challenging to simulate due to the inertia forces generated by high accelerations and high number of dof of the garment. Therefore, it is a great candidate trajectory for evaluating the reality gap. In addition, to evaluate the capability of different physics engines to simulate frictional and inertia forces, we design a quasi-static motion which consists in entering in contact with the rigid surface by slowly lowering and dragging the fabric.

III-B Real-World Dataset

Our dataset is collected using three different cloths from the public household dataset [2]: a towel rag, a linen rag, and a chequered rag. The garments have a size of 50 $\times$ 70 cm, each with different weight and elasticity, providing a variety that helps assess the ability of the engines to simulate different fabric materials and dynamics. We decided to use these cloths from the dataset as they can be easily lifted by two robotic manipulators and placed on a flat surface. In addition, the fabrics have a rectangular shape rather than square, which results in larger deformations of the non-manipulated corners of the cloth when a high-velocity is applied. As shown in Sec. IV, the towel and the chequered rags have a similar final configuration after the fling motion. However, the linen rag, which is more brittle, is partially folded. By contrast, all fabrics exhibit a similar final configuration after the quasi-static motion.

We use two Franka Emika Panda robots to perform the quintic trajectories. The dynamic trajectory performs a motion on the $YZ$ axes and the roll angle $\phi$ , keeping the other axes fixed throughout the trajectory (see Fig. 3a). By contrast, the quasi-static trajectory performs a motion only on the $YZ$ axes (see Fig. 3b). Each trajectory is computed using multiple via-points, where the number of via-points is $n_{Y}=4$ , $n_{Z}=3$ and $n_{\phi}=3$ for each the dynamic trajectory, and $n_{Y}=2$ , $n_{Z}=2$ for the quasi-static trajectory. For each via-point we define a quintic polynomial, where the coefficients of the polynomial are computed following [30]. The starting and final velocity and acceleration values, as well as the time of the trajectory for each via-point, are defined empirically. The position and velocity trajectories for each axis are shown in Fig. 3. During the trajectory, both robots have the $X$ –axis fixed at $51$ cm from their origin. Since one of the manipulators is rotated $180$ degrees with respect to the other, its roll angle is inverted.

In the dynamic motion (Fig. 3a), we distinguish between the phase where the cloth undergoes its natural dynamics and when it makes contact with the surface. The cloth dynamics phase, concluding approximately 3 seconds after starting the trajectory, is used in the benchmarking. Although the trajectory remains consistent across all trials and materials, the cloth contacts the table at slightly different time steps due to inherent randomness in cloth behaviour and material differences¹¹1These values can be found in our open-source code.. Consequently, we manually refine the change of phase time step for each case. By contrast, in the quasi-static trajectory we evaluate both the time instant where the fabric enters into contact as well as the entire contact phase.

The point clouds of the dataset are captured using a Microsoft Azure Kinect RGB-D sensor. The RGB-D images have a dimensionality of $1280\times 720$ , and are captured at a frame rate of 30 fps. To compare how well the garments resemble reality in a simulator, we propose to compare the dense point cloud $\mathcal{P}$ obtained by the sensor in the real setup with the meshes $\mathcal{V}$ of the garment provided by the simulator. This enables to quantitatively compare the reality gap, as we can measure the distance between the simulated and real fabric points, rather than performing a qualitative comparison by e.g. comparing their deformation using RGB images.

To obtain the point clouds $\mathcal{P}$ we use the real-world RGB-D images, as well as the position of the camera w.r.t. the manipulators and the intrinsic and extrinsic camera parameters. First, we segment the RGB images with MiVOS [31], which allows to interactively refine the segmentation on individual frames and obtain temporally coherent results. This enables filtering out points that are not part of the garment. Since the positions of the robots and the boundary positions of the garment are known, we use these positions to filter the points. Then, we discard points further away from their neighbours compared to the average. This is performed by applying an statistical outlier removal. Finally, we need to account for the possibly different coordinate systems used across simulators. To achieve this, we define the appropriate coordinate transformation matrices and apply them to convert the simulated meshes into the observation space.

III-C Simulation Engine Set-up

To benchmark the sim-to-real gap in the cloth manipulation tasks we design a framework that is agnostic to the simulation engine and share it open-source²²2https://sites.google.com/view/cloth-sim2real-benchmark. In order to benchmark a simulation engine using our dataset, the simulator needs to have the following capabilities:

•

simulation of both rigid and deformable objects,
•

control over the cloth points,
•

information about the position of the mesh points of the garment,
•

adjustable frequency of the simulation engine.

The simulation scene is comprised of the same elements as the real-world dataset: a rigid-object surface, the fabric to manipulate, and two manipulators. Given the limitations present in some simulators (see Sec. IV-A), we consider that a robot is not available in the simulator and assume that only a pinch grasp is available using a dummy manipulator.

The simulated manipulators must take as input the desired target Cartesian coordinate position. The trajectories are given in Cartesian coordinates and calculated according to the specific simulator $\Delta t$ . Thus, the trajectories are agnostic to the simulator frequency. To accurately follow the trajectory of the real-world dataset, the simulator must not modify the dataset trajectories or repeat the same action in case that frame-skips are used, as often done in MuJoCo or Bullet.

Given the variety of dynamic models used to approximate the behaviour of cloths in simulation engines, there is no restriction on the cloth parameters. In addition, our benchmark can be used to fine-tune simulator parameters such as the damping coefficient or stiffness that better approximate the dynamics of the garments.

III-D Performance Metrics

The objectives of our metrics are to: 1) qualitatively measure the reality gap, 2) evaluate the stability of the simulated cloth, and 3) assess the capability of using the simulated scenes in real-time control (hardware-in-the-loop).

There are multiple candidate metrics for measuring the distance between two point clouds, such as the Chamfer Distance (CD), the Hausdorff Distance (HD), or the Earth-mover distance. We select the CD and HD as they do not require point correspondences between the real point cloud and the simulated mesh, are efficient, and permutation invariant. We use the unidirectional (also known as one-way or one-sided) CD and HD to address different mesh resolutions, and incomplete point clouds due to self-occlusions, as done in previous works facing the same issues on clothes [24, 32]. For a point cloud $\mathcal{P}_{t}$ and a simulated mesh with vertices $\mathcal{V}_{t}$ , the CD used for evaluating the reality gap is defined as

\displaystyle\text{CD}(\mathcal{V}_{t},\mathcal{P}_{t}):=\frac{1}{|\mathcal{V}_{t}|}\sum_{v\in\mathcal{V}_{t}}\min_{p\in\mathcal{P}_{t}}\|v-p\|_{1}\,.

(2)

The unidirectional HD with $\ell_{1}$ norm is defined as

\displaystyle\text{HD}(\mathcal{V}_{t},\mathcal{P}_{t}):=\max_{v\in\mathcal{V}_{t}}\min_{p\in\mathcal{P}_{t}}\|v-p\|_{1}\,.

(3)

The HD is closely related to the CD and greater by definition, as it corresponds to the largest error, whereas the CD is an average of errors. Both metrics typically use the squared Euclidean distance. However, we empirically find that the error values obtained with the Manhattan distance are more representative. The reason for that is that the $\ell_{1}$ norm is more robust to outliers, an observation consistent with the use of the un-squared $\ell_{2}$ norm as an evaluation measure in previous works [33, 34]. Note that $|\mathcal{V}_{t}|\ll|\mathcal{P}_{t}|$ , further motivating the use of the $\ell_{1}$ norm, which could cause the metric to blow up in the presence of a few extreme values.

To evaluate the modelling of cloth dynamics, we use the recorded trajectory before the collision, as detailed in Sec. III-B. For this purpose, we report in Table II the average of the Chamfer and Hausdorff distances between the simulated mesh vertices and point clouds up to the change of phase time step, denoted $\text{CD}_{d}$ and $\text{HD}_{d}$ .

The quasi-static trajectory is used in its entirety to evaluate the simulation of contacts in the absence of fast dynamic motions. The reported metrics in this case are $\text{CD}_{q}$ and $\text{HD}_{q}$ , representing the average of distances across all time steps.

The HD is closely related to the CD, and, by definition, it has a value greater than or equal to it. Both distances determine point correspondences by finding the closest pairs between sets. However, the CD reports the average of distances and hence has higher tolerance for outliers, while the HD is a stricter metric that focuses on the maximum dissimilarity. Overall, both metrics offer complementary and valuable information about the reality gap. One of the drawbacks of both the CD and HD is that they do not consider the connectivity of the mesh [35]. However, in our case, the mesh connectivity is already enforced by the physics simulator.

We provide as a reference the error metrics between each of the target point clouds in Tab. II. The table measures the difference in their deformation and serves as a guide to understand the metric values in Sec. IV-B.

To evaluate the simulator stability, we apply a moving average filter to the simulated vertices and compute the difference between the filtered and non-filtered vertices as

\mathcal{L}_{\text{s}}=\dfrac{1}{N}\sum^{N-1}_{t=1}\left|\dfrac{\mathcal{V}_{t-1}+\mathcal{V}_{t}+\mathcal{V}_{t+1}}{3}-\mathcal{V}_{t}\right|.

(4)

Finally, to measure the capability of using the simulators in real-time control, we measure the computational time taken to perform a single simulation step and contrast it against the simulator frequency and error metrics aforementioned.

TABLE I: Comparison of the evaluated simulators: MuJoCo, Bullet, Flex and SOFA. Here, we compare if the simulator: 1) has visual feedback (RGB-D), 2) has robotic systems, 3) the type of grasp, 4) the numerical integrator, and 5) CPU or GPU acceleration. Specifically for deformable objects, we contrast whether 1) meshes can be used (variable); and 2) the dynamics model used for deformable objects. The type of grasp is considered as points (P) or lines (L) [28].

	Simulation Generic				Deformable Objects
Physics Simulator	RGB-D	Robot Integration	Grasp	CPU/GPU	Shape	Dynamics Model
MuJoCo [13]	✓	✓	P/L	CPU & GPU	Variable\@footnotemark	Mass-Spring
Bullet [15]	✓	✓	P/L	CPU	Variable	PBD / FEM
Flex [14, 20]	✓	✓\@footnotemark	P/L	GPU	Variable	PBD
SOFA [16]	RGB	✗	P	CPU & GPU	Variable	Mass-Spring / FEM

TABLE II: Error metrics between the dataset point clouds. The table rows refer to the source point cloud

\mathcal{V}

and the columns to the target point cloud

\mathcal{P}

for both the Chamfer Distance (CD) and Hausdorff Distance (HD).

Rag	Metric	Towel	Cheq.	Linen
Towel	$\text{CD}_{d}$	-	0.023 $\pm$ 0.000	0.050 $\pm$ 0.001
	$\text{HD}_{d}$	-	0.119 $\pm$ 0.000	0.163 $\pm$ 0.008
	$\text{CD}_{q}$	-	0.022 $\pm$ 0.000	0.018 $\pm$ 0.000
	$\text{HD}_{q}$	-	0.136 $\pm$ 0.003	0.161 $\pm$ 0.003
Cheq.	$\text{CD}_{d}$	0.026 $\pm$ 0.000	-	0.036 $\pm$ 0.000
	$\text{HD}_{d}$	0.087 $\pm$ 0.001	-	0.124 $\pm$ 0.004
	$\text{CD}_{q}$	0.024 $\pm$ 0.000	-	0.023 $\pm$ 0.000
	$\text{HD}_{q}$	0.068 $\pm$ 0.001	-	0.088 $\pm$ 0.002
Linen	$\text{CD}_{d}$	0.036 $\pm$ 0.001	0.054 $\pm$ 0.001	-
	$\text{HD}_{d}$	0.133 $\pm$ 0.002	0.145 $\pm$ 0.005	-
	$\text{CD}_{q}$	0.022 $\pm$ 0.000	0.022 $\pm$ 0.000	-
	$\text{HD}_{q}$	0.121 $\pm$ 0.001	0.125 $\pm$ 0.000	-

TABLE III: Quantitative result showing the mean and standard deviation for the Chamfer Distance (CD) and Hausdorff Distance (HD) for three rags: towel, chequered and linen; over three real world datasets of dynamic and quasi-static tasks for each fabric, and 20 different random seeds in the physic engines MuJoCo, Bullet, Flex and SOFA.

Rag	Metric	MuJoCo	Bullet	Flex	SOFA
Towel	$\text{CD}_{d}$	0.079 $\pm$ 0.031	0.155 $\pm$ 0.093	0.168 $\pm$ 0.129	0.078 $\pm$ 0.029
	$\text{HD}_{d}$	0.167 $\pm$ 0.037	0.206 $\pm$ 0.083	0.277 $\pm$ 0.158	0.194 $\pm$ 0.074
	$\text{CD}_{q}$	0.079 $\pm$ 0.027	0.097 $\pm$ 0.045	0.080 $\pm$ 0.021	0.089 $\pm$ 0.022
	$\text{HD}_{q}$	0.182 $\pm$ 0.040	0.243 $\pm$ 0.078	0.169 $\pm$ 0.024	0.174 $\pm$ 0.032
Cheq.	$\text{CD}_{d}$	0.067 $\pm$ 0.026	0.119 $\pm$ 0.060	0.164 $\pm$ 0.134	0.068 $\pm$ 0.024
	$\text{HD}_{d}$	0.154 $\pm$ 0.035	0.242 $\pm$ 0.063	0.280 $\pm$ 0.180	0.178 $\pm$ 0.051
	$\text{CD}_{q}$	0.076 $\pm$ 0.025	0.094 $\pm$ 0.034	0.072 $\pm$ 0.019	0.102 $\pm$ 0.034
	$\text{HD}_{q}$	0.186 $\pm$ 0.055	0.243 $\pm$ 0.070	0.171 $\pm$ 0.024	0.198 $\pm$ 0.020
Linen	$\text{CD}_{d}$	0.071 $\pm$ 0.031	0.116 $\pm$ 0.054	0.160 $\pm$ 0.131	0.061 $\pm$ 0.024
	$\text{HD}_{d}$	0.154 $\pm$ 0.033	0.235 $\pm$ 0.076	0.281 $\pm$ 0.186	0.150 $\pm$ 0.064
	$\text{CD}_{q}$	0.083 $\pm$ 0.037	0.078 $\pm$ 0.023	0.075 $\pm$ 0.023	0.073 $\pm$ 0.021
	$\text{HD}_{q}$	0.177 $\pm$ 0.068	0.182 $\pm$ 0.024	0.148 $\pm$ 0.045	0.137 $\pm$ 0.038

IV Experiments

IV-A Simulation Engines

The simulation engines³³3The experiments were performed using MuJoCo v3.1.1, Bullet v3.26, Flex v1.2 and SOFA v23.06. selected for our experiments and their main differences are depicted in Tab. I.

IV-A1 Visual Feedback

Starting from a more generic point of view, we note that all simulators provide visual feedback, such as an RGB-D camera. However, setting up specific camera properties, such as the intrinsics or extrinsics of a camera, is not straightforward for neither Flex nor SOFA, which require modifying the source code of the simulation engine. Similarly, there are available solutions for domain randomisation in MuJoCo and Bullet [36], while Flex and SOFA do require additional software such as Blender to randomise properties such as texture or colour of objects.

IV-A2 Robot Integration and Type of Grasp

It is important to note that SOFA does not simulate robot systems and these are not available by default in Flex⁴⁴4IsaacSim which incorporates Flex does simulate robotic systems.. The lack of an end-effector will result in a larger impact on the sim-to-real gap when learning visual feedback controllers. Regarding the type of grasping, the simulation engines that have robotic models, and therefore grippers, enable both point (P) and line (L) grasps [28]. Moreover, although SOFA lacks robotic models, the grasping technique could be modified to perform line grasps.

IV-A3 GPU acceleration

In terms of GPU acceleration, Bullet is a CPU-based simulation engine, while Flex supports only CUDA simulation. By contrast, both MuJoCo and SOFA support both CPU and GPU-based simulation, although in our benchmark we only use the CPU-based for a fair evaluation against the other CPU-based simulators.

IV-A4 Deformable Object Shape

Regarding the shape of deformable objects, all simulators provide the capability of loading 3D meshes⁵⁵5The latest version of MuJoCo can load 3D meshes.. Although Flex is limited by default to rectangular shapes, defining garments by their width and length, recent work by Ha et al. [4] has extended Flex to non-rectangular shapes.

IV-A5 Deformable Object Dynamics Model

As discussed in Sec. II, the dynamics model used for simulating a cloth approximates its behaviour and has an effect on the reality gap. MuJoCo models cloths as mass-spring systems, which are connected by joints. The simulator allows for the definition of shear and stretch joints, enabling more complex behaviours. Bullet uses PBD by default to model object dynamics. However, this must be switched to FEM for simulating deformable objects. Similarly, SOFA provides an FEM implementation to simulate object deformation. Finally, Flex uses PBD to model the object dynamics.

IV-B Evaluating the Sim-to-Real Gap

Prior to evaluating the reality gap for each physics engine, we optimise the simulator cloth parameters that best fit the behaviour of each dataset using the standard optimisation procedures of Bayesian Optimisation (BO) [37] and the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [38]. For doing so, we run 500 sweeps of both BO and CMA-ES in each simulation engine, where we minimise the CD against each fabric and quintic trajectory, and keep the random seed constant. Once the number of sweeps is reached, or the optimisation converges, we select the parameters leading to the lowest distance over BO or CMA-ES. The specific parameters used for each simulator can be found in our open-source code. We use the default numerical integrators for Bullet and Flex, semi-implicit Euler for MuJoCo, and implicit conjugate gradient for SOFA.

We evaluate the reality gap against each fabric using 20 random seeds per simulation engine. The quantitative results for each fabric, task and simulator are reported in Tab. III. Overall, all engines perform similarly for the quasi-static task, where all distances are in the same order of magnitude for both CD and HD. In contrast, the difference in performance in the dynamic task is more noticeable across engines. Both MuJoCo and SOFA present distances two times lower than those in Bullet and Flex.

In general, the values for all metrics are comparable, or greater, than the distances between the chequered and linen rags shown in Tab. II.

In addition, we qualitatively assess the reality gap in each simulator by visualising both simulated and dataset cloth point clouds in Fig. 4, where we randomly select one of the simulations obtained for the optimal parameters of the towel rag. The figure also shows the distances associated to each time step, which helps to further understand the metrics. We can notice that MuJoCo is the only engine that closely follows the dynamic trajectory. In contrast, although Bullet presents low values at $t=1.5$ , the identified parameters do not produce a stable simulation, resulting in an inconsistent result at $t=3.5$ . On the other hand, all simulators are able to closely match the quasi-static trajectory, where Flex has the lowest error at the final time step.

IV-C Simulator Stability, Computational Time and Reality Gap

We report the relationship between reality gap, computational time⁶⁶6All experiments were run using an Intel i7-10875H and an RTX 2070. and simulator frequency in Fig. 5, where we report the average CD for the dynamic and quasi-static tasks. We only report the stability values for the dynamic task due to the difficulty of the engines to simulate this trajectory. We used the benchmark data from the towel rag with 10 random seeds per simulation engine, while keeping the same fine-tuned simulation parameters as in Sec. IV-B. We selected $10$ , $100$ and $1000$ Hz as frequencies for each engine.

As shown by Fig. 5 b) both Bullet and MuJoCo become unstable when using a low frequency, while Flex and SOFA are more consistent at different frequencies. We can notice a drastic improvement in performance for MuJoCo when increasing the frequency in both Fig. 5 c) and Fig. 5 d). By contrast, Flex and SOFA present similar values at different frequencies. Although higher frequencies result in a more stable computation of the system dynamics, there is no improvement in the distance, or even some detrimental performance. This suggests that the physics engines are quite sensitive to the cloth parameters. Therefore, all engines need to be fine-tuned for the specific simulation frequency.

The computational time taken per simulation step is depicted in Fig. 5 a). We can notice that, for the case of Bullet and MuJoCo, if the simulator is unstable the time drastically increases. Given that the time taken per simulation step for 100Hz for all simulators is in the order of milliseconds, it is unfeasible to perform real-time dynamic manipulation with hardware-in-the-loop. Similarly, although the simulation step time for 10Hz in Flex and SOFA is lower, and they are more stable than MuJoCo and Bullet, their CD is still pretty high for hardware-in-the-loop manipulation.

V Discussion and Future Work

Our results show that the largest reality gap results from performing dynamic cloth manipulation. Although the impact of this gap is less pronounced for tasks that do not require high accelerations, for instance, diagonal folding [25], techniques such as sim-to-real-to-sim [23] might be beneficial for closing this gap [39].

Although SOFA presents lower errors on both tasks, its lack of robotic models does not make it an effective simulation engine for learning robotic controllers that require visual feedback. Regarding Bullet, the identification of the system parameters requires greater efforts to produce reasonable results as the simulation parameters also affect the grasping of the fabrics, which is why in our benchmark it performed poorly for the dynamic task. On the other hand, although Flex was able to produce a swing motion, it was not able to match the real cloth behaviour. In addition, it is restricted to GPU acceleration. Given the lower distances in both dynamic and quasi-static manipulation tasks shown by MuJoCo, as well as its capability of integrating robotic models, and availability of both CPU and GPU acceleration, we recommend MuJoCo for learning cloth manipulation tasks in a simulation engine.

Although our dataset is designed with three cloths with different properties, none of these fabrics had a strong resistance to deformation. During our research we found out that only Bullet and MuJoCo were able to approximate the behaviour of stiff garments. The evaluation of the reality gap for stiff cloths and other types of garments such as jeans and t-shirts remains as future work.

VI Conclusion

In this letter, we presented a benchmark that evaluates the reality gap of physics engines simulating cloth manipulation tasks and evaluated four well-established open-source simulation engines: MuJoCo, Bullet, Flex, and SOFA. Our benchmark dataset was collected using three cloths from a public household dataset, each with different material properties, in both a dynamic and quasi-static manipulation task. The benchmark dataset provides the point clouds of the post-processed cloths, as well as the trajectory performed by the robots. Our experiments evaluate qualitatively and quantitatively the discrepancy between the benchmark dataset for each fabric, task, and simulated cloth. Furthermore, we analysed the computational time taken for each simulator at different frequencies, along with their stability and the reality gap. Our results show that all engines are able to produce low errors for the quasi-static task. However, although none of the simulators was able to precisely match the dynamic manipulation task, MuJoCo performed the best at closely following the dynamic trajectory. The remaining reality gap emphasises that, in order to transfer controllers learnt in simulation to the real world, techniques such as domain randomisation, real-to-sim or real-time visual feedback are required.

Our benchmark was designed to aid researchers in cloth manipulation by depicting the current capabilities of simulation engines. Our work also provides the open-source code, which enables evaluating the reality gap of other simulation engines, as well as performing other tasks and trajectories using the same set-up as the one depicted in this letter. We foresee that the next generation of simulators will have a lower reality gap evaluated against these benchmarks, leading to controllers that match more faithfully the behaviour learnt in simulation when applied in the real world.

Acknowledgement

The authors would like to acknowledge the computational resources provided by the Aalto Science-IT project.

References

[1] A. Clegg, Z. Erickson, P. Grady, G. Turk, C. C. Kemp, and C. K. Liu, “Learning to collaborate from simulation for robot-assisted dressing,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 2746–2753, 2020.
[2] I. Garcia-Camacho, J. Borràs, B. Calli, A. Norton, and G. Alenyà, “Household cloth object set: Fostering benchmarking in deformable object manipulation,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 5866–5873, 2022.
[3] V. E. Arriola-Rios, P. Guler, F. Ficuciello, D. Kragic, B. Siciliano, and J. L. Wyatt, “Modeling of deformable objects for robotic manipulation: A tutorial and review,” Frontiers in Robotics and AI, vol. 7, 2020.
[4] H. Ha and S. Song, “Flingbot: The unreasonable effectiveness of dynamic manipulation for cloth unfolding,” in Conference on Robot Learning (CoRL), 2021.
[5] C. Chi, B. Burchfiel, E. Cousineau, S. Feng, and S. Song, “Iterative Residual Policy for Goal-Conditioned Dynamic Manipulation of Deformable Objects,” in Proceedings of Robotics: Science and Systems, New York City, NY, USA, 2022.
[6] Y. Wu, W. Yan, T. Kurutach, L. Pinto, and P. Abbeel, “Learning to Manipulate Deformable Objects without Demonstrations,” in Proceedings of Robotics: Science and Systems, Corvalis, Oregon, USA, 2020.
[7] W. Yan, A. Vangipuram, P. Abbeel, and L. Pinto, “Learning predictive representations for deformable objects using contrastive estimation,” in Conference on Robot Learning (CoRL), 2021.
[8] D. Seita, P. Florence, J. Tompson, E. Coumans, V. Sindhwani, K. Goldberg, and A. Zeng, “Learning to Rearrange Deformable Cables, Fabrics, and Bags with Goal-Conditioned Transporter Networks,” in IEEE International Conference on Robotics and Automation, 2021.
[9] J. Collins, J. McVicar, D. Wedlock, R. Brown, D. Howard, and J. Leitner, “Benchmarking simulated robotic manipulation through a real world dataset,” IEEE Robotics and Automation Letters, vol. 5, no. 1, pp. 250–257, 2020.
[10] R. Jangir, G. Alenyà, and C. Torras, “Dynamic cloth manipulation with deep reinforcement learning,” in 2020 IEEE International Conference on Robotics and Automation, 2020, pp. 4630–4636.
[11] J. Hietala, D. Blanco–Mulero, G. Alcan, and V. Kyrki, “Learning visual feedback control for dynamic cloth folding,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2022.
[12] D. Blanco-Mulero, G. Alcan, F. J. Abu-Dakka, and V. Kyrki, “Qdp: Learning to sequentially optimise quasi-static and dynamic manipulation primitives for robotic cloth manipulation,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023.
[13] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012.
[14] M. Macklin, M. Müller, N. Chentanez, and T.-Y. Kim, “Unified particle physics for real-time applications,” ACM Trans. Graph., vol. 33, no. 4, 2014.
[15] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” http://pybullet.org, 2016–2021.
[16] F. Faure, C. Duriez, H. Delingette, J. Allard, B. Gilles, S. Marchesseau, H. Talbot, H. Courtecuisse, G. Bousquet, I. Peterlik, et al., “Sofa: A multi-model framework for interactive physical simulation,” in Soft Tissue Biomechanical Modeling for Computer Assisted Surgery. Springer, 2012, pp. 283–321.
[17] H. Wang, R. Ramamoorthi, and J. F. O’Brien, “Data-driven elastic models for cloth: Modeling and measurement,” ACM Transactions on Graphics, vol. 30, no. 4, pp. 71:1–11, 2011, proceedings of ACM SIGGRAPH 2011.
[18] B. Acosta, W. Yang, and M. Posa, “Validating robotics simulators on real-world impacts,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 6471–6478, 2022.
[19] I. Garcia-Camacho, M. Lippi, M. C. Welle, H. Yin, R. Antonova, A. Varava, J. Borras, C. Torras, A. Marino, G. Alenyà, and D. Kragic, “Benchmarking bimanual cloth manipulation,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1111–1118, 2020.
[20] X. Lin, Y. Wang, J. Olkin, and D. Held, “Softgym: Benchmarking deep reinforcement learning for deformable object manipulation,” in Conference on Robot Learning (CoRL), 2021.
[21] S. Chen, Y. Xu, C. Yu, L. Li, X. Ma, Z. Xu, and D. Hsu, “Daxbench: Benchmarking deformable object manipulation with differentiable physics,” in The 11th International Conference on Learning Representations, 2023.
[22] G. Salhotra, I.-C. A. Liu, M. Dominguez-Kuhne, and G. S. Sukhatme, “Learning deformable object manipulation from expert demonstrations,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 8775–8782, 2022.
[23] V. Lim, H. Huang, L. Y. Chen, J. Wang, J. Ichnowski, D. Seita, M. Laskey, and K. Goldberg, “Real2sim2real: Self-supervised learning of physical single-step dynamic actions for planar robot casting,” in IEEE International Conference on Robotics and Automation, 2022.
[24] P. Sundaresan, R. Antonova, and J. Bohg, “Diffcloud: Real-to-sim from point clouds with differentiable simulation and rendering of deformable objects,” in arXiv preprint arXiv:2204.03139, 2022.
[25] J. Matas, S. James, and A. J. Davison, “Sim-to-real reinforcement learning for deformable object manipulation,” in Conference on Robot Learning (CoRL), 2018.
[26] R. Antonova, P. Shi, H. Yin, Z. Weng, and D. Kragic, “Dynamic environments with deformable objects,” in NeurIPS Datasets and Benchmarks Track, 2021.
[27] F. Ficuciello, A. Migliozzi, E. Coevoet, A. Petit, and C. Duriez, “Fem-based deformation control for dexterous manipulation of 3d soft objects,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2018, pp. 4007–4013.
[28] J. Borràs, G. Alenyà, and C. Torras, “A grasping-centered analysis for cloth manipulation,” IEEE Transactions on Robotics, vol. 36, no. 3, pp. 924–936, 2020.
[29] R. Proesmans, A. Verleysen, and F. Wyffels, “Unfoldir: Tactile robotic unfolding of cloth,” IEEE Robotics and Automation Letters, vol. 8, no. 8, pp. 4426–4432, 2023.
[30] M. Spong, S. Hutchinson, and M. Vidyasagar, Robot Modeling and Control. Wiley, 2020.
[31] H. K. Cheng, Y.-W. Tai, and C.-K. Tang, “Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion,” in Proc. IEEE/CVF CVPR, 2021, pp. 5559–5568.
[32] Z. Huang, X. Lin, and D. Held, “Self-supervised cloth reconstruction via action-conditioned cloth tracking,” in IEEE International Conference on Robotics and Automation, 2023.
[33] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger, “Occupancy networks: Learning 3d reconstruction in function space,” in Proc. IEEE/CVF CVPR, 2019.
[34] M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger, “Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision,” in Proc. IEEE/CVF CVPR, June 2020.
[35] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry, “A papier-mâché approach to learning 3d surface generation,” in Proc. IEEE CVPR, 2018.
[36] Y. Zhu, J. Wong, A. Mandlekar, and R. Martín-Martín, “robosuite: A modular simulation framework and benchmark for robot learning,” in arXiv preprint arXiv:2009.12293, 2020.
[37] R. Garnett, Bayesian optimization. Cambridge University Press, 2023.
[38] N. Hansen and A. Auger, “Cma-es: evolution strategies and covariance matrix adaptation,” in Proc. 13th annual conference companion on Genetic and evolutionary computation, 2011.
[39] S. Höfer, K. Bekris, A. Handa, J. C. Gamboa, M. Mozifian, F. Golemo, C. Atkeson, D. Fox, K. Goldberg, J. Leonard, C. Karen Liu, J. Peters, S. Song, P. Welinder, and M. White, “Sim2real in robotics and automation: Applications and challenges,” IEEE Transactions on Automation Science and Engineering, vol. 18, no. 2, pp. 398–400, 2021.