Grasping as Inference: Reactive Grasping in Heavily Cluttered Environment

Dongwon Son Manuscript received: January, 8, 2022; Revised April, 7, 2022; Accepted June, 1, 2022.This paper was recommended for publication by Editor Markus Vincze upon evaluation of the Associate Editor and Reviewers’ comments.The author is with AI Method team, Samsung Research, Samsung Electronics, Seoul, Republic of KoreaDigital Object Identifier (DOI): see top of this page.

Abstract

Although, in the task of grasping via a data-driven method, closed-loop feedback and predicting 6 degrees of freedom (DoF) grasp rather than conventionally used 4DoF top-down grasp are demonstrated to improve performance individually, few systems have both. Moreover, the sequential property of that task is hardly dealt with, while the approaching motion necessarily generates a series of observations. Therefore, this paper synthesizes three approaches and suggests a closed-loop framework that can predict the 6DoF grasp in a heavily cluttered environment from continuously received vision observations. This can be realized by formulating the grasping problem as Hidden Markov Model and applying a particle filter to infer grasp. Additionally, we introduce a novel lightweight Convolutional Neural Network (CNN) model that evaluates and initializes grasp samples in real-time, making the particle filter process possible. The experiments, which are conducted on a real robot with a heavily cluttered environment, show that our framework not only quantitatively improves the grasping success rate significantly compared to the baseline algorithms, but also qualitatively reacts to a dynamic change in the environment and cleans up the table.

Index Terms:

Deep Learning in Grasping and Manipulation; Grasping

I INTRODUCTION

This paper aims to find a robust solution for grasping objects in a heavily cluttered environment through partial RGB and depth (RGBD) vision observations which are obtained from a camera rigidly attached to the gripper, as shown in Fig. 1. Recently, a lot of data-driven methods to address this problem have been presented. Specifically, grasping through predicting explicit grasp pose has been showing results. Whereas most of the research in the field focuses on improving the precision of the deep network, there are three other experimentally verified approaches which improve the success rate of the grasping task. In the [1], they focus on the effects of the closed-loop grasping and verify that it improves not only robustness to control errors but also reactiveness to a dynamic environment. On the other hand, the grasp prediction models in [2, 3] use 6 degrees of freedom (DoF) grasp space rather than conventionally used 4DoF space (i.e., top-down), and they show this extension of grasp space is beneficial in the cluttered environment. Concurrently, the work of [4] finds that utilizing sequential multi-view information while approaching can improve performance. The three approaches are orthogonal to each other so that they can be synthesized together, but the problem is that few works achieve it as far as we know.

Refer to caption — Figure 1: Grasp estimation flow (top) and heavily cluttered environment used in the experiment (bottom). The proposed framework can estimate the grasp in real-time by utilizing sequential observations, and finally, it can clean up the table.

We hypothesize that merging benefits from three approaches can dramatically improve task success rate as well as generate reactive motion, and we verify it by developing a novel framework, which has the above-mentioned three properties, and a real robot experiment. In short, we develop the closed-loop framework to estimate 6DoF grasp in a heavily cluttered environment, given the history of partial RGBD pixel input. We achieve it by 1) formulating the grasping problem as an online estimation problem and 2) developing a novel lightweight grasp prediction network. We first formulate the problem as Hidden Markov Model (HMM) and derive a recurrent formulation to infer grasp. This inference formulation is similar to Bayes filter[5], then particle filter, which is one powerful realization of it, is applied. Through this framework, named Grasp Particle Filter (GraspPF), the robot can not only retain prior grasp distribution but also refine the grasps while approaching (i.e., online estimation).

However, to apply particle filter, a precise and lightweight grasp prediction model, which can generate grasps from the partial RGBD observation in real-time, and a grasp evaluation model, which can evaluate any grasp candidates in a bounded space, are required. Therefore we develop a unified data-driven model which can achieve both. To realize this, we design Directional Grasp Quality CNN (DGQ-CNN), which predicts pixel-wise grasp quality from grasp rotation and RGBD inputs. The additional input of grasp rotation makes the network not restricted to a predefined rotation set, which is prevalent in prior works[3, 6, 7]. Additionally, it is computationally efficient enough to run in a closed-loop manner during the approach, making the resulting implementation reactive. The proposed framework is verified with a comparison experiment with the state-of-the-art algorithms, and it outperforms them by a large margin in terms of success rate. Additionally, ablation studies show the performance improvement of individual factors.

In summary, the contribution of this paper is that, for the first time, we fuse three effective components in the data-driven grasp prediction field, which are 1) to utilize closed-loop feedback, 2) to predict 6DoF grasp, and 3) to utilize sequential multi-view observations. It can be achieved by introducing GraspPF and DGQ-CNN.

Methods	CL	Multi	6DoF	Offline	Dataset
Methods	(Reactive)	View	6DoF	Training	Source
GG-CNN-cl[1]	✓	✓	✗	✓	real
MVP[4]	✗	✓	✗	✓	real
Dex-Net[8]	✗	✗	✗	✓	syn
[3, 6, 9]	✗	✗	✓	✓	real
[7, 10, 11]	✗	✗	✓	✓	syn
GPD[12]	✗	✓	✓	✓	real
QT-Opt[13]	✓	✗	✗	✗	real
Song et al. [14]	✓	✗	✓	✗	real
GraspPF (Ours)	✓	✓	✓	✓	both

TABLE I: A comparison of our work to related approaches (”CL” stands for closed-loop.)

II Related Works

We compare the proposed algorithm with other works and summarise the results in Tab. I. The data-driven grasping method through predicting explicit grasp in the cluttered environment [9, 6, 11, 3, 7] usually follows the steps: 1) get the RGBD (or depth only) measurement at a global position which can view all objects in the fixed workspace 2) generate grasp candidates and choose one by utilizing a trained model 3) execute the open-loop pick and place motion with a target grasp configuration which is obtained in the previous step. Specifically, the method in [3] follows the steps mentioned above with a grasp prediction model, which is trained with a real large-scale dataset, while [11] is with a synthetic dataset. However, these methods do not sufficiently utilize the partial observation streaming during motions in step 3).

Although GG-CNN[15] uses sequential observations with a light enough network to fulfill the closed-loop control, it does not sufficiently use the prior observations, because it just picks the closest grasp to the prior estimated grasp after local peak clustering. Additionally, the predicted grasp space of GG-CNN is restricted to 4DoF. On the other hand, the works of [4, 12] utilize multi-view observations to predict grasp pose in a cluttered scene by stacking predicted results in a discretized table [4] or merging them into one point cloud [12]. Still, both works are hard to generate reactive motion. Compared with these, our approach can retain 6DoF grasp distribution as a form of particles and run on an online estimation manner, which enables it to react dynamic environment.

Another branch of the data-driven approach for grasping is directly predicting continuous action without estimating the explicit grasp. For example, works in [13, 14] realize closed-loop grasping through reinforcement learning. Still, they do not take advantage of past observations and, moreover, require real robot interaction, which makes the resulting model dependent on a particular robot. On the other hand, in the grasp estimation methods, the large-scale open-source dataset [16, 3], which is independent of the robot, can be used for training. Additionally, it can be trained offline without real robot interaction. For these reasons, we focus on the grasp estimation method while keeping the advantage of the closed-loop.

III Problem Statement

In this paper, we assume that there are a scene and a robot arm, as shown in Fig. 2. The scene contains a flat table and cluttered objects which are rested on the table in a stable pose. Besides the scene, there is an N-DoF robot equipped with a two-finger parallel gripper on the end-flange, and a camera is rigidly attached to the gripper (i.e., wrist camera). The observation $z$ contains the pose of the camera $T^{W}_{C}\in SE(3)$ where $W$ denotes the world frame, the intrinsic parameter of the camera $\xi\in\mathbb{R}^{6}$ , the RGB and depth images captured from the wrist camera. Regarding grasp representation, the grasp configuration space is defined as $SE(3)\times\mathbb{R}\times\mathbb{R}$ , which is also illustrated in Fig. 2. It includes the 6DoF pose $P\in SE(3)$ which consists of a point $p^{W}_{P}\in\mathbb{R}^{3}$ on the observed point cloud and rotation $R^{W}_{P}\in SO(3)$ , gripper width $w\in\mathbb{R}$ , and grasp depth $d\in\mathbb{R}$ of which direction is based on the gripper approaching axis (z-axis of $R^{W}_{P}$ ). Then, grasp pose $G\in SE(3)$ can be calculated from $P$ and $d$ . Grasp width $w$ is discretized to 2 bins to simplify grasp space.

A successful grasp is defined as the configuration of grasping which has a high probability of succeeding in picking and placing the object without any collision between the gripper, robot arm, objects, and table. The grasps are evaluated by grasp quality measure $Q:SE(3)\rightarrow{}\mathbb{R}^{2}$ , which predicts the quality value per each width bin from $G$ . Note that the symbol $Q$ is also used for the quality value itself in this paper. Finally, the problem this paper deals with is that, given a scene, a robot arm, and a wrist camera, the robot tidies up objects in the scene via estimating successful grasp distribution from the streaming of $z$ for each time step. In the next section, we formally express this problem and explain the connection with the online estimation.

IV Grasping as Inference

IV-A HMM and Online Estimation for Grasping

A grasping problem can be expressed as estimating current grasp distribution conditioned on the history of observations as $p(g_{t}|z_{0:t},v_{0:t}=1)$ where $\cdot_{a:b}$ is history of variables from time step $a$ to $b$ , $g_{t}$ is the grasp as defined in Sec. III, $z_{0:t}$ is history of observation from time step 0 to t, and $v_{0:t}$ is binary success variable which is 1 if task is success and 0 otherwise, which is also introduced in [8]. Note that $p(\cdot|v=1)$ is simplified to $p(\cdot|v)$ , because we do not consider cases conditioning on failure. In this paper, the grasp estimation problem is formulated as an inference in HMM as shown in Fig. 3(a). We assume that $g_{t}$ is dependent on prior grasp $g_{t-1}$ because we can reasonably assume there is an optimal grasp distribution of the scene in terms of the grasp success metric, so the current belief state of the grasp is closely related to the prior one. And we set grasp success variable $v_{t}$ dependent to $g_{t}$ and $z_{t}$ . Then, the belief state of $g_{t}$ , defined as $B(g_{t})=p(g_{t}|v_{0:t},z_{0:t})$ , can be propagated from prior time step as,

B(g_{t})=\mu p(v_{t}|g_{t},z_{t})\int p(g_{t}|g_{t-1})B(g_{t-1})\,dg_{t-1}

(1)

where $\mu$ is normalization factor. This formulation has a recurrent structure and it is similar to Bayes filter except that the motion model (or transition model) $p(g_{t}|g_{t-1})$ does not involves action and measurement model is expressed as $p(v_{t}|g_{t},z_{t})$ by introducing $v$ . That means well-established tools for estimation problems can be utilized to solve this. To apply estimation toolset, motion model $p(g_{t}|g_{t-1})$ and measurement model $p(v_{t}|g_{t},z_{t})$ should be identified first. We reasonably assume that $p(g_{t}|g_{t-1})$ follows Gaussian distribution as noise model because the scene can be regarded as temporarily static between time step $t-1$ and $t$ . On the other hand, measurement model $p(v_{t}|g_{t},z_{t})$ can be obtained by calculating $Q$ . In the next section, the realization method of recurrently solving (1) is explained.

IV-B Grasp Particle Filter

Result: grasp estimation result

1 recieve initial observation

z

\{g\}^{M}=GetInitialDistribution(z)

3 for each time step do

\{g^{\prime}\}^{M}=Transition(\{g\}^{M})

5 receive observation

z

\{\bar{g}\}^{M}=Projection(z,\{g^{\prime}\}^{M})

\{Q\}^{M}=Evaluation(z,\{\bar{g}\}^{M})

\{g\}^{M}\leftarrow Resampling(\{\bar{g}\}^{M},\{Q\}^{M})

9 end for

Algorithm 1 GraspPF

As explained before, grasp estimation can be formulated as Bayes filter, but it includes highly non-linear models, making the realization not trivial. Among the possible realizations, we use particle filter[5], because it has been verified to work well with non-linear model[17, 18, 19]. The resulting algorithm is GraspPF, which is illustrated in Fig. 3(b) and Alg. 1, and a detailed description follows. The first step is the initialization of the successful grasp distribution. The belief state of grasp $B(g_{t-1})$ is approximated to $M$ particles in grasp configuration space $\{g_{t-1}\}^{M}$ defined in Sec. III, where $\{\cdot\}^{X}$ means batch of size $X$ . Then time update is applied to these particles with $\int p(g_{t}|g_{t-1})B(g_{t-1})dg_{t}$ , resulting in $\{g^{\prime}_{t-1}\}^{M}$ . This transition is easily implemented by noise sampling for each particle. Specifically, we apply Gaussian noise to each $d$ , $p^{W}_{P}$ , and $R^{W}_{P}$ . However, the transition through noise distribution makes the particles violate a constraint, where $p^{W}_{P}$ should be on the surface of the scene. To resolve it, we project each $p^{W}_{G}$ of particles to the scene mesh based on current depth observation included in $z_{t}$ . Concretely, in the grasp space, the $p^{W}_{P,i}$ of the particle is projected to $\bar{p}^{W}_{P,i}$ , which is the observed points following line of sight, then the batch of particles becomes $\{\bar{g}_{t}\}^{M}$ . Afterward, the particles are evaluated by the approximated grasp quality measure function $Q$ , which is conditioned on the current observation $z_{t}$ . Then the current belief state $B(g_{t})$ is obtained from the resampling based on the evaluation values $\{Q\}^{M}$ . These procedures are repeated during the approaching motion, and when the target is close enough, the scripted closing motion is executed.

Throughout the GraspPF process, the past information is kept in prior distribution $B(g_{t-1})$ in the form of particles, and they are refined with continuously received observations. It makes GraspPF entertain sequential information while the measurement model only sees current observation. It is also worth noting that this process can be used as iterative refinement with a fixed observation (i.e. open-loop), which can resolve the roughness of the initial distribution. We demonstrate it in the comparison experiments on the real robot as explained in Sec. V-C. However, to get this inference framework to work, there are two unresolved problems: evaluation and initialization. Let us explain each problem and what makes them challenging.

First, the evaluation problem can be regarded as obtaining $Q$ . To obtain exact results, the oracle information of scene and gripper is needed, such as the mesh and pose of each object. Unfortunately, this information is hard to be accessible with real robot settings, so many works adopt bypass methods to predict it with implicit information from the RGBD pixel. This can be achieved by learning-free methods such as [20], but these are hard to be extended to partial depth observation and are also challenging to be computationally efficient, which can be achieved by data-driven methods[8, 10, 12].

On the other hand, the reasonable initial grasp distribution is also essential to the performance since approaching motion takes only a few seconds, limiting the number of iterations in GraspPF to relatively small number. However, the initialization is not trivial because grasp space is continuous, and the result should sufficiently cover the workspace. In this paper, we resolve these evaluation and initialization problems simultaneously by introducing a novel network design, which can not only evaluate grasp candidates but also generate them efficiently, whereas authors in [10, 8] introduce additional network or antipodal heuristics for sampling. In the next section, we explain the details of the network.

IV-C Directional Grasp Quality Network

Before starting a detailed explanation of the network, we sum up the necessary properties for the network model in GraspPF, which are mentioned in the previous section. First, it should be computationally efficient enough for a closed-loop to refine grasp during approaching motion. Second, it should be able to evaluate any grasp in a bounded space, which is necessary for the measurement model in GraspPF. Lastly, it should be able to generate grasp candidates efficiently to get a reasonable initial distribution of grasps. We found that these can be achieved by a novel concept of the directional grasp quality.

Directional grasp quality is defined as grasp quality conditioned on fixed grasp rotation $R^{W}_{G}$ . To utilize this concept, we design DGQ-CNN to predict the pixel-wise directional grasp quality from input as RGBD image and fixed rotation $R^{W}_{G}$ as Fig. 4. We first express grasp rotation as zxy-Euler parameters as $R^{C}_{G}=R_{z}(\alpha)R_{x}(\beta)R_{y}(\gamma)$ , where $R^{C}_{G}$ is rotation matrix of $G$ with respect to camera frame $C$ , $R_{z}(\alpha)$ is rotation matrix through an angle $\alpha$ with respect to $x$ axis, and so does $R_{x}(\beta)$ and $R_{y}(\gamma)$ . Then, $\beta,\gamma$ are included in the inputs of the parameterized CNN, and the $\alpha$ is applied to the RGBD image, which is rotated by $-\alpha$ inspired by [21]. Additionally, the grasp configuration also contains continuous grasp depth $d$ and grasp width $w$ , and we set $d$ as an additional input of the CNN and make $w$ as output with the discretization of size 2. In the perspective of CNN, it receives the rotated images and $\beta,\gamma,d,\xi$ , and predicts pixel-wise grasp quality for 3 levels, object mask, and valid collision mask per two grasp width $w$ , and then total output channel is 6 (3+1+2). Finally, the pixel-wise directional grasp quality is calculated by network outputs after post-processing, which contains extracting the continuous final grasp quality values and Gaussian filter. It is worth specifying that the DGQ-CNN lessens the sampling burden by deleting $p^{W}_{P}$ in the sampling space. To be specific, DGQ-CNN receives only the rotation component $R^{W}_{P}$ in $P$ except $p^{W}_{P}$ , and $p^{W}_{P}$ can be reconstructed by the point cloud and pixel index operation.

Once DGQ-CNN is obtained, the sampling and evaluation are straightforward. In the sampling, the rotation $R^{W}_{G}$ and grasp depth $d$ are sampled uniformly within the predefined bound, and they enter into DGQ-CNN, which generates the pixel-wise quality prediction $Q$ . Then the grasp candidates are recovered by pixel indices where $Q$ is over the threshold. Finally, the grasps are sampled proportionally in terms of grasp quality $Q$ . On the other hand, regarding the evaluation, $Q$ can be obtained from the index-gathering after pixel-wise grasp quality prediction by DGQ-CNN with input as grasp configuration and observation $z$ . Therefore introducing DGQ-CNN resolves the initialization and evaluation issue simultaneously. Next, details about the dataset and training for DGQ-CNN are explained.

IV-D Dataset and Training

To generate the dataset for training DGQ-CNN, a labeling method is needed to measure grasp quality (or how successful the grasp is). With the results from [8, 11, 10], we utilize synthetic labeling because directional grasp quality per pixel needs massive evaluations, and it is not accessible to real grasp[14] or human labeling[21]. We use the force-closure based metric motivated by [12, 8, 3], where a prediction models trained on that metric are verified to perform well, and particularly a validation experiment with real grasping is conducted in [3]. Although some works such as [22] report opposite results, we find that it also works well in our case. Specifically, grasp quality for each pixel has 3 dimensions, which correspond with 3 levels of grasp quality. Also, the label contains object mask and valid collision results per width evaluated by the simplified gripper in Fig. 2, and then the final labels have 6 channels as shown in Fig. 5.

The labeling method above can be applied to a real dataset where 3D mesh and RGBD observation are carefully matched, for example, [23, 24], and this approach is used in [12, 3]. However, the sole source of the real dataset is insufficient to train DGQ-CNN because it needs vision data at various distances to work well even at close proximity to objects. So we generate additional data from various distances to the scene and camera parameters by utilizing synthetic rendering and simulation[25]. As previously studied in [16], various object set is also important, so we use multiple sources[26, 23, 27]. A total of 2.1M synthetic and 400K real RGBD pairs are used.

After generating the dataset, the parameterized CNN in DGQ-CNN can be trained with binary classification loss. However, the dataset has a critical problem, sparse segmentation. It means that the positive pixels in the pixel space are too sparse, then the naive binary cross-entropy loss causes the well-known pain, reducing training stability and performance [28]. Therefore we solve it by utilizing the hierarchical structure of the label. We decompose probability of grasp success, conditioned on pixel index and additional inputs of CNN as $\Theta$ , into the three conditional probabilities as,

	$\displaystyle p(success\|\delta_{ij},\Theta)\propto$	$\displaystyle p(v,m_{c},m_{o}\|\delta_{ij},\Theta)$
	$\displaystyle=$	$\displaystyle p(v\|m_{c},m_{o},\delta_{ij},\Theta)p(m_{c}\|m_{o},\delta_{ij},\Theta)$
		$\displaystyle p(m_{o}\|\delta_{ij},\Theta)$

where $\delta_{ij}$ is pixel in index $i$ and $j$ , $m_{o}$ is binary random variable of object presence, and $m_{c}$ is binary random variable for valid collision. Then we make DGQ-CNN predict each conditional probability $p(v|m_{c},m_{o},\delta_{ij},\Theta)$ , $p(m_{c}|m_{o},\delta_{ij},\Theta)$ and $p(m_{o}|\delta_{ij},\Theta)$ , and this decomposition reduce imbalance problem significantly.

V Experiment

The final step of this paper is to apply GraspPF on a real robot. We use 6 DoF UR5e robot equipped with Robotis Hand (RH-P12-RN), CPU of Intel i9-9900K with 3.60GHz, and GPU of Nvidia RTX 4000 as verification settings. The object set contains tools, dishes, cups, and toys, as shown in Fig. 1, that are prevalent in the house and not contained in the dataset. To make a real robot perform the task, the controller and task planning are needed in addition to the grasp estimation framework. This section explains the implementation details, protocol, and results of the experiments.

V-A Controller Structure on Real Robot

In the experiment, we set two layers of hierarchy as Fig. 6: high-level controller and low-level controller. First, the high-level controller runs on 8Hz, which contains GraspPF to estimate successful grasp, and the decision-maker, which decides on high-level motion semantics such as move-to-target or move-to-box. And the low-level controller includes target follower and admittance control. The simplest way to make a gripper approach to the grasp target is by calculating the vector from the gripper to the grasp target in the workspace and moving the robot arm in that direction. However, it does not consider the gripper approach direction and causes a collision between the gripper tip and the object. To prevent that, we design a specific target following controller considering the approaching direction. It is based on the waypoints, and the details are skipped because it is out of the scope of this paper.

V-B Reachability Network

In the real robot implementation with the proposed framework and heavily cluttered environment, there are cases where the robot cannot reach the target due to kinematic constraint or self-collision. To resolve this reachability issue, we split $Q$ into two sub-functions, grasp quality measure for robot $Q^{R}$ and scene $Q^{S}$ . $Q^{S}$ evaluates grasps based on collision between scene and gripper and grasp quality with a target object, which can be predicted by DGQ-CNN as explained in Sec. IV-C.

In the case of $Q^{R}$ , which evaluates grasps in terms of kinematics and self-collision, it is relatively easy to obtain because the calculation of reachability needs the occupancy and configuration of the robot, which can be accessible precisely even in the real robot. Therefore, it can be calculated without any approximation, but empirically the methods based on collision-check and inverse kinematics (IK) are not feasible to check all particles $\{g\}^{M}$ in real-time when $M$ is large. Therefore, we introduce a simple neural network to approximate $Q^{R}$ , in which data is generated by simulator[25]. Regarding training details, the states in the workspace are sampled uniformly within bound, and then they are labeled based on IK and self-collision results. Afterward, the simple 2 layers Multilayer Perceptron (MLP) is trained with binary classification loss. The final reachability network only takes 100 micro-seconds to check 1500 grasp candidates, and its accuracy is 97%.

V-C Experiment Results

In this section, we present experimental results to quantitatively and qualitatively verify the performance of GraspPF and DGQ-CNN. First, to compare the prediction performance of DGQ-CNN with our grasp quality intuition, the grasp quality prediction results per direction are visualized in Fig. 7. In the prediction example of the first row, a box-shaped object is placed at an angle. Then, the high-quality region at the grasp direction, which matches the inclined angle, has a larger area than in other directions (second column). Similarly, in the second example, the high-quality region varies from left to right following the transition of the y-Euler angle. These examples show that additional grasp rotation input can change high-quality areas matching well with our intuition.

To verify GraspPF quantitatively, a comparison experiment is conducted with baseline algorithms to predict grasp. We adopt GraspNet[3], Contact-GraspNet (abbreviated to Con-GraspNet)[11], and GG-CNN[1] as baselines because they are reported to show good performance in a cluttered environment, and the original implementations are provided. We use the pre-trained models provided by authors despite using a different gripper model (Robotis Hand) in our experiment than that of baselines. We reduce the impact by utilizing the predicted grasp width for pre-grasp motion, making the interaction between the gripper and other entities minimal. GG-CNN is executed in a closed-loop manner, whose implementation reproduces the original Kinova Mico experiment. Throughout all baselines, we filter out grasp candidates by $Q^{R}$ . GraspPF is executed in two versions: GraspPF-ol, GraspPF-cl. GraspPF-ol is an open-loop version of GraspPF, which uses the particle filter as a refinement with a fixed observation. On the other hand, GraspPF-cl runs in a closed-loop manner and continuously refines grasps during approaching motion.

Additionally, we also implement five ablation versions of GraspPF: Sampling-ol, Sampling-cl, GraspPF-TD, GraspPF-real, and GraspPF-syn. Sampling-ol is a method that predicts grasps through forward sampling by DGQ-CNN and $Q^{R}$ (without refinement) from global observations at a fixed camera pose, where grasping is executed in an open-loop manner. Sampling-cl is a closed-loop version of Sampling-ol, where a batch of grasps is predicted every time step, and the grasp is selected by predicted quality and distance regularizer from the previously selected grasp. GraspPF-TD is the same as GraspPF-cl except for reducing grasp space to top-down (i.e., 4DoF). Lastly, in GraspPF-real, only real data is used to train DGQ-CNN, whereas GraspPF-syn uses only synthetic data.

The experiment protocol is as follow: 1) 62 unknown objects are placed in the box 2) pour them on the flat table to construct a heavily cluttered environment 3) set the initial gripper pose with a variety of distance to the table which is uniformly selected within 0.45-0.65m 4) execute the grasping algorithm and log the result 5) repeat 3-4) for 20 times and check the success rate. We repeat the experiment 7 times for each algorithm, and the total number of grasping trials is 140. The comparison results are reported in the Tab. II.

Methods	scene reset trial no. (#/20 per each trial)
Methods	1	2	3	4	5	6	7	total
Baselines
GraspNet[3]	13	13	16	14	14	15	12	69%
Con-GraspNet[11]	12	13	11	13	15	13	13	64%
GGCNN-cl[15]	9	9	7	12	9	10	10	47%
Proposed
GraspPF-cl	16	17	16	18	17	19	17	86%
GraspPF-ol	14	15	14	17	16	16	15	76%
Ablations
Sampling-cl	14	12	14	18	12	13	14	69%
Sampling-ol	15	12	11	12	14	15	11	64%
GraspPF-TD	11	13	14	19	18	13	17	75%
GraspPF-real	7	7	9	6	7	9	7	37%
GraspPF-syn	14	18	16	15	17	14	14	77%

TABLE II: Results of the comparison experiment

First, in terms of open-loop performance, the proposed algorithm GraspPF-ol outperforms baseline algorithms GraspNet and Con-GraspNet. Furthermore, the performance margin increases to 17% when applying closed-loop (GraspPF-cl). On the other hand, GraspPF-cl exceeds GGCNN-cl by 39%. We find that GGCNNs struggle to predict the quality grasp in a highly dense cluttered environment. Next, to study the individual influence of each factor, the results of 5 ablation experiments can be compared. Fixing approach direction to top-down (GraspPF-TD) degrade performance when compared with GraspPF-cl, and this result can be evidence that introducing 6DoF grasp space is beneficial, which is also studied in [9]. Next, to see how introducing a closed-loop affects performance, GraspPF-cl, Sampling-cl, GraspPF-ol, and Sampling-ol can be compared. Overall, applying closed-loop brings performance gain, which comes from improved robustness to control errors and sensor noise as previously studied in [15].

Regarding the influence of introducing particle filter, which means extracting more information from sequential observations, the success rate of Sampling-cl shows a 16% performance drop. This result implies that the system can keep the prior distribution better through particle filter and update the distribution with multi-view observations. On the other hand, the effect of introducing particle filter as a refinement step can be revealed by comparing Sampling-ol and GraspPF-ol, and it shows an improvement of 13%. This improvement can be explained by the fact that forward sampling is based on Monte-Carlo sampling, so it is not enough to cover a large continuous grasp space, although the space is reduced from pixel-wise prediction as explained in Sec. IV-C. Then particle filter can refine the initial grasps and find better grasps even with the same observation. Lastly, regarding the effects of dataset source, although performance declines in both GraspPF-real and GraspPF-syn, it degrades much more in GraspPF-real. A slump in GraspPF-real is because of the lack of variation of camera parameters, including intrinsic and viewpoint, while GraspPF needs more dynamic variations. The 9% drop in GraspPF-syn can be mainly from a sim-to-real gap in the depth sensor.

Furthermore, in the real cleaning-up experiment, the GraspPF-cl can chase an object, re-grasp after fail, and react to a sudden object change. These make the framework can clean all objects containing flat or deformable objects. The grasping failures mainly result from slippage. Specifically, the slippage arises while gripping because the estimated grasp region has a too obtuse angle, or the axial slippage results in the object’s fall during the placing because of not considering the center of gravity.

VI conclusion

In this paper, we show that significant performance improvement can be obtained by merging three effective but orthogonal approaches: running as a closed-loop, extending grasp space to 6DoF, and utilizing sequential multi-view observations. They can be achieved by developing two main components, which are 1) GraspPF, enabling the framework to retain reasonable prior grasp distribution as a form of particles and update grasps from continuously received observations, and 2) DGQ-CNN, which is computationally efficient enough to run on a closed-loop in test time and unleashes restriction of grasp rotation to fixed set by setting direction information as input of the network. The resulting framework outperforms the state-of-the-art algorithm on the real robot experiment in terms of success rate in the heavily cluttered environment. Also, qualitatively it can chase the object by tracking the grasp, and react to a sudden object change.

Future work is an extension to more complex tasks by utilizing grasping as an action primitive. This includes discovering unintended behavior or the long horizon task. Additionally, the high cost is one of the main discouragement of the household robot, and grasping via a cost-efficient robot is an interesting direction. The cost can be saved by reducing the DoF of the robot, and fortunately, DGQ-CNN can generate various grasps, which can be filtered out by an adequate feasibility model.

Acknowledgment

The authors would like to thank Myungsin Kim, Unkyu Park, Jieun Park, Jaecheol Sim, Wonsik Shin, Joonmo Ahn, Jeongmin Lee, Jaeyoung Lim, Jaemin Yoon, Jaesik Chang, Rakjoon Chung, and Daekyoung Jung for their help on reviewing the idea and manuscript.

References

[1] D. Morrison, P. Corke, and J. Leitner, “Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach,” in Robotics: Science and Systems, 2018.
[2] A. Murali, A. Mousavian, C. Eppner, C. Paxton, and D. Fox, “6-dof grasping for target-driven object manipulation in clutter,” in International Conference on Robotics and Automation. IEEE, 2020, pp. 6232–6238.
[3] H.-S. Fang, C. Wang, M. Gou, and C. Lu, “Graspnet-1billion: A large-scale benchmark for general object grasping,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 444–11 453.
[4] D. Morrison, P. Corke, and J. Leitner, “Multi-view picking: Next-best-view reaching for improved grasping in clutter,” in IEEE International Conference on Robotics and Automation. IEEE, 2019, pp. 8762–8768.
[5] S. Thrun, W. Burgard, and D. Fox, “Probabilistic robotics,” 2005.
[6] C. Wang, H.-S. Fang, M. Gou, H. Fang, J. Gao, and C. Lu, “Graspness discovery in clutters for fast and accurate grasp detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 964–15 973.
[7] A. ten Pas, C. Keil, and R. Platt, “Efficient and accurate candidate generation for grasp pose detection in se (3),” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, pp. 5725–5732.
[8] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,” Robotics: Science and Systems, 2017.
[9] Y. Li, T. Kong, R. Chu, Y. Li, P. Wang, and L. Li, “Simultaneous semantic and collision learning for 6-dof grasp pose estimation,” IEEE/RSJ International Conference on Intelligent Robots and Systems, 2021.
[10] A. Mousavian, C. Eppner, and D. Fox, “6-dof graspnet: Variational grasp generation for object manipulation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 2901–2910.
[11] M. Sundermeyer, A. Mousavian, R. Triebel, and D. Fox, “Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes,” IEEE International Conference on Robotics and Automation, 2021.
[12] A. ten Pas, M. Gualtieri, K. Saenko, and R. Platt, “Grasp pose detection in point clouds,” The International Journal of Robotics Research, vol. 36, no. 13-14, pp. 1455–1473, 2017.
[13] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke et al., “Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation,” arXiv preprint arXiv:1806.10293, 2018.
[14] S. Song, A. Zeng, J. Lee, and T. Funkhouser, “Grasping in the wild: Learning 6dof closed-loop grasping from low-cost demonstrations,” IEEE Robotics and Automation Letters, vol. 5, no. 3, pp. 4978–4985, 2020.
[15] D. Morrison, P. Corke, and J. Leitner, “Learning robust, real-time, reactive robotic grasping,” The International journal of robotics research, vol. 39, no. 2-3, pp. 183–201, 2020.
[16] C. Eppner, A. Mousavian, and D. Fox, “Acronym: A large-scale grasp dataset based on simulation,” in IEEE International Conference on Robotics and Automation. IEEE, 2021, pp. 6222–6227.
[17] X. Deng, A. Mousavian, Y. Xiang, F. Xia, T. Bretl, and D. Fox, “Poserbpf: A rao–blackwellized particle filter for 6-d object pose tracking,” IEEE Transactions on Robotics, 2021.
[18] L. Manuelli and R. Tedrake, “Localizing external contact using proprioceptive sensors: The contact particle filter,” in IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2016, pp. 5062–5069.
[19] F. Wirnshofer, P. S. Schmitt, P. Meister, G. v. Wichert, and W. Burgard, “State estimation in contact-rich manipulation,” in International Conference on Robotics and Automation. IEEE, 2019, pp. 3790–3796.
[20] M. Adjigble, N. Marturi, V. Ortenzi, V. Rajasekaran, P. Corke, and R. Stolkin, “Model-free and learning-free grasping by local contact moment matching,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 2933–2940.
[21] A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. R. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo et al., “Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching,” in IEEE International Conference on Robotics and Automation. IEEE, 2018, pp. 3750–3757.
[22] D. Kappler, J. Bohg, and S. Schaal, “Leveraging big data for grasp planning,” in 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2015, pp. 4304–4311.
[23] R. Kaskman, S. Zakharov, I. Shugurov, and S. Ilic, “Homebreweddb: Rgb-d dataset for 6d pose estimation of 3d objects,” in IEEE/CVF International Conference on Computer Vision, 2019.
[24] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,” Robotics: Science and Systems, 2018.
[25] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” http://pybullet.org, 2016–2019.
[26] D. Morrison, P. Corke, and J. Leitner, “Egad! an evolved grasping analysis dataset for diversity and reproducibility in robotic manipulation,” IEEE Robotics and Automation Letters, vol. 5, no. 3, pp. 4368–4375, 2020.
[27] B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, and A. M. Dollar, “The ycb object and model set: Towards common benchmarks for manipulation research,” in 2015 international conference on advanced robotics (ICAR). IEEE, 2015, pp. 510–517.
[28] S. Jadon, “A survey of loss functions for semantic segmentation,” in 2020 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology. IEEE, 2020, pp. 1–7.

	$\displaystyle p(success\|\delta_{ij},\Theta)\propto$	$\displaystyle p(v,m_{c},m_{o}\|\delta_{ij},\Theta)$
	$\displaystyle=$	$\displaystyle p(v\|m_{c},m_{o},\delta_{ij},\Theta)p(m_{c}\|m_{o},\delta_{ij},\Theta)$
		$\displaystyle p(m_{o}\|\delta_{ij},\Theta)$