Learning Efficient Policies for Picking
Entangled Wire Harnesses: An Approach to
Industrial Bin Picking

Xinyi Zhang¹, Yukiyasu Domae², Weiwei Wan¹, and Kensuke Harada^1,2 Manuscript received: July 10, 2022; Revised October 6, 2022; Accepted October 29, 2022.This paper was recommended for publication by Editor Hong Liu upon evaluation of the Associate Editor and Reviewers’ comments. This work was supported by Toyota Motor Corporation. ¹Xinyi Zhang and Weiwei Wan are with the Graduate School of Engineering Science, Osaka University, Toyonaka, Osaka 560-0043, Japan (e-mail: [email protected]; [email protected]).²Yukiyasu Domae is with the National Institute of Advanced Industrial Science and Technology, Tokyo, Japan (e-mail: [email protected]).^1,2Kensuke Harada is with the Graduate School of Engineering Science, Osaka University, Toyonaka, Osaka 560-0043, Japan, and also with the National Institute of Advanced Industrial Science and Technology, Tokyo, Japan (e-mail: [email protected]).Supplementary material, code and video can be found at
https://github.com/xinyiz0931/aspnet. Digital Object Identifier (DOI): see top of this page.

Abstract

Wire harnesses are essential connecting components in manufacturing industry but are challenging to be automated in industrial tasks such as bin picking. They are long, flexible and tend to get entangled when randomly placed in a bin. This makes it difficult for the robot to grasp a single one in dense clutter. Besides, training or collecting data in simulation is challenging due to the difficulties in modeling the combination of deformable and rigid components for wire harnesses. In this work, instead of directly lifting wire harnesses, we propose to grasp and extract the target following a circle-like trajectory until it is untangled. We learn a policy from real-world data that can infer grasps and separation actions from visual observation. Our policy enables the robot to efficiently pick and separate entangled wire harnesses by maximizing success rates and reducing execution time. To evaluate our policy, we present a set of real-world experiments on picking wire harnesses. Our policy achieves an overall 84.6% success rate compared with 49.2% in baseline. We also evaluate the effectiveness of our policy under different clutter scenarios using unseen types of wire harnesses. Results suggest that our approach is feasible for handling wire harnesses in industrial bin picking. Supplementary material, code, and videos can be found at https://xinyiz0931.github.io/aspnet.

Index Terms:

Grasping, Deep Learning in Grasping and Manipulation, Factory Automation

I Introduction

Bin picking is a vital task in manufacturing industry that enables a robot to pick objects randomly placed in a bin. If we try to automate an assembly process without using bin picking, we need to prepare a large amount of parts feeders according to the number of assembly parts. Although robotic bin picking has been researched for decades [1, 2, 3, 4, 5, 6], some objects (e.g., wire harnesses) can still be challenging when automating this process. A wire harness is an indispensable component used in almost every electric drive product. Fig. 1(a) shows its appearance. It comprises a group of bundled wires and multi-conducted connectors and is used for transmitting signals and power. The structure of a wire harness also poses challenges in robotic bin picking: (1) The existence of both deformable and rigid components makes them easily form an entangled clutter in the bin; (2) The complex geometries and deformable nature cause difficulties in 3D modeling; (3) The length of a wire harness often exceeds the operation range of a robot, making it difficult to extract one from the bin. To successfully perform bin picking using wire harnesses, the robot must be equipped with the capability of effectively isolating each from the entanglement. For this reason, the manufacturing industry still relies on human workers to grasp and separate entangled wire harnesses. Therefore, developing an intelligent system to automate this process is highly demanded.

Refer to caption — Figure 1: (a-b) Wire harnesses are composed of both deformable and rigid components. They get entangled easily in clutter and their length may exceed the robot arm’s reach areas. (c) Directly lifting a wire harness causes entanglement. (d) We learn a bin picking policy to efficiently extract an entangled wire harness from an unstructured bin.

Existing works on industrial bin picking have primarily focused on rigid parts. These methods grasp objects by avoiding collisions in highly cluttered environments [1, 4, 5, 6, 7, 8]. For picking simple shaped objects, the robot usually lifts the target in the vertical direction after a successful grasp. Different from those objects, wire harnesses involve complex entanglement when randomly placed in a bin. Besides, they are much longer than the rigid parts already automated in bin picking. The physical reach range of the robot in a bin picking working cell is limited for completely lifting them. Simply adapting the existing bin picking strategies shows unsatisfied performance (see Fig. 1(c)). Previously, some studies have addressed the entanglement problems but for picking curved rigid parts by avoiding the potentially tangled parts [9, 10]. However, there remain problems for densely cluttered wire harnesses where the bin often contains no isolated objects as Fig. 1(b) shows. Motivated by modeling and manipulating deformable linear objects, studies on visually processing wire harnesses start with segmenting or generating synthetic data for wire harnesses with pure linear shapes [11], [12]. For those with complex geometries, obtaining precise models or training in simulation remains difficult. Alternatively, employing a real robot to collect large-scale data is time-consuming. Annotating ground truth labels is also challenging due to the lack of entanglement metrics.

In this paper, we (1) design an effective motion to untangle wire harnesses in clutter and (2) learn a policy to perform bin picking tasks with higher success rates and lower execution time. The key components of our system are:

•

We propose a post-grasping action to untangle wire harnesses. Instead of lifting in the vertical direction, the robot separates the entangled objects in the horizontal direction. The action continuously follows a circle-like trajectory to extract the target within the limited robot’s reach range. Fig. 1(d) shows this process.
•

We learn a bin picking policy to infer an optimal grasp and a post-grasping action from a depth image. Our policy can prioritize grasping the untangled objects, avoid grasping at the bad positions (e.g., the ends of the object) and reason the extracting distance to reduce the execution time for a successful picking. Additionally, we train the policy with real-world data by leveraging active learning for satisfying convergence.

Our contributions are three-fold. (1) We develop a unique bin picking system that can disentangle wire harnesses from dense clutter. (2) Instead of lifting the target in the vertical direction after grasping, our policy proposes to simultaneously lift and move in the horizontal direction for separating wire harnesses. (3) We learn a policy using real-world data to infer the optimal actions, which further improves bin picking efficiency. Real-world experiments suggest our policy can significantly improve the average success rates and reduce operation time compared with baselines.

II Related Work

II-A Industrial Bin Picking

Industrial bin picking has been developed for decades. Prior works have primarily focused on model-based approaches such as 3D object localization or 6D pose identification [2, 13, 14] and grasp planning [7, 8, 15]. Alternatively, model-free methods do not require known object information and can produce grasp poses for novel objects. Domae et al. [1] proposed to plan grasps considering collisions between the gripper and the objects from a single depth image. Several works leverage deep learning to mitigate the challenges of complex physical interactions and environment uncertainties. Mahler et al. [16] trained a model from synthetic data to produce collision-free grasps for daily objects. Matsumura et al. [5] proposed a learning-based method to plan robust grasps for industrial parts. However, there remain challenges in handling difficult objects. Recently, several works tackled the challenges of those objects which are (1) difficult to be recognized, e.g., transparent or shiny objects [6, 17], (2) difficult to perform grasping, e.g., thin and elliptical objects [18, 19], and (3) involved with rich physical interactions, e.g., tangled-prone objects [9, 10]. So far, these approaches have focused on rigid objects. Leao et al. [20] proposed a method to pick up soft tubes by fitting shape primitives to clutter, but it does not work on dense clutter or objects with irregular shapes. Objects with non-rigid nature and complex geometries like wire harnesses are relatively unexplored and remain challenging in the industrial bin picking domain.

II-B Deformable Object Manipulation

Deformable object manipulation has primarily focused on two object classes: 1D (cable, rope) and 2D (fabric, cloth). Several studies adopt specially designed motion primitives to accomplish various manipulation tasks such as knot tying/untying [21, 22, 23], spreading cloth [24] or whipping ropes [25]. Using deformable and long objects in industrial bin picking poses new challenges. The cluttered scenes are more complex due to the entanglement issues caused by their deformable nature. Ray et al. [26] proposed to untangle herbs from a pile using a two-finger gripper. Takahashi et al. [27] proposed a learning-based separation strategy for grasping a specified mass of small food pieces. Although some works have addressed the factory automation problems for wire harnesses [11, 28, 29], robotic wire harnesses picking is less studied. In this paper, we propose a novel and efficient bin picking strategy to deal with wire harnesses.

III Motion Primitives for Disentangling

When a robot tries to isolate small and rigid objects from a bin, it can lift them in a vertical direction after a successful grasp. However, this movement is insufficient for isolating long and flexible objects like a wire harness, whose length exceeds the bin picking workspace. To extract such objects, the required motion primitives must be designed to (1) provide enough space for effectively disentangling long objects and (2) handle various tangle patterns. Instead of directly lifting, the horizontal movement of the gripper can help pull the target object out. The possible positions of the gripper should also remain in the outer part of the parts bin during disentangling. In the end, we propose two motion primitives for effectively disentangling a long and flexible object:

Helix motion: $\psi_{H}=(H,\theta_{H})$ where $H$ denotes the helix trajectory represented by $(c_{H},r_{x},r_{y},h_{0},h)$ and $\theta_{H}$ denotes the execution angle following the trajectory (see Fig. 2(a)). $c_{H}$ denotes the base center of $H$ and $r_{x},r_{y}$ constrain the smallest and largest radius from the center. $h$ denotes the height of $H$ . The helix starts after the gripper lifts the target and reaches $h_{0}$ . The stop point of the helix is determined by the execution angle $\theta_{H}$ . It is a post-grasping motion where the gripper simultaneously lifts and pulls following a helix-like trajectory. Let the gripper move around the bin while holding an entangled object. Part of this object is also moving outside the bin. When the gripper continuously moves like drawing circles, the grasped object can be disentangled softly along a side angle. Fig. 3(a) shows that this movement provides adequate space to pull the target (green) out of the entangled objects (yellow). Meanwhile, we also observe that other entangled objects remain in the bin during or after this process, making the workspace clean for the next picking.
Spinning motion: $\psi_{S}=(c_{S},\theta_{S})$ where $c_{S}$ denotes the position of the gripper tip and $\theta_{S}$ denotes the one-way rotation angle of the spinning (see Fig. 2(b)). The robot performs a two-way spinning about the axis that is vertical to the robot workspace. The gripper spins to handle the entanglement that may be occluded from the observation. As Fig. 3(b) shows, when the rigid components of the wire harness still slightly hang on the others after the helix motion, an extra spinning can help separate them with less execution time. It can also handle the length of a wire harness by extracting it inside a limited working cell.

IV Learning Bin Picking Policies

The goal of our bin picking policy is to pick up a single wire harness at a time by inferring the optimal grasp and action from current entanglement situation. If the scene contains isolated objects, the robot prefers directly lifting them after grasping. Otherwise, the robot can infer disentangling actions and grasp poses to extract the target from the bin. Given a top-down depth image $o$ as observation, we formulate our bin picking policy $\pi$ with a trained model parameterized by $\tau$ using:

a^{*},g^{*}=\pi_{\tau}(o)

(1)

where the outputs are an action $a^{*}$ and a grasp $g^{*}$ with the maximal task effectiveness. The action $a$ comprises the proposed motion primitives. Fig. 4 shows the three essential modules in our policy:
Module I. Model-Free Grasp Detection: A grasp detection algorithm using a depth image without object models.
Module II. Action Success Prediction (ASP): A trained model using real-world data that predicts the success possibilities of the disentangling actions.
Module III. Action-Grasp Inference: A method to infer the action-grasp pair with the highest effectiveness using the trained ASP model.

IV-A Model-Free Grasp Detection

We select Fast Graspability Evaluation (FGE) [1] - a model-free approach to detect collision-free grasps. FGE calculates pixel-wise graspability scores by convoluting a gripper’s template of contact and collision areas with the input depth map. A grasp composes a pixel location $g=(u,v)$ on the depth map and a rotation angle $\phi$ indicating the gripper’s orientation. We transform $(u,v,\phi)$ to the grasp with four degrees of freedom $(g_{x},g_{y},g_{z},g_{\phi})$ denoting the grasp point and the gripper’s orientation at the robot coordinate frame. This module outputs a set of grasps ordered by their FGE scores.

IV-B Action Success Prediction (ASP)

1) Action Formulation: We formulate each disentangling action $a$ with a motion scheme $\psi$ and two parameters as follows:

a=(\psi,\theta_{H},\theta_{S})\;|\;\psi=\{\psi_{H}\}\;\text{or}\;\{\psi_{H},\psi_{S}\}

(2)

where the robot only performs the helix motion $\psi_{H}$ or performs the spinning motion $\psi_{S}$ after $\psi_{H}$ . Note that directly executing $\psi_{S}$ after grasping may not be effective since the extracting displacement of the target object is small. We propose six separation actions $a_{h},a_{hs},a_{f},a_{fs},a_{tf},a_{tfs}$ using two motion primitives and a direct lifting action $a_{dl}$ . Table I shows their notations and illustrations. We use $M$ to represent the collection of these seven actions.

2) Action Parameter Determination: To determine the parameters of each action and search for the best action, we define a numerical metric action complexity for exploring the trade-off between success rates and execution time. Let $\mathcal{A}(a)$ denote the action complexity of the action $a\in\{a_{dl},a_{h},a_{hs},a_{f},a_{fs},a_{tf},a_{tfs}\}$ . It is defined by assuming that actions with larger $\theta_{H}$ or $\theta_{S}$ involve higher complexity. To reduce the search cost during exploration, we assume that the action complexity linearly scales with the success rate of each action. We find this linear relationship by executing 80 physical attempts for each action as Table I presents. Then, we use this hypothesis to determine the action parameters experimentally. Specifically, we predefine a set of possible values of $\theta_{H},\theta_{S}$ experimentally for our policy to select. $\theta_{H}$ can be selected from $\{0,\pi,2\pi,4\pi\}$ and $\theta_{S}$ can be selected between $\{0,\pi/2\}$ . Note that the other parameters of the motion primitives $H=(c_{H},r_{x},r_{y},h_{0},h)$ and $c_{S}$ are fixed in our policy. Finally, we assign integers 0 to 6 as the action complexity for the discrete actions from $a_{dl}$ to $a_{tfs}$ . The action parameters and execution details are included in Table I.

TABLE I: Action Parameters and Execution Details

	$a_{dl}$	$a_{h}$	$a_{hs}$	$a_{f}$	$a_{fs}$	$a_{tf}$	$a_{tfs}$

$\psi$	-	$\{\psi_{H}\}$	$\{\psi_{H},\psi_{S}\}$	$\{\psi_{H}\}$	$\{\psi_{H},\psi_{S}\}$	$\{\psi_{H}\}$	$\{\psi_{H},\psi_{S}\}$
$\theta_{H}$	$0$	$\pi$	$\pi$	$2\pi$	$2\pi$	$4\pi$	$4\pi$
$\theta_{S}$	0	0	$\pi/2$	0	$\pi/2$	0	$\pi/2$
Time (s)	1.2	2.3	2.8	5	5.5	8.2	8.7
SR	31/80	47/80	60/80	65/80	66/80	70/80	72/80
$\mathcal{A}$	0	1	2	3	4	5	6

•

* Time (s) - Execution time of performing the action trajectory.
•

* SR - Success Rate of picking a single object.
•

* $\mathcal{A}$ - Action complexity.

Our policy can explore the optimal action by minimizing the action complexity as much as possible. Let us consider a case when the robot performs $a_{tf}$ to extract an entangled object. Suppose the target object is entirely disentangled after a full circle ( $a_{f}$ ) while the robot still needs to perform the second circle. Thus, the current observation only requires $a_{f}$ as the optimal action to ensure a successful separation with less execution time, while $a_{tf}$ is a redundant action which can also solve the entanglement but costs more time. We can observe that an optimal action has lower action complexity than a redundant action. Thus, the optimal action is required to untangle the target with minimal action complexity.

3) Prediction Model: The inference of the optimal action without object models should be conditioned on the grasp locations. We propose Action Success Prediction (ASP) to predict if the action-grasp pair can successfully separate the target. ASP learns a function parameterized by $\tau$ :

p=f_{\tau}(o,g,a)

(3)

where the input is a depth image $o$ $\in\mathbb{R}^{224\times 224\times 3}$ with triplicated depth values across three channels to match with the default input size of the image encoder’s backbone, a pixel-wise grasp pose $g=(u,v)$ $\in\mathbb{R}^{2}$ , a categorical action $a$ $\in\mathbb{R}^{7}$ and the output is a success possibility in the range of $[0,1]$ . We encode the image using a ResNet-50 backbone [30], the grasp point using a single fully-connected layer with 256 units, and the categorical action using a fully-connected layer with 14 units. Then we concatenate the output from all three branches and feed it to a fully-connected layer with 256 units and produce an action success possibility.

input: Data pool, transfer ratio

r

, actions

M

output: ASP model

\tau

3Select training data from data pool

4 Train ASP model

\tau

using training data

6while data pool is not empty do

N\leftarrow

number of samples in data pool

i\leftarrow 0

9 while $i\leqslant r\times N$ do

10 Randomly select

\{o,g,a,S\}

from data pool

a_{p}\leftarrow

ActionGraspInference(

o,g,M,\tau

)

12 if $S=1$ and $\mathcal{A}(a_{p})\leqslant\mathcal{A}(a)$ then

// Success-logical

13 Move to training data,

i=i+1

15 else if $S=0$ and $\mathcal{A}(a_{p})>\mathcal{A}(a)$ then

// Failure-logical

16 Move to training data,

i=i+1

19 Fine-tune ASP model

\tau

using training data

Algorithm 1 Active Learning Algorithm

4) Training via Active Learning: The dataset for training ASP is entirely collected from real-world experiments. Each sample has a depth image $o$ , a grasp location $g$ , a labeled action $a$ and a binary success metric $S=\{0,1\}$ . We execute each action for the clusters with 6, 10, 12 and 18 objects. We label each attempt with success ( $S=1$ ) or failure ( $S=0$ ) depending on if the robot picks a single wire harness. Due to this data collection manner, some samples in the dataset are labeled with redundant actions instead of optimal actions. To deal with this problem, we leverage active learning to train the ASP model, making it possible to predict optimal actions using this dataset. Generally, we first select several samples manually as training data to train the model, use the trained model to predict the remaining samples, query and transfer samples for training and fine-tune the model repeatedly. Specifically, we manually select the initial training data with approximately optimal actions. Note the number of samples for each action is roughly equal. Let data pool denote the left samples except for the training data. After training, we query the samples in the data pool and transfer the logical samples to the training data. Here, a sample $(o,g,a,S)$ can be determined as success-logical or failure-logical using the trained model $\tau$ and our proposed Action-Grasp Inference module (Section IV.C). Let $a_{p}=$ ActionGraspInference( $o,g,M,\tau$ ) denote the predicted action:

•

Success-logical: $\mathcal{A}(a_{p})\leqslant\mathcal{A}(a)$ . For samples labeled with $S=1$ , the labeled action $a$ is a redundant action compared with the predicted action $a_{p}$ (Fig. 5(a)).
•

Failure-logical: $\mathcal{A}(a_{p})>\mathcal{A}(a)$ . For samples labeled with action $a$ and failure $S=0$ , the predicted action $a_{p}$ has higher action complexity (Fig. 5(b)).

During each iteration, as the number of logical samples increases, the model performance of predicting the optimal actions also improves. We define a transfer ratio $r$ representing the ratio of the number of samples that would be transferred in each iteration to the number of samples in the current data pool. The iteration stops when the data pool is empty or early stops before overfitting. Algorithm 1 shows the detail of training ASP via active learning.

IV-C Action-Grasp Inference

At this point, we’ve obtained a set of grasp candidates, action candidates and the scores of each action-grasp pair. Our policy then needs to determine which action-grasp pair can be executed. This module infers all possible action-grasp pairs to guarantee a successful picking with minimal action complexity:

a^{*},g^{*}=\texttt{ActionGraspInference}(o,G,M,\tau)

(4)

where the inputs are a depth image $o$ , a collection of actions $M$ , grasp candidates $G$ with FGE scores from the Model-Free Grasp Detection module and ASP model $\tau$ . This module first predicts the action success possibilities of all action-grasp pairs $P=f_{\tau}(o,G,M)$ . If all possibilities in $P$ are lower than the threshold $p_{thld}$ , which means all action-grasp pairs cannot solve the entanglement, we select the grasp with the highest FGE score and the most complex action $a_{tfs}$ . Otherwise, the best solution is determined by the action-grasp pair with the lowest action complexity. If multiple grasps share the same action complexity, we select the pair with the highest FGE score.

V Experiments and Results

We conduct several real-world experiments to answer the following three questions: (1) How does the learned ASP model perform using active learning? (Section V-A) (2) Does our bin picking policy perform more accurately and effectively than baselines? (Section V-B) (3) How does our method qualitatively improve the performance of picking wire harnesses? (Section V-C)

V-A ASP Model Performance

Our dataset contains 722 samples. We set the ratio of active learning $r=0.4$ and use a simple decision threshold of $p_{thld}=0.5$ over the softmax of each action’s success possibility to classify success (1) or failure (0). We train the network using binary cross-entropy loss function and the Adam optimizer. We stop training after three times of fine-tuning as it achieves the best performance. Fig. 6 shows the accuracy and loss during active learning. The gray curve refers to the Initial Model (IM) trained using manually determined samples, which would be potentially accurate but lack robustness due to fewer data. The green line indicates the Final Model (FM), which performs the best as the fine-tuning goes on since it converges to IM but with higher data-driven accuracy.

Moreover, Table III shows the details of each iteration in active learning. Row 1-2 shows the number of samples used as the training data and left in the data pool. Particularly, 92 samples left in the data pool after the final fine-tuning are used to validate all models by checking the number of logical samples. Row 3-4 shows the ratios of logical samples increase with the fine-tuning process. Finally, Table III validates our hypothesis that more complex actions correspond to higher success rates. We respectively present the average scores predicted by FM for each action. FM can correctly predict an ascending order of possibilities as the action complexity increases. We can observe that $a_{fs},a_{tf},a_{tfs}$ share similar scores since the validation samples contain 18 objects at most. $a_{tfs}$ does not show a significantly high score due to the accumulated low scores when all predictions fail and $a_{tfs}$ is forced to be selected.

TABLE II: Details and Validation Results of Active Learning

	IM	2^nd	3^rd	FM
# Samples in Training Data	282	453	558	618
# Samples in Data pool	428	257	152	92
Ratio of Success-Logical ( $\%$ )	78.5	85.7	87.8	88.9
Ratio of Failure-Logical ( $\%$ )	85.1	80.1	90.1	91.7

TABLE III: Average Predicted Scores using Validation Samples

	$a_{dl}$	$a_{h}$	$a_{hs}$	$a_{f}$	$a_{fs}$	$a_{tf}$	$a_{tfs}$
$S=1$	0.352	0.489	0.702	0.750	0.783	0.730	0.787
$S=0$	0.257	0.375	0.581	0.636	0.678	0.606	0.685

TABLE IV: Performance of Bin Picking Experiments

		5 Objects			10 Objects			15 Objects
	Method	Success Rate (%)	PPH	Avg. $\mathcal{A}$	Success Rate (%)	PPH	Avg. $\mathcal{A}$	Success Rate (%)	PPH	Avg. $\mathcal{A}$
Consecutive Picking	DL	64.0	128	-	60.0	92	-	56.0	108	-
	RAND	88.0	115	2.3	92.0	117	2.5	76.0	99	2.8
	TFS	96.0	133	-	92.0	127	-	90.0	124	-
	Ours-IM	84.0	131	0.8	76.0	117	2.3	74.0	111	2.9
	Ours-FM	88.0	156	0.8	88.0	140	2.8	86.0	143	2.3
	Ours-FM-R	89.8	154	0.8	89.8	142	2.8	84.4	123	2.9
		18-20 Objects			20-22 Objects			22-25 Objects
	Method	Success Rate (%)	PPH	Avg. $\mathcal{A}$	Success Rate (%)	PPH	Avg. $\mathcal{A}$	Success Rate (%)	PPH	Avg. $\mathcal{A}$
Randomized Picking	DL	46.6	93	-	40.0	80	-	23.3	47	-
	Ours-FM	86.7	113	2.9	80.0	112	3.3	73.3	103	4.3
	Ours-FM-R	92.6	108	2.6	91.7	103	4.5	76.9	107	4.6

V-B Bin Picking Performance

1) Physical Experiment Setup: We use a NEXTAGE robot from Kawada Industries Inc. The robot is required to grasp objects from the parts bin lying in front of it and transport them to another bin located on its left side. The robot’s left arm operates over a workspace captured as a top-down depth image by a Photoneo PhoXi 3D scanner M. A two-fingered parallel gripper is attached at the arm tip. The setup is shown in Fig. 7(a). The length of the wire harness used in this work is 74cm. After performing the analysis and physical experiments, we fix the parameters of the proposed trajectory as $c_{H}=$ (0.525,0.065)[m], $r_{x}=$ 0.1m, $r_{y}=$ 0.225m, $h_{0}=$ 0.32m, $h=$ 0.14m as well as the speed of the action since they yield high task effectiveness. We sample several waypoints on the trajectory and plan motions with a uniform velocity. We use a PC with an Intel Core i7-CPU and 16GB memory without GPU for real-world experiments and a PC with an Intel Core i5-6400 CPU, 16GB memory and an Nvidia GeForce 1080 GPU for learning.

We present three baselines. DL (directly lifting) uses FGE to detect the grasp point and executes by directly lifting ( $a_{dl}$ ). RAND executes a random action and the grasp with the highest FGE score. TFS only executes the most complex action $a_{tfs}$ and the grasp of the highest FGE score. We also present three versions of our policy. Ours-IM is our policy using the initial model in active learning while Ours-FM uses the final model. Ours-FM-R denotes Ours-FM with a recovery module using force feedback. After performing the predicted action, we record the force from an F/T sensor mounted on the robot’s wrist to determine if the grasped wire harness is still entangled. If there exists a sudden increase of force, the target is not disentangled and the robot places it back to the parts bin.

We leverage two metrics to evaluate the bin picking performance. Success rate refers to the number of successful attempts of picking up a single object divided by the number of attempts of placing. PPH (Pickings Per Hour) is the number of successful attempts the robot can execute in one hour. Additionally, we present Avg. $\mathcal{A}$ (Average action complexity) to evaluate how the action complexity predicted by our policy varies under different entanglement scenarios.

2) Task Design: We prepare two real-world bin picking tasks. Consecutive picking aims to empty the bin filled with respectively 5, 10, or 15 objects. The robot picks up objects one by one until the bin is empty. Randomized picking refers to picking up objects from the bin filled with respectively 18-20, 20-22 and 22-25 objects. After each picking, we reload the bin and shuffle the wire harnesses to provide randomness during the task. It can encourage the robot to confront different patterns of entanglement as much as possible. Fig. 7(b) shows the bins filled with different numbers of wire harnesses.

3) Comparisons with Baselines: Table IV compares the performance of the three versions of our policy and three baselines in success rate and PPH. For consecutive picking where the goal is to empty the bin, Ours-FM and Ours-FM-R significantly increase the average success rate from 56.7% to 87.3% and 88.1% compared to DL. TFS achieves higher success rates than Ours-FM but has lower PPH since TFS only executes the time-consuming action $a_{tfs}$ . Especially in the latter half of a continuous picking task when fewer objects remain in the bin, our policy can shorten the execution time by inferring adequate actions. Ours-FM-R also has lower PPH since this policy needs extra actions to place the entangled objects back in the parts bin. Furthermore, the average action complexities for the predicted actions using RAND, Ours-IM, Ours-FM and Ours-FM-R are also presented in Table IV. The average action complexity for 5 objects is significantly lower than that for 10 and 15 objects. As the number of objects in the bin increases, the action complexity of the predicted action increases. It demonstrates that entanglement frequently occurs when the bin contains more objects and requires more complex actions. We also observe that the failed attempts by baselines always drag objects outside the workspace, requiring human workers to rearrange after each attempt. Our policy helps maintain a relatively clean workspace during the consecutive picking thanks to the horizontal separation and our action-grasp inference algorithm.

For randomized picking, we compare the performance of Ours-FM and Ours-FM-R with a DL baseline as Table IV shows. More objects are involved in this task than consecutive picking. Thus, the possibilities of encountering complex entanglement patterns become higher. Ours-FM completes the task with 80% accuracy and 109 PPH, almost twice higher than DL. The results suggest that our policy can grasp the tightly intertwined objects in dense clutter. All three proposed modules collaboratively contribute to efficient bin picking from perception to manipulation planning. However, as the number of objects increases, both metrics of Ours-FM decrease. Due to heavier occlusions and visual noise, the detected grasp candidates become fewer and some entanglement patterns can hardly be recognized from the depth image. Despite this, the most complex action $a_{tfs}$ can still strive for success. Additionally, Ours-FM-R outperforms Our-FM in success rate especially when the number of objects increases thanks to the recovery module but has lower PPH. When the bin contains more than 22 objects, Ours-FM-R shows a higher success rate and PPH than Ours-FM, indicating the feedback module can help further improve the bin picking performance.

V-C Qualitative Analysis

1) Visualized Results: We present visualized results of picking attempts with grasps, actions and input depth images. First, Fig. 8(a) presents the predicted action-grasp pairs of each action. It demonstrates that our policy infers the actions not only by analyzing the object number in the scene but also by reasoning about the occlusions around the input grasp point. Additionally, if the robot grasps close to the wire harness’s end, our policy tends to predict more complex actions since this case may require the gripper to handle the length by moving a larger distance. Then, Fig. 8(b) shows a set of successful pickings with the reasoned action-grasp candidates ranked by descending prediction scores. The optimal action-grasp pairs inferred by our policy are marked as red. Our policy can recognize the objects barely entangled with others that only require $a_{dl}$ . As for the scenes that do not contain such objects, our policy can reason the entanglement situation and predict the proper actions. When the predicted scores of all action-grasp pairs are lower than $p_{thld}$ , our policy executes $a_{tfs}$ and grasp with the highest FGE score, where the target is likely on the top of the pile.

2) Novel Wire Harnesses: To demonstrate the breadth of our method, we utilize Ours-FM for two unseen wire harnesses. They differ from those used for training in lengths and structures but have similar components (e.g., deformable cables and rigid connectors). Fig. 9 shows two novel wire harnesses and the corresponding action-grasp pairs predicted by our policy. Table V shows their length and the average action complexity of prediction with different object numbers. In the case of shorter objects (see Fig. 9(a)), our model does not predict actions with too higher complexity. The robot tends to select $a_{dl}$ and $a_{h}$ to pick up objects. Since this type of wire harness is less tangle-prone, the accuracy of picking them primarily relies on the grasp detection module while our policy can handle the potential entanglement. On the other hand, for long wire harnesses (Fig. 9(b)) whose length exceeds our bin picking working cell, Table V suggests that our policy tends to output more complex actions. However, even $a_{tfs}$ is still insufficient to separate each. More complex manipulation strategies are needed for such objects.

V-D Failure Modes and Limitations

We observe four failure modes in the physical experiments.

TABLE V: Predicted Average Action Complexity (Avg.

\mathcal{A}

)
for Two Types of Unseen Wire Harnesses

Type	Length (cm)	5 Objects	10 Objects	15 Objects
Short	45	0.7	1.3	1.7
Long	115	4.8	4.6	-

•

Objects outside of the bin: The input image of the ASP model does not include the complete objects.
•

Grasp failure: The grasp failure rate is 2.1% (24/1170). A grasp fails when the robot grasps multiple objects in hand or grasps nothing. It mainly comes from vision sensor’s noise and heavy occlusion.
•

Tightly wedged objects: The target tightly inserts another one’s cable bundles or rigid components, making it extremely difficult to be disentangled.
•

Action prediction failure: Our policy sometimes predicts the wrong actions for separation due to visual noise or heavily occluded objects.

Our policy also has limitations. First, for long wire harnesses, the robot fails to extract them from the entanglement since their length exceeds the robot’s reachable areas. Second, the training phase is unique and conditioned on the structure of the objects in the dataset. It would be difficult to adopt our current policy to wire harnesses with completely different geometries.

We divide the reasons causing failure modes and limitations into two categories and provide future extensions. (1) Poor visual prediction for heavily occluded clutter: We will extend our policy by using multi-sensory inputs other than vision-only predetermined policy and force-only feedback control. We will also consider online closed-loop learning and more effective recovery methods to further improve the robustness of our policy. (2) Insufficient motion primitives: the proposed motion primitives cannot solve some complex cases and the reach range of a single robot manipulator is limited. We will consider more effective motion primitives using dual-arm or involving dynamics. It would also be interesting to design more general motion primitives to utilize our policy on various wire harnesses with different geometries.

VI Conclusions

We present a novel bin picking system for grasping and separating entangled wire harnesses. We design an efficient post-grasping action for disentangling the target in clutter, learn a policy from real-world data to reason the extracting distance and produce the optimal action and grasp from a single depth image. Real-world experiments suggest that our policy can successfully untangle the intertwined wire harnesses from different cluttered scenes and pick them up one at a time with high accuracy.

References

[1] Y. Domae, H. Okuda, Y. Taguchi, K. Sumi, and T. Hirai, “Fast graspability evaluation on single depth maps for bin picking with general grippers,” in 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 1997–2004.
[2] M.-Y. Liu, O. Tuzel, A. Veeraraghavan, Y. Taguchi, T. K. Marks, and R. Chellappa, “Fast object localization and pose estimation in heavy clutter for robotic bin picking,” The International Journal of Robotics Research, vol. 31, no. 8, pp. 951–973, 2012.
[3] J. Kirkegaard and T. B. Moeslund, “Bin-picking based on harmonic shape contexts and graph-based matching,” in 18th International Conference on Pattern Recognition (ICPR’06), vol. 2. IEEE, 2006, pp. 581–584.
[4] K. Harada, W. Wan, T. Tsuji, K. Kikuchi, K. Nagata, and H. Onda, “Initial experiments on learning-based randomized bin-picking allowing finger contact with neighboring objects,” in 2016 IEEE International Conference on Automation Science and Engineering (CASE). IEEE, 2016, pp. 1196–1202.
[5] R. Matsumura, K. Harada, Y. Domae, and W. Wan, “Learning based industrial bin-picking trained with approximate physics simulator,” in International Conference on Intelligent Autonomous Systems. Springer, 2018, pp. 786–798.
[6] H. Tachikake and W. Watanabe, “A learning-based robotic bin-picking with flexibly customizable grasping conditions,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 9040–9047.
[7] D. C. Dupuis, S. Léonard, M. A. Baumann, E. A. Croft, and J. J. Little, “Two-fingered grasp planning for randomized bin-picking,” in Proc. of the Robotics: Science and Systems 2008 Manipulation Workshop-Intelligence in Human Environments, 2008.
[8] D. Buchholz, D. Kubus, I. Weidauer, A. Scholz, and F. M. Wahl, “Combining visual and inertial features for efficient grasping and bin-picking,” in 2014 IEEE international conference on robotics and automation (ICRA). IEEE, 2014, pp. 875–882.
[9] R. Matsumura, Y. Domae, W. Wan, and K. Harada, “Learning based robotic bin-picking for potentially tangled objects,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 7990–7997.
[10] X. Zhang, K. Koyama, Y. Domae, W. Wan, and K. Harada, “A topological solution of entanglement for complex-shaped parts in robotic bin-picking,” in 2021 IEEE 17th International Conference on Automation Science and Engineering (CASE). IEEE, 2021, pp. 461–467.
[11] J. Guo, J. Zhang, Y. Gai, D. Wu, and K. Chen, “Visual recognition method for deformable wires in aircrafts assembly based on sequential segmentation and probabilisitic estimation,” in 2022 IEEE 6th Information Technology and Mechatronics Engineering Conference (ITOEC), vol. 6. IEEE, 2022, pp. 598–603.
[12] A. Caporali, K. Galassi, R. Zanella, and G. Palli, “Fastdlo: Fast deformable linear objects instance segmentation,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 9075–9082, 2022.
[13] C. Choi, Y. Taguchi, O. Tuzel, M.-Y. Liu, and S. Ramalingam, “Voting-based pose estimation for robotic assembly using a 3d sensor,” in 2012 IEEE International Conference on Robotics and Automation. IEEE, 2012, pp. 1724–1731.
[14] J. Yang, D. Li, and S. L. Waslander, “Probabilistic multi-view fusion of active stereo depth maps for robotic bin-picking,” IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 4472–4479, 2021.
[15] K. Harada, K. Nagata, T. Tsuji, N. Yamanobe, A. Nakamura, and Y. Kawai, “Probabilistic approach for object bin picking approximated by cylinders,” in 2013 IEEE International Conference on Robotics and Automation. IEEE, 2013, pp. 3742–3747.
[16] J. Mahler and K. Goldberg, “Learning deep policies for robot bin picking by simulating robust grasping sequences,” in Conference on robot learning. PMLR, 2017, pp. 515–524.
[17] X. Li, R. Cao, Y. Feng, K. Chen, B. Yang, C.-W. Fu, Y. Li, Q. Dou, Y.-H. Liu, and P.-A. Heng, “A sim-to-real object recognition and localization framework for industrial robotic bin picking,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 3961–3968, 2022.
[18] Z. Tong, Y. H. Ng, C. H. Kim, T. He, and J. Seo, “Dig-grasping via direct quasistatic interaction using asymmetric fingers: An approach to effective bin picking,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 3033–3040, 2021.
[19] K. Morino, S. Kikuchi, S. Chikagawa, M. Izumi, and T. Watanabe, “Sheet-based gripper featuring passive pull-in functionality for bin picking and for picking up thin flexible objects,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 2007–2014, 2020.
[20] G. Leão, C. M. Costa, A. Sousa, and G. Veiga, “Detecting and solving tube entanglement in bin picking operations,” Applied Sciences, vol. 10, no. 7, p. 2264, 2020.
[21] Y. Yamakawa, A. Namiki, M. Ishikawa, and M. Shimojo, “Knotting manipulation of a flexible rope by a multifingered hand system based on skill synthesis,” in 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2008, pp. 2691–2696.
[22] W. H. Lui and A. Saxena, “Tangled: Learning to untangle ropes with rgb-d perception,” in 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2013, pp. 837–844.
[23] J. Grannen, P. Sundaresan, B. Thananjeyan, J. Ichnowski, A. Balakrishna, V. Viswanath, M. Laskey, J. Gonzalez, and K. Goldberg, “Untangling dense knots by learning task-relevant keypoints,” in Conference on Robot Learning. PMLR, 2021, pp. 782–800.
[24] H. Ha and S. Song, “Flingbot: The unreasonable effectiveness of dynamic manipulation for cloth unfolding,” in Conference on Robot Learning. PMLR, 2022, pp. 24–33.
[25] C. Chi, B. Burchfiel, E. Cousineau, S. Feng, and S. Song, “Iterative Residual Policy for Goal-Conditioned Dynamic Manipulation of Deformable Objects,” in Proceedings of Robotics: Science and Systems, New York City, NY, USA, June 2022.
[26] P. Ray and M. J. Howard, “Robotic untangling of herbs and salads with parallel grippers,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 2624–2629.
[27] K. Takahashi, N. Fukaya, and A. Ummadisingu, “Target-mass grasping of entangled food using pre-grasping & post-grasping,” IEEE Robotics and Automation Letters, 2021.
[28] X. Jiang, K.-m. Koo, K. Kikuchi, A. Konno, and M. Uchiyama, “Robotized assembly of a wire harness in a car production line,” Advanced Robotics, vol. 25, no. 3-4, pp. 473–489, 2011.
[29] H. Zhou, S. Li, Q. Lu, and J. Qian, “A practical solution to deformable linear object manipulation: A case study on cable harness connection,” in 2020 5th International Conference on Advanced Robotics and Mechatronics (ICARM). IEEE, 2020, pp. 329–333.
[30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

Learning Efficient Policies for Picking Entangled Wire Harnesses: An Approach to Industrial Bin Picking