This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning to Localize, Grasp, and Hand Over
Unmodified Surgical Needles

Albert Wilcox, Justin Kerr, Brijen Thananjeyan, Jeffrey Ichnowski,
Minho Hwang, Samuel Paradis, Danyal Fer, Ken Goldberg
* equal contribution
The AUTOLab at UC Berkeley ([email protected])v3.1 [ \xxivtime]
Abstract

Robotic Surgical Assistants (RSAs) are commonly used to perform minimally invasive surgeries by expert surgeons. However, long procedures filled with tedious and repetitive tasks such as suturing can lead to surgeon fatigue, motivating the automation of suturing. As visual tracking of a thin reflective needle is extremely challenging, prior work has modified the needle with nonreflective contrasting paint. As a step towards automation of a suturing subtask without modifying the needle, we propose HOUSTON: Handoff of Unmodified, Surgical, Tool-Obstructed Needles, a problem and algorithm that uses a learned active sensing policy with a stereo camera to localize and align the needle into a visible and accessible pose for the other arm. To compensate for robot positioning and needle perception errors, the algorithm then executes a high-precision grasping motion that uses multiple cameras. In physical experiments using the da Vinci Research Kit (dVRK), HOUSTON successfully passes unmodified surgical needles with a success rate of 96.7%96.7\% and is able to perform handover sequentially between the arms 32.432.4 times on average before failure. On needles unseen in training, HOUSTON achieves a success rate of 7592.9%75-92.9\%. To our knowledge, this work is the first to study handover of unmodified surgical needles. See https://tinyurl.com/houston-surgery for additional materials.

I Introduction

Robotic Surgical Assistants (RSAs) currently rely on human supervision for the entirety of surgical tasks, which can consist of many very repetitive subtasks such as suturing. Automation of surgical subtasks may reduce surgeon fatigue [37], with initial results in surgical cutting [34, 23], debridement [30, 23], suturing [31, 10, 33, 7, 28, 32, 9, 8], hemostasis [26], and peg transfer [24, 12, 13, 27].

This paper considers automation of the bimanual regrasping subtask [7] of surgical suturing, which involves passing a surgical needle from one end effector to another. This handover motion is performed in between stitches during suturing, and is a critical step, as accurately positioning the needle in the end effector affects the stability of its trajectory when guided through tissue. Because varying cable tension in the cables driving the arms causes inaccuracies in motions, a high precision task such as passing a needle between the end effectors is challenging [21, 12, 24, 13, 25].

The task is also difficult because 3D pose information is critical for successfully manipulating needles, and surgical needles are challenging to perceive with RGB or active depth sensors due to their reflective surface, thin profile [17], and self-occlusions (Figure 2). Prior work has mitigated this by painting needles [31, 7] and using color segmentation, but this solution is not practical for clinical use.

To manipulate unmodified surgical needles, we combine recent advances in deep learning, active sensing, and visual servoing. We present HOUSTON: Handoff of Unmodified, Surgical, Tool-Obstructed Needles, a problem and algorithm for using stereo vision with coarse and fine-grained control policies (Figure 1) to sequentially localize, orient, and handover unmodified surgical needles. We present a localization method using stereo RGB with a deep segmentation network to output a point cloud of the needle in the workspace. This point cloud is used to define a coarse robot policy that uses visual servoing to reorient the needle for handover in a pose that is visible to the cameras and accessible by the other end effector. However, due to inaccuracies of robot positioning and the perception system, further corrections may be necessary. We train a fine robot policy from a small set of human demonstrations to perform these subtle but critical corrections from images for unmodified surgical needles.

Refer to caption
Figure 1: Algorithm overview: The algorithm first servos the needle into a position that is easily visible to the stereo camera to produce a high confidence pose estimate. Using this estimate, it coarsely reorients the needle towards the grasping arm, and then refines this iteratively to correct for positioning errors. Finally, it executes a learned visual servoing grasping policy to complete the handover.

This paper makes the following contributions:

  1. 1.

    A perception pipeline using stereo RGB to accurately estimate the pose of surgical steel needles in 3D space, enabling needle manipulation without active depth sensors and painted needles.

  2. 2.

    A visual servoing algorithm to perform coarse reorientation of a surgical needle for grasping.

  3. 3.

    A needle grasping policy that performs fine control of the needle learned from a small set of human demonstrations to compensate for robot positioning and needle sensing inaccuracies.

  4. 4.

    Combination of the pose estimator (1), the servoing algorithm (2) and needle controller (3) to perform bimanual surgical needle regrasping, where physical experiments on the da Vinci Research Kit (dVRK) [15] suggest a success rate of 96.7%96.7\% on needles used in training, and 7592.9%75-92.9\% on needles unseen in training. On sequential handovers, HOUSTON successfully executes 32.4 handovers on average before failure.

II Related Work

II-A Automation in Surgical Robotics

Automation of surgical subtasks is an active area of research with a rich history. Prior literature has studied automation of tasks related to surgical cutting [34, 23, 19], debridement [23, 16], hemostasis [26], peg transfer [12, 13, 24], and suturing [31, 33, 7, 32, 9, 28]. While automated suturing has been studied in prior work [28, 31], suturing without modifications such as painted fiducial markers is an open research problem. Recent work studies robust and general approaches to specific subproblems within suturing, including the precise manipulation of surgical needles during suturing from needle extraction [32] to bimanual regrasping [7], which is the focus of this work.

Needle manipulation is also studied by [32], where the approach studies the extraction of needles from tissue phantoms to compute robust grasps of the needle even in self-occluded configurations. Bimanual needle regrasping was studied in detail by [7], with impressive results on simulation-trained policies that take needle end effector poses as input. We extend their problem definition to consider multiple handoffs of unmodified needles and end effectors, which requires perception of the needle and robot pose from images without color segmentation. Needle grasping has also been studied using visual servoing policies in [9], where the needle is painted with green markers to track its position during closed loop visual servoing. [35] study tabular RL policies for needle regrasping in a discretized space in a fixed setup with known needle pose and experiments without the needle on the dVRK. The experiments in [35] suggest that value iteration-trained policies can mimic expert trajectories used for inverse reinforcement learning. In contrast, we present an algorithm compatible with significantly varying initial needle and gripper poses using only image observations. We additionally present many physical experiments with a needle, evaluating the success rate and speed of the algorithm.

II-B Visual Servoing, and Active Perception

Visual servoing (VS) is a popular technique in robotics [11, 18], and has recently been applied to compensate for surgical robot imprecision in the surgical peg transfer task [24]. While classical VS approaches typically make use of hand-tuned visual features and known system dynamics [6, 5], recent work proposes learning end-to-end visual servoing policies from examples [20, 14]. To reduce the need for tuned features and dynamics models and also reduce the number of training samples required to create a robust VS policy for bimanual needle regrasping, we present a hybrid approach that combines coarse motion planning with fine control, where a learned VS policy is only used in parts of the task where high precision is required. This framework, called intermittent visual servoing (IVS), was studied in detail by [24], where the system switches between a classical trajectory optimizer and imitation learning VS policy based on the precision required at the time. Inspired by this technique, we present an IVS approach to bimanual needle regrasping, that combines coarse perception and planning with fine VS control. Because this task requires reasoning about depth across several directions, we present a multi-view VS policy that learns to precisely hand over the needle based on several camera views.

Active perception is a popular technique with many variations to localize objects prior to manipulation by maximizing information gain about their poses [22, 29, 36, 2, 1]. This has been studied in the context of robot-assisted surgery, where the endoscope position is automatically adjusted via a policy learned from demonstrations to center the camera focus on inclusions during surgeon teleoperation. In this work, we actively servo the needle to highly visible poses to maximize the accuracy of its pose estimate. This is most similar to [1], where the authors propose an algorithm to actively select views of a grasping workspace to uncover enough information about unknown objects to plan grasps.

III Problem Formulation

The HOUSTON problem extends and generalizes the previous problem definition from [7] to include unmodified needles, occlusion, and multiple handoffs.

III-A Overview

In the HOUSTON problem, the surgical robot starts with a curved surgical needle with known curvature and radius grasped by one gripper and must accurately pass it to the other gripper and back. This is challenging due to the needle’s reflective surface, thin profile, and pathological configurations [32, 7] as depicted in Figure 2. Once the needle is successfully passed to the other end effector, it is passed back to the first end effector. This process is repeated NmaxN_{\mathrm{max}} times, or until the needle is dropped. We also consider a special case of this problem, the single-handover version, in which Nmax=1N_{\mathrm{max}}=1.

III-B Notation

Let pL(t)p_{L}(t) and pR(t)p_{R}(t) denote the poses of the left and right grippers, respectively, at discrete timestep tt with respect to a world coordinate frame. The needle has pose pN(t)p_{N}(t) with respect to the world frame. Observations of the workspace are available via RGB images from a stereo camera, IL(t)I_{L}(t) and IR(t)I_{R}(t), or overhead monocular RGB images IO(t)I_{O}(t) from an RGB camera. The left and right cameras in the stereo pair have world poses pst,leftp_{\rm st,left} and pst,rightp_{\rm st,right}, respectively. The overhead camera has pose in world frame pOp_{\rm O}. Each trial starts with the needle in the left gripper and ends when the needle is dropped. Additionally, the trial terminates if no successful handoff occurs in τmax\tau_{\rm max} timesteps.

Refer to caption
Figure 2: Visible and occluded configurations: The left two frames depict needle orientations that are easily identifiable. However, the needle frequently reaches states that are occluded by the gripper or self-occluded, which makes estimating its state challenging.

At timestep tt, the algorithm is provided observation y(t)y(t) which contains images from the sensors: y(t)=(IL(t),IR(t),IO(t))y(t)=\left(I_{L}(t),I_{R}(t),I_{O}(t)\right). The algorithm outputs a target pose and jaw state for each end effector u(t)=((pL(t+1),φL(t+1)),(pR(t+1),φR(t+1)))u(t)=((p_{L}(t+1),\varphi_{L}(t+1)),(p_{R}(t+1),\varphi_{R}(t+1))), where φL(t+1){0,1}\varphi_{L}(t+1)\in\{0,1\} indicates the whether the left jaw is closed at timestep t+1t+1.

III-C Assumptions

In order to deterministically evaluate HOUSTON policies in a wide variety of needle configurations, we discretize the needle-in-gripper pose possibilities by choosing a number of categories across three degrees of freedom:

  1. 1.

    The needle’s curve can face either towards or away from the camera, providing 2 possibilities

  2. 2.

    The gripper may hold the needle either at the tip, or 30° inwards following the curvature of the needle. This degree of freedom has 2 possibilities.

  3. 3.

    The rotation of the needle about the tangent line ω\omega to the point the gripper intersects ranges from 0°0\degree to 180°180\degree, and is discretized into 7 bins as in Figure 3.

This gives a total of 28 possible needle configurations and we perform a grid search over these possibilities. We chose these configurations to be representative of those seen in suturing tasks post-needle extraction.

While the robot encoders provide an estimate of the gripper poses pL(0)p_{L}(0) and pR(0)p_{R}(0), the precise needle pose is unknown due to cabling effects of the arms. We assume access to a stereo RGB pair in the workspace, an overhead RGB camera, and the transforms between the coordinate frames of these cameras and the robot arms.

III-D Evaluation Metrics

We evaluate HOUSTON by recording: i) the number of successful handoffs in a multi-handoff trial, ii) the success rate per arm of single handoffs beginning from each configuration in III-C, and iii) the average time for each handoff.

Refer to caption
Figure 3: Needle Reset Degrees of Freedom: We vary the starting configurations of the needle relative to the gripper in three degrees of freedom as described in III-C. Left: We rotate the needle into 7 discretized states about the axis ω\omega tangent to the needle. Middle: An example holding the middle 30°inward from the tip. Right: An example where the needle’s arc faces towards the camera.

IV HOUSTON Algorithm

HOUSTON uses active stereo visual servoing with both a coarse-motion and fine-motion learned policy for the bimanual regrasping task.

IV-A Phase 1: Active Needle Presentation

In the first phase, the algorithm repositions the needle to a pose where the other arm can easily grasp it without collisions and the cameras can clearly view it. Throughout execution of the coarse policy, we parameterize the needle as a circle of known radius, and measure its state in world frame as a center point cpc_{p}, normal vector cnc_{n}, and a needle tip point ctc_{t} as shown in Figure 6. Active Needle Presentation consists of two stages: needle acquisition and handover positioning. The needle acquisition stage moves the needle to maximize visibility, and the positioning stage uses visual servoing to move the needle into a graspable state.

IV-A1 Needle State Estimation

The state estimator passes stereo images into a fully convolutional neural network that is trained to output segmentation masks for the needle in each image. See the project website for architecture details. It computes a distance transform of the segmentation mask to label each pixel with its distance to the nearest unactivated pixel. Next, it finds peaks in the distance transform along horizontal lines in each image, which correspond to points near the center of activated patches. It then triangulates all pairs of peaks along each horizontal line in the images to obtain a point-cloud as in Figure 4. Because this may triangulate outlier points from the gripper or incorrectly match points on different parts of the needle, RANSAC is applied to filter out incorrect point correspondences and returns the final predicted needle state. At each iteration, it samples a set of 3 points, to which a plane is fit. Each subset of 3 points generates 2 candidate circles in the plane, corresponding to the two circles which pass through one of the pairs. RANSAC uses an inlier radius of 1mm and runs for 300 iterations.

The network is first trained on a dataset of 2000 simulated stereo images of randomly-placed, textured and scaled floating needles and random objects [4] above a surface plane generated in Blender 2.92. Lighting intensity, size, and position, and stereo camera position are also randomized. The segmentation network is fine-tuned on a dataset of 200 manually-labeled images of the surgical needle in the end effector. Training a network on a PC with an NVIDIA V100 GPU takes 3 hours, and fine tuning takes 10 minutes.

Refer to caption
Figure 4: Needle state estimation: Execution of the stereo RGB pipeline on a highly visible needle. The network takes in raw stereo images as input, producing segmasks of the needle. The triangulated segmasks produce a point-cloud to which a circle is fit with RANSAC (3rd panel). Inliers are shown in blue, the best-fit circle in green, and outliers in red. The resulting observation reprojected into the left image is shown in the final panel.

IV-A2 Visual servoing

HOUSTON uses Algorithm 1 to compute updates for visual servoing in both the needle presentation and the handover positioning phases. Algorithm 1 is a fixed point iteration method that uses a state estimator to iteratively visually servo to a target state. Similar to first and second order optimization algorithms, it computes a global update based on its current state and iterates until the computed update is zero. In each iteration, it queries the current 3D state estimate of the needle and then computes an update step in the direction of the target state. To compensate for estimation errors due to challenging needle poses, this process is repeated at each iteration until the algorithm converges to a local optimum within a pose error tolerance.

During needle acquisition, the arm moves to a home pose, then rotates around the world zz and xx axes, stopping when the state estimator observes at threshold number of inlier points nthresh=20n_{\rm thresh}=20 from RANSAC circle fitting. During trials, at most 2 consecutive rotations sufficed to resolve the state to this degree. After this initial acquisition, we apply Algorithm 1 to align cnc_{n} towards the left stereo camera position pst,leftposp_{\rm st,left}^{pos}, with dd defined as a rotation about the axis cn×(pst,leftposcp)c_{n}\times(p^{pos}_{\rm st,left}-c_{p}). Once clearly presented to the camera, we measure ctc_{t} by choosing the inlier point from the circle fitting step which is furthest from the gripper in 3D space.

Subsequently, during handover positioning, we compute inverse kinematics to move cpc_{p} towards the center of the workspace with ctc_{t} pointing towards the other gripper and cnc_{n} orthogonal to the table plane. This flat configuration is critical for the grasping step, since the dVRK arm is primarily designed for top-down grasps near the center of its workspace. Because the arm only has 5 rotational degrees of freedom, we use a numerical IK solver from [3] and attempt to find a configuration minimizing rotational error within a 6×6×86\times 6\times 8 cm tolerance region on end effector translation. After moving to this pose, we repeat Algorithm 1, with dd defined as a rotation aligning ctc_{t} towards the other gripper and cnc_{n} orthogonal to the table. We calculate IK to a configuration with needle curvature towards the camera and one with curvature away from the camera, then pick the configuration which minimizes rotational error to the goal.

Refer to caption
Figure 5: Fine-grained grasping policy: We split corrective actions along the xx- and yy-axes and learn two corresponding policies πgrasp,x\pi_{\mathrm{grasp},x} and πgrasp,y\pi_{\mathrm{grasp},y}. Each policy begins by performing an ego-centric crop by projecting the gripper’s kinematically calculated approximate location into the input image and cropping around it. Then, it feeds the cropped image through a neural network to predict a correction direction.

IV-B Phase 2: Executing a Grasping Policy

After the active presentation phase described in Section IV-A and the pose of the needle is relatively accurately known and accessible to the grasping arm, we execute a grasping policy πgrasp\pi_{\rm grasp} to grasp the needle. However, the needle pose estimate after the first phase may not be perfect, so we must visually servo to compensate for these errors when grasping. Even if the needle pose was perfectly known, reliable grasping of a small needle is still challenging due to the positioning errors of the robot, which are a result of its cable-driven arms [24, 13, 12, 30, 25, 21]. The policy splits corrective actions between the xx- and yy-axes with two sub-policies, πgrasp,x\pi_{\mathrm{grasp},x} and πgrasp,y\pi_{\mathrm{grasp},y}. Each policy uses RGB inputs ego-centrically cropped around the grasping arm, with the xx- and yy-axis policies using 140×200140\times 200 pixel crops from the inclined camera and 70×20070\times 200 pixel crops from the overhead camera respectively. The cropping forces the policy to condition based on the relative position of the gripper and needle without the ability to overfit to texture cues from other parts of the scene. The fine-grained grasping policy and image crops are displayed in Figure 5. We ablate different design choices for the grasping correction policy and also present open-loop grasping results in Section V.

The grasping subpolicy πgrasp,y\pi_{\mathrm{grasp},y} is a neural network classifier that outputs whether the grasping arm should move in the +y+y (down in the crop) or y-y (up in the crop) direction. πgrasp,x\pi_{\mathrm{grasp},x} is trained similarly to output whether the grasping arm should move in the +x+x or x-x directions. The policies are trained by collecting offline human demonstrations through two methods: 1) we sample poses for arms in the workspace such that needle orientation is perturbed by 10°10\degree about each axis, then move the robot to a good grasping position via a keyboard teleoperation interface. 2) we execute the pre-handover positioning routine and position the robot in the desired grasp location by hand, after which the robot autonomously iterates through offsets in the ±x\pm x and ±y\pm y directions, labeling actions according to the offset from goal position. We experimentally find that separating the policy across two axes significantly improves grasp accuracy (Section V). A separate grasping policy is trained for each arm on 100 demonstrations each. Each demonstration takes 5-10 actions and each dataset takes about an hour to collect. The policies πgrasp,x\pi_{\mathrm{grasp},x} and πgrasp,y\pi_{\mathrm{grasp},y} are each represented by voting ensembles of 5 classifiers, each of which have three convolutional layers and two fully connected layers. Details about model architectures are located in the project website.

During policy execution, we iteratively sample actions first from πgrasp,x\pi_{\mathrm{grasp},x}, then πgrasp,y\pi_{\mathrm{grasp},y}, waiting for each to converge before continuing. We multiply the action magnitude by a scalar βdecay=0.5\beta_{decay}=0.5 every time the network outputs the opposite action of the previous timestep. Servoing terminates when action magnitude decays to under 0.20.2mm. This enables implicit convergence to the goal without explicitly training the policy to stop. After xx and yy convergence, we execute a simple downward motion of 1cm to grasp the needle.

Algorithm 1 Presentation Visual Servoing Policy
1:State estimator Φ\Phi, target needle state xNtargx^{\rm targ}_{N}, arm grasping needle a{L,R}a\in\{L,R\}, current time tt, number of iterations NmaxN_{\rm max}, tolerance ϵ\epsilon, distance metric dd.
2:for i=0;i<Nmax;i=i+1i=0;i<N_{\rm max};i=i+1 do
3:     Predict needle state x^N(t+i)=Φ(o(t+i))\hat{x}_{N}(t+i)=\Phi(o(t+i))
4:     Compute gripper update pδ=d(x^N(t+i),xNtarg)p_{\delta}=d(\hat{x}_{N}(t+i),x^{\rm targ}_{N})
5:     Update arm aa pose: pa(t+i+1)=pa(t+i)+pδp_{a}(t+i+1)=p_{a}(t+i)+p_{\delta}
6:     if |pδ|<ϵ|p_{\delta}|<\epsilon then
7:         break
8:     end if
9:end for
Refer to caption
Figure 6: Acquisition Stage Rollout: Rollout of the phase described in IV-A. Images are taken from the left camera and cropped to the gripper, with the circle observation projected in green. The pose goal is one in which the needle faces towards the camera. Note how uncertainty in the second image is resolved in later images as the needle reaches a more observable configuration.

V Physical Experiments

The experiments aim to answer the following question: how efficient and reliable is HOUSTON compared to baseline approaches? We also perform several ablation studies of method components in this section.

V-A Baselines

To evaluate the method for the task of active needle presentation, we compare to the following baselines:

  • Depth-based presentation: Instead of using the stereo RGB network to detect needle pose, we use a depth image-based detection algorithm to detect the needle pose and servo it to the flat grasping pose. This method takes the depth image from the built in depth calculation from the stereo camera, ID(t)I_{D}(t), as input, masks the gripper out of the depth image using the dVRK’s forward kinematics, then performs a volume crop around the end effector and fits a circle of known radius to the points using RANSAC to extract the state.

  • No-Sim-Data: This is an ablation of the RGB stereo segmentation network that is only trained on the small dataset of real data.

Refer to caption
Figure 7: Needles used: The needles are shown with a coin for scale. All models were trained using needle 1, while needles 2, 3 and 4 were also used for testing. Needles 1 and 3 have a radius of 1.25cm, and needles 2 and 4 have radii of 1.75cm and 0.75cm
Table I: Single Handover Physical Experiments: We report success rate, 95%95\% confidence intervals and durations taken over a grid search of the 28 start configurations described in Section III-C for the full surgical needle bimanual regrasping task. We report the frequency of three failure modes: (P) error in the presentation procedure, (X) error along the xx-axis and (Y) error along the yy-axis. HOUSTON significantly outperforms baselines, which either have many presentation failures or grasp positioning failures. The No-Sim-Data ablation also has a high success rate, but we find that the segmentation masks are qualitatively less accurate and have more false positives in the workspace. We present results with HOUSTON using three needles (Figure 7) that were unseen in training samples. HOUSTON performs best on the larger two needles (2 and 3), and performs worse on the smaller needle 4, where occlusions with the gripper are more severe.
Success Rate 95%95\% Conf. Int. Completion Time (s) Failures
Successes / Total %\% Success Low High P X Y
Open Loop 28/5628/56 50.050.0 36.336.3 63.763.7 14.85±2.43\mathbf{14.85\pm 2.43} 0 0 2828
Shared (x,y)(x,y) Grasp Policy 2/282/28 7.17.1 0.90.9 23.523.5 19.41±4.2319.41\pm 4.23 0 0 2626
No Sim Data 53/5653/56 94.694.6 85.185.1 98.998.9 24.58±3.3924.58\pm 3.39 33 0 0
Depth-Based Presentation 6/286/28 21.421.4 8.38.3 41.041.0 32.61±7.4232.61\pm 7.42 2222 0 0
HOUSTON (Left to Right) 𝟏𝟎𝟖/𝟏𝟏𝟐\mathbf{108/112} 96.3\mathbf{96.3} 91.191.1 99.099.0 21.45±3.0821.45\pm 3.08 44 0 0
HOUSTON (Right to Left) 𝟏𝟎𝟖/𝟏𝟏𝟐\mathbf{108/112} 96.3\mathbf{96.3} 91.191.1 99.099.0 23.93±3.2023.93\pm 3.20 11 0 33
Needles Unseen in Training
  HOUSTON (Right to Left, Needle 2) 26/2826/28 92.992.9 76.576.5 99.199.1 23.79±2.5523.79\pm 2.55 11 0 11
HOUSTON (Right to Left, Needle 3) 25/2825/28 89.389.3 71.871.8 97.797.7 23.15±3.4623.15\pm 3.46 33 0 0
HOUSTON (Right to Left, Needle 4) 21/2821/28 75.075.0 55.155.1 89.389.3 23.44±4.1323.44\pm 4.13 55 0 22
Table II: Multiple Handover Physical Experiments: To evaluate the consistency of the HOUSTON, we evaluate it on the full multiple handover HOUSTON task with maximum handoffs Nmax=50N_{\rm max}=50. We run HOUSTON with two different needle orientations during the trial, and report the number of successful handovers (Num), time per handover (Time), and failure mode (F). HOUSTON averages 26.20 and 38.60 successful passes in each configuration, and has three runs with no failures.
Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Avg.
Num Time F Num Time F Num Time F Num Time F Num Time F Num T/H
Away config. 1212 25.4025.40 Y 4545 25.3425.34 P 2323 25.4225.42 P 𝟑𝟓\mathbf{35} 24.7124.71 P 1616 25.2825.28 X 26.2026.20 25.1325.13
Towards config. 𝟓𝟎\mathbf{50} 26.57\mathbf{26.57} \mathbf{-} 𝟓𝟎\mathbf{50} 26.7626.76 - 𝟓𝟎\mathbf{50} 27.0027.00 - 2020 25.0825.08 Y 2323 26.4826.48 P 38.6038.60 26.5626.56

To evaluate the design choices used in the fine-grained grasping policy, we compare to the following baselines:

  • Open Loop: Executes an open loop grasping motion to grasp the needle based only on needle geometry and inverse kinematics.

  • Shared (x,y)(x,y) Grasp Policy: Trains a single policy to output xx and yy displacements, and takes both IO(t)I_{O}(t) and IL(t)I_{L}(t) as input.

To evaluate whether the system can transfer to needles unseen in training, we evaluate HOUSTON on three additional needles as in Figure 7.

V-B Experimental Setup

We perform experiments using the daVinci Research Kit (dVRK) surgical robot [15], a cable-driven surgical robot with two needle drivers, a gripper which can open 1cm.

For perception, the setup includes a Zed Mini stereo camera angled slightly downwards to face the arms, and an overhead Zivid One Plus M camera facing directly down. Stereo images are captured at 2K resolution, and overhead images are captured at 1080p. Locations of the arms relative to each of the cameras is statically calibrated.

V-B1 Single handover

For single handover experiments, we manually vary the orientation of the gripper before each trial to the orientations described in III-C. A handoff is considered successful if the needle switches from one gripper to the other, and at the end is fully supported by the other gripper.

V-B2 Multiple handovers

For multiple handover experiments, we start the needle in the left gripper in a visible configuration to the camera, so that all errors are a result of handoffs rather than initialization. We evaluate two configurations: one where the needle arc ends facing the stereo camera in the grasping configuration (Towards), and one where it faces the opposite direction (Away). This configuration is typically maintained throughout each multi-handover trial because of the consistency of the needle presentation step.

V-C Single Handover Results

We evaluate HOUSTON and baselines on the single handover task in Table I, and perform multiple systematic passes over the 2828 starting configurations described in Section III-C. We find that HOUSTON is able to more reliably perform the task than comparisons, which either experience many presentation errors or many grasp positioning errors.

V-D Multiple Handover Results

We evaluate HOUSTON on the multiple handover task with Nmax=50N_{\rm max}=50 with two different starting configurations (Table II). We observe that in the first configuration, the algorithm completes 26.2026.20 successful handovers on average and 38.6038.60 in the second. In three trials, no errors occur, and we manually terminate them after 5050 successful handovers.

V-E Failure Analysis

HOUSTON encounters three failure modes:

  • P: Presentation error: the robot fails to present the needle in an orientation that is in the plane of the table with the needle tip pointing toward the grasping arm. This may lead to grasping angles that are unreachable or out of the training distribution for the grasping arm.

  • X: Grasping positioning error (X): the xx-axis grasping policy fails to line up with the needle prior to executing theyy-axis grasping policy.

  • Y: Grasping positioning error (Y): the yy-axis grasping policy fails line up with the needle prior to grasping.

We categorize all of the failure modes encountered in Table I. We find that the open loop grasping policies are not able to consistently position well for grasping. HOUSTON has failures that are evenly distributed across the failure modes. Grasp policy servoing errors stem mainly from needle configurations that are far outside the distribution seen in training. Presentation phase failures stem primarily from mis-detection of the needle true tip, either because of incomplete segmentation masks or from drift in robot kinematics causing the most distal needle point to not be the tip. This causes the servoing policy to rotate the needle away from the camera, after which sometimes it loses visibility and fails to bring the needle to the pre-handover pose. Multi-handoff failures most frequently arise because of subtle imperfections in grasp execution where the needle rotates to a difficult angle. During the subsequent handover the needle can become obstructed by the holding gripper, inhibiting the grasping policy.

VI Discussion

In this work we present HOUSTON, a problem and an algorithm for reliably completing the bimanual regrasping task on unpainted surgical needles. To our knowledge, this work is the first to study the unmodified variant of the regrasping task. The main limitations of this approach are its reliance on human demonstrations to learn the grasping policy, and sensitivity to needle and environment appearance. We hypothesize that the former could be mitigated via self-supervised demonstration collection, or by exploring unsupervised methods for fine-tuning behavior cloned policies. Future work will address the latter issue by exploring more powerful network architectures leveraging stereo disparity such as [17], and designing more autonomous data collection techniques which can label real needle data without human input. In future work, we will also study how to reorient needles between handovers for precise control of needle-in-hand pose and attempt to make needle tracking more robust to occlusions from tissue phantoms.

References

  • [1] Ermano Arruda, Jeremy Wyatt and Marek Kopicki “Active vision for dexterous grasping of novel objects” In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2016, pp. 2881–2888 IEEE
  • [2] Ruzena Bajcsy “Active perception” In Proceedings of the IEEE 76.8 IEEE, 1988, pp. 966–1005
  • [3] Patrick Beeson and Barrett Ames “TRAC-IK: An open-source library for improved solving of generic inverse kinematics” In 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), 2015, pp. 928–935 DOI: 10.1109/HUMANOIDS.2015.7363472
  • [4] Berk Calli et al. “Yale-CMU-Berkeley dataset for robotic manipulation research” In The International Journal of Robotics Research 36.3 SAGE Publications Sage UK: London, England, 2017, pp. 261–268
  • [5] Guillaume Caron, Eric Marchand and El Mustapha Mouaddib “Photometric visual servoing for omnidirectional cameras” In Autonomous Robots 35.2 Springer, 2013, pp. 177–193
  • [6] François Chaumette and Seth Hutchinson “Visual servo control. I. Basic approaches” In IEEE Robotics & Automation Magazine 13.4 IEEE, 2006, pp. 82–90
  • [7] Zih-Yun Chiu et al. “Bimanual Regrasping for Suture Needles using Reinforcement Learning for Rapid Motion Planning” In arXiv preprint arXiv:2011.04813, 2020
  • [8] Der-Lin Chow and Wyatt Newman In Improved Knot-Tying Methods for Autonomous Robot Surgery IEEE Conference on Automation ScienceEngineering (CASE), 2013
  • [9] C D’Ettorre et al. In Automated Pick-up of Suturing Needles for Robotic Surgical Assistance IEEE International Conference on RoboticsAutomation (ICRA), 2018
  • [10] “Definition of superhuman by Merriam Webster” Accessed: 2021-05-24, https://www.merriam-webster.com/dictionary/superhuman
  • [11] Seth Hutchinson, Gregory D Hager and Peter I Corke “A tutorial on visual servo control” In IEEE transactions on robotics and automation 12.5 IEEE, 1996, pp. 651–670
  • [12] Minho Hwang et al. In Applying Depth-Sensing to Automated Surgical Manipulation with a da Vinci Robot International Symposium on Medical Robotics (ISMR), 2020
  • [13] Minho Hwang et al. In Efficiently Calibrating Cable-Driven Surgical Robots With RGBD Fiducial Sensing and Recurrent Neural Networks IEEE RoboticsAutomation Letters (RA-L), 2020
  • [14] Dmitry Kalashnikov et al. In QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation Conference on Robot Learning (CoRL), 2018
  • [15] P Kazanzides et al. In An Open-Source Research Kit for the da Vinci Surgical System IEEE International Conference on RoboticsAutomation (ICRA), 2014
  • [16] Ben Kehoe et al. “Autonomous multilateral debridement with the raven surgical robot” In 2014 IEEE International Conference on Robotics and Automation (ICRA), 2014, pp. 1432–1439 IEEE
  • [17] Thomas Kollar et al. “SimNet: Enabling Robust Unknown Object Manipulation from Pure Synthetic Data via Stereo” In arXiv preprint arXiv:2106.16118, 2021
  • [18] Danica Kragic and Henrik I Christensen “Survey on visual servoing for manipulation” In Computational Vision and Active Perception Laboratory, Fiskartorpsv 15 Citeseer, 2002, pp. 2002
  • [19] Sanjay Krishnan et al. “SWIRL: A sequential windowed inverse reinforcement learning algorithm for robot tasks with delayed rewards” In The International Journal of Robotics Research 38.2-3 SAGE Publications Sage UK: London, England, 2019, pp. 126–145
  • [20] Sergey Levine et al. “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection” In The International Journal of Robotics Research 37.4-5 SAGE Publications Sage UK: London, England, 2018, pp. 421–436
  • [21] J. Mahler et al. In Learning Accurate Kinematic Control of Cable-Driven Surgical Robots Using Data Cleaning and Gaussian Process Regression. IEEE Conference on Automation ScienceEngineering (CASE), 2014
  • [22] Lyudmila Mihaylova et al. “A comparison of decision making criteria and optimization methods for active robotic sensing” In International Conference on Numerical Methods and Applications, 2002, pp. 316–324 Springer
  • [23] Adithyavairavan Murali et al. In Learning by Observation for Surgical Subtasks: Multilateral Cutting of 3D Viscoelastic and 2D Orthotropic Tissue Phantoms IEEE International Conference on RoboticsAutomation (ICRA), 2015
  • [24] Samuel Paradis et al. In Intermittent Visual Servoing: Efficiently Learning Policies Robust to Instrument Changes for High-precision Surgical Manipulation IEEE International Conference on RoboticsAutomation (ICRA), 2021
  • [25] Haonan Peng, Xingjian Yang, Yun-Hsuan Su and Blake Hannaford In Real-time Data Driven Precision Estimator for RAVEN-II Surgical Robot End Effector Position IEEE International Conference on RoboticsAutomation (ICRA), 2020
  • [26] Florian Richter et al. In Autonomous Robotic Suction to Clear the Surgical Field for Hemostasis using Image-based Blood Flow Detection arXiv preprint arXiv:2010.08441, 2020
  • [27] Jacob Rosen and Ji Ma In Autonomous Operation in Surgical Robotics 137.9 Mechanical Engineering, 2015
  • [28] H Saeidi et al. In Autonomous Laparoscopic Robotic Suturing with a Novel Actuated Suturing Tool and 3D Endoscope IEEE International Conference on RoboticsAutomation (ICRA), 2019
  • [29] Paolo Salaris, Riccardo Spica, Paolo Robuffo Giordano and Patrick Rives “Online optimal active sensing control” In 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 672–678 IEEE
  • [30] Daniel Seita et al. In Fast and Reliable Autonomous Surgical Debridement with Cable-Driven Robots Using a Two-Phase Calibration Procedure IEEE International Conference on RoboticsAutomation (ICRA), 2018
  • [31] S. Sen et al. In Automating Multiple-Throw Multilateral Surgical Suturing with a Mechanical Needle Guide and Sequential Convex Optimization IEEE International Conference on RoboticsAutomation (ICRA), 2016
  • [32] Priya Sundaresan et al. In Automated Extraction of Surgical Needles from Tissue Phantoms IEEE Conference on Automation ScienceEngineering (CASE), 2019
  • [33] Brijen Thananjeyan et al. In Safety Augmented Value Estimation from Demonstrations (SAVED): Safe Deep Model-Based RL for Sparse Cost Robotic Tasks IEEE RoboticsAutomation Letters (RA-L), 2020
  • [34] Brijen Thananjeyan et al. In Multilateral Surgical Pattern Cutting in 2D Orthotropic Gauze with Deep Reinforcement Learning Policies for Tensioning IEEE International Conference on RoboticsAutomation (ICRA), 2017
  • [35] Vignesh Manoj Varier et al. “Collaborative Suturing: A Reinforcement Learning Approach to Automate Hand-off Task in Suturing for Surgical Robots” In 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), 2020, pp. 1380–1386 IEEE
  • [36] Steven D Whitehead and Dana H Ballard “Active perception and reinforcement learning” In Machine Learning Proceedings 1990 Elsevier, 1990, pp. 179–188
  • [37] Michael Yip and Nikhil Das In Robot Autonomy for Surgery The Encyclopedia of Medical Robotics, 2017