This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Continuous close-range 3D object pose estimation

Bjarne Grossmann, Francesco Rovida and Volker Krüger†∗ The research leading to these results has received funding from the European Commission’s Horizon 2020 Programme under grant agreement no.  723658 (SCALABLE).Robotics, Vision, and Machine Intelligence (RVMI) Lab at Aalborg University Copenhagen, 2450 Copenhagen, Denmark; Robotics and Semantic Systems (RSS), Lund University, Lund, Sweden e-mail: bjarne|[email protected]; [email protected]
Abstract

In the context of future manufacturing lines, removing fixtures will be a fundamental step to increase the flexibility of autonomous systems in assembly and logistic operations. Vision-based 3D pose estimation is a necessity to accurately handle objects that might not be placed at fixed positions during the robot task execution. Industrial tasks bring multiple challenges for the robust pose estimation of objects such as difficult object properties, tight cycle times and constraints on camera views. In particular, when interacting with objects, we have to work with close-range partial views of objects that pose a new challenge for typical view-based pose estimation methods.

In this paper, we present a 3D pose estimation method based on a gradient-ascend particle filter that integrates new observations on-the-fly to improve the pose estimate. Thereby, we can apply this method online during task execution to save valuable cycle time. In contrast to other view-based pose estimation methods, we model potential views in full 6-dimensional space that allows us to cope with close-range partial objects views. We demonstrate the approach on a real assembly task, in which the algorithm usually converges to the correct pose within 10-15 iterations with an average accuracy of less than 8mm.

I INTRODUCTION

With the endeavor of industrial manufacturers towards agile factories with flexible production lines, robots become more and more self-contained autonomous working units equipped with sensory devices to interact with their environment. Workstations are not specially designed for the robot, i.e. there are no fixtures for the objects or cameras to observe the work space, and the work pieces can also be in motion on a conveyor belt or moved around by a human. In these dynamic environments, one crucial ability of the robot to autonomously perform common manufacturing tasks such as picking, placing or assembling, is to visually estimate the accurate pose of objects, often by the means of a RGB-D camera that is mounted on the robot’s wrist for the sake of flexibility (fig. 1, top). Additionally, we often have to deal with tight cycle times, and we therefore we cannot afford to have a time-consuming pose estimation procedure that e.g. includes finding better views, but we have to solve the problem during the task execution.

However, when approaching and interacting with an object, we are facing a new challenge in order to maintain a good pose estimate: the closer we get to the object, the less we see of it. When dealing with objects at close range, we can therefore only use partial observations of the object (fig. 1, bottom) to estimate the object’s pose. Additionally, we need to continuously update the pose estimate due to the object’s potential movements, which requires basic tracking capabilities of the algorithm.

Refer to caption
Figure 1: Robot setup for the (left) piston and (right) crankcase cap insertion with a wrist-mounted camera. During the task execution, only close-range partial views of the engine block are available posing a major challenge for the pose estimation.

In [1], we showed that the pose estimation of industrial objects is due to the difficult object properties by itself already a highly challenging task. Especially local feature-based methods [2] struggle with industrial objects, because they require robust surface patches that can often not be provided by the depth sensors. View-based methods, including descriptor-based [3] and learning-based [4] methods, often used in tracking algorithms (tracking-by-detection) [5], are based on manually captured or prerendered views of a CAD model. They rephrase the pose estimation problem as a recognition or classification task by finding the best (most similar) view among the generated views with respect to the observed scene [6]. However, the number of rendered views is usually a trade-off between speed and robustness, thus restricting the accuracy of the pose estimate. Additionally, the pose estimation problem can often not be solved by using one-shot pose estimation approaches due to view ambiguities [7]. Next-best-view methods [8] are trying to compensate for noise and uncertainties by finding a more informative view or integrating multiple views of a static object to improve the pose estimate, however, they are not suited for moving objects and additionally increase the cycle time of the task execution. Integrating multiple views in a probabilistic way using Bayesian propagation over time ([9, 10]) has been recently explored, but the approach only considers the rotational space due to the high computational demands or rely on local feature descriptors. Additionally, all view-based approaches rely on the assumption of a fully visible object at a reasonable distance. Situation in which we have to deal with partial observations of the object from a close range have not been considered.

In this paper, we therefore discuss how close-range partial observations constitute an inherent problem for typical view-based pose estimation methods due to the way of how the object model is represented. Furthermore, we provide a solution to the close-range pose estimation problem based on our previous approach in [11], which uses an adapted particle filter to continuously estimate the pose of a known object in almost real-time without relying on precomputed views of the object model. Here, we extend the approach to deal with close-range observations, even when the target object is slightly moving. We demonstrate our solution on a realistic use case: a challenging assembly task of a piston and crankcase cap (fig. 1) that have to be inserted into a moving engine block.

In the following section, we discuss the impact of close-range observation on the view-based pose estimation methods in detail. Section III presents our solution to the problem and describes its implementation, whereas Section IV validates the method by means of an experimental evaluation. In section V, we summarize our findings and outline the limitation of the proposed method.

Refer to caption
Figure 2: View generation from a 3D model [10]: An synthetic view is generated from each point on a sphere. However, this model covers only 4-dimensions of the object’s pose.

II The problem of close-range observations

View-based pose estimation methods usually rely on a 3D representation of an object, manually captured beforehand or often given as a CAD model. From this model, multiple different views with known viewpoints are generated (fig. 2) that represent the classes for comparison with the real object.

This way of modelling the object implicitly makes an important assumption: The object is centered in the view of the camera. Since the views are taken from viewpoints that are located on a sphere (or multiple spheres of varying radii) with the same focus point, this model only accounts for 4 degrees of freedom, namely the distance to the object and yaw, pitch, roll of the camera. This means that the remaining two dimensions (the unknown offset of the object with respect to the focus point) render the pose estimate ambiguous. For objects located far enough from the camera, this does not directly pose an issue since the missing two parameters can be retrieved through an additional alignment step. Here, 2D-based methods often rely on template matching [5] or learning-based solutions [4] in order to retrieve the location of the object, whereas 3D-based methods compute a local reference frame [3] of the segmented scene object to achieve the same.

Refer to caption
Refer to caption
Figure 3: View on an engine from the real camera (blue) and the corresponding synthetic view (red) modelled with the focus point in the center of the object. The difference in camera pose can be computed using the offset. (left) At long distances, visible parts of the observed object match the scene object. (top right) When approaching the object, the real and synthetic view start to deviate. (bottom row) At close distances, the real and generated view show huge differences in the visibility of the object’s surface.

However, the initial assumption of a centered object implies that the view generation uses a slightly different camera pose compared to the real scene. This change of perspective can be usually ignored when the object is further away (depending on the focal length of the camera), since we can assume that the object remains fully in view and the projection of the object on the image plane can be considered almost orthographic (fig. 3, left). When moving the camera towards the object, however, these assumptions do not hold anymore (fig. 3, right): The object might not be fully in view anymore such that changes of the viewing direction of the camera will expose more or less visible parts of the object. Identifying the correct view under these circumstances becomes a difficult challenge for any classifier, as the real and synthetic view differ too much. Additionally, in close distance to the object, we have to deal with perspective distortion (traversal and axial magnification) of the object, which reduces the similarity between correct views even more (fig. 4).

Refer to caption Refer to caption

Figure 4: An example of a close range view on an engine block: Difference between the real view (left) and the corresponding modelled view (right) show the shortcomings of typical view-based methods in close range: Different surface areas of the object are visible in addition to the perspective distortion.

An obvious solution to this problem is to adapt the view generation to cover all 6 dimension, instead of modelling only 4 degrees of freedom. However, this usually becomes computational intractable due to the curse of dimensionality, especially when we try to compute all possible views beforehand. Instead, the generation of the synthetic views should be performed online only for relevant views. However, it is not trivial to find these relevant views, since we still have to search the entire high-dimensional view space. In order to find camera poses with a higher probability of generating views similar to the observed scene object, we have to know the object’s pose in the first place. This chicken-or-egg problem can however be solved recursively by implementing a smart search strategy and by tracking promising views over time, as described in section III.

III Sequential Bayesian pose estimation
with gradient-ascent

In our previous work [11], we showed that the formulation of a sequential Bayes filter in form of a particle filter is a well-suited method to integrate multiple observations over time to resolve the ambiguities of single-shot pose estimation methods. Here, we extend this approach to accommodate the algorithm for close-range observations and moving objects.

III-A Particle filter

The sequential Bayes filter is governed by two alternating steps of updating (eq. (1)) the belief of the current state (pose estimate) given all observations z0:tz_{0:t} and predicting (eq. (2)) the next state given the current state θt\theta_{t}, the current control input utu_{t} and all observations z0:tz_{0:t}.

p(θt|z0:t)\displaystyle p(\theta_{t}|z_{0:t}) p(zt|θt)p(θt|z0:t1)\displaystyle\propto p(z_{t}|\theta_{t})\ p(\theta_{t}|z_{0:t-1}) (1)
p(θt+1|z0:t)\displaystyle p(\theta_{t+1}|z_{0:t}) =p(θt+1|θt,zt,ut+1)p(θt|z0:t)𝑑θt.\displaystyle=\int p(\theta_{t+1}|\theta_{t},z_{t},u_{t+1})\ p(\theta_{t}|z_{0:t})d\theta_{t}\enspace. (2)

The control input utu_{t} is used to incorporate the directly measured motion of the robot arm’s wrist (mounting point for the camera). We model the state space in terms of full viewing poses (which is equivalent to the inverse of the object’s pose) given as a 6-dimensional vector for the translational and rotational part encoded in a transformation matrix:

θ=(tx,ty,tz,αroll,βpitch,γyaw)=^[R|t],\displaystyle\theta=(t_{x},t_{y},t_{z},\alpha_{roll},\beta_{pitch},\gamma_{yaw})\,\widehat{=}\,[R\,|\,\vec{t}]\enspace, (3)

whereas the observations zz are captured with a depth sensor in the form of a 3D point cloud. The definition of a sequential Bayesian filter for representing pose probabilities requires the formulation of a motion model p(θt+1|θt,zt,ut+1)p(\theta_{t+1}|\theta_{t},z_{t},u_{t+1}) and the observation model p(zt|θt)p(z_{t}|\theta_{t}).

III-B Fast online view generation

When considering the full 6-dof view space, the generation of synthetic views beforehand is not feasible anymore due to the high memory consumption and computational demands. The view generation has therefore to be performed online and becomes a crucial factor for the time performance of the algorithm, since separate views have to be generated for the particles.

We implemented a fast method to generate point cloud views from a CAD file on-the-fly, as proposed in previous work [11]. Our approach works similar to the rasterization process of typical graphics pipelines, but instead of creating a projected image from the model, we are interested in the 3D points that are used to generate that image. This could be done by reprojecting the image into 3D space again, however, this is computational more expensive and reduces the accuracy of the object’s surface due to the discretization. Therefore, we first sample the CAD model in a preprocessing step (only done once), and then filter non-visible points by projecting the points onto a depth buffer, keeping only the index of the closest point for each pixel. Visible points can then be extracted using the indices.

III-C Observation model: Comparing views

The observation model describes the likelihood of the observation given a viewing pose. A direct comparison of a viewing pose (state) with a point cloud (observation) is not possible. We therefore encode the viewing pose in a joint space with the observation. This requires a partial model (in industrial use cases often given as a CAD model) from a given viewing pose, which can be generated by the fast view generation described before. The similarity measure for two point clouds can then be derived as follows:

Given a point cloud of a rendered CAD model with mm points pmMp_{m}\in M from the viewing pose θt\theta_{t}, the observation point cloud ztz_{t} with points pvOp_{v}\in O and a correspondence map ctC:MOc_{t}\in C:M\mapsto O that maps nn model points to observation points (pvk,pmk)ct(p_{v}^{k},p_{m}^{k})\in c_{t}, we define our similarity measure as the mean squared error between two point cloud:

f(ct,θt)=1nk=1nRθpvk+tθpmk2\displaystyle f(c_{t},\theta_{t})=\frac{1}{n}\sum_{k=1}^{n}\|{R^{\scriptscriptstyle{\mathit{}\!}}_{\scriptscriptstyle{\!\mathit{\theta}}}}\cdot p_{v}^{k}+{\vec{t}^{\scriptscriptstyle{\mathit{}\!}}_{\scriptscriptstyle{\!\mathit{\theta}}}}-p_{m}^{k}\|^{2} (4)

However, since we are modelling the full 6-dimensional view space, we also allow generated views where the object is only partially visible. Two parallel views which only differ in the size of the visible area (fig. 5) are therefore not directly distinguishable with our similarity metric. In order to compensate for this error, we scale the similarity with the overlap ratio between the real view and the synthetic views, based on the found correspondences,

Refer to caption
Refer to caption
Figure 5: Two parallel views of an object: the real view (blue) and a synthetic view (red) have almost the same point cloud. The only way to differentiate them is to account for the overlap ratio.

However, the correspondences between real and synthetic view are usually unknown. We apply a straight-forward scheme, which is also used in the iterative closest point (ICP) algorithm, by simply selecting the nearest neighbors within a limited distance in the real view for each point in the synthetic view. Additionally, we apply a sample consensus (SAC) correspondence rejector to be more robust against outliers, e.g. due to sensor noise. Integrating the overlap ratio (nn matches given mm model points), we can now define our log-likelihood of the observation given the current state and the correspondences as

logp(zt|θt,ct)=mnf(ct,θt)\displaystyle\log p(z_{t}|\theta_{t},c_{t})=\frac{m}{n}\cdot f(c_{t},\theta_{t}) (5)

III-D Motion model: Gradient ascend

In most sequential Bayesian filters, we assume that the proposal distribution is fully determined by the transition prior probability distribution

q(θt|θ0:t1,u0:t,z0:t)=π(θt|θt1,ut).\displaystyle q(\theta_{t}|\theta_{0:t-1},u_{0:t},z_{0:t})=\pi(\theta_{t}|\theta_{t-1},u_{t})\quad. (6)

The motion model then defines the transition from one state θt\theta_{t} to the next predicted state θt+1\theta_{t+1} and is defined by the recursive state transition function

θt+1=h(θt,ut).\displaystyle\theta_{t+1}=h(\theta_{t},u_{t})\enspace. (7)

Due to the high-dimensional space of potential poses, we direct the search towards the more likely states based on the Langevin Monte Carlo method [12] by using the gradient of the log-likelihood function. With the gradient in the motion model, the state transition function reads

θt+1\displaystyle\theta_{t+1} =θtutN(0,Σ)γlogθt\displaystyle=\theta_{t}\cdot u_{t}\cdot N(0,\Sigma)\cdot\gamma\nabla\log\theta_{t}
withlogθt\displaystyle\text{with}\quad\nabla\log\theta_{t} =logp(zt|ct,θt)\displaystyle=\nabla\log p(z_{t}|c_{t},\theta_{t}) (8)

where γ\gamma is the step-size of the gradient111For brevity, the scaling of the gradient is written as a simple multiplication with the scaling factor γ\gamma. Since we are dealing with transformation matrices, this operation is actually a linear interpolation in SE(3).. Eq. III-D can also be seen as a gradient ascent method (neglecting the noise and control input, as they can be applied afterwards).

θ^t+1=θtγlogp(zt|ct,θt).\displaystyle\hat{\theta}_{t+1}=\theta_{t}\cdot\gamma\nabla\log p(z_{t}|c_{t},\theta_{t})\enspace. (9)

Intuitively, we are therefore trying to find the state θ^\hat{\theta} that maximizes the likelihood. However, the gradient of the log-likelihood cannot be solved analytically, since it relies on the correspondence estimation itself. The correct computation of the correspondences ct^\hat{c_{t}}, in contrast, requires a good pose estimate θ^\hat{\theta} in the first place. Instead of computing the gradient directly, we express this problem as a maximization-maximization problem [13] with

ct^\displaystyle\hat{c_{t}} =argmaxctp(zt|ct,θt)\displaystyle=\arg\max_{c_{t}}p(z_{t}|c_{t},\theta_{t}) (10)
θ^\displaystyle\hat{\theta} =argmaxθp(zt|ct,θt).\displaystyle=\arg\max_{\theta}p(z_{t}|c_{t},\theta_{t})\enspace. (11)

Since the observation model also requires the correspondences, we exploit this and use the very same ICP to solve this problem. Here, we are only interested in the gradient and not the maximum, therefore a few iterations (2-3) are sufficient for the gradient estimation. Note that the step-size γ\gamma is implicitly retrieved during the estimation process.

Using the gradient in the motion model has several advantages: The particles are drawn to local maxima representing multiple modes of the posterior capturing multiple pose hypothesis at the same time. We can therefore cover the posterior in the high-dimensional state space with less particles while converging faster to potential modes. Even when the modes of the posterior shift rather rapidly due to large object movements, we are able to track the object’s pose. Since particles in areas of low probability will move towards higher probabilities, it also fights the problem of particle degeneracy and reduces the need for frequent resampling. The particle diffusion, on the other hand, works as an antagonist for the gradient-ascent and avoids particle impoverishment, i.e. the particles located in a local maxima will not collapse to a single point.

III-E Implementation and Optimization

Our implementation is based on a particle filter that has been extended with a gradient-ascent-driven motion model. The input for our algorithm is given as CAD model (or dense point cloud) and a segmented object in form of a point cloud from the real scene. Additionally, we use the measured arm motion as a control input to further optimize our motion model. We initialize our particles on a sphere centered around the observed object with a radius equal to the mean distance. Resampling has only to be done when the effective sample size drops below 5050%. The updating of the particles is done in parallel: We first propagate and diffuse the particles, and then generate the according views, followed by 2-3 ICP iterations. Finally, we compute the log-likelihood using the mean squared error between the observed and synthetic point cloud scaled by the overlap. In order to optimize the execution time, we do all computations in the log-space. As a result, the weight update becomes a simple sum, which we scale by 0.50.5 to ensure that the weights stay in a representable range. Unnormalized weights do not pose an issue, as we are only interested in the MAP estimate, which is scale independent. Additionally, due to the log-space, the MAP estimate has now to be retrieved as the particle with the minimum weight.

Data: Model MM, Motions U=u0:tU=u_{0:t}, Observations Z=z0:tZ=z_{0:t}
Data: Particles (Pose, Weight) P,W\langle P,W\rangle
Result: Pose estimate θ^\hat{\theta}
distmean(z0)dist\leftarrow\textit{mean}(z_{0})
Psample(dist)P\leftarrow\textit{sample}(\text{dist})
foreach (zk,uk)(Z,U)z_{k},u_{k})\in(Z,U) do
       if effectiveSampleSize<0.5\textit{effectiveSampleSize}<0.5 then
             Presample(P,W)P\leftarrow\textit{resample}(P,W)
            
       end if
      foreach piPp_{i}\in P do
             pipropagate(pi,uk)p_{i}\leftarrow\textit{propagate}(p_{i},u_{k})
             pidiffuse(pi,σ2)p_{i}\leftarrow\textit{diffuse}(p_{i},\sigma^{2})
             migenerateView(pi,M)m_{i}\leftarrow\textit{generateView}(p_{i},M)
             tipi1t_{i}\leftarrow p_{i}^{-1}
             for 0 to N do
                   mitransform(mi,ti)m_{i}\leftarrow\textit{transform}(m_{i},t_{i})
                   cinearestNeighbors(zk,mi)c_{i}\leftarrow\textit{nearestNeighbors}(z_{k},m_{i})
                   cirejectCorrespondences(ci)c_{i}\leftarrow\textit{rejectCorrespondences}(c_{i})
                   tileastSquares(zk[ci],mi[ci])t_{i}\leftarrow\textit{leastSquares}(z_{k}^{[c_{i}]},m_{i}^{[c_{i}]})
                  
             end for
            piti1p_{i}\leftarrow t_{i}^{-1}
             ϵi#mi#ciMSE(zk[ci],mi[ci])\epsilon_{i}\leftarrow\frac{\textit{\#}m_{i}}{\textit{\#}c_{i}}\cdot\textit{MSE}(z_{k}^{[c_{i}]},m_{i}^{[c_{i}]})
             wi0.5(wi+ϵi)w_{i}\leftarrow 0.5*(w_{i}+\epsilon_{i})
       end foreach
      mapargminiwi\textit{map}\leftarrow\arg\min_{i}w_{i}
       θ^kpmap\hat{\theta}_{k}\leftarrow p_{\textit{map}}
      
end foreach
Algorithm 1 Particle filter with gradient-ascend

IV Application and Evaluation

For our experiments, we use a kuka iiwa robot arm with a RealSense F200 wrist-mounted camera. A typical sequence of captured images during the tasks is shown in Fig. 7. The ground truth for the engine’s pose has been generated semi-automatically by aligning a fully visible, discriminative view to the model, which we afterwards verified manually. Furthermore, we recorded the trajectory of the camera during the process to backtrack the engine’s pose over time. The particle filter runs with only 200 particles in all experiments. In order to evaluate our algorithm, we ran multiple trials. For each trial, we used 1 baseline test (see Sect. IV-A, below) (with the full view of the engine and a static camera) and 4 realistic task executions from different starting poses for the piston insertion and crankcase cap placement respectively. For each trial, we ran multiple alignments, which in all cases converged to the correct baseline pose.

Refer to caption
Figure 6: Evaluation of the pose estimation error with respect to the distance and viewing volume of the camera.
Refer to caption
Figure 7: Typical sequence of (top) a piston insertion and (bottom) a crankcase cap placement. During the sequence, the engine block is never fully in view.

IV-A Baseline

For the baseline, we conducted a simple experiment with a static camera placed roughly 40cm40cm over the engine block. From that view, the engine was fully visible. The approach used for the baseline is the one from [11] which was shown (and since then has proven) to be precise. We evaluated the pose estimation 5 times to compute the precision and accuracy of the algorithm in terms of the mean error and the standard deviation. With the whole object in view and neglecting the burn-in phase of the particle filter, we derive a mean error of 3.2mm±1.3mm3.2mm\pm 1.3mm for the translational error and 3.3deg±2.2deg3.3\deg\pm 2.2\deg after convergence. Here, the error is computed with respect to the object’s coordinate frame (center). Note that the accuracy of the points on the visible surface is usually even higher. We can easily handle this magnitude of errors during assembly using the compliance of our IIWA Kuka arms [14].

IV-B Close-range partial views

The goal of this experiment was to determine the impact of the camera distance to the object, and thereby the ratio of the visible parts of the object, on the outcome of the pose estimate. Therefore, we captured the distance of the object based on the mean of the point cloud and its extend by computing an ellipsoid around the visible points using its principle axes. The volume of the ellipsoid gives a rough estimate of the visibility ratio.

When setting the distance and volume in contrast with the error development (here: error norm), as shown in fig. 6, we can see that especially when the camera reduces the distance to the object in the end of the sequence (cf. fig. 7), the viewing volume reduces to a point where the view becomes too ambiguous and the particle filter starts failing at a distance less than 25cm25cm (around observation 85). Depending on the used sensor (due to increasing noise in short distances), this value might differ and has to be empirically estimated for other RGB-D cameras.

IV-C Piston insertion and crankcase cap placement

The main experiments were conducted under a realistic movement of the robot arm during (fig. 8, top row) the piston insertion and (fig. 8, bottom row) the crankcase cap placement. As usual, the particle filter requires a burn-in phase of 10 to 15 observations. Afterwards, the particle filter is, in both cases, able to converge rather quickly to the real pose. The estimated pose during the piston insertion converges with an average translational errors (x,y,z) and maximal deviations of (6.1±5.5,3.9±3.6,0.8±0.8)mm(6.1\pm 5.5,3.9\pm 3.6,0.8\pm 0.8)mm and a rotational error and maximal deviation of (0.5±0.4,0.6±0.5,3.9±3.2)deg(0.5\pm 0.4,0.6\pm 0.5,3.9\pm 3.2)\deg. The pose during the crankcase cap placing converges slightly slower to the correct pose with an average translational error (x,y,z) and maximal deviations of (0.8±0.9,3.1±2.9,0.5±0.4)mm(0.8\pm 0.9,3.1\pm 2.9,0.5\pm 0.4)mm and a rotational error and maximal deviation of (0.4±0.4,0.2±0.2,2.2±1.9)deg(0.4\pm 0.4,0.2\pm 0.2,2.2\pm 1.9)\deg. The higher accuracy here can be explained by the structure of the engine block: compared to the piston insertion, where the motor is viewed from the top with little amount of structure, the bottom view looks into the motor and has much more depth information available for the alignment.

In both cases, the error in depth is the smallest translational error. This is related to the alignment of the points using ICP, since displacements in depth (without nearby corresponding points) are recovered immediately. Movements along the object’s surface are not so easily detected and depend mainly on a discriminative surface structure. The majority of the rotational error is related to the camera roll. This is a general problem, but it is amplified when using partial views, since the visible surface of the object is in most cases rectangular. Registration algorithms, such as the ICP, therefore prefer to align the rectangular structure than the often noisy surface structure of the object.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: (top) Piston insertion and (bottom) Crankcase cap placement: Rotational and translational errors while approaching the top of the engine block from a random direction.

V Conclusions

In this paper, we presented a fast particle-based pose estimation method using multiple observations over time. We focused on scenarios where robots with wrist-mounted cameras continuously have to estimate and track an objects pose while approaching and interacting with it. In these cases, observations can only be obtained from a close range and we have to deal with very limited partial views of the objects, which poses a challenge to view-based pose estimation algorithms. We show that these methods are making the assumption of a 4-dimensional view space for generating synthetic model views, which leads to wrong perspectives in close distance. We proposed to solve this issue by modelling the full 6D view space and rendering the views of the model online. Our experiments demonstrate in a realistic assembly task that, by only using partial close-range observations, our algorithm is able to compute the pose of the work piece with a reasonably high accuracy, accurate enough to deal with many manufacturing tasks. In cases where we need higher accuracy, the compliance of a robot arm and force-based exploration strategies can be exploited to cope with the residual uncertainty.

In the introduction, we emphasized that cycle time is critical and that the pose estimation approach must be fast. Particle filter in higher dimension are known to be rather slow, but our approach based on the gradient search strategy, allows us to run the algorithm in almost real-time (1.5Hz) with plenty of space for optimizations. We reach this speed on a Intel Xeon CPU (2.4GHz) with 8 cores and 12 GB of RAM. The particle filter is parallelized using all 8 cores. GPU optimization is not used.

Even though our approach gives good initial results and shows that the concept holds, it also has its limitations: One of the major drawback is that even though we can cope with a certain amount of noise due to the probabilistic approach, we still require the target object to be segmented from the scene, which by itself is a non-trivial task. However, since our approach models the full pose of the camera (or object), it actually allows the free positioning of virtual cameras and with an improved observation model, it could be used directly in the scene to estimate the pose of one or multiple objects.

Furthermore, at a certain distance, even after convergence of the particle filter, the prior is not strong enough to keep the correct pose estimate anymore. In the experiments we found empirically that for the Intel RealSense camera, a threshold of 2525cm was a good cut-off point to stop capturing more images and to accept the computed pose. However, this threshold is use-case dependent and would have to be adapted for other depth sensors and objects. We will explore this cut-off threshold more systematically in the future.

References

  • [1] B. Großmann, M. Siam, and V. Krüger, “Comparative evaluation of 3d pose estimation of industrial objects in rgb pointclouds,” in International Conference on Computer Vision Systems.   Springer, 2015, pp. 329–342.
  • [2] Y. Guo, M. Bennamoun, F. Sohel, M. Lu, J. Wan, and N. M. Kwok, “A comprehensive performance evaluation of 3d local feature descriptors,” International Journal of Computer Vision, vol. 116, no. 1, pp. 66–89, 2016.
  • [3] A. Aldoma, F. Tombari, R. B. Rusu, and M. Vincze, “Our-cvfh - oriented, unique and repeatable clustered viewpoint feature histogram for object recognition and 6dof pose estimation,” in DAGM/OAGM Symposium, ser. Lecture Notes in Computer Science, A. Pinz, T. Pock, H. Bischof, and F. Leberl, Eds., vol. 7476.   Springer, 2012, pp. 113–122.
  • [4] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,” arXiv preprint arXiv:1711.00199, 2017.
  • [5] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. R. Bradski, K. Konolige, and N. Navab, “Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes.” in ACCV (1), 2012, pp. 548–562.
  • [6] V. Lepetit, J. Pilet, and P. Fua, “Point matching as a classification problem for fast and robust object pose estimation,” in Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, vol. 2.   IEEE, 2004, pp. II–II.
  • [7] S. D. Roy, S. Chaudhury, and S. Banerjee, “Active recognition through next view planning: a survey,” Pattern Recognition, vol. 37, no. 3, pp. 429–446, 2004.
  • [8] C. Connolly, “The determination of next best views,” in Proceedings. 1985 IEEE International Conference on Robotics and Automation, vol. 2, March 1985, pp. 432–435.
  • [9] S. Riedel, Z.-C. Marton, and S. Kriegel, “Multi-view orientation estimation using bingham mixture models,” in Automation, Quality and Testing, Robotics (AQTR), 2016 IEEE International Conference on.   IEEE, 2016, pp. 1–6.
  • [10] Ö. Erkent, D. Shukla, and J. Piater, “Integration of probabilistic pose estimates from multiple views,” in European Conference on Computer Vision.   Springer, 2016, pp. 154–170.
  • [11] B. Grossmann and V. Krüger, “Fast view-based pose estimation of industrial objects in point clouds using a particle filter with an icp-based motion model,” in 2017 IEEE 15th International Conference on Industrial Informatics (INDIN), July 2017, pp. 331–338.
  • [12] C. Nemeth, C. Sherlock, and P. Fearnhead, “Particle metropolis-adjusted langevin algorithms,” Biometrika, vol. 103, no. 3, pp. 701–717, 2016.
  • [13] R. M. Neal and G. E. Hinton, “A view of the em algorithm that justifies incremental, sparse, and other variants,” in Learning in graphical models.   Springer, 1998, pp. 355–368.
  • [14] F. Rovida, D. Wuthier, B. Grossmann, M. Fumagalli, and V. Krueger, “Motion generators combined with behavior trees: a novel approach to skill modelling,” in 2018 IEEE/RSJ International Conference on Intelligent Robots (IROS), Barcelona, Oct., 2018, to be published.