This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Symmetry and Uncertainty-Aware Object SLAM
for 6DoF Object Pose Estimation

Nathaniel Merrill1  Yuliang Guo2  Xingxing Zuo3  Xinyu Huang2
Stefan Leutenegger3  Xi Peng1  Liu Ren2  Guoquan Huang1
1University of Delaware  2Bosch Research North America  3Technical University of Munich
{nmerrill,xipeng,ghuang}@udel.edu  {yuliang.guo2,xinyu.huang,liu.ren}@us.bosch.com
{xingxing.zuo,stefan.leutenegger}@tum.de
Abstract

We propose a keypoint-based object-level SLAM framework that can provide globally consistent 6DoF pose estimates for symmetric and asymmetric objects alike. To the best of our knowledge, our system is among the first to utilize the camera pose information from SLAM to provide prior knowledge for tracking keypoints on symmetric objects – ensuring that new measurements are consistent with the current 3D scene. Moreover, our semantic keypoint network is trained to predict the Gaussian covariance for the keypoints that captures the true error of the prediction, and thus is not only useful as a weight for the residuals in the system’s optimization problems, but also as a means to detect harmful statistical outliers without choosing a manual threshold. Experiments show that our method provides competitive performance to the state of the art in 6DoF object pose estimation, and at a real-time speed. Our code, pre-trained models, and keypoint labels are available https://github.com/rpng/suo_slam.

1 Introduction

Object pose estimation in 6 degrees of freedom (DoF) plays a key role in a variety of down-stream applications (e.g., autonomous driving, robotic navigation, manipulation, and augmented reality), and has been extensively studied in computer vision and robotics communities [17, 28, 26, 22, 14, 5, 33]. Some methods rely on RGB input [22, 37, 15, 32, 31], while others utilize additional depth input to improve the performance [29, 35, 15, 32]. Some deal with a single view [35, 28, 22], while others utilize multiple views to enhance the results  [3, 2, 29, 15, 14, 5]. In particular, multi-view methods can be further categorized into offline structure from motion (SfM) – where all the frames are given at once [3, 14] – and the online SLAM styles, where frames are provided sequentially and real-time performance is expected [29, 5]. This paper focuses on image-based 6DoF pose estimation for multiple objects in the context of an online monocular SLAM system.

Refer to caption
Figure 1: Our proposed method leverages the detected keypoints of asymmetric objects and the 3D scene created from the SLAM system to consistently track the keypoints of symmetric objects. Given the current camera pose estimated from asymmetric objects’ keypoints, the projections of the existing 3D keypoints into the current image act as informative prior input to guide the network in predicting keypoints with consistent symmetry over time.

A typical multi-view 6DoF pose estimation method can be decomposed into the single-view estimation stage and the multi-view enhancement stage. While pose estimates from multiple views can be fused for better performance [3, 14, 5], handling extreme inconsistency – e.g., those caused by rotational symmetry of objects – is still challenging. It is also unreliable to manually tune the thresholds for outlier rejection and assign residual weights for nonlinear optimization. To tackle these challenges, in this paper, we propose a symmetry and uncertainty-aware 6DoF object pose estimation method which fuses semantic keypoint measurements from all views within a SLAM framework. The main contributions of this work are:

  • We design a keypoint-based object SLAM system that jointly estimates the globally-consistent object and camera poses in real time – even in the presence of incorrect detections and symmetric objects.

  • We propose a method able to consistently predict and track 2D semantic keypoints for symmetric objects over time, which leverages the projection of existing 3D keypoints into the current image as an informative prior input to the keypoint network.

  • We develop a method to train the keypoint network to estimate the uncertainty of its predictions such that the uncertainty measure quantifies the true error of the keypoints, and significantly improves object pose estimation in the object SLAM system.

The rest of this paper is organized as follows: After briefly reviewing the related literature in Sec. 2, we describe our method in detail in Sec. 3 – including the keypoint detector and how it is used in the entire system. A thorough evaluation of our framework is presented in Sec. 4 before concluding in Sec. 5.

2 Related Work

Single-view object pose estimation.

A large number of single-view object pose estimation methods have been presented in recent years. One major trend is to utilize deep networks to predict the relative pose of an object with respect to the camera in a regress-and-refine fashion [35, 15, 14, 37]. Although effective, the iterative refining process is usually at a high computationally cost. Another trend is to either estimate the 2D projected locations of sparse 3D semantic points from the CAD model [28, 25, 26], or to regress the 3D coordinates from the dense 2D pixels within object masks [22, 33], and then solves a perspective nn point (PnP) problem to estimate object poses. This type of approach is more efficient, however, not always as reliable under occlusion. In order to achieve superior robustness to occlusion and real-time efficiency simultaneously, we develop a multi-view method which integrates a sparse semantic keypoint detection in an object-level SLAM system. Instead of adapting traditional descriptors [3, 2], we opt to develop a CNN-based keypoint detector in order to leverage more global context to reason about the keypoint locations and distinguish their semantics. We show that an object SLAM system can effectively utilize the sparse set of semantic keypoints to optimize the poses in a bundle adjustment (BA) optimization with outlier rejection at the keypoint level.

Object-level SLAM.

Object-level SLAM typically builds upon single-view object pose estimators, which improves the estimated poses’ robustness to occlusions, missing detections, and the global consistency via multi-view optimization. SLAM++ [29] was notably the first work along this line, but their system only worked on depth images. There are also some works which model objects as a sparse set of 3D keypoints, and use a 2D keypoint detectors to estimate the correspondences which are fused over time [23, 30], however none have considered symmetric objects. PoseRBPF [4] on the other hand proposed a method to track objects over time with an autoencoder and particle filter to reason about the symmetry, however their system is only able to track one object at a time – limiting the application. CosyPose [14] presented a method to disambiguate pose estimates of symmetric objects from multiple views through object-level RANSAC , but their method is an offline SfM approach and not directly comparable to ours. Fu et. al [5] proposed a multi-hypothesis SLAM approach to estimate the pose of symmetric objects, which is optimized with a max-mixture model. In contrast, our approach only tracks one hypothesis, and is shown to have superior performance.

Keypoint uncertainty estimation.

A typical global optimization that uses the predicted object keypoints as a measurement (i.e., PnP or multi-view graph optimization), requires a proper weighting of the residuals. Without any measure of certainty to accompany the keypoint measurements, this weight is typically set to identity or some manually-tuned value. Some works have retrieved a weight directly from the output of the keypoint network [25, 26] to be used in PnP as a scalar measure of certainty [25] or Gaussian covariance matrix [26], while [30] adapted the Bayesian method of [10] to estimate a covariance matrix for the keypoints by sampling over a randomized batch. Although these methods have been shown to work in practice, none have shown that the uncertainty they are predicting actually bounds the true error of the prediction compared to the ground truth.

Besides for residual weighting, the uncertainty is especially useful for outlier rejection, since, assuming that the uncertainty is a Gaussian covariance matrix, the χ2\chi^{2} distribution can determine an outlier threshold more systematically compared to manual tuning. Inspired by a plethora of recent works (unrelated to keypoint prediction) on self-uncertainty prediction of networks [1, 12, 16, 36, 9, 38, 18], we design a maximum likelihood estimator (MLE) loss, which trains the network to predict keypoint locations accurately and to jointly predict the uncertainty to be tightly bound around the actual error of the prediction.

Refer to caption
Figure 2: An overview of the proposed symmetry and uncertainty aware object SLAM pipeline.

3 The Proposed Method

Our multi-view 6DoF object pose estimation method is unified in an object SLAM framework, which jointly estimates object and camera poses – while accounting for the symmetry of detected objects and utilizing the uncertainty estimations from the network to robustify the system. A depiction of the full pipeline can be seen in Fig. 2. The pipeline involves two passes to deal with asymmetric and symmetric objects separately. In the first pass, the asymmetric objects are tracked from the 3D scene to estimate the camera pose. In the second pass, the estimated 3D keypoints for symmetric objects are projected into the current camera view to be used as the prior knowledge to help predict keypoints for these objects that are consistent with the 3D scene. The object SLAM system is primarily comprised of two modules, the front-end tracking using the keypoint network, and back-end global optimization to refine the object and camera pose estimates. As a result, the proposed system can operate on sequential inputs and estimate the current state in real time for the use of an operator or robot requiring object and camera poses in a feedback loop.

3.1 Keypoint Network

We develop a keypoint network that not only predicts the 2D keypoint coordinates but also their uncertainty. In addition, to make it able to provide consistent keypoint tracks for symmetric objects, the network optionally takes prior keypoint heatmap inputs that are expected to be somewhat noisy. The architecture of our keypoint network can be seen in Fig. 3. The backbone architecture of our keypoint network is the stacked hourglass network [20], which has been shown to be a good choice for object pose estimation [23, 25, 30]. Similar to the original [20] we choose a multi-channel keypoint parameterization due to its simplicity. With this formulation, each channel is responsible for predicting a single keypoint, and we can combine all of the keypoints for the dataset into one output tensor – allowing for a single network to be used for all of the objects.

Given the image and prior input cropped to a bounding box and resized to a static input resolution, the network predicts an N×H/d×W/dN\times H/d\times W/d tensor pp, where H×WH\times W is the input resolution, dd is the downsampling ratio (4 in our experiments), and NN is the total number of keypoints for the dataset. From pp, a set of NN 2D keypoints {𝒖1,𝒖2,,𝒖N}\{\bm{u}_{1},\bm{u}_{2},\ldots,\bm{u}_{N}\}, 2×22\times 2 covariance matrices {𝚺1,𝚺2,,𝚺N}\{\bm{\Sigma}_{1},\bm{\Sigma}_{2},\ldots,\bm{\Sigma}_{N}\} are predicted. Every channel of pp, pip_{i}, is enforced to be a 2D probability mass by utilizing a spatial softmax. The predicted keypoint is taken as the expected value of 2D coordinates over this probability mass 𝒖i=u,vpi(u,v)[uv]\bm{u}_{i}=\sum_{u,v}p_{i}(u,v)[u~{}v]^{\top}. Unlike the non-differentiable argmax operation, this allows us to use the keypoint coordinate directly in the loss function – which is important for our uncertainty estimation.

Keypoints with uncertainty.

Since the keypoint 𝒖i\bm{u}_{i} is the expected value of the distribution of 2D coordinates with probability mass given by the values of pip_{i}, it is straightforward to estimate an uncertainty measure by the covariance of this distribution with the second moment about the mean

𝚺i=u,vpi(u,v)([uv]𝒖i)([uv]𝒖i).\bm{\Sigma}_{i}=\sum_{u,v}p_{i}(u,v)\left([u~{}v]^{\top}-{\bm{u}}_{i}\right)\left([u~{}v]^{\top}-{\bm{u}}_{i}\right)^{\top}. (1)

However, without any particular criteria for the covariance, there is nothing to enforce that the uncertainty actually captures the true error of the prediction. To this end, we propose to use a Gaussian maximum-likelihood estimator (MLE) loss to jointly optimize the keypoint coordinates as well as the covariance:

LMLE(i)\displaystyle L^{(i)}_{\mathrm{MLE}} =(𝒖i𝒖i)𝚺i1(𝒖i𝒖i)+log|𝚺i|,\displaystyle=\left(\bm{u}^{*}_{i}-{\bm{u}}_{i}\right)^{\top}\bm{\Sigma}^{-1}_{i}\left(\bm{u}^{*}_{i}-{\bm{u}}_{i}\right)+\log{|\bm{\Sigma}_{i}|}, (2)

where 𝒖i\bm{u}^{*}_{i} is the ground truth keypoint coordinate. From a high-level perspective, the first term enforces that the covariance bounds the true error of the prediction, while the second prevents it from becoming too large. This way, the network can predict its own uncertainty in the form of a Gaussian covariance matrix, which is trained to tightly bound the true error of the estimated keypoint.

While our network predicts a total of NN keypoints, only a subset of these, 𝒦(){1,2,,N}\mathcal{K}(\ell)\subset\{1,2,\ldots,N\}, are valid for a particular object \ell. Furthermore, considering a single image, only a subset of keypoints 𝒦()\mathcal{B}\subseteq\mathcal{K}(\ell) lie within the bounding box for object \ell (note that occluded keypoints are still predicted). However, during deployment, while 𝒦()\mathcal{K}(\ell) is known from the object class and keypoint labeling, it may be impossible to know which keypoints lie within the detected bounding box. For this reason, we add another head onto the network to predict a sigmoid vector 𝒎[0,1]N\bm{m}\in[0,1]^{N}, which is trained to estimate the ground-truth binary mask 𝒎{0,1}N\bm{m}^{*}\in\{0,1\}^{N}, where mi=1m^{*}_{i}=1 if ii\in\mathcal{B} and 0 otherwise (see Fig. 3 for the architecture). Thus, for a single object, in a single image, the full loss becomes

Ltot\displaystyle L_{\mathrm{tot}} =BCE(𝒎,𝒎)+1||iLMLE(i),\displaystyle=\mathrm{BCE}(\bm{m},\bm{m^{*}})+\frac{1}{|\mathcal{B}|}\sum_{i\in\mathcal{B}}L^{(i)}_{\mathrm{MLE}}, (3)

where BCE(.)\mathrm{BCE}(.) is the binary cross entropy loss function. For the rest of the paper, to simplify notation, we will denote k{1,2,,K}k\in\{1,2,\ldots,K\} as the indices for keypoints which pass the ground-truth mask 𝒎\bm{m}^{*} for training (i.e., the next section) or the estimated mask 𝒎\bm{m} (as well as the known 𝒦()\mathcal{K}(\ell)) for deployment in the SLAM system (Sec. 3.2).

Refer to caption
Figure 3: The overall architecture of our keypoint network. The network input is augmented to include additional NN channels for the prior keypoint inputs. When no prior is available, these channels are filled with zeros. The network outputs an NN-channel feature map corresponding to the raw logits, from where a spatial softmax head predicts keypoints 𝒖i\bm{u}_{i} and uncertainty 𝚺𝒊\bm{\Sigma_{i}}, while an average pool head predicts the keypoint mask 𝒎\bm{m}.

Keypoints for symmetric objects.

Since we want to efficiently track the keypoints over time during deployment, it is convenient to obtain keypoint predictions that have a symmetry hypothesis that is consistent with the 3D scene. Inspired by [19], we opt to include NN extra channels as input to the keypoint network which contain a prior detection of the object’s keypoints. As shown in Fig. 2, during deployment in the SLAM system, the prior keypoint detections come from projecting the 3D keypoints from the global object frame into the current image once the corresponding camera pose is found (i.e., the second pass). With this paradigm, there are two main issues to address: how to create training examples of the prior detections (since the SLAM system is not run during training), and how to detect the initial keypoints on symmetric objects when there is not yet an object pose estimate available. Here we describe the training scheme used to address these issues.

To create the training prior, we simulate a noisy prior detection that the SLAM system would create by projecting the 3D keypoints from the object frame into the image plane with a perturbed ground truth object pose δ𝐓𝐓OC\delta\mathbf{T}{}^{C}_{O}\mathbf{T}^{*} (see the supplementary Sec. A about the notation). To further ensure that the network can learn to follow the prior detections for the symmetry hypothesis, we utilize the set of symmetry transforms 𝒮={𝐓S1O,𝐓S2O,,𝐓SMO}\mathcal{S}=\{{}^{O}_{S_{1}}\mathbf{T},{}^{O}_{S_{2}}\mathbf{T},\ldots,{}^{O}_{S_{M}}\mathbf{T}\} that we expect to be available for each object (discretized for objects with continuous axes of symmetry). Each 𝐓SmO𝒮{}^{O}_{S_{m}}\mathbf{T}\in\mathcal{S}, when applied to the object CAD model, makes the rendering look (nearly) exactly the same, and in practice, these transforms can be manually chosen fairly easily. Thus, when constructing a training example with a prior detection, we pick a random symmetry transform and apply it to the ground-truth object pose before doing the projection.

In order for the network to learn to predict initial keypoints on symmetric objects (when no prior is available), we only provide this simulated prior randomly for roughly half of the examples. Without the prior detection, however, the network is left up to its own devices to reason about the absolute orientation for the object – which is theoretically impossible for symmetric objects without special care. As opposed to the mirroring technique and additional symmetry classifier proposed by [28], we instead teach the network to deal with this issue with a simple criteria of choosing the keypoints that correspond to the symmetrically-valid pose that is closest to a canonical view where the front of the object faces the camera, and the top of the object faces the top of the image. We refer the reader to the supplementary material (Sec. B) for more details on this procedure.

3.2 Object SLAM System

Our symmetry and uncertainty-aware object SLAM system is comprised of two modules: the front-end tracking, and the back-end global optimization. The front end is responsible for processing the incoming frames – running the keypoint network, estimating the current camera pose, and initializing new objects – while the back end is responsible for refining the camera and object poses for the whole scene. We refer the reader again to Fig. 2 for a visual representation of our system.

Front-end tracking.

The first step of our front end is to split the bounding boxes detected in the current image into two information streams – the first for asymmetric objects and first-time detections of symmetric ones, and the second for symmetric objects that already have 3D estimates. Again, we expect the symmetry information (i.e., symmetric or not) to be included with each object class. The first information stream sends the images, cropped at the bounding boxes, to the keypoint network without any prior to detect keypoints and uncertainty. These keypoints are then used to estimate the pose of each asymmetric object 𝐓pnpOC{}^{C}_{O}\mathbf{T}_{\mathrm{pnp}} in the current camera frame by using PnP with RANSAC. These PnP poses are then used to coarsely estimate the current camera pose and then initialize objects which do not yet have 3D estimates. See the supplementary material Sec. C for more details on how this is done as well as more detailed behavior of the front end.

With a rough estimate of the current camera, we move onto the second information stream of the front end. We use the coarse estimate of the camera pose to create the prior detections for the keypoints of symmetric objects by projecting the 3D keypoints for these objects into the current image, and constructing the prior keypoint heatmaps for network input. After running the keypoint network on these symmetric objects, we store the keypoint measurements from both information streams for later use in the global optimization.

Back-end global optimization.

The global optimization step runs periodically to refine the whole scene (object and camera poses) based on the measurements from each image. Rather than reduce the problem to a pose graph (i.e., using relative pose measurements from PnP), we keep the original noise model of using the keypoint detections as measurements, which allows us to weight each residual with the covariance prediction from the network. The global optimization problem is formulated by creating residuals that constrain the pose 𝐓GCj{}^{C_{j}}_{G}\mathbf{T} of image jj and the pose 𝐓OG{}^{G}_{O_{\ell}}\mathbf{T} of object \ell with the kkth keypoint

𝒓j,,k=𝒖j,,kΠj,(𝐓GCj𝐓OG𝒑¯kO),\displaystyle\bm{r}_{j,\ell,k}=\bm{u}_{j,\ell,k}-\Pi_{j,\ell}\left({}^{C_{j}}_{G}\mathbf{T}~{}{}^{G}_{O_{\ell}}\mathbf{T}~{}{}^{O_{\ell}}\bar{\bm{p}}_{k}\right), (4)

where Πj,\Pi_{j,\ell} is the perspective projection function for the bounding box of object \ell in image jj. Thus the full problem becomes to minimize the cost over the entire scene

C\displaystyle C =j,,ksj,,kρH(𝒓j,,k𝚺j,,k1𝒓j,,k)\displaystyle=\sum_{j,\ell,k}s_{j,\ell,k}~{}\rho_{H}\left(\bm{r}_{j,\ell,k}^{\top}~{}\bm{\Sigma}_{j,\ell,k}^{-1}~{}\bm{r}_{j,\ell,k}\right) (5)

where 𝚺j,,k\bm{\Sigma}_{j,\ell,k} is the 2×22\times 2 covariance matrix predicted by the network for the keypoint 𝒖j,,k\bm{u}_{j,\ell,k}, sj,,k{0,1}s_{j,\ell,k}\in\{0,1\} is a constant indicator that is 1 if the measurement was deemed an inlier before the optimization started and 0 otherwise, and ρH\rho_{H} is the Huber norm which reduces the effect of outliers during the optimization steps. Both ρH\rho_{H} and sj,,ks_{j,\ell,k} use the same outlier threshold τ\tau , which is derived from the 2-dimensional χ2\chi^{2} distribution, and is always set to the 95% confidence threshold τ=5.991\tau={5.991}. Thus we do not need to manually tune the outlier threshold as long as the covariance matrix 𝚺j,,k\bm{\Sigma}_{j,\ell,k} can properly capture the true error of keypoint 𝒖j,,k\bm{u}_{j,\ell,k}.

4 Experiments

Our experiments are conducted on two of the most challenging object pose estimation datasets: the YCB-Video dataset [35] and the T-LESS dataset [6]. Both datasets provide ground truth poses for symmetric and asymmetric objects in cluttered environments over multiple keyframe sequences. YCB-Video contains 21 household objects, including 4 objects with discrete symmetries and one object (the bowl) with a continuous axis of symmetry. The T-LESS dataset contains 30 industry-relevant objects with very little texture, and most are symmetric. Note that the symmetry information of each object is provided by [7].

4.1 Implementation Details

Choice of keypoints.

While our design is agnostic to the choice of keypoint, to reduce the number of channels that the network needs to predict, we created a set of rules to annotate keypoints manually in such a way that each keypoint can be applied to multiple object instances, and the same rules can be applied to both the YCB-Video and T-LESS dataset. We manually label the 3D CAD models for both datasets, and project the keypoints from 3D to 2D to create the ground-truth keypoints described in Sec. 3.1. We refer the reader to the supplementary material Sec. D for more details on how we annotated the keypoints.

Training procedure.

We implemented the keypoint network in PyTorch [24]. For all training, we used the Adam optimizer [11] with a learning rate of 10310^{-3}. For the YCB-Video dataset, we utilized real training data provided along with the official 80k synthetic images. Due to the high redundancy in the real training data, we used only every 5th image. We trained on this dataset for 60 epochs using a batch size of 24 with randomized backgrounds for the synthetic dataset as well as randomized bounding boxes, color, and image warping. For the T-LESS dataset, there are only real training images of single objects on a dark background , so for the synthetic data we opted to use the physics-based pbr rendered data provided by [8]. For both the real and pbr splits we augment the examples with randomized backgrounds, bounding boxes, color, and warping, as well as randomly pasted objects for the real data only – since it only contains images of isolated objects. We trained the TLESS model for 89 epochs with a batch size of 8, which was smaller than that for YCB-Video due to the higher image resolution of the pbr data.

SLAM system.

Our SLAM system is implemented in Python. The GPU is only used for network inference while all other operations are performed on the CPU. All optimizations are implemented using Python wrappers for the g2o library [13]111https://github.com/uoip/g2opy, besides PnP, which is done using the Lambda Twist solver [27] with RANSAC222https://github.com/midjji/pnp. Our front-end tracking works on every incoming frame, while the back-end runs every 10th frame. Note that the testing sequences for both datasets are already provided as keyframes, so no keyframing procedure is needed. While for actual deployment it is ideal to run the back-end graph optimization on a separate worker thread, this would make reproducing the exact results impossible due to randomness in the operating system’s allocation of resources between the two threads. In order to make the results reproducible, we simply execute both the front-end and back-end on the main thread for evaluation. Our front-end tracking can typically run at 11Hz on our desktop with a GTX 1080Ti graphics card, and the back-end can run at an average speed of 2Hz.

4.2 YCB-Video Dataset

Table 1: Results on the YCB-Video dataset. Data means what synthetic data was used in addition to the real data, and U.M. (unified model) is checked if only one model was trained for all objects instead of one model trained for each object separately. Bold is best, underlined is second best.
Method Data U.M. ADD-S ADD(-S)
PoseCNN [35] syn 75.3 61.3
DeepIM [15] syn 88.1 81.9
PoseRBPF [4] syn 76.3 64.4
MHPE [5] syn 82.9 69.7
CosyPose [14] pbr 89.8 84.5
GDR-Net [33] pbr 89.1 80.2
GDR-Net [33] pbr 91.6 84.4
Ours syn 90.3 84.7
no prior det syn 88.7 83.3
manual cov syn 59.1 46.1
no MLE loss syn 47.0 35.2
single view syn 65.7 56.9
Refer to caption
Refer to caption
Refer to caption
Figure 4: Qualitative results on YCB-Video. From left to right columns we show the detected object boxes with prior input to the keypoint network, the predicted keypoints with uncertainty ellipses, and the 3D model projection on the image based on predicted 6Dof object poses and camera pose. Top: the uncertainty ellipses tend to be smaller for visible keypoints on textured surfaces or corner points, while appearing larger for occluded keypoints and keypoints on smooth surfaces (like the clamp). Center: our system is able to consistently track the keypoints throughout the scene despite the presence of symmetric objects. Bottom: the network trained with a fixed-variance loss predicts uncertainty ellipses that are visibly too small – leading to unreliable outlier rejection and object poses.

For the YCB-Video dataset, we compare to the single-view methods [35, 15, 14, 33] and SLAM methods [4, 5]. Note that we do not include the multi-view results of CosyPose [14] since it is an offline SfM method that is not comparable to real-time SLAM methods. Following [35, 15, 14, 33, 4, 5], we report the area under curve (AUC) of the ADD-S and ADD(-S) by varying the accuracy threshold from 0 to 10cm, which is calculated for each object separately and then averaged. To fairly compare the methods, we used the same bounding boxes as PoseCNN. In practice, the bounding boxes can come from any real-time bounding box detector. The benchmark results as well as several ablation studies are reported in Table 1 with our method labeled as “Ours”. Methods in Table 1 are marked as using standard synthetic data (syn) with randomly-placed objects or physics-based (pbr) training data in addition to the real data. Note that while the pbr data is generally considered superior to the randomly-placed objects [8], it is not a part of the official YCB-Video dataset training splits. Regardless, our method beats all of the state-of-the-art single view and SLAM methods in terms of the AUC of ADD(-S) metric – even those utilizing the pbr data while only utilizing one network for all objects. The AUC of ADD(-S) is the most important metric here, since it takes into account the actual object symmetries rather than just shape matching like the ADD-S does. This shows that our system can provide highly accurate globally-consistent poses for symmetric objects, while still maintaining high accuracy on the texture-asymmetric objects. Qualitative results can be seen in Fig. 4. More detailed results of each object category can be found in the supplementary material Sec. E.

Effect of prior detection.

The first ablation study is to run our same system without the prior detection. The results drop slightly, but this is expected on this dataset where only 5 out of 21 objects are considered symmetric, and only the bowl displays a continuous rotational symmetry. In the next section, we will see that the prior detection actually makes a much bigger difference on the T-LESS dataset, where most of the objects are symmetric, and the camera rotates many times completely around the scene – whereas the camera motion in YCB-Video is much simpler.

Manual covariance weight.

For the next ablation in Table 1, “manual cov”, we manually tune a weight to replace the covariance in the SLAM system’s residuals and outlier rejection mechanism. Here, we found that the weight corresponding to 2×2\times the average predicted standard deviation of the network (which was about 2.5 pixels) achieved the best scores. As observed, the results dropped significantly compared to using a network predicted covariance.

Effect of MLE loss.

For the ablation labeled “no MLE loss”, we trained a network with the same procedure, but replaced the MLE loss with a fixed-variance loss with variance regulation similar to that used by the popular human pose estimation [21]. As observed, when placed in the SLAM system, the results are significantly lower than that with our network trained with the MLE loss. The qualitative results of this experiment are also in Fig. 4.

Beyond the accuracy of the SLAM system with this change, we have also tested the accuracy of the predicted covariance itself. To do so, we ran both of the networks (with and without MLE loss) on a separate set of simulated YCB-Video objects (the pbr data which was not used in training), which has perfect ground truth for the keypoints. Here, we ran the networks with the ground truth bounding boxes and no prior detection. To evaluate the accuracy of the predicted covariance, we plotted the keypoint error against the predicted standard deviation of the network. Ideally, the error will always lie above the cone er<3σe_{r}<3\sigma if ere_{r} is the scalar xx or yy component of the error residual of the keypoint prediction. The results of this experiment can be viewed in Fig. 5. As observed, the network trained with the MLE loss has much more of the errors within the 3σ\sigma cone. In fact, 91.0% of the data points on the left in Fig. 5 pass a 99% confidence χ2\chi^{2} test while only 7.1% pass from the points on the right. This shows that the predicted uncertainty describes the actual error distribution well (besides some expected outliers due to heavy occlusion and symmetry), and including the MLE loss is crucial to achieve this.

Refer to caption
Refer to caption
Figure 5: The plot of error of the predicted keypoints against the standard deviation predicted by the network over a separate set of rendered YCB-Video objects. The 3σ\sigma bounds are shown as the cone drawn as the red dotted lines. Left: The result of our network trained with the MLE loss. Right: The result of the same network trained with a typical fixed variance loss instead, which have far fewer points within the 3σ\sigma cone.

Comparing to single view.

For the final ablation in Table 1, we ran just our single view network and compared the accuracy. Specifically, for each view we just ran PnP and refined it using the same procedure as Eq. 5, but with only one fixed camera pose per optimization. Clearly the full SLAM system is more accurate. It is interesting to note that the results for single view are actually more accurate than the SLAM results using the manual covariance or the fixed-variance network. This is most likely due to the fact that incorrect covariance in our SLAM system can cause the outlier rejection mechanism to be unreliable, and outliers can then pull the object pose in an incorrect direction and hurt the accuracy for all views despite the fact that most of the keypoints are correct.

Accuracy of camera poses

The effect of initializing the camera poses with the poses provided by the dataset was minor in this experiment. Using the given camera poses the system achieved a 90.5 AUC of ADD-S score, while the system with the estimated camera poses scored the 90.3 shown in Table 1. This shows that the estimated camera poses are very accurate on this dataset.

4.3 T-LESS Dataset

Table 2: Benchmark Results on the T-LESS dataset.
Method Data U.M. evsd<0.3e_{\mathrm{vsd}}<0.3
Implicit [31] syn 26.8
Pix2Pose [22] syn 29.5
PoseRBPF [4] syn 41.7
CosyPose [14] pbr 63.8
Ours pbr 63.7
real only N/A 45.9
no prior det pbr 16.2
manual cov pbr 13.8
single view pbr 48.1
Refer to caption
Refer to caption
Refer to caption
Figure 6: Qualitative results on T-LESS. Top: Under misalignment between the prior detection and objects (left column), the network still predicts keypoints accurately (center column) which just uses the prior as a general guide for the symmetry. Center: the system displays robustness to missing and bad bounding boxes here. Bottom: the same system, but without the prior detection, fails to track keypoints corresponding to the same 3D locations currently locating at the back side of the symmetric objects, hence causes the estimated object poses to fly away. Note that the predicted covariance was used in all of these images, but left out of the visualization for clarity. Best viewed in color.

For the T-LESS dataset, we compare to two single-view baselines [31, 22] as well as, again, PoseRBPF [4] and CosyPose [14]. To fairly compare to the other methods, we use the same RetinaNet bounding boxes as [22], taking the top scoring bounding box for each object. We use the standard visual surface discrepancy (vsd) recall metric, evsd<0.3e_{\mathrm{vsd}}<0.3 [7], that the other methods reported. Since the T-LESS dataset has multiple scenes that have only symmetric objects, and our system requires asymmetric objects to estimate a camera pose, we initialize our camera poses with the poses provided by the dataset. While this is a potential drawback to our system, typical deployment scenarios will contain symmetric objects or allow for retrieving external odometry from another source, such as an additional IMU sensor or traditional feature-based SLAM.

The benchmark results and ablation studies are reported in Table 2, where our system is shown to achieve a 63.7 recall score – second best to the 63.8 of CosyPose. However, it is interesting to note that CosyPose is an iterative refinement method that utilizes initial object poses rendered at 1m from the camera, which is close to the distance of all the objects, while our method makes no such assumption. Qualitative results can be also seen in Fig. 6.

Effect of training data.

To test the sensitivity to the training data, we train it on only the small real training split, which contains 1,231 images of each object on a dark background. From Table 2 we observe that, even with this small amount of data, we still beat all of the state-of-the-art methods besides CosyPose – all of which used large amounts of synthetic data on top of the real data. This shows the ability of our method to work with a limited amount of data which does not even cover all orientations of the objects.

Effect of prior detection.

On the T-LESS dataset, where most of the objects are symmetric in some way, the 63.7 recall score drops to 16.2 in Table 2 when the prior detection is removed. This shows that the prior detection is crucial for tackling these challenging T-LESS objects when the camera is orbiting around their axes of symmetry multiple times. Without the prior detection, the SLAM system’s outlier rejection simply rejects most of the keypoint measurements on the symmetric objects, as they do not correspond to the same 3D location. Fig. 6 also includes some qualitative results of this experiment.

Manual covariance weight.

Here again we set the covariance in the SLAM system’s residuals to a manually-tuned weight. The result in this case drop to a 13.8 recall, which further substantiates the usefulness of our covariance estimate in the SLAM system. Furthermore, we found that the optimal weight for this dataset was much larger than that for YCB-Video, which is not surprising, but shows that removing the need to manually tune weights by using the predicted covariance is a useful property of our system.

Comparing to single view.

In this case, the single view result in Table 2 outperformed that from the SLAM system when it either used a manual covariance weight or no prior detections. Since the single-view results use no prior detection, this shows that the keypoints considered independently for each view are reasonable, while the prior detection is crucial for tracking them across time.

5 Conclusions and Future Work

In this work, we have designed a keypoint-based object-level SLAM system that provides globally consistent 6DoF pose estimates for objects with or without symmetry. Our method can track semantic keypoints on symmetric objects consistently with the aid of the proposed prior detection, and the uncertainty that our network predicts has been shown to capture the true error of the predicted keypoints as well as greatly improve the object pose accuracy. In the future, we would like to adapt our system to larger environments and generalize to class-level keypoint prediction with unseen instances.

Acknowledgement.

We would like to thank the reviewers for their constructive feedback. This work was partially supported by the University of Delaware College of Engineering, the NSF (IIS-1924897), the ARL (W911NF-19-2-0226, W911NF-20-2-0098), Bosch Research North America, and the Technical University of Munich.

References

  • [1] Michael Bloesch, Jan Czarnowski, Ronald Clark, Stefan Leutenegger, and Andrew J. Davison. Codeslam – learning a compact, optimisable representation for dense visual slam. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [2] Alvaro Collet, Manuel Martinez, and Siddhartha Srinivasa. The moped framework: Object recognition and pose estimation for manipulation. The International Journal of Robotics Research, 30:1284–1306, 09 2011.
  • [3] Alvaro Collet and Siddhartha S. Srinivasa. Efficient multi-view object recognition and full pose estimation. In IEEE International Conference on Robotics and Automation, ICRA 2010, Anchorage, Alaska, USA, 3-7 May 2010, 2010.
  • [4] Xinke Deng, Arsalan Mousavian, Yu Xiang, Fei Xia, Timothy Bretl, and Dieter Fox. Poserbpf: A rao-blackwellized particle filter for 6d object pose tracking. In Robotics: Science and Systems (RSS), 2019.
  • [5] Jiahui Fu, Qiangqiang Huang, Kevin Doherty, Yue Wang, and John J. Leonard. A multi-hypothesis approach to pose ambiguity in object-based slam. In International Conference on Intelligent Robots and Systems (IROS), 2021.
  • [6] Tomáš Hodaň, Pavel Haluza, Štěpán Obdržálek, Jiří Matas, Manolis Lourakis, and Xenophon Zabulis. T-LESS: An RGB-D dataset for 6D pose estimation of texture-less objects. IEEE Winter Conference on Applications of Computer Vision (WACV), 2017.
  • [7] Tomáš Hodaň, Frank Michel, Eric Brachmann, Wadim Kehl, Anders Glent Buch, Dirk Kraft, Bertram Drost, Joel Vidal, Stephan Ihrke, Xenophon Zabulis, Caner Sahin, Fabian Manhardt, Federico Tombari, Tae-Kyun Kim, Jiří Matas, and Carsten Rother. BOP: Benchmark for 6D object pose estimation. European Conference on Computer Vision (ECCV), 2018.
  • [8] Tomáš Hodaň, Martin Sundermeyer, Bertram Drost, Yann Labbé, Eric Brachmann, Frank Michel, Carsten Rother, and Jiří Matas. BOP challenge 2020 on 6D object localization. European Conference on Computer Vision Workshops (ECCVW), 2020.
  • [9] Tong Ke, Tien Do, Khiem Vuong, Kourosh Sartipi, and Stergios I. Roumeliotis. Deep multi-view depth estimation with predicted uncertainty, 2021.
  • [10] Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding, 2016.
  • [11] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR, 2015.
  • [12] Maria Klodt and Andrea Vedaldi. Supervising the new with the old: learning sfm from sfm. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  • [13] Rainer Kuemmerle, Giorgio Grisetti, Hauke Strasdat, Kurt Konolige, and Wolfram Burgard. g2o: A general framework for graph optimization. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 3607–3613, Shanghai, China, May 2011.
  • [14] Yann Labbé, Justin Carpentier, Mathieu Aubry, and Josef Sivic. Cosypose: Consistent multi-view multi-object 6d pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
  • [15] Yi Li, Gu Wang, Xiangyang Ji, Yu Xiang, and Dieter Fox. Deepim: Deep iterative matching for 6d pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • [16] Wenxin Liu, David Caruso, Eddy Ilg, Jing Dong, Anastasios I. Mourikis, Kostas Daniilidis, Vijay Kumar, and Jakob Engel. Tlio: Tight learned inertial odometry. IEEE Robotics and Automation Letters, pages 5653–5660, Oct 2020.
  • [17] David G. Lowe. Object recognition from local scale-invariant features. In Proceedings of the International Conference on Computer Vision, Kerkyra, Corfu, Greece, September 20-25, 1999, 1999.
  • [18] Hidenobu Matsuki, Raluca Scona, Jan Czarnowski, and Andrew J. Davison. Codemapping: Real-time dense mapping for sparse slam using compact scene representations. In IEEE Robotics and Automation Letters (RA-L), 2021.
  • [19] Oliver Moolan-Feroze, Konstantinos Karachalios, Dimitrios N. Nikolaidis, and Andrew Calway. Improving drone localisation around wind turbines using monocular model-based tracking, 2019.
  • [20] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Proceedings of the European Conference on Computer Vision (ECCV), pages 483–499, 2016.
  • [21] Aiden Nibali, Zhen He, Stuart Morgan, and Luke Prendergast. Numerical coordinate regression with convolutional neural networks, 2018.
  • [22] Kiru Park, Timothy Patten, and Markus Vincze. Pix2pose: Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation. In The IEEE International Conference on Computer Vision (ICCV), Oct 2019.
  • [23] Parv Parkhiya, Rishabh Khawad, J. Krishna Murthy, Brojeshwar Bhowmick, and K. Madhava Krishna. Constructing category-specific models for monocular object-slam. In International Conference on Robotics and Automation (ICRA), 2018.
  • [24] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  • [25] Georgios Pavlakos, Xiaowei Zhou, Aaron Chan, Konstantinos G Derpanis, and Kostas Daniilidis. 6-dof object pose from semantic keypoints. In International Conference on Robotics and Automation (ICRA), 2017.
  • [26] Sida Peng, Yuan Liu, Qixing Huang, Xiaowei Zhou, and Hujun Bao. Pvnet: Pixel-wise voting network for 6dof pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [27] Mikael Persson and Klas Nordberg. Lambda twist: An accurate fast robust perspective three point (p3p) solver. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  • [28] Mahdi Rad and Vincent Lepetit. Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. 2017 IEEE International Conference on Computer Vision (ICCV), pages 3848–3856, 2017.
  • [29] Renato F. Salas-Moreno, Richard A. Newcombe, Hauke Strasdat, Paul H.J. Kelly, and Andrew J. Davison. Slam++: Simultaneous localisation and mapping at the level of objects. In 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1352 – 1359, 2013.
  • [30] Mo Shan, Qiaojun Feng, and Nikolay A. Atanasov. Orcvio: Object residual constrained visual-inertial odometry. 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5104 – 5111, 2020.
  • [31] Martin Sundermeyer, Zoltan-Csaba Marton, Maximilian Durner, Manuel Brucker, and Rudolph Triebel. Implicit 3d orientation learning for 6d object detection from rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  • [32] Chen Wang, Danfei Xu, Yuke Zhu, Roberto Martin Martin, Cewu Lu, Li Fei-Fei, and Silvio Savarese. Densefusion: 6d object pose estimation by iterative dense fusion. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, 2019.
  • [33] Gu Wang, Fabian Manhardt, Federico Tombari, and Xiangyang Ji. GDR-Net: Geometry-guided direct regression network for monocular 6d object pose estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16611–16621, June 2021.
  • [34] Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE Winter Conference on Applications of Computer Vision, pages 75 – 82, 2014.
  • [35] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes. In Robotics: Science and Systems (RSS), 2018.
  • [36] Nan Yang, Lukas von Stumberg, Rui Wang, and Daniel Cremers. D3vo: Deep depth, deep pose and deep uncertainty for monocular visual odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [37] Sergey Zakharov, Ivan Shugurov, and Slobodan Ilic. Dpod: 6d pose object detector and refiner. In International Conference on Computer Vision (ICCV), October 2019.
  • [38] Xingxing Zuo, Nathaniel Merrill, Wei Li, Yong Liu, Marc Pollefeys, and Guoquan Huang. Codevio: Visual-inertial odometry with learned optimizable dense depth. In Proc. of the IEEE International Conference on Robotics and Automation, Xi’an, China, 2021.

Supplementary Material

Appendix A Rigid Body Transform Notation

Throughout the paper, we have regularly included rigid body transforms in many equations. Here, we briefly explain the notation. A 6DoF rigid body transform 𝐓ABSE(3){}_{A}^{B}\mathbf{T}\in SE(3) will transform a point defined in the reference frame {A}\{A\} into the reference frame {B}\{B\}. We write this in two possible ways. For the first, and most common, we separate 𝐓AB{}_{A}^{B}\mathbf{T} into its rotational and positional components, 𝐑ABSO(3){}_{A}^{B}\mathbf{R}\in SO(3) and 𝒑AB3{}^{B}\bm{p}_{A}\in\mathbb{R}^{3} respectively. In this form we write 𝒑kB=𝐑AB𝒑kA+𝒑AB{}^{B}\bm{p}_{k}={}_{A}^{B}\mathbf{R}~{}{}^{A}\bm{p}_{k}+{}^{B}\bm{p}_{A} to transform the 3D point 𝒑kA{}^{A}\bm{p}_{k} from the {A}\{A\} frame into the {B}\{B\} frame.

In the other form, which shows up in Eq. 4, we leave the transform in its full 4×44\times 4 SE(3)SE(3) form, and use the homogeneous form of translation vectors 𝒑¯kA=[𝒑kA1]{}^{A}\bar{\bm{p}}_{k}=[{}^{A}\bm{p}_{k}^{\top}~{}1]^{\top}. In this way, we write 𝒑¯kB=𝐓AB𝒑¯kA{}^{B}\bar{\bm{p}}_{k}={}_{A}^{B}\mathbf{T}~{}{}^{A}\bar{\bm{p}}_{k}. This form specifically allows us to chain together multiple transformations with simplified notation, for example: 𝒑¯kB=𝐓A2B𝐓A1A2𝐓AA1𝒑¯kA{}^{B}\bar{\bm{p}}_{k}={}_{A_{2}}^{B}\mathbf{T}~{}{}_{A_{1}}^{A_{2}}\mathbf{T}~{}{}_{A}^{A_{1}}\mathbf{T}~{}{}^{A}\bar{\bm{p}}_{k}.

Appendix B Choice of Symmetry Without Prior

As mentioned in Section 3.1, as opposed to the mirroring technique and additional symmetry classifier proposed by [28], we need to teach the network to predict the initial keypoints of symmetric objects, before the prior is available. We opt to utilize the set of symmetry transforms to solve this issue in a more concise manner with a simple intuition: when the prior detection is not available for a symmetric object, we can simply instruct the network to choose the orientation which brings the object pose closest to a canonical pose where the front of the object faces the camera, and the top of the object faces the top of the image. This intuition is learned by the network during training by choosing the symmetry for keypoint labels that brings the 3D keypoints closest (in orientation) to those transformed into the canonical view {Oc}\{O_{c}\} in the camera frame:

𝐓SO=argmin𝐓SmO𝒮1Kk=1K𝒑~kC𝒑~kcC2\displaystyle{}^{O}_{S}\mathbf{T}=\underset{{}^{O}_{S_{m}}\mathbf{T}\in\mathcal{S}}{\mathrm{argmin}}\frac{1}{K}\sum_{k=1}^{K}\left\lVert{}^{C}\tilde{\bm{p}}_{k}-{}^{C}\tilde{\bm{p}}^{c}_{k}\right\rVert_{2} (6)
𝒑kC=𝐑OC(𝐑SmO𝒑kO+𝒑SmO)𝒑kcC=𝐑OcC𝒑kO\displaystyle{}^{C}{\bm{p}}_{k}={}^{C}_{O}\mathbf{R}\left({}^{O}_{S_{m}}\mathbf{R}{}^{O}\bm{p}_{k}+{}^{O}\bm{p}_{S_{m}}\right)~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}{}^{C}{\bm{p}}^{c}_{k}={}^{C}_{O_{c}}\mathbf{R}{}^{O}\bm{p}_{k}

where 𝒑~k=𝒑k1Kk=1K𝒑k\tilde{\bm{p}}_{k}=\bm{p}_{k}-\frac{1}{K}\sum_{k=1}^{K}\bm{p}_{k} denotes the kkth point of a mean-subtracted point cloud. We provide some visual examples of the effect of Eq. 6, which can be seen in Fig. 7. Remember that Eq. 6 is used to pick the symmetry transform to apply to the ground truth keypoints during training when the simulated prior detection is not given to the network – otherwise a random symmetry transform is applied to the prior and ground truth keypoints together so that the network can learn to follow the prior for the symmetry. The main effect of Eq. 6 is that it will choose the symmetry to apply to the ground truth keypoints that best matches the canonical view in terms of orientation, which essentially tells the network to always pick the symmetry that brings the front of the object closest to the camera and the top of the object closest to the negative yy-axis of the camera frame (i.e., the top of the images) if no prior is given.

Of course there is still the issue of detecting keypoints near the inflection point of a symmetry [28]. While we could utilize the mirroring technique of [28] to avoid this issue, we only need to detect keypoints once without the prior detection in practice (i.e., the first detection), and the mirroring technique of [28] requires an additional classifier during test time – which complicates the pipeline and adds additional computation. If the object is at an inflection point for the symmetry, and it is difficult to decide which symmetry to use, in our full SLAM system we can typically just reject bad measurements until the camera moves to a better viewpoint on the object in order for the network to more confidently choose the initial symmetry based on its training with Eq. 6.

Refer to caption
Refer to caption
Figure 7: Examples of how we pick the symmetry to use for the keypoints during training when a prior detection is not given to the network. Top: the two possible symmetries for the clamp are shown on the left, and the keypoints in the canonical view are shown on the right. The first symmetry is chosen since the points are closer to the points in the canonical view. Bottom: the bowl has a continuous axis of symmetry about its vertical axis which are discretized into 64 symmetry transforms. For brevity we only show two – a random symmetry that is not chosen for the label (left) and one that is chosen (center) since it matches the canonical view (right) the best in terms of orientation. Best viewed in color.
Refer to caption
Figure 8: Our keypoint labels for the YCB-Video dataset. We labeled identifiable features based on the shape class of the objects (box-like, cylinder-like, and hand tool) which are common within different instances of the same shape class (such as box corners, cylinder top/bottom center, etc), and then instance-specific keypoints of other identifiable features such as brand names, bar codes, etc.
Refer to caption
Figure 9: Our keypoint labels for the T-LESS dataset. Here, only shape class-specific keypoints were used due to the lack of texture on each object.

Appendix C Front-End Tracking Details

Besides the first image, whose camera frame becomes the global reference frame {G}\{G\}, we need to estimate the camera pose 𝐓GC{}^{C}_{G}\mathbf{T} with the set of object PnP poses and the current estimates of the objects in the global frame. For each asymmetric object that is both detected in the current frame with a successful PnP pose 𝐓pnpOC{}^{C}_{O}\mathbf{T}_{\mathrm{pnp}} and has an estimated global pose 𝐓OG{}^{G}_{O}\mathbf{T}, we can create a hypothesis about the current camera’s pose as 𝐓hypGC=𝐓pnpOC𝐓1OG{}^{C}_{G}\mathbf{T}_{\mathrm{hyp}}={}^{C}_{O}\mathbf{T}_{pnp}~{}{}^{G}_{O}\mathbf{T}^{-1} and then project the 3D keypoints from all objects that have both a global 3D estimate and detection in the current image into the current image plane with this camera pose, and count inliers with a χ2\chi^{2} test using the detected keypoints and uncertainty. We take the camera pose hypothesis with the most inliers as the final 𝐓GC{}^{C}_{G}\mathbf{T}, and reject any hypothesis that has too few. After this, any objects that have valid PnP poses but are not yet initialized in the scene are given an initial pose 𝐓OG=𝐓1GC𝐓pnpOC{}^{G}_{O}\mathbf{T}={}^{C}_{G}\mathbf{T}^{-1}{}^{C}_{O}\mathbf{T}_{pnp}.

Since each object is initialized with a PnP pose, it is possible that the initialization can be very poor from a PnP failure, and, if the pose is bad enough (e.g., off by a large orientation error), optimization can not fix it due to only reaching local minima. To address this issue, we check if the PnP pose from the current image yields more inliers over the last few views than the current estimated pose, and, if this is true, we re-initialize the object with the new pose. After this, we perform a quick local refinement of the camera pose by fixing the object poses and optimizing just the current camera to better register it into the scene.

Appendix D Keypoint Labeling

Choice of keypoints.

The choice of keypoints for the network to learn is important, but there is no general consensus about which choice is best. Some have proposed to detect the corners of the 3D bounding boxes [28], while others chose keypoints that lie on the object [25, 26] – which seems to be the more accurate approach [26]. Inspired by [34], we try to pick keypoints that carry some semantic meaning. Our keypoint labels on the YCB-Video dataset can be seen in Fig. 8, and Fig. 9 for the T-LESS dataset. Specifically, we split the objects into three categories based on the overall shape – box-like, cylinder-like, and hand tool – and choose a unified set of keypoints for each of these shape classes based on the most identifiable features. We found that picking a set of keypoints for each one of these classes can accurately describe the shape of the objects for the YCB-Video and T-LESS datasets, and the keypoint network had a relatively easy time learning the keypoints despite the fact that the shapes of some objects are not exactly rectangular, cylindrical, etc. In order to increase the number of keypoints and their potential usefulness in a downstream application, we also add some instance-specific keypoints, such as brand names, bar codes, and hand grips, which only show up in the YCB-Video dataset. Such keypoints can still be shared among multiple instances of objects in the YCB-Video dataset, but sometimes occur between shape classes (e.g., bar codes show up on the box-like cracker box and also the cylindrical soup can).

Labeling tool.

To label the keypoints, we create a simple labeling program which allows the user to pick the same keypoint (say keypoint kk) multiple times on the CAD model, and takes the average 3D location in the CAD model frame as the final 3D keypoint location 𝒑kO{}^{O}\bm{p}_{k}. The tool also allows the user to pick the canonical view {Oc}\{O_{c}\} used in Eq. 6 by simply rotating the object into the correct view. This is especially important in the YCB-Video dataset, where the object models are not already rotated into a canonical view as they are for T-LESS. The labeling program will be included along with our keypoint labels in the software release, which will be made available upon publication of this work. Detailed instructions for how to reproduce our keypoint labels will also be included in this release (i.e., the rules we used to determine where each keypoint goes), which can also be used to label keypoints on other datasets with objects similar to YCB-Video and T-LESS. We found that, after the user is acquainted with the labeling program, it only takes a few minutes per object to label the keypoints. In the future, we would like to reduce the labeling task for the shape class-specific keypoints, since there should be a simple set of heuristics to automatically label these when given the CAD model in a canonical view.

Appendix E Extended Results

Table 3: Detailed results on the YCB-Video dataset. Bold blue objects are symmetric.
PoseCNN [35] DeepIM [15] PoseRBPF [4] MHPE [5] Ours
Objects ADD ADD-S ADD ADD-S ADD ADD-S ADD ADD-S ADD ADD-S
002_master_chef_can 50.9 84.0 71.2 93.1 63.3 87.5 67.9 93.8 75.0 87.8
003_cracker_box 51.7 76.9 83.6 91.0 77.8 87.6 67.8 82.9 84.0 90.6
004_sugar_box 68.6 84.3 94.1 96.2 79.6 89.4 83.1 91.3 86.4 91.5
005_tomato_soup_can 66.0 80.9 86.1 92.4 73.0 83.6 79.5 92.2 85.3 93.5
006_mustard_bottle 79.9 90.2 91.5 95.1 84.7 92.0 81.6 90.8 94.2 96.2
007_tuna_fish_can 70.4 87.9 87.7 96.1 64.2 82.7 78.0 92.5 84.3 92.7
008_pudding_box 62.9 79.0 82.7 90.7 64.5 77.2 45.4 71.5 84.1 92.4
009_gelatin_box 75.2 87.1 91.9 94.3 83.0 90.8 76.1 87.8 94.0 95.9
010_potted_meat_can 59.6 78.5 76.2 86.4 51.8 66.9 69.1 85.5 83.7 91.7
011_banana 72.3 85.9 81.2 91.3 18.4 66.9 87.7 93.7 87.3 94.3
019_pitcher_base 52.5 76.8 90.1 94.6 63.7 82.1 76.8 88.8 89.4 93.9
021_bleach_cleanser 50.5 71.9 81.2 90.3 60.5 74.2 47.7 70.3 61.7 70.5
024_bowl 6.5 69.7 8.6 81.4 28.4 85.6 40.2 80.1 32.8 76.9
025_mug 57.7 78.0 81.4 91.3 77.9 89.0 40.6 72.8 84.8 92.6
035_power_drill 55.1 72.8 85.5 92.3 71.8 84.3 39.5 71.2 85.5 92.2
036_wood_block 31.8 65.8 60.0 81.9 2.3 31.4 64.6 85.5 0.0 86.3
037_scissors 35.8 56.2 60.9 75.4 38.7 59.1 64.5 88.9 79.2 91.2
040_large_marker 58.0 71.4 75.6 86.2 67.1 76.4 81.1 90.6 84.9 94.7
051_large_clamp 25.0 49.9 48.4 74.3 38.3 59.3 49.2 70.7 47.2 83.0
052_extra_large_clamp 15.8 47.0 31.0 73.3 32.3 44.3 8.6 47.4 86.3 94.1
061_foam_brick 40.4 87.8 35.9 81.9 84.1 92.6 75.1 92.6 87.4 93.8
Mean 51.7 75.3 71.7 88.1 58.4 76.3 63.1 82.9 76.1 90.3
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 10: Supplementary qualitative results for the YCB-Video (left) and T-LESS (right) datasets. The top three rows show some successful pose estimates from our system while the bottom row shows a failure case. The failure in both cases is from initializing objects upside down. The bowl on the bottom left and the orange object on the bottom right is upside down while the are upside down.

YCB-Video per-object results.

As mentioned in Sec. 4.2, we provide more detailed results for each object on the YCB-Video dataset. The results are presented in Table 3. Here our method displays superior AUC of ADD and ADD-S for the majority of the objects. For the five symmetric objects, which are highlighted in bold blue in Table 3, our method has the best AUC of ADD-S for four of them – which shows our ability to handle these symmetric objects effectively. Note that the ADD metric is not very important for symmetric, since it checks for the match to the actual ground truth pose – which is arbitrary due to the symmetry – while the ADD-S simply checks if the shape of the object matches well between the ground truth and estimated poses [35]. This is clear especially for the case of the wood block, where our method actually scores a 0.0 AUC of ADD, while beating all other methods in the AUC of ADD-S metric. This is because our estimated pose for this object correctly aligned the CAD model to the scene to match the shape, but with a symmetry transform that yielded a completely different orientation from the ground truth.

Qualitative results.

More qualitative results are shown in Fig. 10. Here we show three success cases and one failure case for both the YCB-Video and T-LESS datasets. Our system is able to estimate correct poses for a wide variety of difficult objects even in the presence of occlusion and bad or missing detections. A common failure case that we saw is the system initializing objects (especially symmetric ones) upside down. While we showed the only such case we found in the YCB-Video dataset, this is especially common in the T-LESS dataset where it is harder to distinguish the top from the bottom for many objects. Reliably solving such edge cases is an interesting question to answer in future research.