This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

PR-GCN: A Deep Graph Convolutional Network with Point Refinement
for 6D Pose Estimation

Guangyuan Zhou1, Huiqun Wang1,2, Jiaxin Chen2 and Di Huang1,2
1State Key Laboratory of Software Development Environment, Beihang University, Beijing, China
2School of Computer Science and Engineering, Beihang University, Beijing, China
{zhouguangyuan, hqwangscse, jiaxinchen, dhuang}@buaa.edu.cn
indicates the corresponding author.
Abstract

RGB-D based 6D pose estimation has recently achieved remarkable progress, but still suffers from two major limitations: (1) ineffective representation of depth data and (2) insufficient integration of different modalities. This paper proposes a novel deep learning approach, namely Graph Convolutional Network with Point Refinement (PR-GCN), to simultaneously address the issues above in a unified way. It first introduces the Point Refinement Network (PRN) to polish 3D point clouds, recovering missing parts with noise removed. Subsequently, the Multi-Modal Fusion Graph Convolutional Network (MMF-GCN) is presented to strengthen RGB-D combination, which captures geometry-aware inter-modality correlation through local information propagation in the graph convolutional network. Extensive experiments are conducted on three widely used benchmarks, and state-of-the-art performance is reached. Besides, it is also shown that the proposed PRN and MMF-GCN modules are well generalized to other frameworks.

1 Introduction

6D pose estimation aims to predict the orientation and location of an object in the 3D space from a canonical frame. It has received extensive attention in computer vision, since it is one of the fundamental steps for a wide range of applications, such as robotics grasping [6, 35, 47] and augmented reality [22, 23]. Traditional methods [10, 11] attempt to accomplish this task based on RGB images only. They adopt handcraft features (e.g. SIFT [21] and SURF [1]) to establish correspondence between input and canonical images. Inspired by the great success in detection/recognition, deep neural networks are recently explored to address this issue, including the single-stage regression methods [15] and key-point based methods [13, 32, 36, 35, 26, 25, 20]. Despite the remarkable promotion in accuracy, RGB-based deep models heavily rely on textures; thus sensitive to illumination variations, severe occlusions, and cluttered backgrounds.

Refer to caption
Figure 1: Example object: (a) / (b) are RGB / depth images; (c)/(d) are generated incomplete noisy point cloud and ground-truth.

Along with the emergence and innovation of depth sensors, 6D pose estimation on RGB-D data has become popular, expecting to deliver performance gain by adding geometry information. Early works [11, 24, 45] estimate object poses from RGB images and refine them according to depth maps. Later studies [38, 17] dedicate to integrating RGB and depth clues in a more sophisticated way. Particularly, [41, 31, 30, 38, 9] represent depth images as 3D point clouds and the models are more efficient in computation and storage than those on original depth maps. By jointly making use of both modalities, RGB-D based solutions report better scores, with the superiority in the presence of the difficulties aforementioned as well as the low-texture case.

However, current RGB-D pose estimation suffers from two major limitations: ineffective representation of depth data and insufficient combination of two modalities. For the former, as captured in cluttered scenes, depth information is usually noisy and incomplete (see Fig. 1). Inferring poses from such data, either in 2D depth maps or 3D point clouds, is not robust, leading to accuracy deterioration. For the latter, RGB and depth clues are fused by concatenating separately learned single-modal features [38] or by applying a simple point-wise encoder [9], where inter-modality correlations are not considered or roughly modeled in a global manner, leaving much room for improvement.

In this paper, we propose a novel deep learning approach, namely Graph Convolutional Network with Point Refinement (PR-GCN), to simultaneously address the two limitations in a unified way. As in Fig. 2, given the RGB image and 3D point cloud (generated from depth map) of an object, we first introduce a Point Refinement Network (PRN) to polish the point cloud. Endowed with an encoder-decoder structure and trained with a regularized multi-resolution regression loss, PRN recovers the missing parts of the raw input with noise removed. Subsequently, we integrate RGB-D clues by a Mutli-Modal Fusion Graph Convolutional Network (MMF-GCN). It constructs a kk-Nearest Neighbor (kk-NN) graph and extracts geometry-aware inter-modality correlation through local information propagation in the Graph Convolutional Network (GCN). An additional kk-NN graph and GCN are employed to encode local geometry attributes of the refined point cloud as a complement to the original data. The features from the two GCNs are then combined and fed into several fully-connected layers for final 6D pose prediction. We extensively evaluate PR-GCN on three public benchmarks, Linemod [11], Occlusion Linemod [2], and YCB-Video [40], and achieve the state-of-the-art performance. We also show that the proposed PRN and MMF-GCN modules are well generalized to other frameworks.

The contributions: 1) We propose the PR-GCN approach to 6D pose estimation by enhancing depth representation and multi-modal combination. 2) We present the PRN module with a regularized multi-resolution regression loss for point-cloud refinement. To the best of our knowledge, it is the first that applies 3D point generation to this task. 3) We develop the MMF-GCN module to capture local geometry-aware inter-modality correlation for RGB-D fusion.

Refer to caption
Figure 2: Illustration of PR-GCN. Given an RGB-D image, it first localizes objects on RGB images and generates their raw 3D point clouds. Subsequently, PRN generates refined 3D points to polish shape clues and MMF-GCN integrates multi-modal features by propagating local geometry-aware information and leveraging refined 3D points. 6D pose is finally inferred based on the feature delivered by MMF-GCN.

2 Related Work

RGB based 6D Pose Estimation. The traditional methods [11, 7, 16] establish correspondence between object appearances and poses from single RGB images. Linemod [11] predicts poses by modeling the relationship between texture gradients and surface normals on 3D templates. [3] exploits key-points of specific objects for pose estimation by iteratively matching them between input and canonical frames. As in other vision tasks, deep models are also investigated to build more powerful features for this issue. DeepIM [18] adopts CNNs to learn reliable representations for template-matching. BB8 [32] applies CNNs in a multi-stage segmentation scheme to regress key-point coordinates. PVNet [29] proposes a deep offset prediction model to alleviate negative impacts of occlusions. CDPN [19] and Pix2pose [28] map 3D coordinates to 2D pixels and regress pose parameters on 2D images. LatentFusion [27] handles unseen object poses by reconstructing a latent 3D representation.

RGB-D based 6D Pose Estimation. With geometry information, depth maps contribute to pose estimation for various lighting conditions and low-textured appearances, complementary to RGB images. MCN [17] employs two CNNs for representation learning in RGB and depth respectively and resulting features are then concatenated for pose prediction. PoseCNN [40] and SSD-6D [14] follow the coarse-to-fine scheme, where poses are initially estimated on RGB frames and subsequently refined on depth maps. [37] builds a multi-view model to jointly reconstruct whole scenes and optimize multi-object poses.

Recently, there has emerged a trend to represent geometry clues in 3D point clouds rather than depth mas for higher efficiency [38, 5, 4, 31, 9]. DenseFusion [38] designs a heterogeneous network to integrate texture and shape features and such representation proves more discriminative than single-modal ones. CF [5] introduces attention modules to combine the two modalities for further improvements. G2L [4] segments point clouds of objects in scenes by frustum pointnet [31] and regresses pose parameters via extra coordinate constraints. PVN3D [9] incorporates DenseFusion into 3D key-point detection and instance semantic segmentation, significantly boosting the performance.

Unfortunately, the point clouds generated from the depth maps are often of a low quality, since the shape information is often incomplete and noisy as Fig. 1 shows. Besides, the combination of RGB and depth clues is launched in a very rough way, e.g., directly concatenating or point-wise encoding. In contrast, our approach develops the PRN and MMF-GCN modules to polish depth clues by generating refined point clouds and enhance integration by capturing local geometry-aware inter-modality correlations respectively, both of which are beneficial to pose estimation.

3 The Proposed Method

3.1 Framework Overview

RGB-D based 6D pose estimation recovers 6D poses of objects in RGB-D images, where 6D pose is usually represented by a rotation matrix 𝑹SO(3)\bm{R}\in SO(3) and a translation vector 𝒕3\bm{t}\in\mathbb{R}^{3}. For this issue, we propose the PR-GCN approach. As Fig. 2 depicts, it consists of four steps: object localization and 3D points generation, 3D points refinement, GCN-based multi-modal fusion, and 6D pose prediction.

Object Localization and 3D Points Generation. Given an RGB-D image I=(Irgb,Id)I=(I_{rgb},I_{d}), we firstly locate objects on IrgbI_{rgb} using the off-the-shelf Faster R-CNN [33] detector, where IrgbI_{rgb} and IdI_{d} denote the RGB and depth channels of II. According to the detected bounding boxes, we crop the sub-images {Io}={(Io,rgb,Io,d)}\{I_{o}\}=\{(I_{o,rgb},I_{o,d})\}, each of which contains an instance oo. Io,rgbI_{o,rgb} and Io,dI_{o,d} are the RGB and depth channels of IoI_{o}. As in PoseCNN [40], we add a segmentation head to remove background of Io,rgbI_{o,rgb}. With MoM_{o} (foreground mask) and Io,dI_{o,d}, raw 3D points 𝑷o=[𝒑o(1);;𝒑o(i);;𝒑o(N)]N×3\bm{P}_{o}=[\bm{p}^{(1)}_{o};\cdots;\bm{p}^{(i)}_{o};\cdots;\bm{p}^{(N)}_{o}]\in\mathbb{R}^{N\times 3} are rendered by Point-Cloud Transform (PCT) [8], where NN is the number of points and 𝒑o(i)3\bm{p}^{(i)}_{o}\in\mathbb{R}^{3} is the 3D coordinate of the ii-th point. It is worth noting that 𝑷o\bm{P}_{o} could be severely incomplete and noisy due to external occlusions and sensor noise (see Fig. 1).

3D Points Refinement. To polish the quality of the generated raw 3D points 𝑷o\bm{P}_{o}, we propose the PRN module. As in Fig. 2, it is composed of an MLP-based encoder and a multi-resolution decoder to recover the complete and accurate 3D point cloud 𝑷^o=[𝒑^o(1);;𝒑^o(m);;𝒑^o(M)]M×3\hat{\bm{P}}_{o}=[\hat{\bm{p}}^{(1)}_{o};\cdots;\hat{\bm{p}}^{(m)}_{o};\cdots;\hat{\bm{p}}^{(M)}_{o}]\in\mathbb{R}^{M\times 3}, where MM is the number of refined points. A regularized multi-resolution regression loss is formulated in training, enhancing its ability of filtering out noise in 𝑷o\bm{P}_{o}.

GCN-based Multi-Modal Fusion. For more sufficient RGB-D fusion, we propose the MMF-GCN module. As in Fig. 2, it extracts texture and geometry features from Io,rgbI_{o,rgb} and 𝑷o\bm{P}_{o}, respectively, and a graph is built based on geometry distribution. Accordingly, Io,rgbI_{o,rgb} and 𝑷o\bm{P}_{o} are initially integrated by applying a GCN GCNf()GCN_{f}(\cdot) on the previously built graph through local information propagation. The geometry clues from the refined 3D points are encoded by introducing an extra GCN GCNref()GCN_{ref}(\cdot) and then incorporated into the initially fused features, which are fed into several stacked fully-connected layers T()T(\cdot) for further fusion. The resulting feature 𝑮o=[𝒈o(k)]k=1,,KK×d\bm{G}_{o}=[\bm{g}_{o}^{(k)}]_{k=1,\cdots,K}\in\mathbb{R}^{K\times d} is therefore the multi-modal representation for successive 6D pose estimation, where KK and dd refer to the number of points and the feature dimension, respectively. Since MMF-GCN captures local geometry-aware inter-modality correlation and leverages refined 3D point clouds, it is expected to deliver more discriminative and robust features.

6D Pose Prediction. [𝒈o(k)]k=1,,K[\bm{g}_{o}^{(k)}]_{k=1,\cdots,K} is finally fed into three regression branches: REGr()REG_{r}(\cdot), REGt()REG_{t}(\cdot), REGc()REG_{c}(\cdot), for rotations {𝑹^o(k)=REGr(𝒈o(k))}\{\hat{\bm{R}}_{o}^{(k)}=REG_{r}(\bm{g}_{o}^{(k)})\}, translations {𝒕^o(k)=REGt(𝒈o(k))}\{\hat{\bm{t}}_{o}^{(k)}=REG_{t}(\bm{g}_{o}^{(k)})\}, confidence scores {so(k)=REGc(𝒈o(k))}\{s_{o}^{(k)}=REG_{c}(\bm{g}_{o}^{(k)})\}, respectively. Each branch has four fully-connected layers. Similar to [9, 29, 38], we select the candidate with the highest confidence score as the estimated pose, formulated as:

(𝑹^o,𝒕^o)=argmax{(𝑹^o(k),𝒕^o(k))|k=1,,K}so(k).(\hat{\bm{R}}_{o},\hat{\bm{t}}_{o})=\mathop{\textrm{argmax}}\limits_{\left\{(\hat{\bm{R}}_{o}^{(k)},\hat{\bm{t}}_{o}^{(k)})|k=1,\cdots,K\right\}}~{}~{}s_{o}^{(k)}~{}. (1)
Refer to caption
Figure 3: Detailed structure of PRN (\bigoplus: additive operation).

3.2 Point Refinement Network

Recall that PRN aims to generate the refined 3D point cloud 𝑷^o\hat{\bm{P}}_{o} from the raw one of a low quality 𝑷o\bm{P}_{o}. As in Fig. 3, PRN is endowed with an encoder-decoder architecture. To deal with the change in point density (resolution), we downsample 𝑷o\bm{P}_{o} with 1/21/2 and 1/41/4 scales, resulting in two extra point clouds: 𝑷o,1/2N2×3\bm{P}_{o,1/2}\in\mathbb{R}^{\frac{N}{2}\times 3} and 𝑷o,1/4N4×3\bm{P}_{o,1/4}\in\mathbb{R}^{\frac{N}{4}\times 3}.

Accordingly, the encoder E()E(\cdot) has three branches E1()E_{1}(\cdot), E1/2()E_{1/2}(\cdot), E1/4()E_{1/4}(\cdot), whose inputs are 𝑷o\bm{P}_{o}, 𝑷o,1/2\bm{P}_{o,1/2}, 𝑷o,1/4\bm{P}_{o,1/4} and outputs are the three representations 𝒗1=E1(𝑷o)\bm{v}_{1}=E_{1}(\bm{P}_{o}), 𝒗1/2=E1/2(𝑷o,1/2)\bm{v}_{1/2}=E_{1/2}(\bm{P}_{o,1/2}), 𝒗1/4=E1/4(𝑷o,1/4)\bm{v}_{1/4}=E_{1/4}(\bm{P}_{o,1/4}). Each branch is a stack of six MLP layers. The concatenation of 𝒗1\bm{v}_{1}, 𝒗1/2\bm{v}_{1/2} and 𝒗1/4\bm{v}_{1/4} followed by an MLP layer forms the intermediate latent representation 𝒗\bm{v}.

The decoder DEC()DEC(\cdot) employs a multi-resolution structure as [43] does. The first branch with four fully-connected (FC) layers as well as one reshape (RS) operation is built to obtain the coarse point cloud of a low-resolution 𝑷^o,1/8=RS5(FC4(FC3(FC2(FC1(𝒗)))))M8×3\hat{\bm{P}}_{o,1/8}=RS_{5}(FC_{4}(FC_{3}(FC_{2}(FC_{1}(\bm{v})))))\in\mathbb{R}^{\frac{M}{8}\times 3} (or M8×3×1\mathbb{R}^{\frac{M}{8}\times 3\times 1} equivalently). The second branch consists of the first two shared FC layers, one additional FC layer, one MLP layer and one RS operation, generating a mediate-resolution point cloud 𝑷^o,1/4=RS2,3(MLP2,2(FC2,1(FC2(FC1(𝒗))))).\hat{\bm{P}}_{o,1/4}=RS_{2,3}(MLP_{2,2}(FC_{2,1}(FC_{2}(FC_{1}(\bm{v}))))). Then, 𝑷^o,1/8\hat{\bm{P}}_{o,1/8} is integrated into 𝑷^o,1/4\hat{\bm{P}}_{o,1/4} by a broadcasting additive operation 𝑷^o,1/4:=𝑷^o,1/4𝑷^o,1/8\hat{\bm{P}}_{o,1/4}:=\hat{\bm{P}}_{o,1/4}\bigoplus\hat{\bm{P}}_{o,1/8}, and 𝑷^o,1/4\hat{\bm{P}}_{o,1/4} is reshaped into 𝑷^o,1/4:=RS2,4(𝑷^o,1/4)M4×3\hat{\bm{P}}_{o,1/4}:=RS_{2,4}(\hat{\bm{P}}_{o,1/4})\in\mathbb{R}^{\frac{M}{4}\times 3}. Similarly, the third branch finally renders a high-resolution point cloud 𝑷^oM×3\hat{\bm{P}}_{o}\in\mathbb{R}^{M\times 3} by sharing the first FC layer, adding one FC layer, three MLP layers and one RS operation, and incorporating the mediate-resolution information as 𝑷^o=RS1,7(RS1,6(MLP1,5(MLP1,4(MLP1,3(FC1,2(FC1(𝒗))))))𝑷^o,1/4).\hat{\bm{P}}_{o}=RS_{1,7}(RS_{1,6}(MLP_{1,5}(MLP_{1,4}(MLP_{1,3}(FC_{1,2}(FC_{1}(\bm{v})\\ )))))\bigoplus\hat{\bm{P}}_{o,1/4}).

For training PRN, we develop a multi-resolution regression loss formulated as follows:

mr=dC(𝑷^o,𝑷o,GT)+r{18,14}dC(𝑷^o,r,𝑷o,GT)\begin{array}[]{rl}\mathcal{L}_{\rm{mr}}=d_{\rm C}(\hat{\bm{P}}_{o},\bm{P}_{o,GT})+\sum\limits_{r\in\{\frac{1}{8},\frac{1}{4}\}}d_{\rm C}(\hat{\bm{P}}_{o,r},\bm{P}_{o,GT})\end{array} (2)

where 𝑷o,GT=[𝒑o,GT(1);;𝒑o,GT(H)]H×3\bm{P}_{o,GT}=[\bm{p}_{o,GT}^{(1)};\cdots;\bm{p}_{o,GT}^{(H)}]\in\mathbb{R}^{H\times 3} is ground-truth point cloud of object oo. dC(,)d_{\rm C}(\cdot,\cdot) is the Chamfer distance defined as dC(𝑷,𝑸)=1Mimin𝑗𝐩(i)𝐪(j)22+1Njmini𝐪(j)𝐩(i)22d_{\rm C}(\bm{P},\bm{Q})=\frac{1}{M}\sum\limits_{i}\underset{j}{\rm min}\|\bm{p}^{(i)}-\bm{q}^{(j)}\|_{2}^{2}+\frac{1}{N}\sum\limits_{j}\underset{i}{\rm min}\|\bm{q}^{(j)}-\bm{p}^{(i)}\|_{2}^{2}, given 𝑷=[𝒑(m)]m=1,,MM×3\bm{P}=[\bm{p}^{(m)}]_{m=1,\cdots,M}\in\mathbb{R}^{M\times 3} and 𝑸=[𝒒(n)]n=1,,NN×3\bm{Q}=[\bm{q}^{(n)}]_{n=1,\cdots,N}\in\mathbb{R}^{N\times 3}.

In Eq. (2), 𝑷^o\hat{\bm{P}}_{o} is forced to fit the ground-truth in low-to-high resolutions when minimizing mr\mathcal{L}_{\rm{mr}}. In other words, DEC()DEC(\cdot) is forced to predict high-quality points in multiple resolutions by a unified structure, and thus optimized with more supervision than the single-resolution case. Moreover, as shown in Fig. 3, the high-resolution output 𝑷^o\hat{\bm{P}}_{o} integrates multi-resolution information from 𝑷^o,1/4\hat{\bm{P}}_{o,1/4} and 𝑷^o,1/8\hat{\bm{P}}_{o,1/8}. As a consequence, PRN mitigates the incompleteness and decreases the noise of the raw 3D points.

Despite the aforementioned advantages of mr\mathcal{L}_{\rm{mr}}, it fails to perceive the global point distribution of 𝑷o,GT\bm{P}_{o,GT}. We handle this problem by introducing the adversarial loss:

adv=h=1Hlog(D(𝒑o,GT(h)))+m=1Mlog(1D(𝒑^o(m))),\small\mathcal{L}_{\rm adv}=\sum\limits_{h=1}^{H}\log(D(\bm{p}_{o,GT}^{(h)}))+\sum\limits_{m=1}^{M}\log(1-D(\hat{\bm{p}}_{o}^{(m)})), (3)

where D()D(\cdot) is the discriminator to classify whether a point belongs to 𝑷o,GT\bm{P}_{o,GT} (“real”) or not (“fake”). By minimizing adv\mathcal{L}_{\rm{adv}}, PRN is expected to generate 𝑷^o\hat{\bm{P}}_{o} that captures holistic point distribution of 𝑷o,GT\bm{P}_{o,GT}, benefiting the quality of 𝑷o\bm{P}_{o}.

The regularized multi-resolution regression loss is thus formulated as:

prn=o(λadv+βmr),\small\mathcal{L}_{\rm prn}=\sum\limits_{o}\left(\lambda\cdot\mathcal{L}_{\rm{adv}}+\beta\cdot\mathcal{L}_{\rm{mr}}\right), (4)

where λ\lambda and β\beta are the trade-off hyper-parameters.

3.3 Multi-Modal Fusion GCN

Table 1: Comparison with the state-of-the-arts in terms of ADD(-S) (%) on Linemod. Symmetric objects are marked in bold. */{\dagger} indicates that the method only uses real/synthetic data for training. PF and DF refer to PointFusion [41] and DenseFusion [38], respectively.
RGB based methods RGB-D based methods
Object PoseCNN* PVNet CDPN DPOD DPVL PF* SSD6D DF* PVN3D G2L* Ours* Ours
[40, 18] [29] [19] [44] [42] [41] [14] [38] [9] [4]
ape 77.0 43.6 64.4 87.7 69.1 70.4 65.0 92.3 97.3 96.8 97.6 99.2
benchvise 97.5 99.9 97.8 98.5 100.0 80.7 80.0 93.2 99.7 96.1 99.2 99.8
camera 93.5 86.9 91.7 96.1 94.1 60.8 78.0 94.4 99.6 98.2 99.4 100.0
can 96.5 95.5 95.9 99.7 98.5 61.1 86.0 93.1 99.5 98.0 98.4 99.4
cat 82.1 79.3 83.8 94.7 83.1 79.1 70.0 96.5 99.8 99.2 98.7 99.8
driller 95.0 96.4 96.2 98.8 99.0 47.3 73.0 87.0 99.3 99.8 98.8 99.8
duck 77.7 52.6 66.8 86.3 63.5 63.0 66.0 92.3 98.2 97.7 98.9 98.7
eggbox 97.1 99.2 99.7 99.9 100.0 99.9 100.0 99.8 99.8 100.0 99.9 99.6
glue 99.4 95.7 99.6 96.8 98.0 99.3 100.0 100.0 100.0 100.0 100.0 100.0
holepuncher 52.8 82.0 85.8 86.9 88.2 71.8 49.0 92.1 99.9 99.0 99.4 99.8
iron 98.3 98.9 97.9 100.0 99.9 83.2 78.0 97.0 99.7 99.3 98.5 99.5
lamp 97.5 99.3 97.9 96.8 99.8 62.3 73.0 95.3 99.8 99.5 99.2 100.0
phone 87.7 92.4 90.8 94.7 96.4 78.8 79.0 92.8 99.5 98.9 98.4 99.7
MEAN 88.6 86.3 89.9 95.2 91.5 73.7 79.0 94.3 99.4 98.7 98.9 99.6

As mentioned before, given the RGB (Io,rgbI_{o,rgb}) and point cloud (𝑷o\bm{P}_{o} and 𝑷^o\hat{\bm{P}}_{o}) data of object oo, MMF-GCN integrates multi-modal information into more effective representation (𝑮o\bm{G}_{o}) for accurate 6D pose estimation.

Specifically, MMF-GCN first extracts the geometry feature 𝒇o,d(i)\bm{f}_{o,d}^{(i)} from 𝑷o\bm{P}_{o} and the texture feature 𝒇o,rgb(i)\bm{f}_{o,rgb}^{(i)} from Io,rgbI_{o,rgb} for the ii-th point 𝒑o(i)𝑷o\bm{p}^{(i)}_{o}\in\bm{P}_{o}. The normalized coordinate of 𝒑o(i)\bm{p}_{o}^{(i)} is directly used as the geometry feature, and by mapping this coordinate to the pixel on Io,rgbI_{o,rgb}, PSPNet [46] with the ResNet-18 backbone is adopted to compute pixel-wise representation as the texture feature.

When {𝒇o,rgb(i)}\{\bm{f}_{o,rgb}^{(i)}\} and {𝒇o,d(i)}\{\bm{f}_{o,d}^{(i)}\} are ready, a kk-Nearest Neighbor (kk-NN) graph 𝒢f=(𝒱f,f)\mathcal{G}_{f}=(\mathcal{V}_{f},\mathcal{E}_{f}) is constructed. 𝒱f={𝒑o(1),,𝒑o(N)}\mathcal{V}_{f}=\{\bm{p}_{o}^{(1)},\cdots,\bm{p}_{o}^{(N)}\} and f={(𝒑o(i),𝒑o(j))|𝒑o(j)𝒩k(𝒑o(i))}\mathcal{E}_{f}=\{(\bm{p}_{o}^{(i)},\bm{p}_{o}^{(j)})|\bm{p}_{o}^{(j)}\in\mathcal{N}_{k}(\bm{p}_{o}^{(i)})\} denote the vertices and the edges, and 𝒩k(𝒑o(i))\mathcal{N}_{k}(\bm{p}_{o}^{(i)}) indicates the kk nearest neighbors of 𝒑o(i)\bm{p}_{o}^{(i)}. The edge feature is defined as 𝒆(i,j)=h𝜽(𝒇o(i)𝒇o(j),𝒇o(i))\bm{e}^{(i,j)}=h_{\bm{\theta}}(\bm{f}_{o}^{(i)}-\bm{f}_{o}^{(j)},\bm{f}_{o}^{(i)}) with 𝒇o(i)=[𝒇o,rgb(i),𝒇o,d(i)]\bm{f}_{o}^{(i)}=[\bm{f}_{o,rgb}^{(i)},\bm{f}_{o,d}^{(i)}], where h𝜽(,)h_{\bm{\theta}}(\cdot,\cdot) is a nonlinear function parameterized by 𝜽\bm{\theta}.

Afterwards, a graph convolution network GCNf()GCN_{f}(\cdot) is employed to capture local inter-modality correlations, with EdgeConv [39] for graph convolutions. The basic updating scheme is formulated as:

𝒈o,f(i,l)=MP(h𝜽(l1)(𝒈o,f(i,l1)𝒈o,f(j,l1),𝒈o,f(i,l1))),\bm{g}_{o,f}^{(i,l)}=MP\left(h_{\bm{\theta}^{(l-1)}}\left(\bm{g}_{o,f}^{(i,l-1)}-\bm{g}_{o,f}^{(j,l-1)},\bm{g}_{o,f}^{(i,l-1)}\right)\right),

where 𝒈o,f(i,l)\bm{g}_{o,f}^{(i,l)} denotes the ii-th edge feature in the ll-th layer, h𝜽(l1)(,)h_{\bm{\theta}^{(l-1)}}(\cdot,\cdot) is a nonlinear function in the (l1)(l-1)-th layer, and MP()MP(\cdot) refers to max pooling. The representation 𝑮o,f=[𝒈o,f(j)]j=1,,JJ×df\bm{G}_{o,f}=[\bm{g}_{o,f}^{(j)}]_{j=1,\cdots,J}\in\mathbb{R}^{J\times d_{f}} is then obtained, where 𝑮o,f=GCNf({𝒇o,rgb(i),𝒇o,d(i)})\bm{G}_{o,f}=GCN_{f}(\{\bm{f}_{o,rgb}^{(i)},\bm{f}_{o,d}^{(i)}\}); JJ and dfd_{f} are the point number and the feature dimension, respectively.

As 𝑷o\bm{P}_{o} is usually incomplete and noisy, MMF-GCN encodes the geometry attribute of 𝑷^o\hat{\bm{P}}_{o} and incorporates it into 𝑮o,f\bm{G}_{o,f} as a complement. Concretely, similar to 𝒢f\mathcal{G}_{f}, another kk-NN graph 𝒢ref\mathcal{G}_{ref} is built based on 𝑷^o\hat{\bm{P}}_{o}, and an extra GCN GCNref()GCN_{ref}(\cdot) is employed. The refined geometry feature is calculated using EdgeConv: 𝑮o,ref=[𝒈o,ref(j)]k=1,,JJ×dref\bm{G}_{o,ref}=[\bm{g}_{o,ref}^{(j)}]_{k=1,\cdots,J}\in\mathbb{R}^{J\times d_{ref}}, where 𝑮o,ref=GCNref(𝑷^o)\bm{G}_{o,ref}=GCN_{ref}(\hat{\bm{P}}_{o}) and drefd_{ref} is the feature dimension. 𝑮o,ref\bm{G}_{o,ref} is subsequently combined with 𝑮o,f\bm{G}_{o,f} through simple concatenation, which is further integrated by a few stacked FC layers T()T(\cdot). At last, multi-modal representation is formed as 𝑮o=T([𝑮o,r,𝑮o,ref])\bm{G}_{o}=T([\bm{G}_{o,r},\bm{G}_{o,ref}]) for 6D pose estimation.

3.4 Training Objectives

The objective function for training PR-GCN consists of two parts: the pose estimation loss pose\mathcal{L}_{\rm{pose}} and the regularized multi-resolution regression loss prn\mathcal{L}_{\rm{prn}} as depicted in Eq. (4).

Given the ground-truth 6D pose (𝑹o,𝒕o)(\bm{R}_{o},\bm{t}_{o}) and the predictions {(𝑹^o(k),𝒕^o(k),so(k))}\{(\hat{\bm{R}}_{o}^{(k)},\hat{\bm{t}}_{o}^{(k)},s^{(k)}_{o})\} at KK points {𝒙o(k)}\{\bm{x}_{o}^{(k)}\}, the pose estimation error of the ii-th prediction (𝑹^o(i),𝒕^o(i))(\hat{\bm{R}}_{o}^{(i)},\hat{\bm{t}}_{o}^{(i)}) is defined as eo(i)=1Kj=1Kmink(𝑹o𝒙o(j)+𝒕o)(𝑹^o(i)𝒙o(k)+𝒕^o(i))22e_{o}^{(i)}=\frac{1}{K}\sum_{j=1}^{K}\min_{k}\|(\bm{R}_{o}\bm{x}_{o}^{(j)}+\bm{t}_{o})-(\hat{\bm{R}}_{o}^{(i)}\bm{x}^{(k)}_{o}+\hat{\bm{t}}_{o}^{(i)})\|^{2}_{2}. Based on eo(i)e_{o}^{(i)}, we adopt an extra regularization term on the prediction scores {so(i)}\{s^{(i)}_{o}\} as in [38] and formulate the pose estimation loss as:

pose=1Koieo(i)(so(i)log(so(i))).\mathcal{L}_{\rm{pose}}=\frac{1}{K}\sum_{o}\sum_{i}e^{(i)}_{o}\cdot\left(s^{(i)}_{o}-\log(s^{(i)}_{o})\right). (5)

By combining Eq. (5) and Eq. (4), the overall training objective function is written as:

=pose+μprn,\mathcal{L}=\mathcal{L}_{\rm{pose}}+\mu\cdot\mathcal{L}_{\rm{prn}}, (6)

where μ\mu is the trade-off hyper-parameter.

4 Experiments

4.1 Datasets and Metrics

Extensive evaluation is made on three datasets: Linemod [11], Occlusion Linemod [2] and YCB-Video [40].

Linemod [11] is composed of 15 RGB-D videos of 15 low-textured objects. Following [32], 13 objects are considered and the standard training/testing split is adopted as in [38, 40]. Occlusion Linemod is collected by annotating a subset of Linemod (8 out of 15 objects), where each image has multiple occluded objects, making it more challenging. YCB-Video [40] includes 21 objects with various textures and sizes. It provides RGB-D images and detailed pose annotations. There are 130K real images from 92 videos and 80K synthetically rendered ones, and 16,189 real and all the synthesized images are used in training, according to [38].

Table 2: Comparison of ADD(-S) AUC (%) on Occlusion Linemod. Symmetric objects are marked in bold.
Object PoseCNN [40] DeepHeat [26] SS [12] Pix2pose [28] PVNet [29] HybridPose [34] PVN3D [9] Ours
Ape 9.6 12.1 17.6 22.0 15.8 20.9 33.9 40.2
Can 45.2 39.9 53.9 44.7 63.3 75.3 88.6 76.2
Cat 0.9 8.2 3.3 22.7 16.7 24.9 39.1 57.0
Driller 41.4 45.2 62.4 44.7 65.7 70.2 78.4 82.3
Duck 19.6 17.2 19.2 15.0 25.2 27.9 41.9 30.0
Eggbox 22.0 22.1 25.9 25.2 50.2 52.4 80.9 68.2
Glue 38.5 35.8 39.6 32.4 49.6 53.8 68.1 67.0
Holepuncher 22.1 36.0 21.3 49.5 39.7 54.2 74.7 97.2
MEAN 24.9 27.0 27.0 32.0 40.8 47.5 63.2 65.0
Table 3: Comparison of AUC (%) and ADD-S < 2cm (%) (“<2cm” for short) on YCB-Video. Symmetric objects are highlighted in bold.
PoseCNN+ICP  [40] DenseFusion  [38] PVN3D  [9] CF  [5] G2L   [4] Ours
AUC <2cm AUC <2cm AUC <2cm AUC <2cm AUC AUC <2cm
002_master_chef_can 95.8 100.0 96.4 100.0 96.0 100.0 92.5 98.7 94.0 97.1 100.0
003_cracker_box 92.7 91.6 95.5 99.5 96.1 100.0 95.4 98.6 88.7 97.6 100.0
004_sugar_box 98.2 100.0 97.5 100.0 97.4 100.0 96.7 99.9 96.0 98.3 100.0
005_tomato_soup_can 94.5 96.9 94.6 96.9 96.2 98.1 92.0 95.8 86.4 95.3 97.6
006_mustard_bottle 98.6 100.0 97.2 100.0 97.5 100.0 94.8 97.5 95.9 97.9 100.0
007_tuna_fish_can 97.1 100.0 96.6 100.0 96.0 100.0 88.8 84.1 84.1 97.6 100.0
008_pudding_box 97.9 100.0 96.5 100.0 97.1 100.0 93.2 98.6 93.5 98.4 100.0
009_gelatin_box 98.8 100.0 98.1 100.0 97.7 100.0 95.7 100.0 96.8 96.2 94.4
010_potted_meat_can 92.7 93.6 91.3 93.1 93.3 94.6 86.2 83.9 86.2 96.6 99.1
011_banana 97.1 99.7 96.6 100.0 96.6 100.0 92.6 98.9 96.3 98.5 100.0
019_pitcher_base 97.8 100.0 97.1 100.0 97.4 100.0 95.4 98.4 91.8 98.1 100.0
021_bleach_cleanser 96.9 99.4 95.8 100.0 96.0 100.0 89.0 86.2 92.0 97.9 100.0
024_bowl 81.0 54.9 88.2 98.8 90.2 80.5 86.1 94.3 86.7 90.3 96.6
025_mug 95.0 99.8 97.1 100.0 97.6 100.0 93.5 94.8 95.4 98.1 100.0
035_power_drill 98.2 99.6 96.0 98.7 96.7 100.0 82.9 84.8 95.2 98.1 100.0
036_wood_block 87.6 80.2 89.7 94.6 90.4 93.8 92.3 99.6 86.2 96.0 100.0
037_scissors 91.7 95.6 95.2 100.0 96.7 100.0 90.1 89.5 83.8 96.7 100.0
040_large_marker 97.2 99.7 97.5 100.0 96.7 99.8 93.9 99.8 96.8 97.9 100.0
051_large_clamp 75.2 74.9 72.9 79.2 93.6 93.6 70.3 76.7 94.4 87.5 93.3
052_extra_large_clamp 64.4 48.8 69.8 76.3 88.4 83.6 69.5 74.5 92.3 79.7 84.6
061_foam_brick 97.2 100.0 92.5 100.0 96.8 100.0 94.6 100.0 94.7 97.8 100.0
MEAN 93.0 93.2 93.1 96.8 95.5 97.6 89.8 93.1 92.4 95.8 98.5

As in the literature, two main metrics are employed for evaluation, i.e., Average Distance (ADD) [40] and ADD-Symmetric (ADD-S) [40], designed for general objects and symmetric objects, respectively. DenseFusion [38] gives the ADD-S smaller than 2 centimeters (ADD-S<<2cm) for real applications e.g. robotic manipulation, and PoseCNN [40] and DenseFusion [38] report the Area Under the ADD-S Curve (AUC) with the maximum threshold at 0.1m. We also show them for comparison.

4.2 Implementation Details

We fix the size of RGB images as 480×\times640. The numbers of raw/refined 3D points, i.e., NN/MM, are set to 100/512 and 100/1,024 on Linemod (Occlusion Linemod) and YCB-Video, respectively. In MMF-GCN, refined point clouds are down-sampled to 100 points by FPS. When building the graphs 𝒢f\mathcal{G}_{f} and 𝒢ref\mathcal{G}_{ref}, we utilize 30 nearest neighbors, i.e., k=30k=30. In PRN, the hyper-parameters λ\lambda, β\beta and μ\mu in the overall training loss \mathcal{L} are set to 0.05, 0.95 and 1.0. To train PR-GCN in a more stable way, PRN and MMF-GCN are progressively optimized. For instance, on YCB-Video, PRN and MMF-GCN are first alternatively trained for 15 epochs and then jointly optimized for 30 epochs.

In PRN training, we adopt the ADAM optimizer with the learning rate of 0.0001 and the batch size of 48. The rest parts of PR-GCN are trained for 40, 20 and 60 epochs on Linemod, Occlusion Linemod and YCB-Video, respectively, where the learning rate is initially set to 0.0001 and decayed by a factor of 0.3 after half of the maximal epochs.

4.3 Comparison with the State-of-the-art Methods

Results on Linemod. We first compare PR-GCN to the state-of-the-art methods on Linemod, including the RGB based models: PoseCNN (+DeepIM) [40, 18], PVNet [29], CDPN [19], DPOD [44] and DPVL [42] and the RGB-D based ones: Point Fusion [41], SSD6D (+ICP) [14], Dense Fusion [38], PVN3D [9] and G2L[4]. Several approaches, denoted by ‘*’ or ‘{\dagger}’ in Table 1, only adopt real or synthetic training data, whist the others use both. In our work, we mainly consider the setting with the two types of training data, and also report the performance with real data only.

Table 1 summarizes the ADD(-S) of different methods on Linemod, and we can see that the RGB-D based deep models (e.g., PVN3D and G2L) remarkably outperform the RGB based ones by a large margin due to additional geometry information given by the depth channel. Regarding the RGB-D counterparts, the proposed PR-GCN achieves better performance, which improves PointFusion and DenseFusion by 25.2% and 4.6%, respectively. Our approach also boosts the performance of keypoint-based methods, including PVN3D and G2L. It is worth noting that the second best method, i.e. PVN3D, uses 70,000 synthetic training data, and needs to train different models for distinct object categories. In contrast, our method trains a universal model for all object categories, and merely utilizes 3,500 synthetic training data, which is much more efficient than PVN3D.

Refer to caption
Refer to caption
Figure 4: Qualitative analysis. (a) Visualization results on YCB-Video. From left to right: provided by DenseFusion (with 2 iterations), PVN3D, PR-GCN (ours) and ground-truth (GT). Orange bounding boxes highlight inaccurate estimation. (b) Failure cases. Left: heavy occlusion (‘Holepuncher’ from Occlusion Linemod) and right: symmetry object (‘Bowl’ from YCB-Video).

Results on Occlusion Linemod. To evaluate the robustness of PR-GCN to inter-object occlusions, we display detailed results on Occlusion Linemod, in comparison with PoseCNN [40], DeepHeat [26], SS [12], Pix2Pose [28], PVNet [29], HybridPose [34] and PVN3D [9]. As shown in Table 2, our method consistently reaches the top ADD(-S) AUC and achieves the best mean ADD(-S) AUC, highlighting its superiority in the presence of heavy occlusions.

Results on YCB-Video. We then extend our analysis on this database and compare PR-GCN with PoseCNN (+ICP) [40], DenseFusion [38], PVN3D [9], CF [5] and G2L [4]. Table 3 shows the AUC and ADD-S<2cm for various methods. It can be observed that our method achieves the highest performance on both the metrics. For instance, compared to PVN3D and DenseFusion, PR-GCN improves the ADD-S<<2cm by 0.9% and 1.7%, respectively.

Qualitative results. We additionally provide qualitative results in Fig. 4, comparing to DenseFusion and PVN3D. Due to cluttered backgrounds and severe occlusions, DenseFusion and PVN3D predict inaccurate poses in many cases, while our PR-GCN performs more robustly with much better results. We also demonstrate failure cases in Fig. 4, revealing that PR-GCN fails when dealing with extremely occluded objects and some symmetric ones.

Inference efficiency. Besides the accuracy, we evaluate the efficiency of our method on Linemod. As shown in Table 4, each key component infers fast, and the full pipeline takes 68ms on an Nvidia 1080Ti GPU, which is acceptable in downstream tasks such as robotic grasping.

Table 4: Inference time of Segmentation (Seg), Point Refinement (PR), Pose Estimation (PE) and full PR-GCN (Full) on Linemod.
Component Seg PR PE Full
Time (s) 0.030 0.008 0.030 0.068
Table 5: Ablation study of PR-GCN in ADD(-S) (%) on Linemod.
Method PRN MMF-GCN MEAN
Baseline (with DGCNN) ×\times ×\times 94.8
Baseline+PRN \surd ×\times 96.8
Baseline+MMF-GCN ×\times \surd 96.9
Full model \surd \surd 98.9
Table 6: Generalization of PRN and MMF-GCN to other frameworks in terms of ADD-S (%) and <<2cm (%) on YCB-Video.
Method PVN3D DenseFusion
Metric ADD-S <2cm ADD-S <2cm
Original model 95.5 97.6 93.1 96.8
w/ PRN - - 94.1 97.2
w/ MMF-GCN 96.2 98.4 93.5 97.2
w/ both - - 94.9 98.1
Table 7: The influence of segmentation on different frameworks on YCB-Video in terms of AUC (%) and <<2cm (%) (‘-’ indicates that the result is not reported).
PoseCNN segmentation PVN3D segmentation GT segmentation
PoseCNN DenseFusion Ours DenseFusion PVN3D Ours Densefusion PVN3D Ours
AUC 93.0 93.1 95.0 91.8 95.5 95.8 94.5 96.4 96.9
<2cm 93.2 96.8 97.6 92.8 97.6 98.5 98.1 - 99.9
Table 8: Ablation study of the multi-resolution loss on YCB-Video in terms of ADD-S (%) and <<2cm (%) .
WO-PRN PRN-SR PRN-MR
ADD-S 94.0 94.6 95.8
< 2cm 97.1 96.6 98.5

4.4 Ablation Study

We comprehensively validate individual components of PR-GCN in the following.

The impact of PRN and MMF-GCN. The baseline method removes PRN and replaces MMF-GCN by DGCNN [39] which adopts the same basic point cloud aggregator as our PR-GCN. As Table 5 displays, PRN boosts the baseline by 2.0%, indicating that refined point clouds contribute to pose estimation, while MMF-GCN achieves an improvement of 2.1%, demonstrating its advantage in integrating multi-modal features. The combination of PRN and MMF-GCN further enhances the performance.

The generalizability of PRN and MMF-GCN. We generalize the PRN and MMF-GCN modules to two state-of-the-art frameworks including PVN3D [9] and DenseFusion [38], and evaluate their performance on YCB-Video. Note that PVN3D cannot utilize PRN directly, since segmentation is required on the whole scene while PRN focuses on specific objects. We thus only evaluate the effect of MMF-GCN on PVN3D. As shown in Table 6, PRN promotes the ADD-S of DenseFusion by 1%, and a similar improvement can be observed when applying MMF-GCN. The results indicate that PRN and MMF-GCN benefit other frameworks for 6D pose estimation.

The influence of segmentation. As in Fig. 2, our framework introduces RGB-based segmentation to extract foreground objects, while PoseCNN [40], DenseFusion [38] and PVN3D [9] adopt different instance segmentation models. To validate the effect of segmentation, we replace the segmentation model in our framework by the counterparts used in PoseCNN and PVN3D as well as the ground-truth, and report their AUC and ADD-S<<2cm metrics on YCB-Video. Similarly, we evaluate this factor on other frameworks, including PoseCNN, DenseFusion and PVN3D. As reported in Table 7, all these frameworks achieve the highest AUC and ADD-S<<2cm using ground-truth, indicating that better segmentation boosts the estimation accuracy. Meanwhile, with segmentation alternatives, our framework consistently outperforms the others, showing that PR-GCN is superior, regardless of which segmentation model is used.

Refer to caption
Figure 5: Visualization of the refined 3D points generated by PRN with/without the multi-resolution regression loss.

The effect of the multi-resolution regression loss on PRN. We finally validate the credit of the regularized multi-resolution regression loss prn\mathcal{L}_{\rm{prn}}. For comparison, we apply prn\mathcal{L}_{\rm{prn}} on 𝑷^o\hat{\bm{P}}_{o} only, denoted by PRN-SR, while the multi-resolution case is denoted by PRN-MR. We also report the result without using PRN (WO-PRN). As summarized in Table 8, adopting the single-resolution loss promotes the performance of our method. When the multi-resolution loss is added, ADD-S is further boosted to 95.8%, demonstrating its effectiveness. Moreover, we visualize the generated refined 3D points in Fig. 5, and the results clearly show the advantage of PRN in dealing with the incompleteness and noise, after adding the loss prn\mathcal{L}_{\rm{prn}}.

5 Conclusion

In this paper, we propose a novel approach, namely deep Graph Convolutional Networks with Point Refinement (PR-GCN), to RGB-D based 6D pose estimation. We develop a Point Refinement Network (PRN) to improve the quality of depth representation, together with a Multi-Modal Fusion Graph Convolution Network (MMF-GCN) to fully explore local geometry-aware inter-modality correlations for sufficient combination. Extensive experiments validate its superiority and the PRN and MMF-GCN modules.

Acknowledgment

This work is partly supported by the National Natural Science Foundation of China (No. 62022011), the Research Program of State Key Laboratory of Software Development Environment (SKLSDE-2021ZX-04), and the Fundamental Research Funds for the Central Universities.

References

  • [1] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features (SURF). Computer Vision and Image Understanding, 110(3):346–359, 2008.
  • [2] Eric Brachmann, Alexander Krull, Frank Michel, Stefan Gumhold, Jamie Shotton, and Carsten Rother. Learning 6d object pose estimation using 3d object coordinates. In European Conference on Computer Vision, pages 536–551, 2014.
  • [3] Eric Brachmann, Alexander Krull, Frank Michel, Stefan Gumhold, Jamie Shotton, and Carsten Rother. Learning 6d object pose estimation using 3d object coordinates. In European Conference on Computer Vision, pages 536–551, 2014.
  • [4] Wei Chen, Xi Jia, Hyung Jin Chang, Jinming Duan, and Ales Leonardis. G2l-net: Global to local network for real-time 6d pose estimation with embedding vector features. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4232–4241, 2020.
  • [5] Yi Cheng, Hongyuan Zhu, Cihan Acar, Wei Jing, Yan Wu, Liyuan Li, Cheston Tan, and Joo-Hwee Lim. 6d pose estimation with correlation fusion. CoRR, abs/1909.12936, 2019.
  • [6] Alvaro Collet, Manuel Martinez, and Siddhartha S. Srinivasa. The MOPED framework: Object recognition and pose estimation for manipulation. International Journal of Robotics Research, 30(10):1284–1306, 2011.
  • [7] Chunhui Gu and Xiaofeng Ren. Discriminative mixture-of-templates for viewpoint classification. In European Conference on Computer Vision, volume 6315, pages 408–421, 2010.
  • [8] Andrew Harltey and Andrew Zisserman. Multiple view geometry in computer vision (2. ed.). Cambridge University Press, 2006.
  • [9] Yisheng He, Wei Sun, Haibin Huang, Jianran Liu, Haoqiang Fan, and Jian Sun. PVN3D: A deep point-wise 3d keypoints voting network for 6dof pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 11629–11638, 2020.
  • [10] Stefan Hinterstoisser, Stefan Holzer, Cedric Cagniart, Slobodan Ilic, Kurt Konolige, Nassir Navab, and Vincent Lepetit. Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In IEEE International Conference on Computer Vision, pages 858–865, 2011.
  • [11] Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, Stefan Holzer, Gary R. Bradski, Kurt Konolige, and Nassir Navab. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Asian Conference on Computer Vision, volume 7724, pages 548–562, 2012.
  • [12] Yinlin Hu, Pascal Fua, Wei Wang, and Mathieu Salzmann. Single-stage 6d object pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2927–2936, 2020.
  • [13] Yinlin Hu, Joachim Hugonot, Pascal Fua, and Mathieu Salzmann. Segmentation-driven 6d object pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3385–3394, 2019.
  • [14] Wadim Kehl, Fabian Manhardt, Federico Tombari, Slobodan Ilic, and Nassir Navab. SSD-6D: making rgb-based 3d detection and 6d pose estimation great again. In IEEE International Conference on Computer Vision, pages 1530–1538, 2017.
  • [15] Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In IEEE International Conference on Computer Vision, pages 2938–2946, 2015.
  • [16] Vincent Lepetit and Pascal Fua. Monocular model-based 3d tracking of rigid objects: A survey. Foundations and Trends in Computer Graphics and Vision, 1(1), 2005.
  • [17] Chi Li, Jin Bai, and Gregory D. Hager. A unified framework for multi-view multi-class object pose estimation. In European Conference on Computer Vision, volume 11220, pages 263–281, 2018.
  • [18] Yi Li, Gu Wang, Xiangyang Ji, Yu Xiang, and Dieter Fox. Deepim: Deep iterative matching for 6d pose estimation. In European Conference on Computer Vision, volume 11210, pages 695–711, 2018.
  • [19] Zhigang Li, Gu Wang, and Xiangyang Ji. CDPN: coordinates-based disentangled pose network for real-time rgb-based 6-dof object pose estimation. In IEEE International Conference on Computer Vision, pages 7677–7686, 2019.
  • [20] Xingyu Liu, Rico Jonschkowski, Anelia Angelova, and Kurt Konolige. Keypose: Multi-view 3d labeling and keypoint estimation for transparent objects. In IEEE Conference on Computer Vision and Pattern Recognition, pages 11599–11607, 2020.
  • [21] David G. Lowe. Object recognition from local scale-invariant features. In IEEE International Conference on Computer Vision, pages 1150–1157, 1999.
  • [22] Fabian Manhardt, Wadim Kehl, Nassir Navab, and Federico Tombari. Deep model-based 6d pose refinement in RGB. In European Conference on Computer Vision, pages 833–849, 2018.
  • [23] Eric Marchand, Hideaki Uchiyama, and Fabien Spindler. Pose estimation for augmented reality: A hands-on survey. IEEE Transactions on Visualization and Computer Graphics, 22(12):2633–2651, 2016.
  • [24] Frank Michel, Alexander Kirillov, Eric Brachmann, Alexander Krull, Stefan Gumhold, Bogdan Savchynskyy, and Carsten Rother. Global hypothesis generation for 6d object pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 115–124, 2017.
  • [25] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, volume 9912, pages 483–499, 2016.
  • [26] Markus Oberweger, Mahdi Rad, and Vincent Lepetit. Making deep heatmaps robust to partial occlusions for 3d object pose estimation. In European Conference on Computer Vision, volume 11219, pages 125–141, 2018.
  • [27] Keunhong Park, Arsalan Mousavian, Yu Xiang, and Dieter Fox. Latentfusion: End-to-end differentiable reconstruction and rendering for unseen object pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 10710–10719, 2020.
  • [28] Kiru Park, Timothy Patten, and Markus Vincze. Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation. In IEEE International Conference on Computer Vision, pages 7667–7676, 2019.
  • [29] Sida Peng, Yuan Liu, Qixing Huang, Xiaowei Zhou, and Hujun Bao. Pvnet: Pixel-wise voting network for 6dof pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4561–4570, 2019.
  • [30] Charles R. Qi, Or Litany, Kaiming He, and Leonidas J. Guibas. Deep hough voting for 3d object detection in point clouds. In IEEE International Conference on Computer Vision, pages 9276–9285, 2019.
  • [31] Charles R. Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J. Guibas. Frustum pointnets for 3d object detection from RGB-D data. In IEEE Conference on Computer Vision and Pattern Recognition, pages 918–927, 2018.
  • [32] Mahdi Rad and Vincent Lepetit. BB8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In IEEE International Conference on Computer Vision, pages 3848–3856, 2017.
  • [33] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In Conference on Neural Information Processing Systems, pages 91–99, 2015.
  • [34] Chen Song, Jiaru Song, and Qixing Huang. Hybridpose: 6d object pose estimation under hybrid representations. In IEEE Conference on Computer Vision and Pattern Recognition, pages 428–437, 2020.
  • [35] Bugra Tekin, Sudipta N. Sinha, and Pascal Fua. Real-time seamless single shot 6d object pose prediction. In IEEE Conference on Computer Vision and Pattern Recognition, pages 292–301, 2018.
  • [36] Shubham Tulsiani and Jitendra Malik. Viewpoints and keypoints. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1510–1519, 2015.
  • [37] Kentaro Wada, Edgar Sucar, Stephen James, Daniel Lenton, and Andrew J. Davison. Morefusion: Multi-object reasoning for 6d pose estimation from volumetric fusion. In IEEE Conference on Computer Vision and Pattern Recognition, pages 14528–14537, 2020.
  • [38] Chen Wang, Danfei Xu, Yuke Zhu, Roberto Martin Martin, Cewu Lu, Li Fei-Fei, and Silvio Savarese. Densefusion: 6d object pose estimation by iterative dense fusion. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3343–3352, 2019.
  • [39] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829, 2018.
  • [40] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. In Robotics: Science and Systems Conference, 2018.
  • [41] Danfei Xu, Dragomir Anguelov, and Ashesh Jain. Pointfusion: Deep sensor fusion for 3d bounding box estimation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 244–253, 2018.
  • [42] Xin Yu, Zheyu Zhuang, Piotr Koniusz, and Hongdong Li. 6dof object pose estimation via differentiable proxy voting loss. CoRR, abs/2002.03923, 2020.
  • [43] Wentao Yuan, Tejas Khot, David Held, Christoph Mertz, and Martial Hebert. PCN: point completion network. In IEEE International Conference on 3D Vision, pages 728–737, 2018.
  • [44] Sergey Zakharov, Ivan Shugurov, and Slobodan Ilic. DPOD: 6d pose object detector and refiner. In IEEE International Conference on Computer Vision, pages 1941–1950, 2019.
  • [45] Andy Zeng, Kuan-Ting Yu, Shuran Song, Daniel Suo, Ed Walker Jr., Alberto Rodriguez, and Jianxiong Xiao. Multi-view self-supervised deep learning for 6d pose estimation in the amazon picking challenge. In IEEE International Conference on Robotics and Automation, pages 1386–1383, 2017.
  • [46] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6230–6239, 2017.
  • [47] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4490–4499, 2018.