This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust Point Cloud Registration Framework Based on Deep Graph Matching

Kexue Fu*, Jiazheng Luo*, Xiaoyuan Luo, Shaolei Liu, Chenxi Zhang🖂, Manning Wang🖂 Kexue Fu, Jiazheng Luo, Xiaoyuan Luo, Shaolei Liu, Chenxi Zhang and Manning Wang are with the Digital Medical Research Center, School of Basic Medical Science, Fudan University, Shanghai 200032, China, and also with the Shanghai Key Laboratory of Medical Image Computing and Computer Assisted Intervention, Shanghai, China, 200032.
* These authors have contributed equally to this work.
🖂Corresponding author.
E-mail: {fukexue, jzluo20, xyluo19, liushaolei, chenxizhang, mnwang}@fudan.edu.cn
Manuscript received xxxxx xxxx, xxxx; revised xxxx xx, xxxx.
Abstract

3D point cloud registration is a fundamental problem in computer vision and robotics. Recently, learning-based point cloud registration methods have made great progress. However, these methods are sensitive to outliers, which lead to more incorrect correspondences. In this paper, we propose a novel deep graph matching-based framework for point cloud registration. Specifically, we first transform point clouds into graphs and extract deep features for each point. Then, we develop a module based on deep graph matching to calculate a soft correspondence matrix. By using graph matching, not only the local geometry of each point but also its structure and topology in a larger range are considered in establishing correspondences, so that more correct correspondences are found. We train the network with a loss directly defined on the correspondences, and in the test stage the soft correspondences are transformed into hard one-to-one correspondences so that registration can be performed by a correspondence-based solver. Furthermore, we introduce a transformer-based method to generate edges for graph construction, which further improves the quality of the correspondences. Extensive experiments on object-level and scene-level benchmark datasets show that the proposed method achieves state-of-the-art performance. The code is available at: https://github.com/fukexue/RGM.

Index Terms:
Point Cloud Registration, Graph Matching, Correspondence.

1 Introduction

Rigid point cloud registration is a task that finds a rigid transformation to align two point clouds, and it has long been a fundamental task in computer vision and robotics, with many important applications, such as autopilot [1, 2, 3], surgical navigation [4] and SLAM [5, 6]. There are two interlocked subproblems in point cloud registration: finding the transformation to align the two point clouds and finding the correspondences between the points [7]. Although when the solution to one subproblem is known, the other subproblem can be easily solved, it is difficult to solve both subproblems together. Point cloud registration becomes even harder when there are outliers, which are the points with no correspondences in the other point cloud. Outliers may come from the imperfectness of the sensors used to collect the point clouds or situations in which the two point clouds to be registered are not fully overlapped.

Refer to caption
Figure 1: The idea of point cloud registration based on graph matching. Dashed lines represent correspondences. Point features and graph features are the features extracted directly through points and the features extracted based on graphs, respectively. The two points xix_{i} and yjy_{j} have similar point features because they have similar local geometries, but they have different graph features because the graph topologies around them are different, so they are not mismatched when graph-based matching is used.

Iterative closest point (ICP) [8, 9] is arguably the most widely used method for rigid point cloud registration, which starts from an initial transformation and alternately updates the correspondences and transformation. One major limitation of ICP is that it can only converge to a local optimum near the initialization, and its convergence basin is fairly small, especially when there are noise and outliers. A series of global registration methods based on branch-and-bound (BnB) [10, 11, 12] have been proposed to alleviate the need for initialization by obtaining the global optimal solution, but the time-consuming BnB limits their practical applications. Another method for mitigating the need for initialization is through keypoint extraction and matching [13, 14]. However, these methods [13, 14] are sensitive to outliers and repetitive geometry [15].

The recent registration studies have been dominated by learning-based methods, most of which establish an end-to-end trainable network. Some of them learn the global features of each point cloud and regress the transformation directly from those global features [16, 17]. Other methods integrate neural networks that predict soft correspondence and a differentiable singular value decomposition (SVD) algorithm to solve transformation, such as DCP [18], RPM-Net [19] and IDAM [20], and they do not need transformation initialization. Although these end-to-end methods have made great progress in object-level registration, they do not work well in scene-level registration [21]. Generally, scene-level registration [21, 22, 23, 24, 25] begins with a neural network to predict hard correspondences and then solves the transformation using RANdom SAmple Consensus (RANSAC) [26]. These methods explore deep features to establish correspondences but the discrimination ability of the features extracted from point clouds is limited, as shown in Figure 1, which leads to a large proportion of outliers in the putative correspondences and consequently devastates the registration accuracy.

In this paper, we propose a robust point cloud registration framework that utilizes deep graph matching to better handle outliers, and we denote it as RGM (Registration by Graph Matching). By constructing graphs from point clouds to be registered and capturing the high-order structure of the graphs, RGM can find robust and accurate point-to-point correspondences to better solve the point cloud registration problem. To the best of our knowledge, this is the first time that deep graph matching has been applied to point cloud registration. RGM contains an end-to-end deep neural network, the first part of which is a feature extractor that extracts deep local features for each point by using its neighboring points. Instead of matching these local point features directly, we construct a graph for each of the two point clouds and embed both the graph nodes (local features for each point) and graph structure (second-order or high-order structure) into node feature space [27]. Then, we introduce a module consisting of an affinity layer, instance normalization and Sinkhorn to predict soft correspondences from the node features of the two graphs, and we denote it as AIS module. By using graph matching in the AIS module, not only the local geometry of each node but also its structure and topology in a larger range are considered in establishing correspondences so that more correct correspondences are found. In training, the focal loss between the predicted soft correspondences and the ground-truth correspondences are adopted, which directly promotes the network to learn better point-to-point correspondences. In testing, we use the linear assignment problem (LAP) solver [28] based on the Hungarian algorithm [29] to transform soft correspondences into one-to-one hard correspondences, and then a correspondence-based estimator is employed to calculate the transformation from the hard correspondences. Similar to existing methods such as RPM-Net and ICP, we iteratively optimize the registration results.

Our main contributions are as follows:

  • We propose using deep graph matching to solve the point cloud registration problem for the first time. Instead of only using the features of each point, graph matching can leverage the features of other nodes and the structural information of graphs when establishing correspondences so that it can better address the problem of outliers.

  • We propose using a transformer to generate soft graph edges at the beginning of each iteration. Utilizing the attention and co-attention mechanism in the transformer, more and more meaningful edges are built in the overlapping regions of the two point clouds. As a result, better correspondences can be established for the overlapping parts in registering partial-to-partial point clouds.

  • We introduce the AIS module to establish reliable correspondences between nodes of two given graphs. The AIS module calculates an affinity matrix between the nodes of the two graphs based on the embedded features, and by analyzing the affinity matrix globally and utilizing the Sinkhorn algorithm, it can effectively reduce the proportion of incorrect correspondences.

  • Our method achieves state-of-the-art performance on both object-level and scene-level benchmark datasets.

A preliminary version of this paper was published in conference [30]. This paper significantly improves the previous work to make it applicable in more difficult scenarios and have better performance. Concretely, (i) We change the Binary Cross-Entropy Loss to Focal Loss [31] to make the network pay more attention to the hardest correspondences. (ii) Instead of only using SVD as the transformation estimator, we extend it to use a more robust correspondence-based estimator, RANSAC. (iii) We add a new local feature extractor for scene-level data. (iv) Many new experiments have been conducted on the improved methods, including cross-dataset generalization, registration under full-range rotation, and scene-level registration experiments on 3DMatch and 3DLoMatch. Experimental results show that our method significantly outperforms the state-of-the-art methods.

2 Related Work

2.1 Traditional Registration Method

A large proportion of traditional methods need an initial transformation and find a locally optimal solution near the initialization, in which ICP [8, 9] is an early and representative method. ICP starts with an initial transformation and iteratively alternates between solving two trivial subproblems: finding the closest points as correspondences under current transformation and computing optimal transformation by SVD from found correspondences. Many variants have been proposed to improve ICP [32, 33, 34]. Nevertheless, ICP and its variants can only converge to a local optimum, and their success heavily relies on a good initialization. To improve the robustness to noise and outliers and enlarge the convergence basin, some methods transform point clouds into probability distributions and reformulate point cloud registration as matching two probability distributions, such as GMM [35] and HGMR [36]. These methods do not need to alternately solve correspondences and transformation, but their objective functions are nonconvex, so they still need a good initialization to avoid converging to a bad local optimum. Recently, a series of globally optimal methods based on BnB have been proposed, such as Go-ICP [10], GOGMA [11], GOSMA [12], and GoTS [37], but they are very slow and only practical in some limited scenarios. Another line of work avoids transformation initialization by establishing correspondences. They usually first extract keypoints from the original point clouds and construct feature descriptors for them and then establish potential correspondences through feature matching [13, 14]. After that, RANSAC-like algorithms can be used to find the correct correspondences for registration. Different from RANSAC-like methods, FGR [38] optimizes a correspondence-based objective function by a graduated nonconvex strategy and achieves state-of-the-art performance in correspondence-based point cloud registration. However, correspondence-based methods are sensitive to duplicate structures and partial-to-partial point clouds because a large proportion of the potential correspondences will be incorrect in these scenarios. Specifically, the lack of good initialization, a large proportion of outliers and time constraints are still big challenges for traditional point cloud registration methods.

2.2 Direct Deep Registration Method

The developments of deep learning on point clouds processing allow researchers to make good use of existing techniques, such as PointNet [39], and DGCNN [40], to extract point cloud features for downstream tasks. These studies have stimulated the interest of using deep learning in point cloud registration. One of the earliest works is PointNetLK [41], which calculates global feature descriptors of the two point clouds through PointNet and iteratively uses the IC-LK algorithm [42, 43] to minimize the distance between the two global feature descriptors to achieve registration. PCRNet [15] replaces the IC-LK algorithm in PointNetLK with a deep neural network. DCP [18] utilizes transformer [44, 45] to compute soft correspondences between two point clouds and utilizes a differentiable SVD algorithm to calculate the transformation. Although these methods have the advantages of being fast and some of them do not need transformation initialization, they cannot effectively handle partial-to-partial point cloud registration. PRNet [46] proposes a keypoint detector and uses the keypoint-to-keypoint correspondences in a self-supervised way to solve the partial-to-partial point cloud registration. DeepGMR [47] extracts pose-invariant correspondences between raw point clouds and Gaussian mixture model (GMM) parameters, and then recovers the transformation from the matched Gaussian mixture models. IDAM [20] integrates the iterative distance-aware similarity convolution module into the matching process, which can overcome the shortcomings of using inner products to obtain pointwise similarity. RPM-Net [19] proposes a network to predict optimal annealing parameters and uses annealing and Sinkhorn [48] to obtain soft correspondences from local features. Soft correspondences can increase robustness, but they lead to the decrease of registration accuracy, which is shown in our experiments on clean point clouds. Although these methods can handle partial-to-partial point cloud registration to some extent, there is still room for improvement in their accuracy and robustness. The difference between our method and the existing learning-based methods is that we construct graphs from the original point clouds and merge structural information of the graphs into node features so that the nodes can be better matched.

2.3 Feature-based Deep Registration Method

Most of the direct point cloud registration methods used in object-level registration do not work well in scene-level registration. For scene-level registration, it has become popular to build correspondences based on the learned local feature descriptors of point clouds first, and then compute the transformation using the robust estimator RANSAC [26]. These methods focus on how to obtain more efficient and robust local features. For example, 3DMatch [24] converts point cloud patches into volumetric voxel grids and uses a 3D convolution network to extract local features. The benchmark dataset presented in 3DMatch [24] is popular in evaluating scene-level registration algorithms. PPFNet [49] presented a 3D local patch descriptor with an augmented set of simple geometric relationships: points, normals and point pair features (PPF). These two methods [24, 49] rely on the supervision of correspondences between patches. Later methods utilize self-supervised or unsupervised learning for feature extraction. For example, PPF-FoldNet [50] and CapsuleNet [51] train an autoencoder to reconstruct the original input point clouds without supervision and use the trained encoder for feature extraction. However, features are extracted only for predefined patches, so the perception field of these methods [24, 49, 50, 51] is greatly limited. Therefore, computing the feature descriptors for each point is a better option. The key issue is how to efficiently extract point features for large scene-level point clouds. FCGF [22] uses 3D sparse convolution to process the complete point cloud to generate dense feature predictions, which are more efficient than patch-based methods. Similar to FCGF, D3Feat [23] uses a fully convolution neural network KPConv [52] to compute dense features, and it jointly predicts dense features and point saliency score. Predator [21] utilizes cross attention mechanism to obtain more robust feature descriptors. StickyPillars [53] use a graph neural network and the Sinkhorn algorithm to solve outdoor scene registration. Different from PREDATOR and StickyPillars, we designed an edge generator to update the features of points in overlapping areas using the features of overlapping areas, while features of points in non-overlapping areas are updated only by themselves. This prevents the features of overlapping areas from being disturbed by features of non-overlapping areas.

2.4 Learning for Graph Matching

Graph matching has been widely studied in computer vision and pattern recognition [54, 55, 56, 57, 58]. Recently, learning-based graph matching has attracted considerable research interest [59, 60, 27], but, to the best of our knowledge, there is no research on using learning-based graph matching to solve the point cloud registration problem.

Refer to caption
Figure 2: The pipeline of the proposed 3D rigid point cloud registration framework, RGM, where \bigoplus represents concatenate features and \bigotimes represents matrix multiplication. The solid lines are the data flow of both training and testing, and the dashed lines are the data flow that exists only in testing.

3 Problem Formulation

3D rigid point cloud registration refers to estimating a rigid transformation 𝐓𝐒𝐄(3)\mathbf{T}\in\mathbf{SE}(3) with parameters 𝐑𝐒𝐎(3)\mathbf{R}\in\mathbf{SO}(3), 𝐭3\mathbf{t}\in\mathbb{R}^{3} to align a source point cloud 𝐗={xi3|i=1,,N}\mathbf{X}=\left\{x_{i}\in\mathbb{R}^{3}|i=1,\cdots,N\right\} and a target point cloud 𝐘={yj3|j=1,,M}\mathbf{Y}=\left\{y_{j}\in\mathbb{R}^{3}|j=1,\cdots,M\right\}. NN and MM represent the number of points in 𝐗\mathbf{X} and 𝐘\mathbf{Y}, respectively. The correspondences between points in 𝐗\mathbf{X} and 𝐘\mathbf{Y} are represented by matrix 𝐂={0,1}N×M\mathbf{C}=\left\{0,1\right\}^{N\times M}. If xix_{i} and yjy_{j} are a pair of corresponding points, 𝐂i,j\mathbf{C}_{i,j} is 1; otherwise, it is 0. We first consider the simple case where there are strict one-to-one correspondences between points in 𝐗\mathbf{X} and 𝐘\mathbf{Y}, in which, N=MN=M. The rigid point cloud registration problem can be formulated as minimizing the following objective function:

𝒆(𝐂,𝐑,𝐭)=iNjM𝐂i,j𝐑xi+𝐭yj22,\bm{e}(\mathbf{C},\mathbf{R},\mathbf{t})=\sum_{i}^{N}\sum_{j}^{M}\mathbf{C}_{i,j}\left\|\mathbf{R}x_{i}+\mathbf{t}-y_{j}\right\|_{2}^{2}, (1)

subject to jM𝐂i,j=1,i,iN𝐂i,j=1,j,𝐂i,j{0,1}N×M,i,j\text{subject to }\sum_{j}^{M}\mathbf{C}_{i,j}=1,\forall i,\ \sum_{i}^{N}\mathbf{C}_{i,j}=1,\forall j,\ \mathbf{C}_{i,j}\in\{0,1\}^{N\times M},\forall i,j. In the more difficult case where there are no one-to-one correspondences, the equality constraints no longer hold, and they become inequality constraints. We can introduce slack variables in 𝐂\mathbf{C} as in [19] to convert inequality constraints back into equality constraints. The row constraints are converted as follows, and the column constraints are similarly converted:

jM𝐂i,j1,ijM+1𝐂i,j=1,iN.\sum_{j}^{M}\mathbf{C}_{i,j}\leq 1,\forall i\rightarrow\sum_{j}^{M+1}\mathbf{C}_{i,j}=1,\forall i\leq N. (2)

Please note that 𝐂\mathbf{C} becomes a (N+1)×(M+1)(N+1)\times(M+1) matrix after introducing one row and one column slack variables, and the sums of the added row and column are not restricted to be one.

In this paper, we use an end-to-end neural network to predict 𝐂\mathbf{C}. Once we know the correspondences, the rigid transformation can be obtained by a correspondence-based solver.

4 RGM

Figure 2 (a) shows the overall pipeline of RGM. RGM consists of five components: local feature extractor, edge generator, graph feature extractor &\& AIS module, Soft2Hard Correspondence Solver and Correspondence-based Estimator. During training, we use the shared local feature extractor to extract local features for each point in 𝐗\mathbf{X} and 𝐘\mathbf{Y}, and take these local features as the node features \mathcal{F} of the initial graph. Next, the edge generator generates edges and builds the source graph and target graph, and the graphs are inputted into the graph feature extractor, which processes the two graphs and outputs new node features \mathcal{F}^{\prime} and uses them to update \mathcal{F}. The AIS module predicts the soft correspondence matrix 𝐂~\widetilde{\mathbf{C}} between nodes of the two graphs. By using blocks composed of three modules, the edge generator, graph feature extractor and AIS module, with the same structure but different weights LL times, we can obtain node features \mathcal{F} with better discrimination capability and a more accurate soft correspondence matrix 𝐂~\widetilde{\mathbf{C}}. Finally, we adopt the focal loss between 𝐂~\widetilde{\mathbf{C}} and the ground truth correspondences to train the network. During test, two point clouds are first inputted into the network to obtain the soft correspondence matrix 𝐂~\widetilde{\mathbf{C}}. Then, the soft correspondences are converted to hard correspondences using the Soft2Hard Correspondence Solver, and the transformation is solved by a Correspondence-based Estimator. Similar to ICP, we also update the transformation iteratively during test time. The details of each component are explained in the following subsections.

4.1 Local Feature Extractor

To establish the correspondence matrix between two point clouds, it is necessary to embed the source point cloud 𝐗\mathbf{X} and the target point cloud 𝐘\mathbf{Y} into a common feature space. We choose different backbones to extract high-dimensional features to better fit different kinds of datasets. For object-level datasets, we first define the local feature descriptor 𝒫xi\mathcal{P}_{x_{i}} of xix_{i} as: 𝒫xi={(xi,xn)xn𝒦i},\mathcal{P}_{x_{i}}=\left\{\left(x_{i},x_{n}\right)\mid\forall x_{n}\in\mathcal{K}_{i}\right\},where 𝒦i\mathcal{K}_{i} represents the KK-nearest neighboring points of xix_{i}. Then low-dimensional local feature descriptors are mapped to high-dimensional local feature spaces through nonlinear functions fθf_{\theta}: K×6V\mathbb{R}^{K\times 6}\rightarrow\mathbb{R}^{V}, where VV is the dimensionality of the final high-dimensional local feature. The implementation of fθf_{\theta} can be found in Section 1 of the supplementary materials, where θ\theta represents the parameter of the nonlinear function, which consists of shared multilayer perceptron (MLP), maxpooling and concatenation. We use the high-dimensional local features as the node features \mathcal{F} of the initial graph. The node feature xi\mathcal{F}_{x_{i}} of xix_{i} can be expressed as follows: xi=fθ(𝒫xi),xiV.\mathcal{F}_{x_{i}}=f_{\theta}\left(\mathcal{P}_{x_{i}}\right),\mathcal{F}_{x_{i}}\in\mathbb{R}^{V}.

Unlike object-level point clouds, scene-level point clouds contain a great number of points and have more complex geometric structure. To improve computational efficiency and feature representation capabilities on scene-level point clouds, we build the local feature extractor fθf_{\theta} based on KPFCN [52] to extract high-dimensional local features. The local feature extractor takes the source point cloud 𝐗\mathbf{X} and the target point cloud 𝐘\mathbf{Y} as inputs, and outputs the coordinates and local features of the new source 𝐗\mathbf{X} and the new target point cloud 𝐘\mathbf{Y}, which are obtained by layer-by-layer grid-based subsampling. KPFCN uses kernel point convolution to extract the local features of each points, which can better learn the geometric information of neighbor points. In addition, it uses multi-layer grid-based subsampling to reduce the number of points and expand the feature receptive field. For more details about the implementation of KPFCN, we recommend reading the original paper [52]. The above process can be formulated as (𝐗,xi)=fθ(𝐗)(\mathbf{X},\mathcal{F}_{x_{i}})=f_{\theta}(\mathbf{X}). In order to balance computation cost and registration accuracy, we constructed the local feature extractor by removing the decoder units from the 2nd to the last un-pooling layer from the default UNet-like KPFCN structure.

Inspired by the idea of the Siamese network [61], the two point clouds share the same local feature extractor. When the two point clouds become closer, the local features also become similar, so this structure is suitable for iterative registration.

If only the local features are used to predict the correspondences between point clouds, it is easy to obtain incorrect correspondences, especially when there are outliers. The reason is that the local features do not contain the structural information of the point cloud on a larger scale (self-correlation) and the association between the two point clouds (cross-correlation). Inspired by Wang’s research on deep graph matching [27], we construct graphs from point clouds and use deep graph matching to establish better correspondences. Section 4.2 describes how to build graphs from point clouds, and Section 4.3 introduces how to predict the correspondences by using deep graph matching and the AIS module.

4.2 Transformer based Edge Generator

The graphs built from 𝐗\mathbf{X} and 𝐘\mathbf{Y} are denoted as source graph 𝒢s={𝐗,𝐄𝐗}\mathcal{G}_{s}=\left\{\mathbf{X},\mathbf{E}_{\mathbf{X}}\right\} and target graph 𝒢t={𝐘,𝐄𝐘}\mathcal{G}_{t}=\left\{\mathbf{Y},\mathbf{E}_{\mathbf{Y}}\right\}, respectively. The graph nodes are the original points, and the graph edges are represented by the adjacency matrix 𝐄\mathbf{E}. The node features of 𝒢s\mathcal{G}_{s} and 𝒢t\mathcal{G}_{t} are denoted by xi\mathcal{F}_{x_{i}} and yj\mathcal{F}_{y_{j}}, respectively. There are trivial methods to generate the edges, such as full connection, nearest neighbor connection and Delaunay triangulation, but the features of graphs cannot be effectively aggregated, resulting in poor correspondences, as shown in Figure 6 (d). Inspired by the success of BERT [44] in NLP, we introduce a transformer [45] module to dynamically learn the soft edges of any two nodes within a point cloud. The transformer-based edge generator is illustrated in Figure 2 (b). Detail structure of the edge generator can be found in Section 1.3 of the supplementary materials. The transformer consists of several stacked encoder-decoder layers. The encoder uses a self-attention layer and shared MLP to encode node features, and the decoder associates and encodes features based on the co-attention mechanism. The transformer takes node features 𝐗,𝐘\mathcal{F}_{\mathbf{X}},\mathcal{F}_{\mathbf{Y}} as input and encodes them into embedding features 𝒯𝐗,𝒯𝐘\mathcal{T}_{\mathbf{X}},\mathcal{T}_{\mathbf{Y}}. Soft edge adjacency matrices are obtained by applying a softmax function on the inner product of the embedding features as follows:

𝒯𝐗,𝒯𝐘=ftransformer(𝐗,𝐘),\mathcal{T}_{\mathbf{X}},\mathcal{T}_{\mathbf{Y}}=f_{\text{transformer}}(\mathcal{F}_{\mathbf{X}},\mathcal{F}_{\mathbf{Y}}), (3)
𝐄𝐗=softmax((𝒯𝐗)T,𝒯𝐗),\mathbf{E}_{\mathbf{X}}=\operatorname{softmax}(\langle\left(\mathcal{T}_{\mathbf{X}}\right)^{T},\mathcal{T}_{\mathbf{X}}\rangle), (4)
𝐄𝐘=softmax((𝒯𝐘)T,𝒯𝐘).\mathbf{E}_{\mathbf{Y}}=\operatorname{softmax}(\langle\left(\mathcal{T}_{\mathbf{Y}}\right)^{T},\mathcal{T}_{\mathbf{Y}}\rangle). (5)

4.3 Graph Feature Extractor and AIS Module

This part is shown in Figure 2 (c), which consists of three consecutive steps as follows: First, we use intra-graph conv to explore the self-correlation of node features, where features are aggregated from nodes along edges within each graph. The message passing scheme between nodes is the same as PCA-GM [27]. A node self-correlation feature xicorr\mathcal{F}_{x_{i}}^{corr} of 𝒢s\mathcal{G}_{s} is computed by intra-graph convolution as follows:

xicorr=j=1N𝐄˘i,jfadj(xj)+fself(xi),\mathcal{F}_{x_{i}}^{corr}=\sum_{j=1}^{N}\breve{\mathbf{E}}_{i,j}*f_{adj}(\mathcal{F}_{x_{j}})+f_{self}(\mathcal{F}_{x_{i}}), (6)

and likewise for 𝒢t\mathcal{G}_{t}. Here, 𝐄˘\breve{\mathbf{E}} is the L1 column-normalized adjacency matrix calculated from 𝐄\mathbf{E}, and fadjf_{adj} and fselff_{self} are message passing functions, which are implemented by fully connected layers and ReLU.

Second, the AIS module is used to calculate a soft correspondence matrix. The AIS module consists of an affinity layer, instance normalization and Sinkhorn. An affinity matrix 𝐀\mathbf{A} between the two graphs is computed as follows:

𝐀i,j=(xicorr)T𝐖(yjcorr),\mathbf{A}_{i,j}=(\mathcal{F}_{x_{i}}^{corr})^{T}\mathbf{W}(\mathcal{F}_{y_{j}}^{corr}), (7)

where 𝐖\mathbf{W} is the learnable parameter in the affinity layer. If xicorr\mathcal{F}_{x_{i}}^{corr},yjcorrQ\mathcal{F}_{y_{j}}^{corr}\in\mathbb{R}^{Q}, then 𝐖Q×Q\mathbf{W}\in\mathbb{R}^{Q\times Q}.

Before using Sinkhorn to compute the soft correspondence matrix 𝐂~\widetilde{\mathbf{C}}, we need to transform 𝐀\mathbf{A} into a normalized matrix while keeping the same distribution. There are two approaches to do so, and the naïve approach is to use softmax for rows or columns. The problem with this approach is that it processes each row or column and does not consider the matrix as a whole, which may result in the problem that a smaller value in 𝐀\mathbf{A} is transformed into a larger value in the transformed matrix [30]. To avoid this situation, we do not use softmax but use instance normalization [62] to transform 𝐀\mathbf{A}. Instance normalization not only considers all the elements globally, but also avoids the above problem caused by softmax. For handling outliers, we add an additional row and an additional column of ones to the transformed matrix and then input it into Sinkhorn [48] to calculate the soft correspondence matrix 𝐂~\widetilde{\mathbf{C}} by the iterative process of alternating row and column normalization.

Finally, we enhance the node features by exploring cross-correlation through cross-graph conv. Cross-graph conv is similar to intra-graph conv, except that features are aggregated from the node features of the other graph with edges replaced by 𝐂~\widetilde{\mathbf{C}}. The more similar the node pairs between the two graphs are, the higher the corresponding weight of 𝐂~\widetilde{\mathbf{C}} will be. We obtain a new node feature xi\mathcal{F}_{x_{i}}^{\prime} of node xix_{i} with a self-correlation feature and cross-correlation feature as follows:

xi=fcross(xicorr,j=1M𝐂~i,jyjcorr),\mathcal{F}_{x_{i}}^{\prime}=f_{\text{cross}}(\mathcal{F}_{x_{i}}^{\text{corr}},\sum_{j=1}^{\mathrm{M}}\widetilde{\mathbf{C}}_{i,j}*\mathcal{F}_{y_{j}}^{\text{corr}}), (8)

and likewise for 𝒢t\mathcal{G}_{t}. Here, fcrossf_{cross} consists of a feature concatenate and a fully connected layer, and it is shared for 𝒢s\mathcal{G}_{s} and 𝒢t\mathcal{G}_{t}.

4.4 Soft2Hard Correspondence Solver

To compute the hard correspondence matrix 𝐂¯pre{\overline{\mathbf{C}}}^{pre}, which is binary, we sum the elements of each row and each column of 𝐂~\widetilde{\mathbf{C}} and take out the rows and columns with a sum greater than confidence threshold τ\tau, and apply a LAP solver based on Hungarian algorithm[29] on the resulting matrix to obtain a binary matrix. Then, the elements of the binary matrix are assigned to a zero matrix with the shape of 𝐂~\widetilde{\mathbf{C}} according to their position in 𝐂~\widetilde{\mathbf{C}}, and the result is the hard correspondence matrix 𝐂¯pre{\overline{\mathbf{C}}}^{pre} we need. Finally, for object-level registration, we take 𝐂¯pre{\overline{\mathbf{C}}}^{pre} as input to predict the transformation {𝐑^,𝐭^}\{\hat{\mathbf{R}},\hat{\mathbf{t}}\} by weighted SVD or RANSAC, and set τ=0.5\tau=0.5. Since scene-level registration is more difficult, we only use RANSAC as correspondence-based estimator to estimate the transformation parameters and set τ=0.7\tau=0.7. In addition, we also provided the influence of τ\tau on scene-level registration in Section 6.3 of the supplementary materials.

4.5 Loss

Our loss function takes the ground truth correspondences directly as supervision, which is different from previous studies [18, 19, 47] that define loss on transformation parameters. Focal loss [31] between soft correspondence matrix 𝐂~\widetilde{\mathbf{C}} and ground-truth correspondence matrix 𝐂¯gt\overline{\mathbf{C}}^{gt} is adopted to train our model. The formula is as follows:

loss=i=1Nj=1M(α(1𝐂~i,j)γ𝐂¯i,jgtlog𝐂~i,j+(1α)(𝐂~i,j)γ(1𝐂¯i,jgt)log(1𝐂~i,j)),\begin{split}\text{loss}=-\sum_{i=1}^{N}\sum_{j=1}^{M}(\alpha(1-\widetilde{\mathbf{C}}_{i,j})^{\gamma}\overline{\mathbf{C}}_{i,j}^{gt}\log\widetilde{\mathbf{C}}_{i,j}\\ +(1-\alpha)(\widetilde{\mathbf{C}}_{i,j})^{\gamma}(1-\overline{\mathbf{C}}_{i,j}^{gt})\log(1-\widetilde{\mathbf{C}}_{i,j})),\end{split} (9)

where α\alpha is a balance factor to address class imbalance, α[0,1]\alpha\in[0,1] for correct correspondences and 1α1-\alpha for incorrect correspondences. γ\gamma is a modulating factor to down-weight easy correspondences and thus focus training on hard ones to distinguish correspondences [31]. We empirically set α=0.5\alpha=0.5 and γ=0\gamma=0 in object-level registration, which is equivalent to CE loss used in the conference paper [30]. Finding enough good correspondences for scene-level registration is more difficult, so we focus more on those hard correspondences and empirically set α=0.25\alpha=0.25 and γ=2\gamma=2 as in [31]. Since our loss function is only related to the soft correspondence matrix 𝐂~\widetilde{\mathbf{C}}, the calculations in Section 4.4 do not need to be differentiable.

4.6 Implementation Details

This network is implemented using PyTorch and trained on a single Nvidia Tesla V100. We train the network using the SGD optimizer and set L=2L=2 in this study. For object-level datasets, we consider a neighborhood of KK = 20 for local feature extractor and set V=1024V=1024 for high-dimensional local features. For scene-level datasets, we set V=528V=528 for high-dimensional local features. The number of heads in multi-head attention for the Transformer in Edge Generator is 4. For all experiments, we only run one iteration during training. To achieve more precise registration, we iteratively update the transformation twice during test time. For more training details and network architecture implementation, please refer to the supplementary materials.

5 Experiments

Extensive experiments are conducted on four datasets, and this section is organized as follows. Firstly, we demonstrate the effectiveness of our registration method on object-level datasets ModelNet40 and ShapeNet with different settings. Next, we employ the scene-level datasets 3DMatch and 3DLoMatch to further evaluate the performance of our registration method in real scene-level registration. To show more real-world applications, experimental results on outdoor dataset KITTI are given in Section 4 of the supplementary materials. Finally, to demonstrate the importance of our two key components, we performed ablation studies on RGM and ModelNet40 datasets. The ablation studies include replacing distance matrix in AIS modules with L2 norm between node features, and using different methods to build edges instead of building edges by a transformer.

5.1 Object-level Registration

5.1.1 Benchmark Datasets

ModelNet40 The ModelNet40 [63] dataset includes 12,311 meshed CAD models from 40 categories. We randomly sample 2,048 points from the mesh faces and rescale points into a unit sphere. Each category consists of official train/test splits. To select models for evaluation, we take 80%\% and 20%\% of the official train split as the training set and validation set, respectively, and the official test split for testing. For each object in the dataset, we randomly sample 1,024 points as the source point cloud 𝐗\mathbf{X}, and then apply a random transformation on 𝐗\mathbf{X} to obtain the target point cloud 𝐘\mathbf{Y} and shuffle the point order. For the transformation applied, we randomly sample three Euler angles in the range of [0,45][0,45]^{\circ} for rotation and three displacements in the range of [0.5, 0.5]\left[-0.5,\ 0.5\right] along each axis for translation. Unless otherwise noted, these settings are used by default in all experiments.

ShapeNet The ShapeNet[64] dataset collects 16,881 CAD shape models across 16 categories. The official splits are 14,006 models for training and 2,874 models for testing. The preprocessing of data is the same as ModelNet40, and testing of registration methods are only conducted on the test set.

5.1.2 Evaluation Metrics

For ModelNet40 dataset, we use six evaluation metrics, and the first four are calculated from the estimated transformation parameters. They are the mean isotropic errors (MIE) of 𝐑\mathbf{R} and 𝐭\mathbf{t} proposed in RPM-Net [19], and the mean absolute errors (MAE) of 𝐑\mathbf{R} and 𝐭\mathbf{t} used in DCP [18], which are anisotropic. All rotation-related metrics are in units of degrees.

In addition, we propose a new metric, clip chamfer distance (CCD), which measures how close the two point clouds are brought to each other, and it is calculated as follows:

CCD(𝐗^,𝐘\displaystyle\operatorname{CCD}(\widehat{\mathbf{X}},\mathbf{Y} )=x^i𝐗min(minyj𝐘(x^iyj22),d)\displaystyle)=\sum\limits_{\hat{x}_{i}\in\mathbf{X}}\min(\min\limits_{y_{j}\in\mathbf{Y}}(\left\|\hat{x}_{i}-y_{j}\right\|_{2}^{2}),d)
+yj𝐘min(minx^i𝐗(x^iyj22),d),\displaystyle\ \ +\sum\limits_{y_{j}\in\mathbf{Y}}\min(\min\limits_{\hat{x}_{i}\in\mathbf{X}}(\left\|\hat{x}_{i}-y_{j}\right\|_{2}^{2}),d), (10)

where 𝐗^\hat{\mathbf{X}} is the transformed source point cloud after registration and x^i{\hat{x}}_{i} is the iith point. To avoid the influence of outliers in partial-to-partial registration, the point pair whose distance is larger than 0.1 is not included in the calculation. This is implemented by setting the threshold d=0.1d=0.1.

Finally, we also reported the recall with MAE(𝐑\mathbf{R}) <1<1^{\circ} and MAE(𝐭\mathbf{t}) <0.1<0.1. The best results are marked in bold font in tables.

5.1.3 Comparing Methods

We compare our method to ICP [8], fast global registration (FGR) [38], as well as three latest learning-based methods, RPM-Net [19], IDAM [20] and DeepGMR [47]. We also compare the methods proposed at the same time as our method, including OMNet [65], Predator [21]. Other early learning-based methods, such as DCP and PointNetLK, are not directly compared, because experiments in [19, 20, 47] have already shown that these new methods have better performance. We adopt the ICP and FGR implemented by Intel Open3D [66]. For IDAM, DeepGMR, OMNet and Predator, we use the code provided by the authors and train the models according to the author’s settings. For RPM-Net, we need to estimate the normals except in the clean experiment and use the code provided by the authors. The number of iterations of RPM-Net was set to 5 according to the author’s article. ICP uses the identity matrix as initialization, and none of the other methods need transformation initialization. All models of the comparing learning-based methods are retrained because no trained model is available. We denote our method as RGM when using SVD as the Correspondence-based Estimator, and as RGM when using RANSAC as the Correspondence-based Estimator. In addition to using the default MLP network as the backbone, we also evaluate the performance of using KPFCN as the backbone, and the experimental results using KPFCN as the backbone are provided in the supplementary materials.

5.1.4 Clean Point Cloud

We first evaluate the registration performance on clean point clouds and follow the sampling and transformation settings in Section 5.1.1. The ground-truth correspondences are obtained by the strict correspondences between 𝐗\mathbf{X} and 𝐘\mathbf{Y}. All models are trained and evaluated on clean data, and Table I shows the performance of our method and its peers. Our method achieves the best performance and greatly outperforms the strongest learning-based method. In addition, the success rate of RGM reaches 100%\%, and most of its error metrics are close to 0, which cannot be achieved by other existing methods. Although DeepGMR also achieves a 100%\% success rate, its errors are larger than RGM. Some qualitative comparisons are shown in Figure 3 (a).

TABLE I: Performance on clean point clouds
Methods MAE(R) MAE(t) MIE(R) MIE(t) CCD Recall
ICP 3.0793.079 0.024420.02442 6.44676.4467 0.054460.05446 0.030090.03009 74.19%74.19\%
FGR 0.0060.006 0.000050.00005 0.00990.0099 0.000100.00010 0.000190.00019 99.96%99.96\%
RPM-Net 0.1090.109 0.000500.00050 0.24640.2464 0.001120.00112 0.000890.00089 98.14%98.14\%
IDAM 0.7310.731 0.012440.01244 1.35361.3536 0.026050.02605 0.044700.04470 75.81%75.81\%
DeepGMR 0.0010.001 0.000010.00001 0.01560.0156 0.000020.00002 0.000030.00003 100.00%\mathbf{100.00\%}
OMNet 0.4710.471 0.004680.00468 0.95110.9511 0.009850.00985 0.013180.01318 91.09%91.09\%
Predator 0.3020.302 0.003930.00393 0.57160.5716 0.008330.00833 0.015720.01572 95.54%95.54\%
RGM <0.001<\mathbf{0.001} <0.00001<\mathbf{0.00001} 0.0096\mathbf{0.0096} <0.00001<\mathbf{0.00001} <0.00001<\mathbf{0.00001} 100.00%\mathbf{100.00\%}
RGM <0.001<\mathbf{0.001} <0.00001<\mathbf{0.00001} 0.01000.0100 <0.00001<\mathbf{0.00001} <0.00001<\mathbf{0.00001} 100.00%\mathbf{100.00\%}

If there is a star, RANSAC is used as the estimator, otherwise SVD is used.

TABLE II: Performance on point clouds with Gaussian noise
Methods MAE(R) MAE(t) MIE(R) MIE(t) CCD Recall
ICP 3.1273.127 0.022560.02256 6.50306.5030 0.049440.04944 0.053870.05387 77.59%77.59\%
FGR 5.4055.405 0.033860.03386 10.007910.0079 0.070800.07080 0.069180.06918 30.75%30.75\%
RPM-Net 0.3050.305 0.002530.00253 0.57730.5773 0.005320.00532 0.042570.04257 96.68%96.68\%
IDAM 1.8181.818 0.014160.01416 3.49163.4916 0.029150.02915 0.054360.05436 49.59%49.59\%
DeepGMR 1.1781.178 0.007160.00716 2.27362.2736 0.014980.01498 0.050290.05029 56.52%56.52\%
OMNet 0.9390.939 0.015110.01511 1.85571.8557 0.030900.03090 0.051690.05169 84.24%84.24\%
Predator 0.5150.515 0.004810.00481 0.96330.9633 0.009610.00961 0.044690.04469 94.37%94.37\%
RGM 0.0800.080 0.000690.00069 0.14960.1496 0.001410.00141 0.041850.04185 99.51%99.51\%
RGM 0.073\mathbf{0.073} 0.00065\mathbf{0.00065} 0.1389\mathbf{0.1389} 0.00132\mathbf{0.00132} 0.04184\mathbf{0.04184} 99.84%\mathbf{99.84\%}

5.1.5 Gaussian Noise

To evaluate the robustness to noise, Gaussian noise sampled from 𝒩(0, 0.01)\mathcal{N}\left(0,\ 0.01\right) and clipped to [0.05, 0.05]\left[-0.05,\ 0.05\right] is independently added to each coordinate of the points in clean point clouds. These noises might destroy the original correspondences, so we need to rebuild them for training models that need ground truth correspondences. First, we compute the point pair distance between 𝐘\mathbf{Y} and 𝐗\mathbf{X}^{\prime}, which is obtained by applying the ground truth transformation to 𝐗\mathbf{X}. Then, if xi𝐗x_{i}^{\prime}\in\mathbf{X}^{\prime} and yj𝐘y_{j}\in\mathbf{Y} satisfy Eq. 11, they are regarded as a corresponding point pair and no longer appear in the next round calculation. Finally, we find corresponding point pairs again according to Eq. 11 from the remaining points. To avoid long-distance point pairs being selected as a correspondence, we only consider the point pairs whose distance is less than 0.1. The reason why we find the corresponding point pair again from the remaining points is that the distance between the two points may not be the smallest but the second smallest, so they are not found in the first round.

minxn𝐗(xnyj22)=xiyj22=minym𝐘(xiym22).\min\limits_{x_{n}^{\prime}\in\mathbf{X}^{\prime}}(\left\|x_{n}^{\prime}-y_{j}\right\|_{2}^{2})=\left\|x_{i}^{\prime}-y_{j}\right\|_{2}^{2}=\min\limits_{y_{m}\in\mathbf{Y}}(\left\|x_{i}^{\prime}-y_{m}\right\|_{2}^{2}). (11)

All models are trained and evaluated on the noise data. The results are shown in Table II. It is obvious that our method is much more accurate than the latest learning-based methods and the traditional methods, and the recall of our method is close to 100%\%. Some qualitative comparisons are shown in Figure 3 (b).

Refer to caption
Figure 3: Qualitative registration results on ModelNet40, (a) clean, (b) noise, (c) partial-to-partial, and (d) unseen categories, (e) cross dataset, (f)full-range rotation.
TABLE III: Performance on partial-to-partial point clouds
Methods MAE(R) MAE(t) MIE(R) MIE(t) CCD Recall
ICP 12.45612.456 0.124650.12465 24.877724.8777 0.266850.26685 0.115110.11511 6.56%6.56\%
FGR 23.18523.185 0.145600.14560 42.429242.4292 0.302140.30214 0.121180.12118 5.23%5.23\%
RPM-Net 0.8640.864 0.008340.00834 1.69851.6985 0.017630.01763 0.084570.08457 80.59%80.59\%
IDAM 8.9058.905 0.091920.09192 16.972416.9724 0.192090.19209 0.123930.12393 0.81%0.81\%
DeepGMR 43.68343.683 0.224790.22479 70.914370.9143 0.457050.45705 0.144010.14401 0.08%0.08\%
OMNet 1.3481.348 0.013990.01399 2.64522.6452 0.029750.02975 0.088820.08882 71.55%71.55\%
Predator 0.5020.502 0.004550.00455 0.93340.9334 0.009080.00908 0.083030.08303 93.47%93.47\%
RGM 0.4920.492 0.004140.00414 0.92980.9298 0.008740.00874 0.082380.08238 93.31%93.31\%
RGM 0.308\mathbf{0.308} 0.00275\mathbf{0.00275} 0.6109\mathbf{0.6109} 0.00574\mathbf{0.00574} 0.08221\mathbf{0.08221} 95.54%\mathbf{95.54\%}

5.1.6 Partial-to-Partial

Partial-to-partial is the most challenging case for point cloud registration, and it is important because it occurs frequently in real-world applications. To generate partial-to-partial point cloud pairs, we follow the protocol in RPM-Net [19], which is closer to real-world applications. For each point cloud, we create a random plane passing through the origin independently, translate it along its normal, and retain 70%\% of the points. All models are trained and evaluated on partial-to-partial data and the results are illustrated in Table III. Our method is obviously more accurate than the other methods, and its success rate is higher than 90%\%. RPM-Net is the third best method, but its error is still twice as large as ours. Predator [21] is significantly better than RPM-Net, but slightly worse than ours. Some qualitative comparisons are shown in Figure 3 (c). For the inference time of our method and the comparison methods, please refer to the supplementary materials.

TABLE IV: Performance on unseen categories point clouds
Methods MAE(R) MAE(t) MIE(R) MIE(t) CCD Recall
ICP 13.32613.326 0.130330.13033 26.644726.6447 0.277740.27774 0.118790.11879 6.71%6.71\%
FGR 23.95023.950 0.140670.14067 41.963141.9631 0.291060.29106 0.123700.12370 5.13%5.13\%
RPM-Net 1.0411.041 0.010670.01067 1.98261.9826 0.022760.02276 0.087040.08704 75.59%75.59\%
IDAM 10.15810.158 0.100630.10063 19.324919.3249 0.207290.20729 0.129210.12921 0.95%0.95\%
DeepGMR 44.36344.363 0.220390.22039 71.067771.0677 0.446320.44632 0.147280.14728 0.24%0.24\%
OMNet 1.7421.742 0.017660.01766 3.410123.41012 0.037720.03772 0.092910.09291 59.00%59.00\%
Predator 0.5170.517 0.004990.00499 0.963250.96325 0.010140.01014 0.084310.08431 95.57%\mathbf{95.57\%}
RGM 0.8370.837 0.006740.00674 1.54571.5457 0.014180.01418 0.084690.08469 84.28%84.28\%
RGM 0.464\mathbf{0.464} 0.00396\mathbf{0.00396} 0.8696\mathbf{0.8696} 0.00574\mathbf{0.00574} 0.08364\mathbf{0.08364} 90.60%90.60\%
TABLE V: Performance on Full-range Rotation point clouds
Methods MAE(R) MAE(t) MIE(R) MIE(t) CCD Recall
ICP 79.579.5 0.440.44 138.2138.2 0.690.69 0.1380.138 1.10%1.10\%
FGR 72.672.6 0.280.28 101.5101.5 0.580.58 0.1430.143 0.86%0.86\%
RPM-Net 33.833.8 0.090.09 33.433.4 0.210.21 0.1230.123 4.19%4.19\%
IDAM 73.173.1 0.270.27 83.283.2 0.550.55 0.1650.165 0.15%0.15\%
DeepGMR 66.766.7 0.210.21 69.969.9 0.450.45 0.1560.156 <0.01%<0.01\%
OMNet 34.234.2 0.110.11 38.738.7 0.230.23 0.1250.125 17.85%17.85\%
Predator 3.13.1 0.01\mathbf{0.01} 1.4\mathbf{1.4} 0.01\mathbf{0.01} 0.083\mathbf{0.083} 74.96%\mathbf{74.96\%}
RGM 19.119.1 0.060.06 15.915.9 0.130.13 0.1010.101 24.80%24.80\%
RGM 2.9\mathbf{2.9} 0.01\mathbf{0.01} 1.51.5 0.01\mathbf{0.01} 0.083\mathbf{0.083} 67.14%67.14\%

5.1.7 Unseen Categories

To test each method’s generalization capability on unseen shape categories, we take the official train and test splits for the first 20 categories as the training and validation sets, respectively, and test on the official test splits of the last 20 categories. Other experimental settings are the same as those in the partial-to-partial experiment. The experimental results are summarized in Table IV. We find that the performance of traditional methods does not change significantly. The generalization capability of RPM-Net is also good, but it is obvious that our method works better. Our method is superior to Predator [21] in mean error but slightly lower in registration recall. The other learning-based methods do not generalize well to unseen categories. Some qualitative comparisons are shown in Figure 3 (d).

5.1.8 Cross Dataset

The purpose of this experiment is to validate the generalization ability of different methods between different object-level datasets. For every learning-based method, the model was trained on ModelNet40 while tested on ShapeNet. To more fully demonstrate the generalization of each model, we tested on three different settings: clean, Gaussian noise, and partial-to-partial. The test results for the different settings are summarized in Table VI. We can see that our method has no significant performance degradation on the ShapNet dataset, and the results of our method are much better than those of the other methods. Some qualitative comparisons are shown in Figure 3 (e).

TABLE VI: Performance of cross-dataset generalization (learning-based methods are trained on ModelNet40 and tested on ShapeNet)
Settings Methods MIE(R) MIE(t) MAE(R) MAE(t) CCD Recall
ICP 2.8662.866 0.022650.02265 5.78205.7820 0.049680.04968 0.019860.01986 85.56%85.56\%
FGR 0.0010.001 0.000010.00001 0.0010\mathbf{0.0010} 0.000020.00002 0.000050.00005 100.00%\mathbf{100.00\%}
RPM-Net 0.1390.139 0.000930.00093 0.28470.2847 0.001980.00198 0.001860.00186 96.56%96.56\%
IDAM 7.7567.756 0.056230.05623 13.801413.8014 0.114810.11481 0.076440.07644 51.53%51.53\%
Clean DeepGMR 0.0110.011 0.000070.00007 0.03580.0358 0.000150.00015 0.000150.00015 99.79%99.79\%
OMNet 1.4831.483 0.072600.07260 3.00313.0031 0.149190.14919 0.077820.07782 53.54%53.54\%
Predator 0.4060.406 0.004830.00483 0.76770.7677 0.010140.01014 0.020110.02011 95.82%95.82\%
RGM <0.001<\mathbf{0.001} <0.00001<\mathbf{0.00001} 0.00940.0094 <0.00001<\mathbf{0.00001} <0.00001<\mathbf{0.00001} 100.00%\mathbf{100.00\%}
RGM <0.001<\mathbf{0.001} <0.00001<\mathbf{0.00001} 0.01040.0104 0.000010.00001 <0.00001<\mathbf{0.00001} 100.00%\mathbf{100.00\%}
ICP 2.8932.893 0.023190.02319 5.89345.8934 0.050970.05097 0.051730.05173 83.89%83.89\%
FGR 7.9857.985 0.050750.05075 13.689813.6898 0.105000.10500 0.080480.08048 29.23%29.23\%
RPM-Net 0.4290.429 0.003630.00363 0.79360.7936 0.007750.00775 0.041110.04111 95.41%95.41\%
IDAM 1.8181.818 0.015650.01565 3.46583.4658 0.031770.03177 0.053800.05380 55.81%55.81\%
Gaussian DeepGMR 1.0651.065 0.006530.00653 2.01602.0160 0.013800.01380 0.048850.04885 64.37%64.37\%
Noise OMNet 0.9960.996 0.022680.02268 1.95711.9571 0.047250.04725 0.053120.05312 80.20%80.20\%
Predator 0.4230.423 0.004250.00425 0.79490.7949 0.008630.00863 0.041820.04182 95.12%95.12\%
RGM 0.1430.143 0.001220.00122 0.28100.2810 0.002570.00257 0.039950.03995 98.19%98.19\%
RGM 0.012\mathbf{0.012} 0.00097\mathbf{0.00097} 0.2422\mathbf{0.2422} 0.00204\mathbf{0.00204} 0.03994\mathbf{0.03994} 98.61%\mathbf{98.61\%}
ICP 12.20712.207 0.134380.13438 23.886723.8867 0.290260.29026 0.118530.11853 9.12%9.12\%
FGR 26.87326.873 0.183190.18319 47.429147.4291 0.378200.37820 0.130430.13043 5.08%5.08\%
RPM-Net 2.9172.917 0.036980.03698 5.63095.6309 0.081990.08199 0.091500.09150 49.37%49.37\%
Partial IDAM 10.09010.090 0.108700.10870 19.220419.2204 0.226310.22631 0.130250.13025 1.81%1.81\%
to DeepGMR 46.90146.901 0.251670.25167 74.952174.9521 0.513630.51363 0.150120.15012 0.28%0.28\%
Partial OMNet 2.4182.418 0.045770.04577 4.77834.7783 0.098400.09840 0.106590.10659 36.81%36.81\%
Predator 0.5890.589 0.006440.00644 1.09311.0931 0.013770.01377 0.081620.08162 91.61%\mathbf{91.61\%}
RGM 0.8870.887 0.007910.00791 1.62361.6236 0.016960.01696 0.082010.08201 87.47%87.47\%
RGM 0.487\mathbf{0.487} 0.00446\mathbf{0.00446} 0.9152\mathbf{0.9152} 0.00929\mathbf{0.00929} 0.08096\mathbf{0.08096} 91.02%91.02\%

5.1.9 Full-range Rotation

Most existing learning-based object-level registration methods only consider the rotation within 45 degrees. However, in practice, the rotation between point cloud pairs can be any value within a full-range of [0,180]°. To compare the performance of each method within the [0,180]° rotation range, we conducted this full-range rotation experiment, in which we change the random sampling range of Euler angle to [0,180]°, and the other settings are consistent with partial-to-partial. We used the same rotation augmentations to train the model for all methods, and the results are summarized in Table V. As the results show, all previous methods almost completely fail when the rotation range is [0,180]°. The previous best method, RPM-Net, has a recall of only 4.19%\%. Our method can achieve a recall of 24.80%\% with SVD as the Correspondence-based Estimator, which is six times higher than RPM-Net. When we use RANSAC as the Correspondence-based Estimator, the recall can be further improved to 67.14%\%. Compared with the most recent methods, OMNet [65] and Predator [21], our method is significantly better than OMNet and comparable to Predator. Some qualitative comparisons under large rotation are shown in Figure 3 (f).

Refer to caption
Figure 4: Qualitative registration results of RGM on 3DMatch.

5.2 Scene-level Registration

5.2.1 Benchmark Datasets

3DMatch [67] is a benchmark of indoor rigid scan matching and registration, which includes 46 scenes for training, 8 scenes for validation and 8 for testing. 3DMatch considers only scan pairs with >>30%\% overlap. However, in practice, there will be scan pairs with a lower overlap between 10 and 30%\%. For this scenario, Huang et al.[21] constructed a more difficult dataset named 3DLoMatch.

5.2.2 Evaluation Metrics

According to the actual aim of point cloud registration, we use Registration Recall (RR), the percentage of successful alignment whose transformation error is smaller than a certain threshold (e.g., RMSE<0.2m\text{RMSE}<0.2\mathrm{~{}m}), as our main evaluation metric. Given a dataset 𝒟\mathcal{D} with |𝒟||\mathcal{D}| point cloud pairs, Registration Recall is calculated as follows:

RR(𝒟)=1|𝒟|(𝐗,𝐘)𝒟𝟙(RMSE(𝐗,𝐘)<τ1),\text{RR}(\mathcal{D})=\frac{1}{|\mathcal{D}|}\sum_{(\mathbf{X},\mathbf{Y})\in\mathcal{D}}\mathds{1}\left(\text{RMSE}(\mathbf{X},\mathbf{Y})<\tau_{1}\right), (12)

where 𝟙()\mathds{1}(\cdot) represents the indicator function, τ1=0.2m\tau_{1}=0.2\mathrm{~{}m} and for each scan pairs (𝐗,𝐘)𝒟(\mathbf{X},\mathbf{Y})\in\mathcal{D}, RMSE of the ground truth correspondence set 𝐂¯gt\overline{\mathbf{C}}^{gt} after applying the estimated transformation 𝐓pre\mathbf{T}^{pre} is defined as:

RMSE=1|𝐂¯gt|(xi,yj)𝐂¯gt𝐓pre(xi)yj2.\text{RMSE}=\sqrt{\frac{1}{|\overline{\mathbf{C}}^{gt}|}\sum_{\left(x_{i},y_{j}\right)\in\overline{\mathbf{C}}^{gt}}\left\|\mathbf{T}^{pre}\left(x_{i}\right)-y_{j}\right\|^{2}}. (13)

For keeping consistent with the original evaluation protocol of 3DMatch [67], immediately adjacent point clouds with very high overlap ratios are excluded. Following previous works [22, 23, 21, 25], we also report two other metrics, namely Inlier Ratio (IR) and Feature Match Recall (FMR). The Inlier Ratio indicates the fraction of correspondences whose residual error in the geometry space is less than a threshold τ2=10cm\tau_{2}=10\mathrm{~{}cm}. Given the estimated correspondence set (xi,yj)𝐂¯pre\left(x_{i},y_{j}\right)\in\overline{\mathbf{C}}^{pre}, Inlier Ratio of a single scan pair is defined as:

IR=1|𝐂¯pre|(xi,yj)𝐂¯pre𝟙(𝐓gt(xi)yj2<τ2),\text{IR}=\frac{1}{|\overline{\mathbf{C}}^{pre}|}\sum_{\left(x_{i},y_{j}\right)\in\overline{\mathbf{C}}^{pre}}\mathds{1}\left(\left\|{\mathbf{T}^{gt}}\left(x_{i}\right)-y_{j}\right\|_{2}<\tau_{2}\right), (14)

where 𝐓gt{\mathbf{T}^{gt}} indicates the ground truth transformation between 𝐗\mathbf{X} and 𝐘\mathbf{Y}. The Feature Match Recall is defined as the percentage of point cloud pairs whose Inlier Ratio is larger than 5%5\%.

5.2.3 Comparing Methods

We compare our method to other state-of-the-art approaches in scene-level registration, including 3DSN [68], FCGF [22], D3Feat [23], Predator [21], SpinNet [69] and CoFiNet [25] in Table VII. Other early methods, such as FPFH [70], PPFNet [49] and RPM-Net [19], are not directly compared, because experiments in [21, 68] have already demonstrated that these new methods have better performance. The results of the comparing methods are the best results reported in the original paper. Because the scene-level dataset is complex and more challenging, the correspondence is not as good as that obtained in the object-level dataset, and the SVD performs poorly in scene-level. Therefore, like the comparison methods, our method only reports the results using RANSAC as the correspondence-based estimator.

5.2.4 3DMatch and 3DLoMatch

In this experiment, we evaluated the performance of our method in indoor scene and compared it with the state-of-the-art methods. We take KPFCN based fθf_{\theta} described in Section 4.1 as our feature extractor. Following previous literature [68, 21, 25], we apply a Hungarian algorithm[29] based LAP solver on the corresponding matrix 𝐂~\widetilde{\mathbf{C}} and find the transformation parameters with RANSAC for calculating RR. For a fair comparison with methods that randomly sample different points, we report the maximum Registration Recall(RR) for these methods. As presented in Table VII, there is no clear relationship between RR and FMR based on previous study [21]. Although our method does not have the highest FMR and IR, it significantly outperforms existing methods in terms of RR, which is a more important metric in point cloud registration. This demonstrates that our method is more effective at reducing error correspondences, which can easily result in misregistration, than other methods. Considering the large variation of overlap rate in indoor scenes, we tried both the default CE loss and focal loss. Experiments show that our method with focal loss performs better in indoor scenes, outperforming the state-of-the-art method with a margin of 4.9%\% and 3.7%\% on 3DMatch and 3DLoMatch, respectively. Qualitative comparisons are shown in Figure 4.

5.3 Visualization of Graph Edges

In this section, we visualize the graph edges generated by different methods to demonstrate the superiority of building graph via our edge generator. The edges for an object model is shown in Figure 5. For the KNN graph, we choose the five nearest neighbors of each point and draw an edge between it and the neighbors. For the proposed method, edges with a weight greater than 0.1 in the edge weight matrix are chosen for visualization. As illustrated in Figure 5, our edge generator creates edges mostly in the overlapping parts, avoiding information interference in non-overlapping parts. Points in non-overlapping areas are not connected to any other points, making it easier to identify them.

Refer to caption
Figure 5: Visualization of graphs obtained by different edge generation methods. For our graph, down-sampled points and edges with large weights are illustrated. Please note that most edges of our graph are in the overlapping region and the edges of the two graphs roughly correspond to each other. These two properties are expected for better registration.
TABLE VII: Performance on indoor datasets 3DMatch and 3DLoMatch
3DMatch 3DLoMatch
FMR IR RR FMR IR RR
3DSN 95.095.0 36.036.0 78.478.4 63.663.6 11.411.4 33.033.0
FCGF 97.497.4 56.856.8 85.185.1 75.475.4 20.020.0 41.741.7
D3Feat 95.495.4 38.838.8 84.584.5 67.067.0 14.014.0 46.946.9
Predator 96.596.5 57.157.1 90.690.6 76.376.3 28.328.3 62.462.4
SpinNet 97.697.6 47.547.5 88.688.6 75.375.3 20.520.5 59.859.8
CoFiNet 98.1\mathbf{98.1} 49.849.8 89.389.3 83.1\mathbf{83.1} 24.424.4 67.567.5
RGM- CE 96.796.7 67.4\mathbf{67.4} 93.193.1 79.679.6 40.9\mathbf{40.9} 62.962.9
RGM- focal 97.497.4 50.550.5 95.5\mathbf{95.5} 81.181.1 22.422.4 70.2\mathbf{70.2}

5.4 Ablation Studies

In this section, we present the results of the ablation study to analyze the effectiveness of two key components. All ablation studies are performed on the partial-to-partial dataset with the same settings as section 5.1.7. We analyze the two key components as follows:

To demonstrate the effectiveness of the AIS module, we design a variant to replace the AIS module, and the resulting method is denoted as RGMVar1. The variant computes the distance matrix 𝐃\mathbf{D} between the nodes of the two graphs by computing the L2 norm of node features, transforms 𝐃\mathbf{D} into a positive matrix within the finite values by the formula e(𝐃i,j 0.5)e^{-\left(\mathbf{D}_{i,j}\ -\ 0.5\right)}, and uses Sinkhorn to calculate the soft correspondences. The results are listed in the first row of Table VIII. We find that the registration accuracy becomes very poor by using the AIS variant, and this result shows that the proposed AIS module can effectively improve the registration performance. This is because the AIS module generates more correct matching than its variant, and an illustrative example of the correspondences generated by AIS and its variant is shown in Figure 6 (e) and (b).

Refer to caption
Figure 6: An illustrative case of the ground-truth correspondences and the correspondences generated by RGM and its variants. Much more correct correspondences are generated by RGM.
TABLE VIII: Ablation studies
Variants MAE(R) MAE(t) MIE(R) MIE(t) CCD Recall
RGMVar1 10.74610.746 0.070140.07014 19.272219.2722 0.142550.14255 0.113040.11304 18.33%18.33\%
RGMVar2 1.5541.554 0.014540.01454 2.90512.9051 0.031010.03101 0.086320.08632 74.17%74.17\%
RGMVar3 1.1971.197 0.010830.01083 2.26122.2612 0.022360.02236 0.086050.08605 75.59%75.59\%
RGM 0.8370.837 0.006740.00674 1.54571.5457 0.014180.01418 0.084690.08469 84.28%84.28\%
RGM 0.464\mathbf{0.464} 0.00396\mathbf{0.00396} 0.8696\mathbf{0.8696} 0.00574\mathbf{0.00574} 0.08364\mathbf{0.08364} 90.60%\mathbf{90.60\%}

To understand the importance of our edge generator, we design two variants that use full connection edges and sparse connection edges instead of building edges by a transformer, and the resulting methods are denoted as RGMVar2 and RGMVar3, respectively. For the full connection edges, we connect each point and all other points in the point cloud. For the sparse connection edges, we only connect each point and those points in a sphere centered at this point with a radius of 0.2. The results are shown in the second and third rows of Table VIII, and they are also inferior to the performance by using a transformer to generate edges. Examples of the hard correspondences generated by each of these two variants is shown in Figure 6 (c) and (d).

6 Conclusion

We introduce deep graph matching to solve the point cloud registration problem for the first time and propose a novel deep learning framework RGM that achieves state-of-the-art performance in both object-level and scene-level point cloud registration. We propose the AIS module to establish accurate correspondences between the graph nodes to greatly improve registration performance. In addition, the transformer-based edge generator provides a new idea for building graph edges in addition to full connection, nearest neighbor connection and Delaunay triangulation. We think that the deep graph matching approach has the potential to be used in other registration problems, including 2D-3D registration and deformable registration.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant 62076070.

References

  • [1] W. Lu, Y. Zhou, G. Wan, S. Hou, and S. Song, “L3-net: Towards learning based lidar localization for autonomous driving,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 6389–6398.
  • [2] Y. Li, L. Ma, Z. Zhong, F. Liu, M. A. Chapman, D. Cao, and J. Li, “Deep learning for lidar point clouds in autonomous driving: a review,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 8, pp. 3412–3432, 2020.
  • [3] G. Wan, X. Yang, R. Cai, H. Li, Y. Zhou, H. Wang, and S. Song, “Robust and precise vehicle localization based on multi-sensor fusion in diverse city scenes,” in IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 4670–4677.
  • [4] H. Yoo, A. Choi, and J. H. Mun, “Acquisition of point cloud in ct image space to improve accuracy of surface registration: Application to neurosurgical navigation system,” Journal of Mechanical Science and Technology, vol. 34, no. 6, pp. 2667–2677, 2020.
  • [5] L. Han, L. Xu, D. Bobkov, E. Steinbach, and L. Fang, “Real-time global registration for globally consistent rgb-d slam,” IEEE Transactions on Robotics, vol. 35, no. 2, pp. 498–508, 2019.
  • [6] J.-E. Deschaud, “Imls-slam: scan-to-model matching based on 3d data,” in IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 2480–2485.
  • [7] H. Li and R. Hartley, “The 3d-3d registration problem revisited,” in IEEE International Conference on Computer Vision (ICCV), 2007, pp. 1–8.
  • [8] P. J. Besl and N. D. McKay, “Method for registration of 3-d shapes,” in Sensor fusion IV: control paradigms and data structures, vol. 1611.   International Society for Optics and Photonics, 1992, pp. 586–606.
  • [9] A. Segal, D. Haehnel, and S. Thrun, “Generalized-icp.” in Robotics: Science and Systems, vol. 2, no. 4, 2009, pp. 435–443.
  • [10] J. Yang, H. Li, D. Campbell, and Y. Jia, “Go-icp: A globally optimal solution to 3d icp point-set registration,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 11, pp. 2241–2254, 2015.
  • [11] D. Campbell and L. Petersson, “Gogma: Globally-optimal gaussian mixture alignment,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5685–5694.
  • [12] D. Campbell, L. Petersson, L. Kneip, H. Li, and S. Gould, “The alignment of the spheres: Globally-optimal spherical mixture alignment for camera pose estimation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 11 796–11 806.
  • [13] R. B. Rusu, N. Blodow, and M. Beetz, “Fast point feature histograms (fpfh) for 3d registration,” in IEEE International Conference on Robotics and Automation (ICRA), 2009, pp. 3212–3217.
  • [14] S. Salti, F. Tombari, and L. Di Stefano, “Shot: Unique signatures of histograms for surface and texture description,” Computer Vision and Image Understanding, vol. 125, pp. 251–264, 2014.
  • [15] V. Sarode, X. Li, H. Goforth, Y. Aoki, R. A. Srivatsan, S. Lucey, and H. Choset, “Pcrnet: Point cloud registration network using pointnet encoding,” in IEEE International Conference on Computer Vision (ICCV), 2019.
  • [16] Y. Aoki, H. Goforth, R. A. Srivatsan, and S. Lucey, “Pointnetlk: Robust & efficient point cloud registration using pointnet,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 7163–7172.
  • [17] X. Li, J. K. Pontes, and S. Lucey, “Pointnetlk revisited,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 12 763–12 772.
  • [18] Y. Wang and J. M. Solomon, “Deep closest point: Learning representations for point cloud registration,” in IEEE International Conference on Computer Vision (ICCV), 2019, pp. 3523–3532.
  • [19] Z. J. Yew and G. H. Lee, “Rpm-net: Robust point matching using learned features,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 824–11 833.
  • [20] J. Li, C. Zhang, Z. Xu, H. Zhou, and C. Zhang, “Iterative distance-aware similarity matrix convolution with mutual-supervised point elimination for efficient point cloud registration,” European Conference on Computer Vision (ECCV), 2019.
  • [21] S. Huang, Z. Gojcic, M. Usvyatsov, A. Wieser, and K. Schindler, “Predator: Registration of 3d point clouds with low overlap,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 4267–4276.
  • [22] C. Choy, J. Park, and V. Koltun, “Fully convolutional geometric features,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 8958–8966.
  • [23] X. Bai, Z. Luo, L. Zhou, H. Fu, L. Quan, and C.-L. Tai, “D3feat: Joint learning of dense detection and description of 3d local features,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 6359–6367.
  • [24] A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, and T. Funkhouser, “3dmatch: Learning local geometric descriptors from rgb-d reconstructions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1802–1811.
  • [25] H. Yu, F. Li, M. Saleh, B. Busam, and S. Ilic, “Cofinet: Reliable coarse-to-fine correspondences for robust pointcloud registration,” Advances in Neural Information Processing Systems (NeurIPS), vol. 34, 2021.
  • [26] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
  • [27] R. Wang, J. Yan, and X. Yang, “Learning combinatorial embedding networks for deep graph matching,” in IEEE International Conference on Computer Vision (ICCV), 2019, pp. 3056–3065.
  • [28] R. Jonker and A. Volgenant, “A shortest augmenting path algorithm for dense and sparse linear assignment problems,” Computing, vol. 38, no. 4, pp. 325–340, 1987.
  • [29] H. W. Kuhn, “The hungarian method for the assignment problem,” Naval Research Logistics Quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
  • [30] K. Fu, S. Liu, X. Luo, and M. Wang, “Robust point cloud registration framework based on deep graph matching,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8893–8902.
  • [31] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2980–2988.
  • [32] S. Rusinkiewicz, “A symmetric objective function for icp,” ACM Transactions on Graphics (TOG), vol. 38, no. 4, pp. 1–7, 2019.
  • [33] F. Pomerleau, F. Colas, and R. Siegwart, “A review of point cloud registration algorithms for mobile robotics,” Now Foundations and Trends in Robotics, vol. 4, no. 1, pp. 1–104, 2015.
  • [34] S. Rusinkiewicz and M. Levoy, “Efficient variants of the icp algorithm,” in International Conference on 3-D Digital Imaging and Modeling, 2001, pp. 145–152.
  • [35] B. Jian and B. C. Vemuri, “Robust point set registration using gaussian mixture models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 8, pp. 1633–1645, 2010.
  • [36] B. Eckart, K. Kim, and J. Kautz, “Hgmr: Hierarchical gaussian mixtures for adaptive 3d registration,” in European Conference on Computer Vision (ECCV), 2018, pp. 705–721.
  • [37] Y. Liu, C. Wang, Z. Song, and M. Wang, “Efficient global point cloud registration by matching rotation invariant features through translation search,” in European Conference on Computer Vision (ECCV), 2018, pp. 448–463.
  • [38] Q.-Y. Zhou, J. Park, and V. Koltun, “Fast global registration,” in European Conference on Computer Vision (ECCV), 2016, pp. 766–782.
  • [39] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 652–660.
  • [40] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,” ACM Transactions on Graphics, vol. 38, no. 5, pp. 1–12, 2019.
  • [41] Y. Aoki, H. Goforth, R. A. Srivatsan, and S. Lucey, “Pointnetlk: Robust & efficient point cloud registration using pointnet,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 7163–7172.
  • [42] S. Baker and I. Matthews, “Lucas-kanade 20 years on: A unifying framework,” International Journal of Computer Vision, vol. 56, no. 3, pp. 221–255, 2004.
  • [43] B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” in Proceedings of Imaging Understanding Workshop, 1981, pp. 121–130.
  • [44] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [45] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 5998–6008.
  • [46] Y. Wang and J. M. Solomon, “Prnet: Self-supervised learning for partial-to-partial registration,” in Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 8814–8826.
  • [47] W. Yuan, B. Eckart, K. Kim, V. Jampani, D. Fox, and J. Kautz, “Deepgmr: Learning latent gaussian mixture models for registration,” in European Conference on Computer Vision (ECCV), 2020.
  • [48] R. Sinkhorn, “A relationship between arbitrary positive matrices and doubly stochastic matrices,” The Annals of Mathematical Statistics, vol. 35, no. 2, pp. 876–879, 1964.
  • [49] H. Deng, T. Birdal, and S. Ilic, “Ppfnet: Global context aware local features for robust 3d point matching,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 195–205.
  • [50] ——, “Ppf-foldnet: Unsupervised learning of rotation invariant 3d local descriptors,” in European Conference on Computer Vision (ECCV), 2018, pp. 602–618.
  • [51] Y. Zhao, T. Birdal, H. Deng, and F. Tombari, “3d point capsule networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1009–1018.
  • [52] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas, “Kpconv: Flexible and deformable convolution for point clouds,” in IEEE International Conference on Computer Vision (ICCV), 2019, pp. 6411–6420.
  • [53] K. Fischer, M. Simon, F. Olsner, S. Milz, H.-M. Gross, and P. Mader, “Stickypillars: Robust and efficient feature matching on point clouds using graph neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 313–323.
  • [54] O. Duchenne, A. Joulin, and J. Ponce, “A graph-matching kernel for object categorization,” in IEEE International Conference on Computer Vision (ICCV), 2011, pp. 1792–1799.
  • [55] N. Ufer and B. Ommer, “Deep semantic feature matching,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6914–6923.
  • [56] F. Zhou and F. De la Torre, “Factorized graph matching,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 127–134.
  • [57] M. Cho, K. Alahari, and J. Ponce, “Learning graphs to match,” in IEEE International Conference on Computer Vision (ICCV), 2013, pp. 25–32.
  • [58] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 4938–4947.
  • [59] A. Nowak, S. Villar, A. S. Bandeira, and J. Bruna, “Revised note on learning quadratic assignment with graph neural networks,” in IEEE Data Science Workshop (DSW), 2018, pp. 1–5.
  • [60] A. Zanfir and C. Sminchisescu, “Deep learning of graph matching,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2684–2693.
  • [61] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using a ”siamese” time delay neural network,” in Advances in Neural Information Processing Systems (NeurIPS), 1994, pp. 737–744.
  • [62] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” arXiv preprint arXiv:1607.08022, 2016.
  • [63] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1912–1920.
  • [64] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al., “Shapenet: An information-rich 3d model repository,” arXiv preprint arXiv:1512.03012, 2015.
  • [65] H. Xu, S. Liu, G. Wang, G. Liu, and B. Zeng, “Omnet: Learning overlapping mask for partial-to-partial point cloud registration,” in IEEE International Conference on Computer Vision (ICCV), 2021, pp. 3132–3141.
  • [66] Q.-Y. Zhou, J. Park, and V. Koltun, “Open3d: A modern library for 3d data processing,” arXiv preprint arXiv:1801.09847, 2018.
  • [67] A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, and T. Funkhouser, “3dmatch: Learning local geometric descriptors from rgb-d reconstructions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [68] Z. Gojcic, C. Zhou, J. D. Wegner, and A. Wieser, “The perfect match: 3d point cloud matching with smoothed densities,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5545–5554.
  • [69] S. Ao, Q. Hu, B. Yang, A. Markham, and Y. Guo, “Spinnet: Learning a general surface descriptor for 3d point cloud registration,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 11 753–11 762.
  • [70] R. B. Rusu, N. Blodow, and M. Beetz, “Fast point feature histograms (fpfh) for 3d registration,” in IEEE International Conference on Robotics and Automation.   IEEE, 2009, pp. 3212–3217.