This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

COTReg: Coupled Optimal Transport based Point Cloud Registration

Guofeng Mei, Xiaoshui Huang, Litao Yu, Jian Zhang, Qiang Wu, and Mohammed Bennamoun Guofeng Mei, Litao Yu, Qiang Wu, and Jian Zhang are with the Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney NSW 2007, Australia. (e-mail: litao.yu; qiang.wu; [email protected]; [email protected]). Xiaoshui Huang is with the Image X Institute of Faculty of Medicine and Health, University of Sydney, Sydney NSW 2015, Australia. Mohammed Bennamoun is with the Department of Computer Science and Software Engineering, the University of Western Australia, WA 6009, Australia. Corresponding author: Jian Zhang ([email protected]).
Abstract

Generating a set of high-quality correspondences or matches is one of the most critical steps in point cloud registration. This paper proposes a learning framework COTReg by simultaneously considering point-wise and structural matchings to estimate correspondences in a coarse-to-fine manner. Specifically, we transform the two matchings into a Wasserstein distance-based and a Gromov-Wasserstein distance-based optimizations, respectively. The task of establishing the correspondences thus can be naturally reshaped to a coupled optimal transport problem. Furthermore, we design an overlap attention module consisting primarily of transformer layers and train it to predict the probability (overlap score) that each point lies in the overlapping region. The overlap score then guides the procedure of correspondences prediction. We conducted comprehensive experiments on 3DMatch, KITTI, and 3DCSR benchmarks showing the state-of-art performance of the proposed method.

Index Terms:
Point cloud, Registration, Point-wise matching, Structural matching, Optimal transport, Coarse-to-fine.

1 Introduction

Point cloud registration is to align two or more 3D point clouds acquired from various views, platforms, or at different times into a unified coordinate system [1]. This technique is the cornerstone of many 3D computer vision applications, such as 3D reconstruction [2], augmenting reality [3], autonomous driving [4, 5, 6], cancer radiotherapy [7, 8] and robotics [9, 10].

Recently, various types of deep learning-based point cloud registration approaches have achieved outstanding performance. Some of these methods show that combining conventional optimization approaches and current deep neural networks obtains better accuracies than traditional optimization methods [11]. Algorithms based on correspondences occupy a significant proportion of deep learning-based point cloud registration methods. State-of-the-art correspondence-based pipelines commonly consist of the following stages [12]: feature extraction, correspondence prediction, outlier rejection, and transformation estimation. Such correspondence-based approaches mainly focus on improving registration performance by extracting highly discriminative features [13, 14, 15, 16] or removing outlier correspondences [11, 12, 17]. The correspondence prediction is also a critical stage since the estimation of the rigid motion parameters depends on the correct correspondences. However, very few works have been devoted to improving correspondence prediction algorithms [11]. Therefore, this paper focuses on developing a correspondence prediction algorithm to obtain more accurate correspondences for pairwise point cloud registration.

Refer to caption
Figure 1: A paradigm shows the key concept of the proposed correspondence prediction algorithm. It consists of the pointwise matching based on the feature similarities, such as 𝒑i\bm{p}_{i} and 𝒒j\bm{q}_{j}, and structural matching based on the geometric similarities between two pairs of points, such as {𝒑i\{\bm{p}_{i}, 𝒑k}\bm{p}_{k}\} and {𝒒j,𝒒l}\{\bm{q}_{j},\bm{q}_{l}\}. The size of each point represents its overlap score.

In learning-based 3D point registration, pointwise matching is often used to establish matches between two point clouds [18]. The core idea of pointwise matching is that a pair of points between the source and target point clouds having the most similar feature representations are identified as the corresponding points. However, the putative correspondences produced by the pointwise matching contain many false matches and outliers [19]. For example, the feature matching recall of recent FCGF [20] drops to 80% when the inlier ratio is set to 0.2, meaning at least 20% correspondences contain more than 80% outliers [20]. We observe that the following two factors will cause false matches and outliers: first of all, some inliers are assigned to outliers because we treat the inliers and outliers equally in the correspondence prediction stage [15]; second, only considering pointwise matching is insufficient to find the correct correspondences due to the ambiguous and repeated patterns in the 3D point clouds [16]. To solve the above problems, we introduce the overlap scores, which act as auxiliary information to assist in correspondence prediction. When combining overlap scores with pointwise matching, it induces an optimal transport problem [21], i.e., Wasserstein distance (WD) [21]. Furthermore, inspired by the success of structural information (edge length) in graph matching [22, 23], we also consider it as an additional constraint to produce correspondences for registration. In our case, the structure is defined as the distance between two points within a point cloud in both the Euclidean and feature spaces, so the structural matching is formulated by comparing the geometric difference based on the Gromov-Wasserstein distance (GWD) [21]. The pointwise and structural matchings are illustrated in Figure 1. By integrating the pointwise and structural matchings into a unified learning framework, we formulate the problem of establishing the correspondences of point cloud registration as a coupled optimal transport problem. It can be easily integrated into deep neural networks for end-to-end training. Our method is based on the intuition that incorporating both pointwise and structural matchings into producing correspondences can resolve the ambiguity issue, achieving better performance than using either one of the two matchings. We also design a positional encoding scheme that assigns intrinsic geometric properties to per-point features to enhance distinctions among point features in indistinctive regions. Specifically, the coupled optimal transport can incorporate both the pointwise and the structural matchings with the overlap scores acting as empirical distributions to effectively improve the accuracy of the generated correspondences in 3D point cloud registration. To reduce the complexity of the matching process, our model adopts the coarse-to-fine mechanism, a hierarchical matching strategy to generate the correspondences.

To the best of our knowledge, we are the first to apply coupled optimal transport to predict correspondences for point cloud registration. The contribution of this paper is summarized as follows:

  • We incorporate overlap scores, pointwise and structural matchings in a joint model based on coupled optimal transport to produce correspondences in a coarse-to-fine manner.

  • We transform the pointwise matching into Wasserstein distance-based optimization to generate correspondences.

  • We cast the structural matching into Gromov-Wasserstein distance-based optimization to predict matches. The structural matching is formulated by considering the structural difference in both Euclidean and feature spaces.

  • We introduce an overlap attention module to learn point-wise features and overlap scores.

  • Comprehensive experiments on indoor and outdoor, synthetic, and cross-source point clouds demonstrate that the proposed method can effectively improve the accuracy of 3D point cloud registration.

2 Related Work

We review both traditional and deep learning-based point cloud registration methods. Since optimal transport is a major component in the 3D correspondence prediction of the proposed learning framework, we also review the relevant works.

2.1 Traditional Point Cloud Registration

The best-known traditional registration algorithm is Iterative Closest Point (ICP) [24], which alternates between rigid motion estimation and the correspondences searching [25, 26] by solving a L2L_{2}-optimization. However, ICP converges to spurious local minima due to the non-convexity. Based on this algorithm, many variants have been proposed. For example, the Levenberg-Marquardt ICP [27] uses a Levenberg-Marquardt algorithm to produce a transformation by replacing the singular value decomposition with gradient descent and Gaussian-Newton approaches, accelerating convergence while ensuring high accuracy. Go-ICP [28] solves the point cloud alignment problem globally by using a Branch-and-Bound (BnB) optimization framework without prior information on correspondence or transformation. RANSAC-like algorithms are widely used for robust finding of the correct correspondences for registration [29]. FGR [30] optimizes a Geman-McClure cost-induced correspondence-based objective function in a graduated non-convex strategy and achieves high performance. TEASER [31] reformulates the registration problem as an intractable optimization and provides readily checkable conditions to verify the optimal solution. However, most existing methods still face challenges when point clouds contain a mixture of noise, outlier, and partial overlap [32]. In contrast, the learning-based methods tend to have strong robustness [33].

2.2 Deep Learning-Based Point Cloud Registration

Deep point cloud registration methods can be commonly classified into correspondences-free methods and correspondences-based methods [32]. The core idea of correspondence-free registration approaches [34, 35] is to estimate the transformation by minimizing the difference between global features extracted from two input point clouds. However, the global features rely on the whole points of a point cloud, leading to a major limitation that correspondences-free approaches are inapplicable to real scenes with partial overlaps [11, 32]. On the contrary, correspondence-based methods extract the per-point or per-patch embeddings of the two input point clouds to generate a mapping and estimate a rigid transformation [32]. For instance, DCP [25] employs DGCNN [36] for feature extraction and an attention module to generate soft matching pairs. FCGF [20] uses 1×1×11\times 1\times 1 kernel to extract compact metric features for geometric correspondence. RGM [16] considers information exchange between two point clouds to extract discriminative features for point-wise matching. Predator [15] and PRNet [1] pay attention to the detection of points in the overlap region and use these features of the detected points to generate matches. DGR [11] proposes a 6-dimensional convolutional network architecture for inlier likelihood prediction and estimates the transformation via a weighted Procrustes module. 3DRegNet [17] uses an inlier prediction model to estimate the inliers. PointDSC [12] integrates pairwise similarity as a constraint to improve correspondence accuracy. RPMNet [37] and [38] propose methods to solve the point cloud partial visibility by integrating the Sinkhorn algorithm into a network to get soft correspondences from local features. Soft correspondences can increase the robustness of registration accuracy. IDAM [39] incorporates both geometric and distance features into the iterative matching process. Some works [13, 14, 40, 41] also focus on alleviating the dependence on the manual annotation in the process of extracting features. Most of these methods generate correspondences by applying the point feature similarities then directly match the highest response [39]. This strategy has two obvious limitations. First of all, some inliers are assigned to outliers, since inliers and outliers are treated equally in the correspondence prediction stage [15]. Second, the one-shot matching may fail because there are multiple possible correspondences due to randomness [39], while the per-point feature itself is not powerful enough to distinguish the geometric properties in point clouds [16]. Based on such observations, some additional constraints should be imposed to obtain high-quality correspondences. Inspired by edge similarity in graph matching [22, 23], the matched pairs between two point clouds hold an equal length constraint in both Euclidean and feature spaces, we propose a coupled optimal transport-based framework that incorporates overlap scores into pointwise matching as well as structural matching to generate more accurate correspondences. We also design a positional encoding scheme that assigns intrinsic geometric properties to per-point features to enhance distinctions among point features in indistinctive regions.

2.3 Optimal Transport

Optimal transport (OT) [21] is a method to exploit the best assignment between two general objects, which is widely applied in many machine learning applications [42]. OT provides a way to predict the correspondences and measure the similarity between two distributions. The distance based on OT is called the Wasserstein distances (WD), which has been used in various computer vision tasks [43]. For instance, Su et al. [44] employed the Wasserstein distance to deal with the 3D shape matching and surface registration problem. Dang et al. [38] applied Wasserstein distances to handle 3D point cloud registration. Gromov-Wasserstein (GWD) [21] extends WD by computing couplings between metric measure spaces. Unlike calculating the distance between two entity sets as in WD, GWD can be utilized to calculate the structure distance within each domain. The GWD has been used for shape analysis [45], graph matching [46], etc. The properties of WD and GWD that focus, respectively, on features and structure information motivate us to utilize WD to build pointwise matching and GWD to build structural matching. Contrasting with previous correspondence prediction algorithms for point-cloud registration, we treat point clouds as probability distributions, i.e., overlap scores are embedded in a specific metric space based on the similarities of features and structures. The correspondence prediction is then used to compute a coupled optimal transport distance between two probability distributions that can endow pointwise and structural matchings with overlap scores to generate more accurate correspondences.

3 Problem Formulation

Before introducing our method, we explain the formulation and notation of the problem. We consider two point clouds 𝓟={𝒑i3|i=1,,N}\bm{\mathcal{P}}=\{\bm{p}_{i}\in\mathbb{R}^{3}|i=1,\cdots,N\} and 𝓠={𝒒i3|i=1,,M}\bm{\mathcal{Q}}=\{\bm{q}_{i}\in\mathbb{R}^{3}|i=1,\cdots,M\}, with their associated features 𝓕p={𝒇𝒑id|i=1,,N}\bm{\mathcal{F}}_{p}=\{\bm{f}_{\bm{p}_{i}}\in\mathbb{R}^{d}|i=1,\cdots,N\} and 𝓕q={𝒇𝒒id|i=1,,M}\bm{\mathcal{F}}_{q}=\{\bm{f}_{\bm{q}_{i}}\in\mathbb{R}^{d}|i=1,\cdots,M\}, respectively. The registration problem is to recover a rigid transformation 𝑻\bm{T} with rotation 𝑹SO(3)\bm{R}\in SO(3) and translation 𝒕3\bm{t}\in\mathbb{R}^{3} that aligns the source set 𝓟\bm{\mathcal{P}} to the target set 𝓠\bm{\mathcal{Q}}. Obviously 𝑻\bm{T} can only be determined from the data in overlapping areas of 𝓟\bm{\mathcal{P}} and 𝓠\bm{\mathcal{Q}}, if both sets have sufficient overlaps. We define an assignment matrix ΓN×M\Gamma\in\mathbb{R}^{N\times M} with elements Γij{0,1}\Gamma_{ij}\in\{0,1\} to represent the matching confidence between the point 𝒑i\bm{p}_{i} and 𝒒j\bm{q}_{j}, where each element satisfies

Γij={1,if point𝒑icorresponds to𝒒j0,otherwise.\Gamma_{ij}=\begin{cases}1,&\text{if point}\leavevmode\nobreak\ \bm{p}_{i}\leavevmode\nobreak\ \text{corresponds to}\leavevmode\nobreak\ \bm{q}_{j}\\ 0,&\text{otherwise}\end{cases}. (1)

Let us first consider the case where the overlapping region is given, and we use the overlap scores to depict the overlapping region. The overlap score of the point 𝒑i\bm{p}_{i} is defined as

𝝁𝒑i={1,if point𝒑iis an inlier of𝓟0,otherwise,\bm{\mu}_{\bm{p}_{i}}=\begin{cases}1,&\text{if point}\leavevmode\nobreak\ \bm{p}_{i}\leavevmode\nobreak\ \text{is an inlier of}\leavevmode\nobreak\ \bm{\mathcal{P}}\\ 0,&\text{otherwise}\end{cases}, (2)

where 𝝁𝒒j\bm{\mu}_{\bm{q}_{j}} is defined similarly. Let 𝝁p={𝝁𝒑i|i=1,,N}\bm{\mu}_{p}=\{\bm{\mu}_{\bm{p}_{i}}|i=1,\cdots,N\} and 𝝁q={𝝁𝒒j|j=1,,M}\bm{\mu}_{q}=\{\bm{\mu}_{\bm{q}_{j}}|j=1,\cdots,M\}. Usually, it is unlikely to determine whether a point is in overlap regions. Thus we relax the overlap scores to 𝝁𝒙[0,1]\bm{\mu}_{\bm{x}}\in[0,1] and use a neural network to estimate overlap scores. The details of the overlap score prediction module are illustrated in Sec. 4. When considering the overlap scores, the registration task can be cast to solve the following problem:

minΓ,𝑹,𝒕i=1Nj=1MΓij𝑹𝒑i+𝒕𝒒j2,\displaystyle\min_{\Gamma,\bm{R},\bm{t}}\sum_{i=1}^{N}\sum_{j=1}^{M}\Gamma_{ij}\|\bm{R}\bm{p}_{i}+\bm{t}-\bm{q}_{j}\|_{2}, (3)
s.t.j=1MΓij=𝝁𝒑i,i=1NΓij=𝝁𝒒j,Γij{0,1}.\displaystyle\mbox{s.t.}\leavevmode\nobreak\ \sum_{j=1}^{M}\Gamma_{ij}=\bm{\mu}_{\bm{p}_{i}},\sum_{i=1}^{N}\Gamma_{ij}=\bm{\mu}_{\bm{q}_{j}},\Gamma_{ij}\in\{0,1\}.

The three constraints enforce Γ\Gamma to be a permutation matrix. If we know the optimal rigid transformation {𝑹,𝒕}\{\bm{R},\bm{t}\}, then the assignment matrix Γ\Gamma can be recovered from Eq. (3). By contrary, given the assignment matrix Γ\Gamma, we generate correspondences ={(𝒑i,𝒒j)|j=argmaxkΓik}\mathcal{M}=\{(\bm{p}_{i},\bm{q}_{j})|j=\arg\max_{k}\Gamma_{ik}\}. \mathcal{M} can be directly leveraged by RANSAC [47] or SVD to estimate the transformation.

The following notation will be used throughout the paper. 𝒑i𝒒j\bm{p}_{i}\rightarrow\bm{q}_{j} indicates a correspondence of 𝒑i\bm{p}_{i} and 𝒒j\bm{q}_{j}. {𝒑i,𝒑k}\{\bm{p}_{i},\bm{p}_{k}\} represents a pair. ,\left<\cdot,\cdot\right> is an inner product operator. The Kullback-Leibler (𝒦\mathcal{KL}) divergence between two non-negative vectors 𝒂=(a1,a2,,an)\bm{a}=\left(a_{1},a_{2},\cdots,a_{n}\right) and 𝒃=(b1,b2,,bn)\bm{b}=\left(b_{1},b_{2},\cdots,b_{n}\right) is defined as

𝒦(𝒂|𝒃)=i=1n(ailog(aibi)ai+bi),\mathcal{KL}\left(\bm{a}|\bm{b}\right)=\sum_{i=1}^{n}\left(a_{i}\log\left(\frac{a_{i}}{b_{i}}\right)-a_{i}+b_{i}\right), (4)

with the convention 0log0=00\log 0=0.

4 Proposed optimal transport-based point cloud registration

Refer to caption
Figure 2: Overview of the correspondence prediction. Point clouds 𝓟\bm{\mathcal{P}} and 𝓠\bm{\mathcal{Q}}, with their features 𝓕p\bm{\mathcal{F}}_{p} and 𝓕q\bm{\mathcal{F}}_{q}, and the overlap scores 𝝁p,𝝁q\bm{\mu}_{p},\bm{\mu}_{q}. 𝑪pq\bm{C}^{pq} is the cross distance matrix. 𝑪p\bm{C}^{p} and 𝑪q\bm{C}^{q} represent the structure matrices. The assignment matrix Γ\Gamma is predicted by solving a coupled optimal transport problem. 𝓟\bm{\mathcal{P}} and 𝓠\bm{\mathcal{Q}} have NN and MM points, respectively. \mathcal{M} represents a set of estimated correspondences.

In this paper, we propose a new method to generate correspondences by jointly considering pointwise and structural matchings, as illustrated in Figure 2. We first use 𝓕p\bm{\mathcal{F}}_{p} and the coordinates of 𝓟\bm{\mathcal{P}} to calculate 𝑪p\bm{C}^{p}. Similarly, 𝓕q\bm{\mathcal{F}}_{q} and the coordinates of 𝓠\bm{\mathcal{Q}} are used to calculate the 𝑪q\bm{C}^{q}. After that, 𝑪p\bm{C}^{p}, 𝑪q\bm{C}^{q}, the overlap scores 𝝁p\bm{\mu}_{p} and 𝝁q\bm{\mu}_{q} are used for structural matching based on the Gromov-Wasserstein distance. 𝓕p\bm{\mathcal{F}}_{p} and 𝓕q\bm{\mathcal{F}}_{q} are used to calculate 𝑪pq\bm{C}^{pq}. Then 𝑪pq\bm{C}^{pq} and overlap scores 𝝁p\bm{\mu}_{p} and 𝝁q\bm{\mu}_{q} are used for pointwise matching based on the Wasserstein distance. Finally, we couple Wasserstein distance and Gromov-Wasserstein distance in a mutually-beneficial way by sharing the assignment matrix Γ\Gamma, which is estimated via the coupled optimal transport algorithm. 𝓟\bm{\mathcal{P}} and 𝓠\bm{\mathcal{Q}} have NN and MM points, respectively. The correspondences \mathcal{M} are finally obtained based on ={(𝒑i,𝒒j)|j=argmaxkΓik}\mathcal{M}=\{(\bm{p}_{i},\bm{q}_{j})|j=\arg\max_{k}\Gamma_{ik}\}. Next, we introduce how to use the coupled optimal transport to predict the correspondences.

4.1 Coupled Optimal Transport-Based Correspondence Prediction

As is mentioned above, only considering the pointwise matching is not sufficient to find the accurate correspondences due to the ambiguous and repeated patterns in the 3D acquisition point clouds. Therefore, we exploit both pointwise and structural matchings to indicate the correspondence between source and target point clouds.

For point-wise matching, the cross-distance matrix 𝑪pqN×M\bm{C}^{pq}\in\mathbb{R}^{N\times M} is derived based on the feature space between point cloud 𝓟\bm{\mathcal{P}} and 𝓠\bm{\mathcal{Q}} with each element satisfying

𝑪ijpq=𝒟f(𝒇𝒑i,𝒇𝒒j),0iN,0jM,\bm{C}^{pq}_{ij}=\mathcal{D}_{f}\left(\bm{f}_{\bm{p}_{i}},\bm{f}_{\bm{q}_{j}}\right),0\leq i\leq N,0\leq j\leq M, (5)

where 𝒟f(𝒇𝒑i,𝒇𝒒j)=𝒇𝒑i𝒇𝒑i2𝒇𝒒j𝒇𝒒j22\mathcal{D}_{f}\left(\bm{f}_{\bm{p}_{i}},\bm{f}_{\bm{q}_{j}}\right)=\|\frac{\bm{f}_{\bm{p}_{i}}}{\|\bm{f}_{\bm{p}_{i}}\|_{2}}-\frac{\bm{f}_{\bm{q}_{j}}}{\|\bm{f}_{\bm{q}_{j}}\|_{2}}\|_{2} represents the distance between 𝒇𝒑i\bm{f}_{\bm{p}_{i}} and 𝒇𝒒j\bm{f}_{\bm{q}_{j}}.

To construct the structural matching, we first calculate the discrepancy (structure) matrix 𝑪pN×N\bm{C}^{p}\in\mathbb{R}^{N\times N} of the point pairs within 𝓟\bm{\mathcal{P}} in both Euclidean and feature spaces with the elements related to {𝒑i,𝒑k}\{\bm{p}_{i},\bm{p}_{k}\} satisfying

𝑪ikp=λ𝒟e(𝒑i,𝒑k)Euclidean+(1λ)𝒟f(𝒇𝒑i,𝒇𝒑k)Feature,\bm{C}^{p}_{ik}=\lambda\underbrace{\mathcal{D}_{e}\left(\bm{p}_{i},\bm{p}_{k}\right)}_{Euclidean}+(1-\lambda)\underbrace{\mathcal{D}_{f}\left(\bm{f}_{\bm{p}_{i}},\bm{f}_{\bm{p}_{k}}\right)}_{Feature}, (6)

where 𝒟e(,)\mathcal{D}_{e}\left(\cdot,\cdot\right) is a function related to distances between two points in Euclidean space satisfying 𝒟e(𝒑i,𝒑k)=2tanh(𝒑i𝒑k2)\mathcal{D}_{e}\left(\bm{p}_{i},\bm{p}_{k}\right)=2\tanh(\|\bm{p}_{i}-\bm{p}_{k}\|_{2}). λ[0,1]\lambda\in[0,1] is a hyperparameter controlling the contribution of feature and coordinate information. Similarly, the elements of 𝑪qM×M\bm{C}^{q}\in\mathbb{R}^{M\times M} related to {𝒒j,𝒒l}\{\bm{q}_{j},\bm{q}_{l}\} for 𝓠\bm{\mathcal{Q}} satisfy

𝑪jlq=λ𝒟e(𝒒j,𝒒l)+(1λ)𝒟f(𝒇𝒒j,𝒇𝒒l).\bm{C}^{q}_{jl}=\lambda\mathcal{D}_{e}\left(\bm{q}_{j},\bm{q}_{l}\right)+(1-\lambda)\mathcal{D}_{f}\left(\bm{f}_{\bm{q}_{j}},\bm{f}_{\bm{q}_{l}}\right). (7)

We jointly exploit the pointwise and structural matchings equipped with overlap scores to indicate the correspondences between source and target point clouds. This leads to an optimization problem that can be solved by a coupled optimal transport method (called coupled optimal transport since it contains two types of distance, i.e., WD and GWD).

minΓijξ1Γij𝑪ijpqWD+ξ2ijklΓijΓk,l(𝑪ikp𝑪jlq)2GWD,\displaystyle\min_{\Gamma}\sum_{ij}\xi_{1}\underbrace{\Gamma_{ij}\bm{C}^{pq}_{ij}}_{\mbox{WD}}+\xi_{2}\sum_{ijkl}\underbrace{\Gamma_{ij}\Gamma_{k,l}\left(\bm{C}^{p}_{ik}-\bm{C}^{q}_{jl}\right)^{2}}_{\mbox{GWD}}, (8)
s.t., Γ𝟏M=𝝁p,Γ𝟏N=𝝁q,Γij[0,1],\displaystyle\mbox{s.t.,\leavevmode\nobreak\ }\Gamma\bm{1}_{M}=\bm{\mu}_{p},\Gamma^{\top}\bm{1}_{N}=\bm{\mu}_{q},\Gamma_{ij}\in[0,1],

where 𝟏n\bm{1}_{n} denotes an nn-dimensional all-one vector. ξ1\xi_{1} and ξ2\xi_{2} are two non-negative hyperparameters controlling the pointwise and structural matchings, respectively. If ξ1>0\xi_{1}>0 and ξ2=0\xi_{2}=0, then it only depends on the pointwise matching, and if ξ1=0\xi_{1}=0 and ξ2>0\xi_{2}>0, it only considers the structural matching. When ξ1>0\xi_{1}>0 and ξ2>0\xi_{2}>0, it allows the framework to effectively consider both pointwise and structural matchings for better correspondence predictions by sharing the assignment matrix Γ\Gamma. After obtaining Γ\Gamma, we can apply some methods, such as SVD and RANSAC, to estimate the transformation.

Next, we introduce the Wasserstein distance-based pointwise matching and the Gromov-Wasserstein distance-based structural matching, respectively.

4.2 Wasserstein Distance-Based Pointwise Matching

Our Wasserstein distance-based pointwise matching method can be regarded as a variant of the nearest neighbor search with an additional bijectivity constraint that enforces global matching consistency. Our model is appropriate for partial registration, since the overlap scores have been introduced. Given two point clouds 𝓟\bm{\mathcal{P}} and 𝓠\bm{\mathcal{Q}}, with their associated features 𝓕p\bm{\mathcal{F}}_{p} and 𝓕q\bm{\mathcal{F}}_{q}, overlap scores 𝝁p\bm{\mu}_{p} and 𝝁q\bm{\mu}_{q}, the assignment matrix Γ\Gamma can be estimated based on the following Theorem.

Theorem 1.

Given features 𝓕p\bm{\mathcal{F}}_{p} and 𝓕q\bm{\mathcal{F}}_{q}, overlap scores 𝛍p\bm{\mu}_{p} and 𝛍q\bm{\mu}_{q}, if 𝓕p,𝓕q\bm{\mathcal{F}}_{p},\bm{\mathcal{F}}_{q} are invariant to rigid transformations, and 𝐑,𝐭,Γ\bm{R}^{\star},\bm{t}^{\star},\Gamma^{\star} are the optimal solutions of problem in Eq. (3). Then Γ\Gamma^{\star} is an optimal solution to the following optimization problem:

minΓ𝑪pq,Γ=minΓi=1Nj=1MΓij𝑪ijpq,\displaystyle\min_{\Gamma}\left<\bm{C}^{pq},\Gamma\right>=\min_{\Gamma}\sum_{i=1}^{N}\sum_{j=1}^{M}\Gamma_{ij}\bm{C}^{pq}_{ij}, (9)
s.t.Γ𝟏M=𝝁p,Γ𝟏N=𝝁q,Γij[0,1],\displaystyle\mbox{s.t.}\leavevmode\nobreak\ \Gamma\bm{1}_{M}=\bm{\mu}_{p},\Gamma^{\top}\bm{1}_{N}=\bm{\mu}_{q},\Gamma_{ij}\in[0,1],

where 𝐟𝐩i𝓕p\bm{f}_{\bm{p}_{i}}\in\bm{\mathcal{F}}_{p}, 𝐟𝐪j𝓕q\bm{f}_{\bm{q}_{j}}\in\bm{\mathcal{F}}_{q} represent the features of points 𝐩i\bm{p}_{i} and 𝐪j\bm{q}_{j}, respectively. 𝐂ijpq=𝐟𝐩i𝐟𝐩i2𝐟𝐪j𝐟𝐪j22\bm{C}^{pq}_{ij}=\|\frac{\bm{f}_{\bm{p}_{i}}}{\|\bm{f}_{\bm{p}_{i}}\|_{2}}-\frac{\bm{f}_{\bm{q}_{j}}}{\|\bm{f}_{\bm{q}_{j}}\|_{2}}\|_{2}. The constraint of the assignment matrix Γ\Gamma is relaxed to a doubly stochastic state, that is, Γij[0,1]\Gamma_{ij}\in[0,1].

Please refer to the proof of this theorem in Appendix A.

Remark 1.

Theorem 1 signifies that the assignment matrix Γ\Gamma can be calculated by solving the optimization problem in Eq. (9), which is related to an optimal transport [21] problem (Wasserstein distance). It can be solved using the Sinkhorn algorithm [48].

4.3 Gromov-Wasserstein Distance-Based Structural Matching

Our method is based on the observations as follows: for 𝒑i,𝒑k𝓟\forall\leavevmode\nobreak\ \bm{p}_{i},\bm{p}_{k}\in\bm{\mathcal{P}} and 𝒒j,𝒒l𝓠\forall\leavevmode\nobreak\ \bm{q}_{j},\bm{q}_{l}\in\bm{\mathcal{Q}} with their associated features 𝒇𝒑i,𝒇𝒑k𝓕p\bm{f}_{\bm{p}_{i}},\bm{f}_{\bm{p}_{k}}\in\bm{\mathcal{F}}_{p} and 𝒇𝒒j,𝒇𝒒l𝓕q\bm{f}_{\bm{q}_{j}},\bm{f}_{\bm{q}_{l}}\in\bm{\mathcal{F}}_{q}, if 𝒑i𝒒j\bm{p}_{i}\rightarrow\bm{q}_{j} and 𝒑k𝒒l\bm{p}_{k}\rightarrow\bm{q}_{l} are correct correspondences, then the distance between 𝒑i\bm{p}_{i} and 𝒑k\bm{p}_{k} should be similar to the distance between 𝒒j\bm{q}_{j} and 𝒒l\bm{q}_{l}. Thus, the structural difference in both Euclidean and feature spaces should be small, i.e., |ce(𝒑i,𝒑k)ce(𝒒j,𝒒l)||c^{e}\left(\bm{p}_{i},\bm{p}_{k}\right)-c^{e}\left(\bm{q}_{j},\bm{q}_{l}\right)| and |cf(𝒇𝒑i,𝒇𝒑k)cf(𝒇𝒒j,𝒇𝒒l)||c^{f}\left(\bm{f}_{\bm{p}_{i}},\bm{f}_{\bm{p}_{k}}\right)-c^{f}\left(\bm{f}_{\bm{q}_{j}},\bm{f}_{\bm{q}_{l}}\right)| should be small. Gromov-Wasserstein distance is often applied to find the correspondences between two sets of samples based on their pairwise intra-domain similarity (or distance) matrices. Thus, we can approximately transform the correspondence prediction to a structural matching problem based on the Gromov-Wasserstein distance as the following Theorem.

Theorem 2.

Given two point clouds 𝓟\bm{\mathcal{P}} and 𝓠\bm{\mathcal{Q}}, with their associated features 𝓕p\bm{\mathcal{F}}_{p} and 𝓕q\bm{\mathcal{F}}_{q}, overlap scores 𝛍p\bm{\mu}_{p} and 𝛍q\bm{\mu}_{q}, if 𝓕p,𝓕q\bm{\mathcal{F}}_{p},\bm{\mathcal{F}}_{q} are invariant to rigid transformation, and 𝐑,𝐭,Γ\bm{R}^{\star},\bm{t}^{\star},\Gamma^{\star} are the optimal solutions of problem (3), then Γ\Gamma^{\star} is an optimal solution of the following Gromov-Wasserstein distance-based optimization:

minΓi=1Nj=1Mk=1Nl=1MΓijΓkl(𝑪ikp𝑪jlq)2\displaystyle\min_{\Gamma}\sum_{i=1}^{N}\sum_{j=1}^{M}\sum_{k=1}^{N}\sum_{l=1}^{M}\Gamma_{ij}\Gamma_{kl}\left(\bm{C}^{p}_{ik}-\bm{C}^{q}_{jl}\right)^{2} (10)
s.t., Γ𝟏M=𝝁p,Γ𝟏N=𝝁q,Γij[0,1],\displaystyle\mbox{s.t.,\leavevmode\nobreak\ }\Gamma\bm{1}_{M}=\bm{\mu}_{p},\Gamma^{\top}\bm{1}_{N}=\bm{\mu}_{q},\Gamma_{ij}\in[0,1],

where 𝐂ikp\bm{C}^{p}_{ik} and 𝐂jlq\bm{C}^{q}_{jl} are defined as Eq. (6) and Eq. (7), respectively.

The proof is available in Appendix B.

Remark 2.

Assignment matrix Γ\Gamma can be obtained from by problem in Eq. (10) by solving an entropy regularized optimization. The structural matching is formulated by jointly considering the structural differences in both Euclidean and feature space.

4.4 Model Optimization

Now, we introduce how to solve the problem in (8). For simplicity, we denote a matrix 𝑯(𝑪p,𝑪q,Γ)N×M\bm{H}\left(\bm{C}^{p},\bm{C}^{q},\Gamma\right)\in\mathbb{R}^{N\times M} with each element [𝑯(𝑪p,𝑪q,Γ)]kl=i=1Nj=1M(𝑪ikp𝑪jlq)2Γij[\bm{H}\left(\bm{C}^{p},\bm{C}^{q},\Gamma\right)]_{kl}=\sum_{i=1}^{N}\sum_{j=1}^{M}\left(\bm{C}^{p}_{ik}-\bm{C}^{q}_{jl}\right)^{2}\Gamma_{ij}. We get ijklΓijΓkl(𝑪ikp𝑪jlq)2=𝑯(𝑪p,𝑪q,Γ),Γ\sum_{ijkl}\Gamma_{ij}\Gamma_{kl}\left(\bm{C}^{p}_{ik}-\bm{C}^{q}_{jl}\right)^{2}=\left<\bm{H}\left(\bm{C}^{p},\bm{C}^{q},\Gamma\right),\Gamma\right>. The problem in Eq. (8) can be rewritten as

minΓ0ξ1𝑪pq,ΓWD+ξ2𝑯(𝑪p,𝑪q,Γ),ΓGWD,\displaystyle\min_{\Gamma\geq 0}\xi_{1}\underbrace{\left<\bm{C}^{pq},\Gamma\right>}_{\mbox{WD}}+\xi_{2}\underbrace{\left<\bm{H}\left(\bm{C}^{p},\bm{C}^{q},\Gamma\right),\Gamma\right>}_{\mbox{GWD}}, (11a)
s.t., Γ𝟏M=𝝁p,Γ𝟏N=𝝁q.\displaystyle\mbox{s.t.,\leavevmode\nobreak\ }\Gamma\bm{1}_{M}=\bm{\mu}_{p},\Gamma^{\top}\bm{1}_{N}=\bm{\mu}_{q}. (11b)

Standard optimal transport only allows a meaningful comparison of measures with the same total mass, i.e., i=1N𝝁pi=j=1M𝝁qj\sum_{i=1}^{N}\bm{\mu}_{p_{i}}=\sum_{j=1}^{M}\bm{\mu}_{q_{j}}, which does not always satisfy the registration requirement due to multiple correspondences. Following [49], we replace the constraints in Eq. (11b) with soft-marginals (𝒦\mathcal{KL} divergence). Optimization in Eq. (11) is then translated into an unconstrained approximate transport problem

minΓ0\displaystyle\min_{\Gamma\geq 0} ξ1𝑪pq,Γ+ξ2𝑯(𝑪p,𝑪q,Γ),Γ\displaystyle\xi_{1}\left<\bm{C}^{pq},\Gamma\right>+\xi_{2}\left<\bm{H}\left(\bm{C}^{p},\bm{C}^{q},\Gamma\right),\Gamma\right> (12)
+τ(𝒦(Γ𝟏M|𝝁p)+𝒦(Γ𝟏N|𝝁q)),\displaystyle+\tau\left(\mathcal{KL}\left(\Gamma\bm{1}_{M}|\bm{\mu}_{p}\right)+\mathcal{KL}\left(\Gamma^{\top}\bm{1}_{N}|\bm{\mu}_{q}\right)\right),

where τ>0\tau>0 is a regularization parameter to adjust the strength of penalization of the soft margins. We use the generalized proximal point method [50] and projected gradient descent to solve the problem in Eq. (12) based on the 𝒦\mathcal{KL} metric. Following [45, 50], we fix Γ(k)\Gamma^{(k)} at iteration k+1k+1 for k0k\geq 0, 𝒦(Γ|Γ(k))\mathcal{KL}\left(\Gamma|\Gamma^{(k)}\right) acts as a regularization centered on the previous solution Γ(k)\Gamma^{(k)}. The update rule for Eq. (12) at iteration k+1k+1 can be written as

Γ(k+1)\displaystyle\Gamma^{(k+1)} =argminΓ0ϵ𝒦(Γ|Γ(k))\displaystyle=\mathop{\arg\min}_{\Gamma\geq 0}\epsilon\mathcal{KL}\left(\Gamma|\Gamma^{(k)}\right) (13)
+ξ1𝑪pq,Γ+ξ2𝑯(𝑪p,𝑪q,Γ(k)),Γ\displaystyle+\xi_{1}\left<\bm{C}^{pq},\Gamma\right>+\xi_{2}\left<\bm{H}\left(\bm{C}^{p},\bm{C}^{q},\Gamma^{(k)}\right),\Gamma\right>
+τ(𝒦(Γ𝟏M|𝝁p)+𝒦(Γ𝟏N|𝝁q)),\displaystyle+\tau\left(\mathcal{KL}\left(\Gamma\bm{1}_{M}|\bm{\mu}_{p}\right)+\mathcal{KL}\left(\Gamma^{\top}\bm{1}_{N}|\bm{\mu}_{q}\right)\right),

with initialization Γ(0)=𝝁p𝝁q\Gamma^{(0)}=\bm{\mu}_{p}{\bm{\mu}_{q}}^{\top}. ϵ>0\epsilon>0 is a regularization parameter. 𝒦(Γ|Γ(k))\mathcal{KL}\left(\Gamma|\Gamma^{(k)}\right) can be interpreted as a damping term that encourages Γ(k+1)\Gamma^{(k+1)} not to be very far from Γ(k)\Gamma^{(k)}. For ϵ\epsilon small enough, Γ(k+1)\Gamma^{(k+1)} in Eq. (13) converges to the optimal solution of problem in Eq. (11) as τ\tau increases. Choosing ϵ\epsilon trades off convergence speed with closeness to the original transport problem [48]. The solution of the problem in Eq. (13) is based on the following theorem.

Theorem 3.

Denote f(Γ(k))=ξ1𝐂pq+ξ2𝐇(𝐂p,𝐂q,Γ(k))ϵlogΓ(k)f(\Gamma^{(k)})=\xi_{1}\bm{C}^{pq}+\xi_{2}\bm{H}\left(\bm{C}^{p},\bm{C}^{q},\Gamma^{(k)}\right)-\epsilon\log\Gamma^{(k)} and 𝐂N×M\bm{C}\in\mathbb{R}^{N\times M} with elements that satisfy 𝐂ij=[f(Γ(k))]ij\bm{C}_{ij}=[f\left(\Gamma^{(k)}\right)]_{ij}. The optimal solution for the objective in Eq. (29) can be obtained by solving the following dual entropic regularized objective,

min𝒖,𝒗h(𝒖,𝒗)=min𝒖,𝒗ϵi=1Nj=1Mexp(𝒖i+𝒗j𝑪ijϵ)\displaystyle\min_{\bm{u},\bm{v}}h(\bm{u},\bm{v})=\min_{\bm{u},\bm{v}}\epsilon\sum_{i=1}^{N}\sum_{j=1}^{M}\exp\left(\frac{\bm{u}_{i}+\bm{v}_{j}-\bm{C}_{ij}}{\epsilon}\right) (14)
+τexp(𝒖τ),𝝁p+τexp(𝒗τ),𝝁q,\displaystyle+\tau\left\langle\exp\left(-\frac{\bm{u}}{\tau}\right),\bm{\mu}_{p}\right\rangle+\tau\left\langle\exp\left(-\frac{\bm{v}}{\tau}\right),\bm{\mu}_{q}\right\rangle,

where 𝐮N,𝐯M\bm{u}\in\mathbb{R}^{N},\bm{v}\in\mathbb{R}^{M} are dual variables.

Please refer to the proof of the theorem in Appendix C. The problem in Eq. (14) can be solved using the Sinkhorn algorithm [48, 49]. Specifically,

h𝒖=0jNexp(𝒖i+𝒗j𝑪ijϵ)exp(𝒖iτ)𝝁𝒑i=0\displaystyle\frac{\partial h}{\partial\bm{u}}=0\Rightarrow\sum^{N}_{j}\exp\left(\frac{\bm{u}_{i}+\bm{v}_{j}-\bm{C}_{ij}}{\epsilon}\right)-\exp\left(-\frac{\bm{u}_{i}}{\tau}\right)\bm{\mu}_{\bm{p}_{i}}=0
exp(𝒖iϵ)jNexp(𝒗j𝑪ijϵ)=exp(𝒖iτ)𝝁𝒑i.\displaystyle\Rightarrow\exp\left(\frac{\bm{u}_{i}}{\epsilon}\right)\sum^{N}_{j}\exp\left(\frac{\bm{v}_{j}-\bm{C}_{ij}}{\epsilon}\right)=\exp\left(-\frac{\bm{u}_{i}}{\tau}\right)\bm{\mu}_{\bm{p}_{i}}.

Let (𝒖k,𝒗k,𝒂k,𝒃k)(\bm{u}^{k},\bm{v}^{k},\bm{a}^{k},\bm{b}^{k}) be the solution returned at the kk-th iteration of the algorithm. Here 𝒂=B(𝒖,𝒗)1M\bm{a}=B(\bm{u},\bm{v})\textbf{1}_{M} and 𝒃=B(𝒖,𝒗)1N\bm{b}=B(\bm{u},\bm{v})^{\top}\textbf{1}_{N} with B(𝒖,𝒗)=diag(exp(𝒖ϵ))exp(𝑪ϵ)diag(exp(𝒗ϵ))B(\bm{u},\bm{v})=\text{diag}\left(\exp{\left(\frac{\bm{u}}{\epsilon}\right)}\right)\cdot\exp{\left(-\frac{\bm{C}}{\epsilon}\right)}\cdot\text{diag}\left(\exp{\left(\frac{\bm{v}}{\epsilon}\right)}\right). Suppose we are at iteration k+1k+1 for k0k\geq 0 with a fixed 𝒗k\bm{v}^{k}, i.e.,

exp(𝒖ik+1ϵ)jNexp(𝒗jk𝑪ijϵ)=exp(𝒖ik+1τ)𝝁𝒑i.\exp\left(\frac{\bm{u}_{i}^{k+1}}{\epsilon}\right)\sum^{N}_{j}\exp\left(\frac{\bm{v}_{j}^{k}-\bm{C}_{ij}}{\epsilon}\right)=\exp\left(-\frac{\bm{u}_{i}^{k+1}}{\tau}\right)\bm{\mu}_{\bm{p}_{i}}.

Multiplying both sides by exp(𝒖ikϵ)\exp\left(\frac{\bm{u}_{i}^{k}}{\epsilon}\right), we get

exp(𝒖ik+1ϵ)𝒂ik=exp(𝒖ikϵ)exp(𝒖ik+1τ)𝝁𝒑i\displaystyle\exp\left(\frac{\bm{u}_{i}^{k+1}}{\epsilon}\right)\bm{a}_{i}^{k}=\exp\left(\frac{\bm{u}_{i}^{k}}{\epsilon}\right)\exp\left(-\frac{\bm{u}_{i}^{k+1}}{\tau}\right)\bm{\mu}_{\bm{p}_{i}}
𝒖k+1=[𝒖kϵ+log(𝝁p)log(𝒂k)]ϵτϵ+τ.\displaystyle\Rightarrow\bm{u}^{k+1}=\left[\frac{\bm{u}^{k}}{\epsilon}+\log\left(\bm{\mu}_{p}\right)-\log\left(\bm{a}^{k}\right)\right]\frac{\epsilon\tau}{\epsilon+\tau}.

Similarly, with 𝒖k\bm{u}^{k} fixed, we get

𝒗k+1=[𝒗kϵ+log(𝝁q)log(𝒃k)]ϵτϵ+τ.\bm{v}^{k+1}=\left[\frac{\bm{v}^{k}}{\epsilon}+\log\left(\bm{\mu}_{q}\right)-\log\left(\bm{b}^{k}\right)\right]\frac{\epsilon\tau}{\epsilon+\tau}.

We use the pseudocode in Algorithm 1 to illustrate the solution. The inner iterations can be determined by ϵ,𝑪ij\epsilon,\bm{C}_{ij} and max{M,N}\max\{M,N\} and the proof is similar to Theorem 2 in [49]. In our experiments, we found that when setting ξ1=1.0\xi_{1}=1.0, τ=5.0\tau=5.0, ϵ=0.001\epsilon=0.001, NI=100N_{I}=100 and NO=20N_{O}=20, it can obtain satisfactory results.

Algorithm 1 Coupled optimal transport algorithm.

Input: Distance matrices 𝑪p\bm{C}^{p}, 𝑪q\bm{C}^{q} and 𝑪pq\bm{C}^{pq}, overlap scores 𝝁p\bm{\mu}_{p} and 𝝁q\bm{\mu}_{q}, and hyparameters τ,ϵ>0\tau,\epsilon>0, ξ1=1\xi_{1}=1, and the number of outer/inner iterations NO,NIN_{O},N_{I}.

1:  Initialize Γ(0)=𝝁p𝝁q\Gamma^{(0)}=\bm{\mu}_{p}{\bm{\mu}_{q}}^{\top}, k=0k=0.
2:  for k=0:NOk=0:N_{O} do
3:     ξ2=kNO\xi_{2}=\frac{k}{N_{O}}
4:     Compute 𝑪ij=[f(Γ(k))]ij\bm{C}_{ij}=\left[f\left(\Gamma^{(k)}\right)\right]_{ij}
5:     while k<NIk<N_{I} do
6:        𝒂k=B(𝒖k,𝒗k)1M,𝒃k=B(𝒖k,𝒗k)1N\bm{a}^{k}=B(\bm{u}^{k},\bm{v}^{k})\textbf{1}_{M},\bm{b}^{k}=B(\bm{u}^{k},\bm{v}^{k})^{\top}\textbf{1}_{N}.
7:        if kk is even then
8:           𝒖k+1=[𝒖kϵ+log(𝝁p)log(𝒂k)]ϵτϵ+τ\bm{u}^{k+1}=\left[\frac{\bm{u}^{k}}{\epsilon}+\log\left(\bm{\mu}_{p}\right)-\log\left(\bm{a}^{k}\right)\right]\frac{\epsilon\tau}{\epsilon+\tau}
9:           𝒗k+1=𝒗k\bm{v}^{k+1}=\bm{v}^{k}
10:        else
11:           𝒗k+1=[𝒗kϵ+log(𝝁q)log(𝒃k)]ϵτϵ+τ\bm{v}^{k+1}=\left[\frac{\bm{v}^{k}}{\epsilon}+\log\left(\bm{\mu}_{q}\right)-\log\left(\bm{b}^{k}\right)\right]\frac{\epsilon\tau}{\epsilon+\tau}
12:           𝒖k+1=𝒖k\bm{u}^{k+1}=\bm{u}^{k}
13:        end if
14:        k=k+1k=k+1.
15:     end while
16:     Γ(k)=B(𝒖k,𝒗k)\Gamma^{(k)}=B(\bm{u}^{k},\bm{v}^{k})
17:  end for
18:  Output: Γ(NO)\Gamma^{(N_{O})}

4.5 Combined with Learning Network

The solution of the optimization problem in Eq. (8) is sought over the space of N×MN\times M permutation matrices. Because of memory constraints and speed limitations, it is not suitable to solve large scale registration problem. To this end, COTReg adopts a hierarchical matching strategy that first establishes superpoint-level correspondences and then predicts point-level correspondences according to superpoint-level matches. The pipeline of our COTReg is illustrated in Figure 3, which is a shared weighted two-stream encoder-decoder network. Given a pair of point cloud 𝓟\bm{\mathcal{P}} and 𝓠\bm{\mathcal{Q}}, the encoder aggregates the raw points into superpoints 𝓟¯={𝒑¯i3|i=1,2,,N¯}\bar{\bm{\mathcal{P}}}=\{\bar{\bm{p}}_{i}\in\mathbb{R}^{3}|i=1,2,...,\bar{N}\} and 𝓠¯={𝒒¯j3|j=1,2,,M¯}\bar{\bm{\mathcal{Q}}}=\{\bar{\bm{q}}_{j}\in\mathbb{R}^{3}|j=1,2,...,\bar{M}\}, while jointly learning the associated features 𝓕p¯={𝒇p¯ib|i=1,2,,N¯}\bm{\mathcal{F}}_{\bar{p}}=\{\bm{f}_{\bar{p}_{i}}\in\mathbb{R}^{b}|i=1,2,...,\bar{N}\} and 𝓕q¯={𝒇q¯jb|j=1,2,,M¯}\bm{\mathcal{F}}_{\bar{q}}=\{\bm{f}_{\bar{q}_{j}}\in\mathbb{R}^{b}|j=1,2,...,\bar{M}\}. The overlap attention block updates the features as 𝓕¯p¯\bar{\bm{\mathcal{F}}}_{\bar{p}} and 𝓕¯q¯\bar{\bm{\mathcal{F}}}_{\bar{q}}, and projects them to coarse level overlap score vectors 𝝁p¯={𝝁p¯i[0,1]}i=1N¯,𝝁q¯={𝝁q¯j[0,1]}i=1M¯\bm{\mu}_{\bar{p}}=\{\bm{\mu}_{\bar{p}_{i}}\in\left[0,1\right]\}_{i=1}^{\bar{N}},\bm{\mu}_{\bar{q}}=\{\bm{\mu}_{\bar{q}_{j}}\in\left[0,1\right]\}_{i=1}^{\bar{M}}. The updated features and overlap scores are then used to calculate the coarse-level correspondences. Finally, the decoder transforms the superpoint level features and overlap scores to per-point feature descriptors 𝓕p\bm{\mathcal{F}}_{p} and 𝓕q\bm{\mathcal{F}}_{q}, and overlap scores 𝝁p\bm{\mu}_{p} and 𝝁q\bm{\mu}_{q}, which are used to estimate fine-level correspondences.

Refer to caption
Figure 3: Overview of the proposed COTReg combined with network. COTReg adopts a hierarchical matching strategy that first establishes superpoint-level correspondences and then predicts point-level correspondences according to superpoint-level matches.

4.5.1 Encoder

Inspired by Predator [15], a shared KPConv [51], which consists of a series of ResNet-like blocks and stridden convolutions, simultaneously down-samples the raw point clouds 𝓟\bm{\mathcal{P}} and 𝓠\bm{\mathcal{Q}} into superpoints 𝓟¯\bar{\bm{\mathcal{P}}} and 𝓠¯\bar{\bm{\mathcal{Q}}} and extracts associated features 𝓕p¯\bm{\mathcal{F}}_{\bar{p}} and 𝓕q¯={𝒇q¯jb|j=1,2,,M¯}\bm{\mathcal{F}}_{\bar{q}}=\{\bm{f}_{\bar{q}_{j}}\in\mathbb{R}^{b}|j=1,2,...,\bar{M}\}, respectively.

4.5.2 Overlap Attention Module

The overlap attention module, which estimates the probability (overlap score) of whether a point is in the overlapping area, consists of positional encoding, self-attention, cross-attention, and overlap score prediction. The positional encoding assigns intrinsic geometric properties to a per-point feature, thus enhancing distinctions among point features in indistinctive regions. The extracted local features have a limited receptive field which may not distinguish indistinctive regions. Instead, humans find correspondences in these indistinctive regions not only based on the local neighborhood but with a larger global context. Self-attention is thus introduced to model the long-range dependencies. And cross attention module exploits the intra-relationship within the source and target point clouds, which models the potential overlap regions. Here, we detail the individual parts hereafter.

Positional Encoding. Positional encoding scheme which assigns intrinsic geometric properties to the per-point feature by adding unique positional information enhances distinctions among point features in indistinctive regions. Given two superpoints 𝒑¯i\bar{\bm{p}}_{i} and 𝒑¯j\bar{\bm{p}}_{j} of 𝓟¯\bar{\bm{\mathcal{P}}}, we first select the k=5k=5 nearest neighbors 𝒦i\mathcal{K}_{i} of 𝒑¯i\bar{\bm{p}}_{i} and compute the centroid 𝒑¯c=i=1N¯𝒑¯i\bar{\bm{p}}_{c}=\sum_{i=1}^{\bar{N}}\bar{\bm{p}}_{i} of 𝓟¯\bar{\bm{\mathcal{P}}}. For each 𝒑¯x𝒦i\bar{\bm{p}}_{x}\in\mathcal{K}_{i}, we denote the angle between two vectors 𝒑¯i𝒑¯c\bar{\bm{p}}_{i}-\bar{\bm{p}}_{c} and 𝒑¯x𝒑¯c\bar{\bm{p}}_{x}-\bar{\bm{p}}_{c} is αix\alpha_{ix}. The position encoding 𝒇p¯ipos\bm{f}^{pos}_{\bar{p}_{i}} of 𝒑¯i\bar{\bm{p}}_{i} is defined as follows:

𝒇p¯ipos=φ(𝒑¯i𝒑¯c2)+maxx𝒦i{ϕ(αix)},\bm{f}^{pos}_{\bar{p}_{i}}=\varphi\left(\|\bar{\bm{p}}_{i}-\bar{\bm{p}}_{c}\|_{2}\right)+\max_{x\in\mathcal{K}_{i}}\{\phi\left(\alpha_{ix}\right)\}, (15)

where φ\varphi and ϕ\phi are two MLPs, and each MLP consists of a linear layer and one ReLU nonlinearity function.

Self-Attention and Cross-Attention. Let 𝓕p¯l\bm{\mathcal{F}}^{l}_{\bar{p}} be the intermediate representation for 𝓟¯\bar{\bm{\mathcal{P}}} at layer ll and let 𝓕p¯0={𝒇p¯ipos+𝒇p¯i}i=1N¯\bm{\mathcal{F}}^{0}_{\bar{p}}=\{\bm{f}^{pos}_{\bar{p}_{i}}+\bm{f}_{\bar{p}_{i}}\}_{i=1}^{\bar{N}}. We use a multi-attention layer consisting of four parallel attention head to update the 𝓕p¯l\bm{\mathcal{F}}^{l}_{\bar{p}} via

𝑺p¯\displaystyle\bm{S}_{\bar{p}} =𝑾1l𝓕p¯l+𝒃1l,𝑲x¯=𝑾2l𝓕x¯l+𝒃2l,\displaystyle=\bm{W}^{l}_{1}\bm{\mathcal{F}}^{l}_{\bar{p}}+\bm{b}^{l}_{1},\bm{K}_{\bar{x}}=\bm{W}^{l}_{2}\bm{\mathcal{F}}^{l}_{\bar{x}}+\bm{b}^{l}_{2}, (16)
𝑽x¯\displaystyle\bm{V}_{\bar{x}} =𝑾3l𝓕x¯+𝒃3l,𝑨=softmax(𝑺p¯𝑲x¯/b),\displaystyle=\bm{W}^{l}_{3}\bm{\mathcal{F}}_{\bar{x}}+\bm{b}^{l}_{3},\bm{A}=\mbox{softmax}\left(\bm{S}_{\bar{p}}^{\top}\bm{K}_{\bar{x}}\big{/}\sqrt{b}\right),
𝓕p¯l+1\displaystyle\bm{\mathcal{F}}^{l+1}_{\bar{p}} =𝓕p¯+gl([𝓕p¯l𝑨𝑽x¯]).\displaystyle=\bm{\mathcal{F}}_{\bar{p}}+g^{l}\left(\left[\bm{\mathcal{F}}^{l}_{\bar{p}}\big{\|}\bm{A}\bm{V}_{\bar{x}}\right]\right).

Here, if x¯=p¯\bar{x}=\bar{p} represents self-attention, and if x¯=q¯\bar{x}=\bar{q} indicates cross-attention. [][\cdot\|\cdot] denotes concatenation, and gl()g^{l}\left(\cdot\right) is a three-layer fully connected network consisting of a linear layer, instance normalization, and a LeakyReLU activation. The same attention module is also simultaneously performed for all points in point cloud 𝓠¯\bar{\bm{\mathcal{Q}}}. A fixed number of layers L=2L=2 with different parameters are chained and alternatively aggregate along the self- and cross-attention. As such, start from l=0,x¯=p¯l=0,\bar{x}=\bar{p} if ll is even and x¯=q¯\bar{x}=\bar{q} if ll is odd. The final outputs of attention module are 𝓕¯p¯=𝓕p3\bar{\bm{\mathcal{F}}}_{\bar{p}}=\bm{\mathcal{F}}^{3}_{p} for 𝓟¯\bar{\bm{\mathcal{P}}} and 𝓕¯q¯=𝓕q¯3\bar{\bm{\mathcal{F}}}_{\bar{q}}=\bm{\mathcal{F}}^{3}_{\bar{q}} for 𝓠¯\bar{\bm{\mathcal{Q}}}. By doing this, each point can incorporate non-local information that intuitively strengthens their long-range correlation dependencies. The latent features 𝓕¯p¯\bar{\bm{\mathcal{F}}}_{\bar{p}} have the knowledge of 𝓕¯q¯\bar{\bm{\mathcal{F}}}_{\bar{q}} and vice versa.

Overlap Score Prediction. To deal with those points in non-overlapping regions, we separately predict the overlap scores 𝝁p¯={𝝁p¯1,𝝁p¯2,,𝝁p¯N¯}\bm{\mu}_{\bar{p}}=\{\bm{\mu}_{\bar{p}_{1}},\bm{\mu}_{\bar{p}_{2}},\cdots,\bm{\mu}_{\bar{p}_{\bar{N}}}\} and 𝝁q¯={𝝁q¯1,𝝁q¯2,,𝝁q¯M¯}\bm{\mu}_{\bar{q}}=\{\bm{\mu}_{\bar{q}_{1}},\bm{\mu}_{\bar{q}_{2}},\cdots,\bm{\mu}_{\bar{q}_{\bar{M}}}\} using the conditioned features 𝓕¯p¯\bar{\bm{\mathcal{F}}}_{\bar{p}} and 𝓕¯q¯\bar{\bm{\mathcal{F}}}_{\bar{q}}. Here 𝝁p¯i[0,1]\bm{\mu}_{\bar{p}_{i}}\in[0,1] and 𝝁q¯j[0,1]\bm{\mu}_{\bar{q}_{j}}\in[0,1] can be computed using a single fully layer gβ()g_{\beta}\left(\cdot\right) followed by a sigmoid activation function.

𝝁p¯\displaystyle\bm{\mu}_{\bar{p}} =sigmoid(gβ(𝓕¯p¯)),\displaystyle=\mbox{sigmoid}\left(g_{\beta}\left(\bar{\bm{\mathcal{F}}}_{\bar{p}}\right)\right), (17)
𝝁q¯\displaystyle\bm{\mu}_{\bar{q}} =sigmoid(gβ(𝓕¯q¯)).\displaystyle=\mbox{sigmoid}\left(g_{\beta}\left(\bar{\bm{\mathcal{F}}}_{\bar{q}}\right)\right).

The overlap scores are used to mask out the influence of points outside the overlap region.

4.5.3 Coarse-Level Correspondence Prediction

𝓕¯p¯\bar{\bm{\mathcal{F}}}_{\bar{p}}, 𝓕¯q¯\bar{\bm{\mathcal{F}}}_{\bar{q}}, 𝝁p¯\bm{\mu}_{\bar{p}} and 𝝁q¯\bm{\mu}_{\bar{q}} are used to calculate an assignment matrix 𝚪¯\bar{\bm{\Gamma}}, at superpoint level, by solving a coupled optimal transport problem

min𝚪¯0\displaystyle\min_{\bar{\bm{\Gamma}}\geq 0} ξ1𝑪p¯q¯,𝚪¯+ξ2𝑯(𝑪p¯,𝑪q¯,𝚪¯),𝚪¯\displaystyle\left<\xi_{1}\bm{C}^{\bar{p}\bar{q}},\bar{\bm{\Gamma}}\right>+\left<\xi_{2}\bm{H}\left(\bm{C}^{\bar{p}},\bm{C}^{\bar{q}},\bar{\bm{\Gamma}}\right),\bar{\bm{\Gamma}}\right> (18)
+τ(𝒦(𝚪¯𝟏M¯|𝝁p¯)+𝒦(𝚪¯𝟏N¯|𝝁q¯)),\displaystyle+\tau\left(\mathcal{KL}\left(\bar{\bm{\Gamma}}\bm{1}_{\bar{M}}|\bm{\mu}_{\bar{p}}\right)+\mathcal{KL}\left(\bar{\bm{\Gamma}}^{\top}\bm{1}_{\bar{N}}|\bm{\mu}_{\bar{q}}\right)\right),

where 𝑪p¯q¯,𝑪p¯{\bm{C}}^{\bar{p}\bar{q}},\bm{C}^{\bar{p}}, and 𝑪q¯\bm{C}^{\bar{q}} are the distance matrices with elements satisfying 𝑪ijp¯q¯=𝒟f(𝒇¯p¯i,𝒇¯q¯j),𝑪ijp¯=λDe(p¯i,p¯j)+(1λ)𝒟f(𝒇¯p¯i,𝒇¯p¯j){\bm{C}}^{\bar{p}\bar{q}}_{ij}=\mathcal{D}_{f}\left(\bar{\bm{f}}_{\bar{p}_{i}},\bar{\bm{f}}_{\bar{q}_{j}}\right),\bm{C}^{\bar{p}}_{ij}=\mathcal{\lambda}{D}_{e}\left(\bar{p}_{i},\bar{p}_{j}\right)+(1-\lambda)\mathcal{D}_{f}\left(\bar{\bm{f}}_{\bar{p}_{i}},\bar{\bm{f}}_{\bar{p}_{j}}\right) and 𝑪ijq¯=λDe(q¯i,q¯j)+(1λ)𝒟f(𝒇¯q¯i,𝒇¯q¯j){\bm{C}}^{\bar{q}}_{ij}=\mathcal{\lambda}{D}_{e}\left(\bar{q}_{i},\bar{q}_{j}\right)+(1-\lambda)\mathcal{D}_{f}\left(\bar{\bm{f}}_{\bar{q}_{i}},\bar{\bm{f}}_{\bar{q}_{j}}\right), respectively. λ>0\lambda>0 is a hyperparameter. Eq. (18) is an instance of the optimal transport [48] problem, which can be solved efficiently using Sinkhorn-Knopp algorithm [48]. After reaching 𝚪¯\bar{\bm{\Gamma}}, we select correspondences with maximum confidence score of 𝚪¯\bar{\bm{\Gamma}} in each row and column, and further enforce the mutual nearest neighbor (MNN) criterion, which filters possible outlier coarse matches. The coarse-level correspondences are defined as follows:

¯={(𝒑¯i^,𝒑¯j^)|(i^,j^)MNN(𝚪¯),j^=argmaxk𝚪¯i^k}.\bar{\mathcal{M}}=\{(\bar{\bm{p}}_{\hat{i}},\bar{\bm{p}}_{\hat{j}})\big{|}\forall(\hat{i},\hat{j})\in\mbox{MNN}(\bar{\bm{\Gamma}}),\hat{j}=\arg\max_{k}\bar{\bm{\Gamma}}_{\hat{i}k}\}. (19)

4.5.4 Decoder

Our decoder starts with conditioned features 𝓕p¯\bm{\mathcal{F}}_{\bar{p}}, concatenates them with the overlap score 𝝁p¯\bm{\mu}_{\bar{p}}, and outputs the per-point feature descriptor 𝓕pN×32\bm{\mathcal{F}}_{p}\in\mathbb{R}^{N\times 32} and refined per-point overlap scores 𝝁p\bm{\mu}_{p}. The decoder architecture combines NN-upsampling with linear layers and includes skip connections from the corresponding encoder layers. The same operator is applied to generate 𝓕qM×32\bm{\mathcal{F}}_{q}\in\mathbb{R}^{M\times 32} and 𝝁q\bm{\mu}_{q}.

4.5.5 Fine-Level Prediction

On the finer stage, we refine coarse correspondences to point-level correspondences. Those refined matches are then utilized for point cloud registration.

Point-Level Correspondence Prediction. We first group the points into clusters by assigning points to their nearest superpoints in geometry space. After grouping, points with their associated overlap scores and descriptors form patches, on which we can extract point correspondences. For a superpoint 𝒑¯i𝓟¯\bar{\bm{p}}_{i}\in\bar{\bm{\mathcal{P}}}, its associated point set Gp¯iG_{\bar{p}_{i}}, feature set G𝒇p¯iG_{\bm{f}_{\bar{p}_{i}}}, and the overlap score set G𝝁p¯iG_{\bm{\mu}_{\bar{p}_{i}}} are denoted as

{Gp¯i={𝒑𝓟|𝒑𝒑¯i2𝒑𝒑¯j2,ij},G𝒇p¯i={𝒇𝒙j𝓕p|𝒙jGp¯i},G𝝁p¯i={μ𝒙j𝝁p|𝒙jGp¯i}.\begin{cases}G_{\bar{p}_{i}}=\{\bm{p}\in\bm{\mathcal{P}}\big{|}\|\bm{p}-\bm{\bar{p}}_{i}\|_{2}\leq\|\bm{p}-\bm{\bar{p}}_{j}\|_{2},i\neq j\},\\ G_{\bm{f}_{\bar{p}_{i}}}=\{\bm{f}_{\bm{x}_{j}}\in\bm{\mathcal{F}}_{p}\big{|}\bm{x}_{j}\in G_{\bar{p}_{i}}\},\\ G_{\bm{\mu}_{\bar{p}_{i}}}=\{\mu_{\bm{x}_{j}}\in\bm{\mathcal{\mu}}_{p}\big{|}\bm{x}_{j}\in G_{\bar{p}_{i}}\}.\end{cases} (20)

The coarse-level correspondence set ¯\bar{\mathcal{M}} is expanded to its corresponding patch correspondence sets, both in geometry space C={(Gp¯i,Gq¯j)}\mathcal{M}_{C}=\{(G_{\bar{p}_{i}},G_{\bar{q}_{j}})\}, feature space F={(G𝒇p¯i,G𝒇q¯j)}\mathcal{M}_{F}=\{(G_{\bm{f}_{\bar{p}_{i}}},G_{\bm{f}_{\bar{q}_{j}}})\}, and overlap scores 𝝁={(G𝝁p¯i,G𝝁q¯j)}\mathcal{M}_{\bm{\mu}}=\{(G_{\bm{\mu}_{\bar{p}_{i}}},G_{\bm{\mu}_{\bar{q}_{j}}})\}. For computational efficiency, every patch samples KK number of points based on the overlap scores. Given a pair of overlapped patches (Gp¯i,G𝒇p¯i,G𝝁p¯i)(G_{\bar{p}_{i}},G_{\bm{f}_{\bar{p}_{i}}},G_{\bm{\mu}_{\bar{p}_{i}}}) and (Gq¯j,G𝒇q¯j,G𝝁q¯j)(G_{\bar{q}_{j}},G_{\bm{f}_{\bar{q}_{j}}},G_{\bm{\mu}_{\bar{q}_{j}}}), we first calculate the cross distance matrix 𝑪p¯iq¯j={𝑪klp¯iq¯j}{\bm{C}}^{\bar{p}_{i}\bar{q}_{j}}=\{{\bm{C}}^{\bar{p}_{i}\bar{q}_{j}}_{kl}\}, and structural matrices 𝑪p¯i={𝑪klp¯i}\bm{C}^{\bar{p}_{i}}=\{\bm{C}^{\bar{p}_{i}}_{kl}\} and 𝑪q¯j={𝑪klq¯j}\bm{C}^{\bar{q}_{j}}=\{\bm{C}^{\bar{q}_{j}}_{kl}\} with elements satisfying

𝑪klp¯iq¯j=𝒟f(G𝒇p¯ik,G𝒇q¯jl),\displaystyle{\bm{C}}^{\bar{p}_{i}\bar{q}_{j}}_{kl}=\mathcal{D}_{f}\left(G^{k}_{\bm{f}_{\bar{p}_{i}}},G^{l}_{{\bm{f}_{\bar{q}_{j}}}}\right),
𝑪klp¯i=λ𝒟e(Gp¯ik,Gp¯il)+(1λ)𝒟f(G𝒇p¯ik,G𝒇p¯il),\displaystyle\bm{C}^{\bar{p}_{i}}_{kl}=\lambda\mathcal{D}_{e}\left(G^{k}_{{\bar{p}_{i}}},G^{l}_{{\bar{p}_{i}}}\right)+(1-\lambda)\mathcal{D}_{f}\left(G^{k}_{\bm{f}_{\bar{p}_{i}}},G^{l}_{\bm{f}_{\bar{p}_{i}}}\right),
𝑪klq¯j=λ𝒟e(Gq¯jk,Gq¯jl)+(1λ)𝒟f(G𝒇q¯jk,G𝒇q¯jl),\displaystyle\bm{C}^{\bar{q}_{j}}_{kl}=\lambda\mathcal{D}_{e}\left(G^{k}_{{\bar{q}_{j}}},G^{l}_{{\bar{q}_{j}}}\right)+(1-\lambda)\mathcal{D}_{f}\left(G^{k}_{\bm{f}_{\bar{q}_{j}}},G^{l}_{\bm{f}_{\bar{q}_{j}}}\right),

where λ[0,1]\lambda\in[0,1] is a hyperparameter. Extracting point correspondences is analogous to matching two smaller-scale point clouds by solving a coupled optimal transport problem to calculate a matrix Γp¯i\Gamma_{\bar{p}_{i}} as

minΓp¯i0\displaystyle\min_{\Gamma_{\bar{p}_{i}}\geq 0} ξ1𝑪p¯iq¯j,𝚪p¯i+ξ2𝑯(𝑪p¯i,𝑪q¯j,𝚪p¯i),𝚪p¯i\displaystyle\left<\xi_{1}\bm{C}^{\bar{p}_{i}\bar{q}_{j}},\bm{\Gamma}_{\bar{p}_{i}}\right>+\left<\xi_{2}\bm{H}\left(\bm{C}^{\bar{p}_{i}},\bm{C}^{\bar{q}_{j}},{\bm{\Gamma}}_{\bar{p}_{i}}\right),{\bm{\Gamma}}_{\bar{p}_{i}}\right>
+τ(𝒦(𝚪p¯i𝟏Mq¯j|G𝝁p¯i)+𝒦(𝚪p¯i𝟏Nq¯j|G𝝁q¯j)),\displaystyle+\tau\left(\mathcal{KL}\left(\bm{\Gamma}_{\bar{p}_{i}}\bm{1}_{M_{\bar{q}_{j}}}|G_{\bm{\mu}_{\bar{p}_{i}}}\right)+\mathcal{KL}\left(\bm{\Gamma}_{\bar{p}_{i}}^{\top}\bm{1}_{N_{\bar{q}_{j}}}|G_{\bm{\mu}_{\bar{q}_{j}}}\right)\right),

For correspondences, we choose the maximum confidence score of Γp¯i\Gamma_{\bar{p}_{i}} in every row and column to guarantee a higher precision. The final point correspondence set \mathcal{M} is represented as the union of all the correspondence sets obtained. After obtaining the correspondences \mathcal{M}, following [52, 53], a variant of RANSAC [47] that is specialized to 3D correspondence-based registration [54] is utilized to estimate the transformation.

4.6 Loss Function and Training

Our model is an end-to-end learning framework, using the ground truth correspondences as supervision. The loss function =C+F+CO+FO\mathcal{L}=\mathcal{L}_{C}+\mathcal{L}_{F}+\mathcal{L}_{CO}+\mathcal{L}_{FO} is composed of an coarse-level loss C\mathcal{L}_{C} for superpoint matching, a point matching loss F\mathcal{L}_{F} for point matching, a binary classification loss CO\mathcal{L}_{CO} for coarse-level overlap scores, and a classification loss FO\mathcal{L}_{FO} for fine-level overlap scores.

4.6.1 Coarse-Level Loss

Superpoint Matching Loss. Existing methods [53, 16] usually formulate superpoint matching as a multilabel classification problem and adopt a cross-entropy loss with optimal transport. Doing this requires unfolding the Sinkhorn layer to compute gradients in the training stage. To address this issue, we adopt a circle loss [55] to optimize the superpoint-wise feature descriptors. As there is not direct supervision for superpoint matching, we leverage the overlap ratio rijr^{j}_{i} of points in Gp¯iG_{\bar{p}_{i}} that have correspondences in Gq¯jG_{\bar{q}_{j}} to depict the matching probability between superpoints p¯i\bar{p}_{i} and q¯j\bar{q}_{j}. rijr^{j}_{i} is defined as:

rij=1|Gp¯i||{𝒑Gp¯i|min𝒒Gq¯j𝑻^(𝒑)𝒒2<rp}|.r^{j}_{i}=\frac{1}{{|G_{\bar{p}_{i}}|}}{|\{\bm{p}\in G_{\bar{p}_{i}}\big{|}\min_{\bm{q}\in G_{\bar{q}_{j}}}\|\hat{\bm{T}}\left(\bm{p}\right)-\bm{q}\|_{2}<r_{p}\}|}.

where 𝑻^\hat{\bm{T}} is the ground-truth transformation and rpr_{p} is a set threshold. For circle loss, a pair of superpoints are positive if their corresponded patches share at least 10%10\% overlap, and negative if they do not overlap. All other pairs are omitted. We select the superpoints in 𝓟¯\bm{\bar{\mathcal{P}}} which have at least one positive superpoint in 𝓠¯\bm{\bar{\mathcal{Q}}} to form a set of anchor superpoints, 𝓟~\bm{\tilde{\mathcal{P}}}. For each anchor 𝒑~i𝓟~\tilde{\bm{p}}_{i}\in\bm{\tilde{\mathcal{P}}}, we denote the set of its positive superpoints in 𝓠¯\bm{\bar{\mathcal{Q}}} as 𝒩p𝒑~i\mathcal{N}_{p}^{\tilde{\bm{p}}_{i}}, and the set of its negative patches as 𝒩n𝒑~i\mathcal{N}_{n}^{\tilde{\bm{p}}_{i}}. The superpoint matching loss (circle loss) C𝓟¯\mathcal{L}_{C}^{\bm{\bar{\mathcal{P}}}} on 𝓟¯\bm{\bar{\mathcal{P}}} is then defined as:

C𝓟¯\displaystyle\mathcal{L}_{C}^{\bm{\bar{\mathcal{P}}}} =1|𝓟~|𝒑~i𝓟¯log[1+ζi],\displaystyle=\frac{1}{|\bm{\tilde{\mathcal{P}}}|}\sum_{\tilde{\bm{p}}_{i}\in\bm{\bar{\mathcal{P}}}}\log\left[1+\zeta_{i}\right], (21)
ζi\displaystyle\zeta_{i} =𝒒~k𝒩p𝒑~ierikβpik(dikΔp)𝒒~l𝒩n𝒑~ieβnil(Δndil),\displaystyle=\sum_{\bm{\tilde{q}}_{k}\in\mathcal{N}_{p}^{\tilde{\bm{p}}_{i}}}e^{r^{k}_{i}\beta_{p}^{ik}(d_{i}^{k}-\Delta p)}\cdot\sum_{\bm{\tilde{q}}_{l}\in\mathcal{N}_{n}^{\tilde{\bm{p}}_{i}}}e^{\beta_{n}^{il}(\Delta n-d_{i}^{l})},

where dik=𝒟f(𝒇p~i,𝒇q~k)d_{i}^{k}=\mathcal{D}_{f}(\bm{f}_{\tilde{p}_{i}},\bm{f}_{\tilde{q}_{k}}) is the distance in the feature space. The weights βpik=γdik\beta_{p}^{ik}=\gamma d_{i}^{k} and βnil=γ(2.0dil)\beta_{n}^{il}=\gamma(2.0-d_{i}^{l}) are determined individually for each positive and negative example, using the empirical margins Δp=0.1\Delta p=0.1 and Δn=1.4\Delta n=1.4 with a learned scale factor γ1\gamma\geq 1. The circle loss reweights the loss values on 𝒩pi\mathcal{N}_{p^{i}} based on the overlap ratio so that the patch pairs with higher overlap are given more importance. The same goes for the loss C𝓠¯\mathcal{L}_{C}^{\bm{\bar{\mathcal{Q}}}} on 𝓠¯\bm{\bar{\mathcal{Q}}}. The overall superpoint matching loss is

C=12(C𝓟¯+C𝓠¯).\mathcal{L}_{C}=\frac{1}{2}(\mathcal{L}_{C}^{\bm{\bar{\mathcal{P}}}}+\mathcal{L}_{C}^{\bm{\bar{\mathcal{Q}}}}). (22)

Coarse-Level Overlap Loss. We use the ratio of points in Gp¯iG_{\bar{p}_{i}} that are visible in 𝓠\mathcal{\bm{Q}} to depict the ground-truth overlap scores 𝝁¯p¯i\bar{\bm{\mu}}_{\bar{p}_{i}} of the superpoint p¯i\bar{p}_{i}. It is calculated by

𝝁¯p¯i=1|Gp¯i||{𝒑Gp¯i|min𝒒𝒬𝑻^(𝒑)𝒒2<ro}|,\bar{\bm{\mu}}_{\bar{p}_{i}}=\frac{1}{{|G_{\bar{p}_{i}}|}}{|\{\bm{p}\in G_{\bar{p}_{i}}\big{|}\min_{\bm{q}\in\mathcal{Q}}\|\hat{\bm{T}}\left(\bm{p}\right)-\bm{q}\|_{2}<r_{o}\}|}, (23)

with overlap threshold. If 𝝁¯p¯i\bar{\bm{\mu}}_{\bar{p}_{i}} is close to 1, p¯i\bar{p}_{i} tends to locate in the overlap regions. 𝝁¯q¯j\bar{\bm{\mu}}_{\bar{q}_{j}} is calculated in the same way. The predicted overlap scores for 𝓟¯\bar{\bm{\mathcal{P}}} are thus supervised using the binary cross entropy loss, i.e.,

𝓟¯=1N¯i𝝁¯p¯ilog𝝁p¯i+(1𝝁¯p¯i)log(1𝝁p¯i).\mathcal{L}_{\bm{\mathcal{\bar{P}}}}=-\frac{1}{\bar{N}}\sum_{i}\bar{\bm{\mu}}_{\bar{p}_{i}}\log\bm{\mu}_{\bar{p}_{i}}+\left(1-\bar{\bm{\mu}}_{\bar{p}_{i}}\right)\log\left(1-\bm{\mu}_{\bar{p}_{i}}\right). (24)

The loss 𝓠¯\mathcal{L}_{\bm{\bar{\mathcal{Q}}}} for 𝓠¯\bm{\bar{\mathcal{Q}}} is calculated in the same way. The loss for coarse-level overlap scores is

CO=12(𝓟¯+𝓠¯).\mathcal{L}_{CO}=\frac{1}{2}\left(\mathcal{L}_{\bar{\bm{\mathcal{P}}}}+\mathcal{L}_{\bar{\bm{\mathcal{Q}}}}\right).

4.6.2 Fine-Level Loss

Point Matching Loss. We apply circle loss again to supervise the point matching. Consider a pair of matched superpoints 𝒑¯i\bar{\bm{p}}_{i} and 𝒒¯j\bar{\bm{q}}_{j} with associated patches Gp¯iG_{\bar{p}_{i}} and Gq¯jG_{\bar{q}_{j}}, we first extract a set of anchor points G~p¯iGp¯i\tilde{G}_{\bar{p}_{i}}\subseteq G_{\bar{p}_{i}} satisfying that each 𝒈p¯ikG~p¯i\bm{g}^{k}_{\bar{p}_{i}}\in\tilde{G}_{\bar{p}_{i}} has at least one (possibly multiple) correspondence in Gq¯jG_{\bar{q}_{j}}, i.e.,

G~p¯i={𝒈p¯ikG~p¯i|min𝒈q¯jlGq¯j𝑻^(𝒈p¯ik)𝒈q¯jl2<rp}.\tilde{G}_{\bar{p}_{i}}=\{\bm{g}^{k}_{\bar{p}_{i}}\in\tilde{G}_{\bar{p}_{i}}|\min_{\bm{g}^{l}_{\bar{q}_{j}}\in G_{\bar{q}_{j}}}\|\hat{\bm{T}}\left(\bm{g}^{k}_{\bar{p}_{i}}\right)-\bm{g}^{l}_{\bar{q}_{j}}\|_{2}<r_{p}\}.

For each anchor 𝒈p¯ikG~p¯i\bm{g}^{k}_{\bar{p}_{i}}\in\tilde{G}_{\bar{p}_{i}}, we denote the set of its positive points in Gq¯jG_{\bar{q}_{j}} as 𝒩p𝒈p¯ik\mathcal{N}_{p}^{\bm{g}^{k}_{\bar{p}_{i}}}. All points of 𝓠\mathcal{\bm{Q}} outside a (larger) radios rnr_{n} form the set of its negative patches as 𝒩n𝒈p¯ik\mathcal{N}_{n}^{\bm{g}^{k}_{\bar{p}_{i}}}. The fine-level matching loss F𝓟\mathcal{L}_{F}^{\mathcal{\bm{P}}} on 𝓟\mathcal{\bm{P}} is calculated as:

F𝓟\displaystyle\mathcal{L}_{F}^{\mathcal{\bm{P}}} =1|𝓟~|𝒑~i𝓟¯1|G~p¯i|𝒈p¯isG~p¯ilog[1+ξs],\displaystyle=\frac{1}{|\bm{\tilde{\mathcal{P}}}|}\sum_{\tilde{\bm{p}}_{i}\in\bm{\bar{\mathcal{P}}}}\frac{1}{|\tilde{G}_{\bar{p}_{i}}|}\sum_{\bm{g}^{s}_{\bar{p}_{i}}\in\tilde{G}_{\bar{p}_{i}}}\log\left[1+\xi_{s}\right], (25)
ξs\displaystyle\xi_{s} =𝒈q¯jk𝒩p𝒈p¯iserskβpsk(dskΔp)𝒈q¯jl𝒩n𝒈p¯iseβnsl(Δndsl),\displaystyle=\sum_{\bm{g}^{k}_{\bar{q}_{j}}\in\mathcal{N}_{p}^{\bm{g}^{s}_{\bar{p}_{i}}}}e^{r^{k}_{s}\beta_{p}^{sk}(d_{s}^{k}-\Delta p)}\cdot\sum_{\bm{g}^{l}_{\bar{q}_{j}}\in\mathcal{N}_{n}^{\bm{g}^{s}_{\bar{p}_{i}}}}e^{\beta_{n}^{sl}(\Delta n-d_{s}^{l})},

where dsk=𝒟f(𝒇𝒈p¯is,𝒇𝒈q¯js)d_{s}^{k}=\mathcal{D}_{f}(\bm{f}_{\bm{g}^{s}_{\bar{p}_{i}}},\bm{f}_{\bm{g}^{s}_{\bar{q}_{j}}}) is the distance in the feature space. The weights βpsk=ωdsk\beta_{p}^{sk}=\omega d_{s}^{k} and βnsl=ω(2.0dsl)\beta_{n}^{sl}=\omega(2.0-d_{s}^{l}) are determined individually for each positive and negative example with a learned scale factor ω1\omega\geq 1. Δp=0.1\Delta p=0.1 and Δn=1.4\Delta n=1.4. The same goes for the loss F𝓠\mathcal{L}_{F}^{\bm{\mathcal{Q}}} on 𝓠\bm{\mathcal{Q}}. The overall superpoint matching loss writes as

F=12(F𝓟+F𝓠).\mathcal{L}_{F}=\frac{1}{2}(\mathcal{L}_{F}^{\mathcal{\bm{P}}}+\mathcal{L}_{F}^{\mathcal{\bm{Q}}}). (26)

Fine-Level Overlap Loss. The overlap score loss is FO=12(1|𝓟¯|p¯ip¯i+1|𝓠¯|q¯jq¯j)\mathcal{L}_{FO}=-\frac{1}{2}\left(\frac{1}{|\bm{\mathcal{\bar{P}}}|}\sum_{\bar{p}_{i}}\mathcal{L}_{\bar{p}_{i}}+\frac{1}{|\bm{\mathcal{\bar{Q}}}|}\sum_{\bar{q}_{j}}\mathcal{L}_{\bar{q}_{j}}\right) with

p¯i=1|G~p¯i|𝒈p¯ik(𝝁¯𝒈p¯iklog𝝁𝒈p¯ik+(1𝝁¯𝒈p¯ik)log(1𝝁𝒈p¯ik)).\mathcal{L}_{\bar{p}_{i}}=\frac{1}{|\tilde{G}_{\bar{p}_{i}}|}\sum_{\bm{g}^{k}_{\bar{p}_{i}}}\left(\bar{\bm{\mu}}_{\bm{g}^{k}_{\bar{p}_{i}}}\log\bm{\mu}_{\bm{g}^{k}_{\bar{p}_{i}}}+\left(1-\bar{\bm{\mu}}_{\bm{g}^{k}_{\bar{p}_{i}}}\right)\log\left(1-\bm{\mu}_{\bm{g}^{k}_{\bar{p}_{i}}}\right)\right).

The ground-truth label 𝝁¯𝒈p¯ik\bar{\bm{\mu}}_{\bm{g}^{k}_{\bar{p}_{i}}} of the point 𝒈p¯ikG~p¯i\bm{g}^{k}_{\bar{p}_{i}}\in\tilde{G}_{\bar{p}_{i}} is defined as

𝝁¯𝒈p¯ik={1,(minqj𝓠𝑻^(𝒈p¯ik)𝒒j)<ro0,otherwise,\bar{\bm{\mu}}_{\bm{g}^{k}_{\bar{p}_{i}}}=\begin{cases}1,&\left(\min_{q_{j}\in\bm{\mathcal{Q}}}\|\hat{\bm{T}}(\bm{g}^{k}_{\bar{p}_{i}})-\bm{q}_{j}\|\right)<r_{o}\\ 0,&\text{otherwise}\end{cases}, (27)

where q¯j\mathcal{L}_{\bar{q}_{j}} is calculated in the same way.

4.6.3 Implementation Details

The proposed method is implemented in PyTorch and can be trained on a single Quadro GV100 GPU (32G) and two Intel(R) Xeon(R) Gold 6226 CPUs. The hyperparameters are set as follows: ξ1=1.0\xi_{1}=1.0, τ=5.0,λ=0.1\tau=5.0,\lambda=0.1, ϵ=0.001\epsilon=0.001, NI=100N_{I}=100, and NO=20N_{O}=20. We train our model 120 epochs with a batch size of 1 in all experiments using the ADAM optimizer with an initial learning rate of 5e45e-4 and an exponentially decaying factor of 0.99. We adopt the same encoder and decoder architectures used in [53]. For training the network, we sample 128 coarse correspondences, with truncated patch size K=64K=64 on 3DMatch (3DLoMatch). On KITTI, the numbers are 128 and 32, respectively.

5 Experiments

We conduct extensive experiments to evaluate the performance of our method on indoor 3DMatch [18] and 3DLoMatch [15] benchmarks, outdoor KITTI [56] benchmark, and cross-source 3DCSR [57] benchmark.

Baselines. COTReg was compared with the recent state-of-the-art methods: FCGF [20], D3Feat [58], SpinNet [33], Predator [15], YOHO [59], CoFiNet [53], and GeoTransformer[52].

5.1 Evaluation on 3DMatch and 3DLoMatch.

Datasets. 3DMatch [18] and 3DLoMatch [15] are two widely used indoor datasets that contain more than 30%30\% and 10%30%10\%\sim 30\% partial overlapping scene pairs, respectively. 3DMatch contains 62 scenes, from which we use 46 scenes for training, 8 scenes for validation, and 8 scenes for testing, respectively. The test set contains 1,623 partially overlapped point cloud fragments and their corresponding transformation matrices. We use training data preprocessed by [15] and evaluate on both the 3DMatch and 3DLoMatch [15] protocols. We first voxel downsample the point clouds with a 2.5cm2.5cm voxel size, then extract different feature descriptors. Following [15], we set ro=3.75cmr_{o}=3.75cm, rp=3.75cmr_{p}=3.75cm, and rn=10.0cmr_{n}=10.0cm, respectively.

Metrics. Following Predator[15] and CoFiNet [53], we evaluate performance with three metrics: (1) Inlier Ratio (IR), the fraction of putative correspondences whose residuals are below a certain threshold (i.e., 0.1m) under the ground-truth transformation, (2) Feature Matching Recall (FMR), the fraction of point cloud pairs whose inlier ratio is above a certain threshold (i.e., 5%), and (3) Registration Recall (RR), the fraction of point cloud pairs whose transformation error is smaller than a certain threshold (i.e., RMSE<0.2mRMSE<0.2m).

Table I: Results on both 3DMatch and 3DLoMatch datasets under different numbers of samples.
3DMatch 3DLoMatch
# Samples 5000 2500 1000 500 250 5000 2500 1000 500 250
Method Inlier Ratio (%)(\%)\uparrow
FCGF[20] 56.8 54.1 48.7 42.5 34.1 21.4 20.0 17.2 14.8 11.6
D3Feat[58] 39.0 38.8 40.4 41.5 41.8 13.2 13.1 14.0 14.6 15.0
SpinNet [33] 47.5 44.7 39.4 33.9 27.6 20.5 19.0 16.3 13.8 11.1
Predator [15] 58.0 58.4 57.1 54.1 49.3 26.7 28.1 28.3 27.5 25.8
CoFiNet[53] 49.8 51.2 51.9 52.2 52.2 24.4 25.9 26.7 26.8 26.9
YOHO [59] 64.4 60.7 55.7 46.4 41.2 25.9 23.3 22.6 18.2 15.0
GeoTransformer[52] 71.9 75.2 76.0 82.2 85.1 43.5 45.3 46.2 52.9 57.7
COTReg (Ours) 85.4 85.7 86.1 86.4 86.9 54.2 55.1 56.3 57.7 59.3
Feature Matching Recall (%)(\%)\uparrow
FCGF[20] 97.4 97.3 97.0 96.7 96.6 76.6 75.4 74.2 71.7 67.3
D3Feat [58] 95.6 95.4 94.5 94.1 93.1 67.3 66.7 67.0 66.7 66.5
SpinNet [33] 97.6 97.2 96.8 95.5 94.3 75.3 74.9 72.5 70.0 63.6
Predator[15] 96.6 96.6 96.5 96.3 96.5 78.6 77.4 76.3 75.7 75.3
CoFiNet[53] 98.1 98.3 98.1 98.2 98.3 83.1 83.5 83.3 83.1 82.6
YOHO[59] 98.2 97.6 97.5 97.7 96.0 79.4 78.1 76.3 73.8 69.1
GeoTransformer[52] 97.9 97.9 97.9 97.9 97.6 88.3 88.6 88.8 88.6 88.3
COTReg (Ours) 98.5 98.6 98.5 98.6 98.6 89.5 89.7 89.7 89.6 89.4
Registration Recall (%)(\%)\uparrow
FCGF[20] 85.1 84.7 83.3 81.6 71.4 40.1 41.7 38.2 35.4 26.8
D3Feat[58] 81.6 84.5 83.4 82.4 77.9 37.2 42.7 46.9 43.8 39.1
SpinNet[33] 88.6 86.6 85.5 83.5 70.2 59.8 54.9 48.3 39.8 26.8
Predator[15] 89.0 89.9 90.6 88.5 86.6 59.8 61.2 62.4 60.8 58.1
CoFiNet[53] 89.3 88.9 88.4 87.4 87.0 67.5 66.2 64.2 63.1 61.0
YOHO [59] 90.8 90.3 89.1 88.6 84.5 65.2 65.5 63.2 56.5 48.0
GeoTransformer[52] 92.0 91.8 91.8 91.4 91.2 75.0 74.8 74.2 74.1 73.5
COTReg (Ours) 93.1 92.8 92.9 92.5 92.4 80.9 80.4 79.7 76.0 74.6

5.1.1 Inlier Ratio and Feature Matching Recall

As the main contribution of COTReg is that we jointly adopt the pointwise and structural matchings to estimate the more correct correspondences, we first check the Inlier Ratio of COTReg, which is directly related to the quality of extracted correspondences. Following [15, 53, 52], we report the results with different numbers of correspondences. As shown in Table I (Top), on Inlier Ratio, COTReg outperforms all the previous methods on both benchmarks, showing an outstanding accuracy improvement. In particular, our COTReg surpasses GeoTransformer, the second best baseline, consistently by 1.8%13.5%1.8\%\sim 13.5\% on 3DMatch and 1.6%10.7%1.6\%\sim 10.7\% 3DLoMatch when the sample number ranges from 250 to 5000, respectively. Furthermore, the fact that sampling fewer correspondences leads to a higher Inlier Ratio indicates that our learned scores are well-calibrated, i.e., higher confidence scores indicate more reliable correspondences. For Feature Matching Recall, in Table I (Middle), our method obtains the best results. Especially on 3DLoMatch, which is more challenging due to low overlap scenarios, our proposed method achieves improvements of at least 0.9%, demonstrating its effectiveness in low overlap cases. The main difference between our method and other baselines is the correspondence prediction strategy, which jointly considers point-wise and structural matchings equipped with overlap scores. On the contrary, these baseline methods only consider pointwise matching based on feature similarity.

5.1.2 Registration Recall

Registration Recall reflects the final performance on point cloud registration. To evaluate the registration performance, we compare the RR obtained by RANSAC in Table I (bottom), and our method outperforms all the other models with various number of sampling points on both two datasets. Specifically, COTReg achieved 93.1%93.1\% and 80.9%80.9\% Registration Recall, exceeding the previous best, GeoTransformer,(92.0%92.0\% RR on 3DMatch) by 1.1%1.1\% and (80.9%80.9\% RR on 3DLoMatch) by 5.9%5.9\%, showing its efficacy in both high- and low-overlap scenarios. It demonstrates that incorporating both pointwise and structural matchings with overlap scores into the correspondence prediction process can alleviate the ambiguity issue and thus obtains better performance than the counterparts that only consider pointwise matching. Figures 4 and 5 show visual comparison examples on 3DMatch and 3DLoMatch, respectively. We can easily see that our method can achieve better results in challenging indoor scenes with a low overlap ratio.

We also compare the registration results weighted SVD over correspondences in Table II. Some baselines either fail to achieve reasonable results or suffer from severe performance degradation. In contrast, COTReg with weighted SVD achieves the registration recall of 87.2% on 3DMatch and 60.7% on 3DLoMatch. Without outlier filtering by RANSAC, high inlier ratio is necessary for successful registration. However, a high inlier ratio does not necessarily lead to a high registration recall, since the correspondences could cluster together as noted in [15].

We further count the average inference time of COTReg and compare it with that of the baselines. Notably, all methods consist of two stages that first extract dense features or the correspondences and then recovers the transformation using RANSAC or SVD. We report inference times of the two stages, respectively. Although in the correspondence prediction stage, our method is slightly slower than some baselines, it performs well by extracting reliable correspondences.

Refer to caption
Figure 4: Example qualitative registration results for 3DMatch.
Refer to caption
Figure 5: Example qualitative registration results for 3DLoMatch.
Table II: Results on both 3DMatch and 3DLoMatch datasets under different numbers of samples.
RR Time (s)
Method Estimator Samples 3DM 3DLM Model Pose Total
FCGF[20] RANSAC-50k 5000 85.1 40.1 0.052 3.326 3.378
D3Feat[58] RANSAC-50k 5000 81.6 37.2 0.024 3.088 3.112
SpinNet [33] RANSAC-50k 5000 88.6 59.8 60.248 0.388 60.636
Predator [15] RANSAC-50k 5000 89.0 59.8 0.032 5.120 5.152
CoFiNet[53] RANSAC-50k 5000 89.3 67.5 0.115 1.807 1.922
GeoTrans[52] RANSAC-50k 5000 92.0 75.0 0.075 1.558 1.633
COTReg (Ours) RANSAC-50k 5000 93.1 80.9 0.652 1.463 2.115
FCGF[20] weighted SVD 250 42.1 3.9 0.052 0.008 0.056
D3Feat[58] weighted SVD 250 37.4 2.8 0.024 0.008 0.032
SpinNet [33] weighted SVD 250 34.0 2.5 60.248 0.006 60.254
Predator [15] weighted SVD 250 50.0 6.4 0.032 0.009 0.041
CoFiNet [53] weighted SVD 250 64.6 21.6 0.115 0.003 0.118
GeoTrans[52] weighted SVD 250 86.5 59.9 0.075 0.003 0.078
COTReg (Ours) weighted SVD 250 87.2 60.7 0.652 0.002 0.654

5.2 Evaluation on KITTI

5.2.1 Datasets

KITTI contains 11 sequences of LiDAR scanned outdoor driving scenarios. For fair comparisons, we follow the same data splitting as [20, 11] and use sequences 0-5 for training, 6-7 for validation, and 8-10 for testing, respectively. Following [11], we further refine the ground truth poses provided using ICP and only use point cloud pairs that are at most 10m10m away from each other for evaluation. Following [15], we use a 30cm30cm voxel size for downsampling point clouds and set thresholds ror_{o} to 45cm45cm, rpr_{p} to 21cm21cm, and rnr_{n} to 75cm75cm, respectively.

5.2.2 Metrics

Following Predator [15] and CoFiNet [53], we use three metrics, Registration Recall (RRRR), Rotation Error (RRERRE), and Translation Error (RTERTE), to evaluate the performance of the proposed registration algorithm. RRRR is the percentage of successful alignment whose rotation error and translation error are below set thresholds (i.e., RRE <5<5^{\circ} and RTE <2m<2m). RRE and RTE are defined as RE=arccosTr(𝑹𝑹)12,TE=𝒕𝒕2RE=\arccos\frac{\textbf{Tr}\left(\bm{R}^{\top}\bm{R}^{\star}\right)-1}{2},TE=\|\bm{t}-\bm{t}^{\star}\|_{2}, respectively. 𝑹\bm{R}^{\star} and 𝒕\bm{t}^{\star} denote the ground-truth rotation matrix and the translation vector, respectively.

5.2.3 Registration results

We compare to the state-of-the-art RANSAC-based methods: FCGF [20], D3Feat [58], SpinNet [33], Predator [15], CoFiNet [53], and GeoTransformer [52]. As shown in Table III, our model still achieves the best performance in terms of registration recall as well as the lowest average RTERTE and RRERRE. The results verify the effectiveness of considering both pointwise and structural matchings to produce correspondences.

Table III: Results on KITTI dataset. Best performance is highlighted in bold.
Method Estimator RTE (cm) \downarrow RRE ()(^{\circ})\downarrow RR(%) \uparrow
FCGF [20] RANSAC 9.5 0.30 96.6
D3Feat [58] RANSAC 7.2 0.30 99.8
SpinNet [33] RANSAC 9.9 0.47 99.1
Predator [15] RANSAC 6.8 0.27 99.8
CoFiNet [53] RANSAC 8.5 0.41 99.8
GeoTrans [52] RANSAC 7.4 0.27 99.8
COTReg (Ours) RANSAC 4.9 0.22 99.8

5.3 Generalization on Cross-source Dataset

The generalization ability of learning-based registration algorithms is highly required when the point cloud is acquired from different sensors. To validate the generalizability of our model, we experiment on our own Cross Source Dataset (3DCSR) [57]. 3DCSR is a challenging dataset for registration due to a mixture of noise, outliers, density difference, partial overlap, and scale variation.

5.3.1 3DCSR

This dataset contains two folders: Kinect Lidar and Kinect SFM. Kinect lidar contains 19 scenes from both the Kinect and Lidar sensors, where each scene is cropped into different parts. Kinect SFM consists of 2 scenes from both Kinect and RGB sensors. The RGB images have already been constructed into a point cloud by using the software VSFM. We use the model trained on 3DMatch since the cross-source dataset is captured in an indoor environment. RRRR is the percentage of successful alignment whose rotation error and translation error are below set thresholds (i.e., RRE <15<15^{\circ} and RTE <6m<6m).

Table IV: Registration results on Cross Source Datasets. Best performance is highlighted in bold.
Method Estimator RRE ()(^{\circ})\downarrow RTE (cm) \downarrow RR(%) \uparrow
FCGF [20] RANSAC 7.47 0.21 49.6
D3Feat [58] RANSAC 6.41 0.26 52.0
SpinNet [33] RANSAC 6.56 0.24 53.5
Predator [15] RANSAC 6.26 0.27 54.6
CoFiNet [53] RANSAC 5.76 0.26 57.3
GeoTrans [52] RANSAC 5.60 0.24 60.2
COTReg (Ours) RANSAC 5.49 0.21 63.4

5.3.2 Registration Results

We use FCGF [20], D3Feat [58], SpinNet [33], Predator [15], CoFiNet [53], and GeoTransformer [52], as the baselines. Table IV shows that our method obtains the highest accuracies in generalizing the registration ability to real-world cross-source dataset. Specifically, it outperforms the second-best, GeoTransformer, by more than 3.2% in terms of registration recall (63.4% vs 60.2%). However, the recall is not high enough, showing that registration challenges on 3DCSR remain.

Refer to caption
Figure 6: Qualitative registration results on cross source dataset.

5.4 Ablation Study

To fully understand COTReg, we conduct an ablation study on 3DMatch and 3DLoMatch to investigate the contribution of each part. First, we replace the overlap scores with a uniform distribution, i.e., treating the points in overlap and non-overlap regions equally, to evaluate the effectiveness of overlap scores. As shown in Table V, on 3DMatch, the learned overlap scores improve the performance by nearly 2.0% (92.9% vs. 90.9%) RR, 0.7% (98.5% vs. 97.8%) FMR, and 7.8% (86.1% vs. 68.3%) IR, respectively. Structure matching can boost RR by 1.1% (92.9% vs. 91.8%), FMR by 0.5% (98.5% vs. 98.0%) and IR by 10.2% (86.1% vs. 75.9%), respectively. It also indicates that COTReg benefits from the overlap scores and structure matching. Table V also shows that the positional encoding can improve the performance in terms of RR, FMR and IR. On 3DLoMatch, the same results can be concluded.

Table V: Ablation study of individual modules, tested with #Samples=1000. PM and SM indicate point matching and structure matching, respectively. OS indicates point endow with overlap scores. PE represents positional embedding.
3DMatch 3DLoMatch
PE OS PM SM RR FMR IR RR FMR IR
\checkmark \checkmark \checkmark \checkmark 92.9 98.5 86.1 79.7 89.7 55.1
\checkmark \checkmark \checkmark 91.8 98.0 75.9 74.6 88.9 46.4
\checkmark \checkmark \checkmark 90.9 97.8 68.3 67.2 85.6 35.4
\checkmark \checkmark 90.2 97.6 63.4 66.1 84.7 33.5
\checkmark \checkmark 89.6 97.6 62.9 65.0 84.3 32.1
\checkmark 88.9 97.5 59.8 64.8 84.0 30.8

6 Conclusion

We propose a method to improve the accuracy of putative correspondences of point cloud registration. Specifically, we design a coupled optimal transport-based approach to generate correspondences by jointly considering overlap scores, pointwise and structural matchings. Comprehensive experiments show that our approach achieves a new state-of-the-art. We believe our techniques have the potential in the applications that require more accurate of 3D point cloud registration.

References

  • [1] Y. Wang and J. M. Solomon, “Prnet: Self-supervised learning for partial-to-partial registration,” in NeurIPS, 2019.
  • [2] C. Wang and X. Guo, “Feature-based rgb-d camera pose optimization for real-time 3d reconstruction,” Computational Visual Media, vol. 3, no. 2, pp. 95–106, 2017.
  • [3] D. Borrmann, A. Nuechter, and T. Wiemann, “Large-scale 3d point cloud processing for mixed and augmented reality,” in ISMAR-Adjunct, 2018.
  • [4] S. Chen, B. Liu, C. Feng, C. Vallespi-Gonzalez, and C. Wellington, “3d point cloud processing and learning for autonomous driving: Impacting map creation, localization, and perception,” IEEE Signal Processing Magazine, vol. 38, no. 1, pp. 68–86, 2020.
  • [5] B. Nagy and C. Benedek, “Real-time point cloud alignment for vehicle localization in a high resolution 3d map,” in ECCV Workshops, 2018.
  • [6] R. Wang, Y. Xu, M. A. Sotelo, Y. Ma, T. Sarkodie-Gyan, Z. Li, and W. Li, “A robust registration method for autonomous driving pose estimation in urban dynamic environment using lidar,” Electronics, vol. 8, no. 1, p. 43, 2019.
  • [7] Z. Ma et al., “Point reg net: Invariant features for point cloud registration using in image-guided radiation therapy,” Journal of Computer and Communications, vol. 6, no. 11, p. 116, 2018.
  • [8] W. Li, Z. Jiang, K. Chu, J. Jin, Y. Ge, and J. Cai, “A noninvasive method to reduce radiotherapy positioning error caused by respiration for patients with abdominal or pelvic cancers,” Technology in Cancer Research & Treatment, vol. 18, 2019.
  • [9] R. P. Saputra and P. Kormushev, “Casualty detection from 3d point cloud data for autonomous ground mobile rescue robots,” in SSRR.   IEEE, 2018, pp. 1–7.
  • [10] F. Pomerleau, F. Colas, and R. Siegwart, “A review of point cloud registration algorithms for mobile robotics,” 2015.
  • [11] C. Choy, W. Dong, and V. Koltun, “Deep global registration,” in CVPR, 2020, pp. 2514–2523.
  • [12] X. Bai, Z. Luo, L. Zhou, H. Chen, L. Li, Z. Hu, H. Fu, and C.-L. Tai, “Pointdsc: Robust point cloud registration using deep spatial consistency,” in CVPR, 2021, pp. 15 859–15 869.
  • [13] Z. J. Yew and G. H. Lee, “3dfeat-net: Weakly supervised local 3d features for point cloud registration,” in ECCV, 2018, pp. 607–623.
  • [14] Z. Gojcic, C. Zhou, J. D. Wegner, and A. Wieser, “The perfect match: 3d point cloud matching with smoothed densities,” in CVPR, 2019, pp. 5545–5554.
  • [15] S. Huang, Z. Gojcic, M. Usvyatsov, A. Wieser, and K. Schindler, “Predator: Registration of 3d point clouds with low overlap,” in CVPR, 2021, pp. 4267–4276.
  • [16] K. Fu, S. Liu, X. Luo, and M. Wang, “Robust point cloud registration framework based on deep graph matching,” in CVPR, 2021, pp. 8893–8902.
  • [17] G. D. Pais, S. Ramalingam, V. M. Govindu, J. C. Nascimento, R. Chellappa, and P. Miraldo, “3dregnet: A deep neural network for 3d point registration,” in CVPR, 2020, pp. 7193–7203.
  • [18] A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, and T. Funkhouser, “3dmatch: Learning local geometric descriptors from rgb-d reconstructions,” in CVPR, 2017, pp. 1802–1811.
  • [19] J. Li, P. Zhao, Q. Hu, and M. Ai, “Robust point cloud registration based on topological and cauchy weighted lq-norm,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 160, pp. 244–259, 2020.
  • [20] C. Choy, J. Park, and V. Koltun, “Fully convolutional geometric features,” in ICCV, 2019, pp. 8958–8966.
  • [21] G. Peyré, M. Cuturi et al., “Computational optimal transport: With applications to data science,” Foundations and Trends® in Machine Learning, vol. 11, no. 5-6, pp. 355–607, 2019.
  • [22] M. Fey, J. E. Lenssen, C. Morris, J. Masci, and N. M. Kriege, “Deep graph matching consensus,” in ICLR, 2020.
  • [23] A. Zanfir and C. Sminchisescu, “Deep learning of graph matching,” in CVPR, 2018, pp. 2684–2693.
  • [24] P. J. Besl and N. D. McKay, “Method for registration of 3-d shapes,” in Sensor fusion IV: control paradigms and data structures, vol. 1611.   International Society for Optics and Photonics, 1992, pp. 586–606.
  • [25] Y. Wang and M. Solomon, J, “Deep closest point: Learning representations for point cloud registration,” in ICCV, 2019, pp. 3523–3532.
  • [26] X. Huang, J. Zhang, Q. Wu, L. Fan, and C. Yuan, “A coarse-to-fine algorithm for matching and registration in 3d cross-source point clouds,” TCSVT, vol. 28, no. 10, pp. 2965–2977, 2017.
  • [27] A. W. Fitzgibbon, “Robust registration of 2d and 3d point sets,” Image and vision computing, vol. 21, no. 13-14, pp. 1145–1153, 2003.
  • [28] J. Yang, H. Li, D. Campbell, and Y. Jia, “Go-icp: A globally optimal solution to 3d icp point-set registration,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 11, pp. 2241–2254, 2015.
  • [29] J. Li, Q. Hu, and M. Ai, “Point cloud registration based on one-point ransac and scale-annealing biweight estimation,” TGRS, 2021.
  • [30] Q.-Y. Zhou, J. Park, and V. Koltun, “Fast global registration,” in ECCV.   Springer, 2016, pp. 766–782.
  • [31] H. Yang, J. Shi, and L. Carlone, “Teaser: Fast and certifiable point cloud registration,” IEEE Transactions on Robotics, 2020.
  • [32] Z. Zhang, Y. Dai, and J. Sun, “Deep learning based point cloud registration: an overview,” Virtual Reality & Intelligent Hardware, vol. 2, no. 3, pp. 222–246, 2020.
  • [33] S. Ao, Q. Hu, B. Yang, A. Markham, and Y. Guo, “Spinnet: Learning a general surface descriptor for 3d point cloud registration,” in CVPR, 2021, pp. 11 753–11 762.
  • [34] Y. Aoki, H. Goforth, R. A. Srivatsan, and S. Lucey, “Pointnetlk: Robust & efficient point cloud registration using pointnet,” in CVPR, 2019, pp. 7163–7172.
  • [35] X. Huang, G. Mei, and J. Zhang, “Feature-metric registration: A fast semi-supervised approach for robust point cloud registration without correspondences,” in CVPR, 2020, pp. 11 366–11 374.
  • [36] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,” ACM TOG, vol. 38, no. 5, pp. 1–12, 2019.
  • [37] Z. J. Yew and G. H. Lee, “Rpm-net: Robust point matching using learned features,” in CVPR, 2020, pp. 11 824–11 833.
  • [38] Z. Dang, F. Wang, and M. Salzmann, “Learning 3d-3d correspondences for one-shot partial-to-partial registration,” arXiv preprint arXiv:2006.04523, 2020.
  • [39] J. Li, C. Zhang, Z. Xu, H. Zhou, and C. Zhang, “Iterative distance-aware similarity matrix convolution with mutual-supervised point elimination for efficient point cloud registration,” in ECCV, 2019.
  • [40] M. El Banani, L. Gao, and J. Johnson, “Unsupervisedr&r: Unsupervised point cloud registration via differentiable rendering,” in CVPR, 2021, pp. 7129–7139.
  • [41] L. Ding and C. Feng, “Deepmapping: Unsupervised map estimation from multiple point clouds,” in CVPR, 2019, pp. 8650–8659.
  • [42] L. Chapel, M. Z. Alaya, and G. Gasso, “Partial optimal tranport with applications on positive-unlabeled learning,” NeurIPS, vol. 33, pp. 2903–2913, 2020.
  • [43] J. Solomon, F. De Goes, G. Peyré, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas, “Convolutional wasserstein distances: Efficient optimal transportation on geometric domains,” ACM TOG, vol. 34, no. 4, pp. 1–11, 2015.
  • [44] Z. Su, Y. Wang, R. Shi, W. Zeng, J. Sun, F. Luo, and X. Gu, “Optimal mass transport for shape matching and comparison,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 37, no. 11, pp. 2246–2259, 2015.
  • [45] J. Solomon, G. Peyré, V. G. Kim, and S. Sra, “Entropic metric alignment for correspondence problems,” ACM TOG, vol. 35, no. 4, pp. 1–13, 2016.
  • [46] H. Xu, D. Luo, H. Zha, and L. C. Duke, “Gromov-wasserstein learning for graph matching and node embedding,” in ICML.   PMLR, 2019, pp. 6932–6941.
  • [47] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
  • [48] M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” NeurIPS, vol. 26, pp. 2292–2300, 2013.
  • [49] K. Pham, K. Le, N. Ho, T. Pham, and H. Bui, “On unbalanced optimal transport: An analysis of sinkhorn algorithm,” in ICML.   PMLR, 2020, pp. 7673–7682.
  • [50] A. Iusem and R. D. Monteiro, “On dual convergence of the generalized proximal point method with bregman distances,” Mathematics of Operations Research, vol. 25, no. 4, pp. 606–624, 2000.
  • [51] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas, “Kpconv: Flexible and deformable convolution for point clouds,” in ICCV, 2019, pp. 6411–6420.
  • [52] Z. Qin, H. Yu, C. Wang, Y. Guo, Y. Peng, and K. Xu, “Geometric transformer for fast and robust point cloud registration,” in CVPR, 2022.
  • [53] H. Yu, F. Li, M. Saleh, B. Busam, and S. Ilic, “Cofinet: Reliable coarse-to-fine correspondences for robust pointcloud registration,” NeurIPS, vol. 34, 2021.
  • [54] Q.-Y. Zhou, J. Park, and V. Koltun, “Open3d: A modern library for 3d data processing,” arXiv preprint arXiv:1801.09847, 2018.
  • [55] Y. Sun, C. Cheng, Y. Zhang, C. Zhang, L. Zheng, Z. Wang, and Y. Wei, “Circle loss: A unified perspective of pair similarity optimization,” in CVPR, 2020, pp. 6398–6407.
  • [56] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR.   IEEE, 2012, pp. 3354–3361.
  • [57] X. Huang, G. Mei, J. Zhang, and R. Abbas, “A comprehensive survey on point cloud registration,” 2021.
  • [58] X. Bai, Z. Luo, L. Zhou, H. Fu, L. Quan, and C.-L. Tai, “D3feat: Joint learning of dense detection and description of 3d local features,” in CVPR, 2020, pp. 6359–6367.
  • [59] H. Wang, Y. Liu, Z. Dong, W. Wang, and B. Yang, “You only hypothesize once: Point cloud registration with rotation-equivariant descriptors,” in ICCV, 2021.
  • [60] L. Chizat, G. Peyré, B. Schmitzer, and F.-X. Vialard, “Scaling algorithms for unbalanced optimal transport problems,” Mathematics of Computation, vol. 87, no. 314, pp. 2563–2609, 2018.

                            

Appendix A Proof of the Theorem 1

Proof.

We assume the correct matches in correspondence set are free of noise. We denote h1(Γ)=Γ,𝑪pqh_{1}(\Gamma)=\left<\Gamma,\bm{C}^{pq}\right>.

As the minimum values of h1(Γ)h_{1}(\Gamma) is non-negative, if we can prove that h1(Γ)=0h_{1}(\Gamma^{\star})=0, then Γ\Gamma^{\star} is a optimal solution of (9). As 𝑹,𝒕,Γ\bm{R}^{\star},\bm{t}^{\star},\Gamma^{\star} are the optimal solutions of problem (3), we have

i=1Nj=1MΓij𝑹𝒑i+𝒕𝒒j22=0\displaystyle\sum_{i=1}^{N}\sum_{j=1}^{M}\Gamma_{ij}^{\star}\|\bm{R}^{\star}\bm{p}_{i}+\bm{t}^{\star}-\bm{q}_{j}\|_{2}^{2}=0
\displaystyle\Rightarrow {𝑹𝒑i+𝒕𝒒j22=0,Γij=1Γij𝒟f(𝒇𝒑i,𝒇𝒒j)=0,Γij=0,\displaystyle\begin{cases}\|\bm{R}^{\star}\bm{p}_{i}+\bm{t}^{\star}-\bm{q}_{j}\|_{2}^{2}=0,&\Gamma_{ij}^{\star}=1\\ \Gamma_{ij}^{\star}\mathcal{D}_{f}\left(\bm{f}_{\bm{p}_{i}},\bm{f}_{\bm{q}_{j}}\right)=0,&\Gamma_{ij}^{\star}=0\end{cases},

i.e., when Γij=1\Gamma_{ij}^{\star}=1, 𝒑i𝒒j\bm{p}_{i}\rightarrow\bm{q}_{j} is an aligned correspondence since 𝑹𝒑i+𝒕𝒒j22=0\|\bm{R}^{\star}\bm{p}_{i}+\bm{t}^{\star}-\bm{q}_{j}\|_{2}^{2}=0. As the features are invariant to rigid transformation, then 𝒇𝒑i=𝒇𝒒jΓij𝒟f(𝒇𝒑i,𝒇𝒒j)=0\bm{f}_{\bm{p}_{i}}=\bm{f}_{\bm{q}_{j}}\Rightarrow\Gamma_{ij}^{\star}\mathcal{D}_{f}\left(\bm{f}_{\bm{p}_{i}},\bm{f}_{\bm{q}_{j}}\right)=0. Thus h1(Γ)=0h_{1}(\Gamma^{\star})=0. ∎

Appendix B Proof of the Theorem 2

Proof.

Denote h2(Γ)=ijklΓijΓkl(𝑪ikp𝑪jlq)2h_{2}(\Gamma)=\sum_{ijkl}\Gamma_{ij}\Gamma_{kl}\left(\bm{C}^{p}_{ik}-\bm{C}^{q}_{jl}\right)^{2}. The minimum values of h2(Γ)h_{2}(\Gamma) is non-negative, so if we can prove that h2(Γ)=0h_{2}(\Gamma^{\star})=0, then Γ\Gamma^{\star} is a optimal solution of (10). If Γij=1\Gamma_{ij}^{\star}=1 and Γk,l=1\Gamma_{k,l}=1, then 𝑹𝒑i+𝒕𝒒j22=0\|\bm{R}^{\star}\bm{p}_{i}+\bm{t}^{\star}-\bm{q}_{j}\|_{2}^{2}=0 and 𝑹𝒑k+𝒕𝒒l22=0𝒑i𝒑k2=𝑹𝒑i+𝒕𝑹𝒑k+𝒕=𝒒j𝒒l2\|\bm{R}^{\star}\bm{p}_{k}+\bm{t}^{\star}-\bm{q}_{l}\|_{2}^{2}=0\Rightarrow\|\bm{p}_{i}-\bm{p}_{k}\|_{2}=\|\bm{R}^{\star}\bm{p}_{i}+\bm{t}^{\star}-\bm{R}^{\star}\bm{p}_{k}+\bm{t}^{\star}\|=\|\bm{q}_{j}-\bm{q}_{l}\|_{2}. Meantime, we can get that 𝒇𝒑i=𝒇𝒑k\bm{f}_{\bm{p}_{i}}=\bm{f}_{\bm{p}_{k}} and 𝒇𝒒j=𝒇𝒒l\bm{f}_{\bm{q}_{j}}=\bm{f}_{\bm{q}_{l}}. Then we have

(𝑪ikp𝑪jlq)2=\displaystyle\left(\bm{C}^{p}_{ik}-\bm{C}^{q}_{jl}\right)^{2}= [(1λ)𝒟f(𝒇𝒑i,𝒇𝒑k)+λ𝒟e(𝒑i,𝒑k)\displaystyle[(1-\lambda)\mathcal{D}_{f}\left(\bm{f}_{\bm{p}_{i}},\bm{f}_{\bm{p}_{k}}\right)+\lambda\mathcal{D}_{e}\left(\bm{p}_{i},\bm{p}_{k}\right)
\displaystyle- (1λ)𝒟f(𝒇𝒒j,𝒇𝒒l)λ𝒟e(𝒒j,𝒒l)]2\displaystyle(1-\lambda)\mathcal{D}_{f}\left(\bm{f}_{\bm{q}_{j}},\bm{f}_{\bm{q}_{l}}\right)-\lambda\mathcal{D}_{e}\left(\bm{q}_{j},\bm{q}_{l}\right)]^{2}
=\displaystyle= 0ΓijΓkl(𝑪ikp𝑪jlq)2=0.\displaystyle 0\Rightarrow\Gamma_{ij}\Gamma_{kl}\left(\bm{C}^{p}_{ik}-\bm{C}^{q}_{jl}\right)^{2}=0.

If Γij=0\Gamma_{ij}^{\star}=0 or Γk,l=0\Gamma_{k,l}=0, thus h2(Γ)=0h_{2}(\Gamma^{\star})=0. ∎

Appendix C Proof of the Theorem 3

Proof.

We denote

f(Γ(k))\displaystyle f(\Gamma^{(k)}) =ξ1𝑪pq+ξ2𝑯(𝑪p,𝑪q,Γ(k))ϵlogΓ(k),\displaystyle=\xi_{1}\bm{C}^{pq}+\xi_{2}\bm{H}\left(\bm{C}^{p},\bm{C}^{q},\Gamma^{(k)}\right)-\epsilon\log\Gamma^{(k)}, (28)
(Γ)\displaystyle\mathcal{H}(\Gamma) =i=1Nj=1MΓij(logΓij1).\displaystyle=\sum_{i=1}^{N}\sum_{j=1}^{M}\Gamma_{ij}(\log\Gamma_{ij}-1).

And then we have

𝒦(Γ|Γ(k))\displaystyle\mathcal{KL}\left(\Gamma|\Gamma^{(k)}\right) =i=1Nj=1M(Γijlog(ΓijΓij(k))Γij+Γij(k))\displaystyle=\sum_{i=1}^{N}\sum_{j=1}^{M}\left(\Gamma_{ij}\log\left(\frac{\Gamma_{ij}}{\Gamma^{(k)}_{ij}}\right)-\Gamma_{ij}+\Gamma^{(k)}_{ij}\right)
=(Γ)logΓ(k),Γ+𝟏NΓ(k)𝟏M.\displaystyle=\mathcal{H}(\Gamma)-\left<\log\Gamma^{(k)},\Gamma\right>+\bm{1}^{\top}_{N}\Gamma^{(k)}\bm{1}_{M}.

After algebraic simplification, Eq. (13) can be rewritten as

Γ(k+1)\displaystyle\Gamma^{(k+1)} =argminΓ0f(Γ(k)),Γ+ϵ(Γ)\displaystyle=\mathop{\arg\min}_{\Gamma\geq 0}\left<f\left(\Gamma^{(k)}\right),\Gamma\right>+\epsilon\mathcal{H}(\Gamma) (29)
+τ(𝒦(Γ𝟏M|𝝁p)+𝒦(Γ𝟏N|𝝁q))\displaystyle+\tau\left(\mathcal{KL}\left(\Gamma\bm{1}_{M}|\bm{\mu}_{p}\right)+\mathcal{KL}\left(\Gamma^{\top}\bm{1}_{N}|\bm{\mu}_{q}\right)\right)
+ϵ𝟏NΓ(k)𝟏M,\displaystyle+\epsilon\bm{1}^{\top}_{N}\Gamma^{(k)}\bm{1}_{M},

with initialization Γ(0)=𝝁p𝝁q\Gamma^{(0)}=\bm{\mu}_{p}{\bm{\mu}_{q}}^{\top}. We solve it iteratively with the help of the Sinkhorn-Knopp algorithm [48, 60]. For ϵ>0\forall\epsilon>0, we can check that the problem (29) is strongly convex and lower semi-continuous. Meanwhile, 𝝁p\bm{\mu}_{p} and 𝝁q\bm{\mu}_{q} are given non-negative vectors, strong duality and the existence of a minimizer for (29) is thus given by the Fenchel-Legendre dual form, which states that

max𝒖N,𝒗MF(𝒖)G(𝒗)ϵijexp(ui+vj𝑪ijϵ),\max_{\bm{u}\in\mathbb{R}^{N},\bm{v}\in\mathbb{R}^{M}}-F^{\star}\left(-\bm{u}\right)-G^{\star}\left(-\bm{v}\right)-\epsilon\sum_{ij}\exp\left(\frac{u_{i}+v_{j}-\bm{C}_{ij}}{\epsilon}\right),

where 𝑪ij=[f(Γ(n))]ij\bm{C}_{ij}=[f\left(\Gamma^{(n)}\right)]_{ij}, and the function F()F^{\star}\left(\cdot\right) and G()G^{\star}\left(\cdot\right) take the following forms:

F(𝒖)\displaystyle F^{\star}\left(\bm{u}\right) =sup𝒛N𝒛𝒖τ𝒦(𝒛|𝝁p)\displaystyle=\sup_{\bm{z}\in\mathbb{R}^{N}}\bm{z}^{\top}\bm{u}-\tau\mathcal{KL}\left(\bm{z}|\bm{\mu}_{p}\right) (30)
=τexp(𝒖τ),𝝁p𝝁p𝟏N,\displaystyle=\tau\left\langle\exp\left(\frac{\bm{u}}{\tau}\right),\bm{\mu}_{p}\right\rangle-{\bm{\mu}_{p}}^{\top}\bm{1}_{N},
G(𝒗)\displaystyle G^{\star}\left(\bm{v}\right) =sup𝒛M𝒛𝒗τ𝒦(𝒛|𝝁q)\displaystyle=\sup_{\bm{z}\in\mathbb{R}^{M}}\bm{z}^{\top}\bm{v}-\tau\mathcal{KL}\left(\bm{z}|\bm{\mu}_{q}\right)
=τexp(𝒗τ),𝝁q𝝁q𝟏M,\displaystyle=\tau\left\langle\exp\left(\frac{\bm{v}}{\tau}\right),\bm{\mu}_{q}\right\rangle-{\bm{\mu}_{q}}^{\top}\bm{1}_{M},

Thus, it is proved by denoting h(𝒖,𝒗)=F(𝒖)+G(𝒗)+ϵijexp(𝒖i+𝒗j𝑪ijϵ)h(\bm{u},\bm{v})=F^{\star}\left(-\bm{u}\right)+G^{\star}\left(-\bm{v}\right)+\epsilon\sum_{ij}\exp\left(\frac{\bm{u}_{i}+\bm{v}_{j}-\bm{C}_{ij}}{\epsilon}\right). ∎