LoGG3D-Net: Locally Guided Global Descriptor Learning
for 3D Place Recognition

Kavisha Vidanapathirana^1,2, Milad Ramezani¹, Peyman Moghadam^1,2,
Sridha Sridharan², Clinton Fookes² ¹ Kavisha Vidanapathirana, Milad Ramezani and Peyman Moghadam are with the Robotics and Autonomous Systems Group, DATA61, CSIRO, Brisbane, QLD 4069, Australia. E-mails: firstname.lastname @data61.csiro.au² Kavisha Vidanapathirana, Peyman Moghadam, Sridha Sridharan, Clinton Fookes are with the SAIVT research programme in the School of Electrical Engineering and Robotics, Queensland University of Technology (QUT), Brisbane, Australia. E-mails: {kavisha.vidanapathirana, peyman.moghadam, s.sridharan, c.fookes}@qut.edu.au

Abstract

Retrieval-based place recognition is an efficient and effective solution for re-localization within a pre-built map, or global data association for Simultaneous Localization and Mapping (SLAM). The accuracy of such an approach is heavily dependent on the quality of the extracted scene-level representation. While end-to-end solutions - which learn a global descriptor from input point clouds - have demonstrated promising results, such approaches are limited in their ability to enforce desirable properties at the local feature level. In this paper, we introduce a local consistency loss to guide the network towards learning local features which are consistent across revisits, hence leading to more repeatable global descriptors resulting in an overall improvement in 3D place recognition performance. We formulate our approach in an end-to-end trainable architecture called LoGG3D-Net. Experiments on two large-scale public benchmarks (KITTI and MulRan) show that our method achieves mean $F1_{max}$ scores of $0.939$ and $0.968$ on KITTI and MulRan respectively, achieving state-of-the-art performance while operating in near real-time. The open-source implementation is available at: https://github.com/csiro-robotics/LoGG3D-Net.

I Introduction

Despite considerable progress in the field of 3D point cloud perception for robotics and self-driving cars, existing methods for data-association remain fragile and limited in applicability, especially in large-scale outdoor scenes. Accurate data-association is vital for enabling long-term autonomy, as autonomous agents need to construct and maintain an accurate representation of the environment they operate in. Recognizing previously visited places provides global constraints to restrict cumulative errors within Simultaneous Localization and Mapping (SLAM) systems [1, 2]. This is known as the Place Recognition (PR) task.

Compared to visual place recognition [3], the use of 3D point clouds extracted from LiDAR sensors benefits from its inherent invariance to view-point and illumination. However extracting useful information from the point cloud representation remains a challenge due to higher sparsity and a complex and variable distribution of points. In this paper, we consider the task of place recognition on 3D point clouds.

While there have been many handcrafted approaches for extracting useful information for the task of place recognition [4, 5, 6], discriminative learning-based approaches have demonstrated competitive performance in terms of both accuracy and efficiency [7, 8].

Refer to caption — Figure 1: LoGG3D-Net retrieval results for challenging orthogonal revisits and non-revisit queries on KITTI sequence 08.

We propose LoGG3D-Net, a novel end-to-end 3D place recognition method for LiDAR point clouds. Fig. 1 shows an example of retrieval results on KITTI sequence 08 which demonstrates some challenging orthogonal revisits. In contrast to state-of-the-art end-to-end methods, which rely solely on global descriptor learning, we propose to jointly optimize local and scene-level embeddings. We achieve this by introducing a local consistency loss that takes a pair of LiDAR point clouds and maximizes the similarity of their corresponding points’ features and minimizes the similarity of all non-corresponding points’ features.

We further introduce the use of second-order pooling followed by differentiable Eigen-value power normalization to aggregate local features and generate a global descriptor in an end-to-end setting. We note that all current learning-based methods use NetVLAD[9] or other first-order aggregation methods [10] to compute the global descriptor, and higher-order aggregation methods have not been explored in an end-to-end setting for LiDAR-based place recognition. We evaluate the accuracy and robustness of LoGG3D-Net on 6 KITTI sequences and 5 sequences of the MulRan dataset. Note that these datasets are collected using different LiDAR sensors (Velodyne HDL-64E in KITTI, Ouster OS1-64 in MulRan) and different countries (Germany, Korea). Our main contributions are summarized as follows:

•

We introduce a local consistency loss that can be used in an end-to-end global descriptor learning setting to enforce consistency of the local embeddings extracted from point clouds of the same location. We demonstrate how enforcing this property in the local features contributes towards better performance of the global descriptor.
•

We introduce the use of second-order pooling with differentiable Eigen-value power normalization in an end-to-end setting for LiDAR-based place recognition.
•

Using 2 large scale public datasets, we demonstrate the superiority of our method in an end-to-end setting while operating in near real-time.

The remainder of this paper is organized as follows. In Sec. II we discuss related work. Sec. III explains our proposed method. The experimental setup is outlined in Sec. IV. Sec. V discusses the experimental results and concluding remarks are provided in Sec. VI.

II Related Work

Solutions for 3D LiDAR-based place recognition can be loosely classified under local feature matching based methods and retrieval-based methods. Local feature matching based methods rely on accurate global pose estimates to limit the search space of feature correspondences and hence do not scale well for the task of re-localization in large-scale environments. Majority of local feature matching based approaches operate by detecting keypoints and describing their local neighborhood [6, 11]. Recently features extracted from larger local regions such as point segments have shown superior performance [12, 13].

Retrieval-based methods are inexpensive due to simple matching of global descriptors and therefore scale well to large-scale scenarios. They operate by encoding each point cloud to a single vector representation that can be used for querying a database of previously visited places. Global descriptors can be categorized as handcrafted [4, 5], hybrid [14], and end-to-end learning-based [7]. Handcrafted methods have the benefit of not needing re-training to adapt to different environments and sensor-types. Prominent methods such as ScanContext [5] have demonstrated reliable performance in various scenarios. However the discriminative power of such methods remains limited.

Hybrid methods aim to combine principled mathematical models with data-driven models to benefit from the advantages of both [15]. Locus [14] first demonstrated such an approach for lidar-based PR by mathematically modeling the topological relationships and temporal consistencies of point segments, while structural appearance of the segments were encoded using a data-driven 3D-CNN. Locus achieved state-of-the-art on the KITTI dataset, but struggles to adapt to environments where extracted point segments are structurally different to its training data (eg. unstructured environments). While (in general) hybrid methods show promise in scenarios with inadequate training data [15], the universal function approximation property of neural networks [16] implies that end-to-end methods have the potential to perform better if trained properly. This paper presents an improved method of training of end-to-end models for lidar-based PR.

End-to-end methods formulate the learning of global descriptors using a contrastive approach to obtain discriminative scene representations. PointNetVLAD[7] pioneered the use of an end-to-end trainable global descriptor for 3D point cloud place recognition. PointNetVLAD extracts local features from PointNet [17] and employs the NetVLAD aggregator [9] to form a global descriptor of the scene. To address the limited descriptive power of the PointNet backbone, other approaches such as LPD-Net [18] have been proposed. More recently MinkLoc3D[8] proposed an efficient architecture outperforming its end-to-end predecessors. MinkLoc3D utilized sparse convolutions which have been demonstrated to be effective at capturing useful point-level features. We also benefit from the use of sparse convolutions.

We note that all end-to-end methods rely solely on supervisory signals applied on the global descriptors. Additionally, all end-to-end methods are currently utilizing first-order aggregation methods in the formulation of the global descriptors. In many visual recognition tasks, higher-order aggregation methods have demonstrated superior performance [19, 20, 21]. Higher-order aggregation has previously been applied for 3D place recognition [14], but not in an end-to-end trainable architecture. In this work, we address the above limitations by introducing an additional training signal to the local features, and using differentiable second-order pooling for global descriptor generation respectively.

III Proposed Method

The overall architecture of our LoGG3D-Net is presented in Fig. 2. Given a raw point cloud input, a sparse convolution-based U-Net (SparseConv U-Net) is used to embed each point into a high dimensional feature space. The ‘Local Consistency Loss’ acts on a pair of point clouds from nearby locations (with considerable overlap of points) to maximize the similarity of corresponding points’ features in the embedding space. Next, local features are aggregated using second-order pooling followed by differentiable Eigen-value power normalization to form a global scene-level descriptor. A quadruplet loss is used to train our scene-level global descriptors. Finally, we combine our local consistency loss with our scene-level loss to optimize our network.

III-A Problem Formulation

The task of point cloud based retrieval for place recognition is generally formulated as follows. Given a point cloud $\mathcal{P}\in\mathbb{R}^{N\times 4}$ representing varying number of $N$ points with associated $x,y,z$ and intensity, a mapping function $\Phi:\mathcal{P}\rightarrow g\in\mathbb{R}^{d^{\prime}}$ is developed that represents the point cloud with a fixed-size global descriptor $g$ . End-to-end learning of global descriptors is formulated as follows:

Problem: Given a training set $\{\mathcal{S}_{i}\}=\{(\mathcal{P}_{i},\mathtt{x}_{i})\}$ consisting of pairs of point cloud $\mathcal{P}_{i}$ with associated geo-location $\mathtt{x}_{i}$ , find the parameters $\theta$ of the mapping function $\Phi_{\theta}:\mathcal{P}\rightarrow g\in\mathbb{R}^{d^{\prime}}$ , such that for any subset of samples $\{\mathcal{S}_{a},\mathcal{S}_{i},\mathcal{S}_{j}\}$ ,

\mathcal{D}(\mathtt{x_{a}},\mathtt{x_{i}})\leq\mathcal{D}(\mathtt{x_{a}},\mathtt{x_{j}})\implies\lVert g_{a}-g_{i}\rVert\leq\lVert g_{a}-g_{j}\rVert,

(1)

where, $\mathcal{D}$ represents geometric distance and $\lVert.\rVert$ represents a distance in the feature space (typically ${L}_{2}$ ).

Learning the parameters $\theta$ that address the above problem is generally performed using a metric learning setting by applying a loss function $\mathcal{L}_{g}$ that acts on the global descriptors extracted from a tuple $\mathcal{T}$ of training samples. We use the subscript ‘ $g$ ’ in $\mathcal{L}_{g}$ to highlight that the loss is typically only applied on the global descriptors, i.e., at the scene-level.

III-B Our Approach

We note that generally $\Phi$ can be decomposed into two functions such that,

\Phi\equiv\varphi\circ\phi,

(2)

where, $\phi:\mathcal{P}\rightarrow\{f\}$ extracts local features $f\in\mathbb{R}^{d}$ , and $\varphi:\{f\}\rightarrow g$ aggregates the local features into a single global descriptor $g$ . Under this setting, optimizing $\theta$ with a training signal applied only to the global descriptor seems limited, i.e., there are desirable properties of the intermediate representations $\{f\}$ (local features) that cannot be fully enforced by $\mathcal{L}_{g}$ .

Hypothesis: Given two point clouds from nearby locations (i.e., with considerable overlap of points), enforcing consistency of local features of corresponding points will result in a more repeatable global descriptors after aggregation.

We define ‘corresponding points’ as two points from different point clouds that are nearby when represented with respect to a global coordinate frame. ‘Consistency of local features’ implies that local features of corresponding points should be nearby in the embedding space. Towards testing the above hypothesis, we introduce an additional training signal to enforce the features $\{f\}$ of corresponding points to be consistent in the embedding space.

III-C Local Descriptor

For the local feature extractor $\phi$ , we use a Sparse U-Net style backbone [22] which performs sparse point-voxel convolution. The backbone consists of two branches: a voxel-based branch and a point-based branch. The voxel-based branch learns local neighborhood information at varying receptive field sizes using convolutional layers, and the point-based branch learns high-resolution information that may be lost in the voxelized branch using MLP layers. The information across the two branches are fused at intermediate steps to capture complementary information from the input 3D point cloud.

III-C1 Point Correspondences

Given a pair of samples $\{\mathcal{S}_{1},\mathcal{S}_{2}\}$ from nearby locations (i.e. $\mathcal{D}(\mathtt{x_{1}},\mathtt{x_{2}})<\tau_{p}$ , where $\tau_{p}$ is a distance threshold), corresponding points in the two point clouds $\{\mathcal{P}_{1},\mathcal{P}_{2}\}$ are calculated by first transforming the point clouds to a common coordinate frame using the geo-location information $\{\mathtt{x_{1}},\mathtt{x_{2}}\}$ followed by ICP [23] for better alignment (to account for possible errors in geo-locations). After alignment, point correspondences can be found using radius-based nearest neighbor search. This search is made efficient using the approximate nearest neighbor search algorithm FLANN [24]. For each point $\tensor[^{1}]{p}{{}_{i}}$ in point cloud $\mathcal{P}_{1}$ and radius $r$ , the corresponding points in point cloud $\mathcal{P}_{2}$ are found as $C^{1\leftrightarrow 2}_{i}=\{(i,j)\,|\,\mathcal{D}(\tensor[^{1}]{p}{{}_{i}},\tensor[^{2}]{p}{{}_{j}})<r\}$ . The set of all corresponding points’ indices between the two point clouds is represented as $C^{1\leftrightarrow 2}$ .

III-C2 Local Consistency Loss

Given two samples $\{\mathcal{S}_{1},\mathcal{S}_{2}\}$ , their associated local features $\{\{f\}_{1},\{f\}_{2}\}$ and the point correspondences $C^{1\leftrightarrow 2}$ , a contrastive loss is applied to minimize the distance for features of corresponding points (positive pairs) while maximizing the distance of features of non-corresponding points (negative pairs). We adopt the Hardest-Contrastive loss [25] which is defined as,

\begin{split}\mathcal{L}_{lc}&\!=\!\sum_{(i,j)\in C^{1\leftrightarrow 2}}\Bigg{\{}\left[||f(\tensor[^{1}]{p}{{}_{i}})\!-\!f(\tensor[^{2}]{p}{{}_{j}})||_{2}^{2}\!-\!m_{p}\right]_{+}/\,|C^{1\leftrightarrow 2}|\\ &\!+\!\,\lambda{{}_{n}}I_{i}\left[m_{n}\!-\!\underset{k\in\mathcal{M}}{{\rm min}}||f(\tensor[^{1}]{p}{{}_{i}})\!-\!f(\tensor[^{2}]{p}{{}_{k}})||_{2}^{2}\right]_{+}/\,|C^{1\leftrightarrow 2}_{1}|\\ &\!+\!\,\lambda{{}_{n}}I_{j}\left[m_{n}\!-\!\underset{k\in\mathcal{M}}{{\rm min}}||f(\tensor[^{2}]{p}{{}_{j}})\!-\!f(\tensor[^{1}]{p}{{}_{k}})||_{2}^{2}\right]_{+}/\,|C^{1\leftrightarrow 2}_{2}|\Bigg{\}},\end{split}

(3)

where, $\mathcal{M}$ is a random subset of features used for hard negative mining and $\left[.\right]_{+}$ denotes the hinge loss. $I_{i}$ is short for $I(i,k_{i},r)$ , which is an indicator function that returns 1 if the point $k_{i}$ is non-corresponding (outside radius $r$ ) to point $i$ and 0 otherwise, where $k_{i}=\operatorname*{argmin}_{k\in\mathcal{M}}||f_{i}-f_{k}||_{2}$ . $|C^{1\leftrightarrow 2}_{1}|=\sum_{(i,j)\in C^{1\leftrightarrow 2}}I(i,k_{i},r)$ is the total number of valid mined negatives for points in $\mathcal{P}_{1}$ ( $|C^{1\leftrightarrow 2}_{2}|$ for $\mathcal{P}_{2}$ ). Hyperparameters $m_{p}$ , $m_{n}$ are scalar margins and $\lambda{{}_{n}}$ is a scalar weight.

The local consistency loss $\mathcal{L}_{lc}$ acts on the parameters of $\phi$ in the decomposition of Eq. (2) to get well-formed repeatable local features.

III-D Global Descriptor

Given a point cloud $\mathcal{P}=\{p_{i}\}$ and the set of point features $\{f(p_{i})\}$ where $f(p_{i})\in\mathbb{R}^{d}$ , the second-order pooling $F^{O_{2}}$ of the features is defined as,

F^{O_{2}}=\{F^{O_{2}}_{xy}\},\quad F^{O_{2}}_{xy}=\max_{p_{i}\in\mathcal{P}}\ f_{xy}^{o_{2}}(p_{i}),

(4)

where $F^{O_{2}}$ is a matrix with elements $F^{O_{2}}_{xy}(1\leq x,y\leq d)$ and $f^{o_{2}}(p_{i})=f(p_{i})f(p_{i})^{T}\in\mathbb{R}^{d\times d}$ is the outer product of the point feature with itself. This accounts to taking the element-wise maximum of the second-order features of all points in the point cloud.

In order to make the scene descriptor matrix $F^{O_{2}}$ more discriminative, we use Eigen-value Power Normalization (ePN) [20, 19, 21]. Given the singular value decomposition $F^{O_{2}}=U\lambda V^{T}$ the ePN result $F^{O_{2}}_{\alpha}$ is obtained by raising each of the singular values by a power of $\alpha$ as follows,

F^{O_{2}}_{\alpha}=U\hat{\lambda}V^{T},\quad\hat{\lambda}=\textit{diag}(\lambda^{\alpha}_{1,1},..,\lambda^{\alpha}_{d,d}),

(5)

where, $\alpha=0.5$ . The matrix $F^{O_{2}}_{\alpha}$ is flattened and normalized to obtain the final global descriptor vector $g\in\mathbb{R}^{d^{2}}$ . To incorporate Eq. (5) into our end-to-end pipeline, we utilize differentiable SVD as introduced in [26], and its PyTorch implementation.

After the aggregation of point features into a global descriptor using second-order pooling, the scene-level loss $\mathcal{L}_{g}$ is applied to a tuple $\mathcal{T}$ of training samples. We use the quadruplet loss [27] where a tuple is denoted as $\mathcal{T}_{i}=(\mathcal{S}_{a},\{\mathcal{S}_{p}\},\{\mathcal{S}_{n}\},\mathcal{S}_{n^{*}})$ , where $\mathcal{S}_{a}$ is the anchor sample, { $\mathcal{S}_{p}$ } a set of positives (such that $\mathcal{D}(\mathtt{x_{a}},\mathtt{x_{p}})<\tau_{p}$ ), { $\mathcal{S}_{n}$ } a set of negatives (such that $\mathcal{D}(\mathtt{x_{a}},\mathtt{x_{n}})>\tau_{n}$ ), and $\mathcal{S}_{n^{*}}$ is sampled such that it is not a positive to the query nor to all previous negatives (such that $\mathcal{D}(\mathtt{x_{n^{*}}},\mathtt{x})>\tau_{p}\,\forall\,\mathtt{x}=\{\mathtt{x_{a}},\mathtt{x_{n}}\}$ ).

In each tuple we first find the hardest positive sample,

\mathcal{P}_{hp}=\underset{\mathcal{P}_{p^{i}}\in\left\{\mathcal{P}_{p}\right\}}{\rm max}\,\,||g(\mathcal{P}_{a})-g(\mathcal{P}_{p^{i}})||_{2},

(6)

The quadruplet loss is then defined as:

\begin{split}\mathcal{L}_{g}&\!=\sum_{i}^{\mathcal{N}}\Bigg{\{}\left[||g(\mathcal{P}_{a})\!-\!g(\mathcal{P}_{hp})||_{2}^{2}\!-\!||g(\mathcal{P}_{a})\!-\!g(\mathcal{P}_{n^{i}})||_{2}^{2}\!+\!\alpha\right]_{+}\\ &\!+\!\left[||g(\mathcal{P}_{a})\!-\!g(\mathcal{P}_{hp})||_{2}^{2}\!-\!||g(\mathcal{P}_{n^{*}})\!-\!g(\mathcal{P}_{n^{i}})||_{2}^{2}\!+\!\beta\right]_{+}\Bigg{\}},\end{split}

(7)

where, $\alpha$ and $\beta$ are constant margins and $\mathcal{N}$ is the number of sampled negatives.

III-E Joint Local and Global Loss

Our network is jointly optimized by a weighted sum of the global scene-level loss and the local consistency loss described as:

\mathcal{L}=\mathcal{L}_{g}+\omega\cdot\mathcal{L}_{lc}

(8)

where, $\omega$ is a scalar hyperparameter term.

IV Experimental Setup

IV-A Implementation and Training Setup

The proposed network is implemented using the PyTorch framework and trained on 12 Nvidia Tesla P100-16GB GPUs using $\mathtt{DistributedDataParallel}$ . The TorchSparse library [22] is used for sparse convolutions. During training, the ground plane is first removed using RANSAC plane fitting followed by down-sampling using a voxel grid filter of $10cm$ . Finally, input point clouds are limited to a maximum of $35K$ points. To reduce overfitting, we apply the following data augmentations for training. Random point jitter is introduced using Gaussian noise sampled from $\mathcal{N}(\mu=0,\sigma=0.01)$ clipped at $0.03m$ . Each point cloud is also randomly rotated about the $z$ -axis by an angle between $\pm 180^{\circ}$ . Note that ground plane removal is not used during evaluation to speed up inference time as it does not affect evaluation performance of our proposed method¹¹1The additional features from the ground plane are still well-formed (property in Fig. 3), and after aggregation, the global descriptors still remain discriminative due to the robustness of ePN Eq. (5) to feature burstiness [20]..

For a fair comparison with PointNetVLAD [7], we set the same global descriptor dimension ( $d^{2}=256$ ), hence the dimension of the local features is set to $d=16$ . In the local consistency loss $\mathcal{L}_{lc}$ , the margins $m_{p}$ and $m_{n}$ are set to 0.1 and 2.0 respectively, and $\lambda{{}_{n}}$ is set to 0.5. The quadruplet loss margins are set to $\alpha=0.5$ and $\beta=0.3$ . The distances for sampling positive and negative point cloud pairs are set to $\tau_{p}=3m$ and $\tau_{n}=20m$ . For $\mathcal{L}_{g}$ we use 2 positives, 9 negatives and 1 other negative. We train our model using the Adam optimizer with an initial learning rate of 0.001 and a multi-step scheduler to drop the learning rate by a factor of 10 after 10 epochs and train until convergence for a maximum of 24 hours.

IV-B Datasets

We evaluate the proposed method on two public LiDAR datasets (KITTI, MulRan), both of which were collected from a moving vehicle in multiple dynamic urban environments. Note that these datasets are collected using different LiDAR sensors (Velodyne HDL-64E in KITTI, Ouster OS1-64 in MulRan) and in different countries (Germany, Korea).

KITTI: The KITTI odometry dataset [28] contains 11 sequences of Velodyne HDL-64E LiDAR scans collected in Karlsruhe, Germany. We train on these 11 sequences using the leave-one-out cross-validation strategy and evaluate on the 6 sequences with revisits (00, 02, 05, 06, 07 and 08).

MulRan: The MulRan dataset [29] contains scans collected from an Ouster-64 sensor from multiple environments in South Korea. The dataset contains 12 sequences, 9 of which we use for evaluation. We train on DCC1, DCC2, Riverside1 and Riverside3 sequences and evaluate on the remaining sequences of DCC, Riverside and KAIST. To assess the generalization capabilities of methods, the KAIST sequences are unseen test sets for evaluation.

IV-C Evaluation Criteria

We compute the cosine similarity between the global descriptors of each query with a database of global descriptors of previously seen point clouds in each sequence. Previous entries adjacent to the query by less than $t_{r}$ time difference are excluded from the search to avoid matching to the same instance. For $t_{r}$ we use $30s$ and $90s$ for KITTI and MulRan, respectively. Methods are compared using the Precision-Recall curve and its scalar metric the maximum $F1$ score ( $F1_{max}$ ). The 3m, 20m thresholds are used to classify true positives and false positives respectively, as done in [14].

V Results

We first demonstrate the performance improvement from the inclusion of $\mathcal{L}_{lc}$ . We evaluate our method in comparison to other state-of-the-art methods and conduct a run-time analysis to judge the suitability for real-time operation.

V-A Ablation study on point-wise loss

Method	DCC2	Riverside2	mean
$\mathcal{L}_{g}$	0.355	0.472	0.413
$\mathcal{L}_{g}+0.1\cdot\mathcal{L}_{lc}$	0.471	0.578	0.524
$\mathcal{L}_{g}+1.0\cdot\mathcal{L}_{lc}$	0.591	0.747	0.669

TABLE I: Ablation study on the effects of the

\omega

term for our joint loss.

{L}_{g}

is the scene-level loss and

{L}_{lc}

is the local consistency loss.

We evaluate the effect of inclusion of the local consistency loss through an ablation study on selected sequences of the MulRan dataset. We train on sequences DCC1, Riverside1 and evaluate performance on DCC2 and Riverside2. Table I summarizes the $F1_{max}$ for each test sequence by varying the weight of the local consistency loss $\omega$ in Eq. (8). It is evident that the inclusion on the local consistency loss leads to an improvement in place recognition performance. $\omega=0.1$ leads to an improvement of $26\%$ mean $F1_{max}$ with respect to the baseline while $\omega=1.0$ leads to an improvement of $61\%$ leading to the best performance. All the following experiments are carried out with $\omega=1.0$ .

A qualitative depiction of the effect of $\mathcal{L}_{lc}$ is shown in Fig. 3 between two point clouds extracted from nearby locations. The left half shows the alignment of the point cloud after ground plane removal to estimate point correspondences during training. The right half shows the two point clouds separately with each point colored based on the t-SNE embedding of the local features extracted using our pre-trained model (note that point correspondence information is not used during inference). The visualization clearly highlights that distinct regions in a single point cloud are in distinct regions in the feature space. Additionally, across point clouds, corresponding regions have similar point features implying that the local features extracted from our method are repeatable.

V-B Comparison to State-of-the-Art

KITTI

MulRan

mean

ScanContext [5]

0.966

0.871

0.914

0.985

0.698

0.610

0.841

0.954

0.969

0.994

0.893

0.826

0.916

PointNetVLAD [7]

0.909

0.637

0.859

0.924

0.171

0.437

0.656

0.952

0.856

0.979

0.685

0.868

Locus [14]

0.983

0.762

0.981

0.992

1.000

0.931

0.942

0.938

0.874

0.969

0.718

0.994

0.899

LoGG3D-Net (Ours)

0.953

0.888

0.976

0.977

1.000

0.843

0.939

0.966

0.938

0.991

0.977

0.969

0.968

TABLE II: Evaluation of sequential place recognition on the KITTI and MulRan datasets using the

F1_{max}

metric under the

3m,20m

revisit criteria.

We compare LoGG3D-Net with the state-of-the-art handcrafted method ScanContext²²2https://github.com/irapkaist/scancontext [5], the recently proposed hybrid method Locus³³3https://github.com/csiro-robotics/locus [14], and the popular end-to-end method PointNetVLAD [7]. For ScanContext we use the python version of the code provided by the authors. We use the 20x60 descriptor size and use ring-key search to find the top-10 candidates for descriptor matching. For PointNetVLAD we use a PyTorch-based re-implementation of the original tensorflow implementation⁴⁴4https://github.com/mikacuy/pointnetvlad.

The results are summarized in Table II. On the KITTI dataset, Locus remains the highest performing method with a mean $F1_{max}$ score $1\%$ higher than ours. On the MulRan dataset we obtain the best mean $F1_{max}$ score which is $5\%$ higher than the next best performing method, i.e., ScanContext. It should be noted that the local feature extractor of Locus was trained on the KITTI sequences 05 and 06 thus leading to its extremely high performance on KITTI and relatively low performance on MulRan.

Precision-Recall plots for the sequences KITTI 02 and DCC 03 are depicted in Fig. 4. KITTI 02 contains several repetitive environments and a single revisit to an intersection in the opposite direction. DCC 03 contains long traversals of revisits from the reverse direction. All these scenarios prove challenging for most methods while the adverse effect on the performance of LoGG3D-Net is not as pronounced.

The feature dimensions of Locus and ScanContext give these methods an unfair advantage in terms of representation power. The ScanContext descriptor is 20 $\times$ 60 (a total of 1200 floating point numbers) and the Locus descriptor is 4096 dimensions. Both LoGG3D-Net and PointNetVLAD have descriptor sizes of 256 which are compact and scale well to large databases essential for real-time robotic applications.

V-C Runtime Analysis

The computation time for pre-processing, description and querying is demonstrated in Table III. All experiments in this section are run on a system with an 8 core Intel i7-9700 processor with 32GB RAM and a single Nvidia RTX 2080Ti GPU. Since the time for retrieval increases with the size of the database, we present the average retrieval time for all methods on MulRan DCC1 which consists of 5541 point clouds. ScanContext uses ring-key retrieval to find the top-10 candidates for full descriptor distance calculation.

The results show that Locus has very high pre-processing time due to ground-plane removal and a high description time (due to around 30-80 sequential forward passes through the segment feature extraction network which is not parallelized). ScanContext has very high retrieval time even after limiting the number of ring-key candidates to just 10. The need for cosine similarity computation on each column shifted variant of the descriptor increases the descriptors query time. PointNetVLAD is the most efficient method for descriptor extraction due to its light network architecture. However, we note that PointNetVLAD uses a considerable amount of pre-processing to first remove the ground plane and then iteratively downsample the point cloud to exactly 4096 points which makes its total time longer than ours.

Our method has the lowest pre-processing and retrieval times allowing it to run at real-time ( $\sim{10Hz}$ ), enabling its integration into SLAM systems as a loop closure detection module.

	Pre-	Description	Querying	Total
	proc.
ScanContext [5]	63	582	3968	4613
PointNetVLAD [7]	534	7	1	542
Locus [14]	573	633	6	1212
LoGG3D-Net (Ours)	15	74	1	90

TABLE III: Runtime analysis: Average time taken on MulRan DCC1 (in ms).

VI Conclusion

This paper introduced the use of a local consistency loss in addition to a global contrastive loss for the training of end-to-end models for 3D place recognition. This additional constraint enforces corresponding points in different point clouds of the same place to have similar embeddings. The implementation, named LoGG3D-Net, is based on a U-Net architecture which uses sparse point-voxel convolution to enable efficient and fine-grained inference on high resolution point clouds. Second-order pooling along with differentiable Eigen-value power normalization ensures that point clouds are encoded into a single vector representation which better captures the distribution of local features. Evaluation of LoGG3D-Net on 11 sequences of two large-scale public benchmarks (KITTI and MulRan) resulted in mean $F1_{max}$ scores of $0.939$ and $0.968$ on KITTI and MulRan respectively, achieving state-of-the-art performance. Ablation studies demonstrated that the local consistency loss provides a consistent and significant improvement. Run-time analysis demonstrated real-time inference enabling integration into SLAM systems as a loop closure detection module.

References

[1] C. Park, P. Moghadam, S. Kim, A. Elfes, C. Fookes, and S. Sridharan, “Elastic LiDAR Fusion: Dense Map-Centric Continuous-Time SLAM,” in Proceedings - IEEE International Conference on Robotics and Automation, sep 2018, pp. 1206–1213.
[2] C. Park, P. Moghadam, J. L. Williams, S. Kim, S. Sridharan, and C. Fookes, “Elasticity Meets Continuous-Time: Map-Centric Dense 3D LiDAR SLAM,” IEEE Transactions on Robotics, 2021.
[3] S. Lowry, N. Sunderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. J. Milford, “Visual Place Recognition: A Survey,” IEEE Transactions on Robotics, vol. 32, no. 1, pp. 1–19, feb 2016.
[4] L. He, X. Wang, and H. Zhang, “M2DP: A Novel 3D Point Cloud Descriptor and Its Application in Loop Closure Detection,” in IEEE International Conference on Intelligent Robots and Systems, nov 2016, pp. 231–237.
[5] G. Kim and A. Kim, “Scan Context: Egocentric Spatial Descriptor for Place Recognition Within 3D Point Cloud Map,” in IEEE International Conference on Intelligent Robots and Systems, dec 2018, pp. 4802–4809.
[6] S. Salti, F. Tombari, and L. Di Stefano, “SHOT: Unique signatures of histograms for surface and texture description,” Computer Vision and Image Understanding, vol. 125, pp. 251–264, 2014.
[7] M. A. Uy and G. H. Lee, “PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, dec 2018, pp. 4470–4479.
[8] J. Komorowski, “MinkLoc3D: Point Cloud Based Large-Scale Place Recognition,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), January 2021, pp. 1790–1799.
[9] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: CNN architecture for weakly supervised place recognition,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016-Decem. IEEE Computer Society, dec 2016, pp. 5297–5307.
[10] F. Radenović, G. Tolias, and O. Chum, “Fine-tuning cnn image retrieval with no human annotation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 7, pp. 1655–1668, 2019.
[11] J. Guo, P. V. Borges, C. Park, and A. Gawel, “Local Descriptor for Robust Place Recognition Using LiDAR Intensity,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1470–1477, apr 2019.
[12] R. Dubé, A. Cramariuc, D. Dugas, H. Sommer, M. Dymczyk, J. Nieto, R. Siegwart, and C. Cadena, “Segmap: Segment-based mapping and localization using data-driven descriptors,” The International Journal of Robotics Research, vol. 39, no. 2-3, pp. 339–355, 2020.
[13] G. Tinchev, A. Penate-Sanchez, and M. Fallon, “Learning to See the Wood for the Trees: Deep Laser Localization in Urban and Natural Environments on a CPU,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1327–1334, 2019.
[14] K. Vidanapathirana, P. Moghadam, B. Harwood, M. Zhao, S. Sridharan, and C. Fookes, “Locus: LiDAR-based Place Recognition using Spatiotemporal Higher-Order Pooling,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021.
[15] N. Shlezinger, J. Whang, Y. C. Eldar, and A. G. Dimakis, “Model-based deep learning: Key approaches and design guidelines,” in 2021 IEEE Data Science and Learning Workshop (DSLW), 2021, pp. 1–6.
[16] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural networks, vol. 2, no. 5, pp. 359–366, 1989.
[17] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
[18] Z. Liu, S. Zhou, C. Suo, P. Yin, W. Chen, H. Wang, H. Li, and Y. Liu, “LPD-Net: 3D Point Cloud Learning for Large-Scale Place Recognition and Environment Analysis,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 2831–2840.
[19] P. Koniusz, F. Yan, P. Gosselin, and K. Mikolajczyk, “Higher-Order Occurrence Pooling for Bags-of-Words: Visual Concept Detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 2, pp. 313–326, 2017.
[20] P. Koniusz and H. Zhang, “Power normalizations in fine-grained image, few-shot image and graph classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2021.
[21] P. Li, J. Xie, Q. Wang, and W. Zuo, “Is Second-Order Information Helpful for Large-Scale Visual Recognition?” in Proceedings of the IEEE International Conference on Computer Vision, dec 2017, pp. 2089–2097.
[22] H. Tang, Z. Liu, S. Zhao, Y. Lin, J. Lin, H. Wang, and S. Han, “Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution,” in European Conference on Computer Vision (ECCV), 2020.
[23] P. J. Besl and N. D. McKay, “Method for registration of 3-d shapes,” in Sensor fusion IV: control paradigms and data structures, vol. 1611. International Society for Optics and Photonics, 1992, pp. 586–606.
[24] M. Muja and D. G. Lowe, “Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration,” VISAPP (1), vol. 2, no. 331-340, p. 2, 2009.
[25] C. Choy, J. Park, and V. Koltun, “Fully Convolutional Geometric Features,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
[26] T. Papadopoulo and M. I. A. Lourakis, “Estimating the jacobian of the singular value decomposition: Theory and applications,” in ECCV, 2000.
[27] W. Chen, X. Chen, J. Zhang, and K. Huang, “Beyond triplet loss: a deep quadruplet network for person re-identification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 403–412.
[28] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, sep 2013.
[29] G. Kim, Y. S. Park, Y. Cho, J. Jeong, and A. Kim, “MulRan: Multimodal Range Dataset for Urban Place Recognition,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 6246–6253.

LoGG3D-Net: Locally Guided Global Descriptor Learning for 3D Place Recognition

Abstract

I Introduction

II Related Work

III Proposed Method

III-A Problem Formulation

III-B Our Approach

III-C Local Descriptor

III-C1 Point Correspondences

III-C2 Local Consistency Loss

III-D Global Descriptor

III-E Joint Local and Global Loss

IV Experimental Setup

IV-A Implementation and Training Setup

IV-B Datasets

IV-C Evaluation Criteria

V Results

V-A Ablation study on point-wise loss

V-B Comparison to State-of-the-Art

V-C Runtime Analysis

VI Conclusion

References

LoGG3D-Net: Locally Guided Global Descriptor Learning
for 3D Place Recognition