Loop closure detection using local 3D deep descriptors

Youjie Zhou¹, Yiming Wang², Fabio Poiesi², Qi Qin¹ and Yi Wan¹ *This work was supported by the China government (2019JZZY010112) and (2020JMRH0202), and by the SHIELD project, funded by the European Union’s Joint Programming Initiative – Cultural Heritage, Conservation, Protection and Use joint call.¹Youjie Zhou, Qi Qin and Yi Wan are with the School of Mechanical Engineering, Shandong University, China. Yi Wan <[email protected]> is the corresponding author.²Yiming Wang and Fabio Poiesi are with Fondazione Bruno Kessler, Italy <ywang,poiesi>@fbk.eu.

Abstract

We present a simple yet effective method to address loop closure detection in simultaneous localisation and mapping using local 3D deep descriptors (L3Ds). L3Ds are emerging compact representations of patches extracted from point clouds that are learnt from data using a deep learning algorithm. We propose a novel overlap measure for loop detection by computing the metric error between points that correspond to mutually-nearest-neighbour descriptors after registering the loop candidate point cloud by its estimated relative pose. This novel approach enables us to accurately detect loops and estimate six degrees-of-freedom poses in the case of small overlaps. We compare our L3D-based loop closure approach with recent approaches on LiDAR data and achieve state-of-the-art loop closure detection accuracy. Additionally, we embed our loop closure approach in RESLAM, a recent edge-based SLAM system, and perform the evaluation on real-world RGBD-TUM and synthetic ICL datasets. Our approach enables RESLAM to achieve a better localisation accuracy compared to its original loop closure strategy. Our project page is available at github.com/yiming107/l3d_loop_closure.

Index Terms:

Loop closure detection; 3D local descriptors; Simultaneous localisation and mapping; Deep learning.

I Introduction

Loop closure aims to recognise already visited places in order to mitigate tracking drifts in simultaneous localisation and mapping (SLAM) [1]. Loop closure is a computational module that typically involves loop detection, pose estimation and verification, which can be executed on-board an autonomous robot and that runs in parallel with other processes, e.g. feature extraction and tracking [2, 3].

Loop closure is triggered when there is enough 2D or 3D overlap between the viewpoint of a current keyframe and of a previous one [3]. Therefore, how to reliably measure the overlap is key to successfully detect loops. The overlap can be estimated in 2D by extracting visual representations from keyframe images, e.g., through bags of visual words [4, 2, 5, 3], or in 3D by matching geometric or semantic representations between point clouds of keyframes across time [6, 7, 8, 9, 10]. These 3D representations can be obtained by aggregating local geometric information [6] or by globally encoding the whole point cloud into a signature [7, 8, 11, 9]. Then, the overlap can be either measured through a suitable distance between global keyframe representations, e.g., using the Hamming distance [9], or inferred via a pre-trained Siamese network [8]. Siamese networks can also be used to directly regress the pose between viewpoints, however a minimum overlap is required to obtain an accurate estimate: to estimate a 1DoF transformation between viewpoints, the minimum overlap should be about 30% [8], whereas the overlap should be greater than 70% for a 6DoF transformation [12]. Although loop closure methods that use global representations can be computationally efficient, the literature about geometric representations shows that they lack generalisation ability across domains [13, 14]. In the context of deep-learning-based algorithms, this limitation implies that when a new domain is visited, extra effort for data collection, annotation and training is needed, thus hindering the use of loop closure for robotic exploration in unseen environments. On the other hand, recent research shows that local 3D deep descriptors (L3Ds) have better generalisation abilities across domains than their global counterparts [13, 14]. This is what motivates us to leverage local geometric information for loop closures, in particular to address the specific problems of loop detection and pose estimation.

In this letter, we present a novel approach to detect loop closures where L3Ds are employed to estimate the poses and to quantify the overlaps between loop candidates. The main technical novelty of our approach is an overlap measure that is defined as the Ratio Of Nearest points in both the descriptor and metric spaces, namely RON. Specifically, we quantify the overlap between a point cloud pair through the metric error of the registered points that are mutually nearest neighbours in the descriptor space. Because L3Ds are inferred by a deep network from randomly sampled points within each point cloud, the relative 6DoF transformation between point clouds can be directly estimated with the RANSAC algorithm [15]. Then, RON is computed as the ratio of corresponding points that have a small metric error.

Our L3D-based loop detection approach outperforms the state-of-the-art methods OverlapNet [8] and LiDAR Iris [9], which are validated using the LiDAR point clouds of the KITTI odometry dataset [16]. We use the same evaluation strategy as OverlapNet [8]. Moreover, we show the efficacy of our approach to RGBD SLAM systems by embedding it into RESLAM [3], a recent RGBD edge-based SLAM approach. We evaluate the absolute trajectory error [17] on the real-world TUM-RGBD [17] and synthetic ICL [18] datasets. Our approach enables RESLAM to detect loops more frequently and to estimate 6DoF transformations more accurately than those estimated with the RESLAMS’s original loop closure method. In summary, our contributions are:

•

a novel use of L3Ds for the loop closure problem;
•

an overlap measure based on the ratio of nearest points;
•

a validation for cross-domain applicability with experiments on both LiDAR and RGBD SLAM systems.

II Related work

We review loop closure detection methods that are designed for point cloud data. Readers are referred to [1] for a thorough review of these. Methods for loop detection operating in 3D can be categorised into three groups [19]: feature-based [20, 7, 21, 11, 22, 9], segmentation-based [23, 24], and learning-based [25, 10, 8, 26].

Feature-based methods are typically focused on building hand-crafted rotation-invariant 3D descriptors, for example based on feature histograms [27] or global shape features [28, 9]. The Normal Distribution Transform (NDT) is a method that builds global features from histograms of local shape descriptions, where local shapes can be linear, planar or spherical [20]. Each point cloud is divided into overlapping cells and each cell is classified as a specific shape based on the estimated surface normal. Rotation invariance is achieved by aligning point clouds with respect to dominant surface orientations. Unlike NDT, M2DP [7] first builds intermediate signatures by projecting point clouds to multiple 2D planes and by generating spatial density distributions for each plane. Then, a global representation is computed by aggregating the left and right singular vectors of these signatures. Unlike global descriptors that often encode only the geometric properties of the point cloud, the Intensity Scan Context method [11] encodes both the geometry and the intensity of LIDAR scans. The geometric consistency is then verified through a RANSAC-based registration using FPFH point features [21]. Karl et al. [22] define 41 features of the full scan to create decision stumps, and learn a binary classifier with Adaboost for loop detection. The LiDAR Iris method [9] is inspired by the human’s iris signature for identification in order to detect loop closures efficiently. Each LiDAR scan is firstly converted into a binary signature by using a series of LoG-Gabor filtering and thresholding operations. Then, the Hamming distance between point cloud pairs is used for detecting loops. Similarly to [22] we perform a verification step through a RANSAC-based registration but by leveraging deep-learning based descriptors. Unlike global methods, we do not compute a global representation of a given point cloud/scan, but we instead match local representations between scans to determine a score that provides us with an indication whether a loop closure has occurred or not.

Segmentation-based methods encode the point cloud into a set of discriminative features to reduce the likelihood of false matches. SegMatch [23] uses two hand-crafted features, i.e. eigenvalue-based and shape histograms, to describe the semantic elements of a scene and performs point cloud matching by using Random Forest together with a RANSAC-based geometric verification step. Seed [24] employs handcrafted features that encode the topological information of segmented objects to reduce the noise and resolution effects. Both SegMatch and Seed use a cluster-all approach to segment the point cloud, which requires the ground plane removal prior to the segmentation. Our approach does not use neither priors nor high-level semantic representations of a scene, it only relies on low-level geometric representations in the form of deep-learning based descriptors.

Learning-based methods include the LocNet approach [25] that computes a handcrafted rotation-invariant representation of a point cloud in a image-like format, which is then processed by a Siamese network to learn global features for place matching and loop closure detection. Zaganidis at al. [29] use semantic information processed by PointNet++ [30] together with the NDT-based histogram descriptors for loop closure detection. GOSMatch [26] builds a global representation in the form of a histogram-based graph descriptor to encode semantic relationships between objects that are segmented by RangeNet++ [31]. Instead of performing frame-by-frame loop detection, SegMap [10] segments the scene incrementally as the robot navigates and inputs these segments to a deep network to generate a signature per segment. Loop detection is then performed based on these segments. Authors in [8] and [32], present OverlapNet that employs a Siamese deep neural network to exploit different types of information of LiDAR scans, including depth, normals, intensity or remission values, to predict the overlap and relative yaw angle (1DoF) between pairs of 3D scans. Our approach differs from OverlapNet [8] in three aspects: i) in addition to providing a quantitative indication of the overlap between point clouds, our approach can estimate the 6DoF transformation between point clouds; ii) our approach is local, thus making it suitable to be used in different domains (sensors and scenes); iii) we operate on point clouds instead of range images.

Most of the above-mentioned methods only estimate the relative yaw angle between LiDAR scans, mainly because public LiDAR datasets, such as KITTI [16] or nuScenes [33], are captured from road vehicles where scans are often co-planar. Moreover, learning based methods are typically trained and tested on data belonging to the same domain, e.g. LiDAR data captured outdoors, which cannot generalise well to other domains without retraining or finetuning them, for example on sparser point clouds that are reconstructed with vision-based SLAM systems [34]. Differently, our approach exploits deep local 3D descriptors that are trained with point clouds extracted from RGBD sensors, estimates the 6DoF transformation between a pair of point clouds, and measures the relative overlap, serving for the loop closure detection task with domain gap.

III L3D-based Loop Closure

A typical SLAM system involves three modules running in parallel that are tracking, local mapping and global mapping [2, 3]. Tracking computes the relative camera motion between consecutive frames. The most informative frames are stored as keyframes. Local mapping processes every new keyframe to incrementally reconstruct the map of the environment and optimises the reconstruction using a time-shifting window. Global mapping performs loop closure and relocalisation using pose graph optimisation. Loop closure aims to address the problems of finding keyframes that can potentially form trajectory loops, estimating the poses between candidate keyframe pairs and verifying that the candidate loop closure is correctly estimated. It is important to detect and solve loops correctly as they would otherwise worsen the trajectory error.

Our proposed L3D-based approach identifies the candidate loop frames by measuring the overlap amongst local 3D descriptors that are extracted between keyframe pairs, and then confirms the occurrence of a loop through pose estimation and RON computation. In Sec. III-A we describe the 3D descriptor extraction, while in Sec. III-B we describe the algorithms for loop detection and confirmation.

III-A 3D descriptor extraction

Given a 3D point and a set of its neighbouring points, namely a patch, a local 3D descriptor is a compact numerical representation of these points [35, 36, 14, 13].

Let $\mathcal{P}_{t}=\{\mathbf{p}_{t}\}\subset\mathbb{R}^{3}$ be the point cloud of the current keyframe at frame $t$ that is defined as an unordered set of 3D points, and $\Pi_{t}=[\mathcal{P}_{t-1},...,\mathcal{P}_{0}]$ be an ordered collection of point clouds stored up to frame $t-1$ , with $t>0$ . The size of this collection and the number of points of each point may vary over time and across scenes. At each $t$ , we randomly choose a set of $n_{t}=n\cdot|\mathcal{P}_{t}|$ points as centres for our 3D descriptors, where $n\in\mathbb{R}_{(0,1]}$ and $|\cdot|$ is the cardinality of a set. The subsampled point cloud is defined as $\tilde{\mathcal{P}}_{t}\subset\mathcal{P}_{t}$ where $|\tilde{\mathcal{P}}_{t}|=n_{t}$ . Let $\mathcal{D}_{t}=\{\mathbf{d}_{t}\}\subset\mathbb{R}^{d}$ be the set of $d$ -dimensional descriptors where $|\mathcal{D}_{t}|=|\tilde{\mathcal{P}}_{t}|$ .

Given a point in $\tilde{\mathcal{P}}_{t}$ , we can extract a patch composed of neighbouring points within a spherical region with a certain radius, and use a deep neural network to encode this patch into a compact descriptor. Let $\mathcal{X}_{t}=\{\mathbf{x}_{t}\}\subset\mathcal{P}_{t}$ be the patch extracted from $\mathcal{P}_{t}$ and $\mathbf{d}_{t}=\Phi_{\Theta}(\mathcal{X}_{t})$ , where $\Phi_{\Theta}$ is the deep neural network with parameters $\Theta$ that encodes $\mathcal{X}_{t}$ into the descriptor $\mathbf{d}_{t}$ , such that $\lVert\mathbf{d}_{t}\rVert=1$ . We experimentally select the best-performing deep neural network for the descriptor encoding, more details can be found in Sec. IV-C.

III-B Loop detection and verification

When a loop closure occurs, it is highly likely that a portion of the current keyframe’s point cloud overlaps with a portion of another point cloud in previous keyframes. Except for some occlusions, there should exist corresponding surfaces between overlapping point cloud regions, thus corresponding descriptors. In general, the higher this overlap is, the higher the likelihood of detecting a loop. We detect candidate loops through the estimation of the overlap between point cloud pairs in the descriptor space, and confirm the occurrence of a loop by computing (i) the transformation to register a candidate point cloud pair and (ii) the novel overlap measure RON that measures the ratio of nearest points in both the descriptor and metric space.

Specifically, we determine the overlap region between the two point clouds $\tilde{\mathcal{P}}_{t}$ and $\tilde{\mathcal{P}}_{t^{\prime}}$ , where $t^{\prime}<t$ , by selecting points whose descriptors, $\mathcal{D}_{t}$ and $\mathcal{D}_{t^{\prime}}$ , are mutually nearest neighbours (MNN). The computation of MNN descriptors is efficient and does not require the estimation of the transformation matrix between two viewpoints. Let $\mathcal{C}_{t,t^{\prime}}$ be the set of corresponding points defined as

	$\displaystyle\mathcal{C}_{t,t^{\prime}}$	$\displaystyle=\{\{\mathbf{p}_{t}\in\tilde{\mathcal{P}}_{t},\mathbf{p}_{t^{\prime}}\in\tilde{\mathcal{P}}_{t^{\prime}}\}:$		(1)
		$\displaystyle\mathbf{d}_{t}=\mathrm{NN}(\mathbf{d}_{t^{\prime}},\mathcal{D}_{t})\land\mathbf{d}_{t^{\prime}}=\mathrm{NN}(\mathbf{d}_{t},\mathcal{D}_{t^{\prime}})\},$		(1)

where $\mathrm{NN}(\cdot)$ is the nearest neighbour search based on the L2 norm. We compute the MNN overlap between the two point clouds as

o_{t,t^{\prime}}=\frac{|\mathcal{C}_{t,t^{\prime}}|}{min\{|\mathcal{D}_{t}|,|\mathcal{D}_{t^{\prime}}|\}},

(2)

where $o_{t,t^{\prime}}\in\mathbb{R}_{[0,1]}$ . If the overlap $o_{t,t^{\prime}}$ is greater than a threshold $\tau_{o}$ , then we deem these keyframes to form a candidate loop.

Points with the corresponding descriptors being mutually neighbours does not necessarily guarantee the points in the metric space to be also close to each other, for example flat regions may be ambiguous. To verifying the loop, we further register the candidate pairs by estimating the relative 6DoF transformation and estimate RON for the loop confirmation.

Different methods can be used to register two point clouds [15, 37]. Without loss of generality, we use RANSAC [15] as we found that it performs well in practice in terms of computational efficiency and robustness to noise. Let $\mathbf{T}_{t,t^{\prime}}\in\mathbb{R}^{4\times 4}$ be the transformation estimated with RANSAC between $\tilde{\mathcal{P}}_{t}$ and $\tilde{\mathcal{P}}_{t^{\prime}}$ using $\mathcal{D}_{t}$ and $\mathcal{D}_{t^{\prime}}$ , respectively.

Lastly, with the registered point clouds, we compute RON as the ratio of MNN points in the descriptor space that are close in the metric space, i.e., whose distance in the metric space is below a certain error. Let RON be defined as

\mathtt{RON}_{t,t^{\prime}}=\frac{1}{|\mathcal{C}_{t,t^{\prime}}|}\sum_{\{\mathbf{p}_{t},\mathbf{p}_{t^{\prime}}\}\in\mathcal{C}_{t,t^{\prime}}}\lVert\mathbf{p}_{t}-\mathbf{T}_{t,t^{\prime}}\circ\mathbf{p}_{t^{\prime}}\rVert_{2}<\tau_{e},

(3)

where $\lVert\cdot\rVert_{2}$ is the L2 norm, $\circ$ is the operator that applies $\mathbf{T}_{t,t^{\prime}}$ to a 3D point and $\tau_{e}$ is the maximum error between two corresponding points that takes into account the reconstruction noise. If $\mathtt{RON}_{t,t^{\prime}}$ is greater than a threshold $\tau_{\rho}$ , then we confirm the occurrence of a loop. The transformation $\mathbf{T}_{t,t^{\prime}}$ can be fed to the module in charge of solving the pose graph problem to optimise the camera poses in any SLAM system. The pseudocode for our loop verification approach is shown in Algorithm 1.

⬇

# P_t: point cloud of keyframe at t

# P_t’: point cloud of keyframe at t’ < t

# D_t: descriptors of the points in P_t

# D_t’: descriptors of the points in P_t’

# nn(x, y): nearest-neighbour (NN) search in the set y using the query x

# T: transformation to register P_t on P_t’

# tau_e: max distance between corresponding points

def mnn(D_t, D_t’):

# compute the indices of elements in D_t that are NNs of the queries in D_t’

inds = [nn(d_t’, D_t) for d_t’ in D_t’]

# compute the indices of elements in D_t’ that are NNs of the queries in D_t

inds’ = [nn(d_t, D_t’) for d_t in D_t]

# compute boolean list w.r.t. inds

# true entries correspond to mutual NN elements

c_bools = list(range(len(D_t))) == inds[inds’]

return c_bools, inds’

def loop_detection(P_t, P_t’, D_t, D_t’, T, tau_e):

# compute mutual nearest neighbours

c_bools, inds’ = mnn(D_t, D_t’)

# compute overlap

o = c_bools.sum() / min([len(D_t), len(D_t’)])

# apply transformation

P_t.transform(T)

# compute L2 norm between mutual NN points

dists = norm(P_t.points - P_t’.points[inds’])

# compute ratio of nearest points (RON)

ron = (dists[c_bools] < tau_e).sum() / c_bools.sum()

return o, ron

Algorithm 1 Open3D-style pseudo-code for L3D-based loop detection.

Fig. 1 contains two examples of loops, and illustrates the estimated overlap regions of point clouds reconstructed using RGBD SLAM [3] and captured with LiDAR. Although the L3Ds that we use are trained on a different domain than that of the examples, we can observe how our approach can successfully determine mutually-nearest neighbour points in the overlap region (purple) and how the estimated transformation can be used to register each point cloud pair. In (d) we can observe that there are no mutually-nearest neighbours in the centre of the point clouds, this is due do the flat regions that carry little geometric information and the relative descriptors lack distinctiveness, so they cannot be matched reliably.

\begin{overpic}[width=216.81pt]{images/overlap_example_rgbd_01} \put(50.0,-2.0){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\scriptsize{(a)}} \end{overpic}	\begin{overpic}[width=216.81pt]{images/overlap_example_rgbd_02} \put(50.0,-2.0){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\scriptsize{(b)}} \end{overpic}
\begin{overpic}[width=216.81pt]{images/overlap_example_kitti_01} \put(50.0,-5.0){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\scriptsize{(c)}} \end{overpic}	\begin{overpic}[width=216.81pt]{images/overlap_example_kitti_02} \put(50.0,-5.0){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\scriptsize{(d)}} \end{overpic}

Figure 1: Two examples of estimated overlap regions in the case of loops that show point clouds that are (a,b) reconstructed using RGBD SLAM [3] and (c,d) captured with LiDAR. The pairs in the first column are displayed in their original camera reference frame, whereas in the second column they are displayed in a common reference frame after being registered with RANSAC using L3Ds. We highlight the overlap region estimated with our algorithm in purple. The point clouds are taken (a,b) at frames 0 and 136 from the fr2/desk sequence of TUM-RGBD dataset [17] and (c,d) at frames 107 and 4532 from Sequence 00 of KITTI odometry dataset [16].

IV Experiments

We evaluate our approach with two experiments using LiDAR SLAM and visual RGBD SLAM to show its general applicability. Firstly, we compare our approach against recent LiDAR-based loop closure approaches, namely OverlapNet [8] and LiDAR Iris [9], using the LiDAR dataset KITTI odometry [16]. Secondly, we embed our loop closure method in a recent visual SLAM system, and test it on a real RGBD dataset, TUM-RGBD [17], and on a synthetic RGBD dataset, ICL [18]. We choose the edge-based visual SLAM method, namely RESLAM [3], as the point clouds produced by the mapping are different from those of KITTI.

Local 3D deep descriptors. We use DIP descriptors [13] as they can achieve the best performance in terms of generalisation for point cloud registration applications (see Sec. IV-C for more details), thus no need for retraining them when employed in new domains. DIP descriptors are processed from an input point cloud in two main steps. First, patches are extracted from the point cloud and for each patch a local reference frame is estimated to canonicalise the patch and thus to make the descriptor rotation-invariant. Second, a deep neural network encodes the canonicalised patch into a 32-dimensional descriptor. The deep network employs a PointNet backbone [38] that uses a Transformation Network (TNet) applied to the input points to improve the canonicalisation of the first step in the case of inaccurately estimated local reference frames. The PointNet backbone is composed of a series of multilayer perceptrons that augment the input channels as 3 $\rightarrow$ 256 $\rightarrow$ 512 $\rightarrow$ 1024, followed by max pooling to allow permutation invariance of the input points, which is then followed by another series of multilayer perceptrons to produce the output as 1024 $\rightarrow$ 512 $\rightarrow$ 256 $\rightarrow$ 32. TNet uses an equivalent architecture to this one, but produces a 3 $\times$ 3 transformation as output. These deep networks are trained on the 3DMatch dataset [39] through a Siamese approach using contrastive learning. Please refer to [13] for more details. We use the same descriptor parameters as in [13], except for the radius of the patch. We use 2.5m as the patch radius for KITTI, and 0.2m for TUM RGBD and ICL.

IV-A Loop closure detection for LiDAR SLAM

We compare our approach against two recent LiDAR-based methods [8, 9] for loop closure detection.

OverlapNet [8] estimates the overlap with a deep neural network with the inputs extracted from the LiDAR scans containing both the geometric and semantic information. OverlapNet defines the overlap between a scan pair as the ratio of the projected range pixels within 1m over all valid range pixels, where the candidate scan pairs are projected into range images with a common coordinate system. For fair evaluation, we reproduce the results using the provided model with inputs reflecting only geometrical information, i.e., depths, normals and intensities.

LiDAR Iris [9] uses a global descriptor for LiDAR scans. Each scan is converted into a binary signature image using a series of LoG-Gabor filtering and thresholding operations. LiDAR Iris defines the overlap as the Hamming distance of a pair of the binary signature images. We reproduce the results using the provided code to generate the signature images and compute the Hamming distance.

Within this comparative evaluation, we also assess three variants of our L3D-based loop detection approach, namely L3D-based Overlap that quantifies the overlap following the ground-truth overlap computation as in OverlapNet [8], L3D-based MNN that quantifies the overlap only as the ratio of MNN (Eq. 2), and L3D-based RON that quantifies the overlap based on the proposed RON (Eq. 3).

Evaluation protocol. We follow the evaluation protocol proposed in OverlapNet [8], where a set of criteria are used for the detection of loops given a sequence. We use Sequence 00 of KITTI odometry for the test. As the vehicle moves, each current scan serves as a query. We adopt two evaluation settings. Setting 1: for each query, the latest 100 scans are excluded from the candidate loops in order to avoid detecting a loop closure in the most recent scans. Like most of the SLAM systems, the candidate loop scans are constrained within the 3 $\sigma$ area around the current pose estimate. We use the same series of pose uncertainty as in [8] for the $\sigma$ estimation. Setting 2: we perform experiment where we lift the search constraint of the 3 $\sigma$ area and the exclusion of the latest 100 scans. For each frame, we perform the search against all the previously seen frames to find the loop. In order to reduce the computation load in this setting, we first downsample the sequence by a factor of 20, forming a total of 25878 pairs, 51 of which are positive loop closures. Only the candidate loop scans with the highest estimated overlap are considered for the evaluation of these settings. A true positive occurs if the ground-truth overlap is larger than 30%.

Refer to caption — Figure 2: The precision and recall curve of loop closure detection under Setting 1 with positional prior as a search constraint. We compare variants of our approach based on local 3D deep descriptors (L3D) and the state-of-the-art methods, i.e. OverlapNet [8] and LiDAR Iris [9].

Discussion. Fig. 2 illustrate the Precision-Recall (PR) curves under the Setting 1. Note that no training on KITTI was performed with our L3D-based methods, while OverlapNet is trained with the KITTI odometry using Sequences 03-10, and LiDAR Iris is particularly designed for matching LiDAR scans. Generalisation to unseen scenarios (sensors, environments) is a desired property for the deployment of robots in real-world applications. L3D-based Overlap produces a similar PR curve to OverlapNet, indicating that the relative transformations computed by RANSAC using the local 3D descriptors can achieve comparable performances to global loop closure detection methods. Note that here we estimate the 6DoF transformation, while OverlapNet estimates the 1DoF transformation. As shown by L3D-based MNN curve, when we use the ratio of MNN in the descriptor space as the overlap measure, the PR curve further improves. L3D-based RON achieves the best PR curve, which justifies that the proposed overlap measure is more effective and reliable. To achieve a high RON value, it requires the points to be not only MNN in the descriptors space but also spatially close to each other after the estimated transformation.

Fig. 3 reports the Precision-Recall (PR) curves under the Setting 2, where each frame is queried amongst all the previously seen frames to find the loop. Without the positional prior as a search constraint, all methods achieve a lower Precision values along the PR curves due to the large amount of False Positives as the threshold increases. Precision is generally low for all the methods because querying a point cloud against all the others at previous time steps leads to a higher likelihood of producing false positive loop detections due to similar neighbourhoods along the trajectory. In terms of the Area Under Curve (AUC) performance, OverlapNet is better than LiDAR IRIS, while our proposed method L3D-based RON is still the besting-performing method amongst all the other ones under this more challenging setup.

IV-B Loop Closure for Visual SLAM

For the evaluation of visual RGBD SLAM, we use nine sequences from TUM-RGBD and seven sequences from ICL. TUM-RGBD contains real scenes recorded with a Kinect at 30Hz and its ground truth is captured using a motion capture system working at 100Hz. ICL contains photo-realistic scenes rendered with a simulated Kinect at 20-30Hz, featuring real-world trajectories captured using a motion capture system working at 100Hz.

We use the recent RESLAM [3] to serve as the overall SLAM system where we embed our proposed approach. The original RESLAM performs the selection of candidate keyframe pairs based on the similarity between the current and the previous keyframes using 2D visual cues, i.e. Fern. If none of the previous keyframes yields a sufficient similarity with the current one, the loop closure is terminated, and a global and compact representation of the new keyframe is added to a database. Otherwise a geometric-based assessment is further carried out using pose graph optimisation.

We use the original parameters of RESLAM and set $n=0.4$ , $\tau_{o}=0.13$ , $\tau_{e}=10$ cm and $\tau_{\rho}=0.2$ (see notations in Sec. III-B). We set the max number of RANSAC iterations to 500K and the termination policy to 1K iterations for an average inlier error of 3cm. We report the results of RESLAM from their paper [3] and also include our reproduced results as the authors of [3] stated that different hardware may provide different results.

We use RESLAM [repro.1] and RESLAM [repro.2] to refer to the results reproduced by us using two configurations of RESLAM: we use the original parameters in the former, while we set the parameters to detect loop closures more frequently in the latter. The evaluation of the latter was performed to show the behaviour of RESLAM with a number of detected loops comparable to those of our approach. We also evaluate an ablated version of our approach, i.e. only the module to estimate the transformation matrix between the keyframe and loop frame based on L3D, i.e. DIP [13] in our experiments. This ablation experiment is designed to show the benefit of using L3D for pose estimation after the loop closures detected with RESLAM’s visual cues. We name this version RESLAM [w. L3D].

Evaluation metrics. We evaluate our approach by comparing the absolute errors between the estimated and the ground-truth trajectories, noted as ATE, by computing the root mean squared error (RMSE) of the translational component [17]. Given that the rigid-body transformation $\mathbf{S}$ corresponding to the least-squares solution that maps the estimated pose of the camera $\mathbf{C}_{t}$ onto its ground-truth pose $\mathbf{G}_{t}$ at frame $t$ , ATE can be computed as $\mathcal{E}_{t}=\mathbf{G}_{t}^{-1}\mathbf{S}\mathbf{C}_{t}$ . The RMSE is computed over all the frames of each sequence as

RMSE_{\mathcal{E}}=\sqrt{\frac{1}{n}\sum_{t=1}^{t_{\mathsf{end}}}\lVert\mathsf{trans}(\mathcal{E}_{t})\rVert^{2}},

(4)

where $t_{\mathsf{end}}$ is the number of frames of a given sequence.

TABLE I: RMSE [cm] (Eq. 4) computed on the RGBD-TUM dataset. The number of detected loops are reported in the parentheses.

seq.	RESLAM	RESLAM	RESLAM	RESLAM	OURS
seq.	[3]	[repro.1]	[repro.2]	[w. L3D]	OURS
fr1/xyz	1.1	2.3 (27)	2.3 (27)	2.3 (27)	1.8 (27)
fr1/room	-	8.6 (3)	9.6 (11)	8.5 (3)	6.4 (14)
fr1/plant	-	8.6 (1)	6.7 (6)	7.3 (3)	7.0 (5)
fr1/desk	-	2.7 (5)	2.7 (7)	2.7 (5)	2.7 (7)
fr1/rpy	-	2.9 (19)	2.6 (25)	2.9 (19)	2.6 (25)
fr1/desk2	4.8	6.3 (7)	5.8 (15)	3.8 (8)	3.6 (17)
fr2/desk	1.9	2.2 (24)	128.8 (25)	2.2 (24)	2.2 (38)
fr2/xyz	0.5	0.4 (14)	0.4 (15)	0.5 (16)	0.5 (16)
fr3/office	3.5	3.8 (8)	24.0 (15)	3.8 (10)	3.4 (15)
average		4.2	20.3	3.8	3.4

TABLE II: RMSE [cm] (Eq. 4) computed on the ICL dataset. The number of detected loops are reported in the parentheses.

seq.	RESLAM	RESLAM	RESLAM	OURS
seq.	[repro.1]	[repro.2]	[w. L3D]	OURS
deer/walk	10.1 (15)	30.7 (33)	8.6 (26)	6.0 (48)
deer/Mfast	1.0 (89)	1.0 (92)	1.0 (82)	1.0 (106)
deer/Mslow	1.7 (110)	3.6 (126)	1.7 (112)	1.6 (139)
deer/run	5.4 (4)	19.1 (9)	3.5 (7)	2.6 (16)
diamond/walk	7.2 (9)	19.3 (22)	7.4 (11)	3.8 (46)
diamond/Mfast	1.0 (87)	0.9 (92)	1.0 (89)	0.9 (105)
diamond/run	14.4 (2)	10.8 (7)	8.6 (6)	6.0 (10)
average	5.8	12.1	4.5	3.1

\begin{overpic}[width=186.45341pt]{images/trajs/fr1_room_xz.png} \put(55.0,43.0){ \leavevmode\hbox to28.51pt{\vbox to18.84pt{\pgfpicture\makeatletter\hbox{\hskip 0.59999pt\lower-0.59999pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{{}}{} {}{{}}{}{}{}{}{{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\pgfsys@color@rgb@stroke{.5}{0}{.5}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{.5}{0}{.5}\pgfsys@invoke{ }\definecolor[named]{pgffillcolor}{rgb}{.5,0,.5}\pgfsys@setlinewidth{1.2pt}\pgfsys@invoke{ }\pgfsys@stroke@opacity{0.65}\pgfsys@invoke{ }\pgfsys@fill@opacity{0.65}\pgfsys@invoke{ }{}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{0.0pt}{17.64056pt}\pgfsys@lineto{27.31483pt}{17.64056pt}\pgfsys@lineto{27.31483pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{27.31483pt}{17.64056pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}} } \put(6.0,-2.0){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\scriptsize{a) fr1/room}} \end{overpic}	\begin{overpic}[width=216.81pt,height=130.08731pt]{images/trajs/fr1_room_xz_zoom.png} \put(-3.0,0.0){ \leavevmode\hbox to123.55pt{\vbox to73.75pt{\pgfpicture\makeatletter\hbox{\hskip 0.59999pt\lower-0.59999pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{{}}{} {}{{}}{}{}{}{}{{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\pgfsys@color@rgb@stroke{.5}{0}{.5}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{.5}{0}{.5}\pgfsys@invoke{ }\definecolor[named]{pgffillcolor}{rgb}{.5,0,.5}\pgfsys@setlinewidth{1.2pt}\pgfsys@invoke{ }\pgfsys@stroke@opacity{0.65}\pgfsys@invoke{ }\pgfsys@fill@opacity{0.65}\pgfsys@invoke{ }{}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{0.0pt}{72.55461pt}\pgfsys@lineto{122.34694pt}{72.55461pt}\pgfsys@lineto{122.34694pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{122.34694pt}{72.55461pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}} } \put(23.0,-7.0){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\scriptsize zoomed-in bounding box} \end{overpic}
\begin{overpic}[width=186.45341pt]{images/trajs/fr1_desk2_yz.png} \put(42.0,55.0){ \leavevmode\hbox to25.67pt{\vbox to15.71pt{\pgfpicture\makeatletter\hbox{\hskip 0.59999pt\lower-0.59999pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{{}}{} {}{{}}{}{}{}{}{{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\pgfsys@color@rgb@stroke{.5}{0}{.5}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{.5}{0}{.5}\pgfsys@invoke{ }\definecolor[named]{pgffillcolor}{rgb}{.5,0,.5}\pgfsys@setlinewidth{1.2pt}\pgfsys@invoke{ }\pgfsys@stroke@opacity{0.65}\pgfsys@invoke{ }\pgfsys@fill@opacity{0.65}\pgfsys@invoke{ }{}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{0.0pt}{14.51074pt}\pgfsys@lineto{24.46938pt}{14.51074pt}\pgfsys@lineto{24.46938pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{24.46938pt}{14.51074pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}} } \put(6.0,-2.0){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\scriptsize{b) fr1/desk2}} \end{overpic}	\begin{overpic}[width=216.81pt,height=130.08731pt]{images/trajs/fr1_desk2_yz_zoom.png} \put(-3.0,0.0){ \leavevmode\hbox to123.55pt{\vbox to73.75pt{\pgfpicture\makeatletter\hbox{\hskip 0.59999pt\lower-0.59999pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{{}}{} {}{{}}{}{}{}{}{{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\pgfsys@color@rgb@stroke{.5}{0}{.5}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{.5}{0}{.5}\pgfsys@invoke{ }\definecolor[named]{pgffillcolor}{rgb}{.5,0,.5}\pgfsys@setlinewidth{1.2pt}\pgfsys@invoke{ }\pgfsys@stroke@opacity{0.65}\pgfsys@invoke{ }\pgfsys@fill@opacity{0.65}\pgfsys@invoke{ }{}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{0.0pt}{72.55461pt}\pgfsys@lineto{122.34694pt}{72.55461pt}\pgfsys@lineto{122.34694pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{122.34694pt}{72.55461pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}} } \put(23.0,-7.0){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\scriptsize zoomed-in bounding box} \end{overpic}
\begin{overpic}[width=186.45341pt]{images/trajs/deer_run_xy.png} \put(44.0,71.0){ \leavevmode\hbox to23.11pt{\vbox to16pt{\pgfpicture\makeatletter\hbox{\hskip 0.59999pt\lower-0.59999pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{{}}{} {}{{}}{}{}{}{}{{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\pgfsys@color@rgb@stroke{.5}{0}{.5}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{.5}{0}{.5}\pgfsys@invoke{ }\definecolor[named]{pgffillcolor}{rgb}{.5,0,.5}\pgfsys@setlinewidth{1.2pt}\pgfsys@invoke{ }\pgfsys@stroke@opacity{0.65}\pgfsys@invoke{ }\pgfsys@fill@opacity{0.65}\pgfsys@invoke{ }{}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{0.0pt}{14.79555pt}\pgfsys@lineto{21.90874pt}{14.79555pt}\pgfsys@lineto{21.90874pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{21.90874pt}{14.79555pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}} } \put(6.0,-2.0){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\scriptsize{c) deer/run}} \end{overpic}	\begin{overpic}[width=216.81pt,height=130.08731pt]{images/trajs/deer_run_xy_zoom.png} \put(-3.0,0.0){ \leavevmode\hbox to123.55pt{\vbox to73.75pt{\pgfpicture\makeatletter\hbox{\hskip 0.59999pt\lower-0.59999pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{{}}{} {}{{}}{}{}{}{}{{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\pgfsys@color@rgb@stroke{.5}{0}{.5}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{.5}{0}{.5}\pgfsys@invoke{ }\definecolor[named]{pgffillcolor}{rgb}{.5,0,.5}\pgfsys@setlinewidth{1.2pt}\pgfsys@invoke{ }\pgfsys@stroke@opacity{0.65}\pgfsys@invoke{ }\pgfsys@fill@opacity{0.65}\pgfsys@invoke{ }{}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{0.0pt}{72.55461pt}\pgfsys@lineto{122.34694pt}{72.55461pt}\pgfsys@lineto{122.34694pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{122.34694pt}{72.55461pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}} } \put(23.0,-7.0){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\scriptsize zoomed-in bounding box} \end{overpic}

Figure 4: Trajectories obtained with the original RESLAM’s loop closure approach and with our approach on (a,b) TUM-RGBD and (c) ICL sequences, projected on the

xz

yz

and

xy

planes, respectively. Purple bounding boxes show examples where our approach improves localisation. The right-hand side column is the zoomed-in version of the purple bounding box.

\begin{overpic}[width=143.09538pt]{images/pcds/fr1_desk_unaligned.png} \put(6.0,-1.0){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\scriptsize{a) fr1/desk}} \put(30.0,60.0){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\scriptsize{original keyframes}} \end{overpic}	\begin{overpic}[width=143.09538pt]{images/pcds/fr1_desk_reslam.png} \put(30.0,60.0){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\scriptsize{RESLAM \cite[cite]{[\@@bibref{}{Schenk2019}{}{}]}}} \end{overpic}	\begin{overpic}[width=143.09538pt]{images/pcds/fr1_desk_dip.png} \put(35.0,60.0){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\scriptsize{ours}} \end{overpic}
\begin{overpic}[width=143.09538pt]{images/pcds/fr1_room_1_unaligned.png} \put(6.0,-1.0){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\scriptsize{b) fr1/room}} \end{overpic}	\begin{overpic}[width=143.09538pt]{images/pcds/fr1_room_1_reslam.png} \end{overpic}	\begin{overpic}[width=143.09538pt]{images/pcds/fr1_room_1_dip.png} \end{overpic}
\begin{overpic}[width=143.09538pt]{images/pcds/fr1_room_2_unaligned.png} \put(6.0,-1.0){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\scriptsize{c) fr1/room}} \end{overpic}	\begin{overpic}[width=143.09538pt]{images/pcds/fr1_room_2_reslam.png} \end{overpic}	\begin{overpic}[width=143.09538pt]{images/pcds/fr1_room_2_dip.png} \end{overpic}
\begin{overpic}[width=143.09538pt]{images/pcds/fr3_household_unaligned.png} \put(6.0,-4.0){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\scriptsize{d) fr3/office}} \end{overpic}	\begin{overpic}[width=143.09538pt]{images/pcds/fr3_household_reslam.png} \end{overpic}	\begin{overpic}[width=143.09538pt]{images/pcds/fr3_household_dip.png} \end{overpic}

Figure 5: Edge point cloud registration results using the original RESLAM’s loop closure approach and ours.

\begin{overpic}[width=143.09538pt]{images/recons/fr1_room_1} \end{overpic}	\begin{overpic}[width=143.09538pt]{images/recons/fr1_room_2} \end{overpic}	\begin{overpic}[width=143.09538pt]{images/recons/fr1_room_3} \end{overpic}
\begin{overpic}[width=143.09538pt]{images/recons/fr1_desk2_1} \put(6.0,95.0){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\scriptsize{a) fr1/room}} \put(6.0,-13.0){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\scriptsize{b) fr1/desk2}} \end{overpic}	\begin{overpic}[width=143.09538pt]{images/recons/fr1_desk2_2} \end{overpic}	\begin{overpic}[width=143.09538pt]{images/recons/fr1_desk2_3} \end{overpic}

Figure 6: Examples of localisation and mapping results using our loop closure approach on TUM-RGBD dataset. The trajectory is shown in red and the camera poses in blue.

\begin{overpic}[width=143.09538pt]{images/recons/deer_run_1} \end{overpic}	\begin{overpic}[width=143.09538pt]{images/recons/deer_run_2} \end{overpic}	\begin{overpic}[width=143.09538pt]{images/recons/deer_run_3} \end{overpic}
\begin{overpic}[width=143.09538pt]{images/recons/diamond_mfast_1} \put(6.0,75.0){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\scriptsize{a) deer/run}} \put(6.0,-7.0){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\scriptsize{b) diamond/Mfast}} \end{overpic}	\begin{overpic}[width=143.09538pt]{images/recons/diamond_mfast_2} \end{overpic}	\begin{overpic}[width=143.09538pt]{images/recons/diamond_mfast_3} \end{overpic}

Figure 7: Examples of localisation and mapping results using our loop closure approach on ICL dataset. The trajectory is shown in red and the camera poses in blue.

Discussion. Tab. I and Tab. II report the results of our approach where we can observe that for the majority of the sequences we can improve the localisation accuracy of RESLAM. We run our algorithm multiple times on a subset of sequences of Tab. I and Tab. II to assess the numerical stability of the results and we obtained stable results across different runs. We can observe a reduced RMSE in almost all sequences (except for fr1/xyz) when we use L3D descriptors only to estimate the transformation between the keyframes forming loops (i.e. RESLAM [w. L3D]). On average, when comparing RESLAM [w. L3D] to RESLAM [repro. 1], we obtain $9.5\%$ and $22.4\%$ reduction in RMSE on RGBD-TUM and ICL, respectively. By employing our loop detection approach, we can observe a further reduction of RMSE. On average, when comparing OURS to RESLAM [repro. 1], we obtain $19\%$ and $46.6\%$ reduction in RMSE on RGBD-TUM and ICL, respectively. This is mainly due to the detection of a larger number of loops, thus a more frequent triggering of pose graph optimisation which reduces the trajectory error.

We show in Fig. 4 some qualitative results on the RGBD-TUM and ICL datasets. These sequences include a variety of trajectory patterns, i.e. from the handheld camera of fr1/room to the micro aerial vehicle mounted camera of deer/run. We can visually appreciate how the trajectory estimated using our proposed L3D-based loop closure approach is much closer to the ground-truth one. Some part of these trajectories are marked and zoomed-in with a purple bounding box to highlight the improvements.

Fig. 5 shows some registration results on the keyframes corresponding to detected loops when the 6DoF transformation is estimated using the RESLAM’s loop closure approach and ours. We can observe that the registration that is obtained with RESLAM provides a systematic misalignment. We noticed this also on other sequences. At times, the transformation estimated with RESLAM is incorrect, see c) and d), while our loop closure approach can often register the keyframes correctly. Lastly, Fig. 6 and Fig. 7 show qualitative localisation and mapping results using our loop closure approach. From these examples we can observe the full structure of the environment in the form of coloured point cloud.

IV-C Analysis of 3D descriptors

We provide a quantitative analysis of traditional and recent 3D deep descriptors using the TUM-RGBD dataset. This experiment motivates the choice of DIP descriptor by comparing it with descriptors in the literature. We adopt the registration recall as the performance measure [36, 13], which measures the ratio of point cloud pairs whose average distance error between the corresponding points after being registered by the estimated transformation is below 0.2m.

We use the edge-based point clouds produced by RESLAM from four sequences of TUM-RGBD. For each sequence we randomly select 1000 point cloud pairs with at least 30% overlap. We compare four descriptors, one handcrafted approach, namely FPFH [27], and three deep learning based approaches: one global approach, namely FCGF [35], and two local approaches, namely SpinNet [14] and DIP [13]. For FPFH we used its Open3D implementation, whereas for the others, we used the code and models provided by the authors. For all the deep learning based descriptors, we use their models trained on the 3DMatch dataset [39].

Tab. III shows the results of this experiment. FPFH and FCGF are the fastest descriptors to compute, however at a large cost of the registration error. DIP achieves the best registration recall with a better compromise between the computation efficiency and the registration performance compared to SpinNet.

TABLE III: Registration recall and computation time per descriptor in [ms] computed on four sequences of RGBD-TUM dataset.

	fr1/desk	fr1/desk2	fr1/plant	fr1/room	mean	[ms]
FPFH [27]	0.35	0.39	0.34	0.41	0.37	0.02
FCGF [35]	0.58	0.46	0.41	0.52	0.50	0.03
SpinNet [14]	0.71	0.75	0.85	0.75	0.76	10.18
DIP [13]	0.73	0.76	0.85	0.77	0.78	3.79

V Conclusions

We presented a novel approach to address loop closures using deep learning-based 3D local descriptors [13]. Our approach detects loop candidates using the ratio of mutually nearest-neighbour descriptors and confirms their quality by computing the novel measure RON, the ratio of metrically near points amongst points that are mutually nearest neighbours in the descriptor space. We evaluated the approach for both LiDAR-based and visual SLAM system. Results showed that our approach can outperform the state-of-the-art LiDAR-based loop closure detection methods, and further improves the localisation accuracy of the visual SLAM system. Future research directions include fusing 2D visual cue extracted from images with 3D local descriptors to handle scenes lacking geometric structures, and focus on the optimisation of descriptor computation.

References

[1] S. Arshad and G.-W. Kim, “Role of deep learning in loop closure detection for visual and LiDAR SLAM: A survey,” Sensors, vol. 21, no. 4, pp. 12–43, Feb 2021.
[2] R. Mur-Artal and J. Tardos, “ORB-SLAM2: An open-source slam system for monocular, stereo, and RGB-D cameras,” IEEE Trans. on Robotics, vol. 33, no. 5, p. 1255–1262, Oct 2017.
[3] F. Schenk and F. Fraundorfer, “RESLAM: A real-time robust edge-based SLAM system,” in Proc. of IEEE ICRA, May 2019.
[4] D. Galvez-Lopez and J. Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEE Trans. on Robotics, vol. 28, no. 5, pp. 1188–1197, Oct 2012.
[5] B. Glocker, J. Shotton, A. Criminisi, and S. Izadi, “Real-time RGB-D camera relocalization via randomized ferns for keyframe encoding,” IEEE Trans. on Visualization and Computer Graphics, vol. 21, no. 5, p. 571–583, Sep 2015.
[6] B. Steder, M. Ruhnke, S. Grzonka, and W. Burgard, “Place recognition in 3D scans using a combination of bag of words and point feature based relative pose estimation,” in Proc. of IEEE IROS, Sep 2011.
[7] L. He, X. Wang, and H. Zhang, “M2DP: A novel 3D point cloud descriptor and its application in loop closure detection,” in Proc. of IEEE IROS, Oct 2016.
[8] X. Chen, T. Labe, A. Milioto, T. Rohling, O. Vysotska, A. Haag, J. Behley, and C. Stachniss, “OverlapNet: Loop closing for LiDAR-based SLAM,” in Proc. of RSS, Virtual, Jul 2020.
[9] Y. Wang, Z. Sun, C.-Z. Xu, S. Sarma, J. Yang, and H. Kong, “LiDAR Iris for Loop-Closure Detection,” in Proc. of IEEE IROS, Jan 2021.
[10] R. Dube, A. Cramariuc, D. Dugas, J. Nieto, R. Siegwart, and C. Cadena, “SegMap: Segment-based mapping and localization using data-driven descriptors,” International Journal of Robotics Research, vol. 39, no. 2-3, pp. 339–355, Mar 2020.
[11] H. Wang, C. Wang, and L. Xie, “Intensity scan context: Coding intensity and geometry relations for loop closure detection,” in Proc. of IEEE ICRA, Aug 2020.
[12] X. Li, J. Pontes, and S. Lucey, “PointNetLK Revisited,” in Proc. of IEEE CVPR, Jun 2021.
[13] F. Poiesi and D. Boscaini, “Distinctive 3D local deep descriptors,” in Proc. of IEEE ICPR, Jan 2021.
[14] S. Ao, Q. Hu, B. Yang, A. Markham, and Y. Guo, “SpinNet: Learning a General Surface Descriptor for 3D Point Cloud Registration,” in Proc. of IEEE CVPR, Jun 2021.
[15] M. Fischler and R. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 341–406, Jun 1981.
[16] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite,” in Proc. of IEEE CVPR, Jun 2012.
[17] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of RGBD SLAM systems,” in Proc. of IEEE IROS, Oct 2012.
[18] S. Saeedi, E. Carvalho, W. Li, D. Tzoumanikas, S. Leutenegger, P. Kelly, and A. Davison, “Characterizing visual localization and mapping datasets,” in Proc. of IEEE ICRA, May 2019.
[19] Z. Wang, Y. Shen, B. Cai, and M. Saleem, “A brief review on loop closure detection with 3D point cloud,” in Proc. of IEEE RCAR, Aug 2019.
[20] M. Magnusson, H. Andreasson, A. Nuchter, and A. Lilienthal, “Automatic appearance-based loop detection from three-dimensional laser data using the normal distributions transform,” Journal of Field Robotics, vol. 26, no. 11-12, pp. 892–914, Nov 2009.
[21] T.-L. Habich, M. Stuede, M. Labbe, and S. Spindeldreier, “Have i been here before? learning to close the loop with LiDAR data in graph-based SLAM,” Proc. of IEEE AIM, May 2021.
[22] K. Granstrom and T. Schon, “Learning to close the loop from 3D point clouds,” in Proc. of IEEE IROS, Oct 2010.
[23] R. Dube, D. Dugas, E. Stumm, J. Nieto, R. Siegwart, and C. Cadena, “SegMatch: Segment based place recognition in 3D point clouds,” in Proc. of IEEE ICRA, May 2017.
[24] Y. Fan, Y. He, and U.-X. Tan, “Seed: A segmentation-based egocentric 3D point cloud descriptor for loop closure detection,” in Proc. of IEEE IROS, Oct 2020.
[25] H. Yin, L. Tang, X. Ding, Y. Wang, and R. Xiong, “LocNet: Global Localization in 3D Point Clouds for Mobile Vehicles,” in Proc. of IEEE IV, Jun 2018.
[26] Y. Zhu, Y. Ma, L. Chen, C. Liu, M. Ye, and L. Li, “GOSMatch: Graph-of-Semantics Matching for Detecting Loop Closures in 3D LiDAR data,” in Proc. of IEEE IROS, Oct 2020.
[27] R. Rusu, N. Blodow, and M. Beetz, “Fast point feature histograms (FPFH) for 3D registration,” in Proc. of IEEE ICRA, May 2009.
[28] T. Rohling, J. Mack, and D. Schulz, “A fast histogram-based similarity measure for detecting loop closures in 3-d lidar data,” in Proc. of IEEE IROS, Sep 2015.
[29] A. Zaganidis, A. Zerntev, T. Duckett, and G. Cielniak, “Semantically assisted loop closure in slam using ndt histograms,” in Proc. of IEEE IROS, Nov 2019.
[30] C. Qi, L. Yi, H. Su, and L. Guibas, “Pointnet++: deep hierarchical feature learning on point sets in a metric space,” in Proc. of NeurIPS, Dec 2017.
[31] A. Milioto, I. Vizzo, J. Behley, and C. Stachniss, “Rangenet++: Fast and accurate lidar semantic segmentation,” in Proc. of IEEE IROS, Nov 2019.
[32] X. Chen, T. Labe, A. Milioto, T. Rohling, J. Behley, and C. Stachniss, “OverlapNet: a siamese network for computing LiDAR scan similarity with applications to loop closing and localization,” Autonomous Robots, vol. 46, pp. 61–81, Jan 2022.
[33] H. Caesar, V. Bankiti, A. Lang, S. Vora, V. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuScenes: A multimodal dataset for autonomous driving,” in Proc. of IEEE CVPR, Jun 2020.
[34] S. Zhang, L. Zheng, and W. Tao, “Survey and evaluation of rgb-d slam,” IEEE Access, vol. 9, pp. 21 367–21 387, 2021.
[35] C. Choy, J. Park, and V. Koltun, “Fully Convolutional Geometric Features,” in Proc. of IEEE ICCV, Oct 2019.
[36] Z. Gojcic, C. Zhou, J. Wegner, and W. Andreas, “The perfect match: 3D point cloud matching with smoothed densities,” in Proc. of IEEE CVPR, Jun 2019.
[37] Q.-Y. Zhou, J. Park, and V. Koltun, “Fast global registration,” in Proc. of ECCV, Oct 2016.
[38] C. Qi, H. Su, K. Mo, and L. Guibas, “PointNet: deep learning on point sets for 3D classification and segmentation,” in Proc. of IEEE CVPR, Jun 2017.
[39] A. Zeng, S. Song, M. Niessner, M. Fisher, J. Xiao, and T. Funkhouser, “3DMatch: Learning the matching of local 3D geometry in range scans,” in Proc. of IEEE CVPR, Jul 2017.