Deep Robust Multi-Robot Re-localisation in Natural Environments
Abstract
The success of re-localisation has crucial implications for the practical deployment of robots operating within a prior map or relative to one another in real-world scenarios. Using single-modality, place recognition and localisation can be compromised in challenging environments such as forests. To address this, we propose a strategy to prevent lidar-based re-localisation failure using lidar-image cross-modality. Our solution relies on self-supervised 2D-3D feature matching to predict alignment and misalignment. Leveraging a deep network for lidar feature extraction and relative pose estimation between point clouds, we train a model to evaluate the estimated transformation. A model predicting the presence of misalignment is learned by analysing image-lidar similarity in the embedding space and the geometric constraints available within the region seen in both modalities in Euclidean space. Experimental results using real datasets (offline and online modes) demonstrate the effectiveness of the proposed pipeline for robust re-localisation in unstructured, natural environments.
I Introduction
(Re)-localisation in robotics refers to a process of determining a robot’s current pose (position and orientation) in a known environment that has been previously mapped. This task is crucial for robots to perform their operations seamlessly, even if they experience temporary difficulty in tracking their location. For example, the “wake-up” problem involves a robot that needs to determine its location after being turned off or losing power. Despite significant advances in learning-based approaches for re-localisation that rely on vision [1, 2, 3, 4] or lidar data [5, 6, 7, 8], designing a robust and reliable re-localisation technique remains a challenge, especially in unstructured, natural environments. Such environments lack distinctive features and change over time due to vegetation growth and weather conditions, affecting re-localisation’s robustness [9].
Due to the inherent limitations of both lidar and image, it is difficult to extract appropriate features (in complex natural scenes) relying on a single modality for re-localisation. To address this, we propose integrating a self-supervised image-to-lidar feature matching process to predict re-localisation failure in a pipeline consisting of three modules of place recognition, pose estimation, and hypothesis verification, each leveraging learning methods. For place recognition and pose estimation modules, we use EgoNN [10], an end-to-end deep re-localisation network. With the power of our lidar SLAM system, namely Wildcat [11], we generate lidar submaps and a pose graph with the robot nodes and geometric information in between and store them in a database. EgoNN is trained on Wildcat submaps offline. At inference time, the pre-trained network is used for re-localisation by comparing a query submap with the submaps in the database. Once the relative pose between the query and top-candidate submaps is estimated (Fig. 1), the proposed hypothesis verification module evaluates the correctness of the transformation with a cross-modality comparison between an image captured at the same time as the query submap and the top-candidate submap. Experimental results demonstrate the effectiveness of the proposed pipeline in achieving accurate re-localisation. The main contributions of this work can be summarised as follows:


-
•
We propose integrating a self-supervised image-to-lidar feature matching process to predict re-localisation failure.
-
•
We present a full pipeline of a deep re-localisation method (R3Loc) to address multi-robot re-localisation.
-
•
We demonstrate our pipeline’s effectiveness in a large-scale natural dataset offline and in a forest-like environment on real robots online.
II Related Work
This section reviews the existing Lidar Place Recognition (LPR) algorithms and discusses existing research into re-localisation. Finally, works related to image-lidar modal perception in cross-modal PR and registration are reviewed.
II-A Lidar-Based Localisation
A range of algorithms has been proposed for LPR. Conventional approaches [12, 13, 14, 15, 16] encode point clouds into either a global descriptor representing the entire point cloud or several local descriptors by segmenting the point cloud into patches. However, these handcrafted methods are often rotation dependent and are not effective in generating discriminative descriptors for unstructured environments.
Deep LPR has demonstrated outstanding results in the past few years. These methods process point clouds through a deep neural network to extract local features. Features are either directly used for place recognition, such as works in [17, 18] or aggregated using either a first-order pooling technique, e.g., GeM [19], NetVLAD [20] or second order pooling employed in [21, 7], to generate a global descriptor of the point cloud [7, 6, 8, 5, 22]. Methods such as EgoNN [10] and LCDNet [23] estimate relative pose between two point clouds upon place recognition. EgoNN computes keypoint coordinates, local descriptors, and saliency in a local head. It later estimates 6DoF relative transformation between the query and top-candidate point clouds by matching keypoints and employing RANSAC to remove outliers. LCDNet trains local features end-to-end utilising the Optimal Transport (OT) theory for matching features and finally estimating the relative pose using Singular Value Decomposition (SVD), allowing the entire pipeline to be differentiable, therefore, learnable. However, at test-time, LCDNet employs RANSAC for relative pose estimation prone to divergence in natural environments. Focusing on re-ranking top-k retrieval candidates, SpectralGV [24] introduces a computationally efficient spectral re-ranking method to improve localisation.
II-B Cross-Modal Localisation
There are PR-related works that aim to enhance place recognition by leveraging lidar scans and images captured in the same place. Works such as [25, 26, 27] integrates lidar and visual measurements at an early stage of multi-modal fusion to encode them into a global descriptor using a projection technique; however, at the cost of dimension loss. In contrast, works such as [28, 29, 30] encode lidar and visual data separately (late fusion) into image and point cloud embeddings and later aggregate them to create the bimodal global descriptor. To deal with lighting conditions (affecting the quality of image features), AdaFusion [31] employs an attention mechanism avoiding two modalities to be considered equally important when image quality is poor for recognition and vice versa.
In computer vision, works such as I2P [32] and 2D3D-MatchNet [33] have been proposed with a focus on image-to-lidar registration. I2P trains a network to estimate the pose between a pair of images and point clouds in two steps of classification and inverse camera projection. I2P uses an attention mechanism to classify lidar points in and out of the camera frustum. It optimises pose in the lidar frame using inverse camera projection and classification prediction. 2D3D-MatchNet learns 2D image and 3D point cloud descriptors in a triplet loss (anchor image, positive and negative point clouds) as similar image-lidar descriptors are pushed closer while negative pairs are pushed apart. Recently, SLidR [34] proposed to find similarities between point cloud and image pairs based on locally similar regions on 2D images and their corresponding 3D patches obtained knowledge distillation.
III (R3Loc): Deep Robust Multi-Robot Re-Localisation
Our aim is to improve the robustness and reliability of (re)-localisation of a robot within a revisit session based on a prior (reference) map generated at the initial session by a fleet of robots in unstructured, natural environments.
Our prior map, created by Wildcat SLAM [11], is a pose graph consisting of robots’ poses (nodes) and the edges in between. In short, Wildcat integrates lidar and inertial measurements within a sliding-window localisation and mapping module. This module uses a continuous-time trajectory representation to reduce map distortion caused by motion. Undistorted sub-maps are further used in pose-graph optimisation to remove drift upon loop closure. Generated sub-maps are also stored in the prior map. Further details can be found in Wildcat paper [11] and the references therein.

After generation of a new sub-map, i.e., query point cloud , from the revisit session, a deep lidar PR network, described in Sec. III-A, is used to compare with all the submaps of the prior map to find the top candidate, , using a similarity metric. Initial relative pose between sub-maps and is further estimated using corresponding keypoints (See Sec. III-A) through RANSAC[35]. This initial guess is later refined with ICP, an iterative algorithm for 3D shapes registration[36]. However, it needs to be evaluated before use to merge the new node into the pose graph. A false-positive edge can result in an inferior trajectory being produced or optimisation failure in SLAM.
To sanity check the refined relative pose, we propose a comparison between the query image ( and are the image width and height), i.e., the image obtained at the same time as point cloud , and the point cloud using the estimated relative pose. To this end, we train a self-supervised network to detect 2D and 3D corresponding features and investigate the correctness of the PR output. Furthermore, we project 3D keypoints of onto the image using the relative pose to check whether the image-lidar correspondences fall in the same region of the image. If so, the relative pose will pass to the SLAM system merging the new edge into the pose graph (prior map). Otherwise, we reject the relative pose. Sec. III-B details the hypothesis verification. Fig. 2 overviews our R3Loc pipeline, its components, and their relationship.
III-A Deep Re-localisation Module
Our deep re-localisation module is based upon EgoNN [10]. Using a light 3D CNN network, EgoNN trains a global descriptor and several local embeddings , where is the number of keypoints detected by USIP [37], in each point cloud. Global descriptors are the aggregation of feature maps elements in the global head utilising GeM pooling [19]. is the number of local features in the global head. Keypoint descriptors are generated in the local head by processing the elements of a local feature map . A two-layer Multi-Layer Perceptron (MLP) followed by a functional module is used to compute the local embeddings’ coordinates in each point cloud. Global descriptors are used for PR, while local descriptors for localisation.
III-B Deep Hypothesis Verification
To accept or reject the re-localisation module output, we leverage cross-modal perception to compare the image captured at the time of the query point cloud and the top candidate point cloud estimated by re-localisation module. For this, the top candidate needs to be projected onto the query image using the relative pose estimated by local branch. If the pose estimate is correct, the projected points must overlay with their corresponding image pixels. To evaluate this, corresponding 2D and 3D features must be extracted and matched.
Handcrafted approaches such as [38] are not, however, suitable for feature extraction on lidar point clouds due to their sparsity and for the detection of similar features on images to establish accurate point-to-pixel matches. Point-wise deep feature descriptors, e.g., [33, 32], despite outperforming conventional techniques, can be affected in the presence of occlusion or motion blur, which is inevitable in robotics. Hence, we leverage a deep image-to-lidar self-supervised distillation approach called Superpixel-driven Lidar Representations (SLidR) [34], which relates a group of pixels with a group of points.
SLidR trains a 3D point representation using visual features for semantic segmentation and object detection. Cross-modal representation learning is motivated by the scarcity of annotated 3D point data and the abundance of image labels. SLidR transfers feature knowledge from super-pixels, i.e., regions of the image with visual similarity, to superpoints, i.e., groups of points segmented through superpixels back-projection. The image is segmented into, at most, 250 superpixels using SLIC [39]. Importantly, SLidR requires no data labels for pre-training the 3D network. Given a synchronous lidar and camera data stream and the calibration parameters, SLidR extracts features of superpixels and their corresponding superpoints. The 2D features extracted from a pre-trained ResNet-50 backbone trained using [40], serve as a supervisory signal for training a 3D sparse residual U-Net backbone [41] using a contrastive loss to align the pooled 3D points and 2D pixel features.
Employing SLidR, our approach compares the extracted features of superpixels , where is the number of superpixels in image , with that of superpoints , where is the number of superpoints in point cloud , using cosine similarity:
(1) |
here and denote superpixel and superpoint features, respectively, after average pooling. Symbol denotes inner product and L2 norm.
Now, we define two metrics, one in feature space and one in Euclidean space, to accept or reject re-localisation. First, we use the Mean Cosine Similarity (MCS) of corresponding superpixels and superpoints features, i.e., to decide whether the point clouds and represent the same place. is the total number of superpixel-superpoint pairs on the main diagonal of the similarity matrix computed from Eq. (1). Low MCS values are an indicator of false-positive cases from our re-localisation module.
Second, to evaluate the accuracy of the relative pose estimated by EgoNN, we identify the top-5 candidate superpoints, denoted as @5, for each superpixel . We project the centroid of each of these top-5 superpoints @5 onto the image . We find the superpoint whose projected centroid is closest to the centroid of , and we select it as the pair of . We check whether the projected centroid of falls within , and we count it as a match if it does and a mismatch if it does not. We calculate the percentage of superpixel-superpoint mismatched pairs over the entire set of pairs to determine whether to reject or accept the re-localisation. We then define the alignment ratio as follows:
(2) |
where is the number of superpixel-superpoint mismatched pairs computed from the abovementioned procedure. Defining two similarity and alignment metrics, we train a simple multi-class Support Vector Classifier (SVC), , to predict if the pair belongs to matched, mismatched or unmatched category, where .
IV Experimental Results
In this section, we present the following results: evaluation of the re-localisation module on a large-scale natural dataset, Wild-Places [9] (consisting of Venman and Karawatha sequences) and its comparison with Scan Context [12] (as a handcrafted PR approach widely integrated with lidar SLAM), evaluation of cross-modal localisation on the same dataset. Finally, we evaluate the entire proposed R3Loc pipeline on a wake-up problem scenario on a robot system.
Both EgoNN and SLidR were trained on the Wild-Places dataset. For EgoNN we followed the training split described in [9]. For testing, however, we evaluated the model on two sequences of Venman collected in opposite directions. This inter-sequence PR evaluation mimics the wake-up problem when the robot operating in the revisit session travels within the prior map generated in the opposite direction. Same sequences were used to evaluate Scan Context following the default setting. For evaluation, we define a true positive revisit when a prediction is within 3 m of a positive ground truth.
For SLidR we trained and validated the network using about 1750 matched lidar-image pairs (pairs captured at the same time) on one sequence from Venman. We created three test sets from the validation section by augmenting the relative transformation between the image and point cloud to create matched and mismatched pairs, and by randomly pairing images and point clouds captured in different places for unmatched pairs. This allows testing SLidR for the three most common EgoNN output cases. We also tested the proposed verification pipeline on a new dataset collected in an unstructured area at the Queensland Centre for Advanced Technology (QCAT), Brisbane, Australia.
IV-A Evaluation of EgoNN Offline
Fig. 3 illustrates the top-K Recall curves between EgoNN and Scan Context. As seen, the performance of EgoNN is almost twice higher than Scan Context, indicating the limitation of Scan Context to produce distinctive and rotation-invariant descriptors in cluttered environments such as forests. To evaluate re-localisation accuracy, we compare the estimated relative transformation with ground truth and compute a success rate if the rotation and translation errors are within and m, respectively. This evaluation was not performed for Scan Context due to the inability of this approach to estimate 6DoF rotation and translation only based on global descriptors.
The success rate for EgoNN, when the relative transformation was only estimated using keypoints and via RANSAC, is about . However, after refining the estimated transformation using ICP (we downsampled point clouds to 40 cm spatial resolution for online registration), the success rate increased to . This indicates that although EgoNN achieved high performance in place recognition, the extracted keypoints are not well-repeated in unstructured environments for accurate re-localisation.



IV-B Evaluation of SLidR Offline
Fig. 4 illustrates an example of matched, unmatched and mismatched pairs in the top, middle and bottom rows, respectively. As seen, the similarity matrices (second column) and the projection vectors (third column) obtained from the procedure described in Sec. III-B are good measures to distinguish between matched, mismatched and unmatched pairs. Fig. 5 shows the boxplots over MCS and computed for about 250 matched pairs and 230 mismatched and unmatched pairs on the validation set (above 700 pairs in total). The substantial difference in MCS between unmatched and matched/mismatched pairs allows classifying unmatched ones with high confidence. Additionally, the large for the matched pairs helps classify them from the other pairs. We, however, observed if both MSC and are used together, it improves generalisation when training and testing environments are different. Hence, we trained a multi-class fifth-degree polynomial SVC model, , to predict if a pair belongs to matched, mismatched or unmatched categories.

IV-C Evaluation of the Entire Pipeline Online
To evaluate our pipeline, a tracked robot (as seen in Fig. 2), equipped with a lidar sensor and four cameras, was teleoperated in an unstructured area once as the initial session and once as the revisit session at QCAT. Time difference between the two sessions was reasonably chosen large, allowing us to evaluate the performance of our verification under various lighting conditions. For the cross-modal perception, we only used the camera frames of the front camera. Submaps were generated by Wildcat, and Robot Operating System (ROS) was used for communication between different components. Our re-localisation pipeline is triggered through a rosservice command. Upon requesting re-localisation, the query submap and the submaps existing in the prior map were fed into the already trained model of EgoNN. By performing a forward pass using the weights and benefiting from kd-tree, the top candidate was selected and the relative pose was estimated. Since PR is only performed based on root nodes, there were at most 20 submaps in the prior map from initial session. To thoroughly test the pipeline, the recorded data of the revisit session played back, and the re-localisation was requested for every root node spawned out, resulting in testing the entire pipeline 20 times (i.e., 20 “wake-up” locations). Following this process, the average Recall@1 of EgoNN is . However, the success rate for re-localisation was , evidencing the necessity of the hypothesis verification.
The pose estimate is not transferred to our lidar-inertial SLAM unless it passes through the hypothesis verification. For this, the top-candidate submap and the query image (which is already rectified) are fed into our pre-trained verification models. For the QCAT dataset, after 20 trials, the proposed hypothesis verification detected all the mismatched pairs for which EgoNN could not produce an accurate pose estimate. Fig. 6 shows a sample matched (top) and mismatched (bottom) scenario. The proposed verification pipeline, including the pre-trained feature matching and the SVC model , successfully separated these cases and detected the re-localisation failure.
Upon a verified re-localisation, the pose graph generated by the revisit session is safely merged into the existing map as shown in Fig. 7. Fig. 8 presents qualitative results after merging revisit session robot into the map generated from the initial session, evidencing the proposed pipeline feasibility for multi-agent re-localisation.


IV-D Runtime Analysis
Processing | EgoNN | SLidR | Total | ||||
---|---|---|---|---|---|---|---|
time | Description | Localisation | Superpixel | Description | MCS | Verification | time |
Mean(s) | 0.087 | 0.408 | 0.186 | 0.209 | 0.013 | 0.019 | 0.905 |
std(s) | 0.002 | 0.207 | 0.085 | 0.022 | 0.037 | 0.002 | 0.215 |
To demonstrate that our presented system can run online, we evaluated the computation time for each component. The timing results are collected by running the pre-trained models on a single NVIDIA Quadro T2000 GPU and the rest of the pipeline on a unit with an Intel Xeon W-10885M CPU. Table I reports a breakdown of individual modules’ runtime in our pipeline. Together, the total runtime (for the QCAT experiment with the scale shown in Fig. 1) is less than a second, allowing the system to run for online operation.
V Conclusion
This work introduced a robust multi-robot re-localisation system. Our re-localisation pipeline benefits from deep lidar representation in place recognition and pose estimation. Self-supervised image-to-lidar knowledge distillation was used to reason about the alignment between the image captured at the same time of query point cloud and the top-candidate point cloud. The system’s modules were separately tested on a large-scale public dataset, and the entire pipeline, integrated with our lidar SLAM system, has been tested online in a wake-up case scenario. In future, we will further investigate how to improve the cross-modal perception by end-to-end learning of representation, including image segmentation for superpixel creation in unstructured environments, and verification models.
Acknowledgments
The authors gratefully acknowledge funding of the project by the CSIRO’s Reimagine Farming initiative. They also thank CSIRO Robotics and Autonomous Systems members for the hardware and software support, particularly Mark Cox for his insights on computer vision. M.H. acknowledges ongoing support from the QUT SAIVT research program.
References
- [1] D. Gálvez-López and J. D. Tardos, “Bags of Binary Words for Fast Place Recognition in Image Sequences,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, 2012.
- [2] Z. Li, C. D. W. Lee, B. X. L. Tung, Z. Huang, D. Rus, and M. H. Ang, “Hot-NetVLAD: Learning Discriminatory Key Points for Visual Place Recognition,” IEEE Robotics and Automation Letters, 2023.
- [3] A.-D. Doan, Y. Latif, T.-J. Chin, Y. Liu, T.-T. Do, and I. Reid, “Scalable Place Recognition Under Appearance Change for Autonomous Driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9319–9328.
- [4] H. Zhang, X. Chen, H. Jing, Y. Zheng, Y. Wu, and C. Jin, “ETR: An Efficient Transformer for Re-Ranking in Visual Place Recognition,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 5665–5674.
- [5] M. A. Uy and G. H. Lee, “PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- [6] J. Komorowski, “MinkLoc3D: Point Cloud Based Large-Scale Place Recognition,” in 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 2021, pp. 1789–1798.
- [7] K. Vidanapathirana, M. Ramezani, P. Moghadam, S. Sridharan, and C. Fookes, “LoGG3D-Net: Locally Guided Global Descriptor Learning for 3D Place Recognition,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 2215–2221.
- [8] L. Hui, H. Yang, M. Cheng, J. Xie, and J. Yang, “Pyramid Point Cloud Transformer for Large-Scale Place Recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6098–6107.
- [9] J. Knights, K. Vidanapathirana, M. Ramezani, S. Sridharan, C. Fookes, and P. Moghadam, “Wild-Places: A large-scale dataset for lidar place recognition in unstructured natural environments,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 11 322–11 328.
- [10] J. Komorowski, M. Wysoczanska, and T. Trzcinski, “EgoNN: Egocentric Neural Network for Point Cloud Based 6DoF Relocalization at the City Scale,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 722–729, 2021.
- [11] M. Ramezani, K. Khosoussi, G. Catt, P. Moghadam, J. Williams, P. Borges, F. Pauling, and N. Kottege, “Wildcat: Online Continuous-Time 3D Lidar-Inertial SLAM,” arXiv preprint arXiv:2205.12595, 2022.
- [12] G. Kim and A. Kim, “Scan Context: Egocentric Spatial Descriptor for Place Recognition within 3D Point Cloud Map,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 4802–4809.
- [13] R. B. Rusu, N. Blodow, and M. Beetz, “Fast Point Feature Histograms (FPFH) for 3D Registration,” in 2009 IEEE international conference on robotics and automation. IEEE, 2009, pp. 3212–3217.
- [14] R. B. Rusu, N. Blodow, Z. C. Marton, and M. Beetz, “Aligning Point Cloud Views using Persistent Feature Histograms,” in 2008 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2008, pp. 3384–3391.
- [15] S. Salti, F. Tombari, and L. Di Stefano, “SHOT: Unique signatures of histograms for surface and texture description,” Computer Vision and Image Understanding, vol. 125, pp. 251–264, 2014.
- [16] R. Dubé, D. Dugas, E. Stumm, J. Nieto, R. Siegwart, and C. Cadena, “SegMatch: Segment Based Place Recognition in 3D Point Clouds,” in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 5266–5272.
- [17] R. Dubé, A. Cramariuc, D. Dugas, J. Nieto, R. Siegwart, and C. Cadena, “SegMap: 3d segment mapping using data-driven descriptors,” Robotics: Science and Systems Online Proceedings, vol. 14, 2018.
- [18] G. Tinchev, A. Penate-Sanchez, and M. Fallon, “Learning to see the wood for the trees: Deep laser localization in urban and natural environments on a cpu,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1327–1334, 2019.
- [19] F. Radenović, G. Tolias, and O. Chum, “Fine-Tuning CNN Image Retrieval with No Human Annotation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 7, pp. 1655–1668, 2018.
- [20] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: CNN architecture for weakly supervised place recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5297–5307.
- [21] K. Vidanapathirana, P. Moghadam, B. Harwood, M. Zhao, S. Sridharan, and C. Fookes, “Locus: LiDAR-based Place Recognition using Spatiotemporal Higher-Order Pooling,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 5075–5081.
- [22] W. Zhang and C. Xiao, “PCAN: 3d attention map learning using contextual information for point cloud based retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 436–12 445.
- [23] D. Cattaneo, M. Vaghi, and A. Valada, “LCDNet: Deep Loop Closure Detection and Point Cloud Registration for LiDAR SLAM,” IEEE Transactions on Robotics, vol. 38, no. 4, pp. 2074–2093, 2022.
- [24] K. Vidanapathirana, P. Moghadam, S. Sridharan, and C. Fookes, “Spectral Geometric Verification: Re-Ranking Point Cloud Retrieval for Metric Localization,” IEEE Robotics and Automation Letters, vol. 8, no. 5, pp. 2494–2501, 2023.
- [25] P. Yin, A. Abuduweili, S. Zhao, C. Liu, and S. Scherer, “BioSLAM: A Bio-inspired Lifelong Memory System for General Place Recognition,” arXiv preprint arXiv:2208.14543, 2022.
- [26] L. Bernreiter, L. Ott, J. Nieto, R. Siegwart, and C. Cadena, “Spherical Multi-Modal Place Recognition for Heterogeneous Sensor Systems,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 1743–1750.
- [27] Y. Pan, X. Xu, W. Li, Y. Cui, Y. Wang, and R. Xiong, “CORAL: Colored structural representation for bi-modal place recognition,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 2084–2091.
- [28] J. Komorowski, M. Wysoczańska, and T. Trzcinski, “MinkLoc++: Lidar and Monocular Image Fusion for Place Recognition,” in 2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 2021, pp. 1–8.
- [29] S. Xie, C. Pan, Y. Peng, K. Liu, and S. Ying, “Large-Scale Place Recognition Based on Camera-LiDAR Fused Descriptor,” Sensors, vol. 20, no. 10, p. 2870, 2020.
- [30] S. Ratz, M. Dymczyk, R. Siegwart, and R. Dubé, “OneShot Global Localization: Instant LiDAR-Visual Pose Estimation,” in 2020 IEEE International conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 5415–5421.
- [31] H. Lai, P. Yin, and S. Scherer, “AdaFusion: Visual-LiDAR Fusion With Adaptive Weights for Place Recognition,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 12 038–12 045, 2022.
- [32] J. Li and G. H. Lee, “DeepI2P: Image-to-Point Cloud Registration via Deep Classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 960–15 969.
- [33] M. Feng, S. Hu, M. H. Ang, and G. H. Lee, “2D3D-MatchNet: Learning to Match Keypoints Across 2D Image and 3D Point Cloud,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 4790–4796.
- [34] C. Sautier, G. Puy, S. Gidaris, A. Boulch, A. Bursuc, and R. Marlet, “Image-to-Lidar self-supervised distillation for autonomous driving data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9891–9901.
- [35] M. A. Fischler and R. C. Bolles, “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
- [36] P. J. Besl and N. D. McKay, “Method for registration of 3-D shapes,” in Sensor fusion IV: control paradigms and data structures, vol. 1611. Spie, 1992, pp. 586–606.
- [37] J. Li and G. H. Lee, “USIP: Unsupervised Stable Interest Point Detection from 3D Point Clouds,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 361–370.
- [38] T. Sattler, B. Leibe, and L. Kobbelt, “Efficient & Effective Prioritized Matching for Large-Scale image-Based Localization,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 9, pp. 1744–1756, 2016.
- [39] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “SLIC Superpixels Compared to State-of-the-Art Superpixel Methods,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 11, pp. 2274–2282, 2012.
- [40] X. Chen, H. Fan, R. Girshick, and K. He, “Improved Baselines with Momentum Contrastive Learning,” arXiv preprint arXiv:2003.04297, 2020.
- [41] C. Choy, J. Gwak, and S. Savarese, “4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3075–3084.