This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Discriminative and Semantic Feature Selection for Place Recognition towards Dynamic Environments

Yuxin Tian1, Jinyu Miao1, Xingming Wu, Haosong Yue, Zhong Liu, Weihai Chen 1Equal contributions.Authors are with School of Automation Science and Electrical engineering, Beihang University, Beijing, 100191, P. R. China*Correspondence Authors: [email protected]
Abstract

Features play an important role in various visual tasks, especially in visual place recognition applied in perceptual changing environments. In this paper, we address the challenges of place recognition due to dynamics and confusable patterns by proposing a discriminative and semantic feature selection network, dubbed as DSFeat. Supervised by both semantic information and attention mechanism, we can estimate pixel-wise stability of features, indicating the probability of a static and stable region from which features are extracted, and then select features that are insensitive to dynamic interference and distinguishable to be correctly matched. The designed feature selection model is evaluated in place recognition and SLAM system in several public datasets with varying appearances and viewpoints. Experimental results conclude that the effectiveness of the proposed method. It should be noticed that our proposal can be readily pluggable into any feature-based SLAM system.

I Introduction

Simultaneous Localization and Mapping (SLAM) [1] refers the ability of robot that localizes itself and meanwhile, incrementally build a map of environments during exploration. Visual place recognition plays an essential part in (re-)localization and loop closure detection procedure of SLAM system. It can localize the query image by retrieving the most similar images cached in a pre-built map so that it can estimate the current pose of robot, and then correct the accumulated drifts after long-term navigation and mapping.

Autonomous robots aiming at long-term exploration undergo various visual interference, such as changing weathers, repetitive textures, dynamics, etc. Embedding images into appearance-invariant features is helpful to retrieve the same place and distinguish similar looking ones under varying visual conditions, and thus necessary for appearance-based place recognition and SLAM. Traditional hand-crafted features are always divided into two categories, namely, global features [2, 3, 4, 5] and local features [6, 7, 8]. In general, global features are computational efficient in retrieval but sensitive to varying viewpoints and dynamic occlusions. On the contrary, local features are more robust against viewpoint and appearance changes. They detect hundreds of interest points, also called keypoints, and describe the visual information of neighboring area around each keypoint by a vector, named descriptors. Efficient place recognition algorithms [9, 10, 11] usually firstly extract reliable local features from current frame and then embed them into a global vector [4, 12] for the balance between efficiency and accuracy. Such a scheme is the theoretical core of Bag-of-Word-based (BoW-based) algorithms [9] and popularly used in practical SLAM systems [13, 14, 15]. However, the performances of existing traditional features degrade when suffering from challenging interference in the real environments.

With the incredible success of deep neural networks in various computer vision tasks, many researchers [16, 17, 18, 19] have arisen interests in applying deep Convolutional Neural Networks (CNNs) to achieve marvelous improvements. These learned counterparts perform better on image retrieval or matching tasks [20], but still have difficulties in accurate localization in complex environments. It is due to the neglect of consideration about the stability of regions where features are detected. A stable region means a static, distinct image patch. In the existing feature algorithms, all the regions in images play equal importance for the feature extraction and they simply extract features based on the properties of pixel intensity, which is unreasonable. For place recognition, frequently changed items, repetitive textures, and dynamics should not be concerned. Thus, feature selection is a meaningful task to improve the performance.

As a simple solution to select features in static regions, semantic segmentation models [21, 22, 23] can be applied to obtain pixel-wise semantic labels and then select features with manually designated static property, e.g. walls and buildings. However, the property can be hardly defined [24]. For instance, The moving cars anc parking cars shoule be distinguished, which are frequently occrred in place recognition dataset but can not judged by semantic labels. Motivated by the success of attention mechanism [25] in computer vision tasks, many proposals about feature selection [26, 27, 28, 24] are proposed and they estimate the importance of regions to generate more robust descriptors or detect more reliable keypoints. By using attention models, regions play different interesting in different tasks and the invariance against changing appearances is improved. However, there are apparent disadvantages of state-of-the-arts methods that they only work when incorperating with specific features, limiting the generalization of feature selection mechanisms.

In this paper, we propose a novel fully convolutional network (FCN), named DSFeat, to estimate a pixel-wise stability of regions, which can provide a reliable guidance to select stable features regardless the feature algorithms. We use both matching and semantic supervision to train our model for better convergence. We provide a compared results of selected features are shown in Fig. LABEL:fig:feature. It can be clearly seen that all the car in images are estimated as dynamic objects based on semantic labels, which sometimes discard useful visual information in a scene with plenty of parking cars. As a comparison, our DSFeat can effectively distinguish running cars from parking cars and provide more stable features to perform place recognition. Detailed experiments have proven that our propose model can select robust features located in stable regions effectively. The main contributions of this paper can be summarized as follows:

  1. 1.

    A FCN model, DSFeat, is proposed to estimate a pixel-wise activation map, which indicates the stability of regions where features are detected;

  2. 2.

    A hybrid supervision is introduced to train the model that combined strong supervised regression and self-supervised optimization, i.e. manually designated semantic property and metric learning methods;

  3. 3.

    Selected features are experimentally concluded to be more robust in BoW-based place recognition and overall SLAM system.

The remaining of this paper is organized as follows. Section II describes the methodology in detail. And comprehensive experiments are shown in Section III. Finally, conclusions and future works are discussed in Section IV.

Refer to caption
Figure 1: Examples in the prepared training dataset. (a) shows a pair of matched images, called matching information. And (b) shows a semantic segmentation example, called semantic information.

II Methodology

In this section, the proposed approach is described in detail. Firstly, we introduce the method of dataset preparation in Section II-A. And in Section II-B, we describe the structure of our model, as illustrated in Fig. 2. Finally, the loss function used in our network is introduced in Section II-C.

II-A Dataset Preparation

In our proposed method, we apply a hybrid supervision to train our proposed feature selection network, which includes strong supervised regression from semantic segmentation tasks and self-supervised learning from triplets optimization tasks. The construction of triplets need matching relationship between images. Therefore, it is necessary to prepare the semantic information and matching information of the public dataset.

Matching information in our work means the triplets composited of a query image (IqI_{q}), a positively matching image (IpI_{p}), and a negatively matching image (InI_{n}). Matched images are caught from the same place, that is, a loop closure in the SLAM system. To obtain accurate matches, we refer to the method [29] to detect loops. We extract SuperPoint [16] features from the images and apply BoW-based loop closure algorithm to coarsely detect loop candidates. Then topological graph is constructed to verify the candidates. For better accuracy, we manually check the obtained matches to make sure all the matches are definitely from the same scene. These matched loops describe the same place under different visual conditions, as shown in Fig. 1(a), which includes pedestrians, varying illuminations, and moving cars, etc. In addition, we also need mismatches to construct triplets. For each query image, an image taken in different places, or not passing the graph verification, is considered as the mismatch, i.e. negatively matching image InI_{n}.

For semantic information, we simply apply HRNet [30], a panoramic segmentation network with high performance, to obtain pixel-level semantic labels of images, as shown in Fig.1(b). We do not consider any improvement about segmentation and it is beyond this work. Then, we manually designate some static categories, like walls and buildings, and the others are defined as dynamics, following the reference in [31]. The label of static items is set to 1 and dynamics is 0 to fit the strategy that higher activation means higher probability of semantic stability. After that, we obtain a binary semantic stability map SH×WS\in\mathcal{R}^{H\times W}, where H and W is the height and width of original frame.

II-B Network Structure

Our approach obtain an one-dimensional activation map with value between 0 and 1, and its resolution is the same as the original image. It is theoretically similar to panoramic segmentation. Thus, we decide to use panoramic segmentation model as our basic backbone. Additionally, the trade-off between computational efficiency and accuracy is necessary for real-time SLAM, so we experimentally compare the comprehensive performance of some famous panoramic segmentation network [22, 21, 23], as shown later in Section III-A, and finally choose U-Net [21] as our backbone.

For achieving better computational efficiency, we improve the vanilla U-net network to adapt to our method. Firstly, the number of down-sampling is reduced to improve the speed of network proceeding. Secondly, we increase the depth of the network to improve accuracy. Finally, we use a convolution layer and Sigmoid function as the header to get the one-dimension activation map AH×WA\in\mathcal{R}^{H\times W}. The network structure is shown in Fig. 2

Refer to caption
Figure 2: Improved net structure with U-net as backbone.

II-C Loss Function

To train our proposed feature selection model, we use a hybrid loss function (see Section II-C5) to optimize the parameters, including semantic loss (Section II-C1) based on semantic stability and matching loss (Section II-C2) based on matching information of triplets. Additionally, we design two different distance measurement between images based on various characters of dense features and sparse features (Section II-C3 and II-C4). Detailed descriptions are listed in following paragraphs.

II-C1 Semantic loss

The main usage of proposed network is to select stable feature. Existing approaches [28, 24] only use weakly supervised matching information provided by pairs or triplets, and seem to be hard to converge. Therefore, a stronger supervision in the early warm-up stage should be considered. We use semantic segmentation information to achieve such goal. As described in Section II-A, the binary semantic stability labels can be seen as a rough value of activation. Thus, the semantic loss is defined as an initial supervision and we use the binary cross entropy (BCE) function to directly let the activation to regress the semantic stability labels. The semantic loss can be calculated as follows:

Lsem=1whi=1wj=1hBCE(Ai,j,Si,j)L_{sem}=\frac{1}{wh}\sum_{i=1}^{w}\sum_{j=1}^{h}{BCE(A_{i,j},S_{i,j})} (1)

where h and w is the height and width of original input image. Ai,jA_{i,j} and Si,jS_{i,j} indicate the value on (i,j) of AA and SS, respectively.

II-C2 Matching loss

On the other hand, we introduce triplet ranking loss to measure the matching loss. Triplet loss indicates that the dis-similarities between query image and positives should be smaller than those between query and negatives. It can refine the activation map to obtain a more reliable stability and automatically learn to emphasize the stable and distinct regions.

Lmat=max{d(Iq,Ip)d(Iq,Id)+m,0}L_{mat}=max\{d\left(I_{q},I_{p}\right)-d\left(I_{q},I_{d}\right)+m,0\} (2)

where m is a pre-defined parameter and d() is our proposed distance measurement to better indicate the correctness of feature matching after selecting based on activation. As selection procedure is a non-differetiable process, we regard activation as a weight to optimize the network. In order to propose a method which can be readily pluggable into various features, we design different image distance measurement for dense feature and sparse feature, respectively.

II-C3 Distance measurement for dense features

For dense local feature [16, 17, 18, 32], since feature algorithms provide dense feature map before selection, we can apply an embedding process similar to [28, 24]. We use estimated activation to calculate the weighted mean of dense feature descriptors as a global descriptor, and then measure the similarity between two images based on such descriptors.

d(I1,I2)=1whi=1wj=1hAi,j1di,j1i=1wj=1hAi,j2di,j22d\left(I_{1},I_{2}\right)=\frac{1}{wh}||\sum_{i=1}^{w}\sum_{j=1}^{h}{A^{1}_{i,j}d^{1}_{i,j}}-\sum_{i=1}^{w}\sum_{j=1}^{h}{A^{2}_{i,j}d^{2}_{i,j}}||_{2} (3)

where di,jad^{a}_{i,j} is the descriptor on (i,j) of image IaI_{a} and Ai,jaA^{a}_{i,j} denotes the activation value on (i,j) of image IaI_{a}.

II-C4 Distance measurement for sparse features

For sparse local feature [8, 6, 7], its feature have been already selected in a predetermined range and it will affect the convergence of model training since too much information is discard during selection. Thus, we use the activation value as a weight to calculate the weighted mean of matching re-projection error as a stronger supervised distance:

d(I1,I2)=ARE(I1,I2)d\left(I_{1},I_{2}\right)=A\cdot{RE}(I_{1},I_{2}) (4)

where RE(I1,I2)RE(I_{1},I_{2}) is the re-projection error between image I1I_{1} and I2I_{2}, It can be measured as follows:

RE(I1,I2)=12nk=1n(P(F,ui1\displaystyle RE(I_{1},I_{2})=\frac{1}{2n}\sum_{k=1}^{n}(P(F,{u}^{1}_{i} ,ui2)+P(F1,ui2,ui1))\displaystyle,{u}^{2}_{i})+P(F^{-1},{u}^{2}_{i},{u}^{1}_{i})) (5)
P(T,ua,ub)=\displaystyle P(T,u^{a},u^{b})= (T×ua)ubT×ua\displaystyle\frac{(T\times{u}^{a})\cdot{u}^{b}}{||T\times{u}^{a}||} (6)

where F is the fundmental matrix from I1I_{1} to I2I_{2} calculated by random sample consensus (RANSAC) algorithm. uia{u}^{a}_{i} is the normalized coordinate of the matching feature point in IaI_{a}. The function PP refers to as the distance between projected epipolar line T×uiT\times{u}_{i} and point ubu^{b}

Fundmental matrix represents the epipolar consistency constraint between images from two viewpoints. Corresponding feature points in the positively matching image pair should meet the constraints of the fundamental matrix, so the re-projection error will be smaller, and verse-visa. Therefore, we can use the average re-projection error as the image distance measurement of sparse feature points for better convergence of the model.

II-C5 Hybrid loss

In the end, the loss function we designed includes strongly supervised semantic loss and self-supervised matching loss. Semantic loss offers a rough but easy initial optimization while matching loss provides accurate adaption and refinement. The weighted combination of the two loss is used as the final hybrid loss.

L=(1α)Lsem+αLmatL=(1-\alpha)L_{sem}+\alpha L_{mat} (7)

where α\alpha is a hyper-parameter that gradually increases as the training progresses, indicating that matching loss gradually leads the guidance. The hybrid loss function obtains high accuracy results while ensuring good convergence

III Experiments

We provide experimental analysis in this section. Firstly, we introduce the details of model training. Then we apply our approach in place recognition tasks to verify the robustness of selected features. Finally, we incorporate our method with the practical slam system to verify the effectiveness of our proposed feature selection mechanism.

III-A Model Setups

To select a backbone that balances speed and accuracy, we compare the performance of SegNet[22], U-net[21], and DeepLab[23] in multiple panoramic segmentation datasets. The result is shown in Table I. For higher processing speed, we finally choose U-net as the backbone for our method.

TABLE I: Segmentation backbone comparision
IOU/FPS Evaluation Dataset
CamCid CityScapes SYTNTHIA Mapillary
SegNet 46.4/62 70.6/21 62.1/44 45.4/55
U-net 44.7/85 68.4/33 60.1/60 44.7/75
DeepLab v3 49.5/10 73.1/2.6 66.1/5.1 50.1/8.3

In the training process, we set the maximal training epoches as 50 to let the network be fully trained. The initial learning rate is set to 1e-3, which will multiply 0.1 in the epoch 20, 30 and 40. In addition, the hyper-parameter α\alpha in loss function is initially set as 1 and will multiply 0.9 in every epoch to ensure that semantic loss mainly affects the initial supervision of the network and then matching loss gradually leads the optimization.

III-B Evaluation Criteria

We evaluate our method on typical outdoor datasets since dynamic occlusions are frequently occurred in such datasets. CityCentre [33] contains 2474 images and it is caught during 2.0 km exploration in a campus with many similar buildings and shrubs. KITTI odometry [34] dataset is one of the most famous binocular benchmark and it contains 22 sequences. 11 sequences (00-10) of them have ground-truth trajectories which can be used for visual odometry or SLAM evaluation.

For place recognition, We use CityCentre and 12 sequences from KITTI to evaluate our methods. Generally used Precision and Recall metrics are measured. Precision refers to as the ratio between correctly detected loops and all the detection while Recall is the ratio between correct detections and all the loop events existing in the current scene. We use ground-truth loops from original publication for CityCentre and manually annotated detections in [35] are used as a reference on KITTI dataset.

For SLAM evaluation, we use all of the 11 sequences (00-10) with ground-truth trajectories. CityCentre is not used as we cannot find the public ground-truth poses. We calculate the average error of rotation and the standard deviation of offset between estimation and ground-truth for fair and quantitative comparison.

III-C Place Recognition

Refer to caption
(a) KITTI-00 +ORB
Refer to caption
(b) KITTI-00 +SuperPoint
Refer to caption
(c) CityCentre +ORB
Refer to caption
(d) CityCentre +SuperPoint
Figure 3: The results in upper row is evaluated on the KITTI-00 dataset using (a) ORB and (b) SuperPoint, while the bottom row is on the CityCentre using (c) ORB and (d) SuperPoint. It should be noticed that the precision and recall axes begin from 0.5.

In this experiment, we plug our feature selection method into BoW-based place recognition framework [9] and then evaluate the effectiveness on KITTI and CityCentre. We test our approach with popularly used sparse and dense local feature, ORB [8] and SuperPoint [16], for verifying the versatility and generalization of our method. To comprehensively analyze the effectiveness of our method, we compare with original features and features selected by semantic labels, dubbed as “trad.” and “seman.”.

For original features, we directly use the response value of the characteristic point to select Top-500 features with highest responses. And for the semantic selection methods, we discard features locating on the dynamic objects to avoid the interference from dynamics. Then we also select the Top-500 feature points with the highest response value from remaining features based on the responses of original features. As for our method, the feature points are selected by estimated activation value. 500 features with highest activation are obtained to be further processed. We use RANSAC to estimate the transform matrix between query and loop after place recognition, and we calculate the average re-projection error of inliers as a score to obtain precision-recall curves, as shown in Fig. 3. For quantitative comparison, we also provide area under curve (AUC) in Table II.

TABLE II: Area under curves of compared methods.
ORB(sparse) SuperPoint(dense)
trad. seman. ours. trad. seman. ours.
KITTI00 0.989 0.992 0.997 0.990 0.994 0.996
KITTI01 0.980 0.984 0.985 0.981 0.983 0.985
KITTI02 0.970 0.977 0.976 0.967 0.981 0.979
KITTI05 0.965 0.971 0.980 0.961 0.965 0.971
KITTI06 0.973 0.974 0.984 0.970 0.976 0.988
KITTI07 0.948 0.945 0.950 0.951 0.949 0.951
KITTI08 0.963 0.962 0.964 0.960 0.967 0.964
KITTI09 0.977 0.983 0.992 0.976 0.986 0.994
KITTI15 0.921 0.930 0.928 0.935 0.939 0.940
KITTI16 0.935 0.947 0.947 0.931 0.951 0.940
KITTI18 0.978 0.988 0.992 0.978 0.986 0.990
KITTI19 0.943 0.944 0.952 0.946 0.948 0.956
CityCentre 0.870 0.911 0.956 0.901 0.948 0.987

According to the results, we can see that using semantic static/dynamic labels, which is manually defined based on the segmentation results, can sometimes help to discard confusing features so that the performance of place recognition has been improved. However, it sometimes negatively affect the system, such as in KITTI-07 and KITTI-08. It is because that the definition of dynamics and statics is unreasonable and unstable. Moreoever, it cannot focus on distinctive regions based on the texture of images. On the contrary, our proposed DSFeat works well on all the experiments and outperform compared baselines with a lot margins. It concludes that our method has a more stable estimation of dynamic and static attributes and it can detect distinctive regions which helps detection a lot. Additionally, our method can achieve better performance when incorporating with both sparse features and dense features, which is the first feature selection model achieving such goals to the best of our knowledges.

III-D SLAM System

TABLE III: The accuracy of SLAM systems.
Rotation Error(deg/m) Offset Deviation(%\%)
ORB-SLAM2 ours ORB-SLAM2 ours
KITTI00 0.0026 0.0026 0.6943 0.7018
KITTI01 0.0025 0.0020 1.6632 1.3842
KITTI02 0.0029 0.0024 0.8394 0.7592
KITTI03 0.0028 0.0017 0.8155 0.7476
KITTI04 0.0025 0.0015 0.5069 0.5051
KITTI05 0.0017 0.0016 0.4188 0.3945
KITTI06 0.0015 0.0014 0.4722 0.4602
KITTI07 0.0028 0.0029 0.4672 0.4742
KITTI08 0.0031 0.0029 1.0438 1.0099
KITTI09 0.0030 0.0025 1.0405 0.8169
KITTI10 0.0030 0.0028 0.6945 0.6419

For a more comprehensive evaluation, in this experiment, we integrate the proposed method into an practical SLAM system to verify the improvement. We use the ORB-SLAM2 system [14] as the framework in such experiment and it use ORB as the feature extraction method. Our method is used to select stable features in the feature extraction step of SLAM system. The selected features are then used to track, map, and detect loops, thus affects the performance of overall system. Besides, our provided activation values are also used as a weights in RANSAC procedure during feature matching and pose estimation to improve the accuracy.

We compare the original SLAM system with the one incorporating with our proposed DSFeat. Fig. 4 shows an example of trajectories estimated by the slam system and ground-truth in KITTI-02 sequence. Clearly, our estimated trajectory is more accurate than the estimation of original systems which qualitatively concludes the effectiveness of feature selection procedure. Besides, Table III records the quantitative results of the average error of rotation and the standard deviation of offset on 11 sequence of KITTI. According to the results, after applying our method to select stable features, the accuracy of the entire SLAM system are improved on almost all the scenes, which concludes the effectiveness and generalization of our proposed approach.

Refer to caption
Figure 4: Visualization of results of our method and ORB-SLAM2 in the KITTI-02 sequence
Refer to caption
Figure 5: Qualitative results on KITTI dataset.Pictures on top row are the original images and second ones are the results of semantic segmentation. Third row shows the manually-defined classification of dynamic and static objects based on semantic labels. Activation maps provided by our method are shown in the bottom. Higher activation indicates a static item that should be retained, while regions with lower values are probably dynamics and need to be discarded.

III-E Visualization of Activation Map

In order to intuitively visualize the improvement of our proposed DSFeat, we show the semantic labels, semantic static/dynamic map, and our estimated heatmap (activation map) in Fig. 5. Using manually-defined semantic labels to judge the dynamic/static regions is unreasonable and cannot detect distinctive regions. In the figure, all the car is designated to be dynamic according to the semantic segmentation. But cars parked at the roadsides (Fig. 5(b) and Fig. 5(c)) should be static and running ones (Fig. 5(a)) are dynamic. Our method can reliably distinguish them and help to accurately track, map, and detect loops. The selected feature keypoints can be seen in Fig. LABEL:fig:feature.

IV Conclusion

In this paper, we consider the SLAM problem in complex scenes, and propose a method of using activation map for feature point selection. First, we improve the backbone of U-net and propose a network structure for generating one-dimensional activation maps. Second, we specifically design corresponding dis-similarity measurement function for dense and sparse features. Third, we combine the strong supervision from semantic information with the self supervision from matching information as a hybrid loss function to train the model, and regulate the ratio between various supervision based on the training process. Finally, we use the estimated activation map to select stable feature and add weights in the RANSAC algorithm. The experimental results show that our approach effectively improves the effectiveness of image retrieval in complex scenes, and also significantly improves the accuracy of localization in the SLAM system. And It should be noticed that the proposed feature selection can work well with both sparse features and dense features, and its generalization is concluded in various scenes. The proposed DSFeat can be readily pluggable into any odometry or SLAM inplements based on local features and we will release the related codes later for further studies.

References

  • [1] T. Bailey and H. Durrant-Whyte, “Simultaneous localization and mapping (SLAM): part II,” IEEE ROBOT. AUTOM. MAG., vol. 13, no. 3, pp. 108–117, 2006.
  • [2] C. Siagian and L. Itti, “Biologically inspired mobile robot vision localization,” IEEE Transactions on Robotics, vol. 25, no. 4, pp. 861–873, 2009.
  • [3] I. Ulrich and I. Nourbakhsh, “Appearance-based place recognition for topological localization,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), vol. 2, 2000, pp. 1023–1029 vol.2.
  • [4] H. Jegou, C. Schmid, M. Douze, and P. Perez, “Aggregating local descriptors into a compact image representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, jun 2010, pp. 3304–3311.
  • [5] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, 2005, pp. 886–893.
  • [6] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, pp. 91–110, 2004.
  • [7] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (SURF),” Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346–359, 2008.
  • [8] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to SIFT or SURF,” in Proceedings of the International Conference on Computer Vision (ICCV), 2011, pp. 2564–2571.
  • [9] D. Gálvez-López and J. D. Tardós, “Bags of binary words for fast place recognition in image sequences,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, October 2012.
  • [10] Y. Hou, H. Zhang, and S. Zhou, “BoCNF: efficient image matching with bag of ConvNet features for scalable and robust visual place recognition,” Autonomous Robots, vol. 42, pp. 1169–1185, 2018.
  • [11] S. Khan and D. Wollherr, “IBuILD: Incremental bag of binary words for appearance based loop closure detection,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2015, pp. 5441–5447.
  • [12] J. Sivic and A. Zisserman, “Video Google: a text retrieval approach to object matching in videos,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), vol. 2, 2003, pp. 1470–1477.
  • [13] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “ORB-SLAM: a versatile and accurate monocular SLAM system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
  • [14] R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
  • [15] C. Campos, R. Elvira, J. J. Gómez, J. M. M. Montiel, and J. D. Tardós, “ORB-SLAM3: An accurate open-source library for visual, visual-inertial and multi-map SLAM,” arXiv preprint arXiv:2007.11898, 2020.
  • [16] D. DeTone, T. Malisiewicz, and A. Rabinovich, “SuperPoint: Self-supervised interest point detection and description,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Deep Learning for Visual SLAM Workshop (CVPRW), 2018.
  • [17] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler, “D2-Net: A trainable CNN for joint detection and description of local features,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [18] J. Revaud, P. Weinzaepfel, C. R. de Souza, and M. Humenberger, “R2D2: Repeatable and reliable detector and descriptor,” in Proceedings of the Conference and Workshop on Neural Information Processing Systems (NeurIPS), 2019.
  • [19] R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: CNN architecture for weakly supervised place recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [20] V. Balntas, K. Lenc, A. Vedaldi, and K. Mikolajczyk, “HPatches: A benchmark and evaluation of handcrafted and learned local descriptors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [21] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015, pp. 234–241.
  • [22] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” arXiv preprint arXiv:1511.00561, 2015.
  • [23] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2018.
  • [24] D. Li, J. Miao, X. Shi, Y. Tian, Q. Long, P. Guo, H. Yu, W. Yang, H. Yue, Q. Wei, and F. Qiao, “RaP-Net: A region-wise and point-wise weighting network to extract robust features for indoor localization,” arXiv preprint arXiv:2012.00234, 2020.
  • [25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017.
  • [26] H. J. Kim, E. Dunn, and J.-M. Frahm, “Learned contextual feature reweighting for image geo-localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 07 2017.
  • [27] H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han, “Large-scale image retrieval with attentive deep local features,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [28] Z. Xin, Y. Cai, T. Lu, X. Xing, S. Cai, J. Zhang, Y. Yang, and Y. Wang, “Localizing discriminative visual landmarks for place recognition,” arXiv preprint arXiv:1904.06635, 2019.
  • [29] H. Yue, J. Miao, Y. Yu, W. Chen, and C. Wen, “Robust loop closure detection based on bag of SuperPoints and graph verification,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, pp. 3787–3793.
  • [30] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” arXiv preprint arXiv:1902.09212, 2019.
  • [31] M. Nathaniel and G. Huang, “CALC2.0: Combining appearance, semantic and geometric information for robust and efficient visual loop closure,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019.
  • [32] Y. Song, L. Cai, J. Li, Y. Tian, and M. Li, “Sekd: Self-evolving keypoint detection and description,” arXiv preprint arXiv:2006.05077, 2020.
  • [33] M. Cummins and P. Newman, “FAB-MAP: Probabilistic localization and mapping in the space of appearance,” International Journal of Robotics Research, vol. 27, no. 6, 2008.
  • [34] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • [35] R. Arroyo, P. F. Alcantarilla, L. M. Bergasa, J. J. Yebes, and S. Bronte, “Fast and effective visual place recognition using binary codes and disparity information,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2014, pp. 3089–3094.