AEGIS-Net: Attention-guided Multi-Level Feature Aggregation for Indoor Place Recognition

Abstract

We present AEGIS-Net, a novel indoor place recognition model that takes in RGB point clouds and generates global place descriptors by aggregating lower-level color, geometry features and higher-level implicit semantic features. However, rather than simple feature concatenation, self-attention modules are employed to select the most important local features that best describe an indoor place. Our AEGIS-Net is made of a semantic encoder, a semantic decoder and an attention-guided feature embedding. The model is trained in a 2-stage process with the first stage focusing on an auxiliary semantic segmentation task and the second one on the place recognition task. We evaluate our AEGIS-Net on the ScanNetPR dataset and compare its performance with a pre-deep-learning feature-based method and five state-of-the-art deep-learning-based methods. Our AEGIS-Net achieves exceptional performance and outperforms all six methods.

Index Terms— place recognition, point cloud, point features, self-attention

1 Introduction

Place recognition allows autonomous robots to recognize a previously visited place in large environments. It is a popular research topic as it is crucial for global localization, preceding the 6 degree-of-freedom (DoF) pose estimation. When solving the place recognition task, it is commonly treated as a retrieval problem, which involves creating a global descriptor from local features and matching it with a database of known place descriptors.

Refer to caption — Fig. 1: Overview of the proposed AEGIS-Net.

Place recognition in outdoor settings has garnered significant attention. Pioneered by NetVLAD [1] with images input and PointNetVLAD [2] with point clouds input, loads of end-to-end approaches have emerged with extraordinary performance like [3, 4, 5], along with substantial outdoor datasets, for example OxfordRobotCar [6] and CrossSeason [7].

However, indoor place recognition remains under-explored, presenting distinct challenges. For instance, sensors in indoor environments, like rooms, often capture limited scene portions, resulting in strong locality in the available information. Plus, indoor locations frequently exhibit similar structures and appearances, which challenges the discriminative power of 2-D data like RGB images or 3-D data like point cloud data alone. Combining these two, [8] employs a Siamese network for feature extraction from intensity and depth pairs while [9] enhances DH3D [10] for indoor scenes by adding color as additional features to point cloud inputs. Exploiting the structural features, LCD [11] feeds both RGB-D frames and line clusters to the recognition network, ensuring the generated global descriptors retain structural information. On the other hand, SpoxelNet [12] focuses on the feature-manipulation by leveraging multi-level features of point clouds. It also tackles indoor occlusion issues by introducing the quad-view integrator.

In this paper, we focus on large-scale indoor place recognition and propose AEGIS-Net, short for AttEntion, color, Geometry and Implicit Semantics. Extended from our previous work CGiS-Net [13], it is the first work to utilize self-attention to guide local feature selection, resulting in more discriminative global place descriptors. We propose that integrating semantic information with appearance and structural data can significantly enhance the performance of indoor recognition. For this, we’ve designed an approach, depicted in Fig. 1, that blends color, geometry and implicit semantic features with self-attention. Taking a cue from [14], we utilize an auxiliary semantic segmentation task to train the network that extracts semantics-enriched local features. The local features are then selected with self-attention (SA) layers and embedded into global descriptors.

We evaluate the proposed AEGIS-Net on the ScanNetPR dataset [13], which is derived from the ScanNet dataset [15] and supports both point clouds and images inputs. For comparison, a baseline model with traditional hand-crafted features [16, 17] and five prominent deep learning models [1, 2, 4, 9, 13]. The results demonstrates AEGIS-Net’s superiority over renowned place recognition methods.

2 Methodology

In this paper, we follow our previous work [13] and use RGB point clouds as the network input to leverage color and geometry features. Regarding the semantics, a semantic encoder, which is trained separately on an auxiliary semantic segmentation task, is employed. Then, to balance the impact of color, geometry and implicit semantic features, SA layers are used to learn the weights adaptively before generating global descriptors with the NetVLAD layer.

2.1 Network Architecture

The architecture of the proposed AEGIS-Net is shown in Fig. 2 (Left). It has three main parts: semantic encoder, semantic decoder and attention-guided feature embedding. Considering the varying point cloud densities in the indoor scenarios, we build the semantic encoder and semantic decoder on the advanced 3-D point cloud segmentation network KP-FCNN [18]. However, the encoder-decoder architecture can be replaced with any point cloud segmentation networks. As for the attention-guided feature embedding, its core components are the SA layers and the NetVLAD layer.

The semantic encoder consists of 5 kernel point convolutional (KP-Conv) layers while the semantic decoder uses the same number of the nearest upsampling layers. In addtion, skip connections are used to pass information between the corresponding encoder-decoder layers. Since the KP-Conv mimics the behaviors of 2-D image convolutions, the earlier layers in the semantic encoder focuses on lower-level color and geometry features like corners, whereas later layers extracts more higher-level semantic features. For a deeper insight into behaviors of the KP-Conv layers, please check out the original KP-FCNN paper [18]. To incorporate color, geometry and semantics features for place recognition, we use the features generated by the semantic encoder and the intermediate features extracted by the first 2 and 4 kernel point convolutional layers. Since the number of local features extracted by different layers varies a lot, these multi-level local features are first selected with SA layers respectively before concatenation. Then the combined local features undergo further enhancement with an additional SA layer. Finally, the enhanced representation is processed by a NetVLAD layer [1] to produce the global place descriptor. The detailed architectures of the SA layer and the NetVLAD layer are shown in Fig. 2 (Right). For efficient retrieval, a subsequent fully-connected (FC) layer is added after the NetVLAD layer to reduce dimensions of the global descriptor.

2.2 Self-Attention Layer

The core of our proposed AEGIS-Net are the SA layers. As shown in Fig 2 (Right), each SA layer comprises a multi-head attention layer and a feed-forward layer. It allows the model to assign different weights to the local features and select the most important ones for the current place to generate the global descriptor.

Specifically, in each attention head, the attention is computed based on three vectors: Query ( $\mathbf{Q}$ ), Key ( $\mathbf{K}$ ) and Value ( $\mathbf{V}$ ), which are obtained by applying three separate FC layers on the input feature matrix $\mathbf{X}$ :

\mathbf{Q}=\mathbf{X}\mathbf{W}_{Q};\;\mathbf{K}=\mathbf{X}\mathbf{W}_{K};\;\mathbf{V}=\mathbf{X}\mathbf{W}_{V}\vspace*{-1ex}

(1)

where $\mathbf{W}_{Q}$ , $\mathbf{W}_{K}$ , and $\mathbf{W}_{V}$ are the weight matrices of FC layers for the Query, Key and Value respectively. Then, the attention weights are computed using the softmax function:

\text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})=\text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^{T}}{\sqrt{d_{K}}}\right)\mathbf{V}\vspace*{-1ex}

(2)

where $d_{k}$ is the dimensionality of the Key vectors. The result is a weighted sum of the Value vectors where the weights represent the attention. Then, the result computed from each attention head are concatenated together and passed through another FC layer for feature fusion. Finally, the output is stretched to the desired dimension with a feed forward layer.

2.3 Multi-stage learning

To force the encoder to learn actual scene semantics, we follow our previous work [13] and adopt a 2-stage training process for our AEGIS-Net. In the first stage, we train the encoder and decoder on the semantic segmentation task, thus, semantic encoder and semantic decoder. Unlike many other works which uses the explicitly segmented output for place recognition and localization [20, 21] we favor the implicit features from the semantic encoder. Therefore, we call these features the implicit semantic features.

Then, the attention-guided feature embedding is trained with the encoder’s weights fixed in the second stage. Using the approach of PointNetVLAD [2], we employ metric learning with lazy quadruplet loss. The input of the model is a tuple of point clouds $\mathcal{T}=(P^{anc},P^{pos},P^{neg},P^{*})$ . Inside the tuple, $P^{anc}$ is the anchor point cloud representing the current place. $P^{pos}$ are the set of positive point clouds, which are taken from the same place as the anchor one but different viewpoints. $P^{neg}$ is the set of negative point clouds that are taken from different places than the anchor one, i.e. from the same room but different place or from a complete different room. Finally, $P^{*}$ is a unique other negative point cloud that is negative to all previous point clouds in the input tuple. With the input tuple constructed, the lazy quadruplet loss can be computed as:

\begin{split}\mathcal{L}_{LazyQuad}(\mathcal{T})&=\underset{i,j}{\text{max}}([\alpha+\delta^{pos}_{i}-\delta^{neg}_{j}]_{+})\\ &+\underset{i,k}{\text{max}}([\beta+\delta^{pos}_{i}-\delta^{*}_{k}]_{+})\end{split}\vspace{-1ex}

(3)

where $[\dots]_{+}$ represents the hinge loss with margins $\alpha$ and $\beta$ . $\delta^{pos}_{i}$ , $\delta^{neg}_{j}$ , and $\delta^{*}_{k}$ are the Euclidean distances between respective point clouds.

3 Experiments and Results

Dataset: We use the ScanNetPR dataset [13], which is derived from the famous ScanNet dataset [15], to evaluate the proposed AEGIS-Net. ScanNetPR consists of 1,613 scans of 807 different rooms with the length varies according to the size of the room, the same as ScanNet. Overall, the dataset contains 35,102 training keyframes, which are selected from 1,201 scans of 565 rooms, 9,693 validation keyframes selected from 312 scans of 142 rooms and 3,608 testing keyframes, which covers the last 100 scans of 100 rooms. Each keyframe carries a RGB point cloud and a corresponding image to accommodate various input needs.

Training Setup: In training stage 1, we follow the KP-FCNN [18] paper and adopt stochastic gradient descent (SGD) optimizer to train the semantic encoder and semantic decoder for 50 epochs. All the hyperparameters in this training stage are exactly the same as those of the SLAM segmentation setup in KP-FCNN.

Then, in training stage 2, we first set the criterion to determine the positive and negative point clouds. Considering the size of typical rooms, two point clouds are positive if they share the same room ID and the distance between their centroids is less than $2m$ , otherwise they are negative. However, we set the negative distance threshold to be $4m$ to maximize their differences. Additionally, constrained by the hardware memory limits, we opt for 2 positives and 6 negatives when constructing input tuples. The attention-guided feature embedding is trained for 20 epochs using the Adam optimizer [22] with an initial learning rate of 0.0001 and learning rate decay. Weight decay is applied to counter overfitting. Mirroring typical SA layer and NetVLAD layer hyperparameters [2, 4, 19], we choose 4 attention heads and 64 VLAD clusters. Trying to balance the representation power and the retrieval efficiency, we set the dimension of the final global descriptor to be 256. Finally, the margin parameters for the lazy quadruplet loss are set to $\alpha=0.5$ and $\beta=0.2$ .

Table 1: Quantitative Results. Average Recall Rate (%).

Methods	R@1	R@2	R@3
AEGIS-Net (Ours)	65.09	74.26	79.06
CGiS-Net [13]	61.12	70.23	75.06
SIFT [16] + BoW [17]	16.16	21.17	24.38
NetVLAD [1]	21.77	33.81	41.49
PointNetVLAD [2]	5.31	7.50	9.99
MinkLoc3D [4]	3.32	5.81	8.27
Indoor DH3D [9]	16.10	21.92	25.30
CGiS-Net-20 [13]	56.82	66.46	71.74
AEGIS-Net (w/o attention)	55.13	66.19	71.95

Results and discussions: Table 1 and Fig. 3 showcase the average recall rates and top-1 retrievals of various place recognition methods repectively, with the first row in the table and second column in the figure highlighting the performance of our proposed AEGIS-Net. Remarkably, AEGIS-Net dominates across both quantitative and qualitative results . Specifically, for quantitative result, our AEGIS-Net records scores of 65.09%, 74.26%, and 79.06% for top-1 recall (R@1), top-2 recall (R@2) and top-3 recall (R@3) respectively. When juxtaposed with CGiS-Net [13], which is our previous work in the default setting, AEGIS-Net displays notable superiority with a consistent 4% improvement, suggesting that the enhancements introduced in AEGIS-Net have significantly bolstered its performance. Apart from that, an image-based traditional method which combines SIFT [16] with bag-of-words (BoW) [17] is chosen as a baseline model. Following that, more advanced learning-based methods whose official implementations are available are chosen for comparison. In particular, NetVLAD [1] with images input, PointNetVLAD [2] and MinkLoc3D [4] with point clouds input, and indoor DH3D [9] with RGB point clouds input are re-trained and evaluated on the ScanNetPR dataset with their default parameters.

Notably, while NetVLAD [1] exhibits a marked improvement over the baseline, indicating the progression of early deep learning models, it still lags considerably behind AEGIS-Net. On the other hand, despite PointNetVLAD [2], MinkLoc3D [4] and DH3D [10] prevailed in the outdoor settings, they surprisingly underperform in indoors, even with indoor modification [9]. This vast disparity in performance demonstrates that using color or geometry features along are not enough for indoor place recognition, further accentuating the efficacy and robustness of our AEGIS-Net, setting it apart as a leader in the realm of indoor place recognition.

Ablation study: To further prove the effectiveness of the attention mechanism of our AEGIS-Net, ablation experiments are performed with the results shown at the bottom two rows of Table 1. From the efficiency perspective, our previous work, CGiS-Net converges after 60 epochs of training, and the training process takes around 3 weeks. When limiting the training epochs to 20, CGiS-Net still needs 7 days at the cost of drastic drop in the performance, shown in row “CGiS-Net-20”. While our AEGIS-Net reaches convergence by the 20th training epoch, and the training time is largely reduced to 4 days. Beyond training speed, the use of attention in the model also delivers superior performance. Precisely, the attention-equipped model surpasses its attention-absented counterpart by 10% for top-1 recall and 8% for top-2 and top-3 recall, as shown in the row “AEGIS-Net (w/o attention)”. This improvement isn’t just a testament to the effectiveness of the attention mechanism but also indicates its role in assisting the network to select more discriminative and relevant features for place description.

4 Conclusion

We have presented AEGIS-Net for indoor place recognition which is capable of selecting color, geometry and semantics features that best describe a particular place in the indoor scenes with attention-guidance. The network is trained and evaluated on the ScanNetPR dataset with superior performance compared to state-of-the-art learning-based methods.

References

[1] R. Arandjelović, P. Gronat, A. Torii, T. Pajdla and J. Sivic, “NetVLAD: CNN Architecture for Weakly Supervised Place Recognition,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1437-1451, 2018.
[2] M. A. Uy and G. H. Lee, “PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4470-4479, 2018.
[3] S. Hausler, S. Garg, M. Xu, M. Milford and T. Fischer, “Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14141-14152, 2021.
[4] J. Komorowski, “MinkLoc3D: Point Cloud Based Large-Scale Place Recognition,” in IEEE Winter Conference on Applications of Computer Vision, pp. 1789-1798, 2021.
[5] S. Garg, N. Suenderhauf and M. Milford, “Semantic–geometric visual place recognition: a new perspective for reconciling opposing views,” in The International Journal of Robotics Research, vol. 41, no. 6, pp. 573-598, 2022.
[6] W. Maddern, G. Pascoe, C. Linegar and P. Newman, “1 Year, 1000 Km: The Oxford RobotCar Dataset,”, in International Journal of Robotics Research, vol. 36, no. 1, pp. 3–15, 2017.
[7] M. Måns Larsson, E. Stenborg, L. Hammarstrand, M. Pollefeys, T. Sattler and F. Kahl, “A Cross-Season Correspondence Dataset for Robust Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9524-9534, 2019.
[8] E. Sizikova, V. K. Singh, B. Georgescu, et al., “Enhancing Place Recognition Using Joint Intensity - Depth Analysis and Synthetic Data,” in European Conference on Computer Vision Workshops, 2016.
[9] X. Yang, Y. Ming and A. Calway, “FD-SLAM: 3-D Reconstruction Using Features and Dense Matching,” in IEEE International Conference on Robotics and Automation, 2022.
[10] J. Du, R. Wang and D. Cremers, “DH3D: Deep Hierarchical 3D Descriptors for Robust Large-Scale 6DoF Relocalization,” in European Conference on Computer Vision, 2020.
[11] F. Taubner, F. Tschopp, T. Novkovic, R. Siegwart and F. Furrer, “LCD – Line Clustering and Description for Place Recognition,” in International Conference on 3D Vision, pp. 908-917, 2020.
[12] M. Y. Chang, S. Yeon, S. Ryu and D. Lee, “SpoxelNet: Spherical Voxel-based Deep Place Recognition for 3D Point Clouds of Crowded Indoor Spaces,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 8564-8570, 2020.
[13] Y. Ming, X. Yang, G. Zhang and A. Calway, “CGiS-Net: Aggregating Colour, Geometry and Implicit Semantic Features for Indoor Place Recognition,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6991-6997, 2022.
[14] J. L. Schönberger, M. Pollefeys, A. Geiger and T. Sattler, “Semantic Visual Localization” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6896-6906, 2018.
[15] A. Dai, A. X. Chang, M. Savva, et al., “ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 2432-2443, 2017.
[16] D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” in International Journal of Computer Vision, vol 60, no. 2, pp. 91-110, 2004.
[17] J. Sivic and A. Zisserman, “Efficient Visual Search of Videos Cast as Text Retrieval,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 4, pp. 591-606, 2009.
[18] H. Thomas, C. R. Qi, J. Deschaud, et al., “KPConv: Flexible and Deformable Convolution for Point Clouds,” in IEEE/CVF International Conference on Computer Vision, pp. 6410-6419, 2019.
[19] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is All You Need,” in Advances in Neural Information Processing Systems, 2017.
[20] B. Ramtoula, R. de Azambuja and G. Beltrame, “CAPRICORN: Communication Aware Place Recognition Using Interpretable Constellations of Objects in Robot Networks,” in IEEE International Conference on Robotics and Automation, pp. 8761-8768, 2020.
[21] Y. Ming, X. Yang and A. Calway, “Object-Augmented RGB-D SLAM for Wide-Disparity Relocalisation,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2180-2186, 2021.
[22] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in International Conference on Learning Representations, 2015.