SGL: Structure Guidance Learning for Camera Localization

Xudong Zhang¹, Shuang Gao¹, Xiaohu Nan², Haikuan Ning³, Yuchen Yang¹, Yishan Ping¹,
Jixiang Wan^1,4, Shuzhou Dong¹, Jijunnan Li¹, Yandong Guo¹ ¹ OPPO Research Institute, Shanghai, China. {zhangxudong, gaoshuang, yangyuchen, pingyishan, wanjixiang, dongshuzhou1, lijijunnan, guoyandong}@oppo.com² Xiaohu Nan is with University of Chinese Academy of Sciences, Beijing, China.³ Haikuan Ning is with Wayz Intelligent Manufacturing Technology Co. Ltd., Wuxi, China.⁴ Jixiang Wan is also with Department of Automation, Shanghai Jiao Tong University, Shanghai, China.

Abstract

Camera localization is a classical computer vision task that serves various Artificial Intelligence and Robotics applications. With the rapid developments of Deep Neural Networks (DNNs), end-to-end visual localization methods are prosperous in recent years. In this work, we focus on the scene coordinate prediction ones and propose a network architecture named as Structure Guidance Learning (SGL) which utilizes the receptive branch and the structure branch to extract both high-level and low-level features to estimate the 3D coordinates. We design a confidence strategy to refine and filter the predicted 3D observations, which enables us to estimate the camera poses by employing the Perspective-n-Point (PnP) with RANSAC. In the training part, we design the Bundle Adjustment trainer to help the network fit the scenes better. Comparisons with some state-of-the-art (SOTA) methods and sufficient ablation experiments confirm the validity of our proposed architecture.

I INTRODUCTION

Camera Localization aims to estimate the accurate position and orientation with the input image, which is a crucial and fundamental computer vision task in many Artificial Intelligence and Robotics applications, such as Autonomous Driving and AR/VR. Camera Localization also works as a key component in the Visual Simultaneous Localization and Mapping (V-SLAM) and Structure From Motion (SFM) algorithms. Traditionally, structure-based camera localization methods rely on the multi-view geometry theory to calculate the 6-DoF poses by extracting and matching the feature points between the input images and the retrieved database images. Generally, Image Retrieval Convolution Networks, e.g., NetVLAD [1], DELF [2] and DELG [3], are widely adopted to obtain the high-quality global features of every image to accomplish the retrieval task. Moreover, in the past decades, numerous local-feature descriptors including hand-craft ones, e.g., SIFT [4], SURF [5] and ORB [6], and learning-based ones, e.g., SuperPoint [7], R2D2 [8] and DISK [9] have been proposed to get more precise and stable 2D-3D matches. Commonly, the corresponding 3D points are generated by SFM, V-SLAM, or additional depth sensors. Given the above, 6-DoF Camera poses will be recovered by PnP [10] with RANSAC [11] ultimately. The described traditional structure-based camera localization pipeline has attracted dominated research interests in recent years, while there are still some challenging corner cases remaining challenging, such as illumination changes, texture-less scenes and repetitive structures.

Due to the recent advances of Deep Neural Networks (DNNs) in the computer vision tasks, a variety of researches report the end-to-end camera localization approaches, which estimate the camera poses directly through the neural networks inference without the additional 3D model or depth information. Different from structure-based methods, end-to-end ones replace the mapping stage (e.g. SFM) with the network training process. Essentially, during the training, the models encode the scenes and represent them with the trained weights instead of 3D point clouds in structure-based methods. With the input image and intrinsic parameters, the network produces the final estimated pose through network inference. According to the training pattern, end-to-end also can be separated into two branches, metrics regression (directly regressing the position and orientation vectors) and scene coordinates prediction (predicting the corresponding 3D positions with respect to the 2D key-points and estimate the poses with PnP). Different architectures (shown in Fig. 1) lead to distinct localization performances. The common target of most of the camera localization methods is to address the most accurate positions and orientations.

Refer to caption — Figure 1: Demonstration of different camera localization architectures. Upper: Classical Structure-based camera localization pipeline. SFM and Image Retrieval part take charge of getting image pairs and 3D point clouds. By applying the local feature matching method, feature correspondences are established, leading to 2D-3D matches. Ultimately, PnP + RANSAC will ensure us the stable estimated pose. Lower left: Metrics Regression style end-to-end localization pipeline. In this branch, DNNs are trained only to regress the metrics (position and orientation) without explicit relations to 3D structure information. Lower Right: Scene Coordinate Prediction style end-to-end localization pipeline. This branch of methods applies DNNs to predict key-points 3D position in the global frame, and use PnP + RANSAC to estimate the final pose. In the model training stage, reprojection error loss is favoured usually.

In this paper, we propose an end-to-end framework camera localization method, named as Structure Guidance Learning (SGL) Network. Basically, we take the advantage of DNNs to estimate the key-points 3D positions of the input image in the global coordinate frame, therefore producing the 2D-3D correspondences. Consequently, we are able to calculate the camera poses by PnP and RANSAC. In order to pursue the high performance, we not only carefully design and tune the network architecture, but also refer to the experience and technique in structure-based localization approaches like Bundle Adjustment (BA) [12] and Key-point filtering [13], and merge them into our network training pipeline. Hence, our proposed method holds both advantages of structure-based and end-to-end approaches, and outperforms some SOTA camera localization methods on open-source datasets. Our contributions are summarised as follows:

1. We propose an end-to-end scheme camera localization method, called Structure Guidance Learning Network (SGL). With sufficient experiments, our method outperforms some SOTA ones on the prevailing open-source indoor and outdoor localization datasets, including Microsoft 7Scenes [14] and Cambridge Landmarks Dataset [15].

2. We apply the bilateral network structure to extract and combine the high-level and low-level visual features. In the confidence branch, we format the confidence map and select the high-quality key-points and conduct PnP to calculate the pose afterwards. By merging the confidence branch with the structure branch, our model functions as a point-wise attention model [16] which succeed in reaching the high accuracy performances.

3. Inspired by the structural idea in traditional BA, we develop a novel training technique that feeds back the reprojection error of the retrieved images and their key-point correspondences. Accordingly, we utilize the additional retrieval and image-matching model to help us to make the end-to-end network fit the scenes better.

II Related works

In this section, we discuss the related works on the different camera localization frameworks, including the traditional structure-based methods, the end-to-end metrics regression ones and the end-to-end scene coordinates prediction ones.

Structure-based Localization. The core of structure-based method is how to obtain high-quality matching pairs between 2D features in the query image and 3D points in the SFM reconstruction model, and then get the camera poses according to the perspective geometry theory [17] [18]. [19] compares the descriptors of the query features and 3D points directly. It works well in some small-scale scenes, but faces some drawbacks in some large ones. [20] draws on the image retrieval techniques to narrow down the database, and obtains 2D-3D correspondences indirectly through matching 2D features between the query image and database images. It may cause losses of matching information due to different views angles. And [21] [22] extends the database volumes by generating the rendered synthetic images as the database. Active Search [23] implements the image retrieval process followed by the direct 2D-3D matching, so it covers both advantages. On these bases, the applications of some excellent feature extraction [7] [8], feature matching [24] and image retrieval [2] [3] methods can further improve the camera localization performances individually.

End-to-End Metrics Regression. Metrics regression localization methods aim to regress the camera pose directly from train images with the ground truth poses. PoseNet [25] [15] trains a CNN to regress the 6-DOF camera pose from a single RGB image without additional engineering or graph optimization. And it has been extended to video mode using LSTM to extract temporal information [26]. Later on, [27] uses a Bayesian CNN implementation to obtain an estimate of the localization uncertainty and improves the accuracy on the large-scale outdoor datasets. AtLoc [28] shows that the attention block can be used to force the network to focus on more geometrically robust objects and features, which can learn to reject dynamic objects and illumination conditions to achieve better performance. MapNet [29] exploits other sensory inputs like visual odometry and GPS in addition to images, and fuses them together for camera localization.

End-to-End Scene Coordinates Prediction. Unlike the metrics regression framework, the scene coordinates prediction models derive the poses indirectly by predicting the key-point positions in a global coordinate system through DNNs, and calculating the camera pose using traditional projection geometry theory. [30] [31] use Random Forests to infer the 3D scene coordinates corresponding to every pixel of the input image, and subsequently uses these coordinates to estimate the final camera pose via RANSAC. DSAC [32] learns from SCoRF [14] to predict scene coordinates and replaces the deterministic hypothesis selection by a probabilistic section to estimate camera pose. Any deep learning pipeline can use DSAC as a robust optimization component, like the role of RANSAC in the structure-based methods. [33] suggests a new, fully differentiable camera localization pipeline which has only one learnable component, a fully convolutional neural network for dense coordinate regression. SANet [34] presents a scene agnostic neural architecture for camera localization, where model parameters and scenes are independent from each other.

III Proposed Approach

III-A Overview

In this section, we introduce the proposed method in detail. Multiple researches [35, 36] have revealed that structure-based approaches and the scene coordinates prediction branch of the end-to-end ones yield higher localization accuracy than the metrics regression branch of the end-to-end ones. On the other hand, although a variety of learning-based components (e.g., local feature matching, image retrieval) are introduced into structure-based methods, the main calculation burden still lies on the classical epipolar geometry computing, especially on the mapping stage. Instead, our proposed method relies on the high-performance DNNs to predict the 3D points instead of geometry computing. Namely, we substitute the SFM model with the DNN to represent the scene structure. Fig. 2 indicates the whole architecture of our model together with the training procedure. Table I details the layer construction of both branches.

III-B Receptive Branch

Inspired by the previous solid works on the other computer vision topic [37, 38], we also split the network into the parallel format. Receptive branch $Recep(X)$ is supposed to deal with the deeper features including the context information of the target scene and the global information with the large receptive area obtained by the cascaded multiple convolution layers.

In this branch, we employ a Fully Convolution Network (FCN) style structure which is thoroughly testified by [33, 39] on the camera localization tasks. After the common pre-processing (e.g., normalization and dimensional reduction) of input image $X\in\mathbb{R}^{h\times w\times 3}$ with height $h$ and width $w$ , the shallow layers are applied to encode the visual clues into the 1/8 feature map. The following Convolution Neural Networks (CNNs) are responsible for recovering the deeper context information within the feature map. We also create residual shortcuts [40] to ensure the model convergence during the back-propagation process of the training stage. On the back-end of the Receptive Branch, there is the spatial block, which consists of four $1\times 1$ kernel convolution layers $Conv$ . Sufficient ablation studies in Sec. IV-D implicate that its complexity is highly related to the scene content and scale. Hence, we surmise that this block is capable of embedding the spatial information of the scene into the feature map and name it with spatial block.

III-C Structure Branch

Compared to the Receptive Branch, Structure Branch is responsible for extracting the low-level visual features. For that reason, the shallower network structure is more appropriate. Consequently, the 3-stacked convolution layers are adopted as the feature extractor. Meanwhile, this block is also occupied to form the confidence map $Conf_{u,v}\in\mathbb{R}^{64\times h/8\times w/8}$ for the further key-point refinement, therefore we mark this block as the confidence block. The channel dimension of 64 is used to remap the $8\times 8$ down-sampled window-size pixels. From the confidence map, we define the pixel-wise bias $B_{i,j}\in\mathbb{R}^{2}$ by

B_{u,v}=\mathop{\arg\max}\limits_{i,j}(Conf_{u,v}),

(1)

where $i\in[0,8)$ and $j\in[0,8)$ are the index of the $8\times 8$ patch, $u\in[0,w/8)$ and $v\in[0,h/8)$ are the path index within the image plane. Moreover, the confidence map also produces the absolute value of every sampled pixel, representing the reliability of the prediction. Considering the value disparity, we use the dynamic threshold (changing depending on the confidence outputs) for denying low-quality predictions in contrary to a fixed-threshold filter.

Thresh=\alpha\max_{u,v}(Conf_{u,v})+(1-\alpha)\min_{u,v}(Conf_{u,v}),

(2)

where $\alpha\in[0,1]$ is a tunable parameter, that adjusts the domination difference between datasets.

The output feature maps of the confidence block and the whole receptive branch are fused by the $concat$ operator. The following cascaded $Conv$ layers help to converge the channel-wise information and produce the grid-sampled 3-channel feature map $Coord\in\mathbb{R}^{3\times h/8\times w/8}$ representing the corresponding 3-axis global coordinate of the down-sampled pixels. However, the direct output of $Coord$ is corresponding to the sparse 2D pixel positions. Due to the resolution limits, the original output will face the wrong 2D position prediction problems. Consequently, we add an additional operator that helps us refine the 2D position on the image plane. The following equation depicts the details of the key-point position refinement and selection strategy, where $Coarse_{i,j}$ and $Fine_{i,j}$ are the original coarse 2D pixel positions and the refined positions. By using $Thresh$ we are able to filter the unreliable predictions.

Fine_{u,v}=\left\{\begin{aligned} Coarse_{u,v}+B_{u,v},\quad Conf_{u,v}\geq Thresh.\\ Null,\quad Conf_{u,v}<Thresh.\end{aligned}\right.

(3)

With predicted 2D points and corresponding 3D predictions, we are capable of conducting PnP + RANSAC to calculate the final camera poses.

TABLE I: Instantiation of the Receptive Branch and Structure Branch.

opr

means the operator name.

k

is the kernel size of the convolution layer.

c

is the output channel size.

s

is the stride size and

p

is the padding size. Input image sizes are different depending on the dataset, while we take

512\times 640

as an example. We use ReLU [41] as the non-linear activation layer after most of the convolution layers. After the Refinement block, we use the BReLU-1 [42] to control the value

\in[0,1)

	Receptive Branch						Structure Branch					Output Size
	opr	k	c	s	p		opr	k	c	s	p
Input	-	-	-	-	-	Input	-	-	-	-	-	$512\times 640$
R1	Conv+ReLU	3	32	1	1		Conv+ReLU	3	32	1	1	$512\times 640$
	Conv+ReLU	3	64	2	1	Confidence	Conv+ReLU	3	64	2	1	$256\times 320$
	Conv+ReLU	3	128	2	1	Block	Conv+ReLU	3	128	2	1	$128\times 160$
	Conv+ReLU	3	256	2	1		Conv+ReLU	3	256	2	1	$64\times 80$
R2	Conv+ReLU	3	256	1	1	S1	Conv+ReLU	1	512	1	0	$64\times 80$
	Conv+ReLU	1	256	1	0		Conv+ReLU	1	512	1	0
	Conv+ReLU	3	256	1	1		Conv	3	3	1	1
R3	Conv+ReLU	3	512	1	1
	Conv+ReLU	1	512	1	0	Refinement	Conv	1	128	1	0	$64\times 80$
	Conv+ReLU	3	512	1	1	Block	Conv+BReLU-1	1	64	1	0	$64\times 80$
	Conv	3	512	1	1
	Conv+ReLU	1	512	1	0	-		-				$64\times 80$
Spatial	Conv+ReLU	1	512	1	0
Block	Conv+ReLU	1	512	1	0
	Conv+ReLU	1	512	1	0

III-D Bundle Adjustment Training

Inspired by the traditional BA process in the SFM pipeline, we transfer BA into network training and further experiments also show its practicality. Fig. 3 thoroughly compare the algorithm components between SFM and the SGL training procedure. In the SFM pipeline, image retrieval model and local feature matching model are occupied to achieve the mapping database. With the following geometric methods, including triangulation, PnP, RANSAC and BA, we are able to produce the high accurate sparse point cloud.

Whereas, in the SGL training pipeline, triangulation is substituted by SGL model inference. SGL achieves to get sampled 3D positions in the global frame in the meantime produce the 2D-3D pixel-level pairs. Afterwards, we take the advantage of PnP + RANSAC to calculate the camera poses. BA trainer is the training strategy that fully uses the BA idea into model training and helps the model better fit the scene. Finally, we fuse the BA loss and the metric loss and back propagate the gradients into the model weights.

Given an image whose predicted pose is denoted by the rotation matrix $R$ and the translation vector $t$ , we can calculate the pose by PnP + RANSAC solver with 2D-3D matches. Relatively, the ground truth poses are rotation matrix $\overline{R}$ and translation vector $\overline{t}$ . The metric loss including angle loss $\mathcal{L}_{angle}$ and translation loss $\mathcal{L}_{trans}$ can be obtained by following two equations.

\mathcal{L}_{angle}=\frac{1}{\pi}\arccos{(\frac{1}{2}trace(\overline{R}^{T}\cdot R-1))}.

(4)

\mathcal{L}_{trans}=\|\overline{t}-t\|_{2}.

(5)

Subsequently, we define the BA loss $\mathcal{L}_{BA}$ by the following equation.

\mathcal{L}_{BA}=\sum\limits_{i=0}^{n}\left[{\|p_{i}-K\cdot\left(R_{i}\cdot y_{i}+t_{i}\right)\|_{2}+(\frac{1}{n}\sum\limits_{j=0}^{n}{y}_{j}-y_{i})}\right],

(6)

where $i$ is the index number in the total $n$ bundle images. $p_{i}$ represents the 2D positions on the image plane of the sampled pixels, while $y_{i}$ is the predicted corresponding 3D position vector in the global frame. $K$ is the $3\times 3$ camera intrinsic matrix. By applying the SGL predicted rotation $R_{i}$ and translation $t_{i}$ , we are able to calculate the reprojection errors of every image in the bundle set. Considering the instability of the SGL inference, we use the mean coordinates to compute the variance into the loss, in order to stabilize the network outputs. Consequently, we synthesize all the loss elements together, illustrated by the following equation.

\mathcal{L}_{all}=\mathcal{L}_{angle}+\beta\mathcal{L}_{trans}+\gamma\mathcal{L}_{BA},

(7)

where $\beta$ and $\gamma$ weigh the contributions of the different loss components.

IV Experimental Evaluations

IV-A Datasets

We quantify the visual localization performance of SGL on two public benchmark datasets Microsoft 7-Scenes [14] (indoor) and Cambridge Landmarks Dataset [15] (outdoor). Both datasets contain challenging conditions, including motion blur, large illumination change, dynamic obstacles and etc. Ground truth poses are provided by the higher accurate equipments, which are the Kinect RGB-D camera in 7-Scenes and the abundant SFM data in Cambridge Landmarks. Besides, both benchmark datasets are widely occupied for visual localization evaluation. Many state-of-the-art localization researches have reported their statistic results, which makes them the validate baselines.

IV-B Implementation Details

Hyperparameters $\beta=100$ and $\gamma=0.25$ are set fixed, while $\alpha$ varies from $0.70$ to $0.95$ on different subsets, which will be discussed in Sec. IV-D. The following experiments are conducted on the NVIDIA Tesla V100 GPU with the CUDA 10.2 and Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz.

IV-C Comparison with State-of-the-art Approaches

In quantitative analysis, we select two of the representative visual localization methods among each branch discussed before. Statistics show that structure-based and coordinates prediction methods perform better than metrics regression ones. Because the coordinates prediction methods benefit from the high performance of the DNN and achieve the better performance than structure-based ones. Furthermore, our proposed SGL network outperforms most methods on both indoor and outdoor benchmark datasets, with higher translation and rotation accuracy. In Fig. 4 we also demonstrate the localization trajectory of those datasets to prove the accuracy.

TABLE II: Localization accuracy comparison with SOTA approaches. The median translation error (cm) and angular error (°) of each subset are reported, but the dash (-) sign represents the failure of the certain method on the certain subset.

	HLoc [43]	MS-T. [44]	DSM [45]	DSAC* [39]	SGL
	S.	M. R.	C. P.
Chess	2.4, 0.77	11, 4.66	2, 0.71	1.9, 1.11	1.5, 0.52
Fire	1.8, 0.75	24, 9.6	2, 0.85	1.9, 1.24	1.8, 0.67
Heads	0.9, 0.59	14, 12.19	1, 0.85	1.1, 1.82	1.2, 0.78
Office	2.6, 0.77	17, 5.66	3, 0.84	2.6, 1.18	2.3, 0.68
Pumpkin	4.4, 1.15	18, 4.44	4, 1.16	4.2, 1.41	3.2, 0.94
Kitchen	4.0, 1.38	17, 5.94	4, 1.17	3.0, 1.70	2.9, 0.94
Stairs	5.1, 1.46	26, 8.45	5, 1.33	4.1, 1.42	3.1, 1.21

IV-D Ablation Studies and Discussions

Confidence Block. To illustrate the validity of the confidence block, we visualize the output of the confidence block in Fig. 5. The high values of the down-sampled map represent the high confidence, and vice versa. Eq. 2 ensures us to select the high-confidence points with high-quality observation. Qualitatively, with the help of key-point filter and refinement, the green points focus mostly on the objects with meaningful textures, such as architectures, cabinets and posters. Points on the objects with less texture information, like the sky, black screen and windows are neglected by our strategy. Moreover, although green points are formatted in a grid-like way because of the down-sample effect by the CNNs, some points that lie near the object edges are tunned to fit the contours, because of Eq. 1.

Correspondingly, due to the instability and inexplicability of the DNNs, the outputs of the confidence block vary greatly under different circumstances on a single sequence. Therefore instead of a fixed threshold, a dynamic threshold would avoid uneven remaining points circumstances caused by the unreasonable threshold. Fig. 6 provides the ablation study of the localization accuracy with respect to the tunable confidence ratio $\alpha$ . As $\alpha$ balances the number of SGL predictions, therefore medium number for preserved key-points shows better performance on localization precision. Correspondingly, Fig. 6 shows $\alpha$ that lies within 0.70 0.95 holds lower rotation and translation errors. Because too few observations will result in noise in the PnP process, and too large $\alpha$ leads to failures in excluding those low-quality predictions of SGL.

Bundle Adjustment Trainer. We occupy HOW [46] to achieve the images corresponding graph and SuperGlue [24] to acquire the local feature matches. Nevertheless, the original output of the local feature networks do not always share the same key-points with the SGL. As a result, we have to adjust the matching pairs to fit the SGL key-point predictions. In the BA training process, we fully trust the matches produced by SuperGlue. As shown in Fig 7, we downsample the original local feature matches produced by SuperGlue by $8\times 8$ pixels per patch. Namely, every patch only preserves a single match. On the other hand, SGL provides the key-point pixel-wise position in every patch. Fig. 7 shows the circumstances when the key-point is tunned, modified and even filtered during the training process.

Fig. 8 illustrates the training status of the SGL network. With retrieved $k$ candidates and local feature matches, we are able to establish the key-point level correspondences in every bundle of training images. SGL’s confidence module predicts the high-confidence patches and eliminates the low-confidence ones, which are represented by the dark mosaics in the mid and bottom rows of the Fig. 8. As the training carries forward, the confidence module will gradually fit the scene and locate the low texture patches to exclude them from the later training stage. Table. III interprets that with the BA trainer’s help the localization accuracy rises on both datasets.

Spatial Block. In the Sec. III we introduce the spatial block in the receptive branch, because the depth of this block is sensitive to the final localization performance in different scenes, according to the statistics in Table. III. By increasing the $Conv$ layer number of the spatial block, we can improve the precision of SGL on the most subsets of 7-Scenes (except Stairs), while on the Cambridge dataset such phenomenon goes oppositely. Typically, when we substitute the spatial block with a much more complicated block, Transformer [47], the accuracy polarized greatly on these scenes. But the trend of accuracy transfers synchronizes with the ablation study on the changes of Conv layer number. In some subsets of both datasets, the higher complexity of spatial block has great effects on precision, while its performance is not stable especially on Cambridge datasets. Through all the ablation experiments we attribute the sensitivity of the spatial block to the diversity of the scenes on different subsets.

TABLE III: Ablation study on BA Trainer and Spatial Block. S. represents the number of layers in the spatial block. The transformer block represents the network structure described in [47]. The median translation error (cm) and angular error (°) are reported.

	SGL (S.=4)	SGL (S.=4, w/o BA)	S.=2	S.=6	S.=Transformer
Chess	1.6, 0.77	1.8, 0.83	1.7, 0.71	1.6, 0.59	1.6, 0.52
Fire	1.9, 0.94	2.0, 0.95	1.9, 0.94	1.9, 0.79	1.8, 0.67
Heads	1.1, 0.84	1.2, 0.88	1.3, 0.72	1.2, 0.75	1.3, 0.79
Office	2.6, 0.74	2.8, 0.82	2.7, 0.78	2.5, 0.71	2.4, 0.68
Pumpkin	3.9, 1.04	3.9, 1.10	4.0, 1.09	3.7, 1.03	3.2, 0.94
Kitchen	3.8, 1.20	4.0, 1.24	3.9, 1.29	3.7, 1.12	3.0, 0.95
Stairs	3.9, 1.06	4.4, 1.22	4.1, 1.08	5.1, 1.54	11.2, 1.77
Great Court	27.5, 0.21	32.1, 0.25	30.6, 0.31	35.1, 0.39	59.0, 0.41
Kings College	12.9, 0.32	16.7, 0.32	16.2, 0.42	15.8, 0.41	20.4, 0.32
Old Hospital	22.9, 0.41	21.5, 0.60	20.2, 0.55	26.9, 0.60	30.7, 0.57
Shop Facade	5.1, 0.28	5.6, 0.27	5.2, 0.28	6.2, 0.31	37.5, 1.34
St M. Church	10.5, 0.44	12.1, 0.51	12.5, 0.59	14.8, 0.67	21.0, 0.75

V Conclusions

In this paper, we present an end-to-end framework camera localization network, SGL, which takes the advantage of DNN to predict the scene coordinates of the sampled key-points and utilizes PnP+RANSAC to estimate the predicted camera poses. To accomplish better accuracy, we divide the network into two branches, receptive branch and structure branch, in order to fuse the low-level feature and high-level one. Moreover, we design the key-point refinement strategy to escalate the precision of 3D key-point observations. In the training part, we introduce BA trainer to fully use the multi-view information to help the network fit the scene. We conduct sufficient experiments and compare our method to some SOTA ones. Statistics prove the validity and high precision of the proposed method. Our further works will concentrate on adapting SGL to the sequential inputs and exploring the probability to simplify the BA trainer by omitting the extra relyments on the image retrieval and the local feature matching.

References

[1] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5297–5307, 2016.
[2] H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han, “Large-scale image retrieval with attentive deep local features,” in Proceedings of the IEEE international conference on computer vision, pp. 3456–3465, 2017.
[3] B. Cao, A. Araujo, and J. Sim, “Unifying deep local and global features for image search,” in European Conference on Computer Vision, pp. 726–743, Springer, 2020.
[4] C. Valgren and A. J. Lilienthal, “Sift, surf and seasons: Long-term outdoor localization using local features,” in 3rd European conference on mobile robots, ECMR’07, Freiburg, Germany, September 19-21, 2007, pp. 253–258, 2007.
[5] A. C. Murillo, J. J. Guerrero, and C. Sagues, “Surf features for efficient robot localization with omnidirectional images,” in Proceedings 2007 IEEE International Conference on Robotics and Automation, pp. 3901–3907, IEEE, 2007.
[6] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in 2011 International conference on computer vision, pp. 2564–2571, Ieee, 2011.
[7] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-supervised interest point detection and description,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 224–236, 2018.
[8] J. Revaud, P. Weinzaepfel, C. De Souza, N. Pion, G. Csurka, Y. Cabon, and M. Humenberger, “R2d2: repeatable and reliable detector and descriptor,” arXiv preprint arXiv:1906.06195, 2019.
[9] M. J. Tyszkiewicz, P. Fua, and E. Trulls, “Disk: Learning local features with policy gradient,” arXiv preprint arXiv:2006.13566, 2020.
[10] X.-S. Gao, X.-R. Hou, J. Tang, and H.-F. Cheng, “Complete solution classification for the perspective-three-point problem,” IEEE transactions on pattern analysis and machine intelligence, vol. 25, no. 8, pp. 930–943, 2003.
[11] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
[12] K. Konolige and M. Agrawal, “Frameslam: From bundle adjustment to real-time visual mapping,” IEEE Transactions on Robotics, vol. 24, no. 5, pp. 1066–1077, 2008.
[13] E. Royer, J. Chazalon, M. Rusiñol, and F. Bouchara, “Benchmarking keypoint filtering approaches for document image matching,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 343–348, IEEE, 2017.
[14] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon, “Scene coordinate regression forests for camera relocalization in rgb-d images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2930–2937, 2013.
[15] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for real-time 6-dof camera relocalization,” in Proceedings of the IEEE international conference on computer vision, pp. 2938–2946, 2015.
[16] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018.
[17] Y. Zhou, H. Fan, S. Gao, Y. Yang, X. Zhang, J. Li, and Y. Guo, “Retrieval and localization with observation constraints,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 5237–5244, IEEE, 2021.
[18] S. Gao, J. Wan, Y. Ping, X. Zhang, S. Dong, Y. Yang, H. Ning, J. Li, and Y. Guo, “Pose refinement with joint optimization of visual points and lines,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2888–2894, IEEE, 2022.
[19] Y. Li, N. Snavely, and D. P. Huttenlocher, “Location recognition using prioritized feature matching,” in European conference on computer vision, pp. 791–804, Springer, 2010.
[20] A. Irschara, C. Zach, J.-M. Frahm, and H. Bischof, “From structure-from-motion point clouds to fast location recognition,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2599–2606, 2009.
[21] D. Sibbing, T. Sattler, B. Leibe, and L. Kobbelt, “Sift-realistic rendering,” in 2013 International Conference on 3D Vision - 3DV 2013, pp. 56–63, 2013.
[22] Z. Zhang, T. Sattler, and D. Scaramuzza, “Reference pose generation for visual localization via learned features and view synthesis,” arXiv preprint arXiv:2005.05179, vol. 5, no. 7, p. 9, 2020.
[23] T. Sattler, B. Leibe, and L. Kobbelt, “Efficient amp; effective prioritized matching for large-scale image-based localization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 9, pp. 1744–1756, 2017.
[24] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4938–4947, 2020.
[25] A. Kendall and R. Cipolla, “Geometric loss functions for camera pose regression with deep learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5974–5983, 2017.
[26] F. Walch, C. Hazirbas, L. Leal-Taixe, T. Sattler, S. Hilsenbeck, and D. Cremers, “Image-based localization using lstms for structured feature correlation,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 627–637, 2017.
[27] A. Kendall and R. Cipolla, “Modelling uncertainty in deep learning for camera relocalization,” in 2016 IEEE international conference on Robotics and Automation (ICRA), pp. 4762–4769, IEEE, 2016.
[28] B. Wang, C. Chen, C. X. Lu, P. Zhao, N. Trigoni, and A. Markham, “Atloc: Attention guided camera localization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 10393–10401, 2020.
[29] S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz, “Geometry-aware learning of maps for camera localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2616–2625, 2018.
[30] D. Massiceti, A. Krull, E. Brachmann, C. Rother, and P. H. Torr, “Random forests versus neural networks—what’s best for camera localization?,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 5118–5125, IEEE, 2017.
[31] X. Li, J. Ylioinas, and J. Kannala, “Full-frame scene coordinate regression for image-based localization,” arXiv preprint arXiv:1802.03237, 2018.
[32] E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother, “Dsac-differentiable ransac for camera localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6684–6692, 2017.
[33] E. Brachmann and C. Rother, “Learning less is more-6d camera localization via 3d surface regression,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4654–4662, 2018.
[34] L. Yang, Z. Bai, C. Tang, H. Li, Y. Furukawa, and P. Tan, “Sanet: Scene agnostic network for camera localization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 42–51, 2019.
[35] T. Sattler, Q. Zhou, M. Pollefeys, and L. Leal-Taixe, “Understanding the limitations of cnn-based absolute camera pose regression,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3302–3312, 2019.
[36] Z. Huang, H. Zhou, Y. Li, B. Yang, Y. Xu, X. Zhou, H. Bao, G. Zhang, and H. Li, “Vs-net: Voting with segmentation for visual localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6101–6111, 2021.
[37] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time semantic segmentation,” in Proceedings of the European conference on computer vision (ECCV), pp. 325–341, 2018.
[38] C. Yu, C. Gao, J. Wang, G. Yu, C. Shen, and N. Sang, “Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation,” International Journal of Computer Vision, vol. 129, no. 11, pp. 3051–3068, 2021.
[39] E. Brachmann and C. Rother, “Visual camera re-localization from rgb and rgb-d images using dsac,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
[40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
[41] A. F. Agarap, “Deep learning using rectified linear units (relu),” arXiv preprint arXiv:1803.08375, 2018.
[42] B. Hanin, “Universal function approximation by deep neural nets with bounded width and relu activations,” 2017.
[43] E. Brachmann, M. Humenberger, C. Rother, and T. Sattler, “On the limits of pseudo ground truth in visual camera re-localisation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6218–6228, 2021.
[44] Y. Shavit, R. Ferens, and Y. Keller, “Learning multi-scene absolute pose regression with transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2733–2742, 2021.
[45] S. Tang, C. Tang, R. Huang, S. Zhu, and P. Tan, “Learning camera localization via dense scene matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1831–1841, 2021.
[46] G. Tolias, T. Jenicek, and O. Chum, “Learning and aggregating deep local descriptors for instance-level recognition,” in European Conference on Computer Vision, pp. 460–477, Springer, 2020.
[47] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.