Deep auxiliary learning for visual localization using colorization task

Mi Tian⁺, Qiong Nie⁺, Hao Shen^∗, Xiahua Xia ⁺ indicates equal contributions. * indicates corresponding author. All authors are with Meituan group, Beijing, China. Email:

\{

tianmi02, nieqiong, shenhao04, xiahuaxia

\}

@meituan.com.

Abstract

Visual localization is one of the most important components for robotics and autonomous driving. Recently, inspiring results have been shown with CNN-based methods which provide a direct formulation to end-to-end regress 6-DoF absolute pose. Additional information like geometric or semantic constraints is generally introduced to improve performance. Especially, the latter can aggregate high-level semantic information into localization task, but it usually requires enormous manual annotations. To this end, we propose a novel auxiliary learning strategy for camera localization by introducing scene-specific high-level semantics from self-supervised representation learning task. Viewed as a powerful proxy task, image colorization task is chosen as complementary task that outputs pixel-wise color version of grayscale photograph without extra annotations. In our work, feature representations from colorization network are embedded into localization network by design to produce discriminative features for pose regression. Meanwhile an attention mechanism is introduced for the benefit of localization performance. Extensive experiments show that our model significantly improve localization accuracy over state-of-the-arts on both indoor and outdoor datasets.

Refer to caption — Figure 1: Schematic representation of our proposed deep auxiliary learning architecture for visual localization. Green stream is localization sub-network with a given RGB image and predicts 6DoF camera pose. Blue stream is colorization sub-network with corresponding grayscale photograph mapped from a RGB image and predicts pixel-wise color version. The feature representations of C-net are embedded into L-net.

I INTRODUCTION

Visual localization is the problem of estimating the camera pose of a given image related to a visual representation of a known scene. It is one of the most critical prerequisites for computer vision applications in robotics and autonomous driving. It provides fundamental support for other modules, e.g., perception and path planning by sensing where it is in the map.

Traditional localization methods estimate camera pose from 2D-3D matching between query images and a map by applying an Perspective-n-Point (PnP) in RANSAC loop. The map is predefined with scene information that is generally presented as sparse key points with 3D geometry information and feature descriptors. The early researches aim to improve the efficiency and robustness of such 2D-3D matching by using an intermediate step of keyframe matching [1, 2] or hashing algorithms [3]. Since these methods rely heavily on local features descriptors, which are sensitive to illumination and weather changes, some literatures propose to train more robust local features and descriptors such as [4] and [5]. In addition, high-level semantic information is also taken into account to score the matching of images and features by the semantic consistency for visual localization [6, 7]. Aforementioned methods are based on the traditional localization framework and convolutional neural networks (CNNs) are only used for learning semantics. The first end-to-end CNNs is proposed by [8] for localization. Different from traditional methods, CNN-based methods map the scene through training the network and predict corresponding 6-DoF camera pose of the given image at inference process. Thus, CNN-based methods can leverage execellent feature representation capability of deep learning for localization. Besides, using a network to map the scene makes these methods more scalable and memory-efficient than geometric methods that usually require a large database of landmarks. The main weakness is that the localization accuray is not comparable to prior geometric methods. Then many efforts are made to improve localization performance by taking advantages of more complex architectures [9], geometry-aware constraints [10] and semantic information [11], etc.

In this paper, we focus on investigating the impact of semantics on localization accuracy. State-of-the-art semantic segmentation methods incur heavy expense to collect enormous semantic annotations. Therefore we introduce a complementary self-supervised colorization task for the auxiliary learning of high-level semantic feature representations instead of using supervised semantic segmentation methods. The purpose of colorization is to convert grayscale photos to color. Colorization networks are performed in a self-supervised way as each image can be split into its intensity and its color, and the intensity can be used to predict its color. Besides, it is an intuitive sense that colorization system should be able to interpret the semantic composition of the scene (what is in the image: plants, sky, buildings, …) as well as localize objects (where things are) before assigning a probable color to it. Therefore, it is reasonable that semantic information is hidden in the feature representations which are trained for the purpose of colorization. In fact representation learning via colorization has been studied by several literatures [12, 13, 14, 15]. [16] even viewed it as a proxy task for self-supervised respresentation learning and generalized to other visual tasks well such as classification and segmentation. We build on these successes and aggregate the colorization task into our localization framework to leverage its high-level semantics and sel-supervised strategies. Moreover, an attention mechanism is introduced to boost network performance both in terms of localization results and training convergence by activating useful regions in features.

In summary, there are mainly three contributions in this paper:

•

We propose a novel deep auxiliary learning architecture that includes a localization sub-network and a complementary colorization sub-network. As our knowledge, it is the first time that automatic colorization is employed to help localization task by embedding high-level semantic representations.
•

The effect of attention mechanism on localization is discussed and a localization-specific attention strategy is designed to selectively activate salient objects and useful regions for localization.
•

We present extensive experimental evaluations on both indoor and outdoor datasets by comparing our approach with state-of-the-art methods. The results show that our method achieve excellent localization ability compared with other approaches.

The remainder of this paper is structured as follows: Sec. 2 gives a brief review about CNN-based localization, automatic colorization and attention strategies etc. The proposed architecture is presented in Sec. 3. Sec. 4 extensively evaluates our approach comparing with current state-of-the-arts. Finally, conclusions are drawn in Sec. 5.

II Related Work

CNN-based Localization. Unlike mapping processing of traditional methods aimed to structure a lot of feature points with geometry information and descriptor, and the more points, the better localization, CNN-based methods mapped scene information by learning weights of network and were more capable of handling large scale scenarios. Thus, many CNN-based methods for visual localization were explored and exploited in prior works. PoseNet [8] was the first proposed approach which utilized base architecture of GoogLeNet to directly regress 6DoF camera pose with a given RGB image. Following closely, Kendal et at. [17] utilized Bayesian CNNs to estimate the uncertainty of predicted pose. Melekhov et al. [18] proposed a symmetric encoder-decoder architecture. Walch et al. [19] and Clark et al. [9] introduced Long-Short Term Memory (LSTM) to exploit the advantage of feature learning from constraint of temporal smoothness of the video stream. Kendall et al. [20] and Brahmbhatt [10] proposed geometry-aware constraints as extra loss terms to boost training. Brachmann et al. Valada et al. [21], Lin et al. [22] and Xue et al. [23] joint learned visual odometry as an auxiliary to improve localization that were benefit from relative motion consistency of sequence images to penalize these contradictive pose prediction. Different from above works, Radwan et al. [11] firstly introduced multitask learning framework for visual localization, odometry estimate and semantic segmentation. Through introducing semantics from segmentation and exploiting the mutual benefit of each task, results of localization were more significantly improved than other methods. However, it should be noted that costly manual annotations and multi-sensor information are required that violates the advantages of CNN-based methods in large scale scenarios meanwhile limits generalization of the algorithm. Beside above methods predicting pose directly, Brachmann et al. [24, 25] and Zhou et al [26] predicted the scene 3D coordinates for pixels by training CNNs with ground truth scene coordinate and then calculated 6DoF pose by PnP with RANSAC. Laskar et at [27] and Ding et al [28] found the nearest images as key frames from database using retrieval model and regressed relative pose between query image and key frames using relative pose regression model. In this work, we aim to retain the edge of CNN-based methods and benefit from aggregating scene-specific high-level semantics into localization. To end this, we propose an auxiliary learning using self-supervised colorization to produce useful semantic feature representations

Colorization is a boutique computer graphics task that aims to recover a plausible color version for a given grayscale photograph. For the purpose of auxiliary learning, we mostly focus on CNN-based methods of colorization. Zhang et al. [15] implemented colorization using classification task to increase the diversity of results. Larsson et al. [14, 16] exploited both low-level and semantic representations to predict per-pixel color histograms. Isola et al. [29] and Cao et al. [12] introduced conditional GANs. Deshpande [13] exploited multi-modal colorization. In these works, colorization was viewed as an promising avenue for self-supervised visual representation learning and generalized well to other visual task such as objection classification, dection and segmentation that have been proven to be surprisingly useful. Considering two advantages that availability on capturing high-level visual representations to incorporate semantic parsing for scene understanding and freedom in training data to enable the capability of handling large scale scenarios, colorization task can serve visual localization perfectly. So, it is reasonable that using self-supervised colorization to auxiliary learn localization can overcome the drawback of prior CNN-based methods and be a breakthrough exploration in visual localization.

Attention mechanism is widely applied in CNN-based algorithms due to general performance improvements in variety of tasks, from object detection [30, 31], semantic segmentation [32, 33] to image captioning [34, 35]. The method proposed by [36] was a typically implementation by modeling channel-wise relationships in a computationally efficient manner and designed to enhance the representational power of basic modules throughout the network. The other work [37] focused on finding salient objects in images for sentiment classification by generating activation feature map. Considering when human localize where they are by visual perception, reference landmarks in the scene are more dependent. It suggests that enhancing feature representations of regions of interest is resonable, rather than that all the feature representations contribute equally for localization. In our work, following these methods, we introduce and discuss localization-specific attention mechanism to either optimize channel-wise information or activate regions of interest for the purpose of improving localization performance.

III Approach

Our proposed deep auxiliary learning architecture is depicted in Figure 1. It consists of two sub-nets, namely localization sub-network (L-net) and colorization sub-network (C-net). Given a RGB image and corresponding grayscale image for L-net and C-net respectively, absolute camera pose and color version are jointly predicted. As a complementary task, C-net is trained to learn colorization, but the goal is to provide scene-specific feature representations to help pose regression by aggregating high-level semantic into L-net. Beyond that, we design an attention mechanism by selectively activating meaningful regions for localization task. In the following, L-net and C-net as well as this localization-specific attention strategy will be presented in detail.

III-A Colorization Sub-Network

Following the typical CNN-based colorization algorithm [15], we perform our task in CIE Lab color space as distances in this space model perceptual distance. Our C-net predicts the corresponding a and b color channels $\tilde{Y}^{a,b}$ (we denote ground truth as $Y^{a,b}$ ) with lightness L channel $X^{L}$ as input. Different from [15] using a VGG-style architecture, we adopt the lightweight U-Net [38] that has four downsampling and followed four upsampling operations as well as skip connections employed between layers with same resolution. In order to search for a prediction of vibrant and realistic colorization to increase the diversity of colors, some algorithms think colorization as a classification problem and use cross-entropy loss. In our case, colorization is designed as an complementary task for auxiliary learning of localization, we do not overly entangle with colorization performance but focus on localization improvement by aggregating semantic representation. Thus, we define a $l_{1}$ Euclidean distance loss function between predicted and ground truth color version to constrain the weights updating of C-net.

L_{c}(X^{L})=\sum_{h,w}{\lVert{\tilde{Y}^{a,b}-Y^{a,b}}\rVert}_{1}

Some previous works evaluated how representation learning via colorization can serve for classification task. It was demonstrated that best classification performances are achieved when feature representations from the maximum downsampled layer are contributed into linear classifiers. From our C-net, we follow this conclusion and choose 16-times downsampled feature representations with respect to the input dimension in order to introduce high-level semantic information into localization. We define these representation as $M_{c}$ for the ease of following description.

III-B Localization Sub-Network

Given a RGB image $I\in\mathbb{R}^{H\times W\times 3}$ , L-net can predict the absolute pose $P=[x,q]$ , where $x\in\mathbb{R}^{3}$ denotes the translation and $q\in\mathbb{R}^{4}$ denotes the rotation in a unit quaternion representation. L-net consists of four parts, a backbone for feature extraction, a fusion operation, an attention module and a regressor to predict associated pose.

Backbone Following typical localization architecture, our backbone is designed with five standard residual blocks which have the same bottleneck structure and unit setting with ResNet-50 [39] architecture. The output feature representations of the Res5 block is 16-times downsampled with respect to input image, and we define it as $M_{l}$ .

Fusion operation The aim of this step is to fuse representations $M_{l}$ from localization backbone and semantic representations $M_{c}$ from colorization. Since L-net and C-net are two independent tasks with different input modalities and prediction purposes, it is crucial to intelligently select helpful information from $M_{c}$ and avoid introducing irrelevant feature representations which may have a negative effect on pose regression. To this end, we apply a typical linear weighting and activating operations after the cross-channel concatenation between $M_{l}$ and $M_{c}$ to active elements that are useful for pose regression. This fusion operation can be formulated as:

M_{fuse}=max(W\ast(M_{l}\oplus M_{c})+b,0)

Where $\oplus$ represents concatenation across channels; $W$ and $b$ are weight parameters updated by learning. $M_{fuse}$ is the feature representation after fusion. This fusion operation is completed by a convolution and activate layer in our experiments.

Attention module. As previously analyzed, different feature representations do not equally contribute to localization task. Thus, an attention module is designed upon prior similar works [36] and [37]. Reference to human visual localization mechanism, regions of interest consisting of salient objects or special texture ought to be enhanced in localization-specific attention module.

This attention operation can be described with two steps. The first squeeze and excitation process. Activating $M_{fuse}$ with an channel-wise vector $V=GMP(M_{fuse})$ , where $V\in\mathbb{R}^{C}$ obtained by global max pooling operation of squeezing spatial information into a global representation. And excitation is completed by per-channel multiplication between $M_{fuse}$ and $V$ .

M_{atten}=M_{fuse}\otimes(GAP(M_{SaE}))

M_{SaE}=M_{fuse}\otimes GMP(M_{fuse})

The next step aims to achieve an attention mask to weight different regions of interest in feature representations $M_{fuse}$ . Such mask is computed by applying a cross channel global average pooling operation on $M_{SaE}$ and is a single channel map with high response to regions of interest. As shown in Figure 2, some regions with salient objects, e.g., gates and windows of building, bicycles and stairs are activated to impel network to learn localization from more useful feature respresentations. Finaly, to not loss information, we fuse the original holistic representations $M_{fuse}$ with the region enhanced representations $M_{atten}$ via aforementioned fusion operation and use them for the final regression.

Regressor. After the attention module, pose regressor is designed with three inner fully connected layers with 2048 neurons, 3 neurons and 4 neurons respectively. The first one embeds feature representations into a high-dimensional vector and the last two separately output absolute position and orientation. In order to tackle the over parameterization problem of 4-dimension unit quaternion, we follow the method proposed by [10] and adopt a de-parameterization strategy during training process. The idea is to convert 4-dimentional unit quaternion $q=[u,\textbf{v}]$ ( $u$ is a scalar and v is a 3 dimensional vector) into 3 dimensions by the following formula:

\log{q}=\frac{\textbf{v}}{\|\textbf{v}\|}\cos{u^{-1}}

Finally, we define the loss function with typical $l_{2}$ Euclidean distance to constrain the weights updating of L-net. Meanwhile a hyper parameter is introduced to balance translation and rotation error.

L_{l}(I)={\lVert{\tilde{x}-x}\rVert}_{2}+\beta_{intra}\ast{\lVert{\tilde{q}-q}\rVert}_{2}

III-C Joint Learning

Our proposed architecture can be viewed from the perspective of multi-task learning as a soft-parameter sharing approach. The objective of auxiliary colorization task is to enable the model to learning representations that are helpful for the main localization task. Thus we adopt a joint training strategy that allows the model to learn beneficial representations and also to be more robust against random data noise. The final loss function is then defined as:

L=\beta_{inter}\ast L_{c}(X^{L})+L_{l}(I)

The former term corresponds to the colorization regularizer and the latter describes localization related loss function. Similar to aforementioned intra-task parameter, $\beta_{inter}$ is also a hyper parameter that keeps balancing inter-task loss terms. It can be trained online or preset.

IV Experiments

In this section, we evaluate our visual localization method compared with state-of-the-art methods both on indoor and outdoor datasets. The experimental results demonstrate a surprising performance improvement of our auxiliary learning method.

TABLE I: Comparison of median localization error with prior CNN-based models on outdoor datasets

	Cambridge landmarks					Oxford robotcar
Method	King’s College	Shop Facade	Church	Old Hospital	Average	Loop	DeepLoc
PoseNet15 [8]	1.66m, $4.86^{\circ}$	1.41m, $7.81^{\circ}$	2.45m, $7.96^{\circ}$	2.62m, $4.90^{\circ}$	2.04m, $6.23^{\circ}$	20.29m, $17.45^{\circ}$	2.42m, $3.66^{\circ}$
PoseNet16 [17]	1.74m, $4.06^{\circ}$	1.25m, $7.54^{\circ}$	2.11m, $8.38^{\circ}$	2.57m, $5.14^{\circ}$	1.92m, $6.28^{\circ}$	-	2.24m, $4.31^{\circ}$
SVS-Pose [40]	1.06m, $2.81^{\circ}$	0.63m, $5.73^{\circ}$	2.11m, $8.11^{\circ}$	1.50m, $4.03^{\circ}$	1.33m, $5.17^{\circ}$	-	1.61m, $3.52^{\circ}$
PoseNet17 [20]	0.99m, $1.06^{\circ}$	1.05m, $3.97^{\circ}$	1.49m, $3.43^{\circ}$	2.17m, $2.94^{\circ}$	1.43m, $2.85^{\circ}$	-	-
PoseNet17(geo) [20]	0.88m, $1.04^{\circ}$	0.88m, $3.78^{\circ}$	1.57m, $\mathbf{3.32^{\circ}}$	3.20m, $3.29^{\circ}$	1.63m, $2.86^{\circ}$	-	-
GPPoseNet [41]	1.61m, $2.29^{\circ}$	1.14m, $5.73^{\circ}$	2.93m, $6.46^{\circ}$	2.62m, $3.89^{\circ}$	2.08m, $4.59^{\circ}$	-	-
MapNet [10]	1.07m, $1.89^{\circ}$	1.49m, $4.22^{\circ}$	2.00m, $4.53^{\circ}$	1.94m, $3.91^{\circ}$	1.63m, $3.64^{\circ}$	9.84m, $\mathbf{3.96^{\circ}}$	-
MLFBPPose [42]	0.76m, $1.72^{\circ}$	0.75m, $5.10^{\circ}$	1.29m, $5.01^{\circ}$	1.99m, $2.85^{\circ}$	1.20m, $3.67^{\circ}$	-	-
Ours	0.72m, $\mathbf{0.55^{\circ}}$	0.73m, $\mathbf{3.46^{\circ}}$	1.65m, $3.34^{\circ}$	1.67m, $\mathbf{1.14^{\circ}}$	1.19m, $\mathbf{2.12^{\circ}}$	6.49m, $5.06^{\circ}$	0.48m, $\mathbf{2.76^{\circ}}$

IV-A Datasets

We benchmark performance of our method on three outdoor datasets: DeepLoc [11] , Cambridge landmark [8] and Oxford robotcar [43, 44]. DeepLoc is a large-scale urban outdoor localization dataset in an area spanning $110m\times 130m$ . It is collected by a robot moving along a loop road in an university campus. The captured images are with resolution of $1280\times 720$ and the ground truth is computed by a LiDAR-based SLAM algorithm. The other dataset King’s College, with an area of $140m\times 40m$ , is widely used for benchmarking localization tasks. It is collected around a landmark building and the ground truth pose is obtained by a Structure From Motion algorithm. Both datasets are very challenging since training data and test data are captured from different points in time and distinct walking paths. Moreover, significant urban clutter such as pedestrians and vehicles are present in King’s College dataset, and scene views vary largely within different paths of DeepLoc.

A well-known indoor dataset - Microsoft 7-Scene [45] is also used for evaluating our method. Seven difference scenes are recorded from a handheld Kinect RGB-D camera at $640\times 480$ resolution and ground truth camera pose is provided by KinectFusion algorithm. The existence of motion blur and weak texture under office environment makes this 7-scene dataset very challenging.

IV-B Implementation Details

Although preprocessing is widely used for many visual tasks, prior works demonstrate that typical preprocessing techniques such as cropping and mirroring can not yield performance improvements for localization. In some cases, they even negatively affect the pose accuracy. In our experiments, we only take already proved well-performanced preprocessing steps like resize input images into $320\times 240$ and normalized by pixel mean subtraction and standard deviation division operation.

We use the Adam solver for optimization with $\beta_{1}=0.9$ , $\beta_{2}=0.99$ and $\xi=10^{-10}$ . We initialize five residual blocks of L-net with weights of ResNet-50 pre-trained on ImageNet and remaining layers with Gaussian distribution. The balance parameters, $\beta_{intra}$ and $\beta_{inter}$ are set as 3 and 0.2 for all datasets. Then we train all layers with mini-batch size of 10. The learning rate of the backbone layers of L-net and all layers of C-net are initialized as 0.0003. All other layers of L-net are initialized as a learning rate of 0.001, and both learning rates decay by a power=0.9 every 10 epochs. The total number of iteration is about 150 epochs for two outdoor datasets and 80 epochs for indoor Microsoft 7Scene dataset. The work is implemented based on Tensorflow deep learning library. All the experiments are performed on a NVIDIA Titan V GPU with 16GB on-board memory.

IV-C Evaluation on Outdoor Datasets

The evaluation results of our method against prior CNN-based methods are shown in Table 1. PoseNet [8] and its variant BaysianPoseNet [17] as well as ours have input of a single image. SVS-Pose [40] leverages additional depth information to do data augmentation in 3D space. VLocNet [21] takes advantages of sequential information. And VLocNet++ [11], as a multi-task framework for odometry, localization and semantic segmentation learning, requires much more information such as successive images, semantic segmentations annotation and depth information. Results of PoseNet2 [20] on DeepLoc and VLocNet++ [11] on King’s College are absent since no publish data is available.

According to the median translation and rotation error indicated in Table 1, our method outperforms all other algorithms on King’s College. “ - ” denotes no data provided. On Deeploc dataset, our method also shows outstanding localization performance. Although VLocNet++ has slightly smaller localization error than ours, it should be noted that our method requires a single image for both training and inference, while VLocNet++ should be trained with tremendous semantic segmentation annotation cost and ground truth depth. Also, from camera pose trajactories for test sequences from DeepLoc and King’s College datasets shown in Figure 3, predicted results with our method have relatively few outliers and higher localization accuracy.

IV-D Colorization Results

Colorization results from our method are presented in Figure 4. Four groups of image colorization results are shown. Three each group, the first image is the original color image, the second image is grayscale and the last one is the colorization image. Apparently our colorization results are quite natural and indistinguishable from original image. In order to quantitatively evaluate colorization performance, we display statistical analysis in Table 2 by counting the percentage of pixel whose colorized value is close to the corresponding pixel in the original image. We define this closeness by judging if the color difference is smaller than a threshold (eg. 5 or 10). The results show that our colorization version is much similar to the original color. The main differences exist in some areas with bright and unfixing colors like sky, as the regression strategy used in our colorization has an averaging effect on color prediction.

TABLE II: Colorization results on outdoor datasets

Scene	DeepLoc		King’s College
Threshold	@5	@10	@5	@10
Accuracy	94.04 $\%$	97.84 $\%$	93.42 $\%$	95.87 $\%$

IV-E Ablation Studies

Since auxiliary learning and attention strategy are both introduced in our method, in this section we will separately analyze their influences on localization performance. Therefore we design three different network architectures for different ablative levels. At first, we evaluate the localization performance of the baseline L-net by cutting the representation input from C-Net and removing attention module. In the second network, we re-add C-Net for auxiliary learning to L-net but keep removing attention module. The third network is our whole architecture with both auxiliary learning and attention mechanism.

These three networks are trained with the same initialization and hyper-parameter setting, and tested on two outdoor datesets. The results are shown in Table 3. Apparently our introduced auxiliary learning strategy via colorization leads to significant performance improvements. This improvement is more evident from the test on DeepLoc. We believe that this is because environment situations in DeepLoc are more complicate than those in King’s College. Images from King’s College is mainly based on a landmark building. But DeepLoc is a normal urban scenario with various streets and buildings. Semantic information from this situation seems more difficult to be learned by localization network directly. In this case, semantic feature representations learned via colorization are especially beneficial for localization. Moreover, attention mechanism also contributes to localization task by proving accuracy improvement as well as training convergence acceleration. For instance, during the test in Deeploc, required training epoch for the whole network is 155, but it increases to 189 when we remove the attention module.

TABLE III: Comparison of median translation error (m) and rotation error (^∘) with different ablative levels on outdoor datasets

Scene

Auxiliary

learning

Attention

module

Median error

0.67m,

3.51^{\circ}

DeepLoc

✓

0.50m,

2.88^{\circ}

✓

0.48m,

\mathbf{2.76^{\circ}}

0.86m,

1.61^{\circ}

King’s College

✓

0.78m,

0.90^{\circ}

✓

0.72m,

\mathbf{0.55^{\circ}}

IV-F Comparison with Joint Semantic Segmentation

In this section we replace our colorization task by semantic segmentation and compare their auxiliary learning function on localization. This means that we keep C-net unchanged but its training purpose turns into semantic segmentation. Thus the input of C-net should be changed to a RGB image. And the output becomes into a pixel-wise segmentation prediction whose channel number is equal to the number of semantic annotation categories. This new network is then trained with DeepLoc dataset. Since semantic segmentation labels are available for only half training data in this dataset, we train the whole network (semantic segmenation task and localization task) with semantic segmentation labeled images. Then we fine tune L-net with rest semantic unlabeled images by fixing weights of C-net. The localization results after both stage are presented in Table 4.

According to the results shown in Table 4, it seems that colorization and semantic segmentation equally contribute when they are considered as auxiliary task for localization. But semantic segmentation suffers from additional annotation cost.

TABLE IV: Comparison of median translation error (m) and rotation error (^∘) with joint semantic segmentation method on DeepLoc dataset

	Stage-1	Stage-2	Ours
Median error	0.64, $3.18^{\circ}$	0.54, $2.70^{\circ}$	0.48, $\mathbf{2.76^{\circ}}$

Above evaluations are based on localization performance and segmentation is considered as auxiliary task. Here we present semantic segmentation performance and compare it with other segmentation algorithms. The experiment is performed on DeepLoc dataset and the Intersection over Union (IoU) score for major individual categories as well as the mean IoU are shown in Table 5. The segmentation images can be seen in Figure 5. From the results, our method outperforms AdaptNet [46] and DeepLabv3+ [47] as well as is very close to VLocNet++ that is attributed to the warping computation by successive images and depth information.

TABLE V: Comparison of semantic segmentation prediction with state-of-the-art methods on DeepLoc dataset

Approach	Sky	Road	Sidewalk	Grass	Vegetation	Building	MIoU
AdapNet [46]	94.65	98.98	64.97	82.14	84.48	87.68	78.59
DeepLabv3+ [47]	94.26	98.46	81.60	90.94	91.07	94.20	78.30
VLocNet++ [11]	95.84	98.99	80.85	88.15	91.28	94.72	80.44
Ours	95.49	98.42	81.30	91.26	91.55	94.60	79.45

TABLE VI: Comparison of median translation error and rotation error with various CNN-based methods on Microsoft 7-Scene dataset

Method	Chess	Fire	Heads	Office	Pumpkin	Kitchen	Stairs	Average
PoseNet15 [8]	0.32m, $8.12^{\circ}$	0.47m, $14.4^{\circ}$	0.29m, $12.0^{\circ}$	0.48m, $7.08^{\circ}$	0.47m, $8.42^{\circ}$	0.59m, $8.64^{\circ}$	0.47m, $13.8^{\circ}$	0.44m, $10.8^{\circ}$
PoseNet16 [17]	0.37m, $7.24^{\circ}$	0.43m, $13.7^{\circ}$	0.31m, $12.0^{\circ}$	0.48m, $8.04^{\circ}$	0.61m, $7.08^{\circ}$	0.58m, $7.54^{\circ}$	0.48m, $13.1^{\circ}$	0.47m, $9.81^{\circ}$
PoseNet17 [20]	0.14m, $4.50^{\circ}$	0.27m, $11.8^{\circ}$	0.18m, $12.1^{\circ}$	0.20m, $5.77^{\circ}$	0.25m, $4.82^{\circ}$	0.24m, $5.52^{\circ}$	0.37m, $10.6^{\circ}$	0.24m, $7.87^{\circ}$
PoseNet17(geo) [20]	0.13m, $4.48^{\circ}$	0.27m, $11.3^{\circ}$	0.17m, $13.0^{\circ}$	0.19m, $5.55^{\circ}$	0.26m, $4.75^{\circ}$	0.23m, $5.35^{\circ}$	0.35m, $12.4^{\circ}$	0.23m, $8.12^{\circ}$
GPPoseNet [41]	0.20m, $7.11^{\circ}$	0.38m, $12.3^{\circ}$	0.21m, $13.8^{\circ}$	0.28m, $8.83^{\circ}$	0.37m, $6.94^{\circ}$	0.35m, $8.15^{\circ}$	0.37m, $12.5^{\circ}$	0.31m, $9.95^{\circ}$
MapNet [10]	0.08m, $\mathbf{3.25^{\circ}}$	0.27m, $11.7^{\circ}$	0.18m, $13.3^{\circ}$	0.17m, $\mathbf{5.15^{\circ}}$	0.22m, $4.02^{\circ}$	0.23m, $\mathbf{4.93^{\circ}}$	0.30m, $12.1^{\circ}$	0.21m, $7.77^{\circ}$
ANNet [48]	0.12m, $4.30^{\circ}$	0.27m, $11.6^{\circ}$	0.16m, $12.4^{\circ}$	0.19m, $6.80^{\circ}$	0.21m, $5.20^{\circ}$	0.25m, $6.00^{\circ}$	0.28m, $8.40^{\circ}$	0.21m, $7.90^{\circ}$
MLFBPPose [42]	0.12m, $5.82^{\circ}$	0.26m, $12.0^{\circ}$	0.14m, $13.5^{\circ}$	0.18m, $8.24^{\circ}$	0.21m, $7.05^{\circ}$	0.22m, $8.14^{\circ}$	0.38m, $10.3^{\circ}$	0.22m, $9.29^{\circ}$
PoseNet20 [49]	0.09m, $4.39^{\circ}$	0.25m, $10.8^{\circ}$	0.14m, $12.6^{\circ}$	0.17m, $6.46^{\circ}$	0.19m, $5.91^{\circ}$	0.21m, $6.71^{\circ}$	0.26m, $11.5^{\circ}$	0.19m, $8.33^{\circ}$
Ours	0.09m, $3.77^{\circ}$	0.23m, $\mathbf{11.3^{\circ}}$	0.14m, $\mathbf{11.5^{\circ}}$	0.16m, $5.61^{\circ}$	0.18m, $\mathbf{3.80^{\circ}}$	0.20m, $6.35^{\circ}$	0.25m, $\mathbf{10.2^{\circ}}$	0.18m, $\mathbf{7.52^{\circ}}$

IV-G Evaluation on Indoor Dataset

In addition to the above outdoor datasets, experiments are also performed on indoor dataset Microsoft 7Scene in order to fully demonstrate our localization ability. We compare our method with other CNN-based methods like PoseNet [8], LSTM-Pose [19], PoseNet2 [20], and MapNet/MapNet+ [10]. The median translation and rotation errors for each method are shown in Table 6 meanwhile camera pose trajactories for test sequences of stairs scene are presented in Figure 3. Our method has similar accuracy with MapNet+, but outperforms other methods. To be mentioned that MapNet+ is an advanced version of MapNet by fusion additional information (eg. visual odometry) to update the weights of MapNet with self-supervised learning. Therefore a correct visual odometry algorithm is required for MapNet+. While our method is more independent as only baseline information (image and corresponding ground truth pose) is required. Moreover, our model is easier to converge than others. For example, our model takes about 80 epochs iterations to converging while MapNet takes 300 epochs iterations.

To sum up, our proposed auxiliary learning for visual localization via colorization exhibits significant accuracy-efficiency balance performance without requiring extra information. All these properties makes it suitable for many applications including indoor robots and outdoor autonomous vehicles.

V Conclusions

In most works, visual localization has been implemented as a independent task and performance improvement mainly comes from the optimization of network architecture and loss constraint. Some works take multi-task learning framework by introducing semantic segmentation with the goal of exploiting mutual benefit. But these semantic segmentation methods usually require tremendous manual annotations. Our proposed method aims to overcome above drawback. By taking self-supervised colorization as auxiliary task to learn high-level semantic representation for localization, our method, which requires no additional information, achieves surprising performance improvement on both outdoor and indoor datasets. Also an attention mechanism is introduced into our localization network and leads to further performance improvement on both accuracy and efficiency.

References

[1] T. Sattler, B. Leibe, and L. Kobbelt, “Efficient & effective prioritized matching for large-scale image-based localization,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 9, pp. 1744–1756, 2016.
[2] T. Sattler, T. Weyand, B. Leibe, and L. Kobbelt, “Image retrieval for image-based localization revisited.” in BMVC, vol. 1, no. 2, 2012, p. 4.
[3] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt, “Practical and optimal lsh for angular distance,” in Advances in neural information processing systems, 2015, pp. 1225–1233.
[4] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-supervised interest point detection and description,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
[5] P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “From coarse to fine: Robust hierarchical localization at large scale,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 716–12 725.
[6] T. Shi, S. Shen, X. Gao, and L. Zhu, “Visual localization using sparse semantic 3d map,” in 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019, pp. 315–319.
[7] C. Toft, E. Stenborg, L. Hammarstrand, L. Brynte, M. Pollefeys, T. Sattler, and F. Kahl, “Semantic match consistency for long-term visual localization,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 383–399.
[8] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for real-time 6-dof camera relocalization,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2938–2946.
[9] R. Clark, S. Wang, A. Markham, N. Trigoni, and H. Wen, “Vidloc: A deep spatio-temporal model for 6-dof video-clip relocalization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6856–6864.
[10] S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz, “Geometry-aware learning of maps for camera localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2616–2625.
[11] N. Radwan, A. Valada, and W. Burgard, “Vlocnet++: Deep multitask learning for semantic visual localization and odometry,” IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 4407–4414, 2018.
[12] Y. Cao, Z. Zhou, W. Zhang, and Y. Yu, “Unsupervised diverse colorization via generative adversarial networks,” in Joint European conference on machine learning and knowledge discovery in databases. Springer, 2017, pp. 151–166.
[13] A. Deshpande, J. Lu, M.-C. Yeh, M. Jin Chong, and D. Forsyth, “Learning diverse image colorization,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
[14] G. Larsson, M. Maire, and G. Shakhnarovich, “Learning representations for automatic colorization,” in European Conference on Computer Vision. Springer, 2016, pp. 577–593.
[15] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in European conference on computer vision. Springer, 2016, pp. 649–666.
[16] G. Larsson, M. Maire, and G. Shakhnarovich, “Colorization as a proxy task for visual understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6874–6883.
[17] A. Kendall and R. Cipolla, “Modelling uncertainty in deep learning for camera relocalization,” in 2016 IEEE international conference on Robotics and Automation (ICRA). IEEE, 2016, pp. 4762–4769.
[18] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu, “Image-based localization using hourglass networks,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 879–886.
[19] F. Walch, C. Hazirbas, L. Leal-Taixe, T. Sattler, S. Hilsenbeck, and D. Cremers, “Image-based localization using lstms for structured feature correlation,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 627–637.
[20] A. Kendall and R. Cipolla, “Geometric loss functions for camera pose regression with deep learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5974–5983.
[21] A. Valada, N. Radwan, and W. Burgard, “Deep auxiliary learning for visual localization and odometry,” in 2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018, pp. 6939–6946.
[22] Y. Lin, Z. Liu, J. Huang, C. Wang, G. Du, J. Bai, and S. Lian, “Deep global-relative networks for end-to-end 6-dof visual localization and odometry,” in Pacific Rim International Conference on Artificial Intelligence. Springer, 2019, pp. 454–467.
[23] F. Xue, X. Wang, S. Li, Q. Wang, J. Wang, and H. Zha, “Beyond tracking: Selecting memory and refining poses for deep visual odometry,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8575–8583.
[24] E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother, “Dsac-differentiable ransac for camera localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6684–6692.
[25] E. Brachmann and C. Rother, “Learning less is more-6d camera localization via 3d surface regression,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4654–4662.
[26] L. Zhou, Z. Luo, T. Shen, J. Zhang, M. Zhen, Y. Yao, T. Fang, and L. Quan, “Kfnet: Learning temporal camera relocalization using kalman filtering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4919–4928.
[27] Z. Laskar, I. Melekhov, S. Kalia, and J. Kannala, “Camera relocalization by computing pairwise relative poses using convolutional neural network,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 929–938.
[28] M. Ding, Z. Wang, J. Sun, J. Shi, and P. Luo, “Camnet: Coarse-to-fine retrieval for camera re-localization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 2871–2880.
[29] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134.
[30] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, “Residual attention network for image classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3156–3164.
[31] J. Liu, C. Gao, D. Meng, and A. G. Hauptmann, “Decidenet: Counting varying density crowds through attention guided detection and density estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5197–5206.
[32] A. W. Harley, K. G. Derpanis, and I. Kokkinos, “Segmentation-aware convolutional networks using local attention masks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5038–5047.
[33] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to scale: Scale-aware semantic image segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3640–3649.
[34] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6077–6086.
[35] L. Li, S. Tang, L. Deng, Y. Zhang, and Q. Tian, “Image caption with global-local attention,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[36] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
[37] J. Yang, D. She, Y.-K. Lai, P. L. Rosin, and M.-H. Yang, “Weakly supervised coupled networks for visual sentiment analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7584–7592.
[38] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
[39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[40] T. Naseer and W. Burgard, “Deep regression for monocular camera-based 6-dof global localization in outdoor environments,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 1525–1530.
[41] M. Cai, C. Shen, and I. Reid, “A hybrid probabilistic model for camera relocalization,” 2019.
[42] X. Wang, X. Wang, C. Wang, X. Bai, J. Wu, and E. R. Hancock, “Discriminative features matter: Multi-layer bilinear pooling for camera localization,” in British Machine Vision Conference. York, 2019.
[43] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 Year, 1000km: The Oxford RobotCar Dataset,” The International Journal of Robotics Research (IJRR), vol. 36, no. 1, pp. 3–15, 2017.
[44] W. Maddern, G. Pascoe, M. Gadd, D. Barnes, B. Yeomans, and P. Newman, “Real-time kinematic ground truth for the oxford robotcar dataset,” arXiv preprint arXiv: 2002.10152, 2020. [Online]. Available: https://arxiv.org/pdf/2002.10152
[45] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon, “Scene coordinate regression forests for camera relocalization in rgb-d images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2930–2937.
[46] A. Valada, J. Vertens, A. Dhall, and W. Burgard, “Adapnet: Adaptive semantic segmentation in adverse environmental conditions,” in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 4644–4651.
[47] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818.
[48] M. Bui, C. Baur, N. Navab, S. Ilic, and S. Albarqouni, “Adversarial networks for camera pose regression and refinement,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
[49] M. Tian, Q. Nie, and H. Shen, “3d scene geometry-aware constraint for camera localization with deep learning,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 4211–4217.