Keeping an Eye on Things: Deep Learned Features for Long-Term Visual Localization

Mona Gridseth¹ and Timothy D. Barfoot¹ Manuscript received: September, 8, 2021; Revised December, 3, 2021; Accepted December, 13, 2021.This paper was recommended for publication by Editor Sven Behnke upon evaluation of the Associate Editor and Reviewers’ comments. This work was generously supported by Clearpath Robotics and the Natural Sciences and Engineering Research Council of Canada (NSERC). We thank Yuchen Wu, Ben Congram, and Sherry Chen for their help with VT&R.¹Mona Gridseth and Timothy D. Barfoot are with the University of Toronto Institute for Aerospace Studies, University of Toronto, Canada [email protected]., [email protected] Digital Object Identifier (DOI): see top of this page.

Abstract

In this paper, we learn visual features that we use to first build a map and then localize a robot driving autonomously across a full day of lighting change, including in the dark. We train a neural network to predict sparse keypoints with associated descriptors and scores that can be used together with a classical pose estimator for localization. Our training pipeline includes a differentiable pose estimator such that training can be supervised with ground truth poses from data collected earlier, in our case from 2016 and 2017 gathered with multi-experience Visual Teach and Repeat (VT&R). We insert the learned features into the existing VT&R pipeline to perform closed-loop path following in unstructured outdoor environments. We show successful path following across all lighting conditions despite the robot’s map being constructed using daylight conditions. Moreover, we explore generalizability of the features by driving the robot across all lighting conditions in new areas not present in the feature training dataset. In all, we validated our approach with 35.5 km of autonomous path following experiments in challenging conditions.

Index Terms:

Localization, Deep Learning for Visual Perception, Vision-Based Navigation

I Introduction

Long-term navigation across large appearance change is a challenge for robots that use cameras for sensing. Visual Teach and Repeat (VT&R) tackles this problem with the help of multi-experience localization [1]. While the user manually drives the robot to teach a path, VT&R builds a visual map. Afterwards, it localizes live images to the map, allowing the robot to repeat the taught path autonomously. As the environment changes, VT&R stores data from each path repetition (experience). These intermediate experiences are used to bridge the appearance gap as localizing to the initial map becomes more challenging. With the help of machine learning, we aim to remove the need for such intermediate experiences for localization. Previously, we trained a neural network that would predict relative pose for localization directly from two input images [2]. The network was able to directly predict pose across large appearance change without using intermediate experiences. However, it learned the full pose estimation pipeline, including the parts that are easily solved with classical methods. Moreover, it did not generalize well to new paths not seen in the training data. These results align with Sattler et al. [3], who found that accuracy can be an issue for methods that regress global pose directly from sensor data.

Refer to caption — Figure 1: We train a neural network to predict keypoints, descriptors, and scores that can be used in classical pose estimation. We train on outdoor data collected with a Grizzly ground robot and later drive the same robot autonomously with the learned features despite severe lighting change.

In this paper, we choose a different strategy. Instead of predicting poses directly from images, we use learning for the perception front-end of long-term localization, while the remaining components of pose estimation are implemented with classical tools, see Figure 1. We insert the learned features into the existing VT&R pipeline and perform autonomous path following outdoors without needing intermediate bridging experiences. In particular, we teach a path (and build a visual map) at day time and repeat it for a full day including after dark. In another experiment, we show the learned features can generalize by driving in new areas not present in the feature training dataset.

For training, we use data collected across lighting and seasonal change in 2016 and 2017 by a robot using multi-experience VT&R. We train a network to provide keypoints with associated descriptors and scores. Using a differentiable pose estimator allows us to backpropagate losses based on poses from the training data. The network is small enough to fit on a laptop GPU, fast enough to run in real time, and uses only two losses generated from pose ground truth.

Our method differs from others that use learned features for localization across environmental change, such as [4, 5, 6, 7, 8, 9], since they test localization standalone, while we include our features in the full VT&R system. Gladkova et al. [10] combine learned features for localization with visual odometry (VO), but only test on datasets. Since our goal is to learn features for the path-following task, we focus on closed-loop performance. Good localization performance on datasets does not guarantee successful real-life autonomous driving, which involves interaction of localization with VO and path-tracking control. Sun et al. [11] recently published closed-loop path following with learned features, but their lighting-change experiments are less extensive than ours.

II Related Work

There has been a wide range of work on deep learning for pose estimation. Chen et al. [12] provide a thorough survey of deep learning for mapping and localization. Some research has focused on learning pose for localization directly from images via absolute pose regression [13], relative pose regression [14], or combining learning for localization and VO [15]. Sattler et al. [3] note that learning pose directly from image data can struggle with accuracy.

More structure can be imposed on the learning problem by using features to tackle front-end visual matching, while retaining a classical method for pose estimation. A wide range of papers have been published on learning sparse visual features with examples in [16, 17, 18, 19, 20, 21, 22, 23, 24]. Another option is to learn descriptors densely for the image [25, 26, 5, 6, 7, 27], which can also be used for sparse matching [8, 9].

Although several papers have tackled descriptor learning for challenging appearance change, including seasonal change,[4, 5, 6, 7, 8, 9], they test localization standalone. In our work we include the learned features in the VT&R pipeline, where relative localization to a map is combined with VO for long-term path following. Moreover, we use mostly off-road data with fewer permanent structures, where appearance change can be more drastic than in urban areas.

Sarlin et al. [9] show good generalization to new domains with their learned features. For instance, they are able to use features trained on outdoor data for indoor localization. In [5, 6, 8], the authors generalize to unseen seasonal conditions on the same path, while Piasco et al. [4] train and test on different sections of a path. In our work, we drive three paths that are partially or entirely in areas not included in the training data, showing the generalizability of our features to novel environments.

The work from Gladkova et al. [10] is close to ours as they integrate learned features for localization in a VO pipeline. The localization poses are used as a prior for front-end tracking and integrated into back-end bundle adjustment. Global localization poses are fused with VO estimates. While the authors test on urban datasets with seasonal change, we test our approach in closed loop on a robot. Sun et al. [11] recently published work, where they drive a robot in closed loop using learned features in VT&R. They test day-to-night localization, though their experiments are less extensive with a shorter time range. Furthermore, they test lighting change in on-road areas such as a parking lot and by a church, while we also include more challenging off-road driving.

We base our learning pipeline and network architecture on the design presented by Barnes and Posner [28], which learns keypoints, descriptors, and scores for radar localization. We chose this method because it required only a pose loss for supervision and provided a simple network design with a fully differentiable learning pipeline. In particular, a fixed number of keypoints is detected across the image removing any need for pruning with techniques such as non-maximal suppression (NMS). Overall, network size and run-time is important for our real-time application. We made attempts at unsupervised alternatives, similar to [29], but struggled with estimation in the longitudinal direction and therefore opted for a supervised approach.

III Methodology

Our fully differentiable training pipeline takes a pair of source and target stereo images and estimates the relative pose, $\mbf{T}_{ts}\in SE(3)$ , between their associated frames. We build on the approach for radar localization in [28] with some modifications to use a vision sensor. In short, the neural network detects keypoints and computes their descriptors and scores. We match keypoints from the source and target before computing their 3D coordinates with a stereo camera model. Finally, the point correspondences are used in a differentiable pose estimator. For an overview, see Figure 2.

III-A Keypoint Detection and Description

We start by detecting sparse 2D keypoints, $\mbf{q}=\begin{bmatrix}u_{\ell}&v_{\ell}\end{bmatrix}^{T}$ , at sub-pixel locations in the left stereo image and computing descriptor vectors, $\mbf{d}\in\mathbb{R}^{496}$ , and scores, $s\in[0,1]$ , for all pixels in the image. The descriptor and score for a given keypoint is found using bilinear interpolation. The score determines how important a point is for pose estimation. We pass an image to an encoder-decoder neural network following the design from [28], illustrated in Figure 3. After the bottleneck, the network branches into two decoder branches, one to compute keypoint coordinates and one for the scores. We divide the image into size $16\times 16$ windows and detect a keypoint for each one by taking the weighted average of the pixel coordinates. We get the weights by computing the softmax over the network output for each window. Applying a sigmoid to the output of the second decoder branch gives us the scores. Finally, the descriptors are found by resizing and concatenating the feature maps of each encoder block, leaving us with a length 496 descriptor vector for each pixel.

III-B Keypoint Matching

We have a set of $N$ keypoints for the source image and we need to perform data association between these and points in the target image. Descriptors are compared using zero-normalized cross correlation (ZNCC), which for real matrices is the same as cosine distance and the resulting value will be in the range $[-1,1]$ . For each keypoint in the source image, $\mbf{q}_{s}^{i},i\in[1,N]$ , we compute a matched point, $\hat{\mbf{q}}_{t}^{i}$ , in the target image. This point is the weighted sum of all image coordinates in the target image, where the weight is based on how well descriptors match:

\hat{\mbf{q}}_{t}^{i}=\sum_{j=1}^{M}\sigma(\tau f_{\rm zncc}(\mbf{d}_{s}^{i},\mbf{d}_{t}^{j}))\mbf{q}_{t}^{j}.

(1)

$M$ is the total number of pixels in the target image, $f_{\rm zncc}(\cdot)$ computes the ZNCC between the descriptors, and $\sigma(\cdot)$ takes the temperature-weighted softmax with $\tau$ as the temperature, which is determined empirically by trying a range of values. The keypoint matching is differentiable. We found that using all target pixels worked better in practice than only including keypoints detected in the target image. Finally, we find the descriptor, $\hat{\mbf{d}}_{t}^{i}$ , and score, $\hat{s}_{t}^{i}$ , for each computed target point using bilinear interpolation.

III-C Stereo Camera Model

In order to estimate the pose from matched 2D keypoints, we need to get their corresponding 3D coordinates, which is straightforward with a stereo camera. The camera model, $\mbf{g}(\cdot)$ , for a pre-calibrated stereo rig maps a 3D point, $\mbf{p}=\begin{bmatrix}x&y&z\end{bmatrix}^{T}$ , in the camera frame to a left stereo image coordinate, $\mbf{q}$ , together with disparity, $d=u_{\ell}-u_{r}$ , as follows:

\mbf{y}=\begin{bmatrix}u_{\ell}\\ v_{\ell}\\ d\end{bmatrix}=\begin{bmatrix}\mbf{q}\\ d\end{bmatrix}=\mbf{g}(\mbf{p})=\begin{bmatrix}f_{u}&0&c_{u}&0\\ 0&f_{v}&c_{v}&0\\ 0&0&0&f_{u}b\end{bmatrix}\,\frac{1}{z}\begin{bmatrix}x\\ y\\ z\\ 1\end{bmatrix},

(2)

where $f_{u}$ and $f_{v}$ are the horizontal and vertical focal lengths in pixels, $c_{u}$ and $c_{v}$ are the the camera’s horizontal and vertical optical centre coordinates in pixels, and $b$ is the baseline in metres. We use the inverse stereo camera model to get each keypoint’s 3D coordinates:

\mbf{p}=\begin{bmatrix}x\\ y\\ z\end{bmatrix}=\mbf{g}^{-1}(\mbf{y})=\frac{b}{d}\begin{bmatrix}u_{\ell}-c_{u}\\ \frac{f_{u}}{f_{v}}\left(v_{\ell}-c_{v}\right)\\ f_{u}\end{bmatrix}.

(3)

We perform stereo matching to obtain disparity, $d$ , by using [30] as implemented in OpenCV.

III-D Pose Estimation

Given the correspondences between the source keypoints, $\mbf{q}_{s}^{i}$ , and matched target keypoints, $\hat{\mbf{q}}_{t}^{i}$ , we can compute the relative pose from the source to the target, $\mbf{T}_{ts}=\left[\begin{matrix}\mbf{C}_{ts}&\mbf{r}^{st}_{t}\\ \mbf{0}&1\end{matrix}\right]$ , where $\mbf{r}^{st}_{t}$ is the translation from the target frame to the source frame given in the target frame. As described in Section III-C, we use the inverse stereo camera model (3) to compute 3D coordinates, $\mbf{p}_{s}^{i}$ and $\hat{\mbf{p}}_{t}^{i}$ , from the corresponding 2D keypoints. This allows us to minimize the following cost:

J=\sum_{i=1}^{N}w^{i}||(\mbf{C}_{ts}\mbf{p}_{s}^{i}+\mbf{r}^{st}_{t})-\hat{\mbf{p}}_{t}^{i}||^{2}_{2}.

(4)

The minimization is implemented using Singular Value Decomposition (SVD) (more details can be found in [31]), which is differentiable. The weight, $w^{i}\in[0,1]$ , for a matched point pair is a combination of the learned point scores, $s_{s}^{i}$ and $\hat{s}_{t}^{i}$ , and how well the descriptors match:

w^{i}=\frac{1}{2}(f_{\rm zncc}(\mbf{d}_{s}^{i},\hat{\mbf{d}}_{t}^{i})+1)s_{s}^{i}\hat{s}_{t}^{i}.

(5)

We additionally remove large outliers at training time based on ground truth keypoint coordinate error and using Random Sample Consensus (RANSAC) at inference.

III-E Loss Functions

Barnes and Posner [28] supervise training using only a loss on the estimated pose. We found this was insufficient for our approach and also include a loss on the 3D coordinates of the matched keypoints, similar to [32]. We generate the keypoint ground truth from the poses and do not require additional keypoint correspondence data. These losses are sufficient and we do not need to add losses for descriptors or scores. For our training datasets, we only use a subset of the pose degrees of freedom (DOF) for supervision due to accuracy variability in the remaining DOF for some sequences. Specifically, we use the robot longitudinal direction, $\alpha$ , lateral offset, $\beta$ , and heading, $\gamma$ . For this reason, we modify the keypoint loss to only use these DOFs.

Using (3), we can compute the 3D coordinates of the keypoints in the source and target camera frames. Given the ground truth pose, $\mbf{T}_{ts}$ , we form a pose

\mbf{T}_{ts}^{\prime}=\left[\begin{matrix}\mbf{C}_{ts}^{\prime}&{\mbf{r}^{st}_{t}}^{\prime}\\ \mbf{0}&1\end{matrix}\right]=\left[\begin{matrix}\cos(\gamma)&-\sin(\gamma)&1&\alpha\\ \sin(\gamma)&\cos(\gamma)&1&\beta\\ 0&0&1&0\\ 0&0&0&1\end{matrix}\right],

(6)

that we use to transform the source keypoints. Because we transform the source points in the plane, we only compare the $x$ and $y$ point coordinates:

\mathcal{L}_{\rm keypoint}=\sum_{i=1}^{N}||\mbf{T}_{ts}^{\prime}\left.\mbf{p}_{s}^{i}\right|_{z=0}-\left.\hat{\mbf{p}}_{t}^{i}\right|_{z=0}||_{2}^{2}.

(7)

Forming $\mbf{T}_{ts}^{\prime}$ from the ground truth pose and $\hat{\mbf{T}}_{ts}^{\prime}$ from the estimated pose, we get the following pose loss:

\mathcal{L}_{\rm pose}=||{{}\hat{\mbf{r}}^{st}_{t}}^{\prime}-{\mbf{r}^{st}_{t}}^{\prime}||_{2}^{2}+\lambda||\hat{\mbf{C}}_{ts}^{\prime}{\mbf{C}_{ts}^{\prime}}^{T}-\mbf{1}||_{2}^{2},

(8)

where $\lambda$ is used to balance rotation and translation and $\mbf{1}$ is the identity matrix. The total loss is a weighted sum of $\mathcal{L}_{\rm keypoint}$ and $\mathcal{L}_{\rm pose}$ , where the weight is determined empirically to balance the influence of the two loss terms.

		Repeat
		31/01				02/02				10/02				14/02				27/02				10/04				03/05
		SP	D2	R2D2	Our	SP	D2	R2D2	Our	SP	D2	R2D2	Our	SP	D2	R2D2	Our	SP	D2	R2D2	Our	SP	D2	R2D2	Our	SP	D2	R2D2	Our
Teach	31/01					70	272	613	311	35	76	284	230	4	2	4	112	4	46	17	201	4	37	7	186	4	39	9	154
	02/02	69	263	606	310					77	154	319	248	16	17	10	124	11	47	21	223	6	36	5	206	8	34	5	166
	10/02	31	76	283	230	79	155	313	248					24	22	30	117	3	10	3	173	3	8	2	161	2	8	2	131
	14/02	5	2	3	114	19	31	15	134	30	31	45	128					2	2	3	121	1	0	0	111	1	1	0	102
	27/02	4	45	15	201	12	47	22	223	4	13	5	171	2	2	2	114					15	136	104	257	13	75	25	196
	10/04	3	34	7	183	7	31	4	206	3	7	1	158	1	0	0	106	13	136	119	257					15	56	33	208
	03/05	4	36	9	153	8	38	4	166	3	10	2	131	1	2	0	98	14	74	25	197	15	55	33	207

TABLE I: Comparison of the median number of inliers for SuperPoint, D2-Net, R2D2, and our method for localization of the runs shown in Figure 6. The rows list the condition used to teach by date (dd/mm), while the columns correspond to the repeats. As localization becomes more challenging, our method keeps a high number of inliers, while the other methods deteriorate.

Path	Total dist.	Runs w. fail	Num. fail	Dist. VO
Lighting Change	10.9 km	22:18	3	0.4 m
Generalization 1	20.6 km	03:55, 05:21, 20:59	364	29.2 m
Generalization 2	2.2 km	14:56, 17:33	11	2.0 m
Generalization 3	1.8 km	-	0	0.0 m

TABLE II: An overview of localization failures (less than 6 match inliers) in different experiments. The last column shows the total distance driven on VO during localization failures.

IV Experiments

IV-A Data

The training data were previously collected using a Clearpath Grizzly robot with a Bumblebee XB3 camera, see Figure 1. By using multi-experience VT&R [1], the robot repeats paths accurately across large lighting and seasonal change with the help of intermediate bridging experiences and Speeded-Up Robust Features (SURF) [33]. We can use the resulting data as ground truth for supervised learning. In VT&R, stereo image keyframes are stored as vertices in a spatio-temporal pose graph. Edges contain the relative pose between temporally adjacent vertices and between a repeat vertex and the teach vertex to which it has been localized. We sample image pairs and poses from the pose graph.

We use data from two different paths for training¹¹1Data at http://asrl.utias.utoronto.ca/datasets/2020-vtr-dataset. The In-The-Dark dataset contains 39 runs of a path collected at our campus in summer 2016 along a road and on grass. The path was repeated once per hour for 30 hours systematically capturing lighting change. The Multiseason dataset contains 136 runs of a path in an area with more vegetation and undulating terrain. It was repeated from January until May 2017 capturing varying seasons and weather. Overhead path views can be seen in Figure 4. We generate two separate datasets from these two paths that each have 100,000 training samples and 20,000 validation samples.

IV-B Training and Inference

We train the network by giving it pairs of images and the ground truth relative pose from the training dataset. Large outliers are removed based on keypoint error using the ground truth pose during training and with RANSAC during inference. The network is trained using the Adam optimizer [34] with a learning rate of $10^{-5}$ and other parameters set to default values. We determine the number of training epochs with early stopping based on the validation loss. The network is trained on an NVIDIA Tesla V100 DGXS GPU. Feature extraction on this server using PyTorch takes on average 14.8 ms, while it takes 7.3 ms using C++ in the VT&R system running on a ThinkPad P52 laptop with an Intel ® Core^™ i7-8850H CPU, 32 GB of RAM, and an NVIDIA Quadro P2000 4 GB GPU. The code is made available online ²²2Code at https://github.com/utiasASRL/deep_learned_visual_features.

IV-C Visual Teach and Repeat

In order to test the performance of the learned features, we add them to the VT&R system. While a user manually drives the robot to teach a new path, VT&R creates a spatio-temporal pose graph that serves as the map. The path is repeated by alternating between VO and localizing keyframes. VT&R does not compute global poses, but only finds the relative offset to a keyframe in the map. In multi-experience localization, poses and detected features for each repeat is stored in the pose graph. More detail on VT&R can be found in [35], and the VT&R code base is available online³³3VT&R code at http://utiasasrl.github.io/vtr3. We insert the learned feature detector and descriptors without making substantial changes to VT&R. Instead, we use the existing sparse descriptor matcher and do not add the keypoint scores used in training, but plan to include dense descriptor matching and keypoint scores in future work. Note that in the experiments, we match live images directly to the map without using intermediate experiences. Since VT&R still relies on SURF for VO, the learned features introduce additional computation and we had to reduce the maximum speed of the robot to 0.8 m/s in order to run the experiment.

We complete three experiments. We start by comparing to other state-of-the-art methods with code available online, specifically SuperPoint [19], D2-Net [20], and R2D2 [21]. This comparison is done offline in PyTorch (not in VT&R) for localization only on held-out repeats from the Multiseason dataset, see Figure 6. Our method uses a network trained on the Multiseason and In-The-Dark datasets, while we use weights provided with the other methods.

The last two experiments are run in closed loop on the robot. First we train a network on the In-The-Dark dataset and teach a new path by physically driving the robot, rather than using held-out data from the training dataset. The new path is similar to the one form the training data (see Figure 5), albeit five years later. We will refer to this as the ‘lighting-change experiment’. We taught the approximately 265 m path at 12:14 on August 2nd and repeated it on August 3rd and 9th covering lighting conditions from 3 a.m. til 10.30 p.m., see Figure 7. For this experiment, the network had a higher number of features for each layer (first layer of size 32 instead of 16). We later found this was unnecessary and all other experiments use the architecture from Figure 3.

In the final closed-loop experiments, which we refer to as the ‘generalization experiment’, we test the learned features’ ability to generalize to new areas. We train a network using both the Multiseason and In-The-Dark datasets and teach three new paths. The first is approximately 760 m and, as shown in Figure 5, includes two new areas that are not observed by training data. The larger of the two new sections is driven in an area with more vegetation, trees, and taller grass than seen in the most relevant Multiseason dataset. Moreover, the experiment is conducted in August, while the Multiseason dataset only contains data until May. This path was taught on August 14th and repeated on August 15th, 16th, and 20th covering lighting change from 4 a.m. til 9 p.m., see Figure 8. Additionally, we include a second path taught in a field at 09:52 on November 11th (275 m) and a third path taught in a parking lot at 11:34 on November 11th (220 m). Neither path has any overlap with the training data, see Figure 5. The paths are repeated from morning until dark on November 11th and 12th, see Figure 9.

V Results

V-A Offline: Learned Features Comparison

We compared our method to state-of-the-art learned features on localization only in PyTorch. Keypoints, descriptors, and scores were extracted using the code provided by the authors, and the same sparse nearest-neighbour feature matching was used for all methods for fairness. For D2-Net and R2D2, we used multi-scale detection for best results. All conditions from Figure 6 were localized against each other. We did this test as a sanity check as our method trained on relevant data should outperform the others ⁴⁴4Since our data is different from more commonly used urban datasets, retraining the other methods with the same data would be more fair. This limits the usefulness of our comparison..

The median number of inliers as determined by RANSAC are listed in Table I. R2D2 gets the best result localizing the first three conditions to each other. Our method, however, is the only one with a high number of inliers across all experiments, while the others deteriorate as the seasonal change increases. Table I shows that the run from 14/02 with snow and high brightness is the most difficult. Since our target application runs on a laptop in real-time, memory consumption and speed also matter. R2D2 (1531 MB) and our method (2443 MB) used the lowest memory, while SuperPoint (5305MB) and D2-Net (9521MB) would be too large to fit on the 4 GB GPU. Our method was the fastest with 14.8 ms for feature extraction, and does not rely on NMS to pick a subset of keypoints as SuperPoint (37.3 ms) and R2D2 (38.7 ms). D2-Net was the slowest with 477.4 ms, though it would likely be faster without multi-scale detection.

V-B Closed-loop: Lighting Change

We repeated the path described in Section IV-C 40 times with a 100% autonomy rate. The repeats span lighting change from 3 a.m. until 10.30 p.m., see Figure 7. All days had sunny weather with only occasional clouds. The robot repeated the path accurately for all conditions including driving in the dark with headlights. For this experiment only, we collected the path-tracking error measured with RTK-GPS. The root mean squared error (RMSE) across all repeats is 0.049 m with 0.070 m being the highest RMSE for an individual repeat. We provide box plots of the number of matched feature inliers in Figure 10. For every repeat, we get enough inliers to localize. Localization failure happens for frames with less than 6 inliers, which is the case for only three frames, see Table II. In this case, we rely on VO until localization recovers. As expected, we get the highest number of inliers in the middle of the day and the lowest number when it is dark. The dips in numbers during sunrise and sunset are caused by sun flares. Before sunrise there was fog, explaining the lower number of inliers compared to driving after dark at night.

V-C Closed-loop: Generalization to New Areas

In the previous experiment, we showed that we can learn features based on data collected in 2016, teach a new path five years later and repeat it across large lighting change. We want to make sure that the network is not just memorizing known locations from the training data. This experiment tests generalization to areas and seasonal appearance not included in the training datasets. We train a network with data from both the In-The-Dark and Multiseason datasets. Both the on-road and off-road sections of the first path have one new unseen area, as shown in Figure 5, while the second and third paths are set in entirely new locations. The first path was repeated 26 times with a 100% autonomy rate spanning lighting change from 4 a.m. until 9 p.m., see Figures 8. Table II shows localization failures for a small part of the path driven during night-time. We provide box plots of the number of inliers for all repeats in Figure 11 (a) and get the highest number of inliers during the day, some dips at sunrise and sunset due to sun flares, and the lowest inlier counts in the dark. In the same plot, we compare to localization using SURF with colour-constant images [1]. In eleven examples SURF fails to localize at the beginning of the path. For the remaining repeats, the number of inliers is much lower than for the learned features. Since a large part of the path falls in an area with tree cover and other vegetation, we were unable to collect RTK GPS ground truth. In the new off-road area the robot swayed slightly at a few spots during the dark or strong sun flares, but quickly recovered and never drove off the path. In the new on-road area, for one repeat at 10:20, the robot turned too sharply and was off the path by roughly 1.5 meters, but recovered when exiting the turn.

We also present two plots in Figure 11 (b) and (c) that compare the number of matched feature inliers when driving in the new areas versus the areas already seen in the training dataset. We divide the path into on-road and off-road sections since the on-road section generally gets more inliers. The new off-road area consistently gets fewer inliers than the area contained in the training data, but at the same time it gets enough inliers for driving. Moreover, the inlier numbers for the new area exhibit the same variation over time as those for the known area. In the new on-road area, the median number of inliers remains similar or higher compared to the known area. The reason is the new area is more similar to the rest of the path in the on-road case. Moreover, the new part of the path is less affected by sun flare at sunrise and sunset than the known on-road section, explaining why it has higher values at these times.

Finally, the second path collected in a field and the third path collected in a parking lot were both repeated 7 times with a 100% autonomy rate from morning until dark. Figure 11 (d) and (e) show a high number of inliers for both experiments, which means that the learned features generalized to these new places outside the training data. Localization failures only occur for the path in the field, see Table II.

VI Conclusions and Future Work

We have shown that we can use learned features for real-time autonomous path following under large lighting change and extend our path to new areas not seen in the data used to train the features. From data gathered in summer 2016 and from January to May 2017, we have trained networks to predict visual features that work reliably several years later. We used the existing sparse feature matcher in VT&R, but plan to implement dense matching in VT&R and use the learned scores in future work. We also aim to make the implementation more efficient so that we can drive faster. Finally, we plan a longer closed-loop experiment to test against seasonal change.

References

[1] M. Paton, K. MacTavish, M. Warren, and T. D. Barfoot, “Bridging the appearance gap: Multi-experience localization for long-term visual teach and repeat,” in IROS, 2016.
[2] M. Gridseth and T. D. Barfoot, “Deepmel: Compiling visual multi-experience localization into a deep neural network,” in ICRA, 2020.
[3] T. Sattler, Q. Zhou, M. Pollefeys, and L. Leal-Taixe, “Understanding the limitations of cnn-based absolute camera pose regression,” in CVPR, 2019.
[4] N. Piasco, D. Sidibé, V. Gouet-Brunet, and C. Demonceaux, “Learning scene geometry for visual localization in challenging conditions,” in ICRA, 2019.
[5] L. von Stumberg, P. Wenzel, Q. Khan, and D. Cremers, “Gn-net: The gauss-newton loss for multi-weather relocalization,” RAL, 2020.
[6] L. Von Stumberg, P. Wenzel, N. Yang, and D. Cremers, “Lm-reloc: Levenberg-marquardt based direct visual relocalization,” in 3DV, 2020.
[7] M. Kasper, F. Nobre, C. Heckman, and N. Keivan, “Unsupervised metric relocalization using transform consistency loss,” CoRR, 2020.
[8] J. Spencer, R. Bowden, and S. Hadfield, “Same features, different day: Weakly supervised feature learning for seasonal invariance,” in CVPR, 2020.
[9] P.-E. Sarlin, A. Unagar, M. Larsson, H. Germain, C. Toft, V. Larsson, M. Pollefeys, V. Lepetit, L. Hammarstrand, F. Kahl et al., “Back to the feature: Learning robust camera localization from pixels to pose,” in CVPR, 2021.
[10] M. Gladkova, R. Wang, N. Zeller, and D. Cremers, “Tight integration of feature-based relocalization in monocular direct visual odometry,” in ICRA, 2021.
[11] L. Sun, M. Taher, C. Wild, C. Zhao, F. Majer, Z. Yan, T. Krajnik, T. Prescott, and T. Duckett, “Robust and long-term monocular teach-and-repeat navigation using a single-experience map,” in IROS, 2021.
[12] C. Chen, B. Wang, C. X. Lu, N. Trigoni, and A. Markham, “A survey on deep learning for localization and mapping: Towards the age of spatial machine intelligence,” arXiv preprint arXiv:2006.12567, 2020.
[13] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for real-time 6-dof camera relocalization,” in ICCV, 2015.
[14] Z. Laskar, I. Melekhov, S. Kalia, and J. Kannala, “Camera relocalization by computing pairwise relative poses using convolutional neural network,” in ICCV Workshops, 2017.
[15] A. Valada, N. Radwan, and W. Burgard, “Deep Auxiliary Learning for Visual Localization and Odometry,” in ICRA, 2018.
[16] E. Rosten, R. Porter, and T. Drummond, “Faster and better: A machine learning approach to corner detection,” TPAMI, 2008.
[17] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “Lift: Learned invariant feature transform,” in ECCV, 2016.
[18] Y. Ono, E. Trulls, P. Fua, and K. M. Yi, “Lf-net: learning local features from images,” in NeurIPS, 2018, pp. 6234–6244.
[19] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-supervised interest point detection and description,” in CVPR Workshops, 2018.
[20] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler, “D2-net: A trainable cnn for joint description and detection of local features,” in CVPR, 2019.
[21] J. Revaud, P. Weinzaepfel, C. R. de Souza, and M. Humenberger, “R2D2: repeatable and reliable detector and descriptor,” in NeurIPS, 2019.
[22] Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y. Yao, S. Li, T. Fang, and L. Quan, “Aslfeat: Learning local features of accurate shape and localization,” in CVPR, 2020.
[23] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in CVPR, 2020.
[24] Q. Wang, X. Zhou, B. Hariharan, and N. Snavely, “Learning feature descriptors using camera pose supervision,” in ECCV, 2020.
[25] Z. Lv, F. Dellaert, J. M. Rehg, and A. Geiger, “Taking a deeper look at the inverse compositional algorithm,” in CVPR, 2019.
[26] C. Tang and P. Tan, “Ba-net: Dense bundle adjustment networks,” in ICLR, 2019.
[27] B. Xu, A. J. Davison, and S. Leutenegger, “Deep probabilistic feature-metric tracking,” RAL, 2020.
[28] D. Barnes and I. Posner, “Under the radar: Learning to predict robust keypoints for odometry estimation and metric localisation in radar,” in ICRA, 2020.
[29] J. Tang, R. Ambrus, V. Guizilini, S. Pillai, H. Kim, P. Jensfelt, and A. Gaidon, “Self-Supervised 3D Keypoint Learning for Ego-Motion Estimation,” in CoRL, 2020.
[30] H. Hirschmuller, “Stereo processing by semiglobal matching and mutual information,” TPAMI, 2008.
[31] T. D. Barfoot, State Estimation for Robotics. Cambridge University Press, 2017.
[32] P. H. Christiansen, M. F. Kragh, Y. Brodskiy, and H. Karstoft, “Unsuperpoint: End-to-end unsupervised interest point detector and descriptor,” arXiv preprint arXiv:1907.04011, 2019.
[33] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” in ECCV, 2006.
[34] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2014.
[35] M. Paton, K. MacTavish, L.-P. Berczi, S. K. van Es, and T. D. Barfoot, “I can see for miles and miles: An extended field test of visual teach and repeat 2.0,” in FSR, 2018.