Where is your place, Visual Place Recognition?

Sourav Garg Tobias Fischer* (*: equal contributions) &Michael Milford QUT Centre for Robotics, Queensland University of Technology {s.garg, tobias.fischer, michael.milford}@qut.edu.au

Abstract

Visual Place Recognition (VPR) is often characterized as being able to recognize the same place despite significant changes in appearance and viewpoint. VPR is a key component of Spatial Artificial Intelligence, enabling robotic platforms and intelligent augmentation platforms such as augmented reality devices to perceive and understand the physical world. In this paper, we observe that there are three “drivers” that impose requirements on spatially intelligent agents and thus VPR systems: 1) the particular agent including its sensors and computational resources, 2) the operating environment of this agent, and 3) the specific task that the artificial agent carries out. In this paper, we characterize and survey key works in the VPR area considering those drivers, including their place representation and place matching choices. We also provide a new definition of VPR based on the visual overlap – akin to spatial view cells in the brain – that enables us to find similarities and differences to other research areas in the robotics and computer vision fields. We identify numerous open challenges and suggest areas that require more in-depth attention in future works.

1 Introduction

Visual Place Recognition (VPR) is a rapidly growing topic: Google Scholar lists over 2300 papers matching this exact term, with 1600 of them published since the pivotal survey paper by Lowry et al. in 2016 Lowry2016 . While exhaustive surveys of works on VPR are given elsewhere Lowry2016 ; zhang2020visual ; masone2021survey , our goal here is to lay a concrete understanding of VPR as a research problem. We argue that research on VPR has increasingly become more dissociated: there is no standard definition of a ‘place’ and comparison of methods is challenging as benchmark datasets and metrics vary substantially.

In light of the dissociation and while retaining the accessibility of a compact treatment, we discuss VPR with regards to its definition (Section 2), how it closely relates to other areas of research (Section 3), what the key drivers of VPR research are (Section 4), how to evaluate VPR solutions (Section 5), and what key research problems still remain unsolved (Section 6). Fig. 1 illustrates the outline of our paper and shows how the various sections are interrelated.

Refer to caption — Figure 1: Visual Place Recognition (VPR) is the ability to recognize one’s location based on two observations perceived from overlapping field-of-views. This figure illustrates the main sections of this paper and how they interrelate.

2 What is Visual Place Recognition?

Lowry et al. Lowry2016 state that VPR addresses the question of “given an image of a place, can a human, animal, or robot decide whether or not this image is of a place it has already seen?”. One can easily see that such a capability is of crucial importance in tasks like localization and navigation, which in turn become ever more important with the advent of artificial intelligence (AI) in autonomous cars, mobile robots that interact with humans, and intelligent augmentation platforms such as the HoloLens 2. Inspiration for VPR is often drawn from the animal kingdom, given the remarkable localization and navigation capabilities of even “simple” animals, and the relatively well understood underlying mechanisms (even leading to 2014’s Nobel Prize for the discovery of place cells and grid cells, as further detailed below).

Although it might seem inevitable to define a place first, we instead define VPR directly, remembering that it is a comparison of visual data, observed from same or different physical locations with same or different viewpoints. We argue that two such observations can lead to successful recognition if there exists a certain degree of visual overlap due to overlapping field-of-view of the underlying sensor, whereby the acceptable degree depends on the drivers introduced in Section 4. This implies that: 1) being at the same physical location is not sufficient, the orientation (i.e. viewpoint) needs to be somewhat similar as well, and 2) places can also be recognized when observed from distant physical locations. In short, we define VPR as the ability to recognize one’s location based on two observations perceived from overlapping field-of-views. Note that our definition requires rethinking the typical notion of localization threshold (as used by almost all datasets and evaluation metrics, see Section 5), which is based on metric distances without considering orientation.

Our definition is complementary to that of Lowry et al. Lowry2016 , but has a different underlying motivation. Lowry et al.’s definition is in line with the notion of place cells, which fire when an animal is in a particular place in the environment, irrespective of the animal’s orientation. Instead, our definition is in line with spatial view cells, which fire when a specific area of the environment is gazed at by the animal, irrespective of the particular location 10.1093/cercor/9.3.197 .

We note that in the context of robotics, VPR often involves sequential imagery rather than single images, as this can significantly improve place recognition performance, especially so for challenging environments Lowry2016 . For such sequence-based methods, the equivalent of visual overlap is the overlap of the volume spanned by the sequence.

3 Related Areas

In this section, we highlight similarities and differences of VPR with a handful of related areas. While the relation to image retrieval has been discussed in other works Lowry2016 ; zhang2020visual ; masone2021survey , it is for the first time that VPR’s relation to video retrieval, visual landmark recognition, and overlap detection is systematically presented. We argue that for each of those areas, there is a potential for mutual benefits: research into VPR can offer insights for these areas and vice versa.

Image Retrieval: Image retrieval refers to the general problem of retrieving relevant images from a large database arandjelovic2016netvlad . VPR is commonly cast as an image retrieval problem that involves a nearest neighbor search of compact global descriptors arandjelovic2016netvlad ; babenko2015aggregating ; sivic2003video ; Torii2018 ; jegou2010aggregating or cross-matching of local descriptors liu2020densernet ; taubner2020lcd ; yue2019robust ; hausler2021patch ; tourani2020early . With regards to solving the nearest neighbor search problem, VPR and image-retrieval systems face similar challenges. However, the underlying goals differ between the two areas. For image retrieval, similarity criterion could be based on semantic categories such as ‘clothes’ as a product category or nighttime image as an environmental condition category. However, with the additional context of being a ‘place’ (see Section 2), VPR deviates from the process of merely retrieving a “similar” image, which instead is one of the challenges of VPR and referred to as perceptual aliasing (see Section 4). The notion of similarity in VPR is constrained to matching spatial information, where images captured from the same place would be considered a true match even if environmental conditions are dissimilar (e.g. day vs night).

Video Retrieval: With regards to video retrieval, we observe that sequence-based VPR is typically implemented as a decoupled approach, where single image-based retrieval is followed by sequence score aggregation milford2012seqslam . The recent introduction of explicit sequence-based place representations (where the representation itself describes the sequence), posing VPR as a video retrieval problem, opens up new opportunities to obtain solutions robust to extreme appearance variations garg2021seqnet ; garg2020delta ; facil2019condition ; arroyo2015towards ; neubert2019neurologically ; takeda2020dark .

Visual Landmark Recognition and Retrieval: Visual landmark recognition is the classification problem of given an image and a set of images belonging to a large variety of landmarks, deciding to which landmark this image belongs. Recently, the Google-Landmarks dataset weyand2020google presented a new large-scale instance-level recognition and retrieval challenge, with the number of landmarks¹¹1In the context of mobile robotics, the term ‘landmark’ is typically used to indicate any specific visual entity in the scene relevant for localization luo1992neural ; xin2019localizing . increased from 30,000 to 200,000 in its second version. This large-scale recognition is an extreme classification problem choromanska2013extreme , where existing recognition solutions have relied on retrieval (nearest neighbor search) teichmann2019detect . Google-Landmarks comprises specific places (with the semantics of unique proper names) as opposed to general place categories (with the semantics of common names) zhou2017places ; wu2009visual .

In contrast, VPR refers to the ability of distinctively recognizing any ordinary place or a region in the 3D world, thus posing an ‘extremer’ classification problem. It remains to be seen how methods developed for landmark recognition and retrieval can be leveraged in the context of VPR – recent advances include learning to aggregate ‘relevant’ landmarks teichmann2019detect , as well as jointly training local and global descriptors cao2020unifying ; yang2020ur2kid ; sarlin2019coarse .

Visual Overlap Detection: As discussed in Section 2, our definition of VPR is based on an overlapping field of view between the two places that should be matched; thus VPR and the area of visual overlap detection become more closely linked. The contrast between defining VPR using visual overlap as opposed to “positional offset” impacts the choice of ground truth for both training and evaluation procedures. This contrast has recently been shown to lead to noticeable changes in absolute performance when benchmarking localization algorithms pion2020benchmarking .

As the ground truth visual overlap might not be available for all datasets, overlap detection measures could be used as a supervision signal rau2020imageboxoverlap ; chen2020overlapnet to develop better VPR techniques. A noteworthy recent proposal on overlap detection rau2020imageboxoverlap introduced the ‘normalized surface overlap’ to measure the number of pixels of image A visible in image B (and vice versa). This leads to an asymmetric, but interpretable, measure that can also estimate the relative scale between pairs of images.

4 What Drives VPR Research?

This section outlines the three key drivers of spatially intelligent systems, including intelligent autonomous systems operating in industry and household domains. As drivers, we refer to components that typically impose requirements on the system with regards to a) how the problem should be defined, b) how the solution (in the context of VPR: place representation and matching) should be designed, and c) how these solutions should be evaluated, both in terms of datasets and metrics. The three drivers are the Environment where an agent operates (Section 4.1), the Agent on which the spatially intelligent system is deployed (Section 4.2), and the Downstream Task that is performed (Section 4.3). In practice, different aspects of each of these drivers are simultaneously at play. We detail why it is crucial to understand the influence of these drivers to design better spatially intelligent systems, in particular in the VPR domain.

4.1 Environment

The first driver of VPR research is the operating environment, where research often branches out, as methods that work in certain environment types might cease to do so in other environment types. Differing branches include indoor vs outdoor, suburban vs highway, structured vs open, and human-made vs natural. The operating environment is often tightly coupled with the robotic agent choice (Section 4.2) – for example, driverless cars do not operate in office environments, or at least should not.

While the general aim of VPR systems is often stated to be invariance to changes in viewpoint as well as changes in appearance (including structural, seasonal, and illumination changes) Lowry2016 ; arandjelovic2016netvlad ; garg2018lost ; zhang2020visual , we argue that 1) not all environments/agents require invariance to both viewpoint and appearance (as detailed below), and 2) that there is a trade-off between viewpoint and appearance invariance achievable by current systems (as detailed in Section 6.1). Therefore, knowing the operating environment can provide crucial advantages when deciding how to represent and match places.

Structured Environments and Viewpoint Variations: In well-structured environments such as road infrastructure, the extent of 6-DoF viewpoint variations is generally confined, e.g. for driverless cars on roads, viewpoint varies mostly in the yaw direction RobotCarDatasetIJRR . Similar effects in viewpoint variations can be observed for other platforms too. For example, as soon as aerial vehicles reach a certain height, one can assume a planar homography (“flat world”), simplifying template matching saurer2016homography ; toft2020single ; mount2019automatic ; tourani2020early . Planar homographies are also present when mounting the camera at a fixed distance from the surface and pointing towards the surface. In structured indoor environments such as warehouses and offices, aisles and corridors enable Manhattan world assumptions and often simplify Simultaneous Localization And Mapping (SLAM) li2018monocular .

Environment-Dependent Appearance Variations: Appearance invariance is similarly often constrained when assuming a certain operating environment. However, this kind of invariance is harder to quantify as changes in appearance can originate from a wide range of factors: Examples include changes in the time of the day, seasonal changes, structural changes, and weather changes. Therefore, while viewpoint change could be quantified by the metric shift in translation and rotation, there is no linear scale in the difficulty of appearance invariance. There are even some counter-intuitive examples, where a reference image captured outdoor in the morning might be easier to match to a well-lit nighttime image than to an image captured at noon which has shadows cast on a large area of the image corke2013dealing .

For different platforms and environments, the requirement of representing and matching places in an appearance-invariant or viewpoint-invariant manner can differ significantly. For example, driverless cars typically traverse a well-defined route and could trade-off viewpoint-invariance with appearance-invariance which can be relatively more challenging due to variations in the time of day, season, structural changes including roadworks and differing traffic conditions Warburg_2020_CVPR . On the other hand, when an autonomous agent is deployed indoors or when considering an unmanned aerial vehicle, its route or maneuvers may not always be constrained, thus requiring viewpoint-invariance more than appearance-invariance.

Perceptual Aliasing: Another consideration that is tied to the operating environment is the extent of perceptual aliasing. Perceptual aliasing is the problem that two distinct places can look strikingly similar, often more similar than the same place observed under different conditions Lowry2016 . For example, indoor environments often contain corridors and hallways that are hard to distinguish. In outdoor settings, different places along a highway or a natural vegetative environment tend to be more perceptually aliased than different places within the man-made urban or suburban dwellings.

Dynamic Environments: Problems related to the operating environment that – to our knowledge – have not yet been addressed in VPR research are sensor dust, reflections (in glass or puddles) and undesired objects close to the camera (e.g. windscreen wipers). Such conditions are expected in challenging environments like mines and forests, which have come into focus in recent years garforth2020lost . It would be interesting to model the impact of such ‘noise’ explicitly or measure the impact of sensor noise in existing VPR systems.

4.2 Agent

VPR has widespread applications and is thus deployed on a large variety of robotic platforms, including unmanned ground vehicles and autonomous cars doan2019scalable , unmanned aerial vehicles zaffar2019 and unmanned underwater vehicles li2015high . Other platforms where VPR is applied are those tightly coupled to human users such as virtual/augmented reality devices sattler2016efficient and mobile phones Torii2018 .

Computational Resources: A robotic agent typically runs a large number of processes, many of them interacting with each other through tools like the Robot Operating System (ROS) ROS . These processes share limited onboard resources and often require cognitive architectures FischerFRAI2018 to interact efficiently. Thus the resources dedicated to the VPR system might be relatively small, and a GPU (that significantly boosts inference times of deep networks) might not be available. Similarly, storage limitations could mean that the reference map of the operating environment (in the form of images, global/local descriptors, point clouds) has to be of reasonable size. Section 6 discusses some of the open problems in VPR in this context, for example, compact global description, efficient indexing and quantization, and hierarchical place matching pipelines.

Suitable Sensor Suite: Depending on the agent and the operating environment, robust VPR solutions can be developed by using additional suitable sensors. For example, event cameras perform exceptionally well when a high dynamic range is required, such as when exiting a dark tunnel and moving into bright sunlight fischer2020event . LIDAR-based systems can perceive the scene’s geometry even in the most challenging nighttime conditions, although those systems lack appearance information guo2019local . Using omnidirectional cameras or multi-camera rigs increase the field-of-view and thus the visual overlap, which results in reduced complexities in image matching.

Correct sensor type choice can also aid in tackling specific challenges such as nighttime conditions. Crucially, the sensor capabilities should drive the research regarding what characteristics are required in our learned descriptors. We believe that using novel sensor types such as 3D ultrasonic sensors (e.g. the Toposens TS3) and sensor fusion jacobson2015autonomous could further improve the robustness of VPR systems.

While some sensors can be a replacement for RGB cameras, another area worthy of more thorough investigation is the use of additional information in the form of prior position or ego-motion. For example, one can assume that autonomous cars are equipped with a GPS sensor. Still, despite the popularity of datasets such as Oxford RobotCar RobotCarDatasetIJRR that contain GPS information, it is rarely used for VPR vysotska2015efficient – using GPS information in environments where available could refocus research on GPS-denied environments like tunnels or underground mines that have distinct challenges. While there are many examples of GPS-denied environments, almost all mobile robots have some odometry information, but it has only been used in a limited manner for VPR pepperell2014all .

4.3 Downstream Task

Here, we consider the different tasks that an agent (robotic platform or intelligent augmentation) might perform.

Localization, SLAM and Kidnapped Robot Problem: Some tasks impact the requirements of VPR systems directly. For example, when VPR is used to provide a coarse estimate of the pose within a 6-DoF localization algorithm toft2020long , the error bounds need to be very tight and the visual overlap between the two places relatively large with sufficient parallax. This is opposed to a scenario where loose error bounds are sufficient – a rough location estimate might sufficiently narrow down the search space for a subsequent laser-based pose estimation for global re-localization of a mobile robot (“kidnapped robot problem”) jacobson2021localizes .

The requirements with regards to precision and recall are also varying for different scenarios. When using VPR for generating loop closures for SLAM (i.e. recognizing that a location has been visited previously, so that a globally consistent map can be built), incorrect matches can lead to catastrophic failures, thus requiring high precision VPR cadena2016past . On the other hand, one could use VPR to select top $k$ matches which are then passed to computationally more intensive stages; in this case, higher recall is more important than the precision. Thus, the downstream task is a key determining factor for formulating and evaluating VPR, as further discussed in Section 5.

We note that a purely topological visual SLAM system can be defined through VPR, which is highly relevant for large-scale mapping cummins2008fab ; doan2019scalable . Such a topological SLAM system requires determining whether the currently observed place is a revisited one or is a new ‘unseen’ place, thus posing unique design requirements on VPR.

Higher-level Tasks: The requirements of some downstream tasks like SLAM and Structure from Motion (SfM) are relatively well understood; yet, these requirements are very distinct and probably need a suitably tailored treatment. For example, SfM-based large-scale 3D reconstruction is typically performed offline schoenberger2016sfm and needs sub-pixel accurate alignment of images. The computational requirements of a VPR system then play a much lesser role than in real-time deployments on a mobile platform mapping an unknown environment using visual SLAM.

The requirements of other “higher-level” tasks such as those of augmented reality platforms and navigation are not yet well established. This is in part due to the complex hierarchical nature of typical spatially intelligent systems, for example an augmented reality platform would involve many interrelated components such as image retrieval, sequential localization, local feature matching, visual odometry, and pose refinement stenborgusing . Furthermore, the utility of VPR and mapping for navigation purposes milford2007spatial is a vastly unexplored area, and a deeper understanding of task requirements is needed.

5 How to Evaluate Visual Place Recognition?

This section discusses the evaluation datasets and metrics, in the context of the drivers.

5.1 Evaluation Datasets

There are numerous place recognition datasets, each covering different aspects of VPR (for recent overviews see Warburg_2020_CVPR ; masone2021survey ). Thus some datasets are better suited to investigate specific configurations of proposed drivers (i.e. environments, agents and downstream tasks, see Section 4), while other datasets better represent different scenarios. This highlights the importance of clearly stating the application scenario targeted by a particular VPR system – it may be sufficient that the VPR system excels in datasets that are close to the actual use-case (but not in others).

Recent progress has enabled easier comparison of different methods. VPR-Bench zaffar2020vpr provides a mechanism for the comparison of a new method on an extensive range of datasets. In the light of highly successful standard benchmark datasets in other research areas like visual object tracking kristan2018sixth , we believe that such benchmarking will accelerate VPR research. Mapillary Street Level Sequences (MSLS) Warburg_2020_CVPR is a single dataset that tries to capture all variations of appearance/viewpoint change at once. MSLS notably also introduces ‘sub-tasks’ that can be separately investigated, including sub-tasks like summer to winter, day to night, and old to new. An additional benefit of MSLS is that it provides a hold-out test set that can be used for challenges.

If the aim is to design a VPR system applicable in all different scenarios, an open challenge is to design systems that are equally applicable indoors and outdoors. Few studies evaluate systems both indoors and outdoors, one of them being the above mentioned VPR-Bench zaffar2020vpr . VPR-Bench has shown that performance trends can vary noticeably across environment types, e.g. indoor vs outdoor. However, care should be taken to not make generic assumptions about an architecture when the trained descriptors heavily depend on the training data – the training data should be representative of the data encountered at deployment time. Most recently, Warburg_2020_CVPR have shown that training on more diverse data drastically improves performance on unseen data. This is distinct from the approach where different network configurations are explicitly trained for different scenarios, e.g. one for indoors and another for outdoors sarlin2020superglue .

5.2 Evaluation Metrics

The previous examples show that the downstream task and the relevant evaluation metrics are tightly coupled. However, we note that many VPR papers do not state why a particular evaluation metric was chosen. Notable exceptions include system papers where VPR is one of many components, and a specific downstream task is considered, such as cummins2008fab . The computer vision community typically uses the Recall@K measure, which indicates that in this context the VPR system is benchmarked based on its ability to retrieve a correct match within the top-K retrievals regardless of the false matches. On the other hand, the mean average precision (mAP) metric philbin2007object , used in the image retrieval community, explicitly penalizes selection of false matches. The mAP metric could be adopted to measure VPR performance for SLAM-like downstream tasks (Section 4.3) where precision is more important, complementing measures like Recall at 100% Precision.

The area under the precision-recall curve and the F-score are sometimes used as summary statistics molloy2020intelligent . However, their practical use is unclear, as these summary statistics imply that recall and precision are of similar importance, which is unlikely the case for most downstream tasks. Moreover, these measures are based on the distribution of match scores which may only be relevant for topological SLAM-like scenarios where VPR needs to be highly precise and no subsequent outlier rejection method is employed.

Most of the VPR datasets in robotics are in the form of trajectories with inherent sequential information (Section 2). Thus, evaluation metrics such as ‘maximum open-loop distance traveled’ (that is, the extent of visual odometry or dead reckoning based robot motion without loop closures) have also been considered in the literature clement2020learning ; porav2018adversarial . We believe it would be beneficial to investigate metrics that tightly couple single-image and sequence-based VPR.

6 What Are Open Research Problems?

This section aims to highlight open research problems, considering the drivers discussed in Section 4. For space reasons and to avoid duplication, we do not cover the open research problems discussed in recent surveys on deep learning methods for VPR zhang2020visual ; masone2021survey , which include using autoencoders as an alternative to Convolutional Neural Networks (CNNs), use of generative methods including Generative Adversarial Networks (GANs), using semantic information, making use of heterogeneous data including multi-sensory fusion, and the choice of loss function.

Here, we broadly classify the open research problems into: 1) representation, discussing the need for better global descriptors and enriched/synthesized reference maps, and 2) matching, discussing the need for better hierarchical matching frameworks, relevant distance metrics and ‘learning to match’.

6.1 Place Representation

Global Descriptors – Appearance & Viewpoint Invariance: Section 4.1 discussed the requirements on viewpoint and appearance invariance depending on the operating environment. Here we note that there is a trade-off when learning a descriptor of a fixed size/type: increasing viewpoint-invariance will inevitably reduce some degree of appearance invariance (assuming the same amount of training data) arandjelovic2016netvlad ; chen2017deep ; garg2019semantic . This is evident from significant differences observed in place recognition performance when considering a cross-combination of datasets such as Nordland (same view, varying appearance) sunderhauf2013we and Pittsburgh (similar appearance, varying view) arandjelovic2016netvlad with feature learning/aggregation methods such as HybridNet (viewpoint-assumed) chen2017deep and NetVLAD (viewpoint-agnostic) arandjelovic2016netvlad .

There is a vast research gap, and a need for global description mechanisms that go beyond the binary nature of encoding viewpoint, that is, viewpoint-assumed vs viewpoint-invariant. This might be achieved by learning novel ways to incorporate local geometric information in the global descriptor formation, such as using vertical blocks (Stixels) hernandez2019slanted , semantic blobs gawel2018x , objects qin2021semantic or superpixels neubert2015superpixel , where learning could be based on attention mechanisms such as that employed in Transformers vaswani2017attention and Graph Neural Networks velivckovic2017graph .

Global Descriptors – Efficiency: Most of the state-of-the-art global image descriptors are high-dimensional (with dimensions varying from 512 sarlin2019coarse to 70,000 sunderhauf2015performance ). Increasing the descriptor’s dimensionality directly leads to increased computational requirements. To improve efficiency, researchers commonly employ dimension-reduction methods such as Principal Component Analysis (PCA), and have also explored quantization jegou2010product ; brandt2010transform ; ge2013optimized ; sandhawalia2010searching ; kalantidis2014locally , binarization lowry2018lightweight ; arroyo2015towards ; jegou2008hamming , hashing vysotska2017relocalization ; gionis1999similarity ; andoni2006near ; weiss2009spectral ; han2017mild and efficient indexing cao2020unifying techniques.

However, there have not been any attempts to learn these efficiency-inducing processes for VPR, particularly considering that retrieving places can include additional information in the form of sequences or odometry. This could be achieved by learning to reduce dimensions mcinnes2018umap ; amid2019trimap ; chancan2020hybrid or to hash wang2017survey , while maintaining the overall structure of the appearance space learnt through the existing global descriptor methods. Existing efficient VPR techniques consider sequential or odometry information in a decoupled manner vysotska2017relocalization ; garg2020fast but could benefit from jointly considering additional information when optimizing for efficiency.

Enriched Reference Maps: With the rapid increase in data gathering, more so in the field of autonomous driving, it is high time to consider the use of an enriched reference map, which could be in the form of multiple reference images per location churchill2012practice or semantically-annotated 3D maps garg2020semantics . In the simplest case, choosing the best reference set can lead to vast performance improvements. More sophisticated approaches fuse multiple reference sets to achieve even better performance churchill2012practice ; linegar2015work ; molloy2020intelligent . Multiple reference sets are often used when long-term autonomy is required, as structural changes can be detected and incorporated over time. While significant progress using multiple reference maps has been made in the past, open questions remain around the increased storage and computational requirements when using multiple reference sets. There is clearly a trade-off, and preliminary efforts in that direction doan2019scalable need further attention.

View Synthesis: To deal with significant viewpoint variations, researchers have also explored matching through multiple synthesized views of the observed places Torii2018 ; milford2015sequence , although the process requires additional computation. However, this can be mitigated by performing view synthesis offline during the mapping traverse and using compact global descriptors with an efficient nearest neighbor search for retrieval. The current bottleneck of automatically generating accurate and relevant views can potentially be addressed by recent advances in volumetric rendering mildenhall2020nerf , performed offline to generate novel views under novel lighting conditions.

6.2 Place Matching

Mutually-informed Hierarchical Systems: Different downstream tasks can impose very different requirements on the VPR system (see Section 4.3). For example, a visual SLAM system can be built with cummins2011appearance or without angeli2009visual ; cummins2008fab using a geometric verification step based on local feature matching. cummins2011appearance found this verification to be particularly essential for large datasets. However, it remains an open question how an effective hierarchical system should be designed, where the variables are: the number of complementary VPR techniques fused hausler2019multi , number of stages, and types of unique methods involved, e.g., query expansion chum2011total and keypoint filtering garg2018lost .

The complementarity and information transfer within different stages in the hierarchy requires an in-depth investigation. This can reveal answers to several overarching questions: should these stages always operate independently or could earlier stages better inform subsequent stages (beyond simply providing candidate images); could one selectively apply a subset of the techniques to save computational resources; how such behavior of hierarchical retrieval can be learnt; and in doing so do some of the stages become redundant.

Choice of Distance Metric: When comparing the global descriptors of two images, one has to choose a suitable distance metric or a similarity measure. Some of the most commonly employed measures include Euclidean arandjelovic2016netvlad ; yu2019spatial ; chen2014convolutional ; sunderhauf2015place , cosine sunderhauf2015performance ; garg2018don ; garg2018lost , and Hamming lowry2018lightweight ; arroyo2014fast ; arroyo2016fusion ; neubert2019neurologically distance. While some descriptors are better matched using one distance than the other, the range of distances distribution is typically relatively narrow, even in non-matching images of a completely different appearance. Therefore a more systematic investigation considering both a theoretical viewpoint and practical performance implications is needed. In particular, important consideration factors include suitability to loss functions (e.g. max-margin triplet loss) arandjelovic2016netvlad ; revaud2019learning ; garg2021seqnet , descriptor normalization arandjelovic2013all ; garg2018don ; schubert2020unsupervised , whitening jegou2012negative ; arandjelovic2013all , feature scaling beatty1997dimensionality ; li2015feature , and quantization/binarization jegou2010product ; lowry2018lightweight .

Learning to Match: While learning to ‘describe’ (i.e. local or global descriptors) has been widely explored, there have been limited attempts to learn to ‘match’. Such matchers can either be learnt through siamese networks altwaijry2016learning or cross-attention based on graph neural networks sarlin2020superglue . The outcome of such matcher could either be a matched reference index or relative 3D pose gridseth2020deepmel . Learning-to-match frameworks for VPR and localization could potentially eradicate the need for sophisticated matching pipelines.

7 Conclusions

This survey defines VPR based on the visual overlap of two observations, in line with spatial view cells found in primates. This new definition enabled us to discuss how VPR closely relates to other research areas. This paper also identified and detailed three key drivers of Spatial AI: the environment, agent and downstream task. Considering these drivers, we then discussed numerous open research problems that we think are worth addressing in future VPR research.

To date, VPR research has addressed the problems of representing, associating (matching), and searching of spatial data, and is a key enabler of Spatial AI. Further advances in VPR research will require unifying the efforts of the artificial intelligence, computer vision, robotics, and machine learning communities, particularly taking into account embodied agents. To achieve this, an in-depth understanding of the problem, research goals and evaluation protocols is necessary, and this paper takes a step in that direction.

Acknowledgments: This work received funding from the Australian Government, via grant AUSMURIB000001 associated with ONR MURI grant N00014-19-1-2571. The authors acknowledge continued support from the Queensland University of Technology (QUT) through the Centre for Robotics.

References

(1) Hani Altwaijry, Eduard Trulls, James Hays, Pascal Fua, and Serge Belongie. Learning to match aerial images with deep attentive architectures. In IEEE Conf. Comput. Vis. Pattern Recog., pages 3539–3547, 2016.
(2) Ehsan Amid and Manfred K Warmuth. TriMap: Large-scale dimensionality reduction using triplets. arXiv preprint arXiv:1910.00204, 2019.
(3) Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In IEEE Symposium Foundations Computer Science, pages 459–468, 2006.
(4) Adrien Angeli, Stéphane Doncieux, Jean-Arcady Meyer, and David Filliat. Visual topological SLAM and global localization. In IEEE Int. Conf. Robot. Autom., pages 4300–4305, 2009.
(5) Relja Arandjelović, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. IEEE Trans. Pattern Anal. Mach. Intell., 40(6):1437–1451, 2017.
(6) Relja Arandjelovic and Andrew Zisserman. All about vlad. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1578–1585, 2013.
(7) Roberto Arroyo, Pablo F Alcantarilla, Luis M Bergasa, and Eduardo Romera. Towards life-long visual localization using an efficient matching of binary sequences from images. In IEEE Int. Conf. Robot. Autom., pages 6328–6335, 2015.
(8) Roberto Arroyo, Pablo F Alcantarilla, Luis M Bergasa, and Eduardo Romera. Fusion and binarization of cnn features for robust topological localization across seasons. In IEEE/RSJ Int. Conf. Intell. Robot. Syst., pages 4656–4663, 2016.
(9) Roberto Arroyo, Pablo F Alcantarilla, Luis M Bergasa, J Javier Yebes, and Sebastián Bronte. Fast and effective visual place recognition using binary codes and disparity information. In IEEE/RSJ Int. Conf. Intell. Robot. Syst., pages 3089–3094, 2014.
(10) Artem Babenko and Victor Lempitsky. Aggregating local deep features for image retrieval. In Int. Conf. Comput. Vis., pages 1269–1277, 2015.
(11) Morris Beatty and BS Manjunath. Dimensionality reduction using multi-dimensional scaling for content-based retrieval. In IEEE Int. Conf. Image Process., volume 2, pages 835–838. IEEE, 1997.
(12) Jonathan Brandt. Transform coding for fast approximate nearest neighbor search in high dimensions. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1815–1822, 2010.
(13) Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif, Davide Scaramuzza, José Neira, Ian Reid, and John J Leonard. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Trans. Robot., 32(6):1309–1332, 2016.
(14) Bingyi Cao, Andre Araujo, and Jack Sim. Unifying deep local and global features for efficient image search. In Eur. Conf. Comput. Vis., pages 726–743, 2020.
(15) Marvin Chancán, Luis Hernandez-Nunez, Ajay Narendra, Andrew B Barron, and Michael Milford. A hybrid compact neural architecture for visual place recognition. IEEE Robot. Autom. Lett., 5(2):993–1000, 2020.
(16) Xieyuanli Chen, Thomas Läbe, Andres Milioto, Timo Röhling, Olga Vysotska, Alexandre Haag, Jens Behley, and Cyrill Stachniss. OverlapNet: Loop closing for LiDAR-based SLAM. In Robotics: Science and Systems, 2020.
(17) Zetao Chen, Adam Jacobson, Niko Sünderhauf, Ben Upcroft, Lingqiao Liu, Chunhua Shen, Ian Reid, and Michael Milford. Deep learning features at scale for visual place recognition. In IEEE Int. Conf. Robot. Autom., pages 3223–3230, 2017.
(18) Zetao Chen, Obadiah Lam, Adam Jacobson, and Michael Milford. Convolutional neural network-based place recognition. In Australasian Conf. Robot. Autom., 2014.
(19) Anna Choromanska, Alekh Agarwal, and John Langford. Extreme multi class classification. In NeurIPS Workshop: eXtreme Classification, 2013.
(20) Ondřej Chum, Andrej Mikulik, Michal Perdoch, and Jiří Matas. Total recall II: Query expansion revisited. In IEEE Conf. Comput. Vis. Pattern Recog., pages 889–896, 2011.
(21) Winston Churchill and Paul Newman. Practice makes perfect? Managing and leveraging visual experiences for lifelong navigation. In IEEE Int. Conf. Robot. Autom., pages 4525–4532, 2012.
(22) Lee Clement, Mona Gridseth, Justin Tomasi, and Jonathan Kelly. Learning matchable image transformations for long-term metric visual localization. IEEE Robot. Autom. Lett., 5(2):1492–1499, 2020.
(23) Peter Corke, Rohan Paul, Winston Churchill, and Paul Newman. Dealing with shadows: Capturing intrinsic scene appearance for image-based outdoor localisation. In IEEE/RSJ Int. Conf. Intell. Robot. Syst., pages 2085–2092, 2013.
(24) Mark Cummins and Paul Newman. FAB-MAP: Probabilistic localization and mapping in the space of appearance. Int. J. Robot. Res., 27(6):647–665, 2008.
(25) Mark Cummins and Paul Newman. Appearance-only SLAM at large scale with FAB-MAP 2.0. Int. J. Robot. Res., 30(9):1100–1123, 2011.
(26) Anh-Dzung Doan, Yasir Latif, Tat-Jun Chin, Yu Liu, Thanh-Toan Do, and Ian Reid. Scalable place recognition under appearance change for autonomous driving. In Int. Conf. Comput. Vis., pages 9319–9328, 2019.
(27) Jose M Facil, Daniel Olid, Luis Montesano, and Javier Civera. Condition-invariant multi-view place recognition. arXiv preprint arXiv:1902.09516, 2019.
(28) Tobias Fischer and Michael Milford. Event-based visual place recognition with ensembles of temporal windows. IEEE Robot. Autom. Lett., 5(4):6924–6931, 2020.
(29) Tobias Fischer, Jordi-Ysard Puigbò, Daniel Camilleri, Phuong DH Nguyen, Clément Moulin-Frier, Stéphane Lallée, Giorgio Metta, Tony J Prescott, Yiannis Demiris, and Paul FMJ Verschure. iCub-HRI: A software framework for complex human-robot interaction scenarios on the iCub humanoid robot. Front. Robot. AI, 5(22):1–9, 2018.
(30) James Garforth and Barbara Webb. Lost in the woods? Place recognition for navigation in difficult forest environments. Front. Robot. AI, 7, 2020.
(31) Sourav Garg, Ben Harwood, Gaurangi Anand, and Michael Milford. Delta descriptors: Change-based place representation for robust visual localization. IEEE Robot. Autom. Lett., 5(4):5120–5127, 2020.
(32) Sourav Garg and Michael Milford. Fast, compact and highly scalable visual place recognition through sequence-based matching of overloaded representations. In IEEE Int. Conf. Robot. Autom., 2020.
(33) Sourav Garg and Michael Milford. Seqnet: Learning descriptors for sequence-based hierarchical place recognition. IEEE Robot. Autom. Lett., 2021.
(34) Sourav Garg, Niko Suenderhauf, and Michael Milford. Don’t look back: Robustifying place categorization for viewpoint- and condition-invariant place recognition. In IEEE Int. Conf. Robot. Autom., pages 3645–3652, 2018.
(35) Sourav Garg, Niko Suenderhauf, and Michael Milford. LoST? appearance-invariant place recognition for opposite viewpoints using visual semantics. In Robotics: Science and Systems, 2018.
(36) Sourav Garg, Niko Suenderhauf, and Michael Milford. Semantic–geometric visual place recognition: a new perspective for reconciling opposing views. Int. J. Robot. Res., page 0278364919839761, 2019.
(37) Sourav Garg, Niko Sünderhauf, Feras Dayoub, Douglas Morrison, Akansel Cosgun, Gustavo Carneiro, Qi Wu, Tat-Jun Chin, Ian Reid, Stephen Gould, Peter Corke, and Michael Milford. Semantics for robotic mapping, perception and interaction: A survey. Found. Trends Robot., 8(1–2):1–224, 2020.
(38) Abel Gawel, Carlo Del Don, Roland Siegwart, Juan Nieto, and Cesar Cadena. X-view: Graph-based semantic multi-view localization. IEEE Robot. Autom. Lett., 3(3):1687–1694, 2018.
(39) Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. Optimized product quantization for approximate nearest neighbor search. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2946–2953, 2013.
(40) Pierre Georges-François, Edmund T. Rolls, and Robert G. Robertson. Spatial View Cells in the Primate Hippocampus: Allocentric View not Head Direction or Eye Position or Place. Cerebral Cortex, 9(3):197–212, 1999.
(41) Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. Similarity search in high dimensions via hashing. In Int. Conf. Very Large Data Bases, pages 518–529, 1999.
(42) Mona Gridseth and Timothy D Barfoot. DeepMEL: Compiling visual multi-experience localization into a deep neural network. In IEEE Int. Conf. Robot. Autom., pages 1674–1681, 2020.
(43) Jiadong Guo, Paulo VK Borges, Chanoh Park, and Abel Gawel. Local descriptor for robust place recognition using LiDAR intensity. IEEE Robot. Autom. Lett., 4(2):1470–1477, 2019.
(44) Lei Han and Lu Fang. MILD: Multi-index hashing for loop closure detection. In Int. Conf. Multimedia and Expo, 2017.
(45) Stephen Hausler, Sourav Garg, Ming Xu, Michael Milford, and Tobias Fischer. Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition. In IEEE Conf. Comput. Vis. Pattern Recog., 2021.
(46) Stephen Hausler, Adam Jacobson, and Michael Milford. Multi-process fusion: Visual place recognition using multiple image processing methods. IEEE Robot. Autom. Lett., 4(2):1924–1931, 2019.
(47) Daniel Hernandez-Juarez, Lukas Schneider, Pau Cebrian, Antonio Espinosa, David Vazquez, Antonio M López, Uwe Franke, Marc Pollefeys, and Juan C Moure. Slanted stixels: A way to represent steep streets. Int. J. Comput. Vis., 127(11-12):1643–1658, 2019.
(48) Adam Jacobson, Zetao Chen, and Michael Milford. Autonomous multisensor calibration and closed-loop fusion for SLAM. J. Field. Robot., 32(1):85–122, 2015.
(49) Adam Jacobson, Fan Zeng, David Smith, Nigel Boswell, Thierry Peynot, and Michael Milford. What localizes beneath: A metric multisensor localization and mapping system for autonomous underground mining vehicles. J. Field. Robot., 38(1):5–27, 2021.
(50) Hervé Jégou and Ondřej Chum. Negative evidences and co-occurences in image retrieval: The benefit of pca and whitening. In Eur. Conf. Comput. Vis., pages 774–787. Springer, 2012.
(51) Herve Jegou, Matthijs Douze, and Cordelia Schmid. Hamming embedding and weak geometric consistency for large scale image search. In Eur. Conf. Comput. Vis., pages 304–317, 2008.
(52) Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell., 33(1):117–128, 2010.
(53) Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. Aggregating local descriptors into a compact image representation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 3304–3311. IEEE, 2010.
(54) Yannis Kalantidis and Yannis Avrithis. Locally optimized product quantization for approximate nearest neighbor search. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2321–2328, 2014.
(55) Matej Kristan, Ales Leonardis, Jiri Matas, Michael Felsberg, Roman Pflugfelder, Luka ˇCehovin Zajc, Tomas Vojir, Goutam Bhat, Alan Lukezic, Abdelrahman Eldesokey, et al. The sixth visual object tracking VOT2018 challenge results. In Eur. Conf. Comput. Vis. Worksh., pages 3–53, 2018.
(56) Dong Li, Baoxian Zhang, and Cheng Li. A feature-scaling-based $k$ -nearest neighbor algorithm for indoor positioning systems. IoT J., 3(4):590–597, 2015.
(57) Haoang Li, Jian Yao, Jean-Charles Bazin, Xiaohu Lu, Yazhou Xing, and Kang Liu. A monocular SLAM system leveraging structural regularity in Manhattan world. In IEEE Int. Conf. Robot. Autom., pages 2518–2525, 2018.
(58) Jie Li, Ryan M Eustice, and Matthew Johnson-Roberson. High-level visual features for underwater place recognition. In IEEE Int. Conf. Robot. Autom., pages 3652–3659, 2015.
(59) Chris Linegar, Winston Churchill, and Paul Newman. Work smart, not hard: Recalling relevant experiences for vast-scale but time-constrained localisation. In IEEE Int. Conf. Robot. Autom., pages 90–97, 2015.
(60) Dongfang Liu, Yiming Cui, Liqi Yan, Christos Mousas, Baijian Yang, and Yingjie Chen. DenserNet: Weakly supervised visual localization using multi-scale feature aggregation. In AAAI, 2021.
(61) Stephanie Lowry and Henrik Andreasson. Lightweight, viewpoint-invariant visual place recognition in changing environments. IEEE Robot. Autom. Lett., 3(2):957–964, 2018.
(62) Stephanie Lowry, Niko Sunderhauf, Paul Newman, John J. Leonard, David Cox, Peter Corke, and Michael J. Milford. Visual place recognition: A survey. IEEE Trans. Robot., 32(1):1–19, 2016.
(63) Ren C Luo, Harsh Potlapalli, and David W Hislop. Neural network based landmark recognition for robot navigation. In Int. Conf. Ind. Electron. Control Instrum. Autom., pages 1084–1088, 1992.
(64) Will Maddern, Geoff Pascoe, Chris Linegar, and Paul Newman. 1 Year, 1000km: The Oxford RobotCar dataset. Int. J. Robot. Res., 36(1):3–15, 2017.
(65) Carlo Masone and Barbara Caputo. A survey on deep visual place recognition. IEEE Access, 9:19516–19547, 2021.
(66) Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. UMAP: Uniform manifold approximation and projection. J. Open Source Softw., 3(29), 2018.
(67) Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. In Eur. Conf. Comput. Vis., pages 405–421, 2020.
(68) Michael Milford, Chunhua Shen, Stephanie Lowry, Niko Suenderhauf, Sareh Shirazi, Guosheng Lin, Fayao Liu, Edward Pepperell, Cesar Lerma, Ben Upcroft, et al. Sequence searching with deep-learnt depth for condition-and viewpoint-invariant route-based place recognition. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., pages 18–25, 2015.
(69) Michael Milford and Gordon Wyeth. Spatial mapping and map exploitation: a bio-inspired engineering perspective. In Int. Conf. Spatial Inf. Theory, pages 203–221, 2007.
(70) Michael J. Milford and Gordon F. Wyeth. SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights. In IEEE Int. Conf. Robot. Autom., pages 1643–1649, 2012.
(71) Timothy L. Molloy, Tobias Fischer, Michael Milford, and Girish N. Nair. Intelligent reference curation for visual place recognition via Bayesian selective fusion. IEEE Robot. Autom. Lett., 6(2):588–595, 2021.
(72) James Mount, Les Dawes, and Michael J Milford. Automatic coverage selection for surface-based visual localization. IEEE Robot. Autom. Lett., 4(4):3900–3907, 2019.
(73) Peer Neubert, Stefan Schubert, and Peter Protzel. A neurologically inspired sequence processing model for mobile robot place recognition. IEEE Robot. Autom. Lett., 4(4):3200–3207, 2019.
(74) Peer Neubert, Niko Sünderhauf, and Peter Protzel. Superpixel-based appearance change prediction for long-term navigation across seasons. Robotics and Autonomous Systems, 69:15–27, 2015.
(75) Edward Pepperell, Peter I Corke, and Michael J Milford. All-environment visual place recognition with smart. In IEEE Int. Conf. Robot. Autom., pages 1612–1618, 2014.
(76) James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Object retrieval with large vocabularies and fast spatial matching. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1–8, 2007.
(77) Noé Pion, Martin Humenberger, Gabriela Csurka, Yohann Cabon, and Torsten Sattler. Benchmarking image retrieval for visual localization. In 3DV, 2020.
(78) Horia Porav, Will Maddern, and Paul Newman. Adversarial training for adverse conditions: Robust metric localisation using appearance transfer. In IEEE Int. Conf. Robot. Autom., pages 1011–1018, 2018.
(79) Cao Qin, Yunzhou Zhang, Yingda Liu, and Guanghao Lv. Semantic loop closure detection based on graph matching in multi-objects scenes. J. Vis. Comm. Image Rep., page 103072, 2021.
(80) Morgan Quigley, Ken Conley, Brian P. Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, and Andrew Y. Ng. ROS: an open-source Robot Operating System. In IEEE Int. Conf. Robot. Autom. Worksh., 2009.
(81) Anita Rau, Guillermo Garcia-Hernando, Danail Stoyanov, Gabriel J. Brostow, and Daniyar Turmukhambetov. Predicting visual overlap of images through interpretable non-metric box embeddings. In Eur. Conf. Comput. Vis., pages 629–646, 2020.
(82) Jerome Revaud, Jon Almazán, Rafael S Rezende, and Cesar Roberto de Souza. Learning with average precision: Training image retrieval with a listwise loss. In Int. Conf. Comput. Vis., pages 5107–5116, 2019.
(83) Harsimrat Sandhawalia and Hervé Jégou. Searching with expectations. In IEEE Int. Conf. Acoustics, Speech, Signal Processing, pages 1242–1245, 2010.
(84) Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. In IEEE Conf. Comput. Vis. Pattern Recog., pages 12716–12725, 2019.
(85) Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning feature matching with graph neural networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4938–4947, 2020.
(86) Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Efficient & effective prioritized matching for large-scale image-based localization. IEEE Trans. Pattern Anal. Mach. Intell., 39(9):1744–1756, 2016.
(87) Olivier Saurer, Pascal Vasseur, Rémi Boutteau, Cédric Demonceaux, Marc Pollefeys, and Friedrich Fraundorfer. Homography based egomotion estimation with a common direction. IEEE Trans. Pattern Anal. Mach. Intell., 39(2):327–341, 2016.
(88) Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4104–4113, 2016.
(89) Stefan Schubert, Peer Neubert, and Peter Protzel. Unsupervised learning methods for visual place recognition in discretely and continuously changing environments. In IEEE Int. Conf. Robot. Autom., pages 4372–4378. IEEE, 2020.
(90) Josef Sivic and Andrew Zisserman. Video google: A text retrieval approach to object matching in videos. In Int. Conf. Comput. Vis., page 1470. IEEE, 2003.
(91) Erik Stenborg, Torsten Sattler, and Lars Hammarstrand. Using image sequences for long-term visual localization. In 3DV, pages 938–948, 2020.
(92) Niko Sünderhauf, Peer Neubert, and Peter Protzel. Are we there yet? challenging SeqSLAM on a 3000 km journey across all four seasons. In IEEE Int. Conf. Robot. Autom. Worksh., 2013.
(93) Niko Sünderhauf, Sareh Shirazi, Feras Dayoub, Ben Upcroft, and Michael Milford. On the performance of ConvNet features for place recognition. In IEEE/RSJ Int. Conf. Intell. Robot. Syst., pages 4297–4304, 2015.
(94) Niko Sünderhauf, Sareh Shirazi, Adam Jacobson, Feras Dayoub, Edward Pepperell, Ben Upcroft, and Michael Milford. Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free. In Robotics: Science and Systems, 2015.
(95) Koji Takeda and Kanji Tanaka. Dark reciprocal-rank: Boosting graph-convolutional self-localization network via teacher-to-student knowledge transfer. arXiv preprint arXiv:2011.00402, 2020.
(96) Felix Taubner, Florian Tschopp, Tonci Novkovic, Roland Siegwart, and Fadri Furrer. Lcd–line clustering and description for place recognition. In 3DV, 2020.
(97) Marvin Teichmann, Andre Araujo, Menglong Zhu, and Jack Sim. Detect-to-retrieve: Efficient regional aggregation for image search. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5109–5118, 2019.
(98) Carl Toft, Will Maddern, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, et al. Long-term visual localization revisited. IEEE Trans. Pattern Anal. Mach. Intell., to appear, 2020.
(99) Carl Toft, Daniyar Turmukhambetov, Torsten Sattler, Fredrik Kahl, and Gabriel J Brostow. Single-image depth prediction makes feature matching easier. In Eur. Conf. Comput. Vis., pages 473–492. Springer, 2020.
(100) Akihiko Torii, Relja Arandjelovic, Josef Sivic, Masatoshi Okutomi, and Tomas Pajdla. 24/7 place recognition by view synthesis. IEEE Trans. Pattern Anal. Mach. Intell., 40(2):257–271, 2018.
(101) Satyajit Tourani, Dhagash Desai, Udit Singh Parihar, Sourav Garg, Ravi Kiran Sarvadevabhatla, and K Madhava Krishna. Early bird: Loop closures from opposing viewpoints for perceptually-aliased indoor environments. arXiv preprint arXiv:2010.01421, 2020.
(102) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
(103) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In Int. Conf. Learn. Represent., 2018.
(104) Olga Vysotska, Tayyab Naseer, Luciano Spinello, Wolfram Burgard, and Cyrill Stachniss. Efficient and effective matching of image sequences under substantial appearance changes exploiting gps priors. In IEEE Int. Conf. Robot. Autom., pages 2774–2779, 2015.
(105) Olga Vysotska and Cyrill Stachniss. Relocalization under substantial appearance changes using hashing. In IEEE/RSJ Int. Conf. Intell. Robot. Syst. Worksh., 2017.
(106) Jingdong Wang, Ting Zhang, Nicu Sebe, Heng Tao Shen, et al. A survey on learning to hash. IEEE Trans. Pattern Anal. Mach. Intell., 40(4):769–790, 2017.
(107) Frederik Warburg, Soren Hauberg, Manuel López-Antequera, Pau Gargallo, Yubin Kuang, and Javier Civera. Mapillary street-level sequences: A dataset for lifelong place recognition. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2626–2635, 2020.
(108) Yair Weiss, Antonio Torralba, Robert Fergus, et al. Spectral hashing. In Adv. Neural Inform. Process. Syst., 2008.
(109) Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2575–2584, 2020.
(110) Jianxin Wu, Henrik Christensen, and James Rehg. Visual place categorization: Problem, dataset, and algorithm. In IEEE/RSJ Int. Conf. Intell. Robot. Syst., pages 4763–4770, 2009.
(111) Zhe Xin, Yinghao Cai, Tao Lu, Xiaoxia Xing, Shaojun Cai, Jixiang Zhang, Yiping Yang, and Yanqing Wang. Localizing discriminative visual landmarks for place recognition. In IEEE Int. Conf. Robot. Autom., pages 5979–5985, 2019.
(112) Tsun-Yi Yang, Duy-Kien Nguyen, Huub Heijnen, and Vassileios Balntas. Ur2kid: Unifying retrieval, keypoint detection, and keypoint description without local correspondence supervision. arXiv preprint arXiv:2001.07252, 2020.
(113) Jun Yu, Chaoyang Zhu, Jian Zhang, Qingming Huang, and Dacheng Tao. Spatial pyramid-enhanced NetVLAD with weighted triplet loss for place recognition. IEEE Transactions on Neural Networks and Learning Systems, 31(2):661–674, 2019.
(114) Haosong Yue, Jinyu Miao, Yue Yu, Weihai Chen, and Changyun Wen. Robust loop closure detection based on bag of superpoints and graph verification. In IEEE/RSJ Int. Conf. Intell. Robot. Syst., pages 3787–3793. IEEE, 2019.
(115) Mubariz Zaffar, Shoaib Ehsan, Michael Milford, David Flynn, and Klaus McDonald-Maier. VPR-Bench: An open-source visual place recognition evaluation framework with quantifiable viewpoint and appearance change. arXiv:2005.08135, 2020.
(116) Mubariz Zaffar, Ahmad Khaliq, Shoaib Ehsan, Michael Milford, Kostas Alexis, and Klaus McDonald-Maier. Are state-of-the-art visual place recognition techniques any good for aerial robotics? In IEEE Int. Conf. Robot. Autom. Worksh., 2019.
(117) Xiwu Zhang, Lei Wang, and Yan Su. Visual place recognition: A survey from deep learning perspective. Pattern Recognition, 2020.
(118) Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell., 40(6):1452–1464, 2017.