GOTLoc: General Outdoor Text-based Localization
Using Scene Graph Retrieval with OpenStreetMap

Donghwi Jung, Keonwoo Kim, and Seong-Woo Kim The authors are with Seoul National University, Seoul, South Korea. {donghwijung,knwoo,snwoo}@snu.ac.kr

Abstract

We propose GOTLoc, a robust localization method capable of operating even in outdoor environments where GPS signals are unavailable. The method achieves this robust localization by leveraging comparisons between scene graphs generated from text descriptions and maps. Existing text-based localization studies typically represent maps as point clouds and identify the most similar scenes by comparing embeddings of text and point cloud data. However, point cloud maps have limited scalability as it is impractical to pre-generate maps for all outdoor spaces. Furthermore, their large data size makes it challenging to store and utilize them directly on actual robots. To address these issues, GOTLoc leverages compact data structures, such as scene graphs, to store spatial information, enabling individual robots to carry and utilize large amounts of map data. Additionally, by utilizing publicly available map data, such as OpenStreetMap, which provides global information on outdoor spaces, we eliminate the need for additional effort to create custom map data. For performance evaluation, we utilized the KITTI360Pose dataset in conjunction with corresponding OpenStreetMap data to compare the proposed method with existing approaches. Our results demonstrate that the proposed method achieves accuracy comparable to algorithms relying on point cloud maps. Moreover, in city-scale tests, GOTLoc required significantly less storage compared to point cloud-based methods and completed overall processing within a few seconds, validating its applicability to real-world robotics. Our code is available at https://github.com/donghwijung/GOTLoc.

I Introduction

In the current era of widespread adoption of autonomous platforms, such as delivery robots [1] or autonomous taxis [2], it is necessary to communicate the caller’s current location to the robot to summon it. In outdoor environments, GPS is commonly used for determining locations. However, adverse conditions such as weather [3], obstructing structures [4], and intentional jamming [5] can make GPS unreliable. Consequently, there has been a need for localization algorithms that operate without GPS [6]. Humans typically localize themselves by visually perceiving their surroundings and estimating their location based on that information [7]. When this visual localization is insufficient, they rely on external devices, such as smartphones. Translating this scenario into the delivery robot context, as illustrated in Fig. 1, a user provides the environmental information to the robot, which then determines the location using its stored map data. Among the various ways information can be conveyed to robots, verbal communication is one of the most natural approaches for humans [8]. In this context, some researches have explored text-based localization, where natural language is used to communicate information to robots[9, 10, 11, 12, 13].
Studies such as [9, 10, 11, 12] propose methods where a pre-segmented point cloud map is created beforehand. Both the input text and the point cloud map are embedded using a neural network trained via contrastive learning [14]. The localization process involves finding the most similar text and map embedding pair to identify the corresponding scene. While these methods achieve high localization accuracy, they rely on pre-segmented point cloud maps [15], which have several limitations. The creation of such maps is time-consuming, and their large data size makes it impractical for robots to store map data for extensive areas. These limitations hinder the application of such methods in real-world robotic systems.

Refer to caption — Figure 1: Concept of GOTLoc. The user describes the surrounding environment in text and provides it to the delivery robot. The robot interprets the text description to determine the user’s location. In this process, both the text description and the OSM data are converted into scene graphs. The robot then retrieves and identifies the most similar pair, estimating the current scene based on the highest similarity.

To address these issues, Where am I?[13] leveraged scene graphs[16] to model map data, effectively reducing storage demands. The method was evaluated in indoor settings using the 3DSSG dataset [17] and the ScanScribe dataset [18]. A transformer architecture was employed to compute joint embeddings for text graphs and scene graphs, and their similarities were analyzed to identify the most analogous graph pair for localization. Representing map data through scene graphs allowed for a more compact storage requirement in their approach. However, the method was tested exclusively in indoor environments and focused solely on settings with pre-generated scene graphs, leaving outdoor environments unaddressed. Outdoor environments present additional challenges due to the larger scope of scenes to be compared. Calculating joint embeddings for all scene graphs, as in [13], is computationally expensive, making real-world application in robots infeasible. To overcome this computational burden, we introduce a method that first extracts matching candidates from the vectorDB [19] using text embeddings, followed by computing joint embeddings with these candidates, thereby reducing overall processing time.
To further increase the practical applicability of our method, we utilize OpenStreetMap (OSM) [20] for outdoor map data. OSM is an open dataset containing information on outdoor spaces globally, making it versatile for use in various environments. Additionally, OSM’s categorization of data into nodes, ways, and relations makes it well-suited for conversion into scene graphs [21]. Unlike previous study [22], which represent scene graphs as numerical spatial relationships based on distances and angles, our approach integrates text-based information to solve the text-based localization problem, estimating positions based on text queries.
The contributions of this paper are as follows:

•

We propose GOTLoc, a method leveraging scene graphs to reduce map data storage requirements and maintain consistent algorithm speed even as the number of frames increases, making it suitable for real-world applications.
•

The proposed GOTLoc utilizes OSM data to enable localization in various outdoor environments without the need for preprocessing steps such as map generation, thereby enhancing its practicality.
•

We validate the proposed approach in city-scale environments, demonstrating high accuracy, real-time performance, and minimal map data storage requirements, confirming its applicability to real-world robotic systems.

II Related Works

The characteristics of existing text-based localization studies and GOTLoc are compared in Table I.

II-A Point cloud map-based

Existing methods leveraging text for localization include [9, 10, 11, 12]. These studies employ the KITTI360Pose dataset[9], and focus on localization by comparing the similarity between point cloud maps and textual descriptions. Each study differs slightly depending on the methodologies used for comparing text and point cloud maps. First, [9] utilized an bidirectional Long Short-Term Memory (LSTM) [23], to identify similar scenes. Building on this work, [10] introduced the Relation Enhanced Transformer (RET) network, which improved performance by employing self-attention and cross-attention mechanisms to compare text and point cloud maps. Subsequently, [11] leveraged the Cascaded Cross Attention Transformer (CCAT) proposed in CASSPR[24], achieving superior localization performance compared to the RET-based approach [10]. Furthermore, Shang et al.[12] utilized three Mamba-based structures—Point Clouds Mamba (PCM), Text-Attention Mamba (TAM), and Cascaded Cross-Attention Mamba (CCAM)—based on the Mamba architecture proposed in [25], to perform text-based localization. Their approach currently represents the state-of-the-art performance for text-based localization using the KITTI360Pose dataset. However, these methods rely on point cloud maps for localization, which poses challenges such as storage requirements and scalability to larger spaces, limiting their practical applicability in real-world robotic systems.

TABLE I: comparison with the previous text-based localization works.


	Outdoor	Lightweight map data	w/o Prior map
[9, 10, 11, 12]	✓
[13]		✓
GOTLoc (ours)	✓	✓	✓

II-B Scene graph-based

In contrast, Where am I? [13] employed 3D scene graphs [16], for the text-based localization method. The research assessed the model’s performance using indoor scene graph datasets, including 3DSSG [17] and ScanScribe [18]. In addition, word2vec [26] was applied to process node and edge labels, generating both text and scene graphs. These graphs were then embedded using a graph transformer network with message passing, as proposed by [27], and similarity scores were calculated through a Multi-Layer Perceptron (MLP) to identify the most similar scenes. By representing scene data as scene graphs, their method addresses the challenges of existing text-based localization techniques [9, 10, 11, 12], which rely on large-scale point cloud map data. However, the scope of Where am I? [13] is confined to indoor environments.
In this paper, we extend the scope of text-based localization utilizing scene graph retrieval to outdoor environments, enabling it to operate in more diverse settings. To achieve this, we incorporate OSM, the graph Transformer network, and scene graph embedding data to perform an initial extraction of comparison candidates. As a result, the proposed method features compact scene data storage, fast processing speed, and high accuracy, establishing its significance as a real-time, high-performance city-scale outdoor localization solution.

III Methods

The process of our proposed method is illustrated in Fig. 2. This process takes text descriptions and OSM data as inputs and returns the scene id with the highest degree of matching as an output. The detailed steps are as follows:

III-A Scene graph generation

First, the input data, consisting of text and OSM, is converted into scene graphs. The scene graph generation process is depicted in Fig. 3.

	$\displaystyle G=\{N,\;E\},$		(1)
	$\displaystyle N=\{n\},\;E=\{e\},$		(2)

where $G$ refers to a scene graph. $n$ and $e$ denote the nodes and edges that constitute the scene graph $G$ , respectively. At this point, the nodes and edges correspond to the labels and relationships shown in Fig. 3. Furthermore, the center positions of the nodes are used solely for the calculation of relationships and do not directly define the nodes themselves. $N$ and $E$ represent the sets of $n$ and $e$ , respectively.
For the text input, the source and target labels, along with the spatial relation between these labels, are parsed and represented as nodes and edges in the graph, respectively. For the OSM data, elements such as nodes, ways, and relations within a predefined distance from a reference point are extracted.

	$\displaystyle T_{l},T_{r}=\phi(T),$		(3)
	$\displaystyle G^{T}=g^{T}(T_{l},T_{r}),$		(4)

where $T$ represents the text description that depicts the scene. $\phi$ is a function used to extract labels and spatial relations from the text description. $T_{l}$ and $T_{r}$ represent the sets of extracted labels and spatial relations, respectively. $G^{T}$ refers to the scene graph generated from the text. Additionally, $g^{T}$ denotes the function that generates $G^{T}$ using two textual components.
Positional relationships between nodes are calculated, and if the distance between two nodes is below a predefined threshold, an edge is created between them. The positional differences between nodes in global coordinates are represented as vectors, and these vectors are used to determine the relationship represented by the edges. This process ultimately generates the OSM scene graph. Additionally, under the assumption that the ego’s heading angle is known in advance, we represent the edge relations of both the text and the OSM scene graphs in the East-North-Up (ENU) coordinate system using four directional components: North, South, East, and West. This ensures that the edge relations of the scene graphs generated from the two different data sources, text and OSM, are expressed within the same coordinate system, enabling direct graph-to-graph comparisons. The scene graph generation process can be formally represented as follows:

	$\displaystyle\Delta p=p_{j}-p_{i},$		(5)
	$\displaystyle r^{M}_{i,j}=\begin{cases}North,&\text{if}\;\\|\Delta p\\|_{x}<\\|\Delta p\\|_{y},(\Delta p)_{y}>0\\ South,&\text{else if}\;\\|\Delta p\\|_{x}<\\|\Delta p\\|_{y},(\Delta p)_{y}<0\\ East,&\text{else if}\;\\|\Delta p\\|_{x}>\\|\Delta p\\|_{y},(\Delta p)_{x}>0\\ West,&\text{else if}\;\\|\Delta p\\|_{x}>\\|\Delta p\\|_{y},(\Delta p)_{x}<0\end{cases},$		(6)

where $p$ represents the position of an object in the global coordinate system, and $\Delta p$ is the vector calculated as the positional difference between two objects. This vector is used to compute the spatial relation $r^{M}_{i,j}$ as in Eq. (6). $\|\cdot\|_{x,y}$ denotes the size of elements associated with $x$ or $y$ within the vector.

	$\displaystyle G^{M}=g^{M}(M),\quad^{\forall}M\in\mathcal{M},$		(7)
	$\displaystyle\mathcal{G}^{M}=\{G^{M}\},$		(8)

where $M$ is the submap cropped from the OSM based on a predefined distance. $\mathcal{M}$ represents the set of $M$ , corresponding to the entirety of the OSM. $G^{M}$ denotes to the scene graph generated from the OSM. $g^{M}$ indicates the function that generates $G^{M}$ using the map data. $\mathcal{G}^{M}$ refers the set of $G^{M}$ , constituting the scene graph database for the entire OSM.

III-B Similar scene graph candidates extraction

To efficiently compare similarities between scene graphs stored in the database, converted from OSM into scene graphs, we first extract a set of candidate graphs that are expected to have high similarity. For this purpose, we utilize the graph embedding module for text and OSM scene graphs introduced in Sec. III-C. Using this module, we compute the embedding data for OSM scene graphs and store them in a vectorDB.

	$\displaystyle E^{T}=\varepsilon(G^{T}),$		(9)
	$\displaystyle E^{M}_{i}=\varepsilon(G^{M}_{i}),\quad^{\forall}i\in\mathcal{I},$		(10)

where $\varepsilon$ denotes a function that generates a graph embedding from either a text graph or a scene graph as input. $E$ refers to the graph embedding produced by $\varepsilon$ . $\mathcal{I}$ represents the set of all scene ids stored in the database.
Then, we identify the embedding data most similar to the query text scene graph’s embedding data. The similarity between embedding vectors is calculated using cosine similarity.

\displaystyle S_{i}=\sigma(E^{T},E^{M}_{i})=\frac{E^{T}\cdot E^{M}_{i}}{\|E^{T}\|\cdot\|E^{M}_{i}\|},

(11)

where $\sigma$ refers to a function that calculates the similarity between two scene graph embeddings. $S$ corresponds to the similarity value produced as a result of $\sigma$ . $\|\cdot\|$ is the size of the vector.
Subsequently, we extract the OSM scene ids with high similarities.

\displaystyle\mathbf{I}_{n}=\operatorname*{\arg\operatorname{top}}(S_{i},n),

(12)

where $\operatorname*{\arg\operatorname{top}}(\cdot,\cdot)$ corresponds to the argtop-k function, which extracts the indices of elements with the largest values. Among the function’s arguments, the first represents the reference value used for comparison in the argtop-k operation, while the second specifies the number of indices to be extracted as a result. Additionally, $n$ denotes the number of graph candidates to be extracted, which is applied as the second argument of the argtop-k function, as shown in Eq. (12). Finally, $\mathbf{I}_{n}$ represents the set of scene ids selected as the output of the argtop-k function.
Using the scene ids with high similarities, the corresponding candidate map scene graphs are extracted from the map scene graphs stored in the vectorDB.

\displaystyle\mathcal{C}=\{G^{M}_{i}\}=\psi(G^{T},\mathcal{G}^{M}),\quad^{\forall}i\in\mathbf{I}_{n},

(13)

where $\mathcal{C}$ is the set composed of map scene graphs $G^{M}_{i}$ that exhibit high similarity with the text scene graph $G^{T}$ . $\psi$ represents the function used to extract $\mathcal{C}$ .

III-C Scene graph retrieval and scene selection

As illustrated in Fig. 4, we employ a joint embedding framework, leveraging a graph transformer network, to generate embeddings for scene graphs, referring to [13]. Based on the model, the similarity between these graph embeddings is calculated, and the scene with the highest similarity is determined as the one corresponding to the current query text description. For the graph transformer network, we configure the network in the form of a joint embedding model comprising these graphGPS layers.

	$\displaystyle\bar{E}^{T},\bar{E}^{M}_{i}=\bar{\varepsilon}(G^{T},G^{M}_{i}),\quad^{\forall}G^{M}_{i}\in\mathcal{C},$		(14)
	$\displaystyle\bar{S}_{i}=\bar{\sigma}(\bar{E}^{T},\bar{E}^{M}_{i}),$		(15)

where $\bar{\varepsilon}$ denotes the joint embedding model that transforms a text graph and a scene graph. $\bar{E}$ represents the joint embedding of the text graph and scene graph generated as the output of $\bar{\varepsilon}$ . $\bar{S}$ is the cosine similarity between two joint embedding data. $\bar{\sigma}$ indicates the function that calculates $\bar{S}$ .
Finally, the scene id is predicted by applying the argtop-k operation to the cosine similarity scores. This process is formalized as follows:

\displaystyle I_{k}=\operatorname*{\arg\operatorname{top}}(\bar{S}_{i},k),\quad I_{k}\subset\mathbf{I}_{n},

(16)

where $k$ represents the number of scene ids selected as the result of the scene retrieval process and is applied as the second argument of the argtop-k function, as shown in Eq. (16). $I_{k}$ denotes the set of predicted scene ids that serve as the output of the scene retrieval process.
The constructed scene retrieval network is trained using a loss function composed of two components. First, the matching probability loss is computed based on the matching probability predicted by an MLP for two scene graph embeddings. Second, the cosine similarity loss is defined as the sum of the cosine similarity values directly calculated between the two embedding vectors. The joint embedding network is trained using this combined loss.

	$\displaystyle\mathcal{L}_{mat}=MLP\left(\bar{E}^{T}\\|\bar{E}^{M}_{i}\right),$		(17)
	$\displaystyle\mathcal{L}_{sim}=\bar{S}_{i},$		(18)
	$\displaystyle\mathcal{L}=\mathcal{L}_{mat}+\mathcal{L}_{sim},$		(19)

where $MLP$ refers to the MLP component within the scene graph retrieval network architecture. $\mathcal{L}$ denotes the loss function of the scene graph retrieval network. $\mathcal{L}_{mat}$ and $\mathcal{L}_{sim}$ are the two components of $\mathcal{L}$ , representing the matching probability loss and cosine similarity loss, respectively. $(\,\cdot\,\|\,\cdot\,)$ indicates the concatenation of two vectors.

IV Experiments

IV-A Baselines and metrics

In this paper, to evaluate the performance of the proposed GOTLoc, we adopted existing text-based localization methods, including Text2Pos[9], RET[10], Text2Loc[11], MambaPlace[12], and Where am I?[13], as baselines. Among these, Text2Pos, RET, Text2Loc, and MambaPlace were validated using the KITTI360Pose dataset. For a comparative analysis with these baselines, as shown in Table II, GOTLoc converted the segmented point cloud maps from the KITTI360Pose dataset into scene graphs. Additionally, the GPS coordinates corresponding to the poses in the KITTI360Pose dataset were utilized to generate submaps from OSM, which were then converted into scene graphs to serve as map data. To evaluate the performance of the proposed algorithm, we used Retrieval Recall as the evaluation metric. This metric calculates the proportion of true positive scenes correctly matched among all tested text queries. Specifically, as formalized in Eq. (16), the final predicted scene was selected by identifying the top $k$ candidates with the highest similarity and determining whether the true positive scene was included among these candidates. In this paper, $k$ was set to three different values—1, 3, and 5—by referencing prior text-based localization studies [9, 10, 11, 12]. The algorithm’s performance was then evaluated based on these values of $k$ . Moreover, to validate performance through the ablation study and city-scale case, we evaluated performance by comparing the number of text-to-scene graph pairs processed per second, the time required to retrieve the entire scene graph for a given query scene graph, and the file size of the scene data as the evaluation metrics.

IV-B Implementation

For the experiments conducted in our work, several Python packages were utilized. Specifically, OSMnx [28] was employed for processing OSM data, while GeoPandas [29] was used to convert GPS data into the UTM coordinate system. Additionally, Shapely [30] was utilized to handle geometric information from data loaded through OSMnx. Furthermore, NetworkX [31] was adopted for the processing and visualization of graph data. Furthermore, following prior studies[32, 33], the Milvus[19] vector database was employed as the vectorDB for storing the map scene graphs. The hardware configuration for this paper consisted of an Intel Xeon Gold 6140 2.30 GHz CPU and a 12 GB NVIDIA TITAN Xp GPU. The hardware specifications are comparable to those of commercially available edge computing devices, suggesting that the experimental results can be applied to real-world robotic scenarios.

IV-C Research questions

The research questions we aim to address in this paper are as follows:
Q1. Can localization performance remain comparable even when the environment is represented with simplified data structures such as scene graphs, instead of precise data like point clouds? Q2. Does the scene graph generated from OSM achieve comparable overall algorithm performance to the scene graph generated from segmented point cloud maps? Q3. Does the proposed transformer network-based approach outperform the rule-based scene graph retrieval method, in terms of performance or provide advantages in speed? Q4. Compared to a brute-force retrieval method that evaluates all scenes, how much does the pre-selection of similar scene candidates improve speed, and does it maintain comparable performance? Q5. When scene data is converted from point clouds to scene graphs, how much storage efficiency is gained? Is it feasible for application in real-world robots? Q6. When applied to city-scale data, do accuracy, speed, and data storage requirements produce results that are practical for real-world robotic applications?

IV-D Evaluation

The data examples used in the experiments are illustrated in Fig. 5. These examples include text descriptions and GPS coordinates serving as input data for GOTLoc, the corresponding OSM data retrieved based on these inputs, and the resulting text and map scene graphs generated from this data. To further enhance comprehension, street view images[34] of the respective areas are also included. It should be noted that the segmentations shown in Fig. 5(b) and 5(c) were utilized solely for visualization purposes to facilitate understanding. Through the use of these data, the experiments aim to address the research questions outlined earlier in Sec. IV-C. The results of these validations are as follows:
Q1. Scene graph-based accuracy: The performance of methods leveraging scene graphs is shown in Table II. As described in Sec. IV-A, localization performance was evaluated by converting map data from the KITTI360Pose dataset into scene graphs. In the KITTI360Pose dataset, the direction was computed based on the relative positional differences between the observer and the object. However, in this paper, we did not include the observer as a node in the scene graph, rendering the direction information between the observer and the object unsuitable for use as edges in the scene graph. Therefore, we calculated the direction by applying the positional data of objects to Eq. (6) and utilized it as edges in the scene graph for this experiment. The results indicate that the performance shows minimal difference compared to using semantically segmented point cloud maps. This suggests that the scene graph-based map data used in GOTLoc contains sufficient information to be effectively utilized for localization.

TABLE II: comparison between our method and existing text-based localization methods.

		Submap Retrieval Recall $\uparrow$
		Validation Set			Test Set
Map data	Methods	$k=1$	$k=3$	$k=5$	$k=1$	$k=3$	$k=5$
KITTI360Pose	Text2Pos [9]	0.14	0.28	0.37	0.12	0.25	0.33
	RET [10]	0.18	0.34	0.44	-	-	-
	Text2Loc [11]	0.31	0.54	0.64	0.28	0.49	0.58
	MambaPlace [12]	0.35	0.61	0.72	0.31	0.53	0.62
	GOTLoc (ours; rule-based)	0.39	0.68	0.80	0.24	0.52	0.66
	GOTLoc (ours; Transformer)	0.36	0.58	0.67	0.30	0.49	0.57
OSM	GOTLoc (ours; rule-based)	0.24	0.48	0.57	0.29	0.50	0.58
OSM	GOTLoc (ours; Transformer)	0.23	0.43	0.53	0.22	0.39	0.48

TABLE III: comparison of the number of processed pairs per a second.

	# of processed pairs / $s$
GOTLoc (ours; rule-based)	11.05
GOTLoc (ours; network-based; CPU)	80.44
GOTLoc (ours; network-based; GPU)	152.96

Q2. OSM-based accuracy: The localization performance of GOTLoc using scene graphs generated from OSM is presented in Table II. The localization performance achieved using the OSM scene graph is slightly lower or comparable to the performance obtained when directly utilizing the point cloud map or its transformation into a scene graph. This demonstrates that it is unnecessary to rely on pre-generated, complex data structures such as point clouds for map data. Instead, scene graphs derived from widely available OSM data, accessible for most outdoor environments globally, are sufficient to achieve effective localization.

TABLE IV: comparison of processing time (

s

).
(C.E.:Candidates Extraction)

	# of frames
	447	1,616	3,409	5,225
w/o C.E.	1.06	3.34	6.67	10.71
w/ C.E.	0.03	0.07	0.13	0.17

TABLE V: comparison of map data size (

MB

	# of cells
	110	220	330	447
Point Cloud	1,310.67	2,618.63	3,898.24	5,546.19
Scene graph	0.13	0.23	0.35	0.71

TABLE VI: performance on city-scale data.

	Accuracy	Speed	Size
	(Top $k$ = 1 / 3 / 5)	( $s/iter$ )	( $MB$ )
Karlsruhe	0.28 / 0.44 / 0.52	0.15	15.44
Sydney	0.23 / 0.38 / 0.49	0.13	34.31
Toronto	0.22 / 0.36 / 0.44	0.29	38.74

Q3. Network and rule-based: The comparison of different scene graph retrieval methods and network architectures is presented in Table II and Table III. For the rule-based approach, Graph Edit Distance (GED) was utilized, which is a widely used method for measuring graph similarity as mathematically proven in [35] and detailed in [36]. The results indicate that the rule-based approach using GED achieves the highest accuracy on average. However, as shown in Table III, when comparing the number of graph pairs processed per second, the transformer network-based model executed on a GPU, are approximately 13.84 times faster than the rule-based approach executed on a CPU. Considering the need for real-time operation, network-based methods are more practical.
Q4. Candidates extraction: As shown in the Table IV, we conducted experiments to compare the algorithm’s processing speed with and without the vectorDB-based scene graph candidate extraction functionality used in GOTLoc. In this experiment, we set $n$ , representing the number of extracted candidates, to 10 as described in Eq. (12). Since the presence of candidate extraction had a negligible impact on overall performance, a direct performance comparison was omitted. The results revealed that the slope of the processing time graph differed by approximately 71.94 times between cases with and without candidate extraction. Specifically, when candidate extraction was applied, the scene retrieval processing speed remained nearly constant regardless of the number of frames stored in the database. In contrast, without candidate extraction, processing time increased sharply as the number of frames grew. This difference in processing time became more pronounced as the number of frames increased, indicating that for large-scale environments, such as city-scale scenarios where the number of frames is significantly higher, it is challenging to run the algorithm in real-time without candidate extraction. On the other hand, when candidate extraction was employed, the processing time remained almost constant, demonstrating the feasibility of applying the algorithm to real-world robotic systems.
Q5. Scene graph-based storage size: We also examined the changes in storage requirements when using scene graphs for map data, as presented in Table V. The size of the map data was measured based on the 2013_05_28_drive_0003_sync scene from the KITTI360 [15] and KITTI360Pose [9] datasets. In this case, the point cloud maps and scene graphs were constructed using static objects from the overall set of objects. This choice is motivated by the problem we aim to address, which is robust localization that operates independently of time variations. For the 2013_05_28_drive_0003_sync scene, according to the KITTI360Pose dataset, the map is divided into 447 cells, each forming a square with a side length of 30 $m$ . Summing the areas of these cells gives a total area of approximately 0.40 $km^{2}$ . Comparing the storage requirements of the map data by data type, the storage requirements for point clouds and scene graphs are 5,546.19 $MB$ and 0.71 $MB$ , respectively. This indicates that scene graphs require approximately 7,846 times less storage space than point clouds. If we assume the map area is expanded to encompass the entire city of Karlsruhe, where the KITTI360 dataset was collected, the point cloud map would require approximately 2.29 $TB$ of storage capacity for the entire 173.5 $km^{2}$ area.
However, a storage requirement of more than 2 $TB$ for map data would be a burdensome condition for delivery robots or general autonomous navigation robots equipped with standard desktop-level or lower hardware specifications. In contrast, when the map is represented using a scene graph, an approximate calculation based on the proportional relationship with the point cloud map’s size shows that the storage requirement would be about 0.30 $GB$ , which is less than 1 $GB$ . This demonstrates that, unlike point clouds, scene graph-based map data can be effectively utilized on low-specification computers commonly used in robots.
Q6. City-scale: Finally, assuming GOTLoc’s application to real-world robots, we evaluated its accuracy, speed, and storage requirements using city-scale map data. The results, shown in Table VI, include tests in three regions: Karlsruhe, Sydney, and Toronto. These regions exhibit a range of characteristics, encompassing both larger and smaller cities, thereby facilitating a thorough evaluation of the performance of the proposed algorithm. The areas span 173.5, 12,145, and 630.2 $km^{2}$ , respectively, and were divided into 1,000, 486, and 479 cells, with each cell generating text and map scene graphs for localization. The results show a high prediction accuracy exceeding 0.44, with an average processing time of less than 0.3 seconds, suitable for real-time robotic applications. Additionally, the map data size remains under 40 $MB$ , making it feasible to store on small storage devices typically used in robots.

V Conclusion

In this paper, we proposed GOTLoc, a method that identifies the current location by comparing the similarity between text descriptions of a scene and scene graphs generated from OSM data. GOTLoc reduces the storage burden compared to traditional methods using point cloud maps by storing map data as scene graphs. Additionally, it leverages a vectorDB based on embedding vectors to identify matching candidates from the entire OSM scene graph database and performs scene retrieval within these candidates. This approach reduces localization time, enhancing the potential for real-world robotic applications. Lastly, the use of publicly available map data, such as OSM, which includes global outdoor information, provides advantages in terms of scalability and accessibility. In the future, we plan to expand this research beyond localization, aiming to use scene graphs for finding navigation goals from the scene graph database based on text descriptions. We intend to develop text-based navigation methods that enable robots to perceive, localize, and navigate to destinations described by text while driving.

References

[1] E. Alverhed, S. Hellgren, H. Isaksson, L. Olsson, H. Palmqvist, and J. Flodén, “Autonomous last-mile delivery robots: a literature review,” European Transport Research Review, vol. 16, no. 1, p. 4, 2024.
[2] S.-W. Kim, G.-P. Gwon, W.-S. Hur, D. Hyeon, D.-Y. Kim, S.-H. Kim, D.-K. Kye, S.-H. Lee, S. Lee, M.-O. Shin et al., “Autonomous campus mobility services using driverless taxi,” IEEE Transactions on intelligent transportation systems, vol. 18, no. 12, pp. 3513–3526, 2017.
[3] S. Zhang, L. He, and L. Wu, “Statistical study of loss of gps signals caused by severe and great geomagnetic storms,” Journal of Geophysical Research: Space Physics, vol. 125, no. 9, p. e2019JA027749, 2020.
[4] Y. Yun, J. Jin, N. Kim, J. Yoon, and C. Kim, “Outdoor localization with optical navigation sensor, imu and gps,” in 2012 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI). IEEE, 2012, pp. 377–382.
[5] V. Ieropoulos, “The impact of gps interference in the middle east,” in 2024 IEEE International Conference on Cyber Security and Resilience (CSR). IEEE, 2024, pp. 732–736.
[6] D. F. Wolf and G. S. Sukhatme, “Localization and mapping in urban environments using mobile robots,” Journal of the Brazilian Computer Society, vol. 13, pp. 69–79, 2007.
[7] K. J. Jeffery and J. M. O’Keefe, “Learned interaction of visual and idiothetic cues in the control of place field orientation,” Experimental brain research, vol. 127, pp. 151–161, 1999.
[8] S. Nikolaidis, M. Kwon, J. Forlizzi, and S. Srinivasa, “Planning with verbal communication for human-robot collaboration,” ACM Transactions on Human-Robot Interaction (THRI), vol. 7, no. 3, pp. 1–21, 2018.
[9] M. Kolmet, Q. Zhou, A. Ošep, and L. Leal-Taixé, “Text2pos: Text-to-point-cloud cross-modal localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 6687–6696.
[10] G. Wang, H. Fan, and M. Kankanhalli, “Text to point cloud localization with relation-enhanced transformer,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 2501–2509.
[11] Y. Xia, L. Shi, Z. Ding, J. F. Henriques, and D. Cremers, “Text2loc: 3d point cloud localization from natural language,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 958–14 967.
[12] T. Shang, Z. Li, W. Pei, P. Xu, Z. Deng, and F. Kong, “Mambaplace: Text-to-point-cloud cross-modal place recognition with attention mamba mechanisms,” arXiv preprint arXiv:2408.15740, 2024.
[13] J. Chen, D. Barath, I. Armeni, M. Pollefeys, and H. Blum, ““where am i?” scene retrieval with language,” in European Conference on Computer Vision. Springer, 2025, pp. 201–220.
[14] P. H. Le-Khac, G. Healy, and A. F. Smeaton, “Contrastive representation learning: A framework and review,” Ieee Access, vol. 8, pp. 193 907–193 934, 2020.
[15] Y. Liao, J. Xie, and A. Geiger, “Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3292–3310, 2022.
[16] I. Armeni, Z.-Y. He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, and S. Savarese, “3d scene graph: A structure for unified semantics, 3d space, and camera,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5664–5673.
[17] J. Wald, H. Dhamo, N. Navab, and F. Tombari, “Learning 3d semantic scene graphs from 3d indoor reconstructions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3961–3970.
[18] Z. Zhu, X. Ma, Y. Chen, Z. Deng, S. Huang, and Q. Li, “3d-vista: Pre-trained transformer for 3d vision and text alignment,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2911–2921.
[19] J. Wang, X. Yi, R. Guo, H. Jin, P. Xu, S. Li, X. Wang, X. Guo, C. Li, X. Xu et al., “Milvus: A purpose-built vector data management system,” in Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 2614–2627.
[20] M. Haklay and P. Weber, “Openstreetmap: User-generated street maps,” IEEE Pervasive computing, vol. 7, no. 4, pp. 12–18, 2008.
[21] D. Tully, A. El Rhalibi, C. Carter, and S. Sudirman, “Generating a novel scene-graph structure for a modern gis rendering framework,” in 2016 9th International Conference on Developments in eSystems Engineering (DeSE). IEEE, 2016, pp. 169–174.
[22] M. Zipfl and J. M. Zöllner, “Towards traffic scene description: The semantic scene graph,” in 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2022, pp. 3748–3755.
[23] S. Hochreiter, “Long short-term memory,” Neural Computation MIT-Press, 1997.
[24] Y. Xia, M. Gladkova, R. Wang, Q. Li, U. Stilla, J. F. Henriques, and D. Cremers, “Casspr: Cross attention single scan place recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8461–8472.
[25] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
[26] K. W. Church, “Word2vec,” Natural Language Engineering, vol. 23, no. 1, pp. 155–162, 2017.
[27] Y. Shi, Z. Huang, S. Feng, H. Zhong, W. Wang, and Y. Sun, “Masked label prediction: Unified message passing model for semi-supervised classification,” arXiv preprint arXiv:2009.03509, 2020.
[28] G. Boeing, “Osmnx: New methods for acquiring, constructing, analyzing, and visualizing complex street networks,” Computers, environment and urban systems, vol. 65, pp. 126–139, 2017.
[29] K. Jordahl, J. Van den Bossche, J. Wasserman, J. McBride, J. Gerard, J. Tratner, M. Perry, and C. Farmer, “geopandas/geopandas: v0. 5.0,” Zenodo, 2021.
[30] S. Gillies, “The shapely user manual,” URL https://pypi. org/project/Shapely, 2013.
[31] A. Hagberg, P. J. Swart, and D. A. Schult, “Exploring network structure, dynamics, and function using networkx,” Los Alamos National Laboratory (LANL), Los Alamos, NM (United States), Tech. Rep., 2008.
[32] D. Srivamsi, O. M. Deepak, M. A. Praveena, and A. Christy, “Cosine similarity based word2vec model for biomedical data analysis,” in 2023 7th International Conference on Trends in Electronics and Informatics (ICOEI). IEEE, 2023, pp. 1400–1404.
[33] A. Anwar, J. Welsh, J. Biswas, S. Pouya, and Y. Chang, “Remembr: Building and reasoning over long-horizon spatio-temporal memory for robot navigation,” arXiv preprint arXiv:2409.13682, 2024.
[34] Google, “Google street view,” Online, accessed: 2025-01-09. [Online]. Available: https://www.google.com/maps
[35] A. Sanfeliu and K.-S. Fu, “A distance measure between attributed relational graphs for pattern recognition,” IEEE transactions on systems, man, and cybernetics, no. 3, pp. 353–362, 1983.
[36] X. Gao, B. Xiao, D. Tao, and X. Li, “A survey of graph edit distance,” Pattern Analysis and applications, vol. 13, pp. 113–129, 2010.

GOTLoc: General Outdoor Text-based Localization Using Scene Graph Retrieval with OpenStreetMap