OSM vs HDMaps: Map Representations for Trajectory Prediction

Jing-Yan Liao^†, Parth Doshi^†^∗, Zihan Zhang^†^∗, David Paz^†, Henrik Christensen^† ^†Contextual Robotics Institute, University of California San Diego, La Jolla, CA 92093, USA ^∗These authors contributed equally to this work.

Abstract

While HDMaps have long been favored for their precise depictions of static road elements, their accessibility constraints and susceptibility to rapid environmental changes impede the widespread deployment of autonomous driving, especially in the motion forecasting task. In this context, we propose OpenStreetMap (OSM) as a promising alternative to High Definition (HD) Maps for long-term motion forecasting. The contributions of this work are threefold: Firstly, we extend the application of OSM to long-horizon forecasting, doubling the forecasting horizon compared to previous studies. Secondly, through an expanded receptive field and the integration of intersection prior, our OSM-based approach exhibits competitive performance, narrowing the gap with HDMap-based models. Lastly, we conduct an exhaustive context-aware analysis, providing deeper insights in motion forecasting across diverse scenarios as well as conducting class-aware comparisons. This research not only advances long-term motion forecasting with coarse map but also offers a potential solution into scalability within the domain of autonomous driving.

I INTRODUCTION

Ensuring the safety of all road users is a fundamental mission of autonomous driving. To achieve this, self-driving systems must have a thorough understanding of their surroundings. This is where motion forecasting plays a crucial role by modeling the behavior of agents in a scene to predict their next action in the form of trajectories. Motion forecasting models primarily rely on two key components: a map to depict the static elements in the environment, and past trajectories of agents within the scene, which account for the dynamic components. While past paths are typically well-defined as tracks, recent research has emphasized the importance of the map itself.

High Definition Maps (HDMaps) have emerged as the prevalent choice for map representation in motion forecasting applications due to their precision in conveying road markings, lane boundaries, and road geometry. This information is utilized to incorporate context and semantics that are needed in the prediction task. However, the accessibility to HDMaps is notably constrained in practice, not to mention the considerable human and technological resources required for their creation and maintenance. This challenge hinders the large scale expansion of autonomous driving, especially in areas without up-to-date HDMap coverage. For instance, the UCSD on-campus autonomous driving practice consumes huge amount of human efforts on HDMap creation and maintaince. Unfortunately, this map quickly becomes outdated due to ongoing construction activities on campus, as depicted in Fig. 1. Notably, not only do the locations of road lane markings change, but parts of the road network itself undergo significant alterations within a short timeframe, rendering the previously available HDMap obsolete and ineffective for autonomous driving applications. This example underscores the pressing need to explore alternative map representations that can accommodate dynamic real-world changes with increased efficiency and reduced manual labeling labor.

Interestingly, coarse map, which is widely used in localization [osm_loc_1] [osm_localization] and navigation [osm_nav_1] [osm_nav_2]for autonomous driving, has been somewhat overlooked as a potential map representation for motion forecasting due to reliance on HDMap. Open-source maps like OpenStreetMap (OSM) [OpenStreetMap] and proprietary maps such as Google maps and Apple maps offer scalability advantages over HDMaps. Although they lack precise lane-level information, coarse maps include information about road connectivity and intersections as shown in Fig. 2, which hold substantial potential for motion forecasting. Although proprietary maps might contains more detailed as well as standardized representation of the roads, we choose to conduct experiments with OSM due to its open-source nature.

Therefore in this work, we explore long horizon motion forecasting with OSM as a new map representation by utilizing the HiVT [hivt] architecture as a basis. We specifically chose HiVT, HiVT-128 to be exact, for its state-of-the-art performance, and its real-time capability that scales well as the number of agents in the scene grows. Our key contributions are summarized as follows:

Refer to caption — Figure 1: Spatial Evolution of Road Network on UC San Diego Campus in both satellite image [esri] and vector maps [avl]. The left side of the figure depicts the road network in 2021 and the current status of the road network is on the right.

•

OSM for long tail forecasting: We extend the application of OSM for the more challenging long-term motion forecasting. In contrast to previous research [osm_av1], which focused on short-term forecasting within the Argoverse1 Motion Forecasting Dataset [argoverse1], where the models were tasked with projecting 3 seconds into the future, our study evaluates the performance of our model using the Argoverse2 Motion Forecasting Dataset [Argoverse2]. This dataset doubles the forecasting horizon, providing a more complex and realistic testing ground to assess the effectiveness of our OSM-based approach.
•

Competitive performance of the OSM-based method: Through the expansion of the receptive field and the incorporation of intersection flags, our OSM-based method achieves substantial improvements. Notably, we narrow the performance gap between HDMap-based models and our OSM-based model, demonstrating the competitive potential of our approach.
•

Comprehensive Context-Aware Analysis: We conduct an exhaustive context-aware comparison, both quantative and qualitative, considering both map representation and agent type. Our evaluation encompasses the visualization of motion forecasting outcomes across various scenarios, including straight lines, intersections, and curved roads. Furthermore, we delve into class-aware analysis, exploring how different map representations impact agent behavior under various map contexts.

By introducing OSM as a map representation and rigorously assessing HiVT’s performance in diverse scenarios, this research provides a fresh perspective on long-term motion forecasting. It offers valuable insights into scalability in the realm of autonomous driving and beyond.

II Related Work

II-A Map Representations for Motion Forecasting

Initially, rasterized HDMaps [bansal2018chauffeurnet, djuric2020uncertaintyaware, hong2019rules, salzmann2020trajectron, kamenev2022predictionnet] gain popularity because of the impressive result achieved by Convolutional Neural Networks (CNNs) in computer vision. Previous works incorporate different features in the surrounding to provide rich and context-aware information for motion forecasting task. Some approaches entail the conversion of map elements (such as roads and crosswalks) into distinct layers with color-coded lane direction. Later, some studies have gone beyond simple rasterization to render more complex map features, including roadmaps, traffic lights, and speed limits, into bird’s-eye view images. Afterward, rasterization techniques have been extended to encode semantic map details into a top-down spatial grid. This holistic approach aims to provide richer and more context-aware map representations, which are essential for enhancing the capabilities of motion forecasting models in making informed predictions. However, processing these rasterized representations with CNNs is computationally demanding and has limited receptive field, which could lead to higher error in longer term motion forecasting.

Recently, more and more research has shifted towards vectorized HDMaps [shi2023mtr, gu2021densetnt, wayformer, gilles2022gohome, zhao2022trajgat] due to their more efficient representation. For instance, VectorNet [vectornet] started this trend by sampling key points from lane splines to simplify the map and encode it with graph neural networks. LaneGCN [lanegcn], on the other hand, builds lane graph with centerline segments and uses graph convolutional network to capture information. HiVT transforms map elements into relative position to guarantee translational invariance. TPCN [tpcn] describes map as ordered point set and leverage point cloud learning method to learn from the surrounding.

However, for coarse map, there is not a lot of research being done, especially for newer, longer horizon motion forecasting. Therefore, in this work, we reviewed more than 1000 scenarios in Argoverse 2 validation set, and cherry-picked some interesting examples and our finding from visualizing the prediction.

III Approach

In this section, we present the methodology employed for extracting OpenStreetMap (OSM) data and its subsequent integration into the HiVT model. We delve into the specifics of OSM data formatting and elucidate the steps involved in the integration process.

III-A OSM Data format

OpenStreetMap (OSM) serves as an extensive repository of diverse geographical attributes, encompassing various features such as road networks, structures, and facilities distributed across the Earth’s surface. These geographical features are encapsulated within three fundamental data structures: nodes, ways, and relations.

Nodes, denoted as $n_{i}\in N$ , encapsulate essential geographical information, including latitude, longitude, and a unique node identifier. Ways, represented as $w_{i}=\{n_{j}\}\in W$ , constitute aggregations of nodes that collectively define contiguous segments of roadways. Relations, articulated as $r_{i}=\{n_{j},w_{k},r_{l}\}\in R$ , serve to establish logical or geographic associations between disparate map objects. Each of these fundamental data structures accommodates supplementary metadata, such as road names or lane counts, stored as tags. This metadata enriches the core geographical information within the dataset.

For the purpose of our study, which focuses on motion forecasting, we selectively extract nodes and ways from the OSM data, omitting metadata. This decision is motivated by the inconsistency in label attributes across different geographic locations, an inherent characteristic of crowd-sourced data. However, it is worth noting that future research endeavors may explore methods to incorporate this metadata into our representation to fully harness the potential of OSM data.

III-B Incorporating OSM data into HiVT

OSM, while sharing a graph-based structure with the Argoverse 2 HDMap, exhibits a sparser node distribution. Nevertheless, the inherently graph-based nature of OSM data makes it well-suited for integration with the HiVT model. The integration process can be summarized as follows:

1.

Boundary Extraction: Our process initiates with the acquisition of the boundary formed by all agent tracks within each Argoverse 2 scenario. The original agent coordinates, denoted as $a_{i}=(x_{i},y_{i})$ , are initially provided in the city frame. We employ the Argoverse 2 API to transform these coordinates into the WGS84 coordinate system. This transformed boundary then serves as the basis for downloading relevant OSM map data.
2.

Preprocessing for HiVT: Following boundary extraction, we undertake several preprocessing steps to prepare the downloaded OSM data for integration into the HiVT model. Firstly, we transform all OSM nodes from the WGS84 coordinate system into the city frame, leveraging the Argoverse 2 API. Secondly, we perform interpolation on OSM ways to ensure uniformity in node-to-node distances. This interpolation employs a distance of 1.5 meters, aligning closely with the average centerline segment length observed in Argoverse 2. Finally, we transform the entire map into relative positions, similar to the HDMap preprocessing from HiVT. These preprocessing steps ensure compatibility and coherence between the OSM-based data and the HiVT model.

This integrated approach facilitates the seamless incorporation of OSM data into the HiVT, enabling further analysis for long tail motion forecasting within urban landscapes.

IV Experiments

In this section, we present our comprehensive evaluation of the HiVT model on the publicly available Argoverse 2 Motion Forecasting Dataset.

IV-A Dataset

Initially, the HiVT model underwent evaluation using the Argoverse 1 dataset. However, for reasons outlined below, we opted to develop a tailored implementation for the Argoverse 2 dataset conversion.

•

Forecasting Horizon Expansion: The Argoverse 2 dataset presents a significant departure from its predecessor, notably in terms of forecasting horizons. With each scenario extending to a duration of 11 seconds, the track history lengthens from 2 seconds to 5 seconds, while the forecasting horizon doubles to 6 seconds. This substantial increase in forecasting duration poses a more intricate challenge for motion forecasting, rendering it a valuable dataset for in-depth investigation.
•

Enhanced Class Information: In contrast to Argoverse 1 with no class information provided, Argoverse 2 introduces a richer classification scheme comprising 10 non-overlapping classes, encompassing both static and dynamic agents. The inclusion of detailed class information pertaining to agents permits a more nuanced analysis of forecasting behavior, thereby enhancing our understanding of OSM-based model performance.
•

Diverse Scenarios: The Argoverse 2 dataset is collected from six distinct cities, providing a diverse range of scenarios and environments. This geographical diversity allows us to assess the scalability and adaptability of the OSM-based method across varied urban landscapes.

By undertaking experiments on the Argoverse 2 Motion Forecasting Dataset, we aim to thoroughly examine the capabilities and performance of the HiVT model in addressing the challenges posed by extended forecasting horizons, enriched class information, and diverse real-world scenarios.

IV-B Metrics

We assess our model’s performance using standard metrics for multi-modal motion forecasting: minimum Average Displacement Error (minADE), minimum Final Displacement Error (minFDE), and Miss Rate (MR). These metrics allow models to forecast up to 6 trajectories for each agent. The metric minADE quantifies the average l2 distance in meters between the best-predicted trajectory and the ground-truth trajectory across all future time steps, while minFDE measures the error specifically at the final future time step. MR represents the proportion of scenarios where the distance between the ground-truth endpoint and the best-predicted endpoint exceeds 2.0 meters.

TABLE I: HiVT performance on Argoverse 2 validation set

	receptive field	minADE (m)	minFDE(m)	MR	Inference Speed (Hz)	VRam usage (MB)
HDMap	100 m	0.943	1.934	0.287	6.94	4404
	125 m	0.929	1.876	0.277	7.05	4408
OSM	100 m	1.375	3.234	0.43	5.41	6266
	125 m	1.043	2.241	0.32	5.48	6274

IV-C Implementation Details

In our implementation, several key adjustments were made to enhance the compatibility of the HiVT model with the Argoverse 2 dataset and to leverage OpenStreetMap (OSM) data effectively.

IV-C1 HiVT Model Enhancement

We reconfigured some parts of the publicly available HiVT model to suit the requirements of the Argoverse 2 dataset. Specifically, we redesigned the dataloader tailored for Argoverse 2 data. Additionally, we modified the architecture to accommodate class information as an additional part of the input. This architectural adaptation allows the model to consider agent class information, enriching its contextual understanding and improving prediction accuracy.

IV-C2 OSM Data Preprocessing

On the OSM side, we performed crucial preprocessing steps to optimize the OSM data for integration with HiVT. This entailed interpolating the entire OSM graph to ensure that the distance between nodes uniformly averaged 1.5 meters. Furthermore, we employed a proximity-based approach to identify intersections, flagging nodes located within a 10-meter radius of specific markers such as stop signs and traffic lights. This attribute could provide contextual information for HiVT that improve intersection result. This process facilitated the inclusion of an is_intersection attribute, which plays a vital role in enabling the HiVT model to digest road networks information effectively.

These implementation details are pivotal in achieving the desired synergy between the HiVT model and OSM data, ultimately enhancing the model’s performance in the context of long-term motion forecasting on the Argoverse 2 dataset.

IV-D Quantitative Analysis

Our approach’s validation results on the Argoverse 2 dataset are presented in Table I. Initially, when considering the original receptive field, which encompasses map information within a 100-meter radius around the agent, a notable performance gap becomes apparent between the HDMap-based method and our OSM-based approach. In this context, the OSM-based method only marginally outperforms scenarios where no map information is utilized. It is worth noting that all the HiVT models used in this analysis are HiVT-128.

However, a pivotal transformation occurs when we expand the receptive field, which is the map information within certain radius around the agent, from 100 meters to 125 meters. Here, significant changes in performance dynamics emerge. While the HDMap-based results tend to plateau, the OSM-based method exhibits remarkable improvement. We attribute this divergence in performance to the limited learning power of our light-weight model HiVT, which may reach a performance bottleneck as additional map information provides diminishing returns for learning. In contrast, OSM, having less detailed initial information, benefits significantly from an expanded receptive field, allowing HiVT to better capture the overall road network in the surrounding environment. Even with such a breakthrough in performance, we can still observe that there remains a gap in minFDE while the OSM-based method comes very close to matching the HDMap-based method in minADE. This highlights the need for more in-depth investigation.

To achieve a more comprehensive understanding, we conducted a frame-by-frame analysis of displacement errors, considering that tracked agent information is provided at a rate of 10 frames per second. This visual analysis is depicted in Figure 3. It is evident that with additional OSM information, the error remains constrained to an almost linear increase over time, even performing really close to HDMap-based methods, emphasizing the benefits of incorporating more map data for motion forecasting. Furthermore, increasing the receptive field is more feasible for the OSM-based method, as it is essentially impossible to exceed the range of available OSM data. Importantly, expanding the receptive field in HiVT incurs minimal computational overhead during inference. Our evaluation, conducted on a Titan Xp GPU with a batch size of 32, reveals negligible differences after the receptive field increase, both in terms of inference speed and GPU VRAM usage. This makes it an ideal choice for enhancing motion forecasting capabilities.

This quantitative analysis underscores the pivotal role of receptive field size in influencing the performance of our OSM-based approach compared to the HDMap-based method, revealing the potential for substantial enhancements in motion forecasting capabilities.

IV-E Qualitative Analysis

In addition to our quantitative findings, we conducted a qualitative analysis with a focus on class-specific performance and visualization, shown in Fig. 4, to gain deeper insights into the influence of map data on motion forecasting for different agent types.

In our class-specific analysis, we observed that the forecasting performance for pedestrians and cyclists showed limited improvement when employing HDMap compared to OSM. This result aligns with expectations, as pedestrians often do not adhere to lanes while walking on sidewalks, resulting in both OSM and HDMap providing similar information in such cases. Conversely, classes such as cars, motorcyclists, and buses demonstrated noticeable performance enhancements when utilizing HDMap, emphasizing the potential significance of lane-level information in certain scenarios.

Nevertheless, we aim to uncover the underlying reasons for the performance gap between the OSM-based and HDMap-based methods. To achieve this, we categorized scenarios into three cases: straight roads, curved roads, and intersections. We then selected two representative scenarios from the Argoverse 2 validation set after a meticulous review of over 1000 scenarios. Drawing upon general knowledge, it was evident that lane-level information was primarily crucial at intersections for contextual understanding. To validate this hypothesis, we visualize scenarios such as straight and curved road in Fig.5 and intersection alone in Fig. 6. As expected, in both straight road and curved road scenarios, there was minimal difference in outcome between the OSM-based and HDMap-based methods. However, in the intersection scenario, where the inference of intersection locations was essential, it became evident that lane-level knowledge provided a significant contextual advantage. This observation hightlights the limitation of OSM-based method in motion forecasting given information we extracted from OSM.

V CONCLUSIONS

In summary, this research represents a significant advancement in the utilization of OpenStreetMap (OSM) for long-term prediction when coupled with HiVT. Through the expansion of the receptive field and training our HiVT model with OSM data, we have achieved comparable results to HiVT trained with HDMap, as demonstrated on the Argoverse 2 dataset. Our exploration of map representation’s impact on forecasting performance, with a focus on different agent types through our by-class analysis, has yielded valuable insights. This comprehensive qualitative analysis has illuminated the pivotal role of map data in motion forecasting, especially in scenarios necessitating intricate lane-level information, such as intersections. As a result, our proposed methodology shows substantial potential for enhancing the scalability of autonomous driving systems by leveraging publicly accessible coarse map data.