(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

¹¹institutetext: Horizon Robotics Inc.
¹¹email: {yihan.hu96, Proliu}@gmail.com

Solving Motion Planning Tasks with a Scalable Generative Model

Yihan Hu Siqi Chai Zhening Yang Jingyu Qian Kun Li Wenxin Shao Haichao Zhang Wei Xu Qiang Liu

Abstract

As autonomous driving systems being deployed to millions of vehicles, there is a pressing need of improving the system’s scalability, safety and reducing the engineering cost. A realistic, scalable, and practical simulator of the driving world is highly desired. In this paper, we present an efficient solution based on generative models which learns the dynamics of the driving scenes. With this model, we can not only simulate the diverse futures of a given driving scenario but also generate a variety of driving scenarios conditioned on various prompts. Our innovative design allows the model to operate in both full-Autoregressive and partial-Autoregressive modes, significantly improving inference and training speed without sacrificing generative capability. This efficiency makes it ideal for being used as an online reactive environment for reinforcement learning, an evaluator for planning policies, and a high-fidelity simulator for testing. We evaluated our model against two real-world datasets: the Waymo motion dataset and the nuPlan dataset. On the simulation realism and scene generation benchmark, our model achieves the state-of-the-art performance. And in the planning benchmarks, our planner outperforms the prior arts. We conclude that the proposed generative model may serve as a foundation for a variety of motion planning tasks, including data generation, simulation, planning, and online training. Source code is public at https://github.com/HorizonRobotics/GUMP/.

1 Introduction

Autonomous driving (AD) has evolved from a visionary concept to tangible products in recent years [49, 74, 79]. Despite these advancements, challenges in technical scalability remain [27]. These involve the system’s ability to adapt to new and unseen environments, which is the root cause to ongoing safety concerns and the extensive engineering efforts required to address various failure scenarios. To continuously scale up in terms of safety and cost efficiency, developing a model that learns and represents the driving world is essential. Such a model could enable the continuous generation of data for training and testing, moving toward a more integrated and learned driving system that seamlessly interacts with real-world road users.

A driving world may be considered as a sequence of driving scenarios, which typically contain a map and multiple agents interacting within that space. Obtaining a local map for these scenarios is generally straightforward, whether through map providers [54] or through online mapping modules [43, 44, 45]. The real challenge, however, lies in understanding the dynamic nature of traffic - specifically, the interdependence among road users. Thus, a fundamental task in modeling the driving world involves accurately predicting the behaviors of these agents, formulating the plans based on these predictions and continuously refining these predictions as the agents take action.

This task has been intensively explored in recent years, including motion forecasting [21, 52, 51, 67, 34] and imitative traffic modeling [71, 42]. All of these works are able to model the either marginal or joint distributions of the multi-agents’ future trajectories. Despite these advancements, these models are typically trained in an open-loop setting, which may limit their ability to adapt to out-of-distribution states encountered in real-world, closed-loop settings. Consequently, their capability for traffic simulation is generally constrained to short duration, such as a few seconds.

Recently, closed-loop motion prediction has attracted increasing attentions, which addresses the problems of long horizon traffic simulation and interactive simulation [50, 77, 65, 56]. Through the iterative forecasting of each agent’s next states, autoregressive (AR) models articulate the interactions among agents as a series of cascading conditional distributions. At each subsequent time step, sampling from the model can generate a variety of actions for an agent, aligning with both the context of the scene and the preceding actions of other agents. Therefore, learning the dynamics of the driving world is approached in the same manner as learning sequence prediction, akin to the techniques used in language modeling [60].

In light of these work, we propose the Generative Unified model for Motion Planing tasks, namely GUMP, with a generative model structure and a simple tokenizer, which uses an object’s unique ID as its key and a compressed state space as its value. This design significantly enhances the model’s flexibility, enabling us to adopt an efficient partial AR structure. This model demonstrates high scalability in computational efficiency, and shows strong generalization capabilities in understanding complex traffic flows and handling long-tail cases. In addition, its generative ability allows infinite sampling from the learned distribution, which enables long-horizon reactive agent simulation.

Centered at this model, as shown in Fig. 1, we have explored various downstream tasks, and found that this model can be used as:

1.

a data generator that creates scenarios specific to user’s prompts;
2.

a realistic simulator serving as a reactive closed-loop test bed;
3.

a planner that unravels interactions between agents to reduce infractions and improve human-like behavior;
4.

an online training module that enhances the effectiveness of reinforcement learning for policy models.

In summary, our main contributions are threefold: Firstly, we propose a novel generative model that features a simple key-value paired tokenizer. We show that this model achieves state-of-the-art performance on both the simulation and planning benchmarks. Secondly, we have extensively investigated the usage of this model as a foundation model in a wide range of downstream tasks, and demonstrate that it can significantly improve the functionalities of these tasks. Thirdly, we provide a framework of using the generative model as a central component for developing a closed-loop training and evaluation system. To our best knowledge, we are the first to solve all learning-based motion planning tasks with a unified framework.

Refer to caption — Figure 1: We are motivated to provide a generative model as the central unit that supports all the learning-based motion planning tasks in the autonomous driving domain. We categorize the tasks into four distinct sub-domains: data generation, model evaluation, model training, and model inference. These sub-domains are visually distinguished in our diagram by different colors—green for data generation, blue for model evaluation, purple for model training, and orange for model inference. Our approach encompasses both offboard applications (the first three sub-domains) and onboard application (the last sub-domain). Specifically, scene generation aims at data generation capable of producing specific traffic scenarios based on context information, such as high-definition maps or user prompts; Reactive simulation aims at a closed-loop evaluator that provides realistic, human-like agents that respond to the behavior of the ego vehicle and its environment; Online training aims at a closed-loop training module that allows the learned policy to interact with environment, collect rewards, and perform back-propagation. Lastly, interactive planning aims at enhancing an onboard planner by parallel unrolling to seek for the optimal trajectory that achieves the highest reward.

2 Methods

2.1 Formulation

Let the dynamic states of the agents at time $t$ be denoted by $s_{t}=(a_{t}^{AV},a_{t}^{env})$ , where $a_{t}^{AV}$ is the self-driving vehicle, and $a_{t}^{env}$ is composed of surrounding agents. Specifically, $a_{t}^{AV}=(a_{t}^{0})$ and $a_{t}^{env}=(a_{t}^{i})\quad\text{for}\quad i\in\{1,2,\ldots,n\}$ , and the context is denoted as $c$ , which includes a static map [54] and language description prompts [69]. We factorize the joint probability distribution of traffic scenarios as follows:

\linenomathAMS

P(s_{t},s_{t-1},\ldots,s_{0},c)=\\ \underbrace{P(s_{t}|s_{t-1},\ldots,s_{0},c)\cdot\ldots\cdot P(s_{1}|s_{0},c)}_{\text{scene extrapolation}}\cdot\underbrace{P(s_{0}|c)}_{\text{scene generation}}\cdot\underbrace{P(c)}_{\text{context}}.

(1)

Naturally, traffic scenarios can be modeled in a probabilistic and sequential manner: given just the context $c$ , we can generate the initial states of the agents $s_{0}$ ; With both the context $c$ and the initial states $s_{0}$ , we can extrapolate the subsequent states ${s_{1},s_{2},\dots,s_{t-1}}$ of the agents. The initial step allows us to create scenarios through scene generation, while the next step enables us to foresee how these scenarios evolve over time, which we refer to as scene extrapolation.

2.1.1 Scene Generation

The initial states can be further factorized as:

P(s_{0}|c)=\prod_{j}P(a_{0}^{j}|a_{0}^{0},\ldots,a_{0}^{j-1},c).

(2)

Specifically, we autoregressively generate the initial state of each agent in the scene, denoted as $a_{0}^{i}$ , where $a_{0}^{i}=\{x,y,\theta,v_{x},v_{y},w,l\}$ represents the initial position ( $x,y$ ), heading ( $\theta$ ), velocity components ( $v_{x},v_{y}$ ), width ( $w$ ), and length ( $l$ ) for the $i^{th}$ agent.

2.1.2 Scene Extrapolation

We can further factorize the the scene extrapolation task as [50]:

\linenomathAMS

P(s_{T},s_{T-1},\ldots|s_{0},c)=P(a_{1:T}^{AV},a_{1:T}^{envs}|s_{0},c)\\ =\prod_{t}\pi_{AV}(a_{t}^{AV}|a_{<t}^{AV},a_{<t}^{env},c)\cdot P(a_{t}^{envs}|a_{t}^{AV},a_{<t}^{AV},a_{<t}^{env},c).

(3)

In this model, $\pi_{AV}$ represents the policy for autonomous vehicles. Note that this policy can be replaced with any planning policy. Moreover, we can categorize the environmental actions $a_{t}^{env}$ into two distinct groups: $a_{t}^{tracked}$ and $a_{t}^{newborn}$ . Here, $a_{t}^{tracked}$ refers to objects that have been previously tracked, while $a_{t}^{newborn}$ represents objects that have just emerged in the environment.

We enable the probabilistic modeling of dynamic scene information without imposing a limit on the number of objects throughout the extrapolation. This capability is crucial for realistically modeling scenarios where objects may disappear or newly appear, addressing both currently obscured objects that may become visible later and those present now but might move out of view. Such a nuanced treatment of dynamic elements significantly improves our model’s capability to handle the complex, unpredictable, and long-tail problems in autonomous driving.

2.1.3 Evaluate Planning Policies

Suppose we have a series of candidate policies, denoted as $\pi_{i}$ , where $i=\{0,1,\dots,N_{p}-1\}$ . These policies can be rule-based or learning-based. Given the observations $s$ at a certain moment, these policies can output different response actions $a_{i}\sim\pi_{i}(s)$ . Under the Markov assumption, the value function $V^{\pi_{i}}(s)$ of a chosen policy $\pi_{i}$ is denoted as:

V^{\pi_{i}}(s)=\sum_{s^{\prime}}P(s^{\prime}|s,\pi_{i}(s))\left[r(s,\pi_{i}(s))+\gamma V^{\pi_{i}}(s^{\prime})\right]

(4)

where $\gamma$ is the discount factor and $P(s^{\prime}|s,\pi_{i}(s))$ is the state transition probability function, which can be modeled with the probabilistic rollouts of our world model. Here, $s^{\prime}$ represents the next state, and $r(s,\pi_{i}(s))$ is the reward function, which may be defined as [2]:

r(s,\pi_{i}(s))=\prod_{m}\Theta_{m}(s,\pi_{i}(s))\cdot\sum_{n}\omega_{n}\Phi_{n}(s,\pi_{i}(s))

(5)

where $\Theta$ is the critical metrics, such as safety and driving directions, and $\Phi$ is less critical weighted metrics, such as progress, speed limit, comfort and so on, and $\omega_{n}$ is the metric weight. More details can be found in the Appendix 0.B.2.3.

Given the vast and probabilistic nature of the state space $s$ , we estimate it by sampling from its overall distribution. As in Eq. 3, the results are referred as rollouts, denoted as $X_{i}^{k}=\{s_{0},s_{1},\ldots,s_{T}\}_{i}^{k}=\{a_{t}^{AV},a_{t}^{env}\}_{i}^{k}$ , where $i\in\{0,1,2,\ldots,N_{p}-1\}$ , and $k\in\{0,1,2,\ldots,N_{r}-1\}$ , and $N_{p}$ is the number of the candidate policies, and $N_{r}$ is the number of the paralleled rollouts for each policy. So, the total number of the rollouts is $N_{p}\times N_{r}$ . Then we apply the evaluator to calculate the accumulated rewards for each policy following Eq. 4.

Accordingly, the ego vehicle’s optimal driving policy can be written as

\pi^{\star}(s_{0})=\underset{\pi_{i}}{\mathrm{argmax}}\,\hat{V}^{\pi_{i}}(s_{0})

(6)

where $\hat{V}^{\pi_{i}}(s_{0})$ is the value estimated through rollouts obtained by sampling.

2.1.4 Online Training with Reinforcement Learning

In standard Reinforcement Learning (RL), the problem to be solved is typically described as an Markov Decision Process (MDP) [72], which is commonly defined by a tuple $(\mathcal{S,A},P,r,\gamma)$ , where $\mathcal{S}$ , $\mathcal{A}$ denote the state space and action space respectively. $P(s^{\prime}|s,a)$ denotes the state transition process, i.e., the probability of transiting to $s^{\prime}$ given the current state $s$ and action $a$ . $r$ is the reward function and $\gamma$ is the discount factor. The goal of RL is to learn a policy $\pi(s;\theta)$ that maximizes the accumulated discounted return.

While much effort has been devoted to the development of RL algorithms in the past [64, 23], the importance of high-fidelity simulator starts to gain more attentions recently [53]. For autonomous driving, one challenge in terms of simulation is realistic reactive behaviors from other road participants (i.e. the implementation of $P$ ), which is arguably hard to be fully characterized by scripted rules [16]. Therefore, instead of implementing $P$ by scripting (i.e. $P\!\triangleq\!P_{\rm scripted}$ ) as in standard simulator [16], we leverage our data-driven world model for the state transition:

P(s^{\prime};s,a)\triangleq P_{\rm WorldModel}(s^{\prime};s,a),

(7)

where $P_{\rm WorldModel}$ denotes our world model trained on real-world data, therefore capturing the real-world behavior and avoiding cumbersome rule crafting. This could significantly enhance RL by ensuring alignment with real-world logic.

The unroll process could be presented as the following equations:

\begin{array}[]{ll}s^{\prime}\sim P_{\rm WorldModel}(s^{\prime};s,a)&a\sim\pi(s;\theta).\end{array}

And the rewards can be calculated with Eq. 4 and Eq. 5. For training, one may use typical RL algorithms such as Soft Actor-Critic (SAC) [23].

2.2 Network

The overall architecture of our model, depicted in Figure 2, is comprised of four main components: a static raster autoencoder, a dynamic tokenizer, a Multimodal Causal Transformer (MCT), and an auto-regressive decoder. For the static information, elements such as maps, navigation routes, and static obstacles are converted into a raster image. This image is then encoded with a 2D Convolutional Encoder [28]. Meanwhile, dynamic information is encapsulated into a sequence of tokens, providing a linguistic representation of driving behaviors. The static and dynamic information are then fused [1] and encoded with a multimodal visual-language model [60]. Due to its predictive nature, our model can be further accelerated by the intra-frame non-autoregressive (NAR) conversion.

2.2.1 Tokenization

In probabilistic modeling of continuous state spaces, discretization is required. Unlike previous studies that use complex transformations to map the state space to the action space and then reintegrate to recover the state space [65, 56], our method directly quantizes the state space.

Choosing state space over action space is motivated by several key factors. Although the action token space is more compact, featuring a smaller vocabulary size, this compactness often sacrifices interpretability. Moreover, encoding with action space requires state-dependent decoding, where each step depends on all preceding actions and the initial states. This dependency may result in compounded errors. Additionally, this dependency can also limit the network’s flexibility, thereby restricting its generative capabilities, such as predicting traffic lights, handling the emergence and disappearance of elements, or generating scenarios.

In our design, each object is composed of two tokens with distinct functionalities, akin to a “key-value pair”: a control token and a state token. The control token serves as the key, used to distinguish between different objects by summing the token embeddings of each object’s unique ID and object category. The state token serves as the value, utilized to depict the specific state space of the object. Specifically, we select different state spaces for different types of objects, and perform quantization of the state space for each object. For instance, for traffic lights, the token space includes coordinates ( $x,y,\theta$ ) and traffic signal states ( $s_{tl}$ ). For traffic participants like pedestrians and vehicles, the token space encompasses coordinates ( $x,y,\theta$ ), velocity ( $v_{x}$ , $v_{y}$ ), and size ( $w$ , $l$ ). This key-value pair tokenization enables our model with the indexing ability. We can selectively query and decode any object of interest or arbitrarily add or remove objects. This approach allows the model to dynamically allocate computational resources for increased efficiency and to manage the disappearance and emergence of objects effectively.

Similar to language models in sequence modeling, we incorporate special tokens, including the beginning of sequence (BOS) token, traffic light end token, and newborn begin token. The final tokenized sequence structure can be depicted as shown in Fig. 2. Additionally, we use a predetermined set of scenario description prompts, which can be represented with a fixed vocabulary. This vocabulary can be embedded similarly to the special tokens.

2.2.2 Token Embedding

Each token is embedded into latent features. For the key token of each agent, we sum the embedding features of the agent’s unique ID and its class type. For it’s value token, we embed all the states with sinusoidal embedding along with tokenized state embeddings. All embeddings are then summed with a learnable positional embedding, as in the equation below:

	$\displaystyle e_{\text{key}}^{i}$	$\displaystyle=\text{PE}(i)+\sum_{s\in\{\text{id},\text{class}\}}\text{Embed}(s)$		(8)
	$\displaystyle e_{\text{value}}^{i}$	$\displaystyle=\text{PE}(i)+\sum_{s\in\{x,y,\theta,v_{x},v_{y},w,l\}}(\text{Embed}(s)+\text{sinusoidal}(s))$		(8)

Here, $i$ is the position index in the token sequence, PE is the learnable positional embedding, $sinusoidal$ is the sinusoidal embedding and $Embed$ is the learnable token embedding. For static objects such as traffic light, we only embed the coordinates ( $x$ , $y$ , $\theta$ ) and the traffic light status ( $s_{tl}$ ).

2.2.3 Multi-modal Causal Transformer

The Multi-modal Causal Transformer (MCT) module forms the generative core. It builds on the GPT-2 architecture [60, 39] and includes Gated Cross Attention (GCA) Blocks [1] for fusing information from different modalities. We intermittently insert the GCA blocks among self-attention layers to progressively and interactively fuse the dynamic and static information.

2.2.4 RNN Token Decoder

To decode the discrete state space $S$ from the compact latent embedding produced by the Multi-modal Causal Transformer (MCT), an additional GRU decoder is utilized after the MCT. By querying the MCT with the key embedding for each agent, we extract the corresponding state latent space, which is then autoregressively decoded using stacked GRU layers according to the sequence $\{w,l,v_{x},v_{y},x_{0},y_{0},\theta_{0},...,x_{T},y_{T},\theta_{T}\}$ , where $T$ represents the predicted horizon for each agent. This process facilitates temporal aggregation and intra-frame NAR conversion. To maintain the stability of the decoding, we adopted temperature-scaled top-k sampling strategy [29, 18] and dynamic valid masking to ignore invalid tokens.

2.2.5 Prediction Chunking and Temporal Aggregation

To combat the compounding errors and stabilize the AR process, inspired by the concept of “action chunking” in [84], we have developed a prediction chunking and temporal aggregation module. We perform the chunking process by predicting longer time horizon with the light-weight RNN described above. During the AR decoding process, we aggregate all the prediction results of each time step as illustrated in Fig. 3. Specifically, we take a weighted average across the different time steps with a decay rate $\gamma$ ; the final output state can be represented as $\hat{s}_{t}=\sum_{i=0}^{T}\gamma^{i}\cdot s_{t-i}^{i}$ , where $t$ is the output time step, and $T$ is the total time steps of the chunking data. This module significantly improved the model’s performance on the Waymo Sim Agents metric [50] as the experimental result shown in Appendix. 0.E.2.

2.2.6 Intra-frame NAR Conversion

To speed up a full-AR model, one approach is to decode the parts that are less dependent in parallel [4]. Considering the characteristics of the traffic simulator, we temporarily ignore the current interactions between agents and predict the next state of each agent in parallel with the GRU decoder mentioned earlier. By conditioning on these parallel-decoded states instead, we can eliminate the intra-frame sequential dependency of the AR process, as shown in Fig.4. We can then compensate for the intra-frame dependencies by a single forward pass and solve for more precise agent states efficiently. By converting the full-AR to partial-AR mode, we achieve a significant speedup, as demonstrated in Appendix 0.E.1. Moreover, this conversion is optional, allowing our model to maintain its AR ability, which is essential for scene generation tasks and managing newborn agents. Also, our method avoids altering the causal attention mask, therefore is highly compatible with acceleration libraries [13, 12], leading to notable improvements in training and inference speed and memory usage compared to the versions implemented with customized masking strategy.

2.3 Framework

In this section, we introduce how GUMP serves as a central unit to support a series of downstream tasks, as illustrated in Fig. 5. For those tasks, we have developed specific engines to interact with GUMP, categorized into scene generation, scene extrapolation, planning, and RL engines.

For scene generation, GUMP incorporates a scenario prompt as an additional condition, as described in Sec. 2.2.2, operating in full-AR mode. Specifically, we sequentially construct queried key tokens based on the unique ID and the category of each agent. By querying the MCT, we decode the latent features into initial states. At each iteration, we query and decode information for a single object and use it as the condition for the next query, repeatedly. This AR generation approach effectively manages the relationships between the dynamic objects and the context, and can be well controlled by users.

Different from the task of scene generation, scene extrapolation uses historical scenes as conditions, which can be either from log data or generated by the scene generation task. To accelerate the process, the MCT operates in a partial-AR mode using intra-frame NAR conversion. Through scene extrapolation, we are able to generate diverse, interactive, and closed-loop state predictions for multiple objects within the scene.

This realistic and efficient simulator, denoted as “Plan Engine” in Fig. 5, can be used both off-board as a policy evaluation environment and onboard for deployment, further enhancing existing policies to obtain a more powerful planning policy. This enhancement is achieved by overriding the next state of the ego vehicle in the “Sim Engine” with the action (i.e “ $a$ ” or “next ego state”) output by the existing driving policy, facilitating interaction between the policy and the world model. The resulting rollouts (i.e “ $s$ ” or “current states”) from this interaction are utilized by the reward module to compute relevant scores. After several rounds of interaction, the model’s performance is assessed based on the accumulative score, which can be used for evaluating off-board models or selecting online proposals, as described in Sec 2.1.3. This highly realistic and probabilistic simulator accurately simulates environmental changes, enabling more precise evaluation and selection of optimal policies.

Different from the “Plan Engine”, the “RL Engine” employs a trainable ML-policy, which undergoes RL training via accumulative rewards. Unlike previous approaches that used IDM[41] or logsim[5], the simulator powered by GUMP offers a more efficient, realistic, diverse, and interactive environment. This reduces the simulation-to-reality domain gap and enhances the model’s generalization capability.

3 Experiments

Our experiments are primarily conducted on two mainstream public datasets: the Waymo Open Dataset (WOD) [17, 6] and the nuPlan [2]. Our experiments cover scene generation, trajectory prediction, WOD sim agents, and planning. We primarily use the Waymo dataset for the scene generation [19, 71, 73], trajectory prediction and sim agents [50] tasks, and use the nuPlan dataset for prompts conditioned scene generation and planning. You can find more information about these datasets and our benchmarks in the Appendix 0.B, and also related experiment settings and training details Appendix 0.C. We have placed more ablation studies, experimental results regards trajectory prediction and qualitative results in Appendix 0.E,0.G and 0.F, respectively.

3.0.1 Scene Generation

As shown in Table 1, we compare our model with other methods on the WOD motion dataset. To ensure a fair comparison, we have closely followed the experimental setting of TrafficGen [19]. Our results significantly outperformed all other competitors across all metrics by a large margin, especially in terms of speed and size, where the error was reduced by 42.1% and 26.6% respectively compared to the previous state-of-the-art performance. This fully demonstrates our model’s exceptional capability in scene generation, capable of creating realistic data.

Method	Position $\downarrow$	Heading $\downarrow$	Speed $\downarrow$	Size $\downarrow$
SceneGen[73]	0.1362	0.1307	0.1772	0.1190
TrafficGen[19]	0.1192	0.1189	0.1602	0.0932
$\text{TrafficGen}^{\star}$	0.1221	0.1174	0.1661	0.0913
GUMP-m	0.1107(-9.3%)	0.099(-15.7%)	0.0961(-42.1%)	0.0670(-26.6%)

Table 1: Maximum mean discrepancy (MMD) results on WOD Motion Dataset: Lower MMD Indicate Better Performance Across All Metrics.

\star

: Results reproduced by us under the same experimental setting. GUMP-m refers to the medium variant.

3.0.2 World Simulator

We have further validated our model’s capability for scene extrapolation as a world simulator on the Waymo Sim Agents Benchmark. As shown in Table 2, our method significantly outperforms competitors in interactive, map-based metrics, and achieves the lowest overall minADE, marking it as state-of-the-art in terms of realism meta metric. The result strongly affirms the exceptional realism and interactivity of our method as a simulator. Moreover, it paves the way for providing a more realistic and interactive environment for various downstream applications.

Agent Policy	Meta Metrics $\uparrow$	minADE $\downarrow$	Kinematic Metrics $\uparrow$	Interactive Metrics $\uparrow$	Map-based Metrics $\uparrow$
Logged Oracle	0.7220	0.000	0.4857	0.8415	0.9043
Constant Velocity	0.2870	7.923	0.0465	0.4087	0.4547
SBTA-ADIA [48]	0.4202	3.611	0.3574	0.4283	0.5087
CAD [11]	0.5314	2.315	0.3357	0.5638	0.7688
Joint-Multipath++[77]	0.5330	2.049	0.4078	0.5728	0.6677
Wayformer[51]	0.5750	2.498	0.3120	0.7048	0.7747
MTR+++[59]	0.6077	1.682	0.3597	0.7172	0.8151
MVTA [78]	0.6361	1.869	0.4175	0.7390	0.8139
Trajeglish [56]	0.6437	1.615	0.4157	0.7646	0.8104
MVTE^⋆[78]	0.6448	1.677	0.4202	0.7506	0.8271
GUMP-m	0.6432	1.590	0.3994	0.7657	0.8290

Table 2: Test set result of WOD Sim Agents Benchmark. Note:

\star

indicates the use of model ensemble techniques. Methods are ranked by composite metric on the V1 Leaderboard. Our method significantly outperforms others in interactive, map-based metrics, and achieves the lowest overall minADE, marking it as state-of-the-art in terms of realism meta metric. The highest score is highlighted in bold. Detailed results for each component metric and their descriptions are provided in the Appendix.

3.0.3 Interactive Planning

To validate the effectiveness of GUMP as a world simulator in interactive planning task, we conducted both open-loop and closed-loop experiments on the large-scale planning dataset, nuPlan. In the open-loop experiments, we treated planning as an ego-prediction task. By iteratively unrolling the predicted ego states, we can obtain possible future trajectories of the ego vehicle in an interactive environment. To eliminate randomness, we parallel sampled $N$ times and averaged these $N$ rollouts trajectories. As shown in Table 3, compared to other methods, our method achieved the best results on most open-loop metrics, demonstrating the feasibility of imitating a planning policy by learning a world simulator.

Methods	Score $\uparrow$	8sADE $\downarrow$	3sFDE $\downarrow$	5sFDE $\downarrow$	8sFDE $\downarrow$	MR $\downarrow$
IDM [75]	37.7	9.600	6.256	10.076	16.993	0.552
PlanCNN [61]	64.0	2.468	0.955	2.486	5.936	0.064
Urban Driver [63]	76.0	2.667	1.497	2.815	5.453	0.064
PDM-Open [14]	85.8	2.375	0.715	2.06	5.296	0.042
CKS-124m [70]	88.0	1.777	0.951	2.105	4.515	0.053
CKS-1.5b [70]	86.6	1.783	0.971	2.140	4.460	0.047
GUMP-m	88.6	1.820	0.743	1.833	4.453	0.046

Table 3: Performance comparison of open-loop planning on the validation 14 set of the nuPlan dataset. Our model outperforms all the other models across the majority of metrics.

We further integrated the imitation policy with a set of rule-based policies, enhancing it with a interactive scoring module powered by GUMP, and validated its effectiveness in a closed-loop reactive environment. We have compared it with previous state-of-the-art results on the test14-hard splits [61], as shown in Table 4. Additionally, we have benchmarked our planner and other competitors in two different reactive environments: IDM and GUMP. The experimental results demonstrate that our policy outperformed all the others in both reactive environments, showing improvements in progress, drivable area compliance, and time to collision, among other metrics, when compared to the previous state-of-the-art, PDM [14]. The reliance of PDM on a constant velocity assumption results in overly conservative decision-making. In contrast, our adoption of a stochastic modeling approach allows for better handling of future uncertainties, thereby achieving a better balance between progress and safety (TTC).

Policy	Reactive	Driving	Col.	Driv.	Dir.	Making	TTC	Speed	Progress	Comfort
Policy	env	Score	Col.	Comp.	Comp.	Progress	TTC	Limit	Progress	Comfort
Expert [10]	IDM	68.80	-	-	-	-	-	-	-	-
UrbanDriver [63]	IDM	49.07	-	-	-	-	-	-	-	-
GC-PGP [26]	IDM	39.63	-	-	-	-	-	-	-	-
GameFormer [36]	IDM	68.83	-	-	-	-	-	-	-	-
planTF [10]	IDM	61.70	-	-	-	-	-	-	-	-
hoplan [32]	IDM	75.06	89.33	94.85	96.13	97.05	80.51	95.28	85.02	98.52
PDM [14]	IDM	75.18	95.22	95.58	99.08	93.38	84.19	99.53	75.47	83.45
GUMP hybrid	IDM	77.77	94.36	98.98	98.95	94.41	87.46	97.51	77.08	79.84
hoplan [32]	GUMP	69.56	93.63	96.25	97.37	86.14	88.01	94.17	76.74	98.87
PDM [14]	GUMP	72.26	97.00	95.50	99.25	88.38	86.89	99.51	70.43	85.39
GUMP hybrid	GUMP	73.60	94.98	98.97	99.31	88.76	89.82	97.18	72.76	82.00

Table 4: Results on the Hard-14 Test Set: Higher Scores Indicate Better Performance Across All Metrics. Our driving score in both IDM and GUMP evaluation environments are superior to other competitors, and we achieve a better balance between progress and safety (TTC). For detailed metrics definition, please refer to Appendix 0.B.2.3. In this context, "Col." refers to collision metric, "Driv. Comp." refers to drivable area compliance metric, and "Dir. Comp." refers to driving direction compliance, and "TTC" stands for time to collision.

3.0.4 Online Training

We further demonstrate the effectiveness of using GUMP as an online training environment, as detailed in Table 5. For comparison, we employed Imitation Learning (IL) and the Soft Actor-Critic (SAC) algorithm for online training, with all experiments utilizing a consistent ResNet18 as the policy model. We selected both logsim [5] and GUMP as our environments for online training and testing, respectively. Comparing exp ID 1, 2 with ID 3, 4, we can conclude that using GUMP as the training environment significantly enhances the RL model’s performance across different testing environments, substantially reducing the collision rate and improving progress, and outperforms the IL baseline. Moreover, policies trained in the GUMP environment achieved better results in realistic, reactive GUMP testing environments compared to those trained in logsim environments. These improvements are attributed to GUMP’s capability to generate a vast amount of realistic, interactive, and diverse training scenarios, greatly enhancing the efficiency of RL training and reducing the sim-to-real domain gap. These findings highlight the effectiveness of GUMP as an environment for online training.

ID	Policy	Training	Testing	Driving	Col.	Out of Driv.	Progress	Comfort
ID	Policy	env	env	Score $\uparrow$	Rate $\downarrow$	Rate $\downarrow$	meters $\uparrow$	Score $\uparrow$
0	IL	log data	logsim	0.595	0.248	0.014	44.69	0.892
1	SAC	logsim	logsim	0.587	0.259	0.062	49.77	0.923
2	SAC	GUMP	logsim	0.643	0.193	0.077	56.78	0.845
3	SAC	logsim	GUMP	0.569	0.246	0.061	44.46	0.920
4	SAC	GUMP	GUMP	0.626	0.215	0.061	53.58	0.862

Table 5: In an ablation study on the nuPlan random 1000 splits validation set, we observed significant improvements when using GUMP for online RL training across different environments, compared to non-reactive logsim environments. This enhancement is evident in both logsim and GUMP testing environments. Please note that the metric definition utilized here is more strict than the original nuPlan metrics. For detailed metric definitions, see the Appendix 0.B.2.4.

3.0.5 Scaling Laws

As we adopted the same paradigm as large language models (LLMs), which are transformer-based, and both aim at predicting the next token as the training objective, we have the reason to believe that our model also possesses strong scalability. To validate this, we trained three variants of our model with different MCT backbone capacities, i.e GUMP-small, GUMP-base and GUMP-medium, and measured the cross-entropy loss versus the number of tokens consumed during training, as well as the simulation realism meta metric in WOD Sim Agents validation set, as illustrated in Fig. 6. We can observe that our model demonstrates high scalability, as both training FLOPs and model capacity increasing. Furthermore, an increase in model capacity leads to a significant reduction in training loss and a notable enhancement in validation meta metric, especially when contrasting the GUMP-medium with the GUMP-base model. This phenomenon suggests that our model still harbors considerable potential for enhancement, as indicated by the scaling law [37, 31].

4 Conclusion and Future Work

In this work, we have proposed GUMP, a unified generative model designed to solve a broad spectrum of motion planning tasks in the domain of autonomous driving. Our model has demonstrated exceptional scalability and effectiveness across various downstream applications, showcasing its potential to serve as a foundation model in this field.

Limitations and Future Directions: Our model, while effective, has potential for further refinement. Firstly, there is room to enhance the model’s efficiency through engineering efforts such as model quantization and key-value caching. Additionally, adopting vectorized map inputs instead of raster map inputs may potentially yield more accurate map information. Furthermore, our approach primarily focuses on the structured dynamics of driving scenarios. Expanding our model to integrate sensor data directly would be a logical next step, enabling truly end-to-end operation and potentially enhancing its applicability and performance in real-world settings.

References

[1] Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35, 23716–23736 (2022)
[2] Caesar, H., Kabzan, J., Tan, K., et al.: Nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles. In: CVPR ADP3 workshop (2021)
[3] Chai, Y., Sapp, B., Bansal, M., Anguelov, D.: Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction. In: Conference on Robot Learning. pp. 86–99. PMLR (2020)
[4] Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: Maskgit: Masked generative image transformer. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2022)
[5] Chen, D., Koltun, V., Krähenbühl, P.: Learning to drive from a world on rails. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15590–15599 (2021)
[6] Chen, K., Ge, R., Qiu, H., Ai-Rfou, R., Qi, C.R., Zhou, X., Yang, Z., Ettinger, S., Sun, P., Leng, Z., Mustafa, M., Bogun, I., Wang, W., Tan, M., Anguelov, D.: Womd-lidar: Raw sensor dataset benchmark for motion forecasting. arXiv preprint arXiv:2304.03834 (April 2023)
[7] Chen, Y., Karkus, P., Ivanovic, B., Weng, X., Pavone, M.: Tree-structured policy planning with learned behavior models (2023)
[8] Chen, Y., Rosolia, U., Ubellacker, W., Csomay-Shanklin, N., Ames, A.D.: Interactive multi-modal motion planning with branch model predictive control. IEEE Robotics and Automation Letters 7(2), 5365–5372 (2022)
[9] Chen, Z., Ye, M., Xu, S., Cao, T., Chen, Q.: Deepemplanner: An end-to-end em motion planner with iterative interactions. arXiv e-prints pp. arXiv–2311 (2023)
[10] Cheng, J., Chen, Y., Mei, X., Yang, B., Li, B., Liu, M.: Rethinking imitation-based planners for autonomous driving (2023)
[11] Chiu, H., Smith, S.F.: Collision avoidance detour: A solution for 2023 waymo open dataset challenge - sim agents. Tech. rep., Carnegie Mellon University (2023)
[12] Dao, T.: FlashAttention-2: Faster attention with better parallelism and work partitioning (2023)
[13] Dao, T., Fu, D.Y., Ermon, S., Rudra, A., Ré, C.: FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In: Advances in Neural Information Processing Systems (2022)
[14] Dauner, D., Hallgarten, M., Geiger, A., Chitta, K.: Parting with misconceptions about learning-based vehicle motion planning. In: Conference on Robot Learning (CoRL) (2023)
[15] Devaranjan, J., Kar, A., Fidler, S.: Meta-sim2: Unsupervised learning of scene structure for synthetic data generation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16. pp. 715–733. Springer (2020)
[16] Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: Carla: An open urban driving simulator. In: Conference on robot learning. pp. 1–16. PMLR (2017)
[17] Ettinger, S., Cheng, S., Caine, B., Liu, C., Zhao, H., Pradhan, S., Chai, Y., Sapp, B., Qi, C.R., Zhou, Y., Yang, Z., Chouard, A., Sun, P., Ngiam, J., Vasudevan, V., McCauley, A., Shlens, J., Anguelov, D.: Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9710–9719 (October 2021)
[18] Fan, A., Lewis, M., Dauphin, Y.: Hierarchical neural story generation. arXiv preprint arXiv:1805.04833 (2018)
[19] Feng, L., Li, Q., Peng, Z., Tan, S., Zhou, B.: Trafficgen: Learning to generate diverse and realistic traffic scenarios. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). pp. 3567–3575. IEEE (2023)
[20] Gao, Z., Mu, Y., Shen, R., Chen, C., Ren, Y., Chen, J., Li, S.E., Luo, P., Lu, Y.: Enhance sample efficiency and robustness of end-to-end urban autonomous driving via semantic masked world model. arXiv preprint arXiv:2210.04017 (2022)
[21] Gu, J., Sun, C., Zhao, H.: Densetnt: End-to-end trajectory prediction from dense goal sets. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15303–15312 (2021)
[22] Ha, D., Schmidhuber, J.: Recurrent world models facilitate policy evolution. In: Advances in Neural Information Processing Systems, pp. 2451–2463. Curran Associates, Inc. (2018), https://papers.nips.cc/paper/7512-recurrent-world-models-facilitate-policy-evolution, https://worldmodels.github.io
[23] Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290 (2018)
[24] Hafner, D., Lillicrap, T.P., Norouzi, M., Ba, J.: Mastering atari with discrete world models. In: International Conference on Learning Representations (2020)
[25] Hafner, D., Pasukonis, J., Ba, J., Lillicrap, T.: Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104 (2023)
[26] Hallgarten, M., Stoll, M., Zell, A.: From prediction to planning with goal conditioned lane graph traversals. arXiv preprint arXiv:2302.07753 (2023)
[27] Hawke, J., Badrinarayanan, V., Kendall, A., et al.: Reimagining an autonomous vehicle. arXiv preprint arXiv:2108.05805 (2021)
[28] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
[29] Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. ArXiv abs/1503.02531 (2015), https://api.semanticscholar.org/CorpusID:7200347
[30] Hu, A., Corrado, G., Griffiths, N., Murez, Z., Gurau, C., Yeo, H., Kendall, A., Cipolla, R., Shotton, J.: Model-based imitation learning for urban driving. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
[31] Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., Corrado, G.: Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080 (2023)
[32] Hu, Y., Li, K., Liang, P., Qian, J., Yang, Z., Zhang, H., Shao, W., Ding, Z., Xu, W., Liu, Q.: Imitation with spatial-temporal heatmap: 2nd place solution for nuplan challenge. arXiv preprint arXiv:2306.15700 (2023)
[33] Hu, Y., Shao, W., Jiang, B., Chen, J., Chai, S., Yang, Z., Qian, J., Zhou, H., Liu, Q.: Hope: Hierarchical spatial-temporal network for occupancy flow prediction. arXiv preprint arXiv:2206.10118 (2022)
[34] Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., Lu, L., Jia, X., Liu, Q., Dai, J., Qiao, Y., Li, H.: Planning-oriented autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
[35] Huang, Z., Karkus, P., Ivanovic, B., Chen, Y., Pavone, M., Lv, C.: Dtpp: Differentiable joint conditional prediction and cost evaluation for tree policy planning in autonomous driving. arXiv preprint arXiv:2310.05885 (2023)
[36] Huang, Z., Liu, H., Lv, C.: Gameformer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 3903–3913 (October 2023)
[37] Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)
[38] Kar, A., Prakash, A., Liu, M.Y., Cameracci, E., Yuan, J., Rusiniak, M., Acuna, D., Torralba, A., Fidler, S.: Meta-sim: Learning to generate synthetic datasets. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4551–4560 (2019)
[39] Karpathy, A.: nanogpt. https://github.com/karpathy/nanoGPT (2023)
[40] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
[41] Li, Q., Peng, Z., Feng, L., Zhang, Q., Xue, Z., Zhou, B.: Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022)
[42] Li, X., Zhang, Y., Ye, X.: Drivingdiffusion: Layout-guided multi-view driving scene video generation with latent diffusion model. arXiv preprint arXiv:2310.07771 (2023)
[43] Liao, B., Chen, S., Wang, X., Cheng, T., Zhang, Q., Liu, W., Huang, C.: Maptr: Structured modeling and learning for online vectorized hd map construction. In: International Conference on Learning Representations (2023)
[44] Liao, B., Chen, S., Zhang, Y., Jiang, B., Zhang, Q., Liu, W., Huang, C., Wang, X.: Maptrv2: An end-to-end framework for online vectorized hd map construction. arXiv preprint arXiv:2308.05736 (2023)
[45] Liu, Y., Yuantian, Y., Wang, Y., Wang, Y., Zhao, H.: Vectormapnet: End-to-end vectorized hd map learning. In: International conference on machine learning. PMLR (2023)
[46] Lopez, P.A., Behrisch, M., Bieker-Walz, L., Erdmann, J., Flötteröd, Y.P., Hilbrich, R., Lücken, L., Rummel, J., Wagner, P., Wießner, E.: Microscopic traffic simulation using sumo. In: 2018 21st international conference on intelligent transportation systems (ITSC). pp. 2575–2582. IEEE (2018)
[47] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2018)
[48] Mo, X., Liu, H., Huang, Z., Lv, C.: Simulating behaviors of traffic agents for autonomous driving via interactive autoregression. Tech. rep., Nanyang Technological University, (2023)
[49] Mobileye: Mobileye under the hood. https://www.mobileye.com/ces-2024/ (2024)
[50] Montali, N., Lambert, J., Mougin, P., Kuefler, A., Rhinehart, N., Li, M., Gulino, C., Emrich, T., Yang, Z., Whiteson, S., et al.: The waymo open sim agents challenge. arXiv preprint arXiv:2305.12032 (2023)
[51] Nayakanti, N., Al-Rfou, R., Zhou, A., Goel, K., Refaat, K.S., Sapp, B.: Wayformer: Motion forecasting via simple & efficient attention networks. In: ICRA (2023)
[52] Ngiam, J., Caine, B., Vasudevan, V., Zhang, Z., Chiang, H.T.L., Ling, J., Roelofs, R., Bewley, A., Liu, C., Venugopal, A., et al.: Scene transformer: A unified architecture for predicting multiple agent trajectories. ICLR (2021)
[53] Niu, H., Sharma, S., Qiu, Y., Li, M., Zhou, G., HU, J., Zhan, X.: When to trust your simulator: Dynamics-aware hybrid offline-and-online reinforcement learning. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022)
[54] OpenStreetMap contributors: Planet dump retrieved from https://planet.osm.org . https://www.openstreetmap.org (2017)
[55] Pan, M., Zhu, X., Wang, Y., Yang, X.: Iso-dream: Isolating and leveraging noncontrollable visual dynamics in world models. Advances in neural information processing systems 35, 23178–23191 (2022)
[56] Philion, J., Peng, X.B., Fidler, S.: Trajeglish: Learning the language of driving scenarios. arXiv preprint arXiv:2312.04535 (2023)
[57] Prakash, A., Boochoon, S., Brophy, M., Acuna, D., Cameracci, E., State, G., Shapira, O., Birchfield, S.: Structured domain randomization: Bridging the reality gap by context-aware synthetic data. In: 2019 International Conference on Robotics and Automation (ICRA). pp. 7249–7255. IEEE (2019)
[58] Pronovost, E., Ganesina, M.R., Hendy, N., Wang, Z., Morales, A., Wang, K., Roy, N.: Scenario diffusion: Controllable driving scenario generation with diffusion. Advances in Neural Information Processing Systems 36 (2024)
[59] Qian, C., Xiu, D., Tian, M.: A simple yet effective method for simulating realistic multi-agent behaviors. Tech. rep. (2023)
[60] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)
[61] Renz, K., Chitta, K., Mercea, O.B., Koepke, A., Akata, Z., Geiger, A.: Plant: Explainable planning transformers via object-level representations. In: 6th Annual Conference on Robot Learning. pp. 459–470. MLResearchPress (2022)
[62] Santana, E., Hotz, G.: Learning a driving simulator. arXiv preprint arXiv:1608.01230 (2016)
[63] Scheel, O., Bergamini, L., Wolczyk, M., Osiński, B., Ondruska, P.: Urban driver: Learning to drive from real-world demonstrations using policy gradients. In: Conference on Robot Learning. pp. 718–728. PMLR (2022)
[64] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
[65] Seff, A., Cera, B., Chen, D., Ng, M., Zhou, A., Nayakanti, N., Refaat, K.S., Al-Rfou, R., Sapp, B.: Motionlm: Multi-agent motion forecasting as language modeling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8579–8590 (2023)
[66] Shao, H., Wang, L., Chen, R., Waslander, S.L., Li, H., Liu, Y.: Reasonnet: End-to-end driving with temporal and global reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13723–13733 (2023)
[67] Shi, S., Jiang, L., Dai, D., Schiele, B.: Motion transformer with global intention localization and local movement refinement. Advances in Neural Information Processing Systems (2022)
[68] Shi, S., Jiang, L., Dai, D., Schiele, B.: Mtr++: Multi-agent motion prediction with symmetric scene modeling and guided intention querying. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)
[69] Sima, C., Renz, K., Chitta, K., Chen, L., Zhang, H., Xie, C., Luo, P., Geiger, A., Li, H.: Drivelm: Driving with graph visual question answering. arXiv preprint arXiv:2312.14150 (2023)
[70] Sun, Q., Zhang, S., Ma, D., Shi, J., Li, D., Luo, S., Wang, Y., Xu, N., Cao, G., Zhao, H.: Large trajectory models are scalable motion predictors and planners. arXiv preprint arXiv:2310.19620 (2023)
[71] Suo, S., Regalado, S., Casas, S., Urtasun, R.: Trafficsim: Learning to simulate realistic multi-agent behaviors. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10395–10404. IEEE Computer Society (2021)
[72] Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT press (2018)
[73] Tan, S., Wong, K., Wang, S., Manivasagam, S., Ren, M., Urtasun, R.: Scenegen: Learning to generate realistic traffic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 892–901 (2021)
[74] Tesla: Tesla AI Day. https://www.youtube.com/watch?v=ODSJsviD_SU (2022)
[75] Treiber, M., Hennecke, A., Helbing, D.: Congested traffic states in empirical observations and microscopic simulations. Physical review E 62(2), 1805 (2000)
[76] Varadarajan, B., Hefny, A., Srivastava, A., Refaat, K.S., Nayakanti, N., Cornman, A., Chen, K., Douillard, B., Lam, C.P., Anguelov, D., et al.: Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction. In: 2022 International Conference on Robotics and Automation (ICRA). pp. 7814–7821. IEEE (2022)
[77] Wang, W., Zhen, H.: Joint-multipath++ for simulation agents. Tech. rep. (2023)
[78] Wang, Y., Zhao, T., Yi, F.: Multiverse transformer: 1st place solution for waymo open sim agents challenge 2023. arXiv preprint arXiv:2306.11868 (2023)
[79] Xpeng: Xpeng CVPR2023 Workshops. https://www.youtube.com/watch?v=d6ucRgDDUWQ&list=PL3N9otbGBVLc3jdm6yrPtCWdE7C8AsNAy&index=6 (2023)
[80] Xu, W., Yu, H., Zhang, H., Hong, Y., Yang, B., Zhao, L., Bai, J., ALF-contributors: ALF: Agent Learning Framework (2021), https://github.com/HorizonRobotics/alf
[81] Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157 (2021)
[82] Yang, Q.I., Koutsopoulos, H.N.: A microscopic traffic simulator for evaluation of dynamic traffic management systems. Transportation Research Part C: Emerging Technologies 4(3), 113–129 (1996)
[83] Zhang, L., Xiong, Y., Yang, Z., Casas, S., Hu, R., Urtasun, R.: Learning unsupervised world models for autonomous driving via discrete diffusion. arXiv preprint arXiv:2311.01017 (2023)
[84] Zhao, T.Z., Kumar, V., Levine, S., Finn, C.: Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 (2023)

Appendix

A Related Works 21

A.1 Scene Generation 21

A.2 Reactive Simulation 21

A.3 Interactive Planning 22

A.4 Online Learning with RL 22

B Dataset and Metrics 23

B.1 Dataset 23

B.2 Metrics 23

C Experimental setup 28

C.1 Scene Generation 28

C.2 World Simulator 28

C.3 Interactive Planning 29

C.4 Online Training 29

D Training setup 30

D.1 Raster Input 30

D.2 Data Augmentation 30

D.3 Loss 31

D.4 Hyperparameters 31

E Ablation Study 32

E.1 Effectiveness of NAR conversion 32

E.2 Effectiveness of Prediction Chunking and Temperal Aggregation 32

E.3 Ablation Study of Decay Rate $\gamma$ 33

E.4 Ablation Study of Temperature 34

E.5 Ablation Study of the Number of Conditioned Frames 34

F Qualitative Analysis 35

F.1 Scene Generation 36

F.2 Diverse future 36

F.3 Reactive Simulation 37

G WOD Motion Results 38

G.1 WODM validation set 38

G.2 Per-type Results of WOMD Validation 39

H Per-component WOD Sim Agent Metric 39

Appendix 0.A Related Works

0.A.1 Scene Generation

As a supplement to real data, generated scenarios establish the initial conditions for reactive simulations. Previous efforts have involved the creation of agents guided by heuristic rules [82, 16, 46, 57] or fixed grammars [38, 15]. However, the manual generation of such scenarios requires extensive human labor and struggles to achieve the necessary realism, diversity, and generalization for downstream tasks, especially in new environments. Recent work has started to adopt an end-to-end fully learned approach, either through an autoregressive architecture that places agents one by one [73, 19] or a diffusion-based architecture [58]. Yet, no work has exploited a more scalable or unified transformer-based language model for controllable scene generation. Moreover, our method allows for a coarse control over the entire scene through a global description prompt and a fine control by directly modifying each agent’s states, e.g. enforcing traffic rule constraints. These features enable our model to strike a better balance between diversity and controllability throughout the generation process.

0.A.2 Reactive Simulation

Traditional simulators are built on computer graphics and human prior knowledge, including physical laws, lighting conditions, and hand-crafted traffic dynamics [16, 46, 41, 2]. However, relying on a vast collection of manually created scene assets and heuristic driving policies for intelligent agents not only introduces a significant domain gap but also results in a lack of diverse scenarios and realism.

Recently, there has been a growing emphasis on building world simulators as generative models through a data-driven approach [62, 81, 24, 31, 22, 42, 83]. Compared to those end-to-end models, structured input provides a simpler and more efficient setting, facilitating a variety of downstream applications and onboard deployments. Predictive models, based on the unfolding of agents’ states over time, fall into open-loop and closed-loop categories. In an open-loop setting, each agent makes decisions—either marginally [76, 67, 21, 34] or jointly [52, 68, 33]—based solely on historical information and generates predictions for their trajectory over a short future period. In contrast, closed loop simulations incorporate feedback mechanisms, where the decisions of each agent are influenced by the current state of the system, including the actions of other agents. This leads to a more dynamic, interactive and realistic simulation environment, where agents continuously update their decisions based on the evolving scenario, such as [71, 65]. Notably, WOSAC [50] established the first evaluation benchmark for closed-loop simulators, attracting numerous participants [78, 56, 59, 51]. As a closed-loop simulator, our work is most related to MotionLM [65] and Trajenglish [56], as both employ a GPT-like autoregressive predictive model. However, in contrast to them, which tokenize the action space, we employ a “key-value pair” tokenization strategy and directly quantize the state space. Through this method, we can achieve Non-Autoregressive (NAR) transitions within frames, significantly speeding up inference, and endowing the model with generative capabilities. Moreover, it has the flexibility to handle the disappearance and emergence of agents. These capabilities foster a wider range of downstream applications, which require higher scalability, efficiency, and flexibility.

0.A.3 Interactive Planning

The ability to effectively model the interactions between road users and autonomous vehicles is crucial for enhancing the safety and comfort of self-driving technology. An optimal policy planning algorithm should enable multi-stage reasoning, incorporating bidirectional interactions between the agent and it’s environment. Previous research has tackled this challenge through two main approaches: Some studies have employed neural networks to implicitly and iteratively capture the interactions between the ego vehicle and other road users [36, 67, 66, 34, 9]. Others have taken a more explicit route, utilizing model predictive control (MPC) in conjunction with tree search expansion to navigate complex interactions [74, 8, 7, 35]. Among these approaches, certain studies [8, 7, 35] have simplified the modeling of bidirectional interaction by combining a simple non-reactive rule-based planner with a model-based, ego-conditioned predictor. In contrast, PDM [14] simplifies interaction by assuming a non-reactive environment with constant velocity agents, and a reactive Intelligent Driver Model (IDM)-based [75] ego-planner. Our work advances beyond these methodologies by coupling our realistic ego-conditioned simulator with a reactive planner, adaptable to any rule-based or neural network-driven ego-planner. This approach allows for a more accurate modeling of interactions and the stochastic nature of future scenarios.

0.A.4 Online Learning with RL

The reinforcement learning (RL) domain has witnessed considerable progress, fostering algorithms adept at solving various types of tasks via mode-free [23, 64] and model-based approaches [24, 25]. The progress in RL has contributed to the paradigm shift in autonomous driving from open-loop trajectory prediction-based approach to closed-loop interaction-based approach. One critical component to enable closed-loop training is a reactive environment, generating the state of the next time step given the current state and action taken [5, 30, 55, 20]. Nevertheless, the simplicity of these models often results in environments that fall short of realism and interactivity, leading to a pronounced simulation-to-reality gap in complex scenarios. Addressing this, some research focuses on the creation of more realistic and intelligent training environments [19, 41]. Different from previous approaches, we introduce a scalable, data-driven model capable of imitating interactive agent behaviors and generating realistic, diverse driving scenarios. Its efficiency aligns well with the demands of online RL training. Such an environment could bridge the simulation-reality gap, and push RL closer to practical application in autonomous driving.

Appendix 0.B Dataset and Metrics

0.B.1 Dataset

0.B.1.1 The WOD Motion Dataset

We have utilized the Waymo Open Motion Dataset (WOMD) to train and evaluate for scene generation, reactive simulation and motion prediction tasks. WOMD contains 7.64 million unique tracks from 574 driving hours across 1750 km urban roadways collected in six cities of the United States. Specifically, the WOMD v1.2.0 release includes 486,995 train, 44,097 validation, and 44,920 test scenarios. Each scenario consists a 9.1 seconds 10 Hz driving log, partitioned into 11 history frames and 80 future frames. All reported variants of our model is trained on the full training dataset. To further augment the data volume, we treat different objects as the center object, ultimately yielding approximately 2.6 million training scenarios.

0.B.1.2 The nuPlan Dataset

We utilized the nuPlan Dataset [2] for scenario prompt conditioned scene generation, interactive planning and online training experiments. The dataset encompasses 1,500 hours of diverse driving scenarios from urban environments such as Las Vegas, Boston, Pittsburgh, and Singapore. These scenarios are annotated with various traffic situations including merges, lane changes, interactions with cyclists and pedestrians, among others. For training, a random sampling strategy is adopted, with 50,000 scenarios selected for each scenario type from the training subset, culminating in approximately 1.5 million scenarios.

0.B.2 Metrics

0.B.2.1 Scene Generation Metrics

To measure the quality of generated scenarios, we treat the generated agents as samples from a distribution, and compare it with ground-truth agents, which are also treated as samples from another distribution. Similar to Trafficgen [19], we use maximum mean discrepancy (MMD) as the metric to measure the difference between the generated distribution and the corresponding ground-truth distribution. Given two distributions $p$ and $q$ with Gaussian kernel $k$ , the maximum mean discrepancy is defined as

\displaystyle MMD^{2}(p,q)=\mathbb{E}_{x,x^{\prime}\sim p}[k(x,x^{\prime})]+\mathbb{E}_{y,y^{\prime}\sim q}[k(y,y^{\prime})]-2\mathbb{E}_{x\sim p,y\sim q}[k(x,y)]

(9)

We calculate this metric for generated attributes including agent center positions in $\mathbb{R}^{2}$ , agent heading angle in $\mathbb{R}$ , agent velocities in $\mathbb{R}^{2}$ , and agent dimensions in $\mathbb{R}^{2}$ .

For each pair of generated and ground-truth scenarios, we calcuate MMD for each attributes independently and compute the average value of them. And the final MMD score is averaged across all selected testing dataset.

0.B.2.2 The WOD Sim Agents Metrics

We evaluate our model’s simulation capability on the WOMD Sim Agents Benchmark [50]. Each testing scenario consists a 9.1 seconds 10 Hz driving log, partitioned into 11 history frames and 80 future frames. Trajectories of up to 128 agents (including ego vehicle) are tracked in each scenario. The task is to simulate 32 parallel future rollouts for each agent and scenario in an autoregressive manner. Given an agent in a scenario, the predicted distribution of its future behavior is formed over the 32 rollouts. In order to parameterize the predicted distribution, a total of 3 categories and 9 component metrics are computed:

•
Kinematic-based:
- –
  
  Linear Speed unsigned magnitude of agent’s speed in x, y, z axis.
- –
  
  Linear Acceleration Magnitude derivative of linear speed with respect to time in x, y, z axis.
- –
  
  Angular Speed signed minimum difference between consecutive angular heading
- –
  
  Angular Acceleration Magnitude derivative of angular speed with respect to time
•
Interaction-based:
- –
  
  Distance to nearest object for each agent represented by a box polygon, the signed distance from the nearest object in the scenario.
- –
  
  Collisions Count of collisions with other objects, recognized when Distance to nearest object is negative.
- –
  
  Time-to-collision Time takes before an agent collide with the agent it is following, assuming constant speed.
•
Map-based:
- –
  
  Distance to road edge Signed distance to the nearest road edge, where a road egde is represented as a 2D vector.
- –
  
  Road departures Indicator of an agent going off road at any time.

A component score is calculated by testing the Negative Log Likelihood of the ground truth agent behavior over the corresponding distribution, masked by validity $v_{t}$ : $m=\exp\Big{(}-\frac{1}{\sum\limits_{t}\mathbbm{1}\{v_{t}\}}\sum\limits_{t}\mathbbm{1}\{v_{t}\}NLL_{t}\Big{)}$ . Each component metric is assigned a weight, and each categorical metric is computed by taking the weighted average of all corresponding component metrics: $\mathcal{M}_{c}=\frac{\sum_{i}w_{i}m_{i}}{\sum_{j}w_{j}}$ . We report all categorical metrics on the test dataset in Tab. 13. The final composite metric is calculated as a weighted average over all component metrics. Our reported model performance is based on the V1 Leaderboard, which improved collision and offroad calculation from the previous V0 Leaderboard.

0.B.2.3 The nuPlan Planning Metrics

Our evaluation concentrates on the model’s planning abilities in closed-loop with reactive agents (CL-R) task. In these setups, the planner generates a trajectory at each timestep, which is used as a reference by the controller to incrementally adjust the vehicle’s state. In CL-R tasks, we apply two distinct world models for the surrounding agents: a rule-based IDM environment and a model-based environment powered by GUMP, both controls all non-ego agents dynamically.

The planning score, as defined by the equation (5), is reported on on the test dataset. We exactly follow the nuPlan benchmark metrics [2], which reflects the model’s performance across safety, efficiency, and comfort metrics in CL-R tasks. This score is aggregated by selected metrics for the driven trajectory generated from the planner. The aggregation function is a hybrid hierarchical-weighted average function of individual metric scores.

The first part $\Theta$ of equation (5) gets a zero score for a driven scenario if any one of those following situations happens.

•

There is an at_fault collision with a vehicle or a VRU (pedestrian or bicyclist);
•

There are multiple at_fault collisions with objects (e.g. a cone);
•

There is a drivable_area violation;
•

Ego drives into uncoming traffic more than 6 m;
•

Ego is not making enough progress.

The second part $\Phi$ of equation (5) is a weighted average of other selected metrics’ scores. Those selected metrics and their corresponding weight are listed as following.

Metric Name	$\omega_{n}$
driving_direction_compliance	5
time_to_collision_within_bound	5
speed_limit_compliance	4
ego_progress_along_expert_route	5
Ego_is_comfortable	2

Table 6: Selected metrics for the second part

\Phi

of equation (5) .

The following component metrics contribute to the overall planning score:

•

No at-fault Collisions A collision is defined as the event of ego’s bounding box intersecting another agent’s bounding box. To define the collision score for a scenario, we only consider collisions that should have been prevented if planner performed properly. For simplicity, we call these collisions at-fault.
•

Drivable area compliance Ego should drive in the mapped drivable area at all times. Drivable area compliance metric identifies the frames when ego drives outside the drivable area. Due to over-approximation of ego’s bounding box, we allow for a small infringemenet outside the drivable area (max_violation_threshold = 0.3m).
•

Driving direction compliance This metric is defined to penalize ego when “it drives into oncoming traffic”. The metric computes the movement of ego’s center during a 1 second time_horizon along the driving direction defined according to the baselines of ego’s lanes or lane-connectors. The score is set to 1 if it does not drive/move against the flow more than driving_direction_compliance_threshold (= 2 m) and 0 if it drives against the flow more than driving_direction_violation_threshold (= 6 m), and 0.5 otherwise.
•

Ego progress along the expert’s route ratio This metric is used to evaluate progress of the driven ego trajectory in a scenario by comparing its progress along the route that expert takes in that scenario.
•

Making progress This metric is defined as a boolean metric based on the “Ego progress along the expert’s route ratio”. It’s score is set to 1 if the ratio is more than the selected threshold (min_progress_threshold = 0.2), and is set to 0 otherwise.
•

Time to Collision (TTC) within bound TTC is defined as the time required for ego and another track to collide if they continue at their present speed and heading.
•

Speed limit compliance This metric evaluates if ego’s speed exceeds the associated speed limit in the map. Speed limit violation at each frame is defined based on the difference between ego’s speed and the speed limit, if ego’s speed is higher than the speed limit (over-speeding).
•

Comfort We measure the comfort of ego’s driven trajectory by evaluating minimum and maximum longitudinal accelerations, maximum absolute value of lateral acceleration, maximum absolute value of yaw rate, maximum absolute value of yaw acceleration, maximum absolute value of longitudinal component of jerk, and maximum magnitude of jerk vector. These variables are compared to thresholds with default values determined empirically from examination of a dataset of expert trajectories.

0.B.2.4 RL metrics

We define safe episodes as those without any critical failures, such as collisions and violations of the drivable area. In the context of RL training, we utilize four specific types of metrics that differ slightly from the nuPlan Planning Metrics mentioned previously in 0.B.2.3. These metrics include:

•

Collision Rate We do not differentiate between at-fault and non-at-fault collisions, setting a higher standard for the planner to avoid all collisions, whether by not colliding with others or by not being collided into. The collision rate is calculated as the percentage of episodes that experience at least one collision.
•

Out of Drivable Area Rate Diverging from the nuPlan metric, our approach disallows any encroachments outside the drivable area, demanding strict adherence from the planner. The score for drivable area compliance is determined by the percentage of episodes where the drivable area is violated.
•

Progress We define a series of equally spaced waypoints along the planned route. Progress is measured by the physical distance the ego vehicle covers towards the next waypoint at each step. The overall progress score represents the average progress made in all safe episodes.
•

Comfort This metric maintains the same components as defined in the nuPlan Planning Metrics. The comfort score is the proportion of episodes deemed comfortable out of all safe episodes.

The final score calculation incorporates Equation (5). For this equation, the $\Theta$ component comprises binary metrics from collision and out-of-drivable-area violations, signifying that these metrics are either 0 (no violation) or 1 (violation occurred). As for the $\Phi$ component, it is determined by first calculating the progress percentage. This is achieved by dividing the actual progress made by the vehicle by an empirical value of 62, which represents the average progress length according to the nuPlan dataset. Subsequently, the average of this progress percentage and the comfort metric is calculated to derive the $\Phi$ value.

0.B.2.5 WOD motion metrics

WOMD’s metrics are used to evaluate the motion prediction performance of our model.

•

minADE The minimum Average Displacement Error computes the L2 norm between groundtruth and the closest joint prediction.
•

minFDE. The minimum Final Displacement Error is equivalent to evaluating the minADE at a single time step T.
•

Overlap rate (OR) The overlap rate is computed by taking the highest confidence joint prediction from each multimodal joint prediction. If any of the A agents in the jointly predicted trajectories overlap at any time with any other objects that were visible at the prediction time step (compared at each time step up to T) or with any of the jointly predicted trajectories, it is considered a single overlap. The overlap rate is computed as the total number of overlaps divided by the total number of predictions. See the supplementary material for details. The overlap is calculated using box intersection, with headings inferred from consecutive waypoint position differences.
•

Miss rate (MR) A binary match/miss indicator function ISMATCH(ˆst, st) is assigned to each sample waypoint at a time t. The average over the dataset creates the miss rate at that time step. A single distance threshold to determine ISMATCH is insufficient: we want a stricter criteria for slower moving and closer-in-time predictions, and also different criteria for lateral deviation (e.g. wrong lane) versus longitudinal (e.g. wrong speed profile)
•

Mean average precision (mAP) The Average Precision computes the area under the precision-recall curve by applying confidence score thresholds ck across a validation set, and using the definition of Miss Rate above to define true positives, false positives, etc. Consistent with object detection mAP metrics, only one true positive is allowed for each object and is assigned to the highest confidence prediction, the others are counted as false positives. Further inspired by object detection literature, we seek an overall metric balanced over semantic buckets, some of which may be much more infrequent (e.g., u-turns), so report the mean AP over different driving behaviors. The final mAP metric averages over eight different ground truth trajectory shapes: straight, straight-left, straight-right, left, right, left u-turn, right u-turn, and stationary.

Appendix 0.C Experimental setup

0.C.1 Scene Generation

For our quantitative experiments in scene generation, we utilized all the validation scenarios from the Waymo Open Motion Dataset (WOMD) version 1.2.0. To ensure a fair comparison with Trafficgen [19], we followed the testing settings outlined in that research and limited our model to generate vehicles only within a range of -50 meters to 50 meters. Moreover, we omitted scenarios with less than 8 ground-truth vehicles, yielding a total of 28,341 scenarios. We evaluated the first frame of each selected scenario to compute the average score. Under identical testing conditions, we reassessed Trafficgen and found the results to align closely with those reported in the original study. For the scene generation task, we found through experimentation that a slightly higher temperature setting leads to a more diverse distribution of scenes, thereby enhancing the performance of our metrics. Consequently, we selected a temperature setting of 1.25 and a top-k of 200 for this task.

For our qualitative analysis of scene generation, we mainly used the nuPlan Dataset. This dataset was chosen for its diverse and detailed scenario description tags, which are essential for our prompt-based conditioning experiments. These experiments aim to generate scenes that follows the scenario descriptions, offering a more nuanced and targeted approach.

0.C.2 World Simulator

Our experiments are conducted on the WOD Sim Agents Benchmark, utilizing the WOD Motion Dataset. We adhere to the protocols of the Sim Agents challenge, unrolling 32 futures for each scenario in parallel. Specifically, for the experiments detailed in Tab.2, we utilize the test splits, with evaluation performed by the Waymo Sim Agent Challenge server. For other experiments, such as the Scaling Laws in Sec. 3.0.5 and various ablations, we employ the sub50 validation dataset, which comprises a total of 1024 scenarios.

Specifically, our model operates in 2Hz, and we interpolate the results to 10Hz for evaluation. For both the Waymo and nuPlan datasets, we use 1 second of history along with the current frame as conditions and predict the information for the next 8 seconds. Our model does not directly output information on the z-axis. To meet the requirements of the Waymo Sim Agents Benchmark, we infer the z value for each agent based on their predicted x and y positions, in conjunction with map data. Specifically, we employ the K-nearest neighbor algorithm to identify the k closest map points in the xy-2D plane location. We then average the z information of these map points to estimate the agent’s z-axis position. In our approach, we set k to 4. Additionally, to better combat the jitter caused by the sampling and quantization processes, we employ a simple sliding window algorithm to smooth the final prediction results. This method helps with the stability and the overall smoothness.

0.C.3 Interactive Planning

Our planning experiments are conducted on the nuPlan Dataset. To develop an interactive planner that builds upon the PDM [14], we integrate multiple simple policies with an interactive simulator and scorer, powered by GUMP. Specifically, our planner comprises 15 Intelligent Driver Model (IDM)-based policies with varied parameters, in addition to one imitation policy. We utilize the same set of IDM policies as PDM and adopt the ego vehicle’s prediction within GUMP as the imitation policy. These policies are then simulated in parallel, interacting with GUMP. This interaction is achieved by overriding the ego vehicle’s next states with the output from the policy. Concurrently, the predictions of GUMP for other agents are used as input for the next frame of the policy. After interacting for 4 seconds, we generate a series of rollouts for future states.

By evaluating these simulated rollouts, we can derive the expected return. Considering GUMP operates stochastically, to make a more accurate estimation, we simulated and evaluated each policy for four times, then averaged these results to calculate the expected return. Ultimately, we selected the output of the policy with the highest return as the next action to be taken.

In selecting our evaluation environment, we updated the original IDM-based environment by substituting IDM-controlled smart agents with those controlled by GUMP. For a visual comparison of these two testing environments, one can refer to Appendix 0.F.3. Compared to IDM, GUMP-based smart agents exhibit more realistic and natural behavior, thus providing a more accurate representation of the planner’s real-world performance. To ensure the accuracy of our assessment of the planner’s performance, our experiments are conducted across both environments, with results presented side by side for reference. This method enables a comprehensive evaluation of our planning strategies in scenarios closely mimicking real-world driving conditions.

0.C.4 Online Training

We implement our experiments using an existing open-source RL framework [80], which provides implementations of a number of standard RL algorithms. We use the Soft Actor-Critic (SAC) algorithm [23] to train a simple policy model, which consists of a ResNet-18 model followed by a 2-layer MLP-based projection network, outputting a squashed Gaussian distribution representing the state-conditional action distribution.

Using the same RL algorithm, network structure and hyperparameters, we train the policy network under the following two different environment setups respectively and compare the results:

•

Setup 1: Log Playback. Log playback environment is used in this setting which has no reactive capability.
•

Setup 2: World Model. A learned world model is used to implement the environment, in which participating agents react according to ego vehicle’s action.

For each episode, the world model sees the current observation and unrolls one step conditioned on the policy model’s output. we run 128 such environments in parallel. During the policy gradient update, we sample batches of 2048 state transitions from the replay buffer and optimize via Adam optimizer [40]. Each policy model is trained for 40k steps. The learning rate is set to 1e-4 and is reduced to 2e-5 after 70% of the training process. The training rewards can be found under the metrics section in Appendix. 0.B.2.4. During training, 20% of the samples are used in an open-loop fashion, producing imitation losses. Empirically, we found that mixing open-loop samples during training helps speed up the training process. We then simulate each trained policy model under these environments and report metrics.

Appendix 0.D Training setup

0.D.1 Raster Input

The raster layers encode various static elements and high definition map, which contrains:

•

Roadmap Raster Represents the road network layout, including lanes, intersections, and other road features.
•

Baseline Paths Raster Encodes baseline paths within the road network of the scene.
•

Route Raster Represents the ego vehicle’s desired route or path.
•

Drivable Area Raster Represents areas considered drivable or navigable for vehicles within the scene.
•

Speed Limit Raster Encodes speed limits at various locations within the scene.
•

Static Agents Raster Represents positions of static agents (e.g., traffic cones) within the scene.
•

Traffic Light Raster Encodes traffic light location and orientation at intersections within the scene.

0.D.2 Data Augmentation

To enhance the generalization capability of our model, we implemented various data augmentation techniques. First, we applied an agents and frame dropout strategy, randomly dropping certain agent tokens or the tokens of an entire frame with a probability of 0.1. Second, we employed a random crop strategy, randomly selecting segments from the entire scenario sequence based on a fixed window size for training. Additionally, we uniformly sampled rotation angles from $[-\frac{\pi}{2},\frac{\pi}{2}]$ to rotate the entire scene and translated the whole scene in both x and y directions within a range of $[-5,5]$ meters. Finally, beyond using the self-driving car as the central focus of the scene, we also randomly selected several agents of interest from the dataset to serve as the center of the scene coordinate system.

0.D.3 Loss

The loss function of GUMP is composed of three main components: reconstruction loss, cross entropy loss, and multipath loss [3].

The reconstruction loss employs the L1 loss to supervise the image reconstruction part of the image autoencoder, which is defined as follows:

L_{\text{rec}}=\frac{1}{N_{x}N_{y}}\sum_{i,j}|I(i,j)-\hat{I}(i,j)|

(10)

where $N_{x},N_{y}$ represent the number of 2D pixels, $I$ is the image pixel, and $\hat{I}$ is the reconstructed image.

The cross entropy loss (CE) is responsible for evaluating the accuracy of categorical predictions made by the model. It is applied to both key tokens, such as ID, class and special tokens, as well as value tokens such as position ( $x,y$ ), orientation ( $\theta$ ), and dimensions ( $v_{x},v_{y},w,l$ ). This loss is formulated as:

L_{\text{CE}}=\frac{1}{N_{k}}\sum_{k\in\{ID,class,special\}}\mathcal{H}(\hat{k},k)+\frac{1}{N_{v}}\sum_{v\in\{x,y,\theta,v_{x},v_{y},w,l\}}\mathcal{H}(\hat{v},v)

(11)

where $\mathcal{H}$ denotes the cross entropy, $N_{k}$ and $N_{v}$ are the counts of key and value tokens, respectively, and $\hat{k}$ and $\hat{v}$ are the predicted logits.

To enhance the model’s motion prediction capability, we introduced trajectory prediction as an auxiliary task. The specific results and metric comparisons can be found in Appendix 0.G. We employed the commonly used multipath loss [3], which combines a classification loss ( $L_{cls}$ ) with a minimum Average Displacement Error ( $L_{minADE}$ ). The classification loss supervises the likelihood of different future paths, while the minADE measures the accuracy of the predicted trajectory ( $\hat{s}$ ) against the ground truth ( $s$ ). The balance between these two components is regulated by coefficients $\alpha$ and $\beta$ :

L_{\text{traj}}^{i}=\alpha L_{cls}^{i}+\beta L_{minADE}(\hat{s}_{i},s)

(12)

In MCT, trajectory predictions are made at intervals of every $n$ layers, resulting in multiple sets of trajectory prediction outcomes. As indicated above, $i$ represents the results from different layers of trajectory prediction heads.

Finally, the overall loss $L$ is a weighted sum of these components, allowing for customized emphasis on different aspects of the model’s predictions:

L=\omega_{rec}*L_{\text{rec}}+\omega_{CE}*L_{CE}+\sum_{i}\omega_{traj}^{i}*L_{traj}^{i}

(13)

where $\omega_{rec}$ , $\omega_{CE}$ , and $\omega_{traj}^{i}$ are the weights applied to the reconstruction, cross entropy, and trajectory prediction losses, respectively.

0.D.4 Hyperparameters

0.D.4.1 Model configuration

As shown in Table 7, we list out the detailed hyperparamters of three different model variants.

Hyperparameter	GUMP-small	GUMP-base	GUMP-medium
vocabulary size	2972
block size	2048
meshgrid $[x,y,\theta,w,l,v_{x},v_{y}]$	$[0.2m,0.2m,\frac{\pi}{100},0.5m,0.5m,0.25m/s,0.25m/s]$
range $[x,y,\theta,w,l]$	$[(-100,100)m,(-100,100)m,(-\pi,\pi),(0,7)m,(0,15)m]$
range $[v_{x},v_{y}]$	$[(0,25)m/s,(0,25)m/s]$
embedding dim	384	768	1024
MCT backbone	gpt2-small	gpt2-base	gpt2-medium
$N_{heads}$	6	12	16
$N_{layer}$	12	12	24
# GRU decoder layer	2	2	2
chunking size $T$ (sec)	2	2	2
visual encoder	ResNet18	ResNet34	ResNet50
total params	55.8M	184M	523M

Table 7: Model configurations of GUMP.

0.D.4.2 Training configuration

We detail all training specifics in Table 8. Our models undergo an initial training phase of 10 epochs on the nuPlan Dataset, followed by a fine-tuning stage on the Waymo Open Dataset for an additional 10 epochs. The GUMP-medium model is trained using 8 A100 GPUs with a period of 3 days.

Hyperparameter	Value
lr	2e-4
weight decay	1e-3
optimizer	adamw [47]
lr scheduler	multistep lr
batchsize	64
accumulate grad batches	2
training epoch	10
decay milestones	[6, 8]
decay rate	0.1
precision	fp16
$\omega_{rec}$	1.0
$\omega_{CE}$	1.0
$\omega_{traj}^{0}$	0.25
$\omega_{traj}^{1}$	0.25
$\omega_{traj}^{2}$	0.5
$\omega_{traj}^{3}$	0.5
temperature	1.1
topk	40
temperature at SceneGen	1.25
topk at SceneGen	400
condition length	3
decay rate $\gamma$	1.2
chunking time horizon $T$	2s

Table 8: Training hyperparameters of GUMP.

Appendix 0.E Ablation Study

0.E.1 Effectiveness of NAR conversion

To accelerate the model’s inference process, we employed Non-Autoregressive (NAR) conversion methods described in Section 2.2.6, transforming the full autoregressive (full-AR) model into a partially autoregressive (partial-AR) model. The specific experimental results, as outlined in Table 9, reveal that the transition to a partial-AR model results in a significant acceleration ( $\times 132$ speed up) without any noticeable degradation across various simulation realism metrics. This transformation demonstrates the efficacy of NAR conversion methods in enhancing computational efficiency while maintaining the quality of simulation outcomes.

Inference Mode	Inference Speed (FPS)	Meta Metric $\uparrow$	minADE $\downarrow$	Kinematic Metrics $\uparrow$	Interactive Metrics $\uparrow$	Map-based Metrics $\uparrow$
full-AR	0.69	0.645	1.548	0.4022	0.7651	0.8334
partial-AR	82( $\times 132$ )	0.646	1.584	0.4036	0.7662	0.8331

Table 9: Comparison of full-AR and partial-AR inference mode. The inference speed is measured with a single A100 GPU. The transition from full-AR to partial-AR mode results in a significant acceleration (

\times 132

speed up) without any noticeable degradation across various simulation realism metrics.

0.E.2 Effectiveness of Prediction Chunking and Temperal Aggregation.

As shown in Table 10, we present the ablation study results on the Waymo Sim Agents validation set. It is evident that incorporating prediction chunking as an auxiliary task alone leads to a 1% enhancement in the realism meta-metric, with particularly notable improvements observed in the interactive and map-based metrics. Moreover, the minimum Average Displacement Error (minADE) significantly decreases. Further enhancements are attained through temporal aggregation, where the ensemble of multiple predictions from chunked results yields an additional 1.2% improvement in the meta-metric. It is worth mentioning that the kinematic metrics are the primary contributors to this overall improvement. These experiments underscore the effectiveness of the Prediction Chunking and Temporal Aggregation modules in increasing the realism of simulations conducted by GUMP.

Prediction	Temporal	Meta Metric $\uparrow$	minADE $\downarrow$	Kinematic Metrics $\uparrow$	Interactive Metrics $\uparrow$	Map-based Metrics $\uparrow$
Chunking	Aggregation		minADE $\downarrow$
		0.624	1.626	0.3903	0.7508	0.7903
✓		0.634	1.562	0.3700	0.7674	0.8299
✓	✓	0.646	1.584	0.4036	0.7662	0.8331

Table 10: Ablation study of prediction chunking in Waymo Sim Agents benchmark. These experiments demonstrate the efficacy of Prediction Chunking and Temporal Aggregation modules in enhancing the realism of simulations conducted by GUMP.

0.E.3 Ablation Study of Decay Rate $\gamma$

In this section, we explore the optimal setting for the decay rate $\gamma$ in our temporal aggregation process, as depicted in Figure 7. Setting $\gamma=0.0$ equates to omitting temporal aggregation entirely. As we increase $\gamma$ , future predictions within the chunking data are progressively weighted more heavily. Observing the components of the meta-metric, it becomes evident that a higher decay rate $\gamma$ distinctly improves the kinematic metrics while leading to a decrease in the interactive and map-based metrics. The overall meta-metric peaks at $\gamma=1.2$ . This suggests that by incorporating predictions that extend further into the future, agents are better able to adhere to their initial goals, thereby achieving higher kinematic metrics. However, this reliance on more distant predictions in chunked data tends to overlook interactions between agents, and between agents and the map at future time points, resulting in reduced interactive and map-compliance metrics.

0.E.4 Ablation Study of Temperature

In this section, we conduct a detailed analysis of how varying the temperature parameter affects the performance and behavior of simulations, as illustrated in Table 8. The data clearly shows that the meta-metric peaks at a temperature of 1.1. Furthermore, as the temperature increases, kinematic metrics show a monotonically increasing trend, while both interactive and map-based metrics exhibit a monotonic decrease. This pattern indicates that higher temperatures facilitate better mode coverage by enabling more frequent sampling of low probabilistic states. However, such states may lead to an increase in collisions and deteriorate map compliance. Visual observations reveal that higher temperatures result in more aggressive behaviors among road participants, such as increased collisions or jaywalking. Conversely, lower temperatures are associated with more conservative behaviors, including more obedient drivers and more cautious pedestrians. This parameter thus offers a lever to more precisely control the behavior of traffic participants, enabling effective evaluation of driving policies across diverse scenarios.

0.E.5 Ablation Study of the Number of Conditioned Frames

In this section, we explore the effects of varying the length of conditioned history frames within the autoregressive process. As depicted in Table 9, the performance is notably poor when there is a lack of historical input (condition length = 1, i.e only current frame). With the increment in input frame length, the overall meta-metric reaches its peak at 5 frames, corresponding to a 2-second history, after which the improvement plateaus. This suggests that for the purpose of simulation, a 2-second window of information is sufficiently informative. However, considering that increasing the conditioned length significantly adds to computational demands, thereby affecting the efficiency of downstream tasks, we opt for a conditioned frames length of 3 (1-second history). This decision balances the need for sufficient historical context with the imperative of maintaining computational efficiency.

Appendix 0.F Qualitative Analysis

In this section, we will analyze the visual analysis of GUMP under different tasks. The visualization reflects GUMP’s excellent capability in controlled scene generation, its realistic simulation of diverse complex driving scenarios, and provides a visual comparison with traditional IDM agents. This fully demonstrates the superiority of GUMP as a controllable realistic data-driven simulator.

0.F.1 Scene Generation

Here we demonstrate visualizations of scene generation based on map and scenario description prompts. By providing a static map, specifying the types and numbers of agents, and offering scenario description prompts, we can autoregressively generate corresponding initial states and simulate future based on the initial states. As illustrated in Figure 10, under the same map input, the creation of distinctly different scenes is facilitated by varying the scenario descriptions and the numbers of objects. The 0th frame in the image is generated through fully autoregressive scene generation, while subsequent simulations are generated through partial-autoregressive scene extrapolation. Examples include low magnitude speed, waiting for pedestrian to cross, or controlling the position of objects, such as behind bike. Through such controlled methods, we are able to generate a large number of driving scenarios, thereby mitigating the issue of data scarcity.

0.F.2 Diverse Future

GUMP demonstrates exceptional efficacy in generating diverse, interactive and rational driving behaviors within identical map settings. Fig. 11 show three pair of samples.

In Fig. 11 (a), the red car proceeds to make a right turn because there is sufficient distance from the approaching green vehicle, demonstrating the model’s effective distance assessment for safe maneuvers. Conversely, the right panel depicts the red car yielding to faster-moving green vehicle, illustrating the model’s prioritization of safety in tighter traffic scenarios.

On the left side of Fig. 11 (b), all vehicles stick to their predetermined routes. However, the right side illustrates a more complex scenario where a light blue car accelerates and changes lanes, forcing the dark blue car following it to also switch lanes in order to avoid congestion initiated by the leading vehicle.

Lastly, Fig. 11 (c) illustrates the model’s ability to choose highly probable actions, like making a left turn, as well as to perform less typical maneuvers, such as U-turns, given similar circumstances. This adaptability underscores the model’s sophisticated decision-making capabilities across diverse traffic situations.

0.F.3 Reactive Simulation

Fig. 12 illustrates the superior realism of GUMP compared to the Rule-based reactive environments (IDM). The first row shows agents’ movement generated by GUMP and the second row is for IDM. These visualizations reveal that the simplistic IDM environment falls short of providing an adequate reactive simulation for planning policies. Conversely, GUMP presents an efficient and more human-like alternative.

In Figure (a), we observe collisions between agents in the IDM environment, where a vehicle disregards traffic lights and collides with another vehicle making a left turn. In contrast, in GUMP’s environment, vehicles exhibit more human-like behavior, avoiding any collisions.

In the second row of Figure (b), a bus fails to stop adequately behind another vehicle, leading to a rear-end collision. Meanwhile, in the top row showcasing GUMP’s environment, the bus stops appropriately, avoiding any collision with the vehicle in front.

Finally, Figure (c) also highlights a scenario in the IDM environment where a vehicle ignores a traffic light and collides with another vehicle executing a left turn. However, in the environment simulated by GUMP, vehicles navigate the intersection smoothly, without any collision.

Appendix 0.G WOD Motion Results

To further validate its capabilities as a foundation model for multiple tasks, we also benchmarked GUMP’s performance in motion prediction tasks. Utilizing the latent features corresponding to each agent, future 8 seconds trajectories and their associated probabilities are predicted through a simple multilayer perceptron (MLP) prediction head. Similar to the Waymo motion prediction benchmark, we adopted 6 distinct modes and supervised the model using a multipath loss.

0.G.1 WODM Validation Set

As shown in Table 11, we compare the performance of GUMP-medium on the WOD motion validation set against other models. It’s important to note that, to ensure a fair comparison, all models selected for this analysis are end-to-end models, meaning they do not incorporate additional post-processing or ensembling in their results. Traditionally, agent-centric models have led the way in motion prediction due to their more unified and easier-to-learn representation forms, outpacing scene-centric models. Despite this, our findings reveal that GUMP-medium significantly surpasses other scene-centric models, such as the SceneTransformer [52], and even approaches the performance of agent-centric models like MTR-e2e [67], without any bells and whistles. This demonstrates that GUMP can also serve as a pretraining framework, acting as a foundation model in motion planning to benefit a variety of related tasks.

Methods	scene-centric	mAP $\uparrow$	minADE $\downarrow$	minFDE $\downarrow$	MR $\downarrow$
MTR-e2e [67]	✗	0.32	0.52	1.10	0.12
CPS [70]	✗	0.32	0.74	1.49	0.20
SceneTransformer [52]	✓	0.28	0.61	1.22	0.16
GUMP-m	✓	0.30	0.60	1.15	0.13

Table 11: Performance comparison of motion prediction on the validation set of the WOMD. GUMP-medium significantly outperforms scene-centric SceneTransformer, and even approaches the performance of agent-centric models like MTR-e2e, without any bells and whistles.

0.G.2 Per-type Results of WOMD Validation

We also present the per-type results of GUMP on the WOMD Validation set, as shown in reference Tab. 12.

Object type	mAP $\uparrow$	minADE $\downarrow$	minFDE $\downarrow$	MR $\downarrow$
Vehicle	0.32	0.75	1.42	0.14
Pedestrian	0.27	0.36	0.70	0.07
Cyclist	0.30	0.68	1.33	0.18
Avg	0.30	0.60	1.16	0.13

Table 12: Per-type performance of motion prediction on the validation set of WOMD.

Appendix 0.H Per-component WOD Sim Agent Metric

As shown in Table 13, to provide a more detailed showcase of our model’s performance on the Waymo Sim Agents benchmark, we have listed the breakdown metrics for reference. Our method significantly leads the competition across a majority of these metric components, particularly in linear kinematic metrics, collision-related metrics, and map compliance metrics. It achieves state-of-the-art performance in both the overall realism meta-metric and the minADE (minimum Average Displacement Error) metric, highlighting its effectiveness and efficiency in producing realistic and accurate simulation for the real world driving scenarios.

Agent Policy	Meta	minADE	linear	linear	ang.	ang.	dist.	collision	TTC	dist. to	offroad
	Metric		speed	accel.	speed	accel.	to obj.			road edge
	( $\uparrow$ )	( $\downarrow$ )	( $\uparrow$ )	( $\uparrow$ )	( $\uparrow$ )	( $\uparrow$ )	( $\uparrow$ )	( $\uparrow$ )	( $\uparrow$ )	( $\uparrow$ )	( $\uparrow$ )
Logged Oracle	0.722	0.000	0.561	0.330	0.563	0.489	0.485	1.000	0.881	0.713	1.000
SBTA-ADIA [48]	0.420	3.611	0.317	0.174	0.478	0.463	0.265	0.337	0.770	0.557	0.483
CAD [11]	0.531	2.308	0.349	0.253	0.432	0.310	0.332	0.568	0.789	0.637	0.834
Joint-Multipath++ [77]	0.533	2.049	0.434	0.230	0.515	0.452	0.345	0.567	0.812	0.639	0.682
Wayformer [51]	0.575	2.498	0.331	0.098	0.413	0.406	0.297	0.870	0.782	0.592	0.866
MTR+++ [59]	0.608	1.679	0.414	0.107	0.484	0.436	0.347	0.861	0.797	0.654	0.895
MVTA [78]	0.636	1.866	0.439	0.220	0.533	0.480	0.374	0.875	0.829	0.654	0.893
MVTE^⋆ [78]	0.645	1.674	0.445	0.222	0.535	0.481	0.383	0.893	0.832	0.664	0.908
Trajeglish [56]	0.644	1.615	0.448	0.192	0.538	0.485	0.386	0.918	0.837	0.658	0.887
GUMP-m	0.643	1.590	0.463	0.256	0.467	0.412	0.391	0.920	0.832	0.672	0.908

Table 13: Per-component metric results on the test split of WOMD, representing likelihoods. Note:

\star

indicates the use of model ensemble techniques. Our method significantly leads the competition across a majority of metric components and achieves the state-of-the-art on the overall realism meta metric and the minADE metric. The best score is bolded.