Deep Structured Reactive Planning

Jerry Liu¹, Wenyuan Zeng ^1,2, Raquel Urtasun ^1,2, Ersin Yumer¹ ¹ Uber ATG. Correspondence to: [email protected], [email protected], [email protected], [email protected]² University of Toronto

Abstract

An intelligent agent operating in the real-world must balance achieving its goal with maintaining the safety and comfort of not only itself, but also other participants within the surrounding scene. This requires jointly reasoning about the behavior of other actors while deciding its own actions as these two process are inherently intertwined – a vehicle will yield to us if we decide to proceed first at the intersection but will proceed first if we decide to yield. However, this is not captured in most self-driving pipelines, where planning follows prediction. In this paper we propose a novel data-driven, reactive planning objective which allows a self-driving vehicle to jointly reason about its own plans as well as how other actors will react to them. We formulate the problem as an energy-based deep structured model that is learned from observational data and encodes both the planning and prediction problems. Through simulations based on both real-world driving and synthetically generated dense traffic, we demonstrate that our reactive model outperforms a non-reactive variant in successfully completing highly complex maneuvers (lane merges/turns in traffic) faster, without trading off collision rate.

Refer to caption — Figure 1: Comparison of a non-reactive planner vs. a reactive one in a lane merge scenario. Potential ego future trajectories are in gray and planned one is in blue. The non-reactive planner does not reason about how the actor will react to the ego-agent’s candidate trajectories and thus thinks it’s impossible to lane-change without colliding with other actors, while the reactive planner reasons the neighboring actor will slow down, allowing it to complete the lane merge.

I Introduction

Self-driving vehicles (SDVs) face many challenging situations when dealing with complex dynamic environments. Consider a scenario where an SDV is trying to merge left into a lane that is currently blocked by traffic. The SDV cannot reasonably merge by simply waiting - it could be waiting for quite a while and inconvenience the cars behind it. On the other hand, it cannot aggressively merge into the lane disregarding the lane congestion, as this will likely lead to a collision. A human driver in this situation would think that if they gently nudge, other vehicles will have enough time to react without major inconvenience or safety risk, resulting in a successful and safe merge. While this is just an example, similar situations happen often for example during rush hour, in downtown areas, or highway ramp merging. The key idea here is that the human driver cannot be entirely passive with respect to the dynamic multi-actor environment; they must exercise some degree of control by reasoning about how other actors will react to their actions. Of course, the driver cannot use this control selfishly; they must act in a responsible manner to maximize their own utility while minimizing the risk/inconvenience to others.

This complex reasoning is, however, seldom used in self-driving approaches. Instead, the autonomy stack of an SDV is composed of a set of modules executed one after another. The AV first detects other actors in the scene (perception) and predicts their future trajectories (prediction). Given the output of perception and prediction, it plans a trajectory towards its intended goal that will be executed by the control module. This implies that behavior forecasts of other actors are not affected by the AV’s own plan; the SDV is a passive actor assuming a stochastic world that it cannot change. As a consequence it might struggle when planning in high-traffic scenarios. In this paper we refer to prediction unconditioned on planning as non-reactive.

Recently, there has been a line of work that identify similar issues and tries to incorporate how the ego-agent affects other actors into the planning process; for instance, via game-theoretic planning [46, 45, 20] and reinforcement learning [48, 6]. Yet these works rely on assumptions about a hand-picked prediction model or manually-tuned planning reward, which may not fully model real-world actor dynamics or human-like behaviors. Thus there is a need for a more general approach to the problem.

Towards this goal, we propose a novel joint prediction and planning framework that can perform reactive planning. Our approach is based on cost minimization for planning where we predict the actor reactions to the potential ego-agent plans for costing the ego-car trajectories. We formulate the problem as a deep structured model that defines a set of learnable costs across the future trajectories of all actors; these costs in turn induce a joint probability distribution over these actor future trajectories. A key advantage is that our model can be used jointly for prediction (with derived probabilities) and planning (with the costs). Another key advantage is that our structured formulation allows us to explicitly model interactions between actors and ensure a higher degree of safety in our planning.

We evaluate our reactive model as well as a non-reactive variant in a variety of highly interactive, complex closed-loop simulation scenarios, consisting of lane merges and turns in the presence of other actors. Our simulation settings involve both real-world traffic as well as synthetic dense traffic settings. Importantly, we demonstrate that using a reactive objective can more effectively and efficiently complete these complex maneuvers without trading off safety. Moreover, we validate the choice of our learned joint structured model by demonstrating that it is competitive or outperforms prior works in open-loop prediction tasks.

II Related Work

Prediction

The prediction task, also refer to as motion forecasting, aims to predict future states of each agent given the past. Early methods have used physics-based models to unroll the past actor states [55, 28, 15]. This field has exploded in recent years thanks to the advances in deep learning. One area of work in this space is to perform prediction (often jointly with detection) through rich unstructured sensor and map data as context [32, 11, 59, 31, 30, 29], starting with LiDAR context [32] to map rasterizations [11, 17, 16], to lane graphs [30, 22]. Modeling the future motion with a multi-modal distribution is of key importance given the inherent future uncertainty [11, 41, 42, 52, 24, 60, 12] and sequential nature of trajectories [42, 52] Recent works also model interactions between actors [9, 42, 52, 27, 10, 29], mostly through graph neural networks. In our work, we tackle the multi-modal and interactive prediction with a joint structured model, through which we can efficiently estimate probabilities.

Motion planning

Given observations of the environment and predictions of the future, the purpose of motion planning is to find a safe and comfortable trajectory towards a specified goal. Sample-based planning is a popular paradigm due to its low latency, where first a large set of trajectory candidates are sampled and evaluated based on a pre-defined cost function, and then the minimal cost trajectory is chosen to be executed. Traditionally, such a cost function is hand-crafted to reflect our prior knowledge [19, 62, 35, 7, 1]. More recently, learning-based cost functions also show promising results. Those costs can be learned through either Imitation Learning [44] or Inverse Reinforcement Learning [61]. In most of these systems, predictions are made independently of planning. While there has been recent work on accounting for actor reactivity in the planning process [51, 46, 45, 20], such works still rely on hand designed rewards or prediction models which may have difficulty accounting for all real-world scenarios in complex driving situations.

Neural end-to-end motion planning

The traditional compartmentalization of prediction and planning results in the following issues: First, hooking up both modules may result in a large system that can be prohibitively slow for online settings. Second, classical planning usually assumes predictions to be very accurate or errors to be normally distributed, which is not realistic in practice. Third, the sequential order of prediction and planning makes it difficult to model the interactions between the ego-agent and other agents while making decisions.

To address these issues, prior works have started exploring end-to-end planning approaches integrating perception, prediction and planning into a holistic model. Such methods can enjoy fast inference speed, while capture prediction uncertainties and model prediction-planning interactions either implicitly or explicitly. One popular way is to map sensor inputs directly to control commands via neural nets. [39, 5, 14, 36, 2]. However, such methods lack interpretability and is hard to verify safety. Recent works have proposed neural motion planners that produce interpretable intermediate representations. This are in the form of non-parametric cost maps [59], occupancy maps [43] or affordances [47]. The most related work to ours, DSDNet [60], outputs a structured model representation, similar to our setting - yet DSDNet still follows the traditional pipeline of separating prediction from planning, and thus cannot do reactive planning.

The two closest related works on modeling multi-agent predictions during end-to-end planning are PRECOG [42], and PiP [50]. PiP is a prediction model that generates joint actor predictions conditioned on the known future ego-agent trajectory, assuming planning is solved. However, in the real-world, finding the future ego-trajectory (planning) is in itself a challenging open problem, since the ego-trajectory depends on other actors which creates a complicated feedback-loop between planning and prediction. The PRECOG planning objective accounts for joint reactivity under a flow-based [40] framework, yet it requires sequential decoding for planning and prediction making it hard to satisfy low-latency online requirements. Moreover, the planning objective does not ensure collision avoidance and can suffer from mode collapse in the SDV trajectory space.

Structured Models

Researchers have applied neural nets to learn parameters in undirected graphical models, also known as Markov Random Fields (MRF’s). One of the key challenges in training MRF’s in the discrete setting is the computation of the partition function, where the number of states increases exponentially with the number of nodes in the worst case scenario. Message-passing algorithms such as Loopy Belief Propagation (LBP) have been found to approximate the partition function [58, 34] well in practice. Other learning based methods have included dual minimization [13], directly optimizing the Bethe Free Energy [56], finding variational approximations [25] or mean-field approximations [49]. Some approaches take an energy-minimization approach, unrolling the inference objective through differentiable optimization steps [3, 54] that can also be used to learn the model parameters [4, 53].

III Joint Reactive Prediction and Planning

Suppose the SDV is driving in a scenario where there are $N$ actors. Let $\mathcal{Y}=({\mathbf{y}}_{0},{\mathbf{y}}_{1},\cdots,{\mathbf{y}}_{N})$ be the set of random variables representing future trajectories of both the SDV, ${\mathbf{y}}_{0}$ , and all other traffic participants, ${\mathcal{Y}}_{r}=({\mathbf{y}}_{1},\cdots{\mathbf{y}}_{N})$ . We define a reactive planner as one that considers actor predictions ${\mathcal{Y}}_{r}$ conditioned on ${\mathbf{y}}_{0}$ in the planning objective, and a non-reactive planner as one that assumes a prediction model which is independent of ${\mathbf{y}}_{0}$ . In this section, we first outline the framework of our joint structured model which simultaneously models the costs and probability distribution of the future (Sec. III-A). We then introduce our reactive objective which enables us to safely plan under such a distribution considering the reactive behavior of other agents (Sec. III-B) and discuss how to evaluate it (Sec. III-C), including with a goal-based extension (Sec. III-D). We finally describe our training procedure (Sec. III-E). We highlight additional model properties, such as interpolation between our non-reactive/reactive objectives, in our supplementary.

III-A Structured Model for Joint Perception and Prediction

We define a probabilistic deep structured model to represent the distribution over the future trajectories of the actors conditioned on the environment context ${\mathcal{X}}$ as follows

\displaystyle p({\mathcal{Y}}|{\mathcal{X}};{\mathbf{w}})=\frac{1}{Z}\exp(-C({\mathcal{Y}},{\mathcal{X}};{\mathbf{w}}))

(1)

where $Z$ is the partition function, $C({\mathcal{Y}},{\mathcal{X}};{\mathbf{w}})$ defines the joint energy of all future trajectories ${\mathcal{Y}}$ and ${\mathbf{w}}$ represents all the parameters of the model. In this setting, the context ${\mathcal{X}}$ includes each actor’s past trajectories, LiDAR sweeps and HD maps, represented by a birds-eye view (BEV) voxelized tensor representation [11, 32]. Actor trajectories ${\mathcal{Y}}$ can naturally be represented in continuous space. However, performing inference on continuous structured models is extremely challenging. We thus instead follow [60, 38] and discretize each actor’s action space into $K$ possible trajectories (each continuous) using a realistic trajectory sampler inspired from [60], which takes past positions as input and samples a set of lines, circular curves, and euler spirals as future trajectories. Thus, each ${\mathbf{y}}_{i}$ is a discrete random variable that can take up one of $K$ options, where each option is a full continuous trajectory – such a discretized distribution allows us to efficiently compute predictions (see Sec. III-C). Additional details on input representation and trajectory sampling are provided in the supplementary material.

We decompose the joint energy $C({\mathcal{Y}},{\mathcal{X}};{\mathbf{w}})$ in terms of an actor-specific energy that encodes the cost of a given trajectory for each actor, while the interaction term captures the plausibility of trajectories across two actors:

\displaystyle C({\mathcal{Y}},{\mathcal{X}};{\mathbf{w}})=

\displaystyle\sum_{i=0}^{N}C_{\text{traj}}({\mathbf{y}}_{i},{\mathcal{X}};{\mathbf{w}})+\sum_{i,j}C_{\text{inter}}({\mathbf{y}}_{i},{\mathbf{y}}_{j})

(2)

We exploit a learnable neural network to compute the actor-specific energy, $C_{\text{traj}}({\mathbf{y}}_{i},{\mathcal{X}};{\mathbf{w}})$ , parameterized with weights ${\mathbf{w}}$ . A convolutional network takes as input the context feature ${\mathcal{X}}$ as a rasterized BEV 2D tensor grid centered around the ego-agent, and produces an intermediate spatial feature map $\mathbf{F}\in\mathbb{R}^{h\times w\times c}$ , where $h,w$ represent the dimensions of the feature map (downsampled from the input grid), and $c$ represents the number of channels. These features are then combined with the candidate trajectories ${\mathbf{y}}_{i}$ and processed through an MLP, outputting a $(N+1)\times K$ matrix of trajectory scores, one per actor trajectory sample. Our interaction energy is a combination of collision and safety distance violation costs. We define the collision energy to be $\gamma$ if a pair of future trajectories collide and 0 if not. Following [44], we define the safety distance violation to be a squared penalty within some safety distance of each actor’s bounding box, scaled by the speed of the SDV. In our setting, we define safety distance to be 4 meters from other vehicles. Fig. 2 gives a graphic representation of the two energy terms. Full model details are in the supplementary, including the specific dataset-dependent input representation and model architecture.

III-B Reactive Inference Objective

The structured model defines both a set of costs and probabilities over possible futures. We develop a planning policy on top of this framework which decides what the ego-agent should do in the next few seconds (i.e., planning horizon). Our reactive planning objective is based on an optimization formulation which finds the trajectory that minimizes a set of planning costs – these costs consider both the candidate SDV trajectory as well as other actor predictions conditioned on the SDV trajectory. In contrast to existing literature, we re-emphasize that both prediction and planning components of our objective are derived from the same set of learnable costs in our structured model, removing the need to develop extraneous components outside this framework; we demonstrate that such a formulation inherently considers both the reactivity and safety of other actors. We define our planning objective as

\displaystyle{\mathbf{y}}_{0}^{*}=\text{argmin}_{{\mathbf{y}}_{0}}f({\mathcal{Y}},{\mathcal{X}};{\mathbf{w}})

(3)

where ${\mathbf{y}}_{0}$ is the ego-agent future trajectory and $f$ is the planning cost function defined over our structured model.

In our reactive setting, we define the planning costs to be an expectation of the joint energies, over the distribution of actor predictions conditioned on the current candidate SDV trajectory:

\displaystyle f({\mathcal{Y}},{\mathcal{X}};{\mathbf{w}})={\mathbb{E}}_{{\mathcal{Y}}_{r}\sim p({\mathcal{Y}}_{r}|{\mathbf{y}}_{0},{\mathcal{X}};{\mathbf{w}})}[C({\mathcal{Y}},{\mathcal{X}};{\mathbf{w}})]

(4)

Note that ${\mathcal{Y}}_{r}\sim p({\mathcal{Y}}_{r}|{\mathbf{y}}_{0},{\mathcal{X}};{\mathbf{w}})$ describes the future distribution of other actors, conditioned on the current candidate trajectory ${\mathbf{y}}_{0}$ and is derived from the underlying joint distribution in Eq. (1). Meanwhile, the $C({\mathcal{Y}},{\mathcal{X}};{\mathbf{w}})$ term represents the joint energies of a given future configuration of joint actor trajectories. We can expand the planning objective by decomposing the joint energies into the actor-specific and interaction terms as follows:

	$\displaystyle C_{\text{traj}}({\mathbf{y}}_{0},{\mathcal{X}};{\mathbf{w}})$	$\displaystyle+{\mathbb{E}}_{{\mathcal{Y}}_{r}\sim p({\mathcal{Y}}_{r}\|{\mathbf{y}}_{0},{\mathcal{X}};{\mathbf{w}})}[\sum_{i=1}^{N}C_{\text{inter}}({\mathbf{y}}_{0},{\mathbf{y}}_{i})+$		(5)
		$\displaystyle\sum_{i=1}^{N}C_{\text{traj}}({\mathbf{y}}_{i},{\mathcal{X}};{\mathbf{w}})+\sum_{i=1,j=1}^{N,N}C_{\text{inter}}({\mathbf{y}}_{i},{\mathbf{y}}_{j})]$

The set of costs includes the SDV-specific cost, outside the expectation. It also includes the SDV/actor interaction costs, the actor-specific cost, and actor/actor interaction costs within the expectation. Note that the SDV-specific cost $C_{\text{traj}}({\mathbf{y}}_{0},{\mathcal{X}};{\mathbf{w}})$ uses a different set of parameters from those of other actors ${\mathbf{y}}_{i}$ to better exploit the ego-centric sensor data and model SDV-specific behavior. Moreover, the set of actor-specific and interaction costs within the expectation leads to an inherent balancing property of additional responsibility to additional control: by explicitly modeling the reactive prediction distribution of other actors in the prediction model, we must also take into account their utilities as well. In the following, we further exclude the last energy term $\sum_{i,j}C_{\text{inter}}({\mathbf{y}}_{i},{\mathbf{y}}_{j})$ due to computational reasons. See supplementary material for more details.

III-C Inference for Conditional Planning Objective

Due to our discrete setting and the nature of actor-specific and interaction costs, for any given ${\mathbf{y}}_{0}$ , we can directly evaluate the expectation from Eq. (4) without the need for Monte-Carlo sampling. We thus have

\displaystyle f

\displaystyle=C_{\text{traj}}^{{\mathbf{y}}_{0}}+\sum_{{\mathcal{Y}}_{r}}{p_{{\mathcal{Y}}_{r}|{\mathbf{y}}_{0}}}[\sum_{i=1}^{N}C_{\text{inter}}^{{\mathbf{y}}_{0},{\mathbf{y}}_{i}}+\sum_{i=1}^{N}C_{\text{traj}}^{{\mathbf{y}}_{i}}]

(6)

where $p_{{\mathbf{y}}_{i}|{\mathbf{y}}_{0}}$ is short-hand for $p({\mathbf{y}}_{i}|{\mathbf{y}}_{0},{\mathcal{X}};{\mathbf{w}})$ , and $C_{\text{traj}}^{{\mathbf{y}}_{i}}$ for $C_{\text{traj}}({\mathbf{y}}_{i},{\mathcal{X}};{\mathbf{w}})$ (same for pairwise). Since the joint probabilities factorize over the actor-specific and pairwise interaction energies, they simplify into the marginal and pairwise marginal probabilities between all actors.

\displaystyle f

\displaystyle=C_{\text{traj}}^{{\mathbf{y}}_{0}}+\sum_{i,{\mathbf{y}}_{i}}p_{{\mathbf{y}}_{i}|{\mathbf{y}}_{0}}C_{\text{inter}}^{{\mathbf{y}}_{0},{\mathbf{y}}_{i}}+\sum_{i,{\mathbf{y}}_{i}}p_{{\mathbf{y}}_{i}|{\mathbf{y}}_{0}}C_{\text{traj}}^{{\mathbf{y}}_{i}}

(7)

where $p_{{\mathbf{y}}_{i}|{\mathbf{y}}_{0}}$ represents the marginal probability of the actor trajectory conditioned on the candidate ego-agent trajectory. These marginal probabilities which are tensors of size $N\times K\times K$ , can all be efficiently approximated by exploiting Loopy Belief Propagation (LBP) [58]. This in turn allows efficient batch evaluation of the planning objective: for every sample of every actor ( $N\times K$ samples), evaluate the conditional marginal probability times the corresponding energy term. Note that LBP can also be interpreted as a special form of recurrent network, and thus is amenable to end-to-end training. Then, since the ego-agent itself has $K$ trajectories to choose from, solving the minimization problem in (3) involves simply picking the trajectory with the minimum planning cost.

Model	Succ (%) $\uparrow$	TTC (s) $\downarrow$	Goal (m) $\downarrow$	CR (%) $\downarrow$	Brake $\downarrow$
PRECOG (C)	12.0	16.3	13.5	18.0	39.2
Non-Reactive (C)	46.0	15.8	4.2	5.0	34.4
Reactive (C)	70.0	13.9	2.4	5.0	37.8
PRECOG (S)	21.0	9.4	16.8	20.5	-
Non-Reactive (S)	70.0	7.5	5.3	3.5	-
Reactive (S)	82.0	6.8	4.3	3.5	-

TABLE I: Results obtained from simulations in Simba/CARLA. C = CARLA, S = Simba.

III-D Goal Energy

Similar to [42], [21], we make the observation that our current formulation, which encodes both actor behavior and desirable SDV behavior in the energies of our structured model, can be extended to goal-directed planning to flexibly achieve arbitrary goals during inference. In addition to the learned ego-agent cost $C_{\text{traj}}^{{\mathbf{y}}_{0}}$ , we can specify a goal state $\mathcal{G}$ in each scenario and encourage the ego-agent to reach the goal state via a goal energy $C_{\text{goal}}^{{\mathbf{y}}_{0}}$ . The goal state can take on different forms depending on the scenario: in the case of a turn, $\mathcal{G}$ is a target position. In the case of a lane change, $\mathcal{G}$ is a polyline representing the centerline of the lane in continuous coordinates. In particular we define the goal energy term $C_{\text{goal}}^{{\mathbf{y}}_{0}}$ as follows: if $\mathcal{G}$ is a single point, the energy is the $\ell_{2}$ distance of the final waypoint; if $\mathcal{G}$ represents a lane, the energy represents the average projected distance to the lane polyline. We sum the goal energy cost to the conditional planning objective during inference.

\begin{overpic}[width=104.07117pt,trim=301.125pt 276.03125pt 501.87498pt 225.84373pt,clip]{figures/qual/simba_lm1_r0.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Reactive, t=0s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=301.125pt 276.03125pt 501.87498pt 225.84373pt,clip]{figures/qual/simba_lm1_r1.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Reactive, t=1s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=301.125pt 276.03125pt 501.87498pt 225.84373pt,clip]{figures/qual/simba_lm1_r2.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Reactive, t=2s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=301.125pt 276.03125pt 501.87498pt 225.84373pt,clip]{figures/qual/simba_lm1_r3.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Reactive, t=3s}} \end{overpic}
\begin{overpic}[width=104.07117pt,trim=301.125pt 276.03125pt 501.87498pt 225.84373pt,clip]{figures/qual/simba_lm1_nr0.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Non-Reactive, t=0s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=301.125pt 276.03125pt 501.87498pt 225.84373pt,clip]{figures/qual/simba_lm1_nr1.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Non-Reactive, t=1s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=301.125pt 276.03125pt 501.87498pt 225.84373pt,clip]{figures/qual/simba_lm1_nr2.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Non-Reactive, t=2s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=301.125pt 276.03125pt 501.87498pt 225.84373pt,clip]{figures/qual/simba_lm1_nr3.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Non-Reactive, t=3s}} \end{overpic}

Figure 3: Visualization of a Simba lane merge for non-reactive (bottom) and reactive (top) models at 3 different time steps: 1s (left), 2s (middle), 3s (right). AV is in green, other actors are in blue/purple, and goal lane is in cyan. The reactive model is able to decisively complete the lane merge, while the non-reactive model is not.

III-E Learning

We train our joint structured model given observed ground-truth trajectories for the ego-car and all other agents in the scene. We want to learn the model energies such that they induce both optimal plans for the ego-agent and accurate probabilities for the actor behaviors. Since the model energies induce a probability distribution used in our prediction model, this implies that minimizing the cross-entropy between our predictive distribution and the ground-truth trajectories will also learn a good set of costs for planning. To this end, we minimize the following cross-entropy loss function:

$\displaystyle\mathcal{L}$	$\displaystyle=\sum_{i}\mathcal{L}_{i}+\sum_{i,j}\mathcal{L}_{i,j}$	(8)
$\displaystyle\mathcal{L}_{i}$	$\displaystyle=\frac{1}{K}\sum_{{\mathbf{y}}_{i}\notin\Delta({\mathbf{y}}_{i}^{*})}{p_{\text{g.t.}}({\mathbf{y}}_{i})\log p({\mathbf{y}}_{i},{\mathcal{X}};{\mathbf{w}})}$	(9)
$\displaystyle\mathcal{L}_{{\mathbf{y}}_{i},{\mathbf{y}}_{j}}$	$\displaystyle=\frac{1}{K^{2}}\sum_{{\mathbf{y}}_{i}\notin\Delta({\mathbf{y}}_{i}^{}),{\mathbf{y}}_{j}\notin\Delta({\mathbf{y}}_{j}^{})}{p_{\text{g.t.}}({\mathbf{y}}_{i},{\mathbf{y}}_{j})\log p({\mathbf{y}}_{i},{\mathbf{y}}_{j},{\mathcal{X}};{\mathbf{w}})}$	(10)

where $p_{{\mathbf{y}}_{i}}$ and $p_{{\mathbf{y}}_{i},{\mathbf{y}}_{j}}$ represent the marginal and pairwise marginal probabilities for every actor including the ego-agent, and $p_{\text{g.t.}}$ represents the indicator function that is zero everywhere unless ${\mathbf{y}}_{i},{\mathbf{y}}_{j}$ are equal to the ground-truth ${\mathbf{y}}_{i}^{*},{\mathbf{y}}_{j}^{*}$ . Recall that the marginal probabilities, $p_{{\mathbf{y}}_{i}}$ and $p_{{\mathbf{y}}_{i},{\mathbf{y}}_{j}}$ for every actor including the ego-agent are computed through Loopy Belief Propagation, a differentiable iterative message-passing procedure. Note that our method has a subtle but important distinction from raw cross-entropy loss: $\Delta({\mathbf{y}}_{i}^{*})$ is defined as the set of $k$ non-ground-truth trajectories for actor $i$ closest to ${\mathbf{y}}_{i}^{*}$ by $\ell_{2}$ distance, and we only compute the cross-entropy loss for trajectories outside of this set. We adopt this formulation since any trajectory within $\Delta$ can reasonably be considered as a ground-truth substitute, and hence we do not wish to penalize the probabilities of these trajectories.

Carla	DESIRE [27]	SocialGAN [23]	R2P2 [41]	MultiPath [12]	ESP [42]	MFP [52]	DSDNet [60]	Ours
Town1	2.422	1.141	0.770	0.680	0.447	0.279	0.195	0.210
Town2	1.697	0.979	0.632	0.690	0.435	0.290	0.213	0.205

TABLE II: CARLA prediction performance (minMSD, K=12).

IV Experiments

We demonstrate the effectiveness of our reactive planning objective in two closed-loop driving simulation settings: real-world traffic scenarios with our in-house simulator (Simba), and synthetically generated dense traffic with the open-source CARLA simulator [18]. We setup a large number of complex and highly interactive scenarios from lane changes and (unprotected) turns - in order to tease apart the differences between reactive and non-reactive models.

To better showcase the importance of reactivity, we created a non-reactive variant of our model by defining the planning costs as follows: $f_{\text{nonreactive}}={\mathbb{E}}_{{\mathcal{Y}}_{r}\sim p({\mathcal{Y}}_{r}|{\mathcal{X}};{\mathbf{w}})}[C({\mathcal{Y}},{\mathcal{X}};{\mathbf{w}})]$ , which uses a prediction model unconditioned on the SDV trajectory. This non-reactive assumption leads to a simplification of the joint cost ${\mathbb{E}}_{{\mathcal{Y}}_{r}\sim p({\mathcal{Y}}_{r}|{\mathcal{X}};{\mathbf{w}})}[C({\mathcal{Y}},{\mathcal{X}};{\mathbf{w}})]=C_{\text{traj}}({\mathbf{y}}_{0},{\mathcal{X}};{\mathbf{w}})+{\mathbb{E}}_{{\mathcal{Y}}_{r}\sim p({\mathcal{Y}}_{r}|{\mathcal{X}};{\mathbf{w}})}[\sum_{i=1}^{N}{C_{\text{inter}}({\mathbf{y}}_{0},{\mathbf{y}}_{i})]}$ , where the considered terms are just the SDV-specific trajectory cost and the SDV/actor interaction cost.

Our results show a key insight: our pure reactive model alone achieves a higher success rate compared to the non-reactive model without trading off collision rate, implying it is able to effectively consider the reactive behavior of other actors and formulate a goal-reaching plan without being unreasonably aggressive. Moreover, we justify the choice of a deep structured model by demonstrating that when our model is used for actor trajectory prediction, it is competitive with the state-of-the-art in both CARLA and Nuscenes [8].

IV-A Experimental Setup

IV-A1 Training Datasets

Since our closed-loop evaluations are in Simba and CARLA, our models are trained on the respective datasets in the corresponding domains. Our Simba model is trained on a large-scale, real-world dataset collected through our self-driving vehicles in numerous North American cities, which we call UrbanCity. The dataset consists of over 6,500 snippets of approximately 25 seconds from over 1000 different trips, with 10Hz LiDAR and HD map data, which are included as input context into the model ${\mathcal{X}}$ in addition to the past trajectories per actor. Meanwhile, the CARLA simulated dataset is a publicly available dataset [42], containing 60k training sequences. The input to the model for CARLA consists of rasterized LiDAR features and 2s of past trajectories to predict 4s future trajectories.

IV-A2 Simulation Setup

Simba runs at 10Hz and leverages a realistic LiDAR simulator [33] to generate LiDAR sweeps around the ego-agent at each timestep. HD map polygons are available per scenario. We first setup 12 different interactive “template” scenarios: we select these templates by analyzing logs in the validation set of UrbanCity and selecting a start time where there is a high degree of potential interactions with other actors. We set a goal state for the ego-agent, which for instance can be a turn or lane merge, and initialize actor positions according to their start time positions in the log. We then proceed to generate 25 distinct scenarios per template scenario by perturbing the initial position and velocity of each actor, for a total of 50 val/250 test scenarios. During simulation each actor behaves according to a heuristic car-following model that performs basic hazard detection.

In CARLA we leverage the synthetic LiDAR sensor as input. Rather than initializing scenarios through ground-truth data, we manually create 6 “synthetic” template scenarios containing dense traffic, and spawn actors at specified positions with random perturbations. We extend the BasicAgent class given in CARLA 0.9.5 as an actor model per agent, which performs basic route following to a goal and hazard detection. We generate 50 val/100 test scenarios by perturbing the initial position / vehicle type / hazard detection range of each actor.

Scenarios in all settings are run for a set timer. The scenario completes if 1) the ego-agent has reached the goal, 2) the timer has expired, or 3) the ego-agent has collided.

IV-A3 Closed-Loop Metrics

The output metrics include: 1) Success Rate (whether the ego-agent successfully completed the lane change or turn), 2) time to completion (TTC), 3) collision rate, 4) number of actor brake events. This information is provided in CARLA but not in Simba.

IV-B Reactive/Non-Reactive Simulation Results

The pure reactive model outperforms the non-reactive model on success rate, time to completion, goal distance, with no difference in collision rate (Tab. I), on both Simba and CARLA. This implies that by considering the reactivity of other actors in its planning objective, the reactive model can more efficiently navigate to the goal in a highly-interactive environment, without performing overly aggressive behaviors that would result in a higher collision rate. Moreover, we also note that both the reactive/non-reactive models within our joint structured framework outperform a strong joint prediction and planning model, PRECOG [42] – we present our PRECOG implementation and visualizations in supplementary material.

Nuscenes	KDE [27]	DESIRE [23]	SocialGAN [41]	R2P2 [12]	ESP [42]	Ours
5 agents	52.071	6.575	3.871	3.311	2.892	2.610

TABLE III: Nuscenes prediction performance (5 nearest, minMSD, K=12).

IV-C Qualitative Results

To complement the quantitative metrics, we provide scenario visualizations. In Fig. 7, we present a lane merge scenario in Simba to better highlight the difference between the reactive and non-reactive models in a highly complex, interactive scenario. We provide simulation snapshots at $t=0,1,2,3$ seconds. Note that the reactive model is able to take decisive action and complete the lane merge; the neighboring actor slows down with adequate buffer to let the ego-agent pass. Meanwhile, the non-reactive agent does not complete a lane merge but drifts slowly to the left side of the lane over time. We provide several more comparative visualizations of various scenarios in both Simba and CARLA in our supplementary document and video.

IV-D Prediction Metrics

To validate the general performance of our joint structured framework, we compute actor predictions using our model, by using Loopy Belief Propagation to compute unconditioned actor marginals $p({\mathbf{y}}_{i})$ , and compare against state-of-the-art on standard prediction benchmarks in Fig. II, III: the CARLA PRECOG dataset and the Nuscenes dataset [8] (note that a separate model was trained for Nuscenes). We report minMSD [42], the minimum mean squared distance between a sample of predicted/planned trajectories and ground-truth as metric. As shown, our method is competitive with or outperforms prior methods in minMSD. Similar to the findings of DSDNet [60], this implies that an energy-based model relying on discrete trajectory samples per actor is able to effectively make accurate trajectory predictions for each actor.

IV-E Training Loss Functions

Loss	Pred FDE (3s)	Actor CR	Plan FDE (3s)	Ego CR
Cross-entropy	1.47	0.55%	2.01	0.24%
Chen et al. [13]	1.30	0.67%	2.10	0.22%
Ours	1.24	0.44%	1.73	0.18%

TABLE IV: Ablation Study comparing training losses on UrbanCity. For all metrics, lower is better.

We also perform an ablation study on the UrbanCity validation set to analyze our proposed training loss function compared against vanilla cross-entropy loss (no ignore set), as well as the approach in Chen et al. [13]. We demonstrate in Fig. IV that our approach achieves the lowest Final Displacement Error (FDE), for the SDV and other actors, as well as the lowest collision rate between actor collisions and collisions of the ego-agent with ground-truth actors.

V Conclusion

We have presented a novel reactive planning objective allowing the ego-agent to jointly reason about its own plans as well as how other actors will react to them. We formulated the problem with a deep energy-based model which enables us to explicitly model trajectory goodness as well as interaction cost between actors. Our experiments showed that our reactive model outperforms the non-reactive model in various highly interactive simulation scenarios without trading off collision rate. Moreover, we outperform or are competitive with state-of-the-art in prediction metrics.

References

[1] Bandyopadhyay, T., Kok, S., Won, E., Frazzoli, D., Hsu, W., Lee, D., Rus, D.: Intention-aware motion planning. Springer Tracts in Advanced Robotics, Algorithmic Foundations of Robotics X 86 (02 2013). https://doi.org/10.1007/978-3-642-36279-8_29
[2] Bansal, M., Krizhevsky, A., Ogale, A.: Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. RSS (2019)
[3] Belanger, D., McCallum, A.: Structured prediction energy networks. ICML (2016)
[4] Belanger, D., Yang, B., McCallum, A.: End-to-end learning for structured prediction energy networks. ICML (2017)
[5] Bojarski, M., Testa, D.D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L.D., Monfort, M., Muller, U., Zhang, J., Zhang, X., Zhao, J., Zieba, K.: End to end learning for self-driving cars. arXiv (2016)
[6] Bouton, M., Nakhaei, A., Fujimura, K., Kochenderfer, M.J.: Cooperation-aware reinforcement learning for merging in dense traffic. In: 2019 IEEE Intelligent Transportation Systems Conference (ITSC). pp. 3441–3447 (2019). https://doi.org/10.1109/ITSC.2019.8916924
[7] Buehler, M., Iagnemma, K., Singh, S.: The DARPA Urban Challenge: Autonomous Vehicles in City Traffic. Springer Publishing Company, Incorporated, 1st edn. (2009)
[8] Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027 (2019)
[9] Casas, S., Gulino, C., Liao, R., Urtasun, R.: Spatially-aware graph neural networks for relational behavior forecasting from sensor data. ICRA (2020)
[10] Casas, S., Gulino, C., Suo, S., Luo, K., Liao, R., Urtasun, R.: Implicit latent variable model for scene-consistent motion forecasting. ECCV (2020)
[11] Casas, S., Luo, W., Urtasun, R.: Intentnet: Learning to predict intention from raw sensor data. CoRL (2018)
[12] Chai, Y., Sapp, B., Bansal, M., Anguelov, D.: Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction. CoRL (2019)
[13] Chen, L.C., Schwing, A.G., Yuille, A.L., Urtasun, R.: Learning deep structured models. ICML (2015)
[14] Codevilla, F., Müller, M., López, A., Koltun, V., Dosovitskiy, A.: End-to-end driving via conditional imitation learning. ICRA (2018)
[15] Cosgun, A., Ma, L., Chiu, J., Huang, J., Demir, M., Anon, A.M., Lian, T., Tafish, H., Al-Stouhi, S.: Towards full automated drive in urban environments: A demonstration in gomentum station, california. 2017 IEEE Intelligent Vehicles Symposium (IV) pp. 1811–1818 (2017)
[16] Cui, H., Radosavljevic, V., Chou, F.C., Lin, T.H., Nguyen, T., Huang, T.K., Schneider, J., Djuric, N.: Multimodal trajectory predictions for autonomous driving using deep convolutional networks. ICRA (2019)
[17] Djuric, N., Radosavljevic, V., Cui, H., Nguyen, T., Chou, F.C., Lin, T.H., Singh, N., Schneider, J.: Uncertainty-aware short-term motion prediction of traffic actors for autonomous driving. WACV (2020)
[18] Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: An open urban driving simulator. CoRL (2017)
[19] Fan, H., Zhu, F., Liu, C., Zhang, L., Li Zhuang, D.L., Zhu, W., Hu, J., Li, H., Kong, Q.: Baidu apollo em motion planner. ArXiv (2018)
[20] Fisac, J.F., Bronstein, E., Stefansson, E., Sadigh, D., Sastry, S.S., Dragan, A.D.: Hierarchical game-theoretic planning for autonomous vehicles. ICRA (2019)
[21] Deep Imitative Models for Flexible Inference, P., Control: Nicholas rhinehart, rowan mcallister, sergey levine. ICLR (2020)
[22] Gao, J., Sun, C., Zhao, H., Shen, Y., Anguelov, D., Li, C., Schmid, C.: Vectornet: Encoding hd maps and agent dynamics from vectorized representation. CVPR (2020)
[23] Gupta, A., Johnson, J., Fei, L.F., Savarese, S., Alahi, A.: Social gan: Socially acceptable trajectories with generative adversarial networks. CVPR (2018)
[24] Hong, J., Sapp, B., Philbin, J.: Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions. CVPR (2019)
[25] Kuleshov, V., Ermon, S.: Neural variational inference and learning in undirected graphical models. NIPS (2017)
[26] Lamb, A., Goyal, A., Zhang, Y., Zhang, S., Courville, A., Bengio, Y.: Professor forcing: A new algorithm for training recurrent networks. NIPS (2016)
[27] Lee, N., Choi, W., Vernaza, P., Choy, C.B., Torr, P.H.S., Chandraker, M.: Desire: Distant future prediction in dynamic scenes with interacting agents. CVPR (2017)
[28] Lefevre, S., Vasquez, D., Laugier, C.: A survey on motion prediction and risk assessment for intelligent vehicles. Robomech Journal 1 (07 2014). https://doi.org/10.1186/s40648-014-0001-z
[29] Li, L.L., Yang, B., Liang, M., Zeng, W., Ren, M., Segal, S., Urtasun, R.: End-to-end contextual perception and prediction with interaction transformer. IROS (2020)
[30] Liang, M., Yang, B., Hu, R., Chen, Y., Liao, R., Feng, S., Urtasun, R.: Learning lane graph representations for motion forecasting. ECCV (2020)
[31] Liang, M., Yang, B., Zeng, W., Chen, Y., Hu, R., Casas, S., Urtasun, R.: Pnpnet: End-to-end perception and prediction with tracking in the loop. CVPR (2020)
[32] Luo, W., Yang, B., Urtasun, R.: Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. CVPR (2018)
[33] Manivasagam, S., Wang, S., Wong, K., Zeng, W., Sazanovich, M., Tan, S., Yang, B., Ma, W.C., Urtasun, R.: Lidarsim: Realistic lidar simulation by leveraging the real world. In: CVPR (2020)
[34] McEliece, R., MacKay, D., Cheng, J.F.: Turbo decoding as an instance of pearl’s ”belief propagation” algorithm. IEEE Journal on Selected Areas in Communications (1998)
[35] Montemerlo, M., Becker, J., Bhat, S., Dahlkamp, H., Dolgov, D., Ettinger, S., Haehnel, D., Hilden, T., Hoffmann, G., Huhnke, B., Johnston, D., Klumpp, S., Langer, D., Levandowski, A., Levinson, J., Marcil, J., Orenstein, D., Paefgen, J., Penny, I., Thrun, S.: Junior: The stanford entry in the urban challenge. Journal of Field Robotics 25, 569 – 597 (09 2008). https://doi.org/10.1002/rob.20258
[36] Müller, M., Dosovitskiy, A., Ghanem, B., Koltun, V.: Driving policy transfer via modularity and abstraction. CoRL (2018)
[37] Nowozin, S., Lampert, C.H.: Structured learning and prediction in computer vision. Found. Trends. Comput. Graph. Vis. 6(3–4), 185–365 (Mar 2011). https://doi.org/10.1561/0600000033, https://doi.org/10.1561/0600000033
[38] Phan-Minh, T., Grigore, E.C., Boulton, F.A., Beijbom, O., Wolff, E.M.: Covernet: Multimodal behavior prediction using trajectory sets. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14074–14083 (2020)
[39] Pomerleau, D.A.: Alvinn: An autonomous land vehicle in a neural network. NIPS (1989)
[40] Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. ICML (2015)
[41] Rhinehart, N., Kitani, K., Vernaza, P.: R2p2: A reparameterized pushforward policy for diverse, precise generative path forecasting. ECCV (2018)
[42] Rhinehart, N., McAllister, R., Kitani, K., Levine, S.: Precog: Prediction conditioned on goals in visual multi-agent settings. ICCV (2019)
[43] Sadat, A., Casas, S., Ren, M., Wu, X., Dhawan, P., Urtasun, R.: Perceive, predict, and plan: Safe motion planning through interpretable semantic representations. ECCV (2020)
[44] Sadat, A., Ren, M., Pokrovsky, A., Lin, Y.C., Yumer, E., Urtasun, R.: Jointly learnable behavior and trajectory planning for self-driving vehicles. IROS (2019)
[45] Sadig, D., Sastry, S.S., Seshia, S.A., Dragan, A.: Information gathering actions over human internal state. IROS (2016)
[46] Sadigh, D., Sastry, S., Seshia, S.A., Dragan, A.D.: Planning for autonomous cars that leverage effects on human actions. RSS (2016)
[47] Sauer, A., Savinov, N., Geiger, A.: Conditional affordance learning for driving in urban environments. CoRL (2018)
[48] Saxena, D.M., Bae, S., Nakhaei, A., Fujimura, K., Likhachev, M.: Driving in dense traffic with model-free reinforcement learning. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). pp. 5385–5392 (2020). https://doi.org/10.1109/ICRA40945.2020.9197132
[49] Schwing, A.G., Urtasun, R.: Fully connected deep structured networks. ArXiv (2015)
[50] Song, H., Ding, W., Chen, Y., Shen, S., Wong, M.Y., Chen, Q.: Pip: Planning-informed trajectory prediction for autonomous driving. ECCV (2020)
[51] Sun, L., Zhan, W., Tomizuka, M., Dragan, A.D.: Courteous autonomous cars. IROS (2018)
[52] Tang, Y.C., Salakhutdinov, R.: Multiple futures prediction. NeurIPS (2019)
[53] Wang, S., Fidler, S., Urtasun, R.: Proximal deep structured models. NeurIPS (2016)
[54] Wang, S., Schwing, A., Urtasun, R.: Efficient inference for contiuous markov random fields with polynomial potentials. NeurIPS (2014)
[55] Welch, G., Bishop, G.: An introduction to the kalman filter (1995)
[56] Wiseman, S., Kim, Y.: Amortized bethe free energy minimization for learning mrfs. NeurIPS (2019)
[57] Yang, B., Luo, W., Urtasun, R.: Pixor: Real-time 3d object detection from point clouds. CVPR (2018)
[58] Yedidia, J.S., Freeman, W.T., Weiss, Y.: Generalized belief propagation. NIPS (2000)
[59] Zeng, W., Luo, W., Suo, S., Sadat, A., Yang, B., Casas, S., Urtasun, R.: End-to-end interpretable neural motion planner. CVPR (2019)
[60] Zeng, W., Wang, S., Liao, R., Chen, Y., Yang, B., Urtasun, R.: Dsdnet: Deep structured self-driving network. ECCV (2020)
[61] Ziebart, B.D., Maas, A., Bagnell, J., Dey, A.K.: Maximum entropy inverse reinforcement learning. AAAI (2008)
[62] Ziegler, J., Bender, P., Dang, T., Stiller, C.: Trajectory planning for bertha - a local, continuous method. IEEE Intelligent Vehicles Symposium (2014)

Appendix A Model Details

In this section, we provide more precise model details regarding our joint structured model. Specifically, we first detail the dataset-dependent input representations used by our model (Sec. A-A). We then present the architecture details of our network, which predicts actor-specific and interaction energies between actors (Sec. A-B, Sec. A-C). This also includes details regarding our discrete trajectory sampler (Sec. A-D).

A-A Input Representation

As mentioned in the main paper, we assume that the trajectory history of the other actors, including their bounding box width/height and heading, are known to the ego-agent at the given timestep. Hence, we directly feed the trajectories of all actors to our model, transformed to the current ego-agent coordinates.

In addition, we add dataset-dependent context to the model. In UrbanCity and Nuscenes, we use both LiDAR sweeps and HD maps as context. Meanwhile, we directly use the input representation provided by the CARLA dataset [42], which contains a rasterized LiDAR representation but no map data.

A-A1 UrbanCity

We use a past window of 1 second at 10Hz as input context, and use an input region of $[-72,72]\times[-72,72]\times[-2,4]$ meters centered around the ego-agent. From this input region, we collect the past 10 LiDAR sweeps (1s), voxelize them with a $0.2\times 0.2\times 0.2$ meter resolution, and combine the time/z-dimensions, creating a $720\times 720\times 300$ input tensor. We additionally rasterize given HD map info with the same resolution. The HD maps include different lane polylines, and polygons representing roads, intersections, and crossings. We rasterize these information into different channels respectively.

A-A2 Nuscenes

The details are similar to UrbanCity. The main difference is that the input region is sized $[-49.6,49.6]\times[-49.6,49.6]\times[-3,5]$ meters, and the voxel resolution is $0.2\times 0.2\times 0.25$ meters, creating a $496\times 496\times 320$ input LiDAR tensor.

A-A3 CARLA

We use the input representation directly provided by the CARLA PRECOG dataset [42], which consist of $200\times 200\times 4$ features derived from rasterized LiDAR. Each channel contains a histogram of points within each cell at a given z-threshold (or for all heights). Additional details can be found in https://github.com/nrhine1/precog_carla_dataset/blob/master/README.md#overhead_features-format.

A-B Network Architecture Details

Here, we present additional details of the actor-specific and interaction energies of our model. As mentioned in the main paper, the actor-specific energies are parameterized with neural nets, consisting of a few core components. The first component is a backbone network that takes in the input representations to compute intermediate spatial feature maps. Then, given the past trajectory for each actor, we sample $K$ trajectories for each actor using a realistic trajectory sampler, given in Sec. A-D. These future actor trajectory samples, as well as the past actor trajectories and backbone feature map, are in turn passed to our unary module to predict the actor-specific energies for each actor trajectory. Finally, our interaction energy is determined by computing collision and safety distance violations between actor trajectories.

A-B1 Backbone Network

Given the input representation, we pass it through a backbone network to compute intermediate feature maps that are spatially downsampled from the input resolution. This backbone network is inspired by the detection networks in [57, 59, 60]. This network consists of 5 sub-blocks, each block containing $[2,2,3,6,5]$ Conv2d layers with $[32,64,128,256,256]$ output channels respectively, with a 2x downsampling max-pooling layer in front of the 2nd – 4th blocks. The input is fed through the first 4 blocks, generating intermediate feature maps of different spatial resolutions. These intermediate feature maps are then pooled/upsampled to the same resolution (4x downsample from input), and then fed to the fifth block to generate the final feature map (at 4x downsample from the input). Each convolution is followed by a BatchNorm2d and ReLU layer.

A-B2 Unary Module

We broadcast the past trajectory per actor with their $K$ future trajectory samples to create a $N\times K$ matrix of concatenated trajectories. The purpose of the unary network is then to predict the actor-specific energy for each actor trajectory sample.

We first define a Region of Interest (ROI) centered on each actor’s current position and oriented according to each actor’s heading. In UrbanCity/Nuscenes, we define the ROI to be $12.8\times 12.8$ meters. In CARLA we define it to be much bigger at $100\times 100$ meters due to the absence of map data in the training set. We then use this rotated ROI to extract a corresponding feature per actor from the backbone feature map. We then obtain a 1D representation of this ROI feature per actor by feeding it through an MLP consisting of 6 Conv2d layers with $[512,512,1024,1024,512,512]$ output filters with 2x downsampling in the 1st, 3rd, 5th layers, and an adaptive max-pool collapsing the spatial dimension at the end.

We additionally extract positional embeddings for each actor trajectory sample, at each timestep, by indexing the backbone feature map at the corresponding sample position at the given timestep, with bilinear interpolation. This extracts a $N\times K\times T\times 512$ dimensional tensor, where $T$ represents the total horizon of the trajectory (both past and future timesteps), and $512$ is the channel dimension of the backbone feature map. We collapse the tensor into a 1D representation as well: $N\times K\times(T*512)$ .

Finally, we directly encode the trajectory information per timestep into a trajectory feature, consisting of $[\Delta_{x},\Delta_{y},\cos(\Delta_{\theta}),\sin(\theta),\Delta_{d}]$ , where $\Delta_{x},\Delta_{y}$ represent the displacement in x,y directions from the previous timestep, $\Delta_{d}$ represents displacement magnitude, and $\theta$ represents current heading. The trajectory feature, positional embeddings, and 1-D ROI feature are all concatenated along the channel dimension to form a final unary feature. It is fed to a final MLP consisting of 5 fc layers of $1024,1024,512,256,1$ output channels respectively, and the output is a $N\times K$ matrix of unary energies.

A-C Interaction Energies

Our interaction energy is a pairwise energy between two actor trajectory samples that contains two non-learnable components: collision cost and safety distance cost. The collision detection is efficiently implemented using a GPU kernel computing the IOU among every pairwise sample from two actors, given their timestep, positions, bounding box width/height, and heading. The output is a $N\times N\times K\times K$ matrix, where a given entry is 1 if the trajectory sample of actor $i$ and trajectory sample of actor $j$ collide at any point in the future, and 0 if not. The collision energy is only computed over future samples, not past samples.

Similarly, the safety distance violation is computed over all future actor samples. For a given pairwise sample between actor $i$ and actor $j$ at a given timestep, the distance from the center point of actor $i$ is computed to the polygon of actor $j$ (minimal point-to-polygon distance). If the distance is within a given safety threshold, then the energy is the squared distance of violation within the threshold. Note that unlike collision energy, this matrix is not symmetric. This choice was made to use a GPU kernel that efficiently computes point distances to polygons in parallel.

A-D Trajectory Sampler Details

We follow the discrete trajectory sampler used in [59, 60]. The sampler first estimates the initial speed/heading of each actor given the provided past trajectory. From these values, the sampler samples from three trajectory modes: a straight line, circular trajectory, or spiral trajectory with $[0.3,0.2,0.5]$ probability. Within each mode, the control parameters such as radius, acceleration are uniformly sampled within a range to generate a sampled trajectory. We use 50 trajectories per actor (including the ego-agent) for CARLA, and 100 trajectories per actor on UrbanCity and Nuscenes. Additional details can be found in [59, 60].

Appendix B Model Properties

In this section, we discuss some additional properties of our model. First, we discuss the reasons and tradeoffs for excluding the actor-actor interaction term in the reactive objective (Sec. B-A). Next, recall that in addition to our reactive objective, we also implemented a non-reactive planning objective as a baseline: $f_{\text{nonreactive}}={\mathbb{E}}_{{\mathcal{Y}}_{r}\sim p({\mathcal{Y}}_{r}|{\mathcal{X}};{\mathbf{w}})}[C({\mathcal{Y}},{\mathcal{X}};{\mathbf{w}})]$ . We demonstrate that we can flexibly interpolate between the non-reactive and reactive objectives within our deep structured model by varying the size of the conditioning set in the prediction model of the reactive objective (Sec. B-B). Experimental results showcasing this behavior are demonstrated in Sec. D). Moreover, we demonstrate that the non-reactive objective under our joint structured model is related to maximizing the marginal likelihood of the ego-agent, marginalizing out other actors (Sec. B-C).

B-A Excluding the Actor-Actor Interaction Term

We mention in Sec. III-B of the main paper that we exclude the actor-actor interaction term as a cost from our reactive planning objective. The primary reason for this is due to computational reasons. To illustrate this, we distribute the full expectation in the objective over the costs:

$\displaystyle f_{\text{reactive}}$	$\displaystyle=C_{\text{traj}}^{{\mathbf{y}}_{0}}+{\mathbb{E}}_{p_{{\mathcal{Y}}_{r}\|{\mathbf{y}}_{0}}}[$	(11)
	$\displaystyle\sum_{i=1}^{N}C_{\text{inter}}^{{\mathbf{y}}_{0},{\mathbf{y}}_{i}}+\sum_{i=1}^{N}C_{\text{traj}}^{{\mathbf{y}}_{i}}+\sum_{i=1,j=1}^{N,N}C_{\text{inter}}^{{\mathbf{y}}_{i},{\mathbf{y}}_{j}}]$	(12)
	$\displaystyle=C_{\text{traj}}^{{\mathbf{y}}_{0}}+\sum_{i,{\mathbf{y}}_{i}}p_{{\mathbf{y}}_{i}\|{\mathbf{y}}_{0}}C_{\text{inter}}^{{\mathbf{y}}_{0},{\mathbf{y}}_{i}}+\sum_{i,{\mathbf{y}}_{i}}p_{{\mathbf{y}}_{i}\|{\mathbf{y}}_{0}}C_{\text{traj}}^{{\mathbf{y}}_{i}}$	(13)
	$\displaystyle+\sum_{i,j,{\mathbf{y}}_{i},{\mathbf{y}}_{j}}p_{{\mathbf{y}}_{i},{\mathbf{y}}_{j}\|{\mathbf{y}}_{0}}C_{\text{inter}}^{{\mathbf{y}}_{i},{\mathbf{y}}_{j}}$	(14)

First, we note that the last summation term implies computing a full $N\times N\times K\times K$ matrix (containing both the interaction cost between any pair of samples from two actors, as well as the probabilities) for every value of ${\mathbf{y}}_{0}$ . For our values of $N,K$ , one of these matrices will generally fit in memory on a Nvidia 1080Ti GPU, but additionally batching by the number of ego-agent samples (which is $N$ ) will not. Moreover, we note that the Loopy Belief Propagation algorithm used for obtaining actor marginals will provide marginal probabilities $p_{{\mathbf{y}}_{i}}$ and pairwise probabilities $p_{{\mathbf{y}}_{i},{\mathbf{y}}_{j}}$ [37] for all actor samples, which directly gives us the conditional actor marginal probabilities $p_{{\mathbf{y}}_{i}|{\mathbf{y}}_{0}}$ with one LBP pass. However, $p_{{\mathbf{y}}_{i},{\mathbf{y}}_{j}|{\mathbf{y}}_{0}}$ is not readily provided by the algorithm, requiring us to run LBP for every value of ${\mathbf{y}}_{0}$ to obtain these conditional pairwise marginals.

We acknowledge that the actor-actor interaction term can capture situations that the actor-specific term does not, specifically where the ego-agent’s actions led to dangerous interactions between two other actors (e.g. the SDV causes a neighboring car to swerve and narrowly collide with another car). We can potentially approximate the term by only considering neighboring actors to the SDV – we leave this for future work.

B-B Interpolation

We observe an additional level of flexibility in this model: being able to interpolate between our reactive/non-reactive objectives, which have thus far been presented as distinct. A potential advantage of interpolation is the ability to customize between conservative, non-reactive driving to efficient navigation depending on the user’s preference. Recall that our reactive objective is defined as $f={\mathbb{E}}_{{\mathcal{Y}}_{r}\sim p({\mathcal{Y}}_{r}|{\mathbf{y}}_{0},{\mathcal{X}};{\mathbf{w}})}[C({\mathcal{Y}},{\mathcal{X}};{\mathbf{w}})]$ , which can be simplified into

\displaystyle f=C_{\text{traj}}^{{\mathbf{y}}_{0}}+\sum_{i,{\mathbf{y}}_{i}}p_{{\mathbf{y}}_{i}|{\mathbf{y}}_{0}}C_{\text{inter}}^{{\mathbf{y}}_{0},{\mathbf{y}}_{i}}+\sum_{i,{\mathbf{y}}_{i}}p_{{\mathbf{y}}_{i}|{\mathbf{y}}_{0}}C_{\text{traj}}^{{\mathbf{y}}_{i}}

(15)

(see Sec. III-B, III-C). Similarly, our non-reactive baseline objective is defined as $f_{\text{nonreactive}}={\mathbb{E}}_{{\mathcal{Y}}_{r}\sim p({\mathcal{Y}}_{r}|{\mathcal{X}};{\mathbf{w}})}[C({\mathcal{Y}},{\mathcal{X}};{\mathbf{w}})]$ , and can be simplified into

\displaystyle f_{\text{nonreactive}}=C_{\text{traj}}^{{\mathbf{y}}_{0}}+\sum_{i,{\mathbf{y}}_{i}}p_{{\mathbf{y}}_{i}}C_{\text{inter}}^{{\mathbf{y}}_{0},{\mathbf{y}}_{i}}

(16)

The key to interpolation between these two objectives lies within our conditional prediction model for a given actor ${\mathbf{y}}_{i}$ within the reactive objective: $p({\mathbf{y}}_{i}|{\mathbf{y}}_{0},{\mathcal{X}};{\mathbf{w}})$ , currently conditioned on a single ego-agent plan. We can modify the conditioning to be on a set: $S^{{\mathbf{y}}_{0}}$ with $k,1\leq k\leq K$ elements, which are the top- $k$ candidate trajectories closest to ${\mathbf{y}}_{0}$ by L2 distance. Then, we define $p({\mathbf{y}}_{i}|S^{{\mathbf{y}}_{0}},{\mathcal{X}};{\mathbf{w}})=\frac{1}{Z}\sum_{\bar{{\mathbf{y}}}_{0}\in S^{{\mathbf{y}}_{0}}}{p({\mathbf{y}}_{i},\bar{{\mathbf{y}}}_{0},{\mathcal{X}};{\mathbf{w}})}$ , where $Z$ is a normalizing constant. Intuitively, conditioning actor predictions on this set implies that actors do not know the exact plan that the SDV has, but may have a rough idea about the general intent. When $|S^{{\mathbf{y}}_{0}}|$ is 1, we obtain our reactive model. When $|S^{{\mathbf{y}}_{0}}|$ is $K$ , it is straightforward to see that $Z=1$ , and hence we obtain our actor marginals $p({\mathbf{y}}_{i}|{\mathcal{X}};{\mathbf{w}})$ used in the non-reactive model. Moreover, when $|S^{{\mathbf{y}}_{0}}|=K$ the actor-specific cost term $\sum_{i,{\mathbf{y}}_{i}}p_{{\mathbf{y}}_{i}|S^{{\mathbf{y}}_{0}}}C_{\text{traj}}^{{\mathbf{y}}_{i}}=\sum_{i,{\mathbf{y}}_{i}}p_{{\mathbf{y}}_{i}}C_{\text{traj}}^{{\mathbf{y}}_{i}}$ no longer depends on the candidate SDV trajectory ${\mathbf{y}}_{0}$ ; hence we can remove it from the planing objective, which results in the non-reactive objective.

Please see Sec. D for experimental results in our simulation scenarios at different conditioning set sizes.

B-C Non-Reactive Objective and Marginal Likelihood

Additionally, it is fairly straightforward to show that the non-reactive objective is closely related to maximizing the marginal likelihood of the ego-agent. Let the marginal likelihood of the ego-agent be denoted as $p({\mathbf{y}}_{0}|{\mathcal{X}};{\mathbf{w}})$ .

	$\displaystyle\ln p({\mathbf{y}}_{0}\|{\mathcal{X}};{\mathbf{w}})$	$\displaystyle=\ln{\mathbb{E}}_{{\mathcal{Y}}_{r}\sim p({\mathcal{Y}}_{r}\|{\mathcal{X}};{\mathbf{w}})}[p({\mathbf{y}}_{0}\|{\mathcal{Y}}_{r},{\mathcal{X}};{\mathbf{w}})]$		(17)
		$\displaystyle>={\mathbb{E}}_{{\mathcal{Y}}_{r}\sim p({\mathcal{Y}}_{r}\|{\mathcal{X}};{\mathbf{w}})}[\ln p({\mathbf{y}}_{0}\|{\mathcal{Y}}_{r},{\mathcal{X}};{\mathbf{w}})]$		(18)

The lower bound in 18 is obtained through Jensen’s inequality. Now, note that if we try to maximize this term through ${\mathbf{y}}_{0}$ , we obtain our non-reactive planning objective (which is a minimization over the joint costs):

	$\displaystyle\text{argmax}_{{\mathbf{y}}_{0}}{\mathbb{E}}_{{\mathcal{Y}}_{r}\sim p({\mathcal{Y}}_{r}\|{\mathcal{X}};{\mathbf{w}})}[\ln p({\mathbf{y}}_{0}\|{\mathcal{Y}}_{r},{\mathcal{X}};{\mathbf{w}})]$		(19)
	$\displaystyle=\text{argmax}_{{\mathbf{y}}_{0}}{\mathbb{E}}_{{\mathcal{Y}}_{r}\sim p({\mathcal{Y}}_{r}\|{\mathcal{X}};{\mathbf{w}})}[\ln p({\mathcal{Y}},{\mathcal{X}};{\mathbf{w}})]$		(20)
	$\displaystyle=\text{argmin}_{{\mathbf{y}}_{0}}{\mathbb{E}}_{{\mathcal{Y}}_{r}\sim p({\mathcal{Y}}_{r}\|{\mathcal{X}};{\mathbf{w}})}[C({\mathcal{Y}},{\mathcal{X}};{\mathbf{w}})]$		(21)

This implies that maximizing the marginal likelihood of the SDV trajectory under our model can be considered as a non-reactive planner.

Appendix C Effects of Varying Weights on Planning Costs

In practice when implementing our reactive objective (Sec. III-B of the main paper), we define weights on the SDV/actor interaction energy ( $\lambda_{b}$ ) and the actor-specific energy $\lambda_{c}$ to more flexibly control for safety during planning. These weight values for both our reactive planner and the non-reactive baseline are determined from the scenario validation set. To provide further insight into the planning costs, we analyze the impact of varying $\lambda_{b},\lambda_{c}$ on our closed-loop evaluation metrics. We observe that when $\lambda_{b}$ decreases, collision rates for both non-reactive/reactive models go up while time to completion somewhat trends down (Fig. 4, top). This is reasonable given that $\lambda_{b}$ directly controls the weight on the pairwise energy, which includes collision. Additionally, when varying $\lambda_{c}$ in Simba, we observe that while the variation in TTC is somewhat negligible for the reactive model, collision rate does trend upward while $\lambda_{c}$ is decreased - implying that considering the actor unary energy does have some weight in maintaining safer behavior. Of course, we also emphasize that $\lambda_{c}$ makes no difference in the non-reactive results, since the actor unary term can be cancelled in the non-reactive objective (see Sec. III-B in the main paper).

Appendix D Interpolation Results between Non-Reactive / Reactive

Simulator	$\|S^{{\mathbf{y}}_{0}}\|$	Success (%)	TTC (s)	Goal Distance (m)	Collision Rate (%)	Actor Brake
CARLA	1 (full reactive)	72.0	12.8	2.0	5.0	41.8
	$0.2K$	52.0	14.6	3.3	1.0	37.7
	$0.4K$	42.0	15.2	3.7	4.0	37.3
	$0.6K$	46.0	14.9	4.7	5.0	32.8
	$0.8K$	52.0	14.5	4.3	5.0	36.7
	$K$ (non-reactive)	45.0	16.0	4.4	5.0	37.1
Simba	1 (full reactive)	82.0	6.8	4.3	3.5%	-
	$0.2K$	73.5	7.5	5.4	3.5%	-
	$0.4K$	76.5	7.4	5.4	2.5%	-
	$0.6K$	70.5	7.5	5.2	3.5%	-
	$0.8K$	68.0	7.6	5.2	3.5%	-
	$K$ (non-reactive)	70.0	7.5	5.2	3.5%	-

TABLE V: Interpolating between a reactive and non-reactive model by varying the size of conditioning set

S^{y_{e}}

, in CARLA and Simba.

Sec. B-B in this document showed that we can flexibly interpolate between reactive and non-reactive objectives by increasing the size of the conditioning set $S^{{\mathbf{y}}_{0}}$ . We demonstrate this interpolation, averaged across CARLA/Simba scenarios, in Tab. V. The extremes $|S^{{\mathbf{y}}_{0}}|=1,|S^{{\mathbf{y}}_{0}}|=K$ demonstrate a tradeoff between full reactive and full-nonreactive behavior. The metrics tend to be more strongly pulled towards the non-reactive side as set size increases – success rates trend downward and time to completion rates trend upwards, consistent with the difference in performance between the full reactive/non-reactive objectives in Tab. I in the main paper. We additionally observe that collision rates are roughly similar across the different conditioning set sizes. While the results are not conclusive, we note that they hint at a setting where an interpolated planning objective can achieve high success rates while planning more safely compared to either extreme.

Appendix E PRECOG Details

Here, we provide more details regarding our PRECOG implementation [42]. PRECOG is a planning objective based on a conditional forecasting model called Estimating Social-forecast Probability (ESP). We first implement ESP and verify that it reproduces prediction metrics provided by the authors in CARLA (also see Table III in the main paper). We then attach the PRECOG planning objective on top.

E-A ESP Architecture

The ESP architecture largely follows the details specified in the original paper, with slight modifications similar to the insights discovered in [10]. First, we use the same whisker featurization scheme as specified in the paper, but due to memory limitations in UrbanCity we sample from a set of three radii $[1,2,4]$ as opposed to the original 6. Our past trajectory encoder is a GRU with hidden state 128 that sequentially runs across the past time dimension and takes in the vehicle-relative coordinates, width/height, and heading as inputs. Moreover, given that our scenes can have a variable number of actors as opposed to constant number in the original paper, we use $k$ -nearest neighbors with $k=4$ to select the nearest neighbor features at every future timestep. Finally, we found that in the autoregressive model setting, training using direct teacher forcing [26] by conditioning the next state on the ground-truth current state caused a large mismatch between training and inference. Instead, we add white noise of 0.2m to the conditioning ground-truth states during training to better reflect error during inference.

E-B Implementation of PRECOG objective

The PRECOG planning objective is given by:

\displaystyle\mathbf{z}^{r*}=\text{argmax}_{\mathbf{z}^{r}}\mathbb{E}_{\mathbf{Z}^{h}}[\log q(f(\mathbf{Z})|\phi)+\log p(\mathcal{G}|f(\mathbf{Z},\phi)]

(22)

where the second term represents the goal likelihood, and the first term represents the ”multi-agent” prior, which is a joint density term that can be readily evaluated by the model. In order to plan with the PRECOG objective, one must optimize the ego-agent latent $\mathbf{z}^{r}$ over an expectation of latents sampled from other actors $\mathbf{Z}^{h}$ .

The joint density term can be evaluated by computing the log-likelihood according to the decoded Gaussian at each timestep $\log q(f(\mathbf{Z}))=\log q(S)=\sum_{t=1}^{T}q(\mathbf{S}_{t}|\mathbf{S}_{1:t-1})=\sum_{t=1}^{T}\mathcal{N}(\mathbf{S}_{t};\mathbf{{\mu}_{t}},\Sigma_{t})$ . Meanwhile, the authors use an example of a goal state penalizing L2 distance as an example of goal likelihood (assuming a Gaussian distribution), which admits a straightforward translation into our definition of a goal energy, for both goal states and goal lanes. Hence, we use the same goal energy definition given in Sec. 4-A of the main paper to compute the goal likelihood. We weight the prior likelihood and goal likelihood with two hyperparameters $\lambda_{1},\lambda_{2}$ , which are determined from the validation sets.

In practice, we implement the PRECOG objective as follows: we sample 100 ego-agent latents $\mathbf{z}^{r}$ , effectively using random shooting rather than gradient descent. In discussion with the authors they confirmed that results should be similar. Then for each ego-agent latent, we sample 15 joint actor latent samples $\mathbf{Z}^{h}$ as a Monte-Carlo approximation to the expectation. We then evaluate the goal/prior likelihood costs for each candidate ego-agent latent and select the ego-agent latent with the smallest cost. Evaluation of the planning costs for all candidate samples can be efficiently done in a batch manner using one forward GPU pass. Note that in selecting the optimal ego-agent latent $\mathbf{z}^{r}$ , there is an intricacy that since PRECOG is a joint autoregressive model, the ego-agent latent does not correspond to a fixed ego-agent trajectory, as the final trajectory will depend on the other actor latents. We avoid this challenge in simulation execution by replanning sufficiently often (0.3s) and also observing that the other actor latents do not generally perturb the ego-agent trajectory too much.

E-C Discussion of Results

As indicated in Table I of the main paper, the PRECOG model underperforms both our non-reactive and reactive objectives based on the energy-based model. We qualitatively analyzed some of the simulation scenarios and offer some hypotheses for the results. The first is that since PRECOG does not explicitly define a collision prior, it’s possible that the model does not try to avoid collision in all cases, especially on test scenarios that are out-of-distribution from the training data (especially in CARLA, where the traffic density in simulation is higher than in training). The second is that sampling in latent space does not guarantee a diverse range of trajectories for the ego-agent. In fact, we notice that in some turning scenarios where we set the goal state to be the result of a turn, the ego-agent still goes in a straight line, even when the prior likelihood weight goes to 0 and the number of ego-agent samples is high (tried setting to 1000). We hypothesize that this is partially due to test distribution shift. Nevertheless, we find learned autoregressive models promising to keep in mind for the future.

We showcase a qualitative example comparing PRECOG with our model in the next section, in Fig. 6.

Appendix F Additional Qualitative Results

We present a few additional qualitative results to better visually highlight the difference between our reactive and non-reactive model in various interactive scenarios. For each demonstration, we provide snapshots of the simulation at different timesteps. As with the results in the main paper, the ego-agent and planned trajectory are in green, the actors are represented by an assortment of other colors, and the goal position or lane is in cyan. Results are best viewed by zooming in with a digital reader.

In Fig. 5 we demonstrate the ego-agent performing an unprotected left turn in front of an oncoming vehicle in Simba. We first emphasize that since this scenario was initialized from a real driving sequence, the actual “expert” trajectory also performed the unprotected left turn against the same oncoming vehicle, implying that such an action is not an unsafe maneuver depending on the oncoming vehicle’s position/velocity. The visualizations of our models show that the reactive model is successfully able to perform the left turn, while the non-reactive model surprisingly gets stuck in place, even as the oncoming vehicle slows down. We speculate that this may be due to the model choosing to stay still rather than violate the safety distance of the other actor.

Fig. 6 showcases a comparison between our reactive/non-reactive models and PRECOG in performing a left turn at a busy intersection. We note that both the reactive/non-reactive models are able to reach the goal state, though admittedly they violate lane boundaries in doing so (lane following is not explicitly encoded as an energy in our model). Interestingly, the PRECOG model plans a trajectory to the goal at $t=1$ but is not able to complete it at later timestamps, either implying that the latent samples for the ego-agent do not capture such a behavior or that the prior likelihood cost is too high to go any further. It’s possible that the model can be tuned further, both in terms of the data, the training scheme, as well as the PRECOG evaluation procedure, so we mostly present these as initial results for future investigation.

In Fig. 7, 8 we demonstrate a lane merge scenario and a roundabout turn scenario in CARLA. We note that these are complex scenarios involving multiple actors in the lane that the ego-agent is supposed to merge into. In Fig. 7, the visualizations show that the reactive agent is able to spot a gap at $t=2$ , and merge in at $t=3$ . Meanwhile, the non-reactive agent keeps going straight until $t=3$ , and even then it wavers between merging and going straight. Fig. 8 demonstrates the ego-agent merging into a roundabout turn with multiple actors. While both models reach similar states initially, towards the end the reactive model reasons that it can still merge inwards, while the non-reactive model is stuck waiting for all the actors to pass.

\begin{overpic}[width=104.07117pt,trim=401.49998pt 250.93748pt 401.49998pt 250.93748pt,clip]{supp/figures/qual/simba_turn11_r0.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Reactive, t=0s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=401.49998pt 250.93748pt 401.49998pt 250.93748pt,clip]{supp/figures/qual/simba_turn11_r1.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Reactive, t=1s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=401.49998pt 250.93748pt 401.49998pt 250.93748pt,clip]{supp/figures/qual/simba_turn11_r2.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Reactive, t=2s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=401.49998pt 250.93748pt 401.49998pt 250.93748pt,clip]{supp/figures/qual/simba_turn11_r3.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Reactive, t=3s}} \end{overpic}
\begin{overpic}[width=104.07117pt,trim=401.49998pt 250.93748pt 401.49998pt 250.93748pt,clip]{supp/figures/qual/simba_turn11_nr0.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Non-Reactive, t=0s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=401.49998pt 250.93748pt 401.49998pt 250.93748pt,clip]{supp/figures/qual/simba_turn11_nr1.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Non-Reactive, t=1s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=401.49998pt 250.93748pt 401.49998pt 250.93748pt,clip]{supp/figures/qual/simba_turn11_nr2.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Non-Reactive, t=2s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=401.49998pt 250.93748pt 401.49998pt 250.93748pt,clip]{supp/figures/qual/simba_turn11_nr3.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Non-Reactive, t=3s}} \end{overpic}

Figure 5: Visualization of a Simba turn for non-reactive (bottom) and reactive (top) models at 3 different time steps: 0s, 1s, 2s, 3s (left-to-right).

\begin{overpic}[width=104.07117pt,trim=401.49998pt 250.93748pt 401.49998pt 250.93748pt,clip]{supp/figures/qual/simba_turn9_r0.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Reactive, t=0s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=401.49998pt 250.93748pt 401.49998pt 250.93748pt,clip]{supp/figures/qual/simba_turn9_r1.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Reactive, t=1s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=401.49998pt 250.93748pt 401.49998pt 250.93748pt,clip]{supp/figures/qual/simba_turn9_r2.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Reactive, t=2s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=401.49998pt 250.93748pt 401.49998pt 250.93748pt,clip]{supp/figures/qual/simba_turn9_r3.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Reactive, t=3s}} \end{overpic}
\begin{overpic}[width=104.07117pt,trim=401.49998pt 250.93748pt 401.49998pt 250.93748pt,clip]{supp/figures/qual/simba_turn9_nr0.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Non-Reactive, t=0s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=401.49998pt 250.93748pt 401.49998pt 250.93748pt,clip]{supp/figures/qual/simba_turn9_nr1.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Non-Reactive, t=1s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=401.49998pt 250.93748pt 401.49998pt 250.93748pt,clip]{supp/figures/qual/simba_turn9_nr2.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Non-Reactive, t=2s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=401.49998pt 250.93748pt 401.49998pt 250.93748pt,clip]{supp/figures/qual/simba_turn9_nr3.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Non-Reactive, t=3s}} \end{overpic}
\begin{overpic}[width=104.07117pt,trim=401.49998pt 250.93748pt 401.49998pt 250.93748pt,clip]{supp/figures/qual/simba_turn9_precog0.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} PRECOG, t=0s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=401.49998pt 250.93748pt 401.49998pt 250.93748pt,clip]{supp/figures/qual/simba_turn9_precog1.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} PRECOG, t=1s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=401.49998pt 250.93748pt 401.49998pt 250.93748pt,clip]{supp/figures/qual/simba_turn9_precog2.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} PRECOG, t=2s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=401.49998pt 250.93748pt 401.49998pt 250.93748pt,clip]{supp/figures/qual/simba_turn9_precog3.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} PRECOG, t=3s}} \end{overpic}

Figure 6: Visualization of a Simba turn scenario for reactive (top), non-reactive (middle), and PRECOG (bottom) models at 3 different time steps: 0s, 1s, 2s, 3s (left-to-right).

\begin{overpic}[width=104.07117pt,trim=602.25pt 401.49998pt 802.99998pt 401.49998pt,clip]{supp/figures/qual/carla_r2_1.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Reactive, t=1s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=602.25pt 401.49998pt 802.99998pt 401.49998pt,clip]{supp/figures/qual/carla_r2_2.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Reactive, t=2s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=602.25pt 401.49998pt 802.99998pt 401.49998pt,clip]{supp/figures/qual/carla_r2_3.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Reactive, t=3s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=602.25pt 401.49998pt 802.99998pt 401.49998pt,clip]{supp/figures/qual/carla_r2_4.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Reactive, t=4s}} \end{overpic}
\begin{overpic}[width=104.07117pt,trim=602.25pt 401.49998pt 802.99998pt 401.49998pt,clip]{supp/figures/qual/carla_nr2_1.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Non-Reactive, t=1s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=602.25pt 401.49998pt 802.99998pt 401.49998pt,clip]{supp/figures/qual/carla_nr2_2.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Non-Reactive, t=2s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=602.25pt 401.49998pt 802.99998pt 401.49998pt,clip]{supp/figures/qual/carla_nr2_3.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Non-Reactive, t=3s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=602.25pt 401.49998pt 802.99998pt 401.49998pt,clip]{supp/figures/qual/carla_nr2_4.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Non-Reactive, t=4s}} \end{overpic}

Figure 7: Visualization of a CARLA lane merge for non-reactive (bottom) and reactive (top) models at 3 different time steps: 1s, 2s, 3s, 4s (left-to-right).

\begin{overpic}[width=104.07117pt,trim=652.4375pt 341.275pt 752.81248pt 461.72499pt,clip]{supp/figures/qual/carla_r4_0.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Reactive, t=0s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=652.4375pt 341.275pt 752.81248pt 461.72499pt,clip]{supp/figures/qual/carla_r4_1.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Reactive, t=1s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=652.4375pt 341.275pt 752.81248pt 461.72499pt,clip]{supp/figures/qual/carla_r4_2.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Reactive, t=2s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=652.4375pt 341.275pt 752.81248pt 461.72499pt,clip]{supp/figures/qual/carla_r4_3.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Reactive, t=3s}} \end{overpic}
\begin{overpic}[width=104.07117pt,trim=652.4375pt 341.275pt 752.81248pt 461.72499pt,clip]{supp/figures/qual/carla_nr4_0.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Non-Reactive, t=0s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=652.4375pt 341.275pt 752.81248pt 461.72499pt,clip]{supp/figures/qual/carla_nr4_1.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Non-Reactive, t=1s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=652.4375pt 341.275pt 752.81248pt 461.72499pt,clip]{supp/figures/qual/carla_nr4_2.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Non-Reactive, t=2s}} \end{overpic}	\begin{overpic}[width=104.07117pt,trim=652.4375pt 341.275pt 752.81248pt 461.72499pt,clip]{supp/figures/qual/carla_nr4_3.png} \put(0.0,2.0){\scriptsize\hbox{\pagecolor{gray}\color[rgb]{1,1,1} Non-Reactive, t=3s}} \end{overpic}

Figure 8: Visualization of a CARLA roundabout turn for non-reactive (bottom) and reactive (top) models at 3 different time steps: 0s, 1s, 2s, 3s (left-to-right).

	$\displaystyle\ln p({\mathbf{y}}_{0}\|{\mathcal{X}};{\mathbf{w}})$	$\displaystyle=\ln{\mathbb{E}}_{{\mathcal{Y}}_{r}\sim p({\mathcal{Y}}_{r}\|{\mathcal{X}};{\mathbf{w}})}[p({\mathbf{y}}_{0}\|{\mathcal{Y}}_{r},{\mathcal{X}};{\mathbf{w}})]$		(17)
		$\displaystyle>={\mathbb{E}}_{{\mathcal{Y}}_{r}\sim p({\mathcal{Y}}_{r}\|{\mathcal{X}};{\mathbf{w}})}[\ln p({\mathbf{y}}_{0}\|{\mathcal{Y}}_{r},{\mathcal{X}};{\mathbf{w}})]$		(18)