This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Artificial Intelligence Research Laboratory, ETRI
11email: {d1024.choi,kwmin92}@etri.re.kr

Hierarchical Latent Structure for Multi-Modal Vehicle Trajectory Forecasting

Dooseop Choi corresponding author11    KyoungWook Min 11
Abstract

Variational autoencoder (VAE) has widely been utilized for modeling data distributions because it is theoretically elegant, easy to train, and has nice manifold representations. However, when applied to image reconstruction and synthesis tasks, VAE shows the limitation that the generated sample tends to be blurry. We observe that a similar problem, in which the generated trajectory is located between adjacent lanes, often arises in VAE-based trajectory forecasting models. To mitigate this problem, we introduce a hierarchical latent structure into the VAE-based forecasting model. Based on the assumption that the trajectory distribution can be approximated as a mixture of simple distributions (or modes), the low-level latent variable is employed to model each mode of the mixture and the high-level latent variable is employed to represent the weights for the modes. To model each mode accurately, we condition the low-level latent variable using two lane-level context vectors computed in novel ways, one corresponds to vehicle-lane interaction and the other to vehicle-vehicle interaction. The context vectors are also used to model the weights via the proposed mode selection network. To evaluate our forecasting model, we use two large-scale real-world datasets. Experimental results show that our model is not only capable of generating clear multi-modal trajectory distributions but also outperforms the state-of-the-art (SOTA) models in terms of prediction accuracy. Our code is available at https://github.com/d1024choi/HLSTrajForecast.

1 Introduction

Trajectory forecasting has long been a great interest in autonomous driving since accurate predictions of future trajectories of traffic agents are essential for the safe motion planning of an autonomous vehicle (AV). Many approaches have been proposed for trajectory forecasting in the literature and remarkable progress has been made in recent years. The recent trend in trajectory forecasting is to predict multiple possible trajectories for each agent in the traffic scene. This is because human drivers’ future behavior is uncertain, and consequently, the future motion of the agent naturally exhibits a multi-modal distribution.

Latent variable models, such as variational autoencoders (VAEs) [19] and generative adversarial networks (GANs) [13], have been used for modeling the distribution over the agents’ future trajectories. Using latent variables, trajectory forecasting models can learn to capture agent-agent and agent-space interactions from data, and consequently, generate future trajectories that are compliant with the input scene contexts.

VAEs have been applied in many machine learning applications, including image synthesis [15, 33], language modeling [3, 34], and trajectory forecasting [21, 5] because they are theoretically elegant, easy to train, and have nice manifold representations. One of the limitations of VAEs is that the generated sample tends to be blurry (especially in image reconstruction and synthesis tasks) [37]. We found from our experiments that a similar problem often arises in VAE-based trajectory forecasting models. More specifically, it is often found that the generated trajectory is located between adjacent lanes as illustrated in Figure 1. These false positive motion forecasts can cause uncomfortable rides for the AV with plenty of sudden brakes and steering changes [6]. In the rest of this paper, we will refer to this problem as mode blur as instance-level lanes are closely related to the modes of the trajectory distribution [16]. Mode blur is also found in the recent SOTA model [8] as shown in supplementary materials.

Many approaches have been proposed to mitigate the blurry sample generation problem primarily for image reconstruction or synthesis tasks. In this paper, we introduce a hierarchical latent structure into a VAE-based forecasting model to mitigate mode blur. Based on the assumption that the trajectory distribution can be approximated as a mixture of simple distributions (or modes), the low-level latent variable is employed to model each mode of the mixture and the high-level latent variable is employed to represent the weights for the modes. As a result, the forecasting model is capable of generating clear multi-modal trajectory distributions. To model each mode accurately, we condition the low-level latent variable using two lane-level context vectors (one corresponds to vehicle-lane interaction (VLI) and the other to vehicle-vehicle interaction (V2I)) computed in novel ways. The context vectors are also used to model the weights via the proposed mode selection network. Lastly, we also introduce two techniques to further improve the prediction performance of our model: 1) positional data preprocessing and 2) GAN-based regularization. The preprocessing is introduced based on the fact that vehicles moving along a lane usually try to be parallel to the tangent vector of the lane. The regularization is intended to ensure that the proposed model generates trajectories that match the shape of the lanes well.

In summary, our contributions are the followings:

  • \bullet

    The hierarchical latent structure is introduced in the VAE-based forecasting model to mitigate mode blur.

  • \bullet

    Two context vectors (one corresponds to the VLI and the other to the V2I) calculated in novel ways are proposed for lane-level scene contexts.

  • \bullet

    Positional data preprocessing and GAN-based regularization are introduced to further improve the prediction performance.

  • \bullet

    Our forecasting model outperforms the SOTA models in terms of prediction accuracy on two large-scale real-world datasets.

Refer to caption
Figure 1: Mode blur in trajectory forecasting and our approach. (a) Generated trajectories tend to locate between adjacent lanes. (b) We let a latent variable model each mode of the trajectory distribution to mitigate mode blur. (c) The target vehicle (red) takes into account not only its reference lane (red dashed line) but also the surrounding lanes (black dashed lines) and the surrounding vehicles. Only the surrounding vehicles within a certain distance from the reference lanes (green lines with arrows) influences the future motion of the target vehicle.

2 Related Works

2.1 Limitations of VAEs

The VAE framework has been used to explicitly learn data distributions. The models based on the VAE framework learn mappings from samples in a dataset to points in a latent space and generate plausible samples from variables drawn from the latent space. The VAE-based generative models are known to suffer from two problems: 1) posterior collapse (that the models ignore the latent variable when generating samples) and 2) blurry sample generation. To mitigate the problems, many approaches have been proposed in the literature, primarily for image reconstruction or synthesis tasks [29, 15, 11, 18, 28, 14, 36, 33]. In trajectory forecasting, some researchers [5, 31] have employed the techniques for the mitigation of the posterior collapse. To mitigate the blurry sample generation, [2] proposed a “best-of-many” sample objective that leads to accurate and diverse trajectory generation.

2.2 Forecasting with Lane Geometry

Because the movement of vehicles on the road is greatly restricted by the lane geometry, many works have been proposed to utilize the lane information provided by High-Definition (HD) maps [9, 5, 31, 24, 10, 27, 12, 23, 16, 26]. There are two types of approaches to the representation of the lane information: 1) rasterizing the components of the HD maps on a 2D canvas to obtain the top-view images of the HD maps, 2) representing each component of the HD maps as a series of coordinates of points. In general, Convolutional Neural Network (CNN) is utilized for the former case while Long Short-Term Memory (LSTM) or 1D-CNN is utilized for the latter case to encode the lane information. In this paper, we adopt the second approach. The centerline of each lane in the HD maps is first represented as a series of equally-spaced 2D coordinates and then encoded by an LSTM network. The ability to handle individual lanes in the HD maps allows us to calculate lane-level scene contexts.

2.3 Lane-level Scene Context

Since instance-level lanes are closely related to the modes of the trajectory distribution, recent works [24, 16, 26, 10] proposed calculating lane-level scene contexts and using them for generating trajectories. Our work shares the idea with the previous works. However, ours differs from them in the way it calculates the lane-level scene contexts, which leads to significant gains in the prediction performance. Instead of considering only a single lane for a lane-level scene context, we also take into account surrounding lanes along with their relative importance. The relative importance is calculated based on the past motion of the target vehicle, thus reflecting the vehicle-lane interaction. In addition, for the interaction between the target vehicle and surrounding vehicles, we consider only the surrounding vehicles within a certain distance from the reference lane as illustrated in Figure 1c. This approach shows improved prediction performance compared to the existing approaches that consider either all neighbors [26] or only the most relevant neighbor [16]. This result is consistent with the observation that only a subset of surrounding vehicles is indeed relevant when predicting the future trajectory of the target vehicle [22].

3 Proposed Method

In this section, we present the details of our trajectory forecasting model.

3.1 Problem Formulation

Assume that there are NN vehicles in the traffic scene. We aim to generate plausible trajectory distributions p(𝐘i|𝐗i,𝒞i)p(\mathbf{Y}_{i}|\mathbf{X}_{i},\mathcal{C}_{i}) for the vehicles {Vi}i=1N\{V_{i}\}_{i=1}^{N}. Here, 𝐗i=𝐩i(tH:t)\mathbf{X}_{i}=\mathbf{p}_{i}^{(t-H:t)} denotes the positional history of ViV_{i} for the previous HH timesteps at time tt, 𝐘i=𝐩i(t+1:t+T)\mathbf{Y}_{i}=\mathbf{p}_{i}^{(t+1:t+T)} denotes the future positions of ViV_{i} for the next TT timesteps, and 𝒞i\mathcal{C}_{i} denotes additional scene information available to ViV_{i}. For 𝒞i\mathcal{C}_{i}, we use the positional histories of the surrounding vehicles {𝐗j}j=1,jiN\{\mathbf{X}_{j}\}_{j=1,j\neq i}^{N} and the lane candidates 𝐋(1:M)\mathbf{L}^{(1:M)} available for ViV_{i} at time tt, where 𝐋m=𝐥1,,Fm\mathbf{L}^{m}=\mathbf{l}_{1,...,F}^{m} denotes the FF equally spaced coordinate points on the centerline of the mm-th lane. Finally, we note that every positional information is expressed in the coordinate frame defined by ViV_{i}’s current position and heading. According to [16], p(𝐘i|𝐗i,𝒞i)p(\mathbf{Y}_{i}|\mathbf{X}_{i},\mathcal{C}_{i}) can be re-written as

p(𝐘i|𝐗i,𝒞i)=m=1Mp(𝐘i|Em,𝐗i,𝒞i)modep(Em|𝐗i,𝒞i)weight,p(\mathbf{Y}_{i}|\mathbf{X}_{i},\mathcal{C}_{i})=\sum_{m=1}^{M}\underbrace{p(\mathbf{Y}_{i}|E_{m},\mathbf{X}_{i},\mathcal{C}_{i})}_{\text{mode}}\underbrace{p(E_{m}|\mathbf{X}_{i},\mathcal{C}_{i})}_{\text{weight}}, (1)

where EmE_{m} denotes the event that 𝐋m\mathbf{L}^{m} becomes the reference lane for ViV_{i}. Equation 1 shows that the trajectory distribution can be expressed as a weighted sum of the distributions which we call modes. The fact that the modes are usually much simpler than the overall distribution inspired us to model each mode through a latent variable, and sample trajectories from the modes in proportion to their weights as illustrated in Figure 1b.

3.2 Forecasting Model with Hierarchical Latent Structure

We introduce two latent variables 𝐳lD\mathbf{z}_{l}\in\mathbb{R}^{D} and 𝐳hM\mathbf{z}_{h}\in\mathbb{R}^{M} to model the modes and the weights for the modes in Eq. 1. With the low-level latent variable 𝐳l\mathbf{z}_{l}, our forecasting model defines p(𝐘i|Em,𝐗i,𝒞i)p(\mathbf{Y}_{i}|E_{m},\mathbf{X}_{i},\mathcal{C}_{i}) by using the decoder network pθ(𝐘i|𝐳l,𝐗i,𝒞im)p_{\theta}(\mathbf{Y}_{i}|\mathbf{z}_{l},\mathbf{X}_{i},\mathcal{C}_{i}^{m}) and the prior network pγ(𝐳l|𝐗i,𝒞im)p_{\gamma}(\mathbf{z}_{l}|\mathbf{X}_{i},\mathcal{C}_{i}^{m}) based on

p(𝐘i|Em,𝐗i,𝒞i)=𝐳lp(𝐘i|𝐳l,𝐗i,𝒞im)p(𝐳l|𝐗i,𝒞im)𝑑𝐳l,p(\mathbf{Y}_{i}|E_{m},\mathbf{X}_{i},\mathcal{C}_{i})=\int_{\mathbf{z}_{l}}p(\mathbf{Y}_{i}|\mathbf{z}_{l},\mathbf{X}_{i},\mathcal{C}_{i}^{m})p(\mathbf{z}_{l}|\mathbf{X}_{i},\mathcal{C}_{i}^{m})d\mathbf{z}_{l}, (2)

where 𝒞im𝒞i\mathcal{C}_{i}^{m}\subset\mathcal{C}_{i} denotes the scene information relevant to 𝐋m\mathbf{L}^{m}. To train our forecasting model, we employ the conditional VAE framework [32] and optimize the following modified ELBO objective [14]:

ELBO=𝔼𝐳lqϕ[logpθ(𝐘i|𝐳l,𝐗i,𝒞im)]+βKL(qϕ(𝐳l|𝐘i,𝐗i,𝒞im)||pγ(𝐳l|𝐗i,𝒞im)),\mathcal{L}_{ELBO}=-\mathbb{E}_{\mathbf{z}_{l}\sim q_{\phi}}[\log p_{\theta}(\mathbf{Y}_{i}|\mathbf{z}_{l},\mathbf{X}_{i},\mathcal{C}_{i}^{m})]\\ +\beta KL(q_{\phi}(\mathbf{z}_{l}|\mathbf{Y}_{i},\mathbf{X}_{i},\mathcal{C}_{i}^{m})||p_{\gamma}(\mathbf{z}_{l}|\mathbf{X}_{i},\mathcal{C}_{i}^{m})), (3)

where β\beta is a constant and qϕ(𝐳l|𝐘i,𝐗i,𝒞im)q_{\phi}(\mathbf{z}_{l}|\mathbf{Y}_{i},\mathbf{X}_{i},\mathcal{C}_{i}^{m}) is the approximated posterior network. The weights for the modes p(Em|𝐗i,𝒞i)p(E_{m}|\mathbf{X}_{i},\mathcal{C}_{i}) are modeled by the high-level latent variable 𝐳h\mathbf{z}_{h}, which is output of the proposed mode selection network 𝐳h=fφ(𝐗i,𝒞i(1:M))\mathbf{z}_{h}=f_{\varphi}(\mathbf{X}_{i},\mathcal{C}_{i}^{(1:M)}).

As shown in Eq. 3 and the definition of the mode selection network, the performance of our forecasting model is dependent on how the lane-level scene information 𝒞im\mathcal{C}_{i}^{m} is utilized along with 𝐗i\mathbf{X}_{i} for defining the lane-level scene context. One can consider two interactions for the lane-level scene context: the VLI and V2I. This is because the future motion of the vehicle is highly restricted not only by the vehicle’s motion history but also by the motion histories of the surrounding vehicles and the lane geometry of the road. For the VLI, the existing works [10, 16, 24, 26] considered only the reference lane. For the V2I, [16] considered only one vehicle most relevant to the reference lane, while the others considered all vehicles. In this paper, we present novel ways of defining the two interactions. For the VLI, instead of considering only the reference lane, we also take into account surrounding lanes along with their relative importance, which is calculated based on the target vehicle’s motion history. The V2I is encoded through a GNN by considering only surrounding vehicles within a certain distance from the reference lane. Our approach is based on the fact that human drivers often pay attention to surrounding lanes and vehicles occupying the surrounding lanes when driving along the reference lane. Driving behaviors such as lane changes and overtaking are examples.

Refer to caption
Figure 2: Overall architecture of our forecasting model. To generate KK future trajectories of ViV_{i}, lane-level scene context vectors {𝐜im}m=1M\{\mathbf{c}_{i}^{m}\}_{m=1}^{M}, each of which corresponds to one of 𝐋(1:M)\mathbf{L}^{(1:M)}, are first calculated via scene context extraction module. Next, {wm}m=1M\{w_{m}\}_{m=1}^{M} (wmw_{m} denotes the probability that ViV_{i} will drive along 𝐋m\mathbf{L}^{m} in the future) are calculated by using 𝐳h\mathbf{z}_{h}. Finally, K×wm\lfloor K\times w_{m}\rfloor out of KK future trajectories are generated by the decoder network using 𝐜im\mathbf{c}_{i}^{m} and 𝐳l\mathbf{z}_{l}.

3.3 Proposed Network Structure

We show in Fig. 2 the overall architecture of our forecasting model. In the following sections, we describe the details of our model.

3.3.1 Feature Extraction Module:

Three LSTM networks are used to encode the positional data {𝐗a}a=1N\{\mathbf{X}_{a}\}_{a=1}^{N}, 𝐘i\mathbf{Y}_{i}, and 𝐋(1:M)\mathbf{L}^{(1:M)}, respectively. The last hidden state vector of the networks is used for the encoding result. Before the encoding process, we preprocess the positional data. For the vehicles, we calculate the speed and heading at each timestep and concatenate the sequential speed and heading data to the original data along the data dimension. As a result, {𝐗a}a=1N\{\mathbf{X}_{a}\}_{a=1}^{N} and 𝐘i\mathbf{Y}_{i} have the data dimension of size 4 (x-position, y-position, speed, and heading). For the lanes, at each coordinate point, we calculate the tangent vector and the direction of the tangent vector. The sequential tangential and directional data are concatenated to the original data along the data dimension. As a result, 𝐋(1:M)\mathbf{L}^{(1:M)} have the data dimension of size 5 (2D position vector, 2D tangent vector, and direction). We introduce the preprocessing step to make our model better infer the future positions of the target vehicle with the historical speed and heading records and the tangential data, based on that vehicles moving along a lane usually try to be parallel to the tangent vector of the lane. As shown in Table 1, the prediction performance of our model is improved due to the preprocessing step. In the rest of this paper, we use a tilde symbol at the top of a variable to indicate that it is the result of the encoding process. For example, the encoding result of 𝐗i\mathbf{X}_{i} is expressed as 𝐗~i\tilde{\mathbf{X}}_{i}.

3.3.2 Scene Context Extraction Module:

Two lane-level context vectors are calculated in this stage. Assume that 𝐋m\mathbf{L}^{m} is the reference lane for ViV_{i}. The context vector 𝐚im\mathbf{a}_{i}^{m} for the VLI is calculated as follows:

𝐚im=[𝐋~m;l=1,lmMαl𝐋~l],\mathbf{a}_{i}^{m}=[\tilde{\mathbf{L}}^{m};\sum_{l=1,l\neq m}^{M}\alpha_{l}\tilde{\mathbf{L}}^{l}], (4)

where {αl}l=1M\{\alpha_{l}\}_{l=1}^{M} are the weights calculated through the attention operation [1] between 𝐗~i\tilde{\mathbf{X}}_{i} and 𝐋~(1:M)\tilde{\mathbf{L}}^{(1:M)} and the semi-colon denotes the concatenation operation. αl\alpha_{l} represents the relative importance of the surrounding lane LlL^{l} compared to the reference lane under the consideration of the past motion of ViV_{i}. As a result, our model can generate plausible trajectories for the vehicles that drive paying attention to multiple lanes. For example, suppose that the vehicle is changing its lane from LmL^{m} to LlL^{l}. αl\alpha_{l} will be close to 1 and 𝐚im\mathbf{a}_{i}^{m} can be approximated as [𝐋~m;𝐋~l][\tilde{\mathbf{L}}^{m};\tilde{\mathbf{L}}^{l}], thus, our model can generate plausible trajectories corresponding to the lane change. We show in supplementary materials how the target vehicle interacts with the surrounding lanes of the reference lane using some driving scenarios.

To model the interaction between ViV_{i} and its surrounding vehicles {Vj}ji\{V_{j}\}_{j\neq i}, we use a GNN. As we mentioned, only the surrounding vehicles within a certain distance from the reference lane are considered for the interaction; see Fig. 1c. Let 𝒩im\mathcal{N}_{i}^{m} denote the set of the vehicles including ViV_{i} and its select neighbors. The context vector 𝐛im\mathbf{b}_{i}^{m} for the V2I is calculated as follows:

𝐦ji=MLP([𝐩jt𝐩it;𝐡ik;𝐡jk]),\mathbf{m}_{j\to i}=\text{MLP}([\mathbf{p}_{j}^{t}-\mathbf{p}_{i}^{t};\mathbf{h}_{i}^{k};\mathbf{h}_{j}^{k}]), (5)
𝐨i=j𝒩im,ji𝐦ji,\mathbf{o}_{i}=\sum_{j\in\mathcal{N}_{i}^{m},j\neq i}\mathbf{m}_{j\to i}, (6)
𝐡ik+1=GRU(𝐨i,𝐡ik),\mathbf{h}_{i}^{k+1}=\text{GRU}(\mathbf{o}_{i},\mathbf{h}_{i}^{k}), (7)
𝐛im=j𝒩im,ji𝐡jK1,\mathbf{b}_{i}^{m}=\sum_{j\in\mathcal{N}_{i}^{m},j\neq i}\mathbf{h}_{j}^{K-1}, (8)

where 𝐡0=𝐗~\mathbf{h}^{0}=\tilde{\mathbf{X}} for all vehicles in 𝒩im\mathcal{N}_{i}^{m}. The message passing from VjV_{j} to ViV_{i} is defined in Eq. 5 and all messages coming to ViV_{i} are aggregated by the sum operation as shown in Eq. 6. After the KK rounds of the message passing, the hidden feature vector 𝐡jK1\mathbf{h}_{j}^{K-1} represents not only the motion history of VjV_{j} but also the history of the interaction between VjV_{j} and the others. The distance threshold τ\tau for 𝒩im\mathcal{N}_{i}^{m} plays the important role in the performance improvement. We explore the choice of τ\tau value and empirically find that the best performance is achieved with τ=5\tau=5 meters (the distance between two nearby lane centerlines in straight roads is around 5 meters). Finally, note that we use the zero vector for 𝐛im\mathbf{b}_{i}^{m} when 𝒩im\mathcal{N}_{i}^{m} has the target vehicle only.

3.3.3 Mode Selection Network:

The weights for the modes of the trajectory distribution are calculated by the mode selection network 𝐳h=fφ(𝐗i,𝒞i(1:M))\mathbf{z}_{h}=f_{\varphi}(\mathbf{X}_{i},\mathcal{C}_{i}^{(1:M)}). As instance-level lanes are closely related to the modes, it can be assumed that there are MM modes, each corresponding to one of 𝐋(1:M)\mathbf{L}^{(1:M)}. We calculate the weights from the lane-level scene context vectors 𝐜im=[𝐗~i;𝐚im;𝐛im]\mathbf{c}_{i}^{m}=[\tilde{\mathbf{X}}_{i};\mathbf{a}_{i}^{m};\mathbf{b}_{i}^{m}] which condense the information about the modes:

𝐳h=MLPfφ([𝐜i1;;𝐜iM])M.\mathbf{z}_{h}=\text{MLP}_{f_{\varphi}}([\mathbf{c}_{i}^{1};...;\mathbf{c}_{i}^{M}])\in\mathbb{R}^{M}. (9)

The softmax operation is applied to 𝐳h\mathbf{z}_{h} to get the final weights {wm}m=1M\{w_{m}\}_{m=1}^{M}. Let 𝐳hSM\mathbf{z}_{h}^{SM} denote the result of applying the softmax operation to 𝐳h\mathbf{z}_{h}. wmw_{m} is equal to the mm-th element of 𝐳hSM\mathbf{z}_{h}^{SM}. The lane-level scene context vector is the core feature vector for our encoder, prior, and decoder networks as described in the next section.

3.3.4 Encoder, Prior, and Decoder:

The approximated posterior qϕ(𝐳l|𝐘i,𝐗i,𝒞im)q_{\phi}(\mathbf{z}_{l}|\mathbf{Y}_{i},\mathbf{X}_{i},\mathcal{C}_{i}^{m}), also known as encoder or recognition network, is implemented as MLPs with the encoding of the future trajectory and the lane-level scene context vector as inputs:

μe,σe=MLPqϕ([𝐘~i;𝐜im]),\mu_{e},\sigma_{e}=\text{MLP}_{q_{\phi}}([\tilde{\mathbf{Y}}_{i};\mathbf{c}_{i}^{m}]), (10)

where μe\mu_{e} and σe\sigma_{e} are the mean and standard deviation vectors, respectively. The encoder is utilized in the training phase only because 𝐘i\mathbf{Y}_{i} is not available in the inference phase. The prior pγ(𝐳l|𝐗i,𝒞im)p_{\gamma}(\mathbf{z}_{l}|\mathbf{X}_{i},\mathcal{C}_{i}^{m}) is also implemented as MLPs with the context vector as input:

μp,σp=MLPpγ(𝐜im),\mu_{p},\sigma_{p}=\text{MLP}_{p_{\gamma}}(\mathbf{c}_{i}^{m}), (11)

where μp\mu_{p} and σp\sigma_{p} are the mean and standard deviation vectors, respectively. The latent variable 𝐳l\mathbf{z}_{l} is sampled from (μe,σe)(\mu_{e},\sigma_{e}) via the re-parameterization trick [19] during the training and from (μp,σp)(\mu_{p},\sigma_{p}) during the inference.

The decoder network generates the prediction of the future trajectory, 𝐘^i\hat{\mathbf{Y}}_{i}, via an LSTM network as follows:

𝐞it=MLPemb(𝐩^it),\mathbf{e}_{i}^{t}=\text{MLP}_{emb}(\hat{\mathbf{p}}_{i}^{t}), (12)
𝐡it+1=LSTM([𝐞it;𝐜im;𝐳l],𝐡it),\mathbf{h}_{i}^{t+1}=\text{LSTM}([\mathbf{e}_{i}^{t};\mathbf{c}_{i}^{m};\mathbf{z}_{l}],\mathbf{h}_{i}^{t}), (13)
𝐩^it+1=MLPdec(𝐡it+1),\hat{\mathbf{p}}_{i}^{t+1}=\text{MLP}_{dec}(\mathbf{h}_{i}^{t+1}), (14)

where we initialize 𝐩^i0\hat{\mathbf{p}}_{i}^{0} and 𝐡i0\mathbf{h}_{i}^{0} as the last observed position of ViV_{i} and the zero-vector, respectively.

3.4 Regularization Through GAN

To generate more clear image samples, [20] proposed a method that combines VAE and GAN. Based on the observation that the discriminator network implicitly learns a rich similarity metric for images, the typical element-wise reconstruction metric (e.g., L2L_{2}-distance) in the ELBO objective is replaced with a feature-wise metric expressed in the discriminator. In this paper, we also propose training our forecasting model with a discriminator network simultaneously. However, we don’t replace the element-wise reconstruction metric with the feature-wise metric since the characteristic of trajectory data is quite different from that of images. We instead use the discriminator to regularize our forecasting model during the training so that the trajectories generated by our model well match the shape of the reference lane.

The proposed discriminator network is defined as follows:

s=D(𝐘i,𝐋m)=MLPdis([𝐘~i;𝐋~m])1.s=D(\mathbf{Y}_{i},\mathbf{L}^{m})=\text{MLP}_{dis}([\tilde{\mathbf{Y}}_{i};\tilde{\mathbf{L}}^{m}])\in\mathbb{R}^{1}. (15)

We explored different choices for the encoding of the inputs to the discriminator network and observed that the following approaches improve the prediction performance: 1) 𝐘~i\tilde{\mathbf{Y}}_{i} is the result of encoding [𝐘i;Δ𝐘i][\mathbf{Y}_{i};\Delta\mathbf{Y}_{i}] through an LSTM network where Δ𝐘i=Δ𝐩i(t+1:t+T)\Delta\mathbf{Y}_{i}=\Delta\mathbf{p}_{i}^{(t+1:t+T)}, Δ𝐩it=𝐩it𝐥fm\Delta\mathbf{p}_{i}^{t}=\mathbf{p}_{i}^{t}-\mathbf{l}_{f}^{m}, and 𝐥fm\mathbf{l}_{f}^{m} is the coordinate point of 𝐋m\mathbf{L}^{m} closest to 𝐩it\mathbf{p}_{i}^{t}, 2) 𝐋~m\tilde{\mathbf{L}}^{m} is from the feature extraction module. We also observed that generating trajectories for the GAN objective (GAN\mathcal{L}_{GAN} defined in Eq. 18) from both the encoder and prior yields better prediction performance, which is consistent with the observations in [20]. However, not back-propagating the error signal from the GAN objective to the encoder and prior does not lead to the performance improvement, which is not consistent with the observations in [20].

3.5 Training Details

The proposed model is trained by optimizing the following objective:

=ELBO+αBCE+κGAN.\mathcal{L}=\mathcal{L}_{ELBO}+\alpha\mathcal{L}_{BCE}+\kappa\mathcal{L}_{GAN}. (16)

Here, BCE\mathcal{L}_{BCE} is the binary cross entropy loss for the mode selection network and is defined as follows:

BCE=BCE(𝐠m,softmax(𝐳h)),\mathcal{L}_{BCE}=\texttt{BCE}(\mathbf{g}^{m},\texttt{softmax}(\mathbf{z}_{h})), (17)

where 𝐠m\mathbf{g}^{m} is the one-hot vector indicating the index of the lane, in which the target vehicle traveled in the future timesteps, among the MM candidate lanes. GAN\mathcal{L}_{GAN} is the typical adversarial loss defined as follows:

GAN=𝔼𝐘pdata[𝚕𝚘𝚐D(𝐘,𝐋)]+𝔼𝐳pz[𝚕𝚘𝚐(1D(G(𝐳),𝐋))],\mathcal{L}_{GAN}=\mathbb{E}_{\mathbf{Y}\sim p_{data}}[\mathtt{log}D(\mathbf{Y},\mathbf{L})]+\mathbb{E}_{\mathbf{z}\sim p_{z}}[\mathtt{log}(1-D(G(\mathbf{z}),\mathbf{L}))], (18)

where GG denotes our forecasting model. The hyper-parameters (α\alpha, κ\kappa) in Eq. 16 and β\beta in Eq. 3 are set to 11, 0.010.01, and 0.50.5, respectively. More details can be found in supplementary materials.

3.6 Inference

Future trajectories for the target vehicle are generated from the modes based on their weights. Assume that KK trajectories need to be generated for ViV_{i}. K×wm\lfloor K\times w_{m}\rfloor out of KK future trajectories are generated by the decoder network using 𝐜im\mathbf{c}_{i}^{m} and 𝐳l\mathbf{z}_{l}. In the end, a total of KK trajectories can be generated from {𝐜im}m=1M\{\mathbf{c}_{i}^{m}\}_{m=1}^{M} since m=1Mwm=1\sum_{m=1}^{M}w_{m}=1.

4 Experiments

4.1 Dataset

Two large-scale real-world datasets, Argoverse Forecasting [7] and nuScenes [4], are used to evaluate the prediction performance of our model. Both provide 2D or 3D annotations of road agents, track IDs of agents, and HD map data. nuScenes includes 1000 scenes, each 20 seconds in length. A 6-second future trajectory is predicted from a 2-second past trajectory for each target vehicle. Argoverse Forecasting is the dataset for the trajectory prediction task. It provides more than 300K scenarios, each 5 seconds in length. A 3-second future trajectory is predicted from a 2-second past trajectory for each target vehicle. Argoverse Forecasting and nuScenes publicly release only training and validation sets. Following the existing works [16, 31], we use the validation set for the test. For the training, we use the training set only.

4.2 Evaluation Metric

For the quantitative evaluation of our forecasting model, we employ two popular metrics, average displacement error (ADE) and final displacement error (FDE), defined as follows:

ADE(𝐘^,𝐘)=1Tt=1T𝐩^t𝐩t2,ADE(\hat{\mathbf{Y}},\mathbf{Y})=\frac{1}{T}\sum_{t=1}^{T}||\hat{\mathbf{p}}^{t}-\mathbf{p}^{t}||_{2}, (19)
FDE(𝐘^,𝐘)=𝐩^T𝐩T2,FDE(\hat{\mathbf{Y}},\mathbf{Y})=||\hat{\mathbf{p}}^{T}-\mathbf{p}^{T}||_{2}, (20)

where 𝐘\mathbf{Y} and 𝐘^\hat{\mathbf{Y}} respectively denote the ground-truth trajectory and its prediction. In the rest of this paper, we denote ADEKADE_{K} and FDEKFDE_{K} as the minimum of ADE and FDE among the KK generated trajectories, respectively. It is worth noting that ADE1ADE_{1} and FDE1FDE_{1} metrics shown in the tables presented in the later sections represent the average quality of the trajectories generated for 𝐘\mathbf{Y}. Our derivation can be found in the supplementary materials. On the other hand, ADEKADE_{K} and FDEKFDE_{K} represent the quality of the trajectory closest to the ground-truth among the KK generated trajectories. We will call ADEK12ADE_{K\geq 12} and FDEK12FDE_{K\geq 12} metrics in the tables the best quality in the rest of this paper. According to [5], the average quality and the best quality are complementary and evaluate the precision and coverage of the predicted trajectory distributions, respectively.

Model PDP VLI V2I GAN ADE1ADE_{1}/FDE1FDE_{1} ADE15ADE_{15}/FDE15FDE_{15}
M1 3.15/7.53 0.95/1.82
M2 3.03/7.22 0.93/1.80
M3 2.91/7.00 0.94/1.82
M4 2.67/6.38 0.91/1.77
M5 2.64/6.32 0.89/1.72
(a)
Model ADE1ADE_{1}/FDE1FDE_{1} ADE15ADE_{15}/FDE15FDE_{15}
Ours (τ\tau=1) 2.66/6.31 0.92/1.76
Ours (τ\tau=5) 2.64/6.32 0.89/1.72
Ours (τ\tau=10) 2.65/6.34 0.95/1.84
Ours+All 2.67/6.34 1.00/1.98
Ours+Rel 2.75/6.52 0.91/1.77
(b)
Table 1: Ablation study conducted on nuScenes
Refer to caption
Figure 3: Mode blur example
Model ADE1ADE_{1} FDE1FDE_{1} ADE5ADE_{5} FDE5FDE_{5} ADE10ADE_{10} FDE10FDE_{10} ADE15ADE_{15} FDE15FDE_{15}
CoverNet [27] 3.87 9.26 1.96 - 1.48 - - -
Trajectron++ [31] - 9.52 1.88 - 1.51 - - -
AgentFormer [35] - - 1.86 3.89 1.45 2.86 - -
ALAN [26] 4.67 10.0 1.77 3.32 1.10 1.66 - -
LaPred [16] 3.51 8.12 1.53 3.37 1.12 2.39 1.10 2.34
MHA-JAM [25] 3.69 8.57 1.81 3.72 1.24 2.21 1.03 1.7
Ours 2.640.87\textbf{2.64}_{0.87\downarrow} 6.321.8\textbf{6.32}_{1.8\downarrow} 1.330.2\textbf{1.33}_{0.2\downarrow} 2.920.4\textbf{2.92}_{0.4\downarrow} 1.040.06\textbf{1.04}_{0.06\downarrow} 2.15¯0.49\underline{2.15}_{0.49\uparrow} 0.890.14\textbf{0.89}_{0.14\downarrow} 1.72¯0.02\underline{1.72}_{0.02\uparrow}
Table 2: Quantitative comparison on nuScenes
Model ADE1ADE_{1} FDE1FDE_{1} ADE5ADE_{5} FDE5FDE_{5} ADE6ADE_{6} FDE6FDE_{6} ADE12ADE_{12} FDE12FDE_{12}
DESIRE [21] 2.38 4.64 1.17 2.06 1.09 1.89 0.90 1.45
R2P2 [30] 3.02 5.41 1.49 2.54 1.40 2.35 1.11 1.77
VectorNet [12] 1.66 3.67 - - - - - -
LaneAttention [24] 1.46 3.27 - - 1.05 2.06 - -
LaPred [16] 1.48 3.29 0.76 1.55 0.71 1.44 0.60 1.15
Ours 1.440.02\textbf{1.44}_{0.02\downarrow} 3.150.12\textbf{3.15}_{0.12\downarrow} 0.700.06\textbf{0.70}_{0.06\downarrow} 1.350.2\textbf{1.35}_{0.2\downarrow} 0.650.06\textbf{0.65}_{0.06\downarrow} 1.240.2\textbf{1.24}_{0.2\downarrow} 0.510.09\textbf{0.51}_{0.09\downarrow} 0.850.3\textbf{0.85}_{0.3\downarrow}
Table 3: Quantitative comparison on Argoverse Forecasting

4.3 Ablation Study

4.3.1 Performance Gain over Baseline

In Table 1(a), we present the contributions of each idea to the performance gain over a baseline. M1 denotes the baseline that does not use the positional data preprocessing (PDP), VLI, V2I, and GAN regularization proposed in this paper. We can see from the table that the average quality of the generated trajectories is improved by both the PDP and the VLI (M1 v.s. M2 v.s. M3). The improvement due to the VLI is consistent with the observation in [16] that consideration of multiple lane candidates is more helpful than using a single best lane candidate in predicting the future trajectory. Both the average quality and the best quality are much improved by the V2I (M3 v.s. M4). The accurate trajectory prediction for the vehicles waiting for traffic lights is the most representative case of the performance improvement by the V2I. Due to the past movement of the neighboring vehicles waiting for the traffic light, our model can easily conclude that the target vehicle will also be waiting for the traffic light. Finally, the prediction performance is further improved by the GAN regularization (M4 v.s. M5). As seen in Eq. 15, our discriminator uses a future trajectory along with the reference lane to discriminate between fake trajectories and real trajectories.

4.3.2 Effect of Surrounding Vehicle Selection Mechanism

In Table 1(b), we show the effect of the surrounding vehicle selection mechanism on the prediction performance of our model. Here, Ours (τ\tau) denotes our model in which only the surrounding vehicles within τ\tau meters from the reference lane are considered. Ours+Rel and Ours+All denote our model in which the most relevant vehicle and all the vehicles are considered, respectively. We can see from the table that Ours with τ=5\tau=5 shows the best performance. This result demonstrates that considering only surrounding vehicles within a certain distance from the reference lane is effective in modeling the V2I from a lane-level perspective.

4.3.3 Hierarchical Latent Structure

We show in Fig. 3 the generated trajectories for a particular scenario to demonstrate how helpful the introduction of the hierarchical latent structure would be for the mitigation of mode blur. In the figure, Baseline denotes the VAE-based forecasting model in which a latent variable is trained to model the trajectory distribution. Baseline+BOM and Baseline+NF respectively denote Baseline trained with the best-of-many (BOM) sample objective [2] and normalizing flows (NF) [29]. We introduce NF since the blurry sample generation is often attributed to the limited capability of the approximated posterior [15] and NF is a powerful framework for building flexible approximated posterior distributions [18]. In the figure, gray and black circles indicate historical and future positions, respectively. Squares with colors indicate the predictions of the future positions. Time is encoded in the rainbow color map ranging from red (0s) to blue (6s). Red solid lines indicate the centerlines of the candidate lanes. For the scenario, fifteen trajectories were generated. We can see in the figure that the proposed model generates trajectories that are aligned with the lane candidates. In contrast, neither normalizing flows nor BOM objective can help a lot for the mitigation of mode blur.

Refer to caption
Figure 4: Trajectory prediction examples of our forecasting model on nuScenes (the first and second rows) and Argoverse Forecasting (the third and fourth rows)

4.4 Performance Evaluation

4.4.1 Quantitative Evaluation

We compare our forecasting model with the existing models objectively. The results are shown in Table 2 and 3. Note that the bold and underline indicate the best and second-best performance, respectively. The values in the subscript indicate the performance gain over the second-best or loss over the best. Finally, the values in the table are from the corresponding papers and [16]. Table 2 presents the results on Nuscenes. It shows that our model outperforms the SOTA models [26, 16, 35] on most of the metrics. In particular, the performance gains over the SOTA models in the ADEK5ADE_{K\leqq 5} and FDEK5FDE_{K\leqq 5} metrics are significant. Consequently, it can be said that the trajectories generated from our model, on average, are more accurate than those from the SOTA models. On the other hand, [26] shows the significant performance on FDE10FDE_{10}. This is because, in [26], the vehicle trajectory is defined along the centerlines in a 2D curvilinear normal-tangential coordinate frame, so that the predicted trajectory is well aligned with the centerlines. However, [26] shows the poorest performance in the average quality. Table 3 presents the results on Argoverse Forecasting. It is seen that our forecasting model outperforms the SOTA models [16, 24] on all the metrics. The ADE12ADE_{12} and FDE12FDE_{12} results show that our model achieves much better performance in the best quality compared to the models. However, the performance gain over the second-best model in the average quality is not significant. In short, our forecasting model exhibits remarkable performance in the average and best quality on the two large-scale real-world datasets.

4.4.2 Qualitative Evaluation

Figure 4 illustrates the trajectories generated by our model for particular scenarios in the test dataset. Note that fifteen and twelve trajectories were generated for each scenario in nuScenes and Argoverse Forecasting, respectively. We can see in the figure that the generated trajectories are well distributed along admissible routes. In addition, the shape of the generated trajectory matches the shape of the candidate lane well. These results verify that the trajectory distribution is nicely modeled by the two latent variables conditioned by the proposed lane-level scene context vectors. It is noticeable that our model can generate plausible trajectories for the driving behaviors that require simultaneous consideration of multiple lanes. The first and third figures in the first column show the scenario where the target vehicle has just started changing lanes, and the second shows the scenario where the target vehicle is in the middle of a lane change. For both scenarios, our model generates plausible trajectories corresponding to both changing lanes and returning back to its lane. Finally, the last figure in the first column shows the scenario where the target vehicle is in the middle of a right turn. Our model well captures the motion ambiguity of the vehicle that can keep a lane or change lanes.

5 Conclusions

In this paper, we proposed a VAE-based trajectory forecasting model that exploits the hierarchical latent structure. The hierarchy in the latent space was introduced to the forecasting model to mitigate mode blur by modeling the modes of the trajectory distribution and the weights for the modes separately. For the accurate modeling of the modes and weights, we introduced two lane-level context vectors calculated in novel ways, one corresponds to the VLI and the other to the V2I. The prediction performance of the model was further improved by the two techniques, positional data preprocessing and GAN-based regularization, introduced in this paper. Our experiments on two large-scale real-world datasets demonstrated that the model is not only capable of generating clear multi-modal trajectory distributions but also outperforms the SOTA models in terms of prediction accuracy.

Acknowledgment This research work was supported by the Institute of Information &\& Communications Technology Planning &\& Evaluation (IITP) grant funded by the Korean government (MSIP) (No. 2020-0-00002, Development of standard SW platform-based autonomous driving technology to solve social problems of mobility and safety for public transport-marginalized communities)

References

  • [1] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Int. Conf. on Learn. Represent. (2015)
  • [2] Bhattacharyya, A., Schiele, B., Fritz, M.: Accurate and diverse sampling of sequences based on a best-of-many sample objective. In: IEEE Conf. Comput. Vis. Pattern Recog. (2018)
  • [3] Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Generating sentences from a continuous space. In: arXiv:1511.06349 (2015)
  • [4] Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: a multimodal dataset for autonomous driving. In: IEEE Conf. Comput. Vis. Pattern Recog. (2020)
  • [5] Casas, S., Gulino, C., Suo, S., Luo, K., Liao, R., Urtasun, R.: Implicit latent variable model for scene-consistent motion forecasting. In: Eur. Conf. Comput. Vis. (2020)
  • [6] Casas, S., Gulino, C., Suo, S., Urtasun, R.: The importance of prior knowledge in precise multimodal prediction. In: Int. Conf. Intell. Robots Syst. (2020)
  • [7] Chang, M.F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A., Wang, D., Carr, P., Lucey, S., Ramanan, D., Hays, J.: Argoverse: 3d tracking and forecasting with rich maps. In: IEEE Conf. Comput. Vis. Pattern Recog. (2019)
  • [8] Cui, A., Sadat, A., Casas, S., Liao, R., Urtasun, R.: Lookout: diverse multi-future prediction and planning for self-driving. In: Int. Conf. Comput. Vis. (2021)
  • [9] Cui, H., Radosavljevic, V., F.-C.Chou, Lin, T.H., Nguyen, T., Huang, T.K., Schneider, J., Djuric, N.: Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In: IEEE Int. Conf. Robotics and Automation (2019)
  • [10] Fang, L., Jiang, Q., Shi, J., Zhou, B.: Tpnet: trajectory proposal network for motion prediction. In: IEEE Conf. Comput. Vis. Pattern Recog. (2020)
  • [11] Fu, H., Li, C., Liu, X., Gao, J., Celikyilmaz, A., Carin, L.: Cyclical annealing schedule: A simple approach to mitigating kl vanishing. In: NAACL (2019)
  • [12] Gao, J., Sun, C., Zhao, H., Shen, Y., Anguelov, D., Li, C., Schmid, C.: Vectornet: encoding hd maps and agent dynamics from vectorized representation. In: IEEE Conf. Comput. Vis. Pattern Recog. (2020)
  • [13] Goodfellow, I., Abadie, J.P., Mirza, M., Xu, B., Farley, D.W., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Adv. Neural Inform. Process. Syst. (2014)
  • [14] Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., Lerchner, A.: beta-vae: learning basic visual concepts with a constrained variational framework. In: Int. Conf. on Learn. Represent. (2017)
  • [15] Huang, H., Li, Z., He, R., Sun, Z., Tan, T.: Introvae: Introspective variational autoencoders for photographic image synthesis. In: Adv. Neural Inform. Process. Syst. (2018)
  • [16] Kim, B., Park, S.H., Lee, S., Khoshimjonov, E., Kum, D., Kim, J., Kim, J.S., Choi, J.W.: Lapred: lane-aware prediction of multi-modal future trajectories of dynamic agents. In: IEEE Conf. Comput. Vis. Pattern Recog. (2021)
  • [17] Kingma, D.P., Ba, L.J.: Adam: a method for stochastic optimization. In: Int. Conf. on Learn. Represent. (2015)
  • [18] Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., Welling, M.: Improved variational inference with inverse autoregressive flow. In: Adv. Neural Inform. Process. Syst. (2016)
  • [19] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: arXiv:1312.6114 (2013)
  • [20] Larsen, A.B.L., Sonderby, S.K., Larochelle, H., Winther, O.: Autoencoding beyond pixels using a learned similarity metric. In: Int. Conf. on Learn. Represent. (2016)
  • [21] Lee, N., Choi, W., Vernaza, P., Choy, C.B., Torr, P.H.S., Chan, M.: Desire: Distant future prediction in dynamic scenes with interacting agents. In: IEEE Conf. Comput. Vis. Pattern Recog. (2017)
  • [22] Li, J., Yang, F., Ma, H., Malla, S., Tomizuka, M., Choi, C.: Rain: reinforced hybrid attention inference network for motion forecasting. In: Int. Conf. Comput. Vis. (2021)
  • [23] Liang, M., Yang, B., Hu, R., Chen, Y., Liao, R., Feng, S., Urtasun, R.: Learning lane graph representations for motion forecasting. In: Eur. Conf. Comput. Vis. (2020)
  • [24] Luo, C., Sun, L., Dabiri, D., Yuille, A.: Probabilistic multi-modal trajectory prediction with lane attention for autonomous vehicles. In: IEEE Conf. Intell. Robots Syst. (2020)
  • [25] Messaoud, K., Deo, N., Trivedi, M.M., Nashashibi, F.: Trajectory prediction for autonomous driving based on multi-head attention with joint agent-map representation. In: arXiv:2005.02545 (2020)
  • [26] Narayanan, S., Moslemi, R., Pittaluga, F., Liu, B., Chandraker, M.: Divide-and-conquer for lane-aware diverse trajectory prediction. In: IEEE Conf. Comput. Vis. Pattern Recog. (2021)
  • [27] P-Minh, T., Grigore, E.C., Boulton, F.A., Beijbom, O., Wolff, E.M.: Covernet: multimodal behavior prediction using trajectory sets. In: IEEE Conf. Comput. Vis. Pattern Recog. (2020)
  • [28] Razavi, A., Oord, A., Poole, B., Vinyals, O.: Preventing posterior collapse with delta-vaes. In: Int. Conf. on Learn. Represent. (2019)
  • [29] Rezende, D.J., Mohamad, S.: Variational inference with normalizing flows. In: Int. Conf. on Mach. Learn. (2015)
  • [30] Rhinehart, N., Kitani, K.M., Vernaza, P.: R2p2: a reparameterized pushforward policy for diverse, precise generative path forecasting. In: Eur. Conf. Comput. Vis. (2018)
  • [31] Salzmann, T., Ivanovic, B., Chakravarty, P., Pavone, M.: Trajectron++: dynamically-feasible trajectory forecasting with heterogeneous data. In: Eur. Conf. Comput. Vis. (2020)
  • [32] Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. In: Adv. Neural Inform. Process. Syst. (2015)
  • [33] Vahdat, A., Kautz, J.: Nvae: a deep hierarchical variational autoencoder. In: Adv. Neural Inform. Process. Syst. (2020)
  • [34] Yang, Z., Hu, Z., Salakhutdinov, R., B.-Kirkpatrick, T.: Improved variational autoencoders for text modeling using dilated convolutions. In: Int. Conf. on Mach. Learn. (2017)
  • [35] Yuan, Y., Weng, X., Ou, Y., Kitani, K.: Agentformer: agent-aware transformers for socio-temporal multi-agent forecasting. In: arXiv:2103.14023 (2021)
  • [36] Zhao, S., Song, J., Ermon, S.: Infovae: information maximizing variational autoencoders. In: arXiv:1706.02262 (2017)
  • [37] Zhao, S., Song, J., Ermon, S.: Towards a deeper understanding of variational autoencoding models. In: arXiv:1702.08658v1 (2017)

Appendix 0.A Visualization of Vehicle-Lane Interaction (VLI)

As we mentioned in the paper, for the calculation of the lane-level context vector 𝐚im\mathbf{a}_{i}^{m}, we use not only the reference lane but also the surrounding lanes with their relative importance. This idea is based on the fact that human drivers often pay attention to surrounding lanes when driving along the reference lane. To show how our model pays attention to the surrounding lanes for the target vehicle, we use four scenarios in nuScenes and show the results in Figure 5. In the figure, blue lines denote the reference lanes while the others denote the surrounding lanes. The surrounding lanes of high importance are shown in red and the surrounding lanes of low importance are shown in green. We can see in the figure that our forecasting model pays more attention to the surrounding lanes that are close to the reference lane.

Refer to caption
Figure 5: VLI visualization

Appendix 0.B Mode Blur in SOTA Model

We show in Figure 6 the prediction examples of the state-of-the-art model [8]. We note here that the figure is identical to the figure illustrated in the supplementary material of [8]. The model is built upon [5], which is based on the VAE framework and learns a diverse joint distribution over multi-agent future trajectories in a traffic scene. In the figure, green and light blue bounding boxes respectively denote the AV and surrounding vehicles. The solid lines with light blue dots denote the predicted trajectories for the surrounding vehicles. We can see in the figure that some trajectories are located between adjacent lanes, which can cause uncomfortable rides for the AV with plenty of sudden brakes and steering changes [6].

Refer to caption
Figure 6: Trajectory prediction examples of [8]

Appendix 0.C Further Explanation to Average Quality

We mentioned in the paper that ADE1ADE_{1} and FDE1FDE_{1} metrics shown in the tables presented in the paper represent the average quality of the trajectories generated for the ground-truth trajectory 𝐘\mathbf{Y}. The ADE1ADE_{1} metric in the table is calculated as

ADE1\displaystyle ADE_{1} =1|𝒟|𝐘𝒟ADE(𝐘^,𝐘),\displaystyle=\frac{1}{|\mathcal{D}|}\sum_{\mathbf{Y}\in\mathcal{D}}ADE(\hat{\mathbf{Y}},\mathbf{Y}), (21)

where 𝒟\mathcal{D} is the test dataset and 𝐘^\hat{\mathbf{Y}} is the prediction of 𝐘\mathbf{Y}. Because there are relatively few distinct actions that can be taken by a vehicle over a reasonable time horizon (3 to 6 seconds) [27], the ground-truth trajectories in 𝒟\mathcal{D} can be clustered into multiple groups, where the trajectories of each group are very close to each other in Euclidean space. Assume that there are NN groups in 𝒟\mathcal{D} and let 𝒴i\mathcal{Y}_{i} denote the ii-th group. Then Eqn. 1 can be expressed as

ADE1\displaystyle ADE_{1} =1|𝒟|{𝐘𝒴1ADE(𝐘^,𝐘)++𝐘𝒴NADE(𝐘^,𝐘)}\displaystyle=\frac{1}{|\mathcal{D}|}\{\sum_{\mathbf{Y}\in\mathcal{Y}_{1}}ADE(\hat{\mathbf{Y}},\mathbf{Y})+...+\sum_{\mathbf{Y}\in\mathcal{Y}_{N}}ADE(\hat{\mathbf{Y}},\mathbf{Y})\} (22)
=|𝒴1||𝒟|1|𝒴1|𝐘𝒴1ADE(𝐘^,𝐘)++|𝒴N||𝒟|1|𝒴N|𝐘𝒴NADE(𝐘^,𝐘)\displaystyle=\frac{|\mathcal{Y}_{1}|}{|\mathcal{D}|}\frac{1}{|\mathcal{Y}_{1}|}\sum_{\mathbf{Y}\in\mathcal{Y}_{1}}ADE(\hat{\mathbf{Y}},\mathbf{Y})+...+\frac{|\mathcal{Y}_{N}|}{|\mathcal{D}|}\frac{1}{|\mathcal{Y}_{N}|}\sum_{\mathbf{Y}\in\mathcal{Y}_{N}}ADE(\hat{\mathbf{Y}},\mathbf{Y})
=w11|𝒴1|𝐘𝒴1ADE(𝐘^,𝐘)++wN1|𝒴N|𝐘𝒴NADE(𝐘^,𝐘)\displaystyle=w_{1}\frac{1}{|\mathcal{Y}_{1}|}\sum_{\mathbf{Y}\in\mathcal{Y}_{1}}ADE(\hat{\mathbf{Y}},\mathbf{Y})+...+w_{N}\frac{1}{|\mathcal{Y}_{N}|}\sum_{\mathbf{Y}\in\mathcal{Y}_{N}}ADE(\hat{\mathbf{Y}},\mathbf{Y})
=w1AADE(𝒴1)++wNAADE(𝒴N),\displaystyle=w_{1}AADE(\mathcal{Y}_{1})+...+w_{N}AADE(\mathcal{Y}_{N}),

where i=1Nwi=1\sum_{i=1}^{N}w_{i}=1. Since the trajectories of each group are very close to each other in Euclidean space, AADE(𝒴i)AADE(\mathcal{Y}_{i}) in the last line of Eqn. 2 can be approximated as

AADE(𝒴i)=1|𝒴i|𝐘𝒴iADE(𝐘^,𝐘)1Kk=1KADE(𝐘^k,𝐘r)AADE(\mathcal{Y}_{i})=\frac{1}{|\mathcal{Y}_{i}|}\sum_{\mathbf{Y}\in\mathcal{Y}_{i}}ADE(\hat{\mathbf{Y}},\mathbf{Y})\approx\frac{1}{K}\sum_{k=1}^{K}ADE(\hat{\mathbf{Y}}_{k},\mathbf{Y}_{r}) (23)

where K=|𝒴i|K=|\mathcal{Y}_{i}| is large enough. Here 𝐘r\mathbf{Y}_{r} and 𝐘^k\hat{\mathbf{Y}}_{k} are the most representative trajectory in 𝒴i\mathcal{Y}_{i} and its kk-th prediction, respectively. The last term of Eqn. 3 is the average quality of the KK trajectories generated for 𝐘r\mathbf{Y}_{r}. Consequently, the ADE1ADE_{1} metric represents the average quality. The same derivation can be applied for the FDE1FDE_{1} metric.

Model ADE1ADE_{1}/FDE1FDE_{1} ADE15ADE_{15}/FDE15FDE_{15}
Ours+Multi 2.64/6.32 0.89/1.72
Ours+Single 2.64/6.32 0.97/1.95
(a) nuScenes
Model ADE1ADE_{1}/FDE1FDE_{1} ADE12ADE_{12}/FDE12FDE_{12}
Ours+Multi 1.44/3.15 0.51/0.85
Ours+Single 1.44/3.16 0.53/0.92
(b) Argoverse Forecasting
Table 4: Trajectory generation from single mode and multiple modes

Appendix 0.D Trajectory Generation from The Most Prominent Mode

We show in Table 4 the ADE and FDE performance of our forecasting model when KK trajectories are generated from the most prominent mode only. In the table, Ours+Multi denotes the inference method that generates KK future trajectories from the MM modes. This method is the same as that described in the paper. Ours+Single denotes the inference method that generates KK future trajectories from the most prominent mode, which is identified by the weight distribution {wm}m=1M\{w_{m}\}_{m=1}^{M}. We can observe from the table that the best quality (K12K\geq 12) is degraded when the trajectories are generated from the most prominent mode only. On the other hand, Ours+Single shows nearly the same average quality performance as Ours+Multi. These are very natural results. When sampling a single future trajectory, the most prominent mode will be chosen for the sampling. Therefore, Ours+Multi and Ours+Single will show the same performance. On the other hand, when sampling multiple future trajectories, the trajectories generated by Ours+Multi will better reflect the true future trajectory distribution. Therefore, Ours+Multi will outperforms Ours+Single in terms of the best quality.

Appendix 0.E Trajectory Generation Speed

We ran our model on PC equipped with Intel i7, 32GB RAM, and a GPU (RTX 2080Ti). To generate 15 trajectories per vehicle, it takes around 0.02 sec.

Appendix 0.F Implementation Details

0.F.1 Candidate Lanes Acquisition

We identify M=10M=10 lane candidates for each target vehicle based on the method proposed in [16, 26, 7]. The lane segments within the search radius (10 meters) from the current position of the vehicle are first found. Next, lane candidates 80 meters long in the vehicle’s heading direction are obtained by attaching the preceding and succeeding lane segments based on lane connectivity information provided by the HD maps. The set of coordinate points for the lane candidates is re-sampled such that any two adjacent coordinate points have equal distance (1 meter). The ground-truth lane on which the target vehicle has moved during the future timesteps is identified by the Euclidean distance between the ground-truth future trajectory and the lane candidates. If the number of the identified lane candidates is less than MM, we add fake lane candidates with coordinate points of (0, 0). If the number is greater than MM, M1M-1 randomly selected lanes and the ground-truth lane are used.

0.F.2 Details of Our Implementation

0.F.2.1 Preprocessing:

Let 𝐩it=(pxt,pyt)\mathbf{p}_{i}^{t}=(p_{x}^{t},p_{y}^{t}) denote the position of the vehicle ViV_{i} at tt. The speed ss (meter per second) and heading hh (radian) of the vehicle at tt are calculated as follows:

s=ψ(pxtpxt1)2+(pytpyt1)2,s=\psi\sqrt{(p_{x}^{t}-p_{x}^{t-1})^{2}+(p_{y}^{t}-p_{y}^{t-1})^{2}}, (24)
h=arctan(pytpyt1pxtpxt1),h=\arctan{(\frac{p_{y}^{t}-p_{y}^{t-1}}{p_{x}^{t}-p_{x}^{t-1}})}, (25)

where ψ\psi is the sampling rate. Let 𝐥fm\mathbf{l}_{f}^{m} denote the coordinate of the ff-th point of the lane 𝐋m\mathbf{L}^{m}. The tangent vector 𝐯f=(vf,x,vf,y)\mathbf{v}_{f}=(v_{f,x},v_{f,y}) and its direction d𝐯fd_{\mathbf{v}_{f}} at the point are calculated as follows:

𝐯f=𝐥fm𝐥f1m\mathbf{v}_{f}=\mathbf{l}_{f}^{m}-\mathbf{l}_{f-1}^{m} (26)
d𝐯f=arctan(vf,yvf1,yvf,xvf1,x).d_{\mathbf{v}_{f}}=\arctan{(\frac{v_{f,y}-v_{f-1,y}}{v_{f,x}-v_{f-1,x}})}. (27)

0.F.2.2 Feature Extraction Module:

The positional data 𝐗i\mathbf{X}_{i}, 𝐘i\mathbf{Y}_{i}, and 𝐋m\mathbf{L}^{m} are first preprocessed by the method proposed in this paper. Next, the data are embedded by single-layer MLPs followed by ReLU activation. The MLPs for 𝐗i\mathbf{X}_{i} and 𝐘i\mathbf{Y}_{i} take as input a 4-dimensional vector and output a 16-dimensional vector. The MLP for 𝐋m\mathbf{L}^{m} takes as input a 5-dimensional vector and outputs a 64-dimensional vector. Finally, the embedded sequential vectors are encoded by LSTM networks. The final hidden states of the LSTM networks are used for the final encodings. The hidden state size of the LSTM networks for 𝐗i\mathbf{X}_{i} and 𝐘i\mathbf{Y}_{i} is 16. The hidden state size for 𝐋m\mathbf{L}^{m} is 64.

0.F.2.3 Scene Context Extraction Module:

The attention operation between 𝐗~i\tilde{\mathbf{X}}_{i} and 𝐋~(1:M)\tilde{\mathbf{L}}^{(1:M)} for the context vector 𝐚im\mathbf{a}_{i}^{m} is based on [1]. The context vector 𝐛im\mathbf{b}_{i}^{m} is calculated as follows: The messages coming to the node ViV_{i} are first calculated by a single-layer MLP followed by ReLU activation, which takes as input a 34-dimensional vector and outputs a 16-dimensional vector, and then summarized by the sum operation. The summarized message is used to update the hidden state of the node. To update the hidden state, we use a GRU cell, which takes as input a 16-dimensional vector and outputs a 16-dimensional hidden state vector. After the one round of the message passing, 𝐛im\mathbf{b}_{i}^{m} is obtained by summing the hidden states of the neighboring nodes.

0.F.2.4 Mode Selection Network:

Ten lane-level scene context vectors {𝐜im}\{\mathbf{c}_{i}^{m}\} are first embedded by a single-layer MLP followed by ReLU activation, which takes as input a 160-dimensional vector and outputs a 64-dimension vector. The embedded vectors are then concatenated and used as input to a single-layer MLP, which takes as input a 640-dimensional vector and outputs a 10-dimension vector, to obtain the latent vector 𝐳h\mathbf{z}_{h}.

0.F.2.5 Encoder and Prior:

The encoder produces the mean and variance vectors from the lane-level scene context vector 𝐜im\mathbf{c}_{i}^{m} and the positional data encoding 𝐘~i\tilde{\mathbf{Y}}_{i}. We use two two-layer MLPs for the mean and variance, respectively. The first layers of the MLPs take as input a 178-dimensional vector and output a 64-dimensional vector. The second layers take as input a 64-dimensional vector and output a 16-dimensional vector. The prior produces the mean and variance vectors from 𝐜im\mathbf{c}_{i}^{m}. The networks for the prior have the same structure as those for the encoder except that the first layers of the MLPs take as input a 160-dimensional vector. Finally note that we use ReLU activation for the first layers of the MLPs.

0.F.2.6 Decoder:

To produce the next position 𝐩^it+1\hat{\mathbf{p}}_{i}^{t+1}, the current position 𝐩^it\hat{\mathbf{p}}_{i}^{t} is first embedded by a single-layer MLP followed by ReLU activation, which takes as input a 2-dimensional vector and output a 16-dimensional vector. Next, 𝐜im\mathbf{c}_{i}^{m}, 𝐳l\mathbf{z}_{l}, and the embedding are concatenated and used as input to an LSTM network, which takes as input a 192-dimensional vector and outputs a 128-dimensional hidden state vector, to update the hidden state vector. The next position is obtained by a single-layer MLP, which takes as input a 128-dimensional vector and outputs a 2-dimensional vector.

0.F.2.7 Discriminator:

The positional data [𝐘i;Δ𝐘i][\mathbf{Y}_{i};\Delta\mathbf{Y}_{i}] is first embedded by a single-layer MLP followed by ReLU activation, which takes as input a 4-dimensional vector and outputs a 16-dimensional vector. The embedded sequential data is then encoded by an LSTM network, which takes as input 16-dimensional sequential vectors and outputs 16-dimensional sequential hidden state vectors. The future encoding and lane encoding 𝐋~m\tilde{\mathbf{L}}_{m} are then used as input to a single-layer MLP to produce a scalar value. The MLP takes as input an 80-dimensional vector.

0.F.2.8 Training:

Adam optimizer [17] is used for the optimization with initial learning rates of 10410^{-4} (nuScenes) and 5×1045\times 10^{-4} (Argoverse Forecasting) and batch size of 8 for 100 (nuScenes) and 50 (Argoverse Forecasting) epochs. We evaluate the prediction performance after every three consecutive training epochs by using the validation samples in the training dataset. Whenever the prediction performance improves over the past, we save the model’s network parameters. During the training, we use a cyclical annealing schedule [11] for β\beta.

0.F.3 Details of Ablation Study

We describe the details of the ablation study shown in section 4.3 of the paper. For 𝐌𝟏\mathbf{M1}, we do not use the positional data preprocessing (PDP), VLI, V2I, and GAN regularization proposed in the paper. As a result, the lane-level scene context vector 𝐜im\mathbf{c}_{i}^{m} is defined as 𝐜im=[𝐗~i;𝐋~m]\mathbf{c}_{i}^{m}=[\tilde{\mathbf{X}}_{i};\tilde{\mathbf{L}}^{m}]. For 𝐌𝟑\mathbf{M3}, we use the VLI so that 𝐜im=[𝐗~i;𝐚im]\mathbf{c}_{i}^{m}=[\tilde{\mathbf{X}}_{i};\mathbf{a}_{i}^{m}]. Finally, 𝐜im=[𝐗~i;𝐚im;𝐛im]\mathbf{c}_{i}^{m}=[\tilde{\mathbf{X}}_{i};\mathbf{a}_{i}^{m};\mathbf{b}_{i}^{m}] is used for 𝐌𝟒\mathbf{M4}, which employ the VLI and V2I.

0.F.4 Details of Baselines

We describe the details of the baseline models shown in Figure 3 of the paper. For the figure, we exclude the scene context extraction module and discriminator to show how helpful the introduction of the hierarchical latent structure would be for the mitigation of mode blur. Finally, note that the trajectories depicted in Figure 3-(a) of the paper is generated from 𝐌𝟐\mathbf{M2}.

0.F.4.1 Baseline:

We train a generative model with a latent variable to model the trajectory distribution. One scene context vector 𝐜i\mathbf{c}_{i} that condenses the information about all the modes of the distribution is first calculated as follows:

𝐜i=[𝐗~i;𝐋~ATT],\mathbf{c}_{i}=[\tilde{\mathbf{X}}_{i};\tilde{\mathbf{L}}^{ATT}], (28)

where 𝐋~ATT\tilde{\mathbf{L}}^{ATT} is the result of the attention operation [1] between 𝐗~i\tilde{\mathbf{X}}_{i} and 𝐋~(1:M)\tilde{\mathbf{L}}^{(1:M)}. 𝐜i\mathbf{c}_{i} is then used as input to the encoder, prior, and decoder.

0.F.4.2 Baseline+BOM:

We train Baseline with the best-of-many (BOM) sample objective [2]. During the training, we let the model generate five trajectories per vehicle and select the trajectory with the minimum ADE out of the five for the L2L2-distance loss calculation.

0.F.4.3 Baseline+NF:

We train Baseline with normalizing flows (NF) [29]. We apply ten planar flow operations to a random vector that follows the normal distribution to obtain the final latent variable.