PrognoseNet: A Generative Probabilistic Framework for Multimodal Position Prediction given Context Information

Thomas Kurbiel, Akash Sachdeva, Kun Zhao and Markus Buehren

Aptiv GmbH, Wuppertal, Germany
e-mail: [email protected], [email protected] [email protected], [email protected]

Abstract

The ability to predict multiple possible future positions of the ego-vehicle given the surrounding context while also estimating their probabilities is key to safe autonomous driving. Most of the current state-of-the-art Deep Learning approaches are trained on trajectory data to achieve this task. However trajectory data captured by sensor systems is highly imbalanced, since by far most of the trajectories follow straight lines with an approximately constant velocity. This poses a huge challenge for the task of predicting future positions, which is inherently a regression problem. Current state-of-the-art approaches alleviate this problem only by major preprocessing of the training data, e.g. resampling, clustering into anchors etc.

In this paper we propose an approach which reformulates the prediction problem as a classification task, allowing for powerful tools, e.g. focal loss, to combat the imbalance. To this end we design a generative probabilistic model consisting of a deep neural network with a Mixture of Gaussian head. A smart choice of the latent variable allows for the reformulation of the log-likelihood function as a combination of a classification problem and a much simplified regression problem. The output of our model is an estimate of the probability density function of future positions, hence allowing for prediction of multiple possible positions while also estimating their probabilities. The proposed approach can easily incorporate context information and does not require any preprocessing of the data.

Index Terms:

deep neural networks, generative probabilistic model, trajectory prediction, static context

1 Introduction

Human drivers possess the fundamental skill of predicting a multitude of different possible future positions and movements of other traffic participants. Their predictions highly depend on the surroundings and interactions between traffic participants. This human ability ensures safe and efficient driving. Reproducing this ability by machines is key to safe autonomous driving. Due to the highly dynamic and complex driving environment a large amount of training data is needed in order to develop a system capable of operating at a level comparable to human drivers.

Trajectory data

Most of the current state-of-the-art Deep Learning approaches are trained using trajectory data: spatio-temporal data capturing the movement of vehicles, pedestrians etc. over a certain temporal interval. Trajectory data is usually organized into sequences with three or more dimensions, e.g. batch, time, features. The features can consist of ego x- and y-coordinates, heading angles, velocities etc. To reflect the interaction with other agents, trajectory data may contain multiple objects. Furthermore to reflect the interaction of the objects with the environment, a static map may be included. Trajectory data can be synthetic or based on a real data set.

An eligible approach has to cope with the following two challenges posed by trajectory data captured by sensor systems.

Imbalanced dataset

Trajectory data captured by sensor systems is highly imbalanced, since by far most of the trajectories follow straight lines with an approximately constant velocity. However it is the abnormal behaviors: “unexpected stops”, “accelerations”, “turnings”, “deviation from standard routes” which interest us and pose a challenge. Fig. 1 depicts the probabilities of future ego-positions evaluated on the Argoverse dataset [5]. The evaluation was performed in the vehicle coordinate system (VCS).

Refer to caption — (a) ego-position in 2000ms

Fig. 1 clearly shows how straight line trajectories dominate the dataset. Please note that the colorbar is scaled logarithmically in order to make future ego-positions (not lying on the straight line) visible in the first place. Their probability is several orders of magnitude lower than that of the straight line trajectories.

Multimodality

The second challenge consists in predicting not only the most probable future position, but a multitude of different possible future positions and assigning a probability to each one of them. The multimodality of future positions occurs due to many factors, e.g. multiple possible paths, different acceleration patterns, interactions with other traffic participants, just to name a few.

2 Related Work

In the past, many sophisticated methods have been developed in the field of trajectory prediction. Due to the ongoing research, new approaches are presented almost on a weekly basis. In this section we will therefore only mention the most important ones and illustrate how good or bad they deal with the problems described in the previous section.

One means to cope with imbalanced data is to resample the data, e.g. remove some observations of the majority class (undersampling) or add more copies of the minority class (oversampling) [6]. However, the definition of the minority class and the majority class is based on criteria, which oftentimes are hand-engineered. A drawback of this approach is that we might remove information that is be valuable. This could lead to underfitting and poor generalization to the test set. Think of the example where we want to balance out the dataset by removing trajectories which follow straight lines. We could classify the trajectories according to their curvature, this way neglecting the acceleration behavior and hence dropping important samples. This simple example illustrates how difficult the definition of a proper criterion can get. Another fact to emphasize is that even trajectories which are characterized as curves partially consists of segments which follow straight lines.

Similar arguments apply to approaches which partition the trajectories into several clusters/anchors and treat the prediction problem as a classification of the correct cluster/anchor [3, 15]. Here again, the clustering itself is oftentimes based on hand-engineered criteria. Furthermore, the large number of situations encountered on roads and the high uncertainty of traffic behavior requires a large number of clusters. The drawback of cluster based approaches is that their prediction of multimodality is limited due to limited number of clusters.

The basic min-of- $N$ approach is formulated as a regression problem, where multimodality is achieved by introducing an additional latent noise as input [7, 1]. Each choice of the latent noise produces a different prediction. During training, optimization is performed only on the latent noise corresponding to the prediction with the minimal error to the ground truth position. This approach has the drawback that the predictions tend to lie on a uniform grid as depicted in Fig. 2 especially if the number of latent noises is high. This way the neural network ensures, that there is at least one prediction which lies close to the ground truth position. The major drawback of this approach are the missing probabilities assigned to the predictions. Besides, this approach suffers from the imbalance of data and hence requires preprocessing of the data. The training is computationally very expensive due to the computation of different latent noises.

Another approach is based on Conditional Variational Autoencoders (CVAE), which models multimodality by sampling from a conditional Gaussian Distribution [10, 13]. Since CVAEs do not provide any means to inherently emphasize underrated samples, this approach too suffers from imbalance of data. Here, the conditional probability realized by the encoder network is dominated by the majority class. Furthermore assigning probabilities to the predictions at the output of the decoder is difficult, even though the conditional probability of the latent variable is known. However, the probability density of a transformed random variable is not easily obtained [9].

3 Our Approach

In this section we introduce a novel approach which solves the imbalanced data problem by formulating the prediction problem as a classification problem and utilizing focal loss, thus not requiring any preprocessing of the data. The multimodality is achieved by introducing a generative probabilistic model which outputs an estimate of the probability density function of future positions. This way we create a fully generalizable system, which is not confined by any hand-engineered preprocessing of the data.

3.1 Input Output

The input trajectory (up to time step $t$ ) is defined as the sequence:

\mathscr{\mathscr{T}}^{\left\langle 0:t\right\rangle}=\left\{\left(\text{$\Delta$}x^{\left\langle\tau\right\rangle},\ \text{$\Delta$}y^{\left\langle\tau\right\rangle},\ v^{\left\langle\tau\right\rangle},\ h^{\left\langle\tau\right\rangle}\right)\right\}_{\tau=0}^{t},

(1)

with $\text{$\Delta$}x^{\left\langle t\right\rangle}=x^{\left\langle t\right\rangle}-x^{\left\langle t-1\right\rangle}$ and $\text{$\Delta$}y^{\left\langle t\right\rangle}=y^{\left\langle t\right\rangle}-y^{\left\langle t-1\right\rangle}$ . For brevity in the following we will write: $\boldsymbol{x}^{\left\langle t\right\rangle}=(x^{\left\langle t\right\rangle},y^{\left\langle t\right\rangle})$ . Both $\boldsymbol{x}^{\left\langle t\right\rangle}$ and $\boldsymbol{x}^{\left\langle t-1\right\rangle}$ in the definition of the deltas are expressed in the vehicle coordinate system of time step $t$ (please note that $\boldsymbol{x}^{\left\langle t\right\rangle}=(0,\ 0)$ in VCS of time step $t$ ). The terms $v^{\left\langle t\right\rangle}$ , $h^{\left\langle t\right\rangle}$ are the velocity and the heading angle respectively. Please note that $t$ is an integer index denoting measurements, which were obtained using a sampling rate of 10Hz.

The static map for time step $t$ is denoted as $\mathcal{\mathscr{M}}^{\left\langle t\right\rangle}$ . It is also expressed in the vehicle coordinate system of time step $t$ , see Fig. 3. The static maps used in this paper contain road boundaries and center lines.

The benefit of using the vehicle coordinate system is that it reduces complexity of the input data. In the vehicle coordinate system, the two trajectories depicted above are identical, whereas in global coordinates they are not.

The ground truth for time step $t$ is $\boldsymbol{x}^{\left\langle t+\Delta\right\rangle}$ , i.e. the position of the ego vehicle $\Delta$ time steps ahead. The ground truth position is expressed in the vehicle coordinate system of the current time step $t$ . Since we are in the unsupervised learning setting, these points do not come with any labels. For each time step $t$ we can hence define the following input output pair: $(\mathscr{\mathscr{T}}^{\left\langle 0:t\right\rangle},\ \mathcal{\mathscr{M}}^{\left\langle t\right\rangle};\ \boldsymbol{x}^{\left\langle t+\text{$\Delta$}\right\rangle})$ .

In this paper the ground truth is defined only as the future position of the ego vehicle. It can, however, be extended to encompass further information, i.e. predicting the future heading angle.

3.2 Loss Function

In the following, we will denote random variables by an uppercase letter, while their realizations will be denoted by a lowercase letter.

We wish to model the data in a generative manner¹¹1A generative model includes the distribution of the data itself, and tells you how likely a given example is. by approximating the following conditional density function:

f\left(\boldsymbol{X}^{\left\langle t+\text{$\Delta$}\right\rangle}\ |\ \mathscr{T}^{\left\langle 0:t\right\rangle},\ \mathcal{\mathscr{M}}^{\left\langle t\right\rangle}\right)=\\ f\left(X^{\left\langle t+\text{$\Delta$}\right\rangle},\ Y^{\left\langle t+\text{$\Delta$}\right\rangle}\ |\ \mathscr{T}^{\left\langle 0:t\right\rangle},\ \mathcal{\mathscr{M}}^{\left\langle t\right\rangle}\right),

(2)

which gives the probabilities of all possible positions at time step $t+\text{$\Delta$}$ , given the input trajectory $\mathscr{T}^{\left\langle 0:t\right\rangle}$ up to time step t and the static map $\mathscr{M}^{\left\langle t\right\rangle}$ at time step $t$ . Please note that $X^{\left\langle t+\text{$\Delta$}\right\rangle}$ , $Y^{\left\langle t+\text{$\Delta$}\right\rangle}$ are expressed in the vehicle coordinate system of time step $t$ .

We model the data by specifying a Gaussian mixture model for each time step $t$ :

p\left(\boldsymbol{X}^{\left\langle t+\text{$\Delta$}\right\rangle}\ |\ \mathscr{T}^{\left\langle 0:t\right\rangle},\ \mathcal{\mathscr{M}}^{\left\langle t\right\rangle};\ \theta_{1},\ \theta_{2}\right)=\\ \sum_{j=1}^{k}p\left(\boldsymbol{X}^{\left\langle t+\text{$\Delta$}\right\rangle}\ |\ Z^{\left\langle t+\text{$\Delta$}\right\rangle}=j,\ \mathcal{\mathscr{T}}^{\left\langle 0:t\right\rangle},\ \mathcal{\mathscr{M}}^{\left\langle t\right\rangle};\ \theta_{1}\right)\\ \cdot p\left(Z^{\left\langle t+\text{$\Delta$}\right\rangle}=j\ |\ \mathscr{T}^{\left\langle 0:t\right\rangle},\ \mathcal{\mathscr{M}}^{\left\langle t\right\rangle};\ \theta_{2}\right),\hphantom{dddddd}

(3)

where $Z^{\left\langle t+\text{$\Delta$}\right\rangle}$ is a discrete latent random variable, which can take on $k$ different values. The $\theta_{1},\ \theta_{2}$ denote the parameters of the different parts of the Gaussian mixture model. The probability distribution:

p(Z^{\left\langle t+\text{$\Delta$}\right\rangle}=j\ |\ \mathscr{T}^{\left\langle 0:t\right\rangle},\ \mathcal{\mathscr{M}}^{\left\langle t\right\rangle};\ \theta_{2})

in the upper equation is characterized by $k$ probability values $\phi_{j}$ satisfying $\phi_{j}\geq 0$ and $\sum_{i=1}^{k}\phi_{j}=1$ . Furthermore for a Gaussian mixture model we assume:

\boldsymbol{X}^{\left\langle t+\text{$\Delta$}\right\rangle}\ |\ Z^{\left\langle t+\text{$\Delta$}\right\rangle}=j\ \sim\ \mathcal{N}(\boldsymbol{\mu}_{j},\ \boldsymbol{\Sigma}_{j}).

In the subsequent formulas, for the sake of brevity, we won’t denote explicitly the dependency on $\mathscr{T}^{\left\langle 0:t\right\rangle}$ , $\mathcal{\mathscr{M}}^{\left\langle t\right\rangle}$ when there is no risk of ambiguity.

Assuming that the $m$ training examples were generated independently, we can write down the log-likelihood of the parameters $\theta_{1}$ and $\theta_{2}$ for a single time step $t$ as:

\mathcal{\ell}\left(t,\theta_{1},\theta_{2}\right)=\sum_{i=1}^{m}\log p\left(\boldsymbol{x}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}\ |\ \ldots\ ;\ \theta_{1},\theta_{2}\right)\\ =\sum_{i=1}^{m}\log\sum_{j=1}^{k}\left[p\left(\boldsymbol{x}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}\ |\ Z^{\left\langle t+\text{$\Delta$}\right\rangle}=j,\ \ldots\ ;\ \theta_{1}\right)\right.\\ \left.\cdot p\left(Z^{\left\langle t+\text{$\Delta$}\right\rangle}=j\ |\ \ldots\ ;\ \theta_{2}\right)\right],\hphantom{ddddd}

(4)

where the superscripts in round brackets denote the sample number. We will design the latent random variable $Z^{\left\langle t+\text{$\Delta$}\right\rangle}$ in such a way that its true values $z^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}$ are known beforehand and can easily be deduced from the ground truth $\boldsymbol{x}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}$ . We can then write the log-likelihood as:

\mathcal{\ell}\left(t,\theta_{1},\theta_{2}\right)=\sum_{i=1}^{m}\left[\log p\left(\boldsymbol{x}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}\ |\ z^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)},\ \ldots\ ;\theta_{1}\right)\right.\\ \left.+\log p\left(z^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}\ |\ \ldots\ ;\theta_{2}\right)\right],\hphantom{dddd}

(5)

or using the ground truth of $Z^{\left\langle t+\text{$\Delta$}\right\rangle}$ we can rewrite the above expression as:

	$\displaystyle\mathcal{\ell}\left(t,\theta_{1},\theta_{2}\right)=\sum_{i=1}^{m}\sum_{j=1}^{k}1\left\{{\scriptstyle z^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}=j}\right\}\cdot\log p\left(\vphantom{\boldsymbol{x}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}}j\ \|\ \ldots\ ;\theta_{2}\right)$
	$\displaystyle+\sum_{i=1}^{m}\sum_{j=1}^{k}1\left\{{\scriptstyle z^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}=j}\right\}\cdot\log p\left(\boldsymbol{x}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}\ \|\ j,\ \ldots\ ;\theta_{1}\right),$		(6)

where $1\{\ldots\}$ is the indicator function indicating from which Gaussian each sample had come. Plugging in the definition of a multivariate Gaussian distribution and assuming $X^{\left\langle t+\text{$\Delta$}\right\rangle}$ and $Y^{\left\langle t+\text{$\Delta$}\right\rangle}$ to be independent we get:

	$\displaystyle\mathcal{\ell}\left(t,\theta_{1},\theta_{2}\right)=\sum_{i=1}^{m}\sum_{j=1}^{k}1\left\{{\scriptstyle z^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}=j}\right\}\cdot\log p\left(j\ \|\ \ldots\ ;\theta_{2}\right)$
	$\displaystyle-\sum_{i=1}^{m}\sum_{j=1}^{k}{\scriptstyle\left[\frac{\left(x^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}-\mu_{x,j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}\right)^{2}}{2\cdot\left(\sigma_{x,j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}\right)^{2}}+\frac{\left(y^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}-\mu_{y,j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}\right)^{2}}{2\cdot\left(\sigma_{y,j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}\right)^{2}}\hphantom{dd}\right.}$		(7)
	$\displaystyle\left.+\log\sigma_{x,j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}+\log\sigma_{y,j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}+\textrm{c}\vphantom{{\scriptstyle\left(\frac{x^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}}{\sigma_{x,j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}}\right)^{2}}}\ \right]\cdot 1\left\{{\scriptstyle z^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}=j}\right\},$

with constant $\textrm{c}\textrm{=}\log 2\pi$ . The final cost function is summed over all time steps $t$ :

\textrm{cost}=-\sum_{t=1}^{T_{\textrm{max}}}\mathcal{\ell}\left(t,\theta_{1},\theta_{2}\right),

where $T_{\textrm{max}}$ is the number of time steps.

3.3 Choice of Latent Variable

We have to choose the latent variable $z^{\left\langle t+\text{$\Delta$}\right\rangle}$ in a way, such that its ground truth value can be obtained beforehand. To this end we subdivide the static map into $N\times N$ subregions and assign an index $j=1,\text{\ldots},N^{2}$ to each of the subregions, as depicted in Fig. 5.

We can now easily determine in which of the $N^{2}$ different subregions the ground truth future position $\boldsymbol{x}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}$ is located. Choosing the subregion $j$ as our latent variable hence fulfills the required criteria.

3.4 Interpretation

Using the above definition of the latent variable let us dissect equation (7) to gain more insight. We will start with the first term:

-\sum_{i,j=1}^{m,k}1\left\{{\scriptstyle z^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}=j}\right\}\cdot\log p\left(j\ |\ \mathscr{T}^{\left\langle 0:t\right\rangle\left(i\right)},\ \mathcal{\mathscr{M}}^{\left\langle t\right\rangle\left(i\right)};\theta_{2}\right),

(8)

which is recognized as the cross-entropy loss in a multiclass setting with $k=N^{2}$ classes. Hence our choice of the latent variable leads to a cost function which contains a classification problem.

The cross-entropy loss is used in neural networks which have softmax activations in the output layer. Hence we will realize $p(j\ |\ \mathcal{\mathscr{T}}^{\left\langle 0:t\right\rangle\left(i\right)},\ \mathcal{\mathscr{M}}^{\left\langle t\right\rangle\left(i\right)};\ \theta_{2})$ by a neural network with inputs $\mathcal{\mathscr{T}}^{\left\langle 0:t\right\rangle\left(i\right)}$ and $\mathscr{M}^{\left\langle t\right\rangle\left(i\right)}$ and a softmax activation at the output layer. Please remember that this part of the neural network outputs the probability distribution of the discrete latent variable $Z^{\left\langle t+\text{$\Delta$}\right\rangle}$ , which is characterized by $k=N^{2}$ probability values $\phi_{j}$ satisfying $\phi_{j}\geq 0$ and $\sum_{i=1}^{k}\phi_{j}=1$ . In the following we will hence denote the output of this network as $\phi_{j}^{\left\langle t+\text{$\Delta$}\right\rangle}$ .

Each value $j$ which the latent variable $z^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}$ can assume, corresponds to a different spatial region. As has been shown in section. 1, the future positions of the ego vehicle are unequally distributed, see Fig. 1, which makes (8) a highly imbalanced classification problem. However, since we are in the classification domain, we have powerful tools to combat this issue, e.g. focal loss [14].

Let us come back to the first part of the cost function:

\sum_{i,j=1}^{m,k}{\scriptstyle\left[\frac{\left(\mu_{x,j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}-x^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}\right)^{2}}{2\cdot\left(\sigma_{x,j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}\right)^{2}}+\frac{\left(\mu_{y,j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}-y^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}\right)^{2}}{2\cdot\left(\sigma_{y,j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}\right)^{2}}\right.}\\ {\scriptstyle\left.{\displaystyle+\log\sigma_{x,j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}+\log\sigma_{y,j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}}\right]}\cdot 1\left\{{\scriptstyle z^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}=j}\right\}.

(9)

First of all please note that by design the ground truth value $\boldsymbol{x}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}$ is contained in the subregion indicated by $z^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}$ , see Fig. 5. Because of the indicator function in (9), for each sample $i$ only one Gaussian component $\boldsymbol{\mu}_{j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}$ , $\boldsymbol{\sigma}_{j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}$ is active, with the same index as the selected subregion $j=z^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}$ . During training, the Gaussian components will therefore learn to correspond to different subregions of the static map.

In order to minimize the above expression, the means $\boldsymbol{\mu}_{j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}$ of the Gaussian components must come as close as possible to $\boldsymbol{x}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}$ , hence (9) can be interpreted as a (weighted) regression problem with $k=N^{2}$ regressors covering different spatial positions. The sigma terms $\boldsymbol{\sigma}_{j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}$ are capturing how much noise there is in the outputs [12]. Since each regressor is covering different spatial positions, this helps to combat the imbalanced data problem even further.

In the next step, we will simplify the regression problem, thus emphasizing the classification task even further. Please note, that the classification part (8) is already predicting the future position of the ego vehicle, however quantized to the resolution of the subdivision. We can hence design the regression part (9) only to refine that prediction by providing an offset to the actual position. To this end, we define a grid consisting of the center positions of each subregion: $(\textrm{center}_{x,j},\textrm{center}_{y,j})$ , with $j=1,\ldots,N^{2}$ . Now we can substitute in (9):

	$\displaystyle\mu_{x,j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}$	$\displaystyle=\text{$\Delta$}\mu_{x,j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}+\textrm{center}_{x,j}$
	$\displaystyle\mu_{y,j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}$	$\displaystyle=\text{$\Delta$}\mu_{y,j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}+\textrm{center}_{y,j}$		(10)

This further reduces the complexity of the regression problem, since it has to predict only offsets, see Fig. 6.

3.5 Inference Time

At each time step $t$ the neural network outputs the parameters of a Gaussian mixture model:

\boldsymbol{\mu}_{j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)},\ \boldsymbol{\sigma}_{j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}\ \mathrm{and}\ \phi_{j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}\ \mathrm{with}\ j=1,\text{\ldots},N^{2}.

This model estimates the probability of the vehicle of being at position $\boldsymbol{X}^{\left\langle t+\text{$\Delta$}\right\rangle}$ in $\Delta$ time steps:

p\left(\boldsymbol{X}^{\left\langle t+\text{$\Delta$}\right\rangle}\ |\ \mathscr{T}^{\left\langle 0:t\right\rangle\left(i\right)},\ \mathcal{\mathscr{M}}^{\left\langle t\right\rangle\left(i\right)};\ \theta_{1},\ \theta_{2}\right),

(11)

given the input trajectory $\mathscr{T}^{\left\langle 0:t\right\rangle}$ and the static map $\mathscr{M}^{\left\langle t\right\rangle}$ . We can visualize (11) by plugging values for $\boldsymbol{X}^{\left\langle t+\text{$\Delta$}\right\rangle}$ located on a dense uniform grid, see Fig. 7. The position of the vehicle at the current step is in the middle bottom position (green point). The ground truth position and orientation of the ego-vehicle in 2000 ms is shown as the blue arrow.

(a)

(b)

Figure 7: Probability heatmap

p(\boldsymbol{X}^{\left\langle t+\text{$\Delta$}\right\rangle}\ |\ \mathscr{T}^{\left\langle 0:t\right\rangle},\ \mathcal{\mathscr{M}}^{\left\langle t\right\rangle};\ \theta_{1},\ \theta_{2})

showing multimodality (a) and impact of static map (b)

In many applications, however, instead of a heatmap we want to have a list containing the most probable future positions. The list should contain only distinct positions, thus reflecting the multimodality of the future.

As can be seen in Fig. 7 (a), the $N^{2}$ predictions made by our approach tend to form distinct clusters. Out of each cluster we want to keep only the most probable prediction and get rid of all nearly identical ones. Non-maximum suppression (NMS) is a way to make sure that each position is detected only once [2]. To easily apply the standard NMS, we will interpret the $\boldsymbol{\mu}_{j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}$ as the centers of bounding boxes and choose the bounding box sizes to be proportional to $\boldsymbol{\sigma}_{j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}$ . The confidence required by NMS is obtained by using the fact that by design the Gaussian components are mostly bound to their corresponding subregions, yielding:

\underset{\boldsymbol{X}^{\left\langle t+\text{$\Delta$}\right\rangle}\in\ \mathrm{subregion}{}_{j}}{\mathrm{max}}\ p(\boldsymbol{X}^{\left\langle t+\text{$\Delta$}\right\rangle}\ |\ \mathscr{T}^{\left\langle 0:t\right\rangle\left(i\right)},\ \mathcal{\mathscr{M}}^{\left\langle t\right\rangle\left(i\right)})\thickapprox\phi_{j}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}

which gives the probability that the ego vehicle will be in subregion $j$ in $\Delta$ time steps. Please note that the number of predictions returned by NMS is changing, however at least one prediction is returned. We denote the distinct predictions as $\boldsymbol{\mu}_{k,\mathrm{nms}}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}$ , $k=1,\ldots,\mathrm{max}$ , where $k=1$ is the most probable prediction, $k=2$ is the second most probable prediction and so on.

3.6 Network Architecture

In this section we describe a basic architecture needed for realizing the model described in the previous sections. As described in section 3.1, the input of the network consists both of the sequence describing the dynamics of the vehicle $\{(\text{$\Delta$}x^{\left\langle t\right\rangle},\ \text{$\Delta$}y^{\left\langle t\right\rangle},\ v^{\left\langle t\right\rangle},\ h^{\left\langle t\right\rangle})\}_{t=0}^{T_{\mathrm{max}}}$ and the sequence of static maps $\{\mathcal{\mathscr{M}}^{\left\langle t\right\rangle}\}_{t=0}^{T_{\mathrm{max}}}$ .

In order to generate sensible predictions, we need to aggregate the past values of $\text{$\Delta$}x^{\left\langle t\right\rangle},\ \text{$\Delta$}y^{\left\langle t\right\rangle},\ v^{\left\langle t\right\rangle},\ h^{\left\langle t\right\rangle}$ . Hence a natural choice for processing of the input data (describing the dynamics of the vehicle) is a recurrent neural network (RNN), as depicted in Fig. 8. In the actual implementation, a member of the broad family of RNN architectures e.g. Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU), as well as their variants, can be used. The low dimensional input vector can optionally be embedded into a higher dimensional space.

The static map is represented as spatial data and is condensed to a feature vector (representing semantical information of the static map) using a convolutional neural network (CNN). The static maps only act as constraints for possible predictions and hence do not require an aggregation of past values, see Fig. 8. To account for shifts and rotations of the static map arising from the use of VCS, a spatial transformer network can be utilized [11].

Both the output of the RNN and CNN are fed into the predictor, whose output are five spatial maps consisting of $N$ x $N$ cells. Each spatial map represents a different component of the Gaussian mixture model: $\phi_{j}^{\left\langle t+\Delta\right\rangle}$ , $\Delta\mu_{x,j}^{\left\langle t+\Delta\right\rangle}$ , $\sigma_{x,j}^{\left\langle t+\text{$\Delta$}\right\rangle}$ , $\text{$\Delta$}\mu_{y,j}^{\left\langle t+\text{$\Delta$}\right\rangle}$ and $\sigma_{y,j}^{\left\langle t+\text{$\Delta$}\right\rangle}$ . In a subsequent postprocessing step we add a constant grid to the coordinate offsets (delta terms), as described in (10).

Actual Implementation

As mentioned before the context information we use, encompasses only static context consisting of centerlines and the driveable area. For this reason, only a simple architecture is required to achieve reasonable results. More complex tasks e.g. with agents interaction, will require more complicated architectures (feature pyramid networks, transposed convolution etc.).

The input to the CNN network consists of a spatial map of size 128x128x2. The centerlines and the driveable area are in separate feature maps. The CNN network depicted in Fig. 8 consists of 5 layers. Each convolution layer is followed by a maximum pooling layer to half the size of the feature map. We use “tanh” as activation function in all layers. The output of the CNN is flattened to generate a feature vector of size 256.

layer	size	nodes
0	128x128	2
1	64x64	4
2	32x32	8
3	16x16	8
4	8x8	16
5	4x4	16

(a) CNN

layer	nodes
0	4
1	8
2	16

(b) Embedding

layer	nodes
0	16
1	256
2	150

layer	nodes
0	256+150
1	256
2	128
3	500

(d) Upsampling

TABLE I: Actual implementation

The input to the recurrent part of the architecture consists of 4 values: the deltas between the current position and the last position, velocity and the heading angle, as described in section. 3.1. Prior to feeding the input into the RNN, we use an embedding network consisting of two layers. Again we use the “tanh” activation function. The resulting embedding has a dimension of 16, see Tab. I (b). The concrete implementation of the RNN network consists of a LSTM with two layers, see Tab. I (c).

Finally, the output of the LSTM and CNN is concatenated to a vector of size 256+150 which is fed to the Upsampling Network to obtain the output maps. The upsamling layer is implemented as a stack of three dense layers. The concrete implementation of the Upsampling Network depicted in Fig. 8 consists of 3 layers, where the last layer is a linear layer. The output is reshaped to form spatial maps of size 10x10x5.

4 Evaluation on the Argoverse dataset

In this section we will describe the evaluation of the presented approach using the Argoverse dataset [5]. Argoverse contains a dataset for motion forecasting with 324,557 sequences and rich context maps. Each sequence consists of exactly 50 samples varying in length from 4 to 25 seconds. The motion forecasting dataset of Argoverse was mined, in order to contain diverse scenarios e.g. managing an intersection, slowing for a merging vehicle, accelerating after a turn, stopping for a pedestrian on the road, etc. The number of tracks in which a vehicle is traveling at nearly constant velocity (such tracks are hardly a representation of real forecasting challenges) is hence drastically diminished. Argoverse provides furthermore a rich mutlimodality, as can be seen in Fig. 9.

(a) Right turn

(b) Straight

Figure 9: Multimodality of Argoverse

Our goal is to predict future positions of the ego-vehicle 2000ms ahead given the static context. Since we operate on a sampling interval of 100ms and we want to predict 2000ms into future, we choose the ground truth position to lie $\text{\text{$\Delta$}}=20$ time steps ahead, see section 3.1. Our overall input trajectory hence consists of $T_{\mathrm{max}}=30$ time steps. However not all motion forecasting sequences of Argoverse use a sampling rate of 10Hz (sampling interval of 100ms). Out of the 324,557 sequences we therefore filter 150,000 sequences with a length of approximately 5s, corresponding to a sampling rate of 10Hz. We split the data into a training set consisting of 140,000 samples and a test set consisting of 10,000 samples. We would like to stress that no balancing of the data is performed whatsoever regarding the arrangement of the trajectories, i.e. turns, straight lines etc.

We will evaluate our approach using the three most common metrics used in motion forecasting: Average Displacement Error (ADE), Final Displacement Error (FDE) and Minimum Average Displacement Error (minADE) [16].

The ADE metric is simply defined as the average euclidean distance over all time steps and samples, calculated for the most probable prediction $k=1$ , which is obtained via non-maximum suppression (NMS) as described in section 3.5:

\mathrm{ADE}=\frac{1}{{\scriptstyle m\cdot\left(T_{\mathrm{max}}-4\right)}}\sum_{i=1}^{m}\sum_{t=5}^{T_{\mathrm{max}}}\left\|\boldsymbol{\mu}_{1,\mathrm{nms}}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}-\boldsymbol{x}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}\right\|,

(12)

where the length of trajectories is $T_{\mathrm{max}}=30$ . Since we are using an LSTM, we will ignore the first 5 time steps to account for the warm-up phase.

As can be seen in Fig. 7, in multimodal scenarios the most likely prediction does not necessary need to be the correct one. For this reason, the minADE metric was defined, which accounts for multimodality:

	$\displaystyle\mathrm{minADE}=\hphantom{\left\\|\boldsymbol{\mu}_{k,\mathrm{nms}}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}-\boldsymbol{x}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}\right\\|}$		(13)
	$\displaystyle\frac{1}{{\scriptstyle m\cdot\left(T_{\mathrm{max}}-4\right)}}\sum_{i=1}^{m}\sum_{t=5}^{T_{\mathrm{max}}}\mathrm{\underset{k=1,2,3}{min}}\ \ \left\\|\boldsymbol{\mu}_{k,\mathrm{nms}}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}-\boldsymbol{x}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}\right\\|.$

Here we choose the best prediction, w.r.t. to the euclidean distance, out of a predefined number $K$ of predictions. Since our approach only predicts single future positions and not whole trajectories, we will limit the selection to $K=3$ . Due to the NMS, the 3 predicted positions should be quite distinct, i.e. not lying close to each other. This way we prevent the scenario of achieving a good minADE just by choosing the best prediction from a multitude of almost identical predictions.

Finally, the last metric is the FDE:

	$\displaystyle\mathrm{\mathrm{FDE}}=\frac{1}{{\scriptstyle m\cdot\left(T_{\mathrm{max}}-4\right)}}\sum_{i=1}^{m}\left\\|\boldsymbol{\mu}_{k,\mathrm{nms}}^{\left\langle 5+\text{$\Delta$}\right\rangle\left(i\right)}-\boldsymbol{x}^{\left\langle 5+\text{$\Delta$}\right\rangle\left(i\right)}\right\\|$		(14)
	$\displaystyle\hphantom{dddddddd}+\left\\|\boldsymbol{\mu}_{k,\mathrm{nms}}^{\left\langle T_{\mathrm{max}}+\text{$\Delta$}\right\rangle\left(i\right)}-\boldsymbol{x}^{\left\langle T_{\mathrm{max}}+\text{$\Delta$}\right\rangle\left(i\right)}\right\\|.$

Resampling of the original data

Please note that all the metrics are evaluated at discrete time steps $t$ . However, real measurements are subject to jitter, the deviation from the precise sample timing intervals of 100ms. Jitter, when unaccounted for, may lead to a much higher prediction error than in reality. Imagine the vehicle traveling at a constant velocity of 20 m/s (72 km/h). A jitter of just $\pm$ 10ms leads to a covered distance of $\pm$ 0.2 meters. However, our architecture will learn to make predictions for the average sampling rate of 100ms. We solve this issue by resampling the original data to lie on a precise sampling grid of 100ms. However we have to be cautious with the way in which we perform the resampling, e.g. fitting a lower order polynome would smooth the trajectories, thus making the prediction problem much easier. For this reason we only use linear interpolation between two adjacent samples. This way we do not suppress the measurement noise and do not introduce an advantage. We utilize the resampled data both for training and evaluation.

Measurement Noise of the Argoverse Dataset

So far we have neglected the fact that the ground truth values $\boldsymbol{x}^{\left\langle t\right\rangle\left(i\right)}$ contain statistical noise and other inaccuracies. At time $t$ , a measurement (or observation) $\boldsymbol{x}^{\left\langle t\right\rangle\left(i\right)}$ of the true vehicle position $\boldsymbol{\varkappa}^{\left\langle t\right\rangle\left(i\right)}$ is made according to: $\boldsymbol{x}^{\left\langle t\right\rangle\left(i\right)}=\boldsymbol{\varkappa}^{\left\langle t\right\rangle\left(i\right)}+\boldsymbol{v}^{\left\langle t\right\rangle\left(i\right)}$ , where $\boldsymbol{v}^{\left\langle t\right\rangle\left(i\right)}$ is the measurement noise which is assumed to be zero mean Gaussian white noise (time steps are basically uncorrelated).

In order to correctly assess the prediction performance of our approach, we have to determine the measurement noise $\sigma_{v}^{2}$ of the Argoverse dataset. To approximate the true vehicle position $\boldsymbol{\varkappa}^{\left\langle t\right\rangle\left(i\right)}$ , we fit a curve $\tilde{C}\left(t\right)=(\tilde{x}\left(t\right),\ \tilde{y}\left(t\right))$ to the 50 times steps of the trajectories, where $\tilde{x}\left(t\right)$ and $\tilde{y}\left(t\right)$ are sixth order polynomials. To take the jittering of the sampling rate into account, we use the exact exact time steps $t$ (in ms) provided by the Argoverse dataset and not just multiples of 100ms. In order to cope with possible outliers we utilize the RANSAC (Random Sampling Consensus) [8]. Next we calculate the ADE metric between the fitted trajectory and the original data and obtain an estimate of the measurement noise of $\sigma_{v}=0.46$ meters. Please note, that this value can be interpreted as the lower bound of ADE for the Argoverse dataset.

Let us denote the true (unknown) prediction error as:

\mathrm{\mathbf{e}}_{p}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}=\boldsymbol{\mu}_{1,\mathrm{nms}}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}-\boldsymbol{\varkappa}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)},

where $\boldsymbol{\varkappa}^{\left\langle t\right\rangle\left(i\right)}$ is the unknown true vehicle position. Since $\boldsymbol{\mu}_{1,\mathrm{nms}}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}$ is calculated using only time steps $0,\ldots,t$ , we can safely assume $\mathrm{\mathbf{e}}_{p}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}$ and $\boldsymbol{v}^{\left\langle t+\text{$\Delta$}\right\rangle\left(i\right)}$ to be statistically independent²²2The measurement noise of values lying 2000ms apart shall be independent., hence $\mathrm{ADE}=\sqrt{\sigma_{\mathrm{\mathbf{e}}_{p}}^{2}+\sigma_{v}^{2}}$ . In the following we will use this formula to estimate the metrics, obtained if the true vehicle position $\boldsymbol{\varkappa}^{\left\langle t\right\rangle\left(i\right)}$ was accessible.

ADE	minADE (K=3)	FDE
0.93m	0.78m	0.96m

(a) Metrics evaluated on ground truth
containing measurement noise

ADE	minADE (K=3)	FDE
0.81m	0.63m	0.84m

(b) Estimates of metrics evaluated on
ground truth without measurement
noise using

\sigma_{\mathrm{\mathbf{e}}_{p}}=\sqrt{\mathrm{ADE}^{2}-\sigma_{v}^{2}}

TABLE II: Evaluation on linearly resampled data for

\Delta=20

[4]

5 Conclusions and Future Work

In this paper, we have introduced a novel way to predict future positions of a vehicle taking static context information into account. Our main focus lay on the capability to make multimodal predictions with probabilities assigned to them. The second goal was to cope with the imbalance of data. We achieved both goals by introducing a generative probabilistic model based on a Gaussian mixture model. A smart choice of the latent variable allowed for the reformulation of the problem into a combination of a classification problem and a simplified regression problem.

The first benefit of this formulation arose from the fact that we obtained a classification problem, which allowed to combat the imbalanced data problem by utilizing focal loss. The second benefit arose from the fact that the regression part was spatially distributed, allowing each regressor to specialize on a different subregion.

Our current focus of research lies in the extension of the proposed approach to predict whole trajectories instead of single points in time. In parallel we are examining methods to extend the context information to not only contain the static context like driveable area, centerlines and traffic signs, but also to contain dynamic context, i.e. other agents.

References

[1] Apratim Bhattacharyya, Bernt Schiele, and Mario Fritz. Accurate and diverse sampling of sequences based on a "best of many" sample objective. 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[2] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. Soft-nms – improving object detection with one line of code. 2017, IEEE International Conference on Computer Vision, 2017.
[3] Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir Anguelov. Multipath: Multiple probabilistic anchor trajectoryhypotheses for behavior prediction. 3rd Conferenceon Robot Learning (CoRL), 2019.
[4] Rohan Chandra, Tianrui Guan, Srujan Panuganti, Trisha Mittal, Uttaran Bhattacharya, Aniket Bera, and Dinesh Manocha. Forecasting trajectory and behavior of road-agents using spectral clusteringin graph-lstms. arXiv, 2019.
[5] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, and James Hays. Argoverse: 3d tracking and forecasting with rich maps. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8740–8749, 2019.
[6] Nitesh V. Chawla. Data mining for imbalanced datasets: An overview. Maimon, Oded; Rokach, Lior (Eds) Data Mining and Knowledge Discovery Handbook, Springer, pages 875–886, 2010.
[7] Haoqiang Fan, Hao Su, and Leonidas Guibas. A point set generation network for 3d object reconstruction from a single image. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2463–2471, 2016.
[8] Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. CACM, pages 381–395, 1981.
[9] David Gamarnick and John Tsitsiklis. Introduction to Probability. MIT OCW, 2008.
[10] Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexandre Alahi. Social gan: Socially acceptable trajectories with generative adversarial networks. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2018.
[11] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. arXiv, 2015.
[12] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
[13] Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B. Choy, Philip H. S. Torr, and Manmohan Chandraker. Desire: Distant future prediction in dynamic scenes with interacting agents. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2165–2174, 2017.
[14] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 318–327, 2018.
[15] Tung Phan-Minh, Elena Corina Grigore, Freddy A. Boulton, Oscar Beijbom, and Eric M. Wolff. Covernet: Multimodal behavior prediction using trajectory sets. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14074–14083, 2020.
[16] Nicholas Rhinehart, Kris M Kitani, and Paul Vernaza. A reparameterized pushforward policy for diverse, precise generative path forecasting. Proceedings of the EuropeanConference on Computer Vision (ECCV), pages 772.–788, 2018.