This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Pedestrian Trajectory Forecasting Using Deep Ensembles Under Sensing Uncertainty

Anshul Nayak, Azim Eskandarian, Zachary Doerzaph, Prasenjit Ghorai Autonomous Systems and Intelligent Machines lab, Virginia Tech.
Abstract

One of the fundamental challenges in the prediction of dynamic agents is robustness. Usually, most predictions are deterministic estimates of future states which are over-confident and prone to error. Recently, few works have addressed capturing uncertainty during forecasting of future states. However, these probabilistic estimation methods fail to account for the upstream noise in perception data during tracking. Sensors always have noise and state estimation becomes even more difficult under adverse weather conditions and occlusion. Traditionally, Bayes filters have been used to fuse information from noisy sensors to update states with associated belief. But, they fail to address non-linearities and long-term predictions. Therefore, we propose an end-to-end estimator that can take noisy sensor measurements and make robust future state predictions with uncertainty bounds while simultaneously taking into consideration the upstream perceptual uncertainty. For the current research, we consider an encoder-decoder based deep ensemble network for capturing both perception and predictive uncertainty simultaneously. We compared the current model to other approximate Bayesian inference methods. Overall, deep ensembles provided more robust predictions and the consideration of upstream uncertainty further increased the estimation accuracy for the model.

Index Terms:
Uncertainty quantification, Bayesian Inference, Deep Ensembles, MC Dropout

I INTRODUCTION

Most of the prediction algorithms output deterministic estimates of future states from raw sensor data [1]. Deterministic predictions are over-confident and prone to error. Therefore, it is important to make probabilistic predictions of future states to improve robustness downstream, especially for uncertainty-aware planning [2][3]. Past research has tried to address the issue by developing probabilistic methods for prediction [4, 5, 6]. One popular approach is to use Bayesian neural network (BNN) to capture uncertainty in both classification and regression problems [7]. However, exact Bayesian inference is computationally challenging due to a large number of model parameters. Therefore, approximate inference methods like Monte Carlo dropout [8] and deep ensembles [9] have been developed which can output probabilistic predictions approximating the posterior without making significant changes to the neural network (NN) architecture. The deep ensemble model has a network of independently trained neural networks with each network having a random initialization. Each network outputs probabilistic predictions based on a sophisticated loss function and the predictions are averaged over all the networks to obtain the predictive mean and variance assuming Gaussian posterior distribution. Meanwhile, the MC dropout method introduces dropout [10] layers during training and inference to capture predictive uncertainty. During inference, weights are randomly dropped to generate a distribution of outputs rather than deterministic predictions. The predictive uncertainty accounts for noise in the output data called aleatoric as well as the variation in predictions by the neural network model known as epistemic uncertainty [11]. Feng et.al [12] designed a NN architecture for Lidar 3D vehicle detection and localization capturing aleatroic and epistemic uncertainty during predictions.

Refer to caption
Figure 1: Model: The Kalman filter module updates the state and covariance at each step. Trajectory Sampling: Conditional Trajectory Sampling propagates an initial state based on system dynamics to generate a trajectory. TS is called recursively to generate a distribution of trajectories.

Although, the prediction model provides probabilistic outputs for future states, the model itself considers deterministic states during training and prediction. In the context of prediction, the model assumes that the input states as observed by the sensors are deterministic and makes predictions based on these deterministic state inputs. However, the sensors are inherently noisy and can not accurately estimate the state of an object. Further, state estimation becomes even more uncertain when coupled with adverse weather or occlusion [13][14]. Therefore, having deterministic state estimation may not be correct, and capturing the perception uncertainty associated with each state is necessary for robust predictions downstream. The main idea of the current paper is to investigate why incorporating and propagating perceptual uncertainty into the prediction pipeline is necessary. Traditionally, Bayes filters have been used to capture the state and its associated covariance during tracking [15][16]. A simple Bayes filter like Kalman Filter (KF) recursively updates the state and covariance at each step by fusing raw noisy sensor measurements with the prior computed using a motion model. But, the KF can reliably estimate only the states for which it has measurements while our problem requires the model to accurately learn and predict the covariance associated with future states. Deep neural networks have been utilized to estimate covariance from raw sensor data [17]. The authors learned the representation for measurement model by minimizing the loss between ground truth and raw sensor measurement. Similarly, Bertoni et.al [18] captured 3D localization uncertainty during tracking through a loss function based on Laplace distribution. They used MC dropout and captured both aleatoric and epistemic uncertainty during state localization using monocular RGB images. Recently, Rebecca et.al [19] modelled multivariate uncertainty for regression problems by training a NN end-to-end through a KF. Our model draws inspiration from previous work and has a simple encoder-decoder architecture that learns the KF covariance by minimising the MSE loss between model and ground truth covariance measurements via supervised regression. The learned NN model is able to estimate covariance for future states capturing perceptual uncertainty. Once the model learns to estimate covariance, we propagate the sensing uncertainty into prediction pipeline.

We design the current NN model as an end-to-end estimator that can simultaneously predict the perception and prediction uncertainty associated with future states. In the past, end-to-end approaches such as FAF[20] projected Lidar points into the bird’s eye view (BEV) grid generating predictions by inferring detection multiple times in future. Further, PTP [21] unified the Multi-object tracking (MOT) and prediction under one framework. However, they used generative modeling for trajectory distribution which is not particularly efficient in incorporating the propagation of perception uncertainty into prediction. More recently, Pavone et.al [22] showed the importance of propagating state uncertainty and leveraged upon the idea of penalizing the loss function to encode state uncertainty. Our approach in a way combines both the notion of end-to-end tracking and prediction while simultaneously estimating the perceptual uncertainty for robust probabilistic trajectory predictions. We designed our model on a sophisticated loss function that minimises mean-squared error (MSE) and negative-log-likelihood (NLL) loss simultaneously to perform robust end-to-end predictions while estimating perceptual uncertainty. The model learns to perform state covariance estimation by minimising the MSE loss with KF ground truth covariance. Meanwhile, the predictive uncertainty is captured by minimising the NLL loss using a deep ensemble model. Overall, our end-to-end NN model can take raw sensory inputs with measurement noise and make robust probabilistic predictions of future states downstream without ignoring the upstream perceptual uncertainty.

Contributions. Our key contributions are as follows. We propose a simple end-to-end NN model that can capture both perceptual and predictive uncertainty for robust state prediction. Secondly, we show the essence of using deep ensembles for predictive uncertainty. Finally, we also show how incorporating state uncertainty into prediction pipeline improved overall robustness and compared the deep ensemble model with MC dropout on publicly available pedestrian datasets. Further, we performed offline experiments with the trained model to understand the out-of-distribution prediction accuracy.

TABLE I: NOTATION
𝐗\mathbf{X} State
F:RnRnF:R^{n}\rightarrow R^{n} State Transition matrix
zRnz\in R^{n} Raw measurement
PRnxnP\in R^{nxn} Posterior covariance
KRkxnK\in R^{kxn} Kalman Gain
H : RnRkR^{n}\rightarrow R^{k} Observation matrix
QRnxnQ\in R^{nxn} Process Noise covariance
RRnxnR\in R^{nxn} Measurement covariance

II PROPOSED METHOD

Our method is two-fold; first the neural network model approximates a Bayes filter to predict the sensing uncertainty at each future state. Secondly, a prediction model considers the upstream sensing uncertainty and makes robust future state predictions downstream. we use Deep Ensembles for quantifying predictive uncertainty while sensing covariance estimation is carried out using a simple Kalman filter.

Refer to caption
Figure 2: Deep Ensemble: NN model takes a single trajectory and generates a distribution of input trajectories using KF and Trajectory Sampling. Each trajectory is trained through am independent NN within the ensemble to generate predictive uncertainty.

II-A Covariance Estimation using Bayes Filter

In this section, we present an approach for estimating the state and covariance of a Bayes filter using a neural network (NN). Bayes filters are used to account for the uncertainty during sensing and localization with popular filters such as the KF, Extended Kalman filter, and Particle filter. These filters estimate the states of a system as well as their associated beliefs, which are updated by fusing information from raw sensor measurements. The belief, which characterizes the state uncertainty, can have either uniform variance, known as homoskedastic, or heteroscedastic noise. Capturing this upstream uncertainty during sensing data is critical for subsequent robust predictions. Therefore, the current objective is to design a NN model that can learn the heteroscedastic measurement covariance of a Bayes filter. We approach the problem of state estimation as a 2D object tracking problem, using the KF.

The KF module consists of two steps; prediction and update step (Figure 1). The prediction step takes the previous state 𝐗𝐤𝟏\mathbf{X_{k-1}} and covariance 𝐏𝐤𝟏\mathbf{P_{k-1}} and computes the prior distribution based on the constant velocity motion model (1).

𝐗¯𝐤=𝐅𝐗𝐤𝟏+𝐁uk{\mathbf{\bar{X}_{k}}}=\mathbf{F}\mathbf{X_{k-1}}+\mathbf{B}u_{k}
𝐏¯𝐤=𝐅𝐏𝐤𝟏𝐅T+𝐐\mathbf{\bar{P}_{k}}=\mathbf{F}\mathbf{P_{k-1}}\mathbf{F}^{T}+\mathbf{Q} (1)

Here, 𝐅\mathbf{F} is the state transition matrix, and uku_{k} is the control input. For the tracking problem, the control input has no significance. Further, 𝐐\mathbf{Q} represents the process noise covariance.

We model the innovation, zkz_{k} based on the difference between actual and predicted measurement of the state

zk=yk𝐇𝐗¯𝐤z_{k}=y_{k}-\mathbf{H}{\mathbf{\bar{X}_{k}}}

which has covariance,

𝐒𝐭=𝐑+(𝐇𝐏¯𝐤𝐇T)\mathbf{S_{t}}=\mathbf{R}+(\mathbf{H}\mathbf{\bar{P}_{k}}\mathbf{H}^{T})

It is the sum of measurement noise covariance and predicted state covariance, 𝐏¯𝐤\mathbf{\bar{P}_{k}}. R represents the covariance matrix associated with measurement noise.

𝐊𝐤=𝐏¯𝐤𝐇T(𝐇𝐏¯𝐤𝐇T+𝐑)1\mathbf{K_{k}}=\mathbf{\bar{P}_{k}}\mathbf{H}^{T}(\mathbf{H}{\mathbf{\bar{P}_{k}}}\mathbf{H}^{T}+\mathbf{R})^{-1}

The state, 𝐗𝐤\mathbf{{X}_{k}} and associated covariance, 𝐏𝐤{\mathbf{P_{k}}} are updated at each step using the Kalman gain, 𝐊𝐤\mathbf{K_{k}} which resembles a weighting factor between predicted (prior) state and actual measurement (likelihood) of the state.

𝐗𝐤=𝐗¯𝐤+𝐊𝐤yk\mathbf{{X}_{k}}=\mathbf{\bar{X}_{k}}+\mathbf{K_{k}}y_{k}
𝐏𝐤=(𝐈𝐊𝐤𝐇)𝐏¯𝐤{\mathbf{P_{k}}}=\mathbf{(I-K_{k}H)}\mathbf{\bar{P}_{k}} (2)

The posterior states, 𝐗𝐤,𝐏𝐤\mathbf{{X}_{k}},\mathbf{P_{k}} (2) are updated recursively at each step by fusing the actual noisy measurement from sensors and predicted measurement using the motion model.

II-B Conditional Trajectory Sampling

Usually, trajectories can be randomly sampled from the posterior distribution of states obtained using the KF. However, random sampling generate non-smooth and infeasible trajectories especially with high sensor noise. To address this in-feasibility, we propose conditional trajectory sampling (CTS) where we propagate an initial sampled point using a dynamics model [24] and then resample a new point from the adjacent posterior distribution. This process is repeated recursively till a trajectory is generated (Figure 1). In this way, the generated trajectories adhere to system dynamics and are not random.

The CTS technique is used to sample from the posterior distribution of the KF state covariance at each time step. We randomly sample the initial state as x0x_{0} and then propagate the initial state based on some motion model as a Markovian process xtP(xt|xt1,a)x_{t}\sim P(x_{t}|x_{t-1},a) [25]. The prior distribution, P(xt1)P(x_{t-1}) is propagated based on some action, a. This transition predicts the the likelihood of next state based on the constant-velocity motion model. The Process is repeated recursively based on the Markovian dynamics to generate a trajectory. Further, we repeat the trajectory generation process according to a particular bootstrap, i \in {1,…,M}, where M represents the number of trajectories. The ensemble of generated trajectories roughly represent the state uncertainty. For the current research, trajectory sampling can be considered as a data augmentation step for the training of the neural network to capture this perceptual uncertainty. Another benefit of CTS is that each sampled trajectory within the distribution can be fed into the independent NN of an ensemble model to capture total predictive uncertainty. Details of our implementation are discussed in sec.II-C and Figure 2.

II-C Uncertainty Estimation

We usually treat trajectory prediction as a regression problem. For regression problems with deterministic predictions, the NNs output a single value say μ(x)\mu(x). It is estimated by minimizing the mean squared error on the training set, MSE = n=1N(ynμ(xn))2\sum_{n=1}^{N}(y_{n}-\mu({x_{n}}))^{2}. However, the outputs, yny_{n} are point estimates and do not contain any information on associated uncertainty. To capture the uncertainty, we assume the outputs are sampled from a Gaussian distribution such that the final layer outputs two values, predicted mean, μ(x)\mu({x}) and variance, σ2(x)\sigma^{2}(x) of the distribution. The variance, σ2(x)\sigma^{2}(x) of a NN model can be obtained using the Gaussian negative log-likelihood loss (NLL) on training samples with input xnx_{n} and output yny_{n} as:

logP(yn|xn)=logσ2(xn)2+(ynμ(xn))22σ2(xn)+constant-logP(y_{n}|x_{n})=\frac{log\,\sigma^{2}(x_{n})}{2}+\frac{(y_{n}-\mu({x_{n}}))^{2}}{2\sigma^{2}(x_{n})}+constant (3)

σ(x)\sigma(x) represents the model’s noise observation parameter - showing the amount of noise present in the model’s outputs. However, the standard NLL strongly depends on predictive variance, σ(x)\sigma(x) and scales down the gradient for ill-predicted points [11]. Hence, an alternative loss function called as the β\beta-exponentiated negative log-likelihood loss (β\beta-NLL) [23] has been used to minimize loss.

LβNLL=logP(yn|xn)stop(σ2β){L_{\beta-NLL}}=-logP(y_{n}|x_{n})\,stop(\sigma^{2\beta}) (4)

β\beta controls the dependency of gradients on predictive variance while stop() is the stop gradient operation that prevents the gradients from flowing. β\beta = 0 represents the standard NLL loss. Meanwhile, β\beta =1 completely removes the dependency of gradients on variance, σ(x)\sigma(x) and treats the loss function as standard mean-squared error (MSE). The β\beta-NLL loss function allows us the flexibility to switch between NLL and MSE loss function.

Deep Ensembles

Deep ensemble is an approximate Bayesian inference method that can capture predictive uncertainty during forecasting. It is simple and scalable compared to Bayesian NNs. As the name suggests, an ensemble network consists of a series of NNs which are different from one another due to random initialization. Let, M denote the number of NNs present within the ensemble. Then, μi(x)\mu_{i}(x) and σi(x)\sigma_{i}(x) represent the mean and variance of a single NN indexed i \in [1,…,M]. Balaji et.al [9] treated the ensemble as a uniformly-weighted mixture model and combine the predictions into a single Gaussian mixture distribution p(y|x)p(y|x) using:

p(y|x)N(μi(x),σi2(x))p(y|x)\sim N(\mu_{i}(x),\sigma^{2}_{i}(x)) (5)

And for ease of estimating predictive probabilities, they further approximated the ensemble prediction as a Gaussian whose mean and variance correspond to the respective mean and variance of the mixture model.

μ(x)=M1iμi(x)\mu_{*}(x)=M^{-1}\sum_{i}{}\mu_{i}(x) (6)
σ2(x)=M1i(σi2(x)+μi2(x))μ2(x)\sigma_{*}^{2}(x)=M^{-1}\sum_{i}{}(\sigma^{2}_{i}(x)+\mu^{2}_{i}(x))-\mu^{2}_{*}(x) (7)

Dropout as Bayesian approximation

We compare the performance of other approximate Bayesian inference methods used for uncertainty quantification with deep ensembles [9]. One such approach focuses on dropout [10] to capture the total predictive uncertainty. The key notion is to randomly drop weights during both training and inference. Concisely, we can formulate the MC Dropout algorithm as,

for b = 1:B
e(b)=VariationalDropout(g(x),p)\displaystyle e_{(b)}^{*}=\textit{VariationalDropout}(g(x^{*}),p)
y(b)=Dropout(h(e),p)\displaystyle y_{(b)}^{*}=\textit{Dropout}(h(e^{*}),p)
end for

Provided the input data xx^{*}, an encoder-decoder network g(.)g(.), prediction network h(.)h(.), dropout probability pp, and number of iterations B, we train the encoder-decoder model, e=g(.)e=g(.) with dropout, pp. Further, during inference, for the same input, xx^{*}, the prediction network, h(.)h(.) is inferred by randomly dropping weights to generate a distribution of B outputs. [8] showed the output distribution approximates a Bayesian NN without the additional complexity. The mean and variance of the predicted samples are presented below.

y^mc=1Bb=1By^(b)\\ \hat{y}_{mc}^{*}=\frac{1}{B}\sum_{b=1}^{B}\hat{y}_{(b)}^{*}
η12=1Bb=1B(y^(b)y^mc)2\eta_{1}^{2}=\frac{1}{B}\sum_{b=1}^{B}(\hat{y}_{(b)}^{*}-\hat{y}_{mc}^{*})^{2} (8)

II-D Uncertainty Disentanglement

Total predictive variance (7) can be disentangled into aleatoric uncertainty, associated with the inherent noise of the data, and epistemic uncertainty accounting for uncertainty in model predictions[11].

σ2(x)\displaystyle\sigma_{*}^{2}(x) =M1iσi2(x)\displaystyle=M^{-1}\sum_{i}{}\sigma^{2}_{i}(x) +M1iμi2(x)μ2(x)\displaystyle+M^{-1}\sum_{i}\mu^{2}_{i}(x)-\mu^{2}_{*}(x) (9)
=𝔼i[σi2(x)]\displaystyle=\mathbb{E}_{i}[\sigma^{2}_{i}(x)] +𝔼i[μi2(x)]𝔼i[μi(x)]2\displaystyle+\mathbb{E}_{i}[\mu^{2}_{i}(x)]-\mathbb{E}_{i}[\mu_{i}(x)]^{2}
=𝔼i[σi2(x)]Aleatroricuncertainty\displaystyle=\underbrace{\mathbb{E}_{i}[\sigma^{2}_{i}(x)]}_{Aleatroric\,\,uncertainty} +Vari[μi(x)]Epistemicuncertainty\displaystyle+\underbrace{\mathrm{Var}_{i}[\mu_{i}(x)]}_{Epistemic\,\,uncertainty}

Equation (9) shows that across multiple output samples, the mean of variances represents aleatoric uncertainty, while the variance of mean represents the epistemic uncertainty. The predictive variance, σi2(x)\sigma_{i}^{2}(x) is obtained using the Gaussian NLL loss (4). However, predictive uncertainty only accounts for the data and model uncertainty during future trajectory prediction and does not have the information of upstream perceptual uncertainty obtained using KF. In order to capture the perceptual uncertainty, the NN is trained on augmented trajectory samples that takes both state and associated covariance as inputs. The perceptual uncertainty is then estimated by minimising the MSE loss between actual covariance obtained using KF with the predicted covariance from NN. Details of the method and results have been shown in section IV-C.

Algorithm
function Kalman(𝐗𝐤𝟏,𝐏𝐤𝟏,𝐑,𝐐\mathbf{X_{k-1}},\mathbf{P_{k-1}},\mathbf{R},\mathbf{Q})\triangleright Where 𝐗𝐤𝟏\mathbf{X_{k-1}} - state, 𝐏𝐤𝟏\mathbf{P_{k-1}} - cov, 𝐑\mathbf{R} - measurement noise , 𝐐\mathbf{Q} - process noise
     for k=1k=1 to NN do
         xk=Fxk1x_{k}=Fx_{k-1} \triangleright Predict Step
         Pk=FPk1FT+QP_{k}=FP_{k-1}F^{T}+Q
         S=HPk1HT+RS=HP_{k-1}H^{T}+R
         K=Pk1HTS1K=P_{k-1}H^{T}S^{-1}
         yk=zkHxky_{k}=z_{k}-Hx_{k} \triangleright Innovation
         xk=xk+Kykx_{k}=x_{k}+Ky_{k}
         Pk=PkKHPkP_{k}=P_{k}-KHP_{k} \triangleright Update Step
     end for
end function
for k=1k=1 to MM do\triangleright M samples for M ensembles
     function Trajectory Sampling(𝐗𝐤,𝐏𝐤\mathbf{X_{k}},\mathbf{P_{k}})
         xsample=MultivariateNormal(𝐗𝐤,𝐏𝐤x_{sample}=\mathrm{MultivariateNormal}(\mathbf{X_{k}},\mathbf{P_{k}})
     end function
end for
function Model(input = [𝐗𝐤,𝚺𝐤]𝐓\mathbf{[X_{k},\Sigma_{k}]^{T}}, target = [𝐲𝐤,𝚺𝐤𝐲]𝐓\mathbf{[y_{k},\Sigma_{k}^{y}]^{T}}, num  epochs, batch, M)\triangleright End-to-End Training Model
     for epoch=1epoch=1 to numepochsnum\,epochs do
         [𝐲^𝐤,𝚺^𝐤𝐬,𝚺^𝐤𝐩]=Model([𝐗𝐤,𝚺𝐤]𝐓)\mathbf{[\hat{y}_{k},\hat{\Sigma}_{k}^{s},\hat{\Sigma}_{k}^{p}]}=\mathrm{Model}(\mathbf{[X_{k},\Sigma_{k}]^{T}})\triangleright Outputs
         MSE=𝚺^𝐤𝐬𝚺𝐤𝐲MSE=\lVert\mathbf{\hat{\Sigma}_{k}^{s}-\Sigma_{k}^{y}}\rVert
         NLL=𝐲𝐤𝐲^𝐤𝚺^𝐤𝐩+log(𝚺^𝐤𝐩)2NLL=\dfrac{\lVert\mathbf{y_{k}-\hat{y}_{k}}\rVert}{\mathbf{\hat{\Sigma}_{k}^{p}}}+\dfrac{log(\mathbf{\hat{\Sigma}_{k}^{p}})}{2}
     end for
end function

III Experiments

In this section, we discuss the datasets, data augmentation, implementation details for each network and the performance metrics. Following common practice from literature [27], we trained our models on publicly available pedestrian datasets. Two most popular datasets are the ETH dataset [29] which contains the ETH and HOTEL scene while the UCY dataset [30] which contains the UNIV, ZARA1 and ZARA2 scenes. In order to draw parallelism with past works [28], we studied 8 (3.2 secs) historical steps to predict 12 (4.8 secs) steps into the future.

III-A Data Augmentation

Initially, we trained our model on the ETH dataset only which contains approximately 420 pedestrian trajectories under varied crowd settings. However, a small number of trajectories is insufficient for training. Therefore, we performed data augmentation using Taken’s Embedding theorem [31]. We used a sliding window of T = 1 step to generate multiple small trajectories out of a single large trajectory. For instance, a pedestrian’s trajectory of 29 steps will result in 10 small {x,y}\{x,y\} trajectory pairs of 20 steps each if past trajectory information of 8 steps is used for predicting 12 steps into future. In total, we constructed 1597 multivariate time series sequences which we split into 1260 training and 337 testing sequences for the ETH hotel dataset. Further, each trajectory was augmented using KF to generate posterior state and covariance distribution. Then, M trajectories were sampled from the distribution for each original trajectory. Details of data augmentation using KF and TS have been discussed in sec.II.

III-B Implementation details

The encoder-decoder neural network was trained end-to-end using PyTorch. Adam optimizer with a learning rate of 1e31e-3 was used to compute the MSE and NLL loss. MSE loss was minimized to estimate the covariance of KF while NLL loss was minimized to capture the predictive uncertainty. Each model was trained for 150 epochs with a batch size of 64. For the ensemble model, M=3 networks were considered while for the MC dropout, a single model with dropout probability, p = 0.5 was used based on our previous research [4]. The model was compiled and fit using train data and test data was used for predictions.

III-C Performance Metrics

The trained model is then used to predict the distribution for pedestrian future states. Overall, the predictions are averaged to generate the mean predicted path along with the associated variance that quantifies uncertainty. We adopt the widely used performance metrics [27] namely average displacement error (ADE) and final displacement error (FDE) for prediction comparison between the ground truth and mean predicted path. Further, we define valid prediction intervals for regression problems based on performance metrics like prediction interval coverage probability (PICP) and mean prediction interval width (MPIW) [26].

(a) Prediction Interval Coverage Probability (PICP): Coverage probability for a single state shows whether the ground truth state, 𝐲𝐤\mathbf{y_{k}} lies within the predicted covariance ellipse Γ(Xk)\Gamma(X_{k}) for the state 𝐗𝐤\mathbf{X_{k}},

𝒞(Γ)1|𝒟|(x,y𝒟)𝟙(ykΓ(Xk))\displaystyle\mathcal{C}(\Gamma)\approx\frac{1}{|\mathcal{D^{*}}|}\sum_{(x,y\in\mathcal{D^{*}})}\mathbbm{1}(y_{k}\in\Gamma(X_{k})) (10)

𝟙\mathbbm{1} denotes an indicator function representing Boolean values.

(b) Mean Prediction Interval Width (MPIW): It refers to the average width of the confidence interval. For the current results, we consider MPIW as the average of the major and minor axes of the predicted covariance ellipse.

𝒲(Γ)1|𝒟|(x,y𝒟)(|u(x)l(x)|)\displaystyle\mathcal{W}(\Gamma)\approx\frac{1}{|\mathcal{D^{*}}|}\sum_{(x,y\in\mathcal{D^{*}})}{(|u(x)-l(x)|)} (11)

u(x)u(x) and l(x)l(x) refer to the lower and upper bounds for the prediction interval.

(c) Average Displacement Error (ADE): Mean Euclidean distance between predicted and ground truth.

ADE=1Tt=t0tf𝐲^(t)𝐲(t)\displaystyle\begin{split}\text{ADE}&=\frac{1}{T}\sum_{t=t_{0}}^{t_{f}}||{\mathbf{\hat{y}}_{(t)}-{\mathbf{y}}_{(t)}}||\end{split} (12)

(d) Final Displacement Error (FDE): Euclidean distance between the predicted and true final state across all trajectories.

FDE=𝐲^(tf)𝐲(tf)\displaystyle\begin{split}\text{FDE}&=||{\mathbf{\hat{y}}_{(t_{f})}-{\mathbf{y}}_{(t_{f})}}||\end{split} (13)

where 𝐲^t\mathbf{\hat{y}}_{t} is the predicted location at timestamp t and 𝐲t\mathbf{y}_{t} is the ground truth position.

IV Results

IV-A Why Ensemble?

For quantifying uncertainty, deep ensembles average predictions over an ensemble of independently trained networks. In the current scenario, each network is trained using the Gaussian negative log-likelihood (NLL) (4) loss function such that the network outputs probabilistic predictions with both mean (μ\mu) and variance (σ2\sigma^{2}). As, a single network can generate probabilistic outputs when trained with NLL loss function, why consider averaging the predictions over an ensemble of M networks? To answer this question, we observe how the NLL and test MSE loss scale with the number of independent networks (M) within an ensemble. The losses were evaluated on the ETH [29] dataset for pedestrians.

TABLE II: Scalability of NLL (nats) and MSE with number of networks (M) within an ensemble
M NLL MSE
1 -0.335 0.214
2 -0.362 0.205
3 -0.377 0.205
4 -0.371 0.208
5 -0.379 0.200

Each neural network within the ensemble was randomly initialized at the beginning of the training. Additionally, for each network, a training trajectory was randomly sampled as input from the distribution of trajectories. Training was performed and the final NLL loss was averaged over the number of networks, M within the ensemble. Test MSE was evaluated on a set of test trajectories different from the training data after the model was fully trained (Table II). The results indicate an ensemble of five networks had the lowest NLL as well as test MSE. Indeed, this shows an ensemble network because of lower NLL captures better predictive uncertainty. Further, low test MSE shows predictions of an ensemble network are closer to ground truth as compared to a single network. Overall, any ensemble of networks with M>2M>2 produced better NLL and MSE as compared to a single network.

IV-B Predictions: single vs Ensemble

In this section, we compare the predictive uncertainty of a single network with an ensemble of three networks (M=3) on a pedestrian trajectory from the ETH dataset. Figure 3 shows the predictive uncertainty. The model takes 8 input states ( \bullet, green dot) to predict 12 states into future. \blacktriangle represents the actual ground truth trajectory of the pedestrian. Further, the plot shows the mean predicted path ( \blacklozenge, blue diamond) alongwith the 1σ1\sigma covariance ellipse to quantify uncertainty during prediction.

Refer to caption
Refer to caption
Figure 3: Predictive uncertainty for (a) Deep Ensemble with M=3 networks (b) single network.

Figure 3b shows the predictive uncertainty for a single network, which fails to generate accurate prediction interval with respect to the ground truth. A significant portion of the ground truth trajectory lies outside of the 1σ1\sigma predictive covariance. On the other hand, the ensemble network (Figure 3a) produced better predictive uncertainty and mean path by averaging the mean and variance of predictions over an ensemble of networks. The plot shows that the ground truth completely lies within the confidence interval at each time step. Further, the ADE/FDE for the ensemble network (0.618/1.137) was significantly lower compared to the ADE/FDE for a single network (0.704/1.394).

Figure 3a shows the total predictive uncertainty, which is due to the combination of aleatoric and epistemic uncertainty. The aleatoric uncertainty represents the inherent noise in the data, while the epistemic uncertainty arises due to the variation in model predictions. Since, the test data is sampled from the same distribution as the train data, the model uncertainty highlighted in yellow, is negligible compared to the aleatoric uncertainty. In contrast, a single network has no model uncertainty, and the total predictive uncertainty and aleatoric uncertainty are the same. Thus, the ensemble network is better suited to handle epistemic uncertainty, which is critical for robust real-world applications.

Refer to caption
Figure 4: Variation of prediction metrics (a) PICP (b) MPIWx\mathrm{MPIW_{x}} (c) MPIWy\mathrm{MPIW_{y}} with ensemble models.

Further, we compare the performance metrics, coverage probability PICP (10) and prediction interval width MPIW (11) for different ensemble networks with the ETH dataset (Figure 4). Results show that the ensemble network with M = {3,5}\{3,5\} networks provide better predictive uncertainty estimates than a single network, with an average coverage probability of 63%\approx 63\% compared to 45%45\%. In addition, current results also reveal that the MPIW for an ensemble model with multiple networks is either comparable or less than that of a single network, indicating that even with a smaller confidence interval, the ensemble model can achieve a higher coverage probability for the predictions. We denote the average width of major and minor axes as MPIWx\mathrm{MPIW_{x}} and MPIWy\mathrm{MPIW_{y}} respectively.

IV-C Incorporating Perception uncertainty

Previous studies have focused on capturing predictive uncertainty while neglecting upstream state uncertainty during perception. Incorporating perception or state uncertainty into the prediction pipeline remains a challenge, as it is unclear how the total predictive uncertainty will be affected. To address this challenge, we propose incorporating and propagating state uncertainty by including the associated state covariance, Pk¯\bar{P_{k}}, obtained at each step from the KF. We append the heteroskedastic noise associated with each state to the state, 𝐗𝐤\mathbf{X_{k}}, and train them together. In section IV-B, the state 𝐗𝐤\mathbf{X_{k}} = [x,y][x,y] contained only respective states sampled from the posterior distribution of state covariance using KF. Here, we have neglected the velocity, [u,v] in the states for training as no significant improvement was observed with their inclusion. In the current scenario, we append the states and the associated covariance together as [𝐗𝐤,𝚺𝐤]𝐓\mathbf{[X_{k},\Sigma_{k}]^{T}} and train them jointly. The outputs are [𝐲^𝐤,𝚺^𝐤𝐬,𝚺^𝐤𝐩]𝐓\mathbf{[\hat{y}_{k},\hat{\Sigma}_{k}^{s},\hat{\Sigma}_{k}^{p}]^{T}}. 𝚺^𝐤𝐬\mathbf{\hat{\Sigma}_{k}^{s}} corresponds to the estimated state covariance which is trained by minimizing the MSE loss with the actual covariance, 𝚺𝐤\mathbf{\Sigma_{k}} obtained using KF. This enables the NN capture the upstream perceptual uncertainty. Meanwhile, 𝚺^𝐤𝐩\mathbf{\hat{\Sigma}_{k}^{p}} corresponds to the estimated predictive covariance by the NN and was obtained by minimising the NLL loss for the state 𝐗𝐤\mathbf{X_{k}}. Overall, our method enables us to estimate the state uncertainty and incorporate it into the prediction pipeline, which can help improve the total uncertainty estimation.

IV-C1 Perception uncertainty

Sensing uncertainty during state estimation of a dynamic object can be obtained recursively from sensor measurements using KF. However, estimating perceptual uncertainty for future states over a long prediction horizon using KF can be challenging. Further, the uncertainty represented by covariance will continuously grow based on the motion model. To address these issues, a neural network (NN) model is trained to learn the KF for estimating perceptual uncertainty at any future state. We perform domain randomization on train trajectories by varying the measurement covariance, 𝐑(2%,20%)\mathbf{R}\in(2\%,20\%) to generate trajectories across a wide range of sensor noise. It will make the NN more robust in estimating the sensing uncertainty effectively.

To achieve this, we apply KF to both the input and ground truth of every single trajectory to obtain the estimated state and covariance. The resulting states and covariance are then concatenated and trained together using an encoder-decoder network. The network minimizes the mean squared loss between KF ground truth covariance, 𝚺𝐤\mathbf{\Sigma_{k}} and predicted covariance, 𝚺^𝐤𝐬\mathbf{\hat{\Sigma}_{k}^{s}}. Figure 5 depicts the outputs of the neural network for 𝐑=5%\mathbf{R}=5\% that closely resembles the covariance predictions of the KF. The NN takes 8 states with associated covariance obtained using the KF as input and predicts the 12 future states, as well as the estimated state covariance due to sensing uncertainty. Our results demonstrate that the 2σ2\sigma confidence interval predicted using the neural network closely matches the state uncertainty on ground truth obtained using the KF. This capability allows the NN to estimate the state covariance associated with perception uncertainty at each future state, which can then be integrated with the prediction uncertainty to enhance the system’s robustness.

Refer to caption
Figure 5: Covariance estimation using neural network capturing sensing uncertainty.

IV-C2 Prediction uncertainty

Unlike perception uncertainty which accounts for noise in the process or measurement during sensing, prediction uncertainty captures the unpredictability associated with future states. In Figure 6, we show the total predictive uncertainty for the same trajectory as before. For capturing predictive uncertainty, we train the network using NLL loss and the NN outputs both the mean, μk\mu_{k} and covariance, 𝚺^𝐤𝐩\mathbf{\hat{\Sigma}_{k}^{p}} for the predicted distribution. We treat 𝚺^𝐤𝐩=[Σ^xx,Σ^xy,Σ^yx,Σ^yy]\mathbf{\hat{\Sigma}_{k}^{p}}=[\hat{\Sigma}_{xx},\hat{\Sigma}_{xy},\hat{\Sigma}_{yx},\hat{\Sigma}_{yy}] as the full state covariance of a bivariate distribution.

We generate results for an ensemble of 3 networks and average the predicted distributions to obtain the mean predicted path and uncertainty at each state. The average ADE/FDE of the mean predicted path across all test trajectories is 0.64/1.08. For the ensemble network, the mean of variance,𝚺^𝐤𝐩\mathbf{\hat{\Sigma}_{k}^{p}}, represents aleatoric uncertainty across the ensemble. Meanwhile, the variance of predicted means represents the ”model” or epistemic uncertainty. As the test samples are from the same distribution as train samples, the variation in model predictions is insignificant, and thus the epistemic uncertainty is negligible too. Further, the predictive uncertainty around each state is significantly larger than the predicted sensing uncertainty.

Refer to caption
Figure 6: Total Predictive uncertainty

IV-C3 State and Prediction Uncertainty

Why incorporate sensing uncertainty into prediction pipeline?

The primary objective of this paper is to design an end-to-end estimator that can effectively estimate state uncertainty by taking noisy sensor measurements and propagate the state uncertainty into the future predicted states. This approach enables the neural network to make precise future state predictions while remaining robust to upstream uncertainty. Mathematically, we formulate total uncertainty as the combination of sensing and predictive uncertainty.

Refer to caption
Figure 7: Total uncertainty corresponds to the Minkowski addition of prediction and state uncertainty.

Assume, the upstream state uncertainty due to noisy measurements is represented by the covariance ellipse, E1.

E1={x12:(x1μk)T(𝚺^𝐤𝐬)1(x1μk)1}E1=\{x_{1}{\in\mathbb{R}^{2}}:\,(x_{1}-\mu_{k})^{T}(\mathbf{\hat{\Sigma}_{k}^{s}})^{-1}(x_{1}-\mu_{k})\leq 1\}

Similarly, at each time, a state randomly sampled from the covariance ellipse representing prediction uncertainty as:

E2={x22:(x2μk)T(𝚺^𝐤𝐩)1(x2μk)1}E2=\{x_{2}{\in\mathbb{R}^{2}}:\,(x_{2}-\mu_{k})^{T}(\mathbf{\hat{\Sigma}_{k}^{p}})^{-1}(x_{2}-\mu_{k})\leq 1\}

Covariance ellipses E1 and E2 represent convex polytope of all reachable states during perception and prediction respectively. Any sampled state, x2x_{2} from the predictive uncertainty, E2=𝚺^𝐤𝐩\mathrm{E2}=\mathbf{\hat{\Sigma}_{k}^{p}} represents a possible future state of the tracked object. However, the sampled state has no information of upstream uncertainty, E1=𝚺^𝐤𝐬\mathrm{E1}=\mathbf{\hat{\Sigma}_{k}^{s}}. Therefore, perceptual uncertainty can be incorporated into the prediction uncertainty as the Minkowski addition of the closed convex polytopes, E1\mathrm{E1} and E2\mathrm{E2} centred around the origin (14).

E={x1x2:x1E1,x2E2}E=\{x_{1}\oplus x_{2}:\,x_{1}\in E1,\,x_{2}\in E2\} (14)

Here \oplus denotes the vector addition. Further, we translate the summed covariance ellipse representing total uncertainty to the mean predicted state, μk\mu_{k}.

E={x+μk:xE}E^{\prime}=\{x+\mu_{k}:\,x\in E\}

Overall, E\mathrm{E^{\prime}} represents the total reachable set of states for the end-to-end estimator at any instance.

Refer to caption
Figure 8: Total uncertainty by incorporating state uncertainty into prediction

By including state uncertainty, the total uncertainty now becomes the Minkowski sum of state and prediction uncertainty, as shown in Figure 8. We evaluated the coverage probability for an ensemble of three networks, and with consideration of predictive uncertainty alone, the coverage probability was 0.63 (Figure 4). This predictive uncertainty was based on deterministic state inputs without considering any sensing uncertainty. However, when we accounted for measurement noise, R=5%\mathrm{R}=5\% for states and augmented the data using KF for training, the coverage probability improved by almost 20% to 0.8. This demonstrates the effectiveness of our proposed approach in capturing both state and predictive uncertainty for accurate and robust future predictions.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: Comparison of 𝐋𝐞𝐟𝐭\mathbf{Left}: Perception uncertainty, 𝐂𝐞𝐧𝐭𝐞𝐫\mathbf{Center}: Prediction uncertainty, 𝐑𝐢𝐠𝐡𝐭\mathbf{Right}: Total uncertainty between 𝐓𝐨𝐩\mathbf{Top}: Deep Ensembles 𝐁𝐨𝐭𝐭𝐨𝐦\mathbf{Bottom}: MC Dropout for trajectories from the ZARA 01 dataset.

IV-D UQ Methods

In this section, we compare the prediction efficacy of Deep Ensemble with Monte Carlo (MC) dropout which is an approximate Bayesian inference method. The MC dropout leverages on the idea of training the NN using dropout layers and then performing inference at test time by randomly dropping weights. This generates a distribution of varying outputs instead of a single deterministic prediction. Like ensembles, the mean and variance of the output distribution can be computed to obtain the mean predicted path and quantify uncertainty. Details of the method has been described in section II-C.

We show the the uncertainty plots for three test trajectories from the ZARA01 dataset, at 𝐑=5%\mathbf{R}=5\% comparing both the methods (Figure 9). All trajectories start from origin. For deep ensemble, we consider M=3 networks while a dropout probability, p = 0.5 has been used on a single network for training using MC dropout method.

𝐋𝐞𝐟𝐭:\mathbf{Left:} The plot compares the 2σ2\sigma perception uncertainty between the NN and KF for each method. Both the methods are slightly under confident in predictions and overestimate the uncertainty bounds when compared with the KF ground truth. This may arise due to the simultaneous training using the NLL and MSE loss function. The NN fails to minimise the MSE loss function accurately while estimating covariance. However, the deep ensemble model slightly outperforms the MC dropout in estimating the perception uncertainty at each state. The MC dropout model overestimates the covariance associated with initial states.

𝐂𝐞𝐧𝐭𝐞𝐫:\mathbf{Center:} The predictive uncertainty plot shows the 1σ1\sigma distribution for future states with uncertainty bound for each method. The Predictive uncertainty can be disentangled into epistemic and aleatoric uncertainty. The uncertainty estimation is scalable and no significant difference was observed between the predictions of each models. However, the ADE and FDE for the mean predicted path of deep ensemble model (0.53/0.97) is closer to the ground truth when compared to the dropout model (0.58/1.00) as seen in Table III.

𝐑𝐢𝐠𝐡𝐭:\mathbf{Right:} The top and bottom right plots show the 1σ1\sigma total uncertainty for deep ensemble and MC dropout respectively. As discussed, the total uncertainty is the Minkowski addition of the covariance ellipses representing the perception and prediction uncertainty. Since, the MC dropout overestimates the perception uncertainty, it affects the total uncertainty too. The MC dropout method makes under-confident predictions for total uncertainty. This phenomenon is more pronounced for the two trajectories along negative y-axis when compared to scalable predictions from deep ensembles.

Apart from uncertainty quantification, the current study also compared the performance metrics (sec. III-C) for both methods across the pedestrian datasets ETH[29] and UCY[30]. The ADE/FDE results indicate that the deep ensembles have a closer mean predicted path to ground truth compared to MC dropout across all the datasets. Meanwhile, the 1σ1\sigma coverage probability results show deep ensembles have slightly higher coverage probability although not significant except for ETH dataset. We have combined the prediction interval width across major, MPIWx\mathrm{MPIW_{x}} and minor, MPIWy\mathrm{MPIW_{y}} axes of predicted covariance ellipse to obtain the mean prediction interval width,

MPIW=MPIWx22+MPIWy22MPIW=\sqrt{\frac{\mathrm{MPIW_{x}}^{2}}{2}+\frac{\mathrm{MPIW_{y}}^{2}}{2}}

Again, the deep ensembles have a lower MPIW compared to MC dropout which shows the deep ensembles are able to achieve slightly higher or equal coverage probability even with less prediction interval width. This shows that ensembles make robust predictions with scalable uncertainty and estimations closer to ground truth.

Refer to caption
Figure 10: Real world test trajectory set. Online tracking and estimation using a depth camera provides the position and velocity of the pedestrian in real-time. Generated test trajectory is used for offline prediction.
TABLE III: Performance metrics showing ADE, FDE, PICP and MPIW for predicting 12 future time steps given 8 historical steps
ADE FDE PICP MPIW
Deep Ensemble ETH 0.60 1.11 0.80 2.23
Hotel 0.40 0.67 0.88 1.44
ZARA 01 0.53 0.97 0.81 1.82
ZARA 02 0.56 1.12 0.87 2.08
UNIV 0.25 0.50 0.93 1.55
Dropout ETH 0.7 1.2 0.73 2.38
Hotel 0.44 0.66 0.88 1.46
ZARA 01 0.58 1.00 0.8 1.83
ZARA 02 0.59 1.15 0.86 2.06
UNIV 0.27 0.54 0.94 1.59

IV-E Out-of-distribution Results

The current simulation results showed that the NN based end-to-end estimator yield good performance for prediction on trajectories which follow same distribution as the training data. However, one fundamental challenge for NN based prediction model has been out-of-distribution (OOD) robustness. Especially, if the test samples are based on real-world pedestrian trajectory with distributional shift, how effectively the trained NN model can predict the future state as well as the prediction and sensing uncertainty? To test this hypothesis, we studied multiple scenarios namely walking fast, walking slow, turning left, turning right, walking normal (Figure 11) which constitutes a set of edge case scenarios which are different from the training samples.

IV-E1 Test trajectory generation

In order to collect test trajectories, we use the a depth camera recording at 30 frames per second (Figure 10). The camera estimates the depth of the object based on a pair of images to obtain the real-world position and velocity in 3D Cartesian coordinates. For object detection, we train a simple Mask R-CNN [32] on the coco dataset [33]. The object detection module accurately classifies the pedestrian and tracks it real-time. The sampling time is set at 12 frames such that the camera obtains the object’s position and velocity at every 0.4 seconds similar to the simulation. Every single trajectory has a duration of 8 seconds resulting in 20 {x,y,u,v}\{x,y,u,v\} samples, out of which 8 samples(3.2 secs) represent past trajectory while 12 samples(4.8 secs) represent the ground truth. We apply the current end-to-end estimator on the past trajectory to predict the future states with associated uncertainty and compare the predictions with ground truth.

IV-E2 Sensor measurement noise

In order to estimate the measurement covariance, 𝐑\mathbf{R}, we perform a simple calibration test. A single object was placed exactly 3m away from the camera. 60 samples pertaining to the {x,y}\{x,y\} position of the object were considered. The mean of the distribution was 2.9 ±\pm 0.06 m. This shows roughly 4%4\% noise on all measurements. Kalman filter was applied to the each test trajectory for data augmentation based on the estimated measurement covariance for the sensor.

IV-E3 Inference

The weights of the NN model are trained on publicly available datasets namely ETH and UCY. Further, to ensure robustness towards varying degree of sensing noise, domain randomization was performed for measurement covariance, 𝐑\mathbf{R}. As discussed previously, the trajectories are augmented using KF with a range of measurement covariance, 𝐑{220}%\mathbf{R}\in\{2-20\}\% and then trained using NN. This will make the NN model more robust towards prediction on test samples generated using a different sensor having a different measurement noise. During inference, only model parameters such as trained weights and biases were considered which makes the inference process computationally cheap.

Figure 11 shows end-to-end prediction on out-of-distribution trajectories for the considered scenarios. To draw parallel with simulation results, we predict 12 states into future based on 8 historical steps. The NN model based on deep ensembles with M =3 networks predicts both 1Σ^\hat{\Sigma} sensing (blue) and prediction (olive) covariance ellipse alongwith the mean estimated path for each scenario. The ground truth trajectory lies within the predicted 1Σ^\hat{\Sigma} total covariance ellipse except for the left turn trajectory (Figure 11a). Typically, the NN estimation model fails to capture significantly drastic changes in the trajectory. Overall, the current end-to-end prediction model is robust to out-of-distribution samples as well.

Refer to caption
Figure 11: Out-of-distribution prediction on multiple scenarios based on real-world pedestrian trajectory. The scenarios are : (a) Turning Left (b) Walking Slow (c) Walking Fast (d) Turning Right (e) Walking Normal. The covariance ellipse shows 1σ\sigma total uncertainty disentangled into blue: perception uncertainty and olive: prediction uncertainty.

V Conclusion

The current paper presents an end-to-end estimator that can take raw noisy sensor measurements and make robust future state predictions considering the upstream perceptual uncertainty. The NN model uses deep ensembles and averages outputs over a batch of networks to provide the mean predicted path and associated uncertainty for each state. For perceptual uncertainty, the NN model approximates the characteristics of a Bayes filter and estimates the associated covariance. Further, the model also estimates the predictive uncertainty associated with future states to which the perceptual uncertainty is incorporated to obtain the total uncertainty. Our results show that the incorporation of sensing uncertainty into the prediction pipeline enables the model to make robust downstream predictions. Overall, an ensemble model of 3 networks has been considered over a single network owing to better predictive uncertainty. The performance metrics indicate that the mean predicted path for an ensemble model is closer to the ground truth compared to the MC dropout predictions. Further, the coverage probability for an ensemble network is higher even with a smaller prediction interval width. Finally, the end-to-end prediction model showed robustness on out-of-distribution samples in quantifying both estimated future state and uncertainty.

In the future, it will be interesting to consider non-parametric filters like particle filter (PF) that does not assume Gaussian distribution over the filter estimations and estimate whether the NN model performs better estimates by considering the PF.

References

  • [1] Houenou, Adam, Philippe Bonnifait, Véronique Cherfaoui, and Wen Yao. ”Vehicle trajectory prediction based on motion model and maneuver recognition.” In 2013 IEEE/RSJ international conference on intelligent robots and systems, pp. 4363-4369. IEEE, 2013.
  • [2] Kahn, Gregory, Adam Villaflor, Vitchyr Pong, Pieter Abbeel, and Sergey Levine. ”Uncertainty-aware reinforcement learning for collision avoidance.” arXiv preprint arXiv:1702.01182 (2017).
  • [3] Wu, Xihui, Anshul Nayak, and Azim Eskandarian. ”Motion planning of autonomous vehicles under dynamic traffic environment in intersections using probabilistic rapidly exploring random tree.” SAE International Journal of Connected and Automated Vehicles 4, no. 12-04-04-0029 (2021): 383-399.
  • [4] Nayak, Anshul, Azim Eskandarian, and Zachary Doerzaph. ”Uncertainty estimation of pedestrian future trajectory using Bayesian approximation.” IEEE Open Journal of Intelligent Transportation Systems 3 (2022): 617-630.
  • [5] Li, Jiachen, Hengbo Ma, and Masayoshi Tomizuka. ”Conditional generative neural system for probabilistic trajectory prediction.” In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6150-6156. IEEE, 2019.
  • [6] Wiest, Jürgen, Matthias Höffken, Ulrich Kreßel, and Klaus Dietmayer. ”Probabilistic trajectory prediction with Gaussian mixture models.” In 2012 IEEE Intelligent Vehicles Symposium, pp. 141-146. IEEE, 2012.
  • [7] Jospin, Laurent Valentin, Hamid Laga, Farid Boussaid, Wray Buntine, and Mohammed Bennamoun. ”Hands-on Bayesian neural networks—A tutorial for deep learning users.” IEEE Computational Intelligence Magazine 17, no. 2 (2022): 29-48.
  • [8] Gal, Yarin, and Zoubin Ghahramani. ”Dropout as a bayesian approximation: Representing model uncertainty in deep learning.” In international conference on machine learning, pp. 1050-1059. PMLR, 2016.
  • [9] Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. ”Simple and scalable predictive uncertainty estimation using deep ensembles.” Advances in neural information processing systems 30 (2017).
  • [10] Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. ”Dropout: a simple way to prevent neural networks from overfitting.” The journal of machine learning research 15, no. 1 (2014): 1929-1958.
  • [11] Valdenegro-Toro, Matias, and Daniel Saromo Mori. ”A deeper look into aleatoric and epistemic uncertainty disentanglement.” In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1508-1516. IEEE, 2022.
  • [12] Feng, Di, Lars Rosenbaum, and Klaus Dietmayer. ”Towards safe autonomous driving: Capture uncertainty in the deep neural network for lidar 3d vehicle detection.” In 2018 21st international conference on intelligent transportation systems (ITSC), pp. 3266-3273. IEEE, 2018.
  • [13] Yoneda, Keisuke, Naoki Suganuma, Ryo Yanase, and Mohammad Aldibaja. ”Automated driving recognition technologies for adverse weather conditions.” IATSS research 43, no. 4 (2019): 253-262.
  • [14] Schulz, Dirk, and Wolfram Burgard. ”Probabilistic state estimation of dynamic objects with a moving mobile robot.” Robotics and Autonomous Systems 34, no. 2-3 (2001): 107-115.
  • [15] Prevost, Carole G., Andre Desbiens, and Eric Gagnon. ”Extended KF for state estimation and trajectory prediction of a moving object detected by an unmanned aerial vehicle.” In 2007 American control conference, pp. 1805-1810. IEEE, 2007.
  • [16] Breitenstein, Michael D., Fabian Reichlin, Bastian Leibe, Esther Koller-Meier, and Luc Van Gool. ”Robust tracking-by-detection using a detector confidence particle filter.” In 2009 IEEE 12th International Conference on Computer Vision, pp. 1515-1522. IEEE, 2009.
  • [17] Liu, Katherine, Kyel Ok, William Vega-Brown, and Nicholas Roy. ”Deep inference for covariance estimation: Learning gaussian noise models for state estimation.” In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1436-1443. IEEE, 2018.
  • [18] Bertoni, Lorenzo, Sven Kreiss, and Alexandre Alahi. ”Monoloco: Monocular 3d pedestrian localization and uncertainty estimation.” In Proceedings of the IEEE/CVF international conference on computer vision, pp. 6861-6871. 2019.
  • [19] Russell, Rebecca L., and Christopher Reale. ”Multivariate uncertainty in deep learning.” IEEE Transactions on Neural Networks and Learning Systems 33, no. 12 (2021): 7937-7943.
  • [20] Luo, Wenjie, Bin Yang, and Raquel Urtasun. ”Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net.” In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 3569-3577. 2018.
  • [21] Weng, Xinshuo, Ye Yuan, and Kris Kitani. ”PTP: Parallelized tracking and prediction with graph neural networks and diversity sampling.” IEEE Robotics and Automation Letters 6, no. 3 (2021): 4640-4647.
  • [22] Ivanovic, Boris, Yifeng Lin, Shubham Shrivastava, Punarjay Chakravarty, and Marco Pavone. ”Propagating state uncertainty through trajectory forecasting.” In 2022 International Conference on Robotics and Automation (ICRA), pp. 2351-2358. IEEE, 2022.
  • [23] Seitzer, Maximilian, Arash Tavakoli, Dimitrije Antic, and Georg Martius. ”On the pitfalls of heteroscedastic uncertainty estimation with probabilistic neural networks.” arXiv preprint arXiv:2203.09168 (2022).
  • [24] Chua, Kurtland, Roberto Calandra, Rowan McAllister, and Sergey Levine. ”Deep reinforcement learning in a handful of trials using probabilistic dynamics models.” Advances in neural information processing systems 31 (2018).
  • [25] Rose, Dominic C., Jamie F. Mair, and Juan P. Garrahan. ”A reinforcement learning approach to rare trajectory sampling.” New Journal of Physics 23, no. 1 (2021): 013013.
  • [26] Dewolf, Nicolas, Bernard De Baets, and Willem Waegeman. ”Valid prediction intervals for regression problems.” Artificial Intelligence Review 56, no. 1 (2023): 577-613.
  • [27] Alahi, Alexandre, Kratarth Goel, Vignesh Ramanathan, Alexandre Ro- bicquet, Li Fei-Fei, and Silvio Savarese. ”Social lstm: Human trajectory prediction in crowded spaces.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 961-971. 2016.
  • [28] Nikhil, Nishant, and Brendan Tran Morris. ”Convolutional neural net- work for trajectory prediction.” In Proceedings of the European Confer- ence on Computer Vision (ECCV) Workshops, pp. 0-0. 2018
  • [29] Pellegrini, Stefano, Andreas Ess, and Luc Van Gool. ”Improving data association by joint modeling of pedestrian trajectories and groupings.” In European conference on computer vision, pp. 452-465. Springer, Berlin, Heidelberg, 2010
  • [30] L. Leal-Taixe M. Fenzi, A. Kuznetsova, B. Rosenhahn and S. Savarese, ”Learning an Image-Based Motion Context for Multiple People Track- ing,” 2014 IEEE Conference on Computer Vision and Pattern Recogni- tion, 2014, pp. 3542-3549, doi: 10.1109/CVPR.2014.453.
  • [31] Takens, Floris. ”Detecting strange attractors in turbulence.” In Dynam- ical systems and turbulence, Warwick 1980, pp. 366-381. Springer, Berlin, Heidelberg, 1981
  • [32] He, Kaiming, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. ”Mask r-cnn.” In Proceedings of the IEEE international conference on computer vision, pp. 2961-2969. 2017.
  • [33] Lin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. ”Microsoft coco: Common objects in context.” In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740-755. Springer International Publishing, 2014.

VI Biography Section

[Uncaptioned image] Anshul Nayak received his B.Tech in mechanical engineering from NIT, Rourkela, India. He completed his Master’s degree in Mechanical engineering at Virginia Tech and is currently pursuing his Ph.D at the Autonomous Systems and Intelligent Machines (ASIM) lab at the same university. His research interests include cooperative planning and uncertainty estimation in prediction.
[Uncaptioned image] Azim Eskandarian has been a Professor and Head of the Mechanical Engineering Department at Virginia Tech since August 2015. He became the Nicholas and Rebecca Des Champs chaired Professor in April 2018. He also has a courtesy appointment as a Professor in the Electrical and Computer Engineering Department since 2021. He established the Autonomous Systems and Intelligent Machines laboratory at Virginia Tech and has conducted pioneering research in autonomous vehicles, human/driver cognition and vehicle interface, advanced driver assistance systems, and robotics. Before joining Virginia Tech, he was a Professor of Engineering and Applied Science at George Washington University (GWU) and the Founding Director of the Center for Intelligent Systems Research, from 1996 to 2015, the Director of the Transportation Safety and Security University Area of Excellence, from 2002 to 2015, and the Co-Founder of the National Crash Analysis Center in 1992 and its Director from 1998 to 2002 and 2013 to 2015. From 1989 to 1992, he was an Assistant Professor at Pennsylvania State University, York, PA, and an Engineer/Project Manager in the industry from 1983 to 1989. Dr. Eskandarian is a Fellow of ASME, a member of SAE, and a Senior Member of IEEE professional societies. He received SAE’s Vincent Bendix Automotive Electronics Engineering Award in 2021, IEEE ITS Society’s Outstanding Researcher Award in 2017, and GWU’s School of Engineering Outstanding Researcher Award in 2013.
[Uncaptioned image] Zachary Doerzaph is the Executive Director of the Virginia Tech Transportation Institute (VTTI), a global leader in transportation research. Working alongside a talented team, Doerzaph focuses on creating a future of ubiquitous, safe, and effective mobility by conducting innovative and impactful research today. Also, a faculty member within the Department of Biomedical Engineering and Mechanics at Virginia Tech, Doerzaph works with fellow faculty to provide experiential learning opportunities to prepare the next generation workforce. Doerzaph is known for innovative and extensive transportation research and leadership projects. His work focuses on maximizing performance at the interface of driver, vehicle, and infrastructure systems through the application of advanced technologies.
[Uncaptioned image] Prasenjit Ghorai received his B.Tech. degree from Maulana Abul Kalam Azad University of Technology (formerly West Bengal University of Technology, India) in electronics and instrumentation engineering, his M.Tech. degree in control & instrumentation engineering from the University of Calcutta, and his Ph.D. degree in engineering from the National Institute of Technology (NIT) Agartala (in collaboration with Indian Institute of Technology, Guwahati, India). He was an Assistant Professor of Electronics and Instrumentation Engineering with NIT Agartala from 2011 to 2019. He is currently working as a senior research associate at the Autonomous Systems and Intelligent Machines Laboratory at Virginia Tech and conducts research on cooperative and connected autonomous vehicles.