Limits of Probabilistic Safety Guarantees when Considering Human Uncertainty

Richard Cheng

{}^{1}

, Richard M. Murray

{}^{1}

, and Joel W. Burdick

{}^{1}

{}^{1}

All authors are with the California Institute of Technology, Pasadena, CA, USA [email protected], [email protected], [email protected]

Abstract

When autonomous robots interact with humans, such as during autonomous driving, explicit safety guarantees are crucial in order to avoid potentially life-threatening accidents. Many data-driven methods have explored learning probabilistic bounds over human agents’ trajectories (i.e. confidence tubes that contain trajectories with probability $\delta$ ), which can then be used to guarantee safety with probability $1-\delta$ . However, almost all existing works consider $\delta\geq 0.001$ . The purpose of this paper is to argue that (1) in safety-critical applications, it is necessary to provide safety guarantees with $\delta<10^{-8}$ , and (2) current learning-based methods are ill-equipped to compute accurate confidence bounds at such low $\delta$ . Using human driving data (from the highD dataset), as well as synthetically generated data, we show that current uncertainty models use inaccurate distributional assumptions to describe human behavior and/or require infeasible amounts of data to accurately learn confidence bounds for $\delta\leq 10^{-8}$ . These two issues result in unreliable confidence bounds, which can have dangerous implications if deployed on safety-critical systems.

I INTRODUCTION

Autonomous robots will be increasingly deployed in unstructured human environments (e.g. roads and malls) where they must safely carry out tasks in the presence of other moving human agents. The cost of failure is high in these environments, as safety violations can be life-threatening. At present, safety is often enforced by learning an uncertainty distribution or confidence bounds over the future trajectory of other agents, and designing a controller that is robust to such uncertainty [Fisac2018]. Based on these learned trajectory distributions, probabilistic safety guarantees can be provided at a specified safety threshold $\delta$ over a given planning horizon (e.g. by enforcing chance constraints such that $\mathbb{P}(\text{collision})\leq\delta$ ) [Aoude2013, FridovichKeil2020, Nakka2020, Fan2020]. However, for such guarantees to hold, it is critical that we accurately predict the uncertainty over other agents’ future trajectories with high probability $1-\delta$ .

Current works that aim to provide probabilistic safety guarantees for autonomous navigation in uncertain, human environments consider safety thresholds in the range $\delta\geq 0.001$ . While such guarantees are important, safety critical applications require $\delta$ that are orders of magnitude lower [Shalev-Shwartz2017].

Suppose a robot/car is guaranteed safe with probability $1-\delta$ across every $10s$ planning horizon. Given $\delta\approx 0.001$ , we could expect a safety violation every 3 hrs. For reference, based on NHTSA data [nhtsa], human drivers have an effective safety threshold $\delta<10^{-7}$ .

It is clear then that for safety-critical robotic applications, we must strive for extremely low safety thresholds, on the order $\delta\leq 10^{-8}$ . However, this paper argues that current learning-based approaches that model human trajectory uncertainty (a) rely on highly inaccurate distribution assumptions, invalidating resulting safety guarantees, and/or (b) can not adequately extend to safety-critical situations. To illustrate this, we applied different uncertainty models (see Table 1) to data of human driving from the highD dataset [highDdataset]. We found that even under extremely generous assumptions, learned models are highly inaccurate in capturing human behavior at low $\delta$ , often mispredicting the probability of rare events by several orders of magnitude. Furthermore, we show that increasing dataset sizes will not sufficiently improve accuracy of learned uncertainty models.

Our results highlight potential danger in utilizing learned models of human uncertainty in safety-critical applications. Fundamental limitations prevent us from accurately learning the probability of rare trajectories with finite data, and using inaccurate confidence bounds can result in unexpected collisions. While this paper focuses on illustrating a crucial problem (rather than providing a solution), we conclude by discussing alternative approaches that can address these limitations by combining (a) learned patterns of behavior and (b) prior knowledge encoding human interaction rules.

Before proceeding, we emphasize three critical points regarding our results:

•

We focus on in-distribution error, rather than out-of-distribution error. I.e., we highlight the fundamental inability of uncertainty models to accurately capture distributions at very low $\delta$ , regardless of generalization.
•

We focus not on robust control algorithms, but rather on the learned uncertainties that such algorithms leverage.
•

We distinguish motion predictors from uncertainty models. While recent performance of motion predictors has drastically improved [Gupta2018], they all leverage an underlying uncertainty model (see Table 1) to capture the probability of uncommon events. E.g. most neural network motion predictors output a Gaussian uncertain prediction. This paper focuses on errors associated with uncertainty models (which propagate to the motion predictors).

Uncertainty Model Class	Example Works	Min. Safety Threshold
Gaussian Process	[Fisac2019, Aoude2013, Hakobyan2020, Cheng2020]	$\delta\geq 0.001$
Gaussian Uncertainty w/ Dynamics	[Xu2014, Sadigh2016a, forghani2016]	$\delta\geq 0.001$
Bayesian NN	[Michelmore2019, Fan2020]	$\delta\geq 0.05$
Noisy Rational Model	[FridovichKeil2020]	$\delta\geq 0.01$
Hidden Markov Model	[sadigh2014, Liu2015]	$\delta\geq 0.01$
Quantile Regression	[Fan2020]	$\delta\geq 0.05$
Scenario Optimization	[Cesari2017, Chen2020, Sartipizadeh2020]	$\delta\geq 0.01$
Generative Models (e.g. GANs)	[Gupta2018, Sadeghian2019, Salzmann2020]	N/A

TABLE I: Different model classes for capturing human trajectory uncertainty, used in safe planning algorithms to guarantee safety with probability

1-\delta

. The right column shows the lowest safety threshold,

\delta

, we could find used in the literature (in simulation or hardware experiments) for each model class. There is no entry for generative models, as these models have not yet been utilized to provide explicit safety guarantees during planning, though there is surely a trend in this direction.

II RELATED WORK

Most recent approaches for guaranteed safe navigation in proximity to humans or their cars approximate uncertainty in human trajectories as a random process (i.e. deviations from a nominal trajectory are drawn i.i.d. from a learned distribution). These uncertainty models capture noise and the effects of latent variables (e.g. intention), and enable probabilistic safety guarantees in uncertain, dynamic environments. Most models fall into one or more of the following categories:

•

Gaussian Process (GP): These approaches model other agents’ trajectories as Gaussian processes, which treat trajectory uncertainty as a multivariate Gaussian [Ellis2009, Aoude2013, Hakobyan2020, Cheng2020]. There are several extensions, such as the IGP model [Trautman2015] (which accounts for interaction between multiple agents), or others [ferguson2015, liu2019]. However, they all treat uncertainty as a multivariate Gaussian.
•

Gaussian Noise with Dynamics Model: These approaches use a dynamics model with additive Gaussian noise; noise can also be added in state observations. This induces a Gaussian distribution over other agents’ future trajectory (or a situation where we can do moment-matching) [gray2013, forghani2016].
•

Quantile Regression: This approach computes quantile bounds over the trajectories of other agents at a given confidence level, $\delta$ . This approach benefits from not assuming an uncertainty distribution over trajectories [Tagasovska2018, Fan2020].
•

Scenario Optimization: This approach computes a predicted set over other agents’ actions based on samples of previously observed scenarios [Campi2018]. It is distribution-free (i.e. does not assume a parametric uncertainty distribution) [carvalho2015, Cesari2017, Chen2020, Sartipizadeh2020]. [driggs-campbell2017, driggs-campbell2018] do not use scenario optimization, but their work based on computing minimum support sets follows a similar flavor.
•

Noisy (i.e. Boltzmann) Rational Model: This model treats the human as a rational actor who takes “noisily optimal” actions according to a distribution in the exponential family, shown in Eq. (LABEL:eq:noisy_rational). The uncertainty in the action is captured by this distribution, which relies on an accurate model of the human’s value function [Li2016, Sadigh2016, Fisac2018, FridovichKeil2020, Kwon2020].
•

Generative Models (CVAE, GAN): These models generally learn an implicit distribution over trajectories. Rather than give an explicit distribution, they generate random trajectories that attempt to model the true distribution [Gupta2018, Sadeghian2019]. However, other works have also utilized the CVAE framework to produce explicit parameterized distributions using a discrete latent space [Salzmann2020].
•

Hidden Markov Model (HMM) / Markov Chain: These models capture uncertainty over discrete sets of states/intentions (e.g. goal positions) – as opposed to capturing uncertainty over trajectories. Thus, the objective is to infer the other agents’ unobserved state/intention (from a discrete set) with very high certainty, $1-\delta$ [Kelley2008, McGhan2015, Bandyopadhyay2013, Lam2014, Tran2015, sadigh2014, Liu2015].
•

Uncertainty Quantifying (UQ) Neural Networks: These approaches do not constitute a separate class of uncertainty models, but refer to methods that train a neural network to capture the distribution over other agents’ trajectories [Gindele2010, Kahn2017, Fan2019, Michelmore2019]. We list them separately due to their popularity. Most often these networks output a Gaussian distribution or mixture of Gaussians (e.g. Bayesian neural networks [Blundell2015], deep ensembles [Lakshminarayanan2016], Monte-Carlo dropout [Gal2016]). These models can also quantify uncertainty over discrete states (i.e. infer the hidden state in HMMs) [Hu2018, Ding2019].

Once a predicted trajectory and its uncertainty is learned, many mechanisms exist to guarantee safety (e.g. incorporating uncertainty into chance constraints). In this work, we do not focus on these mechanisms (i.e. robust control algorithms) for guaranteeing safety; rather we focus on the issue of learning/modeling trajectory uncertainty, which such mechanisms must leverage for their safety guarantees.

III EXPERIMENT SETUP

The remainder of this paper aims to highlight the limitations of the aforementioned uncertainty models when considering human behavior. We show that the prevalent model classes of uncertainty (see Table 1) fail to capture human behavior at safety-critical thresholds ( $\delta\leq 10^{-8}$ ), and exhibit significant errors when evaluated against real-world data. In particular, we test these uncertainty models on real-world driving data from the highD dataset [highDdataset], which uses overhead drones to capture vehicle trajectories from human drivers on German highways.

In this section, we detail how we processed the highD dataset to extract important features in order to train/test the different uncertainty models. In the following section, we evaluate the accuracy of these models.

Refer to caption — Figure 1: (Left) In this example the red car must take into account the blue car’s trajectory – and its uncertainty – in its plan to progress safely through the intersection. The dashed yellow curves denote the boundary of a tube that defines the $\delta$ confidence bound over trajectories. The white circle depicts a distribution over trajectories. The blue lines are example trajectories. (Right) Simplified illustration of different stages of the control pipeline. While every stage (prediction, planning, tracking) is crucial to guaranteeing safety, this paper focuses exclusively on the yellow box, prediction.

III-A Processing Dataset

From the highD dataset, we extract all trajectories of length $10$ seconds, $\tau_{[0,10]}$ (denoting the agent’s position over a $10$ second horizon), as well as the corresponding environmental context, $\mathcal{E}_{\tau}$ , denoting the presence and position/velocity of surrounding cars. The trajectory and its context are denoted by the tuple $(\tau_{[0,10]},\mathcal{E}_{\tau})$ . We then split the trajectories/context into a training set, $\mathcal{D}^{train}$ , and a test set, $\mathcal{D}^{test}$ . For a given test trajectory, $(\tau^{(test)}_{[0:10]},\mathcal{E}^{(test)}_{\tau})\in\mathcal{D}^{test}$ , we define equivalent scenarios, $\mathcal{M}(\tau^{(test)}_{[0:10]},\mathcal{E}^{(test)}_{\tau})$ as the set of trajectories with similar environmental context that are $\epsilon$ -close ( $\epsilon=2$ ft) over their first $2s$ .

\begin{split}&\mathcal{M}(\tau^{(test)}_{[0:10]},\mathcal{E}^{(test)}_{\tau})=\Big{\{}\tau_{[0:10]}~{}\Big{|}~{}(\tau_{[0:10]},\mathcal{E}_{\tau})\in\mathcal{D}^{train}~{},\\ &~{}~{}~{}~{}~{}~{}~{}~{}\|\tau_{[0:2]}-\tau^{(test)}_{[0:2]}\|_{\infty}<\epsilon~{}~{},~{}~{}\|\mathcal{E}_{\tau}-\mathcal{E}_{\tau}^{(test)}\|_{\infty}<\epsilon_{\mathcal{E}}\Big{\}}~{}.\end{split}

(1)

Therefore, $\mathcal{M}(\tau^{(test)}_{[0:10]},\mathcal{E}_{\tau}^{(test)})\subset\mathcal{D}^{train}$ denotes the set of scenarios within the training set, $\mathcal{D}^{train}$ , that are highly similar to $(\tau^{(test)}_{[0:10]},\mathcal{E}_{\tau}^{(test)})$ . Since we have several test trajectories within $\mathcal{D}^{test}$ , we define a pruned training set

\mathcal{M}(\mathcal{D}^{test})~{}=\bigcup_{(\tau,\mathcal{E}_{\tau})\in\mathcal{D}^{test}}\mathcal{M}(\tau,\mathcal{E}_{\tau})\subseteq\mathcal{D}^{train}~{}.

(2)

Every trajectory in the test set, $\mathcal{D}^{test}$ , has equivalent scenarios in the pruned training set, $\mathcal{M}(\mathcal{D}^{test})$ , such that we alleviate the issue of out-of-distribution error in learning. For clarity, let us define $\mathcal{T}^{train}=\mathcal{M}(\mathcal{D}^{test})$ .

III-B Training Learned Uncertainty Models

Given our test set, $\mathcal{D}^{test}$ , and pruned training set, $\mathcal{T}^{train}\subseteq\mathcal{D}^{train}$ , we would like to train a given uncertainty model $\hat{F}$ (e.g. Gaussian) on $\mathcal{T}^{train}$ , and observe how accurately it captures the distribution of trajectories within $\mathcal{D}^{test}$ .

Let us divide a given scenario $(\tau_{[0,10]},\mathcal{E}_{\tau})$ into the agent’s state $x=(\tau_{[0:2]},\mathcal{E}_{\tau})$ , and its action $a=\tau_{[2:10]}$ (its future trajectory). Since the action is drawn from some unknown distribution over trajectories, $a\sim\mathcal{A}(x)$ , our goal is to train a model $\hat{F}(x)$ that accurately approximates $\mathcal{A}(x)$ , minimizing the following error,

L^{out}=\mathbb{E}\big{[}m\big{(}\mathcal{A}(x)\|\hat{F}(x)\big{)}\big{]}~{},

(3)

where $m$ defines some metric over probability distributions. Clearly we do not know the true distribution $\mathcal{A}(x)$ , but we can obtain an empirical estimate based on any dataset $\mathcal{D}$ . We denote this empirical estimate $\hat{\mathcal{A}}(x,\mathcal{D})$ .

Using our pruned training dataset, $\mathcal{T}^{train}$ , we can train our uncertainty model $\hat{F}(x)$ (e.g. Gaussian, quantile, etc.), to minimize the following error function:

L^{train}=\sum_{x\in\mathcal{T}^{train}}\big{[}m\big{(}\hat{\mathcal{A}}(x,\mathcal{T}^{train})~{}\|~{}\hat{F}(x)\big{)}\big{]}~{}.

(4)

Then we can test the uncertainty model $\hat{F}(x)$ on the test dataset $\mathcal{D}^{test}$ , yielding the error function

L^{test}_{seen}=\sum_{x\in\mathcal{D}^{test}}\big{[}m\big{(}\hat{\mathcal{A}}(x,\mathcal{D}^{test}))~{}\|~{}\hat{F}(x)\big{)}\big{]}~{}.

(5)

Note that the pruned training set $\mathcal{T}^{train}$ contains data from all states $x$ represented in the test set $\mathcal{D}^{test}$ . This alleviates issues associated with out-of-distribution data, such that $L^{test}_{seen}$ captures aleatoric uncertainty (vs. epistemic uncertainty). Because we do not have to consider generalization of our models to unseen (out-of-distribution) states, the following relationship generally holds,

L^{out}\underbrace{\geq}_{\text{generalization gap}}L^{test}_{seen}.

(6)

In our analysis, we focus on $L^{test}_{seen}$ when measuring performance of our model $\hat{F}$ . As this ignores generalization gap (how out-of-distribution examples affect model accuracy), it benchmarks best potential performance of each model class.

Accounting for replanning: Most motion planning algorithms re-plan their trajectory at some fixed frequency (e.g. 1Hz). To account for this, we examine prediction error (e.g. violation of the $\delta-$ uncertainty bound) only within a short re-planning horizon. I.e. the prediction must be accurate only within this replanning horizon. The horizon is set to $2$ sec.

Incorporating conservative assumptions: To further highlight the fundamental limitations of learning uncertainty models of human behavior, since many prediction algorithms leverage goal inference, we assume that an oracle gives us the target lane of every trajectory. Note that our aim is to illustrate limitations of learned probabilistic models, even under ideal conditions. Thus, this strong assumption (though unrealistic) helps us reason about the best-case scenario for each model class, providing an upper-bound on performance.

Summarizing, we consider (a) there is no generalization gap, and (b) we are given the target lane of every trajectory.

If the models perform poorly under these extremely generous assumptions, we can not expect reasonable performance in realistic settings.

IV RESULTS - ERROR IN UNCERTAINTY MODELS

In this section, we analyze the accuracy of different uncertainty models in capturing the distribution of trajectories in $\mathcal{D}^{test}$ , after being trained on $\mathcal{T}^{train}$ .

IV-A Gaussian Uncertainty Models

We start by analyzing the popular Gaussian uncertainty model, used in most UQ neural networks [Michelmore2019], Gaussian process models [Aoude2013], and robust regression [liu2019, Nakka2020]. These approaches model the data and its uncertainty with a Gaussian distribution (see top 3 rows in Table 1). Using the procedure outlined in Section III, we compute the best-fit Gaussian distribution, $\hat{F}$ , over the training trajectories $\mathcal{T}^{train}$ , and observe how well it captures the in-distribution test trajectories in $\mathcal{D}^{test}$ .

Figure 2 $(K=1)$ shows the ratio of observed to expected violations in the test set at each safety threshold, $\delta$ . A violation is defined when the test trajectory lies outside the $\delta$ -uncertainty bound predicted by $\hat{F}$ (within a $2$ s re-planning horizon) for a specified $\delta$ . If the data followed a perfect Gaussian distribution, each curve in Fig. 2 would track the dotted black line (i.e. ratio near 1). If the curve falls below the dotted black line, then the model is overly conservative, and vice versa. We see that while the Gaussian model might be valid for $\delta\geq 0.01$ , it is highly inaccurate outside this range, posing a problem for safety-critical applications.

Gaussian mixture models (GMM): One might point out that problems with the Gaussian model could be alleviated using GMMs over a discrete set of goals (e.g. left versus right turn). For example, interacting Gaussian processes (IGP) leverage this tool to alleviate the freezing robot problem [Trautman2015]. However, when we trained GMMs on the same data with different numbers of mixtures ( $K=2,...,4$ ), prediction performance on test data did not improve for low $\delta$ (see Fig. 2). These results illustrate limitations of any Gaussian-based uncertainty model (IGP, GMM, etc.), by highlighting that human behavioral variation is inherently non-Gaussian.

In addition to the issue of inaccurate distributional assumptions, the confidence bounds at level