¹¹institutetext: University of California, Berkeley ²²institutetext: University of California, Santa Cruz

Parallel and Multi-Objective Falsification
with Scenic and VerifAI

Kesav Viswanadha 11 Edward Kim 11 Francis Indaheng 11
Daniel J. Fremont 22 Sanjit A. Seshia 11

Abstract

Falsification has emerged as an important tool for simulation-based verification of autonomous systems. In this paper, we present extensions to the Scenic scenario specification language and VerifAI toolkit that improve the scalability of sampling-based falsification methods by using parallelism and extend falsification to multi-objective specifications. We first present a parallelized framework that is interfaced with both the simulation and sampling capabilities of Scenic and the falsification capabilities of VerifAI, reducing the execution time bottleneck inherently present in simulation-based testing. We then present an extension of VerifAI’s falsification algorithms to support multi-objective optimization during sampling, using the concept of rulebooks to specify a preference ordering over multiple metrics that can be used to guide the counterexample search process. Lastly, we evaluate the benefits of these extensions with a comprehensive set of benchmarks written in the Scenic language.

Keywords:

Runtime Verification Formal Methods Falsification Cyber-Physical Systems Autonomous Systems Parallelization

1 Introduction

The growing adoption of autonomous and semi-autonomous systems such as self-driving vehicles brings with it pressing questions about ensuring their safety and reliability. In particular, the increasing use of artificial intelligence (AI) and machine learning (ML) components requires significant advances in formal methods, of which simulation-based formal analysis is a key ingredient [24].

Even with notable development in simulators and methods for simulation-based verification, there are four practical issues which require further advances in tools. First, simulation time can be a huge bottleneck, as falsification is typically done with high-quality, realistic simulators such as CARLA [12], which can be computation-intensive. Second, modeling interactive, multi-agent behaviors using general programming languages like Python can be very time-consuming. Third, autonomous systems usually need to satisfy multiple properties and metrics, with differing priorities, and convenient notation is needed to formally specify these. Fourth, we need to develop specification and sampling methods for falsification that can support multiple objectives.

There has been prior work that addresses these four issues separately, including several ideas for falsification or, conversely, optimization of an autonomous system subject to multiple objectives [9, 4, 5, 26]. There are other tools that address simulation-based testing of systems [3, 22, 2]. In recent years, there have been significant advances in adversarial machine learning [18, 21] as a way to find flaws with neural network-based systems. However, these methods tend to either focus on testing specific components of autonomous systems or are rather complex to use. More importantly, to our knowledge, no prior work has jointly addressed all of these issues and demonstrated these in a single tool. In this paper, we do so by extending the open-source VerifAI toolkit [13].¹¹1Documentation of the extensions covered in this paper is available at:
https://verifai.readthedocs.io/en/kesav-v-multi-objective/. This tool is reasonably mature, having been demonstrated in multiple industrial case studies [17, 14]. Our contributions to the toolkit support:

1.

Parallelized falsification, running multiple simulations in parallel;
2.

Falsification using the latest version of the Scenic formal scenario specification language, extending support to the “dynamic” features of Scenic for modeling interactive behaviors [16];
3.

The ability to specify for falsification multiple objectives with priority orderings;
4.

A multi-armed bandit algorithm that supports multi-objective falsification, and
5.

Evaluation of these extensions with a comprehensive set of self-driving scenarios.

These contributions have had a profound impact on the capabilities of VerifAI. With parallel falsification, we were able to cut down drastically on execution time, achieving up to 5x speedup over the current falsification methods in VerifAI using 5 parallel simulation processes. Using the multi-objective multi-armed bandit sampler, we were able to find scenarios which falsify five objectives at the same time.

2 Background

Scenic is a probabilistic programming language [15, 16, 6] that allows users to intuitively model probabilistic scenarios for multi-agent systems. A concrete scenario is a set of objects and agents, together with values for their static attributes, initial state, and parameters of dynamic behavioral models describing how their attributes evolve over time. In other words, a concrete scenario defines a specific trace. The state of each object or agent, such as a car, includes its semantic properties such as its position, orientation, velocity, color, model, etc. We refer to the vector of such semantic properties as a semantic feature vector; the concatenation of the semantic feature vectors of all objects and agents at a given time instant defines the overall semantic feature vector at that time. Agents also have behaviors defining a (possibly stochastic) sequence of actions for them to take as a function of the state of the simulation at each time step. A Scenic program defines a distribution over concrete scenarios: by sampling an initial state and then executing the behaviors in a simulator, many different simulations can be obtained from a single Scenic program. Scenic provides a general formalism to express probabilistic scenarios for multiple domains, including traffic and other scenarios for autonomous vehicles, which can then be executed in a number of simulators including CARLA [12]. In previous work on VerifAI [13], the tool supported an earlier version of Scenic without interactive, behavioral specifications. In this paper, we provide full support for Scenic’s newer dynamic features.

VerifAI is a Python toolkit that provides capabilities for verification of AI-based systems [7]. A primary capability is falsification, the systematic search for inputs to a system that falsify a specification given in temporal logic or as a cost function. VerifAI can use Scenic as an environment modeling language, sampling from the distribution over semantic feature vectors defined by a Scenic program to generate test cases. It then simulates these cases, obtaining trajectories.

After simulating a test case, VerifAI evaluates the system’s specification over the obtained trajectory, saving the results for offline analysis. These results are also used to guide further falsification, specifically by VerifAI’s active samplers, such as the cross-entropy sampler [23]. These samplers use the history of previously generated samples and their outcomes in simulation to drive the search process to find more counterexamples.

3 Parallel Falsification

In the typical pipeline used by a VerifAI falsifier driven by a Scenic program, semantic feature vectors (parameters) are generated using samplers in either Scenic or VerifAI. These parameter values are then sent by the VerifAI server to the client simulator to configure a simulation and generate a corresponding trajectory. This trajectory is then evaluated by the monitor, deemed either a safe example or a counterexample, and added to the corresponding table in the falsifier. Naturally, a bottleneck of this process is the generation of the trajectory in the simulator, as this is a rather compute-intensive task that can take a minute or more per sample, depending on the scenario description.

We present an improvement on this pipeline by parallelizing it using the Python library Ray [19], which encapsulates process-level parallelism optimized for distributed execution of computation-intensive tasks. Fig. 1 illustrates the new setup: we instantiate multiple instances of the simulator and open multiple Scenic server connections from VerifAI to the simulator instances for performing simulations (the connections now being bidirectional so that the behavior models in the Scenic program can respond to the current state of the simulation). We then aggregate the results of these simulations into a single error table documenting all the counterexamples found during falsification.

Figure 1: Parallelized pipeline for falsification using VerifAI.

4 Multi-Objective Falsification

There are typically many different metrics of interest for evaluating autonomous systems. For example, there are many well-known metrics used in the autonomous driving community to measure safety: no collisions, obeying traffic laws, and maintaining a minimum safe distance from other objects, among others [25]. It is also natural to assert, for example, that it is more important to avoid collisions than to follow traffic laws. We now discuss how to specify these metrics and their relative priorities.

4.1 Specification of Multiple Objectives Using Rulebooks

Let $\rho(x)$ be a function mapping a simulation trajectory generated by Scenic or VerifAI to a vector-valued objective, where $\rho_{j}(x)$ is defined as the value of the $j$ -th metric. Censi et al. [10] have developed a way to specify preferences over these metrics using a rulebook denoted by $\mathcal{R}$ – a directed acyclic graph (DAG) where the nodes are the metrics and a directed edge from node $i$ to node $j$ means $\rho_{i}(x)$ is more important than $\rho_{j}(x)$ . We denote this using the $>_{R}$ operator, e.g. $\rho_{i}>_{R}\rho_{j}$ .

Figure 2: Left: example rulebook over functions

\rho_{1}\dots\rho_{6}

[10]. Right: graph

G

used in experiments.

Fig. 2 shows an example of a rulebook over six metrics $\rho_{1},\dots,\rho_{6}$ . In this example, we can make several inferences, such as $\rho_{1}$ is more important than $\rho_{3}$ , $\rho_{3}$ is more important than $\rho_{4}$ , and $\rho_{5}$ is more important than $\rho_{3}$ . However, there are also many pairs of objective components that cannot be compared; for example $\rho_{1}$ and $\rho_{5}$ . We would like to have a way to order objective vectors to know which values are maximally violating of the specification during active sampling. Because of these indeterminate incomparisons, the rulebook $\mathcal{R}$ only allows for a partial ordering $\succ$ over the objective vectors. Intuitively, we can think of this partial ordering as preferring examples that have lower values of higher priority objectives since we are trying to minimize the values of each objective for falsification. However, if there is any other indeterminate or higher priority objective that has a higher value, the $\succ$ relation does not hold. To satisfy these properties, we define our $\succ$ operator as follows:

\rho(x_{1})\succ\rho(x_{2})\triangleq\forall i\,\bigl{(}\rho_{i}(x_{2})<\rho_{i}(x_{1})\implies\exists j\neq i\,(\rho_{j}>_{R}\rho_{i}\land\rho_{j}(x_{1})<\rho_{j}(x_{2}))\bigr{)}

As an example, consider our rulebook from Fig. 2. Let $\rho(x_{1})=\begin{bmatrix}1&1&1&1&1&1\end{bmatrix}^{T}$ , and $\rho(x_{2})=\begin{bmatrix}1&1&2&1&0&1\end{bmatrix}^{T}$ . In this case we have $\rho(x_{2})\succ\rho(x_{1})$ because $\rho_{5}(x_{2})<\rho_{5}(x_{1})$ , and even though $\rho_{3}(x_{2})>\rho_{3}(x_{1})$ , $\rho_{5}>_{R}\rho_{3}$ according to the rulebook, so the comparison of $\rho_{5}$ for the trajectories takes precedence. Since the rulebook defines a partial ordering over values of $\rho$ , it is possible to have two trajectories $x_{1}$ and $x_{2}$ such that $\rho(x_{1})\not\succ\rho(x_{2})$ and $\rho(x_{2})\not\succ\rho(x_{1})$ . In such cases, both values of $\rho$ are maintained in the sampling algorithm; see below for more details.

4.2 Multi-Objective Active Sampling

When performing active sampling to search for unsafe test inputs, we need a specialized sampler to support having multiple objectives to guide the search process. Most of the samplers previously available in VerifAI focused either entirely on exploration of the search space or entirely on exploitation to find unsafe inputs; we present a sampler that balances these and builds up increasingly-violating counterexamples in the multi-objective case.

The Multi-Armed Bandit Sampler. We present a more robust version of VerifAI’s cross-entropy sampler called the multi-armed bandit sampler; the idea of this sampler is to balance the trade-off between exploitation and exploration. To understand the motivation for the sampler, we first look at the formulation of the multi-armed bandit problem. Consider a bandit which has multiple lotteries, or “arms”, to choose from, each being a random variable offering a probabilistic reward. The bandit does not know ahead of time which arm gives the highest expected reward, and must learn this information by efficiently sampling various arms, while also maximizing average earned reward during the sampling process.

Carpentier et al. [8] present the Upper Confidence Bound (UCB) Algorithm that effectively balances both of these goals, subject to a confidence parameter $\delta$ , by sampling the arm $j$ that minimizes a quantity $Q_{j}$ dependent on the number of timesteps $t$ , the number of times the arm $j$ was sampled $T_{j}(t-1)$ , the observed reward of arm $j$ given by $\hat{\mu}_{j}$ , and the confidence parameter $\delta$ :

Q_{j}=\hat{\mu}_{j}+\sqrt{\frac{2}{T_{j}(t-1)}\ln\left(\frac{1}{\delta}\right)}

Qualitatively, this works as a balance between exploitation of the reward distribution learned so far (the first term), and exploration of seldom-sampled arms (the second term). We can easily see that this can be readily adapted to our cross-entropy sampler in VerifAI, which splits the range of each sampled variable into $N$ equally spaced buckets, which can be considered the “arms”. We take $\hat{\mu}_{j}$ to be the proportion of counterexamples found in bucket $j$ .

To compute $\mu_{j}$ for a vector-valued objective, we present the following incremental algorithm which builds up counterexamples that falsify more and more objectives (according to the priority order) over time. The steps of this algorithm are as follows. This assumes that the sampler is responsible for generating a $d$ -dimensional feature vector.
Setup

1.

Split the range of each component of the feature vector into $N$ buckets, as in the cross-entropy sampler.
2.

Initialize matrix $T$ of size $d\times N$ where $T_{ij}$ will keep track of the number of times that bucket $j$ was visited for variable $x_{i}$ .
3.

Initialize a dictionary $c$ mapping each maximal counterexample found so far to a matrix $c_{b}$ of size $d\times N$ where $c_{b,ij}$ counts how many times sampling bucket $j$ for variable $x_{i}$ resulted in the specific counterexample $b$ .
4.

Sample from each bucket once initially, updating $c$ and $T$ according to the update algorithm described below. The purpose of this is to avoid division by zero when computing $Q$ , as $T_{j}(t-1)=0$ at initialization [1].

Sampling

1.

Compute a matrix $\hat{\mu}$ where $\hat{\mu}_{ij}$ represents the observed reward from sampling bucket $j$ for variable $i$ by taking $\sum_{b}c_{b,ij}$ .
2.

Compute a matrix $Q$ based on the upper confidence bound formula above. For the confidence parameter, we use a time-dependent value of $\frac{1}{\delta}=t$ .
3.

To sample $x_{i}$ , take the bucket $j^{*}=\operatorname*{arg\,max}_{j}Q_{ij}$ . Break ties uniformly at random. This is a key step in the sampling process as it is frequently the case initially that several buckets will have the exact same $Q_{j}$ value, so we need to avoid bias towards any specific bucket. Sample uniform randomly within the range represented by bucket $j^{*}$ .

Updating Internal State

1.

Given the objective vector value $\rho$ , we compute our vector of booleans $b$ as described above.
2.

If $b$ does not exist in the dictionary $c$ and is among the set of maximal counterexamples found so far, i.e. $\forall b^{\prime}\in c,b^{\prime}\not\succ b$ as defined by the rulebook $\mathcal{R}$ , add $b$ as a key to the dictionary $c$ and initialize its value as $0^{d\times N}$ .
3.

For any $b^{\prime}\in c$ such that $b\succ b^{\prime}$ , remove $b^{\prime}$ from $c$ .
4.

Increment the count $c_{b}$ at each position $c_{b,ij}$ for the bucket $j$ sampled from $x_{i}$ .

5 Evaluation

We present a set of experiments designed to evaluate (i) the speedup in simulation time that we expect to see from parallelization; (ii) the benefits of the multi-armed bandit sampler in balancing exploration and exploitation; and (iii) the improved capabilities of falsification to support multiple objectives. We have developed a library of Scenic scripts²²2Full listing and source code of these Scenic scripts is available at:
https://github.com/findaheng/Scenic/tree/behavior_prediction/examples/carla/Behavior_Prediction. based on the list of pre-crash scenarios described by the National Highway Traffic Safety Administration (NHTSA) [20]. For a list of the scenarios, see the Appendix. These scripts cover a wide variety of common driving situations, such as driving through intersections, bypassing vehicles, and accounting for pedestrians.

We selected 7 of these scenarios, running the VerifAI falsifier on each one in CARLA [12] for 30 minutes, with individual simulations limited to 300 timesteps ( $\sim$ 30 seconds). For all of these scenarios, the monitor specifies that the centers of the ego vehicle and other vehicles must stay at least 5 meters apart at all times. This specification means that counterexamples approximately correspond to collisions or near-collisions. All parallelized experiments were run using 5 worker processes to perform simulation.

Fig. 3 shows the results of running these scenarios with a variety of configurations. First, across the scenarios, we observed a 3-5x speedup in the number of simulations using 5 parallel simulation processes. The variation in the number of samples generated can be attributed to termination conditions set in Scenic, which terminate simulations early if specific conditions are met. For some of these scenarios, termination occurred much sooner on average than other scenarios, leading to more simulations finishing in 30 minutes. These values also serve as partial evidence of the effectiveness of the multi-armed bandit sampler compared to cross-entropy, as the proportion of counterexamples found is comparable for the two samplers despite the increased exploration component in the multi-armed bandit sampler.

Refer to caption — Figure 3: Comparison of (i) the serial and parallel versions of the falsifier for cross-entropy and Halton sampling and (ii) the multi-armed bandit sampler with the cross-entropy and Halton samplers all in parallel. The orange part of the bars represent the number of counterexamples found out of the total number of samples.

To validate the scalability and explorative aspect of parallelized falsification, we present two metrics in Table 1. The first metric is the speedup factor, which is the ratio of the number of sampled scenarios in parallel versus serial falsification, averaged across the Halton and cross-entropy samplers. We are also interested in a metric of coverage of the scenario search space, as this ensures that a wide range of scenarios are tested by falsification. To this end, we present the confidence interval width ratio metric. This metric is computed by generating a 95% confidence interval [11] which provides a lower and upper bound on the probability that a randomly generated scenario results in unsafe behavior. Since confidence intervals are generated with the assumption of uniform random sampling, we only compute them for the serial and parallel Halton samplers since they are an approximation of random sampling. We take the ratio of the widths of the intervals in the parallel versus serial case to compare how tight we are able to make the bound in each case with the same level of confidence. The width of the interval in the parallel case is significantly smaller - up to half the width of the serial case. Since the width of the interval is proportional to $1/\sqrt{n}$ for $n$ samples, this makes intuitive sense and can be viewed as having double the coverage of the search space.

Table 1: The speedup factor and confidence interval width ratio metrics for the 7 scenarios.

Scenario #	1	2	3	4	5	6	7
Speedup Factor	3.96	4.27	3.87	4.27	2.73	3.26	5.04
CI Width Ratio	0.51	0.48	0.48	0.53	0.61	0.56	0.44

Figs. 3 and 4 show the qualitative benefits of the multi-armed bandit sampler. The number of counterexamples generated by the multi-armed bandit sampler is higher than for the Halton sampler, but only slightly lower than cross-entropy. However, we can clearly see that multi-armed bandit sampling achieves a balance between number of counterexamples and their diversity that cross-entropy and Halton do not.

To demonstrate the effectiveness of the multi-objective multi-armed bandit sampler in falsifying multiple objectives, we used a Scenic program that instantiates the ego vehicle, along with $m$ adversarial vehicles at random positions with respect to a 4-way intersection and has all of them drive towards the intersection and either go straight or make a turn. The monitor, similarly to before, specifies metric components $\rho_{j}$ which say the ego vehicle must stay at least 5 meters away from vehicle $j$ . We use the following three rulebooks: a completely disconnected graph representing no preference ordering, a linked list structure $L\triangleq\rho_{1}>_{R}\rho_{2}>_{R}...>_{R}\rho_{5}$ representing a total ordering, and the graph $G$ on the right in Fig. 2. We found that when using $L$ or $G$ , we were able to falsify 4 of the 5 objectives with serial falsification, and all 5 objectives in the parallel case. When having no preference ordering, we were able to falsify 3 of the 5 objectives with serial falsification and 4 of the 5 objectives in the parallel case. By contrast, when we combined all of these objectives in disjunction as one single objective (such that only falsifying all 5 objectives is considered unsafe), the cross-entropy sampler was unable to find any counterexamples.

6 Conclusion and Future Work

The extensions to Scenic and VerifAI we report in this paper address important problems in simulation-based falsification. First, we cut down significantly on execution time by supporting parallel simulations. Second, we allow the simple specification of high-level yet complex scenarios using the interface between dynamic Scenic and VerifAI. Third, we support multi-objective specification through the formalism of rulebooks. Lastly, we are able to falsify these multi-objective specifications in a way that is intuitive and scalable using the multi-armed bandit sampler. We hope these extensions prove useful to developers of autonomous systems.

There are many directions for future work. For example, it might be interesting to see if generating random topological sorts of the rulebooks to create total ordering works well in practice. One could also run covariance analysis on the features to determine if they can be jointly optimized for better active sampling. Lastly, there has been some work in connecting these ideas to real-world testing [17], but especially with multi-objective falsification, this is an interesting future direction.

References

[1] The upper confidence bound algorithm (Sep 2016), https://banditalgs.com/2016/09/18/the-upper-confidence-bound-algorithm/
[2] Abbas, H., Fainekos, G., Sankaranarayanan, S., Ivančić, F., Gupta, A.: Probabilistic temporal logic falsification of cyber-physical systems. ACM Trans. Embed. Comput. Syst. 12(2s) (May 2013). https://doi.org/10.1145/2465787.2465797, https://doi.org/10.1145/2465787.2465797
[3] Annpureddy, Y., Liu, C., Fainekos, G., Sankaranarayanan, S.: S-taliro: A tool for temporal logic falsification for hybrid systems. In: Abdulla, P.A., Leino, K.R.M. (eds.) Tools and Algorithms for the Construction and Analysis of Systems. pp. 254–257. Springer Berlin Heidelberg, Berlin, Heidelberg (2011)
[4] Araujo, H., Carvalho, G., Mousavi, M.R., Sampaio, A.: Multi-objective search for effective testing of cyber-physical systems. In: Ölveczky, P.C., Salaün, G. (eds.) Software Engineering and Formal Methods. pp. 183–202. Springer International Publishing, Cham (2019)
[5] Arrieta, A., Wang, S., Markiegi, U., Sagardui, G., Etxeberria, L.: Employing multi-objective search to enhance reactive test case generation and prioritization for testing industrial cyber-physical systems. IEEE Transactions on Industrial Informatics 14(3), 1055–1066 (2018). https://doi.org/10.1109/TII.2017.2788019
[6] BerkeleyLearnVerify: Berkeleylearnverify/scenic, https://github.com/BerkeleyLearnVerify/Scenic
[7] BerkeleyLearnVerify: Berkeleylearnverify/verifai, https://github.com/BerkeleyLearnVerify/VerifAI
[8] Carpentier, A., Lazaric, A., Ghavamzadeh, M., Munos, R., Auer, P.: Upper-confidence-bound algorithms for active learning in multi-armed bandits. In: Kivinen, J., Szepesvári, C., Ukkonen, E., Zeugmann, T. (eds.) Algorithmic Learning Theory. pp. 189–203. Springer Berlin Heidelberg, Berlin, Heidelberg (2011)
[9] Castro, L.I.R., Chaudhari, P., Tumova, J., Karaman, S., Frazzoli, E., Rus, D.: Incremental sampling-based algorithm for minimum-violation motion planning. CoRR abs/1305.1102 (2013), http://arxiv.org/abs/1305.1102
[10] Censi, A., Slutsky, K., Wongpiromsarn, T., Yershov, D.S., Pendleton, S., Fu, J.G.M., Frazzoli, E.: Liability, ethics, and culture-aware behavior specification using rulebooks. CoRR abs/1902.09355 (2019), http://arxiv.org/abs/1902.09355
[11] Clopper, C.J., Person, E.S.: The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial. Biometrika 26(4), 404–413 (12 1934). https://doi.org/10.1093/biomet/26.4.404, https://doi.org/10.1093/biomet/26.4.404
[12] Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: An open urban driving simulator. In: Proceedings of the 1st Annual Conference on Robot Learning. pp. 1–16 (2017)
[13] Dreossi, T., Fremont, D.J., Ghosh, S., Kim, E., Ravanbakhsh, H., Vazquez-Chanlatte, M., Seshia, S.A.: VerifAI: A toolkit for the formal design and analysis of artificial intelligence-based systems. In: 31st International Conference on Computer Aided Verification (CAV) (Jul 2019)
[14] Fremont, D.J., Chiu, J., Margineantu, D.D., Osipychev, D., Seshia, S.A.: Formal analysis and redesign of a neural network-based aircraft taxiing system with verifai. CoRR abs/2005.07173 (2020), https://arxiv.org/abs/2005.07173
[15] Fremont, D.J., Dreossi, T., Ghosh, S., Yue, X., Sangiovanni-Vincentelli, A.L., Seshia, S.A.: Scenic: A language for scenario specification and scene generation. In: Proceedings of the 40th annual ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI) (June 2019)
[16] Fremont, D.J., Kim, E., Dreossi, T., Ghosh, S., Yue, X., Sangiovanni-Vincentelli, A.L., Seshia, S.A.: Scenic: A language for scenario specification and data generation. CoRR abs/2010.06580 (2020), https://arxiv.org/abs/2010.06580
[17] Fremont, D.J., Kim, E., Pant, Y.V., Seshia, S.A., Acharya, A., Bruso, X., Wells, P., Lemke, S., Lu, Q., Mehta, S.: Formal scenario-based testing of autonomous vehicles: From simulation to the real world. In: 23rd IEEE International Conference on Intelligent Transportation Systems (ITSC) (Sep 2020)
[18] Huang, L., Joseph, A.D., Nelson, B., Rubinstein, B.I., Tygar, J.D.: Adversarial machine learning. In: Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence. p. 43–58. AISec ’11, Association for Computing Machinery, New York, NY, USA (2011). https://doi.org/10.1145/2046684.2046692, https://doi.org/10.1145/2046684.2046692
[19] Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Paul, W., Jordan, M.I., Stoica, I.: Ray: A distributed framework for emerging AI applications. CoRR abs/1712.05889 (2017), http://arxiv.org/abs/1712.05889
[20] Najm, W.G., Smith, J.D., Yanagisawa, M.: Pre-crash scenario typology for crash avoidance research (Apr 2007), https://www.nhtsa.gov/sites/nhtsa.gov/files/pre-crash˙scenario˙typology-final˙pdf˙version˙5-2-07.pdf
[21] Qayyum, A., Usama, M., Qadir, J., Al-Fuqaha, A.I.: Securing connected & autonomous vehicles: Challenges posed by adversarial machine learning and the way forward. CoRR abs/1905.12762 (2019), http://arxiv.org/abs/1905.12762
[22] Qin, X., Aréchiga, N., Best, A., Deshmukh, J.V.: Automatic testing and falsification with dynamically constrained reinforcement learning. CoRR abs/1910.13645 (2019), http://arxiv.org/abs/1910.13645
[23] Sankaranarayanan, S., Fainekos, G.: Falsification of temporal properties of hybrid systems using the cross-entropy method. In: Proceedings of the 15th ACM International Conference on Hybrid Systems: Computation and Control. p. 125–134. HSCC ’12, Association for Computing Machinery, New York, NY, USA (2012). https://doi.org/10.1145/2185632.2185653, https://doi.org/10.1145/2185632.2185653
[24] Seshia, S.A., Sadigh, D., Sastry, S.S.: Towards Verified Artificial Intelligence. ArXiv e-prints (July 2016)
[25] Wishart, J., Como, S., Elli, M., Russo, B., Weast, J., Altekar, N., James, E.: Driving safety performance assessment metrics for ADS-equipped vehicles (04 2020). https://doi.org/10.4271/2020-01-1206
[26] Zhou, X., Gou, X., Huang, T., Yang, S.: Review on testing of cyber physical systems: Methods and testbeds. IEEE Access 6, 52179–52194 (2018). https://doi.org/10.1109/ACCESS.2018.2869834

Appendix

List of Evaluation Scenarios

Scenario #	Scenario Description	Related NHTSA Pre-Crash Scenario(s) [20]
1	The ego vehicle drives straight at a 4-way intersection and must suddenly stop to avoid collision when an adversary vehicle from an oncoming parallel lane makes an unprotected left turn.	Scenario 30: Left Turn Across Path, Opposite Direction
2	The ego vehicle makes an unprotected left turn at a 4-way intersection and must suddenly stop to avoid collision when an adversary vehicle from an oncoming parallel lane drives straight.	Scenario 30: Left Turn Across Path, Opposite Direction
3	The ego vehicle performs a lane change to bypass a leading vehicle before returning to its original lane.	Scenario 14: Changing Lanes, Same Direction
4	A trailing vehicle performs a lane change to bypass the ego vehicle before returning to its original lane.	Scenario 14: Changing Lanes, Same Direction
5	The ego vehicle performs a lane change to bypass a leading vehicle, but cannot return to its original lane because the leading vehicle accelerates. The ego vehicle must then slow down to avoid collision with the leading vehicle in its new lane.	Scenario 14: Changing Lanes, Same Direction Scenario 20: Rear-End, Striking Maneuver Scenario 22: Rear-End, Lead Vehicle Moving
6	The ego vehicle must suddenly stop to avoid collision when a pedestrian crosses the road unexpectedly.	Scenario 10: Pedestrian, No Maneuver
7	Both the ego vehicle and an adversary vehicle must suddenly stop to avoid collision when a pedestrian crosses the road unexpectedly.	Scenario 10: Pedestrian, No Maneuver

Parallel and Multi-Objective Falsification with Scenic and VerifAI