Multi-Agent Vulnerability Discovery for Autonomous Driving
with Hazard Arbitration Reward

Weilin Liu^1∗, Ye Mu^1∗, Chao Yu¹, Xuefei Ning^1†,
Zhong Cao², Yi Wu³, Shuang Liang⁴, Huazhong Yang¹ and Yu Wang^1† ¹W. Liu, Y. Mu, C. Yu, X. Ning, H. Yang, Y. Wang are with the Department of Electronic Engineering, Tsinghua University. ^∗W. Liu, Y. Mu contribute equally to this work. ^†Corresponding authors: Y. Wang [email protected], X. Ning [email protected]²Z. Cao is with the School of Vehicle and Mobility, Tsinghua University³Y. Wu is with Shanghai Qi Zhi Institute and Institute for Interdisciplinary Information Sciences, Tsinghua University.⁴S. Liang is with Novauto Technology Co. Ltd, Beijing, China.

Abstract

Discovering hazardous scenarios is crucial in testing and further improving driving policies. However, conducting efficient driving policy testing faces two key challenges. On the one hand, the probability of naturally encountering hazardous scenarios is low when testing a well-trained autonomous driving strategy. Thus, discovering these scenarios by purely real-world road testing is extremely costly. On the other hand, a proper determination of accident responsibility is necessary for this task. Collecting scenarios with wrong-attributed responsibilities will lead to an overly conservative autonomous driving strategy. To be more specific, we aim to discover hazardous scenarios that are autonomous-vehicle responsible (AV-responsible), i.e., the vulnerabilities of the under-test driving policy.

To this end, this work proposes a Safety Test framework by finding Av-Responsible Scenarios (STARS) based on multi-agent reinforcement learning. STARS guides other traffic participants to produce Av-Responsible Scenarios and make the under-test driving policy misbehave via introducing Hazard Arbitration Reward (HAR). HAR enables our framework to discover diverse, complex, and AV-responsible hazardous scenarios. Experimental results against four different driving policies in three environments demonstrate that STARS can effectively discover AV-responsible hazardous scenarios. These scenarios indeed correspond to the vulnerabilities of the under-test driving policies, thus are meaningful for their further improvements.

I INTRODUCTION

Autonomous driving is a safety-critical application field, since autonomous vehicles directly interact with other vehicles and pedestrians, and the mistakes of the driving policy can result in severe accidents. Therefore, identifying possible hazardous scenarios of the autonomous driving policy through extensive testing is a key step in its iterative development process. Among all possible accident scenarios, we focus on the Autonomous-Vehicle Responsible Scenarios (AV-responsible scenarios) that correspond to the vulnerability of the driving policy. In other words, these accident scenarios are caused by the mistakes made by the driving policy. We attribute the responsibility of the accident scenario according to the common sense of human judgment, excluding accident scenarios that are not caused or hard to avoid by the autonomous driving policy. To give an intuition on what an AV-responsible scenarios hazardous scenario corresponding to a policy vulnerability is, we show a comparative illustration in Fig. 1.

Since these hazardous cases are very sparse [1] and the road testing is costly, collecting a single AV-responsible accident case through real-world road testing can require thousands of miles of driving and lead to excessively high costs and unpredictable safety risks [2, 3]. Discovering AV-responsible scenarios in simulators (e.g., CARLA [4], SMARTS [5], AirSim [6]) is a promising way to mitigate the extremely high costs and safety risks of real-world testing. However, despite the high efficiency of these simulators in simulating driving experiences, it is still hard to find hazardous cases due to their rare occurrence.

Refer to caption — Figure 1: A comparative illustration of AV-responsible scenarios. “Defender” denotes the vehicle controlled by the under-test policy. “Attackers” are the vehicles controlled by adversarial policies. “NPC” denotes other background vehicles. Non-AV-responsible scenario: A scenario caused by other vehicles like unavoidable collision from a rushing attacker; AV-responsible scenarios: A scenario caused by mistakes made by the under-test policy when a front car makes a slight deacceleration. Such accident corresponds to a vulnerability of the under-test policy.

In this paper, we propose the Safety Test framework by finding Av-Responsible Scenarios (STARS), to efficiently discover vulnerabilities of an under-test driving policy. Specifically, we introduce Multi-Agent Reinforcement Learning (MARL) to control other traffic participants to interact with the under-test driving policy adversarially. In this way, the adversarial “attackers” can efficiently learn to construct hazardous scenarios where the under-test policy makes mistakes. Secondly, we present a novel reward design, Hazard Arbitration Reward (HAR), which attributes the responsibility of hazardous scenarios according to common senses of human judgment. The power of HAR is three-fold: 1) It is agnostic to the concrete hazardous scenarios, thus enabling our framework to discover novel and diverse hazardous scenarios. 2) It prevents our framework from being stuck in trivial accident cases, and enables us to discover valuable AV-responsible scenarios that indeed correspond to vulnerabilities of the under-test driving policy. 3) HAR is general across different environments.

We demonstrate the effectiveness of the proposed framework by testing against dynamic programming-based and reinforcement learning policies in different environments. Experimental results show that our framework can discover diverse and valuable vulnerabilities of the under-test policy: The “attackers” learn to collaboratively construct non-trivial hazardous scenarios in which the under-test policy makes mistakes. STARS can be utilized as a pressure testing platform for driving policies or used to compare the robustness of policies. Also, it can be used to guide further design or learning of the driving policy.

In summary, the contributions of this paper are:

1) We propose a Safety Test framework by finding Av-Responsible Scenarios, which aims at finding essential AV-responsible scenarios in the autonomous driving simulator.

2) We present a novel and general HAR design, and employ MARL to collaboratively construct complex AV-responsible scenarios.

3) We experiment with four target driving polices, including dynamic programming based control policies and reinforcement learning policies in three different environments. The results demonstrate that STARS can effectively discover diverse AV-responsible scenarios.

II PRELIMINARY AND RELATED WORK

II-A Partially Observed Markov Decision Process

Our work can be describe as a Partially Observed Markov Decision Process (POMDP) [7]. A POMDP is defined by tuple $<\mathcal{S},\mathcal{A},\mathcal{O},\emph{N},\mathcal{P},\mathcal{R},\gamma>$ . $\mathcal{S}$ presents state space including all possible conditions of N agents. $\mathcal{A}$ is shared action space. $\mathcal{O}$ is observation space of each agent. $\mathcal{P}$ indicates the transition possibility model of states. $\mathcal{R}$ is the shared reward function. $\gamma\in[0,1]$ is the reward discount parameter. At each step, agent i receives an observation $\emph{$o_{i}$}=\mathcal{O}(\emph{$s_{i}$})$ where $\emph{$s_{i}$}\in\mathcal{S}$ is the local state, takes an action $a_{i}$ according to shared policy $\pi_{\theta}(\emph{$a_{i}$}|\emph{$o_{i}$})$ where $\theta$ is the shared parameters, transit to next state $s_{i}^{\prime}$ according to transition possibility $\emph{p}(\emph{$s_{i}^{\prime}$}|\emph{$s_{i}$},\emph{$a_{i}$})\in\mathcal{P}$ , and finally receives a reward $\emph{$r_{i}$}=\mathcal{R}(\emph{$s_{i}$}|\emph{$a_{i}$})$ . Considering homogeneous agents, each agent i optimizes discounted cumulative rewards $\emph{$R_{i}$}=\sum_{t}\gamma^{t}\emph{$r_{i}^{t}$}$ .

II-B Attacks on Policies

Generally speaking, there are two types of strategies to attack decision-making policies: (1) Attacking-on-States: directly tampering with the state sequence of the target policy. These methods [8, 9] impose small disturbances on the state observations to make the policy output wrong decisions or get a smaller reward. (2) Attacking-by-Policy: learning one or more adversarial policies through interacting with the target policy. Gleave et al. [10] train an adversarial policy to compete with the under-test policy in a competitive game, and can discover flaws of the under-test policy.

We adopt the Attacking-by-Policy strategy in our work since it is more suitable for discovering vulnerabilities in autonomous driving policies. This is because that the perturbations crafted by Attacking-on-States strategies might be too noisy to occur in practice. In contrast, the constructed adversarial scenarios by Attacking-by-Policy strategies could appear naturally since the adversaries themselves are interactive agents with well-defined action spaces.

There exist some studies [11, 12, 13, 14] that adopt the Attack-by-Policy strategy to attack an autonomous driving policy. However, most of these studies aim at directly causing accidents [11, 12, 13], rather than discovering the vulnerability of the under-test policy. We will elaborate on the multi-aspect differences between our work and these studies in Section III-D.

III FRAMEWORK

This section describes our framework that can efficiently discover Av-responsible hazardous scenarios to reveal the vulnerabilities of the driving policies. First, Section III-A gives out the illustration of the overall framework and introduces some terminologies. To explore the hazardous scenarios efficiently, we employ MARL to train other adversarial agents, and this MARL framework is introduced in Section III-B. At the heart of our framework lies a novel reward design, HAR, which ensures that the discovered scenarios are Av-responsible and reveal vulnerabilities of the under-test policy. This reward design will be introduced in Section III-C. Then, Section III-D gives an in-depth comparison between related studies and our framework.

III-A Safety Test framework by finding Av-Responsible Scenarios

We illustrate the overall framework of Safety Test framework by finding Av-Responsible Scenarios (STARS) in Fig. 2. We follow the terminologies used in the literature of adversarial attack methods, the under-test policy is defined as “defender”, the adversarial vehicles trained by the MARL are called “attackers”, and other background vehicles are named “NPCs”. The defender, served as the attack target, is controlled by a well-trained or well-designed autonomous driving policy. The purpose of attackers is to explore the vulnerability that can make the defender behave wrongly, such as a car crash or out of bounds. In addition, we introduce NPCs to better simulate real driving scenarios.

III-B Multi-Agent PPO

We adopt the Multi-Agent PPO (MAPPO) framework [15] with parameter sharing. MAPPO is an extension of Proximal Policy Optimization (PPO) [16] as an on-policy Centralized Training Decentralized Execution (CTDE) reinforcement learning algorithm. We train a actor network $\pi_{\theta}$ parameterized by $\theta$ , and a value network $V_{\phi}$ as the critic network. With Generalized Advantage Estimator method [17], we get Advantage function $A_{i}^{(k)}$ , and further the forms of actor and critic loss function are given as

L(\theta)=[\frac{1}{Bn}\sum_{i=1}^{B}\sum_{k=1}^{n}min(r_{\theta,i}^{(k)}A_{i}^{(k)},clip(r_{\theta,i}^{(k)},1-\epsilon,1+\epsilon)A_{i}^{(k)})]

+\sigma\frac{1}{Bn}\sum_{i=1}^{B}\sum_{k=1}^{n}S[\pi_{\theta}(o_{i}^{(k)})].

L(\phi)=\frac{1}{Bn}\sum_{i=1}^{B}\sum_{k=1}^{n}(max[(V_{\phi}(s_{i}^{(k)})-\hat{R}_{i})^{2},

(clip(V_{\phi}(s_{i}^{(k)}),V_{\phi_{old}}(s_{i}^{(k)})-\varepsilon,V_{\phi_{old}}(s_{i}^{(k)})+\varepsilon)-\hat{R}_{i})^{2}]),

where B is the batch size, n is the number of homogeneous agents, S is the policy entropy, $\sigma$ is the entropy coefficient hyperparameter, and $\hat{R}_{i}$ is the discounted reward.

In our training environment, all agents share the same form of observation and action space. The $ith$ agent’s observation $o_{i}=[k_{0}^{i},k_{1}^{i},k_{2}^{i},k_{3}^{i},k_{4}^{i}]$ describes the states of the closest five vehicles arranged by distance. $k=[p,x,y,v_{x},v_{y}]$ is a state vector, where p indicates presence of vehicle, x and y are x-coordinate and y-coordinate, and $v_{x}$ and $v_{y}$ are component velocities on x and y axis. The discrete action a takes five choices for accelerating, decelerating, changing to left lane, changing to right lane, and doing nothing.

III-C Hazard Arbitration Reward

Vulnerabilities of the driving policy cause the vehicle to take improper actions, such as unsafe lane changing or risky overtaking, which lead to accident scenarios. However, some accident scenarios might not correspond to vulnerabilities in the under-test driving policy, since they are not caused or even not avoidable by the under-test driving policy, which means they are not valuable for improving the driving policy. An example of these non-AV-responsible hazardous scenarios is shown in Fig. 1 (Left), where the attacker suddenly makes a lane change and hits the under-test vehicle. In contrast, Fig. 1 (Right) shows a typical AV-responsible scenarios, where the under-test policy should have decelerated but makes a lane change and thus makes a collision. To differentiate between AV-responsible and non-AV-responsible hazardous scenarios, we design a reward to penalize some trivial scenarios where the attackers are to blame. We name this reward design HAR, since this reward arbitrates the responsibility assignment of the accident scenes and provides behavior criteria for the attackers.

The design of HAR is as follows. The HAR is consisted of two parts: the collision reward $R_{C}$ and the aggressiveness penalty $P_{agg}$ . We give a sparse collision reward $R_{C}$ when the defender makes collisions with NPC vehicles or attacker vehicles, since we expect the attackers to explore how to generate the hazardous scenarios.Also we give an aggressiveness penalty $P_{agg}$ to limit the behaviors of the attackers. Therefore, the HAR $R_{HAR}$ is

R_{C}=\begin{cases}\phi&if\text{ the defender makes a collision}\\ 0&otherwise\end{cases}

P_{agg}=\begin{cases}\rho&\left|\dot{v_{att}}\right|>\lambda\;or\;a_{att}=a_{lc}\;or\;a_{att}=a_{rc}\\ 0&otherwise\end{cases}

R_{HAR}=R_{C}+P_{agg},

where $\dot{v_{att}}$ is the acceleration of the attacker, $\lambda$ is an acceleration threshold, $a_{att}$ is the action of the attacker, $a_{lc}$ and $a_{rc}$ are left lane change action and right lane change action.

III-D Comparison Between STARS and Other Adversarial Attacks against Driving Policies

III-D1 Single-Attacker Entrypoint

There exist some studies that employ RL to learn one adversarial policy against the driving policy. Behzadan et al. [11] apply a Deep Deterministic Policy Gradient (DDPG) [18] to train the attacker vehicle in the urban road environment, whose goal is to make direct collision with the target vehicle. Similarly, Feng et al. [12] also set direct collisions as the goal of the attacker. In [13], the authors put one Advantage Actor-Critic (A2C) attacker vehicle and one imitation learning-based defender vehicle on a one-lane one-way road. The defender is trained to stay at a safe distance and avoid collision with the leading attacker. The goal of the attacker is to make collisions, and the adversarial reward is the velocity of the defender divided by distance between two vehicles, limited to one hundred. This reward design is highly coupled with the simple car-following task, and is hard to generalize to other environments or tasks.

Apart from the generalization ability of their method, three other issues also limit these methods’ effectiveness in discovering the vulnerability of driving policies.

•

Single-attacker entrypoint: It is hard for these methods to discover complex hazardous scenarios, since their attacking entrypoint is a single vehicle, which is difficult to construct a complex hazardous scenario for the under-test policy.
•

Poor exploration: The reward design in the literature is tightly coupled to concrete hazardous scenarios such as direct collision, which limits the exploration of novel hazardous scenarios.
•

No responsibility assignment: The discovered scenarios are not valuable for improving the driving policy. A considerable part of direct collision scenarios do not correspond to the vulnerability of the driving policy, since these hazardous scenarios are not caused and even not avoidable by the under-test policy.

In contrast, our work adopts a more powerful threat model, in which multiple adversarial vehicles are learned by MAPPO to construct more complex hazardous scenarios. Also, we develop HAR, a general reward design that enables our framework to explore novel, complex, and valuable AV-responsible scenarios. In this way, STARS discovers diverse vulnerabilities of the under-test policy, instead of constructing only a few hazardous scenarios that are not necessarily valuable for improving the driving policy.

III-D2 Multi-Attacker Entrypoint

Another relevant study [14] uses multiple adversarial vehicles, and designs a reward to encourage the attackers to reach their own goals as well as to attack the under-test policy. These two parts of reward design are named as the driving reward and adversarial reward, respectively. However, this work fails to discover complex and diverse failure scenarios, since the design of driving reward is too restrictive for attackers to explore all interactive possibilities, and it also rules out many AV-responsible scenarios. Moreover, they only conduct experiments against simple rule-based policies in a simple environment. With such a restrictive reward design and under such a simple setting, they fail to demonstrate any collaborative behavior of multiple attackers.

In contrast, HAR is a more suitable reward design for policy vulnerability discovery and guides our framework to discover diverse and complex AV-responsible scenarios.

IV EXPERIMENT SETTING

IV-A Highway Environment Testbed

To evaluate the proposed methods, this work chooses three typical driving environments, i.e., highway driving, merging, and roundabout driving. These three scenarios contain the most basic road elements like straight lane, merging lane, circle lane, and crossroad. Meanwhile, the combination of these scenarios could cover most safety-critical driving behaviors like lane changing, lane merging, overtaking, and crossroad driving [1]. Our testbed is based on the previous work [19], and we further add three roles, i.e., attacker, defender, NPC. The details of these three scenarios are as follows:

IV-A1 “Highway” Scenario

The “Highway” scenario defines a four-lane one-way road with several vehicles and infinite length. All NPC vehicles and the defender are initialized randomly on the road. The NPC vehicles closest to the defender are turned into attackers at the start of episodes.

IV-A2 “Merge” Scenario

The ”Merge” scenario defines a two-lane one-way main road with one downside lane merging in. One of the attackers is set randomly on the downside lane. Other attackers, NPC vehicles, and the defender are initialized randomly on the main road.

IV-A3 “Roundabout” Scenario

The “Roundabout” scenario defines a circle roundabout two-lane road with two-lane entering straight roads from four directions. The defender is randomly set on the downside straight roads heading towards the circle road. The attackers and NPC vehicles are randomly initialized on the circle road.

IV-B Tested Defender Policy for Autonomous Driving

We evaluate four representative autonomous driving methods, which can be divided into 3 categories.

•

Planning-based algorithm: Value Iteration [20]
•

Safe-enhanced algorithm: Robust Value Iteration [21]
•

RL-based algorithm: Dueling Double Q-Network [22] and Policy Proximal Optimization [16]

Value Iteration (VI) and Robust Value Iteration (RVI) are based on the finite MDP abstracted by the environment, and use dynamic-programming methods to optimize the Bellman Equation to get the action on each state. Dueling Double Deep Q-Network (D3QN) and Policy Proximal Optimization (PPO) defender models use the same observation and action space as the adversarial agent during the training process. Each model is trained in three scenarios respectively, aiming at collision avoidance and high-speed keeping.

IV-C Adversarial Settings For AV-responsible scenarios

We train MAPPO adversarial agents to explore the vulnerability of different driving policies. The reward function $\mathcal{R}$ of the attacker is consist of two parts: HAR $R_{HAR}$ and a distance reward $r_{d}$ , i.e., $\mathcal{R}=R_{HAR}+r_{d}$ . The HAR bounds the hazardous scenarios to encourage exploration of the AV-responsible scenarios, with the parameter $\phi=10$ and $\rho=-10.5$ . And we design a small distance reward $r_{d}$ to speed up our training.

V RESULTS AND DISCUSSION

This section demonstrates the experimental results of the STARS framework. First, we demonstrate and analyze the discovered AV-responsible scenes in Section V-A. Then, Section V-B compares the results of using single and multiple attackers. Finally, Section V-C discusses the robustness comparison of four driving policies.

TABLE I: Attack success rates (standard deviation) against four defender policies across three environments

	Scenarios	VI	RVI	PPO	D3QN
	Highway	47.88% (0.232)	47.06% (0.143)	1.86% (0.035)	13.64% (0.115)
1 attacker	Merge	20.20% (0.422)	32.30% (0.818)	0 % (0)	6.84 % (0.018)
	Roundabout	12.56% (0.091)	13.11% (0.093)	1.58% (0.018)	8.55 % (0.076)
	Highway	76.08% (0.214)	74.55% (0.630)	2.32% (0.060)	27.75% (0.076)
2 attackers	Merge	77.18% (0.922)	81.41% (0.270)	3.23% (0.130)	13.03% (0.016)
	Roundabout	19.78% (0.122)	27.81% (0.168)	9.66% (0.079)	7.31 % (0.133)
	Highway	72.75% (0.232)	85.49% (0.110)	3.67% (0.042)	58.24% (0.178)
3 attackers	Merge	87.98% (0.414)	90.33% (0.092)	4.21% (0.026)	30.99% (0.016)
	Roundabout	25.59% (0.034)	27.00% (0.294)	11.3% (0.084)	20.75% (0.225)

V-A The Discovered AV-Responsible Scenarios

STARS can discover AV-responsible hazardous scenarios that correspond to vulnerabilities of the under-test policy. Video demonstrations of the discovered scenarios are in the supplementary material. And we discuss three key characteristics of the discovered scenarios as follows.

These scenarios are AV-responsible scenarios that correspond to vulnerabilities of the under-test policy. We demonstrate three example hazardous scenarios discovered by STARS in Fig. 4 and Fig. 5. It is obvious that in all these accidents, the under-test policy (red car) makes improper decisions and causes accidents, while other traffic participants do not exhibit any aggressive behavior. This indicates that the discovered hazardous scenarios are AV-responsible scenarios revealing vulnerabilities of the under-test policy. And the under-test policy should be improved to handle these scenarios.

STARS can discover diverse scenarios exhibiting various accident patterns, which is different from previous studies that conduct RL-based adversarial attacks on driving policies [11, 18, 12, 13, 14]. Most previous studies [11, 18, 12, 13] aim to get direct collision scenarios by designing direct collision rewards. Consequently, their attackers can only learn to construct one type of direct collision scenarios. These scenarios are predictable before running the experiments thus only contribute minor knowledge to the policy testing process.

In contrast, we do not plant in the concrete hazardous scenario into the design of HAR, and the design of HAR enables the framework to explore and discover more diverse hazardous scenarios. Fig. 4 demonstrates two of our discovered hazardous scenarios in the Highway environment against the D3QN defender: 1) When the under-test vehicle is driving on the side lane, two attackers learn to form a siege by driving in front of and at the side of the under-test vehicle. The under-test D3QN policy will make a risky decision to overtake between the two attackers and cause a crash. 2) When the under-test vehicle is driving on the side lane following another vehicle (could be an NPC vehicle), two attackers can choose to occupy the adjacent lane. In this scenario, the under-test policy will fail to slow down and collide with the vehicle ahead. Note that these two scenarios are constructed by the same attacker policy learned in a single experiment run, and the attackers can construct different scenarios according to different initialization of the vehicle positions. This demonstrates the ability of STARS to automatically discover diverse hazardous scenarios.

Multiple attackers exhibit collaborative behaviors to construct complex hazardous scenarios to make the under-test policy misbehave. Our application of MAPPO and design of HAR together enable the attackers to construct complex scenarios collaboratively. In another word, multiple attackers learn to collaborate to construct hazardous scenarios that cannot be constructed by a single attacker.

Fig. 5 shows an example of the collaborative behavior of the attackers in the Merge environment. After initialization, the attacker on the main road keeps a suitable distance in front of the defender, while the attacker on the downside merging lane adjusts its velocity and merges in at the suitable time. This scenario will induce the under-test policy to make a lane change. At this time, if another vehicle is driving on the adjacent lane of the under-test vehicle and falls a little behind, the defender will fail to predict the collision and collide with it. This scenario requires several attackers to appear at the right position and conduct the right action at the right time simultaneously, and it is extremely hard for a single attacker to trigger this scenario.

V-B The Power of Using Multiple Attackers

To demonstrate the power of using multiple attackers, we compare the attack success rates of using single attacker and multiple attackers in Tab. I. In the experiments of using single attacker, one of the attackers is changed into a non-adversarial NPC vehicle. The results show that the attack success rate of two or three attackers significantly outperforms that of single attacker.

V-C Robustness Comparison of Policies

We compare the robustness of four driving policies according to the attack success rates and the training curves. The attack success rates represent the occurrence rate of AV-responsible scenarios, which indicates the risk factor of the under-test policy. The training curves in Fig. 6 of HAR reflect the complexity and difficulty for attackers to explore.

The hazardous scenarios exposing the vulnerability of the dynamic programming-based policies (VI and RVI) are easy to discover and highly reproducible. Limited by the finite estimation of infinite state-action pairs, many of their lane change decisions are risky. Thus, the attackers only need to drive close to the defender vehicle on the adjacent lanes to trigger the hazardous scenarios. Therefore, even one single attacker can successfully attack VI and RVI driving policies with high success rates.

Overall, the RL-based driving policies, PPO and D3QN, perform better than VI and RVI. Attacks against them have lower attack success rates in all environments. Discovering the vulnerability of the D3QN policy needs a sophisticated collaboration between attackers, as shown in Figure 5. PPO driving policy’s vulnerability is even harder to discover, such that the single attacker fails to discover any vulnerability.

VI CONCLUSION

Discovering hazardous scenarios of the autonomous driving policy is a key step in its iterative development process. In this paper, we propose the STARS framework with a novel HAR design. Using MAPPO to adversarially control traffic participants and the design of HAR are the key points of our framework. They enable our framework to discover diverse, complex, and AV-responsible hazardous scenarios that are valuable for further improving the under-test driving policy. Our framework is general w.r.t. different target driving policy and the driving environment. Experimental results in three typical driving environments (Highway, Merge, and Roundabout) against four driving policies (VI, RVI, D3QN, and PPO) demonstrate the effectiveness of our framework. Ablation studies demonstrate that our framework indeed takes advantage of the collaborative behaviors of the adversarial vehicles to construct complex hazardous scenarios. We also analyze the AV-responsible property and diversity of the discovered scenarios. Moreover, STARS can be used to compare the robustness of driving policies.

We hope that our work can inspire further researches towards a pressure test platform for autonomous driving policies. Some interesting future directions include but are not limited to the following. First, our reward design can be extended to incorporate more common senses or traffic rules [23, 24] according to actual needs. Second, considering the imperfectness of the perception and control modules in vulnerability discovery is worth investigating. Last but not least, combining STARS and real-word testing (i.e., naturalistic driving) is an interesting direction. On the one hand, STARS can play a role to efficiently augment the hazardous scenarios collected by real-world testing by using those scenarios as initialization states of the search process. On the other hand, one can efficiently construct and test the hazardous scenarios discovered by STARS in the real world, saving thousands of miles of naturalistic driving.

ACKNOWLEDGMENT

The authors gratefully acknowledge the support from TOYOTA. This work was also supported by Beijing National Research Center for Information Science and Technology (BNRist).

References

[1] N. Kalra and S. M. Paddock, “Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability?” Transportation Research Part A: Policy and Practice, vol. 94, pp. 182–193, 2016.
[2] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin, and R. Yang, “The apolloscape dataset for autonomous driving,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
[3] X. Huang, P. Wang, X. Cheng, D. Zhou, Q. Geng, and R. Yang, “The apolloscape open dataset for autonomous driving and its application,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2702–2719, 2019.
[4] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla: An open urban driving simulator,” in Conference on robot learning. PMLR, 2017, pp. 1–16.
[5] M. Zhou, J. Luo, J. Villella, Y. Yang, D. Rusu, J. Miao, W. Zhang, M. Alban, I. Fadakar, Z. Chen et al., “Smarts: Scalable multi-agent reinforcement learning training school for autonomous driving,” arXiv preprint arXiv:2010.09776, 2020.
[6] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” in Field and service robotics. Springer, 2018, pp. 621–635.
[7] F. A. Oliehoek and C. Amato, “A concise introduction to decentralized pomdps,” 2015.
[8] Y.-C. Lin, Z.-W. Hong, Y.-H. Liao, M.-L. Shih, M.-Y. Liu, and M. Sun, “Tactics of adversarial attack on deep reinforcement learning agents,” arXiv preprint arXiv:1703.06748, 2017.
[9] B. Lütjens, M. Everett, and J. P. How, “Certified adversarial robustness for deep reinforcement learning,” in Conference on Robot Learning. PMLR, 2020, pp. 1328–1337.
[10] A. Gleave, M. Dennis, C. Wild, N. Kant, S. Levine, and S. Russell, “Adversarial policies: Attacking deep reinforcement learning,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=HJgEMpVFwB
[11] V. Behzadan and A. Munir, “Adversarial reinforcement learning framework for benchmarking collision avoidance mechanisms in autonomous vehicles,” IEEE Intelligent Transportation Systems Magazine, 2019.
[12] S. Feng, X. Yan, H. Sun, Y. Feng, and H. X. Liu, “Intelligent driving intelligence test for autonomous vehicles with naturalistic and adversarial environment,” Nature communications, vol. 12, no. 1, pp. 1–14, 2021.
[13] S. Kuutti, S. Fallah, and R. Bowden, “Training adversarial agents to exploit weaknesses in deep control policies,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 108–114.
[14] A. Wachi, “Failure-scenario maker for rule-based agent using multi-agent adversarial reinforcement learning and its application to autonomous driving,” in IJCAI, 2019, pp. 6006–6012. [Online]. Available: https://doi.org/10.24963/ijcai.2019/832
[15] C. Yu, A. Velu, E. Vinitsky, Y. Wang, A. Bayen, and Y. Wu, “The surprising effectiveness of mappo in cooperative, multi-agent games,” arXiv preprint arXiv:2103.01955, 2021.
[16] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
[17] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
[18] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
[19] E. Leurent, “An environment for autonomous driving decision-making,” https://github.com/eleurent/highway-env, 2018.
[20] R. Bellman, “Dynamic programming,” Science, vol. 153, no. 3731, pp. 34–37, 1966.
[21] A. Nilim and L. El Ghaoui, “Robust markov decision processes with uncertain transition matrices,” Ph.D. dissertation, University of California, Berkeley, 2004.
[22] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, “Dueling network architectures for deep reinforcement learning,” in International conference on machine learning. PMLR, 2016, pp. 1995–2003.
[23] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “On a formal model of safe and scalable self-driving cars,” arXiv preprint arXiv:1708.06374, 2017.
[24] Z. Cao, S. Xu, S. Zhang, H. Peng, and D. Yang, “Driving-policy adaptive safeguard for autonomous vehicles using reinforcement learning,” arXiv preprint arXiv:2012.01010, 2020.

Multi-Agent Vulnerability Discovery for Autonomous Driving with Hazard Arbitration Reward