Social Coordination and Altruism
in Autonomous Driving

Behrad Toghi^∗, Rodolfo Valiente^∗, Dorsa Sadigh, Ramtin Pedarsani, Yaser P. Fallah ^∗Authors B. Toghi and R. Valiente contributed equally.Behrad Toghi, Rodolfo Valiente, and Yaser P. Fallah are with the Department of Electrical and Computer Engineering, University of Central Florida. [email protected]Dorsa Sadigh is with the Department of Electrical Engineering and the Department of Computer Science, Stanford University.Ramtin Pedarsani is with the Department of Electrical and Computer Engineering, UC Santa Barbara.This material is based upon work partially supported by the National Science Foundation under Grant No. CNS-1932037.

Abstract

Despite the advances in the autonomous driving domain, autonomous vehicles (AVs) are still inefficient and limited in terms of cooperating with each other or coordinating with vehicles operated by humans. A group of autonomous and human-driven vehicles (HVs) which work together to optimize an altruistic social utility can co-exist seamlessly and assure safety and efficiency on the road. Achieving this mission without explicit coordination among agents is challenging, mainly due to the difficulty of predicting the behavior of humans with heterogeneous preferences in mixed-autonomy environments. Formally, we model an AV’s maneuver planning in mixed-autonomy traffic as a partially-observable stochastic game and attempt to derive optimal policies that lead to socially-desirable outcomes using a multi-agent reinforcement learning framework (MARL), and propose a semi-sequential multi-agent training and policy dissemination algorithm for our MARL problem. We introduce a quantitative representation of the AVs’ social preferences and design a distributed reward structure that induces altruism into their decision-making process. Altruistic AVs are able to form alliances, guide the traffic, and affect the behavior of the HVs to handle competitive driving scenarios. We compare egoistic AVs to our altruistic autonomous agents in a highway merging setting and demonstrate the emerging behaviors that lead to improvement in the number of successful merges and the overall traffic flow and safety.

Index Terms:

Cooperative Driving, Social Navigation, Mixed-autonomy Traffic, Multi-agent Reinforcement Learning

I Introduction

Refer to caption — Figure 1: (a) AV-HV interaction to benefit another HV: Altruistic agents have the opportunity to form alliances and guide the behavior of HVs in order to improve the traffic flow and avoid hazardous situations. AV1 & AV2 can build a formation to slow down HV2 and open up a pathway for HV1, enabling it to trust the AVs, change lanes, and navigate towards the exit ramp. (b) AV-AV interaction to benefit another HV: HV1 is intended to merge into the highway. Egoistic AVs ignore the merging vehicle and do not open up space for it which can potentially lead to hazardous scenarios, whereas if they show sympathy for the merging HV, they can compromise on their own interest in order to create a safe path for HV1 to merge into the highway. (c) AV-AV interaction to benefit another AV: AV1 attempts to exit the highway. If AV2-AV5 act egoistically, AV1 might miss the exit and not be able to follow its planned mission. However, if AV2-AV5 take into account the interest of AV1 and act altruistically, they can open up space in the platoon, by AV2 & AV3 decelerating and AV4 & AV5 accelerating, to enable a safe exit for AV1.

Connected and automated vehicles (CAVs) pursue a mission to enhance driving safety and reliability by bringing automation and intelligence into vehicles, which lessens the inherent human limitations such as range of vision, reaction time, and distraction. Adding the communication component to intelligent vehicles further improves their ability to perceive their surroundings and creates an opportunity for mass coordination and cooperative decision-making. This inter-agent coordination is particularly important as the full potential of CAVs does not lie in operating a single vehicle on an empty road but rather from their seamless co-existence with other autonomous and human-driven vehicles (HVs). Hence, we narrow the focus of this work to studying the decision-making problem in the presence of multiple autonomous agents and human drivers, i.e. a mixed-autonomy multi-agent environment.

Leveraging vehicle-to-vehicle (V2V) communication, decision-making in a purely-autonomous environment can be simplified into a centralized control problem with essentially one agent. However, the presence of HVs makes the inter-agent coordination more challenging as they cannot explicitly communicate to coordinate with AVs in real-time. In order to make safe and socially-desirable decisions in the presence of humans, current solutions on social navigation for AVs mainly rely on learned or hand-coded models that predict the behavior of human drivers [1, 2]. We identify two key shortcomings in the existing schemes. First, the fidelity of the human models that are derived in the absence of autonomous agents is questionable in mixed-autonomy settings as human drivers tend to act differently when around AVs [3]. Second, single-agent solutions do not fully exploit the potential of CAVs in constituting a mass intelligence, forming alliances, and performing coordinated multi-agent maneuvers.

We study the mixed-autonomy decision-making problem from a multi-agent point of view, as opposed to the previous individual perspectives. Our key insight is that incentivizing AVs on adopting an altruistic behavior and accounting for the interest of other vehicles, allows them to see the big picture and find solutions that are optimal for the group in a longer term. In addition to the potential safety and efficiency benefits of altruistic decision-making, altruism leads to circumstances where no vehicle has superiority over the others, creating more societally beneficial outcomes [4]. To elaborate, Figure 1(a) shows that a group of AVs can guide the behavior of human drivers to improve safety and efficiency, Figures 1(b) and 1(c) illustrate examples of how AVs can work together to achieve a social goal that benefits another HV or AV.

We focus our work on inherently competitive driving scenarios, such as the examples illustrated in Figure 1, where safe and efficient traffic flow necessarily requires coordination among autonomous agents and egoistic behavior most likely compromises on traffic safety and efficiency. We build on our prior work in [5, 6] and proposed a novel semi-sequential multi-agent training and policy dissemination algorithm to alleviate the non-stationary problem. Additionally, we use a method for scoring the entries in the experience replay buffer that improves sample efficiency and speeds up the learning process. Furthermore, we emphasize the importance of finding the optimal social value orientation and in contrast to the other works, formulate it as a convex optimization problem. We formalize the mixed-autonomy driving problem as a partially observable stochastic game (POSG) and derive optimal policies using deep multi-agent reinforcement learning (MARL). With our solution, altruistic autonomous agents not only learn to drive safely but also master the inter-agent coordination and social navigation. Our main contributions are as follows:

•

We propose a MARL framework to train altruistic agents using a decentralized social reward signal. These agents are able to drive safely on the highway and coordinate with each other in the presence of human drivers.
•

We proposed a novel semi-sequential multi-agent training and policy dissemination algorithm for our MARL problem and utilized a network architecture that allows our agents to implicitly learn from experience, without the need for an explicit behavioral model of human drivers.
•

In contrast with the existing solutions, we formulate the problem of finding the optimal social value orientation angle as a convex optimization objective. We show that an optimal value for the level of altruism exists and when chosen properly between being absolutely selfless or selfish, despite some agents’ compromise on their local utility, the overall traffic safety and flow improve for the group of vehicles.

II Related Work

This section presents a short literature review on the main topics that are closely related to our problem, namely core MARL solutions, cooperative algorithms, human behavior modeling, and navigation in the presence of humans.

Multi-agent Reinforcement Learning. Early solutions for multi-agent value-learning algorithms assume independently trained agents and are proved to perform poorly [7]. To alleviate this problem, a learning rule is presented by Foerster et al. that relies on an additional term to take into account the effect of other agents’ evolution during the training. They have also attempted to leverage a multi-agent derivation of importance sampling and removing outdated samples from the experience replay buffer [8] to make it effective for multi-agent settings. Xie et al. employ latent representations of partner strategies to address this problem and enable a more scalable partner modeling [9]. Shih et al. further consider the effects of repeated interactions on partner modeling and develop a modular approach that separates rule-dependent representations from partner-dependent conventions [10].

Foerster et al. proposed the counterfactual multi-agent (COMA) algorithm that is expected to address the credit assignment problem in multi-agent environments [11]. COMA algorithm utilizes the set of joint actions of all agents as well as the full state of the world during the training. In contrast, we assume partial observability and a decentralized reward function during both training and execution. More application-oriented related works, include the centralized multi-agent solutions proposed by Gupta et al. [12]. More recently, Wang et al. proposed a gifting approach that enables the emergence of prosocial behaviors in general-sum coordination games [13]. Importantly, in contrast with our approach, the existing literature on multi-agent systems relies on assumptions on the social preference of agents [14, 15].

Human Behavior Modeling. Driving styles of human drivers can be learned either from demonstration through inverse RL, as proposed by Kuderer et al. , or employing statistical models such as Gaussian and Dirichlet processes [16, 17]. Kuefler et al. adopt a novel approach and apply generative adversarial networks to imitate the behavior of a human driver [18]. Schmerling et al. study the scenarios with inherent multimodal uncertainty, such as our driving, and leverage conditional variational autoencoders (CVAEs) to condition the policy on the present interaction history [19]. Recent data-driven approaches have shown achievements in classifying human driving maneuvers [20], and predicting human trajectories to enable fully-autonomous navigation of a robot in human-dense environments [21]. In contrast with works in the broad literature on human behavior modeling that take a game-theoretic or optimization-based approach, we rely on implicitly learning from interaction data within our MARL platform.

Social Navigation. Alahi et al. introduced the Social LSTM framework which leverages recurrent neural networks to extract the temporal information from the trajectory of pedestrians in large crowds [1]. Tsoi et al. present their high-fidelity simulation platform, SEAN, to accelerate the research on social robot navigation [22]. Vazquez et al. study the social interactions in a human-robot role-playing game and expand their observations to studying spatial behavior of a group of robots. More recent works in social navigation have revealed the potential for collaborative planning and interaction with humans. Examples include but are not limited to works by Trautman et al. and Nikolaidis et al. where a mutual reward function is optimized in order to enable joint trajectory planning for humans and robots [23, 24].

Mixed-autonomy Traffic Networks. Lazar et al. take a more abstract and traffic-level perspective to study the emergent behaviors in mixed-autonomy environments using model-free RL solutions [25]. Wu et al. explore the idea of stabilizing the traffic flow that is guided by autonomous vehicles as well as the emergent behaviors in a mixed AV-HV setting [26, 27]. Vinitsky et al. present a benchmark for traffic control based on RL in mixed-autonomy traffic [28]. Biyik et al. formalize the effects of altruistic driving in mixed-autonomy at a road-level and present a formal model of road congestion that can be used for optimal routing in road networks [29].

III Preliminaries

In this section, we provide the preliminary concepts that are essential in the following section and introduce our formal notation.

Partially-observable Stochastic Games. Decision-making process in a finite set of autonomous agents $\mathcal{I}$ with partial observability in stochastic environments can be formalized as a partially-observable stochastic game (POSG) defined by the tuple $\mathcal{M}_{\text{G}}\coloneqq(\mathcal{I},\mathcal{S},[\mathcal{A}_{i}],[\mathcal{O}_{i}],T,[R_{i}])$ for $i=1,...,N$ . At a given time, each agent receives a local observation $\textbf{o}_{i}:\mathcal{S}\rightarrow\mathcal{O}_{i}$ that is correlated with the underlying state of the environment $s\in\mathcal{S}$ and takes an action from the action space $a\in\mathcal{A}$ . Consequently, the environment evolves to a new state $s^{\prime}_{i}$ with probability $T=\Pr(s^{\prime}|s,a):\mathcal{S}\times\mathcal{A}_{1}\times...\times\mathcal{A}_{N}\rightarrow\mathcal{S}$ and the agent receives a decentralized reward $R_{i}:\mathcal{S}\times\mathcal{A}_{i}\rightarrow\mathbb{R}$ . The probability distribution over actions at a given state is known as the stochastic policy $\pi_{i}:\mathcal{O}_{i}\times\mathcal{A}_{i}\rightarrow[0,1]$ . The goal is to derive a distribution that maximizes the discounted sum of future rewards over an infinite time horizon, i.e., an optimal policy $\pi^{*}:\mathcal{S}\to\mathcal{A}$ ,

\pi^{*}\coloneqq\underset{\pi}{\arg\max}\;\mathbb{E}\Big{(}\sum_{i=0}^{\infty}\gamma^{i}R\big{(}s_{i},\pi(s_{i})\big{)}\Big{)}

(1)

in which, $\gamma\in[0,1)$ is the discount factor. The optimal policy maximizes the state-action value function, i.e., $\pi^{*}(s)=\arg\max_{a}Q^{*}(s,a)$ , where

Q^{\pi}(s,a)\coloneqq\mathbb{E}\Big{(}\sum_{i=1}^{\infty}\gamma^{i}R\big{(}s_{i},\pi(s_{i})\big{)}|s_{0}=s,a_{0}=a\Big{)}

(2)

and the optimal state-action value function can then be derived using the Bellman optimality equation,

Q^{*}(s,a)=\mathbb{E}_{s^{\prime}\sim P(.|s,a)}\Big{(}R(s,a)+\max_{a^{\prime}}\gamma Q^{*}(s^{\prime},a^{\prime})\Big{)}

(3)

Solving POSGs with Unknown Dynamics. Dynamics of the environment and reward function are usually stochastic and not fully-known in real-world problems. Reinforcement learning (RL) provides a possibility to solve POSGs with unknown reward and state transition functions through continuous interaction with the environment. RL algorithms such as off-policy temporal difference learning enable agents to update the value function from such interactions with the environment,

Q_{i+1}(s,a)-Q_{i}(s,a)=\\ \alpha_{i}\Big{(}R\big{(}s,\pi(s)\big{)}+\gamma\max_{a^{\prime}}Q_{i}(s^{\prime},a^{\prime})-Q_{i}(s,a)\Big{)},

(4)

where $\alpha_{i}$ is the learning rate at the $i$ th iteration.

Deep Q-networks. Parameterizing the state-action value function using a function approximator, i.e., $\tilde{Q}(.;\textbf{w})\cong Q(.)$ , results in more generalizable policies that can scale to larger state-spaces. Parameters w can be learned through mini-batch gradient descent steps,

\textbf{w}_{i+1}=\textbf{w}_{i}+\alpha_{i}\hat{\nabla}_{\textbf{w}}\mathcal{L}(\textbf{w}_{i})\\

(5)

where, the $\hat{\nabla}_{\textbf{w}}$ operator estimates the gradient at $\textbf{w}_{i}$ . Deep neural networks are widely used as function approximators and are also applicable to the Q-learning algorithm [30]. A deep Q-network (DQN) builds up on two major ideas, namely using two separate networks during training and employing an experience replay buffer to decorrelate the training samples. The former is done to stabilize the training process by updating the greedy network at each training iteration to compute the optimal Q-value and using another less-frequently updated target network. The loss function in Eq. (5) can be written as

\mathcal{L}(\textbf{w}_{i})=\mathbb{E}\Big{(}R+\gamma\underset{a^{\prime}}{\max}\widetilde{Q}^{*}(s^{\prime},a^{\prime};\hat{\textbf{w}})-\widetilde{Q}^{*}(s,a;\textbf{w})\Big{)}^{2}

(6)

where $\hat{\textbf{w}}$ is the target network which periodically gets updated during the training. Additionally, DQN algorithm draws batches of training data $(s,a,R,s^{\prime})$ from an experience replay buffer in order to decorrelate the training samples in Eq. (5) that are generated from simulation or real-world experience and thus naturally have temporal dependencies. This process is challenging in MARL since, $\Pr(s^{\prime}|s,a,\pi_{1},...,\pi_{n})\neq\Pr(s^{\prime}|s,a,\pi^{\prime}_{1},...,\pi^{\prime}_{n})$ if any $\pi_{i}\neq\pi^{\prime}_{i}$ . In other words, the environment becomes non-stationary when multiple agents are evolving concurrently. We will further discuss this issue and provide a solution to stabilize the multi-agent learning process in Section V-D.

V2V Networks. We are interested in a multi-agent setting where agents have no information about others’ actions and cannot explicitly coordinate. Instead, the decentralized coordination among agents is expected to arise from the social reward signal. We extend the earlier introduced concepts to a coordinated POSG defined as $(\mathcal{I},\mathcal{S},[\mathcal{A}_{i}],[\widetilde{\mathcal{O}}_{i}],T,[R_{i}],\mathcal{G})$ , where $\mathcal{G}=(\mathcal{I},\mathcal{E})$ is a stochastic, time-varying, undirected graph that encompasses the V2V communication among agents in the environment $\mathcal{E}$ . The communicated information can be as simple as kinematics information, e.g., speed, location, heading, or more bandwidth-intensive forms of sensory data, e.g., camera and LiDAR. Leveraging this shared situational awareness, agents can extend their range of perception and overcome obstacles and line-of-sight visibility limitations [31, 32]. An agent’s local observation $\tilde{\textbf{o}}_{i}\in\widetilde{\mathcal{O}}_{i}$ is created using the shared situational awareness and clearly depends on $\mathcal{G}$ which incorporates the flow of information among agents. We utilize the network analysis from [33] to model the V2V communication in a high-density highway.

IV Problem Statement

We investigate the maneuver-level decision-making problem for AVs to explore behaviors that can lead to socially-desirable outcomes. We are interested in the question of how autonomous agents can be trained from scratch to perform an individual task such as driving safely on a road, while considering the social aspects of their mission, i.e., optimizing for a social utility that also accounts for the interest of other vehicles around them. Figure 1 helps us to build more intuition on the topic by depicting instances of driving scenarios in which altruism leads to socially-valuable outcomes and clearly overcomes the limitations of egoistic and single-agent planning. Each example in Figure 1 provides an example on altruistic inter-agent coordination settings that can benefit both HVs and AVs. It is clear that in some instances, altruistic AVs have to compromise on their individual utility, e.g., by slowing down, in order to increase the group’s overall utility. The balance between an AV’s selflessness and selfishness is the key to reaching efficient and safe traffic flow. In [5, 6] we show that tuning the level of altruism in AVs leads to different emerging behaviors and affects the traffic flow and driving safety. In this work, we further explore that finding and formulate the problem as a convex optimization objective, to obtain an optimal social value orientation angle. Thus, we continue this section by providing a quantitative representation of an agent’s level of altruism and formally defining our case study scenario, before presenting our proposed solution in the next section.

IV-A Quantifying Social Value Orientation

In order to formally study the social dilemmas between humans and autonomous agents in heterogeneous environments, it is crucial to quantify the social preference of an individual, e.g., whether if they will defect or cooperate in a given situation such as opening a gap in our highway merging example. The degree of an agent’s egoism or altruism with regards to its counterparts is defined as Social Value Orientation (SVO), a widely used notion in the social psychology literature which has been recently adopted in robotics research. Specifically, we borrow the angular annotation for SVO as defined by Liebrand et al. [34]. The SVO angular preference $\phi$ , quantifies how an agent weights its own reward against the reward of others. An agent’s total utility $R_{i}$ can then be written as,

R_{i}=r_{i}\cos\phi_{i}+r^{-}_{i}\sin\phi_{i}

(7)

where $r_{i}$ is the agent’s individual utility, $r^{-}_{i}$ is the total utility of other agents from the perspective of the $i$ th agent which in general is a function $f(.)$ of their individual utilities,

r^{-}_{i}=f(r_{j}),\quad\text{where }j\neq i

(8)

Autonomous agents require an understanding of human drivers’ social preferences and their willingness to coordinate. However, it is well-established in the behavioral decision theory that humans are heterogeneous in SVO and thus their preference is rather ambiguous and unclear [36]. Current works on social navigation for AVs often make restrictive assumptions on human drivers’ social preference and compliance [2], whereas Figure 2 indicates an spectrum of altruism among humans with heterogeneous social value orientations. Thus, due to the large spectrum of altruistic behavior observed by humans, our insight is to rely on autonomous cars instead to guide the overall system toward more socially desirable objectives. Specifically, we plan to find policies for AVs that improve the utility of the group as a whole through emerging alliances and more importantly, affecting the behavior of human drivers. In our particular driving example, the desired social outcome is achieving seamless and safe highway merging while maximizing the distance traveled by all vehicles and avoiding collisions.

IV-B Formalism

We choose a highway merging scenario with a mixed group of AVs and HVs as our base experiment scenario, as illustrated in Figure 3. A merging vehicle, which can be either HV or AV, approaches the highway on the merging ramp and faces a mixed platoon of vehicles that are cruising on the highway. This configuration contains a group of AVs that hold the same SVO, as well as a group of HVs which are heterogeneous in their SVO and hence it is unclear if they are allies or foes. In this settings, it is obvious that the individual interest of the merging vehicle, i.e., seamless merging into the highway, does not align with that of the cruising vehicles, i.e., cruising with optimal speed and energy consumption. We design our case study scenario in a way that safe and seamless merging necessarily requires all AVs to work together and none of them alone can enable the merging of the mission vehicle without the cooperation of the others. Formally, the road section shown in Figure 3 is shared by a set of AVs $\mathcal{I}$ that are connected together via V2V communication and governed by a decentralized stochastic policy, a set of HVs $\mathcal{V}$ operated by humans with heterogeneous and unknown SVOs, and a human-driven or autonomous mission vehicle $M\in\mathcal{I}\cup\mathcal{V}$ that attempts to merge into the highway.

A human driver’s perception is often limited by their range of vision, occlusion, and obstacles. In contrast, CAVs share their observations to overcome these limitations. Each CAVs has a unique local observation $\tilde{\textbf{o}}_{i}([\textbf{o}_{i}];\mathcal{G})$ that is constructed using the its own local observation, as well as the local observations it receives from the neighboring CAVs. As mentioned before, graph $\mathcal{G}$ grasps this inter-agent communication. Therefore, an observer AV can detect a subset of other AVs, $\widetilde{\mathcal{I}}\subset\mathcal{I}$ , and a subset of HVs $\widetilde{\mathcal{V}}\subset\mathcal{V}$ . As we elaborated before, our aim is to find a decentralized control scheme that can induce altruism in the behavior of AVs. Hence, each AV must use its local observation $\textbf{o}_{i}$ to make independent decisions that optimize its utility. The value of the agent’s altruism, i.e., the SVO angular phase $\phi$ , determines the social implications of an agent’s local actions. To summarize, we state our problem as deriving a utility function that enables the AVs to handle competitive driving scenarios, such as those illustrated in Figure 1, and lead them into socially-desirable outcomes that improve traffic safety and efficiency for the group of vehicles.

V Sympathetic Cooperative Driving Framework

In their recent work, Silver et al. explained how artificial intelligence agents can learn complex tasks through experience and maximizing a generic reward function, rather than requiring task-specific specialized problem formulations [37]. Inspired by this approach to solving decision-making problems, rather than breaking down our problem into learning how to drive and learning social coordination, we train our autonomous agents from scratch using a decentralized reward structure and expect them to master the basics of highway driving, e.g., avoiding collisions and unnecessary lane change or acceleration, while learning inter-agent coordination to eventually achieve the goal of enabling a safe and seamless merging. To reiterate on our goal, we seek a decentralized solution that enables the autonomous agents to make independent socially-desirable decisions, with no explicit coordination or sharing of their decisions and future actions. In the rest of this section, we define the action and observation space in the POSG framework of Section III and introduce the notions of sympathy and cooperation that are essential for structuring the reward function.

V-A Action and Observation Spaces

We employ a numeric representation for an agent’s observation that embeds the kinematics of the neighboring vehicles. Additionally, we integrate the history of vehicles’ last $h$ meta-actions to extract temporal information and their past trajectories. An ego vehicle $I_{i}\in\mathcal{I}$ observes a set of HVs and AVs in its perception range. The Kinematic observation includes the relative Frenet coordinates of the closest $|\widetilde{\mathcal{I}}\cup\widetilde{\mathcal{V}}|+1$ vehicles in addition to the absolute Frenet coordinates of the ego vehicle. Formally, agent $I_{i}$ receives a local observation $\tilde{\textbf{o}}_{i}\in\widetilde{\mathcal{O}}_{i}$ ,

\tilde{\textbf{o}}_{i}=\big{[}o_{i},o_{m},o_{i+1},...,o_{i+|\widetilde{\mathcal{I}}\cup\widetilde{\mathcal{V}}|}\big{]}^{\top}

(9)

Each row of the local observation matrix $\textbf{o}^{(i)}$ is defined as,

o_{j}=\Big{[}p_{j},l_{j},d_{j},\mathrm{d}l_{j}/\mathrm{d}t,\mathrm{d}d_{j}/\mathrm{d}t,\cos\rho_{j},\sin\rho_{j},\lambda_{j},\overline{\mathrm{H}}^{\mathcal{A}}_{j}\Big{]}

(10)

in which, $l_{j}$ and $d_{j}$ are the longitudinal and lateral Frenet coordinates of the $j$ th vehicle, respectively. Vehicle’s yaw angle is denoted by $\rho$ and the autonomy flag is $\lambda_{j}=0$ if $I_{j}\in\widetilde{\mathcal{V}}$ and $\lambda_{j}=1$ otherwise. In case that the total number of observed vehicles is smaller than the set size of the observation matrix o, the remaining rows are filled with zeros with $p_{j}=0$ . $\overline{\mathrm{H}}^{\mathcal{A}}_{j}$ is the unrolled numeric representation of the action history array $\textbf{H}^{\mathcal{A}}_{j}$ that contains the last $h$ meta-actions taken by $I_{j}$ and is defined as,

\textbf{H}^{\mathcal{A}}_{j}(t)=\big{[}a_{j}(t-1),...,a_{j}(t-h)\big{]}

(11)

Our interest is in maneuver-level decision-making for autonomous vehicles. Thus, we define the action space $\mathcal{A}$ as the set of abstract meta-actions $\mathcal{A}_{i}=[\texttt{Lane Left}$ , Idle, Lane Right, Accelerate, $\texttt{Decelerate}]^{\top}$ . These meta-actions are then translated into admissible trajectories and low-level control signals that eventually govern the movement of the vehicle. The implementation details of how meta-actions render into steering and acceleration signals are discussed in Section VI. Additionally, the discrete meta-actions defined above must be translated into numeric values in Eq. (11). We experiment with three encodings and choose the one that leads to best performance after training:

-

Binary: A one-hot encoding with 5 bits for $a_{i}\in\mathcal{A}_{i}$ .
-

Discrete: An integer in $(0,5]$ for $a_{i}\in\mathcal{A}_{i}$ .
-

Frenet: Two integers in $[-1,1]$ for lateral and longitudinal actions.

V-B Disentangling Sympathy and Cooperation

Inter-agent relations in our mixed-autonomy problem can be broken down into the interactions among autonomous agents, i.e., AV-AV interactions, as well as between autonomous agents and human drivers, i.e., human-AI interactions. Decoupling the two enables us to systematically study the interactions between human drivers with ambiguous SVO and our autonomous agents. We refer to an autonomous agent’s altruism toward a human as sympathy and define cooperation as the altruistic behavior among autonomous agents. Our rationale for decoupling the components of altruism is that they differ in nature. As an instance, sympathy may not be reciprocal as humans are heterogeneous in their SVO but cooperation among autonomous agents is essentially homogeneous, assuming that they hold the same SVO. We investigate each component of altruism separately to better understand the emerging behaviors and the mechanics of inducing altruism in autonomous agents. Following this definition, we can rewrite Eq. (7) as,

$\displaystyle R_{i}={}$	$\displaystyle r_{i}\cos\phi_{i}+(\sin\theta_{i}R_{i}^{\mathrm{AV}}+\cos\theta_{i}R_{i}^{\mathrm{HV}})\sin\phi_{i}$	(12)
$\displaystyle=$	$\displaystyle\underbrace{r_{i}\cos\phi_{i}}_{\mathrm{egoistic\,\,term}}+$
	$\displaystyle\underbrace{\sin\theta_{i}\sin\phi_{i}R_{i}^{\mathrm{AV}}}_{\mathrm{cooperation\,\,term}}+\underbrace{\cos\theta_{i}\sin\phi_{i}R_{i}^{\mathrm{HV}}}_{\mathrm{sympathy\,\,term}}$

where $\theta$ is the sympathy angular phase determining the cooperation-to-sympathy ratio. Parameters $R_{i}^{\mathrm{AV}}$ and $R_{i}^{\mathrm{HV}}$ denote the total utility of other autonomous and human-driven vehicles, respectively, as perceived from the $i$ th agent’s perspective. We expand on this topic in Section V-C where we introduce the distributed reward structure.

V-C Decentralized Reward Structure

Following the notions of sympathy and cooperation and the notation of Eq. (12) we decompose the decentralized reward received by agent $I_{i}\in\mathcal{I}$ as,

$\displaystyle R_{i}(s_{i},a_{i})={}$	$\displaystyle R^{\mathrm{E}}+R^{\mathrm{C}}+R^{\mathrm{S}}$	(13)
$\displaystyle={}$	$\displaystyle r_{i}(s_{i},a_{i})\cos\phi_{i}$
	$\displaystyle+\sin\theta_{i}\sin\phi_{i}\sum_{j}\Big{(}r^{\mathrm{AV}}_{i,j}(\tilde{\textbf{o}}_{i})+r_{j}^{M}(\tilde{\textbf{o}}_{i})\Big{)}$
	$\displaystyle+\cos\theta_{i}\sin\phi_{i}\sum_{k}\Big{(}r^{\mathrm{HV}}_{i,k}(\tilde{\textbf{o}}_{i})+r_{j}^{M}(\tilde{\textbf{o}}_{i})\Big{)}$

in which $j\in\widetilde{\mathcal{I}}\setminus\{I_{i}\}$ , $k\in(\widetilde{\mathcal{V}}\cup\{M\})\setminus(\mathcal{I}\cap\{M\})$ . The $r_{i}$ term denotes the ego vehicle’s driving performance derived from metrics such as distance traveled, average speed, and a negative cost for changes in acceleration to promote a smooth and efficient movement by the vehicle. The cooperative reward term, $r^{\mathrm{AV}}_{i,j}$ accounts for the utility of the ego’s allies. It is important to note that ego vehicle only requires the observation $\tilde{\textbf{o}}_{i}$ to compute $R^{\mathrm{C}}$ and not any explicit coordination or knowledge of the actions of the other agents. The sympathetic reward term, $r^{\mathrm{HV}}_{i,k}$ is defined as

r^{\mathrm{HV}}_{i,k}=\sum_{k}\frac{1}{\eta d_{i,k}^{\psi}}u_{k},

(14)

where $u_{k}$ denotes an HV’s utility, e.g., its speed, $d_{i,k}$ is the distance between the observer autonomous agent and the $k$ th HV, and $\eta$ and $\psi$ are dimensionless coefficients. Moreover, the sparse scenario-specific mission reward term $r^{\mathrm{M}}_{e}$ in the case of our driving scenario is representing the success or failure of the merging maneuver,

r^{\mathrm{M}}_{e}=\begin{cases}1/2,&\text{if $I_{e}\equiv M$ and merge is successful}\\ 0,&\text{o.w.}\end{cases}

(15)

V-D Deep MARL for Sympathetic and Cooperative Driving

Two cascade multi-layer perceptron (MLP) networks are utilized as the feature extractor network (FEN) and the function approximator network (FAN), each with two layers of size 256 and 128 neurons, respectively, and rectified linear unit (ReLU) non-linearities. As introduced in Section V-A, the temporal information in a vehicle’s observations is captured through integrating the history of the past actions in the observations and the feature extractor network must be able to efficiently extract meaningful patterns from this information. Both networks are trained end-to-end to enforce the feature extractor network to extract the most vital information that is required for estimating the state-action value function. The policy is trained offline and deployed into all agents to be executed in a distributed and online fashion, meaning that each agent makes independent decisions based on its observation but they all follow the same stochastic policy.

As we elaborated in Section III, the non-stationarity of the environment is a major problem in concurrent training of multiple RL agents. We employ a semi-sequential training and policy dissemination algorithm to cope with this challenge and stabilize the training process. Algorithm 1 summarizes our overall methodology which is done in two stages. First, an experience replay buffer (ERB) is filled with data from simulation episodes and then, random samples drawn from this buffer is used for updating the weights of both FEN and FAN networks. For simplicity we refer to the set of all weights for both neural networks as w. We use a novel method for scoring the entries in ERB and drawing them with a probability proportional to that score.

ERB is highly skewed due to the nature of our highway merging scenario. To elaborate, each episode can be morphologically broken down into two parts, straight driving on the highway and the merging point. The former mostly provides information and training samples that are useful for learning the basics of driving and the latter contains the important information regarding the inter-agent coordination and altruistic behavior, which is of our interest. Only a few time steps of each episode contain the merging point and the rest is mostly related to highway cruising. To balance the training data drawn from the experience replay, we randomly draw samples with a probability $p_{\mathrm{ERB}}$ proportional to their spatial distance from the merging point. This method showed better performance when compared to the most common method of prioritizing the experience replay based on a sample’s last resulted reward.

After drawing a training sample from ERB, the agent $I_{i}\in\mathcal{I}$ performs $k_{\mathrm{diss}}$ iterations of training while the weights $\textbf{w}^{-}_{j}$ of all other agents $I_{j}(j\neq i)$ is frozen. The updated weights $\textbf{w}^{+}_{i}$ are then disseminated to the other agents to update their policy. This process is then repeated for all agents until convergence. Doing so enables us to stabilize the training and train all agents concurrently. The key idea is applying incremental updates and keeping the environment stationary in-between the updates so that the optimizer achieves convergence. This semi-sequential algorithm is illustrated in Figure 4 and Algorithm 1.

Algorithm 1 Semi-sequential multi-agent Q-learning

Initialize experience replay buffer (

\mathrm{ERB}

) of size

N_{\mathrm{buff}}

for

\mathrm{Episode}=1

N_{\mathrm{episode}}

Initialize episode with

l_{M}(t_{0})

and

v_{M}(t_{0})

for

t=1

T_{\mathrm{episode}}

Fill

\mathrm{ERB}

with the tuples

([\textbf{o}_{i}],[a_{i}],[\textbf{o}^{\prime}_{i}],[R_{i}])

Calculate the relevance factor

p_{\mathrm{ERB}}

for each entry in

\mathrm{ERB}

Initialize

\tilde{Q}(s,a;\textbf{w})

with random weights

\textbf{w}^{-}

Initialize target network

\hat{\textbf{w}}

with weights

\hat{\textbf{w}}=\textbf{w}^{-}

for

\mathrm{Frame}=1

N_{\mathrm{episode}}\times T_{\mathrm{episode}}

c_{\mathrm{target}}=0

for

I_{i}

\mathcal{I}

Freeze the weights

\textbf{w}^{-}

for

I_{j}

where

j\neq i

for

k=1

k_{\mathrm{diss}}

Calculate the spatial distance

Draw a sample from ERB based on

p_{\mathrm{ERB}}

values

\textbf{w}^{+}\leftarrow\textbf{w}+\alpha\hat{\nabla}_{\textbf{w}}\mathcal{L}(\textbf{w})

c_{\mathrm{target}}

c_{\mathrm{target}}

n_{\mathrm{target}}

then

\hat{\textbf{w}}\leftarrow\textbf{w}^{+}

\textbf{w}^{-}=\textbf{w}^{+}

for all

I_{i}\in\mathcal{I}

VI Implementation Details

We start this section with the 2D micro-traffic simulator we employed to generate simulation episodes and formulate the human driver model that imitates the behavior of a HV in mixed-autonomy environments. Practical details of training and validation are discussed before presenting our results in the next section.

VI-A Driving Simulator

We modified an OpenAI Gym environment [38] to enable multi-agent training and distributed execution in a mixed-autonomy highway merging scenario. The meta-actions determined by the stochastic policy are translated to low-level steering and acceleration control signals through a closed-loop proportional–integral–derivative (PID) controller. Motion of the vehicles is then governed by a Kinematic Bicycle Model that determines the vehicles’ yaw rate and acceleration. As a common practice in robotics, road segments and the motion of the agents are expressed in Frenet-Serret coordinates and broken into lateral and longitudinal movements.

In order to ensure learning generalizable policies rather than memorizing a sequence of actions by the function approximator network, the initial state of each simulation episode is randomized. This episode initialization is particularly critical as the resulting initial states must be still meaningful and valid for our desired conflictive highway merging scenario. Trivial episodes where the merging vehicle can easily merge into the highway regardless of the AVs’ actions or the episodes where the AVs’ do not have an opportunity to enable safe merging, not only do not add valuable information to the training process but also can lead into misleading measures. The initial longitude and speed of the cruising vehicles are uniformly randomized and the initial longitude $l_{M}(t_{0})$ and speed $v_{M}(t_{0})$ of the merging vehicle are drawn from a clipped-Gaussian distribution $\widetilde{\mathcal{N}}(x;\mu,\sigma,\delta)$ defined as,

\widetilde{\mathcal{N}}(x)=\mathcal{N}(x;\mu,\sigma)\Big{(}\mathds{1}(x-\mu+\delta)-\mathds{1}(x-\mu-\delta)\Big{)}

(16)

where $\mathcal{N}(x;\mu,\sigma)$ denotes a Gaussian distribution and $\mathds{1}$ is the Heaviside step function. We elaborate on initializing episodes via parameters $\mu$ , $\sigma$ , and $\delta$ in Section VII-E.

VI-B Human Driver Model

Lateral and longitudinal movements of HVs are mimicked by human driver models proposed by Treiber et al. and Kesting et al. [39, 40]. The lateral actions of HVs, i.e., the decision to perform a lane change, follow the Minimizing Overall Braking Induced by Lane changes (MOBIL) strategy [40]. MOBIL model allows a lane change only if the resulting acceleration $\mathrm{acc}_{n}>-b_{\mathrm{safe}}$ meets the safety criterion, and the incentive criterion is also satisfied,

\mathrm{acc}^{\prime}_{e}-\mathrm{acc}_{e}+\sin\phi_{e}\Big{(}(\mathrm{acc}^{\prime}_{n}-\mathrm{acc}_{n})+(\mathrm{acc}^{\prime}_{o}-\mathrm{acc}_{o})\Big{)}>\mathrm{acc}_{\mathrm{th}}

(17)

with $\mathrm{acc}_{e}$ , $\mathrm{acc}_{n}$ , and $\mathrm{acc}_{o}$ being the acceleration of the ego HV, the following vehicle in the target lane, and the following vehicle in the current lane, respectively, and $\mathrm{acc}^{\prime}_{e}$ , $\mathrm{acc}^{\prime}_{n}$ , and $\mathrm{acc}^{\prime}_{o}$ are the corresponding accelerations assuming the ego HV has performed the lane change. $\mathrm{acc}_{\mathrm{th}}$ is the threshold that determines if the ego HV shall performs the lane change. HV’s SVO angle $\phi_{e}$ is also referred to as the politeness factor in the literature and is extracted from the empirical probability distribution illustrated in Figure 2.

The longitudinal acceleration of HVs follows the Intelligent Driver Model (IDM) [39]. The longitudinal Frenet acceleration of a HV, $\ddot{l}_{\mathrm{IDM}}$ , is determined by

\ddot{l}_{\mathrm{IDM}}=\mathrm{acc}_{\mathrm{max}}\bigg{(}1-\Big{(}\frac{\dot{l}}{v_{\mathrm{set}}}\Big{)}^{4}-\Big{(}\frac{d^{*}(\dot{l},\Delta\dot{l})}{d}\Big{)}^{2}\bigg{)}

(18)

where $\dot{l}$ denotes the longitudinal Frenet speed of the HV, and the desired Frenet distance to the leading vehicle is controlled by $d^{*}$ , defined as,

d^{*}(\dot{l},\Delta\dot{l})=d_{0}+\dot{l}T_{\mathrm{set}}+\frac{\dot{l}\Delta\dot{l}}{2\sqrt{\mathrm{acc}_{\mathrm{max}}.\mathrm{acc}_{\mathrm{des}}}}

(19)

in which $\Delta\dot{l}$ is the approach rate, and the model parameters $v_{\mathrm{set}}$ , $T_{\mathrm{set}}$ , $d_{0}$ , $\mathrm{acc}_{\mathrm{max}}$ , and $\mathrm{acc}_{\mathrm{des}}$ are set speed, set time gap, minimum gap distance, maximum acceleration, and the desired acceleration, respectively. Additionally, the acceleration of the vehicle is a random variable defined as,

\ddot{l}=\ddot{l}_{\mathrm{IDM}}+\frac{\sigma_{\mathrm{vel}}}{\Delta t}\mathcal{N}(0,1)

(20)

with $\mathcal{N}(0,1)$ being a standard Gaussian random variable and $\sigma_{\mathrm{vel}}$ is the standard deviation of the velocity noise at the time step $\Delta t$ of the simulation.

VI-C Training and Validation

The autonomous agents are trained using the semi-sequential multi-agent Q-learning algorithm that we introduced in Figure 4 and Algorithm 1 for 15,000 episodes that are generated by the procedure discussed in Section VI-A. Training process is repeated and compared across multiple runs to assure the stability of training and that it converges to the similar policies every time. The trained policies are then evaluated for 2,000 randomized novel test episodes to gauge their efficacy. Test episodes are intentionally generated with different and broader initialization range than the training episodes to demonstrate that agents actually are able to learn generalizable policies and not only memorize sequences of actions.

VII Experimental Results

We break down the research questions of our interest into experimental hypotheses and investigate them through our experiments and ablation studies in this section.

VII-A Manipulated Variables

The two key variables in Eq. (13) are $\phi$ and $\theta$ that determine the level of altruism, which is the general term we use for both HVs and AVs, as well as level of sympathy, which is the term for altruism toward HVs only. Our experiments are done in 2 $\times$ 6 settings with different values of $phi$ and $\theta$ . Furthermore, we experiment with both autonomous, $M\in\mathcal{I}$ , and human-driven, $M\in\mathcal{V}$ , mission vehicle. Our experiment settings are:

•

HV+E. autonomous agents are egoistic ( $\phi_{i}=0$ for $I_{i}\in\mathcal{I}$ ), and the mission vehicle is HV ( $M\in\mathcal{V}$ );
•

HV+C. autonomous vehicles are cooperative only ( $\phi_{i}=\phi^{*}$ and $\theta_{i}=\pi/2$ for $I_{i}\in\mathcal{I}$ ), and the mission vehicle is HV ( $M\in\mathcal{V}$ );
•

HV+SC. autonomous vehicles are sympathetic and cooperative ( $\phi_{i}=\phi^{*}$ and $\theta_{i}=\pi/4$ for $I_{i}\in\mathcal{I}$ ), and the merging vehicle is HV ( $M\in\mathcal{V}$ );
•

AV+E/C/SC. Duals of the the above cases with autonomous mission vehicle ( $M\in\mathcal{I}$ ).

In HV+SC and AV+SC scenarios where autonomous agents have both sympathy and cooperation components, we set the sympathy angle to $\theta=\pi/4$ for the sake of fairness and to avoid imposing bias between HVs and AVs as they both carry humans or goods and neither should have a pre-assumed advantage over the other. The SVO angle $\phi$ is however tuned to reach the optimal level of altruism, we elaborate on this topic in Section VII-D and derive the optimal SVO angle $\phi^{*}$ .

VII-B Performance Measures

To gauge the impact of the aforementioned manipulated variables and other configurable parameters, 3 metrics are chosen that despite being correlated with each other, provide different insights on the efficacy of our solution. As a traffic-level metric, the average distance traveled by HVs and AVs is logged during simulation episodes. Additionally, counting the percentage of the episodes that experience a successful merge enables us to probe the overall social importance of a solution. Safety is also gauged through counting the percentage of episodes that contain at least one crash.

VII-C Hypotheses

Social and individual performance of altruistic and purely egoistic agents are compared through the 3 key hypotheses:

•

H1. While egoistic AVs fail to account for a merging HV, AVs that hold both sympathy and cooperation elements explore ways to enable safe and seamless merging. Therefore, we expect HV+SC to outperform HV+E and HV+C settings.
•

H2. AVs with $\phi\neq 0$ are able to implicitly learn the SVO of HVs and guide them to improve the overall performance of the group.
•

H3. There exists a social value orientation angle ${0<\phi^{*}<\pi/2}$ for autonomous agents that can both lessen the number of crashes and improve the number of successful merges.

VII-D Analysis and Results

Examining H1. The main claim of hypothesis H1 is the superiority of sympathetic cooperative AVs in creating socially optimal results when compared to egoistic autonomous AVs. To better understand the situation, we reiterate on the driving scenario: the merging vehicle $M$ , that can be either human-driven or autonomous, approaches a highway with a mixed group of HVs and AVs. $M$ requires the cruising vehicles’ assistance in order to be able to merge safely. Per our fundamental assumption, we do not rely on the HVs to compromise on their own utility as their SVO is unknown. Instead, it’s on the AVs to create a safe corridor for $M$ and, as we will show in Section VII-E, this goal cannot be achieved by a single AV alone and necessarily needs a cooperative action by the group of AVs.

Figure 5 illustrates an overall comparison between the settings defined in Section VII-A. Focusing on the cases with a human-driven merging vehicle, it is evident that in the absence of the sympathy component in AVs, i.e., in HV+E and HV+C settings, merging fails in the majority of episodes. Failed merging leads to a crash in our simulator as vehicles cannot stop on the highway nor the merging ramp and the merging vehicle that fails to merge collides with the barrier at the end of the merging ramp. This assumption is made to make our simulations more realistic and avoid unfeasible solutions that require full-stop on the highway. Therefore, most of the crash cases shown in Figure 5 are due to unsuccessful merging and not the lack of basic driving skills in HVs and AVs. As an additional evidence, independent crashes that are not relevant to a failed merge are also plotted in Figure 5, which confirms the fact that the vehicles hold sufficient basic skills to maneuver on a highway and avoid collisions.

Figures 5 and 6 clarify the positive social impact that sympathy and cooperation make in terms of reducing the total number of crashes and failed merging. However, a counter-argument against this comparison can be the fact that a rather conservative model is used to mimic HVs in our simulations and this might limit their capability in merging. To investigate this claim, we repeat the comparison with an autonomous mission vehicle that is more risk-tolerant and attempts more creative ways to merge into the highway. In the AV+E setting that AVs only care about their individual utility, although the results are better compared to HV+E, even the autonomous mission vehicle still fails to safely merge in more than 1/3 of the episodes. We conclude that our test case indeed creates a competitive and conflictive scene for the vehicles and showcases how incorporating sympathy and cooperation components in the reward structure of AVs leads to socially-desirable outcomes and improves safety and traffic flow. Figure 6 provides further intuition to this comparison by depicting a sampled set of mission vehicle’s trajectories in different experiment settings. It is evident that un-sympathetic does not allow the mission vehicle to merge, causing its trajectory to end in the merging ramp.

Examining H2. Figure 7 illustrates an example of autonomous agents trained with the sympathetic cooperative reward and a higher capacity neural network architecture. Although all AVs in this scenario work together to make the merging possible, we focus on the most impactful agent which is the “Guide AV” shown in orange color. Other AVs in this sample scenario (shown in green) compromise on their individual reward by accelerating, consuming more energy, and thus receiving less reward as defined in Section V-C. Interestingly, the Guide AV learns to first slow down and then change lane to left and open up space for $M$ . After $M$ successfully merges, the Guide AV finds its lane blocked by a HV so makes another lane change to the right and follows other AVs. Figure 7 demonstrates how AVs receive a significant reward when $M$ merges into the highway. Although the reward structure defined in Section V-C contains multiple parameters but the mission reward term $r^{M}$ of Eq. (15) has an order of magnitude larger impact and thus is the dominating reward signal in training our autonomous agents. In other words, the trained agents learn to take sequences of actions that lead to receiving $r^{M}$ . This learning process includes learning to avoid collisions, navigating through the traffic, and if required affecting the behavior of other HVs.

As it was emphasized before, the autonomous agents do not have access to an explicit behavior model of human drivers and instead implicitly learn this model from experience during the training episodes. Although we employ a rather conservative model of human drivers to showcase our concept, it is expected that given sufficient training data, the autonomous agents can extract models of more complex human behaviors as well. However, sensitivity of our solution to these models and the effect of human behaviors on inter-agent coordination is a topic worthy of investigation which we leave for our future work. As a relevant observation, AVs implicitly learn to predict the behavior of HVs and the fact that HVs commonly act egoistically (refer to Figure 2) and do not slow down for the merging vehicle. Hence, they do not rely on the HVs and instead compromise on their individual reward to enable the highway merging.

Examining H3.

Our experimental scenarios in Section VII-A are defined based on the optimal SVO angle $\phi^{*}$ of the autonomous agents. This parameter clearly has an important impact on the behavior of AVs and thus the safety and traffic-flow metrics. We trained a large set of agents with different SVO angles and tested them in our case study driving scenario. The optimal SVO angle is then defined as the angle that results the best performance metrics, i.e., least number of episodes with collisions and failed merges. We formulate this simple optimization objective as the convex combination of the two metrics,

\phi^{*}=\operatorname*{arg\,min}_{\phi}\Big{(}\xi.f_{\mathrm{C}}(\phi)+\big{(}1-\xi\big{)}.f_{\mathrm{MF}}(\phi)\Big{)}

(21)

where $f_{\mathrm{C}}$ and $f_{\mathrm{MF}}$ are the percentage of episodes with a crash and failed mission, respectively. The hyper-parameter $\xi$ determines the importance of each performance metric and we choose it to be $\xi=0.5$ as otherwise it could bias the training process by putting more emphasis on either of the metrics. Figure 8 illustrates how the two metrics change when the autonomous agents’ SVO is varied from $\phi=0$ (purely egoistic) towards $\phi=\pi/2$ (purely altruistic). It is worth mentioning that neither of the two extremes seems optimal and a point between caring about others and being selfish leads to the most socially-desirable outcome.

TABLE I: Necessity of Multi-agent Coordination: a single SC agent is not able to create socially-desirable outcomes.

	Mission Failed	Crashed	Distance Traveled
Single-agent (HV+1SC)	$74.4\%$	$74.5\%$	$268.5m$
Multi-agent (HV+SC)	$\textbf{12.2}\%$	$\textbf{12.8}\%$	$\textbf{334.4}m$

TABLE II: Ablation study on representing agent observation

\tilde{\textbf{o}}_{i}

	Mission Failed	Crashed
Adding Autonomy Flag $\lambda$
Without	$5.0\%$	$10.4\%$
With	$\textbf{3.8}\%$	$\textbf{8.9}\%$
Including Mission Vehicle $o_{M}$
Without	$4.2\%$	$9.2\%$
With	$\textbf{1.7}\%$	$\textbf{8.3}\%$

A fair critique to the behavior of sympathetic cooperative agents can be the fact that the Guide AV, i.e., AV3 in Figure 3, decelerates and therefore slows down the group of vehicles behind only to allow the mission vehicle to merge. In other words, the utility of a big group of vehicles is being compromised for the sake of the mission vehicle. To investigate the fairness and effectiveness of this outcome, we measure the average distance traveled by HVs and AVs. Figure 9 reveals how despite the fact that in the HV+SC setting a group of vehicles need to slow down to open up space for the mission vehicle, eventually both HVs and AVs manage to travel more distance when compared to a similar setup with egoistic agents (HV+E). It should be noted that the effect of Guide AV’s deceleration gradually propagates through the platoon of vehicles in behind and only affects a limited group of vehicles as the traffic in the platoon is not rigid and can contract and expand.

VII-E Ablation Studies

Necessity of Multi-agent Coordination. Consider the highway merge scenario of Figure 3. Our claim is that all AVs require to work together to enable a safe and seamless merging and none of them can achieve this goal if the others do not cooperate. As elaborated in Section VI-A, we particularly design our scenarios to gauge the effectiveness of altruistic agents and inter-agent coordination. To complement our results in Figure 5 that back the hypothesis H1, we conducted an ablation study in the driving scenario of Figure 3 with the difference that only $\mathrm{AV3}$ is sympathetic cooperative and label this scenario as HV+1SC. Table I demonstrate the necessity of multi-agent coordination and the fact that a single sympathetic cooperative AV, i.e., the Guide AV, is not able to achieve the mission of safe and seamless merging without help from the other AVs.

Designing Non-trivial and Fair Scenarios. Our method for initializing simulation episodes is described in Eq. (16). Parameters $\mu$ and $\delta$ determine the range of the allowed values for the merging vehicle’s initial longitude and speed. Trivial episodes that are too easy, i.e., always lead to successful merging, or too challenging, i.e., never result in a successful merge, can steer the training process into the wrong direction and must be avoided when initializing the episodes. Furthermore, the initial state of an episode can benefit different agents with various SVOs and thus, one may argue that the superior performance of sympathetic cooperative agents as observed in Figures 5 and 6 is an artifact of the episode’s initialization. We draw the initial values from a region that does not favor either of the social preferences. Two sets of parameters $(\mu_{l},\delta_{l},\sigma_{l}=2\delta_{l})$ and $(\mu_{v},\delta_{v},\sigma_{v}=2\delta_{v})$ are chosen for the initial longitude $l_{M}(t_{0})$ and initial speed $v_{M}(t_{0})$ of the merging vehicle, as listed in Table III. Figure 10 illustrates the intuition behind choosing these values.

Observation-space Representation. We discussed the details of how information is embedded into an agent’s observation in Section V-A. Here we justify the design choices and show their positive impact on the performance. Table II shows the impact of including $o_{m}$ in Eq. (9) as well as the autonomy flag $\lambda$ of Eq. (10). Figure 11 summarizes the effect of integrating $\overline{\mathrm{H}}^{\mathcal{A}}_{j}$ in Eq. (10), the history horizon $h$ , and the type of the action encoding. We also experimented with sorting the rows of $\textbf{o}^{(i)}$ in Eq. (9) based on vehicle ID and vehicles’ longitude, as shown in Figure 11.

TABLE III: Training and simulation hyper-parameters.

Parameter	Value	Parameter	Value
$N_{\mathrm{episode}}$	$10,000$	$\mu_{l}$	$95m$
Batch size	$32$	$\delta_{l}$	$2m$
$N_{\mathrm{buffer}}$	$100,000$	$\mu_{v}$	$24m/s$
$\alpha_{0}$	$0.0005$	$\delta_{v}$	$2m/s$
$n_{\mathrm{target}}$	$200$	$v_{\mathrm{set}}$	$25m/s$
Initial exploration $\epsilon_{0}$	$1.0$	$T_{\mathrm{set}}$	$0.5s$
Final exploration $\epsilon_{f}$	$0.1$	$d_{0}$	$1m$
$\epsilon-$ decay	$\mathrm{Linear}$	$\mathrm{acc}_{\mathrm{max}}$	$3m/s^{2}$
Optimizer	$\mathrm{ADAM}$	$\mathrm{acc}_{\mathrm{des}}$	$-5m/s^{2}$
$\gamma$	$0.95$	$\mathrm{acc}_{\mathrm{th}}$	$0.2m/s^{2}$
$\|\mathcal{V}\|$	$20$	$h$	$10$
$\|\mathcal{I}\|$	$4$	$\xi$	$0.5$
$T_{\mathrm{episode}}$	$18s$	$k_{\mathrm{diss}}$	$4$

VIII Concluding Remarks

Summary. Autonomous vehicles need to learn to co-exist with human-driven vehicles on the same road infrastructure. Deploying egoistic AVs that solely account for their individual interests on the road leads to sub-optimal and non-desirable social outcomes. In contrast, we compute the optimal SVO angle that optimizes the traffic metrics and demonstrate how altruistic AVs with the corresponding SVO can be trained to optimize a decentralized social utility that improves traffic flow, safety, and efficiency. We propose practical solutions to mitigate the non-stationarity problem in simultaneous multi-agent training and implicitly learn the behavior of human drivers from experience. Our experiments reveal that altruistic AVs are able to form alliances and affect the behavior of HVs in order to create socially-desirable outcomes that benefit the group of the vehicles.

Limitations and Future Work. While this paper captures the fundamentals of social coordination and altruism in autonomous driving, many tangential aspects of the problem can be further studied. For example, we employed a conservative and limited model of human drivers. Although we expect our solution to be effective with other human behavior models as well, it is important to study its performance under different human behaviors. Also, the impact of communication imperfections and packet drops on the inter-agent coordination can be further investigated using more complex communication models than those presented in this work. On the implementation side, more advanced neural architectures such as convolutional and recurrent networks can be leveraged to capture spatial and temporal information more effectively, a direction that we plan to explore in our future work.

References

[1] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 961–971.
[2] D. Sadigh, S. Sastry, S. A. Seshia, and A. D. Dragan, “Planning for autonomous cars that leverage effects on human actions.” in Robotics: Science and Systems, vol. 2. Ann Arbor, MI, USA, 2016.
[3] D. Sadigh, “Influencing interactions between human drivers and autonomous vehicles,” in Frontiers of Engineering: Reports on Leading-Edge Engineering from the 2019 Symposium. National Academies Press, 2020.
[4] W. Schwarting, A. Pierson, J. Alonso-Mora, S. Karaman, and D. Rus, “Social behavior for autonomous vehicles,” Proceedings of the National Academy of Sciences, vol. 116, no. 50, pp. 24 972–24 978, 2019.
[5] B. Toghi, R. Valiente, D. Sadigh, R. Pedarsani, and Y. P. Fallah, “Altruistic maneuver planning for cooperative autonomous vehicles using multi-agent advantage actor-critic,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2021.
[6] ——, “Cooperative autonomous vehicles that sympathize with human drivers,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021.
[7] L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “Independent reinforcement learners in cooperative markov games: a survey regarding coordination problems.” Knowledge Engineering Review, vol. 27, no. 1, pp. 1–31, 2012.
[8] J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. Torr, P. Kohli, and S. Whiteson, “Stabilising experience replay for deep multi-agent reinforcement learning,” in International conference on machine learning. PMLR, 2017, pp. 1146–1155.
[9] A. Xie, D. Losey, R. Tolsma, C. Finn, and D. Sadigh, “Learning latent representations to influence multi-agent interaction,” in Proceedings of the 4th Conference on Robot Learning (CoRL), November 2020.
[10] A. Shih, A. Sawhney, J. Kondic, S. Ermon, and D. Sadigh, “On the critical role of conventions in adaptive human-ai collaboration,” in 9th International Conference on Learning Representations (ICLR), 2021.
[11] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
[12] J. K. Gupta, M. Egorov, and M. Kochenderfer, “Cooperative multi-agent control using deep reinforcement learning,” in International Conference on Autonomous Agents and Multiagent Systems. Springer, 2017, pp. 66–83.
[13] W. Z. Wang, M. Beliaev, E. Biyik, D. A. Lazar, R. Pedarsani, and D. Sadigh, “Emergent prosociality in multi-agent games through gifting,” in 30th International Joint Conference on Artificial Intelligence (IJCAI), 2021.
[14] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian, “Deep decentralized multi-task multi-agent reinforcement learning under partial observability,” in International Conference on Machine Learning. PMLR, 2017, pp. 2681–2690.
[15] M. Lauer and M. Riedmiller, “An algorithm for distributed reinforcement learning in cooperative multi-agent systems,” in In Proceedings of the Seventeenth International Conference on Machine Learning. Citeseer, 2000.
[16] M. Kuderer, S. Gulati, and W. Burgard, “Learning driving styles for autonomous vehicles from demonstration,” in 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2015, pp. 2641–2646.
[17] H. N. Mahjoub, B. Toghi, and Y. P. Fallah, “A stochastic hybrid framework for driver behavior modeling based on hierarchical dirichlet process,” in 2018 IEEE 88th Vehicular Technology Conference (VTC-Fall), 2018, pp. 1–5.
[18] A. Kuefler, J. Morton, T. Wheeler, and M. Kochenderfer, “Imitating driver behavior with generative adversarial networks,” in 2017 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2017, pp. 204–211.
[19] E. Schmerling, K. Leung, W. Vollprecht, and M. Pavone, “Multimodal probabilistic model-based planning for human-robot interaction,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 3399–3406.
[20] B. Toghi, D. Grover, M. Razzaghpour, R. Jain, R. Valiente, M. Zaman, G. Shah, and Y. P. Fallah, “A maneuver-based urban driving dataset and model for cooperative vehicle applications,” 2020.
[21] Y. F. Chen, M. Everett, M. Liu, and J. P. How, “Socially aware motion planning with deep reinforcement learning,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 1343–1350.
[22] N. Tsoi, M. Hussein, J. Espinoza, X. Ruiz, and M. Vázquez, “Sean: Social environment for autonomous navigation,” in Proceedings of the 8th International Conference on Human-Agent Interaction, 2020, pp. 281–283.
[23] P. Trautman and A. Krause, “Unfreezing the robot: Navigation in dense, interacting crowds,” in 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2010, pp. 797–803.
[24] S. Nikolaidis, R. Ramakrishnan, K. Gu, and J. Shah, “Efficient model learning from joint-action demonstrations for human-robot collaborative tasks,” in 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 2015, pp. 189–196.
[25] D. A. Lazar, E. Bıyık, D. Sadigh, and R. Pedarsani, “Learning how to dynamically route autonomous vehicles on shared roads,” arXiv preprint arXiv:1909.03664, 2019.
[26] C. Wu, A. M. Bayen, and A. Mehta, “Stabilizing traffic with autonomous vehicles,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 6012–6018.
[27] C. Wu, A. Kreidieh, E. Vinitsky, and A. M. Bayen, “Emergent behaviors in mixed-autonomy traffic,” in Conference on Robot Learning. PMLR, 2017, pp. 398–407.
[28] E. Vinitsky, A. Kreidieh, L. Le Flem, N. Kheterpal, K. Jang, C. Wu, F. Wu, R. Liaw, E. Liang, and A. M. Bayen, “Benchmarks for reinforcement learning in mixed-autonomy traffic,” in Conference on robot learning. PMLR, 2018, pp. 399–409.
[29] E. Bıyık, D. Lazar, R. Pedarsani, and D. Sadigh, “Altruistic autonomy: Beating congestion on shared roads,” arXiv preprint arXiv:1810.11978, 2018.
[30] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
[31] E. Emad Marvasti, A. Raftari, Y. P. Fallah, R. Guo, and H. Lu, “Feature sharing and integration for cooperative cognition and perception with volumetric sensors,” arXiv e-prints, pp. arXiv–2011, 2020.
[32] R. Valiente, M. Zaman, S. Ozer, and Y. P. Fallah, “Controlling steering angle for cooperative self-driving vehicles utilizing cnn and lstm-based deep networks,” in 2019 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2019, pp. 2423–2428.
[33] B. Toghi, M. Saifuddin, H. N. Mahjoub, M. Mughal, Y. P. Fallah, J. Rao, and S. Das, “Multiple access in cellular v2x: Performance analysis in highly congested vehicular networks,” in 2018 IEEE Vehicular Networking Conference (VNC). IEEE, 2018, pp. 1–8.
[34] W. B. Liebrand and C. G. McClintock, “The ring measure of social values: A computerized procedure for assessing individual differences in information processing and social value orientation,” European journal of personality, vol. 2, no. 3, pp. 217–230, 1988.
[35] A. Garapin, L. Muller, and B. Rahali, “Does trust mean giving and not risking? experimental evidence from the trust game,” Revue d’économie politique, vol. 125, no. 5, pp. 701–716, 2015.
[36] R. O. Murphy and K. A. Ackermann, “Social preferences, positive expectations, and trust based cooperation,” Journal of Mathematical Psychology, vol. 67, pp. 45–50, 2015.
[37] D. Silver, S. Singh, D. Precup, and R. S. Sutton, “Reward is enough,” Artificial Intelligence, p. 103535, 2021.
[38] E. Leurent, Y. Blanco, D. Efimov, and O.-A. Maillard, “Approximate robust control of uncertain dynamical systems,” arXiv preprint arXiv:1903.00220, 2019.
[39] M. Treiber, A. Hennecke, and D. Helbing, “Congested traffic states in empirical observations and microscopic simulations,” Physical review E, vol. 62, no. 2, p. 1805, 2000.
[40] A. Kesting, M. Treiber, and D. Helbing, “General lane-changing model mobil for car-following models,” Transportation Research Record, vol. 1999, no. 1, pp. 86–94, 2007.