Deep Attention Driven Reinforcement Learning (DAD-RL) for Autonomous Decision-Making in Dynamic Environment

Jayabrata Chowdhury^1∗, Venkataramanan Shivaraman^2∗, Sumit Dangi ^3∗, Suresh Sundaram ², P B Sujit ⁴ This work was supported by Qualcomm Innovation Fellowship India 2023. The authors marked with * contributed equally.¹Jayabrata Chowdhury is with Robert Bosch Centre for Cyber-Physical Systems, Indian Institute of Science, Bangalore. [email protected]²Venkataramanan Shivaraman and ²Suresh Sundaram are with the Department of Aerospace Engineering, Indian Institute of Science, Bangalore. [email protected], [email protected]³Sumit Dangi and ⁴P.B. Sujit are with the Department of Data Science & Engineering and Electrical Engineering & Computer Science, Indian Institute of Science Education and Research, Bhopal. [email protected], [email protected]

Abstract

Autonomous Vehicle (AV) decision-making in urban environments is inherently challenging due to the dynamic interactions with surrounding vehicles. For safe planning, AV/ego must understand the weightage of various spatiotemporal interactions in a scene. Contemporary works use colossal transformer architectures to encode interactions mainly for trajectory prediction, resulting in increased computational complexity. To address this issue without compromising spatiotemporal understanding and performance, we propose the simple Deep Attention Driven Reinforcement Learning (DAD-RL) framework, which dynamically assigns and incorporates the significance of surrounding vehicles into the ego’s RL-driven decision-making process. We introduce an AV-centric spatio-temporal attention encoding (STAE) mechanism for learning the dynamic interactions with different surrounding vehicles. To understand map and route context, we employ a context encoder to extract features from context maps. The spatio-temporal representations combined with contextual encoding provide a comprehensive state representation. The resulting model is trained using the Soft-Actor Critic (SAC) algorithm. We evaluate the proposed framework on the SMARTS urban benchmarking scenarios without traffic signals to demonstrate that DAD-RL outperforms recent state-of-the-art methods. Furthermore, an ablation study underscores the importance of the context-encoder and spatio-temporal attention encoder in achieving superior performance.

I INTRODUCTION

Navigating safely in a dynamic environment populated with other vehicles remains a significant hurdle for Autonomous Vehicles (AVs). Decisions made by the AV should not only be safe but also comply with human driving behavior. Fig. 1 illustrates a left-turn scenario without a traffic signal, emphasizing the AV’s need to comprehend other road users’ actions and decide the attention importance to ensure safe navigation. Previous methods have explored rule-based methods [1]. Rule-based methods excel in scenarios the rules were defined for but falter in new ones. An alternative approach [2, 3] involves explicit communication between the AV and other vehicles, enabling the AV to make informed decisions in collaboration with other vehicles. However, this method is limited because reliable communication channels can only be assured between vehicles from the same manufacturer. For effective decision-making, the AV must comprehend the dynamic driving context as it evolves implicitly. In a dynamic driving environment, AV should understand the temporal behaviors of other surrounding vehicles and learn to make safe decisions. Also, these behaviors can be influenced by spatial structures such as road geometry.

Imitation Learning approaches involve learning from an expert’s actions. Several recent studies ([4, 5, 6, 7]) have utilized Imitation Learning (IL) methods to develop decision-making abilities that mimic an expert driver. However, expert bias and distribution shift challenges can considerably affect the efficacy of these IL-based approaches. Given the absence of near-collision situations in expert driving, it is difficult for IL-based techniques to recover from such scenarios. Recent advances in Reinforcement Learning (RL) based decision-making algorithms [8] show promising performance. RL’s strength stems from its exploration capabilities; it can recover from near-collision scenarios better. However, these methods need an understanding of the state representations for RL. To better understand the socially related spatio-temporal interactive behaviors between AV and other vehicles, a method must be developed to encode spatio-temporal relationships and provide information for safe decision-making.

Refer to caption — Figure 1: A left-turn scenario with surrounding vehicles. The AV is depicted in red, while the other vehicles are illustrated in white. The desired route is designated in green. The AV must comprehend the significance of each neighboring vehicle in relation to its final objective of reaching the destination, as indicated by blue arrows of varying weights.

In the context of an AV navigating a roadway, the safety relevance of other vehicles varies. The work in [9] has identified essential vehicles using rule-based expertise. However, such expert knowledge can be intricate and may only sometimes scale to unfamiliar driving situations. In a dynamic real-world setting, the significance of each vehicle in the vicinity of the AV fluctuates with each passing moment. Therefore, a spatio-temporal state space representation is needed to encapsulate the evolving importance of the interactions between the AV and surrounding vehicles. This work introduces the Deep Attention Driven Reinforcement Learning (DAD-RL) framework to encode the spatio-temporal interactions between the AV and nearby vehicles with contextual information. This framework extracts efficient state-space representation for secure decision-making in a dynamic environment. The temporal encoders encode temporal relationships, aiding in understanding the spatial dynamics of the AV and each surrounding vehicle. To encode socially interactive behaviors, we utilize the ego AV-oriented attention mechanism [10]. This mechanism is instrumental in learning crucial spatio-temporal features for decision-making based on RL.

As proposed, the DAD-RL framework offers a solution to model the dynamic spatio-temporal interaction between AV and its surrounding vehicles for decision-making. The key contributions are:

1.

DAD-RL introduces an innovative approach to model the ego AV-centric attention mechanism. The query vector associated with AV dynamically learns the attention to surrounding vehicles’ key and value vectors through a spatio-temporal attention encoder.
2.

A context encoder is developed to extract the contextual features important for AV. The final state encoder combines both encodings. This serves as the state space representation for RL-based decision-making.
3.

A dense reward structure is designed to help in safe and efficient decision-making.
4.

The DAD-RL surpasses the performance of the recent larger transformer-based model, Scene-Rep-T (SRT) [11], in SMARTS [12] in terms of success rate, collision, and stagnation. An overall improvement of $29.6\%$ and $2.4\%$ in success rate compared to SAC and SRT, respectively. Furthermore, an ablation study on decisions with errors in human-like behavior and overall score underscores the significance of context encoding with the spatio-temporal context attention mechanism.

II RELATED WORK

In the contemporary research landscape, the autonomous decision-making capabilities of Reinforcement Learning (RL) have been utilized for autonomous driving applications [13, 14, 15]. However, these studies necessitate a spatio-temporal state representation that can understand the dynamics between the Autonomous Vehicle (AV) and its surrounding vehicles. Recent research [16, 17] has utilized a Transformer encoder-decoder architecture to model the sequential decision-making process, effectively transforming RL decision-making into conditional sequence modeling. Recognizing the importance of interactive behavior modeling for social navigation, certain studies [18, 19] have employed Graph Neural Networks (GNNs) for predicting trajectories. Similarly, acknowledging the significance of spatio-temporal interaction modeling for decision-making, some research has incorporated graph-based models [20, 21] for state encoding in RL-based decision-making. Our work utilizes a streamlined and modified self-attention mechanism to model the state space representation between the AV and its surrounding vehicles. It is subsequently used for RL-based decision-making.

III THE DAD-RL FRAMEWORK

This section explains the mechanics of the DAD-RL decision-making framework. The primary component of DAD-RL is the spatio-temporal deep attention state-encoding mechanism learned and wielded by RL-driven ego-vehicle. The framework also consists of a context-encoder to process route information. This RL driving task is formulated as a POMDP since our ego-vehicle can access limited knowledge ascribed to sensor range limitations. The following subsections explain the input and observation space design, the DAD-RL processing, and the RL framework with action space and reward structure.

III-A Observation Space (Input) Preprocessing

The ego-vehicle can obtain historical information of its surrounding vehicles $H_{t}$ , segmented Bird-Eye-View (BEV) context $C_{t}$ , and an elaborate ego’s odometric state history $E_{t}$ as shown in Fig. 2. The whole observation space tensor is represented as $\mathcal{O}_{t}=\left[H_{t};E_{t};\mathcal{C}_{t}\right]$ . The subscript $t$ attributes to the tensor’s value at time $t$ . The BEV context is defined as $\mathcal{C}_{t_{o}}=\{D_{t};W_{t}\}$ , where $D_{t}$ is drivable area map and $W_{t}$ waypoint map of size $128\times 128$ . The historical surrounding vehicle data is defined as ${H}_{t}=\left[H^{1}_{t},...,H^{n}_{t}\right]$ for $n$ surrounding vehicles, with $H^{i_{sv}}_{t}=\left[\left(X^{i_{sv}}_{t-5k.\delta t},\phi^{i_{sv}}_{t-5k.\delta t},v^{i_{sv}}_{t-5k.\delta t},l^{i_{sv}}_{t-5k.\delta t}\right)\right]^{4}_{k=0}$ of vehicle $i$ ’s current and past timesteps spaced four simulation steps $\delta t$ apart. Here, for vehicle $i$ , $X^{i_{sv}}_{t}=(x^{i_{sv}}_{t},y^{i_{sv}}_{t})$ , $\phi^{i_{sv}}_{t}$ , $v^{i_{sv}}_{t}$ , and $l^{i_{sv}}_{t}$ are historical data of the relative position to the ego, the heading, speed of the vehicle and the lane it is in respectively. Observation $E_{t}=\left[\left(E^{(1)}_{t};E^{(2)}_{t}\right)\right]$ consists of various historical odometric values split into two vectors based on utility in the DAD-RL framework. The vector $E^{(1)}_{t}$ contains the same quantities as in $H^{i_{sv}}_{t}$ and $E^{(2)}_{t}=\left[\left(\omega_{t-5k.\delta t},\psi_{t-5k.\delta t},v^{ego}_{t-5k.\delta t},a^{ego}_{t-5k.\delta t},j_{t-5k.\delta t}\right)\right]^{4}_{k=0}$ , a tuple having steering, yaw rate, linear velocity, linear acceleration, and linear jerk of the ego at time $t$ respectively. The following section explains how the spatiotemporal deep attention encoder processes this observation space. The encoder has two parts: a) Spatio-Temporal Attention Encoder (STAE) and b) Context Encoder (CE).

Spatio-Temporal Attention Encoder (STAE): This encoder $\rho_{\eta}$ takes in the historical kinematic states of the ego and surrounding vehicles $E_{t}$ and $H_{t}$ at timestep $t$ as input. To encode the temporal kinematic relationships (past 2.1 seconds or 21 $\delta t$ simulation steps) of a dynamic traffic scenario, the states of the surrounding vehicles $H^{i_{sv}}_{t}$ and the ego vehicle $E^{(1)}_{t}$ is passed through a shared Long Short-Term Memory (LSTM) network. Let’s denote the aforementioned vectors of each vehicle $H^{i_{sv}}_{t}$ and ego $E^{(1)}_{t}$ as $I_{t}$ . Eq 1 represents the intermediate temporal encoding $p_{t}$ .

	$\displaystyle I^{i}_{t}=E^{(1)}_{t}\mbox{or}\ H^{i_{sv}}_{t}$		(1)
	$\displaystyle p^{i}_{t}=LSTM(I^{i}_{t})$		(1)

The temporal encoding for ego $p^{ego}_{t}$ and $p^{i_{sv}}_{t}$ for all $I_{t}$ can be found from the hidden states of the LSTM. The ego vehicle’s attention has been modeled based on the temporal kinematic encodings. This has been obtained using an attention mechanism between each temporal encoding. Let $W_{q}$ , $W_{k}$ , and $W_{v}$ are the query, key, and value weight matrices, respectively. To employ the attention mechanism, $p^{ego}_{t}$ is considered as the query $Q$ and $p^{i_{sv}}_{t}$ for all $n$ surrounding vehicles as both the key $K$ and the value $V$ vectors since it is important for the ego encoding $p^{ego}_{t}$ to pay attention to each surrounding vehicle encoding $p^{i_{sv}}_{t}$ . The following Equ.2 shows the employed attention mechanism. Let query, key and value be $Q=p^{ego}_{t};K=V=p^{1:n_{sv}}_{t}$ and $\alpha_{t}$ be attention.

\displaystyle\alpha_{t}(Q,K,V)=\sigma\left(\frac{Q.W_{q}[K.W_{k}]^{T}+M}{\sqrt{d_{k}}}\right)V

(2)

The softmax or the $\sigma(Q,K)$ term outputs the attention weights for each vehicle to every other vehicle present. However, since the interest is on the attention weights associated with the ego on other vehicles, only the first row of the matrix in the $\sigma(Q,K)$ is taken. After the attention block, the final spatio-temporal attention $Z_{t}$ is derived as given in the Equ.3, where $\it{Norm}$ , $\oplus$ , and $\it{Linear}$ represent layer normalization, concatenation method, and fully connected layer, respectively. $E_{t[k=0]}$ is ego state vector at current time instant. $M$ is a mask that hides absent vehicles from the attention calculation. This mask helps in cases where the number of surrounding vehicles in the sensor range is less than the maximum number of vehicles (n).

	$\displaystyle m_{t}=Norm\ (\alpha_{t}+p^{ego}_{t})$		(3)
	$\displaystyle{Z_{t}}=Linear\ (E_{t[k=0]}\oplus m_{t})$		(3)

Context Encoder (CE): To process the BEV context maps $\mathcal{C}_{t}$ , which contain drivable area map $D_{t}$ and waypoint map $W_{t}$ , which are images, a Convolutional Neural Network (CNN) is used. To obtain a concise vector, CNN layers are employed in a manner that decreases the image size after each successive layer. After the CNN layers, the intermediate output is flattened and passed through a fully connected layer, giving the context encoding vector $c_{t}$ . After obtaining spatio-temporal attention encoding $Z_{t}$ of the surrounding vehicles and odometric states of the ego and the context encoding $c_{t}$ , the vectors $Z_{t}$ and $c_{t}$ are concatenated to obtain a final state encoding $s_{t}$ as shown in Equ.4.

s_{t}=Z_{t}\oplus c_{t}

(4)

III-B RL Algorithm

The ego vehicle has a stochastic policy network $\pi_{\theta}$ with $\theta$ as its parameters mapping state $s_{t}$ to actions. As in any RL task, the objective is to learn $\pi_{\theta}$ along with our STAE ( $\rho_{\eta}$ ) that can maximize the cumulative reward obtained by the ego-vehicle. Soft Actor-Critic (SAC), a state-of-the-art RL algorithm, is used for training policy networks and STAE.

Reward Structure: The ego-vehicle is trained using a dense reward structure for safety and comfort (evaluated by ‘Humanness error’ metrics). The reward structure (Equation 5) is a linear combination of rewards and penalties:

	$\displaystyle R=\lambda_{1}.r_{crash}+\lambda_{2}.r_{road}+\lambda_{3}.r_{v^{ego}}+\lambda_{4}.r_{goal}$		(5)
	$\displaystyle+\lambda_{5}.r_{prog}+\lambda_{6}.r_{oroute}+\lambda_{7}.r_{ww}+\lambda_{8}.r_{slow}$		(5)

The positive reward terms are $r_{goal}$ and $r_{prog}$ . $r_{goal}=1$ if the agent reaches the goal; otherwise, $r_{goal}=0$ ; $r_{prog}$ represents distance travelled towards goal. To encourage the ego vehicle’s momentum towards reaching the goal under speed limits, the reward is defined as $r_{v^{ego}}=v^{ego}/V_{max}$ when $v^{ego}<V_{max}$ and a penalty defined as $r_{v^{ego}}=-\mbox{abs}(v^{ego}-V_{max})/V_{max}$ when the ego vehicle is overspeeding. The penalty terms are defined as $r_{crash}=r_{road}=r_{route}=r_{ww}=r_{slow}=-1$ , representing penalty for crashing, going offroad, off route and wrong way, and for not moving for ten consecutive seconds. The $\lambda_{i}$ coefficients are used to scale up/down and adjust the weights of each term in the reward structure.

III-C Action Space (Output) Representation

The action space is mid-level control, combining continuous and discrete actions. Mathematically, the action space is defined as $\mathcal{A}_{t}=\left[V_{t}^{target},\Lambda_{t}\right]$ , where $V_{t}^{target}\in(0,V_{max})$ represents the target speed of the vehicle at time $t$ and $\Lambda_{t}$ represents lane commands such as ’switch to left/right lane’ or ’keep lane.’ A classical controller present in the SMARTS simulator executes these middle-level commands. The SAC algorithm’s policy network $\pi_{\theta}$ is designed for continuous action spaces. To accommodate discrete lane change commands, $\pi_{\theta}$ which outputs parameters of a Gaussian Distribution from which lane actions are sampled in the range $(-1,1)$ , is divided into partitions a) $(-1,-1/3)$ , b) $(-1/3,1/3)$ and c) $(1/3,1)$ , where a) and c) are mapped to ’switch to left/right lane’ respectively and b) to ’keep lane’.

IV EXPERIMENTS and RESULTS

IV-A Driving Scenarios

The DAD-RL framework strongly emphasizes utilizing interaction encoding techniques to comprehend intricate and realistic traffic settings. SMARTS has been chosen as the simulation platform to assess the effectiveness of the DAD-RL framework in handling such complex scenarios. Within SMARTS, several demanding scenarios were constructed aimed at both training and testing the DAD-RL approach. These scenarios are designed to encompass diverse interactive and stochastic traffic dynamics. The following scenarios capture various aspects of realistic driving behaviors.

Left Turn-T: This is an urban T-junction with heavy traffic and no traffic signals. The goal of the ego-vehicle is to take an unprotected left turn.

Roundabout: In this urban roundabout scenario, the ego transitions from a 2-lane bi-directional road to a 2-lane unidirectional roundabout with four exits—three versions: Roundabout-A, B, and C, with increasing difficulties in that order. The ego vehicle is supposed to cover a quarter, half, and three-quarters in Roundabout-A, B, and C, respectively. The commute distance sets the difficulty level.

Double-Merge: In this scenario, an autonomous vehicle (’ego’) starts from a single-lane road, navigates a two-lane one-way road with two entrances/exits, and exits on the opposite side. The ’ego’ must effectively perform lane changes amidst traffic flows from all entrances to exits, honing its navigation and lane-changing skills.

The scenarios above incorporate heavy traffic flows randomly selected for each simulation episode. The agent engages in multiple consecutive scenarios to assess its overall performance during evaluation.

IV-B Training Setup

This subsection explains how the DAD-RL framework is trained. The SMARTS simulator is used as an RL-Gym environment for training. Each simulation step is $\delta t=0.1s$ apart. The ego-vehicle obtains information about surrounding vehicles, itself, and the context at every step. The ego-vehicle processes the observation, takes action, and proceeds to the next simulation step. A route is initialized for the vehicle to follow at the beginning of each episode. Each step consists of a tuple $(\mathcal{O}_{t},a_{t},\mathcal{O}_{(t+1)},r_{t},d_{t})$ with $d_{t}$ being done/terminal signal. A series of such steps and tuples make one episode. The DAD-RL framework is trained on five scenarios mentioned in the previous subsection. A vectorized environment runs several simulations parallelly to speed up the training process. As several steps/episodes progress, the ego-vehicle stores the experience tuple $(\mathcal{O}_{0:B},a_{0:B},\mathcal{O}_{0:B},r_{0:B},d_{0:B})$ in a buffer B. After gaining sufficient experience, a batch is sampled from this buffer B to estimate the objective function, and the loss is back-propagated from the end of the policy network $\pi_{\theta}$ till the beginning of STAE $\rho_{\eta}$ based on loss functions of SAC. After this network training process, the ego-vehicle collects more experience, repeating the process. The training was done on a machine with at least 32 GB RAM, an Nvidia 2080Ti GPU, and Ubuntu 20.04.

IV-C Comparison Baselines

To thoroughly compare the proposed framework’s performance, the DAD-RL framework was evaluated against several baseline methods, and SOTA [11]. PPO: Proximal Policy Optimization stands out as a SOTA policy gradient approach with promising performance on robotic decision-making tasks. SAC: Soft Actor-Critic is another SOTA RL algorithm, an off-policy approach capitalizing on entropy maximization to assist agent training. DrQ: Data-regularized Q-learning employs a CNN-based augmentation to SAC, facilitating direct learning of a policy function from pixels. RDM: Rule-based Driver Model mimicking the driving behavior of neighboring vehicles within SMARTS simulations, offering a benchmark for evaluating the rule-based performance. DT: Decision Transformer is an innovative approach that reformulates the RL problem as a conditional sequence modeling task and employs a transformer to generate future actions from past states, actions, and rewards. SRT: Scene-Rep Transformer (SRT) is a transformer-based framework recently proposed in [11]. SRT employs two transformer architectures - a Multi-Stage Transformer (MST) for encoding the multi-modal scene input and another Sequential Latent Transformer (SRT) to instill predictive information into the latent state vector.

For all the mentioned baselines, the implementation provided by [11] is considered for the performance comparison, and the results are directly sourced from the same.

IV-D Comparative Results

In evaluating the proposed framework, several performance metrics were utilized to measure the experiments’ effectiveness. These metrics provide insights into the agent’s behavior and performance. Here is an explanation of the performance metrics adopted for the evaluation: Succ.% (Success Rate): This metric represents the proportion of episodes where the agent successfully reaches the goal out of the total evaluation episodes. A higher success rate indicates better performance in goal achievement. Coll.% (Collision Rate): The collision rate is the proportion of episodes that resulted in a collision between the agent and surrounding vehicles. A lower collision rate signifies better navigation and avoidance capabilities. Stag.% (Stagnation Rate): Stagnation rate reflects the proportion of episodes ending prematurely due to exceeding the maximum time steps allowed. A lower stagnation rate indicates efficient decision-making and goal pursuit. Humanness Error: Humanness Error is calculated based on comfort factors such as jerk and angular acceleration. It assesses how closely the agent’s movements resemble human-like driving behavior. Overall Score: Overall Score depicts a comprehensive measure of driving performance. As implemented in SMARTS, it combines multiple factors such as progress, rule violations, and comfort. Humanness Error and Overall Score are both provided by SMARTS. To assess the performance of the DAD-RL framework based on the defined metrics, we follow an evaluation strategy similar to [11]. The evaluation strategy involves playing 50 episodes in simulation to test the agent’s performance across varied traffic patterns, enabling a comprehensive analysis of the proposed framework’s effectiveness. The initialization for the testing environment is randomly selected.

Table I shows the results of three evaluation metrics, namely success rate, collision rate, and stagnation rate for all the baselines mentioned in the previous section and the DAD-RL framework. It is evident from the results that the DAD-RL framework exhibits significant performance improvements compared to the baseline methods in most of the scenarios. DAD-RL shows exceptional performance on the left turn scenario, with a $0\%$ collision rate. DAD-RL significantly improves the success and collision rates on all roundabout scenarios compared to the SRT. The trend of success rates in the three roundabout scenarios reflects their difficulties. As we go from R-A to R-B to R-C, the route length increases, which means the ego will face more traffic interactions, increasing the difficulty. In the Double Merge scenario, DAD-RL faces tough competition from MST and SRT, yet it still surpasses other baselines like DT and DrQ in terms of overall performance. Unlike the other four scenarios, Double Merge requires the agent to assess the perfect time to merge into a lane with ongoing traffic to avoid collisions. This could explain the discrepancies in results for Double Merge, and it can be solved with a spatiotemporal prediction module similar to SRT. On average, across all five scenarios, DAD-RL increases the success rate by 29.6% compared to SAC and 2.4% compared to SRT. These quantitative findings emphasize the superior capabilities of DAD-RL in completing driving tasks and mitigating collision incidents. The architectural difference between DAD-RL and SRT is noteworthy, with the latter leveraging a full Transformers model while the former opts for a simpler approach using a compact spatio-temporal encoder with a single attention layer. This streamlined model design enhances efficiency by concentrating on crucial learning components. The lightweight nature of the DAD-RL framework not only delivers enhanced average performance and simplifies architectural complexity, making it a practical and effective choice for driving tasks.

TABLE I: Comparative Results On Different Scenarios

Scenario	Model	Metric
Scenario	Model	Succ.% $\uparrow$	Coll.% $\downarrow$	Stag.% $\downarrow$
Left Turn-T	RDM	2	54	44
	PPO	36	50	10
	SAC	68	28	0
	DrQ	78	20	0
	DT	66	32	0
	MST	88	12	0
	SLT	94	4	0
	DAD-RL	98	0	0
Roundabout-A	RDM	68	30	0
	PPO	66	34	0
	SAC	76	24	0
	DrQ	80	20	0
	DT	76	22	0
	MST	84	16	0
	SLT	88	12	0
	DAD-RL	96	4	0
Roundabout-B	RDM	2	98	0
	PPO	42	58	0
	SAC	48	52	0
	DrQ	72	28	0
	DT	68	32	0
	MST	76	24	0
	SLT	82	18	0
	DAD-RL	88	12	0
Roundabout-C	RDM	0	100	0
	PPO	38	50	12
	SAC	46	48	6
	DrQ	68	30	2
	DT	66	30	0
	MST	66	34	0
	SLT	76	24	0
	DAD-RL	80	20	0
Double Merge	RDM	0	100	0
	PPO	36	64	0
	SAC	62	22	0
	DrQ	76	14	0
	DT	70	30	0
	MST	92	4	0
	SLT	96	2	0
	DAD-RL	86	14	0

IV-E Ablation Study

Multiple experiments on three distinct scenarios were conducted to understand the individual aspects of the spatio-temporal encoder and the context encoder. The results of these experiments are depicted in Fig. 3. Both the Roundabout and Double Merge scenarios exhibit similar trends among the metrics. Compared to SAC, which only utilizes a context encoder, using only the proposed spatio-temporal encoder (Context) improves the overall score while also reducing humanness errors. A combination of the spatio-temporal attention encoder and the context encoder (DAD-RL) achieves the highest Overall Score and the lowest humanness error. In the Left Turn-T scenario case, the trends are predominantly consistent with the other scenarios except for one metric. SAC obtains a lower humanness error than both variants of DAD-RL. Upon visualizing SAC’s performance in simulation, it was noticed that the agent shows conservative behavior, making it difficult for the agent to maneuver through the traffic at the T-junction, leading to a lower Overall score. This conservative driving style is the reason for the lower humanness error of SAC. DAD-RL shows a slight compromise regarding humanness error to improve the overall score greatly.

V CONCLUSIONS

This paper underscores the challenges recent approaches and RL-based decision-making algorithms encounter in dynamic driving environments. While emphasizing the significance of interaction modeling, we introduce a simple spatio-temporal attention encoder rather than a full transformer for secure decision-making in AV. A new approach to attention modeling for AVs is introduced, which encodes dynamic interactions with surrounding vehicles using the Deep Attention Driven Reinforcement Learning (DAD-RL) framework. The framework enhances performance compared to previous state-of-the-art RL-based decision-making for AVs, including the work that uses a transformer. Future research will concentrate on integrating the dynamics of AVs with safety layer design and interpretable decision-making, paving the way for more robust and reliable autonomous systems.

References

[1] A. Aksjonov and V. Kyrki, ”Rule-Based Decision-Making System for Autonomous Vehicles at Intersections with Mixed Traffic Environment,” In IEEE Int. Intel. Trans. Sys. Conf. (ITSC), Indianapolis, IN, USA, 2021, pp. 660-666.
[2] A. Bazzi, A. O. Berthet, C. Campolo, B. M. Masini, A. Molinaro and A. Zanella, ”On the Design of Sidelink for Cellular V2X: A Literature Review and Outlook for Future,” in IEEE Access, vol. 9, pp. 97953-97980, 2021.
[3] J. Cui, H. Qiu, D. Chen, P. Stone and Y. Zhu, ”Coopernaut: End-to-End Driving with Cooperative Perception for Networked Vehicles,” In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 17231-17241.
[4] Chen, J., Yuan, B. and Tomizuka, M., 2019, November. ”Deep imitation learning for autonomous driving in generic urban scenarios with enhanced safety”, In IEEE/RSJ Int. Conf. on Intel. Robots and Sys. (IROS), 2019, pp. 2884-2890.
[5] Chowdhury, J., Sundaram, S., Rao, N. and Sundararajan, N., 2022. ”An efficient Deep Spatio-Temporal Context Aware decision Network (DST-CAN) for Predictive Manoeuvre Planning”. arXiv preprint arXiv:2205.10092.
[6] Jamgochian, A., Buehrle, E., Fischer, J. and Kochenderfer, M.J., ”SHAIL: Safety-aware hierarchical adversarial imitation learning for autonomous driving in urban environments”, In IEEE Int. Conf. on Robotics and Automation (ICRA), 2023, pp. 1530-1536.
[7] P. Cai, H. Wang, Y. Sun and M. Liu, ”DiGNet: Learning Scalable Self-Driving Policies for Generic Traffic Scenarios with Graph Neural Networks,” In IEEE/RSJ Int. Conf. on Intelligent Robots and Sys. (IROS), Prague, Czech Republic, 2021, pp. 8979-8984.
[8] B. R. Kiran et al., ”Deep Reinforcement Learning for Autonomous Driving: A Survey,” in IEEE Trans. on Intel. Transportation Systems, vol. 23, no. 6, pp. 4909-4926, June 2022.
[9] L. Zhang, W. Ding, J. Chen and S. Shen, ”Efficient Uncertainty-aware Decision-making for Automated Driving Using Guided Branching,” In IEEE Int. Conf. on Robotics and Automation (ICRA), Paris, France, 2020, pp. 3291-3297.
[10] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. ”Attention is all you need”, In Advances in neural information processing systems. 2017; 30.
[11] H. Liu, Z. Huang, X. Mo and C. Lv, ”Augmenting Reinforcement Learning with Transformer-based Scene Representation Learning for Decision-making of Autonomous Driving,” in IEEE Trans. on Intel. Vehicles, 2024, pp. 1 - 17.
[12] Zhou, Ming, et al. ”SMARTS: Scalable Multi-Agent Reinforcement Learning Training School for Autonomous Driving, 2020.” arXiv preprint arXiv:2010.09776.
[13] J. Chen, S. E. Li and M. Tomizuka, ”Interpretable End-to-End Urban Autonomous Driving With Latent Deep Reinforcement Learning,” in IEEE Trans. on Intel. Transportation Sys., vol. 23, no. 6, pp. 5068-5078, June 2022.
[14] X. Tang, B. Huang, T. Liu and X. Lin, ”Highway Decision-Making and Motion Planning for Autonomous Driving via Soft Actor-Critic,” in IEEE Trans. on Vehicular Technology, vol. 71, no. 5, pp. 4706-4717, May 2022.
[15] J. Zhang, C. Chang, X. Zeng and L. Li, ”Multi-Agent DRL-Based Lane Change With Right-of-Way Collaboration Awareness,” in IEEE Trans. on Intel. Transportation. Sys. (ITS), vol. 24, no. 1, pp. 854-869, Jan. 2023.
[16] Janner, M., Li, Q. and Levine, S., 2021. Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34, pp.1273-1286.
[17] Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A. and Mordatch, I., 2021. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34, pp.15084-15097.
[18] J. Schmidt, J. Jordan, F. Gritschneder and K. Dietmayer, ”CRAT-Pred: Vehicle Trajectory Prediction with Crystal Graph Convolutional Neural Networks and Multi-Head Self-Attention,” 2022 Int. Conf. on Robotics and Automation (ICRA), Philadelphia, PA, USA, 2022, pp. 7799-7805.
[19] Z. Wang, J. Zhang, J. Chen and H. Zhang, ”Spatio-Temporal Context Graph Transformer Design for Map-Free Multi-Agent Trajectory Prediction,” in IEEE Trans. on Intel. Vehicles, vol. 9, no. 1, pp. 1369-1381, Jan. 2024.
[20] Cai P, Wang H, Sun Y, Liu M. DQ-GAT: Towards safe and efficient autonomous driving with deep Q-learning and graph attention networks. IEEE Trans. on Intel. Transportation Sys. (ITS), 2022 Jul 7;23(11):21102-12.
[21] Chowdhury J, Shivaraman V, Sundaram S, Sujit PB. Graph-based Prediction and Planning Policy Network (GP3Net) for scalable self-driving in dynamic environments using Deep Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence 2024 Mar 24 (Vol. 38, No. 10, pp. 11606-11614).