This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Detecting Masquerade Attacks in Controller Area Networks Using Graph Machine Learning

William Marfo, Member, IEEE, Pablo Moriano*, Senior Member, IEEE, Deepak K. Tosh, Senior Member, IEEE, Shirley V. Moore, Member, IEEE
Department of Computer Science, *Computer Science and Mathematics Division
University of Texas at El Paso, El Paso, TX 79968, USA
*Oak Ridge National Laboratory, Oak Ridge, TN 37830, USA
[email protected], [email protected], {dktosh, svmoore}@utep.edu
Abstract

Modern vehicles rely on a myriad of electronic control units (ECUs) interconnected via controller area networks (CANs) for critical operations. Despite their ubiquitous use and reliability, CANs are susceptible to sophisticated cyberattacks, particularly masquerade attacks, which inject false data that mimic legitimate messages at the expected frequency. These attacks pose severe risks such as unintended acceleration, brake deactivation, and rogue steering. Traditional intrusion detection systems (IDS) often struggle to detect these subtle intrusions due to their seamless integration into normal traffic. This paper introduces a novel framework for detecting masquerade attacks in the CAN bus using graph machine learning (ML). We hypothesize that the integration of shallow graph embeddings with time series features derived from CAN frames enhances the detection of masquerade attacks. We show that by representing CAN bus frames as message sequence graphs (MSGs) and enriching each node with contextual statistical attributes from time series, we can enhance detection capabilities across various attack patterns compared to using only graph-based features. Our method ensures a comprehensive and dynamic analysis of CAN frame interactions, improving robustness and efficiency. Extensive experiments on the ROAD dataset validate the effectiveness of our approach, demonstrating statistically significant improvements in the detection rates of masquerade attacks compared to a baseline that uses only graph-based features, as confirmed by Mann-Whitney U\mathrm{U} and Kolmogorov-Smirnov tests (p<0.05)(p<0.05).

Index Terms:
Controller area networks, intrusion detection systems, graph ML, masquerade attacks.

I Introduction

Controller area networks (CANs) have become ubiquitous in various industrial applications and in the automotive sector [1]. This protocol is essential for ensuring seamless communication among electronic control units (ECUs) that manage vital vehicle functions such as acceleration, braking, steering, and engine control [2, 3, 4]. The robustness, efficiency, and simplicity of CAN have established it as the industry standard for in-vehicle networks. However, the increasing connectivity and automation in modern vehicles, necessitating interfaces with external networks for diagnostics, firmware updates, and advanced driver assistance systems (ADAS), have exposed the inherent security vulnerabilities in CANs which lead to breach of integrity and confidentiality of in-vehicle communications [5, 6]. Among these, masquerade attacks stand out due to their stealth and potential for significant impact[7]. In such attacks, adversaries inject deceptive messages that mimic legitimate communication, manipulating vehicle behavior without triggering immediate detection [8]. The critical nature of these systems and the potential for severe consequences, such as unintended acceleration or braking, highlight the urgent need for robust IDS capable of identifying and effectively mitigating such sophisticated threats.

IDS have been widely employed to detect attacks such as fabrication, suspension, and the stealthiest of all, masquerade attacks, on CANs to ensure vehicle safety and functionality. Despite the effectiveness of IDS in detecting various types of attacks, CAN’s vulnerability to cyberattacks still poses significant risks [9]. Traditional IDS have primarily focused on signature-based or anomaly-based methods. Signature-based IDS rely on known attack patterns, which limits their effectiveness against novel threats [10, 11, 12, 13]. Anomaly-based IDs, while more flexible, often struggle with high false-positive rates and may lack the granularity needed to detect subtle, sophisticated attacks like masquerade attacks [14]. Masquerade attacks are particularly challenging because they involve injecting false messages that appear legitimate, allowing attackers to manipulate vehicle systems covertly [15]. Masquerade attacks usually bypass conventional detection mechanisms, making it crucial to develop more robust and precise IDS solutions [16]. Current methods often fall short in near real-time detection and handling of masquerade attacks, especially high-dimensional data typical of CANs [17, 18, 19, 20].

Detecting masquerade attacks in CANs is inherently challenging due to their ability to blend seamlessly into normal traffic. Traditional IDS approaches often struggle to identify these sophisticated intrusions because they do not cause immediate disruptions or obvious anomalies in the frequency of CAN frames [21]. Researchers have explored various methods to tackle this problem, focusing predominantly on time series analysis [22] and graph-based approaches [21]. Time series analysis leverages temporal patterns within CAN signals to detect anomalies, allowing for the identification of deviations in the sequence or timing of messages. However, while time series methods can pinpoint irregularities in CAN signal patterns, they often fall short in capturing the complex interactions between different ECUs and their signals. In contrast, graph-based methods have shown promise in addressing these complexities. By representing CAN bus data as graphs, these methods can model the intricate relationships and communication patterns between ECUs. This structural representation enables the detection of deviations from normal behavior that may indicate an attack, specifically in the case of fabrication attacks. Despite these advancements, current graph-based IDS approaches still face significant limitations. In particular, they often lack robustness in masquerade attack scenarios, failing to adequately detect these sophisticated intrusions due to their subtle nature. The primary limitations include insufficient incorporation of contextual data, such as temporal patterns and inter-signal relationships, which are crucial for accurate and reliable detection [21], [23]. Consequently, these limitations result in a higher likelihood of false negatives, where masquerade attacks go undetected, potentially leading to severe vehicle malfunctions or safety risks. Potential ways to address these limitations include integrating more comprehensive contextual data, exploring machine learning (ML) techniques that can better capture the complexities of CAN communications, and developing hybrid models that combine graph-based and time series-based approaches for enhanced detection accuracy.

The pressing need for an efficient approach that can accurately detect these stealthy attacks in resource-constrained vehicular environments forms the core motivation for this research. For instance, in a real-world scenario, a vehicle experiencing a masquerade attack might have its braking system manipulated without the driver’s knowledge, potentially leading to dangerous situations on the road. Detecting such an attack promptly could prevent accidents and enhance the overall safety of the vehicle. Our work aims to address these research gaps by leveraging graph ML to model CAN traffic using message sequence graphs [21] and time series analysis, enhancing the detection and analysis of masquerade attacks in the CAN bus.

The main contributions of this paper can be summarized as follows:

  • We propose a comprehensive framework for detecting masquerade attacks in the CAN bus. Our approach uses graph ML by integrating shallow graph embeddings and time series analysis to capture both the structural and temporal aspects of CAN frames. By representing CAN frames as message sequence graphs (MSGs), we enhance the ability to detect subtle masquerade attacks that traditional IDS might miss.

  • We implement a robust node annotation technique within MSGs that significantly enhances the detection process. By incorporating key statistical attributes derived from time series data, such as the mean and standard deviation of signals, into each CAN ID node, our method provides a richer, context-aware analysis. This detailed annotation not only improves the accuracy of masquerade attack detection but also ensures continuous monitoring and dynamic adaptation to changing network behaviors.

  • We evaluate our proposed framework through extensive experiments using the ROAD dataset, a benchmark in vehicular network security research. Our results demonstrate significant improvements in detecting masquerade attacks when compared to an approach based only on graph topology. The evaluation highlights the scalability and robustness of our approach, ensuring its practical applicability in modern vehicular networks.

We have made the code available to reproduce all the results at [24].

II Background

This section provides the necessary background on CAN and related topics. We begin with an overview of the CAN protocol (Section II-A) and its operational details (Section II-B). Next, we discuss the security aspects and common attack vectors in CAN (Section II-C). We then introduce the concept of graph-based representation in CAN, laying the foundation for how CAN communication patterns are modeled through MSGs (Section II-D), and explain the construction of MSGs (Section II-E). Finally, we cover the use of graph embeddings in the context of CAN (Section II-F).

II-A Controller Area Network (CAN)

CAN is a robust communication protocol that has become a standard in automotive and industrial applications for facilitating the exchange of messages between electronic control units (ECUs). Initially introduced by Robert Bosch GmbH, the protocol’s latest version (2.0) was released in 1991, marking a significant advancement in automotive communication technologies [25]. CAN was further standardized internationally in 2011 by ISO 11898-1:2011, which specified the requirements for high-speed communication and data exchange among various vehicle subsystems. CAN operates on a multi-master serial bus standard for ECUs, which allows devices to communicate with one another within a vehicle without a central computer. This protocol is designed to operate at high speeds, up to 1 Mbps, depending on the distance and configuration, with the ability to support distributed real-time communication efficiently [16].

II-B CAN Protocol

CAN operates as a broadcast protocol, enabling all ECUs connected to the bus to receive signals or messages transmitted over the bus. These signals, called frames, contain messages instructing the vehicle or system on executing operations. CAN encompasses the first two layers of the open systems interconnection (OSI) model, namely the physical and data-link layers, as shown in Fig. 1. Within CAN, ECUs such as the transmission control module, engine control unit, and the antilock braking system broadcast data frames that provide information about the vehicle’s current state.

Refer to caption
Figure 1: CAN protocol frame structure with an 8-byte payload.

A CAN frame comprises several fields, each defined below along with their possible values:

  • Start of frame – This bit indicates the beginning of a CAN frame transmission — 1 bit, dominant (0). A dominant value ensures that all nodes on the network recognize the start of a new frame.

  • Identifier (ID) – This unique identifier represents the message priority, where lower values indicate higher priority — 11 bits for standard ID, 29 bits for extended ID. IDS are crucial for arbitration when multiple ECUs transmit simultaneously.

  • Remote transmission request (RTR) – Indicates whether the frame is for data transmission (dominant) or for requesting data (recessive) — 1 bit. Dominant (0) for data frames, recessive (1) for remote request frames. Dominant indicates an actual data frame, while recessive indicates a request.

  • Identifier extension bit (IDE) – Specifies whether the identifier is standard or extended — 1 bit. Dominant (0) for standard ID, recessive (1) for extended ID. This bit differentiates between the two ID formats.

  • Reserved – Reserved for future use and must be dominant to ensure compatibility — 1 bit. Must be dominant (0) but accepted as either dominant or recessive. Ensures consistency and compatibility across different CAN standards.

  • Data length code (DLC) – Indicates the length of the payload, specifying how many bytes of data are in the frame — 4 bits, representing 0–8 bytes. This is essential for the receiving ECU to know how much data to expect.

  • Payload – The actual data carried by the frame, encapsulated in bits — 0–8 bytes. Contains the information being transmitted, such as sensor readings or control commands.

  • Cyclic redundancy check (CRC) – Used to ensure the integrity of the transmitted data by detecting errors — 15 bits. Helps in error detection to ensure reliable communication.

  • Acknowledge (ACK) – Indicates whether the frame was acknowledged by any node on the network — 1 bit, recessive (1) when the frame is sent. A dominant value (0) signifies that at least one node successfully received the frame.

  • End of frame (EOF) – Denotes the end of a CAN frame — 7 bits, recessive (1). Marks the completion of the frame transmission.

  • Inter-frame spacing (IFS) – The space between the end of one frame and the start of the next — 3 bits, must be recessive (1). Ensures proper spacing between frames to avoid collisions.

In our research, we use only the ID and data fields. The ID serves as the message header that identifies the frame and is used for arbitration, the process of prioritizing frames when multiple ECUs transmit simultaneously, i.e., the lower the ID, the higher the priority. The data field holds the actual message contents of up to 8 bytes, where each unique information carried in the message is termed a signal. CAN frames with the same ID encode the same set of signals in the same format and are typically transmitted at a fixed frequency to relay updated signal values.

CAN protocol’s design ensures reliable communication in environments that are prone to electrical noise and disruptions, making it ideal for automotive applications where safety and efficiency are paramount [26]. Furthermore, the signal-level representation of CAN data is facilitated using a database for CAN (DBC) file, which translates binary payloads into real-valued signals, providing a time series representation of the network’s activity. This feature is crucial for the development of ML-based intrusion detection systems (IDS) for masquerade attacks as it captures a structured representation of CAN data, allowing to characterize the regular relationships between physical signals in a system [25, 27].

II-C CAN Security and Attacks

The security of CAN can be analyzed using the confidentiality, integrity, and availability (CIA) triad, which are fundamental principles of cybersecurity.

Confidentiality

CAN inherently lacks mechanisms for ensuring data confidentiality. Without encryption, all communications are susceptible to interception and interpretation. While some manufacturers have implemented cryptographic methods for specific functionalities like keyless entry [28], these are not standard features of the CAN protocol itself, leaving much of the network communication exposed.

Integrity

Although CAN employs a cyclic redundancy check (CRC) checksum to verify data integrity, this measure is insufficient for detecting malicious alterations. The CRC is designed to detect accidental alterations or corruptions, not intentional manipulations by authenticated sources such as the case of masquerade attacks. Thus, without an authentication mechanism, the protocol fails to guarantee the integrity of transmitted data [28].

Availability

The inherent design of CAN also impacts data and network availability. The arbitration rule allows higher-priority nodes to dominate communication and can be exploited to deny service to other nodes. While efficient under normal operations, this design choice becomes a liability when malicious actors manipulate the priority system to disrupt network access [28].

Our approach is designed to safeguard vehicles from various levels of cyberattacks, which can be broadly classified into three categories based on the attacker’s objectives [15]:

  • Fabrication attacks: A compromised ECU injects malicious IDs and data onto the CAN bus, manipulating the communication and potentially causing unintended behavior in the vehicle. All legitimate ECUs remain active and continue to send their original data. This type of attack is common and relatively easy to launch, as the attacker does not need to take control of any ECU [29].

  • Suspension attacks: An attacker disables a legitimate ECU, resulting in the disappearance of messages from the targeted ECU for a certain period. The attacker can achieve this by disconnecting the ECU from the in-vehicle network, thereby cutting off communication [30].

  • Masquerade attacks: These are the most sophisticated, stealthy, and destructive attacks in CAN. They combine elements of fabrication and suspension attacks, where the attacker silences a legitimate ECU and then impersonates it during ongoing operations to inject malicious messages at the expected frequency of benign messages [31].

In this research, we focus on masquerade attacks due to their intricate nature and their profound impact on in-vehicle networks. However, note that our framework will also be helpful for detecting fabrication and suspension attacks as they alter the regular patterns of frames and these changes are captured by the MSGs.

II-D Graph-Based Representation in CAN

A graph GG is defined as an ordered pair G=(V,E)G=(V,E), where VV represents a set of vertices (or nodes), and E(V×V)E\subseteq(V\times V) represents a set of edges (or links) [32, 33]. Vertices act as entities and edges capture the relationships between nodes. For example, let us consider a toy graph GG with a vertex set denoted by V(G)={v1,v2,v3,v4}V(G)=\{v_{1},v_{2},v_{3},v_{4}\} and edge set by E(G)={e1,e2,e3,e4,e5,e6}E(G)=\{e_{1},e_{2},e_{3},e_{4},e_{5},e_{6}\}. Fig. 2 depicts this graph for detailed visualization and to support the subsequent descriptions and definitions.

Refer to caption
Figure 2: A representation of the nodes and edges of graph GG.

II-D1 Directed graph

A directed graph, or digraph, consists of a vertex set VV and an edge set EE with two functions: init:EV\textit{init}:E\to V and ter:EV\textit{ter}:E\to V. These assign to each edge ee an initial vertex init(e)\textit{init}(e) and a terminal vertex ter(e)\textit{ter}(e). The edge is considered directed from init(e)\textit{init}(e) to ter(e)\textit{ter}(e). Directed graphs can include multiple edges between the same two vertices or loops, where an edge leads from a vertex to itself.

II-D2 Subgraph

A subgraph GG^{\prime} of a graph GG is defined by VVV^{\prime}\subseteq V and EEE^{\prime}\subseteq E, denoted GGG^{\prime}\subseteq G. This implies GG includes GG^{\prime}, or GG is a supergraph of GG^{\prime}.

II-D3 Walks

A walk of length kk is an alternating sequence of vertices and edges in GG, represented as v0,e1,v1,e2,,ek,vkv_{0},e_{1},v_{1},e_{2},\ldots,e_{k},v_{k}, where the edges ei=vi,vi+1e_{i}={v_{i},v_{i+1}} for all i<ki<k [34, 35]. Here, v0,v1,,vkv_{0},v_{1},\ldots,v_{k} represent the vertices visited in order during the walk. A random walk is a particular case where each step from viv_{i} to vi+1v_{i+1} is selected randomly, and which allows viv_{i} to be equal to vi+1v_{i+1}. The probability of a random walk that begins at node ii and ends up at node jj after kk steps can be represented by the transition matrix pijkp^{k}_{ij}. Assuming that GG is a directed graph (digraph), and that the out-degree deg+(vj)\text{deg}^{+}(v_{j}) of every vertex vv is greater than 0, we can define the transition matrix PijP_{ij} as follows:

Pij:={1deg(vj)if (vi,vj) is an edge in the graph G,1deg+(vj)if (vi,vj) is an edge in the digraph G,0otherwise.P_{ij}:=\begin{cases}\frac{1}{\deg(v_{j})}&\text{if }(v_{i},v_{j})\text{ is an edge in the graph }G,\\ \frac{1}{\deg^{+}(v_{j})}&\text{if }(v_{i},v_{j})\text{ is an edge in the digraph }G,\\ 0&\text{otherwise.}\end{cases}

II-E CAN Message Sequence Graph (MSG)

The MSG is a crucial tool for analyzing the behavior of the CAN network [21]. It captures the sequence pattern of messages, reflecting their normal sequence of messages in CAN. Nodes in this graph correspond to unique CAN identifiers (IDs), denoting different ECUs or functions within the vehicle network. Edges map the sequence of messages between these IDS illustrating how information flows within the system. The MSG models regular patterns of communication between ECUs within specified sliding windows, allowing for detailed observation of both normal and potentially malicious activities. Typical sequence patterns found during various vehicle operations, such as parking or acceleration, can be captured by the MSG. These patterns form the basis for detecting deviations that may indicate cybersecurity threats like fabrication and suspension attacks.

Constructing an MSG involves parsing the CAN data to determine the sequence of CAN IDS and quantifying the frequency of these sequences. This process transforms the raw, time-stamped messages into a structured graph that highlights the dynamics of the network’s communication. For instance, typical sequences observed in vehicle operations—such as those related to increasing speed—involve a predictable order of CAN messages related to fuel delivery, RPM, and speed. Fig. 4 in Section IV provides a visualization of these sequences within the MSG. By focusing on the sequence and frequency of message exchanges, the MSG provides a powerful method for identifying disruptions in normal communication patterns indicative of potential attacks. This graph-based approach enhances the ability to scrutinize the integrated behavior of ECUs and ensure the integrity of vehicular communications.

II-F Graph Embeddings

Graph embeddings map nodes, edges, or entire graphs into a lower-dimensional vector space, preserving their structural and relational information [36]. Here, we focus on embedding the whole graph as our purpose is to detect intrusions in MSGs [37]. Consider G={G1,,Gm}G=\{G_{1},\ldots,G_{m}\} as a collection of mm graphs, with each graph Gi={Vi,Ei,li}i=1mG_{i}=\{V_{i},E_{i},l_{i}\}_{i=1}^{m} consisting of sets of vertices VV and edges EE, and lil_{i} representing the class label of GiG_{i}. Whole-graph embedding aims to map an entire graph into a low-dimensional space d\mathbb{R}^{d} where d|V|d\ll|V|. This embedding vector should ideally retain as much of the graph’s structural information as possible [38].

In our analysis, we select node2vec[39], renowned for its effective network feature learning capabilities including node classification, link prediction, and community detection [39],[40]. Whole-graph embeddings are constructed by aggregating the node embeddings generated by node2vec, which captures the overall structure of the graph by summarizing the features of its individual nodes. This aggregation process typically involves averaging or pooling to combine node features into a single representation, providing a comprehensive view of the graph’s topology [41]. node2vec is particularly suitable for CAN bus data due to its proficiency in preserving the network topology and learning meaningful patterns of interactions between ECUs [42]. This is critical for identifying and classifying masquerade attacks within vehicular networks. The node2vec algorithm excels by extracting detailed feature representations for ECUs based on their communication patterns and roles within the network. node2vec parameters include walk length (l), which is crucial in determining the scope of each random walk, influencing whether the embeddings show immediate connectivity or broader network contexts. This adjustment allows feature learning to cover both local interactions and the entire network structure. The number of walks (r) initiated from each node deepens network exploration, ensuring a thorough representation by capturing diverse neighborhood configurations. Control over the random walks is further adjusted through the return parameter (p) and the in-out parameter (q). These parameters help direct the walk’s focus, switching between close neighborhoods and more distant connections. The return parameter (p) affects how likely the walk is to return to a node soon after departing, influencing the focus on local versus broad network connections. Conversely, the in-out parameter (q) shapes whether the walk tends to explore regions close to or far from the starting node, helping to balance the emphasis on tight community structures and a node’s role in the larger network topology. Additionally, the dimensions (d) parameter determines the size of the embedding vectors, balancing computational efficiency with the ability to capture rich graph-related properties.

III Related Work

We discuss prior work closely related to graph-based CAN IDS (Section III-A) and other prior work related to general CAN IDS for detecting masquerade attacks (Section III-B).

III-A Prior Work Closely Related to the Present Study

Islam et al. [43] proposed a four-stage IDS utilizing the chi-squared method to identify both strong and weak cyberattacks on the CAN bus. Their approach, which is among the first graph-based defense systems for CAN, showed misclassification rates of 5.26% for DoS attacks, 10% for fuzzy attacks, 4.76% for replay attacks, and zero misclassification for spoofing attacks. The dataset used for evaluation was obtained from the Hacking and Countermeasure Research Lab [44]. This study emphasized the need for robust security mechanisms in modern vehicles and demonstrated superior accuracy compared to existing ID sequence-based methods.

Jedh et al. [21] proposed a message injection attack detection solution. Their approach leveraged MSGs and used graph-based analytics and anomaly detection to detect malicious message injections with high accuracy and low detection time. They validated their approach using a dataset collected from a moving vehicle under attack conditions. This study highlighted the importance of addressing the security of in-vehicle communication networks without relying on proprietary ECU information.

Islam et al. [45] introduced a novel graph-based Gaussian Naive Bayes (GGNB) algorithm for CAN IDS. They leveraged graph-based features from the nodes and edges. The GGNB method applied to the real raw CAN dataset, achieved high detection accuracy across various attack types, including DoS, fuzzy, spoofing, and replay attacks. The study also utilized the OpelAstra dataset [46] and reported significant improvements in training and testing times compared to SVM classifiers. This research showed the efficiency and effectiveness of the GGNB-based methodology in a resource-constrained environment.

Sreelekshmi et al. [47] proposed a graph-based IDS for CANs achieving notable accuracy in detecting various types of attacks. Their method leverages bidirected graphs and parameters like degree variance to identify anomalies. The system was tested on the Car Hacking: Attack & Defense Challenge 2020 dataset [48] demonstrating superior performance with an accuracy of 98.38%.

Refat et al. [49] presented a lightweight IDS that translates CAN traffic into temporal graphs and applies neighborhood-based graph similarity techniques to detect message injection attacks. The system was evaluated using real vehicle data [50], achieving a detection accuracy of 96.01% for spoofing, fuzzy, and DoS attacks. This work contributed to vehicle security by providing a computationally efficient method without requiring changes to the CAN protocol.

Zhang et al. [23] proposed a CAN bus anomaly detection system using graph neural networks (GNNs) to detect message injection, suspension, and falsification attacks. Their approach involved creating directed attributed graphs from CAN message streams and training a two-stage classifier cascade. The evaluation on a Ford Transit 500 dataset [51] demonstrated the system’s efficiency in real-time detection and its ability to handle new anomalies through federated learning training.

Park et al. [52] introduced the G-IDCS, a graph-based IDS integrating a threshold-based IDS and a ML-based attack classifier. This system reduced the number of required CAN messages for detection and improved detection accuracy by over 9% compared to existing methods. They used the Car Hacking: Attack & Defense Challenge 2020 dataset [48] for evaluation. Their study emphasized the system’s robustness to changes in attack types and its potential use in digital forensic investigations.

Xiao et al. [14] proposed the CAN-GAT model based on graph attention networks for in-vehicle networks. By transforming CAN bus messages into graph structures, the model captured the correlation between traffic bytes, improving detection accuracy and efficiency. The model was evaluated using datasets from the Hacking and Countermeasure Research Lab [44], demonstrating superior performance among compared GNNs.

Meng et al. [53] developed GB-IDS, an IDS leveraging a novel graph structure and a variational autoencoder for training classifiers without negative samples. Their system, evaluated on the OTIDS dataset [54], achieved high detection success rates for DoS, fuzzing, and impersonation attacks. This study addressed the limitations of traditional IDS, i.e., [49] by avoiding the need for protocol parsing and large training datasets.

In this paper, we propose a unique framework for detecting masquerade attacks in CANs, leveraging graph ML and time series analysis. By representing CAN frames as MSGs and annotating their nodes with time series features, we fill a gap in the current research. Our approach builds upon previous graph-based methods, such as those introduced by Islam et al.[43], Jedh et al.[21], and demonstrates competitive performance against recent studies using the ROAD dataset [31] [55]. Unlike methods focusing solely on time series data, our framework captures both structural and temporal aspects of CAN frames, enhancing detection capabilities. The flexibility of our framework, adjustable via the window length and offset in a sliding window environment, allows for fine-tuning performance in constrained vehicular systems filling the gaps in [23]. We test our framework on the most realistic masquerade attacks from the ROAD Dataset [15], demonstrating its potential for enhancing automotive cybersecurity.

III-B Other Prior Work Related to the Present Study

Zhou et al. [56] developed BTMonitor, a bit-time-based IDS for CANs, achieving a sender identification accuracy of 99.76% on a real vehicle. Their approach leverages the small discrepancies in bit times to fingerprint sender ECUs and can detect new types of masquerade attacks that other IDS fail to identify. They evaluated the system using a three-node CAN bus prototype and a 2012 Buick Regal production vehicle dataset.

Ying et al. [57] proposed a formal analysis of clock skew-based IDS and introduced the cloaking attack, a new type of masquerade attack that manipulates message inter-transmission times to avoid detection. Their study, validated using hardware testbeds and the UW EcoCAR dataset [58], predicted the attack success probability with an average prediction error of within 3.0%.

Hanselmann et al. [59] proposed CANet, a neural network architecture for detecting intrusions on the CAN bus. CANet, the first deep learning-based IDS capable of handling the high dimensionality of CAN bus data, was evaluated on real and synthetic CAN data. The method outperformed previous ML-based approaches with a high true negative rate, typically over 0.99, demonstrating robustness in detecting a large number of synthetic masquerade attacks [59].

Moriano et al. [31] focused on detecting masquerade attacks on the CAN bus by analyzing time series extracted from raw CAN frames. Using hierarchical clustering, this study demonstrated that changes in cluster assignments could indicate anomalous behavior. The proposed forensic tool was tested on the ROAD dataset [15], showing significant differences in time series clustering similarity on benign and attack conditions.

Shahriar et al. [55] developed CANShield, a deep learning-based signal-level intrusion detection framework for the CAN bus. CANShield handles high-dimensional CAN data streams, analyzing time-series data from different temporal scales using multiple deep autoencoder networks. Evaluation on the SynCAN [59] and ROAD [15] datasets demonstrated the framework’s robustness against various advanced attacks, improving the overall AUC-ROC by 6.40% compared to conventional methods.

Sharmin et al. [8] contributed to CAN IDS benchmarking efforts by comparing six CAN IDS algorithms using the ROAD dataset [15]. The study included algorithms based on ID sequences, entropy, Hamming distance, frequency and isolation forest. Their results showed that entropy- and frequency-based algorithms performed better emphasizing the need for fine-tuning existing methods to detect sophisticated attacks like targeted ID and masquerade attacks.

Shahriar et al. [3] proposed CANtropy, a feature engineering-based lightweight CAN IDS. CANtropy explores a comprehensive set of features from both temporal and statistical domains utilizing PCA for anomaly detection. Evaluation on the SynCAN dataset [59] showed CANtropy’s effectiveness, with an average AUC-ROC score of 0.992, outperforming existing DL-based baselines.

Moriano et al. [60] conducted a benchmark study on four non-deep learning unsupervised online IDS for masquerade attacks in CANs. They controlled streaming data conditions in a sliding window setting and used realistic attacks from the ROAD dataset [15]. They noticed that there is no one-size-fits-all solution for detecting masquerade attacks in CAN in an online setting, i.e., there is no single algorithm that is optimal for detecting each attack type. Notably, a method that detects changes in the hierarchical structure of clusters of time series outperformed others in detecting various attack types at the expense of computational overhead.

IV Proposed data-driven Methods

This section introduces our proposed approach for detecting masquerade attacks in the CAN bus using graph ML. We begin by defining key terms and then go into the details of our processing pipeline illustrated in Fig. 3. We first discuss the partitioning of CAN data for building MSGs (Section IV-A), followed by the extraction and analysis of time series data (Section IV-B). Next, we detail the process to annotate the graphs (Section IV-C) and the generation of graph embeddings (Section IV-D). We then describe our supervised learning process (Section IV-E). Finally, we outline our evaluation setup and metrics (Section IV-F).

Refer to caption
Figure 3: Proposed data-driven framework for masquerade attack detection in CAN using graph ML.

IV-A CAN Data Partitioning and MSG Building

We use sliding windows of fixed size to partition and process the stream of CAN data [61]. The raw CAN data is first transformed into a structured dataframe format where each row represents a message with its timestamp, process ID, and data payload (see Fig. 4). This data is then partitioned into discrete time slices defined by specific window sizes and offsets, enabling a continuous analysis of CAN data.
We now define the key parameters controlling the sliding windows:

  • Window size (ω\omega): The length of each window determining the extent of data included in each graph representation.

  • Offset (δ\delta): The sliding step between consecutive windows, where δ\delta represents the number of samples separating each window, ensuring no loss of data and the capture of communication patterns that may span multiple windows.

A MSG is constructed within each defined sliding window to model CAN activity following the procedure introduced by [21] and [43]. Graphs are empirically recognized as effective tools to model relationships among relational data that are too complex for tabular representation or other simpler data structures [62]. The process of building MSGs is crucial as it sets the stage for subsequent analyses, including node annotation and embedding generation. By modeling CAN data using MSG on sliding windows, we enhance our ability to detect and respond to potential masquerade attacks by finetuning the portion of analyzed data. We now detail each of the components of the MSG:

  • Nodes: Each node in the MSG corresponds to a unique ID in CAN representing the header of the CAN frame. This alignment allows for the detailed representation of each frame’s interactions on streams of CAN data.

  • Edges: Directed edges are established based on the sequential relationships of messages. That is, an edge is formed when one ID follows another, indicating the flow of communication [43]. This method effectively maps sequences of CAN messages enabling the detection of irregular patterns that may signify an attack. Edge weights are assigned to edges of MSGs patterns to quantify the frequency of observed communication sequences within a sliding window. Specifically, the weight of an edge from node viv_{i} to node vjv_{j} is computed as the number of times the CAN message with ID vjv_{j} directly follows the CAN message with ID viv_{i} within the same sliding window. Weights help to identify anomalous recurrent patterns that could indicate security threats.

Fig. 4 shows how MSGs are built and their corresponding components. Our modeling framework ensures that both structural and temporal aspects of CAN communications are captured, facilitating a comprehensive analysis of the CAN dynamics. Algorithm 1 describes the creation of MSGs.

Refer to caption
Figure 4: Sliding windows concept. The left side shows raw CAN bus messages partitioned by a window of length ω\omega and offset δ\delta. The right side depicts a resulting MSG subgraph, highlighting the top five nodes with the highest degrees of connectivity within the first sliding window. Edges are annotated with weights.
Algorithm 1 Creation of message sequence graphs (MSGs)
0:  Raw CAN data, ω\omega, δ\delta
0:  MSG representing CAN activity over time
1:  Initialize an empty directed graph GG
2:  Initialize current_time=start_timecurrent\_time=start\_time of CAN data
3:  while current_time+ωend_timecurrent\_time+\omega\leq end\_time of CAN data do
4:     Define start_time=current_timestart\_time=current\_time and end_time=current_time+ωend\_time=current\_time+\omega
5:     Extract messages in [start_time,end_time][start\_time,end\_time]
6:     Initialize a new subgraph GsubG_{sub}
7:     for each message mm in [start_time,end_time][start\_time,end\_time] do
8:        Extract idid from mm
9:        if idid not in GsubG_{sub} then
10:           Add idid as a node in GsubG_{sub}
11:        end if
12:        Determine next idid in sequence
13:        if edge from current idid to next idid exists then
14:           Increment weight of edge
15:        else
16:           Add edge with weight = 1
17:        end if
18:     end for
19:     Merge GsubG_{sub} into main graph GG respecting δ\delta
20:     Increment current_timecurrent\_time by δ\delta
21:  end while
22:  return  GG

To ensure the correctness of Algorithm 1, we establish a loop invariant for the main while loop. A loop invariant is a property that holds true before and after each iteration of the loop, helping to trace the algorithm’s functionality and correctness. We demonstrate that the loop invariant holds through three key steps:

  1. 1.

    Loop Invariant: At the start of each iteration of the while loop, current_time is a valid start time for a new window, and all CAN messages up to current_time have been processed into the main graph GG.

    • Initialization: The invariant holds when current_time is set to the start time of the CAN data. No messages have been processed yet, and GG is empty. This ensures that the loop invariant is true before the loop starts.

    • Maintenance: During each iteration, a window is defined from current_time to current_time + ω\omega. All messages in this window are extracted and processed into a subgraph GsubG_{\text{sub}}, which is then merged into the main graph GG. current_time is incremented by δ\delta, updating it to a new valid start time for the next window, ensuring all messages up to the new current_time have been processed. This demonstrates that the property holds after each loop iteration.

    • Termination: The loop terminates when current_time + ω\omega exceeds the end_time of the CAN data. At this point, there are no more complete windows of size ω\omega that can start at or after current_time. The invariant holds true because current_time has moved past the end of the data, confirming that all messages in the valid windows have been processed. The loop invariant confirms that the algorithm functions as intended and supports its correctness.

    By maintaining the loop invariant throughout the execution, we demonstrate the correctness and robustness of the MSG creation algorithm.

  2. 2.

    Time Complexity: The time complexity of our MSG creation algorithm depends on ω\omega and δ\delta. In general, it can be expressed as O(nδω)O\left(\frac{n}{\delta}\cdot\omega\right), where nn is the total number of messages in the CAN data. This is calculated as follows: The outer loop in Algorithm 1 iterates approximately nδ\frac{n}{\delta} times, processing the CAN data in steps of δ\delta. Within each iteration, a window of size ω\omega is processed, involving operations such as extracting messages and constructing subgraphs, which take O(ω)O(\omega) time. Therefore, the overall time complexity is the product of the number of iterations and the operations per iteration, leading to O(nδω)O\left(\frac{n}{\delta}\cdot\omega\right).

    In our experiments, we test various combinations of ω\omega and δ\delta, ranging from (2, 1) to (15, 14) for time-based windows and from (50, 50) to (400, 400) for sample-based windows. For each attack, we run the algorithm with these combinations to analyze the impact of ω\omega and δ\delta on performance and detection accuracy. For example, with a combination of ω=4\omega=4 and δ=4\delta=4, the time complexity simplifies to O(n)O(n), as each message is processed exactly once. This reflects the trade-off between ω\omega and δ\delta, with larger ω\omega or smaller δ\delta increasing the number of operations. The flexibility of our implementation allows for easy adjustment of these parameters, enabling a balance between granularity of analysis and computational efficiency.

IV-B Time Series Extraction and Analysis

Concurrently with the creation of the MSGs, raw CAN messages are decoded using a DBC as described in [63]. In doing so, CAN data is transformed into a time series format to capture detailed patterns of the signals contained in every CAN ID per sliding window. We used CAN-D [64] to extract timeseries from CAN logs. Up to the time of this writing, CAN-D is still the state-of-the-art method for CAN reverse engineering [65]. Fig. 5 shows a decoded representation of the signals representing the four wheels’ speed of a vehicle [15]. The plot is generated by translating the CAN frames into a continuous time series format, which captures the network activity over time. This representation facilitates a detailed characterization of masquerade attacks within CANs highlighting variations and trends that may not be evident relying on raw CAN data.

Refer to caption
Figure 5: Time series encoded in node ID 1760, illustrating signal patterns of four wheels’ speed. In this figure, the x-axis represents time and the y-axis represents the values of signals through time.

IV-C Graph Annotation

We enrich MSGs further by annotating each node with statistical attributes derived from the time series data. In particular, we include the mean and standard deviation of each signal associated with each CAN ID. Note that other time series features can be computed but we opted for simpler statistical features that have constant complexity [3]. Our hypothesis is that enriching MSGs with contextual data from the time series significantly enhances our ability to conduct graph ML and detect subsequent masquerade attacks. Note that the graph annotation process is performed as MSGs become available. This approach ensures that CAN communications’ temporal and semantic aspects are captured comprehensively. Fig. 6 shows how each node is annotated with statistical attributes, specifically the mean (μ\mu) and standard deviation (σ\sigma) of the associated signals. Algorithm 2 describes the pseudocode of this process.

Refer to caption
Figure 6: Illustration of graph annotation with statistical attributes in CAN. Each node represents a CAN ID and is annotated with the mean (μ\mu) and standard deviation (σ\sigma) of each signal associated with it. Note that the number of signals may vary per node.
Algorithm 2 Node annotation with statistical attributes
0:  Data from CAN, list of sliding windows, list of signals
0:  Graph with nodes annotated with statistical attributes
1:  for each sliding window in the list of sliding windows do
2:     Define the start and end of the current sliding window
3:     for each signal in the list of signals do
4:        Extract signal values within the current sliding window
5:        Interpolate missing values using linear interpolation
6:        Calculate the μ\mu and σ\sigma of the interpolated signal
7:        Assign calculated μ\mu and σ\sigma to the corresponding node in the graph
8:     end for
9:  end for
10:  return  Annotated graph

IV-D Graph Embeddings

Graph embeddings provide a powerful way to represent the complex structure of graphs into a lower dimensional space such that ML algorithms can effectively take advantage [66, 67, 68]. In our study, we focus on shallow graph embeddings and in particular the node2vec algorithm [39], which is well-suited for capturing the nuances of weighted and directed graphs characteristic of MSG. Moreover, node2vec facilitates the transformation of MSG nodes into vector representations, preserving nodes’network topology. We configure node2vec with specific parameters—64 dimensions (d), a walk length (l) of 15, 100 walks (r) per node, a return parameter (p) of 1.5, and an in-out parameter (q) of 0.5. This configuration allows capturing the unique characteristics of MSGs while excelling at inference tasks included in the graph annotation phase [39]. Our configuration ensures that the embedding process remains efficient, thereby supporting near real-time detection capabilities, which is critical for timely identification of masquerade attacks in CANs. Note that this relatively low embedding dimension has shown to be enough for capturing rich graph-related properties [69]. In our work, instead of focusing on node embeddings, we construct whole-graph embeddings [32]. We compute them by averaging node embeddings within sliding windows leading to a unified representation of the network’s behavior. Specifically, we stack them per node in the MSGs. Each of these graphs is labeled indicating the normal or anomalous activity in CAN, i.e., we frame our IDS problem as a graph classification problem [70]. Algorithm 3 describes how we generate our node2vec embeddings.

Algorithm 3 node2vec embedding generation
0:  List of graphs for each time slice, node2vec parameters (d, l, r, p, q)
0:  List of node embeddings with associated statistical attributes for each graph
1:  for each graph in the list of time slice graphs do
2:     Initialize node2vec with the specified parameters (d, l, r, p, q)
3:     Fit node2vec model to the graph
4:     for each node in the graph do
5:        Retrieve or initialize the embedding vector for the node
6:        Annotate the embedding vector with statistical attributes
7:     end for
8:     Average the embedding vectors to form a consistent network representation
9:  end for
10:  return  DataFrame of averaged node embeddings with labels

The result of this process is a data frame that not only encapsulates node embeddings but also includes the augmented node attribute features with labels indicative of normal or anomalous activity states in the CAN. Before classification, we create a comprehensive feature vector for each graph by concatenating the averaged graph embeddings with the time series statistical features (mean and standard deviation of signals). This representation combines both structural (graph) and temporal (time series) information, providing a robust input for ML classifiers. These comprehensive feature vectors serve as the foundation for our supervised learning approach to masquerade attack detection.

IV-E Supervised Learning

To evaluate the effectiveness of our feature representation and overall anomaly detection framework, we employ supervised learning techniques. We experiment with two ensemble learning methods: Random Forest (RF) [71] and XGBoost [72]. These decision tree-based ensemble learning methods are renowned for their high accuracy, scalability, and robustness, making them exceptionally suitable for dealing with the high-dimensional space characteristic of CAN data. The RF model is instantiated with 100 trees, and depth is capped at 20 to prevent overfitting. A combination of a minimum samples split of 6 and minimum samples leaf of 2 is used to control the growth of the trees further. As the training dataset exhibits a 1:10 ratio between the minority (attack) and majority (normal) classes, we balance the dataset using the synthetic minority over-sampling technique (SMOTE) [73]. A 5-fold stratified k-fold cross-validation is performed to ensure the model’s generalizability during training. The XGBoost model yielded similar results to the RF classifier in our experiments. Despite the comparable performance, we opted to report only results for the RF classifier due to its simpler interpretability and less intensive hyperparameter tuning process, which aligns well with the goals of this study.

IV-F Evaluation Setup

Evaluation is performed at the sliding window level. In doing so, we verify the proportion of sliding windows coinciding with an attack region in the testing captures, which refers to the recorded CAN bus data used for testing. We found that for the combinations of ω\omega and δ\delta, the proportion of windows containing attacks tended to be balanced, i.e., greater than 40%. To quantitatively assess the performance of our anomaly detection framework, we use the area under the receiver operating characteristic curve (AUC-ROC) [74]. The AUC-ROC is a comprehensive evaluation metric for classification problems across various threshold settings. It reflects the model’s ability to distinguish between classes, with higher values indicating better performance. The ROC curve is a plot of the true positive rate (TPR) against the false positive rate (FPR) at different threshold levels. The TPR, also known as sensitivity or recall, measures the proportion of actual positives correctly identified, while the FPR measures the proportion of actual negatives incorrectly identified as positive. Mathematically, the AUC-ROC is defined as:

AUC-ROC=01TPR(FPR1(x))𝑑x.\text{AUC-ROC}=\int_{0}^{1}\text{TPR}(\text{FPR}^{-1}(x))\,dx. (1)

The value of the AUC-ROC ranges between 0 and 1. An AUC-ROC of 1 indicates perfect classification, whereas an AUC-ROC of 0.5 suggests no discriminative power, equivalent to random guessing. An AUC-ROC below 0.5 indicates that the model performs worse than random guessing. In practical terms, the AUC-ROC can be interpreted as the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. This interpretation is particularly useful in our context, where we aim to detect subtle masquerade attacks embedded within normal CAN traffic.

IV-G Dataset

Our unifying framework is evaluated on the Real ORNL Automotive Dynamometer (ROAD) dataset, a comprehensive set of CAN data collected from a real vehicle at Oak Ridge National Laboratory (ORNL) [15]. This dataset is particularly valuable due to the inclusion of physically verified fabrication and simulated masquerade attacks that provide a realistic environment for testing CAN security methods. The ROAD dataset stands out for its unparalleled quality of CAN information covering the most credible CAN attacks reflecting various driving scenarios. The dataset comprises 3.5 hours of recorded data, with 3 hours allocated for training and the remaining 30 minutes set aside for testing. The test data incorporates five masquerade attacks including ‘correlated signal’, ‘max engine’, ‘max speedometer’, ‘reverse light off’, and ‘reverse light on’ attacks. These attacks are designed to manipulate specific vehicle states. This dataset enables us to conduct a comprehensive evaluation of our framework under a variety of attack conditions. Table I provides a comprehensive overview of the masquerade attacks used in our study and their impacts on the vehicle’s state.

TABLE I: Characteristics of masquerade attacks in the ROAD dataset.
Attack name Attack details and consequences
Correlated signal Injects varying values for wheel speeds, resulting in the vehicle coming to a halt.
Max speedometer Injects the maximum value to be displayed on the speedometer.
Max coolant temp Injects the maximum value, triggering the coolant warning light.
Reverse light on/off Toggles the reverse light irrespective of the actual gear position.

V Results

Our experimental setup was powered by a 12th Gen Intel® Core i9-12900HK processor, an NVIDIA GeForce RTX 3080 Ti GPU, and bolstered by 32GB of RAM. We employed NetworkX, a powerful library for creating and analyzing complex networks, as a key tool in our study. All methods were implemented using Python 3.8.18. For computing evaluation metrics, we utilized the sklearn.metrics module. Graph embeddings were facilitated by node2vec 0.4.6. We focus on comparing two different settings, i.e., a baseline using only graph embeddings and combining graph embeddings with time series contextual information. To ensure the reproducibility of our proposed settings, we have made the code available in a GitHub repository [24].

Section V-A shows detection results using a baseline model with only graph embeddings, demonstrating significant variation in performance with different ω\omega and δ\delta combinations using time-based windows (see Fig. 8). Section V-B evidences how incorporating time series features with graph embeddings enhances detection capabilities improving AUC-ROC values across various attacks (see Fig. 8) using time-based windows. Section V-C provides a comparative analysis of detection settings showing that combining graph embeddings with time series features generally enhances detection performance, supported by detailed AUC-ROC metrics across different attack scenarios (see Table II). Finally, Section V-D investigates the effect of using more granular windows based on the number of samples. We show that even with more granular window lengths, the proposed framework effectively detects masquerade attacks (see Table V and Figs. 10 and 10).

V-A Baseline Model with Graph Embeddings Only

Our evaluation begins with the establishment of a baseline model that relies solely on graph embeddings, i.e., focusing only on the topology of the MSGs. This baseline model provides a fundamental comparison point and is essential for assessing the effectiveness of incorporating additional features. Fig. 8 shows the performance metrics for this baseline, serving as the benchmark for our subsequent enhancements. Specifically, the AUC-ROC values for ‘correlated signal attack’ reach an optimal performance of 0.98 with a ω\omega of 8 seconds and an δ\delta of 1 second, showcasing strong model effectiveness in this configuration. In contrast, ‘reverse light on attack’ achieves its highest AUC-ROC of 0.97 with ω=7\omega=7 seconds and δ=4\delta=4 seconds, respectively, demonstrating its dependence on extended time frames. In general for the remaining attacks, we notice that high classification results can be obtained for ω\omega and δ\delta of several seconds. These insights confirm that adjusting the configurations to suit different types of attacks is crucial for maximizing detection capabilities. The variance in performance across different settings shows the importance of fine-tuning the sliding window parameters to enhance the efficacy of anomaly detection systems in vehicular networks.

Refer to caption
Figure 7: AUC-ROC across different window size and offset combinations for masquerade attacks detected by the baseline model using only graph embeddings.
Refer to caption
Figure 8: AUC-ROC across different window size and offset combinations for masquerade attacks detected by incorporating both graph embeddings and time series features.

V-B Enhanced Model with Graph Embeddings and Time Series Features

Building upon our baseline, we incorporate time series features of CAN signals alongside the graph embeddings. This setting harnesses both the structural patterns and the temporal dynamics of CAN frames aiming to achieve a more comprehensive anomaly detection. Fig. 8 depicts the performance metrics of this enhanced model. The inclusion of time series data significantly improves detection capabilities, as evidenced by the AUC-ROC scores. For ‘correlated signal attack’, the model achieves an AUC-ROC peak of 0.99 at a ω\omega of 3 seconds and an δ\delta of 3 seconds. This suggests a strong model response to quick changes in CAN signal patterns. Similarly, ‘reverse light on attack’ records its highest AUC-ROC at 0.99 for a ω\omega of 7 seconds and an δ\delta of 4 seconds, indicating effective monitoring of critical engine parameters. In general, we notice that when integrating time series features, the required time to get optimal performance is lower than when using only graph embeddings. Specifically, across the five masquerade attacks, the average optimal ω\omega decreases from 9.8 seconds to 6 seconds, while the average optimal δ\delta increases from 3.2 seconds to 4.8 seconds when combining graph embeddings with time series features. This 38.8% reduction in ω\omega coupled with a 50% increase in δ\delta allows for more granular analysis of CAN messages while spacing the analyses further apart, enabling our method to capture detailed patterns more effectively.

TABLE II: Comparison of detection settings based on the AUC ROC heatmaps. The table shows the mean (μ\mu), standard deviation (σ\sigma), median (η\eta), minimum (min\min), and maximum (max\max) of the AUC ROC for each setting across different attack scenarios, as well as the average (max¯\overline{\max}) and standard deviation (σmax\sigma_{\max}) of the maximum values obtained by each attack and experiment type, either embeddings only or embeddings with time series information.
Experiment Correlated signal
attack
Max engine coolant
temperature attack
Max speedometer
attack
Reverse light off
attack
Reverse light on
attack
Experiment max
Graph embeddings only μ=0.94\mu=0.94 σ=0.02\sigma=0.02 η=0.93\eta=0.93 min = 0.90 (15,8) max = 0.98 (8,1) μ=0.91\mu=0.91 σ=0.04\sigma=0.04 η=0.92\eta=0.92 min = 0.84 (15,14) max = 0.98 (9,8) μ=0.91\mu=0.91 σ=0.03\sigma=0.03 η=0.91\eta=0.91 min = 0.85 (4,1) max = 0.98 (14,2) μ=0.93\mu=0.93 σ=0.02\sigma=0.02 η=0.93\eta=0.93 min = 0.89 (6,3) max = 0.98 (11,1) μ=0.92\mu=0.92 σ=0.03\sigma=0.03 η=0.92\eta=0.92 min = 0.88 (2,1) max = 0.97 (7,4) max¯=0.98\overline{\text{max}}=0.98 σmax=0.00\sigma_{\text{max}}=0.00
Graph embeddings + Time series features μ=0.96\mu=0.96 σ=0.02\sigma=0.02 η=0.96\eta=0.96 min = 0.92 (6,3) max = 0.99 (3,3) μ=0.94\mu=0.94 σ=0.04\sigma=0.04 η=0.94\eta=0.94 min = 0.86 (8,1) max = 0.99 (7,4) μ=0.94\mu=0.94 σ=0.03\sigma=0.03 η=0.93\eta=0.93 min = 0.87 (13,10) max = 0.99 (8,8) μ=0.96\mu=0.96 σ=0.02\sigma=0.02 η=0.96\eta=0.96 min = 0.91 (6,3) max = 0.99 (5,5) μ=0.94\mu=0.94 σ=0.03\sigma=0.03 η=0.94\eta=0.94 min = 0.90 (4,1) max = 0.99 (7,4) max¯=0.99\overline{\text{max}}=0.99 σmax=0.00\sigma_{\text{max}}=0.00
Attack max max¯=0.98\overline{\text{max}}=0.98 σmax=0.01\sigma_{\text{max}}=0.01 max¯=0.98\overline{\text{max}}=0.98 σmax=0.01\sigma_{\text{max}}=0.01 max¯=0.98\overline{\text{max}}=0.98 σmax=0.01\sigma_{\text{max}}=0.01 max¯=0.98\overline{\text{max}}=0.98 σmax=0.01\sigma_{\text{max}}=0.01 max¯=0.98\overline{\text{max}}=0.98 σmax=0.01\sigma_{\text{max}}=0.01

V-C Metrics Summary

We compare the effectiveness of two detection settings: using embeddings only and embeddings combined with time series features. The evaluation focuses on the AUC-ROC metric (see Section IV F) to compare the performance of these settings against various attack scenarios.

Table II summarizes results across heatmaps. We place attack categories in the columns and detection settings in the rows. This means that each cell in this table shows the summary statistics from the heatmaps for a specific detection setting and attack. We focus on summary statistics including mean (μ\mu), standard deviation (σ\sigma), median (η\eta), minimum (min\min), and maximum (max\max). In addition, we also compute average (max¯\overline{\max}) and standard deviation (σmax\sigma_{\max}) of the maximum values for each of the attack detection settings and attack types in the ROAD dataset. They are displayed as last rows and columns in the table.

Graph embeddings combined with time series features generally outperform the graph embeddings-only setting across all attack types. Specifically, for ‘correlated signal attack’, the graph embeddings with time series features setting achieves a mean AUC-ROC (μ\mu) of 0.96 with a standard deviation (σ\sigma) of 0.02, compared to the embeddings-only setting with a mean AUC-ROC of 0.94 and a standard deviation of 0.02. For ‘max engine coolant temperature attack’, the embeddings with time series setting has a mean AUC-ROC of 0.94 and a standard deviation of 0.04, which is higher than the graph embeddings-only setting with a mean AUC-ROC of 0.91 and a standard deviation of 0.04. In the case of ‘max speedometer attack’, the mean AUC-ROC for the graph embeddings with time series setting is 0.94 with a standard deviation of 0.03. In contrast, the embeddings-only setting achieves a mean AUC-ROC of 0.91 with a standard deviation of 0.03. For ‘reverse light off attack’, the graph embeddings with time series setting achieves a mean AUC-ROC of 0.96 with a standard deviation of 0.02, while the graph embeddings-only setting has a mean AUC-ROC of 0.93 and a standard deviation of 0.02. Finally, for ‘reverse light on attack’, the mean AUC-ROC for the graph embeddings with time series setting is 0.94 with a standard deviation of 0.03, compared to the graph embeddings-only setting with a mean AUC-ROC of 0.92 and a standard deviation of 0.03.

In ‘correlated signal attack’, the graph embeddings with time series setting achieves a median (η\eta) of 0.96, a minimum (min\min) of 0.92, and a maximum (max\max) of 0.98. These metrics provide further insights into the detection performance by highlighting the central tendency and range of the results. The average (max¯\overline{\max}) and standard deviation (σmax\sigma_{\max}) of the maximum values obtained by each attack and experiment type are also reported. These values illustrate the peak performance and consistency of the settings, respectively. For instance, for ‘max engine coolant temperature attack’, the average maximum AUC-ROC (max¯\overline{\max}) is 0.98 for the embeddings only setting and 0.99 for the embeddings with time series setting. The standard deviation (σmax\sigma_{\max}) for both settings is 0.00, demonstrating consistent peak performance across different experiments.

To further validate the differences in detection effectiveness of both settings, we employed the Mann-Whitney U\mathrm{U} test and the Kolmogorov-Smirnov (KS(\mathrm{KS}) test. The Mann-Whitney U\mathrm{U} test [75] is a non-parametric statistical test that evaluates whether there is a difference between two independent samples. The KS\mathrm{KS} test [76] is another non-parametric test that assesses whether two samples come from the same distribution. In our analysis, we compared the AUC-ROC value distributions of the two settings: one using only embeddings and the other combining these with time series features. Here, the null hypothesis is that the AUC-ROC values for the setting combining embeddings with time series features are less than or equal to those for the setting using only embeddings, while the alternative hypothesis is that the AUC-ROC values for the setting combining embeddings with time series features are greater than those for the setting using only embeddings. A low pp-value indicates a significant difference between these two settings. We focused on a significance level of 0.05. Table III shows significant differences between the observed distributions for all masquerade attacks. In each of the attacks, the Mann-Whitney U and KS\mathrm{KS} tests produce low pp-values, indicating a significant deviation from the expected distribution. This supports our hypothesis of relying on annotated MSGs to effectively detect masquerade attacks at various windows and offset combinations. Consequently, we reject the null hypothesis for all tested masquerade attacks, confirming that the combined setting performs significantly better.

TABLE III: Mann-Whitney U and KS\mathrm{KS} tests results for different masquerade attacks.
Attack name Mann-Whitney Kolmogorov- Smirnov
U Statistic P-value Statistic P-value
Correlated signal 11337.0 5.55e-16 0.462 3.81e-12
Max speedometer 10082.0 7.98e-09 0.328 2.34e-06
Reverse light off 11033.0 4.98e-14 0.429 1.73e-10
Reverse light on 10818.0 9.84e-13 0.429 1.73e-10
Max engine coolant 9864.0 8.01e-08 0.328 2.34e-06
Refer to caption
Figure 9: AUC-ROC across different window size and offset combinations for masquerade attacks detected by the baseline model using only graph embeddings.
Refer to caption
Figure 10: AUC-ROC across different window size and offset combinations for masquerade attacks detected by incorporating both graph embeddings and time series features.

V-D Evaluation Based on Sample Windows

We investigate the effect of sample-based windows in our detection framework’s performance, as opposed to time-based windows. Our aim is to understand the impact of varying the number of CAN message samples in each sliding window on masquerade attack detection. In this context, time-based windows refer to windows defined by a fixed duration, whereas sample-based windows are defined by a fixed number of CAN messages. We experiment with different window sizes and offsets to find the optimal configurations for detecting anomalies.

Figs. 10 and 10 are heatmaps that provide visual comparisons of the AUC-ROC values for different ω\omega and δ\delta combinations for both settings. These heatmaps show that the combined setting generally outperforms the embeddings-only setting, especially at certain window sizes and offsets. However, the optimal configuration varies significantly depending on the attack type, emphasizing the importance of careful parameter selection in practical applications. For instance, in the ‘correlated signal attack’, the combined setting achieves a higher AUC-ROC value of 0.84 with a ω\omega of 100 and an δ\delta of 100, compared to the embeddings-only setting, which peaks at 0.79 with the same configuration. Similarly, in the ‘max engine coolant temperature attack’, the combined setting reaches an AUC-ROC value of 0.97 with a ω\omega of 350 and an δ\delta of 300, outperforming the embeddings-only setting, which attains a maximum of 0.89 with a ω\omega of 200 and an δ\delta of 200. In the ‘reverse light off attack’, the combined setting shows superior performance with an AUC-ROC value of 0.81 with a ω\omega of 400 and an δ\delta of 400, while the embeddings-only setting reaches 0.76 with the same configuration. Note that in the remaining attacks (i.e., ‘max speedometer attack’ and ‘reverse light on attack’), combining graph embeddings with time series features allows the classifier to obtain higher AUC-ROC values but requires a lower number of samples.

Table V presents the summary statistics for all masquerade attacks in the ROAD dataset. For both settings, the highest and lowest performances occur in the ‘max engine coolant temperature attack’ scenario, with varying window sizes and offsets. In the graph embeddings-only setting, we achieve a mean AUC-ROC (μ\mu) of 0.61 with a standard deviation (σ\sigma) of 0.06. When we combine graph embeddings with time series features, the mean AUC-ROC (μ\mu) improves to 0.68, with a standard deviation (σ\sigma) of 0.05. Our findings back previous results that the integration of time series features with graph embeddings generally enhances the model’s performance.

The granular window approach proves effective with performance varying significantly across different attack types and window configurations. Individual attack scenarios reveal interesting patterns. For instance, ‘correlated signal attack’ shows consistent improvement when time series features are added, with the mean AUC-ROC increasing from 0.68 to 0.74. The ‘max engine coolant temperature attack’ exhibits the highest variability in performance, indicating high sensitivity to ω\omega and δ\delta selection. The ‘max speedometer attack’ and ‘reverse light on attack’ prove more challenging to detect, with lower mean AUC-ROC values in both settings. The ‘reverse light off attack’ shows moderate improvement with the addition of time series features. We observed that the average ω\omega increased slightly from 240 to 260 samples when incorporating time series features, while the average δ\delta remained constant at 220 samples. Despite this minor increase in ω\omega, the combined approach consistently demonstrated superior detection performance across various attack scenarios, as evidenced by the improved AUC-ROC values. For instance, in ‘correlated signal attack’, the AUC-ROC improved from 0.79 to 0.84, while in ‘max engine coolant temperature attack’, it increased from 0.89 to 0.97. These results suggest that the integration of time series features with graph embeddings enhances the model’s ability to detect anomalies, even when requiring slightly larger sample windows. This improvement in detection capability outweighs the small increase in computational requirements, further validating the effectiveness of our combined approach in various window configurations.

We also found the distributions of AUC-ROC values to be significantly different between the two settings. The Mann-Whitney U\mathrm{U} and KS\mathrm{KS} tests confirmed this difference, indicating the superior detection performance of the combined setting. Table IV summarizes the results, showing significant differences across all tested masquerade attacks. For instance, ‘correlated signal attack’ yielded a Mann-Whitney U statistic of 1069.5 with a pp-value of 9.84e-07, and a KS statistic of 0.58 with a pp-value of 2.61e-06. Similar trends were observed for other attacks, such as ‘max engine coolant’ and ‘max speedometer’ attacks, confirming the effectiveness of incorporating time series features with graph embeddings. The results for each attack type consistently indicate that the combined setting outperforms the embeddings-only setting, validating the improvements seen in the AUC-ROC metrics.

TABLE IV: Mann-Whitney U and Kolmogorov-Smirnov test results for different masquerade attacks in the sample-based window experiment.
Attack name Mann-Whitney Kolmogorov- Smirnov
U Statistic P-value Statistic P-value
Correlated signal 1069.5 9.84e-07 0.58 2.61e-06
Max engine coolant 811.5 0.033 0.33 0.018
Max speedometer 1085.0 4.20e-07 0.56 9.30e-06
Reverse light off 1040.5 4.84e-06 0.50 9.36e-05
Reverse light on 1112.0 8.35e-08 0.61 6.75e-07
TABLE V: Summary statistics for the evaluation metrics based on sample-based windows. We show the mean (μ\mu), standard deviation (σ\sigma), median (η\eta), minimum (min\min), and maximum (max\max) of the AUC ROC for each setting across different attack scenarios, as well as the average (max¯\overline{\max}) and standard deviation (σmax\sigma_{\max}) of the maximum values obtained by each attack and experiment type, either embeddings-only or embeddings with time series information.
Experiment Correlated signal attack Max engine coolant temperature attack Max speedometer attack Reverse light off attack Reverse light on attack Experiment max
Graph embeddings- only μ=0.68\mu=0.68 σ=0.05\sigma=0.05 η=0.68\eta=0.68 min = 0.54 (400, 400) max = 0.79 (100, 100) μ=0.63\mu=0.63 σ=0.19\sigma=0.19 η=0.69\eta=0.69 min = 0.17 (400, 300) max = 0.89 (200, 200) μ=0.56\mu=0.56 σ=0.05\sigma=0.05 η=0.56\eta=0.56 min = 0.45 (300, 250) max = 0.65 (350, 250) μ=0.63\mu=0.63 σ=0.05\sigma=0.05 η=0.64\eta=0.64 min = 0.52 (300, 200) max = 0.76 (400, 400) μ=0.55\mu=0.55 σ=0.05\sigma=0.05 η=0.56\eta=0.56 min = 0.45 (350, 350) max = 0.68 (150, 150) max¯=0.75\overline{\text{max}}=0.75 σmax=0.10\sigma_{\text{max}}=0.10
Graph embeddings + Time series features μ=0.74\mu=0.74 σ=0.05\sigma=0.05 η=0.75\eta=0.75 min = 0.60 (400, 400) max = 0.84 (100, 100) μ=0.70\mu=0.70 σ=0.19\sigma=0.19 η=0.76\eta=0.76 min = 0.24 (200, 150) max = 0.97 (350, 300) μ=0.63\mu=0.63 σ=0.04\sigma=0.04 η=0.64\eta=0.64 min = 0.52 (300, 250) max = 0.70 (300, 150) μ=0.69\mu=0.69 σ=0.05\sigma=0.05 η=0.69\eta=0.69 min = 0.58 (50, 50) max = 0.81 (400, 400) μ=0.62\mu=0.62 σ=0.04\sigma=0.04 η=0.61\eta=0.61 min = 0.52 (350, 350) max = 0.72 (150, 150) max¯=0.81\overline{\text{max}}=0.81 σmax=0.11\sigma_{\text{max}}=0.11
Attack max max¯=0.81\overline{\text{max}}=0.81 σmax=0.04\sigma_{\text{max}}=0.04 max¯=0.93\overline{\text{max}}=0.93 σmax=0.06\sigma_{\text{max}}=0.06 max¯=0.68\overline{\text{max}}=0.68 σmax=0.04\sigma_{\text{max}}=0.04 max¯=0.79\overline{\text{max}}=0.79 σmax=0.04\sigma_{\text{max}}=0.04 max¯=0.70\overline{\text{max}}=0.70 σmax=0.03\sigma_{\text{max}}=0.03

VI Discussion

This study introduces a novel approach for detecting masquerade attacks in CAN using graph ML. By combining graph embeddings with time series features, our method demonstrates significant improvements in detection capabilities across various masquerade attack scenarios. Our results consistently show that this combined approach outperforms the graph embeddings-only method across all attack types examined in the ROAD dataset. The improvement in performance is statistically significant, as evidenced by the Mann-Whitney U\mathrm{U} and KS\mathrm{KS} tests (see Tables III and IV). Additionally, our exploration of sample-based windows reveals promising results, particularly with smaller ω\omega values, suggesting that fine-grained analysis of CAN messages can even capture subtle anomalies.

The enhanced performance of our combined approach emphasizes the importance of incorporating both structural and temporal information in CAN IDS. By leveraging graph embeddings, we capture the complex interactions between different ECUs, while the addition of time series features allows us to model the temporal dynamics of CAN signals. This comprehensive view enables our system to detect subtle deviations that are characteristic of sophisticated masquerade attacks. Moreover, our experimental results indicate that the integration of graph embeddings with time series features optimizes the detection parameters, reducing the required ω\omega for optimal detection by 38.8% while increasing the optimal δ\delta by 50%. This adjustment supports more efficient anomaly detection, crucial for timely responses in vehicular networks. The effectiveness of smaller ω\omega values coupled with larger δ\delta values suggests that detailed analyses of CAN messages can reveal anomalous behavior patterns while being performed less frequently, striking a balance between granularity and efficiency. This approach could be particularly beneficial in resource-constrained vehicular environments, with important implications for the design of potential real-time IDS.

While our results are promising, it is important to acknowledge the limitations of this study:

Lack of Validation in a Real Environment: Our study is simulation-based and lacks validation in a real vehicle environment. Real-world factors such as network latency, sensor noise, and varying driving conditions could potentially impact the performance of our system. We partitioned logs of already collected CAN captures to simulate a streaming environment of CAN data using windows. Future work should involve empirical validation on a moving vehicle, including the necessary hardware and performance tuning involved to do so.

Evaluation Based Solely on ROAD Dataset: Our evaluation is based solely on the ROAD dataset [15], which, although comprehensive, may not capture the full range of potential attack scenarios in real-world settings. This limitation suggests that future studies should incorporate a variety of datasets, including those with more recent and diverse attack scenarios, to enhance the robustness and generalizability of the findings.

Computational Complexity: The computational complexity of our approach, particularly in the graph embedding phase, may present challenges for real-time implementation in resource-constrained vehicular environments. Further optimization may be necessary for practical deployment. Future research should focus on developing more efficient algorithms and exploring hardware acceleration to make the approach viable for real-time applications.

Use of Default Parameter Values in Detection Algorithms: We used the same default parameter values as the original algorithms, such as the number of trees and maximum depth for the Random Forest classifier, and similar default settings for the XGBoost algorithm. While this ensures consistency in evaluation, it may not represent the optimal configuration for each algorithm in the context of our specific application. This limitation underscores the importance of performing extensive hyperparameter tuning in future work to identify the best settings for each algorithm to maximize their detection performance.

In summary, our study contributes to the growing body of research on graph-based CAN IDS by demonstrating the effectiveness of combining graph embeddings with time series features. This approach not only enhances detection capabilities but also provides a flexible framework adaptable to various vehicular systems and attack patterns.

VII Conclusion

This paper introduces a novel framework for detecting masquerade attacks in CANs using graph machine learning. Our approach uniquely combines graph embeddings with time series analysis, significantly enhancing detection capabilities across various attack patterns. By representing CAN frames as MSGs and enriching nodes with statistical attributes from time series data, we achieve a more comprehensive view of network behavior. This integration allows for the detection of subtle anomalies characteristic of sophisticated masquerade attacks, outperforming methods based solely on graph embeddings across all tested attack types in the ROAD dataset [15]. Our method demonstrates statistically significant improvements (see Tables III and IV), particularly in detecting challenging attacks like the max speedometer attack, which has proven elusive in previous studies [60].

Looking forward, we plan to continue expanding this research in optimizing our approach for real-time implementation in resource-constrained vehicular environments. This could involve exploring more efficient graph neural network architectures, such as those proposed by Zhang et al. [23]. Additionally, extending the evaluation to a broader range of datasets and attack scenarios will be crucial for validating the generalizability of our method. As vehicles become increasingly connected and autonomous, our work contributes to the critical field of automotive cybersecurity, paving the way for more robust and adaptive IDS in future vehicles and cyber-physical systems.

Acknowledgments

This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The publisher, by accepting the article for publication, acknowledges that the U.S. Government retains a non-exclusive, paid up, irrevocable, world-wide license to publish or reproduce the published form of the manuscript, or allow others to do so, for U.S. Government purposes. The DOE will provide public access to these results in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). This research was sponsored in part by Oak Ridge National Laboratory’s (ORNL’s) Laboratory Directed Research and Development program through the Sustainable Research Pathways (SRP) program and by the DOE. This research is also partially funded by US Department of Energy Award No. DE-FE0032089. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of this manuscript.

References

  • [1] S. Halder, M. Conti, and S. K. Das, “Coids: A clock offset based intrusion detection system for controller area networks,” in Proc. 21st Int. Conf. Distrib. Comput. Netw., Jan 2020, pp. 1–10.
  • [2] B. Palaniswamy, S. A. Camtepe, E. Foo, and J. Pieprzyk, “An efficient authentication scheme for intra-vehicular controller area network,” IEEE Trans. Inf. Forensics Secur., vol. 15, pp. 3107–3122, 2020.
  • [3] M. H. Shahriar, W. Lou, and Y. T. Hou, “Cantropy: Time series feature extraction-based intrusion detection systems for controller area networks,” in Proc. Symp. Veh. Secur. Privacy (VehicleSec), 2023, pp. 1–8.
  • [4] S. Tariq, S. Lee, and S. S. Woo, “Cantransfer: Transfer learning based intrusion detection on a controller area network using convolutional lstm network,” in Proc. 35th Annu. ACM Symp. Appl. Comput., Mar 2020, pp. 1048–1055.
  • [5] H. Zhang, X. Meng, X. Zhang, and Z. Liu, “Cansec: A practical in-vehicle controller area network security evaluation tool,” Sensors, vol. 20, no. 17, p. 4900, 2020.
  • [6] Z. Yu, Y. Liu, G. Xie, R. Li, S. Liu, and L. T. Yang, “Tce-ids: Time interval conditional entropy-based intrusion detection system for automotive controller area networks,” IEEE Trans. Ind. Inf., vol. 19, no. 2, pp. 1185–1195, 2022.
  • [7] C. Miller and C. Valasek, “Remote exploitation of an unaltered passenger vehicle,” in Black Hat USA, vol. 2015, no. S 91, 2015, pp. 1–91.
  • [8] S. Sharmin, H. Mansor, A. F. A. Kadir, and N. A. Aziz, “Comparative evaluation of anomaly-based controller area network ids,” in Proc. 2023 12th Int. Conf. Softw. Comput. Appl., Feb 2023, pp. 218–226.
  • [9] V. Tanksale, “Intrusion detection system for controller area network,” Cybersecurity, vol. 7, no. 1, p. 4, 2024.
  • [10] H. S. Mavikumbure, V. Cobilean, C. S. Wickramasinghe, B. J. Varghese, B. Carlson, C. Rieger, and M. Manic, “Cy-phy ads: Cyber physical anomaly detection framework for ev charging systems,” IEEE Trans. Transp. Electr., 2024.
  • [11] N. Jeffrey, Q. Tan, and J. R. Villar, “A review of anomaly detection strategies to detect threats to cyber-physical systems,” Electron., vol. 12, no. 15, p. 3283, 2023.
  • [12] W. Marfo, D. K. Tosh, and S. V. Moore, “Condition monitoring and anomaly detection in cyber-physical systems,” in Proc. 2022 17th Annu. Syst. Syst. Eng. Conf. (SOSE), 2022, pp. 106–111.
  • [13] X. Wang, Y. Liu, K. Jiao, P. Liu, X. Luo, and T. Liu, “Intrusion device detection in fieldbus networks based on channel-state group fingerprint,” IEEE Trans. Inf. Forensics Secur., 2024.
  • [14] J. Xiao, L. Yang, F. Zhong, H. Chen, and X. Li, “Robust anomaly-based intrusion detection system for in-vehicle network by graph neural network framework,” Appl. Intell., vol. 53, no. 3, pp. 3183–3206, 2023.
  • [15] M. E. Verma, R. A. Bridges, M. D. Iannacone, S. C. Hollifield, P. Moriano, S. C. Hespeler et al., “A comprehensive guide to can ids data and introduction of the road dataset,” PLoS One, vol. 19, no. 1, p. e0296879, 2024.
  • [16] F. Martinelli, F. Mercaldo, V. Nardone, and A. Santone, “Car hacking identification through fuzzy logic algorithms,” in Proc. IEEE Int. Conf. Fuzzy Syst. (FUZZ-IEEE), Jul 2017, pp. 1–7.
  • [17] S. Devika, R. R. Shrivastava, P. Narang, T. Alladi, and F. R. Yu, “Vadgan: An unsupervised gan framework for enhanced anomaly detection in connected and autonomous vehicles,” IEEE Trans. Veh. Technol., 2024.
  • [18] H. Sun, M. Chen, J. Weng, Z. Liu, and G. Geng, “Anomaly detection for in-vehicle network using cnn-lstm with attention mechanism,” IEEE Trans. Veh. Technol., vol. 70, no. 10, pp. 10 880–10 893, 2021.
  • [19] Y. Lin, C. Chen, F. Xiao, O. Avatefipour, K. Alsubhi, and A. Yunianta, “Retracted: An evolutionary deep learning anomaly detection framework for in-vehicle networks-can bus,” IEEE Trans. Ind. Appl., 2020.
  • [20] H. J. Jo and W. Choi, “A survey of attacks on controller area networks and corresponding countermeasures,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 7, pp. 6123–6141, 2021.
  • [21] M. Jedh, L. B. Othmane, N. Ahmed, and B. Bhargava, “Detection of message injection attacks onto the can bus using similarities of successive messages-sequence graphs,” IEEE Trans. Inf. Forensics Secur., vol. 16, pp. 4133–4146, 2021.
  • [22] R. Zhao, C. Luo, F. Gao, Z. Gao, L. Li, D. Zhang, and W. Yang, “Application-layer anomaly detection leveraging time-series physical semantics in can-fd vehicle networks,” Electron., vol. 13, no. 2, p. 377, 2024.
  • [23] H. Zhang, K. Zeng, and S. Lin, “Federated graph neural network for fast anomaly detection in controller area networks,” IEEE Trans. Inf. Forensics Secur., vol. 18, pp. 1566–1579, 2023.
  • [24] W. Marfo, “GraphML Controller Area Network Attack Detection,” 2024, gitHub repository: https://github.com/billmj/GraphML-CONTROLLER-AREA-NETWORK-Attack-Detection.
  • [25] M. D. Natale, H. Zeng, P. Giusto, and A. Ghosal, Understanding and Using the Controller Area Network Communication Protocol: Theory and Practice.   New York, NY, USA: Springer, 2012.
  • [26] F. Martinelli, F. Mercaldo, A. Orlando, V. Nardone, A. Santone, and A. K. Sangaiah, “Human behavior characterization for driving style recognition in vehicle system,” Comput. Electr. Eng., vol. 83, p. Art. no. 102504, May 2020.
  • [27] M. D. Pesé, T. Stacer, C. A. Campos, E. Newberry, D. Chen, and K. G. Shin, “Librecan: Automated can message translator,” in Proc. ACM SIGSAC Conf. Comput. Commun. Security, 2019, pp. 2283–2300.
  • [28] T. Hoppe, S. Kiltz, and J. Dittmann, “Security threats to automotive can networks – practical examples and selected short-term countermeasures,” in SAFECOMP 2008: 27th International Conference on Computer Safety, Reliability, and Security, 2008, pp. 235–248.
  • [29] A. Gazdag, R. Ferenc, and L. Buttyán, “Crysys dataset of can traffic logs containing fabrication and masquerade attacks,” Sci. Data, vol. 10, no. 1, p. 903, 2023.
  • [30] S. Rajapaksha, G. Madzudzo, H. Kalutarage, A. Petrovski, and M. O. Al-Kadri, “Can-mirgu: A comprehensive can bus attack dataset from moving vehicles for intrusion detection system evaluation,” 2024, in preparation.
  • [31] P. Moriano and R. A. Bridges and M. D. Iannacone, “Detecting CAN Masquerade Attacks with Signal Clustering Similarity,” in Workshop on Automotive and Autonomous Vehicle Security (AutoSec), 2022, pp. 1–8.
  • [32] P. Goyal and E. Ferrara, “Graph embedding techniques, applications, and performance: A survey,” Knowl.-Based Syst., vol. 151, pp. 78–94, 2018.
  • [33] D. Wang, P. Cui, and W. Zhu, “Structural deep network embedding,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., ser. KDD ’16.   New York, NY, USA: Assoc. Comput. Mach., 2016, pp. 1225–1234.
  • [34] W. Husen, Random Walks on Graphs.   Ohio State Univ., 2018, in: Linear Algebra with Applications.
  • [35] D. A. Spielman, Random Walks on Graphs.   Yale Univ., 2018, in: Spectral Graph Theory.
  • [36] I. Chami, S. Abu-El-Haija, B. Perozzi, C. Ré, and K. Murphy, “Machine learning on graphs: A model and comprehensive taxonomy,” J. Mach. Learn. Res., vol. 23, no. 89, pp. 1–64, 2022.
  • [37] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Yu, “A comprehensive survey on graph neural networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 1, pp. 4–24, 2021.
  • [38] R. P. Mondaini, Trends in Biomathematics: Chaos and Control in Epidemics, Ecosystems, and Cells.   Cham: Springer, 2021.
  • [39] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 2016, pp. 855–864.
  • [40] Z.-F. Wei, P. Moriano, and R. Kannan, “Robustness of graph embedding methods for community detection,” 2024. [Online]. Available: https://arxiv.org/abs/2405.00636
  • [41] L. Maddalena, I. Manipur, M. Manzo, and M. R. Guarracino, “On whole-graph embedding techniques,” in Trends in Biomathematics: Chaos and Control in Epidemics, Ecosystems, and Cells: Selected Works from the 20th BIOMAT Consortium Lectures, Rio de Janeiro, Brazil, 2020, R. P. Mondaini, Ed.   Cham: Springer International Publishing, 2021, pp. 115–131. [Online]. Available: https://doi.org/10.1007/978-3-030-73241-7_8
  • [42] Y. J. Zhang, K. C. Yang, and F. Radicchi, “Systematic comparison of graph embedding methods in practical tasks,” Phys. Rev. E, vol. 104, no. 4, p. 044315, 2021.
  • [43] R. Islam, R. U. D. Refat, S. M. Yerram, and H. Malik, “Graph-based intrusion detection system for controller area networks,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 3, pp. 1727–1736, 2020.
  • [44] H. Lee, S. H. Jeong, and H. K. Kim, “CAN Dataset for Intrusion Detection,” IEEE Trans. on Depend. and Secur. Comput., 2018. [Online]. Available: https://goo.gl/WiVeFj
  • [45] R. Islam, M. K. Devnath, M. D. Samad, and S. M. J. A. Kadry, “Ggnb: Graph-based gaussian naive bayes intrusion detection system for can bus,” Veh. Commun., vol. 33, p. 100442, 2022.
  • [46] G. Dupont, A. Lekidis, J. den Hartog, and S. Etalle, “Automotive controller area network (can) bus intrusion dataset v2,” 4TU.Centre for Research Data, 2020. [Online]. Available: https://data.4tu.nl/articles/dataset/Automotive_Controller_Area_Network_CAN_Bus_Intrusion_Dataset/12696950/2
  • [47] M. Sreelekshmi and S. Aji, “A graph-based strategy for intrusion detection in connected vehicles,” in International Conference on Information and Communication Technology for Competitive Strategies.   Springer, 2022, pp. 133–143.
  • [48] H. Kang, B. Kwak, Y. H. Lee, H. Lee, H. Lee, and H. K. Kim, “Car hacking: Attack and defense challenge 2020 dataset,” 2021. [Online]. Available: https://ieee-dataport.org/documents/car-hacking-attack-and-defense-challenge-2020-dataset
  • [49] R. U. D. Refat, A. A. Elkhail, and H. Malik, “A lightweight intrusion detection system for can protocol using neighborhood similarity,” in 2022 7th Int. Conf. Data Sci. Mach. Learn. Appl. (CDMA), Mar 2022, pp. 121–126.
  • [50] H. M. Song, J. Woo, and H. K. Kim, “In-vehicle network intrusion detection using deep conv. neural net.” Vehicular Communications, vol. 21, p. 100198, 2020.
  • [51] L. B. Othmane, L. Dhulipala, M. Abdelkhalek, N. Multari, and M. Govindarasu, “On the performance of detecting injection of fabricated messages into the can bus,” IEEE Trans. Dependable Secure Comput., vol. 19, no. 1, pp. 468–481, Feb 2022.
  • [52] S. B. Park, H. J. Jo, and D. H. Lee, “G-idcs: Graph-based intrusion detection and classification system for can protocol,” IEEE Access, vol. 11, pp. 39 213–39 227, 2023.
  • [53] Y. Meng, J. Li, F. Liu, S. Li, H. Hu, and H. Zhu, “Gb-ids: An intrusion detection system for can bus based on graph analysis,” in 2023 IEEE/CIC Int. Conf. Commun. China (ICCC), Aug 2023, pp. 1–6.
  • [54] H. Lee, S. H. Jeong, and H. K. Kim, “Otids: A novel intrusion detection system for in-vehicle network by using remote frame,” in 2017 15th Annual Conference on Privacy, Security and Trust (PST).   IEEE, 2017, pp. 57–5709.
  • [55] M. H. Shahriar, Y. Xiao, P. Moriano, W. Lou, and Y. T. Hou, “Canshield: Deep-learning-based intrusion detection framework for controller area networks at the signal level,” IEEE Internet of Things Journal, vol. 10, no. 24, pp. 22 111–22 127, 2023.
  • [56] J. Zhou, P. Joshi, H. Zeng, and R. Li, “Btmonitor: Bit-time-based intrusion detection and attacker identification in controller area network,” ACM Trans. Embed. Comput. Syst., vol. 18, no. 6, nov 2019.
  • [57] X. Ying, S. U. Sagong, A. Clark, L. Bushnell, and R. Poovendran, “Shape of the cloak: Formal analysis of clock skew-based intrusion detection system in controller area networks,” IEEE Transactions on Information Forensics and Security, vol. 14, no. 9, pp. 2300–2314, 2019.
  • [58] “University of washington ecocar 3,” Online, 2017, accessed: Sep. 26, 2017. [Online]. Available: http://uwecocar.com/
  • [59] M. Hanselmann, T. Strauss, K. Dormann, and H. Ulmer, “Canet: An unsupervised intrusion detection system for high dimensional can bus data,” IEEE Access, vol. 8, pp. 58 194–58 205, 2020.
  • [60] P. Moriano, S. C. Hespeler, M. Li, and R. A. Bridges, “Benchmarking unsupervised online ids for masquerade attacks in can,” 2024. [Online]. Available: https://arxiv.org/abs/2406.13778
  • [61] T. Qin, Z. Liu, P. Wang, S. Li, X. Guan, and L. Gao, “Symmetry degree measurement and its applications to anomaly detection,” IEEE Trans. Inf. Forensics Secur., vol. 15, pp. 1040–1055, 2020.
  • [62] J. Bryan and P. Moriano, “Graph-based machine learning improves just-in-time defect prediction,” PLoS One, vol. 18, no. 4, p. e0284077, 2023.
  • [63] CSS Electronics, “Understanding dbc files: Basics for can databases,” 2021, accessed: 2023-04-23. [Online]. Available: https://www.csselectronics.com/pages/can-dbc-file-database-intro
  • [64] M. E. Verma, R. A. Bridges, J. J. Sosnowski, S. C. Hollifield, and M. D. Iannacone, “Can-d: A modular four-step pipeline for comprehensively decoding controller area network data,” IEEE Trans. Veh. Technol., vol. 70, no. 10, pp. 9685–9700, 2021.
  • [65] A. Buscemi, I. Turcanu, G. Castignani, A. Panchenko, T. Engel, and K. G. Shin, “A survey on controller area network reverse engineering,” IEEE Comm. Surveys & Tutorials, vol. 25, no. 3, pp. 1445–1481, 2023.
  • [66] J. Cao, J. Fang, Z. Meng, and S. Liang, “Knowledge graph embedding: A survey from the perspective of representation spaces,” ACM Comput. Surv., vol. 56, no. 6, pp. 1–42, 2024.
  • [67] S. Khoshraftar and A. An, “A survey on graph representation learning methods,” ACM Trans. Intell. Syst. Technol., vol. 15, no. 1, pp. 1–55, 2024.
  • [68] N. Fanourakis, V. Efthymiou, D. Kotzinos, and V. Christophides, “Knowledge graph embedding methods for entity alignment: experimental review,” Data Min. Knowl. Discov., vol. 37, no. 5, pp. 2070–2137, 2023.
  • [69] W. Gu, A. Tandon, Y.-Y. Ahn, and F. Radicchi, “Principled approach to the selection of the embedding dimension of networks,” Nat. Commun., vol. 12, no. 1, p. 3772, 2021.
  • [70] M. Zhong, M. Lin, C. Zhang, and Z. Xu, “A survey on graph neural networks for intrusion detection systems: Methods, trends and challenges,” Computers & Security, p. 103821, 2024.
  • [71] L. Breiman, “Random forests,” Machine learning, vol. 45, pp. 5–32, 2001.
  • [72] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, ser. KDD ’16.   New York, NY, USA: Association for Computing Machinery, 2016, p. 785–794.
  • [73] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” J. Artif. Int. Res., vol. 16, no. 1, p. 321–357, jun 2002.
  • [74] T. Fawcett, “An introduction to roc analysis,” Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, 2006, rOC Analysis in Pattern Recognition.
  • [75] H. B. Mann and D. R. Whitney, “On a test of whether one of two random variables is stochastically larger than the other,” The annals of mathematical statistics, pp. 50–60, 1947.
  • [76] V. W. Berger and Y. Zhou, “Kolmogorov–smirnov test: Overview,” Wiley statsref: Statistics reference online, 2014.
[Uncaptioned image] William Marfo (Student Member, IEEE) is a computer science PhD student at the University of Texas at El Paso, USA. His research focuses on applying distributed deep learning techniques to identify and classify anomalous behavior in computer networks and cyber-physical systems, including cyber-attacks and security breaches. These efforts contribute to building more secure and resilient systems, safeguarding critical infrastructure and data integrity in an increasingly interconnected world. He is a member of ACM, AMS, and ISACA.
[Uncaptioned image] Pablo Moriano (Senior Member, IEEE) received B.S. and M.S. degrees in electrical engineering from Pontificia Universidad Javeriana in Colombia and M.S. and Ph.D. degrees in informatics from Indiana University Bloomington, Bloomington, IN, USA. He is a research scientist with the Computer Science and Mathematics Division at Oak Ridge National Laboratory, Oak Ridge, TN, USA. His research lies at the intersection of data science, network science, and cybersecurity. In particular, he uses data-driven and computational methods to discover, understand, and detect anomalous behavior in large-scale networked systems. Applications of his research range across multiple disciplines, including, the detection of exceptional events in social media, Internet route hijacking, insider threat behavior in version control systems, and anomaly detection in cyber-physical systems. Dr. Moriano is a member of ACM and SIAM.
[Uncaptioned image] Deepak Tosh (Senior Member, IEEE) is an associate professor of Computer Science at the University of Texas at El Paso. His research focuses on addressing various multi-disciplinary networking and cybersecurity challenges associated with critical national infrastructures, Industrial Internet of Things, Blockchain, and tactical battlefields. His research team has been working on multiple DOE and NSF-funded projects to develop resilient data/process provenance mechanisms for industrial operational technology environments and military applications. He has authored/co-authored more than 70 peer-reviewed conference papers, book chapters, and journal papers. Two of his research works on Blockchain were also awarded as “Top 50 Blockchain Papers in 2018” at BlockchainConnect Conference, 2019. He is also a recipient of the prestigious NSF CAREER award, 2023.
[Uncaptioned image] Shirley V. Moore (Member, IEEE) received a B.A. degree in mathematics and chemistry from Indiana University, an M.Ed. degree from the University of Illinois, an M.S. in mathematics from Wichita State University, and M.S. and Ph.D. degrees in Computer Sciences from Purdue University. She is an Associate Professor in the Computer Science Department at the University of Texas at El Paso, USA. Her research is in parallel and high-performance computing, including distributed deep learning. Her focus is on performance optimization methodologies and tools. She is a Senior Member of ACM.