Edge Conditional Node Update Graph Neural Network for Multi-variate Time Series Anomaly Detection

Hayoung Jo Seong-Whan Lee

Abstract

With the rapid advancement in cyber-physical systems, the increasing number of sensors has significantly complicated manual monitoring of system states. Consequently, graph-based time-series anomaly detection methods have gained attention due to their ability to explicitly represent relationships between sensors. However, these methods often apply a uniform source node representation across all connected target nodes, even when updating different target node representations. Moreover, the graph attention mechanism, commonly used to infer unknown graph structures, could constrain the diversity of source node representations. In this paper, we introduce the Edge Conditional Node-update Graph Neural Network (ECNU-GNN). Our model, equipped with an edge conditional node update module, dynamically transforms source node representations based on connected edges to represent target nodes aptly. We validate performance on three real-world datasets: SWaT, WADI, and PSM. Our model demonstrates 5.4%, 12.4%, and 6.0% higher performance, respectively, compared to best F1 baseline models.

keywords:

Multivariate time series , Anomaly detection , Graph neural network , Unsupervised Learning

^†^†journal: Information sciences

\affiliation

[1]organization=Department of Artificial Intelligence, Korea University, addressline=Anam-dong, Seongbuk-ku, city=Seoul, postcode=02841, country=Republic of Korea

1 Introduction

Anomaly detection, which identifies deviations from established norms, plays a critical role in various areas such as industrial operations [1, 2], intrusion detection [3, 4, 5], and fraud prevention [6, 7, 8]. In response to the challenges posed by the proliferation of sensors in cyber-physical systems such as industrial systems, vehicles, and data centers, there is an increasing reliance on multivariate time-series anomaly detection. Machine learning techniques, have been already applied successfully in various complex tasks such as face recognition [9], video shot detection [10], action recognition [11], and pose estimation [12], have also found their application in multivariate time-series anomaly detection. However, these data have the complex interrelationships among numerous sensors with high levels of non-linearity. Traditional anomaly detection models, typically focusing on individual sensor data or employing modeling techniques with simple function, are inadequate for handling this complexity. Moreover, these conventional methods often lack scalability, leading to increased computational time as the number of sensors grows.

The advancement of deep learning has led to the proposal of various models like autoencoder-based models [13, 14, 15, 16] and generative adversarial networks (GAN) [17, 18, 19, 20] for learning diverse and complex normal distributions. Transformers, known for their success in natural language processing, have also been utilized due to their proficiency in modeling time dependencies in time-series data [21, 22, 23, 24]. Despite these advancements, these methods often provide meaningful insights only when the anomalies and detection sensors are the same. For instance, if an attacked sensor appears to function normally while another sensor shows abnormalities, pinpointing the compromised sensor becomes a challenge.

Graph-based methods have become popular in addressing this limitation, as they illustrate the relationships between sensors using a graph structure, offering additional insights in the event of anomalies. This enhanced understanding of sensor interconnections not only provides a deeper analysis of anomaly events but also accelerates the process of identifying causes and implementing corrective measures. However, the primary challenge with a graph-based approach is that the connections between sensors necessary for forming a graph structure are typically unknown. Therefore, learning the graph structure and detecting anomalies must occur simultaneously. Many previous studies have addressed this problem using a graph attention mechanism. In approaches such as [25, 26, 27], spatial relationships are implicitly learned through a graph attention mechanism, which operates under the assumption that all nodes are interconnected. In [28, 29], node embedding vectors that represent node characteristics are introduced. These embedding vectors are used to extract graph structures based on the similarity of the vectors. During this process, the node embedding vectors are utilized to calculate the attention scores.

Previous graph-based methods apply the same source node representation to all connected target nodes, even when updating different target node representations. This approach is effective when source node representations are suitable for all connected targets, but this is often challenging in complex systems. It places a significant burden on the encoder to find a universally applicable source node representation. Additionally, using graph attention can act as an implicit constraint, forcing the source node representation to be similar.

To address these challenges, in this study, we propose Edge Conditional Node Update Graph Neural Network (ECNU-GNN), a novel graph neural network that dynamically updates target node representations. This is achieved by transforming source node representations into forms that are specifically tailored to each target node, guided by edge-specific conditions. This method learns the graph structure without relying on graph attention. Our model comprises four main components: node condition embedding, graph structure extraction, condition-based prediction, and anomaly detection. Node condition embedding utilizes node embedding vectors as conditions to capture the characteristics of sensors. These embedding vectors are used to construct the graph structure, modify source representations, and read out target node representations. Graph structure extraction constructs the graph structure using the relationships among all pairs of node embedding vectors. Condition-based prediction updates the target node representations using different source node representations based on edges and predicts the next time step values from the updated target node representations. From the predicted values, anomaly detection judges whether an anomaly is detected at that time step.

To summarize, the main contributions of our work are:

1.

We propose a state-of-the-art edge conditional node update graph neural network for time-series anomaly detection, featuring an edge conditional node update module. This module transforms source node representations based on edges, ensuring more adequate source node representation for updating the target node representations.
2.

Our model constructs the graph structure without graph attention. Graph attention may act as an implicit regularization, leading to the encoder’s source node representations becoming more similar. This effect limits the variety of source node representations.
3.

Our model excels in anomaly detection, outperforming baseline methods in tests on three real-world datasets. Our model features an innovative edge conditional node update module to effectively adjust source node representations to enhance target node updates and boost anomaly detection performance even in complex datasets.

The remainder of this paper is organized as follows. Section 2 describes the related work. The details of the proposed model are presented in Section 3. A performance evaluation and other experimental results are presented in Section 4. Finally, Section 5 concludes the paper.

2 Related Work

2.1 Graph Neural Networks

Recently, graph neural networks (GNN) [30] have become a successful method for modeling intricate patterns in data containing graph structures, such as social networks and medical data. A graph comprises a collection of nodes linked by edges that signify the connections between these nodes. GNNs are particularly adept at modeling complex and malleable data structures, surpassing the capabilities of conventional deep neural networks.

Typically, GNNs assume that node representations are constructed based on the influences of neighboring nodes in a graph. Graph convolutional networks (GCN) [31] aggregate the feature representations of nodes by directly considering the feature representations of neighbors.

The attention mechanism, which is a crucial component in numerous deep learning applications, is widely used in sequential models to efficiently capture relevant information. When applied to graphs, this concept gives rise to graph attention networks (GAT) [32] that introduce a weighted aggregation of node representations during the aggregation step in GCN. In GAT, nodes are assigned varying weights through training, prioritizing more important nodes with higher weights and assigning lower weights to less significant ones.

2.2 Time series anomaly detection

Time-series anomaly detection has been the subject of extensive research for slightly some time, resulting in several proposed methodologies. Classical approaches are characterized by their simplicity and are ill-equipped to handle increasing system complexity. As deep learning has demonstrated exceptional performance in diverse tasks and various domains, numerous methods have emerged to address the challenges of time-series anomaly detection using deep learning techniques.

A fundamental structure for anomaly detection using deep learning is the autoencoder, which trains a network to reconstruct the input data of normal behavior and detects anomalies based on the difference between the input and the reconstructions. USAD [13] is an autoencoder-based model equipped with two decoders. These two decoders are trained adversarially, with one decoder focused on producing accurate reconstructions and the other focused on producing reconstructions that differ significantly from the input data. This dual-decoder setup is designed to effectively detect anomalies that exhibited small deviations from normal data. ACLAE-DT [14] employs an attention-augmented ConvLSTM and a dynamic thresholding mechanism in its reconstruction-based framework. In particular, it constructs as input an inter-correlation feature map that represents the correlations between all pairs of input data and entity embedding vectors, thereby capturing the temporal and contextual dependencies within the data. CAE-AD [15] introduces two contrasting methods: contextual contrasting and instance contrasting. The contextual contrasting method enables the attention-based projection layer to embed shared information between adjacent time steps. The instance contrasting method allows the encoder to capture shared information within a specific time step, utilizing data augmentation in both the time and frequency domains. PAMFE [16] reconstructs clean inputs from noisy ones by using multiple parallel dilated convolutions with different dilation factors. This approach, combined with a feature fusion module, allows the model to capture features at different scales and increases its robustness to noise.

Generative Adversarial Networks (GAN)[33] consist of a generator that generates data and discriminator that distinguishes between real and generated data. Methods using GANs typically train a generator and discriminator on normal time-series data, and compute anomaly scores based on the outputs of the generator or discriminator. MAD-GAN [17] introduces inverse mapping to traditional GANs with LSTM-based generators and discriminators. It uses inverse transformations to compute anomaly scores using both the reconstruction loss and discriminator results. MAD-GAN aims to capture the hidden relationships and interactions that exist between sensors by considering the entire set of sensors simultaneously. FGANomaly [18] generates pseudo-labels indicating whether the input is normal, based on the deviation between the original input and its reconstruction. This model is adversarially trained using data labeled ’normal’ in the pseudo-labels, which increases its robustness to contaminated data. STAD-GAN [19] is designed within a teacher-student framework, combining a GAN structure with a classifier to overcome noise vulnerability and poor performance on anomaly-containing training data. Initially, the teacher model generates a refined dataset with pseudo-labels through unsupervised training. The student model then trains with this dataset, enhancing pseudo-label accuracy. This iterative process improves the classifier’s anomaly detection capabilities. TGAN-AD [20] is a GAN model that includes a transformer-based generator and a discriminator designed to capture contextual information. The generator is trained to identify the latent vector with the highest covariation between the input and generated data, ensuring that the generated data closely resembles the original.

Recent research has extended the success of transformer models [34] in natural language processing to multivariate anomaly detection by adapting them to the time-series domain. TransAnomaly [21] integrates the transformer architecture with a variational autoencoder to tackle the issue of future dependency in sequential data modeling. This approach leads to a model that not only captures future dependencies more effectively but also offers reduced computational complexity, improved parallel processing, and enhanced explainability. GTA [22] extracts graph structures through a learned network and encodes spatial features using GCN. Additionally, it employs transformer models to capture both global and local temporal features. This integrated approach facilitates anomaly detection by considering spatial and temporal dependencies effectively. TranAD [23] is a transformer-based model using two decoders. The first decoder performs reconstruction, whereas the second decoder uses the difference between the results of the first decoder and the input to perform additional reconstruction. Both results contributed to the adversarial training process. D³TN [24] predicts the next time step value using a GRU-based predictor that processes features comprising global spatial features generated by a disentangled convolution layer, local spatial features, and temporal features derived from a self-attention-based layer.

Graph neural networks offer the advantage of explicitly defining the relationships between nodes. Leveraging this capability, they utilize signals from each sensor as node features to learn relationships between sensors, thereby aiding in anomaly detection. MTAD-GAT [25] is a graph attention-based model that constructs a graph structure without prior knowledge. It uses a fully connected graph across both spatial and temporal dimensions. To capture the unique characteristics of these dimensions, MTAD-GAT employs two distinct GAT models: one for spatial features and the other for temporal features. GDN [28] introduces node-embedding vectors to obtain a graph structure based on the similarity of the embedding vectors. Unlike conventional GATs, which rely solely on node features, a GDN concatenates embedding vectors with node features to compute the attention weight. GTAD [26] extracts temporal features using a temporal convolution network and spatial features through a graph attention network. Utilizing these features, the model is trained for data reconstruction and prediction. Anomaly detection is then performed by analyzing both the reconstruction and prediction errors. HAD-MDGAT [27], similar to MTAD-GAT [25], utilizes two GAT models to extract both spatial and temporal features. Leveraging these features, the model employs a GAN for input reconstruction and an MLP for data prediction. MST-GAT [29] captures multimodal correlations by employing both inter-modal and intra-modal attention mechanisms, which leverage modality-specific information. Additionally, it utilizes multi-head attention that operates independently of modality information. PGRGAT [35] employs prior information of sensor relationships to learn graph structure at an edge predictor. This graph structure is then used to encode features via the gated current graph attention unit network, which is a graph-based GRU module.

3 Method

3.1 Problem Statement

The dataset consists of multivariate time-series data collected from $N$ sensors over a period of time $T$ , $\boldsymbol{s}=[\boldsymbol{s}^{1},\boldsymbol{s}^{2},\dots,\boldsymbol{s}^{T}]\in\mathbb{R}^{N\times T}$ . Each column $\boldsymbol{s}^{t}$ at time $t$ corresponds to a vector representing measurements from $N$ sensors, denoted as $\boldsymbol{s}^{t}=[{s}^{t}_{1},\dots,{s}^{t}_{N}]\in\mathbb{R}^{N}$ .

Both the training and testing datasets have the same number of sensors, but they differ in their time durations, denoted as $T_{train}$ and $T_{test}$ respectively. The training dataset comprises normal data, aligning with conventional unsupervised anomaly detection methods.

The model proposed in this study utilizes historical data to predict the measured values of the next time step. The input data to predict sensor’s value at time $t$ , $x^{(t)}\in\mathbb{R}^{N\times w}$ , is constructed using sliding window with window size $w$ over previous time series data, as follows:

x^{(t)}=[\boldsymbol{s}^{t-w},\boldsymbol{s}^{t-w+1},\cdots,\boldsymbol{s}^{t-1}]

(1)

After the prediction, our model calculates an anomaly score for that specific time step. If the anomaly score exceeds a predefined threshold, the model outputs 1 to indicate an anomaly and 0 if the data is considered normal.

3.2 Overview

Our model extracts the graph structure from node embedding vectors, predicts sensor values for the next time step, and detects anomalies. To achieve this, our model is composed of the following four main components:

1.

Node Condition Embedding uses embedding vectors to be utilized as conditions to transform source node representations and readout final node representations, and to construct the graphs structure.
2.

Graph Structure Extraction generates graph structure based on the similarity between all pairs of the node embedding vectors.
3.

Condition-based prediction forecasts each sensor value for the next step by employing node embedding vectors as conditions in the node conditional readout module and edge conditional node update module.
4.

Anomaly Detection calculates anomaly scores, if the score is higher than a threshold, the model determines the time step as an anomaly.

Fig. 1 shows an overview of our model.

Refer to caption — Figure 1: Overview of our model

3.3 Node Condition Embedding

We introduce node embedding vectors for getting relationships between sensors, as $d_{e}$ dimensional vectors:

v_{i}\in\mathbb{R}^{d_{e}},\textnormal{for $i$}\in\{1,2,\cdots,N\}

(2)

where $v_{i}$ represents the $i$ th node embedding vector.

In [28], the sensor-embedding vectors are trained to reflect the similar behavior of the sensors in graph attention. However, graph attention can cause node representations to become similar, which may contribute to the model’s limited capability in modeling sensor behavior.

To avoid this problem, our model trains the node embedding vector without attention, instead, employing the edge conditional node update module and node conditional readout module, as described in the following section.

3.4 Graph Structure Extraction

To discover the relationships between sensors in systems where these relationships are not predefined, we employ the method of assessing similarities among all node embedding vectors. This similarity measurement is crucial for constructing the graph structure of the system. Specifically, for each node, we determine the Top- $k$ most similar nodes, based on the similarity score $e_{i,j}$ , to form an adjacency matrix:

e_{i,j}=\frac{{v}_{i}\cdot{v}_{j}}{||{v}_{i}||\,||{v}_{j}||}

(3)

A_{ij}=\mathbb{1}_{\{j\in\textit{TopK}(\{e_{i,k}:\forall k\})\}}

(4)

where TopK is the operator used to select the indices of Top- $k$ values among the set of inputs. The value $k$ of TopK is given by the user to adjust the sparsity level of the graph structure.

3.5 Condition based prediction

Condition based prediction process comprises two main modules: the edge conditional node update module (ECNUM) and the node conditional readout module (NCRM). ECNUM uses a graph-based method to generate a node representation, which is then utilized by NCRM to predict the final result. In this paper, we use MLP as ECNUM and NCRM.

A simple encoder network such as a linear or 1D CNN could be adopted to encode the input to $d_{f}$ dimensional vector, $z^{(t)}\in\mathbb{R}^{N\times d_{f}}$ , from a given input $x$ , as formulated in Eq. 5.

z^{(t)}=f_{\textit{enc}}(x^{(t)})

(5)

Subsequently, each row of $z^{(t)}$ is adopted as the node representation. In this paper, we utilize a linear layer as an encoder.

3.5.1 Edge Conditional Node Update Module

As mentioned in Section 1, in existing models, the representation of target nodes is updated using the same representation from a source node, even when the target nodes are different, as illustrated in Fig. 2. This means that for any given source node, its representation is uniformly applied to update the representations of all connected target nodes, regardless of their distinctiveness.

The simplest method to address this issue is to set up transformation modules for each edge, as expressed in Eq. 6.

z^{(t)}_{i,j}=f_{i,j}(z_{j}^{(t)})

(6)

where $z_{j}^{(t)}$ is the node representation from $x^{(t)}_{j}$ , $f_{i,j}$ is the transformation function to create a new representation of source node $j$ to update target node $i$ . $z_{i,j}^{(t)}$ represents the modified node representation of source node $z_{j}^{(t)}$ to update target node $i$ . With this approach, despite using the same source representation, distinct transformation modules are utilized offering diverse representations to update different target nodes, as shown in Fig. 3.

However, this method faces limitations due to the exponential increase in the number of transformation functions as the number of nodes grows, resulting in elevated computational costs and greater data requirements for training. To address these challenges, we draw inspiration from GDN [28] and conditional GAN [36]. GDN utilizes node embedding vectors to capture the unique characteristics of each node, while conditional GAN are capable of generating varied outputs from the same input based on specific conditions. Building on these concepts, our proposed ECNUM dynamically transforms the representation of a source node in accordance with edge conditions, as denoted by the condition embedding vectors of the source and target nodes, as follows:

z_{i,j}^{(t)}=f_{\textit{ECNUM}}(z_{j}^{(t)};v_{i},v_{j})

(7)

where $v_{i},v_{j}$ are the node embedding vector of node $i,j$ , respectively. $f_{\textit{ECNUM}}$ represents ECNUM. ECNUM is a single module designed to adeptly transform source node representations to suit various target nodes, as shown in Fig. 4.

Finally, the transformed source representations, $z_{i,j}^{(t)}$ are aggregated to obtain the final node representation, as follow:

z_{i}^{t}=\sigma\left(\sum_{j\in\mathcal{N}(i)\cup\{i\}}z_{i,j}^{(t)}\right)

(8)

where $\mathcal{N}(i)$ is the set of neighbor node’s indices of node $i$ .

3.5.2 Node Conditional Readout Module

ECNUM enables the individual updating of each node representation with diverse source node representations, leading to a varied set of node representations. Consider a scenario where a single readout module predicts the next values for all nodes simultaneously. In an ideal case, the readout module would effectively extract predictions from diverse node representations, given that the module has sufficient capabilities. However, if the readout module’s capabilities are inadequate, it may limit the diversity of node representations to improve prediction accuracy.

For this reason, similar to ECNUM, we propose NCRM that takes node embedding vectors as input conditions, as illustrated in Fig. 5. This module can produce different results for the same node representation depending on the node embedding vector.

\hat{s}^{t}_{i}=f_{\textit{NCRM}}(z^{t}_{i};v_{i})

(9)

where, $\hat{s}^{t}_{i}$ is predicted value for sensor $i$ at time $t$ . $f_{\textit{readout}}$ is node conditional readout module.

To train our model, we use the Mean Squared Error as a loss function to minimize error between the predicted output $\hat{s}^{t}_{i}$ and the actual data $s^{t}_{i}$ .

\mathit{L}_{MSE}=\frac{1}{T_{train}-w}\sum_{t=w+1}^{T_{train}}\|s^{t}-\hat{s}^{t}\|^{2}_{2}

(10)

3.6 Anomaly Detection

Given the values predicted from the learned graph structure, we detect anomalies using graph deviation scoring (GDS) from [28]. GDS calculates the anomaly score by comparing each sensor’s scores, using the absolute error as the basis for normalization.

The first step in calculating the GDS is to calculate the absolute error between the predicted value and the actual value at time $t$ and sensor $i$ :

\textnormal{Err}_{i}(t)=|s_{i}^{t}-\hat{s}_{i}^{t}|

(11)

The number of absolute errors at time $t$ is $N$ number of sensors. It is not appropriate to compare these absolute errors directly since the absolute error of each sensor exhibits a different behavior. Therefore, we normalize each absolute error as follows:

a_{i}=\frac{\textnormal{Err}_{i}(t)-\tilde{\mu}_{i}}{\tilde{\sigma}_{i}}

(12)

where $\tilde{\mu}_{i}$ and $\tilde{\sigma}_{i}$ are the median and interquartile range (IQR) of the absolute error at sensor $i$ . IQR is defined as the difference between the 1st and 3rd quartiles of set of values. This normalization method is known to be more robust against anomalies than normalization using mean and standard variation.

To compute the anomaly score at given time $t$ , we use the max function to select the most anomalous sensor and score as follow:

A(t)=\max_{i}a_{i}(t)

(13)

The model cannot perfectly predict the sensor value; as a result, the anomaly score can exhibit a sharp change, even when the input is normal. To mitigate these changes, we use a simple moving average (SMA).

Finally, if the smoothed anomaly score is above a fixed threshold, the model labels the input as abnormal for a given time, $t$ .

Many previous methods use automatic thresholding methods, such as extreme value theory [37], to determine the threshold value. However, these methods produce different results depending on the initial set of parameters. To evaluate the anomaly detection capability of the model, we opt for a grid search method to identify an optimal threshold that maximizes the F1 score.

The pseudo-code of the edge conditional node update GNN is presented in Algorithm 1

Input: number of nodes (

N

), Top-k value (

k

), node embedding vectors (

\boldsymbol{v}

), pre-defined threshold (

\delta_{thre}

), input signal (

x^{(t)}=\{\textbf{s}^{t-w},\cdots,\textbf{s}^{t-1}\}

), ground truth signal (

\textbf{s}^{t}

)

Output: anomaly detection result at time

t

(

y^{t}

)

Extract the graph structure A following Eq. 3, 4

*Encoding node representation

\boldsymbol{z}^{(t)}\leftarrow f_{enc}({x}^{(t)})

*Edge conditional node update

for $0\leq i\leq$ N-1 do

z_{i}^{t}\leftarrow

AGGREGATE({

f_{ECNUM}(z_{k}^{(t)};v_{i},v_{k})

\forall k\in\mathcal{N}(i)\cup\{i\}

})

end for

\boldsymbol{z}^{t}\leftarrow\sigma(\boldsymbol{z}^{t})

*Node conditional readout

\hat{\boldsymbol{s}}^{t}\leftarrow f_{NCRM}(\boldsymbol{z}^{t};\boldsymbol{v})

*Anomaly detection

Err

\leftarrow|\boldsymbol{s}^{t}-\hat{\boldsymbol{s}}^{t}|

\tilde{\mathbf{\boldsymbol{\mu}}}\leftarrow

AVERAGE(Err)

\tilde{\boldsymbol{\sigma}}\leftarrow

IQR(Err)

\mathbf{a}\leftarrow\frac{\textbf{Err}-{\tilde{\boldsymbol{\mu}}}}{\tilde{\boldsymbol{\sigma}}}

A(t)\leftarrow max(\mathbf{a})

y^{t}\leftarrow A(t)>\delta_{thre}

return

y^{t}

Algorithm 1 A Pseudo-code of Edge Conditional Node Update GNN

4 Experiments

4.1 Datasets

Our model is evaluated using three real-world datasets. Two of these datasets, SWaT [38] and WADI [39], are sourced from water treatment system-based test beds. These datasets are generated by simulating operator-defined attack scenarios, with evidence labels provided for each scenario. The third dataset, PSM [40], is collected from real-world server machines and is manually labeled by engineers and application experts. The PSM dataset includes both planned and unplanned anomalies

The SWaT dataset is constructed using a water treatment test bed that emulates a small-scale system. It contains 51 features, 47,515 training samples, and 44,986 test samples, with an anomaly rate of approximately 11.97%.

The WADI dataset is built based on a water distribution test bed, which models a larger-scale system than SWaT. This dataset containes 127 features, 118,795 training samples, and 17,275 test samples, with an anomaly rate of approximately 5.99%.

The PSM dataset is collected internally from multiple application server nodes at eBay. It contains 25 features, 132,481 training samples, and 87,841 test samples, with an anomaly rate of approximately 27.75%

The SWaT and WADI datasets are downsampled following the methods outlined in GDN [28]. Data are downsampled every 10s using median values to speed up the training and the labels for the test dataset are determined based on the predominant label among the down-sampled labels. The missing data is replaced by mean value. As noted by [41], data in the first 5–6 hours after system start-up is considered unstable; therefore, we exclude the first 2,160 data points.

4.2 Experimental settings

We implemented our model using PyTorch version 1.13.0, CUDA 11.6, PyTorch Geometric library 2.2.0, and PyTorch Lightning 1.9.0. A server equipped with an Intel(R) Xeon(R) Gold 6148 2.4 Hz CPU and an NVIDIA RTX A6000 GPU is used to train the model. The model is trained using Adam optimizer with a learning rate of 0.001 and beta values of 0.9 and 0.99. We employ an early stopping strategy with a maximum of 100 epochs and patience of 5 and select the final model with the lowest validation loss.

To identify suitable hyperparameters for the model, we employ the grid search method. The final hyperparameters are presented in Table 1.

Table 1: Hyperparameters of the model.

w

, TopK,

d_{e}

d_{f}

n_{ECNUM}

n_{NCRM}

represent the size of the sliding window, Top-

k

factor, node embedding vector dimension, feature dimension, number of layers in ECNUM and NCRM, respectively.

Dataset	$w$	TopK	$d_{e}$	$d_{f}$	$n_{ECNUM}$	$n_{NCRM}$
SWaT	5	30	128	256	4	4
WADI	5	30	128	256	3	4
PSM	3	25	128	128	1	2

4.3 Baselines

We conduct a comprehensive comparison of our model with several baseline models. The comparison includes MAD-GAN [17] from GAN-based models, USAD [13] representing autoencoder-based models, MTAD-GAT [25] and GDN [28] from GNN-based models, as well as TranAD [23] from transformer-based models, and GTA [22] from a hybrid model between transformer and GNN. Since the point adjustment strategy proposed by [42] has the great possibility of overestimating the model’s performance, as noted in [43], we omit this strategy when calculating the performance.

4.4 Evaluation Metrics

As evaluation metrics, we use metrics commonly used in previous research, including F1 score (F1), recall (Rec), and precision (Prec).

The precision measures the ratio of correctly detected anomalies to the total number of detected anomalies. This is calculated by dividing the number of correctly detected anomalies by the total number of detected anomalies.

The recall measures the ratio of correctly detected anomalies to the total number of actual anomalies in the dataset. This is calculated by dividing the number of correctly detected anomalies by the total number of actual anomalies.

The F1 score is the harmonic mean of the precision and recall, providing a balanced measure that considers both in a single metric. The precision, recall, and F1 scores can be formulated as follows:

Prec=\frac{TP}{TP+FP}

(14)

Rec=\frac{TP}{TP+FN}

(15)

F1=\frac{2\times Prec\times Rec}{Prec+Rec}

(16)

where $TP$ , $FP$ , and $FN$ denote the numbers of true positives, false positives, and false negatives, respectively.

4.5 Results

Table 2: Average results

Methods

SWaT

WADI

PSM

Recall

Precision

Recall

Precision

Recall

Precision

MAD-GAN [17]

0.7206

(0.01219)

0.5750

(0.01350)

0.9658

(0.02368)

0.2541

(0.00068)

0.1487

(0.00040)

0.8733

(0.00233)

0.4679

(0.01568)

0.8766

(0.08394)

0.3214

(0.02398)

USAD [13]

0.7406

(0.00020)

0.5902

(0.00016)

0.9941

(0.00027)

0.2543

(0.00076)

0.1488

(0.00045)

0.8743

(0.00277)

0.4547

(0.00274)

0.9401

(0.01106)

0.2999

(0.00323)

MTAD-GAT [25]

0.7495

(0.00299)

0.6187

(0.01176)

0.9515

(0.02527)

0.2645

(0.00406)

0.1857

(0.03330)

0.6210

(0.27438)

0.4716

(0.01595)

0.6688

(0.12744)

0.3834

(0.06902)

GDN [28]

0.7280

(0.04533)

0.6275

(0.01905)

0.8803

(0.13001)

0.4206

(0.33615)

0.3110

(0.02991)

0.6625

(0.09251)

0.5476

(0.01774)

0.7147

(0.06002)

0.4479

(0.03584)

TranAD [23]

0.5003

(0.23316)

0.7139

(0.14616)

0.5701

(0.40900)

0.2524

(0.00078)

0.1478

(0.00045)

0.8680

(0.00266)

0.4441

(0.01667)

0.9374

(0.07317)

0.2924

(0.02101)

GTA [22]

0.7778

(0.02253)

0.6411

(0.02949)

0.9899

(0.00449)

0.4083

(0.05022)

0.3326

(0.06289)

0.6195

(0.20943)

0.5801

(0.00816)

0.7565

(0.01835)

0.4707

(0.01102)

Ours

0.8089

(0.04337)

0.6891

(0.06278)

0.9845

(0.00511)

0.5730

(0.04547)

0.4610

(0.04782)

0.7652

(0.06837)

0.6027

(0.02335)

0.6607

(0.02064)

0.5562

(0.04222)

Table 2 presents a comprehensive overview of the average performance and standard deviation of both our models and the baseline models, derived from a total of 10 experiments. Our model shows performance with an average F1 score of 0.8089 on the SWaT dataset, 0.5730 on the WADI dataset, and 0.6027 on the PSM dataset. Notably, these scores represent a significant improvement over the best-performing baseline models, with increases of 4.0% for the SWaT dataset, 36.2% for the WADI dataset, and 3.9% for the PSM dataset, respectively.

Table 3: Experimental results of best F1 model. †means that value is extracted from the original paper.

Methods	SWaT			WADI			PSM
Methods	F1	Recall	Precsion	F1	Recall	Precision	F1	Recall	Precision
MAD-GAN [17]	0.7361	0.5866	0.9880	0.2555	0.1495	0.8779	0.5103	0.7790	0.3793
USAD [13]	0.7409	0.5904	0.9945	0.2538	0.1485	0.8721	0.4585	0.9087	0.3067
MTAD-GAT [25]	0.7566	0.6261	0.9558	0.2701	0.2158	0.3609	0.4970	0.5350	0.4641
GDN [28]	0.81†	0.6957†	0.9697†	0.57†	0.4019†	0.9750†	0.5896	0.7088	0.5047
TranAD [23]	0.7399	0.5895	0.9929	0.2536	0.1485	0.8721	0.4812	0.8496	0.3357
GTA [22]	0.7948	0.6654	0.9870	0.4694	0.3139	0.9215	0.5929	0.7501	0.4902
Ours	0.8541	0.7593	0.9761	0.6405	0.5257	0.8207	0.6285	0.6341	0.6230

Table 3 shows the performance of the models with the best F1 score. Our model shows performances with an F1 score of 0.8541 on the SWaT dataset, 0.6405 on the WADI dataset, and 0.6179 on the PSM dataset. When comparing our model to the best F1 baseline models, we observed a significant performance improvement. Specifically, our model achieved remarkable increases of 5.4%, 12.4%, and 6.0% in F1 scores for the SWaT, WADI, and PSM datasets, respectively.

These results consistently demonstrate a notably high performance improvement in the complex WADI dataset, with a 36.2% increase in average F1 score and a 12.4% increase in the best F1 score compared to baseline models. Moreover, the comparison with the graph-based baseline GDN model, which is similar to our approach, further validates the benefits of our model’s strategy to use distinct source node representations for each target node.

4.6 Case study

To evaluate the effectiveness of our model in detecting and localizing anomalies, we analyzed the second anomaly in the WADI dataset. This anomaly is triggered by an attack that manipulates the flow sensor 1_FIT_001_PV, causing it to report values different from the actual measurements. During the attack, the values remain within the normal sensor measurement range. As shown in Fig. 6, it is almost infeasible to detect anomalies using the sensor independently.

However, during this attack period, the proposed model correctly identified 1_MV_001_STATUS as a sensor with a high anomaly score, as shown in Fig. 7. Traditional models typically rely on attention weights to identify sensors associated with 1_MV_001_STATUS. In contrast, our model uses layer-wise relevance propagation (LRP)[44] instead of attention mechanisms to provide information about relevant sensors.

If we apply the standard LRP to our model, the edge conditional node update and the node conditional readout module do not propagate the relevance scores corresponding to the node-embedding vectors. To address this problem, we reassigned the relevance scores of the node-embedding vectors to the feature relevance scores. For the node conditional readout module, we reassigned the relevance score of the one-node embedding vector to the feature relevance score, as follows:

\hat{R}^{x_{j}}=\left(\frac{\sum_{i}R^{v_{j}}_{i}}{\sum_{i}R^{x_{j}}_{i}}+1\right)R^{x_{j}}

(17)

where, $R^{x_{j}}_{i}$ , $R^{v_{j}}_{i}$ are $i$ th element of relevance score of feature $x_{j}$ , and node embedding of node $v_{j}$ , respectively.

The edge conditional node update module utilizes the summation of the relevance scores of the two node embedding vectors. The reassignment considering redundancy is conducted as follows.

\hat{R}^{x_{i}}=\left(\frac{1}{|\mathcal{N}(i)\cup\{i\}|}\frac{\sum_{k}\sum_{j}(R^{S_{i,j}}_{k}+R^{T_{j,i}}_{k})}{\sum_{k}R^{x_{i}}_{k}}+1\right)R^{x_{i}}

(18)

where $|\mathcal{N}(i)\cup\{i\}|$ is the number of edges connected to node $i$ , $R^{S_{i,j}}$ is the relevance score vector for source node embedding at edge from node $i$ to node $j$ , $R^{T_{j,i}}$ is the relevance score vector for target node embedding at edge from node $j$ to node $i$ .

Due to the nature of the data, the inputs of some sensors may become zero. In this case, the relevance scores could also become zero, making it difficult to make accurate comparisons. To address this issue, we summed the relevance scores for each node from the relevance scores that correspond to the node representation immediately after input transformation. Since every input value is transformed independently into a node representation, the total relevance scores for each node remain consistent.

The graph structure learned by our model and LRP results centered on the 1_MV_001_STATUS sensor are shown in Fig. 8. In the figure, the red square represents 1_MV_001_STATUS and the red triangle represents 1_FIT_001_PV. The red and blue edges denote the positive and negative relevance scores, respectively, and the edge thickness represents their relative magnitudes. This result confirms the high relevance of 1_MV_001_STATUS and 1_FIT_001_PV.

4.7 Ablation study

4.7.1 Effects of node embedding condition

An ablation study is performed to evaluate the importance of node embedding vectors as conditions by omitting their use in the edge conditional node update and readout modules. However, if all node embedding vectors are excluded, the learning of node embedding vectors fails, resulting in a fixed graph structure at its initial values. Therefore, this case is excluded from the study.

Table 4: Results of ablation study for effects of node embedding vectors with SWaT and WADI

Methods	SWaT			WADI
Methods	F1	Recall	Precision	F1	Recall	Precision
Original	0.8541	0.7593	0.9761	0.6405	0.5257	0.8207
-Target(T)	0.8213	0.7470	0.9122	0.5716	0.5059	0.6577
-Source(S)	0.7967	0.6676	0.9881	0.4508	0.3218	0.7541
-Readout(R)	0.8497	0.7468	0.9858	0.5760	0.4238	0.9011
-T, R	0.8336	0.7240	0.9825	0.4955	0.4376	0.5689
-S, R	0.7992	0.7186	0.8999	0.4591	0.3277	0.7680
-S, T	0.7885	0.6769	0.9446	0.4676	0.3129	0.9159

We denote the cases where we excluded the target embedding vector in the edge conditional node update module as ’T’, the cases where we excluded the source embedding vector as ’S’, and the cases where we excluded the embedding vectors in the readout module as ’R’.

Ablation studies are performed on three datasets using the settings that produce the best F1 scores. In addtion, further experiments are performed on the SWaT dataset with the Top- $k$ factor set to 50, denoted as SWaT₅₀, to investigate the effect of a densely connected graph. The results of these studies are detailed in Table 4 and Table 5.

Table 5: Results of ablation study for effects of node embedding vectors with PSM and SWaT₅₀

Methods	PSM			$\textbf{SWaT}_{50}$
Methods	F1	Recall	Precision	F1	Recall	Precision
Original	0.6285	0.6341	0.6230	0.8114	0.6881	0.9888
-Target(T)	0.5174	0.7038	0.4091	0.7446	0.6161	0.9405
-Source(S)	0.6110	0.6907	0.5477	0.8020	0.6718	0.9943
-Readout(R)	0.6139	0.6304	0.5982	0.8037	0.6732	0.9965
-T,R	0.5469	0.7119	0.4441	0.7486	0.6042	0.9840
-S,R	0.6141	0.6748	0.5633	0.7947	0.6821	0.9520
-S,T	0.5464	0.6694	0.4616	0.7310	0.5898	0.9606

The node embedding vectors used in each module significantly contribute to performance enhancement, indicating that possessing appropriate source node representations for each target node is effective. Additionally, in the SWaT and WADI datasets, the absence of source node embedding vectors in ECNUM leads to notable performance differences. Conversely, in the PSM and SWaT₅₀ datasets, the absence of target node embeddings is more influential.

This variation appears to be attributable to the number of shared source nodes, which increases with larger Top- $k$ values. In less dense graph structures like SWaT or WADI, the number of overlapping source nodes is smaller when updating target nodes, allowing for effective transformation of individual source nodes to provide appropriate representations for target nodes. However, in dense graph structures like PSM or SWaT₅₀, where most source nodes are shared, relying solely on source node information for transformation can result in encoding similar to individual node encoding. This leads to the same issue observed in previous methods where target nodes are updated using identical source node representations regardless of the target nodes.

4.7.2 Effects of the graph attention

To assess the impact of graph attention, our model is compared with two models that utilize vanilla graph attention and node-wise graph attention, respectively. The comparative models are trained by applying each graph attention method to the node representation transformed by ECNUM. The model with vanilla graph attention employs a trainable parameter $\alpha$ that is shared across all nodes to compute attention scores, as follows:

g^{(t)}_{i}=Wz^{(t)}_{i,i}

(19)

g^{(t)}_{j}=Wz^{(t)}_{i,j}

(20)

\pi(i,j)=LeakyReLU(\alpha^{T}(g^{(t)}_{i}\oplus g^{(t)}_{j}))\\

(21)

a_{i,j}=\frac{exp(\pi(i,j))}{\sum_{k\in\mathcal{N}(i)\cup\{i\}}exp(\pi(i,k))}

(22)

z_{i}^{t}=\sigma\left(\sum_{j\in\mathcal{N}(i)\cup\{i\}}a_{i,j}z_{i,j}^{(t)}\right)

(23)

where $W,\alpha$ are trainable parameters, $\oplus$ denotes concatenation.

For the node-wise graph attention model, attention scores are computed using different $\alpha_{i}$ parameters corresponding to each node. It follows the same procedure as the vanilla attention method, except for the modification of Eq. 21 as follows:

\pi(i,j)=LeakyReLU(\alpha_{i}^{T}(g^{(t)}_{i}\oplus g^{(t)}_{j}))\\

(24)

To observe the effect of graph attention on ECNUM models, we utilize the similarity of the representations transformed by ECNUM between target nodes and source nodes at time $t$ , as follows:

sim^{t}(i,j)=\frac{z_{i,i}^{(t)}\cdot z_{i,j}^{(t)}}{|z_{i,i}^{(t)}||z_{i,j}^{(t)}|}

(25)

Table 6: Similarity comparison results for each dataset

Dataset	Model	F1	$sim^{t}$ Mean	$sim^{t}$ Std
SWaT	Origin	0.8541	0.5120	0.2351
	+Vanilla	0.8257	0.8857	0.0771
	+Node-wise	0.7939	0.8088	0.1000
WADI	Origin	0.6405	0.1940	0.2686
	+Vanilla	0.5134	0.6647	0.1389
	+Node-wise	0.5678	0.6361	0.2357
PSM	Origin	0.6285	0.3351	0.2186
	+Vanilla	0.6114	0.6532	0.1825
	+Node-wise	0.5669	0.6037	0.1393

The results for the mean and standard deviation of the similarity between neighboring nodes over time are shown in Table 6. The results show that applying graph attention significantly increases the similarity between node representations encoded by ECNUM. It shows that even though ECNUM modifies the representation of the source node, graph attention acts as a constraint to ensure similarity. Vanilla graph attention, which shares the $\alpha$ parameter across all nodes to compute attention scores, shows higher similarity than node-wise graph attention, which has an individual $\alpha_{i}$ parameter at node $i$ to compute the attention score. This indicates that sharing the $\alpha$ parameter increases similarity.

4.7.3 Efficiency of Edge Conditional Node Update Module

To evaluate the effectiveness of employing node embedding vectors to transform source node representation dynamically, we implement a comparative model based on the GDN [28] that is most similar in structure to our model. In this comparative model, the node conditional readout module replaces the output layer of the GDN model. Furthermore, we implemented transformation modules with a structure similar to ECNUM for each edge, as described in Section 3.5.1. This implementation follows the naive idea initially described in the same section.

Table 7: Results of Complexity

Model

Complexity

per Layer

Sequential

Operations

Maximum

Path Length

Comparative Model

O(Ed_{f}^{2})

O(E)

O(E)

ECNUM

O(E(d_{f}+2d_{e})d_{f})

O(1)

O(1)

As noted in Table 7, the complexity of the first layer of ECNUM is $O(E(d_{f}+2d_{e})d_{f})$ , compared to the comparative model’s $O(Ed_{f}^{2})$ . Regarding sequential operations and maximum path length, our model operates at $O(1)$ , while the comparative model requires $O(E)$ for both metrics.

In terms of memory efficiency, our model requires memory proportional to $(d_{f}+2d_{e})d_{f}+(n-1)d_{f}^{2}$ since the first layer transforms input feature dimension from $(d_{f}+2d_{e})$ to $d_{f}$ , and the subsequent layers operate from $d_{f}$ to $d_{f}$ . In the comparative model, each transformation module, consisting of $n$ layers that transform from $d_{f}$ to $d_{f}$ , is replicated across the square of the number of nodes, amounting to $N^{2}$ such modules. Consequently, the total memory required for all these transformation modules is proportional to $N^{2}nd_{f}^{2}$ . This indicates that as the number of nodes increases, our model becomes increasingly memory-efficient compared to the comparative model, making it particularly suitable for large-scale, complex systems.

Table 8: Results of efficiency

Model	F1	Time/Epoch	Parameters
GDN+1 layer	0.8163	35.7236	43155201
GDN+2 layer	0.8027	48.0908	86102913
GDN+3 layer	0.7941	55.2989	129050625
GDN+4 layer	0.7736	66.8926	171998337
ECNUM	0.8541	6.47009	305793

The models are evaluated based on the F1 score, time per epoch, and the number of model parameters. As illustrated in Table 8, our model surpasses the comparative model, achieving higher performance with fewer parameters and reducing time per epoch. Notably, the comparative model, equipped with four linear layers in its transformation module, possesses approximately 560 times more parameters and takes about 10 times longer per epoch. A trend of performance improvement is observed as the number of layers in the transformation module decreases, likely due to the reduction in the number of parameters to be learned. Additionally, we attribute the improved performance of our model to the effect of data augmentation. The transformation module learns from source node data corresponding to the edge, instead ECNUM utilizes as much source node data as there are edges.

4.8 Sensitivity

We assess the sensitivity of our model with respect to the window length and Top- $k$ factor. We report the average F1 score and standard deviation based on 10 repetitions of the experiments. For window length, experiments are conducted with window lengths of 3, 5, 7, 10, 15, 20, and 30. Regarding Top- $k$ , experiments are carried out with values of 5, 10, 15, 20, 25, 30, 35, and 40. However, for the PSM dataset, experiments are conducted only up to 25, as the dataset has a number of features lower than 30.

Figure 9 illustrates the results of the sensitivity analysis to the sliding window length. The figure shows that for the SWaT and WADI datasets, optimal performance is achieved with a window length of 5. In contrast, the PSM dataset exhibits stable performance across varying window lengths.

Figure 10 shows the results of the sensitivity analysis for the Top- $k$ factor. The figure shows that the performance of the models initially increases, reaches an optimal point, and then decreases. It is noteworthy that for the PSM dataset, there is no observed drop in performance, as the optimal point is identified at 25.

5 Conclusion

In our study, we propose ECNU-GNN model that addresses the problem of updating multiple target nodes with the same source node representation and learns the graph structure using graph attention. Our model innovatively utilizes node embedding vectors to transform source node representations according to connected target nodes and learn the graph structure without using graph attention, which acts as a constraint to enforce similarity in node representations. Our model shows a significant improvement in F1 scores across various datasets, including 5.4% for SWaT, 12.4% for WADI, and 6.0% for PSM. This advancement is particularly beneficial for large, complex industrial systems, where it can aid in identifying and diagnosing system anomalies, ultimately contributing to increased operational uptime and system reliability.

Nonetheless, there are certain limitations to our method. The use of fixed node embedding vectors in our model restricts its ability to reflect dynamic graph structures that evolve over time. Additionally, applying the same Top- $k$ factor to all nodes limits our model’s capability to determine the most appropriate number of neighboring nodes for each node. To overcome these challenges, our future work will concentrate on developing anomaly detection models that incorporate dynamic graph structures. These models will derive their structures from a graph inference network that learns from the temporal relationships between sensors, dynamically determining the optimal neighborhood size for each node. This advancement could enable our models to better adapt to changing data relationships, enhancing both the precision and effectiveness of anomaly detection in complex, evolving systems.

Acknowledgements

This work was supported by the Institute for Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No. 2019-0-00079, Artificial Intelligence Graduate School Program(Korea University)) and Center for Applied Research in Artificial Intelligence (CARAI) grant funded by DAPA and ADD (UD230017TD).

Declaration of generative AI and AI-assisted technologies in the writing process

During the preparation of this work the author(s) used ChatGPT in order to improve language and readability. After using this tool/service, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication.

References

[1] Y. Wu, H.-N. Dai, H. Tang, Graph neural networks for anomaly detection in industrial internet of things, IEEE Internet of Things Journal 9 (12) (2021) 9214–9231.
[2] F. Jin, H. Wu, Y. Liu, J. Zhao, W. Wang, Varying-scale hca-dbscan-based anomaly detection method for multi-dimensional energy data in steel industry, Information Sciences 647 (2023) 119479.
[3] D. Tang, S. Zhang, J. Chen, X. Wang, The detection of low-rate dos attacks using the sadbscan algorithm, Information Sciences 565 (2021) 229–247.
[4] P.-F. Marteau, Random partitioning forest for point-wise and collective anomaly detection—application to network intrusion detection, IEEE Transactions on Information Forensics and Security 16 (2021) 2157–2172.
[5] A. Basati, M. M. Faghih, Pdae: Efficient network intrusion detection in iot using parallel deep auto-encoders, Information Sciences 598 (2022) 57–74.
[6] F. Guo, Z. Wang, S. Du, H. Li, H. Zhu, Q. Pei, Z. Cao, J. Zhao, Detecting vehicle anomaly in the edge via sensor consistency and frequency characteristic, IEEE Transactions on Vehicular Technology 68 (6) (2019) 5618–5628.
[7] F. Jin, M. Chen, W. Zhang, Y. Yuan, S. Wang, Intrusion detection on internet of vehicles via combining log-ratio oversampling, outlier detection and metric learning, Information Sciences 579 (2021) 814–831.
[8] J. Liu, W. Huang, H. Li, S. Ji, Y. Du, T. Li, Slafusion: Attention fusion based on sax and lstm for dangerous driving behavior detection, Information Sciences 640 (2023) 119063.
[9] B. Weyrauch, B. Heisele, J. Huang, V. Blanz, Component-based face recognition with 3d morphable models, in: 2004 Conference on Computer Vision and Pattern Recognition Workshop, IEEE, 2004, pp. 85–85.
[10] M.-S. Lee, Y.-M. Yang, S.-W. Lee, Automatic video parsing using shot boundary detection and camera operation analysis, Pattern Recognition 34 (3) (2001) 711–719.
[11] M. Ahmad, S.-W. Lee, Human action recognition using multi-view image sequences, in: 7th International Conference on Automatic Face and Gesture Recognition (FGR06), IEEE, 2006, pp. 523–528.
[12] Shakhnarovich, Viola, Darrell, Fast pose estimation with parameter-sensitive hashing, in: Proceedings Ninth IEEE International Conference on Computer Vision, IEEE, 2003, pp. 750–757.
[13] J. Audibert, P. Michiardi, F. Guyard, S. Marti, M. A. Zuluaga, Usad: Unsupervised anomaly detection on multivariate time series, in: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, 2020, pp. 3395–3404.
[14] T. Tayeh, S. Aburakhia, R. Myers, A. Shami, An attention-based convlstm autoencoder with dynamic thresholding for unsupervised anomaly detection in multivariate time series, Machine Learning and Knowledge Extraction 4 (2) (2022) 350–370.
[15] H. Zhou, K. Yu, X. Zhang, G. Wu, A. Yazidi, Contrastive autoencoder for anomaly detection in multivariate time series, Information Sciences 610 (2022) 266–280.
[16] G. Zhang, X. Gao, L. Wang, B. Xue, S. Fu, J. Yu, Z. Huang, X. Huang, Probabilistic autoencoder with multi-scale feature extraction for multivariate time series anomaly detection, Applied Intelligence 53 (12) (2023) 15855–15872.
[17] D. Li, D. Chen, B. Jin, L. Shi, J. Goh, S.-K. Ng, Mad-gan: Multivariate anomaly detection for time series data with generative adversarial networks, in: International conference on artificial neural networks, Springer, 2019, pp. 703–716.
[18] B. Du, X. Sun, J. Ye, K. Cheng, J. Wang, L. Sun, Gan-based anomaly detection for multivariate time series using polluted training set, IEEE Transactions on Knowledge and Data Engineering (2021) 12208 – 12219.
[19] Z. Zhang, W. Li, W. Ding, L. Zhang, Q. Lu, P. Hu, T. Gui, S. Lu, Stad-gan: unsupervised anomaly detection on multivariate time series with self-training generative adversarial networks, ACM Transactions on Knowledge Discovery from Data 17 (5) (2023) 1–18.
[20] L. Xu, K. Xu, Y. Qin, Y. Li, X. Huang, Z. Lin, N. Ye, X. Ji, Tgan-ad: Transformer-based gan for anomaly detection of time series data, Applied Sciences 12 (16) (2022) 8085.
[21] H. Zhang, Y. Xia, T. Yan, G. Liu, Unsupervised anomaly detection in multivariate time series through transformer-based variational autoencoder, in: 2021 33rd Chinese Control and Decision Conference (CCDC), IEEE, 2021, pp. 281–286.
[22] Z. Chen, D. Chen, X. Zhang, Z. Yuan, X. Cheng, Learning graph structures with transformer for multivariate time-series anomaly detection in iot, IEEE Internet of Things Journal 9 (12) (2021) 9179–9189.
[23] S. Tuli, G. Casale, N. R. Jennings, Tranad: deep transformer networks for anomaly detection in multivariate time series data, Proceedings of the VLDB Endowment 15 (6) (2022) 1201–1214.
[24] C. Wang, S. Xing, R. Gao, L. Yan, N. Xiong, R. Wang, Disentangled dynamic deviation transformer networks for multivariate time series anomaly detection, Sensors 23 (3) (2023) 1104.
[25] H. Zhao, Y. Wang, J. Duan, C. Huang, D. Cao, Y. Tong, B. Xu, J. Bai, J. Tong, Q. Zhang, Multivariate time-series anomaly detection via graph attention network, in: 2020 IEEE International Conference on Data Mining (ICDM), IEEE, 2020, pp. 841–850.
[26] S. Guan, B. Zhao, Z. Dong, M. Gao, Z. He, Gtad: Graph and temporal neural network for multivariate time series anomaly detection, Entropy 24 (6) (2022) 759.
[27] L. Zhou, Q. Zeng, B. Li, Hybrid anomaly detection via multihead dynamic graph attention networks for multivariate time series, IEEE Access 10 (2022) 40967–40978.
[28] A. Deng, B. Hooi, Graph neural network-based anomaly detection in multivariate time series, in: Proceedings of the AAAI conference on artificial intelligence, Vol. 35, 2021, pp. 4027–4035.
[29] C. Ding, S. Sun, J. Zhao, Mst-gat: A multimodal spatial–temporal graph attention network for time series anomaly detection, Information Fusion 89 (2023) 527–536.
[30] M. Defferrard, X. Bresson, P. Vandergheynst, Convolutional neural networks on graphs with fast localized spectral filtering, Advances in neural information processing systems 29 (2016).
[31] T. N. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks, arXiv preprint arXiv:1609.02907 (2016).
[32] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio, Graph attention networks, arXiv preprint arXiv:1710.10903 (2017).
[33] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, Advances in neural information processing systems 27 (2014).
[34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017).
[35] W. Wu, C. Song, J. Zhao, Z. Xu, Physics-informed gated recurrent graph attention unit network for anomaly detection in industrial cyber-physical systems, Information Sciences 629 (2023) 618–633.
[36] M. Mirza, S. Osindero, Conditional generative adversarial nets, arXiv preprint arXiv:1411.1784 (2014).
[37] A. Siffer, P.-A. Fouque, A. Termier, C. Largouet, Anomaly detection in streams with extreme value theory, in: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 2017, pp. 1067–1075.
[38] A. P. Mathur, N. O. Tippenhauer, Swat: A water treatment testbed for research and training on ics security, in: 2016 international workshop on cyber-physical systems for smart water networks (CySWater), IEEE, 2016, pp. 31–36.
[39] C. M. Ahmed, V. R. Palleti, A. P. Mathur, Wadi: a water distribution testbed for research in the design of secure cyber physical systems, in: Proceedings of the 3rd international workshop on cyber-physical systems for smart water networks, 2017, pp. 25–28.
[40] A. Abdulaal, Z. Liu, T. Lancewicki, Practical approach to asynchronous multivariate time series anomaly detection and localization, in: Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, 2021, pp. 2485–2494.
[41] J. Goh, S. Adepu, K. N. Junejo, A. Mathur, A dataset to support research in the design of secure water treatment systems, in: Critical Information Infrastructures Security: 11th International Conference, CRITIS 2016, Paris, France, October 10–12, 2016, Revised Selected Papers 11, Springer, 2017, pp. 88–99.
[42] H. Xu, W. Chen, N. Zhao, Z. Li, J. Bu, Z. Li, Y. Liu, Y. Zhao, D. Pei, Y. Feng, et al., Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications, in: Proceedings of the 2018 world wide web conference, 2018, pp. 187–196.
[43] S. Kim, K. Choi, H.-S. Choi, B. Lee, S. Yoon, Towards a rigorous evaluation of time-series anomaly detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, 2022, pp. 7194–7201.
[44] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, W. Samek, On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation, PloS one 10 (7) (2015) e0130140.