This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

CSCLog: A Component Subsequence Correlation-Aware Log Anomaly Detection Method

Ling Chen College of Computer Science and TechnologyZhejiang University38 Zheda RoadHangzhouZhejiangChina310027 [email protected] Chaodu Song College of Software TechnologyZhejiang University38 Zheda RdHangzhouChina310027 [email protected] Xu Wang Database Products Business UnitAlibaba Cloud Intelligence1008 CaiDent StreetHangzhouChina310030 [email protected] Dachao Fu Database Products Business UnitAlibaba Cloud Intelligence1008 CaiDent StreetHangzhouChina310030 [email protected]  and  Feifei Li Database Products Business UnitAlibaba Cloud Intelligence1008 CaiDent StreetHangzhouChina310030 [email protected]
Abstract.

Anomaly detection based on system logs plays an important role in intelligent operations, which is a challenging task due to the extremely complex log patterns. Existing methods detect anomalies by capturing the sequential dependencies in log sequences, which ignore the interactions of subsequences. To this end, we propose CSCLog, a Component Subsequence Correlation-Aware Log anomaly detection method, which not only captures the sequential dependencies in subsequences, but also models the implicit correlations of subsequences. Specifically, subsequences are extracted from log sequences based on components and the sequential dependencies in subsequences are captured by Long Short-Term Memory Networks (LSTMs). An implicit correlation encoder is introduced to model the implicit correlations of subsequences adaptively. In addition, Graph Convolution Networks (GCNs) are employed to accomplish the information interactions of subsequences. Finally, attention mechanisms are exploited to fuse the embeddings of all subsequences. Extensive experiments on four publicly available log datasets demonstrate the effectiveness of CSCLog, outperforming the best baseline by an average of 7.41% in Macro F1-Measure.

Log anomaly detection, component, correlation learning, attention mechanism
copyright: nonejournal: TKDDjournalvolume: 0journalnumber: 0journalyear: xxxxarticle: 0ccs: Computer systems organization Reliabilityccs: Software and its engineeringccs: Computing methodologies Artificial intelligence

1. Introduction

Log data are generated in software systems, e.g., high-performance computing systems, distributed file systems, and cloud services, which are unstructured or semi-structured data with temporal information (Zhang et al., 2020). Log anomaly detection aims to detect system anomalies in a timely manner based on log data, which plays an important role in intelligent operations. As the scale and complexity of the system grow, it becomes more difficult to detect and locate system anomalies manually (Liao et al., 2013). As a result, automated log anomaly detection methods (Chen et al., 2004; Liang et al., 2007; Bodik et al., 2010; Lou et al., 2010; Xu et al., 2009; Lin et al., 2016; Du et al., 2017; Zhang et al., 2019; Meng et al., 2019) are proposed. Generally, in these methods, log messages are parsed into log templates and other features, and detection algorithms are employed to locate anomalies from a series of log messages, i.e., log sequence.

Automated log anomaly detection methods are generally divided into two classes, i.e., traditional methods and neural network-based methods. Traditional methods (Chen et al., 2004; Liang et al., 2007; Bodik et al., 2010; Lou et al., 2010; Xu et al., 2009; Lin et al., 2016) detect anomalies by extracting the statistical features in log sequences, e.g., number of templates and keyword distribution, and using traditional machine learning models to capture the anomalous information of these features. However, due to the differences in distributions of statistical features of different datasets, it is difficult for these methods to achieve good performance on different datasets. In addition, these methods fail to capture the sequential dependencies in log sequences, which can also indicate the anomalies.

Refer to caption

Figure 1. Two log sequences constructed from the OpenStack dataset. The anomalous subsequence of the component “nova.osapi_compute.wsgi.server” is highlighted in red. (Best viewed in color).

Neural network-based methods (Du et al., 2017; Zhang et al., 2019; Meng et al., 2019; Li et al., 2020; Yin et al., 2020; Nedelkoski et al., 2020; Wang et al., 2021) usually infuse Recurrent Neural Networks (RNNs) and their variants, e.g., Long Short-Term Memory Networks (LSTMs) and Gated Recurrent Units (GRUs), to capture the sequential dependencies in log sequences. These methods focus on mining the patterns of log sequences and modeling them into unified feature representations for anomaly detection. In addition, some methods (Han and Yuan, 2021; Xia et al., 2021) introduce Generative Adversarial Networks (GANs) to enhance the ability for capturing anomalous information by generating new log data. Some methods (Zhang et al., 2021; Wang et al., 2022) introduce Convolutional Neural Networks (CNNs) to capture the temporal dependencies in log sequences by arranging log templates as a matrix. Some methods (Wan et al., 2021) construct the log sequence as a graph and employ Graph Neural Networks (GNNs) to model information interactions in the graph. However, these methods ignore the interactions of subsequences, which can demonstrate the complex log patterns of components. For example, as shown in Fig. 1, two log sequences on the OpenStack dataset have the same sequence of log templates, but their anomaly labels are different. We find that the subsequences of components extracted from the two log sequences are different. The subsequences of the component “nova.osapi_compute.wsgi.server” are “T3, T4, T5” in the normal sequence and “T4, T5, T3” in the anomalous sequence. Log template “T5” indicates the stop of the compute node in the cloud service, and log template “T3” indicates the data requests from the node, which cannot appear after the template “T5” in the normal running of the system. This example shows that subsequences can be used to detect the anomaly that is difficult to be detected only by capturing the sequential dependencies in log sequences.

To address the aforementioned problems, we propose CSCLog, a Component Subsequence Correlation-Aware Log anomaly detection method, which not only captures the sequential dependencies in subsequences, but also models the implicit correlations of subsequences. Specifically, the main contributions are outlined as follows:

  • We introduce a subsequence modeling module to extract subsequences from log sequences based on components, and exploit LSTMs to capture the sequential dependencies in subsequences, which can model the complex log patterns of components.

  • We introduce an implicit correlation encoder to model the implicit correlations of subsequences adaptively, and employ Graph Convolutional Networks (GCNs) to accomplish the information interactions of subsequences, which can model the influences between subsequences.

  • We conduct extensive experiments on four publicly available log datasets. Experimental results demonstrate the state-of-the-art (SOTA) performances of CSCLog, outperforming the best baseline by an average of 7.41% in Macro F1-Measure.

2. Related Work

In this section, we provide an overview of works related to log anomaly detection, including traditional methods and neural network-based methods.

2.1. Traditional Methods

Traditional methods for log anomaly detection can be generally divided into two classes, i.e., supervised learning-based traditional methods and unsupervised or self-supervised learning-based traditional methods.

Supervised learning-based traditional methods aim to classify log sequences by extracting statistical features, e.g., the number of templates and keyword distribution, which require the data with labels. Chen et al. (Chen et al., 2004) introduced decision trees, which have strong interpretability, to model the number of templates in log sequences. Liang et al. (Liang et al., 2007) extracted six different features from log sequences, e.g., time interval and the number of templates, as the inputs of their classifier. Recently, Bodik et al. (Bodik et al., 2010) employed the Support Vector Machine (SVM) based on the Gaussian kernel to model features in log messages, e.g., frequency of occurrence, periodicity, and correlation of anomalies. Since the frequency of anomalies in a system is low and it is difficult to get a large amount of labeled log data, unsupervised or self-supervised learning-based traditional methods are proposed. Lou et al. (Lou et al., 2010) introduced Invariant Mining (IM) to derive linear relations of the number of log templates, which can capture the co-occurrence patterns of log templates. Xu et al. (Xu et al., 2009) applied Principal Component Analysis (PCA) to log anomaly detection, which uses the count vector of log templates and the parameter value vector of log messages as the inputs. Lin et al. (Lin et al., 2016) used the Agglomerative Hierarchical Clustering algorithm (AHC) to classify log sequences into normal and anomalous clusters, which detects whether a newly arriving log sequence is anomalous by calculating its distance to the two clusters.

However, due to the differences in the distribution of statistical features of different datasets, these methods fail to achieve good performance on different datasets. In addition, when an anomaly occurs, a series of program executions are logged in the system, while traditional methods fail to capture the sequential dependencies in log sequences.

2.2. Neural Network-Based Methods

With the development of deep learning, neural network-based methods began to emerge. These methods usually capture the sequential dependencies in log sequences by infusing RNNs and their variants, e.g., LSTMs and GRUs. To portray the differences in patterns between normal and anomalous sequences, these methods model the patterns of log sequences into a unified feature representation. Du et al. (Du et al., 2017) predicted the log template of the next log message by employing LSTMs to learn the patterns of log sequences when the system runs normally. To capture the possible anomalies in the log parameters, the same LSTMs were applied to check the parameter values of log messages. Zhang et al. (Zhang et al., 2019) exploited bidirectional LSTMs to detect sequential anomalies with semantic features extracted by TextBoxes models (Liao et al., 2017). Meng et al. (Meng et al., 2019) captured quantitative anomalies by introducing the number of templates and detecting linear relations between log messages. Li et al. (Li et al., 2020) introduced time intervals in log sequences, and the bidirectional LSTMs were used to capture both the sequential dependencies and the patterns of time intervals.

To capture the complex log patterns, some methods enhance the perception of the features by introducing component sequences, auxiliary datasets, and multi-scale designs. Yin et al. (Yin et al., 2020) thought that the calling relations of components also contain log patterns and extracted the component sequences from log data as the inputs of the LSTMs. Nedelkoski et al. (Nedelkoski et al., 2020) introduced auxiliary datasets from other systems as the negative sample set, and assumed that the effectiveness of log anomaly detection methods mainly depended on the ability of the model to distinguish between normal and anomalous sequences. Wang et al. (Wang et al., 2021) introduced a multi-scale design to slice the log sequence into fixed-length subsequences containing local log patterns, and LSTMs were used to extract the sequential dependencies at different scales.

Besides RNNs, other deep neural networks, e.g., GANs, CNNs, and GNNs, are also employed to address different challenges. Han et al. (Han and Yuan, 2021) and Xia et al. (Xia et al., 2021) introduced GANs to generate new log data, which can enhance the ability of the model to capture anomalous information. To model the temporal correlation of log templates, Zhang et al. (Zhang et al., 2021) and Wang et al. (Wang et al., 2022) arranged log templates as a matrix and captured the temporal dependencies in log sequences with CNNs and their variants. Wan et al. (Wan et al., 2021) transferred the log sequence into a graph, where nodes represent log templates and edges represent the order of log templates in the sequence, and GNNs were employed to model the information interactions in the graph and capture the log patterns. However, these methods ignore the interactions of subsequences, which can demonstrate the complex log patterns of components.

In this work, we propose CSCLog, a Component Subsequence Correlation-Aware Log anomaly detection method, which extracts subsequences from log sequences based on components and introduces an implicit correlation encoder to model their implicit correlations adaptively.

3. Definitions and Preliminaries

In this section, we provide the definitions of the associated terms used in CSCLog and formulate the template prediction task and anomaly detection task.

Definition 1. Log Message. A log message can be formalized as a tuple of attributes, denoted as m=(e,p,t)m=(e,p,t), where ee denotes the log template, which indicates the parsed structured textual data of the log data, pp denotes the component, and tt denotes the timestamp.

Definition 2. Log Sequence. A log sequence is regarded as a series of log messages ordered in time, denoted as S={m1,m2,,mN}S=\left\{m_{1},m_{2},\ldots,m_{N}\right\}, where NN denotes the length of the log sequence.

Definition 3. Subsequence. A subsequence is regarded as a series of log messages that contain the same component, denoted as Pj={m1,m2,,mNpj}P_{j}=\left\{m_{1},m_{2},\ldots,m_{N_{p_{j}}}\right\}, where pjp_{j} and NpjN_{p_{j}} denote the component and the length of the subsequence, respectively.

Problem 1. Template Prediction Task. The template prediction task aims to predict the log template of the next log message, formalized as e^=f(S,εe,εp)\hat{e}=f\left({S,\varepsilon_{\rm e},\varepsilon_{\rm p}}\right), where SS, εe\varepsilon_{\rm e}, and εp\varepsilon_{\rm p} denote the input log sequence, the set of log templates, and the set of components, respectively.

Problem 2. Anomaly Detection Task. Given a set of log sequences M={S1,S2,,SNs}M=\left\{S_{1},S_{2},\ldots,S_{N_{\rm s}}\right\}, where NsN_{\rm s} denotes the number of the log sequences and each log sequence SiS_{i} is normal. The anomaly detection task aims to detect whether a new log sequence S^\hat{S} is normal or not by capturing the log patterns in MM.

4. Methodology

In this section, we first give the framework of CSCLog and then describe individual modules in detail.

Refer to caption

Figure 2. The overall framework of CSCLog.

4.1. Overview

The framework of CSCLog is shown in Fig. 2, which consists of three modules: (1) subsequence modeling; (2) implicit correlation modeling; (3) subsequence feature fusion. For the subsequence modeling module, subsequences are extracted from the log sequence based on components, and LSTMs are introduced to capture the sequential dependencies in subsequences. For the implicit correlation modeling module, an implicit correlation encoder is introduced to model the implicit correlations of subsequences adaptively, and GCNs are employed to accomplish the interactions of subsequences, which can model the influences between subsequences. For the subsequence feature fusion module, attention mechanisms are exploited to fuse the embeddings of all subsequences. The details of these modules are introduced in Sections 4.2 to 4.4.

4.2. Subsequence Modeling

We extract subsequences from the log sequence, where the log messages in the same subsequence contain the same component, and create empty subsequences for components that are not contained in the log sequence. We denote the set of subsequences as C={P1,P2,,PNc}C=\left\{P_{1},P_{2},\ldots,P_{N_{\rm c}}\right\}, where NcN_{\rm c} is the number of components.

To capture anomalous information with different features in log data, we extract semantic and temporal features in log sequences and subsequences. The semantic features are the descriptions of the unstructured text in the log templates. We first create a set of keywords extracted from the text of the log templates by removing the non-character tokens and prepositions. Then, to obtain semantic information of the words in the set, we map each word to a 768-dimensional word vector by a pre-trained BERT model (Sun et al., 2019). Finally, we calculate the weighted average of the vectors of all keywords in the log template to obtain its semantic vector 𝒗\boldsymbol{v}^{\prime}, which is formalized as:

(1) 𝒗=1Nei=1Newi𝒗i\boldsymbol{v}^{\prime}=\frac{1}{N_{\rm e}}\sum_{i=1}^{N_{\rm e}}w_{i}\boldsymbol{v}_{i}

where 𝒗i\boldsymbol{v}_{i}, wiw_{i}, and NeN_{\rm e} denote the word vector of the iith keyword, the Term Frequency-Inverse Document Frequency (TF-IDF) weight of the iith keyword, and the number of keywords in the log template, respectively. By extracting semantic features for all log messages in the log sequence, we obtain its semantic features 𝑽N×768\boldsymbol{V}\in\mathbb{R}^{N\times 768} of the log sequence, where NN denotes the length of the sequence.

Temporal features focus on extracting the difference in time of log messages. Specifically, the timestamp of the first log message in the log sequence is regarded as the start time, and we take the interval of the timestamp of a log message and the start time as its temporal feature tt^{\prime}, which is formalized as:

(2) t=tit1t^{\prime}=t_{i}-t_{1}

where tit_{i} denotes the timestamp of the iith log message, tit_{i}\in\mathbb{N}. By extracting the temporal features for all log messages in the log sequence, we obtain its temporal features 𝒕N×1\boldsymbol{t}\in\mathbb{R}^{N\times 1}.

We introduce a feature extractor to map features with different dimensions into a low-dimensional feature space. Semantic and temporal features are first sent to a corresponding Multilayer Perceptron (MLP) to obtain semantic and temporal embeddings, whose dimensions are N×dsemN\times d_{\rm sem} and N×dtimN\times d_{\rm tim}, respectively. Then, we concatenate these two embeddings to obtain the feature embedding 𝑿N×d\boldsymbol{X}\in\mathbb{R}^{N\times d}, which is formalized as:

(3) 𝑿=Concat(φ1(𝑽),φ2(𝒕))\boldsymbol{X}={\rm Concat}\left(\varphi_{1}\left(\boldsymbol{V}\right),\varphi_{2}\left(\boldsymbol{t}\right)\right)

where φ1\varphi_{1} and φ2\varphi_{2} denote embedding operations, which are accomplished by MLP, and Concat{\rm Concat} denotes the concatenation operation. In addition, to balance the effects of semantic and temporal features on model performance, a threshold αemb(0,1)\alpha_{\rm emb}\in(0,1) is applied to control the dimensions of the two embeddings, i.e., dsem=αembdd_{\rm sem}=\alpha_{\rm emb}d and dtim=(1αemb)dd_{\rm tim}=(1-\alpha_{\rm emb})d.

Finally, we employ the LSTMs to capture the sequential dependencies in the log sequence and obtain its sequential embedding 𝒙1×d\boldsymbol{x}^{\prime}\in\mathbb{R}^{1\times d^{\prime}}, which is formalized as:

(4) 𝒙=LSTM(𝑿)\boldsymbol{x}^{\prime}={\rm LSTM}\left(\boldsymbol{X}\right)

Similarly, we obtain the sequential embeddings of all the subsequences 𝑿cNc×d\boldsymbol{X}_{\rm c}\in\mathbb{R}^{N_{\rm c}\times d^{\prime}}, where NcN_{\rm c} denotes the number of components. The parameters of the LSTMs are shared between subsequences.

4.3. Implicit Correlation Modeling

To model the interactions of subsequences, we introduce the implicit correlation encoder to learn the implicit correlation of different subsequences end-to-end. Specifically, to avoid the correlation of any subsequence being lost, the embeddings of subsequences are concatenated in pairs based on the assumption that interactions exist between any subsequence pair (Kipf et al., 2018). We obtain the correlation embedding 𝑿edgeNedge×2d\boldsymbol{X}_{\rm edge}\in\mathbb{R}^{N_{\rm edge}\times 2d^{\prime}}, where NedgeN_{\rm edge} denotes the number of subsequence pairs. The correlation embedding of a subsequence pair is formalized as:

(5) 𝑿edgei,j=σ1(φ3(Concat(𝑿ci,𝑿cj)))\boldsymbol{X}_{\rm edge}^{i,j}=\sigma_{1}\left(\varphi_{3}\left({\rm Concat}\left(\boldsymbol{X}_{\rm c}^{i},\boldsymbol{X}_{\rm c}^{j}\right)\right)\right)

where 𝑿ci\boldsymbol{X}_{\rm c}^{i} and 𝑿cj\boldsymbol{X}_{\rm c}^{j} denote the embedding of the iith and jjth subsequences, respectively, σ1\sigma_{1} denotes the ReLU activation function, and φ3\varphi_{3} denotes the MLP. Then, we obtain the correlation weight 𝒙relNedge×1\boldsymbol{x}_{\rm rel}\in\mathbb{R}^{N_{\rm edge}\times 1} of each subsequence pair based on their correlation embeddings, which is formalized as:

(6) 𝒙reli,j=σ2(Conv(𝑿edgei,j))\boldsymbol{x}_{\rm rel}^{i,j}=\sigma_{2}\left({\rm Conv}\left(\boldsymbol{X}_{\rm edge}^{i,j}\right)\right)

where 𝑿edgei,j\boldsymbol{X}_{\rm edge}^{i,j} denotes the correlation embedding of the iith and jjth subsequences, Conv{\rm Conv} denotes the convolution operation, and σ2\sigma_{2} denotes the Sigmoid activation function, which remains the correlation weight between 0 and 1.

Finally, GCNs are employed to accomplish the interactions of subsequences. Specifically, we construct the implicit correlation graph, in which the embeddings of subsequences are regarded as the embeddings of nodes in the graph, and the correlation weights between different subsequences are regarded as the weights of edges, which indicate the importance of the interactions of subsequences. The updated subsequence embedding 𝑿mNc×d\boldsymbol{X}_{\rm m}\in\mathbb{R}^{N_{\rm c}\times d^{\prime}} is obtained by stacking the GCN layers, which is formalized as:

(7) 𝑿m=σ3(GCN(𝑿c,𝒙rel))\boldsymbol{X}_{\rm m}=\sigma_{3}\left({\rm GCN}\left(\boldsymbol{X}_{\rm c},\boldsymbol{x}_{\rm rel}\right)\right)

where 𝑿c\boldsymbol{X}_{\rm c} denotes embeddings of subsequences, 𝒙rel\boldsymbol{x}_{\rm rel} denotes the correlation weight, GCN{\rm GCN} denotes the message passing operation, and σ3\sigma_{3} denotes the ReLU activation function.

4.4. Subsequence Feature Fusion

Attention mechanisms are employed to fuse the embeddings of different subsequences without considering the effect of positional relations. Specifically, we introduce a global vector 𝒖att1×d\boldsymbol{u}_{\rm att}\in\mathbb{R}^{1\times d^{\prime}}, and the attention weights βi\beta_{i} of the embeddings of different subsequences are calculated by the attention mechanisms (Chen et al., 2021), which is formalized as:

(8) βi=AttScore(𝑿mi,𝒖att)k=0Nc(AttScore(𝑿mk,𝒖att))\beta_{i}=\frac{{\rm AttScore}\left(\boldsymbol{X}_{\rm m}^{i},\boldsymbol{u}_{\rm att}\right)}{\sum_{k=0}^{N_{\rm c}}\left({\rm AttScore}\left(\boldsymbol{X}_{\rm m}^{k},\boldsymbol{u}_{\rm att}\right)\right)}

where AttScore{\rm AttScore} denotes the attention scoring function, which is accomplished by the dot product of vectors, and 𝑿mi\boldsymbol{X}_{\rm m}^{i} denotes the embedding of the iith subsequence. To avoid the weight βi\beta_{i} jumping between two neighboring iterations, resulting in unstable training, we introduce a constraint on it, which is formalized as:

(9) βijγβij1+(1γ)βij\beta_{i}^{j}\leftarrow\gamma\beta_{i}^{j-1}+(1-\gamma)\beta_{i}^{j}

where γ\gamma denotes the dynamic threshold and βij\beta_{i}^{j} denotes the attention weight of the iith subsequence at the jjth iteration. Then, we fuse the embeddings of all subsequences to get the fused embedding 𝒙att1×d\boldsymbol{x}_{\rm att}\in\mathbb{R}^{1\times d^{\prime}}, which is formalized as:

(10) 𝒙att=i=0Ncβi𝑿mi\boldsymbol{x}_{\rm att}=\sum_{i=0}^{N_{\rm c}}{\beta_{i}\boldsymbol{X}_{\rm m}^{i}}

where 𝑿mi\boldsymbol{X}_{\rm m}^{i} denotes the embedding of the iith subsequence and βi\beta_{i} denotes the attention weight of the iith subsequence.

Finally, to obtain the probabilities of the log template of the next log message 𝒙^Nevent×1\hat{\boldsymbol{x}}\in\mathbb{R}^{N_{\rm event}\times 1}, where NeventN_{\rm event} denotes the number of log templates, we concatenate the fused embedding with the sequential embedding of the log sequence and send it to the classifier, which is formalized as:

(11) 𝒙^=σout(φout(Concat(𝒙,𝒙att)))\hat{\boldsymbol{x}}=\sigma_{\rm out}\left(\varphi_{\rm out}\left({\rm Concat}\left(\boldsymbol{x}^{\prime},\boldsymbol{x}_{\rm att}\right)\right)\right)

where 𝒙\boldsymbol{x}^{\prime} denotes the sequential embedding, 𝒙att\boldsymbol{x}_{\rm att} denotes the fused embedding of subsequences, φout\varphi_{\rm out} denotes the transformation function, which is implemented by the MLP, and σout\sigma_{\rm out} denotes the Softmax activation function.

4.5. Training and Anomaly Detection

CSCLog is trained on the template prediction task. The training objective is to minimize the cross-entropy loss, which is formalized as:

(12) =i=1Neventyilog(𝒙^i)\mathcal{L}=-\sum_{i=1}^{N_{\rm event}}{y_{i}\log({\hat{\boldsymbol{x}}}_{i})}

where 𝒙^i{\hat{\boldsymbol{x}}}_{i} denotes the predicted probability of the iith log template, and yiy_{i} denotes whether the predicted log template is the same as the true log template, which is formalized as:

(13) yi={1,i=indexofthetruelogtemplate0,otherwisey_{i}=\left\{\begin{matrix}{1,~{}~{}i={\rm index~{}of~{}the~{}true~{}log~{}template}}\\ {0,~{}~{}{\rm otherwise}}\quad\quad\quad\quad\quad\quad\quad\quad\quad\,\,\,\\ \end{matrix}\right.

To improve the efficiency of the calculation of the model, the input data are calculated in parallel using the batch processing method.

To determine whether a log sequence is normal or anomalous, we use a sliding window to get a set of log sequence samples. Each sample St={m1,m2,,mNw}S_{\rm t}=\left\{m_{1},m_{2},\ldots,m_{N_{\rm w}}\right\}, where NwN_{\rm w} denotes the length of the sliding window, is sent to the trained model to get the set of top-k ranked templates. StS_{\rm t} is regarded as an anomaly sample when the true log template of the next log message is not in the set, which is formalized as:

(14) yt={0,etthesetoftop-krankedtemplates1,otherwisey_{\rm t}=\left\{\begin{matrix}{0,~{}~{}e_{\rm t}\in{\rm the~{}set~{}of~{}top\text{-}\textit{k}~{}ranked~{}templates}}\\ {1,~{}~{}{\rm otherwise}}\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\,\,\\ \end{matrix}\right.

where ete_{\rm t} denotes the true log template of the next log message. We consider that an anomaly occurs in the log sequence when the number of anomalous samples reaches the set threshold αanom\alpha_{\rm anom}.

Table 1. The details of the experimental datasets.
Datasets # Components # Log templates # Log sequences Average length Anomaly rate
HDFS 8 47 22,023 32.9 9.1%
BGL 10 1,372 13,074 62.3 30.6%
ThunderBird 48 302 12,841 37.8 22.4%
OpenStack 12 42 5,753 28.6 9.6%
Table 2. Search spaces of hyper-parameters and configures of NNI.
Parameters Datasets Choice
Search spaces Learning rate All {0.0001, 0.001, 0.01}
Batch size All {8, 16, 32}
Threshold of feature extractor All {0.2, 0.5, 0.8}
# Hidden layer units All {128, 256, 512}
Dropout rate All {0.1, 0.5, 0.9}
Length of sliding window All {9, 11, 13, 15, 17, 19, 21 23, 25}
Threshold of anomaly detection All {1, 2, 3, 4, 5, 6, 7, 8, 9}
# Top-k ranked templates HDFS {1, 3, 5, 7, 9, 11, 13, 15, 17}
BGL {1, 23, 45, 67, 89, 111, 133, 155, 177}
ThunderBird {1, 5, 9, 13, 17, 21, 25, 29, 33}
OpenStack {1, 3, 5, 7, 9, 11, 13, 15, 17}
Configures of NNI Max trial number All 100
Optimization algorithm All TPE
Parallel number All 2
Table 3. The best hyper-parameters on different datasets.
      Parameters       HDFS       BGL       ThunderBird       OpenStack
      Learning rate       0.0001       0.0001       0.0001       0.0001
      Batch size       16       8       32       32
      Threshold of feature extractor       0.8       0.5       0.8       0.8
      # Hidden layer units       512       256       128       256
      Dropout rate       0.1       0.9       0.5       0.1
      Length of sliding window       19       25       11       9
      Threshold of anomaly detection       4       8       1       4
      # Top-k ranked templates       7       45       29       3

5. Experiments

In this section, we justify the superiority of CSCLog by conducting extensive experiments. Firstly, we introduce the experimental datasets and settings. Subsequently, we present the comparison with baselines, the ablation study, the parameter sensitivity analysis, and the case study.

5.1. Datasets

To evaluate the performance of CSCLog, we conduct experiments on four public log datasets, i.e., HDFS (Xu et al., 2009), BGL (Oliner and Stearley, 2007), ThunderBird (Oliner and Stearley, 2007), and OpenStack (Du et al., 2017), which contain anomaly labels. The details of the experimental datasets are shown in Table 1.

HDFS: The HDFS dataset is generated by Hadoop-based MapReduce jobs deployed on more than 2,000 Amazon’s EC2 nodes, which contains 11,175,629 log messages for 39 hours. The log sequences are extracted directly based on the block_id in a log message, which are manually labeled as anomaly or normal by the Hadoop domain experts.

BGL: The BGL dataset contains 4,747,963 log messages generated by the BlueGene/L supercomputer deployed at Lawrence Livermore National Laboratory, with a time span of 7 months. Each log message in the dataset is manually labeled as anomaly or normal by the domain experts. A sliding window with the length of 10 seconds is used to extract the log sequences. Log sequences are labeled as anomaly when containing anomalous log messages.

ThunderBird: ThunderBird is a large dataset of over 200 million log messages, generated on the ThunderBird supercomputer system at Sandia National Laboratories (SNL). We extracted 2 million continuous log messages from 9:18:40 on November 11, 2005 to 20:28:59 on November 13, 2005, with a time span of about 59 hours. Similar to BGL, each log message is manually labeled as anomaly or normal by the domain experts. A sliding window with the length of 10 seconds is used to extract the log sequences. Log sequences are labeled as anomaly when containing anomalous log messages.

OpenStack: OpenStack is a small dataset generated by a cloud operating system with 10 compute nodes, which contains 207,636 log messages for 30 hours. The dataset is divided into two normal log files and an anomaly file. A sliding window with the length of 10 seconds is used to extract the log sequences, and the log sequence is labeled with the same label as the file.

5.2. Experimental Settings

CSCLog is implemented in Python with PyTorch 1.10.0 (Paszke et al., 2019) and PyTorch Geometric 2.1.0 (Rozemberczki et al., 2021), and the source code is released on GitHub111https://github.com/Hang-Z/CSCLog. We conduct all the experiments with an Intel Xeon Gold 5118 CPU and an NVIDIA GeForce RTX 2080Ti GPU. We parse the log messages by a parsing tool Drain (He et al., 2017) with default parameters. All the datasets are split into training, validation, and test sets by the ratio of 7:1:2 according to the temporal order of log sequences.

Adam (Kingma and Ba, 2015) is adopted as the optimizer, and the weight decay rate is set to 0.0001. The number of iterations is set to 20. To avoid overfitting, we apply the early stop (Yao et al., 2007) in the training stage, where the stop patience is set to 50% of the number of iterations by default. The number of layers of the GNNs in implicit correlation encoder is set to 2. The number of layers of the LSTMs in subsequence modeling module is set to 2. The step length of the sliding window for sampling is set to 1. The global vector 𝒖att\boldsymbol{u}_{\rm att} in subsequence feature fusion module is initialized with the Xavier initialization. For other hyper-parameters in the model, we exploit the Neural Network Intelligence toolkit (NNI)222https://nni.readthedocs.io/en/latest/ to search for the best value automatically. Due to the differences in the number of log templates for different datasets, the hyper-parameter number of top-k ranked templates is chosen differently for them. The search spaces of these hyper-parameters and the configures of NNI are given in Table 2. The best hyper-parameters on different datasets are given in Table 3.

We use Accuracy, Precision, Recall, and Macro F1-Measure as evaluation metrics. Macro F1-Measure combines Precision and Recall and allocates same weight to every class, which reduces the impact of class imbalance on the evaluation results, and higher Macro F1-Measure demonstrates better performance, which is formalized as:

(15) F1=1Nclass×k=1Nclass2×Precisionk×RecallkPrecisionk+Recallk{\rm F1}=\frac{1}{N_{\rm class}}\times\sum_{k=1}^{N_{\rm class}}\frac{2\times{\rm Precision}_{k}\times{\rm Recall}_{k}}{{\rm Precision}_{k}+{\rm Recall}_{k}}

where kk denotes the index of classes and NclassN_{\rm class} denotes the number of classes. The final reported result is the average macro F1-Measure across five independent runs by varying seeds.

Table 4. Comparisons with baselines on the anomaly detection task. The best results are bold and the second best results are underlined.
Model HDFS BGL ThunderBird OpenStack
Acc. Pre. Rec. F1. Acc. Pre. Rec. F1. Acc. Pre. Rec. F1. Acc. Pre. Rec. F1.
IM (USENIX ATC 2010) 0.897 0.735 0.944 0.790 0.712 0.729 0.538 0.492 0.889 0.832 0.877 0.851 0.866 0.482 0.491 0.481
PCA (SOSP 2009) 0.900 0.738 0.945 0.793 0.660 0.415 0.482 0.416 0.960 0.929 0.966 0.945 0.892 0.485 0.498 0.479
LogCluster (ICSE 2016) 0.906 0.454 0.498 0.475 0.551 0.617 0.626 0.550 0.776 0.388 0.500 0.437 0.904 0.452 0.500 0.475
DeepLog (CCS 2017) 0.895 0.732 0.942 0.786 0.481 0.670 0.621 0.473 0.962 0.938 0.956 0.947 0.897 0.507 0.501 0.481
LogAnomaly (IJCAI 2019) 0.895 0.734 0.942 0.787 0.479 0.670 0.619 0.471 0.964 0.941 0.958 0.949 0.898 0.514 0.501 0.482
LogC (ICSME 2020) 0.898 0.735 0.944 0.790 0.727 0.858 0.593 0.582 0.504 0.643 0.671 0.502 0.874 0.477 0.492 0.480
OC4Seq (KDD 2021) 0.899 0.736 0.944 0.791 0.707 0.696 0.730 0.692 0.974 0.948 0.972 0.956 0.903 0.452 0.500 0.474
CSCLog (Ours) 0.966 0.824 0.952 0.895 0.739 0.749 0.793 0.731 0.982 0.978 0.969 0.973 0.906 0.535 0.528 0.530
Table 5. Comparisons with baselines on template prediction task. The best results are bold.
Model k=1\textit{k}=1 k = 3 k = 5
Acc. Pre. Rec. F1. Acc. Pre. Rec. F1. Acc. Pre. Rec. F1.
DeepLog (CCS 2017) 0.904 0.215 0.187 0.193 0.983 0.325 0.305 0.312 0.987 0.327 0.333 0.330
LogAnomaly (IJCAI 2019) 0.909 0.220 0.196 0.199 0.983 0.261 0.273 0.267 0.987 0.348 0.344 0.341
LogC (ICSME 2020) 0.902 0.213 0.187 0.192 0.983 0.326 0.308 0.314 0.987 0.327 0.333 0.330
OC4Seq (KDD 2021) 0.910 0.226 0.201 0.203 0.981 0.245 0.268 0.256 0.987 0.335 0.352 0.342
CSCLog (Ours) 0.911 0.230 0.210 0.208 0.988 0.359 0.364 0.361 0.989 0.360 0.364 0.362

5.3. Comparison with Baselines

To justify the superiority of CSCLog, we compare it with other log anomaly detection methods, all of which are unsupervised or self-supervised learning-based methods. Among these methods, Invariant Mining (IM) (Lou et al., 2010), Principal Components Analysis (PCA) (Xu et al., 2009), and LogCluster (Lin et al., 2016) are traditional methods, and DeepLog (Du et al., 2017), LogAnomaly (Meng et al., 2019), LogC (Yin et al., 2020), and OC4Seq (Wang et al., 2021) are neural network-based methods. Due to the additional introduction of anomalous information, supervised learning-based methods usually perform better than unsupervised or self-supervised learning-based methods (Le and Zhang, 2022), so that they are not considered in our comparison experiment. We conduct experiments on the anomaly detection task and template prediction task, respectively. All compared methods are implemented by the code provided by the original papers. For the sake of fairness, all compared methods follow the same experimental settings and search spaces of hyper-parameters as CSCLog, and their parameters are also optimized. In addition, the kernel size of the one-class classifier of OC4Seq is set to 5 according to the original paper. The detailed descriptions of baselines are as follows:

IM (Lou et al., 2010), PCA (Xu et al., 2009), and LogCluster (Lin et al., 2016): IM, PCA, and LogCluster are traditional methods that detect anomalies by extracting the statistical features in log sequences. To use them on the log anomaly detection task, we extract the number of log templates for IM and PCA, and construct the set of frequent words in log messages for LogCluster. To a newly arriving log sequence, we detect whether it is anomalous according to the patterns captured from the statistical features.

DeepLog (Du et al., 2017): DeepLog predicts the log template of the next log message by modeling the log sequences as natural language sequences and employing LSTMs to learn the patterns of log sequences when the system runs normally. To a newly arriving log sequence, DeepLog detects whether it is anomalous according to whether the true log template is in the predicted log templates.

LogAnomaly (Meng et al., 2019): LogAnomaly captures quantitative anomalies by introducing the count vector of log templates in log sequences, and captures sequential anomalies by learning the patterns of log sequences when the system runs normally. Similar to DeepLog, LogAnomaly detects whether a newly arriving log sequence is anomalous according to whether the true log template is in the predicted log templates.

LogC (Yin et al., 2020): LogC extracts the component sequences from log data, and captures the calling relations of components and the patterns of log sequences when the system runs normally. To a newly arriving log sequence, LogC detects whether it is anomalous according to whether the true log template is in the predicted log templates and the true component is in the predicted components.

OC4Seq (Wang et al., 2021): OC4Seq introduces a multi-scale design to slice the log sequence into fixed-length subsequences containing local log patterns. LSTMs are used to capture the sequential dependencies at different scales. A one-class classifier is introduced to detect whether a newly arriving log sequence is anomalous by capturing the features of sequential embeddings.

Table 4 shows the Accuracy, Precision, Recall, and Macro F1-Measure of CSCLog and baselines on the anomaly detection task on HDFS, BGL, ThunderBird, and OpenStack datasets, from which we can observe the following phenomena:

1) Neural network-based methods (DeepLog, LogAnomaly, LogC, and OC4Seq) outperform traditional methods (IM, PCA, and LogCluster), which indicates the effectiveness of capturing the sequential dependencies in log sequences. In addition, the high differences in the performances of traditional methods on different datasets demonstrate the limitations of extracted statistical features.

2) The performance of LogAnomaly is slightly better than that of DeepLog, which indicates that the introduced number of templates can enhance the ability to capture anomalous information. However, LogAnomaly fails to perform well on the BGL dataset. The reason is that the large number of log templates (over 1000) contributes to the sparse count vector of templates, so that the model fails to capture the linear relations between log templates.

3) LogC fails to perform well on the ThunderBird dataset, which contains a large number of components. The reason is that LogC focuses on modeling the calling relations of components, and the large number of components make the relations more complex, which presents a challenge to the model to capture the complex log patterns of components.

4) As the SOTA method on the anomaly detection task, OC4Seq performs better than DeepLog, LogAnomaly, and LogC. OC4Seq can capture both global and local sequential dependencies in log sequences by slicing the subsequences of different lengths. In addition, by introducing the one-class classifier to classify log sequences directly, OC4Seq reduces the impact of the length of sequences on performance.

5) CSCLog outperforms the best baseline with relative improvements of 12.86%, 5.64%, 1.19%, and 9.96% in Macro F1-Measure on HDFS, BGL, ThunderBird, and OpenStack datasets, respectively. The results justify the advantage of capturing the sequential dependencies in subsequences and modeling the implicit correlations of subsequences.

Table 5 shows the Accuracy, Precision, Recall, and Macro F1-Measure of CSCLog and neural network-based methods on the template prediction task on the HDFS dataset. To demonstrate the comprehensive results, we set the number of top-k ranked templates to 1, 3, and 5. We replace the one-class classifier of OC4Seq with the same classifier as CSCLog to accomplish the template prediction task. We can observe that CSCLog outperforms all the baselines and performs better than the best baseline with relative improvements of 3.37%, 14.97%, and 5.85% in Macro F1-Measure. This further demonstrates the advancement of CSCLog.

Table 6. Comparisons with simplified models with different modules. The best results are bold.
Model Acc. Pre. Rec. F1.
CSCLog w/o IC 0.799 0.666 0.512 0.495
CSCLog w/o LSTM 0.838 0.719 0.640 0.587
CSCLog 0.966 0.824 0.952 0.895
Table 7. Comparisons with simplified models with different features. The best results are bold.
Model Acc. Pre. Rec. F1.
CSCLog w/o SEM 0.825 0.701 0.653 0.602
CSCLog w/o TIME 0.877 0.722 0.668 0.639
CSCLog 0.966 0.824 0.952 0.895

5.4. Ablation Studies

To justify the advantage of capturing the sequential dependencies in subsequences and modeling the implicit correlations of subsequences, we compare CSCLog with two simplified models. For the sake of fairness, all variants follow the same experimental settings as CSCLog, and their parameters are also optimized. The detailed descriptions of simplified models are as follows:

CSCLog w/o IC: CSCLog w/o IC removes the implicit correlation encoder and sets the correlation weight between subsequences to 1, which means that the correlations of subsequences are fixed and identical. The rest is the same as CSCLog.

CSCLog w/o LSTM: CSCLog w/o LSTM removes the LSTMs of the subsequence modeling module from CSCLog and only utilizes the average of the feature embeddings of a log sequence as its sequential embedding, which means that the model cannot capture the sequential dependencies. The rest is the same as CSCLog.

Table 6 shows the Accuracy, Precision, Recall, and Macro F1-Measure of CSCLog and simplified models on the anomaly detection task on the HDFS dataset, from which we can observe that CSCLog outperforms CSCLog w/o IC, which indicates the effectiveness of using the implicit correlation encoder to model the implicit correlation of subsequences. CSCLog outperforms CSCLog w/o LSTM, which indicates the effectiveness of employing LSTMs to capture the sequential dependencies in log sequences. The performance degradation of CSCLog w/o IC is more than that of CSCLog w/o LSTM, which indicates that modeling the implicit correlation of subsequences is more effective to improve the ability to capture anomalous information.

To justify the effectiveness of introduced features, we remove semantic and temporal features to study the impact on the model. CSCLog w/o SEM denotes the model without semantic features and CSCLog w/o TIME denotes the model without temporal features.

Table 7 shows the Accuracy, Precision, Recall, and Macro F1-Measure of CSCLog and simplified models on the anomaly detection task on the HDFS dataset, from which we can observe that CSCLog outperforms CSCLog w/o SEM and CSCLog w/o TIME, which indicates the effectiveness of the introduced semantic and temporal features. The performance degradation of CSCLog w/o SEM is more than that of CSCLog w/o TIME, which indicates the importance of semantic features. The reason is that the semantic features contain rich text information of log templates.

5.5. Parameter Sensitivity Analysis

To study the impact of several important parameters, including the length of sliding window, the number of top-k ranked templates, and the threshold of anomaly detection, we conduct the parameter sensitivity analysis on the anomaly detection task.

We evaluate the impact of the length of sliding window, increasing it from 9 to 23 with a step size of 2. Fig. 5 shows the evaluation results on the HDFS dataset, from which we can observe that with the increase of the length of sliding window, the performance of CSCLog rises gradually. The best performance is achieved when the length of sliding window is 19. The reason is that when the length of sliding window increases from 9 to 19, the patterns of log sequences increase and the ability to capture anomalous information is improved. As the length of sliding window increases from 19 to larger, the performance gradually drops. The reason may be that the increase of sliding window causes the increase of the number of extracted subsequences, which brings a challenge for the model to capture the correlation of subsequences.

We evaluate the impact of the number of top-k ranked templates, increasing it from 1 to 11 with a step size of 2. Fig. 5 shows the evaluation results on the HDFS dataset, from which we can observe that when the number of top-k ranked templates is less than 7, the performance rises as the number increases, which indicates that increasing the number of top-k ranked templates within a certain range can improve the fault tolerance of the model. When the number of top-k ranked templates is larger than 7, the performance drops slightly as the number increases. The reason may be that when the number is too large, it is easy for the top-k ranked templates to hit the true log template, which makes the model misclassifies the anomalous sequences as normal.

We evaluate the impact of the threshold of anomaly detection, increasing it from 1 to 5 with a step size of 1. Fig. 5 shows the evaluation results on the HDFS dataset, from which we can observe that with the increase of the threshold of anomaly detection, the performance first increases and then drops. The best performance is achieved when the threshold of anomaly detection is 4. The reason may be that when the threshold is set too small, the log sequences are easy to be detected as anomalies, which contributes to the higher false positive rate and the lower accuracy. When the threshold is set too large, the log sequences are difficult to be detected as anomalies, which contributes to the higher false negative rate and the lower recall.

Refer to caption
Figure 3. Impact of the length of sliding window.
Refer to caption
Figure 4. Impact of the number of top-kk ranked templates.
Refer to caption
Figure 5. Impact of the threshold of anomaly detection.

5.6. Computation Cost

We evaluate the computation costs of CSCLog and baselines, including the parameter number, training time, and inference time. Table 8 shows the evaluation results on the HDFS dataset, from which we can observe that DeepLog has the least parameter number and runs fastest in these methods, but it gets the worst anomaly detection performance. Compared with LogAnomaly, LogC, and OC4Seq, CSCLog gets the best anomaly detection performance while needing the least time for inference, which is more in line with the requirements of modern software systems for real-time anomaly detection. Overall, comprehensively considering the significant anomaly detection performance improvement and the computation costs, CSCLog exhibits its superiority over existing methods.

Table 8. The computation costs of different methods. The best results are bold.
Model # Parameters
Training time
/epoch (s)
Total training
time (h)
Inference time
/sequence (s)
Acc. Pre. Rec. F1.
DeepLog 57,880 172.14 2.48 0.0007 0.895 0.732 0.942 0.786
LogAnomaly 428,056 258.31 4.87 0.0016 0.895 0.734 0.942 0.787
LogC 1,062,944 394.45 8.76 0.0044 0.898 0.735 0.944 0.790
OC4Seq 402,304 284.68 4.13 0.0014 0.899 0.736 0.944 0.791
CSCLog 207,207 238.55 4.02 0.0012 0.966 0.824 0.952 0.895

5.7. Case Study

To intuitively reveal the superiority of CSCLog, we perform several case studies.

In Fig. 6, we use the t-SNE method (Van der Maaten and Hinton, 2008) to visualize the embeddings of the output of the final MLP (before the Softmax activation function) on the OpenStack dataset. The log sequence samples whose log templates of the next log message are “T1”, “T2”, “T3”, “T4”, and “T5” are selected. We can find that the sample embeddings of CSCLog are more clustered than those of DeepLog, and the sample embeddings of different log templates are more distantly distributed. It means that CSCLog can better capture the complex log patterns by modeling the implicit correlations of subsequences.

Refer to caption
(a) DeepLog
Refer to caption
(b) CSCLog
Figure 6. Visualization of the embeddings of the output of the final MLP on the OpenStack dataset using t-SNE.
Refer to caption
Figure 7. The distribution of the correlation weights.
Table 9. Top-5 ranked templates of two samples predicted by DeepLog and CSCLog. The results are ranked by the predicted probability and the true log templates are bolded.
     Model      DeepLog      CSCLog
     Normal      Template T3      Template T2
     Template T2      Template T3
     Template T5      Template T6
     Template T6      Template T5
     Template T4      Template T4
     Anomaly      Template T3      Template T2
     Template T2      Template T3
     Template T5      Template T1
     Template T6      Template T4
     Template T4      Template T5

Table 9 shows the predicted top-5 ranked templates of two log sequence samples on the template prediction task on the OpenStack dataset, where the two samples are labeled as normal and anomaly and their log templates order are both “T1, T2, T1, T2, T3, T4, T3, T4, T5, T3”. We can find that to the normal sample, both DeepLog and CSCLog predict the log template of the next log message correctly, and CSCLog has a higher ranking to the true log template. To the anomalous sample, the true log template is in the predicted top-5 ranked templates of DeepLog but not in that of CSCLog, which means CSCLog can detect the anomaly while DeepLog cannot. It indicates that CSCLog can improve the ability to capture anomalous information by introducing the implicit correlation encoder and modeling the correlation of subsequences.

Fig. 7 shows the distribution of the correlation weights of the subsequences of the above two log sequence samples. These two samples contain components “nova.scheduler.host.manage”, “nova.meta-data.wsgi.server”, and “nova.osapi_compute.wsgi.server”, whose subsequences are labeled as 0, 1, and 2, respectively. To the normal sample, subsequences 0, 1, and 2 are “T1, T2, T1, T2”, “T3, T4, T3”, and “T3, T4, T5”, respectively. To the anomalous sample, subsequences 0, 1, and 2 are “T1, T2, T1, T2”, “T3, T4, T3”, and “T4, T5, T3”, respectively. We can find that in the normal sample, subsequences 1 and 2 are strongly correlated, and subsequences 0 and 1 are weakly correlated. In contrast, in the anomalous sample, the correlation between subsequences 0 and 1 is extremely strong, and the correlations between the remaining two subsequence pairs are weak. It indicates that CSCLog can capture the complex log patterns of components.

6. Conclusions and future work

Anomaly detection based on system logs plays an important role in intelligent operations. In this paper, we propose CSCLog, a Component Subsequence Correlation-Aware Log anomaly detection method, which not only captures the sequential dependencies in subsequences, but also models the implicit correlations of subsequences. Specifically, subsequences are extracted from log sequences based on components and the sequential dependencies in subsequences are captured by LSTMs. The implicit correlation encoder is introduced to model the implicit correlations of subsequences adaptively. In addition, GCNs are employed to accomplish the information interactions of subsequences. We conduct comprehensive experiments on four log datasets and experimental results demonstrate the superiority of CSCLog.

In the future, we will extend this work in the following directions. On the one hand, we will try to introduce hypergraphs or hierarchical graphs to model the higher-order correlations of subsequences, which can improve the representation ability of the model. On the other hand, we will try to introduce the self-attention mechanism with low complexity to improve the parallelism of the model in capturing the sequential dependencies.

References

  • (1)
  • Bodik et al. (2010) Peter Bodik, Moises Goldszmidt, Armando Fox, Dawn B Woodard, and Hans Andersen. 2010. Fingerprinting the datacenter: Automated classification of performance crises. In Proceedings of the 5th European Conference on Computer systems. 111–124.
  • Chen et al. (2021) Donghui Chen, Ling Chen, Youdong Zhang, Bo Wen, and Chenghu Yang. 2021. A multiscale interactive recurrent network for time-series forecasting. IEEE Transactions on Cybernetics 52, 9 (2021), 8793–8803.
  • Chen et al. (2004) Mike Chen, Alice X Zheng, Jim Lloyd, Michael I Jordan, and Eric Brewer. 2004. Failure diagnosis using decision trees. In Proceedings of the 1st IEEE International Conference on Autonomic Computing. 36–43.
  • Du et al. (2017) Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. DeepLog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 24th ACM Conference on Computer and Communications Security. 1285–1298.
  • Han and Yuan (2021) Xiao Han and Shuhan Yuan. 2021. Unsupervised cross-system log anomaly detection via domain adaptation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 3068–3072.
  • He et al. (2017) Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R Lyu. 2017. Drain: An online log parsing approach with fixed depth tree. In Proceedings of the 24th IEEE International Conference on Web Services. 33–40.
  • Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. International Conference on Learning Representations 9 (2015), 1–15.
  • Kipf et al. (2018) Thomas Kipf, Ethan Fetaya, Kuan Chieh Wang, Max Welling, and Richard Zemel. 2018. Neural relational inference for interacting systems. In Proceedings of the 38th International Conference on Machine Learning. 2688–2697.
  • Le and Zhang (2022) Van Hoang Le and Hongyu Zhang. 2022. Log-based anomaly detection with deep learning: How far are we. In Proceedings of the 44th International Conference on Software Engineering. 1356–1367.
  • Li et al. (2020) Xiaoyun Li, Pengfei Chen, Linxiao Jing, Zilong He, and Guangba Yu. 2020. SwissLog: Robust and unified deep learning based log anomaly detection for diverse faults. In Proceedings of the 31st IEEE International Symposium on Software Reliability Engineering. 92–103.
  • Liang et al. (2007) Yinglung Liang, Yanyong Zhang, Hui Xiong, and Ramendra Sahoo. 2007. Failure prediction in IBM BlueGene/L event logs. In Proceedings of the 7th IEEE International Conference on Data Mining. 583–588.
  • Liao et al. (2013) Hung Jen Liao, Chun Hung Richard Lin, Ying Chih Lin, and Kuang Yuan Tung. 2013. Intrusion detection system: A comprehensive review. Journal of Network and Computer Applications 36, 1 (2013), 16–24.
  • Liao et al. (2017) Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. TextBoxes: A fast text detector with a single deep neural network. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. 4161–4167.
  • Lin et al. (2016) Qingwei Lin, Hongyu Zhang, Jian Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering based problem identification for online service systems. In Proceedings of the 38th International Conference on Software Engineering Companion. 102–111.
  • Lou et al. (2010) Jian Guang Lou, Qiang Fu, Shengqi Yang, Ye Xu, and Jiang Li. 2010. Mining invariants from console logs for system problem detection. In Proceedings of the 19th USENIX Annual Technical Conference. 1–14.
  • Meng et al. (2019) Weibin Meng, Ying Liu, Yichen Zhu, Shenglin Zhang, Dan Pei, Yuqing Liu, Yihao Chen, Ruizhi Zhang, Shimin Tao, Pei Sun, et al. 2019. LogAnomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. 4739–4745.
  • Nedelkoski et al. (2020) Sasho Nedelkoski, Jasmin Bogatinovski, Alexander Acker, Jorge Cardoso, and Odej Kao. 2020. Self-attentive classification-based anomaly detection in unstructured logs. In Proceedings of the 20th IEEE International Conference on Data Mining. 1196–1201.
  • Oliner and Stearley (2007) Adam Oliner and Jon Stearley. 2007. What supercomputers say: A study of five system logs. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 575–584.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. Neural Information Processing Systems 32 (2019), 1–12.
  • Rozemberczki et al. (2021) Benedek Rozemberczki, Paul Scherer, Yixuan He, George Panagopoulos, Alexander Riedel, Maria Astefanoaei, Oliver Kiss, Ferenc Beres, Guzmán López, Nicolas Collignon, et al. 2021. PyTorch geometric temporal: Spatiotemporal signal processing with neural machine learning models. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 4564–4573.
  • Sun et al. (2019) Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 1441–1450.
  • Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 11 (2008), 2579–2605.
  • Wan et al. (2021) Yi Wan, Yilin Liu, Dong Wang, and Yujin Wen. 2021. GLAD-PAW: Graph-based log anomaly detection by position aware weighted graph attention network. In Proceedings of the 25th Pacific-Asia Conference on Knowledge Discovery and Data Mining. 66–77.
  • Wang et al. (2021) Zhiwei Wang, Zhengzhang Chen, Jingchao Ni, Hui Liu, Haifeng Chen, and Jiliang Tang. 2021. Multi-scale one-class recurrent neural networks for discrete event sequence anomaly detection. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 3726–3734.
  • Wang et al. (2022) Zumin Wang, Jiyu Tian, Hui Fang, Liming Chen, and Jing Qin. 2022. LightLog: A lightweight temporal convolutional network for log anomaly detection on the edge. Computer Networks 203 (2022), 108616–108642.
  • Xia et al. (2021) Bin Xia, Yuxuan Bai, Junjie Yin, Yun Li, and Jian Xu. 2021. LogGAN: A log-level generative adversarial network for anomaly detection using permutation event modeling. Information Systems Frontiers 23 (2021), 285–298.
  • Xu et al. (2009) Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan. 2009. Detecting large-scale system problems by mining console logs. In Proceedings of the 22nd ACM SIGOPS Symposium on Operating Systems Principles. 117–132.
  • Yao et al. (2007) Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. 2007. On early stopping in gradient descent learning. Constructive Approximation 26, 2 (2007), 289–315.
  • Yin et al. (2020) Kun Yin, Meng Yan, Ling Xu, Zhou Xu, Zhao Li, Dan Yang, and Xiaohong Zhang. 2020. Improving log-based anomaly detection with component-aware analysis. In Proceedings of the 27th IEEE International Conference on Software Maintenance and Evolution. 667–671.
  • Zhang et al. (2020) Bo Zhang, Hongyu Zhang, Pablo Moscato, and Aozhong Zhang. 2020. Anomaly detection via mining numerical workflow relations from logs. In Proceedings of the 39th IEEE International Symposium on Reliable Distributed Systems. 195–204.
  • Zhang et al. (2021) Linming Zhang, Wenzhong Li, Zhijie Zhang, Qingning Lu, Ce Hou, Peng Hu, Tong Gui, and Sanglu Lu. 2021. LogAttn: Unsupervised log anomaly detection with an autoencoder based attention mechanism. In Proceedings of the 14th International Conference on Knowledge Science, Engineering and Management. 222–235.
  • Zhang et al. (2019) Xu Zhang, Yong Xu, Qingwei Lin, Bo Qiao, Hongyu Zhang, Yingnong Dang, Chunyu Xie, Xinsheng Yang, Qian Cheng, Ze Li, et al. 2019. Robust log-based anomaly detection on unstable log data. In Proceedings of the 27th ACM Software Engineering Conference and Symposium on the Foundations of Software Engineering. 807–817.