This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, China.
11email: {isfanhy, zhangfengbin}@hrbust.edu.cn,[email protected], [email protected]
22institutetext: Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, Minjiang University, Fuzhou 350121, China.
22email: [email protected]
Corresponding authors: Fengbin Zhang, Zuoyong Li.

Correlation-aware Deep Generative Model for Unsupervised Anomaly Detection

Haoyi Fan 11    Fengbin Zhang 11    Ruidong Wang 11    Liang Xi 11    Zuoyong Li 22
Abstract

Unsupervised anomaly detection aims to identify anomalous samples from highly complex and unstructured data, which is pervasive in both fundamental research and industrial applications. However, most existing methods neglect the complex correlation among data samples, which is important for capturing normal patterns from which the abnormal ones deviate. In this paper, we propose a method of Correlation aware unsupervised Anomaly detection via Deep Gaussian Mixture Model (CADGMM), which captures the complex correlation among data points for high-quality low-dimensional representation learning. More specifically, the relations among data samples are correlated firstly in forms of a graph structure, in which, the node denotes the sample and the edge denotes the correlation between two samples from the feature space. Then, a dual-encoder that consists of a graph encoder and a feature encoder, is employed to encode both the feature and correlation information of samples into the low-dimensional latent space jointly, followed by a decoder for data reconstruction. Finally, a separate estimation network as a Gaussian Mixture Model is utilized to estimate the density of the learned latent vector, and the anomalies can be detected by measuring the energy of the samples. Extensive experiments on real-world datasets demonstrate the effectiveness of the proposed method.

Keywords:
Anomaly Detection Graph Attention Gaussian Mixture Model Data Correlation.

1 Introduction

Anomaly detection aims at identifying abnormal patterns that deviate significantly from the normal behavior, which is ubiquitous in a multitude of application domains, such as cyber-security [15], medical care [19], and surveillance video profiling [14]. Formally, anomaly detection problem can be viewed as density estimation from the data distribution [23]: anomalies tend to reside in the low probability density areas. Although anomaly detection has been well-studied in the machine learning community, how to conduct unsupervised anomaly detection from highly complex and unstructured data effectively, is still a challenge.

Unsupervised anomaly detection aims to detect outliers without labeled data for the scenario that only a small number of labeled anomalous data combined with plenty of unlabeled data are available, which is common in real-world applications. Existing methods for unsupervised anomaly detection can be divided into three categories: reconstruction based methods, clustering based methods, and one-class classification based methods. Reconstruction based methods, such as PCA [5] based approaches [18, 10] and autoencoder based approaches [21, 22, 23, 20], assume that outliers cannot be effectively reconstructed from the compressed low-dimensional projections. Clustering based methods [17, 6] aim at density estimation of data points and usually adopt a two-step strategy [3] that performs dimensionality reduction firstly and then clustering. Different from previously mentioned categories, one-class classification based methods [7, 11, 1] make the effort to learn a discriminative boundary between the normal and abnormal instances.

Refer to caption
Figure 1: Correlation-aware feature learning for anomaly detection.

Although the above-mentioned methods had their fair share of success in anomaly detection, most of these methods neglect the complex correlation among data samples. As shown in Fig. 1, the conventional methods attempt to conduct feature learning on the original observed feature space of data samples, while the correlation among similar samples is ignored, which can be exploited during feature learning by propagating more representative features from the neighbors to generate high-quality embedding for anomaly detection. However, modeling correlation among samples is far different from those conventional feature learning models, in which highly non-linear structure needs to be captured. Therefore, how to effectively incorporate both the original feature and relation structure of samples into an integrated feature learning framework for anomaly detection is still an open problem.

To alleviate the above-mentioned problems, in this paper, we propose a method of Correlation aware unsupervised Anomaly detection via Deep Gaussian Mixture Model (CADGMM), which considers both the original feature and the complex correlation among data samples for feature learning. Specifically, the relations among data samples are correlated firstly in forms of a graph structure, in which, the node denotes the sample and the edge denotes the correlation between two samples from the feature space. Then, a dual-encoder that consists of a graph encoder and a feature encoder, is employed in CADGMM to encode both the feature and correlation of samples into the low-dimensional latent space jointly, followed by a decoder for data reconstruction. Finally, a separate estimation network as a Gaussian Mixture Model is utilized to estimate the density of the learned latent embedding. To verify the effectiveness of our algorithms, we conduct experiments on multiple real-world datasets. Our experimental results demonstrate that, by considering correlation among data samples, CADGMM significantly outperforms the state-of-the-art on unsupervised anomaly detection tasks.

2 Notations and Problem Statement

In this section, we formally define the frequently-used notations and the studied problem.

Definition 1

Graph is denoted as 𝓖={𝓥,𝓔,X}\boldsymbol{\mathcal{G}}=\{\boldsymbol{\mathcal{V}},\boldsymbol{\mathcal{E}},\textbf{X}\} with NN nodes and EE edges, in which, 𝒱={vi|i=1,2,,N}\mathcal{V}=\{v_{i}|i=1,2,...,N\} is a set of nodes, 𝓔={ei|i=1,2,,E}\boldsymbol{\mathcal{E}}=\{e_{i}|i=1,2,...,E\} is a set of edges and ei=(vi1,vi2)e_{i}=(v_{i_{1}},v_{i_{2}}) represents an edge between node vi1v_{i_{1}} and node vi2v_{i_{2}}. XN×F\textbf{X}\in{\mathbb{R}^{{N}\times{F}}} is an feature matrix with each row corresponding to a content feature of a node, where FF indicates the dimension of features. Adjacency Matrix of a graph is denoted as AN×N\textbf{A}\in{\mathbb{R}^{{N}\times{N}}}, which can be used to represent the topologies of a graph. The scalar element Ai,j=1\textbf{A}_{i,j}=1 if there exists an edge between node viv_{i} and node vjv_{j}, otherwise, Ai,j=0\textbf{A}_{i,j}=0.

Problem 1

Anomaly detection: Given a set of input samples 𝓧={xi|i=1,,N}\boldsymbol{\mathcal{X}}=\{x_{i}|i=1,...,N\}, each of which is associated with a FF dimension feature 𝐗iF\boldsymbol{\mathrm{X}}_{i}\in\mathbb{R}^{F}, we aim to learn a score function u(𝐗i):Fu(\boldsymbol{\mathrm{X}}_{i}):\mathbb{R}^{F}\mapsto\mathbb{R}, to classify sample xix_{i} based on the threshold λ\lambda:

yi={1,ifu(𝐗i)λ,0,otherwise.\begin{split}y_{i}=\left\{\begin{matrix}1,&if\ u(\boldsymbol{\mathrm{X}}_{i})\geq\lambda,\\ 0,&otherwise.\end{matrix}\right.\end{split} (1)

where yiy_{i} denotes the label of sample xix_{i}, with 0 being the normal class and 1 the anomalous class.

Refer to caption
Figure 2: The framework of the proposed method.

3 Method

In this section, we introduce the proposed CADGMM in detail. CADGMM is an end-to-end joint representation learning framework for unsupervised anomaly detection. As shown in Fig. 2, CADGMM consists of three modules named dual-encoder, feature decoder, and estimation network, respectively. Specifically, the relations among data samples in the original feature space are correlated firstly in form of the graph structure. In the constructed graph, the node denotes the sample and the edge denotes the correlation between two samples in the feature space. Then, a dual-encoder that consists of a graph encoder and a feature encoder, is employed to encode both the feature and correlation information of samples into the low-dimensional latent space jointly, followed by a feature decoder for sample reconstruction. Finally, a separate estimation network is utilized to estimate the density of the learned latent embedding in the framework of Gaussian Mixture Model, and the anomalies can be detected by measuring the energy of the samples with respect to a given threshold.

3.1 Graph Construction

To explore the correlation among non-structure data samples for feature learning, we explicitly construct a graph structure to correlate the similar samples from the feature space. More specifically, given a set of input samples 𝓧={xi|i=1,,N}\boldsymbol{\mathcal{X}}=\{x_{i}|i=1,...,N\}, we employ KK-NN algorithm on sample xix_{i} to determine its KK nearest neighbors 𝓝i={xik|k=1,,K}\boldsymbol{\mathcal{N}}_{i}=\{x_{i_{k}}|k=1,...,K\} in the feature space. Then, an undirected edge is assigned between xix_{i} and its neighbor xikx_{i_{k}}. Finally, an undirected graph 𝓖={𝓥,𝓔,𝐗}\boldsymbol{\mathcal{G}}=\{\boldsymbol{\mathcal{V}},\boldsymbol{\mathcal{E}},\boldsymbol{\mathrm{X}}\} is constructed, with 𝓥={vi=xi|i=1,,N}\boldsymbol{\mathcal{V}}=\{v_{i}=x_{i}|i=1,...,N\} being the node set, 𝓔={eik=(vi,vik)|vik𝓝i}\boldsymbol{\mathcal{E}}=\{e_{i_{k}}=(v_{i},v_{i_{k}})|v_{i_{k}}\in\boldsymbol{\mathcal{N}}_{i}\} being the edge set, and 𝐗N×F\boldsymbol{\mathrm{X}}\in\mathbb{R}^{N\times F} being the feature matrix of nodes. Based on the constructed graph, the feature affinities among samples are captured explicitly, which can be used during feature learning by performing message propagation mechanism on them.

3.2 Dual-Encoder

In order to obtain sufficient representative high-level sample embedding, Dual-Encoder consists of a feature encoder and a graph encoder to encode the original feature of samples and the correlation among them respectively.

To encode the original sample features 𝐗\boldsymbol{\mathrm{X}}, feature encoder employs a LXL_{\mathrm{X}} layers Multi-Layer Perceptron (MLP) to conduct a non-linear feature transform, which is as follows:

Z𝐗(lX)=σ(Z𝐗(lX1)𝐖𝐗(lX1)+b𝐗(lX1))\begin{split}&\textbf{Z}^{\boldsymbol{\mathrm{X}}(l_{\mathrm{X}})}=\sigma(\textbf{Z}^{\boldsymbol{\mathrm{X}}(l_{\mathrm{X}}-1)}\boldsymbol{\mathrm{W}}^{\boldsymbol{\mathrm{X}}(l_{\mathrm{X}}-1)}+{\boldsymbol{\textbf{b}}^{\boldsymbol{\mathrm{X}}(l_{\mathrm{X}}-1)}})\end{split} (2)

where Z𝐗(lX1)\textbf{Z}^{\boldsymbol{\mathrm{X}}(l_{\mathrm{X}}-1)}, Z𝐗(lX)\textbf{Z}^{\boldsymbol{\mathrm{X}}(l_{\mathrm{X}})}, W𝐗(lX1)\textbf{W}^{\boldsymbol{\mathrm{X}}(l_{\mathrm{X}}-1)} and b𝐗(lX1)\textbf{b}^{\boldsymbol{\mathrm{X}}(l_{\mathrm{X}}-1)} are the input, output, the trainable weight and bias matrix of (lXl_{\mathrm{X}}-11)-th layer respectively, lX{1,2,,LX}l_{\mathrm{X}}\in\{1,2,...,L_{\mathrm{X}}\}, and Z𝐗(0)=X\textbf{Z}^{\boldsymbol{\mathrm{X}}(0)}=\textbf{X} is the initial input of the encoder. σ()\sigma(\bullet) denotes an activation function such as ReLU or Tanh. Finally, the final feature embedding Z𝐗\textbf{Z}^{\boldsymbol{\mathrm{X}}}=Z𝐗(LX)\textbf{Z}^{\boldsymbol{\mathrm{X}}(L_{\mathrm{X}})} is obtained from the output of the last layer in MLP.

To encode the correlation among the samples, a graph attention layer [16] is employed to adaptively aggregate the representation from neighbor nodes, by performing a shared attentional mechanism on the nodes:

wi,j=attn(𝐗i,𝐗j)=σ(aT[𝐖c𝐗i||𝐖c𝐗j])\begin{split}&w_{i,j}=attn(\boldsymbol{\mathrm{X}}_{i},\boldsymbol{\mathrm{X}}_{j})=\sigma(\textbf{a}^{\mathrm{T}}\cdot[\boldsymbol{\mathrm{W}}^{c}\boldsymbol{\mathrm{X}}_{i}||\boldsymbol{\mathrm{W}}^{c}\boldsymbol{\mathrm{X}}_{j}])\end{split} (3)

where wi,jw_{i,j} indicates the importance weight of node 𝒗i\boldsymbol{v}_{i} to node 𝒗j\boldsymbol{v}_{j}, attn()attn(\bullet) denotes the neural network parametrized by weights aDc\textbf{a}\in\mathbb{R}^{D^{c}} and 𝐖cDc2×F\boldsymbol{\mathrm{W}}^{c}\in\mathbb{R}^{\frac{D^{c}}{2}\times F} that shared by all nodes and DcD^{c} is the number of hidden neurons in attn()attn(\bullet), |||| denotes the concatenate operation. Then, the final importance weight αi,j\alpha_{i,j} is normalized through the softmax function:

αi,j=exp(wi,j)k𝓝𝒊exp(wi,k)\begin{split}&{\alpha_{i,j}}=\frac{\mathrm{exp}(w_{i,j})}{\sum_{k\in\boldsymbol{\mathcal{N}_{i}}}\mathrm{exp}(w_{i,k})}\end{split} (4)

where 𝓝𝒊\boldsymbol{\mathcal{N}_{i}} denotes the neighbors of node 𝒗i\boldsymbol{v}_{i}, which is provided by adjacency matrix 𝐀\boldsymbol{\mathrm{A}}, and the final node embedding 𝐙𝓥={𝐙𝒊𝓥}\boldsymbol{{\mathrm{Z}}^{\mathcal{V}}}=\{\boldsymbol{{\mathrm{Z}}_{i}^{\mathcal{V}}}\} can be obtained by the weighted sum based on the learned importance weights as follows:

𝐙𝒊𝓥=k𝓝𝒊αi,k𝐗k\begin{split}&\boldsymbol{{\mathrm{Z}}_{i}^{\mathcal{V}}}=\sum_{k\in\boldsymbol{\mathcal{N}_{i}}}\alpha_{i,k}\cdot\boldsymbol{\mathrm{X}}_{k}\end{split} (5)

Given the learned embedding Z𝐗\textbf{Z}^{\boldsymbol{\mathrm{X}}} and 𝐙𝓥\boldsymbol{{\mathrm{Z}}^{\mathcal{V}}}, a fusion module is designed to fuse the embeddings from heterogeneous data source into a shared latent space, followed by a fully connected layer to obtain the final sample embedding ZfN×D\textbf{Z}^{f}\in\mathbb{R}^{N\times D}:

Z~f=Fusion(Z𝐗,𝐙𝓥)=Z𝐗𝐙𝓥\begin{split}&\tilde{\textbf{Z}}^{f}=\text{Fusion}(\textbf{Z}^{\boldsymbol{\mathrm{X}}},\boldsymbol{{\mathrm{Z}}^{\mathcal{V}}})=\textbf{Z}^{\boldsymbol{\mathrm{X}}}\oplus\boldsymbol{{\mathrm{Z}}^{\mathcal{V}}}\end{split} (6)
Zf=Z~f𝐖+b\begin{split}&\textbf{Z}^{f}=\tilde{\textbf{Z}}^{f}\boldsymbol{\mathrm{W}}+{\boldsymbol{\textbf{b}}}\end{split} (7)

where W and b are the trainable weight and bias matrix, and \oplus indicates the element-wise plus operator of two matrices.

3.3 Feature Decoder

Feature decoder aims at reconstructing the sample features from the latent embedding Zf\textbf{Z}^{f}:

Z𝐗^(lX^)=σ(Z𝐗^(lX^1)W𝐗^(lX^1)+b𝐗^(lX^1))\begin{split}&\textbf{Z}^{\boldsymbol{\hat{\mathrm{X}}}(l_{\hat{\mathrm{X}}})}=\sigma(\textbf{Z}^{\boldsymbol{\hat{\mathrm{X}}}(l_{\hat{\mathrm{X}}}-1)}\textbf{W}^{\boldsymbol{\hat{\mathrm{X}}}(l_{\hat{\mathrm{X}}}-1)}+\textbf{b}^{\boldsymbol{\hat{\mathrm{X}}}(l_{\hat{\mathrm{X}}}-1)})\end{split} (8)

where Z𝐗^(lX^1)\textbf{Z}^{\boldsymbol{\hat{\mathrm{X}}}(l_{\hat{\mathrm{X}}}-1)}, Z𝐗^(lX^)\textbf{Z}^{\boldsymbol{\hat{\mathrm{X}}}(l_{\hat{\mathrm{X}}})}, W𝐗^(lX^1)\textbf{W}^{\boldsymbol{\hat{\mathrm{X}}}(l_{\hat{\mathrm{X}}}-1)} and b𝐗^(lX^1)\textbf{b}^{\boldsymbol{\hat{\mathrm{X}}}(l_{\hat{\mathrm{X}}}-1)} are the input, output, the trainable weight and bias matrix of (lX^l_{\hat{\mathrm{X}}}-11)-th layer of decoder respectively, lX^{1,2,,LX^}l_{\hat{\mathrm{X}}}\in\{1,2,...,L_{\hat{\mathrm{X}}}\}, and Z𝐗^(0)=Zf\textbf{Z}^{\boldsymbol{\hat{\mathrm{X}}}(0)}=\textbf{Z}^{f} is the initial input of the decoder. Finally, the reconstruction 𝐗^\boldsymbol{\hat{\mathrm{X}}} is obtained from the last layer of decoder:

𝐗^=Z𝐗^(LX^)\begin{split}\boldsymbol{\hat{\mathrm{X}}}=\textbf{Z}^{\boldsymbol{\hat{\mathrm{X}}}(L_{\hat{\mathrm{X}}})}\end{split} (9)

3.4 Estimate Network

To estimate the density of the input samples, a Gaussian Mixture Model is leveraged in CADGMM over the learned latent embedding. Inspired by DAGMM [23], a sub-network consists of several fully connected layers is utilized, which takes the reconstruction error preserved low-dimentional embedding as input, to estimate the mixture membership for each sample. The reconstruction error preserved low-dimentional embedding Z is obtained as follows:

Z=[Zf||Zr],Zr=Dist(X,𝐗^)\begin{split}&\textbf{Z}=[\textbf{Z}^{f}||\textbf{Z}^{r}],\ \textbf{Z}^{r}=\text{Dist}(\textbf{X},\boldsymbol{\hat{\mathrm{X}}})\end{split} (10)

where Zr\textbf{Z}^{r} is the reconstruction error embedding and Dist()\text{Dist}(\bullet) denotes the distance metric such as Euclidean distance or cosine distance. Given the final embedding Z as input, estimate network conducts membership prediction as follows:

Z𝓜(l)=σ(Z𝓜(l1)W𝓜(l1)+b𝓜(l1))\begin{split}&\textbf{Z}^{\boldsymbol{\mathcal{M}}(l_{\mathcal{M}})}=\sigma(\textbf{Z}^{\boldsymbol{\mathcal{M}}(l_{\mathcal{M}}-1)}\textbf{W}^{\boldsymbol{\mathcal{M}}(l_{\mathcal{M}}-1)}+\textbf{b}^{\boldsymbol{\mathcal{M}}(l_{\mathcal{M}}-1)})\end{split} (11)

where Z𝓜(l1)\textbf{Z}^{\boldsymbol{\mathcal{M}}(l_{\mathcal{M}}-1)}, Z𝓜(l)\textbf{Z}^{\boldsymbol{\mathcal{M}}(l_{\mathcal{M}})}, W𝓜(l1)\textbf{W}^{\boldsymbol{\mathcal{M}}(l_{\mathcal{M}}-1)} and b𝓜(l1)\textbf{b}^{\boldsymbol{\mathcal{M}}(l_{\mathcal{M}}-1)} are the input, output, the trainable weight and bias matrix of (ll_{\mathcal{M}}-11)-th layer of estimate network respectively, l{1,2,,L}l_{\mathcal{M}}\in\{1,2,...,L_{\mathcal{M}}\}, Z𝓜(0)=Z\textbf{Z}^{\boldsymbol{\mathcal{M}}(0)}=\textbf{Z}, and the mixture-component membership 𝓜\boldsymbol{\mathcal{M}} is calculated by:

𝓜=Softmax(Z𝓜(L))\begin{split}\boldsymbol{\mathcal{M}}=\text{Softmax}(\textbf{Z}^{\boldsymbol{\mathcal{M}}(L_{\mathcal{M}})})\end{split} (12)

where 𝓜N×M\boldsymbol{\mathcal{M}}\in\mathbb{R}^{N\times M} is the predicted membership of MM mixture components for NN samples. With the predicted sample membership, the parameters of GMM can be calculated to facilitate the evaluation of the energy/likelihood of input samples, which is as follows:

𝝁𝒎=i=1N𝓜i,mZii=1N𝓜i,m,𝚺m=i=1N𝓜i,m(Zi𝝁𝒎)(Zi𝝁𝒎)Ti=1N𝓜i,m\begin{split}\boldsymbol{\mu_{m}}=\frac{\sum_{i=1}^{N}\boldsymbol{\mathcal{M}}_{i,m}\textbf{Z}_{i}}{\sum_{i=1}^{N}\boldsymbol{\mathcal{M}}_{i,m}},\ \boldsymbol{\Sigma}_{m}=\frac{\sum_{i=1}^{N}\boldsymbol{\mathcal{M}}_{i,m}(\textbf{Z}_{i}-\boldsymbol{\mu_{m}})(\textbf{Z}_{i}-\boldsymbol{\mu_{m}})^{\text{T}}}{\sum_{i=1}^{N}\boldsymbol{\mathcal{M}}_{i,m}}\end{split} (13)

where 𝝁𝒎\boldsymbol{\mu_{m}} and 𝚺m\boldsymbol{\Sigma}_{m} are the means and covariance of the mm-th component distribution respectively, and the energy of samples is as follows:

EZ=log(m=1Mi=1N𝓜i,mNexp(12(Z𝝁𝒎)T𝚺m1(Z𝝁𝒎))|2π𝚺m|12)\begin{split}\text{E}_{\textbf{Z}}=-\text{log}\left(\sum_{m=1}^{M}\sum_{i=1}^{N}\frac{\boldsymbol{\mathcal{M}}_{i,m}}{N}\frac{\text{exp}(-\frac{1}{2}(\textbf{Z}-\boldsymbol{\mu_{m}})^{\text{T}}\boldsymbol{\Sigma}_{m}^{-1}(\textbf{Z}-\boldsymbol{\mu_{m}}))}{|2\pi\boldsymbol{\Sigma}_{m}|^{\frac{1}{2}}}\right)\end{split} (14)

where |||\bullet| indicates the determinant of a matrix.

3.5 Loss Function and Anomaly Score

The training objective of CADGMM is defined as follows:

=X𝐗^22+λ1EZ+λ2m=1Mi=1N1(𝚺m)ii+λ3Z22\begin{split}\mathcal{L}=||\textbf{X}-\boldsymbol{\hat{\mathrm{X}}}||_{2}^{2}+\lambda_{1}E_{\textbf{Z}}+\lambda_{2}\sum_{m=1}^{M}\sum_{i=1}^{N}\frac{1}{(\boldsymbol{\Sigma}_{m})_{ii}}+\lambda_{3}||\textbf{Z}||_{2}^{2}\end{split} (15)

where the first term is reconstruction error used for feature reconstruction, the second is sample energy, which aims to maximize the likelihood to observed samples, the third is covariance penalization, used for solving singularity problem as in GMM [23] by penalizing small values on the diagonal entries of covariance matrix, and the last is embedding penalization, which serves as a regularizer to impose the magnitude of normal samples as small as possible in the latent space, to deviate the normal samples from the abnormal ones. λ1\lambda_{1}, λ2\lambda_{2}, and λ3\lambda_{3} are three parameters which control the trade off between different terms.

The anomaly score is the sample energy EZE_{\textbf{Z}}, and based on the measured anomaly scores, the threshold λ\lambda in Eq. 1 can be determined according to the distribution of scores, e.g. the samples of top-k scores are classified as anomalous samples.

4 Experiments

In this section, we will describe the experimental details including datasets, baseline methods, and parameter settings, respectively.

Table 1: Statistics of the public benchmark datasets.
Database # Dimensions # Instances  Anomaly ratio
KDD99 120 494,021 0.2
Arrhythmia 274 452 0.15
Satellite 36 6,435 0.32

4.1 Dataset

Three benchmark datasets are used in this paper to evaluate the proposed method, including KDD99, Arrhythmia, and Satellite. The statistics of datasets are shown in Table 1.

  • KDD99 The KDD99 10 percent dataset [2] contains 494021 samples with 41 dimensional features, where 34 of them are continuous and 7 are categorical. One-hot representation is used to encode the categorical features, resulting in a 120-dimensional feature for each sample.

  • Arrhythmia The Arrhythmia dataset [2] contains 452 samples with 274 dimensional features. We combine the smallest classes including 3, 4, 5, 7, 8, 9, 14, 15 to form the outlier class and the rest of the classes are inliers class.

  • Satellite The Satellite dataset [2] has 6435 samples with 36 dimensional features. The smallest three classes including 2,4,5 are combined to form the outliers and the rest are inliers classes.

4.2 Baseline Methods

  • One Class Support Vector Machines (OC-SVM) [4] is a classic kernel method for anomaly detection, which learns a decision boundary between the inliers and outliers.

  • Isolation Forests (IF) [8] conducts anomaly detection by building trees using randomly selected split values across sample features, and defining the anomaly score as the average path length from a specific sample to the root.

  • Deep Structured Energy Based Models (DSEBM) [21] is a deep energy-based model, which aims to accumulate the energy across the layers. DSEBM-r and DSEBM-e are utilized in [21] by taking the energy and reconstruction error as the anomaly score respectively.

  • Deep Autoencoding Gaussian Mixture Model (DAGMM) [23] is an autoencoder based method for anomaly detection, which consists of a compression network for dimension reduction, and an estimate network to perform density estimation under the Gaussian Mixture Model.

  • AnoGAN [13] is an anomaly detection algorithm based on GAN, which trains a DCGAN [12] to recover the representation of each data sample in the latent space during prediction.

  • ALAD [20] is based on bi-directional GANs for anomaly detection by deriving adversarially learned features and uses reconstruction errors based on the learned features to determine if a data sample is anomalous.

4.3 Parameter Settings

The parameter settings in the experiment for different datasets are as follows:

  • KDD99 For KDD99, CADGMM is trained with 300 iterations and NN=1024 for graph construction with KK=15, which is the batch size for training. MM=4, λ1\lambda_{1}=0.1, λ2\lambda_{2}=0.005, λ3\lambda_{3}=10.

  • Arrhythmia For Arrhythmia, CADGMM is trained with 20000 iterations and NN=128 for graph construction with KK=5, which is the batch size for training, MM=2, λ1\lambda_{1}=0.1, λ2\lambda_{2}=0.005, λ3\lambda_{3}=0.001.

  • Satellite For Satellite, CADGMM is trained with 3000 iterations and NN=512 for graph construction with KK=13, MM=4, λ1\lambda_{1}=0.1, λ2\lambda_{2}=0.005, λ3\lambda_{3}=0.005.

The architecture details of CADGMM on different datasets are shown in Table 2, in which, FC(Din,Dout)\text{FC}(D_{in},D_{out}) means a fully connected layer with DinD_{in} input neurons and DoutD_{out} output neurons. Similarly, GAT(Din,Dout)\text{GAT}(D_{in},D_{out}) means a graph attention layer with DinD_{in}-dimensional input and DoutD_{out}-dimensional output. The activation function σ()\sigma(\bullet) for all datasets is set as Tanh. For the baseline methods, we set the parameters by grid search. We independently run each experiment 10 times and the mean values are reported as the final results.

Table 2: Architecture details of CADGMM for different datasets.
\hlineB2 Dataset Dual-Enc. Feature Dec. Estimate Net.
Feature Trans. Graph Attn. MLP
\hlineB2 KDD99 FC(120,64) GAT(120,32) FC(32, 8) FC(8,32) FC(10,20)
FC(64,32) FC(32,64) FC(20,8)
FC(64,120) FC(8,4)
\hlineB2 Arrhythmia FC(274,32) GAT(274,32) FC(32, 2) FC(2,10) FC(4,10)
FC(10,274) FC(10,2)
\hlineB2 Satellite FC(36,16) GAT(36,16) FC(16, 2) FC(2,16) FC(4,10)
FC(16,36) FC(10,4)
\hlineB2
Table 3: Anomaly Detection Performance on KDD99, Arrhythmia, and Satellite datasets. Better results are marked in bold.
\hlineB2 Method KDD99 Arrhythmia Satellite
Precision Recall F1 Precision Recall F1 Precision Recall F1
\hlineB2 OC-SVM [4] 74.57 85.23 79.54 53.97 40.82 45.81 52.42 59.99 61.07
IF [8] 92.16 93.73 92.94 51.47 54.69 53.03 60.81 94.89 75.40
DSEBM-r [21] 85.21 64.72 73.28 15.15 15.13 15.10 67.84 68.61 68.22
DSEBM-e [21] 86.19 64.66 73.99 46.67 45.65 46.01 67.79 68.56 68.18
DAGMM [23] 92.97 94.42 93.69 49.09 50.78 49.83 80.77 81.6 81.19
AnoGAN [13] 87.86 82.97 88.65 41.18 43.75 42.42 71.19 72.03 71.59
ALAD [20] 94.27 95.77 95.01 50 53.13 51.52 79.41 80.32 79.85
CADGMM 96.01 97.53 96.71 56.41 57.89 57.14 81.99 82.75 82.37
\hlineB2

5 Results and Analysis

In this section, we will demonstrate the effectiveness of the proposed method by presenting results of our model on anomaly detection task, and provide a comparison with the state-of-the-art methods.

5.1 Anomaly Detection

As in previous literatures [21, 23, 20], in this paper, Precision, Recall and F1 score are employed as the evaluation metrics. Generally, we expect the values of these evaluation metrics as big as possible. The sample with high energy is classified as abnormal and the threshold is determined based on the ratio of anomalies in the dataset. Following the settings in [21, 23], the training and test sets are split by 1:1 and only normal samples are used for training the model.

The experimental results shown in Table 3 demonstrate that the proposed CADGMM significantly outperforms all baselines in various datasets. The performance of CADGMM is much higher than traditional anomaly detection methods such as OC-SVM and IF, because of the limited capability of feature learning or the curse of dimensionality. Moreover, CADGMM also significantly outperforms all other deep learning based methods such as DSEBM, DAGMM, AnoGAN, and ALAD, which demonstrates that additional correlation among data samples facilitates the feature learning for anomaly detection. For small datasets such as Arrhythmia, we can find that traditional methods such as IF are competitive compared with conventional deep learning based method such as DSEBM, DAGMM, AnoGAN, and ALAD, which might because that the lack of sufficient training data could have resulted in poorer performance of the data hungry deep learning based methods, while CADGMM is capable of leveraging more data power given the limited data source, by considering the correlation among data samples.

Table 4: Anomaly Detection Performance on KDD99 with different ratios of anomalies during training.
\hlineB2 Radio CADGMM DAGMM OC-SVM
Precision Recall F1 Precision Recall F1 Precision Recall F1
\hlineB2 1% 95.53 97.04 96.28 92.01 93.37 92.68 71.29 67.85 69.53
2% 95.32 96.82 96.06 91.86 93.40 92.62 66.68 52.07 58.47
3% 94.83 96.33 95.58 91.32 92.72 92.01 63.93 44.70 52.61
4% 94.62 96.12 95.36 88.37 89.89 89.12 59.91 37.19 45.89
5% 94.35 96.04 95.3 85.04 86.43 85.73 11.55 33.69 17.20
\hlineB2

5.2 Impact of noise data

In this section, we study the impact of noise data for the training of CADGMM. To be specific, 50% of randomly split data samples are used for testing, while the rest 50% combined with 1% to 5% anomalies are used for training.

As shown in Table 4, with the increase of noise data, the performance of all baselines degrade significantly, especially for OC-SVM, which tends to be more sensitive to noise data because of its poor ability of feature learning on high-dimensional data. However, CADGMM performs stable with different ratios of noise and achieves state-of-the-art even 5% anomalies are injected into the training data, which demonstrates the robustness of the proposed method.

Refer to caption
Figure 3: Impact of different KK values of K-NN algorithms in graph construction.
Refer to caption
(a) DAGMM
Refer to caption
(b) CADGMM
Figure 4: Embedding visualization (Blue indicates the normal samples and orange the anomalies).

5.3 Impact of KK values

In this section, we evaluate the impact of different KK values during the graph construction on CADGMM.

More specifically, we conduct experiments on all three datasets by varying the number of KK from 5 to 19, and the experimental results are illustrated in Fig. 3. During training, the batch sizes are set as 1024, 128, and 512 for KDD99, Arrhythmia, and Satellite, respectively, the experimental results show that the changing of KK value causes only a little fluctuation of performance on all datasets with different settings, which demonstrates that CADGMM is less sensitive to the KK value and easy to use.

5.4 Embedding Visualization

In order to explore the quality of the learned embedding, we make a comparison of the visualization of sample representation for different methods in Fig. 4. Specifically, we take the low-dimensional embeddings of samples learned by DAGMM and CADGMM, as the inputs to the t-SNE tool [9]. Here, we randomly choose 40000 data samples from the test set of KDD99 for visualization, and then we generate visualizations of the sample embedding on a two-dimensional space, in which blue colors correspond to the normal class while orange the abnormal class. We can find that CADGMM achieves more compact and separated clusters compared with DAGMM. The results can also explain why our approach achieves better performance on anomaly detection task.

6 Conclusion

In this paper, we study the problem of correlation aware unsupervised anomaly detection, which considers the correlation among data samples from the feature space. To cope with this problem, we propose a method named CADGMM to model the complex correlation among data points to generate high-quality low-dimensional embeddings for anomaly detection. Extensive experiments on real-world datasets demonstrate the effectiveness of the proposed method.

Acknowledgement

This work was supported in part by National Natural Science Foundation of China (No. 61172168, 61972187).

References

  • [1] Amer, M., Goldstein, M., Abdennadher, S.: Enhancing one-class support vector machines for unsupervised anomaly detection. In: SIGKDD. pp. 8–15 (2013)
  • [2] Bache, K., Lichman, M.: Uci machine learning repository, 2013. URL http://archive. ics. uci. edu/ml 5 (2013)
  • [3] Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: A survey. ACM Computing Surveys 41(3) (7 2009). https://doi.org/10.1145/1541880.1541882
  • [4] Chen, Y., Zhou, X.S., Huang, T.S.: One-class svm for learning in image retrieval. In: ICIP. pp. 34–37 (2001)
  • [5] Jolliffe, I.: Principal component analysis. Technometrics 45(3),  276 (2003)
  • [6] Kim, J., Scott, C.D.: Robust kernel density estimation. Journal of Machine Learning Research 13(Sep), 2529–2565 (2012)
  • [7] Li, K.L., Huang, H.K., Tian, S.F., Xu, W.: Improving one-class svm for anomaly detection. In: Proceedings of the 2003 International Conference on Machine Learning and Cybernetics. vol. 5, pp. 3077–3081 (2003)
  • [8] Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: ICDM. pp. 413–422 (2008)
  • [9] Maaten, L.v.d., Hinton, G.: Visualizing data using t-sne. Journal of machine learning research 9(Nov), 2579–2605 (2008)
  • [10] Pascoal, C., De Oliveira, M.R., Valadas, R., Filzmoser, P., Salvador, P., Pacheco, A.: Robust feature selection and robust pca for internet traffic anomaly detection. In: IEEE INFOCOM. pp. 1755–1763 (2012)
  • [11] Perdisci, R., Gu, G., Lee, W., et al.: Using an ensemble of one-class svm classifiers to harden payload-based anomaly detection systems. In: ICDM. vol. 6, pp. 488–498 (2006)
  • [12] Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. ICLR (2016)
  • [13] Schlegl, T., Seeböck, P., Waldstein, S.M., Schmidt-Erfurth, U., Langs, G.: Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In: IPMI. pp. 146–157 (2017)
  • [14] Sultani, W., Chen, C., Shah, M.: Real-world anomaly detection in surveillance videos. In: CVPR. pp. 6479–6488 (2018)
  • [15] Tan, S.C., Ting, K.M., Liu, T.F.: Fast anomaly detection for streaming data. In: IJCAI (2011)
  • [16] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks (2018)
  • [17] Xiong, L., Póczos, B., Schneider, J.G.: Group anomaly detection using flexible genre models. In: NIPS. pp. 1071–1079 (2011)
  • [18] Xu, H., Caramanis, C., Sanghavi, S.: Robust pca via outlier pursuit. In: NIPS. pp. 2496–2504 (2010)
  • [19] Xu, S., Wu, H., Bie, R.: Cxnet-m1: Anomaly detection on chest x-rays with image-based deep learning. IEEE Access 7, 4466–4477 (2018)
  • [20] Zenati, H., Romain, M., Foo, C.S., Lecouat, B., Chandrasekhar, V.: Adversarially learned anomaly detection. In: ICDM. pp. 727–736 (2018)
  • [21] Zhai, S., Cheng, Y., Lu, W., Zhang, Z.: Deep structured energy based models for anomaly detection. In: ICML. pp. 1100–1109 (2016)
  • [22] Zhou, C., Paffenroth, R.C.: Anomaly detection with robust deep autoencoders. In: SIGKDD. pp. 665–674 (2017)
  • [23] Zong, B., Song, Q., Min, M.R., Cheng, W., Lumezanu, C., Cho, D., Chen, H.: Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In: ICLR (2018)