Active Anomaly Detection based on Deep One-class Classification
Abstract
Active learning has been utilized as an efficient tool in building anomaly detection models by leveraging expert feedback. In an active learning framework, a model queries samples to be labeled by experts and re-trains the model with the labeled data samples. It unburdens in obtaining annotated datasets while improving anomaly detection performance. However, most of the existing studies focus on helping experts identify as many abnormal data samples as possible, which is a sub-optimal approach for one-class classification-based deep anomaly detection. In this paper, we tackle two essential problems of active learning for Deep SVDD: query strategy and semi-supervised learning method. First, rather than solely identifying anomalies, our query strategy selects uncertain samples according to an adaptive boundary. Second, we apply noise contrastive estimation in training a one-class classification model to incorporate both labeled normal and abnormal data effectively. We analyze that the proposed query strategy and semi-supervised loss individually improve an active learning process of anomaly detection and further improve when combined together on seven anomaly detection datasets.
keywords:
\KWDDeep anomaly detection, One-class classification, Deep SVDD, Active learning, Noise-contrastive estimation1 Introduction
Anomaly detection (AD) techniques constitute a fundamental resource in various applications to identify observations that deviate considerably from what is considered normal. In response to increasingly complex data at a large scale, deep learning-based AD has been actively researched and has shown high capabilities [12]. Since deep learning is based on a representation learned from a given dataset, most deep AD models aim to learn normality, assuming that a dataset consisting of only normal samples is available. Thereby, one-class classification (OCC)-based approaches are one of the representative approaches of deep AD [16]. However, in practice, it is expensive to prepare a training dataset consisting of normal samples as it requires per-sample inspection. When unlabeled data is abundant, but manual labeling is expensive, active learning has been utilized as an efficient tool in building AD models while unburdening the labeling process and improving AD performance by leveraging feedback from domain experts [2, 4, 10, 11, 22, 14, 25]. There are two significant building blocks in active learning: a query strategy for determining which data samples should be labeled and a semi-supervised learning method for re-training AD models utilizing the labeled samples.
According to common AD scenarios that identify the top instances from a ranked list of anomalousness determined by an AD model, most of the existing AD methods utilizing active learning select the top-1 or top-K instances as a query to be labeled [2, 4, 5, 10, 11]. However, this common but greedy query strategy is sub-optimal in training OCC-based AD, where samples with high uncertainty may be more informative than samples with high confidence. For an OCC-based AD model that learns an explicit decision boundary, data near the boundary is selected as a query to be labeled [7, 22], i.e., uncertainty sampling. However, the OCC-based AD model usually relies on hyper-parameters to learn the boundary, and it is not easy to find the optimal value in the active learning framework since model re-training is repeated with the newly labeled data and remaining unlabeled data.
Another important part of active learning is incorporating labeled and unlabeled samples into training, i.e., semi-supervised learning. The simple semi-supervised learning method for OCC-based AD is to exclude labeled abnormal samples while regarding both labeled normal and unlabeled samples as normal samples during training [2]. However, completely excluding labeled abnormal samples in training is a loss of information. Later works [18, 11] incorporate labeled abnormal samples into semi-supervised learning to learn discriminative characteristics between normal and abnormal samples. However, they still do not discriminate between labeled normal and unlabeled samples in the training phase with the assumption that most of the unlabeled data are also normal. This indiscriminate use of samples is another loss of information, as labeled normal samples should have less uncertainty than unlabeled samples.
In this work, we tackle two essential problems of active learning for OCC-based deep AD. First, we propose an uncertainty-based query strategy without learning an explicit decision boundary. To this end, we search uncertain regions in a feature space where the density of normal and abnormal samples is expected to be similar and query samples from the searched regions. In OCC-based approaches, the anomaly score is computed by the distance from one fixed coordinate called the center in a latent space. Therefore, we search for a boundary where a similar number of normal and abnormal samples are located nearby. The searched boundary, called the adaptive boundary, is adjusted iteratively by the ratio of abnormal samples queried in the previous active learning stage. In other words, instead of learning the explicit boundary with hyper-parameters, we utilize labeling information as feedback for the adaptive boundary. For the semi-supervised learning method, unlike the previous work discriminating only labeled abnormal samples, we apply noise contrastive estimation (NCE) [8] on labeled normal samples to contrast them from labeled abnormal samples. We empirically validate on seven anomaly detection datasets that the proposed query strategy with an adaptive boundary and the NCE-based semi-supervised learning method are effective when applied individually but even stronger when combined together.
2 Related work
2.1 OCC-based anomaly detection
One-class classification (OCC) identifies objects of a specific class amongst all objects by primarily learning from a training dataset containing only the objects of the target class [13]. It detects whether new instances conform to the training dataset or not. On the other hand, the objective of anomaly detection (AD) is to separate abnormal data from a given dataset, which is a mixture of normal and abnormal data without ground truth annotations. Despite this difference, OCC has been widely applied to AD to learn the normality of a given dataset.
The most widely known shallow methods for OCC-based AD are One-Class SVM (OC-SVM) [20], and Support Vector Data Description (SVDD) [21]. They find a hyperplane and a hypersphere enclosing most of the training data, respectively. They deal with abnormal samples through hyper-parameters that allow some points to be put outside the estimated region. On the other hand, the works [7, 21] incorporate labeled data into SVDD to refine a boundary in a semi-supervised way. Tax et al. [21] enforce labeled abnormal samples to lie outside the hypersphere. Gornitz et al. [7] add one more constraint for labeled normal samples to be inside the hypersphere (See [22] for more details).
The shallow methods have been the solid foundation for OCC-based deep AD (OCC-DAD) [17, 3, 18, 26]. Most of them replace the shallow models with deep neural networks while using the same objective functions as in OC-SVM or SVDD. However, they commonly assume all training data are normal. We consider them as semi-supervised learning methods that exploit only labeled normal data while excluding labeled abnormal data. On the other hand, Ruff et al. [18] propose a method to incorporate labeled anomalies into One-Class Deep SVDD [17], a prevailing method in OCC-DAD. In addition, another line of OCC-DAD utilizes a generative adversarial network to exploit distorted anomalies adversarially [19]. However, it is difficult to identify the generated samples and guarantee that they resemble true anomalies.
2.2 Active learning for OCC-based AD
According to the survey [22], the existing query strategies (QS) for OCC-based AD can be classified into three categories. First, data-based QSs [6] utilize data statistics such as posterior probabilities of samples; therefore, they require labeled observation even at the initial stage of active learning. The second category is model-based QSs [2, 11, 7], to which the following two QSs belong. High-Confidence [2, 11] selects samples that least match the normal class. It shares the view of common anomaly detection scenarios that identify the top instances from a ranked list of anomaly scores; thereby, it is a common and widely used QS in active learning-based AD. However, a recent study [25] shows that this greedy choice could be sub-optimal since some low-ranked samples could be more informative. The other QS in this category, Decision-Boundary [7], selects samples closest to a decision boundary. However, OCC-based AD models rely on hyper-parameters to learn an explicit boundary. The third category is hybrid QSs combining the former two categories. Since we tackle a dataset without any initial annotations in this work, we only consider the model-based QSs. Queried samples are utilized according to the semi-supervised learning methods described in Section 2.1.




3 Preliminary
We provide a brief introduction to Deep SVDD (DSVDD) [17], which is used as a base model in this work. Given a training dataset , where , DSVDD maps data to a feature space through while gathering data around the center of a hypersphere in the feature space, where denotes the set of weights of the neural network . Then, the sample loss for DSVDD is defined as follows:
(1) |
For simplicity, we omit regularizers in the rest of this section. The hyper-parameter controls the trade-off between the radius and the amounts of data outside the hypersphere. In the case that most of the training data are normal, the above sample loss can be simplified, which is defined as follows:
(2) |
These two versions of DSVDD are called soft-boundary DSVDD and One-Class DSVDD. An anomaly score in both models is measured by the distance between the center and a data point (Eq. (3)).
(3) |
4 Method
Since we consider a dataset without any annotations, a base model, Deep SVDD (DSVDD), is trained with the average of sample losses (Eq. (1) or Eq. (2)) across all the training data at the initial stage of active learning. In subsequent stages, querying, labeling, and semi-supervised learning are repeated until an annotation budget is exhausted (Fig. 1). We consider that oracle labels queried samples. We denote the number of queried samples in each stage as , the number of active learning stages as , a set of queried normal and abnormal samples at stage as and , a set of total labeled normal and abnormal samples as and and a set of remaining unlabeled samples as .
4.1 Uncertainty sampling with adaptive boundary
Our query strategy with an adaptive boundary (AB) selects data samples closest to an adaptively searched boundary. It is similar to Decision-Boundary in terms of performing uncertainty sampling. AB is motivated by the contaminated normality of DSVDD. Although DSVDD learns normality in feature space as a compact hypersphere enclosing normal samples, abnormal samples mixed in a dataset easily contaminate the normality (the left side of each figure in Fig. 2). It happens in both versions of DSVDD. However, despite the contamination, we empirically observe that relatively more normal samples are located near the sphere’s center (the right side of each figure in Fig. 2). This phenomenon can be explained by the Effective gradient [24], which measures how much the error is reduced for a single datum, i.e., the projection of the overall gradient on the direction of a single datum’s gradient :
where is the angle between the two vectors.
Following the observation, we assume that there exists a boundary (concentric spheres) in which the true label ratio of number of data closest to the boundary becomes half; and we consider those samples close to the boundary have high uncertainty of being judged as normal or abnormal by the model. Our goal is to find the boundary where highly uncertain samples are nearby, which we call AB. For searching AB, we use the ratio of abnormal samples among queried samples at the previous active learning stage. We denote the ratio as . The idea is that when the ratio is high (low), the AB moves in a direction towards (away from) the center (Fig. 3). For the degree of movement, we utilize the difference between and its expected ratio of 0.5.
We denote the ratio of data enclosed by AB as . Suppose is 0.8 at active learning stage . In that case, AB is the boundary of the hypersphere enclosing 80% of the total data. We denote the ratio of queried abnormal samples in stage as , which means . The AB is updated as follows111.:
(4) |
where is the degree of movement derived from the proportional equation (Eq. (5)). We let decrease as the AB widens, i.e., increases, by assuming that the ratio of abnormal samples from the current AB to the maximum boundary, i.e., , increases linearly. We assume is 1 when is 1.
(5) |

4.2 Joint training with NCE
We propose to use noise contrastive estimation (NCE) as effective semi-supervised learning (SSL) method. In utilizing NCE, we pay attention to a fundamental role of an anomaly score function, which assigns a lower score to normal data than abnormal data. We consider this role analogous to NCE, which estimates a model describing observed data while contrasting them with noise data. Furthermore, an anomaly score function can be regarded as an unnormalized and parameterized statistical model. Therefore, we propose the NCE-based SSL method by setting the anomaly score of labeled normal and abnormal data as the estimation target and noise data. Although noise data leveraged by the original NCE is artificially sampled from a known distribution, we assume labeled abnormal data are sampled from the true distribution of abnormal data while querying. The total loss for NCE-based SSL is as follows:
(6) |
(7) |
where denotes pseudo-abnormal samples. is defined by labeled abnormal samples as follows:
(8) |
For brevity, we denote as . The total loss is described using One-Class DSVDD as a base model, but soft-boundary DSVDD can also be applied by replacing the sample loss. We outline the active learning process in Algorithm 1.
5 Experiment
5.1 Experiment settings
Datasets and evaluation metrics. Anomaly detection benchmarks [15] and NSL-KDD [1] for network intrusion detection consist of multivariate tabular datasets. As shown in Table 1, each dataset has data samples with number of attributes. The datasets used in this paper have a wide range of anomaly ratios in a dataset from a minimum of 2.9% to a maximum of 46.5%. We use the area under the receiver operating characteristic curve (AUC) measured from anomaly scores as an evaluation metric.
Dataset | n | d | # anomalies (%) |
---|---|---|---|
NSL-KDD | 125,973 | 41 | 46.5 |
Ionosphere | 351 | 33 | 35.9 |
Arrhythmia | 452 | 274 | 14.6 |
Cardio | 1,831 | 21 | 9.6 |
Mnist | 7,603 | 100 | 9.2 |
Glass | 214 | 9 | 4.2 |
Optdigits | 5,216 | 64 | 2.9 |






Competing methods. We apply our methods to two versions of Deep SVDD (DSVDD) [17], which are soft-boundary DSVDD (SB) and One-Class DSVDD (OC). Our query strategy (QS), which is Adaptive Boundary (AB), is compared with High-confidence (HC) [2, 11], Decision-Boundary (DB) [7], and Random (RD). Note that DB can be applied only to SB because OC does not predict a decision boundary. Our noise contrastive estimation (NCE)-based semi-supervised learning (SSL) method is compared with SB, OC, and DSAD [18]. We consider SB and OC as the SSL method that excludes labeled abnormal samples, and DSAD is the state-of-the-art SSL method based on OC. We also compare our methods to a recent active learning approach for anomaly detection, named Unsupervised to Active Inference (UAI) [14], that transforms an unsupervised deep AD model into an active learning model by adding an anomaly classifier.
Implementation details. To implement the base model, DSVDD222https://github.com/lukasruff/Deep-SVDD-PyTorch, we use the source code released by the authors and adjust the backbone architectures for each dataset. A 3-layer MLP with 32-16-8 units is used on the Ionosphere, Cardio, Glass, Optdigits, and NSL-KDD dataset; a 3-layer MLP with 128-64-32 units is used on the Arrhythmia dataset; a 3-layer MLP with 64-32-16 units is used on the Mnist dataset. The model is pre-trained with a reconstruction loss from an autoencoder for 100 epochs and then fine-tuned with an anomaly detection loss for 50 epochs. The decoder of an autoencoder is implemented in a symmetric structure of the 3-layer MLP encoder. We use Adam optimizer [9] with a batch size of 128 with a learning rate of . Data samples are standardized to have zero mean and unit variance. We set as 1% of each dataset and as five. In the case of small datasets where is less than 500 samples, e.g., ionosphere, arrhythmia, glass, 6 data samples are queried instead of 1%. The initial query boundary is set to 0.8. The hyperparameter of SB is fixed to 0.5 for all the experiments. For the hyperparameter of DSAD, we use 1, as suggested by the work [18]. Following the structure of UAI [14], the UAI layer is attached to the last layer of an encoder. For a fair comparison, we apply the same training scheme to UAI by pre-training the encoder with the autoencoder loss and then fine-tuning the encoder. The experiments are performed with Intel Xeon Silver 4210 CPU and GeForce GTX 1080Ti GPU.
5.2 Experiment results






Aggregated results. To validate the effectiveness of the proposed QS and SSL method comprehensively, we show the aggregated results on all the datasets first (Fig. 4 and Fig. 5). Each figure shows the results with OC (Fig. 4) and SB (Fig. 5) as a base model, respectively. AB shows the highest performance in most cases when used with SSL methods (Fig. 4(a), 4(c), and Fig. 5(a), 5(b)). In particular, even in the case of excluding labeled abnormal samples during SSL, AB obtains higher performance than HC (Fig. 4(a) and Fig. 5(a)). It suggests that excluding highly uncertain abnormal samples by AB can be more conducive to normality learning than excluding highly confident abnormal samples by HC. Similarly, AB shows higher performance than HC on SSL methods that use labeled abnormal samples for training (Fig. 4(b), 4(c), and Fig. 5(b)). We observe that RD shows higher performance than AB when combined with DSAD (Fig. 4(b)), while AB is more effective when combined with OC-NCE (Fig. 4(c)). It is due to the different use of labeled normal samples between DSAD and OC-NCE. AB searches for a set of uncertain samples, including both normal and abnormal ones. However, DSAD only repels labeled abnormal samples from the hypersphere center and ignores labeled normal samples. On the other hand, OC-NCE uses both labeled normal and abnormal samples by contrasting them, making the samples searched by AB more effective. Furthermore, in the case of SB variants, AB shows higher performance than DB. Since DB relies on hyper-parameters and the best value may change for every active learning stage, it may fail to select informative samples. In addition, the proposed NCE-based SSL method obtains the highest performance in most cases when used with any QS (Fig. 4(d), 4(e), 4(f), and Fig. 5(c), 5(d), 5(e), 5(f)), especially when combined with AB.
Results by dataset. The anomaly detection performance with OC as a base model is shown for each dataset (Fig. 6). Random QS, the worst baseline among the competitors, is excluded for better visualization. First, the same trend as in the aggregated results is observed: AB (solid lines) achieves higher performance in most cases for each SSL method, and the NCE-based SSL method (red lines) obtains higher performance than other SSL methods. In addition, the difference among SSL methods is apparent in Fig. 6. OC, which excludes labeled abnormal samples, has a smaller performance improvement or its performance fluctuates even as the number of labeled samples increases along with stages. It indicates that the appropriate use of labeled abnormal samples is more effective for normality learning. Furthermore, compared to DSAD, considering only labeled abnormal samples, OC-NCE effectively improves performance by contrasting labeled normal and abnormal samples. Meanwhile, UAI does not show stable performance improvement despite using both labeled normal and abnormal samples. UAI queries samples with high anomaly scores computed from the UAI layer. Similar to HC, queried samples by UAI are biased to be abnormal samples and less effective than the proposed method.
Lastly, we compare the combination of OC-NCE and the two QSs, AB (solid red lines) and HC (dashed red lines). AB shows higher performance than HC favorably in the early stages in most cases. However, the performance with both QSs becomes similar while repeating active learning stages. It can be explained by AB gradually moving outward of the hypersphere’s center as the model learns robust normality over the stages.










Analysis on queried samples. We analyze and being updated at each active learning stage by OC-NCE+AB (Fig. 7) to show how uncertainty sampling works without learning a parameter-dependent boundary. obtained from the trained model at the initial stage tends to be proportional to the anomaly ratio of each dataset (Fig. 7(a)). This result empirically shows that the model learns contaminated normality due to abnormal samples. While repeating the active learning stages, the values go toward 50%. The latter stages tend to show a high abnormal sample ratio . It is due to the improved model performance, resulting in more abnormal samples located near high regions. Even if does not close to 50% as expected, adaptively moves in the direction where is expected to be close to 50%. In a large anomaly ratio case, e.g., NSL-KDD, changes with large steps because values are far from 50%. Although each is not close to 50%, the accumulated labeled set becomes balanced as the active learning progresses.
6 Conclusion
In this work, we present a method for active anomaly detection based on deep one-class classification. The proposed query strategy selects samples closest to the boundary adjusted by the ratio of queried abnormal samples. Unlike the previous uncertainty sampling based on a decision boundary, our query strategy does not require learning explicit boundary, which is controlled by hyper-parameters in one-class classification. In addition, joint learning with noise contrastive estimation (NCE) is proposed as a semi-supervised learning method to effectively utilize labeled normal and abnormal data. We address that the proposed query strategy and joint learning with NCE mutually enhance each other resulting in a favorable performance against the competing methods, including the state-of-the-art semi-supervised methods.
Acknowledgments
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2020-0-00833, A study of 5G based Intelligent IoT Trust Enabler). The part of this work was done when Junsik Kim was in KAIST and supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2020R1A6A3A01100087).
References
- [1] Nsl-kdd. URL https://www.unb.ca/cic/datasets/nsl.html, 2018.
- [2] Vincent Barnabé-Lortie, Colin Bellinger, and Nathalie Japkowicz. Active learning for one-class classification. In IEEE International Conference on Machine Learning and Applications, 2015.
- [3] Raghavendra Chalapathy, Aditya Krishna Menon, and Sanjay Chawla. Anomaly detection using one-class neural networks. arXiv preprint arXiv:1802.06360, 2018.
- [4] Shubhomoy Das, Weng-Keen Wong, Thomas Dietterich, Alan Fern, and Andrew Emmott. Incorporating expert feedback into active anomaly discovery. In IEEE International Conference on Data Mining, 2016.
- [5] Shubhomoy Das, Weng-Keen Wong, Alan Fern, Thomas G Dietterich, and Md Amran Siddiqui. Incorporating feedback into tree-based anomaly detection. arXiv preprint arXiv:1708.09441, 2017.
- [6] Alireza Ghasemi, Hamid R Rabiee, Mohsen Fadaee, Mohammad T Manzuri, and Mohammad H Rohban. Active learning from positive and unlabeled data. In IEEE International Conference on Data Mining Workshops, 2011.
- [7] Nico Görnitz, Marius Kloft, Konrad Rieck, and Ulf Brefeld. Toward supervised anomaly detection. Journal of Artificial Intelligence Research, 46:235–262, 2013.
- [8] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In International Conference on Artificial Intelligence and Statistics, 2010.
- [9] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- [10] Hemank Lamba and Leman Akoglu. Learning on-the-job to re-rank anomalies from top-1 feedback. In SIAM International Conference on Data Mining, 2019.
- [11] Julien Lesouple and Jean-Yves Tourneret. Incorporating user feedback into one-class support vector machines for anomaly detection. In IEEE European Signal Processing Conference, 2021.
- [12] Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. Deep learning for anomaly detection: A review. ACM Computing Surveys, 54(2):1–38, 2021.
- [13] Pramuditha Perera, Poojan Oza, and Vishal M Patel. One-class classification: A survey. arXiv preprint arXiv:2101.03064, 2021.
- [14] Tiago Pimentel, Marianne Monteiro, Adriano Veloso, and Nivio Ziviani. Deep active learning for anomaly detection. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2020.
- [15] Shebuti Rayana. Odds library. URL http://odds. cs. stonybrook. edu, 2016.
- [16] Lukas Ruff, Jacob R. Kauffmann, Robert A. Vandermeulen, Grégoire Montavon, Wojciech Samek, Marius Kloft, Thomas G. Dietterich, and Klaus-Robert Müller. A unifying review of deep and shallow anomaly detection. Proceedings of the IEEE, 109(5):756–795, 2021.
- [17] Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. Deep one-class classification. In International Conference on Machine Learning, 2018.
- [18] Lukas Ruff, Robert A. Vandermeulen, Nico Görnitz, Alexander Binder, Emmanuel Müller, Klaus-Robert Müller, and Marius Kloft. Deep semi-supervised anomaly detection. In International Conference on Learning Representations, 2020.
- [19] Mohammad Sabokrou, Mohammad Khalooei, Mahmood Fathy, and Ehsan Adeli. Adversarially learned one-class classifier for novelty detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
- [20] Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. Estimating the support of a high-dimensional distribution. Neural computation, 13(7):1443–1471, 2001.
- [21] David MJ Tax and Robert PW Duin. Support vector data description. Machine learning, 54(1):45–66, 2004.
- [22] Holger Trittenbach, Adrian Englhardt, and Klemens Böhm. An overview and a benchmark of active learning for outlier detection with one-class classifiers. Expert Systems with Applications, 168:114372, 2020.
- [23] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(86):2579–2605, 2008.
- [24] Yan Xia, Xudong Cao, Fang Wen, Gang Hua, and Jian Sun. Learning discriminative reconstructions for unsupervised outlier removal. In IEEE International Conference on Computer Vision, 2015.
- [25] Daochen Zha, Kwei-Herng Lai, Mingyang Wan, and Xia Hu. Meta-aad: Active anomaly detection with deep reinforcement learning. In IEEE International Conference on Data Mining, 2020.
- [26] Zheng Zhang and Xiaogang Deng. Anomaly detection using improved deep svdd model with data structure preservation. Pattern Recognition Letters, 148:1–6, 2021.