Unsupervised Deep One-Class Classification with
Adaptive Threshold based on Training Dynamics

Minkyung Kim¹, Junsik Kim², Jongmin Yu³ and Jun Kyun Choi¹ ¹School of Electrical Engineering, KAIST, Republic of Korea
²School of Engineering and Applied Sciences, Harvard University, U.S.A.
³Department of Engineering, King’s College London, United Kingdom
[email protected], [email protected], [email protected], [email protected]

Abstract

One-class classification has been a prevailing method in building deep anomaly detection models under the assumption that a dataset consisting of normal samples is available. In practice, however, abnormal samples are often mixed in a training dataset, and they detrimentally affect the training of deep models, which limits their applicability. For robust normality learning of deep practical models, we propose an unsupervised deep one-class classification that learns normality from pseudo-labeled normal samples, i.e., outlier detection in single cluster scenarios. To this end, we propose a pseudo-labeling method by an adaptive threshold selected by ranking-based training dynamics. The experiments on 10 anomaly detection benchmarks show that our method effectively improves performance on anomaly detection by sizable margins.

Index Terms:

unsupervised one-class classification, anomaly detection, pseudo-labeling, threshold, training dynamics

I Introduction

Anomaly detection (AD) is widely utilized to identify observations that deviate significantly from what is considered normal in each application domain [1]. By leveraging recent successes in deep learning, deep learning-based AD has been actively researched and has shown high capabilities across various applications, such as intrusion detection [2], fraud detection [3], defect detection [4, 5], and surveillance system [6]. In training deep AD models, normal samples are considered far easier obtainable than abnormal samples, which supports the common assumption that all the training data are normal samples [7, 8, 9, 10, 11, 12, 13]. Then, the aim of model training is to describe normality accurately [8]. An abnormal sample is defined as an observation deviating considerably from the learned normality. In the view of learning normality from normal samples, one-class classification (OCC) has been a prevailing method in building deep AD models [14].

OCC-based deep AD models define the concept of normality as a discriminative decision boundary through mapping normal data to a compact representation. However, in practice, abnormal samples are often mixed in a training dataset due to the nature of real data distributions that include anomalous tails. Identifying labels for all the data to train a deep one-class classifier is an expensive process in terms of the time and effort of domain experts. This requirement for normal samples limits the applicability of deep models. Furthermore, abnormal samples mixed in a training dataset have a detrimental effect on the normality that a model learns [15, 16]. Thus, a practical deep AD model needs to be robust in learning normality with a mixed dataset of normal and abnormal samples without labels, i.e., unsupervised AD.

While deep AD with a training dataset consisting of only normal samples, a.k.a., semi-supervised AD, has been extensively studied, research dealing with unlabeled datasets, a.k.a., unsupervised AD, has been less explored. Unsupervised AD can be grouped into two categories: robust methods [17, 18, 19] and iterative methods [20, 16, 21, 22]. Robust methods modify model architectures or loss functions to handle anomalies in a dataset implicitly. Iterative methods explicitly select pseudo-normal and pseudo-abnormal samples and use them in an iterative learning process. The amounts of pseudo-labeled samples are critical hyper-parameters for iterative approaches. However, in unsupervised AD, a validation set is often not available. Thereby, tuning hyper-parameters is not a valid option. Therefore, they are determined manually as fixed values in the previous studies [16, 21, 22], although it is not an optimal approach as the ratio of abnormal samples in a dataset, i.e., anomaly ratio, may vary in different tasks.

In this paper, we propose an unsupervised deep OCC that learns normality from pseudo-labeled normal samples. Unlike the previous approaches based on pseudo-labeling that rely on hyper-parameters correlated with anomaly ratio, our method implicitly estimates the anomaly ratio of a dataset, thereby removing the dependency on hyper-parameters. In unsupervised AD scenarios, a learned representation is unreliable as abnormal samples detrimentally affect the training. Moreover, we observe that the anomaly ranking of each sample fluctuates during training even after convergence. Inspired by the observation, we propose to exploit ranking-based training dynamics of samples to find a threshold that captures anomalies with high precision and recall. Our key idea stems from the fundamental principle in AD, a.k.a., concentration assumption [14]. That is, while the data space is unbounded, normal samples lie in high-density regions, and this normality region can be bounded. In contrast, abnormal samples need not be concentrated but lie in low-density regions. We hypothesize that normal samples are more likely to fluctuate within the normality regions while abnormal samples fluctuate outside the normality regions during model training. Therefore, an effective threshold can be captured by measuring the fluctuations across the two regions, i.e., within and outside of the normality region. Then, we propose to utilize the pseudo-normal samples separated by the threshold for normality learning. The proposed method is free from hyper-parameters for pseudo-labeling and applies to a training dataset with unknown anomaly ratios. We evaluate the effectiveness of the proposed method on various datasets, which are 10 anomaly detection benchmarks with different levels of anomaly ratios. We further analyze that the pseudo-labeling by our method captures anomalies with higher precision and recall than the previous methods [20, 8].

II Related Work

One-class classification-based anomaly detection. One-class classification (OCC) identifies objects of a specific class amongst all objects by primarily learning from a training dataset consisting of only objects of the target class [23]. OCC has been widely applied to anomaly detection (AD) by describing normality. The most studied methods as a solid foundation for OCC-based AD are One-Class SVM (OC-SVM) [24] and Support Vector Data Description (SVDD) [25]. They find a hyperplane and a hypersphere enclosing most of the training data, respectively. By leveraging recent successes in deep learning, many studies have integrated the concept of OC-SVM or SVDD into a deep model for AD [7, 8, 9, 10, 13]. However, they commonly assume a training dataset consists of only normal samples. In formulating AD as an OCC, this fundamental assumption debases the model’s applicability in the real world.

Unsupervised anomaly detection. Unsupervised AD approaches have been proposed to tackle learning normality from anomaly mixed datasets, which aim to reduce the negative learning effects of abnormal data. Robust methods adopt the idea analogous to RPCA [26] by convex relaxation [17] or projecting a latent feature into a lower-dimensional subspace to detect anomalies. NCAE [19] uses adversarial learning to calibrate a latent distribution robust to outliers. Although they do not explicitly use pseudo-labeling, loss and regularizer weighting play an analogous role. On the other hand, iterative methods [16, 20, 21, 22] explicitly define pseudo-normal and pseudo-abnormal samples and use them in an iterative learning process. However, because the optimal ratio of pseudo-normal and pseudo-abnormal samples are unknown, most of the previous works [16, 21, 22] use fixed ratios when selecting pseudo-normal and abnormal samples. Unlike the other iterative approaches, Xia et al. [20] propose to find a threshold by minimizing the intra-class variances of pseudo-normal and pseudo-abnormal samples.

Training dynamics. Recently, training dynamics, i.e. traces of SGD or logits during training, is being used in analyzing catastrophic forgetting [27], measuring the importance of data samples on a learning task [28], large dataset analysis [29], and identifying noisy labels [30]. Toneva et al. [27] use training dynamics to identify forgetting events over the course of training. Ghorbani et al. [28] propose a method called Data Shapley to quantify the predictor performance of each training sample to identify potentially corrupted data samples. Dataset Cartography [29] analyzes a large-scale dataset map with training dynamics to identify the presence of ambiguous-, easy-, and hard-to-learn data regions in a feature space. In noisy label learning [30], training dynamics, a margin between the top two logit values during training, is used to identify mislabeled samples. Unlike previous approaches using SGD or logits for training dynamics, we introduce new ranking-based training dynamics and apply them to unsupervised AD.

III Preliminary

We provide a brief introduction to Deep SVDD [8], which is a prevailing one-class classification-based deep anomaly detection model. We utilize Deep SVDD as a base model in this paper. Given a training dataset ${\mathcal{D}}=\{\boldsymbol{x}_{1},\cdots,\boldsymbol{x}_{n}\}$ , where $\boldsymbol{x}_{i}\in\mathbb{R}^{d}$ , Deep SVDD maps data $\boldsymbol{x}\in{\mathcal{X}}$ to a feature space ${\mathcal{F}}$ through $\phi(\cdot;{\mathcal{W}}):{\mathcal{X}}\rightarrow{\mathcal{F}}$ , while gathering data around the center $\boldsymbol{c}$ of a hypersphere in the feature space, where ${\mathcal{W}}$ denotes the set of weights of the neural network $\phi$ . Then, the sample loss $l$ for Deep SVDD is defined as follows:

l(\boldsymbol{x}_{i})=\nu\cdot R^{2}+\textrm{max}(0,\|\phi(\boldsymbol{x}_{i};{\mathcal{W}})-\boldsymbol{c}\|^{2}-R^{2}).

(1)

For simplicity, we omit regularizers in the rest of this section. The hyper-parameter $\nu$ controls the trade-off between the radius $R$ and the amounts of data outside the hypersphere. In the case that most of the training data are normal, the above sample loss can be simplified, which is defined as follows:

l(\boldsymbol{x}_{i})=\|\phi(\boldsymbol{x}_{i};{\mathcal{W}})-\boldsymbol{c}\|^{2}.

(2)

These two versions of Deep SVDD are called soft-boundary Deep SVDD and One-Class Deep SVDD. An anomaly score in both models is measured by the distance between the center and a data point (Eq. (3)).

s(\boldsymbol{x}_{i};{\mathcal{W}})=\|\phi(\boldsymbol{x}_{i};{\mathcal{W}})-\boldsymbol{c}\|^{2}.

(3)

IV Unsupervised Deep One-class Classification

In unsupervised anomaly detection (AD) scenarios, a deep one-class classification (OCC) model assuming a training dataset consisting of only normal samples would learn contaminated normality by abnormal samples mixed in a training dataset. For better robust normality learning, we propose to add one more step, pseudo-labeling, for each training epoch. To this end, our unsupervised deep OCC model tracks changes in the ranking of anomaly scores for each data sample, i.e., ranking-based training dynamics, and identifies the adaptive threshold where significant ranking changes rarely happen. Then the data samples within the adaptive threshold are utilized as pseudo-normal samples in the following training epoch. Our pseudo-labeling step does not require any hyper-parameter or prior knowledge of a dataset, such as an anomaly ratio.

IV-A Ranking-based Training Dynamics and Threshold Selection

We define ranking-based training dynamics for each data sample as a two-dimensional vector whose elements are rankings of anomaly scores in adjacent epochs. We denote the anomaly score of data $\boldsymbol{x}_{i}$ at epoch $e$ as $s_{i}^{e}\in\mathbb{R}$ , which is measured by Eq. (3), where $i\in\{1,\cdots,n\}$ , $e\in\{1,\cdots,E\}$ , $n$ is the number of data in a training dataset, and $E$ is the number of epochs. Then, we obtain a sorted list of $s_{i}^{e}$ in ascending order and denote the ranking of $s_{i}^{e}$ in the sorted list as $r_{i}^{e}\in\mathbb{N}$ , where $r_{i}^{e}\in\{1,\cdots,n\}$ . The ranking-based training dynamics for $\boldsymbol{x}_{i}$ at epoch $e$ is defined as $(r_{i}^{e-1},r_{i}^{e})$ . The dots in Fig. 1 denote the ranking-based training dynamics of each data sample at epoch $e$ .

Refer to caption — Figure 1: Concept of ranking-based training dynamics and samples with significant ranking change. Dots on the two-dimensional plane denote the ranking-based training dynamics of a training dataset at epoch $e$ . Samples in the colored area denote the samples with significant ranking change according to the threshold $\delta$ .

Starting from the second epoch, $e=2$ , we adaptively select a threshold $\delta$ separating the normality region based on the ranking-based training dynamics. To this end, we define significant ranking change motivated by the following observation and hypothesis: When model parameters converge during training, the anomaly scores are also expected to converge, i.e., the rankings of anomaly scores are also expected to no longer change. In practice, due to a deep model’s stochastic training and high complexity, the rankings of anomaly scores frequently change over epochs, even after a sufficient number of epochs.

However, not all the ranking changes are important. The samples with low and high anomaly scores are likely to be normal and abnormal samples with high confidence. Assuming we have an ideal threshold, the local ranking changes among normal samples (low anomaly scores) or among abnormal samples (high anomaly scores) do not affect the capability of the threshold separating the normality region. Significant ranking changes occur when rankings below the threshold, i.e., samples in the normality region, change to rankings over the threshold, i.e., outside the normality region, or vice versa. Our hypothesis is that significant ranking changes based on the ideal threshold are less likely to occur.

We divide the two-dimensional space in which ranking-based training dynamics exist into four regions to examine significant ranking changes according to a threshold $\delta$ (Eq. (4)). Examples of four regions are shown in Fig. 1. $N$ and $A$ , which are the subscripts of Eq. (4), denote a set of pseudo-normal and pseudo-abnormal samples, respectively.

N=\{\boldsymbol{x}_{i}|r_{i}<\delta\},\quad A=\{\boldsymbol{x}_{i}|r_{i}\geq\delta\}

A set of samples with significant ranking changes based on threshold $\delta$ is defined as the union of ${\mathcal{D}}_{N\rightarrow A}^{e,\delta}$ and ${\mathcal{D}}_{A\rightarrow N}^{e,\delta}$ (areas colored red in Fig. 1). Finally, the degree of significant ranking changes is calculated for all possible $\delta\in\{2,\cdots n\}$ , as in Otsu’s method [31], and the $\delta$ with the minimum degree of significant ranking changes is set as the adaptive threshold $\delta^{e*}$ at epoch $e$ (Eq. (5)). One important thing to note is that since the cardinality of a set ${\mathcal{D}}_{N\rightarrow A}^{e,\delta}\cup{\mathcal{D}}_{A\rightarrow N}^{e,\delta}$ is naturally proportional to the area corresponding to sets in a two-dimensional space, the minimum significant ranking changes occur when a threshold is close to both ends regardless of tasks without proper scaling. To deal with this problem, we scale the cardinality by the corresponding area $\delta\times(n-\delta)$ as in Eq. (5). The example of threshold selection is shown in Fig. 2. Then, the pseudo-normal set ${\mathcal{D}}_{N}^{e+1}$ for model training in the next epoch $e+1$ is determined as in Eq. (6). For better stability during training, all threshold values are averaged before performing pseudo-labeling.

\displaystyle\begin{split}{\mathcal{D}}_{N\rightarrow N}^{e,\delta}&=\{\boldsymbol{x}_{i}\ |\ r_{i}^{e-1}<\delta,\ r_{i}^{e}<\delta\},\\ {\mathcal{D}}_{N\rightarrow A}^{e,\delta}&=\{\boldsymbol{x}_{i}\ |\ r_{i}^{e-1}<\delta,\ r_{i}^{e}\geq\delta\},\\ {\mathcal{D}}_{A\rightarrow N}^{e,\delta}&=\{\boldsymbol{x}_{i}\ |\ r_{i}^{e-1}\geq\delta,\ r_{i}^{e}<\delta\},\\ {\mathcal{D}}_{A\rightarrow A}^{e,\delta}&=\{\boldsymbol{x}_{i}\ |\ r_{i}^{e-1}\geq\delta,\ r_{i}^{e}\geq\delta\},\end{split}

(4)

\delta^{e*}=\operatorname*{arg\,min}_{\delta}\frac{1}{\delta\times(n-\delta)}\ (|{\mathcal{D}}_{N\rightarrow A}^{e,\delta}|+|{\mathcal{D}}_{A\rightarrow N}^{e,\delta}|),

(5)

\begin{split}{\mathcal{D}}_{N}^{e+1}=\{\boldsymbol{x}_{i}|r_{i}^{e-1}<\bar{\delta}^{e*},r_{i}^{e}<\bar{\delta}^{e*}\},\quad\bar{\delta}^{e*}=\frac{1}{|\Delta^{e}|}\sum_{\delta\in\Delta^{e}}\delta,\\ \text{where}\quad\Delta^{e}=\{\delta^{e^{\prime}*}|e^{\prime}\leq e\}.\end{split}

(6)

Algorithm 1 Unsupervised Deep OCC with training dynamics

Input: Unlabeled dataset ${\mathcal{D}}=\{\boldsymbol{x}_{1},\cdots,\boldsymbol{x}_{n}\}$ ,
Maximum number of epochs $E$
Output: Anomaly scores, Threshold

1: Initialize the training set:

{\mathcal{D}}_{N}^{1}={\mathcal{D}}_{N}^{2}={\mathcal{D}}

2: for

e=1

E

3: Train model with

{\mathcal{D}}_{N}^{e}

and

{\mathcal{L}}^{e}

(Eq. (7))

4: if

e\geq 2

then

5: Calculate ranking-based training dynamics

(r_{i}^{e-1},r_{i}^{e})

6: Search threshold

\delta^{e*}

by ranking changes (Eq. (5))

7: Update the pseudo-normal set

{\mathcal{D}}_{N}^{e+1}

(Eq. (6))

8: end if

9: end for

IV-B Unsupervised Deep One-class Classification

Our proposed unsupervised deep one-class classification utilizes the pseudo-normal samples updated by ranking-based training dynamics every epoch, except the first two epochs where all the training data are used. The total loss at epoch $e$ , ${\mathcal{L}}^{e}$ , is defined as follows:

{\mathcal{L}}^{e}=\left\{\begin{array}[]{ l l }&\quad\,\frac{1}{n}\sum_{\boldsymbol{x}_{i}\in{\mathcal{D}}}\|\phi(\boldsymbol{x}_{i};{\mathcal{W}})-\boldsymbol{c}\|^{2},\quad e\leq 2.\\[10.0pt] &\frac{1}{|{\mathcal{D}}_{N}^{e}|}\sum_{\boldsymbol{x}_{i}\in{\mathcal{D}}_{N}^{e}}\|\phi(\boldsymbol{x}_{i};{\mathcal{W}})-\boldsymbol{c}\|^{2},\quad e\geq 3.\end{array}\right.

(7)

We outline the learning process of unsupervised deep one-class classification in Algorithm 1.

V Experiment

V-A Experiment Settings

V-A1 Datasets and evaluation metrics

Anomaly detection benchmarks [32] consist of multivariate tabular datasets. As shown in Table I, each dataset has $n$ number of data samples with $d$ number of attributes. These benchmarks have a wide range of anomaly ratios in a dataset, from 1.2% to 34.9%.

We use three performance metrics for binary classifiers as evaluation metrics in this paper: the area under the receiver operating characteristic curve (ROCAUC), the area under the precision-recall curve (PRAUC), and F1-Score. ROCAUC and PRAUC are calculated on the predicted anomaly scores by varying a decision threshold. ROCAUC is known to be over-confident when the classes are highly imbalanced, which is often the case for anomaly detection, where the anomaly ratio is usually low. Therefore, we also report PRAUC, which scales performance according to anomaly ratio. F1-Score measures the discriminativeness of the selected thresholds.

TABLE I: Anomaly detection benchmarks.

Dataset	Points	Dim.	Anomalies	Anomaly ratio (%)
Pima	768	8	268	34.9
Satellite	6435	36	2036	31.6
Arrhythmia	452	274	66	14.6
Cardio	1831	21	176	9.6
Mnist	7603	100	700	9.2
Wbc	378	30	21	5.6
Glass	214	9	9	4.2
Thyroid	3772	6	93	2.5
Pendigits	6870	16	156	2.3
Satimage-2	5803	36	71	1.2

V-A2 Competing methods

The proposed method is compared with the following deep one-class classification methods:

•

soft-boundary Deep SVDD [8]: A deep one-class classifier that explicitly learns a decision boundary while enforcing a fraction of data to lie outside of the boundary through a hyper-parameter $\nu$ . For the hyper-parameter, we use a true anomaly ratio according to $\nu$ -property [24, 8].
•

One-Class Deep SVDD [8]: A deep one-class classifier designed under the assumption that all the training data is normal.
•

One-Class Deep SVDD + Otsu’s method: One-Class Deep SVDD trained with pseudo-normal samples selected by a threshold. Otsu’s method [31] is applied to anomaly scores as in the study [20] to search for the threshold at each epoch. Dataset is divided into two groups by the possible thresholds, and the threshold that minimizes the intra-class variance is selected.
•

One-Class Deep SVDD + Pre-defined threshold: One-Class Deep SVDD trained with pseudo-normal samples selected according to a pre-defined threshold. We use each dataset’s true anomaly ratio (TAR) as the threshold.

We denote each competing method as SB, OC, OC+Otsu, and OC+TAR, respectively.

TABLE II: Performance results measured by ROCAUC and PRAUC. We report the average AUC with a standard deviation computed over 10 seeds. The highest performances among OCC-based deep AD models are indicated in bold.

Dataset	ROCAUC				PRAUC
Dataset	SB	OC	OC+Otsu	Proposed	SB	OC	OC+Otsu	Proposed
Pima	$54.9\pm 3.8$	$56.6\pm 4.7$	$57.1\pm 5.1$	$\mathbf{58.6\pm 4.7}$	$38.7\pm 2.6$	$39.7\pm 3.5$	$40.5\pm 3.8$	$\mathbf{42.4\pm 3.7}$
Satellite	$65.6\pm 5.5$	$68.7\pm 4.4$	$71.8\pm 5.6$	$\mathbf{77.6\pm 3.7}$	$58.6\pm 7.2$	$63.3\pm 6.6$	$67.6\pm 6.1$	$\mathbf{76.0\pm 3.0}$
Arrhythmia	$69.6\pm 4.0$	$68.1\pm 3.6$	$70.4\pm 3.2$	$\mathbf{71.4\pm 3.3}$	$26.4\pm 3.1$	$27.6\pm 3.8$	$33.2\pm 2.9$	$\mathbf{35.3\pm 2.9}$
Cardio	$72.4\pm 6.2$	$80.6\pm 4.4$	$82.0\pm 4.0$	$\mathbf{85.2\pm 4.5}$	$23.6\pm 5.9$	$31.8\pm 7.2$	$35.7\pm 6.6$	$\mathbf{44.5\pm 12.0}$
Mnist	$69.8\pm 3.4$	$77.4\pm 4.4$	$77.4\pm 5.4$	$\mathbf{79.4\pm 4.7}$	$21.2\pm 2.2$	$28.7\pm 3.3$	$29.8\pm 4.5$	$\mathbf{32.5\pm 4.2}$
Wbc	$75.2\pm 6.4$	$82.5\pm 6.6$	$82.6\pm 5.0$	$\mathbf{85.4\pm 3.1}$	$20.2\pm 10.2$	$26.7\pm 12.9$	$33.3\pm 10.9$	$\mathbf{34.5\pm 11.3}$
Glass	$66.3\pm 8.8$	$77.9\pm 9.3$	$\mathbf{79.2\pm 10.1}$	$\mathbf{79.2\pm 9.6}$	$12.0\pm 7.0$	$14.5\pm 6.0$	$14.2\pm 6.6$	$\mathbf{16.2\pm 6.8}$
Thyroid	$78.2\pm 5.3$	$90.9\pm 4.0$	$90.7\pm 3.3$	$\mathbf{91.4\pm 3.7}$	$14.7\pm 5.4$	$25.3\pm 6.9$	$27.9\pm 9.7$	$\mathbf{36.0\pm 11.7}$
Pendigits	$64.6\pm 11.0$	$71.1\pm 8.5$	$75.5\pm 7.3$	$\mathbf{76.8\pm 10.4}$	$5.3\pm 1.4$	$6.2\pm 2.6$	$8.7\pm 4.4$	$\mathbf{13.6\pm 13.5}$
Satimage-2	$77.1\pm 8.2$	$93.0\pm 3.9$	$94.4\pm 3.1$	$\mathbf{95.1\pm 2.4}$	$8.3\pm 4.0$	$15.1\pm 6.7$	$19.5\pm 9.3$	$\mathbf{22.3\pm 15.9}$

TABLE III: Performance results measured by F1-Score. We report the average score with a standard deviation computed over 10 seeds. The highest performances are indicated in bold.

Dataset	F1-Score
Dataset	SB	OC+Otsu	Proposed
Pima	$\mathbf{38.4\pm 3.9}$	$5.5\pm 2.9$	$18.2\pm 5.7$
Satellite	$50.8\pm 5.4$	$5.8\pm 5.2$	$\mathbf{65.7\pm 3.4}$
Arrhythmia	$29.0\pm 5.9$	$17.6\pm 4.3$	$\mathbf{30.1\pm 8.4}$
Cardio	$24.1\pm 6.7$	$8.8\pm 6.6$	$\mathbf{44.8\pm 9.6}$
Mnist	$26.2\pm 4.4$	$4.4\pm 3.2$	$\mathbf{33.8\pm 2.8}$
Wbc	$23.5\pm 11.9$	$27.1\pm 8.0$	$\mathbf{32.7\pm 11.4}$
Glass	$\mathbf{17.5\pm 7.5}$	$16.1\pm 3.8$	$13.8\pm 2.8$
Thyroid	$22.8\pm 6.7$	$8.4\pm 3.6$	$\mathbf{36.7\pm 9.7}$
Pendigits	$7.3\pm 2.9$	$8.3\pm 10.2$	$\mathbf{16.3\pm 16.3}$
Satimage-2	$13.6\pm 6.0$	$14.3\pm 10.2$	$\mathbf{17.8\pm 18.0}$

V-A3 Implementation details

To implement the base model, Deep SVDD¹¹1https://github.com/lukasruff/Deep-SVDD-PyTorch, we use the source code released by the authors and adjust the backbone architectures for each dataset. A 3-layer MLP with 128-64-32 units is used on the Arrhythmia dataset; a 3-layer MLP with 64-32-16 units is used on the Mnist dataset; a 3-layer MLP with 32-16-4 units is used on the Pima and Thyroid dataset; a 3-layer MLP with 32-16-8 units is used on the remaining 6 datasets. The model is pre-trained with a reconstruction loss with an autoencoder for 100 epochs, and then the pre-trained encoder is fine-tuned with an anomaly detection loss for 50 epochs, i.e., $E$ is set to 50. For an autoencoder, we utilize the aforementioned architectures for an encoder network and implement a decoder network symmetrically. We use Adam optimizer [33] with a batch size of 128 with a learning rate of $10^{-3}$ . Data samples are standardized to have zero mean and unit variance. The experiments are performed with Intel Xeon Silver 4210 CPU and GeForce GTX 1080Ti GPU.

V-B Experiment Results

V-B1 Performance

We report unsupervised anomaly detection accuracy measured by ROCAUC and PRAUC in Table II.

When naïve OC is compared with OC+Otsu and our method utilizing pseudo-labeling, performance improvement is observed in most cases. Moreover, the proposed method shows the most favorable performance. The selected pseudo-normal samples by the proposed threshold enable OC to learn more robust normality. We conjecture that even if Otsu’s method finds the threshold that maximizes the separability between anomaly scores of two groups divided by the threshold, the threshold may not represent the boundary dividing the normal and abnormal data samples in an unsupervised AD scenario.

ROCAUC over-confidently measures performance on a dataset with a small anomaly ratio, i.e., a heavily imbalanced dataset. Therefore, the performance improvement of anomaly detection cannot be guaranteed by only the improvement in ROCAUC for a dataset with low anomaly ratios. However, as aforementioned, our method also shows performance improvement in PRAUC, which considers precision adjusted by an anomaly ratio. In addition, the performance improvement of our method is more pronounced in PRAUC. These results show that our adaptive threshold applied to unsupervised one-class classification improves the robustness of normality on datasets with various anomaly ratios.

TABLE IV: Performance comparison with OC trained with pseudo-normal samples selected according to a true anomaly ratio (TAR). We report the average performance with the standard deviation computed over 10 seeds.

Dataset	ROCAUC		PRAUC		F1-Score
Dataset	Proposed	OC+TAR	Proposed	OC+TAR	Proposed	OC+TAR
Pima	$58.6\pm 4.7$	$61.5\pm 5.1$	$42.4\pm 3.7$	$44.0\pm 3.9$	$18.2\pm 5.7$	$45.4\pm 4.7$
Satellite	$77.6\pm 3.7$	$77.4\pm 3.6$	$76.0\pm 3.0$	$76.0\pm 2.9$	$65.7\pm 3.4$	$65.1\pm 3.2$
Arrhythmia	$71.4\pm 3.3$	$74.2\pm 2.7$	$35.3\pm 2.9$	$39.8\pm 3.6$	$30.1\pm 8.4$	$43.9\pm 3.5$
Cardio	$85.2\pm 4.5$	$85.6\pm 4.3$	$44.5\pm 12.0$	$44.4\pm 11.5$	$44.8\pm 9.6$	$46.0\pm 8.7$
Mnist	$79.4\pm 4.7$	$79.2\pm 4.9$	$32.5\pm 4.2$	$32.1\pm 4.1$	$33.8\pm 2.8$	$35.3\pm 3.1$
Wbc	$85.4\pm 3.1$	$86.0\pm 3.6$	$34.5\pm 11.3$	$37.4\pm 14.0$	$32.7\pm 11.4$	$40.0\pm 13.0$
Glass	$79.2\pm 9.6$	$79.6\pm 10.8$	$16.2\pm 6.8$	$14.8\pm 5.8$	$13.8\pm 2.8$	$14.3\pm 5.0$
Thyroid	$91.4\pm 3.7$	$91.8\pm 3.8$	$36.0\pm 11.7$	$40.2\pm 12.3$	$36.7\pm 9.7$	$44.2\pm 10.3$
Pendigits	$76.8\pm 10.4$	$78.2\pm 9.1$	$13.6\pm 13.5$	$15.5\pm 16.8$	$16.3\pm 16.3$	$21.2\pm 17.4$
Satimage-2	$95.1\pm 2.4$	$94.9\pm 2.5$	$22.3\pm 15.9$	$23.6\pm 17.9$	$17.8\pm 18.0$	$21.8\pm 20.8$

Furthermore, we evaluate a threshold by using F1-Score (Table III). For each competing method, a training dataset is classified by a decision boundary explicitly learned in SB and by the selected threshold in OC+Otsu and our method. The threshold selected by our method also shows robust performance in F1-Score in most cases. On the other hand, F1-Scores in OC+Otsu are remarkably low despite the high performance in ROCAUC and PRAUC. To clarify the differences in classification performance, we compare the precision and recall of each method and dataset (Fig. 3). It is notable that OC+Otsu shows high precision but low recall, leading to a low F1-Score. Based on these results and the concept of Otsu’s method, we conjecture that the threshold selected in OC+Otsu is located far outside the normality region, and it is caused by a few abnormal samples with extremely high anomaly scores. It is because the output of OCC is not normalized, and an upper bound of anomaly scores does not exist. Proper filtering may alleviate this problem, but it leads to the problem of setting thresholds. On the other hand, as a quantization of anomaly scores, ranking is not affected by the scale of anomaly scores. For SB, precision and recall are almost similar. In particular, in the Pima and Glass data, F1-Score is higher than the proposed method. Lastly, we show the changes in three evaluation metrics during model training (Fig. 4). In Fig. 4, the proposed method improves and maintains the performance compared to others where the performance degrades due to the contaminated normality by abnormal samples.

V-B2 Comparison with oracle

We compare our method with OC+TAR, which uses a true anomaly ratio as a hyper-parameter for pseudo labeling (Table IV). In practice, since the anomaly ratio of a task is unknown, we consider OC+TAR as an oracle method. Our method shows competitive performance in ROCAUC and PRAUC. In a few cases, e.g., Satellite, Cardio, Mnist, Glass, and Satimage-2, our method shows on par or marginally better performance in either metric than OC+TAR. It is possible because the true anomaly ratio is used as a hyper-parameter for model training; therefore, it does not guarantee upper-bound performance. In F1-Score, the performance gap is larger than that of ROCAUC or PRAUC. In ROCAUC and PRAUC, the anomaly ratio is only used as a hyper-parameter for training. However, in F1-Score, the anomaly ratio is explicitly used as a threshold to evaluate, resulting in a larger performance gap.

V-B3 Threshold analysis

We visualize and analyze the threshold selected by our method compared to the true anomaly ratio. The proposed threshold selection method is designed under the hypothesis that significant ranking changes based on the ideal threshold are less likely to occur. For a clearer illustration, we plot the degree of significant ranking changes along the possible thresholds for 50 epochs (Fig. 5). The red dotted line represents the true anomaly ratio. The degree of significant ranking changes is minimized around the true anomaly ratio, which aligns with our design principle. We visualize the selected thresholds over 50 epochs to analyze how the selected threshold is close to the true anomaly ratio over the training (Fig. 6). The graph shows that the threshold selected by our method is close to the true anomaly ratio throughout the training.

VI Conclusion

In this paper, we introduce ranking-based training dynamics tracked during model training to search for an effective threshold, which formalizes the fundamental principle in anomaly detection. The proposed threshold selection method is applied to a one-class classification-based method to tackle unsupervised AD problems. While many previous studies utilizing pseudo-labeling rely on a hyper-parameter to control the amount of pseudo-labeled data, no hyper-parameter tuning is required in our method. Moreover, the threshold analysis shows that the selected threshold is close to the true anomaly ratio. The experiments on various datasets with different levels of anomaly ratios validate that our method effectively improves anomaly detection performance.

Acknowledgment

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2020-0-00833, A study of 5G based Intelligent IoT Trust Enabler).

References

[1] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM computing surveys (CSUR), vol. 41, no. 3, pp. 1–58, 2009.
[2] J. Yu, H. Oh, M. Kim, and S. Jung, “Unusual Insider Behavior Detection Framework on Enterprise Resource Planning Systems Using Adversarial Recurrent Autoencoder,” IEEE Transactions on Industrial Informatics, vol. 18, no. 3, pp. 1541–1551, 2021.
[3] G. Zhang, J. Wu, J. Yang, A. Beheshti, S. Xue, C. Zhou, and Q. Z. Sheng, “FRAUDRE: Fraud Detection Dual-Resistant to Graph Inconsistency and Imbalance,” in IEEE International Conference on Data Mining, 2021.
[4] P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “MVTec AD–A comprehensive real-world dataset for unsupervised anomaly detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019.
[5] W. Du, H. Shen, J. Fu, G. Zhang, and Q. He, “Approaches for improvement of the X-ray image defect detection of automobile casting aluminum parts based on deep learning,” NDT & E International, vol. 107, p. 102144, 2019.
[6] A. Singh, D. Patil, and S. Omkar, “Eye in the Sky: Real-Time Drone Surveillance System (DSS) for Violent Individuals Identification Using ScatterNet Hybrid Deep Learning Network,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018.
[7] P. Oza and V. M. Patel, “One-class convolutional neural network,” IEEE Signal Processing Letters, vol. 26, no. 2, pp. 277–281, 2018.
[8] L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, and M. Kloft, “Deep one-class classification,” in International Conference on Machine Learning, 2018.
[9] P. Perera, R. Nallapati, and B. Xiang, “Ocgan: One-class novelty detection using gans with constrained latent representations,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019.
[10] P. Wu, J. Liu, and F. Shen, “A deep one-class neural network for anomalous event detection in complex scenes,” IEEE transactions on neural networks and learning systems, vol. 31, no. 7, pp. 2609–2622, 2019.
[11] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. v. d. Hengel, “Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection,” in IEEE International Conference on Computer Vision, 2019, pp. 1705–1714.
[12] L. Bergman and Y. Hoshen, “Classification-Based Anomaly Detection for General Data,” in International Conference on Learning Representations, 2020.
[13] S. Goyal, A. Raghunathan, M. Jain, H. V. Simhadri, and P. Jain, “DROCC: Deep robust one-class classification,” in International Conference on Machine Learning, 2020.
[14] L. Ruff, J. R. Kauffmann, R. A. Vandermeulen, G. Montavon, W. Samek, M. Kloft, T. G. Dietterich, and K.-R. Müller, “A Unifying Review of Deep and Shallow Anomaly Detection,” Proceedings of the IEEE, vol. 109, no. 5, pp. 756–795, 2021.
[15] B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen, “Deep autoencoding gaussian mixture model for unsupervised anomaly detection,” in International Conference on Learning Representations, 2018.
[16] L. Beggel, M. Pfeiffer, and B. Bischl, “Robust anomaly detection in images using adversarial autoencoders,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2019.
[17] C. Zhou and R. C. Paffenroth, “Anomaly detection with robust deep autoencoders,” in International Conference on Knowledge Discovery & Data Mining, 2017.
[18] C.-H. Lai, D. Zou, and G. Lerman, “Robust Subspace Recovery Layer for Unsupervised Anomaly Detection,” in International Conference on Learning Representations, 2020.
[19] J. Yu, H. Oh, M. Kim, and J. Kim, “Normality-Calibrated Autoencoder for Unsupervised Anomaly Detection on Data Contamination,” in NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
[20] Y. Xia, X. Cao, F. Wen, G. Hua, and J. Sun, “Learning discriminative reconstructions for unsupervised outlier removal,” in IEEE International Conference on Computer Vision, 2015.
[21] J. Fan, Q. Zhang, J. Zhu, M. Zhang, Z. Yang, and H. Cao, “Robust deep auto-encoding Gaussian process regression for unsupervised anomaly detection,” Neurocomputing, vol. 376, pp. 180–190, 2020.
[22] G. Pang, C. Yan, C. Shen, A. v. d. Hengel, and X. Bai, “Self-trained deep ordinal regression for end-to-end video anomaly detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2020.
[23] P. Perera, P. Oza, and V. M. Patel, “One-class classification: A survey,” arXiv preprint arXiv:2101.03064, 2021.
[24] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the support of a high-dimensional distribution,” Neural computation, vol. 13, no. 7, pp. 1443–1471, 2001.
[25] D. M. Tax and R. P. Duin, “Support vector data description,” Machine learning, vol. 54, no. 1, pp. 45–66, 2004.
[26] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of the ACM (JACM), vol. 58, no. 3, pp. 1–37, 2011.
[27] M. Toneva, A. Sordoni, R. T. des Combes, A. Trischler, Y. Bengio, and G. J. Gordon, “An empirical study of example forgetting during deep neural network learning,” in International Conference on Learning Representations, 2019.
[28] A. Ghorbani and J. Zou, “Data shapley: Equitable valuation of data for machine learning,” in International Conference on Machine Learning, 2019.
[29] S. Swayamdipta, R. Schwartz, N. Lourie, Y. Wang, H. Hajishirzi, N. A. Smith, and Y. Choi, “Dataset cartography: Mapping and diagnosing datasets with training dynamics,” in Conference on Empirical Methods in Natural Language Processing, 2020.
[30] G. Pleiss, T. Zhang, E. Elenberg, and K. Q. Weinberger, “Identifying mislabeled data using the area under the margin ranking,” in Advances in Neural Information Processing Systems, 2020.
[31] N. Otsu, “A threshold selection method from gray-level histograms,” IEEE transactions on systems, man, and cybernetics, vol. 9, no. 1, pp. 62–66, 1979.
[32] S. Rayana, “Odds library, 2016,” URL http://odds. cs. stonybrook. edu, 2016.
[33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2015.

Unsupervised Deep One-Class Classification with Adaptive Threshold based on Training Dynamics