Restricted Generative Projection for One-Class Classification and Anomaly Detection
Abstract
We present a simple framework for one-class classification and anomaly detection. The core idea is to learn a mapping to transform the unknown distribution of training (normal) data to a known target distribution. Crucially, the target distribution should be sufficiently simple, compact, and informative. The simplicity is to ensure that we can sample from the distribution easily, the compactness is to ensure that the decision boundary between normal data and abnormal data is clear and reliable, and the informativeness is to ensure that the transformed data preserve the important information of the original data. Therefore, we propose to use truncated Gaussian, uniform in hypersphere, uniform on hypersphere, or uniform between hyperspheres, as the target distribution. We then minimize the distance between the transformed data distribution and the target distribution while keeping the reconstruction error for the original data small enough. Comparative studies on multiple benchmark datasets verify the effectiveness of our methods in comparison to baselines.
Index Terms:
Anomaly Detection, One-class Classification, Generative Projection.I Introduction
Anomaly detection (AD) under the setting of one-class classification aims to distinguish normal data and abnormal data using a model trained on only normal data [1, 2, 3]. AD is useful in numerous real problems such as intrusion detection for video surveillance, fraud detection in finance, and fault detection for sensors. Many AD methods have been proposed in the past decades [4, 5, 6, 7, 8]. For instance, Schölkopf et al.[5] proposed the one-class support vector machine (OC-SVM) that finds, in a high-dimensional kernel feature space, a hyperplane yielding a large distance between the normal training data and the origin. Tax et al.[6] presented the support vector data description (SVDD), which obtains a spherically shaped boundary (with minimum volume) around the normal training data to identify abnormal samples. Hu et al.[8] propose a new kernel function to estimate samples’ local densities and propose a weighted neighborhood density estimation to increase the robustness to changes in the neighborhood size. There are also many deep learning based AD methods including unsupervised AD methods [9, 10, 11, 12, 13, 14, 15] and semi-supervised AD methods [16, 17, 18, 19].
Deep learning based AD methods may be organized into three categories. The first category is based on compression and reconstruction. These methods usually use an autoencoder [20, 21] to learn a low-dimensional representation to reconstruct the high-dimensional data [22, 23]. The autoencoder learned from the normal training data is expected to have a much higher reconstruction error on unknown abnormal data than on normal data. The second category is based on the combination of classical one-class classification [6, 11] and deep learning [10, 17, 19, 24, 25, 26, 27]. For instance, Ruff et al.[10] proposed a method called deep one-class SVDD. The main idea is to use deep learning to construct a minimum-radius hypersphere to include all the training data, while the unknown abnormal data are expected to fall outside. The last category is based on generative learning or adversarial learning [28, 29, 30, 31, 32, 33, 34, 35, 36]. For example, Perera et al. [32] proposed to use the generative adversarial network (GAN) [37] with constrained latent representation to detect anomalies for image data. Goyal et al.[33] presented a method called deep robust one-class classification (DROCC) and the method aims to find a low-dimensional manifold to accommodate the normal data via an adversarial optimization approach.
Although deep learning based AD methods have shown promising performance on various datasets, they still have limitations. For instance, the one-class classification methods such as Deep SVDD [10] only ensure that a hypersphere could include the normal data but cannot guarantee that the normal data are distributed evenly in the hypersphere, which may lead to large empty regions in the hypersphere and hence yield incorrect decision boundary (see Fig.1). Moreover, the popular hypersphere assumption may not be the best one for providing a compact decision boundary (see Fig.2 and Tab.I). The adversarial learning methods such as [31, 32, 33, 38] may suffer from instability in optimization.
In this work, we present a restricted generative projection (RGP) framework for one-class classification and anomaly detection. The main idea is to train a deep neural network to convert the distribution of normal training data to a target distribution that is simple, compact, and informative, which will provide a reliable decision boundary to identify abnormal data from normal data. There are many choices for the target distribution, such as truncated Gaussian and uniform on hypersphere. Our contributions are summarized as follows.
-
•
We present a novel framework called RGP for one-class classification and anomaly detection. It aims to transform the data distribution to some target distributions that are easy to be violated by unknown abnormal data.
-
•
We provide four simple, compact, and informative target distributions, analyze their properties theoretically, and show how to sample from them efficiently.
-
•
We propose two extensions for our original RGP method.
We conduct extensive experiments (on eight benchmark datasets) to compare the performance of different target distributions and compare our method with state-of-the-art baselines. The results verify the effectiveness of our methods. The rest of this paper is organized as follows. Section II introduces the related work. Section III details our proposed methods. Section IV presents two extensions of the proposed method. Section V shows the experiments. Section VI draws conclusions for this paper.
II Related Work
Before elaborating our method, we in this section briefly review deep one-class classification, autoencoder-based AD methods, and maximum mean discrepancy (MMD)[39]. We also discuss the connection and difference between our method and these related works.
II-A Deep One-Class Classification
The Deep SVDD proposed by [10] uses a neural network to learn a minimum-radius hypersphere to enclose the normal training data, i.e.,
(1) |
where is a predefined centroid and denotes the parameters of the -layer neural network , and is a regularization hyperparameter. In (1), to avoid model collapse, bias terms should not be used and activation functions should be bounded [10]. There are also a few variants of Deep SVDD proposed for semi-supervised one-class classification and anomaly detection [17, 19].

Both our method and Deep SVDD as well as its variants aim to project the normal training data into some space such that a decision boundary between normal data and unknown abnormal data can be found easily. However, the sum-of-square minimization in Deep SVDD and its variants only ensures that the projected data are sufficiently close to the centroid in the sense of Euclidean distance and does guarantee that the data are sufficiently or evenly distributed in the hypersphere centered at . Thus, in the hypersphere, there could be holes or big empty regions without containing any normal data and hence it is not suitable to assume that the whole space enclosed by the hypersphere is completely a normal space. In other words, the optimal decision boundary between normal data and abnormal data is actually very different from the hypersphere. An intuitive example is shown in Fig.1. We see that there is a large empty space in the hypersphere learned by Deep SVDD. In contrast, the transformed data of our method are sufficiently distributed.
II-B Autoencoder-based AD Methods
Our method is similar to but quite different from the variational autoencoder (VAE) [21]. Although our model is an autoencoder, the main goal is not to represent or generate data; instead, our model aims to convert distribution to find a reliable decision boundary for anomaly detection. More importantly, the latent distribution in VAE is often Gaussian and not bounded while the latent distribution in our model is more general and bounded, which is essential for anomaly detection. In addition, the optimizations of VAE and our method are also different: VAE involves KL-divergence while our method involves maximum mean discrepancy [39].
It is worth noting that similar to our method, Perera et al.[32] also considered bounded latent distribution in autoencoder for anomaly detection. They proposed to train a denoising autoencoder with a hyper-cube supported latent space, via adversarial training. The latent distribution and optimization are different from ours. In addition, the latent distributions of our method, such as uniform on hypersphere, are more compact than the multi-dimensional uniform latent distribution of their method.
Compared with the autoencoder based anomaly detection method NAE [40] that uses reconstruction error to normalize autoencoder, our method pays more attention to learning a mapping that can transform the unknown data distribution into a simple and compact target distribution. The ideas are orthogonal.
II-C Maximum Mean Discrepancy
In statistics, maximum mean discrepancy (MMD)[39] is often used for Two-Sample test and its principle is to find a function that assumes different expectations on two different distributions:
(2) |
where are probability distributions, is a class of functions and denotes a reproducing kernel Hilbert space. Using the kernel trick, MMD can be represented as a simple loss function to measure the discrepancy between two distributions by finite samples, which is easy to apply to deep learning and can be efficiently trained by gradient descent. Based on the aforementioned advantages of MMD, Li et al.[41] proposed generative moment matching networks (GMMNs), which leads to a simpler optimization objective compared to the min-max optimization of GAN [37].
Although both our method and GMMNs [41] minimize the MMD between data distribution and prior distribution, our goal is not generating new data but detecting anomalies. In addition, we consider a few bounded target distributions and analyze their sampling properties. More importantly, our method has very competitive performance when compared with SOTA methods of anomaly detection and one-class classification.
III Restricted Generative Projection
In this section, we introduce our RGP framework, bounded target distributions, and the computation of anomaly scores.
III-A Restricted Distribution Projection
Suppose we have a set of -dimensional training data drawn from an unknown bounded distribution and any samples drawn from are normal data. We want to train a model on to determine whether a test data is drawn from or not. One may consider estimating the density function (denoted by ) of using some techniques such as kernel density estimation [42]. Suppose the estimation is good enough, then one can determine whether is normal or not according to the value of : if is zero or close to zero, is an abnormal data point; otherwise, is a normal data point 111Here we assume that the distributions of normal data and abnormal data do not overlap. Otherwise, it is difficult to determine whether a single point is normal or not.. However, the dimensionality of the data is often high and hence it is very difficult to obtain a good estimation .
We propose to learn a mapping to transform the unknown bounded distribution to a known distribution while there still exists a mapping that can recover from approximately. Let be the density function of . Then we can determine whether is normal or not according to the value of . To be more precise, we want to solve the following problem
(3) |
where denotes some distance metric between two distributions and is a trade-off parameter for the two terms. Note that if , may convert any distribution to and lose the ability of distinguishing normal data and abnormal data. Based on the universal approximation theorems [43, 44] and substantial success of neural networks, we use deep neural networks (DNN) to model and respectively. Let and be two DNNs with parameters and respectively. We solve
(4) |
where and serve as encoder and decoder respectively. However, problem (4) is intractable because is unknown and , cannot be computed analytically. Note that the samples of and are given and paired. Then the second term in the objective of (4) can be replaced by sample reconstruction error such as . On the other hand, we can also sample from and easily but their samples are not paired. Hence, the metric in the first term of the objective of (4) should be able to measure the distance between two distributions using their finite samples. To this end, we propose to use the kernel maximum mean discrepancy (MMD)[39] to measure the distance between and . Its empirical estimate is
(5) | ||||
where and are samples consisting of i.i.d observations drawn from and , respectively. denotes a kernel function, e.g., , a Gaussian kernel.
Based on the above analysis, we obtain an approximation for (4) as
(6) |
where and . The first term of the objective function in (6) makes learn the mapping from data distribution to target distribution and the second term ensures that can preserve the main information of observations provided that is sufficiently large.
III-B Bounded Target Distributions
Now we introduce four examples of simple and compact for (6). The four distributions are Gaussian in Hypersphere (GiHS), Uniform in Hypersphere (UiHS), Uniform between Hyperspheres (UbHS), and Uniform on Hypersphere (UoHS). Their 2-dimensional examples are visualized in Fig.2.

GiHS (Fig.2.a) is actually a truncated Gaussian. Suppose we want to draw samples from GiHS. A simple approach is drawing samples from a standard -dimensional Gaussian and discarding the samples with larger norms. The maximum norm of the remaining points is the radius of the hypersphere. One may also use the inverse transform method of [45]. We have the following results.
Proposition III.1.
Suppose are sampled from independently. Then for any , we have
(7) |
and
(8) |
where .
Inequality (8) means a hypersphere of radius can include all the samples with a high probability if is sufficiently large. On the other hand, according to (7), if we expect to get samples in a hypersphere of radius , we need to sample about points from . If is larger, we need to sample more points.
UiHS (Fig.2.b) is a hyperball in which all the samples are distributed uniformly. To sample from UiHS, we first need to sample from . Then we discard all the data points outsides the radius- hyperball centered at the origin. The following proposition (the proof is in Appendix) shows some probability result of sampling from a -dimensional uniform distribution.
Proposition III.2.
Suppose are sampled from independently. Then for any , we have
(9) |
and
(10) |
Inequality (10) means a hypersphere of radius can include all the samples with probability at least . On the other hand, inequality (10) indicates that if we draw samples from , the expected number of samples falling into a hypersphere of radius is at least . Actually, sampling from UiHS is closely related to the Curse of Dimensionality and we need to sample a large number of points from if is large because only a small volume of the hypercube is inside the hyperball. To be more precisely, letting be the volume of a hypercube with length and be the volume of a hyperball with radius , we have
(11) |
where is the gamma function. Therefore, we need to draw samples from to ensure that the expected number of samples included in the hyperball is , where is small if is large.
UbHS (Fig.2.c) can be obtained via UiHS. We first sample from UiHS and then remove all samples included by a smaller hypersphere. Since the volume ratio of two hyperballs with radius and is , where , we need to draw samples from UiHS to ensure that the expected number of samples between the two hyperspheres is . Compared with GiHS and UiHS, UbHS is more compact and hence provides larger abnormal space for abnormal data to fall in.
UoHS (Fig.2.d) can be easily obtained via sampling from . Specifically, for every drawn from , we normalize it as , where is the predefined radius of the hypersphere. UoHS is a special case of UbHS when .
To quantify the compactness of the four target distributions, we define density as the number of data points in unit volume, i.e., . Consequently, the densities of the four target distributions are reported in Table I. UoHS is more compact than UbHS as well as GiHS and UiHS, it should have better performance in anomaly detection. Indeed, our numerical results show that UoHS outperforms others in most cases.
GiHS | UiHS | UbHS | UoHS | |
III-C Anomaly Scores
In the test stage, we only use the trained to calculate anomaly scores. For a given test sample , we define anomaly score for each target distribution by
(12) |
There are clear decision boundaries according to (12) and they can be regarded as ‘hard boundaries’ between normal samples and abnormal samples. However, these ‘hard boundaries’ only work in ideal cases where the projected data exactly match the target distributions. In real cases, due to the noise of data or the non-optimality of optimization, the projected data do not exactly match the target distributions. Therefore, we further propose a ‘soft boundary’ for calculating anomaly scores. Specifically, for a given test sample , we define anomaly score for all four target distributions as
(13) |
where denotes a single sample with index in the training data and denotes the index set of the nearest training (projected) samples to .
Empirically, in the experiments, we found that (13) has better performance than (12) in most cases. Table II, III, VI only report the results from (13). The comparison results between (12) and (13) are provided in Section V-E.
We call our method Restricted Generative Projection (RGP), which has four variants, denoted by RGP-GiHS, RGP-UiHS, RGP-UbHS, and RGP-UoHS respectively, though any bounded target distribution applies.
IV Extensions of RGP
In this section, based on the general objective in (4), we provide two variants of RGP.
IV-A Double-MMD based RGP
In the objective function of RGP defined by (6), the second term is the reconstruction error for , which is only a special example of approximation for the second term in the objective function of (4), i.e., . Alternatively, we can use MMD to approximate , which yields the following Double-MMD RGP:
(14) |
Compared to the sum of squares reconstruction error used in (6), is a weaker approximation for , because it does not exploit the fact that the samples in and are paired. Thus, the projection of Double-MMD RGP cannot preserve sufficient information of , which will reduce the detection accuracy. Indeed, as shown by the experimental results in Section V-F, our original RGP outperforms Double-MMD RGP.
IV-B Sinkhorn Distance based RGP
Besides MMD, the optimal transport theory can also be used to construct a notion of distance between pairs of probability distributions. In particular, the Wasserstein distance [46], also known as “Earth Mover’s Distance”, has appealing theoretical properties and a very intuitive formulation
(15) |
where denotes a metric cost matrix and is the optimal transport plan. Finding the optimal transport plan might appear to be a really hard problem. Especially, the computation cost of Wasserstein distance can quickly become prohibitive when the data dimension increases. In order to speed up the calculation of Wasserstein distance, Cuturi [47] proposed Sinkhorn distance that regularizes the optimal transport problem with an entropic penalty and uses Sinkhorn’s algorithm [48] to approximately calculate Wasserstein distance.
Now, if replacing the first term in (6) with the Sinkhorn distance[47], we can get a new optimization objective
(16) | ||||
subject to |
where denotes the metric cost matrix between and , is the coefficient of entropic regularization term, and are two probability vectors and satisfy and respectively. We call this method Sinkhorn RGP.
V Experiments
V-A Datasets and Baselines
We compare the proposed method with several state-of-the-art methods of anomaly detection on five tabular datasets and three widely-used image datasets for one-class classification. The datasets are detailed as follows.
-
•
Abalone222http://archive.ics.uci.edu/ml/datasets/Abalone[49] is a dataset of physical measurements of abalone to predict the age. It contains 1,920 instances with 8 attributes.
-
•
Arrhythmia333http://odds.cs.stonybrook.edu/arrhythmia-dataset/[50] is an ECG dataset. It was used to identify arrhythmic samples in five classes and contains 452 instances with 279 attributes.
-
•
Thyroid444http://odds.cs.stonybrook.edu/thyroid-disease-dataset/[50] is a hypothyroid disease dataset that contains 3,772 instances with 6 attributes.
-
•
KDD555https://kdd.ics.uci.edu/databases/kddcup99/[51] is the KDDCUP99 10 percent dataset from the UCI repository and contains 34 continuous attributes and 7 categorical attributes. The attack samples are regarded as normal data, and the non-attack samples are regarded as abnormal data.
-
•
KDDRev is derived from the KDDCUP99 10 percent dataset. The non-attack samples are regarded as normal data, and the attack samples are regarded as abnormal data.
-
•
MNIST666http://yann.lecun.com/exdb/mnist/[52] is a well-known dataset of handwritten digits and totally contains 70,000 grey-scale images in 10 classes from number 0-9.
-
•
Fashion-MNIST777https://www.kaggle.com/datasets/zalando-research/fashionmnist[53] contains 70,000 grey-scale fashion images (e.g. T-shirt and bag) in 10 classes.
-
•
CIFAR-10888https://www.cs.toronto.edu/ kriz/cifar.html[54] is a widely-used benchmark for image anomaly detection. It contains 60,000 color images in 10 classes.
We compare our method with three classic shallow models, four deep autoencoder based methods, three deep generative model based methods, and some latest anomaly detection methods.
- •
- •
- •
- •
Normal Class | T-shirt | Trouser | Pullover | Dress | Coat | Sandal | Shirt | Sneaker | Bag | Ankle- boot |
OC-SVM[5] | 86.10 | 93.90 | 85.60 | 85.90 | 84.60 | 81.30 | 78.60 | 97.60 | 79.50 | 97.80 |
IF[7] | 91.00 | 97.80 | 87.20 | 93.20 | 90.50 | 93.00 | 80.20 | 98.20 | 88.70 | 95.40 |
DAE[22] | 86.70 | 97.80 | 80.80 | 91.40 | 86.50 | 92.10 | 73.80 | 97.70 | 78.20 | 96.30 |
DAGMM[12] | 42.10 | 55.10 | 50.40 | 57.00 | 26.90 | 70.50 | 48.30 | 83.50 | 49.90 | 34.00 |
ADGAN[29] | 89.90 | 81.90 | 87.60 | 91.20 | 86.50 | 89.60 | 74.30 | 97.20 | 89.00 | 97.10 |
OCGAN[32] | 85.50 | 93.40 | 85.00 | 88.10 | 85.80 | 88.50 | 77.50 | 93.90 | 82.70 | 97.80 |
DeepSVDD[10] | 79.10 | 94.00 | 83.00 | 82.90 | 87.00 | 80.30 | 74.90 | 94.20 | 79.10 | 93.20 |
DROCC∗[33] | 88.32 | 97.94 | 87.31 | 87.89 | 86.53 | 91.80 | 77.64 | 95.37 | 81.35 | 94.75 |
HRN[60] | 92.70 | 98.50 | 88.50 | 93.10 | 92.10 | 91.30 | 79.80 | 99.00 | 94.60 | 98.80 |
PLAD[61] | 93.10 | 98.60 | 90.20 | 93.70 | 92.80 | 96.00 | 82.00 | 98.60 | 90.90 | 99.10 |
RGP-GiHS (Ours) | 92.79 (0.40) | 98.10 (0.27) | 90.45 (1.28) | 94.30 (0.57) | 91.71 (0.30) | 96.09 (0.67) | 85.91 (0.39) | 98.58 (0.08) | 92.67 (1.10) | 97.11 (0.23) |
RGP-UiHS (Ours) | 92.48 (0.78) | 98.31 (0.19) | 89.81 (1.19) | 94.81 (0.74) | 89.30 (1.95) | 95.75 (0.24) | 85.95 (0.59) | 98.54 (0.08) | 92.25 (0.79) | 94.00 (1.10) |
RGP-UbHS (Ours) | 92.83 (0.68) | 97.88 (0.61) | 90.19 (1.02) | 94.87 (0.34) | 91.97 (0.78) | 96.32 (0.18) | 85.76 (0.48) | 98.67 (0.13) | 91.32 (1.05) | 94.93 (1.00) |
RGP-UoHS (Ours) | 94.85 (0.18) | 98.94 (0.09) | 92.39 (0.24) | 95.71 (0.33) | 93.12 (0.39) | 94.71 (0.65) | 86.98 (0.33) | 99.16 (0.11) | 94.16 (0.25) | 97.45 (0.71) |
Normal Class | Airplane | Auto- mobile | Bird | Cat | Deer | Dog | Frog | Horse | Ship | Trunk |
OC-SVM[5] | 61.10 | 63.80 | 50.00 | 55.90 | 66.00 | 62.40 | 74.70 | 62.60 | 74.90 | 75.90 |
IF[7] | 66.10 | 43.70 | 64.30 | 50.50 | 74.30 | 52.30 | 70.70 | 53.00 | 69.10 | 53.20 |
DCAE[56] | 59.10 | 57.40 | 48.90 | 58.40 | 54.00 | 62.20 | 51.20 | 58.60 | 76.80 | 67.30 |
DAE[22] | 41.10 | 47.80 | 61.60 | 56.20 | 72.80 | 51.30 | 68.80 | 49.70 | 48.70 | 37.80 |
DAGMM[12] | 41.40 | 57.10 | 53.80 | 51.20 | 52.20 | 49.30 | 64.90 | 55.30 | 51.90 | 54.20 |
AnoGAN[58] | 67.10 | 54.70 | 52.90 | 54.50 | 65.10 | 60.30 | 58.50 | 62.50 | 75.80 | 66.50 |
ADGAN[29] | 63.20 | 52.90 | 58.00 | 60.60 | 60.70 | 65.90 | 61.10 | 63.00 | 74.40 | 64.20 |
OCGAN[32] | 75.70 | 53.10 | 64.00 | 62.00 | 72.30 | 62.00 | 72.30 | 57.50 | 82.00 | 55.40 |
DeepSVDD[10] | 61.70 | 65.90 | 50.80 | 59.10 | 60.90 | 65.70 | 67.70 | 67.30 | 75.90 | 73.10 |
[33] | 80.10 | 73.41 | 68.78 | 63.36 | 70.81 | 65.01 | 68.83 | 71.13 | 63.81 | 75.49 |
HRN[60] | 77.30 | 69.90 | 60.60 | 64.40 | 71.50 | 67.40 | 77.40 | 64.90 | 82.50 | 77.30 |
[62] | 66.00 | 70.50 | 52.40 | 60.10 | 60.90 | 68.40 | 67.10 | 68.50 | 79.20 | 75.80 |
[62] | 62.60 | 74.60 | 57.50 | 57.80 | 61.50 | 66.30 | 67.40 | 72.10 | 79.10 | 77.30 |
RGP-GiHS (Ours) | 77.01 (0.61) | 68.56 (0.34) | 62.57 (0.82) | 63.06 (0.29) | 70.72 (1.28) | 68.78 (0.76) | 80.51 (0.95) | 67.92 (0.61) | 80.50 (1.13) | 73.06 (1.30) |
RGP-UiHS (Ours) | 76.07 (1.92) | 70.66 (0.23) | 67.20 (0.34) | 64.72 (2.67) | 70.38 (0.51) | 67.63 (0.39) | 80.25 (0.94) | 69.44 (0.82) | 81.19 (0.96) | 74.89 (0.24) |
RGP-UbHS (Ours) | 77.66 (0.37) | 68.76 (1.23) | 65.29 (0.32) | 64.40 (1.56) | 69.89 (1.16) | 68.00 (0.95) | 80.75 (0.18) | 68.79 (0.75) | 82.17 (0.56) | 73.87 (0.81) |
RGP-UoHS (Ours) | 78.09 (0.98) | 67.71 (0.64) | 61.07 (0.95) | 66.48 (0.30) | 69.70 (0.22) | 68.37 (0.66) | 80.14 (0.66) | 70.9 (0.37) | 83.27 (0.28) | 74.10 (0.46) |
V-B Implementation Details and Evaluation Metrics
In this section, we introduce the implementation details of the proposed method RGP and describe experimental settings for image and tabular datasets. Note that our method neither uses any abnormal data during the training process nor utilizes any pre-trained feature extractors.
For the five tabular datasets (Abalone, Arrhythmia, Thyroid, KDD, KDDRev), in our method, and are both MLPs. We follow the dataset preparation of [12] to preprocess the tabular datasets for one-class classification task. The hyper-parameter is set to 1.0 for the Abalone, Arrhythmia and Thyroid. For the KDD and KDDRev, is set to 0.0001.
For the three image datasets (MNIST, Fashion-MNIST, CIFAR-10), in our method, and are both CNNs. Since the three image datasets contain 10 different classes, we conduct 10 independent one-class classification tasks on both datasets: one class is regarded as normal data and the remaining nine classes are regarded as abnormal data. In each task on MNIST, there are about 6,000 training samples and 10000 testing samples. In each task on CIFAR-10, there are 5,000 training samples and 10,000 testing samples. In each task on Fashion-MNIST, there are 6,000 training samples and 10,000 testing samples. The hyper-parameter is chosen from and varies for different classes.
In our method, regarding the radius of GiHS and UiHS, we first generate a large number (denoted by ) of samples from Gaussian or uniform, sort the samples according to their norms, and set to be the -th smallest norm, where . For UbHS, we need to use the aforementioned method to determine an with and a with . We see that are not related to the actual data, they are determined purely by the target distribution. In each iteration (mini-batch) of the optimization for all four target distributions, we resample according to . For UoHS, we draw samples from Gaussian and normalize them to have unit norm, then they lie on a unit hypersphere uniformly. The procedure is repeated in each iteration (mini-batch) of the optimization. For hyper-parameter on the testing stage, we select for Thyroid, Arrhythmia, KDD, KDDRev, and select for Abalone dataset. For three image datasets, the hyper-parameter is chosen from and varies for different classes. We use Adam [63] as the optimizer in our method. For MNIST, Fashion-MNIST, CIFAR-10, Arrhythmia and KDD, the learning rate is set to . For Abalone, Thyroid and KDDRev, the learning rate is set to . Table IV shows the detailed implementation settings of RGP on all datasets. All experiments were run on AMD EPYC CPU with 64 cores and with NVIDIA Tesla A100 GPU, CUDA 11.6.
Datasets | features | latent dimension | learning rate |
Thyroid | 6 | 4 | 0.001 |
Abalone | 8 | 4 | 0.001 |
KDD | 121 | 64 | 0.0001 |
KDDRev | 121 | 64 | 0.001 |
Arrhythmia | 279 | 128 | 0.0001 |
MNIST | 28281 | 128 | 0.0001 |
Fashion-MNIST | 28281 | 128 | 0.0001 |
CIFAR-10 | 32323 | 128 | 0.0001 |
To evaluate the performance of all methods, we follow the previous works such as [10] and [12] to use AUC (Area Under the ROC curve) for image datasets and F1-score for tabular datasets. Note that when conducting experiments on the tabular datasets, we found that most of the strong baselines, like DROCC [33], NeuTral AD [14], GOCC [26], used the F1-score and we just followed this convention. In our method, we get the threshold via simply calculating the dispersion of training data in latent space. Specifically, we first calculated the scores on training data using (12) or (13), and then sorted in ascending order and set the threshold to be the -th smallest score, where is a probability varying for different datasets.
V-C Results on Image Datasets
Tables II and III show the comparison results on Fahsion-MNIST and CIFAR-10 respectively. We have the following observations.
-
•
Firstly, in contrast to classic shallow methods such as OC-SVM [5] and IF [7], our RGP has significantly higher AUC scores on all classes of Fashion-MNIST and most classes of CIFAR-10. An interesting phenomenon is that most deep learning based methods have inferior performance compared to IF [7] on class ‘Sandal’ of Fashion-MNIST and IF [7] outperforms all deep learning based methods including ours on class ‘Deer’ of CIFAR-10.
-
•
Our methods outperformed the deep autoencoder based methods and generative model based methods in most cases and have competitive performance compared to the state-of-the-art in all cases.
-
•
RGP has superior performance on most classes of Fashion-MNIST and CIFAR-10 under the setting of UoHS (uniform distribution on hypersphere).
Methods | MNIST | Fashion-MNIST | CIFAR-10 |
OC-SVM | 91.28 | 87.09 | 64.72 |
IF | 92.29 | 91.52 | 59.72 |
DAE | - | 88.13 | 53.57 |
DAGMM | - | 51.77 | 53.13 |
AnoGAN | 91.27 | - | 61.79 |
Deep SVDD | 94.79 | 84.77 | 64.81 |
DROCC | - | 88.89 | 70.07 |
HRN | 97.59 | 92.84 | 71.32 |
- | - | 66.90 | |
- | - | 67.60 | |
RGP-GiHS | 93.75 | 93.77 | 71.26 |
RGP-UiHS | 94.02 | 93.12 | 72.24 |
RGP-UbHS | 93.60 | 93.47 | 71.97 |
RGP-UoHS | 95.81 | 94.74 | 71.98 |
Table V shows the average performance on MNIST, Fashion-MNIST, and CIFAR-10 over all 10 classes to provide an overall comparison. We see that RGP achieves the best average AUC on Fashion-MNSIT and CIFAR-10 among all competitive methods. Four variants of RGP have relatively close average performance on all three image datasets. The experimental results of a single class on MNIST are reported in Appendix.
Methods | Abalone | Arrhythmia | Thyroid | KDD | KDDRev |
OC-SVM [5] | 48.00 0.00 | 46.00 0.00 | 39.00 1.00 | 79.50 | 83.20 |
LOF [55] | 33.00 1.00 | 51.00 1.00 | 54.00 1.00 | 83.80 | 90.60 |
DCN [57] | 40.00 1.00 | 38.00 3.00 | 33.00 3.00 | - | - |
E2E-AE [12] | 33.00 3.00 | 45.00 3.00 | 13.00 4.00 | - | - |
DAGMM [12] | 20.00 3.00 | 49.00 3.00 | 49.00 4.00 | 93.70 | 93.80 |
DeepSVDD [10] | 62.00 1.00 | 54.00 1.00 | 73.00 0.00 | 99.00 0.10 | 98.60 0.20 |
GoAD [59] | 61.00 2.00 | 51.00 2.00 | 72.00 1.00 | 98.40 0.20 | 98.90 0.30 |
DROCC [33] | 68.00 2.00 | 69.00 2.00 | 78.00 3.00 | - | - |
NeuTral AD∗ [14] | 62.07 2.81 | 60.30 1.10 | 76.80 1.90 | 99.30 0.10 | 99.10 0.10 |
GOCC [26] | - | 61.80 1.80 | 76.80 1.20 | 99.40 0.10 | 99.20 0.30 |
RGP-GiHS (Ours) | 91.25 1.92 | 81.22 0.50 | 97.58 0.48 | 99.29 0.10 | 98.99 0.02 |
RGP-UiHS (Ours) | 90.38 1.87 | 81.02 0.81 | 97.09 0.27 | 99.28 0.19 | 98.96 0.07 |
RGP-UbHS (Ours) | 90.20 2.32 | 81.00 0.67 | 97.17 0.55 | 99.13 0.31 | 98.99 0.03 |
RGP-UoHS (Ours) | 89.59 1.52 | 80.97 0.62 | 97.38 0.36 | 99.43 0.01 | 99.07 0.03 |
V-D Results on Tabular Datasets
In Table VI, we report the F1-scores of our methods in comparison to ten baselines on the five tabular datasets. Our four variants of RGP significantly outperform all baseline methods on Arrhythmia, thyroid, and Abalone. Particularly, RGP-GiHS has , , and improvements on the three datasets in terms of F1-score compared to the runner-up, respectively. It is worth mentioning that Neutral AD [14] and GOCC [26] are both specially designed for non-image data but are outperformed by our methods in most cases. Compared with image datasets, the performance improvements of RGPs on the three tabular datasets are more significant. One possible reason is that, compared to image data, it is easier to convert tabular data to a compact target distribution. Furthermore, we also report the AUC scores on Abalone, Thyroid and Arrhythmia datasets and the results are provided in Appendix.
In addition to the quantitative results, we choose Thyroid (with 6 attributes) as an example and transform the data distribution to 2-dimensional target distributions, which are visualized in Figure 3. Plots (a), (b), (c), (d) in Figure 3 refer to GiHS, UiHS, UbHS, UoHS, respectively. The blue points, orange points, green points, and red points denote samples from target distribution, samples from training data, normal samples from test set, and abnormal samples from test set, respectively. For much clearer illustration, the left figure in each plot of Figure 3 shows all four kinds of instances and the right figure shows two kinds of instances including normal and abnormal samples from test set. We see that RGPs are effective to transform the data distribution to the restricted target distributions, though the transformed data do not exactly match the target distributions (it also demonstrates the necessity of using the ‘soft boundary’ defined by (13)).

V-E Comparison between‘soft’ and ‘hard’ boundary
We further explore the performance of two different anomaly scores. Specifically, we compare the ‘hard boundaries’ (12) and ‘soft boundary’ (13) as anomaly scores during the test stage on image datasets and tabular datasets. The results are showed in Figures 4, 5, 6. It can be observed that using ‘soft boundary’ (13) to calculate anomaly score has better performance than using ‘hard boundaries’ (12) on most classes of image and tabular datasets. Nevertheless, using ‘hard boundaries’ to calculate anomaly scores still achieves remarkable performance on some classes. For example, on the class ‘Ankle-boot’ of Fashion-MNIST and the class ‘Trunk’ of CIFAR-10, the best two results are both from RGPs using ‘hard boundaries’ (12) to calculate anomaly score.



V-F Experiments of Double-MMD RGP and Sinkhorn RGP
We use Double-MMD RGP (14) to conduct experiments and the results are reported in Table VII, VIII. On image datasets, we just consider the target distribution UoHS (Uniform on HyperSphere) for simplicity. On tabular datasets, we conduct experiments on the proposed four different target distributions.
From the experimental results of Table VII, VIII, we found that Double-MMD RGP and original RGP have similar performance on the three tabular datasets, whereas on image datasets including Fashion-MNIST and CIFAR-10, the performance has apparent gap in spite of a large range of adjustment of for Double-MMD RGP (14). Note that Table VII reports the average AUC(%) on all classes of Fahion-MNIST and CIFAR-10, the results on single class are provided in Appendix.
Fashion-MNIST | CIFAR-10 | ||
=10.0 | 80.34 | 65.45 | |
=5.0 | 77.23 | 66.34 | |
Double-MMD RGP (14) | =1.0 | 79.95 | 66.60 |
=0.5 | 79.68 | 66.10 | |
=0.1 | 79.08 | 69.08 | |
=0.01 | 77.47 | 67.19 | |
Original RGP (6) | 94.74 | 71.98 |
For the phenomenon, we consider that the tabular datasets in our implementation have fewer features (no more than 279) than the image datasets and second term of (14) is a much weaker constraint for preserving data information than that of (6). As a consequence, Double-MMD RGP (14) is able to preserve the enough key information on the tabular data but loses a lot of important information on the image data than original RGP (6). Meanwhile, we know that the generalization error of MMD for high-dimensional samples or distribution is often larger than that for low-dimensional samples or distribution. To ensure that MMD is able to accurately measure the distance between two high-dimensional distributions, the sample sizes should be sufficiently large.
We use Sinkhorn RGP (16) to conduct experiments on Abalone, Arrhythmia, and Thyroid datasets and the results are reported in Table VIII. In all implementations, is set to and the a, b are uniform. In keeping with our expectation, the performance of Sinkhorn RGP (16) is similar to or better than the original RGP (6) for all four objective distributions, whereas the time cost of Sinkhorn RGP (16) is much higher. We do not experiment with Sinkhorn RGP for the image dataset since the time cost is too higher.
Datasets | Abalone | Arrhythmia | Thyroid | |
RGP-GiHS | 93.65 | 82.79 | 98.95 | |
Original RGP | RGP-UiHS | 95.64 | 82.90 | 99.06 |
RGP-UbHS | 94.93 | 82.70 | 98.92 | |
RGP-UoHS | 94.95 | 82.89 | 98.93 | |
RGP-GiHS | 95.19 | 81.51 | 98.94 | |
Sinkhorn RGP | RGP-UiHS | 94.72 | 82.37 | 98.85 |
RGP-UbHS | 95.41 | 83.31 | 98.97 | |
RGP-UoHS | 95.17 | 83.20 | 98.99 | |
RGP-GiHS | 94.91 | 82.26 | 98.53 | |
Double-MMD RGP | RGP-UiHS | 94.83 | 82.19 | 98.69 |
RGP-UbHS | 93.88 | 82.28 | 98.73 | |
RGP-UoHS | 92.60 | 80.73 | 98.89 |
V-G Ablation Study
V-G1 The Gaussian Kernel Function for MMD
We use the Gaussian kernel for MMD in optimization objective and set in all experiments, where denotes the mean Euclidean distance among all training samples.
Normal Class | T-shirt | Trouser | Pullover | Dress | Coat | Sandal | Shirt | Sneaker | Bag | Ankle- boot | Avg | |
90.24 | 96.68 | 88.33 | 93.20 | 90.42 | 97.09 | 86.06 | 97.32 | 88.44 | 93.83 | 92.16 | ||
GiHS | 90.73 | 98.22 | 89.08 | 92.90 | 88.12 | 94.70 | 87.15 | 98.24 | 90.24 | 98.40 | 92.77 | |
89.43 | 99.01 | 85.96 | 93.54 | 87.92 | 94.90 | 83.30 | 97.71 | 91.84 | 92.79 | 91.64 | ||
92.84 | 98.26 | 84.80 | 95.50 | 86.69 | 95.16 | 86.36 | 98.75 | 86.78 | 95.83 | 92.09 | ||
88.14 | 98.25 | 88.55 | 93.86 | 91.79 | 94.93 | 87.40 | 97.46 | 86.14 | 91.46 | 91.79 | ||
UiHS | 90.49 | 98.48 | 90.05 | 92.77 | 92.57 | 95.07 | 85.11 | 98.17 | 88.23 | 94.60 | 92.55 | |
88.62 | 98.50 | 88.77 | 94.08 | 86.29 | 93.97 | 87.27 | 98.36 | 94.70 | 90.53 | 92.10 | ||
88.62 | 98.50 | 88.77 | 94.08 | 86.29 | 93.97 | 87.27 | 98.36 | 94.70 | 90.53 | 92.10 |
To show the influence of , we fix from to run experiments on Fashion-MNIST. As shown in Table IX, there are differences in every single case but the gaps in the average results are not significant. This demonstrated that our methods are not sensitive to .
V-G2 The Coefficient of Reconstruction Term in Optimization Objective
The coefficient is a key hyperparameter in problem (6). Now we explore the influence of for model performance. Figures 7, 8 show F1-scores of our methods with varying from 0 to 1000, on the tabular datasets. It can be observed that too small or too large can lower the performance of RGP. When is very tiny, the reconstruction term of (6) makes less impact on the training target and can easily transform the training data to the target distribution but ignores the importance of original data distribution (see Figure 9). On the other hand, when is very large, the MMD term of optimization objective becomes trivial for the whole training target and under the constraint of reconstruction term more concentrates on the original data distribution yet can not learn a good mapping from data distribution to the target distribution. Figure 9 illustrates the influence of hyper-parameter on the training set of Thyroid dataset. We see that transforms training data to target distribution better with the decrease of the . The blue points and orange points in Figure 9 denote samples from target distribution, samples from training data, respectively.



VI Conclusion
We have presented a novel and simple framework for one-class classification and anomaly detection. Our method aims to convert the data distribution to a simple, compact, and informative target distribution that can be easily violated by abnormal data. We presented four target distributions and the numerical results showed that four different target distributions have relatively close performance and uniform on hypersphere is more effective than other distributions in most cases. Furthermore, we also explore two extensions based on the original RGP and analyze performance difference among them. Importantly, our methods have competitive performances as state-of-the-art AD methods on all benchmark datasets considered in this paper and the improvements are remarkable on the tabular datasets.
References
- [1] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Computing Surveys (CSUR), vol. 41, no. 3, pp. 1–58, 2009.
- [2] G. Pang, C. Shen, L. Cao, and A. V. D. Hengel, “Deep learning for anomaly detection: A review,” ACM Computing Surveys (CSUR), vol. 54, no. 2, pp. 1–38, 2021.
- [3] L. Ruff, J. R. Kauffmann, R. A. Vandermeulen, G. Montavon, W. Samek, M. Kloft, T. G. Dietterich, and K.-R. Müller, “A unifying review of deep and shallow anomaly detection,” Proceedings of the IEEE, 2021.
- [4] B. Schölkopf, R. C. Williamson, A. Smola, J. Shawe-Taylor, and J. Platt, “Support vector method for novelty detection,” Advances in Neural Information Processing Systems, vol. 12, 1999.
- [5] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the support of a high-dimensional distribution,” Neural Ccomputation, vol. 13, no. 7, pp. 1443–1471, 2001.
- [6] D. M. Tax and R. P. Duin, “Support vector data description,” Machine Learning, vol. 54, no. 1, pp. 45–66, 2004.
- [7] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in Proceedings of the IEEE International Conference on Data Mining. IEEE, 2008, pp. 413–422.
- [8] W. Hu, J. Gao, B. Li, O. Wu, J. Du, and S. Maybank, “Anomaly detection using local kernel density estimation and context-based regression,” IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 2, pp. 218–233, 2018.
- [9] Y. Liu, S. Pan, Y. G. Wang, F. Xiong, L. Wang, Q. Chen, and V. C. Lee, “Anomaly detection in dynamic graphs via transformer,” IEEE Transactions on Knowledge and Data Engineering, 2021.
- [10] L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, and M. Kloft, “Deep one-class classification,” in Proceedings of the International Conference on Machine Learning. PMLR, 2018, pp. 4393–4402.
- [11] I. Golan and R. El-Yaniv, “Deep anomaly detection using geometric transformations,” Advances in neural information processing systems, vol. 31, 2018.
- [12] B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen, “Deep autoencoding gaussian mixture model for unsupervised anomaly detection,” in Proceedings of the International Conference on Learning Representations, 2018.
- [13] J. Wang, S. Sun, and Y. Yu, “Multivariate triangular quantile maps for novelty detection,” Advances in Neural Information Processing Systems, vol. 32, pp. 5060–5071, 2019.
- [14] C. Qiu, T. Pfrommer, M. Kloft, S. Mandt, and M. Rudolph, “Neural transformation learning for deep anomaly detection beyond images,” in Proceedings of the International Conference on Machine Learning. PMLR, 2021, pp. 8703–8714.
- [15] L. Huang, Y. Zhu, Y. Gao, T. Liu, C. Chang, C. Liu, Y. Tang, and C.-D. Wang, “Hybrid-order anomaly detection on attributed networks,” IEEE Transactions on Knowledge and Data Engineering, 2021.
- [16] D. Hendrycks, M. Mazeika, and T. Dietterich, “Deep anomaly detection with outlier exposure,” arXiv preprint arXiv:1812.04606, 2018.
- [17] L. Ruff, R. A. Vandermeulen, N. Görnitz, A. Binder, E. Müller, K.-R. Müller, and M. Kloft, “Deep semi-supervised anomaly detection,” in Proceedings of the International Conference on Learning Representations, 2020.
- [18] P. Liznerski, L. Ruff, R. A. Vandermeulen, B. J. Franks, M. Kloft, and K.-R. Müller, “Explainable deep one-class classification,” arXiv preprint arXiv:2007.01760, 2021.
- [19] L. Ruff, R. A. Vandermeulen, B. J. Franks, K.-R. Müller, and M. Kloft, “Rethinking assumptions in deep anomaly detection,” arXiv preprint arXiv:2006.00339, 2021.
- [20] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
- [21] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
- [22] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th International Conference on Machine Learning, 2008, pp. 1096–1103.
- [23] S. Wang, X. Wang, L. Zhang, and Y. Zhong, “Auto-ad: Autonomous hyperspectral anomaly detection network based on fully convolutional autoencoder,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2021.
- [24] P. Perera and V. M. Patel, “Learning deep features for one-class classification,” IEEE Transactions on Image Processing, vol. 28, no. 11, pp. 5450–5463, 2019.
- [25] A. Bhattacharya, S. Varambally, A. Bagchi, and S. Bedathur, “Fast one-class classification using class boundary-preserving random projections,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 66–74.
- [26] T. Shenkar and L. Wolf, “Anomaly detection for tabular data with internal contrastive learning,” in Proceedings of the International Conference on Learning Representations, 2022.
- [27] Y. Chen, Y. Tian, G. Pang, and G. Carneiro, “Deep one-class classification via interpolated gaussian descriptor,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
- [28] P. Malhotra, A. Ramakrishnan, G. Anand, L. Vig, P. Agarwal, and G. Shroff, “Lstm-based encoder-decoder for multi-sensor anomaly detection,” arXiv preprint arXiv:1607.00148, 2016.
- [29] L. Deecke, R. Vandermeulen, L. Ruff, S. Mandt, and M. Kloft, “Image anomaly detection with generative adversarial networks,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2018, pp. 3–17.
- [30] S. Pidhorskyi, R. Almohsen, and G. Doretto, “Generative probabilistic novelty detection with adversarial autoencoders,” Advances in Neural Information Processing Systems, vol. 31, 2018.
- [31] D. T. Nguyen, Z. Lou, M. Klar, and T. Brox, “Anomaly detection with multiple-hypotheses predictions,” in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 97. PMLR, 09–15 Jun 2019, pp. 4800–4809.
- [32] P. Perera, R. Nallapati, and B. Xiang, “Ocgan: One-class novelty detection using gans with constrained latent representations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2898–2906.
- [33] S. Goyal, A. Raghunathan, M. Jain, H. V. Simhadri, and P. Jain, “Drocc: Deep robust one-class classification,” in Proceedings of the International Conference on Machine Learning. PMLR, 2020, pp. 3711–3721.
- [34] J. Raghuram, V. Chandrasekaran, S. Jha, and S. Banerjee, “A general framework for detecting anomalous inputs to dnn classifiers,” in Proceedings of the International Conference on Machine Learning. PMLR, 2021, pp. 8764–8775.
- [35] X. Yan, H. Zhang, X. Xu, X. Hu, and P.-A. Heng, “Learning semantic context from normal samples for unsupervised anomaly detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 3110–3118.
- [36] Y. Zheng, M. Jin, Y. Liu, L. Chi, K. T. Phan, and Y.-P. P. Chen, “Generative and contrastive self-supervised learning for graph anomaly detection,” IEEE Transactions on Knowledge and Data Engineering, 2021.
- [37] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in Neural Information Processing Systems, vol. 27, 2014.
- [38] B. Du, X. Sun, J. Ye, K. Cheng, J. Wang, and L. Sun, “Gan-based anomaly detection for multivariate time series using polluted training set,” IEEE Transactions on Knowledge and Data Engineering, 2021.
- [39] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel two-sample test,” The Journal of Machine Learning Research, vol. 13, no. 1, pp. 723–773, 2012.
- [40] S. Yoon, Y.-K. Noh, and F. Park, “Autoencoding under normalization constraints,” in International Conference on Machine Learning. PMLR, 2021, pp. 12 087–12 097.
- [41] Y. Li, K. Swersky, and R. Zemel, “Generative moment matching networks,” in International conference on machine learning. PMLR, 2015, pp. 1718–1727.
- [42] M. Rosenblatt, “Remarks on some nonparametric estimates of a density function,” The annals of mathematical statistics, pp. 832–837, 1956.
- [43] A. Pinkus, “Approximation theory of the mlp model in neural networks,” Acta numerica, vol. 8, pp. 143–195, 1999.
- [44] Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang, “The expressive power of neural networks: A view from the width,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 6232–6240.
- [45] G. Marsaglia, “Generating a variable from the tail of the normal distribution,” BOEING SCIENTIFIC RESEARCH LABS SEATTLE WA, Tech. Rep., 1963.
- [46] L. V. Kantorovich, “Mathematical methods of organizing and planning production,” Management science, vol. 6, no. 4, pp. 366–422, 1960.
- [47] M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” Advances in neural information processing systems, vol. 26, 2013.
- [48] R. Sinkhorn and P. Knopp, “Concerning nonnegative matrices and doubly stochastic matrices,” Pacific Journal of Mathematics, vol. 21, no. 2, pp. 343–348, 1967.
- [49] G. C. Dua, D. (2017) Uci machine learning repository. [Online]. Available: http://archive.ics.uci.edu/ml
- [50] S. Rayana. (2016) Odds library. [Online]. Available: http://odds.cs.stonybrook.edu.
- [51] M. Lichman. (2013) Uci machine learning repository. [Online]. Available: http://archive.ics.uci.edu/ml
- [52] Y. LeCun, C. Cortes, and C. Burges. (2010) Mnist handwritten digit database, at & t labs. [Online]. Available: http://yann.lecun.com/exdb/mnist/
- [53] H. Xiao, K. Rasul, and R. Vollgraf. (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.
- [54] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
- [55] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: identifying density-based local outliers,” in Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 2000, pp. 93–104.
- [56] P. Seeböck, S. Waldstein, S. Klimscha, B. S. Gerendas, R. Donner, T. Schlegl, U. Schmidt-Erfurth, and G. Langs, “Identifying and categorizing anomalies in retinal imaging data,” arXiv preprint arXiv:1612.00686, 2016.
- [57] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 132–149.
- [58] T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs, “Unsupervised anomaly detection with generative adversarial networks to guide marker discovery,” in International conference on information processing in medical imaging. Springer, 2017, pp. 146–157.
- [59] L. Bergman and Y. Hoshen, “Classification-based anomaly detection for general data,” in Proceedings of the International Conference on Learning Representations, 2020.
- [60] W. Hu, M. Wang, Q. Qin, J. Ma, and B. Liu, “Hrn: A holistic approach to one class learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 19 111–19 124, 2020.
- [61] J. Cai and J. Fan, “Perturbation learning based anomaly detection,” CoRR, vol. abs/2206.02704, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2206.02704
- [62] Massoli et al., “Mocca: Multilayer one-class classification for anomaly detection,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 6, pp. 2313–2323, 2022.
- [63] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.