Restricted Generative Projection for One-Class Classification and Anomaly Detection

Feng Xiao, Ruoyu Sun, Jicong Fan The authors are with the School of Data Science, The Chinese University of Hong Kong, Shenzhen, and Shenzhen Research Institute of Big Data. E-mail: [email protected] received April 19, 2021; revised August 16, 2021.

Abstract

We present a simple framework for one-class classification and anomaly detection. The core idea is to learn a mapping to transform the unknown distribution of training (normal) data to a known target distribution. Crucially, the target distribution should be sufficiently simple, compact, and informative. The simplicity is to ensure that we can sample from the distribution easily, the compactness is to ensure that the decision boundary between normal data and abnormal data is clear and reliable, and the informativeness is to ensure that the transformed data preserve the important information of the original data. Therefore, we propose to use truncated Gaussian, uniform in hypersphere, uniform on hypersphere, or uniform between hyperspheres, as the target distribution. We then minimize the distance between the transformed data distribution and the target distribution while keeping the reconstruction error for the original data small enough. Comparative studies on multiple benchmark datasets verify the effectiveness of our methods in comparison to baselines.

Index Terms:

Anomaly Detection, One-class Classification, Generative Projection.

I Introduction

Anomaly detection (AD) under the setting of one-class classification aims to distinguish normal data and abnormal data using a model trained on only normal data [1, 2, 3]. AD is useful in numerous real problems such as intrusion detection for video surveillance, fraud detection in finance, and fault detection for sensors. Many AD methods have been proposed in the past decades [4, 5, 6, 7, 8]. For instance, Schölkopf et al.[5] proposed the one-class support vector machine (OC-SVM) that finds, in a high-dimensional kernel feature space, a hyperplane yielding a large distance between the normal training data and the origin. Tax et al.[6] presented the support vector data description (SVDD), which obtains a spherically shaped boundary (with minimum volume) around the normal training data to identify abnormal samples. Hu et al.[8] propose a new kernel function to estimate samples’ local densities and propose a weighted neighborhood density estimation to increase the robustness to changes in the neighborhood size. There are also many deep learning based AD methods including unsupervised AD methods [9, 10, 11, 12, 13, 14, 15] and semi-supervised AD methods [16, 17, 18, 19].

Deep learning based AD methods may be organized into three categories. The first category is based on compression and reconstruction. These methods usually use an autoencoder [20, 21] to learn a low-dimensional representation to reconstruct the high-dimensional data [22, 23]. The autoencoder learned from the normal training data is expected to have a much higher reconstruction error on unknown abnormal data than on normal data. The second category is based on the combination of classical one-class classification [6, 11] and deep learning [10, 17, 19, 24, 25, 26, 27]. For instance, Ruff et al.[10] proposed a method called deep one-class SVDD. The main idea is to use deep learning to construct a minimum-radius hypersphere to include all the training data, while the unknown abnormal data are expected to fall outside. The last category is based on generative learning or adversarial learning [28, 29, 30, 31, 32, 33, 34, 35, 36]. For example, Perera et al. [32] proposed to use the generative adversarial network (GAN) [37] with constrained latent representation to detect anomalies for image data. Goyal et al.[33] presented a method called deep robust one-class classification (DROCC) and the method aims to find a low-dimensional manifold to accommodate the normal data via an adversarial optimization approach.

Although deep learning based AD methods have shown promising performance on various datasets, they still have limitations. For instance, the one-class classification methods such as Deep SVDD [10] only ensure that a hypersphere could include the normal data but cannot guarantee that the normal data are distributed evenly in the hypersphere, which may lead to large empty regions in the hypersphere and hence yield incorrect decision boundary (see Fig.1). Moreover, the popular hypersphere assumption may not be the best one for providing a compact decision boundary (see Fig.2 and Tab.I). The adversarial learning methods such as [31, 32, 33, 38] may suffer from instability in optimization.

In this work, we present a restricted generative projection (RGP) framework for one-class classification and anomaly detection. The main idea is to train a deep neural network to convert the distribution of normal training data to a target distribution that is simple, compact, and informative, which will provide a reliable decision boundary to identify abnormal data from normal data. There are many choices for the target distribution, such as truncated Gaussian and uniform on hypersphere. Our contributions are summarized as follows.

•

We present a novel framework called RGP for one-class classification and anomaly detection. It aims to transform the data distribution to some target distributions that are easy to be violated by unknown abnormal data.
•

We provide four simple, compact, and informative target distributions, analyze their properties theoretically, and show how to sample from them efficiently.
•

We propose two extensions for our original RGP method.

We conduct extensive experiments (on eight benchmark datasets) to compare the performance of different target distributions and compare our method with state-of-the-art baselines. The results verify the effectiveness of our methods. The rest of this paper is organized as follows. Section II introduces the related work. Section III details our proposed methods. Section IV presents two extensions of the proposed method. Section V shows the experiments. Section VI draws conclusions for this paper.

II Related Work

Before elaborating our method, we in this section briefly review deep one-class classification, autoencoder-based AD methods, and maximum mean discrepancy (MMD)[39]. We also discuss the connection and difference between our method and these related works.

II-A Deep One-Class Classification

The Deep SVDD proposed by [10] uses a neural network to learn a minimum-radius hypersphere to enclose the normal training data, i.e.,

\mathop{\text{minimize}}_{\mathcal{W}}\frac{1}{n}\sum^{n}_{i=1}\|\phi(\mathbf{x}_{i};\mathcal{W})-\mathbf{c}\|^{2}+\frac{\lambda}{2}\sum^{L}_{l=1}\|\mathbf{W}_{l}\|^{2}_{F}

(1)

where $\mathbf{c}\in\mathbb{R}^{d}$ is a predefined centroid and $\mathcal{W}=\{\mathbf{W}_{1},\ldots,\mathbf{W}_{L}\}$ denotes the parameters of the $L$ -layer neural network $\phi$ , and $\lambda$ is a regularization hyperparameter. In (1), to avoid model collapse, bias terms should not be used and activation functions should be bounded [10]. There are also a few variants of Deep SVDD proposed for semi-supervised one-class classification and anomaly detection [17, 19].

Refer to caption — Figure 1: Visualization of transforming the training data of Thyroid dataset (detailed in Section V-A) to a 2-dimensional space. Plots (a) and (b) correspond to Deep SVDD and our method respectively. The orange points denote the transformed normal training data and the purple circles denote the decision boundaries. The blue points in plot (b) denote the samples drawn from a truncated Gaussian. In plot (a), the hypersphere is not a good boundary to describe the distribution of the normal data, while in plot (b), the hypersphere is good enough to describe the distribution of the normal data. Particularly, in plot (a), if we reduce the radius of the hypersphere, there will be many normal data points falling outsides the decision boundary, which contradicts the assumption that all or last least most of the training data are normal.

Both our method and Deep SVDD as well as its variants aim to project the normal training data into some space such that a decision boundary between normal data and unknown abnormal data can be found easily. However, the sum-of-square minimization in Deep SVDD and its variants only ensures that the projected data are sufficiently close to the centroid $\mathbf{c}$ in the sense of Euclidean distance and does guarantee that the data are sufficiently or evenly distributed in the hypersphere centered at $\mathbf{c}$ . Thus, in the hypersphere, there could be holes or big empty regions without containing any normal data and hence it is not suitable to assume that the whole space enclosed by the hypersphere is completely a normal space. In other words, the optimal decision boundary between normal data and abnormal data is actually very different from the hypersphere. An intuitive example is shown in Fig.1. We see that there is a large empty space in the hypersphere learned by Deep SVDD. In contrast, the transformed data of our method are sufficiently distributed.

II-B Autoencoder-based AD Methods

Our method is similar to but quite different from the variational autoencoder (VAE) [21]. Although our model is an autoencoder, the main goal is not to represent or generate data; instead, our model aims to convert distribution to find a reliable decision boundary for anomaly detection. More importantly, the latent distribution in VAE is often Gaussian and not bounded while the latent distribution in our model is more general and bounded, which is essential for anomaly detection. In addition, the optimizations of VAE and our method are also different: VAE involves KL-divergence while our method involves maximum mean discrepancy [39].

It is worth noting that similar to our method, Perera et al.[32] also considered bounded latent distribution in autoencoder for anomaly detection. They proposed to train a denoising autoencoder with a hyper-cube supported latent space, via adversarial training. The latent distribution and optimization are different from ours. In addition, the latent distributions of our method, such as uniform on hypersphere, are more compact than the multi-dimensional uniform latent distribution of their method.

Compared with the autoencoder based anomaly detection method NAE [40] that uses reconstruction error to normalize autoencoder, our method pays more attention to learning a mapping that can transform the unknown data distribution into a simple and compact target distribution. The ideas are orthogonal.

II-C Maximum Mean Discrepancy

In statistics, maximum mean discrepancy (MMD)[39] is often used for Two-Sample test and its principle is to find a function that assumes different expectations on two different distributions:

\text{MMD}[\mathcal{F},p,q]=\underset{\|f\|_{\mathcal{H}}\leq 1}{\sup}\left(\mathbb{E}_{p}[f(\mathbf{x})]-\mathbb{E}_{q}[f(\mathbf{y})]\right),

(2)

where $p,q$ are probability distributions, $\mathcal{F}$ is a class of functions $f:\mathbb{X}\rightarrow\mathbb{R}$ and $\mathcal{H}$ denotes a reproducing kernel Hilbert space. Using the kernel trick, MMD can be represented as a simple loss function to measure the discrepancy between two distributions by finite samples, which is easy to apply to deep learning and can be efficiently trained by gradient descent. Based on the aforementioned advantages of MMD, Li et al.[41] proposed generative moment matching networks (GMMNs), which leads to a simpler optimization objective compared to the min-max optimization of GAN [37].

Although both our method and GMMNs [41] minimize the MMD between data distribution and prior distribution, our goal is not generating new data but detecting anomalies. In addition, we consider a few bounded target distributions and analyze their sampling properties. More importantly, our method has very competitive performance when compared with SOTA methods of anomaly detection and one-class classification.

III Restricted Generative Projection

In this section, we introduce our RGP framework, bounded target distributions, and the computation of anomaly scores.

III-A Restricted Distribution Projection

Suppose we have a set of $m$ -dimensional training data $\mathbf{X}=\{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{n}\}$ drawn from an unknown bounded distribution $\mathcal{D}_{\mathbf{x}}$ and any samples drawn from $\mathcal{D}_{\mathbf{x}}$ are normal data. We want to train a model $\mathcal{M}$ on $\mathbf{X}$ to determine whether a test data $\mathbf{x}_{\text{new}}$ is drawn from $\mathcal{D}_{\mathbf{x}}$ or not. One may consider estimating the density function (denoted by ${p}_{\mathbf{x}}$ ) of $\mathcal{D}_{\mathbf{x}}$ using some techniques such as kernel density estimation [42]. Suppose the estimation $\hat{p}_{\mathbf{x}}$ is good enough, then one can determine whether $\mathbf{x}_{\text{new}}$ is normal or not according to the value of $\hat{p}_{\mathbf{x}}(\mathbf{x}_{\text{new}})$ : if $\hat{p}_{\mathbf{x}}(\mathbf{x}_{\text{new}})$ is zero or close to zero, $\mathbf{x}_{\text{new}}$ is an abnormal data point; otherwise, $\mathbf{x}_{\text{new}}$ is a normal data point ¹¹1Here we assume that the distributions of normal data and abnormal data do not overlap. Otherwise, it is difficult to determine whether a single point is normal or not.. However, the dimensionality of the data is often high and hence it is very difficult to obtain a good estimation $\hat{p}_{\mathbf{x}}$ .

We propose to learn a mapping $\mathcal{T}:\mathbb{R}^{m}\rightarrow\mathbb{R}^{d}$ to transform the unknown bounded distribution $\mathcal{D}_{\mathbf{x}}$ to a known distribution $\mathcal{D}_{\mathbf{z}}$ while there still exists a mapping $\mathcal{T}^{\prime}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{m}$ that can recover $\mathcal{D}_{\mathbf{x}}$ from $\mathcal{D}_{\mathbf{z}}$ approximately. Let $p_{\mathbf{z}}$ be the density function of $\mathcal{D}_{\mathbf{z}}$ . Then we can determine whether $\mathbf{x}_{\text{new}}$ is normal or not according to the value of $p_{\mathbf{z}}(\mathcal{T}(\mathbf{x}_{\text{new}}))$ . To be more precise, we want to solve the following problem

\underset{\mathcal{T},~{}\mathcal{T^{\prime}}}{\text{minimize}}~{}\mathcal{M}\left(\mathcal{T}(\mathcal{D}_{\mathbf{x}}),\mathcal{D}_{\mathbf{z}}\right)+\lambda\mathcal{M}\left(\mathcal{T}^{\prime}(\mathcal{T}(\mathcal{D}_{\mathbf{x}})),\mathcal{D}_{\mathbf{x}}\right),

(3)

where $\mathcal{M}(\cdot,\cdot)$ denotes some distance metric between two distributions and $\lambda$ is a trade-off parameter for the two terms. Note that if $\lambda=0$ , $\mathcal{T}$ may convert any distribution to $\mathcal{D}_{\mathbf{z}}$ and lose the ability of distinguishing normal data and abnormal data. Based on the universal approximation theorems [43, 44] and substantial success of neural networks, we use deep neural networks (DNN) to model $\mathcal{T}$ and $\mathcal{T}^{\prime}$ respectively. Let $f_{\theta}$ and $g_{\phi}$ be two DNNs with parameters $\theta$ and $\phi$ respectively. We solve

\underset{\theta,~{}\phi}{\text{minimize}}~{}\mathcal{M}\left(\mathcal{D}_{f_{\theta}(\mathbf{x})},\mathcal{D}_{\mathbf{z}}\right)+\lambda\mathcal{M}\left(\mathcal{D}_{g_{\phi}(f_{\theta}(\mathbf{x}))},\mathcal{D}_{\mathbf{x}}\right),

(4)

where $f_{\theta}$ and $g_{\phi}$ serve as encoder and decoder respectively. However, problem (4) is intractable because $\mathcal{D}_{\mathbf{x}}$ is unknown and $\mathcal{D}_{f_{\theta}(\mathbf{x})}$ , $\mathcal{D}_{g_{\phi}(f_{\theta}(\mathbf{x}))}$ cannot be computed analytically. Note that the samples of $\mathcal{D}_{\mathbf{x}}$ and $\mathcal{D}_{g_{\phi}(f_{\theta}(\mathbf{x}))}$ are given and paired. Then the second term in the objective of (4) can be replaced by sample reconstruction error such as $\tfrac{1}{n}\sum_{i=1}^{n}\|\mathbf{x}_{i}-g_{\phi}(f_{\theta}(\mathbf{x}_{i}))\|^{2}$ . On the other hand, we can also sample from $\mathcal{D}_{f_{\theta}(\mathbf{x})}$ and $\mathcal{D}_{\mathbf{z}}$ easily but their samples are not paired. Hence, the metric $\mathcal{M}$ in the first term of the objective of (4) should be able to measure the distance between two distributions using their finite samples. To this end, we propose to use the kernel maximum mean discrepancy (MMD)[39] to measure the distance between $\mathcal{D}_{f_{\theta}(\mathbf{x})}$ and $\mathcal{D}_{\mathbf{z}}$ . Its empirical estimate is

		$\displaystyle\text{MMD}^{2}[\mathcal{F},X,Y]=\frac{1}{m(m-1)}\sum_{i=1}^{m}\underset{j\neq i}{\sum^{m}}k(\mathbf{x}_{i},\mathbf{x}_{j})$		(5)
		$\displaystyle+\frac{1}{n(n-1)}\sum_{i=1}^{n}\underset{j\neq i}{\sum^{n}}k(\mathbf{y}_{i},\mathbf{y}_{j})-\frac{2}{mn}\sum_{i=1}^{m}\underset{j=1}{\sum^{n}}k(\mathbf{x}_{i},\mathbf{y}_{j}),$		(5)

where $X=\{\mathbf{x}_{1},\dots,\mathbf{x}_{m}\}$ and $Y=\{\mathbf{y}_{1},\dots,\mathbf{y}_{n}\}$ are samples consisting of i.i.d observations drawn from $p$ and $q$ , respectively. $k(\cdot,\cdot)$ denotes a kernel function, e.g., $k(\mathbf{x},\mathbf{y})=\exp(-\gamma\|\mathbf{x}-\mathbf{y}\|^{2})$ , a Gaussian kernel.

Based on the above analysis, we obtain an approximation for (4) as

\mathop{\text{minimize}}_{\theta,~{}\phi}~{}\text{MMD}^{2}(\mathbf{Z}_{\theta},\mathbf{Z}_{T})+\frac{\lambda}{n}\sum_{i=1}^{n}\|\mathbf{x}_{i}-g_{\phi}(f_{\theta}(\mathbf{x}_{i}))\|^{2},

(6)

where $\mathbf{Z}_{\theta}=\{f_{\theta}(\mathbf{x}_{1}),f_{\theta}(\mathbf{x}_{2}),\ldots,f_{\theta}(\mathbf{x}_{n})\}$ and $\mathbf{Z}_{T}=\{\mathbf{z}_{i}:\mathbf{z}_{i}\sim\mathcal{D}_{\mathbf{z}},~{}i=1,\ldots,n\}$ . The first term of the objective function in (6) makes $f_{\theta}$ learn the mapping $\mathcal{T}$ from data distribution $\mathcal{D}_{\mathbf{x}}$ to target distribution $\mathcal{D}_{\mathbf{z}}$ and the second term ensures that $f_{\theta}$ can preserve the main information of observations provided that $\lambda$ is sufficiently large.

III-B Bounded Target Distributions

Now we introduce four examples of simple and compact $\mathcal{D}_{\mathbf{z}}$ for (6). The four distributions are Gaussian in Hypersphere (GiHS), Uniform in Hypersphere (UiHS), Uniform between Hyperspheres (UbHS), and Uniform on Hypersphere (UoHS). Their 2-dimensional examples are visualized in Fig.2.

GiHS (Fig.2.a) is actually a truncated Gaussian. Suppose we want to draw $n$ samples from GiHS. A simple approach is drawing $(1+\rho)n$ samples from a standard $d$ -dimensional Gaussian and discarding the $\rho n$ samples with larger $\ell_{2}$ norms. The maximum $\ell_{2}$ norm of the remaining $n$ points is the radius of the hypersphere. One may also use the inverse transform method of [45]. We have the following results.

Proposition III.1.

Suppose $\mathbf{z}_{1},\mathbf{z}_{2},\ldots,\mathbf{z}_{n}$ are sampled from $\mathcal{N}(\mathbf{0},\mathbf{I}_{d})$ independently. Then for any $r>\sqrt{d}$ , we have

\operatorname{Pr}\left(\|\mathbf{z}_{j}\|\geq r\right)\leq\exp\left(-0.5\alpha\right),\quad j\in[n],

(7)

and

\operatorname{Pr}\left(\max_{1\leq j\leq n}\|\mathbf{z}_{j}\|\leq r\right)\geq 1-n\exp\left(-0.5\alpha\right),

(8)

where $\alpha=\sqrt{d+2r^{2}}-\sqrt{d}$ .

Inequality (8) means a hypersphere of radius $r$ can include all the $n$ samples with a high probability if $r$ is sufficiently large. On the other hand, according to (7), if we expect to get $n$ samples in a hypersphere of radius $r$ , we need to sample about $n/(1-\exp(-0.5\alpha))$ points from $\mathcal{N}(\mathbf{0},\mathbf{I}_{d})$ . If $d$ is larger, we need to sample more points.

UiHS (Fig.2.b) is a hyperball in which all the samples are distributed uniformly. To sample from UiHS, we first need to sample from $\mathcal{U}(-r,r)^{d}$ . Then we discard all the data points outsides the radius- $r$ hyperball centered at the origin. The following proposition (the proof is in Appendix) shows some probability result of sampling from a $d$ -dimensional uniform distribution.

Proposition III.2.

Suppose $\mathbf{z}_{1},\mathbf{z}_{2},\ldots,\mathbf{z}_{n}$ are sampled from $\mathcal{U}(-r,r)^{d}$ independently. Then for any $t>0$ , we have

\operatorname{Pr}\left(\|\mathbf{z}_{j}\|\geq{rt}\right)\leq\frac{d}{3t^{2}},\quad j\in[n],

(9)

and

\operatorname{Pr}\left(\max_{1\leq j\leq n}\|\mathbf{z}_{j}\|\leq rt\right)\geq 1-\frac{nd}{3t^{2}}.

(10)

Inequality (10) means a hypersphere of radius $rt$ can include all the $n$ samples with probability at least $1-nd/(3t^{2})$ . On the other hand, inequality (10) indicates that if we draw $n/(1-d/(3t^{2}))$ samples from $\mathcal{U}(-r,r)^{d}$ , the expected number of samples falling into a hypersphere of radius $rt$ is at least $n$ . Actually, sampling from UiHS is closely related to the Curse of Dimensionality and we need to sample a large number of points from $\mathcal{U}(-r,r)^{d}$ if $d$ is large because only a small volume of the hypercube is inside the hyperball. To be more precisely, letting $V_{\mathrm{hypercube}}$ be the volume of a hypercube with length $2r$ and $V_{\mathrm{hyperball}}$ be the volume of a hyperball with radius $r$ , we have

\frac{V_{\mathrm{hyperball}}}{V_{\mathrm{hypercube}}}={\frac{\pi^{d/2}}{d2^{d-1}\Gamma(d/2)}}\triangleq\eta,

(11)

where $\Gamma$ is the gamma function. Therefore, we need to draw $n/\eta$ samples from $\mathcal{U}(-r,r)^{d}$ to ensure that the expected number of samples included in the hyperball is $n$ , where $\eta$ is small if $d$ is large.

UbHS (Fig.2.c) can be obtained via UiHS. We first sample from UiHS and then remove all samples included by a smaller hypersphere. Since the volume ratio of two hyperballs with radius $r$ and $r^{\prime}$ is $(\frac{r}{r^{\prime}})^{d}$ , where $r^{\prime}<r$ , we need to draw $n/(1-(r^{\prime}/r)^{d})$ samples from UiHS to ensure that the expected number of samples between the two hyperspheres is $n$ . Compared with GiHS and UiHS, UbHS is more compact and hence provides larger abnormal space for abnormal data to fall in.

UoHS (Fig.2.d) can be easily obtained via sampling from $\mathcal{N}(\mathbf{0},\mathbf{I}_{d})$ . Specifically, for every $\mathbf{z}_{i}$ drawn from $\mathcal{N}(\mathbf{0},\mathbf{I}_{d})$ , we normalize it as $\mathbf{z}_{i}\leftarrow{r\mathbf{z}_{i}}/{\|\mathbf{z}_{i}\|}$ , where $r$ is the predefined radius of the hypersphere. UoHS is a special case of UbHS when $r^{\prime}=r$ .

To quantify the compactness of the four target distributions, we define density $\rho$ as the number of data points in unit volume, i.e., $\rho=n/V$ . Consequently, the densities of the four target distributions are reported in Table I. UoHS is more compact than UbHS as well as GiHS and UiHS, it should have better performance in anomaly detection. Indeed, our numerical results show that UoHS outperforms others in most cases.

TABLE I: Densities of the four target distributions.

	GiHS	UiHS	UbHS	UoHS
$\rho$	$\dfrac{n\Gamma(d/2+1)}{\pi^{d/2}r^{d}}$	$\dfrac{n\Gamma(d/2+1)}{\pi^{d/2}r^{d}}$	$\dfrac{n\Gamma(d/2+1)}{\pi^{d/2}(r^{d}-r^{\prime d})}$	$\infty$

III-C Anomaly Scores

In the test stage, we only use the trained $f_{\theta}^{*}$ to calculate anomaly scores. For a given test sample $\mathbf{x}_{\text{new}}$ , we define anomaly score $s$ for each target distribution by

s(\mathbf{x}_{\text{new}})=\left\{\begin{array}[]{l}|\|f_{\theta}^{*}(\mathbf{x}_{\text{new}})\|-r|,\quad\text{for {UoHS}}\\ \|f_{\theta}^{*}(\mathbf{x}_{\text{new}})\|,\quad\text{for GiHB or UiHS}\\ \left(\|f_{\theta}^{*}(\mathbf{x}_{\text{new}})\|-r\right)\cdot(\|f_{\theta}^{*}(\mathbf{x}_{\text{new}})\|-r^{\prime}),\\ \qquad\qquad\qquad\qquad\qquad\qquad\text{for UbHS}\end{array}\right.

(12)

There are clear decision boundaries according to (12) and they can be regarded as ‘hard boundaries’ between normal samples and abnormal samples. However, these ‘hard boundaries’ only work in ideal cases where the projected data exactly match the target distributions. In real cases, due to the noise of data or the non-optimality of optimization, the projected data do not exactly match the target distributions. Therefore, we further propose a ‘soft boundary’ for calculating anomaly scores. Specifically, for a given test sample $\mathbf{x}_{\text{new}}$ , we define anomaly score $s$ for all four target distributions as

s(\mathbf{x}_{\text{new}})=\frac{1}{k}\sum_{i\in N_{k}}\|f_{\theta}^{*}(\mathbf{x}_{\text{new}})-f_{\theta}^{*}(\mathbf{x}_{i})\|

(13)

where $\mathbf{x}_{i}$ denotes a single sample with index $i$ in the training data and $N_{k}$ denotes the index set of the $k$ nearest training (projected) samples to $f_{\theta}^{*}(\mathbf{x}_{\text{new}})$ .

Empirically, in the experiments, we found that (13) has better performance than (12) in most cases. Table II, III, VI only report the results from (13). The comparison results between (12) and (13) are provided in Section V-E.

We call our method Restricted Generative Projection (RGP), which has four variants, denoted by RGP-GiHS, RGP-UiHS, RGP-UbHS, and RGP-UoHS respectively, though any bounded target distribution applies.

IV Extensions of RGP

In this section, based on the general objective in (4), we provide two variants of RGP.

IV-A Double-MMD based RGP

In the objective function of RGP defined by (6), the second term is the reconstruction error for $\mathbf{X}$ , which is only a special example of approximation for the second term in the objective function of (4), i.e., $\mathcal{M}\left(\mathcal{D}_{g_{\phi}(f_{\theta}(\mathbf{x}))},\mathcal{D}_{\mathbf{x}}\right)$ . Alternatively, we can use MMD to approximate $\mathcal{M}\left(\mathcal{D}_{g_{\phi}(f_{\theta}(\mathbf{x}))},\mathcal{D}_{\mathbf{x}}\right)$ , which yields the following Double-MMD RGP:

\mathop{\text{minimize}}_{\theta,~{}\phi}~{}\text{MMD}^{2}(\mathbf{Z}_{\theta},\mathbf{Z}_{T})+\lambda\text{MMD}^{2}(g_{\phi}(\mathbf{Z}_{\theta}),\mathbf{X}).

(14)

Compared to the sum of squares reconstruction error used in (6), $\text{MMD}^{2}(g_{\phi}(\mathbf{Z}_{\theta}),\mathbf{X})$ is a weaker approximation for $\mathcal{M}\left(\mathcal{D}_{g_{\phi}(f_{\theta}(\mathbf{x}))},\mathcal{D}_{\mathbf{x}}\right)$ , because it does not exploit the fact that the samples in $\mathbf{Z}_{\theta}$ and $\mathbf{X}$ are paired. Thus, the projection of Double-MMD RGP cannot preserve sufficient information of $\mathbf{X}$ , which will reduce the detection accuracy. Indeed, as shown by the experimental results in Section V-F, our original RGP outperforms Double-MMD RGP.

IV-B Sinkhorn Distance based RGP

Besides MMD, the optimal transport theory can also be used to construct a notion of distance between pairs of probability distributions. In particular, the Wasserstein distance [46], also known as “Earth Mover’s Distance”, has appealing theoretical properties and a very intuitive formulation

\mathcal{W}=\langle\gamma^{*},\mathbf{C}\rangle_{F}\\

(15)

where $\mathbf{C}$ denotes a metric cost matrix and $\gamma*$ is the optimal transport plan. Finding the optimal transport plan $\gamma^{*}$ might appear to be a really hard problem. Especially, the computation cost of Wasserstein distance can quickly become prohibitive when the data dimension increases. In order to speed up the calculation of Wasserstein distance, Cuturi [47] proposed Sinkhorn distance that regularizes the optimal transport problem with an entropic penalty and uses Sinkhorn’s algorithm [48] to approximately calculate Wasserstein distance.

Now, if replacing the first term in (6) with the Sinkhorn distance[47], we can get a new optimization objective

$\displaystyle\mathop{\text{minimize}}_{\theta,\phi}$	$\displaystyle~{}~{}\langle\gamma,\mathcal{M}(\mathbf{Z}_{\theta},\mathbf{Z}_{T})\rangle_{F}+\epsilon\sum_{i,j}\gamma_{ij}\log(\gamma_{ij})$	(16)
	$\displaystyle~{}~{}+\frac{\lambda}{n}\sum_{i=1}^{n}\\|\mathbf{x}_{i}-g_{\phi}(f_{\theta}(\mathbf{x}_{i}))\\|^{2}$
subject to	$\displaystyle~{}~{}\gamma\mathbf{1}=\mathbf{a},\gamma^{T}\mathbf{1}=\mathbf{b},\gamma\geq 0$

where $\mathcal{M}(\mathbf{Z}_{\theta},\mathbf{Z}_{T})$ denotes the metric cost matrix between $\mathbf{Z}_{\theta}$ and $\mathbf{Z}_{T}$ , $\epsilon$ is the coefficient of entropic regularization term, $\mathbf{a}$ and $\mathbf{b}$ are two probability vectors and satisfy $\mathbf{a}^{T}\mathbf{1}=1$ and $\mathbf{b}^{T}\mathbf{1}=1$ respectively. We call this method Sinkhorn RGP.

Compared to MMD, Sinkhorn distance is more effective in quantifying the difference between two distributions using their finite samples. Therefore, the Sinkhorn RGP usually has better performance than our original RGP (6), which will be shown by the experimental results in Section V-F.

V Experiments

V-A Datasets and Baselines

We compare the proposed method with several state-of-the-art methods of anomaly detection on five tabular datasets and three widely-used image datasets for one-class classification. The datasets are detailed as follows.

•

Abalone²²2http://archive.ics.uci.edu/ml/datasets/Abalone[49] is a dataset of physical measurements of abalone to predict the age. It contains 1,920 instances with 8 attributes.
•

Arrhythmia³³3http://odds.cs.stonybrook.edu/arrhythmia-dataset/[50] is an ECG dataset. It was used to identify arrhythmic samples in five classes and contains 452 instances with 279 attributes.
•

Thyroid⁴⁴4http://odds.cs.stonybrook.edu/thyroid-disease-dataset/[50] is a hypothyroid disease dataset that contains 3,772 instances with 6 attributes.
•

KDD⁵⁵5https://kdd.ics.uci.edu/databases/kddcup99/[51] is the KDDCUP99 10 percent dataset from the UCI repository and contains 34 continuous attributes and 7 categorical attributes. The attack samples are regarded as normal data, and the non-attack samples are regarded as abnormal data.
•

KDDRev is derived from the KDDCUP99 10 percent dataset. The non-attack samples are regarded as normal data, and the attack samples are regarded as abnormal data.
•

MNIST⁶⁶6http://yann.lecun.com/exdb/mnist/[52] is a well-known dataset of handwritten digits and totally contains 70,000 grey-scale images in 10 classes from number 0-9.
•

Fashion-MNIST⁷⁷7https://www.kaggle.com/datasets/zalando-research/fashionmnist[53] contains 70,000 grey-scale fashion images (e.g. T-shirt and bag) in 10 classes.
•

CIFAR-10⁸⁸8https://www.cs.toronto.edu/ kriz/cifar.html[54] is a widely-used benchmark for image anomaly detection. It contains 60,000 color images in 10 classes.

We compare our method with three classic shallow models, four deep autoencoder based methods, three deep generative model based methods, and some latest anomaly detection methods.

•

Classic shallow models: local outlier factor (LOF)[55], one-class support vector machine (OC-SVM)[5], isolation forest (IF)[7].
•

Deep autoencoder based methods: denoising auto-encoder (DAE)[22], DCAE[56], E2E-AE, DAGMM[12], DCN [57].
•

Deep generative model based methods: AnoGAN[58], ADGAN[29], OCGAN [32].
•

Some latest AD methods: DeepSVDD[10], GOAD [59], DROCC [33], HRN [60], SCADN [35], NeuTraL AD [14], GOCC [26], PLAD [61], MOCCA [62].

TABLE II: Average AUC(%) of one-class anomaly detection on Fashion-MNIST. For the competitive methods we only report their mean performance due to the space limit, while we further report the standard deviation for the proposed methods. ‘*’ denotes we run the official released code to obtain the results, and the best two results are marked in bold.

Normal Class	T-shirt	Trouser	Pullover	Dress	Coat	Sandal	Shirt	Sneaker	Bag	Ankle- boot
OC-SVM[5]	86.10	93.90	85.60	85.90	84.60	81.30	78.60	97.60	79.50	97.80
IF[7]	91.00	97.80	87.20	93.20	90.50	93.00	80.20	98.20	88.70	95.40
DAE[22]	86.70	97.80	80.80	91.40	86.50	92.10	73.80	97.70	78.20	96.30
DAGMM[12]	42.10	55.10	50.40	57.00	26.90	70.50	48.30	83.50	49.90	34.00
ADGAN[29]	89.90	81.90	87.60	91.20	86.50	89.60	74.30	97.20	89.00	97.10
OCGAN[32]	85.50	93.40	85.00	88.10	85.80	88.50	77.50	93.90	82.70	97.80
DeepSVDD[10]	79.10	94.00	83.00	82.90	87.00	80.30	74.90	94.20	79.10	93.20
DROCC^∗[33]	88.32	97.94	87.31	87.89	86.53	91.80	77.64	95.37	81.35	94.75
HRN[60]	92.70	98.50	88.50	93.10	92.10	91.30	79.80	99.00	94.60	98.80
PLAD[61]	93.10	98.60	90.20	93.70	92.80	96.00	82.00	98.60	90.90	99.10
RGP-GiHS (Ours)	92.79 (0.40)	98.10 (0.27)	90.45 (1.28)	94.30 (0.57)	91.71 (0.30)	96.09 (0.67)	85.91 (0.39)	98.58 (0.08)	92.67 (1.10)	97.11 (0.23)
RGP-UiHS (Ours)	92.48 (0.78)	98.31 (0.19)	89.81 (1.19)	94.81 (0.74)	89.30 (1.95)	95.75 (0.24)	85.95 (0.59)	98.54 (0.08)	92.25 (0.79)	94.00 (1.10)
RGP-UbHS (Ours)	92.83 (0.68)	97.88 (0.61)	90.19 (1.02)	94.87 (0.34)	91.97 (0.78)	96.32 (0.18)	85.76 (0.48)	98.67 (0.13)	91.32 (1.05)	94.93 (1.00)
RGP-UoHS (Ours)	94.85 (0.18)	98.94 (0.09)	92.39 (0.24)	95.71 (0.33)	93.12 (0.39)	94.71 (0.65)	86.98 (0.33)	99.16 (0.11)	94.16 (0.25)	97.45 (0.71)

TABLE III: Average AUC(%) of one-class anomaly detection on CIFAR-10. More detailed description are the same as that in Table II.

Normal Class	Airplane	Auto- mobile	Bird	Cat	Deer	Dog	Frog	Horse	Ship	Trunk
OC-SVM[5]	61.10	63.80	50.00	55.90	66.00	62.40	74.70	62.60	74.90	75.90
IF[7]	66.10	43.70	64.30	50.50	74.30	52.30	70.70	53.00	69.10	53.20
DCAE[56]	59.10	57.40	48.90	58.40	54.00	62.20	51.20	58.60	76.80	67.30
DAE[22]	41.10	47.80	61.60	56.20	72.80	51.30	68.80	49.70	48.70	37.80
DAGMM[12]	41.40	57.10	53.80	51.20	52.20	49.30	64.90	55.30	51.90	54.20
AnoGAN[58]	67.10	54.70	52.90	54.50	65.10	60.30	58.50	62.50	75.80	66.50
ADGAN[29]	63.20	52.90	58.00	60.60	60.70	65.90	61.10	63.00	74.40	64.20
OCGAN[32]	75.70	53.10	64.00	62.00	72.30	62.00	72.30	57.50	82.00	55.40
DeepSVDD[10]	61.70	65.90	50.80	59.10	60.90	65.70	67.70	67.30	75.90	73.10
$\text{DROCC}^{*}$ [33]	80.10	73.41	68.78	63.36	70.81	65.01	68.83	71.13	63.81	75.49
HRN[60]	77.30	69.90	60.60	64.40	71.50	67.40	77.40	64.90	82.50	77.30
$\text{MOCCA}_{(h)}$ [62]	66.00	70.50	52.40	60.10	60.90	68.40	67.10	68.50	79.20	75.80
$\text{MOCCA}_{(s)}$ [62]	62.60	74.60	57.50	57.80	61.50	66.30	67.40	72.10	79.10	77.30
RGP-GiHS (Ours)	77.01 (0.61)	68.56 (0.34)	62.57 (0.82)	63.06 (0.29)	70.72 (1.28)	68.78 (0.76)	80.51 (0.95)	67.92 (0.61)	80.50 (1.13)	73.06 (1.30)
RGP-UiHS (Ours)	76.07 (1.92)	70.66 (0.23)	67.20 (0.34)	64.72 (2.67)	70.38 (0.51)	67.63 (0.39)	80.25 (0.94)	69.44 (0.82)	81.19 (0.96)	74.89 (0.24)
RGP-UbHS (Ours)	77.66 (0.37)	68.76 (1.23)	65.29 (0.32)	64.40 (1.56)	69.89 (1.16)	68.00 (0.95)	80.75 (0.18)	68.79 (0.75)	82.17 (0.56)	73.87 (0.81)
RGP-UoHS (Ours)	78.09 (0.98)	67.71 (0.64)	61.07 (0.95)	66.48 (0.30)	69.70 (0.22)	68.37 (0.66)	80.14 (0.66)	70.9 (0.37)	83.27 (0.28)	74.10 (0.46)

V-B Implementation Details and Evaluation Metrics

In this section, we introduce the implementation details of the proposed method RGP and describe experimental settings for image and tabular datasets. Note that our method neither uses any abnormal data during the training process nor utilizes any pre-trained feature extractors.

For the five tabular datasets (Abalone, Arrhythmia, Thyroid, KDD, KDDRev), in our method, $f_{\theta}$ and $g_{\phi}$ are both MLPs. We follow the dataset preparation of [12] to preprocess the tabular datasets for one-class classification task. The hyper-parameter $\lambda$ is set to 1.0 for the Abalone, Arrhythmia and Thyroid. For the KDD and KDDRev, $\lambda$ is set to 0.0001.

For the three image datasets (MNIST, Fashion-MNIST, CIFAR-10), in our method, $f_{\theta}$ and $g_{\phi}$ are both CNNs. Since the three image datasets contain 10 different classes, we conduct 10 independent one-class classification tasks on both datasets: one class is regarded as normal data and the remaining nine classes are regarded as abnormal data. In each task on MNIST, there are about 6,000 training samples and 10000 testing samples. In each task on CIFAR-10, there are 5,000 training samples and 10,000 testing samples. In each task on Fashion-MNIST, there are 6,000 training samples and 10,000 testing samples. The hyper-parameter $\lambda$ is chosen from $\{1.0,0.5,0.1,0.01,0.001,0.0001\}$ and varies for different classes.

In our method, regarding the radius $r$ of GiHS and UiHS, we first generate a large number (denoted by $N$ ) of samples from Gaussian or uniform, sort the samples according to their $\ell_{2}$ norms, and set $r$ to be the $pN$ -th smallest $\ell_{2}$ norm, where $p=0.9$ . For UbHS, we need to use the aforementioned method to determine an $r$ with $p=0.95$ and a $r^{\prime}$ with $p=0.05$ . We see that $\{r,r^{\prime}\}$ are not related to the actual data, they are determined purely by the target distribution. In each iteration (mini-batch) of the optimization for all four target distributions, we resample $\mathbf{Z}_{T}$ according to $r$ . For UoHS, we draw samples from Gaussian and normalize them to have unit $\ell_{2}$ norm, then they lie on a unit hypersphere uniformly. The procedure is repeated in each iteration (mini-batch) of the optimization. For hyper-parameter $k$ on the testing stage, we select $k=3$ for Thyroid, Arrhythmia, KDD, KDDRev, and select $k=5$ for Abalone dataset. For three image datasets, the hyper-parameter $k$ is chosen from $\{1,3,5,10\}$ and varies for different classes. We use Adam [63] as the optimizer in our method. For MNIST, Fashion-MNIST, CIFAR-10, Arrhythmia and KDD, the learning rate is set to $0.0001$ . For Abalone, Thyroid and KDDRev, the learning rate is set to $0.001$ . Table IV shows the detailed implementation settings of RGP on all datasets. All experiments were run on AMD EPYC CPU with 64 cores and with NVIDIA Tesla A100 GPU, CUDA 11.6.

TABLE IV: The detailed implementation settings of RGP on all datasets.

Datasets	features	latent dimension	learning rate
Thyroid	6	4	0.001
Abalone	8	4	0.001
KDD	121	64	0.0001
KDDRev	121	64	0.001
Arrhythmia	279	128	0.0001
MNIST	28 $\times$ 28 $\times$ 1	128	0.0001
Fashion-MNIST	28 $\times$ 28 $\times$ 1	128	0.0001
CIFAR-10	32 $\times$ 32 $\times$ 3	128	0.0001

To evaluate the performance of all methods, we follow the previous works such as [10] and [12] to use AUC (Area Under the ROC curve) for image datasets and F1-score for tabular datasets. Note that when conducting experiments on the tabular datasets, we found that most of the strong baselines, like DROCC [33], NeuTral AD [14], GOCC [26], used the F1-score and we just followed this convention. In our method, we get the threshold via simply calculating the dispersion of training data in latent space. Specifically, we first calculated the scores $s(\mathbf{X})$ on training data $\mathbf{X}$ using (12) or (13), and then sorted $s(\mathbf{X})$ in ascending order and set the threshold to be the $pN$ -th smallest score, where $p$ is a probability varying for different datasets.

V-C Results on Image Datasets

Tables II and III show the comparison results on Fahsion-MNIST and CIFAR-10 respectively. We have the following observations.

•

Firstly, in contrast to classic shallow methods such as OC-SVM [5] and IF [7], our RGP has significantly higher AUC scores on all classes of Fashion-MNIST and most classes of CIFAR-10. An interesting phenomenon is that most deep learning based methods have inferior performance compared to IF [7] on class ‘Sandal’ of Fashion-MNIST and IF [7] outperforms all deep learning based methods including ours on class ‘Deer’ of CIFAR-10.
•

Our methods outperformed the deep autoencoder based methods and generative model based methods in most cases and have competitive performance compared to the state-of-the-art in all cases.
•

RGP has superior performance on most classes of Fashion-MNIST and CIFAR-10 under the setting of UoHS (uniform distribution on hypersphere).

TABLE V: Average AUC (%) over all 10 classes of each image dataset. The best two results in each case are marked in bold.

Methods	MNIST	Fashion-MNIST	CIFAR-10
OC-SVM	91.28	87.09	64.72
IF	92.29	91.52	59.72
DAE	-	88.13	53.57
DAGMM	-	51.77	53.13
AnoGAN	91.27	-	61.79
Deep SVDD	94.79	84.77	64.81
DROCC	-	88.89	70.07
HRN	97.59	92.84	71.32
$\text{MOCCA}_{(h)}$	-	-	66.90
$\text{MOCCA}_{(s)}$	-	-	67.60
RGP-GiHS	93.75	93.77	71.26
RGP-UiHS	94.02	93.12	72.24
RGP-UbHS	93.60	93.47	71.97
RGP-UoHS	95.81	94.74	71.98

Table V shows the average performance on MNIST, Fashion-MNIST, and CIFAR-10 over all 10 classes to provide an overall comparison. We see that RGP achieves the best average AUC on Fashion-MNSIT and CIFAR-10 among all competitive methods. Four variants of RGP have relatively close average performance on all three image datasets. The experimental results of a single class on MNIST are reported in Appendix.

TABLE VI: Average F1-Scores(%) with standard deviation on five tabular datasets. ‘*’ denotes we run the officially released code of NeuTral AD to obtain the result of Abalone, and the results of Arrhythmia and Thyroid are from the original paper [14]. The best two results are marked in bold.

Methods	Abalone	Arrhythmia	Thyroid	KDD	KDDRev
OC-SVM [5]	48.00 $\pm$ 0.00	46.00 $\pm$ 0.00	39.00 $\pm$ 1.00	79.50	83.20
LOF [55]	33.00 $\pm$ 1.00	51.00 $\pm$ 1.00	54.00 $\pm$ 1.00	83.80	90.60
DCN [57]	40.00 $\pm$ 1.00	38.00 $\pm$ 3.00	33.00 $\pm$ 3.00	-	-
E2E-AE [12]	33.00 $\pm$ 3.00	45.00 $\pm$ 3.00	13.00 $\pm$ 4.00	-	-
DAGMM [12]	20.00 $\pm$ 3.00	49.00 $\pm$ 3.00	49.00 $\pm$ 4.00	93.70	93.80
DeepSVDD [10]	62.00 $\pm$ 1.00	54.00 $\pm$ 1.00	73.00 $\pm$ 0.00	99.00 $\pm$ 0.10	98.60 $\pm$ 0.20
GoAD [59]	61.00 $\pm$ 2.00	51.00 $\pm$ 2.00	72.00 $\pm$ 1.00	98.40 $\pm$ 0.20	98.90 $\pm$ 0.30
DROCC [33]	68.00 $\pm$ 2.00	69.00 $\pm$ 2.00	78.00 $\pm$ 3.00	-	-
NeuTral AD^∗ [14]	62.07 $\pm$ 2.81	60.30 $\pm$ 1.10	76.80 $\pm$ 1.90	99.30 $\pm$ 0.10	99.10 $\pm$ 0.10
GOCC [26]	-	61.80 $\pm$ 1.80	76.80 $\pm$ 1.20	99.40 $\pm$ 0.10	99.20 $\pm$ 0.30
RGP-GiHS (Ours)	91.25 $\pm$ 1.92	81.22 $\pm$ 0.50	97.58 $\pm$ 0.48	99.29 $\pm$ 0.10	98.99 $\pm$ 0.02
RGP-UiHS (Ours)	90.38 $\pm$ 1.87	81.02 $\pm$ 0.81	97.09 $\pm$ 0.27	99.28 $\pm$ 0.19	98.96 $\pm$ 0.07
RGP-UbHS (Ours)	90.20 $\pm$ 2.32	81.00 $\pm$ 0.67	97.17 $\pm$ 0.55	99.13 $\pm$ 0.31	98.99 $\pm$ 0.03
RGP-UoHS (Ours)	89.59 $\pm$ 1.52	80.97 $\pm$ 0.62	97.38 $\pm$ 0.36	99.43 $\pm$ 0.01	99.07 $\pm$ 0.03

V-D Results on Tabular Datasets

In Table VI, we report the F1-scores of our methods in comparison to ten baselines on the five tabular datasets. Our four variants of RGP significantly outperform all baseline methods on Arrhythmia, thyroid, and Abalone. Particularly, RGP-GiHS has $23.25\%$ , $12.22\%$ , and $19.58\%$ improvements on the three datasets in terms of F1-score compared to the runner-up, respectively. It is worth mentioning that Neutral AD [14] and GOCC [26] are both specially designed for non-image data but are outperformed by our methods in most cases. Compared with image datasets, the performance improvements of RGPs on the three tabular datasets are more significant. One possible reason is that, compared to image data, it is easier to convert tabular data to a compact target distribution. Furthermore, we also report the AUC scores on Abalone, Thyroid and Arrhythmia datasets and the results are provided in Appendix.

In addition to the quantitative results, we choose Thyroid (with 6 attributes) as an example and transform the data distribution to 2-dimensional target distributions, which are visualized in Figure 3. Plots (a), (b), (c), (d) in Figure 3 refer to GiHS, UiHS, UbHS, UoHS, respectively. The blue points, orange points, green points, and red points denote samples from target distribution, samples from training data, normal samples from test set, and abnormal samples from test set, respectively. For much clearer illustration, the left figure in each plot of Figure 3 shows all four kinds of instances and the right figure shows two kinds of instances including normal and abnormal samples from test set. We see that RGPs are effective to transform the data distribution to the restricted target distributions, though the transformed data do not exactly match the target distributions (it also demonstrates the necessity of using the ‘soft boundary’ defined by (13)).

V-E Comparison between‘soft’ and ‘hard’ boundary

We further explore the performance of two different anomaly scores. Specifically, we compare the ‘hard boundaries’ (12) and ‘soft boundary’ (13) as anomaly scores during the test stage on image datasets and tabular datasets. The results are showed in Figures 4, 5, 6. It can be observed that using ‘soft boundary’ (13) to calculate anomaly score has better performance than using ‘hard boundaries’ (12) on most classes of image and tabular datasets. Nevertheless, using ‘hard boundaries’ to calculate anomaly scores still achieves remarkable performance on some classes. For example, on the class ‘Ankle-boot’ of Fashion-MNIST and the class ‘Trunk’ of CIFAR-10, the best two results are both from RGPs using ‘hard boundaries’ (12) to calculate anomaly score.

V-F Experiments of Double-MMD RGP and Sinkhorn RGP

We use Double-MMD RGP (14) to conduct experiments and the results are reported in Table VII, VIII. On image datasets, we just consider the target distribution UoHS (Uniform on HyperSphere) for simplicity. On tabular datasets, we conduct experiments on the proposed four different target distributions.

From the experimental results of Table VII, VIII, we found that Double-MMD RGP and original RGP have similar performance on the three tabular datasets, whereas on image datasets including Fashion-MNIST and CIFAR-10, the performance has apparent gap in spite of a large range of adjustment of $\lambda\in\{10.0,5.0,1.0,0.5,0.1,0.01\}$ for Double-MMD RGP (14). Note that Table VII reports the average AUC(%) on all classes of Fahion-MNIST and CIFAR-10, the results on single class are provided in Appendix.

TABLE VII: The Average AUC(%) on all class of Fashion-MNIST and CIFAR-10 using two different optimization objectives under UoHS.

	$\lambda$	Fashion-MNIST	CIFAR-10
	$\lambda$ =10.0	80.34	65.45
	$\lambda$ =5.0	77.23	66.34
Double-MMD RGP (14)	$\lambda$ =1.0	79.95	66.60
	$\lambda$ =0.5	79.68	66.10
	$\lambda$ =0.1	79.08	69.08
	$\lambda$ =0.01	77.47	67.19
Original RGP (6)		94.74	71.98

For the phenomenon, we consider that the tabular datasets in our implementation have fewer features (no more than 279) than the image datasets and second term of (14) is a much weaker constraint for preserving data information than that of (6). As a consequence, Double-MMD RGP (14) is able to preserve the enough key information on the tabular data but loses a lot of important information on the image data than original RGP (6). Meanwhile, we know that the generalization error of MMD for high-dimensional samples or distribution is often larger than that for low-dimensional samples or distribution. To ensure that MMD is able to accurately measure the distance between two high-dimensional distributions, the sample sizes should be sufficiently large.

We use Sinkhorn RGP (16) to conduct experiments on Abalone, Arrhythmia, and Thyroid datasets and the results are reported in Table VIII. In all implementations, $\epsilon$ is set to $0.01$ and the a, b are uniform. In keeping with our expectation, the performance of Sinkhorn RGP (16) is similar to or better than the original RGP (6) for all four objective distributions, whereas the time cost of Sinkhorn RGP (16) is much higher. We do not experiment with Sinkhorn RGP for the image dataset since the time cost is too higher.

TABLE VIII: The average AUC (%) of Double-MMD RGP, Sinkhorn RGP, and original RGP on the tabular dataset. The best result of each optimization objective is marked in bold.

	Datasets	Abalone	Arrhythmia	Thyroid
	RGP-GiHS	93.65	82.79	98.95
Original RGP	RGP-UiHS	95.64	82.90	99.06
	RGP-UbHS	94.93	82.70	98.92
	RGP-UoHS	94.95	82.89	98.93
	RGP-GiHS	95.19	81.51	98.94
Sinkhorn RGP	RGP-UiHS	94.72	82.37	98.85
	RGP-UbHS	95.41	83.31	98.97
	RGP-UoHS	95.17	83.20	98.99
	RGP-GiHS	94.91	82.26	98.53
Double-MMD RGP	RGP-UiHS	94.83	82.19	98.69
	RGP-UbHS	93.88	82.28	98.73
	RGP-UoHS	92.60	80.73	98.89

V-G Ablation Study

V-G1 The Gaussian Kernel Function for MMD

We use the Gaussian kernel $\exp(-\gamma\|\mathbf{x}-\mathbf{y}\|^{2})$ for MMD in optimization objective and set $\gamma=\frac{1}{d^{2}}$ in all experiments, where $d=\frac{1}{n(n-1)}\sum^{n}_{i=1}\sum^{n}_{j=1}\|\mathbf{x}_{i}\ -\mathbf{x}_{j}\|$ denotes the mean Euclidean distance among all training samples.

TABLE IX: The Comparison among different

\gamma

in Gaussian kernel for MMD. ‘Avg’ denotes the average performance on all ten classes.

	Normal Class	T-shirt	Trouser	Pullover	Dress	Coat	Sandal	Shirt	Sneaker	Bag	Ankle- boot	Avg
	$\gamma=0.1$	90.24	96.68	88.33	93.20	90.42	97.09	86.06	97.32	88.44	93.83	92.16
GiHS	$\gamma=1$	90.73	98.22	89.08	92.90	88.12	94.70	87.15	98.24	90.24	98.40	92.77
	$\gamma=10$	89.43	99.01	85.96	93.54	87.92	94.90	83.30	97.71	91.84	92.79	91.64
	$\gamma=100$	92.84	98.26	84.80	95.50	86.69	95.16	86.36	98.75	86.78	95.83	92.09
	$\gamma=0.1$	88.14	98.25	88.55	93.86	91.79	94.93	87.40	97.46	86.14	91.46	91.79
UiHS	$\gamma=1$	90.49	98.48	90.05	92.77	92.57	95.07	85.11	98.17	88.23	94.60	92.55
	$\gamma=10$	88.62	98.50	88.77	94.08	86.29	93.97	87.27	98.36	94.70	90.53	92.10
	$\gamma=100$	88.62	98.50	88.77	94.08	86.29	93.97	87.27	98.36	94.70	90.53	92.10

To show the influence of $\gamma$ , we fix $\gamma$ from $\{0.1,1,10,100\}$ to run experiments on Fashion-MNIST. As shown in Table IX, there are differences in every single case but the gaps in the average results are not significant. This demonstrated that our methods are not sensitive to $\gamma$ .

V-G2 The Coefficient $\lambda$ of Reconstruction Term in Optimization Objective

The coefficient $\lambda$ is a key hyperparameter in problem (6). Now we explore the influence of $\lambda$ for model performance. Figures 7, 8 show F1-scores of our methods with $\lambda$ varying from 0 to 1000, on the tabular datasets. It can be observed that too small or too large $\lambda$ can lower the performance of RGP. When $\lambda$ is very tiny, the reconstruction term of (6) makes less impact on the training target and $f_{\theta}$ can easily transform the training data to the target distribution but ignores the importance of original data distribution (see Figure 9). On the other hand, when $\lambda$ is very large, the MMD term of optimization objective becomes trivial for the whole training target and $f_{\theta}$ under the constraint of reconstruction term more concentrates on the original data distribution yet can not learn a good mapping from data distribution to the target distribution. Figure 9 illustrates the influence of hyper-parameter $\lambda$ on the training set of Thyroid dataset. We see that $f_{\theta}$ transforms training data to target distribution better with the decrease of the $\lambda$ . The blue points and orange points in Figure 9 denote samples from target distribution, samples from training data, respectively.

VI Conclusion

We have presented a novel and simple framework for one-class classification and anomaly detection. Our method aims to convert the data distribution to a simple, compact, and informative target distribution that can be easily violated by abnormal data. We presented four target distributions and the numerical results showed that four different target distributions have relatively close performance and uniform on hypersphere is more effective than other distributions in most cases. Furthermore, we also explore two extensions based on the original RGP and analyze performance difference among them. Importantly, our methods have competitive performances as state-of-the-art AD methods on all benchmark datasets considered in this paper and the improvements are remarkable on the tabular datasets.

References

[1] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Computing Surveys (CSUR), vol. 41, no. 3, pp. 1–58, 2009.
[2] G. Pang, C. Shen, L. Cao, and A. V. D. Hengel, “Deep learning for anomaly detection: A review,” ACM Computing Surveys (CSUR), vol. 54, no. 2, pp. 1–38, 2021.
[3] L. Ruff, J. R. Kauffmann, R. A. Vandermeulen, G. Montavon, W. Samek, M. Kloft, T. G. Dietterich, and K.-R. Müller, “A unifying review of deep and shallow anomaly detection,” Proceedings of the IEEE, 2021.
[4] B. Schölkopf, R. C. Williamson, A. Smola, J. Shawe-Taylor, and J. Platt, “Support vector method for novelty detection,” Advances in Neural Information Processing Systems, vol. 12, 1999.
[5] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the support of a high-dimensional distribution,” Neural Ccomputation, vol. 13, no. 7, pp. 1443–1471, 2001.
[6] D. M. Tax and R. P. Duin, “Support vector data description,” Machine Learning, vol. 54, no. 1, pp. 45–66, 2004.
[7] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in Proceedings of the IEEE International Conference on Data Mining. IEEE, 2008, pp. 413–422.
[8] W. Hu, J. Gao, B. Li, O. Wu, J. Du, and S. Maybank, “Anomaly detection using local kernel density estimation and context-based regression,” IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 2, pp. 218–233, 2018.
[9] Y. Liu, S. Pan, Y. G. Wang, F. Xiong, L. Wang, Q. Chen, and V. C. Lee, “Anomaly detection in dynamic graphs via transformer,” IEEE Transactions on Knowledge and Data Engineering, 2021.
[10] L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, and M. Kloft, “Deep one-class classification,” in Proceedings of the International Conference on Machine Learning. PMLR, 2018, pp. 4393–4402.
[11] I. Golan and R. El-Yaniv, “Deep anomaly detection using geometric transformations,” Advances in neural information processing systems, vol. 31, 2018.
[12] B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen, “Deep autoencoding gaussian mixture model for unsupervised anomaly detection,” in Proceedings of the International Conference on Learning Representations, 2018.
[13] J. Wang, S. Sun, and Y. Yu, “Multivariate triangular quantile maps for novelty detection,” Advances in Neural Information Processing Systems, vol. 32, pp. 5060–5071, 2019.
[14] C. Qiu, T. Pfrommer, M. Kloft, S. Mandt, and M. Rudolph, “Neural transformation learning for deep anomaly detection beyond images,” in Proceedings of the International Conference on Machine Learning. PMLR, 2021, pp. 8703–8714.
[15] L. Huang, Y. Zhu, Y. Gao, T. Liu, C. Chang, C. Liu, Y. Tang, and C.-D. Wang, “Hybrid-order anomaly detection on attributed networks,” IEEE Transactions on Knowledge and Data Engineering, 2021.
[16] D. Hendrycks, M. Mazeika, and T. Dietterich, “Deep anomaly detection with outlier exposure,” arXiv preprint arXiv:1812.04606, 2018.
[17] L. Ruff, R. A. Vandermeulen, N. Görnitz, A. Binder, E. Müller, K.-R. Müller, and M. Kloft, “Deep semi-supervised anomaly detection,” in Proceedings of the International Conference on Learning Representations, 2020.
[18] P. Liznerski, L. Ruff, R. A. Vandermeulen, B. J. Franks, M. Kloft, and K.-R. Müller, “Explainable deep one-class classification,” arXiv preprint arXiv:2007.01760, 2021.
[19] L. Ruff, R. A. Vandermeulen, B. J. Franks, K.-R. Müller, and M. Kloft, “Rethinking assumptions in deep anomaly detection,” arXiv preprint arXiv:2006.00339, 2021.
[20] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
[21] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
[22] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th International Conference on Machine Learning, 2008, pp. 1096–1103.
[23] S. Wang, X. Wang, L. Zhang, and Y. Zhong, “Auto-ad: Autonomous hyperspectral anomaly detection network based on fully convolutional autoencoder,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2021.
[24] P. Perera and V. M. Patel, “Learning deep features for one-class classification,” IEEE Transactions on Image Processing, vol. 28, no. 11, pp. 5450–5463, 2019.
[25] A. Bhattacharya, S. Varambally, A. Bagchi, and S. Bedathur, “Fast one-class classification using class boundary-preserving random projections,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 66–74.
[26] T. Shenkar and L. Wolf, “Anomaly detection for tabular data with internal contrastive learning,” in Proceedings of the International Conference on Learning Representations, 2022.
[27] Y. Chen, Y. Tian, G. Pang, and G. Carneiro, “Deep one-class classification via interpolated gaussian descriptor,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
[28] P. Malhotra, A. Ramakrishnan, G. Anand, L. Vig, P. Agarwal, and G. Shroff, “Lstm-based encoder-decoder for multi-sensor anomaly detection,” arXiv preprint arXiv:1607.00148, 2016.
[29] L. Deecke, R. Vandermeulen, L. Ruff, S. Mandt, and M. Kloft, “Image anomaly detection with generative adversarial networks,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2018, pp. 3–17.
[30] S. Pidhorskyi, R. Almohsen, and G. Doretto, “Generative probabilistic novelty detection with adversarial autoencoders,” Advances in Neural Information Processing Systems, vol. 31, 2018.
[31] D. T. Nguyen, Z. Lou, M. Klar, and T. Brox, “Anomaly detection with multiple-hypotheses predictions,” in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 97. PMLR, 09–15 Jun 2019, pp. 4800–4809.
[32] P. Perera, R. Nallapati, and B. Xiang, “Ocgan: One-class novelty detection using gans with constrained latent representations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2898–2906.
[33] S. Goyal, A. Raghunathan, M. Jain, H. V. Simhadri, and P. Jain, “Drocc: Deep robust one-class classification,” in Proceedings of the International Conference on Machine Learning. PMLR, 2020, pp. 3711–3721.
[34] J. Raghuram, V. Chandrasekaran, S. Jha, and S. Banerjee, “A general framework for detecting anomalous inputs to dnn classifiers,” in Proceedings of the International Conference on Machine Learning. PMLR, 2021, pp. 8764–8775.
[35] X. Yan, H. Zhang, X. Xu, X. Hu, and P.-A. Heng, “Learning semantic context from normal samples for unsupervised anomaly detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 3110–3118.
[36] Y. Zheng, M. Jin, Y. Liu, L. Chi, K. T. Phan, and Y.-P. P. Chen, “Generative and contrastive self-supervised learning for graph anomaly detection,” IEEE Transactions on Knowledge and Data Engineering, 2021.
[37] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in Neural Information Processing Systems, vol. 27, 2014.
[38] B. Du, X. Sun, J. Ye, K. Cheng, J. Wang, and L. Sun, “Gan-based anomaly detection for multivariate time series using polluted training set,” IEEE Transactions on Knowledge and Data Engineering, 2021.
[39] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel two-sample test,” The Journal of Machine Learning Research, vol. 13, no. 1, pp. 723–773, 2012.
[40] S. Yoon, Y.-K. Noh, and F. Park, “Autoencoding under normalization constraints,” in International Conference on Machine Learning. PMLR, 2021, pp. 12 087–12 097.
[41] Y. Li, K. Swersky, and R. Zemel, “Generative moment matching networks,” in International conference on machine learning. PMLR, 2015, pp. 1718–1727.
[42] M. Rosenblatt, “Remarks on some nonparametric estimates of a density function,” The annals of mathematical statistics, pp. 832–837, 1956.
[43] A. Pinkus, “Approximation theory of the mlp model in neural networks,” Acta numerica, vol. 8, pp. 143–195, 1999.
[44] Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang, “The expressive power of neural networks: A view from the width,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 6232–6240.
[45] G. Marsaglia, “Generating a variable from the tail of the normal distribution,” BOEING SCIENTIFIC RESEARCH LABS SEATTLE WA, Tech. Rep., 1963.
[46] L. V. Kantorovich, “Mathematical methods of organizing and planning production,” Management science, vol. 6, no. 4, pp. 366–422, 1960.
[47] M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” Advances in neural information processing systems, vol. 26, 2013.
[48] R. Sinkhorn and P. Knopp, “Concerning nonnegative matrices and doubly stochastic matrices,” Pacific Journal of Mathematics, vol. 21, no. 2, pp. 343–348, 1967.
[49] G. C. Dua, D. (2017) Uci machine learning repository. [Online]. Available: http://archive.ics.uci.edu/ml
[50] S. Rayana. (2016) Odds library. [Online]. Available: http://odds.cs.stonybrook.edu.
[51] M. Lichman. (2013) Uci machine learning repository. [Online]. Available: http://archive.ics.uci.edu/ml
[52] Y. LeCun, C. Cortes, and C. Burges. (2010) Mnist handwritten digit database, at & t labs. [Online]. Available: http://yann.lecun.com/exdb/mnist/
[53] H. Xiao, K. Rasul, and R. Vollgraf. (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.
[54] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
[55] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: identifying density-based local outliers,” in Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 2000, pp. 93–104.
[56] P. Seeböck, S. Waldstein, S. Klimscha, B. S. Gerendas, R. Donner, T. Schlegl, U. Schmidt-Erfurth, and G. Langs, “Identifying and categorizing anomalies in retinal imaging data,” arXiv preprint arXiv:1612.00686, 2016.
[57] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 132–149.
[58] T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs, “Unsupervised anomaly detection with generative adversarial networks to guide marker discovery,” in International conference on information processing in medical imaging. Springer, 2017, pp. 146–157.
[59] L. Bergman and Y. Hoshen, “Classification-based anomaly detection for general data,” in Proceedings of the International Conference on Learning Representations, 2020.
[60] W. Hu, M. Wang, Q. Qin, J. Ma, and B. Liu, “Hrn: A holistic approach to one class learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 19 111–19 124, 2020.
[61] J. Cai and J. Fan, “Perturbation learning based anomaly detection,” CoRR, vol. abs/2206.02704, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2206.02704
[62] Massoli et al., “Mocca: Multilayer one-class classification for anomaly detection,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 6, pp. 2313–2323, 2022.
[63] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.