This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

FROCC: Fast Random projection-based One-Class Classification

Arindam Bhattacharya    Sumanth Varambally    Amitabha Bagchi    Srikanta Bedathur
Abstract

We present Fast Random projection-based One-Class Classification (FROCC), an extremely efficient method for one-class classification. Our method is based on a simple idea of transforming the training data by projecting it onto a set of random unit vectors that are chosen uniformly and independently from the unit sphere, and bounding the regions based on separation of the data. FROCC can be naturally extended with kernels. We provide a new theoretical framework to prove that that FROCC generalizes well in the sense that it is stable and has low bias for some parameter settings. FROCC achieves up to 3.1 percent points better AUC ROC, with 1.2–67.8×\times speedup in training and test times over a range of state-of-the-art benchmarks including the SVM and the deep learning based models for the OCC task.

1 Introduction

One-class classification (OCC) has attracted a lot of attention over the years under various names such as novelty detection [30], anomaly or outlier detection [34, 29], concept learning [15], and so on. Unlike the classical binary (or multi-class) classification problem, the problem addressed in OCC is much simpler: to simply identify all instances of a class by using only a (small) set of examples from that class. Therefore it is applicable in settings where only “normal” data is known at training time, and the model is expected to accurately describe this normal data. After training, any data which deviates from this normality is treated as an anomaly or a novel item. Real-world applications include intrusion detection [11, 5], medical diagnoses [14], fraud detection [37], etc., where the requirement is to offer not only high precision, but also the ability to scale training and testing time as the volume of (training) data increases.

Primary approaches to OCC include one-class variants of support vector machines (SVM) [30], probabilistic density estimators describing the support of normal class [6, 27], and recently, deep-learning based methods such as auto-encoders (AE) and Generative Adversarial Networks (GAN) which try to learn the distribution of normal data samples during training [29, 24, 28]. However, almost all of these methods are either extremely slow to train or require significant hyper-parameter tuning or both.

We present an alternative approach based on the idea of using a collection of random projections onto a number of randomly chosen 1-dimensional subspaces (vectors) resulting in a simple yet surprisingly powerful one-class classifier. Random projections down to a single dimension or to a small number of dimensions have been used in the past to speed up approximate nearest neighbor search [17, 2, 13] and other problems that rely on the distance-preserving property guaranteed by the Johnson-Lindenstrauss Lemma but our problem space is different. We use random projections in an entirely novel fashion based on the following key insight: for the one-class classification problem, the preservation of distances is not important, we only require the class boundary to be preserved. The preservation of distances is a stronger property which implies that the class boundary is preserved, but the converse need not hold. We don’t need to preserve distances. All we need to do is “view” the training data from a “direction” that shows that the outlier is separated from the data.

Formalizing this intuition, we develop an algorithm for one-class classification called FROCC (Fast Random projection-based One-Class Classification) that essentially applies a simplified envelope method to random projections of training data to provide very strong classifier. FROCC  also comes with a theoretically provable generalization property: the method is stable and has low bias.

At a high level, FROCC works as follows: given only positively labelled data points, we select a set of random vectors from the unit sphere, and project all the training data points onto these random vectors. We compactly represent all the training data projections along each vector using one (or more) intervals, based on the separation parameter ε\varepsilon. If a test data point’s projection with the random vector is within the interval spanned by all the training data projections, we say it is part of the normal class, otherwise it is an outlier. The parameter ε\varepsilon breaks the single interval used to represent the random vector projection into dense clusters with a density lower-bound 1ε\frac{1}{\varepsilon}, resulting in a collection of intervals along each vector.

The simplicity of the proposed model provides us with a massive advantage in terms of computational cost – during both training and testing – over all the existing methods ranging from SVMs to deep neural network models. We achieve 1–16×\times speedup for non-deep baselines like One Class SVM (OCSVM) [30] and Isolation forests. For state-of-the-art deep learning based approaches like Deep-SVDD and Auto Encoder we achieve a speedup to the tune of 45×\times – 68×\times. At the same time, the average area under the ROC curve is 0.3 — 3.1 percent points better than the state-of-the-art baseline methods on six of the eleven benchmark datasets we used in our empirical study.

Like SVM our methods can be seamlessly extended to use kernels. These extensions can take advantage of kernels that may be suitable for certain data distributions, without changing the overall one-class decision framework of FROCC.

Multi-modal Gaussian Data
ROC = 0.816 ROC = 0.824 ROC = 0.968 ROC = 0.836
Refer to caption Refer to caption Refer to caption Refer to caption
Moon Data
ROC = 0.734 ROC = 0.783 ROC = 0.823 ROC = 0.911
Refer to caption Refer to caption Refer to caption Refer to caption
(a) ε=1,K=linear\varepsilon=1,K=linear (b) ε=1,K=rbf\varepsilon=1,K=rbf (c) ε=0.1,K=linear\varepsilon=0.1,K=linear (d) ε=0.1,K=rbf\varepsilon=0.1,K=rbf
Figure 1: Classification boundaries for different variants of FROCC. Blue-colored points are the “normal” class.

To understand the power of our method consider the one-class decision boundaries for various methods on two toy datasets visualized in Figure 1. We see that FROCC with ε=0.1\varepsilon=0.1 computes a very desirable decision boundary for Multi-modal Gaussian data, a case which is known to be difficult for SVM to handle. In this case incorporating a standard Gaussian RBF kernel doesn’t improve the ROC for FROCC. In the case of the Moon-shaped data, however, this kernel is able to direct FROCC to better decision boundaries which reflect in better ROC scores.

Through a range of experiments we show that the in most cases FROCC is able to achieve high scores in ROC as well as Precision@n measures with 2–3 orders of improvement in speed.

Our contributions.
  • We define a powerful and efficient random-projection based one class classifier: FROCC.

  • We show theoretically that FROCC is a stable classification algorithm.

  • We perform an extensive experimental evaluation on several data sets that shows that our methods outperform traditional methods in both quality of results and speed, and that our methods are comparable in quality to recent neural methods while outperforming them in terms of speed by three orders of magnitude.

Organization.

We discuss related literature in Sec. 2. Formal definitions of our methods are presented in Sec. 3. In Section 4 we provide analysis of stability of our methods. The experiments setup and results of an extensive experimental comparison with several baselines are discussed in Sec. 5. We make some concluding remarks in Sec. 6.

2 Related Work

One Class Classification.

One class classification has been studied extensively so we don’t attempt to survey the literature in full detail, referring the reader instead to the survey of traditional methods by [16]. A more recent development is the application of deep learning-based methods to this problem. A partial survey of these methods is available in [28].

Unlike deep learning-based methods our work takes a more traditional geometric view of the one class problem in a manner similar to [34] and [30] but we do not rely on optimization as a subroutine like their methods do. This gives us an efficiency benefit over these methods in addition to the improvement we get in terms of quality of results. An additional efficiency advantage our FROCC methods have is that we do not need to preprocess to re-center the data like these two methods do.

Our methods operate in the paradigm where the training data consists of a fully labeled set of examples only from the normal class. We do not use even a small number of examples from the negative class as done in, e.g., [3]. Neither does our method use both labeled and unlabelled data as in the PAC model [35]. Our methods are randomized in nature which put them in the same category as probabilistic methods that attempts to fit a distribution on the positive data. One examples of such a method is learning using Parzen window density estimation [6]. We consider one such method as a baseline: [10] apply Minimum Co-variance Determinant Estimator [27] to find the region whose co-variance matrix has the lowest determinant, effectively creating an elliptical envelope around the positive data.

We also compare our algorithms to deep learning-based methods. Specifically we compare our methods to the Auto-encoder-based one class classification method proposed by [29].

Another baseline we compare against is the deep learning extension of Support Vector Data Description (SVDD) proposed by [28], where a neural network learns a transformation from input space to a useful space. In both cases our results are comparable and often better in terms of ROC while achieving a high speedup.

Random Projection-based Algorithms.

Random projection methods have been productive in nearest-neighbor contexts due to the fact that it is possible to preserve distances between a set of points if the number of random dimensions chosen is large enough, the so-called Johnson-Lindenstrauss property, c.f., e.g. locality sensitive hashing [12, 2]. Our methods do not look to preserve distances, only the boundary, and, besides, we are not necessarily looking for dimensionality reduction, which is key goal of the cited methods. A smaller number of projection dimensions may preserve distances between the training points but it does not make the task of discriminating outliers any easier. It may be possible that for the classifier to have good accuracy the dimensionality of the projection space be higher than the data space. As such our problem space is different from that of the space of problems that look for dimensionality reduction through random projection. [17] uses projections to a collection of randomly chosen vectors to take a simpler approach than locality sensitive hashing but his problem of interest is also approximate nearest neighbor search. Some research has also focused on using random projections to learn mixtures of Gaussians, starting with the work of [7] but again this is a different problem space.

The closest work to ours in flavor is that of [36] who attempts to find kk-half-spaces whose intersection fully contains a given positively labelled data set. The key difference here is that our methods work with the intersections of hyper-rectangles that are unbounded in one dimension and not half-spaces. While the solution in [36] can be a general convex body in the projection space, ours is specifically limited to cuboids. As such it is not possible to refine Vempala’s method by taking a parameter ε\varepsilon and subdividing intervals (c.f. Section 3), nor is it clear how it could be regularized for use in real-world applications.

3 Fast Random-Projection based OCC

Definition 1 (FROCC (S,m,ε,KS,m,\varepsilon,K)).

Assume that we are given a set of training points S={𝐱1,,𝐱n}dS=\{\bm{x}_{1},\ldots,\bm{x}_{n}\}\subseteq\mathbb{R}^{d}. Then, given an integer parameter m>0m>0, a real parameter ε(0,1]\varepsilon\in(0,1] and a kernel function K(,)K(\cdot,\cdot), the ε\varepsilon-separated Fast Random-projection based One-Class Classifier FROCC (S,m,ε,K)(S,m,\varepsilon,K), comprises, for each ii, 1im1\leq i\leq m,

  1. 1.

    A classifying direction 𝒘i\bm{w}_{i} that is a unit vector chosen uniformly at random from 𝟏d\bm{1}_{d}, the set of all unit vectors of d\mathbb{R}^{d}, independent of all other classifying directions, and

  2. 2.

    a set of intervals RiR_{i} defined as follows: Let Si={K(𝒘i,𝒙j):1jn}S^{\prime}_{i}=\{K(\bm{w}_{i},\bm{x}_{j}):1\leq j\leq n\} and Si={y1,,yn}S_{i}=\{y_{1},\ldots,y_{n}\} is a shifted and scaled version of SiS^{\prime}_{i} such that yi(0,1),1iny_{i}\in(0,1),1\leq i\leq n. Assume y1yny_{1}\leq\cdots\leq y_{n}. Then each interval of RiR_{i} has the property that it is of the form [yj,yj+k][y_{j},y_{j+k}] for some j1,k0j\geq 1,k\geq 0 such that

    1. (a)

      for all tt such that 0t<k0\leq t<k, yj+t+1yj+t<εy_{j+t+1}-y_{j+t}<\varepsilon,

    2. (b)

      yjyj1>εy_{j}-y_{j-1}>\varepsilon whenever j1>0j-1>0, and

    3. (c)

      yj+k+1yj+k>εy_{j+k+1}-y_{j+k}>\varepsilon whenever j+k+1nj+k+1\leq n.

Given a query point 𝐲d\bm{y}\in\mathbb{R}^{d}, FROCC (S,m,ε,K)(S,m,\varepsilon,K) returns YES if for every ii, 1im1\leq i\leq m, K(𝐰i,𝐲)K(\bm{w}_{i},\bm{y}) lies within some interval of RiR_{i}.

In the simplest setting the kernel function K(,)K(\cdot,\cdot) will be just the usual dot product associated with d\mathbb{R}^{d}.

Refer to caption
Figure 2: The test point qq is correctly classified as an outlier by 𝒘𝟐\bm{w_{2}} using FROCC since the projection lies outside the discriminating boundaries given by ee and ff. However, 𝒘𝟏\bm{w_{1}} classifies qq incorrectly for ε>εd\varepsilon>\varepsilon_{d}, since the discriminating boundary (blue region) is formed by the intervals [a,d][a,d]. If we chose εεd\varepsilon\leq\varepsilon_{d}, 𝒘𝟏\bm{w_{1}} correctly classifies qq, since the discriminating boundaries (green regions) are given by the intervals [a,b][a,b] and [c,d][c,d].

We show a small example in Figure 2. The green points are training points. Two classifying directions 𝒘1\bm{w}_{1} and 𝒘2\bm{w}_{2} are displayed. The test point 𝒒\bm{q} has the property that 𝒒,𝒘1\langle\bm{q},\bm{w}_{1}\rangle lies in [a,d][a,d] but since 𝒒,𝒘2\langle\bm{q},\bm{w}_{2}\rangle does not lie in [e,f][e,f], 𝒒\bm{q} is said to be outside the class. Noting that the projections along 𝒘1\bm{w}_{1} of the two clusters of the training set are well separated, a choice of ε<εd\varepsilon<\varepsilon_{d} leads to splitting of the interval [a,d][a,d] into [a,b][a,b] and [c,d][c,d], thus correctly classifying qq to be an outlier.

In simple terms the intervals of RiR_{i} have the property that the points of SiS_{i} are densely scattered within each interval and there are wide gaps between the intervals that are empty of points of SiS_{i}. The density of the intervals is lower bounded by 1/ε1/\varepsilon and the width between successive intervals is lower bounded by ε\varepsilon. If we now look at Figure 2 we note that since the interval [b,c][b,c] has length more than ε\varepsilon, R1R_{1} becomes [a,b][c,d][a,b]\cup[c,d] and so the projection of 𝒒\bm{q} along 𝒘1\bm{w}_{1} is also classified as being outside the class.

Computing FROCC.

Algorithm 1 outlines the training procedure of FROCC. The function ff in Line 1 is any valid kernel, which defaults to inner product ,\langle\cdot,\cdot\rangle. Algorithm 2 shows how the interval tree is created for each classification direction. Unlike many optimization based learning methods, such as neural networks, our algorithm needs a single pass through the data.

1
Data: S={𝒙1,,𝒙n}dS=\{\bm{x}_{1},\ldots,\bm{x}_{n}\}\subseteq\mathbb{R}^{d}: training set
input : m, ε\varepsilon
output : [Ri]i=1m[R_{i}]_{i=1}^{m}: list of m interval trees
2 Initialize resultresult\leftarrow empty list
3 for i1i\leftarrow 1 to mm do
4       iinf\ell_{i}\leftarrow inf
5       ui0u_{i}\leftarrow 0
6       𝒘iUniform(𝟏d)\bm{w}_{i}\thicksim Uniform(\bm{1}_{d}) // sample a d-dimensional unit vector uniformly
7       Initialize projectionList \leftarrow empty list
8       for j1j\leftarrow 1 to nn do
9             projectionf(xj,wi)projection\leftarrow f(x_{j},w_{i})
10             projectionList.append(projection)
11            
12       end for
13      sort(projectionList)
14       RiR_{i}\leftarrow IntervalTree(projectionList, ε\varepsilon)
15       result.append(RiR_{i})
16 end for
Algorithm 1 FROCC Training
1
input : projectionList, ε\varepsilon
output : R: interval tree
2 Initialize RR\leftarrow empty interval tree
3 Initialize startprojectionList[1]start\leftarrow projectionList[1], endprojectionList[1]end\leftarrow projectionList[1]
4 margin = (max(projectionList)min(projectionList))ε(\max(projectionList)-\min(projectionList))\cdot\varepsilon
5 for i2i\leftarrow 2 to nn do
6       if projectionList[i]<end+marginprojectionList[i]<end+margin then
7             endprojectionList[i]end\leftarrow projectionList[i]
8            
9      else
10             // point is outside margin
11             R.insert([start, end])
12             startprojectionList[i]start\leftarrow projectionList[i]
13             endprojectionList[i]end\leftarrow projectionList[i]
14            
15       end if
16      
17 end for
18if startendstart\neq end then
19       // handle last interval
20       R.insert([start, end])
21 end if
Algorithm 2 IntervalTree subroutine for ε(0,1)\varepsilon\in(0,1)

The standard method for choosing uniformly random vectors in d\mathbb{R}^{d} is to sample from a spherical dd-dimensional Gaussian centered at the origin and then normalizing it so that its length is 1. Folklore and a quick calculation suggests that independently picking dd random variables with distribution N(0,1)N(0,1) is good enough for this purpose. The spherical symmetry of the Gaussian distribution ensures that the random vector is uniformly picked from all possible directions.

Time complexity.

The time taken to compute the boundaries of the FROCCFROCC is O(mnd)O(mnd) for ε=1\varepsilon=1 (assuming the kernel is dot product, which takes O(d)O(d)). For ε<1\varepsilon<1 additional factor of θ(mnlogn)\theta(mn\log n) is required to create ε\varepsilon-separated intervals along each classifying direction.

4 Theoretical Analysis

In this section we state and prove certain properties of FROCC which states that FROCC is a stable classifier whose performance can be reliably tweaked by modifying the number of classification dimensions mm and separation parameter ε\varepsilon. In particular, we show that:

  1. 1.

    Bias of FROCC decreases exponentially with the size of the training set if the training set is drawn from a spatially divisible distribution,

  2. 2.

    Precision and model size of FROCC increase monotonically with mm and decrease monotonically with ε\varepsilon, and

  3. 3.

    Probability of FROCC making an error decays exponentially with mm.

4.1 Stability of FROCC

Following the analysis of [4] we prove that FROCC generalizes in a certain sense. Specifically we will show that under some mild assumptions the bias of FROCC with ε=1\varepsilon=1 goes to 0 at an exponential rate as the size of the training set increases. We first show a general theorem which should be viewed as a variant of Theorem 17 of [4] which is more suited to our situation since we address a one-class situation and our algorithm is randomized. Our Theorem 4 is of general interest and may have applications beyond the analysis of FROCC.

4.1.1 Terminology and notation

Let 𝒮\mathcal{S} denote the set of all finite labelled point sets of d\mathbb{R}^{d} where each point is labelled either 0 or 1. We assume we have a training set S𝒮S\in\mathcal{S} chosen randomly as follows: We are given an unknown distribution DD over d×{0,1}\mathbb{R}^{d}\times\{0,1\} and SS comprises kk samples with label 1 drawn i.i.d. from DD. For a point z=(x,y)d×{0,1}z=(x,y)\in\mathbb{R}^{d}\times\{0,1\} we use dat(z)dat(z) to denote xx and lab(z)lab(z) to denote yy.

Let \mathcal{R} be the set of all finite strings on some finite alphabet, and let us call the elements of \mathcal{R} decision strings. Further, let \mathcal{F} be the set of all classification functions from d\mathbb{R}^{d} to {0,1}\{0,1\}. Then we say a classification map Φ:𝒮×\Phi:\mathcal{S}\times\mathcal{R}\rightarrow\mathcal{F} maps a training set and a decision string to a classification algorithm

With this setup we say that a randomized classification algorithm AA takes a training S𝒮S\in\mathcal{S} as input, picks a random decision string rr\in\mathcal{R} and returns the classification function Φ(S,r)\Phi(S,r) which we denote A(S,r)A(S,r). A(S,r)A(S,r) is a randomly chosen element of \mathcal{F} which we call a classifier. Given an rr, A(S,r)A(S,r) is fixed but we will use ASA_{S} to denote the randomized classifier which has been given its training set SS but is yet to pick a random decision string. Now, given a zdz\in\mathbb{R}^{d}, the loss function of a randomized classifier ASA_{S} is given by

V(AS,z)=Er{1A(S,r)(dat(z))lab(z)},V(A_{S},z)=\mbox{E}_{r}\left\{\mathrm{1}_{A(S,r)(dat(z))\neq lab(z)}\right\},

where the expectation is over the randomness of the decision string rr. In other words the loss is equal to Prr{A(S,r)(dat(z))lab(z)}\mbox{Pr}_{r}\left\{A(S,r)(dat(z))\neq lab(z)\right\}.

We define the risk of ASA_{S} as

R(AS)=ES,z{V(AS,z)},R(A_{S})=\mbox{E}_{S,z}\left\{V(A_{S},z)\right\},

where the expectation is over the random choice of SS and of the point zz. Note that V(,)V(\cdot,\cdot) already contains an expectation on the randomness associated with the decision string which is inherent in ASA_{S}

The empirical risk of ASA_{S} is defined

Re(AS)=1|S|zSV(AS,z).R_{e}(A_{S})=\frac{1}{|S|}\sum_{z\in S}V(A_{S},z).

We introduce notation for two modifications of the training set SS. For 1i|S|1\leq i\leq|S| we say that Si=S{zi}S^{\setminus i}=S\setminus\{z_{i}\} and Si=Si{zi}S^{i}=S^{\setminus i}\cup\{z_{i}^{\prime}\} where ziz_{i}^{\prime} is chosen from DD independently of all previous choices. We now define a notion of stability.

The following notion of stability is a modification of the definition of stability for classification algorithms given by [4].

Definition 2 (0-1 uniform classification stability).

Suppose we have a randomized classification algorithm AA and a loss function V(,)V(\cdot,\cdot) whose co-domain is [0,1][0,1]. Then AA is said to have 0-1 uniform classification stability (β,η)(\beta,\eta) if for all SdS\subseteq\mathbb{R}^{d}, for every ii such that 1i|S|1\leq i\leq|S|, and for every zd×{0,1}z\in\mathbb{R}^{d}\times\{0,1\}

|V(AS,z)V(ASi,z)|β|V(A_{S},z)-V(A_{S^{\setminus i}},z)|\leq\beta

with probability at least 1η1-\eta.

If β\beta is O(1/|S|)O(1/|S|) and η\eta is O(ec|S|)O(e^{-c|S|}) for some constant c>0c>0, we say that AA is 0-1 uniform classification stable.

We prove that a 0-1 uniform classification stable algorithm has low bias and converges exponentially fast in the size of SS to its expected behavior. This is a general result that may be applicable to a large class of randomized classification algorithm. We also prove that FROCC is 0-1 uniform classification stable under mild conditions on the unknown distribution.

We also introduce a condition that the unknown distribution DD has to satisfy in order for us to show the desired stability property.

Definition 3 (Spatial divisibility).

Suppose that μ:d[0,1]\mu:\mathbb{R}^{d}\rightarrow[0,1] is a probability density function. For any set TT containing dd points xi,,xddx_{i},\ldots,x_{d}\in\mathbb{R}^{d} let HT+H_{T+} and HTH_{T-} be the two half-spaces defined by the hyper-plane containing all the points of TT. We say that μ\mu is spatially divisible if for any set d+1d+1 randomly chosen points X={X1,,Xd}X=\{X_{1},\ldots,X_{d}\} and YY chosen independently according to density μ\mu and any A,BdA,B\subset\mathbb{R}^{d} such that μ(A),μ(B)>0\mu(A),\mu(B)>0,

Pr{YHX+|XiA,YB}\displaystyle\mbox{Pr}\left\{Y\in H_{X+}|X_{i}\in A,Y\in B\right\} >0,and\displaystyle>0,\mathrm{and~{}}
Pr{YHX|XiA,YB}\displaystyle\mbox{Pr}\left\{Y\in H_{X-}|X_{i}\in A,Y\in B\right\} >0.\displaystyle>0.

Note that the set XX defines a hyper-plane in dd dimensions with probability 1 for any distribution defined on a non-empty volume of d\mathbb{R}^{d}. Most standard distributions have the spatial divisibility property. For example, it is easy to see that a multi-variate Gaussian distribution has this property. If we pick a point uniformly at random from within a convex dd-polytope then too the spatial divisibility property is satisfied. To see this we note that the only way it could not be satisfied is if the points XiX_{i} and YY are picked from the surface of the dd-polytope which is an event of probability 0.

4.1.2 Results

With the definition of stability given above we can prove the following theorem which shows that the deviation of the empirical risk from the expected risk tends to 0 exponentially as |S||S| increases for a class of randomized algorithms that includes FROCC.

Theorem 4.

Suppose we have a randomized classification algorithm AA which has 0-1 uniform classification stability (β,η)(\beta,\eta) with 0<β,η<10<\beta,\eta<1, and suppose this algorithm is trained on a set SS drawn i.i.d. from a hidden distribution in such a way that the random choices made by AA are independent of SS, then, for any ε>0\varepsilon>0,

Pr{|R(A,S)Re(A,S)|>ε}2exp{2ε2|S|β2}+2|S|η.\mbox{Pr}\left\{|R(A,S)-R_{e}(A,S)|>\varepsilon\right\}\leq 2\exp\left\{-\frac{2\varepsilon^{2}}{|S|\beta^{2}}\right\}+2|S|\eta. (1)

Moreover if AA is 0-1 uniform classification stable then the RHS of (1) tends to 0 at a rate exponential in |S||S|.

Proof.

As in [4] we will use McDiarmid’s inequality [22] to bound the deviation of the empirical risk from the risk. For completeness we state the result:

Lemma 5 ([22]).

Given a set of mm independent random variables Yi,1imY_{i},1\leq i\leq m and a function ff such that |f(x1,,xm)f(y1,,ym)|<ci|f(x_{1},\ldots,x_{m})-f(y_{1},\ldots,y_{m})|<c_{i} whenever xj=yj,1jm,jix_{j}=y_{j},1\leq j\leq m,j\neq i, and xiyix_{i}\neq y_{i}, then

Pr{|E{f(Y1,,Ym)}f(Y1,,Ym)|>ε}e(2ε2i=1mci2).\mbox{Pr}\left\{|\mbox{E}\left\{f(Y_{1},\ldots,Y_{m})\right\}-f(Y_{1},\ldots,Y_{m})|>\varepsilon\right\}\leq e^{\left(-\frac{2\varepsilon^{2}}{\sum_{i=1}^{m}c_{i}^{2}}\right)}.

To use Lemma 5 we have to show that Re(A,S)R_{e}(A,S) is a 2β2\beta-Lipschitz function. However, this property is only true with a certain probability in our case. In particular we will show that for all ii, 1i|S|1\leq i\leq|S|, |Re(A,S)Re(A,Si)|2β|R_{e}(A,S)-R_{e}(A,S^{i})|\leq 2\beta with probability at least 12|S|η1-2|S|\eta. To see this we note that for any ii and any zjSz_{j}\in S (including j=ij=i).

|V(AS,zj)V(ASi,zj)|=\displaystyle|V(A_{S},z_{j})-V(A_{S^{i}},z_{j})|= |V(AS,zj)V(ASi,zj)\displaystyle|V(A_{S},z_{j})-V(A_{S^{\setminus i}},z_{j})
+V(ASi,zj)V(ASi,zj)|.\displaystyle+V(A_{S^{\setminus i}},z_{j})-V(A_{S^{i}},z_{j})|.

Using the definition of (β,η)(\beta,\eta) 0-1 uniform classification stability and the triangle inequality, we get that the RHS is bounded by 2β2\beta with probability at least 12η1-2\eta. If AjA_{j} is the event that the RHS above is bounded by 2β2\beta then we want to bound the probability of the event that this is true for all jj, 1j|S|1\leq j\leq|S|, i.e., the event F=j=1|S|AjF=\cap_{j=1}^{|S|}A_{j}. Since Pr{Aj¯}2η\mbox{Pr}\left\{\overline{A_{j}}\right\}\leq 2\eta, we can say that Pr{F}\mbox{Pr}\left\{F\right\} is at least 12|S|η1-2|S|\eta.

Now, let EE be the event that |R(A,S)Re(A,S)|>ε|R(A,S)-R_{e}(A,S)|>\varepsilon and FF defined above be the event that Re(A,S)R_{e}(A,S), which is a function of the vector of |S||S| elements chosen independently to form SS is 2β2\beta-Lipschitz. We know that

Pr{E}=Pr{EF}+Pr{EF¯},\mbox{Pr}\left\{E\right\}=\mbox{Pr}\left\{E\cap F\right\}+\mbox{Pr}\left\{E\cap\overline{F}\right\},

and so we can say that

Pr{E}Pr{E|F}Pr{F}+Pr{F¯}.\mbox{Pr}\left\{E\right\}\leq\mbox{Pr}\left\{E|F\right\}\mbox{Pr}\left\{F\right\}+\mbox{Pr}\left\{\overline{F}\right\}. (2)

We have already argued that the second term on the RHS is upper bounded by 2|S|η2|S|\eta so we turn to the first term. To apply Lemma 5 to the first term we note that EE depends on the random selection of SS which are selected independently of each other and, by assumption, independent of the random choices made by AA. Therefore the random collection {V(AS,z):zS}\{V(A_{S},z):z\in S\} is independent even when conditioned on FF which is determined purely by the random choices of AA and is true for every choice of set SS. Once this is noted then we can use McDiarmid’s inequality to bound the first term of Equation (2). We ignore Pr{F}\mbox{Pr}\left\{F\right\} by upper bounding it by 1. ∎

We now show that Theorem 4 applies to FROCC with ε=1\varepsilon=1 if the unknown distribution DD is spatially divisible.

Proposition 6.

If the unknown distribution DD is spatially divisible then FROCC with ε=1\varepsilon=1 is 0-1 uniform classification stable.

Proof.

For any ii, 1in1\leq i\leq n, where n=|S|n=|S|, we note that when ε=1\varepsilon=1 the entire interval spanned by the projections in any direction is said to contain inliers, and so if xix_{i} lies strictly within the convex hull of SS then the removal of ii from SS does not affect the classifier. This is because the boundaries in any direction are determined by the points that are part of the convex hull. Let us denote this set conv(S)\mbox{conv}(S). Therefore, since V(,)V(\cdot,\cdot) takes value at most 1, we deduce the following upper bound

|V(AS,z)V(ASi,z)|Pr{xiconv(S)}.|V(A_{S},z)-V(A_{S^{\setminus i}},z)|\leq\mbox{Pr}\left\{x_{i}\in\mbox{conv}(S)\right\}.

To bound the probability that a point xix_{i} lies in the convex hull of SS we use an argument that Efron [9] attributes to Rényi and Sulanke [25, 26]. Given a region BdB\subset\mathbb{R}^{d} let us denote by D(B)D(B) the probability that a point drawn from the unknown distribution DD places a point in BB. Now given any dd points, say x1,,xdSx_{1},\ldots,x_{d}\in S, we divide d\mathbb{R}^{d} into two regions AX+A_{X+} that lies on one side of the hyper-plane defined by x1,,xdx_{1},\ldots,x_{d} and AXA_{X-} that lies on the other side. Then the probability that x1,,xdx_{1},\ldots,x_{d} all lie in conv(S)\mbox{conv}(S) is equal to the probability that all the remaining points of SS are either in AX+A_{X+} or in AXA_{X-}, i.e.,

Pr{x1,,xdconv(S)}=D(AX+)(nd)+D(AX)(nd).\mbox{Pr}\left\{x_{1},\ldots,x_{d}\in\mbox{conv}(S)\right\}=D(A_{X+})^{(n-d)}+D(A_{X-})^{(n-d)}.

From this, using the union bound we can say that

Pr{xiconv(S)}Y𝒫(S{xi},d)D(AY+)(nd)+D(AY)(nd),\mbox{Pr}\left\{x_{i}\in\mbox{conv}(S)\right\}\leq\sum_{Y\in\mathcal{P}(S\setminus\{x_{i}\},d)}D(A_{Y+})^{(n-d)}+D(A_{Y-})^{(n-d)},

where 𝒫(A,d)\mathcal{P}(A,d) is the set of all subsets of AA of size dd. This can further be bounded as

LHS2(n1d)maxY𝒫(S{xi},d)max{D(AY+)(nd),D(AY)(nd)}.LHS\leq 2\binom{n-1}{d}\max_{Y\in\mathcal{P}(S\setminus\{x_{i}\},d)}\max\{D(A_{Y+})^{(n-d)},D(A_{Y-})^{(n-d)}\}.

Since we have assumed that D(AY±)<1D(A_{Y\pm})<1 for all YSY\subset S such that |Y|=d|Y|=d and xiYx_{i}\notin Y, using the fact that (nk)(en/k)k\binom{n}{k}\leq(en/k)^{k}, we can say that there is an α\alpha such that 0<α<10<\alpha<1 and a constant C>0C>0 such that

|V(AS,z)V(ASi,z)|C(n1)dαn|V(A_{S},z)-V(A_{S^{\setminus i}},z)|\leq C\cdot(n-1)^{d}\alpha^{n}

almost surely. Since the RHS goes to 0 exponentially fast in nn with probability 1, we can say that FROCC is 0-1 uniform classification stable when ε=1\varepsilon=1. ∎

From Theorem 4 and Proposition 6 we get

Theorem 7.

For a training set SS chosen i.i.d. from a spatially divisible unknown distribution, if ε\varepsilon is set to 1 then Re(FROCC,S)R_{e}(FROCC,S) converges in probability to R(FROCC,S)R(FROCC,S) exponentially fast in |S||S|.

We feel that it should be possible to use the framework provided by Theorem 4 to show that FROCC is stable even when ε\varepsilon lies in (0,1)(0,1), however this problem remains open.

4.2 Monotonicity properties of FROCC

We now show that the precision and the model size of FROCC is monotonic in the two parameters mm and ε\varepsilon but in opposite directions. The precision and model size increase as mm increases and decrease as ε\varepsilon increases. This result is formalized as Proposition 8 below and follows by a coupling argument.

We first introduce some terms. On the σ\sigma-algebra induced by the Borel sets of i1i\cup_{i\geq 1}\mathbb{R}^{i} we define the probability measure Pm,ε()\mbox{P}_{m,\varepsilon}(\cdot) induced by FROCC with parameters ε,m>0\varepsilon,m>0. Note that ε,m>0\varepsilon,m>0, FROCC is a random collection of mm-dimensional cubes in m\mathbb{R}^{m}. These cubes are induced by taking the products of intervals in each of the mm dimensions. Each dimension has between 1 and |S||S| intervals when trained on SS. We define the model size of FROCC to be the sum of the number of intervals, i.e., if there are kik_{i} intervals along dimension ii, 1im1\leq i\leq m, the model size is i=1m2ki\sum_{i=1}^{m}2k_{i}. We will use the letter Mm,εM_{m,\varepsilon} to denote the model size of FROCC with parameters m,εm,\varepsilon. Note that this is a random variable.

Proposition 8.

Given FROCC trained on a finite set SdS\subset\mathbb{R}^{d}, for ε,ε>0\varepsilon,\varepsilon^{\prime}>0, m,m>0m,m^{\prime}>0 and any xdx\in\mathbb{R}^{d} we have that

  1. 1.

    whenever ε<ε\varepsilon<\varepsilon^{\prime}

    1. (a)

      Pm,ε(x is classifed YES)Pm,ε(x is classifed YES)\mbox{P}_{m,\varepsilon}(x\mbox{ is classifed YES})\leq\mbox{P}_{m,\varepsilon^{\prime}}(x\mbox{ is classifed YES}),

    2. (b)

      Em,ε(Mm,ε)Em,ε(Mm,ε)\mbox{E}_{m,\varepsilon}(M_{m,\varepsilon})\geq\mbox{E}_{m,\varepsilon^{\prime}}(M_{m,\varepsilon^{\prime}}), and,

  2. 2.

    whenever m<mm<m^{\prime}

    1. (a)

      Pm,ε(x is classifed YES)Pm,ε(x is classifed YES)\mbox{P}_{m,\varepsilon}(x\mbox{ is classifed YES})\leq\mbox{P}_{m^{\prime},\varepsilon}(x\mbox{ is classifed YES}),

    2. (b)

      Em,ε(Mm,ε)Em,ε(Mm,ε)\mbox{E}_{m,\varepsilon}(M_{m,\varepsilon})\geq\mbox{E}_{m^{\prime},\varepsilon}(M_{m^{\prime},\varepsilon}).

Proof.

For 1(a) and 1(b) we couple the probability distributions Pm,ε()\mbox{P}_{m,\varepsilon}(\cdot) and Pm,ε()\mbox{P}_{m,\varepsilon^{\prime}}(\cdot) by choosing mm random unit vectors, 𝒘1,,𝒘m\boldsymbol{w}_{1},\ldots,\boldsymbol{w}_{m}, and then constructing one instance of FROCC with separation ε\varepsilon and one instance of FROCC with separation ε\varepsilon^{\prime} using the same mm random unit vectors.

Now, consider some vector 𝒘i\boldsymbol{w}_{i} and let the xji,1i|S|x_{j}^{i},1\leq i\leq|S| be the projections of the training points on 𝒘i\boldsymbol{w}_{i}. Let the intervals formed by ε\varepsilon separated FROCC be {[aij,bij]:1jki}\{[a_{ij},b_{ij}]:1\leq j\leq k_{i}\} and those of ε\varepsilon^{\prime} separated FROCC be {[aij,bij]:1jki}\{[a^{\prime}_{ij},b^{\prime}_{ij}]:1\leq j\leq k^{\prime}_{i}\}. We know that ai1=ai1=minxjia^{\prime}_{i1}=a_{i1}=\min x_{j}^{i}. Clearly each [aij,bij][a_{ij},b_{ij}] is contained within some [aij,bij][a^{\prime}_{ij^{\prime}},b^{\prime}_{ij^{\prime}}] since the distance between subsequent points within [aij,bij][a_{ij},b_{ij}] is at most ε\varepsilon which is strictly less than ε\varepsilon^{\prime}. This proves 1(a) and also implies that kikik_{i}^{\prime}\leq k_{i} which proves 1(b).

For 2(a) and 2(b) we couple the probability distributions Pm,ε()\mbox{P}_{m,\varepsilon}(\cdot) and Pm,ε()\mbox{P}_{m^{\prime},\varepsilon}(\cdot) by choosing mm random unit vectors, 𝒘1,,𝒘m\boldsymbol{w}_{1},\ldots,\boldsymbol{w}_{m}, and then constructing two instances of FROCC. Then we choose an additional mmm^{\prime}-m vectors 𝒘m+1,,𝒘m\boldsymbol{w}_{m+1},\ldots,\boldsymbol{w}_{m^{\prime}} and compute intervals corresponding to these and add them to the second instance of FROCC. 2(b) follows immediately. For 2(a) we note that any xdx\in\mathbb{R}^{d} classified as YES by the second model must also be classified as YES by the first model since it is classified as YES along the first mm dimensions of the second model which are common to both models. ∎

From Proposition 8 it follows that expected model size and false positive probability vary inversely, i.e., the larger the model size the smaller the false positive probability and vice versa.

Proposition 8 is also consistent with the following upper bound that is easy to prove

Fact 9.

For FROCC constructed on training set S={𝐱1,,𝐱n}S=\{\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{n}\},

2mMm,εmmax{2,δmaxε}2m\leq M_{m,\varepsilon}\leq m\cdot\max\left\{2,\frac{\delta_{\max}}{\varepsilon}\right\}

with probability 1, where δmax=max1i,jn𝐱i𝐱j2\delta_{\max}=\max_{1\leq i,j\leq n}\|\boldsymbol{x}_{i}-\boldsymbol{x}_{j}\|_{2}.

The first inequality is trivial. The second follows by observing that along any dimension the gaps between intervals must be at least ε\varepsilon and the length of the interval between the maximum and minimum projection is at mode δmax\delta_{\max}.

4.3 FROCC’s error decays exponentially with mm

Definition 10.

Given a finite set of points S={𝐱1,,𝐱n}dS=\{\bm{x}_{1},\ldots,\bm{x}_{n}\}\subseteq\mathbb{R}^{d} and any 𝐰𝟏d\bm{w}\in\bm{1}_{d}, we say that 𝐰=minj=1n𝐱j,𝐰,\ell_{\bm{w}}=\min_{j=1}^{n}\langle\bm{x}_{j},\bm{w}\rangle, and u𝐰=maxj=1n𝐱j,𝐰u_{\bm{w}}=\max_{j=1}^{n}\langle\bm{x}_{j},\bm{w}\rangle. Now, given a point 𝐲d\bm{y}\subset\mathbb{R}^{d} and 𝐰𝟏d\bm{w}\in\bm{1}_{d}, we say that the distance of 𝐲\bm{y} from SS along the direction 𝐰\bm{w} is defined as

α(𝒚,S,𝒘)={𝒘𝒚,𝒘if 𝒚,𝒘<𝒘0if 𝒘𝒚,𝒘u𝒘𝒚,𝒘u𝒘o.w.\alpha(\bm{y},S,\bm{w})=\left\{\begin{array}[]{ll}\ell_{\bm{w}}-\langle\bm{y},\bm{w}\rangle&\mbox{if }\langle\bm{y},\bm{w}\rangle<\ell_{\bm{w}}\\ 0&\mbox{if }\ell_{\bm{w}}\leq\langle\bm{y},\bm{w}\rangle\leq u_{\bm{w}}\\ \langle\bm{y},\bm{w}\rangle-u_{\bm{w}}&\mbox{o.w.}\end{array}\right.

Further we say that 𝐰𝟏d\bm{w}\in\bm{1}_{d} is a discriminator of 𝐲\bm{y} from SS if α(𝐲,S)>0\alpha(\bm{y},S)>0. Finally we denote by C(𝐲,S)C(\bm{y},S) the subset of unit vectors of 𝟏d\bm{1}_{d} that discriminate 𝐲\bm{y} from SS.

FROCC has the property that the probability of misclassifying a point outside the class decreases exponentially with parameter mm. We state this formally. Given A𝟏dA\subset\bm{1}_{d} we denote the measure of AA by μ(A)\mu(A).

Proposition 11.

Given a set SS and a point 𝐲\bm{y} with a set C(𝐲,S)C(\bm{y},S) of non-empty measure such that for any 𝐰C(𝐲,S)\bm{w}\in C(\bm{y},S), 𝐲,𝐰[minj=1n𝐱j,𝐰,maxj=1n𝐱j,𝐰]\langle\bm{y},\bm{w}\rangle\notin[\min_{j=1}^{n}\langle\bm{x}_{j},\bm{w}\rangle,\max_{j=1}^{n}\langle\bm{x}_{j},\bm{w}\rangle], the probability that FROCC projection dimension mm returns a Yes answer is

Pr{𝒚,S,m}=(1μ(C(𝒚,S))μ(𝟏d))m.\mbox{Pr}\left\{\bm{y},S,m\right\}=\left(1-\frac{\mu(C(\bm{y},S))}{\mu(\bm{1}_{d})}\right)^{m}.

Hence for the classifier to given an erroneous Yes answer on 𝐲\bm{y} with probability at most δ\delta we need

mμ(𝟏d)μ(C(𝒚,S))log1δm\geq\frac{\mu(\bm{1}_{d})}{\mu(C(\bm{y},S))}\log\frac{1}{\delta} (3)

projection dimensions.

Refer to caption
(a) m=1
Refer to caption
(b) m=2
Refer to caption
(c) m=8
Figure 3: Probability contours of Pr{𝒚,S,m}\mbox{Pr}\left\{\bm{y},S,m\right\} for different values of mm. The region boundary gets more distinct with increasing mm.

The proof of (3) is just a simple calculation. Figure 3 presents the result of Proposition 11 visually.

5 Experiments and Results

We conducted a comprehensive set of experiments to evaluate the performance of FROCC on synthetic and real world benchmark datasets and compare them with various existing state of the art baselines. In this section we first present the details of the datasets, baselines, and metrics we used in our empirical evaluation, and then present the results of comparison.

5.1 Methods Compared

We compare FROCC with the following state of the art algorithms for OCC:

  1. (i)

    OC-SVM: One class SVM based on the formulation of Schölkopf et al [30],

  2. (ii)

    Isolation Forest: Isolation forest algorithm [21],

  3. (iii)

    AE: An auto encoder based anomaly detection algorithm based on Aggarwal [1],

  4. (iv)

    Deep SVDD: A deep neural-network based formulation of Ruff et al [28]. We used the implementation and the optimal model parameters provided by the authors111https://github.com/lukasruff/Deep-SVDD-PyTorch.

We implemented FROCC using Python 3.8 with numpy library. The IntervalTree subroutine (Algorithm 2) was optimized by quantizing the interval into discrete bins of sizes based on ε\varepsilon. This allows for a vectorized implementation of the subroutine. All experiments were conducted on a server with Intel Xeon (2.60GHz), and NVIDIA GeForce GTX 1080 Ti GPU. It is noteworthy that unlike deep learning based methods (i.e., AE and Deep SVDD), FROCC does not require GPU during training as well as testing.

For OCSVM and FROCC, for each dataset we chose the best performing (in ROC) kernel among the following: (a) linear (dot product), (b) radial basis function (RBF), (c) polynomial (degree=3) and, (d) sigmoid .

5.2 Datasets

We evaluate FROCC and the baseline methods on a set of well-known benchmark datasets summarized in Table 1. The set of benchmarks we have chosen covers a large spectrum of data characteristics such as the number of samples (from a few hundreds to millions) as well as the number of features (from a handful to tens of thousands). Thus it broadly covers the range of scenarios where OCC task appears.

To adapt the multi-class datasets for evaluation in OCC setting, we perform the processing similar to Ruff et al [28]: Given a dataset with multiple labels, we assign one class the positive label and all others the negative label. Specifically, we select the positive classes for i. MNIST: randomly from 0, 1 and 4, ii. CIFAR 10: airplane, automobile and deer and iii. CIFAR 100: beaver, motorcycle and fox. We then split the positive class instances into train and test sets. Using labelled data allows us to evaluate the performance using true labels. We do not use any labels for training. We double the test set size by adding equal number of instances from the negatively labeled instances. We train the OCC on the train set which consists of only positive examples. Metrics are calculated on test set which consists of equal number of positive and negative examples.

ID Name Features Training Samples
Test Samples
(Equal +/– Split)
A1 Diabetes [8] 8 268 508
A2 MAGIC Telescope [8] 10 12332 20870
A3 MiniBooNE [8] 50 18250 30650
A4 Cardiotocography [8] 35 147 294
B1 SATLog-Vehicle [31] 100 24623 36864
B2 Kitsune Network Attack [23] 115 3018972 3006490
B3 MNIST [20] 784 3457 5390
B4 CIFAR-10 [18] 3072 3000 5896
B5 CIFAR-100 [18] 3072 300 594
B6 GTSRB [32] 3072 284 562
B7 Omniglot [19] 33075 250 400
Table 1: Dataset statistics. The datasets are divided in to two groups: Group A (A1:A4) has datasets with fewer than 100 features and Group B (B1:B7) contains datasets with greater than 100 features. marks the datasets with more than 10,00010,000 training samples.

5.3 Metrics

Classification Performance: For measuring the classification performance, we use the following two commonly used metrics:

  • ROC AUC to measure the classification performance of all methods, and

  • Adjusted Precision@n, that is, fraction of true outliers among the top n points in the test set, normalized across datasets [33].

Computational Scalability: For measuring the computational performance of all the methods, we use the speedup of their training and testing time over the fastest baseline method for the dataset. In other words, for an algorithm AA its speedup is calculated as the ratio of training (or testing) time of the fastest baseline method to the training (or testing) time of AA. We found Isolation forests to be the fastest across all scenarios, thus making it the default choice for computing the speedup metric over. All measurements are reported as an average over 5 runs for each method.

5.4 Results and Observations

ID OC-SVM IsoForest Deep SVDD Auto Encoder FROCC
A1 53.35 ±\pm 0.95 63.48 ±\pm 0.46 73.24 ±\pm 0.65 62.43 ±\pm 0.26 73.18 ±\pm 0.26
A2 68.12 ±\pm 0.85 77.09 ±\pm 0.84 61.39 ±\pm 0.87 75.96 ±\pm 0.44 72.30 ±\pm 0.33
A3 54.22 ±\pm 0.45 81.21 ±\pm 0.14 78.21 ±\pm 0.10 56.27 ±\pm 0.22 69.32 ±\pm 0.30
A4 57.93 ±\pm 0.73 85.10 ±\pm 0.33 56.95 ±\pm 0.53 59.43 ±\pm 0.60 83.57 ±\pm 0.44
B1 73.51 ±\pm 0.41 83.19 ±\pm 0.44 68.36 ±\pm 0.64 76.42 ±\pm 0.76 85.14 ±\pm 0.74
B2 86.01 ±\pm 0.60 80.05 ±\pm 0.43 84.73 ±\pm 0.37 79.23 ±\pm 0.34 86.64 ±\pm 0.51
B3 98.99 ±\pm 0.11 98.86 ±\pm 0.31 97.04 ±\pm 0.08 99.38 ±\pm 0.13 99.61 ±\pm 0.35
B4 52.21 ±\pm 0.39 51.47 ±\pm 0.59 59.32 ±\pm 0.62 56.73 ±\pm 0.47 61.93 ±\pm 0.64
B5 49.62 ±\pm 0.12 54.13 ±\pm 0.17 67.81 ±\pm 0.39 55.73 ±\pm 0.53 71.04 ±\pm 0.37
B6 63.55 ±\pm 0.12 61.23 ±\pm 0.35 71.88 ±\pm 0.36 66.73 ±\pm 0.44 67.12 ±\pm 0.60
B7 90.21 ±\pm 0.63 70.83 ±\pm 0.35 72.70 ±\pm 0.07 62.38 ±\pm 0.12 92.46 ±\pm 0.21
Table 2: Area under the ROC curve. ID corresponds to the dataset ID in Table 1. Bold values represent the best scores and italics represent the second best scores.
ID OC-SVM IsoForest Deep SVDD Auto Encoder FROCC
A1 59.72 ±\pm 0.95 61.64 ±\pm 0.05 72.74 ±\pm 0.13 63.54 ±\pm 0.14 72.83 ±\pm 0.27
A2 78.17 ±\pm 0.15 87.09 ±\pm 0.35 72.23 ±\pm 0.47 78.19 ±\pm 0.32 84.31 ±\pm 0.50
A3 61.05 ±\pm 0.56 79.77 ±\pm 0.73 78.32 ±\pm 0.43 66.17 ±\pm 0.48 71.91 ±\pm 0.36
A4 63.21 ±\pm 0.94 82.83 ±\pm 0.18 60.03 ±\pm 0.35 57.73 ±\pm 0.54 85.91 ±\pm 0.33
B1 74.22 ±\pm 0.44 81.94 ±\pm 0.62 68.36 ±\pm 0.68 76.42 ±\pm 0.60 89.21 ±\pm 0.35
B2 81.35 ±\pm 0.60 87.24 ±\pm 0.65 81.21 ±\pm 0.63 72.94 ±\pm 0.56 83.21 ±\pm 0.65
B3 92.34 ±\pm 0.47 91.65 ±\pm 0.65 93.25 ±\pm 0.45 74.32 ±\pm 0.37 99.80 ±\pm 0.58
B4 50.02 ±\pm 0.75 51.42 ±\pm 0.86 57.32 ±\pm 0.09 51.63 ±\pm 0.18 57.92 ±\pm 0.41
B5 50.08 ±\pm 0.32 50.13 ±\pm 0.43 55.21 ±\pm 0.64 53.40 ±\pm 0.77 55.62 ±\pm 0.38
B6 61.63 ±\pm 0.01 59.13 ±\pm 0.11 69.38 ±\pm 0.30 64.71 ±\pm 0.16 62.46 ±\pm 0.36
B7 88.32 ±\pm 0.10 72.83 ±\pm 0.26 82.70 ±\pm 0.51 65.90 ±\pm 0.47 79.34 ±\pm 0.40
Table 3: Adjusted Precision@nPrecision@n. For each dataset, nn is the number of positive examples in the test set of that dataset. ID corresponds to the dataset ID in Table 1. Bold values represent the best scores and italics represent the second best scores.
ID OC-SVM Deep SVDD Auto Encoder FROCC
Train Test Train Test Train Test Train Test
A1 0.055 0.862 0.034 0.038 0.089 0.090 1.111(2) 1.111
A2 0.057 0.847 0.032 0.038 0.085 0.088 1.333(3) 1.075
A3 0.100 0.840 0.031 0.039 0.082 0.087 1.282(3) 1.075
A4 0.064 0.870 0.038 0.038 0.091 0.091 1.190(2) 1.099
B1 0.074 0.855 0.032 0.038 0.087 0.089 1.075(1) 1.087
B2 0.043 0.524 0.028 0.038 0.062 0.078 1.786(1) 1.176
B3 0.099 0.794 0.043 0.039 0.050 0.080 1.299(1) 1.149
B4 0.099 0.820 0.040 0.039 0.069 0.083 1.389(1) 1.099
B5 0.100 0.826 0.039 0.039 0.074 0.085 1.370(1) 1.087
B6 0.098 0.833 0.039 0.039 0.079 0.086 1.351(2) 1.087
B7 0.100 0.806 0.035 0.039 0.061 0.082 1.449(1) 1.111
Table 4: Training and test speedups for baselines and FROCC. It is calculated as speedup(M)=time(IF)time(M)speedup(M)=\frac{time(IF)}{time(M)}, where time(IF)time(IF) is the time for Isolation Forest, the fastest baseline. Higher is better. A number greater than 1 indicates the method is faster than the fastest baseline. Superscript shows the rank of the methods based on ROC-AUC scores for FROCC.

Table 2 compares the ROC scores of the baselines on the datasets. From the table we note that FROCC outperforms the baselines in six of the eleven datasets. Further, we note that for high dimensional datasets –i.e., datasets with \geq 100 features (B1-B7), FROCC outperforms the baselines in six out of seven datasets while ranking second in the remaining one. This demonstrates the superiority of FROCC for high dimensional data which are known to be challenging for most OCC methods. Table 3 summarizes the Adjusted Precision@n (for a given dataset, nn is the number of positive examples in test data). Here too we observe that FROCC outperforms the baselines in six datasets. In high dimensional datasets, it is superior in four out of seven datasets, and is the second best in one. And even in datasets where the performance of FROCC is not the highest, it does not drop as low as traditional shallow baselines, which in two occasions fail to learn altogether.

Next, we turn our attention to the computational performance of all the methods summarized in Table 4, where we present the training and testing speedups. These results particularly highlight the scalability of FROCC over all other baselines, including the traditionally efficient shallow methods such as Isolation forest. Against deep learning based methods such as Deep SVDD and Auto Encoder, FROCC shows nearly 1-2 orders magnitude speedup. Overall, we notice that FROCC achieves up to 68×68\times speedup while beating the baselines, and up to 35×35\times speedup while ranking second or third.

5.5 Discussion of Results

Here we discuss the results from two broad perspectives:

ROC and Speed trade-offs:

The speedups show that FROCC is the fastest method. From Table 2, we see that in six datasets, FROCC outperforms the baseline as well. Of the cases it doesn’t outperform the baselines, it ranks second thrice out of five and in each of the these cases (datasets A1, A4 and B7) it is 35, 29 and 1.19 times faster than the best performing method respectively. Thus, depending on the application, the speed vs ROC trade-off may make FROCC the method of choice even when it doesn’t outperform other algorithms.

Scalability:

While we are faster than baselines in all datasets, large datasets like MiniBooNE (A3), SATLog-Vehicle (B1) and especially Kitsune network attack (B2) datasets – where FROCC is 68 times faster than the second best method – can be render the best performing deep baselines infeasible, leaving scalable methods like IsolationForest and FROCC as the only option. Given that FROCC consistently ranks among top three (1: 6 times, 2: 3 times, 3: 2 times), whereas IsolationForest’s rank varies inconsistently and go up to 5 (out of 5), FROCC is the algorithm of choice when scalability is a concern.

5.6 Parameter Sensitivity

In previous section we mentioned that the parameters were chosen based on grid search. In this section we present a study of sensitivity of FROCC to the parameters.

Sensitivity to dimensions.

From the analysis in Section 4, we know that the performance of FROCC improves with increasing number of classification dimensions. Figure 4 shows this variation for a few selected datasets.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Variation of area under the ROC with Dimensions for MNIST, SATLOG Vehicle, Omniglot and Kitsune network attack dataset. We observe that the metric improves monotonically, confirming the theoretical claim.
Sensitivity to ε\varepsilon.

The separation parameter ε\varepsilon is a dataset specific parameter and relates to the minimum distance of separation between positive examples and potential negative examples. In practice this comes out to be around the order of 0.1 or 0.01, as shown in Figure 5

Refer to caption
Refer to caption
Figure 5: Variation of area under the ROC with ε\varepsilon for MNIST and MAGIC Telescope dataset.

6 Conclusion

In this paper, we presented a surprisingly simple and highly effective one-class classification strategy based on projecting data onto randomly selected vectors from the unit sphere. By representing these projections as a collections of intervals we developed FROCC. Our experiments over real and synthetic datasets reveal that despite its extreme simplicity –and the resulting computational efficiency– FROCC is not just better than the traditional methods for one-class classification but also better than the highly sophisticated (and expensive) models that use deep neural networks.

As part of our future work, we plan to explore extending the FROCC algorithms to the classification task under highly imbalanced data, and to application settings where high-performance OCC are crucial.

References

  • Aggarwal [2015] Aggarwal CC (2015) Outlier analysis. In: Data mining, Springer, pp 237–263
  • Ailon and Chazelle [2006] Ailon N, Chazelle B (2006) Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. In: ACM STOC, pp 557–563
  • Borisyak et al [2019] Borisyak M, Ryzhikov A, Ustyuzhanin A, Derkach D, Ratnikov F, Mineeva O (2019) (1+ε)(1+\varepsilon)-class classification: an anomaly detection method for highly imbalanced or incomplete data sets, arXiv:1906.06096
  • Bousquet and Elisseeff [2002] Bousquet O, Elisseeff A (2002) Stability and generalization. J Mach Learn Res 2:499–526
  • Chandola et al [2009] Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: A survey. ACM computing surveys (CSUR) 41(3):1–58
  • Cohen et al [2008] Cohen G, Sax H, Geissbuhler A, et al (2008) Novelty detection using one-class parzen density estimator. an application to surveillance of nosocomial infections. In: MIE, pp 21–26
  • Dasgupta [1999] Dasgupta S (1999) Learning mixtures of Gaussians. In: Proc. IEEE FOCS, pp 634–644
  • Dua and Graff [2017] Dua D, Graff C (2017) UCI machine learning repository. URL http://archive.ics.uci.edu/ml
  • Efron [1965] Efron B (1965) The convex hull of a random set of points. Biometrika 52(3-4):331–343
  • Fauconnier and Haesbroeck [2009] Fauconnier C, Haesbroeck G (2009) Outliers detection with the minimum covariance determinant estimator in practice. Statistical Methodology 6(4):363–379
  • Garcia-Teodoro et al [2009] Garcia-Teodoro P, Díaz-Verdejo JE, Maciá-Fernández G, Vázquez E (2009) Anomaly-based network intrusion detection: Techniques, systems and challenges. Computers & Security 28(1-2):18–28
  • Indyk and Motwani [1998] Indyk P, Motwani R (1998) Approximate nearest neighbors: Towards removing the curse of dimensionality. In: Proc. ACM STOC, pp 604–613
  • Indyk and Naor [2007] Indyk P, Naor A (2007) Nearest-neighbor-preserving embeddings. ACM Transactions on Algorithms (TALG) 3(3):31–es
  • Irigoien et al [2014] Irigoien I, Sierra B, Arenas C (2014) Towards application of one-class classification methods to medical data. In: TheScientificWorldJournal
  • Japkowicz et al [1995] Japkowicz N, Myers C, Gluck MA (1995) A novelty detection approach to classification. In: IJCAI, pp 518–523
  • Khan and Madden [2009] Khan SS, Madden MG (2009) A survey of recent trends in one class classification. In: Irish conference on artificial intelligence and cognitive science, Springer, pp 188–197
  • Kleinberg [1997] Kleinberg JM (1997) Two algorithms for nearest-neighbor search in high dimensions. In: Proc. 29th Annu. ACM STOC, pp 599–608
  • Krizhevsky [2009] Krizhevsky A (2009) Learning multiple layers of features from tiny images
  • Lake et al [2015] Lake BM, Salakhutdinov R, Tenenbaum JB (2015) Human-level concept learning through probabilistic program induction. Science 350(6266):1332–1338
  • LeCun and Cortes [2005] LeCun Y, Cortes C (2005) The mnist database of handwritten digits
  • Liu et al [2008] Liu FT, Ting KM, Zhou ZH (2008) Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining, IEEE, pp 413–422
  • McDiarmid [1989] McDiarmid C (1989) On the method of bounded differences. In: Siemons J (ed) Surv. Comb., Cambridge University Press, London Mathematical Society Lecture Note Series
  • Mirsky et al [2018] Mirsky Y, Doitshman T, Elovici Y, Shabtai A (2018) Kitsune: an ensemble of autoencoders for online network intrusion detection. Network and Distributed System Security Symposium (NDSS)
  • Perera and Patel [2018] Perera P, Patel VM (2018) Learning deep features for one-class classification. IEEE Transactions on Image Processing 28:5450–5463
  • Rényi and Sulanke [1963] Rényi A, Sulanke R (1963) Uber die convexe hulle von is zufallig gewahlten punkten i. Z Wahr Verw Geb 2:75–84
  • Rényi and Sulanke [1964] Rényi A, Sulanke R (1964) Uber die convexe hulle von is zufallig gewahlten punkten ii. Z Wahr Verw Geb 3:138–147
  • Rousseeuw and Driessen [1999] Rousseeuw PJ, Driessen KV (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3):212–223, 10.1080/00401706.1999.10485670
  • Ruff et al [2018] Ruff L, Vandermeulen R, Goernitz N, Deecke L, Siddiqui SA, Binder A, Müller E, Kloft M (2018) Deep one-class classification. In: International conference on machine learning, pp 4393–4402
  • Sakurada and Yairi [2014] Sakurada M, Yairi T (2014) Anomaly detection using autoencoders with nonlinear dimensionality reduction. In: Proceedings of the MLSDA 2014 2Nd Workshop on Machine Learning for Sensory Data Analysis, ACM, MLSDA’14, p 4:4–4:11
  • Schölkopf et al [2000] Schölkopf B, Williamson R, Smola A, Shawe-Taylor J, Piatt J (2000) Support vector method for novelty detection. In: Advances in Neural Information Processing Systems, pp 582–588
  • Siebert [1987] Siebert JP (1987) Vehicle recognition using rule based methods
  • Stallkamp et al [2012] Stallkamp J, Schlipsing M, Salmen J, Igel C (2012) Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks
  • Swersky et al [2016] Swersky L, Marques HO, Sander J, Campello RJ, Zimek A (2016) On the evaluation of outlier detection and one-class classification methods. In: 2016 IEEE international conference on data science and advanced analytics (DSAA), IEEE, pp 1–10
  • Tax and Duin [2004] Tax DM, Duin RP (2004) Support Vector Data Description. Machine Learning 54(1):45–66
  • Valiant [1984] Valiant L (1984) Theory of the learnable. Commun ACM 27(11):1134–1142
  • Vempala [2010] Vempala SS (2010) A random-sampling-based algorithm for learning intersections of halfspaces. J ACM 57(6):32:1–32:14
  • Zheng et al [2018] Zheng P, Yuan S, Wu X, Li JY, Lu A (2018) One-class adversarial nets for fraud detection. In: AAAI