Self-Supervised Representation Learning for Visual Anomaly Detection

Rabia Ali KAIST
Muhammad Umar Karim Khan KAIST
Chong Min Kyung KAIST

Abstract

Self-supervised learning allows for better utilization of unlabelled data. The feature representation obtained by self-supervision can be used in downstream tasks such as classification, object detection, segmentation, and anomaly detection. While classification, object detection, and segmentation have been investigated with self-supervised learning, anomaly detection needs more attention. We consider the problem of anomaly detection in images and videos, and present a new visual anomaly detection technique for videos. Numerous seminal and state-of-the-art self-supervised methods are evaluated for anomaly detection on a variety of image datasets. The best performing image-based self-supervised representation learning method is then used for video anomaly detection to see the importance of spatial features in visual anomaly detection in videos. We also propose a simple self-supervision approach for learning temporal coherence across video frames without the use of any optical flow information. At its core, our method identifies the frame indices of a jumbled video sequence allowing it to learn the spatiotemporal features of the video. This intuitive approach shows superior performance of visual anomaly detection compared to numerous methods for images and videos on UCF101 and ILSVRC2015 video datasets.

I Introduction

Anomaly detection, also termed as one-class classification, is a classic problem [1, 2, 3]. One-class classifiers are capable of identifying out-of-distribution (abnormal) instances by learning from the instances of the normal (in-distribution) class as shown in the Fig. 1. We address the problem of anomaly detection for images and videos, which is useful in applications such as visual quality inspection in manufacturing [4], surveillance [5, 6], biomedical applications [7, 8], self-driving cars [9], and robotics [10, 11]. One-class classification is more general compared to binary classification because all unseen instances are treated as anomalies.

Anomalies in images and videos are generally defined as objects or events that are unusual and indicate an irregular behaviour. Manually detecting these rare objects in images and unexpected events in videos is a very tiresome task. Automating this cumbersome job, e.g., by self-supervised representation learning can ease the task of detecting: faulty materials in manufacturing, malignant tumor or a nodule from medical images including mammogram, CT or PET images, and anomalous events such as traffic accidents, crimes or illegal activities in video surveillance.

Deep Neural Networks (DNNs) learn different levels of visual features in images and videos resulting in remarkable performance in classification [12, 13, 14], object detection [15, 16, 17], semantic segmentation [18, 19, 20], and anomaly detection [21, 22, 23]. However, the near-human or superhuman performances achieved by deep learning algorithms generally requires annotation, which is both time-consuming and expensive. To reduce or avoid the cumbersome job of data labelling, researchers have for long focused on methods that require minimum level of supervision. These efforts led to advancements in domain adaptation, transfer learning, meta learning, continual learning, semi-supervised learning, weakly-supervised learning, and unsupervised learning.

We focus on self-supervised learning, a promising subclass of unsupervised learning, which provides an opportunity for better utilizing unlabeled data by setting the learning objectives to learn from the internal cues. In self-supervised learning a pretext task such as context prediction [24], colorization [25], predicting image rotations [26], etc. can be formulated by using only the unlabeled data. The network while solving these pretext tasks learns a useful feature representation which can be transferred to different downstream tasks of interest, such as classification, object detection, segmentation, and anomaly detection.

In the context of deep learning, different schemes have been proposed for unsupervised anomaly detection, which includes using autoencoders [27, 28, 29, 30], Generative Adversarial Networks (GANs) [31], and low-density rejection [32] and single-class SVM [33, 34] over the low-dimensional embeddings. However, with the attempt to shift from supervised learning to unsupervised learning, anomaly detection using self-supervised learning is somewhat less explored. Anomaly detection using self-supervised visual representation learning aims to train a model such that it better learns the features of in-distribution (normal) examples. The learned features should not only focus on the low-level object characteristics like the color, texture, etc. but also on the high-level characteristics such as the object parts, shapes, position, and orientation. By learning these features from normal instances, the network can detect out-of-distribution samples.

In order to better learn spatiotemporal features in videos for anomaly detection, we introduce the task of guessing the indices of randomly permuted video frames as shown in Fig. 2. While doing so the network implicitly reasons about the object shape, position, and orientation in time without the use of any optical flow information. Thus, the learned feature representation carries rich semantic or structural information. Training for anomaly detection is done with normal videos only. At test time, the network is unable to predict the permutation of video frames from unseen context. Results show that the proposed method achieves significant improvement in anomaly detection over the best available self-supervised visual-representation learning methods.

Our main contributions are two fold. 1) We provide an overview of one-class classification (anomaly detection) using self-supervised visual representation learning. Such analysis do exists for other downstream tasks but not for anomaly detection. 2) We propose a simple method for self-supervised learning of spatiotemporal features from videos without the use of any extra information. The proposed method outperforms numerous other methods in video anomaly detection, which use spatial, temporal or combined self-supervision.

The rest of the paper is structured as follows. Research related to our work is discussed in Section 2. In Section 3, we describe the task of permuting video frames for self-supervised visual representation learning. Experimental results and discussion are given in Section 4, and Section 5 concludes the paper.

Refer to caption — Figure 1: In one-class classification (anomaly detection), the detector is expected to identify out-of-distribution (abnormal) instances. Here, given a set of normal apple instances, the detector learns to detect abnormal instances of non-apple objects

II Related Work

II-1 Self-Supervised Representation Learning from Images.

Numerous methods have been proposed for self-supervised representation learning on images, each exploring a different pretext task. Doersch et al. [24] perform self-supervised representation learning by predicting the relative position of image patches. Noroozi and Favaro [35] design a jigsaw puzzle game as pretext task, where the model is trained to place nine shuffled patches back to their original locations. Other autoenocder-based methods try to recover part of the data itself, such as image inpainting [36], image colorization [25], and its improved variant channel prediction [37]. Noroozi et al. [38] introduce a method for self-supervised representation learning that uses an artificial supervision-signal based on counting visual primitives. Gidaris et al. [26] propose to randomly rotate an image by one of four possible angles and let the model predict the rotation. Caron et al. [39, 40] use a clustering-based approach to generate pseudo-labels.

II-2 Self-Supervised Representation Learning from Videos.

Video datasets contain raw spatiotemporal signals, which can be used for self-supervised representation learning. Vondrick et al. [41] use video colorization as a self-supervised learning problem. Wang and Gupta [42] propose a way of self-supervised representation learning by tracking moving objects in videos. Another method includes validating frame order [43], where the pretext task is to determine whether a sequence of frames from a video is placed in the correct temporal order or not. Wei et al. [44] show that predicting the arrow of time (whether the video is playing forward or backward) learns useful latent representation. HY Lee et al. [45] use optical flow magnitude to select frames with high motion magnitude and then use sorting task to learn a feature representation. Xu, Dejing, et al. [46] introduce a clip order prediction task as a self-supervised learning task. However, all these pretext tasks are never used for the downstream task of visual anomaly detection in videos.

II-3 Anomaly Detection with Images.

Many researchers have used reconstruction-based methods for anomaly detection. Xia et al. [29] use a convolutional autoencoder with a regularizing term that produces a large reconstruction error for out-of-distribution samples. An and Cho [27] use a variational autoencoder to extract an anomaly score based on the reconstruction probability estimated through Monte-Carlo sampling. Zhai et al. [47] investigate the usage of deep structured energy-based models for anomaly detection, in particular focusing on two decision criteria: energy score and reconstruction error. Zong et al. [30] propose to jointly model the encoded features and the reconstruction error in a deep autoencoder for anomaly detection. Generative adversarial network-based anomaly detection method is used by Li et al. [31]. Ruff et al. [48] revisit classical one-class SVMs [33] with deep representations to improve anomaly detection results on complex data. The maximun softmax-probability of a classifier is used by Hendrycks and Gimpel [49] for anomaly detection. Lee et al. [50] develop a training method for neural classification networks for better anomaly detection without losing the original classification accuracy. Golan and El-Yaniv [51] propose that learning a method to discriminate between different geometric transformations applied to normal images encourages learning of features that are useful for anomaly detection.

II-4 Anomaly Detection with Videos.

Video anomaly detection has mostly been applied to detect anomalies in crowd behaviour, which is important to avert any casualties. Hand-crafted methods [52, 53] propose novel features. A model is trained to learn normal features and an anomaly is detected by identifying the isolated clusters or outliers. In deep learning-based methods, a network is trained to reconstruct frames of normal instances. This trained network cannot reconstruct the anomalous instances, thus, detects anomalies. Hasan et al. [28] model regular frames using a 3-D convolutional autoeocder. [54, 55] uses Convolutional LSTM Auto-Encoder (ConvLSTM-AE) to model both normal appearance and motion at the same time for crowd anomaly detection. Future frame prediction [56] has also been proposed for anomaly detection [57]. A model is trained to predict future video frames of normal training data using the previous frames. In the testing phase, prediction error is used to declare an anomaly.

III Frame Permutation Prediction

Our goal is to use the raw spatiotemporal signals in videos to learn a feature representation carrying rich semantic and structural meaning for the downstream task of visual anomaly detection in videos which has never been addressed before. We learn this representation by solving a frame permutation prediction task. By solving this complex pretext task without any motion information for normal videos, the network learns both low-level and high-level features such that it is able to identify anomalous videos at test time.

III-A Problem Formulation

Given a video with $N$ frames $\{f_{1},\dots,f_{N}\}$ , we divide it into $Z=N-M+1$ sub-sequences, each of $M$ consecutive frames. These sub-sequences or video segments $S=\{s_{i}\}_{i=1}^{Z}$ are separated by one frame with an overlap of $M-1$ . We then permute the frames of each segment according to some permutation index $y_{i}\in[1,\dots,M!]$ $\forall i$ , called pseudo-label of the frame permutation. For each raw input video segment $s_{i}$ , the permutation results in

S^{*}=\{{s_{i}}^{*}\}_{i=1}^{Z}=\{g(s_{i},y_{i})\}_{i=1}^{Z},

(1)

where $g(.)$ is the permutation operator.

Our aim is to train a neural network $W$ , with parameters $\psi$ , such that it can predict the permutation of jumbled video frames. In other words, training is performed to obtain optimal $\psi$ such that

\operatorname*{arg\,min}_{\psi}\sum_{i}\ell(W_{\psi}({s_{i}}^{*}),y_{i}),

(2)

where $\ell$ is a general loss function.

III-B $M$ -Stream Siamese Convolutional Neural Network

For this work, we use an $M$ -Stream Siamese CNN as shown in Fig. 3, where each stream consists of a Base-CNN (BCNN). All $M$ -BCNNs are from conv1 to fc7 layer of the CaffeNet architecture [58] (1-GPU version of AlexNet [59]) with shared weights. Each BCNN model B(.) takes one frame $f_{k}$ of the permuted input video segment ${s_{i}}^{*}$ , and returns a feature representation $r_{k}$ of that frame.

R=\{r_{k}\}_{k=1}^{M}=\{B(f_{k};\theta)\}_{k=1}^{M},

(3)

where $\theta$ are the shared learnable parameters of all BCNN models $B(.)$ .

The stack of layers from conv1 to fc7 making $M$ -BCNNs have shared weights, therefore, they have same number of parameters as the AlexNet architecture. All these $M$ -representations are concatenated by a concatenation operator O(.)

x_{i}=O(r_{1},r_{2},\dots,r_{M}),

(4)

where $x_{i}$ is the concatenated feature representation of the frames of the permuted video segment ${s_{i}}^{*}$ .

The layers following the concatenation layer form another CNN called the Top-CNN (TCNN). TCNN model $T(.)$ is a logistic classifier, which takes as input the concatenated representation $x_{i}$ , and yields as output a vector $Y\in\mathbb{R}^{M!}$ with the probability value for each possible label.

\displaystyle T(x_{i})=p(Y|x_{i};\phi),

(5)

where $p(Y|x_{i};\phi)$ is the predicted probability for the permutation labels $Y=\{y_{k}\}_{k=1}^{M!}$ and $\phi$ represents the learnable parameters of the model $T(.)$

III-C Training

Our network is trained to learn better and more detailed information by solving a complex task without the use of any optical flow information. Unlike the previous self-supervised learning approaches our method not only focuses on low-level features but also learns high-level features which make the task of visual anomaly detection such easier and faster. We are not interested in the final performance of the frame permutation prediction task, rather we are only interested in the learned intermediate representation. Notice that during the training on the frame permutation prediction task, we set the stride of the first layer ( $conv1$ ) of our $M$ -Stream Siamese CNN to be 2 instead of 4 (CaffeNet model uses 4).

Let the training video be divided into $Z$ permuted segments $S^{*}=\{{s_{i}}^{*}\}_{i=1}^{Z}$ with corresponding permutations $P=\{y_{i}\}_{i=1}^{Z}$ . Note that $y_{i}\in[1,\dots,M!]$ $\forall i$ . In order to train the network the parameters of the $M$ -Stream Siamese CNN are updated such that

\theta=\operatorname*{arg\,min}_{\theta}\frac{1}{Z}\sum_{i=1}^{Z}L(s_{i},y_{i},\theta,\phi).

(6)

\phi=\operatorname*{arg\,min}_{\phi}\frac{1}{Z}\sum_{i=1}^{Z}L(s_{i},y_{i},\theta,\phi).

(7)

The loss function $L(.)$ is the cross entropy loss defined as

L(s_{i},y_{i},\theta,\phi)=-\frac{1}{M!}\sum_{i=1}^{M!}y_{i}\log(T(O(B(g(s_{i},y_{i});\theta));\phi)).

(8)

$T(O(B(g(s_{i},y_{i}))))$ is the output of the $M$ -Stream Siamese CNN. Both $\theta$ and $\phi$ are jointly optimized.

III-D Anomaly Detection

If the training of the pretext task defined in the previous section is done with the normal videos then the network will only be able to predict the permutation labels for them. For out-of-distribution or abnormal examples (videos not seen during training), the predicted probability diverges from the actual label, giving a higher value of the cross-entropy loss compared to the in-distribution examples. We normalize this loss function for all video segments in the test data to [0,1] and calculate the anomaly score $A(s_{i})$ for each video segment $\{s_{i}\}_{i=1}^{Z}$ by the following minmax normalization:

A(s_{i})=\frac{L(s_{i},y_{i},\theta,\phi)-min_{i}L(s_{i},y_{i},\theta,\phi)}{max_{i}L(s_{i},y_{i},\theta,\phi)-min_{i}L(s_{i},y_{i},\theta,\phi)}.

(9)

IV Experimental Results

We first show the results of self-supervised learning for anomaly detection over images, and follow it up with videos. We use the best performing method for images with videos as well to identify the role of spatial anomaly detection over videos. The results of our method and other video-based self-supervised methods are also given.

IV-A Datasets

We consider four image datasets: CIFAR-10, CIFAR-100 [60], fashion-MNIST [61] and ImageNet [62], and 2 video datasets: UCF101 [63] and ILSVRC2015 [64] in our experiments.

• CIFAR-10 consists of 69,000 32 $\times$ 32 colored images in 10 classes with 6,000 images per class. There are 50,000 training images and 10,000 test images, equally divided over the classes.

• CIFAR-100 is similar to CIFAR-10 but with 100 classes containing 600 images for each class. This set has a fixed train/test partition with 500 training images and 100 test images per class. The 100 classes of this dataset are grouped into 20 superclasses, which we use in our experiments.

• Fashion-MNIST is a dataset comprising 28 $\times$ 28 grayscale images of 70,000 fashion products from 10 categories with 7000 images per category. There are 60,000 training and 10,000 testing images.

• ImageNet consists of 1000 classes with more than 14 million images. We have used a subset of 20 classes for our anomaly detection experiments. We resize all the images to 256 $\times$ 256.

• UCF101 is an action recognition dataset of 13320 videos with 101 action categories. These 101 actions can be divided into 5 types: 1) human-object interaction, 2) body-motion only, 3) human-human interaction, 4) playing musical instruments, and 5) sports. We have used a subset of this dataset with 20 different action classes that includes actions from all the five types. There are 50 training and 10 testing videos per class. Training videos are taken from the train split-1 while testing videos are taken from test split-1 of the UCF101 dataset.

• ILSVRC2015 is an ImageNet VID dataset for object detection in videos. The final release of VID dataset consists of three splits. The training set contains 3862 video snippets with 56 to 458 videos per category. The validation set contains 555 snippets. The test set contains 937 snippets. We have used a subset of this dataset, which consists of videos of 20 different object classes. There are 100 training and 15 testing videos per class. To create the dataset, the testing videos are taken from test or validation set whereas the training videos are taken from the train set.

Further details on the dataset can be found in the supplementary material.

IV-B Experimental Setup

We have used one-vs-all evaluation scheme in our experiments. Suppose a dataset has C classes and we train the self-supervised visual learning task on one class c, which we call as normal class. Samples from remaining C-1 classes are said to be abnormal samples. We quantify the performance using the area under the ROC curve metric, which is commonly utilized as a performance measure for anomaly detection models. The same experiment is repeated C times, where each time a different normal class is used to train the model. Results are averaged for all c normal classes.

IV-C Anomaly Detection on Images

We have used solving jigsaw puzzle game (Jigsaw) [35], image colorization (Color) [25], image inpainting (Inpaint) [36], counting visual primitives (Count) [38], split brain autoencoders (Split-Brain) [37], and predicting image rotations (RotNet) [26] as self-supervised visual representation learning methods [26] for the task of visual anomaly detection. With these simple tasks and no semantic labels, we can learn a powerful visual representation using a CNN, which can be used for the well-known problem of novelty detection. We also include the results of DeepSVDD [48], Geometric Transform (GT) [51], and InceptionCAE NN-QED [65] for CIFAR10, CIFAR-100, and fashion-MNIST. The results of these three methods are taken from [51] and [65] as the evaluation protocol is same as ours.

We trained each of the model for 100 epochs on the normal class. Batch size for all methods was set to 64. The results for CIFAR-10 and CIFAR-100 are given in the Table I. Interestingly, all the self-supervised techniques perform better than the methods specifically designed for one-class classification (anomaly detection). The visual features learned by self-supervised methods are even better than those learned by supervised or unsupervised methods. This makes the task of visual anomaly detection much easier, faster, and accurate. In case of fashion-MNIST, anomaly detection by solving jigsaw puzzle game performs better than RotNet. However, the average AUCROC of deepSVDD is slightly higher than the jigsaw network as given in Table II.

TABLE I: AUROC of anomaly detection on CIFAR-10 and CIFAR-100. The best performing method in each experiment is in bold. All values are percentages

Dataset	$c_{i}$	Deep	GT	Inception	Count	Color	Inpaint	Split-	Jigsaw	RotNet
		SVDD		CAE				Brain
		[48].	[51]	[65]	[38]	[25]	[36]	[37]	[35]	[26]
CIFAR-10 ( $32\times 32\times 3$ )	0	61.7	74.7	66.7	74.1	71.1	72.0	65.2	77.3	80.3
	1	65.9	95.7	71.3	96.1	88.4	88.9	93.6	97.0	96.8
	2	50.8	78.1	66.8	79.6	70.3	71.4	78.6	79.2	80.0
	3	59.1	72.4	64.1	71.1	69.6	70.8	73.1	73.5	75.6
	4	60.9	87.8	72.3	87.4	65.2	64.3	87.6	88.6	89.1
	5	65.7	87.8	65.3	88.5	66.8	70.2	86.9	89.6	90.1
	6	67.7	83.4	76.4	85.9	79.9	80.2	81.1	86.8	87.2
	7	67.3	95.5	63.7	95.9	70.5	75.6	93.4	94.8	96.0
	8	75.9	93.3	76.9	94.0	68.3	84.6	83.1	93.4	95.7
	9	73.1	91.3	72.5	92.1	78.1	83.4	93.1	92.9	93.4
\cdashline2-11	$avg$	64.8	86.0	69.6	86.5	73.4	76.1	83.6	87.3	88.4
CIFAR-100 ( $32\times 32\times 3$ )	0	57.4	74.7	66.0	79.2	70.1	75.4	81.6	80.4	82.8
	1	63.0	68.5	60.1	71.1	54.2	58.4	61.3	73.3	75.2
	2	70.0	74.0	59.2	75.3	68.7	71.2	74.9	75.6	77.4
	3	55.8	81.0	58.7	81.6	65.3	74.6	69.0	80.2	85.6
	4	69.0	78.4	60.9	76.4	62.1	65.3	68.9	78.9	80.1
	5	51.0	59.1	54.2	66.5	51.1	50.6	52.3	64.3	67.4
	6	59.9	81.8	63.7	82.9	75.4	78.7	83.6	84.2	87.1
	7	53.0	65.0	66.1	66.4	61.9	63.3	65.1	68.2	66.3
	8	51.6	85.5	74.8	87.5	75.5	81.2	87.8	86.3	89.4
	9	72.9	90.6	78.3	86.9	72.1	79.9	85.4	89.1	90.8
	10	81.5	87.6	80.4	86.2	68.3	72.5	75.6	88.2	88.3
	11	53.6	83.9	68.3	81.1	74.2	78.4	82.9	84.6	85.2
	12	50.6	83.2	75.6	77.5	66.5	74.6	76.4	79.2	80.1
	13	44.0	58.0	61.0	56.3	53.2	56.2	55.2	58.1	60.3
	14	57.2	92.1	64.3	90.7	78.4	89.1	93.8	92.9	94.9
	15	47.7	68.3	66.3	69.9	62.1	65.8	66.2	70.4	73.6
	16	54.3	73.5	72.0	73.2	57.8	62.9	65.3	74.8	76.4
	17	74.7	93.8	75.9	96.3	70.4	65.4	60.2	96.0	97.8
	18	52.1	90.7	67.4	89.4	71.1	78.1	89.6	91.5	92.1
	19	57.9	85.0	65.8	85.7	76.2	80.9	85.4	86.3	90.6
\cdashline2-11	$avg$	58.9	78.7	67.0	79.0	66.7	71.1	74.1	80.1	82.1

TABLE II: AUROC of anomaly detection on fashion-MNIST dataset. The best performing method in each experiment is in bold. All values are percentages

Dataset	$c_{i}$	Deep	GT	Inception	Jigsaw	RotNet
		SVDD		CAE
		[48]	[51]	[65]	[35]	[26]
Fashion-MNIST ( $28\times 28\times 2$ )	0	98.8	99.4	92.4	98.4	97.4
	1	99.7	97.6	98.8	97.9	96.6
	2	93.5	91.1	90.0	91.0	92.7
	3	94.9	89.9	95.0	90.7	90.4
	4	95.1	92.1	92.0	90.9	87.9
	5	90.4	93.4	93.4	88.4	92.0
	6	98.0	83.3	85.5	97.2	98.0
	7	96.0	98.9	98.6	96.5	98.2
	8	95.4	90.8	95.1	95.7	98.4
	9	97.6	99.2	97.7	91.2	89.5
\cdashline2-7	$avg$	95.9	93.5	93.9	94.5	94.1

For ImageNet dataset, we compare the results of anomaly detection using different self-supervised visual representation learning methods. Table III shows that the RotNet model gives the highest average AUROC for anomaly detection. This is because in order to predict the rotation of an image the network must also learn to model object shape, i.e., it should be able to localize salient objects in the image and recognize their orientation and object type. Learning of low-level object features like color, texture, etc. alone are not enough to predict the image rotations. Therefore, unlike the other self-supervised representation learning methods that mainly focus on low-level features, the RotNet model focuses on learning both low-level and high-level object characteristics, which can better assist in visual anomaly detection. However, for rotationally symmetric objects, the RotNet model will fail to predict rotations, hence will not be able to learn a good feature representation.

TABLE III: AUROC of anomaly detection on ImageNet dataset. The best performing method in each experiment is in bold. All values are percentages

Dataset	$c_{i}$	Count	Color	Inpaint	Split-	Jigsaw	RotNet
					Brain
		[38]	[25]	[36]	[37]	[35]	[26]
ImageNet ( $256\times 256\times 3$ )	0	77.6	67.6	71.2	75.4	81.5	95.6
	1	83.2	74.5	76.4	80.2	89.3	98.6
	2	66.4	60.2	62.1	63.2	68.9	76.0
	3	73.1	65.5	68.7	70.1	75.9	93.9
	4	67.6	63.4	64.5	62.3	70.7	61.8
	5	75.4	68.7	71.1	72.7	77.2	77.9
	6	73.2	66.1	67.6	69.9	78.4	93.2
	7	74.1	65.4	68.2	70.8	76.3	84.4
	8	66.6	61.8	61.5	63.5	68.7	86.2
	9	72.9	67.6	69.8	71.2	75.2	89.3
	10	84.7	75.4	78.7	81.6	87.1	94.7
	11	65.2	60.2	62.7	64.5	66.9	83.9
	12	72.6	63.8	65.5	67.8	75.0	89.7
	13	73.4	66.1	68.3	70.8	70.1	75.9
	14	85.7	70.6	78.9	82.1	89.5	94.0
	15	76.8	64.3	67.6	71.6	79.6	91.2
	16	68.4	63.8	66.7	65.4	71.3	72.6
	17	75.4	69.8	72.1	74.3	77.1	76.9
	18	67.5	60.2	63.4	65.9	70.8	73.7
	19	82.4	71.3	76.7	78.1	85.1	94.9
\cdashline2-8	$avg$	74.11	66.315	69.085	71.07	76.73	85.22

TABLE IV: AUROC of anomaly detection using state-of-the-art self-supervised visual representation learning baselines and our approach on UCF101 and ILSVRC2015 datasets. AUROC values are an average of 20 AUROCs corresponding to 20 different models trained on exactly one of the 20 classes. Each model’s in-distribution examples are from one of the 20 classes, and the test out-of-distribution samples are from the remaining 19 classes. All values are percentages

Dataset	Video	Tracking	Shuffle and	AoT	RotNet	Sorting	Ours
	Colorization		Learn
	[41]	[42]	[43]	[44]	[26]	[45]
UCF101	66.3	56.4	67.5	54.1	72.8	74.6	76.4
ILSVRC2015	69.2	70.1	70.7	61.0	74.4	73.8	75.5

IV-D Anomaly Detection on Videos

We have compared the anomaly detection results of our proposed approach (Frame Permutation Prediction Task) with other self-supervised representation learning methods on videos: video colorization [41], visual learning by tracking moving objects in videos (Tracking) [42], shuffle and learn [43], AoT [44], and sorting [45]. Overall best performing image-based self-supervised representation learning method, i.e., RotNet is also used for video anomaly detection. This is done to see the importance of spatial features in video anomaly detection. With RotNet, we treat the video frames of UCF101 and ILSRVC2015 datasets as independant images, and perform visual anomaly detection similar to the image datasets.

For video anomaly detection using frame permutation prediction task, the video segments are generated from both normal and abnormal class videos. A label zero is assigned to a video segment if it belongs to normal class and one if it belongs to abnormal class. We use stochastic gradient descent optimizer with a momentum of 0.9. The training uses colored frames of normal video segments, which are resized to 225 $\times$ 225 pixels. The batch size is kept 10 for all anomaly detection experiments. Training is done for 100 epochs with a learning rate of $10^{-3}$ .

The comparison is shown in Table IV. The spatiotemporal features learned by our proposed frame permutation prediction task are better than the prior methods. Unlike the previous methods our network not only captures low-level video features but also focuses on high-level spatiotemporal features that make the task of visual anomaly detection much more easier and faster. It is also observed that the temporal features alone provide less meaningful information than the spatial features for the task of visual anomaly detection in videos. This is the reason AOT gives poor video anomaly detection results. This suggests that learning good spatial features is more important than learning good temporal features for visual anomaly detection. However, it can be seen from our experimental results that learning both spatiotemporal features from normal videos allows better detection of anomalous videos than learning spatial features alone provided that the the learned representation carries rich semantic and structure understanding of the video content. The results in Table IV show that our model better learns the semantic and structural meanings of the normal class (in-distribution examples), therefore, making it easier to detect the abnormal class (out-of-distribution examples). The average AUROC obtained by our method exceeds the existing methods both on UCF101 and ILSVRC2015 datasets.

It is also found that the number of frames, $M$ , in a video segment and the corresponding possible permutations, $M!$ , is an important hyperparameter. Empirical results in Table V show that $M=5$ gives optimal results. Fewer frames and the corresponding fewer permutations do not provide enough information to learn, as the frames in a segment would be quite similar. Larger values of $M$ do not improve the performance.

TABLE V: AUROC of anomaly detection with different

M

Dataset	$M=3$	$M=4$	$M=5$	$M=6$
UCF101	69.4	72.4	76.4	76.1
ILSVRC2015	70.5	73.5	75.5	75.3

Experiments in Table VI also show that for each video segment instead of choosing a combination of five consecutive frames $\{f_{1},f_{2},f_{3},f_{4},f_{5}\}$ , selecting a combination by skipping two frames $\{f_{1},f_{4},f_{7},f_{10},f_{13}\}$ gives optimal results. This is because consecutive frames tend to be very similar, not allowing the network to learn enough spatiotemporal information.

TABLE VI: AUROC of anomaly detection with different skip

Dataset	$Skip=0$	$Skip=1$	$Skip=2$	$Skip=3$
UCF101	67.4	73.1	76.4	70.2
ILSVRC2015	72.5	73.3	75.5	71.6

The results in Table I, II, III, and IV clearly show that image-based self-supervised representation learning shows competitive performance. However, there is much room for improvement in self-supervised methods for visual anomaly anomaly detection in videos which has never been addressed in past.

V Conclusion

In this paper, we explore anomaly detection in images and videos using self-supervised visual representation learning. A number of pretext tasks have been proposed fro representation learning but they have never been evaluated for the downstream task of visual anomaly detection. We performed an extensive experimental comparison of anomaly detection in images using existing self-supervised visual representation methods and state-of-the-art algorithms specifically designed for anomaly detection. The best performing image-based self-supervised representation learning method is then used for video anomaly detection to see the importance of spatial features. For visual anomaly detection in videos, we introduce an efficient frame permutation prediction task which learns better spatiotemporal features without the use of any additional information. The proposed method results in improved visual anomaly detection on two video datasets as compared to the state-of-the-art self-supervised representation learning methods.

References

[1] R. Chalapathy, A. K. Menon, and S. Chawla, “Anomaly detection using one-class neural networks,” arXiv preprint arXiv:1802.06360, 2018.
[2] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. v. d. Hengel, “Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection,” arXiv preprint arXiv:1904.02639, 2019.
[3] T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs, “Unsupervised anomaly detection with generative adversarial networks to guide marker discovery,” in International Conference on Information Processing in Medical Imaging. Springer, 2017, pp. 146–157.
[4] M. Haselmann, D. P. Gruber, and P. Tabatabai, “Anomaly detection using deep learning based image completion,” in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 2018, pp. 1237–1242.
[5] S. Shashikar and V. Upadhyaya, “Traffic surveillance and anomaly detection using image processing,” in 2017 Fourth International Conference on Image Information Processing (ICIIP). IEEE, 2017, pp. 1–6.
[6] W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in surveillance videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6479–6488.
[7] A. Taboada-Crispi, H. Sahli, D. Hernandez-Pacheco, and A. Falcon-Ruiz, “Anomaly detection in medical image analysis,” in Handbook of research on advanced techniques in diagnostic imaging and biomedical applications. IGI Global, 2009, pp. 426–446.
[8] Q. Wei, Y. Ren, R. Hou, B. Shi, J. Y. Lo, and L. Carin, “Anomaly detection for medical images based on a one-class classification,” in Medical Imaging 2018: Computer-Aided Diagnosis, vol. 10575. International Society for Optics and Photonics, 2018, p. 105751M.
[9] C. Creusot and A. Munawar, “Real-time small obstacle detection on highways using compressive rbm road reconstruction,” in 2015 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2015, pp. 162–167.
[10] P. Chakravarty, A. M. Zhang, R. Jarvis, and L. Kleeman, “Anomaly detection and tracking for a patrolling robot,” in Australasian Conference on Robotics and Automation (ACRA). Citeseer, 2007.
[11] A. Munawar, P. Vinayavekhin, and G. De Magistris, “Spatio-temporal anomaly detection for industrial robots through prediction in unsupervised feature space,” in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2017, pp. 1017–1025.
[12] D. Cireşan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classification,” arXiv preprint arXiv:1202.2745, 2012.
[13] D. C. Cireşan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber, “High-performance neural networks for visual object classification,” arXiv preprint arXiv:1102.0183, 2011.
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[15] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
[16] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
[17] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
[18] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
[19] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
[20] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890.
[21] R. Chalapathy and S. Chawla, “Deep learning for anomaly detection: A survey,” arXiv preprint arXiv:1901.03407, 2019.
[22] B. Kiran, D. Thomas, and R. Parakkal, “An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos,” Journal of Imaging, vol. 4, no. 2, p. 36, 2018.
[23] D. Kwon, H. Kim, J. Kim, S. C. Suh, I. Kim, and K. J. Kim, “A survey of deep learning-based network anomaly detection,” Cluster Computing, pp. 1–13, 2017.
[24] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1422–1430.
[25] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in European conference on computer vision. Springer, 2016, pp. 649–666.
[26] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” arXiv preprint arXiv:1803.07728, 2018.
[27] J. An and S. Cho, “Variational autoencoder based anomaly detection using reconstruction probability,” Special Lecture on IE, vol. 2, no. 1, 2015.
[28] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis, “Learning temporal regularity in video sequences,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 733–742.
[29] Y. Xia, X. Cao, F. Wen, G. Hua, and J. Sun, “Learning discriminative reconstructions for unsupervised outlier removal,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1511–1519.
[30] B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen, “Deep autoencoding gaussian mixture model for unsupervised anomaly detection,” 2018.
[31] D. Li, D. Chen, J. Goh, and S.-k. Ng, “Anomaly detection with generative adversarial networks for multivariate time series,” arXiv preprint arXiv:1809.04758, 2018.
[32] R. El-Yaniv and M. Nisenson, “Optimal single-class classification strategies,” in Advances in Neural Information Processing Systems, 2007, pp. 377–384.
[33] B. Schölkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt, “Support vector method for novelty detection,” in Advances in neural information processing systems, 2000, pp. 582–588.
[34] D. M. Tax and R. P. Duin, “Support vector data description,” Machine learning, vol. 54, no. 1, pp. 45–66, 2004.
[35] M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in European Conference on Computer Vision. Springer, 2016, pp. 69–84.
[36] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2536–2544.
[37] R. Zhang, P. Isola, and A. A. Efros, “Split-brain autoencoders: Unsupervised learning by cross-channel prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1058–1067.
[38] M. Noroozi, H. Pirsiavash, and P. Favaro, “Representation learning by learning to count,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5898–5906.
[39] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149.
[40] M. Caron, P. Bojanowski, J. Mairal, and A. Joulin, “Unsupervised pre-training of image features on non-curated data,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 2959–2968.
[41] C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy, “Tracking emerges by colorizing videos,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 391–408.
[42] X. Wang and A. Gupta, “Unsupervised learning of visual representations using videos,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2794–2802.
[43] I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: unsupervised learning using temporal order verification,” in European Conference on Computer Vision. Springer, 2016, pp. 527–544.
[44] D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman, “Learning and using the arrow of time,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8052–8060.
[45] H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang, “Unsupervised representation learning by sorting sequences,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 667–676.
[46] D. Xu, J. Xiao, Z. Zhao, J. Shao, D. Xie, and Y. Zhuang, “Self-supervised spatiotemporal learning via video clip order prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 334–10 343.
[47] S. Zhai, Y. Cheng, W. Lu, and Z. Zhang, “Deep structured energy based models for anomaly detection,” arXiv preprint arXiv:1605.07717, 2016.
[48] L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, and M. Kloft, “Deep one-class classification,” in International conference on machine learning, 2018, pp. 4393–4402.
[49] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” arXiv preprint arXiv:1610.02136, 2016.
[50] K. Lee, H. Lee, K. Lee, and J. Shin, “Training confidence-calibrated classifiers for detecting out-of-distribution samples,” arXiv preprint arXiv:1711.09325, 2017.
[51] I. Golan and R. El-Yaniv, “Deep anomaly detection using geometric transformations,” in Advances in Neural Information Processing Systems, 2018, pp. 9758–9769.
[52] F. Tung, J. S. Zelek, and D. A. Clausi, “Goal-based trajectory analysis for unusual behaviour detection in intelligent surveillance,” Image and Vision Computing, vol. 29, no. 4, pp. 230–240, 2011.
[53] S. Wu, B. E. Moore, and M. Shah, “Chaotic invariants of lagrangian particle trajectories for anomaly detection in crowded scenes,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2010, pp. 2054–2060.
[54] Y. S. Chong and Y. H. Tay, “Abnormal event detection in videos using spatiotemporal autoencoder,” in International Symposium on Neural Networks. Springer, 2017, pp. 189–196.
[55] W. Luo, W. Liu, and S. Gao, “Remembering history with convolutional lstm for anomaly detection,” in 2017 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2017, pp. 439–444.
[56] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video prediction beyond mean square error,” arXiv preprint arXiv:1511.05440, 2015.
[57] W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for anomaly detection–a new baseline,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6536–6545.
[58] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 675–678.
[59] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[60] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
[61] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747, 2017.
[62] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
[63] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
[64] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015.
[65] N. Sarafijanovic-Djukic and J. Davis, “Fast distance-based anomaly detection in images using an inception-like autoencoder,” in International Conference on Discovery Science. Springer, 2019, pp. 493–508.