This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

RANDOMSEMO: NORMALITY LEARNING OF MOVING OBJECTS
FOR VIDEO ANOMALY DETECTION

Abstract

Recent anomaly detection algorithms have shown powerful performance by adopting frame predicting autoencoders. However, these methods face two challenging circumstances. First, they are likely to be trained to be excessively powerful, generating even abnormal frames well, which leads to failure in detecting anomalies. Second, they are distracted by the large number of objects captured in both foreground and background. To solve these problems, we propose a novel superpixel-based video data transformation technique named Random Superpixel Erasing on Moving Objects (RandomSEMO) and Moving Object Loss (MOLoss), built on top of a simple lightweight autoencoder. RandomSEMO is applied to the moving object regions by randomly erasing their superpixels. It enforces the network to pay attention to the foreground objects and learn the normal features more effectively, rather than simply predicting the future frame. Moreover, MOLoss urges the model to focus on learning normal objects captured within RandomSEMO by amplifying the loss on the pixels near the moving objects. The experimental results show that our model outperforms state-of-the-arts on three benchmarks.

Index Terms—  Video Anomaly Detection, Data Transformation, Frame Prediction, Superpixel, Moving Object

1 Introduction

Video anomaly detection (VAD) is a computer vision task that discriminates abnormal behaviors within the captured scenes. It is gaining interest due to the demanding cost on monitoring surveillance videos. Because the occurrence of abnormal events is rare compared to the normal, which is known as a data imbalance problem, the datasets used to develop anomaly detection models consist of a normal training set and a test set containing both normal and abnormal frames [1, 2, 3]. Following this circumstance, most of the deep-learning based VAD models [4, 5, 6, 7, 8, 9, 10] are trained to recognize the normal patterns during training, and detect the frames with the outlying patterns during testing. Some methods [4, 5, 6, 9] utilize frame predicting autoencoders (AEs) which generate the successive frame from the input frames based on the learned normal patterns. These approaches discriminate abnormal scenes by the quality of the predicted output, assuming that the unusual frame is likely to be generated poorly.

Refer to caption
Fig. 1: Process of the proposed RandomSEMO. (a) shows the process done for each frame. With the optical flow computed between 𝐅𝐭\mathbf{F_{t}} and 𝐅𝐭+𝟏\mathbf{F_{t+1}} ((2)), we estimate the region of the moving objects ((3)). Then, on that region, we group pixels with similar characteristics to make superpixels ((4)). Finally, we erase some superpixels at a random probability ((5)). (b) illustrates application on three successive frames.
Refer to caption
Fig. 2: Overall framework of our method. During training, the AE predicts future frame 𝐏𝐭\mathbf{P^{\prime}_{t}} from the RandomSEMO input frames 𝐂𝐭\mathbf{C^{\prime}_{t}}. The training exceeded with the help of the proposed MOLoss. When testing, the RandomSEMO is detached and normality score StS_{t} obtained from the quality of 𝐏𝐭\mathbf{P^{\prime}_{t}} determines the normality of each frame. The values in brackets indicate [channel, temporal, height, width] of feature and (depth, height, width) of kernel in order.

However, these approaches face two major challenges. First, the AEs tend to perform excessively well, generating even the abnormal frames with high quality, which is adverse in detecting anomalies. Also, the large number of objects in the foreground and background distract the AE from focusing on the prominent moving objects. Abnormal behaviors generally take place in the moving foregrounds, rather than the background objects.

To cope with these problems, inspired by the recently proposed data transformation algorithms [9, 12, 13], we propose a novel video transformation technique named RandomSEMO. It is applied in the data pre-processing stage before feeding the input to the frame predicting AE during training phase. The RandomSEMO erases random superpixels found on the moving object regions. It works under the assumption that the model is likely to learn the most prominent objects’ mormal patterns when it is given partially insufficient data for the frame predicting task.

We also propose a Moving Object Loss (MOLoss) to maximize the effect of RandomSEMO. The MOLoss amplifies the loss of the pixels on the moving objects via generating a weight map. By utilizing the proposed loss function, our model is designed to focus on the foreground regions, effectively extracting crucial features with less distraction of the background objects.

Our contributions are summarized as: (1) We propose a superpixel-based video transformation method called RandomSEMO. (2) We maximize the advantage of the RandomSEMO with the proposed loss function, MOLoss. (3) Our model surpasses or performs compatibly with other methods.

2 Related Works

Various data transformation techniques, such as augmentations, have been proposed to boost the feature learning for computer vision tasks. In the image-level, Zhong et al. [14] introduced Random Erasing which makes occlusion at a random patch in the image. DeVries and Taylor [15] suggested CutOut, a method that deletes a box at the random location. These mentioned methods are all proven by experiments to enhance the performance of various tasks.

Few VAD methods have been proposed using the data transformation. Georgescu et al. [16] used sequence-reversed frames to embed motion feature learning. Park et al. [9] devised SRT and TMT, which rotates random rectangular patches in the spatial dimension and shuffles the sequence of random patches, respectively. Our work is in close relation with this method. However, compared to the work of Park et al. [9], our proposed RandomSEMO is fully aware of the moving objects and applied only on the corresponding region. It transforms the superpixels of the region, which is more sophisticated and foreground-aware than applying transformations on random rectangular patches. Therefore, our method is more superior in the localizing ability, which is discussed in Sec. 4.2 and Fig. 5.

3 Method

The overall framework is described in Fig. 2. We resize NfN_{f} consecutive frames 𝐅𝐭𝐍𝐟,𝐅𝐭𝐍𝐟+𝟏,𝐅𝐭𝐍𝐟+𝟐,,𝐅𝐭𝟏\mathbf{F_{t-N_{f}},F_{t-N_{f}+1},F_{t-N_{f}+2},\dots,F_{t-1}} to 3×240×3603\times 240\times 360 and stack them to make a 4D cuboid 𝐂𝐭3×Nf×240×360\mathbf{C_{t}}\in\mathbb{R}^{3\times N_{f}\times 240\times 360}. During training, RandomSEMO takes place to make transform 𝐂𝐭\mathbf{C_{t}} to 𝐂𝐭\mathbf{C^{\prime}_{t}}. Then, 𝐂𝐭\mathbf{C^{\prime}_{t}} is fed to the AE which generates a prediction of the future frame 𝐏𝐭\mathbf{P^{\prime}_{t}}. The AE is trained with the proposed MOLoss between PtP^{\prime}_{t} and the ground truth PtP_{t}. During inference, RandomSEMO is detached. 𝐂𝐭\mathbf{C_{t}} is fed to the AE, and the normality score StS_{t} is calculated from 𝐏𝐭\mathbf{P^{\prime}_{t}} to distinguish the abnormal frames.

3.1 RandomSEMO

We propose RandomSEMO, a transformation method that eliminates the details of the foregrounds by randomly erasing the superpixels of the moving objects. The purpose of RandomSEMO is to let AE be trained to extract prototypical normal features of the foreground objects than simply learning to predict frames well.

To apply RandomSEMO, we first localize the moving objects. Given a sequence of frames 𝐅𝐭𝐍𝐟\mathbf{F_{t-N_{f}}}, 𝐅𝐭𝐍𝐟+𝟏\mathbf{F_{t-N_{f}+1}}, 𝐅𝐭𝐍𝐟+𝟐\mathbf{F_{t-N_{f}+2}}, \dots, 𝐅𝐭𝟏\mathbf{F_{t-1}}, 𝐏𝐭\mathbf{P_{t}}, we compute the optical flow between each two successive frames (Fig. 1 (2)). Then, we generate moving object masks 𝐌𝐨𝐌𝐚𝐬𝐤\mathbf{MoMask} (Fig. 1 (3)) by thresholding the magnitude of the optical flow. Next, we apply SLIC [17] on the masked regions of the former frames to generate superpixel maps (Fig. 1 (4)) with maximum NspN_{sp} components. Finally, we randomly erase the superpixels by replacing the values with zero (Fig. 1 (5)). Each superpixel is erased with probability TspT_{sp}.

By utilizing RandomSEMO, the network is forced to learn the appearance and motion features of the moving objects to infer the future frame with the randomly erased information, as seen in Fig. 1 (b). This solves the aforementioned excessive predicting power problem of the previous methods because our method makes the model to focus on the normal feature learning, rather than simply implying the upcoming frame with the cues of the input frames. Also, the AE effectively extracts normal features because RandomSEMO nudges the model to pay attention to the moving objects which are more potentially abnormal than the background.

Refer to caption
Fig. 3: MOWM generating process.

3.2 MOLoss

To maximize the effect of RandomSEMO, we optimize our model with the newly proposed MOLoss. MOLoss is a weighted sum of a moving object weighted L1 loss LmL_{m} and SSIM loss [18] LsL_{s} obtained between 𝐏𝐭\mathbf{P^{\prime}_{t}} and 𝐏𝐭\mathbf{P_{t}}. LmL_{m} amplifies loss on the pixels near the moving objects by using a Moving Object Weight Map (𝐌𝐎𝐖𝐌\mathbf{MOWM}). Fig. 3 shows the example and generating process of MOWM. We sum up all the 𝐌𝐨𝐌𝐚𝐬𝐤\mathbf{MoMask} obtained during RandomSEMO to generate a summed binary mask (𝐒𝐁𝐌\mathbf{SBM}) (Fig. 3 (b)), which represents the moving object region of the input frame cuboid CtC^{\prime}_{t}. Therefore, 𝐒𝐁𝐌\mathbf{SBM} is expressed as:

𝐒𝐁𝐌𝐭=min(k=0t1𝐌𝐨𝐌𝐚𝐬𝐤𝐤,255)\mathbf{SBM_{t}}=min\left(\sum_{k=0}^{t-1}\mathbf{MoMask_{k}},255\right)\vspace{-0.2cm} (1)

Then, we apply a distance function to 𝐒𝐁𝐌\mathbf{SBM} to generate 𝐌𝐎𝐖𝐌\mathbf{MOWM}, which demonstrates the inverse-distance from each pixel to the foreground. Thereby, the values near the foreground pixel are large and vice versa. 𝐌𝐎𝐖𝐌\mathbf{MOWM} is defined as:

𝐌𝐎𝐖𝐌𝐭=255{min(pxqx)2+(pyqy)2,p𝐒𝐁𝐌bg0,p𝐒𝐁𝐌fg,\mathbf{MOWM^{t}}=255-\begin{cases}min\sqrt{(p_{x}-q_{x})^{2}+(p_{y}-q_{y})^{2}}&,p\in\mathbf{SBM}_{bg}\\ \hskip 56.9055pt0&,p\in\mathbf{SBM}_{fg}\end{cases}, (2)

where pp and qq indicate each pixel in 𝐒𝐁𝐌𝐭\mathbf{SBM_{t}} and its nearest foreground pixel, respectively. Evidently, the pixels near the foreground have relatively big values for the corresponding location in 𝐌𝐎𝐖𝐌\mathbf{MOWM}. Then, 𝐌𝐎𝐖𝐌\mathbf{MOWM} is multiplied during L1 loss calculation, amplifying the values near the moving objects. During this stage, we add a small value η\eta to the tensorized 𝐌𝐎𝐖𝐌\mathbf{MOWM} to avoid zero values. Because LmL_{m} is a pixel-level loss function, we combine this with LsL_{s} to guarantee the similarity of 𝐏𝐭\mathbf{P^{\prime}_{t}} and 𝐏𝐭\mathbf{P_{t}} in the feature-level. The equations are as follows:

Lm(𝐏𝐭,𝐏𝐭)=1M×Nx=0M1y=0N1[(𝐌𝐎𝐖𝐌(x,y)t+η)×|𝐏(x,y)𝐏(x,y)|]L_{m}(\mathbf{P^{\prime}_{t}},\mathbf{P_{t}})=\frac{1}{M\times N}\sum_{x=0}^{M-1}\sum_{y=0}^{N-1}\left[\left(\mathbf{MOWM}_{(x,y)}^{t}+\eta\right)\times|\mathbf{P^{\prime}}_{(x,y)}-\mathbf{P}_{(x,y)}|\right] (3)
Ls(𝐏𝐭,𝐏𝐭)=1(2μ𝐏𝐭μ𝐏𝐭+c1)(2σ𝐏𝐭𝐏𝐭+c2)(2μ𝐏𝐭2μ𝐏𝐭2+c1)(σ𝐏𝐭2+σ𝐏𝐭2+c2)L_{s}(\mathbf{P^{\prime}_{t}},\mathbf{P_{t}})=1-\frac{(2\mu_{\mathbf{P^{\prime}_{t}}}\mu_{\mathbf{P_{t}}}+c_{1})(2\sigma_{\mathbf{P^{\prime}_{t}}\mathbf{P_{t}}}+c_{2})}{(2\mu_{\mathbf{P^{\prime}_{t}}}^{2}\mu_{\mathbf{P_{t}}}^{2}+c_{1})(\sigma_{\mathbf{P^{\prime}_{t}}}^{2}+\sigma_{\mathbf{P_{t}}}^{2}+c_{2})} (4)

In Eq. 3, M and N represent the number of pixels in the width and height axis, respectively. Moreover, xx and yy represent the location of each pixel in the width and height axis. In Eq. 4, μ\mu and σ2\sigma^{2} denote the average and variance of each frame, respectively. Also, σ𝐏𝐭𝐏𝐭\sigma_{\mathbf{P^{\prime}_{t}}\mathbf{P_{t}}} represents the covariance. c1c_{1} and c2c_{2} are the variables to stabilize the division. Finally, MOLoss is described as:

MOLoss(𝐏𝐭,𝐏𝐭)=wmLm(𝐏𝐭,𝐏𝐭)+wsLs(𝐏𝐭,𝐏𝐭)MOLoss(\mathbf{P^{\prime}_{t}},\mathbf{P_{t}})=w_{m}L_{m}(\mathbf{P^{\prime}_{t}},\mathbf{P_{t}})+w_{s}L_{s}(\mathbf{P^{\prime}_{t}},\mathbf{P_{t}}) (5)

where wmw_{m} and wsw_{s} denote the weights that control the contribution of LmL_{m} and LsL_{s}, respectively.

3.3 AE Architecture

The AE used in our method is trained to recognize the likelihood of the normal frames and predict the upcoming frame 𝐏𝐭\mathbf{P^{\prime}_{t}} from the transformed input frame cuboid 𝐂𝐭\mathbf{C^{\prime}_{t}} using the learned normal features. To avoid excessive generalization, we design a lightweight, yet effective AE. As demonstrated in Fig. 2, our AE is composed of a three-layer encoder and a three-layer decoder with skip connections in between to supplement the features lost during downsampling. Each encoder layer consists of 3D convolution, batch normalization, and LeakyReLU activation. After the last encoder layer, an Atrous Spatial Pyramid Pooling (ASPP) [21] layer takes place to enlarge the receptive field. The decoder layers are all made of 3D deconvolution, batch normalization, and ReLU activation.

3.4 Normality Score

During inference, we adopt the peak signal to noise ratio (PSNR) to estimate the abnormality of each frame. It is defined as PSNR(𝐏𝐭,𝐏𝐭)=10log10max(𝐏𝐭)𝐏𝐭𝐏𝐭22/NPSNR(\mathbf{P^{\prime}_{t}},\mathbf{P_{t}})=10\log_{10}\frac{\max(\mathbf{P^{\prime}_{t}})}{\|\mathbf{P^{\prime}_{t}}-\mathbf{P_{t}}\|_{2}^{2}/N}, where N is the number of pixels in the frame. When 𝐏𝐭\mathbf{P_{t}} contains anomalies which our network has never seen during training, our network is incapable of predicting a clear 𝐏𝐭\mathbf{P^{\prime}_{t}}, resulting in low PSNR and vice versa. Following [2, 11, 4, 7, 5, 10, 9, 8], we normalize PSNR(𝐏𝐭,𝐏𝐭PSNR(\mathbf{P^{\prime}_{t}},\mathbf{P_{t}}) of each video clip to the range [0,1][0,1] to obtain the final normality score StS_{t}.

St=PSNR(𝐏𝐭,𝐏𝐭)minPSNR(𝐏𝐭,𝐏𝐭)maxPSNR(𝐏𝐭,𝐏𝐭)minPSNR(𝐏𝐭,𝐏𝐭)S_{t}=\frac{PSNR(\mathbf{P^{\prime}_{t}},\mathbf{P_{t}})-\min PSNR(\mathbf{P^{\prime}_{t}},\mathbf{P_{t}})}{\max PSNR(\mathbf{P^{\prime}_{t}},\mathbf{P_{t}})-\min PSNR(\mathbf{P^{\prime}_{t}},\mathbf{P_{t}})}\vspace{-0.15cm} (6)
Dataset Model STAN [10] Abn GAN [25] HyAE [7] Mem AE [11] AD [26] IntAE [6] MNAD -Recon [5] Fast Ano [9] Base Base+R Base+R+M (Ours)
Avenue [1] 81.7 - 82.8 83.3 - 83.7 82.8 85.3 84.1 83.9 85.4
ST [2] - - - 71.2 - 71.5 69.8 72.2 71.6 72.2 72.4
Ped2 [3] 92.2 93.5 84.3 94.1 95.5 96.2 90.2 96.3 93.7 94.2 95.8
FPS 50 - - 38 2 30 67 195 127 127 127
Table 1: Frame-level AUC (%) comparison. All figures and FPS are copied from the corresponding papers. ’Base’ indicates our baseline framework without the RandomSEMO and MOLoss. It is instead trained with a single L1L_{1} loss. R and M represent RandomSEMO and MOLoss, respectively. The top two results for each category are marked red and blue.

4 Experiments

Dataset. We validate our network on three popular benchmarks: CUHK Avenue [1], ShanghaiTech Campus (ST) [2], and UCSD Ped2 [3]. These datasets are all acquired by fixed cameras from the real-world. Among the three, ST [2] is the largest and most complex, containing 330 training videos and 107 testing videos captured from 13 different scenes.

Evaluation metric. For the evaluation, we adopt the area under curve (AUC) obtained from the frame-level scores and the ground truth labels. This metric is used in most studies [4, 7, 6, 9, 11, 12] on VAD. Since the first five frames of each clip cannot be predicted, they are ignored in the evaluation, following other prediction-based methods [4, 6, 5, 9].

Settings. We implement our experiments using PyTorch and a NVIDIA RTX A6000. Also, we use a pre-trained FlowNet2 [23] to generate optical flow maps. The training is based on batch size 3030 and Adam optimizer with an initial learning rate of 2e42e^{-4}, β1=0.5\beta_{1}=0.5, β2=0.9\beta_{2}=0.9. The network is trained for 40 epochs on Ped2 [3] and Avenue [1] and 1010 epochs for ST [2]. η\eta, wmw_{m}, wsw_{s}, NfN_{f}, NspN_{sp}, and TspT_{sp} are empirically set to 1,0.25,0.75,5,101,0.25,0.75,5,10, and 0.30.3 respectively.

4.1 Quantitative Results

The quantitative results are shown in Table 1. We compare our network with eight VAD networks that do not require any pre-trained network—such as object detectors or optical flow networks—during inference for fair comparison. It is observed that the accuracy of our model is superior to most of the compared methods in the three datasets. Furthermore, our proposed network is capable of detecting anomalies at 127 frames per seconds (FPS), faster than most methods. Because accuracy and speed are both mandatory in the real world application, our network is more practical than other methods that show low accuracy or slow speed.

Avenue [1] Ped2 [3]
Tsp=0.1T_{sp}=0.1 84.184.1 94.994.9
Tsp=0.2T_{sp}=0.2 85.085.0 95.9
Tsp=0.3T_{sp}=0.3 85.4 95.895.8
Tsp=0.4T_{sp}=0.4 84.584.5 95.095.0
Tsp=0.5T_{sp}=0.5 84.484.4 94.194.1
Tsp=0.6T_{sp}=0.6 84.484.4 94.794.7
Table 2: Impact of the RandomSEMO probability TspT_{sp}
[Uncaptioned image]
Fig. 4: Graph of Table 2
Avenue [1] ST [2] Ped2 [3]
L1L_{1} 83.983.9 72.272.2 94.294.2
SSIM [18] 84.384.3 72.072.0 94.294.2
wmL1+wsw_{m}L_{1}+w_{s}SSIM [18] 84.884.8 72.372.3 94.294.2
MOLoss 85.4 72.4 95.8
Table 3: Impact of MOLoss. Comparison with L1L_{1}, SSIM [18], and a weighted sum of L1L_{1} and SSIM [18].

4.2 Qualitative Results

Fig. 5 demonstrates the output and difference map of our network compared to those of FastAno [9]. The difference map shows that our network more precisely localizes the abnormal object. Only the biker is saturated in our proposed model whereas the background pixels near the biker is unnecessarily highlighted in FastAno [9], mirroring the effect of our moving object aware RandomSEMO and MOLoss. Furthermore, the difference between foreground and background is more pronounced in our results.

4.3 Impact of RandomSEMO probability TspT_{sp}

We found the optimal value of TspT_{sp} by changing TspT_{sp} from 0.10.1 to 0.60.6. Table 2 and Fig. 4 show the result of the experiments conducted on Avenue [1] and Ped2 [3]. For Avenue [1], the accuracy is highest when TspT_{sp} is 0.30.3. For Ped2 [3], the model trained with Tsp=0.2T_{sp}=0.2 shows the best performance, and it is also as much accurate when trained with Tsp=0.3T_{sp}=0.3. Furthermore, it is observed that both low TspT_{sp} leads to relatively low performance because only a few superpixels are obscured, practically having no significant difference from the original image. Similarly, the accuracy drops when TspT_{sp} is high because too much information is lost.

4.4 Impact of MOLoss

We conducted ablation studies to validate the contribution of MOLoss. The results are compared in Table 3. We experimented by changing the loss function to L1L_{1} loss, SSIM loss, and a weighted sum of L1L_{1} and SSIM where the weights are equivalent to those of the MOLoss. From the results, it is observed that using MOLoss strongly boosts the performance, especially when compared to using a single L1L_{1} loss.

Refer to caption
Fig. 5: Qualitative results on Ped2 [3] compared with those of FastAno [9]. The biker on the pedestrian street is the anomaly in this scene.

5 Conclusion

We propose a novel superpixel-based video transformation method called RandomSEMO for anomaly detection. Also, we propose a loss function named MOLoss to boost the effect of RandomSEMO by amplifying the loss values on the pixels near the moving objects. The experimental results have shown that our network performs competitively or surpasses other previous methods. Furthermore, we validated the impact of RandomSEMO and MOLoss with ablation studies and output comparison. As future works, we will verify the localizing ability of our model for extensive usage.

References

  • [1] Cewu Lu, Jianping Shi, and Jiaya Jia, “Abnormal event detection at 150 fps in matlab,” in ICCV, 2013, pp. 2720–2727.
  • [2] Weixin Luo, Wen Liu, and Shenghua Gao, “A revisit of sparse coding based anomaly detection in stacked rnn framework,” in ICCV, 2017, pp. 341–349.
  • [3] Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos, “Anomaly detection in crowded scenes,” in CVPR. IEEE, 2010, pp. 1975–1981.
  • [4] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao, “Future frame prediction for anomaly detection–a new baseline,” in CVPR, 2018, pp. 6536–6545.
  • [5] Hyunjong Park, Jongyoun Noh, and Bumsub Ham, “Learning memory-guided normality for anomaly detection,” in CVPR, 2020, pp. 14372–14381.
  • [6] Yao Tang, Lin Zhao, Shanshan Zhang, Chen Gong, Guangyu Li, and Jian Yang, “Integrating prediction and reconstruction for anomaly detection,” PR Letters, vol. 129, pp. 123–130, 2020.
  • [7] Trong Nguyen Nguyen and Jean Meunier, “Hybrid deep network for anomaly detection,” BMVC, 2019.
  • [8] MyeongAh Cho, Taeoh Kim, Ig-Jae Kim, and Sangyoun Lee, “Unsupervised video anomaly detection via normalizing flows with implicit latent features,” arXiv preprint arXiv:2010.07524, 2020.
  • [9] Chaewon Park, MyeongAh Cho, Minhyeok Lee, and Sangyoun Lee, “Fastano: Fast anomaly detection via spatio-temporal patch transformation,” in WACV, 2022, pp. 2249–2259.
  • [10] Sangmin Lee, Hak Gu Kim, and Yong Man Ro, “Stan: Spatio-temporal adversarial networks for abnormal event detection,” in ICASSP. IEEE, 2018, pp. 1323–1327.
  • [11] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel, “Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection,” in ICCV, 2019, pp. 1705–1714.
  • [12] Muhammad Zaigham Zaheer, Jin-ha Lee, Marcella Astrid, and Seung-Ik Lee, “Old is gold: Redefining the adversarially learned one-class classifier training paradigm,” in CVPR, 2020, pp. 14183–14193.
  • [13] Abhishek Joshi and Vinay P Namboodiri, “Unsupervised synthesis of anomalies in videos: transforming the normal,” in IJCNN. IEEE, 2019, pp. 1–8.
  • [14] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang, “Random erasing data augmentation,” in AAAI, 2020, vol. 34, pp. 13001–13008.
  • [15] Terrance DeVries and Graham W Taylor, “Improved regularization of convolutional neural networks with cutout,” arXiv preprint arXiv:1708.04552, 2017.
  • [16] Mariana-Iuliana Georgescu, Antonio Barbalau, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah, “Anomaly detection in video via self-supervised and multi-task learning,” in CVPR, 2021, pp. 12742–12752.
  • [17] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk, “Slic superpixels,” Tech. Rep., 2010.
  • [18] Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz, “Loss functions for image restoration with neural networks,” IEEE Transactions on computational imaging, vol. 3, no. 1, pp. 47–57, 2016.
  • [19] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in ICCV, 2015, pp. 4489–4497.
  • [20] Andrew L Maas, Awni Y Hannun, Andrew Y Ng, et al., “Rectifier nonlinearities improve neural network acoustic models,” in ICML. Citeseer, 2013, vol. 30, p. 3.
  • [21] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” TPAMI, vol. 40, no. 4, pp. 834–848, 2017.
  • [22] Vinod Nair and Geoffrey E Hinton, “Rectified linear units improve restricted boltzmann machines,” in ICML, 2010.
  • [23] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in CVPR, 2017, pp. 2462–2470.
  • [24] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [25] Mahdyar Ravanbakhsh, Moin Nabi, Enver Sangineto, Lucio Marcenaro, Carlo Regazzoni, and Nicu Sebe, “Abnormal event detection in videos using generative adversarial nets,” in ICIP. IEEE, 2017, pp. 1577–1581.
  • [26] Mahdyar Ravanbakhsh, Enver Sangineto, Moin Nabi, and Nicu Sebe, “Training adversarial discriminators for cross-channel abnormal event detection in crowds,” in WACV. IEEE, 2019, pp. 1896–1904.
  • [27] Fei Dong, Yu Zhang, and Xiushan Nie, “Dual discriminator generative adversarial network for video anomaly detection,” IEEE Access, vol. 8, pp. 88170–88176, 2020.