Abnormal Behavior Detection Based on Target Analysis
Abstract
Abnormal behavior detection in surveillance video is a pivotal part of the intelligent city. Most existing methods only consider how to detect anomalies, with less considering to explain the reason of the anomalies. We investigate an orthogonal perspective based on the reason of these abnormal behaviors. To this end, we propose a multivariate fusion method that analyzes each target through three branches: object, action and motion. The object branch focuses on the appearance information, the motion branch focuses on the distribution of the motion features, and the action branch focuses on the action category of the target. The information that these branches focus on is different, and they can complement each other and jointly detect abnormal behavior. The final abnormal score can then be obtained by combining the abnormal scores of the three branches.
Index Terms— Abnormal behavior detection, Multivariate fusion
1 Introduction
Abnormal behavior detection in surveillance video is a challenging task in computer vision. In practical applications, the definition of abnormality in the video is varied. For example, it may be an abnormal event for a pedestrian to ride a bicycle in the square, but cycling on a non-motor vehicle lane is usually regarded as a normal behavior. Therefore, it is difficult to define abnormality. Generally speaking, a video event is usually considered as an anomaly if it is not very likely to occur in the video [1]. Unlike other tasks, there are usually only normal samples given by the training set and the anomalies in the environment are defined by these normal samples.
With the development of intelligent monitoring, we not only need to accurately detect the abnormal area in the surveillance video, but also need to analyse the reasons of anomalies.

If we can give the reasons for the anomalies when we detect it, it would help the observers quickly judge whether they are false alarms or not. There are various factors that may be considered as the abnormal behaviors, such as holding a gun, running in a panic and so on. Therefore, if we want to explain the abnormal behavior, we need to analyze the target from multiple perspectives. Through the fusion result of each perspective, we can jointly detect anomalies and explain them.
Here we propose a multivariate fusion method to detect and explain anomalies. We analyze each target through three perspectives: object, action and motion. The object branch focuses on the appearance information, the motion branch focuses on the distribution of motion features, and the action branch focuses on the action category of the target. Both motion and action perspectives use video information. The focus of the three perspectives is different, but they could complement each other. When anomalies are detected, the reason for determining the anomalies can be explained. As shown in Fig.1, the red area on the left is a person riding with abnormal motion, which is regarded as an abnormal event in the UCSD Ped2 dataset. On the right is the detection result of the target, which is explained as ’person’, ’riding’, ’abnormal motion’.
However, how to detect the action category of each target in a video is a critical task. Most of the existing action recognition algorithms only classify videos instead of targets. In a video, a few people continuously perform an action, and then the entire video is classified into some action category, which is called single-target and single-action recognition. In the anomaly detection field, each frame of video has multiple targets, and every target performs different actions. Therefore, these algorithms cannot solve the multi-target and multi-action recognition problem.
We propose an action recognition module to solve this problem. First, each frame is sent to the object detection network to obtain the category and coordinates of every target. Then, each target is tracked by visual tracking algorithm to obtain the target of the subsequent frame. Combined with object detection and visual tracking, we can get a video that contains only one target.

Taking these videos as input of the action recognition network to obtain the category of the action and the confidence score. The main contributions of our work are as follows:
(1) We propose a multivariate fusion method to detect and explain anomalies through three branches: object, action and motion. The performance of the fused method is not only higher than the performance of each branch, but also outperforms the state-of-the-art methods.
(2) We propose an action recognition module to solve the multi-target and multi-action recognition problem in the surveillance video. As far as we know, this is the first time that the action recognition using inter-frame information is utilized for anomaly detection.
2 Related Works
In early studies, the traditional methods based on model-driven played a dominant role in anomaly detection. These methods usually extract hand-crafted features, which are generally divided into two types: motion and appearance. Motion-based features are often classified into three categories. The first is based on optical flow, such as Histograms of Oriented Optical Flow (HOF) [2], Multi-scale Histogram of Optical Flow (MHOF) [1] and Histogram of Magnitude Optical Flow (HMOF). The second is based on trajectories [3]. They focus on how to learn the normal trajectories of targets in videos. The last category is based on energy, which considers the crowd density and energy distribution, such as social force model [4], pedestrian loss model [5]. Appearance-based features generally include RGB information, such as Histogram of Oriented Gradient (HOG) [6] and Spatio-temporal Gradients [7].
In recent years, anomaly detection algorithms represented by deep learning are also developed and achieving good results. These methods are data driven and usually do not extract hand-crafted features. Instead, it uses neural networks to extract high-level features from video sequences [8]. For example, Sabokrou et al. [9] used Fully Convolutional Neural Network (FCNs) to extract deep features to distinguish anomalies. However, these methods did not utilize the video information. As we all know, action is a continuous behavior, which means that inter-frame information is more important. Although MT-FRCN analyzed each target from three perspectives, the redundancy among them is high by only using the single-frame information.

Compared to each perspective, the performance of the fusion was degraded, which indicated that this method has some limitations.
3 Methodology
The proposed method is shown in Fig.2. Firstly, the frame is sent to the object detection network to obtain the label and the coordinates of each target. Then, we analyze each target through three branches: object, action and motion. Finally, we can get the final abnormal score by combining the three branches. When the anomalies are detected, the reasons for determining the anomalies can be explained.
3.1 Object Branch
The first step is to detect every targets in the video. We use the well-known Yolo v3 [10] algorithm for object detection. Each frame of video is inputted into the Yolo v3 to obtain the label, confidence score, and coordinates of the targets. We only keep the label with highest confidence score.
In the anomaly detection, the training set usually only contains the normal samples, thus we don’t know which labels correspond to the anomalies. According to the definition of the anomalies, an event is usually considered as an anomaly if it is not very likely to occur in the video. Therefore, if the label of the object appears in the training set, high object confidence means the low probability of anomalies. Otherwise, high object confidence means that the target is more likely to be an anomaly. Considering that the network may obtain some false detections in the object label of the training set, we only retain the object label with high confidence.
We take the training set of the anomaly detection dataset as the input of the Yolo v3, and get the label and confidence of each target. We choose the label whose confidence is higher than a threshold to store in a list denoted as . The abnormal score of each target is defined as:
(1) |
where and are the label and the confidence of the target respectively. The of all targets in the test set are stored into a list denoted as , and then we normalize each abnormal score to [0,1].
(2) |
3.2 Action Branch
The difficulty for action recognition in the anomaly detection field is how to solve the problems of multi-target and multi-action recognition in a surveillance video. We propose a action recognition module to solve this problem. Based on the object detection, we use the well-known KCF [11] to track the target. Each target is tracked in frames, so we can get the the coordinates of the continuous -frame of the target. After clipping, we get a short RGB video only containing the target. At the same time, the optical flow is extracted by using the coordinates of each target through previous frame and current frame of the video and obtain the optical flow video. Taking RGB video and optical flow video as the input of the action recognition network to obtain the action category and confidence of the target, thus we can solve the multi-target and multi-action detection problem in anomaly detection.
As for the action recognition network, we borrow the idea from the TSN algorithm [12]. The TSN algorithm combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The video used in the TSN algorithm usually contains much redundant information. However, in the anomaly detection tasks, the video used for the action recognition is a short video obtained from the target tracking, which contains less redundant information. Therefore, we remove the sparse temporal sampling strategy.
The process of the action branch is shown in Fig.3. We use a two-stream ConvNet architecture which incorporates spatial and temporal networks for action recognition. The ConvNet used in our work is BNInception network [13]. We send each RGB image and optical flow image of the short video into the spatial network and the temporal network respectively and the category scores of different frames are averaged. Finally the category scores of the RGB and optical flow branches are fused by a weight. The weight and the training process is similar to those in the TSN algorithm.
Similar to the object branch, if the action category of the target has appeared in the training set, high action confidence means low probability of anomalies. We take the video frame sequence of each target tracked in the training set as the input of the action recognition network, and get the action category and confidence. After that, we choose the action category whose confidence is higher than to store in a list denoted as . The action abnormal score of each target is defined as:
(3) |
where and are the action category and the confidence of the target respectively. The action abnormal score of all targets in the test set are stored into a list denoted as , and then we normalize each abnormal score to [0,1].
(4) |
3.3 Motion Branch
We use Histogram of Magnitude Optical Flow (HMOF) features as the motion features. We extract the HMOF features of each target in the video and then send them to the auto-encoder network for reconstruction in order to further expand the difference between normal and abnormal features. All features of the training set are used to train the Gaussian Mixture Model (GMM) Classifiers, then we use the trained classifier to test the features of the testing set. Each feature get a score after passing through the classifier. The score is denoted as . Finally, the motion abnormal score of all targets in the test set are stored into a list denoted as , and each abnormal score is normalized to [0, 1].
(5) |
The larger the , the less possibility the motion feature of the target match the distribution of normal motion features, so the target is more likely to be an anomaly.
3.4 Fusion
After getting the abnormal scores of the three branches, we fuse these abnormal scores to get the final abnormal score. However, different branches play a different role in various scenes. For example, in the UMN dataset, the scene is more concerned with the panic running of the crowd, the weights of the motion and action branches should be higher than the object branch. Thus, the abnormal score of each branch should be weighted by a factor, and the final abnormal score of the target is determined by the three weighted abnormal scores.
In addition, the anomaly detection is different from other tasks. In practical applications, the monitoring system notifies the staff to handle the abnormality after detecting the abnormality. A few false detections only slightly burden the staff, but a few missing detections mean that the abnormal events cannot be handled in time, which may eventually lead to serious consequences. Considering the practical significance of anomaly detection, we select the maximum value of the abnormal scores from the weighted branches as the final fusing score denoted as :
(6) |
where the , , are the weight of the object, action and motion branch respectively. The of all targets in the test set are stored into a list denoted as , and then we normalize the final abnormal score of each target to [0,1].
(7) |
If the is greater than the threshold , it is considered an anomaly.
(8) |

4 Experiments
We provide the representative experimental results on two datasets: the UMN dataset and the UCSD Ped2 dataset. In the action branch, we select about 8,000 videos with 16 common action categories from different commom datasets for training.
As for the evaluation criteria, we adopt the frame-level criteria for anomaly detection and the pixel-level criteria for anomaly localization [14]. For frame-level, if one pixel is detected as an anomaly, the whole frame is considered as an anomaly. For pixel-level, if more than 40% of truly abnormal pixels are detected, it will be treated as true positive [14]. Furthermore, two criteria are used to evaluate the ROC curves: Area Under Curve (AUC) and Equal Error Rate (EER). Higher AUC and lower EER indicate a better performance.
4.1 UMN Dataset
Because the UMN dataset does not have pixel-level annotation information, we only evaluated it on the frame-level.
The hyperparameters are empirically set as follows: , , are set to 1, 1.5, 1.5. is 0.95, and is 0.99. The threshold of HMOF is 1.8, the interval is 8, and the number of the tracking frames is 5.
Table.1 shows the performance comparison between the proposed algorithm and other state-of-the-art algorithms. The object branch’s performance is very poor, since the anomalies are panic people, and the object branch is difficult to detect the anomaly. The motion branch is sensitive to the panic people and has better performance. The performance of the proposed method after fusion is not reduced, which indicates the effectiveness of the method. Compared with other state-of-the-art algorithms, the proposed algorithm has the best performance in the UMN dataset.
Method | AUC | Method | AUC |
ZH et al.[15] | 99.3% | Leyva et al.[16] | 88.3% |
SR[1] | 97.5% | Ours-object | 48.6% |
MIP[17] | 94.4% | Ours-action | 84.8% |
Sabokrou et al.[14] | 99.6% | Ours-motion | 99.8% |
DeepCascade[18] | 99.6% | Ours-fusion | 99.8% |
4.2 UCSD Dataset
The UCSD Ped2 dataset is the standard benchmark for anomaly detection, and have different scenes. The hyperparameters are empirically set as follows: , , are set to 1, is 0.95, and is 0.99. The threshold of HMOF is 2.4, the interval is 8, and the number of tracking frames is 5.
Fig.4 shows the ROC curves on the UCSD Ped2 dataset and the EER comparison with state-of-the-art method is shown in Table2. From Fig.4 and Table.2, we can see that:
1. Object, motion and action branches’s performance have their own limitations, but the performance of the fusion is better than the respective branches in frame-level and pixel-level.
2. The performance of our method is significantly improved compared with MT-FRCN. The combined performance EER decreased by 11.6% at the frame-level, from 17.1% to 5.5%, and decreased by 4.9% at the pixel-level, from 19.4% to 14.5%.
3. Compared with other state-of-the-art algorithms, EER of our method has reached the lowest at frame-level and pixel-level respectively.
Method | Pixel-level | Frame-level |
MDT[19] | 24% | 54% |
Zhang et al.[20] | 22% | 33% |
Xua et al.[2] | 20% | 42% |
IBC[21] | 13% | 26% |
Li et al.[22] | 18.5% | 29.9% |
Leyva et al.[16] | 19.2% | 36.6% |
Sabokrou et al.[14] | 19% | 24% |
Xiao et al.[23] | 10% | 17% |
Deep-anomaly[9] | 11% | 15% |
MT-FRCN[24] | 17.1% | 19.4% |
Ours-object | 16.0% | 33.2% |
Ours-action | 11.1% | 31.5% |
Ours-motion | 10.6% | 25.3% |
Ours-fusion | 5.5% | 14.5% |
5 Conclusion
This paper proposes a multivariate fusion method based on target analysis, which not only focuses on how to detect anomalies, but also try to explain the reason of anomalies. We analyze each target through three branches: object, action and motion. The information that these branches focus on is different, and they can jointly detect and explain anomalies. Furthermore, in the action branches, we propose an action recognition module using inter-frame information to solve the multi-target and multi-action recognition problems, which has not been utilized in anomaly detection field before.
References
- [1] Yang Cong, Junsong Yuan, and Ji Liu, “Sparse reconstruction cost for abnormal event detection,” in CVPR. IEEE, 2011, pp. 3449–3456.
- [2] Dan Xu, Rui Song, Xinyu Wu, Nannan Li, Wei Feng, and Huihuan Qian, “Video anomaly detection based on a hierarchical activity discovery within spatio-temporal contexts,” Neurocomputing, vol. 143, pp. 144–152, 2014.
- [3] Fan Jiang, Junsong Yuan, Sotirios A Tsaftaris, and Aggelos K Katsaggelos, “Anomalous video event detection using spatiotemporal context,” Computer Vision and Image Understanding, vol. 115, no. 3, pp. 323–333, 2011.
- [4] Ramin Mehran, Alexis Oyama, and Mubarak Shah, “Abnormal crowd behavior detection using social force model,” in CVPR. IEEE, 2009, pp. 935–942.
- [5] Paul Scovanner and Marshall F Tappen, “Learning pedestrian dynamics from the real world,” in ICCV. IEEE, 2009, pp. 381–388.
- [6] Navneet Dalal and Bill Triggs, “Histograms of oriented gradients for human detection,” in CVPR. IEEE Computer Society, 2005, vol. 1, pp. 886–893.
- [7] Louis Kratz and Ko Nishino, “Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models,” in CVPR. IEEE, 2009, pp. 1446–1453.
- [8] Yachuang Feng, Yuan Yuan, and Xiaoqiang Lu, “Deep representation for abnormal event detection in crowded scenes,” in Multimedia. ACM, 2016, pp. 591–595.
- [9] Mohammad Sabokrou, Mohsen Fayyaz, Mahmood Fathy, Zahra Moayed, and Reinhard Klette, “Deep-anomaly: Fully convolutional neural network for fast anomaly detection in crowded scenes,” Computer Vision and Image Understanding, vol. 172, pp. 88–97, 2018.
- [10] Joseph Redmon and Ali Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
- [11] João F Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista, “High-speed tracking with kernelized correlation filters,” IEEE TPAMI, vol. 37, no. 3, pp. 583–596, 2015.
- [12] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in ECCV. Springer, 2016, pp. 20–36.
- [13] Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
- [14] Mohammad Sabokrou, Mahmood Fathy, Mojtaba Hoseini, and Reinhard Klette, “Real-time anomaly detection and localization in crowded scenes,” in CVPR, 2015, pp. 56–62.
- [15] Yang Liu, Yibo Li, and Xiaofei Ji, “Abnormal event detection in nature settings,” International Journal of Signal Processing, Image Processing and Pattern Recognition, vol. 7, no. 4, pp. 115–126, 2014.
- [16] Roberto Leyva, Victor Sanchez, and Chang-Tsun Li, “Video anomaly detection with compact feature sets for online performance,” IEEE TIP, vol. 26, no. 7, pp. 3463–3478, 2017.
- [17] Dawei Du, Honggang Qi, Qingming Huang, Wei Zeng, and Changhua Zhang, “Abnormal event detection in crowded scenes based on structural multi-scale motion interrelated patterns,” in ICME. IEEE, 2013, pp. 1–6.
- [18] Mohammad Sabokrou, Mohsen Fayyaz, Mahmood Fathy, and Reinhard Klette, “Deep-cascade: Cascading 3d deep neural networks for fast anomaly detection and localization in crowded scenes,” IEEE TIP, vol. 26, no. 4, pp. 1992–2004, 2017.
- [19] Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos, “Anomaly detection in crowded scenes,” in CVPR. IEEE, 2010, pp. 1975–1981.
- [20] Ying Zhang, Huchuan Lu, Lihe Zhang, and Xiang Ruan, “Combining motion and appearance cues for anomaly detection,” Pattern Recognition, vol. 51, pp. 443–452, 2016.
- [21] Oren Boiman and Michal Irani, “Detecting irregularities in images and in video,” International journal of computer vision, vol. 74, no. 1, pp. 17–31, 2007.
- [22] Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos, “Anomaly detection and localization in crowded scenes,” IEEE TPAMI, vol. 36, no. 1, pp. 18–32, 2014.
- [23] Tan Xiao, Chao Zhang, and Hongbin Zha, “Learning to detect anomalies in surveillance video,” IEEE Signal Processing Letters, vol. 22, no. 9, pp. 1477–1481, 2015.
- [24] Ryota Hinami, Tao Mei, and Shin’ichi Satoh, “Joint detection and recounting of abnormal events by learning deep generic knowledge,” in ICCV, 2017, pp. 3619–3627.