¹¹institutetext: Department of Computer Science and Engineering
Islamic University of Technology
¹¹email: {raianrahman, zadidbinazad, bakhtiarhasan}@iut-dhaka.edu

Densely-Populated Traffic Detection using YOLOv5 and Non-Maximum Suppression Ensembling

Raian Rahman Zadid Bin Azad Md. Bakhtiar Hasan

Abstract

Vehicular object detection is the heart of any intelligent traffic system. It is essential for urban traffic management. R-CNN, Fast R-CNN, Faster R-CNN and YOLO were some of the earlier state-of-the-art models. Region based CNN methods have the problem of higher inference time which makes it unrealistic to use the model in real-time. YOLO on the other hand struggles to detect small objects that appear in groups. In this paper, we propose a method that can locate and classify vehicular objects from a given densely crowded image using YOLOv5. The shortcoming of YOLO was solved my ensembling 4 different models. Our proposed model performs well on images taken from both top view and side view of the street in both day and night. The performance of our proposed model was measured on Dhaka AI dataset which contains densely crowded vehicular images. Our experiment shows that our model achieved [email protected] of $0.458$ with inference time of $0.75$ sec which outperforms other state-of-the-art models on performance. Hence, the model can be implemented in the street for real-time traffic detection which can be used for traffic control and data collection.

Keywords:

Real-time object detection, Ensemble learning, YOLOv5, Non-Maximum Suppression

⁰⁰footnotetext: Accepted for Springer Lecture Notes on Data Engineering and Communications Technologies as a part of BIM 2021

1 Introduction

An increasing number of vehicle types in urban areas pose many problems like traffic congestion, long queue in toll and parking sites. To solve traffic problems in mega-cities and to monetize traffics in areas like toll booths, parking lots, and analyzing types of vehicles in a city more efficiently and effectively, an intelligent system is required. As an indispensable part of the intelligent traffic monitoring system, accurate vehicle detection and real-time performance is the most challenging part which is gaining the attention of researchers all over the world. Efficient vehicle detection and classification in densely populated areas can facilitate automated toll collection, smart parking systems, and identification of vehicles related to crimes.

The task of vehicle detection can be formulated as a multi-object detection problem. In simple terms, object detection is the task of locating the objects in an image with a bounding box and detecting the class of that object. For this, convolutional neural network (CNN) based methods have been widely used in the recent past. The prominent state-of-the-art methods utilize R-CNN [1], Fast R-CNN [2], Faster R-CNN [3] to achieve this task. But the problem with these two-stage-based models is that training happens in multiple phases and the network is too slow at inference time, which impedes real-time detection of vehicles. To solve this problem, recently You Only Look Once (YOLO) [4] introduced a faster way of real-time object detection making it usable in real-life applications. However, this architecture struggles to detect small objects that appear in groups [4].

To solve this issue, we trained 4 separate models that utilize the ensemble technique to aggregate the separate predictions using Non-Maximum Suppression. Our contributions are as follows:

•

Trained a total of 4 YOLOv5 [5] models using different image sizes and hyper-parameters.
•

Aggregated the prediction of 4 models using ensemble model that facilitates faster detection of vehicles.
•

Introduced additional difficulty by adding low-light nighttime images and top-view images with densely crowded vehicles to training samples to improve the accuracy and robustness of the model.

These steps resulted in a solution that can be used in real-time and low light situations even in densely populated streets. Besides it also ensured that our solution outputs a result with acceptable accuracy which makes our model usable in congested and complex scenes.

2 Related Work

The traditional approaches[6, 7] for vehicle detection apply common machine learning algorithms like the histogram of oriented gradient (HOG) to extract features from vehicle images. After extracting the features, the vehicles are then classified using Support Vector Machine (SVM). Other approaches use Deformable Part Model (DPM)[8] to detect vehicles. Even though these approaches provide comparable accuracy, they involve handcrafted feature designing that requires human intervention.

Recent advances in deep learning facilitated by the availability of large datasets and big compute have made them a viable option for vehicle detection. Earlier approaches [9, 10, 11] utilize Convolutional Neural Network (CNN) to perform feature extraction and softmax function for classification. Later, more efficient models like R-CNN [1] and fast R-CNN[2] and Faster R-CNN[3] models were proposed. All these models utilize a Region-based Convolutional Neural Network, which uses a technique called Selective Search [12] to select a small number of candidate regions among all possible regions. As a result, the model requires running an image classification algorithm for a smaller amount of region making the model run faster. R-CNN is comparatively slower among all three models as it generates lots of candidate regions. Fast R-CNN [2] addressed this issue by feeding the input image to a CNN to generate a convolutional feature map. Then the candidate regions are proposed using an RoI pooling layer and feeding it into a fully connected network. The number of candidate regions proposed by Fast R-CNN is less than that of R-CNN, hence it requires less time for inference. But the Selective Search algorithm, used by Fast R-CNN, is not a machine learning algorithm, so it cannot learn from the context, and often proposes a bad candidate for the region. Later, Faster R-CNN [3] was proposed with the idea of replacing selective search as it is a time-consuming process. Faster R-CNN provides the fastest running time compared with R-CNN and Fast R-CNN. However, it is still not fast enough to detect objects in real-time. Additionally, all these three models require huge computation due to having a complex model containing a large number of parameters.

Recently, YOLO is being used for vehicle detection [13, 14, 15]. Instead of using the region selection method, YOLO uses Convolutional Neural Network that predicts the bounding boxes as well as the class for these boxes. It divides the image into an $S\times S$ grid where $S$ is a constant value. For each grid, YOLO generates a constant number of bounding boxes. Then if a bounding box has confidence greater than a certain threshold, the bounding box is selected to locate the object within the image. YOLO is by far the fastest algorithm for vehicle detection and its speed is helpful to implement real-time vehicle detection systems.

3 Proposed Methodology

Refer to caption — Figure 1: The pipeline of our proposed solution: First, we acquired the training data. During preprocessing, images were resized, relabelled before creating four different folds of the dataset. Different augmentation technique was applied to these folds. During training, these folds were trained independently with YOLOv5 model with different setups. All four of our trained models were then ensembled using Non Maximum Suppression. The last state of our work is the images with bounding box surrounding the vehicle objects of an image with its class

3.1 Overview

Our proposed method consists of 3 main modules. First, we acquired and preprocessed the dataset. During preprocessing, we applied augmentation, resized the images into uniform shapes, and created training and testing folds. Then, four different models were trained with these different training folds. After training, we ensembled the models using Non-Maximum Suppression [16] for final inference. A complete pipeline of our proposed methodology is illustrated in Figure 1.

3.2 Dataset Acquisition

For this experiment, we used the “DhakaAI” [17] dataset developed under the “Dhaka AI 2020 challenge”. The dataset consists of 3000 annotated images of traffic objects. The training dataset consisted of 21 classes. The most challenging part about the dataset is it contains images of vehicles from a different point of view. There were images from the front view, back view, side view and most importantly top view of streets. We also added around 200 new images for training to increase the sample of rare class vehicles. These new images were hand-annotated using labelImg tool [18]. Most of these images were top-view nighttime images.

3.3 Preprocessing

For generalizing a model for object detection using deep learning architecture, a prerequisite is to have enough training examples for each class so that the model can learn properly. But, after exploring the DhakaAI dataset [17], we found that it has a huge class imbalance. The number of labels for each class is shown in Table 1. Here, some of the classes have less than 50 samples in the training dataset.

Table 1: Sample Distribution per Class

Class Name	Label Count	Class Name	Label Count
Ambulance	76	Pickup	1178
Army Vehicle	25	Police Car	33
Auto Rickshaw	465	Rickshaw	3495
Bicycle	465	Scooter	30
Bus	3340	SUV	667
Car	5574	Taxi	59
Garbage Van	8	Three Wheeler (CNG)	2982
Human Hauler	170	Truck	1475
Minibus	100	Van	682
Minivan	815	Wheelbarrow	251
Motorbike	2252

To resolve this issue, we used augmentation using tools from Roboflow ¹¹1Available at: https://roboflow.com/ and Albumentations library [19] for image augmentation. Although augmentation did not provide a very good result in the case of densely populated images, it improved the result in the case of night images.

During the exploration of the dataset, we found that there was a lot of mislabelling in the DhakaAI dataset training data. We also found that two images had different labeling of class for the same car in the same frame (illustrated in Figure 2). So, we hand-annotated all 3000 images and labeled all the mislabelled objects as well as fixed labeling of wrongly labeled objects in the image.

Table 2: Image resolution and applied augmentation for different folds of training dataset. All images had

1024\times 1024

resolution.

Fold No.	Train Set Image Count	Validation Set Image Count	Augmentation
1	$2506$	$600$	Sharpened
2	$2321$	$785$	Sharpened
3	$2400$	$706$	Sharpened
4	$1200$	$400$	Darkened and Sharpened

Another challenge in the dataset is it does not have uniform image quality. Some of the images are in landscape mode while some of the images are in portrait mode. So, we resized the images to $1024\times 1024$ pixels.

For the train and validation set split, we used the k-fold Cross-Validation technique so that our model could learn from the complete dataset. While creating the fold, we tried to make sure that images from the same frame in the train split do not occur in the validation split. Count of train-validation split for each fold is given in Table 2.

3.4 Model Selection

Although the key priority of our work was to localize and classify the vehicular objects on a street image, we also had to look into the inference speed so that it could be implemented real time. We had to discard R-CNN, Fast R-CNN and Faster R-CNN as it could not compete with YOLO models in both performance and inference time. YOLO on the other hand, YOLO had a much less inference time with better accuracy. Among different versions of YOLO we chose YOLOv5[5] due to its simple architecture compared to R-CNN based models. Even YOLOv5 is faster and more robust than other members of YOLO family.

Even if the author of YOLOv4[20] got official approval of YOLO, the version of YOLOv5 [5] developed by Ultralytics LLC team did not get any acknowledgement from the original author of YOLO. Still YOLOv5 provides much better performance compared to other models of YOLO family[21]. YOLOv5 inherits the advantages of YOLOv4 [20] by adding SPP-NET [22] along with some enhancement techniques. YOLOv5 has become the new state of the art for object detection[23]. YOLOv5 was mainly developed to balance between real-time performance and detection accuracy.

YOLOv5s [5] , YOLOv5m [5], YOLOv5l [5], YOLOv5x [5] are the four versions of YOLOv5 where YOLOv5s being the lightest model and YOLOv5x being the heaviest model respectively. Among all these four version, there is a trade off between the detection speed and real-time performance. The key difference among these versions are the number of feature extraction modules and convolution kernel in specific location of the network.

The network consists of three networks. These are: backbone network, neck network and detect network. Backbone network is a convolutional neural network for aggregating the fine-grained images and forming image features. Neck network is responsible for combining the image features collected by backbone network and transmitting the feature map to the detect network. The detect network is responsible for detection and classification part of the model. It applies anchor boxes on the feature map from the neck network. It also contains a softmax layer which predicts the probability of the class the bounding box surrounding the object.

For image enhancement, YOLOv5 uses mosaic data augmentation to solve the low dataset problem. It applies operations like random inversion, zooming, cropping on four images and then combines them into a single image.

In traffic detection, the core priority was to improve the performance, so we chose YOLOv5x for our training model. It contains $607$ layers with $88,568,234$ trainable parameters. The model was pre-trained using Common Object in Context (COCO) dataset [24] to detect $80$ classes. For our task, we changed the final layer to detect only $21$ classes corresponding to the $21$ vehicle classes available in the DhakaAI dataset.

3.5 Ensemble Learning

To ensure the robustness and accuracy of our model, we trained 4 separate models using different sets of images. Each of the models proposes multiple bounding boxes to specify candidate regions for vehicle detection. We used Non-Maximum Suppression [16] to aggregate these bounding boxes to select the ones having the most confidence. The way it works is the system takes all the bounding boxes proposed by all four models and puts them in a priority queue sorted based on the confidence of the models predicting them. It then selects the box with the highest confidence from the queue and calculates the Intersection over Union (IoU) with the rest of the boxes. If the IoU value exceeds a certain threshold for any of the remaining boxes, that box is discarded. It then removes the bounding box with the highest confidence from the queue and adds it to the selected box list. This process is repeated until there is no bounding box remaining in the priority queue. Finally, the boxes in the selected box list are returned.

4 Result and Analysis

4.1 Experimental Setup

During training we had to train four different models and ensembled these four models for final output. All four of our model was trained on Google Colab[25]. Google Colab provides cloud based training utility with free GPU access for limited amount of time. For each fold of our dataset, we trained a model. The first three models were trained with image resolution of $1024\times 1024$ pixels. But the fourth model was trained with image resolution of $640\times 640$ pixels. First three fold of our dataset contained all the images while the fourth fold contained only the night images. On the dataset, it was seen that the night images itself were quite distorted noisy. So. we decided to train the night images on lower resolution as it might then focus on the the larger objects of the night images. Also thus we could train our model for longer time.

All four of our models were trained with Tesla T4 GPU which comes with $16$ GB of video memory. All four of our model was trained for around $12$ hours each.

For training, we used a YOLOv5 implementation by the Ultralytics ²²2Available at: https://github.com/ultralytics/yolov5. We used Stochastic Gradient Descent as our optimizer. Image augmentation parameters used for each of the models are given in Table 4.

Table 3: Training Specification for Each Model. All 4 models had Stochastic Gradient Descent optimizer with a learning rate

0.01

and momentum

0.937

Model	Training Data	Image Size	Number of Epochs	Batch Size
1	Fold 1	$1024\times 1024$	80	4
2	Fold 2	$1024\times 1024$	80	4
3	Fold 3	$1024\times 1024$	80	4
4	Fold 4	$640\times 640$	120	16

Table 4: Image augmentation parameters during training.

Hyperparameter	Value
Image HSV - Hue augmentation	$0.015$
Image HSV - Saturation augmentation	$0.7$
Image HSV - Value augmentation	$0.4$
Image Rotation	$5.0$
Image Translation	$0.1$
Image Scale	$0.5$
Image Flip Left-Right - Probability	$0.5$
Image Mosaic - Probability	$1.0$
Image Mixup - Probability	$0.0$

4.2 Evaluation Metrics

To evaluate our performance, we used mean Average Precision (mAP) over training epochs following . The formula for calculating mean average precision for object detection is

\displaystyle maP=\frac{1}{n}\sum_{k=1}^{k=n}AP_{k}

(1)

where $n$ is the number of classes and $AP_{k}$ is the average precision for class $k$ . Average precision (AP) is a way of summarizing the precision-recall curve into a single value representing average of all precision. The formula for calculating AP is

\displaystyle AP@n=\sum_{k=0}^{k=n-1}[Recall(k)-Recall(k+1)]\times Precision(k)

(2)

where $n$ is the number of thresholds and $Recall(n)=0$ and $Precision(n)=1$ . We used the checkpoint where the model had most [email protected]

For inference, we ensembled the weight of all four models we trained. We used Non-Maximum Suppression during the ensemble and the confidence threshold for each predicted bounding box was set to $0.3$

4.3 Result Discussion

We used all four training model’s weights during the final inference. We ran an inference on test data - 2 provided by DhakaAI. We hand-annotated $450$ test images and executed inference. On that test, our model achieved $[email protected]$ value of $0.458$ . We also conducted inference on one of our validation sets. During validation set inference, our model achieved $[email protected]$ value of $0.883$ and $[email protected]$ value of $0.677$ .

We compared the result of our model with other models of YOLO family as well as Faster R-CNN model. Table 5 shows the comparison between these models. For comparison we compared our model’s performance as well as the inference time on a single image with YOLOv3, YOLOv4 and Faster R-CNN. We trained each of these models for $12$ hours on google colab in a similar environment. The table shows that, our model has achieved the most $[email protected]$ . As our proposed method ensembles $4$ different models during inference, the inference time of our solution is a little bit higher compared to other models. Still the precision performance of our model outperforms the other models.

Table 5: Comparison of performance and inference time with other models. Here Faster R-CNN. YOLOv3. YOLOv4 and YOLOv5x show performance trained on a single fold of train dataset while YOLOv5x with NMS ensembling model shows the result of our four combined models.

Model Name	[email protected]	Inference Time (s)
Faster R-CNN	0.356	0.39
YOLOv3	0.266	0.18
YOLOv4	0.313	0.28
YOLOv5x	0.372	0.14
YOLOv5 with NMS ensembling (ours)	0.458	0.75

Output of our model for different scenario is illustrated in Figure 3 and 4. Our model was able to localize and detect most of the vehicular objects for a given image taken from a different view of the street. It also performed well in the case of night images. Figure 3 illustrates the performance of our model on night images. It could locate most of the vehicles as well as properly classify those vehicles in both densely populated images and in less populated images. However, as seen in Figure 3c, our model could not detect most of the vehicles in a very low-light noisy image.

In Figure 4, we illustrated our model’s performance on images taken from a different view of the streets which shows it can locate and detect the objects properly. As seen in the figure, the model also performs well in the case of occluded objects in the image.

And our model could run inference within $0.75$ second per image. So, it could also be implemented in real-time vehicular traffic detection applications.

5 Conclusion

This paper proposed a new method of traffic object detection using YOLOv5. To improve the performance and robustness of our method, we ensembled 4 different models using Non-Maximum Suppression ensembling. We also tried to incorporate dataset modification by adding night images from different view-angles. Our experiment compared the performance of our model with other state of the art models on Dhaka AI dataset. Result shows that our model had better precision. Due to limited resources, we couldn’t test our model’s performance on other baseline dataset. For further experimentation, our work could be expanded on how we can use better ensembling methods like weighted ensembling or voting mechanism for faster inference time.

6 Acknowledgement

We would like to thank Redwan Karim Sony, Department of Computer Science and Engineering, Islamic University of Technology and Mohammad Sabik Irbaz, Pioneer Alpha Limited for their continuous support and suggestions throughout the work. We would also like to thank the organizing committee of Dhaka AI 2020 for organizing the competition.

References

[1] Ross Girshick, Jeff Donahue, Trevor Darrell and Jitendra Malik “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation” In 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587 DOI: 10.1109/CVPR.2014.81
[2] Ross Girshick “Fast R-CNN” In 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1440–1448 DOI: 10.1109/ICCV.2015.169
[3] Shaoqing Ren, Kaiming He, Ross Girshick and Jian Sun “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks” In IEEE Transactions on Pattern Analysis and Machine Intelligence 39.6, 2017, pp. 1137–1149 DOI: 10.1109/TPAMI.2016.2577031
[4] Joseph Redmon, Santosh Divvala, Ross Girshick and Ali Farhadi “You Only Look Once: Unified, Real-Time Object Detection” In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779–788 DOI: 10.1109/CVPR.2016.91
[5] Glenn Jocher et al. “ultralytics/yolov5: v5.0 - YOLOv5-P6 1280 models, AWS, Supervise.ly and YouTube integrations” Zenodo, 2021 DOI: 10.5281/zenodo.4679653
[6] Natthariya Laopracha and Khamron Sunat “Comparative Study of Computational Time that HOG-Based Features Used for Vehicle Detection” In Recent Advances in Information and Communication Technology 2017 Cham: Springer, 2018, pp. 275–284
[7] Xianbin Cao, Changxia Wu, Pingkun Yan and Xuelong Li “Linear SVM classification using boosting HOG features for vehicle detection in low-altitude airborne videos” In 2011 18th IEEE International Conference on Image Processing, 2011, pp. 2421–2424 IEEE DOI: 10.1109/ICIP.2011.6116132
[8] Chun Pan, Mingxia Sun and Zhiguo Yan “The Study on Vehicle Detection Based on DPM in Traffic Scenes” In Frontier Computing Singapore: Springer Singapore, 2018, pp. 19–27 URL: https://link.springer.com/chapter/10.1007/978-981-10-3187-8_3
[9] Yong Tang et al. “Vehicle detection and recognition for intelligent traffic surveillance system” In Multimedia tools and applications 76.4 Springer, 2017, pp. 5817–5832 DOI: 10.1007/s11042-015-2520-x
[10] Yang Gao et al. “Scale optimization for full-image-CNN vehicle detection” In 2017 IEEE Intelligent Vehicles Symposium (IV), 2017, pp. 785–791 IEEE DOI: 10.1109/IVS.2016.7535529
[11] Heikki Huttunen, Fatemeh Shokrollahi Yancheshmeh and Ke Chen “Car type recognition with Deep Neural Networks” In 2016 IEEE intelligent vehicles symposium (IV), 2016, pp. 1115–1120 IEEE DOI: 10.1109/IVS.2016.7535529
[12] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers and Arnold WM Smeulders “Selective Search for Object Recognition” In International Journal of Computer Vision 104.2 Springer, 2013, pp. 154–171 DOI: 10.1007/s11263-013-0620-5
[13] Margrit Kasper-Eulaers et al. “Short Communication: Detecting Heavy Goods Vehicles in Rest Areas in Winter Conditions Using YOLOv5” In Algorithms 14.4, 2021 DOI: 10.3390/a14040114
[14] Jun Sang et al. “An Improved YOLOv2 for Vehicle Detection” In Sensors 18.12, 2018 DOI: 10.3390/s18124272
[15] CS Asha and AV Narasimhadhan “Vehicle Counting for Traffic Management System using YOLO and Correlation Filter” In 2018 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), 2018, pp. 1–6 IEEE DOI: 10.1109/CONECCT.2018.8482380
[16] A. Neubeck and L. Van Gool “Efficient Non-Maximum Suppression” In 18th International Conference on Pattern Recognition (ICPR’06) 3, 2006, pp. 850–855 DOI: 10.1109/ICPR.2006.479
[17] ASM Shihavuddin and Mohammad Rifat Ahmmad Rashid “DhakaAI” Harvard Dataverse, 2020 DOI: 10.7910/DVN/POREXF
[18] Tzutalin “LabelImg”, Git Code, 2018 URL: https://github.com/tzutalin/labelImg
[19] Alexander Buslaev et al. “Albumentations: Fast and Flexible Image Augmentations” In Information 11.2, 2020 DOI: 10.3390/info11020125
[20] Alexey Bochkovskiy, Chien-Yao Wang and Hong-Yuan Mark Liao “YOLOv4: Optimal Speed and Accuracy of Object Detection” In Computing Research Repository (CoRR) abs/2004.10934, 2020 arXiv: https://arxiv.org/abs/2004.10934
[21] Yifan Liu, BingHang Lu, Jingyu Peng and Zihao Zhang “Research on the use of YOLOv5 object detection algorithm in mask wearing recognition” In World Scientific Research Journal, 2020, pp. 276–284
[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition” In IEEE Transactions on Pattern Analysis and Machine Intelligence 37.9, 2015, pp. 1904–1916 DOI: 10.1109/TPAMI.2015.2389824
[23] Bin Yan et al. “A Real-Time Apple Targets Detection Method for Picking Robot Based on Improved YOLOv5” In Remote Sensing 13.9, 2021 DOI: 10.3390/rs13091619
[24] Tsung-Yi Lin et al. “Microsoft COCO: Common Objects in Context” In Computer Vision – ECCV 2014 Cham: Springer International Publishing, 2014, pp. 740–755 URL: https://link.springer.com/chapter/10.1007/978-3-319-10602-1_48
[25] Ekaba Bisong “Google Colaboratory” In Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners Berkeley, CA: Apress, 2019, pp. 59–64 DOI: 10.1007/978-1-4842-4470-8˙7