Workshop on Autonomous Driving at CVPR 2021:
Technical Report for Streaming Perception Challenge
Abstract
In this report, we introduce our real-time 2D object detection system for realistic autonomous driving scenario. Our detector is built on a new designed YOLO model, called YOLOX. On the Argoverse-HD dataset, our system achieves 41.0 streaming AP, which surpassed the second place by 7.8/6.1 on detection-only track/fully track, respectively. Moreover, equipped with TensorRT, our model achieves the 30FPS inference speed with a high-resolution input size (e.g., 14402304). Code and models will be available at https://github.com/Megvii-BaseDetection/YOLOX
1 Overview
Our goal is to build a fast and accurate 2D detector for autonomous driving scenario, we follow the YOLO series models and introduce several improvements to build our detection system. Below we first present the main idea of our method in Sec. 3, followed by the network architecture used in the proposed detector in Sec 3. Then, we describe the optimization strategy used to speedup the inference in Sec. 4. Finally, we report the experiment and detailed ablation analysis in Sec. 5.
2 Our Approach
As the inference speed is quite important for this challenge, we adopt an internal new designed YOLO model, named YOLOX[6], for this task.
Specifically, YOLOX[6] follows the advanced data augmentation strategies of YOLOv4 [1] and YOLOv5 [10], such as mosiac, mixup. Then, we replace the anchor-based YOLOv5 detection head with an anchor-free head and adopt a simplified version of OTA [5], an advanced label assignment strategy, for training.
The above new designs simplify the current YOLO detectors and diminish many hyperparameters such as anchor shape, layer loss weight, etc. Indeed, YOLOX achieves about 1.5% higher AP than current YOLOv5 on COCO at the same speed.
3 Model Structure
We adopt the same C3 backbone as YOLOv5-L-P6. We refer the reader to YOLOX[6] for the details of the overall model structure.
4 Inference Optimization
To speedup the inference process for deployment, we adopt the TensorRT to generate the final model. Specifically, we convert the plain model trained with aforementioned strategy with Torch2TRT [9]. We fuse the image pre-processing operation and the prediction post-processing operation (NMS) into model forward function, which enables the whole inference process can be finished with a single function interface.
Given the all-in-one inference interface, We simply the inference logic by modifying the provided toolkit, and submit the generated predictions on validation set and test set to evaluation server .
5 Experiments and Analysis
5.1 Training Configuration
We implement the proposed YOLO X model with PyTorch [4], and training the model on our internal deep learning computation platform. The overall learning consists of two stages: pre-training stage and multi-dataset fine-tuning stage.
Pre-training Stage.
We learn the YOLOX model by leveraging COCO 2017 dataset [8], which containes 135K images for traning and 5K images for valiation. We use the SGD with the momentum of 0.9. We use the cosine learning rate scheduler of 300 epochs with 5 epoch warmup. When pre-train COCO, we use 16 V100 GPUs (8 images/GPU). Therefor 128 images are utilized per mini-batch.
Multi-dataset Fine-tuning Stage.
Due to the limited data provided in Argoverse-HD [7] train set, we utilize several additional datasets to enhance the model capacity.
Specifically, though there are 39384 images in train set, which mainly comes from only 65 video sequences. In other word, most of the images share the similar scene content, which makes the model easy to over-fitting. To tackle this issue and improve detector, we introduce BDD100K [11], Cityscapes [3] and nuScenes [2] for training, which are all collected from the autonoumous driving scenario and have the similar scene content/categories with Argoverse-HD. We report the detailed statistics of the extra dataset in Tab-1.
We use the pre-trained model in stage-1 to initilize the detector and train the model with Argoverse-HD, BDD100K, Cityscape and nuScenes together in this stage. For the multi-dataset fine-tuning, we train the model with a learning rate at 0.01, 5 epoch for warmup and 20 epoch in total. We use 16 V100 GPU(6 images/GPU) with 96 images for a mini-batch.
Dataset | Type | Num Img(Train) | Num Img(Val) | Resolution | Num Class | Class Overlap |
COCO [8] | Common | 118287 | 5000 | multi-scale | 80 | True |
Argoverse-HD [7] | Road | 39384 | 15062 | 12001920 | 8 | True |
BDD100-K [11] | Road | 69863 | 10000 | 7201280 | 10 | True |
Cityscape [3] | Road | 2975 | 500 | 10242048 | 19 | True |
nuScenes [2] | Road | 67279 | 16445 | 9001600 | 25 | True |
Method | AP | AP50 | AP75 | APS | APM | APL |
Strong Baseline (Single Argoverse-HD Dataset) | 35.4 | 59.9 | 34.2 | 18.3 | 34.2 | 48.4 |
+ COCO Pre-train | 42.8 | 62.9 | 44.7 | 22.0 | 43.0 | 64.7 |
+ Multi-dataset | 48.7 | 72.0 | 50.1 | 25.5 | 54.1 | 71.3 |
+ Large Scale Inference | 50.6 | 74.4 | 52.3 | 27.5 | 57.0 | 70.8 |
Online(on Val) | 40.2 | 68.9 | 39.4 | 21.5 | 42.9 | 53.9 |
Online(on Test) | 40.1 | 68.3 | 40.6 | 15.9 | 40.6 | 44.6 |
Online†(on Test) | 41.0 | 69.5 | 41.2 | 15.1 | 41.8 | 47.7 |
5.2 Detection Road Map
We report the detection performance in Tab. 2. Using COCO dataset for pre-training significantly improves our strong baseline with a gain of AP. Training with multiple datasets jointly further achieves AP. Moreover, evaluation with large-scale(1440 2304) improves the performance to . We achieve 40.2 on validation set in terms of the online evaluation metric equipped with TensorRT. By adding the validation set for fine-tuning the model, we achieve 41.0 AP(online) on test set.
5.3 Latency vs Accuracy
We have also reported the latency and the performance in terms of different image sizes used for inference, in Tab. 3 There exists a trade-off between the performance and latency, which is depended on the specific requirements of different task scenarios.
Width | Height | Speed(ms) | AP |
1440 | 2304 | 28.1 | 50.6 |
1280 | 2048 | 21.4 | 49.9 |
1200 | 1920 | 20.5 | 49.7 |
1120 | 1792 | 19.7 | 48.7 |
960 | 1536 | 16.0 | 46.3 |
References
- [1] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.
- [2] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
- [3] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
- [4] Paszke et al. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, NeurIPS. 2019.
- [5] Zheng Ge, Songtao Liu, Zeming Li, Osamu Yoshie, and Jian Sun. Ota: Optimal transport assignment for object detection. In CVPR, 2021.
- [6] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
- [7] Mengtian Li, Yu-Xiong Wang, and Deva Ramanan. Towards streaming perception. In ECCV. Springer, 2020.
- [8] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV. Springer, 2014.
- [9] NVIDIA. https://github.com/nvidia-ai-iot/torch2trt.
- [10] ultralytics. https://github.com/ultralytics/yolov5.
- [11] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In CVPR, 2020.