Workshop on Autonomous Driving at CVPR 2021:
Technical Report for Streaming Perception Challenge

Songyang Zhang^1, Lin Song^2,∗ Songtao Liu^4,∗ Zheng Ge³ Zeming Li⁴ Xuming He¹ Jian Sun⁴
¹ShanghaiTech University ²Xi’an Jiaotao University ³Waseda University ⁴Megvii Technology
[email protected], [email protected], [email protected]
{liusongtao,gezheng,lizeming,sunjian}@megvii.com, [email protected] equal contribution. This work is done when Songyang Zhang, Lin Song and Zheng Ge are research interns at Megvii Technology.

Abstract

In this report, we introduce our real-time 2D object detection system for realistic autonomous driving scenario. Our detector is built on a new designed YOLO model, called YOLOX. On the Argoverse-HD dataset, our system achieves 41.0 streaming AP, which surpassed the second place by 7.8/6.1 on detection-only track/fully track, respectively. Moreover, equipped with TensorRT, our model achieves the 30FPS inference speed with a high-resolution input size (e.g., 1440 $\times$ 2304). Code and models will be available at https://github.com/Megvii-BaseDetection/YOLOX

1 Overview

Our goal is to build a fast and accurate 2D detector for autonomous driving scenario, we follow the YOLO series models and introduce several improvements to build our detection system. Below we first present the main idea of our method in Sec. 3, followed by the network architecture used in the proposed detector in Sec 3. Then, we describe the optimization strategy used to speedup the inference in Sec. 4. Finally, we report the experiment and detailed ablation analysis in Sec. 5.

2 Our Approach

As the inference speed is quite important for this challenge, we adopt an internal new designed YOLO model, named YOLOX[6], for this task.

Specifically, YOLOX[6] follows the advanced data augmentation strategies of YOLOv4 [1] and YOLOv5 [10], such as mosiac, mixup. Then, we replace the anchor-based YOLOv5 detection head with an anchor-free head and adopt a simplified version of OTA [5], an advanced label assignment strategy, for training.

The above new designs simplify the current YOLO detectors and diminish many hyperparameters such as anchor shape, layer loss weight, etc. Indeed, YOLOX achieves about 1.5% higher AP than current YOLOv5 on COCO at the same speed.

3 Model Structure

We adopt the same C3 backbone as YOLOv5-L-P6. We refer the reader to YOLOX[6] for the details of the overall model structure.

4 Inference Optimization

To speedup the inference process for deployment, we adopt the TensorRT to generate the final model. Specifically, we convert the plain model trained with aforementioned strategy with Torch2TRT [9]. We fuse the image pre-processing operation and the prediction post-processing operation (NMS) into model forward function, which enables the whole inference process can be finished with a single function interface.

Given the all-in-one inference interface, We simply the inference logic by modifying the provided toolkit, and submit the generated predictions on validation set and test set to evaluation server .

5 Experiments and Analysis

5.1 Training Configuration

We implement the proposed YOLO X model with PyTorch [4], and training the model on our internal deep learning computation platform. The overall learning consists of two stages: pre-training stage and multi-dataset fine-tuning stage.

Pre-training Stage.

We learn the YOLOX model by leveraging COCO 2017 dataset [8], which containes 135K images for traning and 5K images for valiation. We use the SGD with the momentum of 0.9. We use the cosine learning rate scheduler of 300 epochs with 5 epoch warmup. When pre-train COCO, we use 16 V100 GPUs (8 images/GPU). Therefor 128 images are utilized per mini-batch.

Multi-dataset Fine-tuning Stage.

Due to the limited data provided in Argoverse-HD [7] train set, we utilize several additional datasets to enhance the model capacity.

Specifically, though there are 39384 images in train set, which mainly comes from only 65 video sequences. In other word, most of the images share the similar scene content, which makes the model easy to over-fitting. To tackle this issue and improve detector, we introduce BDD100K [11], Cityscapes [3] and nuScenes [2] for training, which are all collected from the autonoumous driving scenario and have the similar scene content/categories with Argoverse-HD. We report the detailed statistics of the extra dataset in Tab-1.

We use the pre-trained model in stage-1 to initilize the detector and train the model with Argoverse-HD, BDD100K, Cityscape and nuScenes together in this stage. For the multi-dataset fine-tuning, we train the model with a learning rate at 0.01, 5 epoch for warmup and 20 epoch in total. We use 16 V100 GPU(6 images/GPU) with 96 images for a mini-batch.

Dataset	Type	Num Img(Train)	Num Img(Val)	Resolution	Num Class	Class Overlap
COCO [8]	Common	118287	5000	multi-scale	80	True
Argoverse-HD [7]	Road	39384	15062	1200 $\times$ 1920	8	True
BDD100-K [11]	Road	69863	10000	720 $\times$ 1280	10	True
Cityscape [3]	Road	2975	500	1024 $\times$ 2048	19	True
nuScenes [2]	Road	67279	16445	900 $\times$ 1600	25	True

Table 1: Statistics of the datasets used in the fine-tuning stage.

Method	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
Strong Baseline (Single Argoverse-HD Dataset)	35.4	59.9	34.2	18.3	34.2	48.4
+ COCO Pre-train	42.8	62.9	44.7	22.0	43.0	64.7
+ Multi-dataset	48.7	72.0	50.1	25.5	54.1	71.3
+ Large Scale Inference	50.6	74.4	52.3	27.5	57.0	70.8
Online(on Val)	40.2	68.9	39.4	21.5	42.9	53.9
Online(on Test)	40.1	68.3	40.6	15.9	40.6	44.6
Online^†(on Test)	41.0	69.5	41.2	15.1	41.8	47.7

Table 2: Detection road map of streaming perception challenge. We report the offline AP in the first block and the streaming AP in the second block.

\dagger

means using validation set for fine-tuning. We use the 1440

\times

2304 size for large scale inference.

5.2 Detection Road Map

We report the detection performance in Tab. 2. Using COCO dataset for pre-training significantly improves our strong baseline with a gain of $7.4$ AP. Training with multiple datasets jointly further achieves $48.7$ AP. Moreover, evaluation with large-scale(1440 $\times$ 2304) improves the performance to $50.6$ . We achieve 40.2 on validation set in terms of the online evaluation metric equipped with TensorRT. By adding the validation set for fine-tuning the model, we achieve 41.0 AP(online) on test set.

5.3 Latency vs Accuracy

We have also reported the latency and the performance in terms of different image sizes used for inference, in Tab. 3 There exists a trade-off between the performance and latency, which is depended on the specific requirements of different task scenarios.

Width	Height	Speed(ms)	AP
1440	2304	28.1	50.6
1280	2048	21.4	49.9
1200	1920	20.5	49.7
1120	1792	19.7	48.7
960	1536	16.0	46.3

Table 3: Performance v.s. Inference Latency

References

[1] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.
[2] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
[3] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
[4] Paszke et al. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, NeurIPS. 2019.
[5] Zheng Ge, Songtao Liu, Zeming Li, Osamu Yoshie, and Jian Sun. Ota: Optimal transport assignment for object detection. In CVPR, 2021.
[6] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
[7] Mengtian Li, Yu-Xiong Wang, and Deva Ramanan. Towards streaming perception. In ECCV. Springer, 2020.
[8] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV. Springer, 2014.
[9] NVIDIA. https://github.com/nvidia-ai-iot/torch2trt.
[10] ultralytics. https://github.com/ultralytics/yolov5.
[11] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In CVPR, 2020.

Workshop on Autonomous Driving at CVPR 2021: Technical Report for Streaming Perception Challenge