This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Workshop on Autonomous Driving at CVPR 2021:
Technical Report for Streaming Perception Challenge

Songyang Zhang1, Lin Song2,∗ Songtao Liu4,∗ Zheng Ge3  Zeming Li4 Xuming He1 Jian Sun4
1ShanghaiTech University 2Xi’an Jiaotao University 3Waseda University 4Megvii Technology
[email protected], [email protected], [email protected]
{liusongtao,gezheng,lizeming,sunjian}@megvii.com, [email protected]
equal contribution. This work is done when Songyang Zhang, Lin Song and Zheng Ge are research interns at Megvii Technology.
Abstract

In this report, we introduce our real-time 2D object detection system for realistic autonomous driving scenario. Our detector is built on a new designed YOLO model, called YOLOX. On the Argoverse-HD dataset, our system achieves 41.0 streaming AP, which surpassed the second place by 7.8/6.1 on detection-only track/fully track, respectively. Moreover, equipped with TensorRT, our model achieves the 30FPS inference speed with a high-resolution input size (e.g., 1440×\times2304). Code and models will be available at https://github.com/Megvii-BaseDetection/YOLOX

1 Overview

Our goal is to build a fast and accurate 2D detector for autonomous driving scenario, we follow the YOLO series models and introduce several improvements to build our detection system. Below we first present the main idea of our method in Sec. 3, followed by the network architecture used in the proposed detector in Sec 3. Then, we describe the optimization strategy used to speedup the inference in Sec. 4. Finally, we report the experiment and detailed ablation analysis in Sec. 5.

2 Our Approach

As the inference speed is quite important for this challenge, we adopt an internal new designed YOLO model, named YOLOX[6], for this task.

Specifically, YOLOX[6] follows the advanced data augmentation strategies of YOLOv4 [1] and YOLOv5 [10], such as mosiac, mixup. Then, we replace the anchor-based YOLOv5 detection head with an anchor-free head and adopt a simplified version of OTA [5], an advanced label assignment strategy, for training.

The above new designs simplify the current YOLO detectors and diminish many hyperparameters such as anchor shape, layer loss weight, etc. Indeed, YOLOX achieves about 1.5% higher AP than current YOLOv5 on COCO at the same speed.

3 Model Structure

We adopt the same C3 backbone as YOLOv5-L-P6. We refer the reader to YOLOX[6] for the details of the overall model structure.

4 Inference Optimization

To speedup the inference process for deployment, we adopt the TensorRT to generate the final model. Specifically, we convert the plain model trained with aforementioned strategy with Torch2TRT [9]. We fuse the image pre-processing operation and the prediction post-processing operation (NMS) into model forward function, which enables the whole inference process can be finished with a single function interface.

Given the all-in-one inference interface, We simply the inference logic by modifying the provided toolkit, and submit the generated predictions on validation set and test set to evaluation server .

5 Experiments and Analysis

5.1 Training Configuration

We implement the proposed YOLO X model with PyTorch [4], and training the model on our internal deep learning computation platform. The overall learning consists of two stages: pre-training stage and multi-dataset fine-tuning stage.

Pre-training Stage.

We learn the YOLOX model by leveraging COCO 2017 dataset [8], which containes 135K images for traning and 5K images for valiation. We use the SGD with the momentum of 0.9. We use the cosine learning rate scheduler of 300 epochs with 5 epoch warmup. When pre-train COCO, we use 16 V100 GPUs (8 images/GPU). Therefor 128 images are utilized per mini-batch.

Multi-dataset Fine-tuning Stage.

Due to the limited data provided in Argoverse-HD [7] train set, we utilize several additional datasets to enhance the model capacity.

Specifically, though there are 39384 images in train set, which mainly comes from only 65 video sequences. In other word, most of the images share the similar scene content, which makes the model easy to over-fitting. To tackle this issue and improve detector, we introduce BDD100K [11], Cityscapes [3] and nuScenes [2] for training, which are all collected from the autonoumous driving scenario and have the similar scene content/categories with Argoverse-HD. We report the detailed statistics of the extra dataset in Tab-1.

We use the pre-trained model in stage-1 to initilize the detector and train the model with Argoverse-HD, BDD100K, Cityscape and nuScenes together in this stage. For the multi-dataset fine-tuning, we train the model with a learning rate at 0.01, 5 epoch for warmup and 20 epoch in total. We use 16 V100 GPU(6 images/GPU) with 96 images for a mini-batch.

Dataset Type Num Img(Train) Num Img(Val) Resolution Num Class Class Overlap
COCO [8] Common 118287 5000 multi-scale 80 True
Argoverse-HD [7] Road 39384 15062 1200×\times1920 8 True
BDD100-K [11] Road 69863 10000 720×\times1280 10 True
Cityscape [3] Road 2975 500 1024×\times2048 19 True
nuScenes [2] Road 67279 16445 900×\times1600 25 True
Table 1: Statistics of the datasets used in the fine-tuning stage.
Method AP AP50 AP75 APS APM APL
Strong Baseline (Single Argoverse-HD Dataset) 35.4 59.9 34.2 18.3 34.2 48.4
+ COCO Pre-train 42.8 62.9 44.7 22.0 43.0 64.7
+ Multi-dataset 48.7 72.0 50.1 25.5 54.1 71.3
+ Large Scale Inference 50.6 74.4 52.3 27.5 57.0 70.8
Online(on Val) 40.2 68.9 39.4 21.5 42.9 53.9
Online(on Test) 40.1 68.3 40.6 15.9 40.6 44.6
Online(on Test) 41.0 69.5 41.2 15.1 41.8 47.7
Table 2: Detection road map of streaming perception challenge. We report the offline AP in the first block and the streaming AP in the second block. \dagger means using validation set for fine-tuning. We use the 1440×\times 2304 size for large scale inference.

5.2 Detection Road Map

We report the detection performance in Tab. 2. Using COCO dataset for pre-training significantly improves our strong baseline with a gain of 7.47.4 AP. Training with multiple datasets jointly further achieves 48.748.7 AP. Moreover, evaluation with large-scale(1440×\times 2304) improves the performance to 50.650.6. We achieve 40.2 on validation set in terms of the online evaluation metric equipped with TensorRT. By adding the validation set for fine-tuning the model, we achieve 41.0 AP(online) on test set.

5.3 Latency vs Accuracy

We have also reported the latency and the performance in terms of different image sizes used for inference, in Tab. 3 There exists a trade-off between the performance and latency, which is depended on the specific requirements of different task scenarios.

Width Height Speed(ms) AP
1440 2304 28.1 50.6
1280 2048 21.4 49.9
1200 1920 20.5 49.7
1120 1792 19.7 48.7
960 1536 16.0 46.3
Table 3: Performance v.s. Inference Latency

References

  • [1] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.
  • [2] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
  • [3] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
  • [4] Paszke et al. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, NeurIPS. 2019.
  • [5] Zheng Ge, Songtao Liu, Zeming Li, Osamu Yoshie, and Jian Sun. Ota: Optimal transport assignment for object detection. In CVPR, 2021.
  • [6] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
  • [7] Mengtian Li, Yu-Xiong Wang, and Deva Ramanan. Towards streaming perception. In ECCV. Springer, 2020.
  • [8] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV. Springer, 2014.
  • [9] NVIDIA. https://github.com/nvidia-ai-iot/torch2trt.
  • [10] ultralytics. https://github.com/ultralytics/yolov5.
  • [11] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In CVPR, 2020.