¹¹institutetext: Hikvision Research Institute, Hangzhou China ²²institutetext: Zhejiang University, Hangzhou, China
²²email: {zhaowei29, chenbinbin8, chenweijie5, yangshicai, xiedi, pushiliang.hri}@hikvision.com, {chenweijie, yzhuang}@zju.edu.cn

1st Place Solution for ECCV 2022 OOD-CV Challenge Object Detection Track

Wei Zhao 11¹¹1Equal contributions¹¹1Equal contributions Binbin Chen 11¹¹1Equal contributions¹¹1Equal contributions Weijie Chen 1122⁴⁴4Corresponding author⁴⁴4Corresponding author Shicai Yang 11 Di Xie 11
Shiliang Pu 11 Yueting Zhuang 22

Abstract

OOD-CV challenge⁵⁵5https://www.ood-cv.org/ is an out-of-distribution generalization task. To solve this problem in object detection track, we propose a simple yet effective Generalize-then-Adapt (G&A) framework, which is composed of a two-stage domain generalization part and a one-stage domain adaptation part. The domain generalization part is implemented by a Supervised Model Pretraining stage using source data for model warm-up and a Weakly Semi-Supervised Model Pretraining stage using both source data with box-level label and auxiliary data (ImageNet-1K) with image-level label for performance boosting. The domain adaptation part is implemented as a Source-Free Domain Adaptation paradigm, which only uses the pre-trained model and the unlabeled target data to further optimize in a self-supervised training manner. The proposed G&A framework help us achieve the first place on the object detection leaderboard of the OOD-CV challenge. Code will be released in https://github.com/hikvision-research/OOD-CV.

Keywords:

Weakly Semi-Supervised Object Detection, Out-of-Distribution Generalization, Test-Time Domain Adaptive Object Detection

1 Method

Refer to caption — Figure 1: The proposed Generalize-then-Adapt framework, which firstly enriches the source domain as much as possible via strong data augmentation and weakly semi-supervised object detection by exploiting the image-level labeled ImageNet-1K as an auxiliary training data, and then shifts the model performance from the enriched source domains to the given target domain via test-time domain adaptation.

To improve the model robustness in the unknown target domains, we propose a simple yet effective Generalize-then-Adapt (G&A) framework to solve the model degradation problem for object detection under domain shift. Specifically, this framework is composed of a two-stage domain generalization part and a one-stage domain adaptation part: 1) Supervised Model Pre-training. A strong baseline is built against domain shift, which exploits labeled source data with various strong data augmentation strategies so as to simulate potential out-of-distribution data. 2) Weakly Semi-Supervised Model Pre-training. Previous work demonstrates that using extra auxiliary training can further enhance out-of-distribution generalization ability [9]. ImageNet-1K [13] can be viewed as an auxiliary training data with only image-level label. In this way, the pre-trained object detector in the first stage can be further optimized on the labeled source data (Robin training set [18] with box-level label) and the weakly-labeled source data (ImageNet-1K with image-level label), termed Weakly Semi-Supervised Object Detection, which is implemented via a Class-Specific Pseudo-Labeling method in this report. 3) Source-Free Domain Adaptation is utilized for Test-Time Training, which adapts the model to the target domain by merely exploiting the source pre-trained object detector and the unlabeled target data without accessing source data [1, 8]. In this challenge, it is simply implemented as a Mean-Teacher based Self-Training mechanism. The overview of the proposed method is shown in Fig. 2. After integrating Test-Time Augmentation and Model Ensemble strategies, our solution ranks the 1^st place on the object detection leaderboard of the OOD-CV challenge https://codalab.lisn.upsaclay.fr/competitions/6784#results.

Our thinking behind this framework: Compared with conventional domain adaptation methods, which train the source data and the target data jointly (although this paradigm is forbidden in this challenge), the proposed G&A framework is more feasible in real-world scenarios, which decouples the joint-training paradigm into a domain generalization stage by merely exploiting source data and a test-time domain adaptation stage by merely exploiting target data as shown in Fig.1. G&A only allows pre-trained model transmission without source data exchange for the sake of avoiding data expansive transmission and data privacy leakage. The generalization step is usually carried out on the server side, while the adaptation step is usually conducted on the client side for model self-evolution. Actually, these two steps can be viewed as an upstream operation and a downstream operation. However, the existing works in OOD community usually focus on the domain generalization step or the test-time domain adaptation step, without taking these two steps into an integration. We hope our solution with superior performance on the leaderboard of this challenge can inspire the community to focus on how to integrate these two steps so as to further resist model degradation under domain shifts.

2 Implementation Details

2.1 Dependencies

•

Backbone: ConvNext-Large[11]
•

Detection Neck: FPN[10], DyHead[3]
•

Detection Head: DDOD[2], TOOD[4], VFNet[16], Auto-Assign[20], PAA[7]
•

Ensemble: WBF[14]
•

Dataset: ROBIN[18], ImageNet-1K[13]
•

Weakly-Supervised Semantic Segmentation: MCTformer[15]

2.2 Training Description

AUG	sub-AUG	probability
random resize	/	1.0
mask-level copy-paste	/	0.3
random-choice	mixup	0.3
	blur,noise	0.3
	color jitter	0.3
	random-erase	0.3
	foggy, snow	0.3
random horizontal flipping	/	0.5

Table 1: Strong data augmentation.

2.2.1 Strong baseline.

A ConvNext-Large based DDOD detection model is trained using the ROBIN training dataset with strong data augmentation. In this stage, the source data is enriched from the original source domain to a wider domain that includes both the original source domain as well as some novel domains to resist domain shifts.

2.2.2 Weakly semi-supervised object detection.

To further generalize the object detector, a subset of ImageNet-1K (with the same labels as the 10 categories in ROBIN task) is used as the additional data to enrich the source data with a weakly semi-supervised object detection method based on class-specific pseudo-labeling. Firstly, the pseudo-boxes of ImageNet-1K are generated by forwarding the weak augmented ImageNet-1K dataset to the EMA (Exponential Moving Average) model. Due to the image-level class supervision prior, we only keep the pseudo-boxes belonging to the corresponding image-level labels. Then, the pre-trained model obtained in the first stage is further optimized using the ImageNet-1K dataset and the ROBIN dataset[18] with strong data augmentation. The detail about stage2 is shown in Fig. 2-stage2.

2.2.3 Test-Time Training.

A source-free domain adaptation based test-time training strategy is used to adapt the model to the testing domain. A simple mean-teacher based self-training mechanism is used in this stage. The weak augmented test data is fed to the EMA model to generate the pseudo label. And then these labels and the corresponding data are used to train the detector with strong data augmentation. The final EMA model is selected for testing.

2.2.4 Technique details.

We build the detector with the backbone of ConvNext-Large and the detection head of DDOD with FPN [10]. For data augmentation, we apply mixup[17], color jitter, foggy [6], snow [6], resize, horizontal flipping, rotation, blur, noise, random erase[19], and mask-level copy-paste. To obtain more vivid occlusion effects, we introduce MCTformer[15] to generate pseudo masks of the salient objects in the subset of ImageNet-1K. Then, these masks are used in the mask-level copy-paste strategy [5], which are pasted onto the training images to simulate object occlusion. The data augmentation setting is shown in Table 1. Note that both mixup and mask-level copy-paste use the ImageNet-1K dataset, and these augmentations are not used in the test-time domain adaptation part. The optimizer for the three training stages described above is AdamW[12]. The learning rate is 0.0001. The batch size is 2 per GPU. A multi-scale training mechanism is used, and the image height is rescaled between [480, 800] randomly.

2.3 Testing Description

In the normal test stage, we only use a single scale (width 1333, height 800). For test-time augmentation (TTA), we use multi-scale and horizontal flipping tricks. The multi-scale trick includes 11 scales, and the image height is evenly ranging from 480 to 800 .

3 Experimental Results

3.1 Ablation Study

Method	S1*	S1	S2	S3	TTA	OOD-AP50 (Phase-1)	OOD-AP50 (Phase-2)
DDOD	✓					74.43	/
		✓				83.24	/
		✓	✓			85.34	68.26
		✓	✓	✓		85.94	71.34
		✓	✓	✓	✓	86.21	/

Table 2: Ablation study. S1, S2 and S3 are short for stage-1, stage-2 and stage-3. S1* only uses weak augmentation in stage-1. TTA is short for test-time augmentation.

Table 2 shows the ablation study of the proposed method on the phase-1 dataset. The DDOD based ConvNext is the basic object detection framework in this ablation study. The results of S1 and S2 show that the generalization ability of the detection model is gradually enhanced as the training data is enriched. The result of S3 shows that the test-time domain adaptation can effectively shift the model performance from the enriched source domain to the target domain. Besides, the TTA can further improve the model performance.

3.2 Ensembles And Fusion Strategies

Experimental evidence shows that the ensemble method can get further performance improvement. We ensemble the predicted boxes from the 15 predictions (8 different models and 7 of them are further performed TTA) by weighted box fusion (WBF). ConvNext-Large[11] is the backbone for all models, and the training settings are exactly the same. Detailed results are shown in Table 3. After ensemble, the OOD accuracy (mAP50) is improved from 71.34% to 73.89%.

Model	OOD-AP50 (Phase-1)	OOD-AP50 (Phase-2)
DDOD	85.94	71.34
+TTA	86.21	/
DDOD+Dyhead1	85.42	/
+TTA	85.92	/
DDOD+Dyhead2	85.70	/
+TTA	86.08	/
TOOD	84.93	/
+TTA	85.92	/
Auto-Assign	83.96	/
+TTA	84.69	/
VFNet	83.69	/
+TTA	84.66	/
DDOD $\dagger$	85.04	/
+TTA	85.76	/
PAA	85.39	/
Ensemble	86.67	73.89

Table 3: Model ensemble performance.

\dagger

uses random rotation and random vertical flipping except for the strong data augmentations in Table 2. We use WBF[14] to fuse the vanilla test results and TTA results from the models of DDOD[2], DDOD[2]+DyHead[3], TOOD[4], Auto-Assign[20], VFNet[16], DDOD

\dagger

[2] and PAA[7]. “Dyhead1” and “Dyhead2” represent different number of Dyhead blocks.

3.3 Model Complexity Analysis

Environment	Descriptions
system and hardware	Ubuntu20.04+cuda11.3+cudnn8+Tesla V100(32GB) $\times$ 8
python and pytorch	python=3.8.5+pytorch=1.10
python library	mmcv-full=1.6.1+mmdet=2.25.1+mmcls=0.23.2

Table 4: Language and implementation details.

Stage	Time	Descriptions
Stage-1 (training)	13h45min13s	fp16, test-interval=4epoch, total-interval=36epoch
Stage-2 (training)	9h44min26s	fp16, test-interval=1epoch, total-interval=12epoch
Stage-3 (training)	5h36min03s	fp16, test-interval=1epoch, total-interval=6epoch
Testing	26.3 fps	fp32

Table 5: Training/testing time (single model, batch-size=2

\times

8). The test-time augmentation and model ensemble counterparts can be roughly determined from the single-model results, which is omitted here for simplicity.

ConvNext+DDOD	Input height	800
	Flops	834.61 GFLOPs
	Params	204.71 M
ConvNext+DDOD (Multi-Scale)	Input height	480:800:32
	Flops	14687.86 GFLOPs
	Params	204.71 M

Table 6: Method complexity and model parameters (single model)

The environment we use is given in Table 4. The training and testing time for DDOD model are shown in Table 5. The proposed solution requires three training stages as shown in Table 5. Therefore, the training time is longer than the base detection framework. However, the test efficiency is the same as the basis detection framework. Please refer to Table 6 for model complexity analysis.

4 Conclusion

We thank the organizing committee for providing data [18] that allowed us to study the robustness of object detectors under domain shifts.

References

[1] Chen, W., Lin, L., Yang, S., Xie, D., Pu, S., Zhuang, Y.: Self-supervised noisy label learning for source-free unsupervised domain adaptation. In: IROS (2022)
[2] Chen, Z., Yang, C., Li, Q., Zhao, F., Zha, Z.J., Wu, F.: Disentangle your dense object detector. In: Proceedings of the 29th ACM International Conference on Multimedia. pp. 4939–4948 (2021)
[3] Dai, X., Chen, Y., Xiao, B., Chen, D., Liu, M., Yuan, L., Zhang, L.: Dynamic head: Unifying object detection heads with attentions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
[4] Feng, C., Zhong, Y., Gao, Y., Scott, M.R., Huang, W.: Tood: Task-aligned one-stage object detection. In: ICCV (2021)
[5] Guo, Y., Shi, X., Chen, W., Yang, S., Xie, D., Pu, S., Zhuang, Y.: 1st place solution for eccv 2022 ood-cv challenge image classification track. arXiv preprint (2023)
[6] Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations (2019)
[7] Kim, K., Lee, H.S.: Probabilistic anchor assignment with iou prediction for object detection. In: ECCV (2020)
[8] Li, X., Chen, W., Xie, D., Yang, S., Yuan, P., Pu, S., Zhuang, Y.: A free lunch for unsupervised domain adaptive object detection without source data. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 8474–8481 (2021)
[9] Lin, L., Xie, H., Yang, Z., Sun, Z., Liu, W., Yu, Y., Chen, W., Yang, S., Xie, D.: Semi-supervised domain generalization in real world: New benchmark and strong baseline. arXiv preprint arXiv:2111.10221 (2021)
[10] Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2117–2125 (2017)
[11] Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. arXiv preprint arXiv:2201.03545 (2022)
[12] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
[13] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.S., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 211–252 (2015)
[14] Solovyev, R., Wang, W., Gabruseva, T.: Weighted boxes fusion: Ensembling boxes from different object detection models. Image and Vision Computing pp. 1–6 (2021)
[15] Xu, L., Ouyang, W., Bennamoun, M., Boussaid, F., Xu, D.: Multi-class token transformer for weakly supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4310–4319 (2022)
[16] Zhang, H., Wang, Y., Dayoub, F., Sünderhauf, N.: Varifocalnet: An iou-aware dense object detector. arXiv preprint arXiv:2008.13367 (2020)
[17] Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
[18] Zhao, B., Yu, S., Ma, W., Yu, M., Mei, S., Wang, A., He, J., Yuille, A., Kortylewski, A.: Robin: A benchmark for robustness to individual nuisances in real-world out-of-distribution shifts. arXiv preprint arXiv:2111.14341 (2021)
[19] Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: Proceedings of the AAAI conference on artificial intelligence. vol. 34, pp. 13001–13008 (2020)
[20] Zhu, B., Wang, J., Jiang, Z., Zong, F., Liu, S., Li, Z., Sun, J.: Autoassign: Differentiable label assignment for dense object detection. arXiv preprint arXiv:2007.03496 (2020)