This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Quantifying Context Bias in Domain Adaptation for Object Detection

Hojun Son1 and Arpan Kusari1 1Hojun Son is a postdoctoral researcher at University of Michigan Transportation Research Institute, 2901 Baxter Rd, Ann Arbor, MI 48105, USA [email protected] 1Arpan Kusari is an assistant research scientist at University of Michigan Transportation Research Institute, 2901 Baxter Rd, Ann Arbor, MI 48105, USA [email protected]
Abstract

Domain adaptation for object detection (DAOD) aims to transfer a trained model from a source to a target domain. Various DAOD methods exist, some of which minimize context bias between foreground-background associations in various domains. However, no prior work has studied context bias in DAOD by analyzing changes in background features during adaptation and how context bias is represented in different domains. Our research experiment highlights the potential usability of context bias in DAOD. We address the problem by varying activation values over different layers of trained models and by masking the background, both of which impact the number and quality of detections. We then use one synthetic dataset from CARLA and two different versions of real open-source data, Cityscapes and Cityscapes foggy, as separate domains to represent and quantify context bias. We utilize different metrics such as Maximum Mean Discrepancy (MMD) and Maximum Variance Discrepancy (MVD) to find the layer-specific conditional probability estimates of foreground given manipulated background regions for separate domains. We demonstrate through detailed analysis that understanding of the context bias can affect DAOD approach and focusing solely on aligning foreground features is insufficient for effective DAOD.

I INTRODUCTION

Domain adaptation for object detection (DAOD) has been studied extensively [1, 2, 3, 4, 5] to enable object detectors to perform well on datasets with distribution shifts from the training data [6, 7]. It is well known that there’s an entanglement between background and foreground features in object detection, leading to a phenomenon called context bias in DAOD. Here, significant differences in background features between the source and target domains can cause a notable decline in the quality and number of detections, even when the foreground features remain unchanged. Recent studies in image classification [8] and segmentation [9, 1, 10] have attempted to mitigate context bias by minimizing this association. Oliva & Torralba [11] demonstrated that context bias could result in the corruption of foreground objects by contextually correlated backgrounds, substantially degrading detection quality. However, there has been no prior work specifically analyzing the impact of context bias in DAOD. This work aims to address this gap.

In the realm of human cognition, the brain can accurately and instantly recognize foreground-background associations without extensive training [12]. Several studies, including [13, 14, 12, 15], have investigated the processes of background suppression and foreground representation to understand the scene and temporal dynamics of foreground and background modulation in the brain. These insights can be applied to the field of computer vision for DAOD through comprehensive analysis of the representation of foreground-background associations.

Refer to caption
Figure 1: The proportion of background pixels in Cityscapes [16] are the highest of all classes. The image is from Cityscapes publication.
Refer to caption
Figure 2: Loss of information defined by negative log of complement of the performance drop. Red dots are maximum loss of information and green dots are significant performance drop cases. Blue dots indicate the road context does not affect detection.

To motivate our problem, we first look at the proportion of background features in autonomous driving datasets, as an example. For Cityscapes dataset [16], the number of pixels from built-up features (such as road and sidewalk) are much greater than the foreground object pixels (see Fig. 1). As an image feature, roads are very simple and constant which leads them to be trained more rapidly than other objects such as vehicles. By rapidly learning the simple and constant features of roads, the model can establish a foundational understanding that supports more complex learning tasks, such as detecting vehicles. Figure 2 shows the performance drops when activated features on road regions are zeroed out using semantic labels at the second layer of backbone in the Detectron2 [17], trained on the Cityscapes dataset for object detection. Compared to performance drop (ΔD\Delta D where 0ΔD10\leq\Delta D\leq 1) with the sky label, the amount of loss information defined as the negative log of the complement of ΔD\Delta D (log(1ΔD)-log(1-\Delta D)), tends to be larger. It means road context has more contextual association with vehicles, particularly when the vehicle size is small.

Refer to caption
Figure 3: (a) 2D inference of CARLA dataset using YOLOv4 model (b) CAM attention map of the inference.

Additionally we train a YOLOv4 detection model [18] on a sample CARLA [19] dataset collected under sunny conditions and provide inference on a separate CARLA dataset collected under cloudy conditions. We find that the model is focused on the road in front of the vehicles rather than the vehicles themselves (see Fig. 3) using Class Activation Mapping (CAM). We also perform another analogous experiment where we transform the road pixels by masking them and find that the newer YOLOv8 model is unable to detect most of the vehicles otherwise detected in the normal image (see Fig. 4). This outcome suggests that the neural network model may have implicitly learned to associate vehicles with road environments, leading to poor performance in detecting vehicles when a different background is present.

Refer to caption
Figure 4: 3 out of 4 vehicles are detected correctly in the original image but with road masking, only one vehicle is detected.

Our fundamental questions are as follows: 1) Why does the context bias occur during training? 2) Can we evaluate the context bias in different domains? 3) Can we quantify the context bias across different domains?

Firstly, we investigate the underlying reasons for foreground-background associations. Then we set up hypotheses about how the association is represented in different domains and analysis it using feature embedding. Next, we quantify the discrepancy between foreground-background representations across multiple domains. This quantitative assessment aims to provide insights into the extent of context bias in object detection and its implications for DAOD.

Our main contributions are as follows:

  • We examine the issue of DAOD in relation to context bias. Although context bias is a well-recognized challenge in computer vision, none of the current approaches investigate how context bias can manifest across various domains.

  • We highlight a crucial gap, suggesting that considering context bias is essential for enhancing the generalization and robustness of models across various environments.

  • We employ distance-based metrics to measure the association between foreground and background under domain shifts. Additionally, we calculate these metrics for each layer of the neural network and expand the background region across bins to assess the impact of background regions on object detection.

II Related Work

II-A Foreground-background associations and context bias

Background influence [20, 21, 22, 23, 24, 25, 26, 27] and context bias [28, 29, 30, 31] aim at improving performance in tasks such as classification, object recognition, and object localization. Xiao et al. [20] and Zhu et al. [24] studied background effect on accuracy of classification by modifying images with different combinations of foreground and background. A paper proposed a graphical model [30] which modeled foreground-background associations in conditional probability which serves as a methodological inspiration for us. Various studies have addressed context bias using several techniques, such as data augmentation to generate out-of-distributions samples into the background, combination of naturally unmatched background and foreground (e.g., an elephant in room), and applying background removal during training. Torralba [32] demonstrates background effect can be factorized into object priming, focus of attention, and scale selection by modeling the foreground-background associations in a probabilistic model. Liang et al. [21] studied background influence using fashion dataset [33, 34]. These studies [35, 36] can localize foreground objects better than CAM-based algorithms without using bounding box information and with only classification labels. These prior works focus on context bias in the same domain and uses datasets with smaller variations like centered objects or single objects. It lacks to provide strong insight for DAOD.

II-B Domain Adaptation for Object Detection

Different variations of DAOD methods have been proposed using feature alignment, synthetic images, and self-training or self-distillation. Feature alignment is to find transformations between source and target domain to reduce distribution shift with adversarial training [37, 1, 38, 39]. It can be helpful to extract common latent features from different domains. Progressive Domain Adaptation for Object Detection [40] synthesized new dataset by using cycleGAN [41] which enables to bridge domain gaps and Self-Adversarial Disentangling for Specific Domain Adaptation [42] achieved 45.2 mAP on Cityscapes to Cityscapes foggy dataset using synthetic images. Gong et al. [43] utilized transformers to focus on aligning features across backbone and decoder networks. However, combining multiple sources into a single dataset and performing single-source domain adaptation for feature alignment does not guarantee better performance compared to using the best individual source domain [44].

Self-training uses a teacher model to predict pseudo labels on target domains to gradually understand domain shiftiness [45, 46, 47, 2, 48]. MIC [4] employed masked images on teacher-student model and MRT [49] suggested modified masked based retraining approach on the teacher-student model. [50] performed an alignment and distillation to enforce invariance across domains to reduce discrepancy of features.

Finding common features from multiple domains is critical for DAOD. They have summarily demonstrated that the foreground features in latent space can be aligned using dimension reduction methods such as UMAP [51] and t-SNE [52]. These studies do not address how to manage context bias when adapting across different domains. Instead, they propose and validate their methods within DAOD framework using accuracy metrics. Thus, we focus on analyzing the root causes of domain discrepancy in object detection both qualitatively and quantitatively.

III Method

III-A Why does the context bias occur during training?

Prior studies [30, 32] researched context bias for object classification. The studies pointed out that relying only on local features (foreground features in our case) has limitations, including degraded quality due to noise and ambiguity in the target search space. They extended the likelihood to incorporate context information surrounding the foreground, which enhances object classification by providing a stronger conditional probability. The conditional probability of the object (OO) given the features (ff) was given as:

P(O|f)=P(O|F,B)=P(F|O,B)P(O|B)P(F|B)P(O|f)=P(O|F,B)=\frac{P(F|O,B)P(O|B)}{P(F|B)} (1)

where FF and BB are the foreground and background features The modeling can also be interpreted as:

P(O|B)=P(c|σ,Y,B)P(σ|Y,B)P(Y|B)=P(σ|Y,c,B)P(x|c,B)P(c|B)\begin{split}P(O|B)=P(c|\sigma,Y,B)P(\sigma|Y,B)P(Y|B)\\ =P(\sigma|Y,c,B)P(x|c,B)P(c|B)\end{split} (2)

where the object is represented by scale (σ\sigma), location (YY), and category (cc).

The challenge with using a convolutional neural network (CNN) to estimate likelihood is the inability to explicitly teach the model to learn each factor in a specific order. In other words, it means that parameters of CNN can be different depending on how it can be trained. In CNNs, likelihood estimation is a process to find the mean of a distribution, which necessitates more samples to accurately estimate the true mean. This aligns with the principle that a more extensive and refined dataset, achieved through data augmentation, is crucial for better performance [53].

In the context of a graphical casual model (F→Y ←B), it represents the joint distribution P(Y, F, B), which can be decomposed as either P(Y|F,B)P(F|B)P(B)P(Y|F,B)P(F|B)P(B) or P(Y|F,B)P(B|F)P(F)P(Y|F,B)P(B|F)P(F). In the SUN 09 training set used by [30], the association between roads and cars is strong, and the number of road pixels surpasses object pixels (similar to Cityscapes: Fig. 1). Consequently, the CNN is more likely to learn P(F=car|B=road)P(F=car|B=road) than P(B=road|F=car)P(B=road|F=car), leading to the induction of the equation P(Y|F,B)P(F|B)P(B)P(Y|F,B)P(F|B)P(B) during training. Also, the factor P(B=road)P(B=road) can be factorized as P(Color|X)P(shape|X)P(Color|X)*P(shape|X) and can be learned more easily and earlier than P(F=car)=P(Color|X)P(shape|X)P(F=car)=P(Color|X)*P(shape|X). It can cause why the object detection model fails to detect on different domain especially when the road is not given and changes.

Another reason is related to the receptive field of CNN architecture. As layers in a CNN go deeper, the receptive field increases, causing pooled information after convolution operations mixing foreground and background features [54]. This mixing can create ambiguity between foreground and background features.

Given these considerations, we now delve into how we empirically provide results of context bias on different datasets.

III-B Evaluate the context bias in different domains

We utilized Detectron2 with three different datasets which semantic labels are available:

  • Cityscapes - we used the training dataset of 2,975 images to extract foreground and background features and validation set of 500 images.

  • Cityscapes foggy beta 0.02 - it is used only for experiments. It contains synthetic foggy dataset with different levels of foggy controlled by a hyper parameter.

  • CARLA - It has 19 different driving scenarios, sampling same number of images as Cityscapes training set.

We focused on three labels: Car, Truck, and Bus.

Detectron2 model consists of a backbone network which extracts feature maps from the input image at five different scales, region proposal networks extracting object regions from multi-scale features and box heads which aggregates the region proposals. To train the Detection2 model, we employed a batch size of 8, a stepLR scheduler with iterations of {10,000, 20,000, 30,000, 35,000, and 37,500} (about 100 epochs). The initial learning rate was set to 0.015, and the image resolution was 1024×20481024\times 2048, with horizontal flip and brightness augmentation. We used the Adam optimizer, regression loss for bounding boxes, and cross-entropy loss for label classifications. After training, the model achieved an AP50 of 53.72 on the Cityscapes validation set and 88.623 on the CARLA validation set. While the result on CARLA indicates overfitting, it provides useful insights into feature alignment between the two different domains.

III-C Foreground and background feature extraction

To separate foreground and background related features per target object, we utilized a combination of activation hooks and semantic pixel labels. To get the overall activation region of the foreground, we utilized Smooth GradCAM++[55] which extracts spatial information from attention and semantic maps, distinguishing foreground and background features. Following this separation, we saved feature maps as feature vectors per object across various layers. While the entire area of the activation is provided by Smooth GradCAM++, we manipulated what percentage of the background activation region is required for the foreground object detection. Therefore, we expanded the area thresholds in terms of bins (where bin 1 represents almost no background and bin 9 represents “all” activation). In other words, the region of interest from background is gradually extended as bin parameter increases. Figure 5 shows visualized example of foreground and background extracted in from different layers and threshold on image from the Cityscapes and CARLA. The algorithm 1 shows the process of extracting features.

Algorithm 1 algorithm to extract features
1:Input: activation maps per object, semantic labels, trained model.
2:Output: FF and BB per object, layers, and threshold
3:for 11\leqbin\leq 10 do
4:     THSTHS \leftarrow (MAVMAV * POWER(0.1, binbin)) \triangleright MAVMAV is maximum activation value
5:     if THSTHS is zero then
6:         Return
7:     end if
8:     for TargetLTarget_{L} in [res2res2, res3res3, res4res4, res5res5do
9:         if TargetLTarget_{L} is in activation hooks then
10:              Separate foreground and background
11:              activation maps with THSTHS
12:              and semantic labels
13:         else
14:              Forward activation maps
15:         end if
16:         Compute average feature vectors (FF and BB)
17:         Save the feature vectors
18:     end for
19:end for
Refer to caption
Figure 5: Each alphabet represent visualized foreground and background area in the image. ”A” is a color image on which attention map is overlaid. ”B” is the foreground region. It does not change depending on the bin. ”C” and ”D” are threshold background region. ”C” has more narrow background region compared to ”D”. ”C” is the background region with bin threshold 4 and ”D” is the background region when the bin is 9.

III-D Quantify the context bias across different domains

MMD was proposed by [56] as a kernel-based statistical test to find the distance between two given distributions. MMD between two distributions XX and YY can be mathematically defined as:

MMD2(X,Y)=μk(X)μk(Y)F2\displaystyle MMD^{2}(X,Y)=||\mu_{k}(X)-\mu_{k}(Y)||^{2}_{\it{F}} (3)

where xiXx_{i}\in X, yiYy_{i}\in Y. Alternatively, a variance based discrepancy has also been proposed by [57] as:

MVD2(X,Y)=Σk(X)Σk(Y)b2\displaystyle MVD^{2}(X,Y)=||\Sigma_{k}(X)-\Sigma_{k}(Y)||^{2}_{b} (4)

where bb represents the unit ball of kernel Hillbert space k\mathcal{H}_{k}. We compute MMD and MVD between foreground and background feature vectors from different datasets with separated scaled features. Since the MMD values are too small to comprehend, we scaled the values by 1000. UMAP clustering depends on the hyper-parameters such as minimum distance to visualize and number of neighbors to compute weights among samples. To avoid bias, we kept all hyperparameters constant for all experiments.

Refer to caption
Figure 6: We illustrate the 2D embedding results of different domains using features from the 4th4^{th} ResNet50 [58] layer. The red boxes represent detections where background features are entangled with the foreground.
Refer to caption
Figure 7: The figure demonstrates Cityscapes-CARLA 3d embedding. Image patches annotated for each cluster. The red circle is foreground of Cityscapes, the red diamond is foreground of CARLA. The green star and cross are background of Cityscapes and CARLA respectively. Features are extracted from 5th5^{th} ResNet50 layer.

IV Experiments

IV-A UMAP visualization

We start our analysis by plotting the foreground and background feature distributions of different domains using UMAP. Figures 6 and 7 present the visualization of the foreground and background features from different domains in 2D and 3D. In Figure 6, we use the features at the 4th4^{th} ResNet layer from three different data distributions - (a) Cityscapes training dataset as source and validation dataset as target; (b) Cityscapes train dataset as source and Cityscapes foggy dataset as target; and (c) Cityscapes train dataset as source and CARLA as target. The interesting finding is the differences of background alignment between the three comparisons. It is immediately apparent that as the target domain shifts away from the source domain, the background becomes more separable than the foreground. For the Cityscapes training and validation dataset, the foreground and background are separable from each other but are mingled up between source and target. For the second panel, the foreground features are together while the background features are separable but overlapping. We can see an extreme case in the third panel where the foreground features between Cityscape and CARLA are next to each other but are non-overlapping while the background features are very distant from each other. The same patterns of Cityscapes-CARLA are demonstrated in Figure 7 when it uses 3d UMAP embedding. We visualized corresponding image patches for qualitative analysis. Noise and ambiguous foreground features are aligned with the background. Beside, background features contains foreground features because of objects adjacent each other. It can lead to alignment with foreground features.

IV-B MMD and MVD comparison

Given the qualitative analysis, we derive quantitative estimates of difference in foreground and background feature distributions using MMD and MVD. Figures 8 and 9 show the MMD and MVD in violin plots across different layers and bins respectively for different domains. We first split it into small objects in image frame (400 to 6000 sq. pixels) and medium sized objects (6000 to 52000 sq. pixels). In case of Cityscapes-train-val, it shows the highest amount of discrepancy between foreground-to-background (FB) and background-to-foreground (BF), while having less discrepancy between foreground-to-foreground (FF) and background-to-background (BB). Significant differences or discernible patterns were not observed in the MVD.

Overall, in case of Cityscapes-CARLA domains, it shows the bin 1 does not show significant difference and bin 9 has significant differences across FF, FB, BF, and BB. As we go deeper (5th5^{th} layer), the MMD of BB has the higher value compared to that observed in the shallower layer (2nd2^{nd} layer). The variability in the overall metrics is higher in the second figure is larger as compared to the first as expected.

Refer to caption
Figure 8: Source: Cityscapes train and target: Cityscapes val. A: MMD for bin 1 for small image size objects; B: MMD for bin 9 for small image size objects; C: MVD for bin 1 for small image size objects; D: MVD for bin 9 for small image size objects; E: MMD for bin 1 for medium image size objects; F: MMD for bin 9 for medium image size objects; G: MVD for bin 1 for medium image size objects; H: MVD for bin 9 for medium image size objects.
Refer to caption
Figure 9: Source: Cityscapes and target: CARLA. A: MMD for bin 1 for small image size objects; B: MMD for bin 9 for small image size objects; C: MVD for bin 1 for small image size objects; D: MVD for bin 9 for small image size objects; E: MMD for bin 1 for medium image size objects; F: MMD for bin 9 for medium image size objects; G: MVD for bin 1 for medium image size objects; H: MVD for bin 9 for medium image size objects.

IV-C Observations

The Detectron2 model trained on the Cityscapes dataset achieves 53.72 mAP, with performance dropping to 41.06 mAP on the CARLA validation set and 37.77 mAP on the Cityscapes foggy beta 0.02 set. Despite the CARLA dataset being relatively easier, there is a significant drop of 47.56 mAP from 88.62 mAP, which implies that even when foreground features are well-aligned, the difference in background associations contributes to the significant drop. This indicates that using a trained model without proper alignment of background context can lead to severe degradation in performance. The observation of MMD in different layers implies that CNNs typically learn finer, low-level features at shallower layers and captures progressively representing more abstract, higher-level features at deeper layers. The minimal variance observed among features post-embedding suggests that the CNN model captures backgrounds across various domains. The challenge of DAOD may stem from insufficient understanding of the association between foreground and background elements.

IV-D Conclusion and limitations

Even though we demonstrate how the background and foreground features are aligned and separable, extracting the features separately and using them is tremendously computationally expensive. Therefore, using any contextual cue in the object detection model needs to be computationally efficient to make it useful. While CAM is a useful tool for obtaining attention maps, its effectiveness in analyzing activation maps for detection tasks on entire images is questionable. To address this concern, we used only true positive object detection cases and measured correspondences between attention maps and the foreground region of objects using instance segmentation labels. With models trained on Cityscapes and Carla, this approach showed a complete match. Another limitation is that we need to check different class labels to find association between foreground and background like pedestrian, bicyclists. We also use only a single model and therefore, the results can differ with respect to models. For example, vision transformer is a dominant model achieving better accuracy in multiple fields and it can show different results. Finally, we leave a method to mitigate or leverage the context bias during training as future work because it is beyond the current scope.

ACKNOWLEDGMENT

Asma Almutairi contributed to this project generating Carla dataset.

References

  • [1] Y. Chen, H. Wang, W. Li, C. Sakaridis, D. Dai, and L. Van Gool, “Scale-aware domain adaptive faster r-cnn,” International Journal of Computer Vision, vol. 129, no. 7, pp. 2223–2243, 2021.
  • [2] M. Chen, W. Chen, S. Yang, J. Song, X. Wang, L. Zhang, Y. Yan, D. Qi, Y. Zhuang, D. Xie, et al., “Learning domain adaptive object detection with probabilistic teacher,” arXiv preprint arXiv:2206.06293, 2022.
  • [3] J. Deng, W. Li, Y. Chen, and L. Duan, “Unbiased mean teacher for cross-domain object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4091–4101.
  • [4] L. Hoyer, D. Dai, H. Wang, and L. Van Gool, “Mic: Masked image consistency for context-enhanced domain adaptation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 11 721–11 732.
  • [5] Y.-J. Li, X. Dai, C.-Y. Ma, Y.-C. Liu, K. Chen, B. Wu, Z. He, K. Kitani, and P. Vajda, “Cross-domain adaptive teacher for object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7581–7590.
  • [6] P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsubramani, W. Hu, M. Yasunaga, R. L. Phillips, I. Gao, et al., “Wilds: A benchmark of in-the-wild distribution shifts,” in International conference on machine learning.   PMLR, 2021, pp. 5637–5664.
  • [7] T. Kalluri, W. Xu, and M. Chandraker, “Geonet: Benchmarking unsupervised adaptation across geographies,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 368–15 379.
  • [8] W. Li, J. Liu, B. Han, and Y. Yuan, “Adjustment and alignment for unbiased open set domain adaptation (supplementary material).”
  • [9] L. Zhu, T. Chen, J. Yin, S. See, and J. Liu, “Addressing background context bias in few-shot segmentation through iterative modulation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3370–3379.
  • [10] M. Dreyer, R. Achtibat, T. Wiegand, W. Samek, and S. Lapuschkin, “Revealing hidden context bias in segmentation and object detection through concept-specific explanations. 2023 ieee,” in CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2023, pp. 3829–3839.
  • [11] A. Oliva and A. Torralba, “The role of context in object recognition,” Trends in cognitive sciences, vol. 11, no. 12, pp. 520–527, 2007.
  • [12] P. Papale, A. Leo, L. Cecchetti, G. Handjaras, K. N. Kay, P. Pietrini, and E. Ricciardi, “Foreground-background segmentation revealed during natural image viewing,” eneuro, vol. 5, no. 3, 2018.
  • [13] B. Zhang, S. Hu, T. Zhang, M. Hai, Y. Wang, Y. Li, and Y. Wang, “Different patterns of foreground and background processing contribute to texture segregation in humans: an electrophysiological study,” PeerJ, vol. 11, p. e16139, 2023.
  • [14] J. Poort, M. W. Self, B. Van Vugt, H. Malkki, and P. R. Roelfsema, “Texture segregation causes early figure enhancement and later ground suppression in areas v1 and v4 of visual cortex,” Cerebral cortex, vol. 26, no. 10, pp. 3964–3976, 2016.
  • [15] L. Huang, L. Wang, W. Shen, M. Li, S. Wang, X. Wang, L. G. Ungerleider, and X. Zhang, “A source for awareness-dependent figure–ground segregation in human prefrontal cortex,” Proceedings of the National Academy of Sciences, vol. 117, no. 48, pp. 30 836–30 847, 2020.
  • [16] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223.
  • [17] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “Detectron2,” https://github.com/facebookresearch/detectron2, 2019.
  • [18] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv preprint arXiv:2004.10934, 2020.
  • [19] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla: An open urban driving simulator,” in Conference on robot learning.   PMLR, 2017, pp. 1–16.
  • [20] K. Xiao, L. Engstrom, A. Ilyas, and A. Madry, “Noise or signal: The role of image backgrounds in object recognition,” arXiv preprint arXiv:2006.09994, 2020.
  • [21] J. Liang, Y. Liu, and V. Vlassov, “The impact of background removal on performance of neural networks for fashion image classification and segmentation,” in 2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE).   IEEE, 2023, pp. 1960–1968.
  • [22] J. Zhang, M. Marszałek, S. Lazebnik, and C. Schmid, “Local features and kernels for classification of texture and object categories: A comprehensive study,” International journal of computer vision, vol. 73, pp. 213–238, 2007.
  • [23] M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” explaining the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144.
  • [24] Z. Zhu, L. Xie, and A. L. Yuille, “Object recognition with and without objects,” arXiv preprint arXiv:1611.06596, 2016.
  • [25] A. Rosenfeld, R. Zemel, and J. K. Tsotsos, “The elephant in the room,” arXiv preprint arXiv:1808.03305, 2018.
  • [26] A. Barbu, D. Mayo, J. Alverio, W. Luo, C. Wang, D. Gutfreund, J. Tenenbaum, and B. Katz, “Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models,” Advances in neural information processing systems, vol. 32, 2019.
  • [27] S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang, “Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization,” arXiv preprint arXiv:1911.08731, 2019.
  • [28] A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in CVPR 2011.   IEEE, 2011, pp. 1521–1528.
  • [29] A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Torralba, “Undoing the damage of dataset bias,” in Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part I 12.   Springer, 2012, pp. 158–171.
  • [30] M. J. Choi, A. Torralba, and A. S. Willsky, “Context models and out-of-context objects,” Pattern Recognition Letters, vol. 33, no. 7, pp. 853–862, 2012.
  • [31] R. Shetty, B. Schiele, and M. Fritz, “Not using the car to see the sidewalk–quantifying and controlling the effects of context in classification and segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8218–8226.
  • [32] A. Torralba, “Contextual priming for object detection,” International journal of computer vision, vol. 53, pp. 169–191, 2003.
  • [33] M. Jia, M. Shi, M. Sirotenko, Y. Cui, C. Cardie, B. Hariharan, H. Adam, and S. Belongie, “Fashionpedia: Ontology, segmentation, and an attribute localization dataset,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16.   Springer, 2020, pp. 316–332.
  • [34] M. Takagi, E. Simo-Serra, S. Iizuka, and H. Ishikawa, “What makes a style: Experimental analysis of fashion prediction,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 2247–2253.
  • [35] W. Zhai, P. Wu, K. Zhu, Y. Cao, F. Wu, and Z.-J. Zha, “Background activation suppression for weakly supervised object localization and semantic segmentation,” International Journal of Computer Vision, vol. 132, no. 3, pp. 750–775, 2024.
  • [36] P. Wu, W. Zhai, and Y. Cao, “Background activation suppression for weakly supervised object localization,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   IEEE, 2022, pp. 14 228–14 237.
  • [37] Z. He and L. Zhang, “Multi-adversarial faster-rcnn for unrestricted object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6668–6677.
  • [38] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. March, and V. Lempitsky, “Domain-adversarial training of neural networks,” Journal of machine learning research, vol. 17, no. 59, pp. 1–35, 2016.
  • [39] X. Zhu, J. Pang, C. Yang, J. Shi, and D. Lin, “Adapting object detectors via selective cross-domain alignment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 687–696.
  • [40] H.-K. Hsu, C.-H. Yao, Y.-H. Tsai, W.-C. Hung, H.-Y. Tseng, M. Singh, and M.-H. Yang, “Progressive domain adaptation for object detection,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 749–757.
  • [41] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
  • [42] Q. Zhou, Q. Gu, J. Pang, X. Lu, and L. Ma, “Self-adversarial disentangling for specific domain adaptation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 7, pp. 8954–8968, 2023.
  • [43] K. Gong, S. Li, S. Li, R. Zhang, C. H. Liu, and Q. Chen, “Improving transferability for domain adaptive detection transformers,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 1543–1551.
  • [44] S. Zhao, B. Li, P. Xu, and K. Keutzer, “Multi-source domain adaptation in the deep learning era: A systematic survey,” arXiv preprint arXiv:2002.12169, 2020.
  • [45] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660.
  • [46] M. Pham, M. Cho, A. Joshi, and C. Hegde, “Revisiting self-distillation,” arXiv preprint arXiv:2206.08491, 2022.
  • [47] Q. Cai, Y. Pan, C.-W. Ngo, X. Tian, L. Duan, and T. Yao, “Exploring object relation in mean teacher for cross-domain detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 457–11 466.
  • [48] S. Cao, D. Joshi, L.-Y. Gui, and Y.-X. Wang, “Contrastive mean teacher for domain adaptive object detectors,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 23 839–23 848.
  • [49] Z. Zhao, S. Wei, Q. Chen, D. Li, Y. Yang, Y. Peng, and Y. Liu, “Masked retraining teacher-student framework for domain adaptive object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19 039–19 049.
  • [50] J. Kay, T. Haucke, S. Stathatos, S. Deng, E. Young, P. Perona, S. Beery, and G. Van Horn, “Align and distill: Unifying and improving domain adaptive object detection,” arXiv preprint arXiv:2403.12029, 2024.
  • [51] L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426, 2018.
  • [52] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
  • [53] L. Taylor and G. Nitschke, “Improving deep learning with generic data augmentation,” in 2018 IEEE symposium series on computational intelligence (SSCI).   IEEE, 2018, pp. 1542–1547.
  • [54] H. Phan, P. Koch, L. Hertel, M. Maass, R. Mazur, and A. Mertins, “Cnn-lte: a class of 1-x pooling convolutional neural networks on label tree embeddings for audio scene classification,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2017, pp. 136–140.
  • [55] D. Omeiza, S. Speakman, C. Cintas, and K. Weldermariam, “Smooth grad-cam++: An enhanced inference level visualization technique for deep convolutional neural network models,” arXiv preprint arXiv:1908.01224, 2019.
  • [56] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel two-sample test,” The Journal of Machine Learning Research, vol. 13, no. 1, pp. 723–773, 2012.
  • [57] N. Makigusa, “Two-sample test based on maximum variance discrepancy,” Communications in Statistics-Theory and Methods, pp. 1–18, 2023.
  • [58] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.