This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: KAIST 11email: {sb020518, jchoo}@kaist.ac.kr 22institutetext: Lunit Inc. 22email: {junha.kim, taesoo.kim, ghnam, tkooi}@lunit.io

Is user feedback always informative?
Retrieval Latent Defending for Semi-Supervised Domain Adaptation without Source Data

Abstract

This paper aims to adapt the source model to the target environment, leveraging small user feedback (i.e., labeled target data) readily available in real-world applications. We find that existing semi-supervised domain adaptation (SemiSDA) methods often suffer from poorly improved adaptation performance when directly utilizing such feedback data, as shown in Figure 1. We analyze this phenomenon via a novel concept called Negatively Biased Feedback (NBF), which stems from the observation that user feedback is more likely for data points where the model produces incorrect predictions. To leverage this feedback while avoiding the issue, we propose a scalable adapting approach, Retrieval Latent Defending. This approach helps existing SemiSDA methods to adapt the model with a balanced supervised signal by utilizing latent defending samples throughout the adaptation process. We demonstrate the problem caused by NBF and the efficacy of our approach across various benchmarks, including image classification, semantic segmentation, and a real-world medical imaging application. Our extensive experiments reveal that integrating our approach with multiple state-of-the-art SemiSDA methods leads to significant performance improvements.

Keywords:
Rethinking user-provided feedback Semi-supervised &
Source-free domain adaptation Medical image diagnosis

1 Introduction

While deep neural networks have demonstrated remarkable performance in the development domain (i.e., source domain) [23, 15], they often suffer from performance degradation in the deployed domain (i.e., target domain) due to domain shift [17, 78, 72]. To mitigate this issue, domain adaptation (DA) techniques have been introduced [70, 34, 58]. The most common DA tasks include semi-supervised domain adaptation (SemiSDA) and source-free domain adaptation (SFDA). SemiSDA aims to adapt the model given a small amount of labeled target data along with massive unlabeled target data [58, 6, 99, 66]. SFDA conducts adaptation with only target data without accessing source data considering data privacy or memory constraints in edge devices [34, 92, 67].

Despite such advances in DA, adapting the model with user feedback still remains an open area for further research, even though practical machine learning (ML) products often allow users to provide feedback in order to further improve the model in the target environment. For example, facial recognition or medical image diagnosis applications enable users to give feedback correcting wrong model predictions, as depicted in Figure 1 (a). Since feedback can be modeled in this case as a small amount of labeled target data, it is anticipated that previous SemiSDA methods assuming the same setup would yield promising results. However, we observe that they show inferior adaptation performance on multiple DA benchmarks when using such user feedback in practice, as shown in the dark-gray bar \makebox(4.0,4.0)[]{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}~{}~{}}} in Figure 1 (b).

Refer to caption
Figure 1: (a) User feedback. Users can provide feedback while interacting with an ML product, where feedback is likely to be biased towards misclassified samples, which we define as Negatively Biased Feedback (NBF). (b) Adaptation results. We adapt the source model with small user feedback and large unlabeled target data using previous semi-supervised domain adaptation (SemiSDA) algorithms. Compared to random feedback, which is the classical SemiSDA setup where labeled data is a random subset of target data, model adaptation with NBF leads to subpar performance. This paper analyzes this problem and introduces a scalable solution.

We introduce a novel concept called Negatively Biased Feedback (NBF) to explain this phenomenon. NBF is based on the observation that user feedback is more likely to be derived from incorrect model predictions. For example, a radiologist might log a misdiagnosed chest X-ray by the model, as its accuracy directly impacts the patient’s survival. Interestingly, our observation aligns with findings from cognitive psychology literature [3, 57] that proves that humans are more likely to react and provide feedback to negative events (i.e., wrong model predictions). Since such an NBF scenario is feasible, we analyze its unexpected impact on SemiSDA observed above. We identify that a biased distribution of NBF within the overall data distribution leads to sub-optimal adaptation results, particularly compared to Random Feedback (RF). RF represents the classical SemiSDA setup, where labeled data is randomly selected from the target data.

To address the problem caused by NBF, we present a scalable approach named Retrieval Latent Defending, which can be seamlessly integrated with existing SemiSDA methods. Our approach allows them to adapt the model without a strong dependence on the biasedly distributed labeled data. Specifically, we balance the supervised adapting signal by appending latent defending samples to the mini-batch and help to keep the model’s balanced class discriminability throughout adapting iterations. We evaluate the unexpected influence of NBF using various benchmarks, including image classification, semantic segmentation, and medical image diagnosis. Building upon these evaluations, we demonstrate that our approach not only complements, but significantly enhances the performance of multiple SemiSDA methods.

Refer to caption
Figure 2: Adaptation with user feedback can be effective in alleviating performance degradation caused by domain shift. However, there are some challenges: (i) user feedback may be a biased sampling of the true target distribution due to the nature of feedback, (ii) the amount of the ground truths (GT) labels obtained through feedback is small, and (iii) only unlabeled target data is typically available, not source data.

The contributions of the paper are as follows:

  • \circ

    We introduce the novel concept called Negatively Biased Feedback and uncover that it can lead to sub-optimal adaptation performance of existing SemiSDA methods.

  • \circ

    We analyze this problem and present a scalable solution, Retrieval Latent Defending, that combines with SemiSDA methods and allows them to avoid the unexpected effect of NBF.

  • \circ

    We show that our approach generalizes through diverse DA benchmarks and improves adaptation results of state-of-the-art SemiSDA methods.

  • \circ

    We publicly release the code on https://github.com/junha1125/RLD-SemiSDA.

2 Related Work

Adaptation in the deployment environment.

Real-world ML products often encounter performance degradation caused by gaps between the source and target environment [17]. One solution is to adapt the model using unlabeled data observed in the target domain, referred to as unsupervised domain adaptation (UDA) [72, 59, 37]. Works on UDA use both source and target data to improve the target performance by using methods such as domain discrepancy minimization by adversarial training [41, 70, 76, 18, 73, 72, 59], and self-training with pseudo labels [45, 98, 51, 97]. Source-free DA (SFDA) builds on UDA and imposes an additional constraint that the source data can not be accessed during domain adaptation. This has practical implications for addressing data privacy concerns or barriers in data transmission to edge devices [34, 38, 77, 95]. The majority of recent SFDA works rely on strategies like domain clustering [34], nearest neighbors [92, 93, 91], and contrastive learning [8, 35, 101]. Nevertheless, SFDA does not consider the availability of small labeled data, which may be available in practical ML systems. Semi-supervised DA (SemiSDA) works mainly demonstrate that permitting small labeled data in the target domain can substantially enhance adaptation performance compared to traditional UDA [58]. Their primary strategy is to use domain alignment [58, 33, 20, 94], multi-view consistency [33, 6, 2, 89], and asymmetric co-training [36, 90].

Active domain adaptation

(ActiveDA) [55, 86, 25] envisions a scenario in which the machine selects specific target samples and instructs annotators to label them. The primary objective of ActiveDA is to strategically identify and select the most informative samples for annotation. These chosen samples (i.e., labeled target data) are subsequently utilized to update the source model using SemiSDA methods [58, 33], and the effectiveness of ActiveDA is assessed by evaluating the target performance of the adapted model.

Semi-supervised learning

(SemiSL) aims to reduce expensive human annotations, and propose methods to train a model from scratch using massive unlabeled data along with limited amounts of labeled data [74, 43]. The majority of SemiSL methods depend on consistency regularization [60, 66, 87, 5, 4, 16], which helps the model to make similar predictions for augmented versions of the same image. Moreover, adaptive thresholding [66, 81, 24, 88, 12, 9, 99] is also popularly utilized to produce reliable pseudo labels from unlabeled data.

SemiSDA and SemiSL setups mimic small labeled datasets by randomly selecting subsets of the target dataset, whereas ActiveDA involves selections instructed by the machine. In contrast, this paper posits that in real-world applications, labeled data is typically acquired through user intervention. Additionally, users often provide feedback on samples misclassified by the model (i.e., negatively biased feedback), a process detailed in the following section.

UDA SFDA ActiveDA SemiSDA SemiSL Our setup
Adaptation \circ \circ \circ \circ ×\times \circ
Source-free ×\times \circ ×\times ×\times - \circ
Feedback ×\times ×\times machine-instructed randomly selected randomly selected user-provided

The table above summarizes the comparison of relevant studies to our setup. In the table, adaptation means fine-tuning the source pre-trained model (as opposed to training from scratch); feedback represents a small number of labeled target samples. Appendix A provides further comparisons with settings like class-imbalanced SemiSDA and test-time adaptation (TTA).

3 Negatively Biased Feedback

3.1 Adaptation with user feedback.

Our adaptation setup is illustrated in Figure 2. A model is pre-trained on the source data DsD_{s}. Next, the model is deployed to the target domain, such as a smartphone or a hospital, where we assume the transfer of DsD_{s} is prohibited due to data privacy regulations or resource constraints (same setup as SFDA [34]). While users utilize ML products on the target domain, the model provides prediction results for data observed in the target domain DtD_{t} and occasionally obtains user feedback in the form of annotations yy. We represent the target data as Dt=XtlbXtulbD_{t}=X^{lb}_{t}\cup X^{ulb}_{t}, where Xtlb={(xlbn,ylbn):n[1..Nlb]},Xtulb={(xulbn):n[1..Nulb]}X^{lb}_{t}=\{(x^{n}_{lb},y^{n}_{lb}):n\in[1..\,N_{lb}]\},X^{ulb}_{t}=\{(x^{n}_{ulb}):n\in[1..\,N_{ulb}]\}, xlbx_{lb} and xulbx_{ulb} denote labeled and unlabeled data and NlbN_{lb} and NulbN_{ulb} is their number of data. Lastly, the model can utilize DtD_{t} and SemiSDA algorithms for adaptation during its inactive phase (e.g., when users do not use the product, like at nighttime) in order to alleviate performance degradation due to domain shift or to personalize the model based on user feedback.

Refer to caption
Figure 3: Effect of negatively biased feedback. Our novel observations are that (a) user-provided feedback in practice has a biased distribution in each class cluster (the bottom center sub-figure) which is in contrast to random feedback, (b) Existing SemiSDA methods adapt the model by dominating the labeled data points (the right sub-figures) even though they are biasedly positioned, and (c) NBF prevents the model from having a decision boundary for true class clusters and leads to inferior adaptation performance (the bottom right sub-figure).

Rethinking user-provided feedback.

Classical SemiSDA works simply assume that a random subset in target data DtD_{t} is labeled by users when building XtlbX^{lb}_{t}. However, as illustrated in Figure 2(i), we suggest that users are more likely to provide feedback on misclassified samples by the source model, named negatively biased feedback (NBF). This behavior can be understood from two perspectives: (a) users generally expect their feedback to be used as a basis of model improvement, motivating them to provide NBF, and (b) humans tend to react more strongly to negative experiences, such as receiving incorrect predictions, as observed in psychological studies [3, 57]. We note that the NBF assumption holds more strongly for the medical application: it is reasonable to imagine that the user (i.e., radiologist) logs the mistakes of the model while diagnosing a chest X-ray exam because the diagnostic accuracy of the model is directly related to the patient’s chances of survival. Furthermore, applications beyond the medical domain can also exhibit NBF. For instance, users in self-driving cars can report errors, such as object detection failures or navigation mistakes, to enhance the car’s driving capabilities.

3.2 Influence of NBF on SemiSDA

Simulation study.

As shown in Figure 3, we conduct a simulation study to understand the effect of NBF on SemiSDA. We first use the blobs dataset [53] and construct the source and target data so that domain shift exists between them (left sub-figures). We pre-train a source model on the source data and compute the accuracy in the target domain, where the performance drop due to domain shift is observed (98.5%→76.4%). Next, we simulate two types of feedback (i.e., labeled data): random feedback and negatively biased feedback following a previous SemiSDA setup and our setup, respectively. Specifically, NBF is randomly selected among misclassified samples by the source model. We find that random feedback (RF) points are evenly distributed, while NBF points are biasedly positioned across each class cluster (refer to blue points in the dashed circle in the center sub-figures).

To alleviate the performance drop caused by domain shift, we adapt the model using the target data and a semi-supervised method, Pseudo-labeling [1]. This method iteratively optimizes the model by the cross-entropy loss computed by the ground truth of labeled data and pseudo labels of unlabeled data in a mini-batch (pseudo labels are predicted by the current adapting model so they can be changed according to an updating decision boundary. Further comprehension can be achieved by referring Appendix B.). The SemiSDA results are shown in the right sub-figures, where we make two interesting observations: (i) the distribution of labeled data can contribute significantly to a decision boundary of the adapted model (red arrows in the figure), and (ii) the adapted model under NBF has poorly improved performance compared with one under RF (76.4%→88.1% with NBF, but 76.4%→91.7% with RF).

Unexpected influence of NBF.

Our intuitive reasoning probably suggests that NBF provides more information than RF by correcting more source model deficiencies, and thus leads to better adaptation performance. However, we empirically show that NBF can result in inferior adaptation performance due to its biased distribution across each class cluster, as illustrated in Figure 3. Surprisingly, we also show that this problem persists, even with other state-of-the-art SemiSDA methods and large datasets for various DA benchmarks, including image classification, semantic segmentation, and medical image diagnosis. Our work highlights the importance of careful design when using user feedback in real-world scenarios and, to the best of our knowledge, is the first study to uncover and analyze this phenomenon.

4 Approach

4.1 Prerequisite: Previous SemiSDA method

Previous SemiSDA and SemiSL works typically construct a mini-batch with labeled data {(xlbb,ylbb):b[1..B]}\{(x^{b}_{lb},y^{b}_{lb}):b\in[1..\,B]\}, and unlabeled data whose size is μ\mu times larger than labeled ones {(xulbb):b[1..μB]}\{(x^{b}_{ulb}):b\in[1..\,\mu{\cdot}B]\}, where BB is the mini-batch size for labeled data. To adapt the model iteratively, they compute the cross-entropy loss (,)\mathcal{H}(\cdot,\cdot) with labeled data and the consistency regularization to multi-view of unlabeled data, which are formulated as the following:

sup=1Bb=1B(ylbb,fθ(xlbb)),unsup=1μBb=1μB(y^ulbb,fθ(Ω(xulbb))),\footnotesize\mathcal{L}_{sup}=\frac{1}{B}\sum^{B}_{b=1}\mathcal{H}(y_{lb}^{b},f_{\theta}(x_{lb}^{b})),~{}~{}\mathcal{L}_{unsup}=\frac{1}{\mu\cdot B}\sum_{b=1}^{\mu\cdot B}\mathcal{H}(\hat{y}_{ulb}^{b},f_{\theta}(\Omega(x_{ulb}^{b}))),\vspace{-.5em} (1)

where fθ()f_{\theta}(\cdot) is the output probability from the model, y^ulb\hat{y}_{ulb} denotes a pseudo label obtained from fθ(ω(xulb))f_{\theta}(\omega(x_{ulb})), and ω()\omega(\cdot) and Ω()\Omega(\cdot) represent weak and strong image augmentation, respectively. While sharing the core framework, each SemiSDA method employs distinct adapting strategies, especially to enhance the effectiveness of the use of unlabeled data rather than labeled data [6, 96, 81].

Problem of previous works.

Since previous SemiSDA methods have overlooked the unexpected impact of NBF, they often suffer from sub-optimal performance under the NBF assumption (shown in Section 5). To address this problem, we focus on developing a scalable solution that (i) can easily combine with existing DA methods without modifying their core adapting strategies and (ii) can be applied to a wide range of benchmarks, including medical image diagnosis.

Refer to caption
Figure 4: Even though labeled data (xlb,ylb)(x_{lb},y_{lb}) is biasedly positioned, the model needs to be adapted with balanced class discriminability (i.e., decision boundary). (i) However, previous SemiSDA methods have overlooked this fact and used the labeled data naively by applying a cross-entropy loss, leading to inadequate adaptation performance. (ii) To alleviate this problem, we propose a scalable adapting approach, retrieval latent defending, which allows the model to adjust the balance of a mini-batch on each iteration by using latent defending samples xLDx_{LD} together with labeled data xlbx_{lb}.

4.2 Retrieval Latent Defending

Based on the observations in Figure 3, we illustrate the unintended effect of NBF when using an existing SemiSDA method in Figure 4 (top center), where NBF is likely to exhibit a biased distribution, leading to undesirable adaptation results. To alleviate this issue, we propose Retrieval Latent Defending as depicted in Figure 4 (bottom). 1 Prior to each epoch, we generate a candidate bank of data points, denoted as xLDx_{LD}. 2\sim 4 For each adapting iteration, we balance the mini-batch by retrieving latent defending samples xLDx_{LD} from the bank. 5\sim 6 The model is then adapted using the reconfigured mini-batch and following the baseline SemiSDA approach. We hypothesize that the latent space progressively created by the xLDx_{LD} candidates throughout the adaptation process (bold dashed circle in Figure 4 (top right)) mitigates the issue caused by NBF, thereby allowing the SemiSDA baseline to achieve robust adaptation against NBF.

Candidate bank generation.

The candidate bank serves as a repository of pseudo labels Y^tulb\hat{Y}^{ulb}_{t} for a subset of the target unlabeled data XtulbX^{ulb}_{t}. Before each epoch, we freeze the model and use it to generate pseudo labels Y^tulb={(y^ulbn):n[1..Nulb]}\hat{Y}^{ulb}_{t}{=}\{(\hat{y}^{n}_{ulb}):n\in[1..\,N_{ulb}]\}, where y^ulbn\hat{y}^{n}_{ulb} is assigned to xulbnx^{n}_{ulb} as the predicted class with the highest softmax probability: y^ulbn=argmaxc[fθ(xulbn)]c\hat{y}^{n}_{ulb}=\operatorname*{argmax}_{c}\left[f_{\theta}(x^{n}_{ulb})\right]_{c}. We then retain only samples with the top pp% highest probabilities within each class. This filtering step helps mitigate the inclusion of data with potentially inaccurate pseudo labels, as the model’s predictions on XtulbX^{ulb}_{t} might not always be perfect.

Defending sample selection.

We select kk latent defending samples xLDx_{LD} from the bank at random for each labeled data (xlbb,ylbb)(x^{b}_{lb},y^{b}_{lb}). These selected samples share the same pseudo label as the ground-truth label of their associated counterparts (i.e., y^LD=ylbb\hat{y}_{LD}\,{=}\,y^{b}_{lb}). By incorporating these defending samples, we balance the data distribution within the current mini-batch. For example, consider xlb1x^{1}_{lb} and xlb2x^{2}_{lb} in Figure 4 (top right). As these labeled samples are included in the current mini-batch alongside the selected defending samples xLD1x^{1}_{LD} and xLD2x^{2}_{LD}, we expect to prevent the supervised adapting signal from becoming overly dependent on the labeled samples. We imagine the effect of the defending samples throughout the adaptation process and depict the latent space formed gradually by the xLDx_{LD} candidates as bold dashed circles in Figure 4 (top right).

Consequently, the overall loss consists of the sum of losses in Eq. (1) and a loss from our proposed method as,

total=sup+unsupbaseline+1kBb=1kB(y^LDb,fθ(xLDb))retrieval latent defending.\scriptsize\mathcal{L}_{total}=\underbrace{\mathcal{L}_{sup}+\mathcal{L}_{unsup}}_{\text{baseline}}+\underbrace{\frac{1}{k\cdot B}\sum_{b=1}^{k\cdot B}\mathcal{H}(\hat{y}_{LD}^{b},f_{\theta}(x_{LD}^{b}))}_{\text{retrieval latent defending}}.\vspace{-.5em} (2)

Importance of our method.

Understanding the impact of NBF on adaptation performance is crucial. For example, naively adapting a model for a medical application using radiologist-provided feedback can actually cause performance degradation (shown in Table 5), potentially posing significant risks to patients. We propose a scalable and simple approach to solve the problem caused by NBF, which can not be addressed by existing methods. Given the practicality of the NBF problem and the scalability of our solution, we believe our work holds considerable potential for real-world applications.

5 Experiments

5.1 Experimental Setups

Our approach is simple enough to seamlessly combine with existing SemiSDA algorithms and also be applied to diverse benchmarks. This section describes our experimental setup for natural image classification tasks and a real-world medical application. Details for semantic segmentation experiments are in Appendix D.

Baselines. We validate our approach by combining various state-of-the-art algorithms for SemiSDA [58] (e.g., CDAC [33] and AdaMatch [6]) and SemiSL [66, 87] (e.g., FlexMatch [96] and FreeMatch [81]). Note that the SemiSL methods have been demonstrated to be strong SemiSDA learners [99], so we can consider them as SemiSDA methods. For medical experiments, we use Pseudo-labeling [1] as a baseline since it is easily applicable to medical image adaptation.

Datasets. We utilize natural image datasets containing multiple kinds of domains (e.g., real and painting). The datasets include DomainNet-126 [54, 58] with 142k images of 126 classes, and OfficeHome [75] with 15K images of 65 classes.

To conduct medical experiments, we present a practical medical setting. We adopt the MIMIC-CXR-V2 dataset [27]. It assumes a multi-finding binary classification setup, where multiple radiographic findings, like Pneumonia and Atelectasis, can coexist in a single chest X-ray (CXR) sample. Thus, the model predicts the presence or absence (binary classes) of each individual finding. We simulate domain shift by using Posterior-Anterior (PA)-view data as the source and AP-view data as the target, capturing real-world variations in data acquisition. Typically, patients requiring an AP X-ray are those facing positioning challenges that prevent them from undergoing a PA X-ray. Therefore, this setup can be seen as a scenario where the target environment is the intensive care unit, which hospitalizes critically ill patients.

Following the recent SemiSDA [94] and SFDA [8] setups, we assume the model is pre-trained in the source domain and deployed in the target domain. Since the datasets above were not initially divided into training and test sets, we performed a random 8:2 split within each domain, designating them respectively for training and testing. The training set is used to adapt the model, while the test set is used to report the top-1 accuracy.

method feedback average r→c r→p p→c c→s s→p r→s p→r
AdaMatch [6] RF 67.6 66.6 68.5 68.5 60.3 69.2 58.7 81.5
NBF 64.5 (-3.1) 64.3 66.1 65.6 56.9 65.6 54.2 78.9
ResNet w/ ours NBF 72.0(+7.5) 74.5 72.7 73.9 65.5 70.0 64.3 83.2
AdaMatch [6] RF 74.7 75.3 76.9 73.8 68.0 76.3 67.1 85.5
NBF 73.7 (-1.0) 74.7 76.2 74.7 65.7 74.0 66.8 84.0
ViT w/ ours NBF 75.9(+2.2) 76.9 77.8 77.8 68.5 76.6 68.3 85.1
Table 1: Adaptation results on DomainNet-126. We simulate seven domain-shift scenarios (i.e., source → target). The model is pre-trained on the source domain and then adapted to a training set of the target domain. The results on the test set of the target domain are reported as the top-1 accuracy (%). DomainNet-126 [54, 58] dataset includes real, painting, sketch, and clip-art domains. In this experiment, we assume that the 378 feedback samples (i.e., 3 labeled data per class) are obtained from users. A state-of-the-art SemiSDA method, AdaMatch [6], is used as a baseline.
method feedback average a → c a → p a → r c → a c → p c → r p → a p → c p → r r → a r → c r → p
AdaMatch [6] RF 70.9 55.4 80.4 75.9 65.7 81.5 74.6 65.9 58.7 78.4 68.8 61.5 84.3
NBF 69.3 (-1.6) 54.2 76.6 75.3 65.9 79.3 75.5 63.7 57.4 75.9 66.7 56.8 84.2
w/ ours NBF 73.8(+4.5) 62.2 81.0 79.7 68.8 85.4 78.6 67.7 61.7 79.5 69.0 64.1 88.2
Table 2: Adaptation results on OfficeHome. OfficeHome [75] dataset includes real, product, art, and clip-art domain. We assume that the 195 feedback samples (i.e., 3 labeled data per class) are obtained. AdaMatch [6] and ResNet-50 [23] are used.

User feedback. Feedback given by users is modeled as annotations {ylbn:n[1..Nlb]}\{y^{n}_{lb}:n\in[1..\,N_{lb}]\} on a small subset of the target’s training set DttrainD_{t}^{train}, while the remaining of them are used as unlabeled target data. In our experiments, we take into account two types of feedback: random feedback (RF) and negatively biased feedback (NBF). RF is the same setup of classical SemiSDA and SemiSL, where randomly selected samples from DttrainD_{t}^{train} are used as small labeled set XtlbX_{t}^{lb}. For NBF, we randomly select samples that are incorrectly predicted in DttrainD_{t}^{train} by the source model (i.e., the pre-trained model before adaptation). Note that we focus on the impact of a biased label distribution within the same class, as shown in Figure 3, and thus take the same number of feedback for each class. Further discussion about the imbalance in the number of feedback between classes presented in [83, 49, 31] is provided in Appendix A.2.

Network architectures. We adopt commonly used networks, ResNet [23] and ViT [15] for natural image tasks and DenseNet [26] for a medical task. We employ ResNet-50 with the last classification layer comprising a weight normalization layer and a bottleneck layer, following previous works [34, 8] and use the ViT-Small (i.e., ViT-S) introduced in [80]. The DenseNet-121 is used, provided in TorchXrayVision [13], like existing medical works [32, 44].

Implementation details. We implement our framework by extending the publicly available USB [80] repository. Both pre-training and adaptation are conducted with a mini-batch size of 128 and the SGD optimizer. Diverse baselines for SemiSDA and SemiSL are used to compute the losses in Eq. (1). The hyper-parameters for each baseline simply follow USB [80] or public code [58, 33]. For all experiments, our approach uses the same hyper-parameters of the appended defending samples kk and reliable filtering rate pp as 3 and 0.4, respectively.

feed. amount 378 (3 labeled data per class) 630 (5 labeled data per class)
method RF NBF w/ ours RF NBF w/ ours
Source model 56.5
\cdashline2-7 MME [58] 69.5 68.4 (-1.1) 70.8 (+2.4) 71.2 70.1 (-1.1) 72.5 (+2.4)
CDAC [33] 68.3 64.6 (-3.7) 73.2 (+8.6) 71.7 68.1 (-3.6) 74.9 (+6.8)
AdaMatch [6] 67.6 64.5 (-3.1) 72.0 (+7.5) 70.9 67.7 (-3.2) 74.3 (+6.6)
\cdashline2-7 FixMatch[66] 67.6 63.4 (-4.2) 73.2 (+9.8) 71.5 66.1 (-5.4) 75.1 (+9.0)
UDA [87] 69.2 64.9 (-4.3) 73.4 (+8.5) 72.9 68.8 (-4.1) 75.3 (+6.5)
FlexMatch [96] 73.3 71.4 (-1.9) 74.7 (+3.3) 75.3 73.9 (-1.4) 76.0 (+2.1)
FreeMatch [81] 73.8 72.0 (-1.8) 74.8(+2.8) 75.6 74.4 (-1.2) 76.1(+1.7)
\cdashline2-7 ResNet-50 [23] Fully supervised 83.6
Source model 64.5
\cdashline2-7 MME [58] 73.2 72.7 (-0.5) 74.1 (+1.4) 74.5 74.0 (-0.5) 75.2 (+1.2)
CDAC [33] 74.2 72.8 (-1.4) 75.4 (+2.6) 75.4 74.1 (-1.3) 76.2 (+2.1)
AdaMatch [6] 74.7 73.7 (-1.0) 75.9(+2.2) 75.9 75.1 (-0.8) 76.7 (+1.6)
\cdashline2-7 FixMatch [66] 74.6 73.0 (-1.6) 75.6 (+2.6) 75.7 74.3 (-1.4) 76.5 (+2.2)
UDA [87] 74.8 73.3 (-1.5) 75.8 (+2.5) 75.9 74.5 (-1.4) 76.7 (+2.2)
FlexMatch [96] 74.9 73.9 (-1.0) 75.8 (+1.9) 76.0 75.1 (-0.9) 76.9(+1.8)
FreeMatch [81] 74.9 73.9 (-1.0) 75.7 (+1.8) 76.0 75.1 (-0.9) 76.8 (+1.7)
\cdashline2-7 ViT-S [15] Fully supervised 85.4
Table 3: Comparisons on DomainNet-126. We evaluate our method by integrating it with SemiSDA and SemiSL methods. The average accuracy of seven domain-shift scenarios in Table 1 is reported. Source model represents the pre-trained model without adaptation. Fully supervised means the model is adapted with fully labeled target data.
feed. amount 195 (3 labeled data per class) 325 (5 labeled data per class)
method RF NBF w/ ours RF NBF w/ ours
Source model 57.6
\hdashlineMME [58] 71.2 70.2 (-1.0) 73.4 (+3.2) 73.5 73.1 (-0.4) 75.6 (+2.5)
CDAC [33] 71.2 69.0 (-2.2) 74.3 (+5.3) 73.5 72.3 (-1.2) 75.7 (+3.4)
AdaMatch [6] 70.9 69.3 (-1.6) 73.8 (+4.5) 73.4 72.7 (-0.7) 75.5 (+2.8)
\hdashlineFixMatch [66] 71.4 68.6 (-2.8) 73.7 (+5.1) 73.9 72.2 (-1.7) 75.3 (+3.1)
UDA [87] 72.2 69.5 (-2.7) 74.1 (+4.6) 74.4 73.0 (-1.4) 76.0 (+3.0)
FlexMatch [96] 73.7 72.1 (-1.6) 74.7 (+2.6) 75.9 74.9 (-1.0) 76.6(+1.7)
FreeMatch [81] 74.0 72.7 (-1.3) 74.8(+2.1) 75.8 75.0 (-0.8) 76.6(+1.6)
\hdashline Fully supervised 87.4
Table 4: Comparisons on OfficeHome. The average accuracy of twelve domain-shift scenarios in Table 2 is reported. ResNet-50 is used.

5.2 Main Results

Natural image classification. Following recent DA works [8, 94], we conduct experiments on seven and twelve domain shift scenarios provided with the DomainNet-126 and OfficeHome datasets, respectively. Table 1 and Table 2 show the results, where AdaMatch [6] is used as the baseline. We observe consistent results with Figure 3 even on large natural datasets: when simply applying the baseline under the NBF assumption, the adapted model shows inferior performance for most domain shifts than applying it under RF, e.g., 64.5<67.664.5{<}67.6. Combining our approach with the baseline mitigates this issue and achieves a performance increase, e.g., 64.564.572.072.0.

We also use other promising baselines and report the average accuracy of all domain shifts in Table 3 and Table 5.1 (all results can be found in Appendix F). While both feedback types bring performance improvement from the source model, lower performance is observed with NBF. Our method enables the baselines to not only address this problem but surpass performance under RF. The above results suggest that the biased distribution of labeled samples, which has been overlooked in previous SemiSDA works, is actually problematic, and our retrieval latent defending approach is effective.

[Uncaptioned image]
method feedback average atelectasis cardiomegaly consolidation edema enlarged cardio. fracture lung lesion lung opacity effusion pleural pneumonia pneumothorax support device
Source mo. .7738 .7784 .7919 .8236 .8500 .7646 .6642 .7555 .7818 .8271 .8288 .7535 .6894 .7500
\hdashlinePseudoL [1] RF .7850 .7828 .7965 .8453 .8615 .7639 .6832 .7598 .7947 .8333 .8565 .7702 .6957 .7622
\hdashline NBF .7691 .7719 .7851 .8202 .8468 .7403 .6934 .7446 .7809 .8070 .8260 .7521 .6979 .7324
gap -.0159 -.0109 -.0114 -.0252 -.0147 -.0236 +.0102 -.0152 -.0138 -.0262 -.0304 -.0181 +.0022 -.0298
w/ ours NBF .7884 .7895 .7956 .8515 .8606 .7730 .6821 .7599 .7973 .8445 .8611 .7753 .6851 .7736
gain +.0193 +.0176 +.0105 +.0313 +.0138 +.0326 -.0113 +.0153 +.0164 +.0375 +.0351 +.0232 -.0128 +.0412
\hdashline NBF-CE .7639 .7682 .7834 .8124 .8418 .7403 .6808 .7472 .7744 .8005 .8199 .7469 .6879 .7277
gap -.0211 -.0146 -.0131 -.0330 -.0198 -.0236 -.0024 -.0126 -.0203 -.0328 -.0366 -.0233 -.0079 -.0344
w/ ours NBF-CE .7875 .7895 .7956 .8515 .8606 .7730 .6731 .7599 .7973 .8445 .8611 .7753 .6831 .7736
gain +.0236 +.0213 +.0122 +.0391 +.0189 +.0327 -.0077 +.0126 +.0229 +.0440 +.0412 +.0284 -.0048 +.0459
\hdashlineFully super. .8117 .8150 .8277 .8758 .8820 .7984 .6949 .7750 .8200 .8725 .8441 .8044 .7398 .8025
Table 5: Adaptation in a medical application. We use samples with PA-view as the source data and samples with AP-view as the target data in MIMIC-CXR-V2 dataset [27]. NBF-CE represents a scenario when NBF is composed of cases with confident errors. We use DensNet-121 [26, 13] and assume the 20 feedback for the absence and presence per finding.
labeling type feed. amount RF NBF NBF w/ ours ENT [62] ENT w/ ours
IAST [46, 63] PA, 40 points 55.3 53.0 (-2.3) 56.3(+3.3) 53.5 56.0 (+2.5)
RIPU [85] PA, 40 points 57.6 54.5 (-3.1) 58.0(+3.5) 54.6 57.7 (+3.1)
Table 6: Adaptation on semantic segmentation. The GTA5 [56] → Cityscapes [14] setup is used [72]. The target performance of the source model is 36.6 mIoU.

Medical image diagnosis. Table 5 shows the results (bottom) and also depicts the effect of NBF (top center). We report the AUROC [7] for each finding following standard practice for measuring computer-aided-diagnosis model evaluation [32, 44]. The baseline SemiSDA method under NBF exhibits inferior performance compared to one under RF, but this issue can be mitigated by combining our approach.

In addition, we propose an interesting and practical scenario named NBF with more confident errors (NBF-CE). In this scenario, we assume that a radiologist is likely to give feedback when the model makes confidently wrong predictions. Imagine that the model predicts a 1% likelihood of cancer in a CXR image, but the person actually has cancer. Such failure to detect potential patients early on can significantly reduce the patient’s chances of survival, so a radiologist may provide feedback to the model. To simulate NBF-CE, we select samples where the source model most confidently predicts a finding to be absent (y^0\hat{y}{\approx}0) although it is clearly visible in the radiograph (y=1y{=}1), and vice versa, i.e., samples of y^1\hat{y}{\approx}1 but y=0y{=}0. Table 5 also shows the results under an NBF-CE scenario, where the model’s adaptation performance is further reduced compared with NBF (0.7691 for NBF → 0.7639 for NBF-CE). By combining our method, we observe performance improvements for both NBF variants, e.g., 0.7639 for NBF-CE → 0.7875 with ours. We illustrate the hypothesized impact of our method in Table 5.

Semantic segmentation. We evaluate the influence of NBF and our approach on a semantic segmentation task. We utilize the most common adaptation benchmark of GTA5 [56] to Cityscapes [14]. The baseline DA algorithms are used as IAST [46, 63] and RIPU [85] in a source-free scenario. We regard Pixel-based Annotation (PA) in which we assume 40 pixels per image like LabOR [63]. Table 6 shows results similar to those we observed in the classification and medical imaging tasks. The baselines under NBF exhibit inferior performance compared to those under RF (54.5 for NBF<{<}57.6 for RF), but this issue is addressed by combining our approach with them (+3.5 mIoU). Although out of our scope (refer to Appendix A.1), we validate one active labeling strategy ENT [62], which assigns highly uncertain (i.e., probably misclassified) pixels as feedback. Consequently, the feedback instructed by ENT is biasedly distributed in a manner similar to NBF. ENT also causes unexpected results (54.6 for ENT<{<}57.6 for RF), and our approach alleviates this issue (+3.1 mIoU).

5.3 Ablation Study

If not specified, we use ResNet-50 and report the average accuracy (%) of seven domain shift scenarios in Table 1 for ablation studies.

Positive vs. Negative feedback.

We study the role of feedback on the adaptation results by varying feedback configurations. Let positively-provided feedback (PF) be obtained from samples that the source model correctly predicts, as opposed to negatively-provided feedback (NF). We adjust the ratio of PF:NF while keeping the total number of labeled samples constant, as shown in Figure 5.

When using only FreeMatch (gray dot-dashed line), both biased feedback types (i.e., NBF and PBF) result in worse adaptation performance compared to balanced feedback for the baseline, e.g., 72.6 in 378:0 (PBF)<{<} 73.3 in 252:126. In contrast, when our method is applied (red line), NBF yields the best performance. PBF and NBF can be respectively regarded as contributing previously known knowledge of the model and new knowledge that complements model deficiencies. Hence, it may be natural that NBF, which actually encodes the model’s mistakes, contributes to favorable adaptation results.

Number of unlabeled samples in a mini-batch. Existing SemiSDA methods [6, 81] typically set the ratio μ\mu between labeled and unlabeled samples in a mini-batch to 1:7. However, we observe that adhering to this ratio is not optimal for our approach, as shown in Table 7. Our method shows better performance when the ratio is varied to 1:4, i.e., decreasing unlabeled sample sizes. This finding contradicts observations in several TTA works [48, 28, 67], where adaptation performance tends to increase with larger batch sizes. We speculate that it is beneficial to prioritize more reliable information, which refers to labeled data and our defending samples selected from the filtering-applied bank, during the adapting process. This result may be aligned with previous works for curriculum learning [100, 40] and adaptive thresholding [96].

Refer to caption
Figure 5: NBF leads to higher performance than PBF. We compare different user-feedback configurations when the total number of feedback is 378 (top) and 630 (bottom). Positive and negative feedback refers to feedback from correct and incorrect model predictions, respectively. We run three random seed experiments and describe the average performance and standard deviation in the parenthesis.
method feed. negatively biased feedback (NBF)
# xulbx_{ulb}, # xLDx_{LD}, # xlbx_{lb} 112, 0, 16 112, 48, 16 64, 48, 16
total batch size 128 176 128
FreeMatch [81] 72.0 74.2 74.8(+0.6)
AdaMatch [6] 368 64.5 71.3 72.0 (+0.7)
\hdashlineFreeMatch [81] 74.4 75.5 76.1(+0.6)
AdaMatch [6] 630 67.7 73.4 74.3 (+0.9)
Table 7: In the mini-batch, diminishing the number of unlabeled samples and adding our defending samples achieves better performance with our approach. We ablate them by changing the ratio μ\mu in Section 4.1, while keeping the size of labeled samples.

Number of labeled data. We measure the impact of feedback size (number of labeled samples) in Figure 6. The results show that the inferior performance on NBF persists even with an increased amount of feedback (gray → black line); however, our approach mitigates it and improves performance (black → red line). We make an interesting observation that the performance gap between black and red lines becomes larger as the number of available feedback decreases. Since obtaining large feedback may be challenging in real-world applications, our method is expected to be more helpful in this practical case.

Refer to caption
Figure 6: More reliable adaptation with NBF. In addition to Table 3, we conduct experiments with different amounts (1,3,5,10, and 15 labeled data per class) of feedback using FreeMatch [81]. The number of available feedback is likely to be small in practice. In this case, our method achieves large performance improvement, e.g., our method increases the baseline performance by +4.9+4.9 when one feedback per class is available.
selection strategy random random kmeans cosine baseline
class-aware -
FreeMatch [81] Res. 74.1 74.8 74.6 74.0 72.0
FreeMatch [81] ViT. 75.0 75.7 75.6 75.1 73.9
filtering rate 0.2 0.4 0.6 0.8 baseline only
FreeMatch [81] Res. 74.5 74.8 74.3 73.7 72.0
FreeMatch [81] ViT. 75.5 75.7 75.9 75.5 73.9
Table 8: We ablate a component of our approach with 378 feedback: xLDx_{LD} selection strategy and filtering rate pp for bank generation.

Data selecting strategy. We explore various strategies for selecting defending samples to balance the mini-batch, as shown in Table 8 (top). The strategies include: in the xLDx_{LD} candidate bank, (i) random selection regardless of the class of the labeled data, (ii) random selection in the same class as the labeled data (i.e., class-aware), (iii) selecting samples close to the cluster center obtained by k means clustering [21] and (iv) selecting samples with embedded features distant from the labeled data where cosine distance is used. While our approach consistently outperforms the baseline regardless of the chosen strategy, we empirically find that strategy (ii) achieves the best performance. Therefore, we adopt this strategy for our proposed method.

Further studies, such as extension to a TTA scenario, combining with SFDA methods and different feedback configurations, are presented in Appendix C.

6 Conclusion & Discussion

User feedback can play an integral part in adapting the practical ML product to the target environment. However, we have shown that naive adaptation using existing SemiSDA methods led to undesirable adaptation results. We explained this through the lens of Negatively-Biased Feedback (NBF). In this paper, we uncovered the unexpected results of NBF and presented a scalable solution, Retrieval Latent Defending. This method prevents the mini-batch from becoming overly dependent on labeled samples that may have a biased distribution within the overall target distribution. Under the diverse DA benchmarks, from the simulation study to the medical imaging task, we demonstrated the practical problem caused by NBF and the effectiveness of our approach by combining it with multiple SemiSDA baselines. We hope our efforts will inspire future DA works on leveraging user feedback to improve an ML model in the deployment environment.

Broader impact. The proposed setup assumes that an ML product obtains feedback as a form of annotations (i.e., labeled data). In some cases, users can provide feedback in different forms, like thumbs up & down and rating of model prediction, or noise feedback whose information is different from the ground truth. Further research considering these points will pave the way for developing safer and more reliable adapting strategies.

Acknowledgment. We sincerely appreciate the abundant support provided by Lunit Inc., and we would like to thank Donggeun Yoo, Seonwook Park, and Sérgio Pereira for their valuable feedback.

References

  • [1] Arazo, E., Ortego, D., Albert, P., O’Connor, N.E., McGuinness, K.: Pseudo-labeling and confirmation bias in deep semi-supervised learning. In: International Joint Conference on Neural Networks (IJCNN) (2020)
  • [2] Basak, H., Yin, Z.: Semi-supervised domain adaptive medical image segmentation through consistency regularized disentangled contrastive learning. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (2023)
  • [3] Baumeister, R.F., Bratslavsky, E., Finkenauer, C., Vohs, K.D.: Bad is stronger than good. Review of general psychology (2001)
  • [4] Berthelot, D., Carlini, N., Cubuk, E.D., Kurakin, A., Sohn, K., Zhang, H., Raffel, C.: Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring. In: ICLR (2020)
  • [5] Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.A.: Mixmatch: A holistic approach to semi-supervised learning. In: NeurIPS (2019)
  • [6] Berthelot, D., Roelofs, R., Sohn, K., Carlini, N., Kurakin, A.: Adamatch: A unified approach to semi-supervised learning and domain adaptation. In: ICLR (2022)
  • [7] Bradley, A.P.: The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition (1997)
  • [8] Chen, D., Wang, D., Darrell, T., Ebrahimi, S.: Contrastive test-time adaptation. In: CVPR (2022)
  • [9] Chen, H., Tao, R., Fan, Y., Wang, Y., Wang, J., Schiele, B., Xie, X., Raj, B., Savvides, M.: Softmatch: Addressing the quantity-quality trade-off in semi-supervised learning. In: ICLR (2023)
  • [10] Chen, W., Lin, L., Yang, S., Xie, D., Pu, S., Zhuang, Y.: Self-supervised noisy label learning for source-free unsupervised domain adaptation. In: IROS (2022)
  • [11] Chen, X., Zhao, Z., Zhang, Y., Duan, M., Qi, D., Zhao, H.: Focalclick: Towards practical interactive image segmentation. In: CVPR (2022)
  • [12] Chen, Y., Tan, X., Zhao, B., Chen, Z., Song, R., Liang, J., Lu, X.: Boosting semi-supervised learning by exploiting all unlabeled data. In: CVPR (2023)
  • [13] Cohen, J.P., Viviano, J.D., Bertin, P., Morrison, P., Torabian, P., Guarrera, M., Lungren, M.P., Chaudhari, A., Brooks, R., Hashir, M., et al.: Torchxrayvision: A library of chest x-ray datasets and models. In: International Conference on Medical Imaging with Deep Learning (2022)
  • [14] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
  • [15] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)
  • [16] Fini, E., Astolfi, P., Alahari, K., Alameda-Pineda, X., Mairal, J., Nabi, M., Ricci, E.: Semi-supervised learning made simple with self-supervised clustering. In: CVPR (2023)
  • [17] Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: ICML (2015)
  • [18] Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., Lempitsky, V.: Domain-adversarial training of neural networks. The journal of machine learning research (2016)
  • [19] Gong, T., Jeong, J., Kim, T., Kim, Y., Shin, J., Lee, S.J.: Robust continual test-time adaptation: Instance-aware bn and prediction-balanced memory. In: NeurIPS (2023)
  • [20] Harada, S., Bise, R., Araki, K., Yoshizawa, A., Terada, K., Kurata, M., Nakajima, N., Abe, H., Ushiku, T., Uchida, S.: Cluster-guided semi-supervised domain adaptation for imbalanced medical image classification. arXiv preprint arXiv:2303.01283 (2023)
  • [21] Hartigan, J.A., Wong, M.A.: Algorithm as 136: A k-means clustering algorithm. Journal of the royal statistical society. series c (applied statistics) (1979)
  • [22] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020)
  • [23] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
  • [24] Higuchi, Y., Moritz, N., Roux, J.L., Hori, T.: Momentum pseudo-labeling for semi-supervised speech recognition. In: Interspeech (2021)
  • [25] Huang, D., Li, J., Chen, W., Huang, J., Chai, Z., Li, G.: Divide and adapt: Active domain adaptation via customized learning. In: CVPR (2023)
  • [26] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR (2017)
  • [27] Johnson, A.E., Pollard, T.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Peng, Y., Lu, Z., Mark, R.G., Berkowitz, S.J., Horng, S.: Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042 (2019)
  • [28] Khurana, A., Paul, S., Rai, P., Biswas, S., Aggarwal, G.: Sita: Single image test-time adaptation. arXiv preprint arXiv:2112.02355 (2021)
  • [29] Kim, J., Hur, Y., Park, S., Yang, E., Hwang, S.J., Shin, J.: Distribution aligning refinery of pseudo-label for imbalanced semi-supervised learning. In: NeurIPS (2020)
  • [30] Knox, W.B., Stone, P.: Tamer: Training an agent manually via evaluative reinforcement. In: IEEE international conference on development and learning (2008)
  • [31] Lee, H., Shin, S., Kim, H.: Abc: Auxiliary balanced classifier for class-imbalanced semi-supervised learning. In: NeurIPS (2021)
  • [32] Lenga, M., Schulz, H., Saalbach, A.: Continual learning for domain adaptation in chest x-ray classification. In: Medical Imaging with Deep Learning (2020)
  • [33] Li, J., Li, G., Shi, Y., Yu, Y.: Cross-domain adaptive clustering for semi-supervised domain adaptation. In: CVPR (2021)
  • [34] Liang, J., Hu, D., Feng, J.: Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In: ICML (2020)
  • [35] Litrico, M., Del Bue, A., Morerio, P.: Guiding pseudo-labels with uncertainty estimation for source-free unsupervised domain adaptation. In: CVPR (2023)
  • [36] Liu, X., Xing, F., Shusharina, N., Lim, R., Jay Kuo, C.C., El Fakhri, G., Woo, J.: Act: Semi-supervised domain-adaptive medical image segmentation with asymmetric co-training. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (2022)
  • [37] Liu, X., Yoo, C., Xing, F., Oh, H., El Fakhri, G., Kang, J.W., Woo, J., et al.: Deep unsupervised domain adaptation: A review of recent advances and perspectives. APSIPA Transactions on Signal and Information Processing (2022)
  • [38] Liu, Y., Zhang, W., Wang, J.: Source-free domain adaptation for semantic segmentation. In: CVPR (2021)
  • [39] Liu, Y., Kothari, P., van Delft, B., Bellot-Gurlet, B., Mordan, T., Alahi, A.: Ttt++: When does self-supervised test-time training fail or thrive? In: NeurIPS (2021)
  • [40] Liu, Z., Miao, Z., Pan, X., Zhan, X., Lin, D., Yu, S.X., Gong, B.: Open compound domain adaptation. In: CVPR (2020)
  • [41] Long, M., Zhu, H., Wang, J., Jordan, M.I.: Deep transfer learning with joint adaptation networks. In: ICML (2017)
  • [42] MacGlashan, J., Ho, M.K., Loftin, R., Peng, B., Wang, G., Roberts, D.L., Taylor, M.E., Littman, M.L.: Interactive learning from policy-dependent human feedback. In: International conference on machine learning (2017)
  • [43] Madani, A., Moradi, M., Karargyris, A., Syeda-Mahmood, T.: Semi-supervised learning with generative adversarial networks for chest x-ray classification with ability of data domain adaptation. In: International symposium on biomedical imaging (2018)
  • [44] Mahapatra, D., Korevaar, S., Bozorgtabar, B., Tennakoon, R.: Unsupervised domain adaptation using feature disentanglement and gcns for medical image classification. In: ECCV (2022)
  • [45] Mei, K., Zhu, C., Zou, J., Zhang, S.: Instance adaptive self-training for unsupervised domain adaptation. In: ECCV (2020)
  • [46] Mei, K., Zhu, C., Zou, J., Zhang, S.: Instance adaptive self-training for unsupervised domain adaptation. In: ECCV (2020)
  • [47] Niu, S., Wu, J., Zhang, Y., Chen, Y., Zheng, S., Zhao, P., Tan, M.: Efficient test-time model adaptation without forgetting. In: ICML (2022)
  • [48] Niu, S., Wu, J., Zhang, Y., Wen, Z., Chen, Y., Zhao, P., Tan, M.: Towards stable test-time adaptation in dynamic wild world. In: ICLR (2023)
  • [49] Oh, Y., Kim, D.J., Kweon, I.S.: Daso: Distribution-aware semantics-oriented pseudo-label for imbalanced semi-supervised learning. In: CVPR (2022)
  • [50] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)
  • [51] Pan, Y., Yao, T., Li, Y., Wang, Y., Ngo, C.W., Mei, T.: Transferrable prototypical networks for unsupervised domain adaptation. In: CVPR (2019)
  • [52] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, e.a.: Pytorch: An imperative style, high-performance deep learning library. In: NeurIPS (2019), https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py
  • [53] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research (2011)
  • [54] Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., Wang, B.: Moment matching for multi-source domain adaptation. In: ICCV (2019)
  • [55] Prabhu, V., Chandrasekaran, A., Saenko, K., Hoffman, J.: Active domain adaptation via clustering uncertainty-weighted embeddings. In: ICCV (2021)
  • [56] Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: Ground truth from computer games. In: ECCV (2016)
  • [57] Rozin, P., Royzman, E.B.: Negativity bias, negativity dominance, and contagion. Personality and social psychology review (2001)
  • [58] Saito, K., Kim, D., Sclaroff, S., Darrell, T., Saenko, K.: Semi-supervised domain adaptation via minimax entropy. In: ICCV (2019)
  • [59] Saito, K., Watanabe, K., Ushiku, Y., Harada, T.: Maximum classifier discrepancy for unsupervised domain adaptation. In: CVPR (2018)
  • [60] Sajjadi, M., Javanmardi, M., Tasdizen, T.: Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In: NeurIPS (2016)
  • [61] Schulman, J., Zoph, B., Kim, C., Hilton, J., Menick, J., Weng, J., Uribe, J.F.C., Fedus, L., Metz, L., Pokorny, M., et al.: Chatgpt: Optimizing language models for dialogue. OpenAI blog (2022)
  • [62] Shen, Y., Yun, H., Lipton, Z.C., Kronrod, Y., Anandkumar, A.: Deep active learning for named entity recognition. In: ICLR (2017)
  • [63] Shin, I., Kim, D.J., Cho, J.W., Woo, S., Park, K., Kweon, I.S.: Labor: Labeling only if required for domain adaptive semantic segmentation. In: ICCV (2021)
  • [64] Sofiiuk, K., Petrov, I., Barinova, O., Konushin, A.: f-brs: Rethinking backpropagating refinement for interactive segmentation. In: CVPR (2020)
  • [65] Sofiiuk, K., Petrov, I.A., Konushin, A.: Reviving iterative training with mask guidance for interactive segmentation. In: IEEE International Conference on Image Processing (ICIP) (2022)
  • [66] Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C.A., Cubuk, E.D., Kurakin, A., Li, C.L.: Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In: NeurIPS (2020)
  • [67] Song, J., Lee, J., Kweon, I.S., Choi, S.: Ecotta: Memory-efficient continual test-time adaptation via self-distilled regularization. In: CVPR (2023)
  • [68] Song, J., Park, K., Shin, I., Woo, S., Zhang, C., Kweon, I.S.: Test-time adaptation in the dynamic world with compound domain knowledge management. IEEE Robotics and Automation Letters (2023)
  • [69] Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., Christiano, P.F.: Learning to summarize with human feedback. In: NeurIPS (2020)
  • [70] Sun, B., Saenko, K.: Deep coral: Correlation alignment for deep domain adaptation. In: ECCV Workshops (2016)
  • [71] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
  • [72] Tsai, Y.H., Hung, W.C., Schulter, S., Sohn, K., Yang, M.H., Chandraker, M.: Learning to adapt structured output space for semantic segmentation. In: CVPR (2018)
  • [73] Tsai, Y.H., Hung, W.C., Schulter, S., Sohn, K., Yang, M.H., Chandraker, M.: Learning to adapt structured output space for semantic segmentation. In: CVPR (2018)
  • [74] Van Engelen, J.E., Hoos, H.H.: A survey on semi-supervised learning. Machine learning (2020)
  • [75] Venkateswara, H., Eusebio, J., Chakraborty, S., Panchanathan, S.: Deep hashing network for unsupervised domain adaptation. In: CVPR (2017)
  • [76] Vu, T.H., Jain, H., Bucher, M., Cord, M., Pérez, P.: Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In: CVPR (2019)
  • [77] Wang, D., Shelhamer, E., Liu, S., Olshausen, B., Darrell, T.: Tent: Fully test-time adaptation by entropy minimization. In: ICLR (2021)
  • [78] Wang, M., Deng, W.: Deep visual domain adaptation: A survey. Neurocomputing (2018)
  • [79] Wang, Q., Fink, O., Van Gool, L., Dai, D.: Continual test-time domain adaptation. In: CVPR (2022)
  • [80] Wang, Y., Chen, H., Fan, Y., Sun, W., Tao, R., Hou, W., Wang, R., Yang, L., Zhou, Z., Guo, L.Z., et al.: Usb: A unified semi-supervised learning benchmark for classification. In: NeurIPS (2022)
  • [81] Wang, Y., Chen, H., Heng, Q., Hou, W., Fan, Y., Wu, Z., Wang, J., Savvides, M., Shinozaki, T., Raj, B., et al.: Freematch: Self-adaptive thresholding for semi-supervised learning. In: ICLR (2023)
  • [82] Warnell, G., Waytowich, N., Lawhern, V., Stone, P.: Deep tamer: Interactive agent shaping in high-dimensional state spaces. In: AAAI (2018)
  • [83] Wei, C., Sohn, K., Mellina, C., Yuille, A., Yang, F.: Crest: A class-rebalancing self-training framework for imbalanced semi-supervised learning. In: CVPR (2021)
  • [84] Wirth, C., Akrour, R., Neumann, G., Fürnkranz, J., et al.: A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research (2017)
  • [85] Xie, B., Yuan, L., Li, S., Liu, C.H., Cheng, X.: Towards fewer annotations: Active learning via region impurity and prediction uncertainty for domain adaptive semantic segmentation. In: CVPR (2022)
  • [86] Xie, M., Li, Y., Wang, Y., Luo, Z., Gan, Z., Sun, Z., Chi, M., Wang, C., Wang, P.: Learning distinctive margin toward active domain adaptation. In: CVPR (2022)
  • [87] Xie, Q., Dai, Z., Hovy, E., Luong, T., Le, Q.: Unsupervised data augmentation for consistency training. In: NeurIPS (2020)
  • [88] Xu, Y., Shang, L., Ye, J., Qian, Q., Li, Y.F., Sun, B., Li, H., Jin, R.: Dash: Semi-supervised learning with dynamic thresholding. In: ICML (2021)
  • [89] Yan, Z., Wu, Y., Li, G., Qin, Y., Han, X., Cui, S.: Multi-level consistency learning for semi-supervised domain adaptation. In: IJCAI (2022)
  • [90] Yang, L., Wang, Y., Gao, M., Shrivastava, A., Weinberger, K.Q., Chao, W.L., Lim, S.N.: Deep co-training with task decomposition for semi-supervised domain adaptation. In: ICCV (2021)
  • [91] Yang, S., Jui, S., van de Weijer, J., et al.: Attracting and dispersing: A simple approach for source-free domain adaptation. In: NeurIPS (2022)
  • [92] Yang, S., Wang, Y., Van De Weijer, J., Herranz, L., Jui, S.: Generalized source-free domain adaptation. In: ICCV (2021)
  • [93] Yang, S., van de Weijer, J., Herranz, L., Jui, S., et al.: Exploiting the intrinsic neighborhood structure for source-free domain adaptation. In: NeurIPS (2021)
  • [94] Yu, Y.C., Lin, H.T.: Semi-supervised domain adaptation with source label adaptation. In: CVPR (2023)
  • [95] Yu, Z., Li, J., Du, Z., Zhu, L., Shen, H.T.: A comprehensive survey on source-free domain adaptation. arXiv preprint arXiv:2302.11803 (2023)
  • [96] Zhang, B., Wang, Y., Hou, W., Wu, H., Wang, J., Okumura, M., Shinozaki, T.: Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. In: NeurIPS (2021)
  • [97] Zhang, C., Miech, A., Shen, J., Alayrac, J.B., Luc, P.: Making the most of what you have: Adapting pre-trained visual language models in the low-data regime. arXiv preprint arXiv:2305.02297 (2023)
  • [98] Zhang, W., Ouyang, W., Li, W., Xu, D.: Collaborative and adversarial network for unsupervised domain adaptation. In: CVPR (2018)
  • [99] Zhang, Y., Zhang, H., Deng, B., Li, S., Jia, K., Zhang, L.: Semi-supervised models are strong unsupervised domain adaptation learners. arXiv preprint arXiv:2106.00417 (2021)
  • [100] Zhang, Y., David, P., Gong, B.: Curriculum domain adaptation for semantic segmentation of urban scenes. In: ICCV (2017)
  • [101] Zhang, Y., Wang, Z., He, W.: Class relationship embedded learning for source-free unsupervised domain adaptation. In: CVPR (2023)

Supplementary Material on Is user feedback always informative? Retrieval Latent Defending for Semi-Supervised Domain Adaptation without Source Data

Junha Song Tae Soo Kim Junha Kim Gunhee Nam Thijs Kooi Jaegul Choo

In this supplementary material, we provide:

  1. A.

    Comparison with Related Work

    1. A.1.

      Active Domain Adaptation

    2. A.2.

      Class-Imbalanced Semi-Supervised Learning

    3. A.3.

      Test-time Adaptation

    4. A.4.

      Learning with User Feedback

  2. B.

    Further understanding with Simulation Study

  3. C.

    Additional Ablation Study

  4. D.

    Additional Experimental Details

  5. E.

    Additional Discussion

    1. E.1.

      Technique novelty

    2. E.2.

      Computational overhead

    3. E.3.

      Limitations

  6. F.

    Results of All Domain Shifts.

A Comparison with Related Work

A.1 Active Domain Adaptation

Active domain adaptation (ActiveDA) aims to select the most informative samples being labeled by annotators, given a limited annotating budget. As shown in Figure 7, the machine selects some samples using ActiveDA methods and instructs annotators to label the selected samples. Several ActiveDA methods have been proposed, such as CLUE [55], which employs an entropy-based clustering algorithm to preserve the uncertainty and diversity of labeled data. SDM-AG [86] and DiaNa [25] utilize margin functions between the source and target domains to identify informative samples. In contrast to this ActiveDA scenario, we present an NBF scenario where there is no machine-instructed sample selection, and instead, users directly provide feedback as a response to the prediction result. It may lead to more flexible applications since (1) users have the freedom to choose samples, and (2) individual users can impose different standards in selecting samples.

Refer to caption
Figure 7: Comparison between labeling scenarios: random feedback (RF), active domain adaptation (ActiveDA), and negatively biased feedback (NBF).
11footnotetext: For DiaNA [25], we utilize their proposed ‘informativeness scoring mechanism’ to maintain a pretrained model-agnostic property.22footnotetext: If not specified, we use ResNet-50 and report the average accuracy (%) of seven domain shift scenarios in Table 1 for additional studies.

We note that ActiveDA methods are for stage B of Figure 7, while our method is for stage C and proposed to alleviate the problem caused by NBF. Although out of our scope, we evaluate our method under ActiveDA labeling scenarios, where CLUE and DiaNA1 are employed. The results in Table 9 suggest two points. First, our method complements existing ActiveDA methods, consistently improving their performance. This highlights the importance of adapting the model with a balanced supervised signal throughout adaptation (i.e., stage C) using our method, even when ActiveDA methods like CLUE respect the diversity of labeled samples. Second, our method achieves significant performance gains regardless of the labeling scenario, showing that our method can be applied for reliable adaptation even when the distribution of labeled data is unknown.

state B
feed. amount 378 (3 labeled data per class) 1890 5040
stage C RF NBF Entropy [62] CLUE [55] DiaNA [25] CLUE [55] CLUE [55]
AdaMatch [7] 67.6 64.5 65.9 68.6 68.1 76.1 80.3
w/ ours 71.1 (+3.5) 72.0 (+7.5) 71.1 (+5.2) 71.5 (+2.9) 71.3 (+3.2) 78.0 (+1.9) 81.4 (+1.1)
Table 9: We evaluate a SemiSDA method [6] and our method under diverse labeling scenarios. The scenarios include our proposed NBF and ActiveDA scenarios [55, 25]. The difference between ActiveDA and our method is illustrated in Figure 7.

A.2 Class-Imbalanced Semi-Supervised Learning

SemiSDA and SemiSL methods often struggle with the different numbers of labeled data between classes, known as class imbalance [49]. To address this problem, class-imbalanced SemiSL works like CReST [83] propose to balance the quantity of labeled data by using pseudo labels [29, 31] in stage D (i.e., generation in CReST) of Figure 7. Recent advancements like DASO [49] further reduce the imbalance effect using both a similarity-based and linear classifier. Despite such advances in class-imbalanced SemiSL, the biased (i.e., imbalanced) label distribution within the same class has been overlooked in the SemiSDA, SemiSL, and class-imbalanced SemiSL works. Therefore, we introduce the new concept of biased labeled data called NBF and demonstrate its unexpected influence on adaptation performance.

Even though our focus in this paper is on the bias within the same class, accounting for the imbalance between classes can still be crucial for reliable domain adaptation. For example, in the medical domain, while radiologists are likely to log the mistakes of the model, the amount of feedback from false negative samples may be small compared to those from false positive samples, given the natural prevalence of disease (e.g., lung cancer is less than 1 in 1000). We simulate this example scenario and evaluate our method in Table 10.

method feedback FP : FN average fracture pneumothorax
Source model - - .6768 .6642 .6894
Pseudo-Label. [1] RF - .7325 .7541 .7109
\hdashline NBF 40 : 40 .7173 (-.0152) .7414 (-.0127) .6931 (-.0178)
with ours NBF 40 : 40 .7334(+.0162) .7625 (+.0211) .7044 (+.0113)
\cdashline2-7 NBF 75 : 5 .7248 (-.0077) .7494 (-.0047) .7002 (-.0107)
with ours NBF 75 : 5 .7361 (+.0113) .7653 (+.0159) .7070 (+.0068)
\cdashline2-7 NBF 5 : 75 .7170 (-.0155) .7420 (-.0121) .6921 (-.0188)
80 feedback with ours NBF 5 : 75 .7315 (+.0145) .7679 (+.0260) .6951 (+.0030)
Pseudo-Label. [1] RF - .7353 .7565 .7141
\hdashline NBF 80 : 80 .7162 (-.0192) .7429 (-.0136) ..6894 (-.0247)
with ours NBF 80 : 80 .7331 (+.0169) .7680 (+.0251) .6983 (+.0088)
\cdashline2-7 NBF 155 : 5 .7237 (-.0117) .7559 (-.0007) .6915 (-.0227)
with ours NBF 155 : 5 .7358 (+.0121) .7665 (+.0106) .7051 (+.0136)
\cdashline2-7 NBF 5 : 155 .7166 (-.0188) .7438 (-.0128) .6894 (-.0248)
160 feedback with ours NBF 5 : 155 .7300 (+.0134) .7696 (+.0258) .6904 (+.0010)
Fully supervised - - .7744 .8003 .7486
Table 10: Adaptation with different feedback configurations on MIMIC-CXR-V2. We conduct additional experiments to Table 5, where the same pre-trained model is utilized, and the two radiographic findings are considered for simplification. We compare different NBF configurations as we vary the amount of feedback from false positives (FP) and false negatives (FN) errors.

Under different feedback configurations.

We take various feedback configurations into account, as depicted in Table 10. Assuming the model acquires 80 or 160 feedback instances for each finding, we alter the feedback quantities from false positive (FP) and false negative (FN) errors, which is similar to the setup of class-imbalanced SemiSL [83, 29]. We only consider two radiographic findings for simplification. The results show that our method can also mitigate the intended impact of NBF even with the class-imbalanced scenario. Interestingly, we observe better performance when FP feedback is larger than FN feedback, which makes our method suitable for the medical domain, where radiographic findings are rarely detected due to the natural prevalence of the disease.

Combining with class-imbalanced SemiSL methods.

One naive way to more reliably adapt to the challenging scenario could involve combining our method with class-imbalanced SemiSL methods in stages C and D of Figure 7. To evaluate this approach, we conduct an additional simulation study in Figure 8. The simulation replicates the NBF scenario by selecting only misclassified samples within the same class. We further introduce class imbalance by varying the number of feedback points between the blue and orange classes (leftmost sub-figure).

By adapting the model with different approaches, we can find two interesting takeaways: (1) the approach proposed in CReST [83] was not designed to solve the unexpected effect of NBF, so it struggles with adaptation under the challenging scenario. (2) our method achieves better adaptation performance than using only CReST, and outperforms other results by combining with CReST. These results highlight the importance of considering an NBF case as well as a class-imbalance problem and the efficiency of our method. We hypothesize that defending the latent class space throughout adapting iterations helps the model to be robust to the effect of NBF, different from a previous generation-based approach in CReST [83]. In addition, a discussion about zero feedback for certain classes is provided in Section C.

Refer to caption
Figure 8: Our contribution focuses on introducing NBF and analyzing the effect of NBF on adaptation. However, it may be required to consider both an NBF and a class-imbalanced scenario in real-world applications. Hence, we first simulate this scenario and perform adaptation using i) a SemiSDA method (i.e., Pseudo labeling [1]) with our method and ii) a class-imbalanced SemiSL method (i.e., CReST [83]).

A.3 Test-time Adaptation

To mitigate performance degradation caused by domain shift, models deployed on edge devices like smartphones and self-driving cars can be adapted to the target domain in an online manner, referred to as test-time adaptation (TTA). TTA assumes two practical settings: i) adapting without source data and ii) storing a limited amount of unlabeled target data. For instance, TENT [77, 47, 67, 68] leverages the current batch of unlabeled data to update the model’s batch normalization parameters. Alternatively, methods like NOTE [19] and ContraTTA [8] employ a target memory bank where a small amount of data (e.g., 16k image features in ContraTTA) can be only stored and used for adaptation.

Extension to a TTA scenario.

Our setup illustrated in Figure 2 also assumes a source-free setup, so it can be easily extended to a TTA scenario by employing the memory bank. In particular, on a periodic basis, when 10% of the target training data is encountered, an adaptation is executed following a TTA setup of TTT++ [39] and ContraTTA [8], where unlabeled data in the memory bank and labeled data are utilized. The memory bank size is set to 5k pseudo labels, and FreeMatch [81] is used for a SemiSDA baseline algorithm. It should be noted that since previous TTA works do not consider the utilization of labeled data, we can not use them as a baseline or compare the adaptation performance directly (but we attempt to alleviate this problem and implement comparisons in Section C.). The results in Table 11 show that our method works well even with a smaller amount of unlabeled data in the memory bank. We find this result very surprising and wish to continue in this direction for future research.

memory bank size = 5k percentage of target data encountered in target domain
method feed. amo. 10%  \xrightarrow{\hskip 21.00005pt}   40% \xrightarrow{\hskip 21.00005pt}   70% \xrightarrow{\hskip 21.00005pt} 100%
FreeMatch [81] RF 68.4 71.4 73.0 73.4
NBF 66.9 69.5 71.0 71.5
w/ ours NBF 368 68.9 (+2.0) 72.4 (+2.9) 73.6 (+2.6) 74.3 (+2.8)
FreeMatch [81] RF 71.2 73.7 74.7 75.4
NBF 69.8 72.2 73.2 73.9
w/ ours NBF 630 71.5 (+1.7) 74.1 (+1.9) 75.0 (+1.8) 75.5 (+1.6)
Table 11: We evaluate our approach on a TTA scenario, where labeled data and unlabeled target data in a memory bank are only available for adaptation like ContraTTA [8]. In the real, painting, scratch, and clipart domains of DomainNet-126, 10% of the data consists of 5.5k, 2.4K, 1.9k, and 1.5k images. In the table, 40% means that the model has encountered 40% of the unlabeled target training data.

A.4 Learning with User Feedback

Learning with User Feedback has garnered significant attention for its effectiveness in capturing users’ preferences or intentions [84, 69, 42, 50]. Reinforcement learning from human feedback is a powerful technique for model optimization based on human-provided rewards [30, 82, 61, 71]. Another application is interactive image segmentation [65, 64, 11], where users provide pixel-level annotations, enabling the model to enhance its understanding of user preferences over time.

B Further understanding with Simulation Study

In this section, we provide additional details and understanding about the simulation study in Figure 3.

Network architecture.

We build the model consisting of three fully connected layers and Relu activation functions. This model takes the point coordinate as input and returns the class label as output. Please refer to example codes found in the ‘sklearn.datasets.make_blobs’ documents [53].

Baseline.

One simple SemiSL method, Pseudo labeling [1], can be easily applied to the toy experiment. Given a mini-batch with labeled data {(xlbb,ylbb):b[1..B]}\{(x^{b}_{lb},y^{b}_{lb}):b\in[1..\,B]\} and unlabeled data {(xulbb):b[1..μB]}\{(x^{b}_{ulb}):b\in[1..\,\mu{\cdot}B]\}, we simply adapt the model with cross-entropy losses as the following:

sup=1Bb=1B(ylbb,fθ(xlbb)),unsup=1μBb=1μB(argmaxc[fθ(xulbb)]c,fθ(xulbb)).\mathcal{L}_{sup}=\frac{1}{B}\sum^{B}_{b=1}\mathcal{H}(y_{lb}^{b},f_{\theta}(x_{lb}^{b})),~{}~{}\mathcal{L}_{unsup}=\frac{1}{\mu\cdot B}\sum_{b=1}^{\mu\cdot B}\mathcal{H}(\operatorname*{argmax}_{c}\left[f_{\theta}(x^{b}_{ulb})\right]_{c},f_{\theta}(x_{ulb}^{b})).

(3)

fθ()f_{\theta}(\cdot) is the output probability from the model and argmaxc[fθ(xulbb)]c\operatorname*{argmax}_{c}\left[f_{\theta}(x^{b}_{ulb})\right]_{c} refers to the pseudo label. As shown in the equation, the updating model fθf_{\theta} continuously predicts pseudo labels for the unlabeled data. So, the pseudo labels can be changed based on an updated decision boundary. Figure 9 presents this phenomenon as the adapting epoch progresses.

Additional study on two moon dataset.

To better understand the unexpected influence of NBF on domain adaptation, we conducted additional simulations using the two moon datasets from scikit-learn [53]. As shown in Figure 10, we generate source and target data so that they have domain shifts. After pre-training a model on the source data, we evaluate its performance on the target domain, observing a performance drop due to the shift (99.9%→81.4%). After we simulate user-provided feedback under two scenarios (i.e., RF and NBF), we adapt the model to the target data in a semi-supervised manner [1]. The results highlight crucial observations shown in Section 3.2: the distribution of label data significantly impacts adaptation performance. Notably, biased feedback distribution (NBF) leads to poorer performance compared to evenly distributed feedback (RF). In our main paper, we showed that this problem remained the same even with state-of-the-art SemiSDA methods and under different DA benchmarks.

Refer to caption
Figure 9: During the adapting process, an updated decision boundary of the model is depicted. The details can be found in Section 3.2.
Refer to caption
Figure 10: Effect of negatively biased feedback. We conduct an additional simulation study with two moon dataset. We make the same observations of Figure 3, i.e., NBF is biasedly distributed, leading to inferior adaptation performance compared to RF. The experimental details are provided in Section 3.2 and Section B.

C Additional Ablation Study

Reliable sample filtering.

An important design of our approach is to retain only samples having reliable pseudo labels among {(xulbn,y^ulbn):n[1..Nulb]}\{(x^{n}_{ulb},\hat{y}^{n}_{ulb}){}:n\in[1..\,N_{ulb}]\}. We evaluate the adaptation performance with variations in the filtering ratio p%p\% in Table 8. A higher pp increases the likelihood of the bank being contaminated with samples with incorrect pseudo labels (i.e., yulby^ulby_{ulb}{\neq}\hat{y}_{ulb}) while a lower pp decreases the diversity of the defending samples. We observe that our approach is robust to the hyper-parameter pp, yet achieves reasonable performance with p=0.4p=0.4.

Combining with SFDA methods.

Recent SFDA methods [34, 35] have shown promise in computing the unsupervised loss unsup\mathcal{L}_{unsup}. So, we explore their potential as baselines within our framework. To construct the overall loss function total\mathcal{L}_{total} in Eq. (2), we simply combine their unsup\mathcal{L}_{unsup} with the supervised loss sup\mathcal{L}_{sup} of FreeMatch [81] since SFDA methods do not take the utilization of supervised loss into account. The results are presented in Table 12. Interestingly, some SFDA works [34, 8, 35] using sophisticated methods, such as k-means clustering [21] and contrastive learning [22], are likely to be less susceptible to NBF. However, the trend is not consistent for all methods. NRC [93], using a strategy of nearest neighbors, shows sub-optimal performance under an NBF assumption. Notably, all SFDA methods achieve their best adaptation performance when combined with our method. This suggests that even methods that partially mitigate NBF’s unexpected effects can further benefit from our method.

feed. amount 378 (3 labeled data per class) 630 (5 labeled data per class)
method RF NBF w/ ours RF NBF w/ ours
SHOT [34] 69.6 70.7 (+1.1) 71.5 (+0.8) 71.1 72.3 (+1.2) 73.0 (+0.7)
NRC [93] 66.3 64.9 (-1.4) 69.3 (+4.4) 68.5 66.4 (-2.1) 69.6 (+3.2)
ContraTTA [8] 68.6 69.2 (+0.6) 71.6 (+2.4) 70.1 70.5 (+0.4) 72.4 (+1.9)
ResNet-50 GuidingSP [35] 69.7 70.2 (+0.5) 71.8(+1.6) 70.5 71.0 (+0.5) 72.8(+1.8)
SHOT [34] 73.4 73.7 (+0.3) 74.1 (+0.4) 74.4 74.8 (+0.4) 75.4 (+0.6)
NRC [93] 72.2 71.9 (-0.3) 72.9 (+1.0) 73.9 73.7 (-0.2) 74.6 (+0.9)
ContraTTA [8] 72.8 73.4 (+0.6) 74.9 (+1.5) 73.9 74.8 (+0.9) 76.4 (+1.6)
ViT-S GuidingSP [35] 73.3 73.7 (+0.4) 75.0(+1.3) 74.1 74.9 (+0.8) 76.4(+1.5)
Table 12: Comparisons on DomainNet-126. We combine our method and SFDA methods. The average accuracy (%) of seven domain-shift scenarios is reported. We use the same pre-trained model as in Table 3.

Number of appended defending samples.

As mentioned in Section 4.2, we incorporate kk defending samples for each labeled data point (xlbb,ylbb)(x^{b}_{lb},y^{b}_{lb}) to decrease the unexpected impact of NBF on the supervised signal. To understand how the value of kk affects performance, we conducted an ablation study in Table 13. We fix the number of labeled data points to 16 and maintain the total batch size at 128 by adjusting the ratio μ\mu in Eq. (1). For instance, with k=4k{=}4, the ratio μ\mu is set to 3 (i.e., 16+16×k+16×μ=12816+16{\times}k+16{\times}\mu=128). Our experiments across two different architectures reveal that a k=3k{=}3 value generally yields good adaptation performance. Consequently, we adopt k=3k{=}3 for all experiments.

k=1k=1 k=2k=2 k=3k=3 k=4k=4 baseline
FreeMatch [81] Res. 74.0 74.6 74.8 74.4 72.0
FreeMatch [81] ViT. 75.5 75.9 75.7 75.4 73.9
Table 13: We ablate the number of defending samples kk in Eq. (2). We also report the performance of the baseline without our approach (rightmost).
only unsup\mathcal{L}_{unsup} the overall loss total\mathcal{L}_{total} in Eq. (LABEL:2)
pseudo-feedback per class 0 3 w/ ours 5 w/ ours
NRC [93] 63.5 63.4 64.6 (+1.2) 63.4 64.4 (+1.0)
ContrastiveTTA [8] 66.6 66.6 67.4 (+0.8) 66.5 67.2 (+0.7)
Table 14: Although out of our scope, we consider a zero-feedback scenario in which a user does not provide any feedback. To evaluate our method in this scenario, we leverage unlabeled target data and their pseudo label for semi-supervised adaptation.

Under a zero feedback scenario.

We note that, as previous SemiSDA [58, 6] and SemiSL [66, 81] works, we assume that a user provides a small amount of feedback (i.e., labeled data) during their interaction with an ML application. Nevertheless, we wondered about a broader question: how can our method be used when no feedback is received? This scenario, while beyond the scope of our work, presents an intriguing area for further exploration, so we attempt to investigate the potential impact of our method under such a scenario. We initially opted to use SFDA baselines of Table 12, which have demonstrated potential in the absence of labeled target data, and assess their performance within an SFDA setup (i.e., only unsup\mathcal{L}_{unsup} in Table 14). Then, pseudo-feedback is generated by randomly selecting small unlabeled data sets and their pseudo-labels from samples with high predicted probabilities. With the pseudo-feedback and unlabeled target data, we conduct SemiSDA and report the results (i.e., the overall loss total\mathcal{L}_{total} in Table 14). We find that i) simulating pseudo-feedback has a minor influence on SFDA baselines, yet ii) the adaptation performance is enhanced by combining with our method. Based on these results, we believe that even in the absence of feedback for certain classes, SemiSDA with our method can achieve good adaptation performance by leveraging the pseudo-feedback.

D Additional Experimental Details

Details for medical experiments.

We use DenseNet-121 [26] provided by the TorchXRayVision repository [13]. This architecture consists of a shared backbone and multiple classification heads for radiographic findings. When given a 256x256 image as input, it generates sigmoid values for thirteen different findings.

The majority of SemiSDA methods, such as AdaMatch [6] and FreeMatch [81], depend on consistency regularization, which requires image augmentation strategies, such as ColorJitter and GaussianBlur [52]. Unfortunately, applying them to medical images remains challenging, as most strategies have been proposed specifically for natural images. As a result, we employ Pseudo-labeling [1], a fundamental SemiSL algorithm that (i) obviates the necessity for image augmentations and (ii) can be easily implemented for a multi-finding binary classification setup. To be more specific, we substitute the cross-entropy (,)\mathcal{H}(\cdot,\cdot) in Eq. (3) with the binary cross-entropy loss. To generate pseudo labels (i.e., presence or absence in Table 5 (top)), thresholds that are pre-calculated in the source domain are used. The hyper-parameters for model updates are the following.

batch size learning rate optimizer weight decay
pre-training 128 1e-3 Adam 1e-5
adaptation 128 1e-4 Adam 1e-5

Details for semantic segmentation experiments.

Our experiment leverages the GTA5 [56] and Cityscapes [14] datasets as the source and target domains. To compute the supervised sup\mathcal{L}_{sup} and unsupervised losses unsup\mathcal{L}_{unsup} in Eq. (1), we employ baseline algorithms: IAST [46] in LabOR [63] and RIPU [85]. Following previous works [63, 85], we utilize ResNet-101 as the backbone architecture and DeepLab-v2 as the segmentation model. Further details regarding implementation and hyper-parameter for adaptation can be found in the publicly available codebase of RIPU [85]. One of our method’s key strengths is its simplicity, which makes it readily applicable to various tasks like semantic segmentation. To be more specific, we first identify pixel points in an image that have the top 40% probabilities for each class. Among them, we select three pixels (i.e., defending pixels) for each labeled pixel in order to balance the supervised signal (i.e., total\mathcal{L}_{total} in Eq. (2)) and obtain robust adaptation performance to the unexpected effect of NBF.

E Additional Discussion

33footnotetext: We evaluate SSNL using the same experimental setup in Table 12.

E.1 Technique novelty

Compared to previous works, our approach, retrieval latent defending, distinguishes itself in how balancing is applied to solve the novel NBF problem.: (i) We initially anticipated that conventional tricks using confident pseudo labels or balancing strategy, such as CReST [83] for class-imbalance, CLUE [55], DiaNA [25] for ActiveDA, GuidSP [35] and SSNLL [10] for noisy pseudo labels, would ameliorate the NBF issue. However, as shown in the table belowE, we found these methods to fall short due to their lack of specific targeting of the novel problem by NBF, thereby underscoring the need for our tailored approach. (ii) Our strategy diverges from the dataset-level balancing approaches in [83, 55, 10]. Instead, we focus on enhancing the supervised signal within a minibatch through iterative retrieval of defending samples, which helps in fortifying latent spaces against the unexpected issue by NBF as illustrated in Figure 4 and Table 5. Surprisingly, this distinct method not only effectively addresses the NBF problem but also leads to substantial improvements in adaptation performance.

method CReST (CVPR21) CLUE (ICCV21) DiaNA (CVPR23) SSNLL (IROS22) GuidSP (CVPR23)
reference Figure 8 Table 9 Table 9 - Table 12
accuracy 92.6 68.6 68.1 68.9 69.2
w/ ours 95.8 (+3.2) 71.5 (+2.9) 71.3 (+3.2) 71.4 (+2.5) 71.6 (+2.4)
44footnotetext: We specify the database size when the real domain of the DomainNet dataset serves as the target domain.

E.2 Computational overhead

Our method incurs only negligible overhead, as the only additional data that needs to be stored are pseudo labels. As shown in the following tableE.1, our method results in an additional 0.1 MB of memory and a 3% increase in running time compared to existing SemiSDA [6, 81] and SFDA [35] methods, but these modest increases facilitate significant performance enhancements. We adhere to the standard practices of SemiSDA and SFDA, which involve storing target images in a database (DB).

method AdaMatch (ICLR22) w/ ours GuidSP (CVPR23) w/ ours FreeMatch (ICLR23) w/ ours
reference Table 3 Table 3 Table 12 Table 12 Table 11 Table 11
DB size 55k images 55k images 55k images 55k images 5k images 5k images
add. data 0 MB 0.1 MB 53.8 MB 53.9 MB 0 MB 0.01 MB
run. time 132 min 136 min 150 min 155 min 14 min 15 min
accuracy 64.5 72.0 (+7.5) 70.2 71.8 (+1.6) 66.9 68.9 (+2.0)

E.3 Limitations.

Machine learning (ML) powered products can collect target data in various ways. Beyond unlabeled data encountered in the target environment (e.g., driving scenes from a self-driving car), feedback containing valuable target information can be collected by users. For example, a radiologist can log misdiagnosed chest X-ray images in the medical application. However, leveraging effectively such feedback to enhance the deployed model has yet to be well studied. So, this paper addressed this issue by proposing a framework, domain adaptation with user feedback, as illustrated in Figure 2. Moreover, we identified potential issues (i.e., the unexpected impact of NBF) and introduced a simple and scalable solution (i.e., retrieval latent defending).

However, a few more considerations need to be made before this framework is applied in the real world. (1) Current SemiSDA and SemiSL works typically conduct a single adaptation round using all target training data. In practice, however, periodic adaptation may be required since the model can continuously collect new data. According to CoTTA [79], EATA [47], and EcoTTA [67], studies to make initial TTA research [77, 39, 8] more realistic, long-term adaptation can lead to catastrophic forgetting and error accumulation. They attempt to address this problem by utilizing continual learning strategies, e.g., random parameter restoration and knowledge distillation. Repeated adaptation processes in our setup might result in similar issues, suggesting a potential connection to continual learning techniques within the SemiSDA methods. (2) More SemiSDA methods specializing in medical imaging still need to be developed. We employed the native SemiSDA method, Pseudo-Labeling [1], in Table 5. Developing SemiSDA methods specific to medical imaging has the potential to significantly improve adaptation performance beyond the results of Table 5. It is also a promising direction for future research.

F Results of all domain shifts

In addition to Table 3, Table 5.1, and Table 12, we report the adaptation results for all domain shift scenarios in Table 15, Table 16, Table 17, Table 18, Table 19, and Table 20.

method feedback average real→clip. real→pain. pain.→clip. clip.→scat. scat.→pain. real→scat. pain.→real
Source model - 56.5 56.1 63.7 55.2 48.0 51.7 45.8 74.7
\cdashline2-11 FixMatch [66] RF 67.6 66.2 68.3 68.2 61.0 69.8 58.7 80.8
NBF 63.4 62.4 65.1 64.8 55.8 64.6 52.7 78.4
w/ ours NBF 73.2 75.0 74.3 74.7 66.9 71.8 65.4 84.1
UDA [87] RF 69.2 68.7 70.0 69.8 62.8 70.9 60.0 82.0
NBF 64.9 64.5 66.0 67.3 57.2 66.3 53.8 79.5
w/ ours NBF 73.4 76.2 74.0 74.7 67.4 71.9 65.7 84.1
FlexMatch [96] RF 73.3 76.7 74.0 75.6 66.9 73.2 64.4 82.5
NBF 71.4 74.8 72.2 74.5 63.8 71.1 61.7 81.4
w/ ours NBF 74.7 77.9 74.8 77.8 68.9 72.2 66.9 84.4
FreeMatch [81] RF 73.8 76.6 74.2 75.5 67.7 73.5 65.1 84.0
NBF 72.0 75.5 72.9 74.6 65.0 72.3 62.0 81.7
w/ ours NBF 74.8 78.1 74.5 77.1 68.8 72.4 67.3 85.0
\cdashline2-11 MME [58] RF 69.5 70.0 71.2 69.3 63.5 69.6 61.7 81.5
NBF 68.4 69.5 70.7 69.1 61.5 69.0 58.8 80.2
w/ ours NBF 70.8 72.9 71.6 72.9 64.0 68.4 62.1 83.5
CDAC [33] RF 68.3 67.1 69.0 68.9 62.6 69.9 59.5 81.1
NBF 64.6 64.5 66.2 66.3 56.9 65.8 53.6 78.6
w/ ours NBF 73.2 76.1 73.9 74.4 67.0 71.2 65.8 84.1
AdaMatch [6] RF 67.6 66.6 68.5 68.5 60.3 69.2 58.7 81.5
NBF 64.5 64.3 66.1 65.6 56.9 65.6 54.2 78.9
ResNet-50 [23] w/ ours NBF 72.0 74.5 72.7 73.9 65.5 70.0 64.3 83.2
\cdashline2-11 Fully sup. - 83.6 85.6 81.4 85.6 80.4 81.4 80.4 90.1
Source model - 64.5 63.6 70.2 61.6 56.7 65.5 53.5 80.5
\cdashline2-11 FixMatch [66] RF 74.6 75.5 77.1 73.8 67.7 75.9 67.1 85.1
NBF 73.0 73.8 75.4 74.0 65.1 72.8 66.1 83.8
w/ ours NBF 75.6 77.1 77.7 77.3 67.8 76.8 68.0 84.7
UDA [87] RF 74.8 75.5 77.1 74.0 67.9 76.1 67.4 85.4
NBF 73.3 74.1 75.6 74.3 65.4 73.2 66.3 83.9
w/ ours NBF 75.8 77.1 77.8 77.6 68.2 77.1 68.2 84.9
FlexMatch [96] RF 74.9 75.5 77.0 74.7 68.4 76.2 66.7 85.7
NBF 73.9 74.5 76.6 75.1 66.1 74.5 66.4 84.1
w/ ours NBF 75.8 77.2 77.5 77.9 68.3 77.0 67.9 85.0
FreeMatch [81] RF 74.9 75.3 76.8 74.5 68.1 76.5 67.0 86.0
NBF 73.9 74.6 76.4 75.0 66.0 74.5 66.5 84.1
w/ ours NBF 75.7 76.9 77.5 77.9 68.1 76.7 67.8 85.2
\cdashline2-11 MME [58] RF 73.2 74.0 74.8 73.0 66.5 74.6 65.2 84.3
NBF 72.7 73.2 74.8 73.8 65.3 73.0 64.8 83.8
w/ ours NBF 74.1 75.4 75.9 76.2 66.2 74.7 66.4 84.2
CDAC [33] RF 74.2 74.8 76.3 73.8 67.5 75.5 66.6 84.9
NBF 72.8 73.6 74.9 73.9 65.0 72.8 65.4 83.8
w/ ours NBF 75.4 76.7 77.6 77.2 67.6 76.2 67.9 84.6
AdaMatch [6] RF 74.7 75.3 76.9 73.8 68.0 76.3 67.1 85.5
NBF 73.7 74.7 76.2 74.7 65.7 74.0 66.8 84.0
ViT-S [15] w/ ours NBF 75.9 76.9 77.8 77.8 68.5 76.6 68.3 85.1
\cdashline2-11 Fully sup. - 85.4 87.8 83.4 87.8 81.3 83.4 81.3 92.7
Table 15: Adaptation results with SemiSL and SemiSDA methods on DomainNet-126. The adaptation performance on various domain shifts is reported, where the number of labeled data per class is 3. The details can be found in Table LABEL:1.
method feedback average real→clip. real→pain. pain.→clip. clip.→scat. scat.→pain. real→scat. pain.→real
Source model - 56.5 56.1 63.7 55.2 48.0 51.7 45.8 74.7
\cdashline2-11 FixMatch [66] RF 71.5 71.3 70.9 73.1 65.5 71.9 65.1 83.0
NBF 66.1 66.2 67.6 67.6 57.4 67.4 56.5 79.8
w/ ours NBF 75.1 77.2 75.7 77.2 69.8 73.9 68.0 84.1
UDA [87] RF 72.9 73.4 72.6 74.6 67.1 73.1 65.9 83.4
NBF 68.8 70.3 68.7 71.1 60.3 70.3 60.5 80.5
w/ ours NBF 75.3 78.3 75.2 77.9 69.6 73.9 68.1 84.3
FlexMatch [96] RF 75.3 78.5 74.6 77.5 70.3 73.8 68.7 83.8
NBF 73.9 77.3 74.0 76.3 66.2 73.8 67.2 82.6
w/ ours NBF 76.0 79.5 75.6 78.7 70.2 74.3 69.0 84.7
FreeMatch [81] RF 75.6 78.6 74.9 77.6 70.2 74.3 69.0 84.7
NBF 74.4 77.6 74.5 76.3 66.8 73.8 68.2 83.5
w/ ours NBF 76.1 79.6 75.5 78.6 70.4 74.5 69.3 84.9
\cdashline2-11 MME [58] RF 71.2 71.3 72.1 71.8 65.6 70.7 64.6 82.6
NBF 70.1 71.4 71.4 70.4 62.1 70.7 62.7 81.8
w/ ours NBF 72.5 74.5 72.7 74.9 66.4 70.7 64.6 83.8
CDAC [33] RF 71.7 71.5 71.7 73.0 66.1 72.0 64.8 82.9
NBF 68.1 69.5 68.9 69.3 59.8 69.4 59.7 80.0
w/ ours NBF 74.9 77.0 74.9 77.0 69.6 73.4 67.9 84.2
AdaMatch [6] RF 70.9 70.6 70.4 72.7 65.3 70.8 63.7 83.0
NBF 67.7 69.0 68.7 69.7 59.5 67.6 58.8 80.4
ResNet-50 [23] w/ ours NBF 74.3 76.7 74.4 76.8 68.8 72.8 66.2 84.1
\cdashline2-11 Fully sup. - 83.6 85.6 81.4 85.6 80.4 81.4 80.4 90.1
Source model - 64.5 63.6 70.2 61.6 56.7 65.5 53.5 80.5
\cdashline2-11 FixMatch [66] RF 75.7 76.5 77.4 76.2 69.6 76.9 67.9 85.8
NBF 74.3 75.7 75.6 75.6 67.4 74.2 66.7 84.7
w/ ours NBF 76.5 78.0 77.9 78.3 70.2 76.9 68.7 85.4
UDA [87] RF 75.9 76.7 77.4 76.4 69.8 76.9 68.1 85.9
NBF 74.5 75.9 76.0 76.0 67.6 74.4 67.0 84.9
w/ ours NBF 76.7 78.2 78.2 78.8 70.6 76.9 68.8 85.5
FlexMatch [96] RF 76.0 76.5 77.2 76.8 70.1 77.3 68.1 86.2
NBF 75.1 76.2 76.6 76.2 68.9 75.5 67.4 85.1
w/ ours NBF 76.9 78.9 77.9 79.1 70.4 77.6 68.6 86.0
FreeMatch [81] RF 76.0 76.7 77.1 76.6 69.9 77.1 68.0 86.3
NBF 75.1 76.2 76.4 76.3 69.0 75.6 67.4 85.1
w/ ours NBF 76.8 78.5 77.8 78.5 70.5 77.4 68.8 85.9
\cdashline2-11 MME [58] RF 74.5 75.3 75.4 75.2 68.2 75.4 66.5 85.1
NBF 74.0 74.9 75.2 75.2 67.3 74.1 66.3 84.7
w/ ours NBF 75.2 76.4 76.4 77.3 68.9 75.4 66.9 85.1
CDAC [33] RF 75.4 76.3 76.9 75.5 69.2 76.4 67.8 85.6
NBF 74.1 75.1 75.3 75.4 67.4 73.9 66.5 84.6
w/ ours NBF 76.2 77.8 77.4 78.3 70.0 76.4 68.5 85.3
AdaMatch [6] RF 75.9 76.6 77.1 76.6 70.0 76.9 68.2 86.1
NBF 75.1 76.2 76.7 76.3 68.1 75.5 67.5 85.2
ViT-S [15] w/ ours NBF 76.7 78.6 78.0 78.8 69.6 77.2 68.8 86.0
\cdashline2-11 Fully sup. - 85.4 87.8 83.4 87.8 81.3 83.4 81.3 92.7
Table 16: Adaptation results with SemiSL and SemiSDA methods on DomainNet-126. The adaptation performance on various domain shifts is reported, where the number of labeled data per class is 5. The details can be found in Table LABEL:1.
method feedback average a → c a → p a → r c → a c → p c → r p → a p → c p → r r → a r → c r → p
Source - 57.6 44.2 65.6 71.6 47.3 60.2 58.2 47.9 40.8 69.8 60.6 46.5 78.1
\hdashlineFreeMatch [81] RF 71.4 56.6 79.7 76.3 67.9 83.2 74.5 65.5 58.6 78.3 69.4 62.4 84.8
NBF 68.6 53.0 76.1 75.3 65.3 78.5 74.8 62.5 56.7 74.7 66.7 56.2 83.7
w/ ours NBF 73.7 60.8 80.3 80.5 69.2 84.0 78.6 67.7 62.3 80.1 70.0 64.1 87.2
UDA [87] RF 72.2 56.1 81.0 76.8 68.0 83.4 75.6 67.1 59.7 79.7 69.8 62.7 86.4
NBF 69.5 53.3 78.6 75.7 66.3 79.7 75.8 63.7 57.2 75.7 66.7 57.2 83.9
w/ ours NBF 74.1 61.1 80.7 80.3 69.0 85.9 79.2 68.0 62.3 80.7 70.4 63.9 87.4
FlexMatch [96] RF 73.7 58.0 84.6 79.3 68.4 84.7 78.8 68.4 62.8 79.8 70.6 62.9 86.3
NBF 72.1 56.1 79.0 77.8 68.4 83.4 77.6 67.5 60.1 79.2 68.8 60.5 86.2
w/ ours NBF 74.7 60.8 81.7 81.1 70.0 85.8 79.8 68.8 61.4 81.4 70.2 65.7 89.4
FreeMatch [81] RF 74.0 58.5 85.0 79.4 68.2 84.7 79.2 68.4 62.5 80.4 71.0 63.7 87.0
NBF 72.2 56.4 79.3 77.7 67.7 83.4 78.5 67.3 60.5 79.1 69.2 61.0 86.9
w/ ours NBF 74.8 60.6 81.4 81.5 70.8 86.7 80.0 68.6 61.6 81.7 69.8 66.2 89.2
\hdashlineMME [58] RF 71.2 56.2 80.4 75.7 65.1 81.0 76.7 64.5 59.0 79.8 69.0 62.0 85.1
NBF 70.2 55.0 77.6 76.8 65.1 82.2 77.7 61.1 57.1 77.1 68.8 58.1 85.4
w/ ours NBF 73.4 60.5 81.4 80.0 68.6 84.8 78.4 65.3 61.3 79.8 69.8 62.8 87.5
CDAC [33] RF 71.2 55.5 80.0 76.4 67.1 82.4 75.8 64.5 58.7 79.0 69.2 61.5 84.4
NBF 69.0 54.1 76.2 75.4 64.1 79.5 75.4 63.9 57.9 75.2 66.5 55.8 83.6
w/ ours NBF 74.3 63.7 81.3 80.4 70.0 85.4 79.0 67.9 62.2 80.3 69.6 65.1 86.9
AdaMatch [6] RF 70.9 55.4 80.4 75.9 65.7 81.5 74.6 65.9 58.7 78.4 68.8 61.5 84.3
NBF 69.3 54.2 76.6 75.3 65.9 79.3 75.5 63.7 57.4 75.9 66.7 56.8 84.2
w/ ours NBF 73.8 62.2 81.0 79.7 68.8 85.4 78.6 67.7 61.7 79.5 69.0 64.1 88.2
\hdashlineFully sup. - 87.4 84.5 95.1 89.0 80.9 95.1 89.0 80.9 84.5 89.0 80.9 84.5 95.1
Table 17: Adaptation results with SemiSL and SemiSDA methods on OfficeHome. The adaptation performance on various domain shifts is reported, where the number of labeled data per class is 3. The details can be found in Table LABEL:2.
method feedback average a → c a → p a → r c → a c → p c → r p → a p → c p → r r → a r → c r → p
Source - 57.6 44.2 65.6 71.6 47.3 60.2 58.2 47.9 40.8 69.8 60.6 46.5 78.1
\hdashlineFreeMatch [81] RF 73.9 59.8 83.9 80.0 69.2 84.6 77.8 66.7 63.7 80.5 71.2 63.3 85.8
NBF 72.2 57.6 82.4 76.3 68.2 82.0 76.6 65.3 61.4 78.0 71.4 60.9 86.6
w/ ours NBF 75.3 63.9 84.6 79.0 70.0 85.7 79.1 68.6 64.8 81.1 73.0 65.4 88.6
UDA [87] RF 74.4 60.2 84.5 79.8 68.4 85.1 79.8 66.5 64.4 80.7 72.2 64.4 86.1
NBF 73.0 58.7 82.6 77.4 68.6 82.5 77.3 66.9 61.9 78.8 71.4 62.0 87.4
w/ ours NBF 76.0 64.9 84.5 79.4 71.2 85.8 79.8 71.4 65.4 80.5 74.2 66.3 88.9
FlexMatch [96] RF 75.9 64.3 84.9 82.1 69.6 85.7 80.4 69.2 65.7 82.3 74.2 65.4 87.3
NBF 74.9 62.9 83.2 77.6 70.2 84.7 80.5 69.8 62.9 79.5 74.4 64.4 87.7
w/ ours NBF 76.6 63.3 86.7 79.5 71.6 86.9 81.0 72.0 65.7 81.3 75.0 67.5 88.9
FreeMatch [81] RF 75.8 63.2 85.2 81.8 70.0 86.3 80.6 69.0 65.8 82.1 73.2 65.6 87.0
NBF 75.0 63.2 83.6 77.4 70.0 84.9 80.5 70.4 62.6 79.8 74.6 63.9 88.9
w/ ours NBF 76.6 63.4 85.6 79.8 71.8 86.3 81.2 71.8 65.3 81.8 74.8 67.5 89.6
\hdashlineMME [58] RF 73.5 59.6 82.4 78.7 67.3 83.6 79.2 67.3 62.4 80.5 71.4 63.2 86.6
NBF 73.1 59.5 83.2 77.2 66.5 82.5 78.3 65.1 61.5 79.1 72.8 62.8 88.3
w/ ours NBF 75.6 63.6 84.2 77.3 69.8 85.5 80.3 70.8 65.2 80.5 74.6 66.6 88.9
CDAC [33] RF 73.5 59.7 83.4 79.3 68.6 84.5 78.1 66.3 63.4 80.5 69.8 63.4 85.1
NBF 72.3 59.5 81.7 76.6 67.7 81.9 76.7 65.9 62.4 77.4 70.8 60.2 86.4
w/ ours NBF 75.7 64.2 84.7 79.0 72.2 85.5 79.6 70.4 65.3 80.4 73.4 65.4 88.6
AdaMatch [6] RF 73.4 60.0 83.8 78.7 68.0 84.4 77.6 66.5 62.5 80.0 71.0 63.2 85.1
NBF 72.7 60.2 81.7 76.9 67.1 81.5 77.2 66.3 61.8 78.7 71.2 62.0 87.1
w/ ours NBF 75.5 63.4 84.4 78.8 70.0 86.0 79.4 70.2 65.3 80.6 72.8 66.6 88.4
\hdashlineFully sup. - 87.4 84.5 95.1 89.0 80.9 95.1 89.0 80.9 84.5 89.0 80.9 84.5 95.1
Table 18: Adaptation results with SemiSL and SemiSDA methods on OfficeHome. The adaptation performance on various domain shifts is reported, where the number of labeled data per class is 5. The details can be found in Table LABEL:2.
method feedback average real→clip. real→pain. pain.→clip. clip.→scat. scat.→pain. real→scat. pain.→real
Source model - 56.5 56.1 63.7 55.2 48.0 51.7 45.8 74.7
\cdashline2-11 SHOT [34] RF 69.6 70.2 70.9 69.6 63.4 69.1 61.4 82.8
NBF 70.7 71.7 72.7 71.0 64.1 69.7 62.0 83.6
w/ ours NBF 71.5 73.8 72.8 73.5 64.6 69.8 62.6 83.6
NRC [93] RF 66.3 66.1 69.3 64.8 58.0 67.9 57.6 80.6
NBF 64.9 63.1 68.4 63.6 56.9 67.1 55.1 80.4
w/ ours NBF 69.3 70.2 71.4 69.7 62.1 68.2 62.0 81.4
ContraTTA [8] RF 68.6 72.3 70.4 70.7 60.0 65.1 61.6 80.1
NBF 69.2 72.8 70.9 71.1 60.2 66.5 62.1 80.7
w/ ours NBF 71.6 74.6 72.1 75.3 64.1 69.7 62.7 82.7
GuidingSP [35] RF 69.7 66.6 68.5 68.5 60.3 69.2 58.7 81.5
NBF 70.2 64.3 66.1 65.6 56.9 65.6 54.2 78.9
ResNet-50 [23] w/ ours NBF 71.8 74.5 72.7 73.9 65.5 70.0 64.3 83.2
\cdashline2-11 Fully sup. - 83.6 85.6 81.4 85.6 80.4 81.4 80.4 90.1
Source model - 64.5 63.6 70.2 61.6 56.7 65.5 53.5 80.5
\cdashline2-11 SHOT [34] RF 73.4 73.9 74.9 73.2 66.8 74.8 65.4 84.7
NBF 73.7 74.6 75.6 74.2 67.0 74.4 65.4 84.6
w/ ours NBF 74.1 75.1 75.7 74.9 67.6 74.6 66.0 84.7
NRC [93] RF 72.2 73.0 73.9 72.3 65.6 73.6 63.8 83.0
NBF 71.9 73.1 73.8 72.1 65.2 73.0 64.1 82.3
w/ ours NBF 72.9 73.9 74.9 73.9 65.5 73.4 64.5 84.3
ContraTTA [8] RF 72.8 73.0 74.1 74.7 66.7 73.2 62.9 84.8
NBF 73.4 74.3 75.1 74.6 67.6 73.8 63.7 84.9
w/ ours NBF 74.9 75.4 75.8 76.7 69.2 75.6 66.6 85.0
GuidingSP [35] RF 73.3 73.9 74.5 75.0 66.9 73.7 63.4 85.1
NBF 73.7 74.8 75.5 74.6 67.8 73.9 63.9 85.1
ViT-S [15] w/ ours NBF 75.0 75.6 75.8 76.9 69.1 75.6 66.5 85.2
\cdashline2-11 Fully sup. - 85.4 87.8 83.4 87.8 81.3 83.4 81.3 92.7
Table 19: Adaptation results with SFDA methods on DomainNet-126. The adaptation performance on various domain shifts is reported, where the number of labeled data per class is 3. The details can be found in Table 12.
method feedback average real→clip. real→pain. pain.→clip. clip.→scat. scat.→pain. real→scat. pain.→real
Source model - 56.5 56.1 63.7 55.2 48.0 51.7 45.8 74.7
\cdashline2-11 SHOT [34] RF 71.1 71.9 72.6 70.8 65.3 70.1 63.6 83.1
NBF 72.3 73.3 74.0 73.1 65.8 71.5 64.4 84.2
w/ ours NBF 73.0 75.2 74.2 74.3 66.3 71.4 65.1 84.5
NRC [93] RF 68.5 68.6 70.1 68.3 61.1 68.6 61.5 81.2
NBF 66.4 65.4 69.0 65.7 58.9 67.0 58.6 80.7
w/ ours NBF 69.6 70.6 72.2 70.2 61.9 68.1 62.4 81.6
ContraTTA [8] RF 70.1 73.7 71.0 72.4 61.8 67.0 64.0 81.0
NBF 70.5 74.4 71.8 72.3 61.4 67.8 64.2 81.3
w/ ours NBF 72.4 76.0 73.3 73.1 64.8 71.3 65.0 83.2
GuidingSP [35] RF 70.5 70.9 70.6 70.4 72.7 65.3 70.8 63.7
NBF 71.0 67.7 69.0 68.7 69.7 59.5 67.6 58.8
ResNet-50 [23] w/ ours NBF 72.8 74.3 76.7 74.4 76.8 68.8 72.8 66.2
\cdashline2-11 Fully sup. - 83.6 85.6 81.4 85.6 80.4 81.4 80.4 90.1
Source model - 64.5 63.6 70.2 61.6 56.7 65.5 53.5 80.5
\cdashline2-11 SHOT [34] RF 74.4 75.1 75.6 74.6 68.5 75.2 67.0 85.0
NBF 74.8 75.9 76.3 75.1 68.7 75.8 66.7 85.3
w/ ours NBF 75.4 77.3 76.5 75.9 69.2 76.1 67.1 85.4
NRC [93] RF 73.9 75.1 75.1 73.8 67.4 74.2 66.3 85.5
NBF 73.7 74.8 74.9 73.8 67.2 73.8 66.2 85.0
w/ ours NBF 74.6 76.0 75.9 75.5 67.9 74.5 66.7 85.3
ContraTTA [8] RF 73.9 74.3 74.9 76.2 68.5 74.1 64.7 84.9
NBF 74.8 74.9 75.7 76.2 69.2 75.3 66.7 85.5
w/ ours NBF 76.4 77.2 76.4 79.0 70.9 76.8 67.8 86.5
GuidingSP [35] RF 74.1 74.2 75.0 76.5 68.9 74.2 64.9 85.0
NBF 74.9 74.9 75.8 76.3 69.1 75.2 66.8 85.9
ViT-S [15] w/ ours NBF 76.4 77.4 76.4 79.1 70.9 76.8 67.7 86.6
\cdashline2-11 Fully sup. - 85.4 87.8 83.4 87.8 81.3 83.4 81.3 92.7
Table 20: Adaptation results with SFDA methods on DomainNet-126. The adaptation performance on various domain shifts is reported, where the number of labeled data per class is 5. The details can be found in Table 12.