(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

¹¹institutetext: KAIST ¹¹email: {sb020518, jchoo}@kaist.ac.kr ²²institutetext: Lunit Inc. ²²email: {junha.kim, taesoo.kim, ghnam, tkooi}@lunit.io

Is user feedback always informative?
Retrieval Latent Defending for Semi-Supervised Domain Adaptation without Source Data

Abstract

This paper aims to adapt the source model to the target environment, leveraging small user feedback (i.e., labeled target data) readily available in real-world applications. We find that existing semi-supervised domain adaptation (SemiSDA) methods often suffer from poorly improved adaptation performance when directly utilizing such feedback data, as shown in Figure 1. We analyze this phenomenon via a novel concept called Negatively Biased Feedback (NBF), which stems from the observation that user feedback is more likely for data points where the model produces incorrect predictions. To leverage this feedback while avoiding the issue, we propose a scalable adapting approach, Retrieval Latent Defending. This approach helps existing SemiSDA methods to adapt the model with a balanced supervised signal by utilizing latent defending samples throughout the adaptation process. We demonstrate the problem caused by NBF and the efficacy of our approach across various benchmarks, including image classification, semantic segmentation, and a real-world medical imaging application. Our extensive experiments reveal that integrating our approach with multiple state-of-the-art SemiSDA methods leads to significant performance improvements.

Keywords:

Rethinking user-provided feedback Semi-supervised &
Source-free domain adaptation Medical image diagnosis

1 Introduction

While deep neural networks have demonstrated remarkable performance in the development domain (i.e., source domain) [23, 15], they often suffer from performance degradation in the deployed domain (i.e., target domain) due to domain shift [17, 78, 72]. To mitigate this issue, domain adaptation (DA) techniques have been introduced [70, 34, 58]. The most common DA tasks include semi-supervised domain adaptation (SemiSDA) and source-free domain adaptation (SFDA). SemiSDA aims to adapt the model given a small amount of labeled target data along with massive unlabeled target data [58, 6, 99, 66]. SFDA conducts adaptation with only target data without accessing source data considering data privacy or memory constraints in edge devices [34, 92, 67].

Despite such advances in DA, adapting the model with user feedback still remains an open area for further research, even though practical machine learning (ML) products often allow users to provide feedback in order to further improve the model in the target environment. For example, facial recognition or medical image diagnosis applications enable users to give feedback correcting wrong model predictions, as depicted in Figure 1 (a). Since feedback can be modeled in this case as a small amount of labeled target data, it is anticipated that previous SemiSDA methods assuming the same setup would yield promising results. However, we observe that they show inferior adaptation performance on multiple DA benchmarks when using such user feedback in practice, as shown in the dark-gray bar \makebox(4.0,4.0)[]{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}~{}~{}}} in Figure 1 (b).

Refer to caption — Figure 1: (a) User feedback. Users can provide feedback while interacting with an ML product, where feedback is likely to be biased towards misclassified samples, which we define as Negatively Biased Feedback (NBF). (b) Adaptation results. We adapt the source model with small user feedback and large unlabeled target data using previous semi-supervised domain adaptation (SemiSDA) algorithms. Compared to random feedback, which is the classical SemiSDA setup where labeled data is a random subset of target data, model adaptation with NBF leads to subpar performance. This paper analyzes this problem and introduces a scalable solution.

We introduce a novel concept called Negatively Biased Feedback (NBF) to explain this phenomenon. NBF is based on the observation that user feedback is more likely to be derived from incorrect model predictions. For example, a radiologist might log a misdiagnosed chest X-ray by the model, as its accuracy directly impacts the patient’s survival. Interestingly, our observation aligns with findings from cognitive psychology literature [3, 57] that proves that humans are more likely to react and provide feedback to negative events (i.e., wrong model predictions). Since such an NBF scenario is feasible, we analyze its unexpected impact on SemiSDA observed above. We identify that a biased distribution of NBF within the overall data distribution leads to sub-optimal adaptation results, particularly compared to Random Feedback (RF). RF represents the classical SemiSDA setup, where labeled data is randomly selected from the target data.

To address the problem caused by NBF, we present a scalable approach named Retrieval Latent Defending, which can be seamlessly integrated with existing SemiSDA methods. Our approach allows them to adapt the model without a strong dependence on the biasedly distributed labeled data. Specifically, we balance the supervised adapting signal by appending latent defending samples to the mini-batch and help to keep the model’s balanced class discriminability throughout adapting iterations. We evaluate the unexpected influence of NBF using various benchmarks, including image classification, semantic segmentation, and medical image diagnosis. Building upon these evaluations, we demonstrate that our approach not only complements, but significantly enhances the performance of multiple SemiSDA methods.

The contributions of the paper are as follows:

$\circ$

We introduce the novel concept called Negatively Biased Feedback and uncover that it can lead to sub-optimal adaptation performance of existing SemiSDA methods.
$\circ$

We analyze this problem and present a scalable solution, Retrieval Latent Defending, that combines with SemiSDA methods and allows them to avoid the unexpected effect of NBF.
$\circ$

We show that our approach generalizes through diverse DA benchmarks and improves adaptation results of state-of-the-art SemiSDA methods.
$\circ$

We publicly release the code on https://github.com/junha1125/RLD-SemiSDA.

2 Related Work

Adaptation in the deployment environment.

Real-world ML products often encounter performance degradation caused by gaps between the source and target environment [17]. One solution is to adapt the model using unlabeled data observed in the target domain, referred to as unsupervised domain adaptation (UDA) [72, 59, 37]. Works on UDA use both source and target data to improve the target performance by using methods such as domain discrepancy minimization by adversarial training [41, 70, 76, 18, 73, 72, 59], and self-training with pseudo labels [45, 98, 51, 97]. Source-free DA (SFDA) builds on UDA and imposes an additional constraint that the source data can not be accessed during domain adaptation. This has practical implications for addressing data privacy concerns or barriers in data transmission to edge devices [34, 38, 77, 95]. The majority of recent SFDA works rely on strategies like domain clustering [34], nearest neighbors [92, 93, 91], and contrastive learning [8, 35, 101]. Nevertheless, SFDA does not consider the availability of small labeled data, which may be available in practical ML systems. Semi-supervised DA (SemiSDA) works mainly demonstrate that permitting small labeled data in the target domain can substantially enhance adaptation performance compared to traditional UDA [58]. Their primary strategy is to use domain alignment [58, 33, 20, 94], multi-view consistency [33, 6, 2, 89], and asymmetric co-training [36, 90].

Active domain adaptation

(ActiveDA) [55, 86, 25] envisions a scenario in which the machine selects specific target samples and instructs annotators to label them. The primary objective of ActiveDA is to strategically identify and select the most informative samples for annotation. These chosen samples (i.e., labeled target data) are subsequently utilized to update the source model using SemiSDA methods [58, 33], and the effectiveness of ActiveDA is assessed by evaluating the target performance of the adapted model.

Semi-supervised learning

(SemiSL) aims to reduce expensive human annotations, and propose methods to train a model from scratch using massive unlabeled data along with limited amounts of labeled data [74, 43]. The majority of SemiSL methods depend on consistency regularization [60, 66, 87, 5, 4, 16], which helps the model to make similar predictions for augmented versions of the same image. Moreover, adaptive thresholding [66, 81, 24, 88, 12, 9, 99] is also popularly utilized to produce reliable pseudo labels from unlabeled data.

SemiSDA and SemiSL setups mimic small labeled datasets by randomly selecting subsets of the target dataset, whereas ActiveDA involves selections instructed by the machine. In contrast, this paper posits that in real-world applications, labeled data is typically acquired through user intervention. Additionally, users often provide feedback on samples misclassified by the model (i.e., negatively biased feedback), a process detailed in the following section.

	UDA	SFDA	ActiveDA	SemiSDA	SemiSL	Our setup
Adaptation	$\circ$	$\circ$	$\circ$	$\circ$	$\times$	$\circ$
Source-free	$\times$	$\circ$	$\times$	$\times$	-	$\circ$
Feedback	$\times$	$\times$	machine-instructed	randomly selected	randomly selected	user-provided

The table above summarizes the comparison of relevant studies to our setup. In the table, adaptation means fine-tuning the source pre-trained model (as opposed to training from scratch); feedback represents a small number of labeled target samples. Appendix A provides further comparisons with settings like class-imbalanced SemiSDA and test-time adaptation (TTA).

3 Negatively Biased Feedback

3.1 Adaptation with user feedback.

Our adaptation setup is illustrated in Figure 2. A model is pre-trained on the source data $D_{s}$ . Next, the model is deployed to the target domain, such as a smartphone or a hospital, where we assume the transfer of $D_{s}$ is prohibited due to data privacy regulations or resource constraints (same setup as SFDA [34]). While users utilize ML products on the target domain, the model provides prediction results for data observed in the target domain $D_{t}$ and occasionally obtains user feedback in the form of annotations $y$ . We represent the target data as $D_{t}=X^{lb}_{t}\cup X^{ulb}_{t}$ , where $X^{lb}_{t}=\{(x^{n}_{lb},y^{n}_{lb}):n\in[1..\,N_{lb}]\},X^{ulb}_{t}=\{(x^{n}_{ulb}):n\in[1..\,N_{ulb}]\}$ , $x_{lb}$ and $x_{ulb}$ denote labeled and unlabeled data and $N_{lb}$ and $N_{ulb}$ is their number of data. Lastly, the model can utilize $D_{t}$ and SemiSDA algorithms for adaptation during its inactive phase (e.g., when users do not use the product, like at nighttime) in order to alleviate performance degradation due to domain shift or to personalize the model based on user feedback.

Rethinking user-provided feedback.

Classical SemiSDA works simply assume that a random subset in target data $D_{t}$ is labeled by users when building $X^{lb}_{t}$ . However, as illustrated in Figure 2 , we suggest that users are more likely to provide feedback on misclassified samples by the source model, named negatively biased feedback (NBF). This behavior can be understood from two perspectives: (a) users generally expect their feedback to be used as a basis of model improvement, motivating them to provide NBF, and (b) humans tend to react more strongly to negative experiences, such as receiving incorrect predictions, as observed in psychological studies [3, 57]. We note that the NBF assumption holds more strongly for the medical application: it is reasonable to imagine that the user (i.e., radiologist) logs the mistakes of the model while diagnosing a chest X-ray exam because the diagnostic accuracy of the model is directly related to the patient’s chances of survival. Furthermore, applications beyond the medical domain can also exhibit NBF. For instance, users in self-driving cars can report errors, such as object detection failures or navigation mistakes, to enhance the car’s driving capabilities.

3.2 Influence of NBF on SemiSDA

Simulation study.

As shown in Figure 3, we conduct a simulation study to understand the effect of NBF on SemiSDA. We first use the blobs dataset [53] and construct the source and target data so that domain shift exists between them (left sub-figures). We pre-train a source model on the source data and compute the accuracy in the target domain, where the performance drop due to domain shift is observed (98.5%→76.4%). Next, we simulate two types of feedback (i.e., labeled data): random feedback and negatively biased feedback following a previous SemiSDA setup and our setup, respectively. Specifically, NBF is randomly selected among misclassified samples by the source model. We find that random feedback (RF) points are evenly distributed, while NBF points are biasedly positioned across each class cluster (refer to blue points in the dashed circle in the center sub-figures).

To alleviate the performance drop caused by domain shift, we adapt the model using the target data and a semi-supervised method, Pseudo-labeling [1]. This method iteratively optimizes the model by the cross-entropy loss computed by the ground truth of labeled data and pseudo labels of unlabeled data in a mini-batch (pseudo labels are predicted by the current adapting model so they can be changed according to an updating decision boundary. Further comprehension can be achieved by referring Appendix B.). The SemiSDA results are shown in the right sub-figures, where we make two interesting observations: (i) the distribution of labeled data can contribute significantly to a decision boundary of the adapted model (red arrows in the figure), and (ii) the adapted model under NBF has poorly improved performance compared with one under RF (76.4%→88.1% with NBF, but 76.4%→91.7% with RF).

Unexpected influence of NBF.

Our intuitive reasoning probably suggests that NBF provides more information than RF by correcting more source model deficiencies, and thus leads to better adaptation performance. However, we empirically show that NBF can result in inferior adaptation performance due to its biased distribution across each class cluster, as illustrated in Figure 3. Surprisingly, we also show that this problem persists, even with other state-of-the-art SemiSDA methods and large datasets for various DA benchmarks, including image classification, semantic segmentation, and medical image diagnosis. Our work highlights the importance of careful design when using user feedback in real-world scenarios and, to the best of our knowledge, is the first study to uncover and analyze this phenomenon.

4 Approach

4.1 Prerequisite: Previous SemiSDA method

Previous SemiSDA and SemiSL works typically construct a mini-batch with labeled data $\{(x^{b}_{lb},y^{b}_{lb}):b\in[1..\,B]\}$ , and unlabeled data whose size is $\mu$ times larger than labeled ones $\{(x^{b}_{ulb}):b\in[1..\,\mu{\cdot}B]\}$ , where $B$ is the mini-batch size for labeled data. To adapt the model iteratively, they compute the cross-entropy loss $\mathcal{H}(\cdot,\cdot)$ with labeled data and the consistency regularization to multi-view of unlabeled data, which are formulated as the following:

\footnotesize\mathcal{L}_{sup}=\frac{1}{B}\sum^{B}_{b=1}\mathcal{H}(y_{lb}^{b},f_{\theta}(x_{lb}^{b})),~{}~{}\mathcal{L}_{unsup}=\frac{1}{\mu\cdot B}\sum_{b=1}^{\mu\cdot B}\mathcal{H}(\hat{y}_{ulb}^{b},f_{\theta}(\Omega(x_{ulb}^{b}))),\vspace{-.5em}

(1)

where $f_{\theta}(\cdot)$ is the output probability from the model, $\hat{y}_{ulb}$ denotes a pseudo label obtained from $f_{\theta}(\omega(x_{ulb}))$ , and $\omega(\cdot)$ and $\Omega(\cdot)$ represent weak and strong image augmentation, respectively. While sharing the core framework, each SemiSDA method employs distinct adapting strategies, especially to enhance the effectiveness of the use of unlabeled data rather than labeled data [6, 96, 81].

Problem of previous works.

Since previous SemiSDA methods have overlooked the unexpected impact of NBF, they often suffer from sub-optimal performance under the NBF assumption (shown in Section 5). To address this problem, we focus on developing a scalable solution that (i) can easily combine with existing DA methods without modifying their core adapting strategies and (ii) can be applied to a wide range of benchmarks, including medical image diagnosis.

4.2 Retrieval Latent Defending

Based on the observations in Figure 3, we illustrate the unintended effect of NBF when using an existing SemiSDA method in Figure 4 (top center), where NBF is likely to exhibit a biased distribution, leading to undesirable adaptation results. To alleviate this issue, we propose Retrieval Latent Defending as depicted in Figure 4 (bottom). Prior to each epoch, we generate a candidate bank of data points, denoted as $x_{LD}$ . $\sim$ For each adapting iteration, we balance the mini-batch by retrieving latent defending samples $x_{LD}$ from the bank. $\sim$ The model is then adapted using the reconfigured mini-batch and following the baseline SemiSDA approach. We hypothesize that the latent space progressively created by the $x_{LD}$ candidates throughout the adaptation process (bold dashed circle in Figure 4 (top right)) mitigates the issue caused by NBF, thereby allowing the SemiSDA baseline to achieve robust adaptation against NBF.

Candidate bank generation.

The candidate bank serves as a repository of pseudo labels $\hat{Y}^{ulb}_{t}$ for a subset of the target unlabeled data $X^{ulb}_{t}$ . Before each epoch, we freeze the model and use it to generate pseudo labels $\hat{Y}^{ulb}_{t}{=}\{(\hat{y}^{n}_{ulb}):n\in[1..\,N_{ulb}]\}$ , where $\hat{y}^{n}_{ulb}$ is assigned to $x^{n}_{ulb}$ as the predicted class with the highest softmax probability: $\hat{y}^{n}_{ulb}=\operatorname*{argmax}_{c}\left[f_{\theta}(x^{n}_{ulb})\right]_{c}$ . We then retain only samples with the top $p$ % highest probabilities within each class. This filtering step helps mitigate the inclusion of data with potentially inaccurate pseudo labels, as the model’s predictions on $X^{ulb}_{t}$ might not always be perfect.

Defending sample selection.

We select $k$ latent defending samples $x_{LD}$ from the bank at random for each labeled data $(x^{b}_{lb},y^{b}_{lb})$ . These selected samples share the same pseudo label as the ground-truth label of their associated counterparts (i.e., $\hat{y}_{LD}\,{=}\,y^{b}_{lb}$ ). By incorporating these defending samples, we balance the data distribution within the current mini-batch. For example, consider $x^{1}_{lb}$ and $x^{2}_{lb}$ in Figure 4 (top right). As these labeled samples are included in the current mini-batch alongside the selected defending samples $x^{1}_{LD}$ and $x^{2}_{LD}$ , we expect to prevent the supervised adapting signal from becoming overly dependent on the labeled samples. We imagine the effect of the defending samples throughout the adaptation process and depict the latent space formed gradually by the $x_{LD}$ candidates as bold dashed circles in Figure 4 (top right).

Consequently, the overall loss consists of the sum of losses in Eq. (1) and a loss from our proposed method as,

\scriptsize\mathcal{L}_{total}=\underbrace{\mathcal{L}_{sup}+\mathcal{L}_{unsup}}_{\text{baseline}}+\underbrace{\frac{1}{k\cdot B}\sum_{b=1}^{k\cdot B}\mathcal{H}(\hat{y}_{LD}^{b},f_{\theta}(x_{LD}^{b}))}_{\text{retrieval latent defending}}.\vspace{-.5em}

(2)

Importance of our method.

Understanding the impact of NBF on adaptation performance is crucial. For example, naively adapting a model for a medical application using radiologist-provided feedback can actually cause performance degradation (shown in Table 5), potentially posing significant risks to patients. We propose a scalable and simple approach to solve the problem caused by NBF, which can not be addressed by existing methods. Given the practicality of the NBF problem and the scalability of our solution, we believe our work holds considerable potential for real-world applications.

5 Experiments

5.1 Experimental Setups

Our approach is simple enough to seamlessly combine with existing SemiSDA algorithms and also be applied to diverse benchmarks. This section describes our experimental setup for natural image classification tasks and a real-world medical application. Details for semantic segmentation experiments are in Appendix D.

Baselines. We validate our approach by combining various state-of-the-art algorithms for SemiSDA [58] (e.g., CDAC [33] and AdaMatch [6]) and SemiSL [66, 87] (e.g., FlexMatch [96] and FreeMatch [81]). Note that the SemiSL methods have been demonstrated to be strong SemiSDA learners [99], so we can consider them as SemiSDA methods. For medical experiments, we use Pseudo-labeling [1] as a baseline since it is easily applicable to medical image adaptation.

Datasets. We utilize natural image datasets containing multiple kinds of domains (e.g., real and painting). The datasets include DomainNet-126 [54, 58] with 142k images of 126 classes, and OfficeHome [75] with 15K images of 65 classes.

To conduct medical experiments, we present a practical medical setting. We adopt the MIMIC-CXR-V2 dataset [27]. It assumes a multi-finding binary classification setup, where multiple radiographic findings, like Pneumonia and Atelectasis, can coexist in a single chest X-ray (CXR) sample. Thus, the model predicts the presence or absence (binary classes) of each individual finding. We simulate domain shift by using Posterior-Anterior (PA)-view data as the source and AP-view data as the target, capturing real-world variations in data acquisition. Typically, patients requiring an AP X-ray are those facing positioning challenges that prevent them from undergoing a PA X-ray. Therefore, this setup can be seen as a scenario where the target environment is the intensive care unit, which hospitalizes critically ill patients.

Following the recent SemiSDA [94] and SFDA [8] setups, we assume the model is pre-trained in the source domain and deployed in the target domain. Since the datasets above were not initially divided into training and test sets, we performed a random 8:2 split within each domain, designating them respectively for training and testing. The training set is used to adapt the model, while the test set is used to report the top-1 accuracy.

	method	feedback	average	r→c	r→p	p→c	c→s	s→p	r→s	p→r
	AdaMatch [6]	RF	67.6	66.6	68.5	68.5	60.3	69.2	58.7	81.5
		NBF	64.5 (-3.1)	64.3	66.1	65.6	56.9	65.6	54.2	78.9
ResNet	w/ ours	NBF	72.0 (+7.5)	74.5	72.7	73.9	65.5	70.0	64.3	83.2
	AdaMatch [6]	RF	74.7	75.3	76.9	73.8	68.0	76.3	67.1	85.5
		NBF	73.7 (-1.0)	74.7	76.2	74.7	65.7	74.0	66.8	84.0
ViT	w/ ours	NBF	75.9 (+2.2)	76.9	77.8	77.8	68.5	76.6	68.3	85.1

Table 1: Adaptation results on DomainNet-126. We simulate seven domain-shift scenarios (i.e., source → target). The model is pre-trained on the source domain and then adapted to a training set of the target domain. The results on the test set of the target domain are reported as the top-1 accuracy (%). DomainNet-126 [54, 58] dataset includes real, painting, sketch, and clip-art domains. In this experiment, we assume that the 378 feedback samples (i.e., 3 labeled data per class) are obtained from users. A state-of-the-art SemiSDA method, AdaMatch [6], is used as a baseline.

method	feedback	average	a → c	a → p	a → r	c → a	c → p	c → r	p → a	p → c	p → r	r → a	r → c	r → p
AdaMatch [6]	RF	70.9	55.4	80.4	75.9	65.7	81.5	74.6	65.9	58.7	78.4	68.8	61.5	84.3
	NBF	69.3 (-1.6)	54.2	76.6	75.3	65.9	79.3	75.5	63.7	57.4	75.9	66.7	56.8	84.2
w/ ours	NBF	73.8 (+4.5)	62.2	81.0	79.7	68.8	85.4	78.6	67.7	61.7	79.5	69.0	64.1	88.2

Table 2: Adaptation results on OfficeHome. OfficeHome [75] dataset includes real, product, art, and clip-art domain. We assume that the 195 feedback samples (i.e., 3 labeled data per class) are obtained. AdaMatch [6] and ResNet-50 [23] are used.

User feedback. Feedback given by users is modeled as annotations $\{y^{n}_{lb}:n\in[1..\,N_{lb}]\}$ on a small subset of the target’s training set $D_{t}^{train}$ , while the remaining of them are used as unlabeled target data. In our experiments, we take into account two types of feedback: random feedback (RF) and negatively biased feedback (NBF). RF is the same setup of classical SemiSDA and SemiSL, where randomly selected samples from $D_{t}^{train}$ are used as small labeled set $X_{t}^{lb}$ . For NBF, we randomly select samples that are incorrectly predicted in $D_{t}^{train}$ by the source model (i.e., the pre-trained model before adaptation). Note that we focus on the impact of a biased label distribution within the same class, as shown in Figure 3, and thus take the same number of feedback for each class. Further discussion about the imbalance in the number of feedback between classes presented in [83, 49, 31] is provided in Appendix A.2.

Network architectures. We adopt commonly used networks, ResNet [23] and ViT [15] for natural image tasks and DenseNet [26] for a medical task. We employ ResNet-50 with the last classification layer comprising a weight normalization layer and a bottleneck layer, following previous works [34, 8] and use the ViT-Small (i.e., ViT-S) introduced in [80]. The DenseNet-121 is used, provided in TorchXrayVision [13], like existing medical works [32, 44].

Implementation details. We implement our framework by extending the publicly available USB [80] repository. Both pre-training and adaptation are conducted with a mini-batch size of 128 and the SGD optimizer. Diverse baselines for SemiSDA and SemiSL are used to compute the losses in Eq. (1). The hyper-parameters for each baseline simply follow USB [80] or public code [58, 33]. For all experiments, our approach uses the same hyper-parameters of the appended defending samples $k$ and reliable filtering rate $p$ as 3 and 0.4, respectively.

	feed. amount	378 (3 labeled data per class)			630 (5 labeled data per class)
	method	RF	NBF	w/ ours	RF	NBF	w/ ours
	Source model	56.5
\cdashline2-7	MME [58]	69.5	68.4 (-1.1)	70.8 (+2.4)	71.2	70.1 (-1.1)	72.5 (+2.4)
	CDAC [33]	68.3	64.6 (-3.7)	73.2 (+8.6)	71.7	68.1 (-3.6)	74.9 (+6.8)
	AdaMatch [6]	67.6	64.5 (-3.1)	72.0 (+7.5)	70.9	67.7 (-3.2)	74.3 (+6.6)
\cdashline2-7	FixMatch[66]	67.6	63.4 (-4.2)	73.2 (+9.8)	71.5	66.1 (-5.4)	75.1 (+9.0)
	UDA [87]	69.2	64.9 (-4.3)	73.4 (+8.5)	72.9	68.8 (-4.1)	75.3 (+6.5)
	FlexMatch [96]	73.3	71.4 (-1.9)	74.7 (+3.3)	75.3	73.9 (-1.4)	76.0 (+2.1)
	FreeMatch [81]	73.8	72.0 (-1.8)	74.8 (+2.8)	75.6	74.4 (-1.2)	76.1 (+1.7)
\cdashline2-7 ResNet-50 [23]	Fully supervised	83.6
	Source model	64.5
\cdashline2-7	MME [58]	73.2	72.7 (-0.5)	74.1 (+1.4)	74.5	74.0 (-0.5)	75.2 (+1.2)
	CDAC [33]	74.2	72.8 (-1.4)	75.4 (+2.6)	75.4	74.1 (-1.3)	76.2 (+2.1)
	AdaMatch [6]	74.7	73.7 (-1.0)	75.9 (+2.2)	75.9	75.1 (-0.8)	76.7 (+1.6)
\cdashline2-7	FixMatch [66]	74.6	73.0 (-1.6)	75.6 (+2.6)	75.7	74.3 (-1.4)	76.5 (+2.2)
	UDA [87]	74.8	73.3 (-1.5)	75.8 (+2.5)	75.9	74.5 (-1.4)	76.7 (+2.2)
	FlexMatch [96]	74.9	73.9 (-1.0)	75.8 (+1.9)	76.0	75.1 (-0.9)	76.9 (+1.8)
	FreeMatch [81]	74.9	73.9 (-1.0)	75.7 (+1.8)	76.0	75.1 (-0.9)	76.8 (+1.7)
\cdashline2-7 ViT-S [15]	Fully supervised	85.4

Table 3: Comparisons on DomainNet-126. We evaluate our method by integrating it with SemiSDA and SemiSL methods. The average accuracy of seven domain-shift scenarios in Table 1 is reported. Source model represents the pre-trained model without adaptation. Fully supervised means the model is adapted with fully labeled target data.

[Uncaptioned image] — Table 4: Comparisons on OfficeHome. The average accuracy of twelve domain-shift scenarios in Table 2 is reported. ResNet-50 is used.

feed. amount	195 (3 labeled data per class)			325 (5 labeled data per class)
method	RF	NBF	w/ ours	RF	NBF	w/ ours
Source model	57.6
\hdashlineMME [58]	71.2	70.2 (-1.0)	73.4 (+3.2)	73.5	73.1 (-0.4)	75.6 (+2.5)
CDAC [33]	71.2	69.0 (-2.2)	74.3 (+5.3)	73.5	72.3 (-1.2)	75.7 (+3.4)
AdaMatch [6]	70.9	69.3 (-1.6)	73.8 (+4.5)	73.4	72.7 (-0.7)	75.5 (+2.8)
\hdashlineFixMatch [66]	71.4	68.6 (-2.8)	73.7 (+5.1)	73.9	72.2 (-1.7)	75.3 (+3.1)
UDA [87]	72.2	69.5 (-2.7)	74.1 (+4.6)	74.4	73.0 (-1.4)	76.0 (+3.0)
FlexMatch [96]	73.7	72.1 (-1.6)	74.7 (+2.6)	75.9	74.9 (-1.0)	76.6 (+1.7)
FreeMatch [81]	74.0	72.7 (-1.3)	74.8 (+2.1)	75.8	75.0 (-0.8)	76.6 (+1.6)
\hdashline Fully supervised	87.4

method	feedback	average	atelectasis	cardiomegaly	consolidation	edema	enlarged cardio.	fracture	lung lesion	lung opacity	effusion	pleural	pneumonia	pneumothorax	support device
Source mo.		.7738	.7784	.7919	.8236	.8500	.7646	.6642	.7555	.7818	.8271	.8288	.7535	.6894	.7500
\hdashlinePseudoL [1]	RF	.7850	.7828	.7965	.8453	.8615	.7639	.6832	.7598	.7947	.8333	.8565	.7702	.6957	.7622
\hdashline	NBF	.7691	.7719	.7851	.8202	.8468	.7403	.6934	.7446	.7809	.8070	.8260	.7521	.6979	.7324
	gap	-.0159	-.0109	-.0114	-.0252	-.0147	-.0236	+.0102	-.0152	-.0138	-.0262	-.0304	-.0181	+.0022	-.0298
w/ ours	NBF	.7884	.7895	.7956	.8515	.8606	.7730	.6821	.7599	.7973	.8445	.8611	.7753	.6851	.7736
	gain	+.0193	+.0176	+.0105	+.0313	+.0138	+.0326	-.0113	+.0153	+.0164	+.0375	+.0351	+.0232	-.0128	+.0412
\hdashline	NBF-CE	.7639	.7682	.7834	.8124	.8418	.7403	.6808	.7472	.7744	.8005	.8199	.7469	.6879	.7277
	gap	-.0211	-.0146	-.0131	-.0330	-.0198	-.0236	-.0024	-.0126	-.0203	-.0328	-.0366	-.0233	-.0079	-.0344
w/ ours	NBF-CE	.7875	.7895	.7956	.8515	.8606	.7730	.6731	.7599	.7973	.8445	.8611	.7753	.6831	.7736
	gain	+.0236	+.0213	+.0122	+.0391	+.0189	+.0327	-.0077	+.0126	+.0229	+.0440	+.0412	+.0284	-.0048	+.0459
\hdashlineFully super.		.8117	.8150	.8277	.8758	.8820	.7984	.6949	.7750	.8200	.8725	.8441	.8044	.7398	.8025

labeling type	feed. amount	RF	NBF	NBF w/ ours	ENT [62]	ENT w/ ours
IAST [46, 63]	PA, 40 points	55.3	53.0 (-2.3)	56.3 (+3.3)	53.5	56.0 (+2.5)
RIPU [85]	PA, 40 points	57.6	54.5 (-3.1)	58.0 (+3.5)	54.6	57.7 (+3.1)

	state B
feed. amount	378 (3 labeled data per class)					1890	5040
stage C	RF	NBF	Entropy [62]	CLUE [55]	DiaNA [25]	CLUE [55]	CLUE [55]
AdaMatch [7]	67.6	64.5	65.9	68.6	68.1	76.1	80.3
w/ ours	71.1 (+3.5)	72.0 (+7.5)	71.1 (+5.2)	71.5 (+2.9)	71.3 (+3.2)	78.0 (+1.9)	81.4 (+1.1)

	method	feedback	FP : FN	average	fracture	pneumothorax
	Source model	-	-	.6768	.6642	.6894
	Pseudo-Label. [1]	RF	-	.7325	.7541	.7109
\hdashline		NBF	40 : 40	.7173 (-.0152)	.7414 (-.0127)	.6931 (-.0178)
	with ours	NBF	40 : 40	.7334(+.0162)	.7625 (+.0211)	.7044 (+.0113)
\cdashline2-7		NBF	75 : 5	.7248 (-.0077)	.7494 (-.0047)	.7002 (-.0107)
	with ours	NBF	75 : 5	.7361 (+.0113)	.7653 (+.0159)	.7070 (+.0068)
\cdashline2-7		NBF	5 : 75	.7170 (-.0155)	.7420 (-.0121)	.6921 (-.0188)
80 feedback	with ours	NBF	5 : 75	.7315 (+.0145)	.7679 (+.0260)	.6951 (+.0030)
	Pseudo-Label. [1]	RF	-	.7353	.7565	.7141
\hdashline		NBF	80 : 80	.7162 (-.0192)	.7429 (-.0136)	..6894 (-.0247)
	with ours	NBF	80 : 80	.7331 (+.0169)	.7680 (+.0251)	.6983 (+.0088)
\cdashline2-7		NBF	155 : 5	.7237 (-.0117)	.7559 (-.0007)	.6915 (-.0227)
	with ours	NBF	155 : 5	.7358 (+.0121)	.7665 (+.0106)	.7051 (+.0136)
\cdashline2-7		NBF	5 : 155	.7166 (-.0188)	.7438 (-.0128)	.6894 (-.0248)
160 feedback	with ours	NBF	5 : 155	.7300 (+.0134)	.7696 (+.0258)	.6904 (+.0010)
	Fully supervised	-	-	.7744	.8003	.7486

method	feed.	negatively biased feedback (NBF)
# $x_{ulb}$ , # $x_{LD}$ , # $x_{lb}$		112, 0, 16	112, 48, 16	64, 48, 16
total batch size		128	176	128
FreeMatch [81]		72.0	74.2	74.8 (+0.6)
AdaMatch [6]	368	64.5	71.3	72.0 (+0.7)
\hdashlineFreeMatch [81]		74.4	75.5	76.1 (+0.6)
AdaMatch [6]	630	67.7	73.4	74.3 (+0.9)

selection strategy		random	random	kmeans	cosine	baseline
class-aware		✗	✓	✓	✓	-
FreeMatch [81]	Res.	74.1	74.8	74.6	74.0	72.0
FreeMatch [81]	ViT.	75.0	75.7	75.6	75.1	73.9

filtering rate		0.2	0.4	0.6	0.8	baseline only
FreeMatch [81]	Res.	74.5	74.8	74.3	73.7	72.0
FreeMatch [81]	ViT.	75.5	75.7	75.9	75.5	73.9

memory bank size = 5k			percentage of target data encountered in target domain
method	feed.	amo.	10% $\xrightarrow{\hskip 21.00005pt}$ 40% $\xrightarrow{\hskip 21.00005pt}$ 70% $\xrightarrow{\hskip 21.00005pt}$ 100%
FreeMatch [81]	RF		68.4	71.4	73.0	73.4
	NBF		66.9	69.5	71.0	71.5
w/ ours	NBF	368	68.9 (+2.0)	72.4 (+2.9)	73.6 (+2.6)	74.3 (+2.8)
FreeMatch [81]	RF		71.2	73.7	74.7	75.4
	NBF		69.8	72.2	73.2	73.9
w/ ours	NBF	630	71.5 (+1.7)	74.1 (+1.9)	75.0 (+1.8)	75.5 (+1.6)

	only $\mathcal{L}_{unsup}$	the overall loss $\mathcal{L}_{total}$ in Eq. (LABEL:2)
pseudo-feedback per class	0	3	w/ ours	5	w/ ours
NRC [93]	63.5	63.4	64.6 (+1.2)	63.4	64.4 (+1.0)
ContrastiveTTA [8]	66.6	66.6	67.4 (+0.8)	66.5	67.2 (+0.7)

method	CReST (CVPR21)	CLUE (ICCV21)	DiaNA (CVPR23)	SSNLL (IROS22)	GuidSP (CVPR23)
reference	Figure 8	Table 9	Table 9	-	Table 12
accuracy	92.6	68.6	68.1	68.9	69.2
w/ ours	95.8 (+3.2)	71.5 (+2.9)	71.3 (+3.2)	71.4 (+2.5)	71.6 (+2.4)

method	AdaMatch (ICLR22)	w/ ours	GuidSP (CVPR23)	w/ ours	FreeMatch (ICLR23)	w/ ours
reference	Table 3	Table 3	Table 12	Table 12	Table 11	Table 11
DB size^‡	55k images	55k images	55k images	55k images	5k images	5k images
add. data^†	0 MB	0.1 MB	53.8 MB	53.9 MB	0 MB	0.01 MB
run. time	132 min	136 min	150 min	155 min	14 min	15 min
accuracy	64.5	72.0 (+7.5)	70.2	71.8 (+1.6)	66.9	68.9 (+2.0)

	method	feedback	average	real→clip.	real→pain.	pain.→clip.	clip.→scat.	scat.→pain.	real→scat.	pain.→real
	Source model	-	56.5	56.1	63.7	55.2	48.0	51.7	45.8	74.7
\cdashline2-11	FixMatch [66]	RF	67.6	66.2	68.3	68.2	61.0	69.8	58.7	80.8
		NBF	63.4	62.4	65.1	64.8	55.8	64.6	52.7	78.4
	w/ ours	NBF	73.2	75.0	74.3	74.7	66.9	71.8	65.4	84.1
	UDA [87]	RF	69.2	68.7	70.0	69.8	62.8	70.9	60.0	82.0
		NBF	64.9	64.5	66.0	67.3	57.2	66.3	53.8	79.5
	w/ ours	NBF	73.4	76.2	74.0	74.7	67.4	71.9	65.7	84.1
	FlexMatch [96]	RF	73.3	76.7	74.0	75.6	66.9	73.2	64.4	82.5
		NBF	71.4	74.8	72.2	74.5	63.8	71.1	61.7	81.4
	w/ ours	NBF	74.7	77.9	74.8	77.8	68.9	72.2	66.9	84.4
	FreeMatch [81]	RF	73.8	76.6	74.2	75.5	67.7	73.5	65.1	84.0
		NBF	72.0	75.5	72.9	74.6	65.0	72.3	62.0	81.7
	w/ ours	NBF	74.8	78.1	74.5	77.1	68.8	72.4	67.3	85.0
\cdashline2-11	MME [58]	RF	69.5	70.0	71.2	69.3	63.5	69.6	61.7	81.5
		NBF	68.4	69.5	70.7	69.1	61.5	69.0	58.8	80.2
	w/ ours	NBF	70.8	72.9	71.6	72.9	64.0	68.4	62.1	83.5
	CDAC [33]	RF	68.3	67.1	69.0	68.9	62.6	69.9	59.5	81.1
		NBF	64.6	64.5	66.2	66.3	56.9	65.8	53.6	78.6
	w/ ours	NBF	73.2	76.1	73.9	74.4	67.0	71.2	65.8	84.1
	AdaMatch [6]	RF	67.6	66.6	68.5	68.5	60.3	69.2	58.7	81.5
		NBF	64.5	64.3	66.1	65.6	56.9	65.6	54.2	78.9
ResNet-50 [23]	w/ ours	NBF	72.0	74.5	72.7	73.9	65.5	70.0	64.3	83.2
\cdashline2-11	Fully sup.	-	83.6	85.6	81.4	85.6	80.4	81.4	80.4	90.1
	Source model	-	64.5	63.6	70.2	61.6	56.7	65.5	53.5	80.5
\cdashline2-11	FixMatch [66]	RF	74.6	75.5	77.1	73.8	67.7	75.9	67.1	85.1
		NBF	73.0	73.8	75.4	74.0	65.1	72.8	66.1	83.8
	w/ ours	NBF	75.6	77.1	77.7	77.3	67.8	76.8	68.0	84.7
	UDA [87]	RF	74.8	75.5	77.1	74.0	67.9	76.1	67.4	85.4
		NBF	73.3	74.1	75.6	74.3	65.4	73.2	66.3	83.9
	w/ ours	NBF	75.8	77.1	77.8	77.6	68.2	77.1	68.2	84.9
	FlexMatch [96]	RF	74.9	75.5	77.0	74.7	68.4	76.2	66.7	85.7
		NBF	73.9	74.5	76.6	75.1	66.1	74.5	66.4	84.1
	w/ ours	NBF	75.8	77.2	77.5	77.9	68.3	77.0	67.9	85.0
	FreeMatch [81]	RF	74.9	75.3	76.8	74.5	68.1	76.5	67.0	86.0
		NBF	73.9	74.6	76.4	75.0	66.0	74.5	66.5	84.1
	w/ ours	NBF	75.7	76.9	77.5	77.9	68.1	76.7	67.8	85.2
\cdashline2-11	MME [58]	RF	73.2	74.0	74.8	73.0	66.5	74.6	65.2	84.3
		NBF	72.7	73.2	74.8	73.8	65.3	73.0	64.8	83.8
	w/ ours	NBF	74.1	75.4	75.9	76.2	66.2	74.7	66.4	84.2
	CDAC [33]	RF	74.2	74.8	76.3	73.8	67.5	75.5	66.6	84.9
		NBF	72.8	73.6	74.9	73.9	65.0	72.8	65.4	83.8
	w/ ours	NBF	75.4	76.7	77.6	77.2	67.6	76.2	67.9	84.6
	AdaMatch [6]	RF	74.7	75.3	76.9	73.8	68.0	76.3	67.1	85.5
		NBF	73.7	74.7	76.2	74.7	65.7	74.0	66.8	84.0
ViT-S [15]	w/ ours	NBF	75.9	76.9	77.8	77.8	68.5	76.6	68.3	85.1
\cdashline2-11	Fully sup.	-	85.4	87.8	83.4	87.8	81.3	83.4	81.3	92.7

method	feedback	average	a → c	a → p	a → r	c → a	c → p	c → r	p → a	p → c	p → r	r → a	r → c	r → p
Source	-	57.6	44.2	65.6	71.6	47.3	60.2	58.2	47.9	40.8	69.8	60.6	46.5	78.1
\hdashlineFreeMatch [81]	RF	71.4	56.6	79.7	76.3	67.9	83.2	74.5	65.5	58.6	78.3	69.4	62.4	84.8
	NBF	68.6	53.0	76.1	75.3	65.3	78.5	74.8	62.5	56.7	74.7	66.7	56.2	83.7
w/ ours	NBF	73.7	60.8	80.3	80.5	69.2	84.0	78.6	67.7	62.3	80.1	70.0	64.1	87.2
UDA [87]	RF	72.2	56.1	81.0	76.8	68.0	83.4	75.6	67.1	59.7	79.7	69.8	62.7	86.4
	NBF	69.5	53.3	78.6	75.7	66.3	79.7	75.8	63.7	57.2	75.7	66.7	57.2	83.9
w/ ours	NBF	74.1	61.1	80.7	80.3	69.0	85.9	79.2	68.0	62.3	80.7	70.4	63.9	87.4
FlexMatch [96]	RF	73.7	58.0	84.6	79.3	68.4	84.7	78.8	68.4	62.8	79.8	70.6	62.9	86.3
	NBF	72.1	56.1	79.0	77.8	68.4	83.4	77.6	67.5	60.1	79.2	68.8	60.5	86.2
w/ ours	NBF	74.7	60.8	81.7	81.1	70.0	85.8	79.8	68.8	61.4	81.4	70.2	65.7	89.4
FreeMatch [81]	RF	74.0	58.5	85.0	79.4	68.2	84.7	79.2	68.4	62.5	80.4	71.0	63.7	87.0
	NBF	72.2	56.4	79.3	77.7	67.7	83.4	78.5	67.3	60.5	79.1	69.2	61.0	86.9
w/ ours	NBF	74.8	60.6	81.4	81.5	70.8	86.7	80.0	68.6	61.6	81.7	69.8	66.2	89.2
\hdashlineMME [58]	RF	71.2	56.2	80.4	75.7	65.1	81.0	76.7	64.5	59.0	79.8	69.0	62.0	85.1
	NBF	70.2	55.0	77.6	76.8	65.1	82.2	77.7	61.1	57.1	77.1	68.8	58.1	85.4
w/ ours	NBF	73.4	60.5	81.4	80.0	68.6	84.8	78.4	65.3	61.3	79.8	69.8	62.8	87.5
CDAC [33]	RF	71.2	55.5	80.0	76.4	67.1	82.4	75.8	64.5	58.7	79.0	69.2	61.5	84.4
	NBF	69.0	54.1	76.2	75.4	64.1	79.5	75.4	63.9	57.9	75.2	66.5	55.8	83.6
w/ ours	NBF	74.3	63.7	81.3	80.4	70.0	85.4	79.0	67.9	62.2	80.3	69.6	65.1	86.9
AdaMatch [6]	RF	70.9	55.4	80.4	75.9	65.7	81.5	74.6	65.9	58.7	78.4	68.8	61.5	84.3
	NBF	69.3	54.2	76.6	75.3	65.9	79.3	75.5	63.7	57.4	75.9	66.7	56.8	84.2
w/ ours	NBF	73.8	62.2	81.0	79.7	68.8	85.4	78.6	67.7	61.7	79.5	69.0	64.1	88.2
\hdashlineFully sup.	-	87.4	84.5	95.1	89.0	80.9	95.1	89.0	80.9	84.5	89.0	80.9	84.5	95.1

Is user feedback always informative? Retrieval Latent Defending for Semi-Supervised Domain Adaptation without Source Data

Abstract

Keywords:

1 Introduction

2 Related Work

Adaptation in the deployment environment.

Active domain adaptation

Semi-supervised learning

3 Negatively Biased Feedback

3.1 Adaptation with user feedback.

Rethinking user-provided feedback.

3.2 Influence of NBF on SemiSDA

Simulation study.

Unexpected influence of NBF.

4 Approach

4.1 Prerequisite: Previous SemiSDA method

Problem of previous works.

4.2 Retrieval Latent Defending

Candidate bank generation.

Defending sample selection.

Importance of our method.

5 Experiments

5.1 Experimental Setups

5.2 Main Results

5.3 Ablation Study

Positive vs. Negative feedback.

6 Conclusion & Discussion

References

A Comparison with Related Work

A.1 Active Domain Adaptation

A.2 Class-Imbalanced Semi-Supervised Learning

Under different feedback configurations.

Combining with class-imbalanced SemiSL methods.

A.3 Test-time Adaptation

Extension to a TTA scenario.

A.4 Learning with User Feedback

B Further understanding with Simulation Study

Network architecture.

Baseline.

Additional study on two moon dataset.

C Additional Ablation Study

Reliable sample filtering.

Combining with SFDA methods.

Number of appended defending samples.

Under a zero feedback scenario.

D Additional Experimental Details

Details for medical experiments.

Details for semantic segmentation experiments.

E Additional Discussion

E.1 Technique novelty

E.2 Computational overhead

E.3 Limitations.

F Results of all domain shifts

Is user feedback always informative?
Retrieval Latent Defending for Semi-Supervised Domain Adaptation without Source Data