Prototypical Distillation and Debiased Tuning for Black-box Unsupervised Domain Adaptation

Jian Liang, Lijun Sheng, Hongmin Liu, and Ran He J. Liang and R. He are with the State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences and the School of Artificial Intelligence, University of Chinese Academy of Sciences. (Email: [email protected], [email protected]). L. Sheng is with the University of Science and Technology of China (Email: [email protected]). H. Liu is with the School of Intelligence Science and Technology, University of Science and Technology Beijing (Email: [email protected]).

Abstract

Unsupervised domain adaptation aims to transfer knowledge from a related, label-rich source domain to an unlabeled target domain, thereby circumventing the high costs associated with manual annotation. Recently, there has been growing interest in source-free domain adaptation, a paradigm in which only a pre-trained model, rather than the labeled source data, is provided to the target domain. Given the potential risk of source data leakage via model inversion attacks, this paper introduces a novel setting called black-box domain adaptation, where the source model is accessible only through an API that provides the predicted label along with the corresponding confidence value for each query. We develop a two-step framework named Prototypical Distillation and Debiased tuning (ProDDing). In the first step, ProDDing leverages both the raw predictions from the source model and prototypes derived from the target domain as teachers to distill a customized target model. In the second step, ProDDing keeps fine-tuning the distilled model by penalizing logits that are biased toward certain classes. Empirical results across multiple benchmarks demonstrate that ProDDing outperforms existing black-box domain adaptation methods. Moreover, in the case of hard-label black-box domain adaptation, where only predicted labels are available, ProDDing achieves significant improvements over these methods. Code will be available at https://github.com/tim-learn/ProDDing/.

Index Terms:

Domain adaptation, source-free, black-box, transfer learning, knowledge distillation, hard-label

1 Introduction

Deep neural networks have achieved remarkable success across various tasks with the help of massive labeled datasets. However, collecting sufficient labeled data for each new task is often expensive and inefficient. To address this challenge, transfer learning has garnered significant attention [1, 2], particularly in the realm of unsupervised domain adaptation (UDA) [3, 4]. UDA leverages one or more related but distinct labeled datasets (source domains) to assist in recognizing unlabeled instances in a new dataset, referred to as the target domain. In recent years, UDA methods have been extensively applied to various computer vision tasks, including image classification [5, 6, 7], semantic segmentation [8, 9, 10], and object detection [11, 12, 13].

Refer to caption — Figure 1: For black-box domain adaptation, the source vendor provides only black-box predictors (e.g., through a cloud API service) to the target user, who possesses certain unlabeled data. During adaptation, only the predicted labels and their associated confidence values are accessible for target queries. When confidence values are unavailable, we refer to this scenario as hard-label black-box domain adaptation.

Existing UDA methods typically require access to raw source data and rely on techniques such as domain adversarial training [6, 5] or maximum mean discrepancy minimization [14, 15] to align source and target features. However, in many cases, such as handling personal medical records, raw source data can not be shared due to privacy concerns. To address this limitation, recent studies [16, 17, 18, 19] have explored source-free unsupervised domain adaptation (SFUDA) by using trained source models as supervision instead of raw data, achieving promising adaptation results. Nevertheless, these SFUDA methods often require the source models to be carefully trained and fully disclosed to the target domain, raising two critical concerns. First, model inversion attacks [20, 21] can potentially reconstruct raw source data, risking individual privacy. Second, these approaches typically train source models with specialized techniques while assuming an identical target model architecture, which is especially impractical for resource-constrained users. Thus, this paper focuses on a realistic and challenging scenario for UDA, where the source model is trained without bells and whistles and provided to the unlabeled target domain as a black-box predictor.

To illustrate this process more clearly, as shown in Fig. 1, the target user accesses the API service provided by the source vendor to obtain the predicted label and its confidence value (i.e., the highest soft-max probability) for each instance (e.g., (‘dog’, 0.4)), using this information for knowledge adaptation in the unlabeled target domain. This UDA scenario offers greater flexibility for cross-domain knowledge transfer, as it does not necessitate any specialized design for the source model. To further mitigate privacy risks, this paper also considers the hard-label scenario, where the associated confidence value is unavailable for each query. To tackle this challenging black-box scenario, knowledge distillation [22, 23] offers a potential solution, wherein the target model (student) is traditionally trained by mimicking the comprehensive outputs of the source model (teacher) using labeled data. However, this becomes impractical in black-box UDA due to the simultaneous absence of labeled data and the inability to obtain complete teacher outputs.

In this paper, we propose a novel knowledge adaptation framework named Prototypical Distillation and Debiased Tuning (ProDDing). ProDDing follows a simple two-step pipeline: first, it primarily distills knowledge from the predictions of the source model, and then fine-tunes the distilled model using the unlabeled target data. To fully exploit the confidence value provided by the source vendor, we elegantly devise an adaptive label smoothing technique that combines one-hot training labels with uniform label vectors, weighted adaptively. Given the inherent noise in source-predicted labels, we further utilize representative prototypes—the feature centroids of classes in the target domain—and use the feature distance to these prototypes as a complementary source of supervision. In addition to these two point-wise supervisions, we introduce two new structural regularizations into distillation: interpolation consistency training [24]—which ensures that predictions for interpolated samples align with their interpolated labels, and mutual information maximization [16, 25]—which helps increase the diversity among target predictions.

To extract knowledge from unlabeled data, we fine-tune the distilled model in the second step, drawing inspiration from the semi-supervised method [26]. To alleviate class-sampling bias during the weak-to-strong consistency objective, we propose to adjust the logits for each class based on estimated label frequencies. Specifically, we introduce large offset values to the logits of dominant classes to reduce their influence, while reapplying mutual information maximization to the weakly augmented samples at the same time. Extensive results on standard benchmark datasets (e.g., Office, Office-Home, and DomainNet) verify that ProDDing consistently outperforms previous black-box UDA methods. Furthermore, even in the challenging hard-label adaptation scenario, ProDDing yields surprisingly promising performance.

Our contributions can be summarized as follows:

•

We introduce a realistic and challenging UDA setting, called black-box domain adaptation, where the source model is limited to providing only the predicted label and, optionally, its confidence score for each target query.
•

We propose ProDDing, a simple yet effective framework that first distills noisy knowledge by querying the source model and then performs self-tuning solely on unlabeled data.
•

We design an adaptive label-smoothing strategy for source predictions and develop a novel unsupervised distillation method by integrating structural regularizations into the distillation process
•

We present a novel unsupervised fine-tuning strategy that mitigates class bias through logit adjustment and mutual information maximization.
•

Empirical results across diverse benchmarks confirm the superiority of ProDDing over existing black-box UDA methods. Notably, when only limited information (i.e., hard labels) is available from the source vendor, ProDDing again achieves the best performance.

This paper extends our previous conference publication [27] mainly in five aspects: (1). In the distillation step, we incorporate a novel teacher supervision signal based on prototypes derived from the target data. (2). In the self-tuning step, we implement a logit-adjustable weak-to-strong augmentation strategy to reduce class bias. (3). In the introduced black-box setting, we are the first to investigate the hard-label case, where only the predicted labels are available from the source vendor. (4). We also broaden the experimental evaluation by including additional datasets, such as DomainNet (large-scale) and Office-Home-RSUT (reverse label shift). (5). We offer a more detailed analysis to evaluate the proposed approach, with a particular focus on the newly added components in the framework.

2 Related Work

2.1 Unsupervised Domain Adaptation

Domain adaptation, a common scenario in transfer learning [1], involves using labeled data from one or more source domains to address tasks in a related target domain with covariate shifts. Specifically, much of the research efforts have been devoted to unsupervised domain adaptation (UDA), where no labeled data is available in the target domain. At early times, researchers address this problem via instance weighting [28, 29], feature transformation [30, 31], and feature space [32, 33, 34]. Over the past decade, deep domain adaptation methods, driven by advances in representation learning, have become prevalent and achieved remarkable progress. To bridge the gap between features across different domains, deep UDA methods commonly employ domain adversarial learning [5, 6, 7, 35] and discrepancy minimization [14, 36, 37, 38]. Another branch of deep UDA methods [39, 40, 41, 42] focuses on the network outputs, introducing various regularization terms (e.g., entropy minimization) to achieve implicit domain alignment. Moreover, researchers explore other aspects of neural networks for deep UDA, such as domain-specific normalization methods [43, 44] and feature regularization techniques [45, 46].

Several UDA methods [47, 48] address class-level domain discrepancies by aligning class prototypes across domains. Alternatively, some approaches learn the domain-invariant features by promoting the alignment of target data with source class prototypes [49]. To ensure semantic consistency in adversarial alignment, class prototypes are incorporated as conditional signals into features before being fed to the discriminator [10]. Moreover, prototypes can serve as centroids in nearest-centroid classifiers to assign pseudo-labels to unlabeled target data [50, 51]. A closely related work to ours is [52], which utilizes distances to prototypes to iteratively refine pseudo-labels via a multiplication mechanism. In contrast, we introduce a straightforward weighted average of source predictions and derived prototypical pseudo-labels as the initial teacher signal.

2.2 Source-free Domain Adaptation

Motivated by hypothesis transfer learning [53, 54] and early parameter adaptation techniques [55, 56], several pioneering studies [57, 58] have proposed shallow domain adaptation methods in the absence of source data (features). More recently, several works [16, 19, 18, 17] introduce the source data-free setting for deep UDA, where the source domain merely offers a trained model rather than the raw source data. Specifically, [16, 19] freeze the classifier module in the provided model and fine-tune the feature module via information maximization and pseudo-labeling in the target domain. By contrast, [17] leverages a conditional generative adversarial net and incorporates generated images into the adaptation process. This source-free paradigm [59] is more privacy-preserving and flexible compared to the conventional UDA setting, rapidly gaining attention among researchers in various transfer learning applications, such as semantic segmentation [60, 61], object detection [62, 63], and medical image analysis [64, 65].

According to the taxonomy outlined in a recent survey [66], existing source-free domain adaptation methods can be broadly categorized into four groups: pseudo-labeling [16, 67, 68, 69], consistency training [70, 71, 72, 73], clustering-based training [74, 75, 76, 77], and source distribution estimation [17, 21, 78, 79, 71]. However, exposing the parameters of the trained source model in source-free domain adaptation can pose a risk due to model inversion attacks [20, 80], particularly in methods that rely on source distribution estimation. Typically, the target network architecture is assumed to be the same as the pre-trained source one, limiting the flexibility in resource-constrained scenarios. To thoroughly evaluate effectiveness under label shift, we also conduct experiments in partial-set and imbalanced cases, as used in previous source-free methods [16, 81].

2.3 Black-box Domain Adaptation

To enhance privacy-preserving capabilities in source-free domain adaptation, several recent works [27, 82, 83] propose using the source model as a black-box predictor, where only predictions for target queries are accessible, without any knowledge of the source model’s parameters. In contrast to methods [82, 83] that leverage the complete probability distribution from the source model, our previous work [27] proposes using truncated probabilities, such as the largest probability value and its associated label. Such incomplete source predictions are more realistic in real-world APIs and have been widely adopted in subsequent black-box UDA studies [84, 85, 86, 87, 88]. In addition to the predicted label and its probability as considered in [27], this extension further explores a more challenging scenario where only the predicted label is available. Note that, we focus exclusively on image classification and do not address black-box adaptation methods for other tasks [89, 90, 91, 92, 93, 94].

Faced with a black-box source model, a pioneering work [19] partitions the target dataset into two splits and employs semi-supervised learning to enhance the performance of the uncertain split. This divide-and-learn strategy is further employed in other methods [83, 85, 84, 87], where the domain gap between the two splits is addressed using adversarial training [84], discrepancy minimization [83], or graph alignment [87]. Additionally, [82] introduces an iterative noisy label learning approach to refine source predictions, while [95] develops a sophisticated memory mechanism to capture representative information during adaptation. Furthermore, [86] emphasizes consistency under both data and model variations. [96] takes a different approach by leveraging third-party data and adversarial training to train the target model. [97] proposes a black-box solution for prior shifts that relies on a hold-out source set to estimate the class confusion matrix, which is sometimes hard to satisfy in practice. Several recent studies [98, 99] even leverage additional vision-language models [100] to enhance the performance of black-box domain adaptation.

2.4 Knowledge Distillation

Knowledge distillation [101] is a well-studied technique aimed at transferring knowledge from one model (commonly referred to as the teacher) to another model (the student), typically from a larger model to a smaller one. A seminal work [22] shows that, augmenting the training of the student with a distillation loss, matching the predictions between teacher and student, is beneficial. Recently, [102] introduces self-knowledge distillation, showing that past predictions within the same neural network can serve as the teacher. Beyond supervised training, self-distillation can be effectively applied with unlabeled data in semi-supervised learning. For example, [103] proposes ensembling predictions during training by using outputs from a single network across different epochs as a teacher for the current epoch. In contrast to maintaining an exponential moving average (EMA) prediction [103], [104] utilizes an average of consecutive student models (past model weights) as a stronger teacher, though this approach is not suitable for black-box UDA. In this paper, we propose an adaptive label smoothing technique on source predictions and for the first time introduce structural regularizations [24, 105] into unsupervised distillation.

2.5 Semi-supervised Learning

A senior semi-supervised learning approach [106] assigns pseudo-labels based on model predictions for unlabeled data, which are then used alongside labeled data to retrain the model. Another classic method [107] minimizes the entropy of each unlabeled data point as a form of regularization. In the deep learning era, [108] unifies existing dominant approaches for semi-supervised learning into a holistic method, which has gained increasing popularity due to its superior performance. The holistic method is further enhanced in [109] by incorporating distribution alignment, which encourages the marginal distribution of predictions on unlabeled data to match that of labeled data, and augmentation anchoring, which promotes weak-to-strong consistency. In contrast, [26] presents a simple approach that also leverages weak-to-strong consistency, but uses a pre-defined threshold to filter out high-confidence samples. A recent notable work [110] enhances [26] by introducing a curriculum learning approach that flexibly adjusts thresholds for different classes. In this paper, we incorporate the marginal distribution of unlabeled data in weak-to-strong consistency by using logit adjustment and information maximization to mitigate class bias.

3 Methodology

In this paper, we focus on the $K$ -way cross-domain image classification task and address a realistic yet challenging UDA setting, where only the predictions of a black-box source model are accessible for unlabeled target domain data. For the single-source UDA scenario, the source domain $\{x_{s}^{i},y_{s}^{i}\}_{i=1}^{n_{s}}$ consists of $n_{s}$ labeled instances, where $x_{s}^{i}\in\mathcal{X}_{s},y_{s}^{i}\in\mathcal{Y}_{s}$ , and the target domain $\{x_{t}^{i},y_{t}^{i}\}_{i=1}^{n_{t}}$ consists of $n_{t}$ unlabeled instances, where $x_{t}^{i}\in\mathcal{X}_{t},y_{t}^{i}\in\mathcal{Y}_{t}$ , and the goal of UDA is typically to infer the values of $\{y_{t}^{i}\}_{i=1}^{n_{t}}$ . The label spaces are assumed to be identical across domains, i.e., $\mathcal{Y}_{s}=\mathcal{Y}_{t}$ , even when label shift occurs [111, 81]. By contrast, partial-set UDA [112, 113] assumes that some source classes do not exist in the target domain, i.e., $\mathcal{Y}_{s}\supset\mathcal{Y}_{t}$ . Concerning the black-box adaptation setting, only the trained source model is provided through an API service, without requiring access to the source data. It differs from prior source-free domain adaptation methods [16, 17] in requiring no details about the source model, e.g., backbone type and network parameters. In particular, only the predicted labels along with their associated probability values of the target instances $\mathcal{X}_{t}$ from the source model $f_{s}:\mathcal{X}_{s}\to\mathcal{Y}_{s}$ are utilized for knowledge adaptation in the target domain.

3.1 Source Model Generation

We elaborate on how to obtain the trained model from the source domain as follows. Unlike most source-free domain adaptation methods [16, 19] that elegantly design the source model with a bottleneck layer and weight normalization [114], we simply insert a single linear fully-connected (FC) layer after the backbone feature network and use label smoothing (LS) [115] to train $f_{s}$ ,

\mathcal{L}_{s}(f_{s};\mathcal{X}_{s},\mathcal{Y}_{s})=\mathbb{E}_{(x_{s},y_{s})\in\mathcal{X}_{s}\times\mathcal{Y}_{s}}\mathcal{H}(q_{s},\delta(f_{s}(x_{s}))),

(1)

where $q_{s}=(1-\epsilon)\mathbf{1}_{y_{s}}+\epsilon/K$ is the smoothed label vector and $\epsilon$ is empirically set to 0.1, and $\mathbf{1}_{j}$ denotes a $K$ -dimensional one-hot encoding with only the $j$ -th value being 1. Moreover, $\delta(\cdot)$ denotes the softmax function, and $\mathcal{H}(q,p)$ denotes the cross-entropy between $p$ and $q$ .

Remark #1. In contrast, for the self-defined target network $f_{t}:\mathcal{X}_{t}\to\mathcal{Y}_{t}$ , we adopt the common practice in source-free domain adaptation [16, 19, 116, 117, 118]. Specifically, the bottleneck layer consists of a batch normalization layer and an FC layer, while the classifier includes a weight normalization layer followed by an FC layer.

3.2 Prototypical and Adaptive Knowledge Distillation

To extract knowledge from a black-box model, a natural solution is knowledge distillation [22], which trains the target model (student) to replicate the predictions of the source model (teacher). However, existing knowledge distillation methods are primarily designed for supervised training tasks, with the consistency loss serving as a regularization term, as shown below:

\mathcal{L}_{kd}(f_{t};\mathcal{X}_{t},f_{s})=\mathbb{E}_{x_{t}\in\mathcal{X}_{t}}D_{kl}\left(\delta(f_{s}(x_{t}))\ ||\ \delta(f_{t}(x_{t}))\right),

(2)

where $D_{kl}$ denotes the Kullback-Leibler (KL) divergence loss. However, the source model $f_{s}$ ’s outputs for target instances are often inaccurate and sometimes even incomplete. For the studied black-box adaptation problem, highly relying on the teacher $f_{s}(x_{t})$ through a consistency loss is no longer desirable. Thus, we propose adaptively smoothing the teacher $p$ by focusing on the top- $r$ largest values as:

\text{AdaLS}(p,r)_{i}=\left\{\begin{aligned} p_{i},\qquad\qquad\qquad\qquad\qquad\;&i\in\mathcal{T}^{r}_{p},\\ (1-\sum\nolimits_{j\in\mathcal{T}^{r}_{p}}p_{j})/(K-r),\;&\text{otherwise}.\end{aligned}\right.

(3)

Here $\mathcal{T}^{r}_{p}$ represents the index set of the top- $r$ classes in $p$ . We refer to the transformation in Eq. (3) as adaptive label smoothing (Adaptive LS), as it retains instance-specific top- $r$ values, which vary across samples. Using the smoothed output ( $r=1$ ) means that we merely need the predicted label along with its maximum probability, which sounds more flexible when using an API service provided by profitable companies. For simplicity, we denote the refined output as

p_{s}(x_{t})=\text{AdaLS}\left(\delta(f_{s}(x_{t})),r=1\right)

(4)

throughout this paper. The refined output $p_{s}(x_{t})$ is expected to work better than the original output $p$ for several reasons: 1) it reduces redundant and noisy information by focusing on the pseudo label (the class associated with the largest value) and applying a uniform distribution to the other classes, similar to label smoothing [115]; 2) it does not rely solely on the noisy pseudo label, instead using the largest value as a measure of confidence, akin to self-weighted pseudo labeling [119].

Inspired by previous studies [16, 52] that leverage prototypes to denoise the pseudo labels for unlabeled target data, we further develop a prototypical pseudo-labeling strategy based on the source predictions. Once we have a pre-trained feature extractor $g_{t}$ , the prototype of the $k$ -th class in the target domain could be calculated as follows,

\mathcal{C}_{k}=\frac{\sum_{x_{t}\in\mathcal{X}_{t}}[p_{s}(x)]_{k}\,g_{t}(x_{t})}{\sum_{x\in\mathcal{X}_{t}}[p_{s}(x)]_{k}}.

(5)

We can then obtain the soft pseudo labels as the softmax over the distance between the target features and the prototypes as follows:

[p_{s}^{t}(x_{t})]_{k}=\frac{\exp\left(-d(g_{t}(x_{t}),\,\mathcal{C}_{k})/\tau\right)}{\sum_{k^{\prime}}\exp\left((-d(g_{t}(x_{t}),\,\mathcal{C}_{k^{\prime}})/\tau\right)},

(6)

where $\tau$ represents the temperature parameter, which is empirically set to 0.1, and $d(\cdot,\cdot)$ denotes the cosine distance. To integrate these two types of pseudo labels in Eq. (4) and Eq. (6), we present a simple weighted addition below,

P_{s}^{t}(x_{t})=\beta\,p_{s}(x_{t})+(1-\beta)\,p_{s}^{t}(x_{t}),

(7)

where the balancing parameter $\beta\in[0,1]$ is empirically set to 0.5. The prototypical pseudo label is omitted when $\beta=1$ .

To further alleviate the noise in the teacher prediction, we follow [103, 102] and adopt a self-distillation strategy, shown in Fig. 2, maintaining an EMA prediction by

P_{s}^{t}(x_{t})\leftarrow\gamma\,P_{s}^{t}(x_{t})+(1-\gamma)\,\delta(f_{t}(x_{t})),\ \forall x_{t}\in\mathcal{X}_{t},

(8)

where $\gamma$ is a momentum hyper-parameter, which is empirically set to 0.7. Following [103], we update teacher predictions after every training epoch. When $\gamma=1$ , there exists no temporal ensembling, i.e., the refined source predictions keep acting as a teacher throughout distillation.

Remark #2. To calculate the prototypical soft pseudo labels, features are reduced through principal component analysis and $l_{2}$ -normalized. To determine the feature extractor $g_{t}$ , we could use powerful large neural networks; however, for a fair comparison with other methods, we default to utilizing the feature encoder module in the distilled target network $f_{t}$ . The analysis is provided in the experiments.

3.3 Self-distillation with Structural Regularizations

As mentioned earlier, the teacher output from the source model is likely to be inaccurate and noisy due to domain shift. Even though we propose a promising solution in Eq. (7), which only considers point-wise information during the distillation process, it fails to account for the data structure in the target domain, making it insufficient for effective noisy knowledge distillation. To address this, we incorporate the structural information in the target domain to regularize the distillation process. On the one hand, we consider the pairwise structural information via MixUp [120], and employ the interpolation consistency training [24] technique as below,

	$\displaystyle\mathcal{L}_{mix}$	$\displaystyle(f_{t};\mathcal{X}_{t})=\mathbb{E}_{x^{t}_{i},x^{t}_{j}\in\mathcal{X}_{t}}\mathbb{E}_{\lambda\in\text{Beta}(\alpha,\alpha)}$		(9)
		$\displaystyle\mathcal{H}\left(\text{Mix}_{\lambda}\left(\delta(f^{\prime}_{t}(x^{t}_{i})),\delta(f^{\prime}_{t}(x^{t}_{j}))\right),\delta(f_{t}\left(\text{Mix}_{\lambda}(x^{t}_{i},x^{t}_{j})\right))\right),$		(9)

where the operation $\text{Mix}_{\lambda}(a,b)=\lambda\cdot a+(1-\lambda)\cdot b$ denotes the MixUp operation, $\lambda$ is sampled from a Beta distribution, and $\alpha$ is the hyper-parameter, empirically set to 0.3 according to [120]. Note that $f^{\prime}_{t}$ just offers the values of $f_{t}$ but requires no gradient optimization. Here we do not adopt the EMA update strategy in [24] for $f^{\prime}_{t}$ . Eq. (9) can be treated to augment the target domain with more interpolated samples, which is beneficial for better generalization ability.

On the other hand, we also consider the global structural information during distillation in the target domain. In fact, during distillation, the classes with a large number of instances are relatively easy to learn, which may wrongly recognize some confusing target instances as such classes in turn. To circumvent this problem, we attempt to encourage diversity among the predictions of all the target instances. Specifically, we try to maximize the widely-used mutual information objective [105, 25, 16, 121] in the following,

	$\displaystyle\mathcal{L}_{mi}(f_{t};\mathcal{X}_{t})$	$\displaystyle=\mathcal{H}(\mathcal{Y}_{t})-\mathcal{H}(\mathcal{Y}_{t}\|\mathcal{X}_{t})$		(10)
		$\displaystyle=h\left(\mathbb{E}_{x_{t}\in\mathcal{X}_{t}}\delta(f_{t}(x_{t}))\right)-\mathbb{E}_{x_{t}\in\mathcal{X}_{t}}\ h\left(\delta(f_{t}(x_{t}))\right),$		(10)

where $h(p)=-\sum_{i}p_{i}\log p_{i}$ represents the conditional entropy function. Note that, increasing the marginal entropy $\mathcal{H}(\mathcal{Y}_{t})$ encourages the label distribution to be uniform while decreasing the conditional entropy $\mathcal{H}(\mathcal{Y}_{t}|\mathcal{X}_{t})$ encourages unambiguous network predictions.

By combining the objectives defined in Eqs. (2), (9), and (10), the final loss function for the first distillation step of ProDDing is formulated as follows:

\mathcal{L}_{prod}=\mathcal{L}_{skd}+\mathcal{L}_{mix}-\mathcal{L}_{mi},\\

(11)

where $\mathcal{L}_{skd}(f_{t};\mathcal{X}_{t},f_{s})=\mathbb{E}_{x_{t}\in\mathcal{X}_{t}}D_{kl}\left(\delta(P_{s}^{t}(x_{t}))\ ||\ \delta(f_{t}(x_{t}))\right)$ , and both structural regularizations equally contribute to the Prototypical Distillation method (ProD). Unlike the closely related work [82], which iteratively refines pseudo labels and optimizes the target network, ProD adopts a unified approach to directly learn accurate predictions for the target data, which is more effective at capturing the inherent data structure of the target domain.

3.4 Debiased Fine-tuning

Through the proposed prototypical structural knowledge distillation method from black-box source predictors $f_{s}$ , it is expected to learn a well-performing white-box target model. However, the distilled model seems sub-optimal since it is mainly optimized via the point-wise knowledge distillation term in Eq. (2), which highly depends on the source predictions. Inspired by DIRT-T [122], we hypothesize that the network performance can be further improved by introducing a secondary training phase focused exclusively on minimizing violations of the target-side cluster assumption. Instead of using the parameter-sensitive virtual adversarial training [122], we refine the distilled target model by adopting the widely-used weak-to-strong consistency technique introduced in FixMatch [26], as illustrated below:

\mathcal{L}_{fm}=\mathbb{E}_{x_{t}\in\mathcal{X}_{t}}\ \mathbb{I}(max(\delta(f_{t}(x_{t})))\geq\eta)\ \mathcal{H}(\mathbf{1}_{\hat{y}_{t}},\delta(f_{t}(\mathcal{A}(x_{t})))),

(12)

where $\hat{y}_{t}=\arg\max(\delta(f_{t}(x_{t})))$ represents the hard pseudo-label based on weak augmentation, $\eta$ is the threshold (dashed in Fig. 3), and $\mathcal{A}(\cdot)$ denotes the augmentation sampled from AutoAugment [123].

As mentioned earlier, class bias is a common obstacle in the unsupervised learning process. Inspired by [124], which adjusts the logits per class for long-tail learning, we incorporate an estimate of the class prior into the consistency loss to mitigate the bias toward ‘easy’ classes. Firstly, the class prior is iteratively estimated per epoch by

\pi_{k}=\frac{\mathbb{E}_{x_{t}\in\mathcal{X}_{t}}\,\mathbb{I}(\hat{y}_{t}=k)}{n_{t}},\,k\in[1,\dots,K].

(13)

Secondly, we adjust the logits of samples under strong augmentations, as shown in Fig. 3. The adjustment is formulated as follows:

	$\displaystyle\mathcal{L}_{afm}=\mathbb{E}_{x_{t}\in\mathcal{X}_{t}}$	$\displaystyle\ \mathbb{I}(max(\delta(f_{t}(x_{t})))\geq\eta)\$		(14)
		$\displaystyle\mathcal{H}(\mathbf{1}_{\hat{y}_{t}},\delta(f_{t}(\mathcal{A}(x_{t}))+\rho\log\pi)),$		(14)

where $\rho$ denotes the adjustable parameter, empirically set to 0.5. At the same time, we can also apply mutual information maximization, as described in Eq. (10), to alleviate class bias in samples under weak augmentation. Finally, the overall loss for the second step of ProDDing (referred to as Debiased Fine-tuning, Ding) is given by:

\mathcal{L}_{ding}=\mathcal{L}_{afm}-\mathcal{L}_{mi}.\\

(15)

Algorithm 1 Pseudocode of ProDDing for black-box UDA.

1. Source Model Generation

Require:

\{x_{s}^{i},y_{s}^{i}\}_{i=1}^{n_{s}}

\triangleright

Train

f_{s}

via minimizing the objective in Eq. (1).

2. Prototypical Distillation

Require: Target data

\{x_{t}^{i}\}_{i=1}^{n_{t}}

, source predictions

\{p_{s}(x_{t})\}_{i=1}^{n_{t}}

, parameters

\beta=0.5,\tau=0.1

, the number of epochs

T_{m}

\triangleright

Obtain the smoothed source predictions via Eq. (4) (

r=1

\triangleright

Obtain the prototypical pseudo labels via Eq. (6).

\triangleright

Initialize the teacher output

P_{s}^{t}(x_{t})

via Eq. (7).

for

e=1

T_{m}

for

i=1

n_{b}

\triangleright

Sample a batch from target data.

\triangleright

Apply MixUp within the batch.

\triangleright

Update

f_{t}

via minimizing the objective in Eq. (11).

end for

\triangleright

Update the teacher output

P_{s}^{t}(x_{t})

via Eq. (8).

end for

3. Debiased Fine-tuning

Require: Target data

\{x_{t}^{i}\}_{i=1}^{n_{t}}

, parameters

\rho=0.5,\eta

, the distilled target network

f_{t}

, the number of epochs

T_{m}

for

e=1

T_{m}

\triangleright

Obtain the pseudo labels

\hat{y}_{t}

under weak augmentation.

\triangleright

Update the label prior estimate

\pi

using Eq. (13).

for

i=1

n_{b}

\triangleright

Sample a batch from target data with both weak and strong augmentations.

\triangleright

Update

f_{t}

via minimizing the objective in Eq. (15).

end for

So far, we have presented all the details of the two steps within the proposed framework (ProDDing). A full description of ProDDing can be found in Algorithm 1.

Remark #3. In a more challenging case, i.e., hard-label black-box UDA, where only the predicted label is available for each target query, we simply employ the conventional label smoothing technique instead of AdaLS in Eq. (4).

TABLE I: Accuracies (%) on the Office-Home dataset for UDA under two black-box scenarios. The ‘Hard’ setting indicates that only the predicted label is available for each target query. The best results are bolded and highlighted in different colors for each scenario.

Methods	Hard	Ar $\to$ Cl	Ar $\to$ Pr	Ar $\to$ Re	Cl $\to$ Ar	Cl $\to$ Pr	Cl $\to$ Re	Pr $\to$ Ar	Pr $\to$ Cl	Pr $\to$ Re	Re $\to$ Ar	Re $\to$ Cl	Re $\to$ Pr	Avg.
No Adapt.	-	44.7	68.3	75.2	54.4	63.4	66.7	52.3	40.3	73.5	66.5	46.4	78.0	60.8
NLL-OT [125]	✗	46.3	69.6	75.8	56.7	65.4	68.3	54.0	41.8	74.4	67.0	48.7	78.8	62.2
NLL-KL [82]	✗	47.1	70.1	76.1	57.1	65.9	68.5	54.2	42.4	74.5	67.1	48.9	79.0	62.6
NLL-MM [19]	✗	47.5	75.9	80.2	62.0	74.4	76.4	57.9	43.6	80.5	67.7	47.4	81.8	66.3
SHOT^† [16]	✗	53.2	77.5	80.3	66.3	77.2	77.7	62.3	48.6	81.0	70.8	54.5	82.5	69.3
DINE [27]	✗	54.7	78.9	81.7	64.4	75.2	78.4	62.5	51.0	81.8	70.9	57.1	84.9	70.1
BETA [85]	✗	56.2	79.7	82.8	66.5	76.3	79.1	64.3	52.2	82.9	72.4	58.4	85.0	71.3
SEAL [87]	✗	56.5	79.8	82.6	68.7	77.9	79.0	65.2	53.7	83.4	73.0	58.6	84.7	71.9
ProD	✗	53.5	80.0	81.2	67.8	78.0	79.3	63.7	51.1	82.3	70.8	55.8	84.1	70.6
ProDDing	✗	56.9	80.9	83.2	68.2	79.6	81.6	65.4	54.6	83.7	71.6	59.0	84.9	72.5
SHOT^† [16]	✓	52.0	76.9	79.2	64.4	76.1	75.7	60.4	47.4	79.3	69.4	52.8	81.4	67.9
DINE [27]	✓	52.4	75.6	79.3	62.8	74.6	75.1	59.5	48.0	79.0	69.5	55.4	82.9	67.8
BETA [85]	✓	51.6	75.1	79.4	62.1	74.3	75.6	59.1	48.5	79.1	69.4	55.1	82.6	67.7
SEAL [87]	✓	50.5	74.7	78.8	61.6	71.3	72.9	58.2	46.3	78.1	69.9	52.7	82.4	66.4
ProD	✓	51.5	78.9	80.7	66.6	77.0	78.5	62.8	49.1	81.0	70.7	54.5	83.3	69.5
ProDDing	✓	55.5	79.5	82.6	68.1	79.5	80.8	64.3	53.1	82.5	71.3	57.6	84.7	71.6

4 Experiments

4.1 Setup

a) Datasets. Office-Home [126] is a challenging medium-sized benchmark comprising four distinct domains: Artistic images (Ar), Clip Art (Cl), Product images (Pr), and Real-World images (Re). Each domain includes 65 categories of everyday objects. Office [127] is a widely-used UDA benchmark for cross-domain object recognition. It includes three domains: Amazon (A), DSLR (D), and Webcam (W), with each domain containing 31 object classes commonly found in office environments. DomainNet [128] is a large-scale dataset encompassing common objects across six diverse domains, each containing 345 categories, such as bracelets, planes, birds, and cellos. The six domains are: clipart-style illustrations (clp), infographic-style images (inf), paintings (pnt), simplistic drawings from the quick draw game (qdr), real-world photographs (rel), and hand-drawn sketches (skt). To investigate performance under label shift for different UDA methods, we also consider two variants of Office-Home: Office-Home-RSUT [111, 129], where the source and target label distributions are manually modified to be reversed versions of one another, and Office-Home-Partial [112, 113], which selects the first 25 categories (in alphabetical order) from the 65 classes in each domain as the partial target domain.

b) Baseline methods. No Adapt. is also known as ‘source only’ in this field that infers the class label from the source predictions. Throughout this paper, we compare ProDDing with three existing black-box UDA methods: DINE [27], BETA [85], and SEAL [87]. Besides, we extend a popular source-free UDA method, SHOT [16], to both black-box scenarios, denoted as SHOT^†. In particular, SHOT^† first learns a white-box target model by utilizing the source predictions with a weighted cross-entropy loss. Subsequently, SHOT^† applies the algorithm proposed in [16] to adapt the learned model to the target domain. Additionally, we construct two baselines using noisy label learning: NLL-OT and NLL-KL, which regularly update the pseudo labels during the training process with different optimization objectives. NLL-OT adopts the optimal transport (OT) technique [125], while NLL-KL adopts the diversity-promoting KL divergence [82] to refine the noisy pseudo labels. In contrast, NLL-MM employs the divide-to-learn strategy [19] based on confidence values and leverages the semi-supervised learning algorithm, MixMatch [109], to train the target model. For our methods, we also provide the results of Prod in each table. As the source model plays a crucial role in source-free UDA, all the results presented in the experiments were reproduced by us using the source code provided by the authors of each respective work. We attempted to reproduce other existing black-box UDA approaches [83, 86, 88], but were unable to match the results reported in their original papers.

TABLE II: Accuracies (%) on the Office dataset for UDA under two black-box scenarios.

Methods Hard A $\to$ D A $\to$ W D $\to$ A D $\to$ W W $\to$ A W $\to$ D Avg. No Adapt. - 80.3 77.9 61.5 94.5 63.5 98.4 79.3 NLL-OT [125] ✗ 85.9 83.4 63.5 96.1 65.1 98.4 82.1 NLL-KL [82] ✗ 88.4 83.8 63.9 96.4 65.4 98.4 82.7 NLL-MM [19] ✗ 84.7 85.5 69.4 95.8 72.4 96.3 84.0 SHOT^† [16] ✗ 93.0 92.1 72.8 95.7 74.0 97.3 87.5 DINE [27] ✗ 88.8 89.4 74.8 97.7 74.8 98.8 87.4 BETA [85] ✗ 91.3 88.8 75.4 98.3 76.6 98.7 88.2 SEAL [87] ✗ 88.4 88.1 75.6 98.0 76.7 98.9 87.6 ProD ✗ 94.0 91.2 74.8 96.0 75.3 97.2 88.1 ProDDing ✗ 94.6 92.0 75.7 96.2 76.8 97.7 88.8 SHOT^† [16] ✓ 92.3 91.5 71.7 95.0 72.8 97.5 86.8 DINE [27] ✓ 88.5 87.2 70.0 97.0 71.3 98.8 85.4 BETA [85] ✓ 88.4 87.1 69.8 96.7 71.2 99.0 85.4 SEAL [87] ✓ 86.8 85.7 68.5 96.3 70.0 98.6 84.3 ProD ✓ 93.2 91.3 72.5 97.1 73.8 97.9 87.6 ProDDing ✓ 94.1 92.8 75.3 97.7 76.6 97.9 89.1

TABLE III: Accuracies (%) on the DomainNet dataset for UDA under two black-box scenarios. ^∘ denotes results under the hard-label scenario.

No Adapt.	clp	inf	pnt	qdr	rel	skt	Avg.	NLL-OT [125]	clp	inf	pnt	qdr	rel	skt	Avg.	NLL-KL [82]	clp	inf	pnt	qdr	rel	skt	Avg.
clp $\to$	-	16.3	34.7	9.8	51.9	40.3	30.6	clp $\to$	-	14.8	37.0	17.0	60.4	40.7	34.0	clp $\to$	-	15.7	40.7	17.8	62.6	42.6	35.9
inf $\to$	31.2	-	30.8	2.3	47.3	24.7	27.2	inf $\to$	38.7	-	34.6	3.6	57.9	32.2	33.4	inf $\to$	40.6	-	38.1	4.4	60.0	33.5	35.3
pnt $\to$	40.1	16.3	-	2.6	57.2	33.8	30.0	pnt $\to$	46.3	15.4	-	5.6	61.4	38.5	33.4	pnt $\to$	48.0	16.4	-	6.5	63.5	39.8	34.9
qdr $\to$	9.6	1.0	1.5	-	3.6	7.8	4.7	qdr $\to$	17.4	1.2	3.2	-	10.0	13.2	9.0	qdr $\to$	17.4	1.2	3.4	-	10.7	13.8	9.3
rel $\to$	46.7	18.9	46.3	4.5	-	34.3	30.1	rel $\to$	50.1	17.3	42.0	7.5	-	37.1	30.8	rel $\to$	52.1	18.7	47.2	8.6	-	38.7	33.1
skt $\to$	47.9	12.7	33.6	11.6	45.9	-	30.4	skt $\to$	50.6	14.2	39.0	19.6	59.1	-	36.5	skt $\to$	53.8	15.2	43.1	20.1	61.2	-	38.7
Avg.	35.1	13.0	29.4	6.2	41.2	28.2	25.5	Avg.	40.6	12.6	31.2	10.7	49.8	32.4	29.5	Avg.	42.4	13.4	34.5	11.5	51.6	33.7	31.2
NLL-MM [19]	clp	inf	pnt	qdr	rel	skt	Avg.	SHOT^† [16]	clp	inf	pnt	qdr	rel	skt	Avg.	DINE [27]	clp	inf	pnt	qdr	rel	skt	Avg.
clp $\to$	-	13.5	38.4	10.7	58.3	40.7	32.3	clp $\to$	-	15.6	41.5	16.5	62.2	41.8	35.5	clp $\to$	-	15.9	42.6	13.8	60.9	43.1	35.3
inf $\to$	32.3	-	33.4	2.5	54.3	26.6	29.8	inf $\to$	41.5	-	38.9	4.7	58.7	33.4	35.4	inf $\to$	37.6	-	41.1	4.3	57.3	31.6	34.4
pnt $\to$	38.9	14.1	-	2.6	61.0	32.6	29.8	pnt $\to$	49.2	16.7	-	7.2	62.1	39.4	34.9	pnt $\to$	44.0	16.3	-	5.5	62.6	39.3	33.5
qdr $\to$	9.8	0.7	1.3	-	4.2	7.9	4.8	qdr $\to$	20.8	1.9	3.9	-	11.5	15.2	10.7	qdr $\to$	14.0	0.8	3.3	-	9.1	12.0	7.8
rel $\to$	47.0	15.0	49.3	4.9	-	35.1	30.2	rel $\to$	52.6	18.9	46.5	8.4	-	38.9	33.1	rel $\to$	51.9	18.5	52.0	7.5	-	39.8	33.9
skt $\to$	48.2	10.3	34.1	13.0	51.9	-	31.5	skt $\to$	54.3	15.1	43.6	18.1	60.3	-	38.3	skt $\to$	52.8	14.5	44.0	16.7	58.2	-	37.2
Avg.	35.2	10.7	31.3	6.7	45.9	28.6	26.4	Avg.	43.7	13.6	34.9	11.0	51.0	33.8	31.3	Avg.	40.1	13.2	36.6	9.5	49.6	33.2	30.4
BETA [85]	clp	inf	pnt	qdr	rel	skt	Avg.	SEAL [87]	clp	inf	pnt	qdr	rel	skt	Avg.	ProD	clp	inf	pnt	qdr	rel	skt	Avg.
clp $\to$	-	12.8	39.9	12.2	61.0	38.9	33.0	clp $\to$	-	15.4	42.6	12.0	63.6	43.9	35.5	clp $\to$	-	15.7	44.2	13.3	63.7	41.9	35.8
inf $\to$	34.0	-	39.2	3.4	57.6	27.8	32.4	inf $\to$	39.8	-	42.6	3.5	59.0	32.1	35.4	inf $\to$	38.4	-	43.7	4.0	60.7	31.7	35.7
pnt $\to$	38.5	14.2	-	3.0	62.6	36.5	30.9	pnt $\to$	45.1	16.6	-	3.8	64.0	41.0	34.1	pnt $\to$	44.2	15.9	-	5.0	64.3	38.3	33.5
qdr $\to$	10.4	0.7	1.5	-	7.5	8.7	5.7	qdr $\to$	16.1	1.2	3.5	-	9.1	12.8	8.5	qdr $\to$	18.6	0.9	7.1	-	16.8	13.4	11.3
rel $\to$	46.6	14.7	49.5	6.1	-	36.3	30.6	rel $\to$	54.1	19.0	52.7	6.0	-	41.8	34.7	rel $\to$	51.8	17.8	51.6	6.7	-	39.2	33.4
skt $\to$	46.4	11.1	40.9	15.0	58.2	-	34.3	skt $\to$	54.7	15.2	44.5	15.1	61.4	-	38.2	skt $\to$	53.0	14.8	45.5	15.6	62.7	-	38.3
Avg.	35.2	10.7	34.2	8.0	49.4	29.6	27.8	Avg.	42.0	13.5	37.2	8.1	51.4	34.3	31.1	Avg.	41.2	13.0	38.4	8.9	53.6	32.9	31.3
ProDDing	clp	inf	pnt	qdr	rel	skt	Avg.	SHOT^†^∘ [16]	clp	inf	pnt	qdr	rel	skt	Avg.	ProDDing^∘	clp	inf	pnt	qdr	rel	skt	Avg.
clp $\to$	-	15.2	45.5	14.2	65.9	43.2	36.8	clp $\to$	-	15.5	41.7	16.9	61.8	41.7	35.5	clp $\to$	-	15.8	46.0	17.2	66.3	44.7	38.0
inf $\to$	41.3	-	45.1	3.9	62.5	33.7	37.3	inf $\to$	41.8	-	38.4	4.5	58.6	32.4	35.1	inf $\to$	42.9	-	44.7	4.3	63.1	35.7	38.2
pnt $\to$	47.7	15.4	-	4.9	65.9	40.1	34.8	pnt $\to$	49.4	16.4	-	7.4	62.1	39.4	34.9	pnt $\to$	48.9	16.0	-	7.1	66.1	40.9	35.8
qdr $\to$	22.6	1.1	7.7	-	17.4	15.3	12.8	qdr $\to$	21.1	1.6	4.1	-	11.2	14.8	10.6	qdr $\to$	21.0	1.0	5.6	-	18.2	16.2	12.4
rel $\to$	54.8	17.4	52.0	6.7	-	41.0	34.4	rel $\to$	51.4	18.5	47.1	8.4	-	38.6	32.8	rel $\to$	55.4	17.7	52.3	8.7	-	41.9	35.2
skt $\to$	55.3	15.1	46.1	16.2	65.4	-	39.6	skt $\to$	54.3	14.7	43.3	17.9	60.1	-	38.1	skt $\to$	55.9	15.1	46.2	19.1	64.9	-	40.2
Avg.	44.3	12.8	39.3	9.2	55.4	34.7	32.6	Avg.	43.6	13.3	34.9	11.0	50.8	33.4	31.2	Avg.	44.8	13.1	39.0	11.3	55.7	35.9	33.3

c) Implementation details. For the source model $f_{s}$ , we train it using all samples from the source domain with a random seed of 1234 and select the checkpoint with the best performance on the source domain. In the case of DomainNet, the source model is trained on the training split and validated using the testing split. Throughout this paper, we primarily use ResNet-50 [130] as the backbone, as it is a widely adopted architecture in the UDA field. Following [16], mini-batch SGD is employed to learn the layers initialized from the ImageNet pre-trained model or last stage with the learning rate (1e-3), and new layers from scratch with the learning rate (1e-2). Besides, we use the suggested training settings in [7, 16], including learning rate scheduler, momentum (0.9), weight decay (1e-3), bottleneck size (256), and batch size (64). Concerning the parameters in ProDDing, we adopt the following values for all datasets: $r=1,\beta=0.5,\tau=0.1,\rho=0.5$ . Additionally, $T_{m}=30,\eta=0.95$ is used for all datasets, except for DomainNet, where $T_{m}=10,\eta=0.6$ . We randomly run all the methods three times with different random seeds (2024, 2025, 2026) using PyTorch and report the average accuracies.

4.2 Results on Standard UDA datasets

We provide the results on three standard UDA datasets on Tables I,II,III. As shown in Table I, ProDDing achieves the best average accuracy under the non-hard-label scenario, outperforming the second-best method, SEAL, by approximately 0.5%. Methods utilizing strong data augmentations (i.e., BETA, SEAL, and ProDDing) clearly have an advantage over the other methods. Without the use of strong data augmentations, ProD achieves the best performance, surpassing DINE.

In the hard-label scenario, we present results for only the well-performing methods and observe that ProDDing consistently achieves the best average accuracy. Although all methods experience a decline in accuracy when transitioning from non-hard-label to hard-label scenarios, it is noteworthy that the performance gap between ProDDing and the second-best method enlarges. Surprisingly, ProD significantly outperforms existing black-box counterparts such as BETA and SEAL. This suggests that both ProDDing and its distillation component, ProD, exhibit greater robustness to the quality of the source predictions.

TABLE IV: Per-class accuracies (%) on the Office-Home-RSUT dataset for UDA under two black-box scenarios.

Methods	Hard	Cl $\to$ Pr	Cl $\to$ Re	Pr $\to$ Cl	Pr $\to$ Re	Re $\to$ Cl	Re $\to$ Pr	Avg.
No Adapt.	-	51.5	51.4	37.1	67.2	39.6	71.2	53.0
NLL-OT [125]	✗	53.3	55.4	38.9	68.8	40.5	71.8	54.8
NLL-KL [82]	✗	54.0	57.0	39.7	69.8	42.3	72.1	55.8
NLL-MM [19]	✗	60.8	54.8	34.6	69.7	36.3	74.5	55.1
SHOT^† [16]	✗	61.3	64.9	43.2	73.7	44.8	76.2	60.7
DINE [27]	✗	61.6	56.0	27.5	69.3	35.3	75.9	54.3
BETA [85]	✗	63.5	59.9	33.7	71.2	39.8	77.6	57.6
SEAL [87]	✗	62.8	62.1	39.3	74.3	44.1	77.1	59.9
ProD	✗	64.1	63.3	36.1	71.1	41.0	76.4	58.7
ProDDing	✗	65.2	66.3	39.5	73.2	44.4	78.1	61.1
SHOT^† [16]	✓	59.9	64.1	41.7	73.0	44.1	74.6	59.6
DINE [27]	✓	61.2	61.0	39.2	72.6	41.1	75.4	58.4
BETA [85]	✓	59.9	61.0	40.4	71.5	42.5	74.5	58.3
SEAL [87]	✓	60.1	60.0	40.0	72.2	43.1	74.7	58.3
ProD	✓	63.5	63.4	39.0	72.1	41.9	75.9	59.3
ProDDing	✓	65.0	65.7	42.6	74.6	45.2	77.9	61.8

As shown in Table II, the results on the Office dataset further demonstrate the effectiveness of ProDDing, with notable performance gains over existing methods. In the non-hard-label scenario, ProDDing outperforms the second-best method, BETA, achieving the highest average accuracy. Additionally, ProD surpasses both DINE and SEAL, further demonstrating the effectiveness of prototypical distillation. In the hard-label scenario, the performance advantage of ProDDing over other methods becomes even more obvious, surpassing its own performance in the non-hard-label scenario, particularly when the target domain is W. On the DomainNet dataset, similar conclusions can be drawn: ProDDing achieves the best performance under both scenarios, with the distillation component, ProD, also delivering competitive results compared to other methods.

TABLE V: Accuracies (%) on the Office-Home-Partial dataset for UDA under two black-box scenarios.

Methods	Hard	Ar $\to$ Cl	Ar $\to$ Pr	Ar $\to$ Re	Cl $\to$ Ar	Cl $\to$ Pr	Cl $\to$ Re	Pr $\to$ Ar	Pr $\to$ Cl	Pr $\to$ Re	Re $\to$ Ar	Re $\to$ Cl	Re $\to$ Pr	Avg.
No Adapt.	-	46.5	71.3	80.8	56.0	60.6	66.8	59.2	40.2	76.5	71.4	48.7	76.9	62.9
NLL-OT [125]	✗	55.4	78.4	86.5	67.3	73.1	78.1	70.3	49.1	83.8	76.6	57.0	82.0	71.5
NLL-KL [82]	✗	50.0	68.9	73.8	58.7	62.8	67.5	61.8	45.4	74.9	68.0	50.2	71.4	62.8
NLL-MM [19]	✗	50.4	82.1	87.2	64.8	73.0	80.6	67.1	45.2	85.0	75.4	52.2	83.6	70.5
SHOT^† [16]	✗	53.9	76.3	84.4	69.6	66.4	76.1	66.2	47.5	82.2	78.6	56.4	82.0	70.0
DINE [27]	✗	59.8	84.7	89.5	68.1	79.4	81.5	71.6	54.2	87.8	77.8	62.2	86.7	75.3
BETA [85]	✗	59.3	83.8	90.3	74.1	76.6	81.6	72.3	56.0	85.8	79.9	63.8	86.0	75.8
SEAL [87]	✗	58.0	80.0	83.4	73.6	73.7	77.5	71.5	56.9	84.3	80.2	61.4	81.1	73.5
ProD	✗	60.6	85.2	89.3	71.3	76.5	84.9	70.8	56.9	85.9	77.5	62.6	85.7	75.6
ProDDing	✗	65.2	84.7	90.5	76.2	75.7	84.3	74.5	60.0	86.6	80.0	64.7	85.2	77.3
SHOT^† [16]	✓	51.2	72.0	79.3	62.1	64.2	71.6	63.4	47.6	77.8	72.9	53.2	77.5	66.0
DINE [27]	✓	52.0	74.0	79.6	66.0	64.8	70.4	65.8	45.0	77.5	75.0	54.5	79.4	67.0
BETA [85]	✓	51.5	75.6	83.4	62.5	67.3	73.7	64.3	45.2	80.5	74.4	53.8	80.5	67.7
SEAL [87]	✓	48.7	72.7	79.5	60.9	64.8	69.4	61.3	45.9	78.3	72.9	52.2	77.1	65.3
ProD	✓	59.5	83.6	90.3	73.7	76.5	85.5	72.8	57.0	87.2	79.0	60.6	84.0	75.8
ProDDing	✓	62.6	81.6	90.3	77.6	74.6	83.5	73.0	60.3	87.1	81.2	62.8	83.2	76.5

TABLE VI: Accuracies (%) on eight representative UDA tasks across three datasets. Using ‘proto.’ denotes that

\beta=0.5

, whereas in other cases,

\beta=1.0

ProD in Eq. (11)				Ding in Eq. (15)			Office		Office-Home		DomainNet
$\mathcal{L}_{skd}$	$\mathcal{L}_{mix}$	$\mathcal{L}_{mi}$	proto.	$\mathcal{L}_{fm}$	$\mathcal{L}_{afm}$	$\mathcal{L}_{mi}$	A $\to$ D	W $\to$ A	Ar $\to$ Cl	Pr $\to$ Re	clp $\to$ pnt	pnt $\to$ rel	rel $\to$ skt	skt $\to$ clp	Avg.
							80.3	63.5	44.7	73.5	34.7	57.2	34.3	47.9	54.5
✓							82.1	65.1	45.5	74.8	37.0	60.1	36.0	49.1	56.2
✓	✓						84.5	65.7	46.6	75.5	39.8	60.4	37.6	50.4	57.6
✓		✓					85.7	67.4	47.6	76.3	40.2	62.2	38.7	52.5	58.8
✓	✓	✓					87.6	68.3	49.1	77.4	42.9	62.6	39.9	53.7	60.2
✓	✓	✓	✓				93.4	74.0	52.1	81.2	43.6	63.8	39.5	53.4	62.6
✓	✓	✓	✓	✓			93.8	76.3	54.2	79.5	43.8	62.8	38.3	53.7	62.8
✓	✓	✓	✓		✓		94.0	75.8	55.8	80.8	46.6	65.4	42.3	56.1	64.6
✓	✓	✓	✓			✓	94.4	75.9	54.4	81.9	39.9	58.8	34.5	51.8	61.4
✓	✓	✓	✓	✓		✓	94.2	76.1	55.8	82.4	46.4	65.3	41.0	55.5	64.6
✓	✓	✓	✓		✓	✓	94.8	76.3	56.0	82.5	46.5	65.9	41.7	56.4	65.0

4.3 Results on UDA datasets with Label Shifts

As mentioned above, we also study the effectiveness of different black-box UDA methods for label shifts. As shown in Table IV, SHOT^†, which was a weak baseline in previous tables, achieves better performance than both BETA and SEAL. ProDDing still achieves the best performance under the non-hard-label scenario, winning 3 out of 6 tasks. In the challenging hard-label scenario, different methods exhibit varying behaviors. DINE and BETA show slight improvements, while SEAL performs worse. ProDDing again achieves the best performance under the hard-label scenario, winning all 6 tasks. This may be because the source predictions become more unpredictable, which perturbs the gap between these scenarios and affects the performance of the methods. Besides, we adopt per-class accuracy, following existing methods [111, 129], which may highlight differences in performance across different runs.

Partial-set domain adaptation [112, 113] can be considered a special case of label shift, where some classes are absent in the target domain. As shown in Table V, ProDDing achieves the best average accuracy under both scenarios. The performance of SHOT^† is relatively worse compared to other baseline methods. While other baseline methods (e.g., DINE, BETA, and SEAL) experience significant performance drops, our methods (i.e., ProD and ProDDing) remain more stable.

TABLE VII: Accuracies (%) on the Office and Office-Home dataset for UDA under two black-box scenarios (ResNet-34 used as source model).

Methods	Hard	A $\to$ D	A $\to$ W	D $\to$ A	W $\to$ A	Avg.	Ar $\to$ Cl	Ar $\to$ Pr	Ar $\to$ Re	Cl $\to$ Ar	Cl $\to$ Pr	Cl $\to$ Re	Pr $\to$ Ar	Pr $\to$ Cl	Pr $\to$ Re	Re $\to$ Ar	Re $\to$ Cl	Re $\to$ Pr	Avg.
No Adapt.	-	68.9	70.4	47.2	51.7	59.5	41.7	59.2	68.7	45.0	56.9	60.3	45.7	37.1	68.35	60.3	44.4	75.1	55.2
NLL-OT [125]	✗	79.3	77.4	51.0	55.9	65.9	43.9	61.8	69.8	48.3	59.9	63.3	47.9	39.5	70.0	61.5	46.9	76.0	57.4
NLL-KL [95]	✗	80.7	78.2	53.1	57.4	67.3	45.1	62.5	70.2	48.7	60.7	64.1	47.8	40.5	70.4	62.0	48.3	76.0	58.0
NLL-MM [19]	✗	78.0	78.1	61.2	63.7	70.2	46.6	70.5	75.9	57.8	70.2	74.0	57.7	42.2	77.2	67.1	44.8	82.5	63.9
SHOT^† [16]	✗	86.2	85.0	71.2	73.1	78.9	52.2	75.2	79.2	63.9	75.1	78.0	61.2	51.0	80.6	71.1	55.6	82.3	68.8
DINE [27]	✗	78.9	85.0	70.0	71.3	76.3	53.3	75.3	80.2	60.6	74.0	77.2	63.3	51.9	80.6	70.7	56.5	83.7	69.0
BETA [85]	✗	85.3	85.7	74.7	74.0	79.9	55.8	76.3	80.8	61.9	75.3	78.8	64.6	53.9	82.1	72.6	58.2	84.9	70.4
SEAL [87]	✗	86.3	83.2	72.8	73.9	79.0	55.2	76.7	81.2	65.5	75.5	79.5	65.8	54.3	82.1	73.3	59.0	84.9	71.1
ProD	✗	89.3	85.7	72.2	73.1	80.1	52.7	77.0	79.6	64.1	76.9	78.6	64.1	52.6	81.3	70.2	55.5	83.1	69.6
ProDDing	✗	90.9	86.8	75.5	74.8	82.0	55.8	77.5	81.9	65.1	78.7	79.7	65.7	54.7	82.6	70.5	58.0	84.3	71.2
SHOT^† [16]	✓	88.3	84.7	71.5	72.6	79.3	51.6	74.2	77.9	61.7	74.4	77.0	60.0	50.0	79.6	69.1	54.9	80.9	67.6
DINE [27]	✓	81.5	83.7	68.4	71.0	76.1	52.7	72.7	77.3	59.3	72.3	75.1	58.5	50.2	78.3	68.2	55.4	81.5	66.8
BETA [85]	✓	82.9	83.1	67.3	69.5	75.7	52.4	72.3	77.2	58.8	71.8	75.0	57.5	49.4	78.7	67.8	56.0	82.4	66.6
SEAL [87]	✓	81.2	80.2	65.6	68.0	73.7	49.4	70.1	75.5	57.1	69.6	72.9	57.8	47.9	77.4	66.8	53.9	81.0	65.0
ProD	✓	86.9	85.9	71.2	73.2	79.3	51.6	75.3	78.9	62.5	75.8	77.3	62.6	50.0	80.3	70.1	55.2	82.8	68.5
ProDDing	✓	88.5	88.6	74.9	76.0	82.0	55.9	75.8	81.1	64.1	78.7	79.0	64.7	54.7	81.9	71.2	58.3	84.5	70.8

4.4 Ablation Study

To validate the effectiveness of each component within the proposed ProDDing, we conduct extensive ablation experiments, as shown in Table VI. The first line shows the results of ‘No Adapt.’ With the introduction of the three objectives in Eq. (11), the average accuracy clearly increases. The proposed prototypical pseudo-labeling in the initialized teacher also plays a crucial role, contributing to an improvement in accuracy by 2.4%. Regarding the second step, we find that using $\mathcal{L}_{afm}$ significantly outperforms $\mathcal{L}_{fm}$ , which highlights the effectiveness of logit adjustment during the weak-to-strong consistency phase. A similar improvement is observed with the $\mathcal{L}_{mi}$ objective. When both debiased terms are combined, we achieve the best result in terms of average accuracy across 8 UDA tasks. DINE [27] merely utilizes the $\mathcal{L}_{mi}$ in the fine-tuning step, but we find that incorporating the adjusted consistency term significantly helps boost performance.

4.5 Analysis

To study the effectiveness of ProDDing, we conduct experiments under the hard-label scenario unless stated otherwise.

$\rhd$ Source architecture. We further adopt the ResNet-34 backbone as the source model and present the results on the Office and Office-Home datasets in Table VII. Our method, ProDDing, consistently outperforms all baseline methods or achieves competitive performance under both non-hard-label and hard-label scenarios. It is worth noting that under the hard-label scenario, ProDDing achieves an accuracy of 70.8% in Office-Home, which is close to the non-hard-label performance and 3.2% higher than the second best method, SHOT^†. These results suggest that ProDDing remains effective even under challenging conditions, such as when only limited information is available from the source domain due to either a weak source model or the use of hard labels.

$\rhd$ Sensitivity. We conduct a sensitivity analysis on the temperature parameter $\tau$ in Eq. (6) and the balancing parameter $\beta$ in Eq. (7). Results across four adaptation tasks are shown in Fig. 4 where $\tau$ is in the range of $[0.001,0.01,0.05,0.1,1.0]$ , and Fig. 5 where $\beta$ is in the range of $[0.1,0.3,0.5,0.7,0.9]$ . For each parameter setting, we provide the results for both ProD and ProDDing. On the medium-sized datasets Office and Office-Home, the performance of our methods changes little across different temperature values. Moreover, the results on the large-scale dataset DomainNet in Fig. 4(c-d) show that larger values of $\tau$ yield better results, with $\tau$ =0.1 outperforming $\tau$ =0.001 by approximately 4.7% for ProD. This can be attributed to the fact that a small $\tau$ leads to a sharper pseudo-prediction distribution, which results in overconfidence and a negative effect on the adaptation process. Balancing parameter $\beta$ controls the initialization distribution ensembling ratio of the prediction obtained from the source model and through target domain feature clustering. For tasks in the Office and Office-Home datasets, a smaller $\beta$ (favoring target domain clustering) performs better, while for the more challenging DomainNet dataset (Fig.5(c-d)), a larger $\beta$ (relying more on source domain predictions) yields superior results. The uniform ensemble strikes a balance between these two strategies, ensuring ProDDing adapts effectively to a wide range of scenarios.

$\rhd$ Adaptive label smoothing. We investigate the impact of adaptive label smoothing and present a comparative analysis of ProDDing, alongside four baseline approaches. We evaluate the performance of these methods under different smoothing techniques on two widely-used datasets, Office and Office-Home, as shown in Fig. 6. Our results indicate that nearly all methods benefit from the confidence scores provided by the source model’s interface. Specifically, on the Office-Home dataset, the accuracy of SEAL increases from 83.7% (with hard labels) to 87.9% (with AdaLS, $r$ =1). In contrast, ProDDing achieves an improvement of 0.50%. This observation highlights a key limitation of baseline methods, such as SEAL, which rely heavily on the richness of the source domain information. These methods struggle to maintain stable adaptation performance when the source model only provides hard labels. Further analysis of vanilla label smoothing reveals that it outperforms the use of hard labels, confirming the positive impact of label smoothing on adaptation performance. We also explore the scenario when the source model provides top-3 prediction with the confidence scores (AdaLS, $r$ =3), which is also proved to provide a stable increase for almost all methods. In conclusion, our study demonstrates the effectiveness of the label smoothing technique in enhancing adaptation performance. Notably, ProDDing outperforms all baseline methods and exhibits remarkable stability across different smoothing techniques, making it a robust solution for various scenarios.

$\rhd$ Different pre-trained feature extractor architectures. To investigate the impact of various feature extractor architectures on the calculation of pseudo-predictions through feature clustering, we evaluate four additional network architectures: ResNet-101 [130], ViT-B [131], Swin-B [132], and ConvNeXt-B [133]. The average results across two datasets, Office and Office-Home, are presented in Fig. 7. Note that all experiments use ResNet-50 as the target model and query pre-trained feature extractors in a black-box manner. It is shown that a stronger pre-trained feature encoder $g_{t}$ yields more accurate initial pseudo-predictions (Ensemble). For example, on the Office-Home dataset, compared to ResNet-50, pseudo-prediction accuracy using clustering with the other four stronger feature extractors improves by 0.7%, 1.9%, 2.5%, and 2.5%, respectively. This trend is also reflected in the accuracy of the final target predictions. Our method achieves an accuracy of 73.8% on Office-Home when using ViT-B as the feature extractor, which is an improvement over the 71.9% achieved with ResNet-50. This accuracy further increases to 74.4% when combined with the Swin-B backbone, which benefits from a stronger ImageNet-1k classification ability.

$\rhd$ Convergence. We provide the accuracy convergence curves for DINE [27] and ProDDing in the distillation step and their accuracy curves based on two checkpoints in the fine-tuning step on four tasks. In the distillation step, ProDDing consistently outperforms DINE in the common tasks from Office and Office-Home and achieves competitive performance in challenging tasks from DomainNet. Notably, ProDDing achieves an accuracy of 73.9% on W $\to$ A task, which is 5.7% higher than DINE. As for the fine-tuning step, ProDDing always improves accuracies and becomes convergent on all four tasks, while DINE suffers from negative transferring in complicated tasks from DomainNet, which indicates that it can not be deployed in challenging scenarios. For example, in rel $\to$ skt tasks, ProDDing improves ProD from 39.6% to 42.6%, while DINE (Ding w/o $\mathcal{L}_{afm}$ ) drops to 33.6% using the same checkpoint.

5 Conclusion

We explore a novel yet realistic UDA setting where the source vendor only provides its black-box predictor to the target domain, enabling the use of different networks for each domain while maintaining privacy. Thereafter, we propose a simple yet effective two-step framework called Prototypical Distillation and Debiased tuning (ProDDing). Built on self-distillation, ProDDing elegantly refines the noisy teacher output through adaptive smoothing and prototypical pseudo-labeling, while fully considering the data structure in the target domain during the distillation process. To further mitigate potential class bias, ProDDing continues fine-tuning the distilled model by penalizing logits that exhibit bias toward certain classes. Experiments across multiple datasets confirm the superiority of ProDDing over existing approaches for various UDA tasks. Remarkably, even in the hard-label scenario, where only predicted labels are available, ProDDing achieves surprisingly better results.

References

[1] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2009.
[2] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehensive survey on transfer learning,” Proceedings of the IEEE, vol. 109, no. 1, pp. 43–76, 2020.
[3] S. Ben-David, J. Blitzer, K. Crammer, F. Pereira et al., “Analysis of representations for domain adaptation,” in Proc. NeurIPS, 2007, pp. 137–144.
[4] G. Csurka, “Domain adaptation for visual applications: A comprehensive survey,” arXiv preprint arXiv:1702.05374, 2017.
[5] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016.
[6] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in Proc. CVPR, 2017, pp. 7167–7176.
[7] M. Long, Z. Cao, J. Wang, and M. I. Jordan, “Conditional adversarial domain adaptation,” in Proc. NeurIPS, 2018, pp. 1647–1657.
[8] Y. Zhang, P. David, and B. Gong, “Curriculum domain adaptation for semantic segmentation of urban scenes,” in Proc. ICCV, 2017, pp. 7472–7481.
[9] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker, “Learning to adapt structured output space for semantic segmentation,” in Proc. CVPR, 2018, pp. 480–490.
[10] D. Hu, J. Liang, Q. Hou, H. Yan, and Y. Chen, “Adversarial domain adaptation with prototype-based normalized output conditioner,” IEEE Transactions on Image Processing, vol. 30, pp. 9359–9371, 2021.
[11] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool, “Domain adaptive faster r-cnn for object detection in the wild,” in Proc. CVPR, 2018, pp. 3339–3348.
[12] M. Khodabandeh, A. Vahdat, M. Ranjbar, and W. G. Macready, “A robust learning approach to domain adaptive object detection,” in Proc. ICCV, 2019, pp. 480–490.
[13] K. Saito, Y. Ushiku, T. Harada, and K. Saenko, “Strong-weak distribution alignment for adaptive object detection,” in Proc. CVPR, 2019, pp. 6956–6965.
[14] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep domain confusion: Maximizing for domain invariance,” arXiv preprint arXiv:1412.3474, 2014.
[15] M. Long, Y. Cao, J. Wang, and M. Jordan, “Learning transferable features with deep adaptation networks,” in Proc. ICML, 2015, pp. 97–105.
[16] J. Liang, D. Hu, and J. Feng, “Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation,” in Proc. ICML, 2020, pp. 6028–6039.
[17] R. Li, Q. Jiao, W. Cao, H.-S. Wong, and S. Wu, “Model adaptation: Unsupervised domain adaptation without source data,” in Proc. CVPR, 2020, pp. 9641–9650.
[18] J. N. Kundu, N. Venkat, R. V. Babu et al., “Universal source-free domain adaptation,” in Proc. CVPR, 2020, pp. 4544–4553.
[19] J. Liang, D. Hu, Y. Wang, R. He, and J. Feng, “Source data-absent unsupervised domain adaptation through hypothesis transfer and labeling transfer,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 8602–8617, 2021.
[20] M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks that exploit confidence information and basic countermeasures,” in Proc. CCS, 2015, pp. 1322–1333.
[21] V. K. Kurmi, V. K. Subramanian, and V. P. Namboodiri, “Domain impression: A source data free domain adaptation method,” in Proc. WACV, 2021, pp. 615–625.
[22] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in Proc. NeurIPS Workshops, 2015.
[23] B. Zhao, Q. Cui, R. Song, Y. Qiu, and J. Liang, “Decoupled knowledge distillation,” in Proc. CVPR, 2022, pp. 11 953–11 962.
[24] V. Verma, K. Kawaguchi, A. Lamb, J. Kannala, Y. Bengio, and D. Lopez-Paz, “Interpolation consistency training for semi-supervised learning,” in Proc. IJCAI, 2019, pp. 3635–3641.
[25] W. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama, “Learning discrete representations via information maximizing self-augmented training,” in Proc. ICML, 2017, pp. 1558–1567.
[26] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and C.-L. Li, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” in Proc. NeurIPS, 2020, pp. 596–608.
[27] J. Liang, D. Hu, J. Feng, and R. He, “Dine: Domain adaptation from single and multiple black-box predictors,” in Proc. CVPR, 2022, pp. 8003–8013.
[28] J. Huang, A. Gretton, K. Borgwardt, B. Schölkopf, and A. Smola, “Correcting sample selection bias by unlabeled data,” in Proc. NeurIPS, 2006, pp. 601–608.
[29] M. Sugiyama, S. Nakajima, H. Kashima, P. v. Bünau, and M. Kawanabe, “Direct importance estimation with model selection and its application to covariate shift adaptation,” in Proc. NeurIPS, 2007, pp. 1433–1440.
[30] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE Transactions on Neural Networks, vol. 22, no. 2, pp. 199–210, 2010.
[31] J. Liang, R. He, Z. Sun, and T. Tan, “Aggregating randomized clustering-promoting invariant projections for domain adaptation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 5, pp. 1027–1042, 2018.
[32] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel for unsupervised domain adaptation,” in Proc. CVPR, 2012, pp. 2066–2073.
[33] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsupervised visual domain adaptation using subspace alignment,” in Proc. ICCV, 2013, pp. 2960–2967.
[34] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation,” in Proc. AAAI, 2016, pp. 2058–2065.
[35] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation,” in Proc. ICML, 2018, pp. 1989–1998.
[36] M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Deep transfer learning with joint adaptation networks,” in Proc. ICML, 2017, pp. 2208–2217.
[37] P. Koniusz, Y. Tas, and F. Porikli, “Domain adaptation by mixture of alignments of second-or higher-order scatter tensors,” in Proc. CVPR, 2017, pp. 4478–4487.
[38] G. Kang, L. Jiang, Y. Yang, and A. G. Hauptmann, “Contrastive adaptation network for unsupervised domain adaptation,” in Proc. CVPR, 2019, pp. 3723–3732.
[39] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada, “Maximum classifier discrepancy for unsupervised domain adaptation,” in Proc. CVPR, 2018, pp. 3941–3950.
[40] M. Chen, H. Xue, and D. Cai, “Domain adaptation for semantic segmentation with maximum squares loss,” in Proc. ICCV, 2019, pp. 2090–2099.
[41] S. Cui, S. Wang, J. Zhuo, L. Li, Q. Huang, and Q. Tian, “Towards discriminability and diversity: Batch nuclear-norm maximization under label insufficient situations,” in Proc. CVPR, 2020, pp. 3941–3950.
[42] Y. Jin, X. Wang, M. Long, and J. Wang, “Minimum class confusion for versatile domain adaptation,” in Proc. ECCV, 2020, pp. 464–480.
[43] F. Maria Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. Rota Bulo, “Autodial: Automatic domain alignment layers,” in Proc. ICCV, 2017, pp. 5067–5075.
[44] W.-G. Chang, T. You, S. Seo, S. Kwak, and B. Han, “Domain-specific batch normalization for unsupervised domain adaptation,” in Proc. CVPR, 2019, pp. 7354–7362.
[45] R. Xu, G. Li, J. Yang, and L. Lin, “Larger norm more transferable: An adaptive feature norm approach for unsupervised domain adaptation,” in Proc. ICCV, 2019, pp. 1426–1435.
[46] X. Chen, S. Wang, M. Long, and J. Wang, “Transferability vs. discriminability: Batch spectral penalization for adversarial domain adaptation,” in Proc. ICML, 2019, pp. 1081–1090.
[47] S. Xie, Z. Zheng, L. Chen, and C. Chen, “Learning semantic representations for unsupervised domain adaptation,” in Proc. ICML, 2018, pp. 5423–5432.
[48] Y. Pan, T. Yao, Y. Li, Y. Wang, C.-W. Ngo, and T. Mei, “Transferrable prototypical networks for unsupervised domain adaptation,” in Proc. CVPR, 2019, pp. 2239–2247.
[49] K. Tanwisuth, X. Fan, H. Zheng, S. Zhang, H. Zhang, B. Chen, and M. Zhou, “A prototype-oriented framework for unsupervised domain adaptation,” in Proc. NeurIPS, 2021, pp. 17 194–17 208.
[50] Q. Zhang, J. Zhang, W. Liu, and D. Tao, “Category anchor-guided unsupervised domain adaptation for semantic segmentation,” in Proc. NeurIPS, 2019, pp. 435–445.
[51] J. Liang, D. Hu, and J. Feng, “Domain adaptation with auxiliary target domain-oriented classifier,” in Proc. CVPR, 2021, pp. 16 632–16 642.
[52] P. Zhang, B. Zhang, T. Zhang, D. Chen, Y. Wang, and F. Wen, “Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation,” in Proc. CVPR, 2021, pp. 12 414–12 424.
[53] I. Kuzborskij and F. Orabona, “Stability and hypothesis transfer learning,” in Proc. ICML, 2013, pp. 942–950.
[54] Y.-X. Wang and M. Hebert, “Learning by transferring from unsupervised universal sources,” in Proc. AAAI, 2016, pp. 2187–2193.
[55] T. Joachims et al., “Transductive inference for text classification using support vector machines,” in Proc. ICML, 1999, pp. 200–209.
[56] L. Duan, I. W. Tsang, D. Xu, and S. J. Maybank, “Domain transfer svm for video concept detection,” in Proc. CVPR, 2009, pp. 1375–1381.
[57] B. Chidlovskii, S. Clinchant, and G. Csurka, “Domain adaptation in the absence of source domain data,” in Proc. KDD, 2016, pp. 451–460.
[58] J. Liang, R. He, Z. Sun, and T. Tan, “Distant supervised centroid shift: A simple and efficient approach to visual domain adaptation,” in Proc. CVPR, 2019, pp. 2975–2984.
[59] J. Li, Z. Yu, Z. Du, L. Zhu, and H. T. Shen, “A comprehensive survey on source-free domain adaptation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 8, pp. 5743–5762, 2024.
[60] Y. Liu, W. Zhang, and J. Wang, “Source-free domain adaptation for semantic segmentation,” in Proc. CVPR, 2021, pp. 1215–1224.
[61] J. N. Kundu, A. Kulkarni, A. Singh, V. Jampani, and R. V. Babu, “Generalize then adapt: Source-free domain adaptive semantic segmentation,” in Proc. ICCV, 2021, pp. 7046–7056.
[62] X. Li, W. Chen, D. Xie, S. Yang, P. Yuan, S. Pu, and Y. Zhuang, “A free lunch for unsupervised domain adaptive object detection without source data,” in Proc. AAAI, 2021, pp. 8474–8481.
[63] S. Li, M. Ye, X. Zhu, L. Zhou, and L. Xiong, “Source-free object detection by learning to overlook domain style,” in Proc. CVPR, 2022, pp. 8014–8023.
[64] M. Bateson, H. Kervadec, J. Dolz, H. Lombaert, and I. B. Ayed, “Source-free domain adaptation for image segmentation,” Medical Image Analysis, vol. 82, p. 102617, 2022.
[65] C. Yang, X. Guo, Z. Chen, and Y. Yuan, “Source free domain adaptation for medical image segmentation with fourier style mining,” Medical Image Analysis, vol. 79, p. 102457, 2022.
[66] J. Liang, R. He, and T. Tan, “A comprehensive survey on test-time adaptation under distribution shifts,” International Journal of Computer Vision, pp. 1–34, 2024.
[67] F. Wang, Z. Han, Y. Gong, and Y. Yin, “Exploring domain-invariant parameters for source free domain adaptation,” in Proc. CVPR, 2022, pp. 7151–7160.
[68] Y. Ding, L. Sheng, J. Liang, A. Zheng, and R. He, “Proxymix: Proxy-based mixup training with label refinery for source-free domain adaptation,” Neural Networks, vol. 167, pp. 92–103, 2023.
[69] M. Litrico, A. Del Bue, and P. Morerio, “Guiding pseudo-labels with uncertainty estimation for source-free unsupervised domain adaptation,” in Proc. CVPR, 2023, pp. 7640–7650.
[70] F. Fleuret et al., “Uncertainty reduction for model adaptation in semantic segmentation,” in Proc. CVPR, 2021, pp. 9613–9623.
[71] Z. Zhang, W. Chen, H. Cheng, Z. Li, S. Li, L. Lin, and G. Li, “Divide and contrast: source-free domain adaptation via adaptive contrastive learning,” in Proc. NeurIPS, 2022, pp. 5137–5149.
[72] D. Chen, D. Wang, T. Darrell, and S. Ebrahimi, “Contrastive test-time adaptation,” in Proc. CVPR, 2022, pp. 295–305.
[73] J. Lee and G. Lee, “Feature alignment by uncertainty and self-training for source-free unsupervised domain adaptation,” Neural Networks, vol. 161, pp. 682–692, 2023.
[74] Z. Qiu, Y. Zhang, H. Lin, S. Niu, Y. Liu, Q. Du, and M. Tan, “Source-free domain adaptation via avatar prototype generation and adaptation,” in Proc. IJCAI, 2021, pp. 2921–2927.
[75] H. Xia, H. Zhao, and Z. Ding, “Adaptive adversarial network for source-free domain adaptation,” in Proc. ICCV, 2021, pp. 9010–9019.
[76] K. Xia, L. Deng, W. Duch, and D. Wu, “Privacy-preserving domain adaptation for motor imagery-based brain-computer interfaces,” IEEE Transactions on Biomedical Engineering, vol. 69, no. 11, pp. 3365–3376, 2022.
[77] S. Roy, M. Trapp, A. Pilzer, J. Kannala, N. Sebe, E. Ricci, and A. Solin, “Uncertainty-guided source-free domain adaptation,” in Proc. ECCV, 2022, pp. 537–555.
[78] G. K. Nayak, K. R. Mopuri, S. Jain, and A. Chakraborty, “Mining data impressions from deep models as substitute for the unavailable training data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 8465–8481, 2021.
[79] Y. Hou and L. Zheng, “Visualizing adapted knowledge in domain transfer,” in Proc. ICCV, 2021, pp. 13 824–13 833.
[80] Y. Zhang, R. Jia, H. Pei, W. Wang, B. Li, and D. Song, “The secret revealer: Generative model-inversion attacks against deep neural networks,” in Proc. CVPR, 2020, pp. 253–261.
[81] X. Li, J. Li, L. Zhu, G. Wang, and Z. Huang, “Imbalanced source-free domain adaptation,” in Proc. ACM-MM, 2021, pp. 3330–3339.
[82] H. Zhang, Y. Zhang, K. Jia, and L. Zhang, “Unsupervised domain adaptation of black-box source models,” arXiv preprint arXiv:2101.02839v1, 2021.
[83] C. Liu, L. Zhou, M. Ye, and X. Li, “Self-alignment for black-box domain adaptation of image classification,” IEEE Signal Processing Letters, vol. 29, pp. 1709–1713, 2022.
[84] X. Chen, Y. Shen, X. Luo, Y. Zhang, K. Li, and S. Lin, “Classifier decoupled training for black-box unsupervised domain adaptation,” in Proc. PRCV, 2023, pp. 16–30.
[85] J. Yang, X. Peng, K. Wang, Z. Zhu, J. Feng, L. Xie, and Y. You, “Divide to adapt: Mitigating confirmation bias for domain adaptation of black-box predictors,” in Proc. ICLR, 2023.
[86] Q. Peng, Z. Ding, L. Lyu, L. Sun, and C. Chen, “Rain: Regularization on input and network for black-box domain adaptation,” in Proc. IJCAI, 2023, pp. 4118–4126.
[87] M. Xia, J. Zhao, G. Lyu, Z. Huang, T. Hu, G. Chen, and H. Wang, “A separation and alignment framework for black-box domain adaptation,” in Proc. AAAI, 2024, pp. 16 005–16 013.
[88] S. Zhang, C. Shen, S. Lü, and Z. Zhang, “Reviewing the forgotten classes for domain adaptation of black-box predictors,” in Proc. AAAI, 2024, pp. 16 830–16 837.
[89] Y. Xu, J. Yang, H. Cao, M. Wu, X. Li, L. Xie, and Z. Chen, “Leveraging endo-and exo-temporal regularization for black-box video domain adaptation,” arXiv preprint arXiv:2208.05187, 2022.
[90] S. Wang, D. Zhang, Z. Yan, S. Shao, and R. Li, “Black-box source-free domain adaptation via two-stage knowledge distillation,” arXiv preprint arXiv:2305.07881, 2023.
[91] C. Cuttano, A. Tavera, F. Cermelli, G. Averta, and B. Caputo, “Cross-domain transfer learning with corte: Consistent and reliable transfer from black-box to lightweight segmentation model,” in Proc. ICCV Workshops, 2023.
[92] Y. Wang, J. Liang, and Z. Zhang, “A curriculum-style self-training approach for source-free semantic segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 9890–9907, 2024.
[93] X. Luo, W. Chen, Z. Liang, L. Yang, S. Wang, and C. Li, “Crots: Cross-domain teacher–student learning for source-free domain adaptive semantic segmentation,” International Journal of Computer Vision, vol. 132, no. 1, pp. 20–39, 2024.
[94] L. Ren and X. Cheng, “Single/multi-source black-box domain adaption for sensor time series data,” IEEE Transactions on Cybernetics, vol. 54, no. 8, pp. 4712–4723, 2024.
[95] J. Zhang, J. Huang, X. Jiang, and S. Lu, “Black-box unsupervised domain adaptation with bi-directional atkinson-shiffrin memory,” in Proc. ICCV, 2023, pp. 11 771–11 782.
[96] Y. Shi, K. Wu, Y. Han, Y. Shao, B. Li, and F. Wu, “Source-free and black-box domain adaptation via distributionally adversarial training,” Pattern Recognition, p. 109750, 2023.
[97] Z. Lipton, Y.-X. Wang, and A. Smola, “Detecting and correcting for label shift with black box predictors,” in Proc. ICML, 2018, pp. 3122–3130.
[98] S. Xiao, M. Ye, Q. He, S. Li, S. Tang, and X. Zhu, “Adversarial experts model for black-box domain adaptation,” in Proc. ACM MM, 2024, pp. 8982–8991.
[99] L. Tian, M. Ye, L. Zhou, and Q. He, “Clip-guided black-box domain adaptation of image classification,” Signal, Image and Video Processing, vol. 18, no. 5, pp. 4637–4646, 2024.
[100] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Proc. ICML, 2021, pp. 8748–8763.
[101] J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” International Journal of Computer Vision, vol. 129, no. 6, pp. 1789–1819, 2021.
[102] K. Kim, B. Ji, D. Yoon, and S. Hwang, “Self-knowledge distillation with progressive refinement of targets,” in Proc. ICCV, 2021, pp. 6567–6576.
[103] S. Laine and T. Aila, “Temporal ensembling for semi-supervised learning,” in Proc. ICLR, 2017.
[104] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” in Proc. NeurIPS, 2017, pp. 1195–1204.
[105] R. Gomes, A. Krause, and P. Perona, “Discriminative clustering by regularized information maximization,” in Proc. NeurIPS, 2010, pp. 775–783.
[106] D.-H. Lee et al., “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” in Proc. ICML Workshops, 2013.
[107] Y. Grandvalet and Y. Bengio, “Semi-supervised learning by entropy minimization,” in Proc. NeurIPS, 2004, pp. 529–536.
[108] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel, “Mixmatch: A holistic approach to semi-supervised learning,” in Proc. NeurIPS, 2019, pp. 5049–5059.
[109] D. Berthelot, N. Carlini, E. D. Cubuk, A. Kurakin, K. Sohn, H. Zhang, and C. Raffel, “Remixmatch: Semi-supervised learning with distribution matching and augmentation anchoring,” in Proc. ICLR, 2020.
[110] B. Zhang, Y. Wang, W. Hou, H. Wu, J. Wang, M. Okumura, and T. Shinozaki, “Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling,” in Proc. NeurIPS, 2021, pp. 18 408–18 419.
[111] S. Tan, X. Peng, and K. Saenko, “Class-imbalanced domain adaptation: An empirical odyssey,” in Proc. ECCV Workshops, 2020, pp. 585–602.
[112] Z. Cao, L. Ma, M. Long, and J. Wang, “Partial adversarial domain adaptation,” in Proc. ECCV, 2018, pp. 135–150.
[113] J. Liang, Y. Wang, D. Hu, R. He, and J. Feng, “A balanced and uncertainty-aware approach for partial domain adaptation,” in Proc. ECCV, 2020, pp. 123–140.
[114] T. Salimans and D. P. Kingma, “Weight normalization: A simple reparameterization to accelerate training of deep neural networks,” in Proc. NeurIPS, 2016, pp. 901–909.
[115] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proc. CVPR, 2016, pp. 2818–2826.
[116] S. Qu, G. Chen, J. Zhang, Z. Li, W. He, and D. Tao, “Bmd: A general class-balanced multicentric dynamic prototype strategy for source-free domain adaptation,” in Proc. ECCV, 2022, pp. 165–182.
[117] S. Yang, Y. Wang, J. Van de Weijer, L. Herranz, S. Jui, and J. Yang, “Trust your good friends: Source-free domain adaptation by reciprocal neighborhood clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 883–15 895, 2023.
[118] Y. Mitsuzumi, A. Kimura, and H. Kashima, “Understanding and improving source-free domain adaptation from a theoretical perspective,” in Proc. CVPR, 2024, pp. 28 515–28 524.
[119] A. Iscen, G. Tolias, Y. Avrithis, and O. Chum, “Label propagation for deep semi-supervised learning,” in Proc. CVPR, 2019, pp. 5070–5079.
[120] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” in Proc. ICLR, 2017.
[121] H. Xie, M. E. Hussein, A. Galstyan, and W. Abd-Almageed, “Muscle: strengthening semi-supervised learning via concurrent unsupervised learning using mutual information maximization,” in Proc. WACV, 2021, pp. 2586–2595.
[122] R. Shu, H. H. Bui, H. Narui, and S. Ermon, “A dirt-t approach to unsupervised domain adaptation,” in Proc. ICLR, 2018.
[123] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaugment: Learning augmentation strategies from data,” in Proc. CVPR, 2019, pp. 113–123.
[124] A. K. Menon, S. Jayasumana, A. S. Rawat, H. Jain, A. Veit, and S. Kumar, “Long-tail learning via logit adjustment,” in Proc. ICLR, 2021.
[125] Y. Asano, C. Rupprecht, and A. Vedaldi, “Self-labelling via simultaneous clustering and representation learning,” in Proc. ICLR, 2019.
[126] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan, “Deep hashing network for unsupervised domain adaptation,” in Proc. CVPR, 2017, pp. 5018–5027.
[127] K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models to new domains,” in Proc. ECCV, 2010, pp. 213–226.
[128] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang, “Moment matching for multi-source domain adaptation,” in Proc. ICCV, 2019, pp. 1406–1415.
[129] V. Prabhu, S. Khare, D. Kartik, and J. Hoffman, “Sentry: Selective entropy optimization via committee consistency for unsupervised domain adaptation,” in Proc. ICCV, 2021, pp. 8558–8567.
[130] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016, pp. 770–778.
[131] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. ICLR, 2021.
[132] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. ICCV, 2021, pp. 10 012–10 022.
[133] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proc. CVPR, 2022, pp. 11 976–11 986.


(a) W $\to$ A	(b) Ar $\to$ Cl	(c) rel $\to$ skt	(d) skt $\to$ clp


(a) W $\to$ A	(b) Ar $\to$ Cl	(c) rel $\to$ skt	(d) skt $\to$ clp


(a) W $\to$ A	(b) Ar $\to$ Cl

(c) rel $\to$ skt	(d) skt $\to$ clp