This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Chaos to Order: A Label Propagation Perspective on Source-Free Domain Adaptation

Chunwei Wu East China Normal UniversityShanghaiChina [email protected] Guitao Cao Shanghai Key Laboratory of Trustworthy Computing, East China Normal UniversityShanghaiChina43017-6221 [email protected] Yan Li Shanghai Normal UniversityShanghaiChina [email protected] Xidong Xi East China Normal UniversityShanghaiChina [email protected] Wenming Cao Shenzhen UniversityShenzhenChina [email protected]  and  Hong Wang Shanghai Research Institute of Microwave EquipmentShanghaiChina [email protected]
(2023)
Abstract.

Source-free domain adaptation (SFDA), where only a pre-trained source model is used to adapt to the target distribution, is a more general approach to achieving domain adaptation in the real world. However, it can be challenging to capture the inherent structure of the target features accurately due to the lack of supervised information on the target domain. By analyzing the clustering performance of the target features, we show that they still contain core features related to discriminative attributes but lack the collation of semantic information. Inspired by this insight, we present Chaos to Order (CtO), a novel approach for SFDA that strives to constrain semantic credibility and propagate label information among target subpopulations. CtO divides the target data into inner and outlier samples based on the adaptive threshold of the learning state, customizing the learning strategy to fit the data properties best. Specifically, inner samples are utilized for learning intra-class structure thanks to their relatively well-clustered properties. The low-density outlier samples are regularized by input consistency to achieve high accuracy with respect to the ground truth labels. In CtO, by employing different learning strategies to propagate the labels from the inner local to outlier instances, it clusters the global samples from chaos to order. We further adaptively regulate the neighborhood affinity of the inner samples to constrain the local semantic credibility. In theoretical and empirical analyses, we demonstrate that our algorithm not only propagates from inner to outlier but also prevents local clustering from forming spurious clusters. Empirical evidence demonstrates that CtO outperforms the state of the arts on three public benchmarks: Office-31, Office-Home, and VisDA.

transfer learning; source-free domain adaptation; label propagation; cluster analysis
copyright: acmcopyrightjournalyear: 2023doi: 10.1145/3581783.3613821copyright: acmlicensedconference: Proceedings of the 31st ACM International Conference on Multimedia; October 29-November 3, 2023; Ottawa, ON, Canadabooktitle: Proceedings of the 31st ACM International Conference on Multimedia (MM ’23), October 29-November 3, 2023, Ottawa, ON, Canadaprice: 15.00isbn: 979-8-4007-0108-5/23/10submissionid: 2405ccs: Computing methodologies Transfer learningccs: Computing methodologies Regularization

1. Introduction

Refer to caption
Figure 1. A toy illustration of our approach on label propagation. The target samples can be divided into two subsets: the inner set and the outlier set. CtO achieves label propagation through Adaptive Local-consistency Regularization and Adaptive Input-consistency Regularization. Among them, the outlier samples may move to the correct clusters with the help of their augmented versions. Maintaining this trend, the samples gradually move from chaos to order.

The excellent performance of deep learning relies heavily on a large amount of high-quality labeled data. Obtaining large amounts of manually labeled data for specific learning tasks is often time-consuming and expensive, making these tasks challenging to implement in practical applications. To alleviate this dependency, Unsupervised Domain Adaptation (UDA) has been developed to improve performance in the unlabeled target domain by exploiting the labeled source domain. Two popular practices for modern UDA design are learning domain-invariant features (Ganin et al., 2016; Long et al., 2018; Kang et al., 2019; Tang et al., 2020) and generating dummy samples to match the target domain distribution (Wu et al., 2020; Li et al., 2020a; Zhong et al., 2021; Na et al., 2021). However, due to data privacy and security issues, the source domain training data required by most existing UDA methods is usually unavailable in real-world applications. In response, Source-Free Domain Adaptation (SFDA) emerged, which attempted to adapt a trained source model to the target domain without using any source data.

Due to the lack of source data, it is impossible to estimate source-target domain differences. Existing theoretical work usually provides learning guarantees on the target domain by further assuming that the source domain covers the support of the target domain. In the seminal work by (Yang et al., 2021a), the authors point out that the target features from the source model have formed some semantic structures. Inspired by this intuition, we can preserve the important clustering structure in the target domain by matching similar features in the high-dimensional space. However, the nearest-neighbor consistency of points in high-dimensional space may be wrong, such as when forcing the local consistency of points in low-density regions. As shown in Table 1, when the source and target domains have significant differences (i.e., Pr\rightarrowCl and Rw\rightarrowCl), numerous features gather in low-density regions, with only about one-third of the neighbors having the correct labels.

Along with such a question, we propose Chaos to Order (CtO) (Fig. 1), an effective method to achieve more robust clustering of unlabeled data from the perspective of label propagation. To achieve flexible adaptation for different data properties and exploit the target domain structure information, our work introduces a novel data division strategy and then designs different regularization strategies to achieve label propagation.

Firstly, our approach treats the target domain’s intrinsic structure information mining as a clustering problem. Although existing local consistency-based methods aim to preserve the local structure, Table 1 illustrates the reason why neighbors are unreliable: In distance-based neighbor discrimination, neighbors are similar points in a high-dimensional space, and since the points in the low-density region are all scattered far apart, the label information in the KK-nearest neighbors is not consistent at this point. In CtO, we utilize the model’s learning state to dynamically divide the target data into inner and outlier sets. The intrinsic reason is that a sample can be considered an inner sample if it can obtain high predictive values from the classifier; otherwise, it is an outlier. We regularize the input consistency of outliers and encourage local consistency for those inner samples, which effectively improves the mining of intrinsic structural information.

Secondly, we assume a minimum overlap in the subpopulations of the inner and outlier sets, and extend the subset using the simple but realistic extension assumption of (Wei et al., 2021). For the inner set, the local-consistency regularizer connects similar points in the high-dimensional space, allowing SFDA training to proceed stably. Enlightening experiments on Office-Home show that: (1) the pre-trained source model can extract rich semantic information from the target data; (2) what is lacking in domain adaptation is the filtering and permutation of high-dimensional semantic information. Due to the lack of supervised information, we preserve the core features associated with the discriminative attributes by enforcing the local consistency of points in the latent space. Moreover, we propose a re-weighted clustering strategy called adaptive Local-Consistency Regularization (ALR), which explicitly constrains the local semantic credibility to filter spurious clustering information. To advance further along this line, we propose Adaptive Input-Consistency Regularization (AIR) for the outlier set. Generally, requiring the model to be invariant to input perturbations can improve generalizability. Furthermore, as (Wei et al., 2021) discussed, a low-probability subset of data can be extended to a neighborhood with a large probability relative to that subset. We show that labeling information can be propagated among subpopulations by minimizing the consistency regularization term on unlabeled data. In Theorem 3.2, we give the upper bound on the task risk of the target model. As a result, by customizing the learning strategy for different data properties, CtO can propagate structural information from the inner to the outlier while enhancing the clustering of the inner set.

Table 1. Ratio (%) of different number of KK-nearest neighbor which have the correct predicted label (on Office-Home).
KK Ar\rightarrowCl Ar\rightarrowPr Cl\rightarrowAr Pr\rightarrowCl Pr\rightarrowRw Rw\rightarrowCl
1 42.0 66.2 47.3 33.6 70.0 41.2
2 36.8 62.7 40.7 28.6 66.1 36.9
3 33.8 59.6 37.4 24.7 63.0 33.1
4 30.4 57.1 34.3 22.0 60.4 30.7
5 28.5 55.1 31.2 20.0 58.2 28.0
6 26.8 53.0 29.1 18.1 56.4 26.3
7 25.2 51.6 27.6 16.7 54.9 24.3

In summary, the contributions of this paper are summarized as follows: (1) We introduce CtO, a dynamical clustering approach for SFDA. Such a approach customizes the learning strategy for data subsets by using dynamic data splits, allowing label information to propagate among subpopulations. (2) To combat spurious clustering, we propose a novel Adaptive Local-consistency Regularization (ALR) strategy that estimates ground-truth structural information by re-weighting the neighbors. (3) To utilize unlabeled data more effectively, we propose Adaptive Input-Consistent Regularization (AIR) from the perspective of label propagation. By collaborating with ALR, structural information can be propagated from the inner to the outlier sets, significantly improving clustering performance. (4) Empirical evidence demonstrates that the proposed method outperforms the state-of-the-art on three domain adaptation benchmark datasets.

Table 2. Comparison with different bottleneck layers on Office-Home.
Methods Ar\rightarrowCl Ar\rightarrowPr Ar\rightarrowRw Cl\rightarrowAr Cl\rightarrowPr Cl\rightarrowRw Pr\rightarrowAr Pr\rightarrowCl Pr\rightarrowRw Rw\rightarrowAr Rw\rightarrowCl Rw\rightarrowPr Avg.
AaD (w/ Source Bottleneck Layer) 59.3 79.3 82.1 68.9 79.8 79.5 67.2 57.4 83.1 72.1 58.5 85.4 72.7
AaD (w/ Target Bottleneck Layer) 69.3 85.7 91.4 82.4 86.2 87.4 84.5 67.5 90.5 89.1 68.9 92.1 82.9

2. Related Work

Source-free Domain Adaptation (SFDA)

SFDA aims to adapt unlabeled target domains using only the pre-trained source model. Existing approaches try to refine the solution of SFDA by pseudo-labeling (Liang et al., 2020; Qiu et al., 2021; Huang et al., 2021; Ding et al., 2022; Qu et al., 2022; Lee et al., 2022), generating transition domains (Li et al., 2020b, 2022; Kundu et al., 2021, 2022), or local consistency (Yang et al., 2021b, a, 2022). However, due to the domain differences, pseudo-labels that may contain noise cause confirmation bias. Additionally, task discriminative information and domain-related information are highly non-linearly entangled. Directly constructing an ideal generic domain from the source model may be difficult. Most closely related to our work is AaD(Yang et al., 2022), which introduced a simple and efficient optimization upper bound for feature clustering of unlabeled data, i.e., aggregating (scattering) similar (dissimilar) features in the feature space. However, AaD uses KK-nearest neighbors directly, which suffer from source bias due to domain shift. In contrast to the above methods, we explore the idea of label propagation to assign regularization strategies to unlabeled data that are more suitable for the data properties, to achieve source-free model adaptation.

Label Propagation

Label propagation has been widely used in semi-supervised learning. (Douze et al., 2018) show that label propagation on large image sets outperforms state-of-the-art few-shot learning when few labels are available. (Iscen et al., 2019) employ a transductive label propagation method based on the stream shape assumption to predict the entire dataset. (Wei et al., 2021) introduce the ”extension” assumption to analyze label propagation and show learning guarantees for unsupervised and semi-supervised learning. (Cai et al., 2021) extend the extension assumption to domain adaptation and propose a provably effective framework for domain adaptation based on label propagation. Considering label propagation for SFDA and leveraging the advantages of extension assumptions, we design a novel and dynamic clustering strategy for SFDA that propagates structural information from high-density regions to low-density regions.

Refer to caption
Figure 2. Cosine similarity within the same class and across classes on Office-Home.

3. Preliminaries and Analysis

In this section, we consider the clustering performance of the target domain and the label propagation for SFDA. Correspondingly, we first introduce some notations and then perform an empirical and theoretical analysis to better understand the role of different learning strategies in CtO. In Section 3.2, we study an Oracle setup that beats the original AaD (Yang et al., 2022) by a large margin, confirming that the features from the source domain model are already rich in semantic information, which requires us to reduce the redundant information in the features. Finally, in Section 3.3, we claim that if the learning state of a model is superior, then the target sample has consistency with its neighbors (Claim 3.1). Furthermore, we present upper bounds on the target error in Theorem 3.2.

3.1. Preliminary

For source-free domain adaptation (SFDA), consider an unlabeled target dataset 𝒟T={xi:xi𝒳}i=1nt\mathcal{D}_{T}=\left\{x_{i}:x_{i}\in\mathcal{X}\right\}_{i=1}^{n_{t}} on the input space 𝒳t\mathcal{X}_{t}. The task is to adapt a well-trained source model to the target domain without source data, where the target domain has the same |C||C| class as the source domain. Following (Yang et al., 2021a, 2022), we use a feature extractor h:𝒳t𝒵h:\mathcal{X}_{t}\rightarrow\mathcal{Z}, and the classifier gc:𝒵𝒞g_{c}:\mathcal{Z}\rightarrow\mathcal{C}. Then the output of the network is denoted as p(x)=δ(gc(h(x)))Cp(x)=\delta(g_{c}(h(x)))\in\mathcal{R}^{C}, where δ\delta is the softmax function. Specifically, we retrieve the nearest neighbors for each mini-batch of target features. Let 𝑭Rnt×d\boldsymbol{F}\in R^{n_{t}\times d} denotes a memory bank that stores all target features and 𝑷Rnt×C\boldsymbol{P}\in R^{n_{t}\times C} denotes the corresponding prediction scores in the memory bank, where dd is the feature dimension in the last linear layer:

(1) 𝑭=[𝒛1,𝒛2,𝒛nt],𝑷=[𝒑1,𝒑2,𝒑nt],\begin{gathered}\boldsymbol{F}=\left[\boldsymbol{z}_{1},\boldsymbol{z}_{2},\ldots\boldsymbol{z}_{n_{t}}\right],\boldsymbol{P}=\left[\boldsymbol{p}_{1},\boldsymbol{p}_{2},\ldots\boldsymbol{p}_{n_{t}}\right],\end{gathered}

where 𝒛i\boldsymbol{z}_{i} is L2-normalized and 𝒑i\boldsymbol{p}_{i} denotes the output softmax probability for 𝒛i\boldsymbol{z}_{i}.

3.2. Empirical analysis

Most of the clustering-based SFDA methods have the problem of spurious clustering. Especially, in extreme domain shifts, the spurious clustering problem worsens. To address this issue, we investigate the local consistency of feature representations on the source and target domain models. We carry out the experiments on Office-Home since it exists different degrees of domain shift, i.e., Rw vs. Pr and Pr vs. Cl. In this experiment, we study the feature properties of the different layers in the model: (1) Backbone: the last layer of the backbone network with 2048 feature dimensions; (2) Bottleneck: only replaces the bottleneck layer in the source model, with 256 feature dimensions.

It is worth noting that most of the existing clustering-based methods are distance-based. The key idea is the smoothness assumption that the model should produce similar predictions for similar unlabeled data. Therefore, a good feature representation should have intra-class compactness and inter-class separability. Without loss of generality, we use cosine similarity as a metric. It is very unexpected that the same-class similarity and across-class similarity between the source domain model and the target domain model on Backbone are similar, while a huge difference appears at the Bottleneck (see Fig. 2). According to the light blue bar in Fig. 2, which visualizes the across-class similarity of samples at different network structures, we can easily observe that the features from the bottleneck layer have better inter-class separability. This means that adding a bottleneck layer to the model helps reduce redundant features, which improves discriminability and generalizability.

We further investigated the impact of bottleneck layers on the clustering-based SFDA approach by another study. Table 2 shows the learning effect of the AaD (Yang et al., 2022) with only the bottleneck layer replaced. Note that the bottleneck layer of the target model is only used for the analysis of this experiment. We observe that replacing the target domain bottleneck layer improves the AaD model dramatically, from 72.7% to 82.9%. This indicates that the high-dimensional features from backbone network of the source model already contain rich semantic information, whereas the generalization of the features is more reflected in the filtering and permutation of the semantic information. Additionally, on the results of AaD (w/ Source Bottleneck Layer), there was a very strong correlation between prediction accuracy and the ratio of same-class similarity to across-class similarity, as indicated by the Spearman rank correlation of 0.92. This observation hints that we can use the correlation between similarity and test accuracy to improve the clustering effect.

We also evaluated the feature similarity of different metric functions. The different metric functions all point out that Backbone features are rich in semantic information (see Appendix C).

3.3. Theoretical analysis

Following the expansion assumption in (Wei et al., 2021; Cai et al., 2021), we first define that the suitable set of input transformations ()\mathcal{B}(\cdot) takes the general form (x){x:A𝒜\mathcal{B}(x)\triangleq\{x^{\prime}:\exists A\in\mathcal{A} such that xA(x)r}\left\|x^{\prime}-A(x)\right\|\leq r\} for a small radius r>0r>0, where 𝒜\mathcal{A} can be understood as a distance-based neighborhood or the data augmentations set. Then, we define the neighborhood function 𝒩\mathcal{N} as

(2) 𝒩(x)={x(x)(x)},\mathcal{N}(x)=\left\{x^{\prime}\mid\mathcal{B}(x)\cap\mathcal{B}\left(x^{\prime}\right)\neq\emptyset\right\},

and the neighborhood of a set S𝒟TS\subset\mathcal{D}_{T} as

(3) 𝒩(S)xS𝒩(x).\mathcal{N}(S)\triangleq\cup_{x\in S}\mathcal{N}(x).

The regularizer of G=gchG=g_{c}\circ h is defined as:

(4) R(G)=𝔼𝒟T[maxneighbor x𝟏(G(x)G(x))]R_{\mathcal{B}}(G)=\mathbb{E}_{\mathcal{D}_{T}}\left[\max_{\text{neighbor }x^{\prime}}\mathbf{1}\left(G(x)\neq G(x^{\prime})\right)\right]

Our setting for Source-free Domain Adaptation is formulated in the following assumption.

Assumption 3.1.

Let IiI_{i} and OiO_{i} denote the conditional distribution of the target distribution 𝒟T\mathcal{D}_{T} on the set II and OO, respectively. Assume there exists a constant κ1\kappa\geq 1 such that the measure IiI_{i} and OiO_{i} are bounded by κUi\kappa U_{i}. That is, for any S𝒳tS\subset\mathcal{X}_{t},

PIi(S)κPUi(S) and POi(S)κPUi(S).P_{I_{i}}(S)\leq\kappa P_{U_{i}}(S)\text{ and }P_{O_{i}}(S)\leq\kappa P_{U_{i}}(S).

The expansion property on the target domain is defined as follows:

Definition 3.0 (Constant Expansion (Wei et al., 2021)).

We say that distribution QQ satisfies (q,ξ)(q,\xi)-constant expansion for some constant q,ξ(0,1)q,\xi\in(0,1), if for all SQS\subset Q satisfying Q(S)q\mathbb{P}_{Q}(S)\geq q, we have Q[𝒩(S)\S]min{ξ,Q[S]}\mathbb{P}_{Q}[\mathcal{N}(S)\backslash S]\geq\min\left\{\xi,\mathbb{P}_{Q}[S]\right\}.

Based on the model’s learning state, our CtO method divides the target data into the inner set (II) and the outlier set (OO). The intuition lies in the fact that different datasets and classes should determine their division thresholds based on the models’ learning states so that the division boundary is more reasonable. Specifically, we set a global threshold ρ\rho to utilize unlabeled data at early training stages. As the adaptation progresses, we estimate the model’s learning state τi\tau_{i} by the predictions of the model to determine a dynamic local (class-specific) threshold. The following claim guarantees the consistency robustness of inner samples:

Claim 3.1.

Suppose GG satisfies a Lipschitz condition; there exists a global threshold ρ(0,1)\rho\in(0,1) and scale of models’ learning status τi\tau_{i} such that the inner set II is consistency robust, i.e., R(G)=0R_{\mathcal{B}}(G)=0. More specifically,

r(2max{τi}ρ1)2L,\displaystyle r\leq(2\max\{\tau_{i}\}\rho-1)\frac{2}{L},
i[𝒞].\displaystyle\forall i\in[\mathcal{C}].
Remark 1.

The claim illustrates that when the global threshold ρ\rho is fixed, as long as the model’s learning state is good enough, the inner set can achieve a consistency error close to zero. In clustering-based methods, rr usually refers to the number of selectable nearest neighbors, which allows us to select more reliable neighbor for inner samples. Moreover, we do not need to set specific data division thresholds for different datasets or classes based on experience, making it more general and realistic.

With the above preparation, we can investigate how to propagate label information from the ordered inner set to the chaotic outlier set. The following theorem establishes bounds on the target risks and indicates that as long as there is minimal overlap between the inner and outlier sets, label information will propagate in the target subpopulation.

Theorem 3.2.

Suppose Assumption 3.1 and Claim 3.1 hold and I,OI,O satisfies (q,μ)(q,\mu)-constant expansion. Then the expected error of model GG is bounded,

ϵ𝒟T(G)4max(q,μ)κ+μ(1+κ).\epsilon_{\mathcal{D}_{T}}(G)\leq 4\max(q,\mu)\kappa+\mu(1+\kappa).
Remark 2.

This theorem states that the target risks is bounded by the consistency regularization RR_{\mathcal{B}} (equivalently, μ\mu). Our analysis explains why consistency regularization is important for SFDA methods: assuming the data satisfies expansion, it encourages representations to maintain important semantic structures by enforcing local consistency within the representation space. As a result, we can effectively capture the global structure of the target domain by considering all local neighborhoods together, which strongly enforces label propagation.

The proof of both claim and theorem are in Appendix A.

4. Method

In this section, we introduce CtO aims to achieve efficient feature clustering from the perspective of label propagation.

There are two aspects involved in achieving CtO. First, how to divide the dataset reasonably? The ideal inner set should have well-clustered properties, so the effectiveness of the division boundary in distinguishing high- and low-density regions is crucial when choosing a division threshold. The model’s prediction probability can measure this. According to the previous analysis, by considering the model’s learning state, our proposed adaptive division threshold can better guarantee local consistency among inner samples.

Second, how can we customize learning strategies to achieve label propagation among target subpopulations? Inspired by self-supervised learning, we apply two regularization strategies - local consistency and input consistency - on the inner and outlier sets, respectively. Our theoretical analysis demonstrates that these strategies can extend low-density subsets to high-density ones when the subpopulations overlap, which leads to the clustering from chaos to order.

4.1. Dynamic Data Grouping

As the analysis before, we employ the model’s learning states to adaptively divide the data in 𝒟T\mathcal{D}_{T} into the inner sets II and outlier sets OO. As believed in (Zhang et al., 2021), the learning effect of the model can be reflected by the class-level hit rate. Therefore, our principle is that the data division in CtO should be related to the prediction confidence of the unlabeled data on different classes so as to reflect the class-level learning status. Namely, classes with fewer samples reaching a threshold of prediction confidence are considered to have difficult in learning local structural information. Moreover, the threshold should be increased steadily as the model is continuously improved during training. We set the global threshold as the exponential moving average (EMA) of the highest confidence level for each training time step:

(5) ρt={1/|C|, if t=0αρt1+(1α)max(p), otherwise \rho_{t}=\begin{cases}1/|C|,&\text{ if }t=0\\ \alpha\rho_{t-1}+(1-\alpha)\max(p),&\text{ otherwise }\end{cases}

where α(0,1)\alpha\in(0,1) is the momentum decay of EMA, tt denotes the tt-th iteration. Combining this flexible thresholds, the learning effect of class cc in the time step is defined as:

(6) τt(c)=n=1Nt𝟏(max(p)>ρt)𝟏(argmax(p=c)).\tau_{t}(c)=\sum_{n=1}^{N_{t}}\mathbf{1}\left(\max\left(p\right)>\rho_{t}\right)\cdot\mathbf{1}\left(\arg\max\left(p=c\right)\right).

Then we formulate the adaptive data division weights:

(7) 𝒯t(c)=1|C|(1βt(c)logβt(c))\displaystyle\mathcal{T}_{t}(c)=\frac{1}{|C|}(1-\frac{\beta_{t}(c)}{\log\beta_{t}(c)})
where,βt(c)=τt(c)maxcτt\displaystyle where,\beta_{t}(c)=\frac{\tau_{t}(c)}{\max_{c}\tau_{t}}

Finally, the samples are dynamically grouped into the outlier set in the tt-th iteration:

(8) Ot={ximax(pi)𝒯t(argmax(pi)),xi𝒟T},O^{t}=\left\{x_{i}\mid\max\left(p_{i}\right)\geq\mathcal{T}_{t}({\arg\max}\left(p_{i}\right)),x_{i}\in\mathcal{D}_{T}\right\},

and the inner samples are the rest target data, i.e., I=𝒟T\OI=\mathcal{D}_{T}\backslash O. To this end, we customize learning strategies for different data properties and connect both sets by expansion assumption.

4.2. Label Propagation with Different Regularizations in SFDA

In the theoretical analysis (Section 3.3), we show that performing consistency regularization on the unlabeled target data can propagate semantic information across different subpopulations. However, it also has some limitations. First, the neighbors are not always correct. Due to the domain shift, the model may incorrectly focus on the object of some target features, leading to noisy labels. Here, the misalignment of neighbors forms spurious clusters instead of helping label propagation. Second, when dealing with low-density regions (i.e., outlier samples), designing the same local consistent regularization will further exacerbate the cross-label risk. To address these limitations, we flexibly customize learning strategies for different data properties with the help of dynamic data grouping.

Adaptive Local-consistency Regularization

In Adaptive Local-consistency Regularization (ALR), inspired by the fact that the target features from the source model have formed some semantic structures, we can capture the intra-class structure by local-consistency regularization. However, in the source-free domain adaptation problem, the features extracted by the pre-trained source model are typically influenced by the source bias. This may lead to neighbors containing incorrect semantic information. To mitigate incorrect alignment, we propose identifying the clustering weights of each sample.

As observed in Fig. 2, the cosine similarity of same-class is generally higher than that of across-class. Building on this finding, we can measure neighbor affinity based on cosine similarity and then re-weight the neighbors to approximate the ground-truth structural information. By re-weighting with similarity-based adaptive weights, positive clustering can be promoted while spurious clustering can be combated. The Adaptive Local-consistency Regularization is as follows:

(9) alr=iNIjKi𝒘ij𝒑iT𝒑j\mathcal{L}_{alr}=-\sum_{i}^{N_{I}}\sum_{j}^{K_{i}}\boldsymbol{w}_{ij}\boldsymbol{p}_{i}^{T}\boldsymbol{p}_{j}

where KiK_{i} denotes the KK-nearest neighbor set of 𝒛i\boldsymbol{z}_{i}. The similarity weight 𝒘ij\boldsymbol{w}_{ij} in Eq. 9 is the cosine similarity of 𝒛i\boldsymbol{z}_{i} to the neighbors 𝒛j\boldsymbol{z}_{j}, which is calculated via the memory bank 𝑭\boldsymbol{F}. Optimizing alr\mathcal{L}_{alr} improves the reliability of clustering, which stabilizes the intra-class structure. In addition, relaxing the ranking of samples in low-density regions helps reduce incorrect local semantic alignment.

Additionally, to improve separability between clusters, we employ the separation strategy proposed by (Yang et al., 2022) to disperse the prediction of potentially dissimilar features.

(10) sep=iNImNBi𝒑iT𝒑m\mathcal{L}_{sep}=-\sum_{i}^{N_{I}}\sum_{m}^{N_{B_{i}}}\boldsymbol{p}_{i}^{T}\boldsymbol{p}_{m}

where BiB_{i} denotes other features except 𝒛i\boldsymbol{z}_{i} in mini-batch.

Table 3. Accuracy (%) on Office-Home (ResNet-50).
Methods Source-free Ar\rightarrowCl Ar\rightarrowPr Ar\rightarrowRw Cl\rightarrowAr Cl\rightarrowPr Cl\rightarrowRw Pr\rightarrowAr Pr\rightarrowCl Pr\rightarrowRw Rw\rightarrowAr Rw\rightarrowCl Rw\rightarrowPr Avg.
ResNet-50 (He et al., 2016) 34.9 50.0 58.0 37.4 41.9 46.2 38.5 31.2 60.4 53.9 41.2 59.9 46.1
CDAN (Long et al., 2018) 50.7 70.6 76.0 57.6 70.0 70.0 57.4 50.9 77.3 70.9 56.7 81.6 65.8
MDD (Zhang et al., 2019) 54.9 73.7 77.8 60.0 71.4 71.8 61.2 53.6 78.1 72.5 60.2 82.3 68.1
SRDC (Tang et al., 2020) 52.3 76.3 81.0 69.5 76.2 78.0 68.7 53.8 81.7 76.3 57.1 85.0 71.3
FixBi (Na et al., 2021) 58.1 77.3 80.4 67.7 79.5 78.1 65.8 57.9 81.7 76.4 62.9 86.7 72.7
SHOT (Liang et al., 2020) 56.9 78.1 81.0 67.9 78.4 78.1 67.0 54.6 81.8 73.4 58.1 84.5 71.6
A2A^{2}Net (Xia et al., 2021) 58.4 79.0 82.4 67.5 79.3 78.9 68.0 56.2 82.9 74.1 60.5 85.0 72.8
NRC (Yang et al., 2021a) 57.7 80.3 82.0 68.1 79.8 78.6 65.3 56.4 83.0 71.0 58.6 85.6 72.2
SFDA-DE (Ding et al., 2022) 59.7 79.5 82.4 69.7 78.6 79.2 66.1 57.2 82.6 73.9 60.8 85.5 72.9
feat-mixup (Kundu et al., 2022) 61.8 81.2 83.0 68.5 80.6 79.4 67.8 61.5 85.1 73.7 64.1 86.5 74.5
AaD (Yang et al., 2022) 59.3 79.3 82.1 68.9 79.8 79.5 67.2 57.4 83.1 72.1 58.5 85.4 72.7
DaC (Zhang et al., 2022) 59.1 79.5 81.2 69.3 78.9 79.2 67.4 56.4 82.4 74.0 61.4 84.4 72.8
NRC+ELR (Yi et al., 2023) 58.4 78.7 81.5 69.2 79.5 79.3 66.3 58.0 82.6 73.4 59.8 85.1 72.6
SFUDA (Pei et al., 2023) 59.9 81.4 83.0 68.9 80.1 80.3 67.5 56.9 83.7 74.3 60.8 86.3 73.7
CtO 58.5 79.8 85.5 74.8 82.5 83.1 73.8 58.4 85.0 78.2 63.3 89.6 76.1
Table 4. Accuracy (%) on Office-31 (ResNet-50).
Methods Source-free A\rightarrowD A\rightarrowW D\rightarrowW W\rightarrowD D\rightarrowA W\rightarrowA Avg.
ResNet-50 (He et al., 2016) 68.9 68.4 96.7 99.3 62.5 60.7 76.1
CDAN (Long et al., 2018) 92.9 94.1 98.6 100.0 71.0 69.3 87.7
MDD (Zhang et al., 2019) 90.4 90.4 98.7 99.9 75.0 73.7 88.0
SRDC (Tang et al., 2020) 95.8 95.7 99.2 100.0 76.7 77.1 90.8
FixBi (Na et al., 2021) 95.0 96.1 99.3 100.0 78.7 79.4 91.4
SHOT (Liang et al., 2020) 93.1 90.9 98.8 99.9 74.5 74.8 88.7
A2A^{2}Net (Xia et al., 2021) 94.5 94.0 99.2 100.0 76.7 76.1 90.1
NRC (Yang et al., 2021a) 96.0 90.8 99.0 100.0 75.3 75.0 89.4
HCL (Huang et al., 2021) 94.7 92.5 98.2 100.0 75.9 77.7 89.8
SFDA-DE (Ding et al., 2022) 96.0 94.2 98.5 99.8 76.6 75.5 90.1
AaD (Yang et al., 2022) 96.4 92.1 99.1 100.0 75.0 76.5 89.9
feat-mixup (Kundu et al., 2022) 94.6 93.2 98.9 100.0 78.3 78.9 90.7
NRC+ELR (Yi et al., 2023) 93.8 93.3 98.0 100.0 76.2 76.9 89.6
SFUDA (Pei et al., 2023) 96.2 94.0 99.1 99.9 76.9 79.2 90.7
CtO 96.4 95.1 99.0 100.0 80.0 78.2 91.5

Adaptive Input-consistency Regularization

In Adaptive Input-consistency Regularization, we propagate the structural information from the inner set to the outlier set as discussed in Remark 2. Since the outliers in the low-density region are far away from all other points, which means there is no nearest neighbor support, we turn to seek support from the outliers themselves. Specifically, we use a weakly augmented version of xix_{i}, denoted as ω(xi)\omega(x_{i}), to generate the pseudo-label p^i=P(yω(xi))\hat{p}_{i}=P(y\mid\omega(x_{i})) and enforce consistency against its strongly augmented version Ω(xi)\Omega(x_{i}). To encourage the model to make diverse predictions, we combined regularization with the aforementioned class-level confidence thresholds. The Adaptive Input-consistency Regularization is as follows:

(11) air=1NOi=1NO(p^i,qi),\mathcal{L}_{air}=\frac{1}{N_{O}}\sum_{i=1}^{N_{O}}\mathcal{H}(\hat{p}_{i},q_{i}),

where (,)\mathcal{H}(\cdot,\cdot) refers to cross-entropy loss, and qi=P(yΩ(xi))q_{i}=P(y\mid\Omega(x_{i})) denotes the pseudo label of Ω(xi)\Omega(x_{i}).

During training, the clustering processes on the inner and outlier sets facilitate each other. By implementing different regularization strategies, labels are propagated among subsets to enable under- or hard-to-learn samples to find suitable neighbors. The inclusion of these new members in clusters provides additional information for learning the intra-class structure, which adjusts the feature space and enhances the power of clustering.

4.3. Overall Objective

As described above, the overall objective of CtO can be summarized as follows:

(12) =alr+air+λsep,\mathcal{L}=\mathcal{L}_{alr}+\mathcal{L}_{air}+\lambda\mathcal{L}_{sep},

where λ\lambda are a trade-off parameter. With alr\mathcal{L}_{alr} and air\mathcal{L}_{air}, CtO preserves local and input consistency, allowing label information to be propagated. The training process is described in Appendix B.

Table 5. Accuracy (%) on VisDA (ResNet-101).
Methods Source-free plane bicycle bus car horse knife mcycl person plant sktbrd train truck Per-class
ResNet-101 (He et al., 2016) 55.1 53.3 61.9 59.1 80.6 17.9 79.7 31.2 81.0 26.5 73.5 8.5 52.4
CDAN (Long et al., 2018) 85.2 66.9 83.0 50.8 84.2 74.9 88.1 74.5 83.4 76.0 81.9 38.0 73.9
SAFN (Xu et al., 2019) 93.6 61.3 84.1 70.6 94.1 79.0 91.8 79.6 89.9 55.6 89.0 24.4 76.1
MCC (Jin et al., 2020) 88.7 80.3 80.5 71.5 90.1 93.2 85.0 71.6 89.4 73.8 85.0 36.9 78.8
FixBi (Na et al., 2021) 96.1 87.8 90.5 90.3 96.8 95.3 92.8 88.7 97.2 94.2 90.9 25.7 87.2
SHOT (Liang et al., 2020) 92.6 81.1 80.1 58.5 89.7 86.1 81.5 77.8 89.5 84.9 84.3 49.3 79.6
A2A^{2}Net (Xia et al., 2021) 94.0 87.8 85.6 66.8 93.7 95.1 85.8 81.2 91.6 88.2 86.5 56.0 84.3
NRC (Yang et al., 2021a) 96.8 91.3 82.4 62.4 96.2 95.9 86.1 80.6 94.8 94.1 90.4 59.7 85.9
HCL (Huang et al., 2021) 93.3 85.4 80.7 68.5 91.0 88.1 86.0 78.6 86.6 88.8 80.0 74.7 83.5
CPGA (Qiu et al., 2021) 94.8 83.6 79.7 65.1 92.5 94.7 90.1 82.4 88.8 88.0 88.9 60.1 84.1
SFDA-DE (Ding et al., 2022) 95.3 91.2 77.5 72.1 95.7 97.8 85.5 86.1 95.5 93.0 86.3 61.6 86.5
AaD (Yang et al., 2022) 97.4 90.5 80.8 76.2 97.3 96.1 89.8 82.9 95.5 93.0 92.0 64.7 88.0
DaC (Zhang et al., 2022) 96.6 86.8 86.4 78.4 96.4 96.2 93.6 83.8 96.8 95.1 89.6 50.0 87.3
CtO 98.2 91.0 86.4 78.0 97.6 98.8 91.8 84.8 96.6 94.7 93.7 53.3 88.7
Table 6. Ablation study on Office-31.
seq\mathcal{L}_{seq} lr\mathcal{L}_{lr} alr\mathcal{L}_{alr} air\mathcal{L}_{air} A\rightarrowD A\rightarrowW D\rightarrowA W\rightarrowA Avg.
96.4 92.1 75.0 76.5 85.0
95.4 93.3 77.9 77.6 86.1
95.8 94.7 79.4 77.8 86.9
96.4 95.1 80.0 78.2 87.4

5. Experiments

5.1. Setup

Datasets

We conduct experiments on three public domain adaptation benchmarks. (i) Office-31 (Saenko et al., 2010) is a commonly used dataset for domain adaptation that consists of three domains: Amazon (A), Webcam (W), and DSLR (D), each containing 31 categories of items in an office environment. (ii) Office-Home (Venkateswara et al., 2017) is a standard domain adaptation dataset collected in office and home environments. It consists of four domains, Art (Ar), Clipart (Cl), Product (Pr), and RealWorld (Rw), and each covering 65 object categories. (iii) VisDA (Peng et al., 2017) is one of the large benchmark datasets on the domain adaptation task. It contains 12 categories of images from two subsets: synthetic image domain and real image domain.

Implementation details

Following the standard protocol for SFDA, we use all labeled source data to obtain pre-trained models. For the Office-31 and Office-Home, the backbone network is ResNet-50 (He et al., 2016). For VisDA, the backbone network is ResNet-101. For a fair comparison, we use the same network structure as SHOT (Liang et al., 2020), NRC (Yang et al., 2021a) and AaD (Yang et al., 2022). All network parameters are updated by Stochastic Gradient Descent (SGD) with momentum of 0.9, an initial learning rate of 0.001, and a weight decay of 0.005. The learning rate of the additional layer is 10 times smaller than that of the backbone layer. We follow G-SFDA (Yang et al., 2021b), NRC (Yang et al., 2021a), and AaD (Yang et al., 2022) for the number of nearest neighbors (KK): set 3 for Office-31, Office-Home, and 5 on VisDA. To ensure a fair comparison, we set the hyperparameter λ\lambda to be the same as in the previous work (Yang et al., 2022). That is, we set λ=(1+10itermax_iter)β\lambda=\left(1+10*\frac{iter}{max\_iter}\right)^{-\beta}, and set β\beta to 0 on Office-Home, 2 on Office-31, and 5 on VisDA. The strong augmentation function used in our experiments is RandAugment (Cubuk et al., 2020).

Refer to caption
(a) Source-only (A\rightarrowW)
Refer to caption
(b) CtO (A\rightarrowW)
Refer to caption
(c) Source-only (D\rightarrowA)
Refer to caption
(d) CtO (D\rightarrowA)
Refer to caption
(e) Source-only (D\rightarrowA)
Refer to caption
(f) CtO (D\rightarrowA)
Figure 3. The t-SNE and Confusion Matrix visualization. (a-d): t-SNE visualization of the final prediction layer activation for source model and CtO, where red and blue points denote the source and target domains, respectively. Note that the source samples are only used to plot the t-SNE. (e) and (f): Confusion Matrix visualization for source model and CtO.

5.2. Results and Analysis

In this section, we will present our results and compare with other methods, which are summarized in Table 345, respectively. For a fair comparison, all baseline results were obtained from their original papers or the follow-up work.

Comparison with state-of-the-art methods

For Office-31, as shown in Table 4, the proposed CtO yield state-of-the-art performance on 4 out of 6 tasks. Note that our CtO produces competitive results even when compared to source-present methods such as FixBi (Na et al., 2021) (91.5% v.s. 91.4%). For Office-Home, Table 3 presents that the proposed CtO method achieves the most advanced classification accuracy (76.1%) and achieves the highest results on 7 out of 12 tasks. As we all know, in clustering-based methods, the clustering error increases with the number of object classes. Therefore, it is difficult for local consistency-based SFDA methods to accurately capture the target structure information. However, our CtO employs adaptive input-consistency regularization to efficiently utilize unlabeled data through label propagation. This is the primary reason for our success on Office-Home. Moreover, CtO beats several source-present DA methods, such as SRDC (Tang et al., 2020) and FixBi (Na et al., 2021), by a large margin, which means that even if we do not have access to the source data, our method can still exploit the target structure information to achieve better adaptation. Similar observations on VisDA can be found in Table 5. The reported results sufficiently demonstrate the superiority of our method.

Comparison with clustering-based Method

As discussed in related work, NRC (Yang et al., 2021a) uses reciprocal nearest neighbors to measure clustering affinity. On the hard tasks of Office-Home, our approach outperforms the accuracy of NRC by a considerable margin, especially on tasks Pr\rightarrowAr (73.8% v.s. 65.3%). The improvement of our method indicates the importance of our adaptive input-consistency regularization for capturing intra-class structural information. Compared with AaD (Yang et al., 2022), our CtO improves the accuracy by 1.6% on Office-31 and by 3.1% on Office-Home, indicating that the co-training of the adaptive local-consistency regularizer and the adaptive input-consistency regularizer performs reliable label propagation through the subpopulation of unlabeled data. Moreover, in VisDA, CtO exhibits higher recognition accuracy for several confusing objects than AaD, which indicates that the adaptive input-consistency regularizer can enhance model discrimination by providing more comprehensive intra-class information.

Ablation Study

To evaluate the contribution of the different components of our work, we conduct ablation studies for CtO on Office-31. We investigated different combinations of the two parts: Adaptive Local-consistency Regularization (ALR) and Adaptive Input-consistency Regularization (AIR). Compared to our method, AaD (i.e., only seq\mathcal{L}_{seq} and lr\mathcal{L}_{lr} are used) can be regarded as the baseline. As shown in Table 6, each part of our method contributes to improving performance. It is not difficult to find that AIR contributes the most to the improvement of accuracy, with the performance increasing from 85.0% to 86.9%, which shows the effectiveness of label propagation. ALR also improves the average performance by 1.1% compared to the base model, confirming that the distance-based re-weighting improves the quality of the neighbors. For easy transfer tasks, target features from pre-trained source models naturally have good clustering performance. In this case, ALR dominates in loss optimization, with AIR helping to improve model training for under-learned categories. When the target feature distribution is scattered, it benefits from the AIR to ensure the smoothness of the model, while the extended property amplifies it to global consistency within the same class, allowing the limited structural information captured from the ALR to be propagated among subpopulations. According to the comparison of results, we conclude that employing a regularization strategy suitable for data property is important in capturing semantic information. Without local consistency, outlier samples are difficult to learn, which makes the model heavily dependent on the transferability of source knowledge. Similarly, removing input consistency makes it difficult to effectively facilitate the learning of global semantic information. Overall, CtO increased baseline AaD by an average of 2.4%. This shows that there is complementarity between ALR and AIR.

Visualization

To demonstrate the superiority of our method, we show the t-SNE (Van der Maaten and Hinton, 2008) feature visualization and confusion matrix on Office-31 (see Fig. 3). From Fig. 3(a-d), we can observe that the clustering of the target features is more compact after the adaptation by CtO. Fig. 3(b) and (d) illustrate that CtO can achieve good model adaptation whether the model is pre-trained on a large-scale or small-scale source domain. When the source domain is knowledge-rich, as shown in Fig. 3(a), the target domain features already possess considerable semantics. In such cases, adaptive local-consistency regularization can effectively capture the intra-class structure. However, when significant domain differences exist (as shown in Fig. 3(c)), abundant target features are jumbled together, so that the model has difficult in capturing the local structure. The flexible data division of our method, thus, customizes the learning strategy for different data properties, which facilitates the estimation of ground-truth structural information instead of only adjusting the neighbor weights as in NRC (Yang et al., 2021a). Benefiting from the adaptive input-consistency regularization, we can capture semantic structures with rich intra-class variations while dissimilar samples are naturally separated in the representation space. More importantly, as the training iterates, outlier samples gradually join the clustering, and locally informed clusters propagate label information to these outliers. The comparison of Fig. 3(e) and (f) further demonstrates that our method increases prediction diversity by adaptively adjusting the training on under-learned or hard-to-learn samples (i.e., outlier).

6. Conclusions

In this paper, we propose a novel approach called Chaos to Order (CtO), which tries to achieve efficient feature clustering from the perspective of label propagation. CtO divides the target data into inner and outlier samples based on the adaptive threshold of the learning state, and applies a customized learning strategy to fit the data properties best. To mitigate the source bias, on the one hand, considering the clustering affinity, we propose Adaptive Local-consistency Regularization (ALR) to reduce spurious clustering by re-weighting neighbors. On the other hand, Adaptive Input-consistency Regularization (AIR) is used at outlier points to propagate structural information from high-density to low-density regions, thus achieving high accuracy with respect to the ground truth labels. Moreover, this co-training process can encourage positive clustering and combat spurious clustering. The experimental results of three popular benchmarks verify that our proposed model outperforms the state-of-the-art in various SFDA tasks. For future work, we plan to extend our CtO method to source-free open-set and partial-set domain adaptation.

Acknowledgements.
This work was supported by the National Natural Science Foundation of China under Grant 61871186 and 61771322.

References

  • (1)
  • Cai et al. (2021) Tianle Cai, Ruiqi Gao, Jason D. Lee, and Qi Lei. 2021. A Theory of Label Propagation for Subpopulation Shift. In ICML (Proceedings of Machine Learning Research, Vol. 139). PMLR, 1170–1182.
  • Cubuk et al. (2020) Ekin Dogus Cubuk, Barret Zoph, Jonathon Shlens, and Quoc Le. 2020. RandAugment: Practical Automated Data Augmentation with a Reduced Search Space. In NeurIPS.
  • Ding et al. (2022) Ning Ding, Yixing Xu, Yehui Tang, Chao Xu, Yunhe Wang, and Dacheng Tao. 2022. Source-Free Domain Adaptation via Distribution Estimation. In CVPR. IEEE, 7202–7212.
  • Douze et al. (2018) Matthijs Douze, Arthur Szlam, Bharath Hariharan, and Hervé Jégou. 2018. Low-Shot Learning With Large-Scale Diffusion. In CVPR. Computer Vision Foundation / IEEE Computer Society, 3349–3358.
  • Ganin et al. (2016) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor S. Lempitsky. 2016. Domain-Adversarial Training of Neural Networks. J. Mach. Learn. Res. 17 (2016), 59:1–59:35.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. IEEE Computer Society, 770–778.
  • Huang et al. (2021) Jiaxing Huang, Dayan Guan, Aoran Xiao, and Shijian Lu. 2021. Model Adaptation: Historical Contrastive Learning for Unsupervised Domain Adaptation without Source Data. In NeurIPS. 3635–3649.
  • Iscen et al. (2019) Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum. 2019. Label Propagation for Deep Semi-Supervised Learning. In CVPR. Computer Vision Foundation / IEEE, 5070–5079.
  • Jin et al. (2020) Ying Jin, Ximei Wang, Mingsheng Long, and Jianmin Wang. 2020. Minimum Class Confusion for Versatile Domain Adaptation. In ECCV (21) (Lecture Notes in Computer Science, Vol. 12366). Springer, 464–480.
  • Kang et al. (2019) Guoliang Kang, Lu Jiang, Yi Yang, and Alexander G. Hauptmann. 2019. Contrastive Adaptation Network for Unsupervised Domain Adaptation. In CVPR. Computer Vision Foundation / IEEE, 4893–4902.
  • Kundu et al. (2022) Jogendra Nath Kundu, Akshay R. Kulkarni, Suvaansh Bhambri, Deepesh Mehta, Shreyas Anand Kulkarni, Varun Jampani, and Venkatesh Babu Radhakrishnan. 2022. Balancing Discriminability and Transferability for Source-Free Domain Adaptation. In ICML (Proceedings of Machine Learning Research, Vol. 162). PMLR, 11710–11728.
  • Kundu et al. (2021) Jogendra Nath Kundu, Akshay R. Kulkarni, Amit Singh, Varun Jampani, and R. Venkatesh Babu. 2021. Generalize then Adapt: Source-Free Domain Adaptive Semantic Segmentation. In ICCV. IEEE, 7026–7036.
  • Lee et al. (2022) Jonghyun Lee, Dahuin Jung, Junho Yim, and Sungroh Yoon. 2022. Confidence Score for Source-Free Unsupervised Domain Adaptation. In ICML (Proceedings of Machine Learning Research, Vol. 162). PMLR, 12365–12377.
  • Li et al. (2022) Jingjing Li, Zhekai Du, Lei Zhu, Zhengming Ding, Ke Lu, and Heng Tao Shen. 2022. Divergence-Agnostic Unsupervised Domain Adaptation by Adversarial Attacks. IEEE Trans. Pattern Anal. Mach. Intell. 44, 11 (2022), 8196–8211.
  • Li et al. (2020a) Rui Li, Wenming Cao, Si Wu, and Hau-San Wong. 2020a. Generating Target Image-Label Pairs for Unsupervised Domain Adaptation. IEEE Trans. Image Process. 29 (2020), 7997–8011.
  • Li et al. (2020b) Rui Li, Qianfen Jiao, Wenming Cao, Hau-San Wong, and Si Wu. 2020b. Model Adaptation: Unsupervised Domain Adaptation Without Source Data. In CVPR. Computer Vision Foundation / IEEE, 9638–9647.
  • Liang et al. (2020) Jian Liang, Dapeng Hu, and Jiashi Feng. 2020. Do We Really Need to Access the Source Data? Source Hypothesis Transfer for Unsupervised Domain Adaptation. In ICML (Proceedings of Machine Learning Research, Vol. 119). PMLR, 6028–6039.
  • Long et al. (2018) Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I. Jordan. 2018. Conditional Adversarial Domain Adaptation. In NeurIPS. 1647–1657.
  • Na et al. (2021) Jaemin Na, Heechul Jung, Hyung Jin Chang, and Wonjun Hwang. 2021. FixBi: Bridging Domain Spaces for Unsupervised Domain Adaptation. In CVPR. Computer Vision Foundation / IEEE, 1094–1103.
  • Pei et al. (2023) Jiangbo Pei, Zhuqing Jiang, Aidong Men, Liang Chen, Yang Liu, and Qingchao Chen. 2023. Uncertainty-Induced Transferability Representation for Source-Free Unsupervised Domain Adaptation. IEEE Trans. Image Process. 32 (2023), 2033–2048.
  • Peng et al. (2017) Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate Saenko. 2017. VisDA: The Visual Domain Adaptation Challenge. CoRR abs/1710.06924 (2017).
  • Qiu et al. (2021) Zhen Qiu, Yifan Zhang, Hongbin Lin, Shuaicheng Niu, Yanxia Liu, Qing Du, and Mingkui Tan. 2021. Source-free Domain Adaptation via Avatar Prototype Generation and Adaptation. In IJCAI. ijcai.org, 2921–2927.
  • Qu et al. (2022) Sanqing Qu, Guang Chen, Jing Zhang, Zhijun Li, Wei He, and Dacheng Tao. 2022. BMD: A General Class-Balanced Multicentric Dynamic Prototype Strategy for Source-Free Domain Adaptation. In ECCV (34) (Lecture Notes in Computer Science, Vol. 13694). Springer, 165–182.
  • Saenko et al. (2010) Kate Saenko, Brian Kulis, et al. 2010. Adapting Visual Category Models to New Domains. In ECCV.
  • Tang et al. (2020) Hui Tang, Ke Chen, and Kui Jia. 2020. Unsupervised Domain Adaptation via Structurally Regularized Deep Clustering. In CVPR. Computer Vision Foundation / IEEE, 8722–8732.
  • Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
  • Venkateswara et al. (2017) Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. 2017. Deep Hashing Network for Unsupervised Domain Adaptation. In CVPR. IEEE Computer Society, 5385–5394.
  • Wei et al. (2021) Colin Wei, Kendrick Shen, Yining Chen, and Tengyu Ma. 2021. Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data. In ICLR. OpenReview.net.
  • Wu et al. (2020) Yuan Wu, Diana Inkpen, and Ahmed El-Roby. 2020. Dual Mixup Regularized Learning for Adversarial Domain Adaptation. In ECCV (29) (Lecture Notes in Computer Science, Vol. 12374). Springer, 540–555.
  • Xia et al. (2021) Haifeng Xia, Handong Zhao, and Zhengming Ding. 2021. Adaptive Adversarial Network for Source-free Domain Adaptation. In ICCV. IEEE, 8990–8999.
  • Xu et al. (2019) Ruijia Xu, Guanbin Li, Jihan Yang, and Liang Lin. 2019. Larger Norm More Transferable: An Adaptive Feature Norm Approach for Unsupervised Domain Adaptation. In ICCV. IEEE, 1426–1435.
  • Yang et al. (2021a) Shiqi Yang, Yaxing Wang, Joost van de Weijer, Luis Herranz, and Shangling Jui. 2021a. Exploiting the Intrinsic Neighborhood Structure for Source-free Domain Adaptation. In NeurIPS. 29393–29405.
  • Yang et al. (2021b) Shiqi Yang, Yaxing Wang, Joost van de Weijer, Luis Herranz, and Shangling Jui. 2021b. Generalized Source-free Domain Adaptation. In ICCV. IEEE, 8958–8967.
  • Yang et al. (2022) Shiqi Yang, Yaxing Wang, Kai Wang, Shangling Jui, et al. 2022. Attracting and dispersing: A simple approach for source-free domain adaptation. In Advances in Neural Information Processing Systems.
  • Yi et al. (2023) Li Yi, Gezheng Xu, Pengcheng Xu, Jiaqi Li, Ruizhi Pu, Charles Ling, A. Ian McLeod, and Boyu Wang. 2023. When Source-Free Domain Adaptation Meets Learning with Noisy Labels. In ICLR. OpenReview.net.
  • Zhang et al. (2021) Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu, Jindong Wang, Manabu Okumura, and Takahiro Shinozaki. 2021. FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling. In NeurIPS. 18408–18419.
  • Zhang et al. (2019) Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael I. Jordan. 2019. Bridging Theory and Algorithm for Domain Adaptation. In ICML (Proceedings of Machine Learning Research, Vol. 97). PMLR, 7404–7413.
  • Zhang et al. (2022) Ziyi Zhang, Weikai Chen, Hui Cheng, Zhen Li, Siyuan Li, Liang Lin, and Guanbin Li. 2022. Divide and Contrast: Source-free Domain Adaptation via Adaptive Contrastive Learning. In Advances in Neural Information Processing Systems.
  • Zhong et al. (2021) Li Zhong, Zhen Fang, Feng Liu, Jie Lu, Bo Yuan, and Guangquan Zhang. 2021. How Does the Combined Risk Affect the Performance of Unsupervised Domain Adaptation Approaches?. In AAAI. AAAI Press, 11079–11087.

Appendix A Proof of Claim 3.1 and Theorem 3.2

A.1. Claim 3.1 and Proof

Claim 3.1. Suppose GG satisfies a Lipschitz condition; there exists a global threshold ρ(0,1)\rho\in(0,1) and scale of models’ learning status τi\tau_{i} such that the inner set II is consistency robust, i.e., R(G)=0R_{\mathcal{B}}(G)=0. More specifically,

r(2max{τi}ρ1)2L,\displaystyle r\leq(2\max\{\tau_{i}\}\rho-1)\frac{2}{L},
i[𝒞].\displaystyle\forall i\in[\mathcal{C}].

Proof

Let G(x)[k]G(x)_{[k]} denote the predicted probability of the model on class kk. Integrating the dynamic thresholds, we say that there exists a constant γ\gamma such that G(x)[j][τjργ,1τiρ]G(x)_{[j]}\in[\tau_{j}\rho-\gamma,1-\tau_{i}\rho], where jargmax(G(x))j\neq\arg\max(G(x)). Suppose II is defined by model’s learning state, denote

I{x:max(G(x))τiρs.t.i=argmax(G(x))}.I\triangleq\{x:\max(G(x))\geq\tau_{i}\rho\;s.t.\;i=\arg\max(G(x))\}.

If x,xI,x(x)I\exists x,x^{\prime}\in I,x^{\prime}\in\mathcal{B}(x)\cap I s.t. G(x)G(x)G(x)\neq G(x^{\prime}). Specifically, define

argmaxG(x)=i,argmaxG(x)=j,\arg\max G(x)=i,\;\arg\max G(x^{\prime})=j,

where iji\neq j. Along with the inequality we know that

|G(x)[i]G(x)[j]|2τiργ1.|G(x)_{[i]}-G(x^{\prime})_{[j]}|\geq 2\tau_{i}\rho-\gamma-1.

By the definition of Lipschitz constant, we have:

(13) Lxx\displaystyle L\left\|x-x^{\prime}\right\| G(x)G(x)\displaystyle\geq\left\|G(x)-G\left(x^{\prime}\right)\right\|
|G(x)[i]+G(x)[j]G(x)[j]G(x)[i]|\displaystyle\geq\left|G(x)_{[i]}+G\left(x\right)_{[j]}-G(x^{\prime})_{[j]}-G\left(x^{\prime}\right)_{[i]}\right|
|G(x)[i]G(x)[j]|+|G(x)[j]G(x)[i]|\displaystyle\geq\left|G(x)_{[i]}-G\left(x^{\prime}\right)_{[j]}\right|+\left|G(x)_{[j]}-G\left(x^{\prime}\right)_{[i]}\right|
4(τi+τj)ρ22γ\displaystyle\geq 4(\tau_{i}+\tau_{j})\rho-2-2\gamma
4(τi+τj)ρ2.\displaystyle\geq 4(\tau_{i}+\tau_{j})\rho-2.

As a result,

Lxx4max(τi)ρ2,\displaystyle L\left\|x-x^{\prime}\right\|\geq 4\max(\tau_{i})\rho-2,
i[𝒞].\displaystyle\forall i\in[\mathcal{C}].

Since by definition of xx^{\prime}\in\mathcal{B}, we have xxr\|x-x^{\prime}\|\leq r. Combining Eq. 13 and the Lipschitz constant L>0L>0, we know that this forms a contradiction with LxxLrL\left\|x-x^{\prime}\right\|\geq Lr. Thus, xI,x(x)I\forall x\in I,x^{\prime}\in\mathcal{B}(x)\cap I, the model predictions are consistent, i.e., R(G)=0R_{\mathcal{B}}(G)=0.

A.2. Proof Sketch for Theorem 3.2

Theorem 3.1. Suppose Assumption 3.1 and Claim 3.1 hold and I,OI,O satisfies (q,μ)(q,\mu)-constant expansion. Then the expected error of model GG is bounded,

ϵ𝒟T(G)4max(q,μ)κ+μ(1+κ).\epsilon_{\mathcal{D}_{T}}(G)\leq 4\max(q,\mu)\kappa+\mu(1+\kappa).

To prove the Theorem 3.2, we introduces some concepts and notations following (Cai et al., 2021): (i) the robust set of GG, RS(G)RS(G); (ii) the minority robust set of GG on UU, MM.

For a given model GG, define the robust set to be the set for which GG is robust under input transformations:

RS(G):={x|G(x)=G(x),x(x)}.RS(G):=\{x|G(x)=G(x^{\prime}),\forall x^{\prime}\in\mathcal{B}(x)\}.

Let AikRS(G)Ui{x|G(x)=k}s.t.i,k[𝒞]A_{ik}\triangleq RS(G)\cap U_{i}\cap\{x|G(x)=k\}\;s.t.\;i,k\in[\mathcal{C}], where UiU_{i} denote the conditional distribution of UU. Towards define the minority robust set MM on UU, we consider the majority class label of GG:

yiMajargmaxk[𝒞]U[Aik].y_{i}^{\text{Maj}}\triangleq\arg\max_{k\in[\mathcal{C}]}\mathbb{P}_{U}[A_{ik}].

Thus, we denote

Mk[𝒞]\{yiMaj}AikM\triangleq\bigcup_{k\in[\mathcal{C}]\backslash\{y_{i}^{\text{Maj}}\}}A_{ik}

be the minority robust set of GG. In addition, let

M~i[𝒞](Ui{x|G(x)yiMaj})\widetilde{M}\triangleq\bigcup_{i\in[\mathcal{C}]}\left(U_{i}\cap\{x|G(x)\neq y_{i}^{\text{Maj}}\}\right)

be the minority set of GG.

By the Lemma A.1 in (Cai et al., 2021), under the (q,μ)(q,\mu)-constant expansion, we have

U[M]2max(q,μ),\displaystyle\mathbb{P}_{U}[M]\leq 2\max(q,\mu),
U[M~]2max(q,μ)+μ.\displaystyle\mathbb{P}_{U}[\widetilde{M}]\leq 2\max(q,\mu)+\mu.
Lemma A.0 (Upper Bound on the inner set II).

Suppose the condition of Claim 3.1 holds, then

ϵI(G)I[M]+R(G).\epsilon_{I}(G)\leq\mathbb{P}_{I}[M]+R_{\mathcal{B}}(G).

Proof

Based on the definition of the minority robust set MM, we know that IM{x:G(x)G(x),xI,andx(x)O}I\triangleq M\cup\{x:G(x)\neq G(x^{\prime}),x\in I,\;\text{and}\;x^{\prime}\in\mathcal{B}(x)\cap O\}. Therefore, we can write:

(14) ϵI(G)\displaystyle\epsilon_{I}(G) =I[G(x)G(x)]\displaystyle=\mathbb{P}_{I}[G(x)\neq G^{*}(x)]
=I[M(G(x)G(x))]\displaystyle=\mathbb{P}_{I}[M\cap(G(x)\neq G^{*}(x))]
+I[(G(x)G(x))(G(x)G(x))]\displaystyle\;\;\;\;+\mathbb{P}_{I}[(G(x)\neq G(x^{\prime}))\cap(G(x)\neq G^{*}(x))]
I[M]+I[RS(G)¯]\displaystyle\leq\mathbb{P}_{I}[M]+\mathbb{P}_{I}[\overline{RS(G)}]
I[M]+R(G).\displaystyle\leq\mathbb{P}_{I}[M]+R_{\mathcal{B}}(G).
Lemma A.0 (Upper Bound on the outlier set OO).

Let O=𝒟T\IO=\mathcal{D}_{T}\backslash I, then

ϵO(G)O[M]+O[M~]+R(G).\epsilon_{O}(G)\leq\mathbb{P}_{O}[M]+\mathbb{P}_{O}[\widetilde{M}]+R_{\mathcal{B}}(G).

Proof

By the definition of the outlier set OO, we note that {x:G(x)G(x),andxO}MM~(RS(G)¯\M~)\{x:G(x)\neq G^{*}(x),\;\text{and}\;x\in O\}\subseteq M\cup\widetilde{M}\cup(\overline{RS(G)}\backslash\widetilde{M}). Thus, we obtain

(15) ϵO(G)O[M]+O[M~]+R.\displaystyle\epsilon_{O}(G)\leq\mathbb{P}_{O}[M]+\mathbb{P}_{O}[\widetilde{M}]+R_{\mathcal{B}}.

Based on the above results, we can now apply Lemma A.1 and Lemma A.2 to bound the target error ϵ𝒟T(G)\epsilon_{\mathcal{D}_{T}}(G). Under the conditions of Theorem 3.2, we have:

(16) ϵ𝒟T(G)\displaystyle\epsilon_{\mathcal{D}_{T}}(G) =𝒟T[I]ϵI(G)+𝒟T[O]ϵO(G)\displaystyle=\mathbb{P}_{\mathcal{D}_{T}}[I]\epsilon_{I}(G)+\mathbb{P}_{\mathcal{D}_{T}}[O]\epsilon_{O}(G)
𝒟T[I](I[M]+R(G))\displaystyle\leq\mathbb{P}_{\mathcal{D}_{T}}[I]\left(\mathbb{P}_{I}[M]+R_{\mathcal{B}}(G)\right)
+𝒟T[O](O[M]+O[M~]+R(G))\displaystyle\;\;\;\;+\mathbb{P}_{\mathcal{D}_{T}}[O](\mathbb{P}_{O}[M]+\mathbb{P}_{O}[\widetilde{M}]+R_{\mathcal{B}}(G))
(Lemma A.1 and Lemma A.2)
𝒟T[I](κU[M]+R(G))\displaystyle\leq\mathbb{P}_{\mathcal{D}_{T}}[I]\left(\kappa\mathbb{P}_{U}[M]+R_{\mathcal{B}}(G)\right)
+𝒟T[O](κU[M]+κU[M~]+R(G))\displaystyle\;\;\;\;+\mathbb{P}_{\mathcal{D}_{T}}[O](\kappa\mathbb{P}_{U}[M]+\kappa\mathbb{P}_{U}[\widetilde{M}]+R_{\mathcal{B}}(G))
(Assumption 3.1)
κU[M]+R(G)+κ𝒟T[O]U[M~]\displaystyle\leq\kappa\mathbb{P}_{U}[M]+R_{\mathcal{B}}(G)+\kappa\mathbb{P}_{\mathcal{D}_{T}}[O]\mathbb{P}_{U}[\widetilde{M}]
4max(q,μ)κ+μ(1+κ).\displaystyle\leq 4\max(q,\mu)\kappa+\mu(1+\kappa).
Layer Ar\rightarrowCl Ar\rightarrowPr Ar\rightarrowRw Cl\rightarrowAr Cl\rightarrowPr Cl\rightarrowRw Pr\rightarrowAr Pr\rightarrowCl Pr\rightarrowRw Rw\rightarrowAr Rw\rightarrowCl Rw\rightarrowPr Avg.
Layer 4 (source) same class 14.848 12.155 12.918 13.849 11.914 12.650 16.183 17.240 14.453 14.183 13.989 11.680 13.839
across classes 16.674 15.556 15.548 15.449 14.317 14.597 18.152 18.730 16.924 15.978 15.410 14.548 15.990
Layer 4 (target) same class 12.800 12.961 12.275 14.183 12.961 12.275 14.183 12.800 12.275 14.183 12.800 12.961 13.055
across classes 14.422 16.174 14.750 16.396 16.174 14.750 16.396 14.422 14.750 16.396 14.422 16.174 15.435
Bottleneck (source) same class 19.041 15.302 16.117 18.614 16.453 17.260 18.656 20.286 16.868 16.799 17.428 14.102 17.244
across classes 21.975 20.965 21.160 21.529 20.717 20.926 21.778 22.301 21.182 21.033 20.275 20.274 21.176
Bottleneck (target) same class 19.813 13.601 14.845 16.137 13.223 14.362 18.703 22.772 16.631 16.223 18.673 12.934 16.493
across classes 24.252 20.307 22.235 20.172 18.489 19.838 23.969 26.605 23.761 21.410 22.489 19.170 21.891
Table 7. Euclidean distance within the same class and across classes in each task on Office-Home.

Appendix B Algorithm for CtO

Our method, as described in Algorithm 1, involves dynamic data grouping and adaptive input- and local-consistency regularization. By using an adaptive threshold for the learning state, we are able to dynamically divide the target data into Inner data and outlier data. This allows us to apply different learning strategies to each subset.

Algorithm 1 Training Algorithm of CtO
0:  Unlabeled target data 𝒳t\mathcal{X}_{t}, number of classes CC, a pre-trained source model G=gchG=g_{c}\circ h, global threshold ρ\rho.
0:  Initialisation: Build feature bank 𝑭\boldsymbol{F} and score bank 𝑷\boldsymbol{P} by forward computation, initialize the global threshold as 1/C1/C.
1:  while t<MaxIterationst<\text{MaxIterations} do
2:     Update 𝑭\boldsymbol{F} and 𝑷\boldsymbol{P} corresponding to the current mini-batch BB;
3:     Update the global threshold based on Eq. 5;
4:     Compute the learning effect of class based on Eq. 6;
5:     Update the inner and outlier sample based on Eq. 8;
6:     Compute alr\mathcal{L}_{alr} on inner sample by Eq. 9;
7:     Compute sep\mathcal{L}_{sep} on all sample by Eq. 10;
8:     Compute air\mathcal{L}_{air} on outlier sample by Eq. 11;
9:     Compute overall loss and update the model GG.
10:  end while
10:  The adapted model GG.

Appendix C Analysis of different similarity measures

To check the effect of different metric functions on target feature clustering, we compared the Cosine similarity and the Euclidean distance on Office-Home for feature similarity. Fig. 4 shows the average Euclidean distance for all tasks on Office-Home. A smaller value indicates more similarity. It can be seen that the Euclidean distance also indicates the presence of rich semantic information in the high-dimensional features of the source model backbone network. However, compared to Fig. 2, the differences in similarity based on Euclidean distance are insignificant. As is well known, the Euclidean distance reflects absolute differences in values, while the cosine distance reflects relative differences in direction. Therefore, Cosine similarity maintains ”1 for identical, 0 for orthogonal, -1 for opposite” in high-dimensional space. Euclidean distance, in contrast, is influenced by dimensions, and its numerical space is not unstable. Particularly in the case of distribution shifts, the variance of the sample fluctuations is too large, leading to poor Euclidean distance performance.

We also shows feature similarities among samples within the same class and across classes in each transfer task. As shown in Table 7, the differences between the Euclidean distance within the same class and that across classes are not clear if the distributions are significantly different (e.g., Ar\rightarrowCl, Pr\rightarrowCl, Rw\rightarrowCl tasks). Weak inter-category discrimination exacerbates spurious clustering, which biases the adaptation process.

Through the experiment, we notice two things: 1) the source model contains sufficient inductive biases; and 2) under domain shift conditions, there is a strong correlation between the metric method and target feature clustering. In future work, one possible direction is to study how different metrics affect the performance of target clustering.

Refer to caption
Figure 4. Histogram of the Euclidean distance within the same class and across classes on Office-Home.