Chaos to Order: A Label Propagation Perspective on Source-Free Domain Adaptation

Chunwei Wu East China Normal UniversityShanghaiChina [email protected] , Guitao Cao Shanghai Key Laboratory of Trustworthy Computing, East China Normal UniversityShanghaiChina43017-6221 [email protected] , Yan Li Shanghai Normal UniversityShanghaiChina [email protected] , Xidong Xi East China Normal UniversityShanghaiChina [email protected] , Wenming Cao Shenzhen UniversityShenzhenChina [email protected] and Hong Wang Shanghai Research Institute of Microwave EquipmentShanghaiChina [email protected]

(2023)

Abstract.

Source-free domain adaptation (SFDA), where only a pre-trained source model is used to adapt to the target distribution, is a more general approach to achieving domain adaptation in the real world. However, it can be challenging to capture the inherent structure of the target features accurately due to the lack of supervised information on the target domain. By analyzing the clustering performance of the target features, we show that they still contain core features related to discriminative attributes but lack the collation of semantic information. Inspired by this insight, we present Chaos to Order (CtO), a novel approach for SFDA that strives to constrain semantic credibility and propagate label information among target subpopulations. CtO divides the target data into inner and outlier samples based on the adaptive threshold of the learning state, customizing the learning strategy to fit the data properties best. Specifically, inner samples are utilized for learning intra-class structure thanks to their relatively well-clustered properties. The low-density outlier samples are regularized by input consistency to achieve high accuracy with respect to the ground truth labels. In CtO, by employing different learning strategies to propagate the labels from the inner local to outlier instances, it clusters the global samples from chaos to order. We further adaptively regulate the neighborhood affinity of the inner samples to constrain the local semantic credibility. In theoretical and empirical analyses, we demonstrate that our algorithm not only propagates from inner to outlier but also prevents local clustering from forming spurious clusters. Empirical evidence demonstrates that CtO outperforms the state of the arts on three public benchmarks: Office-31, Office-Home, and VisDA.

transfer learning; source-free domain adaptation; label propagation; cluster analysis

^†^†copyright: acmcopyright^†^†journalyear: 2023^†^†doi: 10.1145/3581783.3613821^†^†copyright: acmlicensed^†^†conference: Proceedings of the 31st ACM International Conference on Multimedia; October 29-November 3, 2023; Ottawa, ON, Canada^†^†booktitle: Proceedings of the 31st ACM International Conference on Multimedia (MM ’23), October 29-November 3, 2023, Ottawa, ON, Canada^†^†price: 15.00^†^†isbn: 979-8-4007-0108-5/23/10^†^†submissionid: 2405^†^†ccs: Computing methodologies Transfer learning^†^†ccs: Computing methodologies Regularization

1. Introduction

Refer to caption — Figure 1. A toy illustration of our approach on label propagation. The target samples can be divided into two subsets: the inner set and the outlier set. CtO achieves label propagation through Adaptive Local-consistency Regularization and Adaptive Input-consistency Regularization. Among them, the outlier samples may move to the correct clusters with the help of their augmented versions. Maintaining this trend, the samples gradually move from chaos to order.

The excellent performance of deep learning relies heavily on a large amount of high-quality labeled data. Obtaining large amounts of manually labeled data for specific learning tasks is often time-consuming and expensive, making these tasks challenging to implement in practical applications. To alleviate this dependency, Unsupervised Domain Adaptation (UDA) has been developed to improve performance in the unlabeled target domain by exploiting the labeled source domain. Two popular practices for modern UDA design are learning domain-invariant features (Ganin et al., 2016; Long et al., 2018; Kang et al., 2019; Tang et al., 2020) and generating dummy samples to match the target domain distribution (Wu et al., 2020; Li et al., 2020a; Zhong et al., 2021; Na et al., 2021). However, due to data privacy and security issues, the source domain training data required by most existing UDA methods is usually unavailable in real-world applications. In response, Source-Free Domain Adaptation (SFDA) emerged, which attempted to adapt a trained source model to the target domain without using any source data.

Due to the lack of source data, it is impossible to estimate source-target domain differences. Existing theoretical work usually provides learning guarantees on the target domain by further assuming that the source domain covers the support of the target domain. In the seminal work by (Yang et al., 2021a), the authors point out that the target features from the source model have formed some semantic structures. Inspired by this intuition, we can preserve the important clustering structure in the target domain by matching similar features in the high-dimensional space. However, the nearest-neighbor consistency of points in high-dimensional space may be wrong, such as when forcing the local consistency of points in low-density regions. As shown in Table 1, when the source and target domains have significant differences (i.e., Pr $\rightarrow$ Cl and Rw $\rightarrow$ Cl), numerous features gather in low-density regions, with only about one-third of the neighbors having the correct labels.

Along with such a question, we propose Chaos to Order (CtO) (Fig. 1), an effective method to achieve more robust clustering of unlabeled data from the perspective of label propagation. To achieve flexible adaptation for different data properties and exploit the target domain structure information, our work introduces a novel data division strategy and then designs different regularization strategies to achieve label propagation.

Firstly, our approach treats the target domain’s intrinsic structure information mining as a clustering problem. Although existing local consistency-based methods aim to preserve the local structure, Table 1 illustrates the reason why neighbors are unreliable: In distance-based neighbor discrimination, neighbors are similar points in a high-dimensional space, and since the points in the low-density region are all scattered far apart, the label information in the $K$ -nearest neighbors is not consistent at this point. In CtO, we utilize the model’s learning state to dynamically divide the target data into inner and outlier sets. The intrinsic reason is that a sample can be considered an inner sample if it can obtain high predictive values from the classifier; otherwise, it is an outlier. We regularize the input consistency of outliers and encourage local consistency for those inner samples, which effectively improves the mining of intrinsic structural information.

Secondly, we assume a minimum overlap in the subpopulations of the inner and outlier sets, and extend the subset using the simple but realistic extension assumption of (Wei et al., 2021). For the inner set, the local-consistency regularizer connects similar points in the high-dimensional space, allowing SFDA training to proceed stably. Enlightening experiments on Office-Home show that: (1) the pre-trained source model can extract rich semantic information from the target data; (2) what is lacking in domain adaptation is the filtering and permutation of high-dimensional semantic information. Due to the lack of supervised information, we preserve the core features associated with the discriminative attributes by enforcing the local consistency of points in the latent space. Moreover, we propose a re-weighted clustering strategy called adaptive Local-Consistency Regularization (ALR), which explicitly constrains the local semantic credibility to filter spurious clustering information. To advance further along this line, we propose Adaptive Input-Consistency Regularization (AIR) for the outlier set. Generally, requiring the model to be invariant to input perturbations can improve generalizability. Furthermore, as (Wei et al., 2021) discussed, a low-probability subset of data can be extended to a neighborhood with a large probability relative to that subset. We show that labeling information can be propagated among subpopulations by minimizing the consistency regularization term on unlabeled data. In Theorem 3.2, we give the upper bound on the task risk of the target model. As a result, by customizing the learning strategy for different data properties, CtO can propagate structural information from the inner to the outlier while enhancing the clustering of the inner set.

Table 1. Ratio (%) of different number of

K

-nearest neighbor which have the correct predicted label (on Office-Home).

$K$	Ar $\rightarrow$ Cl	Ar $\rightarrow$ Pr	Cl $\rightarrow$ Ar	Pr $\rightarrow$ Cl	Pr $\rightarrow$ Rw	Rw $\rightarrow$ Cl
1	42.0	66.2	47.3	33.6	70.0	41.2
2	36.8	62.7	40.7	28.6	66.1	36.9
3	33.8	59.6	37.4	24.7	63.0	33.1
4	30.4	57.1	34.3	22.0	60.4	30.7
5	28.5	55.1	31.2	20.0	58.2	28.0
6	26.8	53.0	29.1	18.1	56.4	26.3
7	25.2	51.6	27.6	16.7	54.9	24.3

In summary, the contributions of this paper are summarized as follows: (1) We introduce CtO, a dynamical clustering approach for SFDA. Such a approach customizes the learning strategy for data subsets by using dynamic data splits, allowing label information to propagate among subpopulations. (2) To combat spurious clustering, we propose a novel Adaptive Local-consistency Regularization (ALR) strategy that estimates ground-truth structural information by re-weighting the neighbors. (3) To utilize unlabeled data more effectively, we propose Adaptive Input-Consistent Regularization (AIR) from the perspective of label propagation. By collaborating with ALR, structural information can be propagated from the inner to the outlier sets, significantly improving clustering performance. (4) Empirical evidence demonstrates that the proposed method outperforms the state-of-the-art on three domain adaptation benchmark datasets.

Table 2. Comparison with different bottleneck layers on Office-Home.

Methods	Ar $\rightarrow$ Cl	Ar $\rightarrow$ Pr	Ar $\rightarrow$ Rw	Cl $\rightarrow$ Ar	Cl $\rightarrow$ Pr	Cl $\rightarrow$ Rw	Pr $\rightarrow$ Ar	Pr $\rightarrow$ Cl	Pr $\rightarrow$ Rw	Rw $\rightarrow$ Ar	Rw $\rightarrow$ Cl	Rw $\rightarrow$ Pr	Avg.
AaD (w/ Source Bottleneck Layer)	59.3	79.3	82.1	68.9	79.8	79.5	67.2	57.4	83.1	72.1	58.5	85.4	72.7
AaD (w/ Target Bottleneck Layer)	69.3	85.7	91.4	82.4	86.2	87.4	84.5	67.5	90.5	89.1	68.9	92.1	82.9

2. Related Work

Source-free Domain Adaptation (SFDA)

SFDA aims to adapt unlabeled target domains using only the pre-trained source model. Existing approaches try to refine the solution of SFDA by pseudo-labeling (Liang et al., 2020; Qiu et al., 2021; Huang et al., 2021; Ding et al., 2022; Qu et al., 2022; Lee et al., 2022), generating transition domains (Li et al., 2020b, 2022; Kundu et al., 2021, 2022), or local consistency (Yang et al., 2021b, a, 2022). However, due to the domain differences, pseudo-labels that may contain noise cause confirmation bias. Additionally, task discriminative information and domain-related information are highly non-linearly entangled. Directly constructing an ideal generic domain from the source model may be difficult. Most closely related to our work is AaD(Yang et al., 2022), which introduced a simple and efficient optimization upper bound for feature clustering of unlabeled data, i.e., aggregating (scattering) similar (dissimilar) features in the feature space. However, AaD uses $K$ -nearest neighbors directly, which suffer from source bias due to domain shift. In contrast to the above methods, we explore the idea of label propagation to assign regularization strategies to unlabeled data that are more suitable for the data properties, to achieve source-free model adaptation.

Label Propagation

Label propagation has been widely used in semi-supervised learning. (Douze et al., 2018) show that label propagation on large image sets outperforms state-of-the-art few-shot learning when few labels are available. (Iscen et al., 2019) employ a transductive label propagation method based on the stream shape assumption to predict the entire dataset. (Wei et al., 2021) introduce the ”extension” assumption to analyze label propagation and show learning guarantees for unsupervised and semi-supervised learning. (Cai et al., 2021) extend the extension assumption to domain adaptation and propose a provably effective framework for domain adaptation based on label propagation. Considering label propagation for SFDA and leveraging the advantages of extension assumptions, we design a novel and dynamic clustering strategy for SFDA that propagates structural information from high-density regions to low-density regions.

3. Preliminaries and Analysis

In this section, we consider the clustering performance of the target domain and the label propagation for SFDA. Correspondingly, we first introduce some notations and then perform an empirical and theoretical analysis to better understand the role of different learning strategies in CtO. In Section 3.2, we study an Oracle setup that beats the original AaD (Yang et al., 2022) by a large margin, confirming that the features from the source domain model are already rich in semantic information, which requires us to reduce the redundant information in the features. Finally, in Section 3.3, we claim that if the learning state of a model is superior, then the target sample has consistency with its neighbors (Claim 3.1). Furthermore, we present upper bounds on the target error in Theorem 3.2.

3.1. Preliminary

For source-free domain adaptation (SFDA), consider an unlabeled target dataset $\mathcal{D}_{T}=\left\{x_{i}:x_{i}\in\mathcal{X}\right\}_{i=1}^{n_{t}}$ on the input space $\mathcal{X}_{t}$ . The task is to adapt a well-trained source model to the target domain without source data, where the target domain has the same $|C|$ class as the source domain. Following (Yang et al., 2021a, 2022), we use a feature extractor $h:\mathcal{X}_{t}\rightarrow\mathcal{Z}$ , and the classifier $g_{c}:\mathcal{Z}\rightarrow\mathcal{C}$ . Then the output of the network is denoted as $p(x)=\delta(g_{c}(h(x)))\in\mathcal{R}^{C}$ , where $\delta$ is the softmax function. Specifically, we retrieve the nearest neighbors for each mini-batch of target features. Let $\boldsymbol{F}\in R^{n_{t}\times d}$ denotes a memory bank that stores all target features and $\boldsymbol{P}\in R^{n_{t}\times C}$ denotes the corresponding prediction scores in the memory bank, where $d$ is the feature dimension in the last linear layer:

(1)

\begin{gathered}\boldsymbol{F}=\left[\boldsymbol{z}_{1},\boldsymbol{z}_{2},\ldots\boldsymbol{z}_{n_{t}}\right],\boldsymbol{P}=\left[\boldsymbol{p}_{1},\boldsymbol{p}_{2},\ldots\boldsymbol{p}_{n_{t}}\right],\end{gathered}

where $\boldsymbol{z}_{i}$ is L2-normalized and $\boldsymbol{p}_{i}$ denotes the output softmax probability for $\boldsymbol{z}_{i}$ .

3.2. Empirical analysis

Most of the clustering-based SFDA methods have the problem of spurious clustering. Especially, in extreme domain shifts, the spurious clustering problem worsens. To address this issue, we investigate the local consistency of feature representations on the source and target domain models. We carry out the experiments on Office-Home since it exists different degrees of domain shift, i.e., Rw vs. Pr and Pr vs. Cl. In this experiment, we study the feature properties of the different layers in the model: (1) Backbone: the last layer of the backbone network with 2048 feature dimensions; (2) Bottleneck: only replaces the bottleneck layer in the source model, with 256 feature dimensions.

It is worth noting that most of the existing clustering-based methods are distance-based. The key idea is the smoothness assumption that the model should produce similar predictions for similar unlabeled data. Therefore, a good feature representation should have intra-class compactness and inter-class separability. Without loss of generality, we use cosine similarity as a metric. It is very unexpected that the same-class similarity and across-class similarity between the source domain model and the target domain model on Backbone are similar, while a huge difference appears at the Bottleneck (see Fig. 2). According to the light blue bar in Fig. 2, which visualizes the across-class similarity of samples at different network structures, we can easily observe that the features from the bottleneck layer have better inter-class separability. This means that adding a bottleneck layer to the model helps reduce redundant features, which improves discriminability and generalizability.

We further investigated the impact of bottleneck layers on the clustering-based SFDA approach by another study. Table 2 shows the learning effect of the AaD (Yang et al., 2022) with only the bottleneck layer replaced. Note that the bottleneck layer of the target model is only used for the analysis of this experiment. We observe that replacing the target domain bottleneck layer improves the AaD model dramatically, from 72.7% to 82.9%. This indicates that the high-dimensional features from backbone network of the source model already contain rich semantic information, whereas the generalization of the features is more reflected in the filtering and permutation of the semantic information. Additionally, on the results of AaD (w/ Source Bottleneck Layer), there was a very strong correlation between prediction accuracy and the ratio of same-class similarity to across-class similarity, as indicated by the Spearman rank correlation of 0.92. This observation hints that we can use the correlation between similarity and test accuracy to improve the clustering effect.

We also evaluated the feature similarity of different metric functions. The different metric functions all point out that Backbone features are rich in semantic information (see Appendix C).

3.3. Theoretical analysis

Following the expansion assumption in (Wei et al., 2021; Cai et al., 2021), we first define that the suitable set of input transformations $\mathcal{B}(\cdot)$ takes the general form $\mathcal{B}(x)\triangleq\{x^{\prime}:\exists A\in\mathcal{A}$ such that $\left\|x^{\prime}-A(x)\right\|\leq r\}$ for a small radius $r>0$ , where $\mathcal{A}$ can be understood as a distance-based neighborhood or the data augmentations set. Then, we define the neighborhood function $\mathcal{N}$ as

(2)

\mathcal{N}(x)=\left\{x^{\prime}\mid\mathcal{B}(x)\cap\mathcal{B}\left(x^{\prime}\right)\neq\emptyset\right\},

and the neighborhood of a set $S\subset\mathcal{D}_{T}$ as

(3)

\mathcal{N}(S)\triangleq\cup_{x\in S}\mathcal{N}(x).

The regularizer of $G=g_{c}\circ h$ is defined as:

(4)

R_{\mathcal{B}}(G)=\mathbb{E}_{\mathcal{D}_{T}}\left[\max_{\text{neighbor }x^{\prime}}\mathbf{1}\left(G(x)\neq G(x^{\prime})\right)\right]

Our setting for Source-free Domain Adaptation is formulated in the following assumption.

Assumption 3.1.

Let $I_{i}$ and $O_{i}$ denote the conditional distribution of the target distribution $\mathcal{D}_{T}$ on the set $I$ and $O$ , respectively. Assume there exists a constant $\kappa\geq 1$ such that the measure $I_{i}$ and $O_{i}$ are bounded by $\kappa U_{i}$ . That is, for any $S\subset\mathcal{X}_{t}$ ,

P_{I_{i}}(S)\leq\kappa P_{U_{i}}(S)\text{ and }P_{O_{i}}(S)\leq\kappa P_{U_{i}}(S).

The expansion property on the target domain is defined as follows:

Definition 3.0 (Constant Expansion (Wei et al., 2021)).

We say that distribution $Q$ satisfies $(q,\xi)$ -constant expansion for some constant $q,\xi\in(0,1)$ , if for all $S\subset Q$ satisfying $\mathbb{P}_{Q}(S)\geq q$ , we have $\mathbb{P}_{Q}[\mathcal{N}(S)\backslash S]\geq\min\left\{\xi,\mathbb{P}_{Q}[S]\right\}$ .

Based on the model’s learning state, our CtO method divides the target data into the inner set ( $I$ ) and the outlier set ( $O$ ). The intuition lies in the fact that different datasets and classes should determine their division thresholds based on the models’ learning states so that the division boundary is more reasonable. Specifically, we set a global threshold $\rho$ to utilize unlabeled data at early training stages. As the adaptation progresses, we estimate the model’s learning state $\tau_{i}$ by the predictions of the model to determine a dynamic local (class-specific) threshold. The following claim guarantees the consistency robustness of inner samples:

Claim 3.1.

Suppose $G$ satisfies a Lipschitz condition; there exists a global threshold $\rho\in(0,1)$ and scale of models’ learning status $\tau_{i}$ such that the inner set $I$ is consistency robust, i.e., $R_{\mathcal{B}}(G)=0$ . More specifically,

		$\displaystyle r\leq(2\max\{\tau_{i}\}\rho-1)\frac{2}{L},$
		$\displaystyle\forall i\in[\mathcal{C}].$

Remark 1.

The claim illustrates that when the global threshold $\rho$ is fixed, as long as the model’s learning state is good enough, the inner set can achieve a consistency error close to zero. In clustering-based methods, $r$ usually refers to the number of selectable nearest neighbors, which allows us to select more reliable neighbor for inner samples. Moreover, we do not need to set specific data division thresholds for different datasets or classes based on experience, making it more general and realistic.

With the above preparation, we can investigate how to propagate label information from the ordered inner set to the chaotic outlier set. The following theorem establishes bounds on the target risks and indicates that as long as there is minimal overlap between the inner and outlier sets, label information will propagate in the target subpopulation.

Theorem 3.2.

Suppose Assumption 3.1 and Claim 3.1 hold and $I,O$ satisfies $(q,\mu)$ -constant expansion. Then the expected error of model $G$ is bounded,

\epsilon_{\mathcal{D}_{T}}(G)\leq 4\max(q,\mu)\kappa+\mu(1+\kappa).

Remark 2.

This theorem states that the target risks is bounded by the consistency regularization $R_{\mathcal{B}}$ (equivalently, $\mu$ ). Our analysis explains why consistency regularization is important for SFDA methods: assuming the data satisfies expansion, it encourages representations to maintain important semantic structures by enforcing local consistency within the representation space. As a result, we can effectively capture the global structure of the target domain by considering all local neighborhoods together, which strongly enforces label propagation.

The proof of both claim and theorem are in Appendix A.

4. Method

In this section, we introduce CtO aims to achieve efficient feature clustering from the perspective of label propagation.

There are two aspects involved in achieving CtO. First, how to divide the dataset reasonably? The ideal inner set should have well-clustered properties, so the effectiveness of the division boundary in distinguishing high- and low-density regions is crucial when choosing a division threshold. The model’s prediction probability can measure this. According to the previous analysis, by considering the model’s learning state, our proposed adaptive division threshold can better guarantee local consistency among inner samples.

Second, how can we customize learning strategies to achieve label propagation among target subpopulations? Inspired by self-supervised learning, we apply two regularization strategies - local consistency and input consistency - on the inner and outlier sets, respectively. Our theoretical analysis demonstrates that these strategies can extend low-density subsets to high-density ones when the subpopulations overlap, which leads to the clustering from chaos to order.

4.1. Dynamic Data Grouping

As the analysis before, we employ the model’s learning states to adaptively divide the data in $\mathcal{D}_{T}$ into the inner sets $I$ and outlier sets $O$ . As believed in (Zhang et al., 2021), the learning effect of the model can be reflected by the class-level hit rate. Therefore, our principle is that the data division in CtO should be related to the prediction confidence of the unlabeled data on different classes so as to reflect the class-level learning status. Namely, classes with fewer samples reaching a threshold of prediction confidence are considered to have difficult in learning local structural information. Moreover, the threshold should be increased steadily as the model is continuously improved during training. We set the global threshold as the exponential moving average (EMA) of the highest confidence level for each training time step:

(5)

\rho_{t}=\begin{cases}1/|C|,&\text{ if }t=0\\ \alpha\rho_{t-1}+(1-\alpha)\max(p),&\text{ otherwise }\end{cases}

where $\alpha\in(0,1)$ is the momentum decay of EMA, $t$ denotes the $t$ -th iteration. Combining this flexible thresholds, the learning effect of class $c$ in the time step is defined as:

(6)

\tau_{t}(c)=\sum_{n=1}^{N_{t}}\mathbf{1}\left(\max\left(p\right)>\rho_{t}\right)\cdot\mathbf{1}\left(\arg\max\left(p=c\right)\right).

Then we formulate the adaptive data division weights:

(7)			$\displaystyle\mathcal{T}_{t}(c)=\frac{1}{\|C\|}(1-\frac{\beta_{t}(c)}{\log\beta_{t}(c)})$
(7)			$\displaystyle where,\beta_{t}(c)=\frac{\tau_{t}(c)}{\max_{c}\tau_{t}}$

Finally, the samples are dynamically grouped into the outlier set in the $t$ -th iteration:

(8)

O^{t}=\left\{x_{i}\mid\max\left(p_{i}\right)\geq\mathcal{T}_{t}({\arg\max}\left(p_{i}\right)),x_{i}\in\mathcal{D}_{T}\right\},

and the inner samples are the rest target data, i.e., $I=\mathcal{D}_{T}\backslash O$ . To this end, we customize learning strategies for different data properties and connect both sets by expansion assumption.

4.2. Label Propagation with Different Regularizations in SFDA

In the theoretical analysis (Section 3.3), we show that performing consistency regularization on the unlabeled target data can propagate semantic information across different subpopulations. However, it also has some limitations. First, the neighbors are not always correct. Due to the domain shift, the model may incorrectly focus on the object of some target features, leading to noisy labels. Here, the misalignment of neighbors forms spurious clusters instead of helping label propagation. Second, when dealing with low-density regions (i.e., outlier samples), designing the same local consistent regularization will further exacerbate the cross-label risk. To address these limitations, we flexibly customize learning strategies for different data properties with the help of dynamic data grouping.

Adaptive Local-consistency Regularization

In Adaptive Local-consistency Regularization (ALR), inspired by the fact that the target features from the source model have formed some semantic structures, we can capture the intra-class structure by local-consistency regularization. However, in the source-free domain adaptation problem, the features extracted by the pre-trained source model are typically influenced by the source bias. This may lead to neighbors containing incorrect semantic information. To mitigate incorrect alignment, we propose identifying the clustering weights of each sample.

As observed in Fig. 2, the cosine similarity of same-class is generally higher than that of across-class. Building on this finding, we can measure neighbor affinity based on cosine similarity and then re-weight the neighbors to approximate the ground-truth structural information. By re-weighting with similarity-based adaptive weights, positive clustering can be promoted while spurious clustering can be combated. The Adaptive Local-consistency Regularization is as follows:

(9)

\mathcal{L}_{alr}=-\sum_{i}^{N_{I}}\sum_{j}^{K_{i}}\boldsymbol{w}_{ij}\boldsymbol{p}_{i}^{T}\boldsymbol{p}_{j}

where $K_{i}$ denotes the $K$ -nearest neighbor set of $\boldsymbol{z}_{i}$ . The similarity weight $\boldsymbol{w}_{ij}$ in Eq. 9 is the cosine similarity of $\boldsymbol{z}_{i}$ to the neighbors $\boldsymbol{z}_{j}$ , which is calculated via the memory bank $\boldsymbol{F}$ . Optimizing $\mathcal{L}_{alr}$ improves the reliability of clustering, which stabilizes the intra-class structure. In addition, relaxing the ranking of samples in low-density regions helps reduce incorrect local semantic alignment.

Additionally, to improve separability between clusters, we employ the separation strategy proposed by (Yang et al., 2022) to disperse the prediction of potentially dissimilar features.

(10)

\mathcal{L}_{sep}=-\sum_{i}^{N_{I}}\sum_{m}^{N_{B_{i}}}\boldsymbol{p}_{i}^{T}\boldsymbol{p}_{m}

where $B_{i}$ denotes other features except $\boldsymbol{z}_{i}$ in mini-batch.

Table 3. Accuracy (%) on Office-Home (ResNet-50).

Methods	Source-free	Ar $\rightarrow$ Cl	Ar $\rightarrow$ Pr	Ar $\rightarrow$ Rw	Cl $\rightarrow$ Ar	Cl $\rightarrow$ Pr	Cl $\rightarrow$ Rw	Pr $\rightarrow$ Ar	Pr $\rightarrow$ Cl	Pr $\rightarrow$ Rw	Rw $\rightarrow$ Ar	Rw $\rightarrow$ Cl	Rw $\rightarrow$ Pr	Avg.
ResNet-50 (He et al., 2016)	✗	34.9	50.0	58.0	37.4	41.9	46.2	38.5	31.2	60.4	53.9	41.2	59.9	46.1
CDAN (Long et al., 2018)	✗	50.7	70.6	76.0	57.6	70.0	70.0	57.4	50.9	77.3	70.9	56.7	81.6	65.8
MDD (Zhang et al., 2019)	✗	54.9	73.7	77.8	60.0	71.4	71.8	61.2	53.6	78.1	72.5	60.2	82.3	68.1
SRDC (Tang et al., 2020)	✗	52.3	76.3	81.0	69.5	76.2	78.0	68.7	53.8	81.7	76.3	57.1	85.0	71.3
FixBi (Na et al., 2021)	✗	58.1	77.3	80.4	67.7	79.5	78.1	65.8	57.9	81.7	76.4	62.9	86.7	72.7
SHOT (Liang et al., 2020)	✓	56.9	78.1	81.0	67.9	78.4	78.1	67.0	54.6	81.8	73.4	58.1	84.5	71.6
$A^{2}$ Net (Xia et al., 2021)	✓	58.4	79.0	82.4	67.5	79.3	78.9	68.0	56.2	82.9	74.1	60.5	85.0	72.8
NRC (Yang et al., 2021a)	✓	57.7	80.3	82.0	68.1	79.8	78.6	65.3	56.4	83.0	71.0	58.6	85.6	72.2
SFDA-DE (Ding et al., 2022)	✓	59.7	79.5	82.4	69.7	78.6	79.2	66.1	57.2	82.6	73.9	60.8	85.5	72.9
feat-mixup (Kundu et al., 2022)	✓	61.8	81.2	83.0	68.5	80.6	79.4	67.8	61.5	85.1	73.7	64.1	86.5	74.5
AaD (Yang et al., 2022)	✓	59.3	79.3	82.1	68.9	79.8	79.5	67.2	57.4	83.1	72.1	58.5	85.4	72.7
DaC (Zhang et al., 2022)	✓	59.1	79.5	81.2	69.3	78.9	79.2	67.4	56.4	82.4	74.0	61.4	84.4	72.8
NRC+ELR (Yi et al., 2023)	✓	58.4	78.7	81.5	69.2	79.5	79.3	66.3	58.0	82.6	73.4	59.8	85.1	72.6
SFUDA (Pei et al., 2023)	✓	59.9	81.4	83.0	68.9	80.1	80.3	67.5	56.9	83.7	74.3	60.8	86.3	73.7
CtO	✓	58.5	79.8	85.5	74.8	82.5	83.1	73.8	58.4	85.0	78.2	63.3	89.6	76.1

Table 4. Accuracy (%) on Office-31 (ResNet-50).

Methods	Source-free	A $\rightarrow$ D	A $\rightarrow$ W	D $\rightarrow$ W	W $\rightarrow$ D	D $\rightarrow$ A	W $\rightarrow$ A	Avg.
ResNet-50 (He et al., 2016)	✗	68.9	68.4	96.7	99.3	62.5	60.7	76.1
CDAN (Long et al., 2018)	✗	92.9	94.1	98.6	100.0	71.0	69.3	87.7
MDD (Zhang et al., 2019)	✗	90.4	90.4	98.7	99.9	75.0	73.7	88.0
SRDC (Tang et al., 2020)	✗	95.8	95.7	99.2	100.0	76.7	77.1	90.8
FixBi (Na et al., 2021)	✗	95.0	96.1	99.3	100.0	78.7	79.4	91.4
SHOT (Liang et al., 2020)	✓	93.1	90.9	98.8	99.9	74.5	74.8	88.7
$A^{2}$ Net (Xia et al., 2021)	✓	94.5	94.0	99.2	100.0	76.7	76.1	90.1
NRC (Yang et al., 2021a)	✓	96.0	90.8	99.0	100.0	75.3	75.0	89.4
HCL (Huang et al., 2021)	✓	94.7	92.5	98.2	100.0	75.9	77.7	89.8
SFDA-DE (Ding et al., 2022)	✓	96.0	94.2	98.5	99.8	76.6	75.5	90.1
AaD (Yang et al., 2022)	✓	96.4	92.1	99.1	100.0	75.0	76.5	89.9
feat-mixup (Kundu et al., 2022)	✓	94.6	93.2	98.9	100.0	78.3	78.9	90.7
NRC+ELR (Yi et al., 2023)	✓	93.8	93.3	98.0	100.0	76.2	76.9	89.6
SFUDA (Pei et al., 2023)	✓	96.2	94.0	99.1	99.9	76.9	79.2	90.7
CtO	✓	96.4	95.1	99.0	100.0	80.0	78.2	91.5

Adaptive Input-consistency Regularization

In Adaptive Input-consistency Regularization, we propagate the structural information from the inner set to the outlier set as discussed in Remark 2. Since the outliers in the low-density region are far away from all other points, which means there is no nearest neighbor support, we turn to seek support from the outliers themselves. Specifically, we use a weakly augmented version of $x_{i}$ , denoted as $\omega(x_{i})$ , to generate the pseudo-label $\hat{p}_{i}=P(y\mid\omega(x_{i}))$ and enforce consistency against its strongly augmented version $\Omega(x_{i})$ . To encourage the model to make diverse predictions, we combined regularization with the aforementioned class-level confidence thresholds. The Adaptive Input-consistency Regularization is as follows:

(11)

\mathcal{L}_{air}=\frac{1}{N_{O}}\sum_{i=1}^{N_{O}}\mathcal{H}(\hat{p}_{i},q_{i}),

where $\mathcal{H}(\cdot,\cdot)$ refers to cross-entropy loss, and $q_{i}=P(y\mid\Omega(x_{i}))$ denotes the pseudo label of $\Omega(x_{i})$ .

During training, the clustering processes on the inner and outlier sets facilitate each other. By implementing different regularization strategies, labels are propagated among subsets to enable under- or hard-to-learn samples to find suitable neighbors. The inclusion of these new members in clusters provides additional information for learning the intra-class structure, which adjusts the feature space and enhances the power of clustering.

4.3. Overall Objective

As described above, the overall objective of CtO can be summarized as follows:

(12)

\mathcal{L}=\mathcal{L}_{alr}+\mathcal{L}_{air}+\lambda\mathcal{L}_{sep},

where $\lambda$ are a trade-off parameter. With $\mathcal{L}_{alr}$ and $\mathcal{L}_{air}$ , CtO preserves local and input consistency, allowing label information to be propagated. The training process is described in Appendix B.

Table 5. Accuracy (%) on VisDA (ResNet-101).

Methods	Source-free	plane	bicycle	bus	car	horse	knife	mcycl	person	plant	sktbrd	train	truck	Per-class
ResNet-101 (He et al., 2016)	✗	55.1	53.3	61.9	59.1	80.6	17.9	79.7	31.2	81.0	26.5	73.5	8.5	52.4
CDAN (Long et al., 2018)	✗	85.2	66.9	83.0	50.8	84.2	74.9	88.1	74.5	83.4	76.0	81.9	38.0	73.9
SAFN (Xu et al., 2019)	✗	93.6	61.3	84.1	70.6	94.1	79.0	91.8	79.6	89.9	55.6	89.0	24.4	76.1
MCC (Jin et al., 2020)	✗	88.7	80.3	80.5	71.5	90.1	93.2	85.0	71.6	89.4	73.8	85.0	36.9	78.8
FixBi (Na et al., 2021)	✗	96.1	87.8	90.5	90.3	96.8	95.3	92.8	88.7	97.2	94.2	90.9	25.7	87.2
SHOT (Liang et al., 2020)	✓	92.6	81.1	80.1	58.5	89.7	86.1	81.5	77.8	89.5	84.9	84.3	49.3	79.6
$A^{2}$ Net (Xia et al., 2021)	✓	94.0	87.8	85.6	66.8	93.7	95.1	85.8	81.2	91.6	88.2	86.5	56.0	84.3
NRC (Yang et al., 2021a)	✓	96.8	91.3	82.4	62.4	96.2	95.9	86.1	80.6	94.8	94.1	90.4	59.7	85.9
HCL (Huang et al., 2021)	✓	93.3	85.4	80.7	68.5	91.0	88.1	86.0	78.6	86.6	88.8	80.0	74.7	83.5
CPGA (Qiu et al., 2021)	✓	94.8	83.6	79.7	65.1	92.5	94.7	90.1	82.4	88.8	88.0	88.9	60.1	84.1
SFDA-DE (Ding et al., 2022)	✓	95.3	91.2	77.5	72.1	95.7	97.8	85.5	86.1	95.5	93.0	86.3	61.6	86.5
AaD (Yang et al., 2022)	✓	97.4	90.5	80.8	76.2	97.3	96.1	89.8	82.9	95.5	93.0	92.0	64.7	88.0
DaC (Zhang et al., 2022)	✓	96.6	86.8	86.4	78.4	96.4	96.2	93.6	83.8	96.8	95.1	89.6	50.0	87.3
CtO	✓	98.2	91.0	86.4	78.0	97.6	98.8	91.8	84.8	96.6	94.7	93.7	53.3	88.7

Table 6. Ablation study on Office-31.

$\mathcal{L}_{seq}$	$\mathcal{L}_{lr}$	$\mathcal{L}_{alr}$	$\mathcal{L}_{air}$	A $\rightarrow$ D	A $\rightarrow$ W	D $\rightarrow$ A	W $\rightarrow$ A	Avg.
✓	✓			96.4	92.1	75.0	76.5	85.0
✓		✓		95.4	93.3	77.9	77.6	86.1
✓			✓	95.8	94.7	79.4	77.8	86.9
✓		✓	✓	96.4	95.1	80.0	78.2	87.4

5. Experiments

5.1. Setup

Datasets

We conduct experiments on three public domain adaptation benchmarks. (i) Office-31 (Saenko et al., 2010) is a commonly used dataset for domain adaptation that consists of three domains: Amazon (A), Webcam (W), and DSLR (D), each containing 31 categories of items in an office environment. (ii) Office-Home (Venkateswara et al., 2017) is a standard domain adaptation dataset collected in office and home environments. It consists of four domains, Art (Ar), Clipart (Cl), Product (Pr), and RealWorld (Rw), and each covering 65 object categories. (iii) VisDA (Peng et al., 2017) is one of the large benchmark datasets on the domain adaptation task. It contains 12 categories of images from two subsets: synthetic image domain and real image domain.

Implementation details

Following the standard protocol for SFDA, we use all labeled source data to obtain pre-trained models. For the Office-31 and Office-Home, the backbone network is ResNet-50 (He et al., 2016). For VisDA, the backbone network is ResNet-101. For a fair comparison, we use the same network structure as SHOT (Liang et al., 2020), NRC (Yang et al., 2021a) and AaD (Yang et al., 2022). All network parameters are updated by Stochastic Gradient Descent (SGD) with momentum of 0.9, an initial learning rate of 0.001, and a weight decay of 0.005. The learning rate of the additional layer is 10 times smaller than that of the backbone layer. We follow G-SFDA (Yang et al., 2021b), NRC (Yang et al., 2021a), and AaD (Yang et al., 2022) for the number of nearest neighbors ( $K$ ): set 3 for Office-31, Office-Home, and 5 on VisDA. To ensure a fair comparison, we set the hyperparameter $\lambda$ to be the same as in the previous work (Yang et al., 2022). That is, we set $\lambda=\left(1+10*\frac{iter}{max\_iter}\right)^{-\beta}$ , and set $\beta$ to 0 on Office-Home, 2 on Office-31, and 5 on VisDA. The strong augmentation function used in our experiments is RandAugment (Cubuk et al., 2020).

5.2. Results and Analysis

In this section, we will present our results and compare with other methods, which are summarized in Table 3, 4, 5, respectively. For a fair comparison, all baseline results were obtained from their original papers or the follow-up work.

Comparison with state-of-the-art methods

For Office-31, as shown in Table 4, the proposed CtO yield state-of-the-art performance on 4 out of 6 tasks. Note that our CtO produces competitive results even when compared to source-present methods such as FixBi (Na et al., 2021) (91.5% v.s. 91.4%). For Office-Home, Table 3 presents that the proposed CtO method achieves the most advanced classification accuracy (76.1%) and achieves the highest results on 7 out of 12 tasks. As we all know, in clustering-based methods, the clustering error increases with the number of object classes. Therefore, it is difficult for local consistency-based SFDA methods to accurately capture the target structure information. However, our CtO employs adaptive input-consistency regularization to efficiently utilize unlabeled data through label propagation. This is the primary reason for our success on Office-Home. Moreover, CtO beats several source-present DA methods, such as SRDC (Tang et al., 2020) and FixBi (Na et al., 2021), by a large margin, which means that even if we do not have access to the source data, our method can still exploit the target structure information to achieve better adaptation. Similar observations on VisDA can be found in Table 5. The reported results sufficiently demonstrate the superiority of our method.

Comparison with clustering-based Method

As discussed in related work, NRC (Yang et al., 2021a) uses reciprocal nearest neighbors to measure clustering affinity. On the hard tasks of Office-Home, our approach outperforms the accuracy of NRC by a considerable margin, especially on tasks Pr $\rightarrow$ Ar (73.8% v.s. 65.3%). The improvement of our method indicates the importance of our adaptive input-consistency regularization for capturing intra-class structural information. Compared with AaD (Yang et al., 2022), our CtO improves the accuracy by 1.6% on Office-31 and by 3.1% on Office-Home, indicating that the co-training of the adaptive local-consistency regularizer and the adaptive input-consistency regularizer performs reliable label propagation through the subpopulation of unlabeled data. Moreover, in VisDA, CtO exhibits higher recognition accuracy for several confusing objects than AaD, which indicates that the adaptive input-consistency regularizer can enhance model discrimination by providing more comprehensive intra-class information.

Ablation Study

To evaluate the contribution of the different components of our work, we conduct ablation studies for CtO on Office-31. We investigated different combinations of the two parts: Adaptive Local-consistency Regularization (ALR) and Adaptive Input-consistency Regularization (AIR). Compared to our method, AaD (i.e., only $\mathcal{L}_{seq}$ and $\mathcal{L}_{lr}$ are used) can be regarded as the baseline. As shown in Table 6, each part of our method contributes to improving performance. It is not difficult to find that AIR contributes the most to the improvement of accuracy, with the performance increasing from 85.0% to 86.9%, which shows the effectiveness of label propagation. ALR also improves the average performance by 1.1% compared to the base model, confirming that the distance-based re-weighting improves the quality of the neighbors. For easy transfer tasks, target features from pre-trained source models naturally have good clustering performance. In this case, ALR dominates in loss optimization, with AIR helping to improve model training for under-learned categories. When the target feature distribution is scattered, it benefits from the AIR to ensure the smoothness of the model, while the extended property amplifies it to global consistency within the same class, allowing the limited structural information captured from the ALR to be propagated among subpopulations. According to the comparison of results, we conclude that employing a regularization strategy suitable for data property is important in capturing semantic information. Without local consistency, outlier samples are difficult to learn, which makes the model heavily dependent on the transferability of source knowledge. Similarly, removing input consistency makes it difficult to effectively facilitate the learning of global semantic information. Overall, CtO increased baseline AaD by an average of 2.4%. This shows that there is complementarity between ALR and AIR.

Visualization

To demonstrate the superiority of our method, we show the t-SNE (Van der Maaten and Hinton, 2008) feature visualization and confusion matrix on Office-31 (see Fig. 3). From Fig. 3(a-d), we can observe that the clustering of the target features is more compact after the adaptation by CtO. Fig. 3(b) and (d) illustrate that CtO can achieve good model adaptation whether the model is pre-trained on a large-scale or small-scale source domain. When the source domain is knowledge-rich, as shown in Fig. 3(a), the target domain features already possess considerable semantics. In such cases, adaptive local-consistency regularization can effectively capture the intra-class structure. However, when significant domain differences exist (as shown in Fig. 3(c)), abundant target features are jumbled together, so that the model has difficult in capturing the local structure. The flexible data division of our method, thus, customizes the learning strategy for different data properties, which facilitates the estimation of ground-truth structural information instead of only adjusting the neighbor weights as in NRC (Yang et al., 2021a). Benefiting from the adaptive input-consistency regularization, we can capture semantic structures with rich intra-class variations while dissimilar samples are naturally separated in the representation space. More importantly, as the training iterates, outlier samples gradually join the clustering, and locally informed clusters propagate label information to these outliers. The comparison of Fig. 3(e) and (f) further demonstrates that our method increases prediction diversity by adaptively adjusting the training on under-learned or hard-to-learn samples (i.e., outlier).

6. Conclusions

In this paper, we propose a novel approach called Chaos to Order (CtO), which tries to achieve efficient feature clustering from the perspective of label propagation. CtO divides the target data into inner and outlier samples based on the adaptive threshold of the learning state, and applies a customized learning strategy to fit the data properties best. To mitigate the source bias, on the one hand, considering the clustering affinity, we propose Adaptive Local-consistency Regularization (ALR) to reduce spurious clustering by re-weighting neighbors. On the other hand, Adaptive Input-consistency Regularization (AIR) is used at outlier points to propagate structural information from high-density to low-density regions, thus achieving high accuracy with respect to the ground truth labels. Moreover, this co-training process can encourage positive clustering and combat spurious clustering. The experimental results of three popular benchmarks verify that our proposed model outperforms the state-of-the-art in various SFDA tasks. For future work, we plan to extend our CtO method to source-free open-set and partial-set domain adaptation.

Acknowledgements.

This work was supported by the National Natural Science Foundation of China under Grant 61871186 and 61771322.

References

(1)
Cai et al. (2021) Tianle Cai, Ruiqi Gao, Jason D. Lee, and Qi Lei. 2021. A Theory of Label Propagation for Subpopulation Shift. In ICML (Proceedings of Machine Learning Research, Vol. 139). PMLR, 1170–1182.
Cubuk et al. (2020) Ekin Dogus Cubuk, Barret Zoph, Jonathon Shlens, and Quoc Le. 2020. RandAugment: Practical Automated Data Augmentation with a Reduced Search Space. In NeurIPS.
Ding et al. (2022) Ning Ding, Yixing Xu, Yehui Tang, Chao Xu, Yunhe Wang, and Dacheng Tao. 2022. Source-Free Domain Adaptation via Distribution Estimation. In CVPR. IEEE, 7202–7212.
Douze et al. (2018) Matthijs Douze, Arthur Szlam, Bharath Hariharan, and Hervé Jégou. 2018. Low-Shot Learning With Large-Scale Diffusion. In CVPR. Computer Vision Foundation / IEEE Computer Society, 3349–3358.
Ganin et al. (2016) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor S. Lempitsky. 2016. Domain-Adversarial Training of Neural Networks. J. Mach. Learn. Res. 17 (2016), 59:1–59:35.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. IEEE Computer Society, 770–778.
Huang et al. (2021) Jiaxing Huang, Dayan Guan, Aoran Xiao, and Shijian Lu. 2021. Model Adaptation: Historical Contrastive Learning for Unsupervised Domain Adaptation without Source Data. In NeurIPS. 3635–3649.
Iscen et al. (2019) Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum. 2019. Label Propagation for Deep Semi-Supervised Learning. In CVPR. Computer Vision Foundation / IEEE, 5070–5079.
Jin et al. (2020) Ying Jin, Ximei Wang, Mingsheng Long, and Jianmin Wang. 2020. Minimum Class Confusion for Versatile Domain Adaptation. In ECCV (21) (Lecture Notes in Computer Science, Vol. 12366). Springer, 464–480.
Kang et al. (2019) Guoliang Kang, Lu Jiang, Yi Yang, and Alexander G. Hauptmann. 2019. Contrastive Adaptation Network for Unsupervised Domain Adaptation. In CVPR. Computer Vision Foundation / IEEE, 4893–4902.
Kundu et al. (2022) Jogendra Nath Kundu, Akshay R. Kulkarni, Suvaansh Bhambri, Deepesh Mehta, Shreyas Anand Kulkarni, Varun Jampani, and Venkatesh Babu Radhakrishnan. 2022. Balancing Discriminability and Transferability for Source-Free Domain Adaptation. In ICML (Proceedings of Machine Learning Research, Vol. 162). PMLR, 11710–11728.
Kundu et al. (2021) Jogendra Nath Kundu, Akshay R. Kulkarni, Amit Singh, Varun Jampani, and R. Venkatesh Babu. 2021. Generalize then Adapt: Source-Free Domain Adaptive Semantic Segmentation. In ICCV. IEEE, 7026–7036.
Lee et al. (2022) Jonghyun Lee, Dahuin Jung, Junho Yim, and Sungroh Yoon. 2022. Confidence Score for Source-Free Unsupervised Domain Adaptation. In ICML (Proceedings of Machine Learning Research, Vol. 162). PMLR, 12365–12377.
Li et al. (2022) Jingjing Li, Zhekai Du, Lei Zhu, Zhengming Ding, Ke Lu, and Heng Tao Shen. 2022. Divergence-Agnostic Unsupervised Domain Adaptation by Adversarial Attacks. IEEE Trans. Pattern Anal. Mach. Intell. 44, 11 (2022), 8196–8211.
Li et al. (2020a) Rui Li, Wenming Cao, Si Wu, and Hau-San Wong. 2020a. Generating Target Image-Label Pairs for Unsupervised Domain Adaptation. IEEE Trans. Image Process. 29 (2020), 7997–8011.
Li et al. (2020b) Rui Li, Qianfen Jiao, Wenming Cao, Hau-San Wong, and Si Wu. 2020b. Model Adaptation: Unsupervised Domain Adaptation Without Source Data. In CVPR. Computer Vision Foundation / IEEE, 9638–9647.
Liang et al. (2020) Jian Liang, Dapeng Hu, and Jiashi Feng. 2020. Do We Really Need to Access the Source Data? Source Hypothesis Transfer for Unsupervised Domain Adaptation. In ICML (Proceedings of Machine Learning Research, Vol. 119). PMLR, 6028–6039.
Long et al. (2018) Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I. Jordan. 2018. Conditional Adversarial Domain Adaptation. In NeurIPS. 1647–1657.
Na et al. (2021) Jaemin Na, Heechul Jung, Hyung Jin Chang, and Wonjun Hwang. 2021. FixBi: Bridging Domain Spaces for Unsupervised Domain Adaptation. In CVPR. Computer Vision Foundation / IEEE, 1094–1103.
Pei et al. (2023) Jiangbo Pei, Zhuqing Jiang, Aidong Men, Liang Chen, Yang Liu, and Qingchao Chen. 2023. Uncertainty-Induced Transferability Representation for Source-Free Unsupervised Domain Adaptation. IEEE Trans. Image Process. 32 (2023), 2033–2048.
Peng et al. (2017) Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate Saenko. 2017. VisDA: The Visual Domain Adaptation Challenge. CoRR abs/1710.06924 (2017).
Qiu et al. (2021) Zhen Qiu, Yifan Zhang, Hongbin Lin, Shuaicheng Niu, Yanxia Liu, Qing Du, and Mingkui Tan. 2021. Source-free Domain Adaptation via Avatar Prototype Generation and Adaptation. In IJCAI. ijcai.org, 2921–2927.
Qu et al. (2022) Sanqing Qu, Guang Chen, Jing Zhang, Zhijun Li, Wei He, and Dacheng Tao. 2022. BMD: A General Class-Balanced Multicentric Dynamic Prototype Strategy for Source-Free Domain Adaptation. In ECCV (34) (Lecture Notes in Computer Science, Vol. 13694). Springer, 165–182.
Saenko et al. (2010) Kate Saenko, Brian Kulis, et al. 2010. Adapting Visual Category Models to New Domains. In ECCV.
Tang et al. (2020) Hui Tang, Ke Chen, and Kui Jia. 2020. Unsupervised Domain Adaptation via Structurally Regularized Deep Clustering. In CVPR. Computer Vision Foundation / IEEE, 8722–8732.
Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
Venkateswara et al. (2017) Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. 2017. Deep Hashing Network for Unsupervised Domain Adaptation. In CVPR. IEEE Computer Society, 5385–5394.
Wei et al. (2021) Colin Wei, Kendrick Shen, Yining Chen, and Tengyu Ma. 2021. Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data. In ICLR. OpenReview.net.
Wu et al. (2020) Yuan Wu, Diana Inkpen, and Ahmed El-Roby. 2020. Dual Mixup Regularized Learning for Adversarial Domain Adaptation. In ECCV (29) (Lecture Notes in Computer Science, Vol. 12374). Springer, 540–555.
Xia et al. (2021) Haifeng Xia, Handong Zhao, and Zhengming Ding. 2021. Adaptive Adversarial Network for Source-free Domain Adaptation. In ICCV. IEEE, 8990–8999.
Xu et al. (2019) Ruijia Xu, Guanbin Li, Jihan Yang, and Liang Lin. 2019. Larger Norm More Transferable: An Adaptive Feature Norm Approach for Unsupervised Domain Adaptation. In ICCV. IEEE, 1426–1435.
Yang et al. (2021a) Shiqi Yang, Yaxing Wang, Joost van de Weijer, Luis Herranz, and Shangling Jui. 2021a. Exploiting the Intrinsic Neighborhood Structure for Source-free Domain Adaptation. In NeurIPS. 29393–29405.
Yang et al. (2021b) Shiqi Yang, Yaxing Wang, Joost van de Weijer, Luis Herranz, and Shangling Jui. 2021b. Generalized Source-free Domain Adaptation. In ICCV. IEEE, 8958–8967.
Yang et al. (2022) Shiqi Yang, Yaxing Wang, Kai Wang, Shangling Jui, et al. 2022. Attracting and dispersing: A simple approach for source-free domain adaptation. In Advances in Neural Information Processing Systems.
Yi et al. (2023) Li Yi, Gezheng Xu, Pengcheng Xu, Jiaqi Li, Ruizhi Pu, Charles Ling, A. Ian McLeod, and Boyu Wang. 2023. When Source-Free Domain Adaptation Meets Learning with Noisy Labels. In ICLR. OpenReview.net.
Zhang et al. (2021) Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu, Jindong Wang, Manabu Okumura, and Takahiro Shinozaki. 2021. FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling. In NeurIPS. 18408–18419.
Zhang et al. (2019) Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael I. Jordan. 2019. Bridging Theory and Algorithm for Domain Adaptation. In ICML (Proceedings of Machine Learning Research, Vol. 97). PMLR, 7404–7413.
Zhang et al. (2022) Ziyi Zhang, Weikai Chen, Hui Cheng, Zhen Li, Siyuan Li, Liang Lin, and Guanbin Li. 2022. Divide and Contrast: Source-free Domain Adaptation via Adaptive Contrastive Learning. In Advances in Neural Information Processing Systems.
Zhong et al. (2021) Li Zhong, Zhen Fang, Feng Liu, Jie Lu, Bo Yuan, and Guangquan Zhang. 2021. How Does the Combined Risk Affect the Performance of Unsupervised Domain Adaptation Approaches?. In AAAI. AAAI Press, 11079–11087.

Appendix A Proof of Claim 3.1 and Theorem 3.2

A.1. Claim 3.1 and Proof

Claim 3.1. Suppose $G$ satisfies a Lipschitz condition; there exists a global threshold $\rho\in(0,1)$ and scale of models’ learning status $\tau_{i}$ such that the inner set $I$ is consistency robust, i.e., $R_{\mathcal{B}}(G)=0$ . More specifically,

		$\displaystyle r\leq(2\max\{\tau_{i}\}\rho-1)\frac{2}{L},$
		$\displaystyle\forall i\in[\mathcal{C}].$

Proof

Let $G(x)_{[k]}$ denote the predicted probability of the model on class $k$ . Integrating the dynamic thresholds, we say that there exists a constant $\gamma$ such that $G(x)_{[j]}\in[\tau_{j}\rho-\gamma,1-\tau_{i}\rho]$ , where $j\neq\arg\max(G(x))$ . Suppose $I$ is defined by model’s learning state, denote

I\triangleq\{x:\max(G(x))\geq\tau_{i}\rho\;s.t.\;i=\arg\max(G(x))\}.

If $\exists x,x^{\prime}\in I,x^{\prime}\in\mathcal{B}(x)\cap I$ s.t. $G(x)\neq G(x^{\prime})$ . Specifically, define

\arg\max G(x)=i,\;\arg\max G(x^{\prime})=j,

where $i\neq j$ . Along with the inequality we know that

|G(x)_{[i]}-G(x^{\prime})_{[j]}|\geq 2\tau_{i}\rho-\gamma-1.

By the definition of Lipschitz constant, we have:

(13)	$\displaystyle L\left\\|x-x^{\prime}\right\\|$	$\displaystyle\geq\left\\|G(x)-G\left(x^{\prime}\right)\right\\|$
		$\displaystyle\geq\left\|G(x)_{[i]}+G\left(x\right)_{[j]}-G(x^{\prime})_{[j]}-G\left(x^{\prime}\right)_{[i]}\right\|$
		$\displaystyle\geq\left\|G(x)_{[i]}-G\left(x^{\prime}\right)_{[j]}\right\|+\left\|G(x)_{[j]}-G\left(x^{\prime}\right)_{[i]}\right\|$
		$\displaystyle\geq 4(\tau_{i}+\tau_{j})\rho-2-2\gamma$
		$\displaystyle\geq 4(\tau_{i}+\tau_{j})\rho-2.$

As a result,

		$\displaystyle L\left\\|x-x^{\prime}\right\\|\geq 4\max(\tau_{i})\rho-2,$
		$\displaystyle\forall i\in[\mathcal{C}].$

Since by definition of $x^{\prime}\in\mathcal{B}$ , we have $\|x-x^{\prime}\|\leq r$ . Combining Eq. 13 and the Lipschitz constant $L>0$ , we know that this forms a contradiction with $L\left\|x-x^{\prime}\right\|\geq Lr$ . Thus, $\forall x\in I,x^{\prime}\in\mathcal{B}(x)\cap I$ , the model predictions are consistent, i.e., $R_{\mathcal{B}}(G)=0$ .

A.2. Proof Sketch for Theorem 3.2

Theorem 3.1. Suppose Assumption 3.1 and Claim 3.1 hold and $I,O$ satisfies $(q,\mu)$ -constant expansion. Then the expected error of model $G$ is bounded,

\epsilon_{\mathcal{D}_{T}}(G)\leq 4\max(q,\mu)\kappa+\mu(1+\kappa).

To prove the Theorem 3.2, we introduces some concepts and notations following (Cai et al., 2021): (i) the robust set of $G$ , $RS(G)$ ; (ii) the minority robust set of $G$ on $U$ , $M$ .

For a given model $G$ , define the robust set to be the set for which $G$ is robust under input transformations:

RS(G):=\{x|G(x)=G(x^{\prime}),\forall x^{\prime}\in\mathcal{B}(x)\}.

Let $A_{ik}\triangleq RS(G)\cap U_{i}\cap\{x|G(x)=k\}\;s.t.\;i,k\in[\mathcal{C}]$ , where $U_{i}$ denote the conditional distribution of $U$ . Towards define the minority robust set $M$ on $U$ , we consider the majority class label of $G$ :

y_{i}^{\text{Maj}}\triangleq\arg\max_{k\in[\mathcal{C}]}\mathbb{P}_{U}[A_{ik}].

Thus, we denote

M\triangleq\bigcup_{k\in[\mathcal{C}]\backslash\{y_{i}^{\text{Maj}}\}}A_{ik}

be the minority robust set of $G$ . In addition, let

\widetilde{M}\triangleq\bigcup_{i\in[\mathcal{C}]}\left(U_{i}\cap\{x|G(x)\neq y_{i}^{\text{Maj}}\}\right)

be the minority set of $G$ .

By the Lemma A.1 in (Cai et al., 2021), under the $(q,\mu)$ -constant expansion, we have

		$\displaystyle\mathbb{P}_{U}[M]\leq 2\max(q,\mu),$
		$\displaystyle\mathbb{P}_{U}[\widetilde{M}]\leq 2\max(q,\mu)+\mu.$

Lemma A.0 (Upper Bound on the inner set $I$ ).

Suppose the condition of Claim 3.1 holds, then

\epsilon_{I}(G)\leq\mathbb{P}_{I}[M]+R_{\mathcal{B}}(G).

Proof

Based on the definition of the minority robust set $M$ , we know that $I\triangleq M\cup\{x:G(x)\neq G(x^{\prime}),x\in I,\;\text{and}\;x^{\prime}\in\mathcal{B}(x)\cap O\}$ . Therefore, we can write:

(14)	$\displaystyle\epsilon_{I}(G)$	$\displaystyle=\mathbb{P}_{I}[G(x)\neq G^{*}(x)]$
		$\displaystyle=\mathbb{P}_{I}[M\cap(G(x)\neq G^{*}(x))]$
		$\displaystyle\;\;\;\;+\mathbb{P}_{I}[(G(x)\neq G(x^{\prime}))\cap(G(x)\neq G^{*}(x))]$
		$\displaystyle\leq\mathbb{P}_{I}[M]+\mathbb{P}_{I}[\overline{RS(G)}]$
		$\displaystyle\leq\mathbb{P}_{I}[M]+R_{\mathcal{B}}(G).$

Lemma A.0 (Upper Bound on the outlier set $O$ ).

Let $O=\mathcal{D}_{T}\backslash I$ , then

\epsilon_{O}(G)\leq\mathbb{P}_{O}[M]+\mathbb{P}_{O}[\widetilde{M}]+R_{\mathcal{B}}(G).

Proof

By the definition of the outlier set $O$ , we note that $\{x:G(x)\neq G^{*}(x),\;\text{and}\;x\in O\}\subseteq M\cup\widetilde{M}\cup(\overline{RS(G)}\backslash\widetilde{M})$ . Thus, we obtain

(15)

\displaystyle\epsilon_{O}(G)\leq\mathbb{P}_{O}[M]+\mathbb{P}_{O}[\widetilde{M}]+R_{\mathcal{B}}.

Based on the above results, we can now apply Lemma A.1 and Lemma A.2 to bound the target error $\epsilon_{\mathcal{D}_{T}}(G)$ . Under the conditions of Theorem 3.2, we have:

(16)	$\displaystyle\epsilon_{\mathcal{D}_{T}}(G)$	$\displaystyle=\mathbb{P}_{\mathcal{D}_{T}}[I]\epsilon_{I}(G)+\mathbb{P}_{\mathcal{D}_{T}}[O]\epsilon_{O}(G)$
		$\displaystyle\leq\mathbb{P}_{\mathcal{D}_{T}}[I]\left(\mathbb{P}_{I}[M]+R_{\mathcal{B}}(G)\right)$
		$\displaystyle\;\;\;\;+\mathbb{P}_{\mathcal{D}_{T}}[O](\mathbb{P}_{O}[M]+\mathbb{P}_{O}[\widetilde{M}]+R_{\mathcal{B}}(G))$
		(Lemma A.1 and Lemma A.2)
		$\displaystyle\leq\mathbb{P}_{\mathcal{D}_{T}}[I]\left(\kappa\mathbb{P}_{U}[M]+R_{\mathcal{B}}(G)\right)$
		$\displaystyle\;\;\;\;+\mathbb{P}_{\mathcal{D}_{T}}[O](\kappa\mathbb{P}_{U}[M]+\kappa\mathbb{P}_{U}[\widetilde{M}]+R_{\mathcal{B}}(G))$
		(Assumption 3.1)
		$\displaystyle\leq\kappa\mathbb{P}_{U}[M]+R_{\mathcal{B}}(G)+\kappa\mathbb{P}_{\mathcal{D}_{T}}[O]\mathbb{P}_{U}[\widetilde{M}]$
		$\displaystyle\leq 4\max(q,\mu)\kappa+\mu(1+\kappa).$

Layer		Ar $\rightarrow$ Cl	Ar $\rightarrow$ Pr	Ar $\rightarrow$ Rw	Cl $\rightarrow$ Ar	Cl $\rightarrow$ Pr	Cl $\rightarrow$ Rw	Pr $\rightarrow$ Ar	Pr $\rightarrow$ Cl	Pr $\rightarrow$ Rw	Rw $\rightarrow$ Ar	Rw $\rightarrow$ Cl	Rw $\rightarrow$ Pr	Avg.
Layer 4 (source)	same class	14.848	12.155	12.918	13.849	11.914	12.650	16.183	17.240	14.453	14.183	13.989	11.680	13.839
Layer 4 (source)	across classes	16.674	15.556	15.548	15.449	14.317	14.597	18.152	18.730	16.924	15.978	15.410	14.548	15.990
Layer 4 (target)	same class	12.800	12.961	12.275	14.183	12.961	12.275	14.183	12.800	12.275	14.183	12.800	12.961	13.055
Layer 4 (target)	across classes	14.422	16.174	14.750	16.396	16.174	14.750	16.396	14.422	14.750	16.396	14.422	16.174	15.435
Bottleneck (source)	same class	19.041	15.302	16.117	18.614	16.453	17.260	18.656	20.286	16.868	16.799	17.428	14.102	17.244
Bottleneck (source)	across classes	21.975	20.965	21.160	21.529	20.717	20.926	21.778	22.301	21.182	21.033	20.275	20.274	21.176
Bottleneck (target)	same class	19.813	13.601	14.845	16.137	13.223	14.362	18.703	22.772	16.631	16.223	18.673	12.934	16.493
Bottleneck (target)	across classes	24.252	20.307	22.235	20.172	18.489	19.838	23.969	26.605	23.761	21.410	22.489	19.170	21.891

Table 7. Euclidean distance within the same class and across classes in each task on Office-Home.

Appendix B Algorithm for CtO

Our method, as described in Algorithm 1, involves dynamic data grouping and adaptive input- and local-consistency regularization. By using an adaptive threshold for the learning state, we are able to dynamically divide the target data into Inner data and outlier data. This allows us to apply different learning strategies to each subset.

Algorithm 1 Training Algorithm of CtO

0: Unlabeled target data

\mathcal{X}_{t}

, number of classes

C

, a pre-trained source model

G=g_{c}\circ h

, global threshold

\rho

0: Initialisation: Build feature bank

\boldsymbol{F}

and score bank

\boldsymbol{P}

by forward computation, initialize the global threshold as

1/C

1: while

t<\text{MaxIterations}

2: Update

\boldsymbol{F}

and

\boldsymbol{P}

corresponding to the current mini-batch

B

;

3: Update the global threshold based on Eq. 5;

4: Compute the learning effect of class based on Eq. 6;

5: Update the inner and outlier sample based on Eq. 8;

6: Compute

\mathcal{L}_{alr}

on inner sample by Eq. 9;

7: Compute

\mathcal{L}_{sep}

on all sample by Eq. 10;

8: Compute

\mathcal{L}_{air}

on outlier sample by Eq. 11;

9: Compute overall loss and update the model

G

10: end while

10: The adapted model

G

Appendix C Analysis of different similarity measures

To check the effect of different metric functions on target feature clustering, we compared the Cosine similarity and the Euclidean distance on Office-Home for feature similarity. Fig. 4 shows the average Euclidean distance for all tasks on Office-Home. A smaller value indicates more similarity. It can be seen that the Euclidean distance also indicates the presence of rich semantic information in the high-dimensional features of the source model backbone network. However, compared to Fig. 2, the differences in similarity based on Euclidean distance are insignificant. As is well known, the Euclidean distance reflects absolute differences in values, while the cosine distance reflects relative differences in direction. Therefore, Cosine similarity maintains ”1 for identical, 0 for orthogonal, -1 for opposite” in high-dimensional space. Euclidean distance, in contrast, is influenced by dimensions, and its numerical space is not unstable. Particularly in the case of distribution shifts, the variance of the sample fluctuations is too large, leading to poor Euclidean distance performance.

We also shows feature similarities among samples within the same class and across classes in each transfer task. As shown in Table 7, the differences between the Euclidean distance within the same class and that across classes are not clear if the distributions are significantly different (e.g., Ar $\rightarrow$ Cl, Pr $\rightarrow$ Cl, Rw $\rightarrow$ Cl tasks). Weak inter-category discrimination exacerbates spurious clustering, which biases the adaptation process.

Through the experiment, we notice two things: 1) the source model contains sufficient inductive biases; and 2) under domain shift conditions, there is a strong correlation between the metric method and target feature clustering. In future work, one possible direction is to study how different metrics affect the performance of target clustering.

(13)	$\displaystyle L\left\\|x-x^{\prime}\right\\|$	$\displaystyle\geq\left\\|G(x)-G\left(x^{\prime}\right)\right\\|$
		$\displaystyle\geq\left\|G(x)_{[i]}+G\left(x\right)_{[j]}-G(x^{\prime})_{[j]}-G\left(x^{\prime}\right)_{[i]}\right\|$
		$\displaystyle\geq\left\|G(x)_{[i]}-G\left(x^{\prime}\right)_{[j]}\right\|+\left\|G(x)_{[j]}-G\left(x^{\prime}\right)_{[i]}\right\|$
		$\displaystyle\geq 4(\tau_{i}+\tau_{j})\rho-2-2\gamma$
		$\displaystyle\geq 4(\tau_{i}+\tau_{j})\rho-2.$

Chaos to Order: A Label Propagation Perspective on Source-Free Domain Adaptation

Abstract.

1. Introduction

2. Related Work

Source-free Domain Adaptation (SFDA)

Label Propagation

3. Preliminaries and Analysis

3.1. Preliminary

3.2. Empirical analysis

3.3. Theoretical analysis

Assumption 3.1.

Definition 3.0 (Constant Expansion (Wei et al., 2021)).

Claim 3.1.

Remark 1.

Theorem 3.2.

Remark 2.

4. Method

4.1. Dynamic Data Grouping

4.2. Label Propagation with Different Regularizations in SFDA

Adaptive Local-consistency Regularization

Adaptive Input-consistency Regularization

4.3. Overall Objective

5. Experiments

5.1. Setup

Datasets

Implementation details

5.2. Results and Analysis

Comparison with state-of-the-art methods

Comparison with clustering-based Method

Ablation Study

Visualization

6. Conclusions

Acknowledgements.

References

Appendix A Proof of Claim 3.1 and Theorem 3.2

A.1. Claim 3.1 and Proof

Proof

A.2. Proof Sketch for Theorem 3.2

Lemma A.0 (Upper Bound on the inner set II).

Proof

Lemma A.0 (Upper Bound on the outlier set OO).

Proof

Appendix B Algorithm for CtO

Appendix C Analysis of different similarity measures

Lemma A.0 (Upper Bound on the inner set $I$ ).

Lemma A.0 (Upper Bound on the outlier set $O$ ).