Chaos to Order: A Label Propagation Perspective on Source-Free Domain Adaptation
Abstract.
Source-free domain adaptation (SFDA), where only a pre-trained source model is used to adapt to the target distribution, is a more general approach to achieving domain adaptation in the real world. However, it can be challenging to capture the inherent structure of the target features accurately due to the lack of supervised information on the target domain. By analyzing the clustering performance of the target features, we show that they still contain core features related to discriminative attributes but lack the collation of semantic information. Inspired by this insight, we present Chaos to Order (CtO), a novel approach for SFDA that strives to constrain semantic credibility and propagate label information among target subpopulations. CtO divides the target data into inner and outlier samples based on the adaptive threshold of the learning state, customizing the learning strategy to fit the data properties best. Specifically, inner samples are utilized for learning intra-class structure thanks to their relatively well-clustered properties. The low-density outlier samples are regularized by input consistency to achieve high accuracy with respect to the ground truth labels. In CtO, by employing different learning strategies to propagate the labels from the inner local to outlier instances, it clusters the global samples from chaos to order. We further adaptively regulate the neighborhood affinity of the inner samples to constrain the local semantic credibility. In theoretical and empirical analyses, we demonstrate that our algorithm not only propagates from inner to outlier but also prevents local clustering from forming spurious clusters. Empirical evidence demonstrates that CtO outperforms the state of the arts on three public benchmarks: Office-31, Office-Home, and VisDA.
1. Introduction

The excellent performance of deep learning relies heavily on a large amount of high-quality labeled data. Obtaining large amounts of manually labeled data for specific learning tasks is often time-consuming and expensive, making these tasks challenging to implement in practical applications. To alleviate this dependency, Unsupervised Domain Adaptation (UDA) has been developed to improve performance in the unlabeled target domain by exploiting the labeled source domain. Two popular practices for modern UDA design are learning domain-invariant features (Ganin et al., 2016; Long et al., 2018; Kang et al., 2019; Tang et al., 2020) and generating dummy samples to match the target domain distribution (Wu et al., 2020; Li et al., 2020a; Zhong et al., 2021; Na et al., 2021). However, due to data privacy and security issues, the source domain training data required by most existing UDA methods is usually unavailable in real-world applications. In response, Source-Free Domain Adaptation (SFDA) emerged, which attempted to adapt a trained source model to the target domain without using any source data.
Due to the lack of source data, it is impossible to estimate source-target domain differences. Existing theoretical work usually provides learning guarantees on the target domain by further assuming that the source domain covers the support of the target domain. In the seminal work by (Yang et al., 2021a), the authors point out that the target features from the source model have formed some semantic structures. Inspired by this intuition, we can preserve the important clustering structure in the target domain by matching similar features in the high-dimensional space. However, the nearest-neighbor consistency of points in high-dimensional space may be wrong, such as when forcing the local consistency of points in low-density regions. As shown in Table 1, when the source and target domains have significant differences (i.e., PrCl and RwCl), numerous features gather in low-density regions, with only about one-third of the neighbors having the correct labels.
Along with such a question, we propose Chaos to Order (CtO) (Fig. 1), an effective method to achieve more robust clustering of unlabeled data from the perspective of label propagation. To achieve flexible adaptation for different data properties and exploit the target domain structure information, our work introduces a novel data division strategy and then designs different regularization strategies to achieve label propagation.
Firstly, our approach treats the target domain’s intrinsic structure information mining as a clustering problem. Although existing local consistency-based methods aim to preserve the local structure, Table 1 illustrates the reason why neighbors are unreliable: In distance-based neighbor discrimination, neighbors are similar points in a high-dimensional space, and since the points in the low-density region are all scattered far apart, the label information in the -nearest neighbors is not consistent at this point. In CtO, we utilize the model’s learning state to dynamically divide the target data into inner and outlier sets. The intrinsic reason is that a sample can be considered an inner sample if it can obtain high predictive values from the classifier; otherwise, it is an outlier. We regularize the input consistency of outliers and encourage local consistency for those inner samples, which effectively improves the mining of intrinsic structural information.
Secondly, we assume a minimum overlap in the subpopulations of the inner and outlier sets, and extend the subset using the simple but realistic extension assumption of (Wei et al., 2021). For the inner set, the local-consistency regularizer connects similar points in the high-dimensional space, allowing SFDA training to proceed stably. Enlightening experiments on Office-Home show that: (1) the pre-trained source model can extract rich semantic information from the target data; (2) what is lacking in domain adaptation is the filtering and permutation of high-dimensional semantic information. Due to the lack of supervised information, we preserve the core features associated with the discriminative attributes by enforcing the local consistency of points in the latent space. Moreover, we propose a re-weighted clustering strategy called adaptive Local-Consistency Regularization (ALR), which explicitly constrains the local semantic credibility to filter spurious clustering information. To advance further along this line, we propose Adaptive Input-Consistency Regularization (AIR) for the outlier set. Generally, requiring the model to be invariant to input perturbations can improve generalizability. Furthermore, as (Wei et al., 2021) discussed, a low-probability subset of data can be extended to a neighborhood with a large probability relative to that subset. We show that labeling information can be propagated among subpopulations by minimizing the consistency regularization term on unlabeled data. In Theorem 3.2, we give the upper bound on the task risk of the target model. As a result, by customizing the learning strategy for different data properties, CtO can propagate structural information from the inner to the outlier while enhancing the clustering of the inner set.
ArCl | ArPr | ClAr | PrCl | PrRw | RwCl | |
---|---|---|---|---|---|---|
1 | 42.0 | 66.2 | 47.3 | 33.6 | 70.0 | 41.2 |
2 | 36.8 | 62.7 | 40.7 | 28.6 | 66.1 | 36.9 |
3 | 33.8 | 59.6 | 37.4 | 24.7 | 63.0 | 33.1 |
4 | 30.4 | 57.1 | 34.3 | 22.0 | 60.4 | 30.7 |
5 | 28.5 | 55.1 | 31.2 | 20.0 | 58.2 | 28.0 |
6 | 26.8 | 53.0 | 29.1 | 18.1 | 56.4 | 26.3 |
7 | 25.2 | 51.6 | 27.6 | 16.7 | 54.9 | 24.3 |
In summary, the contributions of this paper are summarized as follows: (1) We introduce CtO, a dynamical clustering approach for SFDA. Such a approach customizes the learning strategy for data subsets by using dynamic data splits, allowing label information to propagate among subpopulations. (2) To combat spurious clustering, we propose a novel Adaptive Local-consistency Regularization (ALR) strategy that estimates ground-truth structural information by re-weighting the neighbors. (3) To utilize unlabeled data more effectively, we propose Adaptive Input-Consistent Regularization (AIR) from the perspective of label propagation. By collaborating with ALR, structural information can be propagated from the inner to the outlier sets, significantly improving clustering performance. (4) Empirical evidence demonstrates that the proposed method outperforms the state-of-the-art on three domain adaptation benchmark datasets.
Methods | ArCl | ArPr | ArRw | ClAr | ClPr | ClRw | PrAr | PrCl | PrRw | RwAr | RwCl | RwPr | Avg. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AaD (w/ Source Bottleneck Layer) | 59.3 | 79.3 | 82.1 | 68.9 | 79.8 | 79.5 | 67.2 | 57.4 | 83.1 | 72.1 | 58.5 | 85.4 | 72.7 |
AaD (w/ Target Bottleneck Layer) | 69.3 | 85.7 | 91.4 | 82.4 | 86.2 | 87.4 | 84.5 | 67.5 | 90.5 | 89.1 | 68.9 | 92.1 | 82.9 |
2. Related Work
Source-free Domain Adaptation (SFDA)
SFDA aims to adapt unlabeled target domains using only the pre-trained source model. Existing approaches try to refine the solution of SFDA by pseudo-labeling (Liang et al., 2020; Qiu et al., 2021; Huang et al., 2021; Ding et al., 2022; Qu et al., 2022; Lee et al., 2022), generating transition domains (Li et al., 2020b, 2022; Kundu et al., 2021, 2022), or local consistency (Yang et al., 2021b, a, 2022). However, due to the domain differences, pseudo-labels that may contain noise cause confirmation bias. Additionally, task discriminative information and domain-related information are highly non-linearly entangled. Directly constructing an ideal generic domain from the source model may be difficult. Most closely related to our work is AaD(Yang et al., 2022), which introduced a simple and efficient optimization upper bound for feature clustering of unlabeled data, i.e., aggregating (scattering) similar (dissimilar) features in the feature space. However, AaD uses -nearest neighbors directly, which suffer from source bias due to domain shift. In contrast to the above methods, we explore the idea of label propagation to assign regularization strategies to unlabeled data that are more suitable for the data properties, to achieve source-free model adaptation.
Label Propagation
Label propagation has been widely used in semi-supervised learning. (Douze et al., 2018) show that label propagation on large image sets outperforms state-of-the-art few-shot learning when few labels are available. (Iscen et al., 2019) employ a transductive label propagation method based on the stream shape assumption to predict the entire dataset. (Wei et al., 2021) introduce the ”extension” assumption to analyze label propagation and show learning guarantees for unsupervised and semi-supervised learning. (Cai et al., 2021) extend the extension assumption to domain adaptation and propose a provably effective framework for domain adaptation based on label propagation. Considering label propagation for SFDA and leveraging the advantages of extension assumptions, we design a novel and dynamic clustering strategy for SFDA that propagates structural information from high-density regions to low-density regions.

3. Preliminaries and Analysis
In this section, we consider the clustering performance of the target domain and the label propagation for SFDA. Correspondingly, we first introduce some notations and then perform an empirical and theoretical analysis to better understand the role of different learning strategies in CtO. In Section 3.2, we study an Oracle setup that beats the original AaD (Yang et al., 2022) by a large margin, confirming that the features from the source domain model are already rich in semantic information, which requires us to reduce the redundant information in the features. Finally, in Section 3.3, we claim that if the learning state of a model is superior, then the target sample has consistency with its neighbors (Claim 3.1). Furthermore, we present upper bounds on the target error in Theorem 3.2.
3.1. Preliminary
For source-free domain adaptation (SFDA), consider an unlabeled target dataset on the input space . The task is to adapt a well-trained source model to the target domain without source data, where the target domain has the same class as the source domain. Following (Yang et al., 2021a, 2022), we use a feature extractor , and the classifier . Then the output of the network is denoted as , where is the softmax function. Specifically, we retrieve the nearest neighbors for each mini-batch of target features. Let denotes a memory bank that stores all target features and denotes the corresponding prediction scores in the memory bank, where is the feature dimension in the last linear layer:
(1) |
where is L2-normalized and denotes the output softmax probability for .
3.2. Empirical analysis
Most of the clustering-based SFDA methods have the problem of spurious clustering. Especially, in extreme domain shifts, the spurious clustering problem worsens. To address this issue, we investigate the local consistency of feature representations on the source and target domain models. We carry out the experiments on Office-Home since it exists different degrees of domain shift, i.e., Rw vs. Pr and Pr vs. Cl. In this experiment, we study the feature properties of the different layers in the model: (1) Backbone: the last layer of the backbone network with 2048 feature dimensions; (2) Bottleneck: only replaces the bottleneck layer in the source model, with 256 feature dimensions.
It is worth noting that most of the existing clustering-based methods are distance-based. The key idea is the smoothness assumption that the model should produce similar predictions for similar unlabeled data. Therefore, a good feature representation should have intra-class compactness and inter-class separability. Without loss of generality, we use cosine similarity as a metric. It is very unexpected that the same-class similarity and across-class similarity between the source domain model and the target domain model on Backbone are similar, while a huge difference appears at the Bottleneck (see Fig. 2). According to the light blue bar in Fig. 2, which visualizes the across-class similarity of samples at different network structures, we can easily observe that the features from the bottleneck layer have better inter-class separability. This means that adding a bottleneck layer to the model helps reduce redundant features, which improves discriminability and generalizability.
We further investigated the impact of bottleneck layers on the clustering-based SFDA approach by another study. Table 2 shows the learning effect of the AaD (Yang et al., 2022) with only the bottleneck layer replaced. Note that the bottleneck layer of the target model is only used for the analysis of this experiment. We observe that replacing the target domain bottleneck layer improves the AaD model dramatically, from 72.7% to 82.9%. This indicates that the high-dimensional features from backbone network of the source model already contain rich semantic information, whereas the generalization of the features is more reflected in the filtering and permutation of the semantic information. Additionally, on the results of AaD (w/ Source Bottleneck Layer), there was a very strong correlation between prediction accuracy and the ratio of same-class similarity to across-class similarity, as indicated by the Spearman rank correlation of 0.92. This observation hints that we can use the correlation between similarity and test accuracy to improve the clustering effect.
We also evaluated the feature similarity of different metric functions. The different metric functions all point out that Backbone features are rich in semantic information (see Appendix C).
3.3. Theoretical analysis
Following the expansion assumption in (Wei et al., 2021; Cai et al., 2021), we first define that the suitable set of input transformations takes the general form such that for a small radius , where can be understood as a distance-based neighborhood or the data augmentations set. Then, we define the neighborhood function as
(2) |
and the neighborhood of a set as
(3) |
The regularizer of is defined as:
(4) |
Our setting for Source-free Domain Adaptation is formulated in the following assumption.
Assumption 3.1.
Let and denote the conditional distribution of the target distribution on the set and , respectively. Assume there exists a constant such that the measure and are bounded by . That is, for any ,
The expansion property on the target domain is defined as follows:
Definition 3.0 (Constant Expansion (Wei et al., 2021)).
We say that distribution satisfies -constant expansion for some constant , if for all satisfying , we have .
Based on the model’s learning state, our CtO method divides the target data into the inner set () and the outlier set (). The intuition lies in the fact that different datasets and classes should determine their division thresholds based on the models’ learning states so that the division boundary is more reasonable. Specifically, we set a global threshold to utilize unlabeled data at early training stages. As the adaptation progresses, we estimate the model’s learning state by the predictions of the model to determine a dynamic local (class-specific) threshold. The following claim guarantees the consistency robustness of inner samples:
Claim 3.1.
Suppose satisfies a Lipschitz condition; there exists a global threshold and scale of models’ learning status such that the inner set is consistency robust, i.e., . More specifically,
Remark 1.
The claim illustrates that when the global threshold is fixed, as long as the model’s learning state is good enough, the inner set can achieve a consistency error close to zero. In clustering-based methods, usually refers to the number of selectable nearest neighbors, which allows us to select more reliable neighbor for inner samples. Moreover, we do not need to set specific data division thresholds for different datasets or classes based on experience, making it more general and realistic.
With the above preparation, we can investigate how to propagate label information from the ordered inner set to the chaotic outlier set. The following theorem establishes bounds on the target risks and indicates that as long as there is minimal overlap between the inner and outlier sets, label information will propagate in the target subpopulation.
Theorem 3.2.
Remark 2.
This theorem states that the target risks is bounded by the consistency regularization (equivalently, ). Our analysis explains why consistency regularization is important for SFDA methods: assuming the data satisfies expansion, it encourages representations to maintain important semantic structures by enforcing local consistency within the representation space. As a result, we can effectively capture the global structure of the target domain by considering all local neighborhoods together, which strongly enforces label propagation.
The proof of both claim and theorem are in Appendix A.
4. Method
In this section, we introduce CtO aims to achieve efficient feature clustering from the perspective of label propagation.
There are two aspects involved in achieving CtO. First, how to divide the dataset reasonably? The ideal inner set should have well-clustered properties, so the effectiveness of the division boundary in distinguishing high- and low-density regions is crucial when choosing a division threshold. The model’s prediction probability can measure this. According to the previous analysis, by considering the model’s learning state, our proposed adaptive division threshold can better guarantee local consistency among inner samples.
Second, how can we customize learning strategies to achieve label propagation among target subpopulations? Inspired by self-supervised learning, we apply two regularization strategies - local consistency and input consistency - on the inner and outlier sets, respectively. Our theoretical analysis demonstrates that these strategies can extend low-density subsets to high-density ones when the subpopulations overlap, which leads to the clustering from chaos to order.
4.1. Dynamic Data Grouping
As the analysis before, we employ the model’s learning states to adaptively divide the data in into the inner sets and outlier sets . As believed in (Zhang et al., 2021), the learning effect of the model can be reflected by the class-level hit rate. Therefore, our principle is that the data division in CtO should be related to the prediction confidence of the unlabeled data on different classes so as to reflect the class-level learning status. Namely, classes with fewer samples reaching a threshold of prediction confidence are considered to have difficult in learning local structural information. Moreover, the threshold should be increased steadily as the model is continuously improved during training. We set the global threshold as the exponential moving average (EMA) of the highest confidence level for each training time step:
(5) |
where is the momentum decay of EMA, denotes the -th iteration. Combining this flexible thresholds, the learning effect of class in the time step is defined as:
(6) |
Then we formulate the adaptive data division weights:
(7) | ||||
Finally, the samples are dynamically grouped into the outlier set in the -th iteration:
(8) |
and the inner samples are the rest target data, i.e., . To this end, we customize learning strategies for different data properties and connect both sets by expansion assumption.
4.2. Label Propagation with Different Regularizations in SFDA
In the theoretical analysis (Section 3.3), we show that performing consistency regularization on the unlabeled target data can propagate semantic information across different subpopulations. However, it also has some limitations. First, the neighbors are not always correct. Due to the domain shift, the model may incorrectly focus on the object of some target features, leading to noisy labels. Here, the misalignment of neighbors forms spurious clusters instead of helping label propagation. Second, when dealing with low-density regions (i.e., outlier samples), designing the same local consistent regularization will further exacerbate the cross-label risk. To address these limitations, we flexibly customize learning strategies for different data properties with the help of dynamic data grouping.
Adaptive Local-consistency Regularization
In Adaptive Local-consistency Regularization (ALR), inspired by the fact that the target features from the source model have formed some semantic structures, we can capture the intra-class structure by local-consistency regularization. However, in the source-free domain adaptation problem, the features extracted by the pre-trained source model are typically influenced by the source bias. This may lead to neighbors containing incorrect semantic information. To mitigate incorrect alignment, we propose identifying the clustering weights of each sample.
As observed in Fig. 2, the cosine similarity of same-class is generally higher than that of across-class. Building on this finding, we can measure neighbor affinity based on cosine similarity and then re-weight the neighbors to approximate the ground-truth structural information. By re-weighting with similarity-based adaptive weights, positive clustering can be promoted while spurious clustering can be combated. The Adaptive Local-consistency Regularization is as follows:
(9) |
where denotes the -nearest neighbor set of . The similarity weight in Eq. 9 is the cosine similarity of to the neighbors , which is calculated via the memory bank . Optimizing improves the reliability of clustering, which stabilizes the intra-class structure. In addition, relaxing the ranking of samples in low-density regions helps reduce incorrect local semantic alignment.
Additionally, to improve separability between clusters, we employ the separation strategy proposed by (Yang et al., 2022) to disperse the prediction of potentially dissimilar features.
(10) |
where denotes other features except in mini-batch.
Methods | Source-free | ArCl | ArPr | ArRw | ClAr | ClPr | ClRw | PrAr | PrCl | PrRw | RwAr | RwCl | RwPr | Avg. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 (He et al., 2016) | ✗ | 34.9 | 50.0 | 58.0 | 37.4 | 41.9 | 46.2 | 38.5 | 31.2 | 60.4 | 53.9 | 41.2 | 59.9 | 46.1 |
CDAN (Long et al., 2018) | ✗ | 50.7 | 70.6 | 76.0 | 57.6 | 70.0 | 70.0 | 57.4 | 50.9 | 77.3 | 70.9 | 56.7 | 81.6 | 65.8 |
MDD (Zhang et al., 2019) | ✗ | 54.9 | 73.7 | 77.8 | 60.0 | 71.4 | 71.8 | 61.2 | 53.6 | 78.1 | 72.5 | 60.2 | 82.3 | 68.1 |
SRDC (Tang et al., 2020) | ✗ | 52.3 | 76.3 | 81.0 | 69.5 | 76.2 | 78.0 | 68.7 | 53.8 | 81.7 | 76.3 | 57.1 | 85.0 | 71.3 |
FixBi (Na et al., 2021) | ✗ | 58.1 | 77.3 | 80.4 | 67.7 | 79.5 | 78.1 | 65.8 | 57.9 | 81.7 | 76.4 | 62.9 | 86.7 | 72.7 |
SHOT (Liang et al., 2020) | ✓ | 56.9 | 78.1 | 81.0 | 67.9 | 78.4 | 78.1 | 67.0 | 54.6 | 81.8 | 73.4 | 58.1 | 84.5 | 71.6 |
Net (Xia et al., 2021) | ✓ | 58.4 | 79.0 | 82.4 | 67.5 | 79.3 | 78.9 | 68.0 | 56.2 | 82.9 | 74.1 | 60.5 | 85.0 | 72.8 |
NRC (Yang et al., 2021a) | ✓ | 57.7 | 80.3 | 82.0 | 68.1 | 79.8 | 78.6 | 65.3 | 56.4 | 83.0 | 71.0 | 58.6 | 85.6 | 72.2 |
SFDA-DE (Ding et al., 2022) | ✓ | 59.7 | 79.5 | 82.4 | 69.7 | 78.6 | 79.2 | 66.1 | 57.2 | 82.6 | 73.9 | 60.8 | 85.5 | 72.9 |
feat-mixup (Kundu et al., 2022) | ✓ | 61.8 | 81.2 | 83.0 | 68.5 | 80.6 | 79.4 | 67.8 | 61.5 | 85.1 | 73.7 | 64.1 | 86.5 | 74.5 |
AaD (Yang et al., 2022) | ✓ | 59.3 | 79.3 | 82.1 | 68.9 | 79.8 | 79.5 | 67.2 | 57.4 | 83.1 | 72.1 | 58.5 | 85.4 | 72.7 |
DaC (Zhang et al., 2022) | ✓ | 59.1 | 79.5 | 81.2 | 69.3 | 78.9 | 79.2 | 67.4 | 56.4 | 82.4 | 74.0 | 61.4 | 84.4 | 72.8 |
NRC+ELR (Yi et al., 2023) | ✓ | 58.4 | 78.7 | 81.5 | 69.2 | 79.5 | 79.3 | 66.3 | 58.0 | 82.6 | 73.4 | 59.8 | 85.1 | 72.6 |
SFUDA (Pei et al., 2023) | ✓ | 59.9 | 81.4 | 83.0 | 68.9 | 80.1 | 80.3 | 67.5 | 56.9 | 83.7 | 74.3 | 60.8 | 86.3 | 73.7 |
CtO | ✓ | 58.5 | 79.8 | 85.5 | 74.8 | 82.5 | 83.1 | 73.8 | 58.4 | 85.0 | 78.2 | 63.3 | 89.6 | 76.1 |
Methods | Source-free | AD | AW | DW | WD | DA | WA | Avg. |
ResNet-50 (He et al., 2016) | ✗ | 68.9 | 68.4 | 96.7 | 99.3 | 62.5 | 60.7 | 76.1 |
CDAN (Long et al., 2018) | ✗ | 92.9 | 94.1 | 98.6 | 100.0 | 71.0 | 69.3 | 87.7 |
MDD (Zhang et al., 2019) | ✗ | 90.4 | 90.4 | 98.7 | 99.9 | 75.0 | 73.7 | 88.0 |
SRDC (Tang et al., 2020) | ✗ | 95.8 | 95.7 | 99.2 | 100.0 | 76.7 | 77.1 | 90.8 |
FixBi (Na et al., 2021) | ✗ | 95.0 | 96.1 | 99.3 | 100.0 | 78.7 | 79.4 | 91.4 |
SHOT (Liang et al., 2020) | ✓ | 93.1 | 90.9 | 98.8 | 99.9 | 74.5 | 74.8 | 88.7 |
Net (Xia et al., 2021) | ✓ | 94.5 | 94.0 | 99.2 | 100.0 | 76.7 | 76.1 | 90.1 |
NRC (Yang et al., 2021a) | ✓ | 96.0 | 90.8 | 99.0 | 100.0 | 75.3 | 75.0 | 89.4 |
HCL (Huang et al., 2021) | ✓ | 94.7 | 92.5 | 98.2 | 100.0 | 75.9 | 77.7 | 89.8 |
SFDA-DE (Ding et al., 2022) | ✓ | 96.0 | 94.2 | 98.5 | 99.8 | 76.6 | 75.5 | 90.1 |
AaD (Yang et al., 2022) | ✓ | 96.4 | 92.1 | 99.1 | 100.0 | 75.0 | 76.5 | 89.9 |
feat-mixup (Kundu et al., 2022) | ✓ | 94.6 | 93.2 | 98.9 | 100.0 | 78.3 | 78.9 | 90.7 |
NRC+ELR (Yi et al., 2023) | ✓ | 93.8 | 93.3 | 98.0 | 100.0 | 76.2 | 76.9 | 89.6 |
SFUDA (Pei et al., 2023) | ✓ | 96.2 | 94.0 | 99.1 | 99.9 | 76.9 | 79.2 | 90.7 |
CtO | ✓ | 96.4 | 95.1 | 99.0 | 100.0 | 80.0 | 78.2 | 91.5 |
Adaptive Input-consistency Regularization
In Adaptive Input-consistency Regularization, we propagate the structural information from the inner set to the outlier set as discussed in Remark 2. Since the outliers in the low-density region are far away from all other points, which means there is no nearest neighbor support, we turn to seek support from the outliers themselves. Specifically, we use a weakly augmented version of , denoted as , to generate the pseudo-label and enforce consistency against its strongly augmented version . To encourage the model to make diverse predictions, we combined regularization with the aforementioned class-level confidence thresholds. The Adaptive Input-consistency Regularization is as follows:
(11) |
where refers to cross-entropy loss, and denotes the pseudo label of .
During training, the clustering processes on the inner and outlier sets facilitate each other. By implementing different regularization strategies, labels are propagated among subsets to enable under- or hard-to-learn samples to find suitable neighbors. The inclusion of these new members in clusters provides additional information for learning the intra-class structure, which adjusts the feature space and enhances the power of clustering.
4.3. Overall Objective
As described above, the overall objective of CtO can be summarized as follows:
(12) |
where are a trade-off parameter. With and , CtO preserves local and input consistency, allowing label information to be propagated. The training process is described in Appendix B.
Methods | Source-free | plane | bicycle | bus | car | horse | knife | mcycl | person | plant | sktbrd | train | truck | Per-class |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-101 (He et al., 2016) | ✗ | 55.1 | 53.3 | 61.9 | 59.1 | 80.6 | 17.9 | 79.7 | 31.2 | 81.0 | 26.5 | 73.5 | 8.5 | 52.4 |
CDAN (Long et al., 2018) | ✗ | 85.2 | 66.9 | 83.0 | 50.8 | 84.2 | 74.9 | 88.1 | 74.5 | 83.4 | 76.0 | 81.9 | 38.0 | 73.9 |
SAFN (Xu et al., 2019) | ✗ | 93.6 | 61.3 | 84.1 | 70.6 | 94.1 | 79.0 | 91.8 | 79.6 | 89.9 | 55.6 | 89.0 | 24.4 | 76.1 |
MCC (Jin et al., 2020) | ✗ | 88.7 | 80.3 | 80.5 | 71.5 | 90.1 | 93.2 | 85.0 | 71.6 | 89.4 | 73.8 | 85.0 | 36.9 | 78.8 |
FixBi (Na et al., 2021) | ✗ | 96.1 | 87.8 | 90.5 | 90.3 | 96.8 | 95.3 | 92.8 | 88.7 | 97.2 | 94.2 | 90.9 | 25.7 | 87.2 |
SHOT (Liang et al., 2020) | ✓ | 92.6 | 81.1 | 80.1 | 58.5 | 89.7 | 86.1 | 81.5 | 77.8 | 89.5 | 84.9 | 84.3 | 49.3 | 79.6 |
Net (Xia et al., 2021) | ✓ | 94.0 | 87.8 | 85.6 | 66.8 | 93.7 | 95.1 | 85.8 | 81.2 | 91.6 | 88.2 | 86.5 | 56.0 | 84.3 |
NRC (Yang et al., 2021a) | ✓ | 96.8 | 91.3 | 82.4 | 62.4 | 96.2 | 95.9 | 86.1 | 80.6 | 94.8 | 94.1 | 90.4 | 59.7 | 85.9 |
HCL (Huang et al., 2021) | ✓ | 93.3 | 85.4 | 80.7 | 68.5 | 91.0 | 88.1 | 86.0 | 78.6 | 86.6 | 88.8 | 80.0 | 74.7 | 83.5 |
CPGA (Qiu et al., 2021) | ✓ | 94.8 | 83.6 | 79.7 | 65.1 | 92.5 | 94.7 | 90.1 | 82.4 | 88.8 | 88.0 | 88.9 | 60.1 | 84.1 |
SFDA-DE (Ding et al., 2022) | ✓ | 95.3 | 91.2 | 77.5 | 72.1 | 95.7 | 97.8 | 85.5 | 86.1 | 95.5 | 93.0 | 86.3 | 61.6 | 86.5 |
AaD (Yang et al., 2022) | ✓ | 97.4 | 90.5 | 80.8 | 76.2 | 97.3 | 96.1 | 89.8 | 82.9 | 95.5 | 93.0 | 92.0 | 64.7 | 88.0 |
DaC (Zhang et al., 2022) | ✓ | 96.6 | 86.8 | 86.4 | 78.4 | 96.4 | 96.2 | 93.6 | 83.8 | 96.8 | 95.1 | 89.6 | 50.0 | 87.3 |
CtO | ✓ | 98.2 | 91.0 | 86.4 | 78.0 | 97.6 | 98.8 | 91.8 | 84.8 | 96.6 | 94.7 | 93.7 | 53.3 | 88.7 |
AD | AW | DA | WA | Avg. | ||||
---|---|---|---|---|---|---|---|---|
✓ | ✓ | 96.4 | 92.1 | 75.0 | 76.5 | 85.0 | ||
✓ | ✓ | 95.4 | 93.3 | 77.9 | 77.6 | 86.1 | ||
✓ | ✓ | 95.8 | 94.7 | 79.4 | 77.8 | 86.9 | ||
✓ | ✓ | ✓ | 96.4 | 95.1 | 80.0 | 78.2 | 87.4 |
5. Experiments
5.1. Setup
Datasets
We conduct experiments on three public domain adaptation benchmarks. (i) Office-31 (Saenko et al., 2010) is a commonly used dataset for domain adaptation that consists of three domains: Amazon (A), Webcam (W), and DSLR (D), each containing 31 categories of items in an office environment. (ii) Office-Home (Venkateswara et al., 2017) is a standard domain adaptation dataset collected in office and home environments. It consists of four domains, Art (Ar), Clipart (Cl), Product (Pr), and RealWorld (Rw), and each covering 65 object categories. (iii) VisDA (Peng et al., 2017) is one of the large benchmark datasets on the domain adaptation task. It contains 12 categories of images from two subsets: synthetic image domain and real image domain.
Implementation details
Following the standard protocol for SFDA, we use all labeled source data to obtain pre-trained models. For the Office-31 and Office-Home, the backbone network is ResNet-50 (He et al., 2016). For VisDA, the backbone network is ResNet-101. For a fair comparison, we use the same network structure as SHOT (Liang et al., 2020), NRC (Yang et al., 2021a) and AaD (Yang et al., 2022). All network parameters are updated by Stochastic Gradient Descent (SGD) with momentum of 0.9, an initial learning rate of 0.001, and a weight decay of 0.005. The learning rate of the additional layer is 10 times smaller than that of the backbone layer. We follow G-SFDA (Yang et al., 2021b), NRC (Yang et al., 2021a), and AaD (Yang et al., 2022) for the number of nearest neighbors (): set 3 for Office-31, Office-Home, and 5 on VisDA. To ensure a fair comparison, we set the hyperparameter to be the same as in the previous work (Yang et al., 2022). That is, we set , and set to 0 on Office-Home, 2 on Office-31, and 5 on VisDA. The strong augmentation function used in our experiments is RandAugment (Cubuk et al., 2020).






5.2. Results and Analysis
In this section, we will present our results and compare with other methods, which are summarized in Table 3, 4, 5, respectively. For a fair comparison, all baseline results were obtained from their original papers or the follow-up work.
Comparison with state-of-the-art methods
For Office-31, as shown in Table 4, the proposed CtO yield state-of-the-art performance on 4 out of 6 tasks. Note that our CtO produces competitive results even when compared to source-present methods such as FixBi (Na et al., 2021) (91.5% v.s. 91.4%). For Office-Home, Table 3 presents that the proposed CtO method achieves the most advanced classification accuracy (76.1%) and achieves the highest results on 7 out of 12 tasks. As we all know, in clustering-based methods, the clustering error increases with the number of object classes. Therefore, it is difficult for local consistency-based SFDA methods to accurately capture the target structure information. However, our CtO employs adaptive input-consistency regularization to efficiently utilize unlabeled data through label propagation. This is the primary reason for our success on Office-Home. Moreover, CtO beats several source-present DA methods, such as SRDC (Tang et al., 2020) and FixBi (Na et al., 2021), by a large margin, which means that even if we do not have access to the source data, our method can still exploit the target structure information to achieve better adaptation. Similar observations on VisDA can be found in Table 5. The reported results sufficiently demonstrate the superiority of our method.
Comparison with clustering-based Method
As discussed in related work, NRC (Yang et al., 2021a) uses reciprocal nearest neighbors to measure clustering affinity. On the hard tasks of Office-Home, our approach outperforms the accuracy of NRC by a considerable margin, especially on tasks PrAr (73.8% v.s. 65.3%). The improvement of our method indicates the importance of our adaptive input-consistency regularization for capturing intra-class structural information. Compared with AaD (Yang et al., 2022), our CtO improves the accuracy by 1.6% on Office-31 and by 3.1% on Office-Home, indicating that the co-training of the adaptive local-consistency regularizer and the adaptive input-consistency regularizer performs reliable label propagation through the subpopulation of unlabeled data. Moreover, in VisDA, CtO exhibits higher recognition accuracy for several confusing objects than AaD, which indicates that the adaptive input-consistency regularizer can enhance model discrimination by providing more comprehensive intra-class information.
Ablation Study
To evaluate the contribution of the different components of our work, we conduct ablation studies for CtO on Office-31. We investigated different combinations of the two parts: Adaptive Local-consistency Regularization (ALR) and Adaptive Input-consistency Regularization (AIR). Compared to our method, AaD (i.e., only and are used) can be regarded as the baseline. As shown in Table 6, each part of our method contributes to improving performance. It is not difficult to find that AIR contributes the most to the improvement of accuracy, with the performance increasing from 85.0% to 86.9%, which shows the effectiveness of label propagation. ALR also improves the average performance by 1.1% compared to the base model, confirming that the distance-based re-weighting improves the quality of the neighbors. For easy transfer tasks, target features from pre-trained source models naturally have good clustering performance. In this case, ALR dominates in loss optimization, with AIR helping to improve model training for under-learned categories. When the target feature distribution is scattered, it benefits from the AIR to ensure the smoothness of the model, while the extended property amplifies it to global consistency within the same class, allowing the limited structural information captured from the ALR to be propagated among subpopulations. According to the comparison of results, we conclude that employing a regularization strategy suitable for data property is important in capturing semantic information. Without local consistency, outlier samples are difficult to learn, which makes the model heavily dependent on the transferability of source knowledge. Similarly, removing input consistency makes it difficult to effectively facilitate the learning of global semantic information. Overall, CtO increased baseline AaD by an average of 2.4%. This shows that there is complementarity between ALR and AIR.
Visualization
To demonstrate the superiority of our method, we show the t-SNE (Van der Maaten and Hinton, 2008) feature visualization and confusion matrix on Office-31 (see Fig. 3). From Fig. 3(a-d), we can observe that the clustering of the target features is more compact after the adaptation by CtO. Fig. 3(b) and (d) illustrate that CtO can achieve good model adaptation whether the model is pre-trained on a large-scale or small-scale source domain. When the source domain is knowledge-rich, as shown in Fig. 3(a), the target domain features already possess considerable semantics. In such cases, adaptive local-consistency regularization can effectively capture the intra-class structure. However, when significant domain differences exist (as shown in Fig. 3(c)), abundant target features are jumbled together, so that the model has difficult in capturing the local structure. The flexible data division of our method, thus, customizes the learning strategy for different data properties, which facilitates the estimation of ground-truth structural information instead of only adjusting the neighbor weights as in NRC (Yang et al., 2021a). Benefiting from the adaptive input-consistency regularization, we can capture semantic structures with rich intra-class variations while dissimilar samples are naturally separated in the representation space. More importantly, as the training iterates, outlier samples gradually join the clustering, and locally informed clusters propagate label information to these outliers. The comparison of Fig. 3(e) and (f) further demonstrates that our method increases prediction diversity by adaptively adjusting the training on under-learned or hard-to-learn samples (i.e., outlier).
6. Conclusions
In this paper, we propose a novel approach called Chaos to Order (CtO), which tries to achieve efficient feature clustering from the perspective of label propagation. CtO divides the target data into inner and outlier samples based on the adaptive threshold of the learning state, and applies a customized learning strategy to fit the data properties best. To mitigate the source bias, on the one hand, considering the clustering affinity, we propose Adaptive Local-consistency Regularization (ALR) to reduce spurious clustering by re-weighting neighbors. On the other hand, Adaptive Input-consistency Regularization (AIR) is used at outlier points to propagate structural information from high-density to low-density regions, thus achieving high accuracy with respect to the ground truth labels. Moreover, this co-training process can encourage positive clustering and combat spurious clustering. The experimental results of three popular benchmarks verify that our proposed model outperforms the state-of-the-art in various SFDA tasks. For future work, we plan to extend our CtO method to source-free open-set and partial-set domain adaptation.
Acknowledgements.
This work was supported by the National Natural Science Foundation of China under Grant 61871186 and 61771322.References
- (1)
- Cai et al. (2021) Tianle Cai, Ruiqi Gao, Jason D. Lee, and Qi Lei. 2021. A Theory of Label Propagation for Subpopulation Shift. In ICML (Proceedings of Machine Learning Research, Vol. 139). PMLR, 1170–1182.
- Cubuk et al. (2020) Ekin Dogus Cubuk, Barret Zoph, Jonathon Shlens, and Quoc Le. 2020. RandAugment: Practical Automated Data Augmentation with a Reduced Search Space. In NeurIPS.
- Ding et al. (2022) Ning Ding, Yixing Xu, Yehui Tang, Chao Xu, Yunhe Wang, and Dacheng Tao. 2022. Source-Free Domain Adaptation via Distribution Estimation. In CVPR. IEEE, 7202–7212.
- Douze et al. (2018) Matthijs Douze, Arthur Szlam, Bharath Hariharan, and Hervé Jégou. 2018. Low-Shot Learning With Large-Scale Diffusion. In CVPR. Computer Vision Foundation / IEEE Computer Society, 3349–3358.
- Ganin et al. (2016) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor S. Lempitsky. 2016. Domain-Adversarial Training of Neural Networks. J. Mach. Learn. Res. 17 (2016), 59:1–59:35.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. IEEE Computer Society, 770–778.
- Huang et al. (2021) Jiaxing Huang, Dayan Guan, Aoran Xiao, and Shijian Lu. 2021. Model Adaptation: Historical Contrastive Learning for Unsupervised Domain Adaptation without Source Data. In NeurIPS. 3635–3649.
- Iscen et al. (2019) Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum. 2019. Label Propagation for Deep Semi-Supervised Learning. In CVPR. Computer Vision Foundation / IEEE, 5070–5079.
- Jin et al. (2020) Ying Jin, Ximei Wang, Mingsheng Long, and Jianmin Wang. 2020. Minimum Class Confusion for Versatile Domain Adaptation. In ECCV (21) (Lecture Notes in Computer Science, Vol. 12366). Springer, 464–480.
- Kang et al. (2019) Guoliang Kang, Lu Jiang, Yi Yang, and Alexander G. Hauptmann. 2019. Contrastive Adaptation Network for Unsupervised Domain Adaptation. In CVPR. Computer Vision Foundation / IEEE, 4893–4902.
- Kundu et al. (2022) Jogendra Nath Kundu, Akshay R. Kulkarni, Suvaansh Bhambri, Deepesh Mehta, Shreyas Anand Kulkarni, Varun Jampani, and Venkatesh Babu Radhakrishnan. 2022. Balancing Discriminability and Transferability for Source-Free Domain Adaptation. In ICML (Proceedings of Machine Learning Research, Vol. 162). PMLR, 11710–11728.
- Kundu et al. (2021) Jogendra Nath Kundu, Akshay R. Kulkarni, Amit Singh, Varun Jampani, and R. Venkatesh Babu. 2021. Generalize then Adapt: Source-Free Domain Adaptive Semantic Segmentation. In ICCV. IEEE, 7026–7036.
- Lee et al. (2022) Jonghyun Lee, Dahuin Jung, Junho Yim, and Sungroh Yoon. 2022. Confidence Score for Source-Free Unsupervised Domain Adaptation. In ICML (Proceedings of Machine Learning Research, Vol. 162). PMLR, 12365–12377.
- Li et al. (2022) Jingjing Li, Zhekai Du, Lei Zhu, Zhengming Ding, Ke Lu, and Heng Tao Shen. 2022. Divergence-Agnostic Unsupervised Domain Adaptation by Adversarial Attacks. IEEE Trans. Pattern Anal. Mach. Intell. 44, 11 (2022), 8196–8211.
- Li et al. (2020a) Rui Li, Wenming Cao, Si Wu, and Hau-San Wong. 2020a. Generating Target Image-Label Pairs for Unsupervised Domain Adaptation. IEEE Trans. Image Process. 29 (2020), 7997–8011.
- Li et al. (2020b) Rui Li, Qianfen Jiao, Wenming Cao, Hau-San Wong, and Si Wu. 2020b. Model Adaptation: Unsupervised Domain Adaptation Without Source Data. In CVPR. Computer Vision Foundation / IEEE, 9638–9647.
- Liang et al. (2020) Jian Liang, Dapeng Hu, and Jiashi Feng. 2020. Do We Really Need to Access the Source Data? Source Hypothesis Transfer for Unsupervised Domain Adaptation. In ICML (Proceedings of Machine Learning Research, Vol. 119). PMLR, 6028–6039.
- Long et al. (2018) Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I. Jordan. 2018. Conditional Adversarial Domain Adaptation. In NeurIPS. 1647–1657.
- Na et al. (2021) Jaemin Na, Heechul Jung, Hyung Jin Chang, and Wonjun Hwang. 2021. FixBi: Bridging Domain Spaces for Unsupervised Domain Adaptation. In CVPR. Computer Vision Foundation / IEEE, 1094–1103.
- Pei et al. (2023) Jiangbo Pei, Zhuqing Jiang, Aidong Men, Liang Chen, Yang Liu, and Qingchao Chen. 2023. Uncertainty-Induced Transferability Representation for Source-Free Unsupervised Domain Adaptation. IEEE Trans. Image Process. 32 (2023), 2033–2048.
- Peng et al. (2017) Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate Saenko. 2017. VisDA: The Visual Domain Adaptation Challenge. CoRR abs/1710.06924 (2017).
- Qiu et al. (2021) Zhen Qiu, Yifan Zhang, Hongbin Lin, Shuaicheng Niu, Yanxia Liu, Qing Du, and Mingkui Tan. 2021. Source-free Domain Adaptation via Avatar Prototype Generation and Adaptation. In IJCAI. ijcai.org, 2921–2927.
- Qu et al. (2022) Sanqing Qu, Guang Chen, Jing Zhang, Zhijun Li, Wei He, and Dacheng Tao. 2022. BMD: A General Class-Balanced Multicentric Dynamic Prototype Strategy for Source-Free Domain Adaptation. In ECCV (34) (Lecture Notes in Computer Science, Vol. 13694). Springer, 165–182.
- Saenko et al. (2010) Kate Saenko, Brian Kulis, et al. 2010. Adapting Visual Category Models to New Domains. In ECCV.
- Tang et al. (2020) Hui Tang, Ke Chen, and Kui Jia. 2020. Unsupervised Domain Adaptation via Structurally Regularized Deep Clustering. In CVPR. Computer Vision Foundation / IEEE, 8722–8732.
- Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
- Venkateswara et al. (2017) Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. 2017. Deep Hashing Network for Unsupervised Domain Adaptation. In CVPR. IEEE Computer Society, 5385–5394.
- Wei et al. (2021) Colin Wei, Kendrick Shen, Yining Chen, and Tengyu Ma. 2021. Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data. In ICLR. OpenReview.net.
- Wu et al. (2020) Yuan Wu, Diana Inkpen, and Ahmed El-Roby. 2020. Dual Mixup Regularized Learning for Adversarial Domain Adaptation. In ECCV (29) (Lecture Notes in Computer Science, Vol. 12374). Springer, 540–555.
- Xia et al. (2021) Haifeng Xia, Handong Zhao, and Zhengming Ding. 2021. Adaptive Adversarial Network for Source-free Domain Adaptation. In ICCV. IEEE, 8990–8999.
- Xu et al. (2019) Ruijia Xu, Guanbin Li, Jihan Yang, and Liang Lin. 2019. Larger Norm More Transferable: An Adaptive Feature Norm Approach for Unsupervised Domain Adaptation. In ICCV. IEEE, 1426–1435.
- Yang et al. (2021a) Shiqi Yang, Yaxing Wang, Joost van de Weijer, Luis Herranz, and Shangling Jui. 2021a. Exploiting the Intrinsic Neighborhood Structure for Source-free Domain Adaptation. In NeurIPS. 29393–29405.
- Yang et al. (2021b) Shiqi Yang, Yaxing Wang, Joost van de Weijer, Luis Herranz, and Shangling Jui. 2021b. Generalized Source-free Domain Adaptation. In ICCV. IEEE, 8958–8967.
- Yang et al. (2022) Shiqi Yang, Yaxing Wang, Kai Wang, Shangling Jui, et al. 2022. Attracting and dispersing: A simple approach for source-free domain adaptation. In Advances in Neural Information Processing Systems.
- Yi et al. (2023) Li Yi, Gezheng Xu, Pengcheng Xu, Jiaqi Li, Ruizhi Pu, Charles Ling, A. Ian McLeod, and Boyu Wang. 2023. When Source-Free Domain Adaptation Meets Learning with Noisy Labels. In ICLR. OpenReview.net.
- Zhang et al. (2021) Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu, Jindong Wang, Manabu Okumura, and Takahiro Shinozaki. 2021. FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling. In NeurIPS. 18408–18419.
- Zhang et al. (2019) Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael I. Jordan. 2019. Bridging Theory and Algorithm for Domain Adaptation. In ICML (Proceedings of Machine Learning Research, Vol. 97). PMLR, 7404–7413.
- Zhang et al. (2022) Ziyi Zhang, Weikai Chen, Hui Cheng, Zhen Li, Siyuan Li, Liang Lin, and Guanbin Li. 2022. Divide and Contrast: Source-free Domain Adaptation via Adaptive Contrastive Learning. In Advances in Neural Information Processing Systems.
- Zhong et al. (2021) Li Zhong, Zhen Fang, Feng Liu, Jie Lu, Bo Yuan, and Guangquan Zhang. 2021. How Does the Combined Risk Affect the Performance of Unsupervised Domain Adaptation Approaches?. In AAAI. AAAI Press, 11079–11087.
Appendix A Proof of Claim 3.1 and Theorem 3.2
A.1. Claim 3.1 and Proof
Claim 3.1. Suppose satisfies a Lipschitz condition; there exists a global threshold and scale of models’ learning status such that the inner set is consistency robust, i.e., . More specifically,
Proof
Let denote the predicted probability of the model on class . Integrating the dynamic thresholds, we say that there exists a constant such that , where . Suppose is defined by model’s learning state, denote
If s.t. . Specifically, define
where . Along with the inequality we know that
By the definition of Lipschitz constant, we have:
(13) | ||||
As a result,
Since by definition of , we have . Combining Eq. 13 and the Lipschitz constant , we know that this forms a contradiction with . Thus, , the model predictions are consistent, i.e., .
A.2. Proof Sketch for Theorem 3.2
Theorem 3.1. Suppose Assumption 3.1 and Claim 3.1 hold and satisfies -constant expansion. Then the expected error of model is bounded,
To prove the Theorem 3.2, we introduces some concepts and notations following (Cai et al., 2021): (i) the robust set of , ; (ii) the minority robust set of on , .
For a given model , define the robust set to be the set for which is robust under input transformations:
Let , where denote the conditional distribution of . Towards define the minority robust set on , we consider the majority class label of :
Thus, we denote
be the minority robust set of . In addition, let
be the minority set of .
By the Lemma A.1 in (Cai et al., 2021), under the -constant expansion, we have
Lemma A.0 (Upper Bound on the inner set ).
Suppose the condition of Claim 3.1 holds, then
Proof
Based on the definition of the minority robust set , we know that . Therefore, we can write:
(14) | ||||
Lemma A.0 (Upper Bound on the outlier set ).
Let , then
Proof
By the definition of the outlier set , we note that . Thus, we obtain
(15) |
Based on the above results, we can now apply Lemma A.1 and Lemma A.2 to bound the target error . Under the conditions of Theorem 3.2, we have:
Layer | ArCl | ArPr | ArRw | ClAr | ClPr | ClRw | PrAr | PrCl | PrRw | RwAr | RwCl | RwPr | Avg. | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Layer 4 (source) | same class | 14.848 | 12.155 | 12.918 | 13.849 | 11.914 | 12.650 | 16.183 | 17.240 | 14.453 | 14.183 | 13.989 | 11.680 | 13.839 |
across classes | 16.674 | 15.556 | 15.548 | 15.449 | 14.317 | 14.597 | 18.152 | 18.730 | 16.924 | 15.978 | 15.410 | 14.548 | 15.990 | |
Layer 4 (target) | same class | 12.800 | 12.961 | 12.275 | 14.183 | 12.961 | 12.275 | 14.183 | 12.800 | 12.275 | 14.183 | 12.800 | 12.961 | 13.055 |
across classes | 14.422 | 16.174 | 14.750 | 16.396 | 16.174 | 14.750 | 16.396 | 14.422 | 14.750 | 16.396 | 14.422 | 16.174 | 15.435 | |
Bottleneck (source) | same class | 19.041 | 15.302 | 16.117 | 18.614 | 16.453 | 17.260 | 18.656 | 20.286 | 16.868 | 16.799 | 17.428 | 14.102 | 17.244 |
across classes | 21.975 | 20.965 | 21.160 | 21.529 | 20.717 | 20.926 | 21.778 | 22.301 | 21.182 | 21.033 | 20.275 | 20.274 | 21.176 | |
Bottleneck (target) | same class | 19.813 | 13.601 | 14.845 | 16.137 | 13.223 | 14.362 | 18.703 | 22.772 | 16.631 | 16.223 | 18.673 | 12.934 | 16.493 |
across classes | 24.252 | 20.307 | 22.235 | 20.172 | 18.489 | 19.838 | 23.969 | 26.605 | 23.761 | 21.410 | 22.489 | 19.170 | 21.891 |
Appendix B Algorithm for CtO
Our method, as described in Algorithm 1, involves dynamic data grouping and adaptive input- and local-consistency regularization. By using an adaptive threshold for the learning state, we are able to dynamically divide the target data into Inner data and outlier data. This allows us to apply different learning strategies to each subset.
Appendix C Analysis of different similarity measures
To check the effect of different metric functions on target feature clustering, we compared the Cosine similarity and the Euclidean distance on Office-Home for feature similarity. Fig. 4 shows the average Euclidean distance for all tasks on Office-Home. A smaller value indicates more similarity. It can be seen that the Euclidean distance also indicates the presence of rich semantic information in the high-dimensional features of the source model backbone network. However, compared to Fig. 2, the differences in similarity based on Euclidean distance are insignificant. As is well known, the Euclidean distance reflects absolute differences in values, while the cosine distance reflects relative differences in direction. Therefore, Cosine similarity maintains ”1 for identical, 0 for orthogonal, -1 for opposite” in high-dimensional space. Euclidean distance, in contrast, is influenced by dimensions, and its numerical space is not unstable. Particularly in the case of distribution shifts, the variance of the sample fluctuations is too large, leading to poor Euclidean distance performance.
We also shows feature similarities among samples within the same class and across classes in each transfer task. As shown in Table 7, the differences between the Euclidean distance within the same class and that across classes are not clear if the distributions are significantly different (e.g., ArCl, PrCl, RwCl tasks). Weak inter-category discrimination exacerbates spurious clustering, which biases the adaptation process.
Through the experiment, we notice two things: 1) the source model contains sufficient inductive biases; and 2) under domain shift conditions, there is a strong correlation between the metric method and target feature clustering. In future work, one possible direction is to study how different metrics affect the performance of target clustering.
