^†^†footnotetext: * Corresponding Authors: Yinghuan Shi () and Lei Qi ().¹¹institutetext: State Key Laboratory for Novel Software Technology, Nanjing University, China ²²institutetext: National Institute of Healthcare Data Science, Nanjing University, China ³³institutetext: School of Computer Science and Engineering, Key Lab of Computer Network and Information Integration (Ministry of Education), Southeast University, China

Learn to Preserve and Diversify: Parameter-Efficient Group with Orthogonal Regularization for Domain Generalization

Jiajun Hu 1State Key Laboratory for Novel Software Technology, Nanjing University, China 12National Institute of Healthcare Data Science, Nanjing University, China 2 Jian Zhang 1State Key Laboratory for Novel Software Technology, Nanjing University, China 12National Institute of Healthcare Data Science, Nanjing University, China 2 Lei Qi 3School of Computer Science and Engineering, Key Lab of Computer Network and Information Integration (Ministry of Education), Southeast University, China3** Yinghuan Shi 1State Key Laboratory for Novel Software Technology, Nanjing University, China 12National Institute of Healthcare Data Science, Nanjing University, China 2** Yang Gao 1State Key Laboratory for Novel Software Technology, Nanjing University, China 12National Institute of Healthcare Data Science, Nanjing University, China 2 [email protected] [email protected] 1State Key Laboratory for Novel Software Technology, Nanjing University, China 12National Institute of Healthcare Data Science, Nanjing University, China 21State Key Laboratory for Novel Software Technology, Nanjing University, China 12National Institute of Healthcare Data Science, Nanjing University, China 23School of Computer Science and Engineering, Key Lab of Computer Network and Information Integration (Ministry of Education), Southeast University, China3**1State Key Laboratory for Novel Software Technology, Nanjing University, China 12National Institute of Healthcare Data Science, Nanjing University, China 2**1State Key Laboratory for Novel Software Technology, Nanjing University, China 12National Institute of Healthcare Data Science, Nanjing University, China 2

Abstract

Domain generalization (DG) aims to avoid the performance degradation of the model when the distribution shift between the limited training data and unseen test data occurs. Recently, foundation models with enormous parameters have been pre-trained with huge datasets, demonstrating strong generalization ability and showing promising direction for solving the DG problem. However, fully Fine-Tuning (FT) the foundation models results in unsatisfactory out-of-distribution accuracy due to the destroyed pre-trained generalized features. Recently, Parameter-Efficient Fine-Tuning (PEFT) alleviates the above problem by fine-tuning a small portion of the model parameters while keeping the rest frozen, which achieves better generalization performance compared to FT. Nevertheless, PEFT still suffers from the issue of overfitting to the training domains. To address the above issue, we propose Parameter-Efficient Group with Orthogonal regularization (PEGO) for vision transformers, which effectively preserves the generalization ability of the pre-trained network and learns more diverse knowledge compared with conventional PEFT. Specifically, we inject a group of trainable Low-Rank Adaptation (LoRA) modules into the pre-trained model and propose an orthogonal regularization loss to enhance the generalization ability of the model. Our framework achieves SOTA performance on five DG benchmarks, while only requiring training a small number of parameters without adding additional testing cost.

1 Introduction

Traditional machine learning algorithms assume that training data and test data come from independent and identical distributions [54]. However, a trained model suffers significant performance degradation when the distribution discrepancy (a.k.a domain gap) between the training and test data is large. To address this issue, Domain Generalization (DG) is proposed, which assumes that the model trained with only source domains can generalize well in the unseen target domains. Previous DG works [2, 1, 26, 24, 47] mostly aim to facilitate model generalizability by either extracting domain-invariant features that are applicable across domains or augmenting training source domains with manually defined transformations, e.g., style transfer [80, 66], Fourier transformation [69, 27], etc.

However, most DG works follow the same training strategy that directly fine-tunes a pre-trained model (e.g., ResNet [28] pre-trained on ImageNet [18]) without considering the influence of the initial parameters on the ultimate generalization performance of finally-trained models. For example, we find that when using a randomly initialized ResNet-50 as the backbone, the average performance of Empirical Risk Minimization (ERM that trains a model by simply aggregating all source domains without any other techniques) on the PACS [44] dataset is only 35.8%. This result is much lower than the performance of ResNet-50 pre-trained on ImageNet, which is 84.2%. Particularly, on the domain sketch in PACS, where the distribution is far from ImageNet, fine-tuning the pre-trained ResNet-50 has a 55.3% performance improvement (24.0% $\rightarrow$ 79.3%) than fine-tuning the randomly initialized ResNet-50. This indicates that the generalizable knowledge from pre-trained models should be fully exploited for better generalization for downstream DG tasks.

Recently, due to the flourishing development of deep learning, both the parameters and training data of deep models have largely increased and everyone can easily access these pre-trained large foundation models [8] based on the vision transformer architecture (ViT) [19]. Previous works [35, 72, 42] have shown that the vision transformer is more robust to unknown distributions than CNN [39], and CLIP [55] model trained on 400M image-text pairs has demonstrated strong zero-shot generalization ability. However, a critical question arises that does directly fine-tuning stronger models lead to better results on DG tasks? The answer is NO. Cha et al. [14] found that on the DomainBed [25] benchmark, the performance of fine-tuning ViT-B [19] pre-trained from CLIP [55] is lower than that of fine-tuning ResNet-50 pre-trained on ImageNet (ViT-B: 61.1% vs. ResNet-50: 64.2%). This is because direct fine-tuning distorts the generalizable features that originally reside in the pre-trained model [38]. Compared to the small models, a large number of parameters of the foundation models cause more severe overfitting issues when training with limited source domain data. Previous DG methods [2, 1, 26, 24, 47] mainly focus on how to extract domain-invariant features from limited source domains or perform data augmentation to generate more training data, ignoring how to preserve and exploit the generalization ability of the pre-trained models itself to improve the out-of-distribution generalization performance. Furthermore, with a huge number of parameters in foundation models, fine-tuning these parameters requires high training overhead in both GPU memory and training time, which significantly increases the difficulty of successfully fine-tuning a foundation model for users with limited resources.

To address the abovementioned two issues, Parameter-Efficient Fine-Tuning (PEFT) [29, 30, 33] has attracted significant interest in various language and visual tasks. Compared to full Fine-Tuning (FT), PEFT methods inject lightweight trainable modules into the pre-trained model and freeze all the parameters of the pre-trained model. This approach reduces training overhead and achieves comparable or better performance than FT on downstream tasks [29]. Low-Rank Adaptation (LoRA) [30] is one of the most commonly employed PEFT implementations, which injects trainable rank decomposition matrices into every layer of the transformer [60]. Moreover, we discover that LoRA demonstrates substantial performance enhancement in addressing out-of-distribution tasks compared to FT and LoRA outperforms some conventional DG algorithms (In Sec. 4.3).

Despite the advantages of little computational overhead and overfitting alleviation, applying LoRA for DG bears two limitations. Firstly, although LoRA only injects little parameters into a foundation model that aims to alleviate the feature distortion problem, it still suffers from this problem since the features learned from the LoRA module may conflict with the feature of the pre-trained model, resulting in knowledge forgetting. Secondly, LoRA only employs a single low-rank module in each layer, which also easily overfits the training domain and cannot handle various unseen domains, further limiting its generalization performance.

To address the above limitations of LoRA, we propose a novel Parameter Efficient Group with Orthogonal regularization (PEGO) framework to fully exploit the potential of the pre-trained foundation models to solve the DG problem. First, we Preserve the generalization ability of the pre-trained model learned from the large-scale pre-training by imposing an orthogonal regularization loss between the pre-trained weights and the weights of LoRA layers. In this way, we can effectively minimize the distortion of the pre-trained generalized features. Second, we employ a group of LoRA modules for each layer to Diversify feature representations during training. With the learned abundant features, the model can better handle various unseen domains during the test. To further encourage diversity, the orthogonal constraints are also added between the weights of these LoRA modules. We summarize the contributions of this work as follows:

1.

We propose a novel PEFT framework named PEGO that can effectively alleviate the overfitting issue with little computational overhead.
2.

We design an orthogonal regularization loss to facilitate knowledge preservation of the pre-trained model.
3.

We design the LoRA group with diversity constraints to learn diverse features to handle various unseen domains.
4.

On five DomianBed benchmarks, PEGO achieves state-of-the-art performance compared to previous DG algorithms. Moreover, our method outperforms other PFET methods and the methods exploiting pre-trained models.

2 Related Work

2.1 Domain Generalization

Domain generalization (DG) aims to ensure that a model trained on source domains can generalize well to any unseen target domains. There are various methods for DG problem, mainly including data augmentation [78, 77, 69, 53, 80, 65], meta-learning [43, 20, 5, 74, 64], ensemble learning [79, 13, 3, 56], self-supervised learning [12, 62], adversarial learning [24, 45, 47], causal learning [2, 1, 49], test time adaptation [32, 15, 73, 75], etc.

Most of the previous DG works choose small models (e.g., AlexNet [37] and ResNet [28]) as the pre-trained backbone. Different from them, several methods exploit large pre-trained models for out-of-distribution generalization. MIRO [14] proposes mutual information regularization by assuming the pre-trained model as the oracle model. GESTUR [41] designs a gradient estimation to reduce potential risks in unseen domains utilizing a large pre-trained model. These two methods require significant training costs due to optimizing all the parameters of the pre-trained model. Moreover, there are some recent works [17, 50, 58, 31] that utilize the text information from vision-language models to enhance the generalization ability of the fine-tuned model, but these methods rely on the jointly trained text encoder and visual encoder.

2.2 Parameter-Efficient Fine-Tuning

Parameter-Efficient Fine-Tuning (PEFT) [29] is first proposed to fine-tune large pre-trained transformers in natural language processing tasks, effectively reducing the computational and storage cost. PEFT methods inject some lightweight modules into the foundation model and only optimize a small portion of the model parameters to achieve a similar or higher performance compared with FT on downstream tasks. PEFT methods are also applied to deal with visual tasks. For example, VPT [33] introduces trainable prompt tokens in the input space for ViT [19]. One of the most influential PEFT methods is LoRA [30], which inserts trainable low-rank decomposition matrices into the transformer block while freezing the pre-trained model weights. It does not introduce additional inference latency and works with any neural network with dense layers.

2.3 Orthogonal Regularization

Orthogonality in neural networks [68, 11, 10, 6] has been widely studied to improve training stability, training speed, and the performance of the model. For example, Xie et al. [68] utilize the orthonormality among different filter banks to mitigate the gradient vanishing on training deep convolutional neural networks (CNN). Bansal et al. [6] develop novel orthogonality regularizations, which achieve better accuracy, and faster and more stable convergences on several CNNs. In the DG literature, there are some works [9, 4, 16, 34] applying orthogonal loss to disentangle the domain-invariant representations and domain-specific representations, but these methods usually require designing the additional network to generate two decoupled representations. Orthogonality is also extensively used in continual learning to prevent catastrophic forgetting of past tasks [71, 23, 57, 63]. The most relevant to our work is O-LoRA [63], which utilizes orthogonal subspace learning for continual learning in language models. Different from their task, this paper aims to solve the DG problem when fine-tuning the foundation model and we additionally consider the relationship between the pre-trained weights and the injected LoRA layers.

3 Methods

3.1 Problem Formulation

For the classification task in the DG setting, the training data $D_{tr}$ usually contains $n_{d}$ domains $\{D_{i}\}_{i=1}^{n_{d}}$ ( $n_{d}>1$ ). The test data $D_{te}$ is not accessible during training but shares the same category space as $D_{tr}$ . We define that $x_{i}$ is the $i$ -th sample in $D_{tr}$ with the category label $y_{i}$ . The distribution of training data $P_{tr}$ is the joint distribution over image space and label space in $D_{tr}$ , and there is a distribution gap between $P_{tr}$ and the distribution of test data $P_{te}$ .

The entire model includes a feature extractor $F(\cdot;\theta)$ parameterized by $\theta$ and a classifier $H(\cdot;\psi)$ parameterized by $\psi$ . Previous DG works usually choose a pre-trained model and fine-tune it on the distribution of training data $P_{tr}$ . The goal of DG is that $F$ and $H$ trained on $D_{tr}$ can generalize well on $D_{te}$ .

3.2 Revisit Previous Methods

3.2.1 Fine-Tuning.

Fine-Tuning (FT, updating all the parameters of the model) is the most common training manner in previous DG works. The optimization objective of FT can be defined as the following formula:

\min_{\theta,\psi}\quad\mathcal{L}_{cls}=\mathcal{L}_{CE}(H(F(x;\theta);\psi),y),

(1)

where $(x,y)\sim P_{tr}$ and $\mathcal{L}_{CE}$ is the cross-entropy loss. Gulrajani et al. [25] claim that ERM based on FT has competitive performance compared to most DG algorithms. However, Kumar et al. [38] argue that FT distorts pre-trained features, which leads to poor out-of-distribution performance. In Sec. 4.3, we also find that the performances of ERM are much lower than those of other methods when choosing a foundation model as the backbone.

3.2.2 Parameter-Efficient Fine-Tuning.

Compared with FT, Parameter-Efficient Fine-Tuning (PEFT) only updates a small number of parameters. For any PEFT method, the network with PEFT modules can be defined as $G(\cdot;\Phi)$ parameterized by $\Phi$ , where the trainable parameters are $\phi\subseteq\Phi$ (the parameter number of $\phi$ is much smaller than that of $\Phi$ ). The general optimization objective of any PEFT method can be formulated as follows:

\min_{\phi,\psi}\quad\mathcal{L}_{cls}=\mathcal{L}_{CE}(H(G(x;\Phi);\psi),y).

(2)

Due to not updating any parameters of the foundation model, PEFT effectively inherits the generalization ability of the foundation model. However, in the DG setting, the trainable parameters in PEFT still suffer from overfitting to the source domains, resulting in a performance decrease on the unseen test domain. At the same time, the injected PEFT modules also partially distort the generalized features produced by the foundation model during training.

Refer to caption — Figure 1: Illustration of our method: Parameter Efficient Group with Orthogonal regularization (PEGO). Different from previous DG work updating all the parameters of the pre-trained model, we freeze the parameters of the model and inject a group of trainable parameter-efficient modules into it. Moreover, we apply an orthogonal regularization loss between the pre-trained weights and the LoRA modules to preserve the generalization ability of the pre-trained model (Learn to Preserve) and employ another orthogonal regularization loss on different LoRA modules within the group to encourage them to learn diverse knowledge during training (Learn to Diversify).

3.3 Parameter-Efficient Group with Orthogonal Regularization

To address the issue of FT and PEFT in DG, we propose Parameter Efficient Group with Orthogonal regularization (PEGO) based on LoRA [30], which is a classic PEFT method for pre-trained transformer. In this section, we first review the key technologies of LoRA and then introduce the details of our method.

LoRA. Hu et al. [30] propose Low-Rank Adaptation (LoRA), utilizing the hypothesis that the pre-trained weights have a low intrinsic rank. Specifically, for a linear layer in the pre-trained model, which is parameterized by $W\in\mathbb{R}^{d\times k}$ , there is a low-rank decomposition $W+\Delta W=W+BA$ , where $B\in\mathbb{R}^{d\times r}$ , $A\in\mathbb{R}^{r\times k}$ and $r\ll\min{(d,k)}$ . During training, LoRA only updates injected low-rank weights $A$ and $B$ while keeping $W$ frozen. We define the input feature of the linear layer as $z_{in}$ , and the output feature of the linear layer can be calculated through the following forward process:

z_{out}=(W+\Delta W)z_{in}=(W+BA)z_{in}=Wz_{in}+BAz_{in}.

(3)

In the original paper, the LoRA layers are only injected into the $W^{q}$ and $W^{v}$ , which are the query and value projection matrices of the self-attention modules. After training, we can directly compute $W_{final}=W+BA$ as the final weight. Therefore, there is no additional inference latency in LoRA.

PEGO. To further improve the out-of-distribution performance of LoRA, we propose the idea of learning to Preserve and Diversify. The former aims to preserve the generalization performance of the pre-trained model during fine-tuning, while the latter aims to learn more diverse knowledge from source domains. Specifically, we propose to inject a group of LoRA modules into the pre-trained network and apply an orthogonal regularization loss to achieve the above two goals simultaneously. Fig. 1 provides an illustration of our method.
• Learn to Preserve

Inspired by the orthogonal gradient updating [71, 23, 57] used to prevent catastrophic forgetting, we strive to constrain the gradient subspace of fine-tuning to be orthogonal to the gradient subspace of large-scale pre-training task. This enables the model to learn useful information in the source domains while preserving the generalization ability of the pre-trained model. However, we may not be able to access the pre-training dataset and it is not practical to calculate the full gradient of the large-scale pre-training task. Following Wang et al. [63] who regard the weights of the LoRA layer as the gradient subspace of a certain task, we propose to consider the original pre-trained weights as the gradient subspace of the pre-training task. Similarly, the fine-tuned LoRA weights can be considered as the gradient subspace updated on the source domains. With these two gradient subspaces, we propose an orthogonal regularization loss to constrain the pre-trained weights $W$ orthogonal to the weights of injected module $BA$ . Specifically, the loss can be formulated as follows:

\mathcal{L}_{preserve}=\left\|W^{T}(BA)\right\|_{1},

(4)

where $\|\cdot\|_{1}$ is $L1$ norm. As the same as LoRA, we only fine-tune the low-rank weight matrices $A$ and $B$ while keeping the rest parameters frozen during training.

Furthermore, we analyze the above loss from the perspective of the feature-level. According to the forward process in Eq. 3, we define the output feature of the pre-trained weight $z_{init}$ = $Wz_{in}$ and the output feature of the LoRA layer $z_{new}$ = $BAz_{in}$ . While the loss restricts $BA$ orthogonal to $W$ , it indirectly constrains $z_{init}$ orthogonal to $z_{new}$ . This can be demonstrated by the following transpose transformation:

z_{init}^{T}z_{new}=(Wz_{in})^{T}BAz_{in}=z_{in}^{T}(W^{T}BA)z_{in}.

(5)

Since the features generated by fine-tuning are encouraged to be orthogonal to pre-trained features, the generalization ability of the pre-trained model is preserved well. At the same time, during the implementation, we discovered that optimizing weights to be orthogonal requires fewer computational resources and results in better performance compared to optimizing features to be orthogonal.
• Learn to Diversify

In the DG literature, increasing the diversity of the training trajectories is used to improve the generalization performance of the model. For example, Zhang et al. [74] propose a multi-view algorithm for employing multiple optimization trajectories; Arpit et al. [3] perform the ensemble of multiple independently trained models. However, these methods require additional multi-step gradient updates or training multiple models. Benefiting from the lightweight and easily scalable characteristics of LoRA, we propose to introduce multiple LoRA modules and apply orthogonal regularization to facilitate the model learning diverse knowledge.

Different from the original LoRA where the pre-trained weight $W$ comes with only one trainable low-rank matrices $BA$ , we employ a parameter-efficient group of LoRA layers $g=\{A_{i},B_{i}\}_{i=1}^{N}$ in our framework, where $N$ is the number of LoRA layers. Moreover, we adopt a pairwise orthogonal regularization loss to enhance the diversity of knowledge learned by each LoRA layer. Specifically, for a certain LoRA layer $g_{i}=\{A_{i},B_{i}\}$ , its weight matrix $B_{i}A_{i}$ is encouraged to be orthogonal to the weight matrix of other LoRA layers $\{B_{j}A_{j}\}_{j\neq i}^{N}$ . Formally, the loss that aims to learn to diversify is defined as follows:

\mathcal{L}_{diversify}=\sum_{i=1}^{N}\sum_{j=i+1}^{N}\left\|(B_{i}A_{i})^{T}(B_{j}A_{j})\right\|_{1}.

(6)

The above loss promotes the orthogonality between the weights of different LoRA modules, resulting in the output features of different LoRA modules are also encouraged to orthogonal. Our method learns more diverse optimization trajectories compared to the original LoRA and only increases little training cost.
• Final Object

We combine $\mathcal{L}_{preserve}$ and $\mathcal{L}_{diversify}$ as the optimization objective of orthogonal regularization, which takes the following form:

\mathcal{L}_{O}(W)=\sum_{i=1}^{N}\left\|W^{T}(B_{i}A_{i})\right\|_{1}+\sum_{i=1}^{N}\sum_{j=i+1}^{N}\left\|(B_{i}A_{i})^{T}(B_{j}A_{j})\right\|_{1}.

(7)

In line with LoRA, we only apply the above loss to $W^{q}$ and $W^{v}$ in each block of the pre-trained transformer. During training, we only fine-tune the LoRA group and the classification head while keeping the rest parameters frozen. The final orthogonal regularization loss can be given by:

\mathcal{L}_{OR}=\sum_{b=1}^{B}(\mathcal{L}_{O}(W^{q}_{b})+\mathcal{L}_{O}(W^{v}_{b})),

(8)

where $B$ is the number of blocks in ViT, $W^{q}_{b}$ and $W^{v}_{b}$ indicates the query and value projection matrices in the $b$ -th block. Finally, we combine $\mathcal{L}_{cls}$ with $\mathcal{L}_{OR}$ as the final optimization object of the model:

\mathcal{L}_{final}=\mathcal{L}_{cls}+\alpha\mathcal{L}_{OR},

(9)

where $\alpha$ is the balancing coefficient of two losses. When the model is deployed to the test environment after training, we merge the group of LoRA layers to the pre-trained weight:

W_{final}=W+\sum_{i=1}^{N}B_{i}A_{i}.

(10)

As the same as LoRA, there is no additional testing latency in our method.

4 Experiments

4.1 DataSets

We use five common datasets in DomainBed [25] evaluation benchmarks to verify the effectiveness of our method: 1. PACS [44] includes 9,991 images, 7 categories, and 4 domains. The domain shift between each domain is large (e.g., photo and sketch). 2. VLCS [22] includes 10,729 images, 5 categories, and 4 domains. The domain shift between each domain is mainly from different viewpoints. 3. OfficeHome [61] includes 15,500 images, 65 categories, and 4 domains. It contains more categories and smaller domain shift than PACS. 4. TerraIncognita [7] includes 24,788 images, 10 categories, and 4 domains. The images are taken in four different wild locations and it’s a challenging dataset. 5. DomainNet [52] includes 586,575 images, 345 categories, and 6 domains. Its number of images and categories far exceeds the above datasets.

4.2 Implementation Details

Different from DomainBed [25] using ResNet-50 [28] as the backbone, we utilize ViT-B/16 [19] pre-trained by CLIP [55] as the default foundation model. During training, we employ the Adam [36] optimizer with a learning rate of 5e-4 for 5000 iterations, except for the DomainNet, which requires 15000 iterations for convergence. The 32 images in each source domain construct a whole batch. Our data augmentation includes random horizontal flip, color jittering, and random graying in DomainBed. We set the rank of LoRA $r$ to 4 and the balancing coefficient $\alpha$ to 1e-3. For the numbers of LoRA layers $N$ in each group, we define the hyperparameter search space as $N\in\{2,4,6\}$ . More details about the evaluation protocol and hyperparameters search can be found in Supplementary.

4.3 Main Results

Table 1: Performance comparison with DG methods. Leave-one-domain-out accuracy (%) on five DomainBed benchmarks. In addition to the results of our method, other results come from Lew et al. [41]. OH, TI and DN indicate OfficeHome, TerraIncognita, and DomainNet, respectively (similarly hereinafter).

Algorithm	PACS	VLCS	OH	TI	DN	Avg
ERM (FT)	83.4 $\pm$ 0.5	75.9 $\pm$ 1.3	66.4 $\pm$ 0.5	35.3 $\pm$ 0.8	44.4 $\pm$ 0.6	61.1
SWAD [13]	91.3 $\pm$ 0.1	79.4 $\pm$ 0.4	76.9 $\pm$ 0.1	45.4 $\pm$ 0.5	51.7 $\pm$ 0.8	68.9
SMA [3]	92.1 $\pm$ 0.2	79.7 $\pm$ 0.2	78.1 $\pm$ 0.1	48.3 $\pm$ 0.7	55.9 $\pm$ 0.2	70.8
MIRO [14]	95.6 $\pm$ 0.8	82.2 $\pm$ 0.3	82.5 $\pm$ 0.1	54.3 $\pm$ 0.4	54.0 $\pm$ 0.3	73.7
GESTUR [41]	96.0 $\pm$ 0.0	82.8 $\pm$ 0.1	84.2 $\pm$ 0.1	55.7 $\pm$ 0.2	58.9 $\pm$ 0.1	75.5
Ours	96.5 $\pm$ 0.1	83.2 $\pm$ 0.3	84.2 $\pm$ 0.1	57.3 $\pm$ 0.3	59.3 $\pm$ 0.1	76.1

Comparison with DG Methods. We first compare PEGO with the baseline ERM (i.e., FT) and state-of-the-art (SOTA) DG methods, including: SWAD [13], SMA [3], MIRO [14] and GESTUR [41]. SWAD and SMA are both ensemble methods that show significant improvement compared to ERM on DomainBed using ResNet-50 as the backbone; MIRO and GESTUR both aim to preserve and exploit the generalization ability of the pre-trained network.

The results of the performance comparison are shown in Tab. 1. Compared to ERM, all DG approaches have significantly improved, indicating much potential for further enhancement when using foundation models for the DG problem. Furthermore, PEGO outperforms all previous methods on all five datasets, demonstrating the superiority of our method. Especially on the challenging TerraIncognita dataset, PEGO achieves a remarkable improvement which is more than 1.6% compared with other methods (55.7% $\rightarrow$ 57.3%).

Table 2: Performance comparison with PEFT methods. Leave-one-domain-out accuracy (%) on five DomainBed benchmarks.

Algorithm	PACS	VLCS	OH	TI	DN	Avg
ERM (FT)	83.4 $\pm$ 0.5	75.9 $\pm$ 1.3	66.4 $\pm$ 0.5	35.3 $\pm$ 0.8	44.4 $\pm$ 0.6	61.1
Adapter [29]	92.0 $\pm$ 0.5	79.8 $\pm$ 0.4	72.9 $\pm$ 0.4	44.4 $\pm$ 0.8	56.2 $\pm$ 0.1	69.1
LoRA [30]	96.0 $\pm$ 0.1	82.7 $\pm$ 0.0	83.4 $\pm$ 0.1	54.8 $\pm$ 0.6	58.1 $\pm$ 0.1	75.0
VPT [33]	96.2 $\pm$ 0.3	82.9 $\pm$ 0.3	83.4 $\pm$ 0.3	54.2 $\pm$ 0.7	58.9 $\pm$ 0.1	75.1
Ours	96.5 $\pm$ 0.1	83.2 $\pm$ 0.3	84.2 $\pm$ 0.1	57.3 $\pm$ 0.3	59.3 $\pm$ 0.1	76.1

Comparison with PEFT Methods. In this subsection, we make a performance comparison between PEGO and several PEFT methods, containing Adapter [29] and LoRA [30] which are widely utilized in language tasks, VPT [33] which is designed specifically for vision transformer.

As shown in Tab. 2, the performances of all PEFT methods on five DomainBed benchmarks are significantly higher than that of FT. This indicates that full fine-tuning considerably harms the generalization performance of the pre-trained model, while PEFT can effectively address this issue. Moreover, our method achieves state-of-the-art performance on five datasets and 1.1% improvement of average performances compared to LoRA which is the basis of our method (75.0% $\rightarrow$ 76.1%).

Table 3: Performance comparison with methods exploiting pre-trained models. Leave-one-domain-out accuracy (%) on four DomainBed benchmarks. The results of WiSE-FT come from Lew et al. [41] and we report the rest results. The best and second-best accuracy are bolded and underlined, respectively.

Algorithm	PACS	VLCS	OH	TI	Avg
ERM (FT)	83.4 $\pm$ 0.5	75.9 $\pm$ 1.3	66.4 $\pm$ 0.5	35.3 $\pm$ 0.8	65.3
L²-SP [46]	92.2 $\pm$ 0.7	81.0 $\pm$ 0.2	68.2 $\pm$ 0.5	39.4 $\pm$ 1.6	70.2
LP-FT [38]	94.2 $\pm$ 0.7	77.5 $\pm$ 0.4	72.0 $\pm$ 0.4	39.0 $\pm$ 1.5	70.7
LwF [48]	93.6 $\pm$ 0.6	81.9 $\pm$ 0.4	80.7 $\pm$ 0.4	39.4 $\pm$ 0.6	73.9
WiSE-FT [67]	94.5 $\pm$ 0.0	83.9 $\pm$ 0.3	83.9 $\pm$ 0.2	47.5 $\pm$ 1.2	77.5
Ours	96.5 $\pm$ 0.1	83.2 $\pm$ 0.3	84.2 $\pm$ 0.1	57.3 $\pm$ 0.3	80.3

Comparison with Methods Exploiting Pre-trained Models. There are some works in other fields utilizing pre-trained models to improve the generalization ability of the fine-tuned model. Following Cha et al. [14], we select several classic methods to make a performance comparison with PEGO. Specifically, L²-SP [46] employs $L^{2}$ penalty between the pre-trained model and fine-tuned model during training; LP-FT [38] proposes a strategy of first linear probing and then full fine-tuning; LwF [48] constrains the outputs of fine-tuned model for old tasks to be similar to that of pre-trained network; WiSE-FT [67] ensembles the weights of the zero-shot and fine-tuned networks to realize robust fine-tuning.

We report the results of all methods on four DomainBed benchmarks in Tab. 3. Our method achieves the best performance on PACS, OfficeHome, and TerraIncognita, except for VLCS, where it is only 0.7% lower than WiSE-FT. Furthermore, PEGO outperforms the previous best method by 2.8% on the average accuracy of four benchmarks (77.5% $\rightarrow$ 80.3%). This is mainly because PEGO has over 9.8% improvement on the TerraIncognita dataset (47.5% $\rightarrow$ 57.3%), and its standard error is significantly smaller. The above results indicate that although previous methods achieve some improvement compared with FT and aim to preserve the generalization ability of pre-trained models, there is still a significant overfitting phenomenon when using these methods. Our method effectively alleviates the above issue.

5 Further Analysis

Table 4: Ablation study on two orthogonal regularization losses. Leave-one-domain-out accuracy (%) on PACS and OfficeHome.

	$\mathcal{L}_{preserve}$	$\mathcal{L}_{diversify}$	PACS	OH
PEGO	✓	✓	96.55 $\pm$ 0.11	84.21 $\pm$ 0.10
	✓		96.14 $\pm$ 0.25	82.95 $\pm$ 0.02
		✓	96.37 $\pm$ 0.20	83.76 $\pm$ 0.08
			95.34 $\pm$ 0.30	82.85 $\pm$ 0.07
LoRA [30]			95.96 $\pm$ 0.12	83.41 $\pm$ 0.11

5.1 Ablation Study

Effectiveness of Two Orthogonal Losses. To verify the improvement of the model’s generalization performance by employing our proposed losses, we conduct ablation experiments about $\mathcal{L}_{preserve}$ and $\mathcal{L}_{diversify}$ on the PACS and OfficeHome benchmarks. As shown in Tab. 4, when $\mathcal{L}_{preserve}$ and $\mathcal{L}_{diversify}$ are applied simultaneously, the model achieves the best performance (blue row). We notice that the performance of injecting a group of LoRA layers into the pre-trained model without applying regularization loss ( $4$ th row) is worse than the performance of the original LoRA. This indicates that increasing the number of training parameters without regularization cannot improve the model’s generalization performance, while our losses can effectively utilize more parameters.
Effects of Numbers of LoRA Modules. Intuitively, the more LoRA modules in our method, the higher the probability of learning diverse knowledge. However, excessive modules lead to complicating loss optimization and increasing training overhead. The first column of Fig. 2 shows the performances of PEGO and Baseline when choosing different numbers of modules. PEGO achieves higher accuracy and is more stable than the Baseline.
Effects of Balancing Coefficient. The second column of Fig. 2 shows the performances of Baseline and PEGO with different balancing coefficients. In a wide range of coefficients from 1e-4 to 1e-1, PEGO outperforms Baseline (balancing coefficient is zero), demonstrating the effectiveness of our proposed loss.
Effects of Rank of LoRA. As shown in the third column of Fig. 2, when the rank of the LoRA module is too high (greater than 8), the accuracy of Baseline significantly decreases, while the accuracy of our method remains stable.

Table 5: Performance comparison with Zero-shot CLIP. Leave-one-domain-out accuracy (%) on four DomainBed benchmarks. The results of Zero-shot come from Lew et al. [41]. The best and second-best accuracy are bolded and underlined, respectively.

Algorithm	PACS	VLCS	OH	TI	Avg
ERM (FT)	83.4 $\pm$ 0.5	75.9 $\pm$ 1.3	66.4 $\pm$ 0.5	35.3 $\pm$ 0.8	65.3
Zero-shot [55]	96.8 $\pm$ 0.0	81.7 $\pm$ 0.3	83.0 $\pm$ 0.3	31.3 $\pm$ 0.2	73.2
Ours	96.5 $\pm$ 0.1	83.2 $\pm$ 0.3	84.2 $\pm$ 0.1	57.3 $\pm$ 0.3	80.3

5.2 Comparison with the Zero-shot Baseline

In Sec. 4.3, we choose the CLIP [55] pre-trained model as the default backbone. CLIP learns representations on 400 million image-text pairs and has demonstrated strong zero-shot ability on plenty of visual datasets.

Tab. 5 shows the performance comparison between our method and Zero-shot CLIP on four DomainBed benchmarks. In addition to being slightly lower than Zero-shot on the simple benchmark PACS by 0.3%, PEGO achieves the best performances on the other three benchmarks. It outperforms zero-shot by 7.1% in average performance (73.2% $\rightarrow$ 80.3%). Besides, we notice that the accuracy of Zero-shot on TerraIncognita is worse than ERM and our method surpasses Zero-shot by 26% (31.3% $\rightarrow$ 57.3%). This result is consistent with the finding of Cho et al. [17]. Although CLIP can leverage text information to achieve zero-shot without training data, we argue that source domain data is crucial for enhancing the generalization ability of the model and the key factor is whether robust fine-tuning can be accomplished.

Table 6: Leave-one-domain-out accuracy (%) on four DomainBed benchmarks when using ViT-L/14 pre-trained by CLIP as the backbone.

Algorithm	PACS	VLCS	OH	TI	Avg
ERM	88.0 $\pm$ 4.1	77.5 $\pm$ 0.6	53.0 $\pm$ 3.2	43.3 $\pm$ 0.7	65.5
LoRA [30]	98.1 $\pm$ 0.0	83.7 $\pm$ 0.3	87.9 $\pm$ 0.1	52.7 $\pm$ 0.8	80.6
PEGO	98.0 $\pm$ 0.1	83.7 $\pm$ 0.2	88.6 $\pm$ 0.1	57.2 $\pm$ 0.5	81.9

5.3 Experiment Using ViT-L as the Backbone

In Sec. 4.3, we choose ViT-B/16 [19] pre-trained by CLIP [55] as the backbone for all the experiments. To verify the effectiveness of our method on larger models, we conduct the experiment using ViT-L/14 pre-trained by CLIP as the backbone. Tab. 6 provides the performances of ERM, LoRA [30] and PEGO on four DomainBed benchmarks. Both LoRA and PEGO outperform ERM significantly on all the benchmarks and achieve similar accuracy on PACS and VLCS. However, on the other two benchmarks, PEGO has a significant improvement compared to LoRA, especially on TerraIncognita (52.7% $\rightarrow$ 57.2%).

5.4 Visualization of Feature Space

To understand whether our method can “Learn to Preserve”, we visualize the difference in feature space between our method and the pre-trained model with PCA [51] and compare with full fine-tuning (FT) and LoRA. As shown in Fig. 3, fine-tuning all layers largely distorts the feature distribution and LoRA can partially alleviate it. Instead, our method successfully preserves pre-trained features by further using our orthogonal loss $\mathcal{L}_{preserve}$ .

5.5 Principal Component Analysis on Weights

To confirm that our method achieves orthogonal regularization, we experiment by decomposing the model weights to get their principal components (PCs). Specifically, to verify the effect of “Learn to Preserve”, we expect the learned PCs to be orthogonal to the PCs of pre-trained weights, and to verify “Learn to Diversify”, we expect the learned PCs to be more than the original LoRA. As shown in Fig. 4, the weight of our model has more PCs (8 vs. 4) than LoRA and also exhibits orthogonal (zero cosine similarity) to the PCs of the pre-trained weights, validating the effectiveness of our proposed two losses.

6 Conclusion

In this paper, we address the problem of using foundation models in DG from a novel perspective of Learning to Preserve and Diversify. Specifically, we propose Parameter-Efficient Group with Orthogonal regularization (PEGO), which effectively preserves the generalization ability of pre-trained models and learns diverse knowledge. We conduct comparative experiments and ablation experiments to demonstrate the effectiveness and stability of PEGO. Our simple method can be applied to any neural network architecture with linear layers and is training-friendly without additional testing costs.

Acknowledgements

The work is supported by the NSFC Project (62222604, 62206052, 62192783), Jiangsu Natural Science Foundation Project (BK20210224), China Postdoctoral Science Foundation (2024M750424), the Fundamental Research Funds for the Central Universities (020214380120), and the State Key Laboratory Funds for Key Project (ZZKT2024A14).

References

[1] Ahuja, K., Caballero, E., Zhang, D., Gagnon-Audet, J.C., Bengio, Y., Mitliagkas, I., Rish, I.: Invariance principle meets information bottleneck for out-of-distribution generalization. In: Conference on Neural Information Processing Systems (NeurIPS). pp. 3438–3450 (2021)
[2] Arjovsky, M., Bottou, L., Gulrajani, I., Lopez-Paz, D.: Invariant risk minimization. arXiv:1907.02893 (2019)
[3] Arpit, D., Wang, H., Zhou, Y., Xiong, C.: Ensemble of averages: Improving model selection and boosting performance in domain generalization. In: Conference on Neural Information Processing Systems (NeurIPS). pp. 8265–8277 (2022)
[4] Bai, H., Sun, R., Hong, L., Zhou, F., Ye, N., Ye, H.J., Chan, S.H.G., Li, Z.: DecAug: Out-of-distribution generalization via decomposed feature representation and semantic augmentation. In: AAAI Conference on Artificial Intelligence (AAAI). pp. 6705–6713 (2021)
[5] Balaji, Y., Sankaranarayanan, S., Chellappa, R.: Metareg: Towards domain generalization using meta-regularization. In: Conference on Neural Information Processing Systems (NeurIPS) (2018)
[6] Bansal, N., Chen, X., Wang, Z.: Can we gain more from orthogonality regularizations in training deep networks? In: Conference on Neural Information Processing Systems (NeurIPS) (2018)
[7] Beery, S., Van Horn, G., Perona, P.: Recognition in terra incognita. In: European Conference on Computer Vision (ECCV). pp. 456–473 (2018)
[8] Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv:2108.07258 (2021)
[9] Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., Erhan, D.: Domain separation networks. In: Conference on Neural Information Processing Systems (NeurIPS) (2016)
[10] Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high fidelity natural image synthesis. In: International Conference on Learning Representations (ICLR) (2019)
[11] Brock, A., Lim, T., Ritchie, J., Weston, N.: Neural photo editing with introspective adversarial networks. In: International Conference on Learning Representations (ICLR) (2017)
[12] Carlucci, F.M., D’Innocente, A., Bucci, S., Caputo, B., Tommasi, T.: Domain generalization by solving jigsaw puzzles. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2229–2238 (2019)
[13] Cha, J., Chun, S., Lee, K., Cho, H.C., Park, S., Lee, Y., Park, S.: SWAD: Domain generalization by seeking flat minima. In: Conference on Neural Information Processing Systems (NeurIPS). pp. 22405–22418 (2021)
[14] Cha, J., Lee, K., Park, S., Chun, S.: Domain generalization by mutual-information regularization with pre-trained models. In: European Conference on Computer Vision (ECCV). pp. 440–457 (2022)
[15] Chen, L., Zhang, Y., Song, Y., Shan, Y., Liu, L.: Improved test-time adaptation for domain generalization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 24172–24182 (2023)
[16] Chen, Y., Wang, Y., Pan, Y., Yao, T., Tian, X., Mei, T.: A style and semantic memory mechanism for domain generalization. In: International Conference on Computer Vision (ICCV). pp. 9164–9173 (2021)
[17] Cho, J., Nam, G., Kim, S., Yang, H., Kwak, S.: Promptstyler: Prompt-driven style generation for source-free domain generalization. In: International Conference on Computer Vision (ICCV). pp. 15702–15712 (2023)
[18] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 248–255 (2009)
[19] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021)
[20] Dou, Q., Coelho de Castro, D., Kamnitsas, K., Glocker, B.: Domain generalization via model-agnostic learning of semantic features. In: Conference on Neural Information Processing Systems (NeurIPS) (2019)
[21] d’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L.: ConVit: Improving vision transformers with soft convolutional inductive biases. In: International Conference on Machine Learning (ICML). pp. 2286–2296 (2021)
[22] Fang, C., Xu, Y., Rockmore, D.N.: Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In: International Conference on Computer Vision (ICCV). pp. 1657–1664 (2013)
[23] Farajtabar, M., Azizan, N., Mott, A., Li, A.: Orthogonal gradient descent for continual learning. In: International Conference on Artificial Intelligence and Statistics (AISTATS). pp. 3762–3773 (2020)
[24] Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., Lempitsky, V.: Domain-adversarial training of neural networks. Journal of Machine Learning Research (JMLR) pp. 2096–2030 (2016)
[25] Gulrajani, I., Lopez-Paz, D.: In search of lost domain generalization. In: International Conference on Learning Representations (ICLR) (2021)
[26] Guo, J., Qi, L., Shi, Y.: Domaindrop: Suppressing domain-sensitive channels for domain generalization. In: International Conference on Computer Vision (ICCV). pp. 19114–19124 (2023)
[27] Guo, J., Wang, N., Qi, L., Shi, Y.: ALOFT: A lightweight mlp-like architecture with dynamic low-frequency transform for domain generalization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 24132–24141 (2023)
[28] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (2016)
[29] Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp. In: International Conference on Machine Learning (ICML). pp. 2790–2799 (2019)
[30] Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (ICLR) (2022)
[31] Huang, Z., Zhou, A., Ling, Z., Cai, M., Wang, H., Lee, Y.J.: A sentence speaks a thousand images: Domain generalization through distilling clip with language guidance. In: International Conference on Computer Vision (ICCV). pp. 11685–11695 (2023)
[32] Iwasawa, Y., Matsuo, Y.: Test-time classifier adjustment module for model-agnostic domain generalization. In: Conference on Neural Information Processing Systems (NeurIPS). pp. 2427–2440 (2021)
[33] Jia, M., Tang, L., Chen, B.C., Cardie, C., Belongie, S., Hariharan, B., Lim, S.N.: Visual prompt tuning. In: European Conference on Computer Vision (ECCV). pp. 709–727 (2022)
[34] Jo, S.Y., Yoon, S.W.: Poem: Polarization of embeddings for domain-invariant representations. AAAI Conference on Artificial Intelligence (AAAI) pp. 8150–8158 (2023)
[35] Kim, D., Wang, K., Sclaroff, S., Saenko, K.: A broad study of pre-training for domain generalization and adaptation. In: European Conference on Computer Vision (ECCV). pp. 621–638 (2022)
[36] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2015)
[37] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Conference on Neural Information Processing Systems (NeurIPS) (2012)
[38] Kumar, A., Raghunathan, A., Jones, R.M., Ma, T., Liang, P.: Fine-tuning can distort pretrained features and underperform out-of-distribution. In: International Conference on Learning Representations (ICLR) (2022)
[39] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE pp. 2278–2324 (1998)
[40] Lee, Y., Chen, A.S., Tajwar, F., Kumar, A., Yao, H., Liang, P., Finn, C.: Surgical fine-tuning improves adaptation to distribution shifts. In: International Conference on Learning Representations (ICLR) (2023)
[41] Lew, B., Son, D., Chang, B.: Gradient estimation for unseen domain risk minimization with pre-trained models. In: International Conference on Computer Vision Workshops (ICCVW). pp. 4436–4446 (2023)
[42] Li, B., Shen, Y., Yang, J., Wang, Y., Ren, J., Che, T., Zhang, J., Liu, Z.: Sparse mixture-of-experts are domain generalizable learners. In: International Conference on Learning Representations (ICLR) (2023)
[43] Li, D., Yang, Y., Song, Y.Z., Hospedales, T.: Learning to generalize: Meta-learning for domain generalization. In: AAAI Conference on Artificial Intelligence (AAAI). pp. 3490–3497 (2018)
[44] Li, D., Yang, Y., Song, Y.Z., Hospedales, T.M.: Deeper, broader and artier domain generalization. In: International Conference on Computer Vision (ICCV). pp. 5542–5550 (2017)
[45] Li, H., Pan, S.J., Wang, S., Kot, A.C.: Domain generalization with adversarial feature learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5400–5409 (2018)
[46] LI, X., Grandvalet, Y., Davoine, F.: Explicit inductive bias for transfer learning with convolutional networks. In: International Conference on Machine Learning (ICML). pp. 2825–2834 (2018)
[47] Li, Y., Tian, X., Gong, M., Liu, Y., Liu, T., Zhang, K., Tao, D.: Deep domain generalization via conditional invariant adversarial networks. In: European Conference on Computer Vision (ECCV). pp. 624–639 (2018)
[48] Li, Z., Hoiem, D.: Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2017)
[49] Mahajan, D., Tople, S., Sharma, A.: Domain generalization using causal matching. In: International Conference on Machine Learning (ICML). pp. 7313–7324 (2021)
[50] Mao, X., Chen, Y., Jia, X., Zhang, R., Xue, H., Li, Z.: Context-aware robust fine-tuning. International Journal of Computer Vision (IJCV) pp. 1–16 (2023)
[51] Pearson, K.: LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science pp. 559–572 (1901)
[52] Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., Wang, B.: Moment matching for multi-source domain adaptation. In: International Conference on Computer Vision (ICCV). pp. 1406–1415 (2019)
[53] Qi, L., Yang, H., Shi, Y., Geng, X.: Normaug: Normalization-guided augmentation for domain generalization. IEEE Transactions on Image Processing (TIP) pp. 1419–1431 (2024)
[54] Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D.: Dataset shift in machine learning. Mit Press (2008)
[55] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML). pp. 8748–8763 (2021)
[56] Rame, A., Kirchmeyer, M., Rahier, T., Rakotomamonjy, A., Gallinari, P., Cord, M.: Diverse weight averaging for out-of-distribution generalization. In: Conference on Neural Information Processing Systems (NeurIPS). pp. 10821–10836 (2022)
[57] Saha, G., Garg, I., Roy, K.: Gradient projection memory for continual learning. In: International Conference on Learning Representations (ICLR) (2021)
[58] Shu, Y., Guo, X., Wu, J., Wang, X., Wang, J., Long, M.: CLIPood: Generalizing clip to out-of-distributions. In: International Conference on Machine Learning (ICML) (2023)
[59] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International conference on machine learning (ICML). pp. 10347–10357 (2021)
[60] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Conference on Neural Information Processing Systems (NeurIPS) (2017)
[61] Venkateswara, H., Eusebio, J., Chakraborty, S., Panchanathan, S.: Deep hashing network for unsupervised domain adaptation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5018–5027 (2017)
[62] Wang, S., Yu, L., Li, C., Fu, C.W., Heng, P.A.: Learning from extrinsic and intrinsic supervisions for domain generalization. In: European Conference on Computer Vision (ECCV). pp. 159–176 (2020)
[63] Wang, X., Chen, T., Ge, Q., Xia, H., Bao, R., Zheng, R., Zhang, Q., Gui, T., Huang, X.J.: Orthogonal subspace learning for language model continual learning. In: Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 10658–10671 (2023)
[64] Wang, X., Zhang, J., Qi, L., Shi, Y.: Generalizable decision boundaries: Dualistic meta-learning for open set domain generalization. In: International Conference on Computer Vision (ICCV). pp. 11564–11573 (2023)
[65] Wang, Y., Qi, L., Shi, Y., Gao, Y.: Feature-based style randomization for domain generalization. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) pp. 5495–5509 (2022)
[66] Wang, Z., Luo, Y., Qiu, R., Huang, Z., Baktashmotlagh, M.: Learning to diversify for single domain generalization. In: International Conference on Computer Vision (ICCV). pp. 834–843 (2021)
[67] Wortsman, M., Ilharco, G., Kim, J.W., Li, M., Kornblith, S., Roelofs, R., Lopes, R.G., Hajishirzi, H., Farhadi, A., Namkoong, H., et al.: Robust fine-tuning of zero-shot models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7959–7971 (2022)
[68] Xie, D., Xiong, J., Pu, S.: All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6176–6185 (2017)
[69] Xu, Q., Zhang, R., Zhang, Y., Wang, Y., Tian, Q.: A fourier-based framework for domain generalization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14383–14392 (2021)
[70] Zang, Y., Li, W., Zhou, K., Huang, C., Loy, C.C.: Unified vision and language prompt learning. arXiv:2210.07225 (2022)
[71] Zeng, G., Chen, Y., Cui, B., Yu, S.: Continual learning of context-dependent processing in neural networks. Nature Machine Intelligence (MNI) pp. 364–372 (2019)
[72] Zhang, C., Zhang, M., Zhang, S., Jin, D., Zhou, Q., Cai, Z., Zhao, H., Liu, X., Liu, Z.: Delving deep into the generalization of vision transformers under distribution shifts. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7277–7286 (2022)
[73] Zhang, J., Qi, L., Shi, Y., Gao, Y.: Generalizable model-agnostic semantic segmentation via target-specific normalization. Pattern Recognition (PR) p. 108292 (2022)
[74] Zhang, J., Qi, L., Shi, Y., Gao, Y.: MVDG: A unified multi-view framework for domain generalization. In: European Conference on Computer Vision (ECCV). pp. 161–177 (2022)
[75] Zhang, J., Qi, L., Shi, Y., Gao, Y.: Domainadaptor: A novel approach to test-time adaptation. In: International Conference on Computer Vision (ICCV). pp. 18971–18981 (2023)
[76] Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. International Journal of Computer Vision (IJCV) pp. 2337–2348 (2022)
[77] Zhou, K., Yang, Y., Hospedales, T., Xiang, T.: Deep domain-adversarial image generation for domain generalisation. In: AAAI Conference on Artificial Intelligence (AAAI). pp. 13025–13032 (2020)
[78] Zhou, K., Yang, Y., Hospedales, T., Xiang, T.: Learning to generate novel domains for domain generalization. In: European Conference on Computer Vision (ECCV). pp. 561–578 (2020)
[79] Zhou, K., Yang, Y., Qiao, Y., Xiang, T.: Domain adaptive ensemble learning. IEEE Transactions on Image Processing (TIP) pp. 8008–8018 (2021)
[80] Zhou, K., Yang, Y., Qiao, Y., Xiang, T.: Domain generalization with mixstyle. In: International Conference on Learning Representations (ICLR) (2021)

Learn to Preserve and Diversify: Parameter-Efficient Group with Orthogonal Regularization for Domain Generalization
—Supplementary—

Jiajun Hu, Jian Zhang, Lei Qi, Yinghuan Shi, Yang Gao

Appendix A Algorithm

Algorithm 1 Parameter-Efficient Group with Orthogonal Regularization

1:training data

D_{tr}

, pre-trained vision transformer

F(\cdot;\theta)

with

B

blocks, classification head

H(\cdot;\psi)

, group of LoRA modules

g(\cdot;\phi)

, balancing coefficient

\alpha

, iteration

T

2:Initialization: Inject

g(\cdot;\phi)

into

F(\cdot;\theta)

to get the pre-trained model with group of LoRA modules

G(\cdot;\Phi)

and freeze the pre-trained model weight

3:for

t

= 1, 2, …,

T

4: sample a batch

(x,y)

D_{tr}

\mathcal{L}_{cls}\leftarrow\mathcal{L}_{CE}(H(G(x;\Phi);\psi),y)

\triangleright

Eq. (2)

\mathcal{L}_{OR}\leftarrow 0

7: for

b

= 1, 2, …,

B

\mathcal{L}_{OR}\leftarrow\mathcal{L}_{OR}

(\mathcal{L}_{O}(W^{q}_{b})+\mathcal{L}_{O}(W^{v}_{b}))

\triangleright

Eq. (8)

9: end for

10:

\mathcal{L}_{final}\leftarrow\mathcal{L}_{cls}+\alpha\mathcal{L}_{OR}

\triangleright

Eq. (9)

11: update

g(\cdot;\phi)

H(\cdot;\psi)

to minimize

\mathcal{L}

12:end for

13:Merge the LoRA group with the pre-trained weight.

\triangleright

Eq. (10)

14:return

G

H

Appendix B Evaluation Protocol and Hyperparameters Search

In this section, we provide a detailed description of our evaluation protocol and hyperparameters (HPs) search. In line with prior research in DG, we designate one domain within the dataset as the unseen test domain, while the remaining domains serve as source domains. The final experimental results are obtained by averaging the accuracies across all test domains. To maintain consistency with DomainBed [25], $20\%$ of the samples from each source domain are allocated for validation and we adopt the training-domain validation strategy for hyperparameter search and model selection. Furthermore, all experiments are conducted using three different random seeds to ensure the reliability and reproducibility of our experiments.

As for algorithm-agnostic HPs in DomainBed (e.g., learning rate, dropout, weight decay), to reduce the training overhead caused by HPs search, we do not tune any algorithm-agnostic HPs. Specifically, for all the experiments, the learning rate, dropout, and weight decay are fixed to 5e-4, 0, and 0. As regards the algorithm-specific HPs, we fix the rank of LoRA [30] $r$ to 4 and the balance coefficient $\alpha$ to 1e-3 for all the experiments. We only search for the number of LoRA modules $N$ from $\{2,4,6\}$ . Tab. 7 provides a summary of the searched hyperparameter $N$ on five DomainBed benchmarks in our experiments.

Table 7: The hyperparameter

N

used on five DomainBed benchmarks in our experiments.

Hyperparameter	PACS	VLCS	OH	TI	DN
$N$	2	4	4	4	4

As shown in the ablation experiments of the main body (Sec. 5.1, Pages 12-13), the performance of our method is not sensitive to algorithm-specific HPs. Besides, to save GPU memory, we use half-precision (FP16) during training and inference for all the experiments.

Table 8: Performance comparison with more methods. Leave-one-domain-out accuracy (%) on five DomainBed benchmarks.

Algorithm	PACS	VLCS	OH	TI	DN	Avg
Auto-RGN [40]	90.3 $\pm$ 0.5	80.7 $\pm$ 0.3	76.7 $\pm$ 0.5	48.5 $\pm$ 0.6	51.2 $\pm$ 0.7	69.5
CoOp [76]	96.1 $\pm$ 0.2	80.5 $\pm$ 0.6	84.2 $\pm$ 0.1	49.4 $\pm$ 0.6	59.3 $\pm$ 0.1	73.9
UPT [70]	96.5 $\pm$ 0.2	82.7 $\pm$ 0.1	84.4 $\pm$ 0.2	54.9 $\pm$ 0.9	60.2 $\pm$ 0.1	75.7
PEGO	96.5 $\pm$ 0.1	83.2 $\pm$ 0.3	84.2 $\pm$ 0.1	57.3 $\pm$ 0.3	59.3 $\pm$ 0.1	76.1

Appendix C Comparisons with More Methods

In this subsection, we conduct a performance comparison between PEGO and more methods, including Auto-RGN [40], CoOp [76], and UPT [70]. Auto-RGN measures the Relative Gradient Norm (RGN) of each transformer layer and sets different learning rates for each layer by its RGN. CoOp and UPT are both Prompt Learning methods that introduce learnable text or visual prompts for fine-tuning. As shown in Tab. 8, our method achieves better average performance than other methods benefiting from the proposed preserving and diversifying losses.

Table 9: Trainable Parameters of Different Methods.

	FT	Adapter[29]	LoRA[30]	VPT[33]	CoOp[76]	UPT[70]	Auto-RGN[40]	PEGO
Parameters	86M	0.16M	0.15M	0.10M	2048	0.57M	86M	0.29M

Appendix D Trainable Parameters of Different Methods

The trainable parameters for each dataset are different due to the dimension difference of the classifier. We compare the trainable parameters of all methods on the PACS dataset. As shown in Tab. 9, our method is significantly parameter-efficient compared to FT (0.29M vs. 86M).

Appendix E Detail Results of Each Domain

In this section, Tabs. 10, 11, 12, 13 and 14 provide the detailed accuracy of algorithms on five DomainBed [25] benchmarks: PACS [44], VLCS [22], OfficeHome [61], TerraIncognita [7] and DomainNet [52]. Since SWAD [13], SMA [3], and GESTUR [41] do not report the detailed results of each domain in their papers, we only present the results of ERM, MIRO [14], Adapter [29], LoRA [30], VPT [33], L²-SP [46], LP-FT [38], LwF [48] and PEGO.

Table 10: Leave-one-domain-out accuracy (%) of each domain on PACS when using ViT-B/16 pre-trained by CLIP as the backbone.

Algorithm	A	C	P	S	Avg
ERM (FT)	80.5 $\pm$ 3.4	86.4 $\pm$ 0.6	93.4 $\pm$ 1.0	73.2 $\pm$ 3.9	83.4 $\pm$ 0.4
MIRO [14]	95.6 $\pm$ 0.6	96.6 $\pm$ 0.2	99.7 $\pm$ 0.1	90.7 $\pm$ 2.5	95.6 $\pm$ 0.6
Adapter [29]	91.8 $\pm$ 0.2	93.1 $\pm$ 0.4	98.8 $\pm$ 0.1	84.4 $\pm$ 1.6	92.0 $\pm$ 0.5
LoRA [30]	97.4 $\pm$ 0.3	97.5 $\pm$ 0.1	99.7 $\pm$ 0.1	89.2 $\pm$ 0.4	96.0 $\pm$ 0.1
VPT [33]	97.1 $\pm$ 0.4	97.8 $\pm$ 0.1	99.9 $\pm$ 0.0	90.1 $\pm$ 0.9	96.2 $\pm$ 0.3
L²-SP [46]	93.9 $\pm$ 1.0	94.3 $\pm$ 0.6	97.8 $\pm$ 0.3	83.1 $\pm$ 2.3	92.2 $\pm$ 0.7
LwF [48]	93.2 $\pm$ 1.4	94.2 $\pm$ 0.7	98.5 $\pm$ 0.2	88.8 $\pm$ 0.4	93.6 $\pm$ 0.6
LP-FT [38]	89.1 $\pm$ 2.8	97.8 $\pm$ 0.1	99.8 $\pm$ 0.0	89.9 $\pm$ 0.2	94.2 $\pm$ 0.7
PEGO	97.1 $\pm$ 0.1	98.5 $\pm$ 0.2	99.7 $\pm$ 0.1	90.9 $\pm$ 0.2	96.5 $\pm$ 0.1

Table 11: Leave-one-domain-out accuracy (%) of each domain on VLCS when using ViT-B/16 pre-trained by CLIP as the backbone.

Algorithm	C	L	S	V	Avg
ERM (FT)	95.4 $\pm$ 0.6	65.6 $\pm$ 0.9	72.9 $\pm$ 2.2	69.9 $\pm$ 2.2	75.9 $\pm$ 1.1
MIRO [14]	98.9 $\pm$ 0.5	67.1 $\pm$ 1.0	81.9 $\pm$ 0.4	81.2 $\pm$ 0.2	82.3 $\pm$ 0.2
Adapter [29]	95.7 $\pm$ 0.2	65.9 $\pm$ 0.9	79.5 $\pm$ 0.7	78.0 $\pm$ 0.7	79.8 $\pm$ 0.4
LoRA [30]	96.1 $\pm$ 0.4	68.1 $\pm$ 0.2	83.5 $\pm$ 0.3	83.1 $\pm$ 0.4	82.7 $\pm$ 0.0
VPT [33]	96.8 $\pm$ 0.5	67.2 $\pm$ 0.3	84.9 $\pm$ 0.2	82.6 $\pm$ 0.4	82.9 $\pm$ 0.3
LP-FT [38]	94.5 $\pm$ 0.3	62.0 $\pm$ 0.3	76.4 $\pm$ 1.3	77.0 $\pm$ 2.9	77.5 $\pm$ 0.4
L²-SP [46]	96.8 $\pm$ 0.9	66.2 $\pm$ 1.0	78.5 $\pm$ 1.6	82.5 $\pm$ 0.2	81.0 $\pm$ 0.2
LwF [48]	99.1 $\pm$ 0.3	65.5 $\pm$ 1.4	80.4 $\pm$ 1.2	82.6 $\pm$ 0.2	81.9 $\pm$ 0.4
PEGO	96.4 $\pm$ 0.1	67.8 $\pm$ 0.5	83.3 $\pm$ 0.3	85.2 $\pm$ 1.0	83.2 $\pm$ 0.3

Table 12: Leave-one-domain-out accuracy (%) of each domain on OfficeHome when using ViT-B/16 pre-trained by CLIP as the backbone.

Algorithm	A	C	P	R	Avg
ERM (FT)	59.2 $\pm$ 1.3	56.1 $\pm$ 0.6	74.8 $\pm$ 0.1	75.4 $\pm$ 0.8	66.4 $\pm$ 0.4
MIRO [14]	80.8 $\pm$ 0.1	72.2 $\pm$ 0.5	88.6 $\pm$ 0.3	88.5 $\pm$ 0.2	82.5 $\pm$ 0.1
Adapter [29]	67.1 $\pm$ 1.2	61.7 $\pm$ 0.4	81.5 $\pm$ 0.5	81.3 $\pm$ 0.6	72.9 $\pm$ 0.4
LoRA [30]	83.2 $\pm$ 0.2	71.8 $\pm$ 0.4	89.1 $\pm$ 0.2	89.5 $\pm$ 0.2	83.4 $\pm$ 0.1
VPT [33]	82.9 $\pm$ 0.6	71.5 $\pm$ 0.6	89.7 $\pm$ 0.1	89.5 $\pm$ 0.3	83.4 $\pm$ 0.3
L²-SP [46]	62.6 $\pm$ 1.3	57.1 $\pm$ 0.4	76.4 $\pm$ 0.8	76.6 $\pm$ 0.2	68.2 $\pm$ 0.5
LP-FT [38]	64.5 $\pm$ 1.4	68.0 $\pm$ 0.4	76.7 $\pm$ 0.3	79.0 $\pm$ 0.2	72.0 $\pm$ 0.4
LwF [48]	79.0 $\pm$ 1.7	70.4 $\pm$ 0.7	86.8 $\pm$ 0.3	86.7 $\pm$ 0.4	80.7 $\pm$ 0.4
PEGO	83.7 $\pm$ 0.3	73.3 $\pm$ 0.4	90.3 $\pm$ 0.3	89.5 $\pm$ 0.3	84.2 $\pm$ 0.1

Table 13: Leave-one-domain-out accuracy (%) of each domain on TerraIncognita when using ViT-B/16 pre-trained by CLIP as the backbone.

Algorithm	L100	L38	L43	L46	Avg
ERM (FT)	38.1 $\pm$ 0.3	26.7 $\pm$ 2.5	41.9 $\pm$ 1.3	34.4 $\pm$ 1.8	35.3 $\pm$ 0.6
MIRO [14]	65.0 $\pm$ 0.6	46.7 $\pm$ 0.7	60.8 $\pm$ 1.3	44.9 $\pm$ 0.1	54.3 $\pm$ 0.3
Adapter [29]	38.8 $\pm$ 5.1	44.9 $\pm$ 2.0	56.2 $\pm$ 0.3	37.8 $\pm$ 1.3	44.4 $\pm$ 0.8
VPT [33]	55.0 $\pm$ 3.9	52.6 $\pm$ 1.3	61.3 $\pm$ 0.4	47.8 $\pm$ 0.4	54.2 $\pm$ 0.7
LoRA [30]	54.6 $\pm$ 2.4	52.7 $\pm$ 1.2	61.2 $\pm$ 0.8	50.5 $\pm$ 0.5	54.8 $\pm$ 0.6
LP-FT [38]	42.8 $\pm$ 4.2	33.2 $\pm$ 3.3	46.7 $\pm$ 1.1	33.2 $\pm$ 1.1	39.0 $\pm$ 1.5
L²-SP [46]	45.6 $\pm$ 5.5	27.2 $\pm$ 3.5	49.9 $\pm$ 1.3	34.8 $\pm$ 0.3	39.4 $\pm$ 1.6
LwF [48]	44.4 $\pm$ 1.8	34.9 $\pm$ 2.6	47.5 $\pm$ 1.3	30.9 $\pm$ 3.8	39.4 $\pm$ 0.6
PEGO	63.2 $\pm$ 0.3	56.4 $\pm$ 0.3	61.8 $\pm$ 1.0	47.9 $\pm$ 0.5	57.3 $\pm$ 0.3

Table 14: Leave-one-domain-out accuracy (%) of each domain on DomainNet when using ViT-B/16 pre-trained by CLIP as the backbone.

Algorithm	clipart	infograph	painting	quickdraw	real	sketch	Avg
ERM (FT)	68.0 $\pm$ 0.1	22.5 $\pm$ 0.4	46.5 $\pm$ 2.4	18.5 $\pm$ 0.6	58.7 $\pm$ 1.6	52.5 $\pm$ 0.7	44.4 $\pm$ 0.5
MIRO [14]	74.9 $\pm$ 0.1	37.1 $\pm$ 0.2	59.8 $\pm$ 0.4	18.7 $\pm$ 0.8	72.2 $\pm$ 0.1	61.2 $\pm$ 0.6	54.0 $\pm$ 0.2
Adapter [29]	75.6 $\pm$ 0.2	37.6 $\pm$ 0.2	63.1 $\pm$ 0.2	19.4 $\pm$ 0.3	77.2 $\pm$ 0.1	64.2 $\pm$ 0.3	56.2 $\pm$ 0.1
LoRA [30]	76.4 $\pm$ 0.1	43.3 $\pm$ 0.3	63.6 $\pm$ 0.3	19.5 $\pm$ 0.3	79.2 $\pm$ 0.1	66.4 $\pm$ 0.1	58.1 $\pm$ 0.1
VPT [33]	76.7 $\pm$ 0.0	43.1 $\pm$ 0.3	66.6 $\pm$ 0.1	19.4 $\pm$ 0.2	80.3 $\pm$ 0.0	67.4 $\pm$ 0.1	58.9 $\pm$ 0.1
LP-FT [38]	70.9 $\pm$ 0.2	26.7 $\pm$ 0.3	55.8 $\pm$ 0.3	17.1 $\pm$ 0.5	66.3 $\pm$ 0.4	57.5 $\pm$ 0.4	49.1 $\pm$ 0.3
L²-SP [46]	70.6 $\pm$ 0.1	28.4 $\pm$ 0.3	55.6 $\pm$ 0.5	18.3 $\pm$ 0.5	68.5 $\pm$ 0.4	58.4 $\pm$ 0.1	50.0 $\pm$ 0.2
LwF [48]	73.2 $\pm$ 0.1	30.6 $\pm$ 0.3	58.0 $\pm$ 0.5	18.6 $\pm$ 0.4	69.1 $\pm$ 0.2	60.8 $\pm$ 0.0	51.7 $\pm$ 0.1
PEGO	76.8 $\pm$ 0.1	44.6 $\pm$ 0.2	67.1 $\pm$ 0.3	18.8 $\pm$ 0.2	80.5 $\pm$ 0.1	67.7 $\pm$ 0.1	59.3 $\pm$ 0.1

Appendix F Limitation

Although our method cannot be easily applied to some traditional convolutional neural networks not containing linear layers (e.g., ResNet [28]), it can be applied to any type of Transformer [60] architecture, similar to LoRA. With the increasing number of Transformer-based architectures being proposed (e.g., ViT [19], ConViT [21], DeiT [59]), our method exhibits a wide range of applications for these networks.