This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

KIND: Knowledge Integration and Diversion in Diffusion Models

Yucheng Xie1,2, Fu Feng1,2, Jing Wang1,2, Xin Geng1,2111Co-corresponding author, Yong Rui3
Abstract

Pre-trained models have become the preferred backbone due to the expansion of model parameters, with techniques like Parameter-Efficient Fine-Tuning (PEFTs) typically fixing the parameters of these models. However, pre-trained models may not always be optimal, especially when there are discrepancies between training tasks and target tasks, potentially resulting in negative transfer. To address this, we introduce KIND, which performs Knowledge INtegration and Diversion in diffusion models. KIND first integrates knowledge by decomposing parameter matrices of models using UU, Σ\Sigma, and VV matrices, formally inspired by singular value decomposition (SVD). Then it explicitly partitions the components of these matrices into learngenes and tailors to condense common and class-specific knowledge, respectively, through a class gate. In this way, KIND redefines traditional pre-training methods by adjusting training objectives from maximizing model performance on current tasks to condensing transferable common knowledge, leveraging the Learngene framework. We conduct experiments on ImageNet-1K and compare KIND with PEFT and other learngene methods. Results indicate that KIND achieves state-of-the-art performance compared to other PEFT and learngene methods. Specifically, the images generated by KIND achieves more than 6.54 and 1.07 decrease in FID and sFID on DiT-L/2, utilizing only 45.4M trainable parameters and saving at least 35.4G FLOPs in computational cost.

Introduction

The increase in model size entails higher computational costs, making the direct fine-tuning of pre-trained models a common approach in model training (Qiu et al. 2020; Han et al. 2021). However, such training way still poses challenges, especially when training large models or facing limited training data, which greatly increases the risk of overfitting. To address this, efficient fine-tuning techniques (PEFTs) such as adapter (Hu et al. 2023; Chen et al. 2022), LoRA (Hu et al. 2022; Hayou, Ghosh, and Yu 2024), and their variants (Zhang et al. 2023; Valipour et al. 2023; Liu et al. 2024a) have been developed. These methods fix the parameters of pre-trained models and create a relatively compact parameter space by adding a small number of parameters. By fine-tuning these additional parameters, the model can quickly adapt to new tasks and such fine-tuning techniques have be widely used in tasks such as image segmentation (Sun et al. 2022; Peng et al. 2024), image editing (Zhang et al. 2024), style transfer (Chen, Tennent, and Hsu 2024) and personalization (Zhang and Pilanci 2024).

Refer to caption
Figure 1: (a) PEFT techniques typically fix the parameters of models pre-trained with traditional objectives and add parameters (e.g., LoRA) for fine-tuning on downstream tasks. (b) KIND first integrates and diverts knowledge during model pre-training, so that it can condense common knowledge into the learngenes and capture task-specific knowledge with the tailors. Then, the learngenes serve as the backbone for adapting to novel tasks by adding new tailors adaptively to learn the knowledge specific to these tasks.

However, traditional pre-trained models are often trained on large datasets (e.g., ImageNet-21K (Ridnik et al. 2021)) with a primary focus on maximizing performance, without considering their transferability to downstream tasks. This may lead to issues such as negative transfer (Wang et al. 2019; Rosenstein et al. 2005), particularly when the downstream tasks are not sufficiently similar to the training datasets, suggesting that these models may not always be the optimal choice for backbones in diverse applications, as shown in Figure 1. This raises the question: Can we filter knowledge during the pre-training process to identify more suitable common knowledge for transfer, thus constructing a backbone with enhanced transferability?

Recently, a novel knowledge transfer framework called Learngene has been introduced (Wang et al. 2023), drawing inspiration from the efficient transfer of genetic information in biological evolution. Unlike traditional transfer learning methods, Learngene emphasizes condensing common knowledge into network fragments (Feng et al. 2023) known as learngenes. The learngenes can enable networks inheriting them to adaptively learn task-specific knowledge, significantly enhancing the efficiency of knowledge transfer and the adaptability of networks to downstream tasks. However, current implementations of Learngene (Feng et al. 2024; Xia et al. 2024) have primarily focused on model initialization across various sizes, without fully exploring the extraction of common knowledge from training data. This limitation is particularly evident in diffusion models for image generation tasks, where the problem of effective knowledge transfer remains unresolved.

To address the aforementioned issues, we propose KIND, a novel method within the Learngene framework that performs Knowledge INtegration and Diversion in diffusion models. Inspired by the matrix decomposition of singular value decomposition (SVD), we integrate the primary weight matrices in diffusion models by multiplying the UU, Σ\Sigma, and VV matrices. Unlike other PEFT methods based on SVD (Han et al. 2023; Zhang and Pilanci 2024; Robb et al. 2020), we do not directly apply SVD to the pre-trained model weights. Instead, we explicitly integrate knowledge during the training process by updating the UU, Σ\Sigma, and VV matrices accordingly, allowing us to separate knowledge into common and class-specific categories. We divide the UU, Σ\Sigma, and VV matrices into learngenes and tailors, with row or column vectors as units, to condense common and class-specific knowledge, respectively. This process is facilitated through a class gate, which selectively updates only the learngenes and their corresponding tailors based on the class of the training data, thus achieving the diversion of common and class-specific knowledge.

We employ Diffusion Transformers (DiTs) (Peebles and Xie 2023), as our foundational structure of diffusion models and futher categorize all classes of ImageNet-1K into superclasses and partition them into training classes and novel classes. Training classes are used to integrate and divert knowledge, thereby extracting learngenes, while novel classes simulates diverse downstream tasks to evaluate the adaptability of KIND and other PEFT methods. Our results demonstrate that employing the extracted learngenes as the backbone significantly outperforms full parameter fine-tuning and maintains a considerable advantage over other PEFT methods. This indicates that the learngenes extracted by KIND effectively condense substantial common knowledge, while Tailor demonstrates flexibility, reducing the risk of overfitting and quickly adapting to new tasks.

Our main contributions are as follows: (1) We introduce a novel learngene method called KIND, which successfully integrates and diverts knowledge during pre-training diffusion models. This is the first application of learngenes to diffusion models, and the first instance in PEFT of fine-tuning with learngenes as the backbone, rather than traditional pre-trained models. (2) We propose to modify the pre-training objective from maximizing model performance to condensing common knowledge as much as possible, thereby developing a backbone more conducive to transfer learning. (3) We further categorize and divide ImageNet-1K to create a benchmark suitable for training and evaluating PEFT and learngene methods on diffusion models. Detailed experiments demonstrate that KIND achieves the state-of-the-art performance compared to other PEFT methods, while also reducing storage space and computational overhead.

Related Work

Parameter Efficient Fine-Tuning (PEFT)

The increase in model parameters make fine-tuning all parameters of pre-trained models become resource-intensive and time-consuming (Touvron et al. 2021; Achiam et al. 2023). To address this, PEFT techniques are developed to adapt large pre-trained models to new tasks by fine-tuning only a small set of additional parameters, known as adapters (Houlsby et al. 2019; Hu et al. 2023; Chen et al. 2022). Techniques such as LoRA (Hu et al. 2022; Hayou, Ghosh, and Yu 2024) further reduce the number of trainable parameters using low rank hypothesis, while Orthogonal Fine-Tuning (OFT) (Liu et al. 2024b; Qiu et al. 2023) orthogonalize the added parameters for further preserving characteristics of pre-trained models. Recent approaches successfully avoid adding parameters by applying SVD on pre-trained weight matrices and adjusting the singular values, known as spectral shift (Han et al. 2023; Robb et al. 2020; Sun et al. 2022), or fine-tuning the singular vectors (Zhang et al. 2024; Zhang and Pilanci 2024). However, existing PEFT methods directly use models pre-trained with traditional objectives, without considering their suitability as a universal backbone. To address this, our KIND separates the knowledge into common knowledge and class-specific knowledge during pre-training. Then we transfer the weights condensing common knowledge (i.e., learngenes) to novel tasks and attach trainable parameters (i.e., tailors) to learn class-specific knowledge, enhancing both efficiency and adaptability.

Learngene

Learngene is an innovative knowledge transfer approach inspired by the transfer of genetic information in nature (Wang et al. 2023; Feng et al. 2023). In biological evolution, core information is compressed into “genes” through genetic bottlenecks (Bohacek and Mansuy 2015; Waddington 1942) and then transferred to descendants, equipping them with instincts to quickly acquire environment-specific skills (Wong and Candolin 2015; Sih, Ferrari, and Harris 2011). Similarly, in artificial neural networks, learngenes compress core common knowledge into network fragments, which are then transferred to descendant networks to facilitate acquisition of task-specific knowledge (Feng, Wang, and Geng 2024). Currently learngene methods, such as Heur-LG (Wang et al. 2022) and Auto-LG (Wang et al. 2023), involve directly transferring selected layers from pre-trained models. Other methods, like TLEG (Xia et al. 2024) and WAVE (Feng et al. 2024), employ auxiliary networks to condense knowledge into the learngenes under the specific rules. However, these approaches primarily focus on image classification tasks, and fail to effectively distinguish between common knowledge and class-specific knowledge. In contrast, our KIND explores the application of the learngenes in diffusion models and image generation tasks. By constructing tailors for each class-specific knowledge, KIND extracts more transferable learngenes that condense more common knowledge through knowledge integration and diversion.

Methods

Preliminary

Latent Diffusion Models

Latent diffusion models transfer the diffusion process from the high-resolution pixel space to the latent space by employing an autoencoder \mathcal{E}, which encodes an image xx into a latent code z=(x)z=\mathcal{E}(x). A diffusion model is then trained to generate the corresponding latent code in a denoising process, with the goal of minimizing the following objective:

=𝔼z,c,ε,t[||εεθ(zt|c,t)||22]\mathcal{L}=\mathbb{E}_{z,c,\varepsilon,t}[||\varepsilon-\varepsilon_{\theta}(z_{t}|c,t)||^{2}_{2}] (1)

Here, εθ\varepsilon_{\theta} is a noise prediction network, which is trained to predict the noise ε\varepsilon added to ztz_{t} at timestep tt under the condition vector cc.

Diffusion Transformers (DiTs)

Diffusion Transformers (DiTs) introduce a novel architecture for noise prediction based on transformers instead of the traditional UNet. Given an image xH1×H2×Cx\in\mathbb{R}^{H_{1}\times H_{2}\times C} and its latent code zh1×h2×cz\in\mathbb{R}^{h_{1}\times h_{2}\times c} encoded by \mathcal{E}, DiT first divides the latent code zz into TT patches, and maps these patches into dd-dimensional patch embeddings with added position embeddings. The number of tokens TT is determined by the patch size hyperparameter pp, where T=h1h2p2T=\frac{h_{1}\cdot h_{2}}{p^{2}}.

The structure of DiTs resembles that of Vision Transformers (ViTs), which comprises LL stacked layers, each containing a Multi-Head Self-Attention (MSA) mechanism and a Pointwise Feedforward (PFF) layer. In each layer, a self-attention head AiA_{i} performs self-attention using a query QQ, key KK, and value VT×dV\in\mathbb{R}^{T\times d}, with parameter matrices WqiW_{q}^{i}, WkiW_{k}^{i}, and WviD×dW_{v}^{i}\in\mathbb{R}^{D\times d}, which is defined as:

Ai=softmax(QiKid)Vi,AiT×dA_{i}=\text{softmax}(\frac{Q_{i}K_{i}^{\top}}{\sqrt{d}})V_{i}\,,\;A_{i}\in\mathbb{R}^{T\times d} (2)

MSA mechanism integrates nhn_{h} self-attention heads AA and projects the concatenated outputs using a weight matrix WoW_{o}:

MSA=concat(A1,A2,,Anh)Wo,Wohd×D\text{MSA}=\text{concat}(A_{1},A_{2},...,A_{n_{h}})W_{o}\,,\;W_{o}\in\mathbb{R}^{hd\times D} (3)

In the implementation of MSA, the matrices WqiW_{q}^{i}, WkiW_{k}^{i}, and WviD×dW_{v}^{i}\in\mathbb{R}^{D\times d} for nhn_{h} attention heads are combined into three parameter matrices WqW_{q}, WkW_{k}, and WvD×hdW_{v}\in\mathbb{R}^{D\times hd}.

PFF layer comprises two linear transformations WinD×DW_{in}\in\mathbb{R}^{D\times D^{\prime}} and WoutD×DW_{out}\in\mathbb{R}^{D^{\prime}\times D} with a GELU (Hendrycks and Gimpel 2016) activation function:

PFF(x)=GELU(xWin+b1)Wout+b2\text{PFF}(x)=\text{GELU}(xW_{in}+b_{1})W_{out}+b_{2} (4)

where b1b_{1} and b2b_{2} are the biases for the linear transformations, and DD^{\prime} denotes the hidden layer dimensions.

Knowledge Integration in Weight Matrices

Refer to caption
Figure 2: (a) For each weight matrix in DiTs, we integrate it into the product of matrices UU, Σ\Sigma and VV, formally inspired by SVD. The components of these matrices are then explicitly partitioned into the learngenes and tailors, which condense common and class-specific knowledge, respectively. (b) Knowledge is diverted through a class gate ensuring each training image updates only the learngenes and their corresponding class-related tailors, so that the common knowledge can be condensed into the learngenes, while knowledge specific to each class is diverted into corresponding tailors.

FSGAN (Robb et al. 2020) introduces spectral shifts by directly applying SVD to pre-trained model parameters and fine-tune singular values for model adaptation. Similar approaches are used by (Sun et al. 2022) and (Han et al. 2023) for segmentation and image generation, respectively. The success of these methods demonstrates that SVD can create a compact parameter space in pre-trained models, facilitaing efficient fine-tuning.

However, directly applying SVD to pre-trained parameter matrices will decompose them according to fixed orthogonalization rules, which reduces the interpretability of the singular vectors and makes it challenging to determine which singular vectors contain common knowledge that is suitable for transfer. To address this issue, we employ knowledge integration by reconstructing weight matrices using the SVD-derived matrix forms UU, Σ\Sigma and VV, rather than directly applying SVD to the pre-trained parameter matrices.

For the DiT architecture, the main weight matrices in a LL-layer DiT are 𝒲={Wq(1L),Wk(1L),Wv(1L),Wo(1L),Win(1L),Wout(1L)}\mathcal{W}=\{W_{q}^{(1\thicksim L)},W_{k}^{(1\thicksim L)},W_{v}^{(1\thicksim L)},W_{o}^{(1\thicksim L)},W_{in}^{(1\thicksim L)},W_{out}^{(1\thicksim L)}\}. Let W(l)W_{\star}^{(l)} represent any weight matrix in layer ll, where 𝒮\star\in\mathcal{S} and 𝒮={q,k,v,o,in,out}\mathcal{S}=\{q,k,v,o,in,out\} denotes the set of subscripts. The matrices U(l)U_{\star}^{(l)}, Σ(l)\Sigma_{\star}^{(l)}, V(l)V_{\star}^{(l)} are the corresponding components that constitute W(l)W_{\star}^{(l)}, which is calculated as:

W(l)=U(l)Σ(l)V(l)W_{\star}^{(l)}=U_{\star}^{(l)}\Sigma_{\star}^{(l)}{V_{\star}^{(l)}}^{\top} (5)

where Σ=diag(𝝈)\Sigma=\text{diag}(\boldsymbol{\sigma}) with 𝝈=[σ1,σ2,,σr]\boldsymbol{\sigma}=[\sigma_{1},\sigma_{2},...,\sigma_{r}]. Here, W(l)m1×m2W_{\star}^{(l)}\in\mathbb{R}^{m_{1}\times m_{2}}, U(l)=[u1,u2,,ur]m1×rU_{\star}^{(l)}=[u_{1},u_{2},...,u_{r}]\in\mathbb{R}^{m_{1}\times r}, Σ(l)r×r\Sigma_{\star}^{(l)}\in\mathbb{R}^{r\times r} and V(l)=[v1,v2,,vr]r×m2V_{\star}^{(l)}=[v_{1},v_{2},...,v_{r}]^{\top}\in\mathbb{R}^{r\times m_{2}}. The rank rr and dimensions m1m_{1} and m2m_{2} define the sizes of W(l)W_{\star}^{(l)}. By updating U(l)U_{\star}^{(l)}, Σ(l)\Sigma_{\star}^{(l)} and V(l)V_{\star}^{(l)}, the W(l)W_{\star}^{(l)} can be updated accordingly.

Knowledge Diversion by Class Labels

Given a dataset 𝒟\mathcal{D} with NclsN_{cls} categories, our objective is to diverse knowledge during the training of DiTs. We categorize the components of UU, Σ\Sigma, and VV (i.e., row/column vectors uiu_{i}, σi\sigma_{i}, and viv_{i}) that condense common knowledge as learngenes, while those representing class-specific knowledge are termed tailors. Specifically, we partition the components in UU, Σ\Sigma, and VV based on the number of categories NclsN_{cls} and the matrix rank rr, which satisfies r=NG+NclsNTr=N_{G}+N_{cls}\cdot N_{T}, where NGN_{G} is the number of components condensing common knowledge, which make up the learngenes 𝒢={𝒢(l)|𝒮andl[1,L]}\mathcal{G}=\{\mathcal{G}_{\star}^{(l)}|\star\in\mathcal{S}\;\text{and}\;l\in[1,L]\}, with:

𝒢(l)={UG(l),ΣG(l),VG(l)}\mathcal{G}_{\star}^{(l)}=\{U_{G}^{(l)},\Sigma_{G}^{(l)},V_{G}^{(l)}\} (6)

Here UG={u1,u2,,uNG}U_{G}=\{u_{1},u_{2},\ldots,u_{N_{G}}\} (UG=u1NGU_{G}=u_{1\thicksim N_{G}} for shot), ΣG=σ1NG\Sigma_{G}=\sigma_{1\thicksim N_{G}}, and VG=v1NGV_{G}=v_{1\thicksim N_{G}}.

The NTN_{T} represents the number of components corresponding to each category, forming the tailor 𝒯={𝒯i,(l)|𝒮,l[1,L]andi[1,Ncls]}\mathcal{T}=\{\mathcal{T}_{i,\star}^{(l)}|\star\in\mathcal{S},\,l\in[1,L]\;\text{and}\;i\in[1,N_{cls}]\}, with:

𝒯i,(l)={UTi(l),ΣTi(l),VTi(l)}\mathcal{T}_{i,\star}^{(l)}=\{U^{(l)}_{T_{i}},\Sigma^{(l)}_{T_{i}},V^{(l)}_{T_{i}}\} (7)

where UTi=u(titi+iNt)U_{T_{i}}=u_{(t_{i}\thicksim t_{i}+i\cdot N_{t})}, ΣTi=σ(titi+iNt)\Sigma_{T_{i}}=\sigma_{(t_{i}\thicksim t_{i}+i\cdot N_{t})} and VTi=v(titi+iNt)V_{T_{i}}=v_{(t_{i}\thicksim t_{i}+i\cdot N_{t})}. Here, t=NG+iNTt=N_{G}+i\cdot N_{T}. Thus, each weight matrix in UU, Σ\Sigma, and VV is decomposed into a learngene 𝒢\mathcal{G} and NclsN_{cls} tailors 𝒯\mathcal{T}.

During the training of DiTs, we introduce a class gate G=[0,,0,1,0,,0]NclsG=[0,\ldots,0,1,0,\ldots,0]\in\mathbb{R}^{N_{cls}}, where only one element is set to 1, corresponding to the class index. This mechanism ensures that for each training class, only the weight parameters of the learngenes and relevant tailors are updated, facilitating targeted knowledge diversion. The optimization objective is defined as:

argmin𝒢,𝒯(f(Gθ,x),y),s.t.W(l)=U(l)Σ(l)V(l){\underset{\mathcal{G},\mathcal{T}}{{\arg\min}}\,\mathcal{L}(f(G\cdot\theta,x),y)},\;\;\text{s.t.}\,W_{\star}^{(l)}=U_{\star}^{(l)}\Sigma_{\star}^{(l)}{V_{\star}^{(l)}}^{\top} (8)

where the loss function \mathcal{L} is as defined in Eq.1.

Refer to caption
Figure 3: Upon completing knowledge integration and diversion, we inherit only learngenes as the backbone with randomly initialized tailors based on task difficulty. These tailors are then fine-tuned to adapt to downstream tasks.

Inheritance of the Learngenes

After diverting the knowledge, we can obtain the learngenes and tailors, which condense common knowledge and class-specific knowledge, respectively. Thus, when transferring pre-trained models to novel tasks, only learngenes need to be transferred, significantly improving transfer efficiency. Then, based on the difficulty of downstream tasks (i.e., the number of classes), the corresponding number of tailors are randomly initialized and concatenated with learngenes to form the weight matrices W^(l)\hat{W}_{\star}^{(l)} of descendants models as:

W^(l)=[UG(l),U^T1Tn(l)][ΣG(l),Σ^T1Tn(l)][VG(l),V^T1Tn(l)]\hat{W}_{\star}^{(l)}=[U_{G}^{(l)},\hat{U}^{(l)}_{T_{1}\thicksim T_{n}}][\Sigma_{G}^{(l)},\hat{\Sigma}^{(l)}_{T_{1}\thicksim T_{n}}][V^{(l)}_{G},\hat{V}^{(l)}_{T_{1}\thicksim T_{n}}]^{\top} (9)

where U^T1Tn\hat{U}_{T_{1}\thicksim T_{n}}, Σ^T1Tn\hat{\Sigma}_{T_{1}\thicksim T_{n}}, and V^T1Tn\hat{V}_{T_{1}\thicksim T_{n}} are composed of TnT_{n} randomly initialized tailors built for specific tasks.

During fine-tuning, we freeze the parameters of learngenes and only update tailors, allowing them to learn class-specific knowledge from downstream tasks, thereby achieving more efficient fine-tuning.

Experiments

Datasets

To better integrate and divert knowledge, we conduct experiments on ImageNet-1K, which comprises 1,000 classes, with 1.2M training images and 50K validation images. To minimize inter-class similarity, we further merge certain similar classes based on the superclasses in WordNet (Miller 1995), resulting in a final total of 611 classes. Among them, 150 classes are used for pre-training diffusion models and condensing learngenes, and while the remaining 461 classes serve as novel classes for evaluating the performance of learngenes and other PEFT methods. Further details are provided in Appendix.

Basic Setting

For pre-training DiT and extracting learngenes, we train class-conditional latent DiTs of sizes -B and -L, with a latent patch size of p=2p=2 at a 256×256256\times 256 image resolution on training classes. All models are trained using AdamW with a batch size of 256 and a constant learning rate of 1×1041\times 10^{-4} over 200K steps. When fine-tuning on novel classes, we randomly divide 461 novel classes into 18 image generation tasks, each generating images for cc classes, where c[7,35]c\in[7,35]. All PEFT and learngene methods are fine-tuned with a constant learning rate of 1×1031\times 10^{-3} over 50K steps. We use an exponential moving average (EMA) of DiT weights with a decay rate of 0.9999, and results are reported using the EMA model. During image generation, a classifier-free guidance (cfg) scale of 1.5 is applied. Performance is evaluated using Fréchet Inception Distance (Heusel et al. 2017), sFID (Nash et al. 2021), Inception Score (Salimans et al. 2016) and Precision/Recall (Kynkäänniemi et al. 2019).

Baselines

We first compare KIND with several PEFT methods, which are 1) OFT (Qiu et al. 2023): Multiplies pre-trained weights by a trainable orthogonal matrix. 2) LoRA (Hu et al. 2022): Replaces adapters (Hu et al. 2023) with two low-rank matrices to further reduce parameters. 3) PiSSA (Meng, Wang, and Zhang 2024): Fine-tunes only principal components of the original matrix based on LoRA. 4) SVDiff (Han et al. 2023): Applies SVD to pre-trained weight matrices and fine-tunes only the singular values. Additionally, we adapt two learngene methods for DiTs. 5) Heur-LG (Wang et al. 2022): Selects layers to be transferred based on gradients during continual learning. 6) Auto-LG (Wang et al. 2023): Uses meta-networks to select layers that are similar to downstream tasks. Lastly, we provide the results for full fine-tuning. 7) Full FT: Directly fine-tuning all parameters of pre-trained models.

Results

Performance of KIND on Novel Classes

Table 1: Performance of various PEPT and learngene methods on novel classes. All methods are fine-tuned for 50K steps on 18 downstream tasks involving novel classes. “Para.” denotes the average number of trainable parameters, while “FLOPs” represents the average total floating-point operations required during fine-tuning.
Methods DiT-B/2 DiT-L/2
Para.(M) FLOPs(G) FID\downarrow sFID\downarrow IS\uparrow Prec.\uparrow Recall\uparrow Para.(M) FLOPs(G) FID\downarrow sFID\downarrow IS\uparrow Prec.\uparrow Recall\uparrow
PEFT SVDiff 0.1 43.6 55.01 18.12 19.6 0.35 0.55 0.2 155.0 49.59 16.81 20.8 0.38 0.56
OFT 14.2 119.7 36.19 17.79 32.0 0.48 0.50 50.5 425.6 24.81 18.27 44.1 0.59 0.47
LoRA 12.8 50.1 36.70 16.28 31.6 0.44 0.57 45.3 178.2 22.55 14.00 46.3 0.55 0.56
PiSSA 12.8 50.1 33.16 15.51 34.6 0.49 0.52 45.3 178.2 19.41 14.72 53.7 0.63 0.50
LoHa 12.7 87.1 42.38 17.37 27.3 0.40 0.58 45.3 309.6 29.79 15.17 35.8 0.49 0.59
DoRA 12.8 129.5 35.87 16.40 32.3 0.45 0.56 45.6 503.0 21.28 14.16 48.3 0.57 0.55
LG Heur-LG 129.6 43.6 55.45 22.14 24.4 0.33 0.48 456.8 155.0 41.83 19.23 30.9 0.40 0.51
Auto-LG 129.6 43.6 56.38 21.39 25.5 0.30 0.49 456.8 155.0 31.78 18.71 41.7 0.46 0.54
KIND 12.8 33.7 20.94 14.75 62.4 0.53 0.50 45.4 119.6 12.87 12.93 86.1 0.65 0.51
FT Full FT 129.6 43.6 26.49 15.08 45.1 0.51 0.55 456.8 155.0 14.51 13.16 69.1 0.63 0.55

To evaluate the adaptability of KIND, we compare it with several PEFT and learngene methods on novel tasks fairly, ensuring that the pre-trained models and learngenes used in these methods are trained with the same setting. As shown in Table 1, our proposed KIND achieves state-of-the-art results on DiT-B/2 and DiT-L/2, demonstrating significant improvements (in DiT-L/2) with more than 6.54 and 1.07 decrease in FID and sFID respectively with only 45.4M trainable parameters and the computational cost saved at least 35.4G FLOPs. The improvements in IS (\uparrow38.8) and precision (\uparrow0.02) further underscore the superiority of KIND.

We observe a clear performance gap between all PEFT methods and Full FT, despite the computational efficiency and reduced trainable parameters offered by PEFT methods. This disparity underscores a substantial difference between the novel tasks and the training tasks. Consequently, directly fixing the parameters of pre-trained models, as done in PEFT methods, might not be an optimal strategy in such scenarios. As illustrated in Figure 5, the images generated by PEFT methods fail to adequately capture the knowledge of corresponding categories due to the limited number of trainable parameters and the significant gap between the knowledge in pre-trained models and the target tasks.

Existing learngene methods, such as Heur-LG and Auto-LG, have not achieved the same level of success in image generation tasks as they have in image classification (Wang et al. 2022, 2023). These methods selectively transfer knowledge in pre-trained models while retaining some flexibility to acquire knowledge from novel tasks. However, they still heavily rely on pre-trained models trained with traditional objectives and do not intervene in the process of knowledge acquisition of pre-trained models, leading to sub-optimal results. Additionally, these methods introduce too many randomly initialized parameters. While this can be beneficial for parameter transfer, it hinders the effective learning of useful knowledge from a limited number of images.

Conversely, our KIND shifts the learning objectives from simply maximizing model performance on training dataset to condensing as much transferable common knowledge as possible. KIND emphasizes the extraction of highly transferable knowledge during the pre-training stage, using knowledge diversion to condense common knowledge into learngenes, thereby making learngenes a stronger backbone for task adaptability. Additionally, we apply low-rank assumptions to tailors, making them class-specific, with their rank flexibly set based on task difficulty, further ensuring structural adaptability. As shown in Figure 5 and Tabel 1, the images generated by KIND are significantly better than those produced by other PEFT methods, both visually and in terms of performance metrics. Surprisingly, KIND’s performance even surpass Full FT with significant visual improvements while saving 411.4M trainable parameters and reducing computational cost by 35.4G FLOPs.

Strong Learning Ability Brought by Learngenes

Refer to caption
Figure 4: Visualization of convergence speed of KIND and other methods on downstream tasks. Each image is sampled every 10K steps to illustrate progress more clearly.
Refer to caption
Figure 5: Selected samples from DiT-L/2 models of various PEFT and learngene methods, with a resolution of 256×256256\times 256. All images are generated using a classifier-free guidance (cfg) scale of 4.0 and an EMA VAE decoder.

As noted in (Wang et al. 2022; Xia et al. 2024), learngenes can accelerate the adaptation of descendant models to novel tasks by transferring common knowledge, providing a significant advantage over training from scratch. Beyond this, our proposed KIND further enhances the convergence speed compared to PEFT methods. Figure 4 illustrates the convergence speed of KIND and other PEFT methods, showcasing images generated by the models at every 10K training steps.

The convergence speed is typically influenced by the number of trainable parameters for fine-tuning, with PEFT methods primary aiming to reduce this number through techniques like orthogonalization and low-rank constraints. However, these methods often overlook the importance of the transferability of knowledge in pre-trained models, as they tend to fix the pre-training parameters of backbone network. In contrast, KIND uses the learngenes that condense common knowledge as the backbone, providing superior transferability while maintaining lightweight. Meanwhile, the tailors ensure the models acquire task-specific knowledge, allowing KIND to achieve faster convergence and better performance on downstream tasks.

Ablation Experiments

We validate the effectiveness of learngenes and tailors along with class gate through ablation experiments. #1 performs SVD on pre-trained weights and randomly selects NGN_{G} singular vectors comprising its backbone and then fine-tune it with LoRA. #2 uses the learngenes extracted by KIND as the backbone based on #1. #3 uses tailors instead of LoRA to fine-tune models without class gate.

As shown in Table 2, the knowledge in the learngenes which have undergone knowledge diversion, is more common and thus better suited for adapting to downstream tasks, especially when these tasks differ significantly from the training tasks. Futhermore, the tailors themselves function as a PEFT method by utilizing a low-rank assumption to combine class-specific knowledge into the pre-trained models or learngenes, effectively augmenting the backbone network with additional components. This combination helps the model better acquire new knowledge for downstream tasks. The presence of the class gate leverages category information to aid the model in distinguishing class-specific knowledge during the learning process, thereby enhancing the effectiveness of the tailors.

Table 2: Ablation study on different components of KIND.
LG Tailor Gate FID\downarrow sFID\downarrow IS\uparrow Prec.\uparrow Recall\uparrow
DiT-B/2 #1 60.28 19.96 20.4 0.30 0.49
#2 49.54 18.08 23.2 0.34 0.56
#3 21.60 14.84 59.7 0.54 0.50
KIND 20.94 14.75 62.4 0.53 0.50
DiT-L/2 #1 42.04 18.07 28.0 0.41 0.54
#2 33.53 15.55 32.2 0.46 0.59
#3 13.03 12.93 85.1 0.64 0.51
KIND 12.87 12.93 86.1 0.65 0.51

Analysis on Common Knowledge in Learngenes

Refer to caption
Figure 6: Visualization of KIND w/ and w/o Tailers (i.e., learngene only) across 14 superclasses for 2 different seeds.
Table 3: Comparison of pre-trained models and learngenes when serving as backbones on training tasks.
Entropy\uparrow Variance\downarrow Kurtosis\downarrow
Raw Images of ImageNet 1.458 6.414e-4 884.3
Pretrained Model 2.387 4.516e-4 780.1
Learngene 4.046 1.495e-4 544.9

As discussed earlier, learngenes serve as a superior backbone compared to pre-trained models due to their condensation of common knowledge. To explore this further, we analyze the properties of the common knowledge condensed in learngenes. Table 3 compares learngenes (w/o tailors) with pre-trained models on training tasks. The results reveal that learngenes demonstrate higher entropy, along with lower variance and kurtosis across different categories, indicating that the common knowledge they condense is largely class-agnostic. Such stability underscores that learngenes, as a backbone, provide superior adaptability to unfamiliar classes compared to pre-trained models.

We also visualize the learngenes with and without tailors in Figure 6. The visualizations show that the learngenes are not sensitive to category variations, consistently generating similar images across different class conditions. Although these images may lack detailed semantic information on their own, combining them with category-specific information (i.e., tailors) allows for the generation of images corresponding to specific categories, which further highlights the inherent commonality of knowledge within the learngenes.

Conclusion

In this study, we explore knowledge transfer in diffusion models. Traditional methods fix the parameters of pre-trained models as a backbone without assessing their suitability. To address this, we introduce KIND, which integrates and diverts knowledge within the Learngene framework. Leveraging KIND, we extract learngenes that condense common knowledge, making them more effective as a backbone than traditional pre-trained models. Additionally, we introduce tailors for learning class-specific knowledge. Our proposed KIND achieves state-of-the-art results compared to existing PEFT and learngene methods. Detailed analysis shows that the common knowledge embedded in learngenes is class-agnostic, underscoring its broad applicability across various tasks.

Acknowledgement

We sincerely thank Freepik for providing some images. This research is supported by the National Key Research & Development Plan of China (No. 2018AAA0100104), the National Science Foundation of China (62125602, 62076063) and Xplorer Prize.

References

  • Achiam et al. (2023) Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  • Bohacek and Mansuy (2015) Bohacek, J.; and Mansuy, I. M. 2015. Molecular insights into transgenerational non-genetic inheritance of acquired behaviours. Nature Reviews Genetics, 16(11): 641–652.
  • Chen, Tennent, and Hsu (2024) Chen, D.-Y.; Tennent, H.; and Hsu, C.-W. 2024. ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8619–8628.
  • Chen et al. (2022) Chen, S.; Ge, C.; Tong, Z.; Wang, J.; Song, Y.; Wang, J.; and Luo, P. 2022. Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35: 16664–16678.
  • Feng, Wang, and Geng (2024) Feng, F.; Wang, J.; and Geng, X. 2024. Transferring Core Knowledge via Learngenes. arXiv preprint arXiv:2401.08139.
  • Feng et al. (2023) Feng, F.; Wang, J.; Zhang, C.; Li, W.; Yang, X.; and Geng, X. 2023. Genes in Intelligent Agents. arXiv preprint arXiv:2306.10225.
  • Feng et al. (2024) Feng, F.; Xie, Y.; Wang, J.; and Geng, X. 2024. WAVE: Weight Template for Adaptive Initialization of Variable-sized Models. arXiv preprint arXiv:2406.17503.
  • Han et al. (2023) Han, L.; Li, Y.; Zhang, H.; Milanfar, P.; Metaxas, D.; and Yang, F. 2023. Svdiff: Compact parameter space for diffusion fine-tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7323–7334.
  • Han et al. (2021) Han, X.; Zhang, Z.; Ding, N.; Gu, Y.; Liu, X.; Huo, Y.; Qiu, J.; Yao, Y.; Zhang, A.; Zhang, L.; et al. 2021. Pre-trained models: Past, present and future. AI Open, 2: 225–250.
  • Hayou, Ghosh, and Yu (2024) Hayou, S.; Ghosh, N.; and Yu, B. 2024. LoRA+: Efficient Low Rank Adaptation of Large Models. In Forty-first International Conference on Machine Learning.
  • Hendrycks and Gimpel (2016) Hendrycks, D.; and Gimpel, K. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
  • Heusel et al. (2017) Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and Hochreiter, S. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30.
  • Houlsby et al. (2019) Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; and Gelly, S. 2019. Parameter-efficient transfer learning for NLP. In International conference on machine learning, 2790–2799. PMLR.
  • Hu et al. (2022) Hu, E. J.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W.; et al. 2022. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.
  • Hu et al. (2023) Hu, Z.; Wang, L.; Lan, Y.; Xu, W.; Lim, E.-P.; Bing, L.; Xu, X.; Poria, S.; and Lee, R. 2023. LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 5254–5276.
  • Kynkäänniemi et al. (2019) Kynkäänniemi, T.; Karras, T.; Laine, S.; Lehtinen, J.; and Aila, T. 2019. Improved precision and recall metric for assessing generative models. Advances in neural information processing systems, 32.
  • Liu et al. (2024a) Liu, S.-y.; Wang, C.-Y.; Yin, H.; Molchanov, P.; Wang, Y.-C. F.; Cheng, K.-T.; and Chen, M.-H. 2024a. DoRA: Weight-Decomposed Low-Rank Adaptation. In Forty-first International Conference on Machine Learning.
  • Liu et al. (2024b) Liu, W.; Qiu, Z.; Feng, Y.; Xiu, Y.; Xue, Y.; Yu, L.; Feng, H.; Liu, Z.; Heo, J.; Peng, S.; et al. 2024b. Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization. In The Twelfth International Conference on Learning Representations.
  • Meng, Wang, and Zhang (2024) Meng, F.; Wang, Z.; and Zhang, M. 2024. Pissa: Principal singular values and singular vectors adaptation of large language models. arXiv preprint arXiv:2404.02948.
  • Miller (1995) Miller, G. A. 1995. WordNet: a lexical database for English. Communications of the ACM, 38(11): 39–41.
  • Nash et al. (2021) Nash, C.; Menick, J.; Dieleman, S.; and Battaglia, P. 2021. Generating images with sparse representations. In International Conference on Machine Learning, 7958–7968. PMLR.
  • Peebles and Xie (2023) Peebles, W.; and Xie, S. 2023. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4195–4205.
  • Peng et al. (2024) Peng, Z.; Xu, Z.; Zeng, Z.; Yang, X.; and Shen, W. 2024. Sam-parser: Fine-tuning sam efficiently by parameter space reconstruction. In Proceedings of the AAAI Conference on Artificial Intelligence, 4515–4523.
  • Qiu et al. (2020) Qiu, X.; Sun, T.; Xu, Y.; Shao, Y.; Dai, N.; and Huang, X. 2020. Pre-trained models for natural language processing: A survey. Science China technological sciences, 63(10): 1872–1897.
  • Qiu et al. (2023) Qiu, Z.; Liu, W.; Feng, H.; Xue, Y.; Feng, Y.; Liu, Z.; Zhang, D.; Weller, A.; and Schölkopf, B. 2023. Controlling text-to-image diffusion by orthogonal finetuning. Advances in Neural Information Processing Systems, 36: 79320–79362.
  • Ridnik et al. (2021) Ridnik, T.; Ben-Baruch, E.; Noy, A.; and Zelnik-Manor, L. 2021. ImageNet-21K Pretraining for the Masses. In Proceedings of Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 1–12.
  • Robb et al. (2020) Robb, E.; Chu, W.-S.; Kumar, A.; and Huang, J.-B. 2020. Few-shot adaptation of generative adversarial networks. arXiv preprint arXiv:2010.11943.
  • Rosenstein et al. (2005) Rosenstein, M. T.; Marx, Z.; Kaelbling, L. P.; and Dietterich, T. G. 2005. To transfer or not to transfer. In NIPS 2005 Workshop on Transfer Learning, 1–4.
  • Salimans et al. (2016) Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; and Chen, X. 2016. Improved techniques for training gans. Advances in neural information processing systems, 29.
  • Sih, Ferrari, and Harris (2011) Sih, A.; Ferrari, M. C.; and Harris, D. J. 2011. Evolution and behavioural responses to human-induced rapid environmental change. Evolutionary Applications, 4(2): 367–387.
  • Sun et al. (2022) Sun, Y.; Chen, Q.; He, X.; Wang, J.; Feng, H.; Han, J.; Ding, E.; Cheng, J.; Li, Z.; and Wang, J. 2022. Singular value fine-tuning: Few-shot segmentation requires few-parameters fine-tuning. In Proceedings of Advances in Neural Information Processing Systems, 37484–37496.
  • Touvron et al. (2021) Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; and Jégou, H. 2021. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, 10347–10357.
  • Valipour et al. (2023) Valipour, M.; Rezagholizadeh, M.; Kobyzev, I.; and Ghodsi, A. 2023. DyLoRA: Parameter-Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 3274–3287.
  • Waddington (1942) Waddington, C. H. 1942. Canalization of development and the inheritance of acquired characters. Nature, 150(3811): 563–565.
  • Wang et al. (2022) Wang, Q.; Geng, X.; Lin, S.; Xia, S.-Y.; Qi, L.; and Xu, N. 2022. Learngene: From open-world to your learning task. In Proceedings of the AAAI Conference on Artificial Intelligence, 8557–8565.
  • Wang et al. (2023) Wang, Q.; Yang, X.; Lin, S.; and Geng, X. 2023. Learngene: Inheriting Condensed Knowledge from the Ancestry Model to Descendant Models. arXiv preprint arXiv:2305.02279.
  • Wang et al. (2019) Wang, Z.; Dai, Z.; Póczos, B.; and Carbonell, J. 2019. Characterizing and avoiding negative transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11293–11302.
  • Wong and Candolin (2015) Wong, B. B.; and Candolin, U. 2015. Behavioral responses to changing environments. Behavioral Ecology, 26(3): 665–673.
  • Xia et al. (2024) Xia, S.; Zhang, M.; Yang, X.; Chen, R.; Chen, H.; and Geng, X. 2024. Transformer as Linear Expansion of Learngene. In Proceedings of the AAAI Conference on Artificial Intelligence, 16014–16022.
  • Zhang et al. (2023) Zhang, F.; Li, L.; Chen, J.; Jiang, Z.; Wang, B.; and Qian, Y. 2023. Increlora: Incremental parameter allocation method for parameter-efficient fine-tuning. arXiv preprint arXiv:2308.12043.
  • Zhang and Pilanci (2024) Zhang, F.; and Pilanci, M. 2024. Spectral Adapter: Fine-Tuning in Spectral Space. arXiv preprint arXiv:2405.13952.
  • Zhang et al. (2024) Zhang, X.; Wen, S.; Han, L.; Juefei-Xu, F.; Srivastava, A.; Huang, J.; Wang, H.; Tao, M.; and Metaxas, D. N. 2024. Spectrum-Aware Parameter Efficient Fine-Tuning for Diffusion Models. arXiv preprint arXiv:2405.21050.