LEMON: Lossless model expansion
Abstract
Scaling of deep neural networks, especially Transformers, is pivotal for their surging performance and has further led to the emergence of sophisticated reasoning capabilities in foundation models. Such scaling generally requires training large models from scratch with random initialization, failing to leverage the knowledge acquired by their smaller counterparts, which are already resource-intensive to obtain. To tackle this inefficiency, we present LosslEss MOdel ExpansioN (LEMON), a recipe to initialize scaled models using the weights of their smaller but pre-trained counterparts. This is followed by model training with an optimized learning rate scheduler tailored explicitly for the scaled models, substantially reducing the training time compared to training from scratch. Notably, LEMON is versatile, ensuring compatibility with various network structures, including models like Vision Transformers and BERT. Our empirical results demonstrate that LEMON reduces computational costs by 56.7% for Vision Transformers and 33.2% for BERT when compared to training from scratch.
1 Introduction
Deep neural networks (DNNs) have become increasingly popular, showcasing their adaptability across natural language processing to computer vision. Recent advances in architectural design, especially Transformers, have further enhanced the scalability of DNNs. However, it is a common practice to train larger versions of these models from scratch, discarding the valuable knowledge acquired by their smaller counterparts. Such an approach can be highly inefficient, especially given the intensive computational resources required to train large language models such as Generative Pre-trained Transformer (GPT) (Brown et al., 2020), and the resultant huge carbon footprints. For instance, training GPT-3 incurs costs around $4.6M (Li, 2020). Given these challenges, researchers are keenly exploring ways to leverage the prior knowledge of smaller models for more efficient scaling.
Knowledge inheritance and model expansion are two primary methodologies to achieve this goal. Knowledge inheritance (Qin et al., 2021), the reverse of knowledge distillation (Hinton et al., 2015), allows the large model to learn the predictions of a smaller pre-trained model. However, this method often necessitates additional computational resources and modifications to the training pipeline due to the involvement of a ‘teacher network.’ In contrast, model expansion directly utilizes the weights from the pre-trained small source network, either without training (Chen et al., 2015; 2021a; Yang et al., 2020; Shen et al., 2022) or with negligible training (Wang et al., 2023a). Hence, our work mainly focuses on model expansion due to its minimal impact on the training pipeline and negligible computational overhead.
A compelling requirement for model expansion is to ensure it is ‘lossless,’ meaning no information from the source model is lost. Specifically, the goal is for the larger target model to inherit the exact functional mapping as the smaller source model, thus preserving the performance. Net2Net (Chen et al., 2015) represents a foundational study of lossless model expansion for convolutional networks (CNNs) and multi-layer perceptrons (MLPs) where it duplicates neurons and averages their fan-out weights. However, a challenge arises with the ‘symmetry breaking’ issue. This problem occurs when duplicated neurons in expanded layers introduce redundancy, which persists during subsequent training. In this sense, the expanded model will never gain more capacity than the source model. To counteract this problem, previous researchers introduced additional noise into the expansion process, leading to a shift away from a genuine lossless expansion.
Transformers, despite their rising popularity in modern deep learning, introduce additional complexities in achieving lossless expansion that goes beyond traditional issues like symmetry breaking. One key obstacle arises from the intricacy of the LayerNorm, which was evident when bert2BERT (Chen et al., 2021a) tried extending the Net2Net approach to Transformers, leading to lossy outcomes. Staged training (Shen et al., 2022) demonstrated the feasibility of lossless model expansion, but with a specific constraint: doubling the width during expansion and only for a variant of Transformers known as Pre-Layer Normalization (Pre-LN) Transformers. However, real-world applications often require width increases in the expanded model that are indivisible by the smaller source model’s width, highlighting a limitation in existing methodologies. A typical scenario involves expanding the hidden dimension from 512 to 768.
In exploring the possibilities of lossless model expansion, our research focuses on the ability to break symmetry, handle indivisible width and depth increments, and remain compatible with almost all Transformer varieties. We have discovered affirmative answers, revealing that multiple solutions exist, enabling the selection of an optimal candidate to break the symmetry or find an initialization point with specific properties. Specifically, we break the symmetry of replicated neurons by setting their fan-out weights to be unequal, and we introduce average expansion to deal with LayerNorm for indivisible width increment.
In addition to lossless model expansion techniques, our study also delves into training recipes for the expanded models. It is often overlooked whether applying the original training recipe remains optimal or whether the expanded models necessitate tailored approaches. Our empirical studies reveal two key insights: expanded models can benefit from utilizing a default maximum learning rate and, intriguingly, a learning rate scheduler that decays more rapidly.
Our contributions are summarized as follows:
-
1.
We propose LEMON, a suite of algorithms designed for lossless model expansion across a variety of architectures, ensuring compatibility with indivisible width and depth increments.
-
2.
Drawing inspiration from our empirical results, we propose an optimized learning rate scheduler for the expanded models. This scheduler maintains the maximum learning rate used by training from scratch, but features accelerated decay rates.
-
3.
LEMON reduces the computational costs by up to 56.7% for Vision Transformers and 33.2% for BERT compared to training from scratch, thereby setting a new benchmark in performance.
2 Related works
Method | Depth | Width (divisible) | Width (indivisible) | Free parameters | Data-free |
---|---|---|---|---|---|
KI Qin et al. (2021) | ✗ | ✗ | ✗ | No | No |
StackBERT (Gong et al., 2019) | ✗ | N/A | N/A | No | Yes |
MSLT (Yang et al., 2020) | ✗ | N/A | N/A | No | Yes |
bert2BERT (Chen et al., 2021a) | ✗ | ✓ | ✗ | No | Yes |
Staged Training (Shen et al., 2022) | ✓ | ✓ | N/A | No | Yes |
LiGO (Wang et al., 2023a) | ✗ | ✗ | ✗ | Yes | No |
LEMON (Ours) | ✓ | ✓ | ✓ | Yes | Yes |
From small models to larger models. There are two main approaches to transferring the knowledge of the smaller models to larger models: knowledge inheritance and model expansion. Knowledge inheritance (Qin et al., 2021) enables a student network to learn the logits provided by a teacher network. Net2Net (Chen et al., 2015) was the first work to explore the idea of model expansion. It involves randomly duplicating neurons while preserving the output values through proper normalization and increasing depth by adding identity layers. However, Net2Net resorts to introducing weight perturbations to overcome symmetry, resulting in performance deterioration. Follow-up work bert2BERT (Chen et al., 2021a) extends Net2Net to Transformer while others study depth growth (Gong et al., 2019; Yang et al., 2020; Chang et al., 2017; Dong et al., 2020). Staged training (Shen et al., 2022) made significant progress by proposing a lossless model expansion method for Pre-LN Transformer, but with the constraint of width doubling. LiGO (Wang et al., 2023a) suggests employing multiple training steps to find an appropriate linear combination of weights from the source networks. Despite these advancements, all existing methods still face the challenge of the performance drop or strict restrictions on the model width. Table 1 compares the related methods.
Network initialization. Numerous studies aim to seek optimal initialization methods for neural networks, primarily focusing on regulating the norm of network parameters (Glorot & Bengio, 2010; He et al., 2015). Theoretical works try to study these methods through dynamical isometry (Saxe et al., 2013) or mean field theory (Poole et al., 2016). Orthogonal initialization, which supports layer-wise dynamical isometry in fully-connected layers, has been extended to CNNs via Delta orthogonal initialization (Xiao et al., 2018). However, there has been limited research on initialization methods specifically for Transformers. Most of these works focus on theoretical approaches to train Transformers without skip connections or normalization layers (Noci et al., 2022; He et al., 2023). Mimetic initialization (Trockman & Kolter, 2023) seeks to initialize attention based on the principles of pre-trained Transformers.
Continual pre-training. Recent research explores adapting pre-trained networks for new or improved datasets. While some target datasets from different domains (Scialom et al., 2022; Ke et al., 2022; Gupta et al., 2023; Qin et al., 2022), others focus on datasets that evolve over time (Han et al., 2020; Jang et al., 2021; Loureiro et al., 2022). Model expansion is similar to continual pre-training, with the distinction being a change in the model size rather than the data distribution.
3 Preliminaries
Model expansion aims to initialize a large model with the weights from its smaller pre-trained counterparts. Concretely, suppose we have pre-trained weights in a source network , our goal is to design a mapping , where the expanded weights initialize the target network as . Since these expanded weights contain knowledge acquired by the small pre-trained model, it should accelerate the training of compared to random initialization. Moreover, we call a model expansion algorithm lossless if .
An example for model expansion is to use a pre-trained ResNet-18 (He et al., 2016) or BERT-Small () to facilitate the training of ResNet-50 or BERT-Base (), respectively. Instead of training the larger models from scratch, the idea is to initialize them with the weights of their smaller pre-trained counterparts, i.e., ResNet-18 or BERT-Small, respectively.
Transformer architecture, introduced by Vaswani et al. (2017), consists of multiple Transformer blocks , where each block is a stack of two modules, a multi-head attention (MHA) and a two-layer MLP. Depending on the location of LayerNorm, Transformer blocks can be categorized as (1) Post-Norm block used by the original BERT (Devlin et al., 2019) where LN is applied after the residual block, i.e., , (2) Pre-Norm used by GPT (Brown et al., 2020), Pre-LN BERT, Vision Transformers (Dosovitskiy et al., 2021), and SWin Transformer (Liu et al., 2021b) where LN is applied inside the residual connection and before all other transformations, i.e., , and (3) Res-Post-Norm used by SWin Transformer V2 (Liu et al., 2022) where LN is applied inside the residual connection and after all other transformations, i.e., . See Figure 2 for an illustration.



Multi-head attention (MHA) uses multiple self-attention heads to attend to information from different representation subspaces of the input. Given an input sequence , where is the sequence length, and is the embedding dimension, each head projects the inputs into different subspaces using linear transformations. For the -th head, its query is defined as , its key as , and its values as , where and . Here, and represent the dimensions of the key and value, respectively. Each head then computes the attention with . The outputs from all heads are concatenated and linearly transformed to yield the final output:
where is the weight matrix. Please refer to Vaswani et al. (2017) for more details.
Weight symmetry. Consider a two-layer MLP with two hidden neurons in the form of , where is the nonlinear activation, and are the weights associated with the hidden neurons. If the weights are initialized such that , the two neurons will always compute identical values throughout training. This symmetry results from the fact that, at each iteration, the gradients for the corresponding weights are the same, i.e., . Weight symmetry is significant as it implies that the two symmetric neurons do not contribute independently to the model’s learning, potentially harming the model’s expressive power and learning capability.
4 Our method: Lossless model expansion


We decompose the expansion operator to two operators, i.e. the depth expansion operator and the width expansion operator , each applied to individual layers.
Our expansion method mainly consists of three main components, i.e., (1) general lossless width expansion with symmetry breaking, (2) average width expansion for LayerNorm, and (3) lossless depth expansion. In the expansion process, each layer is independently subjected to these methods, ensuring a layer-level lossless expansion. This entails a systematic, recursive application of duplicating inputs for each layer in a lossless manner, and every layer, in turn, guarantees the lossless duplication of its output.
4.1 General lossless width expansion with symmetry breaking
We first show how to apply lossless expansion with symmetry breaking for (1) fully-connected layers (FC-layers) and (2) multi-head attention (MHA).
Lossless width expansion for FC-layers. Transformers consist of a set of FC-layers. We first use MLP as an example to show the basic width expansion operator for the FC-layers.
For width expansion, we create copies of neurons similar to Net2Net and bert2BERT, as this step is necessary due to the nonlinear activation used in MLP. However, the essential difference is that we do NOT set the fan-out weights of replicated neurons to be equal. Out of simplicity, we use a single-hidden-layer MLP for illustration, and we show it on the left half in 3(a) . We first replicate neurons to in a circular pattern. Consider the same neurons and in the plot with the original fan-out weight ; we can set the expanded fan-out weights to be and where to ensure lossless expansion.
The selection of corresponds to a specific lossless model expansion algorithm, and our method can be considered as a generalization of existing model expansion methods. Specifically, Net2Net and bert2BERT perform width expansion by setting . However, such a choice causes weight symmetry problems where two neurons learn the exact same representations when it is initialized and for the subsequent training. We introduce a simple modification to fix the issue, i.e., by setting is enough to break symmetry for commonly-used nonlinear activation . This concept extends to cases where neurons are replicated more than twice, illustrated on the right half of 3(a). In such cases, we set coefficients such that and .
MHA expansion. We make sure that we directly copy the entire head in a circular pattern similar to FC-layers as mentioned in the previous section. We then perform width expansion for the corresponding key, query, and value matrices. Then, it reduces to a case similar to MLP due to the following projection matrix. Symmetry breaking is realized by setting the corresponding fan-out weights in the projection matrix differently. We illustrate the process in 3(b).
4.2 Average width expansion for LayerNorm
When dealing with indivisible width increments, we need to design a specific expansion method for the LayerNorm layer. In this section, we demonstrate that achieving a lossless expansion is feasible provided that FC-layers are positioned before the LayerNorm layer.
Average width expansion. We first show that it is easy to perform the average expansion method such that the output of FC-layers is padded with its average. We do so by adding neurons whose weights are the average of existing neurons. Specifically, we pad the original weight with rows , and pad bias with .111Input dimension should be expanded as well depending on how inputs are expanded. See Figure 4 for an illustration.
LayerNorm layer. We now show that if the input of LayerNorm is average expanded, lossless width expansion is possible. Specifically, consider LayerNorm layer with element-wise affine-transformation in the form of , where and . Define average expanded of to be . It can be shown that if and , where is a zero vector, is an arbitrary vector, and is a scalar. See section E.1 for results and proof with a more generalized case where .
4.3 Lossless depth expansion



In this section, we detail our approach for increasing model depth in a lossless manner.
Arrangement of added layers. Similar to how Chang et al. (2017); Dong et al. (2020) deal with ResNet, we put added layers directly next to the source layer. For example, when expanding two-layer network with blocks , we perform depth expansion with the resulting model . See 5(a) for an illustration.
Lossless depth expansion. We now provide two ways to perform lossless depth expansion.
Firstly, we can simply set the output of each module (MLP or MHA) to be zero, i.e. . Hence, the residual branch does not contribute to the output. This choice gives great flexibility to the rest of the parameters since we can (1) copy weights from other layers or (2) randomly initialize the weights. See 5(b) for an illustration.
Secondly, we can enforce the output to be zero by setting the summation of fan-out weights for replicated neurons to zero. With the example shown in 3(a), we can set the fan-out weights of replicated neurons to be to ensure all outputs are zeros.222If neurons are not replicated, then we have to set the fan-out weights to be zero. See 5(c) for an illustration.
4.4 A summary of implementing model expansion
We summarize the procedure of model expansion for Pre-LN Transformer architecture with both depth and width increments. We first average expand the embedding weights. Then, make sure the output of each layer is average expanded. Hence, the input to the decoder layer is the original output padded with zeros after the last LayerNorm. We provide a detailed description of our expansion method in section C.1. Furthermore, we explain how to use our method for Post-LN and Res-Post-Norm architectures in Appendix D.
5 How to train the expanded models
In this section, we delve into the influence of different factors in the training recipe, in particular the maximum learning rate and the learning rate scheduler, when training expanded models.
Experiment setup. Throughout this study, we adopt ViT (Dosovitskiy et al., 2021) as our exemplary model and train it on the standard ImageNet-1k dataset. In particular, we choose to expand ViT to ViT, where / represent the number of attention blocks and / denote the hidden dimensions. When training these models from scratch, we apply a default maximum learning rate of and run the training for 300 epochs with a batch size of 1024. We use a cosine learning rate scheduler that decays to a minimum learning rate of . However, we will modify this training recipe for continual training of the expanded model ViT.




5.1 The effects of maximum learning rate
Suppose we have an expanded model, , that maintains the same accuracy as a smaller source model, . One might naturally opt for a smaller learning rate, expecting the validation accuracy of the expanded model to continue to decrease. If this were the case, we could smooth the transition between the training processes of the small model and the expanded model. However, our investigations reveal that the relationship is more complex than it initially seems.
We conducted experiments with three different maximum learning rates: (default), , and , maintaining a consistent minimum learning rate of across all cases. The results are shown in 6(b). We summarize our findings in the following paragraphs.
Performance drop early at training. An interesting observation is the immediate decrease in validation accuracy experienced by all three expanded models early during the learning rate warm-up.333We tried to change the number of warm-up steps, but the results were not greatly affected. This performance drop is correlated with the magnitude of the learning rate; the larger it is, the more pronounced the drop. This aligns with our anticipation as smaller learning rates are critical for model convergence, especially when the source model is already near local optima. Adopting a larger learning rate can displace the weights from this local minimum, leading to an increase in training loss.
Maximum learning rate and model generalization. We observe that maintaining the default maximum learning rate is pivotal to recovering the performance of the large model. To investigate whether adopting smaller learning rates hinders model learning, we also examine the training loss of all cases, as illustrated in 6(a). The results show that models trained with reduced learning rates incur smaller training losses compared to training from scratch. Hence, we postulate that the deterioration in performance, induced by a smaller maximum learning rate, is detrimental to the generalization capability of the expanded networks rather than the optimization capability. This concept is also theoretically examined by Li et al. (2020), illustrating how the learning rate can influence the sequence of learning varied patterns, thereby affecting generalization capacities.
5.2 How fast the learning rate should decay
After settling the maximum learning rate, the next important parameter to consider is the total number of epochs. Most works use the default learning rate scheduler (Wang et al., 2023a; Chen et al., 2021a), maintaining the same number of epochs as if the model were training from scratch. We, however, note that the expanded model, having inherited knowledge from the source model, starts with a small training loss — this holds true even when accounting for the significant loss drop during warm-up. This indicates the expanded model is closer to the local optimum, requiring a smaller learning rate for continued loss reduction. Thus, we should adopt a learning rate scheduler where the learning rate decays faster.
We examine four different epoch totals : 130, 150, 200, and 300, with the corresponding learning rate schedulers illustrated in 6(c). Experiment results are shown in 6(d).
Expanded model necessitates faster learning rate decay. As depicted in 6(d), a notable observation is that employing a learning rate scheduler with faster decay enables the expanded model to quickly attain the performance of the corresponding large target model. Remarkably, the expanded model requires only 130 epochs of training to match the performance of the target model that was trained from scratch, translating to a computational cost saving of up to . This corroborates our earlier conjecture that expanded models need a learning rate scheduler that decays faster.
In summary, we recommend employing the same maximum learning rate as is used for training from scratch but with accelerated decay.
6 Main experiments




In this section, we compare our method with existing model expansion algorithms on Vision Transformers and BERT. We name our method LosslEss MOdel ExpansioN (LEMON), which uses the expansion algorithm explained in section 4 with an optimized learning rate scheduler that decays faster, as suggested in section 5.
Baselines. We consider several baselines to compare with our proposed method: (1) training the target model from scratch, (2) bert2BERT-FPI (Chen et al., 2015), a generalization of Net2Net, (3) bert2BERT-AKI (Chen et al., 2021a), which uses advanced knowledge initialization (AKI) to break symmetry, (3) soft KI (Qin et al., 2021) which learns the output of the source model by minimizing the KL-divergence of the two distributions, and (4) hard KI which learns the predictions of the source model. We do not include StackBERT (Gong et al., 2019), Yang et al. (2020), and Staged training (Shen et al., 2022) as they are not compatible with indivisible width expansion. LiGO (Wang et al., 2023a) is unavailable for direct comparison due to the absence of open-source code; hence, comparisons are made using reported values on ViT(12,512) to ViT(12,768) in section F.1.
6.1 Vision Transformers
Experiment setting. We adopt the default experimental setup described in section 5 unless stated otherwise. For LEMON, the learning rate is decayed to its minimum value over epochs in both experiments. Parameters choices of LEMON are discussed in section C.4.
Experiment results. As demonstrated in 7(a) and 7(b), LEMON is able to achieve lossless model expansion. For both experiment settings, LEMON is able to recover the performance of the target model in 130 epochs, outperforming other baselines.
Several additional observations were made during the study. First, both bert2BERT-FPI and bert2BERT-AKI exhibited performance inferior to training from scratch. Second, consistent with the observations in Chen et al. (2021a) and Wang et al. (2023a), soft KI did not enhance the training speed of the target model, while hard KI did, possibly by functioning akin to curriculum learning and filtering out the challenging training samples for the target model early at training.
6.2 Language models
Experiment setting. For our experiments, we train Pre-LN BERT (Xiong et al., 2020) on masked language modeling task. The model is trained on the English Wiki corpus as per the methods in Tan & Bansal (2020) for 220k iterations with 5k warmup steps and a batch size of 256. We use a maximum learning rate of and a cosine learning rate scheduler which decreases the learning rate to . Following Liu et al. (2019), we remove the next sentence prediction task and use a fixed sequence length of 128 for model pre-training.
We consider the following expansion procedure: (1) BERT to BERT, and (2) BERT to BERT. We remove KI as our baseline. For LEMON, we decay the learning rate to the minimum values in 165k and 132k iterations for BERT and BERT, respectively. Parameters choices of LEMON are discussed in section C.4. We report the number of iterations needed to achieve a log validation MLM loss of 1.64.


Experiment results. As shown in 7(c) and 7(d), LEMON successfully expands smaller models without incurring loss. It outperforms baselines and achieve computational cost savings of 25.5% and 33.2% for BERT and BERT, respectively.
Downstream task. We also present downstream performance of BERT trained by LEMON on the GLUE (Wang et al., 2018) dataset. We report correlation for the STS-B dataset and Matthews correlation coefficient for the CoLA dataset. Accuracy is reported for the remaining datasets. The results reveal that BERT(12,768) exhibits superior downstream performance when expanded from BERT(6,384) as opposed to being trained from scratch or being expanded from BERT(6,512). This likely stems from its more extensive training duration (165k iterations) compared to the model expanded from BERT(6,512) (132k iterations).
Dataset | STS-B | MRPC | CoLA | SST-2 | QNLI | MNLI | MNLI-mm | QQP |
---|---|---|---|---|---|---|---|---|
(Metric) | (Corr.) | (Acc.) | (Mcc.) | (Acc.) | (Acc.) | (Acc.) | (Acc.) | (Acc.) |
Train from scratch | 0.744 | 83.33 | 0.19 | 88.88 | 87.80 | 80.28 | 81.17 | 89.62 |
LEMON (Ours), from BERT | 0.848 | 83.82 | 0.36 | 90.14 | 88.76 | 80.92 | 81.57 | 89.91 |
LEMON (Ours), from BERT | 0.866 | 85.54 | 0.38 | 90.94 | 89.33 | 81.81 | 81.81 | 90.40 |
6.3 Ablation studies: the effects of the training recipe
To study the effects of our proposed training recipe on baselines, we conduct an ablation study where we apply our training recipe on them. The results are shown in 8(a). It is shown that expanded models indeed require faster learning rate decay. Additionally, LEMON continues to outperform other baselines under the same modified training recipe.
7 Conclusion
In this paper, we propose LEMON, a method that combines lossless model expansion and optimized learning rate scheduler, showing compatibility and significant performance improvements for a variety of Transformer architectures. However, LEMON does have its limitations, including the need for tuning the total number of training epochs, and our evaluation scale was constrained by available computational resources. Looking ahead, we are working on extending the application of LEMON to larger models and on developing methodologies for selecting optimal free parameters when initializing LEMON.
References
- Abdelfattah et al. (2021) Mohamed S Abdelfattah, Abhinav Mehrotra, Łukasz Dudziak, and Nicholas D Lane. Zero-cost proxies for lightweight nas. arXiv preprint arXiv:2101.08134, 2021.
- Baker et al. (2016) Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.
- Bellec et al. (2017) Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep rewiring: Training very sparse deep networks. arXiv preprint arXiv:1711.05136, 2017.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Chang et al. (2017) Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. Multi-level residual networks from dynamical systems view. arXiv preprint arXiv:1710.10348, 2017.
- Chen et al. (2021a) Cheng Chen, Yichun Yin, Lifeng Shang, Xin Jiang, Yujia Qin, Fengyu Wang, Zhi Wang, Xiao Chen, Zhiyuan Liu, and Qun Liu. bert2bert: Towards reusable pretrained language models. arXiv preprint arXiv:2110.07143, 2021a.
- Chen et al. (2015) Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641, 2015.
- Chen et al. (2021b) Wuyang Chen, Xinyu Gong, and Zhangyang Wang. Neural architecture search on imagenet in four {gpu} hours: A theoretically inspired perspective. In International Conference on Learning Representations, 2021b. URL https://openreview.net/forum?id=Cnon5ezMHtu.
- de Jorge et al. (2020) Pau de Jorge, Amartya Sanyal, Harkirat S Behl, Philip HS Torr, Gregory Rogez, and Puneet K Dokania. Progressive skeletonization: Trimming more fat from a network at initialization. arXiv preprint arXiv:2006.09081, 2020.
- Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- Dong et al. (2020) Chengyu Dong, Liyuan Liu, Zichao Li, and Jingbo Shang. Towards adaptive residual network training: A neural-ode perspective. In International conference on machine learning, pp. 2616–2626. PMLR, 2020.
- Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
- Evci et al. (2020) Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning, pp. 2943–2952. PMLR, 2020.
- Frankle & Carbin (2018) Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
- Glorot & Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. JMLR Workshop and Conference Proceedings, 2010.
- Gong et al. (2021) Chengyue Gong, Dilin Wang, Meng Li, Vikas Chandra, and Qiang Liu. Keepaugment: A simple information-preserving data augmentation approach. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1055–1064, 2021.
- Gong et al. (2019) Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, and Tieyan Liu. Efficient training of bert by progressively stacking. In International conference on machine learning, pp. 2337–2346. PMLR, 2019.
- Gupta et al. (2023) Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timothée Lesort. Continual pre-training of large language models: How to (re) warm your model? arXiv preprint arXiv:2308.04014, 2023.
- Han et al. (2020) Rujun Han, Xiang Ren, and Nanyun Peng. Econet: Effective continual pretraining of language models for event temporal reasoning. arXiv preprint arXiv:2012.15283, 2020.
- Han et al. (2015) Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
- Hassibi et al. (1993) Babak Hassibi, David G Stork, and Gregory J Wolff. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pp. 293–299. IEEE, 1993.
- He et al. (2023) Bobby He, James Martens, Guodong Zhang, Aleksandar Botev, Andrew Brock, Samuel L Smith, and Yee Whye Teh. Deep transformers without shortcuts: Modifying self-attention for faithful signal propagation. arXiv preprint arXiv:2302.10322, 2023.
- He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Hubara et al. (2017) Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017.
- Jang et al. (2021) Joel Jang, Seonghyeon Ye, Sohee Yang, Joongbo Shin, Janghoon Han, Gyeonghun Kim, Stanley Jungkyu Choi, and Minjoon Seo. Towards continual knowledge learning of language models. arXiv preprint arXiv:2110.03215, 2021.
- Ke et al. (2022) Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, and Bing Liu. Continual pre-training of language models. In The Eleventh International Conference on Learning Representations, 2022.
- LeCun et al. (1989) Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. Advances in neural information processing systems, 2, 1989.
- Lee et al. (2018) Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018.
- Lee et al. (2019) Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, and Philip HS Torr. A signal propagation perspective for pruning neural networks at initialization. arXiv preprint arXiv:1906.06307, 2019.
- Li (2020) Chuan Li. Openai’s gpt-3 language model: A technical overview. https://lambdalabs.com/blog/demystifying-gpt-3, 2020. Accessed: 2023-09-22.
- Li et al. (2020) Yuanzhi Li, Colin Wei, and Tengyu Ma. Towards explaining the regularization effect of initial large learning rate in training neural networks, 2020.
- Liu et al. (2018) Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
- Liu et al. (2021a) Shiwei Liu, Lu Yin, Decebal Constantin Mocanu, and Mykola Pechenizkiy. Do we actually need dense over-parameterization? in-time over-parameterization in sparse training. In International Conference on Machine Learning, pp. 6989–7000. PMLR, 2021a.
- Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019.
- Liu et al. (2021b) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021b.
- Liu et al. (2022) Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12009–12019, 2022.
- Louizos et al. (2017) Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through regularization. arXiv preprint arXiv:1712.01312, 2017.
- Loureiro et al. (2022) Daniel Loureiro, Francesco Barbieri, Leonardo Neves, Luis Espinosa Anke, and Jose Camacho-Collados. Timelms: Diachronic language models from twitter. arXiv preprint arXiv:2202.03829, 2022.
- Mellor et al. (2021) Joe Mellor, Jack Turner, Amos Storkey, and Elliot J Crowley. Neural architecture search without training. In International Conference on Machine Learning, pp. 7588–7598. PMLR, 2021.
- Mocanu et al. (2018) Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 9(1):1–12, 2018.
- Noci et al. (2022) Lorenzo Noci, Sotiris Anagnostidis, Luca Biggio, Antonio Orvieto, Sidak Pal Singh, and Aurelien Lucchi. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. arXiv preprint arXiv:2206.03126, 2022.
- Poole et al. (2016) Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. Advances in neural information processing systems, 29, 2016.
- Qin et al. (2021) Yujia Qin, Yankai Lin, Jing Yi, Jiajie Zhang, Xu Han, Zhengyan Zhang, Yusheng Su, Zhiyuan Liu, Peng Li, Maosong Sun, et al. Knowledge inheritance for pre-trained language models. arXiv preprint arXiv:2105.13880, 2021.
- Qin et al. (2022) Yujia Qin, Jiajie Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. Elle: Efficient lifelong pre-training for emerging data. arXiv preprint arXiv:2203.06311, 2022.
- Rastegari et al. (2016) Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pp. 525–542. Springer, 2016.
- Real et al. (2019) Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Aging evolution for image classifier architecture search. In AAAI Conference on Artificial Intelligence, volume 2, 2019.
- Renda et al. (2020) Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning. arXiv preprint arXiv:2003.02389, 2020.
- Saxe et al. (2013) Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
- Scialom et al. (2022) Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. Fine-tuned language models are continual learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 6107–6122, 2022.
- Shen et al. (2022) Sheng Shen, Pete Walsh, Kurt Keutzer, Jesse Dodge, Matthew Peters, and Iz Beltagy. Staged training for transformer language models, 2022.
- Tan & Bansal (2020) Hao Tan and Mohit Bansal. Vokenization: Improving language understanding with contextualized, visual-grounded supervision, 2020.
- Tanaka et al. (2020) Hidenori Tanaka, Daniel Kunin, Daniel L Yamins, and Surya Ganguli. Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in Neural Information Processing Systems, 33:6377–6389, 2020.
- Touvron et al. (2021) Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention, 2021.
- Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Trockman & Kolter (2023) Asher Trockman and J Zico Kolter. Mimetic initialization of self-attention layers. arXiv preprint arXiv:2305.09828, 2023.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
- Wang et al. (2020a) Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. arXiv preprint arXiv:2002.07376, 2020a.
- Wang et al. (2022a) Haoxiang Wang, Yite Wang, Ruoyu Sun, and Bo Li. Global convergence of maml and theory-inspired neural architecture search for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9797–9808, 2022a.
- Wang et al. (2023a) Peihao Wang, Rameswar Panda, Lucas Torroba Hennigen, Philip Greengard, Leonid Karlinsky, Rogerio Feris, David Daniel Cox, Zhangyang Wang, and Yoon Kim. Learning to grow pretrained models for efficient transformer training. arXiv preprint arXiv:2303.00980, 2023a.
- Wang et al. (2020b) Ruochen Wang, Minhao Cheng, Xiangning Chen, Xiaocheng Tang, and Cho-Jui Hsieh. Rethinking architecture selection in differentiable nas. In International Conference on Learning Representations, 2020b.
- Wang et al. (2022b) Yite Wang, Dawei Li, and Ruoyu Sun. Ntk-sap: Improving neural network pruning by aligning training dynamics. In The Eleventh International Conference on Learning Representations, 2022b.
- Wang et al. (2023b) Yite Wang, Jing Wu, Naira Hovakimyan, and Ruoyu Sun. Double dynamic sparse training for gans, 2023b.
- Wen et al. (2016) Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. Advances in neural information processing systems, 29, 2016.
- Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
- Wu et al. (2023a) Jing Wu, Jennifer Hobbs, and Naira Hovakimyan. Hallucination improves the performance of unsupervised visual representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16132–16143, 2023a.
- Wu et al. (2023b) Jing Wu, Naira Hovakimyan, and Jennifer Hobbs. Genco: An auxiliary generator from contrastive learning for enhanced few-shot learning in remote sensing. arXiv preprint arXiv:2307.14612, 2023b.
- Xiao et al. (2018) Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. In International Conference on Machine Learning, pp. 5393–5402. PMLR, 2018.
- Xiong et al. (2020) Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pp. 10524–10533. PMLR, 2020.
- Xu et al. (2021) Jingjing Xu, Liang Zhao, Junyang Lin, Rundong Gao, Xu Sun, and Hongxia Yang. Knas: green neural architecture search. In International Conference on Machine Learning, pp. 11613–11625. PMLR, 2021.
- Yang et al. (2023) Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan Yang, et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
- Yang et al. (2020) Cheng Yang, Shengnan Wang, Chao Yang, Yuechuan Li, Ru He, and Jingqiao Zhang. Progressively stacking 2.0: A multi-stage layerwise training method for bert training speedup. arXiv preprint arXiv:2011.13635, 2020.
- Yu et al. (2019) Kaicheng Yu, Christian Sciuto, Martin Jaggi, Claudiu Musat, and Mathieu Salzmann. Evaluating the search phase of neural architecture search. In International Conference on Learning Representations, 2019.
- Zhang & Sennrich (2019) Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
- Zoph & Le (2016) Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
Overview of the Appendix
The Appendix is organized as follows:
-
•
Appendix A introduces the general experiment setup.
-
•
Appendix B provides backgrounds and notations for model expansion.
-
•
Appendix C shows details for applying LEMON on Pre-LN Transformers.
-
•
Appendix D shows details for applying LEMON on other architectures.
-
•
Appendix E provides related proofs.
-
•
Appendix F provides additional ablation studies for the experiments.
-
•
Appendix G provides additional related works for efficient deep learning.
Appendix A Experiment setup
We conduct all experiments with NVIDIA-V100 and NVIDIA-A100 GPUs. We use the official code base of DeiT444https://github.com/facebookresearch/deit/tree/main (Touvron et al., 2021) for training Vision Transformers and the code base of VLM555https://github.com/airsplay/vokenization (Tan & Bansal, 2020) for training BERT.
A.1 Network architecture
For Vision Transformers, we use the default network architecture adopted in Touvron et al. (2021). For BERT, we implemented Pre-LN BERT in Huggingface’s Transformers package (Wolf et al., 2019) such that:
-
•
Within the residual branch of each Transformer block, we positioned LayerNorm to precede both the multi-head attention (MHA) and multi-layer perception (MLP) modules.
-
•
For the MLM classification head, we use only one fully-connected layer (shared with the embedding).
A.2 Detailed training configurations
Vision Transformers. We train Vision Transformers on the ImageNet-1k (Deng et al., 2009) dataset. When training Vision Transformers from scratch, we apply a maximum learning rate of and run the training for 300 epochs with a batch size of 1024. We use a cosine learning rate scheduler that decays to a minimum learning rate of with 5 warm-up epochs.
BERT pre-training. We train Pre-LN BERT (Xiong et al., 2020) on masked language modeling task. The model is trained on the English Wiki corpus as per the methods in Tan & Bansal (2020) for 220k iterations with 5k warmup steps and a batch size of 256. We use a maximum learning rate of and a cosine learning rate scheduler which decreases the learning rate to . Following Liu et al. (2019), we remove the next sentence prediction task and use a fixed sequence length of 128 for model pre-training.
BERT fine-tuning. For fine-tuning task of BERT on the GLUE (Wang et al., 2018) dataset, we train 3 epochs with a learning rate of and a batch size of 32 for all tasks. We report correlation for the STS-B dataset and Matthews correlation coefficient for the CoLA dataset. Accuracy is reported for the remaining datasets.
A.3 Details of baselines
We provide our implementation details of knowledge inheritance (KI) (Qin et al., 2021) in this section. Given a training dataset denoted as , we define the total loss as:
where is a scalar controlling the strength of KI; The functions and respectively represent the small source model and the large target model; The loss function computes the standard training loss, such as cross-entropy, between the prediction and the actual label . For soft KI, we set . For hard KI, we set , where KL stands for Kullback–Leibler divergence, and is the standard basis vector.
During the KI process, we start with an initial value of 0.5 and linearly decrease it to zero.
Appendix B Notations and backgrounds
In this section, we introduce basic notations in section B.1, the definition of some normalization layers in section B.2, lossless expansion in vector space in section B.3, lossless expansion for operators (layers) in section B.4, and the rule of consecutive application of lossless expansion methods for consecutive layers in section B.4.3.
B.1 Notations
All vectors are assumed to be column vectors. We define to be a zero vector of dimension . We use bold-faced letters for vectors, matrices, and tensors. For a vector , let be its -th entry and be its first entries. For a matrix , let , , and be its -th entry, -th row, and -th column, respectively. Moreover, let and be its first rows and first columns, respectively. We use to denote the matrix transpose of . We use where to denote . We use Id to denote identity mapping. We use to denote horizontal concatenation.
B.2 Model layers
In this section, we give the formal definition of LayerNorm and RMS Norm .
Definition 1 (LayerNorm).
LayerNorm of dimension is defined as:
where .
Definition 2 (RMSNorm).
RMS Norm of dimension is defined as:
where .
Remark.
In neural networks, inputs of normalization layers are usually high dimension tensors. In this case, LayerNorm and RMSNorm normally apply to the last dimension separately.
B.3 Lossless expansion in vector space
In this section, we first give the general definition of lossless expansion in vector space.
Definition 3 (Lossless expansion in vector space).
Given and are two vector spaces where the dimensions satisfy , a vector space expansion is said to be lossless if it is invertible.
Remark.
Note that the identity function Id is lossless with its inverse being itself.
Then we give a few examples of lossless vector space expansions. These examples will also be used in LEMON.
Example B.3.1 (Vector average expansion ).
Let be a vector of dimension and its average . is called the average expanded of dimension with if
Example B.3.2 (Vector zero expansion ).
Let be a vector of dimension . is called the zero expanded of dimension with if
Example B.3.3 (Vector circular expansion ).
Let be a vector of dimension . is called the circular expanded of dimension with if
Example B.3.4 (Vector random expansion ).
Let be a vector of dimension . is called the random expanded of dimension with if
where is an arbitrary vector.
Remark.
(1) All vector expansion examples above follow the same pattern. Specifically, when expanding from dimension to , all vector expansion methods pad first entries by repeating number of times. Each method deals with the remaining entries differently. (2) The random vector in vector random expansion is arbitrary, so . (3) Here all three examples are expansion methods for vectors. In practice, neural networks like Transformers are dealing high dimensional tensors. These tensors can essentially be thought of as collections of vectors. In such scenarios, we can apply the expansion methods separately to the last dimension of these tensors.
In the following claim, we show that vectors expanded by these operators are lossless.
Claim 1.
Vector average expansion , vector zero expansion , vector circular expansion , and vector random expansion are all lossless expansion for vectors.
Proof.
The inverse function of these vector expansion methods is
∎
Remark.
In practice, we want inverse mapping of expansion methods to be easily computed just like the example above.
B.4 Lossless expansion for operators
We then give the definition of lossless expansion for operators. These operators apply on tensors, hence our definition of lossless operator expansion is based on lossless expansion in vector space. These operators can be different layers used in Transformer architectures, including LayerNorm, convolutional layers, and fully-connected layers, etc.
Definition 4 (Lossless expansion for operators).
Consider vector spaces and such that and . Moreover, suppose the operator is denoted with . We say the operator expansion is -lossless for if there exist lossless input vector space expansion and lossless output vector space expansion such that .
Remark.
(1) Intuitively, a lossless operator expansion can be understood as follows: when using losslessly expanded input, the output of the expanded operator is also a losslessly expanded version of the original output. (2) For conciseness, we use ‘ is -lossless’ and ‘ is -lossless for ’ interchangeably. (3) We only require the vector expansions and to be invertible, we do not have restrictions on the operator expansion .
B.4.1 Lossless expansion for matrix multiplication
Then we give a few examples of lossless expansion for operators. We give examples for matrix multiplication since fully-connected layers are building blocks for Transformers. We first start by introducing the following three lossless operator expansion methods for matrix multiplication assuming that the input dimension is unchanged so .
Example B.4.1 (Matrix row-average expansion ).
Let be a matrix of dimension and its row average . is called the row-average expanded of dimension with if
Moreover, is -lossless for .
Example B.4.2 (Matrix row-zero expansion ).
Let be a matrix of dimension . is called the row-zero expanded of dimension with if
Moreover, is -lossless for .
Example B.4.3 (Matrix row-circular expansion ).
Let be a matrix of dimension . is called the row-circular expanded of dimension with if
Moreover, is -lossless for .
Remark.
Similar to vector expansion examples, these matrix row-expansion methods follow the same pattern. Specifically, when expanding the number of rows from dimension to , these expansion methods pad first rows by repeating number of times. Each method deals with the remaining rows differently.
The following two lossless operator expansion methods assume that the output dimension is unchanged so .
Example B.4.4 (Matrix column-random expansion ).
Let be a matrix of dimension and is an arbitrary matrix. is called the column-random expanded of dimension with if
where
Moreover, is -lossless for .
Example B.4.5 (Matrix column-circular expansion ).
Let be a matrix of dimension and . is called the column-circular expanded of dimension with if
where
and
Moreover, is -lossless for .
Note that lossless matrix row expansion and lossless matrix column expansion can be used together with the following claim.
Claim 2.
Consider matrix column expansion is -lossless for , and matrix row expansion is -lossless for . and are both -lossless for .
The claim is easy to prove since rows and columns are expanded independently.
B.4.2 Lossless expansion for bias
Note that the fully-connected layer consists of a matrix multiplication followed by a bias operator. We now give examples for the bias operator .
Example B.4.6 (Bias average expansion ).
Consider the bias operator where . is called the average expanded of dimension with if . Moreover, is -lossless for .
Remark.
Note that we can easily extend to and by expanding to and , respectively. Moreover, and are -lossless and -lossless for , respectively.
B.4.3 Consecutive application of lossless expansion for operators
In previous sections we give examples of lossless expansion methods for single operators. Now, to ensure lossless when applying expansion methods to consecutive layers/operators, we introduce the following claim:
Claim 3 (Lossless of consecutive application).
If is -lossless for and is -lossless for . Then is -lossless for .
Proof.
This is easily obtained if input is losslessly expanded, then the output of , , is lossless by definition. Using the fact that is -lossless and the input is losslessly expanded, we conclude the proof. ∎
Remark.
By leveraging Claim 3, we can separately apply lossless expansion methods to various layers/operators in a larger network. The only requirement is that the output vector space expansion of one expansion method matches the input vector space expansion of the subsequent expansion method.
Appendix C Details of LEMON for Pre-LN Transformers
In this section, we provide detailed explanation of applying LEMON on Pre-LN Transformer architecture. By Claim 3, we can deal with different modules separately. In the following sections, we delve into the details of applying expansion methods to these modules.


C.1 Width expansion for Pre-LN Transformer blocks
We first recap the Pre-LN Transformer architecture. It usually consists of (1) the embedding layer, (2) several Pre-LN Transformer blocks, (3) the final LayerNorm layer, and (4) a decoder layer.
Suppose that the hidden dimension of the transformer is increased from to . The head dimension is unchanged during expansion. Hence, the number of heads is increased from to . We use to denote the key, query, and value weight matrix for -th head in the MHA module. We use to denote the projection matrix.
We use to denote the width expansion of Pre-LN Transformer blocks. can be decomposed into (1) LayerNorm expansion , (2) MHA module expansion , and (3) MLP module expansion . We introduce these expansion methods in the following paragraphs. We provide an illustration of and in Figure 9.
(1) LayerNorm expansion with . We define the expansion procedure for LN as follows. We use where , , and with to expand the original LayerNorm layer . The expansion is lossless and the proof is given in Proposition 1. Moreover, is -lossless for . In Figure 9, we omit and for better illustration.
(2) MHA expansion with . We explain how to expand MHA as follow:
-
•
in self attention. We consider the affine transformations applied to a single token 666In the formulation of MHA in section 3, are right matrix multiplied with the input sequence matrix . Here we use the form of for better illustration. in a sequence in self attention in the form of , , and where and .
During expansion, we increase the dimension of from to , and unchanged. Since the number of rows for is unchanged, we only increase the number of columns by applying column-random expansion defined in Example B.4.4 to its transpose for each head, i.e., we use , , and for the expanded weights of and , where are random matrices. Biases are unchanged.
-
•
Heads in self attention. We increase the number of heads in a circular pattern. See 3(b) for an illustration. Note that (1) When , we can set differently for replicated heads to break symmetry; (2) Additionally, when , random matrices can be chosen differently for replicated heads to break symmetry. Please see Example B.4.4 for definitions of and .
-
•
Projection matrix in self attention. For the projection transformation in the form of where and , we use and defined in Example B.4.5 and Example B.4.1 to expand the weights and biases. Specifically, we use for the expanded weight of . We then use for the expanded bias of .
Moreover, is -lossless for .
(3) MLP expansion with . Consider the MLP in the form of where is the non-linear activation. We explain how to expand MLP as follow:
-
•
For the first fully-connected layer, we increase the columns by random expansion and increase the rows by circular expansion. Specifically, we use and for the expanded weight and bias.
-
•
For the second fully-connected layer, we increase the columns by circular expansion and increase the rows by average expansion. Specifically, we use and for the expanded weight and bias.
Moreover, is -lossless for .
C.2 Width expansion of other layers
In this section, we explain how to expand the rest of the layers, i.e., embedding layers and decoder layers.
Embeddings expansion with . We first average expand the embedding for each token by adding its average, i.e., with . For Vision Transformers, we do so by adding averaged channels for patch embeddings.
Decoder layer expansion with . For Vision Transformers, the decoder layer is a fully-connected layer with the form . We increase the rows of the matrix by applying column-random expansion to its transpose, i.e., we use for the expanded weights. The bias is unchanged.
For language models, the decoder layer is shared with the embedding layer. So we have to instead scale the weight and bias of the LayerNorm before the decoder layer by . Moreover, is -lossless for Dec.
C.3 Depth expansion
Depth expansion is explained in the section 4.
C.4 Parameter choices
We consider the case for better illustration.777In fact we only need to deal with such cases in our experiments. There are mainly the following parameters to choose for LEMON. For the non-divisible case, we set the random parameter in the LayerNorm such that . When using matrix column-random expansion for the indivisible case, we use .
Vision transformers. For the width expansion parameters of the Vision Transformers, we set for indivisible case and for divisible case to be , where is randomly initialized and .
For the depth expansion parameters, we set the free parameters that are used to cancel out replicated neurons following the distribution .
Language models. For the width expansion parameters of BERT, we set for indivisible case and for divisible case to , where is randomly initialized and .
For the depth expansion parameters, we set the projection matrix of the MHA block and the second fully-connected layer of the MLP block to be zero matrices. Moreover, inspired by advanced knowledge initialization (AKI) (Chen et al., 2021a), we append heads/neurons from the next adjacent layer.888This is still lossless since the last layer is a left-multiplied with a zero matrix followed by adding a zero vector.
Appendix D LEMON for other architectures
Though we haven’t included experiments for Post-Res-Norm and Post-Norm blocks in our main experiments, we show that LEMON is able to perform lossless model expansion for these scenarios. We then briefly discuss how to handle RMS norm (Zhang & Sennrich, 2019), which is used in LLaMa (Touvron et al., 2023). We also discuss how to apply LEMON on convolutional neural networks.
D.1 Post-Res-Norm Transformers
We consider the Transformer with the following architecture: (1) an embedding layer, (2) several Post-Res-Norm blocks, and (3) the final decoder layer.999We assume there is no final LayerNorm before the final decoder layer.
D.1.1 Width expansion
The only difference between the expansion methods of Post-Res-Norm Transformers and Pre-LN Transformers is that we zero expand embedding vector for each token with .
For the MHA and MLP modules, we use the exact same expansion introduced in section C.1, where it is -lossless for MHA and MLP. Consequently, our expansion is -lossless for Post-Res-Norm Transformer blocks. Since the last decoder expansion is -lossless for Dec, our expansion method is strict lossless.
D.1.2 Depth expansion
For increasing depth, we only need to set the weights and bias of the LayerNorm for each added layer to be all zeros.
D.2 Post-LN Transformers
For Post-LN Transformers, we can only deal with divisible cases, i.e., . Suppose , in this case, all the embedding and outputs of modules (MLP and MHA) are duplicated times and hence lossless. The only difficulty is to deal with depth expansion.
Depth expansion. Suppose we are given a pre-trained Post-LN Transformer block . First we expand to so that it outputs zeros. Then we can create two expanded layers where and . It is easy to show that is lossless where we use the fact that .
D.3 Transformers with RMS Norm
RMS Norm (Zhang & Sennrich, 2019) is used by foundation models like LLaMa (Touvron et al., 2023) and Baichuan (Yang et al., 2023). See Definition 2 for the definition of RMS Norm. Suppose we want to expand the RMS Norm from dimension to , we use the following expansion.
RMS Norm expansion with . We use where , and with to expand the original RMS Norm layer . The expansion is -lossless for . The proof is provided in Proposition 4.
D.4 Convolutional neural networks
We use to denote convolutional layer with in-channels, out-channels, and kernel size . We assume the kernel weight is and bias . We use BN and ReLU to denote BatchNorm and ReLU, respectively. ResNet and WideResNet with more than 50 layers consist of multiple Bottleneck blocks, where there are 3 sub-blocks: (1) -BN-ReLU, (2) -BN-ReLU, and (3) -BN in the residual branch.
We consider expanding ResNet to WideResNet with the same depth.101010Depth increase can be also applied. During expansion, we increase the number of channels from to . To apply expansion, we do the following:
(1) For the first sub-block, increase the number of out-channels of the first convolutional layer from to . Specifically, the expanded weight satisfies , and . The output of the convolutional layer will be also in a circular pattern in the channel dimension. This also holds true after the application of BatchNorm and ReLU since the statistics of BatchNorm are computed within channels.
(2) For the second sub-block, increase the number of out-channels and in-channels of the first convolutional layer from to . We apply the same operation to the out-channels dimension similar to (1). For the in-channel dimension, we need to make sure that the weights of replicated channels sum up to the original weight. Specifically, suppose that the replicated channels indices are denoted . Then we need to set for lossless expansion. Moreover, we need to make sure for symmetry breaking.
(3) For the last sub-block, increase the number of in-channels of the first convolutional layer from to similar to (2).
Appendix E Proofs
E.1 Proofs for Transformers with LayerNorm
In this section, we first show that three main components , , and are lossless. Then, we prove that LEMON defined in Appendix C is lossless.
We first start by showing that our LayerNorm expansion defined in section C.1 is lossless.
Proposition 1 (Lossless expansion for LayerNorm ).
Consider of dimension where . Define average expanded of of dimension to be , where . If , , and , where , then
Proof.
Since and ,
-
•
For :
-
•
For :
Hence, . ∎
Remark.
When is divisible by , then . Hence, it explains why simply circularly expanding LayerNorm is lossless in such a scenario.
Proposition 1 naturally leads to the following corollary.
Corollary 1.
introduced in Definition 1 is -lossless for .
Using Claim 3, we are ready to prove that and are lossless. We first show that is lossless in Proposition 2.
Proposition 2 (Lossless of ).
defined in section C.1 is -lossless for MHA.
Proof.
Consider a sequence input is expanded losslessly by to . We expand the source small MHA such that the target large model is .
We first check the key, query, and value of each head such that for the large model . We denote them as . Note that biases are not expanded. Hence, these outputs are identical to the output of the small source model since are expanded by , which is -lossless. Consequently, is identical to the output of -th head of the MHA in the source small model, which is .
Since heads are circularly expanded, the output of is also lossless.
Finally, since is expanded by and , which is -lossless. With the fact that bias is not expanded (unchanged), we obtain the result that is -lossless for MHA. ∎
We then show that is lossless in Proposition 3.
Proposition 3 (Lossless of ).
This is easily obtained since the first fully-connected layer is -lossless. Hence, the output is losslessly expanded. After applying element-wise nonlinear activation, the output is still losslessly expanded. Since the second fully-connected layer is -lossless, we conclude the proof that is -lossless for MLP.
Hence, using Proposition 2 and Proposition 3 along with Claim 3, we obtain the following Corollary 2 and Corollary 3.
Corollary 2.
The expanded Pre-LN MHA module is -lossless for .
Proof.
Since is -lossless for LN, and is -lossless for MHA. The result is obtained by Claim 3. ∎
Corollary 3.
The expanded Pre-LN MLP module is -lossless for .
By incorporating the residual connection, we obtain the following corollary.
Corollary 4.
The expanded Pre-LN modules (Pre-LN MHA/MLP) with residual connections are -lossless for the original Pre-LN modules with residual connections.
Once again using Claim 3, we naturally obtain the following corollary.
Corollary 5.
The width-expanded Pre-LN Transformer layer is -lossless for .
Finally, by considering the embedding layers and encoder layers, we show that LEMON is lossless.
Corollary 6.
LEMON introduced in section C.1 is -lossless for Pre-LN Transformers, i.e., strict lossless or identical.
Proof.
Since embeddings are average expanded, the output of Pre-LN Transformer blocks are average expanded. Hence, outputs of the final LN before the encoder is zero expanded. Since the decoder layer expansion is -lossless for , we obtain the result that LEMON is -lossless. ∎
E.2 Proofs for Transformers with RMS Norm
In this section, we show that defined in section D.3 is lossless.
Proposition 4 (Lossless expansion for RMS Norm ).
Consider of dimension where . Define zero expanded of of dimension to be , where . If , and , where , then
Proof.
-
•
For :
-
•
For :
Hence, ∎
Proposition 4 naturally leads to the following corollary.
Corollary 7.
introduced in section D.3 is -lossless for .
Appendix F More ablation studies
F.1 Comparison with LiGO
LiGO (Wang et al., 2023a) is unavailable for direct comparison due to the absence of open-source code. Hence, we compare them with their reported values. Note that our method is lossless only for Pre-LN Transformer architecture while LiGO reports their results for language models mainly on Post-LN BERT and RoBerTa. As a consequence, we compare our results with LiGO on ViT (ViT-Small) ViT (ViT-Base).111111Note that DeiT without distillation is exactly ViT. The result is shown in Figure 10.
Our method is able to recover the performance of the target model with 85 epochs, leading to a 71.67% computational saving. It is higher than the reported value of 55.40% for LiGO.121212Note that DeiT-Base (ViT-Base) has a final validation accuracy of 81.00% for LiGO, which is lower than the 81.70% reported value of the official DeiT and our implementation.
Appendix G More related works
Efficiency in deep learning can be achieved in multiple ways. In this section we provide a brief overview of efficient deep learning regarding model training and inference, distinguishing it from methods addressing data efficiency (Gong et al., 2021; Wu et al., 2023a; b).
Efficient deep learning. In the realm of deep learning, the drive for efficiency has led researchers to develop a multitude of methods aimed at optimizing model efficiency. Techniques such as neural architecture search (NAS) (Zoph & Le, 2016; Liu et al., 2018) have been employed to automate the discovery of optimal network architecture. Quantization (Rastegari et al., 2016; Hubara et al., 2017) refines the numeric precision of model parameters to boost computational speed. Knowledge distillation (Hinton et al., 2015) and knowledge inheritance (Qin et al., 2021) allow target models to inherit the knowledge of their source counterparts. Neural network pruning (LeCun et al., 1989) involves removing unnecessary connections to accelerate model training or inference. Finally, model growth methods (Chen et al., 2015) directly use the weights of source models to initialize the large target models.
Neural architecture search (NAS) has emerged as a promising solution for automating the process of neural architecture design, eliminating the need for labor-intensive manual designs across various deep learning tasks. Initial methodologies leveraged reinforcement learning (Zoph & Le, 2016; Baker et al., 2016) and evolutionary algorithms (Real et al., 2019) to identify high-performing architectures. Despite their success, a significant drawback was their computational demands. Addressing this, DARTS (Liu et al., 2018) introduced a continuous relaxation of architectural representation, allowing for search via gradient descent. However, DARTS can be challenging to optimize, and its weight-sharing approach has been criticized for potential performance degradation (Yu et al., 2019; Wang et al., 2020b). Seeking further efficiency, Mellor et al. (Mellor et al., 2021) introduced a training-free NAS, which evaluates randomly initialized architectures, thus fully eliminating neural network training during the search phase. Subsequent training-free methods explored searches using Neural Tangent Kernel (NTK) (Xu et al., 2021; Chen et al., 2021b; Wang et al., 2022a), linear regions (Chen et al., 2021b), and criteria related to pruning (Abdelfattah et al., 2021).
When considered alongside model expansion, NAS holds potential for determining the optimal number of layers and hidden dimension of the large target model.
Neural network pruning. Pruning techniques can be broadly classified based on their timing into three categories: post-hoc pruning, pruning-at-initialization methods, and pruning-during-training methods. (1) Post-hoc pruning method removes certain weights of a fully-trained neural network. Post-hoc pruning was initially proposed to accelerate model inference (LeCun et al., 1989; Hassibi et al., 1993; Han et al., 2015), while lottery ticket works (Frankle & Carbin, 2018; Renda et al., 2020) shifted towards uncovering trainable sub-networks. (2) SNIP (Lee et al., 2018) is one of the pioneering works of pruning-at-initialization methods that aim to find trainable sub-networks without any training. Subsequent research (Wang et al., 2020a; Tanaka et al., 2020; de Jorge et al., 2020; Lee et al., 2019; Wang et al., 2022b) introduced varying metrics for pruning at the network initialization stage. (3) Finally, pruning-during-training methods prune or adjust DNNs throughout training. Early works incorporate explicit (Louizos et al., 2017) or (Wen et al., 2016) regularization terms to encourage sparsity, hence mitigating performance degradation commonly associated with post-hoc pruning. More recent techniques like DST methods (Bellec et al., 2017; Mocanu et al., 2018; Evci et al., 2020; Liu et al., 2021a; Wang et al., 2023b) allow for adaptive mask modifications during training while adhering to specified parameter constraints.
Neural network pruning has potential synergies with model expansion, akin to the dynamics of DST. A combined approach could involve iterative increases and decreases in hidden dimensions or layers during training, potentially accelerating training speed.