This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

LEMON: Lossless model expansion

Yite Wang1, , Jiahao Su2, , Hanlin Lu2, Cong Xie2, Tianyi Liu2, Jianbo Yuan2,
Haibin Lin2, Ruoyu Sun3,4, Hongxia Yang2
1University of Illinois Urbana-Champaign, USA  2ByteDance Inc.
3Shenzhen International Center for Industrial and Applied Mathematics,
  Shenzhen Research Institute of Big Data
4School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
[email protected]
{jiahao.su, hanlin.lu, cong.xie, tianyi.liu, jianbo.yuan,
haibin.lin, hx.yang}@bytedance.com
[email protected]
Work done during internship at ByteDance.Corresponding author.
Abstract

Scaling of deep neural networks, especially Transformers, is pivotal for their surging performance and has further led to the emergence of sophisticated reasoning capabilities in foundation models. Such scaling generally requires training large models from scratch with random initialization, failing to leverage the knowledge acquired by their smaller counterparts, which are already resource-intensive to obtain. To tackle this inefficiency, we present LosslEss MOdel ExpansioN (LEMON), a recipe to initialize scaled models using the weights of their smaller but pre-trained counterparts. This is followed by model training with an optimized learning rate scheduler tailored explicitly for the scaled models, substantially reducing the training time compared to training from scratch. Notably, LEMON is versatile, ensuring compatibility with various network structures, including models like Vision Transformers and BERT. Our empirical results demonstrate that LEMON reduces computational costs by 56.7% for Vision Transformers and 33.2% for BERT when compared to training from scratch.

1 Introduction

Refer to caption

Figure 1: Comparison between training from scratch and model expansion. Our method expands a smaller pre-trained model into a larger model without any performance drop. The continual training of the large model requires significantly less training time than training from scratch.

Deep neural networks (DNNs) have become increasingly popular, showcasing their adaptability across natural language processing to computer vision. Recent advances in architectural design, especially Transformers, have further enhanced the scalability of DNNs. However, it is a common practice to train larger versions of these models from scratch, discarding the valuable knowledge acquired by their smaller counterparts. Such an approach can be highly inefficient, especially given the intensive computational resources required to train large language models such as Generative Pre-trained Transformer (GPT) (Brown et al., 2020), and the resultant huge carbon footprints. For instance, training GPT-3 incurs costs around $4.6M (Li, 2020). Given these challenges, researchers are keenly exploring ways to leverage the prior knowledge of smaller models for more efficient scaling.

Knowledge inheritance and model expansion are two primary methodologies to achieve this goal. Knowledge inheritance (Qin et al., 2021), the reverse of knowledge distillation (Hinton et al., 2015), allows the large model to learn the predictions of a smaller pre-trained model. However, this method often necessitates additional computational resources and modifications to the training pipeline due to the involvement of a ‘teacher network.’ In contrast, model expansion directly utilizes the weights from the pre-trained small source network, either without training (Chen et al., 2015; 2021a; Yang et al., 2020; Shen et al., 2022) or with negligible training (Wang et al., 2023a). Hence, our work mainly focuses on model expansion due to its minimal impact on the training pipeline and negligible computational overhead.

A compelling requirement for model expansion is to ensure it is ‘lossless,’ meaning no information from the source model is lost. Specifically, the goal is for the larger target model to inherit the exact functional mapping as the smaller source model, thus preserving the performance. Net2Net (Chen et al., 2015) represents a foundational study of lossless model expansion for convolutional networks (CNNs) and multi-layer perceptrons (MLPs) where it duplicates neurons and averages their fan-out weights. However, a challenge arises with the ‘symmetry breaking’ issue. This problem occurs when duplicated neurons in expanded layers introduce redundancy, which persists during subsequent training. In this sense, the expanded model will never gain more capacity than the source model. To counteract this problem, previous researchers introduced additional noise into the expansion process, leading to a shift away from a genuine lossless expansion.

Transformers, despite their rising popularity in modern deep learning, introduce additional complexities in achieving lossless expansion that goes beyond traditional issues like symmetry breaking. One key obstacle arises from the intricacy of the LayerNorm, which was evident when bert2BERT (Chen et al., 2021a) tried extending the Net2Net approach to Transformers, leading to lossy outcomes. Staged training (Shen et al., 2022) demonstrated the feasibility of lossless model expansion, but with a specific constraint: doubling the width during expansion and only for a variant of Transformers known as Pre-Layer Normalization (Pre-LN) Transformers. However, real-world applications often require width increases in the expanded model that are indivisible by the smaller source model’s width, highlighting a limitation in existing methodologies. A typical scenario involves expanding the hidden dimension from 512 to 768.

In exploring the possibilities of lossless model expansion, our research focuses on the ability to break symmetry, handle indivisible width and depth increments, and remain compatible with almost all Transformer varieties. We have discovered affirmative answers, revealing that multiple solutions exist, enabling the selection of an optimal candidate to break the symmetry or find an initialization point with specific properties. Specifically, we break the symmetry of replicated neurons by setting their fan-out weights to be unequal, and we introduce average expansion to deal with LayerNorm for indivisible width increment.

In addition to lossless model expansion techniques, our study also delves into training recipes for the expanded models. It is often overlooked whether applying the original training recipe remains optimal or whether the expanded models necessitate tailored approaches. Our empirical studies reveal two key insights: expanded models can benefit from utilizing a default maximum learning rate and, intriguingly, a learning rate scheduler that decays more rapidly.

Our contributions are summarized as follows:

  1. 1.

    We propose LEMON, a suite of algorithms designed for lossless model expansion across a variety of architectures, ensuring compatibility with indivisible width and depth increments.

  2. 2.

    Drawing inspiration from our empirical results, we propose an optimized learning rate scheduler for the expanded models. This scheduler maintains the maximum learning rate used by training from scratch, but features accelerated decay rates.

  3. 3.

    LEMON reduces the computational costs by up to 56.7% for Vision Transformers and 33.2% for BERT compared to training from scratch, thereby setting a new benchmark in performance.

2 Related works

Table 1: Overview of model expansion or knowledge inheritance methods. In the first three columns, we use symbols , , and N/A to denote whether the method is (1) lossless, (2) non-lossless, or (3) not applicable in the given scenarios. Here, ‘Depth’ represents the scenario where the large model has more layers than the smaller model, and ‘Width (divisible/indivisible)’ denotes whether the large model’s hidden dimension is a multiple of the smaller model’s. In the subsequent columns, ‘Free parameters’ denotes whether the method allows for choosing different target models (e.g., to avoid symmetry breaking issue). ‘Data-free’ specifies whether the algorithm requires training data. LEMON is the most versatile method compared to previous methods.
Method Depth Width (divisible) Width (indivisible) Free parameters Data-free
KI Qin et al. (2021) No No
StackBERT (Gong et al., 2019) N/A N/A No Yes
MSLT (Yang et al., 2020) N/A N/A No Yes
bert2BERT (Chen et al., 2021a) No Yes
Staged Training (Shen et al., 2022) N/A No Yes
LiGO (Wang et al., 2023a) Yes No
LEMON (Ours) Yes Yes

From small models to larger models. There are two main approaches to transferring the knowledge of the smaller models to larger models: knowledge inheritance and model expansion. Knowledge inheritance (Qin et al., 2021) enables a student network to learn the logits provided by a teacher network. Net2Net (Chen et al., 2015) was the first work to explore the idea of model expansion. It involves randomly duplicating neurons while preserving the output values through proper normalization and increasing depth by adding identity layers. However, Net2Net resorts to introducing weight perturbations to overcome symmetry, resulting in performance deterioration. Follow-up work bert2BERT (Chen et al., 2021a) extends Net2Net to Transformer while others study depth growth (Gong et al., 2019; Yang et al., 2020; Chang et al., 2017; Dong et al., 2020). Staged training (Shen et al., 2022) made significant progress by proposing a lossless model expansion method for Pre-LN Transformer, but with the constraint of width doubling. LiGO (Wang et al., 2023a) suggests employing multiple training steps to find an appropriate linear combination of weights from the source networks. Despite these advancements, all existing methods still face the challenge of the performance drop or strict restrictions on the model width. Table 1 compares the related methods.

Network initialization. Numerous studies aim to seek optimal initialization methods for neural networks, primarily focusing on regulating the norm of network parameters (Glorot & Bengio, 2010; He et al., 2015). Theoretical works try to study these methods through dynamical isometry (Saxe et al., 2013) or mean field theory (Poole et al., 2016). Orthogonal initialization, which supports layer-wise dynamical isometry in fully-connected layers, has been extended to CNNs via Delta orthogonal initialization (Xiao et al., 2018). However, there has been limited research on initialization methods specifically for Transformers. Most of these works focus on theoretical approaches to train Transformers without skip connections or normalization layers (Noci et al., 2022; He et al., 2023). Mimetic initialization (Trockman & Kolter, 2023) seeks to initialize attention based on the principles of pre-trained Transformers.

Continual pre-training. Recent research explores adapting pre-trained networks for new or improved datasets. While some target datasets from different domains (Scialom et al., 2022; Ke et al., 2022; Gupta et al., 2023; Qin et al., 2022), others focus on datasets that evolve over time (Han et al., 2020; Jang et al., 2021; Loureiro et al., 2022). Model expansion is similar to continual pre-training, with the distinction being a change in the model size rather than the data distribution.

3 Preliminaries

Model expansion aims to initialize a large model with the weights from its smaller pre-trained counterparts. Concretely, suppose we have pre-trained weights θS\theta_{S} in a source network fS(;θStrained)f_{S}(\cdot;\theta^{\text{trained}}_{S}), our goal is to design a mapping θTexpanded=(θStrained)\theta^{\text{expanded}}_{T}=\mathcal{M}(\theta^{\text{trained}}_{S}), where the expanded weights initialize the target network as fT(;θTexpanded)f_{T}(\cdot;\theta^{\text{expanded}}_{T}). Since these expanded weights contain knowledge acquired by the small pre-trained model, it should accelerate the training of fTf_{T} compared to random initialization. Moreover, we call a model expansion algorithm lossless if fT(𝐱;θTexpanded)=fS(𝐱;θStrained),𝐱f_{T}(\mathbf{x};\theta^{\text{expanded}}_{T})=f_{S}(\mathbf{x};\theta^{\text{trained}}_{S}),\forall\mathbf{x}.

An example for model expansion is to use a pre-trained ResNet-18 (He et al., 2016) or BERT-Small (fSf_{S}) to facilitate the training of ResNet-50 or BERT-Base (fTf_{T}), respectively. Instead of training the larger models from scratch, the idea is to initialize them with the weights of their smaller pre-trained counterparts, i.e., ResNet-18 or BERT-Small, respectively.

Transformer architecture, introduced by Vaswani et al. (2017), consists of multiple Transformer blocks g()g(\cdot), where each block is a stack of two modules, a multi-head attention (MHA) and a two-layer MLP. Depending on the location of LayerNorm, Transformer blocks can be categorized as (1) Post-Norm block used by the original BERT (Devlin et al., 2019) where LN is applied after the residual block, i.e., g(x)=LN(Module(x)+x)g(x)=\texttt{LN}(\texttt{Module}(x)+x), (2) Pre-Norm used by GPT (Brown et al., 2020), Pre-LN BERT, Vision Transformers (Dosovitskiy et al., 2021), and SWin Transformer (Liu et al., 2021b) where LN is applied inside the residual connection and before all other transformations, i.e., g(x)=x+Module(LN(x))g(x)=x+\texttt{Module}(\texttt{LN}(x)), and (3) Res-Post-Norm used by SWin Transformer V2 (Liu et al., 2022) where LN is applied inside the residual connection and after all other transformations, i.e., g(x)=x+LN(Module(x))g(x)=x+\texttt{LN}(\texttt{Module}(x)). See Figure 2 for an illustration.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 2: Varieties of attention blocks. (a) Post-Norm block. (b) Pre-Norm block. (c) Post-Res-Norm block.

Multi-head attention (MHA) uses multiple self-attention heads to attend to information from different representation subspaces of the input. Given an input sequence 𝐗E×D\mathbf{X}\in\mathbb{R}^{E\times D}, where EE is the sequence length, and DD is the embedding dimension, each head projects the inputs into different subspaces using linear transformations. For the ii-th head, its query is defined as 𝐐i=𝐗𝐖iQ\mathbf{Q}_{i}=\mathbf{X}\mathbf{W}^{Q}_{i}, its key as 𝐊i=𝐗𝐖iK\mathbf{K}_{i}=\mathbf{X}\mathbf{W}^{K}_{i}, and its values as 𝐕i=𝐗𝐖iV\mathbf{V}_{i}=\mathbf{X}\mathbf{W}^{V}_{i}, where 𝐖iQ,𝐖iKD×dK\mathbf{W}_{i}^{Q},\mathbf{W}_{i}^{K}\in\mathbb{R}^{D\times d_{K}} and 𝐖iVD×dV\mathbf{W}_{i}^{V}\in\mathbb{R}^{D\times d_{V}}. Here, dKd_{K} and dVd_{V} represent the dimensions of the key and value, respectively. Each head then computes the attention with Headi=Attention(𝐐i,𝐊i,𝐕i)=softmax(𝐐i𝐊i/dK)𝐕i\text{Head}_{i}=\text{Attention}(\mathbf{Q}_{i},\mathbf{K}_{i},\mathbf{V}_{i})=\texttt{softmax}\left(\mathbf{Q}_{i}\mathbf{K}_{i}^{\intercal}/\sqrt{d_{K}}\right)\mathbf{V}_{i}. The outputs from all HH heads are concatenated and linearly transformed to yield the final output:

MHA(𝐗)=Concat[head1,,headH]𝐖O,\displaystyle\text{MHA}(\mathbf{X})=\texttt{Concat}\left[\text{head}_{1},\cdots,\text{head}_{H}\right]\mathbf{W}^{O},

where 𝐖OHdV×D\mathbf{W}^{O}\in\mathbb{R}^{Hd_{V}\times D} is the weight matrix. Please refer to Vaswani et al. (2017) for more details.

Weight symmetry. Consider a two-layer MLP with two hidden neurons in the form of MLP(𝐱)=𝐯σ(𝐖1𝐱)=v1σ(w1,1x1+w1,2x2)+v2σ(w2,1x1+w2,2x2)\texttt{MLP}(\mathbf{x})=\mathbf{v}^{\intercal}\sigma(\mathbf{W}_{1}\mathbf{x})=v_{1}\sigma(w_{1,1}x_{1}+w_{1,2}x_{2})+v_{2}\sigma(w_{2,1}x_{1}+w_{2,2}x_{2}), where σ\sigma is the nonlinear activation, and v1,v2v_{1},v_{2} are the weights associated with the hidden neurons. If the weights are initialized such that v1=v2,w1,1=w2,1,w1,2=w2,2v_{1}=v_{2},w_{1,1}=w_{2,1},w_{1,2}=w_{2,2}, the two neurons will always compute identical values throughout training. This symmetry results from the fact that, at each iteration, the gradients for the corresponding weights are the same, i.e., w˙1,1=w˙2,1,w˙1,2=w˙2,2\dot{w}_{1,1}=\dot{w}_{2,1},\dot{w}_{1,2}=\dot{w}_{2,2}. Weight symmetry is significant as it implies that the two symmetric neurons do not contribute independently to the model’s learning, potentially harming the model’s expressive power and learning capability.

4 Our method: Lossless model expansion

Refer to caption
(a) Width expansion of MLP from 22 to 44 (left) or 55 (right).
Refer to caption
(b) Expand the number of heads in MHA from 2 to 3.
Figure 3: Lossless width expansion with symmetry breaking for multi-layer perceptron (MLP) and multi-head attention (MHA). (a) Left: MLP expansion with divisible width. We replicate neurons h1h_{1}/h2h_{2} to h1h_{1}^{\ast}/h2h_{2}^{\ast} and set α+β=1\alpha+\beta=1 with αβ\alpha\neq\beta. Right: MLP expansion with indivisible width. We further replicate the neuron h1h_{1} to h1h_{1}^{\ast\ast} and set α+β+γ=1\alpha+\beta+\gamma=1 with αβγ\alpha\neq\beta\neq\gamma. (b) MHA expansion with head dimension unchanged. We duplicate Head1\text{Head}_{1} to Head1\text{Head}^{\ast}_{1} (i.e., duplicate key/query/value projections) and expand the projection layer as in an MLP module.

We decompose the expansion operator \mathcal{M} to two operators, i.e. the depth expansion operator 𝒟\mathcal{D} and the width expansion operator 𝒲\mathcal{W}, each applied to individual layers.

Our expansion method mainly consists of three main components, i.e., (1) general lossless width expansion with symmetry breaking, (2) average width expansion for LayerNorm, and (3) lossless depth expansion. In the expansion process, each layer is independently subjected to these methods, ensuring a layer-level lossless expansion. This entails a systematic, recursive application of duplicating inputs for each layer in a lossless manner, and every layer, in turn, guarantees the lossless duplication of its output.

4.1 General lossless width expansion with symmetry breaking

We first show how to apply lossless expansion with symmetry breaking for (1) fully-connected layers (FC-layers) and (2) multi-head attention (MHA).

Lossless width expansion for FC-layers. Transformers consist of a set of FC-layers. We first use MLP as an example to show the basic width expansion operator for the FC-layers.

For width expansion, we create copies of neurons similar to Net2Net and bert2BERT, as this step is necessary due to the nonlinear activation used in MLP. However, the essential difference is that we do NOT set the fan-out weights of replicated neurons to be equal. Out of simplicity, we use a single-hidden-layer MLP for illustration, and we show it on the left half in 3(a) . We first replicate neurons h1,h2h_{1},h_{2} to h1,h2h^{*}_{1},h^{*}_{2} in a circular pattern. Consider the same neurons h1h_{1} and h1h_{1}^{*} in the plot with the original fan-out weight v1,1v_{1,1}; we can set the expanded fan-out weights to be αv1,1\alpha v_{1,1} and βv1,1\beta v_{1,1} where α+β=1\alpha+\beta=1 to ensure lossless expansion.

The selection of (α,β)(\alpha,\beta) corresponds to a specific lossless model expansion algorithm, and our method can be considered as a generalization of existing model expansion methods. Specifically, Net2Net and bert2BERT perform width expansion by setting α=β=1/2\alpha=\beta=1/2. However, such a choice causes weight symmetry problems where two neurons learn the exact same representations when it is initialized and for the subsequent training. We introduce a simple modification to fix the issue, i.e., by setting αβ\alpha\neq\beta is enough to break symmetry for commonly-used nonlinear activation σ\sigma. This concept extends to cases where neurons are replicated more than twice, illustrated on the right half of 3(a). In such cases, we set coefficients such that α+β+γ=1\alpha+\beta+\gamma=1 and αβγ\alpha\neq\beta\neq\gamma.

MHA expansion. We make sure that we directly copy the entire head in a circular pattern similar to FC-layers as mentioned in the previous section. We then perform width expansion for the corresponding key, query, and value matrices. Then, it reduces to a case similar to MLP due to the following projection matrix. Symmetry breaking is realized by setting the corresponding fan-out weights in the projection matrix differently. We illustrate the process in 3(b).

4.2 Average width expansion for LayerNorm

When dealing with indivisible width increments, we need to design a specific expansion method for the LayerNorm layer. In this section, we demonstrate that achieving a lossless expansion is feasible provided that FC-layers are positioned before the LayerNorm layer.

Refer to caption

Figure 4: Lossless average expansion. When the fully-connected layer right before LayerNorm is average expanded, the output of LayerNorm is expanded with zeros.

Average width expansion. We first show that it is easy to perform the average expansion method such that the output of FC-layers is padded with its average. We do so by adding neurons whose weights are the average of existing neurons. Specifically, we pad the original weight 𝐖Dout×Din\mathbf{W}\in\mathbb{R}^{D_{\text{out}}\times D_{\text{in}}} with rows 1/DoutiDout𝐖[i]1/D_{\text{out}}\sum_{i}^{D_{\text{out}}}\mathbf{W}[i], and pad bias 𝐛Dout\mathbf{b}\in\mathbb{R}^{D_{\text{out}}} with 1/DoutiDout𝐛[i]1/D_{\text{out}}\sum_{i}^{D_{\text{out}}}\mathbf{b}[i].111Input dimension should be expanded as well depending on how inputs are expanded. See Figure 4 for an illustration.

LayerNorm layer. We now show that if the input of LayerNorm is average expanded, lossless width expansion is possible. Specifically, consider LayerNorm layer with element-wise affine-transformation in the form of LN(;𝝁,𝐛)=𝝁Norm()+𝐛\texttt{LN}(\cdot;\bm{\mu},\mathbf{b})=\bm{\mu}\odot\texttt{Norm}(\cdot)+\mathbf{b}, where 𝝁,𝐛DS\bm{\mu},\mathbf{b}\in\mathbb{R}^{D_{S}} and DT2DSD_{T}\leq 2D_{S}. Define average expanded of 𝐱DS\mathbf{x}\in\mathbb{R}^{D_{S}} to be 𝐱DT\mathbf{x}^{*}\in\mathbb{R}^{D_{T}}. It can be shown that LN(𝐱;𝝁,𝐛)=Concat[LN(𝐱;𝝁,𝐛),𝟎]\texttt{LN}(\mathbf{x}^{*};\bm{\mu}^{*},\mathbf{b}^{*})=\texttt{Concat}\left[\texttt{LN}(\mathbf{x};\bm{\mu},\mathbf{b}),\mathbf{0}\right] if 𝝁=Concat[η𝝁,𝜻]\bm{\mu}^{*}=\texttt{Concat}\left[\eta\bm{\mu},\bm{\zeta}\right] and 𝐛=Concat[𝐛,𝟎]\mathbf{b}^{*}=\texttt{Concat}\left[\mathbf{b},\mathbf{0}\right], where 𝟎DTDS\mathbf{0}\in\mathbb{R}^{D_{T}-D_{S}} is a zero vector, 𝜻DTDS\bm{\zeta}\in\mathbb{R}^{D_{T}-D_{S}} is an arbitrary vector, and η=(DS/DT)\eta=\sqrt{(D_{S}/D_{T})} is a scalar. See section E.1 for results and proof with a more generalized case where DTDSD_{T}\geq D_{S}.

4.3 Lossless depth expansion

Refer to caption
(a) Arrangement of block stacking.
Refer to caption
(b) Type-1 depth expansion.
Refer to caption
(c) Type-2 depth expansion.
Figure 5: Lossless depth expansion. (a) We place a new block next to the block where it originates. (b) For type-1 depth expansion, we set the weights of the last fully-connected layer to zeros. (c) For type-2 depth expansion, we specify the weights of the last fully-connected layer so that the contributions from replicated neurons cancel each other. For example, assume h1h^{\ast}_{1} is a duplicate of h1h_{1}, we set their fan-out weights to be αv1,1\alpha v_{1,1} and αv1,1-\alpha v_{1,1} to enforce zero output.

In this section, we detail our approach for increasing model depth in a lossless manner.

Arrangement of added layers. Similar to how Chang et al. (2017); Dong et al. (2020) deal with ResNet, we put added layers directly next to the source layer. For example, when expanding two-layer network with blocks {g1,g2}\{g_{1},g_{2}\}, we perform depth expansion with the resulting model {𝒲[g1],𝒟[𝒲[g1]],𝒲[g2],𝒟[𝒲[g2]]}\left\{\mathcal{W}[g_{1}],\mathcal{D}[\mathcal{W}[g_{1}]],\mathcal{W}[g_{2}],\mathcal{D}[\mathcal{W}[g_{2}]]\right\}. See 5(a) for an illustration.

Lossless depth expansion. We now provide two ways to perform lossless depth expansion.

Firstly, we can simply set the output of each module (MLP or MHA) to be zero, i.e. α=β=0\alpha=\beta=0. Hence, the residual branch does not contribute to the output. This choice gives great flexibility to the rest of the parameters since we can (1) copy weights from other layers or (2) randomly initialize the weights. See 5(b) for an illustration.

Secondly, we can enforce the output to be zero by setting the summation of fan-out weights for replicated neurons to zero. With the example shown in 3(a), we can set the fan-out weights of replicated neurons to be α=β0\alpha=-\beta\neq 0 to ensure all outputs are zeros.222If neurons are not replicated, then we have to set the fan-out weights to be zero. See 5(c) for an illustration.

4.4 A summary of implementing model expansion

We summarize the procedure of model expansion for Pre-LN Transformer architecture with both depth and width increments. We first average expand the embedding weights. Then, make sure the output of each layer is average expanded. Hence, the input to the decoder layer is the original output padded with zeros after the last LayerNorm. We provide a detailed description of our expansion method in section C.1. Furthermore, we explain how to use our method for Post-LN and Res-Post-Norm architectures in Appendix D.

5 How to train the expanded models

In this section, we delve into the influence of different factors in the training recipe, in particular the maximum learning rate and the learning rate scheduler, when training expanded models.

Experiment setup. Throughout this study, we adopt ViT (Dosovitskiy et al., 2021) as our exemplary model and train it on the standard ImageNet-1k dataset. In particular, we choose to expand ViT(6,512)(6,512) to ViT(12,768)(12,768), where 66/1212 represent the number of attention blocks and 512512/768768 denote the hidden dimensions. When training these models from scratch, we apply a default maximum learning rate of 1×1031\times 10^{-3} and run the training for 300 epochs with a batch size of 1024. We use a cosine learning rate scheduler that decays to a minimum learning rate of 10510^{-5}. However, we will modify this training recipe for continual training of the expanded model ViT(12,768)(12,768).

Refer to caption
(a) Train loss (LR)
Refer to caption
(b) Valid Acc. (LR)
Refer to caption
(c) Used LR (Sched)
Refer to caption
(d) Valid Acc. (Sched)
Figure 6: Influence of maximum learning rate (LR; a,b) and learning rate scheduler (Sched; c,d) for training expanded Vision Transformers. Dashed and solid horizontal lines represent the validation accuracy of small and large models, when trained from scratch. (a) Train loss when changing maximum LR, (b) validation accuracy when changing maximum LR, (c) different LR scheduler used in experiments, (d) validation accuracy when changing LR scheduler. We find that (1) using a smaller maximum LR results in smaller training loss but yields worse validation accuracy; (2) expanded models require significantly fewer epochs to match the performance of the larger model.

5.1 The effects of maximum learning rate

Suppose we have an expanded model, fTf_{T}, that maintains the same accuracy as a smaller source model, 𝒜(fS)\mathcal{A}(f_{S}). One might naturally opt for a smaller learning rate, expecting the validation accuracy of the expanded model to continue to decrease. If this were the case, we could smooth the transition between the training processes of the small model and the expanded model. However, our investigations reveal that the relationship is more complex than it initially seems.

We conducted experiments with three different maximum learning rates: 1×1031\times 10^{-3} (default), 2×1042\times 10^{-4}, and 1×1041\times 10^{-4}, maintaining a consistent minimum learning rate of 1×1051\times 10^{-5} across all cases. The results are shown in 6(b). We summarize our findings in the following paragraphs.

Performance drop early at training. An interesting observation is the immediate decrease in validation accuracy experienced by all three expanded models early during the learning rate warm-up.333We tried to change the number of warm-up steps, but the results were not greatly affected. This performance drop is correlated with the magnitude of the learning rate; the larger it is, the more pronounced the drop. This aligns with our anticipation as smaller learning rates are critical for model convergence, especially when the source model is already near local optima. Adopting a larger learning rate can displace the weights from this local minimum, leading to an increase in training loss.

Maximum learning rate and model generalization. We observe that maintaining the default maximum learning rate is pivotal to recovering the performance of the large model. To investigate whether adopting smaller learning rates hinders model learning, we also examine the training loss of all cases, as illustrated in 6(a). The results show that models trained with reduced learning rates incur smaller training losses compared to training from scratch. Hence, we postulate that the deterioration in performance, induced by a smaller maximum learning rate, is detrimental to the generalization capability of the expanded networks rather than the optimization capability. This concept is also theoretically examined by Li et al. (2020), illustrating how the learning rate can influence the sequence of learning varied patterns, thereby affecting generalization capacities.

5.2 How fast the learning rate should decay

After settling the maximum learning rate, the next important parameter to consider is the total number of epochs. Most works use the default learning rate scheduler (Wang et al., 2023a; Chen et al., 2021a), maintaining the same number of epochs as if the model were training from scratch. We, however, note that the expanded model, having inherited knowledge from the source model, starts with a small training loss — this holds true even when accounting for the significant loss drop during warm-up. This indicates the expanded model is closer to the local optimum, requiring a smaller learning rate for continued loss reduction. Thus, we should adopt a learning rate scheduler where the learning rate decays faster.

We examine four different epoch totals TtotalT_{\text{total}}: 130, 150, 200, and 300, with the corresponding learning rate schedulers illustrated in 6(c). Experiment results are shown in 6(d).

Expanded model necessitates faster learning rate decay. As depicted in 6(d), a notable observation is that employing a learning rate scheduler with faster decay enables the expanded model to quickly attain the performance of the corresponding large target model. Remarkably, the expanded model requires only 130 epochs of training to match the performance of the target model that was trained from scratch, translating to a computational cost saving of up to 56.67%56.67\%. This corroborates our earlier conjecture that expanded models need a learning rate scheduler that decays faster.

In summary, we recommend employing the same maximum learning rate as is used for training from scratch but with accelerated decay.

6 Main experiments

Refer to caption
(a) From ViT(6,384)(6,384) to ViT(12,768)(12,768)
Refer to caption
(b) From ViT(6,512)(6,512) to ViT(12,768)(12,768)
Refer to caption
(c) From BERT(6,384)(6,384) to BERT(12,768)(12,768)
Refer to caption
(d) From BERT(6,512)(6,512) to BERT(12,768)(12,768)
Figure 7: Results of ViT on ImageNet (a,b) and BERT on English Wiki (c,d). Dashed and solid horizontal lines represent the validation accuracy/MLM loss of the trained small model and target model, respectively. LEMON outperforms baselines, yielding computational savings of 56.7%, 56.7%, 25.5%, and 33.2% in panels (a), (b), (c), and (d) compared to training from scratch, respectively.

In this section, we compare our method with existing model expansion algorithms on Vision Transformers and BERT. We name our method LosslEss MOdel ExpansioN (LEMON), which uses the expansion algorithm explained in section 4 with an optimized learning rate scheduler that decays faster, as suggested in section 5.

Baselines. We consider several baselines to compare with our proposed method: (1) training the target model from scratch, (2) bert2BERT-FPI (Chen et al., 2015), a generalization of Net2Net, (3) bert2BERT-AKI (Chen et al., 2021a), which uses advanced knowledge initialization (AKI) to break symmetry, (3) soft KI (Qin et al., 2021) which learns the output of the source model by minimizing the KL-divergence of the two distributions, and (4) hard KI which learns the predictions of the source model. We do not include StackBERT (Gong et al., 2019), Yang et al. (2020), and Staged training (Shen et al., 2022) as they are not compatible with indivisible width expansion. LiGO (Wang et al., 2023a) is unavailable for direct comparison due to the absence of open-source code; hence, comparisons are made using reported values on ViT(12,512) to ViT(12,768) in section F.1.

6.1 Vision Transformers

Experiment setting. We adopt the default experimental setup described in section 5 unless stated otherwise. For LEMON, the learning rate is decayed to its minimum value over Ttotal=130T_{\text{total}}=130 epochs in both experiments. Parameters choices of LEMON are discussed in section C.4.

Experiment results. As demonstrated in 7(a) and 7(b), LEMON is able to achieve lossless model expansion. For both experiment settings, LEMON is able to recover the performance of the target model in 130 epochs, outperforming other baselines.

Several additional observations were made during the study. First, both bert2BERT-FPI and bert2BERT-AKI exhibited performance inferior to training from scratch. Second, consistent with the observations in Chen et al. (2021a) and Wang et al. (2023a), soft KI did not enhance the training speed of the target model, while hard KI did, possibly by functioning akin to curriculum learning and filtering out the challenging training samples for the target model early at training.

6.2 Language models

Experiment setting. For our experiments, we train Pre-LN BERT (Xiong et al., 2020) on masked language modeling task. The model is trained on the English Wiki corpus as per the methods in Tan & Bansal (2020) for 220k iterations with 5k warmup steps and a batch size of 256. We use a maximum learning rate of 2×1042\times 10^{-4} and a cosine learning rate scheduler which decreases the learning rate to 2×1052\times 10^{-5}. Following Liu et al. (2019), we remove the next sentence prediction task and use a fixed sequence length of 128 for model pre-training.

We consider the following expansion procedure: (1) BERT(6,384)(6,384) to BERT(12,768)(12,768), and (2) BERT(6,512)(6,512) to BERT(12,768)(12,768). We remove KI as our baseline. For LEMON, we decay the learning rate to the minimum values in 165k and 132k iterations for BERT(6,384)(6,384) and BERT(6,512)(6,512), respectively. Parameters choices of LEMON are discussed in section C.4. We report the number of iterations needed to achieve a log validation MLM loss of 1.64.

Refer to caption
(a) ViT(6,384)(6,384) \rightarrow (12,768)(12,768).
Refer to caption
(b) ViT(6,512)(6,512) \rightarrow (12,768)(12,768).
Figure 8: LEMON outperforms other baselines even when they employ the same optimized learning rate schedulers.

Experiment results. As shown in 7(c) and 7(d), LEMON successfully expands smaller models without incurring loss. It outperforms baselines and achieve computational cost savings of 25.5% and 33.2% for BERT(6,384)(6,384) and BERT(6,512)(6,512), respectively.

Downstream task. We also present downstream performance of BERT trained by LEMON on the GLUE (Wang et al., 2018) dataset. We report correlation for the STS-B dataset and Matthews correlation coefficient for the CoLA dataset. Accuracy is reported for the remaining datasets. The results reveal that BERT(12,768) exhibits superior downstream performance when expanded from BERT(6,384) as opposed to being trained from scratch or being expanded from BERT(6,512). This likely stems from its more extensive training duration (165k iterations) compared to the model expanded from BERT(6,512) (132k iterations).

Table 2: Downstream performance of BERT(12,768)(12,768) on the GLUE dataset: Large model expanded from BERT(6,384) achieves the best downstream performance. A potential reason for this may be its longer training duration (165k) compared to the BERT(6,512) (132k).
Dataset STS-B MRPC CoLA SST-2 QNLI MNLI MNLI-mm QQP
(Metric) (Corr.) (Acc.) (Mcc.) (Acc.) (Acc.) (Acc.) (Acc.) (Acc.)
Train from scratch 0.744 83.33 0.19 88.88 87.80 80.28 81.17 89.62
LEMON (Ours), from BERT(6,512)(6,512) 0.848 83.82 0.36 90.14 88.76 80.92 81.57 89.91
LEMON (Ours), from BERT(6,384)(6,384) 0.866 85.54 0.38 90.94 89.33 81.81 81.81 90.40

6.3 Ablation studies: the effects of the training recipe

To study the effects of our proposed training recipe on baselines, we conduct an ablation study where we apply our training recipe on them. The results are shown in 8(a). It is shown that expanded models indeed require faster learning rate decay. Additionally, LEMON continues to outperform other baselines under the same modified training recipe.

7 Conclusion

In this paper, we propose LEMON, a method that combines lossless model expansion and optimized learning rate scheduler, showing compatibility and significant performance improvements for a variety of Transformer architectures. However, LEMON does have its limitations, including the need for tuning the total number of training epochs, and our evaluation scale was constrained by available computational resources. Looking ahead, we are working on extending the application of LEMON to larger models and on developing methodologies for selecting optimal free parameters when initializing LEMON.

References

  • Abdelfattah et al. (2021) Mohamed S Abdelfattah, Abhinav Mehrotra, Łukasz Dudziak, and Nicholas D Lane. Zero-cost proxies for lightweight nas. arXiv preprint arXiv:2101.08134, 2021.
  • Baker et al. (2016) Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.
  • Bellec et al. (2017) Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep rewiring: Training very sparse deep networks. arXiv preprint arXiv:1711.05136, 2017.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Chang et al. (2017) Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. Multi-level residual networks from dynamical systems view. arXiv preprint arXiv:1710.10348, 2017.
  • Chen et al. (2021a) Cheng Chen, Yichun Yin, Lifeng Shang, Xin Jiang, Yujia Qin, Fengyu Wang, Zhi Wang, Xiao Chen, Zhiyuan Liu, and Qun Liu. bert2bert: Towards reusable pretrained language models. arXiv preprint arXiv:2110.07143, 2021a.
  • Chen et al. (2015) Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641, 2015.
  • Chen et al. (2021b) Wuyang Chen, Xinyu Gong, and Zhangyang Wang. Neural architecture search on imagenet in four {gpu} hours: A theoretically inspired perspective. In International Conference on Learning Representations, 2021b. URL https://openreview.net/forum?id=Cnon5ezMHtu.
  • de Jorge et al. (2020) Pau de Jorge, Amartya Sanyal, Harkirat S Behl, Philip HS Torr, Gregory Rogez, and Puneet K Dokania. Progressive skeletonization: Trimming more fat from a network at initialization. arXiv preprint arXiv:2006.09081, 2020.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.  248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  • Dong et al. (2020) Chengyu Dong, Liyuan Liu, Zichao Li, and Jingbo Shang. Towards adaptive residual network training: A neural-ode perspective. In International conference on machine learning, pp.  2616–2626. PMLR, 2020.
  • Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  • Evci et al. (2020) Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning, pp.  2943–2952. PMLR, 2020.
  • Frankle & Carbin (2018) Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
  • Glorot & Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp.  249–256. JMLR Workshop and Conference Proceedings, 2010.
  • Gong et al. (2021) Chengyue Gong, Dilin Wang, Meng Li, Vikas Chandra, and Qiang Liu. Keepaugment: A simple information-preserving data augmentation approach. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  1055–1064, 2021.
  • Gong et al. (2019) Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, and Tieyan Liu. Efficient training of bert by progressively stacking. In International conference on machine learning, pp.  2337–2346. PMLR, 2019.
  • Gupta et al. (2023) Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timothée Lesort. Continual pre-training of large language models: How to (re) warm your model? arXiv preprint arXiv:2308.04014, 2023.
  • Han et al. (2020) Rujun Han, Xiang Ren, and Nanyun Peng. Econet: Effective continual pretraining of language models for event temporal reasoning. arXiv preprint arXiv:2012.15283, 2020.
  • Han et al. (2015) Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
  • Hassibi et al. (1993) Babak Hassibi, David G Stork, and Gregory J Wolff. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pp.  293–299. IEEE, 1993.
  • He et al. (2023) Bobby He, James Martens, Guodong Zhang, Aleksandar Botev, Andrew Brock, Samuel L Smith, and Yee Whye Teh. Deep transformers without shortcuts: Modifying self-attention for faithful signal propagation. arXiv preprint arXiv:2302.10322, 2023.
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp.  1026–1034, 2015.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • Hubara et al. (2017) Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017.
  • Jang et al. (2021) Joel Jang, Seonghyeon Ye, Sohee Yang, Joongbo Shin, Janghoon Han, Gyeonghun Kim, Stanley Jungkyu Choi, and Minjoon Seo. Towards continual knowledge learning of language models. arXiv preprint arXiv:2110.03215, 2021.
  • Ke et al. (2022) Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, and Bing Liu. Continual pre-training of language models. In The Eleventh International Conference on Learning Representations, 2022.
  • LeCun et al. (1989) Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. Advances in neural information processing systems, 2, 1989.
  • Lee et al. (2018) Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018.
  • Lee et al. (2019) Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, and Philip HS Torr. A signal propagation perspective for pruning neural networks at initialization. arXiv preprint arXiv:1906.06307, 2019.
  • Li (2020) Chuan Li. Openai’s gpt-3 language model: A technical overview. https://lambdalabs.com/blog/demystifying-gpt-3, 2020. Accessed: 2023-09-22.
  • Li et al. (2020) Yuanzhi Li, Colin Wei, and Tengyu Ma. Towards explaining the regularization effect of initial large learning rate in training neural networks, 2020.
  • Liu et al. (2018) Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
  • Liu et al. (2021a) Shiwei Liu, Lu Yin, Decebal Constantin Mocanu, and Mykola Pechenizkiy. Do we actually need dense over-parameterization? in-time over-parameterization in sparse training. In International Conference on Machine Learning, pp.  6989–7000. PMLR, 2021a.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019.
  • Liu et al. (2021b) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  10012–10022, 2021b.
  • Liu et al. (2022) Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  12009–12019, 2022.
  • Louizos et al. (2017) Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through l_0l\_0 regularization. arXiv preprint arXiv:1712.01312, 2017.
  • Loureiro et al. (2022) Daniel Loureiro, Francesco Barbieri, Leonardo Neves, Luis Espinosa Anke, and Jose Camacho-Collados. Timelms: Diachronic language models from twitter. arXiv preprint arXiv:2202.03829, 2022.
  • Mellor et al. (2021) Joe Mellor, Jack Turner, Amos Storkey, and Elliot J Crowley. Neural architecture search without training. In International Conference on Machine Learning, pp.  7588–7598. PMLR, 2021.
  • Mocanu et al. (2018) Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 9(1):1–12, 2018.
  • Noci et al. (2022) Lorenzo Noci, Sotiris Anagnostidis, Luca Biggio, Antonio Orvieto, Sidak Pal Singh, and Aurelien Lucchi. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. arXiv preprint arXiv:2206.03126, 2022.
  • Poole et al. (2016) Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. Advances in neural information processing systems, 29, 2016.
  • Qin et al. (2021) Yujia Qin, Yankai Lin, Jing Yi, Jiajie Zhang, Xu Han, Zhengyan Zhang, Yusheng Su, Zhiyuan Liu, Peng Li, Maosong Sun, et al. Knowledge inheritance for pre-trained language models. arXiv preprint arXiv:2105.13880, 2021.
  • Qin et al. (2022) Yujia Qin, Jiajie Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. Elle: Efficient lifelong pre-training for emerging data. arXiv preprint arXiv:2203.06311, 2022.
  • Rastegari et al. (2016) Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pp.  525–542. Springer, 2016.
  • Real et al. (2019) Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Aging evolution for image classifier architecture search. In AAAI Conference on Artificial Intelligence, volume 2, 2019.
  • Renda et al. (2020) Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning. arXiv preprint arXiv:2003.02389, 2020.
  • Saxe et al. (2013) Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
  • Scialom et al. (2022) Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. Fine-tuned language models are continual learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  6107–6122, 2022.
  • Shen et al. (2022) Sheng Shen, Pete Walsh, Kurt Keutzer, Jesse Dodge, Matthew Peters, and Iz Beltagy. Staged training for transformer language models, 2022.
  • Tan & Bansal (2020) Hao Tan and Mohit Bansal. Vokenization: Improving language understanding with contextualized, visual-grounded supervision, 2020.
  • Tanaka et al. (2020) Hidenori Tanaka, Daniel Kunin, Daniel L Yamins, and Surya Ganguli. Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in Neural Information Processing Systems, 33:6377–6389, 2020.
  • Touvron et al. (2021) Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention, 2021.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • Trockman & Kolter (2023) Asher Trockman and J Zico Kolter. Mimetic initialization of self-attention layers. arXiv preprint arXiv:2305.09828, 2023.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  • Wang et al. (2020a) Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. arXiv preprint arXiv:2002.07376, 2020a.
  • Wang et al. (2022a) Haoxiang Wang, Yite Wang, Ruoyu Sun, and Bo Li. Global convergence of maml and theory-inspired neural architecture search for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9797–9808, 2022a.
  • Wang et al. (2023a) Peihao Wang, Rameswar Panda, Lucas Torroba Hennigen, Philip Greengard, Leonid Karlinsky, Rogerio Feris, David Daniel Cox, Zhangyang Wang, and Yoon Kim. Learning to grow pretrained models for efficient transformer training. arXiv preprint arXiv:2303.00980, 2023a.
  • Wang et al. (2020b) Ruochen Wang, Minhao Cheng, Xiangning Chen, Xiaocheng Tang, and Cho-Jui Hsieh. Rethinking architecture selection in differentiable nas. In International Conference on Learning Representations, 2020b.
  • Wang et al. (2022b) Yite Wang, Dawei Li, and Ruoyu Sun. Ntk-sap: Improving neural network pruning by aligning training dynamics. In The Eleventh International Conference on Learning Representations, 2022b.
  • Wang et al. (2023b) Yite Wang, Jing Wu, Naira Hovakimyan, and Ruoyu Sun. Double dynamic sparse training for gans, 2023b.
  • Wen et al. (2016) Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. Advances in neural information processing systems, 29, 2016.
  • Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  • Wu et al. (2023a) Jing Wu, Jennifer Hobbs, and Naira Hovakimyan. Hallucination improves the performance of unsupervised visual representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  16132–16143, 2023a.
  • Wu et al. (2023b) Jing Wu, Naira Hovakimyan, and Jennifer Hobbs. Genco: An auxiliary generator from contrastive learning for enhanced few-shot learning in remote sensing. arXiv preprint arXiv:2307.14612, 2023b.
  • Xiao et al. (2018) Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. In International Conference on Machine Learning, pp.  5393–5402. PMLR, 2018.
  • Xiong et al. (2020) Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pp.  10524–10533. PMLR, 2020.
  • Xu et al. (2021) Jingjing Xu, Liang Zhao, Junyang Lin, Rundong Gao, Xu Sun, and Hongxia Yang. Knas: green neural architecture search. In International Conference on Machine Learning, pp.  11613–11625. PMLR, 2021.
  • Yang et al. (2023) Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan Yang, et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
  • Yang et al. (2020) Cheng Yang, Shengnan Wang, Chao Yang, Yuechuan Li, Ru He, and Jingqiao Zhang. Progressively stacking 2.0: A multi-stage layerwise training method for bert training speedup. arXiv preprint arXiv:2011.13635, 2020.
  • Yu et al. (2019) Kaicheng Yu, Christian Sciuto, Martin Jaggi, Claudiu Musat, and Mathieu Salzmann. Evaluating the search phase of neural architecture search. In International Conference on Learning Representations, 2019.
  • Zhang & Sennrich (2019) Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  • Zoph & Le (2016) Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.

Overview of the Appendix

The Appendix is organized as follows:

  • Appendix A introduces the general experiment setup.

  • Appendix B provides backgrounds and notations for model expansion.

  • Appendix C shows details for applying LEMON on Pre-LN Transformers.

  • Appendix D shows details for applying LEMON on other architectures.

  • Appendix E provides related proofs.

  • Appendix F provides additional ablation studies for the experiments.

  • Appendix G provides additional related works for efficient deep learning.

Appendix A Experiment setup

We conduct all experiments with NVIDIA-V100 and NVIDIA-A100 GPUs. We use the official code base of DeiT444https://github.com/facebookresearch/deit/tree/main (Touvron et al., 2021) for training Vision Transformers and the code base of VLM555https://github.com/airsplay/vokenization (Tan & Bansal, 2020) for training BERT.

A.1 Network architecture

For Vision Transformers, we use the default network architecture adopted in Touvron et al. (2021). For BERT, we implemented Pre-LN BERT in Huggingface’s Transformers package (Wolf et al., 2019) such that:

  • Within the residual branch of each Transformer block, we positioned LayerNorm to precede both the multi-head attention (MHA) and multi-layer perception (MLP) modules.

  • For the MLM classification head, we use only one fully-connected layer (shared with the embedding).

A.2 Detailed training configurations

Vision Transformers. We train Vision Transformers on the ImageNet-1k (Deng et al., 2009) dataset. When training Vision Transformers from scratch, we apply a maximum learning rate of 1×1031\times 10^{-3} and run the training for 300 epochs with a batch size of 1024. We use a cosine learning rate scheduler that decays to a minimum learning rate of 10510^{-5} with 5 warm-up epochs.

BERT pre-training. We train Pre-LN BERT (Xiong et al., 2020) on masked language modeling task. The model is trained on the English Wiki corpus as per the methods in Tan & Bansal (2020) for 220k iterations with 5k warmup steps and a batch size of 256. We use a maximum learning rate of 2×1042\times 10^{-4} and a cosine learning rate scheduler which decreases the learning rate to 2×1052\times 10^{-5}. Following Liu et al. (2019), we remove the next sentence prediction task and use a fixed sequence length of 128 for model pre-training.

BERT fine-tuning. For fine-tuning task of BERT on the GLUE (Wang et al., 2018) dataset, we train 3 epochs with a learning rate of 1×1041\times 10^{-4} and a batch size of 32 for all tasks. We report correlation for the STS-B dataset and Matthews correlation coefficient for the CoLA dataset. Accuracy is reported for the remaining datasets.

A.3 Details of baselines

We provide our implementation details of knowledge inheritance (KI) (Qin et al., 2021) in this section. Given a training dataset denoted as 𝒟=(𝐱i,𝐲i)i=1n\mathcal{D}=(\mathbf{x}_{i},\mathbf{y}_{i})_{i=1}^{n}, we define the total loss Total\mathcal{L}_{\text{Total}} as:

Total(fL;fS,𝒟)=(𝐱i,𝐲i)𝒟(1α)self(fL(𝐱i),𝐲i)+αKI(fL,fS,𝐱i)\displaystyle\mathcal{L}_{\text{Total}}(f_{L};f_{S},\mathcal{D})=\sum_{(\mathbf{x}_{i},\mathbf{y}_{i})\in\mathcal{D}}(1-\alpha)\mathcal{L}_{\text{self}}(f_{L}(\mathbf{x}_{i}),\mathbf{y}_{i})+\alpha\mathcal{L}_{\text{KI}}(f_{L},f_{S},\mathbf{x}_{i})

where α\alpha is a scalar controlling the strength of KI; The functions fSf_{S} and fLf_{L} respectively represent the small source model and the large target model; The loss function self\mathcal{L}_{\text{self}} computes the standard training loss, such as cross-entropy, between the prediction fL(𝐱i)f_{L}(\mathbf{x}_{i}) and the actual label 𝐲i\mathbf{y}_{i}. For soft KI, we set KI=KL(fL(𝐱i)||fS(𝐱i))\mathcal{L}_{\text{KI}}=\text{KL}(f_{L}(\mathbf{x}_{i})||f_{S}(\mathbf{x}_{i})). For hard KI, we set KI=KL(fL(𝐱i)||𝐞argmaxfS(𝐱i))\mathcal{L}_{\text{KI}}=\text{KL}(f_{L}(\mathbf{x}_{i})||\mathbf{e}_{\operatorname*{arg\,max}f_{S}(\mathbf{x}_{i})}), where KL stands for Kullback–Leibler divergence, and 𝐞\mathbf{e} is the standard basis vector.

During the KI process, we start with an initial α\alpha value of 0.5 and linearly decrease it to zero.

Appendix B Notations and backgrounds

In this section, we introduce basic notations in section B.1, the definition of some normalization layers in section B.2, lossless expansion in vector space in section B.3, lossless expansion for operators (layers) in section B.4, and the rule of consecutive application of lossless expansion methods for consecutive layers in section B.4.3.

B.1 Notations

All vectors are assumed to be column vectors. We define 𝟎d\mathbf{0}_{d} to be a zero vector of dimension dd. We use bold-faced letters for vectors, matrices, and tensors. For a vector 𝐯\mathbf{v}, let 𝐯[i]\mathbf{v}[i] be its ii-th entry and 𝐯[:i]\mathbf{v}[:i] be its first ii entries. For a matrix 𝐌\mathbf{M}, let 𝐌[i,j]\mathbf{M}[i,j], 𝐌[i,:]\mathbf{M}[i,:], and 𝐌[:,j]\mathbf{M}[:,j] be its (i,j)(i,j)-th entry, ii-th row, and jj-th column, respectively. Moreover, let 𝐌[:i,:]\mathbf{M}[:i,:] and 𝐌[:,:j]\mathbf{M}[:,:j] be its first ii rows and first jj columns, respectively. We use 𝐌\mathbf{M}^{\intercal} to denote the matrix transpose of 𝐌\mathbf{M}. We use [n][n] where n+n\in\mathbb{Z}_{+} to denote {1,,n}\{1,\cdots,n\}. We use Id to denote identity mapping. We use Concat[]\texttt{Concat}\left[\cdot\right] to denote horizontal concatenation.

B.2 Model layers

In this section, we give the formal definition of LayerNorm LN()\texttt{LN}(\cdot) and RMS Norm RMS()\texttt{RMS}(\cdot).

Definition 1 (LayerNorm).

LayerNorm LN(;𝛍,𝛃,ϵ)\texttt{LN}(\cdot;\bm{\mu},\bm{\beta},\epsilon) of dimension DD is defined as:

LN(𝐱;𝝁,β,ϵ)=𝐱𝔼[𝐱]Var[𝐱]+ϵ𝝁+𝜷,\displaystyle\texttt{LN}(\mathbf{x};\bm{\mu},\mathbf{\beta},\epsilon)=\frac{\mathbf{x}-\mathbb{E}[\mathbf{x}]}{\sqrt{\mathrm{Var}[\mathbf{x}]+\epsilon}}\odot\bm{\mu}+\bm{\beta},

where 𝐱,𝛍,𝛃D\mathbf{x},\bm{\mu},\bm{\beta}\in\mathbb{R}^{D}.

Definition 2 (RMSNorm).

RMS Norm RMS(;𝛍,ϵ)\texttt{RMS}(\cdot;\bm{\mu},\epsilon) of dimension DD is defined as:

RMS(𝐱;𝝁,ϵ)=𝐱1Di=1D(𝐱[i])2+ϵ𝝁,\displaystyle\texttt{RMS}(\mathbf{x};\bm{\mu},\epsilon)=\frac{\mathbf{x}}{\sqrt{\frac{1}{D}\sum_{i=1}^{D}(\mathbf{x}[i])^{2}+\epsilon}}\odot\bm{\mu},

where 𝐱,𝛍D\mathbf{x},\bm{\mu}\in\mathbb{R}^{D}.

Remark.

In neural networks, inputs of normalization layers are usually high dimension tensors. In this case, LayerNorm and RMSNorm normally apply to the last dimension separately.

B.3 Lossless expansion in vector space

In this section, we first give the general definition of lossless expansion in vector space.

Definition 3 (Lossless expansion in vector space).

Given 𝒮\mathcal{S} and 𝒯\mathcal{T} are two vector spaces where the dimensions satisfy dim(𝒯)dim(𝒮)\text{dim}(\mathcal{T})\geq\text{dim}(\mathcal{S}) , a vector space expansion 𝒱:𝒮𝒯\mathcal{V}:\mathcal{S}\rightarrow\mathcal{T} is said to be lossless if it is invertible.

Remark.

Note that the identity function Id is lossless with its inverse being itself.

Then we give a few examples of lossless vector space expansions. These examples will also be used in LEMON.

Example B.3.1 (Vector average expansion 𝒱avg\mathcal{V}_{\text{avg}}).

Let 𝐱DS\mathbf{x}\in\mathbb{R}^{D_{S}} be a vector of dimension DSD_{S} and its average Avg(𝐱)=𝔼[𝐱]=1DSiDS𝐱[i]\texttt{Avg}(\mathbf{x})=\mathbb{E}[\mathbf{x}]=\frac{1}{D_{S}}\sum_{i}^{D_{S}}\mathbf{x}[i]. 𝐱avg\mathbf{x}^{*}_{\text{avg}} is called the average expanded 𝐱\mathbf{x} of dimension DTD_{T} with DTDSD_{T}\geq D_{S} if

𝐱avg=𝒱avg(𝐱)=Concat[𝐱,,𝐱DT/DS,Avg(𝐱),,Avg(𝐱)DTmodDS]DT.\displaystyle\mathbf{x}^{*}_{\text{avg}}=\mathcal{V}_{\text{avg}}(\mathbf{x})=\texttt{Concat}\left[\underbrace{\mathbf{x}^{\intercal},\cdots,\mathbf{x}^{\intercal}}_{\lfloor D_{T}/D_{S}\rfloor},\underbrace{\texttt{Avg}(\mathbf{x}),\cdots,\texttt{Avg}(\mathbf{x})}_{D_{T}\bmod D_{S}}\right]^{\intercal}\in\mathbb{R}^{D_{T}}.
Example B.3.2 (Vector zero expansion 𝒱zero\mathcal{V}_{\text{zero}}).

Let 𝐱DS\mathbf{x}\in\mathbb{R}^{D_{S}} be a vector of dimension DSD_{S}. 𝐱zero\mathbf{x}^{*}_{\text{zero}} is called the zero expanded 𝐱\mathbf{x} of dimension DTD_{T} with DTDSD_{T}\geq D_{S} if

𝐱zero=𝒱zero(𝐱)=Concat[𝐱,,𝐱DT/DS,0,,0DTmodDS]DT.\displaystyle\mathbf{x}^{*}_{\text{zero}}=\mathcal{V}_{\text{zero}}(\mathbf{x})=\texttt{Concat}\left[\underbrace{\mathbf{x}^{\intercal},\cdots,\mathbf{x}^{\intercal}}_{\lfloor D_{T}/D_{S}\rfloor},\underbrace{0,\cdots,0}_{D_{T}\bmod D_{S}}\right]^{\intercal}\in\mathbb{R}^{D_{T}}.
Example B.3.3 (Vector circular expansion 𝒱circ\mathcal{V}_{\text{circ}}).

Let 𝐱DS\mathbf{x}\in\mathbb{R}^{D_{S}} be a vector of dimension DSD_{S}. 𝐱circ\mathbf{x}^{*}_{\text{circ}} is called the circular expanded 𝐱\mathbf{x} of dimension DTD_{T} with DTDSD_{T}\geq D_{S} if

𝐱circ=𝒱circ(𝐱)=Concat[𝐱,,𝐱DT/DS,𝐱[:DTmodDS]]DT.\displaystyle\mathbf{x}^{*}_{\text{circ}}=\mathcal{V}_{\text{circ}}(\mathbf{x})=\texttt{Concat}\left[\underbrace{\mathbf{x}^{\intercal},\cdots,\mathbf{x}^{\intercal}}_{\lfloor D_{T}/D_{S}\rfloor},\mathbf{x}^{\intercal}[:D_{T}\bmod D_{S}]\right]^{\intercal}\in\mathbb{R}^{D_{T}}.
Example B.3.4 (Vector random expansion 𝒱rand\mathcal{V}_{\text{rand}}).

Let 𝐱DS\mathbf{x}\in\mathbb{R}^{D_{S}} be a vector of dimension DSD_{S}. 𝐱rand\mathbf{x}^{*}_{\text{rand}} is called the random expanded 𝐱\mathbf{x} of dimension DTD_{T} with DTDSD_{T}\geq D_{S} if

𝐱rand=𝒱rand(𝐱;𝜻)=Concat[𝐱,,𝐱DT/DS,𝜻]DT,\displaystyle\mathbf{x}^{*}_{\text{rand}}=\mathcal{V}_{\text{rand}}(\mathbf{x};\bm{\zeta})=\texttt{Concat}\left[\underbrace{\mathbf{x}^{\intercal},\cdots,\mathbf{x}^{\intercal}}_{\lfloor D_{T}/D_{S}\rfloor},\bm{\zeta}^{\intercal}\right]^{\intercal}\in\mathbb{R}^{D_{T}},

where 𝛇DTmodDS\bm{\zeta}\in\mathbb{R}^{D_{T}\bmod D_{S}} is an arbitrary vector.

Remark.

(1) All vector expansion examples above follow the same pattern. Specifically, when expanding from dimension DSD_{S} to DTD_{T}, all vector expansion methods pad first DT/DSDS\lfloor D_{T}/D_{S}\rfloor D_{S} entries by repeating 𝐱\mathbf{x} DT/DSDS\lfloor D_{T}/D_{S}\rfloor D_{S} number of times. Each method deals with the remaining DTmodDSD_{T}\bmod D_{S} entries differently. (2) The random vector 𝛇\bm{\zeta} in vector random expansion is arbitrary, so 𝒱avg,𝒱zero,𝒱circ𝒱rand\mathcal{V}_{\text{avg}},\mathcal{V}_{\text{zero}},\mathcal{V}_{\text{circ}}\subset\mathcal{V}_{\text{rand}}. (3) Here all three examples are expansion methods for vectors. In practice, neural networks like Transformers are dealing high dimensional tensors. These tensors can essentially be thought of as collections of vectors. In such scenarios, we can apply the expansion methods separately to the last dimension of these tensors.

In the following claim, we show that vectors expanded by these operators are lossless.

Claim 1.

Vector average expansion 𝒱avg\mathcal{V}_{\text{avg}}, vector zero expansion 𝒱zero\mathcal{V}_{\text{zero}}, vector circular expansion 𝒱circ\mathcal{V}_{\text{circ}}, and vector random expansion 𝒱rand\mathcal{V}_{\text{rand}} are all lossless expansion for vectors.

Proof.

The inverse function 𝒱1:DTDS\mathcal{V}^{-1}:\mathbb{R}^{D_{T}}\rightarrow\mathbb{R}^{D_{S}} of these vector expansion methods is

𝒱1(𝐱)=𝐱[:DS].\displaystyle\mathcal{V}^{-1}(\mathbf{x})=\mathbf{x}[:D_{S}].

Remark.

In practice, we want inverse mapping of expansion methods to be easily computed just like the example above.

B.4 Lossless expansion for operators

We then give the definition of lossless expansion for operators. These operators apply on tensors, hence our definition of lossless operator expansion is based on lossless expansion in vector space. These operators can be different layers used in Transformer architectures, including LayerNorm, convolutional layers, and fully-connected layers, etc.

Definition 4 (Lossless expansion for operators).

Consider vector spaces 𝒮in,𝒮out,𝒯in\mathcal{S}^{\text{in}},\mathcal{S}^{\text{out}},\mathcal{T}^{\text{in}} and 𝒯out\mathcal{T}^{\text{out}} such that dim(𝒮in)dim(𝒯in)\text{dim}(\mathcal{S}^{\text{in}})\leq\text{dim}(\mathcal{T}^{\text{in}}) and dim(𝒮out)dim(𝒯out)\text{dim}(\mathcal{S}^{\text{out}})\leq\text{dim}(\mathcal{T}^{\text{out}}). Moreover, suppose the operator is denoted with g():𝒮in𝒮outg(\cdot):\mathcal{S}^{\text{in}}\rightarrow\mathcal{S}^{\text{out}}. We say the operator expansion \mathcal{E} is (𝒱in,𝒱out)(\mathcal{V}_{\text{in}},\mathcal{V}_{\text{out}})-lossless for g()g(\cdot) if there exist lossless input vector space expansion 𝒱in:𝒮in𝒯in\mathcal{V}_{\text{in}}:\mathcal{S}^{\text{in}}\rightarrow\mathcal{T}^{\text{in}} and lossless output vector space expansion 𝒱out:𝒮out𝒯out\mathcal{V}_{\text{out}}:\mathcal{S}^{\text{out}}\rightarrow\mathcal{T}^{\text{out}} such that 𝒱out(g(𝐱))=[g](𝒱in(𝐱)),𝐱𝒮in\mathcal{V}_{\text{out}}(g(\mathbf{x}))=\mathcal{E}[g](\mathcal{V}_{\text{in}}(\mathbf{x})),\forall\mathbf{x}\in\mathcal{S^{\text{in}}}.

Remark.

(1) Intuitively, a lossless operator expansion can be understood as follows: when using 𝒱in\mathcal{V}_{\text{in}} losslessly expanded input, the output of the \mathcal{E} expanded operator is also a 𝒱out\mathcal{V}_{\text{out}} losslessly expanded version of the original output. (2) For conciseness, we use ‘[g]\mathcal{E}[g] is (𝒱in,𝒱out)(\mathcal{V}_{\text{in}},\mathcal{V}_{\text{out}})-lossless’ and ‘\mathcal{E} is (𝒱in,𝒱out)(\mathcal{V}_{\text{in}},\mathcal{V}_{\text{out}})-lossless for g()g(\cdot)’ interchangeably. (3) We only require the vector expansions 𝒱in\mathcal{V}_{\text{in}} and 𝒱out\mathcal{V}_{\text{out}} to be invertible, we do not have restrictions on the operator expansion \mathcal{E}.

B.4.1 Lossless expansion for matrix multiplication

Then we give a few examples of lossless expansion for operators. We give examples for matrix multiplication since fully-connected layers are building blocks for Transformers. We first start by introducing the following three lossless operator expansion methods for matrix multiplication assuming that the input dimension is unchanged so 𝒱in=Id\mathcal{V}_{\text{in}}=\texttt{Id}.

Example B.4.1 (Matrix row-average expansion row,avg\mathcal{E}_{\text{row,avg}}).

Let 𝐌DS×P\mathbf{M}\in\mathbb{R}^{D_{S}\times P} be a matrix of dimension DS×PD_{S}\times P and its row average 𝐦=1DSiDS𝐌[i,:]\mathbf{m}=\frac{1}{D_{S}}\sum_{i}^{D_{S}}\mathbf{M}[i,:]. 𝐌row,avg\mathbf{M}^{*}_{\text{row,avg}} is called the row-average expanded 𝐌\mathbf{M} of dimension DT×PD_{T}\times P with DTDSD_{T}\geq D_{S} if

𝐌row,avg=row,avg(𝐌)=Concat[𝐌,,𝐌DT/DS,𝐦,,𝐦DTmodDS]DT×P.\displaystyle\mathbf{M}^{*}_{\text{row,avg}}=\mathcal{E}_{\text{row,avg}}(\mathbf{M})=\texttt{Concat}\left[\underbrace{\mathbf{M}^{\intercal},\cdots,\mathbf{M}^{\intercal}}_{\lfloor D_{T}/D_{S}\rfloor},\underbrace{\mathbf{m},\cdots,\mathbf{m}}_{D_{T}\bmod D_{S}}\right]^{\intercal}\in\mathbb{R}^{D_{T}\times P}.

Moreover, row,avg\mathcal{E}_{\text{row,avg}} is (Id,𝒱avg)(\texttt{Id},\mathcal{V}_{\text{avg}})-lossless for 𝐌\mathbf{M}.

Example B.4.2 (Matrix row-zero expansion row,zero\mathcal{E}_{\text{row,zero}}).

Let 𝐌DS×P\mathbf{M}\in\mathbb{R}^{D_{S}\times P} be a matrix of dimension DS×PD_{S}\times P. 𝐌row,zero\mathbf{M}^{*}_{\text{row,zero}} is called the row-zero expanded 𝐌\mathbf{M} of dimension DT×PD_{T}\times P with DTDSD_{T}\geq D_{S} if

𝐌row,zero=row,zero(𝐌)=Concat[𝐌,,𝐌DT/DS,𝟎P,,𝟎PDTmodDS]DT×P.\displaystyle\mathbf{M}^{*}_{\text{row,zero}}=\mathcal{E}_{\text{row,zero}}(\mathbf{M})=\texttt{Concat}\left[\underbrace{\mathbf{M}^{\intercal},\cdots,\mathbf{M}^{\intercal}}_{\lfloor D_{T}/D_{S}\rfloor},\underbrace{\mathbf{0}_{P},\cdots,\mathbf{0}_{P}}_{D_{T}\bmod D_{S}}\right]^{\intercal}\in\mathbb{R}^{D_{T}\times P}.

Moreover, row,zero\mathcal{E}_{\text{row,zero}} is (Id,𝒱zero)(\texttt{Id},\mathcal{V}_{\text{zero}})-lossless for 𝐌\mathbf{M}.

Example B.4.3 (Matrix row-circular expansion row,circ\mathcal{E}_{\text{row,circ}}).

Let 𝐌DS×P\mathbf{M}\in\mathbb{R}^{D_{S}\times P} be a matrix of dimension DS×PD_{S}\times P. 𝐌row,circ\mathbf{M}^{*}_{\text{row,circ}} is called the row-circular expanded 𝐌\mathbf{M} of dimension DT×PD_{T}\times P with DTDSD_{T}\geq D_{S} if

𝐌row,circ=row,circ(𝐌)=Concat[𝐌,,𝐌DT/DS,(𝐌[:DTmodDS,:])]DT×P.\displaystyle\mathbf{M}^{*}_{\text{row,circ}}=\mathcal{E}_{\text{row,circ}}(\mathbf{M})=\texttt{Concat}\left[\underbrace{\mathbf{M}^{\intercal},\cdots,\mathbf{M}^{\intercal}}_{\lfloor D_{T}/D_{S}\rfloor},\left(\mathbf{M}[:D_{T}\bmod D_{S},:]\right)^{\intercal}\right]^{\intercal}\in\mathbb{R}^{D_{T}\times P}.

Moreover, row,avg\mathcal{E}_{\text{row,avg}} is (Id,𝒱circ)(\texttt{Id},\mathcal{V}_{\text{circ}})-lossless for 𝐌\mathbf{M}.

Remark.

Similar to vector expansion examples, these matrix row-expansion methods follow the same pattern. Specifically, when expanding the number of rows from dimension DSD_{S} to DTD_{T}, these expansion methods pad first DT/DSDS\lfloor D_{T}/D_{S}\rfloor D_{S} rows by repeating 𝐌\mathbf{M} DT/DSDS\lfloor D_{T}/D_{S}\rfloor D_{S} number of times. Each method deals with the remaining DTmodDSD_{T}\bmod D_{S} rows differently.

The following two lossless operator expansion methods assume that the output dimension is unchanged so 𝒱out=Id\mathcal{V}_{\text{out}}=\texttt{Id}.

Example B.4.4 (Matrix column-random expansion col,rand\mathcal{E}_{\text{col,rand}}).

Let 𝐌P×DS\mathbf{M}\in\mathbb{R}^{P\times D_{S}} be a matrix of dimension P×DSP\times D_{S} and 𝛇P×(DTmodDS)\bm{\zeta}\in\mathbb{R}^{P\times(D_{T}\bmod D_{S})} is an arbitrary matrix. 𝐌col,rand\mathbf{M}^{*}_{\text{col,rand}} is called the column-random expanded 𝐌\mathbf{M} of dimension P×DTP\times D_{T} with DTDSD_{T}\geq D_{S} if

𝐌col,rand=col,rand(𝐌;𝜻)=Concat[𝐌1,,𝐌DT/DSDT/DS,𝜻]P×DT,\displaystyle\mathbf{M}^{*}_{\text{col,rand}}=\mathcal{E}_{\text{col,rand}}(\mathbf{M};\bm{\zeta})=\texttt{Concat}\left[\underbrace{\mathbf{M}^{1},\cdots,\mathbf{M}^{\lfloor D_{T}/D_{S}\rfloor}}_{\lfloor D_{T}/D_{S}\rfloor},\bm{\zeta}\right]\in\mathbb{R}^{P\times D_{T}},

where

iDT/DS𝐌i=𝐌.\displaystyle\sum_{i}^{\lfloor D_{T}/D_{S}\rfloor}\mathbf{M}^{i}=\mathbf{M}.

Moreover, col,rand\mathcal{E}_{\text{col,rand}} is (𝒱zero,Id)(\mathcal{V}_{\text{zero}},\texttt{Id})-lossless for 𝐌\mathbf{M}.

Example B.4.5 (Matrix column-circular expansion col,circ\mathcal{E}_{\text{col,circ}}).

Let 𝐌P×DS\mathbf{M}\in\mathbb{R}^{P\times D_{S}} be a matrix of dimension P×DSP\times D_{S} and 𝐌res=𝐌[:,:DTmodDS]P×(DTmodDS)\mathbf{M}^{\text{res}}=\mathbf{M}[:,:D_{T}\bmod D_{S}]\in\mathbb{R}^{P\times(D_{T}\bmod D_{S})}. 𝐌col,circ\mathbf{M}^{*}_{\text{col,circ}} is called the column-circular expanded 𝐌\mathbf{M} of dimension P×DTP\times D_{T} with DTDSD_{T}\geq D_{S} if

𝐌col,circ=col,circ(𝐌)=Concat[𝐌1,,𝐌DT/DSDT/DS,𝐌res]P×DT,\displaystyle\mathbf{M}^{*}_{\text{col,circ}}=\mathcal{E}_{\text{col,circ}}(\mathbf{M})=\texttt{Concat}\left[\underbrace{\mathbf{M}^{1},\cdots,\mathbf{M}^{\lfloor D_{T}/D_{S}\rfloor}}_{\lfloor D_{T}/D_{S}\rfloor},\mathbf{M}^{\text{res}}\right]\in\mathbb{R}^{P\times D_{T}},

where

𝐌res+i=1DT/DS𝐌i[:,:DTmodDS]=𝐌[:,:DTmodDS],\displaystyle\mathbf{M}^{\text{res}}+\sum_{i=1}^{\lfloor D_{T}/D_{S}\rfloor}\mathbf{M}^{i}[:,:D_{T}\bmod D_{S}]=\mathbf{M}[:,:D_{T}\bmod D_{S}],

and

i=1DT/DS𝐌i[:,DTmodDS:]=𝐌[:,DTmodDS:].\displaystyle\sum_{i=1}^{\lfloor D_{T}/D_{S}\rfloor}\mathbf{M}^{i}[:,D_{T}\bmod D_{S}:]=\mathbf{M}[:,D_{T}\bmod D_{S}:].

Moreover, col,rand\mathcal{E}_{\text{col,rand}} is (𝒱circ,Id)(\mathcal{V}_{\text{circ}},\texttt{Id})-lossless for 𝐌\mathbf{M}.

Note that lossless matrix row expansion and lossless matrix column expansion can be used together with the following claim.

Claim 2.

Consider matrix column expansion col\mathcal{E}_{\text{col}} is (𝒱col,Id)(\mathcal{V}_{\text{col}},\texttt{Id})-lossless for 𝐌\mathbf{M}, and matrix row expansion row\mathcal{E}_{\text{row}} is (Id,𝒱row)(\texttt{Id},\mathcal{V}_{\text{row}})-lossless for 𝐌\mathbf{M}. colrow\mathcal{E}_{\text{col}}\circ\mathcal{E}_{\text{row}} and rowcol\mathcal{E}_{\text{row}}\circ\mathcal{E}_{\text{col}} are both (𝒱col,𝒱row)(\mathcal{V}_{\text{col}},\mathcal{V}_{\text{row}})-lossless for 𝐌\mathbf{M}.

The claim is easy to prove since rows and columns are expanded independently.

B.4.2 Lossless expansion for bias

Note that the fully-connected layer consists of a matrix multiplication followed by a bias operator. We now give examples for the bias operator (𝐱;𝐛)=𝐱+𝐛\mathcal{B}(\mathbf{x};\mathbf{b})=\mathbf{x}+\mathbf{b}.

Example B.4.6 (Bias average expansion bias,avg\mathcal{E}_{\text{bias,avg}}).

Consider the bias operator (𝐱;𝐛)=𝐱+𝐛\mathcal{B}(\mathbf{x};\mathbf{b})=\mathbf{x}+\mathbf{b} where 𝐛DS\mathbf{b}\in\mathbb{R}^{D_{S}}. bias,avg(;𝐛bias,avg)=bias,avg[(;𝐛)]\mathcal{B}^{*}_{\text{bias,avg}}(\cdot;\mathbf{b}^{*}_{\text{bias,avg}})=\mathcal{E}_{\text{bias,avg}}[\mathcal{B}(\cdot;\mathbf{b})] is called the average expanded \mathcal{B} of dimension DTD_{T} with DTDSD_{T}\geq D_{S} if 𝐛bias,avg=𝒱avg(𝐛)\mathbf{b}^{*}_{\text{bias,avg}}=\mathcal{V}_{\text{avg}}(\mathbf{b}). Moreover, bias,avg\mathcal{E}_{\text{bias,avg}} is (𝒱avg,𝒱avg)(\mathcal{V}_{\text{avg}},\mathcal{V}_{\text{avg}})-lossless for \mathcal{B}.

Remark.

Note that we can easily extend bias,avg\mathcal{E}_{\text{bias,avg}} to bias,circ\mathcal{E}_{\text{bias,circ}} and bias,zero\mathcal{E}_{\text{bias,zero}} by expanding 𝐛\mathbf{b} to 𝒱circ(𝐛)\mathcal{V}_{\text{circ}}(\mathbf{b}) and 𝒱zero(𝐛)\mathcal{V}_{\text{zero}}(\mathbf{b}), respectively. Moreover, bias,circ\mathcal{E}_{\text{bias,circ}} and bias,zero\mathcal{E}_{\text{bias,zero}} are (𝒱circ,𝒱circ)(\mathcal{V}_{\text{circ}},\mathcal{V}_{\text{circ}})-lossless and (𝒱zero,𝒱zero)(\mathcal{V}_{\text{zero}},\mathcal{V}_{\text{zero}})-lossless for \mathcal{B}, respectively.

B.4.3 Consecutive application of lossless expansion for operators

In previous sections we give examples of lossless expansion methods for single operators. Now, to ensure lossless when applying expansion methods to consecutive layers/operators, we introduce the following claim:

Claim 3 (Lossless of consecutive application).

If 1\mathcal{E}_{1} is (𝒱a,𝒱b)(\mathcal{V}_{a},\mathcal{V}_{b})-lossless for g1g_{1} and 2\mathcal{E}_{2} is (𝒱b,𝒱c)(\mathcal{V}_{b},\mathcal{V}_{c})-lossless for g2g_{2}. Then 2[g2]1[g1]\mathcal{E}_{2}[g_{2}]\circ\mathcal{E}_{1}[g_{1}] is (𝒱a,𝒱c)(\mathcal{V}_{a},\mathcal{V}_{c})-lossless for g2g1g_{2}\circ g_{1}.

Proof.

This is easily obtained if input 𝐱\mathbf{x} is 𝒱a\mathcal{V}_{\text{a}} losslessly expanded, then the output of 1[g1]()\mathcal{E}_{1}[g_{1}](\cdot), 𝐱mid=1[g1](𝒱a(𝐱))\mathbf{x}_{\text{mid}}=\mathcal{E}_{1}[g_{1}](\mathcal{V}_{\text{a}}(\mathbf{x})), is 𝒱b\mathcal{V}_{\text{b}} lossless by definition. Using the fact that 2[g2]()\mathcal{E}_{2}[g_{2}](\cdot) is (𝒱b,𝒱c)(\mathcal{V}_{b},\mathcal{V}_{c})-lossless and the input 𝐱mid\mathbf{x}_{\text{mid}} is 𝒱b\mathcal{V}_{\text{b}} losslessly expanded, we conclude the proof. ∎

Remark.

By leveraging Claim 3, we can separately apply lossless expansion methods to various layers/operators in a larger network. The only requirement is that the output vector space expansion of one expansion method matches the input vector space expansion of the subsequent expansion method.

Appendix C Details of LEMON for Pre-LN Transformers

In this section, we provide detailed explanation of applying LEMON on Pre-LN Transformer architecture. By Claim 3, we can deal with different modules separately. In the following sections, we delve into the details of applying expansion methods to these modules.

Refer to caption
(a) Small source model.
Refer to caption
(b) Large target model.
Figure 9: Illustration of LayerNorm expansion LN\mathcal{E}_{\text{LN}} and MHA expansion MHA\mathcal{E}_{\text{MHA}}. We assume d=dK=dVd=d_{K}=d_{V}. We transpose weight matrices so that they can be considered left multiplied with vectors. The vectors in black font color indicate the intermediate values of inputs while the matrices in white color indicate parameters of the module. Biases are ignored for better illustration.

C.1 Width expansion for Pre-LN Transformer blocks

We first recap the Pre-LN Transformer architecture. It usually consists of (1) the embedding layer, (2) several Pre-LN Transformer blocks, (3) the final LayerNorm layer, and (4) a decoder layer.

Suppose that the hidden dimension DD of the transformer is increased from DSD_{S} to DTD_{T}. The head dimension dd is unchanged during expansion. Hence, the number of heads is increased from DS/dD_{S}/d to DT/dD_{T}/d. We use 𝐖iK,𝐖iQ,𝐖iV\mathbf{W}_{i}^{K},\mathbf{W}_{i}^{Q},\mathbf{W}_{i}^{V} to denote the key, query, and value weight matrix for ii-th head Headi\text{Head}_{i} in the MHA module. We use WOW_{O} to denote the projection matrix.

We use block\mathcal{E}_{\text{block}} to denote the width expansion of Pre-LN Transformer blocks. block\mathcal{E}_{\text{block}} can be decomposed into (1) LayerNorm expansion LN\mathcal{E}_{\text{LN}}, (2) MHA module expansion MHA\mathcal{E}_{\text{MHA}}, and (3) MLP module expansion MLP\mathcal{E}_{\text{MLP}}. We introduce these expansion methods in the following paragraphs. We provide an illustration of LN\mathcal{E}_{\text{LN}} and MHA\mathcal{E}_{\text{MHA}} in Figure 9.

(1) LayerNorm expansion with LN\mathcal{E}_{\text{LN}}. We define the expansion procedure for LN as follows. We use LN(;𝝁rand,𝜷zero,ϵ)\texttt{LN}(\cdot;\bm{\mu}_{\text{rand}}^{*},\bm{\beta}_{\text{zero}}^{*},\epsilon^{*}) where 𝝁rand=η𝒱rand(𝝁)DT\bm{\mu}_{\text{rand}}^{*}=\eta\mathcal{V}_{\text{rand}}(\bm{\mu})\in\mathbb{R}^{D_{T}}, 𝜷zero=𝒱zero(𝜷)DT\bm{\beta}_{\text{zero}}^{*}=\mathcal{V}_{\text{zero}}(\bm{\beta})\in\mathbb{R}^{D_{T}}, and ϵ=η2ϵ\epsilon^{*}=\eta^{2}\epsilon with η=DT/DS(DS/DT)\eta=\lfloor D_{T}/D_{S}\rfloor*(D_{S}/D_{T}) to expand the original LayerNorm layer LN(;𝝁,𝜷,ϵ)\texttt{LN}(\cdot;\bm{\mu},\bm{\beta},\epsilon). The expansion is lossless and the proof is given in Proposition 1. Moreover, LN\mathcal{E}_{\text{LN}} is (𝒱avg,𝒱zero)(\mathcal{V}_{\text{avg}},\mathcal{V}_{\text{zero}})-lossless for LN()\texttt{LN}(\cdot). In Figure 9, we omit ϵ\epsilon and 𝜷\bm{\beta} for better illustration.

(2) MHA expansion with MHA\mathcal{E}_{\text{MHA}}. We explain how to expand MHA as follow:

  • 𝐖iK,𝐖iQ,𝐖iV\mathbf{W}_{i}^{K},\mathbf{W}_{i}^{Q},\mathbf{W}_{i}^{V} in self attention. We consider the affine transformations applied to a single token 𝐱DS\mathbf{x}\in\mathbb{R}^{D_{S}} 666In the formulation of MHA in section 3, 𝐖iK,𝐖iQ,𝐖iV\mathbf{W}_{i}^{K},\mathbf{W}_{i}^{Q},\mathbf{W}_{i}^{V} are right matrix multiplied with the input sequence matrix 𝐗E×DS\mathbf{X}\in\mathbb{R}^{E\times D_{S}}. Here we use the form of 𝐖i𝐱\mathbf{W}_{i}\mathbf{x} for better illustration. in a sequence in self attention in the form of 𝐤i(𝐱;𝐖iK,𝐛iK)=(𝐖iK)𝐱+𝐛iK\mathbf{k}_{i}(\mathbf{x};\mathbf{W}_{i}^{K},\mathbf{b}_{i}^{K})=(\mathbf{W}_{i}^{K})^{\intercal}\mathbf{x}+\mathbf{b}_{i}^{K}, 𝐪i(𝐱;𝐖iQ,𝐛iQ)=(𝐖iQ)𝐱+𝐛iQ\mathbf{q}_{i}(\mathbf{x};\mathbf{W}_{i}^{Q},\mathbf{b}_{i}^{Q})=(\mathbf{W}_{i}^{Q})^{\intercal}\mathbf{x}+\mathbf{b}_{i}^{Q}, and 𝐯i(𝐱;𝐖iV,𝐛iV)=(𝐖iV)𝐱+𝐛iV\mathbf{v}_{i}(\mathbf{x};\mathbf{W}_{i}^{V},\mathbf{b}_{i}^{V})=(\mathbf{W}_{i}^{V})^{\intercal}\mathbf{x}+\mathbf{b}_{i}^{V} where (𝐖iK),(𝐖iQ),(𝐖iV)dK×DS(\mathbf{W}_{i}^{K})^{\intercal},(\mathbf{W}_{i}^{Q})^{\intercal},(\mathbf{W}_{i}^{V})^{\intercal}\in\mathbb{R}^{d_{K}\times D_{S}} and 𝐛iK,𝐛iQ,𝐛iVdK\mathbf{b}_{i}^{K},\mathbf{b}_{i}^{Q},\mathbf{b}_{i}^{V}\in\mathbb{R}^{d_{K}}.

    During expansion, we increase the dimension of (𝐖iK),(𝐖iQ),(𝐖iV)(\mathbf{W}_{i}^{K})^{\intercal},(\mathbf{W}_{i}^{Q})^{\intercal},(\mathbf{W}_{i}^{V})^{\intercal} from dK×DS\mathbb{R}^{d_{K}\times D_{S}} to dK×DT\mathbb{R}^{d_{K}\times D_{T}}, and 𝐛iK,𝐛iQ,𝐛iV\mathbf{b}_{i}^{K},\mathbf{b}_{i}^{Q},\mathbf{b}_{i}^{V} unchanged. Since the number of rows for (𝐖iK),(𝐖iQ),(𝐖iV)(\mathbf{W}_{i}^{K})^{\intercal},(\mathbf{W}_{i}^{Q})^{\intercal},(\mathbf{W}_{i}^{V})^{\intercal} is unchanged, we only increase the number of columns by applying column-random expansion col,rand\mathcal{E}_{\text{col,rand}} defined in Example B.4.4 to its transpose for each head, i.e., we use {col,rand[(𝐖iK);𝜻iK]}\left\{\mathcal{E}_{\text{col,rand}}\left[(\mathbf{W}_{i}^{K})^{\intercal};\bm{\zeta}_{i}^{K}\right]\right\}^{\intercal}, {col,rand[(𝐖iQ);𝜻iQ]}\left\{\mathcal{E}_{\text{col,rand}}\left[(\mathbf{W}_{i}^{Q})^{\intercal};\bm{\zeta}_{i}^{Q}\right]\right\}^{\intercal}, and {col,rand[(𝐖iV);𝜻iV]}\left\{\mathcal{E}_{\text{col,rand}}\left[(\mathbf{W}_{i}^{V})^{\intercal};\bm{\zeta}_{i}^{V}\right]\right\}^{\intercal} for the expanded weights of 𝐖iK,𝐖iQ\mathbf{W}_{i}^{K},\mathbf{W}_{i}^{Q} and 𝐖iV\mathbf{W}_{i}^{V}, where 𝜻iK,𝜻iQ,𝜻iVdk×(DTmodDS)\bm{\zeta}_{i}^{K},\bm{\zeta}_{i}^{Q},\bm{\zeta}_{i}^{V}\in\mathbb{R}^{d_{k}\times(D_{T}\bmod D_{S})} are random matrices. Biases are unchanged.

  • Heads in self attention. We increase the number of heads in a circular pattern. See 3(b) for an illustration. Note that (1) When DT/DS>1\lfloor D_{T}/D_{S}\rfloor>1, we can set 𝐖1,,𝐖DT/DS\mathbf{W}^{1},\cdots,\mathbf{W}^{\lfloor D_{T}/D_{S}\rfloor} differently for replicated heads to break symmetry; (2) Additionally, when DTmodDS0D_{T}\bmod D_{S}\neq 0, random matrices 𝜻iK,𝜻iQ,𝜻iV\bm{\zeta}_{i}^{K},\bm{\zeta}_{i}^{Q},\bm{\zeta}_{i}^{V} can be chosen differently for replicated heads to break symmetry. Please see Example B.4.4 for definitions of 𝐖1,,𝐖DT/DS\mathbf{W}^{1},\cdots,\mathbf{W}^{\lfloor D_{T}/D_{S}\rfloor} and 𝜻iK,𝜻iQ,𝜻iV\bm{\zeta}_{i}^{K},\bm{\zeta}_{i}^{Q},\bm{\zeta}_{i}^{V}.

  • Projection matrix in self attention. For the projection transformation in the form of 𝐖O𝐱+𝐛O\mathbf{W}_{O}^{\intercal}\mathbf{x}+\mathbf{b}_{O} where 𝐖ODS×DS\mathbf{W}_{O}^{\intercal}\in\mathbb{R}^{D_{S}\times D_{S}} and 𝐛ODS\mathbf{b}_{O}\in\mathbb{R}^{D_{S}}, we use col,circ\mathcal{E}_{\text{col,circ}} and row,avg\mathcal{E}_{\text{row,avg}} defined in Example B.4.5 and Example B.4.1 to expand the weights and biases. Specifically, we use {col,circ[row,avg(𝐖O)]}DT×DT\left\{\mathcal{E}_{\text{col,circ}}\left[\mathcal{E}_{\text{row,avg}}(\mathbf{W}_{O}^{\intercal})\right]\right\}^{\intercal}\in\mathbb{R}^{D_{T}\times D_{T}} for the expanded weight of 𝐖O\mathbf{W}_{O}. We then use 𝒱avg(𝐛O)DT\mathcal{V}_{\text{avg}}(\mathbf{b}_{O})\in\mathbb{R}^{D_{T}} for the expanded bias of 𝐛O\mathbf{b}_{O}.

Moreover, MHA\mathcal{E}_{\text{MHA}} is (𝒱zero,𝒱avg)(\mathcal{V}_{\text{zero}},\mathcal{V}_{\text{avg}})-lossless for MHA()\texttt{MHA}(\cdot).

(3) MLP expansion with MLP\mathcal{E}_{\text{MLP}}. Consider the MLP in the form of MLP(𝐱)=𝐖fc2σ(𝐖fc1𝐱+𝐛fc1)+𝐛fc2\texttt{MLP}(\mathbf{x})=\mathbf{W}_{\text{fc2}}\sigma(\mathbf{W}_{\text{fc1}}\mathbf{x}+\mathbf{b}_{\text{fc1}})+\mathbf{b}_{\text{fc2}} where σ\sigma is the non-linear activation. We explain how to expand MLP as follow:

  • For the first fully-connected layer, we increase the columns by random expansion and increase the rows by circular expansion. Specifically, we use col,rand[row,circ(𝐖fc1)]\mathcal{E}_{\text{col,rand}}\left[\mathcal{E}_{\text{row,circ}}\left(\mathbf{W}_{\text{fc1}}\right)\right] and 𝒱circ(𝐛fc1)\mathcal{V}_{\text{circ}}(\mathbf{b}_{\text{fc1}}) for the expanded weight and bias.

  • For the second fully-connected layer, we increase the columns by circular expansion and increase the rows by average expansion. Specifically, we use col,circ[row,avg(𝐖fc2)]\mathcal{E}_{\text{col,circ}}\left[\mathcal{E}_{\text{row,avg}}\left(\mathbf{W}_{\text{fc2}}\right)\right] and 𝒱avg(𝐛fc2)\mathcal{V}_{\text{avg}}(\mathbf{b}_{\text{fc2}}) for the expanded weight and bias.

Moreover, MLP\mathcal{E}_{\text{MLP}} is (𝒱zero,𝒱avg)(\mathcal{V}_{\text{zero}},\mathcal{V}_{\text{avg}})-lossless for MLP()\texttt{MLP}(\cdot).

C.2 Width expansion of other layers

In this section, we explain how to expand the rest of the layers, i.e., embedding layers and decoder layers.

Embeddings expansion with 𝒱avg\mathcal{V}_{\text{avg}}. We first average expand the embedding for each token 𝐱\mathbf{x} by adding its average, i.e., with 𝒱avg\mathcal{V}_{\text{avg}}. For Vision Transformers, we do so by adding averaged channels for patch embeddings.

Decoder layer expansion with dec\mathcal{E}_{\text{dec}}. For Vision Transformers, the decoder layer is a fully-connected layer with the form Dec(𝐱)=𝐖dec𝐱+𝐛\texttt{Dec}(\mathbf{x})=\mathbf{W}_{\text{dec}}\mathbf{x}+\mathbf{b}. We increase the rows of the matrix by applying column-random expansion to its transpose, i.e., we use col,rand(𝐖dec)\mathcal{E}_{\text{col,rand}}(\mathbf{W}_{\text{dec}}) for the expanded weights. The bias is unchanged.

For language models, the decoder layer is shared with the embedding layer. So we have to instead scale the weight and bias of the LayerNorm before the decoder layer by 1/DT/DS1/\lfloor D_{T}/D_{S}\rfloor. Moreover, dec\mathcal{E}_{\text{dec}} is (𝒱zero,Id)(\mathcal{V}_{\text{zero}},\texttt{Id})-lossless for Dec.

C.3 Depth expansion

Depth expansion is explained in the section 4.

C.4 Parameter choices

We consider the case DT2DSD_{T}\leq 2D_{S} for better illustration.777In fact we only need to deal with such cases in our experiments. There are mainly the following parameters to choose for LEMON. For the non-divisible case, we set the random parameter 𝜻\bm{\zeta} in the LayerNorm such that 𝜻Unif(1,1)\bm{\zeta}\sim\text{Unif}(-1,1). When using matrix column-random expansion C, rand\mathcal{E}_{\text{C, rand}} for the indivisible case, we use 𝜻i,jiid𝒩(0,0.022)\bm{\zeta}_{i,j}\overset{\mathrm{iid}}{\sim}\mathcal{N}(0,0.02^{2}).

Vision transformers. For the width expansion parameters of the Vision Transformers, we set 𝐖res\mathbf{W}^{\text{res}} for indivisible case and 𝐖2\mathbf{W}^{2} for divisible case to be 12𝐖O+Φ\frac{1}{2}\mathbf{W}_{O}^{\intercal}+\Phi, where ΦDS×(DTDS)\Phi\in\mathbb{R}^{D_{S}\times(D_{T}-D_{S})} is randomly initialized and Φi,jiid𝒩(0,0.022)\Phi_{i,j}\overset{\mathrm{iid}}{\sim}\mathcal{N}(0,0.02^{2}).

For the depth expansion parameters, we set the free parameters that are used to cancel out replicated neurons following the distribution 𝒩(0,0.022)\mathcal{N}(0,0.02^{2}).

Language models. For the width expansion parameters of BERT, we set 𝐖res\mathbf{W}^{\text{res}} for indivisible case and 𝐖2\mathbf{W}^{2} for divisible case to Φ\Phi, where ΦDS×(DTDS)\Phi\in\mathbb{R}^{D_{S}\times(D_{T}-D_{S})} is randomly initialized and Φi,jiid𝒩(0,0.0022)\Phi_{i,j}\overset{\mathrm{iid}}{\sim}\mathcal{N}(0,0.002^{2}).

For the depth expansion parameters, we set the projection matrix of the MHA block and the second fully-connected layer of the MLP block to be zero matrices. Moreover, inspired by advanced knowledge initialization (AKI) (Chen et al., 2021a), we append heads/neurons from the next adjacent layer.888This is still lossless since the last layer is a left-multiplied with a zero matrix followed by adding a zero vector.

Appendix D LEMON for other architectures

Though we haven’t included experiments for Post-Res-Norm and Post-Norm blocks in our main experiments, we show that LEMON is able to perform lossless model expansion for these scenarios. We then briefly discuss how to handle RMS norm (Zhang & Sennrich, 2019), which is used in LLaMa (Touvron et al., 2023). We also discuss how to apply LEMON on convolutional neural networks.

D.1 Post-Res-Norm Transformers

We consider the Transformer with the following architecture: (1) an embedding layer, (2) several Post-Res-Norm blocks, and (3) the final decoder layer.999We assume there is no final LayerNorm before the final decoder layer.

D.1.1 Width expansion

The only difference between the expansion methods of Post-Res-Norm Transformers and Pre-LN Transformers is that we zero expand embedding vector for each token with 𝒱zero\mathcal{V}_{\text{zero}}.

For the MHA and MLP modules, we use the exact same expansion introduced in section C.1, where it is (𝒱zero,𝒱avg)(\mathcal{V}_{\text{zero}},\mathcal{V}_{\text{avg}})-lossless for MHA and MLP. Consequently, our expansion is (𝒱zero,𝒱zero)(\mathcal{V}_{\text{zero}},\mathcal{V}_{\text{zero}})-lossless for Post-Res-Norm Transformer blocks. Since the last decoder expansion is (𝒱zero,Id)(\mathcal{V}_{\text{zero}},\texttt{Id})-lossless for Dec, our expansion method is strict lossless.

D.1.2 Depth expansion

For increasing depth, we only need to set the weights and bias of the LayerNorm for each added layer to be all zeros.

D.2 Post-LN Transformers

For Post-LN Transformers, we can only deal with divisible cases, i.e., DTmodDS=0D_{T}\bmod D_{S}=0. Suppose DT/DS=nD_{T}/D_{S}=n, in this case, all the embedding and outputs of modules (MLP and MHA) are duplicated nn times and hence lossless. The only difficulty is to deal with depth expansion.

Depth expansion. Suppose we are given a pre-trained Post-LN Transformer block g1(𝐱)=LN1(Module1(𝐱)+𝐱)=𝝁1Norm(Module1(𝐱)+𝐱)+𝐛1g_{1}(\mathbf{x})=\texttt{LN}_{1}(\texttt{Module}_{1}(\mathbf{x})+\mathbf{x})=\bm{\mu}_{1}\odot\texttt{Norm}(\texttt{Module}_{1}(\mathbf{x})+\mathbf{x})+\mathbf{b}_{1}. First we expand Module1\texttt{Module}_{1} to Module10,\texttt{Module}_{1}^{0,*} so that it outputs zeros. Then we can create two expanded layers g1,g2g^{*}_{1},g^{*}_{2} where g1(𝐱)=𝟏Norm(Module10,(𝐱)+𝐱)+𝟎=Norm(𝐱)g_{1}^{*}(\mathbf{x}^{*})=\bm{1}\odot\texttt{Norm}(\texttt{Module}^{0,*}_{1}(\mathbf{x}^{*})+\mathbf{x}^{*})+\mathbf{0}=\texttt{Norm}(\mathbf{x}^{*}) and g2(𝐱)=𝝁1Norm(Module1(𝐱)+𝐱)+𝐛1g_{2}^{*}(\mathbf{x}^{*})=\bm{\mu}_{1}^{*}\odot\texttt{Norm}(\texttt{Module}^{*}_{1}(\mathbf{x}^{*})+\mathbf{x}^{*})+\mathbf{b}_{1}^{*}. It is easy to show that g2g1g_{2}^{*}\circ g_{1}^{*} is lossless where we use the fact that Norm(Norm(𝐱))=Norm(𝐱)\texttt{Norm}(\texttt{Norm}(\mathbf{x}))=\texttt{Norm}(\mathbf{x}).

D.3 Transformers with RMS Norm

RMS Norm (Zhang & Sennrich, 2019) is used by foundation models like LLaMa (Touvron et al., 2023) and Baichuan (Yang et al., 2023). See Definition 2 for the definition of RMS Norm. Suppose we want to expand the RMS Norm from dimension DSD_{S} to DTD_{T}, we use the following expansion.

RMS Norm expansion with RMS\mathcal{E}_{\text{RMS}}. We use RMS(;𝝁rand,ϵ)\texttt{RMS}(\cdot;\bm{\mu}_{\text{rand}}^{*},\epsilon^{*}) where 𝝁rand=η𝒱rand(𝝁)DT\bm{\mu}_{\text{rand}}^{*}=\eta\mathcal{V}_{\text{rand}}(\bm{\mu})\in\mathbb{R}^{D_{T}}, and ϵ=η2ϵ\epsilon^{*}=\eta^{2}\epsilon with η=DT/DS(DS/DT)\eta=\lfloor D_{T}/D_{S}\rfloor*(D_{S}/D_{T}) to expand the original RMS Norm layer LN(;𝝁,𝜷,ϵ)\texttt{LN}(\cdot;\bm{\mu},\bm{\beta},\epsilon). The expansion is (𝒱zero,𝒱zero)(\mathcal{V}_{\text{zero}},\mathcal{V}_{\text{zero}})-lossless for RMS()\texttt{RMS}(\cdot). The proof is provided in Proposition 4.

D.4 Convolutional neural networks

We use Conv(k×k,Cin,Cout)\texttt{Conv}(k\times k,C_{\text{in}},C_{\text{out}}) to denote convolutional layer with CinC_{\text{in}} in-channels, CoutC_{\text{out}} out-channels, and kernel size k×kk\times k . We assume the kernel weight is 𝐖Cout×Cin×k×k\mathbf{W}\in\mathbb{R}^{C_{\text{out}}\times C_{\text{in}}\times k\times k} and bias 𝐛Cout\mathbf{b}\in\mathbb{R}^{C_{\text{out}}}. We use BN and ReLU to denote BatchNorm and ReLU, respectively. ResNet and WideResNet with more than 50 layers consist of multiple Bottleneck blocks, where there are 3 sub-blocks: (1) Conv(D,DS,1×1)\texttt{Conv}(D,D_{S},1\times 1)-BN-ReLU, (2) Conv(DS,DS,3×3)\texttt{Conv}(D_{S},D_{S},3\times 3)-BN-ReLU, and (3) Conv(DS,D,1×1)\texttt{Conv}(D_{S},D,1\times 1)-BN in the residual branch.

We consider expanding ResNet to WideResNet with the same depth.101010Depth increase can be also applied. During expansion, we increase the number of channels from DSD_{S} to DTD_{T}. To apply expansion, we do the following:

(1) For the first sub-block, increase the number of out-channels of the first convolutional layer from DSD_{S} to DTD_{T}. Specifically, the expanded weight satisfies 𝐖[i,:,:,:]=𝐖[imodDS,:,:,:],i[DT]\mathbf{W}^{*}[i,:,:,:]=\mathbf{W}[i\bmod D_{S},:,:,:],\forall i\in[D_{T}], and 𝐛[i]=𝐛[imodDS],i[DT]\mathbf{b}^{*}[i]=\mathbf{b}[i\bmod D_{S}],\forall i\in[D_{T}]. The output of the convolutional layer will be also in a circular pattern in the channel dimension. This also holds true after the application of BatchNorm and ReLU since the statistics of BatchNorm are computed within channels.

(2) For the second sub-block, increase the number of out-channels and in-channels of the first convolutional layer from DSD_{S} to DTD_{T}. We apply the same operation to the out-channels dimension similar to (1). For the in-channel dimension, we need to make sure that the weights of replicated channels sum up to the original weight. Specifically, suppose that the replicated channels indices are denoted 𝒞z={i|imodDS=z}\mathcal{C}_{z}=\{i|i\bmod D_{S}=z\}. Then we need to set i𝒞k𝐖[i,:,:,:]=𝐖[k,:,:,:]\sum_{i\in\mathcal{C}_{k}}\mathbf{W}^{*}[i,:,:,:]=\mathbf{W}[k,:,:,:] for lossless expansion. Moreover, we need to make sure 𝐖[i,a,b,c]𝐖[j,a,b,c],i,j𝒞z,a[Cin],b[k],c[k],z[Cout]\mathbf{W}^{*}[i,a,b,c]\neq\mathbf{W}^{*}[j,a,b,c],\forall i,j\in\mathcal{C}_{z},a\in\left[C_{\text{in}}\right],b\in\left[k\right],c\in\left[k\right],z\in[C_{\text{out}}] for symmetry breaking.

(3) For the last sub-block, increase the number of in-channels of the first convolutional layer from DSD_{S} to DTD_{T} similar to (2).

Appendix E Proofs

E.1 Proofs for Transformers with LayerNorm

In this section, we first show that three main components LN\mathcal{E}_{\text{LN}}, MHA\mathcal{E}_{\text{MHA}}, and MLP\mathcal{E}_{\text{MLP}} are lossless. Then, we prove that LEMON defined in Appendix C is lossless.

We first start by showing that our LayerNorm expansion LN\mathcal{E}_{\text{LN}} defined in section C.1 is lossless.

Proposition 1 (Lossless expansion for LayerNorm LN\mathcal{E}_{\text{LN}}).

Consider LN(;𝛍,𝛃,ϵ)\texttt{LN}(\cdot;\bm{\mu},\bm{\beta},\epsilon) of dimension DSD_{S} where 𝛍,𝛃DS\bm{\mu},\bm{\beta}\in\mathbb{R}^{D_{S}}. Define average expanded of 𝐱DS\mathbf{x}\in\mathbb{R}^{D_{S}} of dimension DTD_{T} to be 𝐱avg=𝒱avg(𝐱)DT\mathbf{x}^{*}_{\text{avg}}=\mathcal{V}_{\text{avg}}(\mathbf{x})\in\mathbb{R}^{D_{T}}, where DTDSD_{T}\geq D_{S}. If 𝛍rand=η𝒱rand(𝛍)DT\bm{\mu}_{\text{rand}}^{*}=\eta\mathcal{V}_{\text{rand}}(\bm{\mu})\in\mathbb{R}^{D_{T}}, 𝛃zero=𝒱zero(𝛃)DT\bm{\beta}_{\text{zero}}^{*}=\mathcal{V}_{\text{zero}}(\bm{\beta})\in\mathbb{R}^{D_{T}}, and ϵ=η2ϵ\epsilon^{*}=\eta^{2}\epsilon, where η=DT/DS(DS/DT)\eta=\sqrt{\lfloor D_{T}/D_{S}\rfloor*(D_{S}/D_{T})}, then

LN(𝐱avg;𝝁rand,𝜷zero,ϵ)=𝒱zero(LN(𝐱;𝝁,𝜷,ϵ)).\displaystyle\texttt{LN}(\mathbf{x}^{*}_{\text{avg}};\bm{\mu}_{\text{rand}}^{*},\bm{\beta}_{\text{zero}}^{*},\epsilon^{*})=\mathcal{V}_{\text{zero}}(\texttt{LN}(\mathbf{x};\bm{\mu},\bm{\beta},\epsilon)).
Proof.

Since 𝔼[𝐱avg]=1DTi𝐱avg[i]=1DT(DT/DSiDS𝐱[i]+(DTmodDS)𝔼[x])=𝔼[x]\mathbb{E}[\mathbf{x}^{*}_{\text{avg}}]=\frac{1}{D_{T}}\sum_{i}\mathbf{x}_{\text{avg}}^{*}[i]=\frac{1}{D_{T}}\left({\lfloor D_{T}/D_{S}\rfloor}\sum_{i}^{D_{S}}\mathbf{x}[i]+(D_{T}\bmod D_{S})\mathbb{E}[x]\right)=\mathbb{E}[x] and Var[𝐱avg]=1DT(DT/DSDSVar[𝐱]+(DTmodDS)0)=η2Var[𝐱]\mathrm{Var}[\mathbf{x}^{*}_{\text{avg}}]=\frac{1}{D_{T}}\left({\lfloor D_{T}/D_{S}\rfloor}D_{S}\mathrm{Var}[\mathbf{x}]+(D_{T}\bmod D_{S})*0\right)=\eta^{2}\mathrm{Var}[\mathbf{x}],

  • For 1iDT/DSDS1\leq i\leq{\lfloor D_{T}/D_{S}\rfloor}D_{S}:

    LN(𝐱avg;𝝁rand,𝜷zero,ϵ)[i]\displaystyle\texttt{LN}(\mathbf{x}^{*}_{\text{avg}};\bm{\mu}_{\text{rand}}^{*},\bm{\beta}_{\text{zero}}^{*},\epsilon^{*})[i] =𝐱avg[i]𝔼[𝐱avg]Var[𝐱avg]+ϵ𝝁rand[i]+𝜷zero[i]\displaystyle=\frac{\mathbf{x}_{\text{avg}}^{*}[i]-\mathbb{E}[\mathbf{x}_{\text{avg}}^{*}]}{\sqrt{\mathrm{Var}[\mathbf{x}_{\text{avg}}^{*}]+\epsilon^{*}}}\odot\bm{\mu}_{\text{rand}}^{*}[i]+\bm{\beta}_{\text{zero}}^{*}[i]
    =𝐱[imodDS]𝔼[𝐱]ηVar[𝐱]+ϵη𝝁[imodDS]+𝜷[imodDS]\displaystyle=\frac{\mathbf{x}[i\bmod D_{S}]-\mathbb{E}[\mathbf{x}]}{\eta\sqrt{\mathrm{Var}[\mathbf{x}]+\epsilon}}\odot\eta\bm{\mu}[i\bmod D_{S}]+\bm{\beta}[i\bmod D_{S}]
    =𝒱zero(LN(𝐱;𝝁,𝜷,ϵ))[i]\displaystyle=\mathcal{V}_{\text{zero}}(\texttt{LN}(\mathbf{x};\bm{\mu},\bm{\beta},\epsilon))[i]
  • For DT/DSDSiDT{\lfloor D_{T}/D_{S}\rfloor}D_{S}\leq i\leq D_{T}:

    LN(𝐱avg;𝝁rand,𝜷zero,ϵ)[i]\displaystyle\texttt{LN}(\mathbf{x}^{*}_{\text{avg}};\bm{\mu}_{\text{rand}}^{*},\bm{\beta}_{\text{zero}}^{*},\epsilon^{*})[i] =𝐱avg[i]𝔼[𝐱avg]Var[𝐱avg]+ϵ𝝁rand[i]+𝜷zero[i]\displaystyle=\frac{\mathbf{x}_{\text{avg}}^{*}[i]-\mathbb{E}[\mathbf{x}_{\text{avg}}^{*}]}{\sqrt{\mathrm{Var}[\mathbf{x}_{\text{avg}}^{*}]+\epsilon^{*}}}\odot\bm{\mu}_{\text{rand}}^{*}[i]+\bm{\beta}_{\text{zero}}^{*}[i]
    =𝔼[𝐱]𝔼[𝐱]ηVar[𝐱]+ϵη𝜻[imodDS]+0\displaystyle=\frac{\mathbb{E}[\mathbf{x}]-\mathbb{E}[\mathbf{x}]}{\eta\sqrt{\mathrm{Var}[\mathbf{x}]+\epsilon}}\odot\eta\bm{\zeta}[i\bmod D_{S}]+0
    =0\displaystyle=0
    =𝒱zero(LN(𝐱;𝝁,𝜷,ϵ))[i].\displaystyle=\mathcal{V}_{\text{zero}}(\texttt{LN}(\mathbf{x};\bm{\mu},\bm{\beta},\epsilon))[i].

Hence, LN(𝐱avg;𝝁rand,𝜷zero,ϵ)=𝒱zero(LN(𝐱;𝝁,𝜷,ϵ))\texttt{LN}(\mathbf{x}^{*}_{\text{avg}};\bm{\mu}_{\text{rand}}^{*},\bm{\beta}_{\text{zero}}^{*},\epsilon^{*})=\mathcal{V}_{\text{zero}}(\texttt{LN}(\mathbf{x};\bm{\mu},\bm{\beta},\epsilon)). ∎

Remark.

When DTD_{T} is divisible by DSD_{S}, then η=1\eta=1. Hence, it explains why simply circularly expanding LayerNorm is lossless in such a scenario.

Proposition 1 naturally leads to the following corollary.

Corollary 1.

LN\mathcal{E}_{\text{LN}} introduced in Definition 1 is (𝒱avg,𝒱zero)(\mathcal{V}_{\text{avg}},\mathcal{V}_{\text{zero}})-lossless for LN()\texttt{LN}(\cdot).

Using Claim 3, we are ready to prove that MHA\mathcal{E}_{\text{MHA}} and MLP\mathcal{E}_{\text{MLP}} are lossless. We first show that MHA\mathcal{E}_{\text{MHA}} is lossless in Proposition 2.

Proposition 2 (Lossless of MHA\mathcal{E}_{\text{MHA}}).

MHA\mathcal{E}_{\text{MHA}} defined in section C.1 is (𝒱zero,𝒱avg)(\mathcal{V}_{\text{zero}},\mathcal{V}_{\text{avg}})-lossless for MHA.

Proof.

Consider a sequence input 𝐗E×DS\mathbf{X}\in\mathbb{R}^{E\times D_{S}} is expanded losslessly by 𝒱zero\mathcal{V}_{\text{zero}} to 𝐗zeroE×DT\mathbf{X}^{*}_{\text{zero}}\in\mathbb{R}^{E\times D_{T}}. We expand the source small MHA such that the target large model is MHA=MHA(MHA)\texttt{MHA}^{*}=\mathcal{E}_{\text{MHA}}(\texttt{MHA}).

We first check the key, query, and value of each head Headi\text{Head}^{*}_{i} such that iH=Ds/di\leq H=D_{s}/d for the large model MHA\texttt{MHA}^{*}. We denote them as 𝐊i,𝐐i,𝐕iE×dK\mathbf{K}^{*}_{i},\mathbf{Q}^{*}_{i},\mathbf{V}^{*}_{i}\in\mathbb{R}^{E\times d_{K}}. Note that biases 𝐛iK,𝐛iQ,𝐛iVdK\mathbf{b}_{i}^{K},\mathbf{b}_{i}^{Q},\mathbf{b}_{i}^{V}\in\mathbb{R}^{d_{K}} are not expanded. Hence, these outputs are identical to the output of the small source model 𝐊i,𝐐i,𝐕iE×dK\mathbf{K}_{i},\mathbf{Q}_{i},\mathbf{V}_{i}\in\mathbb{R}^{E\times d_{K}} since (𝐖iK),(𝐖iQ),(𝐖iV)(\mathbf{W}_{i}^{K})^{\intercal},(\mathbf{W}_{i}^{Q})^{\intercal},(\mathbf{W}_{i}^{V})^{\intercal} are expanded by C, rand\mathcal{E}_{\text{C, rand}}, which is (𝒱zero,Id)(\mathcal{V}_{\text{zero}},\texttt{Id})-lossless. Consequently, Headi=Attention(𝐐i,𝐊i,𝐕i)=softmax(𝐐i(𝐊i)/dK)𝐕i\text{Head}^{*}_{i}=\text{Attention}(\mathbf{Q}^{*}_{i},\mathbf{K}^{*}_{i},\mathbf{V}^{*}_{i})=\texttt{softmax}\left(\mathbf{Q}^{*}_{i}(\mathbf{K}^{*}_{i})^{\intercal}/\sqrt{d_{K}}\right)\mathbf{V}^{*}_{i} is identical to the output of ii-th head of the MHA in the source small model, which is Headi\text{Head}_{i}.

Since heads are circularly expanded, the output of MHA\texttt{MHA}^{*} is also 𝒱circ\mathcal{V}_{\text{circ}} lossless.

Finally, since 𝐖O\mathbf{W}_{O}^{\intercal} is expanded by col,circ\mathcal{E}_{\text{col,circ}} and row,avg\mathcal{E}_{\text{row,avg}}, which is (𝒱circ,𝒱avg)(\mathcal{V}_{\text{circ}},\mathcal{V}_{\text{avg}})-lossless. With the fact that bias 𝐛O\mathbf{b}_{O} is not expanded (unchanged), we obtain the result that MHA\mathcal{E}_{\text{MHA}} is (𝒱zero,𝒱avg)(\mathcal{V}_{\text{zero}},\mathcal{V}_{\text{avg}})-lossless for MHA. ∎

We then show that MLP\mathcal{E}_{\text{MLP}} is lossless in Proposition 3.

Proposition 3 (Lossless of MLP\mathcal{E}_{\text{MLP}}).

This is easily obtained since the first fully-connected layer is (𝒱zero,𝒱circ)(\mathcal{V}_{\text{zero}},\mathcal{V}_{\text{circ}})-lossless. Hence, the output is 𝒱circ\mathcal{V}_{\text{circ}} losslessly expanded. After applying element-wise nonlinear activation, the output is still 𝒱circ\mathcal{V}_{\text{circ}} losslessly expanded. Since the second fully-connected layer is (𝒱zero,𝒱circ)(\mathcal{V}_{\text{zero}},\mathcal{V}_{\text{circ}})-lossless, we conclude the proof that MLP\mathcal{E}_{\text{MLP}} is (𝒱zero,𝒱avg)(\mathcal{V}_{\text{zero}},\mathcal{V}_{\text{avg}})-lossless for MLP.

Hence, using Proposition 2 and Proposition 3 along with Claim 3, we obtain the following Corollary 2 and Corollary 3.

Corollary 2.

The expanded Pre-LN MHA module MHA(MHA)LN(LN)\mathcal{E}_{\text{MHA}}(\texttt{MHA})\circ\mathcal{E}_{\text{LN}}(\texttt{LN}) is (𝒱avg,𝒱avg)(\mathcal{V}_{\text{avg}},\mathcal{V}_{\text{avg}})-lossless for MHALN\texttt{MHA}\circ\texttt{LN}.

Proof.

Since LN\mathcal{E}_{\text{LN}} is (𝒱avg,𝒱zero)(\mathcal{V}_{\text{avg}},\mathcal{V}_{\text{zero}})-lossless for LN, and MHA\mathcal{E}_{\text{MHA}} is (𝒱zero,𝒱avg)(\mathcal{V}_{\text{zero}},\mathcal{V}_{\text{avg}})-lossless for MHA. The result is obtained by Claim 3. ∎

Corollary 3.

The expanded Pre-LN MLP module MLP(MLP)LN(LN)\mathcal{E}_{\text{MLP}}(\texttt{MLP})\circ\mathcal{E}_{\text{LN}}(\texttt{LN}) is (𝒱avg,𝒱avg)(\mathcal{V}_{\text{avg}},\mathcal{V}_{\text{avg}})-lossless for MLPLN\texttt{MLP}\circ\texttt{LN}.

By incorporating the residual connection, we obtain the following corollary.

Corollary 4.

The expanded Pre-LN modules (Pre-LN MHA/MLP) with residual connections are (𝒱avg,𝒱avg)(\mathcal{V}_{\text{avg}},\mathcal{V}_{\text{avg}})-lossless for the original Pre-LN modules with residual connections.

Once again using Claim 3, we naturally obtain the following corollary.

Corollary 5.

The width-expanded Pre-LN Transformer layer block\mathcal{E}_{\text{block}} is (𝒱avg,𝒱avg)(\mathcal{V}_{\text{avg}},\mathcal{V}_{\text{avg}})-lossless for gg.

Finally, by considering the embedding layers and encoder layers, we show that LEMON is lossless.

Corollary 6.

LEMON introduced in section C.1 is (Id,Id)(\texttt{Id},\texttt{Id})-lossless for Pre-LN Transformers, i.e., strict lossless or identical.

Proof.

Since embeddings are average expanded, the output of Pre-LN Transformer blocks are average expanded. Hence, outputs of the final LN before the encoder is zero expanded. Since the decoder layer expansion is (𝒱zero,Id)(\mathcal{V}_{\text{zero}},\texttt{Id})-lossless for Dec()\texttt{Dec}(\cdot), we obtain the result that LEMON is (Id,Id)(\texttt{Id},\texttt{Id})-lossless. ∎

E.2 Proofs for Transformers with RMS Norm

In this section, we show that RMS\mathcal{E}_{\text{RMS}} defined in section D.3 is lossless.

Proposition 4 (Lossless expansion for RMS Norm RMS\mathcal{E}_{\text{RMS}}).

Consider RMS(;𝛍,ϵ)\texttt{RMS}(\cdot;\bm{\mu},\epsilon) of dimension DSD_{S} where 𝛍DS\bm{\mu}\in\mathbb{R}^{D_{S}}. Define zero expanded of 𝐱DS\mathbf{x}\in\mathbb{R}^{D_{S}} of dimension DTD_{T} to be 𝐱zero=𝒱zero(𝐱)DT\mathbf{x}^{*}_{\text{zero}}=\mathcal{V}_{\text{zero}}(\mathbf{x})\in\mathbb{R}^{D_{T}}, where DTDSD_{T}\geq D_{S}. If 𝛍rand=η𝒱rand(𝛍)DT\bm{\mu}_{\text{rand}}^{*}=\eta\mathcal{V}_{\text{rand}}(\bm{\mu})\in\mathbb{R}^{D_{T}}, and ϵ=η2ϵ\epsilon^{*}=\eta^{2}\epsilon, where η=DT/DS(DS/DT)\eta=\sqrt{\lfloor D_{T}/D_{S}\rfloor*(D_{S}/D_{T})}, then

RMS(𝐱zero;𝝁rand,ϵ)=𝒱zero(RMS(𝐱;𝝁,ϵ)).\displaystyle\texttt{RMS}(\mathbf{x}^{*}_{\text{zero}};\bm{\mu}_{\text{rand}}^{*},\epsilon^{*})=\mathcal{V}_{\text{zero}}(\texttt{RMS}(\mathbf{x};\bm{\mu},\epsilon)).
Proof.
  • For 1iDT/DSDS1\leq i\leq{\lfloor D_{T}/D_{S}\rfloor}D_{S}:

    RMS(𝐱zero;𝝁rand,ϵ)[i]\displaystyle\texttt{RMS}(\mathbf{x}^{*}_{\text{zero}};\bm{\mu}_{\text{rand}}^{*},\epsilon^{*})[i] =𝐱zero[i]1DTi=1DT(𝐱zero)2+ϵ𝝁rand[i]\displaystyle=\frac{\mathbf{x}_{\text{zero}}^{*}[i]}{\sqrt{\frac{1}{D_{T}}\sum_{i=1}^{D_{T}}(\mathbf{x}^{*}_{\text{zero}})^{2}+\epsilon^{*}}}\odot\bm{\mu}_{\text{rand}}^{*}[i]
    =𝐱[imodDS]DSDT/DSDT1DSi=1DS(𝐱[i])2+η2ϵη𝝁[imodDS]\displaystyle=\frac{\mathbf{x}[i\bmod D_{S}]}{\sqrt{\frac{D_{S}\lfloor D_{T}/D_{S}\rfloor}{D_{T}}\frac{1}{D_{S}}\sum_{i=1}^{D_{S}}(\mathbf{x}[i])^{2}+\eta^{2}\epsilon}}\odot\eta\bm{\mu}[i\bmod D_{S}]
    =𝐱[imodDS]η1DSi=1DS(𝐱[i])2+ϵη𝝁[imodDS]\displaystyle=\frac{\mathbf{x}[i\bmod D_{S}]}{\eta\sqrt{\frac{1}{D_{S}}\sum_{i=1}^{D_{S}}(\mathbf{x}[i])^{2}+\epsilon}}\odot\eta\bm{\mu}[i\bmod D_{S}]
    =𝒱zero(RMS(𝐱;𝝁,ϵ))[i].\displaystyle=\mathcal{V}_{\text{zero}}(\texttt{RMS}(\mathbf{x};\bm{\mu},\epsilon))[i].
  • For DT/DSDSiDT{\lfloor D_{T}/D_{S}\rfloor}D_{S}\leq i\leq D_{T}:

    RMS(𝐱zero;𝝁rand,ϵ)[i]\displaystyle\texttt{RMS}(\mathbf{x}^{*}_{\text{zero}};\bm{\mu}_{\text{rand}}^{*},\epsilon^{*})[i] =𝐱zero[i]1DTi=1DT(𝐱zero)2+ϵ𝝁rand[i]\displaystyle=\frac{\mathbf{x}_{\text{zero}}^{*}[i]}{\sqrt{\frac{1}{D_{T}}\sum_{i=1}^{D_{T}}(\mathbf{x}^{*}_{\text{zero}})^{2}+\epsilon^{*}}}\odot\bm{\mu}_{\text{rand}}^{*}[i]
    =01DTi=1DT(𝐱zero)2+ϵ𝝁rand[i]\displaystyle=\frac{0}{\sqrt{\frac{1}{D_{T}}\sum_{i=1}^{D_{T}}(\mathbf{x}^{*}_{\text{zero}})^{2}+\epsilon^{*}}}\odot\bm{\mu}_{\text{rand}}^{*}[i]
    =0\displaystyle=0
    =𝒱zero(RMS(𝐱;𝝁,ϵ))[i].\displaystyle=\mathcal{V}_{\text{zero}}(\texttt{RMS}(\mathbf{x};\bm{\mu},\epsilon))[i].

Hence, RMS(𝐱zero;𝝁rand,ϵ)=𝒱zero(RMS(𝐱;𝝁,ϵ)).\texttt{RMS}(\mathbf{x}^{*}_{\text{zero}};\bm{\mu}_{\text{rand}}^{*},\epsilon^{*})=\mathcal{V}_{\text{zero}}(\texttt{RMS}(\mathbf{x};\bm{\mu},\epsilon)).

Proposition 4 naturally leads to the following corollary.

Corollary 7.

RMS\mathcal{E}_{\text{RMS}} introduced in section D.3 is (𝒱zero,𝒱zero)(\mathcal{V}_{\text{zero}},\mathcal{V}_{\text{zero}})-lossless for RMS()\texttt{RMS}(\cdot).

Appendix F More ablation studies

F.1 Comparison with LiGO

LiGO (Wang et al., 2023a) is unavailable for direct comparison due to the absence of open-source code. Hence, we compare them with their reported values. Note that our method is lossless only for Pre-LN Transformer architecture while LiGO reports their results for language models mainly on Post-LN BERT and RoBerTa. As a consequence, we compare our results with LiGO on ViT(12,384)(12,384) (ViT-Small) \rightarrow ViT(12,768)(12,768) (ViT-Base).111111Note that DeiT without distillation is exactly ViT. The result is shown in Figure 10.

Our method is able to recover the performance of the target model with 85 epochs, leading to a 71.67% computational saving. It is higher than the reported value of 55.40% for LiGO.121212Note that DeiT-Base (ViT-Base) has a final validation accuracy of 81.00% for LiGO, which is lower than the \sim 81.70% reported value of the official DeiT and our implementation.


Refer to caption

Figure 10: We expand ViT(12,384)(12,384) to ViT(12,768)(12,768). Our expanded model recovers the performance of the target model with 85 epochs (28.3% compared to training from scratch).

Appendix G More related works

Efficiency in deep learning can be achieved in multiple ways. In this section we provide a brief overview of efficient deep learning regarding model training and inference, distinguishing it from methods addressing data efficiency (Gong et al., 2021; Wu et al., 2023a; b).

Efficient deep learning. In the realm of deep learning, the drive for efficiency has led researchers to develop a multitude of methods aimed at optimizing model efficiency. Techniques such as neural architecture search (NAS) (Zoph & Le, 2016; Liu et al., 2018) have been employed to automate the discovery of optimal network architecture. Quantization (Rastegari et al., 2016; Hubara et al., 2017) refines the numeric precision of model parameters to boost computational speed. Knowledge distillation (Hinton et al., 2015) and knowledge inheritance (Qin et al., 2021) allow target models to inherit the knowledge of their source counterparts. Neural network pruning (LeCun et al., 1989) involves removing unnecessary connections to accelerate model training or inference. Finally, model growth methods (Chen et al., 2015) directly use the weights of source models to initialize the large target models.

Neural architecture search (NAS) has emerged as a promising solution for automating the process of neural architecture design, eliminating the need for labor-intensive manual designs across various deep learning tasks. Initial methodologies leveraged reinforcement learning (Zoph & Le, 2016; Baker et al., 2016) and evolutionary algorithms (Real et al., 2019) to identify high-performing architectures. Despite their success, a significant drawback was their computational demands. Addressing this, DARTS (Liu et al., 2018) introduced a continuous relaxation of architectural representation, allowing for search via gradient descent. However, DARTS can be challenging to optimize, and its weight-sharing approach has been criticized for potential performance degradation (Yu et al., 2019; Wang et al., 2020b). Seeking further efficiency, Mellor et al. (Mellor et al., 2021) introduced a training-free NAS, which evaluates randomly initialized architectures, thus fully eliminating neural network training during the search phase. Subsequent training-free methods explored searches using Neural Tangent Kernel (NTK) (Xu et al., 2021; Chen et al., 2021b; Wang et al., 2022a), linear regions (Chen et al., 2021b), and criteria related to pruning (Abdelfattah et al., 2021).

When considered alongside model expansion, NAS holds potential for determining the optimal number of layers and hidden dimension of the large target model.

Neural network pruning. Pruning techniques can be broadly classified based on their timing into three categories: post-hoc pruning, pruning-at-initialization methods, and pruning-during-training methods. (1) Post-hoc pruning method removes certain weights of a fully-trained neural network. Post-hoc pruning was initially proposed to accelerate model inference (LeCun et al., 1989; Hassibi et al., 1993; Han et al., 2015), while lottery ticket works (Frankle & Carbin, 2018; Renda et al., 2020) shifted towards uncovering trainable sub-networks. (2) SNIP (Lee et al., 2018) is one of the pioneering works of pruning-at-initialization methods that aim to find trainable sub-networks without any training. Subsequent research (Wang et al., 2020a; Tanaka et al., 2020; de Jorge et al., 2020; Lee et al., 2019; Wang et al., 2022b) introduced varying metrics for pruning at the network initialization stage. (3) Finally, pruning-during-training methods prune or adjust DNNs throughout training. Early works incorporate explicit 0\ell_{0} (Louizos et al., 2017) or 1\ell_{1} (Wen et al., 2016) regularization terms to encourage sparsity, hence mitigating performance degradation commonly associated with post-hoc pruning. More recent techniques like DST methods (Bellec et al., 2017; Mocanu et al., 2018; Evci et al., 2020; Liu et al., 2021a; Wang et al., 2023b) allow for adaptive mask modifications during training while adhering to specified parameter constraints.

Neural network pruning has potential synergies with model expansion, akin to the dynamics of DST. A combined approach could involve iterative increases and decreases in hidden dimensions or layers during training, potentially accelerating training speed.