This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

One Network, Many Masks:
Towards More Parameter-Efficient Transfer Learning

Guangtao Zeng  Peiyuan Zhang  Wei Lu
StatNLP Research Group
Singapore University of Technology and Design
[email protected], {peiyuan_zhang, luwei}@sutd.edu.sg
Abstract

Fine-tuning pre-trained language models for multiple tasks tends to be expensive in terms of storage. To mitigate this, parameter-efficient transfer learning (PETL) methods have been proposed to address this issue, but they still require a significant number of parameters and storage when being applied to broader ranges of tasks. To achieve even greater storage reduction, we propose ProPetl, a novel method that enables efficient sharing of a single PETL module which we call prototype network (e.g., adapter, LoRA, and prefix-tuning) across layers and tasks. We then learn binary masks to select different sub-networks from the shared prototype network and apply them as PETL modules into different layers. We find that the binary masks can determine crucial information from the network, which is often ignored in previous studies. Our work can also be seen as a type of pruning method, where we find that overparameterization also exists in the seemingly small PETL modules. We evaluate ProPetl on various downstream tasks and show that it can outperform other PETL methods with approximately 10%10\% of the parameter storage required by the latter.111Our code is available at https://github.com/ChaosCodes/ProPETL.

11footnotetext: The first two authors contributed equally.

1 Introduction

Refer to caption
Figure 1: An illustration of our ProPetlAdapter model. Note that ProPetl is orthogonal to the specific PETL architectures. LoRA and prefix-tuning are also implemented in our framework.

With the release and wide application of numerous pre-trained language models (PLMs) Devlin et al. (2019); Liu et al. (2019), pre-training and subsequently fine-tuning them becomes prevalent in natural language processing (NLP). This yields good performance in many downstream tasks. However, such a paradigm requires the entire model to be updated and saved after fine-tuning. As PLMs grow in size, traditional fine-tuning becomes costly in storage, limiting the application to multi-task scenarios. In order to ameliorate this issue, many Parameter-Efficient Transfer Learning (PETL) methods have been proposed  Houlsby et al. (2019); Li and Liang (2021); Hu et al. (2022). Rather than fine-tuning the entire model, they introduce new parameters and only fine-tune those additional parameters on downstream tasks, which drastically reduces the storage of the parameters required by each task. However, they still require a significant number of parameters when more tasks are considered.

In this paper, we continue this line of research and target using even less storage per task. We observe that recent advancements in the field of PETL focus on finding better ways to apply the additional parameters, such as the adapter module after each feed-forward layer Houlsby et al. (2019); Pfeiffer et al. (2021), or the low-rank matrices in the query and value projection of the self-attention networks Hu et al. (2022). However, limited works examine the impact of sub-network structure and integrate pruning methods with PETL methods. In fact, studies in network pruning have shown that the modeling ability of neural networks relies not only on the parameters but also on the sub-network structures that are decided by the pruning masks. For instance, Zhou et al. (2019) discovered that the sub-network of an untrained model can yield great performance without any parameter update. In light of this, we seek to incorporate the structural information of sub-networks into PETL. We believe that when enough structural information is injected into the network, we no longer need that many parameters in PELT modules and further improve the parameter efficiency.

To this end, we propose a novel PETL method dubbed ProPetl that enables efficient sharing of a single prototype adapter, prefix, or LoRA across layers and tasks. When sharing the prototype network, ProPetl learns binary masks to prune different sub-networks in different layers and tasks (Figure 1). The connections of the prototype network pruned in one layer can be used in another with a different pruning mask. In this way, a parameter can be used multiple times across different modules, achieving higher parameter efficiency. Previous methods He et al. (2022b); Ma et al. (2022) only consider simply discarding (pruning) the useless parameters, while we focus on the structural information in masks by strategically dispatching the parameters in the single prototype network to different modules. We evaluate ProPetl on various downstream tasks, including GLUE Wang et al. (2018), XSum Narayan et al. (2018), and WMT16 Ro-En Bojar et al. (2016). Experiments show that ProPetl achieves better performance than other PETL methods while using significantly fewer parameters.

Our contributions are summarized as follows:

  • We propose ProPetl, a highly parameter-efficient transfer learning method that injects structural information into PELT and allows for efficient sharing of a single prototype network across layers and tasks.

  • Experiments show ProPetl is able to dramatically reduce the storage of the parameters while achieving better performance than conventional PETL methods.

  • ProPetl offers an alternative view for network pruning and sharing, where we use binary masks to decide when to discard or share the parameters. We hope to inspire more intriguing explorations in this direction.

2 Related Work

In this section, we briefly survey ideas that are related to our work from three fields: parameter-efficient transfer learning, pruning methods, and multi-task learning.

2.1 Parameter-Efficient Transfer Learning

Recently, as the pre-trained language models get larger and larger, some parameter-efficient transfer learning methods that only update a few extra parameters while freezing the PLM backbone have been proposed. Adapter-tuning Houlsby et al. (2019) fine-tuned adapter modules inserted after each attention and feed-forward layer. Prefix-tuning Li and Liang (2021) placed an additional trainable prefix to the keys and values matrix in the attention module. LoRA Hu et al. (2022) injected tunable rank decomposition matrices into each Transformer layer. Based on these parameter-efficient transfer learning methods, He et al. (2022a) gave a unified framework that allows for the transfer of design elements across various PETL approaches. However, when applied to larger PLMs and a broader range of tasks, these methods still require a large storage space because the number of extra parameters is directly proportional to the number of layers and tasks. Inspired by the parameter-sharing techniques of ALBERT Lan et al. (2020a), we propose sharing the additional parameters in PETL modules across layers and tasks. Our method can thus obtain higher parameter-efficient efficiency with a significantly smaller portion of additional storage than existing PETL methods.

2.2 Pruning Methods

Pruning is one of the most popular methods to reduce unnecessary weights from over-parameterized neural networks while maintaining comparable performance. Recently, Frankle and Carbin (2019) proposed Lottery Ticket Hypothesis and stated that in a randomly initialized dense model, a sparse sub-network exists that, when trained in isolation, can achieve performance comparable to dense models. Accompanied by this hypothesis, many pruning-before-training methods have emerged Lee et al. (2019); Bai et al. (2022); Sreenivasan et al. (2022). Xu et al. (2021) further propose a method that prunes the backward gradient of the neural network, as opposed to pruning the network parameters themselves. Based on these methods, some works He et al. (2022b); Ma et al. (2022) also proposed to combine pruning algorithms with parameter-efficient methods to further decrease the additional module size. However, those methods only focus on discarding redundant parameters. A parameter is either discarded or retained without any sharing. They fail to make full use of the additional parameters and cannot achieve highly sparse sub-networks without significantly compromising accuracy.

Refer to caption
Figure 2: Overview of our ProPetl method. In the left part, the grey neural network indicates the prototype network. Using various binary masks, we can derive sub-networks by pruning certain connections, as depicted by the green (top right), red (down left), and blue (down right) networks. In the right part, we learn layer masks and task masks under multi-task learning. Given a specific Transformer Vaswani et al. (2017) layer and task to handle, ProPetl generates a hybrid mask by performing an OR logical operation on the layer mask and the task mask. It then uses hybrid masks to generate different sub-networks from the prototype.

2.3 Multi-Task Learning

Multi-task learning Zhang and Yang (2022), which involves training a single model to perform well on multiple tasks, has gained popularity as a research direction in machine learning. However, this approach can be hindered by catastrophic forgetting Kirkpatrick et al. (2016) and data imbalance among tasks, which will result in overfitting on low-resource tasks and underfitting on high-resource tasks Arivazhagan et al. (2019). Houlsby et al. (2019) propose adapter tuning that only introduces and updates small additional parameters for each task while freezing the pre-trained model. Based on such a parameter-efficient method, Mahabadi et al. (2021) train a hyper-network named Hyperformer, which generates task-specific weights for the adapter modules when fed with different task embeddings.

3 Methods

In this section, we first give an introduction to parameter-efficient transfer learning (PETL). We then present our method ProPetl, depicted in Figure 2, which combines the techniques of parameter-sharing and pruning to further improve the parameter efficiency compared to existing PETL methods.

3.1 Preliminaries

In parameter-efficient transfer learning, we freeze the parameters θlm\theta_{lm} of the pre-trained language model and then introduce additional fine-tunable parameters denoted as θt\theta_{t}. Given dataset {Xi,Yi}i=1N\{X_{i},Y_{i}\}^{N}_{i=1}, the goal of parameter-efficient fine-tuning is to maximize the following likelihood of the label YY by only updating the additional parameters θt\theta_{t}:

maxθti=1NlogP(Yi|Xi;θlm,θt)\operatorname*{max}_{\theta_{t}}\sum_{i=1}^{N}\log P(Y_{i}|X_{i};\theta_{lm},\theta_{t}) (1)

Such parameter-efficient methods suggest a more effective way to adapt pre-trained language models over downstream tasks than fully fine-tuning. We give a brief introduction of the three most widely used PETL modules, namely adapter Houlsby et al. (2019), prefix-tuning Li and Liang (2021), and LoRA  Hu et al. (2022) in Appendix A. However, there is still a storage limitation when we handle a large range of tasks using these methods. In this paper, we investigate the potential for further enhancing parameter efficiency in neural network models by reducing storage requirements. While previous PETL methods have primarily focused on decreasing the number of parameters to improve efficiency, our approach posits that employing varying bit lengths (e.g., 1-bit, 8-bit, 32-bit) during storage can lead to significant improvements in parameter efficiency by reducing the overall number of bits used by the parameters. To this end, we use bits to measure the storage, which we call Bit-Level Storage (BLS), to take into account the fact that different parameters may have different bit lengths. Consider a neural model, where each parameter has a specific bit length. Then we divide these parameters into NN distinct groups based on their respective bit lengths. Let {ρi}i=1N\{\rho_{i}\}_{i=1}^{N} denote the number of parameters within the group ii, with corresponding bit lengths {bi}i=1N\{b_{i}\}_{i=1}^{N}. The BLS for these parameters can subsequently be determined as follows:

Bit-Level Storage=i=1Nρibi\text{Bit-Level Storage}=\sum_{i=1}^{N}\rho_{i}b_{i} (2)

3.2 Shared Prototype Network

Parameter-efficient methods like adapter and prefix-tuning tend to introduce an additional module in each Transformer layer. Assuming the Transformer has LL layers, we can split the parameters θt\theta_{t} into [θt,1,,θt,L][\theta_{t,1},...,\theta_{t,L}] according to their layer indexes. Therefore, we can rewrite Equation 2 as:

maxθt,1,,θt,Li=0NlogP(Yi|Xi;θlm,[θt,1,,θt,L])\operatorname*{max}_{\theta_{t,1},...,\theta_{t,L}}\sum_{i=0}^{N}\log P(Y_{i}|X_{i};\theta_{lm},[\theta_{t,1},...,\theta_{t,L}]) (3)

Inspired by ALBERT Lan et al. (2020b), in our methods, we first introduce additional parameters for a single PETL module as our prototype network, denoted as θpro\theta_{pro}. Then, we share the prototype network across different layers. Assuming that the number of parameters in a single PETL module is nn, we can decrease total parameters from nLnL to only LL, which significantly improves the parameter efficiency. Therefore, we convert the objective function to a more concise one:

maxθproi=0NlogP(Yi|Xi;θlm,θpro)\operatorname*{max}_{\theta_{pro}}\sum_{i=0}^{N}\log P(Y_{i}|X_{i};\theta_{lm},\theta_{pro}) (4)
1
Input: Dataset 𝒟=(xi,yi)\mathcal{D}={(x_{i},y_{i})}, prototype network learning rate λp\lambda_{p}, mask learning rate λm\lambda_{m}, number of layers LL, sparsity k%[0,1]k\%\in[0,1], pretrained parameters θlm\theta_{lm}
Output: Parameter of the prototype network θpron\theta_{pro}\in\mathbb{R}^{n}, binary mask across layers mnm\in\mathbb{R}^{n}
2 θpro\theta_{pro}\leftarrow randomly initialize in n\mathbb{R}^{n}
3 ss\leftarrow randomly initialize in n\mathbb{R}^{n}
4
5for (xi,yi)inD(x_{i},y_{i})\;in\;D do
       /* Apply masks in each layer */
6       for lin 1,2,,Ll\;in\;1,2,...,L do
7             mlh(sl,k)m_{l}\leftarrow h(s_{l},k)
8             θsub,lθproml\theta_{sub,l}\leftarrow\theta_{pro}\odot m_{l}
9      θproθproλpθprologP(yi|xi;θlm,θsub)\theta_{pro}\leftarrow\theta_{pro}-\lambda_{p}\nabla_{\theta_{pro}}\log P(y_{i}|x_{i};\theta_{lm},\theta_{sub})
10       ssλmslogP(yi|xi;θlm,θsub)s\leftarrow s-\lambda_{m}\nabla_{s}\log P(y_{i}|x_{i};\theta_{lm},\theta_{sub})
11      
12for lin 1,2,,Ll\;in\;1,2,...,L do
13       mlh(sl,k)m_{l}\leftarrow h(s_{l},k)
14return θpro,m\theta_{pro},m
15
Algorithm 1 ProPetl training algorithm

3.3 Masked Sub-Networks

Sharing the parameters alone will reduce the model’s capacity to capture meaningful representation in different layers, leading to suboptimal results. Inspired by Zhou et al. (2019) and Ramanujan et al. (2020), we believe that parameters and network structures are both crucial contributing factors to the model’s representative capacity. To this end, we introduce different binary masks ml={0,1}nm_{l}=\{0,1\}^{n} in each Transformer layer ll (left part of Figure 2), where nn denotes the number of parameters in a single PETL module. Each mask represents a corresponding subnetwork of the shared prototype network. Even though the prototype is shared among all layers, we can use different masks to create different sub-networks for each layer ll, whose parameter will be θsub,l=θproml\theta_{sub,l}=\theta_{pro}\odot m_{l}, where \odot indicates the element-wise product. With this, we can get our final objective function as:

maxθpro,m1,m2,,mLi=0NlogP(Yi|Xi;θlm,θsub)\displaystyle\operatorname*{max}_{\theta_{pro},m_{1},m_{2},...,m_{L}}\sum_{i=0}^{N}\log P(Y_{i}|X_{i};\theta_{lm},\theta_{sub}) (5)

where θsub\theta_{sub} = [θprom1\theta_{pro}\odot m_{1}, θprom2\theta_{pro}\odot m_{2}, …, θpromL\theta_{pro}\odot m_{L}].

To learn such masks, we develop our training algorithm based on the edge-popup approach Ramanujan et al. (2020). Specifically, for each binary mask mlm_{l}, we introduce floating-point scores slns_{l}\in\mathbb{R}^{n}. In the forward pass, we generate the binary mask mlm_{l} by setting the top-kk% with the largest absolute value in sls_{l} as 1 and the rest as 0. We denote such top-kk% thresholding function as hh, and ml=h(sl)m_{l}=h(s_{l}). We refer to the hyperparameter k%k\% as the sparsity ratio in subsequent sections of this paper. During the backpropagation, we use the straight-through gradient estimator Bengio et al. (2013) to approximate the gradient of the scores, where function h()h(\cdot) is treated as the identity function. In addition to training the masks, we also jointly optimize the prototype network.

Our approach employs learnable floating-point scores for each binary mask during fine-tuning, leading to nLnL parameters. When integrated with the prototype network, ProPetl updates a total of nL+nnL+n parameters during the training phase, which is marginally more than the conventional PEFT methods which have nLnL extra parameters. A comprehensive overview of the ProPetl algorithm can be found in Algorithm 1. After training, we discard the floating-point scores and retain only the binary masks (1-bit) together with the shared prototype network (32-bit). Assuming that the 32-bit prototype network requires pp bit-level storage and the binary mask of the same dimension demands p/32p/32, our ProPetl can achieve a substantial decrease in storage from around pLpL to p(1+L/32)p(1+L/32). To reconstruct the network structure during inference, we adhere to the following steps: (1) first load the unpruned, shared 32-bit prototype network, (2) then load the binary masks (1-bit) for each layer/task, and (3) extract and use the pruned subnets from the shared prototype network based on specific binary masks during the inference step.

3.4 Hybrid Masks for Multi-Task Learning

Rather than just sharing a PETL module across layers under single-task learning, we can also allow for efficient sharing of the prototype network across multiple tasks. In our approach, we leverage layer masks, as introduced in the previous section, to support parameter sharing within the model. Additionally, we introduce task masks to support parameter sharing across multiple tasks. By performing a logical OR operation222We provide ablations regarding the choice of mask combining methods in Appendix D.1. on these masks, we can obtain a hybrid mask for a specific layer in a specific task, as shown in the right side of Figure 2.

mhybrid=mlayermtaskm_{hybrid}=m_{layer}\lor m_{task} (6)

With the design of the hybrid mask, given TT tasks and NN layers in the pre-trained language models, we only require one ProPetl module, NN layer masks, and TT task masks, further reducing the BLS from the single task scenario (e.g., 0.011% BLS as in Table 2). In addition, layer masks and task masks, which can be combined as hybrid masks, will potentially help infuse the knowledge into the shared prototype network from layers and tasks.

4 Experimental Setup

We briefly summarize the experimental setup in this section. More details can be found in Appendix C.

Datasets

We evaluate ProPetl on a wide range of benchmarks, including language understanding (GLUE Wang et al. (2018)), text summarization (XSum Narayan et al. (2018)), and machine translation (WMT16 Ro-En Bojar et al. (2016)).

Backbones

We use RoBERTaBASE Liu et al. (2019) for single-task learning on GLUE. During fine-tuning, we only tune our ProPetl module and the text classification head. For generation and multi-task learning benchmark, we use T5BASE Raffel et al. (2020) and only tune the ProPetl module. Note that some previous works also tune the layer norms Houlsby et al. (2019); Mahabadi et al. (2021) while we keep them frozen during fine-tuning.

PETL Modules and ProPetl

We use the Pfeiffer adapter Pfeiffer et al. (2021) as our adapter module and set the bottleneck dimension as 64 by default. For prefix-tuning, we follow Li and Liang (2021) and choose the prefix length to be 64. As for LoRA tuning Hu et al. (2022), the bottleneck dimension and the scaling factor α\alpha are both set to 32. In ProPetl, we increase the value of α\alpha to 48 to scale the representation, as the sparse network will decrease the norm of the output representation. We give a brief summary of these PETL modules and how ProPetl is implemented on top of them in Appendix A. Following Ramanujan et al. (2020), we choose the sparsity ratio kk% of ProPetl as 0.5, which we will also further discuss in Section 6. In multi-task learning, we aim to maintain an expected kk% around 0.5 for the hybrid mask, so we set the kk% to 0.3 for both the layer and task masks.333Two random and independent binary masks whose elements have 30% probability to be one will produce a resulting mask with about 50% ones after the OR logical operation, which can be calculated using the equation P(AB)=P(A)+P(B)P(A)P(B)P(A\cup B)=P(A)+P(B)-P(A)P(B).

Model % FT Params %BLS CoLA SST-2 MRPC QQP STS-B MNLI QNLI RTE Avg
RoBERTaBASE 100.00% 100.00% 60.94 94.04 87.25/90.76 91.34/88.53 90.96/90.70 87.57 92.53 73.14 86.16
Prefix (ll=64) 0.95% 000.95% 63.27 94.42 89.05/92.01 88.86/85.18 90.46/90.39 85.76 91.46 63.79 84.97
LoRA (bn=32) 0.95% 000.95% 62.24 93.81 86.76/90.48 88.79/85.15 90.73/90.49 86.59 91.85 67.63 84.96
Adapter (bn=64) 0.95% 000.95% 63.15 94.00 86.93/90.49 89.78/86.52 90.84/90.65 87.10 92.23 70.50 85.65
ProPetlPrefix (ll=64) 1.03% 000.11% 61.81 94.00 87.42/91.00 88.85/85.22 90.48/90.47 85.73 91.05 63.79 84.53
ProPetlLoRA (bn=32) 1.04% 000.11% 62.16 93.62 88.73/91.80 87.59/83.71 90.92/90.83 85.30 91.75 72.66 85.37
ProPetlAdapter (bn=64) 1.04% 000.11% 65.43 94.15 88.24/91.41 89.40/86.04 91.34/90.95 86.53 92.58 76.50 86.60
Table 1: Performance of all models based on RoBERTa on the GLUE tasks under single task settings. Bold fonts indicate the best results. “bn” stands for the bottleneck dimension and “ ll” refers to the number of prefixes. Here % FT Params refers to the percentage of fine-tunable parameters during training (including the underlying floating point score of each pruning mask). %BLS indicates the task-specific Bit-Level Storage (defined in Sec 3) calculated against the fully-finetuned counterparts when saving the model weights and during the inference time.

Evaluation

For text generation, we report ROUGE-2 Lin (2004) on the XSUM test set and BLEU Papineni et al. (2002) score on the Ro-En test set. Since the test sets of GLUE are not released publicly, following Zhang et al. (2021) and Mao et al. (2022), when the sample number of the datasets is fewer than 10k (RTE, MRPC, STS-B, CoLA), we divide the original validation set into halves – the first half for validation and the second for testing. As for the other datasets in GLUE, we randomly choose 1k samples from the training set as our validation data and test on the original validation set. we report both accuracy and F1 for MRPC and QQP in GLUE . For STS-B, we report both Pearson and Spearman correlation coefficients. For CoLA, we report Matthews correlation. For all remaining sub-tasks in GLUE, we report accuracy. Due to high training overhead for generation tasks, we report experimental results with one run for XSum and Ro-En. For GLUE, we report the mean of three different random runs.

5 Results

5.1 Single-Task Learning

Results in Language Understanding

In Table 1, we report the performance of ProPetl and various baselines on the GLUE benchmark. Both ProPetlAdapter and ProPetlLoRA demonstrate superior performance compared to their respective counterparts (adapter and LoRA). Despite having slightly more parameters during the fine-tuning process, ProPetl requires only 1/9 (0.11% v.s. 0.95%) of bit-level storage during inference, making them more efficient options. Specifically, ProPetl increases the average score of the adapter by 0.95 and improves the score of LoRA by 0.41. Besides, ProPetlAdapter remarkably outperforms the fully fine-tuned model (86.60 v.s. 86.16) while using 0.11%0.11\% storage of the fully fine-tuned model. These results indicate that although with reduced parameters, ProPetl injected with the structure information from masks can make more use of the single prototype network and achieve better performance compared to their counterparts. However, we also found that ProPetlPrefix did not outperform prefix-tuning, which we believe is caused by the reparameterization in prefix-tuning that has a harmful effect on the mask learning.444See more details and explanation in Appendix A.3 Overall, ProPetl increases the performance of the adapter to the greatest extent. ProPetlAdapter also achieves the highest performance among the three ProPetl variants. We will stick to ProPetlAdapter for the rest of the experiments.

Results in Language Generation

To verify ProPetl can also be applied to harder tasks, we evaluate our method on two language generation datasets, XSum and WMT16 Ro-En. The results are presented in Figure 3 (a) and (b). We find that ProPetlAdapter can perform just as well as the regular adapter method while using significantly less bit-level storage. Additionally, when consuming more than 1.6% of the storage, ProPetlAdapter is able to achieve competitive performance on the XSum dataset compared with the fully fine-tuned T5. However, both adapter tuning and ProPetlAdapter do not reach the level of the fully fine-tuned model on Ro-En. One potential explanation is Ro-En is harder because translation knowledge may not be covered a lot during the pre-training process of T5. To perform well on such tasks, the model needs to learn additional knowledge and requires more tunable parameters during fine-tuning. We note that such sub-optimal performance on hard generation tasks is not only a nature of ProPetl but generally exists in all PETL methods. Similar findings are also presented in  Raffel et al. (2020) and  He et al. (2022a). Overall, these experiments show that our ProPetl is also more parameter-efficient on text generation benchmarks compared to existing PETL methods.

5.2 Multi-Task Learning

Model \pbox3cm%FT Params
per task \pbox3cm%BLS
per task CoLA SST-2 MRPC QQP STS-B MNLI QNLI RTE Avg
Single-Task Learning
T5BASE 100.0% 100.000% 54.85 92.19 88.18/91.61 91.46/88.61 89.55/89.41 86.49 91.60 67.39 84.67
Adapter (bn=64) 1.070% 1.070% 62.64 94.07 87.36/91.06 90.25/87.28 89.88/89.55 85.76 92.85 71.01 85.61
Multi-Task Hypernetworks
Hyperformer++ 0.290% 0.290% 63.73 94.03 89.66/92.63 90.28/87.20 90.00/89.66 85.74 93.02 75.36 86.48
Multi-Task Training
T5BASE 12.500% 12.500% 54.88 92.54 90.15/93.01 91.13/88.07 88.84/88.53 85.66 92.04 75.36 85.47
Adapter (bn=64) 0.130% 0.130% 62.08 93.57 89.49/92.64 90.25/87.13 87.54/87.41 85.14 92.80 72.22 85.48
Adapter (bn=6) 0.013% 0.013% 58.34 93.61 86.20/90.44 90.10/86.98 86.96/86.66 84.02 92.38 67.63 83.94
ProPetlAdapter (bn=64) 0.156% 0.011% 61.43 94.22 87.36/90.97 90.13/87.14 90.32/90.12 85.34 93.01 75.60 85.97
ProPetlAdapter (bn=6) 0.016% 0.001% 54.59 93.53 87.36/91.02 90.15/87.04 90.70/90.50 85.08 92.79 75.86 85.32
Table 2: GLUE Results on T5. Under single-task learning, we train each task with different model copies. As for multi-task training, we train a unified model or adapter. Results marked with † are from the implementation of Mahabadi et al. (2021). Bold fonts suggest the best results in the block.

We present the results of multi-task learning and also provide baseline results from single-task learning using the T5BASE in Table 2. Our best-performing model, ProPetlAdapter with bottleneck dimension of 64, surpasses the fully fine-tuned T5. We also compare ProPetl with Hyperformer++  Mahabadi et al. (2021), a hyper-network that is specifically designed to transfer knowledge across tasks. The latter uses significantly more task-specific bit-level storage (26×\times: 0.011%0.011\% v.s. 0.29%0.29\% per task), while only increasing the average score by 0.510.51. Compared to the vanilla adapter, ProPetlAdapter can be marginally better under a similar fine-tuning parameter budget but with 1/9 of the original storage. Besides, we experiment with an extreme case, where we set the bottleneck dimension to 6. Our results show that the accuracy of adapter tuning decreases from 85.48 to 83.94, while ProPetlAdapter still maintains a comparable performance to the fully fine-tuned model (85.32 v.s. 85.47) with a remarkably small percentage (0.001%0.001\%) of bit-level storage per task. This demonstrates that normal adapter tuning can not make full use of the parameters and may fail to perform well with a relatively small bottleneck dimension. However, in the case of ProPetlAdapter, even with a bottleneck dimension that is only 6, it can still achieve a reasonable result. To further validate that ProPetl is effective for larger-sized models, we also carry out experiments on the T5 3B variant and present the findings in Table 3. The outcomes align with our conclusions drawn from the T5 base model. We believe that, in ProPetlAdapter, the structure information learned by the mask can make up for the performance drop due to the shrink of the bottleneck dimension to a certain extent. We further compare adapter tuning with ProPetlAdapter with different percentages of bit-level storage by varying the bottleneck dimension.

Model \pbox3cm%FT Params
per task \pbox3cm%BLS
per task Avg
T53B 0.0025% 0.0025% 88.92
Adapter (bn=64) 0.0278% 0.0278% 88.31
ProPetlAdapter (bn=64) 0.0283% 0.0016% 89.02
Table 3: Multi-task learning results on the GLUE using T53B as the backbone.

The results are presented in Figure 3 (c). It shows ProPetlAdapter is able to reach the fine-tuned performance with as few as 0.002%0.002\% task-specific bit-level storage. We can also see that the curve shares a similar trend with those in Figure 3 (a) and (b), which we will further discuss in the next section.

6 Discussion

Refer to caption
(a) XSum
Refer to caption
(b) Ro-En
Refer to caption
(c) GLUE
Figure 3: Performance of adapter and ProPetlAdapter on XSum (left), Ro-En (middle), and GLUE (right) with T5BASE model. We train on XSum and Ro-En under single-task settings for 1 run. We train GLUE under multi-task learning and report the average score with 3 runs. We additionally provide these results in table format in Appendix E.

How does ProPetl Scale Differently to Adapter?

Figure 3 presents the results when we adjust the size of the adapter and ProPetlAdapter on three different datasets. Despite the difference in tasks and methods, we discover that adapter and ProPetlAdapter show similar scaling trends on all three datasets when we increase the proportion of the task-specific bit-level storage. Their performance increases linearly with respect to the log scale of the extra storage in the beginning. When the adapter and ProPetlAdapter reach close to the performance of the fully fine-tuned model, their performance gradually converges. Even though ProPetlAdapter and adapter tuning can slightly exceed the fully fine-tuned performance on some datasets, their performance is still bounded to the same level and cannot outperform the fully fine-tuned model by a large margin. For instance, the performance of both adapter and ProPetl starts to drop when the task-specific storage exceed 0.4% per task on GLUE (Figure 3 (c)). However, ProPetlAdapter is able to reach the fully fine-tuned level much earlier in the scaling curves. Given a fixed amount of task-specific storage, ProPetlAdapter is also able to achieve better results than the adapter. These results indicate that our share-and-mask method is overall more efficient than the adapter across almost all scales.

Which is More Important, Sharing or Masking?

In this section, we discuss the effect of masking and sharing in the prototype network by comparing ProPetl with a random mask baseline and two alternative settings (only mask and only share). For the random mask setting, we randomly select the sub-network of the prototype module during the forward process rather than relying on the updated mask score. Only mask involves not sharing the module across layers but only learning masks to prune different PETL modules into sparse sub-networks. Only share shares a single PETL module among layers without using masks for pruning. Masking without sharing will drastically increase the number of fine-tunable parameters if we keep the same bottleneck dimension and sparsity ratio. To keep the number of fine-tunable parameters on the same level, we either keep the bottleneck dimension the same and reduce the sparsity ratio k%k\% or use the same sparsity ratio with reduced bottleneck dimension for the only mask setting. We also slightly increase the bottleneck dimension of the only share set up to compensate for the parameters of masks. Our results, as presented in Table 4, indicate that the random mask setting yields the poorest performance. Moreover, neither only mask nor only share reaches the same performance with ProPetl, highlighting the necessity to use both masking and sharing to achieve higher parameter efficiency.555Detailed setup can be found in Appendix D.2. We believe masking injects crucial structural information into the sub-networks, while sharing is necessary to expand the size of each sub-network when the number of fine-tunable parameters is fixed. Therefore, our method can use the parameters more efficiently.

Method ProPetl Random \pbox3cmOnly Mask
(Same bn) \pbox3cmOnly Mask
(Same k%k\%) Only Share
GLUE
Adapter (0.11%) 86.60 82.29 85.40 84.70 84.32
Prefix (0.11%) 84.53 79.56 84.18 84.23 81.57
LoRA (0.11%) 85.37 82.48 83.46 84.75 82.53
Ro-En
Adapter (0.46%) 32.63 30.02 31.58 30.68 31.30
Table 4: Ablation studies of the shared network and masks. We report the average score on GLUE based on RoBERTaBASE under single-task learning. For Ro-En, we report the BLEU score with T5BASE as the backbone. Numbers in parenthesis indicate the percentage of task-specific bit-level storage calculated against the fully-finetuned model.

How do Sub-Networks’ Size and Structure Affect Each Other?

The sparsity ratio k%k\% is an important hyperparameter in ProPetl. We study the impact of such sub-network sparsity and present the results in Figure 4. The performance of our ProPetl improves as k%k\% increases from 0.1 to 0.5 but then declines as k%k\% continues to grow from 0.5 to 1.0. Additionally, it can be seen that all these PETL methods can achieve the best accuracy when k%k\% is set to 0.5. This is likely due to the fact that as the network becomes denser from 0.1 to 0.5, the sub-networks get to use more parameters and thus obtain better modeling ability. However, after 0.5, the network in different layers becomes more homogeneous as the sub-networks overlap more, leading to less distinctive structural information in each layer. The absence of enough structural information starts to harm the performance, which even outweighs the benefits of potential knowledge sharing across the layers. We also discover that sharing the PETL module without any pruning (k%k\% = 1.0) results in the worst performance among all sparsity levels. These results suggest that given the fixed percentage of tunable parameters, it is crucial to find a good balance between the distinctive structural information of each sub-network and the number of parameters used in each sub-network.

How is ProPetl Conceptually Related to PETL?

Other than the explanation of prototype network sharing and masking, our proposed ProPetl can also be considered as a PETL-on-PETL approach, which we refer to as PETL2. In other words, the mask is to the prototype network (in our approach) as the PETL module is to the PLM (in conventional PETL approaches). Vanilla PETL methods, such as adapter and prefix-tuning, update specific additional parameters for each layer and downstream task while only sharing the parameters of the PLM. In contrast, ProPetl extends this approach by sharing not only the PLM but also the prototype PETL module among layers and tasks, resulting in a higher degree of parameter sharing. Our method uses binary masks that function like PETL modules on top of the prototype PETL module to prune different structures in different sub-networks. These task-specific tunable parameters are thus an order of magnitude smaller than conventional PETL modules.

Refer to caption
Figure 4: Average score of ProPetl on GLUE with different sparsity ratios under single-task learning on RoBERTaBASE.

7 Conclusion and Future Work

In this paper, we introduce ProPetl, a method for sharing prototype PETL modules across different layers and tasks. Our method significantly improves the parameter efficiency by utilizing the prototype network and maintaining a binary mask for each layer and task. Extensive experiments show that our method achieves comparable performance to fully fine-tuned models and conventional PETL methods with a much smaller fraction of storage. For future works, we aim to study the interpretability of the masks in different layers and explore their potential relationships. We also intend to apply our method to the pre-training process of large language models to reduce the overall number of parameters.

Limitations

Although our masks in different layers are binary and require significantly less storage compared to other PETL networks, we still need the underlying 32-bit scores for each mask during the training process. Therefore, ProPetl consumes slightly more memory in the training time than existing PETL methods. To fine-tune ProPetl, it takes a similar training time to conventional PETL modules, which means our method will normally take a longer time to converge compared to the fully fine-tuned model.

Acknowledgements

We would like to thank the anonymous reviewers, our meta-reviewer, and senior area chairs for their constructive comments and support on this work. This research/project is supported by the National Research Foundation Singapore and DSO National Laboratories under the AI Singapore Program (AISG Award No: AISG2-RP-2020-016), and AI Singapore Programme (AISG Award No: AISG2-PhD-2021-08-007).

References

Appendix A Implementation Details of the Three Variants of ProPetl

A.1 ProPetlAdapter

An adapter module modifies a model’s hidden representation through a down-sampling projection and an up-sampling projection with a non-linear layer in between:

𝒉𝒉+f(𝒉Wdown+bdown)Wup+bup\boldsymbol{h}\leftarrow\boldsymbol{h}+f(\boldsymbol{h}W_{\text{down}}+b_{\text{down}})W_{\text{up}}+b_{\text{up}} (7)

where WW represents the weight matrix, ff denotes the non-linear layer and bb is the bias term. In ProPetlAdapter, we apply our binary pruning masks on the up-sampling and down-sampling weights, respectively:

𝒉𝒉+f(𝒉W~down+bdown)W~up+bup\boldsymbol{h}\leftarrow\boldsymbol{h}+f(\boldsymbol{h}\widetilde{W}_{\text{down}}+b_{\text{down}})\widetilde{W}_{\text{up}}+b_{\text{up}} (8)

where W~down=Wdownmdown\widetilde{W}_{\text{down}}=W_{\text{down}}\odot m_{\text{down}} and W~up=Wupmup\widetilde{W}_{\text{up}}=W_{\text{up}}\odot m_{\text{up}}.

The original Houlsby adapter Houlsby et al. (2019) introduces the adapter module after each multi-head attention and feed-forward layer. Pfeiffer et al. (2021) later propose a more efficient variant of the adapter that is only inserted after the feed-forward layer.

A.2 ProPetlLoRA

LoRA Hu et al. (2022), like adapter tuning, also includes a down-sampling projection and an up-sampling projection. The difference is that LoRA does not have any non-linear layers, but it does have an additional scaling factor α1\alpha\geq 1:

𝒉𝒉+α𝒙WdownWup\boldsymbol{h}\leftarrow\boldsymbol{h}+\alpha\cdot\boldsymbol{x}W_{\text{down}}W_{\text{up}} (9)

To modify PLMs’ hidden representation, LoRA is applied to the query and value representations of the attention modules in Transformers. In ProPetlLoRA, the network is pruned with binary masks:

𝒉𝒉+α𝒙(Wdownmdown)(Wupmup)\displaystyle\boldsymbol{h}\leftarrow\boldsymbol{h}+\alpha\cdot\boldsymbol{x}(W_{\text{down}}\odot m_{\text{down}})(W_{\text{up}}\odot m_{\text{up}}) (10)

The scaling factor α\alpha is an important parameter for LoRA and Hu et al. (2022) use different α\alpha for different datasets in their released code. In ProPetlLoRA, applying the pruning masks mm will reduce the norm of the output of the LoRA module. We find that applying a larger α\alpha for ProPetlLoRA than that used in LoRA will remedy the issue of reduced feature norm and result in better performance.

A.3 ProPetlPrefix

Li and Liang (2021) propose to insert tunable matrices, which they call prefixes, into the key-value pair of the Transformers’ attention modules:

H\displaystyle H =Attention(Q,[Pk;K],[Pv;V])\displaystyle=\operatorname*{Attention}(Q,[P_{k};K],[P_{v};V]) (11)
=softmax(Q[Pk;K]Td)[Pv;V]\displaystyle=\operatorname*{softmax}(\frac{Q[{P}_{k};K]^{T}}{\sqrt{d}})[P_{v};V]

where [;][\cdot;\cdot] stands for matric concatenation. They also find that directly updating the PP parameters will lead to unstable training and a bit drop in performance. In order to ameliorate the problem, they propose to reparametrize the matrix:

P\displaystyle P =PW\displaystyle=P^{\prime}W (12)

where PP^{\prime} is a smaller learnable matrix and WW is the weight of an up-projecting feedforward neural network. In ProPetlPrefix, we apply our binary pruning masks on the reparametrized prefixes:

H\displaystyle H =Attention(Q,[Pkmk;K],[Pvmv;V])\displaystyle=\operatorname*{Attention}(Q,[P_{k}\odot m_{k};K],[P_{v}\odot m_{v};V]) (13)

Note that different from adapter and LoRA tuning, the pruning masks in ProPetlPrefix do not directly operate on the parameters of the network. Instead, the masks are applied on PkP_{k} and PvP_{v} that are output activation depending collectively on WW and PP^{\prime}. Thus, it might be hard for the mask training process to identify good structures from PkP_{k} and PvP_{v}, which potentially explains the sub-optimal results of ProPetlPrefix in Table 1. To verify this claim, we further compare 3 PETL modules against their counterparts pruned with a sparsity of 0.5 in Table 5. We find that both adapter and LoRA tuning improve their performance when 50% of parameters are pruned, which substantiates the idea that structural information of sub-networks is important. However, for prefix tuning that uses reparameterization, its performance drops when we learn pruning masks on top of it. This shows that the masks fail to locate suitable structures for prefix tuning.

Module Vanilla Prune 50%50\%
Prefix (l=64) 84.97 84.50
LoRA (bn=32) 84.96 85.44
Adapter (bn=64) 85.65 86.56
Table 5: Average score between pruning or not on the GLUE dataset based on RoBERTaBASE single-task learning under conventional PETL methods.

Appendix B Parameter Efficiency in ProPetl

B.1 Overview

Learning Setup \pbox3cmBit-Level Storage
of ProPetl
Single-task Learning p+1/32pLp+1/32pL
Multi-task Learning p+1/32p(L+T)p+1/32p(L+T)
Table 6: Storage calculation.

We present an approximate calculation of the bits (space) required by ProPetl during inference in Table 6, in which we assume that a single shared PETL network contains pp BLS and the model has LL layers. In addition, the storage space required for the binary mask makes up 1/321/32 (0.031\approx 0.031) of the 32-bit PETL module. Depending on the specific PETL module used, the calculations may vary a bit, and we detail them in the following sections.

B.2 ProPetlAdapter

Given the number of layers (LL), hidden dimension (dd) of the pre-trained language model, and bottleneck dimension (bnbn) of the adapter module, the BLS consumed by ProPetlAdapter is:

32(2bnd+bn+d)Prototypeadapter+2bndLMask\displaystyle\underbrace{32\cdot(2\cdot bn\cdot d+bn+d)}_{\text{Prototype}\;\text{adapter}}+\underbrace{2\cdot bn\cdot d\cdot L}_{\text{Mask}} (14)

Note that our method does not apply any masks on the bias terms in the prototype adapter. Since our ProPetl will reuse the prototype network, we do not consider the pruning ratio k%k\% when calculating the storage. Here we also discuss how to calculate the storage cost of parameters used under the only mask setting in Table 4, in which we do not share the PETL module across layers but only learn masks to prune them. In that case, we do not count the pruned parameters because they will never be reused. Therefore, the formula for the BLS required by only mask is:

32(2k%bnd+bn+d)L\displaystyle 32\cdot(2\cdot k\%\cdot bn\cdot d+bn+d)\cdot L (15)
Datasets XSum Ro-En CoLA SST-2 MRPC QQP STS-B MNLI QNLI RTE
#Train 204k 997k 8.6k 66k 3.7k 363k 5.7k 392k 104k 2.5k
#Valid 11k 2.6k 0.5k 1k 0.2k 1k 0.8k 1k 1k 0.1k
#Test 11k 3k 0.5k 0.9k 0.2k 40k 0.8k 10k 5k 0.1k
Table 7: Dataset statistics

B.3 ProPetlLoRA

Similarly, given the number of layers (LL), hidden dimension (dd) of the pre-trained language model, and bottleneck dimension (bnbn) of the LoRA module, the formula for the storage needed by ProPetlLoRA is:

324bndPrototypeLoRA+4bndLMask\displaystyle\underbrace{32\cdot 4\cdot bn\cdot d}_{\text{Prototype}\;\text{LoRA}}+\underbrace{4\cdot bn\cdot d\cdot L}_{\text{Mask}} (16)

Note that there is no bias term in LoRA. Under the only mask setup, given the sparsity ratio k%k\%, the storage of parameters is:

324k%bndL\displaystyle 32\cdot 4\cdot k\%\cdot bn\cdot d\cdot L (17)

B.4 ProPetlPrefix

Given the number of layers (LL), hidden dimension (dd) of the pre-trained language model, and the prefix length (ll) of the prefix module, the formula to calculate the BLS of ProPetlPrefix is:

322ldPrototypeprefix+2ldLMask\displaystyle\underbrace{32\cdot 2\cdot l\cdot d}_{\text{Prototype}\;\text{prefix}}+\underbrace{2\cdot l\cdot d\cdot L}_{\text{Mask}} (18)

Under only mask settings, given the sparsity ratio k%k\%, the approximate required storage is:

2k%ldL\displaystyle 2\cdot k\%\cdot l\cdot d\cdot L (19)

Appendix C Experimental Details

We briefly introduce the benchmark datasets used in this work. Their statistics can be found in Table 7.

C.1 Datasets

GLUE

The General Language Understanding Evaluation (GLUE) benchmark Wang et al. (2018) is widely used to benchmark models’ language understanding ability. It consists of a broad range of sentence-level tasks, including natural language inference (MNLI, QNLI, RTE), sentence similarity (MRPC), paraphrase detection (QQP, STS-B), and single sentence classification (SST-2, CoLA).

XSum

The Extreme Summarization dataset (XSum) Narayan et al. (2018) is designed to evaluate systems’ abstractive summarization ability of a single document. It is collected from the online articles of the British Broadcasting Corporation (BBC). The input is a single document with an average token count of 431.07, and the model is expected to generate a short, single-sentence summarization of this document.

WMT16 Ro-En

WMT16 is the 2016 edition of the Workshop on Machine Translation (WMT) dataset Bojar et al. (2016). This dataset is widely used to evaluate machine translation systems. To benchmark on WMT16 Ro-En, the model is supposed to take in a Romanian sentence and output translation in English.

Model Masklr
ProPetlAdapter 3e-3
ProPetlLoRA 3e-2
ProPetlPrefix 1e-4
Table 8: Learning rates in GLUE under single task settings
Datasets Training time
MNLI 90min
QQP 81min
QNLI 24min
SST-2 15min
CoLA 04min
STS-B 03min
MRPC 02min
RTE 01min
Table 9: The approximate training time in GLUE with ProPetl under single-task training

C.2 Implementation Details

Single-Task Learning on RoBERTaBASE

Our models are implemented based on the AdapterHub package Pfeiffer et al. (2020). We use the datasets Lhoest et al. (2021) library to calculate each sub-task’s scores in GLUE.

The learning rate of the PETL module is set to 1e-4, and we detail the learning rate of pruning masks in Table 8. We find that it is important to set a higher learning rate for the masks than the learning rate of the PELT network. The batch size is set to 128 and the weight decay is 0.1 in all experiments with Roberta. When conducting experiments on GLUE datasets, we train 10 epochs when the dataset is large (MNLI, QNLI, QQP, SST-2), while training 20 epochs on the small dataset (RTE, MRPC, STS-B, CoLA). Table 9 shows the training time taken on a single A100 GPU.

Single and Multi-Task Learning on T5BASE

We implement T5 based on the transformers library Wolf et al. (2020). We use the ROUGE package Lin (2004) for ROUGE-2 calculation and sacrebleu Post (2018) for the BLEU score. Table 10 shows the training time on a single A100 GPU and the detailed hyperparameters used. We mainly follow He et al. (2022a) and Mahabadi et al. (2021) to select the hyperparameters and do not perform an exhaustive search. The same set of hyperparameters is used across the fully-finetuning, adapter tuning, and ProPetl models. For GLUE, we follow Mahabadi et al. (2021) to sample data from each GLUE sub-task with a temperature of 10. Specifically, a sub-task is sampled with probability pτ1/Tp_{\tau}^{1/T} where pτ=Nτi=1TNτp_{\tau}=\frac{N_{\tau}}{\sum_{i=1}^{T}N{\tau}} and T=10T=10.

Hyperparameters XSum Ro-En GLUE
Batch size 64 64 128
Total steps 100k 60k 20k
Learning rate 1e-4 3e-4 3e-4
Mask’s learning rate 1e-3 3e-3 3e-3
Learning rate schedule linear linear linear
Label smoothing 0.1 0.1 0.1
Weight decay 0.01 0.01 0.0
Sampling temperature n.a. n.a. 10
Training time 1 day 6 hours 2 hours
Table 10: Hyperparameters on XSum, Ro-En, and GLUE when T5BASE is used as the backbone.
Combining Method Avg. GLUE Score
OR 85.97
ADD 85.66
AND 85.57
Table 11: Results of different mask combining methods based on T5 multi-task learning on GLUE. We set the bottleneck dimension to 64.
Method ProPetl \pbox3cmOnly Mask
(Same bn) \pbox3cmOnly Mask
(Same k%k\%) \pbox3cm       Only Mask
(Same k%k\% and bn) Only Share
GLUE
Adapter 64/50% 64/11% 14/50% 64/50% 88/100%
Prefix 64/50% 64/15% 15/50% 64/50% 88/100%
LoRA 32/50% 32/15% 8/50% 32/50% 44/100%
% BLS (0.11%) (0.11%) (0.11%) (0.48%) (0.11%)
Ro-En
Adapter 384/50% 384/8% 55/50% 384/50% 673/100%
% BLS (0.46%) (0.46%) (0.46%) (3.2%) (0.46%)
Table 12: The bottleneck dimension/sparsity ratio used in Table 13.
Method ProPetl \pbox3cmOnly Mask
(Same bn) \pbox3cmOnly Mask
(Same k%k\%) \pbox3cm       Only Mask
(Same k%k\% and bn) Only Share
GLUE
Adapter 86.60 85.40 84.70 86.56 84.32
Prefix 84.53 84.18 84.23 84.50 81.57
LoRA 85.37 83.46 84.75 85.44 82.53
% BLS (0.11%) (0.11%) (0.11%) (0.48%) (0.11%)
Ro-En
Adapter 32.63 31.58 30.68 33.28 31.30
% BLS (0.46%) (0.46%) (0.46%) (3.2%) (0.46%)
Table 13: Ablation studies of the shared network and masks. We report the average score on GLUE based on RoBERTaBASE under single-task learning. For Ro-En, we report the BLEU score with T5BASE as the backbone.

Appendix D Additional Ablation Studies

D.1 Choice of Mask Combining Methods

As shown in Table 11, using the OR logical operation to combine the layer mask and the task mask achieves the best performance. This is intuitive because given a specific task and a Transformer layer, parameters that contain the layer information and parameters that contain the task information should be both used in the forward pass.

Adapter
Bottleneck dimension 12 24 48 96 192 384 769
% Bit-Level Storage 0.207% 0.405% 0.803% 1.597% 3.186% 6.363% 12.72%
ROUGE-2 14.33 15.05 15.98 16.66 16.97 18.29 18.58
ProPetlAdapter
Bottleneck dimension 96 192 384 768 1536 3042 6144
% Bit-Level Storage 0.116% 0.232% 0.464% 0.927% 1.853% 3.706% 7.412%
ROUGE-2 15.52 16.42 16.96 17.91 18.52 18.87 18.96
Table 14: Performance of adapter and ProPetlAdapter on XSum under single-task learning when varying the bottleneck dimension. We report the BLEU and the proportion of the bit-level storage.
Adapter
Bottleneck dimension 6 12 24 48 96 192 384 768
% Bit-Level Storage 0.108% 0.207% 0.405% 0.803% 1.597% 3.186% 6.363% 12.72%
BLEU 26.95 29.18 30.20 31.20 32.52 32.83 33.56 33.63
ProPetlAdapter
Bottleneck dimension 96 192 384 768 1536 3072 6144
% Bit-Level Storage 0.116% 0.232% 0.464% 0.927% 1.853% 3.706% 7.412%
BLEU 30.82 32.72 32.63 33.16 33.62 33.79 33.83
Table 15: Performance of adapter and ProPetlAdapter on En-Ro under single-task learning when varying the bottleneck dimension. We report the BLEU and the proportion of the bit-level storage.
Adapter
Bottleneck dimension 1 2 3 6 12 24 48 64 96 192 384
% BLS per task 0.003% 0.005% 0.007% 0.013% 0.026% 0.051% 0.100% 0.133% 0.200% 0.398% 0.795%
Avg. Score 80.88 82.61 82.89 83.94 84.92 85.00 85.49 85.48 85.78 86.01 85.50
ProPetlAdapter
Bottleneck dimension 1 2 3 6 12 24 48 64 96 192 384 768 1536 3072
% BLS per task 0.0002% 0.0004% 0.0006% 0.001% 0.002% 0.004% 0.008% 0.011% 0.017% 0.033% 0.066% 0.132% 0.265% 0.529%
Avg. Score 79.88 83.45 84.46 85.32 85.77 85.85 85.93 85.97 85.76 86.05 85.73 86.3 86.21 85.71
Table 16: Performance of adapter and ProPetlAdapter on GLUE under multi-task learning when varying the bottleneck dimension. We report the average score and the proportion of the bit-level storage per task.

D.2 Choice of Sharing and Masking

We provide additional details of the experiments in Table 4 in this section. With the equations shown in Appendix B, we calculate the bottleneck dimension (bnbn) and sparsity ratio (k%k\%) to ensure all the settings have a similar BLS to ProPetl. We also add one more setup, only mask with the same k%k\% and bnbn as ProPetl. Note that this new setup will result in more storage than the other settings. The hyperparameters are listed in Table 12 and the results are detailed in Table 13.

We find that on the GLUE dataset, ProPetl achieves matched performance to this new setup, even though the latter has used more than 4 times the bit-level storage (0.11% v.s. 0.49%). We believe that, with masks, models with only 0.11% of storage can be enough to have a good performance while more storage cost will not have a significant improvement or lead to overfitting under simple tasks like GLUE. However, on the more challenging dataset Ro-En, when we keep the k%k\% and bnbn unchanged and only mask the adapter modules, the model indeed improves its performance by 0.65, but this comes at a cost of around 7×\times more storage of parameters.

Appendix E Additional Results

We additionally present the experimental results from Figure 3 in table format in Table 1415, and 16.