HyperPELT: Unified Parameter-Efficient Language Model Tuning for Both Language and Vision-and-Language Tasks

Zhengkun Zhang Wenya Guo Xiaojun Meng Yasheng Wang Yadao Wang Xin Jiang Qun Liu Zhenglu Yang

Abstract

The workflow of pretraining and fine-tuning has emerged as a popular paradigm for solving various NLP and V&L (Vision-and-Language) downstream tasks. With the capacity of pretrained models growing rapidly, how to perform parameter-efficient fine-tuning has become fairly important for quick transfer learning and deployment. In this paper, we design a novel unified parameter-efficient transfer learning framework that works effectively on both pure language and V&L tasks. In particular, we use a shared hypernetwork that takes trainable hyper-embeddings as input, and outputs weights for fine-tuning different small modules in a pretrained language model, such as tuning the parameters inserted into multi-head attention blocks (i.e., prefix-tuning) and feed-forward blocks (i.e., adapter-tuning). We define a set of embeddings (e.g., layer, block, task and visual embeddings) as the key components to calculate hyper-embeddings, which thus can support both pure language and V&L tasks. Our proposed framework adds fewer trainable parameters in multi-task learning while achieves superior performances and transfer ability compared to state-of-the-art methods. Empirical results on the GLUE benchmark and multiple V&L tasks confirm the effectiveness of our framework on both textual and visual modalities. ¹¹1We will release our code to facilitate future work.

Machine Learning, ICML

1 Introduction

Pretraining and fine-tuning are now the prevalent paradigm in natural language processing, yielding state-of-the-art performances on a variety of downsteam tasks (Devlin et al., 2019). With pre-trained language models (PLMs) growing rapidly in size, it becomes increasingly infeasible to perform conventional fine-tuning on all model parameters, i.e., full fine-tuning. It is even more time & space-consuming for multi-tasking if separate replicas of model parameters are updated and saved per single task.

To mitigate these issues, there has recently been one line of research on Parameter-Efficient Language model Tuning (PELT). A few lightweight transfer learning methods have been proposed and they only update a subset of model parameters while freeze the remaining most parameters (Liu et al., 2021b). Extra trainable task-specific model parameters can also be newly introduced to PLMs, such as the widely used adapter-tuning (Houlsby et al., 2019) and prefix-tuning (Li & Liang, 2021) methods. The former adapter-tuning adds new parameters between transformer layers, while the later prepends tunable prefix vectors to the keys and values of multi-head attention at each layer. Although the number of parameters in the introduced adapter or prefix is much fewer than the original PLM, training these new parameters still requires a lot of resources due to the complex structure of PLMs.

Apart from traditional NLP tasks, fine-tuning language models pretrained on pure text corpora to perform various V&L tasks, has merged as a upward trend. Previous methods (e.g., VL-T5 from Cho et al. (2021)) often concatenate visual patch tokens and textual tokens as input to a pretrained language model (e.g., T5 from Raffel et al. (2020)), and then finetune the whole model on V&L tasks. This tuning towards vision-and-language has achieved a noticeable improvement to V&L tasks (Cho et al., 2021). The key advantage therein is that language models with large capacity and semantic interpretation serve as a cornerstone to benefit visual language alignment and modelling in a wide range of V&L tasks.

Similarly, training all the parameters of PLMs for handling visual input is time-consuming. It is crucial to explore how a small number of trainable parameters can equip a language model with the ability of handling visual input and V&L tasks. Existing methods typically handle the visual input via a prompt-tuning form, and prepend visual patch tokens (i.e., visual prefix of Frozen in Tsimpoukelli et al. (2021)) to the textual sequence. To reduce the trainable parameters, VL-adapter (Sung et al., 2021) adopts the adapter-tuning technique from NLP to the frozen model VL-T5, which can match the performance of full fine-tuning.

Inspired by the recent progress of parameter-efficient tuning, we are motivated to unify a transfer learning framework that supports both language and V&L models in tackling with multitasks. We use a shared hypernetwork (Mahabadi et al., 2021) that is able to take multi-task and multi-modal information as input, and generate weights for tuning different task-specific modules of PLMs in transfer learning. As shown in Figure 1, when finetuning on multitasks, only the shared hypernetwork and its input embedding (namely, hyper-embedding) consisting of layer, block, task and visual embeddings, along with layer normalization, are trained. Such unified parameter-efficient tuning reduces a great number of trainable parameters.

We experiment with two task-specific modules that use the weights output by our hypernetwork. They are respectively multi-head attention modules (Li & Liang, 2021) and task-specific adapter (Houlsby et al., 2019). Different from previous methods using visual input in a prompt-tuning manner, we present a novel perspective of adopting visual input to the above prefix-tuning and adapter-tuning modules. Empirical results on GLUE benchmark and multiple V&L tasks confirm the effectiveness of our unified framework.

In summary, we make the following contributions:

•

We propose an unified parameter-efficient framework for vision and language transfer learning, which supports tuning both language and V&L models on multitasks.
•

We present a novel method of leveraging visual modality as input for a shared hypernetwork, which generates weights for prefix-tuning and adapter-tuning modules.
•

We demonstrate that our framework scales more efficiently than prior work. Empirical results on GLUE benchmark show the effectiveness of our proposed framework in multi-task learning. Empirical results on multiple vision-and-language tasks evidence its feasibility and usefulness in multi-modal transfer learning.
•

We also perform extensive experiments on few-shot domain transfer in pure language and V&L scenarios, and results reveal that the learned shared knowledge across multitasks in our framework is able to positively transfer to unseen domain tasks.

2 Related Work

In this section, we review recent research on parameter-efficient tuning for pure language and V&L tasks, as well as the corresponding work for multi-task learning.

2.1 Parameter-efficient tuning

As recent models grow rapidly in size, how to finetune pretrained models with a small number of trainable parameters becomes more crucial. Existing research (He et al., 2021; Lester et al., 2021; Liu et al., 2021a; Mao et al., 2021) have explored a large amount of methods on parameter-efficient tuning. These methods generally include two categories according to whether new trainable parameters are introduced. One category is that only a subset of model parameters can be updated while freezing the remain (Liu et al., 2021b; Lee et al., 2019). The other is introducing a few task-specific new parameters to different parts of pretrained models, such as before multi-head attention (Li & Liang, 2021), after feedforward layers (Houlsby et al., 2019) or Mixed-and-Match methods (MAM adapter) proposed by He et al. (2021).

2.2 Tuning towards Vision-and-Language

In addition, fine-tuning language models pretrained on pure large text corpora have led to noticeable improvements to V&L tasks. This line of research such as VL-T5 (Cho et al., 2021) and Frozen (Tsimpoukelli et al., 2021) attempts to tune large language models (e.g. T5; GPT-3) to achieve transfer learning for V&L tasks. For example, Frozen aligns the image representation into the word representation space of frozen GPT-3 model which thus is able to generate captions for those images. PICa (Yang et al., 2021) utilizes a pretrained image captioner to convert the image into captions that GPT-3 can understand, and then adapt GPT-3 to solve the VQA tasks in a few-shot manner. Sung et al. (2021) introduces a limited set of new trainable parameters to VL-T5 via a adapter-based method that can match the performance of fine-tuning the entire model.

2.3 Multi-task Learning

Learning a unified model to perform well on multiple different tasks (i.e., multi-task learning) is a challenging problem in both NLP and V&L domains. It has to address many challenges such as catastrophic forgetting, and model overfitting in low-resource tasks while underfitting in high-resource tasks (Aharoni et al., 2019). Radford et al. (2019) highlights the ability of language models to perform a wide range of multitasks in a zero-shot setting. As mentioned above, involving task-specific new parameters such as adapter (Houlsby et al., 2019), can be trained for each task separately while keeping the model fixed. von Oswald et al. (2020) propose a task-conditioned hypernetwork to generate all the weights for the targeted model, while Mahabadi et al. (2021) use a shared hypernetwork to only generate weights for a small number of parameters in adapter modules, to allow the model to adapt to each individual task efficiently.

Our motivation.

Different from mainstream V&L models that append image tokens to the input sequence, we present a novel perspective of merging textual and visual modalities, by using image embedding and task-specific type embedding of multitasks as input to a shared hypernetwork, which generates weights for prefix-tuning and adpater-tuning modules of PLMs. At the same time, we notice a recent paper (He et al., 2022) that was publicly available days ago. This concurrent work shares the similar motivation like us on generating weights for prefix-tuning modules via a hypernetwork, but their method is only targeted at pure language tasks. Our unified framework is able to improve transfer learning in both pure text and vision-to-language multitasks, in a very parameter-efficient manner.

Refer to caption — Figure 1: The model structure of the proposed unified pure language and V&L multi-task framework (left), and illustration of computing the hyper-embedding (right). We use green color to fill the trainable layers and blue color for the frozen ones. And the dashed parts denote the modules for processing visual modality. Conditioned on the input hyper-embedding $I_{\tau}$ or $I_{v}$ , the adapter hypernetwork $h^{t}_{A}$ or $h^{v}_{A}$ parallelly produce the weights ( $W_{up}$ , $b_{up}$ , $W_{down}$ , $b_{down}$ ) for task-specific and visual-specific adapter-tuning modules. Similarly, the prefix hypernetwork $h^{t}_{P}$ or $h^{v}_{P}$ produce the weights ( $P_{k}$ , $P_{v}$ ) as the prefix-tuning vectors in multi-head attention modules. During training, we only update layer normalization in T5, the hypernetworks, and the used input embeddings (i.e., Layer, Block, Task and Visual).

3 Preliminaries

3.1 Pretrained Language Models

All of our models are built on top of the state-of-the-art language model, T5 (Raffel et al., 2020), consisting of an encoder-decoder Transformer (Vaswani et al., 2017) with minor modifications. It frames language tasks as sequence-to-sequence generation, and is trained simultaneously on multiple task datasets. This large-scale T5 model achieves state-of-the-art performances across a diverse set of tasks. We use the T5 backbone as it enables training a universal model that interfaces with many downstream language tasks.

3.2 Mutli-task Learning Problem formulation

Our paper targets at a general multi-task learning problem, where we are given the data from a set of tasks $\{\mathcal{D}_{\tau}\}^{T}_{\tau=1}$ . $T$ is the total number of tasks and $\mathcal{D}_{\tau}=\{(x^{i}_{\tau},y^{i}_{\tau})\}^{N_{\tau}}_{i=1}$ is the training data of the $\tau$ -th task with $N_{\tau}$ samples. We are also given a large-scale pretrained language model, i.e., T5, parameterized by $\theta$ , which generates the output $y^{i}_{\tau}$ for input $x^{i}_{\tau}$ . The standard multi-task finetuning minimizes the following loss on the training set:

\mathcal{L}_{\text{total}}=\sum^{T}_{\tau=1}\sum_{(x^{i}_{\tau},y^{i}_{\tau})\in\mathcal{D}_{\tau}}\mathcal{L}_{\text{task}}(\theta,x^{i}_{\tau},y^{i}_{\tau}),

(1)

where $\mathcal{L}_{\text{task}}$ is the loss function of the tasks that is usually defined as the cross-entropy loss. Our goal is to efficiently finetune the given model in this multi-task learning setting, allowing knowledge sharing across tasks, and at the same time, enabling the model to adapt to each individual task.

3.3 Hypernetworks

The key idea of hypernetwork (Ha et al., 2017; von Oswald et al., 2020) is to learn a parametric task-specific hyper-embedding $\{I_{\tau}\}^{T}_{\tau=1}$ for each task. The hyper-embedding is fed to a hypernetwork $h(.)$ parameterized by $\theta_{h}$ , which generates task-specific parameters $\Delta\theta=h(\theta_{h},I_{\tau})$ for other networks. In the multitask training, the hypernetwork is trained to capture the shared knowledge across tasks. It enables positive transfer among related domains and tasks, and mitigates the catastrophic forgetting problems to some extent. In this paper, we use a simple linear projection layer as the hypernetwork that takes hyper-embeddings $I_{\tau}$ as input, and outputs weights for subset networks of PLMs.

4 Proposed Method

We aim to integrate a unified hypernetwork-based parameter-efficient transfer learning method into a multi-task transformer model. In other word, we insert the parameters generated by the hypernetworks $\Delta\theta$ into the layer and attention blocks of PLMs. During training, we only update the hypernetwork parameters $\theta_{h}$ with hyper-embeddings $\{I_{\tau}\}^{T}_{\tau=1}$ and parameters in layer normalization, while the remaining model parameters in $\theta$ are fixed as in the Equation 2.

\begin{split}\mathcal{L}_{\text{total}}&=\sum^{T}_{\tau=1}\sum_{(x^{i}_{\tau},y^{i}_{\tau})\in\mathcal{D}_{\tau}}\mathcal{L}_{\text{task}}(\Delta\theta,\theta,x^{i}_{\tau},y^{i}_{\tau})\\ &=\sum^{T}_{\tau=1}\sum_{(x^{i}_{\tau},y^{i}_{\tau})\in\mathcal{D}_{\tau}}\mathcal{L}_{\text{task}}(I_{\tau},\theta_{h},\theta,x^{i}_{\tau},y^{i}_{\tau})\end{split},

(2)

We next describe the detailed hyper-embedding $I_{\tau}$ and which modules of PLMs to insert the parameters $\Delta\theta$ generated by hypernetworks, to achieve PELT. In our methods, the hyper-embeddings $I_{\tau}$ consists of two: $I^{t}_{\tau}$ is computed by a task projector network $h^{t}_{I}(.)$ for each individual task, and $I^{v}_{\tau}$ is computed by a visual projector network $h^{v}_{I}(.)$ for each image. We will mostly introduce the hyper-embedding $I^{t}_{\tau}$ , and $I^{v}_{\tau}$ is used in a similar parallel manner.

4.1 Hyper-Embeddings for PELT

Considering a flexible parameterization of task-specific parameters for $L$ layers of transformer, we introduce a set of layer id embeddings $\mathcal{I}=\{l_{i}\}^{L}_{i=1}$ , and block type embeddings $\mathcal{B}=\{b_{j}\}^{5}_{j=1}$ , which specify the position where the parameters $\Delta\theta$ are inserted to. The five block positions to insert $\Delta\theta$ include the self-attention, feed-forward block of an encoder, and the self-attention, cross-attention, and feed-forward block in a decoder. The layer id and block type embeddings together determine the exact targeted transformer block, and are also used as input to the hypernetwork.

Then, we compute a hyper-embedding $I^{t}_{\tau}\in\mathbb{R}^{d_{I}}$ for each individual task via a task projector network $h^{t}_{I}(.)$ . It is a multi-layer perceptron consisting of two feed-forward layers and a ReLU non-linearity:

I^{t}_{\tau}=h^{t}_{I}(z_{\tau},l_{i},b_{j}),

(3)

where the task projector network $h^{t}_{I}(.)$ learns a suitable compressed hyper-embedding from a concatenation of task embeddings $z_{\tau}\in\mathbb{R}^{d_{t}}$ , layer id embeddings $l_{i}\in\mathbb{R}^{d_{t}}$ , and block type embeddings $b_{j}\in\mathbb{R}^{d_{t}}$ . In this way, this hypernetwork is able to produce distinct weights for tuning each task, and each transformer block at each layer.

4.2 HyperPrefix: Incorporate with Prefix-tuning

Prefix-tuning (Li & Liang, 2021) prepends a number of task-specific trainable prefix vectors to the parameters of multi-head attention (i.e., keys and values) at each transformer layer. In the original implementation, the prefix vectors of each attention block are reparameterized by a two-layer feed-forward network:

P=W_{\text{up}}\phi(W_{\text{down}}E),

(4)

where $P\in\mathbb{R}^{d\times N}$ , $W_{\text{down}}\in\mathbb{R}^{d_{\text{mid}}\times d}$ , $W_{\text{up}}\in\mathbb{R}^{d\times L\times 2\times d_{\text{mid}}}$ , $N$ denotes the prefix length, $d$ denotes the dimension size of PLMs and $E\in\mathbb{R}^{d\times N}$ is a randomly initialized embedding matrix. For prefix-tuning, the block type $\mathcal{B}$ has three: the self-attention block of encoder, the self-attention block and cross-attention block of decoder. Therefore, this original method indeed introduces quite a large number of parameters in the finetuning phase, due to the separate embedding matrix $E$ and feed-forward networks at each block.

In our method, we extend the dimension for different embeddings to match the prefix length $N$ , i.e., $z_{\tau}\in\mathbb{R}^{N\times d_{t}}$ , $l_{i}\in\mathbb{R}^{N\times d_{t}}$ , $b_{j}\in\mathbb{R}^{N\times d_{t}}$ , and then compute the hyper-embedding $I^{t}_{\tau}\in\mathbb{R}^{N\times d_{I}}$ . We finally employ a hypernetwork $h^{t}_{P}(.)$ with trainable parameters $\theta_{h^{t}_{P}}$ , to project $I^{t}_{\tau}$ to prefix vectors $P^{t}\in\mathbb{R}^{N\times d}$ , named HyperPrefix:

P^{t}=h^{t}_{P}(\theta_{h^{t}_{P}},I^{t}_{\tau}).

(5)

In this way, since the hyper-embedding has already perceived information of which block at which layer, the number of hypernetwork parameters, i.e., $\theta_{h^{t}_{P}}$ , can be largely reduced to $\mathbb{R}^{d_{I}\times d}$ . We further explain the relationship between prefix-tuning and hypernetwork in the Appendix A.

4.3 HyperPELT: Incorporate with Adapter

To further capture knowledge across tasks and transfer to others, we follow the adapter-tuning (He et al., 2021), and input the hyper-embedding to a hypernetwork for generating the weights in adapters. As depicted in Figure 1, we introduce a hypernetwork-based adapter layer with a trainable scaled parameter $\lambda$ , which is inserted parallelly with feed-forward blocks, named HyperPELT. This task-conditioned adapter layer $A_{\tau}$ consists of a down-projection, $W^{\tau}_{\text{down}}\in\mathbb{R}^{d_{\text{mid}}\times d}$ , GeLU non-linearity, and up-projection $W^{\tau}_{\text{up}}\in\mathbb{R}^{d\times d_{\text{mid}}}$ , where $h$ is the input dimension, $d_{\text{mid}}$ is the bottleneck dimension for the adapter layer, and $x$ is the input hidden state, mathematically defined as:

A^{t}_{\tau}(x)=\lambda\ \text{LN}(W^{\tau}_{\text{up}}\mathrm{GeLU}(W^{\tau}_{\text{down}}x))+x,

(6)

We generate adapter weights $(W^{\tau}_{\text{up}},W^{\tau}_{\text{down}})$ through a hypernetwork $h^{t}_{A}(.)$ :

(W^{\tau}_{\text{up}},W^{\tau}_{\text{down}}):=h^{t}_{A}(\theta_{h^{t}_{A}},I^{t}_{\tau}).

(7)

Note that in Section 4.2, we use the prefix length $N$ as the dimension for hyper-embeddings. We utilize an adaptive pooling operation on hyper-embeddings to adjust the dimension for adapter hypernetwork. Note that due to we extend the dimension of the components of hyper-embeddings in the last section, we utilize an adaptive pooling operation for hyper-embeddings to adjust the dimension for adapter hypernetwork.

4.4 VL-HyperPELT: Incorporate with Visual Modality

Transferring pretrained language models to V&L tasks (Cho et al., 2021), with updating the whole parameters is inefficient. Thus, it motivates us to employ parameter-efficient tuning approaches to extend language models for V&L tasks. We follow Cho et al. (2021) to unify V&L tasks as a text generation problem. As illustrated in Fig. 1, we use CLIP (Radford et al., 2021) (parameterized by $\theta^{v}$ ) with a trainable visual projection layer (parameterized by $\theta^{v\rightarrow I_{v}}$ ), which projects the visual representation to the identical dimension of task embedding, i.e., $z_{\tau}\in\mathbb{R}^{N\times d_{t}}$ . Then we feed this visual representation $v$ to a visual projector network $h^{v}_{I}(.)$ , whose architecture is same to the mentioned task projector network $h^{t}_{I}(.)$ . In this way, we learn the visual hyper-embeddings $I^{v}_{\tau}\in\mathbb{R}^{d_{I}}$ . Finally, taking this visual-specific hyper-embeddings as input, we use a visual hypernetwork $h^{v}(.)$ to generate new visual-specific parameters to different modules in PLMs.

Similar to the Section 4.2 & 4.3, the incorporation of visual-specific parameters to PLMs are the same as task-specific ones, e.g., used as prefix vectors via a prefix hypernetwork $h^{v}_{P}(.)$ and adapter weights via an adapter hypernetwork $h^{v}_{A}(.)$ . We name it VL-HyperPELT. Note that $I^{v}_{\tau}$ is also calculated as the Equation 3 via replacing task embedding $z_{\tau}$ by visual representation $v$ . As illustrated in Fig. 1, the hyper-embeddings $I^{t}$ and $I^{v}$ separately produce weights for parallel adapters. In addition, their produced prefix vectors are appended together for multi-head attention modules.

Model

#Total

params

#Trained

params/task

CoLA

SST-2

MRPC

QQP

STS-B

MNLI

QNLI

RTE

Avg

Single-Task Training

{}_{\text{BASE}}

\dagger

8.0

\times

100%

54.85

92.19

88.18/91.61

91.46/88.61

89.55/89.41

86.49

91.60

67.39

84.67

Adapters

\dagger

1+8

\times

0.01

0.87%

59.49

93.46

88.18/91.55

90.94/88.01

87.44/87.18

86.38

92.26

68.84

84.88

Multi-Task Training

{}_{\text{BASE}}

\dagger

1.0

\times

12.5%

54.88

92.54

90.15/93.01

91.13/88.07

88.84/88.53

85.66

92.04

75.36

85.47

Adapters

\dagger

1.07

\times

0.82%

61.53

93.00

90.15/92.91

90.47/87.26

89.86/89.44

86.09

93.17

70.29

85.83

HYPERFORMER++

\dagger

1.02

\times

0.29%

63.73

94.03

89.66/92.63

90.28/87.20

90.00/89.66

85.74

93.02

75.36

86.48

{}_{\text{BASE}}

\spadesuit

1.0

\times

12.5%

55.58

93.00

87.50/90.97

88.96/85.54

88.96/88.40

83.62

91.47

81.25

85.02

Prefix-tuning

\clubsuit

1.14

\times

1.72%

56.67

93.92

89.42/92.57

90.59/87.37

89.49/89.34

85.23

93.17

79.17

86.09

MAMAdapters

\clubsuit

1.15

\times

2.96%

56.53

93.58

91.35/93.96

90.58/87.53

88.89/88.76

85.98

92.77

81.94

86.53

HYPERFORMER++

\spadesuit

1.02

\times

0.23%

58.02

93.69

91.34/93.84

90.42/87.25

88.94/88.68

85.41

92.51

83.33

86.68

HyperPrefix

1.01

\times

0.15%

63.01

93.46

90.38/93.10

90.49/87.27

89.88/89.71

85.21

92.88

77.78

86.65

HyperPELT

1.02

\times

0.24%

65.96

93.23

89.42/92.31

90.48/87.54

89.15/89.07

85.35

92.79

82.64

87.09

Table 1: Performance of all models on the GLUE tasks. For each method, we report the total number of parameters across all tasks and the number of parameters that are trained for each task as a multiple and proportion respectively of the baseline single-task T5 model. For MNLI, we report accuracy on the matched validation set. For MRPC and QQP, we report accuracy and F1. For STS-B, we report Pearson and Spearman correlation coefficients. For CoLA, we report Matthews correlation. For all other tasks, we report accuracy.

\dagger

: Results from the implementation of Mahabadi et al. (2021),

\spadesuit

: Our re-implementation of (Mahabadi et al., 2021),

\clubsuit

: We implement the methods of Li & Liang (2021) and He et al. (2021) on top of T5.

Dataset	# Samples	T5 ${}_{\text{BASE}}\dagger$	HYPERFORMER++ $\dagger$	HyperPELT	Prompt-tuning	HyperPELT TaskEmbed
Natural Language Inference
CB	4	57.78 ${}_{\pm\text{10.9}}$	60.74 ${}_{\pm\text{16.66}}$	77.86 ${}_{\pm\text{6.9}}$	81.43 ${}_{\pm\text{5.2}}$	85.71 ${}_{\pm\text{3.2}}$
	16	77.04 ${}_{\pm\text{7.2}}$	76.29 ${}_{\pm\text{4.45}}$	82.14 ${}_{\pm\text{3.9}}$	84.29 ${}_{\pm\text{4.3}}$	86.43 ${}_{\pm\text{1.4}}$
	32	80.00 ${}_{\pm\text{7.6}}$	81.48 ${}_{\pm\text{6.2}}$	83.57 ${}_{\pm\text{3.6}}$	85.71 ${}_{\pm\text{3.2}}$	87.14 ${}_{\pm\text{1.7}}$
	100	85.93 ${}_{\pm\text{5.4}}$	87.41 ${}_{\pm\text{2.96}}$	90.71 ${}_{\pm\text{2.9}}$	87.14 ${}_{\pm\text{1.7}}$	87.86 ${}_{\pm\text{1.7}}$
	250	85.19 ${}_{\pm\text{4.7}}$	89.63 ${}_{\pm\text{4.32}}$	91.43 ${}_{\pm\text{2.8}}$	88.57 ${}_{\pm\text{3.5}}$	89.57 ${}_{\pm\text{1.4}}$
Question Classification
TREC	4	28.11 ${}_{\pm\text{5.9}}$	28.85 ${}_{\pm\text{6.9}}$	24.08 ${}_{\pm\text{1.7}}$	21.76 ${}_{\pm\text{1.7}}$	18.88 ${}_{\pm\text{0.6}}$
	16	40.08 ${}_{\pm\text{12.6}}$	49.40 ${}_{\pm\text{9.5}}$	48.96 ${}_{\pm\text{5.0}}$	34.00 ${}_{\pm\text{13.8}}$	19.20 ${}_{\pm\text{0.7}}$
	32	62.49 ${}_{\pm\text{6.2}}$	68.94 ${}_{\pm\text{7.5}}$	69.28 ${}_{\pm\text{3.8}}$	58.24 ${}_{\pm\text{18.1}}$	20.88 ${}_{\pm\text{1.8}}$
	100	87.79 ${}_{\pm\text{0.7}}$	88.42 ${}_{\pm\text{1.7}}$	88.96 ${}_{\pm\text{1.3}}$	84.96 ${}_{\pm\text{3.8}}$	22.64 ${}_{\pm\text{2.8}}$
	500	93.57 ${}_{\pm\text{1.3}}$	94.78 ${}_{\pm\text{1.4}}$	96.16 ${}_{\pm\text{0.6}}$	87.52 ${}_{\pm\text{3.1}}$	23.76 ${}_{\pm\text{1.6}}$
	1000	95.5 ${}_{\pm\text{0.9}}$	96.72 ${}_{\pm\text{1.3}}$	97.04 ${}_{\pm\text{0.6}}$	88.32 ${}_{\pm\text{3.0}}$	24.40 ${}_{\pm\text{0.8}}$
	2000	96.87 ${}_{\pm\text{1.3}}$	96.92 ${}_{\pm\text{0.9}}$	97.20 ${}_{\pm\text{0.4}}$	91.68 ${}_{\pm\text{3.1}}$	24.64 ${}_{\pm\text{0.8}}$
Question Answering
BoolQ	4	50.49 ${}_{\pm\text{11.1}}$	48.03 ${}_{\pm\text{4.8}}$	71.52 ${}_{\pm\text{0.1}}$	64.69 ${}_{\pm\text{8.8}}$	69.28 ${}_{\pm\text{4.1}}$
	16	56.50 ${}_{\pm\text{7.1}}$	50.21 ${}_{\pm\text{7.9}}$	73.66 ${}_{\pm\text{1.8}}$	73.44 ${}_{\pm\text{1.8}}$	71.52 ${}_{\pm\text{0.2}}$
	32	58.43 ${}_{\pm\text{4.9}}$	58.37 ${}_{\pm\text{3.7}}$	73.80 ${}_{\pm\text{0.6}}$	75.16 ${}_{\pm\text{0.8}}$	72.64 ${}_{\pm\text{0.4}}$
	100	60.10 ${}_{\pm\text{2.4}}$	62.03 ${}_{\pm\text{2.0}}$	75.70 ${}_{\pm\text{0.8}}$	75.57 ${}_{\pm\text{1.0}}$	73.59 ${}_{\pm\text{0.4}}$
	500	66.49 ${}_{\pm\text{1.2}}$	70.04 ${}_{\pm\text{1.4}}$	76.37 ${}_{\pm\text{0.5}}$	75.82 ${}_{\pm\text{1.1}}$	73.73 ${}_{\pm\text{0.7}}$
	1000	69.01 ${}_{\pm\text{1.1}}$	72.35 ${}_{\pm\text{1.7}}$	76.71 ${}_{\pm\text{1.6}}$	76.91 ${}_{\pm\text{0.3}}$	74.33 ${}_{\pm\text{0.6}}$
	2000	71.58 ${}_{\pm\text{0.8}}$	74.94 ${}_{\pm\text{0.6}}$	77.81 ${}_{\pm\text{0.8}}$	77.11 ${}_{\pm\text{0.6}}$	74.53 ${}_{\pm\text{0.1}}$
Sentiment Analysis
IMDB	4	77.23 ${}_{\pm\text{3.0}}$	81.77 ${}_{\pm\text{1.8}}$	80.68 ${}_{\pm\text{0.4}}$	50.68 ${}_{\pm\text{0.9}}$	56.41 ${}_{\pm\text{0.3}}$
	16	82.74 ${}_{\pm\text{1.7}}$	84.06 ${}_{\pm\text{0.7}}$	84.86 ${}_{\pm\text{3.8}}$	54.83 ${}_{\pm\text{1.8}}$	59.57 ${}_{\pm\text{1.3}}$
	32	83.42 ${}_{\pm\text{1.0}}$	84.64 ${}_{\pm\text{0.4}}$	85.40 ${}_{\pm\text{0.4}}$	58.70 ${}_{\pm\text{7.9}}$	62.53 ${}_{\pm\text{1.5}}$
	100	84.58 ${}_{\pm\text{0.6}}$	84.74 ${}_{\pm\text{0.4}}$	90.50 ${}_{\pm\text{0.4}}$	76.31 ${}_{\pm\text{11.1}}$	86.68 ${}_{\pm\text{0.2}}$
	500	84.99 ${}_{\pm\text{0.3}}$	86.00 ${}_{\pm\text{0.2}}$	91.78 ${}_{\pm\text{0.3}}$	89.36 ${}_{\pm\text{0.4}}$	89.41 ${}_{\pm\text{0.3}}$
	1000	85.50 ${}_{\pm\text{0.1}}$	86.37 ${}_{\pm\text{0.4}}$	92.19 ${}_{\pm\text{0.3}}$	89.68 ${}_{\pm\text{0.7}}$	90.15 ${}_{\pm\text{0.2}}$
	2000	86.01 ${}_{\pm\text{0.2}}$	86.60 ${}_{\pm\text{0.1}}$	92.54 ${}_{\pm\text{0.1}}$	90.08 ${}_{\pm\text{0.7}}$	91.78 ${}_{\pm\text{0.1}}$
Paraphrase Detection
PAWS	4	53.89 ${}_{\pm\text{3.6}}$	55.58 ${}_{\pm\text{7.5}}$	59.58 ${}_{\pm\text{0.4}}$	54.54 ${}_{\pm\text{4.1}}$	56.47 ${}_{\pm\text{0.2}}$
	16	54.18 ${}_{\pm\text{1.0}}$	72.71 ${}_{\pm\text{1.1}}$	73.66 ${}_{\pm\text{4.4}}$	55.49 ${}_{\pm\text{0.6}}$	56.92 ${}_{\pm\text{0.4}}$
	32	55.23 ${}_{\pm\text{3.2}}$	73.39 ${}_{\pm\text{2.1}}$	74.41 ${}_{\pm\text{1.2}}$	55.93 ${}_{\pm\text{0.8}}$	58.99 ${}_{\pm\text{0.3}}$
	100	71.51 ${}_{\pm\text{2.4}}$	78.24 ${}_{\pm\text{2.1}}$	79.84 ${}_{\pm\text{1.3}}$	57.56 ${}_{\pm\text{0.9}}$	59.35 ${}_{\pm\text{1.4}}$
	500	82.81 ${}_{\pm\text{1.0}}$	86.30 ${}_{\pm\text{1.1}}$	87.31 ${}_{\pm\text{1.1}}$	63.94 ${}_{\pm\text{0.9}}$	64.71 ${}_{\pm\text{0.3}}$
	1000	85.67 ${}_{\pm\text{0.7}}$	89.12 ${}_{\pm\text{0.5}}$	90.42 ${}_{\pm\text{0.3}}$	65.55 ${}_{\pm\text{7.6}}$	66.80 ${}_{\pm\text{0.4}}$
	2000	88.33 ${}_{\pm\text{0.6}}$	90.87 ${}_{\pm\text{0.3}}$	91.79 ${}_{\pm\text{0.6}}$	67.17 ${}_{\pm\text{1.4}}$	68.67 ${}_{\pm\text{0.6}}$

Table 2: Few-shot domain transfer results of five different tasks averaged across 5 seeds. We compute accuracy for all tasks and datasets.

\dagger

: Results from the paper of Mahabadi et al. (2021). HyperPELT and HyperPELT TaskEmbed are respectively fine-tuning hypernetworks with all hyper-embeddings and only task embedding in the few-shot learning.

	Trained Params (%)	VQAv2	VQA Karpathy test			GQA	CoCo Caption
	Trained Params (%)	test-std	in-domain	out-domain	overall	test-dev	B@4	M	C	S
Single-Task Training
VL-T5 $\dagger$	100%	70.3	71.4	13.1	67.9	60.0	34.6	28.8	116.1	21.9
Multi-Task Training
VL-T5 $\dagger$	100%	-	-	-	67.2	58.9	-	-	110.8	-
CLIP-T5 $\dagger$	100%	-	-	-	67.3	56.5	-	-	113.1	-
VL-Adapter $\dagger$	7.98%	-	-	-	67.6	56.2	-	-	111.8	-
CLIP-T5 $\spadesuit$	100%	69.8	70.8	17.4	66.8	59.6	32.4	27.1	108.5	20.4
VL-Adapter $\spadesuit$	7.16%	69.4	70.0	16.4	65.9	57.6	31.4	27.2	105.6	20.1
VL-HyperPELT	6.62%	69.6	70.3	16.8	66.3	57.9	32.1	27.0	108.2	20.1

Table 3: Experimental results on popular Vision-Language banchmarks. We report vqa-score for VQA, gqa-score for GQA and various metrics for image captioning (B@4: BLEU@4, M: METEOR, C: CIDEr, S: SPICE).

\dagger

: Results from the paper of Cho et al. (2021); Sung et al. (2021),

\spadesuit

: Our re-implementation of Sung et al. (2021). Following Cho et al. (2021), we use VQA Karpathy split, which splits the VQA dataset into 605,102 / 26,729 / 26,280 image and question pairs separately as the train/validation/test set to evaluate VQA tasks in a generative manner.

Dataset	# Samples	CLIP-T5	VL-Adapter	VL-HyperPELT
Knowledge-based VQA
OKVQA	4	32.65 ${}_{\pm\text{0.4}}$	31.83 ${}_{\pm\text{1.1}}$	33.25 ${}_{\pm\text{0.5}}$
	16	33.68 ${}_{\pm\text{1.0}}$	31.86 ${}_{\pm\text{0.6}}$	34.72 ${}_{\pm\text{0.6}}$
	32	33.87 ${}_{\pm\text{1.1}}$	32.07 ${}_{\pm\text{0.9}}$	34.86 ${}_{\pm\text{0.3}}$
	100	34.27 ${}_{\pm\text{0.1}}$	33.03 ${}_{\pm\text{1.6}}$	34.99 ${}_{\pm\text{0.9}}$
	500	34.43 ${}_{\pm\text{2.1}}$	33.35 ${}_{\pm\text{0.8}}$	35.56 ${}_{\pm\text{1.1}}$
	1000	34.59 ${}_{\pm\text{1.1}}$	34.57 ${}_{\pm\text{0.2}}$	35.72 ${}_{\pm\text{0.4}}$
	2000	34.62 ${}_{\pm\text{0.9}}$	34.87 ${}_{\pm\text{0.6}}$	35.86 ${}_{\pm\text{0.6}}$
Visual Entailment
SNLI-VE	4	34.77 ${}_{\pm\text{2.2}}$	38.75 ${}_{\pm\text{4.3}}$	39.94 ${}_{\pm\text{0.9}}$
	16	38.55 ${}_{\pm\text{2.6}}$	46.67 ${}_{\pm\text{1.7}}$	47.86 ${}_{\pm\text{1.2}}$
	32	39.89 ${}_{\pm\text{3.7}}$	51.69 ${}_{\pm\text{2.4}}$	53.40 ${}_{\pm\text{0.9}}$
	100	49.97 ${}_{\pm\text{1.8}}$	55.05 ${}_{\pm\text{1.9}}$	57.69 ${}_{\pm\text{0.8}}$
	500	59.01 ${}_{\pm\text{1.6}}$	60.58 ${}_{\pm\text{1.1}}$	61.97 ${}_{\pm\text{1.1}}$
	1000	60.49 ${}_{\pm\text{0.6}}$	62.83 ${}_{\pm\text{0.4}}$	63.01 ${}_{\pm\text{0.6}}$
	2000	62.29 ${}_{\pm\text{0.9}}$	64.22 ${}_{\pm\text{1.1}}$	65.67 ${}_{\pm\text{0.7}}$

Table 4: Few-shot domain transfer results of two different V&L tasks averaged across 5 seeds. We report the vqa-score on OKVQA validation split, and the accuracy on SNLI-VE test-P split.

5 Experimental Setup

5.1 Datasets

Our framework is evaluated on the GLUE benchmark (Wang et al., 2019b) in terms of natural language understanding. This benchmark covers multiple tasks of paraphrase detection (MRPC, QQP), sentiment classification (SST-2), natural language inference (MNLI, RTE, QNLI), and linguistic acceptability (CoLA). The original test sets are not publicly available, and following Zhang et al. (2021), for datasets fewer than 10K samples (RTE, MRPC, STS-B, CoLA), we split the original validation set into two halves, one for validation and the other for testing. For other larger datasets, we randomly split 1K samples from the training set as our validation data and test on the original validation set.

In addition, we evaluate the few-shot domain transfer performance on four tasks and datasets: 1) the natural language inference (NLI) datasets CB and 2) the question answering (QA) dataset BoolQ from SuperGLUE (Wang et al., 2019a); 3) the sentiment analysis datasets IMDB (Maas et al., 2011); and 4) the paraphrase detection dataset PAWS (Zhang et al., 2019). For CB and BoolQ, since the test set is not available, we split the validation set into two halves, one for validation and the other for testing. For IMDB, since the validation set is not available, we similarly split the test set to form validation. For PAWS, we report on the original test set.

To evaluate our framework on V&L tasks, we experiment on four datasets COCO (Lin et al., 2014), VQA (Goyal et al., 2017), VG-QA (Krishna et al., 2017) and GQA (Hudson & Manning, 2019). We further evaluate our framework on three datasets for multi-modal few-shot transfer learning: OKVQA (Marino et al., 2019); SNLI-VE (Xie et al., 2018).

5.2 Implementation Details

Our models are built on T5 ${}_{\text{BASE}}$ (Raffel et al., 2020) ²²2https://huggingface.co/t5-base and use the same tokenizer of T5 to tokenize text inputs. We set $N=49$ , $d=d_{t}=768$ , $d^{\text{mid}}_{I}=128$ , $d_{I}=64$ for all the experiments. Following the training strategies from Raffel et al. (2020), we fine-tune all models with a constant learning rate of 0.001, use $2^{18}=262144$ steps in all experiments with batch size of 128 and sample tasks via the conventional temperature-based sampler, with temperature $T=2$ , i.e., sample corresponding task proportional to $p^{1/T}_{\tau}$ , where $p_{\tau}=\frac{N_{\tau}}{\sum^{T}_{i=1}N_{\tau}}$ and $N_{\tau}$ is the number of training samples for the $\tau$ -th task. We did not experiment with other complex sampling strategies or tuning of $T$ . For the experiments under multi-task training settings, we save a checkpoint every 1000 steps and report results on a single checkpoint with the highest average validation performance across all tasks.

For evaluating our framework on vision-language scenarios, we follow Cho et al. (2021) to convert V&L tasks to a text generation format. We use ResNet101 as our vision encoder, and initialize it with CLIP (Radford et al., 2021) ³³3https://github.com/openai/CLIP pretrained weights. Input images are resized to 224 $\times$ 224 for the memory efficiency. We extract the 7 $\times$ 7 grid features produced by the last convolutional layer. The percentage of updated parameters is also reported as one metric for approach efficiency, and we do not take visual encoder into computation since it is frozen in our experiments. We count the number of tunable parameters and list the input-output formats of each task in the Appendix B and C.

6 Results and Analysis

We design a series of experiments for pure language and V&L tasks on both multi-tasking and few-shot scenarios to verify the effectiveness of our proposed framework compared to existing methods.

6.1 Results on the GLUE Benchmark

We conduct experiments on GLUE for both single- and multi-task settings. As shown in Table 1, we also report the total number of parameters and trainable parameters for all models. Our methods are built on T5 ${}_{\text{BASE}}$ which contains 12 layers and 222M parameters. For adapter-related methods, we experiment with reduction factors of $r=32$ (Mahabadi et al., 2021), where $r=\frac{d}{d_{\text{mid}}}$ . We follow the original T5 implementation (Raffel et al., 2020), which is slightly different from HYPERFORMER++ (Mahabadi et al., 2021). In our re-implementation of T5 ${}_{\text{BASE}}$ and HYPERFORMER++, we prepend task-prefix tokens to the textual input sequence as the original T5 does. We also experiment with generating weights for layer normalization via a hypernetwork. However, we find that it makes no difference on performances.

For Prefix-tuning (Li & Liang, 2021) and MAMAdapter (He et al., 2021), their original implementation is single-task training on BART (Lewis et al., 2020). To make a fair comparison to other baselines, we apply their methods to T5 in a multi-task training setting. ⁴⁴4For adapting Prefix-tuning from BART to T5, a noteworthy point is that since they use different position embedding, i.e., absolute position embedding for BART and relative position embedding for T5, it is necessary to manually concatenate all-zero vectors on the relative position bias of each layer in T5. For each model, we share the parameters of both prefix vectors and adapter weights across multitasks.

Overall, our HyperPELT method obtains the best performance with less trainable parameters. Compared to the single-task Adapters that finetunes all the introduced parameters in adapters, our method yields a significant improvement by 2.21% with much fewer trainable parameters, which illustrates the effectiveness of our proposed multi-task training framework.

In multi-task training, the proposed hypernetwork-based prefix-tuning strategy, e.g., HyperPrefix, decreases the number of trainable parameters (e.g., 1.01 $\times$ of HyperPrefix vs. 1.14 $\times$ of Prefix-tuning), while achieves a better performance at the same time (e.g., 86.65% of HyperPrefix vs. 86.09% of Prefix-tuning). It is noticeable that the number of trainable parameters per task is 11 $\times$ fewer than Prefix-tuning.

HyperPELT obtains a superior performance over HyperPrefix, and the main reason lies in that we further combine the hypernetwork-based adapters and add them to the feedforward layers in a parallel manner. In this way, the average performance is further enhanced (+0.44%) by involving a small number of parameters (0.09% parameters per task). The comparison to MAMAdapter shows that using hypernetwork to tune each transformer block and learn the shared knowledge across multitasks leads to an improvement.

6.2 Few-shot Domain Transfer

We use the above models trained on GLUE as reported in Table 1, and evaluate them on the test set of five different tasks after being few-shot finetuned on each target training data. Following Mahabadi et al. (2021), we use the task embedding respectively trained on the most similar GLUE task for initialization, i.e., MNLI for CB, QNLI for QA, SST-2 for sentiment analysis, and QQP for paraphrase detection. As suggested by Perez et al. (2021) and Zhao & Schütze (2021), we randomly select the same number of samples from training and validation set, making it a reasonable few-shot scenario. Checkpoints are selected via early stopping on the selected validation set, and the stopping metric is the default metric for each task.

In the first three columns of Table 2, we show the results of full fine-tuning of T5 ${}_{\text{BASE}}$ , HYPERFORMER++ (finetuning both hypernetworks and task embeddings) and our proposed HyperPELT. Overall, our method achieves the best performance in the few-shot tasks.

For the tasks of CB and BoolQ from SuperGLUE, even though the backbone T5 was previously trained on the train sets of these two, the performance of all methods differs a lot. The two baselines still do not work with very few samples, like 4 and 16 samples, while our method is significantly better than them. Therefore, we assume that the two baselines suffer from catastrophic forgetting problems to some degree during multi-task training. In contrast, our proposed HyperPELT works effectively on these two tasks. We speculate that the reason might be the use of hypernetworks on both prefix-tuning and adapter-tuning modules of transformer. We leave this exploration to our future work.

Besides, in the last two columns of Table 2, we show the results of Prompt-tuning (Lester et al., 2021) and fine-tuning only the task embedding in our HyperPELT. Note that in this comparison, we keep the same trainable parameters between these two methods, i.e., $\mathbb{R}^{N\times d_{t}}$ , where $N$ denotes the prompt length in Prompt-tuning method. Our HyperPELT TaskEmbed mostly achieves a comparable or even better performance than Prompt-tuning except TREC task, where the task format and verbalizer are quite different from previous tasks. In this case, only tuning the very limit parameters of our model with very few samples is unable to quickly transfer. In comparison, HYPERFORMER++ (Mahabadi et al., 2021) claims only fine-tuning the task embedding in their model achieves low performance in the few-shot learning and they have not reported any result. Our framework with only fine-tuning the task embedding, though not perfect on all tasks, may still present a new promising parameter-efficient way towards few-shot learning with an extreme low number of tunable parameters (Sanh et al., 2021).

6.3 Results on the Vision-Language Benchmarks

Next, we move to the experiments of applying the proposed hypernetwork-based parameter-efficient training framework to V&L tasks. We compare to the pre-trained and full fine-tuning VL-T5 (Cho et al., 2021), and other adapter-based methods built on top of T5, i.e., CLIP-T5 and VL-Adapter (Sung et al., 2021) in the multi-task training setting.

The results and the number of trainable parameters are reported in Table 3. To match the same amount of trainable parameters as VL-Adapter, we experiment with the reduction factors $r=16$ , where $r=\frac{d}{d_{mid}}$ . Since the test dataset is slightly different from Sung et al. (2021) and their checkpoint is not avaliable at this time, we re-implement CLIP-T5 and VL-Adpater. Compared to which, our method achieves a comparable performance with fewer number of trainable parameters (e.g., 7.16% of VL-Adapter vs. 6.62% of VL-HyperPELT).

To our best knowledge, we are the first to employ the visual modality to tune the very few parameters of different transformer blocks, instead of normally inserting image patch tokens to the input sequence. Experimental results evidence the effectiveness of our novel approach, thus providing a new perspective on how to extend the multi-modality capability on top of PLMs. It is to use the features from different modalities as the input of hypernetwork to generate parameters for modules in PLMs, instead of as a part of input sequence to accomplish the multimodal tasks. One advantage in our approach is still keeping the original maximum text input length, since no other modalities such as visual and audio features occupy it. It is promisingly useful in document-level and text-heavy tasks such as multimodal summarization (Zhang et al., 2022).

We believe the resulting performance might be even better with a more complex design combination of methods across tuning task-specific and visual-specific parameters in PLMs, but we leave this exploration in future work.

6.4 Multimodal Few-shot Learning

We further use the models trained on V&L tasks as reported in Table 4 and evaluate them on the test set after few-shot fine-tuning on OKVQA (Marino et al., 2019) and SNLI-VE (Xie et al., 2018). For OKVQA, since there is no test set, we split its original validation set into two halves, one for validating and the other for testing. For SNLI-VE, we use its validation set for validating, and test-P set for testing and reporting results. We follow the methods in Section 6.2 to select samples, and report results in Table 4.

Compared with the full parameter fine-tuning, i.e., CLIP-T5, and the other baseline VL-Adapter, our method achieves the best performance with smaller variances in this multimodal few-shot learning setting. We find that VL-Adapter is inferior to CLIP-T5 when with fewer samples (e.g., fewer than 500) on the OKVQA dataset. The reason may be that there exists a lot of out-domain knowledge and complex image content in OKVQA, which makes it more challenging for parameter-efficient VL-Adapter to achieve accurate prediction. In other words, the small number of samples are not enough to train the introduced randomly initialized parameters in VL-Adapter.

However, our approach can still tackle with fewer samples. We use the hypernetwork to generate trainable parameters in adapters and multi-head attention, as well as directly integrating image features into attention modules in the form of prefix tuning vectors. We believe such method, though training less parameters, can still capture knowledge across tasks and transfer them in a few-shot setting. It is also worth noting that for the used five random seeds, the variance of our method is generally smaller than VL-Adapter, which indicates that our method is more robust in this few-shot learning scenario.

7 Discussion and Conclusion

In this paper, we propose a unified parameter-efficient tuning framework for multitasks, particularly on both pure language and vision-and-language (i.e., V&L) tasks. On the one hand, we use a hypernetwork to reduce the scale of trainable parameters of existing adapter-tuning and prefix-tuning modules. On the other hand, for the V&L tasks, we directly integrate the image features into the multi-head attention in the form of prefix vectors, which further reduces the number of trainable parameters for processing visual input. Extensive experiments on pure language and V&L tasks demonstrate the superiority of our proposed framework in both multi-tasking and few-shot settings. In the future, we plan to explore more combination of methods across tuning task-specific and visual-specific parameters for different modules in a pretrained language model.

References

Aharoni et al. (2019) Aharoni, R., Johnson, M., and Firat, O. Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3874–3884, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1388. URL https://aclanthology.org/N19-1388.
Cho et al. (2021) Cho, J., Lei, J., Tan, H., and Bansal, M. Unifying vision-and-language tasks via text generation. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 1931–1942. PMLR, 2021. URL http://proceedings.mlr.press/v139/cho21a.html.
Devlin et al. (2019) Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1423. URL https://doi.org/10.18653/v1/n19-1423.
Goyal et al. (2017) Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 6325–6334. IEEE Computer Society, 2017. doi: 10.1109/CVPR.2017.670. URL https://doi.org/10.1109/CVPR.2017.670.
Ha et al. (2017) Ha, D., Dai, A. M., and Le, Q. V. Hypernetworks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=rkpACe1lx.
He et al. (2021) He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., and Neubig, G. Towards a unified view of parameter-efficient transfer learning. CoRR, abs/2110.04366, 2021. URL https://arxiv.org/abs/2110.04366.
He et al. (2022) He, Y., Zheng, H. S., Tay, Y., Gupta, J. P., Du, Y., Aribandi, V., Zhao, Z., Li, Y., Chen, Z., Metzler, D., Cheng, H., and Chi, E. H. Hyperprompt: Prompt-based task-conditioning of transformers. CoRR, abs/2203.00759, 2022. URL https://arxiv.org/abs/2203.00759.
Houlsby et al. (2019) Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. Parameter-efficient transfer learning for NLP. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 2790–2799. PMLR, 2019. URL http://proceedings.mlr.press/v97/houlsby19a.html.
Hudson & Manning (2019) Hudson, D. A. and Manning, C. D. GQA: A new dataset for real-world visual reasoning and compositional question answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 6700–6709. Computer Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019.00686. URL http://openaccess.thecvf.com/content_CVPR_2019/html/Hudson_GQA_A_New_Dataset_for_Real-World_Visual_Reasoning_and_Compositional_CVPR_2019_paper.html.
Krishna et al. (2017) Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L., Shamma, D. A., Bernstein, M. S., and Fei-Fei, L. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis., 123(1):32–73, 2017. doi: 10.1007/s11263-016-0981-7. URL https://doi.org/10.1007/s11263-016-0981-7.
Lee et al. (2019) Lee, J., Tang, R., and Lin, J. What would elsa do? freezing layers during transformer fine-tuning. arXiv preprint arXiv:1911.03090, 2019.
Lester et al. (2021) Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. CoRR, abs/2104.08691, 2021. URL https://arxiv.org/abs/2104.08691.
Lewis et al. (2020) Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. R. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 7871–7880. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.703. URL https://doi.org/10.18653/v1/2020.acl-main.703.
Li & Liang (2021) Li, X. L. and Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp. 4582–4597. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.acl-long.353. URL https://doi.org/10.18653/v1/2021.acl-long.353.
Lin et al. (2014) Lin, T., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft COCO: common objects in context. In Fleet, D. J., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.), Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, volume 8693 of Lecture Notes in Computer Science, pp. 740–755. Springer, 2014. doi: 10.1007/978-3-319-10602-1“˙48. URL https://doi.org/10.1007/978-3-319-10602-1_48.
Liu et al. (2021a) Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. CoRR, abs/2107.13586, 2021a. URL https://arxiv.org/abs/2107.13586.
Liu et al. (2021b) Liu, Y., Agarwal, S., and Venkataraman, S. Autofreeze: Automatically freezing model blocks to accelerate fine-tuning. arXiv preprint arXiv:2102.01386, 2021b.
Maas et al. (2011) Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. Learning word vectors for sentiment analysis. In Lin, D., Matsumoto, Y., and Mihalcea, R. (eds.), The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, pp. 142–150. The Association for Computer Linguistics, 2011. URL https://aclanthology.org/P11-1015/.
Mahabadi et al. (2021) Mahabadi, R. K., Ruder, S., Dehghani, M., and Henderson, J. Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp. 565–576. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.acl-long.47. URL https://doi.org/10.18653/v1/2021.acl-long.47.
Mao et al. (2021) Mao, Y., Mathias, L., Hou, R., Almahairi, A., Ma, H., Han, J., Yih, W., and Khabsa, M. Unipelt: A unified framework for parameter-efficient language model tuning. CoRR, abs/2110.07577, 2021. URL https://arxiv.org/abs/2110.07577.
Marino et al. (2019) Marino, K., Rastegari, M., Farhadi, A., and Mottaghi, R. OK-VQA: A visual question answering benchmark requiring external knowledge. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 3195–3204. Computer Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019.00331. URL http://openaccess.thecvf.com/content_CVPR_2019/html/Marino_OK-VQA_A_Visual_Question_Answering_Benchmark_Requiring_External_Knowledge_CVPR_2019_paper.html.
Perez et al. (2021) Perez, E., Kiela, D., and Cho, K. True few-shot learning with language models. CoRR, abs/2105.11447, 2021. URL https://arxiv.org/abs/2105.11447.
Radford et al. (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Radford et al. (2021) Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 8748–8763. PMLR, 2021. URL http://proceedings.mlr.press/v139/radford21a.html.
Raffel et al. (2020) Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
Sanh et al. (2021) Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja, A., Dey, M., Bari, M. S., Xu, C., Thakker, U., Sharma, S., Szczechla, E., Kim, T., Chhablani, G., Nayak, N. V., Datta, D., Chang, J., Jiang, M. T., Wang, H., Manica, M., Shen, S., Yong, Z. X., Pandey, H., Bawden, R., Wang, T., Neeraj, T., Rozen, J., Sharma, A., Santilli, A., Févry, T., Fries, J. A., Teehan, R., Biderman, S., Gao, L., Bers, T., Wolf, T., and Rush, A. M. Multitask prompted training enables zero-shot task generalization. CoRR, abs/2110.08207, 2021. URL https://arxiv.org/abs/2110.08207.
Sung et al. (2021) Sung, Y.-L., Cho, J., and Bansal, M. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. 2021.
Tsimpoukelli et al. (2021) Tsimpoukelli, M., Menick, J., Cabi, S., Eslami, S. M. A., Vinyals, O., and Hill, F. Multimodal few-shot learning with frozen language models. CoRR, abs/2106.13884, 2021. URL https://arxiv.org/abs/2106.13884.
Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998–6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
von Oswald et al. (2020) von Oswald, J., Henning, C., Sacramento, J., and Grewe, B. F. Continual learning with hypernetworks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=SJgwNerKvB.
Wang et al. (2019a) Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. Superglue: A stickier benchmark for general-purpose language understanding systems. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 3261–3275, 2019a. URL https://proceedings.neurips.cc/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6-Abstract.html.
Wang et al. (2019b) Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019b. URL https://openreview.net/forum?id=rJ4km2R5t7.
Xie et al. (2018) Xie, N., Lai, F., Doran, D., and Kadav, A. Visual entailment task for visually-grounded language learning. CoRR, abs/1811.10582, 2018. URL http://arxiv.org/abs/1811.10582.
Yang et al. (2021) Yang, Z., Gan, Z., Wang, J., Hu, X., Lu, Y., Liu, Z., and Wang, L. An empirical study of GPT-3 for few-shot knowledge-based VQA. CoRR, abs/2109.05014, 2021. URL https://arxiv.org/abs/2109.05014.
Zhang et al. (2021) Zhang, T., Wu, F., Katiyar, A., Weinberger, K. Q., and Artzi, Y. Revisiting few-sample BERT fine-tuning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=cO1IH43yUF.
Zhang et al. (2019) Zhang, Y., Baldridge, J., and He, L. PAWS: paraphrase adversaries from word scrambling. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 1298–1308. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1131. URL https://doi.org/10.18653/v1/n19-1131.
Zhang et al. (2022) Zhang, Z., Meng, X., Wang, Y., Jiang, X., Liu, Q., and Yang, Z. Unims: A unified framework for multimodal summarization with knowledge distillation. AAAI, 2022.
Zhao & Schütze (2021) Zhao, M. and Schütze, H. Discrete and soft prompting for multilingual models. In Moens, M., Huang, X., Specia, L., and Yih, S. W. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp. 8547–8555. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.emnlp-main.672. URL https://doi.org/10.18653/v1/2021.emnlp-main.672.

Appendix A The Connection Between Prefix-tuning and Hypernetwork

Prefix tuning prepends $N$ tunable prefix vectors to the keys and values of the multi-head attention at every layer. Specifically, two sets of prefix vectors $P_{k},P_{v}\in\mathbb{R}^{N\times d}$ are concatenated with the original key $K$ and value $V$ . Then multi-head attention is performed on the new prefixed keys and values. The computation of head_i:

\mathrm{head}_{i}=\mathrm{Attn}(xW^{l_{i}}_{q},\mathrm{concat}(P^{l_{i}}_{k},CW^{l_{i}}_{k}),\mathrm{concat}(P^{l_{i}}_{v},CW^{l_{i}})).

(8)

(He et al., 2021) derives an equivalent form of Eq. 8 and provide an alternative view of prefix tuning:

\begin{split}\mathrm{head}&=\mathrm{Attn}(xW_{q},\mathrm{concat}(P_{k},CW_{k}),\mathrm{concat}(P_{v};CW_{v}))\\ &=\mathrm{softmax}(xW_{q}\mathrm{concat}(P_{k},CW_{k})^{\top}[\begin{array}[]{cc}&P_{v}\\ &CW_{v}\end{array}]\\ &=(1-\lambda(x))\mathrm{softmax}(xW_{q}W_{k}^{\top}C^{\top})CW_{v}+\lambda(x)\mathrm{softmax}(xW_{q}P_{k})P_{v}\\ &=\underbrace{(1-\lambda(x))\mathrm{Attn}(xW_{q},CW_{k},CW_{v})}_{\text{standard attention}}+\underbrace{\lambda(x)\mathrm{Attn}(xW_{q},P_{k},P_{v})}_{\text{indendent of C}},\end{split}

(9)

where $\lambda(x)$ is a scalar that represents the sum of normalized attention weights on the prefixes:

\lambda(x)=\frac{\sum_{i}\mathrm{exp}(xW_{q}P_{k}^{\top})_{i}}{\sum_{i}\mathrm{exp}(xW_{q}P_{k}^{\top})_{i}+\sum_{j}\mathrm{exp}(xW_{q}W_{k}^{\top}C^{\top})_{j}}

(10)

Eq. 9 gives an alternative view of prefix tuning that essentially applies a position-wise modification to the original head attention output $h$ through linear interpolation. Eq. 9 can be rewrited by define $W_{1}=W_{q}P_{k}^{\top}$ , $W_{2}=P_{v}$ , $f=\mathrm{softmax}$ :

h\leftarrow(1-\lambda(x))h+\lambda(x)f(xW_{1})W_{2},

(11)

which reaches a very similar form to the adapter function in Eq. 6, except that prefix tuning is performing weighted addition while the adapter one is unweighted. This view of prefix tuning allows for abstraction of prefix tuning as a plug-in module like adapters. Therefore, we can approximately regard the prefix vectors as the weights of the model, which can be generated when combined with hypernetworks.

Method	Number of Tunable Parameters
Prompt Tuning	$N\times d$
Prefix Tuning	$N\times d+(1+2\times L)\times d_{\text{mid}}\times d\times B_{\text{attn}}$
Adapter	$2\times d_{\text{mid}}\times d\times(B_{\text{attn}}+B_{\text{ffn}})\times L$
MAM Adapter	$N\times d+(1+2\times L)\times d_{\text{mid}}\times d\times B_{\text{attn}}+2\times d_{\text{mid}}\times d\times B_{\text{ffn}}\times L$
HYPERFORMER++	$(N+B_{\text{attn}}+B_{\text{ffn}}+L)\times d_{t}+d_{t}\times d^{\text{mid}}_{I}+d^{\text{mid}}_{I}\times d_{I}+2\times d_{I}\times(d_{\text{mid}}\times d)$
HyperPrefix	$(N+B_{\text{attn}}+L)\times d_{t}+d_{t}\times d^{\text{mid}}_{I}+d^{\text{mid}}_{I}\times d_{I}\times d_{I}\times d$
HyperPELT	$(N+B_{\text{attn}}+B_{\text{ffn}}+L)\times d_{t}+d_{t}\times d^{\text{mid}}_{I}+d^{\text{mid}}_{I}\times d_{I}+2\times d_{I}\times d+2\times d_{I}\times(d_{\text{mid}}\times d)$

Table 5: Number of tunable parameters of various parameter-efficient tuning methods with T5 models.

Appendix B Number of Tunable Parameters

Following He et al. (2021), to simplify the computation of tunable parameters, we compute the sum of parameter used in one encoder layer and one decoder layer as the parameter overhead of one single layer of the pre-trained encoder-decoder model. T5 has an encoder-decoder structure that has $L$ layers. Each layer has $B_{\text{attn}}$ blocks and $B_{\text{ffn}}$ blocks. For the encoder-decoder models like T5, $B_{\text{attn}}=3$ : the encoder self-attention block, the decoder self-attention block and the decoder cross-attention block and $B_{\text{ffn}}=2$ : encoder feed-forward block and decoder feed-forward block. For modifications applied at the attention blocks, the number of tunable parameters is computed by $\theta_{\text{attn}}=\theta^{\text{attn}}_{\text{W}}\times B_{\text{attn}}\times L$ , where $\theta^{\text{attn}}_{\text{W}}$ denotes the number of parameters used for one attention sub-layer. Similarly, the number of tunable parameters for the FFN sub-layers is computed by $\theta_{\text{ffn}}=\theta^{\text{ffn}}_{\text{W}}\times B_{\text{ffn}}\times L$ . Finally, the total number of tunable parameters for prefix tuning and adapter variants is $\theta=\theta_{\text{attn}}+\theta_{\text{ffn}}$ as applicable. Using T5 as an example, we present the number of parameters used by several representative methods throughout our paper in Tab. 5.

Appendix C Input-output formats

Task	Input Text	Target Text
GLUE Tasks
CoLA	cola sentence: [sentence]	acceptable/unacceptable
SST-2	sst2 sentence: [sentence]	positive/negative
MRPC	mrpc sentence1: [sentence1] sentence2: [sentence2]	equivalent/not_equivalent
QQP	qqp question1: [question1] question2: [question2]	duplicate/not_duplicate
STS-B	stsb sentence1: [sentence1] sentence2: [sentence2]	0.0 - 5.0
MNLI	mnli hypothesis: [hypothesis] premise: [premise]	entailment/neutral/contradiction
QNLI	qnli question: [question] sentence: [sentence]	entailment/not_entailment
RTE	rte sentence1: [sentence1] sentence2: [sentence2]	entailment/not_entailment
Few-shot Tasks
CB	cb hypothesis: [hypothesis] premise: [premise]	entailment/neutral/contradiction
TREC	trec question: [question]	DESC/ENTY/ABBR/HUM/NUM/LOC
BoolQ	boolq question: [question] context: [context]	True/False
IMDB	imdb sentence: [sentence]	positive/negative
PAWS	paws sentence1: [sentence1] sentence2: [sentence2]	equivalent/not_equivalent
Vision-Language Tasks
COCO	caption:	[caption]
VQA	vqa question: [question]	[answer]
GQA	gqa question: [question]	[answer]
Vision-Language Few-shot Tasks
OKVQA	okvqa question: [question]	[answer]
SNLI-VE	snli-ve premise: [premise]	entailment/neutral/contradiction

Table 6: Input-output formats for NLU and Vision-Language tasks. Following Raffel et al. (2020); Cho et al. (2021), we use different prefixes (such as “cola sentence:”, “vqa question:”) for questions from different datasets.

As shown in Tab. 6, we formulate the input text and labels from each task to the corresponding text, and we learn these different tasks by predicting target text with the language modeling objective in Eq. 2.