PSP: Pre-trained Soft Prompts for Few-Shot Abstractive Summarization
Abstract
Few-shot abstractive summarization has become a challenging task in natural language generation. To support it, we developed a novel soft prompts architecture coupled with a prompt pre-training plus prompt fine-tuning paradigm, which is effective and tunes only extremely light parameters. To meet the structure of the generation models, the soft prompts comprise continuous input embeddings across an encoder and a decoder. Importantly, a new inner-prompt placed in the text is introduced to capture document-level information. The aim is to devote attention to understanding the document that better prompts the model to generate document-related content. In the training process, the prompt pre-training with self-supervised pseudo-data firstly teaches the model basic summarizing capability. Then, with few-shot examples, only the designed lightweight soft prompts are fine-tuned. Experimental results on the CNN/DailyMail and XSum datasets show that our method, with only 0.1% of the parameters, outperforms full-model tuning where all model parameters are tuned. It also surpasses Prompt Tuning by a large margin and delivers competitive results against Prefix-Tuning with 3% of the parameters.
1 Introduction
Given the high labor-costs of obtaining quality abstractive summaries, few-shot abstractive summarization is very demanding and highly challenging. A widely accepted paradigm for almost all NLP tasks is to fine-tune the entire set of parameters for a large pre-trained language model to suit the target task (Liu and Lapata, 2019; Liu et al., 2020).
However, the fine-tuning with few-shot examples usually leads to disappointing results, especially with generation tasks like abstractive summarization (Fabbri et al., 2020; Yu et al., 2021). The likely outcome is an overfit model. Further, for every specific task, a large number of pre-trained parameters need to be updated and stored, which is not efficient to use.
Pre-trained language models are few-shot learners, i.e., GPT-3 (Brown et al., 2020) that surprisingly perform generation tasks from a few examples without any further gradient updates. Although it lacks a rigorously theoretical proof, prompt learning inherits the few-shot property (Li and Liang, 2021; Schick and Schütze, 2020; Jin et al., 2021; Liu et al., 2021). Commonly, this type of learning is considered to retrieve relevant knowledge from frozen language models, only tuning continuous prompts to quickly adapt to new tasks with very few examples.
More recently, Prompt Tuning (Lester et al., 2021) has received much attention. With large frozen language models (say, 10 billion parameters), Prompt Tuning simply adds a tunable soft prompt to the input of the encoder, achieving results that are comparable to full-model tuning. Yet, our empirical results, in Section 2, demonstrate that Prompt Tuning for abstractive summarization yields simply abysmal performance. Prefix-Tuning (Li and Liang, 2021) extends the use of prompt learning in the natural language generation area. With this technique, continuous prompts are applied to every layer of the pre-trained model and even shows increase in few-shot generation tasks over fine-tuning. Yet the training process is not stable and updates are required that add to the memory and training costs.111See more related work in Section 5.

Given the shortcomings of these two methods, we have developed a soft prompts tuning method that is specifically designed for summarization. The structure is given in Figure 1. The method is capable of performing few-shot language generation task (i.e., abstractive summarization) with an efficient amount of training parameters. Prompt tokens are added before the decoder input tokens to guide the generation process toward the target summary. Moreover, we have designed three inner prompts – interval, sequential, and fixed-length – one of which is placed among the source input tokens. The aim is to capture the structure in the source document and aid in understanding its semantics, so as to better prompt the model to generate document-related content. Each kind of inner prompts focuses on different semantic units (e.g., phrases, sentences, and etc.), differentiating important units from non-informative ones. To bolster the summarization ability of the model and assist the prompts to understand the documents, prompt pre-training is performed before the tuning process, and leveraged by self-supervised pseudo data. As a last step, all the prompts are fine-tuned with few-shot training examples. Experiments conducted on two commonly used datasets - CNNDM (See et al., 2017) and XSum (Narayan et al., 2018) - demonstrate that our method outperforms full-model tuning under few-shot settings only with 0.1% of the parameters. It also surpasses naive Prompt Tuning by a large margin. Our model also yields a performance competitive to Prefix-Tuning with 3% of the trainable parameters. A detailed analysis shows that the designed prompt-pre-training phase and the inner prompts are effective for few-shot text summarization. Thus, the major contributions of this work include : 1) A novel soft prompt architecture for few-shot abstractive summarization. With the well-designed prompts in embedding layer, our model fulfills the task effectively and efficiently; 2) It is necessary to perform prompt pre-training strategy which benefits soft prompts model for few-shot summarization and shows excellent zero-shot capabilities; 3) Experiments that investigate the effect of different prompts by probing the attention weights. The results show our model is able to: extract knowledge from the encoder language model; understand the discourse in the document; and guide the decoder language model to generate fluent summaries.
2 Pilot Experiments
In a pilot study, we experimented with using Prompt Tuning under 300-shots settings to find reasonable clues as to how to design summary-prompts for the task. Our findings follow.
Consider an encoder-decoder language model based on the Transformer architecture (Vaswani et al., 2017) (e.g., BART (Lewis et al., 2020)) and parameterized by . To conduct a few-shot summarization task, we have some few-shot training pairs of a document and a corresponding summary . Specifically, we divided into different subsets with sentences222Note that, throughout this work, a “sentence” can be an arbitrary span of contiguous text (e.g., fixed length of 10 tokens), or an actual linguistic sentence. as our unit, , where denotes the token in the sentence.
First, original Prompt Tuning is applied by concatenating a series of prompt tokens , parameterized by , to the encoder input , where represents the embedding of each token (the leftmost structure in Figure 1). The gradients are backpropagated through the prompts and the weights of language model are frozen Lester et al. (2021). In this way, the model maximizes the likelihood of the output :
(1) |
The result of original Prompt Tuning is shown on the first line in Table 1, where we see it severely underperforms versus full-model tuning. In further experiments, we added a series of prompts to the decoder inputs following the generation . Here, we found the results to be even worse than the last.
Necessary Prompts for Generation
For generation-based tasks, prompts in both the encoder and decoder are equivalently useful. Therefore, our model employs a combination of the two series of prompts mentioned above, and generates conditioning on , and :
(2) |
Model | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|
Prompt in encoder | 32.87 | 11.92 | 21.73 |
Prompt in decoder | 26.77 | 11.73 | 16.71 |
Prompt in en.&de. | 36.37 | 14.41 | 24.46 |
Full-Model Tuning | 37.01 | 14.49 | 23.91 |
The result on the third line in Table 1 again verify our hypothesis. Prompts across the encoder and decoder even achieve comparable results with full-model tuning under few-shot settings. This verifies two things for us. First, prepending simple prompts to only the input embedding layer is effective and efficient for few-shot abstractive summarization. Second, prompts across the encoder and decoder are both necessary for generation tasks.

Lack of Attention on the Document
We further explored the encoder-decoder attention to investigate the effect of the prompts and freezing the language model. From Figure 2, we find the generating output is mainly focused on the soft prompts to come with little attention given to the document itself. This outcome is detrimental to summarization that requires to understand the semantics and inner discourse structure of documents Wang et al. (2019). Without the associations of target summaries and source documents, it is impossible to obtain high-quality summaries using current prompt architectures.
From Figure 2, we can observe that prompts in the encoder and the ones in decoder are consistently and directly associated with each other. We speculate that the mechanism is that encoder prompts retrieve relevant knowledge from the frozen encoder language model as a document representation, and decoder prompts copy the encoder’s behaviour, guiding the decoder language model to generate text.

3 Method
In light of our findings about the current architectures, we developed a new architecture of pre-trained soft prompts, for few-shot abstractive summarization called PSP. The framework includes continuous prompts across the encoder and decoder inputs, as well as inner-prompts to capture the dependencies between documents and target summaries. To better understand a given document, we add a prompt pre-training process before few-shot tuning. It also brings a good initialization for the prompting. The overall architecture and training scheme are illustrated in Figure 3.
3.1 Encoder-Decoder Basic Prompts
As mentioned in Section 2, in the training phase of current architectures, is responsible for extracting knowledge from the encoder’s frozen language model as a document representation. Meanwhile, mostly copies the behavior of and guides the frozen decoder’s language model to generate fluent text as a summary.
To strengthen the model’s ability to understand a document, the dependencies and attentions given to the source document need to be embodied in the prompt architecture.
3.2 Inner-Prompts for Document Understanding
To achieve our goal, we propose the notion of adding inner-prompts within the source document, denoted as with the parameters to be updated. Each corresponds to a single sentence. These inner-prompts are added to the corresponding token embedding, which gives rise to a new :
(3) |
We believe that by prompting different semantic units (e.g., sentences, phrases, etc.), more attention can be given to understanding the document’s discourse. Furthermore, the inner-prompts help the model to quickly interpret the document by strengthening the associations between outputs and documents. What follows are three different strategies for incorporating the three different inner-prompts. Note that there is more discussion on this point in Section 4.2.

Interval
Following Liu and Lapata (2019), the interval inner-prompts comprises two inner-prompt tokens are assigned to each sentence , depending on whether is odd. Specifically,
(4) |
In this way, the model can identify important sentences to encode the document at sentence level.
Sequential
To highlight the complex discourse structure of documents, sentence positions need to be considered. Therefore, different tokens are set in sentences by their sequences, formulated as:
(5) |
Fixed-length
To discover more fine-grained semantic units, a text span with a fixed length is manipulated into a new “sentence” and a corresponding sequential token is assigned to it. Further, prompts are assigned to the newly divided sentences [, , …, ], as . Figure 4 illustrates some examples where the above strategies have been used.
3.3 Self-supervised Prompt Pre-training
To improve ability of the prompts to understand the documents and to help the model to adapt to the summarization tasks, soft prompts are further pre-trained on the corpus using summarization-oriented self-supervised objectives. Doing this also means that the prompts are well initialized for few-shot tuning.
We tested two strategies for constructing the self-supervised data. Each strategy was designed to suit a particular type of writing bias in the document. These are “lead” and “gap sentences generation”.
Lead
Lead bias is common in news articles, which usually follow an inverted pyramid structure where the first few sentences contain the most salient information (See et al., 2017; Yang et al., 2020). With this type of bias, we initially select the first three sentences as our target summary, and treated the rest of the document as the source text. With this type of prompt pre-training process, the model was able to infer the salient information based on the remaining text.
GSG
Gap sentences generation applies to all documents that do not follow the lead bias structure (e.g., XSum (Narayan et al., 2018)). The strategy used here follows Zhang et al. (2020) , where we used ROUGE1-F1 (Lin, 2004) between each sentence and the rest of the document as a proxy for the principal score, . The top- most important sentences were selected according to , and removed from the document. Then these sentences are concatenated in the same order as the original text in the form of a pseudo summary. The remainder of the text is treated as a pseudo document.
With the constructed data, our designed prompts can be pre-trained and further tuned with few-shot examples.
3.4 Training Objective
The model is trained with maximum likelihood estimation (MLE). Given a ground-truth summary for an input passage , the objective is to minimize the negative log-likelihood of the target word sequence:
(6) | |||
Note that only these prepended-prompts parameters (, ) and the inner-prompts parameters () are optimized, the language model parameters () are all frozen.
4 Experiments
Datasets
We experimented with the CNN/DailyMail (CNNDM) dataset (Hermann et al., 2015) and the XSum dataset (Narayan et al., 2018). We chose these datasets because they differ in abstraction level and text length, which helps to show the generalization ability of our results.
We constructed the self-supervised pre-training data for CNNDM with Lead, and for XSum with GSG. We show details in Section A.1 in the appendix. Given that the lead bias structure exists only in some domain-specific datasets, we also conducted experiments to demonstrate the universality of the GSG to construct pseudo-data. The results are shown in Section A.3 in the appendix. Our few-shot training set contained 300 document-summary pairs randomly sampled from the original training data. To tune the hyper-parameters and select the best checkpoint, we composed a validation set from the original validation data. Here, we were careful to ensure that so that it fit into a true few-shot learning setting, following Perez et al. (2021). Since few-shot learning may have high variance, we sampled the examples with 5 different random seeds. We used the original test set to report our results, including the mean value and the standard deviation. Table 2 shows the statistics of the pre-processed corpus.
Datasets | CNNDM | XSum | ||||
---|---|---|---|---|---|---|
train | dev | test | train | dev | test | |
Avg.Passage | 697.45 | 676.64 | 717.92 | 396.53 | 387.62 | 380.55 |
Avg.Sum | 55.91 | 51.97 | 58.62 | 22.90 | 23.29 | 22.11 |
Labled data | 300 | 300 | 11,490 | 300 | 300 | 11,333 |
Setup
The base version of BART was used in our work. Following Lester et al. (2021), we used 100 prompt tokens for both the encoder inputs and the decoder inputs. These prompts were randomly initialized from the set of vocabularies. The sequential and fixed-length inner-prompts require a maximum number. Hence, we counted the number of sentences in each document and divided the results into two groups – the 85% with the least sentences (Group A) and the 15% with the most sentences (Group B)333We made our division at 85% to ensure all embeddings of inner-prompt tokens could be fully trained, because sentences after the -th only exist in 15% of the data.. We then set the number of prompts to the most number of sentences in Group A plus one, i.e., . For CNNDM, that number was 61 and, for XSum, it was 33. In this way, one inner-prompt token was assigned to each sentence up to . For the excessively long documents in Group B, the text after sentences was assigned an -th token. Further, we drew from a normal distribution to initialize the inner-prompt embeddings444More information about implementation details are shown in Section A.2 in the appendix.. Taking CNNDM as an example, all the tunable parameters that need to be stored amount to only . This is compared to the () parameters of full-model tuning. That equates to around 0.1% of the parameters for each dataset that need to be tuned and stored.
Evaluation Metrics
We adopted ROUGE Lin (2004) to measure the quality of the summaries produced in our experiments. The F1 scores for ROUGE-1, ROUGE-2, and ROUGE-L between the ground-truth and the generated summaries are each reported.
Baseline Models
We compared PSP to: Prompt Tuning (Lester et al., 2021), which only concatenates soft prompts into the encoder input; Prefix Tuning (Li and Liang, 2021), which adds a prefix to all the encoder layers, cross-attention layers, and the decoder layers; and Full-Model Tuning, which does not have any prompts and fine-tunes all the parameters of the pre-trained language model.
CNNDM | XSum | |||||||
---|---|---|---|---|---|---|---|---|
Model | ROUGE-1 | ROUGE-2 | ROUGE-L | PPL | ROUGE-1 | ROUGE-2 | ROUGE-L | PPL |
Prompt Tuning | ||||||||
Prefix-Tuning | ||||||||
Full-Model Tuning | ||||||||
PSPInterval | ||||||||
PSPSequential | ||||||||
PSPFixed-k |
4.1 Experimental Results of Our Method
Table 3 presents the results of all PSP variants and baselines across CNNDM and XSum datasets. With the exception of the ROUGE-2 and ROUGE-L scores for the Prefix-Tuning on the CNNDM dataset, our proposed PSP, outperforms the others. However, PSP delivered a competitive result with only 3% of the parameters, which is an acceptable place to start. To our surprise, we observe that 50% of PSP’s results surpass the full-model tuning, especially on XSum, as underlined in the table. Besides, results on the PPL metric show that PSP can generate more fluent summaries than other models. These results indicate that fine-tuning large language models is not necessarily a good or efficient idea with few-shot generation. It also shows that soft prompts with frozen language models are effective for few-shot abstractive summarization. Moreover, it statistically verifies that PSP with its three inner-prompt strategies is effective.
Efficiency v.s. effectiveness.
We gave an overall comparison to baseline models on effectiveness and memory-efficiency, evaluated by ROUGE and the number of parameters, respectively. The results are shown in Table 4. Prompt Tuning has the least number of parameters, while its capacity is limited to this and lacks control over the decoder side, hence it can not perform natural language generation tasks well. We can see that substantial gains are made when going from vanilla Prompt Tuning to PSP. However, even if Prefix-Tuning is nearly thirty times more parameters than ours, there is either a marginal improvement or even performance decrease on some metrics. Besides, Prefix-Tuning relies on reparameterization tricks to stabilize the training, i.e., adds a MLP with large number of parameters to the training stage. Our method provides the best effectiveness-efficiency trade off, and outperforms full-model tuning with only 0.1% parameters, and presents competitive results against Prefix-Tuning with 3% parameters.
Model | # Train | # Store | ROUGE-1 | |
---|---|---|---|---|
CNNDM | XSUM | |||
PSP | 38.32 | 32.86 | ||
Prefix-Tuning | 37.12 | 32.18 | ||
Prompt Tuning | 30.58 | 29.63 | ||
Full-Model Tuning | 38.03 | 32.85 |
Human Evaluation
We conducted a human evaluation study. To this end, we randomly selected 20 instances from the test set of each dataset. Ten graduate students with high levels of fluency in English were asked to assess the generated summaries and golden summaries from independent perspectives Wang et al. (2021): Informativeness (how much useful information does the summary provide?), Relevance (how well does the summary reflect the input document?), and Fluency (how grammatically correct are the summary sentences and how easy are they to read?). Scoring followed the Best-Worst Scaling method (Kiritchenko and Mohammad, 2017). Participants were asked to select the best and worst summaries from each perspective. The scores were computed as the percentage of times a summary was chosen as the best minus the times it was selected as the worst. The scores ranged from -1 (worst) to 1 (best). Results are shown in Table 5. Qualitatively, we show several examples generated by different models and the reference in Table 14 and Table 15 in the appendix. Compared with all baselines, the summaries generated by PSP are always more fluent and relevant to the source document, consistent with the results of human evaluation. Further more, we found summaries generated by PSP and Prefix-Tuning are always similar in sentence patterns and expressions. However, Prefix-Tuning tends to generate texts shorter than PSP, which often leads to lack of information.
Methods | CNNDM | XSum | ||||
---|---|---|---|---|---|---|
IF | RL | FL | IF | RL | FL | |
PSP | 0.500 | 0.708 | 0.667 | 0.217 | 0.275 | 0.492 |
Prompt Tuning | -0.317 | -0.758 | -0.975 | -0.336 | -0.400 | -0.867 |
Prefix-Tuning | -0.233 | 0.067 | 0.158 | 0.017 | -0.008 | 0.292 |
Full-Model Tuning | 0.067 | -0.025 | 0.075 | 0.117 | 0.092 | 0.075 |
Selection of fixed length .
As shown in Table 3, PSPFixed-k performs consistently well on both datasets. So we further explored the influence of different length , i.e., , for inner-prompt tokens of the PSPFixed-k555The average number of tokens per sentence in both datasets was about 18, so we did not consider fixed lengths of 20, for its similarity to the PSPSequential.. Table 6 presents the results of the variants on XSum. We observe the segmented spans with 10 tokens achieve the best performance. Interestingly, it can be induced that, to understand a document, it is possible to reorganize the sentence into several semantic units, where the number of the tokens is 10 on average. We also report results of different on our validation set in Table 6. The ranking is consistent with the test set. From a practical perspective, when applying PSP to a new dataset, we can choose the best based on the validation set.
R-1 | R-2 | R-L | R-1 | R-2 | R-L | |
---|---|---|---|---|---|---|
5 | 34.27 | 11.90 | 26.41 | 31.90 | 10.28 | 24.20 |
10 | 35.31 | 12.88 | 26.85 | 32.89 | 11.13 | 25.51 |
15 | 34.98 | 11.68 | 26.45 | 32.11 | 10.46 | 24.72 |
30 | 34.48 | 12.57 | 26.55 | 32.20 | 11.03 | 25.30 |
4.2 Analyses on Soft Prompts
Whether our model attends to understand documents?
According to Figure 2, we further present the encoder-decoder attention distribution of the PSP. The comparison visualization is shown in Figure 5. We find the following enhancement of our model by introducing the inner prompts. First, the PSP model strengthens the associations between the encoder prompts and the decoder prompts compared to the original model. Second, the soft prompt has more opportunities to be related to the output , indicating the semantic relations between them. Third, the output assigns more attention to the source document . This suggests that the hidden structure of the document is emphasized, increasing the capability of understanding its semantics. As such, these prompts can properly elect salient information from the document and prompt the model to generate the output.

Do inner prompts assist the model to understand the content of documents or simply increase the model’s capacity?
Instead of using inner-prompts, we prepended additional tunable tokens (i.e. 150 tokens) in front of the encoder and the decoder inputs. Comparison results are shown in Table 7. Despite the larger capacity, soft prompts with 150 tunable tokens before the input performed the worst, denoted as soft prompts (en.&de., 150). This suggests the inner-prompts with a few parameters do help to understand the document by prompting the structures, rather than simply add more trainable parameters to increase the model’s capacity.
Model | CNNDM | XSum | ||||
---|---|---|---|---|---|---|
R-1 | R-2 | R-L | R-1 | R-2 | R-L | |
Soft prompts (en.&de., 100) | 36.89 | 14.96 | 24.63 | 29.36 | 9.90 | 22.92 |
Soft prompts (en.&de., 150) | 35.71 | 14.86 | 23.97 | 28.94 | 9.52 | 22.24 |
Soft prompts (en.&de.&ip., 100) | 37.87 | 15.83 | 25.37 | 31.95 | 10.52 | 24.80 |
Model | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|
Soft prompts (en.&de., shared) | 36.06 | 14.30 | 24.24 |
Soft prompts (en.&de., separate) | 36.37 | 14.41 | 24.46 |
Further insight on soft prompts across the encoder and the decoder.
To verify our hypothesis that the decoder prompts largely copy the behaviour of the encoder prompts, we shared similar embeddings of the soft prompts before the encoder and the decoder. In Table 8, we observe the Soft prompts (en.&de., shared) and (en.&de., separate) almost perform identical results. Although the parameters are only half of the original model, the performance consistently remains competitive. This shows that the shared prompts can extract important information from the document and further guide the language model to generate consistently good summaries more efficiently.

Model | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|
Full-Model Tuning | 11.69 | 2.67 | 7.74 |
Prefix-Tuning | 11.76 | 2.63 | 7.93 |
Prompt Tuning | 9.40 | 1.86 | 6.19 |
PSPInterval | 17.16 | 3.36 | 12.65 |
4.3 Analysis on Few-shot and Zero-shot Summarization
To examine the performance of different methods under few-shots, we further randomly sampled number of {50, 100, 200} as the settings. Figure 6 reports a more detailed overview of all models’ performance across a range of different few-shots. The ROUGE scores of our model generally outperform other baselines and remain steady across different scenarios. Especially, the PSP with only 50 examples receives the most significant improvements, while the Prefix-Tuning doesn’t even work (tuning based on BARTbase) possibly due to its instability of the model. Moreover, we report the results of zero-shot on XSum in Table 9. Benefiting from the knowledge gained in the pre-training phase, our model shows a significant advantage of zero-shot adaptation in generating quality summaries.
4.4 The Performance of Pre-training on Prefix-Tuning
A crucial strategy for PSP is the pre-training of soft prompts. To give a fairly comparison, we performed prefix pre-training for Prefix-Tuning in the same way with the PSP. The results are shown in Table 10. We can find that the Prefix model obtains improvements on the XSum dataset after adopting the pre-training strategy, but underperforms the original one on the CNNDM dataset. It indicates that Prefix-Tuning shows limited potential compared to our model. We induce that the pre-training for Prefix-Tuning raises over-fitting risk due to its sensitivity to different data or parameter settings.
Method | CNNDM | XSum | ||||
---|---|---|---|---|---|---|
ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-1 | ROUGE-2 | ROUGE-L | |
Prefix-Tuning | ||||||
Prefix-Tuning w/ Pre. |
4.5 Ablation Study
We conducted experiments to examine the effectiveness of the major components of our model, and Table 11 shows the ablation results across the two datasets. We observed both the prompt pre-training operation and the inner-prompts component contribute to the main model. Notably, with the removal of each component, the model becomes considerably unstable, indicated by the variance shown in the ablation results. Comparably, prompt pre-training in our model accounts for more importance on the XSum dataset whose summaries have a higher abstract level (we assume it’s more “difficult”) than the CNNDM. In sum, these two components support the performance and stability of our model in terms of summarization adaption (by prompt pre-training) and structural documents understanding (by inner-prompts).
Method | CNNDM | XSum | ||||
---|---|---|---|---|---|---|
ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-1 | ROUGE-2 | ROUGE-L | |
PSPFixed-k | ||||||
w/o PP | ||||||
w/o IP | ||||||
w/o PP & IP |
5 Related Work
Few-Shot Abstractive Summarization
In practical application scenarios, the lack of manual constructed document-summary pairs or labeled data makes data-driven neural models performs badly Hu et al. (2021, 2020). Fabbri et al. (2020) condense characteristics of the target dataset into Wikipedia data to construct pseudo-summaries. Bražinskas et al. (2020) introduce plug-in networks to reproduce characteristics of the target dataset with only a small set of labeled examples. Bai et al. (2021) conduct cross-lingual summarization in a low-resource setting. Yu et al. (2021) design the second phase of pre-training on large-scale generative models before fine-tuning. In this paper, we construct pseudo-summary corpus with heuristic rules, providing a better parameter initialization for soft prompts under few-shot settings. More importantly, we design summarization-oriented soft prompts to help the model produce few-shot summaries.
Prompt Learning
The emergence of GPT-3 (Brown et al., 2020) introduces the concept of “prompting”. One only needs to assemble a task description and few examples into a prompt, and then prepend it to the task input. With the large-scale frozen parameters, a pre-trained model can generate the output without any task-specific tuning. However, task description is error-prone while there is no unified, explicit, and effective way to build these hard prompts manually (Logan IV et al., 2021). Hence, several works (Gao et al., 2020; Jiang et al., 2020; Shin et al., 2020) are proposed to generate prompts automatically, but they all restrict prompts to discrete spaces. These discrete prompts are less expressive and sub-optimal. To overcome the shortcomings of hard prompts, Li and Liang (2021) propose “Prefix-Tuning”. This method only tunes prefix activation prepended to all transformer layers, and keeps the LM parameters frozen. To further simplify, Prompt Tuning (Lester et al., 2021) only prepends tunable tokens to the encoder input, and keeps all other parameters frozen. Logan IV et al. (2021) and Gu et al. (2021) propose to use pre-training to boost the low performance of Prompt Tuning for few-shot learning. In this work, we fit the structure of Prompt Tuning to text generation models, proposing encoder prompts, decoder prompts, and inner prompts. We successfully apply prompt tuning methods to few-shot abstractive summarization task.
6 Conclusion
In this paper, we present a novel pre-trained soft prompts architecture (PSP) specifically designed for few-shot abstractive summarization. We design continuous input embeddings across an encoder and a decoder alongside several kinds of inner-prompts placed in the text, assisting the model better to understand documents and guide accurate generation. Empirical results find the necessity of using prompt pre-training for few-shot/zero-shot abstractive summarization. Extensive experiments and analyses show that the proposed PSP provides the best effectiveness-efficiency trade off among all the baseline methods.
7 Acknowledgments
The research presented in this publication was sponsored by CCF Fund For Young Scholars, and Joint Funds of the National Natural Science Foundation of China (Grant No. U21B2009).
References
- Bai et al. (2021) Yu Bai, Yang Gao, and Heyan Huang. 2021. Cross-lingual abstractive summarization with limited parallel resources. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6910–6924, Online. Association for Computational Linguistics.
- Bražinskas et al. (2020) Arthur Bražinskas, Mirella Lapata, and Ivan Titov. 2020. Few-shot learning for opinion summarization. arXiv preprint arXiv:2004.14884.
- Brown et al. (2020) Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
- Fabbri et al. (2020) Alexander R Fabbri, Simeng Han, Haoyuan Li, Haoran Li, Marjan Ghazvininejad, Shafiq Joty, Dragomir Radev, and Yashar Mehdad. 2020. Improving zero and few-shot abstractive summarization with intermediate fine-tuning and data augmentation. arXiv preprint arXiv:2010.12836.
- Gao et al. (2020) Tianyu Gao, Adam Fisch, and Danqi Chen. 2020. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723.
- Gu et al. (2021) Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. 2021. Ppt: Pre-trained prompt tuning for few-shot learning. arXiv preprint arXiv:2109.04332.
- Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. Advances in neural information processing systems, 28:1693–1701.
- Hu et al. (2020) Xuming Hu, Lijie Wen, Yusong Xu, Chenwei Zhang, and S Yu Philip. 2020. Selfore: Self-supervised relational feature learning for open relation extraction. In Proc. of EMNLP, pages 3673–3682.
- Hu et al. (2021) Xuming Hu, Chenwei Zhang, Yawen Yang, Xiaohe Li, Li Lin, Lijie Wen, and S Yu Philip. 2021. Gradient imitation reinforcement learning for low resource relation extraction. In Proc. of EMNLP, pages 2737–2746.
- Jiang et al. (2020) Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. 2020. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438.
- Jin et al. (2021) Woojeong Jin, Yu Cheng, Yelong Shen, Weizhu Chen, and Xiang Ren. 2021. A good prompt is worth millions of parameters? low-resource prompt-based learning for vision-language models. arXiv preprint arXiv:2110.08484.
- Kiritchenko and Mohammad (2017) Svetlana Kiritchenko and Saif M Mohammad. 2017. Best-worst scaling more reliable than rating scales: A case study on sentiment intensity annotation. arXiv preprint arXiv:1712.01765.
- Lester et al. (2021) Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880.
- Li and Liang (2021) Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online. Association for Computational Linguistics.
- Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Liu et al. (2021) Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. 2021. Gpt understands, too. arXiv preprint arXiv:2103.10385.
- Liu and Lapata (2019) Yang Liu and Mirella Lapata. 2019. Text summarization with pretrained encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3730–3740.
- Liu et al. (2020) Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
- Logan IV et al. (2021) Robert L Logan IV, Ivana Balažević, Eric Wallace, Fabio Petroni, Sameer Singh, and Sebastian Riedel. 2021. Cutting down on prompts and parameters: Simple few-shot learning with language models. arXiv preprint arXiv:2106.13353.
- Manning et al. (2014) Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pages 55–60.
- Narayan et al. (2018) Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium.
- Perez et al. (2021) Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. True few-shot learning with language models. arXiv preprint arXiv:2105.11447.
- Schick and Schütze (2020) Timo Schick and Hinrich Schütze. 2020. Few-shot text generation with pattern-exploiting training. arXiv preprint arXiv:2012.11926.
- See et al. (2017) Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083.
- Shin et al. (2020) Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. 2020. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980.
- Vaswani et al. (2017) Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. ArXiv, abs/1706.03762.
- Wang et al. (2021) Haonan Wang, Yang Gao, Yu Bai, Mirella Lapata, and Heyan Huang. 2021. Exploring explainable selection to control abstractive summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13933–13941.
- Wang et al. (2019) Wenbo Wang, Yang Gao, He-Yan Huang, and Yuxiang Zhou. 2019. Concept pointer network for abstractive summarization. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3076–3085.
- Wolf et al. (2020) Thomas Wolf, Julien Chaumond, Lysandre Debut, Victor Sanh, Clement Delangue, Anthony Moi, Pierric Cistac, Morgan Funtowicz, Joe Davison, Sam Shleifer, et al. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45.
- Yang et al. (2020) Ziyi Yang, Chenguang Zhu, Robert Gmyr, Michael Zeng, Xuedong Huang, and Eric Darve. 2020. Ted: A pretrained unsupervised summarization model with theme modeling and denoising. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 1865–1874.
- Yu et al. (2021) Tiezheng Yu, Zihan Liu, and Pascale Fung. 2021. Adaptsum: Towards low-resource domain adaptation for abstractive summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5892–5904.
- Zhang et al. (2020) Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pages 11328–11339. PMLR.
Appendix A Appendix
A.1 Constructing Pesudo Data for Pre-training
We constructed the pseudo data for CNNDM with Lead. We also conducted a simple data cleaning procedure to the self-supervised pre-train corpus. First, we cleaned away irrelevant information, such as media names, reporter names or dates from the summaries. Second, for those summaries with less than 50 tokens, we iteratively collected the first sentence of the remaining text to the pseudo summary, until the length of summary reaches 70. This procedure was set up to prevent the target text from being too short to form a meaningful summary. Third, for those samples in which the source document is shorter than its summary, we filtered them out.
For XSum, we constructed the pseudo data for pre-training following GSG. The top-1 most important sentence was selected as the pseudo summary. Then we filtered out those pseudo summaries that are not relevant enough to the pseudo passages. In particular, we leveraged hand-written summaries in the few-shot dataset to determine the filtering threshold of pseudo data. We calculated the ROUGE-1 F1 between each ground-truth summary and its corresponding passage, represented as . Then we calculated the mean and variance of : , , and was used as a lower-bound threshold to filter out low quality pseudo data. For those pseudo samples where ROUGE1-F1 between the pseudo summary and the pseudo passage is lower than the threshold , we filtered them out. Finally, we conducted pre-training on our soft prompts with these filtered pseudo-data. Table 12 shows the statistics for the pre-training data corpus.
CNNDM | XSum | |
---|---|---|
Pseudo Corpus | Pseudo Corpus | |
# of Original Passages | 287,113 | 204,017 |
# of Pre-training Data | 284,177 | 158,499 |
A.2 Implementation Details
We first split sentences with the Stanford CoreNLP toolkit (Manning et al., 2014), and the input documents were truncated to 1024 BPE tokens. We adopted BART-base for all the experiments. Our implementation was based on the Hugging Face Transformer models (Wolf et al., 2020). We used a mini-batch size of 8 with a gradient accumulation for 10 iterations. We used Adam optimizer with momentum = 0.9, = 0.998 and noam decay. In the stage of pre-training, the peak value of learning rate was 1e-3, and we set the warm up ratio to 10%. During fine-tuning, the peak value of learning rate was 3e-4, and we set the warm up steps to 100 with 400 epochs. In the decoding stage, we used beam search with a beam size of 4. The decoding process will not stop until an end-of sequence (EOS) token was emitted or the length of the generated summary reached to 256 tokens. All models were trained on 4 TITAN RTX GPUs.
A.3 The Universality of GSG to Construct Pseudo-data
To demonstrate the universality of using the GSG method to construct pseudo-data for prompt pre-training, we conducted a complimentary experiment to testify its effect on the CNNDM666We do not conduct ablation experiments on XSum, as there is no “ lead bias” in this dataset. So it is inappropriate to take the first sentences of the passage as the pseudo summary.. Specifically, we selected important sentences. Results in Table 13 indicate that the PSP model pre-trained by GSG is equally effective with the original PSPLead, showing that the GSG can be universally employed to pre-train soft prompts for abstractive summarization.
ROUGE-1 | ROUGE-2 | ROUGE-L | |
---|---|---|---|
PSPLead (w/o inner-prompts) | 37.66 | 15.07 | 24.52 |
PSPGSG (w/o inner-prompts) | 37.04 | 15.04 | 25.20 |
Source | Manchester City are monitoring Juventus striker Alvaro Morata. The Spain international was a target for Arsenal a year ago when he left Real Madrid but opted for a move to Italy instead. Real have a buy-back option set at around £15million but are unlikely to take that up. Manchester City are keeping tabs on Juventus striker Alvaro Morata ahead of possible summer bid . Morata closes down Lazio goalkeeper Federico Marchetti during Juve’s Serie A win on Saturday . City also remain keen on Juventus midfielder Paul Pogba but face competition from Paris Saint Germain . Morata has been a success at Juventus but the Italians are always susceptible to offers at the right price and are making plans to bring in Edinson Cavani from Paris Saint Germain despite the French club’s resistance. PSG also are one of the frontrunners to sign Juve midfielder Paul Pogba. City remain one of the clubs pushing to sign the France international too and Juventus have City’s Edin Dzeko, Aleksandar Kolarov and Stevan Jovetic on their list of potential targets for the summer. |
---|---|
PSP | Manchester City are keeping tabs on Juventus striker Alvaro Morata . The Spain international was a target for Arsenal a year ago when he left Real Madrid but opted for a move to Italy instead . Juventus have a buy-back option set at around £15million but are unlikely to take that up . |
Prefix-Tuning | Manchester City are keeping tabs on Juventus striker Alvaro Morata ahead of possible summer bid. Morata closes down Lazio goalkeeper Federico Marchetti during Juve’s Serie A win on Saturday. |
Prompt Tuning | LMorata is a target for Real Madrid, who have a buy - back option set at around £ 15million . He has been a success at Real Madrid but opted for a move to Italy instead . The Italian club are unlikely to take that up . . .. . . ." . .." . ." ." ." . ."." . .". . ."". . ." . . " . . … . . [and] . . (He] had a great season at Real but who more had been about” . . , but who else had been close enough to sign him? . . He had her all after." . .) . . She didn’t year when when when two over people A lot more time than two theret. . ." —LastJustInOther |
Full-Model Tuning | NEW City are monitoring Juventus striker Alvaro Morata . The Spain international was a target for Arsenal a year ago when he left Real Madrid . Morata has a buy-back option set at around £15million but is unlikely to be offered by the French side . |
Reference | Alvaro Morata had attracted interest from Arsenal before joining Juventus . Spain international made move to Italy in £15million deal from Real Madrid . Manchester City are monitoring the 22-year-old after impressive season . |
Source | Nairobi, Kenya (CNN)University of Nairobi students were terrified Sunday morning when they heard explosions – caused by a faulty electrical cable – and believed it was a terror attack, the school said. Students on the Kikuyu campus stampeded down the halls of the Kimberly dormitory, and some jumped from its fifth floor, the university said. Hundreds were injured and were taken to hospitals. One person died, according to the school. The confusion and panic came less than two weeks after Al-Shabaab slaughtered 147 people at a college in Garissa, Kenya. Kenyan teachers and students have said they fear being targeted by the Somalia-based terrorists. On Sunday, as many as 108 students from the University of Nairobi were admitted to Kenyatta National Hospital. Among them, at least 63 students have been discharged, and at least four are slated for surgery, the school said. Almost all of the 54 students being treated at PCEA Kikuyu Hospital have been released, the university said. Kenya Power authorities and its CEO are at the school and looking into the electrical issue. Normal power supply will resume after repairs, the university said. "As we mourn the unfortunate loss of the departed student, we are also praying for the quick recovery of those who were injured," said Vice Chancellor Peter M.F. Mbithi in a statement. He called on the students, staff and public to remain calm. CNN’s Lillian Leposo reported from Nairobi and Ashley Fantz wrote this story in Atlanta. |
PSP | University of Nairobi students were terrified Sunday morning when they heard explosions – caused by a faulty electrical cable – and believed it was a terror attack . Hundreds were injured and were taken to hospitals . One person died, according to the school . The confusion and panic came less than two weeks after Al- Shabaab slaughtered 147 people at a college in Garissa . |
Prefix-Tuning | University of Nairobi students were terrified Sunday morning when they heard explosions – caused by a faulty electrical cable – and believed it was a terror attack. |
Prompt Tuning | L Hundreds were injured and taken to hospitals . Hundreds of students were evacuated from the dormitory . The school said they feared being targeted by al-Shabaab . Hundreds were evacuated and treated in hospitals . The university said it is investigating the cause of the explosion . . . The explosion was caused by a faulty electrical cable. . .. . ." . . ." ." ." . ."." . .." . .""People were terrified," said the school’s vice chancellor . "People were screaming, but who more had been were about” . "We had no idea what was going on but who else had been about to blow her all after." … .. ." .."." ..""They were terrified at the time than two overtakes" —LastJustIn3 |
Full-Model Tuning | NEW students panicked when they heard explosions – caused by a faulty electrical cable – and believed it was a terror attack, university says . As many as 108 students from University of Nairobi were admitted to Kenyatta National Hospital . One person died, according to the school . |
Reference | Students stampeded; some jumped from a fifth story at a dorm; one student died, school officials say . The blasts were caused by faulty electrical cable, and Kenya Power is at the school . The panic came less than two weeks after terrorists attacked Kenya’s Garissa University . |
Source | Brunon Kwiecien, 48, was convicted of planning a terrorist attack, illegal weapons possession and inciting two students to carry out an attack.He suggested he had been manipulated by Polish intelligence agents.Kwiecien was said to be fascinated with Norwegian mass killer Anders Behring Breivik.Right-wing extremist Breivik killed 77 people in a bombing and shooting rampage in Norway in July 2011.Kwiecien, a former professor at Krakow’s University of Agriculture, was arrested in 2012.Investigators believe he wanted to target parliament with four tonnes of explosives while then-President Bronislaw Komorowski and former Prime Minister Donald Tusk were due to attend, the trial heard."If Brunon Kwiecien hadn’t been stopped, we would be talking amid the ruins of the state today," said judge Aleksandra Almert, according to the AFP agency.While admitting he planned to carry out the attack, he also said he was subject to "provocation" by the intelligence services.Kwiecien is the first Pole to be accused of terrorism, Polish media reported. He has no known links to established extremist groups. |
---|---|
PSP | A Pole has been convicted of planning a terrorist attack in Poland, a court heard. |
Prefix-Tuning | A Pole has been convicted of planning to carry out a terrorist attack in Poland. |
Prompt Tuning | AA Polish man has been convicted of planning a terrorist attack in the Polish capital, Warsaw, on Thursday. |
Full-Model Tuning | A Pole has been found guilty of planning a terrorist attack in the Polish state of Krakow. |
Reference | A Polish university lecturer has been sentenced to 13 years in jail for plotting to ram a car packed with explosives into parliament. |
Source | Schmidt was sent off by the referee for insulting Hoffenheim’s Julian Nagelsmann in Saturday’s 3-0 home loss."That was nothing, what sort of a nutcase are you? Just shut your mouth," Schmidt shouted after going 2-0 down.The 49-year-old has been banned for two games and handed a 15,000 euros (£13,373) fine.The German was sanctioned after triggering a suspended sentence from February this year.He had been banned for three games, with a further two in the event of a repeat offence before June 2017, for refusing a referee’s order to leave the sidelines during a 1-0 defeat to Borussia Dortmund.Schmidt will be unable to have any contact with the team for half an hour before, during and after Tuesday’s German Cup second-round match against Lotte and Saturday’s league match against Wolfsburg.Leverkusen’s director of sport Rudi Voller has sought a meeting with the head of the disciplinary committee. |
PSP | Leverkusen defender Christian Schmidt has been banned for two games for insulting the referee. |
Prefix-Tuning | Leverkusen midfielder Matthias Schmidt has been banned for two games after refusing to leave the sidelines during a match against Wolfsburg. |
Prompt Tuning | ALeverkusen midfielder Christian Schmidt has been banned for two games for insulting the referee in a game against Hoffenheim on Saturday..’ |
Full-Model Tuning | Aeverkusen manager Gerhard Schmidt has been banned for two games for insulting the head of the German national team. |
Reference | Bayer Leverkusen head coach Roger Schmidt has been banned and fined for calling an opposing manager "a nutcase" during a Bundesliga game. |