PSP: Pre-trained Soft Prompts for Few-Shot Abstractive Summarization

Xiaochen Liu^†, Yang Gao^†, Yu Bai^†,
Jiawei Li^†, Yinan Hu^†, Heyan Huang^† Corresponding author. Boxing Chen
^†School of Computer Science and Technology,
Beijing Institute of Technology
{xcliu,gyang,yubai,jwli,ynhu,hhy63}@bit.edu.cn
[email protected]

Abstract

Few-shot abstractive summarization has become a challenging task in natural language generation. To support it, we developed a novel soft prompts architecture coupled with a prompt pre-training plus prompt fine-tuning paradigm, which is effective and tunes only extremely light parameters. To meet the structure of the generation models, the soft prompts comprise continuous input embeddings across an encoder and a decoder. Importantly, a new inner-prompt placed in the text is introduced to capture document-level information. The aim is to devote attention to understanding the document that better prompts the model to generate document-related content. In the training process, the prompt pre-training with self-supervised pseudo-data firstly teaches the model basic summarizing capability. Then, with few-shot examples, only the designed lightweight soft prompts are fine-tuned. Experimental results on the CNN/DailyMail and XSum datasets show that our method, with only 0.1% of the parameters, outperforms full-model tuning where all model parameters are tuned. It also surpasses Prompt Tuning by a large margin and delivers competitive results against Prefix-Tuning with 3% of the parameters.

1 Introduction

Given the high labor-costs of obtaining quality abstractive summaries, few-shot abstractive summarization is very demanding and highly challenging. A widely accepted paradigm for almost all NLP tasks is to fine-tune the entire set of parameters for a large pre-trained language model to suit the target task (Liu and Lapata, 2019; Liu et al., 2020).

However, the fine-tuning with few-shot examples usually leads to disappointing results, especially with generation tasks like abstractive summarization (Fabbri et al., 2020; Yu et al., 2021). The likely outcome is an overfit model. Further, for every specific task, a large number of pre-trained parameters need to be updated and stored, which is not efficient to use.

Pre-trained language models are few-shot learners, i.e., GPT-3 (Brown et al., 2020) that surprisingly perform generation tasks from a few examples without any further gradient updates. Although it lacks a rigorously theoretical proof, prompt learning inherits the few-shot property (Li and Liang, 2021; Schick and Schütze, 2020; Jin et al., 2021; Liu et al., 2021). Commonly, this type of learning is considered to retrieve relevant knowledge from frozen language models, only tuning continuous prompts to quickly adapt to new tasks with very few examples.

More recently, Prompt Tuning (Lester et al., 2021) has received much attention. With large frozen language models (say, $>$ 10 billion parameters), Prompt Tuning simply adds a tunable soft prompt to the input of the encoder, achieving results that are comparable to full-model tuning. Yet, our empirical results, in Section 2, demonstrate that Prompt Tuning for abstractive summarization yields simply abysmal performance. Prefix-Tuning (Li and Liang, 2021) extends the use of prompt learning in the natural language generation area. With this technique, continuous prompts are applied to every layer of the pre-trained model and even shows increase in few-shot generation tasks over fine-tuning. Yet the training process is not stable and updates are required that add to the memory and training costs.¹¹1See more related work in Section 5.

Refer to caption — Figure 1: The comparison between PSP and previous methods. “E” and “D” represents the encoder and the decoder, respectively.

Given the shortcomings of these two methods, we have developed a soft prompts tuning method that is specifically designed for summarization. The structure is given in Figure 1. The method is capable of performing few-shot language generation task (i.e., abstractive summarization) with an efficient amount of training parameters. Prompt tokens are added before the decoder input tokens to guide the generation process toward the target summary. Moreover, we have designed three inner prompts – interval, sequential, and fixed-length – one of which is placed among the source input tokens. The aim is to capture the structure in the source document and aid in understanding its semantics, so as to better prompt the model to generate document-related content. Each kind of inner prompts focuses on different semantic units (e.g., phrases, sentences, and etc.), differentiating important units from non-informative ones. To bolster the summarization ability of the model and assist the prompts to understand the documents, prompt pre-training is performed before the tuning process, and leveraged by self-supervised pseudo data. As a last step, all the prompts are fine-tuned with few-shot training examples. Experiments conducted on two commonly used datasets - CNNDM (See et al., 2017) and XSum (Narayan et al., 2018) - demonstrate that our method outperforms full-model tuning under few-shot settings only with 0.1% of the parameters. It also surpasses naive Prompt Tuning by a large margin. Our model also yields a performance competitive to Prefix-Tuning with 3% of the trainable parameters. A detailed analysis shows that the designed prompt-pre-training phase and the inner prompts are effective for few-shot text summarization. Thus, the major contributions of this work include : 1) A novel soft prompt architecture for few-shot abstractive summarization. With the well-designed prompts in embedding layer, our model fulfills the task effectively and efficiently; 2) It is necessary to perform prompt pre-training strategy which benefits soft prompts model for few-shot summarization and shows excellent zero-shot capabilities; 3) Experiments that investigate the effect of different prompts by probing the attention weights. The results show our model is able to: extract knowledge from the encoder language model; understand the discourse in the document; and guide the decoder language model to generate fluent summaries.

2 Pilot Experiments

In a pilot study, we experimented with using Prompt Tuning under 300-shots settings to find reasonable clues as to how to design summary-prompts for the task. Our findings follow.

Consider an encoder-decoder language model $p_{\theta}(y|x)$ based on the Transformer architecture (Vaswani et al., 2017) (e.g., BART (Lewis et al., 2020)) and parameterized by $\theta$ . To conduct a few-shot summarization task, we have some few-shot training pairs of a document $X=\{x_{1},x_{2},\dots,x_{|X|}\}$ and a corresponding summary $Y=\{y_{1},y_{2},\dots,y_{|Y|}\}$ . Specifically, we divided $X$ into different subsets with sentences²²2Note that, throughout this work, a “sentence” can be an arbitrary span of contiguous text (e.g., fixed length of 10 tokens), or an actual linguistic sentence. as our unit, $X=\{x^{1}_{1},\dots x^{i}_{j},\dots,x^{n}_{m}\}$ , where $x^{i}_{j}$ denotes the $j_{\rm th}$ token in the $i_{\rm th}$ sentence.

First, original Prompt Tuning is applied by concatenating a series of prompt tokens ${P}_{en}$ , parameterized by $\theta_{p_{en}}$ , to the encoder input $X_{en}=\{e^{1}_{1},\dots,e^{i}_{j},\dots e^{n}_{m}\}$ , where $e$ represents the embedding of each token (the leftmost structure in Figure 1). The gradients are backpropagated through the prompts and the weights $\theta$ of language model are frozen Lester et al. (2021). In this way, the model maximizes the likelihood of the output $Y$ :

p_{\theta;\theta_{\tt p_{en}}}(Y|[P_{en};X_{en}])

(1)

The result of original Prompt Tuning is shown on the first line in Table 1, where we see it severely underperforms versus full-model tuning. In further experiments, we added a series of prompts $P_{de}$ to the decoder inputs $X_{de}$ following the generation $p_{\theta;\theta_{p_{de}}}(Y|X_{en},P_{de})$ . Here, we found the results to be even worse than the last.

Necessary Prompts for Generation

For generation-based tasks, prompts in both the encoder and decoder are equivalently useful. Therefore, our model employs a combination of the two series of prompts mentioned above, and generates $Y$ conditioning on $X_{en}$ , $P_{en}$ and $P_{de}$ :

p_{\theta;\theta_{p_{en}};\theta_{p_{de}}}(Y|[P_{en};X_{en}],P_{de})

(2)

Model	ROUGE-1	ROUGE-2	ROUGE-L
Prompt in encoder	32.87	11.92	21.73
Prompt in decoder	26.77	11.73	16.71
Prompt in en.&de.	36.37	14.41	24.46
Full-Model Tuning	37.01	14.49	23.91

Table 1: Results of BART-base on CNN/DailyMail Datasets. Best results are bold.

The result on the third line in Table 1 again verify our hypothesis. Prompts across the encoder and decoder even achieve comparable results with full-model tuning under few-shot settings. This verifies two things for us. First, prepending simple prompts to only the input embedding layer is effective and efficient for few-shot abstractive summarization. Second, prompts across the encoder and decoder are both necessary for generation tasks.

Lack of Attention on the Document

We further explored the encoder-decoder attention to investigate the effect of the prompts and freezing the language model. From Figure 2, we find the generating output is mainly focused on the soft prompts to come with little attention given to the document itself. This outcome is detrimental to summarization that requires to understand the semantics and inner discourse structure of documents Wang et al. (2019). Without the associations of target summaries and source documents, it is impossible to obtain high-quality summaries using current prompt architectures.

From Figure 2, we can observe that prompts in the encoder and the ones in decoder are consistently and directly associated with each other. We speculate that the mechanism is that encoder prompts retrieve relevant knowledge from the frozen encoder language model as a document representation, and decoder prompts copy the encoder’s behaviour, guiding the decoder language model to generate text.

3 Method

In light of our findings about the current architectures, we developed a new architecture of pre-trained soft prompts, for few-shot abstractive summarization called PSP. The framework includes continuous prompts across the encoder and decoder inputs, as well as inner-prompts to capture the dependencies between documents and target summaries. To better understand a given document, we add a prompt pre-training process before few-shot tuning. It also brings a good initialization for the prompting. The overall architecture and training scheme are illustrated in Figure 3.

3.1 Encoder-Decoder Basic Prompts

As mentioned in Section 2, in the training phase of current architectures, $P_{en}$ is responsible for extracting knowledge from the encoder’s frozen language model as a document representation. Meanwhile, $P_{de}$ mostly copies the behavior of $P_{en}$ and guides the frozen decoder’s language model to generate fluent text as a summary.

To strengthen the model’s ability to understand a document, the dependencies and attentions given to the source document need to be embodied in the prompt architecture.

3.2 Inner-Prompts for Document Understanding

To achieve our goal, we propose the notion of adding inner-prompts within the source document, denoted as $P_{in}=\{p_{in}^{1},p_{in}^{2},\dots,p_{in}^{n}\}$ with the parameters $\theta_{P_{in}}$ to be updated. Each $p_{in}^{i}$ corresponds to a single sentence. These inner-prompts are added to the corresponding token embedding, which gives rise to a new $X^{\prime}_{in}$ :

X^{\prime}_{in}=\{e^{1}_{1}+p^{1}_{in},e^{1}_{2}+p^{1}_{in},\dots,e^{i}_{j}+p^{i}_{in},\dots,e^{n}_{m}+p^{n}_{in}\}

(3)

We believe that by prompting different semantic units (e.g., sentences, phrases, etc.), more attention can be given to understanding the document’s discourse. Furthermore, the inner-prompts help the model to quickly interpret the document by strengthening the associations between outputs and documents. What follows are three different strategies for incorporating the three different inner-prompts. Note that there is more discussion on this point in Section 4.2.

Interval

Following Liu and Lapata (2019), the interval inner-prompts comprises two inner-prompt tokens are assigned to each sentence $sent_{i}$ , depending on whether $i$ is odd. Specifically,

P_{in}=\{p_{in}^{1},p_{in}^{2},p_{in}^{1},\dots,p_{in}^{(n-1){\rm mod}2+1}\}

(4)

In this way, the model can identify important sentences to encode the document at sentence level.

Sequential

To highlight the complex discourse structure of documents, sentence positions need to be considered. Therefore, different tokens are set in sentences by their sequences, formulated as:

P_{in}=\{p_{in}^{1},p_{in}^{2},\dots,p_{in}^{n}\}

(5)

Fixed-length

To discover more fine-grained semantic units, a text span with a fixed length $k$ is manipulated into a new “sentence” and a corresponding sequential token is assigned to it. Further, prompts are assigned to the newly divided sentences [ $sent_{1}$ , $sent_{2}$ , …, $sent_{n}$ ], as $\{p_{in}^{1},p_{in}^{2},\dots,p_{in}^{n}\}$ . Figure 4 illustrates some examples where the above strategies have been used.

3.3 Self-supervised Prompt Pre-training

To improve ability of the prompts to understand the documents and to help the model to adapt to the summarization tasks, soft prompts are further pre-trained on the corpus using summarization-oriented self-supervised objectives. Doing this also means that the prompts are well initialized for few-shot tuning.

We tested two strategies for constructing the self-supervised data. Each strategy was designed to suit a particular type of writing bias in the document. These are “lead” and “gap sentences generation”.

Lead

Lead bias is common in news articles, which usually follow an inverted pyramid structure where the first few sentences contain the most salient information (See et al., 2017; Yang et al., 2020). With this type of bias, we initially select the first three sentences as our target summary, and treated the rest of the document as the source text. With this type of prompt pre-training process, the model was able to infer the salient information based on the remaining text.

GSG

Gap sentences generation applies to all documents that do not follow the lead bias structure (e.g., XSum (Narayan et al., 2018)). The strategy used here follows Zhang et al. (2020) , where we used ROUGE1-F1 (Lin, 2004) between each sentence $x_{i}$ and the rest of the document as a proxy for the principal score, $s_{i}=rouge(x_{i},D\setminus\{x_{i}\}),\forall{i}$ . The top- $m$ most important sentences were selected according to $s_{i}$ , and removed from the document. Then these $m$ sentences are concatenated in the same order as the original text in the form of a pseudo summary. The remainder of the text is treated as a pseudo document.

With the constructed data, our designed prompts can be pre-trained and further tuned with few-shot examples.

3.4 Training Objective

The model is trained with maximum likelihood estimation (MLE). Given a ground-truth summary $Y=[y_{1},y_{2},...,y_{|Y|}]$ for an input passage $X$ , the objective is to minimize the negative log-likelihood of the target word sequence:

	$\displaystyle\mathcal{L}=-\sum_{t=1}^{\|Y\|}\log p_{\theta^{*}}(y_{t}\|[P_{en};X^{\prime}_{in}],[P_{de};y_{1},...y_{t-1}])$		(6)
	$\displaystyle\theta^{*}=\{\theta;\theta_{p_{en}};\theta_{p_{de}};\theta_{p_{in}}\}$		(6)

Note that only these prepended-prompts parameters ( $\theta_{p_{en}}$ , $\theta_{p_{de}}$ ) and the inner-prompts parameters ( $\theta_{p_{in}}$ ) are optimized, the language model parameters ( $\theta$ ) are all frozen.

4 Experiments

Datasets

We experimented with the CNN/DailyMail (CNNDM) dataset (Hermann et al., 2015) and the XSum dataset (Narayan et al., 2018). We chose these datasets because they differ in abstraction level and text length, which helps to show the generalization ability of our results.

We constructed the self-supervised pre-training data for CNNDM with Lead, and for XSum with GSG. We show details in Section A.1 in the appendix. Given that the lead bias structure exists only in some domain-specific datasets, we also conducted experiments to demonstrate the universality of the GSG to construct pseudo-data. The results are shown in Section A.3 in the appendix. Our few-shot training set $D_{train}$ contained 300 document-summary pairs randomly sampled from the original training data. To tune the hyper-parameters and select the best checkpoint, we composed a validation set $D_{dev}$ from the original validation data. Here, we were careful to ensure that $\lvert D_{train}\rvert=\lvert D_{dev}\rvert$ so that it fit into a true few-shot learning setting, following Perez et al. (2021). Since few-shot learning may have high variance, we sampled the examples with 5 different random seeds. We used the original test set to report our results, including the mean value and the standard deviation. Table 2 shows the statistics of the pre-processed corpus.

Datasets	CNNDM			XSum
Datasets	train	dev	test	train	dev	test
Avg.Passage	697.45	676.64	717.92	396.53	387.62	380.55
Avg.Sum	55.91	51.97	58.62	22.90	23.29	22.11
Labled data	300	300	11,490	300	300	11,333

Table 2: Datasets statistics. “Avg.Passage” means the average length of passages and “Avg.Sum” means the average length of summaries.

Setup

The base version of BART was used in our work. Following Lester et al. (2021), we used 100 prompt tokens for both the encoder inputs and the decoder inputs. These prompts were randomly initialized from the set of vocabularies. The sequential and fixed-length inner-prompts require a maximum number. Hence, we counted the number of sentences in each document and divided the results into two groups – the 85% with the least sentences (Group A) and the 15% with the most sentences (Group B)³³3We made our division at 85% to ensure all embeddings of inner-prompt tokens could be fully trained, because sentences after the $n$ -th only exist in 15% of the data.. We then set the number of prompts to the most number of sentences in Group A plus one, i.e., $n+1$ . For CNNDM, that number was 61 and, for XSum, it was 33. In this way, one inner-prompt token was assigned to each sentence up to $n$ . For the excessively long documents in Group B, the text after $n$ sentences was assigned an $n+1$ -th token. Further, we drew from a normal distribution $\mathcal{N}(0,0.05)$ to initialize the inner-prompt embeddings⁴⁴4More information about implementation details are shown in Section A.2 in the appendix.. Taking CNNDM as an example, all the tunable parameters that need to be stored amount to only $2\times 10^{5}$ . This is compared to the ( $1.4\times 10^{8}$ ) parameters of full-model tuning. That equates to around 0.1% of the parameters for each dataset that need to be tuned and stored.

Evaluation Metrics

We adopted ROUGE Lin (2004) to measure the quality of the summaries produced in our experiments. The F1 scores for ROUGE-1, ROUGE-2, and ROUGE-L between the ground-truth and the generated summaries are each reported.

Baseline Models

We compared PSP to: Prompt Tuning (Lester et al., 2021), which only concatenates soft prompts into the encoder input; Prefix Tuning (Li and Liang, 2021), which adds a prefix to all the encoder layers, cross-attention layers, and the decoder layers; and Full-Model Tuning, which does not have any prompts and fine-tunes all the parameters of the pre-trained language model.

	CNNDM				XSum
Model	ROUGE-1	ROUGE-2	ROUGE-L	PPL	ROUGE-1	ROUGE-2	ROUGE-L	PPL
Prompt Tuning	$30.58_{2.07}$	$11.93_{0.46}$	$21.73_{1.86}$	$141.56$	$29.63_{1.21}$	$8.84_{0.55}$	$22.00_{1.23}$	$101.96$
Prefix-Tuning	$37.12_{0.15}$	${\bf 16.59_{0.09}}$	${\bf 26.28_{0.06}}$	$52.59$	$32.18_{0.16}$	$11.13_{0.08}$	$25.50_{0.14}$	$39.58$
Full-Model Tuning	$38.03_{0.56}$	$16.01_{0.79}$	$25.21_{0.70}$	$65.73$	$32.85_{0.25}$	$10.52_{0.24}$	$25.15_{0.29}$	$51.63$
PSP_Interval	$37.82_{0.29}$	$15.40_{0.31}$	$25.10_{0.36}$	${\bf 45.54}$	$\underline{\bf 32.86_{0.21}}$	$\underline{\bf 11.27_{0.08}}$	$\underline{\bf 25.64_{0.11}}$	$44.25$
PSP_Sequential	$37.82_{0.39}$	$15.58_{0.32}$	$25.16_{0.32}$	$48.10$	$32.57_{0.11}$	$\underline{10.97_{0.07}}$	$\underline{25.39_{0.05}}$	${\bf 35.70}$
PSP_Fixed-k	$\underline{\bf 38.31_{0.15}}$	$15.94_{0.21}$	$\underline{25.41_{0.25}}$	$58.50$	$32.81_{0.10}$	$\underline{11.15_{0.10}}$	$\underline{25.48_{0.13}}$	$52.10$

Table 3: Results on CNNDM and XSum Datasets. The experiments are conducted with 300 training samples and 300 validation samples on each dataset. We report the mean value and the standard deviation over 5 sampled datasets.

k

= 10 is chosen for PSP_Fixed-k. “PPL” represents the perplexity of generated summaries. A low perplexity indicates the summaries are fluent. Best results are bold and underline means our models outperform Full-model tuning.

4.1 Experimental Results of Our Method

Table 3 presents the results of all PSP variants and baselines across CNNDM and XSum datasets. With the exception of the ROUGE-2 and ROUGE-L scores for the Prefix-Tuning on the CNNDM dataset, our proposed PSP, outperforms the others. However, PSP delivered a competitive result with only 3% of the parameters, which is an acceptable place to start. To our surprise, we observe that 50% of PSP’s results surpass the full-model tuning, especially on XSum, as underlined in the table. Besides, results on the PPL metric show that PSP can generate more fluent summaries than other models. These results indicate that fine-tuning large language models is not necessarily a good or efficient idea with few-shot generation. It also shows that soft prompts with frozen language models are effective for few-shot abstractive summarization. Moreover, it statistically verifies that PSP with its three inner-prompt strategies is effective.

Efficiency v.s. effectiveness.

We gave an overall comparison to baseline models on effectiveness and memory-efficiency, evaluated by ROUGE and the number of parameters, respectively. The results are shown in Table 4. Prompt Tuning has the least number of parameters, while its capacity is limited to this and lacks control over the decoder side, hence it can not perform natural language generation tasks well. We can see that substantial gains are made when going from vanilla Prompt Tuning to PSP. However, even if Prefix-Tuning is nearly thirty times more parameters than ours, there is either a marginal improvement or even performance decrease on some metrics. Besides, Prefix-Tuning relies on reparameterization tricks to stabilize the training, i.e., adds a MLP with large number of parameters to the training stage. Our method provides the best effectiveness-efficiency trade off, and outperforms full-model tuning with only 0.1% parameters, and presents competitive results against Prefix-Tuning with 3% parameters.

Model	# Train	# Store	ROUGE-1
Model	# Train	# Store	CNNDM	XSUM
PSP	$2.0\times 10^{5}$	$2.0\times 10^{5}$	38.32	32.86
Prefix-Tuning	$2.4\times 10^{7}$	$5.5\times 10^{6}$	37.12	32.18
Prompt Tuning	$7.7\times 10^{4}$	$7.7\times 10^{4}$	30.58	29.63
Full-Model Tuning	$1.4\times 10^{8}$	$1.4\times 10^{8}$	38.03	32.85

Table 4: Comparison with baseline models on effectiveness and efficiency. “# Train” means the number of tuned parameters during training. “ # Store” means the number of stored parameters. Best results are bold.

Human Evaluation

We conducted a human evaluation study. To this end, we randomly selected 20 instances from the test set of each dataset. Ten graduate students with high levels of fluency in English were asked to assess the generated summaries and golden summaries from independent perspectives Wang et al. (2021): Informativeness (how much useful information does the summary provide?), Relevance (how well does the summary reflect the input document?), and Fluency (how grammatically correct are the summary sentences and how easy are they to read?). Scoring followed the Best-Worst Scaling method (Kiritchenko and Mohammad, 2017). Participants were asked to select the best and worst summaries from each perspective. The scores were computed as the percentage of times a summary was chosen as the best minus the times it was selected as the worst. The scores ranged from -1 (worst) to 1 (best). Results are shown in Table 5. Qualitatively, we show several examples generated by different models and the reference in Table 14 and Table 15 in the appendix. Compared with all baselines, the summaries generated by PSP are always more fluent and relevant to the source document, consistent with the results of human evaluation. Further more, we found summaries generated by PSP and Prefix-Tuning are always similar in sentence patterns and expressions. However, Prefix-Tuning tends to generate texts shorter than PSP, which often leads to lack of information.

Methods	CNNDM			XSum
Methods	IF	RL	FL	IF	RL	FL
PSP	0.500	0.708	0.667	0.217	0.275	0.492
Prompt Tuning	-0.317	-0.758	-0.975	-0.336	-0.400	-0.867
Prefix-Tuning	-0.233	0.067	0.158	0.017	-0.008	0.292
Full-Model Tuning	0.067	-0.025	0.075	0.117	0.092	0.075

Table 5: Human evaluation results. Best results are bold.

Selection of fixed length $k$ .

As shown in Table 3, PSP_Fixed-k performs consistently well on both datasets. So we further explored the influence of different length $k$ , i.e., $k=5,10,15,30$ , for inner-prompt tokens of the PSP_Fixed-k⁵⁵5The average number of tokens per sentence in both datasets was about 18, so we did not consider fixed lengths of 20, for its similarity to the PSP_Sequential.. Table 6 presents the results of the variants on XSum. We observe the segmented spans with 10 tokens achieve the best performance. Interestingly, it can be induced that, to understand a document, it is possible to reorganize the sentence into several semantic units, where the number of the tokens is 10 on average. We also report results of different $k$ on our validation set in Table 6. The ranking is consistent with the test set. From a practical perspective, when applying PSP to a new dataset, we can choose the best $k$ based on the validation set.

$k$	$D_{dev}$			$D_{test}$
$k$	R-1	R-2	R-L	R-1	R-2	R-L
5	34.27	11.90	26.41	31.90	10.28	24.20
10	35.31	12.88	26.85	32.89	11.13	25.51
15	34.98	11.68	26.45	32.11	10.46	24.72
30	34.48	12.57	26.55	32.20	11.03	25.30

Table 6: Results of different fixed length

k

on validation set

D_{dev}

and test set

D_{test}

of XSum. “R-1” is short for “ROUGE-1”, the same for “R-2” and “R-L”.

4.2 Analyses on Soft Prompts

Whether our model attends to understand documents?

According to Figure 2, we further present the encoder-decoder attention distribution of the PSP. The comparison visualization is shown in Figure 5. We find the following enhancement of our model by introducing the inner prompts. First, the PSP model strengthens the associations between the encoder prompts and the decoder prompts compared to the original model. Second, the soft prompt $P_{en}$ has more opportunities to be related to the output $Y$ , indicating the semantic relations between them. Third, the output $Y$ assigns more attention to the source document $X$ . This suggests that the hidden structure of the document is emphasized, increasing the capability of understanding its semantics. As such, these prompts can properly elect salient information from the document and prompt the model to generate the output.

Do inner prompts assist the model to understand the content of documents or simply increase the model’s capacity?

Instead of using inner-prompts, we prepended additional tunable tokens (i.e. 150 tokens) in front of the encoder and the decoder inputs. Comparison results are shown in Table 7. Despite the larger capacity, soft prompts with 150 tunable tokens before the input performed the worst, denoted as soft prompts (en.&de., 150). This suggests the inner-prompts with a few parameters do help to understand the document by prompting the structures, rather than simply add more trainable parameters to increase the model’s capacity.

Model	CNNDM			XSum
Model	R-1	R-2	R-L	R-1	R-2	R-L
Soft prompts (en.&de., 100)	36.89	14.96	24.63	29.36	9.90	22.92
Soft prompts (en.&de., 150)	35.71	14.86	23.97	28.94	9.52	22.24
Soft prompts (en.&de.&ip., 100)	37.87	15.83	25.37	31.95	10.52	24.80

Table 7: Results of different architectures of soft prompts on CNNDM and XSum, where “en.” “de.” “ip.” are short for encoder, decoder and inner prompts, respectively. Numbers in parentheses represent the number of prompt tokens we prepended before the encoder and decoder input.

Model	ROUGE-1	ROUGE-2	ROUGE-L
Soft prompts (en.&de., shared)	36.06	14.30	24.24
Soft prompts (en.&de., separate)	36.37	14.41	24.46

Table 8: Results of basic soft prompts on the CNNDM.

Further insight on soft prompts across the encoder and the decoder.

To verify our hypothesis that the decoder prompts largely copy the behaviour of the encoder prompts, we shared similar embeddings of the soft prompts before the encoder and the decoder. In Table 8, we observe the Soft prompts (en.&de., shared) and (en.&de., separate) almost perform identical results. Although the parameters are only half of the original model, the performance consistently remains competitive. This shows that the shared prompts can extract important information from the document and further guide the language model to generate consistently good summaries more efficiently.

Model	ROUGE-1	ROUGE-2	ROUGE-L
Full-Model Tuning	11.69	2.67	7.74
Prefix-Tuning	11.76	2.63	7.93
Prompt Tuning	9.40	1.86	6.19
PSP_Interval	17.16	3.36	12.65

Table 9: Zero-shot results on XSum.

4.3 Analysis on Few-shot and Zero-shot Summarization

To examine the performance of different methods under few-shots, we further randomly sampled number of {50, 100, 200} as the settings. Figure 6 reports a more detailed overview of all models’ performance across a range of different few-shots. The ROUGE scores of our model generally outperform other baselines and remain steady across different scenarios. Especially, the PSP with only 50 examples receives the most significant improvements, while the Prefix-Tuning doesn’t even work (tuning based on BART_base) possibly due to its instability of the model. Moreover, we report the results of zero-shot on XSum in Table 9. Benefiting from the knowledge gained in the pre-training phase, our model shows a significant advantage of zero-shot adaptation in generating quality summaries.

4.4 The Performance of Pre-training on Prefix-Tuning

A crucial strategy for PSP is the pre-training of soft prompts. To give a fairly comparison, we performed prefix pre-training for Prefix-Tuning in the same way with the PSP. The results are shown in Table 10. We can find that the Prefix model obtains improvements on the XSum dataset after adopting the pre-training strategy, but underperforms the original one on the CNNDM dataset. It indicates that Prefix-Tuning shows limited potential compared to our model. We induce that the pre-training for Prefix-Tuning raises over-fitting risk due to its sensitivity to different data or parameter settings.

Method	CNNDM			XSum
Method	ROUGE-1	ROUGE-2	ROUGE-L	ROUGE-1	ROUGE-2	ROUGE-L
Prefix-Tuning	$37.12_{0.15}$	${16.59_{0.09}}$	${26.28_{0.06}}$	$32.18_{0.16}$	$11.13_{0.08}$	$25.50_{0.14}$
Prefix-Tuning w/ Pre.	$37.35_{0.58}$	$16.08_{0.37}$	$25.95_{0.50}$	$33.39_{0.10}$	$11.61_{0.06}$	$26.07_{0.09}$

Table 10: Test set results of Prefix-Tuning. “w/ Pre.” means that we pre-trained the prefix with pseudo data.

4.5 Ablation Study

We conducted experiments to examine the effectiveness of the major components of our model, and Table 11 shows the ablation results across the two datasets. We observed both the prompt pre-training operation and the inner-prompts component contribute to the main model. Notably, with the removal of each component, the model becomes considerably unstable, indicated by the variance shown in the ablation results. Comparably, prompt pre-training in our model accounts for more importance on the XSum dataset whose summaries have a higher abstract level (we assume it’s more “difficult”) than the CNNDM. In sum, these two components support the performance and stability of our model in terms of summarization adaption (by prompt pre-training) and structural documents understanding (by inner-prompts).

Method	CNNDM			XSum
Method	ROUGE-1	ROUGE-2	ROUGE-L	ROUGE-1	ROUGE-2	ROUGE-L
PSP_Fixed-k	$38.31_{0.15}$	$15.94_{0.21}$	$25.41_{0.25}$	$32.81_{0.10}$	$11.15_{0.10}$	$25.48_{0.13}$
w/o PP	$37.30_{0.56}$	$15.45_{0.39}$	$24.93_{0.38}$	$32.17_{0.16}$	$10.69_{0.13}$	$25.02_{0.21}$
w/o IP	$37.76_{0.28}$	$15.22_{0.31}$	$24.80_{0.40}$	$32.59_{0.17}$	$11.14_{0.17}$	$25.46_{0.24}$
w/o PP & IP	$36.88_{0.42}$	$14.96_{0.45}$	$24.63_{0.40}$	$29.35_{1.5}$	$9.87_{0.43}$	$22.89_{1.19}$

Table 11: Ablation study of PSP on two datasets. “w/o” means without. “PP” and “IP” are short for Prompt Pre-training and Inner-Prompts, respectively. The variance of each result is provided.

5 Related Work

Few-Shot Abstractive Summarization

In practical application scenarios, the lack of manual constructed document-summary pairs or labeled data makes data-driven neural models performs badly Hu et al. (2021, 2020). Fabbri et al. (2020) condense characteristics of the target dataset into Wikipedia data to construct pseudo-summaries. Bražinskas et al. (2020) introduce plug-in networks to reproduce characteristics of the target dataset with only a small set of labeled examples. Bai et al. (2021) conduct cross-lingual summarization in a low-resource setting. Yu et al. (2021) design the second phase of pre-training on large-scale generative models before fine-tuning. In this paper, we construct pseudo-summary corpus with heuristic rules, providing a better parameter initialization for soft prompts under few-shot settings. More importantly, we design summarization-oriented soft prompts to help the model produce few-shot summaries.

Prompt Learning

The emergence of GPT-3 (Brown et al., 2020) introduces the concept of “prompting”. One only needs to assemble a task description and few examples into a prompt, and then prepend it to the task input. With the large-scale frozen parameters, a pre-trained model can generate the output without any task-specific tuning. However, task description is error-prone while there is no unified, explicit, and effective way to build these hard prompts manually (Logan IV et al., 2021). Hence, several works (Gao et al., 2020; Jiang et al., 2020; Shin et al., 2020) are proposed to generate prompts automatically, but they all restrict prompts to discrete spaces. These discrete prompts are less expressive and sub-optimal. To overcome the shortcomings of hard prompts, Li and Liang (2021) propose “Prefix-Tuning”. This method only tunes prefix activation prepended to all transformer layers, and keeps the LM parameters frozen. To further simplify, Prompt Tuning (Lester et al., 2021) only prepends tunable tokens to the encoder input, and keeps all other parameters frozen. Logan IV et al. (2021) and Gu et al. (2021) propose to use pre-training to boost the low performance of Prompt Tuning for few-shot learning. In this work, we fit the structure of Prompt Tuning to text generation models, proposing encoder prompts, decoder prompts, and inner prompts. We successfully apply prompt tuning methods to few-shot abstractive summarization task.

6 Conclusion

In this paper, we present a novel pre-trained soft prompts architecture (PSP) specifically designed for few-shot abstractive summarization. We design continuous input embeddings across an encoder and a decoder alongside several kinds of inner-prompts placed in the text, assisting the model better to understand documents and guide accurate generation. Empirical results find the necessity of using prompt pre-training for few-shot/zero-shot abstractive summarization. Extensive experiments and analyses show that the proposed PSP provides the best effectiveness-efficiency trade off among all the baseline methods.

7 Acknowledgments

The research presented in this publication was sponsored by CCF Fund For Young Scholars, and Joint Funds of the National Natural Science Foundation of China (Grant No. U21B2009).

References

Bai et al. (2021) Yu Bai, Yang Gao, and Heyan Huang. 2021. Cross-lingual abstractive summarization with limited parallel resources. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6910–6924, Online. Association for Computational Linguistics.
Bražinskas et al. (2020) Arthur Bražinskas, Mirella Lapata, and Ivan Titov. 2020. Few-shot learning for opinion summarization. arXiv preprint arXiv:2004.14884.
Brown et al. (2020) Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
Fabbri et al. (2020) Alexander R Fabbri, Simeng Han, Haoyuan Li, Haoran Li, Marjan Ghazvininejad, Shafiq Joty, Dragomir Radev, and Yashar Mehdad. 2020. Improving zero and few-shot abstractive summarization with intermediate fine-tuning and data augmentation. arXiv preprint arXiv:2010.12836.
Gao et al. (2020) Tianyu Gao, Adam Fisch, and Danqi Chen. 2020. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723.
Gu et al. (2021) Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. 2021. Ppt: Pre-trained prompt tuning for few-shot learning. arXiv preprint arXiv:2109.04332.
Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. Advances in neural information processing systems, 28:1693–1701.
Hu et al. (2020) Xuming Hu, Lijie Wen, Yusong Xu, Chenwei Zhang, and S Yu Philip. 2020. Selfore: Self-supervised relational feature learning for open relation extraction. In Proc. of EMNLP, pages 3673–3682.
Hu et al. (2021) Xuming Hu, Chenwei Zhang, Yawen Yang, Xiaohe Li, Li Lin, Lijie Wen, and S Yu Philip. 2021. Gradient imitation reinforcement learning for low resource relation extraction. In Proc. of EMNLP, pages 2737–2746.
Jiang et al. (2020) Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. 2020. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438.
Jin et al. (2021) Woojeong Jin, Yu Cheng, Yelong Shen, Weizhu Chen, and Xiang Ren. 2021. A good prompt is worth millions of parameters? low-resource prompt-based learning for vision-language models. arXiv preprint arXiv:2110.08484.
Kiritchenko and Mohammad (2017) Svetlana Kiritchenko and Saif M Mohammad. 2017. Best-worst scaling more reliable than rating scales: A case study on sentiment intensity annotation. arXiv preprint arXiv:1712.01765.
Lester et al. (2021) Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880.
Li and Liang (2021) Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online. Association for Computational Linguistics.
Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
Liu et al. (2021) Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. 2021. Gpt understands, too. arXiv preprint arXiv:2103.10385.
Liu and Lapata (2019) Yang Liu and Mirella Lapata. 2019. Text summarization with pretrained encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3730–3740.
Liu et al. (2020) Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
Logan IV et al. (2021) Robert L Logan IV, Ivana Balažević, Eric Wallace, Fabio Petroni, Sameer Singh, and Sebastian Riedel. 2021. Cutting down on prompts and parameters: Simple few-shot learning with language models. arXiv preprint arXiv:2106.13353.
Manning et al. (2014) Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pages 55–60.
Narayan et al. (2018) Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium.
Perez et al. (2021) Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. True few-shot learning with language models. arXiv preprint arXiv:2105.11447.
Schick and Schütze (2020) Timo Schick and Hinrich Schütze. 2020. Few-shot text generation with pattern-exploiting training. arXiv preprint arXiv:2012.11926.
See et al. (2017) Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083.
Shin et al. (2020) Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. 2020. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980.
Vaswani et al. (2017) Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. ArXiv, abs/1706.03762.
Wang et al. (2021) Haonan Wang, Yang Gao, Yu Bai, Mirella Lapata, and Heyan Huang. 2021. Exploring explainable selection to control abstractive summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13933–13941.
Wang et al. (2019) Wenbo Wang, Yang Gao, He-Yan Huang, and Yuxiang Zhou. 2019. Concept pointer network for abstractive summarization. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3076–3085.
Wolf et al. (2020) Thomas Wolf, Julien Chaumond, Lysandre Debut, Victor Sanh, Clement Delangue, Anthony Moi, Pierric Cistac, Morgan Funtowicz, Joe Davison, Sam Shleifer, et al. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45.
Yang et al. (2020) Ziyi Yang, Chenguang Zhu, Robert Gmyr, Michael Zeng, Xuedong Huang, and Eric Darve. 2020. Ted: A pretrained unsupervised summarization model with theme modeling and denoising. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 1865–1874.
Yu et al. (2021) Tiezheng Yu, Zihan Liu, and Pascale Fung. 2021. Adaptsum: Towards low-resource domain adaptation for abstractive summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5892–5904.
Zhang et al. (2020) Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pages 11328–11339. PMLR.

Appendix A Appendix

A.1 Constructing Pesudo Data for Pre-training

We constructed the pseudo data for CNNDM with Lead. We also conducted a simple data cleaning procedure to the self-supervised pre-train corpus. First, we cleaned away irrelevant information, such as media names, reporter names or dates from the summaries. Second, for those summaries with less than 50 tokens, we iteratively collected the first sentence of the remaining text to the pseudo summary, until the length of summary reaches 70. This procedure was set up to prevent the target text from being too short to form a meaningful summary. Third, for those samples in which the source document is shorter than its summary, we filtered them out.

For XSum, we constructed the pseudo data for pre-training following GSG. The top-1 most important sentence was selected as the pseudo summary. Then we filtered out those pseudo summaries that are not relevant enough to the pseudo passages. In particular, we leveraged hand-written summaries in the few-shot dataset to determine the filtering threshold of pseudo data. We calculated the ROUGE-1 F1 between each ground-truth summary and its corresponding passage, represented as ${Ri}$ . Then we calculated the mean and variance of ${Ri}$ : $\epsilon=\frac{1}{n}\sum_{i=1}^{n}Ri$ , $\sigma^{2}=\frac{1}{n}\sum_{i=1}^{n}(Ri-\epsilon)^{2}$ , and $\epsilon-\sigma^{2}$ was used as a lower-bound threshold to filter out low quality pseudo data. For those pseudo samples where ROUGE1-F1 between the pseudo summary and the pseudo passage is lower than the threshold $\epsilon-\sigma^{2}$ , we filtered them out. Finally, we conducted pre-training on our soft prompts with these filtered pseudo-data. Table 12 shows the statistics for the pre-training data corpus.

	Pseudo Corpus	Pseudo Corpus
	CNNDM	XSum
# of Original Passages	287,113	204,017
# of Pre-training Data	284,177	158,499

Table 12: Pseudo-summarization corpus statistics. “# of Original Passages” means the number of original passages in the training set, “# of Pre-training data” means the number of pseudo data after data cleaning.

A.2 Implementation Details

We first split sentences with the Stanford CoreNLP toolkit (Manning et al., 2014), and the input documents were truncated to 1024 BPE tokens. We adopted BART-base for all the experiments. Our implementation was based on the Hugging Face Transformer models (Wolf et al., 2020). We used a mini-batch size of 8 with a gradient accumulation for 10 iterations. We used Adam optimizer with momentum $\beta_{1}$ = 0.9, $\beta_{2}$ = 0.998 and noam decay. In the stage of pre-training, the peak value of learning rate was 1e-3, and we set the warm up ratio to 10%. During fine-tuning, the peak value of learning rate was 3e-4, and we set the warm up steps to 100 with 400 epochs. In the decoding stage, we used beam search with a beam size of 4. The decoding process will not stop until an end-of sequence (EOS) token was emitted or the length of the generated summary reached to 256 tokens. All models were trained on 4 TITAN RTX GPUs.

A.3 The Universality of GSG to Construct Pseudo-data

To demonstrate the universality of using the GSG method to construct pseudo-data for prompt pre-training, we conducted a complimentary experiment to testify its effect on the CNNDM⁶⁶6We do not conduct ablation experiments on XSum, as there is no “ lead bias” in this dataset. So it is inappropriate to take the first sentences of the passage as the pseudo summary.. Specifically, we selected $m=3$ important sentences. Results in Table 13 indicate that the PSP model pre-trained by GSG is equally effective with the original PSP_Lead, showing that the GSG can be universally employed to pre-train soft prompts for abstractive summarization.

	ROUGE-1	ROUGE-2	ROUGE-L
PSP_Lead (w/o inner-prompts)	37.66	15.07	24.52
PSP_GSG (w/o inner-prompts)	37.04	15.04	25.20

Table 13: Results on CNNDM by using the Lead and the GSG to construct pseudo-data for prompt pre-training.

Source	Manchester City are monitoring Juventus striker Alvaro Morata. The Spain international was a target for Arsenal a year ago when he left Real Madrid but opted for a move to Italy instead. Real have a buy-back option set at around £15million but are unlikely to take that up. Manchester City are keeping tabs on Juventus striker Alvaro Morata ahead of possible summer bid . Morata closes down Lazio goalkeeper Federico Marchetti during Juve’s Serie A win on Saturday . City also remain keen on Juventus midfielder Paul Pogba but face competition from Paris Saint Germain . Morata has been a success at Juventus but the Italians are always susceptible to offers at the right price and are making plans to bring in Edinson Cavani from Paris Saint Germain despite the French club’s resistance. PSG also are one of the frontrunners to sign Juve midfielder Paul Pogba. City remain one of the clubs pushing to sign the France international too and Juventus have City’s Edin Dzeko, Aleksandar Kolarov and Stevan Jovetic on their list of potential targets for the summer.
PSP	Manchester City are keeping tabs on Juventus striker Alvaro Morata . The Spain international was a target for Arsenal a year ago when he left Real Madrid but opted for a move to Italy instead . Juventus have a buy-back option set at around £15million but are unlikely to take that up .
Prefix-Tuning	Manchester City are keeping tabs on Juventus striker Alvaro Morata ahead of possible summer bid. Morata closes down Lazio goalkeeper Federico Marchetti during Juve’s Serie A win on Saturday.
Prompt Tuning	LMorata is a target for Real Madrid, who have a buy - back option set at around £ 15million . He has been a success at Real Madrid but opted for a move to Italy instead . The Italian club are unlikely to take that up . . .. . . ." . .." . ." ." ." . ."." . .". . ."". . ." . . " . . … . . [and] . . (He] had a great season at Real but who more had been about” . . , but who else had been close enough to sign him? . . He had her all after." . .) . . She didn’t year when when when two over people A lot more time than two theret. . ." —LastJustInOther
Full-Model Tuning	NEW City are monitoring Juventus striker Alvaro Morata . The Spain international was a target for Arsenal a year ago when he left Real Madrid . Morata has a buy-back option set at around £15million but is unlikely to be offered by the French side .
Reference	Alvaro Morata had attracted interest from Arsenal before joining Juventus . Spain international made move to Italy in £15million deal from Real Madrid . Manchester City are monitoring the 22-year-old after impressive season .
Source	Nairobi, Kenya (CNN)University of Nairobi students were terrified Sunday morning when they heard explosions – caused by a faulty electrical cable – and believed it was a terror attack, the school said. Students on the Kikuyu campus stampeded down the halls of the Kimberly dormitory, and some jumped from its fifth floor, the university said. Hundreds were injured and were taken to hospitals. One person died, according to the school. The confusion and panic came less than two weeks after Al-Shabaab slaughtered 147 people at a college in Garissa, Kenya. Kenyan teachers and students have said they fear being targeted by the Somalia-based terrorists. On Sunday, as many as 108 students from the University of Nairobi were admitted to Kenyatta National Hospital. Among them, at least 63 students have been discharged, and at least four are slated for surgery, the school said. Almost all of the 54 students being treated at PCEA Kikuyu Hospital have been released, the university said. Kenya Power authorities and its CEO are at the school and looking into the electrical issue. Normal power supply will resume after repairs, the university said. "As we mourn the unfortunate loss of the departed student, we are also praying for the quick recovery of those who were injured," said Vice Chancellor Peter M.F. Mbithi in a statement. He called on the students, staff and public to remain calm. CNN’s Lillian Leposo reported from Nairobi and Ashley Fantz wrote this story in Atlanta.
PSP	University of Nairobi students were terrified Sunday morning when they heard explosions – caused by a faulty electrical cable – and believed it was a terror attack . Hundreds were injured and were taken to hospitals . One person died, according to the school . The confusion and panic came less than two weeks after Al- Shabaab slaughtered 147 people at a college in Garissa .
Prefix-Tuning	University of Nairobi students were terrified Sunday morning when they heard explosions – caused by a faulty electrical cable – and believed it was a terror attack.
Prompt Tuning	L Hundreds were injured and taken to hospitals . Hundreds of students were evacuated from the dormitory . The school said they feared being targeted by al-Shabaab . Hundreds were evacuated and treated in hospitals . The university said it is investigating the cause of the explosion . . . The explosion was caused by a faulty electrical cable. . .. . ." . . ." ." ." . ."." . .." . .""People were terrified," said the school’s vice chancellor . "People were screaming, but who more had been were about” . "We had no idea what was going on but who else had been about to blow her all after." … .. ." .."." ..""They were terrified at the time than two overtakes" —LastJustIn3
Full-Model Tuning	NEW students panicked when they heard explosions – caused by a faulty electrical cable – and believed it was a terror attack, university says . As many as 108 students from University of Nairobi were admitted to Kenyatta National Hospital . One person died, according to the school .
Reference	Students stampeded; some jumped from a fifth story at a dorm; one student died, school officials say . The blasts were caused by faulty electrical cable, and Kenya Power is at the school . The panic came less than two weeks after terrorists attacked Kenya’s Garissa University .

Table 14: Qualitative examples of CNNDM.

Source	Brunon Kwiecien, 48, was convicted of planning a terrorist attack, illegal weapons possession and inciting two students to carry out an attack.He suggested he had been manipulated by Polish intelligence agents.Kwiecien was said to be fascinated with Norwegian mass killer Anders Behring Breivik.Right-wing extremist Breivik killed 77 people in a bombing and shooting rampage in Norway in July 2011.Kwiecien, a former professor at Krakow’s University of Agriculture, was arrested in 2012.Investigators believe he wanted to target parliament with four tonnes of explosives while then-President Bronislaw Komorowski and former Prime Minister Donald Tusk were due to attend, the trial heard."If Brunon Kwiecien hadn’t been stopped, we would be talking amid the ruins of the state today," said judge Aleksandra Almert, according to the AFP agency.While admitting he planned to carry out the attack, he also said he was subject to "provocation" by the intelligence services.Kwiecien is the first Pole to be accused of terrorism, Polish media reported. He has no known links to established extremist groups.
PSP	A Pole has been convicted of planning a terrorist attack in Poland, a court heard.
Prefix-Tuning	A Pole has been convicted of planning to carry out a terrorist attack in Poland.
Prompt Tuning	AA Polish man has been convicted of planning a terrorist attack in the Polish capital, Warsaw, on Thursday.
Full-Model Tuning	A Pole has been found guilty of planning a terrorist attack in the Polish state of Krakow.
Reference	A Polish university lecturer has been sentenced to 13 years in jail for plotting to ram a car packed with explosives into parliament.
Source	Schmidt was sent off by the referee for insulting Hoffenheim’s Julian Nagelsmann in Saturday’s 3-0 home loss."That was nothing, what sort of a nutcase are you? Just shut your mouth," Schmidt shouted after going 2-0 down.The 49-year-old has been banned for two games and handed a 15,000 euros (£13,373) fine.The German was sanctioned after triggering a suspended sentence from February this year.He had been banned for three games, with a further two in the event of a repeat offence before June 2017, for refusing a referee’s order to leave the sidelines during a 1-0 defeat to Borussia Dortmund.Schmidt will be unable to have any contact with the team for half an hour before, during and after Tuesday’s German Cup second-round match against Lotte and Saturday’s league match against Wolfsburg.Leverkusen’s director of sport Rudi Voller has sought a meeting with the head of the disciplinary committee.
PSP	Leverkusen defender Christian Schmidt has been banned for two games for insulting the referee.
Prefix-Tuning	Leverkusen midfielder Matthias Schmidt has been banned for two games after refusing to leave the sidelines during a match against Wolfsburg.
Prompt Tuning	ALeverkusen midfielder Christian Schmidt has been banned for two games for insulting the referee in a game against Hoffenheim on Saturday..’
Full-Model Tuning	Aeverkusen manager Gerhard Schmidt has been banned for two games for insulting the head of the German national team.
Reference	Bayer Leverkusen head coach Roger Schmidt has been banned and fined for calling an opposing manager "a nutcase" during a Bundesliga game.

Table 15: Qualitative examples of XSum.