Pre-training for Abstractive Document Summarization by
Reinstating Source Text

Yanyan Zou^1,3 , Xingxing Zhang², Wei Lu³, Furu Wei² Contribution during internship at Microsoft Research. The first author now works in JD.com. Ming Zhou²
¹ JD.com
² Microsoft Research Asia, Beijing, China
³ StatNLP Research Group, Singapore University of Technology and Design
[email protected]
{xizhang,fuwei,mingzhou}@microsoft.com
[email protected]

Abstract

Abstractive document summarization is usually modeled as a sequence-to-sequence (seq2seq) learning problem. Unfortunately, training large seq2seq based summarization models on limited supervised summarization data is challenging. This paper presents three sequence-to-sequence pre-training (in shorthand, STEP) objectives which allow us to pre-train a seq2seq based abstractive summarization model on unlabeled text. The main idea is that, given an input text artificially constructed from a document, a model is pre-trained to reinstate the original document. These objectives include sentence reordering, next sentence generation and masked document generation, which have close relations with the abstractive document summarization task. Experiments on two benchmark summarization datasets (i.e., CNN/DailyMail and New York Times) show that all three objectives can improve performance upon baselines. Compared to models pre-trained on large-scale data ( $\geq$ 160GB), our method, with only 19GB text for pre-training, achieves comparable results, which demonstrates its effectiveness. Code and models are public available at https://github.com/zoezou2015/abs_pretraining.

1 Introduction

Automatic document summarization is the task of condensing a document into its shorter form with important content preserved, which requires wide-coverage understandings of the document, rather than specific words or phrases. This task can be typically classified into two categories: extractive and abstractive document summarization. Extractive summarization Cheng and Lapata (2016); Nallapati et al. (2017); Narayan et al. (2018a) aims to extract important sentences from the input document and concatenates such extracted sentences as the corresponding output summary. Thus, the relative orders of the selected sentences in the summary is the same as their relative orders in the input document. Differently, abstractive summarization Nallapati et al. (2016); See et al. (2017); Paulus et al. (2018) rewrites the source text and generates the corresponding summary which may contain novel words and phrases not featured in the input. The output summary is closely related to the input document. Also, summary sentences, paraphrased from the input by the abstractive summarizers, might have a different relative order compared to the source text. In other words, contents of the original document may be reordered in its summary. Such a phenomena is defined as content reordering (see Section 3.2 for detailed definition). Statistically, we observed that around 40% instances of the training split of our summarization dataset have this content reordering phenomena. Therefore, it is necessary to design a model that is capable of reordering content. However, as far as we know, relatively rare prior work has studied this for abstractive summarization.

Abstractive summarization is usually framed as a sequence-to-sequence (seq2seq) learning problem Nallapati et al. (2016); See et al. (2017). In this paper, we adopt the seq2seq Transformer Vaswani et al. (2017), which has been demonstrated to be the state-of-the-art for seq2seq modeling Vaswani et al. (2017); Ott et al. (2019). Recent studies Song et al. (2019); Dong et al. (2019); Lewis et al. (2019); Zhang et al. (2019a); Raffel et al. (2019) have proven effectiveness of pre-trained seq2seq Transformer models on the natural language generation tasks, such as abstractive summarization.

Based on the above observations, with regard to abstractive summarization, this work proposes three sequence-to-sequence pre-training (in shorthand, STEP) objectives which can be used to pre-train a seq2seq model on unlabeled text, namely Sentence Reordering (SR), Next Sentence Generation (NSG), and Masked Document Generation (MDG). All three objectives are designed to reinstate the original source text. SR learns to recover a document with randomly shuffled sentences. Given the first segment of a document, NSG generates the next segment of the original document. MDG learns to recover a masked document to its original form.

After pre-training a model with our proposed objective(s) on unlabeled documents, we fine-tune it on supervised summarization datasets (i.e., CNN/DailyMail and New York Times). Experiments show that, even pre-training on documents from the training split of a summarization dataset, our method can improve performance upon a heavily tuned large seq2seq Transformer model which already includes a strong pre-trained encoder by a large margin. By involving more data (19GB) for pre-training, the performance is further improved. Compared to models pre-trained with much more data ( $\geq$ 160GB), we can still achieve comparable or even higher ROUGE scores.

2 Related Work

Extractive Summarization This task aims to find the informative sentences in a document as its summary. This task is usually viewed as a sentence ranking problem Kupiec et al. (1995); Conroy and O’leary (2001) using scores from a binary (sequence) classification model, which predicts whether a sentence is in the summary or not. Extractive neural models Cheng and Lapata (2016); Nallapati et al. (2017); Narayan et al. (2018b); Zhang et al. (2018) employ hierarchical LSTMs/CNNs as the feature learning part of the binary (sequence) classifier, which largely outperform discrete feature based models Radev et al. (2004); Filatova and Hatzivassiloglou (2004); Nenkova et al. (2006). Very recently, the feature learning part was replaced again with pre-trained Transformer encoders Zhang et al. (2019b); Liu and Lapata (2019) that lead to another huge performance gain. However, extractive models have their own limitations. For example, the extracted sentences might be too long and redundant. Besides, manually written summaries in their nature are abstractive. Therefore, we focus on abstractive summarization in this paper.

Abstractive Summarization This task aims to generate a summary by rewriting a document, which is a seq2seq learning problem. seq2seq attentive LSTMs Hochreiter and Schmidhuber (1997); Bahdanau et al. (2015) are employed in Nallapati et al. (2016) that have been extended with copy mechanism Gu et al. (2016), coverage model See et al. (2017) and reinforcement learning Paulus et al. (2018). Liu and Lapata (2019) used a seq2seq Transformer model with only its encoder initialized with a pre-trained Transformer encoder (i.e., BERT; Devlin et al. 2019). This work proposes to pre-train the decoder together with the encoder and then initialize both the encoder and decoder of a summarization model with the pre-trained Transformer model.

There is also a line of work that bridges extractive and abstractive models with attention mechanisms Gehrmann et al. (2018); Hsu et al. (2018) and reinforcement learning Chen and Bansal (2018), while our model is simpler.

Pre-training Pre-training methods draw a lot of attentions recently. Peters et al. (2018) and Radford et al. (2019) pre-trained LSTM and Transformer using language modeling objectives. To leverage the context in both directions, BERT Devlin et al. (2019) is trained with the masked language modeling and next sentence prediction objectives. SpanBERT Joshi et al. (2020) applied only the masked language modeling objective that masks contiguous random spans, rather than random tokens. XLNet Yang et al. (2019) proposed a permutation language modeling objective that removes the independence assumption of masked tokens in BERT. RoBERTa Liu et al. (2019) extends BERT with more training data and better training strategies. The above models focus on pre-training an encoder or a decoder, while we propose methods to pre-train a seq2seq model (i.e., the encoder together with the decoder) for abstractive summarization.

Dong et al. (2019) (UniLM) proposed a unified language model that can be used for both natural language understanding and generation tasks, which is pre-trained using masked, unidirectional and seq2seq language modeling objectives. The encoder and decoder parameters are shared. By contrast, we pre-train a seq2seq Transformer with separate parameters for the encoder and decoder. Song et al. (2019) (MASS) proposed a method to pre-train a seq2seq Transformer by masking a span of text and then predicting the masked tokens. Their pre-training task is similar to our MDG task, but we apply a different masking strategy and predict the original text. Song et al. (2019) tested their model on sentence-level tasks (e.g., machine translation and sentence compression), while we aim to solve document-level tasks (e.g., abstractive document summarization). Lewis et al. (2019) (BART) adopted the combination of text infilling and sentence permutation as a single objective for seq2seq Transformer pre-training. Differently, we propose three objectives and use them individually. Specifically, MDG replaced each selected token with a masked token in the input sequence. Raffel et al. (2019) (T5) studies different pre-training objectives, model architectures, and unlabeled datasets. ProphetNet Yan et al. (2020) predicts the next $n$ tokens simultaneously. Zhang et al. (2019a) (PEGASUS) proposed to remove/mask sentences from an input document and learn to generate such removed/masked sentences for pre-training, while NSG predicts the following sentences of the input sequence and MDG masks randomly selected tokens.

3 Proposed Method

Refer to caption — Figure 1: Assume a document $(x_{1},x_{2},\cdots,x_{8})$ contains three sentences (i.e., SENT. 1, SENT. 2 and SENT. 3). A seq2seq Transformer model can be pre-trained with our proposed objective. It takes the transformed document (i.e., a shuffled document, the first segment of a document, or a masked document) as input and learns to recover the original document (or part of the original document) by generation. SR: Sentence Reordering; NSG: Next Sentence Generation; MDG: Masked Document Generation.

3.1 Sequence-to-Sequence Learning

In this work, the task of abstractive document summarization is modeled as a seq2seq learning problem. We adopt the seq2seq Transformer architecture Vaswani et al. (2017). Given a document $X=(x_{1},x_{2},\dots,x_{|X|})$ paired with its summary $Y=(y_{1},y_{2},\dots,y_{|Y|})$ , we aim to learn the model parameters $\theta$ and estimate the conditional probability:

\vspace{-0mm}P(Y|X;\theta)=\prod_{t=1}^{|Y|}p(y_{t}|y_{<t};X;\theta)

(1)

where $y_{<t}$ stands for all tokens before position $t$ (i.e., $y_{<t}=(y_{1},y_{2},\dots,y_{t-1})$ ). Given the whole training set ( $\mathcal{X},\mathcal{Y}$ ), this model can be trained by maximizing the log-likelihood of the training document-summary pairs:

\vspace{-0mm}\mathcal{L}(\theta;\mathcal{X},\mathcal{Y})=\sum_{(X,Y)\in(\mathcal{X},\mathcal{Y})}\log P(Y|X;\theta)

(2)

We first pre-train the seq2seq Transformer model on the unlabeled text using our proposed pre-training objectives (see Section 3.2) and then fine-tune it on the document-summary dataset.

3.2 Pre-training Objectives

Automatic abstractive summarization requires comprehensive understanding of the input document and rewrites the source text into its shorter form, where the summary is closely related to the input, retaining important contents. Also, rewriting the document may result in content reordering.

Now, we define content reordering as follows. For each document-summary pair, we first map each sentence in the summary to its corresponding sentence in the document by maximizing the ROUGE score (see Appendix A more details). If the relative orders of sentences in the summary are different from the relative orders of their mapped sentences in the original document, we count this as one content reordering. According to the statistics on the training split of our summarization dataset, contents of the original documents are reordered in their summaries for 40% of cases, approximately.

The above observations motivate us to propose sequence-to-sequence pre-training objectives that are capable of pre-training a seq2seq model serving the abstractive summarization task.

Sentence Reordering

In sentence reordering (SR), we first divide an unlabeled document into multiple sentences based on full stops. Let us change the notation of a document slightly in this paragraph. Let $X=(S_{1}||S_{2}||\dots||S_{m})$ denote a document, where $S_{i}$ is a sentence, $m$ is the number of sentences, and $||$ refers to sentence concatenation. The sentence index order in $X$ can be represented as $\mathcal{O}=(1,2,\dots,m)$ . We then shuffle the document by sentences. In other words, the items in the order $\mathcal{O}$ are rearranged and we obtain a shuffled order $\mathcal{O}_{S}=(a_{1},a_{2},\dots,a_{m})$ , where $1\leq a_{i}\leq m$ , $1\leq a_{j}\leq m$ , and $a_{i}\neq a_{j}$ for any $i,j\in[1,m]$ and $i\neq j$ . Concatenating sentences following $\mathcal{O}_{S}$ , we obtain a shuffled document $\hat{X}_{S}=(S_{a_{1}}||S_{a_{2}}||\dots||S_{a_{m}})$ . A seq2seq model takes as input the shuffled document $\hat{X}_{S}$ and is pre-trained to reinstate the original one $X$ , as demonstrated in Figure 1. The training objective is calculated as:

\vspace{-0mm}\mathcal{L}(\theta;\mathcal{X})=\sum_{X\in\mathcal{X}}\log P(X|\hat{X}_{S};\theta)

There are several reasons why we design this objective. First, a summary of a document usually consists of multiple sentences. We expect that the model is pre-trained to learn to generate long and coherent summaries (across sentences). The output of the objective (i.e., the original document) also contains multiple sentences. Second, as we discussed earlier, sentence reordering (or content reordering) is necessary for summarization. Third, abstractive summary requires reproducing factual details (e.g., named entities, figures) from the source document. We also expect the model to learn to copy tokens.

Note that document rotation¹¹1A document is randomly divided into two fragments $X=(F_{1}||F_{2})$ using full stops. The rotated document is $\hat{X}_{R}=(F_{2}||F_{1})$ . Document rotation recovers $X$ using $\hat{X}_{R}$ . is a special case of sentence reordering with a significant amount of partially ordered sentences, which we believe is a simpler objective. In this work, we only consider the general case of sentence reordering.

Next Sentence Generation

Next Sentence Generation (NSG) uses one span of text in a document to predict its next span of text, which leverages the natural order of text, as shown in Figure 1. Specifically, we split a document into two segments (i.e., $\hat{X}_{G_{1}}$ and $\hat{X}_{G_{2}}$ ). Note that each segment might contain multiple sentences. Intuitively, in a document, sentences are highly correlated with their preceding sentences due to the context dependent nature of documents or language. Our intention is to learn to generate multiple sentences and also learn to focus on input text, which fits the document summarization task, since either a document or its summary usually includes multiple sentences and they are closely related. The training objective is calculated as:

\vspace{-0mm}\mathcal{L}(\theta;\mathcal{X})=\sum_{X=(\hat{X}_{G_{1}}||\hat{X}_{G_{2}}),X\in\mathcal{X}}\log P(\hat{X}_{G_{2}}|\hat{X}_{G_{1}};\theta)

We do not make constraints that the split point must be the position right after a full-stop symbol, which ensures full sentences for each segment. Instead, the split point can be at any position within the document, which may lead to incomplete sentences in segments. We intend to force the model to understand input text without complete information. Similarly, as a common wisdom in abstractive summarization, documents, as input, are truncated to a fixed number of tokens, which may also contain incomplete sentences. This setting allows to reduce mismatches between the pre-training and fine-tuning input.

Masked Document Generation

The third objective is Masked Document Generation (MDG) that learns to reinstate a document with a masked span of tokens (see Figure 1). A document is denoted as $X=(x_{1},x_{2},\cdots,x_{|X|})$ . We randomly sample the length of the span $l$ from a discrete uniform distribution $\mathcal{U}(a,b)$ ( $a$ and $b$ are distribution parameters) and the span starting position $k$ from another discrete uniform distribution $\mathcal{U}(1,|X|-l+1)$ . Thus, $\mathcal{M}=(x_{k},x_{k+1},\cdots,x_{k+l-1})$ is the text span to be masked. Let $\hat{X}_{M}$ denote the document after the application of our masking strategy. The training objective is calculated as:

\vspace{-0mm}\mathcal{L}(\theta;\mathcal{X})=\sum_{X\in\mathcal{X}}\log P(X|\hat{X}_{M};\theta)

One straightforward masking strategy is to replace each token residing in $\mathcal{M}$ with a special [MASK] token. However, we refrain from doing so because of the following two reasons. Usually, [MASK] tokens will not appear in downstream tasks. Second, similar to SR, avoiding replacing every token with [MASK] also helps our model learn the ability of copying tokens from the input while preserving the ability of generating novel tokens. Thus, in the sub-sequence $\mathcal{M}$ , each token is processed with one of the three strategies: 1) replaced with the [MASK] token; 2) replaced with a random token; 3) remains unchanged. Inspired by BERT Devlin et al. (2019), for 80% of selected tokens, we follow strategy 1). In 10% of cases, we employ strategy 2) and we use strategy 3) for the remaining 10% of cases.

During pre-training, we consider two settings. Setting one: pre-training a model with one single objective, i.e., SR, NSG or MDG, resulting in three different pre-trained models. Setting two: employing all three objectives. For each training batch, we randomly choose one objective and each objective is used for $1/3$ of the training time, obtaining one model (i.e., ALL, see Section 5).

For better reference, we name our model as STEP (i.e., sequence-to-sequence pre-training) that can be used to denote a seq2seq model pre-trained using our proposed objective(s).

3.3 Fine-tuning

After a seq2seq model is pre-trained, we fine-tune the model on abstractive document summarization datasets. In other words, we continue to train the model on the document-summary pairs.

4 Experimental Setup

4.1 Datasets

CNNDM

The CNNDM dataset contains news articles and the associated highlights (i.e., summaries) collected from the CNN and Daily Mail Online websites²²2https://edition.cnn.com and https://dailymail.co.uk. Articles were collected starting in April 2007 for CNN and June 2010 for Daily Mail, both until the end of April 2015. The validation data is from March 2015, and the test data from April 2015 Hermann et al. (2015). Following previous work See et al. (2017); Liu and Lapata (2019), we use the non-anonymized version of CNNDM. Specifically, we preprocessed the dataset with the publicly available scripts³³3https://github.com/abisee/cnn-dailymail provided by See et al. (2017) and obtained 287,226 document-summary pairs for training, 13,368 for validation and 11,490 for test.

NYT

The NYT dataset Sandhaus (2008) is a collection of articles along with multi-sentence summaries written by library scientists. Following the preprocessing procedures described in Durrett et al. (2016); Liu and Lapata (2019), the test set is constructed by including all articles published on January 1, 2007 or later, which contains 9,076 articles. The remaining 100,834 articles are split into a training set of 96,834 examples and a validation set of 4,000 examples. Following Durrett et al. (2016), we also removed articles whose summaries contain less than 50 words from the test set, and the resulting test set contains 3,452 examples.

GIGA-CM

To pre-train our model with the objectives introduced in Section 3.2, following the procedures in Zhang et al. (2019b), we created the GIGA-CM dataset, which contains only unlabeled documents. The training set of GIGA-CM is composed of 6,521,658 documents sampled from the English Gigaword dataset⁴⁴4https://catalog.ldc.upenn.edu/LDC2012T21 and the training documents in CNNDM, resulting in 19GB text for pre-training. We used the 13,368 documents in the validation split of CNNDM as the validation set. Note that the Gigaword dataset overlaps with the NYT dataset and we therefore excluded the test set of NYT from the training set of GIGA-CM.

Dataset	Training	Validation	Test
CNNDM	287,226	13,368	11,490
NYT	96,834	4,000	9,076
GIGA-CM	6,521,658	13,368	-

Table 1: The number of document-summary pairs (for CNNDM and NYT) and unlabeled documents (for GIGA-CM).

Table 1 lists the number of document-summary pairs (for CNNDM and NYT) and unlabeled documents (for GIGA-CM). For CNNDM, NYT and GIGA-CM datasets, we segmented and tokenized documents and/or summaries (GIGA-CM only contains documents) using the Stanford CoreNLP toolkit Manning et al. (2014). We further applied the UTF8 based BPE Sennrich et al. ; Radford et al. (2019) to reduce the vocabulary size. As a common wisdom in abstractive summarization, documents and summaries in CNNDM and NYT are usually truncated to 512 and 256 tokens, respectively.

We leverage unlabeled documents differently for different pre-training objectives. We first split each document into 512-token pieces if it contains more than 512 tokens (pieces or documents with less than 512 tokens are removed). In SR and MDG, we use the piece after transformation to predict its original form. We set the minimum and maximum masked length $a=100$ and $b=256$ in MDG individually. In NSG, each piece is used to predict its next 256 tokens.

4.2 Implementation Details

As mentioned in Section 3, we adopt the seq2seq Transformer model Vaswani et al. (2017) as our backbone architecture. The purpose of releasing large pre-trained models is to reuse so that the community can avoid high computational costs. Hence, similar to previous work Liu and Lapata (2019), our encoder is initialized with a pre-trained model, i.e., $\text{RoBERTa}_{\text{LARGE}}$ model⁵⁵5We tried $\text{RoBERTa}_{\text{BASE}}$ and obtained inferior results. Liu et al. (2019), and therefore they share the same architecture. Specifically, the encoder is a 24-layer Transformer. Each layer has 16 attention heads and its hidden size and feed-forward filter size are 1,024 and 4,096, respectively. The decoder is shallower with 6 layers and is randomly initialized. The number of total trainable model parameters is 585M. The hidden size and number of attention head of the decoder are identical to those of the encoder, but the feed-forward filter size is 2,048. We use a smaller filter size in the decoder to reduce the computational and memory cost. The dropout rates of all layers in the encoder are set to 0.1 and all dropout rates in the decoder are set to 0.3. Our models are optimized using Adam Kingma and Ba (2015) with $\beta_{1}=0.9$ , $\beta_{2}=0.98$ . The other optimization hyper-parameters for pre-training and fine-tuning are different. In the pre-training stage, the encoder is initialized with a pre-trained model while the decoder is randomly initialized. Therefore, similar to Liu and Lapata (2019), we used two separate optimizers for the encoder and decoder. The peak learning rates of the encoder and decoder are set to $2e-5$ and $1e-4$ with 10,000 warmup steps, respectively. We also adopted the same learning rate schedule strategies as in Vaswani et al. (2017). We used smaller batch sizes for datasets with less examples (i.e., 1,024 for GIGA-CM, 256 for CNNDM and 128 for NYT) to ensure each epoch has sufficient number of model updates. We trained our models until their convergence of validation perplexities (around 30 epochs on GIGA-CM, 60 epochs on CNNDM and 40 epochs on NYT). One epoch on GIGA-CM takes around 24 hours with 8 Nvidia Tesla V100 GPUs. The time costs for different pre-training objectives are close.

We highlight the parameters used in the fine-tuning stage that are different from the pre-training stage. Others remain the same. The learning rates for both the encoder and decoder are set to 2e-5 with 4,000 warmup steps, since both the encoder and decoder are already pre-trained. We trained our models for 30 epochs on CNNDM and 50 epochs on NYT, respectively. We selected the best model with regard to ROUGE score on the validation set. During decoding, similar to Liu and Lapata (2019); Dong et al. (2019), we applied beam search with beam size of 5. We also conducted experiments on the validation set of CNNDM with different beam sizes (i.e., 1 to 10). According to ROUGE-L, beam=5 is indeed optimal. Detailed results with different beam sizes are included in the Appendix B. Following Paulus et al. (2018), we also blocked repeated trigrams during beam search and tuned the minimum summary length on the validation set in the range of $[30,80]$ . The search range of minimum summary length was empirically set according to the summaries of training split of CNNDM, where the average and medium minimum lengths are both around 55. We used step size of 5 to get quick feedback. Similar to the pre-training process, the datasets with less instances were fine-tuned with smaller batch sizes (i.e., 64 for NYT and 768 for CNNDM).

5 Results

5.1 Automatic Evaluation

We used ROUGE Lin (2004) to measure the quality of different summarization model outputs. We reported full-length F1 based ROUGE-1, ROUGE-2 and ROUGE-L scores on CNNDM, while we used the limited-length recall based ROUGE-1, ROUGE-2 and ROUGE-L on NYT, following Durrett et al. (2016). The ROUGE scores are computed using the ROUGE-1.5.5.pl script⁶⁶6https://github.com/bheinzerling/pyrouge.git.

Models in Comparison

Lead3 is a baseline which simply takes the first three sentences of a document as its summary. BERTExt Liu and Lapata (2019) is an extractive model fine-tuned on BERT Devlin et al. (2019) that outperforms other extractive systems. PTGen See et al. (2017), DRM Paulus et al. (2018), and DCA Celikyilmaz et al. (2018) are seq2seq learning based models extended with copy and coverage mechanism, reinforcement learning, as well as deep communicating agents individually. BottomUp Gehrmann et al. (2018) assisted summary generation with a word prediction model. BERTAbs Liu and Lapata (2019) and UniLM Dong et al. (2019) are both pre-training based models and are trained based on BERT Devlin et al. (2019). We also implemented four abstractive models as our baselines. Transformer-S2S is a 12-layer seq2seq Transformer with random initialization. When we replaced the encoder of Transformer-S2S with $\textsc{RoBERTa}_{\text{BASE}}$ or $\textsc{RoBERTa}_{\text{LARGE}}$ Liu et al. (2019), we obtain two baselines, $\textsc{RoBERTa}_{\text{BASE}}$ -S2S and RoBERTa-S2S, respectively. Following Liu et al. (2019), we further train the $\textsc{RoBERTa}_{\text{LARGE}}$ on the documents of training split of CNNDM for 60 epochs, same as the number of epochs for our models (indicated as “In-domain”). We replaced the encoder of Transformer-S2S with this further trained model, resulting in $\textsc{RoBERTa}_{\text{CONT}}$ -S2S.

Extractive
Model		R-1	R-2	R-L
Lead3		40.34	17.70	36.57
BERTExt Liu and Lapata (2019)		43.85	20.34	39.90
Abstractive
PTGen See et al. (2017)		39.53	17.28	36.38
DRM Paulus et al. (2018)		39.87	15.82	36.90
BottomUp Gehrmann et al. (2018)		41.22	18.68	38.34
DCA Celikyilmaz et al. (2018)		41.69	19.47	37.92
BERTAbs Liu and Lapata (2019)		42.13	19.60	39.18
UniLM Dong et al. (2019)		43.47	20.30	40.63
Transformer-S2S		40.43	17.66	37.44
$\textsc{RoBERTa}_{\text{BASE}}$ -S2S		42.30	19.29	39.54
RoBERTa-S2S		43.06	19.70	40.16
$\textsc{RoBERTa}_{\text{CONT}}$ -S2S		42.29	19.27	39.56
Ours
STEP (In-domain)	SR	43.77^∗	20.78^∗	40.92^∗
	NSG	43.48^∗	20.70^∗	40.72^∗
	MDG	43.72^∗	20.77^∗	40.88^∗
	ALL	43.75^∗	20.81^∗	40.99^∗
\hdashline STEP (GIGA-CM)	SR	44.03^∗	21.13^∗	41.20^∗
	NSG	44.03^∗	21.02^∗	41.18^∗
	MDG	44.07^∗	20.97^∗	41.22^∗
	ALL	44.06^∗	21.07^∗	41.24^∗

Table 2: Results on the test split of CNNDM using full-length F1 based ROUGE-1 (R-1), ROUGE-2 (R-2) and ROUGE-L (R-L).

\ast

indicates significant improvements (

p<0.05

measured with the ROUGE script) compared to models in the first two blocks.

Results on CNNDM

The results on the CNNDM are listed in Table 5.1. The first and second blocks show results of previous extractive and abstractive models, respectively. Results of ours are all listed in the third block. $\textsc{RoBERTa}_{\text{BASE}}$ -S2S outperforms Transformer-S2S by nearly 2 ROUGE points. RoBERTa-S2S further improves the performance. This shows the effectiveness of the pre-trained encoders.

Then, we study the effects of different pre-training objectives (see Section 3.2). We first pre-train a seq2seq Transformer model (the sizes of our model and RoBERTa-S2S are identical) on unlabeled documents of CNNDM training split to get quick feedback⁷⁷7One epoch takes 3 hours on CNNDM and 0.5 on NYT., denoted as “STEP (In-domain)”. From the top part of the third block in Table 5.1, we can see that Sentence Reordering (SR), Next Sentence Generation (NSG) and Masked Document Generation (MDG) can all improve $\textsc{RoBERTa}_{\text{BASE}}$ -S2S and RoBERTa-S2S significantly measured by the ROUGE script⁸⁸8According to the ROUGE script, $\pm 0.22$ ROUGE almost always means a significant difference with $p<0.05$ .. Interestingly, even though we merely use the in-domain training split (around 1GB), our method still significantly outperforms UniLM Dong et al. (2019) that is pre-trained on 16GB data. Compared to STEP (In-domain) (e.g., pre-training with SR) with $\textsc{RoBERTa}_{\text{CONT}}$ -S2S, although the encoders of such two models are pre-trained on the same corpus for the same epochs, our model achieves better performance. This shows that the performance gains mainly result from our proposed objectives for pre-training the decoder together with the encoder. Training RoBERTa longer may improve understanding tasks Liu et al. (2019), but no evidence shows longer training time for RoBERTa may improve generation performance.

When we pre-train the seq2seq model on even larger dataset (i.e., GIGA-CM in the size of 19GB), indicated as STEP (GIGA-CM), the results are further improved and our method outperforms all models under comparison, as listed in the bottom part of Table 5.1.

Extractive
Model		R-1	R-2	R-L
Lead3		39.58^∗	20.11^∗	35.78^∗
BERTExt Liu and Lapata (2019)		46.66^∗	26.35^∗	42.62^∗
Abstractive
PTGen See et al. (2017)		43.71^∗	26.40^∗	-
DRM Paulus et al. (2018)		42.94^∗	26.02^∗	-
BERTAbs Liu and Lapata (2019)		49.02^∗	31.02^∗	45.55^∗
Transformer-S2S		35.75^∗	17.23^∗	31.41^∗
RoBERTa-S2S		45.92 ^∗	29.48^∗	42.73^∗
Ours
STEP (In-domain)	SR	48.57^∗	30.81^∗	45.00^∗
	NSG	48.28^∗	30.33^∗	44.79^∗
	MDG	48.44^∗	30.74^∗	45.01^∗
	ALL	48.70^∗	30.93^∗	45.12^∗
\hdashline STEP (GIGA-CM)	SR	50.03^∗	32.12^∗	46.25^∗
	NSG	49.67^∗	31.82^∗	45.97^∗
	MDG	49.40^∗	31.45^∗	45.60^∗
	ALL	49.57^∗	31.81^∗	45.87^∗

Table 3: Results on the test set of NYT dataset using limited-length recall based ROUGE.

\ast

indicates significant improvements (

p<0.05

measured with the ROUGE script) to models in the first two blocks.

Results on NYT

Table 5.1 presents results on NYT dataset. Following the same evaluation protocol as Durrett et al. (2016), we adopted the limited-length recall based ROUGE, where we truncated the predicted summaries to the length of the gold ones. Again, the first and second blocks show results of previous extractive and abstractive models, respectively. Results of our models are listed in the third block. Similar to the trends in CNNDM, our method leads to significant performance gains (with $p<0.05$ ).

Comparisons among Objectives

Among all three pre-training objectives, SR works slightly better than the other two objectives (i.e., NSG and MDG). We also tried to randomly use all the three objectives during training with 1/3 probability each (indicated as ALL). Interestingly, we observed that, in general, ALL outperforms all three objectives when employing unlabeled documents of training splits of CNNDM or NYT, which might be due to limited number of unlabeled documents of the training splits. After adding more data (i.e., GIAG-CM) for pre-training, SR consistently achieves the highest ROUGE-2 on both CNNDM and NYT. We conclude that SR is the most effective pre-training objective for abstractive summarization since sentence reordering objective fits content reordering and it requires comprehensively understanding a document in a wide coverage, going beyond individual words and sentences, which is highly close to the essence of abstractive document summarization.

We put the performance of our models on the validation splits of CNNDM and NYT in the Appendix B.

Model	Corpus Size	R-1	R-2	R-L
T5	750GB	43.52^∗	21.55^∗	40.69^∗
PEGASUS (C4)	750GB	43.90^∗	21.20^∗	40.76^∗
PEGASUS (HugeNews)	3,800GB	44.17^∗	21.47^∗	41.11^∗
BART	160GB	44.16^∗	21.28^∗	40.90^∗
ProphetNet (160GB)	160GB	44.20^∗	21.17^∗	41.30^∗
\hdashlineProphetNet (16GB)	16GB	43.68^∗	20.64^∗	40.72^∗
UniLM	16GB	43.47^∗	20.30^∗	40.63^∗
STEP	19GB	44.03^∗	21.13^∗	41.20^∗

Table 4: Results on the CNNDM test split of models pre-trained on different corpora.

\ast

indicates significant differences from our model.

Comparison to Models Pre-trained with Large-scale Corpora

It is worth noting that several models have been released recently, which are pre-trained using various corpora much larger than ours, as listed in Table 4 (top part). T5 Raffel et al. (2019) introduced C4 (750GB) as its pre-training corpus. PEGASUS ${}_{\text{LARGE}}$ has two versions that are pre-trained on C4 and HugeNews (3,800GB), respectively. Both BART Lewis et al. (2019) and ProphetNet (160GB) Yan et al. (2020) are pre-trained on a 160GB corpus introduced by Liu et al. (2019). We compare our best preforming model STEP (i.e., pre-training on the GIGA-CM dataset using SR objective) with such models and focus on the performance on the CNNDM which is the well-known benchmark for abstractive summarization. We highlight the highest ROUGE scores in Table 4 using bold font and use the symbol $\ast$ to indicate the models that perform significantly different from STEP. Both T5 and PEGASUS (HugeNews) achieve significantly higher ROUGE-2 scores than our model. However, we obtain higher ROUGE-1 and ROUGE-L scores. On the other hand, we also consider models pre-trained on the relatively small-scale corpus. Following BERT Devlin et al. (2019), both ProphetNet (16GB) Yan et al. (2020) and UniLM Dong et al. (2019) use the same 16GB text for pre-training. As listed in Table 4 (bottom part), our model significantly outperforms such two models.

5.2 Human Evaluation

Since summaries generated by abstractive models may produce disfluent or ungrammatical outputs, we also evaluated abstractive systems by eliciting human judgements. We compared our best preforming model (i.e., pre-training on the GIGA-CM dataset using SR objective) with human references (denoted as Gold), as well as several strong baselines whose system outputs are available to us, including RoBERTa-S2S, and two pre-training based models, i.e., BERTAbs Liu and Lapata (2019) and UniLM Dong et al. (2019)⁹⁹9Outputs of BERTAbs and UniLM are publicly available at https://github.com/nlpyang/PreSumm and https://github.com/microsoft/unilm. 50 documents are randomly sampled from the test split of CNNDM. 10 participants are presented with a document and a list of outputs generated by different abstractive summarization systems. Then they are asked to rank the outputs of these systems from best to worst according to informativeness (does the summary capture the informative part of the document?), fluency (is the summary grammatical?), and succinctness (does the summary express the document clearly in a few words?) We report the proportions of system rankings and mean rank (lower is better) in Table 5. The output of STEP is selected as the best for the 23% of cases and we obtained lower mean rank than all systems except for Gold, which shows the participants’ preference for our model. We further converted ranking numbers into ratings (i.e., rank $i$ is converted into $6-i$ ) and applied the student $t$ -test on the ratings. Ours is significantly better than all other systems (except for Gold) in comparison with $p<0.05$ . However, it still lags behind human. One possible reason is that our system (as well as other systems) only takes the first 512 tokens of a long document as input and thus may lose information residing in the following tokens.

Systems	1st	2nd	3rd	4th	5th	MR
BERTAbs	0.11	0.15	0.17	0.26	0.31	3.50
UniLM	0.12	0.16	0.20	0.24	0.29	3.43
RoBERTa-S2S	0.17	0.21	0.20	0.20	0.21	3.07
STEP	0.23	0.23	0.23	0.18	0.14	2.77
\hdashlineGold	0.37	0.25	0.20	0.12	0.05	2.12

Table 5: Human evaluation results: proportions of system rankings. MR: mean rank (the lower the better).

Qualitative analysis with generated examples are illustrated in the Appendix C.

6 Conclusion

We proposed three sequence-to-sequence pre-training objectives, including sentence reordering, next sentence generation, and masked document generation. All those objectives have relations with abstractive summarization task and are designed based on reinstating the source text. A seq2seq model for abstractive document summarization can be pre-trained using such objectives and then fine-tuned on the summarization dataset. Compared to models pre-training on the even larger corpora ( $\geq$ 160GB), our method, with only 19GB for pre-training, can still achieve comparable and even better performance. In the future, we would like to investigate other objectives to pre-train seq2seq models for abstractive summarization.

Acknowledgments

We would like to thank the anonymous reviewers for their thoughtful and constructive comments. Yanyan Zou and Wei Lu were supported by Singapore Ministry of Education Academic Research Fund (AcRF) Tier 2 Project MOE2017-T2-1-156.

References

Bahdanau et al. (2015) Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proc. of ICLR.
Celikyilmaz et al. (2018) Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Yejin Choi. 2018. Deep communicating agents for abstractive summarization. In Proc. of NAACL.
Chen and Bansal (2018) Yen-Chun Chen and Mohit Bansal. 2018. Fast abstractive summarization with reinforce-selected sentence rewriting. In Proc. of ACL.
Cheng and Lapata (2016) Jianpeng Cheng and Mirella Lapata. 2016. Neural summarization by extracting sentences and words. In Proc. of ACL.
Conroy and O’leary (2001) John M Conroy and Dianne P O’leary. 2001. Text summarization via hidden markov models. In Proc. of SIGIR.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proc. of ACL.
Dong et al. (2019) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Proc. of NIPS.
Durrett et al. (2016) Greg Durrett, Taylor Berg-Kirkpatrick, and Dan Klein. 2016. Learning-based single-document summarization with compression and anaphoricity constraints. In Proc. of ACL.
Filatova and Hatzivassiloglou (2004) Elena Filatova and Vasileios Hatzivassiloglou. 2004. Event-based extractive summarization. In Text Summarization Branches Out.
Gehrmann et al. (2018) Sebastian Gehrmann, Yuntian Deng, and Alexander Rush. 2018. Bottom-up abstractive summarization. In Proc. of EMNLP.
Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proc. of ACL.
Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Proc. of NIPS.
Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
Hsu et al. (2018) Wan-Ting Hsu, Chieh-Kai Lin, Ming-Ying Lee, Kerui Min, Jing Tang, and Min Sun. 2018. A unified model for extractive and abstractive summarization using inconsistency loss. In Proc. of ACL.
Joshi et al. (2020) Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. 2020. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77.
Kingma and Ba (2015) Diederik P Kingma and Jimmy Lei Ba. 2015. Adam: A method for stochastic optimization. In Proc. of ICLR.
Kupiec et al. (1995) Julian Kupiec, Jan Pedersen, and Francine Chen. 1995. A trainable document summarizer. In Proc. of SIGIR.
Lewis et al. (2019) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proc. of ACL-Workshop.
Liu and Lapata (2019) Yang Liu and Mirella Lapata. 2019. Text summarization with pretrained encoders. In Proc. of EMNLP.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. In Proc. of ACL.
Manning et al. (2014) Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The stanford corenlp natural language processing toolkit. In Proc. of ACL: System Demonstrations.
Nallapati et al. (2017) Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. In Proc. of AAAI.
Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proc. of SIGNLL.
Narayan et al. (2018a) Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018a. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proc. of EMNLP.
Narayan et al. (2018b) Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018b. Ranking sentences for extractive summarization with reinforcement learning. In Proc. of NAACL.
Nenkova et al. (2006) Ani Nenkova, Lucy Vanderwende, and Kathleen McKeown. 2006. A compositional context sensitive multi-document summarizer: exploring the factors that influence summarization. In Proc. of SIGIR.
Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proc. of NAACL: Demonstrations).
Paulus et al. (2018) Romain Paulus, Caiming Xiong, and Richard Socher. 2018. A deep reinforced model for abstractive summarization. In Proc. of ICLR.
Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.
Radev et al. (2004) Dragomir Radev, Timothy Allison, Sasha Blair-Goldensohn, John Blitzer, Arda Çelebi, Stanko Dimitrov, Elliott Drabek, Ali Hakim, Wai Lam, Danyu Liu, Jahna Otterbacher, Hong Qi, Horacio Saggion, Simone Teufel, Michael Topper, Adam Winkel, and Zhu Zhang. 2004. MEAD - a platform for multidocument multilingual text summarization. In Proc. of LREC.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
Sandhaus (2008) Evan Sandhaus. 2008. The new york times annotated corpus. Linguistic Data Consortium, Philadelphia, 6(12):e26752.
See et al. (2017) Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proc. of ACL.
(36) Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proc. of ACL.
Song et al. (2019) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. Mass: Masked sequence to sequence pre-training for language generation. In Proc. of ICML.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. of NIPS.
Yan et al. (2020) Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, and Ming Zhou. 2020. Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training. arXiv preprint arXiv:2001.04063.
Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
Zhang et al. (2019a) Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J Liu. 2019a. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. arXiv preprint arXiv:1912.08777.
Zhang et al. (2018) Xingxing Zhang, Mirella Lapata, Furu Wei, and Ming Zhou. 2018. Neural latent extractive document summarization. In Proc. of ACL.
Zhang et al. (2019b) Xingxing Zhang, Furu Wei, and Ming Zhou. 2019b. HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization. In Proc. of ACL.
Zhou et al. (2018) Qingyu Zhou, Nan Yang, Furu Wei, Shaohan Huang, Ming Zhou, and Tiejun Zhao. 2018. Neural document summarization by jointly learning to score and select sentences. In Proc. of ACL.

Appendix A Additional Setup Details

Statistics for Content Reordering

Recall that it is not an unusual case that a human rewrites a document to summarize its most important information yet does not track the ordering in which how such information is described in the document. This phenomena is defined as content reordering as follows. For each document-summary pair, we first map each sentence in the summary to one sentence in the document by maximizing the ROUGE-2 score. If the relative orders of sentences in the summary are different from the relative orders of their mapped sentences in the original document, we count this as one content reordering.

We did statistics of such cases over the training and validation splits of CNNDM dataset. To be specific, we borrow the sentence annotations from extractive summarization Zhang et al. (2018); Zhou et al. (2018); Zhang et al. (2019b) that consider extractive summarization as a sentence classification task. The sentences in a document that maximize ROUGE-2 score Lin (2004) against the human references are labeled as True while other sentences are assigned False. Like previous extractive summarization systems Zhang et al. (2018, 2019b), we also concatenate sentences with label True in a document as its associated summary. For each sentence in a summary, we search for its string closet sentence in its associated document according to the count of overlapped bigrams. We found that, for some instances, the relative orders of sentences in the summary is not consistent with the relative orders of their closet sentences appearing in the document. In practice, we found that 38.1% instances in the training split and 40.5% instances in the validation set have this phenomenon.

Beam Size	1	2	3	4	5	6	7	8	9	10
ROUGE-L	41.16	41.78	41.85	41.87	41.88	41.83	41.84	41.83	41.82	41.83

Table 6: ROUGE-L results on the validation set of CNNDM with different beam sizes.

Model	MNLI	SST	QQP	QNLI	STS	RTE	MRPC	CoLA
RoBERTa	90.2	96.4	92.2	94.7	92.4	86.6	90.9	68.0
STEP Encoder	89.6	95.0	86.8	91.6	91.5	81.9	92.4	65.5

Table 7: Results on GLUE

Model		R-1	R-2	R-L
STEP (In-domain)	SR	44.37	21.30	41.60
	NSG	44.12	21.29	41.39
	MDG	44.36	21.37	41.59
	ALL	44.28	21.33	41.54
\hdashline STEP (GIGA-CM)	SR	44.63	21.59	41.88
	NSG	44.61	21.58	41.89
	MDG	44.59	21.48	41.81
	ALL	44.46	21.47	41.77

Table 8: Results on the validation split of CNNDM using full-length F1 based ROUGE-1 (R-1), ROUGE-2 (R-2) and ROUGE-L (R-L).

Model		R-1	R-2	R-L
STEP (In-domain)	SR	46.58	28.12	42.62
	NSG	46.61	27.95	42.71
	MDG	46.64	28.19	42.78
	ALL	47.04	28.39	43.11
\hdashline STEP (GIGA-CM)	SR	47.81	29.12	43.71
	NSG	47.60	29.02	43.51
	MDG	47.61	29.08	43.56
	ALL	47.68	29.13	43.50

Table 9: Results on the validation set of NYT dataset using limited-length recall based ROUGE.

Appendix B Additional Results

Results on Validation Set

The performance of our proposed models on the validation splits of CNNDM and NYT are listed in Table A and A, respectively.

Results on Validation Set of CNNDM with different beam sizes

Table 6 lists the ROUGE-L results on the validation set of CNNDM with different beam sizes for the beam search during decoding. Beam size of 5 gives the highest ROUGE-L score. Thus, we use beam $=5$ in this work.

Results on XSum

Different from CNNDM and NYT, XSum consists of 226,711 online news articles extracted from British Broadcasting Corporation (BBC), each annotated with a short, one-sentence news summary, answering the question “What is the article about?”. The same split (204,045/11,332/11,334 for training/validation/testing) and preprocessing procedures described in the work of Narayan et al. (2018a) are adopted to make direct comparisons. Table 10 lists results on the XSum. Here, we report our model pre-trained using SR objective on the in-domain pre-training corpus, indicated as “STEP”. As we can see that, after pre-training, STEP does not give performance gain. One possible reason is that the summary of XSum contains only one sentence. The SR objective might not be helpful for this dataset.

Extractive
Model	R-1	R-2	R-L
Lead3	16.30^∗	1.60^∗	11.95^∗
Abstractive
PTGen See et al. (2017)	28.10^∗	8.02^∗	21.72^∗
TConvS2S Narayan et al. (2018a)	31.89^∗	11.54^∗	25.75^∗
BERTAbs Liu and Lapata (2019)	38.81^∗	16.50^∗	31.27^∗
Transformer-S2S	29.41^∗	9.77^∗	23.01^∗
RoBERTa-S2S	43.54^∗	20.49^∗	35.75^∗
Ours
STEP	43.02^∗	20.11^∗	35.34^∗

Table 10: Results on the test split of XSum using full-length F1 based ROUGE-1 (R-1), ROUGE-2 (R-2) and ROUGE-L (R-L).

Results on GLUE

We also apply the encoder of our best performing model to the GLUE tasks wang2018glue, as listed in Table 7. Compared to RoBERTa Liu et al. (2019), the encoder of our best performing model does not consistently achieve higher results, which demonstrates that the improvements of our models on the abstractive summarization task do not come from a better encoder.

Appendix C Examples of System Outputs

Table 11 and 12 demonstrate three output examples of various systems, including BERTAbs Liu and Lapata (2019), UniLM Dong et al. (2019), gold standard summaries (human references, denoted as Gold), the RoBERTa-S2S baseline, and our best performing model. Table 11 shows an example that the outputs of systems BERTAbs and UniLM copied a sentence from the input article, while our model generates summaries by rewriting sentences. Table 12 lists an instance, where the summary generated by the system UniLM contains an incomplete sentence.

Article	(CNN) In response to reports of big banks threatening to withhold campaign funds from Senate Democrats, Sen. Elizabeth Warren last week offered a defiant response: ”Bring it on.” Warren said she isn’t going to slack off on her calls for breaking up banks and other measures to rein in Wall Street. As Hillary Clinton prepares to officially launch her presidential campaign this month, she will need to make a choice about how much to highlight issues relating to economic inequality. Former Maryland Gov. Martin O’Malley, who is also running for the Democratic nomination, is trying to steal Clinton’s thunder by talking about the problems of disproportionate wealth. In other words, there are many signs that Democrats are planning to take on the big issue of economic inequality. But in other recent news , the likelihood that New York’s Chuck Schumer will replace Harry Reid as leader of the Senate Democrats means the dreams of a more economically leftward party are crashing into political reality. While Schumer has been a very effective Democrat and skilled legislative leader, he is also a Wall Street Democrat who has spent much of his time courting and protecting powerful financial interests who run one of the dominant industries in his state. He is not alone. Even at his most progressive moments, President Barack Obama relied on Wall Street donations for both of his campaigns. Despite all the talk from conservatives about left-wing ”socialism” in the White House, the financial community has been willing to open its coffers to Democrats without much concern, even in the 2012 election. Democratic populism can’t really work within the current campaign finance system. The enormous pressures for parties to raise funds in campaigns has for many decades created pressure on Democrats, despite their political base, to court big donors. During the 1980s, California Democrat Tony Coelho, serving as the chairman of the Democratic Congressional Campaign Committee and then as majority whip, made a strong appeal to savings and loans executives before the crash of the industry to catch up to Republicans who had been outflanking them in raising money. The Democrats were, and have continued to, losing their traditional base of campaign support – organized labor – which had been a central source of campaign muscle since the 1930s, providing money and campaign assistance during campaigns. Without organized labor to serve as their foundation and with the pressure for raising private funds increasing, many Democrats concluded they needed business by their side. Democrats running for president have made the same kind of choices. In 2008, Obama disappointed many supporters upon becoming the first president to abandon the post-Watergate public finance system for campaigns altogether, preferring to raise money himself for the general campaign. While small donors were enormously important to his victories, so too were business and Wall Street executives. At the height of the financial crash, when public sentiment had clearly turned against Wall Street, the administration agreed to a financial regulation bill (Dodd-Frank) that was structured in such a way as to give powerful interests more than enough opportunity to limit the bite over the coming years. Wall Street, with an army of counsel, succeeded in eroding the impact of the legislation. Not only does the acceptance of our campaign finance system limit the policy choices Democrats can make, but it also greatly damages the party’s brand name. As The Washington Post reported, the scandal that might bring down New Jersey Democratic Sen. Robert Menendez is the first involving large scale super PAC donations. At the heart of the story is almost $600,000 that physician Salomon Melgen gave to Senate Majority PAC, possibly in exchange for favors. This is not simply some sort of accommodation of Democrats to the corporate system. They don’t have much of a choice. Without these funds, they won’t be able to compete. In this election cycle, independent campaign donors are causing a huge stir. In conservative circles, the Koch brothers and their allies are throwing around enormous amounts of money to candidates who will support their deregulatory agenda. Individual donors such as Las Vegas gambling magnate Sheldon Adelson are causing ripples every time candidates speak, pressuring them to adjust their agenda. Democrats have found their own magnates for political support, such as Tom Steyer and George Soros. This is why campaign finance reform is so important, Without Congress changing the fundamental dynamics, there won’t be much room for populism to thrive. Even if Democrats select someone like a Elizabeth Warren as their candidate or Hillary Clinton decides to move sharply to the left on economic policy, there won’t be much room for reform when the time of governance actually starts. The Democratic Party needs Wall Street more than it needs to take a stand against Wall Street. Those are the facts on the ground. If Democrats really want to take on Wall Street and tackle economic inequality, they first have to bring about reform of the campaign finance system. If campaigns were publicly funded or there were more stringent limits on independent expenditures, Wall Street would have much more trouble achieving disproportionate influence. Reform could level the playing field. More often than not, campaign finance reform is an issue that gets sidetracked with little more than some pro forma words of support. A more populist economic agenda that revolved around progressive taxation and substantial public assistance to strengthen the middle class can only work in a different kind of political system. If things stay the same, Democrats can only continue to win elections by turning to their corporate and financial base of support.
BERTAbs	julian zelizer: sen.elizabeth warren said she isn’t going to slack off on her call for breaking up banks. he says the likelihood that new york’s chuck schumer will replace harry reid as leader of the senate democrats. zelizer: democratic populism can’t really work within the current campaign finance system
UniLM	Julian Zelizer: Democratic populism can’t really work within the current campaign finance system. He says the pressure for parties to raise funds in campaigns has created pressure on Democrats to court big donors. He says even at his most progressive moments , President Barack Obama relied on Wall Street donations for both campaigns. He says Obama ’s decision to abandon the post-Watergate public finance system
RoBERTa-S2S	Julian Zelizer: Sen. Elizabeth Warren isn’t going to slack off on Wall Street. Zelizer : Democrats are planning to take on the big issue of economic inequality. He says Democrats have lost their traditional base of campaign support – organized labor – and money – in their campaigns.
STEP	Julian Zelizer: Democrats are planning to take on the big issue of economic inequality. Zelizer: Democratic populism can’t work within the current campaign finance system. He says Democrats have lost traditional base of campaign support.
Gold	Julian Zelizer: Elizabeth Warren was defiant about Wall Street, but Hillary Clinton likely won’t be. Zelizer: The Democrats need Wall Street’s campaign donations to be competitive in 2016.

Table 11: An example article sampled from the test splitting of CNNDM, paired with a list of summaries generated by different systems. We highlight (with bold) the sentences in the summaries that are copied from the article.

Article	(CNN) Hillary Clinton is finally announcing her candidacy for the 2016 presidential election. Although she has watched her standing in the polls sag in recent months, there is likely to be a boost in the days that follow the announcement. For Democrats, there is ample reason to be excited about Clinton’s run for the presidency. She is certainly one of the strongest candidates in many decades. She brings to the table extensive political and policy experience, a combination of skills that is often lacking. She has been through some of the roughest partisan wars and emerged stronger than ever before. She has a keen sense about the nature of the modern news media, how to use it to her advantage and how to survive scandal frenzies. She is a hardened, tough partisan who will not shy away from Republican attack. Americans have many positive memories of Clinton name, given the booming economy of the late 1990s during Bill Clinton’s presidency. If Hillary Clinton puts together an effective campaign, she could be unbeatable in the Democratic primaries as well as in the general election. However, during the buildup to her final decision, some of her weaknesses have also been exposed. Clinton doesn’t want to end up like Vice President Al Gore in 2000. Although he did relatively well in the final election (with many Americans believing that he did actually defeat George W. Bush) he didn’t generate much energy once the campaign started. Although he too was touted as a ”perfect” candidate who was the ideal person for the job, something seemed stiff and inauthentic when he actually hit the trail. He seemed to freeze when the television cameras were rolling. Gore had trouble connecting with voters, and he seemed to remake his image constantly. His biggest asset ended up being that he was viewed as the inevitable nominee, rather than what he actually stood for. Clinton must avoid following Gore’s path. She suffered this fate in the 2008 primaries and can’t afford to do so again. She needs to do more than rest on the perception that her candidacy is inevitable and on her record of experience. That is not enough. More important is for her to put forth an exciting vision about what she would stand for in the White House. Voters thirst for signs of greatness when they pick their presidents, even if they are savvy enough to understand that the reality of a polarized Washington will probably limit her ability to achieve bold change. A recent story in The Washington Post suggests that her advisers are aware of this potential liability. After the announcement, they are going to avoid big rallies and events and instead concentrate on smaller events where she will meet with voters directly in states such as Iowa and New Hampshire. Clinton also will have to contend with doubts about her authenticity. In his first day on the campaign trail, Sen. Rand Paul immediately tapped into these concerns by raising questions about whether she could be trusted. That question has dogged the Clintons ever since they came onto the national political scene in the late 1980s. Their greatest virtue, their immense skills as politicians , has often come back to haunt them. Bill Clinton was attacked as ”slick Willie” by members of both parties for the perception that he would say anything to win and Hillary Clinton has faced similar criticism. When she tried to distance herself from her vote for the use of force in Iraq , many Democrats didn’t buy her critique of President George W. Bush’s foreign policies and went for Barack Obama instead. When she conducted her ”listening tour” of New York before running for the Senate, many voters saw it as a manufactured effort to hide the fact she was running for office as an outsider. When she explained that there was nothing to the recent stories about her use of a private email server rather than her State Department email, some felt that even if the story was relatively minor it indicated that she wasn’t always telling us what she was really about. Even if she isn’t hiding anything, she often gives that appearance. During the next few months, Clinton will also have to connect with her party’s base. The ongoing speculation about Sen. Elizabeth Warren of Massachusetts has suggested that the most active part of the Democratic Party is not that enthused with Clinton’s candidacy. While they will probably vote for her, they are not very motivated and don’t trust that she will stand for Democratic values. She will need to address these concerns, not through her style but through her agenda. Voters will want to hear her talking about issues such as tougher financial regulation and policies to diminish economic inequality as well as her positions on race and policing. She will also need to make clear that she has heard voters on being too hawkish about going to war and give clear indications about how she would handle a nuclear agreement with Iran. Clinton will also have to contend with the gender bias that still exists in the electorate at large. Without any doubt she will be subject to questions and comments – about her appearance, for instance – that won’t be aimed at male candidates. Part of her candidacy is itself an effort to break down these remaining vestiges of political sexism. But the struggle will be tough. Finally, and this relates to the last challenge, Clinton will have to contend with her husband. To be sure he can be an immense force on the campaign trail, one of the most compelling Democrats of our generation. But he can also be liability. As she learned in 2008 , Bill Clinton is not always easy to control. When he speaks his mind, as he did in dismissive comments about Obama’s candidacy, it can often work against her. The fund-raising records of the Clinton Foundation will also raise questions about conflict of interest, and ongoing stories about his personal life, as was the case when Monica Lewinsky returned to the media a few months ago, could re-emerge on the campaign trail. Whether that is fair or not is beside the point: Everything is fair game on the modern campaign trail. Hillary Clinton has the potential to be a hugely successful presidential candidate. But she and her campaign team will need to address the multiple questions and weaknesses that have become clear in recent months.
BERTAbs	julian zelizer: hillary clinton is finally announcing her candidacy for the 2016 presidential election. zelizer: she has been through some of the roughest partisan wars and emerged stronger than ever before. he says she is a hardened, tough partisan who will not shy away from republican attack
UniLM	Julian Zelizer: Hillary Clinton is finally announcing her candidacy for the 2016 presidential election. He says she has extensive political and policy experience, a combination of skills often lacking. He says Clinton doesn’t want to end up like Vice President Al Gore in 2000; he didn’t generate much energy once the campaign started. Clinton must avoid following Gore’s path, he
RoBERTa-S2S	Julian Zelizer: For Democrats , there is plenty of reason to be excited about Hillary Clinton’s run. Zelizer: If Clinton puts together an effective campaign, she could easily win the general election. He says Clinton needs to put forth on what she would stand for in the White House.
STEP	Julian Zelizer: For Democrats, there is ample reason to be excited about Hillary Clinton’s run for president. Zelizer: Clinton needs to put forth an exciting vision about what she would stand for.
Gold	Julian Zelizer: Hillary Clinton has immense political and governmental experience. He says she needs to make stronger connection to her party ’s base. Clinton also needs to convince voters of her authenticity, Zelizer says.

Table 12: An example article sampled from the test splitting of CNNDM, paired with a list of summaries generated by different systems. The incomplete sentence is highlighted with bold.

Pre-training for Abstractive Document Summarization by Reinstating Source Text