This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

CoLo: A Contrastive Learning based Re-ranking Framework for One-Stage Summarization

Chenxin An1, Ming Zhong2, Zhiyong Wu3 ,Qin Zhu1, Xuanjing Huang1, Xipeng Qiu1
1School of Computer Science, Fudan University
2 University of Illinois at Urbana-Champaign
3 Shanghai AI Lab
{cxan20, qzhu18, xjhuang, xpqiu}@fudan.edu.cn
[email protected], [email protected]
   Corresponding author.
Abstract

Traditional training paradigms for extractive and abstractive summarization systems always only use token-level or sentence-level training objectives. However, the output summary is always evaluated from summary-level which leads to the inconsistency in training and evaluation. In this paper, we propose a Contrastive Learning based re-ranking framework for one-stage summarization called CoLo. By modeling a contrastive objective, we show that the summarization model is able to directly generate summaries according to the summary-level score without additional modules and parameters. Extensive experiments demonstrate that CoLo boosts the extractive and abstractive results of one-stage systems on CNN/DailyMail benchmark to 44.58 and 46.33 ROUGE-1 score while preserving the parameter efficiency and inference efficiency. Compared with state-of-the-art multi-stage systems, we save more than 100 GPU training hours and obtaining 3×8×3\times\sim 8\times speed-up ratio during inference while maintaining comparable results111https://github.com/ChenxinAn-fdu/CoLo.

1 Introduction

In general, there are two main paradigms to do text summarization: abstractive Rush et al. (2015); Nallapati et al. (2016); Gehrmann et al. (2018) and extractive Cheng and Lapata (2016); Narayan et al. (2018b); Zhong et al. (2019, 2022) methods.

For extractive summarization, previous studies Nallapati et al. (2017); Liu and Lapata (2019) formulate it as a sentence-level sequence labeling task. However, there is an inherent gap between the sentence-level scoring and the summary-level evaluation Zhong et al. (2020).This means that some high-scoring sentences may share the same meaning, making them not a qualified summary when combined. Similarly, the previous training paradigm for abstractive summarization models can be viewed as a token-level scoring process upon the decoder of sequence-to-sequence model. There also exists the issue of exposure bias Bengio et al. (2015); Paulus et al. (2017) in the teacher-forcing framework leading to the error accumulation during auto-regressive decoding. Therefore, previous frameworks for both extractive and abstractive methods did not perform summary-level optimization.

To tackle this problem, state-of-the-art summarization systems Zhong et al. (2020); Liu and Liu (2021) are enhanced with an additional module (called re-ranker) and follow a two-stage paradigm. They first train a summarizer to model the conditional distribution p(Y|X)p(Y|X) where XX is the document and YY is the output summary. Then the re-ranker is trained to re-score candidates sampled from the pre-trained summarizer in the second stage. However, this paradigm trades efficiency for accuracy, the auxiliary re-ranking greatly harms the inference efficiency especially for the highly efficient extractive systems. Experimentally, the decoding speed of two-stage re-ranking models is only ~7.0 samples/s while removing the re-ranker module will greatly boost the decoding speed to ~42.0 samples/s222We run these two models on the test set of CNN/DailyMail using single GeForce GTX TITAN XP GPU for 3 times and report the average speed.. This makes two-stage summarization systems may be unacceptable in real-world scenarios that require timely feedback.

The limitations of the existing work motivate us to build a one-stage summarization system that can 1) replace previous naive sentence/token-level score with a summary-level score and 2) do not sacrifice the parameter and inference efficiency. In this paper, we propose a Contrastive Learning based re-ranking framework for one-stage summarization called CoLo for both extractive and abstractive approach. Contrastive learning has been explored in summarization Sun and Li (2021); An et al. (2021b) and generation Lee et al. (2020); An et al. . CoLo uses a contrastive re-ranking training objective.

We first present a novel sampling method that can be equipped to any one-stage summarization systems so that it can re-score candidates without the second stage.

The existing two-stage models use offline sampling to preprocess samples for training of re-ranker where candidate samples are drawn from a fixed model distribution.

This is a huge obstacle to turning summarize-then-rerank two-stage framework into an efficient end-to-end model. To solve this issue, we propose an online sampling approach. Concretely, instead of sampling from a fixed distribution, we draw positive and negative samples from a dynamic distribution of model outputs during training, which ultimately eliminates the requirement for additional modules in the overall framework.

We then introduce a summary-level optimization strategy in addition to the traditional sentence-level (for extractive systems) or token-level loss (for abstractive systems). As a result, as a one-stage model, CoLo achieves comparable performance to two-stage systems, and greatly improves decoding speed to meet the needs of real-world applications.

We summarize our contributions as follows:

  • We are the first to propose a one-stage re-ranking framework CoLo for both extractive and abstractive summarization systems.

  • Results on the popular CNN/DailyMail benchmark show that both the extractive and abstractive versions of Colo outperform previous state-of-the-art one-stage systems by a large margin. Compared to the two-stage systems, CoLo achieves comparable performance without additional pre-trained model. More importantly, Colo do not sacrifice inference speed and thus can be more widely used in real-world scenarios.

2 Background

2.1 Preliminary about Two-Stage Systems

Two-stage paradigms Zhong et al. (2020); Liu and Liu (2021) improve summarization quality by re-ranking and selecting a candidate from a given set of candidates. MatchSum Zhong et al. (2020) forms a contrastive learning based re-ranking framework where they first generate a set of candidates summaries by a extractive summarization model and then feed them to a re-ranker. The re-ranker is trained to optimize a summary-level score and it can evaluate the candidate summaries holistically. SimCLS Liu and Liu (2021) is the abstractive version which replaces the extractive summarizer in Zhong et al. (2020) with a abstractive summarizer.

The training objective for summarization models is to estimate a conditional probability distribution p(Y|X)p(Y|X), where XX is the document and YY is the output summary. Given a summarization model \mathcal{M} that has already tuned under the conventional framework with loss function sum\mathcal{L}_{sum} where sum\mathcal{L}_{sum} could be binary cross entropy loss (BCELoss) or negative log likelihood loss (NLLLoss). The two-stage systems should first use a sampling algorithm e.g. beam search to sample a candidate set 𝒞={C1,C2,,Cm}\mathcal{C}=\{C_{1},C_{2},\ldots,C_{m}\} of size mm from the fixed model distribution Cip(Y|X)C_{i}\sim p_{\mathcal{M}}(Y|X). Candidates in 𝒞\mathcal{C} are sort by their ROUGE score in descending order. Then the they further train a separate re-ranker,e.g., BERT , with a contrastive-style ranking loss rank\mathcal{L}_{rank} to select the the best candidate from 𝒞\mathcal{C} as the final output. The ranking loss used in the best re-ranking system for summarization is the triplet margin loss Kingma and Ba (2014). For a candidate pair (Ci,Cj)(C_{i},C_{j}) where i<ji<j, if CiC_{i} has higher ROUGE score and it will be treated as the positive sample:

i,j=max{0,cos(𝐳X,𝐳Cj)cos(𝐳X,𝐳Ci)+ρ},\mathcal{L}_{i,j}=max\{0,\text{cos}(\mathbf{z}_{X},\mathbf{z}_{C_{j}})-\text{cos}(\mathbf{z}_{X},\mathbf{z}_{C_{i}})+\rho\}, (1)

where 𝐳X,𝐳Ci,𝐳Cj\mathbf{z}_{X},\,\mathbf{z}_{C_{i}},\,\mathbf{z}_{C_{j}} are the vector feature representation of X,Ci,CjX,C_{i},C_{j} output by the re-ranker, and ρ\rho is the margin value. The final ranking loss is obtained by summing up all pairs: rank=ji<ji,j\mathcal{L}_{rank}=\sum_{j}\sum_{i<j}\mathcal{L}_{i,j}. The ranking loss ensures that candidates with higher ROUGE score is closer to the document in the embedding space.

2.2 A Comparison between Two-Stage Systems and CoLo

Figure 1(a) illustrates the difference between the architecture of two-stage systems and CoLo. Although MatchSum and SimCLS significantly outperform all one-stage models, they mainly suffer from three drawbacks which strongly emphasize the necessity of designing an one-stage model:

(1) Training/inference inefficiency. Building the training set of the re-ranker and the second training stage consumes large amounts of GPU and CPU time (see details in Section 5.3). Moreover, the need of re-feeding generation results to another module also requires unaffordable computational resources.

(2) Coupling between the summarizer and re-ranker. Each improvement to one of these modules requires simultaneous updating or retraining of another module, which limits the use of such systems in the real world. For example, to try a larger candidate set or a different decoding method, we have to prepare the training set again for the second stage. In addition, how to tune the hyperparameters to be optimal in both modules at the same time is another tricky issue. Compared with two-stage systems, our one-stage system has a simple and clean implementation.

(3) Two-stage systems also face difficulties in long document summarization, because the input length of the re-ranker will drastically increase as the length of candidates increasing (see detailed analysis in Appendix A). Correspondingly, CoLo is not easily affected by length variance.

Refer to caption
(a) Two-stage models: MatchSum and SimCLS
Figure 1: A comparison between two-stage models and CoLo. The two-stage models including two training stages and a time-consuming preprocess while CoLo is trained in an end-to-end fashion. (GPU and CPU hours cost in each stage are shown in Table 6). Two-stage models use offline sampling to build positive-negative pairs while CoLo builds positive-negative pairs with online sampling where we directly get theses pairs from a changing model distribution.

 

Refer to caption
(b) CoLo (this work)

3 Method

3.1 A Naive One-Stage Re-ranking Model

The goal of one-stage re-ranking systems is to enable both training and inference to score candidate summaries holistically without requiring a second stage of computation by a separate model. Ideally, an one-stage summarization model should both function as a summarizer and a re-ranker. A straightforward solution is multi-task learning. The naive training pipeline can be formulated as follows: (i) tuning \mathcal{M} with sum\mathcal{L}_{sum}. (ii) Getting positive and negative samples from p(Y|X)p_{\mathcal{M}}(Y|X) via offline sampling for each datapoint XX in the training set. (iii) Building the ranking loss with these candidates and further tuning \mathcal{M} with rank+sum\mathcal{L}_{rank}+\mathcal{L}_{sum}. However, in practice, such training method is always suboptimal compared to the state-of-the-art two-stage models. We denote the model after multi-task learning as \mathcal{M^{\prime}}. There is a serious generalization error in the naive methods: via multi-task learning, \mathcal{M}^{\prime} is only able to rank candidates drawn from the original model distribution p(Y|X)p_{\mathcal{M}}(Y|X) but not candidates from the new distribution p(Y|X)p_{\mathcal{M^{\prime}}}(Y|X). This error makes the naive approach unable to directly output a good summary in sequence-level generated by itself.

3.2 Our approach: CoLo

The first step of CoLo is also to train the summarization model with sum\mathcal{L}_{sum} like the naive approach.

In CoLo, we discard using positive-negative samples that from a fixed model distribution, instead, we sample these candidates from a constantly shifting model distribution during multi-task learning. By doing so, we can mitigate the above mentioned generalization error as much as possible because candidates are dynamically changing with the parameters of the model distribution p(Y|X)p_{\mathcal{M}}(Y|X) updated by gradient descent. To implement this process, at each training step, we sample the newest candidates along with their feature presentations from the summarization model and calculate the ranking loss. We will give a detailed description about how we performing the online sampling process on mainstream extractive and abstractive summarization models in the following parts.

Online Sampling for Extractive Model

The task of extractive summarization is to assign a label yi{0,1}y_{i}\in\{0,1\} for each sentence senti{sent}_{i} from the source document XX =(sent1,sent2,,sentn)=({sent}_{1},{sent}_{2},\ldots,{sent}_{n}) consisting of nn sentences. Figure 2 gives an example of our one-stage extractive summarization model. Extractive candidates can be viewed as a subset of sentences from the document. In this figure, we sample sent1,sent2sent_{1},sent_{2} to form the first candidate C1={sent1,sent2}C_{1}=\{sent_{1},sent_{2}\}, and C2C_{2} is consisting of {sent2,sent3}\{sent_{2},sent_{3}\}. After constructing these candidates, the next step is to represent them in the embedding space. In our one-stage model, we employ a heuristic way to obtain the feature presentations of candidates: pooling results of the sentence embedding from the extractive model. Concretely, we denote the sentence embedding for the ii-th sentence as 𝐡i\mathbf{h}_{i}. The hidden representation of a candidate is created by pooling the sentence representations belong to it. For example 𝐳C1\mathbf{z}_{C_{1}} is the average pooling result of 𝐡1\mathbf{h}_{1} and 𝐡2\mathbf{h}_{2}. Suppose C2C_{2} has higher ROUGE score than C1C_{1}, then C2C_{2} is treated as a positive sample and C1C_{1} is treated as a negative sample for this pair. Finally, the whole system is trained by the sum of rank\mathcal{L}_{rank} and sum\mathcal{L}_{sum}.

Sampling informative candidates is essential in re-ranking systems. The first step of the sampling method is to determine 𝒩{\mathcal{N}} which represents the number of candidate sentences. 𝒩{\mathcal{N}} is set depending on the number of summary sentences of downstream datasets. Take CNN/DailyMail as an example, we set 𝒩{\mathcal{N}} to {2,3}\{2,3\} because most gold summaries consist of 2\sim3 sentences. At each training step, we iterate over 𝒩\mathcal{N} by combination and form mm different candidates 𝒞={C1,C2,,Cm}\mathcal{C}=\{C_{1},C_{2},\ldots,C_{m}\}. mm is equal to iCnnumi\sum_{i}C_{n}^{num_{i}} where numinum_{i} is the ii-th element in 𝒩{\mathcal{N}} and nn is number of sentences of the document. For CNN/DailyMail whose 𝒩{\mathcal{N}} is set to {2,3}\{2,3\}, we can sample Cn2+Cn3C_{n}^{2}+C_{n}^{3} different candidates.

However, in practice, we always face the combination explosion problem when the number of sentences nn grows larger. The two-stage system Zhong et al. (2020) pre-trained an extractive model to clip the origin number of sentences to an acceptable size. Notice that our extractive summarizer is also supervised with the BCELoss, so that we can clip the sampling space to nn^{\prime} (a hyperparameter) with the output distribution over the sentences at each training step. Then the total size of the final candidate set decreases to m=iCnnumim^{\prime}=\sum_{i}C_{n^{\prime}}^{num_{i}}. For CNN/DailyMail, nn^{\prime} is set to 5, and we can get C52+C53=20C_{5}^{2}+C_{5}^{3}=20 different extractive candidates. Details about the setting of 𝒩\mathcal{N} and nn^{\prime} can be found in Table 1 in Appendix.

Notably, the offline sampling needs to feed each candidate into the pre-trained encoder. In real-life setting, when summarizing some long documents, the number of sentences in the input document and output summary will increase significantly. It will bring a polynomial level increase to the computation and GPU overhead of the two-stage model. But our one-stage system with online sampling is robust to the length variance.

Inference Stage of Extractive Model

Since we have modeled a summary-level score during training, it is easy to directly generate summaries according to the summary-level semantic score. Concretely, given a candidate set 𝒞\mathcal{C} built by the combination strategy, we calculate the cosine similarity between each candidate presentation 𝐳Ci\mathbf{z}_{C_{i}} and the document representation 𝐳X\mathbf{z}_{X}:

C^=maxCi𝒞cos(𝐳X,𝐳Ci).\hat{C}=\max_{C_{i}\in\mathcal{C}}{cos(\mathbf{z}_{X},\mathbf{z}_{C_{i}})}. (2)

The final output is the candidate with highest cosine similarity score.

Refer to caption
Figure 2: Architecture of our extractive model. Input sequence: The ‘[doc]’ token is used to get vector representation 𝐳X\mathbf{z}_{X} of the document XX, ‘[sep]’ is used as separator for sentences. We omit the classifier and the BCELoss. 𝐡i\mathbf{h}_{i} is the sentence embedding the i-thth sentence in XX. 𝐳Ci\mathbf{z}_{C_{i}} means the feature representation of the ii-th candidate.

Online Sampling for Abstractive Model

Our method can also be easily adapted in abstractive summarization. Selecting a generated summary maximum a posteriori (MAP) usually result in poor performance Stahlberg and Byrne (2019), thus most state-of-the-art generation model usually use the beam search algorithm at inference stage. The online sampling for the abstractive version is much simpler than the extractive version. We use beam search as sampling algorithm and get the feature representations from the encoder/decoder output. We denote the encoder output of source document XX as HencH_{enc} and the decoder hidden states of the target summary as HdecH_{dec}. We get the document representation from the encoder output of the 0-th token 𝐳X=Henc0\mathbf{z}_{X}=H_{enc}^{0}. The feature representation of the ii-th candidate CiC_{i} with length = |Ci||C_{i}| is derived from the last step of the decoder output 𝐳Ci=Hdec|Ci|1\mathbf{z}_{C_{i}}=H_{dec}^{|C_{i}|-1}. Hidden states of other steps can not represent the entire sequence because of the sequence mask in transformer decoder. finally we formulate the ranking loss following Eq. 1.

Inference Stage of Abstractive Model

The inference stage of our abstractive version is similar to the extractive version. We save the feature representation of the document and each beam during beam search. The final output is determind by the cosine distance between 𝐳X\mathbf{z}_{X} and 𝐳Ci\mathbf{z}_{C_{i}}.

4 Experimental Setup

CNN/DM Reddit XSum SSN PubMed
nn^{\prime} 5 5 5 8 8
𝒩{\mathcal{N}} 2,3 1,2 1,2 6 6,7
|𝒞|\mathcal{C}| 20 15 15 28 36
Table 1: candidate size |𝒞||\mathcal{C}| of each datasets (extractive). |n||n^{\prime}| is the clipped candidate size, 𝒩{\mathcal{N}} is a set containing all number of possible sentence.

4.1 Datasets

We conduct experiments on five mainstream datasets to evaluate the effectiveness of our approach.
CNN/DailyMail Hermann et al. (2015) is a classic benchmark which contains articles from the CNN/Daily Mail newspapers. We use the cased version from datasets333https://github.com/huggingface/datasets
XSum Narayan et al. (2018a) is a one-sentence summary dataset from BBC News. Gold summaries are professionally written by the authors of documents.
Reddit Kim et al. (2019) is collected from social media platform and we use the TIFU-long version.
PubMed Cohan et al. (2018) is a long document summarization dataset from scientific domain whose avgavg summary length is about 4 times longer than CNN/DM.
SSN An et al. (2021a) consists of papers mainly from math, physics and computer science with the abstract section as gold reference.

4.2 Implementation Details

For the simplity of experimental settings, both extractive model and abstractive mode are based on BART. We use the encoder of BART (170M) as the backbone and a 3-layer MLP as the classifier to implement the extractor. We add two special token ‘<cls>’ to generate the sentence representation and ‘<sep>’ as sentence separator. ‘<doc>’ token is used to generate the document feature representation. candidate size for each dataset can be found in 1 We use adam optimizer Kingma and Ba (2014) learning rate schedule follows the setting in transformer Vaswani et al. (2017). We train our model for 15000 steps with BCELoss and 32000 steps with BCELoss and RankingLoss where each step has a batch size of 36. The margin parameter γ\gamma is set to 0.01. The size of generated candidates |𝒞||\mathcal{C}| is set to 20 for CNN/DM. We report the results. Other settings follow the default setting in Liu and Lapata (2019). Our model is trained on single GeForce RTX 3090 GPU for 8 hours. Both our abstractive model and extractive model are trained on 24G GeForce RTX 3090 GPUs and the inference process is on 12G GeForce GTX TITAN XP GPUs.

For abstractive model, we choose BART initialized with facebook/bart-large-cnn from transformers444https://github.com/huggingface/transformers as the basic summarizer. We further fintune this model by NLLLoss and RankingLoss for 15000 steps where each step with a batch size of 8. Other setting is the same with our extractive version. To encourage diversity, we use the diverse beam search Vijayakumar et al. (2016) to generate the candidates with beam size set to 16 and diversity penalty set to 1.0. Our model is trained on 8 GeForce RTX 3090 GPUs for about 18 hours.

4.3 Evaluation Metrics

We examine our approach with 4 metrics that measure the distance between generated summaries against the gold reference. ROUGE Lin (2004) where R-1 and R-2 measure informativeness based on n-gram overlapping and R-L represents fluency. JS-2 Divergence Louis and Nenkova (2013) measures Jensen-Shannon divergence between the bigram distributions of two input texts. BERTScore Zhang et al. (2019) measures soft overlap between BERT embeddings of two texts instead of using lexical matching methods. MoverScore Zhao et al. (2019) is also based on the neural model but applies a earth mover distance measure to contextualized BERT embeddings.

5 Results

We denote the model without contrastive learning as the baseline system. Since the backbone of our extractive model is BART encoder so that we call the baseline model BartExt. The baseline model for abstractive system is BART. Our extractive model is called CoLoExt and its abstractive version is denoted as CoLoAbs.

5.1 Extractive Results

We compare our models with baseline models which has similar amount of parameters and decoding speed of our models in this section. Our extractive results on CNN/DM are shown in Table 2 We compare our model with previous strong extractive baseline built on pre-trained model Zhong et al. (2019); Bae et al. (2019); Liu and Lapata (2019) and strong multi-stage systems Zhong et al. (2020). From the third section of Table 2, we can see that our model CoLoExt beats the baseline model by 1.49 ROUGE-1 score and achieve the state-of-the-art among all end-to-end systems when input length set to 512 and the results can be further improved while extending the input length to 1024. Even compared with the BertSum-large (340M) Liu and Lapata (2019) which is built on large PTM, We still have an improvement of 0.42 with only the half number of parameters of theirs. Though RL-based methods hold the motivation of optimizing towards the evaluation metric, but it does not gain much improvement on performance in practice.

To verify whether our model is effective on datasets of various lengths, we also evaluate our model on datasets with short summaries (Reddit and XSum) and long document dataset PubMed and results are shown in Table 3. On reddit and XSum, we achieve the advantage of more than 1.0 point ROUGE-1 than baseline systems and close performance with the upper bound ORACLE. We also gain improvements when tested on the long document summarzation dataset PubMed. Detailed results on long document dataset can be found in Appendix A.

Model R-1 R-2 R-L
LEAD 40.43 17.62 36.67
ORACLE 52.59 31.23 48.87
TransformerVaswani et al. (2017) 40.90 18.02 37.17
Bert-ExtBae et al. (2019) 42.29 19.38 38.63
Bert-Ext + RL 42.76 19.87 39.11
BertSum Liu and Lapata (2019) 42.57 19.96 39.04
BertSum-large 43.85 20.34 39.90
BartExt 42.78 20.24 39.24
BartExt (len=1024len=1024) 43.65 20.88 40.19
Naive one-stage 43.53 20.54 39.62
CoLoExt 44.10 20.97 40.19
CoLoExt + BERTScore 44.27 21.01 40.34
CoLoExt (len=1024len=1024) 44.58 21.25 40.65
Table 2: Extractive results on CNN/DM test set. lenlen means the input length of the document, results without the marker using 512 tokens as input. +RL means the addition of reinforcement learning. +BERTScore means we use BERTScore to determine positive-negative samples. CoLo clearly outperform all previous one-stage summarization systems. The best results are in bold and the second best ones are underlined.
Model Reddit XSum PubMed
R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L
LEAD 12.38 2.17 10.12 14.40 1.46 10.59 37.58 12.22. 33.44
ORACLE 29.10 11.08 23.10 25.62 7.62 18.72 45.12 20.33 40.19
BertSum 23.86 5.85 19.11 22.86 4.48 17.16 41.05 14.88 36.57
BartExt 23.97 5.68 19.24 22.96 4.70 17.29 41.40 16.18 37.89
CoLoExt 25.06 5.90 19.52 24.51 5.04 18.21 41.93 16.51 38.28
Table 3: Results on test sets of reddit, XSum and PubMed. Our model achieve significant improvement on the baseline model BartExt. LEAD means we select the first kk sentences from the source document as the output summary and ORACLE is the upper bound of extractive methods.

5.2 Abstractive results

Early work also successfully applies reinforcement learning on abstractive summarization Paulus et al. (2017); Li et al. (2019). But we do not find related works that successfully combine reinforcement learning with strong pre-trained models. Therefore, most of our baselines are strong pertrained model finetuned with NLLLoss. Our results is shown in Table 4, due to the huge cost of using large pre-trained model with length set to 1024, we also report results with 512 input tokens and it is able to significantly outperform other baselines which has longer input length (1024). CoLoAbs has an improvement of 2.17 R-1 socre on the very strong BART-large baseline without adding additional parameters or modules. Additionally, our method is able to outperform all one-stage baseline systems by a large margin. We also conduct experiments on long document summarzation datasets (see in Table 11 in Appendix).

Model R-1 R-2 R-L
BertSumAbsLiu and Lapata (2019) 41.72 19.39 38.76
PegasusZhang et al. (2020) 44.17 21.47 41.11
BARTLewis et al. (2020) 44.16 21.28 40.90
BART+R3FAghajanyan et al. (2020) 44.38 21.53 41.17
BART (len=512len=512) 43.82 20.96 40.63
ConSum Sun and Li (2021) 44.53 21.54 41.57
SeqCo Xu et al. (2021) 45.02 21.80 41.75
Naive one-stage (ROUGE, len=512len=512) 43.90 20.88 40.69
CoLoAbs (ROUGE, len=512len=512) 45.45 21.53 42.35
CoLoAbs(ROUGE) 46.33 22.15 43.08
Table 4: Abstractive results on CNN/DM test set. lenlen means the maximum input length of the encoder, results without the marker using 1024 tokens as the input. ConSum Sun and Li (2021) and SeqCo Xu et al. (2021) in the second block are also previous contrastive learning based methods without re-ranking.

5.3 Comparison with Multi-stage Systems

Apart from the one-stage systems, we also compare our model with these powerful multi-stage systems: CTRLSum, multi-stage re-ranking models. CTRLSum needs other systems to previously produce a control signal.

Model R-1 R-2 R-L
extractive systems
CoLoExt 44.27 21.01 40.34
BERT+BERTR Zhong et al. (2020) 44.22 20.62 40.38
BERT+RoBERTaR  Zhong et al. (2020) 44.41 20.86 40.55
CoLoExt +RoBERTaR 44.70 21.03 40.74
abstractive systems
CoLoAbs 46.33 22.15 43.08
CTRLSumHe et al. (2020) 45.65 22.35 42.50
BART+RoBERTaR Liu and Liu (2021) 46.67 22.15 43.54
Table 5: Comparision with the multi-stage systems. RoBERTaR means a RoBERTa re-ranker is is added to the summarization model.

performance

The addition of another pre-trained model implicitly introduces more parameters and knowledge, thus it is usually unfair to directly compare one-stage systems with the two-stage systems. But we show that CoLo is able to achieve comparable performance with the multi-stage systems. As is shown in the first part of Table 5, compared with the multi-stage models that ensembles another pre-trained encoder as a re-ranker, CoLoExt still performs better than their BERT+BERTR version without the need to re-feed the generated candidates to another model meanwhile we obtain a ~×5\times 5 speed up over the multi-stage systems. We also try concatenating a re-ranker RoBERTa for our model, results shows that CoLoExt can be further improved by combing another pre-trained re-ranker reaching new extractive SOTA on the test set of CNN/DM. For abstractive models, our end-to-end model still legs behind multi-stage systems but we do not need training another model and keep similar inference speed with baseline models.

Inference Efficiency

Despite the fact that multi-stage models outperform all end-to-end systems, they frequently suffer from inefficiency. In this part we mainly focus on analysing the efficiency of 3 kinds of systems: 1) baseline, which is trained only with BCELoss or NLLLoss, 2) CoLo, our end-to-end constrastive learning framework, 3) Rerank, which means the multi-stage re-ranking systems. it has more 110M parameters than baseline model and CoLo. The efficiency experiments for training and inference are respectively conducted on 24G RTX 3090 GPUs and 12G TITAN XP GPUs. For extractve summarization, figures 3(a),3(b) give a detailed comparison of the inference speed between the three models. Y-axis represents the number of samples processed per second. To give a fair comparison, we test the inference efficiency in two settings: i) all models are tested with batch size fixed to 1. ii) all models are tested with the maximum batch size allowed by the GPU. While the candidate size varies from 4\sim32, both our model have a 𝟑×𝟖×\mathbf{3\times\sim 8\times} speed-up ratio over the multi-stage re-ranking model. When the candidate size is set to 20, the baseline model is able to process ~31.2/41.9 (batch = 1/MAX) samples per second, the decoding speed of CoLoExt is ~30.4/38.9 samples/s (batch=1/MAX) and the decoding speed of the multi-stage re-ranking model is only ~4.9/7.0 samples/s(batch=1/MAX). Our model almost does no harm on inference speed while the candidate size |𝒞||\mathcal{C}| is less than 16. However, when the candidate size grows larger there is more time spent on generating the representations of the candidates. Figure 4 show the comparison of inference time of the abstractive models. While the bottleneck of abstractive models is the auto-regressive generation process. Our abstractive model generally save ~0.5 GPU hours compared to the re-ranking model.

Refer to caption
(a) CNN/DM (batch=1)
Refer to caption
(b) CNN/DM (batch =MAX)
Figure 3: Inference speed on CNN/DM (extractive). we use the candidate size |𝒞||\mathcal{C}| as the X-axis. The Y-axis represents the number of samples processed per second. batch=MAX means we use the maximum batch size allowed by GPU memory.
Systems Stage1 Preprocess Stage2 Total hours
Ext+RoBERTaR 4 5 (+20) 128 137 (+20)
CoLoExt 7 7 (𝟏𝟑𝟎\downarrow\mathbf{130})
Abs+RoBERTaR 80 132 (+18) 128 340 (+18)
CoLoAbs 224 224 (𝟏𝟏𝟔\downarrow\mathbf{116})
Table 6: GPU hours spent on training for each process on the training set of CNN/DM(reported results are rounded down after the decimal point. Ext+RoBERTaR/Abs+RoBERTaR denotes the multi-stage re-ranking systems with an extracitve/abstrastive summarizer. (+18)/(+20) means 18/20 CPU hours are spent on calculate ROUGE score for each candidate with 32 threads.

Training Efficiency

Table 6 gives an overview of the training time of our system and the multi-stage models on the training set of CNN/DM. The general pipeline for the multi-stage models is: i) training a generator (Stage1), ii) Preprocess, ii) training a re-ranker (Stage2). The preprocess includes generating the training/dev/test set for training re-ranker and sorting candidates by ROUGE. For extractive system we save 130 GPU hours compared to the multi-stage systems whose bottleneck is training the re-ranking model. For abstractive model, apart from the 128 GPU hours spent on training the ranker, using beam search to generate the training set for re-ranker model is also very time consuming, generally we obtain 116 GPU hours and 18 CPU hours saved.

Refer to caption
Figure 4: Test inference time with beam size for abstractive model. We use the maximum batch size allowed by GPU memory.

5.4 Ablation for Different Discriminators

In addition to ROUGE, we also select other metrics as the discriminator (shown in Table 7). ROUGE and JS-2 is based on lexical matching while BERTScore and MoverScore are based on the contextualized embedding from BERT. Our model generally obtains the best results on the metric used in training. Because these metrics are not actually separated, using one of these metrics as the discriminator can also gain significant improvements on other metrics. Overall, the neural evaluation metric BERTScore and MoverScore bring more improvements compared with metrics that based on the lexical matching. But incorporating neural model based metrics in training will obviously increase the training time.

Metric Used R-1 R-2 R-L JS-2 BS MS
Baseline 42.78 20.24 39.23 54.24 43.52 58.27
ROUGE-1,2 44.10 20.97 40.19 54.07 44.26 58.63
ROUGE-L 44.09 20.93 40.34 54.06 44.32 58.60
JS-2 43.85 21.13 39.98 53.92 44.19 58.60
BERTScore 44.27 21.01 40.34 54.08 44.85 58.71
MoverScore 44.21 20.81 40.25 54.33 44.47 58.78
Table 7: Extractive results of using different evaluation metrics as the discriminator on CNN/DM test set.

5.5 Visualization Experiment

We conduct a visualization experiment on our extractive model to get a close look on the distribution of candidates in semantic space. We randomly sample 100 documents with more than 10 sentences from the test set of CNN/DM. We first select the top 10 sentences based on the predicted score from the classifier. We set the possible number of sentences to {2,3}\{2,3\} resulting a candidate size of C102+C103=165C_{10}^{2}+C_{10}^{3}=165 for each sample. We visualize the learned embedding of these candidates and the anchor in a two-dimensional space by applying the t-SNE algorithm. As shown in Figure 5, there is an obvious cluster of the top 50 candidates (colored in purple) and the candidates with higher score are closer to the anchor while the distribution of uninformative candidates (gray,cyan points) is relatively random.

Refer to caption
Refer to caption
Figure 5: T-SNE Visualization of two examples from CNN/DM test set. We divide the candidates into 3 groups based on ROUGE score: candidates ranking 1~50, candidates ranking 51~100, candidate ranking 101~150. The red point denotes the anchor and the purple/cyan/gray points respectively denote the top 50/100/150 candidates.

5.6 Human Evaluation

We also conduct a human evaluation on our models to get more accurate results . We randomly select 30 articles from the test set of CNN/DM, and each articles have 5 candidate summaries 4 from automatic systems and 1 is the gold reference. We recruit 2 PhD students majoring in computer science and ask them to rank the candidate summries based on the fluency, informativeness. If two of these systems generate the same summary for the source document, this sample will be filtered out. As we can see from Table 8, the CoLoExt with the discriminator as BERTScore achieve the best result among all automatic systems. However, using BERTScore will bring much training time. We also suggest taking JS-2 divergence as the discriminator which also does a good job in human evaluation.

Metric Used 1st 2nd 3rd 4th 5th Avg R.
Baseline 0% 8.3% 8.3% 23.3% 60% 4.33
JS2 6.7% 25% 33.3% 21.7% 13.3% 3.10
R1+R2 5% 20% 28.3% 30.3% 16.7% 3.35
BERTScore 10% 35% 20% 25% 10% 2.90
Gold label 78.3% 11.7% 10% 0% 0% 1.32
Table 8: Results of human evaluation results. Baseline means the BartExt model, Gold-label means the means the human written summary. Avg R. denotes the average ranking of the system.

6 Limitations and Future Work

Compared with the most well-known contrastive learning framework simCLR Chen et al. (2020) which propose to construct positive and negative pairs from training samples in the same batch, Drawing negative-positive pairs from the summarization model requires more training time. Ideally, providing more positive and negative samples will benefit the performance of CoLo . However, decoding with very large beam size in training mode will cost more GPU memory and training time. Future work can search for an efficient way to construct these positive-negative pairs to perform re-ranking during training.

7 Conclusion

We introduce CoLo, a contrastive learning based summarization framework for one-stage summarization where positive-negative pairs are generated directly from the summaizer with online sampling. CoLo can be both easily applied on extractive and abstractive methods. Results show that we greatly exceed previous stage-of-the art one-stage systems with no additional parameters and obivious decline of the inference efficiency.

Acknowledgement

We would like to thank Yixin Liu and the anonymous reviewers for their valuable advice. This work was supported by the National Key Research and Development Program of China (No.2020AAA0106702) and National Natural Science Foundation of China (No.62022027).

References

  • Aghajanyan et al. (2020) Armen Aghajanyan, Akshat Shrivastava, Anchit Gupta, Naman Goyal, Luke Zettlemoyer, and Sonal Gupta. 2020. Better fine-tuning by reducing representational collapse. arXiv preprint arXiv:2008.03156.
  • (2) Chenxin An, Jiangtao Feng, Kai Lv, Lingpeng Kong, Xipeng Qiu, and Xuanjing Huang. Cont: Contrastive neural text generation. In Advances in Neural Information Processing Systems.
  • An et al. (2021a) Chenxin An, Ming Zhong, Yiran Chen, Danqing Wang, Xipeng Qiu, and Xuanjing Huang. 2021a. Enhancing scientific papers summarization with citation graph. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12498–12506.
  • An et al. (2021b) Chenxin An, Ming Zhong, Zhichao Geng, Jianqiang Yang, and Xipeng Qiu. 2021b. Retrievalsum: A retrieval enhanced framework for abstractive summarization. arXiv preprint arXiv:2109.07943.
  • Bae et al. (2019) Sanghwan Bae, Taeuk Kim, Jihoon Kim, and Sang-goo Lee. 2019. Summary level training of sentence rewriting for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 10–20.
  • Bengio et al. (2015) Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. arXiv preprint arXiv:1506.03099.
  • Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR.
  • Cheng and Lapata (2016) Jianpeng Cheng and Mirella Lapata. 2016. Neural summarization by extracting sentences and words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 484–494.
  • Cohan et al. (2018) Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. 2018. A discourse-aware attention model for abstractive summarization of long documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), volume 2, pages 615–621.
  • Gehrmann et al. (2018) Sebastian Gehrmann, Yuntian Deng, and Alexander Rush. 2018. Bottom-up abstractive summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4098–4109.
  • He et al. (2020) Junxian He, Wojciech Kryściński, Bryan McCann, Nazneen Rajani, and Caiming Xiong. 2020. Ctrlsum: Towards generic controllable text summarization. arXiv preprint arXiv:2012.04281.
  • Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1684–1692.
  • Kim et al. (2019) Byeongchang Kim, Hyunwoo Kim, and Gunhee Kim. 2019. Abstractive summarization of reddit posts with multi-level memory networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2519–2531.
  • Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Lee et al. (2020) Seanie Lee, Dong Bok Lee, and Sung Ju Hwang. 2020. Contrastive learning with adversarial perturbations for conditional text generation. arXiv preprint arXiv:2012.07280.
  • Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880.
  • Li et al. (2019) Siyao Li, Deren Lei, Pengda Qin, and William Yang Wang. 2019. Deep reinforcement learning with distributional semantic rewards for abstractive summarization. arXiv preprint arXiv:1909.00141.
  • Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  • Liu and Lapata (2019) Yang Liu and Mirella Lapata. 2019. Text summarization with pretrained encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3721–3731.
  • Liu and Liu (2021) Yixin Liu and Pengfei Liu. 2021. Simcls: A simple framework for contrastive learning of abstractive summarization. arXiv preprint arXiv:2106.01890.
  • Louis and Nenkova (2013) Annie Louis and Ani Nenkova. 2013. Automatically assessing machine summary content without a gold standard. Computational Linguistics, 39(2):267–300.
  • Nallapati et al. (2017) Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. In Thirty-First AAAI Conference on Artificial Intelligence.
  • Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023.
  • Narayan et al. (2018a) Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018a. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807.
  • Narayan et al. (2018b) Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018b. Ranking sentences for extractive summarization with reinforcement learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1747–1759.
  • Paulus et al. (2017) Romain Paulus, Caiming Xiong, and Richard Socher. 2017. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304.
  • Rush et al. (2015) Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389.
  • Stahlberg and Byrne (2019) Felix Stahlberg and Bill Byrne. 2019. On nmt search errors and model errors: Cat got your tongue? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3356–3362.
  • Sun and Li (2021) Shichao Sun and Wenjie Li. 2021. Alleviating exposure bias via contrastive learning for abstractive text summarization. arXiv preprint arXiv:2108.11846.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
  • Vijayakumar et al. (2016) Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2016. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424.
  • Xu et al. (2021) Shusheng Xu, Xingxing Zhang, Yi Wu, and Furu Wei. 2021. Sequence level contrastive learning for text summarization. arXiv preprint arXiv:2109.03481.
  • Zhang et al. (2020) Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pages 11328–11339. PMLR.
  • Zhang et al. (2019) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  • Zhao et al. (2019) Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M Meyer, and Steffen Eger. 2019. Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578.
  • Zhong et al. (2020) Ming Zhong, Pengfei Liu, Yiran Chen, Danqing Wang, Xipeng Qiu, and Xuanjing Huang. 2020. Extractive summarization as text matching. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 6197–6208. Association for Computational Linguistics.
  • Zhong et al. (2019) Ming Zhong, Pengfei Liu, Danqing Wang, Xipeng Qiu, and Xuan-Jing Huang. 2019. Searching for effective neural extractive summarization: What works and what’s next. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1049–1058.
  • Zhong et al. (2022) Ming Zhong, Yang Liu, Suyu Ge, Yuning Mao, Yizhu Jiao, Xingxing Zhang, Yichong Xu, Chenguang Zhu, Michael Zeng, and Jiawei Han. 2022. Unsupervised summarization with customized granularities. arXiv preprint arXiv:2201.12502.

Appendix

Appendix A Results on Long Document Summarization

We experiment our method on two long document datasets PubMed and SSN. A crucial problem for the two-stage model is that they face great difficulty when runing on long document summarization datset. For PubMed, which has an average of 7.6 sentences and 260 tokens after converted to ids. If we have a candidate size |𝒞||\mathcal{C}|= C76C_{7}^{6} + C87C_{8}^{7} and limit the maximum candidate length to 300 that will lead to the total input tokens of the re-ranker up to (30015)batch(300*15)*batch. This will cause an out-of-memory problem even the batch size is set to 1 during training on a 12G GPU.

The two-stage model Rerank in Table 9 is implemented with the BartExt as generator and RoBERTa as re-ranker and we report the maximum GPU memory footprint of each system. Compared with the baseline model our approach do not need more GPU memory during both training and reference while the pipeline model needs 9.7G memory even with a very small candidate size 8. Since our learning approach does not suffer from the out-of-memory problem, we are able to experiment with larger candidate size and obtain more performance improvements.

Model |𝒞||\mathcal{C}| Mem(T) Mem(I)
BartExt 5.5G 1.5G
multi-stage systems
Rerank C76C_{7}^{6} 8.8G 1.9G
Rerank C87C_{8}^{7} 9.7G 2.1G
Rerank C76C_{7}^{6} + C87C_{8}^{7} OOM 2.4G
ours
CoLoExt C76C_{7}^{6} 5.6G 1.5G
CoLoExt C87C_{8}^{7} 5.6G 1.5G
CoLoExt C76C_{7}^{6} + C87C_{8}^{7} 5.6G 1.5G
Table 9: GPU memory test on the test set of PubMed. Mem(T)/Mem(I) denotes the maximum GPU memory used during training/inference with batch size set to 1. Our model hardly need more GPU memory. All these experiments are run on single 12G Geforce TITAN XP GPU.

As we can see in Table 10 and 11, both the extractive and abstractive model outperform the baseline model on long document. Introducing more samples in comparative learning is helpful for performance, previous two-stage framework has the limitation to expand the candidate size due to is huge memory consumption.

Model R-1 R-2 R-L
SSN
ORACLE 51.04 23.34 45.88
BertSum 42.41 13.10 37.97
BartExt 43.53 13.45 38.00
CoLoExt 42.78 14.21 42.18
PubMed
ORACLE 45.12 20.33 40.19
BartExt 41.40 16.18 37.89
Rerank (two-stage)
    |𝒞||\mathcal{C}|= C76C_{7}^{6} 41.80 16.28 38.20
    |𝒞||\mathcal{C}|=C87C_{8}^{7} 41.78 16.33 38.27
CoLoExt
    |𝒞||\mathcal{C}|= C76C_{7}^{6} 41.74 16.28 38.20
    |𝒞||\mathcal{C}|=C87C_{8}^{7} 41.78 16.33 38.27
    |𝒞||\mathcal{C}|= C76C_{7}^{6} + C87C_{8}^{7} 41.93 16.51 38.28
Table 10: Extractive results on test sets of long document dataset PubMed and SSN. Introducing more samples in comparative learning is helpful for performance. But two-stage model has difficulty training with large candidate size.
Model R-1 R-2 R-L
SSN
BertSumAbs 45.23 16.56 41.25
Bart-base 45.89 16.75 41.51
CoLoAbs(BERTScore) 46.23 17.09 42.10
CoLoAbs(ROUGE) 46.57 17.21 42.18
PubMed
BertSumAbs 42.90 17.35 38.88
Bart-base 43.30 17.01 39.34
CoLoAbs(BERTScore) 43.63 17.32 40.01
CoLoAbs(ROUGE) 43.98 17.39 40.35
Table 11: Abstractive results on test sets of long document dataset PubMed and SSN. Due to the generation process for long document is longer